SVI-Part1 readjusted with multinomial instead of bernoulli doesn't work?

Hello

I have my own model+data to infer parameters related with multinomial distribution but that doesn’t work.
So I tried to solve the SVI Part1 problem with multinomial to see if that’s better but it’s like that doesn’t work neither so I’m trying to guess what’s wrong.

data=[]
for i in range(2000):
    data.append(pyro.sample("obs_{}".format(i),pdist.Multinomial(10,probs=torch.Tensor([0.6,0.4]))))

def model(data):
    alpha0 = 10
    alpha1 = 10
    f = pyro.sample("latent",pdist.Beta(alpha0, alpha1))
    ok = torch.Tensor([f,1-f])
    for i in range(len(data)):
        pyro.sample("obs_{}".format(i), pdist.Multinomial(10,probs=ok),obs=data[i])

def guide(data):
    qalpha0 = pyro.param("qalpha0",torch.Tensor([15]),constraint=constraints.positive)
    qalpha1 = pyro.param("qalpha1",torch.Tensor([15]),constraint=constraints.positive)
    pyro.sample("latent",pdist.Beta(qalpha0, qalpha1))

adam_params = {"lr": 0.001, "betas": (0.9, 0.999)}

optimizer = pyro.optim.Adam(adam_params)

svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

n_steps = 4000
for step in range(n_steps):
    svi.step(data)
    if step % 100 == 0:
        print(step)
        print(pyro.param("qalpha0")/(pyro.param("qalpha0")+pyro.param("qalpha1")))

Basically I’m doing the same thing as in the tutorial except that the shape of the data and then the distribution used change (multinomial instead of bernoulli). The problem is that infered f given by the last line is roughly equal to 0.5 instead of 0.6. I’ve already though about the size of the data but with len(data)=2000 that infers perfectly 0.6 when using bernoulli distribution like in the SVI, plus the computation time is pretty high.

What’s wrong ?

Hi @telecom, I’m not sure what’s wrong but here are two things I noticed:

  1. You should use torch.tensor(15.) instead of torch.Tensor(15) to initialize the qalphas. Note the lower case t. Note the decimal point making the result a float tensor. Note the lack of brackets making the result have .dim()==0 rather than .dim()==1.
  2. You can get faster dubugging cycles by vecorizing the data and observe statement:
data = pdist.Multinomial(10, probs=torch.tensor([0.6, 0.4]).expand_by([2000]).sample()
assert data.shape == (2000, 2)

def model(data):
    ...
    with pyro.iarange("data", len(data)):  # faster than a for loop
        pyro.sample("obs", pdist.Multinomial(10, probs=ok).expand_by([len(data)]),
                    obs=data)

Also, what are your final pyro.param('qalpha0') values? I believe the posterior should be approximately

expected_qalpha0 = 1210.0
expected_qalpha1 = 810.0

Hi fritzo and thanks for the reply.

I think that for your second point you meant:
data = pdist.Multinomial(10, probs=torch.tensor([0.6, 0.4]).expand(2000,-1)).sample()

I resolved the problem by setting a Dirichlet instead of Beta so that parameter is a tensor of 2 elements and that works then (before that i had qalpha0=12.6 and qalpha1=13.1 to answer you):

pyro.clear_param_store()
data = pdist.Multinomial(10, probs=torch.tensor([0.6, 0.4]).expand(2000,-1)).sample()
assert data.shape == (2000, 2)

def model(data):
    alpha = torch.Tensor([10.,10.])
    f = pyro.sample("latent",pdist.Dirichlet(concentration=alpha))
    with pyro.iarange("data", len(data)):  
        pyro.sample("obs", pdist.Multinomial(10, probs=f.expand(len(data),-1)),
                    obs=data)

def guide(data):
    qalpha = pyro.param("qalpha",torch.Tensor([15.,15.]),constraint=constraints.positive)
    pyro.sample("latent",pdist.Dirichlet(qalpha))

(I obtain after inference qalpha=tensor[21,14] we have E[X1]=21/(14+21)=0.6, E[X2]=14/(14+21)=0.4, that’s also really much faster in that way).

I’m into working on my own problem now (maybe I will use Categorical instead of Multinomial, whatever). The tricky thing is that I have to put the latent variable in a list and apply a function on this last:

(pseudocode)
blabla1=torch.tensor(blabla)
blabla2=torch.tensor(blabla)

def model(data):
alpha = torch.Tensor([10.,10.])
f = pyro.sample(“latent”,pdist.Dirichlet(concentration=alpha))
target = fun([f,blabla1,blabla2])
for i in range(len(data)):
pyro.sample(“obs_{}”.format(i), pdist.Multinomial(10,target),obs=data[i])

Assuming that the function returns correctly a tensor, I hope that will works. (?)

Edit: Okay, that works well with some regular functions that I’ve try. I will do it with more complex function implying convolutions and so on. I update the topic if any issues :wink:

Hey @telecom or any of the many other very smart people on this forum,

Were you able to get the multinomial/categorical version of the problem to work? I’m getting started with Pyro and trying to accomplish something similar with a Categorical distribution.

There are possibilities 0, 1, and 2 in the categorical distribution, and I’m feeding it a dataset that should be extremely skewed towards outcome 0. It compiles and runs, but the model doesn’t seem to be training at all. I would expect prob0 to be going up and the .

One thought I had and am unsure of how to debug is that the categorical distribution normalizes the probs argument automatically, but I’m not sure if that transformation of my pyro.sample statements is somehow breaking backprop through the rest of the model.

Dataset:

data = torch.cat((torch.zeros(500), torch.ones(5), torch.empty(5).fill_(2.)))

Here’s my model and guide:

def model(data):
    # Prior beta with top probability ~ 0.3
    alpha0 = alpha1 = alpha2 = torch.tensor(6.)
    beta0 = beta1 = beta2 = torch.tensor(10.)
    
    prob0 = pyro.sample("prob0", dist.Beta(alpha0, beta0))
    prob1 = pyro.sample("prob1", dist.Beta(alpha1, beta1))
    prob2 = pyro.sample("prob2", dist.Beta(alpha2, beta2))
    
    # loop over the observed data
    with pyro.iarange("data_loop", len(data)):
        pyro.sample("obs", dist.Categorical(
            probs=torch.tensor([prob0, prob1, prob2])), 
            obs=data
        )

def guide(data):
    alpha_q0 = pyro.param("alpha_q0", torch.tensor(6.0),
                         constraint=constraints.positive)
    beta_q0 = pyro.param("beta_q0", torch.tensor(15.0),
                        constraint=constraints.positive)
    alpha_q1 = pyro.param("alpha_q1", torch.tensor(6.0),
                         constraint=constraints.positive)
    beta_q1 = pyro.param("beta_q1", torch.tensor(15.0),
                        constraint=constraints.positive)
    alpha_q2 = pyro.param("alpha_q2", torch.tensor(6.0),
                         constraint=constraints.positive)
    beta_q2 = pyro.param("beta_q2", torch.tensor(15.0),
                        constraint=constraints.positive)
    
    pyro.sample("prob0", dist.Beta(alpha_q0, beta_q0))
    pyro.sample("prob1", dist.Beta(alpha_q1, beta_q1))
    pyro.sample("prob2", dist.Beta(alpha_q2, beta_q2))

I’m using the same SVI training code as the tutorial SVI Part I:

svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

n_steps = 5000
# do gradient steps
for step in range(n_steps):
    svi.step(data)

And the output:

[ step 0 loss: 572.641185760498 probabilities: [0.3333333333333333, 0.3333333333333333, 0.3333333333333333] ]
...
[ step 4900 loss: 634.6478080749512 probabilities: [0.328363784009581, 0.33784123633053903, 0.33379497965988] ]

inferred final probabilities:
prob0: 0.329 +- 0.090
prob1: 0.337 +- 0.090
prob2: 0.334 +- 0.090

Any thoughts, advice, or pointers to resources appreciated!

for one, the gradients are being killed when you construct a new tensor with the sampled values. why not do this in a vectorized way? eg:

probs = pyro.sample('probs', dist.Beta(6., 10.).expand(3))
with pyro.iarange("data_loop", len(data)):
        pyro.sample("obs", dist.Categorical(probs), obs=data)

and have something analogous in your guide

1 Like

:open_mouth: was unaware of that type of usage for expand. Wish I would have known that sooner, thanks for the tip!

The gradient dying when the new tensor is constructed makes sense to me now in the sense that it “breaks” the link between the pyro.sample(...obs=data) statement and the original samples. Are there formal rules around this behavior that a beginner could look out for? i.e. what manipulations are and are not allowed on Pyro samples in order to preserve the gradient? I know I’ve at least done scalar operations on them in the past and maintained a working model.

Thanks for your help!

this is a general pytorch thing and not a pyro specifidc one. in pre-0.4 pytorch, there were Variables, which were tensors with additional metadata that tracked the gradients. if you extracted a value from the Variable, essentially the gradient wouldn’t flow through that point. in pytorch 0.4, Variable and Tensor have been combined but the same principle holds in that if you extract a value from a Tensor and operate on it, the gradients won’t be tracked in the autograd graph.

1 Like

Gotcha. I guess it’s not that intuitive that making the sample part of a larger tensor like you called out in my initial question is an operation that “extracts” the value from the tensor, but it’s something I can get used to.

Running into a couple of things trying to move forward with your suggestion using the following code:

def model(data):
    alpha = torch.tensor(6.)
    beta = torch.tensor(10.)    
    
    probs = pyro.sample('probs', dist.Beta(alpha, beta).expand(3))

    with pyro.iarange("data_loop", len(data)):
        pyro.sample("obs", dist.Categorical(probs=probs), obs=data)

def guide(data):
    alpha = pyro.param('alpha', torch.tensor(6.))
    beta = pyro.param('beta', torch.tensor(15.))
  
    pyro.sample('probs', dist.Beta(alpha, beta).expand(3))

I’m getting the following error when attempting to run SVI:

ValueError: at site "probs", invalid log_prob shape
  Expected [], actual [3]
  Try one of the following fixes:
  - enclose the batched tensor in a with iarange(...): context
  - .independent(...) the distribution being sampled
  - .permute() data dimensions

When I try the first fix, and move the probs sample into the iarange context I get the same error, except with Expected [511], actual [3]. The data shape is [511] so I guess that’s the conflict in the iarange context.

I tried reading up on the other suggested fixes and while I don’t fully understand them, it doesn’t seem like they would apply to my problem which is fully in 1D space since they seem to deal with dimensionality.

On a higher level, the usage of expand is great and I will be using it a lot moving forward but I’m having a hard time reconciling it with this problem. I am looking to create three distinct probability distributions for three different outcomes, and I don’t see how that would work with just one beta distribution. My original broken “solution” had six trainable hyperparameters for three distributions, which maps directly to the three separate probabilities, and it doesn’t seem like I could get the same result with one expanded distribution?

try tacking on independent() to your sample statement in iarange ie:

pyro.sample("obs", dist.Categorical(probs=probs).independent(1), obs=data)

and yes, your “vectorized” guide is not equivalent to your original guide, you would need need to wrap a tensor of the same size (in this case (3)) in pyro.param so that each param can be learned separately.

1 Like

Got the following model and guide to train as expected:

def model(data):
  alpha = torch.tensor(6.)
  beta = torch.tensor(10.)

  probs = pyro.sample('probs', dist.Beta(alpha, beta).expand(3).independent(1))

  with pyro.iarange("data_loop", len(data)):
    pyro.sample("obs", dist.Categorical(probs=probs), obs=data)

def guide(data):
  alphas = pyro.param('alphas', torch.tensor((6., 6., 6.)), constraint=constraints.positive)
  betas = pyro.param('betas', torch.tensor((10., 10., 10.)), constraint=constraints.positive)

  pyro.sample('probs', dist.Beta(alphas, betas).independent(1))

I ended up applying independent to the Beta distributions instead of the Categorical distribution as originally suggested. When I tried applying it to the Categorical distribution, I got an unpleasant ValueError: Expected reinterpreted_batch_ndims <= len(sample_shape + base_dist.batch_shape), actual 1 vs 0.

My tensor-fu is pretty weak, so I’m still trying to reason out how the respective batch shapes and event shapes affect what’s happening here. Will try to report back once I straighten that out. If there is a more intuitive explanation for this behavior (and why indepedent is a fix for the invalid log_prob shape in the first place) I’d love to hear it.

Thanks for all of your help @jpchen!