Strange convergence behavior when inferring parameters of a categorical distribution in a model containing independent observable

Dear Pyro community,

I was playing around with pyro trying to implement inference over a categorical variable when I noticed strange convergence behavior in the posterior estimates (e.g. ELBO increases over time instead of decreasing). I am providing here a simplified version of the code with the hope that someone can tell me what is going on under the hood, or if I am doing something wrong.

So lets look first at the version of the code which behaves as expected:

n1 = 2
n2 = 3
n = 100

w = ones(n1,n2)
w[n1//2:,-1] = 0.
w /= w.sum(dim=-1)[:,None]

def model():
    sample('depth', dist.Categorical(w).independent(1))
 
def guide():
    probs = param('probs', ones(n1,n2)/n2, constraint=constraints.simplex)
    sample('depth', dist.Categorical(probs=probs).independent(1))

I use the following code to do the inference and estimat values of ‘probs’ variable

clear_param_store()

num_particles = 10
n_iterations = 2000

svi = SVI(model=model,
          guide=guide,
          optim=Adam({'lr':0.1}),
          loss=Trace_ELBO(num_particles=num_particles))

losses = []
with tqdm(total = n_iterations, file = sys.stdout) as pbar:
    for step in range(n_iterations):
    losses.append(svi.step())
            pbar.set_description("Mean ELBO %6.2f" % torch.Tensor(losses[-20:]).mean())
            pbar.update(1)
    
results = {}
for name, value in get_param_store().named_parameters():
    results[name] = value

In this case I get near zero loss values, and almost perfect estimate of the ‘probs’ values

print(softmax(results['probs'], dim=-1))
tensor([[ 0.3333,  0.3333,  0.3333],
        [ 0.5000,  0.5000,  0.0001]])

However, a slight change in the model

def model():
    sample('depth', dist.Categorical(w).independent(1))
    with iarange('data', size = n):
        sample('obs', dist.Categorical(probs = ones(n,2)/2), obs = ones(n, dtype=torch.int64))

leads to very noisy estimates of the loss (it looks like it does not converge at all) and
very bad estimates of the ‘probs’ values

print(softmax(results['probs'], dim=-1))
tensor([[ 0.0193,  0.9756,  0.0051],
        [ 0.9391,  0.0605,  0.0003]])

What is confusing for me is that adding an observable which is independent from the variable which I am interested in completely messes up the inference process. Looking carefully how loss changes over time I notice that it slightly increases with each new step in the inference.

I would appreciate if someone could explain what causes this interaction between independent variables (in my understanding ‘obs’ just adds a constant to the computation of the loss) and what could be the source of increased noise in computing the loss.

Am I making here some trivial mistake or is the silence sign that no one has an idea what could be going on?

Hi @pyroman, the sample('obs', ...) statement in your second model is not doing anything, since its probs are fixed. Did you instead intend to

def model():
   probs = pyro.sample('probs', dist.Dirichlet(...))
   with pyro.iarange('data', size=n):
        sample('obs', dist.Categorical(probs), obs=...)

and learn probs in the guide?

Hi @fritzo, thanks for the reply. This was intentional. I am trying to debug a more complicated model, so here I just showed a simplified version of that. I was getting very nosy posterior estimates and tried to figure out where this is coming from. I identified the problem in that an independent observable influences the inference of the variable I am interested in. The ‘obs’ variable should just behave as a constant in the ELBO, hence I do not understand why it messes up the posterior estimate once it is included into the model.

I tried reducing the learning rate (e.g. lr: 0.001) but the process just becomes slower, I still see increase in the ELBO with each iteration. I also tried using more particles (e.g. n_particles = 100) this just impacts the variability of the ELBO estimate, and the trend away from the minimum is still obvious. Finally, i tried initiating the ‘probs’ in the guide already at the expected solution, and I still find increasing ELBO estimates with each iteration.