SVI converges in complex discrete model but infer_discrete results are nonsense

Charlie.m · January 7, 2022, 7:38pm

So when you add the dimension of 88 to it by doing the indexing then it just retains its position essentially (presumably for summing)?

martinjankowiak · January 7, 2022, 8:04pm

i suggest you focus your attention on the shapes of the log_prob factors accompanying each sample statement (as opposed to the shape of x_t). these are the fundamental ingredients used to construct the elbo. the values of the enumerated discrete latent variables x_t are effectively intermediate quantities. x_t should be manipulated so as to get the appropriate log_prob shapes.

broadcasting is used to aggregate log_prob factors and do variable elimination using the algorithm described here

Charlie.m · January 7, 2022, 8:40pm

Thanks, so is that a no to my question just above? In that case, I’m not sure I understand the original reply.

Thanks for the paper, I have read the paper/theory behind it (the reason why I was drawn to using the framework in the first place). From my recollection the paper doesn’t describe exactly where this summing/elimination takes place (in a number of places I guess: obs sample sites and markov plates for two). The algorithm being mainly about being sure to reduce along all dimensions of independence everywhere on the graph.

I have looked through the source code (minipyro is helpful in cutting the time spent down) and observed the ELBO as the difference between the log_probs of the guide and model before. Not really sure how that means that the log_probs are more of an important object since you as a developer can do the calculation however you like really as long as you keep track of everything. Are the dimensioning requirements more obvious there?

I mean from my perspective it would be useful to have some rules of thumb to follow in the documentation regarding how the (enumeration) dimensions are supposed to match when indexing using tensors like simple values if the magic doesn’t just happen behind the scenes when using that syntax.

Charlie.m · January 7, 2022, 8:50pm

Or is the emphasis on log_probs to do with the inference side of enumeration a la Vindex?

martinjankowiak · January 10, 2022, 2:59pm

i’m sorry @Charlie.m but i don’t know how to give rules of thumb that would be more helpful than what can be gleaned from Inference with Discrete Latent Variables and hmm.py. (note that the tensor shapes tutorial is also very relevant).

i suggest you shorten the length of your time series and remove the use of markov. this plus format_shapes() should make it clear how the enumeration dimensions are allocated one at a time towards the left. when you turn on markov the enumerations dimensions will be eliminated greedily.

another strategy would be to take a model whose correctness we vouch for (e.g. model_1 in hmm.py) and mess up the indexing in various ways to see what can go wrong.

please note that if i didn’t appear to answer any of your particular questions that’s likely because i didn’t understand what you were asking.

Charlie.m · January 14, 2022, 10:24am

Hi @martinjankowiak, thanks for the tips . Eventually I simplified the model and that enabled me to iterate faster over/think better about the initial conditions that would be closer to the sensible parameter values for the model. That plus decaying the learning rate has got the model converging to the expected results. I haven’t done much probabilistic programming before and SVI seems to be a bit more difficult to get convergence than other gradient descent algorithms in deep learning. What sort of value (or change?) for SVI indicates robust convergence as a rule of thumb? On a similar topic, what sort of size model would you be looking at to run MCMC in a reasonable time in Pyro?

martinjankowiak · January 14, 2022, 2:47pm

glad to hear it!

What sort of value (or change?) for SVI indicates robust convergence as a rule of thumb?

it’s hard to say in general but it’s often useful to normalize the elbo by the number of datapoints (or rather the number of observations times the dimension of the observation). in that normalization the elbo is expected to be O(1). so changes smaller than 0.01 may be somewhat small. but usually it’s best to also monitor the convergence of certain parameters etc

On a similar topic, what sort of size model would you be looking at to run MCMC in a reasonable time in Pyro?

this is also hard to answer in general. pyro probably has the best mcmc support for models that only contain continuous latent variables (or where any discrete latent variables can be summed out relatively cheaply). in that case hmc/nuts may work well. hard to give a rule of thumb, but if the latent dimension is more than ~100 it’s unlikely to be particularly fast (and may not work reliably at all depending on details)