Question about the SVI documentation

olivierma · March 25, 2019, 3:24am

In part I of the SVI examples, in the model learning section, the way the documentation is versed, it sounds like between the model parameters, the observations, and the latent variables, we are aiming at finding the maximum likelihood value of the model parameters, and the distribution of the latent variables. Isn’t in variational inference, we are learning the distribution of both the model parameters and the latent variables? And at the end of the section, the documentation states

Variational inference offers a scheme for finding θmax and computing an approximation to the posterior pθmax(z|x).

but shouldn’t we be optimising the variational parameters phi, which is introduced in the guide? Am I missing something? This reads more like the EM algorithm rather than VI.

Also in the guide section the doc states

The basic idea is that we introduce a parameterized distribution qϕ(z), where ϕ are known as the variational parameters.

This makes it feel like the variational parameters are only used to parametrise the latent variables, but not the model parameters. But shouldn’t it be both? For example in the intro scale example, we have no latent variables, and the variational parameters are used to parametrise the posterior distribution of the mean and the deviation, which are model parameters.

olivierma · March 25, 2019, 3:34am

Basically I don’t understand why the model parameters are treated like the variational parameters, being optimised; rather than like the latent variables, for which a distribution will be learned.

fehiepsi · March 25, 2019, 6:02pm

Hi @olivierma, if you want to make your model parameter a latent variable, you just simply set a prior for it. Learning model parameter is quite similar to doing a MAP inference with a very wide prior. In many situations (e.g. in neural networks), if we only want to learn the optimal value of a parameter, then it is no need to set prior, add Delta guide,… for it.

neerajprad · March 26, 2019, 6:21pm

You could take a look at the VAE example, which might clarify this. Note that the parameters in the decoder are registered for optimization. As @fehiepsi mentioned, the tutorial talks about the general case where you might have learnable parameters in your model like in the VAE example i.e. parameters for which you just want to do Maximum Likelihood inference. This is akin to placing a wide uniform prior and a delta distribution in your guide. You are of course free to place priors on these parameters and be fully bayesian (though whether that would be computationally feasible is another question), in which case your model will only have latent variables.

olivierma · March 27, 2019, 12:32pm

Thanks to both of you, @fehiepsi and @neerajprad! I’m only familiar with the “standard” VI (coming from a MCMC background) so the wording sounds a little strange to me, but thinking about it, I guess since pyro is built with very deep models of very large number of parameters in mind, so it makes perfect sense. I’m still early in my journey through the documentation, so I’ll try out these techniques later