I have a noob question. I am going over Introduction to Pyro — Pyro Tutorials 1.8.4 documentation and I see that the hyperparameters in the model and the variational parameters in the guide are different in the given example. For example, a is defined in custom_guide as
Furthermore, I see in the same example that the distributions for sigma are different in the model and guide (It is Uniform in the model and Normal in the guide)
I have two questions following this:
Is there any particular reason why distribution parameters/distributions are different in the guide and model?
When should I consider choosing the distribution parameters/distributions in the model and guide differently?
Hi @dilara, from a Bayesian perspective, model distributions are priors and guide distributions are approximate posterior distributions. Prior distributions express your belief before seeing any data, and posterior distributions are learned to capture both prior information and data. Symbolically, we write p(z) for a prior, p(x|z) for a likelihood, p(z|x) for a the true posterior, and q(z|x) or simply q(z) for the variational posterior, and we try to make q(z) close to p(z|x) by minimizing KL(q||p).
Model distributions should be weakly informative and are often from standard families. I usually use Normal or diagonal Normal(...).to_event(1) or LogNormal or Laplace priors. Guide distributions should generally be more flexible, and allow for more correlations due to coupling by model dependencies. Guide distributions are often transformed MultivariateNormal or LowRankMultivariateNormnal or neural networks. I usually start with a simple AutoNormal or AutoLowRankMultivariateNormal guide.
Hi @fritzo, thank you so much for your answers! I have two follow-up questions.
You suggested that model distributions are often from standard families and I wonder how strict this is. For example, if I have a prior belief before seeing the data that the prior distribution is not from a standard family but is a more “sophisticated” distribution, it would be better to use the latter distribution, right?
I understand that guides should be flexible to open up the possibilities to capture the problem-specific structure of the true posterior. But if I strongly believe that a certain latent variable should be sampled from a less flexible distribution and use that distribution in the model, wouldn’t it make sense to use it in the guide as well?
No, in general posteriors are not simple. In the large data limit the Berstein-
Von-Mises theorem says posteriors should be approximately multivariate normal with narrow variance, but variables may generally be correlated.