Understanding connection between model parameters and logp

yozhikoff · August 16, 2020, 4:02pm

Hello,

If I understand correctly, Pyro implements SVI optimization with ELBO loss the following way:

Decreasing the estimate of expected log(q(z)) over the variational parameters (phi), which are specified in pyro.param statements in the guide function - this term of loss could be obtained using guide_tr.log_prob_sum()
Increasing estimate of the expected log(p(z|x)) plus the log(p(z)) over the model parameters (theta) - actually model_tr.log_prob_sum()

My question is about the theta parameters. Expectably there are none of them in pyro.get_param_store() in the case if nothing was explicitly specified in the model function using pyro.param. But even in this case model_tr.log_prob_sum() is increasing during the training process - which ‘degrees of freedom’ are used for that? What actually is optimized?

Thanks a lot!

fehiepsi · August 17, 2020, 6:07am

@yozhikoff I think that whether theta is available or not, phi will be optimized to minimize ELBO loss, which is log q(z) - log p(x, z). Though minimizing ELBO loss does not guarantee the increasing of log p(x, z), they are somehow related (if a = b + c, increasing a will likely increase c). By optimizing phi, z will move to better areas, which in turn (likely) increases the joint probability p(x, z). However, it is not necessary that p(x, z) will always increase, for two reasons:

ELBO loss is stochastic (at least, the term p(x|z) is computed using a sample z from q(z))
Increasing a might not increase c. For example, forgetting about x for simplicity and considering q(z) = N(1, phi) and p(z) = N(0, 1), while optimizing, phi will move the the minimal point of KL(q,p). At this minimal point phi_0, each svi step will generate a random sample z from N(1, phi_0). There is no guarantee that this random sample z will be decreased at later steps (maximizing p(z) is equivalent to decreasing z to 0).

yozhikoff · August 25, 2020, 2:43am

@fehiepsi Thanks!
Does it mean that in the case when we are not sure about model priors it is always reasonable to parametrize priors using pyro.param?
I mean something like

a = pyro.param('a', torch.tensor(5.))
sample = pyro.sample('sample', pyro.distributions.Exponential(a))

in order to let the VI optimize model_logp?

fehiepsi · August 25, 2020, 5:16am

To be honest, I rarely use param in a Bayesian model (unless I am working with nn.Module). For a hyperparameter like a, I will set a prior (hyperprior) for it and define a guide for a (e.g. we can use the simplest Delta guide for a, which is equivalent to doing maximum likelihood).

yozhikoff · August 26, 2020, 9:48am

Thanks a lot, now it’s more clear to me.
The thing that confused me was that in the absence of model parameters model_loss could not be decreased otherwise than by sampling better latent and observed variables from the guide, but it seems that the latter in combination with phi optimization is usually enough for good convergence.