How to implement tempered model in (Num)Pyro

Hi

I am wondering how can I have a tempered posterior using Pyro/Numpyro.
For example: The origin model is p(w, z) = p(w | z) p(z), and I would like to change the likelihood term to p(w | z)^T.

This tempering operation similar to the KL annealing trick adopted in VAE and the cold posterior trick used in BNN training.

Thanks

if the temperature is a fixed parameter that you set then you can simply use poutine.scale:

with pyro.poutine.scale(scale=T):
    pyro.sample("w", ...)

this multiplies the enclosed log_probs by T

What if I want to treat temperature as a local data-dependent latent variable, (e.g. [1411.1810] Variational Tempering), can poutine.scale still help? Or I need to use other methods.

Thanks

i’m not sure it depends. certainly scale can be local. but whether or not it can be treated as e.g. latent will depend. can you be (much) more specific as to what you want to do? e.g. point to a specific objective function and corresponding inference algorithm?

For example, a model like this, in which T_i stands for the temperature

And now I would like to perform MCMC on this model.

Problem solved, I implement a wrapper class of distributions, which would return a scaled version of log likelihood:

class tempered_XXX(dist.XXX):
    def __init__(self, T, *args, **kwargs):
        self.T = T
        super().__init__(*args, **kwargs)
    
    def log_prob(self, value):
        return 1 / self.T * super().log_prob(value)

return 1 / self.T * super().log_prob(value)

This is what scale does. Probably the documentation part scale (float) is not clear and makes you think we need a scalar there. We should change it to: scale (float or ndarray) and mention that its shape should be broadcast-able to the batch shape (i.e. log_prob shape) of each site under its context. If this does what you wanted, could you open a PR to enhance the docs. :slight_smile:

Sure I can do that.

Hi

I am sorry to start this conversation again. I will be appreciated If you will clarify two points for me.

  1. Is there any differences between the two realizations:
 with numpyro.plate('data', len(data)), scale(scale=T)
        numpyro.sample('obs', dist.Normal(locs, sigma), obs=data)

and

nuts_kernel = NUTS(scale(model, scale=T))

And If I use both, will I multiply the log-likelihood to T^2 or only T?

  1. The scale wrapper multiplies the log-likelihood to scale (not to 1/scale), am I right?

Thank You!

Hi @ChernovAndrey, the first scale only scales the likelihood while the second one scales all sample sites in the model. If you use both, the likelihood will be scaled by T^2. You are right about the second question.

So the second one scales prior distribution too, am I right?

Yes, you are right.