Why samples from Categorical and Bernoulli dist are missing after MCMC?

odats · October 26, 2020, 11:55am

def model_test():
  pyro.sample('Categorical', dist.Categorical(probs=torch.tensor([0.5,0.5])))
  pyro.sample('Uniform', dist.Uniform(0,1))
  pyro.sample('Bernoulli', dist.Bernoulli(0.1))
  pyro.sample('Normal', dist.Normal(1.,1.))

  with pyro.plate('data', 3):
    val = pyro.sample('obs', dist.Normal(1., 1.))

  return val

conditioned_model = pyro.condition(model_test, data={
    "obs": torch.tensor([1.,0.,1.])
    }) 

pyro.clear_param_store() 

kernel = NUTS(conditioned_model)
posterior = MCMC(kernel, num_samples=2, warmup_steps=1)
posterior.run();

posterior.get_samples() returns {'Normal': tensor([0.7652, 0.2905]), 'Uniform': tensor([0.4800, 0.6557])} Why there are no samples from Categorical and Bernoulli? How can I get them?

fritzo · October 26, 2020, 12:12pm

Hi @odats, I believe the discrete random variables were marginalized out during inference (via enumeration), so they won’t appear in the trace. However I believe you can use Predictive to get the remaining sites (or all sites by setting the return_sites kwarg).

odats · October 26, 2020, 5:30pm

Thank you @fritzo, you are right, the discrete random variables were marginalized. Predictive helped to get the missing sites:

predictive = pyro.infer.predictive.Predictive(model_test, posterior.get_samples())
predictions = predictive.forward()

predictions now contains: {'Bernoulli': tensor([[0.], [0.]]), 'Categorical': tensor([[1], [1]]), 'obs': ...

May I ask you 2 more questions:

In the documentation I have found TracePredictive (deprecated). Should I sample from Predictive (forward) as an alternative to EmpiricalMarginal?
Let imagine I have obtained posterior P(Z|X). I have a new observation x_11. How to use Predictive to get latent variables Z? There should be something better then Predictive(model_test, posterior.get_samples() + my_new_observation)

fritzo · October 27, 2020, 5:37pm

Yes you should sample from Predictive using .__call__() (not .forward):

predictive = Predictive(...)
samples = predictive(...)

That depends on your inference method. If you are using MCMC or standard variational inference (e.g. autoguides), then you will need to re-train on the new data. If you have completely amortized inference you can re-run on the extended data (but in that case I’m not sure Predictive is helpful).

odats · October 28, 2020, 5:53pm

posterior.get_samples() contains samples from posterior (latent variables Z). How should I use these values as a new prior in my model to re-train on the new data? I want to use MCMC to estimate a new posterior P(Z_new|X_new_observations)

martinjankowiak · October 28, 2020, 7:35pm

there’s no easy way to estimate a new posterior that somehow incorporates information from the old MCMC samples. your best bet in general is to “warm start” the new MCMC chain with a sample from the old chain. you would still need to collect 100s (or more depending on your problem) of samples in the new chain to get good results (at least if you’re keen on closely approximating the true posterior).

odats · October 29, 2020, 10:41am

Can you please provide some basic code or a link on documentation? I have found only initial_params. I suppose it is the starting point for MCMC and the best option is to set to mean?

initial_params = {
    'latent_fairness': posterior.get_samples()['latent_fairness'].mean()
}

It is to help MCMC but not to change prior by learned posterior.

I am more interested in some sort of bayesian online learning. Where I can forget about old data and continuously update my prior by new observations. The closest idea is conjugate prior and sufficient statistics. Where posterior contains all needed information: for example, Beta(a_post, b_post), and then I can use it with new observations.

odats · October 29, 2020, 1:12pm

@martinjankowiak @fritzo
please find a simple prototype (code):

Use MCMC to estimate P(Z|X)
Use samples from MCMC to replace the true posterior distribution with a
simpler parametric distribution. In my case Beta.

Use posterior Beta as my new prior

 PRIOR_ALPHA = torch.tensor(1.0)
 PRIOR_BETA = torch.tensor(1.0)

 def model(data):
     # sample f from the beta prior
     f = pyro.sample("latent_fairness", dist.Beta(PRIOR_ALPHA, PRIOR_BETA))
     for i in range(len(data)):
         pyro.sample("obs_{}".format(i), dist.Bernoulli(f), obs=data[i])

 # train 1
 pyro.clear_param_store()
 kernel = NUTS(model, jit_compile=True, ignore_jit_warnings=True, max_tree_depth=3)
 posterior = MCMC(kernel, num_samples=2000, warmup_steps=500)
 posterior.run(data);

 def estimate_beta(obs):
   alpha0 = torch.tensor(1.01, requires_grad=True)
   beta0 = torch.tensor(1.01, requires_grad=True)

   for i in range(10):
     for o in obs:
       prior = torch.distributions.beta.Beta(alpha0, beta0)
       prob = prior.log_prob(o)
       prob.backward()
       alpha0.data += 0.01*alpha0.grad
       beta0.data += 0.01*beta0.grad

       alpha0.grad.data.zero_()
       beta0.grad.data.zero_()

   return alpha0.detach(), beta0.detach()

 # update prior = posterior
 PRIOR_ALPHA, PRIOR_BETA = estimate_beta(posterior.get_samples()['latent_fairness'])
  
 # train 2
 pyro.clear_param_store()
 kernel = NUTS(model, jit_compile=True, ignore_jit_warnings=True, max_tree_depth=3)
 posterior = MCMC(kernel, num_samples=2000, warmup_steps=500)
 posterior.run(NEW_DATA);

martinjankowiak · October 29, 2020, 2:55pm

yes, that looks right. mean, median, final sample all might be reasonable choices—depends on your goals and the problem.

please note that MCMC is “non-parametric” in that it represents the posterior with a bag of samples. in general online learning algorithms have to make parametric assumptions if they want to avoid revisiting old data. the conjugate case you shared is very special—most models are not conjugate. so in most cases you need to make parametric assumptions. one option is variational inference. if you want to stick to MCMC you need to somehow convert your old samples into a density, for example by fitting a mixture distribution to them and using that density as your new prior. whether something like that is likely to work well well depend on the details of your problem, e.g. the dimensionality of the latent space.

odats · October 29, 2020, 2:59pm

I did it in my last comment. Is there a Pyro approach to do it automatically? I mean Pyro can combine samples from MCMC and prior distributions.

martinjankowiak · October 29, 2020, 3:16pm

no there’s no mechanism to do it automatically since the space of choices is very large. what you’re doing above is using the same parametric family of distributions used in the model (the beta distribution) and choosing a member of that family by maximum likelihood estimation. you could certainly encode that loop in pyro. you can use SVI to do MLE. note that in the general case the prior parametric family is not going to be sufficiently flexible to encode the information in the posterior samples.

odats · October 29, 2020, 5:10pm

Point to use SVI to do MLE is very helpful.

What happens when we switch to SVI? Don’t we do have the same problem? Encode posterior by parametric family?

martinjankowiak · October 29, 2020, 6:56pm

i’m not sure i understand your question. yes, it’s the case that variational inference is in general biased and cannot recover the true posterior. this is the consequence of making a parametric approximation.