Model selection in Variational inference

Hi, Thanks for Pyro team. It is awesome. It might be a research question, but I don’t have clear idea, so I would like to ask your opinions.

Q. What is a practical approach for model selection in variation inference?

Let’s say there is a problem of data={train, test} and model={m1,m2}.

For simplicity, m1 is a neural network with hidden layer 10 and m2 is exactly same with m1 except for it has hidden layer 20. Put all same priors for all parameters. The neural network structure and activations are all same.

Here are my approaches. This is not theoretically 100% correct, but we can use these steps in some practical engineering application (please correct me if I am wrong).

[A] In a frequentist way,

  1. Train m1 with train and calculate loss loss (e.g., l2 norm) on test.
  2. Compare loss_m1(test) and loss_m2(test). Choose either m1 or m2 showing lower loss.

[B] In a Full Bayesian way (assuming same prior distributions),

  1. MCMC for m1 and m2 with train.
  2. Posterior predictive check with m1 and m2 on test or train data.
  3. If both acceptable, calculate some metrics such as loo WAIC or Bayes factor (evidence) on test or train data (via bridge sampling).

[C] In Variational inference (with Pyro)

  1. Train m1 and m2 with train via SVI.
  2. Calculate ELBO via SVI.evaluate_loss() on test data for m1 and m2. (But, it gives stochastic ELBO, so we may repeat it several times or have large number for num_particles argument).
  3. Choose one that shows lower loss. (because ELBO is approximation of evidence).

I am particularly interested [C]. I think theoretically and practically it is viable because that’s the meaning of ELBO for some practical engineering application.
I may recall Bayesian dropout in VI or Spike/Slab priors in Bayesian, but I would like to know if [C] is acceptable in a simple situation such as two model comparison.

i think [C] can be reasonable depending on the context. an important point is that the ELBO is a lower bound to the evidence but the variational gap will vary between different (model, guide) pairs. so e.g. if you compare two bayesian neural networks with different numbers of hidden layers the variational gap will probably be larger for the larger network and so ELBO comparisons may be misleading. in any case ELBO comparisons are probably most suspect for models with very complex posteriors with many parameters, where ELBOs can be quite poor estimates of the evidence. if you care about prediction it generally makes more sense to look at quantities like predictive log likelihood on held out data. by and large ELBO isn’t a great way to compare models, although in some scenarios it makes sense.