 # Model selection in Variational inference

Hi, Thanks for Pyro team. It is awesome. It might be a research question, but I don’t have clear idea, so I would like to ask your opinions.

Q. What is a practical approach for model selection in variation inference?

Let’s say there is a problem of `data={train, test}` and `model={m1,m2}`.

For simplicity, `m1` is a neural network with hidden layer 10 and `m2` is exactly same with `m1` except for it has hidden layer 20. Put all same priors for all parameters. The neural network structure and activations are all same.

Here are my approaches. This is not theoretically 100% correct, but we can use these steps in some practical engineering application (please correct me if I am wrong).

[A] In a frequentist way,

1. Train `m1` with `train` and calculate loss `loss` (e.g., l2 norm) on `test`.
2. Compare `loss_m1(test)` and `loss_m2(test)`. Choose either `m1` or `m2` showing lower `loss`.

[B] In a Full Bayesian way (assuming same prior distributions),

1. MCMC for `m1` and `m2` with `train`.
2. Posterior predictive check with `m1` and `m2` on `test` or `train` data.
3. If both acceptable, calculate some metrics such as `loo` `WAIC` or `Bayes factor (evidence)` on `test` or `train` data (via bridge sampling).

[C] In Variational inference (with Pyro)

1. Train `m1` and `m2` with `train` via SVI.
2. Calculate ELBO via SVI.evaluate_loss() on `test` data for `m1` and `m2`. (But, it gives stochastic ELBO, so we may repeat it several times or have large number for `num_particles` argument).
3. Choose one that shows lower loss. (because ELBO is approximation of evidence).

I am particularly interested [C]. I think theoretically and practically it is viable because that’s the meaning of ELBO for some practical engineering application.
I may recall Bayesian dropout in VI or Spike/Slab priors in Bayesian, but I would like to know if [C] is acceptable in a simple situation such as two model comparison.

i think [C] can be reasonable depending on the context. an important point is that the ELBO is a lower bound to the evidence but the variational gap will vary between different (model, guide) pairs. so e.g. if you compare two bayesian neural networks with different numbers of hidden layers the variational gap will probably be larger for the larger network and so ELBO comparisons may be misleading. in any case ELBO comparisons are probably most suspect for models with very complex posteriors with many parameters, where ELBOs can be quite poor estimates of the evidence. if you care about prediction it generally makes more sense to look at quantities like predictive log likelihood on held out data. by and large ELBO isn’t a great way to compare models, although in some scenarios it makes sense.