Does Pyro support Model Selection for Mixture Distributions using ARD?

This paper describes an application of Automatic Relevance Determination to Gaussian Mixture models using variation inference (CAVI). I’d like to implement a pyro solution using SVI, but I’m troubled by an apparent inconsistency with the GMM tutorial. In the tutorial, all parameters optimized appear to be hyperparameters of the guide, while all pyro ‘sites’ in the model are of the sample() variety.

In the paper, by contrast, the parameters being optimized include the mixture weights, which of course appear in the likelihood function itself, not the guide. From equation (12) of the paper, the guide is a variational posterior only over the component means, variances and class membership vectors, not the weights.

My question is whether pyro-SVI with ELBO will work here; is it required that all optimized parameters be in the guide or can some of them be in the model? (The reason I’m concerned is that the elbo api indicates its implementation follows this source paper, which assumes the likelihood gradient against parameters vanishes identically.)

You can have parameters in the model that get optimized. I have an implementation of ARD for factor models here (starting on line 243), in case you’re interested. I have a horseshoe prior on the factor loadings matrix, and optimize its parameters along with the variational parameters in the guide (I also have other parameters in the model that control other priors, a la empirical Bayes). Hope that helps.

2 Likes

thank you!

A follow-on question relating to ARD for mixtures. I’m concerned about the ability of ARD to discern model ambiguity in the posterior distribution over mixture weights, such as multi-modality. I believe ARD is a maximum likelihood estimator; all MLEs are vulnerable to similar criticism. It seems one way to address this would be to use a Dirichlet prior over the mixture weights. (The optimization now is not over the weights themselves, but the Dirichlet prior parameter vector.)

I have never attempted this, nor have I seen a paper describing this approach, so I’m curious how you address this issue, as someone with obviously more experience in this area than myself. Thanks again.

That should work I think. It seems to work with a shrinkage prior on the Dirichlet parameter vector: see notebook. It generates data with 3 components, of which one is shrunk to zero, and fits a mixture with 10 components, of which all but 2 get shrunk, and the fit is fine.