Bayesian model selection

yunus · November 18, 2021, 2:14pm

Hi everyone,

I’ve been looking into horseshoe regression for the purpose of model selection.
http://num.pyro.ai/en/stable/examples/horseshoe_regression.html

In my case, I have several equations, for each of which I would like to independently “select”/ shrink towards the best model.
For instance, one of the equations could be (in the linear case):
Blood_pressure = a1 * BMI + a2 * Physical_inactivity + a3 * Smoking

In this case I’m quite confident about the directions that a1-3 should have, so I might use an informative prior, like an inverse gamma prior that is mostly consistent with positive values.

but I also want to include possible (2nd order) interaction terms, so the fullest model would be:
Blood_pressure = a1 * BMI + a2 * Physical_activity + a3 * Smoking + a4 * BMI * Physical_activity + a5 * BMI * Smoking + a6 * Physical_activity * Smoking.

For these interaction terms (a4-a6), I would like to use horseshoe priors that are centered at zero.

First: Is this a good way of conducting model selection and selecting the interaction terms?

Second: Should I simply use horseshoe priors for all parameters? It would be nice if I can somehow encode my confidence in positive values.

Third: I would like to combine these equations later on into a system of equations. In this case, it would make sense to drop any terms that have highly shrunk parameters. Is there any criterion I could use to exclude a term from consequent analyses (e.g., the 91% credibility interval is between [-0.01, 0.01] or somethingp; or is this dropping of terms against the Bayesian model selection spirit ?

martinjankowiak · November 18, 2021, 3:02pm

First: Is this a good way of conducting model selection and selecting the interaction terms?

seems reasonable. sometimes people make a “strong hierarchy” assumption so that interaction terms are only “turned on” when each of their constituent parts are “turned on”. basically setting the regularizer for an interaction term as follows lambda_{i,j} = lambda_i * lambda_j (or similar)

Second: Should I simply use horseshoe priors for all parameters? It would be nice if I can somehow encode my confidence in positive values.

how much data do you have? you can use a HalfNormal prior or the like if you think you know the right sign but i’m not sure how much this will get you. after all with enough data the likelihood overwhelms the prior.

In this case, it would make sense to drop any terms that have highly shrunk parameters.

one of the downsides of something like the horseshoe is that it makes hard in/out decisions a bit tricky. in that scenario it may be preferable to use discrete latent variables. i’ve released a small library that supports that flavor of model selection for a particular class of models. no idea if it’d cover your use case

yunus · November 19, 2021, 11:51am

Thanks for your elaborate answer @martinjankowiak !

I agree that the discrete latent variables option is neater for making in/out decisions. I’ve experimented a bit with the library and ran the NormalLikelihoodVariableSelector with isotropic prior. Very nice package! Thanks for making it.

I’m wondering about a few things:

The package does not seem to be compatible with using informative priors for some of the parameters (i.e., the linear terms I mentioned above), or would it be possible to somehow include that?
What would be a reasonable PIP cutoff? I saw in the documentation that spurious features had a maximum PIP of 0.0028, but that would leave me with too many variable so I’m currently using 0.01 as a cutoff. Or should I try to validate this using metrics like WAIC?
It’s not completely clear to me how this ‘expected number of covariates’ hyperparameter functions and whether I should tune it somehow? I don’t necessarily have a theoretical reason to prefer 2 over 5 covariates, for instance.
How can I access the posterior samples? I didn’t find that in the documentation.

Then regarding my data, which you asked about: it really depends (it’s different for each equation) but generally between 70-600 samples. I’m also a bit confused about the statement that the likelihood would overwhelm the halfnormal, because if the likelihood ranges from [-1, -0.1] and I use a halfnormal prior, doesn’t that prior ascribe zero probability to that likelihood range, so that the posterior would probably be very similar to the prior (assuming the likelihood is normal and so very small probability is still ascribed to the range of the prior)?

martinjankowiak · November 19, 2021, 5:40pm

if you have questions about the package please post in the corresponding github issues

I’m also a bit confused about the statement that the likelihood would overwhelm the halfnormal

i don’t know what prior you have in mind or what your data looks like. my point is just that if you have strong signal in your data that strongly prefers that a particular coefficient is positive, any prior which puts non-negligible prior mass on positive coefficients will lead to similar posteriors. in other words if you get a posterior with mean +- std of 0.7 +- 0.1 using e.g. a normal prior then the posterior you’d get for a halfnormal prior would probably be very similar. of course if your prior specification got the sign wrong then you’d get wildly different results.

yunus · November 19, 2021, 6:30pm

thanks that makes it clearer, I’ve opened up a topic on the GitHub page!