Chosing your prior : learning the prior, and meta learning

Remi · December 6, 2023, 10:47am

Hello, I am new to pyro and BNN and I am still trying to properly understand the duality prior, posterior.

So far, I understood that the prior is supposed to represent the knowledge I have, my intuitions BEFORE learning, so they are often chosen as isotropic Gaussians. Yet as the prior has a role in the prediction function, a bad prior could result in a bad prediction, thus you can have a prior more adapted by :

splitting your dataset in two parts, some kind of x_train_1 and x_train_2, learning the posterior on x_train_1 and then injecting the posterior in the prior before redefining the posterior. Then learn once again the posterior on x_train_2 and you have now a good prior and a good posterior. (more or less the principle of online learning)
First learn N times a classical model (non Bayesian), get the weights of the parameters of this model and infer a probability distribution on each weight (given that you have N weights that represent N possible values of your parameter). Then you define properly your prior of your BNN and proceed to your learning (BAYESIAN NEURAL NETWORK PRIORS REVISITED, https://arxiv.org/pdf/2102.06571.pdf )

But then the article Hands-on Bayesian Neural Networks – A Tutorial for Deep Learning Users ( https://arxiv.org/pdf/2007.06823.pdf ), suggests that you could try to learn the prior (V.D) Learning the prior) with a strategy called empirical Bayes. If I understand the Algorithm you should simultaneously do a backpropagation on your prior and on your posterior but this seems strange… It would mean that the posterior and the prior would be decorelated right ? how should you still converge to a minimum with a svi ?

Moreover, I came across the notion of Meta Learning (https://gaussianprocess.org/gpml/chapters/RW.pdf) that would be considering that the prior is defined by hyperparameters that would be learned by the prior before on previous datasets. But I don’t understand, for me these hyperparameters are just parameters and we are in the case of the first * I wrote right ? There may be a particularity due to the fact that they are learning Gaussian Processes in the article and I am not really comfortable with this question.

Can you please help me to understand the two last strategies of prior choice ? Should I stick with the first two I mentioned (I only did the first for instance and am about to try the second one) or do the two last strategies have an interest ?

Thank you very much and have a great day

Rémi