The current pyro.contrib.gp package and most literature I found conducted variational inference by approximating the posterior of hidden (latent) variables. I feel like we could also do VI by approximating the posterior over the hyperparameters. (Although the performance may be heavly related the likelihood function if we use reprameterizable variational families). The motivation is for many applications, I feel like the hyper-parameters of kernels may be a better “representation” of the data (it captures the signal variance and length scale, for example). And the variatioanal approximation of posterior of hyperparameters provides a natural way of eavaluating the uncertainty of hyperparameter values (though it could be bad depend on the gap of approximation). Thus my question is would it be useful if we implement a VI of GP by approximating the posterior of hyperparameters?
@ciaobladoo Did you mean that instead of using Delta guide for kernel’s parameters (as in http://pyro.ai/examples/gp.html#Fit-the-model-using-MAP), you want to define another guide? If it is the case, then I will support it in a few weeks (I am a bit busy now).
Hmmm. I am actually not sure if that is what I mean. The guide method in all GP model classes are a distribution of the hidden variable f, what I am trying to say is why don’t we treat the hyperparameters as latent variable and use a guide that is a distribution over hyperparameters. When we do inference, we then maximize the ELBO that minimizes the KL divergence between the posterior of hyperparameters given data and the variational approximation of this posterior (the hidden varible f should be maginalized). But maybe it is equavilent to what you are suggesting? I need to be more familiar with probablistic programming.
@ciaobladoo As in the example I mentioned above, we can set priors to your hyperparameters and the
guide method of a GP model will (automatically) set up a Delta guides for them. These hyperparameters will be latent variables but the inference just gives MAP points for them (posterior with Delta distribution). I think that what you want is to define other guides for your hyperparameters. Just wait for a while, I will implement the ability to do it.
Thank you very much for your help! But I am still not sure if we are talking about the same thing. What I am thinking is a kernel learning setting. Let’s say we have a GP model with RBF kernel, and we have hyperparameters “signal variance” and “length scale”, let us lump them together and call it z. So basically my GP model is composed of the following: A kernel function: z->K(z)(), a GP: f~GP(0, K(z)(x)) (let us only consider zero mean), and a likelihood y~p(y|f). The inference I want to do is to approximate the posterior of hyperparameters p(z|x,y). The motivation is that it makes more sence to use the hyperparameter as a “representation” of the data since it captures the structure of the data and can be intepretable. So here my objective is minimize KL(q(z)||p(z|x,y)). The difficulty here is that when optimizing the corresponding ELBO we need to calculate the gradient of p((x,y),z), which is an integral over hidden variable f, and is intractable unless our likelihood is Gaussian. But I think we may still have a chance to have a good MCMC estimator for this integral. In your current implementation, for example in vgp.py, the guide is a distribution q(f), so I am not sure if it does the same thing I described. Of course whether what I described is useful or not is debatable, but I think learn a varitaional distribution q(z) rather than q(f) could be useful, say quantify uncertainty of hyperparameters and enable more robust comparison between data.
I also could not find something simliar to what I described in the literature. I guess most automatic differentiation variational inference assume that you can easily evaluate p(x,z) while here in order to evaluate it you probably need to calculate an intractable intergral. In any case I am just puting it here to stimulate discussion in case it is useful.
In any case thanks again for the implementation of GP it is really helpful!
I think that we are talking about the same thing. In the current implementation,
q(z) = Delta(z_MAP), where
z_MAP is a parameter in
guide learned using SVI. What you need is defining an arbitrary guide for
q (not just
Delta), am I right?
For reference in code, this is the place we define guide for
q(z). I intend to make a method
set_guide there to let users define their own guide for latent variables.
I don’t know where using other distributions for guides of hyperparameters is good or not. Anyway, Pyro is flexible, so supporting this is definitely necessary.
Hi @fehiepsi, thank you very much! I think maybe I understand it better now. So basically, in here you defined guide for f_scale_tril. Given the data, this is equivalent to defining a guide for hyperparameter z, as K()(X) should be an inverterble transformation of z. Furthermore, to approximate the posterior of hyperparameter p(z|y), we need to evaluate the gradient of p(y|z), which requires marginalization over hidden variable f. When p(y|f) is not Gaussian, we need another Monte Carlo estimator for this. And this can be done by sampling f ~ N(f_loc, f_scale_tril).
I guess I still need some mathematical derivation to convience myself that what you are trying to implement is equivalent to what I described but if the last paragraph is correct understanding then I think I am getting there.
It is almost there. As an example, if we set prior to
lengthscale, then we want to find
q(lengthscale, f), which is
q(lengthscale) = Delta(lengthscale_MAP) and
q(f) = Normal(f_loc, f_scale). In non-variational models,
f is not a latent variable. So we simply find
Thank you very much. I understand now. The guide was set iteratively for the kernel parameters when set_mode. Did not understand this part of the code before. Still not exactly the same as what I was thinking though. I was thinking about approximating the marignal posterior of hyper-parameters. That is we will not define a guide for f. But the model needs to be redefined and to calculate the likelihood (and it’s gradient) we need another monte carlo estimator to integrate over f, which may not be stable. So practically the currently implementation should be prefered.