Thank you very much for your help! But I am still not sure if we are talking about the same thing. What I am thinking is a kernel learning setting. Let's say we have a GP model with RBF kernel, and we have hyperparameters "signal variance" and "length scale", let us lump them together and call it z. So basically my GP model is composed of the following: A kernel function: z->K(z)(), a GP: f~GP(0, K(z)(x)) (let us only consider zero mean), and a likelihood y~p(y|f). The inference I want to do is to approximate the posterior of hyperparameters p(z|x,y). The motivation is that it makes more sence to use the hyperparameter as a "representation" of the data since it captures the structure of the data and can be intepretable. So here my objective is minimize KL(q(z)||p(z|x,y)). The difficulty here is that when optimizing the corresponding ELBO we need to calculate the gradient of p((x,y),z), which is an integral over hidden variable f, and is intractable unless our likelihood is Gaussian. But I think we may still have a chance to have a good MCMC estimator for this integral. In your current implementation, for example in vgp.py, the guide is a distribution q(f), so I am not sure if it does the same thing I described. Of course whether what I described is useful or not is debatable, but I think learn a varitaional distribution q(z) rather than q(f) could be useful, say quantify uncertainty of hyperparameters and enable more robust comparison between data.
I also could not find something simliar to what I described in the literature. I guess most automatic differentiation variational inference assume that you can easily evaluate p(x,z) while here in order to evaluate it you probably need to calculate an intractable intergral. In any case I am just puting it here to stimulate discussion in case it is useful.
In any case thanks again for the implementation of GP it is really helpful!