GPLVM - priors and constraints for inputs, X

jejjohnson · May 31, 2019, 3:16pm

Hello,

First of all I wanted to thank the people on the Pyro team for creating the GP contrib section. I find it nice and really easy to use. It sets a nice standard for good software to allow researchers to explore GP algorithms - something we desperate need in the GP community.

Problem

I am trying to verify some things that have been said in this paper (Damianou 2014) for my research. They claim that we can use the BGPLVM model in the case of modeling uncertain inputs. So, instead of having deterministic inputs X, the inputs come from a distribution e.g. N(mu_x, Sigma_x).

In the paper, they claim that we can do one of two things:

Set a prior distribution to X (the model(?) as you call it in Pyro)
Set a prior distribution to the variational parameter q(X) (or the guide as you call it in Pyro)

I like the example you put in the tutorial (from this paper GrandPrix 2019) as it is something very similar in the sense that you are explicitly putting priors on X. The problem I’m doing is slightly different as I’m not reducing the dimensionality but I think it’s a similar situation regardless; I imagine that I just need to finess the priors and which parameters require a grad function. However, I’m having trouble understanding how we are setting the priors and what exactly they represent within our GPLVM model.

Code - Setting Priors

Continuing from the tutorial where you set a prior to the gplvm class (line [6]),

#...
# we use `.to_event()` to tell Pyro that the prior distribution for X has no batch_shape
gplvm.set_prior("X", dist.Normal(X_prior_mean, 0.1).to_event())
gplvm.autoguide("X", dist.Normal)

the autoguide puzzles me a bit as I cannot really understand where to access the parameters. I could be wrong but doing a simple inspection of the model attributes

gplvm.mode = 'model'
model_X_loc = gplvm.X_loc.cpu().detach().numpy()
model_X_scale_unconstrained = gplvm.X_scale_unconstrained.cpu().detach().numpy()

and similarly a simple inspection of the guide attributes

gplvm.mode = 'guide'
guide_X_loc = gplvm.X_loc.cpu().detach().numpy()
guide_X_scale_unconstrained = gplvm.X_scale_unconstrained.cpu().detach().numpy()

gives the exact same output

assert(model_X_loc.all() == guide_X_loc.all())
assert(model_X_scale_unconstrained.all() == guide_X_scale_unconstrained.all())

So my intuition is that I just don’t understand where the parameters are stored in this model nor the guide because if I look at the min, mean and max of the values for the model and the guide

print(model_X_loc.min(), 
      model_X_loc.mean(), 
      model_X_loc.max())
print(model_X_scale_unconstrained.min(), 
      model_X_scale_unconstrained.mean(), 
      model_X_scale_unconstrained.max())

they return

0.0 0.38005337 1.0
0.0 0.0 0.0

which doesn’t make sense to me because we explicitly set the prior to have a 0.1 variance.

Question

Would anyone be able to help me and give me some more intuition or perhaps point me in the direction of some tutorials about how I can do the following:

Set a prior distribution to the parameter X
Constrain the mean and/or variance of the distribution of the prior for X (e.g. positive, zero_grad)
Set a prior distribution to the parameter q (the guide)
Constrain the mean and/or variance of the distribution of the prior for the guide (e.g. positive, zero_grad)

Thank you in advanced.
J. Emmanuel Johnson

fehiepsi · June 3, 2019, 3:27am

Hi @jejjohnson, I am really happy to see your interest in the gp module! I’ll try to answer your questions but if there is any point which is not clear, please let me know. There might be something wrong with my understanding so further discussions would be very helpful for me.

First of all, if you set prior distribution to X with

gplvm.set_prior("X", dist.Normal(X_prior_mean, 0.1).to_event())

then mean of prior is X_prior_mean and scale is 0.1 (variance 0.01). Unless you want to learn prior’s mean/variance, these tensors will always be constant. Under the hood, dist.Normal(X_prior_mean, 0.1).to_event() will be stored.

When you call gplvm.autoguide("X", dist.Normal), then the module will create variational parameters X_loc and X_scale for you. However, we need X_scale to be positive, so under the hood, the “root/raw” parameters are X_loc and X_scale_unconstrained. These parameters will be used to generate a sample X from the guide distribution dist.Normal(X_loc, X_scale). They play no role for prior.

If you want to constrain X_loc to positive, you might call: gplvm.set_constraint('X_loc', constraint.positive). Then root/raw parameter of X_loc will be X_loc_unconstrained.

jejjohnson · June 11, 2019, 11:46am

Hello,

Apologies for the late reply. I believe you answered my question. I would just like to clarify a few things for any readers (and more myself if anything):

Set a prior distribution to X - this is done as in the tutorial with .set_prior("X")… and these parameters are fixed.
The X prior is already constrained and fixed with the set_prior("X") call. Is there a way to unfix this and let these parameters be learned? (Unpractical I know but it’s nice to know for the future)
Define a distribution q - as in the tutorial, call gplvm.autoguide("X")… This will create the parameters X_loc and X_scale which are learned.
We constrain q automatically via the gplvm.autoguide("X"). The X_loc and X_scale_unconstrained are created and we are free to modify them as we see fit. However, these parameters are learned. It is possible that we could fix them by setting them with a X_loc = Parameter(torch.Tensor(0.1), requires_grad=False), for example, correct?

If everything I said above is correct then I think I finally understand how everything works together. Thank you again.

Best,
J. Emmanuel Johnson

fehiepsi · June 12, 2019, 3:14pm

Hi @jejjohnson, I think that you get everything right. Thanks for your detailed clarifications! For your questions,

It is possible that we could fix them by setting them with a X_loc = Parameter(torch.Tensor(0.1), requires_grad=False)

To fix them, I use gplvm.X_loc.requires_grad_(False) but it is equivalent to what you did. For example, in this benchmark test, I fixed inducing points from learning.

Is there a way to unfix this and let these parameters be learned?

Sure, you can do it but you need a bit more effort (not much though):

class LearnedPriorGP(gp.parameterized.Parameterized):
    def __init__(self, gplvm):
        self.gplvm = gplvm
        self.prior_loc = nn.Parameter(...)
        self.prior_scale = nn.Parameter(...)
        self.set_constraints("prior_scale", constraints.positive)

    def model(self):
        self.mode = "model"
        X = pyro.sample("X", dist.Normal(self.prior_loc, self.prior_scale))
        self.gplvm.set_data(X, y)
        self.gplvm.model()

    def guide(self):
        gplvm.guide()

and use this class instead of gplvm for inference.

Hope that it helps! The design pattern I have in mind when making gp module is to make it modular (like PyTorch nn.Module) and flexible, so it is easy to combine parts together as a probabilistic model (rather focusing in analytic derivations as in other frameworks). Please let me know if something does not work.

jejjohnson · June 13, 2019, 5:04am

Hey @fehiepsi,

Thank you for the pseudocode and confirmation of clarification. I believe I have plenty to continue conducting my experiments regarding uncertain inputs for GPLVMs.

Once again, thank you for the replies and thank you for all your work on the contrib library. I appreciate it even more with every additional element of understanding!

Thanks!
Emmanuel

jejjohnson · October 30, 2019, 1:51pm

Hello @fehiepsi,

So I’ve been working in the uncertain inputs problem that was mentioned in this thread for a while now and I have a question about what the inference method is actually doing for the latent variables in the GPLVM model.

If you recall I was using the GPLVM tutorial. In my problem I am assuming that I know the noise in my inputs. So to do that, I put a prior on my X where I fix the X_prior_mean and put my known X_prior_scale. Then I set the guide function to be a Normal distribution with a X_mean and diagonal X_scale term. The mean is fixed because I assume my observations are true. My project has been to try experiments where I use different combinations of fixing or unfixing the scale term for both the prior and the guide, e.g. X_scale= fixed, X_scale=not fixed, etc. In all of my experiments I’m using the TraceMeanField_ELBO inference method.

So behind the scenes, the prior for X is fixed. So the only role it plays in the elbo minimization is in the KL-Divergence term between the prior , p(X) and variational params, q(X). However, the variational parameters (the guide) is the term that is being changed. So I’d like to know if it is simply a reparameterization where for q(X), X = X_loc + X_scale @ Normal(0,I). Then update the variational parameters X_loc and X_scale just like any other parameter within our model? I just wanted to confirm that that is the case. It’s actually not very common to see this formulation for the latent variable priors in the GP literature. It’s very common for the kernel parameters, f and the inducing points Z. But I haven’t seen this formulation specifically for the latent variables X.

Thank you in advanced!

Best,
Emmanuel

P.S. If anyone is interesting in seeing my initial results they’re more than welcome to look at the google colab notebook I created.

fehiepsi · October 30, 2019, 6:07pm

Hi @jejjohnson, IIUC then your question is how to fix X_mean? The parameter names of the guide is X_loc and X_scale_unconstrained. To not optimize X_loc, I think you can use this method (probably using module.named_parameters() to do filtering based on name instead of using module.parameters()). This also works for X_scale_unconstrained. Otherwise, IIRC you can also use

del model.X_loc
model.X_loc = some_fixed_tensor

if it is simply a reparameterization where for q(X), X = X_loc + X_scale @ Normal(0,I)

Actually, KL is computed directly from p(x) and Normal(X_loc, X_scale), so KL is a function of X_loc and X_scale (which is what we want to optimize).

jejjohnson · October 31, 2019, 8:01am

Hi,

Thank you for your insight.

So actually the second point that you mentioned was my question. You compute the KLD between q(X) and p(X) and you compute the likelihood using the q(X) params just like in the standard VI literature.

Sorry if my question was a bit convoluted but that was the gist of what I wanted to know.

Thanks again!

fehiepsi · October 31, 2019, 3:01pm

Glad that it helps! If you want to make sure that KL is computed analytically (instead of stochastic), then you can add a print statement print(name) at this line.