GP kernel ignoring lengthscale?

fonnesbeck · May 8, 2021, 7:08pm

I have a GP that I am fitting in Pyro where I am setting the lengthscale to a large constant, based on prior knowledge:

    amp = pyro.sample('amp', Gamma(torch.DoubleTensor([2.]).to(device), 
                                        torch.DoubleTensor([0.5]).to(device)))

    K = gp.kernels.RBF(
        input_dim=1, 
        variance=amp, 
        lengthscale=torch.tensor(100.).to(device)
    )
    
    cov_beta = K(torch.DoubleTensor(days).to(device))
    cov_beta.view(-1)[::D+1] += jitter

    beta = pyro.sample('beta', MultivariateNormal((torch.ones(D).to(device)), 
        covariance_matrix=cov_beta))

however, when I fit this model, a plot of beta looks as follows:

clearly the lengthscale is much smaller than 100 here. Does SVI change the values of the lengthscale, even though I have specified it, with no prior? It seems like I am doing something wrong here.

I even tried putting a uniform prior on the lengthscale with a lower bound of 100, and I get a similar result:

ls = pyro.sample('ls', Uniform(torch.DoubleTensor([100.]).to(device), torch.DoubleTensor([200]).to(device)))

Any ideas?

fehiepsi · May 8, 2021, 8:16pm

Hi @fonnesbeck, like PyTorch nn.Module, to filter out some nn parameters, you can use this method:

optimizer.SGD(filter(filter_fn, model.parameters()), lr=1e-3)

Alternatively, I guess you can do

K.lengthscale_unconstrained.requires_grad_(False)

fonnesbeck · May 8, 2021, 10:15pm

OK, that makes sense. Shouldn’t a uniform prior with a large lower bound also prevent the lengthscale from getting small, though? I’ve tried all sorts of strong priors on the lengthscale, but they don’t seem to do any good.

fonnesbeck · May 9, 2021, 12:03am

Ah, I see: you have to set requires_grad_ to False AND set the prior. I was confused.

fehiepsi · May 9, 2021, 4:05am

How did you set prior for the kernel? The recommended way is to use PyroSample. It would be strange if we get values out of support of the prior.

fonnesbeck · May 9, 2021, 2:59pm

My setup is as follows:

    ls = pyro.sample('ls', Gamma(torch.DoubleTensor([20.]).to(device), torch.DoubleTensor([0.5]).to(device)))
    amp = pyro.sample('amp', Gamma(torch.DoubleTensor([2.]).to(device), torch.DoubleTensor([0.5]).to(device)))

    K = gp.kernels.RBF(
        input_dim=1, 
        variance=amp, 
        lengthscale=ls
    )
    K.lengthscale_unconstrained.requires_grad_(False)
    K.variance_unconstrained.requires_grad_(False)

It was not clear how to use PyroSample in the context of how I am using the GP (as a latent variable rather than in a GPRegression or a custom model class. This clearly does not work:

ls = pyro.nn.PyroSample(Gamma(torch.DoubleTensor([20.]).to(device), torch.DoubleTensor([0.5]).to(device)))
amp = pyro.nn.PyroSample(Gamma(torch.DoubleTensor([2.]).to(device), torch.DoubleTensor([0.5]).to(device)))
K = gp.kernels.RBF(
        input_dim=1, 
        variance=amp, 
        lengthscale=ls
    )

fehiepsi · May 9, 2021, 7:05pm

How about this? (you can see more examples in Pyro Modules docs

K = gp.kernels.RBF(...)
K.lengthscale = PyroSample(...)

fehiepsi · May 9, 2021, 7:11pm

@fonnesbeck I think this notebook might be helpful for you. It shows how to set priors, set autoguide, add random effects, inspect parameters, customize optimizers… for a gp model.

fonnesbeck · May 9, 2021, 7:43pm

That gives me a RuntimeError:

RuntimeError: Multiple sample sites named 'lengthscale'

Thanks for the notebook link; that looks very similar to the sort of modeling I am doing, so I think it will be useful.

fehiepsi · May 9, 2021, 8:10pm

It is hard to guess what’s wrong without some code to replicate the error. I think you might need to use pyro_method like GPR model. You can also find an explanation for it in the above notebook.

fonnesbeck · May 9, 2021, 10:23pm

I’m going to try an emulate the regression example and use PyroModule. Its not clear what needs to be defined in __init__ as PyroSample and what is in forward using pyro.sample. I will try all of the GP parameters in the former and the non-GP parameters in the latter.

My model is a little unique because I am using GPs as latent functions with MultivariateNormal variables, and combining them as an expected value for a likelihood, rather than GPRegression, which is only a marginal GP. So, I have several GPs that I am combining.

fonnesbeck · May 9, 2021, 11:07pm

fonnesbeck:

That gives me a RuntimeError:
RuntimeError: Multiple sample sites named 'lengthscale'
Thanks for the notebook link; that looks very similar to the sort of modeling I am doing, so I think it will be useful.

I think I discoverd the cause of this error: I have multiple kernels in my model (since I have multiple latent GPs). If I comment out the other kernels, it runs fine. Strange that this occurs even though I am using the K. namespace (the other kernel is K_w).

fonnesbeck · May 10, 2021, 1:27am

OK, here is some runnable code that illustrates the issue:

jitter = torch.tensor(1.e-5).to(device)

def model(X, y):

    K1 = gp.kernels.RBF(input_dim=1)
    K1.lengthscale = pyro.nn.PyroSample(Gamma(torch.DoubleTensor([20.]).to(device), torch.DoubleTensor([0.5]).to(device)))
    K1.variance = pyro.nn.PyroSample(Gamma(torch.DoubleTensor([2.]).to(device), torch.DoubleTensor([0.5]).to(device)))

    cov_beta = K1(torch.DoubleTensor(X).to(device))
    cov_beta.view(-1)[::len(X)+1] += jitter

    beta = pyro.sample('beta', MultivariateNormal((torch.zeros(len(y)).to(device)), covariance_matrix=cov_beta), obs=torch.DoubleTensor(y).to(device))

    K2 = gp.kernels.Matern32(input_dim=1)
    K2.lengthscale = pyro.nn.PyroSample(Gamma(torch.DoubleTensor([20.]).to(device), torch.DoubleTensor([0.5]).to(device)))
    K2.variance = pyro.nn.PyroSample(Gamma(torch.DoubleTensor([2.]).to(device), torch.DoubleTensor([0.5]).to(device)))

    cov_alpha = K2(torch.DoubleTensor(X).to(device))
    cov_alpha.view(-1)[::len(X)+1] += jitter

    alpha = pyro.sample('alpha', MultivariateNormal((torch.zeros(len(y)).to(device)), covariance_matrix=cov_alpha), obs=torch.DoubleTensor(y).to(device))

So, any time you have two kernels in a model, it seems like this issue will come up. There appears to be a namespace issue here, as the parameters of K1 and K2 should be distinct.

fehiepsi · May 10, 2021, 3:03pm

I see. I think all you need to do is to make your model a pyro module, with submodules K1, K2. Like PyTorch nn Module, parameter names of those submodules will be “K1.lengthscale”, “K2.lengthscale”. We designed Pyro module as a subclass of PyTorch nn module, so constructing, naming,… are very similar to PyTorch ways.

fonnesbeck · May 12, 2021, 2:08am

Are there any more complex examples of using a Pyro module for constructing models? For example, does it support plate notation in the initialization?

fonnesbeck · May 12, 2021, 5:32pm

I did manage to get the model converted to a PyroModule, but it appears to run about 10x slower. Is there known overhead for using a module? The only change I made to how the model is run was to add a retain_graph=True argument to Trace_ELBO to avoid the RuntimeError I was getting without it. Would that be responsible for a slowdown?

fehiepsi · May 12, 2021, 8:01pm

Would that be responsible for a slowdown?

I think so, some results from google search mentioned that in pytorch, using retain_graph=True is ways slower than the default option. I guess it is better to try to resolve the RuntimeError.

Is there known overhead for using a module

When I converted the old gp modules to PyroModules, I didn’t observe such regression (if I remember correctly).

fonnesbeck · May 12, 2021, 8:25pm

Yeah, its not obvious why the RuntimeError is occurring, as I have just re-expressed my existing model (which does not raise an error) as a PyroModule. The running logic is straight from the docs:

N_STEPS = 1000

model = VenueModel(days, venue_list)
guide = autoguides.AutoDiagonalNormal(model)

optimizer = pyro.optim.Adam({"lr": 0.01})
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())

pyro.clear_param_store()
for j in range(N_STEPS):
    loss = svi.step(X, y)
    if not j % 10:
        print("[iteration %04d] loss: %.4f" % (j + 1, loss / len(y)))

The failure occurs at the call to step:

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
--> 147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    148 
    149 

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.

and the only way I can get around it is to set retain_graph=True.

fehiepsi · May 12, 2021, 8:38pm

This seems like a bug. Could you simplify the model to isolate the problem and make a github issue?

fonnesbeck · May 12, 2021, 10:45pm

Well, I tried to make a runnable simple example that fails, but cannot. However, I notice that if I build up my model incrementally, its when I add the following GP components to the __init__ and forward methods that results in the error:

class VenueModel(PyroModule);

    def __init__(self):

        ...

        self.ls_venue = pyro.nn.PyroSample(
            lambda self: Gamma(
                torch.DoubleTensor([2.0]).to(device),
                torch.DoubleTensor([0.5]).to(device),
            )
        )

        self.amp_venue = pyro.nn.PyroSample(
            lambda self: Gamma(
                torch.DoubleTensor([2.0]).to(device),
                torch.DoubleTensor([0.5]).to(device),
            )
        )

        self.noise_venue = pyro.nn.PyroSample(
            lambda self: HalfCauchy(torch.DoubleTensor([1.0]).to(device))
        )

        self.K_w = gp.kernels.Matern32(
            input_dim=1, variance=self.amp_venue, lengthscale=self.ls_venue
        )

        self.cov_alpha = self.K_w(torch.DoubleTensor(days).to(device))
        self.cov_alpha.view(-1)[:: self.game_days.shape[0] + 1] += self.noise_venue

        ...

    def forward(self, X, y):

        ...

        with pyro.plate("venues", len(self.venues)):

            # g, _ = self.alpha.model()
            g = pyro.sample(
                "g",
                MultivariateNormal(
                    torch.zeros_like(self.game_days).to(device),
                    covariance_matrix=self.cov_alpha,
                ),
            )

        ...

So again, there is much more to the model than this, but removing these components allows the model to run without error. Does anything jump out at you as being problematic?