Best Practice for Variaional Inference for Gaussian Process

ciaobladoo · May 21, 2018, 6:14pm

I am working on the Gaussian Process Tutorial Gaussian Processes — Pyro Tutorials 1.8.4 documentation
I was able to reproduce the exact result on the first dataset of the tutorial (N=20) using GPRegression, using the given code.

import torch

import pyro
import pyro.contrib.gp as gp
import pyro.distributions as dist
from pyro.infer import SVI, Trace_ELBO
from pyro.optim import Adam

smoke_test = ('CI' in os.environ)  # ignore; used for checking code integrity in the Pyro repo
pyro.enable_validation(True)       # can help with debugging
pyro.set_rng_seed(0)

N = 20
X = dist.Uniform(0.0, 5.0).sample(sample_shape=(N,))
y = 0.5 * torch.sin(3*X) + dist.Normal(0.0, 0.2).sample(sample_shape=(N,))

kernel = gp.kernels.RBF(input_dim=1)
gpr = gp.models.GPRegression(X, y, kernel, noise=torch.tensor(1.))

optim = Adam({"lr": 0.005})
svi = SVI(gpr.model, gpr.guide, optim, loss=Trace_ELBO())
losses = []
num_steps = 2500 if not smoke_test else 2
for i in range(num_steps):
    losses.append(svi.step())

However, when I try to do inference on the same data with Variational Inference, using the following code,

kernel = gp.kernels.RBF(input_dim=1)
likelihood = gp.likelihoods.Gaussian()
vgp = gp.models.VariationalGP(X, y, kernel, likelihood=likelihood, whiten=True)

pyro.clear_param_store()
num_steps = 2500 if not smoke_test else 2
losses = vgp.optimize(optimizer=Adam({"lr": 0.01}), num_steps=num_steps)

It almost always breaks because cholesky decomposition can not be performed due to numerical issues,

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/contrib/gp/models/model.py", line 199, in optimize
    losses.append(svi.step())
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/infer/svi.py", line 75, in step
    loss = self.loss_and_grads(self.model, self.guide, *args, **kwargs)
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/infer/trace_elbo.py", line 104, in loss_and_grads
    for model_trace, guide_trace in self._get_traces(model, guide, *args, **kwargs):
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/infer/trace_elbo.py", line 52, in _get_traces
    model_trace = poutine.trace(poutine.replay(model, trace=guide_trace)).get_trace(*args, **kwargs)
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/poutine/trace_messenger.py", line 198, in get_trace
    self(*args, **kwargs)
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/poutine/trace_messenger.py", line 182, in __call__
    ret = self.fn(*args, **kwargs)
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/poutine/messenger.py", line 148, in __call__
    return self.fn(*args, **kwargs)
  File "/home/fengc/anaconda2/envs/pyro/lib/python3.6/site-packages/pyro/contrib/gp/models/vsgp.py", line 117, in model
    Luu = Kuu.potrf(upper=False)
RuntimeError: Lapack Error in potrf : the leading minor of order 14 is not positive definite at /pytorch/aten/src/TH/generic/THTensorLapack.c:617

I also tried add variance to the Gaussian likelihood and using Rprop instead of Adam for SGD, and change jitter to 1e-5 or 1e-4. But it does not help, the SGD still stops at certain point of the iteration.

I am wondering what is the best practice of doing Variational inference for Gaussian process.

fehiepsi · May 22, 2018, 12:44am

@ciaobladoo How about putting pyro.clear_param_store() before redefining kernel?

To deal with Cholesky problem, usually increasing jitter or casting your tensor to torch.float64 is enough.

ciaobladoo · May 22, 2018, 12:51am

Hi @fehiepsi

Thanks a lot for your help!

I don’t think pyro.clear_param_store() is the problem. Also I don’t want to increase jitter beyond 1e-4, which does not help. I think I should try torch.float64. Another way I just remembered I used before is to add a prior on lengthscale to upper bound it, as the problem usually comes when gradient descent reaches a high lengthscale value.

fehiepsi · May 22, 2018, 1:00am

Yup, adding priors is a good way to prevent unrealistic parameters. However, I run your code in my machine and didn’t catch the error you got.

ciaobladoo · May 22, 2018, 1:03am

Thanks @fehiepsi That is interesting! Did you use the exact code or you made some changes? How about the MAP (MLE) value of hyper-parameter, are they reasonably close to the MLE results? If everything is fine one your machine it is probably because somehow I got float32.

(Sorry I just don’t have access to my machine now so I can’t do quick runs.)

Thanks a lot!!

fehiepsi · May 22, 2018, 1:33am

@ciaobladoo I just run exactly your code (skip gpr model), and use float32. Here are the learned parameters FYI

variance: tensor(0.1643)
lengthscale: tensor(0.2120)