RuntimeError during Cholesky Decomposition

milost · August 10, 2019, 12:30pm

Hi,

I am currently working on a system that uses DKL in an active learning setup (Variance for Active Learning). The core of the system is based on the DKL example that is given in the examples section (Example: Deep Kernel Learning — Pyro Tutorials 1.8.4 documentation). So far everything seems to work out except that sometimes during model training I get this exception.

RuntimeError: cholesky_cpu: U(65,65) is zero, singular U.

I have looked around a bit to see how to deal with this problem and have come across the following solutions:

Increate the noise / jitter or use torch.float64 instead of torch.float32 tensors (Cholesky decomposition during GPRegression model optimization - #2 by fehiepsi)
Set the lengthscale prior to be strictly positive (U(1,1) is zero, singular U with GP Kernel Prior · Issue #1863 · pyro-ppl/pyro · GitHub)

So far I have tried to increase the jitter which in some cases leads to the model being able to complete the training, in other cases I get the same error message only later. Next, I tried to use torch.float64 tensors instead of torch.float32, which ultimately failed because I was unable to get the gpmodule (see below) to work with torch.float64 tensors.

Maybe someone from the pyro team could help out here?

After that I looked at the solution where the prior for the lengthscale is limited to strictly positive values. Although this approach seems to me to be the most effective, as it would probably guarantee that the error mentioned above would not occur again, I lack the necessary knowledge to implement this solution. In addition, it may also be that this solution does not even apply to my case.

Therefore I posted below the code I am currently using in my project. I can imagine that solving this problem is not trivial but maybe there is a good way to deal with it. Perhaps one can adjust the warp_core so that it provides “better” values for the Gauss Process? I can imagine that other users experience similar problems so that we might be able to develop some kind of best practice approach.

# Define neural net which is used as warp core.
class WarpCore(nn.Module):
    def __init__(self, dims):
        super(WarpCore, self).__init__()
        self.fc1 = nn.Linear(dims, 100)
        self.fc2 = nn.Linear(100, 50)
        self.fc3 = nn.Linear(50, 50)
        self.fc4 = nn.Linear(50, 2)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

# Define data loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=len(test_dataset), shuffle=False)

# Get inducing points 
batches = []
for i, (data, _) in enumerate(train_loader):
    batches.append(data)
    if i >= ((number_inducing - 1) // 64):
        break
inducing_points = torch.cat(batches)[:number_inducing].clone()

# Define loss function
elbo = infer.TraceMeanField_ELBO()
loss_fn = elbo.differentiable_loss

# Define likelihood
likelihood = gp.likelihoods.Binary()

# Create deep kernel
warp_core = WarpCore(100)
kernel_fn = gp.kernels.RBF(input_dim=2, lengthscale=torch.ones(2))
deep_kernel = gp.kernels.Warping(kernel_fn, iwarping_fn=warp_core)

# Set up VariationalSparseGP
gpmodule = gp.models.VariationalSparseGP(X=inducing_points, y=None, kernel=deep_kernel,
                                         Xu=inducing_points, likelihood=likelihood,
                                         latent_shape=torch.Size([]),
                                         num_data=len(train_dataset),
                                         whiten=True, jitter=1e-2)

# Set up optimizer
optimizer_params = {"lr": learning_rate}
optimizer = torch.optim.Adam(gpmodule.parameters(), **optimizer_params)

# Define training loop
epochs = 800
for epoch in range(1, epochs + 1):
	epoch_loss = torch.Tensor()
	for batch_idx, (data, target) in enumerate(train_loader):
    		if cuda:
        		data, target = data.cuda(), target.cuda()
    		target = target.float()

    		gpmodule.set_data(data, target)
    		optimizer.zero_grad()
    		loss = loss_fn(gpmodule.model, gpmodule.guide)
    		loss.backward()
    		optimizer.step()
    		epoch_loss = torch.cat([epoch_loss, torch.Tensor([loss.item()])])
	print(epoch_loss.mean())

Hope someone can help out with this …

fehiepsi · August 10, 2019, 1:56pm

Hi @milost, could you post the error which you got when using float64? Something like target.float() in your code will not work with float64 tensor. Make sure that you set default tensor type to float64 too. If you want to set lengthscale to another constraint, I guess you can just simply use (see docs)

kernel_fn.set_constraint("lengthscale", constraints.greater_than(0.01))

gtorres · February 21, 2020, 5:53pm

@fehiepsi I am also getting a similar Cholesky error as above. When I try your suggestion:

kernel_fn.set_constraint("lengthscale", constraints.greater_than(0.01))

it is saying that no such attribute exists. Is there another way to enforce the positivity constraint? Thank you!

fehiepsi · February 21, 2020, 6:48pm

@gtorres From Pyro 1.0, you can simply set constraint for a parameter with

kernel_fn.variance = pyro.nn.PyroParam(torch.tensor(1.), constraints.greater_than(0.01))

gtorres · February 21, 2020, 7:06pm

Thank you! I had to prepend torch.distributions. to constraints.greater_than, for others seeing this thread in the future.