DKL + Embedding Layer

Hi,

I am currently trying out Deep Kernel Learning using pyro by following the tutorial (https://pyro.ai/examples/dkl.html). I changed the warping kernel according to the kernel mentioned in the following paper (http://proceedings.mlr.press/v51/wilson16.pdf) which results in:

class WarpCore(nn.Module):
    def __init__(self, dims):
        super(WarpCore, self).__init__()
        self.fc1 = nn.Linear(dims, 1000)
        self.fc2 = nn.Linear(1000, 500)
        self.fc3 = nn.Linear(500, 50)
        self.fc4 = nn.Linear(50, 2)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

Which works pretty fine. Now I was thinking of including an nn.Embedding layer into the WarpCore such as:

class WarpCore(nn.Module):
    def __init__(self, dims):
        super(WarpCore, self).__init__()
        self.embs = nn.Embedding(1000, 1000)
        self.fc1 = nn.Linear(1000, 1000)
        self.fc2 = nn.Linear(1000, 500)
        self.fc3 = nn.Linear(500, 50)
        self.fc4 = nn.Linear(50, 2)

    def forward(self, x):
        x = torch.relu(self.fc1(self.embs(x)))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

This on its own also works fine until I get to the part where I need to specify the inducing points.

batches = []
    for i, (data, _) in enumerate(train_loader):
        batches.append(data)
        if i >= ((args.num_inducing - 1) // args.batch_size):
            break
    Xu = torch.cat(batches)[:args.num_inducing].clone()

For me its not quite clear how to specify the inducing points as they will be “produced” by the embedding layer. Also the embedding layer requires me to pass a tensor of type torch.long whereas Xu needs to be a float as it is defined as self.Xu = Parameter(Xu) in vsgp.py.

I was thinking that maybe someone else tried something similar and could give me a pointer on where to look or how to do that?

Thanks …

Hi @milost, I think that you don’t need to have inducing points lie in the space of integer. How about making Xu lies in the domain of output of Embedding layer? Something like this

class EmbeddingAndGP(gp.parameterized.Parameterized):
    def __init__(...):
        self.embs = nn.Embedding(1000, 1000)
        Xu = self.embs(torch.cat(batches)[:args.num_inducing])
        self.gp_module = gp.models.VariationalSparseGP(X=Xu, y=None, Xu=Xu, ...)
    
    def model(self, X, y):
        #gp_X = self.embs(X)
        #self.gp_module.set_data(gp_X, y)
        self.gp_module.model()

    def guide(self, X, y):
        gp_X = self.embs(X)
        self.gp_module.set_data(gp_X, y)
        self.gp_module.guide()

Now that I am looking at it, should the points in Xu be coming from the nn.Embedding as in self.embs(...) or should they be the result of the forward pass of the warp_core as in

self.warp_core = WarpCore()
Xu = self.warp_core(torch.cat(batches)[:args.num_inducing])

… as the embeddings will be part of the warp_core

I think that it depends on your choice. For DKL example, I found it is better to let Xu lie in the input (i.e. image) domain of cnn. You can make Xu lie in the output domain of cnn (as in the DKL paper) but I guess you have to tune hyper-parameters to achieve the results in DKL paper.

About WarpCore, my previous comment suggested to separate Embedding layer from the remaining layers of your warp_core. If you want to combine them, then it is better to let Xu lie in the output space of of warp_core as you did. The corresponding code will be something like

class WarpCoreAndGP(gp.parameterized.Parameterized):
    def __init__(...):
        self.warp_core = WarpCore()
        Xu = self.warp_core(torch.cat(batches)[:args.num_inducing])
        self.gp_module = gp.models.VariationalSparseGP(X=Xu, y=None, Xu=Xu, ...)

    def model(self, X, y):
        #gp_X = self.warp_core(X)
        #self.gp_module.set_data(gp_X, y)
        self.gp_module.model()

    def guide(self, X, y):
        gp_X = self.warp_core(X)
        self.gp_module.set_data(gp_X, y)
        self.gp_module.guide()

To construct the deep kernel for gp_module, in previous comment, we should warp RBF with the remaining layers of warp_core (except embedding). If you put Xu in output domain of warp_core, then you don’t need to warp; your kernel will not be a deep kernel.

Ahh I see. So what you suggested was something along the lines of:

… define the WarpCore as follows:

class WarpCore(nn.Module):
    def __init__(self):
        super(WarpCore, self).__init__()
        self.fc1 = nn.Linear(1000, 1000)
        self.fc2 = nn.Linear(1000, 500)
        self.fc3 = nn.Linear(500, 50)
        self.fc4 = nn.Linear(50, 2)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

… and then use it during the construction of the GP model like so:

class EmbeddingAndGP(gp.parameterized.Parameterized):
    def __init__(...):
        self.embs = nn.Embedding(1000, 1000)
        rbf = gp.kernels.RBF(input_dim=2, lengthscale=torch.ones(2))
        deep_kernel = gp.kernels.Warping(rbf, iwarping_fn=WarpCore())
        Xu = self.embs(torch.cat(batches)[:args.num_inducing])
        self.gp_module = gp.models.VariationalSparseGP(X=Xu, y=None, Xu=Xu, kernel=deep_kernel, ...)
    
    def model(self, X, y):
        #gp_X = self.embs(X)
        #self.gp_module.set_data(gp_X, y)
        self.gp_module.model()

    def guide(self, X, y):
        gp_X = self.embs(X) # this will be called before X is passed through the warp_core correct?
        self.gp_module.set_data(gp_X, y)
        self.gp_module.guide()

Does this reflect your idea? Why are gp_X = self.embs(X) and self.gp_module.set_data(gp_X, y) commented out in the model method? Or are/should the embeddings only used in the guide method?

Also, if it’s not too bothersome, could you elaborate a little more on what you mean by

and

Why wouldn’t the kernel be deep anymore? Sorry if this is dumb question, I am just starting to warp my head around the concepts.

Thanks

Does this reflect your idea?

Yup!

Why are gp_X = self.embs(X) and self.gp_module.set_data(gp_X, y) commented out in the model method?

I thought that it is not necessary when you use SVI because SVI runs the guide first. In guide, we already set gp_X, y so we might not need to set the data again in model. However, if you use the model for something else (e.g. in GPLVM, the input X in guide is different from the input X in model), you might have to uncomment these lines. Anyway, I guess embs is cheap, so you can uncomment them. :slight_smile:

Why wouldn’t the kernel be deep anymore?

In case Xu lies in output space of warp_core, I think we only need to use RBF kernel gp.kernels.RBF(input_dim=2, lengthscale=torch.ones(2)) (no warping). In my language (it might be different in some literature),

  • x, z -> RBF(x, z): usual kernel
  • x, z -> RBF(net(x), net(z)): deep kernel
    If I remember correctly, then the kernel in DKL paper works like
  • x, z -> net(x), z -> RBF(net(x), z)
    which seems like we are using a usual RBF kernel with feature extraction from net, rather than creating a deep kernel.
1 Like

@fehiepsi A big thank you! You really helped me out!