VAE for multi label dataset


I tried to implement VAE for a custom dataset.

Information about the problem and the dataset.

  1. Image dataset with each image resized to 200*200 and has 23 classes
  2. Each image can belong to multiple class at the same time( Multi label)
  3. I have a train and test data loader set up, getitem gives a tensor and label tensor with necessary transforms already done.

I modified the shape in the encoder and decoder from the MNIST example ( instead of 28 X 28 = 784, i have modified it as 200 X 200 = 40000).

When i run inference, i get nan training loss and testing loss at the second epoch (sometimes from the first epoch) . (This works only when i make the images grayscale and squeeze the channel dimension itself)

[epoch 000]  average training loss: 13551.6894
[epoch 000]  average test loss: 7712.9928
/home/mancunian92/anaconda3/lib/python3.6/site-packages/pyro/infer/ UserWarning: Encountered NaN: loss
  warn_if_nan(loss, "loss")
[epoch 001]  average training loss: nan
[epoch 002]  average training loss: nan


  1. For multi label images , what modifications should i make ?
  2. Right now i have converted the images to grayscale, what should i do if i have to keep it in rgb ?

@jpchen @martinjankowiak @fritzo @eb8680

FYI @osazuwa

Hi, can you clarify what your model is supposed to do? What are the latent and observed variables? How does the model represent multiple labels, as a bit vector? Which example have you modified?

As an initial direction, what do your encoder and decoder neural network architectures look like? With larger images you’ll probably want something a bit larger and more specialized than the small MLP in the example. If increasing the hidden layer width or adding another couple of layers doesn’t help, you might try googling around for PyTorch convolutional VAE example implementations to get a sense of what works; here’s a (completely unvetted) example I just found:

You might also try tinkering with the learning rate and batch size.

If loc_img.shape == x.shape == (batch_size, 3, 400, 400), you need to declare all three dimensions (channel, width, height) as event dimensions with .to_event(3):

pyro.sample("obs", dist.Bernoulli(loc_img).to_event(3), obs=x)

See the tensor shapes tutorial for background.

1 Like

Thanks @eb8680_2, it worked.

When we do SVI is there any specific reason as to getting a nan loss ? (The log likelihood part of elbo causing an issue ?)

The loss reduces over each epoch and then suddenly it goes to nan . Would i have to make any changes if i encounter this situation ?

is there any specific reason as to getting a nan loss ?

There’s no single reason; the problem you’re working on (learning a generative model of 400x400 images) sounds quite hard and there’s no guarantee the approach you have in mind will work.

You could try changing your prior or your guide initialization, using multiple particles in your ELBO estimator, annealing your prior as in the DMM example, lowering your learning rate either from the beginning or using a scheduler, clipping gradients, using batch normalization or other standard deep learning tricks, looking at the experimental details in papers or code that address similar problems etc.

Hi @eb8680_2 ,i took on your advice and my VAE captured the data generating process much better than i anticipated. Although, to achieve what i initially set out to do, i need to include the labels too. I am trying to do something similar to the SSVAE tutorial (the only difference is that i don’t have any unsupervised images. everything is supervised)

While following the SSVAE tutorial, i see that the first step is concatenating 2 different tensors (either xs and ys as given in the guide or zs and ys given in the model). I was having trouble adapting it to my image and hence was looking at the shapes the tensors produced in the tutorial.

In the ssvae tutorial, the batch size is 200 and in the dataloader, the image size is (num_samples, 784) and label size is (num_samples, 10). But in the guide, the ys shape is changed to (10, batch_size,10) so that the shapes are broadcasted correctly and the tensors concatenated. Where is the transformation done? I.e labels changing from (batch_size, 10) -> (10, batch_size, 10) ?

In my dataset, my image size is (batch_size, 3, 400, 400) and label size is (batch_size, 11). How do i concatenate these tensors ?

How do i concatenate these tensors ?

You don’t have to concatenate the input tensors immediately. Just add a pathway in your encoder that applies a CNN to convert an input image into a feature vector and concatenate that with the label, or with the output of another pathway that converts the label into features.

For my VAE, the shapes from my encoder to decoder go like this.

<------------------Encoder ----------------><-----Decoder-------------------------->
Image -> Hidden -> (Mu, Sigma) -> Z -> Hidden -> Reconstructed Image
(3,400,400) (1,1024) (1,32) (1,32) -> (1,32) -> (1,1024) -> (3,400,400)

If i understood you right, all i need to do is to concatenate in the hidden space ?
<------------------Encoder ----------------><-----Decoder-------------------------->
Image -> Hidden -> (Mu, Sigma) -> Z -> Hidden -> Reconstructed Image
(3,400,400) (1,1024+labels), (1,1024) (1,32) (1,32) -> (1,32) -> (1,1024+labels) -> (1,1024) -> (3,400,400)

Is my understanding correct ?

Yes, that’s right. Note that there’s nothing special about the particular neural network architectures used in any of the tutorials; you can change them however you like to fit your use case.