Optimization increases loss when changing VAEs model definition

Hi All,

Testing with the tutorial on VAEs, i changed the generating distribution of the pixel values to gaussians.

    def model(self, x):
        # register PyTorch module `decoder` with Pyro
        pyro.module("decoder", self.decoder)
        with pyro.iarange("data", x.size(0)):
            # setup hyperparameters for prior p(z)
            z_loc = x.new_zeros(torch.Size((x.size(0), self.z_dim)))
            z_scale = x.new_ones(torch.Size((x.size(0), self.z_dim)))
            # sample from prior (value will be sampled by guide when computing the ELBO)
            z = pyro.sample("latent", dist.Normal(z_loc, z_scale).independent(1))
            # decode the latent code z
            loc_img = self.decoder.forward(z)
            # score against actual images
            
            #here is the change
            sigmas = x.new_ones(torch.Size((x.size(0), 784)))*0.1
            pyro.sample("obs", dist.Normal(loc_img, sigmas).independent(1), obs=x.reshape(-1, 784))
            # return the loc so we can visualize it later
            return loc_img

However, when i tray to train the elbo keeps consistengly decreasing towards more negative values.
A result of training would be like this

[epoch 000]  average training loss: 935.3164
[epoch 000] average test loss: 62.3496
[epoch 001]  average training loss: -158.1064
[epoch 002]  average training loss: -398.4471
[epoch 003]  average training loss: -506.6157
[epoch 004]  average training loss: -573.4166
[epoch 005]  average training loss: -618.9186
[epoch 005] average test loss: -647.4464
[epoch 006]  average training loss: -652.3466
[epoch 007]  average training loss: -677.4514
[epoch 008]  average training loss: -696.5506
[epoch 009]  average training loss: -711.8633
[epoch 010]  average training loss: -724.7451
[epoch 010] average test loss: -735.1249

Notice also the initial positive values.

What could be the reason for this?, is the model not supposed to work with a normal distribution for the pixels?. I understand that assuming bernoulli is actually computed with cross-entropy (reconstruction loss). But it seems to me that a normal distribution is another valid way to represent the pixel distribution.

I suppose this converges fine? Try reconstructing an image to see if the encoder/decoder weights have been correctly learnt. With regard to the issue of negative loss, note that the score reported is the negative of the ELBO. One reason why you might observe a positive value for the ELBO (negative loss) is that |log(p(X, Z))| is less than the entropy term in the ELBO. This could be because of the value chosen for the normal scale parameter (0.1) which correspondingly results in higher (less negative) values for log(p(X, Z)).

Thanks for the response, indeed the reconstruction works fine.

About the loss, In this case the progression towards more and more negative values makes sense?. I thought that the value calculated on the vae tutorial run was the ELBO (maximixed, progressing towards less negative values). Changing the bernoully for a normal gives the opposite behavior in the loss/ELBO curves.

My intention was to compare which model was better using the loss. How can this be done?

I see the issue - the value reported by svi.step() is the negative of the ELBO and should be positive (the plotted graph in the tutorial is the actual ELBO, and hence negative). You are right that in general the ELBO should be taking on less negative values. In this case, while the ELBO is increasing, it is becoming more and more positive. That is problematic because KL(q(z|z) || p(z|x)) is positive and ELBO needs to be negative so that the sum is the model log evidence log(p(x)). I also noticed that using a higher value, say 0.5, for the observation’s scale makes the model well behaved. Edit: Positive log density should not be a problem per se since you have switched to a continuous distribution. Also see: Large negative training loss in VAE - #5 by rgreen1995

My intention was to compare which model was better using the loss. How can this be done?

While ELBO has been used for model selection in practice (see Beal, 2002), I’m not sure if it is theoretically very well grounded.