I've been playing around with Pyro for a bit. The code works perfectly fine (getting great results), but the ELBO loss is always pretty high. If the objective is to generate images, then the loss returned by
svi.step() seems to be summed over the entire batch and all the pixels. My previous experience with PyTorch is that the default is usually the mean of all pixels instead of the sum.
For image size 64x64 and batch size 32, the difference between mean and sum is 5 orders of magnitude when doing the gradient step. However, the tutorials (VAE, AIR, etc) all use learning rate
1e-3 with Adam optimizer and get great results. Am I understanding something wrong here?