Bounding the ELBO loss for Continuous observations

yilmazcanozyurt · May 24, 2020, 11:31am

Dear Pyro Community,

I am working on a Deep Markov Model which is similar to this example. Different from that, my observations are continuous, not categorical. I am aware that for continuous variables, pdf of P(x|z) can go up to infinity (like Dirac distribution) and therefore it makes ELBO loss unbounded.

In my case, I modeled my DMM network such that Emitter module learns the loc and scale of each observation, as it was the case for GatedTransition module. However, I realized after some iterations that the model tends to reduce the scale of correctly/closely guessed observations even though it performs way worse on other observations. In other words, overfitting to some observations compensates performing worse on other observations, and therefore the loss (-ELBO) keeps reducing towards -infinity.

One possible way to overcome this challenge is to discretize those continuous observations into bins and treat them as categorical variables, as it was the case of original DMM paper, Beta-VAE paper, info-VAE paper etc. That’s also something that I tried for my case, and it worked reasonably.

However, I want to use my hidden states z for classification task and I want to use the full power of my observations (in other words, don’t want to limit the information by discretizing it). I highly appreciate if you have any recommendations of using the continuous observations for DMM and bound the ELBO. One straightforward approach could be to introduce some minimum scale (variance) so that pdf can never exceed a certain value, but I wanted to get your suggestions first before I start the implementation.

Thanks in advance!

martinjankowiak · May 24, 2020, 7:43pm

yes, as you point out this is a well known issue with VAE-like models with flexible likelihoods. i’m not sure what the best solution is. using a minimum variance seems reasonable (you can choose the right scale with cross-validation, for example). perhaps in this context it’d be useful to standardize your outputs with something like the probability integral transform.