NaNs in SVI when using SGD instead of Adam

This problem applies to both the VAE tutorial and to my own model, so I’m filing it under Tutorials.

The tutorial (and my model) work okay when I use the Adam optimizer, but when I try to use SGD they quickly run into NaNs in the log_prob calculation. In the case of SGD this seems to happen essentially upon initialization, which makes me think that something about how it gets started is incompatible with the structure of the VAEs.

I don’t understand why SGD would be incompatible with these models, but I haven’t dug too deeply into potential differences in how they two are running. When I set validate_args=True I immediately run into some errors but that happens with Adam as well so it doesn’t seem like the culprit.

what learning rate are you using? why would you expect SGD to work well?

It’s not that I necessarily expect it to work well, but I expected it to not fail immediately. Just trying different things out and was surprised there was such a difference.

I used a learning rate of 1e-4 for both when I had this issue. It looks like if I use SGD with {'lr': 1e-5, 'momentum': 0.9} it works without any NaNs. I’m a little confused why it fails so quickly and repeatedly with the higher learning rate.

Is there a guide for debugging this kind of problem? I struggled a lot just tracking down where the NaN values were being generated (I’m still not totally sure to be honest).

i’m not aware of any guide. how do you initialize the variances of the normal distribution in the guide? it’s possible that SGD could perform slightly more robustly if the variances were initialized at particularly small values; also using something like a softplus parameterization could help. in any case there’s no reason to expect SGD to be particularlry robust in these kinds of settings.