ELBO log_pdf overflow

I’ve encountered this error multiple times, but it is non-deterministic and only occurs occasionally.

Traceback (most recent call last):
  File "train.py", line 33, in <module>
    _, loss_dict = model.train(*data)
  File "/home/luoa/junting/video_prediction/models/air_base_model.py", line 389, in train
    loss = svi.loss_and_grads(svi.model, svi.guide, input, output)
  File "/home/luoa/anaconda3/lib/python3.6/site-packages/pyro/infer/elbo.py", line 65, in loss_and_grads
    return self.which_elbo.loss_and_grads(model, guide, *args, **kwargs)
  File "/home/luoa/anaconda3/lib/python3.6/site-packages/pyro/infer/trace_elbo.py", line 133, in loss_and_grads
    for weight, model_trace, guide_trace, log_r in self._get_traces(model, guide, *args, **kwargs):
  File "/home/luoa/anaconda3/lib/python3.6/site-packages/pyro/infer/trace_elbo.py", line 87, in _get_traces
    log_r = model_trace.log_pdf() - guide_trace.log_pdf()
  File "/home/luoa/anaconda3/lib/python3.6/site-packages/pyro/poutine/trace.py", line 71, in log_pdf
    site["value"], *args, **kwargs) * site["scale"]
  File "/home/luoa/anaconda3/lib/python3.6/site-packages/pyro/distributions/random_primitive.py", line 42, in log_pdf
    return self.dist_class(*args, **kwargs).log_pdf(x)
  File "/home/luoa/anaconda3/lib/python3.6/site-packages/pyro/distributions/distribution.py", line 185, in log_pdf
    return torch.sum(self.batch_log_pdf(x, *args, **kwargs))
RuntimeError: value cannot be converted to type double without overflow: -inf

I don’t know how to debug this, since sometimes when I re-run the same thing, the error might not occur. I’m using Adam optimizer with learning rate 2e-4 (or 1e-4).

The only thing that I can think of is that I use softplus to get standard deviation for sampling. I have code that looks like this:

x = self.model(...)
x_mu = x[:, :N]
x_sigma = F.softplus(x[:, N:])
x = pyro.sample('name', dist.normal, x_mu, x_sigma)

I think softplus is the standard way of doing this though. The VAE tutorial an AIR tutorial both use softplus to get the standard deviation. So this might not be the problem.

Any suggestions on how to debug this?

you can try doing the same thing in log space with the LogNormal distribution and see if that helps. you might also try clipping your softplus.

Actually, PyTorch’s softplus has a default threshold 20. It reverts to the linear function for inputs above the threshold. So I think the overflow error is from something else.

how big are x_mu and x_sigma? some more details on your model would be helpful

My model is a similar to a VAE. The x_mu and x_sigma have size (80, 256).

I just need some help on why this happens and how to debug. From the error message, the log_pdf reaches -inf, so the probability reaches 0. The only thing that might cause this (that I can think of) is the standard deviation, since I’m sampling from normal distribution. If x has a very negative value, then x_sigma will be 0.

this seems more like a more generic problem and not a pyro one. you should inspect the numerical values of things. how big/small are typical elements of x_sigma? how different are the x’s being scored in the log pdf from the mean x_mu? does this happen in one of the first iterations? if so it might be an initialization problem or you may have too high a learning rate? etc etc