GPU memory usage increasing across batches

Paul · January 28, 2018, 11:52pm

Hi, I have model that is an expanded version of the SS-VAE example. The main change I’ve made is to replace the categorical latent “y”, with a grid of Bernoulli latents.

As I increase the size of the input images, I’ve noticed the GPU memory usage increases as I process batches. It starts small, and increases steadily, until I run out of GPU memory and it crashes.

If I make all batches “supervised”, where the latent y is always observed, the problem doesn’t occur. It only occurs when executing the “unsupervised” loss function.

Does anyone have any idea what is happening?

Reading up on similar PyTorch problems, it sounds like it could be the loss function being held for too long, so memory can’t be released at the end of a batch? (CUDA memory continuously increases when net(images) called in every iteration - #5 by Kalamaya - PyTorch Forums)

fritzo · January 29, 2018, 1:21am

Hi Paul,

We’ve also seen some memory growth in some Pyro models, and we’re planning to do some memory profiling before the Pyro 0.2 release to resolve these issues.

In the short term could you try to insert gc.collect() statements every few training iterations. Let me know if this helps; if so it suggests we have reference cycles and will help us locate the leak.

-Fritz

Paul · January 29, 2018, 9:03pm

Thanks Fritz. gc.collect() did the trick.