Memory usage grows with iterations in svi.step()

hhschede · April 13, 2022, 8:27am

What tutorial are you running?
What version of Pyro are you using?
Please link or paste relevant code, and steps to reproduce.

Hello,
I have been testing out a model that is somewhat similar to the GMM that is found in the tutorial https://pyro.ai/examples/gmm.html. Since I am using larger datasets, it is necessary to allow more iterations of svi.step() to run. However, I noticed that I started running into memory problems - it appeared that with many iterations of svi.step(), memory use gradually increases.

I tested this out with the exact code from the GMM tutorial to make sure it was not an issue with my model. The only 2 differences with the tutorial are that (A) the dataset was increased to 20,000 points instead of 5, and (B) the number of iterations in svi.step() are put to 100,000. Memory use increased as early as the 1000th point, and seemed to increase in jumps of ~0.01GB. I don’t know if the memory allocation stabilised (can’t test that on my machine - but it does crash it at 64GB which is my limit).

I tried 2 versions of Pyro - 1.8.1, and 1.8.0. The both result in the same issue.

I am wondering if this is an expected result, and if it is, what causes it? If it is unavoidable, what might be the best way to circumvent these memory problems?

Thank you very much, and any help is greatly appreciated.

P.S. Pyro is super cool.

Hannah

fritzo · April 13, 2022, 12:51pm

Hi @hhschede, the memory growth might be caused by that example’s saving of gradient norms:

# Register hooks to monitor gradient norms.
gradient_norms = defaultdict(list)
for name, value in pyro.get_param_store().named_parameters():
    value.register_hook(lambda g, name=name: gradient_norms[name].append(g.norm().item()))

Does removing the hook reduce memory growth?

hhschede · April 13, 2022, 1:49pm

Thanks for the response. Unfortunately this does not change the memory growth!

fritzo · April 13, 2022, 4:47pm

Hmm, does it help to call gc.collect() between steps? You could get an idea of which tensors are leaking by using this trick.

hhschede · April 14, 2022, 9:04am

I had tried gc.collect before, but it does not seem to change anything in this case.
Printing the tensors and their memory usage indicates that 3 float32 tensors consistently increase in memory. I have no knowledge in how to deal with memory leaks

hhschede · April 14, 2022, 11:41am

update:

output from the tracemalloc package provides this as output:

/opt/miniconda3/envs/brainatlas/lib/python3.8/site-packages/pyro/ops/einsum/adjoint.py:37: size=518 KiB (+516 KiB), count=9036 (+9009), average=59 B
/opt/miniconda3/envs/brainatlas/lib/python3.8/site-packages/pyro/ops/einsum/torch_log.py:26: size=212 KiB (+211 KiB), count=3012 (+3003), average=72 B
/opt/miniconda3/envs/brainatlas/lib/python3.8/site-packages/pyro/infer/util.py:287: size=212 KiB (+211 KiB), count=3012 (+3003), average=72 B
/opt/miniconda3/envs/brainatlas/lib/python3.8/site-packages/pyro/infer/util.py:288: size=188 KiB (+188 KiB), count=3012 (+3003), average=64 B

This indicates that adjoint.py or torch_log.py or infer/util.py are probably the scripts that contain the bug. If I am not mistaken the modules within einsum are related to summing across the enumeration axis.
Interestingly I noticed that the memory issue only occurs when TraceEnum_ELBO is used. This observation would support that.

Gioelelm · April 14, 2022, 1:47pm

I could validate the strange behaviour also on my machine (Mackbook 2019 running MacOS 10.15) and the GMM tutorial. memory usage just accumulates during training. I suspect an issue should be open on github?

DavidSilva · July 12, 2022, 12:18pm

My macbook 2020 is also behaving strangely. I was looking for answers on reddit and was led to your forum.