Predictive model memory skyrockets

Hello, thank you very much for the nice library!

I have implemented the ProdLDA model (link to the Pyro tutorial) with 31000 legal documents for a big Public Administration ¶.

The main reason is to sort a huge mass of PDF documents accumulated along the years. The goal is to provide probabilistic tags (i.e. “theta” in LDA terminology) for each document. This should help the PA in searching for similar documents and use them as templates for future documentations.

Since it was missing in the original tutorial, I have added to the ProdLDA model the following lines in order to compute the “theta” (in percentages) for each document.

# sub-sample prodLDA's results by posterior predictive
predictive = Predictive(model=prodLDA.model, 
                        guide=prodLDA.guide, 
                        num_samples=2000,
                        return_sites=["logtheta"])
samples = predictive(docs)

# extract "theta" (percentages)
theta_percentages_numpy = torch.nn.functional.softmax(torch.mean(samples.get("logtheta").cpu(), dim=0), dim=-1).numpy()

I am able to run the whole pipeline successfully up to 25000 documents on a single GPU (16 GBs). As a reference, when I run 25000 documents “svi.step()” consumes 3-4 GBs for 2-3 Hrs at 60-70 % “Volatile GPU-Util” in “nvidia-smi” and the Predictive() consumes 12-13 GBs at ~90 % “Volatile GPU-Util” and it takes 2-3 minutes. Results are great!

Above 25000 documents the Predictive() step triggers an Out of Memory (OOM) error. I have noticed that when I am using 31000 documents the “svi.step()” runs smoothly at around 4-5 GBs for 3-4 Hrs. That’s way I can always get the “beta” (namely the wordcloud), no matter of the size of the data-set. Nevertheless I do need the “theta”, as well. Unfortunately, with 31000 documents Predictive() breaks the whole pipeline because the GPU’s memory starts consuming 3-4 times more GPU memory within few minutes.

That’s unfortunate, since it looks like a waste of computational power: the final three minutes of computation spoils the other 3-4 Hrs of computation…

As I have understood from the documentation, Predictive() is just a huge plate() on top of the actual model. Hence explained the OOM error.

Below the strategies I have tried -unsuccessfully- in order to avoid the OOM with 31000 documents:

  1. decreasing the “batch_size” (e.g. from 512 to 16). It works, but now “svi.step()” is extremely slow (24 Hrs). Indeed, I consider it a suboptimal strategy because during “svi.step()” the GPU is working only at 10-20 % of its potential, leaving most of the GPU “Volatile GPU-Util” idling during the 24 Hrs “svi.step()”.

  2. I have tried to move the “Predictive()” to CPU. It works but only single thread: it takes 8 Hrs to run the Predictive() whereas by GPU took 3 minutes! The “parallel=True” triggers an error, as mentioned in this post (link here). It looks like the problem is the “batch_size” inherited when I have saved the trained “prodLDA” model.
    Ideal solution: it would have been wonderful to use “batch_size = 512” for “svi.step()”, whereas a “batch_size = 1” for “Predictive()”. This would have used GPU at its fullest potential. Unfortunately, this strategy is not possible/easy.

  3. I was thinking that the problem was the cumulating GPU memory from “svi.step()” to “Predictive()”. I had saved the ProdLDA model (not an easy task itself…), deleted the model and made a hard GPU flushing by “gc.collect()” and “torch.cuda.empty_cache()” before entering “Predictive()”. Still the problem persists. Therefore, I do not think anymore that the problem is the cumulating GPU memory between “svi.step()” and “Predictive()” steps.

  4. Use multiple GPU by “horovod”. Not tried it yet.

Honestly, I am running out of ideas how to solve the OOM error with Predictive().

In conclusion, it looks like the culprit of GPU’s OOM error is Predictive().