Thanks for your response @eb8680_2. I want to train an SVI instance on a machine with four GPUs. I’m more concerned about variance reduction during training than accelerating the computation itself. I’m able to increase the number of particles I use to estimate the ELBO (per each epoch and each loaded data point) on a single GPU, until I max out it’s memory. However, I would like to increase the total number of particles I use (per each epoch and data point), such that all four GPUs take part in the estimation of the ELBO gradient (in a way, copy the same data to all four GPUs, on each one estimate 10 different particles, and combine them on GPU0 after all GPUs have finished, for example).

Although this is much more complicated, my main motivation is that, for example, if we estimate the mean of a Gaussian random variable, then using N samples should yield a variance reduction on the order of \sqrt(N), in theory. Unfortunately, I don’t have enough memory on a single GPU to achieve an order of magnitude variance reduction.

In fact, I’ve noticed that when I increase the number of particles for gradient estimate, the GPU memory occupancy increases as well. Is there a way to reduce the memory footprint with the increase in num_particles?

I hope this clarifies this a bit more.

Thanks