Motivation to use SteinVI over SVI in NumPyro?

Hi! I have been looking into the contributed module Contributed Code — NumPyro documentation, and it seems very interesting and intriguing. I know some papers are referenced, but it would be great to have some comments or tips around these practical details.

Concretely:

  • Is there any particular case(s) where SteinVI is expected to work better than SVI (or not?)
  • What is the intuition behind kernel selection? It is recommended to use numpyro.contrib.einstein.RBFKernel. When shall we look for a different alternative?
  • Is there any rule of thumb for the number of particles and other parameters?

Thanks!

cc @OlaRonning

Hi Juan,

Thanks for your interest and for asking here :smiley:

I have evidence that SteinVI is generally preferable over SVI for small BNNs (1-layered 50-100 hidden dim). I’m looking at SteinVI performance on ResNet-size BNNs (11-25k params), but I don’t have results to share yet.

SteinVI with one particle is SVI. With multiple particles, we get a mixture approximation of the posterior. Because SVI tends to be mode-seeking with light tails, I would expect SteinVI to be preferable when we want mass-cover l; however, this is still only a hypothesis.

The kernel’s role is to ensure that mixture components don’t collapse onto each other and the nearest mode (and to smooth the score, but that’s not as important). Because each particle parameterizes a guide, the desired behavior for kernels is different than for SVGD (and ASVGD). In particular, we want the kernel to ensure that the guides (parameterized by particles) don’t overlap, not just that the particles repel.

The probability product kernel has the desired behavior of acting on guides; however, it is only for a mixture of Gaussians (AutoNormal). I have an idea for a kernel that works on a general guide, but I don’t understand enough to recommend it yet.

I’m currently recommending the RBF kernel because it’s my recommendation for the “simpler” SVGD and ASVGD (they correspond to SteinVI with AutoDelta and AutoDelta with annealing, respectively). However, it’s pretty simple to show that the RBF kernel (and probably most SVGD kernels) is actually ill-suited for SteinVI with guides that control guide distribution shape (like the variance of a Gaussian).

I’m aware of some research on which kernel to use for SVGD, vector and matrix variates, which allow pre-conditioning to eliminate poor geometry. If your model is a Gaussian, use a linear kernel. Random feature kernels are mainly used because they are easy to work with theoretically. For the RBF kernel, the bandwidth selection is important (just like for GPs). I only have the median heuristic implemented now, but I plan to expand this. I don’t have anything like this for SteinVI yet.

I usually see quite good results with 3-5 Stein particles and ~100 ELBO draws. I would not recommend changing the loss or repulsion temperature because SteinVI does not optimize an ELBO. For BNNs, you will want to initialize like you’d do for NNs (e.g., He init); this has been the case for SVI, SteinVI and SVGD.

If you try out SteinVI, please let me know how it goes.

1 Like

Thank you very much for sharing your thoughts! It is much appreciated! I will give it a try :slight_smile: !