High variance in PSIS Diagnostic

Using the following simple model, I get quite a bit of variance in the PSIS Diagnostic.

import numpy as np
from torch import tensor

from pyro import distributions as dist
from pyro import sample, factor, plate

from pyro.optim import Adam
from pyro.infer import autoguide, SVI, Trace_ELBO, TraceMeanField_ELBO
from pyro.infer.autoguide import init_to_median
from pyro.infer.importance import psis_diagnostic

n = 1000
mu = 3.14
sigma = 2.7
y = tensor(np.random.normal(mu, sigma, n)).float()

def model(y):
    mu = sample("mu", dist.Normal(0, 1))
    sigma = sample("sigma", dist.Gamma(1, 1))
    with plate("data"):
        y = sample('y', dist.Normal(mu, sigma), obs=y)

optimizer = Adam({"lr": 5e-2})
guide = autoguide.AutoDiagonalNormal(model, init_loc_fn=init_to_median())
svi = SVI(model, guide, optimizer, loss=TraceMeanField_ELBO())

for i in range(5000):
    if i % 100 == 0:
        print(i)
    svi.step(y)

psis_diagnostic(model, guide, y, num_particles=1000)

Repeated several times gives

In [102]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[102]: 0.12475437759230361

In [103]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[103]: 0.03406547836026321

In [104]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[104]: 0.15812618171283743

In [105]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[105]: 0.24965923846260274

In [106]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[106]: 0.08239155431661215

In [107]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[107]: -0.13488150137448549

In [108]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[108]: -0.16388070392834722

In [109]: psis_diagnostic(model, guide, y, num_particles=1000)
Out[109]: -0.0156825685265973

Is this expected behavior or a limitation of using PSIS?

repeated with the same data?

i don’t necessarily find this surprising, since in my experience the psis diagnostic is pretty high variance. it may be that there’s something wrong with the psis code but i’m not aware of any specific problems. it’d be interesting to see if you get similar results with stan (afaik they implement psis).

do you results improve if you use more particles?

Yes, using the same data. It does seem to stabilize if you use more particles (see below), though I was surprised by how many are needed! One might need 10s of thousands of particles if near a decision threshold like 0.7. It would be interesting to compare to stan, I’ll try that at some point.

In [126]: tensor([psis_diagnostic(model, guide, y, num_particles=100) for i in range(100)]).std()
Out[126]: tensor(0.1876)

In [127]: tensor([psis_diagnostic(model, guide, y, num_particles=1000) for i in range(100)]).std()
Out[127]: tensor(0.1007)

In [128]: tensor([psis_diagnostic(model, guide, y, num_particles=10000) for i in range(100)]).std()
Out[128]: tensor(0.0557)