Difference in BNN results between Pyro and NumPyro

I’m comparing Pyro and NumPyro on a 1-layer (50 hidden dim) BNN with an AutoNormal guide and seeing substantially better results with Pyro on the UCI datasets with the standard splits from On the Importance of Strong Baselines in Bayesian Deep Learning. I see the same trend with all of the UCI regression benchmarks. If I use AutoDelta, the difference disappears, and the results are on par with the literature.

Does Pyro implement optimizations not in NumPyro that could explain the difference?

Pyro Yacht results

Like is short for likelihood and corresponds to samples from `sample` site y, and loc corresponds to network outputs, i.e., `deterministic` site y_loc in the network below.

NumPyro Yacht results

Notice that the precision is lower, and the network locations are further from the ground truth with NumPyro than with Pyro. Both lead to poor performance; however, whether one is causal or the other is not clear.

Training

``````def train_svi(x, y):
with seed(rng_seed=0):  # Change seed doesn't affect the difference
svi = SVI(model, AutoNormal(model), Adam(1e-3), Trace_ELBO())
res = svi.run(prng_key(), STEPS, x, y, subsample=100)  # Pyro version uses step.

return svi, res
``````

BNN

``````def bnn(x, y, subsample):
"""BNN described in Appendix D of [1]

**References:**
1. *UNDERSTANDING THE VARIANCE COLLAPSE OF SVGD IN HIGH DIMENSIONS*
Jimmy Ba, Murat A. Erdogdu, Marzyeh Ghassemi, Shengyang Sun, Taiji Suzuki, Denny Wu, Tianzong Zhang
"""

hdim = 50
prec = sample('prec', Gamma(2., 2.))

w1 = sample(
"nn_w1",
Normal(0.0, 1.0).expand((x.shape[1], hdim)).to_event(2))  # prior on l1 weights
b1 = sample("nn_b1", Normal(0.0, 1.0).expand((hdim,)).to_event(1))  # prior on output bias term

w2 = sample("nn_w2", Normal(0.0, 1.0).expand((hdim, hdim)).to_event(2))  # prior on l1 weights
b2 = sample("nn_b2", Normal(0.0, 1.0).expand((hdim,)).to_event(1))  # prior on output bias term

w3 = sample("nn_w3", Normal(0.0, 1.0).expand((hdim,)).to_event(1))  # prior on output weights
b3 = sample("nn_b3", Normal(0.0, 1.0))  # prior on output bias term

with plate(
"data",
x.shape[0], subsample_size=subsample if subsample is not None else x.shape[0]
) as idx:
x_batch = x[idx]
y_batch = y[idx] if y is not None else y

# 2 hidden layer with tanh activation
loc_y = deterministic(
"y_loc", nn.relu(nn.relu(x_batch @ w1 + b1) @ w2 + b2) @ w3 + b3
)

sample(
"y",
Normal(loc_y, jnp.sqrt(1/prec)),
obs=y_batch,
)
``````

I’m only aware of `AutoNormal` in numpyro using the `init_to_uniform` init strategy, while it is `init_to_feasible` in pyro.

I see, thanks. That made a large difference, though they are still not quite on par.

Numpyro with init_to_feasible

Do you think the difference is significant now? If not, the difference likely comes from optimizer params, numerical computations, or similar diffs coming from using different frameworks

The gap is still pretty large; however, so was the improvement, so you are likely right that it comes down to numerics, etc., rather than some significant departure between the frameworks.