Difference in BNN results between Pyro and NumPyro

OlaRonning · May 15, 2024, 10:53am

I’m comparing Pyro and NumPyro on a 1-layer (50 hidden dim) BNN with an AutoNormal guide and seeing substantially better results with Pyro on the UCI datasets with the standard splits from On the Importance of Strong Baselines in Bayesian Deep Learning. I see the same trend with all of the UCI regression benchmarks. If I use AutoDelta, the difference disappears, and the results are on par with the literature.

Does Pyro implement optimizations not in NumPyro that could explain the difference?

Pyro Yacht results

Like is short for likelihood and corresponds to samples from sample site y, and loc corresponds to network outputs, i.e., deterministic site y_loc in the network below.

NumPyro Yacht results

Notice that the precision is lower, and the network locations are further from the ground truth with NumPyro than with Pyro. Both lead to poor performance; however, whether one is causal or the other is not clear.

Training

def train_svi(x, y):
    with seed(rng_seed=0):  # Change seed doesn't affect the difference
        svi = SVI(model, AutoNormal(model), Adam(1e-3), Trace_ELBO())
        res = svi.run(prng_key(), STEPS, x, y, subsample=100)  # Pyro version uses step.

    return svi, res

BNN

def bnn(x, y, subsample):
    """BNN described in Appendix D of [1]

    **References:**
        1. *UNDERSTANDING THE VARIANCE COLLAPSE OF SVGD IN HIGH DIMENSIONS*
           Jimmy Ba, Murat A. Erdogdu, Marzyeh Ghassemi, Shengyang Sun, Taiji Suzuki, Denny Wu, Tianzong Zhang
    """

    hdim = 50
    prec = sample('prec', Gamma(2., 2.))

    w1 = sample(
        "nn_w1",
        Normal(0.0, 1.0).expand((x.shape[1], hdim)).to_event(2))  # prior on l1 weights
    b1 = sample("nn_b1", Normal(0.0, 1.0).expand((hdim,)).to_event(1))  # prior on output bias term

    w2 = sample("nn_w2", Normal(0.0, 1.0).expand((hdim, hdim)).to_event(2))  # prior on l1 weights
    b2 = sample("nn_b2", Normal(0.0, 1.0).expand((hdim,)).to_event(1))  # prior on output bias term

    w3 = sample("nn_w3", Normal(0.0, 1.0).expand((hdim,)).to_event(1))  # prior on output weights
    b3 = sample("nn_b3", Normal(0.0, 1.0))  # prior on output bias term

    with plate(
        "data",
        x.shape[0], subsample_size=subsample if subsample is not None else x.shape[0]
    ) as idx:
        x_batch = x[idx]
        y_batch = y[idx] if y is not None else y

        # 2 hidden layer with tanh activation
        loc_y = deterministic(
            "y_loc", nn.relu(nn.relu(x_batch @ w1 + b1) @ w2 + b2) @ w3 + b3
        )

        sample(
            "y",
            Normal(loc_y, jnp.sqrt(1/prec)),
            obs=y_batch,
        )

fehiepsi · May 15, 2024, 1:29pm

I’m only aware of AutoNormal in numpyro using the init_to_uniform init strategy, while it is init_to_feasible in pyro.

OlaRonning · May 15, 2024, 2:12pm

I see, thanks. That made a large difference, though they are still not quite on par.

Numpyro with init_to_feasible

fehiepsi · May 15, 2024, 3:48pm

Do you think the difference is significant now? If not, the difference likely comes from optimizer params, numerical computations, or similar diffs coming from using different frameworks

OlaRonning · May 15, 2024, 5:03pm

The gap is still pretty large; however, so was the improvement, so you are likely right that it comes down to numerics, etc., rather than some significant departure between the frameworks.