TracePredictive worse than sampling guides?

druhe · February 6, 2019, 10:31am

Hey! First off: thanks to all the contributors to Pyro for allowing laymen like me to engage in the world of probabilistic programming. I have a question regarding making predictions.

What tutorial are you running?
The Bayesian Regression tutorial.
What version of Pyro are you using?
0.3.0
Please link or paste relevant code, and steps to reproduce.

When doing predictions, I’ve seen multiple approaches. In the tutorials, TracePredictive is used. However, I’ve also seen Jupyter Notebooks and blogposts in which the authors just sample the Guide n times and use those to predict. Which one is better? When applying both to my own dataset My r^2 increases with 0.15 when using the latter approach.

Code using the predictive distribution:

get_marginal = lambda traces, sites:EmpiricalMarginal(traces, sites)._get_samples_and_weights()[0].detach().cpu().numpy()

def summary(traces, sites):
    marginal = get_marginal(traces, sites)
    site_stats = {}
    for i in range(marginal.shape[1]):
        site_name = sites[i]
        marginal_site = pd.DataFrame(marginal[:, i]).transpose()
        describe = partial(pd.Series.describe, percentiles=[.05, 0.25, 0.5, 0.75, 0.95])
        site_stats[site_name] = marginal_site.apply(describe, axis=1) \
            [["mean", "std", "5%", "25%", "50%", "75%", "95%"]]
    return site_stats

def wrapped_model(x_data, y_data):
    pyro.sample("prediction", Delta(model(*test_data.tensors)))

posterior = svi.run(*train_data.tensors)

# posterior predictive distribution we can get samples from
trace_pred = TracePredictive(wrapped_model,
                             posterior,
                             num_samples=1000)
post_pred = trace_pred.run(x_test.tensors[0], None)
post_summary = summary(post_pred, sites= ['prediction'])
y = post_summary["prediction"]
predictions = pd.DataFrame({
    "y_mean": y["mean"],
    "y_std": y["std"],
    "y_perc_5": y["5%"],
    "y_perc_95": y["95%"],
    "true_los": y_data.squeeze(),
})

r2_score(predictions['true_los'], predictions['y_mean'])

Out: 0.306

Sampling the guide:

preds = []

for i in range(1000):
    sampled_reg_model = guide(None, None)
    pred = sampled_reg_model(X_test_tensor).data.numpy().flatten()
    preds.append(pred)

all_preds = np.stack(preds).mean(0)
r2_score(y_test, all_preds)

Out: 0.472

Help would be greatly appreciated!

Thanks.

neerajprad · February 7, 2019, 12:44am

Which one is better? When applying both to my own dataset My r^2 increases with 0.15 when using the latter approach.

Both should be the same. For SVI, you can sample latents from the guide and use these to replay against your model. I am not sure what is the guide doing in your case, but generally I think you will have to run something like the following to generate samples from your predictive distribution.

for _ in range(1000):
    guide_trace = guide(x)
    # assuming that the original model took in data as (x, y) where y is observed
    preds.append(poutine.replay(model, guide_trace)(x, None))

TracePredictive will be doing the same behind the scenes, but by default it only samples 10 values from the empirical distribution. Did you change num_samples to 1000 in your call to SVI?

druhe · February 7, 2019, 9:00am

Hi Neeraj,

Thanks for helping out! I modified your code a bit and managed to get the same performance using poutine approach (although I’m not entire sure what is meant by ‘replaying’ the guides against the model). Here is the code I used for that:

for _ in range(1000):
    guide_trace = poutine.trace(guide).get_trace(X_test_tensor, None)
    # assuming that the original model took in data as (x, y) where y is observed
    
    lifted_reg_model = poutine.replay(model, guide_trace)
    
    preds.append(lifted_reg_model(X_test_tensor, None))

I still find it strange that the built-in TracePredictive wasn’t working properly. I think I was already sampling 1000 times?

FYI, my guide and model:

def guide(X_data, y_data):
    # Create unit normal priors over the parameters
    weight_loc = torch.zeros_like(regression_model.linear.weight)
    weight_scale = torch.ones_like(regression_model.linear.weight)
    bias_loc = torch.zeros_like(regression_model.linear.bias)
    bias_scale = torch.ones_like(regression_model.linear.bias)
    mw_param = pyro.param("guide_mean_weight", weight_loc)
    sw_param = softplus(pyro.param("guide_log_scale_weight", weight_scale))
    mb_param = pyro.param("guide_mean_bias", bias_loc)
    sb_param = softplus(pyro.param("guide_log_scale_bias", bias_scale))
    # gaussian guide distributions for w and b
    w_dist = Normal(mw_param, sw_param).to_event(1)
    b_dist = Normal(mb_param, sb_param).to_event(1)
    
    dists = {'linear.weight': w_dist, 'linear.bias': b_dist}
    
    # overloading the parameters in the module with random samples from the guide distributions
    lifted_module = pyro.random_module("module", regression_model, dists)
    # sample a regressor
    return lifted_module()

def model(X_data, y_data):
    # Create unit normal priors over the parameters
    weight_loc = torch.zeros_like(regression_model.linear.weight)
    weight_scale = torch.ones_like(regression_model.linear.weight)
    bias_loc = torch.zeros_like(regression_model.linear.bias)
    bias_scale = torch.ones_like(regression_model.linear.bias)
    w_prior = Normal(weight_loc, weight_scale).to_event(1)
    b_prior = Normal(bias_loc, bias_scale).to_event(1)
    priors = {'linear.weight': w_prior, 'linear.bias': b_prior}
    # lift module parameters to random variables sampled from the priors
    lifted_module = pyro.random_module("module", regression_model, priors)
    # sample a regressor (which also samples w and b)
    lifted_reg_model = lifted_module()

    with pyro.plate("map"):
        # run the regressor forward conditioned on inputs
        prediction_mean = lifted_reg_model(X_data).squeeze()
        
        pyro.sample("obs", Normal(prediction_mean, 1),
                    obs=y_data)
        
        return prediction_mean

Cheers,
Druhe

neerajprad · February 7, 2019, 7:08pm

Hi Druhe,
I see what is perhaps causing the confusion. When you pass in num_samples=1000 to TracePredictive, it randomly samples from the empirical distribution over the approximate posterior to generate samples from the predictive distribution. It does this 1000 times. However, unless you had also invoked SVI(.., num_samples=1000) in your call to SVI, the empirical distribution from your approximate posterior itself only has 10 samples which is the default value for num_samples in SVI’s __init__. So you’ll be sampling with replacement from these 10 samples to generate 1000 samples! You can verify this also by looking at posterior.exec_traces, i.e. the second argument in TracePredictive which probably has a size of 10 right now.

lifted_reg_model = poutine.replay(model, guide_trace)

This line is substituting/replaying values from the approximate posterior when it encounters a pyro.sample() statement in the model. However in this case, you are actually sampling 1000 times from the approximate posterior, whereas with your TracePredictive method you are only effectively using 10 samples. I think the gap should close once you initialize SVI with num_samples=1000. Let me know if that doesn’t work.

neerajprad · February 8, 2019, 12:32am

We will also try to address this confusing behavior as part of Make AbstractInfer classes lazily consume traces · Issue #1725 · pyro-ppl/pyro · GitHub.

druhe · February 8, 2019, 8:28am

Hi Neeraj,

You were right, I was initialising SVI with n_samples=10.

It all works as expected now.

Thanks very much for helping out,
Druhe