Confused how to implement simple Bayesian NN

thecity2 · December 12, 2020, 7:34pm

From the docs I gathered that to create a simple 2 hidden layer classifier with 3 inputs and 12 nodes in each layer, it looks like:

model = PyroModule[nn.Sequential](
    PyroModule[nn.Linear](3, 12),
    PyroModule[nn.Sigmoid](),
    PyroModule[nn.Linear](12, 12),
    PyroModule[nn.Sigmoid](),
    PyroModule[nn.Linear](12, 1),
    PyroModule[nn.Sigmoid]()
)
assert isinstance(model, nn.Sequential)
assert isinstance(model, PyroModule)

# Now we can be Bayesian about weights in the first layer.
model[0].weight = PyroSample(
    prior=dist.Normal(0,1).expand([3, 12]).to_event(2))
model[2].weight = PyroSample(
    prior=dist.Normal(0,1).expand([12, 12]).to_event(2))
model[4].weight = PyroSample(
    prior=dist.Normal(0,1).expand([12,1]).to_event(2))

I have no idea what the next step is after defining the network and can’t seem to find a single full example of such. I have made a few other Pyro models, but am new to the nn module.

fehiepsi · December 13, 2020, 6:07pm

Hi @thecity2, you can find a full example of how to train a BNN in Bayesian regression tutorial. Let me know if you need to clarify something.

thecity2 · December 13, 2020, 6:32pm

Thank you @fehiepsi . I have been going through that example, I am trying to use subsampling in the plate:

def forward(self, x, y=None):
    sigma = pyro.sample("sigma", dist.Uniform(0., 10.))
    with pyro.plate("data", x.shape[0], subsample_size=10) as ind:
        mean = self.linear(x[ind]).squeeze(-1)
        obs = pyro.sample("obs", dist.Normal(mean, sigma), obs=y[ind])
    return mean

This gives the error:
RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyro/poutine/trace_messenger.py in call(self, *args, **kwargs)
164 try:
–> 165 ret = self.fn(*args, **kwargs)
166 except (ValueError, RuntimeError) as e:

56 frames
RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 3D

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 3D
     Trace Shapes:         
      Param Sites:         
     Sample Sites:         
        sigma dist    |    
             value    |    
linear.weight dist 10 | 1 3
             value 10 | 1 3
  linear.bias dist 10 | 1  
             value 10 | 1

What is the cause of this error?

fehiepsi · December 14, 2020, 7:55am

I think performing self.linear(x[ind]) under plate data will add an additional data dimension to the weights. See this caution in Pyro Modules tutorial. One solution you can try is to add load_pyro_samples method and call it before the plate statement.

thecity2 · December 14, 2020, 5:00pm

So what I ended up doing is using DataLoader to train the model, thus avoiding the issue. Would be nice to know how to do it with subsampling though.

fehiepsi · December 14, 2020, 5:32pm

using DataLoader to train the model

This is a better solution. Just make sure to scale your likelihood (see this tutorial for more explanation) with

with pyro.poutine.scale(scale=num_full_data / batch_size):
    obs = pyro.sample("obs", dist.Normal(mean, sigma), obs=y_batch)

thecity2 · December 15, 2020, 10:53pm

@fehiepsi I don’t see anything about pyro.poutine.scale in that link. Is that a replacement for using pyro.plate?

fehiepsi · December 16, 2020, 11:22pm

That tutorial will explain to you why we need to scale. You can use scale poutine as in my last comment.

chvandorp · October 5, 2022, 8:20pm

Hi, I know this is an old topic, but I ran into the same problem today. I don’t see how using a data loader solves the problem (as I am using one). I was hoping that someone came up with a simple solution to this problem? Thanks!

student_12 · October 7, 2022, 5:43pm

I think he means that a dataloader does the subsampling/batching before you input the data into the model. This could potentially be seen as better because the linear layers in the network will only be operating on the batches. In contrast, if you use subsampling in the plate, your linear layers already performed computations on the full dataset and your plate is then subsampling at this stage when doing its likelihood scoring.

Although the dataloader could be more computationally efficient, your likelihood will be assuming the batch is the full dataset when constructing the ELBO loss, so it isn’t properly scaled (i.e., it’s giving your priors too much weight and your data not enough weight); hence you need to poutine.scale the likelihood (scale up) if you go this route.

I do happen to have a working Bayesian NN below that uses subsampling. I don’t know why the above poster’s version doesn’t work, but this version below works. However, it may require additional tuning to get optimal performance… I’ve noticed when many talk about Bayesian deep learning, they don’t necessarily mean just putting priors on the parameters of a neural network (that would be the simplest approach I think). Although that can give good model performance in some areas, one shouldn’t assume it automatically means you’re correctly exploring/capturing the full posterior of a giant neural network (admittedly, this is an active research area). Deep kernel learning, SWAG/MultiSWAG, etc. also fall under Bayesian deep learning and can get really good performance without a bunch of tuning.

class Bayesian_Network(PyroModule):
    def __init__(self, in_size, out_size):
        super().__init__()
        # Neural network layers (converts nn.Modules to PyroModules).
        self.fc1 = PyroModule[nn.Linear](in_size, 100)
        self.fc2 = PyroModule[nn.Linear](100, 150)
        self.fc3 = PyroModule[nn.Linear](150, 100)
        self.fc4 = PyroModule[nn.Linear](100, out_size)
        # Priors of parameters (replaces nn.Parameters with PyroSamples).
        self.fc1.weight = PyroSample(dist.Normal(0., 1.).expand([100, in_size]).to_event(2))
        self.fc1.bias = PyroSample(dist.Normal(0., 10.).expand([100]).to_event(1))
        self.fc2.weight = PyroSample(dist.Normal(0., 1.).expand([150, 100]).to_event(2))
        self.fc2.bias = PyroSample(dist.Normal(0., 10.).expand([150]).to_event(1))
        self.fc3.weight = PyroSample(dist.Normal(0., 1.).expand([100, 150]).to_event(2))
        self.fc3.bias = PyroSample(dist.Normal(0., 10.).expand([100]).to_event(1))
        self.fc4.weight = PyroSample(dist.Normal(0., 1.).expand([out_size, 100]).to_event(2))
        self.fc4.bias = PyroSample(dist.Normal(0., 10.).expand([out_size]).to_event(1))

    def forward(self, x, y=None):
        # Neural network computation.
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        mean = self.fc4(x).squeeze(-1)  # squeeze() makes `mean` 1D (instead of 2D with rightmost dim having size 1)
        # Prior of observation sigma.
        sigma = pyro.sample('sigma', dist.Uniform(0., 10.))
        # Likelihood.
        with pyro.plate('data', x.shape[0], subsample_size=1000) as ind:
            obs = pyro.sample('obs', dist.Normal(mean.index_select(0, ind), sigma), obs=y.index_select(0, ind))
        return mean

# Train model.
pyro.clear_param_store()

bayesian_network = Bayesian_Network(5, 1)
guide = AutoNormal(bayesian_network)
optimizer = pyro.optim.Adam({'lr': 0.01})
svi = SVI(bayesian_network, guide, optimizer, Trace_ELBO())
for step in range(501):
    loss = svi.step(x, y) / y.numel()
    if step % 100 == 0:
        print(f"Step {step}, loss = {loss}")

chvandorp · October 8, 2022, 1:24pm

@student_12 this is very helpful.

I don’t know why the above poster’s version doesn’t work

This is because the Bayesian NN computation is done in the plate context, thereby adding a dimension to the parameters of the BNN. The solution would be to apply the BNN before the plate context, but as you point out, this can be very wasteful in the case of subsampling.

In my case, I have a VAE, with a model like this

def model(x):
    batch_size = x.shape[0]
    with pyro.plate("data", batch_size), poutine.scale(scale=num_full_data / batch_size):
        z = pyro.sample("z", dist.Normal(torch.zeros(dim_z), torch.ones(dim_z)).to_event(1)
        x_mean, x_sd = decoder_net(z)
        pyro.sample("x", dist.Normal(x_mean, x_sd), obs=x)

Now if I make decoder_net Bayesian, I have to add a new plate

def model(x):
    batch_size = x.shape[0]
    with pyro.plate("latent", batch_size), poutine.scale(scale=num_full_data / batch_size):
        z = pyro.sample("z", dist.Normal(torch.zeros(dim_z), torch.ones(dim_z)).to_event(1)
    # compute x_mean and x_sd outside the plate context
    x_mean, x_sd = bayesian_decoder_net(z)
    # now compute the data likelihood
    with pyro.plate("data", batch_size), poutine.scale(scale=num_full_data / batch_size):
        pyro.sample("x", dist.Normal(x_mean, x_sd), obs=x)

One question I had is: are there any downsides of “splitting” these plates?

I’ve noticed when many talk about Bayesian deep learning, they don’t necessarily mean just putting priors on the parameters of a neural network

Yes, that’s a good point. My goal is just to add some regularization, not learn the posterior of the NN parameters. I know there are other ways to regularize the NN parameters, but I like specifying priors better than adding some terms to the SVI objective as explained in this tutorial.

student_12 · October 8, 2022, 5:38pm

@chvandorp
I don’t see anything wrong with using two separate plates on the same dimension like you’re doing; I’ve seen other models/have used models that do that (though not necessarily on the obs dim=-1 that you’re doing). It will just be putting one extra plate message/node in any trace you run on the model. But I would make sure to use mini-batching instead of subsampling so you don’t have to worry about the two plates not subsampling the same indices.

Also, if your decoder network wasn’t Bayesian, you could get rid of the poutine.scales (i.e., I don’t think your first model needs the scale) and put both the decoder and the obs pyro.sample in the same plate, since you’d only have local/obs-level latents in that case (the NN parameters would be treated as variational parameters instead of latent variables).