How to get prediction scores with Pyro models?

Hello,
Say I have myPyroModel as the following:

myPyroModel

# define parameters for training      
guide = guides.AutoDiagonalNormal(myPyroModel) 
optimizer_1 = Adam({"lr": 0.000000055}) 
scheduler_1 = pyro.optim.StepLR({'optimizer': optimizer_1, 'optim_args': {'lr': 0.000000055}})
svi = SVI(myPyroModel, guide_diag_normal, optimizer_1, loss=Trace_ELBO())

Given the svi, guide and myPyroModel, is there any way that I can calculate and extract the prediction scores (i.e. for y’s)?

Thank you,

Hi, can you clarify what you mean by “prediction scores” and “y”?

If you don’t mind a bit of unsolicited advice, I would also highly recommend reading through the SVI tutorial and the Bayesian regression tutorial. Especially relevant is this section on Bayesian regression with SVI, which is a complete worked example of SVI with a (much simpler) PyroModule-based model and an autoguide.

I think that might help clear things up conceptually for you - for example, based on this and other questions I suspect your guide is not being trained correctly, since it doesn’t seem like you’re including the likelihood (the data loss) in the ELBO via a pyro.factor or observed pyro.sample site analogous to the obs site in the Bayesian regression example linked above.

Hello,

  • What I am trying to do is to train a Bayesian Pyro neural network. Based on the contents of the webpage Deep Markov Model — Pyro Tutorials 1.8.4 documentation , I don’t think I need to pass in any extra information for ELBO … if my understanding is correct, svi.step() itself automatically calculates loss from the KL-divergence, so I really only need to perform the gradient descent by doing svi.step(), don’t I? Below is my code including the full training loop as well as my attempt to calculate the prediction scores (see the second bullet below). If my code is still wrong, please correct me:
model = RobertaForMultipleChoice.from_pretrained('roberta-large')

module.to_pyro_module_(model)
        
model.roberta._dummy_param = 
nn.Parameter(torch.tensor(0.).to(dtype=model.dtype, device=model.device))

# Now we can attempt to be fully Bayesian:
for m in model:
    for name, value in list(m.named_parameters(recurse=False)):
         if name != "_dummy_param":
              setattr(m, name, module.PyroSample(prior=dist.Normal(0, 1)
                                                 .expand(value.shape)
                                                 .to_event(value.dim())))        
# define guide     
guide = guides.AutoMultivariateNormal(model)
        
# parameters for training
optimizer = Adam({"lr": 0.000005200}) 
scheduler = pyro.optim.StepLR({'optimizer': optimizer_3, 
                               'optim_args': {'lr': 0.000005200}})
svi = SVI(model, guide, optimizer, loss=Trace_ELBO())
 
# initialize the best_guide    
best_guide = None
best_svi_loss = float("inf")

# turn on a training mode
model.train()

for i in range(epoch):    
    for i in range(num_iter):
        # initialize total_loss to 0
        total_svi_loss = 0    

        # calculate the loss and take a gradient step
        svi_loss = svi.step(input_ids = input, 
                            attention_mask = attention_mask, 
                            labels = mc_labels)

        # update the with the calculated loss 
        total_svi_loss = total_svi_loss + svi_loss 

        if m % log_interval == 0 and m > 0:
            cur_svi_loss = total_svi_loss / log_interval
            print('| epoch {:3d}  | loss {:5.4f} |'.format(
                    epoch,cur_svi_loss ))
                   
            total_svi_loss = 0 

        if cur_svi_loss < best_svi_loss:
            best_val_loss = cur_svi_loss
            best_guide = guide

### MAKING PREDICTIONS
# Turn on the evaluation mode
model.eval()

# calculate prediction scores
pred_obj = Predictive(model, guide=best_guide, num_samples = 100)
prediction_scores = pred_obj.call(input_ids=test_input, 
                       attention_mask = attention_mask_test).detach()
  • As for your comment, by “prediction scores”, I mean the following:
    suppose our task is classification; say we want to classify an observation into one of the 4 possible classes. A vector of prediction scores would be the vector of length 4 (=num of classes), which its elements denotes for the probability (or likelihood) for the observation to fall into the corresponding class. For example, for an observation A, if the 3rd element of the prediction score vector has the highest value among the 4 elements, this would mean that the neural network model predicts the observation A
    to be most likely to be classfied under the 3rd class. To make predictions, a neural network will first compute prediction scores for all classes, and generate a prediction (y) by picking up the index of the element with the highest prediction score.

  • My last question is, if my model has both fixed (untrained) parameters (like in a frequentist model) and random parameters (which their distributions are also untrained), would it be okay if I only use svi.step() to train the entire model (frequentist + bayesian)? or do I need to train the frequentist components and bayesian components of my model seperately?

Thank you,

if my understanding is correct, svi.step() itself automatically calculates loss from the KL-divergence, so I really only need to perform the gradient descent by doing svi.step() , don’t I?

I’m afraid you’ve interpreted this incorrectly. Your code does not include a likelihood, so all it’s doing is training your guide to match the prior. You need to include a pyro.factor or pyro.sample statement in your model that computes the likelihood of your data; only then is your understanding correct:

class MyModel(PyroModule):

    def forward(self, ...):
        actual_outputs = self.my_actual_model(...)
        ...
        y_dist: Distribution = make_my_output_distribution(actual_outputs, ...)
        pyro.sample("y", y_dist, obs=y)  # or pyro.factor("y_loss", y_loss_tensor)
        return ...

I am not familiar enough with the details of your model to fix the code for you, but this is covered in detail in the tutorials I linked to, especially the Bayesian regression with SVI section of the Bayesian regression tutorial. It is also discussed in the PyroModule tutorial.

do I need to train the frequentist components and bayesian components of my model seperately?

No, as long as you correctly incorporate your data loss into the ELBO as described above. This is discussed in detail in the SVI tutorial I linked to.

prediction scores

I’m not sure what the output of your neural network is supposed to be and I’m not very familiar with BERT and related models, so I can’t say whether your code is correct. If your model returns prediction scores, then your code should probably work provided you’ve trained the model and guide correctly.

Also, another unsolicited observation - AutoMultivariateNormal applied to your model creates O(number of weights ^ 2) trainable variational parameters and requires multiplying a number of weights x number of weights matrix by a number of weights vector at each step of SVI. Are you sure that’s what you want, computationally and algorithmically?

More broadly, I think maybe you’re expecting a lot of intelligent BNN-specific behavior from pyro.nn.PyroModule and Pyro’s autoguides that just isn’t there. PyroModule isn’t very smart, and its primary purpose is to allow the use of the PyTorch JIT compiler to serve Pyro models, while the autoguide library wasn’t designed with BNNs specifically in mind.

Thank you again for your reply. I am trying to read the documentation and understand the code as much as I can, and I can understand all the math parts of the model descriptions. But I am struggling with how to implement my model in a Python code, because I just started learning Python couple months ago, and I haven’t done much Bayesian computing before. This task is (probably too much of) a big jump ahead for me, but I am in a need to finish this project. I thank you for your patience.

I tried to modify my code as the below. If you don’t mind, could you please take a look at my code and suggest any fix if needed?

In particular, I am wondering whether I have done the enumeration right for my discrete latent variable y (every parameter for my model is continuous except y, which has Multinomial distribution);

Thank you very much once again for your help.

model = RobertaForMultipleChoice.from_pretrained('roberta-large')

module.to_pyro_module_(model)
        
model.roberta._dummy_param = 
nn.Parameter(torch.tensor(0.).to(dtype=model.dtype, device=model.device))

# Now we can attempt to be fully Bayesian:
for m in model.modules():
    for name, value in list(m.named_parameters(recurse=False)):
         if name != "_dummy_param":
              setattr(m, name, module.PyroSample(prior=dist.Normal(0, 1)
                                                 .expand(value.shape)
                                                 .to_event(value.dim())))  

# add likelihood function to the exisiting frequentist Transformer model.
class MyModel(PyroModule):
    
    def __init__(self,  model, name=""):
        self._pyro_name = name
        self._pyro_context = pyro.nn.module._Context()
        self._pyro_params = model.parameters()
        self._modules = model.modules()
        super(MyModel, self).__init__()

    def forward(self, model, input_ids, attention_mask, mc_labels = None):
        # retrieve prediction_scores (y)
        if mc_labels != None:
            prediction_scores = model(input_ids=input_ids, 
                                       attention_mask=attention_mask,
                                       mc_labels=mc_labels)[2]
            
            softmax_tensor = nn.Softmax(dim=-1)(prediction_scores)

            if mc_labels == torch.tensor([0]):
                 mc_label_tensor = torch.tensor([[1.,0.,0.,0.]])

            elif mc_labels == torch.tensor([1]):
                 mc_label_tensor = torch.tensor([[0.,1.,0.,0.]])

            elif mc_labels == torch.tensor([2]):
                 mc_label_tensor = torch.tensor([[0.,0.,1.,0.]])

            elif mc_labels == torch.tensor([3]):
                 mc_label_tensor = torch.tensor([[0.,0.,0.,1.]])
            
        else:
            prediction_scores = model(input_ids=input_ids, 
                                    attention_mask=attention_mask)[1]
            
            softmax_tensor = nn.Softmax(dim=-1)(prediction_scores)
          
  
        # for each data, y (prediction scores) has 4 classes.
        # Hence the multinomial distribution with total_size =1 and 
        # prob= nn.softmax(prediction_scores)
        pyro.sample('y', 
                    dist.Multinomial(1, probs = softmax_tensor), 
                       obs = mc_label_tensor)

        return prediction_scores

### ERROR OCCURS HERE
my_model = MyModel(model)

# define guide     
guide = guides.AutoDiagonalNormal(poutine.block(my_model, hide = ['y']))
        
# parameters for training
optimizer = Adam({"lr": 0.000005200}) 
scheduler = pyro.optim.StepLR({'optimizer': optimizer_3, 
                               'optim_args': {'lr': 0.000005200}})
svi = SVI(my_model, guide, optimizer, loss=TraceEnum_ELBO(max_plate_nesting=0))
 
# initialize the best_guide    
best_guide = None
best_svi_loss = float("inf")

# turn on a training mode
my_model.train()

for i in range(epoch):    
    for i in range(num_iter):
        # initialize total_loss to 0
        total_svi_loss = 0    

        # calculate the loss and take a gradient step
        svi_loss = svi.step(model, input_ids = input, 
                            attention_mask = attention_mask, 
                            labels = labels)

        # update the with the calculated loss 
        total_svi_loss = total_svi_loss + svi_loss 

        if m % log_interval == 0 and m > 0:
            cur_svi_loss = total_svi_loss / log_interval
            print('| epoch {:3d}  | loss {:5.4f} |'.format(
                    epoch,cur_svi_loss ))
                   
            total_svi_loss = 0 

        if cur_svi_loss < best_svi_loss:
            best_val_loss = cur_svi_loss
            best_guide = guide

### MAKING PREDICTIONS
# Turn on the evaluation mode
my_model.eval()

# calculate prediction scores
pred_obj = Predictive(my_model, guide=best_guide, num_samples = 100)
prediction_scores = pred_obj.call(model, input_ids=test_input, 
                       attention_mask = attention_mask_test).detach()

In particular, I am wondering whether I have done the enumeration right for my discrete latent variable y

y is observed, so there’s no need to enumerate it. The distinction between latent and observed variables is covered in the SVI intro tutorial I linked to above.

My reply on the other thread (How to avoid this error?) is helpful here too (unsurprisingly given I was trying to achieve the same thing). All I’ve added here is the Predictive bit again from the tutorials.

neural_network = nn.Sequential(
     nn.Linear(28 * 28, 100),
     nn.Sigmoid(),
     nn.Linear(100, 100),
     nn.Sigmoid(),
     nn.Linear(100, 1),
 )
 
module.to_pyro_module_(neural_network)
 
for m in neural_network.modules():
     for name, value in list(m.named_parameters(recurse=False)):
         setattr(m, name, module.PyroSample(prior=dist.Normal(0, 1)
                                       .expand(value.shape)
                                       .to_event(value.dim())))

This bit is important

class BayesianNeuralNetwork(PyroModule):
     def __init__(self, neural_network):
         super().__init__()
        self.neural_network= neural_network
 
     def forward(self, x, y=None):
         sigma = pyro.sample("sigma", dist.Uniform(0., 10.))
         mean = self.neural_network(x).squeeze(-1)
         with pyro.plate("data", x.shape[0]):
             obs = pyro.sample("obs", dist.Normal(mean, sigma), obs=y)
         return mean
model = BayesianNeuralNetwork(neural_network)
 
guide= guides.AutoDiagonalNormal(model)
 
optimizer = Adam({"lr": 0.03}) 
 
svi= SVI(model, guide, optimizer, loss=Trace_ELBO())
 
X=torch.rand(85,10)
y=torch.rand(85,1)

pyro.clear_param_store()
svi.step(X,y)

predictive = Predictive(model, guide=guide, num_samples=100,return_sites=("obs", "_RETURN"))
samples = predictive(data['data'].float().to(device))
mean=samples['_RETURN'].mean(dim=0)
sigma = samples['_RETURN'].std(dim=0)

Where mean is your prediction and sigma is your standard deviation of the predictions. Hope this helps @h56cho