Converting huggingface transformer to fully bayesian


I would like to convert a pretrained transformer (ideally gpt2) to a pyro neural net. I would like this pyro neural net to be fully Bayesian.

I have many starter attempts but I will hesitate to post them now. I think that I am approaching the problem incorrectly. My plan is roughly this workflow:

1: load transformer model
2: to_pyro_model(transformer)
3: guide = auto_diagnoal_normal
4: apply svi

All of my attempts have failed due to technical issues. However, I am unsure of my selection of guide what is the best guide?

How do we best select a guide for a neural net, specifically a sequence model? What is the best practice and theory here. Which guides provide what advantages.

Secondly, can someone please help me try to get this up and running. I’ve been dreaming about getting this running for months now really excited, a little frustrated won’t lie. I’ve been trying to contact the poster who asked a similar question: Unable to do next(model.parameters()) with Pyro models

I believe the person in the topic you linked did a reasonable job setting up their model in the end, so a narrow, huggingface-specific answer to your second question would be to start from their more recent code snippets in that and other topics, but since you also asked for general advice here’s a bit of context on why you’re having trouble and why that approach is unlikely to succeed:

Bayesian neural networks are a subject of active research. Broadly speaking, there are only two problem regimes where people who are not experts in this research can expect them to work reliably given current inference technology.

  1. When the number of datapoints is (ideally much) larger than the number of parameters, variational inference using local reparametrization with independent variational distributions per layer or per parameter may produce reasonable predictive uncertainty estimates.
  2. When running HMC for a long time is computationally feasible, i.e. when your model and dataset are small enough that you can run forward passes on your entire dataset and store many (10s-100s) copies of your weights for prediction

Other approaches remain unproven at best and have generally not been evaluated or scaled up beyond small feed-forward networks on a few toy datasets.

Unfortunately, neither of these two regimes match your particular problem (post-hoc calibration of a very large pretrained neural network), so even if it were much easier to construct such a model in Pyro it is unlikely that variational inference would produce sensible posterior or predictive uncertainty estimates, nor are there other off-the-shelf techniques, even non-Bayesian ones, that would be likely to do any better at calibrating something as large and complex as GPT-2.

For that reason the Pyro core team have tended not to invest a lot of developer time in Bayesian neural network tooling, though we’re certainly open to community contributions in this direction - see TyXe for an example of a Pyro library that addresses the problems of BNN prior and guide creation/initialization and automatic local reparametrization.

1 Like


There is a recent paper that uses local reparameterization trick (sparse variational dropout) on Transformer, and they open sourced their code:

1 Like

There is a recent paper that uses local reparameterization trick (sparse variational dropout) on Transformer

It’s certainly possible to apply standard Bayesian neural network approaches like variational dropout (a special case of my point 1 above) to large models and even approach comparable performance to the non-Bayesian case on the usual metrics like accuracy or test likelihood, but that doesn’t mean the resulting posterior or predictive uncertainty estimates are accurate or well-calibrated, and indeed the paper makes no such claim.

I don’t want to discourage people from playing around with BNNs, but when doing so it’s worth understanding that neural network calibration remains an open research problem with no easy solutions and adjusting expectations accordingly.

1 Like

Oops, it looks like I miss the context :rofl:: Convert huggingface transformer model to bayesian nn

The purpose I bring up this paper is to show that, training a huge model like Bayesian Transformer is possible and straight-forward, mean-field Gaussian could do the job.

1 Like

many thanks! they’re approach is intuitive and there results interesting any implementation source codes?