Weight decay during optimization

Should I use weight decay during SVI? Should I experiment with suggested docs from Pytorch? torch.optim — PyTorch 1.11.0 documentation

I’ve been using Adam, only setting the learning rate.

I have not used weight decay, has anyone else?

it depends what you’re doing but if you’re doing “canonical” probabilistic modeling you should probably not use weight decay because doing so is in effect changing your prior. if you’re doing something wackier that is more along the lines of bayesian deep learning then all bets are off and do whatever works