Weights are diverging when there is too many parameters

Hi everyone,

I have designed a simple bayesian neural network with one hidden layer for a multi-class classification problem.

When I put a few neurons in the hidden layer (about 16), it works very well for very easy classification tasks.

When the classification problem gets harder and that it needs more neurons in the hidden layer, then the mean and the standard deviation of some neurons tend to infinity, leading to invalid values in the network output.

I could not see similar issues looking in the forum, does anyone has experienced this problem ? Do you have any tips ?

(I tried to constrain the parameters to some real interval. It actually prevents the training stage to crash but then the loss does not converge…)

Thank you very much.
Matheo

this is probably expected. unfortunately no one has invented a machine that magically does bayesian inference for large bayesian neural networks. if you want specific hints that might make things work a bit better in some intermediate regime that is still viable you will need to provide more details.

1 Like

Thanks for the reply.

Do you have some hints that would help me to understand why increasing the size of the network makes variational inference fail ? Some insightful paper ?

if you want specific hints that might make things work a bit better in some intermediate regime that is still viable you will need to provide more details.

What are the kind of information you need ?

Thanks,
Matheo

i see. the language you were using led me to believe you were using mcmc. one option might be to try tyxe, which is a bayesian neural network library built on top of pyro.

Thanks, it seems to work! I have to take a look at the tyxe source code to understand its benefits and the differences between my own code.

At first sight, I don’t see what’s different. It seems that num_particles in the ELBO is 1 in both cases and the guide seems the same although I implemented mine from scratch whereas tyxe uses autoguide.

Do you seee any obvious practical / theoretical reasons why tyxe works better ?

Thanks a lot,
Matheo

well i don’t know anything about your code so it’s impossible to say. probably the “local reparameterization trick” plays a role, perhaps initialization, perhaps other factors

Ok thank you I am slowly reading the references of pyro SVI docs.