How does Tyxe evaluate() models?

cc: @karalets

I converted the resnet example to a Colab friendly version so that I could play with the bnn produced. On training on the cifar10 data set the code reports a test error of about 8%

When I manually recreate the functionality of the b.evaluate() function I get a much higher error: 20-30%. I zeroed down the reason to not using the .eval() to freeze the batchnorm layers in resnet18. On .eval(), the errors are close to evaluate errors. - say about 9-10% error.

My question is that I couldn’t see the model being put in .eval state in the evaluate() code. How is .evaluate() handling batchnorm?