Model converges on CPU but not GPU

mattg · February 6, 2023, 12:15pm

I am currently working on implementing quite a complex Bayesian Hierarchical Model in numpyro, which is being trained using NUTS. As expected, I see a considerable performance improvement when training on a GPU compared to CPU. Unfortunately, for some reason when training on GPU some of the chains will often get stuck at their initial values meaning that the model does not converge at all. When I train on CPU, however, the model converges very well with the expected hyper parameter values for this data set.

This occurs when using either init_to_median() and init_to_sample() and with the same random seed for the PRNG key across both the CPU and GPU trainings. When I use init_to_values() and start out the GPU training with roughly the expected parameter values, the model does then converge, but going forward this is not desirable as obviously it won’t generalise well to new data. Has anyone else encountered something similar or does anyone know why the training would behave so differently on GPU compared to CPU?

martinjankowiak · February 6, 2023, 3:18pm

are you using 64 bit precision? differences between cpu and gpu computations tend to be larger at lower precision and smaller at higher precision

mattg · February 6, 2023, 6:30pm

No I wasn’t, I’ve now run it again on GPU with 64 bit precision and it did converge which is reassuring, thanks for the suggestion! The down side is that the run time was roughly twice as long this way. As the model does converge on GPU with 32 bit precision with careful initialisation, I was curious as to whether there are any differences in how the initialisation works between CPU and GPU? It seems like that is the main difference in this case.

Thanks for your help!

martinjankowiak · February 6, 2023, 7:07pm

i generally recommend using 64-bit precision for hmc/nuts.

numpyro code is generically agnostic to the hardware. that said, what the hardware does when it decides to do a floating point operation is up to it and i generally don’t have much insight into that. i guess one thing that happens is that gpu computations are sometimes done in a non-deterministic way because that’s faster with the consequence that the order of aggregation isn’t static. i suggest using google to learn more e.g.

github.com/openai/baselines

Non-deterministic behaviour when ran on GPU

opened 05:12PM - 29 Jan 19 UTC

dkorenkevych

The following commit https://github.com/openai/baselines/commit/9fa8e1baf1d1f975…b87b369a8082122eac812eb1#diff-fc3e1c3522d2c7871bda86ed40bcb0ddL28 introduced non-deterministic behavior of PPO1 when ran on GPU even with setting tf.set_random_seed (CPU behavior is deterministic). Specifically, at line 28 and others in mlp_policy.py replacing ``` U.dense(last_out, hid_size, name='fc%i'%(i+1), weight_init=U.normc_initializer(1.0)) ``` with ``` tf.layers.dense(last_out, hid_size, name='fc%i'%(i+1), kernel_initializer=U.normc_initializer(1.0)) ``` created this behavior. Below are 4 runs of Mujoco Swimmer-v2 environment with the same random seed using PPO1 in latest version of baselines code [swimmer_same_seed_new_code.pdf](https://github.com/openai/baselines/files/2808750/swimmer_same_seed_new_code.pdf) Replacing all instances of tf.layers.dense with U.dense, and adding the corresponding code ``` def dense(x, size, name, weight_init=None, bias=True): w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init) ret = tf.matmul(x, w) if bias: b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer()) return ret + b else: return ret ``` back to tf_utils.py fixes the issue. Below is a figure with 4 Swimmer runs after this change [swimmer_same_seed_old_code.pdf](https://github.com/openai/baselines/files/2808761/swimmer_same_seed_old_code.pdf) All experiments were run using tensorflow-gpu==1.12.0 cudatoolkit==9.2 cudnn==7.3.1