# Proper implementation of Bayesian Regression

I’m rather new to the world of PyTorch and Pyro and GPU computing, albeit not so new to Bayesian modeling. I’m uncertain on how to write down a really basic Bayesian logistic regression for (efficient) use in Pyro.

I have looked at resources around here and the docs but have not found a true answer - consider the following code and the following two issues:

1. The code works as is. But using “num_chains”>1 to sample from more than one markov chain in the MCMC statement below doesn’t work (no error, but the sampling simply never seems to start, no computations done). Any known reasons why?

2. By adding a “.cuda()” to all the tensors defined, I can get the code to run on the GPU. However, it is much slower than using the CPU. Furthermore, the GPU is hardly used (less than 10% load). Granted, my problem is small (X has shape (500,10) ) but I’d like to know if this is the proper way to implement a GPU model or what else causes this slowdown on the GPU.

Using pytorch 1.01, cuda 10, pyro 0.3.1.

Thanks for looking!

def bayes_logistic(X, y, loc_intercept, loc_beta, scale_intercept, scale_beta):
# distribution for coefficients
intercept = pyro.sample("intercept", StudentT(3, loc=loc_intercept, scale=scale_intercept))
beta = pyro.sample("beta", StudentT(3, loc=loc_beta, scale=scale_beta))
with pyro.plate("outcome", len(X)):
pyro.sample("y_hat", Bernoulli(logits=intercept+X.matmul(beta)), obs=y)

df_tensor = torch.tensor(np.array(df)).float()
y_tensor = torch.tensor(y.astype('uint8')).float()
loc_beta = torch.zeros(df_tensor.size(1))
scale_beta = torch.ones(df_tensor.size(1))*5
loc_intercept = torch.zeros(1)
scale_intercept = torch.ones(1)

hmc_posterior = MCMC(nuts_kernel, num_samples=1000, warmup_steps=500).run(df_tensor, y_tensor,
loc_intercept, loc_beta,
scale_intercept, scale_beta)

Sorry for the late reply on this.

• Did multiple chains not work on the CPU or CUDA? CUDA with multiprocessing has many known issues, and I think is best avoided, but CPU should work fine. Also what OS are you on?
• You can also set the default tensor type to CUDA by using torch.set_default_tensor_type("torch.cuda.FloatTensor"). For HMC, there is a significant overhead to running on the GPU. Unless you are operating on very large tensors and have many linear algebra computations, you are likely going to do better on the CPU.

Thanks a bunch for the reply! Yes, sorry for not stating this, I’m running pyro on windows 7, fully aware of the difficulties that this could mean for multicore processing.

1. Multiple chains did not work on either device. On GPU I get an immediate runtime error (cuda runtime error 71, operation not supported), on CPU I get no error message but sampling simply never starts and seems to hang. Any ideas what could be the cause and why a single chain can use all cores but two chains don’t even start?

2. Thank you for clearing that up. I feared as much (i.e. that the GPU is not worth it for such a problem when using HMC). Out of interest, can you explain where that overhead comes from or do you have some pointers where I could look into the reasons?

I would have some follow-up questions then:
Would using variational inference work better with less overhead on GPU?

I can get VI to work on CPU using eg. a AutoMultivariateNormal guide but this breaks down when using the GPU: “RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2” -> this message lead me to try and implement my own guide which seems to at least work on the GPU but is atrociously wrong at the moment somehow.
My question would be: are AutoGuides supposed to work with the GPU?

Any ideas what could be the cause and why a single chain can use all cores but two chains don’t even start?

The multiprocessing functionality is experimental and not well-tested on windows (the docstring contains a disclaimer around this). I still think that with a couple of tweaks it might work on the CPU. The first thing to check would be that it is actually hanging, and not an issue with tqdm’s progress bar, which has been known to have issues on windows. One way to check this would be to reduce the number of samples and see if the process terminates within a reasonable time.

Out of interest, can you explain where that overhead comes from or do you have some pointers where I could look into the reasons?

The more general issue of smaller models being faster on the CPU than GPU isn’t specific to Pyro. The CPU cores are much more powerful, though GPUs will outperform if you have a lot of matrix operations that can be parallelized. Looking at the GPU utilization, that doesn’t appear to be the case here. Maybe you can increase the size of tensors to see how the CPU vs GPU comparison holds up. The other issue is that the leapfrog integrator is inherently sequential, so unless the GPU is actually faster per gradient computation, you won’t realize any benefits in practice (going back to the earlier point, this is unlikely for small models).

We also haven’t done extensive profiling on the GPU, but I am interested in revisiting this to make sure that we aren’t leaving out something on the table.