I’m rather new to the world of PyTorch and Pyro and GPU computing, albeit not so new to Bayesian modeling. I’m uncertain on how to write down a really basic Bayesian logistic regression for (efficient) use in Pyro.
I have looked at resources around here and the docs but have not found a true answer  consider the following code and the following two issues:

The code works as is. But using “num_chains”>1 to sample from more than one markov chain in the MCMC statement below doesn’t work (no error, but the sampling simply never seems to start, no computations done). Any known reasons why?

By adding a “.cuda()” to all the tensors defined, I can get the code to run on the GPU. However, it is much slower than using the CPU. Furthermore, the GPU is hardly used (less than 10% load). Granted, my problem is small (X has shape (500,10) ) but I’d like to know if this is the proper way to implement a GPU model or what else causes this slowdown on the GPU.
Using pytorch 1.01, cuda 10, pyro 0.3.1.
Thanks for looking!
def bayes_logistic(X, y, loc_intercept, loc_beta, scale_intercept, scale_beta):
# distribution for coefficients
intercept = pyro.sample("intercept", StudentT(3, loc=loc_intercept, scale=scale_intercept))
beta = pyro.sample("beta", StudentT(3, loc=loc_beta, scale=scale_beta))
with pyro.plate("outcome", len(X)):
pyro.sample("y_hat", Bernoulli(logits=intercept+X.matmul(beta)), obs=y)
df_tensor = torch.tensor(np.array(df)).float()
y_tensor = torch.tensor(y.astype('uint8')).float()
loc_beta = torch.zeros(df_tensor.size(1))
scale_beta = torch.ones(df_tensor.size(1))*5
loc_intercept = torch.zeros(1)
scale_intercept = torch.ones(1)
nuts_kernel = NUTS(bayes_logistic, adapt_step_size=True)
hmc_posterior = MCMC(nuts_kernel, num_samples=1000, warmup_steps=500).run(df_tensor, y_tensor,
loc_intercept, loc_beta,
scale_intercept, scale_beta)