I don't understand why NUTS code is not working. bayesian hackers mail

hellocyber · January 27, 2019, 4:13pm

Hi. I am new user of pyro.
I wrote code of Probabilistic-Programming-and-Bayesian-Methods-for-Hackersbayesian-chapter1 in pyro, however that code is not working well.

Tensorflow probability code

I just only replace tfp code into pyro.

def model(data):

    alpha = (1. / data.mean())

    lambda1 = pyro.sample("lambda1", dist.Exponential(rate=alpha))
    lambda2 = pyro.sample("lambda2", dist.Exponential(rate=alpha))

    tau = pyro.sample("tau", dist.Uniform(0, 1))

    lambda_ = torch.gather(torch.tensor([lambda1, lambda2]), 0,
              (tau*data.size(0) <= torch.arange(len(data)).float()).long())
    
    with pyro.plate("data", data.size(0)):
        pyro.sample("obs", dist.Poisson(lambda_), obs=data) 

nuts_kernel = NUTS(model,
                   jit_compile=True,
                   adapt_step_size=True,
                   transforms={
                       "lambda1": torch.distributions.transforms.ExpTransform(),
                       "lambda2": torch.distributions.transforms.ExpTransform(),
                       "tau": torch.distributions.transforms.SigmoidTransform()
                   }
                  )
posterior = MCMC(nuts_kernel, 
                 num_samples=1000, 
                 warmup_steps=300,
                ).run(count_data)

I think tfp need writing log-joint-prob manually, but pyro need only generative model and automatically get the log-joint-prob. Therefore, because lambda_ = torch.gather is not from pyro.sample(), mcmc sampler cannot get the log-joint-prob correctly right?

How can I modify that code. Please teach me. Thank you.

eb8680_2 · January 27, 2019, 7:13pm

Hi, can you provide more information about the problem you’re having? Are you getting an error, or seeing inference results that don’t make sense? If so, can you post enough code to reproduce the error? Or are you just asking if your model is correctly specified?

fehiepsi · January 28, 2019, 2:30am

Not sure what is your question as @eb8680_2 pointed out, but here are something that might help:

No need to define transforms, NUTS will choose suitable transforms for you automatically.
No need to use gather. PyTorch (hence Pyro) is a dynamic computation graph framework, you can use if ... else ... instead of the gather(...) hack for tensorflow. Edit: I might have been wrong. The reason to use gather is to get a vector value lambda_. I’m not sure if gather works but we can define lambda_ as:

lambda1_size = (tau * data.size(0)).item()
lambda2_size = data.size(0) - lambda1_size
lambda_ = torch.cat([lambda1.expand(lambda1_size), lambda2.expand(lambda2_size)])

It will be strange to me if Pyro NUTS does not work well on this problem. Please let us know what problem you have so we can help.

hellocyber · January 28, 2019, 8:54am

Thank you for reply.

The problems are the following two.

The calculation is much slowly than TensorFlow probablity using eager execution.
PyTorch is often faseter than TF eager execution. I don’t know why pyro is slowly than TFP. So, I thought my code was something wrong.
The code is running, but that’s result is deffer from TFP and PyMC3 (TFP and PyMC3 have same result). And there are some warning as below.

TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can’t record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

Remove the CWD from sys.path while we load stuff.

TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.

Full code

Full_code_at_gist

fehiepsi · January 28, 2019, 3:36pm

Sorry for making confusion. The way to define lambda_ as in my last comment won’t work because we need to back propagate gradient to tau. Thanks for sharing colab notebook (my first time using it ^^). I will look into this problem and will get back to you soon.

fehiepsi · January 28, 2019, 4:30pm

I can recover similar posteriors to TFP for lambda1 and lambda2. The running time is pretty fast.

count_data = torch.tensor([
    13,  24,   8,  24,   7,  35,  14,  11,  15,  11,  22,  22,  11,  57,  
    11,  19,  29,   6,  19,  12,  22,  12,  18,  72,  32,   9,   7,  13,  
    19,  23,  27,  20,   6,  17,  13,  10,  14,   6,  16,  15,   7,   2,  
    15,  15,  19,  70,  49,   7,  53,  22,  21,  31,  19,  11,  18,  20,  
    12,  35,  17,  23,  17,   4,   2,  31,  30,  13,  27,   0,  39,  37,   
    5,  14,  13,  22,
], dtype=torch.float)

def model(data):
    alpha = (1. / data.mean())
    lambda1 = pyro.sample("lambda1", dist.Exponential(rate=alpha))
    lambda2 = pyro.sample("lambda2", dist.Exponential(rate=alpha))

    tau = pyro.sample("tau", dist.Uniform(0, 1))
    lambda1_size = int(tau.item() * data.size(0)) + 1
    lambda2_size = data.size(0) - lambda1_size
    lambda_ = torch.cat([lambda1.expand((lambda1_size,)), lambda2.expand((lambda2_size,))])
    
    with pyro.plate("data", data.size(0)):
        pyro.sample("obs", dist.Poisson(lambda_), obs=data)

nuts_kernel = NUTS(model, jit_compile=True)
posterior = MCMC(nuts_kernel, num_samples=10000, warmup_steps=5000, num_chains=1).run(count_data)

marginal = posterior.marginal(sites=["lambda1", "lambda2", "tau"]).support(flatten=True)
lambda_1_samples = marginal["lambda1"]
lambda_2_samples = marginal["lambda2"]
tau_samples = marginal["tau"]

However, posterior for tau in Pyro is kind of Uniform over interval [0, 1]. @eb8680_2 @neerajprad, do you have some ideas how to backpropagate loss information to tau?

hellocyber · January 29, 2019, 4:51pm

Thank you for your experiment. I have same results lambda1, lambda2 and tau.But at my google colab notebook, the running time is very slow. Which was that experiment was run at google colab (that mean same environment as me ), or your local environment?

fehiepsi · January 29, 2019, 6:13pm

@hellocyber I first run your colab notebook but get some errors (e.g. when num_chains > 1) which I don’t understand so I rerun it in my local environment (which has pyro dev version).

I think that the model in my last comment is correct (sorry for making confusion @eb8680_2 @neerajprad). Because the gradient of loss w.r.t. tau is 0 (because prior of tau is uniform), velocity of tau will be constant in a leap-frog trajectory. So turning condition for NUTS will only depend on lambda1 and lambda2's velocities. When the trajectory triggers U-turn, tau might have go a long way from its initial value.

I think that is why in TFP notebook, the author set initial value for tau is 0.5 and only use num_leapfrog_steps=2. I’ll confirm my theory by real experiment.

@hellocyber Are you able to replicate the same result in PyMC3 with NUTS? I’m curious about it.

hellocyber · January 29, 2019, 7:30pm

@fehiepsi Oh, sorry. I had same error, and at num_chains=1 that code was running. I just forgot to modify that code.

This is snipet of this inference in pymc3.

import pymc3 as pm
import theano.tensor as tt

with pm.Model() as model:
    alpha = 1.0/count_data.mean()  # Recall count_data is the
                                   # variable that holds our txt counts
    lambda_1 = pm.Exponential("lambda_1", alpha)
    lambda_2 = pm.Exponential("lambda_2", alpha)
    
    tau = pm.Uniform("tau", lower=0, upper=n_count_data - 1)
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)


with model:
    idx = np.arange(n_count_data) # Index
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)

with model:
    observation = pm.Poisson("obs", lambda_, observed=count_data)

with model:
    step = pm.NUTS()
    trace = pm.sample(10000, tune=5000,step=step)

lambda_1_samples = trace['lambda_1']
lambda_2_samples = trace['lambda_2']
tau_samples = trace['tau']

This calculation is so slow that I cannot wait…! However using DiscreteUniform instead of Uniform, that calculation became very fast and I got correct result.

Does this change of distribution have anything to do with result?

fehiepsi · January 29, 2019, 8:12pm

PyMC3 uses another sampling method for discrete variables. HMC/NUTS just works for continuous variables.

I have did my own experiments with the setting max_tree_depth=1, adapt_step_size=False, step_size=0.05 and was able to get the same pattern for tau as in TFP. However, I got different results for different number of warmup_steps. This suggests that the posterior for tau is Uniform over (0, 1) can be true.

In addition, the inference result of that model in TFP/PyMC3 is unreliable to me. There is no way the probability for high value of tau (tau > 44) is negligible given this data. In my opinion, the author does some tricks to make sure the sampler will give such results. Because Bayesian for Hacker is popular, I suggest to raise a question there to ask its author about these technical details. It can be a chance that I am wrong so please ping me if you make some discussions with its author.

hellocyber · January 30, 2019, 9:01am

@fehiepsi
Really thank you. This discussion made me understood more compare to before discussion. And when using NUTS algorithm, the condition setting is more difficult than I thought.

In pyro, does various parameters of NUTS algorithm for example leap flog steps, become setting manually in the future?

fehiepsi · January 30, 2019, 9:56am

I think that it is good to stick with default arguments. Unless there are some bugs, default setting works well with most small models. It works for the above model too, and I think that the uniform posterior of tau is a better result than the one in pymc/tfp.

NUTS does not have leap_frog steps (you can reduce max_tree_depth, which is kind of log2 of leap_frog steps). But you can specify it in HMC, where that parameter has name: num_steps if I remember correctly.

hellocyber · January 30, 2019, 1:02pm

OK. Unless there are special circumstances, I will use the default setting. And I think I should study NUTS algorithm itself more and more.

I have mathematical knowledge of HMC, but I have just only some experience of running with regard to NUTS. So finally I would be really happy if you could tell me about pros and cons of NUTS compared to HMC.

I learned, NUTS determines automatically the number of steps and learning rate (step size), so if I determined max_tree_depth, this becomes constraint to number of steps, right?Is there a point which is for judgment of necessity of max_tree_depth, or judgment of using HMC.
Analogically SGD and ADAM oprimizers, is this depend on problem, data, experiment and own intuition?

Thanks great advice.

fehiepsi · January 30, 2019, 5:22pm

@hellocyber Yes, you are right. In NUTS, we double the tree for each “doubling” step until the tree makes a U-turn. max_tree_depth is used to guarantee that the “doubling” tree does not expand too large (when it can’t make a U-turn). I personally only modify max_tree_depth to speed things up. Theoretically, the larger max_tree_depth, the better result we will get. However, if I see that it is wasteful for some tricky models, I will decrease it to improve speed.

About pros and cons of NUTS compared to HMC, I don’t think that I can answer correctly. I mostly use NUTS to not have to worry about tuning num_steps/trajectory_length (step_size is tuned for both methods with Dual Averaging scheme). I only use HMC to debug.

neerajprad · January 30, 2019, 8:31pm

My apologies for chiming in late on this thread.

@hellocyber, @fehiepsi - I think the model above is giving a flat posterior because of the line lambda1_size = int(tau.item() * data.size(0)) + 1 which effectively blocks the effect of changing tau. You will find that the model gives identical answers if you change that line to lambda1_size = (tau * data.size(0) + 1).long(), i.e. you’ll get a mean in the neighborhood of 0.59 that tallies up with 44 days you see for tf and pymc. Its actually kind of neat that this works for NUTS, and even with the PyTorch JIT.

fehiepsi · January 31, 2019, 12:14am

@@ Amazing! I can’t believe in my eye that it works and information can pass through indexing. Do you have any idea why tau samples from posterior less than 0.59 @neerajprad ? I looked at the data plot and couldn’t identify it. I guess somehow when tau values pass that threshold, potential energy will make a big decrease? I believe that I miss something here. Edit: Somehow I figure it out. I expected that posterior should be continuous but it does not need to be for the case of tau here. Potential energy can make a big jump when tau passes through bins (multiplies of 1/num_data) because we only use its discrete values to calculate potential energy.

First time I see this phenomenon. It is good to let other users know about it too.

I make a gist here for reference. Happy learning!

neerajprad · January 31, 2019, 5:52am

That’s pretty cool, and good to know! Thanks for plotting this.

hellocyber · January 31, 2019, 8:51am

@fehiepsi
Thanks to publish code and visualization!

@neerajprad
I have understood that when coding in Pyro (or PyTorch), we must use torch.tensor family (torch.long, torch.float) as data, otherwise the program cannot trace the calculation graph and cannot get the gradients right?

neerajprad · January 31, 2019, 7:13pm

That is correct. PyTorch can only track operations performed on tensor objects as you would expect. This isn’t specific to PyTorch, even Tensorflow behaves analogously.

hellocyber · February 1, 2019, 7:23am

Thank you very much.
I have learned a lots at this discussion.