Problems transforming a pymc3 model to pyro (mcmc)


I am working on transforming a stochastic volatility model implemented in pymc3 to pyro.

I was using the following pyro code. The result was very different than what I got in pymc3. My acceptance rate is very low nearly zero. Can anyone help me with it? Thank you in advance!

def model(data):
    sigma = pyro.sample('sigma', dist.Exponential(50.))
    nu = pyro.sample('nu', dist.Exponential(.1))
    #sample volatilities
    h_t = [pyro.sample("h_{}".format(0), dist.Normal(0., sigma))]
    y = [pyro.sample("return_{}".format(0), dist.StudentT(nu, torch.zeros(1), torch.exp(2*h_t[0])), obs = torch.tensor(data[0]))]
    for t in range(T-1):
        h = pyro.sample("h_{}".format(t+1), dist.Normal(h_t[t], sigma))
        y.append(pyro.sample("return_{}".format(t+1), dist.StudentT(nu, torch.zeros(1), torch.exp(2*h_t[t+1])), obs= torch.tensor(data[t+1])))
    return y

nuts_kernel = NUTS(model, step_size=1.0, adapt_step_size=True)
mcmc_run = MCMC(nuts_kernel, num_samples=500, warmup_steps=300).run(data)

i havent looked at the pymc3 model very closely, but isnt the lambda to your student T distribution supposed to be a negative exponent:

- torch.exp(2*h_t[t+1]))
+ torch.exp(-2*h_t[t+1]))

As an aside, I think a straight transformation like this will be very inefficient - right now each time the model runs, it sequentially samples for T time steps. And NUTS needs to run this model many times to even generate a single sample.

Thank you for the reply! I’ve looked at the distributions in pymc3. It seems like they are using lambda instead of standard deviation in the student distribution (lambda seems to be the inverse of std) while pyro is using std.

Thank you for the fast response! Is there another (more efficient) way of implementing this sequential
random-walk behavior?

I looked at the DMM Example but it seems to be using a loop for T steps as well (although that example is using SVI rather than MCMC).

Examining the implementation of pymc3’s GaussianRandomWalk it appears that they treat the process as a distribution and define the log-probability by offsetting the x by one position. But it is not clear to me how to translate this into Pyro.

Any guidance would be greatly appreciated!

After working through some of the math behind the model I was able to come up with this version (which is equivalent as far as I can tell).

def model(data):
    T = len(data)
    sigma = pyro.sample('sigma', dist.Exponential(50.))
    nu = pyro.sample('nu', dist.Exponential(.1))
    shifts = pyro.sample("h_t", dist.Normal(loc=torch.zeros(T), scale=torch.ones(T) * sigma))
    h_t = torch.cumsum(shifts, dim=0)
    y = pyro.sample("returns", dist.StudentT(df=nu, loc=torch.zeros(1), scale=(torch.exp(2*h_t) + 1e-12)), obs=torch.tensor(data))
    return y

This does run much faster than the previous version, but my acceptance rate is still zero (or so small that I can’t see the non-zero digits on the logs). Any suggestions as to how I could alleviate this?

One thing that you could try is to see if HMC runs on this with a small step size, and turn off step size adaptation. If that runs fine, it could indicate some issue with our step size adaptation.

Your simplified version can be simplified more by using the property of Gaussian random walk
h_t = pyro.sample('h_t", dist.Normal(0, T * sigma)) (just heuristic, you should correct the shape here). So no need for torch.cumsum.

Ref: Random walk - Wikipedia

Edit: I am wrong here.

What happened if you change it to scale=torch.exp(h_t)?

Thank you again for the reply. I turned off the step size adaption and tried a few small step sizes as you suggested. It seemed to have no improvement on the acceptance rate. Any reasons you could think of?

I tried this and it barely had any influence on the acceptance rate from what I observed… (Adding 1e-12 is just to make the function more stable so that it won’t complain about invalid scale values.)

It is interesting. It seems that accumulating a long sequence causes a precision issue. I will try to implement GaussianRandomWalk and will get back to you.

Thank you for your time :slight_smile:

I tried to implement GaussianRandomWalk but still cannot do mcmc when data.shape > 200 (less than 200 is fine). :frowning:

Updated: I changed random seed and use very small step_size and be able to make mcmc run on full data. I have tried to debug to see why for some seeds it failed and observed that gradient blow up at those time. I don’t know how to deal with this issue. :frowning:

class GaussianRandomWalk(dist.TorchDistribution):
    has_rsample = True
    arg_constraints = {'scale': constraints.positive}
    support = constraints.real

    def __init__(self, scale, num_steps=1):
        self.scale = scale
        batch_shape, event_shape = scale.shape, torch.Size([num_steps])
        super(GaussianRandomWalk, self).__init__(batch_shape, event_shape)

    def rsample(self, sample_shape=torch.Size()):
        shape = sample_shape + self.batch_shape + self.event_shape
        walks = self.scale.new_empty(shape).normal_()
        return walks.cumsum(-1) * self.scale.unsqueeze(-1)

    def log_prob(self, x):
        init_prob = dist.Normal(self.scale.new_tensor(0.), self.scale).log_prob(x[..., 0])
        step_probs = dist.Normal(x[..., :-1], self.scale).log_prob(x[..., 1:])
        return init_prob + step_probs.sum(-1)

def model(data):
    T = len(data)
    sigma = pyro.sample('sigma', dist.Exponential(50.))
    nu = pyro.sample('nu', dist.Exponential(0.1))
    h = pyro.sample("h", GaussianRandomWalk(scale=sigma, num_steps=T))
    y = pyro.sample("returns", dist.StudentT(df=nu, loc=torch.zeros(T),
                                             scale=torch.exp(h)).independent(1), obs=torch.tensor(data))
    return y


nuts_kernel = NUTS(model, step_size=0.00001, adapt_step_size=False)
mcmc_run = MCMC(nuts_kernel, num_samples=1, warmup_steps=2).run(data)

I also noticed that this runs with HMC with small step size. e.g.

hmc_kernel = HMC(model, step_size=0.001, num_steps=200)
mcmc_run = MCMC(hmc_kernel, num_samples=100, warmup_steps=50).run(data)

Due to the small step size, the time taken per sample is quite high. This is a good example to study in more depth, and see what improvements if any would make this run more efficiently and consistently.

Thank you for your time!

Is there anything else I can do to help investigate this issue?

I am afraid we’ll need to look under the hood to investigate further; but you are of course welcome to investigate yourself, and let us know what you notice. NUTS/HMC is very much an experimental feature, and this issue tracks further improvements to make it practically useful beyond small datasets. For example, it could be the case that we need to adapt the mass matrix to do well on this problem. If you would like to dig deeper into the code and do not understand any part of it, let me know and I will be happy to guide you!

Ah, I see. Thanks!

I did some digging in the code and narrowed down the source of the issue a little bit, but I don’t know if I’m really hitting the root of the issue.

It seems that doubling is always aborted during the first iteration of NUTS because the base tree is marked as diverging. The culprit is the joint_prob which is becoming HUGE.

Similarly, in HMC the delta_energy is pretty erratic (extremely large, zero, or nan).

I investigated further and determined the cause is the potential energy having a large magnitude (which is then exponentiated to obtain the joint_prob). Specifically, the log probabilities for both h and returns (y) have very large magnitudes (in the 100s or 1000s).

Taking a look at the velocity_verlet gradients in HMC I observed something similar to what @fehiepsi described (the gradients appear to explode). The h and y gradients appear to be the worst (~-4000) sigma and nu are in the 100s (I’m not sure if that’s considered exploded or not).

I’m not really sure how to proceed… Any ideas for other steps I could take?

Hello, I’m a friend of @mwang101 working on the same example. I’ve taken a closer look at the gradients and it appears what’s going on with the h is something akin to what happens with vanishing gradients during backprop in a neural network. The gradients for the elements near the end of h (h[-1], h[-2] ...) are quite reasonable, but they accumulate as you go father back with h[0], h[1] … having the most extreme gradients. Given this pattern I assume this is related to autograd’s handling of the cumsum function, but I haven’t investigated yet.

My first thought would be to use something like gradient clipping, but I don’t know whether that would affect the theoretical properties of the HMC sampling process or not…

I used GaussianRandomWalk and didn’t observe that behavior of h's gradient. As far as I can tell, there are many sources of gradient blow up. I don’t think that gradient clipping will solve this problem because when a parameter in its extremal case (e.g. sigma << 1 or sigma very near boundary of constrained distributions such as Uniform if we define Uniform prior on sigma), the wrong direction of the gradient will make things worse, no matter how large it is (of course, when sigma << 1, h’s gradient will be easily blow up <- this is another source).

To move a parameter out of its extremal location (in unconstrained space), we have to use large momentum. Currently, in Pyro, the starting momentum just follows N(0, 1), which is so small to deal with this problem (relevant ref: Section 4.1 of Neal’s review paper). As @neerajprad pointed out, we need to adapt the mass matrix to do well on this problem.