Problems transforming a pymc3 model to pyro (mcmc)

Thank you for the fast response! Is there another (more efficient) way of implementing this sequential
random-walk behavior?

I looked at the DMM Example but it seems to be using a loop for T steps as well (although that example is using SVI rather than MCMC).

Examining the implementation of pymc3’s GaussianRandomWalk it appears that they treat the process as a distribution and define the log-probability by offsetting the x by one position. But it is not clear to me how to translate this into Pyro.

Any guidance would be greatly appreciated!

After working through some of the math behind the model I was able to come up with this version (which is equivalent as far as I can tell).

def model(data):
    T = len(data)
    sigma = pyro.sample('sigma', dist.Exponential(50.))
    nu = pyro.sample('nu', dist.Exponential(.1))
    shifts = pyro.sample("h_t", dist.Normal(loc=torch.zeros(T), scale=torch.ones(T) * sigma))
    h_t = torch.cumsum(shifts, dim=0)
    y = pyro.sample("returns", dist.StudentT(df=nu, loc=torch.zeros(1), scale=(torch.exp(2*h_t) + 1e-12)), obs=torch.tensor(data))
    return y

This does run much faster than the previous version, but my acceptance rate is still zero (or so small that I can’t see the non-zero digits on the logs). Any suggestions as to how I could alleviate this?

One thing that you could try is to see if HMC runs on this with a small step size, and turn off step size adaptation. If that runs fine, it could indicate some issue with our step size adaptation.

Your simplified version can be simplified more by using the property of Gaussian random walk
h_t = pyro.sample('h_t", dist.Normal(0, T * sigma)) (just heuristic, you should correct the shape here). So no need for torch.cumsum.


Edit: I am wrong here.

What happened if you change it to scale=torch.exp(h_t)?

Thank you again for the reply. I turned off the step size adaption and tried a few small step sizes as you suggested. It seemed to have no improvement on the acceptance rate. Any reasons you could think of?

I tried this and it barely had any influence on the acceptance rate from what I observed… (Adding 1e-12 is just to make the function more stable so that it won’t complain about invalid scale values.)

It is interesting. It seems that accumulating a long sequence causes a precision issue. I will try to implement GaussianRandomWalk and will get back to you.

Thank you for your time :slight_smile:

I tried to implement GaussianRandomWalk but still cannot do mcmc when data.shape > 200 (less than 200 is fine). :frowning:

Updated: I changed random seed and use very small step_size and be able to make mcmc run on full data. I have tried to debug to see why for some seeds it failed and observed that gradient blow up at those time. I don’t know how to deal with this issue. :frowning:

class GaussianRandomWalk(dist.TorchDistribution):
    has_rsample = True
    arg_constraints = {'scale': constraints.positive}
    support = constraints.real

    def __init__(self, scale, num_steps=1):
        self.scale = scale
        batch_shape, event_shape = scale.shape, torch.Size([num_steps])
        super(GaussianRandomWalk, self).__init__(batch_shape, event_shape)

    def rsample(self, sample_shape=torch.Size()):
        shape = sample_shape + self.batch_shape + self.event_shape
        walks = self.scale.new_empty(shape).normal_()
        return walks.cumsum(-1) * self.scale.unsqueeze(-1)

    def log_prob(self, x):
        init_prob = dist.Normal(self.scale.new_tensor(0.), self.scale).log_prob(x[..., 0])
        step_probs = dist.Normal(x[..., :-1], self.scale).log_prob(x[..., 1:])
        return init_prob + step_probs.sum(-1)

def model(data):
    T = len(data)
    sigma = pyro.sample('sigma', dist.Exponential(50.))
    nu = pyro.sample('nu', dist.Exponential(0.1))
    h = pyro.sample("h", GaussianRandomWalk(scale=sigma, num_steps=T))
    y = pyro.sample("returns", dist.StudentT(df=nu, loc=torch.zeros(T),
                                             scale=torch.exp(h)).independent(1), obs=torch.tensor(data))
    return y


nuts_kernel = NUTS(model, step_size=0.00001, adapt_step_size=False)
mcmc_run = MCMC(nuts_kernel, num_samples=1, warmup_steps=2).run(data)

I also noticed that this runs with HMC with small step size. e.g.

hmc_kernel = HMC(model, step_size=0.001, num_steps=200)
mcmc_run = MCMC(hmc_kernel, num_samples=100, warmup_steps=50).run(data)

Due to the small step size, the time taken per sample is quite high. This is a good example to study in more depth, and see what improvements if any would make this run more efficiently and consistently.

Thank you for your time!

Is there anything else I can do to help investigate this issue?

I am afraid we’ll need to look under the hood to investigate further; but you are of course welcome to investigate yourself, and let us know what you notice. NUTS/HMC is very much an experimental feature, and this issue tracks further improvements to make it practically useful beyond small datasets. For example, it could be the case that we need to adapt the mass matrix to do well on this problem. If you would like to dig deeper into the code and do not understand any part of it, let me know and I will be happy to guide you!

Ah, I see. Thanks!

I did some digging in the code and narrowed down the source of the issue a little bit, but I don’t know if I’m really hitting the root of the issue.

It seems that doubling is always aborted during the first iteration of NUTS because the base tree is marked as diverging. The culprit is the joint_prob which is becoming HUGE.

Similarly, in HMC the delta_energy is pretty erratic (extremely large, zero, or nan).

I investigated further and determined the cause is the potential energy having a large magnitude (which is then exponentiated to obtain the joint_prob). Specifically, the log probabilities for both h and returns (y) have very large magnitudes (in the 100s or 1000s).

Taking a look at the velocity_verlet gradients in HMC I observed something similar to what @fehiepsi described (the gradients appear to explode). The h and y gradients appear to be the worst (~-4000) sigma and nu are in the 100s (I’m not sure if that’s considered exploded or not).

I’m not really sure how to proceed… Any ideas for other steps I could take?

Hello, I’m a friend of @mwang101 working on the same example. I’ve taken a closer look at the gradients and it appears what’s going on with the h is something akin to what happens with vanishing gradients during backprop in a neural network. The gradients for the elements near the end of h (h[-1], h[-2] ...) are quite reasonable, but they accumulate as you go father back with h[0], h[1] … having the most extreme gradients. Given this pattern I assume this is related to autograd’s handling of the cumsum function, but I haven’t investigated yet.

My first thought would be to use something like gradient clipping, but I don’t know whether that would affect the theoretical properties of the HMC sampling process or not…

I used GaussianRandomWalk and didn’t observe that behavior of h's gradient. As far as I can tell, there are many sources of gradient blow up. I don’t think that gradient clipping will solve this problem because when a parameter in its extremal case (e.g. sigma << 1 or sigma very near boundary of constrained distributions such as Uniform if we define Uniform prior on sigma), the wrong direction of the gradient will make things worse, no matter how large it is (of course, when sigma << 1, h’s gradient will be easily blow up <- this is another source).

To move a parameter out of its extremal location (in unconstrained space), we have to use large momentum. Currently, in Pyro, the starting momentum just follows N(0, 1), which is so small to deal with this problem (relevant ref: Section 4.1 of Neal’s review paper). As @neerajprad pointed out, we need to adapt the mass matrix to do well on this problem.

Ah, that makes sense. Thank you for your help!

I guess for now I’ll stick to using SVI in Pyro (although I’m looking forward to trying out more models like this in the future if MCMC is developed further!).

I recently implement this model with pyro, however with SVI,

after doing some math, I replace the gauss random walk with a multivariate normal with a special lower triangular matrix, as specified by the following model code:

################## vectorized version
def model_vec(data_vec):
    N = len(data_vec)
    rv_sigma = pyro.sample("rv_sigma", dist.Exponential(torch.tensor(50.)))
    rv_nu = pyro.sample("rv_nu", dist.Exponential(torch.tensor(0.1)))
    # random walks model corresponds to scale_tril of tril of all one matrix
    rv_s = pyro.sample("rv_s", dist.MultivariateNormal(loc=torch.zeros(N)*0.0, scale_tril=torch.ones((N,N)).tril()))
                dist.StudentT(rv_nu, loc=torch.tensor(0.), scale=rv_s.exp()).independent(),

def guide_vec(data_vec):
    N = len(data_vec)
    ##### params
    p_sigma_1 = pyro.param("p_sigma_1", torch.tensor(0.))
    p_sigma_2 = pyro.param("p_sigma_2", torch.tensor(1.),
    p_nu_1 = pyro.param("p_nu_1", torch.tensor(0.))
    p_nu_2 = pyro.param("p_nu_2", torch.tensor(1.),

    p_s_loc = pyro.param("p_s_loc", torch.zeros(N)) 
    p_s_scale = pyro.param("p_s_scale", torch.ones(N),

    ##### rvs
    rvq_sigma = pyro.sample("rv_sigma", dist.LogNormal(p_sigma_1, p_sigma_2))
    rvq_nu = pyro.sample("rv_nu", dist.LogNormal(p_nu_1, p_nu_2))
    pyro.sample("rv_s", dist.Normal(p_s_loc, p_s_scale).independent())

I tried on the sp500 data used by the pymc3 tutorial, however the estimated s is quite unsmooth as shown in the following plot

Any idea of how to improve the inference?

First, I find initialization to be quite important in SVI. You might try initializing p_s_loc to data and p_s_scale to either overestimate or underestimate variance

pyro.param("p_s_loc", torch.abs(data_vec).log1p())  # or similar
pyro.param("p_s_scale", 0.1 * torch.ones(N),

or maybe

pyro.param("p_s_scale", 10.0 * torch.ones(N),

Second, I believe your guide can be automatically constructed via

guide = pyro.contrib.autoguide.AutoDiagonalNormal(model)

though you would need to interact with it a bit differently.

Thanks @fritzo for the quick reply. I tried both of the initialization strategy.

  • init with small scale gives similar result
  • init with large s_scale takes longer to converge.

However the resulting exp(s) are both unsmooth, as previous result.

I am thinking if fully factorized approximation of s over simplified the problem. In the model, s_t is correlated with s_{t-1}. Is it possible to specify the following approximated posterior in an efficient way in pyro?

p(s) = p(s_0)p(s_1|s_0)p(s_2|s_1)\cdotsp(s_n|s_{n-1})
where each conditional p is a Normal?