Am I using HMC/NUTS/MCMC correctly?

ellery · January 8, 2019, 6:04am

Hey;
I’m interested in Bayesian Hierarchical Logistic Regression.

I have predictors X (dimension = D) and my response variable Y={0,1}.

I want to learned the posterior distribution of my beta (some calls it weights or coefficients, etc) where 1/(1+exp(-X@beta) gives me the probability of deciding 1 or 0

In term of Hierarchical model, my posterior distribution is

log_p(beta, sigma | data) proportional to log_likelihood(data) + log_Normal(beta | 0, sigma) + log_Normal(sigma | 0, 10.)

My implementation of NUTS is

def model(X,Y):
    sigma = pyro.sample('sigma', dist.Normal(torch.zeros(N), torch.ones(N)))
    beta = pyro.sample('beta', dist.Normal(torch.zeros(D), torch.ones(D) * sigma ** 2) )


    y = pyro.sample('y', pyro.distributions.Bernoulli(logits=(X*beta).sum(-1)), obs=Y)



nuts_kernel = NUTS(model)
mcmc_run = MCMC(nuts_kernel,
                 num_samples=100,
                 warmup_steps=100,
                 num_chains=1).run(X,Y) # I assume that argument in run() are the argument passed into model
marginal = mcmc_run.marginal(sites=["sigma", "beta"])

Are this correct high level implementation ?

Also I’m interesting in many other posterior distribution where I only have unnormalized log probability function and I don’t know the “model”. How do I specify in Pyro ? Just like one would do in tensor flow; such as

mcmc.kernel( target_log_prob_fn=target_log_prob_fn, # just a callable function current_state=[0.], step_size=[0.3], seed=3)

fehiepsi · January 8, 2019, 8:42pm

It looks like that you have a right model (though I think that put a positive-support prior on sigma is better).

You can use unnormalized log prob with the following trick:

# assume that you have computed `unnormalized_log_prob`
pyro.sample("forward_prob", dist.Bernoulli(unnormalized_log_prob.exp()), obs=torch.tensor(1.))

ellery · January 9, 2019, 7:56pm

Thanks for the reply; in term of specifying unnormalized_log_prob as target distribution.

What is the general approach here?

For example, consider I have an log unnormalized distribution (say log just for numerical stability ) such as

(x@b - (x**2).sum(-1)) + (exp(phi) + exp(-phi)).sum(-1) # phi is some function where I don’t know any sampling procedure so that I couldn’t able to specify my model. Nor do I know any inference procedures; All I want is to collect unbiased samples fro this distribution

How do I approach such problems?

fehiepsi · January 9, 2019, 8:48pm

I don’t quite get it. I guess that you want to get samples for b and phi? In that case, just set flat priors for them (for example, Normal(0, 100)).

ellery · January 9, 2019, 11:20pm

Hey; this is just an example.

The high level of my problem is that: I only have access to potential energy function (or minus log unnormalized function); nothing else.

I’m given an arbitrary positive function with finite integral over its domain

Given a function f:R^n -> R where integral f(x) over R^n = Z is finite and f(x) / Z > 0 (So its unnormalized probability density function). I’m interested in getting the samples from this its normalized distribution.

Unlike logistic regression or any other bayesian inference, I don’t know what is the model looks like; How do I get samples with HMC ? In tensor flow version of HMC, one can put f as an argument where in Pyro; it seems to me I have no such choice.

fehiepsi · January 10, 2019, 1:52am

I think you can set a flat prior for your x. Whether f normalized or unnormalized is not important for HMC.

ellery · January 10, 2019, 3:22am

@fehiepsi

I’m still quite confused; could you give me a bit more information ?

let’s say I have a black box function (can be in any form such as a neural network)

def log_target_dist(pts): # it can query a given pts and return the log of unnormalized/normalized density of pts

How do I use HMC in pyro to get samples from it?

fehiepsi · January 10, 2019, 5:09am

To get samples of a variable, you have to define a prior for it. In this case, I guess you dont have much information on the variable, so just set a flat prior on it, say x = pyro.sample(“x”, dist.Normal(0, 100)).

Then you can use the above trick to tell HMC that you want to use f(x) as potential energy.

ellery · January 10, 2019, 7:41am

Thanks!

So the code for model would be

x = pyro.sample(“x”, dist.Normal(0, 100)) y = pyro.sample("y", f(x))

Does this mean that the Chain will have its initial position sampled from N(0,100) and starts to explore the target space from there ? because I also kind of interested to set my Chain starts from somewhere that isn’t N(0,1) but some other distribution such as mixture or etc.

fehiepsi · January 10, 2019, 1:57pm

You can specify initial positions by changing initial_trace as in this test.

And to inject unnormalized prob, you can use the trick which I mentioned above

# assume that you have computed `unnormalized_log_prob` from `x`
pyro.sample("forward_prob", dist.Bernoulli(unnormalized_log_prob.exp()), obs=torch.tensor(1.))

There might be the case unnormalized_log_prob > 0, then you can further do

unnormalized_log_prob = unnormalized_log_prob - M

where M is a upper bound for unnormalized_log_prob.

fehiepsi · January 10, 2019, 2:03pm

If you don’t want to use the above trick, then you can follow this issue to get a solution (if any) in the future. Or you can modify potential_energy as you like.

ellery · January 10, 2019, 4:43pm

thank you. if I modify potential_energy (as I found using trick requires to search for the upper bound and can be tedious if target is in high dimension)

My plan is to

class HMC(TraceKernel):

     def __init__(self,
                 model,
                 log_target=None # my modification
                 step_size=1,
                 trajectory_length=None,
                 num_steps=None,
                 adapt_step_size=True,
                 adapt_mass_matrix=True,
                 full_mass=False,
                 transforms=None,
                 max_plate_nesting=None,
                 jit_compile=False,
                 ignore_jit_warnings=False):
       self.log_target = log_target # my modification

In the corresponding class method potential energy function I do following

def _potential_energy(self, z):
    if self.log_target is not None: # my modification
         potential_energy = - self.log_target
         return potential_energy

    # the rest just original code
    if self._jit_compile:
        return self._potential_energy_jit(z)
    # Since the model is specified in the constrained space, transform the
    # unconstrained R.V.s `z` to the constrained space.
    z_constrained = z.copy()
    for name, transform in self.transforms.items():
        z_constrained[name] = transform.inv(z_constrained[name])
    trace = self._get_trace(z_constrained)
    potential_energy = -self._compute_trace_log_prob(trace)
    # adjust by the jacobian for this transformation.
    for name, transform in self.transforms.items():
        potential_energy += transform.log_abs_det_jacobian(z_constrained[name], z[name]).sum()
    return potential_energy

So these will be all modification I needed ? I trace down to integrator.py file where they call this self._potential_energy as long as I keep my log_target can be autograd; these should be fine.

However, in term of model arguments when instantiate HMC class; I’m not sure how to deal with it. As I really don’t need it but pretty much all other methods (generate sample ) requires trace and requires model to perform.

fehiepsi · January 10, 2019, 8:03pm

Yes, you only need to create a subclass of HMC/NUTS and modify potential energy method. No need to change other things AFAIK.

neerajprad · January 10, 2019, 8:14pm

As an alternate approach, I think you could also define a distribution class with log_prob that gives you unnormalized density and use it in your model. Sample code:

import torch
from torch.distributions import constraints

import pyro
from pyro.distributions import TorchDistribution
from pyro.infer.mcmc import MCMC, NUTS


class UnnormalizedDist(TorchDistribution):
    support = constraints.real
    has_rsample = False

    def __init__(self):
        self.mu = torch.tensor(2.)
        self.std = torch.tensor(3.)
        super(TorchDistribution, self).__init__()

    # HACK: Only used for model initialization.
    def sample(self, sample_shape=torch.Size()):
        return torch.tensor(1.)

    def log_prob(self, value):
        return -((value - self.mu) ** 2) / (2 * self.std ** 2)


def model():
    pyro.sample("y", UnnormalizedDist())


if __name__ == "__main__":
    kernel = NUTS(model, jit_compile=True, ignore_jit_warnings=True)
    mcmc = MCMC(kernel=kernel, num_samples=10000, warmup_steps=200).run()
    marginal = mcmc.marginal(["y"]).empirical["y"]
    print(marginal.mean)
    print(marginal.variance)

As a note for us, we can facilitate such use cases by not sampling from the model for the initial setup in case an initial_trace is provided to the kernel, thereby avoiding the constant sampling hack above.

ellery · January 10, 2019, 9:11pm

Thanks for alternative approach; for def sample() method; I assume that it really doesn’t matter the return value; simply there to make sure that poutine.trace(self.model).get_trace and poutine.enum(config_enumerate(self.model),
first_available_dim=-1 - self.max_plate_nesting) can be properly called?

neerajprad · January 10, 2019, 9:13pm

The return value matters in the following way - it is used to initiate the sampler, so it should be included in the distribution’s support (in particular, it should have a finite value under log_prob), otherwise we will be stuck and the kernel would throw an error.

ellery · January 10, 2019, 9:37pm

Thanks so much !

ellery · January 12, 2019, 9:34pm

Hey; I’ve running two different approach to Funnel Distribution (Neal’s Paper)

Where the joint distribution p(x,v) = N(v| 0,9) * prod_i N(x_i | 0, exp(v))

Its unnormalized form (in log space)

   def log_prob(x):

    dim = x.shape[-1]
    x = x.reshape(-1,dim)

    xx = torch.sum(x[:,:-1]**2,dim=1)
    v = x[:,-1]

    res = - xx / (2. * torch.exp(v)) - v ** 2/18

    return res

I use this in your approach and gets bad results for hmc (which is reasonable as funnel is particularly designed to fail HMC)

However, using another approach by

def model(dim):
    v = pyro.sample('v', dist.Normal(0., 3.))
    x = pyro.sample('x', dist.Normal(torch.zeros(dim), torch.sqrt(torch.exp(v))))
    return x
nuts_kernel = HMC(model)
mcmc_run = MCMC(nuts_kernel,
                 num_samples=1000,
                 warmup_steps=1000,
                 num_chains=1).run(dim=1000) # I assume that argument in run() are the argument passed into model
marginal = mcmc_run.marginal(sites=["x", "v"])

I have amazing results even up to 1000 dimension of funnel ! so I believe that I’m using the second approach incorrectly; could you specify where did I go wrong (this is approach I see how other use in hierarchical model )?

neerajprad · January 14, 2019, 7:11am

The second approach is correct and would be the idiomatic way of expressing your distribution of interest using the model. Since p(x, y) is well defined, you should use the normal distribution as is.

The first one should work fine too, but your log_prob term doesn’t seem right (see below for the corrected one, not tested though). I wouldn’t recommend this when you are not dealing with unnormalized distributions (as in this example), since it is easy to trip up on broadcasting or missing some term, and much harder to debug when things go wrong. If you use pre-defined distributions, this is taken care of for you.

    def log_prob(self, x):
        x, std = x[:-1], x[-1]
        res = (-(x**2) / (2. * std.exp()**2) - std).sum() - std ** 2 / 18
        return res

ellery · January 15, 2019, 12:56am

Hey; Thanks so much for point out the bug for me !!!

The reason I use unnormalized log form because I have many other problems that I only have access to its unnormalized log distribution. So I start with funnel which is easier to implement to make sure I’m comfortable with such execution.

here I assume x[-1].exp() is the variance for other variables not std (as in approach 1; I have torch.sqrt(torch.exp(v))); so I think the correct implementation is

x, v = x[:-1], x[-1] res = (-(x ** 2) / (2. * v.exp()) - v / 2).sum() - v ** 2 / 18

I’ve found that approach 1 is slower than approach 2