Mixture of poisson's with large sample size

I’m attempting to fit a mixture of discrete values where there is a large sample size but many repeated values. For example, in this mixture of Poisson’s with 10,000 observations, there are only 24 unique values

from numpy.random import choice, poisson, seed
from numpy import array, unique
from torch import tensor

from pyro.infer.mcmc.api import MCMC
from pyro.infer.mcmc import NUTS
from pyro.infer import config_enumerate
from pyro import distributions as dist
from pyro import plate, sample, set_rng_seed, poutine


set_rng_seed(1)
seed(1)

n = 1000
alpha = array([0.4, 0.6])
lam = array([3., 10.])

z = choice(len(alpha), size=n, p=alpha)
data = poisson(lam[z], n)
values, counts = unique(data, axis=0, return_counts=True)

In Stan, it is possible to directly modify the likelihood by multiplying by the number of times that value has been seen, e.g.

model {
    real contributions[K];

    // priors....

    for (n in 1:N) {
        for (k in 1:K) {
            contributions[k] = log(theta[k]) + normal_lpdf(values[n] | mu[k], sigma[k]);
        }

        target += log_sum_exp(contributions) * count[n];
    }
}

My understanding of pyro is that you can use pyro.poutine.scale to accomplish something similar

@config_enumerate
def model(values, counts, mu0, sigma0, pi0):
    # priors
    mu = sample("mu", dist.Normal(mu0, sigma0).to_event(1))
    theta = sample("theta", dist.Dirichlet(pi0))
     
    with plate('data'), poutine.scale(scale=counts):
        Z = sample('Z', dist.Categorical(theta))
        sample('obs', dist.Poisson(mu[Z]), obs=values)

However, I think this throws off the enumeration of Z’s and now during enumeration, all observations of a given value are assigned Z=1, etc. When I fit the model to data without poutine.scale, the true mu’s and theta’s are recovered well, but that does not happen with scaling. Is there any way around this issue?

I think that the model I want will require a custom implementation of a Mixture of Poisson’s distribution. I started with a mixture of normals to test:

from numpy.random import choice, normal, seed
from numpy import array
from torch import tensor, Size

from pyro.infer.mcmc.api import MCMC
from pyro.infer.mcmc import NUTS
from pyro import distributions as dist
from pyro import plate, sample


class MixtureNormal(dist.TorchDistribution):
    support = dist.Normal.support
    has_rsample = False

    def __init__(self, mu, sigma, theta):
        self._normal = dist.Normal(mu, sigma)
        self.theta = theta
        super(dist.TorchDistribution, self).__init__()

    def sample(self, sample_shape=Size()):
        return tensor(1.)

    def log_prob(self, value):
        contributions = self._normal.log_prob(value) + self.theta.log()
        return contributions.logsumexp(0)


def model(data, mu0, sigma0, pi0, alpha0, beta0):
    # priors
    mu = sample("mu", dist.Normal(mu0, sigma0).to_event(1))
    theta = sample("theta", dist.Dirichlet(pi0))
    sigma = sample("sigma", dist.Gamma(alpha0, beta0))
     
    # with plate('data', len(data)):
        # sample('obs', MixtureNormal(mu, sigma, theta), obs=data)
    for i in range(len(data)):
        sample('obs_{}'.format(i), MixtureNormal(mu, sigma, theta), obs=data[i])

if __name__ == "__main__":
    n = 100
    alpha = array([0.4, 0.6])
    mu = array([1.8, 5.7])
    sigma = 0.1

    z = choice(len(alpha), size=n, p=alpha)
    data = normal(mu[z], sigma, n)

    kernel = NUTS(model)
    mcmc = MCMC(kernel, num_samples=100, warmup_steps=100, num_chains=1)
    mcmc.run(tensor(data).float(),
             tensor(mu).float(),
             tensor(0.5),
             tensor(alpha * 10).float(),
             tensor(1.), tensor(1.))

    mcmc.summary()

The above works, but if I remove the for loop in model and uncomment the plate, I get an error saying RuntimeError: The size of tensor a (100) must match the size of tensor b (2) at non-singleton dimension 0. I think I need to implement an expand method for this class, but am not clear on how to. Any suggestions?

yes plate uses expand under the hood. you might look at other pyro distributions for example expand methods, e.g. studentT.

1 Like

Probably the issue at this line. logsumexp(0) does not work with a batch of values. Probably you want something like logsumexp(-1). Another issue is your sample method returns a scalar, which is not True in general. I believe you need to declare batch_shape, event_shape in the constructor and return torch.ones(sample_shape + self.shape()) in the sample method.

I think you don’t need to implement expand method because .expand(...) will return an ExpandedDistribution for you (unless you want to overwrite it with some custom implementations).

Thanks @fehiepsi.

I’m having some issues getting two different priors set on the component means.

What I would like to represent with my priors is:

mu[1] ~ Normal(1.8, 0.1)
mu[2] ~ Normal(5.7, 0.1)

But when I run multiple chains, some chains end up with posteriors centered at 1.8 in component 1 mean and some centered at 5.7 component 1 mean. Any suggestions on how to fix this? Do I need some type of plate around the mus?

For what it’s worth, it does not seem like the chains are initialized by sampling from the prior, as described in the documentation

initial_params … If not specified, parameter values will be sampled from the prior.

from numpy.random import choice, normal, seed
from numpy import array
from torch import tensor, Size, ones

from pyro.infer.mcmc.api import MCMC
from pyro.infer.mcmc import NUTS
from pyro import distributions as dist
from pyro import plate, sample, set_rng_seed


class MixtureNormal(dist.TorchDistribution):
    support = dist.Normal.support
    has_rsample = False

    def __init__(self, mu, sigma, theta):
        self._normal = dist.Normal(mu, sigma)
        self.theta = theta
        self._num_component = theta.shape[-1]

        cdbs = self._normal.batch_shape[:-1]
        event_shape = self._normal.event_shape
        self._event_ndims = len(event_shape)

        super(MixtureNormal, self).__init__(batch_shape=cdbs,
                                            event_shape=event_shape,
                                            validate_args=False)

    def sample(self, sample_shape=Size()):
        return ones(sample_shape + self.shape())

    def log_prob(self, value):
        contributions = self._normal.log_prob(value) + self.theta.log()

        return contributions.logsumexp(-1)


def model(data, mu0, sigma0, pi0):
    # priors
    mu = sample("mu", dist.Normal(mu0, sigma0))
    theta = sample("theta", dist.Dirichlet(pi0))
    sigma = tensor(0.1)
     
    with plate('data', len(data)):
        sample('obs', MixtureNormal(mu, sigma, theta), obs=data)

seed(1)
set_rng_seed(1)

n = 100
alpha = array([0.4, 0.6])
mu = array([1.8, 5.7])
sigma = 0.1

z = choice(len(alpha), size=n, p=alpha)
data = normal(mu[z], sigma, n)

mu0 = tensor(mu).float()
sigma0 = tensor(0.1)
alpha = tensor(alpha * 100).float()

kernel = NUTS(model)
mcmc = MCMC(kernel, num_samples=10, warmup_steps=0, num_chains=4)
mcmc.run(tensor(data).float().unsqueeze(1),
         mu0,
         sigma0,
         alpha)

print(mcmc.summary())

print(mcmc.get_samples(group_by_chain=True)['mu'].mean(1))
print(mcmc.get_samples(group_by_chain=True)['mu'])
                mean       std    median      5.0%     95.0%     n_eff     r_hat
     mu[0]      2.55      1.99      1.81      1.09      6.36      2.39      3.12
     mu[1]      4.84      2.02      5.70      0.17      5.77      2.40      2.82
  theta[0]      0.56      0.18      0.47      0.30      0.76      2.12     23.79
  theta[1]      0.44      0.18      0.31      0.24      0.70      2.12     23.79

Number of divergences: 3
None
tensor([[1.4932, 5.8903],
        [1.5514, 5.8562],
        [1.4103, 5.9970],
        [5.7619, 1.6005]])
tensor([[[-0.6392,  7.8258],
         [ 1.2329,  5.3973],
         [ 1.6752,  5.7484],
         [ 1.7760,  5.7006],
         [ 1.8168,  5.7074],
         [ 1.8129,  5.7082],
         [ 1.8145,  5.7005],
         [ 1.8147,  5.7048],
         [ 1.8147,  5.7048],
         [ 1.8132,  5.7057]],

        [[-0.1903,  7.4315],
         [ 1.3376,  5.4560],
         [ 1.7005,  5.7405],
         [ 1.7884,  5.7023],
         [ 1.8100,  5.7074],
         [ 1.8130,  5.7053],
         [ 1.8133,  5.7071],
         [ 1.8163,  5.7027],
         [ 1.8132,  5.7023],
         [ 1.8122,  5.7066]],

        [[-1.2729,  9.0464],
         [ 1.0872,  5.2244],
         [ 1.6405,  5.7740],
         [ 1.7750,  5.6915],
         [ 1.8069,  5.7103],
         [ 1.8117,  5.7038],
         [ 1.8117,  5.7038],
         [ 1.8152,  5.7058],
         [ 1.8146,  5.7044],
         [ 1.8133,  5.7056]],

        [[ 6.3648,  0.1720],
         [ 5.6095,  1.4249],
         [ 5.7201,  1.7247],
         [ 5.7014,  1.7931],
         [ 5.7057,  1.8105],
         [ 5.7018,  1.8185],
         [ 5.7048,  1.8165],
         [ 5.7031,  1.8123],
         [ 5.7049,  1.8164],
         [ 5.7031,  1.8156]]])

@jacobcvt12 I am not sure if using multi-chains will be helpful here. AFAIK multi-chains are mainly used to get more samples and to diagnose mixing issues. For multi-modal posteriors, if each chain covers one mode, it is still unclear on how to merge samples from those chains (for examples, assume your data concentrates on 2 groups, but one group contains 90 points, the other group contains 10 points; you run 2-chain MCMC and somehow get 1000 samples from the first chain concentrates around the first group and 1000 samples from the second chain concentrate around the second group; if you combine them, you will get equal distribution over those two groups because each of them has 1000 samples - this violates the assumption that the distribution of them should be proportional to 9-1). So to solve multi-modal posteriors, in my opinion, it is better to construct the model (e.g. using mixture models) so that it can capture those modes or using some specific MCMC algorithms (to my knowledge, using normalization flows are helpful here).

If you use pyro dev version, you can use various initialization strategies for the argument init_strategy of NUTS kernel. init_to_sample will be what you are looking for.

Thanks for suggestions @fehiepsi. I’m still confused on why the chains converge differently though. If I had an exchangeable prior on the mu’s (or even week non-exchangeable priors), I would expect that some chains might have mu[1] approx 1.8, mu[2] approx 5.7 at the posterior and other chains vice versa. But given the strong priors, I’m surprised that this is happening. My guess is that the way I am specifying the prior does not imply ordering, whereas in Stan, you can specify e.g.

for (k in 1:K) {
    mu[k] ~ Normal(mu0[k], sigma0);
}

Is there a way to do something similar in pyro?

@jacobcvt12 You are right that with the default init_strategy, chains might start at points far from priors. If you are familiar with Stan, then you might know that initial values in Stan are generated randomly from -2 to 2 in “unconstrained support” (see init section in Stan doc). The default init_strategy in Pyro is init_to_uniform(radius=2), which is the same as Stan. That means, if your prior is Normal(100, 1), the initial value still belongs to (-2, 2) with the default init strategy in Pyro and Stan. In Pyro, you can use init_strategy=init_to_sample to get initial values from priors. I’m not sure if Stan supports such functionality.

the chains converge differently

I guess with num_samples=10, warmup_steps=0, the particles haven’t traveled far from initial values yet. Could you try higher num_samples or warmup_steps?