Multi-Label Classification from Time-Series data

Hey everyone,

I’m trying to get my head around the plate notation representation with Pyro but unfortunately haven’t been able to understand this from the documentation.

At a high level, I’m trying to model a time-series dataset as a Deep Markov Process where the statistics for transition and emission are parametrized by an NN. This is a multi-task dataset and one of the tasks is to predict a multi-label classification problem at the end (among 25 classes). The inputs are mini-batches of padded time series (just like in the tutorial on DMM) and pheno represents the binary multi-class labels.

    with pyro.plate('X_plate', X.size(1), device=X.device):
      for t in range(T_max):
        with poutine.mask(mask=(t < L)):
          h_t, z_mu, z_log_var = self.transition(z_prev, h_prev)
          z_dist = dist.Normal(z_mu, z_log_var.exp()).to_event(1)

          z_t = pyro.sample('Z_{}'.format(t + 1), z_dist)

          # TODO: How do we decide df?
          x_mu, x_log_var = self.emitter(z_t)
          x_dist = dist.StudentT(2, x_mu, x_log_var.exp()).to_event(1)

          pyro.sample('X_{}'.format(t + 1), x_dist,
                      obs=X[t, :, :13])

        with poutine.mask(mask=(t == L)):
          pheno_p = self.pheno(z_t)
          pheno_dist = dist.Bernoulli(pheno_p).to_event(1)

          pyro.sample('P_{}'.format(t + 1), pheno_dist,
                      obs=pheno)

        h_prev = h_t
        z_prev = z_t

I have the following questions:

  1. Is there a difference between Normal and MultivariateNormal distributions?
  2. At the last time step of each time-series, I would like to solve 25 independent binary classification problems. So, I instantiated the plate notation again with a Bernoulli parametrized by another NN. I’ve specified its independence along the columns of pheno_p. Is this the right way to proceed?
  3. Which parameters is pyro.module exactly capturing? My NN parameters are not something that I want to be Bayesian about.

Thanks!

EDITS: Upgraded code to Pyro 0.3.0 and limited the scope of questions.

Can somebody please help me verify if what I’m thinking is doing the right thing?

cc @fritzo, @neerajprad, @eb8680_2

Hi @activatedgeek,

Is there a difference between Normal and MultivariateNormal distributions?

Yes. Normal is an elementwise normal distribution, whereas MultivariateNormal allows for correlations along the rightmost tensor dimension.

At the last time step of each time-series …

This is a really interesting pattern. I think what you’re doing is correct, but it seems weird to evaluate every step of the loop (and mask out sequences that haven’t just ended). I think it would be a bit neater to build up a z_final as you go and then run the nn once, something like this:

with pyro.plate('X_plate', X.size(1), device=X.device):
    z_final = torch.empty("TODO appropriate size here")
    for t in range(T_max):
        with poutine.mask(mask=(t < L)):
            h_t, z_mu, z_log_var = self.transition(z_prev, h_prev)
            z_dist = dist.Normal(z_mu, z_log_var.exp()).to_event(1)
            z_t = pyro.sample('Z_{}'.format(t + 1), z_dist)

            # Let optimizer learn df :-)
            df = pyro.param("df", 2., constraint=constraints.positive)
            x_mu, x_log_var = self.emitter(z_t)
            x_dist = dist.StudentT(df, x_mu, x_log_var.exp()).to_event(1)
            pyro.sample('X_{}'.format(t + 1), x_dist,
                        obs=X[t, :, :13])

        is_final = (t == L)
        z_final[is_final] = z_t[is_final]  # save last step for later
        h_prev = h_t
        z_prev = z_t

    pheno_p = self.pheno(z_final)
    pheno_dist = dist.Bernoulli(pheno_p).to_event(1)
    pyro.sample('P_final', pheno_dist,
                obs=pheno)

Let us know how this works!

Which parameters is pyro.module exactly capturing? My NN parameters are not something that I want to be Bayesian about.

pyro.module captures all nn parameters, but they are optimized rather than treated in a Bayesian way. Pyro is only Bayesian about parameters if you tell it to via poutine.lift.

I have a few follow up questions.

MultivariateNormal allows for correlations along the rightmost tensor dimension.

Ah yes. In this case, does the call to to_event(1) allow me to replicate that behavior (albeit diagonal covariance)?

I think it would be a bit neater to build up a z_final as you go.

Thanks! Certainly much cleaner. Should shave off an hour or two hopefully.

Let optimizer learn df :slight_smile:

That’s much easier I guess.

pyro.module captures all nn parameters, but they are optimized rather than treated in a Bayesian way.

Ok that clarifies it now.

Let us know how this works!

At this stage, I am using an empty guide (which I believe is equivalent to MAP). I was able to get the loss well below zero (and seemed to be going further on the downward trajectory). Before, I add the guide, I have some follow-up questions.

  1. Does using pyro.markov make sense here in the time loop? My structure does satisfy the Markov property and I was wondering if that helps give me some speed up during training. So something like
for t in pyro.markov(range(T_max)):
  ...
  1. How exactly are constraints enforced during optimization? Is this related to the Lagrange duals or some sort of projection back into feasible space?

  2. From a syntax perspective, I’m unclear how to make predictions in both MAP and the Posterior Predictive case. Would you mind pointing to the right functions (or portions from examples)?

Thank you so much! I should have much better clarity once I have these follow-ups resolved.

1 . Does using pyro.markov make sense here in the time loop?

Currently (in Pyro 0.3 release) pyro.markov is only useful when enumerating discrete latent variables. It is mainly used internally to let Pyro know when it can reuse a tensor dimension. Beware that torch.cat()ing the z_finals won’t work if you’re enumerating, but maybe you could extend them onto a list instead?

2 . How exactly are constraints enforced during optimization?

Constraints are used to project to a feasible space via transform_to().

3 . I’m unclear how to make predictions in both MAP and the Posterior Predictive case.

It depends what you want to predict. If you are predicting continuous global variables, use an AutoDelta guide for MAP inference or an AutoMultivariateNormal guide (or similar) for posterior. If you are predicting discrete local variables, you can use infer_discrete() for both MAP and posterior predictive inference (set temperature=0 for MAP, temperature=1 for posterior sampling). Take a look at the GMM example.

This is great! Thank you so much for all the help. I think I’ll pose any further questions in a separate thread.