# Multi-Label Classification from Time-Series data

Hey everyone,

I’m trying to get my head around the plate notation representation with Pyro but unfortunately haven’t been able to understand this from the documentation.

At a high level, I’m trying to model a time-series dataset as a Deep Markov Process where the statistics for transition and emission are parametrized by an NN. This is a multi-task dataset and one of the tasks is to predict a multi-label classification problem at the end (among 25 classes). The inputs are mini-batches of padded time series (just like in the tutorial on DMM) and `pheno` represents the binary multi-class labels.

``````    with pyro.plate('X_plate', X.size(1), device=X.device):
for t in range(T_max):
h_t, z_mu, z_log_var = self.transition(z_prev, h_prev)
z_dist = dist.Normal(z_mu, z_log_var.exp()).to_event(1)

z_t = pyro.sample('Z_{}'.format(t + 1), z_dist)

# TODO: How do we decide df?
x_mu, x_log_var = self.emitter(z_t)
x_dist = dist.StudentT(2, x_mu, x_log_var.exp()).to_event(1)

pyro.sample('X_{}'.format(t + 1), x_dist,
obs=X[t, :, :13])

pheno_p = self.pheno(z_t)
pheno_dist = dist.Bernoulli(pheno_p).to_event(1)

pyro.sample('P_{}'.format(t + 1), pheno_dist,
obs=pheno)

h_prev = h_t
z_prev = z_t
``````

I have the following questions:

1. Is there a difference between `Normal` and `MultivariateNormal` distributions?
2. At the last time step of each time-series, I would like to solve 25 independent binary classification problems. So, I instantiated the plate notation again with a Bernoulli parametrized by another NN. I’ve specified its independence along the columns of `pheno_p`. Is this the right way to proceed?
3. Which parameters is `pyro.module` exactly capturing? My NN parameters are not something that I want to be Bayesian about.

Thanks!

EDITS: Upgraded code to `Pyro 0.3.0` and limited the scope of questions.

Can somebody please help me verify if what I’m thinking is doing the right thing?

Is there a difference between Normal and MultivariateNormal distributions?

Yes. `Normal` is an elementwise normal distribution, whereas `MultivariateNormal` allows for correlations along the rightmost tensor dimension.

At the last time step of each time-series …

This is a really interesting pattern. I think what you’re doing is correct, but it seems weird to evaluate every step of the loop (and mask out sequences that haven’t just ended). I think it would be a bit neater to build up a `z_final` as you go and then run the nn once, something like this:

``````with pyro.plate('X_plate', X.size(1), device=X.device):
z_final = torch.empty("TODO appropriate size here")
for t in range(T_max):
h_t, z_mu, z_log_var = self.transition(z_prev, h_prev)
z_dist = dist.Normal(z_mu, z_log_var.exp()).to_event(1)
z_t = pyro.sample('Z_{}'.format(t + 1), z_dist)

# Let optimizer learn df :-)
df = pyro.param("df", 2., constraint=constraints.positive)
x_mu, x_log_var = self.emitter(z_t)
x_dist = dist.StudentT(df, x_mu, x_log_var.exp()).to_event(1)
pyro.sample('X_{}'.format(t + 1), x_dist,
obs=X[t, :, :13])

is_final = (t == L)
z_final[is_final] = z_t[is_final]  # save last step for later
h_prev = h_t
z_prev = z_t

pheno_p = self.pheno(z_final)
pheno_dist = dist.Bernoulli(pheno_p).to_event(1)
pyro.sample('P_final', pheno_dist,
obs=pheno)
``````

Let us know how this works!

Which parameters is pyro.module exactly capturing? My NN parameters are not something that I want to be Bayesian about.

`pyro.module` captures all nn parameters, but they are optimized rather than treated in a Bayesian way. Pyro is only Bayesian about parameters if you tell it to via poutine.lift.

I have a few follow up questions.

`MultivariateNormal` allows for correlations along the rightmost tensor dimension.

Ah yes. In this case, does the call to `to_event(1)` allow me to replicate that behavior (albeit diagonal covariance)?

I think it would be a bit neater to build up a `z_final` as you go.

Thanks! Certainly much cleaner. Should shave off an hour or two hopefully.

Let optimizer learn df

That’s much easier I guess.

`pyro.module` captures all nn parameters, but they are optimized rather than treated in a Bayesian way.

Ok that clarifies it now.

Let us know how this works!

At this stage, I am using an empty guide (which I believe is equivalent to MAP). I was able to get the loss well below zero (and seemed to be going further on the downward trajectory). Before, I add the guide, I have some follow-up questions.

1. Does using `pyro.markov` make sense here in the time loop? My structure does satisfy the Markov property and I was wondering if that helps give me some speed up during training. So something like
``````for t in pyro.markov(range(T_max)):
...
``````
1. How exactly are constraints enforced during optimization? Is this related to the Lagrange duals or some sort of projection back into feasible space?

2. From a syntax perspective, I’m unclear how to make predictions in both MAP and the Posterior Predictive case. Would you mind pointing to the right functions (or portions from examples)?

Thank you so much! I should have much better clarity once I have these follow-ups resolved.

1 . Does using pyro.markov make sense here in the time loop?

Currently (in Pyro 0.3 release) `pyro.markov` is only useful when enumerating discrete latent variables. It is mainly used internally to let Pyro know when it can reuse a tensor dimension. Beware that `torch.cat()`ing the `z_final`s won’t work if you’re enumerating, but maybe you could extend them onto a list instead?

2 . How exactly are constraints enforced during optimization?

Constraints are used to project to a feasible space via transform_to().

3 . I’m unclear how to make predictions in both MAP and the Posterior Predictive case.

It depends what you want to predict. If you are predicting continuous global variables, use an `AutoDelta` guide for MAP inference or an `AutoMultivariateNormal` guide (or similar) for posterior. If you are predicting discrete local variables, you can use infer_discrete() for both MAP and posterior predictive inference (set `temperature=0` for MAP, `temperature=1` for posterior sampling). Take a look at the GMM example.

This is great! Thank you so much for all the help. I think I’ll pose any further questions in a separate thread.