Shape problem with mixture model, enumeration and Predictive

I am working with a relatively simple extension of the model from the GMM tutorial and am running into problems with the tensor dimensions. Model evaluation works (model(X, y)), TraceEnum_ELBO evaluation works, but calling Predictive works only when the number of test samples matches the number of training samples.

I do not use any squeeze, to_event, .T calls or anything else that might mess with the dimensions; it’s all very basic.

Here’s the fully reproducible example:

import torch
import pyro
import pyro.distributions as dist
from pyro import poutine
from pyro.infer import TraceEnum_ELBO, config_enumerate, Predictive
from pyro.infer.autoguide import AutoDiagonalNormal

# Notice the different number of samples (10 vs 5) in the training / test "datasets"
x_train = torch.randn(10)
y_train = torch.bernoulli(torch.ones(10)*0.5)
x_test = torch.randn(5)
y_test = torch.bernoulli(torch.ones(5)*0.5)

@config_enumerate
def model_LR_aleatoric(X, y, num_mixture_components=3):
    beta_coef = pyro.sample(f"beta", dist.Normal(0.0, 5.0))  # size []
    risk_logits = X * beta_coef  # broadcasting to [len(X),] here
    
    weights_dist = dist.Dirichlet(0.5 * torch.ones(num_mixture_components))
    weights = pyro.sample('weights', weights_dist)
    # location and scale parameters of the 3 noise components
    with pyro.plate('components', num_mixture_components):
        locs = pyro.sample('locs', dist.Normal(0., 0.5))
        scales = pyro.sample('scales', dist.LogNormal(0., 0.5))
    
    # Actual observation model
    with pyro.plate("data", y.shape[0]):
        # which noise component to sample from
        assignment = pyro.sample('assignment', dist.Categorical(weights))
        risk_noise_dist = dist.Normal(locs[assignment], scales[assignment])
        risk_noise = pyro.sample('risk_noise', risk_noise_dist)
        actual_risk_logits = pyro.deterministic("actual_risk_logits", risk_logits + risk_noise)
        outcome_dist = dist.Bernoulli(logits=actual_risk_logits)
        observation = pyro.sample("obs", outcome_dist, obs=y)

guide_LR_aleatoric = AutoDiagonalNormal(poutine.block(model_LR_aleatoric, hide=['assignment']))

pyro.clear_param_store()
model_LR_aleatoric(x_train, y_train) # works fine
elbo = TraceEnum_ELBO(max_plate_nesting=1)
elbo.loss(model_LR_aleatoric, guide_LR_aleatoric, x_train, y_train); # works fine
predictive = Predictive(model=model_LR_aleatoric, guide=guide_LR_aleatoric, num_samples=4)
samples = predictive(x_train, y_train) # works fine
samples = predictive(x_test, y_test) # fails

The error message the last command fails with is The size of tensor a (5) must match the size of tensor b (10) at non-singleton dimension 0, in the line actual_risk_logits = ....

So, of course, I started tracing the shapes of the different variables throughout the various calls.

During the first model call, I find these sizes, which all seem to make sense:

ic| X.shape: torch.Size([10])
ic| y.shape: torch.Size([10])
ic| weights_dist.batch_shape: torch.Size([])
ic| weights_dist.event_shape: torch.Size([3])
ic| weights.shape: torch.Size([3])
ic| locs.shape: torch.Size([3])
ic| assignment.shape: torch.Size([10])
ic| locs[assignment].shape: torch.Size([10])
ic| scales[assignment].shape: torch.Size([10])
ic| risk_noise_dist.batch_shape: torch.Size([10])
ic| risk_noise_dist.event_shape: torch.Size([])
ic| risk_logits.shape: torch.Size([10])
ic| risk_noise.shape: torch.Size([10])

During the elbo.loss call, exactly as outlined here, I find that the model is called twice, and in the first call (by the guide), shapes are as above, whereas in the second call, we have assignment.shape: torch.Size([3, 1]), due to the enumeration. (Same for locs[assignment].shape and scales[assignment].shape.) So far, so good, I guess?

During the call predictive(x_train, y_train), I find that the model is called six times (number of samples is four - can someone explain the discrepancy?). During the first two calls, shapes are exactly as outlined above, whereas during the latter four calls, I have

ic| weights_dist.batch_shape: torch.Size([])
ic| weights_dist.event_shape: torch.Size([3])
ic| weights.shape: torch.Size([1, 3])
ic| locs.shape: torch.Size([3])

… can someone explain where the additional 1-dimension for the weights comes from?

And, finally, during the call predictive(x_test, y_test), the first two model calls are again like above (except that all 10s are replaced by 5s, the size of the test dataset). In the next model call, we then have these sizes, before the error mentioned above occurs:

ic| X.shape: torch.Size([5])
ic| y.shape: torch.Size([5])
ic| weights_dist.batch_shape: torch.Size([])
ic| weights_dist.event_shape: torch.Size([3])
ic| weights.shape: torch.Size([1, 3])
ic| locs.shape: torch.Size([3])
ic| assignment.shape: torch.Size([5])
ic| locs[assignment].shape: torch.Size([5])
ic| scales[assignment].shape: torch.Size([5])
ic| risk_noise_dist.batch_shape: torch.Size([5])
ic| risk_noise_dist.event_shape: torch.Size([])
ic| risk_logits.shape: torch.Size([5])
ic| risk_noise.shape: torch.Size([10])

It is completely opaque to me where the [10] in the last line comes from when everything else has shape [5]…?!

Any help is greatly appreciated! I know it takes significant time to sift through these complex problem descriptions.

(I am aware that there are other current threads on very similar topics - 1, 2 - , but I did not see how the solutions discussed there would solve my problem / apply to my setting.)

I believe two additional calls are due to this line where it is called once per guide and once per model (_guess_max_plate_nesting guesses max plate nesting by running the input model).

Predictive works by running the model conditioned on latent samples from posterior_samples. If a guide is provided, then posterior samples are obtained from the guide.

Note that your guide is conditional on a particular data (x_train, y_train) with size=10. Specifically, the local variable risk_noise in the guide has a distribution risk_noise_dist with size equal to 10. If you want to use Predictive for new data (which potentially can have a different size) you probably need to use amortized inference.. But I’m also not sure what are you trying to use Predictive for here?

On a technical level, Predictive runs your model while replacing the values at pyro.sample with 1) values from posterior_samples or 2) values sampled from the guide distribution (whenever these two happen it doesn’t sample from the model distribution) or 3) directly samples from the model distribution if 1) or 2) doesn’t happen. For example, notice that since assignment variable is marginalized in the model it is not present in the guide, therefore, it is directly sampled from the model distribution and has a size of 5 (because y_test.shape[0] == 5). Hope this clarifies it a bit more.

2 Likes

That certainly clarifies a lot, thank you so much!

What I am ultimately trying to do is

  1. estimate p(\text{beta_coef}, \text{weights}, \text{locs}, \text{scales} | x_{\text{train}})
  2. use that to estimate p(\text{actual_risk_logits} | x_{\text{test}}).

I somehow thought Predictive would be the right tool for that, but maybe it is not? Maybe I should just condition the model using the distribution of beta_coef, weights, locs, scales observed during training, pass in x_{\text{test}}, draw a few samples, and look at the resulting distribution of actual_risk_logits?

On a side note, I think this section of the introductory tutorial might be misleading in this regard, because it starts off by saying “we will compare the posterior predictive distribution over possible new data induced by our model to the existing observed data” - but all the examples that follow just use Predictive on the training data, i.e., no new data are used anywhere I believe? If you agree that this is misleading and I’m not just misunderstanding something, where would I suggest to improve this? Would that be a bug report on github?

It seems clear to me what is meant by new data here, at least given the context. WDYT @fritzo ?

Maybe your confusion comes from wanting to use Predictive for inference. I think Predictive should be used for generating new data/samples while conditioning latent variables on some samples or a guide. It doesn’t do inference by updating probabilities using Bayes’ rule. When you want to calculate p(...|x_test) it means you are doing inference. You can do it either by 1) running inference again with new test data (use TraceEnum_ELBO similar like you did for training data, but replace global priors with updated priors after training) or 2) amortized inference if you want to learn a function from any data points to variational parameters of latent variables.