Modeling missingness indicators

Hey everyone,

Does someone have a code sample on how to model missing values? In theory, I understand that the missing values can be considered just like any other random variables. However, it would be great if I could get a head start on how to write this with Pyro?

P.S.: This is my first time applying theory using a PPL and still trying to calibrate the theory/practice transfer.

There are many ways to model missing data in a PPL like Pyro. I think the main techniques are:

  • make partial observations sequentially via pyro.sample(..., obs=x) where x is either a tensor or None.
  • make partial observations in parallelusing poutine.mask to include only observed data in the log prob
  • optionally model missingness via pyro.sample("present", Bernoulli(p_observed), obs=present)

For example, suppose you have a dataset of inputs x and partially observed outputs y:

def model(x, y, y_present):
    assert x.dtype == torch.float
    assert x.dtype == torch.float
    assert x.shape == y.shape
    assert y_present.dtype == torch.uint8
    with pyro.plate("data", len(x)):

        # Model the data that is observed:
        with poutine.mask(y_present):
            pyro.module("loc_nn", my_loc_nn)
            loc = my_loc_nn(x)
            pyro.sample("y", Normal(loc, 1.),

        # Model whether data is observed:
        pyro.module("presence_nn", my_presence_nn)
        p_present = my_presence_nn(x)
        pyro.sample("y_present", Bernoulli(p_present),
1 Like

That makes sense. I think for a start, I’m choosing to model missingness directly via independent Bernoulli(s).

On this note, one thing that comes up is that sometimes I might want to integrate out my missing values. If I want to put that down in Pyro PPL,

  1. For discrete RVs, does enumeration in the model equal integrating out those missing values?

  2. For continuous RVs, I would like to think of something like a Monte-Carlo EM. Should this be part of the model? If yes, how?

Thank you so much for the inputs!

  1. Yes, enumerating in the model is equivalent to integrating out the variables.
  2. I think yes you would sample latent variables in the model and “monte carlo integrate them out”, but I’m not sure.

I’m unclear how this implementation would look like right inside the model. Do you mind providing rough hints as to what I should be doing using the Pyro PPL?