How to predict new data with missing data imputation?

dfd · March 14, 2022, 7:25pm

I’ve tried to follow a few examples on missing value imputation with numpyro, but I cannot find one in which it works on new data with missing values. Instead, what seems to happen is that whatever missing index that was derived during training does not change when I run new data through it, and this causes an error that starts with:
“ValueError: Incompatible shapes for broadcasting”

I started with this tutorial, and it works (with minor API changes):

github.com

RMichae1/PyroStudies/blob/master/Bayesian_Imputation.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "analysis",
      "language": "python",
      "name": "analysis"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.3"

This file has been truncated. show original

But I made my own version (starting from copy of the original) to demonstrate the prediction problem here:

github.com

dfd/numpyro_missing/blob/master/Bayesian_Imputation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/RMichae1/PyroStudies/blob/master/Bayesian_Imputation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notebook found here to demonstrate error while predicting with new missing values\n",
    "\n",
    "https://github.com/RMichae1/PyroStudies/blob/master/Bayesian_Imputation.ipynb"
   ]

This file has been truncated. show original

I would appreciate if anyone could either point out a way to fix the problem in my notebook or point me to any working example that uses Bayesian missing value imputation and predicts on new data with missing values. Thanks in advance!

fehiepsi · March 15, 2022, 12:39am

So you are trying to infer missing values from training sets and want to use such information for “inferring” missing values from the test set? Is your model capable of doing so? If not, I would suggest to use a model that allows to fill missing values for you.

dfd · March 16, 2022, 5:48pm

@fehiepsi

I’m trying to fit a model on training data that has missing values in a column, and then use the fit model to make predictions on new data with missing data in the same column. Actually in the example I posted, I merely take the same original training data and and randomly assign new data to be missing.

This time I used the missing value imputation example straight from the numpyro documentation, and it also fails when I reshuffle the data and add one more missing value, which you can see here: numpyro_missing/docs_bayesian_imputation.ipynb at master · dfd/numpyro_missing · GitHub

It blows up on the line:

age = jnp.asarray(age).at[age_nanidx].set(age_impute)

with error:

ValueError: Incompatible shapes for broadcasting: (177,) and requested shape (178,)

In the notebook, I print out the indices of the missing values within the model, and we can see they differ between train and test. But for some reason, it is not adjusting the size of age_impute.

Could someone please suggest how to augment that documentation example to handle predictions on new data with different missing values?

Or is there any working example of fitting a model with missing data and then using that model to do predictions on different data (or the same data with different missing values)?

fehiepsi · March 16, 2022, 8:57pm

Hi @dfd, the model there is used to infer missing value from “some” data - it is not directly used in the train/test fashion where you try to train a model on non-complete train data and test a model on non-complete test data. I would suggest to construct a model for your purpose first, rather than trying to run apply the model in that tutorial for your problem. Maybe trying to search for reference first to see what people did for your problem. I’m curious on how Bayesian methods applied in such situation.

fitting a model with missing data and then using that model to do predictions on different data

If you want to infer the missing data on the new dataset, I guess the easiest way is to rerun your inference code on the new dataset. Otherwise, you can construct a model that learns loc&scale of the missing entries and use Normal(loc, scale) to simulate missing values on the testing data (it is better to use Variational inference in this approach).

dfd · March 23, 2022, 1:49am

hi @fehiepsi

It looks like you posted an example of exactly what I want to do a couple of years ago in the kaggle forum (my messages keep getting blocked by the spam filter, so I need to remove the link). It looks like the key to make it work is to pop off the samples from imputation:

posterior.pop("age_impute")

So that seems to allow the model to infer new missing values while making predictions.

Something in that code, no longer works though:

    if survived is not None:
        age_impute = numpyro.param("age_impute", np.zeros(age_isnan.sum()))
    else:  # for prediction, we sample `age_impute` from Normal(age_mu, age_sigma)
        age_impute = numpyro.sample("age_impute", dist.Normal(age_mu[age_nanidx], age_sigma[age_nanidx]))

It seems that using “param” doesn’t sample (does not include age_impute in the trace). However, if I just get rid of the param and just use sample, it seems to work and can predict new data with missing values (again, I would post a link to working code, but I keep getting blocked by the filter).

I’m still relatively new to the framework, so I guess I’m not sure what has changed with param.