How efficient is Bayesian imputation?

yunus · October 21, 2021, 10:33am

Hi everyone,
I’m wondering what the limit of Bayesian imputation is in terms of number of values I can impute?

I have a model that already has many uncertain parameters and may want to impute up to a 1000 missing values (in a longitudinal data set with 35 features or so). I could do that relatively efficiently with an iterative imputer like the one in scipy but I would like to use Bayesian imputation to take uncertainty into account.

What would be a reasonable number of missing values to impute? And how do these Bayesian imputed missing values scale compared to uncertain parameters for instance?

fritzo · October 23, 2021, 2:05am

Hi @yunus, imputation ability is very problem dependent. At one extreme consider a table whose rows tend to be either all zero or all one: you need only O(1) cells to impute an entire row. At the other extreme consider a table whose rows implement the XOR function: you need all but one value to impute the final remaining value.

yunus · October 23, 2021, 9:57am

hi @fritzo, thanks for your response! In the kind of data set I’m using a lot of the features are correlated, and they also interact in my model, which is a large system of ODEs. Is there not any general guideline regarding how many values could be feasibly imputed? E.g., imputing 100 values will take an additional 30 minutes of 1000 NUTS samples on an average CPU, or something akin to that?

fritzo · October 23, 2021, 10:43pm

I don’t know of any general statements about either statistical strength or computational complexity of imputation, basically because imputation is so model independent. Can you say anything specific about your model class?

yunus · October 28, 2021, 7:12pm

I’m sorry but what do you mean by ‘model class’?
As I said it’s a large system of ODEs, actually a differential algebraic equation system with about 10 differential and 20 algebraic equations. I have data for most, but not all, of the variables but very substantial missingness, especially in the timeseries (least missingness at baseline).

Now I would like to use Bayesian imputation when sampling from the model, but I’m unsure what the limits are. In theory I could try to impute 50,000 values or even more (1400 individuals, 30 variables, 12 time-points), but practically speaking that seems impossible.

fritzo · October 28, 2021, 7:52pm

You’ll need to answer this yourself empirically. Different models support vastly different amounts of imputation. In some models it is easy to impute 50000 values. In some models it is impossible to impute even a single value. One way to try to answer this is to sample from the prior and see how much correlation there is between the values you know and the values you want to impute. If there correlation = 1, you’re in luck. If correlation = 0 it will be difficult to impute.

yunus · October 28, 2021, 8:06pm

Ah that’s nice to hear, thanks! I will definitely try that out. Any indications regarding how long a 50,000 value imputation would take additionally? (Say that I do have some correlation, but maybe not 1.0 )

Just to be sure: we’re talking about the imputation that is done by NumPyro when I assign a prior to a NaN value in the data set, right?

(As shown here: Bayesian Imputation — NumPyro documentation)