Bayesian Updating/Incremental Learning in Numpyro

leonardofoliv · October 20, 2020, 9:50pm

Hi everyone,

I am working on a problem where I am trying to create a gaussian process model. The main problem that I am facing right now is that this gaussian process must be updated as new data arrives, in an incremental fashion. Namely, I want to take the posterior distribution of the GP from the last iteration and feed it as a prior to the new iteration, that should be computed as new data arrives. I would like to know if there is an easy way to do this in numpyro. I also tried doing this in PyMC3, but it seems really complicated to do this with a GP in a scalable and simple way.

Thanks in advance for your attention

fehiepsi · October 21, 2020, 4:49am

Hi @leonardofoliv, I think the online GP model in this reference would be helpful to you. I would suggest to start with Pyro’s GP and implement a StreamingSparseGP model (the math at conditional might be helpful). I’m not sure if there is an easy way to do this in NumPyro.

leonardofoliv · October 21, 2020, 11:17am

Hi @fehiepsi, thank you very much for your answer.

This seems quite complicated for my knowledge level of GPs and numpyro at the moment. But I will dig into it and see if I can take something out of it, since it seems that there’s no easy way to begin with this.

Again, thanks for your help.

martinjankowiak · October 21, 2020, 3:05pm

@leonardofoliv how much data do you expect?

leonardofoliv · October 21, 2020, 3:38pm

Hi @martinjankowiak,

Thanks for your interest in answering my question. I will try to better contextualize what I’m trying to do.

I am working with spatiotemporal dataset of multiple measurements that are collected across a city. Basically, it is a dataset where I have multiple time series which are spatially correlated. Since some locations of the city are not covered during the data collection process, I need to build a model that gives me an estimate of this variable in these locations where I have missing measurements. Besides, the model must be able to update its outputs when new (streaming) data arrives, in an incremental fashion.

I tried some other ML approaches, but after some research, it seems to me that a bayesian approach would be the most suitable here, due to the nature of the problem. In addition, I was trying to model this dataset as a GP because it allows me to model the spatiotemporal dependency between my measurements explicitly. It is a nice way to correlate the outputs of a model (based on the distance between the locations in which these measurements are collected, for example). I don’t know if I’m missing some obvious approach in this case, but that’s how I imagined solving the problem.

Right now I am working with a GP where its multivariate normal distribution is 1200-dimensional, but this dimensionality could be increased in the future. I could get a first model working in PyMC3 but I couldn’t manage to do a bayesian updating/incremental learning with this model in an efficient manner. If I could have a simple GP model that performs this incremental updating (posterior -> new prior -> posterior -> new prior -> …), this would already be a good advance for me.

Thanks again for your reply!

martinjankowiak · October 21, 2020, 4:33pm

i’m still unsure about the structure of your model, however, keep the following in mind. if you’re using a GP and doing exact inference then the following are equivalent:

start with a prior, condition on dataset D_1, compute the posterior, use that posterior as the new prior, condition on dataset D_2, compute the new posterior
start with a prior, condition on dataset D_1+D_2, compute the posterior

in other words if you’re in this setup you can just skip to the second formulation and not bother about the “current prior as previous posterior” formulation

leonardofoliv · October 21, 2020, 7:08pm

But wouldn’t this be more computionally expensive, since I have to always refit the model on the whole dataset at every iteration of the update? My hope was to do encode the knowledge acquired from the previous data (D_1) in the prior so that wouldn’t be necessary.

Thanks again for the reply

martinjankowiak · October 22, 2020, 9:21pm

it depends on how exactly you do it but either way you end up working with NxN covariance matrices. in any case if the number of datapoints you have is less than 5000ish it may not be worth the trouble.