Sparse variational GP: do we need to provide entire dataset at initialization?

hidoiame · July 4, 2021, 8:21pm

Hi, I’m new to this forum. While trying to use Sparse variational GP (and later on Deep GP) with Pyro, I had some questions. Would appreciate if somebody can help!

I’m working with large datasets that cannot be loaded into memory, so I am using the mini-batch training by following examples from here Inferences for Deep Gaussian Process models in Pyro | fehiepsi's blog. However, the GP API in Pyro still requires X and y for model initialization, where X and y are the feature and label of the entire dataset. I looked into the code, it seems X and y are being used for “conditional” https://docs.pyro.ai/en/0.3.1/_modules/pyro/contrib/gp/models/vsgp.html.

I was wondering if the initial X and y we provide matter? Can we simply provide some dummy X and y and just later set data in each mini-batch training epoch? What are the recommended practices? Thank you in advance.

fehiepsi · July 4, 2021, 9:52pm

It is unnecessary to provide the entire dataset. You can use some dummy input like in deep kernel learning example.

martinjankowiak · July 4, 2021, 9:53pm

i don’t think you need to provide them so long as you provide num_data

in practice it’s usually a good idea to initialize the inducing point locations Xu using k-means on the training data X (or in your case likely a random subset of X)