GPLVM on Astronomical Data

Hello, I have a pretty high-level question related to my research using Astronomical Stellar Spectra.

Background

I am involved with an open-source project to do inference of spectroscopic data from stars. Spectra have a lot of information baked-in but statistical fitting is tough because the only source of comparison that we have are from spectral models. These models require intensive simulation and therefore are only created on a certain grid of intrinsic stellar parameters. In our research, these parameters are (T, logg, Z).

Because we want to fit our data using points that exist in-between the model grid points, we need a way of interpolating, but in our specific example, we also want to forward-propagate our errors for use in the inference. We also want to dimensionally reduce our model spectra for more efficient fitting.

what we’ve currently been doing is taking our model spectra and performing PCA decomposition on them to create “eigenspectra” and corresponding “weights” (eigenvalues) to recreate the original model grid. Currently, this looks a bit like this-

ncomps = 7
X = fluxes # Shape (Ngrid, Npix), usually like (100, 30000)
Xcentered = X - X.mean(0)
Xwhitened = Xcentered / Xcentered.std(0)
U, S, V = torch.svd(Xwhitened)
components = V.mm(torch.diag(S))[:, :ncomps] # Shape (Npix, ncomps)
weights = U[:, :ncomps] # Shape (Ngrid, ncomps)

Using this decomposition, we can recreate any of our models on the grid with

reconstructed = weights[0].matmul(components.t())

The problem

So, given this dimensionality reduction, we need to find a function to map our stellar parameters (T, logg, Z) to the weights used in reconstruction. We’ve been previously using a Gaussian process to sample functions for this mapping. More importantly, we want to fit this function using a loss metric that accurately portrays how well the function maps the stellar parameters to the original grid fluxes (rather than Gaussian likelihood that measure how well the parameters recreate our PCA weights). I’ve done some reading around and it seems like GPLVM is supposed to be the solution to this problem.

So, basically, I’m looking for a way to use GPLVM to map the stellar parameters to reconstructed model spectra using dimensionally reduced latent “eigenspectra” where I can get both the reconstructed spectra as well as the covariance from this reconstruction. The covariance matrix we’ve previously used looks something like

c = Kernel(X, X) # shape (M, M)
# create block diagonal of a batch of the above kernels
covs = batch_kernel(X, X) # shape(ncomps, M, M)
C = block_diag(covs) # shape (ncomps*M, ncomps*M)
# we typically don't evaluate this covariance directly because of its large size
cov = torch.chain_matmul(components.t(), C, components) # shape (Npix, Npix)

What I want

My ideal workflow would look something like this

X = grid_points # shape (M, 3)
Y = fluxes # shape(M, Npix) (Npix is ~ 30000)
pca = PCA(ncomps=7)
comps, weights = pca(Y)
comps.shape # (Npix, ncomps)
weights.shape # (M, ncomps)
kernel = gp.kernels.RBF(input_dim=3, lengthscale=torch.tensor([300, 30, 30]))
gpr = gp.model.GPRegression(X, weights, kernel)
gplvm = gp.model.GPLVM(gpr, comps, Y)
gp.util.train(gplvm)
# sample a model
fluxes, cov = gplvm(torch.tensor([[7230, 4.32, 0.0]]), diag=False)
fluxes.shape # (1, Npix)
cov.shape # (1, Npix, Npix)

where cov is the forward-propagated covariance from the interpolation and reconstruction using the reduced dimensions (not just the self-covariance of the fluxes)

Does this exist in the pyro framework? Can we make it exist in the pyro framework?

1 Like

@mileslucas I think that you want to use this kernel: http://docs.pyro.ai/en/0.3.0-release/contrib.gp.html#verticalscaling . It gives you the kernel which you want.

A side note: GPLVM class is just an example to show the flexible of Pyro GP module. By default, it sets prior N(0, 1) to X. To get more control, you can look at Gaussian Process Latent Variable Model — Pyro Tutorials 1.8.4 documentation tutorial, where we don’t use GPLVM class to do GPLVM.

Btw, your problem setup is interesting. Thanks for sharing!

@fehiepsi

Thanks for the reply.

I don’t quite understand how to apply such a kernel to my problem. If it helps at all, the equations I’m refactoring are in this paper. Specifically, see the appendix on page 19 starting at paragraph 2 that explains the exigence behind wanting to optimize a GP as though the GP is the latent function. Eqn. A22 I think shows specifically the same kind of math that GPLVMs have. You can also see here why I am keen on batching (eqn A.12-A.13).

I think you can define cov as follows:

def scale(x):
    return components  # or create a class with property `components` and returns it in `__call__()`
    # or create a nn.Module with parameter `components` and returns it in `forward()` if you want to train

kernel = gp.kernels.VerticalScaling(gp.kernels.RBF(3, lengthscale=...), scale)

Edit let me think more, it seems that I understand the problem wrong. Can you clarify the following points:

  • comps, weights = pca(X): here, it is pca(Y) right?
  • C = Kernel(X, X) # shape (ncomps, ncomps): I thought C.shape = M x M?

Good catch on the PCA- you are correct, I’m doing the PCA decomposition of the fluxes, so

comps, weights = pca(fluxes) # fluxes shape (M, Npix)
comps.shape # Npix, ncomps
weights.shape # M, ncomps

You are also correct about about the kernel shape, sorry

C = kernel(X, X) # shape = (M, 3)x(3, M) = (M, M)
# if batched-
C = batch_kernel(X, X) #shape = (ncomps, M, M)

I’ve also edited the original post to make these clarifications.

Also, to clarify on my “ideal usage”- I want to have a GP regression model on the mapping from the stellar parameters to the PCA weights (n, 3)->(n, ncomps). I want to train this model based on the marginal likelihood of the reconstructed fluxes rather than the marginal likelihood of the PCA weights. I have such a likelihood that is poorly optimized for tensor math (from the paper I linked above- see eqn. A14), but I am curious if there are existing models that achieve the same effect.

We want to do this not because we want a probabilistic PCA model, but rather because we want to do our inference down the line using the given PCA. Our final “emulator” takes in parameters, samples from the GP regression model to get a set of weights, and takes the dot product of these weights to the components of the PCA to create an interpolation of our original model grid without having to carry around and do math on M model spectra. Please let me know if any of this is still confusing- sometimes I do a poor job explaining these problems since I am still coming to terms with them myself.

It is not GPLVM in my opinion. Here is my sketch for replicating that paper:

Data

# X.shape = N x 3
# y.shape = N x C
# weights.shape = N x 7
# components.shape = C x 7
X = X.repeat(7, 1, 1).reshape(7 * N, 3)
eigen_onehot = torch.eye(7).reshape(7, 1, 7).repeat(1, N, 1).reshape(7 * N, 7)
X = torch.cat([X, eigen_onehot], dim=-1)  # 7N x 10
weights = weights.t().reshape(7 * N)  # formula A12
components = torch.eye(N).reshape(-1).ger(components.t().reshape(-1)).reshape(N, N, 7, C)
components = components.permute(0, 3, 2, 1).reshape(C * N, 7 * N)  # formula A16
noise_scale = components.t().matmul(components).inverse()  # weight_noise = noise * noise_scale
# for batch Gaussian Processes, take block diagonal and reshape noise_scale to 7 x N x N
noise_init = torch.tensor(0.0001)  # formula A.15

Kernel

# formula A12
# it is better to make a loop here; I write it explicitly for clarity
# it is even better to use Coregionalize kernel; because the paper does not use it, I skip
rbf1 = Product(RBF(3), gp.kernels.Linear(1, active_dims=[3])
rbf2 = Product(RBF(3), gp.kernels.Linear(1, active_dims=[4])
...
kernel = Sum(Sum(Sum(Sum(Sum(Sum(rbf1, rbf2), rbf3), rbf4), ...)

Model

# formula A13
# formula A22
class CustomGPR(GPModel):
    def __init__(self, X, weights, kernel, mean_function=None, noise=noise_init, jitter=1e-6, noise_scale=None, components=None):
        super(CustomGPR, self).__init__(X, weights, kernel, mean_function, jitter)
        self.noise = Parameter(noise)
        self.set_constraint("noise", constraints.positive)
        self.noise_scale = torch.eye(X.size(0)) if noise_scale is None else noise_scale
        self.components = components

    @autoname.scope(prefix="GPR")
    def model(self):
        self.set_mode("model")
        N = self.X.size(0)
        Kff = self.kernel(self.X)
        Kff.view(-1)[::N + 1] += self.jitter  # add jitter to the diagonal
        Kff = Kff + self.noise * self.noise_scale  # A22
        Lff = Kff.cholesky()

        zero_loc = self.X.new_zeros(self.X.size(0))
        return pyro.sample("weights", dist.MultivariateNormal(zero_loc, scale_tril=Lff), obs=self.y)

    @autoname.scope(prefix="GPR")
    def guide(self):
        self.set_mode("guide")

    def forward(self, Xnew, full_cov=False):
        self.set_mode("guide")
        N = self.X.size(0)
        Kff = self.kernel(self.X)
        Kff.view(-1)[::N + 1] += self.jitter  # add jitter to the diagonal
        Kff = Kff + self.noise * self.noise_scale  # V11 term, it seems like a typo in the paper (missing inverse)
        Lff = Kff.cholesky()  # I recommend to cache Lff to avoid re-computing for next prediction

        weight_loc, weight_cov = conditional(Xnew, self.X, self.kernel, self.y, None, Lff,
                                                       full_cov, jitter=self.jitter)  # A28 -> A32
        y_loc = self.components.matmul(weight_loc)
        if full_cov:
            y_cov = self.components.matmul(weight_cov).matmul(self.components.t())
        else:
            y_cov = ...  # exercise for you
        return y_loc, y_cov

If you train with HMC, then no need to touch Linear.variance parameters (default to 1)

If you train with VI, then you can filter out these parameters (to not optimize, because it is correlated with RBF’s variance)

rbf1.kern1.variance_unconstrained.requires_grad_(False)

First off, thank you for the help, this is all really good stuff.

In regards to the kernel- I’m not partial to sticking to the paper verbatim. Gaussian processes are still new to me, as well, so if there is a better approach to the problem, I’m all for it. So, I’d love to hear more about the coregionalize kernel. From the paper, though, these GPs are supposed to be independent since our PCA components form a basis for our library “space”.

From the code sample you’ve given me, its unfortunately impossible to compute- just looking through at

components = torch.eye(N).reshape(-1).ger(components.t().reshape(-1)).reshape(N, N, 7, C)

This will have a total number of elements of N^4 (where N in our use cases can be up to 1000 or more) which is going to be around 10^9 elements. Actually running it errors asking for more RAM for trying to instantiate 68GB.

In our code we’ve previously done these calculations using for-loops and a “skinny kroenecker”. However, like I’ve mentioned above- I would really like to redo our code and rather than perfectly recreating the likelihoods presented I’d rather write code that is much more tensor-oriented. For example, from the paper eqn. A12 instead of making a stacked vector (mM) I’d rather create a matrix (m, M). I’ve been working with the primary author (Czekala) as well as the original creator of the likelihood we use (Habib) on revisiting this math.

By the way, I have an implementation of what I interpreted the paper to be- here it is.

from torch import nn
import pyro.optim

def block_diag(tensor, fill=0):
    if tensor.dim() != 3:
        raise ValueError('Cannot form block matrix unless shape is bxnxn')
    batch_size = tensor.shape[0]
    side_length = tensor.shape[1]
    size = batch_size * side_length
    blocked = fill * tensor.new_ones(size, size)
    for i, block in enumerate(tensor):
        tl = i * side_length
        br = (i + 1) * side_length
        blocked[tl:br, tl:br] = block
        
    return blocked

class Emulator(nn.Module):
    
    __constants__ = ['m', 'M', 'n']
    
    def __init__(self, grid, n_comps=None):
        super().__init__()
        
        self.wl = torch.from_numpy(grid.wl)
        self.grid_points = torch.from_numpy(grid.grid_points)
        self.fluxes = torch.tensor(list(grid.fluxes), dtype=torch.float64)
        self.flux_mean = self.fluxes.mean(0)
        self.flux_sd = self.fluxes.std(0)
        self.pca = PCA(n_comps=n_comps)
        self.components, self.weights = self.pca(self.fluxes)
        self.components.to(torch.float64)
        self.M, self.n = self.grid_points.shape
        self.m = self.weights.shape[-1]
        
        conf_priors = torch.tensor(config.PCA['priors']).t()
        
        self.lam_xi = nn.Parameter(pyro.sample('lam_xi', dist.Gamma(1, 1)))
        l_prior = dist.Gamma(conf_priors[0], conf_priors[1]).expand((self.m, self.n)).to_event(0)
        lengthscale = pyro.sample('lengthscale', l_prior)
        a_prior = dist.Normal(300, 100).expand((self.m,))
        amplitude = pyro.sample('amplitude', a_prior)
        self.kernels = nn.ModuleList()
        self.gprs = nn.ModuleList()
        for i in range(self.m):
            k = gp.kernels.RBF(input_dim=self.n, variance=amplitude[i], lengthscale=lengthscale[i])
            self.kernels.append(k)
            self.gprs.append(gp.models.GPRegression(self.grid_points, self.weights[:, i], k)) 
    
        dots = torch.empty(self.m, self.m)
        for i in range(self.m):
            for j in range(self.m):
                dots[i, j] = self.components[i].dot(self.components[j])
        self.PhiPhi = torch.zeros(self.m * self.M, self.m * self.M)
        for i in range(self.M * self.m):
            for jj in range(self.m):
                ii = i // self.M
                j = jj * self.M + (i % self.M)
                self.PhiPhi[i, j] = dots[ii, jj]
                
        out = self.components.t().mm(fluxes.t().to(torch.float64)).view(self.M * self.m)
        self.w_hat = self.PhiPhi.inverse().mv(out)
           
    def forward(self, params, *args, **kwargs):
        if params.dim() < 2:
            params = params.unsqueeze(0)
        mus = params.new_empty(self.m, params.shape[0])
        covs = params.new_empty(self.m, params.shape[0], params.shape[0])
        for i, gpr in enumerate(self.gprs):
            m, c = gpr(params, *args, **kwargs)
            mus[i] = m
            covs[i] = c
        return mus, covs

    
    def loss(self):
        sigs = self.grid_points.new_empty(self.m, self.M, self.M)
        for i, k in enumerate(self.kernels):
            sigs[i] = k(self.grid_points)
        sig_w = block_diag(sigs)
        C = (1. / self.lam_xi) * self.PhiPhi + sig_w
        ld = torch.logdet(C)
        central = self.w_hat * C.inverse() * self.w_hat
        lnp = -0.5 * (ld + central + self.M * self.m * np.log(2. * np.pi))
        return -lnp.sum()
        
    
    def load_flux(self, params, get_cov=False):
        mu, cov = self(params, full_cov=get_cov)
        comp = self.components.matmul(mu).t().squeeze()
        if get_cov:
            C = torch.chain_matmul(self.components, cov, self.components.t())
            return comp * self.flux_sd + self.flux_mean, C
        else:
            return comp * self.flux_sd + self.flux_mean
        

This is a lot more than just a subclass of a GPModel because that is how we plan to deploy it in our code base. However, I wouldn’t mind creating a custom class and calling it from within this Emulator class for the training and forwarding.

I train this using a simple loop like

emu = Emulator(grid)
optim = torch.optim.Adam(emu.parameters(), lr=.01)
num_steps = int(1e3)
losses = []
for i in tqdm.trange(num_steps):
    optim.zero_grad()
    loss = emu.loss()
    loss.backward()
    optim.step()
    losses.append(loss.item())

unfortunately I haven’t had any success with training this over a variety of learning rates. I always end up with losses around 10^23 and getting erratic behavior. I also test by plotting the GPs of the weights (very similar to the method from the GPRegression tutorial) and I get crap means and variance that is ~ 1000x too large to capture the actual shape of the weights

Coregionalize gives you a dense covariance matrix, instead of diagonal block matrix (formula A12). About component computation (formula A16), because we only use it to do matmul, it is enough to only store the original components with shape C x 7, and make a function matmul(components, weight) or matmul(components, components.t()) which do all the math under the hood.

I think that your approach is much more efficient. But no need to replicate a lot of math of that paper as in your code. In addition, the way you compute loss is also strange to me and it does not seem to include all the prior log_prob.

To do batch GPs and save computation cost, you just need to make a couple of change in my code

def model():
    # if using my method, X has shape 7N x 10
    Kff = self.kernel(self.X)  # shape: 7N x 7N
    Kff = block_diagonal_matrix(Kff)  # shape: 7 x N x N
    # if you want to use the original X with shape N x 3 together a nn.ModuleList with a sequence of 7 RBF kernels:
    Kff = torch.stack([self.kernels[i](self.X) for i in range(7)])  # faster!
    # any way, after this step, you get a matrix with shape 7 x N x N
    Kff = Kff + self.noise * self.noise_scale  # make sure noise_scale has shape 7 x N x N
    Lff = Kff.cholesky()  # cholesky support batch!
    # if you use precision instead of noise, then do noise=1/precision and set a very high init precision as in the paper

    # self.y has shape 7 x N
    zero_loc = self.X.new_zeros(7, N)
    return pyro.sample("weights", dist.MultivariateNormal(zero_loc, scale_tril=Lff).to_event(), obs=self.y)

def forward(...):
    # make a loop to do condition for each kernel
    weight1_loc, weight1_cov = conditional(Xnew, self.X, self.kernels[0], self.y[0], ...)

This seems like the equivalent of create 7 GPRegression models and optimizing/calling them all within a loop. However, this doesn’t tackle the original problem which is to train these GPs using a loss based on how well the spectra reconstruct rather than how well do we fit the weights of the PCA.

The loss function doesn’t take into account any log priors because in its current state it is just MLE rather than MAP. The loss I’ve coded is basically the marginalized density of two multivariate normals multipilied together. From the paper, it is A23. The paper does this work in the likelihood to reduce the dimensionality from CxC to mMxmM which is a factor of 1,000,000 if not more.

I think that my code did what is presented in the paper. self.noise is the noise for fluxes; then we have to scale it back to noise of weight (A21). That’s what is presented in A21, A22, and A23. Maybe you want to use A20? I think that it will be not efficient and you have to implement such likelihood by yourself.

Ah, I see. Sorry for the confusion. I will implement this and let you know what I see!