Hi All,
first of all thank you for the amazing probabilistic programming language, it is of great help in my research! I have a doubt regarding how variational inference is executed in VariationalGP
models. In particular, I am not sure I completely understand the use of the sampling statements in the model
definition here below:
if self.whiten:
identity = eye_like(self.X, N)
pyro.sample("f",
dist.MultivariateNormal(zero_loc, scale_tril=identity)
.to_event(zero_loc.dim() - 1))
f_scale_tril = Lff.matmul(self.f_scale_tril)
f_loc = Lff.matmul(self.f_loc.unsqueeze(-1)).squeeze(-1)
else:
pyro.sample("f",
dist.MultivariateNormal(zero_loc, scale_tril=Lff)
.to_event(zero_loc.dim() - 1))
f_scale_tril = self.f_scale_tril
f_loc = self.f_loc
f_loc = f_loc + self.mean_function(self.X)
f_var = f_scale_tril.pow(2).sum(dim=-1)
if self.y is None:
return f_loc, f_var
else:
return self.likelihood(f_loc, f_var, self.y)
Why are we not using the samples from the latent variable f in the definition of the likelihood (in later stages of model
), but rather leave the task of defining the observed variable y entirely to the trainable parameters f_loc
and f_scale_tril
? Specifically, in following a generative view of Gaussian Processes it would seem reasonable to use a sampled latent variable, e.g. fs = pyro.sample("f",...)
, in the likelihood of the model N(y| fs
, sigma).
Also, I was wondering whether this could be related to the fact of using the following training scheme (as for the Gaussian Process Introduction on the documentation):
optimizer = torch.optim.Adam(gp.parameters(), lr=0.005)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = 2500 if not smoke_test else 2
for i in range(num_steps):
optimizer.zero_grad()
loss = loss_fn(gp.model, gp.guide)
loss.backward()
optimizer.step()
losses.append(loss.item())
opposed to a more “pyro-classical” SVI training procedure as follows:
optimizer = pyro.optim.Adam({"lr": 0.01})
svi = SVI(gp.model, gp.guide, optimizer, loss=Trace_ELBO())
for i in range(1000):
svi.step()
In the first, if I understand correctly, we are optimizing the model parameters model.parameters()
, which contain also the guide parameters f_loc
and f_scale_tril
(not explicitly pyro.param
s in the guide
but rather trainable torch Parameters
used in both model
and guide
). Does this mean we are not explicitly using a guide
from which we can sample from through, for example, svi.run()
+ EmpiricalMarginal
(which we could do if we used the second training scheme)?
My idea is that the second training scheme would somehow require the use of latent samples (i.e. fs = pyro.sample("f",...)
) in defining the observations y to allow for a feasible posterior approximation through the guide
.
Hope I managed to be sufficiently clear and thank you very much in advance for your help!