SVI, evidence and Elbo Loss

Hello, i’m a bit lost in the reasoning of the svi even after reading the tutorial and i am seeking for help…
I want to do time serie forecasting, without using the forecaster class as i need more liberty. I have a time serie, I learn on the past of the serie and I want to predict the next day…

So I have the time serie \{y_i\}_{0 \leq i\leq T}, I then pose X_T = \{y_i\}_{0 \leq i \leq T-1} = \{x_i\}_{1 \leq i \leq T} and Y_T = \{y_i\}_{1 \leq i \leq T}. I then want to compute \displaystyle p\big(y_{t+1} \big|\{y_i\}_{i\leq t}\big) = p\big(y_{t+1} \big|x_{t+1},\big(Y_t, X_t\big)\big) = \int_{\theta'}p\big(y_{t+1}\big|x_{t+1}, \theta'\big) \cdot p\big(\theta'\big|\big(Y_t, X_t\big)\big)\, \mathrm{d}\theta' And to do so I have as definition of the posterior \displaystyle p\big(\theta\big|\big(Y_t, X_t\big)\big) = p\big(\theta\big|\{y_i\}_{i\leq t }\big) = \frac{p\big(Y_t\big|X_t, \theta \big) p(\theta)}{ \displaystyle \int_{\theta'}p\big(Y_t\big|X_t, \theta'\big) ~ p(\theta')\, \mathrm{d}\theta'}

Okay so my prior p(\theta), is the model, I understand.
But how is the likelihood computed, the p\big(Y_t\big|X_t, \theta \big), or p\big(y_{t+1}\big|x_{t+1}, \theta\big). At first I thought it was the guide but it’s already taken, so how is the likelihood computed because we need it when we are using a Predictive object ???

Then the evidence is intractable, this is why there is a need of a guide to approximate the posterior…

But in my formula of the ELBO, I have \displaystyle ELBO = \int_{\theta '} q_\phi(\theta ') \cdot log \Bigg( \frac{p\big(\theta ' , (X_t, Y_t)\big)}{q_\phi (\theta ' )}\Bigg) ~ d\theta ' okay, here the guide is the approached posterior, it’s q_\phi (\theta) but then how is computed p\big(\theta ' , (X_t, Y_t)\big) is it decomposed into p(X_t,Y_t) \cdot p(\theta |X_t, Y_t) ???
How come the elbo optimizes both the parameters \theta and \phi given that the prior doesn’t evolve during the training, for me the backpropagartion is only done on the \phi, even if it’s parameterized by the \theta

Thank you very much and have a great day

I am not sure if I’m understanding the question, but I think you are asking:

  • Whether we maximize both \theta and \phi in the ELBO
  • how we compute the joint p(\theta, (X_t, X_t)).

For first point: we don’t. We only optimize for the parameters of the variational approximation q_{\phi}(\theta), which is denoted by \phi.

For the second point: technically, when we say “model” in the Bayesian context, it’s both the prior p(\theta) and the likelihood p(Y_t \mid X_t, \theta). Thus, by design, you have both available. Thus you can compute the joint p(\theta, (Y_t, X_t)) = p(Y_t \mid X_t, \theta) p(\theta) directly.

while keeping θ fixed is common in “classical bayesian” applications, when doing so-called “type ii maximum likelihood” or when considering models with neural networks (e.g. variational autoencoders) jointly maximizing θ and ϕ is common practice.

1 Like

Thank you very much for your time
But in terms of mathematics p(\theta) \cdot p(Y_t | X_t, \theta) = p(\theta, Y_t | X_t) right ? I feel like we would have more like p(\theta,(X_t,Y_t)) = p((Y_t , X_t) | \theta) \cdot p(\theta). Do we have the result to this formula inside the model directly ? In fact do we decompose this formula in two or is it directly the output of the model ?

Okay thank you very much for your clarification