I am new to both pyro and probabilistic programming but I tried to do my homework before I raise this issue, please bear with me if it is real basic.
I was going through the Bayesian regression tutorial and don’t quite understand the difference between the model function and the guide function, both of them create a linear model, with parameter defined as a distribution.
My understanding is that model function simply add samples (to adjust the prior to sample to model from) while guide function only returns a (trainable?) regression model sampled from the parameter distributions.
However, its written that the guide function is one with the parameter that actually get trained.

TL;DR

What is the point of defining 2 sets of parameters (in model and guide)?
Why is the parameter in model drawn from a constant distribution?
Why do we need argument for the guide function if its never reference during training and inference?
Thank you

Good questions! If you’re new to bayesian methods in general, i recommend reading the SVI tutorials to get a basic understanding first.

What is the point of defining 2 sets of parameters (in model and guide)?

by parameters, i assume you mean parameters of the priors (correct me if im wrong). the gaussian distributions in the model are our priors p(z). the analogous distributions in the guide are our approximating distributions q(z). we sample from q(z) when running VI to calculate the ELBO (see below).

Why is the parameter in model drawn from a constant distribution

not sure what you mean by this - we initialize the parameters for the gaussian priors, then used those priors and place them over the parameters of our neural net module.

Why do we need argument for the guide function if its never reference during training and inference?

good catch, it actually isn’t used in the guide in this case, but pyro requires both model and guides to have the same type signature.

in general, in SVI, you are trying to minimize the kl divergence between your approximating distribution and the true [unknown] posterior. that’s the purpose of the guide. your model is modeling the latents to the data you observe.

" In the general case we take gradient steps in both θ and ϕ space simultaneously so that the guide and model play chase, with the guide tracking a moving posterior logpθ(z|x)."

My question is this: if the model has trainable parameters that appear before the sampling of a latent variable, are those parameters optimized? If so, how? (Sampling?)
Explicitly, will theta here be optimized by SVI?