What is SVI doing to the model side param?

yaow · June 14, 2022, 2:31pm

Hi!
I’m a little bit confused about model side parameters.

In the SVI tutorial, it says

So, for a fixed θ, as we take steps in ϕ space that increase the ELBO, we decrease the KL divergence between the guide and the posterior, i.e. we move the guide towards the posterior. In the general case we take gradient steps in both θ and ϕ space simultaneously so that the guide and model play chase, with the guide tracking a moving posterior log⁡pθ(z|x).

I’m wondering in the general case, what model side parameter value does SVI give us? Is it a MLE on those parameters?

Say, if we got a latent variable in the model, it’s mean and variance was set by two param sites, and in the guide, this latent variable also has it’s mean and variance set by two param sites. In my understanding, the SVI will optimize the guide side params to make the guide converge to the posterior, however, this posterior also depends on model side parameter, then what kind of model side parameter value will SVI give us?

Thanks~

martinjankowiak · June 14, 2022, 3:19pm

model parameters are driven to maximize the ELBO which is a lower bound to the log evidence so in effect model parameters are MLE estimates.

whether tying model parameters and guide parameters to be the same (i.e. using the same named param statement in both places) makes sense depends on the particular model/guide. regardless of whether it makes sense, in all cases the quantity being maximized is the ELBO

yaow · June 15, 2022, 1:10am

Does it means whether we have a full guide or a empty guide, SVI will give us the same model side parameter value, which is the MLE value?

yaow · June 15, 2022, 2:07am

I tried a simple model and found that under full guide and empty guide, SVI gave quite different value on model side parameters.
This evidence implies that in case where both guide and model have trainable parameters, the model parameter may not be the simple MLE value.

martinjankowiak · June 15, 2022, 12:36pm

i don’t really understand. i suggest you formulate your question about a specific model and guide and include code for them.

SVI maximizes the ELBO. how you want to interpret that is up to you.

yaow · June 15, 2022, 3:20pm

I tried this small model,
@config_enumerate
def model(data):
a=pyro.param(‘a0’,torch.tensor(1.),constraint=constraints.positive)
b=pyro.param(‘b0’,torch.tensor(1.),constraint=constraints.positive)
c=pyro.sample(‘c’,dist.Beta(concentration0=a,concentration1=b))
with pyro.plate(‘data’,size=data.shape[0],dim=-1):
o=pyro.sample(‘o’,dist.Binomial(probs=hp),obs=data)

def mle_guide(data):
pass

def full_guide(data):
a=pyro.param(‘a’,torch.tensor(1.),constraint=constraints.positive)
b=pyro.param(‘b’,torch.tensor(1.),constraint=constraints.positive)
c=pyro.sample(‘c’,dist.Beta(concentration0=a,concentration1=b))

def generate_data():
data=[]
for i in range(5000):
data.append(0)
return torch.tensor(data)

So if you run

data=generate_data()
adam = pyro.optim.Adam({‘lr’: 0.1})
elbo = pyro.infer.TraceEnum_ELBO(num_particles=10)
svi=SVI(model,mle_guide,adam,elbo)
#or
svi=SVI(model,full_guide,adam,elbo)
for step in range(100):
svi.step(data)

print(“a0: {}”.format(pyro.param(“a0”).item()))
print(“a1: {}”.format(pyro.param(“a1”).item()))

you will find that in these two situations, the printed values are quite different.

I think using the empty mle guide give us the mle of the model side parameters, if that’s true, then the value produced by using full guide will not be the mle value, because they are different.
It seems that when both model side and guide side have trainable parameters, the trained model side parameter values are not MLE, it is something between MLE and MAP, i guess?

martinjankowiak · June 15, 2022, 4:07pm

mle and map are explained here.

if the model has latent variables and the guide is a true variational distribution (i.e. has sample statements for each sample statement in the model) then parameters in the model are point estimates whose values are determined by maximizing the ELBO. these are “MLE” in the loose sense that they are point estimates whose value is fixed by maximizing the ELBO, which is a lower bound to the log evidence. they are not MLE in the sense used here, which specifically refers to latent variables, although these are obviously closely related since both involve point estimates.

yaow · June 16, 2022, 1:08am

I see, it seems that the classical MLE is a special case of ELBO MLE. They both aim to maximize model evidence of the observed data, and classical MLE is a special case when we don’t have priors on the latent variable.