On the guide in the SS-VAE tutorial

ffp · June 25, 2021, 2:12pm

Hi everyone, I’m new to probabilistic programming.

I recently studied with great interest the tutorial on SS-VAE, but I have a doubt about the way the guide has been defined. In particular, I did not fully understand why the “encoder_z” receives, beside the input x, a “guessed y” even when y should not be provided (i.e., in the case of an unsupervised batch). For the sake of clarity, this is the code I’m referring to:

def guide(self, xs, ys=None):
    with pyro.plate("data"):
        # if the class label (the digit) is not supervised, sample
        # (and score) the digit with the variational distribution
        # q(y|x) = categorical(alpha(x))
        if ys is None:
            alpha = self.encoder_y(xs)
            ys = pyro.sample("y", dist.OneHotCategorical(alpha))
        # sample (and score) the latent handwriting-style with the variational
        # distribution q(z|x,y) = normal(loc(x,y),scale(x,y))
        loc, scale = self.encoder_z([xs, ys])
        pyro.sample("z", dist.Normal(loc, scale).to_event(1))

Is this required for some theoretical reason that I have not fully understood? I’m a little bit confused about this point, since the original approach in Kingma et al. (2014) factorizes the full posterior q(z, y | x) as a product of a posterior on z (q(z | x)) and one on y (q(y | x)) —i.e., it assumes that the two latent variables are independent.

Otherwise, if this is not required formally, does this facilitate the implementation of the approach in some way instead? Indeed, in this way, “encoder_z” always receives two components to concatenate (i.e., x and the observed or guessed y), instead of having the problem to manage a case when it receives only x or both x and y? Is this the true reason behind this choice?

Thank you so much.

martinjankowiak · June 26, 2021, 3:42pm

since the original approach in Kingma et al. (2014) factorizes the full posterior q(z, y | x) as a product of a posterior on z (q(z | x)) and one on y (q(y | x)) —i.e., it assumes that the two latent variables are independent.

i don’t believe this is the case. please refer to section 3.1.2

the high level reason for these kinds of choices is usually motivated by what you want the latent variables to do. as far as the model is concerned it has two latent variables: one is continuous and one is discrete. it doesn’t know anything about the human concept of discrete digits. depending on how the model and guide are structured, the z variable might start learning about the concept of digits. however, this isn’t what one wants here. instead one wants z to pick up on other factors of variation like digit thickness. thus the chosen factorization

ffp · June 28, 2021, 7:10am

Thank you for your clarification.

I reiterate: I’m new to probabilistic ML and probabilistic programming. Therefore, sorry if my question could appear trivial for expert people as the typical audience of this forum.

Anyway, I try to express my doubt in a more clear way: should y affect z even when y is latent? If y and z should capture two distinct (orthogonal?) factor of variations, why do we allow y to affect z by passing to “encoder_z” both x and y? —it is clear to me the behavior when y is observed instead.

A last question: does your tutorial implement the exact idea in Kingma et al., 2014? I noticed indeed that you refer to another work (i.e. “Learning Disentangled Representations with Semi-Supervised Deep Generative Models”) in the tutorial page. Does reading this paper help me understand better some choices you made implementing your SS-VAE approach?

Thank you so much.

martinjankowiak · June 28, 2021, 5:55pm

to the best of my knowledge the tutorial follows “Semi-Supervised Learning with Deep Generative Models” closely apart from possible small differences in neural network architecture. reading any number of papers could be helpful: this is a large area of current research.

what does “should” mean? even if a dataset is generated using two independent factors of variation in practice for a finite dataset you will never recover exact orthogonality: the true posterior will show correlations. consequently it makes sense to factorize the approximate posterior as either q(z|y)q(y) or q(y|z)q(z). there are in any case no simple answers as to what one should or shouldn’t do here as the details depend on the precise goals, the precise dataset etc. in addition this is an active area of research, especially when it comes to “disentangled” representations. please refer to that line of research for more details