Generation network in CVAE, p(y|z) or p(y|x, z)?

I have been recently trying to understand CVAE by reading the Pyro example: Conditional Variational Auto-encoder — Pyro Tutorials 1.7.0 documentation

I found the generation network seems to only take z as input instead of both z and x:

# the output y is generated from the distribution pθ(y|x, z)
loc = self.generation_net(zs)

Does this mean we are having p(y|z) instead of p(y|x, z) for the decoding? If so, what could be the difference if we feed both x and z to the decoder?


Hi @enhuiz, good observation - in the example z depends on x, so y depends indirectly on x through z, but you’re right that the code as written samples from p(z | x) * p(y | z).

You’re welcome to play around with the example notebook and change the network architecture, and submit a pull request if you find that it produces better results!