Hi all,

we are using a Multimodal Variational Autoencoder inspired by this code and paper to predict conversational engagement (continuous labels between 0 and 1) from facial features and context data eg. binary gender label, numerical personality ratings across five dimensions.

The paper above models the following graph with a Multimodal Variational Autoencoder:

An encoder maps from outcome x to the latent space z, a decoder maps back to outcome. The same happens for facial expressions and emotion ratings, so we can think of it as training three VAEs with a joint latent space. We would like to encode also other causes for emotion ie. instead of modelling continuous outcomes x with a Normal distribution, we model a binary variable x such as gender (male/female) using a Bernoulli distribution. We model the features capturing facial expressions with a Normal distribution, instead of using raw images with the assumption of a Bernoulli distribution as done in the paper.

**The Problem:** When training the Multimodal VAE, the loss explodes after 6-7 iterations and the weights of the encoder/decoder which map between the binary outcome variable and the latent space z become NaN. This happens even though we clamp the â€śscaleâ€ť variable everywhere to prevent them from exploding in the decoder which applies torch.exp() eg.

scale = torch.exp(self.scale_layer(hidden)).clamp(min=1.0e-5)

Q1 Intuitively is there something wrong in the modelling assumptions eg. the Bernoulli assumption for binary outcomes and Normal assumption for visual features like action units?

Q1b How would we model causes that are categorical with more than two possible values?

Q3 Intuitively, what would be sensible priors to set for the role modality and the facial features?

Any help would be greatly appreciated!