Latents in DMM Model

Hello everyone,

I am using the DMM Model from Pyro. I am training the model with my inputs to learn the latents.
Then, I am using those latents as inputs for another model(LSTM) and training it in a supervised way.

So, to take the learned latents from the DMM Model, what is the ideal approach to follow for both training and testing data?

Is it fine if I take the z from line 332 in guide function for all samples in the last epoch?
Just want to make sure whether it is correct or not.

Thanks.

why do you want to do that? sounds a bit strange.

you can certainly take the final z_t but there’s no reason to expect it will tell you everything about the sequence (rather it’ll tell you something about the hidden state at the final time slice)

I thought those latents will represent the original inputs in a better way by learning all hidden patterns in a high-dimension, like how VAE is also implemented. If they don’t, what can represent the original inputs better?

you can certainly try but you probably want all {z_t} and not just the last one. also you need to be mindful if your sequences have different lengths. because if so padding means that the final z_t will be meaningless.

Yes, I have tried that way and I have taken all the z_t for all samples in the last epoch(as that is the most updated value) and I have taken care of different lengths. I was wondering whether it is the correct method of taking the values or is there any other way?

Apart from these z values, what other things are learned?

I was wondering whether it is the correct method of taking the values or is there any other way?

i don’t think there’s a well defined notion of “correct” here. it depends on your downstream application. one thing to keep in mind is that these z’s are stochastic. depending on what you’re doing downstream you may instead want to do something else. e.g. sample many, many z’s and then find the mean value of z_t at each time slice.

Apart from these z values, what other things are learned?

i’m afraid this question is too vague to answer

Yeah, good point thanks - I will sample many.

Apart from these z values, what other things are learned?

Sorry for not being precise. We have parameters for the model and guide that are learned during the training process, right? Theta refers to parameters of the model and Phi refers to parameters of the guide. z_loc and z_scale coming from the Combiner are theta values or anything else?

Hi,

When I tried to take many samples and average them using the below code in the model and the guide, the runtime shot up too much. It was taking around 30 minutes per epoch. I am running 60 epochs only, though in the example they ran it for 5000 epochs. Before this implementation, it was taking around 2 to 3 minutes per epoch.

                num_samples_total = 50
                if len(self.iafs) > 0:
                    # in output of normalizing flow, all dimensions are correlated (event shape is not empty)
                    for i in range(1, num_samples_total+1):
                        z_t = pyro.sample("z_%d_%d" %(t, i),
                                          z_dist.mask(mini_batch_mask[:, t - 1]))
                        z_t_total += z_t
                    z_t = z_t_total/num_samples_total

                else:
                    # when no normalizing flow used, ".to_event(1)" indicates latent dimensions are independent
                    for i in range(1, num_samples_total+1):
                        z_t = pyro.sample("z_%d_%d" %(t, i),
                                          z_dist.mask(mini_batch_mask[:, t - 1:t])
                                          .to_event(1))
                        z_t_total += z_t
                    z_t = z_t_total/num_samples_total

What do you think about the implementation? Can it be optimized?

Thanks

i don’t understand what you’re trying to do. what do you do with z_t? why do you need to compute it at each training step?

To get the latent and hidden patterns of my data, so that I can use them to train a neural network to do predictions instead of directly feeding my input to the neural network.

why are you training them jointly? why not learn the dmm first. and then learn the second network. this will presumably be much faster

I am not training them jointly. I am first training the DMM, storing all z values at the last epoch and using them later for the second model.

why don’t you just use a single simple (no averaging) obtained at the end of training and use that to learn your downstream model? i mentioned finding the posterior mean as a possibility, not as a requirement. why not see if your approach works at all before doing the more complicated thing?

yes, I have tried that and it is working. But sampling more is a very good idea to improve the accuracy as you suggested, so I am trying to implement that. So, just wanted to know whether the way that I am doing can be optimized to reduce the computation time?

note that the code is already fully vectorized. after training you can just run all the data through the dmm but repeat the training inputs. e.g. instead of having a mini-batch {input_1, input_2, …, input_20} have a mini-batch {input_1, input_1, …, input_1}.