Inference using One Hot Categorical Distribution

Currently I’m working on a DMM where the observed are relaxed one hot categorical and latent variable is one hot categorical (similar to the dmm example from the tutorials). I have being trying to use infer_discrete for inference of the hidden states but I get a KeyError with whatever first_available dimension I use to enumerate. This got me thinking that strictly I should be using .to_event(1) for all the one hot categorical sampling sites since the last dimensions of any tensor produced has the dependency of the fact that there is only one 1 in the last dimension? Would this be correct please? It does however lead to a “enumeration over a Cartesian product is not implemented” error. I can post the code up if that’s helpful? Thanks.

I’m guessing with the error about the Cartesian product I have in fact converted what should be the (only) batch dimension into an event dimension.

In actual fact from the error output it looks like when running infer_discrete the backward run is actually using the enumerated dimensions (since the dimension grows as expected when using a different choice of first_available dim) generated using the forward run which is causing a shape error?

missing context but you likely do not want to declare .to_event(1) on your one hot latents because pyro “already knows” that the rightmost dimension is dependent for such latents. to_event is primarily for taking something like a batch of scalar normal distributions and declaring them as a multivariate gaussian distribution with a diagonal covariance (for example)

Thanks :+1: I initially thought that was probably the case with the one hot categorical distribution, but was struggling to find the source of the problems I was having with using infer_discrete.

We regard to the same model, if we have a supervised DMM i.e. where we know the values of the latent variables as well as the observed variables, can I train the model as one labelling the latent variables as observed (in addition to the observed variables). Trying this I get a bad_sites error due to variables in the model not being in the guide. I would have assumed that simply marking the sample with “obs= some_labels” would be enough to generate a “is_observed” message? Or is there some more sophisticated reasoning going on that negates this when another (observed) variable (x in this case) is dependent on an observed value? My thinking was that all the variables values in the trace would just be replaced with their observed values and then the distributional parameters (trans and emitter networks) would be trained to maximise the probabilities of the observed values. The I can freeze the trans and emitter networks, train the guide and then perform inference with the trained model. Since in doing this I am using all the observed values to get a MLE of the parameters then training the approximate posterior using different networks under this model and then performing inference.

Or would a better approach be to use an obs_mask which allows observations for the training phase and then masks any observations when the model is used for inferring the latent variables z?

I have tried both approaches, run into the obs errors with the former approach and run into the above infer_discrete errors with the second approach. Hence it hasn’t be possible to test the working approaches against each other. Just wondering where is best to invest my time? (Currently working on the first approach.)

@Charlie.m can you please provide some code sketches? more often than not, purely verbal descriptions lack the precision required for giving helpful pointers

Thanks. So in pseudo-code to avoid writing hundreds of lines:

I’m making a DMM (similar to the tutorial) that learns the parameters from supervised data of the true hidden states (and observed states) and then during inference will infer the hidden states from the observed states once the model has been trained. So would the best approach be:

class DMM:

    #some networks inheriting from nn.Module
    self.trans = some_network
    self.emitter = some_other_network
    self.combiner = network_that_combines_previous_hidden_state_and_all_observations

    def model(data_args):

        z_prev = some_trainable_parameter
        z_out = []
        if data_args_hidden_labels is not None:
            with pyro.plate("z_minibatch", len(mini_batch)) as batch:
                for t in pyro.markov(the_range_of_t):
                    z_probs = self.trans(z_prev)

                    z = sample("z_%d" % t,dist.OneHotCategorical(z_probs),obs=data_args_hidden_labels[batch, t - 1, :],)
                    z_prev = z
                    z_out.append(z)
                    emission_probs_t = self.emitter(z)
                    x = pyro.sample("obs_x_%d" % t,
                                    dist.RelaxedOneHotCategorical(torch.tensor(2.2), emission_probs_t),
                                    obs=data_args_observed[batch, t - 1, :],
                                    )
            return z_out
        else:
            with pyro.plate("z_minibatch", len(mini_batch)) as batch:
                for t in pyro.markov(the_range_of_t):
                    z_probs = self.trans(z_prev)

                    z = sample("z_%d" % t, dist.OneHotCategorical(z_probs),infer={"enumerate": "parallel"}, )
                    emission_probs_t = self.emitter(z)
                    x = pyro.sample("obs_x_%d" % t,
                                    dist.RelaxedOneHotCategorical(torch.tensor(2.2), emission_probs_t),
                                    obs=mini_batch[batch, t - 1, :],
                                    )

                    z_prev = z
                    z_out.append(z)
            return z_out

    def pass_guide(data_args):
        pass

    @config_enumerate
    def inference_guide(data_args):
        z_out = []
        with pyro.plate("z_minibatch", len(mini_batch)) as batch:
            for t in pyro.markov(the_range_of_t):

                z_loc = self.combiner(z_prev, rnn_output(data_args_observed_data)) 

                z = pyro.sample("z_%d" % t, dist.OneHotCategorical(z_loc),infer={"enumerate": "parallel"},)
                z_out.append(z)

                z_prev = z
        return z_out

    #SVI train the model with Trace_ELBO using pass guide

    #freeze all parameters except the combiner network from the guide and the initial state parameters from the guide

    #SVI train the posterior (guide) enumerating the sites in the guide

    #run inference using the now trained model and infer_discrete

Details omitted for brevity. Since the first batch of training (of the model) I’m maximising the likelihood of seeing the observed outcome by adjusting the parameters of the model then under the second training (of the guide) I’m deriving the closest posterior under the restrictions of the guide to the model. Then simply running inference using the best guess of the model. I guess this makes the guide superfluous in this use case? But could it then be used to further train with more unlabelled training examples (semi-supervised)?

For some reason as well, I keep getting KeyErrors with dim_to_symbol with this model. I’ll investigate this as it occurs in all models (supervised/unsupervised) that I’ve been running of this type. The model does seem to be handling the enumeration correctly when running an enumerated trace, and it is clearly taking up the enumeration instruction since the trace quotes an enumeration dimension for the relevant nodes.

On debugging it looks like an unenumerated tensor that should be enumerated has ended up with an extra dimension.