About the optimized result

zephyrus · April 26, 2023, 12:35pm

Hello, I’m building an HMM topic model. I have implemented my model via Pyro. Before applying my model to the real dataset, I did a synthetic experiment, in which I generated a batch of synthetic data by using known parameters. The model is then trained using this data to evaluate the recovery of the parameters. The model runs very smoothly in terms of the Loss curve:

Then I compared the reproduction of some of the parameters and found that some of them were almost perfectly reproduced, but others were completely different from the real values:

These parameters are the distribution of topics (here I call it motivation), and the number of topics I specify manually is 10. I used the guide AutoDelta and I don’t know if this result is enough to prove that my program is correct and the model is valid.

Regarding those results with large differences from the real values, I currently think of the following possible reasons:

the Batch_size of the synthetic data is too small, the current Batch_size is 2500;
the artificially specified parameters are unreasonable, one of the hidden variables in my model is a shape of (Topic_num, Hidden_state_num, X) continuous random variable, the specific why of each sample’s topic assignment will directly depend on this variable, but some hidden state corresponding to the value under the dimension may be too large, resulting in the generated data does not cover all the Topics;
AutoDelta may not be enough to train this complex model. I tried AutoNormal, but encountered errors with Simplex() constraints.
Learning rate is unreasonable, ClippedAdam with init_lr=1e-2 and gamma=0.1 is applied to my model. The total SVI steps is set to 1200.

I hope I can get your advice, thank you very much.