Text generation Markov Model

maamli · August 23, 2022, 9:34pm

Hi everyone,
I am trying to implement a basic Markov model for predicting next character (english letter) generation, given previous character (pairwise model).
The format of the data is an array of ordinal values (0-25) for the given character and have to probabilistic way to generate the next.
P(c2|c1) where c1 is the given character and c2 is the following character
Although I understand this can done by counting and frequency generation, buy I am learning pyro so trying to understand how to setup this model so I can build upon it to create complex models.
Question:

Is it necessary for me to convert the data to a tensor of shape (26,26) with counts in there to get things setup or can the model be designed to learn with one row at a time.
Assuming the count matrix is setup, does the below code make sense:

num_characters = 26
def model(counts):
    next_ch_probs = pyro.sample('next_ch_probs', dist.Dirichlet(torch.ones(num_characters,num_characters)/num_characters))
    pyro.sample('counts', dist.Multinomial(26*26, next_ch_probs), obs=counts)

If you have a better way of framing the problem or an example to share, please do. I am trying to learn.

fritzo · August 25, 2022, 7:17pm

Hmm I think you’ll want to fit a plate full of multinomials

num_characters = 26
def model(counts):
    next_ch_probs = pyro.sample(
        'next_ch_probs',
        dist.Dirichlet(torch.ones(num_characters, num_characters) / num_characters),
    )
    with pyro.plate("characters", num_characters):
        pyro.sample(
            'counts',
            dist.Multinomial(probs=next_ch_probs, validate_args=False),
            obs=counts,
        )

where the validate_args=False works around Multinomial's lack of support for heterogeneous counts.