MCMC for biology

Diesdas · November 14, 2023, 10:14am

Dear all,

I am very much a beginner in probabilistic programming. My problem comes from genetics and I hope it might have a nice solution in pyro, as it can not, to my knowledge, be solved analytically (for larger data).

A formulation:
I have a sequence of n genetic sites, each site having the value one. This sequence now undergoes multiple changes. At every iteration, each sequence site can change by adding or subtracting a 1 or leaving the site unchanged. However, changes are not independent between sites, but correlate with changes at neighboring sites (which I model with a Markov Chain). Now, given an input sequence and an output sequence, I would like to know, how many iterations have happened.

Here is my attempt:

import torch as tc
import warnings; warnings.simplefilter("ignore", FutureWarning)
import pyro
import pyro.distributions as dist
from pyro.infer.mcmc import MCMC, NUTS

pyro.set_rng_seed(1)

I first build a function from which I can sample a markov chain. The output is the change of my sequence in each iteration:

First defining the transition probabilities

transition_prob = tc.tensor([[0.7,0.3,0.0],[0.025,0.95,0.025],[0.0,0.3,0.7]])

Defining the MC and sampling an example of length 20:

def mc(transition_prob, length):
    category = tc.tensor(1)  # start with one category
    final_markov_chain = []
    for t in range(length+1):
        if t > 0:
            category = pyro.sample("category_{}".format(t), dist.Categorical(transition_prob[category]))
            final_markov_chain.append(category)
            
    return ((tc.stack(final_markov_chain))-1).float()

mc(transition_prob, 20)
Out:
tensor([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
             0.,  0.,  0.,  0., -1., -1.])

So the last two sites would be changed with this mc output and at this iteration. Now i model a system that receives an input and conducts these changes ‘max_iter’ times. max_iter is sampled from a Poisson distribution

def model_system(inp):
    poisson_prior = tc.tensor(4.0) #pyro.sample("poisson_prior", dist.HalfNormal(6.0,3.0))
    
    max_iter = pyro.sample("poisson", dist.Poisson(poisson_prior))
    
    iteration = 0
    output_vector = inp.clone()
    while iteration <= max_iter:
        iteration +=1
        change = mc(transition_prob, length = inp.shape[0])
        
       #as an additional tweak, sites that have the value zero are stuck at zero
        output_vector = tc.where(output_vector == 0, tc.tensor(0.0), output_vector + change)
        
    return output_vector

model_system(tc.ones(100))

Out:
tensor([1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 3.,
        3., 2., 2., 1., 2., 2., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2., 2., 2., 2., 2.,
        1., 0., 1., 2., 2., 2., 2., 2., 1., 2., 0., 0., 1., 1., 1., 1., 1., 1.,
        1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 2., 1.,
        1., 0., 1., 1., 1., 1., 1., 1., 2., 1.])

This returns nicely what I expect as output.

My problem is the following:
Given The output vector, input vector and transition probabilities, how can I find the parameter max_iter (i.e. how often did the input sequence undergo change)? How would I encode the observed output vector in pyro, since the output of model_system is not sampled directly from a distribution?

Thanks a lot!