Hi everyone,
I’m trying to build a non standard mixture model.
Simplifying slightly, my data looks like this:
obs, sample_id, replicate_id
0.1, A, 1
0.05, A, 1
0.15, A, 1
1.3, A, 2
1.2, A, 2
1.1, A, 2
15, B, 1
15.1, B, 1
15.1, B, 2
15.1, B, 2,
14.9, B, 3
14.9, B, 3
And I want my model to look like this:
assigned_cluster # depends on sample_id AND replicate_ID
locs # depends on sample_id alone
y_i ~ locs[sample_id_i, assigned_cluster_{ij}] # this isn't a proper notation
The challenge I’m dealing with is that different sample_id have a different number of maximum clusters, so fitting this in plate notation is complicated.
in Python pseudo code:
n_max_replicates = data.groupby("sample_id")["replicate_id"].max().to_dict()
for row in data:
sample_id = row["sample_id"]
replicate_id = row["replicate_id"]
cluster_probas = assignments[sample_id][replicate_id] # this is n_max_replicates dimensional
cluster_means = locs[sample_id] this is n_max_replicates dimensional
I had a few ideas on how to proceed:
- The maximum number of clusters is not that high (it’s 4). Maybe I can break my likelihood in 4 different blocks, then group observations by numbers of clusters?
- Generate
loc
as an array of dimension (n_sample_id, max_cluster_size) - and then ignore all the unused elements of loc?
Thanks for all the work on this great library!
Federico