Hi everyone!

I am dealing with a very specific biochemical dataset, where I have a set of molecules that are constructed out of building blocks, sort of like legos.

A molecule will be made out of three building blocks (bb1, bb2, bb3), and each building block can be one of hundreds~thousands.

Therefore the actual space of possible molecules is quite large.

The dataset I have is a set of pairs [[(bb1, bb2, bb3), activity], …] and I want to fit a Poisson model like:

```
molecule_activity ~ Poisson(lambda_mno)
lambda_mno ~ sum_{m in [i, j, l]}[(onehot bb_m).alpha_m] + \sum_{m,n in [i, j, k]}[(onehot bbm, bbn).alpha_mn] + alpha_{mno} (onehothot bbm, bbn, bbo)
alpha_* ~ Gamma() # enforcing sparsity with a high prior on 0
```

The idea here is to figure out which subset of building blocks is responsible for the `molecule_activity`

by looking at the posterior values of `alpha_m`

, `alpha_mn`

, and `alpha_mno`

.

I have a small scale POC which works very well, but not sure what would the best way to intelligently deal with the sparsity here - for the full scale model, the `1-hot_mn`

vectors would be up to several hundred-thousand dimensional, and the `1-hot_mno`

vectors would be million-dimensional. The dataset itself has several 10s millions of lines.

The sparsity structure (for a given line in the dataset, only three possible 2-interactions [mn, mo, and on] and only one possible 3-interaction term [mno] are non-zero) doesn’t seem amenable to solutions I’ve seen in the litterature:

- The Kernel Interaction Trick needs strong hierarchy to work (where an interaction term (m,n) will be non-zero if one of the mono-terms m or n is non-zero) - but the data doesn’t respect that. It is quite possible that an (m, n) molecule will have strong activity but that its individual components will not.
- Hierachical modeling: I’m thinking we could represent the data as a three-layer tree with a high branching factor, but I think that this will only output conditionals, but the data doesn’t necessarily have a dependence structure that fits this (i.e., at the third step, a building block
`o`

could have high activity, regardless of what happens at the previous steps)

Any guidance in how best to model this combinatorial dataset would be much appreciated!