Dealing with molecular combinatorial chemistry data

hpalacci · October 17, 2022, 4:55pm

Hi everyone!
I am dealing with a very specific biochemical dataset, where I have a set of molecules that are constructed out of building blocks, sort of like legos.
A molecule will be made out of three building blocks (bb1, bb2, bb3), and each building block can be one of hundreds~thousands.
Therefore the actual space of possible molecules is quite large.
The dataset I have is a set of pairs [[(bb1, bb2, bb3), activity], …] and I want to fit a Poisson model like:

molecule_activity ~ Poisson(lambda_mno)
lambda_mno ~ sum_{m in [i, j, l]}[(onehot bb_m).alpha_m] + \sum_{m,n in [i, j, k]}[(onehot bbm, bbn).alpha_mn] + alpha_{mno} (onehothot bbm, bbn, bbo)
alpha_* ~ Gamma() # enforcing sparsity with a high prior on 0

The idea here is to figure out which subset of building blocks is responsible for the molecule_activity by looking at the posterior values of alpha_m, alpha_mn, and alpha_mno.

I have a small scale POC which works very well, but not sure what would the best way to intelligently deal with the sparsity here - for the full scale model, the 1-hot_mn vectors would be up to several hundred-thousand dimensional, and the 1-hot_mno vectors would be million-dimensional. The dataset itself has several 10s millions of lines.

The sparsity structure (for a given line in the dataset, only three possible 2-interactions [mn, mo, and on] and only one possible 3-interaction term [mno] are non-zero) doesn’t seem amenable to solutions I’ve seen in the litterature:

The Kernel Interaction Trick needs strong hierarchy to work (where an interaction term (m,n) will be non-zero if one of the mono-terms m or n is non-zero) - but the data doesn’t respect that. It is quite possible that an (m, n) molecule will have strong activity but that its individual components will not.
Hierachical modeling: I’m thinking we could represent the data as a three-layer tree with a high branching factor, but I think that this will only output conditionals, but the data doesn’t necessarily have a dependence structure that fits this (i.e., at the third step, a building block o could have high activity, regardless of what happens at the previous steps)

Any guidance in how best to model this combinatorial dataset would be much appreciated!