Ideas for subsampling features?

austinv11 · April 24, 2023, 2:48pm

Hello, I have a rather large dataset with some sparsity.

I know that I can subsample the rows of the data. But is there a suggested approach for subsampling features? I don’t want to waste GPU resources on features that are not present in my random mini batch of rows.

Thanks!

martinjankowiak · April 24, 2023, 3:52pm

you would need to provide more details. generally speaking, subsampling features will lead to bias, which is not the case for data subsampling (when suitable conditional independence structure is present)

austinv11 · April 24, 2023, 4:30pm

So in my case I am trying to model single-cell data. For each row (cell), there is high sparsity in its features (genes) either due to random missed observations or due to a cell simply not expressing a gene.
The upshot is that there are many 0s and it is unclear how informative these 0s are.
In my model, the gene expression is given as input, but when they are 0 I would expect that these features would be unimportant in my downstream model process – so rather than loading a large model onto the GPU I would prefer to try to only with features that matter.

The tricky thing is that the features that are 0 for all data rows can vary with each mini batch depending on how it samples the data.

Does that help provide clarity?

martinjankowiak · April 25, 2023, 10:50am

i don’t think there’s much you can do without introducing bias one way or another.

why can’t you do what is, afaik, standard in the field, which is to remove dimensions that are less variable using some heuristic, e.g. scanpy.pp.highly_variable_genes — Scanpy 1.9.3 documentation