Gene length per cell in in silico perturb

#514
by Tak0329 - opened

Hello. I have one question.
Why is it that the number of genes in each cell sometimes differs when public scRNA-seq data is analyzed in silico perterb after tokenize?
For example, when I analyze 1000 cells with 4096 model, some cells in the beginning have 4096 genes (length of 4096), but the number of genes gradually becomes shorter. Is this a quality issue with the scRNA-seq I used? Also, should cells with such short lengths not be used for analysis?

Thanks for your time.

Thanks for your question! In scRNAseq data, each cell has a different number of genes detected, so it is normal for the number of genes to vary by cell. In the in silico perturbation, we order the cells from largest to smallest so that any memory constraints are encountered earlier rather than later. The order does not impact the results since in silico perturbation is inference-only and does not involve training the model. In terms of general scRNAseq quality control, this should be done by standard methods prior to analysis with Geneformer as needed for the particular dataset.

ctheodoris changed discussion status to closed

Sign up or log in to comment