ctheodoris/Geneformer · Genecorpus-95M availability

18 days ago

Hi,

I am wondering is the training data for Geneformer-95M posted online yet?

And I have another question about the tokenizer of Geneformer-95M. I noticed that the size of the token dictionary is ~20,000 for 95M but is ~25,000 for 30M. Is there a particular reason why the length of the new token dictionary was reduced?

Thanks!

ctheodoris

Owner 18 days ago

Thank you for your question. This will be added to a dataset repository analogous to the Geneformer-30M repository and linked here when available. The dictionary size differs due to the genes included. While the 30M dictionary included protein coding as well as miRNA and lincRNA genes, the 95M dictionary contains only protein coding genes. This change was made due to the fact that many scRNAseq methods do not quantify miRNA/lincRNA gene expression so the inconsistency may lead to confusion by the model on whether those genes were deleted in certain samples or simply missing due to the specific method used in that sample.

ctheodoris changed discussion status to closed 18 days ago