Recreating embeddings from cellxgene
Hi!
I am currently playing around with the geneformer 95m that was used to create the embeddings for the cells in cellxgene, as a sanity check I wanted to see if I could recreate the same embeddings, this has not been too successful.
The process has been to use the downloaded dataset from the cellxgene API with geneformer embeddings, then tokenizing the raw data and afterwards creating embeddings using the extract embeddings script that is located in "examples". Is that the wrong approach? And if so how should I do it?
this is an warning that I get when I load the model into the extract embeddings script, (I am using modeltype Cellclassifier):
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /Geneformer/fine_tuned_models/gf-12L-95M-i4096_MTLCellClassifier_CELLxGENE_240522 and are newly initialized:
['classifier.bias', 'classifier.weight']
Could this be the reason? Or should I have loaded it in differently? Just need to know how to replicate your embeddings for any given dataset!
The default dictionary was also changed to the 95m*
Update: I have created a single cell dataset that I have tokenized, I did this with a dataset downloaded from the cellxgene and a dataset that was downloaded through the api. Their tokenizations were identitical so my theory is that the issue lies in the emb extractor. It might be that I am using wrong parameters when creating the embedding.
Thank you for your question. You can load the MTLCellClassifier CELLxGENE model as Pretrained and extract embeddings from the last layer to replicate the CELLxGENE hosted ones. If you load it as a CellClassifier, it adds a randomly initialized classification head that may cause random embeddings if you extract from that layer.
Thank you! And the rest of settings? similar to the one from the script? Another question that the embeddings should occur in the same order right? I tried the pretraining setting aswell and they differed still!
Please share what settings you are using so we can help check if there are any others that differ from what we would expect.