ctheodoris/Geneformer · fixed perturbed genes

9 days ago

Hi!

I noticed that in the in silico perturbation output files, each cell appears to have a fixed number of perturbed genes (e.g., 120), regardless of its input length or total number of expressed genes.

Given that the 95M model should allow up to 4096 genes per cell, this suggests that only the top-ranked ~120 genes are being perturbed in each cell. This could artificially constrain the gene coverage and affect metrics like n_detections.

Could you confirm whether this is expected behavior, and if so, at which step the restriction is applied?

Best regards

ZYSK-huggingface

9 days ago

To be specific, I checked several of the intermediate pickle files directly and counted the number of perturbed genes under both the start and goal states. It's always 120 for the 95M model and 60 for the 30M model. I’m sure that my code and parameter settings are correct.

ctheodoris

Owner 9 days ago

Thanks for your question. If you are running it with "all" for genes_to_perturb, it should not be restricting to only the top ranked 120 genes. The output files are generated with regular saves of the data dependent on your selected batch size relevant to the memory allowance, so do not necessarily represent 1 cell per file.

ctheodoris changed discussion status to closed 9 days ago

ZYSK-huggingface

9 days ago

Moreover, I didn’t find any filtering logic in the code that would limit which genes are perturbed per cell, so I find this result puzzling. As far as I understand, each intermediate pickle file corresponds to a single cell, and all genes expressed in that cell should have been perturbed and recorded individually.

ZYSK-huggingface

9 days ago

Thanks for your question. If you are running it with "all" for genes_to_perturb, it should not be restricting to only the top ranked 120 genes. The output files are generated with regular saves of the data dependent on your selected batch size relevant to the memory allowance, so do not necessarily represent 1 cell per file.

Thanks for the clarification.

I double-checked my setup: I set genes_to_perturb="all" and did not apply any filtering. I perturbed 10,000 cells in the start state. However, I noticed that the number of intermediate pickle files is also close to 10,000, which led me to initially think that each file corresponds to one cell.

ZYSK-huggingface

9 days ago

Thanks for your question. If you are running it with "all" for genes_to_perturb, it should not be restricting to only the top ranked 120 genes. The output files are generated with regular saves of the data dependent on your selected batch size relevant to the memory allowance, so do not necessarily represent 1 cell per file.

For example, I’m running the perturbations on different GPUs with assigned cell index ranges. For example, GPU 0 was assigned to perturb 1,250 cells. However, it generated 2,696 intermediate pickle files, and each file contains perturbation results for exactly 120 genes. That means, on average, each cell had only around 300 genes perturbed, which doesn't align with the actual number of genes expressed per cell in my dataset. I'm trying to understand where this discrepancy might be coming from.

YorkZhang

9 days ago

Hi, the pickle file for each cell is indicated by the cell index in between "cell_embs_" and "batch". Each cell could have multiple pickle files (batch-1, batch0, batch1). The genes in all pickle files combined will be the genes perturbed for that cell. Are you only looking at one of the batch files for each cell (I'm guessing batch-1)?

ZYSK-huggingface

9 days ago

Hi, the pickle file for each cell is indicated by the cell index in between "cell_embs_" and "batch". Each cell could have multiple pickle files (batch-1, batch0, batch1). The genes in all pickle files combined will be the genes perturbed for that cell. Are you only looking at one of the batch files for each cell (I'm guessing batch-1)?

The reason I initially dug into this is because I noticed a mismatch between the n_detections from the insilico stats results and the actual number of cells expressing each gene in my input (based on token presence). This discrepancy doesn’t occur with the 30M model — there, the n_detections aligns well with the actual token-level expression.

So to investigate further, I randomly checked the pickle files and now understand that each cell can have multiple batches. However, the total number of batch files still seems insufficient, which leads me to suspect that some cells may not have been perturbed at all — again, something I didn’t observe in the 30M perturbation runs.

ctheodoris

Owner 9 days ago

Thank you for following up.

One further suggestion is testing perturbing 1 single cell and then checking that the outputs match for N_Detections and the genes in the intermediate pickle files. Then perturb 2 single cells and check the same. If there is an issue with either of those please update this discussion and also send the code you are using and the outputs you are noting to be problematic.

Another aspect to confirm is that the correct dictionary is used for tokenizing the data as well as each subsequent step (including the fine tuning, extracting embeddings, in silico perturbation, stats module, etc). Since the default dictionaries should be aligned with the new model, the only place this could go awry is if you are using data that you previously tokenized for the 30m model with that dictionary now with the new model, though I assume this is not the case.

YorkZhang

9 days ago

To answer the second part of your question. You can count how many cells are actually perturbed by the number of unique cell index in the pickle files. Here's a one-line bash code to do it.

ls dict_cell_embs | sed -E 's/._([0-9]+)batch./\1/' | sort -u | wc -l

ZYSK-huggingface

9 days ago

Thank you for following up.

One further suggestion is testing perturbing 1 single cell and then checking that the outputs match for N_Detections and the genes in the intermediate pickle files. Then perturb 2 single cells and check the same. If there is an issue with either of those please update this discussion and also send the code you are using and the outputs you are noting to be problematic.

Another aspect to confirm is that the correct dictionary is used for tokenizing the data as well as each subsequent step (including the fine tuning, extracting embeddings, in silico perturbation, stats module, etc). Since the default dictionaries should be aligned with the new model, the only place this could go awry is if you are using data that you previously tokenized for the 30m model with that dictionary now with the new model, though I assume this is not the case.

Thanks for your suggestion, I will try one by one cell. I have ensured that my dicts、inputs and all other steps are correct.

ZYSK-huggingface

9 days ago

To answer the second part of your question. You can count how many cells are actually perturbed by the number of unique cell index in the pickle files. Here's a one-line bash code to do it.

ls dict_cell_embs | sed -E 's/._([0-9]+)batch./\1/' | sort -u | wc -l

Hi，based on the file names, it seems that each of my cells corresponds to only one batch file — I don't see cases where the same cell index appears with multiple batch numbers. So it looks like most of my cells have just one corresponding pickle file rather than being split across multiple batches.

In total, I’m perturbing 1250 cells, but the highest cell index in the pickle files is 1347, and the total number of pickle files is 2696 — which exactly matches 1348 cells each having two pickle files: one with batch-1 and one with batchXX. So it looks like the perturbation was applied to 1348 cells in total, and for each cell, only a single batch of ~120 genes was perturbed.

ZYSK-huggingface

9 days ago

To answer the second part of your question. You can count how many cells are actually perturbed by the number of unique cell index in the pickle files. Here's a one-line bash code to do it.

ls dict_cell_embs | sed -E 's/._([0-9]+)batch./\1/' | sort -u | wc -l

Alternatively, if that number before batch doesn’t represent the cell index, and in fact all genes in each of the 1250 cells were perturbed, then 2696 total pickle files would definitely not be enough to hold all the results — especially considering each file only includes about 120 perturbations.