Cell SubTypes 47/ 53 classes : How?
How did you define cell subtypes 47/53 from Cell Subtype column 'obs cl295v11SubFull' of GSE178341
dataset I can see there are 87 unique cell subtypes. Could you please clarify, how did you choose 47/53 subtypes here ?
Thank you for your question. Some of the cell types were consolidated to match other data sources we are using for ongoing work. For example, all B cell subtypes were consolidated into B cells.
This is not not clear in the paper
Could you please explain it here ?
Even though consolidating all B subtypes into B cells, I am still left with 84 subtypes,
I can see only 3 b types cB1, cB2 and cB3. from column 'cl295v11SubShort'
anndata.obs.cl295v11SubShort.unique()
{'cTNI15', 'cM05', 'cTNI22', 'cB2', 'cE05', 'cTNI08', 'cTNI19', 'cTNI11', 'cS11', 'cE02', 'cS05', 'cP3', 'cS04', 'cTNI06', 'cM10', 'cTNI12', 'cP1', 'cTNI10', 'cTNI05', 'cM06', 'cS07', 'cE09', 'cE04', 'cE06', 'cTNI24', 'cTNI14', 'cS26', 'cTNI26', 'cTNI02', 'cS23', 'cS15', 'cTNI18', 'cS17', 'cS08', 'cB3', 'cS14', 'cM09', 'cE07', 'cS30', 'cTNI20', 'cM04', 'cM03', 'cS19', 'cM02', 'cS31', 'cS28', 'cS27', 'cS03', 'cTNI17', 'cE01', 'cTNI13', 'cM01', 'cS20', 'cP2', 'cS29', 'cS13', 'cE11', 'cTNI16', 'cS09', 'cS25', 'cS10', 'cS02', 'cTNI25', 'cM08', 'cTNI04', 'cM07', 'cS16', 'cS21', 'cE08', 'cMA01', 'cTNI01', 'cE10', 'cTNI07', 'cTNI09', 'cS01', 'cS06', 'cE03', 'cTNI23', 'cS18', 'cS22', 'cS12', 'cB1', 'cS33', 'cS32', 'cTNI21', 'cS24', 'cTNI03'}
This is one example where cell types were consolidated. Other cell types were also consolidated. This discussion page is for discussions related to the code on this repository. If you have further questions regarding the biological analyses in the paper, please email. Thank you!
Can you please clarify this. I am still waiting for your reply on this
How did you come up with 47/53 classes
Because in the original dataset I can see all tumour patients have unique 87 cell sub types as mentioned above and below
cell_subtypes from public dataset
['cE01 (Stem/TA-like)',
'cE03 (Stem/TA-like prolif)',
'cTNI05 (CD4+ IL17+)',
'cTNI04 (CD4+ IL7R+CCL5+)',
'cE05 (Enterocyte 2)',
'cS17 (Pericyte)',
'cM02 (Macrophage-like)',
'cE04 (Enterocyte 1)',
'cE02 (Stem/TA-like/Immature Goblet)',
'cTNI09 (CD4+ Treg prolif)',
'cE08 (Goblet)',
'cM09 (mregDC)',
'cTNI22 (cTNI22)',
'cS13 (Endo venous-like)',
'cM01 (Monocyte)',
'cS27 (CXCL14+ CAF)',
'cTNI11 (CD8+GZMK+)',
'cE09 (Best4)',
'cTNI08 (CD4+ Treg)',
'cM04 (DC2)',
'cTNI24 (NK GZMK+)',
'cB1 (B IGD+IgM+)',
'cMA01 (Mast)',
'cP2 (Plasma IgG)',
'cTNI10 (CD8+ IL7R+)',
'cB2 (B GC-like)',
'cTNI20 (PLZF+ T)',
'cE11 (Enteroendocrine)',
'cS08 (Endo arterial-like)',
'cTNI02 (CD4+ IL7R+SELL+)',
'cTNI13 (CD8+ T IL17+)',
'cM05 (DC2 C1Q+)',
'cB3 (B CD40+ GC-like)',
'cS29 (MMP3+ CAF)',
'cS02 (Endo capillary)',
'cS12 (Endo)',
'cP1 (Plasma IgA)',
'cS15 (Pericyte)',
'cTNI25 (NK XCL1+)',
'cTNI18 (gd-like T PDCD1+)',
'cE06 (Immature Goblet)',
'cTNI17 (gd-like T)',
'cTNI16 (CD8+ CXCL13+ prolif)',
'cS18 (Pericyte)',
'cS04 (Endo)',
'cE10 (Tuft)',
'cTNI23 (NK CD16A+)',
'cM10 (Granulocyte)',
'cM06 (DC IL22RA2)',
'cS11 (Endo proif)',
'cS09 (Endo)',
'cTNI03 (CD4+ IL7R+HSP+)',
'cS19 (Pericyte)',
'cTNI06 (CD4+ TFH)',
'cTNI01 (CD4+ IL7R+)',
'cM03 (DC1)',
'cTNI07 (CD4+ CXCL13+)',
'cS25 (Fibro CCL8+)',
'cS28 (GREM1+ CAF)',
'cE07 (Goblet/Enterocyte)',
'cP3 (Plasma IgG prolif)',
'cS10 (Endo tip cells)',
'cTNI26 (ILC3)',
'cTNI14 (CD8+ CXCL13+)',
'cS32 (Smooth Muscle)',
'cTNI21 (PLZF+ T prolif)',
'cM07 (pDC)',
'cTNI12 (CD8+ IL7R+)',
'cS24 (Fibro BMP-producing)',
'cTNI15 (CD8+ CXCL13+ HSP+)',
'cS01 (Endo arterial)',
'cS30 (CAF CCL8 Fibro-like)',
'cM08 (AS-DC)',
'cS20 (Pericyte prolif)',
'cS14 (Endo)',
'cTNI19 (gd-like T prolif)',
'cS06 (Endo lymphatic)',
'cS16 (Pericyte)',
'cS26 (Myofibro)',
'cS31 (CAF stem niche Fibro-like)',
'cS22 (Fibro stem cell niche)',
'cS33 (Schwann)',
'cS05 (Endo venous)',
'cS03 (Endo capillary)',
'cS23 (Fibro BMP-producing)',
'cS21 (Fibro stem cell niche)',
'cS07 (Endo capillary-like)']
You clarified in one correspondance that you treated all Epithelial cells (=11 sub types) as malignant. How did you sub group others (46. This is really confusing). Even if you choose these epithelial cells as one class , the sum should be 46 + 11 not 53.
Or Did you group Epi cells as follows
Epi_sub_group 1: cE01 (Stem/TA-like)', 'cE03 (Stem/TA-like prolif)', 'cE02 (Stem/TA-like/Immature Goblet)'
Epi_sub_group 2: 'cE02 (Stem/TA-like/Immature Goblet)', 'cE08 (Goblet)', 'cE07 (Goblet/Enterocyte)', 'cE06 (Immature Goblet)',
Epi_sub_group 3: 'cE05 (Enterocyte 2)',
Epi_sub_group 4: 'cE04 (Enterocyte 1)',
Epi_sub_group 5: 'cE09 (Best4)',
Epi_sub_group 6 'cE11 (Enteroendocrine)',
Epi_sub_group 7: 'cE10 (Tuft)',
Can you please clarify this
@shakeel604
I also have questions about the category division of subtype(47/53). Have you solved this problem and can you share ideas on the division?
Thanks for your question. This discussion page is for discussions related to the code on this repository. If you have further questions regarding the biological analyses in the paper, please email. Thank you!
@ctheodoris
I have a very confusing question. I conducted a perturbation experiment on T cells. I randomly selected 1000 cells at the starting state MMRp for perturbation experiments. However, the calculated cosine shift was very different from the supplement data provided in the article. For example, the maximum consine shift of genes with statistically significant shift was only 0.021 (while the maximum data provided in the article was 0.1291). Is it meaningful to screen genes with such a small shift?My code is shown below:
from geneformer import InSilicoPerturber, EmbExtractor, InSilicoPerturberStats
Define paths
model_directory = "./Geneformer/examples/model_path/colon/GeneformerMultiTask"
input_data_file = "./Geneformer/examples/data/colon/T_CD8_MMRd_MMRp.dataset"
output_directory = "./Geneformer/examples/delete_result/colon"
output_prefix = "mtl_perturbationTcells"
Define parameters
perturb_type = "overexpress" # or "overexpress"
Define cell states to model
cell_states_to_model = {
"state_key": "MMRStatus",
"start_state": "MMRp",
"goal_state": "MMRd"}
Define filter data
filter_data_dict = {
"MMRStatus": ["MMRp","MMRd"]
}
Initialize EmbExtractor
embex = EmbExtractor(
model_type="CellClassifier",
num_classes=2,
filter_data=filter_data_dict,
max_ncells=22000, # Number of cells to extract embeddings for
emb_layer=0, # Use the second to last layer
emb_mode = "cls",
summary_stat="exact_mean",
forward_batch_size=8, # Adjust based on available GPU memory
nproc=4
)
Extract state embeddings
state_embs_dict = embex.get_state_embs(
cell_states_to_model,
model_directory=model_directory,
input_data_file=input_data_file,
output_directory=output_directory,
output_prefix=output_prefix
)
input_data_file = "./Geneformer/examples/data/colon/T_CD8_MMRd_MMRp.dataset"
output_directory = "./Geneformer/examples/delete_result/colon/perturb_dataTcell"
output_prefix = "mtl_perturbation_Quantized"
Initialize InSilicoPerturber
isp = InSilicoPerturber(
perturb_type=perturb_type,
genes_to_perturb="all", # Perturb all genes
model_type="MTLCellClassifier-Quantized", # Use quantized MTL model
emb_mode="cls", # Use CLS token embedding
cell_states_to_model=cell_states_to_model,
state_embs_dict=state_embs_dict,
max_ncells=1000, # Number of cells to perturb (larger number increases power)
emb_layer=0,
forward_batch_size=8, # Adjust based on available GPU memory
nproc=16
)
Run perturbation and output intermediate files
isp.perturb_data(
model_directory=model_directory,
input_data_file=input_data_file,
output_directory=output_directory,
output_prefix=output_prefix
)
Initialize InSilicoPerturberStats
ispstats = InSilicoPerturberStats(
mode="goal_state_shift",
genes_perturbed="all",
combos=0,
anchor_gene=None,
cell_states_to_model=cell_states_to_model
)
input_data_file = "./Geneformer/examples/data/colon/T_CD8_MMRd_MMRp.dataset"
output_directory="./Geneformer/examples/delete_result/colon/perturb_dataTcell"
output_prefix="T_CD8_MMRd_MMRp"
Process stats and output final .csv
ispstats.get_stats(
input_data_file,
None,
output_directory,
output_prefix
)
Thank you for your question. When we perform the in silico perturbations for T cells we are targeting shifts towards the activated T cell state in MMRd. It is unclear whether you also subsetting to the activated T cell state. Also, if your model was fine-tuned differently, this could also impact the in silico perturbations. In terms of magnitude, we expect it to be small by absolute values since the comparison is only related to a change of a single gene within the 4096 represented in a given cell. More important than the absolute magnitude is the statistical significance of the shift due to a given gene perturbation compared to the shifts of perturbing random genes, which is also relevant to the direction of the shift being towards a given goal state.