Jan 31

How did you define cell subtypes 47/53 from Cell Subtype column 'obs cl295v11SubFull' of GSE178341
dataset I can see there are 87 unique cell subtypes. Could you please clarify, how did you choose 47/53 subtypes here ?

@ctheodoris

ctheodoris

Owner Jan 31

Thank you for your question. Some of the cell types were consolidated to match other data sources we are using for ongoing work. For example, all B cell subtypes were consolidated into B cells.

ctheodoris changed discussion status to closed Jan 31

shakeel604

Feb 3

This is not not clear in the paper
Could you please explain it here ?

@ctheodoris

shakeel604

Feb 3

•

edited Feb 3

This is not not clear in the paper
Could you please explain it here ?

Even though consolidating all B subtypes into B cells, I am still left with 84 subtypes,

I can see only 3 b types cB1, cB2 and cB3. from column 'cl295v11SubShort'

anndata.obs.cl295v11SubShort.unique()

{'cTNI15', 'cM05', 'cTNI22', 'cB2', 'cE05', 'cTNI08', 'cTNI19', 'cTNI11', 'cS11', 'cE02', 'cS05', 'cP3', 'cS04', 'cTNI06', 'cM10', 'cTNI12', 'cP1', 'cTNI10', 'cTNI05', 'cM06', 'cS07', 'cE09', 'cE04', 'cE06', 'cTNI24', 'cTNI14', 'cS26', 'cTNI26', 'cTNI02', 'cS23', 'cS15', 'cTNI18', 'cS17', 'cS08', 'cB3', 'cS14', 'cM09', 'cE07', 'cS30', 'cTNI20', 'cM04', 'cM03', 'cS19', 'cM02', 'cS31', 'cS28', 'cS27', 'cS03', 'cTNI17', 'cE01', 'cTNI13', 'cM01', 'cS20', 'cP2', 'cS29', 'cS13', 'cE11', 'cTNI16', 'cS09', 'cS25', 'cS10', 'cS02', 'cTNI25', 'cM08', 'cTNI04', 'cM07', 'cS16', 'cS21', 'cE08', 'cMA01', 'cTNI01', 'cE10', 'cTNI07', 'cTNI09', 'cS01', 'cS06', 'cE03', 'cTNI23', 'cS18', 'cS22', 'cS12', 'cB1', 'cS33', 'cS32', 'cTNI21', 'cS24', 'cTNI03'}

@ctheodoris

ctheodoris

Owner Feb 3

This is one example where cell types were consolidated. Other cell types were also consolidated. This discussion page is for discussions related to the code on this repository. If you have further questions regarding the biological analyses in the paper, please email. Thank you!

shakeel604

Feb 14

@ctheodoris

Can you please clarify this. I am still waiting for your reply on this
How did you come up with 47/53 classes
Because in the original dataset I can see all tumour patients have unique 87 cell sub types as mentioned above and below

cell_subtypes from public dataset
['cE01 (Stem/TA-like)',
'cE03 (Stem/TA-like prolif)',
'cTNI05 (CD4+ IL17+)',
'cTNI04 (CD4+ IL7R+CCL5+)',
'cE05 (Enterocyte 2)',
'cS17 (Pericyte)',
'cM02 (Macrophage-like)',
'cE04 (Enterocyte 1)',
'cE02 (Stem/TA-like/Immature Goblet)',
'cTNI09 (CD4+ Treg prolif)',
'cE08 (Goblet)',
'cM09 (mregDC)',
'cTNI22 (cTNI22)',
'cS13 (Endo venous-like)',
'cM01 (Monocyte)',
'cS27 (CXCL14+ CAF)',
'cTNI11 (CD8+GZMK+)',
'cE09 (Best4)',
'cTNI08 (CD4+ Treg)',
'cM04 (DC2)',
'cTNI24 (NK GZMK+)',
'cB1 (B IGD+IgM+)',
'cMA01 (Mast)',
'cP2 (Plasma IgG)',
'cTNI10 (CD8+ IL7R+)',
'cB2 (B GC-like)',
'cTNI20 (PLZF+ T)',
'cE11 (Enteroendocrine)',
'cS08 (Endo arterial-like)',
'cTNI02 (CD4+ IL7R+SELL+)',
'cTNI13 (CD8+ T IL17+)',
'cM05 (DC2 C1Q+)',
'cB3 (B CD40+ GC-like)',
'cS29 (MMP3+ CAF)',
'cS02 (Endo capillary)',
'cS12 (Endo)',
'cP1 (Plasma IgA)',
'cS15 (Pericyte)',
'cTNI25 (NK XCL1+)',
'cTNI18 (gd-like T PDCD1+)',
'cE06 (Immature Goblet)',
'cTNI17 (gd-like T)',
'cTNI16 (CD8+ CXCL13+ prolif)',
'cS18 (Pericyte)',
'cS04 (Endo)',
'cE10 (Tuft)',
'cTNI23 (NK CD16A+)',
'cM10 (Granulocyte)',
'cM06 (DC IL22RA2)',
'cS11 (Endo proif)',
'cS09 (Endo)',
'cTNI03 (CD4+ IL7R+HSP+)',
'cS19 (Pericyte)',
'cTNI06 (CD4+ TFH)',
'cTNI01 (CD4+ IL7R+)',
'cM03 (DC1)',
'cTNI07 (CD4+ CXCL13+)',
'cS25 (Fibro CCL8+)',
'cS28 (GREM1+ CAF)',
'cE07 (Goblet/Enterocyte)',
'cP3 (Plasma IgG prolif)',
'cS10 (Endo tip cells)',
'cTNI26 (ILC3)',
'cTNI14 (CD8+ CXCL13+)',
'cS32 (Smooth Muscle)',
'cTNI21 (PLZF+ T prolif)',
'cM07 (pDC)',
'cTNI12 (CD8+ IL7R+)',
'cS24 (Fibro BMP-producing)',
'cTNI15 (CD8+ CXCL13+ HSP+)',
'cS01 (Endo arterial)',
'cS30 (CAF CCL8 Fibro-like)',
'cM08 (AS-DC)',
'cS20 (Pericyte prolif)',
'cS14 (Endo)',
'cTNI19 (gd-like T prolif)',
'cS06 (Endo lymphatic)',
'cS16 (Pericyte)',
'cS26 (Myofibro)',
'cS31 (CAF stem niche Fibro-like)',
'cS22 (Fibro stem cell niche)',
'cS33 (Schwann)',
'cS05 (Endo venous)',
'cS03 (Endo capillary)',
'cS23 (Fibro BMP-producing)',
'cS21 (Fibro stem cell niche)',
'cS07 (Endo capillary-like)']

You clarified in one correspondance that you treated all Epithelial cells (=11 sub types) as malignant. How did you sub group others (46. This is really confusing). Even if you choose these epithelial cells as one class , the sum should be 46 + 11 not 53.
Or Did you group Epi cells as follows

Epi_sub_group 1: cE01 (Stem/TA-like)', 'cE03 (Stem/TA-like prolif)', 'cE02 (Stem/TA-like/Immature Goblet)'
Epi_sub_group 2: 'cE02 (Stem/TA-like/Immature Goblet)', 'cE08 (Goblet)', 'cE07 (Goblet/Enterocyte)', 'cE06 (Immature Goblet)',
Epi_sub_group 3: 'cE05 (Enterocyte 2)',
Epi_sub_group 4: 'cE04 (Enterocyte 1)',
Epi_sub_group 5: 'cE09 (Best4)',
Epi_sub_group 6 'cE11 (Enteroendocrine)',
Epi_sub_group 7: 'cE10 (Tuft)',

Can you please clarify this

syc821

Apr 7

@shakeel604
I also have questions about the category division of subtype(47/53). Have you solved this problem and can you share ideas on the division?

ctheodoris

Owner Apr 7

Thanks for your question. This discussion page is for discussions related to the code on this repository. If you have further questions regarding the biological analyses in the paper, please email. Thank you!

syc821

about 1 month ago

@ctheodoris
I have a very confusing question. I conducted a perturbation experiment on T cells. I randomly selected 1000 cells at the starting state MMRp for perturbation experiments. However, the calculated cosine shift was very different from the supplement data provided in the article. For example, the maximum consine shift of genes with statistically significant shift was only 0.021 (while the maximum data provided in the article was 0.1291). Is it meaningful to screen genes with such a small shift?My code is shown below:

from geneformer import InSilicoPerturber, EmbExtractor, InSilicoPerturberStats

Define paths

model_directory = "./Geneformer/examples/model_path/colon/GeneformerMultiTask"
input_data_file = "./Geneformer/examples/data/colon/T_CD8_MMRd_MMRp.dataset"
output_directory = "./Geneformer/examples/delete_result/colon"
output_prefix = "mtl_perturbationTcells"

Define parameters

perturb_type = "overexpress" # or "overexpress"

Define cell states to model

cell_states_to_model = {
"state_key": "MMRStatus",
"start_state": "MMRp",
"goal_state": "MMRd"}

Define filter data

filter_data_dict = {
"MMRStatus": ["MMRp","MMRd"]
}

Initialize EmbExtractor

embex = EmbExtractor(
model_type="CellClassifier",
num_classes=2,
filter_data=filter_data_dict,
max_ncells=22000, # Number of cells to extract embeddings for
emb_layer=0, # Use the second to last layer
emb_mode = "cls",
summary_stat="exact_mean",
forward_batch_size=8, # Adjust based on available GPU memory
nproc=4
)

Extract state embeddings

state_embs_dict = embex.get_state_embs(
cell_states_to_model,
model_directory=model_directory,
input_data_file=input_data_file,
output_directory=output_directory,
output_prefix=output_prefix
)

input_data_file = "./Geneformer/examples/data/colon/T_CD8_MMRd_MMRp.dataset"
output_directory = "./Geneformer/examples/delete_result/colon/perturb_dataTcell"
output_prefix = "mtl_perturbation_Quantized"

Initialize InSilicoPerturber

isp = InSilicoPerturber(
perturb_type=perturb_type,
genes_to_perturb="all", # Perturb all genes
model_type="MTLCellClassifier-Quantized", # Use quantized MTL model
emb_mode="cls", # Use CLS token embedding
cell_states_to_model=cell_states_to_model,
state_embs_dict=state_embs_dict,
max_ncells=1000, # Number of cells to perturb (larger number increases power)
emb_layer=0,
forward_batch_size=8, # Adjust based on available GPU memory
nproc=16
)

Run perturbation and output intermediate files

isp.perturb_data(
model_directory=model_directory,
input_data_file=input_data_file,
output_directory=output_directory,
output_prefix=output_prefix
)
Initialize InSilicoPerturberStats
ispstats = InSilicoPerturberStats(
mode="goal_state_shift",
genes_perturbed="all",
combos=0,
anchor_gene=None,
cell_states_to_model=cell_states_to_model
)
input_data_file = "./Geneformer/examples/data/colon/T_CD8_MMRd_MMRp.dataset"
output_directory="./Geneformer/examples/delete_result/colon/perturb_dataTcell"
output_prefix="T_CD8_MMRd_MMRp"

Process stats and output final .csv

ispstats.get_stats(
input_data_file,
None,
output_directory,
output_prefix
)

ctheodoris

Owner 29 days ago

Thank you for your question. When we perform the in silico perturbations for T cells we are targeting shifts towards the activated T cell state in MMRd. It is unclear whether you also subsetting to the activated T cell state. Also, if your model was fine-tuned differently, this could also impact the in silico perturbations. In terms of magnitude, we expect it to be small by absolute values since the comparison is only related to a change of a single gene within the 4096 represented in a given cell. More important than the absolute magnitude is the statistical significance of the shift due to a given gene perturbation compared to the shifts of perturbing random genes, which is also relevant to the direction of the shift being towards a given goal state.

ctheodoris
/

Geneformer

Cell SubTypes 47/ 53 classes : How?

Define paths

Define parameters

Define cell states to model

Define filter data

Initialize EmbExtractor

Extract state embeddings

Initialize InSilicoPerturber

Run perturbation and output intermediate files

Process stats and output final .csv