CellReasoner: A reasoning-enhanced large language model for cell type annotation 🧬🧠



πŸ“Œ Table of Contents


πŸ”¬ Key Highlights

  • Only a few expert-level reasoning samples are needed to activate reasoning in a 7B LLM.
  • CellReasoner achieves expert-level interpretability and zero-/few-shot generalization.
  • Demonstrated superior performance across various scRNA-seq and scATAC-seq datasets.
  • Compatible with marker-by-marker annotation, ontology mapping, and biological reasoning.

🧠 Less data, more reasoning: CellReasoner achieves accurate, interpretable, and scalable cell annotation with minimal supervision.


πŸ”‘ Key Results

PDAC dataset

Model Score
Deepseek-V3 0.50
Deepseek-R1 0.53
ChatGPT-o3 0.58
ChatGPT-4o 0.63
singleR 0.68
CellReasoner-7B 0.73
CellReasoner-32B 0.74

PBMC3K dataset

Model Score
Deepseek-V3 0.52
Deepseek-R1 0.52
ChatGPT-4o 0.76
ChatGPT-o3 0.85
singleR 0.83
CellReasoner-7B 0.87
CellReasoner-32B 0.84

🧠 Model Zoo

Our CellReasoner models are available on Hugging Face πŸ€—:

Model Backbone Link
CellReasoner-7B Qwen2.5-7B-Instruct πŸ€—
CellReasoner-32B QwQ-32B πŸ€—

πŸ‹οΈβ€β™‚οΈ Training

We use the LLaMA-Factory framework for fine-tuning. It offers a flexible and efficient pipeline for supervised fine-tuning, LoRA, and multi-stage training strategies.


πŸš€ Usage

πŸ› οΈ Step 1: Prepare Conda Environment

Make sure you have a working conda environment with the necessary dependencies installed. We recommend:

conda create -n cellreasoner python=3.11
conda activate cellreasoner
pip install -r requirements.txt

πŸ§ͺ Step 2: Preprocess Input Data

If your input is in Seurat .rds format, use the R preprocessing script:

Rscript s01.process_rds.R ./demo_data/pbmc_demo.rds ./output/ data/ranked_hvg.list

If your input is in AnnData .h5ad format, use the Python script:

python s01.process_h5ad.py \
    --input_file ./demo_data/pbmc_demo.h5ad \
    --output_path ./output_h5ad \
    --ranked_hvg_list ./data/ranked_hvg.list

Both pipelines will generate the following output files:

output/
β”œβ”€β”€ pbmc_demo.h5
└── pbmc_demo.meta.csv

🧱 Step 3: Build Dataset for CellReasoner

Build the model input file using:

python s02.build_dataset.py \
    --h5_path ./output/pbmc_demo.h5 \
    --output_path ./output/ \
    --meta_file_path ./output/pbmc_demo.meta.csv

If your metadata includes cell type labels (for scoring), specify the column name:

python s02.build_dataset.py \
    --h5_path ./output/pbmc_demo.h5 \
    --output_path ./output/ \
    --meta_file_path ./output/pbmc_demo.meta.csv \
    --cell_type_column "seurat_annotations"

This will generate:

output/
└── pbmc_demo_for_CellReasoner.json

πŸ€– Step 4: Run Inference with CellReasoner

python s03.inference.py \
    --model "CellReasoner-7B" \
    --output_path "./output" \
    --input_json "./output/pbmc_demo_for_CellReasoner.json" \
    --batch_size 2

Result:

output/
└── pbmc_demo_CellReasoner_result.csv

πŸ“Š Evaluation and Reasoning Visualization

To compute scores, generate plots, or view reasoning outputs, refer to:

s03.inference.ipynb

Citation

@article {Cao2025.05.20.655112,
    author = {Cao, Guangshuo and Shen, Yi and Wu, Jianghong and Chao, Haoyu and Chen, Ming and Chen, Dijun},
    title = {CellReasoner: A reasoning-enhanced large language model for cell type annotation},
    elocation-id = {2025.05.20.655112},
    year = {2025},
    doi = {10.1101/2025.05.20.655112},
    URL = {https://www.biorxiv.org/content/early/2025/05/26/2025.05.20.655112},
    eprint = {https://www.biorxiv.org/content/early/2025/05/26/2025.05.20.655112.full.pdf},
    journal = {bioRxiv}
}

Downloads last month
64
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support