---
license: apache-2.0
library_name: transformers
---
## Introduction
Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active.
It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning.
Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD),
Step3 maintains exceptional efficiency across both flagship and low-end accelerators.
### Step3 model card:
| Config | Value |
|------------------------|---------|
| **Number of Layers (Dense layer included)**|61|
|**Number of Dense Layers**| 5|
| **Hidden Dimension** | 7168 |
| **Attention Mechanism** | MFA |
| **Low-rank Query Dimension** | 2048 |
| **Number of Query Heads** | 64 |
| **Head Dimension** | 256 |
|**Number of Experts** |48|
|**Selected Experts per Token**|3|
|**Number of Shared Experts**| 1|
| **Max Context Length** | 65536 |
| **Tokenizer** | Deepseek V3 |
| **Total Parameters (LLM)** | 316B |
| **Activated Params per Token** | 38B |
| **Total Parameters (VLM)** | 321B |
## Evaluation Results
|
Model |
Total Params. |
MMMU |
MathVision |
ZeroBench(sub) |
DYNAMATH |
SimpleVQA |
HallusionBench |
AIME25 |
HMMT25 |
CNMO24 |
GPQA-Diamond |
LiveCodeBench (24.8-25.5) |
Open-Source VLM |
Step3 |
321B |
74.2 |
64.8 |
23.0 |
50.1 |
62.2 |
64.2 |
82.9 |
70.0 |
83.7 |
73.0 |
67.1 |
ERINE4.5 - thinking |
300B/424B |
70.0 |
47.6 |
22.5 |
46.9 |
59.8 |
60.0 |
35.1 |
40.5* |
75.5 |
76.8 |
38.8 |
GLM-4.1V-thinking |
9B |
68.0 |
49.4 |
22.8 |
41.9 |
48.1 |
60.8 |
13.3 |
6.7 |
25.0 |
47.4 |
24.2 |
MiMo-VL |
7B |
66.7 |
60.4 |
18.6 |
45.9 |
48.5 |
59.6 |
60.0 |
34.6 |
69.9 |
55.5 |
50.1 |
QvQ-72B-Preview |
72B |
70.3 |
35.9 |
15.9 |
30.7 |
40.3 |
50.8 |
22.7 |
49.5 |
47.3 |
10.9 |
24.1 |
LLaMA-Maverick |
400B |
73.4 |
47.2 |
22.8 |
47.1 |
45.4 |
57.1 |
19.2 |
8.91 |
41.6 |
69.8 |
33.9 |
Open-Source LLM |
MiniMax-M1-80k |
456B |
- |
- |
- |
- |
- |
- |
76.9 |
- |
- |
70.0 |
65.0 |
Qwen3-235B-A22B-Thinking |
235B |
- |
- |
- |
- |
- |
- |
81.5 |
62.5 |
- |
71.1 |
65.9 |
DeepSeek R1-0528 |
671B |
- |
- |
- |
- |
- |
- |
87.5 |
79.4 |
86.9 |
81.0 |
73.3 |
Qwen3-235B-A22B-Thinking-2507 |
235B |
- |
- |
- |
- |
- |
- |
92.3 |
83.9 |
- |
81.1 |
- |
Proprietary VLM |
O3 |
- |
82.9 |
72.8 |
25.2 |
58.1 |
59.8 |
60.1 |
88.9 |
70.1 |
86.7 |
83.3 |
75.8 |
Claude4 Sonnet (thinking) |
- |
76.9 |
64.6 |
26.1 |
48.1 |
43.7 |
57.0 |
70.5 |
- |
- |
75.4 |
55.9 |
Claude4 opus (thinking) |
- |
79.8 |
66.1 |
25.2 |
49.3 |
47.2 |
59.9 |
75.5 |
- |
- |
79.6 |
56.6 |
Gemini 2.5 Flash (thinking) |
- |
73.2 |
57.3 |
20.1 |
57.1 |
61.1 |
65.2 |
72.0 |
- |
- |
82.8 |
61.9 |
Gemini 2.5 Pro |
- |
81.7 |
73.3 |
30.8 |
56.3 |
66.8 |
66.8 |
88.0 |
- |
- |
86.4 |
71.8 |
Grok 4 |
- |
80.9 |
70.3 |
22.5 |
40.7 |
55.9 |
64.8 |
98.8 |
93.9 |
85.5 |
87.5 |
79.3 |
Note: Parts of the evaluation results are reproduced using the same settings.
†: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability.
## Deployment
> [!Note]
> Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you.
### Inference with Hugging Face Transformers
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.
```python
from transformers import AutoProcessor, AutoModelForCausalLM
key_mapping = {
"^vision_model": "model.vision_model",
r"^model(?!\.(language_model|vision_model))": "model.language_model",
"vit_downsampler": "model.vit_downsampler",
"vit_downsampler2": "model.vit_downsampler2",
"vit_large_projector": "model.vit_large_projector",
}
model_path = "stepfun-ai/step3"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
device_map="auto", torch_dtype="auto",trust_remote_code=True,
key_mapping=key_mapping)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "What's in this picture?"}
]
},
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False)
decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)
print(decoded)
```
### Inference with vLLM and SGLang
Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on [Huggingface](https://huggingface.co/collections/stepfun-ai/step3-688a3d652dbb45d868f9d42d).
Currently, it is recommended to run Step3 on the following inference engines:
* vLLM
* SGLang
Deployment and Request examples for vLLM and SGLang can be found in the [Model Deployment Guide](docs/deploy_guidance.md).
## Contact Us
If you have any questions, please reach out at [contact@stepfun.com](mailto:contact@stepfun.com) .
## License
Both the code repository and the model weights are released under the [Apache License (Version 2.0)](./LICENSE).
## Citation
```
@misc{step3system,
title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding},
author={StepFun Team},
year={2025},
eprint={2507.19427},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.19427},
}
@misc{step3blog,
title={Step3: Cost-Effective Multimodal Intelligence},
author={StepFun Team},
url={https://stepfun.ai/research/step3},
}
```