--- license: apache-2.0 library_name: transformers ---
StepFun: Cost-Effective Multimodal Intelligence

Chat Homepage
GitHub ModelScope Twitter Follow
Discord License
📰  Step3 Model Blog     |     📄  Step3 System Blog
## Introduction Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. ### Step3 model card: | Config | Value | |------------------------|---------| | **Number of Layers (Dense layer included)**|61| |**Number of Dense Layers**| 5| | **Hidden Dimension** | 7168 | | **Attention Mechanism** | MFA | | **Low-rank Query Dimension** | 2048 | | **Number of Query Heads** | 64 | | **Head Dimension** | 256 | |**Number of Experts** |48| |**Selected Experts per Token**|3| |**Number of Shared Experts**| 1| | **Max Context Length** | 65536 | | **Tokenizer** | Deepseek V3 | | **Total Parameters (LLM)** | 316B | | **Activated Params per Token** | 38B | | **Total Parameters (VLM)** | 321B | ## Evaluation Results
Model Total Params. MMMU MathVision ZeroBench(sub) DYNAMATH SimpleVQA HallusionBench AIME25 HMMT25 CNMO24 GPQA-Diamond LiveCodeBench
(24.8-25.5)
Open-Source VLM Step3 321B 74.2 64.8 23.0 50.1 62.2 64.2 82.9 70.0 83.7 73.0 67.1
ERINE4.5 - thinking 300B/424B 70.0 47.6 22.5 46.9 59.8 60.0 35.1 40.5* 75.5 76.8 38.8
GLM-4.1V-thinking 9B 68.0 49.4 22.8 41.9 48.1 60.8 13.3 6.7 25.0 47.4 24.2
MiMo-VL 7B 66.7 60.4 18.6 45.9 48.5 59.6 60.0 34.6 69.9 55.5 50.1
QvQ-72B-Preview 72B 70.3 35.9 15.9 30.7 40.3 50.8 22.7 49.5 47.3 10.9 24.1
LLaMA-Maverick 400B 73.4 47.2 22.8 47.1 45.4 57.1 19.2 8.91 41.6 69.8 33.9
Open-Source LLM MiniMax-M1-80k 456B - - - - - - 76.9 - - 70.0 65.0
Qwen3-235B-A22B-Thinking 235B - - - - - - 81.5 62.5 - 71.1 65.9
DeepSeek R1-0528 671B - - - - - - 87.5 79.4 86.9 81.0 73.3
Qwen3-235B-A22B-Thinking-2507 235B - - - - - - 92.3 83.9 - 81.1 -
Proprietary VLM O3 - 82.9 72.8 25.2 58.1 59.8 60.1 88.9 70.1 86.7 83.3 75.8
Claude4 Sonnet (thinking) - 76.9 64.6 26.1 48.1 43.7 57.0 70.5 - - 75.4 55.9
Claude4 opus (thinking) - 79.8 66.1 25.2 49.3 47.2 59.9 75.5 - - 79.6 56.6
Gemini 2.5 Flash (thinking) - 73.2 57.3 20.1 57.1 61.1 65.2 72.0 - - 82.8 61.9
Gemini 2.5 Pro - 81.7 73.3 30.8 56.3 66.8 66.8 88.0 - - 86.4 71.8
Grok 4 - 80.9 70.3 22.5 40.7 55.9 64.8 98.8 93.9 85.5 87.5 79.3
Note: Parts of the evaluation results are reproduced using the same settings. †: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability. ## Deployment > [!Note] > Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you. ### Inference with Hugging Face Transformers We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang. ```python from transformers import AutoProcessor, AutoModelForCausalLM key_mapping = { "^vision_model": "model.vision_model", r"^model(?!\.(language_model|vision_model))": "model.language_model", "vit_downsampler": "model.vit_downsampler", "vit_downsampler2": "model.vit_downsampler2", "vit_large_projector": "model.vit_large_projector", } model_path = "stepfun-ai/step3" processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype="auto",trust_remote_code=True, key_mapping=key_mapping) messages = [ { "role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, {"type": "text", "text": "What's in this picture?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt" ).to(model.device) generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False) decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True) print(decoded) ``` ### Inference with vLLM and SGLang Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on [Huggingface](https://huggingface.co/collections/stepfun-ai/step3-688a3d652dbb45d868f9d42d). Currently, it is recommended to run Step3 on the following inference engines: * vLLM * SGLang Deployment and Request examples for vLLM and SGLang can be found in the [Model Deployment Guide](docs/deploy_guidance.md). ## Contact Us If you have any questions, please reach out at [contact@stepfun.com](mailto:contact@stepfun.com) . ## License Both the code repository and the model weights are released under the [Apache License (Version 2.0)](./LICENSE). ## Citation ``` @misc{step3system, title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding}, author={StepFun Team}, year={2025}, eprint={2507.19427}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2507.19427}, } @misc{step3blog, title={Step3: Cost-Effective Multimodal Intelligence}, author={StepFun Team}, url={https://stepfun.ai/research/step3}, } ```