--- license: apache-2.0 library_name: transformers ---

📰 Step3 Model Blog | 📄 Step3 System Blog

## Introduction Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. ### Step3 model card: | Config | Value | |------------------------|---------| | **Number of Layers (Dense layer included)**|61| |**Number of Dense Layers**| 5| | **Hidden Dimension** | 7168 | | **Attention Mechanism** | MFA | | **Low-rank Query Dimension** | 2048 | | **Number of Query Heads** | 64 | | **Head Dimension** | 256 | |**Number of Experts** |48| |**Selected Experts per Token**|3| |**Number of Shared Experts**| 1| | **Max Context Length** | 65536 | | **Tokenizer** | Deepseek V3 | | **Total Parameters (LLM)** | 316B | | **Activated Params per Token** | 38B | | **Total Parameters (VLM)** | 321B | ## Evaluation Results

	Model	Total Params.	MMMU	MathVision	ZeroBench(sub)	DYNAMATH	SimpleVQA	HallusionBench	AIME25	HMMT25	CNMO24	GPQA-Diamond	LiveCodeBench (24.8-25.5)
Open-Source VLM	Step3	321B	74.2	64.8	23.0	50.1	62.2	64.2	82.9	70.0	83.7	73.0	67.1
	ERINE4.5 - thinking	300B/424B	70.0	47.6	22.5	46.9	59.8	60.0	35.1	40.5*	75.5	76.8	38.8
	GLM-4.1V-thinking	9B	68.0	49.4	22.8	41.9	48.1	60.8	13.3	6.7	25.0	47.4	24.2
	MiMo-VL	7B	66.7	60.4	18.6	45.9	48.5	59.6	60.0	34.6	69.9	55.5	50.1
	QvQ-72B-Preview	72B	70.3	35.9	15.9	30.7	40.3	50.8	22.7	49.5	47.3	10.9	24.1
	LLaMA-Maverick	400B	73.4	47.2	22.8	47.1	45.4	57.1	19.2	8.91	41.6	69.8	33.9
Open-Source LLM	MiniMax-M1-80k	456B	-	-	-	-	-	-	76.9	-	-	70.0	65.0
	Qwen3-235B-A22B-Thinking	235B	-	-	-	-	-	-	81.5	62.5	-	71.1	65.9
	DeepSeek R1-0528	671B	-	-	-	-	-	-	87.5	79.4	86.9	81.0	73.3
	Qwen3-235B-A22B-Thinking-2507	235B	-	-	-	-	-	-	92.3	83.9	-	81.1	-
Proprietary VLM	O3	-	82.9	72.8	25.2	58.1	59.8	60.1	88.9	70.1	86.7	83.3	75.8
	Claude4 Sonnet (thinking)	-	76.9	64.6	26.1	48.1	43.7	57.0	70.5	-	-	75.4	55.9
	Claude4 opus (thinking)	-	79.8	66.1	25.2	49.3	47.2	59.9	75.5	-	-	79.6	56.6
	Gemini 2.5 Flash (thinking)	-	73.2	57.3	20.1	57.1	61.1	65.2	72.0	-	-	82.8	61.9
	Gemini 2.5 Pro	-	81.7	73.3	30.8	56.3	66.8	66.8	88.0	-	-	86.4	71.8
	Grok 4	-	80.9	70.3	22.5	40.7	55.9	64.8	98.8	93.9	85.5	87.5	79.3

Note: Parts of the evaluation results are reproduced using the same settings. †: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability. ## Deployment > [!Note] > Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you. ### Inference with Hugging Face Transformers We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang. ```python from transformers import AutoProcessor, AutoModelForCausalLM key_mapping = { "^vision_model": "model.vision_model", r"^model(?!\.(language_model|vision_model))": "model.language_model", "vit_downsampler": "model.vit_downsampler", "vit_downsampler2": "model.vit_downsampler2", "vit_large_projector": "model.vit_large_projector", } model_path = "stepfun-ai/step3" processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype="auto",trust_remote_code=True, key_mapping=key_mapping) messages = [ { "role": "user", "content": [ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, {"type": "text", "text": "What's in this picture?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt" ).to(model.device) generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False) decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True) print(decoded) ``` ### Inference with vLLM and SGLang Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on [Huggingface](https://huggingface.co/collections/stepfun-ai/step3-688a3d652dbb45d868f9d42d). Currently, it is recommended to run Step3 on the following inference engines: * vLLM * SGLang Deployment and Request examples for vLLM and SGLang can be found in the [Model Deployment Guide](docs/deploy_guidance.md). ## Contact Us If you have any questions, please reach out at [contact@stepfun.com](mailto:contact@stepfun.com) . ## License Both the code repository and the model weights are released under the [Apache License (Version 2.0)](./LICENSE). ## Citation ``` @misc{step3system, title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding}, author={StepFun Team}, year={2025}, eprint={2507.19427}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2507.19427}, } @misc{step3blog, title={Step3: Cost-Effective Multimodal Intelligence}, author={StepFun Team}, url={https://stepfun.ai/research/step3}, } ```