This is the quantized version of the GLM-4.1V-9B-Thinking model.
Paper: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Code: https://github.com/THUDM/GLM-4.1V-Thinking Hugging Face Demo: https://huggingface.co/spaces/THUDM/GLM-4.1V-9B-Thinking-API-Demo ModelScope Demo: https://modelscope.cn/studios/ZhipuAI/GLM-4.1V-9B-Thinking-Demo API Service: https://www.bigmodel.cn/dev/api/visual-reasoning-model/GLM-4.1V-Thinking
GLM-4.1V-9B-Thinking-GPTQ-Int4-Int8Mix
Model Introduction
Vision-Language Models (VLMs) have become foundational components of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs must evolve beyond basic multimodal perception to enhance their reasoning capabilities in complex tasks. This involves improving accuracy, comprehensiveness, and intelligence, enabling applications such as complex problem solving, long-context understanding, and multimodal agents.
Based on the GLM-4-9B-0414 foundation model, we present the new open-source VLM model GLM-4.1V-9B-Thinking, designed to explore the upper limits of reasoning in vision-language models. By introducing a "thinking paradigm" and leveraging reinforcement learning, the model significantly enhances its capabilities. It achieves state-of-the-art performance among 10B-parameter VLMs, matching or even surpassing the 72B-parameter Qwen-2.5-VL-72B on 18 benchmark tasks. We are also open-sourcing the base model GLM-4.1V-9B-Base to support further research into the boundaries of VLM capabilities.
Compared to the previous generation models CogVLM2 and the GLM-4V series, GLM-4.1V-Thinking offers the following improvements:
- The first reasoning-focused model in the series, achieving world-leading performance not only in mathematics but also across various sub-domains.
- Supports 64k context length.
- Handles arbitrary aspect ratios and up to 4K image resolution.
- Provides an open-source version supporting both Chinese and English bilingual usage.
Model Information
Model Download Links
Model | Download Links | Model Type |
---|---|---|
GLM-4.1V-9B-Thinking | 🤗 Hugging Face 🤖 ModelScope |
Reasoning Model |
GLM-4.1V-9B-Base | 🤗 Hugging Face 🤖 ModelScope |
Base Model |
The model's algorithm implementation can be found in the official transformers repository.
Runtime Requirements
Inference
Device (Single GPU) | Framework | Min Memory | Speed | Precision |
---|---|---|---|---|
NVIDIA A100 | transformers | 22GB | 14 - 22 Tokens / s | BF16 |
NVIDIA A100 | vLLM | 22GB | 60 - 70 Tokens / s | BF16 |
Fine-tuning
The following results are based on image fine-tuning using the LLaMA-Factory toolkit.
Device (Cluster) | Strategy | Min Memory / # of GPUs | Batch Size (per GPU) | Freezing |
---|---|---|---|---|
NVIDIA A100 | LORA | 21GB / 1 GPU | 1 | Freeze VIT |
NVIDIA A100 | FULL ZERO2 | 280GB / 4 GPUs | 1 | Freeze VIT |
NVIDIA A100 | FULL ZERO3 | 192GB / 4 GPUs | 1 | Freeze VIT |
NVIDIA A100 | FULL ZERO2 | 304GB / 4 GPUs | 1 | No Freezing |
NVIDIA A100 | FULL ZERO3 | 210GB / 4 GPUs | 1 | No Freezing |
Note: Fine-tuning with Zero2 may result in zero loss; Zero3 is recommended for stable training.
Benchmark Performance
Based on the GLM-4-9B-0414 foundation model, we present the new open-source VLM model GLM-4.1V-9B-Thinking, which introduces a "thinking" paradigm and leverages Reinforcement Learning with Curriculum Sampling (RLCS) to comprehensively enhance model capabilities. It achieves state-of-the-art performance among vision-language models at the 10B parameter scale, matching or even surpassing the 72B Qwen-2.5-VL on 18 benchmark tasks. We also open-source the base model GLM-4.1V-9B-Base to support further research on the frontier of vision-language models.
Model Inference
Downloading the Quantized Model via ModelScope
from modelscope import snapshot_download
snapshot_download('dengcao/GLM-4.1V-9B-Thinking-GPTQ-Int4-Int8Mix', cache_dir="本地路径")
Inference Scripts and Examples
All inference scripts are located in the inference
folder of the GitHub repository and include:
trans_infer_cli.py
: A command-line interactive script using thetransformers
library as the backend. It supports multi-turn dialogue.trans_infer_gradio.py
: A Gradio-based web UI script using thetransformers
backend. It supports multimodal inputs such as images, videos, PDFs, and PPTs.OpenAI-compatible API service with
vllm
, along with a simple request example provided invllm_api_request.py
.vllm serve THUDM/GLM-4.1V-9B-Thinking --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path /
- If
--limit-mm-per-prompt
is not specified, only 1 image is supported. The model supports a maximum of 1 video or 300 images per input — it does not support simultaneous image and video inputs. --allowed-local-media-path
must be set to permit access to local multimodal inputs.
- If
trans_infer_bench
: Academic benchmarking script for inference withGLM-4.1V-9B-Thinking
. Key features:- Automatically interrupts thinking if it exceeds 8192 tokens and appends
</think><answer>
to prompt the model to generate a final answer. - Demonstrates video-based input; for other modalities, modifications are required.
- Only a
transformers
version is provided. ForvLLM
, a custom implementation is needed to support this logic.
- Automatically interrupts thinking if it exceeds 8192 tokens and appends
vllm_request_gui_agent.py
: This script demonstrates how to handle model responses and construct prompts for GUI Agent use cases. It covers strategies for mobile, desktop, and web environments, and can be integrated into your application framework. For detailed documentation about GUI Agent, please refer to this file.For Ascend NPU Inference, Check here.
Model Fine-tuning
LLaMA-Factory now supports fine-tuning of this model. Below is an example dataset using two images. Prepare your dataset in a finetune.json
file like the following:
[
{
"messages": [
{
"content": "<image>Who are they?",
"role": "user"
},
{
"content": "<think>
User ask me to observe the image and get the answer. I Know they are Kane and Gretzka from Bayern Munich.</think>
<answer>They're Kane and Gretzka from Bayern Munich.</answer>",
"role": "assistant"
},
{
"content": "<image>What are they doing?",
"role": "user"
},
{
"content": "<think>
I need to observe what this people are doing. Oh, They are celebrating on the soccer field.</think>
<answer>They are celebrating on the soccer field.</answer>",
"role": "assistant"
}
],
"images": [
"mllm_demo_data/1.jpg",
"mllm_demo_data/2.jpg"
]
}
]
- Content inside
<think> ... </think>
will not be stored in the conversation history or used during fine-tuning. - The
<image>
tag will be replaced with actual image data during preprocessing.
After preparing the dataset, you can proceed with fine-tuning using the standard LLaMA-Factory pipeline.
Model License
- The code in this repository is released under the Apache License 2.0.
- The models GLM-4.1V-9B-Thinking and GLM-4.1V-9B-Base are both licensed under the MIT License.
Citation
If you find our work helpful, please consider citing the following paper.
@misc{glmvteam2025glm41vthinkingversatilemultimodalreasoning,
title={GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning},
author={GLM-V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Wenkai Li and Wei Jia and Xin Lyu and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuxuan Zhang and Zhanxiao Du and Zhenyu Hou and Zhao Xue and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},
year={2025},
eprint={2507.01006},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.01006},
}
- Downloads last month
- 13