base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- TIGER-Lab/ViRL39K
license: mit
library_name: transformers
pipeline_tag: video-text-to-text
tags:
- lvlm
- reasoning
- multimodal
- qwen
Spark-VL-7B
โญ If you find our code or model helpful, please consider giving us a star โ your support means a lot! ๐ Github repository ๐Daily Paper ๐คmodels ๐Paper
Introduction
We propose SPARK, a unified framework that integrates policy and reward into a single model for joint and synchronous training. SPARK can automatically derive reward and reflection data from verifiable reward, enabling self-learning and self-evolution. Furthermore, we instantiate this framework on multiple backbones, training SPARK-VL-7B, SPARK-7B, and SPARK-VL-32B. This repo is the SPARK-VL-7B.
๐ข News
- ๐ [09/29/2025] We release our ๐คdatasets.
- ๐ [09/29/2025] We release our Spark's ๐Paper.
- ๐ [09/29/2025] We upload our evaluation code and ๐คmodels.
- ๐ [09/29/2025] We release Spark ๐ Github repository.
๐ก Highlights
- ๐ฅ Synergistic PolicyโReward Co-Evolving (SPARK): We introduce SPARK, a unified reinforcement fine-tuning framework that jointly optimizes policy and reward within a single model through on-policy co-evolution..
- ๐ฅ Recycling Rollouts: Unlike conventional RL pipelines that discard rollouts after policy updates, SPARK recycles RLVR rollouts into pointwise, pairwise, and reflection objectives, enabling the model itself to act as both a strong policy and a generative reward model.
- ๐ฅ Co-Evolving Mechanism: Improved reward accuracy provides better gradients for policy learning, while stronger reasoning further refines reward judgment, forming a positive feedback loop that enhances reasoning, judgment, and reflection in synergy.
- ๐ฅ Efficient and Practical: SPARK requires no human preference data, teacher models, or external reward models, making it significantly more data- and compute-efficient than traditional RM-based RL pipelines.
๐ ๏ธ Usage
๐ค Using Transformers
Our model is based on Qwen2.5-VL-7B-Instruct. You can use the same code as the Qwen2.5-VL-7B-Instruct model for inference, referring to ๐คHuggingface.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"internlm/Spark-VL-7B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("internlm/Spark-VL-7B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_path,
},
{"type": "text", "text": prompt},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
๐ฆ Using vLLM
We recommend using vLLM for faster inference speed. Using vLLM leads to significant speed improvements in dataset evaluation.
PORT=8019
N_PROC=256
SERVE_NAME=spark_vl_7b
MODEL_PATH=/internlm/Spark-VL-7B
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve "$MODEL_PATH" \
--tensor-parallel-size 4 \
--served-model-name $SERVE_NAME \
--port $PORT \
--max-num-seqs $N_PROC
Training
Spark Training
After downloading the dataset, you can start training using the following example bash script. Our bash scripts are in /Spark/Lmm_XC/XC/scripts/spark_training
You need to modify the dataset paths and model paths to your own locations.
export WORKSPACE_DIR="/fs-computility/....../Lmm_XC" # Path to project root directory
export DATASET_PATH="/fs-computility/....../infer_data_ViRL_19k.json" # Path to your dataset
export PRETRAIN_MODEL_PATH="/fs-computility/....../Qwen2.5-VL-7B-Instruct" # Path to pretrained model
export WANDB_PROJECT="Observation" # Name for this project
export MODEL_CPK_NAME="Qwen2.5-VL-7B-GRPO-virl-19k-iar-reflection-hyb-diverse-bs64-e2" # Name for this training run
export LOG_PATH='/fs-computility/....../Qwen2.5-VL-7B-GRPO-virl-19k-iar-reflection-hyb-diverse-bs64-e2.txt' #Log file save path
export WANDB_API_KEY="......"
export SAVE_PATH="/fs-computility/....../${WANDB_PROJECT}/${MODEL_CPK_NAME}" # Absolute path to save everything about this training run
export CKPT_PATH="${SAVE_PATH}/ckpt" # Path to save checkpoints
export FINAL_CKPT_PATH="${SAVE_PATH}/final_ckpt" # Path to save final checkpoints
export TIMESTAMP=$(date +%Y%m%d_%H%M%S) # Timestamp
export CUR_LOG_DIR="${SAVE_PATH}/training_logs/${TIMESTAMP}" # Path to save current run logs
export LOG_DIR="${SAVE_PATH}/tb_logs"
โฐ Attention:
export DEV_MODE=0 # Set to 1 for debug mode on single dev machine
Evaluation
The integrated multimodal mathematics dataset can be downloaded from ๐คdatasets and evaluated using the scripts provided in the Evaluation
folder. The evaluation results will be stored, and accuracy can subsequently be computed with the calculate_acc.py
file.
bash ./Evaluation/eval_spark_vl_7b.sh
python calculate_acc.py --result_path ./your_result_path.json
โ๏ธCitation
@article{liu2025spark,
title={SPARK: Synergistic Policy And Reward Co-Evolving Framework},
author={Ziyu Liu and Yuhang Zang and Shengyuan Ding and Yuhang Cao and Xiaoyi Dong and Haodong Duan and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2509.22624},
year={2025}
}
๐ License
Usage and License Notices: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
Acknowledgement
We sincerely thank projects lmm-r1 and OpenRLHF for providing their open-source resources.