--- license: mit datasets: - TIGER-Lab/ViRL39K base_model: - Qwen/Qwen2.5-VL-7B-Instruct ---

logo

# Spark-VL-7B โญ If you find our code or model helpful, please consider giving us a star โ€” your support means a lot! ๐Ÿ Github repository ๐Ÿ“–Daily Paper ๐Ÿค—models ๐Ÿ“–Paper ## Introduction We propose **SPARK**, **a unified framework that integrates policy and reward into a single model for joint and synchronous training**. SPARK can automatically derive reward and reflection data from verifiable reward, enabling **self-learning** and **self-evolution**. Furthermore, we instantiate this framework on multiple backbones, training SPARK-VL-7B, SPARK-7B, and SPARK-VL-32B. This repo is the **SPARK-VL-7B**. ## ๐Ÿ“ข News - ๐Ÿš€ [09/29/2025] We release our ๐Ÿค—datasets. - ๐Ÿš€ [09/29/2025] We release our **Spark's** ๐Ÿ“–Paper. - ๐Ÿš€ [09/29/2025] We upload our evaluation code and ๐Ÿค—models. - ๐Ÿš€ [09/29/2025] We release **Spark** ๐Ÿ Github repository. ## ๐Ÿ’ก Highlights - ๐Ÿ”ฅ **Synergistic Policyโ€“Reward Co-Evolving (SPARK)**: We introduce SPARK, a unified reinforcement fine-tuning framework that jointly optimizes policy and reward within a single model through on-policy co-evolution.. - ๐Ÿ”ฅ **Recycling Rollouts**: Unlike conventional RL pipelines that discard rollouts after policy updates, SPARK recycles RLVR rollouts into pointwise, pairwise, and reflection objectives, enabling the model itself to act as both a strong policy and a generative reward model. - ๐Ÿ”ฅ **Co-Evolving Mechanism**: Improved reward accuracy provides better gradients for policy learning, while stronger reasoning further refines reward judgment, forming a positive feedback loop that enhances reasoning, judgment, and reflection in synergy. - ๐Ÿ”ฅ **Efficient and Practical**: SPARK requires no human preference data, teacher models, or external reward models, making it significantly more data- and compute-efficient than traditional RM-based RL pipelines. ## ๐Ÿ› ๏ธ Usage ### ๐Ÿค— Using Transformers Our model is based on Qwen2.5-VL-7B-Instruct. You can use the same code as the Qwen2.5-VL-7B-Instruct model for inference, referring to ๐Ÿค—Huggingface. ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "internlm/Spark-VL-7B", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) processor = AutoProcessor.from_pretrained("internlm/Spark-VL-7B") messages = [ { "role": "user", "content": [ { "type": "image", "image": image_path, }, {"type": "text", "text": prompt}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ### ๐Ÿ”ฆ Using vLLM We recommend using **vLLM** for faster inference speed. Using vLLM leads to significant speed improvements in dataset evaluation. ```bash PORT=8019 N_PROC=256 SERVE_NAME=spark_vl_7b MODEL_PATH=/internlm/Spark-VL-7B CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve "$MODEL_PATH" \ --tensor-parallel-size 4 \ --served-model-name $SERVE_NAME \ --port $PORT \ --max-num-seqs $N_PROC ``` ## โœ’๏ธCitation ``` TBD ```