--- title: Qwen2.5-VL | 📔 Storyteller emoji: 📚 colorFrom: red colorTo: red sdk: gradio sdk_version: 5.30.0 app_file: app.py pinned: true tags: - vision-language-model - visual-storytelling - chain-of-thought - grounded-text-generation - cross-frame-consistency - storytelling - image-to-text license: apache-2.0 datasets: - daniel3303/StoryReasoning base_model: - daniel3303/QwenStoryteller pipeline_tag: image-to-text language: en, zh --- # QwenStoryteller This HF Space is a simple implementation of [2505.10292](https://arxiv.org/abs/2505.10292) by Daniel A. P. Oliveira and David Martins de Matos. BibTeX citation provided below. The space was created as a POC, all other credits go to Daniel and David. QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story. ## Model Description **Base Model:** Qwen2.5-VL 7B **Training Method:** LoRA fine-tuning (rank 2048, alpha 4096) **Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning) QwenStoryteller processes sequences of images to perform: - End-to-end object detection - Cross-frame object re-identification - Landmark detection - Chain-of-thought reasoning for scene understanding - Grounded story generation with explicit visual references The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision. ## System Prompt The model was trained with the following system prompt, and we recommend using it as it is for inference. ``` You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use tags to show your reasoning process before writing the final story. ``` ## Key Features - **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques - **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure - **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities - **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model ``` @misc{oliveira2025storyreasoningdatasetusingchainofthought, title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, author={Daniel A. P. Oliveira and David Martins de Matos}, year={2025}, eprint={2505.10292}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.10292}, } ```