metadata

title: Qwen2.5-VL | 📔 Storyteller
emoji: 📚
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.30.0
app_file: app.py
pinned: true
tags:
  - vision-language-model
  - visual-storytelling
  - chain-of-thought
  - grounded-text-generation
  - cross-frame-consistency
  - storytelling
  - image-to-text
license: apache-2.0
datasets:
  - daniel3303/StoryReasoning
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-to-text
model-index:
  - name: QwenStoryteller
    results:
      - task:
          type: visual-storytelling
          name: Visual Storytelling
        dataset:
          name: StoryReasoning
          type: daniel3303/StoryReasoning
          split: test
language: en, zh

QwenStoryteller

This HF Space is a simple implementation of 2505.10292 by Daniel A. P. Oliveira and David Martins de Matos. BibTeX citation provided below. The space was created as a POC, all other credits go to Daniel and David.

QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.

Model Description

Base Model: Qwen2.5-VL 7B
Training Method: LoRA fine-tuning (rank 2048, alpha 4096)
Training Dataset: StoryReasoning

QwenStoryteller processes sequences of images to perform:

End-to-end object detection
Cross-frame object re-identification
Landmark detection
Chain-of-thought reasoning for scene understanding
Grounded story generation with explicit visual references

The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.

System Prompt

The model was trained with the following system prompt, and we recommend using it as it is for inference.

You are an AI storyteller that can analyze sequences of images and create creative narratives. 
First think step-by-step to analyze characters, objects, settings, and narrative structure. 
Then create a grounded story that maintains consistent character identity and object references across frames. 
Use <think></think> tags to show your reasoning process before writing the final story.

Key Features

Cross-Frame Consistency: Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
Structured Reasoning: Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
Grounded Storytelling: Uses specialized XML tags to link narrative elements directly to visual entities
Reduced Hallucinations: Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model

@misc{oliveira2025storyreasoningdatasetusingchainofthought,
      title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, 
      author={Daniel A. P. Oliveira and David Martins de Matos},
      year={2025},
      eprint={2505.10292},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.10292}, 
}