Spaces:
Running
on
Zero
Running
on
Zero
File size: 3,528 Bytes
1110cf5 7de8390 1110cf5 7de8390 1110cf5 7de8390 1110cf5 7de8390 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
title: Qwen2.5-VL | π Storyteller
emoji: π
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.30.0
app_file: app.py
pinned: true
tags:
- vision-language-model
- visual-storytelling
- chain-of-thought
- grounded-text-generation
- cross-frame-consistency
- storytelling
- image-to-text
license: apache-2.0
datasets:
- daniel3303/StoryReasoning
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-to-text
model-index:
- name: QwenStoryteller
results:
- task:
type: visual-storytelling
name: Visual Storytelling
dataset:
name: StoryReasoning
type: daniel3303/StoryReasoning
split: test
language: en, zh
---
# QwenStoryteller
This HF Space is a simple implementation of [2505.10292](https://arxiv.org/abs/2505.10292) by Daniel A. P. Oliveira and David Martins de Matos. BibTeX citation provided below. The space was created as a POC, all other credits go to Daniel and David.
QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.
## Model Description
**Base Model:** Qwen2.5-VL 7B
**Training Method:** LoRA fine-tuning (rank 2048, alpha 4096)
**Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning)
QwenStoryteller processes sequences of images to perform:
- End-to-end object detection
- Cross-frame object re-identification
- Landmark detection
- Chain-of-thought reasoning for scene understanding
- Grounded story generation with explicit visual references
The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1Γ10β»β΄ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.
## System Prompt
The model was trained with the following system prompt, and we recommend using it as it is for inference.
```
You are an AI storyteller that can analyze sequences of images and create creative narratives.
First think step-by-step to analyze characters, objects, settings, and narrative structure.
Then create a grounded story that maintains consistent character identity and object references across frames.
Use <think></think> tags to show your reasoning process before writing the final story.
```
## Key Features
- **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
- **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
- **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities
- **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model
```
@misc{oliveira2025storyreasoningdatasetusingchainofthought,
title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation},
author={Daniel A. P. Oliveira and David Martins de Matos},
year={2025},
eprint={2505.10292},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.10292},
}
``` |