Spaces:
Running
on
Zero
Running
on
Zero
Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,86 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
emoji: π
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.30.0
|
8 |
app_file: app.py
|
9 |
-
pinned:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: Qwen2.5-VL | π Storyteller
|
3 |
emoji: π
|
4 |
+
colorFrom: red
|
5 |
+
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.30.0
|
8 |
app_file: app.py
|
9 |
+
pinned: true
|
10 |
+
tags:
|
11 |
+
- vision-language-model
|
12 |
+
- visual-storytelling
|
13 |
+
- chain-of-thought
|
14 |
+
- grounded-text-generation
|
15 |
+
- cross-frame-consistency
|
16 |
+
- storytelling
|
17 |
+
- image-to-text
|
18 |
+
license: apache-2.0
|
19 |
+
datasets:
|
20 |
+
- daniel3303/StoryReasoning
|
21 |
+
base_model:
|
22 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
23 |
+
pipeline_tag: image-to-text
|
24 |
+
model-index:
|
25 |
+
- name: QwenStoryteller
|
26 |
+
results:
|
27 |
+
- task:
|
28 |
+
type: visual-storytelling
|
29 |
+
name: Visual Storytelling
|
30 |
+
dataset:
|
31 |
+
name: StoryReasoning
|
32 |
+
type: daniel3303/StoryReasoning
|
33 |
+
split: test
|
34 |
+
language: en, zh
|
35 |
---
|
36 |
|
37 |
+
|
38 |
+
# QwenStoryteller
|
39 |
+
|
40 |
+
This HF Space is a simple implementation of [2505.10292](https://arxiv.org/abs/2505.10292) by Daniel A. P. Oliveira and David Martins de Matos. BibTeX citation provided below. The space was created as a POC, all other credits go to Daniel and David.
|
41 |
+
|
42 |
+
QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.
|
43 |
+
|
44 |
+
## Model Description
|
45 |
+
|
46 |
+
**Base Model:** Qwen2.5-VL 7B
|
47 |
+
**Training Method:** LoRA fine-tuning (rank 2048, alpha 4096)
|
48 |
+
**Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning)
|
49 |
+
|
50 |
+
QwenStoryteller processes sequences of images to perform:
|
51 |
+
- End-to-end object detection
|
52 |
+
- Cross-frame object re-identification
|
53 |
+
- Landmark detection
|
54 |
+
- Chain-of-thought reasoning for scene understanding
|
55 |
+
- Grounded story generation with explicit visual references
|
56 |
+
|
57 |
+
The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1Γ10β»β΄ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.
|
58 |
+
|
59 |
+
## System Prompt
|
60 |
+
The model was trained with the following system prompt, and we recommend using it as it is for inference.
|
61 |
+
|
62 |
+
```
|
63 |
+
You are an AI storyteller that can analyze sequences of images and create creative narratives.
|
64 |
+
First think step-by-step to analyze characters, objects, settings, and narrative structure.
|
65 |
+
Then create a grounded story that maintains consistent character identity and object references across frames.
|
66 |
+
Use <think></think> tags to show your reasoning process before writing the final story.
|
67 |
+
```
|
68 |
+
|
69 |
+
## Key Features
|
70 |
+
|
71 |
+
- **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
|
72 |
+
- **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
|
73 |
+
- **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities
|
74 |
+
- **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model
|
75 |
+
|
76 |
+
```
|
77 |
+
@misc{oliveira2025storyreasoningdatasetusingchainofthought,
|
78 |
+
title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation},
|
79 |
+
author={Daniel A. P. Oliveira and David Martins de Matos},
|
80 |
+
year={2025},
|
81 |
+
eprint={2505.10292},
|
82 |
+
archivePrefix={arXiv},
|
83 |
+
primaryClass={cs.CV},
|
84 |
+
url={https://arxiv.org/abs/2505.10292},
|
85 |
+
}
|
86 |
+
```
|