LPX55 commited on
Commit
7de8390
Β·
verified Β·
1 Parent(s): 24db381

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -5
README.md CHANGED
@@ -1,12 +1,86 @@
1
  ---
2
- title: QwenStoryteller
3
  emoji: πŸ“š
4
- colorFrom: indigo
5
- colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 5.30.0
8
  app_file: app.py
9
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Qwen2.5-VL | πŸ“” Storyteller
3
  emoji: πŸ“š
4
+ colorFrom: red
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 5.30.0
8
  app_file: app.py
9
+ pinned: true
10
+ tags:
11
+ - vision-language-model
12
+ - visual-storytelling
13
+ - chain-of-thought
14
+ - grounded-text-generation
15
+ - cross-frame-consistency
16
+ - storytelling
17
+ - image-to-text
18
+ license: apache-2.0
19
+ datasets:
20
+ - daniel3303/StoryReasoning
21
+ base_model:
22
+ - Qwen/Qwen2.5-VL-7B-Instruct
23
+ pipeline_tag: image-to-text
24
+ model-index:
25
+ - name: QwenStoryteller
26
+ results:
27
+ - task:
28
+ type: visual-storytelling
29
+ name: Visual Storytelling
30
+ dataset:
31
+ name: StoryReasoning
32
+ type: daniel3303/StoryReasoning
33
+ split: test
34
+ language: en, zh
35
  ---
36
 
37
+
38
+ # QwenStoryteller
39
+
40
+ This HF Space is a simple implementation of [2505.10292](https://arxiv.org/abs/2505.10292) by Daniel A. P. Oliveira and David Martins de Matos. BibTeX citation provided below. The space was created as a POC, all other credits go to Daniel and David.
41
+
42
+ QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.
43
+
44
+ ## Model Description
45
+
46
+ **Base Model:** Qwen2.5-VL 7B
47
+ **Training Method:** LoRA fine-tuning (rank 2048, alpha 4096)
48
+ **Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning)
49
+
50
+ QwenStoryteller processes sequences of images to perform:
51
+ - End-to-end object detection
52
+ - Cross-frame object re-identification
53
+ - Landmark detection
54
+ - Chain-of-thought reasoning for scene understanding
55
+ - Grounded story generation with explicit visual references
56
+
57
+ The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1Γ—10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.
58
+
59
+ ## System Prompt
60
+ The model was trained with the following system prompt, and we recommend using it as it is for inference.
61
+
62
+ ```
63
+ You are an AI storyteller that can analyze sequences of images and create creative narratives.
64
+ First think step-by-step to analyze characters, objects, settings, and narrative structure.
65
+ Then create a grounded story that maintains consistent character identity and object references across frames.
66
+ Use <think></think> tags to show your reasoning process before writing the final story.
67
+ ```
68
+
69
+ ## Key Features
70
+
71
+ - **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
72
+ - **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
73
+ - **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities
74
+ - **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model
75
+
76
+ ```
77
+ @misc{oliveira2025storyreasoningdatasetusingchainofthought,
78
+ title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation},
79
+ author={Daniel A. P. Oliveira and David Martins de Matos},
80
+ year={2025},
81
+ eprint={2505.10292},
82
+ archivePrefix={arXiv},
83
+ primaryClass={cs.CV},
84
+ url={https://arxiv.org/abs/2505.10292},
85
+ }
86
+ ```