File size: 8,231 Bytes
69107a1
 
 
 
 
42c2f4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69107a1
 
 
 
b2461a2
69107a1
42c2f4b
69107a1
42c2f4b
 
 
69107a1
42c2f4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69107a1
 
42c2f4b
 
69107a1
42c2f4b
69107a1
42c2f4b
 
 
69107a1
42c2f4b
 
69107a1
42c2f4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1ee200
69107a1
42c2f4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
393aff8
 
 
 
 
 
 
 
 
 
 
42c2f4b
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
language: en
license: apache-2.0
tags:
- vision-language-model
- visual-storytelling
- chain-of-thought
- grounded-text-generation
- cross-frame-consistency
- storytelling
- image-to-text
datasets:
- daniel3303/StoryReasoning
metrics:
- precision
- recall
- bleu
- meteor
- rouge
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-to-text
model-index:
  - name: QwenStoryteller
    results:
      - task:
          type: visual-storytelling
          name: Visual Storytelling
        dataset:
          name: StoryReasoning
          type: daniel3303/StoryReasoning
          split: test
        metrics:
          - name: Character Precision
            type: precision
            value: 0.83
          - name: Object Precision
            type: precision
            value: 0.46
          - name: Total Precision
            type: precision
            value: 0.57
          - name: mAP
            type: mean_average_precision
            value: 0.27
          - name: Character Recall
            type: recall
            value: 0.62
          - name: Object Recall
            type: recall
            value: 0.25
          - name: Total Recall
            type: recall
            value: 0.40
          - name: METEOR
            type: meteor
            value: 0.14
          - name: ROUGE-L
            type: rouge-l
            value: 0.16
          - name: BLEU-4
            type: bleu-4
            value: 0.054
          - name: Description Accuracy
            type: accuracy
            value: 2.76
            description: "Rating on a scale of 1-5"
          - name: Average Hallucinations
            type: error_rate
            value: 3.56
            description: "Average number of hallucinations per story"
library_name: transformers
---

# QwenStoryteller

QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.

## Model Description

**Base Model:** Qwen2.5-VL 7B  
**Training Method:** LoRA fine-tuning (rank 2048, alpha 4096)  
**Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning)

QwenStoryteller processes sequences of images to perform:
- End-to-end object detection
- Cross-frame object re-identification
- Landmark detection
- Chain-of-thought reasoning for scene understanding
- Grounded story generation with explicit visual references

The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.

## System Prompt
The model was trained with the following system prompt, and we recommend using it as it is for inference.

```
You are an AI storyteller that can analyze sequences of images and create creative narratives. 
First think step-by-step to analyze characters, objects, settings, and narrative structure. 
Then create a grounded story that maintains consistent character identity and object references across frames. 
Use <think></think> tags to show your reasoning process before writing the final story.
```

## Key Features

- **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
- **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
- **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities
- **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model

## Usage

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "daniel3303/QwenStoryteller", torch_dtype="auto", device_map="auto"
)

# Load processor
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller")

# Load images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg"),
    Image.open("image4.jpg"),
    Image.open("image5.jpg")
]

# Create image content list
image_content = []
for img in images:
    image_content.append({
        "type": "image",
        "image": img,
    })

# Add text prompt at the end
image_content.append({"type": "text", "text": "Generate a story based on these images."})

# Create messages with system prompt
messages = [
    {
        "role": "system", 
        "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story."
    },
    {
        "role": "user",
        "content": image_content,
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs, 
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
story = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(story)
```

### Using vLLM for faster inference

For significantly faster inference, you can use vLLM to serve the model. Simply install vLLM and run:

```bash
# Install vLLM
pip install vllm

# Serve the model with vLLM
vllm serve daniel3303/QwenStoryteller
```

## Output Format

QwenStoryteller produces two main outputs:

1. **Chain-of-Thought Analysis (`<think></think>`):** A structured analysis containing:
   - Character tables with consistent identity references, emotions, actions, and spatial locations
   - Object tables with functions, interactions, and spatial coordinates
   - Setting tables categorizing environmental elements
   - Narrative structure tables modeling story progression

2. **Grounded Story:** A narrative with specialized XML tags linking text to visual elements:
   - `<gdi>`: Image tags for specific frames
   - `<gdo>`: Entity reference tags for character and object mentions
   - `<gda>`: Action tags for character actions
   - `<gdl>`: Location/landmark tags for background elements

## Limitations

- Re-identification relies primarily on object appearance rather than overall context, which can lead to confusion with similar-looking objects/persons
- Movie-derived training data introduces biases from cinematic composition that may not generalize to candid visual sequences
- Low grounding rates for first-person pronouns as they primarily appear in character dialogues
- May still produce hallucinations, albeit at a reduced rate compared to the base model

## Citation

```
@misc{oliveira2025storyreasoningdatasetusingchainofthought,
      title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, 
      author={Daniel A. P. Oliveira and David Martins de Matos},
      year={2025},
      eprint={2505.10292},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.10292}, 
}
```

## Contact

For questions or feedback regarding this model, please contact:
- Daniel A. P. Oliveira ([email protected])