File size: 8,231 Bytes

---
language: en
license: apache-2.0
tags:
- vision-language-model
- visual-storytelling
- chain-of-thought
- grounded-text-generation
- cross-frame-consistency
- storytelling
- image-to-text
datasets:
- daniel3303/StoryReasoning
metrics:
- precision
- recall
- bleu
- meteor
- rouge
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-to-text
model-index:
  - name: QwenStoryteller
    results:
      - task:
          type: visual-storytelling
          name: Visual Storytelling
        dataset:
          name: StoryReasoning
          type: daniel3303/StoryReasoning
          split: test
        metrics:
          - name: Character Precision
            type: precision
            value: 0.83
          - name: Object Precision
            type: precision
            value: 0.46
          - name: Total Precision
            type: precision
            value: 0.57
          - name: mAP
            type: mean_average_precision
            value: 0.27
          - name: Character Recall
            type: recall
            value: 0.62
          - name: Object Recall
            type: recall
            value: 0.25
          - name: Total Recall
            type: recall
            value: 0.40
          - name: METEOR
            type: meteor
            value: 0.14
          - name: ROUGE-L
            type: rouge-l
            value: 0.16
          - name: BLEU-4
            type: bleu-4
            value: 0.054
          - name: Description Accuracy
            type: accuracy
            value: 2.76
            description: "Rating on a scale of 1-5"
          - name: Average Hallucinations
            type: error_rate
            value: 3.56
            description: "Average number of hallucinations per story"
library_name: transformers
---

# QwenStoryteller

QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.

## Model Description

**Base Model:** Qwen2.5-VL 7B  
**Training Method:** LoRA fine-tuning (rank 2048, alpha 4096)  
**Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning)

QwenStoryteller processes sequences of images to perform:
- End-to-end object detection
- Cross-frame object re-identification
- Landmark detection
- Chain-of-thought reasoning for scene understanding
- Grounded story generation with explicit visual references

The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.

## System Prompt
The model was trained with the following system prompt, and we recommend using it as it is for inference.

```
You are an AI storyteller that can analyze sequences of images and create creative narratives. 
First think step-by-step to analyze characters, objects, settings, and narrative structure. 
Then create a grounded story that maintains consistent character identity and object references across frames. 
Use <think></think> tags to show your reasoning process before writing the final story.
```

## Key Features

- **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
- **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
- **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities
- **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model

## Usage

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "daniel3303/QwenStoryteller", torch_dtype="auto", device_map="auto"
)

# Load processor
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller")

# Load images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg"),
    Image.open("image4.jpg"),
    Image.open("image5.jpg")
]

# Create image content list
image_content = []
for img in images:
    image_content.append({
        "type": "image",
        "image": img,
    })

# Add text prompt at the end
image_content.append({"type": "text", "text": "Generate a story based on these images."})

# Create messages with system prompt
messages = [
    {
        "role": "system", 
        "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story."
    },
    {
        "role": "user",
        "content": image_content,
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs, 
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
story = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(story)
```

### Using vLLM for faster inference

For significantly faster inference, you can use vLLM to serve the model. Simply install vLLM and run:

```bash
# Install vLLM
pip install vllm

# Serve the model with vLLM
vllm serve daniel3303/QwenStoryteller
```

## Output Format

QwenStoryteller produces two main outputs:

1. **Chain-of-Thought Analysis (`<think></think>`):** A structured analysis containing:
   - Character tables with consistent identity references, emotions, actions, and spatial locations
   - Object tables with functions, interactions, and spatial coordinates
   - Setting tables categorizing environmental elements
   - Narrative structure tables modeling story progression

2. **Grounded Story:** A narrative with specialized XML tags linking text to visual elements:
   - `<gdi>`: Image tags for specific frames
   - `<gdo>`: Entity reference tags for character and object mentions
   - `<gda>`: Action tags for character actions
   - `<gdl>`: Location/landmark tags for background elements

## Limitations

- Re-identification relies primarily on object appearance rather than overall context, which can lead to confusion with similar-looking objects/persons
- Movie-derived training data introduces biases from cinematic composition that may not generalize to candid visual sequences
- Low grounding rates for first-person pronouns as they primarily appear in character dialogues
- May still produce hallucinations, albeit at a reduced rate compared to the base model

## Citation

```
@misc{oliveira2025storyreasoningdatasetusingchainofthought,
      title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, 
      author={Daniel A. P. Oliveira and David Martins de Matos},
      year={2025},
      eprint={2505.10292},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.10292}, 
}
```

## Contact

For questions or feedback regarding this model, please contact:
- Daniel A. P. Oliveira ([email protected])