|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language-model |
|
|
- visual-storytelling |
|
|
- chain-of-thought |
|
|
- grounded-text-generation |
|
|
- cross-frame-consistency |
|
|
- storytelling |
|
|
- image-to-text |
|
|
- contrastive-learning |
|
|
- reinforcement-learning |
|
|
- entity-reidentification |
|
|
datasets: |
|
|
- daniel3303/StoryReasoningAdversarialDPO |
|
|
- daniel3303/StoryReasoning |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- bleu |
|
|
- meteor |
|
|
- rouge |
|
|
- map |
|
|
base_model: |
|
|
- daniel3303/QwenStoryteller |
|
|
pipeline_tag: image-to-text |
|
|
model-index: |
|
|
- name: QwenStoryteller2 |
|
|
results: |
|
|
- task: |
|
|
type: visual-storytelling |
|
|
name: Visual Storytelling |
|
|
dataset: |
|
|
name: StoryReasoningAdversarialDPO |
|
|
type: daniel3303/StoryReasoningAdversarialDPO |
|
|
split: test |
|
|
metrics: |
|
|
- name: Character Precision |
|
|
type: precision |
|
|
value: 0.78 |
|
|
- name: Object Precision |
|
|
type: precision |
|
|
value: 0.29 |
|
|
- name: Total Precision |
|
|
type: precision |
|
|
value: 0.45 |
|
|
- name: mAP |
|
|
type: mean_average_precision |
|
|
value: 0.31 |
|
|
- name: Character Recall |
|
|
type: recall |
|
|
value: 0.77 |
|
|
- name: Object Recall |
|
|
type: recall |
|
|
value: 0.28 |
|
|
- name: Total Recall |
|
|
type: recall |
|
|
value: 0.48 |
|
|
- name: F1 Score |
|
|
type: f1 |
|
|
value: 0.41 |
|
|
- name: METEOR |
|
|
type: meteor |
|
|
value: 0.17 |
|
|
- name: ROUGE-L |
|
|
type: rouge-l |
|
|
value: 0.18 |
|
|
- name: BLEU-4 |
|
|
type: bleu-4 |
|
|
value: 0.057 |
|
|
- name: Character Persistence (≥5 frames) |
|
|
type: accuracy |
|
|
value: 0.493 |
|
|
- name: Object Persistence (≥5 frames) |
|
|
type: accuracy |
|
|
value: 0.213 |
|
|
- name: Well-structured Stories |
|
|
type: accuracy |
|
|
value: 0.975 |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# QwenStoryteller2 |
|
|
|
|
|
QwenStoryteller2 is an improved version of QwenStoryteller, fine-tuned using contrastive reinforcement learning with Direct Preference Optimization (DPO) to achieve superior entity re-identification and visual grounding in cross-frame storytelling scenarios. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Base Model:** QwenStoryteller (Qwen2.5-VL 7B) |
|
|
**Training Method:** Contrastive Reinforcement Learning with Direct Preference Optimization (LoRA rank 2048, alpha 4096) |
|
|
**Training Dataset:** [StoryReasoningAdversarialDPO](https://huggingface.co/datasets/daniel3303/StoryReasoningAdversarialDPO) |
|
|
|
|
|
QwenStoryteller2 builds upon the original QwenStoryteller by addressing critical limitations in cross-frame entity consistency through: |
|
|
- **Contrastive Learning:** Training on both real and synthetic negative story examples |
|
|
- **Enhanced Entity Re-identification:** Improved tracking of characters and objects across frames |
|
|
- **Better Grounding:** Superior alignment between narrative elements and visual entities |
|
|
- **Reduced Hallucinations:** More reliable entity connections and fewer spurious references |
|
|
|
|
|
The model employs a dual-component reward function that promotes appropriate entity connections in coherent sequences while discouraging incorrect connections in synthetic arrangements. |
|
|
|
|
|
## Key Improvements Over QwenStoryteller |
|
|
|
|
|
- **Grounding Performance:** mAP improved from 0.27 to 0.31 (+14.8%), F1 score from 0.35 to 0.41 (+17.1%) |
|
|
- **Cross-frame Consistency:** Character persistence on ≥5 frames increased from 37.7% to 49.3% (+30.8%) |
|
|
- **Pronoun Grounding:** Significant improvements across all pronoun types (he: 90.1%→99.1%, she: 91.1%→98.6%, they: 47.6%→68.8%) |
|
|
- **Structural Quality:** Well-structured stories increased from 79.1% to 97.5% (+23.3%) |
|
|
- **Entity Tracking:** Object persistence on ≥5 frames improved from 20.9% to 21.3% |
|
|
|
|
|
## System Prompt |
|
|
|
|
|
The model was trained with the following system prompt, and we recommend using it for optimal performance: |
|
|
|
|
|
``` |
|
|
You are an AI storyteller that can analyze sequences of images and create creative narratives. |
|
|
First think step-by-step to analyze characters, objects, settings, and narrative structure. |
|
|
Then create a grounded story that maintains consistent character identity and object references across frames. |
|
|
Use <think></think> tags to show your reasoning process before writing the final story. |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
import torch |
|
|
from PIL import Image |
|
|
|
|
|
# Load the model |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"daniel3303/QwenStoryteller2", torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
|
|
|
# Load processor |
|
|
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller2") |
|
|
|
|
|
# Load images |
|
|
images = [ |
|
|
Image.open("image1.jpg"), |
|
|
Image.open("image2.jpg"), |
|
|
Image.open("image3.jpg"), |
|
|
Image.open("image4.jpg"), |
|
|
Image.open("image5.jpg") |
|
|
] |
|
|
|
|
|
# Create image content list |
|
|
image_content = [] |
|
|
for img in images: |
|
|
image_content.append({ |
|
|
"type": "image", |
|
|
"image": img, |
|
|
}) |
|
|
|
|
|
# Add text prompt at the end |
|
|
image_content.append({"type": "text", "text": "Generate a story based on these images."}) |
|
|
|
|
|
# Create messages with system prompt |
|
|
messages = [ |
|
|
{ |
|
|
"role": "system", |
|
|
"content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story." |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": image_content, |
|
|
} |
|
|
] |
|
|
|
|
|
# Preparation for inference |
|
|
text = processor.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True |
|
|
) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
inputs = inputs.to(model.device) |
|
|
|
|
|
# Inference: Generation of the output |
|
|
generated_ids = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=4096, |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.9 |
|
|
) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
story = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
)[0] |
|
|
|
|
|
print(story) |
|
|
``` |
|
|
|
|
|
### Using vLLM for faster inference |
|
|
|
|
|
For significantly faster inference, you can use vLLM to serve the model: |
|
|
|
|
|
```bash |
|
|
# Install vLLM |
|
|
pip install vllm |
|
|
|
|
|
# Serve the model with vLLM |
|
|
vllm serve daniel3303/QwenStoryteller2 |
|
|
``` |
|
|
|
|
|
## Training Methodology |
|
|
|
|
|
### Contrastive Learning Framework |
|
|
|
|
|
QwenStoryteller2 was trained using a novel contrastive reinforcement learning approach: |
|
|
|
|
|
1. **Synthetic Story Generation:** Extended the StoryReasoning dataset with 4,178 synthetic stories created by sampling images from different movies to create incoherent sequences |
|
|
2. **Dual-Component Reward Function:** Combined entity re-identification (R_reid) and grounding (R_ground) rewards with structural validation |
|
|
3. **Direct Preference Optimization:** Used offline preference pairs generated from the reward function to train the model |
|
|
|
|
|
### Reward Function Components |
|
|
|
|
|
- **Entity Re-identification Reward:** Tracks character and object persistence across frames, promoting connections in real stories while penalizing them in synthetic ones |
|
|
- **Grounding Reward:** Evaluates pronoun and proper noun grounding to visual entities |
|
|
- **Structure Validation:** Ensures generated outputs maintain required format and consistency |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Method:** Direct Preference Optimization (DPO) with LoRA fine-tuning |
|
|
- **LoRA Parameters:** Rank 2048, alpha 4096 |
|
|
- **Optimizer:** AdamW with learning rate 5×10⁻⁶ |
|
|
- **Batch Size:** 8 |
|
|
- **Epochs:** 3 |
|
|
- **Temperature Parameter (β):** 0.1 |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
| Metric | QwenStoryteller | QwenStoryteller2 | Improvement | |
|
|
|--------|-----------------|------------------|-------------| |
|
|
| Character Precision | 0.83 | 0.78 | -6.0% | |
|
|
| Object Precision | 0.46 | 0.29 | -37.0% | |
|
|
| Total Precision | 0.57 | 0.45 | -21.1% | |
|
|
| mAP | 0.27 | 0.31 | +14.8% | |
|
|
| Character Recall | 0.62 | 0.77 | +24.2% | |
|
|
| Object Recall | 0.25 | 0.28 | +12.0% | |
|
|
| Total Recall | 0.40 | 0.48 | +20.0% | |
|
|
| F1 Score | 0.35 | 0.41 | +17.1% | |
|
|
| METEOR | 0.14 | 0.17 | +21.4% | |
|
|
| ROUGE-L | 0.16 | 0.18 | +12.5% | |
|
|
| BLEU-4 | 0.054 | 0.057 | +5.6% | |
|
|
|
|
|
## Output Format |
|
|
|
|
|
QwenStoryteller2 produces enhanced outputs with improved consistency: |
|
|
|
|
|
1. **Chain-of-Thought Analysis (`<think></think>`):** More accurate structured analysis with: |
|
|
- Improved character tables with consistent identity references |
|
|
- Better object tracking with enhanced spatial coordination |
|
|
- More reliable setting categorization |
|
|
- Stronger narrative structure modeling |
|
|
|
|
|
2. **Grounded Story:** Enhanced narrative with specialized XML tags: |
|
|
- `<gdi>`: Image tags for specific frames |
|
|
- `<gdo>`: Entity reference tags with improved accuracy |
|
|
- `<gda>`: Action tags with better character-action alignment |
|
|
- `<gdl>`: Location/landmark tags with enhanced spatial grounding |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Enhanced Cross-Frame Consistency:** Superior character and object identity maintenance through contrastive learning |
|
|
- **Improved Pronoun Grounding:** Better alignment of pronouns with visual entities (up to 99.1% for "he", 98.6% for "she") |
|
|
- **Reduced Hallucinations:** Fewer incorrect entity connections and spurious references |
|
|
- **Robust Entity Discrimination:** Learned ability to distinguish when cross-frame connections are appropriate |
|
|
- **Better Structural Quality:** Near-perfect adherence to expected output format (97.5%) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Precision scores show some reduction compared to the original model due to increased recall |
|
|
- Training data derived from movies may introduce cinematic biases |
|
|
- Entity re-identification still relies primarily on visual similarity within bounding boxes |
|
|
- Performance validated only on 7B parameter scale |
|
|
- Optimal real-to-synthetic story ratio (2:1) may not generalize to all scenarios |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
TODO |
|
|
|
|
|
@misc{oliveira2025storyreasoningdatasetusingchainofthought, |
|
|
title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, |
|
|
author={Daniel A. P. Oliveira and David Martins de Matos}, |
|
|
year={2025}, |
|
|
eprint={2505.10292}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2505.10292} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or feedback regarding this model, please contact: |
|
|
- Daniel A. P. Oliveira ([email protected]) |