prithivMLmods commited on
Commit
5cb5745
·
verified ·
1 Parent(s): fec8a76

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ library_name: transformers
7
+ base_model:
8
+ - Qwen/Qwen2.5-VL-7B-Instruct
9
+ pipeline_tag: image-text-to-text
10
+ tags:
11
+ - trl
12
+ - VisionLanguageAttribution
13
+ - VisualUnderstanding
14
+ - text-generation-inference
15
+ - AttributeCaptioning
16
+ - VLA
17
+ datasets:
18
+ - prithivMLmods/blip3o-caption-mini-arrow
19
+ - prithivMLmods/Caption3o-Opt-v3
20
+ - prithivMLmods/Caption3o-Opt-v2
21
+ - >-
22
+ Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647
23
+ ---
24
+
25
+ ![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/y7C3BvR9PCOwy6I478tkY.png)
26
+
27
+ # **DeepCaption-VLA-7B**
28
+
29
+ > The **DeepCaption-VLA-7B** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, tailored for **Image Captioning** and **Vision Language Attribution**. This variant is designed to generate precise, highly descriptive captions with a focus on **defining visual properties, object attributes, and scene details** across a wide spectrum of images and aspect ratios.
30
+
31
+ # Key Highlights
32
+
33
+ 1. **Vision Language Attribution (VLA):** Specially fine-tuned to attribute and define visual properties of objects, scenes, and environments.
34
+ 2. **Detailed Object Definitions:** Generates captions with rich attribute descriptions, making outputs more precise than generic captioners.
35
+ 3. **High-Fidelity Descriptions:** Handles general, artistic, technical, abstract, and low-context images with descriptive depth.
36
+ 4. **Robust Across Aspect Ratios:** Accurately captions images regardless of format—wide, tall, square, or irregular.
37
+ 5. **Variational Detail Control:** Supports both concise summaries and fine-grained attributions depending on prompt structure.
38
+ 6. **Foundation on Qwen2.5-VL Architecture:** Leverages Qwen2.5-VL-7B’s multimodal reasoning for visual comprehension and instruction-following.
39
+ 7. **Multilingual Capability:** Default in English, but adaptable for multilingual captioning through prompt engineering.
40
+
41
+ > model type: experimental
42
+
43
+ # Training Details
44
+
45
+ This model was fine-tuned with a curated mix of datasets focused on **caption richness and object-attribute alignment**:
46
+
47
+ * [prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow)
48
+ * [prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3)
49
+ * [prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2)
50
+ * [Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647)
51
+
52
+ The training objective emphasized **Vision Language Attribution**: defining image properties, attributes, and objects with clarity, while preserving descriptive fluency.
53
+
54
+ ---
55
+
56
+ ## SYSTEM_PROMPT
57
+
58
+ ```py
59
+ CAPTION_SYSTEM_PROMPT = """
60
+ You are an AI assistant that rigorously follows this response protocol:
61
+
62
+ 1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language.
63
+
64
+ 2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics.
65
+
66
+ 3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format.
67
+ - Use the syntax: `{class_name==write_the_core_theme}`
68
+ - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}`
69
+
70
+ 4. Maintain the following strict format in your output:
71
+ - **Caption:** <one-sentence description>
72
+ - **Attributes:** <comma-separated list of visual attributes>
73
+ - **{class_name==core_theme}**
74
+
75
+ 5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required.
76
+
77
+ 6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name.
78
+
79
+ """.strip()
80
+ ```
81
+
82
+ ---
83
+
84
+ # Quick Start with Transformers
85
+
86
+ ```python
87
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
88
+ from qwen_vl_utils import process_vision_info
89
+
90
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
91
+ "prithivMLmods/DeepCaption-VLA-7B", torch_dtype="auto", device_map="auto"
92
+ )
93
+
94
+ processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-7B")
95
+
96
+ messages = [
97
+ {
98
+ "role": "user",
99
+ "content": [
100
+ {
101
+ "type": "image",
102
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
103
+ },
104
+ {"type": "text", "text": "Describe this image with detailed attributes and properties."},
105
+ ],
106
+ }
107
+ ]
108
+
109
+ text = processor.apply_chat_template(
110
+ messages, tokenize=False, add_generation_prompt=True
111
+ )
112
+ image_inputs, video_inputs = process_vision_info(messages)
113
+ inputs = processor(
114
+ text=[text],
115
+ images=image_inputs,
116
+ videos=video_inputs,
117
+ padding=True,
118
+ return_tensors="pt",
119
+ )
120
+ inputs = inputs.to("cuda")
121
+
122
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
123
+ generated_ids_trimmed = [
124
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
125
+ ]
126
+ output_text = processor.batch_decode(
127
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
128
+ )
129
+ print(output_text)
130
+ ```
131
+
132
+ # Intended Use
133
+
134
+ * Generating attribute-rich image captions for research, dataset creation, and AI training.
135
+ * Vision-language attribution for object detection, scene understanding, and dataset annotation.
136
+ * Supporting creative, artistic, and technical applications requiring detailed descriptions.
137
+ * Captioning across varied aspect ratios, unusual visual styles, and non-standard datasets.
138
+
139
+ # Limitations
140
+
141
+ * May over-attribute or infer properties not explicitly visible in ambiguous images.
142
+ * Outputs can vary in tone depending on prompt phrasing.
143
+ * Not intended for filtered captioning tasks (explicit or sensitive content may appear).
144
+ * Accuracy may degrade on synthetic or highly abstract visual domains.