prithivMLmods commited on
Commit
6280c0e
·
verified ·
1 Parent(s): c511daa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +349 -3
README.md CHANGED
@@ -1,3 +1,349 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ Here's your revised documentation with the updated model name and a slightly tuned focus based on your note — **best for creating highlights of the image**:
5
+
6
+ ---
7
+
8
+ # **Needle-2B-VL-Highlights**
9
+
10
+ > [!Note]
11
+ > The **Needle-2B-VL-Highlights** model is a fine-tuned version of *Qwen2-VL-2B-Instruct*, specifically optimized for **image highlights extraction**, **messy handwriting recognition**, **Optical Character Recognition (OCR)**, **English language understanding**, and **math problem solving with LaTeX formatting**. This model uses a conversational visual-language interface to effectively handle multi-modal tasks.
12
+
13
+
14
+ # **Key Enhancements:**
15
+
16
+ * **State-of-the-art image comprehension** across varying resolutions and aspect ratios:
17
+ Needle-2B-VL-Highlights delivers top-tier performance on benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA.
18
+
19
+ * **Image Highlighting Expertise**:
20
+ Specially tuned to **identify and summarize key visual elements** in an image — ideal for **creating visual highlights**, annotations, and summaries.
21
+
22
+ * **Handwriting OCR Enhanced**:
23
+ Recognizes **messy and complex handwritten notes** with precision, perfect for digitizing real-world documents.
24
+
25
+ * **Video Content Understanding**:
26
+ Capable of processing videos longer than 20 minutes for **context-aware Q&A, transcription**, and **highlight extraction**.
27
+
28
+ * **Multi-device Integration**:
29
+ Can be used as an intelligent agent for mobile phones, robots, and other devices — able to **understand visual scenes and execute actions**.
30
+
31
+ * **Multilingual OCR Support**:
32
+ In addition to English and Chinese, supports OCR for European languages, Japanese, Korean, Arabic, and Vietnamese.
33
+
34
+ # **Run with Transformers🤗**
35
+
36
+ ```py
37
+ %%capture
38
+ !pip install -q gradio spaces transformers accelerate
39
+ !pip install -q numpy requests torch torchvision
40
+ !pip install -q qwen-vl-utils av ipython reportlab
41
+ !pip install -q fpdf python-docx pillow huggingface_hub
42
+ ```
43
+
44
+ ```py
45
+ #Demo
46
+ import gradio as gr
47
+ import spaces
48
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, TextIteratorStreamer
49
+ from qwen_vl_utils import process_vision_info
50
+ import torch
51
+ from PIL import Image
52
+ import os
53
+ import uuid
54
+ import io
55
+ from threading import Thread
56
+ from reportlab.lib.pagesizes import A4
57
+ from reportlab.lib.styles import getSampleStyleSheet
58
+ from reportlab.lib import colors
59
+ from reportlab.platypus import SimpleDocTemplate, Image as RLImage, Paragraph, Spacer
60
+ from reportlab.lib.units import inch
61
+ from reportlab.pdfbase import pdfmetrics
62
+ from reportlab.pdfbase.ttfonts import TTFont
63
+ import docx
64
+ from docx.enum.text import WD_ALIGN_PARAGRAPH
65
+
66
+ # Define model options
67
+ MODEL_OPTIONS = {
68
+ "Needle-2B-VL-Highlights": "prithivMLmods/Needle-2B-VL-Highlights",
69
+ }
70
+
71
+ # Preload models and processors into CUDA
72
+ models = {}
73
+ processors = {}
74
+ for name, model_id in MODEL_OPTIONS.items():
75
+ print(f"Loading {name}...")
76
+ models[name] = Qwen2VLForConditionalGeneration.from_pretrained(
77
+ model_id,
78
+ trust_remote_code=True,
79
+ torch_dtype=torch.float16
80
+ ).to("cuda").eval()
81
+ processors[name] = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
82
+
83
+ image_extensions = Image.registered_extensions()
84
+
85
+ def identify_and_save_blob(blob_path):
86
+ """Identifies if the blob is an image and saves it."""
87
+ try:
88
+ with open(blob_path, 'rb') as file:
89
+ blob_content = file.read()
90
+ try:
91
+ Image.open(io.BytesIO(blob_content)).verify() # Check if it's a valid image
92
+ extension = ".png" # Default to PNG for saving
93
+ media_type = "image"
94
+ except (IOError, SyntaxError):
95
+ raise ValueError("Unsupported media type. Please upload a valid image.")
96
+
97
+ filename = f"temp_{uuid.uuid4()}_media{extension}"
98
+ with open(filename, "wb") as f:
99
+ f.write(blob_content)
100
+
101
+ return filename, media_type
102
+
103
+ except FileNotFoundError:
104
+ raise ValueError(f"The file {blob_path} was not found.")
105
+ except Exception as e:
106
+ raise ValueError(f"An error occurred while processing the file: {e}")
107
+
108
+ @spaces.GPU
109
+ def qwen_inference(model_name, media_input, text_input=None):
110
+ """Handles inference for the selected model."""
111
+ model = models[model_name]
112
+ processor = processors[model_name]
113
+
114
+ if isinstance(media_input, str):
115
+ media_path = media_input
116
+ if media_path.endswith(tuple([i for i in image_extensions.keys()])):
117
+ media_type = "image"
118
+ else:
119
+ try:
120
+ media_path, media_type = identify_and_save_blob(media_input)
121
+ except Exception as e:
122
+ raise ValueError("Unsupported media type. Please upload a valid image.")
123
+
124
+ messages = [
125
+ {
126
+ "role": "user",
127
+ "content": [
128
+ {
129
+ "type": media_type,
130
+ media_type: media_path
131
+ },
132
+ {"type": "text", "text": text_input},
133
+ ],
134
+ }
135
+ ]
136
+
137
+ text = processor.apply_chat_template(
138
+ messages, tokenize=False, add_generation_prompt=True
139
+ )
140
+ image_inputs, _ = process_vision_info(messages)
141
+ inputs = processor(
142
+ text=[text],
143
+ images=image_inputs,
144
+ padding=True,
145
+ return_tensors="pt",
146
+ ).to("cuda")
147
+
148
+ streamer = TextIteratorStreamer(
149
+ processor.tokenizer, skip_prompt=True, skip_special_tokens=True
150
+ )
151
+ generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=1024)
152
+
153
+ thread = Thread(target=model.generate, kwargs=generation_kwargs)
154
+ thread.start()
155
+
156
+ buffer = ""
157
+ for new_text in streamer:
158
+ buffer += new_text
159
+ # Remove <|im_end|> or similar tokens from the output
160
+ buffer = buffer.replace("<|im_end|>", "")
161
+ yield buffer
162
+
163
+ def format_plain_text(output_text):
164
+ """Formats the output text as plain text without LaTeX delimiters."""
165
+ # Remove LaTeX delimiters and convert to plain text
166
+ plain_text = output_text.replace("\\(", "").replace("\\)", "").replace("\\[", "").replace("\\]", "")
167
+ return plain_text
168
+
169
+ def generate_document(media_path, output_text, file_format, font_size, line_spacing, alignment, image_size):
170
+ """Generates a document with the input image and plain text output."""
171
+ plain_text = format_plain_text(output_text)
172
+ if file_format == "pdf":
173
+ return generate_pdf(media_path, plain_text, font_size, line_spacing, alignment, image_size)
174
+ elif file_format == "docx":
175
+ return generate_docx(media_path, plain_text, font_size, line_spacing, alignment, image_size)
176
+
177
+ def generate_pdf(media_path, plain_text, font_size, line_spacing, alignment, image_size):
178
+ """Generates a PDF document."""
179
+ filename = f"output_{uuid.uuid4()}.pdf"
180
+ doc = SimpleDocTemplate(
181
+ filename,
182
+ pagesize=A4,
183
+ rightMargin=inch,
184
+ leftMargin=inch,
185
+ topMargin=inch,
186
+ bottomMargin=inch
187
+ )
188
+ styles = getSampleStyleSheet()
189
+ styles["Normal"].fontSize = int(font_size)
190
+ styles["Normal"].leading = int(font_size) * line_spacing
191
+ styles["Normal"].alignment = {
192
+ "Left": 0,
193
+ "Center": 1,
194
+ "Right": 2,
195
+ "Justified": 4
196
+ }[alignment]
197
+
198
+ story = []
199
+
200
+ # Add image with size adjustment
201
+ image_sizes = {
202
+ "Small": (200, 200),
203
+ "Medium": (400, 400),
204
+ "Large": (600, 600)
205
+ }
206
+ img = RLImage(media_path, width=image_sizes[image_size][0], height=image_sizes[image_size][1])
207
+ story.append(img)
208
+ story.append(Spacer(1, 12))
209
+
210
+ # Add plain text output
211
+ text = Paragraph(plain_text, styles["Normal"])
212
+ story.append(text)
213
+
214
+ doc.build(story)
215
+ return filename
216
+
217
+ def generate_docx(media_path, plain_text, font_size, line_spacing, alignment, image_size):
218
+ """Generates a DOCX document."""
219
+ filename = f"output_{uuid.uuid4()}.docx"
220
+ doc = docx.Document()
221
+
222
+ # Add image with size adjustment
223
+ image_sizes = {
224
+ "Small": docx.shared.Inches(2),
225
+ "Medium": docx.shared.Inches(4),
226
+ "Large": docx.shared.Inches(6)
227
+ }
228
+ doc.add_picture(media_path, width=image_sizes[image_size])
229
+ doc.add_paragraph()
230
+
231
+ # Add plain text output
232
+ paragraph = doc.add_paragraph()
233
+ paragraph.paragraph_format.line_spacing = line_spacing
234
+ paragraph.paragraph_format.alignment = {
235
+ "Left": WD_ALIGN_PARAGRAPH.LEFT,
236
+ "Center": WD_ALIGN_PARAGRAPH.CENTER,
237
+ "Right": WD_ALIGN_PARAGRAPH.RIGHT,
238
+ "Justified": WD_ALIGN_PARAGRAPH.JUSTIFY
239
+ }[alignment]
240
+ run = paragraph.add_run(plain_text)
241
+ run.font.size = docx.shared.Pt(int(font_size))
242
+
243
+ doc.save(filename)
244
+ return filename
245
+
246
+ # CSS for output styling
247
+ css = """
248
+ #output {
249
+ height: 500px;
250
+ overflow: auto;
251
+ border: 1px solid #ccc;
252
+ }
253
+ .submit-btn {
254
+ background-color: #cf3434 !important;
255
+ color: white !important;
256
+ }
257
+ .submit-btn:hover {
258
+ background-color: #ff2323 !important;
259
+ }
260
+ .download-btn {
261
+ background-color: #35a6d6 !important;
262
+ color: white !important;
263
+ }
264
+ .download-btn:hover {
265
+ background-color: #22bcff !important;
266
+ }
267
+ """
268
+
269
+ # Gradio app setup
270
+ with gr.Blocks(css=css) as demo:
271
+ gr.Markdown("# Qwen2VL Models: Vision and Language Processing")
272
+
273
+ with gr.Tab(label="Image Input"):
274
+
275
+ with gr.Row():
276
+ with gr.Column():
277
+ model_choice = gr.Dropdown(
278
+ label="Model Selection",
279
+ choices=list(MODEL_OPTIONS.keys()),
280
+ value="Needle-2B-VL-Highlights"
281
+ )
282
+ input_media = gr.File(
283
+ label="Upload Image", type="filepath"
284
+ )
285
+ text_input = gr.Textbox(label="Question", placeholder="Ask a question about the image...")
286
+ submit_btn = gr.Button(value="Submit", elem_classes="submit-btn")
287
+
288
+ with gr.Column():
289
+ output_text = gr.Textbox(label="Output Text", lines=10)
290
+ plain_text_output = gr.Textbox(label="Standardized Plain Text", lines=10)
291
+
292
+ submit_btn.click(
293
+ qwen_inference, [model_choice, input_media, text_input], [output_text]
294
+ ).then(
295
+ lambda output_text: format_plain_text(output_text), [output_text], [plain_text_output]
296
+ )
297
+
298
+ # Add examples directly usable by clicking
299
+ with gr.Row():
300
+ with gr.Column():
301
+ line_spacing = gr.Dropdown(
302
+ choices=[0.5, 1.0, 1.15, 1.5, 2.0, 2.5, 3.0],
303
+ value=1.5,
304
+ label="Line Spacing"
305
+ )
306
+ font_size = gr.Dropdown(
307
+ choices=["8", "10", "12", "14", "16", "18", "20", "22", "24"],
308
+ value="18",
309
+ label="Font Size"
310
+ )
311
+ alignment = gr.Dropdown(
312
+ choices=["Left", "Center", "Right", "Justified"],
313
+ value="Justified",
314
+ label="Text Alignment"
315
+ )
316
+ image_size = gr.Dropdown(
317
+ choices=["Small", "Medium", "Large"],
318
+ value="Small",
319
+ label="Image Size"
320
+ )
321
+ file_format = gr.Radio(["pdf", "docx"], label="File Format", value="pdf")
322
+ get_document_btn = gr.Button(value="Get Document", elem_classes="download-btn")
323
+
324
+ get_document_btn.click(
325
+ generate_document, [input_media, output_text, file_format, font_size, line_spacing, alignment, image_size], gr.File(label="Download Document")
326
+ )
327
+
328
+ demo.launch(debug=True)
329
+ ```
330
+
331
+ ### **Key Features**
332
+
333
+ 1. **Visual Highlights Generator:**
334
+ - Extracts **key objects, regions, and contextual clues** from images and turns them into meaningful **visual summaries**.
335
+
336
+ 2. **Advanced Handwriting OCR:**
337
+ - Excels at recognizing and transcribing **messy or cursive handwriting** into digital text.
338
+
339
+ 3. **Vision-Language Fusion:**
340
+ - Seamlessly integrates **visual input** with **language reasoning**, ideal for image captioning, description, and Q&A.
341
+
342
+ 4. **Math and LaTeX Support:**
343
+ - Understands math problems in visual/text format and outputs in **LaTeX syntax**.
344
+
345
+ 5. **Conversational AI:**
346
+ - Supports **multi-turn dialogue** with memory of prior input — highly useful for interactive problem-solving and explanations.
347
+
348
+ 6. **Multi-modal Input Capability:**
349
+ - Accepts **image, text, or a combination**, and generates intelligent output tailored to the input.