Blazer.1-7B-Vision
Blazer.1-7B-Vision 4-bit precision
is based on the Qwen2-VL model, fine-tuned for raw document annotation extraction, optical character recognition (OCR), and solving math problems with LaTeX formatting. This model integrates a conversational approach with advanced visual and textual understanding to effectively handle multi-modal tasks. Key enhancements include state-of-the-art (SoTA) performance in understanding images of various resolutions and aspect ratios, as demonstrated by its success on visual
understanding benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA. Additionally, it excels in video comprehension, capable of processing videos over 20 minutes in length for high-quality video-based question answering, dialogue, and content creation. Blazer.1-7B-Vision also functions as an intelligent agent capable of operating devices like mobile phones and robots, thanks to its complex reasoning and decision-making abilities, enabling automatic operations based on visual environments and text instructions. To serve global users, the model offers multilingual support, understanding texts in a wide range of languages, including English, Chinese, most European languages, Japanese, Korean, Arabic, and Vietnamese.
Use it With Transformer
The bitsandbytes
library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Blazer.1-7B-Vision", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "prithivMLmods/Blazer.1-7B-Vision",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("prithivMLmods/Blazer.1-7B-Vision")
# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Buf
buffer = ""
for new_text in streamer:
buffer += new_text
# Remove <|im_end|> or similar tokens from the output
buffer = buffer.replace("<|im_end|>", "")
yield buffer
Intended Use
Blazer.1-7B-Vision is designed for a variety of multi-modal tasks involving visual and textual data. Its primary use cases include:
- Document Annotation and Extraction: The model is fine-tuned for extracting structured information from raw documents, making it suitable for tasks like automated form processing, invoice extraction, and report generation.
- Optical Character Recognition (OCR): It can accurately recognize and extract text from images and documents in multiple languages, aiding in digitizing physical documents and image-based text extraction.
- Math Problem Solving with LaTeX Formatting: Blazer.1-2B-Vision can handle complex mathematical problems, generate step-by-step solutions, and present them in LaTeX format, making it useful for educational platforms and research support.
- Visual Question Answering (VQA): The model excels at answering questions about images and videos, enabling applications in content moderation, image-based search engines, and interactive virtual assistants.
- Video Comprehension: With the ability to process long videos (over 20 minutes), it is well-suited for video-based dialogue systems, summarization, and content analysis.
- Device Interaction: By integrating visual understanding with decision-making capabilities, the model can serve as an intelligent agent to operate devices like mobile phones and robots, facilitating automation and IoT applications.
- Multilingual Support: The model supports text recognition and understanding in multiple languages, making it ideal for global applications in document processing and OCR tasks.
Limitations
- Performance on Low-Quality Images: Although it performs well on high-resolution images, the model may struggle with low-quality, blurry, or heavily distorted images, leading to errors in OCR or annotation tasks.
- Video Length Limitations: While it can handle videos over 20 minutes, processing very long videos may still result in degraded performance or increased latency, depending on computational resources.
- Generalization Issues: Despite being fine-tuned on various benchmarks, the model may face challenges when encountering data formats or visual environments significantly different from its training set.
- Language Variability: Although it supports multiple languages, the model may exhibit varying accuracy across different languages, with higher performance for those more prevalent in its training data (e.g., English and Chinese).
- Resource Intensive: As a large multi-modal model, it requires significant computational resources for both training and inference, which may limit its usability for smaller-scale deployments.
- Error Propagation in Complex Tasks: When performing tasks that involve both visual and textual understanding, errors in one modality (e.g., incorrect text recognition) can negatively impact the overall result.
- Bias and Safety Concerns: Since the model is trained on publicly available datasets, it may inherit biases present in the data and may occasionally generate unsafe or inappropriate responses in certain contexts.
- Downloads last month
- 78