VSXzdfgvsdxf.png

Caption-Pro

Caption-Pro is an advanced image caption and annotation generator optimized for generating detailed, structured JSON outputs. Built upon a powerful vision-language architecture with enhanced OCR and multilingual support, Caption-Pro extracts high-quality captions and annotations from images for seamless integration into your applications.

Key Enhancements:

  • Advanced Image Understanding: Fine-tuned on millions of annotated images, Caption-Pro delivers precise comprehension and interpretation of visual content.
  • Optimized for JSON Output: Produces structured JSON data containing captions and detailed annotations—perfect for integration with databases, APIs, and automation pipelines.
  • Enhanced OCR Capabilities: Accurately extracts textual content from images in multiple languages, including English, Chinese, Japanese, Korean, Arabic, and more.
  • Multimodal Processing: Seamlessly handles both image and text inputs, generating comprehensive annotations based on the provided image.
  • Multilingual Support: Recognizes and processes text within images across various languages.
  • Secure and Optimized Model Weights: Employs safetensors for efficient and secure model loading.

How to Use

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the Caption-Pro model with optimized parameters
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Caption-Pro", torch_dtype="auto", device_map="auto"
)

# Recommended acceleration for performance optimization:
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "prithivMLmods/Caption-Pro",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Load the default processor for Caption-Pro
processor = AutoProcessor.from_pretrained("prithivMLmods/Caption-Pro")

# Define the input messages with both an image and a text prompt
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://flux-generated.com/sample_image.jpeg",
            },
            {"type": "text", "text": "Provide detailed captions and annotations for this image in JSON format."},
        ],
    }
]

# Prepare the input for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Key Features

  1. Annotation-Ready Training Data

    • Trained using a diverse dataset of annotated images to ensure high-quality structured output.
  2. Optical Character Recognition (OCR)

    • Robustly extracts and processes text from images in various languages and scripts.
  3. Structured JSON Output

    • Generates detailed captions and annotations in standardized JSON format for easy downstream integration.
  4. Image & Text Processing

    • Capable of handling both visual and textual inputs, delivering comprehensive and context-aware annotations.
  5. Conversational Annotation Generation

    • Supports multi-turn interactions, enabling detailed and iterative refinement of annotations.
  6. Secure and Efficient Model Weights

    • Uses safetensors for enhanced security and optimized model performance.

Caption-Pro streamlines the process of generating image captions and annotations, making it an ideal solution for applications that require detailed visual content analysis and structured data integration.

Downloads last month
49
Safetensors
Model size
2.21B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for prithivMLmods/Caption-Pro

Base model

Qwen/Qwen2-VL-2B
Finetuned
(125)
this model
Quantizations
2 models

Collection including prithivMLmods/Caption-Pro