PaliGemma2 VRD Dhivehi OCR Model

Model Description

This is a fine-tuned version of the PaliGemma2 model specifically optimized for Optical Character Recognition (OCR) of Dhivehi text in images. The model is based on the google/paligemma2-3b-pt-224 architecture and has been fine-tuned for improved performance in reading and transcribing Dhivehi text from images.

Model Details

Model type: Vision-Language Model
Base model: google/paligemma2-3b-pt-224
Fine-tuning approach: QLoRA
Input format: Images with text
Output format: Text transcription
Supported languages: Primarily Dhivehi

How to Use

Option 1: Direct Loading

from transformers.image_utils import load_image
import torch
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor

# Print GPU information
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Current GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory allocated: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB")
    print(f"GPU memory cached: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB")

model_id = "alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm"
print("Loading model...")
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

print("Loading image...")
image = load_image("ocr1.png")

print("Processing image...")
prompt = "What text is written in this image?"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to("cuda")
input_len = model_inputs["input_ids"].shape[-1]

print("Model inputs device:", model_inputs["input_ids"].device)
print("Model device:", model.device)
print("Generating output...")
with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

print("Done!")

Option 2: Memory-Efficient PEFT Loading

from transformers.image_utils import load_image
import torch
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor
from peft import PeftModel, PeftConfig

# Define model ID
model_id = "alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm"

# Load the PEFT configuration to get the base model path
print("Loading PEFT configuration...")
peft_config = PeftConfig.from_pretrained(model_id)

# Load the base model
print(f"Loading base model: {peft_config.base_model_name_or_path}...")
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Load the adapter on top of the base model
print(f"Loading PEFT adapter: {model_id}...")
model = PeftModel.from_pretrained(base_model, model_id)

# Load the processor from the base model
processor = AutoProcessor.from_pretrained(peft_config.base_model_name_or_path)

print("Loading image...")
image = load_image("ocr1.png")

print("Processing image...")
prompt = "What text is written in this image?"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16)

# Move inputs to the same device as the model
if hasattr(model, 'device'):
    device = model.device
else:
    # If device isn't directly accessible, infer from model parameters
    device = next(model.parameters()).device
    
model_inputs = {k: v.to(device) for k, v in model_inputs.items()}

input_len = model_inputs["input_ids"].shape[-1]
# Process without printing device information

print("Generating output...")
with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

print("Done!")

Training Details

Base Model: google/paligemma2-3b-pt-224
Dataset: alakxender/dhivehi-vrd-b1-img-questions
Training Configuration:
- Batch size: 2 per device
- Gradient accumulation steps: 8
- Effective batch size: 16
- Learning rate: 2e-5
- Weight decay: 1e-6
- Adam β2: 0.999
- Warmup steps: 2
- Training steps: 20,000
- Epochs: 1
- Mixed precision: bfloat16
QLoRA Configuration:
- Quantization: 4-bit NF4
- LoRA rank (r): 8
- Target modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
- Task type: CAUSAL_LM
- Optimizer: paged_adamw_8bit
Data Processing:
- Image resize method: LANCZOS
- Input format: RGB images
- Text prompt format: "answer [question]"
Training Metrics:
- Initial loss: ~15
- Final loss: ~2
- Learning rate: Decreasing from 1.5e-5 to 5e-6
- Gradient norm: Stabilized around 20-60
- Model checkpointing: Every 1000 steps
- Logging frequency: Every 100 steps

Performance

The model showed consistent improvement during training:

Loss decreased significantly in the first 5k steps and stabilized afterwards
Gradient norms remained in a healthy range throughout training
Learning rate was automatically adjusted following a linear decay schedule
Training completed successfully with convergence in loss metrics
Training progress was monitored using Weights & Biases

Model Architecture

This model uses Parameter-Efficient Fine-Tuning (PEFT) with QLoRA:

Quantization: 4-bit quantization for memory efficiency
LoRA Adaptation: Low-rank adaptation of key transformer components
Memory Optimization: Uses paged optimizer for efficient memory usage
Mixed Precision: bfloat16 for training stability and speed

Limitations

Primarily optimized for Dhivehi text
Performance may vary with different image qualities and text styles
May or may not perform optimally on handwritten text

Dataset

Source Dataset: alakxender/dhivehi-vrd-images (VRD Batch 1)
Processed Dataset: alakxender/dhivehi-vrd-b1-img-questions
Dataset Size:
- Total: 474,169 samples
- Training set: 379,335 samples (80%)
- Validation set: 94,834 samples (20%)
Question Types: The dataset uses a variety of question prompts for OCR tasks, including:

- "What text is written in this image?"
- "Can you read and transcribe the Dhivehi text shown in this image?"
- "What is the Dhivehi text visible in this image?"
- "Please read out the text content from this image"
- "What Dhivehi text can you see in this image?"
- "Is there any text visible in this image? If so, what does it say?"
- "Could you transcribe the Dhivehi text shown in this image?"
- "What does the text in this image say?"
- "Can you read the Dhivehi text in this image? What does it say?"
- "Please identify and transcribe any text visible in this image"
- "What Dhivehi text is present in this image?"

Dataset Format:
- Features:
  - image: Image containing Dhivehi text
  - question: Randomly selected question from the question pool
  - answer: Ground truth Dhivehi text transcription
- Processing: Memory-efficient chunked processing (10,000 samples per chunk)

alakxender
/

paligemma2-qlora-vrd-dhivehi-ocr-224-sm

You need to agree to share your contact information to access this model

PaliGemma2 VRD Dhivehi OCR Model

Model Description

Model Details

How to Use

Option 1: Direct Loading

Option 2: Memory-Efficient PEFT Loading

Training Details

Performance

Model Architecture

Limitations

Dataset

Model tree for alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm

Dataset used to train alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm

Space using alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm 1

Collection including alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm

Computer Vision