HordeVision: Open-Source Kazakh Vision-Language Model
HordeVision is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks.
Model Description
HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at:
- Image Captioning: Generating detailed, contextual descriptions in Kazakh
- Visual Question Answering (VQA): Answering diverse questions about image content
- OCR: Extracting and reading Kazakh text from images
- Visual Reasoning: Making inferences about context, causality, and temporal states
- Instruction Following: Executing multi-step visual tasks based on user commands
Key Features
- First open-source Kazakh vision-language model
- Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage
- Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
- Ranks #1 across all evaluation tasks compared to comparable multilingual models
Model Performance Summary
| Model | caption | vqa | ocr | reason | instruct_follow | Avg Rank |
|---|---|---|---|---|---|---|
| horde-vision | 83.5 (↑12.3%) | 68.1 (↑5.3%) | 64.7 (↑2.6%) | 77.4 (↑5.7%) | 70.5 (↑5.9%) | #1 |
| Qolda | 75.2 (↑8.7%) | 61.7 (↑3.0%) | 60.6 (↑2.0%) | 70.3 (↑2.9%) | 62.2 (↑2.8%) | #2 |
| Qwen3-VL-8B-Instruct | 41.3 (↑0.5%) | 53.6 (↑1.1%) | 59.3 (↑2.1%) | 55.5 (↑0.7%) | 49.5 (↑0.9%) | #3 |
| gemma-3-4b-it | 42.0 (↑0.1%) | 41.8 (↑0.4%) | 50.3 (↑2.3%) | 53.0 (↑0.6%) | 42.5 (↑0.5%) | #4 |
| Qwen2.5-VL-7B-Instruct | 35.4 (↑0.0%) | 41.6 (↑0.4%) | 51.0 (↑0.9%) | 44.6 (↑0.3%) | 37.7 (↑0.3%) | #5 |
| Llama-3.2-11B-Vision | 36.2 (↑0.1%) | 38.0 (↑0.3%) | 15.0 (↑0.1%) | 43.4 (↑0.3%) | 36.4 (↑0.3%) | #6 |
| InternVL3-8B | 26.1 (↑0.6%) | 29.0 (↑0.0%) | 29.1 (↑0.3%) | 27.3 (↑0.0%) | 25.7 (↑0.0%) | #7 |
Comparison: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks.
Dataset
The training dataset was collected using a syntactic data generation pipeline:
- Size: 45k training images, 5k validation images
- Categories: 21 main categories, 104 subcategories, ~2,600 keyword phrases
- Coverage: Daily contexts, social life, education, work/economy, media/communications, culture and heritage
- Quality: Filtered with imagededup for deduplication and aesthetic scoring
- Annotation: Labeled using GPT-4.1 with structured prompts for consistent quality
- Split Strategy: Entity-level stratification to ensure models are tested on completely unseen entities
Training Details
Supervised Fine-Tuning (SFT)
- Data: 46k images
- LoRA Rank: 128
- Epochs: 1
Reinforcement Learning (GRPO)
- Data: 5k images
- LoRA Rank: 64
- Epochs: 1
- Judge: GPT-4.1-mini with custom Kazakh evaluation prompts
How to Use
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"kz-transformers/horde-vision", dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
# "kz-transformers/horde-vision",
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Бұл суретті сипаттаңыз."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Citation
- Downloads last month
- 29
Model tree for kz-transformers/horde-vision
Base model
Qwen/Qwen3-VL-8B-Instruct