You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

HordeVision: Open-Source Kazakh Vision-Language Model

HordeVision is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks.

Model Description

HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at:

Image Captioning: Generating detailed, contextual descriptions in Kazakh
Visual Question Answering (VQA): Answering diverse questions about image content
OCR: Extracting and reading Kazakh text from images
Visual Reasoning: Making inferences about context, causality, and temporal states
Instruction Following: Executing multi-step visual tasks based on user commands

Key Features

First open-source Kazakh vision-language model
Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage
Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
Ranks #1 across all evaluation tasks compared to comparable multilingual models

Model Performance Summary

Model	caption	vqa	ocr	reason	instruct_follow	Avg Rank
horde-vision	83.5 (↑12.3%)	68.1 (↑5.3%)	64.7 (↑2.6%)	77.4 (↑5.7%)	70.5 (↑5.9%)	#1
Qolda	75.2 (↑8.7%)	61.7 (↑3.0%)	60.6 (↑2.0%)	70.3 (↑2.9%)	62.2 (↑2.8%)	#2
Qwen3-VL-8B-Instruct	41.3 (↑0.5%)	53.6 (↑1.1%)	59.3 (↑2.1%)	55.5 (↑0.7%)	49.5 (↑0.9%)	#3
gemma-3-4b-it	42.0 (↑0.1%)	41.8 (↑0.4%)	50.3 (↑2.3%)	53.0 (↑0.6%)	42.5 (↑0.5%)	#4
Qwen2.5-VL-7B-Instruct	35.4 (↑0.0%)	41.6 (↑0.4%)	51.0 (↑0.9%)	44.6 (↑0.3%)	37.7 (↑0.3%)	#5
Llama-3.2-11B-Vision	36.2 (↑0.1%)	38.0 (↑0.3%)	15.0 (↑0.1%)	43.4 (↑0.3%)	36.4 (↑0.3%)	#6
InternVL3-8B	26.1 (↑0.6%)	29.0 (↑0.0%)	29.1 (↑0.3%)	27.3 (↑0.0%)	25.7 (↑0.0%)	#7

Comparison: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks.

Dataset

The training dataset was collected using a syntactic data generation pipeline:

Size: 45k training images, 5k validation images
Categories: 21 main categories, 104 subcategories, ~2,600 keyword phrases
Coverage: Daily contexts, social life, education, work/economy, media/communications, culture and heritage
Quality: Filtered with imagededup for deduplication and aesthetic scoring
Annotation: Labeled using GPT-4.1 with structured prompts for consistent quality
Split Strategy: Entity-level stratification to ensure models are tested on completely unseen entities

Training Details

Supervised Fine-Tuning (SFT)

Data: 46k images
LoRA Rank: 128
Epochs: 1

Reinforcement Learning (GRPO)

Data: 5k images
LoRA Rank: 64
Epochs: 1
Judge: GPT-4.1-mini with custom Kazakh evaluation prompts

How to Use

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "kz-transformers/horde-vision", dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "kz-transformers/horde-vision",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Бұл суретті сипаттаңыз."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Citation

Downloads last month: 29

Safetensors

Model size

9B params

Tensor type

F32

BF16

Model tree for kz-transformers/horde-vision

Base model

Qwen/Qwen3-VL-8B-Instruct

Quantized

(41)

this model