You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

HordeVision: Open-Source Kazakh Vision-Language Model

HordeVision is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks.

Model Description

HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at:

  • Image Captioning: Generating detailed, contextual descriptions in Kazakh
  • Visual Question Answering (VQA): Answering diverse questions about image content
  • OCR: Extracting and reading Kazakh text from images
  • Visual Reasoning: Making inferences about context, causality, and temporal states
  • Instruction Following: Executing multi-step visual tasks based on user commands

Key Features

  • First open-source Kazakh vision-language model
  • Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage
  • Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
  • Ranks #1 across all evaluation tasks compared to comparable multilingual models

Model Performance Summary

Model caption vqa ocr reason instruct_follow Avg Rank
horde-vision 83.5 (↑12.3%) 68.1 (↑5.3%) 64.7 (↑2.6%) 77.4 (↑5.7%) 70.5 (↑5.9%) #1
Qolda 75.2 (↑8.7%) 61.7 (↑3.0%) 60.6 (↑2.0%) 70.3 (↑2.9%) 62.2 (↑2.8%) #2
Qwen3-VL-8B-Instruct 41.3 (↑0.5%) 53.6 (↑1.1%) 59.3 (↑2.1%) 55.5 (↑0.7%) 49.5 (↑0.9%) #3
gemma-3-4b-it 42.0 (↑0.1%) 41.8 (↑0.4%) 50.3 (↑2.3%) 53.0 (↑0.6%) 42.5 (↑0.5%) #4
Qwen2.5-VL-7B-Instruct 35.4 (↑0.0%) 41.6 (↑0.4%) 51.0 (↑0.9%) 44.6 (↑0.3%) 37.7 (↑0.3%) #5
Llama-3.2-11B-Vision 36.2 (↑0.1%) 38.0 (↑0.3%) 15.0 (↑0.1%) 43.4 (↑0.3%) 36.4 (↑0.3%) #6
InternVL3-8B 26.1 (↑0.6%) 29.0 (↑0.0%) 29.1 (↑0.3%) 27.3 (↑0.0%) 25.7 (↑0.0%) #7

Comparison: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks.

Dataset

The training dataset was collected using a syntactic data generation pipeline:

  • Size: 45k training images, 5k validation images
  • Categories: 21 main categories, 104 subcategories, ~2,600 keyword phrases
  • Coverage: Daily contexts, social life, education, work/economy, media/communications, culture and heritage
  • Quality: Filtered with imagededup for deduplication and aesthetic scoring
  • Annotation: Labeled using GPT-4.1 with structured prompts for consistent quality
  • Split Strategy: Entity-level stratification to ensure models are tested on completely unseen entities

Training Details

Supervised Fine-Tuning (SFT)

  • Data: 46k images
  • LoRA Rank: 128
  • Epochs: 1

Reinforcement Learning (GRPO)

  • Data: 5k images
  • LoRA Rank: 64
  • Epochs: 1
  • Judge: GPT-4.1-mini with custom Kazakh evaluation prompts

How to Use

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "kz-transformers/horde-vision", dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "kz-transformers/horde-vision",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Бұл суретті сипаттаңыз."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Citation


Downloads last month
29
Safetensors
Model size
9B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kz-transformers/horde-vision

Quantized
(41)
this model