Uzbek POS Tagger

This model predicts Universal Dependencies part-of-speech (POS) tags for Uzbek text.

Model details

The model was fine-tuned on a Universal Dependencies treebank containing approximately 600 annotated sentences. It is based on the XLM-RoBERTa base model and adapted for token classification.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Arofat/uzbek-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("Arofat/uzbek-pos-tagger")

# Prepare text
text = "Men O'zbekistonda yashayman."
tokens = text.split()

# Get predictions
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Process outputs
predictions = torch.argmax(outputs.logits, dim=2)
id2label = model.config.id2label

# Get POS tags
pos_tags = []
word_ids = inputs.word_ids(batch_index=0)
prev_word_id = None
for idx, word_id in enumerate(word_ids):
    if word_id is None or word_id == prev_word_id:
        continue
    pos_tags.append(id2label[predictions[0, idx].item()])
    prev_word_id = word_id

# Print results
for token, tag in zip(tokens, pos_tags):
    print(f"{token}: {tag}")

Limitations

This model was trained on a relatively small dataset and may not generalize well to all domains of Uzbek text.

Downloads last month
0
Safetensors
Model size
277M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Arofat/uzbek-pos-tagger