Fine-tuned Wav2Vec2-XLS-R-300m for Uzbek Speech Recognition

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m for Automatic Speech Recognition (ASR) on the Uzbek language. It has been fine-tuned on the nickoo004/FeruzaSpeech_to_fine_tuning dataset, which contains high-quality narrated audio from an audiobook.

The goal of this project was to create a robust, publicly available model for transcribing Uzbek speech.

Model Description

Base Model: facebook/wav2vec2-xls-r-300m
Language: Uzbek (uz)
Task: Automatic Speech Recognition (ASR)
Training Data: nickoo004/FeruzaSpeech_to_fine_tuning
Author: Nicholas (nickoo004)

Intended Uses & Limitations

You can use this model to transcribe Uzbek audio files. For best results, the audio should be clean (minimal background noise) and sampled at 16kHz.

Limitations:

The model was trained on audiobook narration, so it will perform best on clear, formal speech. Performance may be lower on highly conversational, noisy, or technical audio.
The model's performance on different Uzbek dialects or accents outside of the training data distribution has not been formally evaluated.
This model is designed for transcription only and does not perform speaker identification or translation.

How to Use

You can easily use this model with the transformers library pipeline.

First, install the necessary libraries: pip install transformers torch Next, use the following Python code to transcribe an audio file:

import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Make sure to use your final repository ID
repo_id = "nickoo004/wav2vec2-feruza-uzbek-v1"

# Load the fine-tuned model and processor
model = Wav2Vec2ForCTC.from_pretrained(repo_id)
processor = Wav2Vec2Processor.from_pretrained(repo_id)

# --- Example: Transcribing an audio file ---
# You need to load your own audio file.
# Ensure your audio is a 1D tensor/numpy array and sampled at 16kHz.
# import librosa
# audio_input, sampling_rate = librosa.load("path/to/your/audio.wav", sr=16000)


sampling_rate = 16000
dummy_audio_input = torch.randn(1, sampling_rate * 5) # 5-second audio

# Audio processing
inputs = processor(dummy_audio_input, sampling_rate=sampling_rate, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

# Bashorat qilingan ID'larni matnga o'tkazish
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

print("Transcription:", transcription)

Training Procedure Training Data The model was fine-tuned on the train split of the nickoo004/FeruzaSpeech_to_fine_tuning dataset, containing approximately 11,444 samples. The text was pre-processed by converting it to lowercase, removing all punctuation except apostrophes, and replacing spaces with a | delimiter for the CTC tokenizer. Training Hyperparameters Learning Rate: 3e-4 Optimizer: AdamW Scheduler: Linear warmup (500 steps) Effective Batch Size: 16 Epochs: 5 Mixed Precision: FP16 Training Results Best Evaluation Loss (CTC Loss): 13.27

@misc{feruza_speech_asr_2025,
  author = {Nicholas},
  title = {Fine-tuned Wav2Vec2-XLS-R-300m for Uzbek Speech Recognition},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Hub},
  howpublished = {\\url{https://huggingface.co/nickoo004/wav2vec2-feruza-uzbek-v1}}
}

nickoo004
/

wav2vec2-feruza-uzbek-v1

Fine-tuned Wav2Vec2-XLS-R-300m for Uzbek Speech Recognition

Model Description

Intended Uses & Limitations

Limitations:

How to Use

Model tree for nickoo004/wav2vec2-feruza-uzbek-v1

Dataset used to train nickoo004/wav2vec2-feruza-uzbek-v1