Model Card for flamme-vrm/whisper-fine-tuned-kk

This model is a fine-tuned version of OpenAI's Whisper-Small, specifically adapted for Kazakh Automatic Speech Recognition (ASR). It was trained using the Mozilla Common Voice Kazakh dataset, containing approximately 500 hours of Kazakh speech

Training Details

Overall STEPS: 2000

Metrics from TensorBoard:

Eval:

loss: [0.7475]

wer: [53.51]

Train:

epoch: [28.99]

grad_norm: [6.78e-3]

learning_rate: [5.7229e-6]

loss: [1e-4]

batch_size: [8]

Model Description

This model represents an effort to improve Speech-to-Text capabilities for the Kazakh language, a morphologically rich and agglutinative language that presents unique challenges for ASR systems. It was fine-tuned from the pre-trained openai/whisper-small model.

The training process involved monitoring key metrics, including Word Error Rate (WER) and learning loss, to understand the model's performance on the specific characteristics of Kazakh speech data. While the model achieved a rapid increase in recognition speed when integrated into a downstream application, it also provided valuable insights into the challenges of achieving high accuracy with limited datasets for complex languages.

Developed by: Flamme-VRM (Shyngisbek Asylkhan)
Funded by [optional]: N/A (Self-funded / Personal Project)
Shared by [optional]: Flamme-VRM (Shyngisbek Asylkhan)
Model type: Automatic Speech Recognition (ASR)
Language(s) (NLP): Kazakh (kk)
License: MIT
Finetuned from model [optional]: openai/whisper-small

Factors

Performance may vary across different speakers, accents, and recording conditions not well-represented in the Common Voice dataset, as well as with variations in linguistic domains (e.g., formal vs. informal speech, specific technical jargon)

Model Sources [optional]

Repository: [https://github.com/Flamme-VRM/whisperKK]
Paper [optional]: N/A (Personal Project, no formal paper)
Demo [optional]: ["Currently none available, but planned."]

Compute Used

This model was trained on RTX 3060 12GB of VRAM

Direct Use

This model can be used directly for transcribing Kazakh audio into text using the Hugging Face transformers library. It offers a starting point for developers and researchers interested in Kazakh ASR.

Downstream Use [optional]

This model has been integrated into a personal chatbot project to enhance its speech recognition capabilities, leading to a significant improvement in processing speed for Kazakh voice commands. It can serve as a base for other applications requiring Kazakh STT, such as voice assistants, transcription services, or accessibility tools.

Out-of-Scope Use

This model is not intended for use in critical applications where high accuracy is paramount without further fine-tuning on a larger, more diverse dataset and rigorous testing. It is not designed for real-time translation (only transcription). It should not be used for malicious purposes, such as surveillance without consent.

Bias, Risks, and Limitations

As with all ASR models, this model may exhibit biases related to the demographic representation within its training data (Mozilla Common Voice). Performance may vary based on speaker accent, background noise, audio quality, and specific vocabulary/dialect.

Key Technical Limitation: The model's accuracy (WER) on the validation set plateaued at approximately 53.5% during training, despite low learning loss. This indicates a challenge with generalization, likely due to the inherent complexity of the Kazakh language's morphology combined with the limited scale of the available training data. Further research and more extensive, diverse datasets are required to significantly improve accuracy.

Recommendations

Users should be aware of the current accuracy limitations, especially for critical applications. For higher accuracy, further fine-tuning with larger and more diverse Kazakh datasets is recommended. Contributions to larger Kazakh ASR datasets would greatly benefit the community.

How to Get Started with the Model

Use the code below to get started with the model using the Hugging Face transformers library:

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="flamme-vrm/whisper-fine-tuned-kk")
audio_file = "your_kazakh_audio.wav" # Replace with your audio file path
transcription = pipe(audio_file)
print(transcription)

Contact

Telegram: @Vermeei

Github Profile: Flamme-VRM

Flamme-VRM
/

whisper-fine-tuned-kk

You need to agree to share your contact information to access this model