Model Card for flamme-vrm/whisper-fine-tuned-kk
This model is a fine-tuned version of OpenAI's Whisper-Small, specifically adapted for Kazakh Automatic Speech Recognition (ASR). It was trained using the Mozilla Common Voice Kazakh dataset, containing approximately 500 hours of Kazakh speech
Training Details
Overall STEPS: 2000
Metrics from TensorBoard:
Eval:
loss: [0.7475
]
wer: [53.51
]
Train:
epoch: [28.99
]
grad_norm: [6.78e-3
]
learning_rate: [5.7229e-6
]
loss: [1e-4
]
batch_size: [8
]
Model Description
This model represents an effort to improve Speech-to-Text capabilities for the Kazakh language, a morphologically rich and agglutinative language that presents unique challenges for ASR systems. It was fine-tuned from the pre-trained openai/whisper-small
model.
The training process involved monitoring key metrics, including Word Error Rate (WER) and learning loss, to understand the model's performance on the specific characteristics of Kazakh speech data. While the model achieved a rapid increase in recognition speed when integrated into a downstream application, it also provided valuable insights into the challenges of achieving high accuracy with limited datasets for complex languages.
- Developed by: Flamme-VRM (Shyngisbek Asylkhan)
- Funded by [optional]: N/A (Self-funded / Personal Project)
- Shared by [optional]: Flamme-VRM (Shyngisbek Asylkhan)
- Model type: Automatic Speech Recognition (ASR)
- Language(s) (NLP): Kazakh (kk)
- License: MIT
- Finetuned from model [optional]:
openai/whisper-small
Factors
Performance may vary across different speakers, accents, and recording conditions not well-represented in the Common Voice dataset, as well as with variations in linguistic domains (e.g., formal vs. informal speech, specific technical jargon)
Model Sources [optional]
- Repository: [
https://github.com/Flamme-VRM/whisperKK
] - Paper [optional]: N/A (Personal Project, no formal paper)
- Demo [optional]: ["Currently none available, but planned."]
Compute Used
This model was trained on RTX 3060 12GB of VRAM
Direct Use
This model can be used directly for transcribing Kazakh audio into text using the Hugging Face transformers
library. It offers a starting point for developers and researchers interested in Kazakh ASR.
Downstream Use [optional]
This model has been integrated into a personal chatbot project to enhance its speech recognition capabilities, leading to a significant improvement in processing speed for Kazakh voice commands. It can serve as a base for other applications requiring Kazakh STT, such as voice assistants, transcription services, or accessibility tools.
Out-of-Scope Use
This model is not intended for use in critical applications where high accuracy is paramount without further fine-tuning on a larger, more diverse dataset and rigorous testing. It is not designed for real-time translation (only transcription). It should not be used for malicious purposes, such as surveillance without consent.
Bias, Risks, and Limitations
As with all ASR models, this model may exhibit biases related to the demographic representation within its training data (Mozilla Common Voice). Performance may vary based on speaker accent, background noise, audio quality, and specific vocabulary/dialect.
Key Technical Limitation: The model's accuracy (WER) on the validation set plateaued at approximately 53.5% during training, despite low learning loss. This indicates a challenge with generalization, likely due to the inherent complexity of the Kazakh language's morphology combined with the limited scale of the available training data. Further research and more extensive, diverse datasets are required to significantly improve accuracy.
Recommendations
Users should be aware of the current accuracy limitations, especially for critical applications. For higher accuracy, further fine-tuning with larger and more diverse Kazakh datasets is recommended. Contributions to larger Kazakh ASR datasets would greatly benefit the community.
How to Get Started with the Model
Use the code below to get started with the model using the Hugging Face transformers
library:
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="flamme-vrm/whisper-fine-tuned-kk")
audio_file = "your_kazakh_audio.wav" # Replace with your audio file path
transcription = pipe(audio_file)
print(transcription)
Contact
Telegram: @Vermeei
Github Profile: Flamme-VRM
- Downloads last month
- 14
Model tree for Flamme-VRM/whisper-fine-tuned-kk
Base model
openai/whisper-small