--- library_name: transformers tags: - Kazakh - KZ - Kazakhstani - KazakhLanguage - WhisperKZ - KazakhFine-Tune license: mit datasets: - mozilla-foundation/common_voice_17_0 language: - kk metrics: - wer base_model: - openai/whisper-small --- # Model Card for flamme-vrm/whisper-fine-tuned-kk This model is a fine-tuned version of OpenAI's Whisper-Small, specifically adapted for **Kazakh Automatic Speech Recognition (ASR)**. It was trained using the Mozilla Common Voice Kazakh dataset, containing approximately 500 hours of Kazakh speech ### Training Details **Overall STEPS**: 2000 Metrics from **TensorBoard**: **Eval**: loss: [`0.7475`] wer: [`53.51`] **Train**: epoch: [`28.99`] grad_norm: [`6.78e-3`] learning_rate: [`5.7229e-6`] loss: [`1e-4`] batch_size: [`8`] ### Model Description This model represents an effort to improve Speech-to-Text capabilities for the Kazakh language, a morphologically rich and agglutinative language that presents unique challenges for ASR systems. It was fine-tuned from the pre-trained `openai/whisper-small` model. The training process involved monitoring key metrics, including Word Error Rate (WER) and learning loss, to understand the model's performance on the specific characteristics of Kazakh speech data. While the model achieved a rapid increase in recognition speed when integrated into a downstream application, it also provided valuable insights into the challenges of achieving high accuracy with limited datasets for complex languages. - **Developed by:** Flamme-VRM (Shyngisbek Asylkhan) - **Funded by [optional]:** N/A (Self-funded / Personal Project) - **Shared by [optional]:** Flamme-VRM (Shyngisbek Asylkhan) - **Model type:** Automatic Speech Recognition (ASR) - **Language(s) (NLP):** Kazakh (kk) - **License:** MIT - **Finetuned from model [optional]:** `openai/whisper-small` ### Factors Performance may vary across different speakers, accents, and recording conditions not well-represented in the Common Voice dataset, as well as with variations in linguistic domains (e.g., formal vs. informal speech, specific technical jargon) ### Model Sources [optional] - **Repository:** [`https://github.com/Flamme-VRM/whisperKK`] - **Paper [optional]:** N/A (Personal Project, no formal paper) - **Demo [optional]:** ["Currently none available, but planned."] ### Compute Used This model was trained on RTX 3060 12GB of VRAM ### Direct Use This model can be used directly for transcribing Kazakh audio into text using the Hugging Face `transformers` library. It offers a starting point for developers and researchers interested in Kazakh ASR. ### Downstream Use [optional] This model has been integrated into a personal chatbot project to enhance its speech recognition capabilities, leading to a significant improvement in processing speed for Kazakh voice commands. It can serve as a base for other applications requiring Kazakh STT, such as voice assistants, transcription services, or accessibility tools. ### Out-of-Scope Use This model is not intended for use in critical applications where high accuracy is paramount without further fine-tuning on a larger, more diverse dataset and rigorous testing. It is not designed for real-time translation (only transcription). It should not be used for malicious purposes, such as surveillance without consent. ## Bias, Risks, and Limitations As with all ASR models, this model may exhibit biases related to the demographic representation within its training data (Mozilla Common Voice). Performance may vary based on speaker accent, background noise, audio quality, and specific vocabulary/dialect. **Key Technical Limitation:** The model's accuracy (WER) on the validation set plateaued at approximately 53.5% during training, despite low learning loss. This indicates a challenge with generalization, likely due to the inherent complexity of the Kazakh language's morphology combined with the limited scale of the available training data. Further research and more extensive, diverse datasets are required to significantly improve accuracy. ### Recommendations Users should be aware of the current accuracy limitations, especially for critical applications. For higher accuracy, further fine-tuning with larger and more diverse Kazakh datasets is recommended. Contributions to larger Kazakh ASR datasets would greatly benefit the community. ## How to Get Started with the Model Use the code below to get started with the model using the Hugging Face `transformers` library: ```python from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="flamme-vrm/whisper-fine-tuned-kk") audio_file = "your_kazakh_audio.wav" # Replace with your audio file path transcription = pipe(audio_file) print(transcription) ``` # Contact Telegram: @Vermeei Github Profile: Flamme-VRM