--- license: cc-by-nc-3.0 datasets: - google/fleurs language: - km metrics: - cer base_model: - openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition --- # Whisper Fine-tuning for Khmer Language This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Khmer language using the Google FLEURS dataset (km_kh). ## Features - **Flexible Configuration**: All parameters are configurable through YAML files - **Multi-GPU Support**: Automatic detection and support for multiple GPUs - **Dynamic Language Selection**: Train on any subset of supported languages - **On-the-fly Processing**: Efficient memory usage with dynamic audio preprocessing - **Comprehensive Evaluation**: Automatic evaluation on test sets ## Configuration All parameters are configurable through the `config.yaml` file. This configuration is specifically set up for Khmer language training using the Google FLEURS dataset. ### Model Configuration - Model checkpoint (default: `openai/whisper-large-v3`) - Maximum target length for sequences ### Dataset Configuration - Uses Google FLEURS Khmer (km_kh) dataset - Dataset sources and splits - Language-specific settings - Training subset ratio (25% of data for faster training) ### Training Configuration - Learning rate, batch sizes, training steps - Multi-GPU vs single GPU settings - Evaluation and logging parameters ### Environment Configuration - CPU core limits - Environment variables for optimization ### Pushing to Hub - I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting `push_to_hub: true` in your config file. ## Usage ### Basic Usage ```bash python finetune.py --config config.yaml ``` ### Custom Configuration ```bash python finetune.py --config my_custom_config.yaml ``` ### Multi-GPU Training Since we only have very few training data (around 2.5 hours), multi-GPU training is not recommended. ## Configuration File Structure The `config.yaml` file is organized into the following sections: 1. **model**: Model checkpoint and sequence length settings 2. **output**: Output directory configuration 3. **environment**: Environment variables and CPU settings 4. **audio**: Audio processing settings (sampling rate) 5. **languages**: Khmer language configuration 6. **datasets**: Google FLEURS Khmer dataset configuration 7. **training**: All training hyperparameters 8. **data_processing**: Data processing settings ## Customizing Your Training ### Adjusting Training Parameters Modify the `training` section in `config.yaml`: - Change learning rate, batch sizes, or training steps - Adjust evaluation frequency - Configure multi-GPU settings ### Environment Optimization Adjust the `environment` section to optimize for your system: - Set CPU core limits - Configure memory usage settings ## Configuration The provided `config.yaml` is specifically configured for Khmer language training using the Google FLEURS dataset. ## Training Commands ### Basic Training ```bash python finetune.py ``` ### Single GPU Training ```bash python finetune.py ``` ## Inference Guide After training your model, you can use the provided `inference.py` script for speech recognition: ```bash python inference.py ``` The inference script includes: - Model loading from the trained checkpoint - Audio preprocessing pipeline - Text generation with proper formatting - Support for Khmer language transcription ### Using the Trained Model The inference script automatically handles: - Loading the fine-tuned model weights - Audio preprocessing with proper sampling rate - Generating transcriptions for Khmer speech - Output formatting for evaluation metrics ## Dependencies Install required packages: ```bash pip install -r requirements.txt ``` Key dependencies: - PyYAML (for configuration loading) - torch, transformers, datasets - librosa (for audio processing) - evaluate (for metrics) ## Evaluation Results | Language | Metric | Error Rate | |-------------|:------:|-----------:| | Khmer | CER | 33.18% | **Note**: If you encounter issues running finetune.py, you can use the `finetune-backup.py` file which contains the original hardcoded configuration.