Whisper Fine-tuning for Khmer Language
This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Khmer language using the Google FLEURS dataset (km_kh).
Features
- Flexible Configuration: All parameters are configurable through YAML files
- Multi-GPU Support: Automatic detection and support for multiple GPUs
- Dynamic Language Selection: Train on any subset of supported languages
- On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
- Comprehensive Evaluation: Automatic evaluation on test sets
Configuration
All parameters are configurable through the config.yaml
file. This configuration is specifically set up for Khmer language training using the Google FLEURS dataset.
Model Configuration
- Model checkpoint (default:
openai/whisper-large-v3
) - Maximum target length for sequences
Dataset Configuration
- Uses Google FLEURS Khmer (km_kh) dataset
- Dataset sources and splits
- Language-specific settings
- Training subset ratio (25% of data for faster training)
Training Configuration
- Learning rate, batch sizes, training steps
- Multi-GPU vs single GPU settings
- Evaluation and logging parameters
Environment Configuration
- CPU core limits
- Environment variables for optimization
Pushing to Hub
- I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting
push_to_hub: true
in your config file.
Usage
Basic Usage
python finetune.py --config config.yaml
Custom Configuration
python finetune.py --config my_custom_config.yaml
Multi-GPU Training
Since we only have very few training data (around 2.5 hours), multi-GPU training is not recommended.
Configuration File Structure
The config.yaml
file is organized into the following sections:
- model: Model checkpoint and sequence length settings
- output: Output directory configuration
- environment: Environment variables and CPU settings
- audio: Audio processing settings (sampling rate)
- languages: Khmer language configuration
- datasets: Google FLEURS Khmer dataset configuration
- training: All training hyperparameters
- data_processing: Data processing settings
Customizing Your Training
Adjusting Training Parameters
Modify the training
section in config.yaml
:
- Change learning rate, batch sizes, or training steps
- Adjust evaluation frequency
- Configure multi-GPU settings
Environment Optimization
Adjust the environment
section to optimize for your system:
- Set CPU core limits
- Configure memory usage settings
Configuration
The provided config.yaml
is specifically configured for Khmer language training using the Google FLEURS dataset.
Training Commands
Basic Training
python finetune.py
Single GPU Training
python finetune.py
Inference Guide
After training your model, you can use the provided inference.py
script for speech recognition:
python inference.py
The inference script includes:
- Model loading from the trained checkpoint
- Audio preprocessing pipeline
- Text generation with proper formatting
- Support for Khmer language transcription
Using the Trained Model
The inference script automatically handles:
- Loading the fine-tuned model weights
- Audio preprocessing with proper sampling rate
- Generating transcriptions for Khmer speech
- Output formatting for evaluation metrics
Dependencies
Install required packages:
pip install -r requirements.txt
Key dependencies:
- PyYAML (for configuration loading)
- torch, transformers, datasets
- librosa (for audio processing)
- evaluate (for metrics)
Evaluation Results
Language | Metric | Error Rate |
---|---|---|
Khmer | CER | 33.18% |
Note: If you encounter issues running finetune.py, you can use the finetune-backup.py
file which contains the original hardcoded configuration.
- Downloads last month
- 6
Model tree for pengyizhou/whisper-fleurs-km_kh-small
Base model
openai/whisper-large-v3