Whisper Fine-tuning with Configuration
This project provides a configurable way to fine-tune OpenAI's Whisper model on mixed multilingual datasets including:
- FLEURS Cebuano (ceb_ph)
- FLEURS Khmer (km_kh)
- Switchboard1 English
- WenetSpeech Chinese
- EN-INDON-CS
- EN-MALAY-CS
Features
- Flexible Configuration: All parameters are configurable through YAML files
- Multi-GPU Support: Automatic detection and support for multiple GPUs
- Dynamic Language Selection: Train on any subset of supported languages
- On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
- Comprehensive Evaluation: Automatic evaluation on multiple test sets
Configuration
All parameters are now configurable through the config.yaml
file. This includes:
Model Configuration
- Model checkpoint (default:
openai/whisper-large-v3
) - Maximum target length for sequences
Dataset Configuration
- Languages to include in training
- Dataset sources and splits
- Language-specific settings (subset ratios, validation sizes)
Training Configuration
- Learning rate, batch sizes, training steps
- Multi-GPU vs single GPU settings
- Evaluation and logging parameters
Environment Configuration
- CPU core limits
- Environment variables for optimization
Pushing to Hub
- I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting
push_to_hub: true
in your config file.
Usage
Basic Usage
python finetune.py --config config.yaml
Custom Configuration
python finetune.py --config my_custom_config.yaml
Multi-GPU Training
# Using torchrun (recommended) for two GPUs
torchrun --nproc_per_node=2 finetune.py --config config.yaml
Configuration File Structure
The config.yaml
file is organized into the following sections:
- model: Model checkpoint and sequence length settings
- output: Output directory configuration
- environment: Environment variables and CPU settings
- audio: Audio processing settings (sampling rate)
- languages: Language-specific configurations
- datasets: Dataset source and split configurations
- training: All training hyperparameters
- data_processing: Data processing settings
Customizing Your Training
Adding a New Language
- Add the language configuration to the
languages
section - Add the dataset configuration to the
datasets
section - The script will automatically include it in training
Adjusting Training Parameters
Modify the training
section in config.yaml
:
- Change learning rate, batch sizes, or training steps
- Adjust evaluation frequency
- Configure multi-GPU settings
Environment Optimization
Adjust the environment
section to optimize for your system:
- Set CPU core limits
- Configure memory usage settings
Configuration Examples
config.yaml
: Full configuration with all 6 languages
Inference Guide
After training your model, you can use it for standalone inference. The project includes two inference template scripts:
inference_template_wer.py
: Template for English/Malay/Indonesian/Cebuano evaluation (computes WER)inference_template_cer.py
: Template for Chinese/Khmer evaluation (computes CER)
These templates show how to load your trained model and perform batch inference on test datasets.
Dependencies
Install required packages:
pip install -r requirements.txt
Key dependencies:
- PyYAML (for configuration loading)
- torch, transformers, datasets
- librosa (for audio processing)
- evaluate (for metrics)
Evaluation Results
Results from training on all 6 languages with the original configuration:
Language | Metric | Error Rate |
---|---|---|
Cebuano | WER | 17.13% |
Khmer | CER | 48.00% |
English | WER | 8.11% |
Chinese-NET | CER | 11.98% |
Chinese-MTG | CER | 21.02% |
En-Indon | WER | 17.99% |
En-Malay | WER | 43.55% |
Note: If you encounter issues running finetune.py, you can use the finetune-backup.py
file which contains the original hardcoded configuration that was used to generate these evaluation metrics.
- Downloads last month
- 19
Model tree for pengyizhou/whisper-mixed-datasets-fixed-six-languages
Base model
openai/whisper-large-v3