--- license: cc-by-nc-3.0 datasets: - pengyizhou/SWB1 - pengyizhou/wenetspeech-subset-S - pengyizhou/EN-MALAY-CS - pengyizhou/EN-INDON-CS - google/fleurs language: - en - zh - id - ms - km metrics: - cer - wer base_model: - openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition --- # Whisper Fine-tuning with Configuration This project provides a configurable way to fine-tune OpenAI's Whisper model on mixed multilingual datasets including: - FLEURS Cebuano (ceb_ph) - FLEURS Khmer (km_kh) - Switchboard1 English - WenetSpeech Chinese - EN-INDON-CS - EN-MALAY-CS ## Features - **Flexible Configuration**: All parameters are configurable through YAML files - **Multi-GPU Support**: Automatic detection and support for multiple GPUs - **Dynamic Language Selection**: Train on any subset of supported languages - **On-the-fly Processing**: Efficient memory usage with dynamic audio preprocessing - **Comprehensive Evaluation**: Automatic evaluation on multiple test sets ## Configuration All parameters are now configurable through the `config.yaml` file. This includes: ### Model Configuration - Model checkpoint (default: `openai/whisper-large-v3`) - Maximum target length for sequences ### Dataset Configuration - Languages to include in training - Dataset sources and splits - Language-specific settings (subset ratios, validation sizes) ### Training Configuration - Learning rate, batch sizes, training steps - Multi-GPU vs single GPU settings - Evaluation and logging parameters ### Environment Configuration - CPU core limits - Environment variables for optimization ### Pushing to Hub - I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting `push_to_hub: true` in your config file. ## Usage ### Basic Usage ```bash python finetune.py --config config.yaml ``` ### Custom Configuration ```bash python finetune.py --config my_custom_config.yaml ``` ### Multi-GPU Training ```bash # Using torchrun (recommended) for two GPUs torchrun --nproc_per_node=2 finetune.py --config config.yaml ``` ## Configuration File Structure The `config.yaml` file is organized into the following sections: 1. **model**: Model checkpoint and sequence length settings 2. **output**: Output directory configuration 3. **environment**: Environment variables and CPU settings 4. **audio**: Audio processing settings (sampling rate) 5. **languages**: Language-specific configurations 6. **datasets**: Dataset source and split configurations 7. **training**: All training hyperparameters 8. **data_processing**: Data processing settings ## Customizing Your Training ### Adding a New Language 1. Add the language configuration to the `languages` section 2. Add the dataset configuration to the `datasets` section 3. The script will automatically include it in training ### Adjusting Training Parameters Modify the `training` section in `config.yaml`: - Change learning rate, batch sizes, or training steps - Adjust evaluation frequency - Configure multi-GPU settings ### Environment Optimization Adjust the `environment` section to optimize for your system: - Set CPU core limits - Configure memory usage settings ## Configuration Examples - `config.yaml`: Full configuration with all 6 languages ## Inference Guide After training your model, you can use it for standalone inference. The project includes two inference template scripts: - **`inference_template_wer.py`**: Template for English/Malay/Indonesian/Cebuano evaluation (computes WER) - **`inference_template_cer.py`**: Template for Chinese/Khmer evaluation (computes CER) These templates show how to load your trained model and perform batch inference on test datasets. ## Dependencies Install required packages: ```bash pip install -r requirements.txt ``` Key dependencies: - PyYAML (for configuration loading) - torch, transformers, datasets - librosa (for audio processing) - evaluate (for metrics) ## Evaluation Results Results from training on all 6 languages with the original configuration: | Language | Metric | Error Rate | |-------------|:------:|-----------:| | Cebuano | WER | 17.13% | | Khmer | CER | 48.00% | | English | WER | 8.11% | | Chinese-NET | CER | 11.98% | | Chinese-MTG | CER | 21.02% | | En-Indon | WER | 17.99% | | En-Malay | WER | 43.55% | **Note**: If you encounter issues running finetune.py, you can use the `finetune-backup.py` file which contains the original hardcoded configuration that was used to generate these evaluation metrics.