--- license: cc-by-nc-3.0 datasets: - pengyizhou/EN-MALAY-CS language: - en - ms metrics: - wer base_model: - openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition --- # Whisper Fine-tuning for English-Malay Code-Switching This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the EN-MALAY-CS (English-Malay Code-Switching) dataset. ## Features - **Flexible Configuration**: All parameters are configurable through YAML files - **Multi-GPU Support**: Automatic detection and support for multiple GPUs - **Dynamic Language Selection**: Train on any subset of supported languages - **On-the-fly Processing**: Efficient memory usage with dynamic audio preprocessing - **Comprehensive Evaluation**: Automatic evaluation on test sets ## Configuration All parameters are configurable through the `config.yaml` file. This configuration is specifically set up for English-Malay code-switching data training. ### Model Configuration - Model checkpoint (default: `openai/whisper-large-v3`) - Maximum target length for sequences ### Dataset Configuration - Uses EN-MALAY-CS dataset only - Dataset sources and splits - Language-specific settings - Language-specific settings (subset ratios, validation sizes) ### Training Configuration - Learning rate, batch sizes, training steps - Multi-GPU vs single GPU settings - Evaluation and logging parameters ### Environment Configuration - CPU core limits - Environment variables for optimization ### Pushing to Hub - I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting `push_to_hub: true` in your config file. ## Usage ### Basic Usage ```bash python finetune.py --config config.yaml ``` ### Custom Configuration ```bash python finetune.py --config my_custom_config.yaml ``` ### Multi-GPU Training ```bash # Using torchrun (recommended) for two GPUs torchrun --nproc_per_node=2 finetune.py --config config.yaml ``` ## Configuration File Structure The `config.yaml` file is organized into the following sections: 1. **model**: Model checkpoint and sequence length settings 2. **output**: Output directory configuration 3. **environment**: Environment variables and CPU settings 4. **audio**: Audio processing settings (sampling rate) 5. **languages**: Malay language configuration 6. **datasets**: EN-MALAY-CS dataset configuration 7. **training**: All training hyperparameters 8. **data_processing**: Data processing settings ## Customizing Your Training ### Adjusting Training Parameters Modify the `training` section in `config.yaml`: - Change learning rate, batch sizes, or training steps - Adjust evaluation frequency - Configure multi-GPU settings ### Environment Optimization Adjust the `environment` section to optimize for your system: - Set CPU core limits - Configure memory usage settings ## Configuration The provided `config.yaml` is specifically configured for English-Malay CS training. ## Training Commands ### Basic Training ```bash python finetune.py ``` ### Single GPU Training ```bash python finetune.py ``` ### Multi-GPU Training ```bash torchrun --nproc_per_node=2 finetune.py ``` ## Inference Guide After training your model, you can use the provided `inference.py` script for speech recognition: ```bash python inference.py ``` The inference script includes: - Model loading from the trained checkpoint - Audio preprocessing pipeline - Text generation with proper formatting - Support for English-Malay code-switching ### Using the Trained Model The inference script automatically handles: - Loading the fine-tuned model weights - Audio preprocessing with proper sampling rate - Generating transcriptions with code-switching support - Output formatting for evaluation metrics ## Dependencies Install required packages: ```bash pip install -r requirements.txt ``` Key dependencies: - PyYAML (for configuration loading) - torch, transformers, datasets - librosa (for audio processing) - evaluate (for metrics) ## Evaluation Results | Language | Metric | Error Rate | |-------------|:------:|-----------:| | En-Malay | WER | 42.33% | **Note**: If you encounter issues running finetune.py, you can use the `finetune-backup.py` file which contains the original hardcoded configuration that was used to generate these evaluation metrics.