Whisper Fine-tuning for English (Switchboard-1)
This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Switchboard-1 (SWB1) English conversational speech dataset.
Features
- Flexible Configuration: All parameters are configurable through YAML files
- Multi-GPU Support: Automatic detection and support for multiple GPUs
- Dynamic Language Selection: Train on any subset of supported languages
- On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
- Comprehensive Evaluation: Automatic evaluation on test sets
Configuration
All parameters are configurable through the config.yaml
file. This configuration is specifically set up for English conversational speech training using the Switchboard-1 dataset.
Model Configuration
- Model checkpoint (default:
openai/whisper-large-v3
) - Maximum target length for sequences
Dataset Configuration
- Uses Switchboard-1 (SWB1) English conversational speech dataset
- Dataset sources and splits
- Language-specific settings
- Training configuration optimized for conversational speech
Training Configuration
- Learning rate, batch sizes, training steps
- Multi-GPU vs single GPU settings
- Evaluation and logging parameters
Environment Configuration
- CPU core limits
- Environment variables for optimization
Pushing to Hub
- I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting
push_to_hub: true
in your config file.
Usage
Basic Usage
python finetune.py --config config.yaml
Custom Configuration
python finetune.py --config my_custom_config.yaml
Multi-GPU Training
# Using torchrun (recommended) for two GPUs
torchrun --nproc_per_node=2 finetune.py --config config.yaml
Configuration File Structure
The config.yaml
file is organized into the following sections:
- model: Model checkpoint and sequence length settings
- output: Output directory configuration
- environment: Environment variables and CPU settings
- audio: Audio processing settings (sampling rate)
- languages: English language configuration
- datasets: Switchboard-1 dataset configuration
- training: All training hyperparameters
- data_processing: Data processing settings
Customizing Your Training
Adjusting Training Parameters
Modify the training
section in config.yaml
:
- Change learning rate, batch sizes, or training steps
- Adjust evaluation frequency
- Configure multi-GPU settings
Environment Optimization
Adjust the environment
section to optimize for your system:
- Set CPU core limits
- Configure memory usage settings
Configuration
The provided config.yaml
is specifically configured for English SWB1 training.
Training Commands
Basic Training
python finetune.py
Single GPU Training
python finetune.py
Multi-GPU Training
torchrun --nproc_per_node=2 finetune.py
Inference Guide
After training your model, you can use the provided inference.py
script for speech recognition:
python inference.py
The inference script includes:
- Model loading from the trained checkpoint
- Audio preprocessing pipeline
- Text generation with proper formatting
- Support for English conversational speech transcription
Using the Trained Model
The inference script automatically handles:
- Loading the fine-tuned model weights
- Audio preprocessing with proper sampling rate
- Generating transcriptions for conversational English speech
- Output formatting for evaluation metrics
SWB1 Evaluation with Text Normalization
Important: Switchboard-1 evaluation requires proper text normalization to handle mismatch terms used between training and evaluation datasets. Use the provided normalization script for accurate WER computation:
# Run inference first to generate predictions
python inference.py
# Then use the normalization script for proper evaluation
bash normalization_inference_results.sh
The normalization_inference_results.sh
script performs:
- Hesitation Removal: Filters out hesitation markers from both reference and hypothesis
- Text Normalization: Applies SWB1-specific text filters (
wer_ref_filter
andwer_hyp_filter
) - WER Computation: Calculates word error rate with proper normalization
- Result Display: Shows the overall WER performance
Manual Evaluation Steps
If you need to customize the evaluation process:
# 1. Filter hesitations from reference and hypothesis
grep "sw_" ./SWB1.ref | grep -v "hesitation" > ./SWB1.ref.no_hesitation
grep "sw_" ./SWB1_finetuned.pred > ./SWB1_finetuned.pred.no_hesitation
# 2. Apply text normalization filters
cat ./SWB1.ref.no_hesitation | ./wer_ref_filter > ./SWB1.ref.filter.no_hesitation
cat ./SWB1_finetuned.pred.no_hesitation | ./wer_hyp_filter > ./SWB1_finetuned.pred.filter.no_hesitation
# 3. Compute WER with normalized text
./compute-wer.sh ./SWB1.ref.filter.no_hesitation ./SWB1_finetuned.pred.filter.no_hesitation ./SWB1_finetuned.wer
# 4. View results
grep Overall ./SWB1_finetuned.wer
Dependencies
Install required packages:
pip install -r requirements.txt
Key dependencies:
- PyYAML (for configuration loading)
- torch, transformers, datasets
- librosa (for audio processing)
- evaluate (for metrics)
Evaluation Results (after normalization)
Language | Metric | Error Rate |
---|---|---|
English | WER | 10.20% |
Note: If you encounter issues running finetune.py, you can use the finetune-backup.py
file which contains the original hardcoded configuration that was used to generate these evaluation metrics.
- Downloads last month
- 12
Model tree for pengyizhou/whisper-switchboard1
Base model
openai/whisper-large-v3