Whisper Fine-tuning for Chinese (WenetSpeech)
This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the WenetSpeech Chinese speech dataset.
Features
- Flexible Configuration: All parameters are configurable through YAML files
- Multi-GPU Support: Automatic detection and support for multiple GPUs
- Dynamic Language Selection: Train on any subset of supported languages
- On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
- Comprehensive Evaluation: Automatic evaluation on test sets
Configuration
All parameters are configurable through the config.yaml
file. This configuration is specifically set up for Chinese speech training using the WenetSpeech dataset.
Model Configuration
- Model checkpoint (default:
openai/whisper-large-v3
) - Maximum target length for sequences
Dataset Configuration
- Uses WenetSpeech Chinese speech dataset
- Multiple dataset splits (train, validation, test_net, test_meeting)
- Language-specific settings
- Training configuration optimized for Chinese speech recognition
Training Configuration
- Learning rate, batch sizes, training steps
- Multi-GPU vs single GPU settings
- Evaluation and logging parameters
Environment Configuration
- CPU core limits
- Environment variables for optimization
Pushing to Hub
- I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting
push_to_hub: true
in your config file.
Usage
Basic Usage
python finetune.py --config config.yaml
Custom Configuration
python finetune.py --config my_custom_config.yaml
Multi-GPU Training
# Using torchrun (recommended) for two GPUs
torchrun --nproc_per_node=2 finetune.py --config config.yaml
Configuration File Structure
The config.yaml
file is organized into the following sections:
- model: Model checkpoint and sequence length settings
- output: Output directory configuration
- environment: Environment variables and CPU settings
- audio: Audio processing settings (sampling rate)
- languages: Chinese language configuration
- datasets: WenetSpeech dataset configuration
- training: All training hyperparameters
- data_processing: Data processing settings
Customizing Your Training
Adjusting Training Parameters
Modify the training
section in config.yaml
:
- Change learning rate, batch sizes, or training steps
- Adjust evaluation frequency
- Configure multi-GPU settings
Environment Optimization
Adjust the environment
section to optimize for your system:
- Set CPU core limits
- Configure memory usage settings
Configuration
The provided config.yaml
is specifically configured for Chinese WenetSpeech training.
Training Commands
Basic Training
python finetune.py
Single GPU Training
python finetune.py
Multi-GPU Training
torchrun --nproc_per_node=2 finetune.py
Inference Guide
After training your model, you can use the provided inference.py
script for speech recognition:
python inference.py
The inference script includes:
- Model loading from the trained checkpoint
- Audio preprocessing pipeline
- Text generation with proper formatting
- Support for Chinese speech transcription
Using the Trained Model
The inference script automatically handles:
- Loading the fine-tuned model weights
- Audio preprocessing with proper sampling rate
- Generating transcriptions for Chinese speech
- Output formatting for evaluation metrics
WenetSpeech Evaluation
Evaluation Protocol: WenetSpeech provides multiple test sets for comprehensive evaluation:
# Run inference to generate predictions
python inference.py
Test Sets Available:
- DEV: Development set for validation during training
- TEST_NET: Internet-sourced audio test set
- TEST_MEETING: Meeting audio test set
The evaluation uses Character Error Rate (CER) which is appropriate for Chinese speech recognition.
WenetSpeech Dataset Characteristics
- High Quality: Professional recordings with clean annotations
- Diverse Content: Multiple domains including internet audio and meeting recordings
- Large Scale: Extensive dataset for robust Chinese ASR training
- Standard Benchmark: Widely used for Chinese speech recognition evaluation
WenetSpeech Dataset
This model is specifically designed for Chinese speech recognition using the WenetSpeech corpus. Key characteristics:
- Multi-domain Audio: Internet videos, audiobooks, podcasts, and meeting recordings
- High-quality Annotations: Professional Chinese transcriptions with punctuation
- Large Scale: 10,000+ hours of labeled Chinese speech data (We only use the subset-S for training)
- Standard Benchmark: Widely adopted for Chinese ASR research and development
Dataset Characteristics
- Audio Quality: Various quality levels from internet sources to studio recordings
- Speaking Styles: Read speech, spontaneous speech, and conversational audio
- Vocabulary: Large vocabulary covering diverse topics and domains
- Language: Mandarin Chinese with standard simplified Chinese characters
- Evaluation Protocol: Character Error Rate (CER) based evaluation
Training Configuration
- Training Data: WenetSpeech subset-S for efficient training
- Validation: DEV_fixed split for model selection
- Test Sets: TEST_NET (internet audio) and TEST_MEETING (meeting audio)
- Metric: Character Error Rate (CER) optimized for Chinese script
Dependencies
Install required packages:
pip install -r requirements.txt
Key dependencies:
- PyYAML (for configuration loading)
- torch, transformers, datasets
- librosa (for audio processing)
- evaluate (for metrics)
Evaluation Results (after normalization)
Language | Metric | Error Rate |
---|---|---|
Chinese-NET | CER | 14.20% |
Chinese-MTG | CER | 22.40% |
Note: If you encounter issues running finetune.py, you can use the finetune-backup.py
file which contains the original hardcoded configuration that was used to generate these evaluation metrics.
- Downloads last month
- 11
Model tree for pengyizhou/whisper-wenetspeech-S
Base model
openai/whisper-large-v3