Whisper Fine-tuning for Chinese (WenetSpeech)

This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the WenetSpeech Chinese speech dataset.

Features

  • Flexible Configuration: All parameters are configurable through YAML files
  • Multi-GPU Support: Automatic detection and support for multiple GPUs
  • Dynamic Language Selection: Train on any subset of supported languages
  • On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
  • Comprehensive Evaluation: Automatic evaluation on test sets

Configuration

All parameters are configurable through the config.yaml file. This configuration is specifically set up for Chinese speech training using the WenetSpeech dataset.

Model Configuration

  • Model checkpoint (default: openai/whisper-large-v3)
  • Maximum target length for sequences

Dataset Configuration

  • Uses WenetSpeech Chinese speech dataset
  • Multiple dataset splits (train, validation, test_net, test_meeting)
  • Language-specific settings
  • Training configuration optimized for Chinese speech recognition

Training Configuration

  • Learning rate, batch sizes, training steps
  • Multi-GPU vs single GPU settings
  • Evaluation and logging parameters

Environment Configuration

  • CPU core limits
  • Environment variables for optimization

Pushing to Hub

  • I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting push_to_hub: true in your config file.

Usage

Basic Usage

python finetune.py --config config.yaml

Custom Configuration

python finetune.py --config my_custom_config.yaml

Multi-GPU Training

# Using torchrun (recommended) for two GPUs
torchrun --nproc_per_node=2 finetune.py --config config.yaml

Configuration File Structure

The config.yaml file is organized into the following sections:

  1. model: Model checkpoint and sequence length settings
  2. output: Output directory configuration
  3. environment: Environment variables and CPU settings
  4. audio: Audio processing settings (sampling rate)
  5. languages: Chinese language configuration
  6. datasets: WenetSpeech dataset configuration
  7. training: All training hyperparameters
  8. data_processing: Data processing settings

Customizing Your Training

Adjusting Training Parameters

Modify the training section in config.yaml:

  • Change learning rate, batch sizes, or training steps
  • Adjust evaluation frequency
  • Configure multi-GPU settings

Environment Optimization

Adjust the environment section to optimize for your system:

  • Set CPU core limits
  • Configure memory usage settings

Configuration

The provided config.yaml is specifically configured for Chinese WenetSpeech training.

Training Commands

Basic Training

python finetune.py

Single GPU Training

python finetune.py

Multi-GPU Training

torchrun --nproc_per_node=2 finetune.py

Inference Guide

After training your model, you can use the provided inference.py script for speech recognition:

python inference.py

The inference script includes:

  • Model loading from the trained checkpoint
  • Audio preprocessing pipeline
  • Text generation with proper formatting
  • Support for Chinese speech transcription

Using the Trained Model

The inference script automatically handles:

  • Loading the fine-tuned model weights
  • Audio preprocessing with proper sampling rate
  • Generating transcriptions for Chinese speech
  • Output formatting for evaluation metrics

WenetSpeech Evaluation

Evaluation Protocol: WenetSpeech provides multiple test sets for comprehensive evaluation:

# Run inference to generate predictions
python inference.py

Test Sets Available:

  • DEV: Development set for validation during training
  • TEST_NET: Internet-sourced audio test set
  • TEST_MEETING: Meeting audio test set

The evaluation uses Character Error Rate (CER) which is appropriate for Chinese speech recognition.

WenetSpeech Dataset Characteristics

  • High Quality: Professional recordings with clean annotations
  • Diverse Content: Multiple domains including internet audio and meeting recordings
  • Large Scale: Extensive dataset for robust Chinese ASR training
  • Standard Benchmark: Widely used for Chinese speech recognition evaluation

WenetSpeech Dataset

This model is specifically designed for Chinese speech recognition using the WenetSpeech corpus. Key characteristics:

  • Multi-domain Audio: Internet videos, audiobooks, podcasts, and meeting recordings
  • High-quality Annotations: Professional Chinese transcriptions with punctuation
  • Large Scale: 10,000+ hours of labeled Chinese speech data (We only use the subset-S for training)
  • Standard Benchmark: Widely adopted for Chinese ASR research and development

Dataset Characteristics

  • Audio Quality: Various quality levels from internet sources to studio recordings
  • Speaking Styles: Read speech, spontaneous speech, and conversational audio
  • Vocabulary: Large vocabulary covering diverse topics and domains
  • Language: Mandarin Chinese with standard simplified Chinese characters
  • Evaluation Protocol: Character Error Rate (CER) based evaluation

Training Configuration

  • Training Data: WenetSpeech subset-S for efficient training
  • Validation: DEV_fixed split for model selection
  • Test Sets: TEST_NET (internet audio) and TEST_MEETING (meeting audio)
  • Metric: Character Error Rate (CER) optimized for Chinese script

Dependencies

Install required packages:

pip install -r requirements.txt

Key dependencies:

  • PyYAML (for configuration loading)
  • torch, transformers, datasets
  • librosa (for audio processing)
  • evaluate (for metrics)

Evaluation Results (after normalization)

Language Metric Error Rate
Chinese-NET CER 14.20%
Chinese-MTG CER 22.40%

Note: If you encounter issues running finetune.py, you can use the finetune-backup.py file which contains the original hardcoded configuration that was used to generate these evaluation metrics.

Downloads last month
11
Safetensors
Model size
1.54B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pengyizhou/whisper-wenetspeech-S

Finetuned
(590)
this model

Datasets used to train pengyizhou/whisper-wenetspeech-S