Whisper Fine-tuning for English (Switchboard-1)

This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Switchboard-1 (SWB1) English conversational speech dataset.

Features

Flexible Configuration: All parameters are configurable through YAML files
Multi-GPU Support: Automatic detection and support for multiple GPUs
Dynamic Language Selection: Train on any subset of supported languages
On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
Comprehensive Evaluation: Automatic evaluation on test sets

Configuration

All parameters are configurable through the config.yaml file. This configuration is specifically set up for English conversational speech training using the Switchboard-1 dataset.

Model Configuration

Model checkpoint (default: openai/whisper-large-v3)
Maximum target length for sequences

Dataset Configuration

Uses Switchboard-1 (SWB1) English conversational speech dataset
Dataset sources and splits
Language-specific settings
Training configuration optimized for conversational speech

Training Configuration

Learning rate, batch sizes, training steps
Multi-GPU vs single GPU settings
Evaluation and logging parameters

Environment Configuration

CPU core limits
Environment variables for optimization

Pushing to Hub

I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting push_to_hub: true in your config file.

Usage

Basic Usage

python finetune.py --config config.yaml

Custom Configuration

python finetune.py --config my_custom_config.yaml

Multi-GPU Training

# Using torchrun (recommended) for two GPUs
torchrun --nproc_per_node=2 finetune.py --config config.yaml

Configuration File Structure

The config.yaml file is organized into the following sections:

model: Model checkpoint and sequence length settings
output: Output directory configuration
environment: Environment variables and CPU settings
audio: Audio processing settings (sampling rate)
languages: English language configuration
datasets: Switchboard-1 dataset configuration
training: All training hyperparameters
data_processing: Data processing settings

Customizing Your Training

Adjusting Training Parameters

Modify the training section in config.yaml:

Change learning rate, batch sizes, or training steps
Adjust evaluation frequency
Configure multi-GPU settings

Environment Optimization

Adjust the environment section to optimize for your system:

Set CPU core limits
Configure memory usage settings

Configuration

The provided config.yaml is specifically configured for English SWB1 training.

Training Commands

Basic Training

python finetune.py

Single GPU Training

python finetune.py

Multi-GPU Training

torchrun --nproc_per_node=2 finetune.py

Inference Guide

After training your model, you can use the provided inference.py script for speech recognition:

python inference.py

The inference script includes:

Model loading from the trained checkpoint
Audio preprocessing pipeline
Text generation with proper formatting
Support for English conversational speech transcription

Using the Trained Model

The inference script automatically handles:

Loading the fine-tuned model weights
Audio preprocessing with proper sampling rate
Generating transcriptions for conversational English speech
Output formatting for evaluation metrics

SWB1 Evaluation with Text Normalization

Important: Switchboard-1 evaluation requires proper text normalization to handle mismatch terms used between training and evaluation datasets. Use the provided normalization script for accurate WER computation:

# Run inference first to generate predictions
python inference.py

# Then use the normalization script for proper evaluation
bash normalization_inference_results.sh

The normalization_inference_results.sh script performs:

Hesitation Removal: Filters out hesitation markers from both reference and hypothesis
Text Normalization: Applies SWB1-specific text filters (wer_ref_filter and wer_hyp_filter)
WER Computation: Calculates word error rate with proper normalization
Result Display: Shows the overall WER performance

Manual Evaluation Steps

If you need to customize the evaluation process:

# 1. Filter hesitations from reference and hypothesis
grep "sw_" ./SWB1.ref | grep -v "hesitation" > ./SWB1.ref.no_hesitation
grep "sw_" ./SWB1_finetuned.pred > ./SWB1_finetuned.pred.no_hesitation

# 2. Apply text normalization filters
cat ./SWB1.ref.no_hesitation | ./wer_ref_filter > ./SWB1.ref.filter.no_hesitation
cat ./SWB1_finetuned.pred.no_hesitation | ./wer_hyp_filter > ./SWB1_finetuned.pred.filter.no_hesitation

# 3. Compute WER with normalized text
./compute-wer.sh ./SWB1.ref.filter.no_hesitation ./SWB1_finetuned.pred.filter.no_hesitation ./SWB1_finetuned.wer

# 4. View results
grep Overall ./SWB1_finetuned.wer

Dependencies

Install required packages:

pip install -r requirements.txt

Key dependencies:

PyYAML (for configuration loading)
torch, transformers, datasets
librosa (for audio processing)
evaluate (for metrics)

Evaluation Results (after normalization)

Language	Metric	Error Rate
English	WER	10.20%

Note: If you encounter issues running finetune.py, you can use the finetune-backup.py file which contains the original hardcoded configuration that was used to generate these evaluation metrics.

pengyizhou
/

whisper-switchboard1

You need to agree to share your contact information to access this model

Whisper Fine-tuning for English (Switchboard-1)

Features

Configuration

Model Configuration

Dataset Configuration

Training Configuration

Environment Configuration

Pushing to Hub

Usage

Basic Usage

Custom Configuration

Multi-GPU Training

Configuration File Structure

Customizing Your Training

Adjusting Training Parameters

Environment Optimization

Configuration

Training Commands

Basic Training

Single GPU Training

Multi-GPU Training

Inference Guide

Using the Trained Model

SWB1 Evaluation with Text Normalization

Manual Evaluation Steps

Dependencies

Evaluation Results (after normalization)

Model tree for pengyizhou/whisper-switchboard1