You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Whisper Fine-tuning for English (Switchboard-1)

This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Switchboard-1 (SWB1) English conversational speech dataset.

Features

  • Flexible Configuration: All parameters are configurable through YAML files
  • Multi-GPU Support: Automatic detection and support for multiple GPUs
  • Dynamic Language Selection: Train on any subset of supported languages
  • On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
  • Comprehensive Evaluation: Automatic evaluation on test sets

Configuration

All parameters are configurable through the config.yaml file. This configuration is specifically set up for English conversational speech training using the Switchboard-1 dataset.

Model Configuration

  • Model checkpoint (default: openai/whisper-large-v3)
  • Maximum target length for sequences

Dataset Configuration

  • Uses Switchboard-1 (SWB1) English conversational speech dataset
  • Dataset sources and splits
  • Language-specific settings
  • Training configuration optimized for conversational speech

Training Configuration

  • Learning rate, batch sizes, training steps
  • Multi-GPU vs single GPU settings
  • Evaluation and logging parameters

Environment Configuration

  • CPU core limits
  • Environment variables for optimization

Pushing to Hub

  • I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting push_to_hub: true in your config file.

Usage

Basic Usage

python finetune.py --config config.yaml

Custom Configuration

python finetune.py --config my_custom_config.yaml

Multi-GPU Training

# Using torchrun (recommended) for two GPUs
torchrun --nproc_per_node=2 finetune.py --config config.yaml

Configuration File Structure

The config.yaml file is organized into the following sections:

  1. model: Model checkpoint and sequence length settings
  2. output: Output directory configuration
  3. environment: Environment variables and CPU settings
  4. audio: Audio processing settings (sampling rate)
  5. languages: English language configuration
  6. datasets: Switchboard-1 dataset configuration
  7. training: All training hyperparameters
  8. data_processing: Data processing settings

Customizing Your Training

Adjusting Training Parameters

Modify the training section in config.yaml:

  • Change learning rate, batch sizes, or training steps
  • Adjust evaluation frequency
  • Configure multi-GPU settings

Environment Optimization

Adjust the environment section to optimize for your system:

  • Set CPU core limits
  • Configure memory usage settings

Configuration

The provided config.yaml is specifically configured for English SWB1 training.

Training Commands

Basic Training

python finetune.py

Single GPU Training

python finetune.py

Multi-GPU Training

torchrun --nproc_per_node=2 finetune.py

Inference Guide

After training your model, you can use the provided inference.py script for speech recognition:

python inference.py

The inference script includes:

  • Model loading from the trained checkpoint
  • Audio preprocessing pipeline
  • Text generation with proper formatting
  • Support for English conversational speech transcription

Using the Trained Model

The inference script automatically handles:

  • Loading the fine-tuned model weights
  • Audio preprocessing with proper sampling rate
  • Generating transcriptions for conversational English speech
  • Output formatting for evaluation metrics

SWB1 Evaluation with Text Normalization

Important: Switchboard-1 evaluation requires proper text normalization to handle mismatch terms used between training and evaluation datasets. Use the provided normalization script for accurate WER computation:

# Run inference first to generate predictions
python inference.py

# Then use the normalization script for proper evaluation
bash normalization_inference_results.sh

The normalization_inference_results.sh script performs:

  • Hesitation Removal: Filters out hesitation markers from both reference and hypothesis
  • Text Normalization: Applies SWB1-specific text filters (wer_ref_filter and wer_hyp_filter)
  • WER Computation: Calculates word error rate with proper normalization
  • Result Display: Shows the overall WER performance

Manual Evaluation Steps

If you need to customize the evaluation process:

# 1. Filter hesitations from reference and hypothesis
grep "sw_" ./SWB1.ref | grep -v "hesitation" > ./SWB1.ref.no_hesitation
grep "sw_" ./SWB1_finetuned.pred > ./SWB1_finetuned.pred.no_hesitation

# 2. Apply text normalization filters
cat ./SWB1.ref.no_hesitation | ./wer_ref_filter > ./SWB1.ref.filter.no_hesitation
cat ./SWB1_finetuned.pred.no_hesitation | ./wer_hyp_filter > ./SWB1_finetuned.pred.filter.no_hesitation

# 3. Compute WER with normalized text
./compute-wer.sh ./SWB1.ref.filter.no_hesitation ./SWB1_finetuned.pred.filter.no_hesitation ./SWB1_finetuned.wer

# 4. View results
grep Overall ./SWB1_finetuned.wer

Dependencies

Install required packages:

pip install -r requirements.txt

Key dependencies:

  • PyYAML (for configuration loading)
  • torch, transformers, datasets
  • librosa (for audio processing)
  • evaluate (for metrics)

Evaluation Results (after normalization)

Language Metric Error Rate
English WER 10.20%

Note: If you encounter issues running finetune.py, you can use the finetune-backup.py file which contains the original hardcoded configuration that was used to generate these evaluation metrics.

Downloads last month
12
Safetensors
Model size
1.54B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for pengyizhou/whisper-switchboard1

Finetuned
(590)
this model