You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Whisper Fine-tuning for English-Indonesian Code-Switching

This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the EN-INDON-CS (English-Indonesian Code-Switching) dataset.

Features

  • Flexible Configuration: All parameters are configurable through YAML files
  • Multi-GPU Support: Automatic detection and support for multiple GPUs
  • Dynamic Language Selection: Train on any subset of supported languages
  • On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
  • Comprehensive Evaluation: Automatic evaluation on test sets

Configuration

All parameters are configurable through the config.yaml file. This configuration is specifically set up for English-Indonesian code-switching data training.

Model Configuration

  • Model checkpoint (default: openai/whisper-large-v3)
  • Maximum target length for sequences

Dataset Configuration

  • Uses EN-INDON-CS dataset only
  • Dataset sources and splits
  • Language-specific settings
  • Language-specific settings (subset ratios, validation sizes)

Training Configuration

  • Learning rate, batch sizes, training steps
  • Multi-GPU vs single GPU settings
  • Evaluation and logging parameters

Environment Configuration

  • CPU core limits
  • Environment variables for optimization

Pushing to Hub

  • I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting push_to_hub: true in your config file.

Usage

Basic Usage

python finetune.py --config config.yaml

Custom Configuration

python finetune.py --config my_custom_config.yaml

Multi-GPU Training

# Using torchrun (recommended) for two GPUs
torchrun --nproc_per_node=2 finetune.py --config config.yaml

Configuration File Structure

The config.yaml file is organized into the following sections:

  1. model: Model checkpoint and sequence length settings
  2. output: Output directory configuration
  3. environment: Environment variables and CPU settings
  4. audio: Audio processing settings (sampling rate)
  5. languages: Indonesian language configuration
  6. datasets: EN-INDON-CS dataset configuration
  7. training: All training hyperparameters
  8. data_processing: Data processing settings

Customizing Your Training

Adjusting Training Parameters

Modify the training section in config.yaml:

  • Change learning rate, batch sizes, or training steps
  • Adjust evaluation frequency
  • Configure multi-GPU settings

Environment Optimization

Adjust the environment section to optimize for your system:

  • Set CPU core limits
  • Configure memory usage settings

Configuration

The provided config.yaml is specifically configured for English-Indonesian CS training.

Training Commands

Basic Training

python finetune.py

Single GPU Training

python finetune.py

Multi-GPU Training

torchrun --nproc_per_node=2 finetune.py

Inference Guide

After training your model, you can use the provided inference.py script for speech recognition:

python inference.py

The inference script includes:

  • Model loading from the trained checkpoint
  • Audio preprocessing pipeline
  • Text generation with proper formatting
  • Support for English-Indonesian code-switching

Using the Trained Model

The inference script automatically handles:

  • Loading the fine-tuned model weights
  • Audio preprocessing with proper sampling rate
  • Generating transcriptions with code-switching support
  • Output formatting for evaluation metrics

Dependencies

Install required packages:

pip install -r requirements.txt

Key dependencies:

  • PyYAML (for configuration loading)
  • torch, transformers, datasets
  • librosa (for audio processing)
  • evaluate (for metrics)

Evaluation Results

Language Metric Error Rate
En-Indon WER 30.20%

Note: If you encounter issues running finetune.py, you can use the finetune-backup.py file which contains the original hardcoded configuration that was used to generate these evaluation metrics.

Downloads last month
6
Safetensors
Model size
1.54B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for pengyizhou/whisper-eng-indon-cs-finetuned

Finetuned
(587)
this model