Whisper Fine-tuning for Khmer Language

This project provides a configurable way to fine-tune OpenAI's Whisper model specifically on the Khmer language using the Google FLEURS dataset (km_kh).

Features

Flexible Configuration: All parameters are configurable through YAML files
Multi-GPU Support: Automatic detection and support for multiple GPUs
Dynamic Language Selection: Train on any subset of supported languages
On-the-fly Processing: Efficient memory usage with dynamic audio preprocessing
Comprehensive Evaluation: Automatic evaluation on test sets

Configuration

All parameters are configurable through the config.yaml file. This configuration is specifically set up for Khmer language training using the Google FLEURS dataset.

Model Configuration

Model checkpoint (default: openai/whisper-large-v3)
Maximum target length for sequences

Dataset Configuration

Uses Google FLEURS Khmer (km_kh) dataset
Dataset sources and splits
Language-specific settings
Training subset ratio (25% of data for faster training)

Training Configuration

Learning rate, batch sizes, training steps
Multi-GPU vs single GPU settings
Evaluation and logging parameters

Environment Configuration

CPU core limits
Environment variables for optimization

Pushing to Hub

I have set the configuration to not push to the Hugging Face Hub by default. You can enable this by setting push_to_hub: true in your config file.

Usage

Basic Usage

python finetune.py --config config.yaml

Custom Configuration

python finetune.py --config my_custom_config.yaml

Multi-GPU Training

Since we only have very few training data (around 2.5 hours), multi-GPU training is not recommended.

Configuration File Structure

The config.yaml file is organized into the following sections:

model: Model checkpoint and sequence length settings
output: Output directory configuration
environment: Environment variables and CPU settings
audio: Audio processing settings (sampling rate)
languages: Khmer language configuration
datasets: Google FLEURS Khmer dataset configuration
training: All training hyperparameters
data_processing: Data processing settings

Customizing Your Training

Adjusting Training Parameters

Modify the training section in config.yaml:

Change learning rate, batch sizes, or training steps
Adjust evaluation frequency
Configure multi-GPU settings

Environment Optimization

Adjust the environment section to optimize for your system:

Set CPU core limits
Configure memory usage settings

Configuration

The provided config.yaml is specifically configured for Khmer language training using the Google FLEURS dataset.

Training Commands

Basic Training

python finetune.py

Single GPU Training

python finetune.py

Inference Guide

After training your model, you can use the provided inference.py script for speech recognition:

python inference.py

The inference script includes:

Model loading from the trained checkpoint
Audio preprocessing pipeline
Text generation with proper formatting
Support for Khmer language transcription

Using the Trained Model

The inference script automatically handles:

Loading the fine-tuned model weights
Audio preprocessing with proper sampling rate
Generating transcriptions for Khmer speech
Output formatting for evaluation metrics

Dependencies

Install required packages:

pip install -r requirements.txt

Key dependencies:

PyYAML (for configuration loading)
torch, transformers, datasets
librosa (for audio processing)
evaluate (for metrics)

Evaluation Results

Language	Metric	Error Rate
Khmer	CER	33.18%

Note: If you encounter issues running finetune.py, you can use the finetune-backup.py file which contains the original hardcoded configuration.

pengyizhou
/

whisper-fleurs-km_kh-small

Whisper Fine-tuning for Khmer Language

Features

Configuration

Model Configuration

Dataset Configuration

Training Configuration

Environment Configuration

Pushing to Hub

Usage

Basic Usage

Custom Configuration

Multi-GPU Training

Configuration File Structure

Customizing Your Training

Adjusting Training Parameters

Environment Optimization

Configuration

Training Commands

Basic Training

Single GPU Training

Inference Guide

Using the Trained Model

Dependencies

Evaluation Results

Model tree for pengyizhou/whisper-fleurs-km_kh-small

Dataset used to train pengyizhou/whisper-fleurs-km_kh-small