Distil-Whisper Large v3.5: Fine-tuned for ATC Domain

Model Description

This model is a fine-tuned version of distil-whisper/distil-large-v3.5 optimized for transcribing Air Traffic Control (ATC) communications.

It was trained on the ATCOSIM dataset, which simulates realistic ATC conversations, using a decoder-only fine-tuning approach.

This model aims to preserve general English speech recognition ability while enhancing performance on ATC-specific terminology using a decoder-only fine-tuning strategy.

Intended Use

This model is designed for:

Transcribing ATC radio communications
Supporting aviation safety analytics
Monitoring communication patterns for research or operational analysis
ATC phraseology recognition in controlled environments

Training Methodology

The model was trained using a domain-adaptive strategy with the following key characteristics:

Encoder frozen to retain general speech recognition capability
Decoder fine-tuned to adapt to ATC-specific language
Whisper processor and tokenizer unchanged to preserve general English vocabulary

Training configuration:

Learning rate: 3e-5
Epochs: 5
Warmup steps: 50
Batch size: 8 (×2 gradient accumulation)
Mixed precision: FP16
Model: distil-whisper/distil-large-v3.5
Evaluation metric: Word Error Rate (WER)

Performance

The model achieved the following results during fine-tuning:

Step	Training Loss	Validation Loss	WER
500	0.1131	0.0934	5.16%
1000	0.0654	0.0849	5.72%
1500	0.0208	0.0830	4.37% ✅
2000	0.0152	0.0859	4.94%
2500	0.0080	0.0861	4.85%

The best performing checkpoint was at step 1500, which is the model uploaded here.

Limitations

Optimized specifically for English ATC communications
Not tuned for general-purpose transcription
May underperform on non-standard phraseology or overlapping transmissions
Synthetic training data may limit performance on real-world accents/noise

Usage

Basic Usage with Pipeline

The Whisper model requires the sampling rate of the audio to be 16,000Hz

from transformers import pipeline

transcriber = pipeline(
    task="automatic-speech-recognition",
    model="tclin/distil-large-v3.5-atcosim-finetune",
    torch_dtype="auto",   # fp16 on GPU, fp32 on CPU
    device="cuda"         # or "cpu"
)

# The original Whisper config forces <|en|><|transcribe|> tokens and suppresses a few special-tokens.  During fine-tuning, these are removed, so at inference, it is needed to unset them:
transcriber.model.generation_config.forced_decoder_ids   = None
transcriber.model.generation_config.begin_suppress_tokens = None

result = transcriber("path_to_atc_audio.wav")
print("Transcription:", result["text"])

Advanced Usage with Audio Processing

import torch, torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

MODEL_ID = "tclin/distil-large-v3.5-atcosim-finetune"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.float16 if torch.cuda.is_available() else torch.float32

# 1. Load & pre-process audio
audio_path = "path_to_atc_audio.wav"
waveform, sr = torchaudio.load(audio_path)         # (channels, time)

# Down-mix stereo → mono
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)

# Resample to 16 kHz (high-quality filter)
if sr != 16_000:
    waveform = torchaudio.transforms.Resample(
        sr, 16_000, lowpass_filter_width=64, rolloff=0.99,
        resampling_method="sinc_interpolation"
    )(waveform)

audio_np = waveform.squeeze(0).numpy()

# 2. Load model & processor 
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    MODEL_ID, torch_dtype=DTYPE, use_safetensors=True
).to(DEVICE)

# The original Whisper config forces <|en|><|transcribe|> tokens and suppresses a few special-tokens.  During fine-tuning, these are removed, so at inference, it is needed to unset them:
model.generation_config.forced_decoder_ids   = None
model.generation_config.begin_suppress_tokens = None

# 3. Feature extraction & generation
with torch.inference_mode():
    inputs = processor(audio_np, sampling_rate=16_000,
                       return_tensors="pt").to(DEVICE, DTYPE)
    ids = model.generate(**inputs, max_new_tokens=128)
    text = processor.batch_decode(ids, skip_special_tokens=True)[0]

print("✈️  Transcription:", text)

Important Notes

Always ensure audio is resampled to 16kHz before processing
Whisper expects input in mono format; convert stereo to mono if needed
Whisper tokenizer is byte-level and doesn't require language-specific customization
The model performs best on clean ATC communications with standard phraseology

Broader Application

This model can be used as part of an automated pipeline for:

Transcribing aviation communications
Extracting call signs, altitudes, and instructions
Analyzing communication patterns for traffic management or compliance

Citation

If you use this model in your research, please cite:

@misc{ta-chun_lin_2025,
  author       = { Ta-Chun Lin },
  title        = { distil-whisper-large-v3.5-atcosim-finetune (Step 1500) },
  year         = 2025,
  url          = { https://huggingface.co/tclin/distil-whisper-large-v3.5-atcosim-finetune },
  doi          = { 10.57967/hf/5803 },
  publisher    = { Hugging Face }
}

Acknowledgements

OpenAI and Distil-Whisper for the base model
The ATCOSIM dataset contributors
Hugging Face and the open-source community for tooling and support

tclin
/

distil-large-v3.5-atcosim-finetune