Distil-Whisper Large v3.5: Fine-tuned for ATC Domain
Model Description
This model is a fine-tuned version of distil-whisper/distil-large-v3.5
optimized for transcribing Air Traffic Control (ATC) communications.
It was trained on the ATCOSIM dataset, which simulates realistic ATC conversations, using a decoder-only fine-tuning approach.
This model aims to preserve general English speech recognition ability while enhancing performance on ATC-specific terminology using a decoder-only fine-tuning strategy.
Intended Use
This model is designed for:
- Transcribing ATC radio communications
- Supporting aviation safety analytics
- Monitoring communication patterns for research or operational analysis
- ATC phraseology recognition in controlled environments
Training Methodology
The model was trained using a domain-adaptive strategy with the following key characteristics:
- Encoder frozen to retain general speech recognition capability
- Decoder fine-tuned to adapt to ATC-specific language
- Whisper processor and tokenizer unchanged to preserve general English vocabulary
Training configuration:
- Learning rate: 3e-5
- Epochs: 5
- Warmup steps: 50
- Batch size: 8 (×2 gradient accumulation)
- Mixed precision: FP16
- Model:
distil-whisper/distil-large-v3.5
- Evaluation metric: Word Error Rate (WER)
Performance
The model achieved the following results during fine-tuning:
Step | Training Loss | Validation Loss | WER |
---|---|---|---|
500 | 0.1131 | 0.0934 | 5.16% |
1000 | 0.0654 | 0.0849 | 5.72% |
1500 | 0.0208 | 0.0830 | 4.37% ✅ |
2000 | 0.0152 | 0.0859 | 4.94% |
2500 | 0.0080 | 0.0861 | 4.85% |
The best performing checkpoint was at step 1500, which is the model uploaded here.
Limitations
- Optimized specifically for English ATC communications
- Not tuned for general-purpose transcription
- May underperform on non-standard phraseology or overlapping transmissions
- Synthetic training data may limit performance on real-world accents/noise
Usage
Basic Usage with Pipeline
The Whisper model requires the sampling rate of the audio to be 16,000Hz
from transformers import pipeline
transcriber = pipeline(
task="automatic-speech-recognition",
model="tclin/distil-large-v3.5-atcosim-finetune",
torch_dtype="auto", # fp16 on GPU, fp32 on CPU
device="cuda" # or "cpu"
)
# The original Whisper config forces <|en|><|transcribe|> tokens and suppresses a few special-tokens. During fine-tuning, these are removed, so at inference, it is needed to unset them:
transcriber.model.generation_config.forced_decoder_ids = None
transcriber.model.generation_config.begin_suppress_tokens = None
result = transcriber("path_to_atc_audio.wav")
print("Transcription:", result["text"])
Advanced Usage with Audio Processing
import torch, torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
MODEL_ID = "tclin/distil-large-v3.5-atcosim-finetune"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.float16 if torch.cuda.is_available() else torch.float32
# 1. Load & pre-process audio
audio_path = "path_to_atc_audio.wav"
waveform, sr = torchaudio.load(audio_path) # (channels, time)
# Down-mix stereo → mono
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Resample to 16 kHz (high-quality filter)
if sr != 16_000:
waveform = torchaudio.transforms.Resample(
sr, 16_000, lowpass_filter_width=64, rolloff=0.99,
resampling_method="sinc_interpolation"
)(waveform)
audio_np = waveform.squeeze(0).numpy()
# 2. Load model & processor
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
MODEL_ID, torch_dtype=DTYPE, use_safetensors=True
).to(DEVICE)
# The original Whisper config forces <|en|><|transcribe|> tokens and suppresses a few special-tokens. During fine-tuning, these are removed, so at inference, it is needed to unset them:
model.generation_config.forced_decoder_ids = None
model.generation_config.begin_suppress_tokens = None
# 3. Feature extraction & generation
with torch.inference_mode():
inputs = processor(audio_np, sampling_rate=16_000,
return_tensors="pt").to(DEVICE, DTYPE)
ids = model.generate(**inputs, max_new_tokens=128)
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print("✈️ Transcription:", text)
Important Notes
- Always ensure audio is resampled to 16kHz before processing
- Whisper expects input in mono format; convert stereo to mono if needed
- Whisper tokenizer is byte-level and doesn't require language-specific customization
- The model performs best on clean ATC communications with standard phraseology
Broader Application
This model can be used as part of an automated pipeline for:
- Transcribing aviation communications
- Extracting call signs, altitudes, and instructions
- Analyzing communication patterns for traffic management or compliance
Citation
If you use this model in your research, please cite:
@misc{ta-chun_lin_2025,
author = { Ta-Chun Lin },
title = { distil-whisper-large-v3.5-atcosim-finetune (Step 1500) },
year = 2025,
url = { https://huggingface.co/tclin/distil-whisper-large-v3.5-atcosim-finetune },
doi = { 10.57967/hf/5803 },
publisher = { Hugging Face }
}
Acknowledgements
- OpenAI and Distil-Whisper for the base model
- The ATCOSIM dataset contributors
- Hugging Face and the open-source community for tooling and support
- Downloads last month
- 265