DOI

Distil-Whisper Large v3.5: Fine-tuned for ATC Domain

Model Description

This model is a fine-tuned version of distil-whisper/distil-large-v3.5 optimized for transcribing Air Traffic Control (ATC) communications.

It was trained on the ATCOSIM dataset, which simulates realistic ATC conversations, using a decoder-only fine-tuning approach.

This model aims to preserve general English speech recognition ability while enhancing performance on ATC-specific terminology using a decoder-only fine-tuning strategy.

Intended Use

This model is designed for:

  • Transcribing ATC radio communications
  • Supporting aviation safety analytics
  • Monitoring communication patterns for research or operational analysis
  • ATC phraseology recognition in controlled environments

Training Methodology

The model was trained using a domain-adaptive strategy with the following key characteristics:

  • Encoder frozen to retain general speech recognition capability
  • Decoder fine-tuned to adapt to ATC-specific language
  • Whisper processor and tokenizer unchanged to preserve general English vocabulary

Training configuration:

  • Learning rate: 3e-5
  • Epochs: 5
  • Warmup steps: 50
  • Batch size: 8 (×2 gradient accumulation)
  • Mixed precision: FP16
  • Model: distil-whisper/distil-large-v3.5
  • Evaluation metric: Word Error Rate (WER)

Performance

The model achieved the following results during fine-tuning:

Step Training Loss Validation Loss WER
500 0.1131 0.0934 5.16%
1000 0.0654 0.0849 5.72%
1500 0.0208 0.0830 4.37% ✅
2000 0.0152 0.0859 4.94%
2500 0.0080 0.0861 4.85%

The best performing checkpoint was at step 1500, which is the model uploaded here.

Limitations

  • Optimized specifically for English ATC communications
  • Not tuned for general-purpose transcription
  • May underperform on non-standard phraseology or overlapping transmissions
  • Synthetic training data may limit performance on real-world accents/noise

Usage

Basic Usage with Pipeline

The Whisper model requires the sampling rate of the audio to be 16,000Hz

from transformers import pipeline

transcriber = pipeline(
    task="automatic-speech-recognition",
    model="tclin/distil-large-v3.5-atcosim-finetune",
    torch_dtype="auto",   # fp16 on GPU, fp32 on CPU
    device="cuda"         # or "cpu"
)

# The original Whisper config forces <|en|><|transcribe|> tokens and suppresses a few special-tokens.  During fine-tuning, these are removed, so at inference, it is needed to unset them:
transcriber.model.generation_config.forced_decoder_ids   = None
transcriber.model.generation_config.begin_suppress_tokens = None

result = transcriber("path_to_atc_audio.wav")
print("Transcription:", result["text"])

Advanced Usage with Audio Processing

import torch, torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

MODEL_ID = "tclin/distil-large-v3.5-atcosim-finetune"
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE    = torch.float16 if torch.cuda.is_available() else torch.float32

# 1. Load & pre-process audio
audio_path = "path_to_atc_audio.wav"
waveform, sr = torchaudio.load(audio_path)         # (channels, time)

# Down-mix stereo → mono
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)

# Resample to 16 kHz (high-quality filter)
if sr != 16_000:
    waveform = torchaudio.transforms.Resample(
        sr, 16_000, lowpass_filter_width=64, rolloff=0.99,
        resampling_method="sinc_interpolation"
    )(waveform)

audio_np = waveform.squeeze(0).numpy()

# 2. Load model & processor 
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    MODEL_ID, torch_dtype=DTYPE, use_safetensors=True
).to(DEVICE)

# The original Whisper config forces <|en|><|transcribe|> tokens and suppresses a few special-tokens.  During fine-tuning, these are removed, so at inference, it is needed to unset them:
model.generation_config.forced_decoder_ids   = None
model.generation_config.begin_suppress_tokens = None

# 3. Feature extraction & generation
with torch.inference_mode():
    inputs = processor(audio_np, sampling_rate=16_000,
                       return_tensors="pt").to(DEVICE, DTYPE)
    ids = model.generate(**inputs, max_new_tokens=128)
    text = processor.batch_decode(ids, skip_special_tokens=True)[0]

print("✈️  Transcription:", text)

Important Notes

  • Always ensure audio is resampled to 16kHz before processing
  • Whisper expects input in mono format; convert stereo to mono if needed
  • Whisper tokenizer is byte-level and doesn't require language-specific customization
  • The model performs best on clean ATC communications with standard phraseology

Broader Application

This model can be used as part of an automated pipeline for:

  1. Transcribing aviation communications
  2. Extracting call signs, altitudes, and instructions
  3. Analyzing communication patterns for traffic management or compliance

Citation

If you use this model in your research, please cite:

@misc{ta-chun_lin_2025,
  author       = { Ta-Chun Lin },
  title        = { distil-whisper-large-v3.5-atcosim-finetune (Step 1500) },
  year         = 2025,
  url          = { https://huggingface.co/tclin/distil-whisper-large-v3.5-atcosim-finetune },
  doi          = { 10.57967/hf/5803 },
  publisher    = { Hugging Face }
}

Acknowledgements

  • OpenAI and Distil-Whisper for the base model
  • The ATCOSIM dataset contributors
  • Hugging Face and the open-source community for tooling and support
Downloads last month
265
Safetensors
Model size
756M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train tclin/distil-large-v3.5-atcosim-finetune

Evaluation results