๐ฃ๏ธ CUPE: Contextless Universal Phoneme Encoder
๐ A PyTorch model for contextless phoneme prediction from speech audio
CUPE processes 120ms frames independently, ensuring each frame's embeddings are acoustically pureโunlike transformer models that mix context across frames.
๐ Quick Links
- ๐ฏ Bournemouth Forced Aligner - For phoneme/word timestamp alignment
- ๐ CUPE GitHub - Source code repository
- ๐ค CUPE Hugging Face - Pre-trained models
๐ฏ Trained Models
๐ Three 30.1M parameter models available
All models are available in the checkpoints directory.
๐ Model Performance
๐ท๏ธ Model | ๐ Languages | ๐ PER | ๐ GER | ๐ Description |
---|---|---|---|---|
๐ฌ๐ง English | English | 0.25 | 0.23 | ๐ Best quality for English speech |
๐ Multilingual MLS | 8 European | 0.31 | 0.26 | ๐ช๐บ en, de, fr, es, pt, it, pl, nl |
๐ Multilingual MSWC | 38 languages | 0.49 | 0.39 | ๐บ๏ธ Broad language coverage |
๐ Detailed Metrics
๐ฌ๐ง English (en_libri1000_uj01d):
- ๐ฏ PER: 0.25 (Phoneme Error Rate)
- ๐ฏ GER: 0.23 (Phoneme Group Error Rate)
๐ Multilingual MLS (multi_MLS8_uh02):
- ๐ฏ PER: 0.31
- ๐ฏ GER: 0.26
๐ Multilingual MSWC (multi_mswc38_ug20):
- ๐ฏ PER: 0.49
- ๐ฏ GER: 0.39
โ ๏ธ Note: CUPE models are designed for contextless phoneme prediction and are not optimal for phoneme classification tasks that require contextual information. CUPE excels at extracting pure, frame-level embeddings that represent the acoustic properties of each phoneme independently of surrounding context.
๐ Datasets
๐ต Training Data Sources
- ๐ LibriSpeech ASR corpus (SR12): 960 hours of English speech
- ๐ Multilingual LibriSpeech (MLS): 800 hours across 8 languages
- ๐ฃ๏ธ MSWC Multilingual Spoken Words: 240 hours from 50 languages
๐ Dataset Details
๐ LibriSpeech ASR corpus (SR12):
- โฑ๏ธ 960 hours of English speech
- ๐ train-100, train-360, and train-500 splits
๐ Multilingual LibriSpeech (MLS) (SLR94):
- โฑ๏ธ 800 hours total (100 hours each)
- ๐ 8 languages:
pl
,pt
,it
,es
,fr
,nl
,de
,en
๐ฃ๏ธ MSWC Multilingual Spoken Words Corpus:
- โฑ๏ธ 240 hours from 50 languages (max 10 hours/language)
- ๐ Training: 38 languages (
en
,de
,fr
,ca
,es
,fa
,it
,ru
,pl
,eu
,cy
,eo
,nl
,pt
,tt
,cs
,tr
,et
,ky
,id
,sv-SE
,ar
,el
,ro
,lv
,sl
,zh-CN
,ga-IE
,ta
,vi
,gn
,or
) - ๐งช Testing: 6 languages (
lt
,mt
,ia
,sk
,ka
,as
)
๐ก Need a new language? Start a new discussion and we'll train it for you!
๐ Installation
โก Quick Start (Bournemouth Forced Aligner)
# ๐ฆ Install the package
pip install bournemouth-forced-aligner
# ๐ง Install dependencies
apt-get install espeak-ng ffmpeg
# โ Show help
balign --help
๐ See complete BFA guide.
๐ ๏ธ Quick Start (CUPE)
# ๐ฆ Install core dependencies
pip install torch torchaudio huggingface_hub
๐ป Easy Usage with Automatic Download
๐ฏ Zero-setup required - automatic downloads from Hugging Face Hub
๐ฆ Example Output
Running with sample audio ๐ฆ butterfly.wav:
๐ Loading CUPE english model...
โ
Model loaded on cpu
๐ต Processing audio: 1.26s duration
๐ Processed 75 frames (1200ms total)
๐ Results:
๐ค Phoneme predictions shape: (75,)
๐ท๏ธ Group predictions shape: (75,)
โน๏ธ Model info: {'model_name': 'english', 'sample_rate': 16000, 'frames_per_second': 62.5}
๐ First 10 frame predictions:
Frame 0: phoneme=66, group=16
Frame 1: phoneme=66, group=16
Frame 2: phoneme=29, group=7
...
๐ค Phonemes: ['b', 'ส', 't', 'h', 'ส', 'f', 'l', 'รฆ']...
๐ท๏ธ Groups: ['voiced_stops', 'central_vowels', 'voiceless_stops']...
๐ Python Code
import torch
import torchaudio
from huggingface_hub import hf_hub_download
import importlib.util
def load_cupe_model(model_name="english", device="auto"):
"""๐ Load CUPE model with automatic downloading from Hugging Face Hub"""
model_files = {
"english": "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt",
"multilingual-mls": "multi_MLS8_uh02_e36_val_GER=0.2334.ckpt",
"multilingual-mswc": "multi_mswc38_ug20_e59_val_GER=0.5611.ckpt"
}
if device == "auto":
device = "cuda" if torch.cuda.is_available() else "cpu"
# ๐ฅ Download files automatically from Hugging Face Hub
repo_id = "Tabahi/CUPE-2i"
model_file = hf_hub_download(repo_id=repo_id, filename="model2i.py")
windowing_file = hf_hub_download(repo_id=repo_id, filename="windowing.py")
checkpoint = hf_hub_download(repo_id=repo_id, filename=f"ckpt/{model_files[model_name]}")
model_utils_file = hf_hub_download(repo_id=repo_id, filename="model_utils.py")
# ๐ง Import modules dynamically
_ = import_module_from_file("model_utils", model_utils_file)
spec = importlib.util.spec_from_file_location("model2i", model_file)
model2i = importlib.util.module_from_spec(spec)
spec.loader.exec_module(model2i)
spec = importlib.util.spec_from_file_location("windowing", windowing_file)
windowing = importlib.util.module_from_spec(spec)
spec.loader.exec_module(windowing)
# ๐ Initialize model
extractor = model2i.CUPEEmbeddingsExtractor(checkpoint, device=device)
return extractor, windowing
# ๐ฏ Example usage
extractor, windowing = load_cupe_model("english")
# ๐ต Load and process your audio
audio, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
resampler = torchaudio.transforms.Resample(sr, 16000)
audio = resampler(audio)
# ๐ Add batch dimension and process
audio_batch = audio.unsqueeze(0)
windowed_audio = windowing.slice_windows(audio_batch, 16000, 120, 80)
batch_size, num_windows, window_size = windowed_audio.shape
windows_flat = windowed_audio.reshape(-1, window_size)
# ๐ฎ Get predictions
logits_phonemes, logits_groups = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)
print(f"๐ค Phoneme logits shape: {logits_phonemes.shape}") # [num_windows, frames_per_window, 66]
print(f"๐ท๏ธ Group logits shape: {logits_groups.shape}") # [num_windows, frames_per_window, 16]
๐ง Advanced Usage (Manual Setup)
๐ Manual Setup Code
For more control, see run.py:
import torch
import torchaudio
from model2i import CUPEEmbeddingsExtractor # ๐ฏ Main CUPE model feature extractor
import windowing # ๐ง Provides slice_windows, stich_window_predictions
# ๐ Load model from local checkpoint
cupe_ckpt_path = "./ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"
extractor = CUPEEmbeddingsExtractor(cupe_ckpt_path, device="cuda")
# ๐ต Prepare audio
sample_rate = 16000
window_size_ms = 120
stride_ms = 80
max_wav_len = 10 * sample_rate # 10 seconds
dummy_wav = torch.zeros(1, max_wav_len, dtype=torch.float32, device="cpu")
audio_batch = dummy_wav.unsqueeze(0) # Add batch dimension
# ๐ช Window the audio
windowed_audio = windowing.slice_windows(
audio_batch.to("cuda"),
sample_rate,
window_size_ms,
stride_ms
)
batch_size, num_windows, window_size = windowed_audio.shape
windows_flat = windowed_audio.reshape(-1, window_size)
# ๐ฎ Get predictions
logits, _ = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)
# ๐ Reshape and stitch window predictions
frames_per_window = logits.shape[1]
logits = logits.reshape(batch_size, num_windows, frames_per_window, -1)
logits = windowing.stich_window_predictions(
logits,
original_audio_length=audio_batch.size(2),
cnn_output_size=frames_per_window,
sample_rate=sample_rate,
window_size_ms=window_size_ms,
stride_ms=stride_ms
)
print(f"๐ Output shape: {logits.shape}") # [B, T, 66]
๐ Output Format
- ๐ค Phoneme logits:
(time_frames, 66)
- 66 IPA phoneme classes - ๐ท๏ธ Group logits:
(time_frames, 16)
- 16 phoneme groups - โฑ๏ธ Time resolution:
16ms per frame (62.5 FPS) - ๐บ๏ธ Mapping: See mapper.py for phoneme-to-index mapping
โจ Key Features
- ๐ No manual downloads - automatic via Hugging Face Hub
- ๐ Multiple languages - English + 37 other languages
- โก Real-time capable - faster than real-time on GPU
- โฑ๏ธ Frame-level timing - 16ms resolution
- ๐ฏ Contextless - each frame processed independently
๐จ Custom Dataset for Training
๐ง Training Setup
- ๐ See mapper.py for tokenization (66 phonemes + 16 groups)
- ๐ค Use IPA-based grapheme-to-phoneme tools: Espeak-ng
- ๐ Convert words to IPA sequences: phonemizer
- ๐บ๏ธ Map IPA phonemes to tokens: IPAPhonemeMapper
Token Mapping:
- Token 0: ๐ Silence
- Tokens 1-65: ๐ค IPA phonemes
- Token 66: ๐ป Blank/noise
๐ฏ Use Cases
- โฐ Timestamp alignment (examples coming soon)
- ๐ Speech analysis
- ๐ Phoneme recognition
- ๐ต Audio processing
๐ Visual Results
๐ Sample Probabilities Timeline
๐ Multilingual Confusion Plot
๐ฌ๐ง English-only Confusion Plot
๐ Citation
๐ Paper: CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing
@inproceedings{rehman2025cupe,
title = {CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing},
author = {Abdul Rehman and Jian-Jun Zhang and Xiaosong Yang},
booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025)},
year = {2025},
organization = {ICNLSP},
publisher = {International Conference on Natural Language and Speech Processing},
}
Evaluation results
- Phoneme Error Rate on LibriSpeechself-reported0.250
- Phoneme Group Error Rate on LibriSpeechself-reported0.230
- Phoneme Error Rate on Multilingual LibriSpeech (MLS)self-reported0.310
- Phoneme Group Error Rate on Multilingual LibriSpeech (MLS)self-reported0.260
- Phoneme Error Rate on MSWC Multilingual Spoken Words Corpusself-reported0.490
- Phoneme Group Error Rate on MSWC Multilingual Spoken Words Corpusself-reported0.390