๐Ÿ—ฃ๏ธ CUPE: Contextless Universal Phoneme Encoder

๐Ÿค— Hugging Face GitHub Paper License: GPLv3

๐Ÿš€ A PyTorch model for contextless phoneme prediction from speech audio

CUPE processes 120ms frames independently, ensuring each frame's embeddings are acoustically pureโ€”unlike transformer models that mix context across frames.

๐Ÿ”— Quick Links


๐ŸŽฏ Trained Models

๐Ÿ“Š Three 30.1M parameter models available

All models are available in the checkpoints directory.

๐Ÿ“ˆ Model Performance

๐Ÿท๏ธ Model ๐ŸŒ Languages ๐Ÿ“Š PER ๐Ÿ“Š GER ๐Ÿ“ Description
๐Ÿ‡ฌ๐Ÿ‡ง English English 0.25 0.23 ๐Ÿ† Best quality for English speech
๐ŸŒ Multilingual MLS 8 European 0.31 0.26 ๐Ÿ‡ช๐Ÿ‡บ en, de, fr, es, pt, it, pl, nl
๐ŸŒ Multilingual MSWC 38 languages 0.49 0.39 ๐Ÿ—บ๏ธ Broad language coverage
๐Ÿ“‹ Detailed Metrics

๐Ÿ‡ฌ๐Ÿ‡ง English (en_libri1000_uj01d):

  • ๐ŸŽฏ PER: 0.25 (Phoneme Error Rate)
  • ๐ŸŽฏ GER: 0.23 (Phoneme Group Error Rate)

๐ŸŒ Multilingual MLS (multi_MLS8_uh02):

  • ๐ŸŽฏ PER: 0.31
  • ๐ŸŽฏ GER: 0.26

๐ŸŒ Multilingual MSWC (multi_mswc38_ug20):

  • ๐ŸŽฏ PER: 0.49
  • ๐ŸŽฏ GER: 0.39

โš ๏ธ Note: CUPE models are designed for contextless phoneme prediction and are not optimal for phoneme classification tasks that require contextual information. CUPE excels at extracting pure, frame-level embeddings that represent the acoustic properties of each phoneme independently of surrounding context.


๐Ÿ“š Datasets

๐ŸŽต Training Data Sources

  • ๐Ÿ“– LibriSpeech ASR corpus (SR12): 960 hours of English speech
  • ๐ŸŒ Multilingual LibriSpeech (MLS): 800 hours across 8 languages
  • ๐Ÿ—ฃ๏ธ MSWC Multilingual Spoken Words: 240 hours from 50 languages
๐Ÿ” Dataset Details

๐Ÿ“– LibriSpeech ASR corpus (SR12):

  • โฑ๏ธ 960 hours of English speech
  • ๐Ÿ“ train-100, train-360, and train-500 splits

๐ŸŒ Multilingual LibriSpeech (MLS) (SLR94):

  • โฑ๏ธ 800 hours total (100 hours each)
  • ๐ŸŒ 8 languages: pl, pt, it, es, fr, nl, de, en

๐Ÿ—ฃ๏ธ MSWC Multilingual Spoken Words Corpus:

  • โฑ๏ธ 240 hours from 50 languages (max 10 hours/language)
  • ๐ŸŽ“ Training: 38 languages (en, de, fr, ca, es, fa, it, ru, pl, eu, cy, eo, nl, pt, tt, cs, tr, et, ky, id, sv-SE, ar, el, ro, lv, sl, zh-CN, ga-IE, ta, vi, gn, or)
  • ๐Ÿงช Testing: 6 languages (lt, mt, ia, sk, ka, as)

๐Ÿ’ก Need a new language? Start a new discussion and we'll train it for you!


๐Ÿš€ Installation

โšก Quick Start (Bournemouth Forced Aligner)

# ๐Ÿ“ฆ Install the package
pip install bournemouth-forced-aligner

# ๐Ÿ”ง Install dependencies
apt-get install espeak-ng ffmpeg

# โ“ Show help
balign --help

๐Ÿ“– See complete BFA guide.

๐Ÿ› ๏ธ Quick Start (CUPE)

# ๐Ÿ“ฆ Install core dependencies
pip install torch torchaudio huggingface_hub

๐Ÿ’ป Easy Usage with Automatic Download

๐ŸŽฏ Zero-setup required - automatic downloads from Hugging Face Hub

๐Ÿฆ‹ Example Output

Running with sample audio ๐Ÿฆ‹ butterfly.wav:

๐Ÿ”„ Loading CUPE english model...
โœ… Model loaded on cpu
๐ŸŽต Processing audio: 1.26s duration
๐Ÿ“Š Processed 75 frames (1200ms total)

๐Ÿ“‹ Results:
๐Ÿ”ค Phoneme predictions shape: (75,)
๐Ÿท๏ธ Group predictions shape: (75,)
โ„น๏ธ Model info: {'model_name': 'english', 'sample_rate': 16000, 'frames_per_second': 62.5}

๐Ÿ” First 10 frame predictions:
Frame 0: phoneme=66, group=16
Frame 1: phoneme=66, group=16
Frame 2: phoneme=29, group=7
...

๐Ÿ”ค Phonemes: ['b', 'สŒ', 't', 'h', 'สŒ', 'f', 'l', 'รฆ']...
๐Ÿท๏ธ Groups: ['voiced_stops', 'central_vowels', 'voiceless_stops']...

๐Ÿ Python Code

import torch
import torchaudio
from huggingface_hub import hf_hub_download
import importlib.util

def load_cupe_model(model_name="english", device="auto"):
    """๐Ÿ”„ Load CUPE model with automatic downloading from Hugging Face Hub"""
    
    model_files = {
        "english": "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt",
        "multilingual-mls": "multi_MLS8_uh02_e36_val_GER=0.2334.ckpt", 
        "multilingual-mswc": "multi_mswc38_ug20_e59_val_GER=0.5611.ckpt"
    }
    
    if device == "auto":
        device = "cuda" if torch.cuda.is_available() else "cpu"
    
    # ๐Ÿ“ฅ Download files automatically from Hugging Face Hub
    repo_id = "Tabahi/CUPE-2i"
    model_file = hf_hub_download(repo_id=repo_id, filename="model2i.py")
    windowing_file = hf_hub_download(repo_id=repo_id, filename="windowing.py") 
    checkpoint = hf_hub_download(repo_id=repo_id, filename=f"ckpt/{model_files[model_name]}")
    model_utils_file = hf_hub_download(repo_id=repo_id, filename="model_utils.py")
    
    # ๐Ÿ”ง Import modules dynamically
    _ = import_module_from_file("model_utils", model_utils_file)
    spec = importlib.util.spec_from_file_location("model2i", model_file)
    model2i = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(model2i)
    
    spec = importlib.util.spec_from_file_location("windowing", windowing_file)
    windowing = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(windowing)
    
    # ๐Ÿš€ Initialize model
    extractor = model2i.CUPEEmbeddingsExtractor(checkpoint, device=device)
    return extractor, windowing

# ๐ŸŽฏ Example usage
extractor, windowing = load_cupe_model("english")

# ๐ŸŽต Load and process your audio
audio, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)

# ๐Ÿ“Š Add batch dimension and process
audio_batch = audio.unsqueeze(0)
windowed_audio = windowing.slice_windows(audio_batch, 16000, 120, 80)
batch_size, num_windows, window_size = windowed_audio.shape
windows_flat = windowed_audio.reshape(-1, window_size)

# ๐Ÿ”ฎ Get predictions
logits_phonemes, logits_groups = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)

print(f"๐Ÿ”ค Phoneme logits shape: {logits_phonemes.shape}")  # [num_windows, frames_per_window, 66]
print(f"๐Ÿท๏ธ Group logits shape: {logits_groups.shape}")     # [num_windows, frames_per_window, 16]

๐Ÿ”ง Advanced Usage (Manual Setup)

๐Ÿ“ Manual Setup Code

For more control, see run.py:

import torch
import torchaudio
from model2i import CUPEEmbeddingsExtractor  # ๐ŸŽฏ Main CUPE model feature extractor
import windowing  # ๐Ÿ”ง Provides slice_windows, stich_window_predictions

# ๐Ÿ“ Load model from local checkpoint
cupe_ckpt_path = "./ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"
extractor = CUPEEmbeddingsExtractor(cupe_ckpt_path, device="cuda")

# ๐ŸŽต Prepare audio
sample_rate = 16000
window_size_ms = 120
stride_ms = 80
max_wav_len = 10 * sample_rate  # 10 seconds

dummy_wav = torch.zeros(1, max_wav_len, dtype=torch.float32, device="cpu")
audio_batch = dummy_wav.unsqueeze(0)  # Add batch dimension

# ๐ŸชŸ Window the audio
windowed_audio = windowing.slice_windows(
    audio_batch.to("cuda"),
    sample_rate,
    window_size_ms,
    stride_ms
)
batch_size, num_windows, window_size = windowed_audio.shape
windows_flat = windowed_audio.reshape(-1, window_size)

# ๐Ÿ”ฎ Get predictions
logits, _ = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)

# ๐Ÿ”„ Reshape and stitch window predictions
frames_per_window = logits.shape[1]
logits = logits.reshape(batch_size, num_windows, frames_per_window, -1)
logits = windowing.stich_window_predictions(
    logits,
    original_audio_length=audio_batch.size(2),
    cnn_output_size=frames_per_window,
    sample_rate=sample_rate,
    window_size_ms=window_size_ms,
    stride_ms=stride_ms
)

print(f"๐Ÿ“Š Output shape: {logits.shape}")  # [B, T, 66]

๐Ÿ“Š Output Format

  • ๐Ÿ”ค Phoneme logits: (time_frames, 66) - 66 IPA phoneme classes
  • ๐Ÿท๏ธ Group logits: (time_frames, 16) - 16 phoneme groups
  • โฑ๏ธ Time resolution: 16ms per frame (62.5 FPS)
  • ๐Ÿ—บ๏ธ Mapping: See mapper.py for phoneme-to-index mapping

โœจ Key Features

  • ๐Ÿš€ No manual downloads - automatic via Hugging Face Hub
  • ๐ŸŒ Multiple languages - English + 37 other languages
  • โšก Real-time capable - faster than real-time on GPU
  • โฑ๏ธ Frame-level timing - 16ms resolution
  • ๐ŸŽฏ Contextless - each frame processed independently

๐ŸŽจ Custom Dataset for Training

๐Ÿ”ง Training Setup
  • ๐Ÿ“‹ See mapper.py for tokenization (66 phonemes + 16 groups)
  • ๐Ÿ”ค Use IPA-based grapheme-to-phoneme tools: Espeak-ng
  • ๐Ÿ“ Convert words to IPA sequences: phonemizer
  • ๐Ÿ—บ๏ธ Map IPA phonemes to tokens: IPAPhonemeMapper

Token Mapping:

  • Token 0: ๐Ÿ”‡ Silence
  • Tokens 1-65: ๐Ÿ”ค IPA phonemes
  • Token 66: ๐Ÿ“ป Blank/noise

๐ŸŽฏ Use Cases

  • โฐ Timestamp alignment (examples coming soon)
  • ๐Ÿ“Š Speech analysis
  • ๐Ÿ” Phoneme recognition
  • ๐ŸŽต Audio processing

๐Ÿ“Š Visual Results

๐Ÿ“ˆ Sample Probabilities Timeline

Sample output logits plot

๐ŸŒ Multilingual Confusion Plot

Multilingual Confusion Plot (counts)

๐Ÿ‡ฌ๐Ÿ‡ง English-only Confusion Plot

English-only Confusion Plot (probabiltities)


๐Ÿ“– Citation

๐Ÿ“„ Paper: CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing

@inproceedings{rehman2025cupe,
  title     = {CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing},
  author    = {Abdul Rehman and Jian-Jun Zhang and Xiaosong Yang},
  booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025)},
  year      = {2025},
  organization = {ICNLSP},
  publisher = {International Conference on Natural Language and Speech Processing},
}

๐ŸŒŸ Star this repository if you find it helpful! โญ

GitHub stars Hugging Face likes

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • Phoneme Error Rate on LibriSpeech
    self-reported
    0.250
  • Phoneme Group Error Rate on LibriSpeech
    self-reported
    0.230
  • Phoneme Error Rate on Multilingual LibriSpeech (MLS)
    self-reported
    0.310
  • Phoneme Group Error Rate on Multilingual LibriSpeech (MLS)
    self-reported
    0.260
  • Phoneme Error Rate on MSWC Multilingual Spoken Words Corpus
    self-reported
    0.490
  • Phoneme Group Error Rate on MSWC Multilingual Spoken Words Corpus
    self-reported
    0.390