🧠 SE-DiCoW — Self-Enrolled Diarization-Conditioned Whisper

This repository hosts the SE-DiCoW model developed by BUT Speech@FIT, in collaboration with JHU CLSP/HLTCOE and CMU LTI, tailored for target-speaker multi-talker automatic speech recognition (TS-ASR).

🔧 Key Innovations

Self-Enrollment (SE):
Automatically selects the most informative segment of the target speaker within a conversation and integrates it via cross-attention at each encoder layer.
Improved Initialization & Segmentation:
Refined FDDT initialization and corrected data segmentation for more stable training.
Augmentations:
- Gaussian noise injection to STNO masks
- Segment-wise flipping of dominant STNO classes
- Joint SpecAugment on input + STNO
- MUSAN noise mixing

➡️ Together, these yield 49.7% tcpWER reduction over the original DiCoW on the EMMA MT-ASR benchmark, with over 70% gains on heavily overlapped Libri3Mix.

🛠️ Model Usage

from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)

➡️ Training and inference pipelines:

🏆 Performance

Benchmark: EMMA MT-ASR (multi-domain, multi-talker)

SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both oracle and real diarization, particularly in highly overlapped conditions (Libri3Mix).
Achieves state-of-the-art or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.

🔗 EMMA-MT ASR Leaderboard

📦 Model Details

Base Model: Whisper large-v3-turbo
Training Datasets:
- NOTSOFAR-1
- AMI Meeting Corpus
- Libri2Mix / Libri3Mix
- LibriSpeech synthetic mixtures

🧬 Source Repositories

🔧 Training Code: TS-ASR-Whisper
🚀 Inference (DiCoW)

📚 Related Publications

📰 ICASSP 2026: SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper [IEEE ICASSP 2026]
📰 Journal Paper (CSL 2026): DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR Computer Speech & Language, 2026
📰 ICASSP 2025: Target Speaker ASR with Whisper IEEE ICASSP 2025

📝 Citation

If you use this model, please cite the following works:

@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, Lukáš},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
  pages={1-5},
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
}

@INPROCEEDINGS{10887683,
  author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
  booktitle={ICASSP 2025}, 
  title={Target Speaker ASR with Whisper}, 
  year={2025},
  doi={10.1109/ICASSP49660.2025.10887683}
}

📬 Contact

For questions or collaboration inquiries:

📧 Email: [email protected]

🏢 Affiliation: BUT Speech@FIT, Brno University of Technology

🔗 GitHub: BUTSpeechFIT

Downloads last month: 62

Safetensors

Model size

1B params

Tensor type

F32

Datasets used to train BUT-FIT/SE_DiCoW

Collection including BUT-FIT/SE_DiCoW

DiCoW

Collection

DiCoW (Diarization-Conditioned Whisper) is a collection of speaker-aware ASR models developed by BUT-FIT, extending OpenAI’s Whisper. • 6 items • Updated Oct 17, 2025 • 2