🧠 SE-DiCoW β€” Self-Enrolled Diarization-Conditioned Whisper

This repository hosts the SE-DiCoW model developed by BUT Speech@FIT, in collaboration with JHU CLSP/HLTCOE and CMU LTI, tailored for target-speaker multi-talker automatic speech recognition (TS-ASR).

πŸ”§ Key Innovations

  • Self-Enrollment (SE):
    Automatically selects the most informative segment of the target speaker within a conversation and integrates it via cross-attention at each encoder layer.
  • Improved Initialization & Segmentation:
    Refined FDDT initialization and corrected data segmentation for more stable training.
  • Augmentations:
    • Gaussian noise injection to STNO masks
    • Segment-wise flipping of dominant STNO classes
    • Joint SpecAugment on input + STNO
    • MUSAN noise mixing

➑️ Together, these yield 49.7% tcpWER reduction over the original DiCoW on the EMMA MT-ASR benchmark, with over 70% gains on heavily overlapped Libri3Mix.

SE-DiCoW Architecture

πŸ› οΈ Model Usage

from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)

➑️ Training and inference pipelines:


πŸ† Performance

Benchmark: EMMA MT-ASR (multi-domain, multi-talker)

  • SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both oracle and real diarization, particularly in highly overlapped conditions (Libri3Mix).
  • Achieves state-of-the-art or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.

πŸ”— EMMA-MT ASR Leaderboard


πŸ“¦ Model Details


🧬 Source Repositories


πŸ“š Related Publications

  • πŸ“° ICASSP 2026: SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper [IEEE ICASSP 2026]

  • πŸ“° Journal Paper (CSL 2026): DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR Computer Speech & Language, 2026

  • πŸ“° ICASSP 2025: Target Speaker ASR with Whisper IEEE ICASSP 2025


πŸ“ Citation

If you use this model, please cite the following works:

@INPROCEEDINGS{polok2026sedicow,
  author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and Černocký, Jan and Khudanpur, Sanjeev and Burget, LukÑő},
  booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper}, 
  year={2026},
  pages={1-5},
}

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and LukÑő Burget},
}

@INPROCEEDINGS{10887683,
  author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, LukÑő},
  booktitle={ICASSP 2025}, 
  title={Target Speaker ASR with Whisper}, 
  year={2025},
  doi={10.1109/ICASSP49660.2025.10887683}
}

πŸ“¬ Contact

For questions or collaboration inquiries:

πŸ“§ Email: [email protected]

🏒 Affiliation: BUT Speech@FIT, Brno University of Technology

πŸ”— GitHub: BUTSpeechFIT

Downloads last month
325
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train BUT-FIT/SE_DiCoW

Collection including BUT-FIT/SE_DiCoW