π§ SE-DiCoW β Self-Enrolled Diarization-Conditioned Whisper
This repository hosts the SE-DiCoW model developed by BUT Speech@FIT, in collaboration with JHU CLSP/HLTCOE and CMU LTI, tailored for target-speaker multi-talker automatic speech recognition (TS-ASR).
π§ Key Innovations
- Self-Enrollment (SE):
Automatically selects the most informative segment of the target speaker within a conversation and integrates it via cross-attention at each encoder layer. - Improved Initialization & Segmentation:
Refined FDDT initialization and corrected data segmentation for more stable training. - Augmentations:
- Gaussian noise injection to STNO masks
- Segment-wise flipping of dominant STNO classes
- Joint SpecAugment on input + STNO
- MUSAN noise mixing
β‘οΈ Together, these yield 49.7% tcpWER reduction over the original DiCoW on the EMMA MT-ASR benchmark, with over 70% gains on heavily overlapped Libri3Mix.
π οΈ Model Usage
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/SE_DiCoW"
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
β‘οΈ Training and inference pipelines:
π Performance
Benchmark: EMMA MT-ASR (multi-domain, multi-talker)
- SE-DiCoW outperforms DiCoW and DiCoW v3.2 under both oracle and real diarization, particularly in highly overlapped conditions (Libri3Mix).
- Achieves state-of-the-art or comparable performance to domain-tuned systems on AMI, NOTSOFAR-1, and synthetic LibriMix mixtures.
π¦ Model Details
Base Model: Whisper large-v3-turbo
Training Datasets:
- NOTSOFAR-1
- AMI Meeting Corpus
- Libri2Mix / Libri3Mix
- LibriSpeech synthetic mixtures
𧬠Source Repositories
π Related Publications
π° ICASSP 2026: SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper [IEEE ICASSP 2026]
π° Journal Paper (CSL 2026): DiCoW: Diarization-Conditioned Whisper for Target Speaker ASR Computer Speech & Language, 2026
π° ICASSP 2025: Target Speaker ASR with Whisper IEEE ICASSP 2025
π Citation
If you use this model, please cite the following works:
@INPROCEEDINGS{polok2026sedicow,
author={Polok, Alexander and Klement, Dominik and Cornell, Samuele and Wiesner, Matthew and ΔernockΓ½, Jan and Khudanpur, Sanjeev and Burget, LukΓ‘Ε‘},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper},
year={2026},
pages={1-5},
}
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan ΔernockΓ½ and LukΓ‘Ε‘ Burget},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and ΔernockΓ½, Jan and Burget, LukΓ‘Ε‘},
booktitle={ICASSP 2025},
title={Target Speaker ASR with Whisper},
year={2025},
doi={10.1109/ICASSP49660.2025.10887683}
}
π¬ Contact
For questions or collaboration inquiries:
π§ Email: [email protected]
π’ Affiliation: BUT Speech@FIT, Brno University of Technology
π GitHub: BUTSpeechFIT
- Downloads last month
- 325