🧠 DiCoW_v3.2 — BUT-FIT Model for MT-ASR

This repository hosts the DiCoW_v3.2 model developed by BUT Speech@FIT, tailored for multi-talker automatic speech recognition (MT-ASR).

🔧 Key Improvements over DiCoW v1

FDDT (Frame-Level Diarization Dependent Transformation) before positional embeddings
Less strict suppressive initialization to ease early training dynamics
Enhanced sequential decoding with fallback seeking
Frozen decoder during fine-tuning to retain language modeling capabilities

🧪 Augmentations

Random STNO noise injection
Segment-wise random class flipping of STNO tokens
SpecAugment
MUSAN noise mixing

⚙️ Optimization & Inference Enhancements

Updated learning schedule
Improved hallucination detection & mitigation during inference

🛠️ Model Usage

from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/DiCoW_v3_2"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)

➡️ For detailed inference pipelines, see: DiCoW GitHub (Inference)

🏆 Performance

See how DiCoW_v3.2 performs on our multi-talker ASR benchmark:

🔗 EMMA-MT ASR Leaderboard

📦 Model Details

Base Model: Whisper large-v3-turbo
Training Datasets:

🧬 Source Repositories

🔧 Training Code: TS-ASR-Whisper
🚀 Inference

📚 Related Publications

📰 Journal Paper: DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Computer Speech & Language, 2026
📰 ICASSP 2025: Target Speaker ASR with Whisper IEEE ICASSP 2025
📰 CHiME-8 System Description: BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge CHiME 2024 Proceedings
📰 MLC-SLM Challenge Submission: BUT System for the MLC-SLM Challenge arXiv:2506.13414

📝 Citation

If you use this model, please cite the following works:

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    issn = {0885-2308},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
    keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}

@INPROCEEDINGS{10887683,
    author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
    title={Target Speaker ASR with Whisper}, 
    year={2025},
    volume={},
    number={},
    pages={1-5},
    keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
    doi={10.1109/ICASSP49660.2025.10887683}
}

@inproceedings{polok24_chime,
  title     = {BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge},
  author    = {Alexander Polok and Dominik Klement and Jiangyu Han and Šimon Sedláček and Bolaji Yusuf and Matthew Maciejewski and Matthew S Wiesner and Lukáš Burget},
  year      = {2024},
  booktitle = {8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024)},
  pages     = {18--22},
  doi       = {10.21437/CHiME.2024-4},
}

@misc{polok2025mlcslmchallenge,
    title={BUT System for the MLC-SLM Challenge}, 
    author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
    year={2025},
    eprint={2506.13414},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2506.13414}, 
}

📬 Contact

For questions or collaboration inquiries:

📧 Email: [email protected]

🏢 Affiliation: BUT Speech@FIT, Brno University of Technology

🔗 GitHub: BUTSpeechFIT

BUT-FIT
/

DiCoW_v3_2