π§ DiCoW_v3.2 β BUT-FIT Model for MT-ASR
This repository hosts the DiCoW_v3.2 model developed by BUT Speech@FIT, tailored for multi-talker automatic speech recognition (MT-ASR).
π§ Key Improvements over DiCoW v1
- FDDT (Frame-Level Diarization Dependent Transformation) before positional embeddings
- Less strict suppressive initialization to ease early training dynamics
- Enhanced sequential decoding with fallback seeking
- Frozen decoder during fine-tuning to retain language modeling capabilities
π§ͺ Augmentations
- Random STNO noise injection
- Segment-wise random class flipping of STNO tokens
- SpecAugment
- MUSAN noise mixing
βοΈ Optimization & Inference Enhancements
- Updated learning schedule
- Improved hallucination detection & mitigation during inference
π οΈ Model Usage
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/DiCoW_v3_2"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
β‘οΈ For detailed inference pipelines, see: DiCoW GitHub (Inference)
π Performance
See how DiCoW_v3.2 performs on our multi-talker ASR benchmark:
π¦ Model Details
Base Model: Whisper large-v3-turbo
Training Datasets:
𧬠Source Repositories
- π§ Training Code: TS-ASR-Whisper
- π Inference
π Related Publications
π° Journal Paper: DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Computer Speech & Language, 2026
π° ICASSP 2025: Target Speaker ASR with Whisper IEEE ICASSP 2025
π° CHiME-8 System Description: BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge CHiME 2024 Proceedings
π° MLC-SLM Challenge Submission: BUT System for the MLC-SLM Challenge arXiv:2506.13414
π Citation
If you use this model, please cite the following works:
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan ΔernockΓ½ and LukΓ‘Ε‘ Burget},
keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and ΔernockΓ½, Jan and Burget, LukΓ‘Ε‘},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Target Speaker ASR with Whisper},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
doi={10.1109/ICASSP49660.2025.10887683}
}
@inproceedings{polok24_chime,
title = {BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge},
author = {Alexander Polok and Dominik Klement and Jiangyu Han and Ε imon SedlΓ‘Δek and Bolaji Yusuf and Matthew Maciejewski and Matthew S Wiesner and LukΓ‘Ε‘ Burget},
year = {2024},
booktitle = {8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024)},
pages = {18--22},
doi = {10.21437/CHiME.2024-4},
}
@misc{polok2025mlcslmchallenge,
title={BUT System for the MLC-SLM Challenge},
author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan ΔernockΓ½ and LukΓ‘Ε‘ Burget},
year={2025},
eprint={2506.13414},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.13414},
}
π¬ Contact
For questions or collaboration inquiries:
π§ Email: [email protected]
π’ Affiliation: BUT Speech@FIT, Brno University of Technology
π GitHub: BUTSpeechFIT
- Downloads last month
- 137