🧠 DiCoW_v3.2 β€” BUT-FIT Model for MT-ASR

This repository hosts the DiCoW_v3.2 model developed by BUT Speech@FIT, tailored for multi-talker automatic speech recognition (MT-ASR).

πŸ”§ Key Improvements over DiCoW v1

  • FDDT (Frame-Level Diarization Dependent Transformation) before positional embeddings
  • Less strict suppressive initialization to ease early training dynamics
  • Enhanced sequential decoding with fallback seeking
  • Frozen decoder during fine-tuning to retain language modeling capabilities

πŸ§ͺ Augmentations

  • Random STNO noise injection
  • Segment-wise random class flipping of STNO tokens
  • SpecAugment
  • MUSAN noise mixing

βš™οΈ Optimization & Inference Enhancements

  • Updated learning schedule
  • Improved hallucination detection & mitigation during inference

πŸ› οΈ Model Usage

from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/DiCoW_v3_2"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)

➑️ For detailed inference pipelines, see: DiCoW GitHub (Inference)


πŸ† Performance

See how DiCoW_v3.2 performs on our multi-talker ASR benchmark:


πŸ“¦ Model Details


🧬 Source Repositories


πŸ“š Related Publications


πŸ“ Citation

If you use this model, please cite the following works:

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    issn = {0885-2308},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and LukÑő Burget},
    keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}

@INPROCEEDINGS{10887683,
    author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, LukÑő},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
    title={Target Speaker ASR with Whisper}, 
    year={2025},
    volume={},
    number={},
    pages={1-5},
    keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
    doi={10.1109/ICASSP49660.2025.10887683}
}

@inproceedings{polok24_chime,
  title     = {BUT/JHU System Description for CHiME-8 NOTSOFAR-1 Challenge},
  author    = {Alexander Polok and Dominik Klement and Jiangyu Han and Šimon SedlÑček and Bolaji Yusuf and Matthew Maciejewski and Matthew S Wiesner and LukÑő Burget},
  year      = {2024},
  booktitle = {8th International Workshop on Speech Processing in Everyday Environments (CHiME 2024)},
  pages     = {18--22},
  doi       = {10.21437/CHiME.2024-4},
}

@misc{polok2025mlcslmchallenge,
    title={BUT System for the MLC-SLM Challenge}, 
    author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and LukÑő Burget},
    year={2025},
    eprint={2506.13414},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2506.13414}, 
}

πŸ“¬ Contact

For questions or collaboration inquiries:

πŸ“§ Email: [email protected]

🏒 Affiliation: BUT Speech@FIT, Brno University of Technology

πŸ”— GitHub: BUTSpeechFIT

Downloads last month
137
Safetensors
Model size
958M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train BUT-FIT/DiCoW_v3_2

Collection including BUT-FIT/DiCoW_v3_2