MSA-ASR
Multilingual Speaker-Attributed Automatic Speech Recognition
Demo
Introduction
This repository provides an implementation of a Speaker-Attributed Automatic Speech Recognition model. The model performs both multilingual speech recognition and speaker embedding extraction, enabling speaker differentiation.
Model architecture
Setup
git clone [email protected]:nguyenvulebinh/MSA-ASR.git
cd MSA-ASR
conda create -n MSA-ASR python=3.10
conda activate MSA-ASR
pip install -r requirements.txt
Test script:
python infer.py
Training Dataset
From ASR to SA-ASR dataset:
- Segment ASR data into single-speaker turns.
- Match turns into group which may come from the same speaker by using speaker embedding cosine similarity.
- Pick a few groups, each group a few turns.
- Concatenate turns in random order.
In total:
- 15.5M turns
- 14k audio hours
- English only
Dataset is openly available in HF Dataset
Example
Audio
Label:
spk_1 A 0.00 1.58 »spk_1
spk_1 A 0.00 1.58 Pacifica
spk_1 A 1.58 0.68 continues
spk_1 A 2.27 0.52 today
spk_1 A 2.79 0.24 to
spk_1 A 3.03 0.20 be
spk_1 A 3.23 0.14 a
spk_1 A 3.37 0.54 listener
spk_1 A 3.91 0.80 supported
spk_1 A 4.71 0.70 network
spk_1 A 5.42 0.38 of
spk_2 A 5.80 0.12 »spk_2
spk_2 A 5.80 0.12 At
spk_2 A 5.92 0.42 home,
spk_2 A 6.34 0.18 an
spk_2 A 6.52 0.38 Aed
spk_2 A 6.90 0.26 is
spk_2 A 7.16 0.18 an
spk_2 A 7.34 0.56 automated
spk_2 A 7.90 0.60 external
spk_2 A 8.50 0.90 defibrillator.
spk_2 A 9.40 0.40 It's
spk_2 A 9.81 0.08 the
spk_2 A 9.89 0.36 device
spk_2 A 10.25 0.08 you
spk_2 A 10.33 0.16 use
spk_2 A 10.49 0.12 when
spk_2 A 10.61 0.10 your
spk_2 A 10.73 0.16 heart
spk_2 A 10.89 0.18 goes
spk_2 A 11.07 0.12 into
spk_2 A 11.19 0.38 cardiac
spk_2 A 11.57 0.38 arrest
spk_2 A 11.95 0.18 to
spk_2 A 12.13 0.36 shock
spk_2 A 12.49 0.14 it
spk_2 A 12.63 0.28 back
spk_2 A 12.91 0.22 into
spk_2 A 13.13 0.06 a
spk_2 A 13.19 0.32 normal
spk_2 A 13.51 0.88 rhythm.
spk_1 A 14.40 1.38 »spk_1
spk_1 A 14.40 1.38 stations.
Citation
@INPROCEEDINGS{10889116,
author={Nguyen, Thai-Binh and Waibel, Alexander},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Training;Adaptation models;Limiting;Predictive models;Data models;Robustness;Multilingual;Data mining;Speech processing;Standards;speaker-attributed;asr;multilingual},
doi={10.1109/ICASSP49660.2025.10889116}}
@INPROCEEDINGS{10446589,
author={Nguyen, Thai-Binh and Waibel, Alexander},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Synthetic Conversations Improve Multi-Talker ASR},
year={2024},
volume={},
number={},
pages={10461-10465},
keywords={Systematics;Error analysis;Knowledge based systems;Oral communication;Signal processing;Data models;Acoustics;multi-talker;asr;synthetic conversation},
doi={10.1109/ICASSP48485.2024.10446589}}
License
CC-BY-NC 4.0
Contact
Contributions are welcome; feel free to create a PR or email me:
[Binh Nguyen](nguyenvulebinh[at]gmail.com)
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no pipeline_tag.