nguyenvulebinh
/

MSA-ASR

Model card Files Files and versions

nguyenvulebinh commited on Apr 10

Commit

2d8c1a6

·

verified ·

1 Parent(s): e2e0dbf

Update README.md

Files changed (1) hide show

README.md +35 -10

README.md CHANGED Viewed

@@ -5,13 +5,17 @@ tags: []
 # MSA-ASR
 Multilingual Speaker-Attributed Automatic Speech Recognition
 ### Introduction
 This repository provides an implementation of a Speaker-Attributed Automatic Speech Recognition model. The model performs both multilingual speech recognition and speaker embedding extraction, enabling speaker differentiation.
 Model architecture
-![MSA-ASR Model](https://github.com/nguyenvulebinh/MSA-ASR/blob/679f7016c1b0610c5ae5f85fae2168096491b464/resource/model.png?raw=true)
 ### Setup
@@ -30,18 +34,39 @@ Test script:
 python infer.py
 ```
 ### Citation
 ```bibtex
-@misc{nguyen2025msaasrefficientmultilingualspeaker,
-      title={MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models},
-      author={Thai-Binh Nguyen and Alexander Waibel},
-      year={2025},
-      eprint={2411.18152},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2411.18152},
-}
 ```
 ### License

 # MSA-ASR
 Multilingual Speaker-Attributed Automatic Speech Recognition
+### Demo
+<video src="https://huggingface.co/nguyenvulebinh/MSA-ASR/resolve/main/demo_sa-asr.mp4" width="640" height="480" controls></video>
 ### Introduction
 This repository provides an implementation of a Speaker-Attributed Automatic Speech Recognition model. The model performs both multilingual speech recognition and speaker embedding extraction, enabling speaker differentiation.
 Model architecture
+![MSA-ASR Model](https://github.com/nguyenvulebinh/MSA-ASR/blob/main/resource/model.png?raw=true)
 ### Setup
 python infer.py
 ```
+### Training Dataset
+*From ASR to SA-ASR dataset:*
+- Segment ASR data into single-speaker turns.
+- Match turns into group which may come from the same speaker by using speaker embedding cosine similarity.
+- Pick a few groups, each group a few turns.
+- Concatenate turns in random order.
+![MSA-ASR Dataset](https://github.com/nguyenvulebinh/MSA-ASR/blob/main/resource/sa_asr_data_pipeline.png?raw=true)
+*In total:*
+- 15.5M turns
+- 14k audio hours
+- English only
+Dataset is open available in [HF Dataset](https://huggingface.co/datasets/nguyenvulebinh/spk-attribute)
 ### Citation
 ```bibtex
+@INPROCEEDINGS{10889116,
+  author={Nguyen, Thai-Binh and Waibel, Alexander},
+  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  title={MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models},
+  year={2025},
+  volume={},
+  number={},
+  pages={1-5},
+  keywords={Training;Adaptation models;Limiting;Predictive models;Data models;Robustness;Multilingual;Data mining;Speech processing;Standards;speaker-attributed;asr;multilingual},
+  doi={10.1109/ICASSP49660.2025.10889116}}
 ```
 ### License