File size: 7,981 Bytes
c51704e 532f63e c51704e 532f63e c51704e 532f63e 55af3e5 f994d3d 532f63e f994d3d 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 625e859 532f63e 55af3e5 532f63e 55af3e5 532f63e f825df5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 55af3e5 532f63e 2628d40 55af3e5 2628d40 55af3e5 2628d40 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
---
license: cc-by-4.0
language:
- ar
metrics:
- WER
- CER
tags:
- speech-recognition
- ASR
- Arabic
- Conformer
- Transducer
- CTC
- NeMo
- hf-asr-leaderboard
- speech
- audio
pipeline_tag: automatic-speech-recognition
library_name: NeMo
---
# ποΈ Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
<style>
img {
display: inline-table;
vertical-align: small;
margin: 0;
padding: 0;
}
</style>
| [](#model-architecture)
| [](#model-architecture)
| [](#datasets)|
This model transcribes speech in the Arabic language with punctuation mark support.
It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
See the section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
This model is ready for commercial and non-commercial use. β
## ποΈ Model Architecture
FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
### π₯ Input
- **Input Type:** Audio
- **Input Format(s):** .wav files
- **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
### π€ Output
This model provides transcribed speech as a string for a given audio sample.
- **Output Type**: Text
- **Output Format:** String
- **Output Parameters:** One Dimensional (1D)
- **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
## β οΈ Limitations
- The model is non-streaming and outputs the speech as a string without diacritical marks.
- Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
## π How to download and use the model
#### π§ Installations
```
$ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
$ pip -q install nemo_toolkit['asr']
```
#### π₯ Download the model
```
$ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
```
#### π Imports and usage
```
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.restore_from(
"path/to/tawasul_egy_stt.nemo",
)
```
### π― Transcribing using Python
Simply do:
```
prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
print(prediction.text)
```
You also can pass more then one audio as a patch inference
```
asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
```
## π Training, and Testing Datasets
### ποΈ Training Datasets
#### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
- [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
- Data Collection Method: Automated
- Labeling Method: Automated
- [Mozilla Common Voice 17.0 Arabic](https://commonvoice.mozilla.org/en/datasets) [65h]
- Data Collection Method: by Human
- Labeling Method: by Human
- [Google Fleurs Arabic](https://huggingface.co/datasets/google/fleurs) [5h]
- Data Collection Method: by Human
- Labeling Method: by Human
#### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
- The second stage training Egyptian data is Private, there is no intention to open-source the data
### π§ͺ Test Benchmark datasets
| Test Set | Num Dialects | Test (h) |
|-------------------------------------------------------------------------------------------------|----------------|-------------|
| [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022) | 10 | 10.7 |
| [Common Voice 18.0](https://commonvoice.mozilla.org/en/datasets) | 25 | 12.6 |
| [MASC (Clean-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) | 7 | 10.5 |
| [MASC (Noisy-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) | 8 | 14.9 |
| [MGB-2](http://www.mgb-challenge.org/MGB-2.html) | Unspecified | 9.6 |
| [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca) | 8 | 7.7 |
### π Test Benchmark results
- CommonVoice
- WER:
- CER:
- MASC
- Clean
- WER:
- CER:
- Noisy
- WER:
- CER:
- MGB-2
- WER:
- CER:
- Casablanca
- WER:
- CER:
- SADA
- WER:
- CER:
### π§ Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta
### βοΈ Runtime Engine
- Nemo 2.0.0
### π₯οΈ Preferred Operating System
- Linux
## π Explainability
- High-Level Application and Domain: Automatic Speech Recognition
- - Describe how this model works: The model transcribes audio input into text for the Arabic language
- Verified to have met prescribed quality standards: Yes
- Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
- Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
## βοΈ Bias
- Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
- Have any special measures been taken to mitigate unwanted bias? No
## π Safety & Security
### Use Case Restrictions:
- Non-streaming ASR model
- Model outputs text in Arabic without diacritical marks
- Output text requires Inverse Text Normalization
- The model is noise-sensitive
- The model is Egyptian Dialect further finetuned
## π License
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
## π References
[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
[4] [Open Universal Arabic ASR Leaderboard](https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard) |