File size: 7,981 Bytes
c51704e
532f63e
c51704e
 
 
532f63e
 
 
 
 
 
 
 
 
 
 
 
 
c51704e
532f63e
 
55af3e5
f994d3d
532f63e
 
 
 
 
 
f994d3d
532f63e
 
 
 
 
 
55af3e5
532f63e
 
55af3e5
532f63e
55af3e5
532f63e
 
 
 
 
 
55af3e5
532f63e
 
 
 
55af3e5
532f63e
 
 
 
 
 
55af3e5
625e859
 
532f63e
55af3e5
 
532f63e
 
 
 
 
55af3e5
532f63e
f825df5
532f63e
55af3e5
 
532f63e
 
 
 
 
 
55af3e5
532f63e
 
55af3e5
532f63e
 
 
 
 
 
 
55af3e5
 
532f63e
 
 
 
 
 
 
 
 
 
 
 
 
55af3e5
532f63e
 
 
 
 
 
 
 
 
55af3e5
532f63e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55af3e5
532f63e
 
 
 
 
 
 
 
 
55af3e5
532f63e
 
55af3e5
532f63e
 
55af3e5
532f63e
 
 
 
 
 
55af3e5
532f63e
 
 
55af3e5
532f63e
 
 
 
 
2628d40
 
55af3e5
2628d40
 
55af3e5
2628d40
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: cc-by-4.0
language:
- ar
metrics:
- WER
- CER
tags:
- speech-recognition
- ASR
- Arabic
- Conformer
- Transducer
- CTC
- NeMo
- hf-asr-leaderboard
- speech
- audio
pipeline_tag: automatic-speech-recognition
library_name: NeMo
---
# πŸŽ™οΈ Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
<style>
img {
  display: inline-table;
  vertical-align: small;
  margin: 0;
  padding: 0;
}
</style>
| [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture) 
| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-ar-lightgrey#model-badge)](#datasets)|

This model transcribes speech in the Arabic language with punctuation mark support.
It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
See the section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.

This model is ready for commercial and non-commercial use. βœ…

## πŸ—οΈ Model Architecture
FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).

Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.

### πŸ“₯ Input
  - **Input Type:** Audio
  - **Input Format(s):** .wav files
  - **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed

### πŸ“€ Output
This model provides transcribed speech as a string for a given audio sample.
  - **Output Type**: Text 
  - **Output Format:** String
  - **Output Parameters:** One Dimensional (1D)
  - **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks

## ⚠️ Limitations
- The model is non-streaming and outputs the speech as a string without diacritical marks.
- Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).

## πŸš€ How to download and use the model
#### πŸ”§ Installations
```
$ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging 
$ pip -q install nemo_toolkit['asr']
```

#### πŸ“₯ Download the model
```
$ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
```
#### 🐍 Imports and usage
```
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.restore_from(
    "path/to/tawasul_egy_stt.nemo",
)
```
### 🎯 Transcribing using Python
Simply do:
```
prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
print(prediction.text)
```
You also can pass more then one audio as a patch inference
```
asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
```

## πŸ“Š Training, and Testing Datasets
### πŸ‹οΈ Training Datasets
#### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
- [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
    - Data Collection Method: Automated
    - Labeling Method: Automated
- [Mozilla Common Voice 17.0 Arabic](https://commonvoice.mozilla.org/en/datasets) [65h]
    - Data Collection Method: by Human
    - Labeling Method: by Human
- [Google Fleurs Arabic](https://huggingface.co/datasets/google/fleurs) [5h]
    - Data Collection Method: by Human
    - Labeling Method: by Human
#### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
- The second stage training Egyptian data is Private, there is no intention to open-source the data

### πŸ§ͺ Test Benchmark datasets
| Test Set                                                                                        | Num Dialects   | Test (h)    |
|-------------------------------------------------------------------------------------------------|----------------|-------------|
| [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022)                                      | 10             | 10.7        |
| [Common Voice 18.0](https://commonvoice.mozilla.org/en/datasets)                                | 25             | 12.6        |
| [MASC (Clean-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus)    | 7              | 10.5        |
| [MASC (Noisy-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus)    | 8              | 14.9        |
| [MGB-2](http://www.mgb-challenge.org/MGB-2.html)                                                | Unspecified    | 9.6         |
| [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca)                                | 8              | 7.7         |

### πŸ“ˆ Test Benchmark results
- CommonVoice
    - WER:
    - CER:
- MASC
  - Clean
    - WER:
    - CER:
  - Noisy
    - WER:
    - CER:
- MGB-2
    - WER:
    - CER:
- Casablanca
    - WER:
    - CER:
- SADA
    - WER:
    - CER:

### πŸ”§ Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta

### βš™οΈ Runtime Engine
- Nemo 2.0.0
  
### πŸ–₯️ Preferred Operating System
- Linux

## πŸ” Explainability
- High-Level Application and Domain: Automatic Speech Recognition
-   - Describe how this model works: The model transcribes audio input into text for the Arabic language
- Verified to have met prescribed quality standards: Yes
- Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
- Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).

## βš–οΈ Bias
- Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
- Have any special measures been taken to mitigate unwanted bias? No

## πŸ”’ Safety & Security
### Use Case Restrictions:
- Non-streaming ASR model
- Model outputs text in Arabic without diacritical marks
- Output text requires Inverse Text Normalization
- The model is noise-sensitive
- The model is Egyptian Dialect further finetuned

## πŸ“„ License
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.

## πŸ“š References
[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)

[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

[4] [Open Universal Arabic ASR Leaderboard](https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard)