Amr-khaled commited on
Commit
532f63e
·
verified ·
1 Parent(s): c51704e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -3
README.md CHANGED
@@ -1,8 +1,203 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - ar
5
  metrics:
6
- - wer
 
 
 
 
 
 
 
 
 
 
 
 
7
  pipeline_tag: automatic-speech-recognition
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-4.0
3
  language:
4
  - ar
5
  metrics:
6
+ - WER
7
+ - CER
8
+ tags:
9
+ - speech-recognition
10
+ - ASR
11
+ - Arabic
12
+ - Conformer
13
+ - Transducer
14
+ - CTC
15
+ - NeMo
16
+ - hf-asr-leaderboard
17
+ - speech
18
+ - audio
19
  pipeline_tag: automatic-speech-recognition
20
+ library_name: NeMo
21
+ ---
22
+ # Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
23
+ <style>
24
+ img {
25
+ display: inline-table;
26
+ vertical-align: small;
27
+ margin: 0;
28
+ padding: 0;
29
+ }
30
+ </style>
31
+ | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
32
+ | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
33
+ | [![Language](https://img.shields.io/badge/Language-ar-lightgrey#model-badge)](#datasets)|
34
+
35
+ This model transcribes speech in the Arabic language with punctuation mark support.
36
+ It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
37
+ See the section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
38
+ The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
39
+
40
+ This model is ready for commercial and non-commercial use.
41
+
42
+ ## License
43
+
44
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
45
+
46
+ ## References
47
+
48
+ [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
49
+
50
+ [2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
51
+
52
+ [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
53
+
54
+ [4] [Open Universal Arabic ASR Leaderboard](https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard)
55
+
56
+ <!-- ## NVIDIA NeMo: Training
57
+
58
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
59
+ We recommend you install it after you've installed latest Pytorch version.
60
+ ```
61
+ pip install nemo_toolkit['all']
62
+ ```
63
+ -->
64
+ ## Model Architecture
65
+
66
+ FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
67
+ The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
68
+ You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
69
+
70
+ Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
71
+
72
+ ### Input
73
+ - **Input Type:** Audio
74
+ - **Input Format(s):** .wav files
75
+ - **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
76
+
77
+ ### Output
78
+
79
+ This model provides transcribed speech as a string for a given audio sample.
80
+ - **Output Type**: Text
81
+ - **Output Format:** String
82
+ - **Output Parameters:** One Dimensional (1D)
83
+ - **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
84
+
85
+ ## Limitations
86
+
87
+ The model is non-streaming and outputs the speech as a string without diacritical marks.
88
+ Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
89
+
90
+ ## How to download and use the model
91
+ #### Installations
92
+ ```
93
+ $ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
94
+ $ pip -q install nemo_toolkit['asr']
95
+ ```
96
+
97
+ #### Download the model
98
+ ```
99
+ $ curl -L -o tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
100
+ ```
101
+ #### Imports and usage
102
+ ```python
103
+ import nemo.collections.asr as nemo_asr
104
+
105
+ asr_model = nemo_asr.models.ASRModel.restore_from(
106
+ "path/to/tawasul_egy_stt.nemo",
107
+ )
108
+ ```
109
+ ### Transcribing using Python
110
+ Simply do:
111
+ ```
112
+ prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])[0]
113
+ print(prediction.text)
114
+ ```
115
+ You also can pass more then one audio as a patch inference
116
+ ```
117
+ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
118
+ ```
119
+
120
+ ## Training, and Testing Datasets
121
+ ### Training Datasets
122
+ #### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
123
+ - [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
124
+ - Data Collection Method: Automated
125
+ - Labeling Method: Automated
126
+ - [Mozilla Common Voice 17.0 Arabic](https://commonvoice.mozilla.org/en/datasets) [65h]
127
+ - Data Collection Method: by Human
128
+ - Labeling Method: by Human
129
+ - [Google Fleurs Arabic](https://huggingface.co/datasets/google/fleurs) [5h]
130
+ - Data Collection Method: by Human
131
+ - Labeling Method: by Human
132
+ #### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
133
+ - The second stage training Egyptian data is Private, there is no intention to open-source the data
134
+
135
+ ### Test Benchmark datasets
136
+ | Test Set | Num Dialects | Test (h) |
137
+ |-------------------------------------------------------------------------------------------------|----------------|-------------|
138
+ | [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022) | 10 | 10.7 |
139
+ | [Common Voice 18.0](https://commonvoice.mozilla.org/en/datasets) | 25 | 12.6 |
140
+ | [MASC (Clean-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) | 7 | 10.5 |
141
+ | [MASC (Noisy-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) | 8 | 14.9 |
142
+ | [MGB-2](http://www.mgb-challenge.org/MGB-2.html) | Unspecified | 9.6 |
143
+ | [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca) | 8 | 7.7 |
144
+
145
+ ### Test Benchmark results
146
+ - CommonVoice
147
+ - WER:
148
+ - CER:
149
+ - MASC
150
+ - Clean
151
+ - WER:
152
+ - CER:
153
+ - Noisy
154
+ - WER:
155
+ - CER:
156
+ - MGB-2
157
+ - WER:
158
+ - CER:
159
+ - Casablanca
160
+ - WER:
161
+ - CER:
162
+ - SADA
163
+ - WER:
164
+ - CER:
165
+
166
+ ## Software Integration
167
+
168
+ ### Supported Hardware Microarchitecture Compatibility:
169
+ - NVIDIA Ampere
170
+ - NVIDIA Blackwell
171
+ - NVIDIA Jetson
172
+ - NVIDIA Hopper
173
+ - NVIDIA Lovelace
174
+ - NVIDIA Pascal
175
+ - NVIDIA Turing
176
+ - NVIDIA Volta
177
+
178
+ ### Runtime Engine
179
+ - Nemo 2.0.0
180
+
181
+ ### Preferred Operating System
182
+ - Linux
183
+
184
+ ## Explainability
185
+
186
+ - High-Level Application and Domain: Automatic Speech Recognition
187
+ - - Describe how this model works: The model transcribes audio input into text for the Arabic language
188
+ - Verified to have met prescribed quality standards: Yes
189
+ - Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
190
+ - Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
191
+
192
+ ## Bias
193
+ - Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
194
+ - Have any special measures been taken to mitigate unwanted bias? No
195
+
196
+ ## Safety & Security
197
+ ### Use Case Restrictions:
198
+
199
+ - Non-streaming ASR model
200
+ - Model outputs text in Arabic without diacritical marks
201
+ - Output text requires Inverse Text Normalization
202
+ - The model is noise-sensitive
203
+ - The model is Egyptian Dialect further finetuned