Amr-khaled commited on
Commit
55af3e5
Β·
verified Β·
1 Parent(s): f825df5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -29
README.md CHANGED
@@ -19,27 +19,27 @@ tags:
19
  pipeline_tag: automatic-speech-recognition
20
  library_name: NeMo
21
  ---
22
- # Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
23
- <style>
24
  img {
25
  display: inline-table;
26
  vertical-align: small;
27
  margin: 0;
28
  padding: 0;
29
  }
30
- </style>
31
  | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
32
  | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
33
  | [![Language](https://img.shields.io/badge/Language-ar-lightgrey#model-badge)](#datasets)|
34
 
35
  This model transcribes speech in the Arabic language with punctuation mark support.
36
  It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
37
- See the section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
38
  The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
39
 
40
- This model is ready for commercial and non-commercial use.
41
 
42
- ## Model Architecture
43
 
44
  FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
45
  The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
@@ -47,12 +47,12 @@ You may find more information on the details of FastConformer here: [Fast-Confor
47
 
48
  Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
49
 
50
- ### Input
51
  - **Input Type:** Audio
52
  - **Input Format(s):** .wav files
53
  - **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
54
 
55
- ### Output
56
 
57
  This model provides transcribed speech as a string for a given audio sample.
58
  - **Output Type**: Text
@@ -60,33 +60,33 @@ This model provides transcribed speech as a string for a given audio sample.
60
  - **Output Parameters:** One Dimensional (1D)
61
  - **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
62
 
63
- ## Limitations
64
  The model is non-streaming and outputs the speech as a string without diacritical marks.
65
  Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
66
 
67
- ## How to download and use the model
68
- #### Installations
69
  ```
70
  $ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
71
  $ pip -q install nemo_toolkit['asr']
72
  ```
73
 
74
- #### Download the model
75
  ```
76
  $ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
77
  ```
78
- #### Imports and usage
79
- ```python
80
  import nemo.collections.asr as nemo_asr
81
 
82
  asr_model = nemo_asr.models.ASRModel.restore_from(
83
  "path/to/tawasul_egy_stt.nemo",
84
  )
85
  ```
86
- ### Transcribing using Python
87
  Simply do:
88
  ```
89
- prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])[0]
90
  print(prediction.text)
91
  ```
92
  You also can pass more then one audio as a patch inference
@@ -94,8 +94,8 @@ You also can pass more then one audio as a patch inference
94
  asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
95
  ```
96
 
97
- ## Training, and Testing Datasets
98
- ### Training Datasets
99
  #### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
100
  - [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
101
  - Data Collection Method: Automated
@@ -109,7 +109,7 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
109
  #### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
110
  - The second stage training Egyptian data is Private, there is no intention to open-source the data
111
 
112
- ### Test Benchmark datasets
113
  | Test Set | Num Dialects | Test (h) |
114
  |-------------------------------------------------------------------------------------------------|----------------|-------------|
115
  | [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022) | 10 | 10.7 |
@@ -119,7 +119,7 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
119
  | [MGB-2](http://www.mgb-challenge.org/MGB-2.html) | Unspecified | 9.6 |
120
  | [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca) | 8 | 7.7 |
121
 
122
- ### Test Benchmark results
123
  - CommonVoice
124
  - WER:
125
  - CER:
@@ -140,9 +140,9 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
140
  - WER:
141
  - CER:
142
 
143
- ## Software Integration
144
 
145
- ### Supported Hardware Microarchitecture Compatibility:
146
  - NVIDIA Ampere
147
  - NVIDIA Blackwell
148
  - NVIDIA Jetson
@@ -152,13 +152,13 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
152
  - NVIDIA Turing
153
  - NVIDIA Volta
154
 
155
- ### Runtime Engine
156
  - Nemo 2.0.0
157
 
158
- ### Preferred Operating System
159
  - Linux
160
 
161
- ## Explainability
162
 
163
  - High-Level Application and Domain: Automatic Speech Recognition
164
  - - Describe how this model works: The model transcribes audio input into text for the Arabic language
@@ -166,11 +166,11 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
166
  - Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
167
  - Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
168
 
169
- ## Bias
170
  - Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
171
  - Have any special measures been taken to mitigate unwanted bias? No
172
 
173
- ## Safety & Security
174
  ### Use Case Restrictions:
175
 
176
  - Non-streaming ASR model
@@ -179,11 +179,11 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
179
  - The model is noise-sensitive
180
  - The model is Egyptian Dialect further finetuned
181
 
182
- ## License
183
 
184
  License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
185
 
186
- ## References
187
 
188
  [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
189
 
 
19
  pipeline_tag: automatic-speech-recognition
20
  library_name: NeMo
21
  ---
22
+ # πŸŽ™οΈ Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
23
+
24
  img {
25
  display: inline-table;
26
  vertical-align: small;
27
  margin: 0;
28
  padding: 0;
29
  }
30
+
31
  | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
32
  | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
33
  | [![Language](https://img.shields.io/badge/Language-ar-lightgrey#model-badge)](#datasets)|
34
 
35
  This model transcribes speech in the Arabic language with punctuation mark support.
36
  It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
37
+ See the section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
38
  The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
39
 
40
+ This model is ready for commercial and non-commercial use. βœ…
41
 
42
+ ## πŸ—οΈ Model Architecture
43
 
44
  FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
45
  The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
 
47
 
48
  Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
49
 
50
+ ### πŸ“₯ Input
51
  - **Input Type:** Audio
52
  - **Input Format(s):** .wav files
53
  - **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
54
 
55
+ ### πŸ“€ Output
56
 
57
  This model provides transcribed speech as a string for a given audio sample.
58
  - **Output Type**: Text
 
60
  - **Output Parameters:** One Dimensional (1D)
61
  - **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
62
 
63
+ ## ⚠️ Limitations
64
  The model is non-streaming and outputs the speech as a string without diacritical marks.
65
  Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
66
 
67
+ ## πŸš€ How to download and use the model
68
+ #### πŸ”§ Installations
69
  ```
70
  $ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
71
  $ pip -q install nemo_toolkit['asr']
72
  ```
73
 
74
+ #### πŸ“₯ Download the model
75
  ```
76
  $ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
77
  ```
78
+ #### 🐍 Imports and usage
79
+ ```
80
  import nemo.collections.asr as nemo_asr
81
 
82
  asr_model = nemo_asr.models.ASRModel.restore_from(
83
  "path/to/tawasul_egy_stt.nemo",
84
  )
85
  ```
86
+ ### 🎯 Transcribing using Python
87
  Simply do:
88
  ```
89
+ prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
90
  print(prediction.text)
91
  ```
92
  You also can pass more then one audio as a patch inference
 
94
  asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
95
  ```
96
 
97
+ ## πŸ“Š Training, and Testing Datasets
98
+ ### πŸ‹οΈ Training Datasets
99
  #### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
100
  - [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
101
  - Data Collection Method: Automated
 
109
  #### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
110
  - The second stage training Egyptian data is Private, there is no intention to open-source the data
111
 
112
+ ### πŸ§ͺ Test Benchmark datasets
113
  | Test Set | Num Dialects | Test (h) |
114
  |-------------------------------------------------------------------------------------------------|----------------|-------------|
115
  | [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022) | 10 | 10.7 |
 
119
  | [MGB-2](http://www.mgb-challenge.org/MGB-2.html) | Unspecified | 9.6 |
120
  | [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca) | 8 | 7.7 |
121
 
122
+ ### πŸ“ˆ Test Benchmark results
123
  - CommonVoice
124
  - WER:
125
  - CER:
 
140
  - WER:
141
  - CER:
142
 
143
+ ## πŸ’» Software Integration
144
 
145
+ ### πŸ”§ Supported Hardware Microarchitecture Compatibility:
146
  - NVIDIA Ampere
147
  - NVIDIA Blackwell
148
  - NVIDIA Jetson
 
152
  - NVIDIA Turing
153
  - NVIDIA Volta
154
 
155
+ ### βš™οΈ Runtime Engine
156
  - Nemo 2.0.0
157
 
158
+ ### πŸ–₯️ Preferred Operating System
159
  - Linux
160
 
161
+ ## πŸ” Explainability
162
 
163
  - High-Level Application and Domain: Automatic Speech Recognition
164
  - - Describe how this model works: The model transcribes audio input into text for the Arabic language
 
166
  - Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
167
  - Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
168
 
169
+ ## βš–οΈ Bias
170
  - Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
171
  - Have any special measures been taken to mitigate unwanted bias? No
172
 
173
+ ## πŸ”’ Safety & Security
174
  ### Use Case Restrictions:
175
 
176
  - Non-streaming ASR model
 
179
  - The model is noise-sensitive
180
  - The model is Egyptian Dialect further finetuned
181
 
182
+ ## πŸ“„ License
183
 
184
  License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
185
 
186
+ ## πŸ“š References
187
 
188
  [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
189