Update README.md
Browse files
README.md
CHANGED
@@ -19,27 +19,27 @@ tags:
|
|
19 |
pipeline_tag: automatic-speech-recognition
|
20 |
library_name: NeMo
|
21 |
---
|
22 |
-
# Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
|
23 |
-
|
24 |
img {
|
25 |
display: inline-table;
|
26 |
vertical-align: small;
|
27 |
margin: 0;
|
28 |
padding: 0;
|
29 |
}
|
30 |
-
|
31 |
| [](#model-architecture)
|
32 |
| [](#model-architecture)
|
33 |
| [](#datasets)|
|
34 |
|
35 |
This model transcribes speech in the Arabic language with punctuation mark support.
|
36 |
It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
|
37 |
-
See the
|
38 |
The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
|
39 |
|
40 |
-
This model is ready for commercial and non-commercial use.
|
41 |
|
42 |
-
## Model Architecture
|
43 |
|
44 |
FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
|
45 |
The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
|
@@ -47,12 +47,12 @@ You may find more information on the details of FastConformer here: [Fast-Confor
|
|
47 |
|
48 |
Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
|
49 |
|
50 |
-
### Input
|
51 |
- **Input Type:** Audio
|
52 |
- **Input Format(s):** .wav files
|
53 |
- **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
|
54 |
|
55 |
-
### Output
|
56 |
|
57 |
This model provides transcribed speech as a string for a given audio sample.
|
58 |
- **Output Type**: Text
|
@@ -60,33 +60,33 @@ This model provides transcribed speech as a string for a given audio sample.
|
|
60 |
- **Output Parameters:** One Dimensional (1D)
|
61 |
- **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
|
62 |
|
63 |
-
## Limitations
|
64 |
The model is non-streaming and outputs the speech as a string without diacritical marks.
|
65 |
Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
|
66 |
|
67 |
-
## How to download and use the model
|
68 |
-
#### Installations
|
69 |
```
|
70 |
$ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
|
71 |
$ pip -q install nemo_toolkit['asr']
|
72 |
```
|
73 |
|
74 |
-
#### Download the model
|
75 |
```
|
76 |
$ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
|
77 |
```
|
78 |
-
#### Imports and usage
|
79 |
-
```
|
80 |
import nemo.collections.asr as nemo_asr
|
81 |
|
82 |
asr_model = nemo_asr.models.ASRModel.restore_from(
|
83 |
"path/to/tawasul_egy_stt.nemo",
|
84 |
)
|
85 |
```
|
86 |
-
### Transcribing using Python
|
87 |
Simply do:
|
88 |
```
|
89 |
-
prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
|
90 |
print(prediction.text)
|
91 |
```
|
92 |
You also can pass more then one audio as a patch inference
|
@@ -94,8 +94,8 @@ You also can pass more then one audio as a patch inference
|
|
94 |
asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
|
95 |
```
|
96 |
|
97 |
-
## Training, and Testing Datasets
|
98 |
-
### Training Datasets
|
99 |
#### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
|
100 |
- [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
|
101 |
- Data Collection Method: Automated
|
@@ -109,7 +109,7 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
|
|
109 |
#### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
|
110 |
- The second stage training Egyptian data is Private, there is no intention to open-source the data
|
111 |
|
112 |
-
### Test Benchmark datasets
|
113 |
| Test Set | Num Dialects | Test (h) |
|
114 |
|-------------------------------------------------------------------------------------------------|----------------|-------------|
|
115 |
| [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022) | 10 | 10.7 |
|
@@ -119,7 +119,7 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
|
|
119 |
| [MGB-2](http://www.mgb-challenge.org/MGB-2.html) | Unspecified | 9.6 |
|
120 |
| [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca) | 8 | 7.7 |
|
121 |
|
122 |
-
### Test Benchmark results
|
123 |
- CommonVoice
|
124 |
- WER:
|
125 |
- CER:
|
@@ -140,9 +140,9 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
|
|
140 |
- WER:
|
141 |
- CER:
|
142 |
|
143 |
-
## Software Integration
|
144 |
|
145 |
-
### Supported Hardware Microarchitecture Compatibility:
|
146 |
- NVIDIA Ampere
|
147 |
- NVIDIA Blackwell
|
148 |
- NVIDIA Jetson
|
@@ -152,13 +152,13 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
|
|
152 |
- NVIDIA Turing
|
153 |
- NVIDIA Volta
|
154 |
|
155 |
-
### Runtime Engine
|
156 |
- Nemo 2.0.0
|
157 |
|
158 |
-
### Preferred Operating System
|
159 |
- Linux
|
160 |
|
161 |
-
## Explainability
|
162 |
|
163 |
- High-Level Application and Domain: Automatic Speech Recognition
|
164 |
- - Describe how this model works: The model transcribes audio input into text for the Arabic language
|
@@ -166,11 +166,11 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
|
|
166 |
- Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
|
167 |
- Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
|
168 |
|
169 |
-
## Bias
|
170 |
- Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
|
171 |
- Have any special measures been taken to mitigate unwanted bias? No
|
172 |
|
173 |
-
## Safety & Security
|
174 |
### Use Case Restrictions:
|
175 |
|
176 |
- Non-streaming ASR model
|
@@ -179,11 +179,11 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
|
|
179 |
- The model is noise-sensitive
|
180 |
- The model is Egyptian Dialect further finetuned
|
181 |
|
182 |
-
## License
|
183 |
|
184 |
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
|
185 |
|
186 |
-
## References
|
187 |
|
188 |
[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
|
189 |
|
|
|
19 |
pipeline_tag: automatic-speech-recognition
|
20 |
library_name: NeMo
|
21 |
---
|
22 |
+
# ποΈ Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
|
23 |
+
|
24 |
img {
|
25 |
display: inline-table;
|
26 |
vertical-align: small;
|
27 |
margin: 0;
|
28 |
padding: 0;
|
29 |
}
|
30 |
+
|
31 |
| [](#model-architecture)
|
32 |
| [](#model-architecture)
|
33 |
| [](#datasets)|
|
34 |
|
35 |
This model transcribes speech in the Arabic language with punctuation mark support.
|
36 |
It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
|
37 |
+
See the section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
|
38 |
The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
|
39 |
|
40 |
+
This model is ready for commercial and non-commercial use. β
|
41 |
|
42 |
+
## ποΈ Model Architecture
|
43 |
|
44 |
FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
|
45 |
The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
|
|
|
47 |
|
48 |
Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
|
49 |
|
50 |
+
### π₯ Input
|
51 |
- **Input Type:** Audio
|
52 |
- **Input Format(s):** .wav files
|
53 |
- **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
|
54 |
|
55 |
+
### π€ Output
|
56 |
|
57 |
This model provides transcribed speech as a string for a given audio sample.
|
58 |
- **Output Type**: Text
|
|
|
60 |
- **Output Parameters:** One Dimensional (1D)
|
61 |
- **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
|
62 |
|
63 |
+
## β οΈ Limitations
|
64 |
The model is non-streaming and outputs the speech as a string without diacritical marks.
|
65 |
Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
|
66 |
|
67 |
+
## π How to download and use the model
|
68 |
+
#### π§ Installations
|
69 |
```
|
70 |
$ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
|
71 |
$ pip -q install nemo_toolkit['asr']
|
72 |
```
|
73 |
|
74 |
+
#### π₯ Download the model
|
75 |
```
|
76 |
$ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
|
77 |
```
|
78 |
+
#### π Imports and usage
|
79 |
+
```
|
80 |
import nemo.collections.asr as nemo_asr
|
81 |
|
82 |
asr_model = nemo_asr.models.ASRModel.restore_from(
|
83 |
"path/to/tawasul_egy_stt.nemo",
|
84 |
)
|
85 |
```
|
86 |
+
### π― Transcribing using Python
|
87 |
Simply do:
|
88 |
```
|
89 |
+
prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
|
90 |
print(prediction.text)
|
91 |
```
|
92 |
You also can pass more then one audio as a patch inference
|
|
|
94 |
asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
|
95 |
```
|
96 |
|
97 |
+
## π Training, and Testing Datasets
|
98 |
+
### ποΈ Training Datasets
|
99 |
#### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
|
100 |
- [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
|
101 |
- Data Collection Method: Automated
|
|
|
109 |
#### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
|
110 |
- The second stage training Egyptian data is Private, there is no intention to open-source the data
|
111 |
|
112 |
+
### π§ͺ Test Benchmark datasets
|
113 |
| Test Set | Num Dialects | Test (h) |
|
114 |
|-------------------------------------------------------------------------------------------------|----------------|-------------|
|
115 |
| [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022) | 10 | 10.7 |
|
|
|
119 |
| [MGB-2](http://www.mgb-challenge.org/MGB-2.html) | Unspecified | 9.6 |
|
120 |
| [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca) | 8 | 7.7 |
|
121 |
|
122 |
+
### π Test Benchmark results
|
123 |
- CommonVoice
|
124 |
- WER:
|
125 |
- CER:
|
|
|
140 |
- WER:
|
141 |
- CER:
|
142 |
|
143 |
+
## π» Software Integration
|
144 |
|
145 |
+
### π§ Supported Hardware Microarchitecture Compatibility:
|
146 |
- NVIDIA Ampere
|
147 |
- NVIDIA Blackwell
|
148 |
- NVIDIA Jetson
|
|
|
152 |
- NVIDIA Turing
|
153 |
- NVIDIA Volta
|
154 |
|
155 |
+
### βοΈ Runtime Engine
|
156 |
- Nemo 2.0.0
|
157 |
|
158 |
+
### π₯οΈ Preferred Operating System
|
159 |
- Linux
|
160 |
|
161 |
+
## π Explainability
|
162 |
|
163 |
- High-Level Application and Domain: Automatic Speech Recognition
|
164 |
- - Describe how this model works: The model transcribes audio input into text for the Arabic language
|
|
|
166 |
- Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
|
167 |
- Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
|
168 |
|
169 |
+
## βοΈ Bias
|
170 |
- Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
|
171 |
- Have any special measures been taken to mitigate unwanted bias? No
|
172 |
|
173 |
+
## π Safety & Security
|
174 |
### Use Case Restrictions:
|
175 |
|
176 |
- Non-streaming ASR model
|
|
|
179 |
- The model is noise-sensitive
|
180 |
- The model is Egyptian Dialect further finetuned
|
181 |
|
182 |
+
## π License
|
183 |
|
184 |
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
|
185 |
|
186 |
+
## π References
|
187 |
|
188 |
[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
|
189 |
|