TawasulAI
/

tawasul-egy-stt

 ---
+license: cc-by-4.0
 language:
 - ar
 metrics:
+- WER
+- CER
+tags:
+- speech-recognition
+- ASR
+- Arabic
+- Conformer
+- Transducer
+- CTC
+- NeMo
+- hf-asr-leaderboard
+- speech
+- audio
 pipeline_tag: automatic-speech-recognition
+library_name: NeMo
+---
+# Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
+<style>
+img {
+  display: inline-table;
+  vertical-align: small;
+  margin: 0;
+  padding: 0;
+}
+</style>
+| [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
+| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
+| [![Language](https://img.shields.io/badge/Language-ar-lightgrey#model-badge)](#datasets)|
+This model transcribes speech in the Arabic language with punctuation mark support.
+It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
+See the  section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
+The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
+This model is ready for commercial and non-commercial use.
+## License
+License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
+## References
+[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
+[2] [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece)
+[3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
+[4] [Open Universal Arabic ASR Leaderboard](https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard)
+<!-- ## NVIDIA NeMo: Training
+To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
+We recommend you install it after you've installed latest Pytorch version.
+```
+pip install nemo_toolkit['all']
+```
+ -->
+## Model Architecture
+FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
+The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
+You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
+Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
+### Input
+  - **Input Type:** Audio
+  - **Input Format(s):** .wav files
+  - **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
+### Output
+This model provides transcribed speech as a string for a given audio sample.
+  - **Output Type**: Text
+  - **Output Format:** String
+  - **Output Parameters:** One Dimensional (1D)
+  - **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
+## Limitations
+The model is non-streaming and outputs the speech as a string without diacritical marks.
+Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
+## How to download and use the model
+#### Installations
+```
+$ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
+$ pip -q install nemo_toolkit['asr']
+```
+#### Download the model
+```
+$ curl -L -o tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
+```
+#### Imports and usage
+```python
+import nemo.collections.asr as nemo_asr
+asr_model = nemo_asr.models.ASRModel.restore_from(
+    "path/to/tawasul_egy_stt.nemo",
+)
+```
+### Transcribing using Python
+Simply do:
+```
+prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])[0]
+print(prediction.text)
+```
+You also can pass more then one audio as a patch inference
+```
+asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
+```
+## Training, and Testing Datasets
+### Training Datasets
+#### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
+- [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
+    - Data Collection Method: Automated
+    - Labeling Method: Automated
+- [Mozilla Common Voice 17.0 Arabic](https://commonvoice.mozilla.org/en/datasets) [65h]
+    - Data Collection Method: by Human
+    - Labeling Method: by Human
+- [Google Fleurs Arabic](https://huggingface.co/datasets/google/fleurs) [5h]
+    - Data Collection Method: by Human
+    - Labeling Method: by Human
+#### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
+- The second stage training Egyptian data is Private, there is no intention to open-source the data
+### Test Benchmark datasets
+| Test Set                                                                                        | Num Dialects   | Test (h)    |
+|-------------------------------------------------------------------------------------------------|----------------|-------------|
+| [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022)                                      | 10             | 10.7        |
+| [Common Voice 18.0](https://commonvoice.mozilla.org/en/datasets)                                | 25             | 12.6        |
+| [MASC (Clean-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus)    | 7              | 10.5        |
+| [MASC (Noisy-Test)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus)    | 8              | 14.9        |
+| [MGB-2](http://www.mgb-challenge.org/MGB-2.html)                                                | Unspecified    | 9.6         |
+| [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca)                                | 8              | 7.7         |
+### Test Benchmark results
+- CommonVoice
+    - WER:
+    - CER:
+- MASC
+  - Clean
+    - WER:
+    - CER:
+  - Noisy
+    - WER:
+    - CER:
+- MGB-2
+    - WER:
+    - CER:
+- Casablanca
+    - WER:
+    - CER:
+- SADA
+    - WER:
+    - CER:
+## Software Integration
+### Supported Hardware Microarchitecture Compatibility:
+- NVIDIA Ampere
+- NVIDIA Blackwell
+- NVIDIA Jetson
+- NVIDIA Hopper
+- NVIDIA Lovelace
+- NVIDIA Pascal
+- NVIDIA Turing
+- NVIDIA Volta
+### Runtime Engine
+- Nemo 2.0.0
+### Preferred Operating System
+- Linux
+## Explainability
+- High-Level Application and Domain: Automatic Speech Recognition
+-   - Describe how this model works: The model transcribes audio input into text for the Arabic language
+- Verified to have met prescribed quality standards: Yes
+- Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
+- Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
+## Bias
+- Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
+- Have any special measures been taken to mitigate unwanted bias? No
+## Safety & Security
+### Use Case Restrictions:
+- Non-streaming ASR model
+- Model outputs text in Arabic without diacritical marks
+- Output text requires Inverse Text Normalization
+- The model is noise-sensitive
+- The model is Egyptian Dialect further finetuned