TawasulAI
/

tawasul-egy-stt

@@ -19,27 +19,27 @@ tags:
 pipeline_tag: automatic-speech-recognition
 library_name: NeMo
 ---
-# Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
-<style>
 img {
   display: inline-table;
   vertical-align: small;
   margin: 0;
   padding: 0;
 }
-</style>
 | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
 | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
 | [![Language](https://img.shields.io/badge/Language-ar-lightgrey#model-badge)](#datasets)|
 This model transcribes speech in the Arabic language with punctuation mark support.
 It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
-See the  section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
 The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
-This model is ready for commercial and non-commercial use.
-## Model Architecture
 FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
 The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
@@ -47,12 +47,12 @@ You may find more information on the details of FastConformer here: [Fast-Confor
 Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
-### Input
   - **Input Type:** Audio
   - **Input Format(s):** .wav files
   - **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
-### Output
 This model provides transcribed speech as a string for a given audio sample.
   - **Output Type**: Text
@@ -60,33 +60,33 @@ This model provides transcribed speech as a string for a given audio sample.
   - **Output Parameters:** One Dimensional (1D)
   - **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
-## Limitations
 The model is non-streaming and outputs the speech as a string without diacritical marks.
 Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
-## How to download and use the model
-#### Installations
 ```
 $ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
 $ pip -q install nemo_toolkit['asr']
 ```
-#### Download the model
 ```
 $ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
 ```
-#### Imports and usage
-```python
 import nemo.collections.asr as nemo_asr
 asr_model = nemo_asr.models.ASRModel.restore_from(
     "path/to/tawasul_egy_stt.nemo",
 )
 ```
-### Transcribing using Python
 Simply do:
 ```
-prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])[0]
 print(prediction.text)
 ```
 You also can pass more then one audio as a patch inference
@@ -94,8 +94,8 @@ You also can pass more then one audio as a patch inference
 asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
 ```
-## Training, and Testing Datasets
-### Training Datasets
 #### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
 - [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
     - Data Collection Method: Automated
@@ -109,7 +109,7 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
 #### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
 - The second stage training Egyptian data is Private, there is no intention to open-source the data
-### Test Benchmark datasets
 | Test Set                                                                                        | Num Dialects   | Test (h)    |
 |-------------------------------------------------------------------------------------------------|----------------|-------------|
 | [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022)                                      | 10             | 10.7        |
@@ -119,7 +119,7 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
 | [MGB-2](http://www.mgb-challenge.org/MGB-2.html)                                                | Unspecified    | 9.6         |
 | [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca)                                | 8              | 7.7         |
-### Test Benchmark results
 - CommonVoice
     - WER:
     - CER:
@@ -140,9 +140,9 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
     - WER:
     - CER:
-## Software Integration
-### Supported Hardware Microarchitecture Compatibility:
 - NVIDIA Ampere
 - NVIDIA Blackwell
 - NVIDIA Jetson
@@ -152,13 +152,13 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
 - NVIDIA Turing
 - NVIDIA Volta
-### Runtime Engine
 - Nemo 2.0.0
-### Preferred Operating System
 - Linux
-## Explainability
 - High-Level Application and Domain: Automatic Speech Recognition
 -   - Describe how this model works: The model transcribes audio input into text for the Arabic language
@@ -166,11 +166,11 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
 - Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
 - Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
-## Bias
 - Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
 - Have any special measures been taken to mitigate unwanted bias? No
-## Safety & Security
 ### Use Case Restrictions:
 - Non-streaming ASR model
@@ -179,11 +179,11 @@ asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_
 - The model is noise-sensitive
 - The model is Egyptian Dialect further finetuned
-## License
 License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
-## References
 [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)

 pipeline_tag: automatic-speech-recognition
 library_name: NeMo
 ---
+# 🎙️ Tawasul STT V0 (Supports all Arabic Dialects with more focus on Egyptian Arz dialect)
 img {
   display: inline-table;
   vertical-align: small;
   margin: 0;
   padding: 0;
 }
 | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture)
 | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
 | [![Language](https://img.shields.io/badge/Language-ar-lightgrey#model-badge)](#datasets)|
 This model transcribes speech in the Arabic language with punctuation mark support.
 It is a "large" version of FastConformer Transducer-CTC (around 115M parameters) model and is trained on two losses: Transducer (default) and CTC.
+See the section [Model Architecture](#Model-Architecture) and [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer) for complete architecture details.
 The model transcribes text in Arabic without diacritical marks and supports periods, Arabic commas and Arabic question marks.
+This model is ready for commercial and non-commercial use. ✅
+## 🏗️ Model Architecture
 FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling.
 The model is trained in a multitask setup with hybrid Transducer decoder (RNNT) and Connectionist Temporal Classification (CTC) loss.
 Model utilizes a [Google Sentencepiece Tokenizer](https://github.com/google/sentencepiece) [2] tokenizer with a vocabulary size of 1024.
+### 📥 Input
   - **Input Type:** Audio
   - **Input Format(s):** .wav files
   - **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed
+### 📤 Output
 This model provides transcribed speech as a string for a given audio sample.
   - **Output Type**: Text
   - **Output Parameters:** One Dimensional (1D)
   - **Other Properties Related to Output:** May Need Inverse Text Normalization; Does Not Handle Special Characters; Outputs text in Arabic without diacritical marks
+## ⚠️ Limitations
 The model is non-streaming and outputs the speech as a string without diacritical marks.
 Not recommended for word-for-word transcription and punctuation as accuracy varies based on the characteristics of input audio (unrecognized word, accent, noise, speech type, and context of speech).
+## 🚀 How to download and use the model
+#### 🔧 Installations
 ```
 $ apt-get update && apt-get install -y libsndfile1 ffmpeg soundfile librosa sentencepiece Cython packaging
 $ pip -q install nemo_toolkit['asr']
 ```
+#### 📥 Download the model
 ```
 $ curl -L -o path/to/tawasul_egy_stt.nemo https://huggingface.co/TawasulAI/tawasul-egy-stt/resolve/main/tawasul_egy_stt_wer0.3543.nemo
 ```
+#### 🐍 Imports and usage
+```
 import nemo.collections.asr as nemo_asr
 asr_model = nemo_asr.models.ASRModel.restore_from(
     "path/to/tawasul_egy_stt.nemo",
 )
 ```
+### 🎯 Transcribing using Python
 Simply do:
 ```
+prediction = asr_model.transcribe(['sample_audio_to_transcribe.wav'])
 print(prediction.text)
 ```
 You also can pass more then one audio as a patch inference
 asr_model.transcribe(['sample_audio_1.wav', 'sample_audio_2.wav', 'sample_audio_3.wav'])
 ```
+## 📊 Training, and Testing Datasets
+### 🏋️ Training Datasets
 #### The model is trained by Nvidia on a composite dataset comprising around 760 hours of Arabic speech:
 - [Massive Arabic Speech Corpus (MASC)](https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus) [690h]
     - Data Collection Method: Automated
 #### And then the model was further finetuned on around 100 hours of PRIVATE Egyptian dialects Arabic speech
 - The second stage training Egyptian data is Private, there is no intention to open-source the data
+### 🧪 Test Benchmark datasets
 | Test Set                                                                                        | Num Dialects   | Test (h)    |
 |-------------------------------------------------------------------------------------------------|----------------|-------------|
 | [SADA](https://www.kaggle.com/datasets/sdaiancai/sada2022)                                      | 10             | 10.7        |
 | [MGB-2](http://www.mgb-challenge.org/MGB-2.html)                                                | Unspecified    | 9.6         |
 | [Casablanca](https://huggingface.co/datasets/UBC-NLP/Casablanca)                                | 8              | 7.7         |
+### 📈 Test Benchmark results
 - CommonVoice
     - WER:
     - CER:
     - WER:
     - CER:
+## 💻 Software Integration
+### 🔧 Supported Hardware Microarchitecture Compatibility:
 - NVIDIA Ampere
 - NVIDIA Blackwell
 - NVIDIA Jetson
 - NVIDIA Turing
 - NVIDIA Volta
+### ⚙️ Runtime Engine
 - Nemo 2.0.0
+### 🖥️ Preferred Operating System
 - Linux
+## 🔍 Explainability
 - High-Level Application and Domain: Automatic Speech Recognition
 -   - Describe how this model works: The model transcribes audio input into text for the Arabic language
 - Performance Metrics: Word Error Rate (WER), Character Error Rate (CER), Real-Time Factor
 - Potential Known Risks: Transcripts may not be 100% accurate. Accuracy varies based on the characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etcetera).
+## ⚖️ Bias
 - Was the model trained with a specific accent? The model was trained on general Arabic Dialects and then further fine-tuned on Egyptian dialect (Arz)
 - Have any special measures been taken to mitigate unwanted bias? No
+## 🔒 Safety & Security
 ### Use Case Restrictions:
 - Non-streaming ASR model
 - The model is noise-sensitive
 - The model is Egyptian Dialect further finetuned
+## 📄 License
 License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
+## 📚 References
 [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)