File size: 6,905 Bytes
db0f9ff 81b3c01 b703a14 81b3c01 47a5cbe db0f9ff b703a14 81b3c01 db0f9ff 0e1d7bc db0f9ff 81b3c01 db0f9ff edb19bd db0f9ff |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
language:
- bm
library_name: nemo
datasets:
- RobotsMali/bam-asr-all
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- Bambara
- NeMo
license: cc-by-4.0
base_model: nvidia/parakeet-tdt_ctc-110m
model-index:
- name: soloni-114m-tdt-ctc
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: bam-asr-all
type: RobotsMali/bam-asr-all
split: test
args:
language: bm
metrics:
- name: Test WER (TDT)
type: wer
value: 66.7
- name: Test WER (CTC)
type: wer
value: 40.6
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---
# Soloni TDT-CTC 114M Bambara
<style>
img {
display: inline;
}
</style>
[](#model-architecture)
| [](#model-architecture)
| [](#datasets)
`soloni-114m-tdt-ctc` is a fine tuned version of nvidia's [`parakeet-tdt_ctc-110m`](https://huggingface.co/nvidia/parakeet-tdt_ctc-110m) that transcribes bambara language speech. Unlike its base model, this model cannot write Punctuations and Capitalizations since these were absent from its training.
The model was fine-tuned using **NVIDIA NeMo** and supports **both TDT (Token-and-Duration Transducer) and CTC (Connectionist Temporal Classification) decoding**.
## **๐จ Important Note**
**Update (February 17th):** We observed a significantly lower WER **(\~36%)** for the TDT branch when using an external WER calculation method that relies solely on the predicted and reference transcriptions. However, the WER values reported in this model card are derived from the standard NeMo workflow using PyTorch Lightning's trainer, where the TDT branch yielded higher WER scores **(\~66%)**. Differences may arise due to variations in post-processing, alignment handling, or evaluation methodologies.
This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. Users should be aware that:
- **The model may not generalize very well accross all speaking conditions and dialects.**
- **Community feedback is welcome, and contributions are encouraged to refine the model further.**
## NVIDIA NeMo: Training
To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
```bash
pip install nemo_toolkit['asr']
```
## How to Use This Model
Note that this model has been released for research purposes primarily.
### Load Model with NeMo
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="RobotsMali/soloni-114m-tdt-ctc")
```
### Transcribe Audio
```python
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])
```
Note that the decoding strategy for the TDT decoder use CUDA Graphs by default but not all GPUs and versions of cuda support that parameter. If you run into a `RuntimeError: CUDA error: invalid argument` you should set that argument to false in the decoding strategy before calling asr_model.transcribe()
```python
decoding_cfg = asr_model.cfg.decoding
# Disable CUDA Graphs
decoding_cfg.greedy.use_cuda_graph_decoder = False
# Then change the decoding strategy
asr_model.change_decoding_strategy(decoding_cfg=decoding_cfg)
```
### Input
This model accepts **16000 Hz mono-channel** audio (wav files) as input.
### Output
This model provides transcribed speech as a string for a given audio sample.
## Model Architecture
This model uses a Hybrid FastConformer-TDT-CTC architecture. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
## Training
The NeMo toolkit was used for finetuning this model for **16,296 steps** over `parakeet-tdt_ctc-110m` model. This model is trained with this [base config](https://github.com/RobotsMali-AI/bambara-asr/blob/main/configs/parakeet-110m-config-v6.yaml). The full training configurations, scripts, and experimental logs are available here:
๐ [Bambara-ASR Experiments](https://github.com/RobotsMali-AI/bambara-asr)
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
## Dataset
This model was fine-tuned on the [bam-asr-all](https://huggingface.co/datasets/RobotsMali/bam-asr-all) dataset, which consists of 37 hours of transcribed Bambara speech data. The dataset is primarily derived from **Jeli-ASR dataset** (~87%).
## Performance
The performance of Automatic Speech Recognition models is measured using Word Error Rate. Since this model has two decoders operating independently, each decoder is evaluated independently too.
The following table summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of **Word Error Rate (WER%)**.
|**Decoder (Version)**|**Tokenizer**|**Vocabulary Size**|**bam-asr-all (test set)**|
|---------|-----------------------|-----------------|---------|
| CTC (V6) | BPE | 1024 | 40.6 |
| TDT (V6) | BPE | 1024 | 66.7 |
These are greedy WER numbers without external LM. By default the main decoder branch is the TDT branch, if you would like to switch to the CTC decoder simply run this block of code before calling the .transcribe method
```python
# Retrieve the CTC decoding config
ctc_decoding_cfg = asr_model.cfg.aux_ctc.decoding
# Then change the decoding strategy
asr_model.change_decoding_strategy(decoder_type='ctc', decoding_cfg=ctc_decoding_cfg)
# Transcribe with the CTC decoder
asr_model.transcribe(['sample_audio.wav'])
```
## License
This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.
---
More details are available in the **Experimental Technical Report**:
๐ [Draft Technical Report - Weights & Biases](https://wandb.ai/yacoudiarra-wl/bam-asr-nemo-training/reports/Draft-Technical-Report-V1--VmlldzoxMTIyOTMzOA).
Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/RobotsMali-AI/bambara-asr/issues) on github if you have any contributions
---
|