--- license: cc-by-nc-4.0 base_model: nvidia/stt_hy_fastconformer_hybrid_large_pc datasets: - mteb/common_voice_20_0 metrics: - wer tags: - nemo - fastconformer - armenian - common-voice - asr model-index: - name: fastconformer-hybrid-arm-asr results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: common_voice_20_0 type: common_voice config: hy-AM split: test args: hy-AM metrics: - name: WER type: wer value: 8.47 --- # FastConformer-Hybrid-ARM-ASR This model is a fine-tuned version of [**nvidia/stt_hy_fastconformer_hybrid_large_pc**](https://huggingface.co/nvidia/stt_hy_fastconformer_hybrid_large_pc) for **Automatic Speech Recognition (ASR)** in **Armenian**. It was trained on the **Mozilla Common Voice 20.0** dataset (`hy-AM`) using the [NVIDIA NeMo toolkit](https://github.com/NVIDIA/NeMo). --- ## Model Architecture This model uses the **FastConformer-Hybrid** encoder, which combines: - **Self-attention layers** (like transformers) for global context modeling - **Convolutional modules** for capturing local patterns efficiently For decoding, the model uses: - **Transducer (RNN-T)** decoder — the main inference component - **Auxiliary CTC loss** — used only during training to improve alignment and convergence During inference (`transcribe()`), **only the Transducer decoder is used**, and its performance is what defines the model’s WER. --- ## Training Configuration - **Base model**: `nvidia/stt_hy_fastconformer_hybrid_large_pc` - **Dataset**: Common Voice 20.0 (`hy-AM`) - **Epochs**: 20 - **Batch size**: 32 (train), 16 (val/test) - **Audio**: 16kHz mono WAVs - **Tokenizer**: BPE (Byte-Pair Encoding) — same as base model - **Augmentation**: SpecAugment - **Loss**: Transducer + auxiliary CTC (`ctc_loss_weight: 0.3`) - **Optimizer**: AdamW with cosine annealing - **Precision**: Mixed 16-bit (fp16) --- ## Evaluation Evaluated on the **Common Voice 20.0** Armenian (`hy-AM`) test split: | Decoder Used | WER (%) | |--------------|---------| | Transducer | **8.47** | > The model improves over the base model’s original WER of **9.90%**, achieving a ~14% relative improvement. --- ## Files | File / Folder | Description | |--------------------------------|---------------------------------------------| | `fastconformer-hybrid-arm-asr.nemo` | The fine-tuned ASR model checkpoint | | `config.yaml` | NeMo training configuration used to fine-tune | | `tokenizer/tokenizer.model` | SentencePiece BPE tokenizer model | | `tokenizer/vocab.txt` | Vocabulary used for decoding | | `tokenizer/tokenizer.vocab` | NeMo-compatible tokenizer vocabulary | --- ## Usage Example ```python from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel model = EncDecHybridRNNTCTCBPEModel.restore_from("fastconformer-hybrid-arm-asr.nemo") transcription = model.transcribe(["path_to_audio.wav"]) print(transcription[0]) ``` > The input audio must be a 16kHz mono WAV file. Other formats may result in degraded transcription quality or runtime errors. ## Reproducibility To fine-tune this model or adapt it to new datasets, you can reuse the included `config.yaml`. It defines: - **Dataset loading** – Manifest paths, sampling rate, bucketing, batch sizes - **Model architecture** – FastConformer encoder, RNNT decoder, joint module, auxiliary CTC decoder - **Tokenizer setup** – BPE tokenizer (`tokenizer.model`, `vocab.txt`, `tokenizer.vocab`) - **Loss functions** – Transducer (RNNT) as main loss + auxiliary CTC (`ctc_loss_weight = 0.3`) - **Optimizer & scheduler** – AdamW optimizer with cosine annealing scheduler - **Logging & checkpointing** – NeMo's `exp_manager` with optional checkpoint saving