--- language: ar license: apache-2.0 tags: - whisper - automatic-speech-recognition - asr - audio - arabic - egyptian-arabic datasets: - MAdel121/arabic-egy-cleaned metrics: - wer - cer base_model: openai/whisper-medium pipeline_tag: automatic-speech-recognition library_name: transformers model-index: - name: whisper-medium-egy results: - task: type: automatic-speech-recognition name: Speech Recognition dataset: name: MAdel121/arabic-egy-cleaned (validation split) type: MAdel121/arabic-egy-cleaned config: ar split: validation metrics: - name: WER type: wer value: 18.029990439289488 - name: CER type: cer value: 13.375029793807732 --- # Whisper Medium Egyptian Arabic (whisper-medium-egy) This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on a custom dataset of 72 hours of Egyptian Arabic speech. It's designed for Automatic Speech Recognition (ASR) for the Egyptian Arabic dialect. ## Model Description * **Base Model:** `openai/whisper-medium` * **Language:** Arabic (ar), specifically focused on Egyptian dialect (arz) * **Fine-tuning Dataset:** `MAdel121/arabic-egy-cleaned` (approx. 72 hours) * **Total Training Steps:** 7299 * **Epochs:** 10 ## Intended Uses & Limitations This model is intended for transcribing speech in Egyptian Arabic. **Intended Use:** * Automatic transcription of audio recordings and live speech in Egyptian Arabic. * Assisting with content creation, subtitling, and voice-controlled applications for Egyptian Arabic speakers. **Limitations:** * Performance may degrade in highly noisy environments or with very strong, non-Egyptian accents. * The model was fine-tuned on a specific dataset; its performance on significantly different domains or audio characteristics might vary. * The training data primarily consists of [describe your dataset sources/domains if possible, e.g., "YouTube videos", "audiobooks", "scripted conversations"]. Performance might be better on similar types of audio. ## How to Use You can use this model with the `transformers` library and the `pipeline` interface for ease of use. ```python from transformers import pipeline import torch device = "cuda:0" if torch.cuda.is_available() else "cpu" pipe = pipeline( "automatic-speech-recognition", model="YOUR_HF_USERNAME/whisper-medium-egy", # Replace YOUR_HF_USERNAME with your Hugging Face username device=device ) # Example with a local audio file # audio_file = "path/to/your/egyptian_arabic_audio.wav" # transcription = pipe(audio_file, generate_kwargs={"language": "arabic"})["text"] # print(transcription) # Example with a Hugging Face dataset audio sample # from datasets import load_dataset # ds = load_dataset("MAdel121/arabic-egy-cleaned", "ar", split="validation") # Or your test split # sample = ds[0]["audio"] # Make sure your dataset has an "audio" column # result = pipe(sample.copy(), generate_kwargs={"language": "arabic"}) # print(result["text"]) ``` Make sure to replace `"YOUR_HF_USERNAME/whisper-medium-egy"` with the actual model ID after uploading. The `generate_kwargs={"language": "arabic"}` is important for Whisper models to ensure correct tokenization and transcription for the target language. ## Training Data The model was fine-tuned on the `MAdel121/arabic-egy-cleaned` dataset available on the Hugging Face Hub. This dataset contains approximately 72 hours of Egyptian Arabic audio paired with transcripts. ## Training Procedure The model was trained using the `transformers` library. The fine-tuning process involved the following key hyperparameters: * **Base Model:** `openai/whisper-medium` * **Optimizer:** AdamW * **Learning Rate:** 1e-5 (0.00001) * **Warmup Steps:** 1000 * **Weight Decay:** 0.05 * **Gradient Accumulation Factor:** 2 * **Batch Size (loader_batch_size):** 8 (effective batch size would be 8 * 2 = 16) * **Number of Epochs:** 10 * **Max Grad Norm:** 5 * **Augmentations Used:** * `use_drop_freq`: true * `use_drop_chunk`: true * `use_drop_bit_resolution`: true * Other augmentations like `use_add_noise`, `use_speed_perturb`, `use_pitch_shift`, `use_add_reverb`, `use_codec_augment`, `use_gain` were set to `false` * **Task:** transcribe * **Language:** ar * **Seed:** 1986 Training was done on 1x A100 (80GB) on Modal Labs The training was managed and tracked using Weights & Biases under the project `whisper-medium-egyptian-arabic` with resume ID `r3sz4v27`. ## Training Code Can be found on [Github here](https://github.com/moadel321/Fine-tuning-whisper-on-Modal-Labs-with-speech-brain-augmentations-/blob/c85312785faa2b927cbc217fe43acb8ed660d2ee/train_whisper_modal.py) ## Weights & Biases Run can be found here : https://wandb.ai/m-adelomar1/whisper-medium-egyptian-arabic/ ## Evaluation Results The model was evaluated on the `validation` split of the `MAdel121/arabic-egy-cleaned` dataset. * **Word Error Rate (WER):** 18.03% * **Character Error Rate (CER):** 13.38% These metrics indicate the performance of the model on the validation set. Lower values are better. ### BibTeX Citation ```bibtex @misc{madel_2025_whisper_medium_egy, author = Madel title = {Whisper Medium Fine-tuned for Egyptian Arabic}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Hub}, howpublished = {\\url{https://huggingface.co/MAdel121/whisper-medium-egy}} // Replace with actual URL } ```