๐ Octopus: Towards Building the Arabic Speech LLM Suite
๐ข Overview
Octopus is a bilingual Audio-Language Model (Audio-LLM) family developed to understand, transcribe, translate, and reason over Arabic and English speech.
It unifies audio, text, and reasoning within one multimodal framework, supporting:
- Automatic Speech Recognition (ASR) for Arabic & English ๐ฃ๏ธ
- Speech Translation (Arabic โ English and vice versa) ๐
- Arabic Dialect Identification (DID) ๐ท๏ธ
The lightweight variant, TinyOctopus, maintains the same modular design but is optimized for efficiency on smaller GPUs.
๐งฉ Architecture
Core Components
The Octopus family scales across several encoderโdecoder configurations, combining complementary strengths in acoustic understanding and text generation.
Audio Encoders
- Distil-Whisper (distil-large-v3) โ lightweight frozen encoder producing compact speech embeddings.
- Whisper-large-v3 โ high-capacity encoder for robust transcription and multilingual coverage.
- BEATs (Microsoft) โ self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
Alignment & Fusion
- Cross-Attention Projection Layer โ a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
Language / Decoder Models
- DeepSeek 1.5B โ efficient generative decoder for reasoning, dialogue, and translation.
- LLaMA 3.2 1B โ compact ArabicโEnglish language model variant optimized for code-switching and reasoning on limited hardware.
- ALLaM 13B โ large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.
Together these components enable the Octopus lineโfrom TinyOctopus (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full ALLaM-Octopus (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.
๐ Training Datasets
The Octopus models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling โ25,000 hours of high-quality data for ASR, translation, and dialect identification.
| Task / Domain | Dataset | Train (h) | Dev (h) | Description |
|---|---|---|---|---|
| ASR (Arabic) | QASR | 1,880.5 | 9.6 | Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags. |
| In-house Arabic Corpus | 13,392.1 | 142.7 | Large internal Arabic dataset across Gulf, Levantine, and North-African dialects. | |
| ASR (English) | LibriSpeech | 960.0 | 10.5 | Read English corpus for ASR benchmarking. |
| TED-LIUM | 453.8 | 1.6 | English TED-talk recordings for spontaneous speech recognition. | |
| ASR (ArโEn Code Switching) | Synthetic (In-house TTS) | 119.5 | โ | Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness. |
| Translation (ArโEn) | Translated QASR (via GPT-4o) | 1,858.4 | 9.6 | QASR corpus automatically translated to English for parallel supervision. |
| Translated In-house Arabic (via GPT-4o) | 7,229.2 | 141.9 | In-house Arabic dataset machine-translated to English via GPT-4o. | |
| Dialect Identification | ADI17 | 2,241.5 | 19.0 | YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation. |
Total Coverage: โ25,000 hours of speech across Arabic, English, and mixed-language domains โ enabling broad generalization for ASR, translation, and dialect identification.
These datasets jointly provide:
- Balanced representation across dialects.
- Both natural and synthetic speech sources for enhanced robustness.
- Parallel ArabicโEnglish pairs enabling bilingual text generation and translation.
๐งฎ Model Weights & Resources
The full set of model weights (including large checkpoints) is publicly available here:
โก๏ธ Octopus Model Weights
โ๏ธ Installation & Usage
๐ป Install Dependencies
pip install -r requirements.txt
Inference
from inference import transcribe
audio_path = "path/to/audio.wav" # Replace with your actual audio file
output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation"
print("Generated Text:", output)
๐งช Evaluation Results
๐๏ธ ASR Performance (WER โ)
| Dataset | Ar-Octopus | Bilingual-Octopus | Trans-Octopus | Whisper-large-v3 | SeamlessM4T |
|---|---|---|---|---|---|
| MGB2 (Arabic) | 16.5 | 6.5 | 15.2 | 6.8 | 13.3 | 5.9 | 16.2 | 7.9 | 17.2 | 8.4 |
| test-clean (English) | 82.5 | 92.4 | 2.6 | 1.4 | 67.3 | 79.4 | 2.86 | 0.98 | 2.68 | 0.88 |
| test-other (English) | 86.9 | 95.1 | 5.1 | 3.4 | 71.5 | 87.8 | 5.00 | 2.05 | 5.07 | 1.94 |
| tedlium (English) | 101.9 | 77.4 | 5.1 | 3.9 | 85.2 | 63.6 | 11.9 | 4.4 | 86.5 | 62.2 |
| Escwa (Code-Switched) | 42.5 | 26.3 | 40.8 | 27.1 | 41.8 | 25.1 | 47.3 | 31.0 | 52.0 | 35.3 |
| Mixat-ALL (Code-Switched) | 22.0 | 9.0 | 23.4 | 10.3 | 34.1 | 10.6 | 29.0 | 15.0 | 32.8 | 16.9 |
| Mixat-CS (Code-Switched) | 26.4 | 12.4 | 28.5 | 14.9 | 27.8 | 13.3 | 34.8 | 20.6 | 38.2 | 21.8 |
| In-house Long-form | 25.4 | 13.0 | 24.9 | 12.5 | 24.1 | 12.1 | 26.7 | 15.2 | 29.3 | 18.6 |
+86 % English improvement observed with the addition of language-tokens for bilingual and translation variants.
๐ชถ Tiny-Octopus & Fine-Tuning (WER โ)
| Dataset | TinyOctopus LLaMA-3 1B | Fine-tuned LLaMA-3 1B | TinyOctopus DeepSeek 1.5B | Fine-tuned DeepSeek 1.5B |
|---|---|---|---|---|
| MGB2 (Arabic) | 22.6 | 15.7 | 16.1 | 9.5 | 23.2 | 15.8 | 15.5 | 9.2 |
| test-clean (English) | 7.5 | 5.7 | 3.1 | 1.3 | 7.7 | 5.8 | 7.6 | 5.7 |
| test-other (English) | 11.3 | 8.0 | 6.9 | 3.5 | 11.5 | 8.2 | 11.3 | 8.0 |
| Escwa (Code-Switched) | 42.5 | 26.9 | 40.3 | 24.4 | 43.6 | 27.8 | 41.8 | 26.3 |
| Mixat-All | 35.2 | 19.6 | 34.1 | 19.3 | 37.1 | 21.1 | 35.5 | 19.9 |
| Mixat-CS | 40.2 | 24.2 | 36.2 | 21.4 | 41.2 | 25.2 | 39.9 | 24.2 |
| In-house Long-files | 44.3 | 29.1 | 42.8 | 26.9 | 47.0 | 32.7 | 43.7 | 31.5 |
Code-Switch TTS augmentation yielded โ 20 % WER reduction across multilingual evaluation sets.
๐ Translation Performance (BLEU โ / BERT-F1 โ)
| Model / System | CoVoST2 (ArโEn) | FLEURS (ArโEn) |
|---|---|---|
| Whisper-large-v3 | 28.8 / 0.53 | 15.1 / 0.47 |
| SeamlessM4T | 33.7 / 0.55 | 23.9 / 0.56 |
| Trans-Octopus | 38.6 / 0.64 | 23.2 / 0.58 |
| TO-LLaMA-1B | 33.9 / 0.61 | 20.5 / 0.53 |
| TO-DeepSeek-1.5B | 33.6 / 0.61 | 20.8 / 0.53 |
Trans-Octopus achieves the best BLEU and BERT-F1 on CoVoST2 and competitive results on FLEURS, surpassing SeamlessM4T in low-resource conditions.
๐ท๏ธ Dialect Identification
For dialect identification, the Tiny-Octopus models achieved 87 โ 89 % accuracy across all 17 dialects in ADI-17.
The confusion matrices reveal clear separation among Gulf, Levantine, North-African, and Egyptian clusters โ showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.
Examples
Example 1: Arabic Speech Recognition
๐ต Audio Input (Arabic):
๐ User Prompt:
Transcribe the audio or ูู ุจุชูุฑูุบ ุงูู ูุทุน ุงูุตูุชู
๐ก System Response:
ุฃููุง ุจูู ู ุดุงูุฏููุง ุงููุฑุงู ูู ุญููุฉ ุฌุฏูุฏุฉ ู ู ุจุฑูุงู ุฌ ุงูุงูุชุตุงุฏ ูุงููุงุณ
๐ต Audio Input (English):
๐ User Prompt:
Transcribe the audio or ูู ุจุชูุฑูุบ ุงูู ูุทุน ุงูุตูุชู
๐ก System Response:
NO IT'S NOT TOO SOON
Example 2: Arabic to English Translation
๐ต Audio Input:
๐ User Prompt:
Translate the following Arabic speech into English or ูู ุจุชุฑุฌู ุฉ ุงูู ูุทุน ููุฅูุฌููุฒูุฉ
๐ก System Response:
I took a loan a certain amount of money to pay off the debt
Example 3: Dialect Identification
๐ต Audio Input:
๐ User Prompt:
Identify the dialect of the given speech or ู ุงูู ููุฌุฉ ุงูู ุชุญุฏุซุ
๐ก System Response:
KSA
Model tree for ArabicSpeech/Octopus
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B