🐙 Octopus: Towards Building the Arabic Speech LLM Suite

📢 Overview

Octopus is a bilingual Audio-Language Model (Audio-LLM) family developed to understand, transcribe, translate, and reason over Arabic and English speech.
It unifies audio, text, and reasoning within one multimodal framework, supporting:

Automatic Speech Recognition (ASR) for Arabic & English 🗣️
Speech Translation (Arabic → English and vice versa) 🌍
Arabic Dialect Identification (DID) 🏷️

The lightweight variant, TinyOctopus, maintains the same modular design but is optimized for efficiency on smaller GPUs.

🧩 Architecture

Core Components

The Octopus family scales across several encoder–decoder configurations, combining complementary strengths in acoustic understanding and text generation.

Audio Encoders
- Distil-Whisper (distil-large-v3) → lightweight frozen encoder producing compact speech embeddings.
- Whisper-large-v3 → high-capacity encoder for robust transcription and multilingual coverage.
- BEATs (Microsoft) → self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
Alignment & Fusion
- Cross-Attention Projection Layer → a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
Language / Decoder Models
- DeepSeek 1.5B → efficient generative decoder for reasoning, dialogue, and translation.
- LLaMA 3.2 1B → compact Arabic–English language model variant optimized for code-switching and reasoning on limited hardware.
- ALLaM 13B → large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.

Together these components enable the Octopus line—from TinyOctopus (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full ALLaM-Octopus (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.

📚 Training Datasets

The Octopus models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling ≈25,000 hours of high-quality data for ASR, translation, and dialect identification.

Task / Domain	Dataset	Train (h)	Dev (h)	Description
ASR (Arabic)	QASR	1,880.5	9.6	Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags.
	In-house Arabic Corpus	13,392.1	142.7	Large internal Arabic dataset across Gulf, Levantine, and North-African dialects.
ASR (English)	LibriSpeech	960.0	10.5	Read English corpus for ASR benchmarking.
	TED-LIUM	453.8	1.6	English TED-talk recordings for spontaneous speech recognition.
ASR (Ar–En Code Switching)	Synthetic (In-house TTS)	119.5	–	Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness.
Translation (Ar→En)	Translated QASR (via GPT-4o)	1,858.4	9.6	QASR corpus automatically translated to English for parallel supervision.
	Translated In-house Arabic (via GPT-4o)	7,229.2	141.9	In-house Arabic dataset machine-translated to English via GPT-4o.
Dialect Identification	ADI17	2,241.5	19.0	YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation.

Total Coverage: ≈25,000 hours of speech across Arabic, English, and mixed-language domains — enabling broad generalization for ASR, translation, and dialect identification.

These datasets jointly provide:

Balanced representation across dialects.
Both natural and synthetic speech sources for enhanced robustness.
Parallel Arabic–English pairs enabling bilingual text generation and translation.

🧮 Model Weights & Resources

The full set of model weights (including large checkpoints) is publicly available here:
➡️ Octopus Model Weights

⚙️ Installation & Usage

💻 Install Dependencies

pip install -r requirements.txt

Inference

from inference import transcribe

audio_path = "path/to/audio.wav"  # Replace with your actual audio file
output = transcribe(audio_path, task="asr")  # Options: "dialect", "asr", "translation"

print("Generated Text:", output)

🧪 Evaluation Results

🎙️ ASR Performance (WER ↓)

Dataset	Ar-Octopus	Bilingual-Octopus	Trans-Octopus	Whisper-large-v3	SeamlessM4T
MGB2 (Arabic)	16.5 \| 6.5	15.2 \| 6.8	13.3 \| 5.9	16.2 \| 7.9	17.2 \| 8.4
test-clean (English)	82.5 \| 92.4	2.6 \| 1.4	67.3 \| 79.4	2.86 \| 0.98	2.68 \| 0.88
test-other (English)	86.9 \| 95.1	5.1 \| 3.4	71.5 \| 87.8	5.00 \| 2.05	5.07 \| 1.94
tedlium (English)	101.9 \| 77.4	5.1 \| 3.9	85.2 \| 63.6	11.9 \| 4.4	86.5 \| 62.2
Escwa (Code-Switched)	42.5 \| 26.3	40.8 \| 27.1	41.8 \| 25.1	47.3 \| 31.0	52.0 \| 35.3
Mixat-ALL (Code-Switched)	22.0 \| 9.0	23.4 \| 10.3	34.1 \| 10.6	29.0 \| 15.0	32.8 \| 16.9
Mixat-CS (Code-Switched)	26.4 \| 12.4	28.5 \| 14.9	27.8 \| 13.3	34.8 \| 20.6	38.2 \| 21.8
In-house Long-form	25.4 \| 13.0	24.9 \| 12.5	24.1 \| 12.1	26.7 \| 15.2	29.3 \| 18.6

+86 % English improvement observed with the addition of language-tokens for bilingual and translation variants.

🪶 Tiny-Octopus & Fine-Tuning (WER ↓)

Dataset	TinyOctopus LLaMA-3 1B	Fine-tuned LLaMA-3 1B	TinyOctopus DeepSeek 1.5B	Fine-tuned DeepSeek 1.5B
MGB2 (Arabic)	22.6 \| 15.7	16.1 \| 9.5	23.2 \| 15.8	15.5 \| 9.2
test-clean (English)	7.5 \| 5.7	3.1 \| 1.3	7.7 \| 5.8	7.6 \| 5.7
test-other (English)	11.3 \| 8.0	6.9 \| 3.5	11.5 \| 8.2	11.3 \| 8.0
Escwa (Code-Switched)	42.5 \| 26.9	40.3 \| 24.4	43.6 \| 27.8	41.8 \| 26.3
Mixat-All	35.2 \| 19.6	34.1 \| 19.3	37.1 \| 21.1	35.5 \| 19.9
Mixat-CS	40.2 \| 24.2	36.2 \| 21.4	41.2 \| 25.2	39.9 \| 24.2
In-house Long-files	44.3 \| 29.1	42.8 \| 26.9	47.0 \| 32.7	43.7 \| 31.5

Code-Switch TTS augmentation yielded ≈ 20 % WER reduction across multilingual evaluation sets.

🌍 Translation Performance (BLEU ↑ / BERT-F1 ↑)

Model / System	CoVoST2 (Ar→En)	FLEURS (Ar→En)
Whisper-large-v3	28.8 / 0.53	15.1 / 0.47
SeamlessM4T	33.7 / 0.55	23.9 / 0.56
Trans-Octopus	38.6 / 0.64	23.2 / 0.58
TO-LLaMA-1B	33.9 / 0.61	20.5 / 0.53
TO-DeepSeek-1.5B	33.6 / 0.61	20.8 / 0.53

Trans-Octopus achieves the best BLEU and BERT-F1 on CoVoST2 and competitive results on FLEURS, surpassing SeamlessM4T in low-resource conditions.

🏷️ Dialect Identification

For dialect identification, the Tiny-Octopus models achieved 87 – 89 % accuracy across all 17 dialects in ADI-17.
The confusion matrices reveal clear separation among Gulf, Levantine, North-African, and Egyptian clusters — showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.

Examples

Example 1: Arabic Speech Recognition

🎵 Audio Input (Arabic):

📝 User Prompt:

Transcribe the audio or قم بتفريغ المقطع الصوتي

💡 System Response:

أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس

🎵 Audio Input (English):

📝 User Prompt:

Transcribe the audio or قم بتفريغ المقطع الصوتي

💡 System Response:

NO IT'S NOT TOO SOON

Example 2: Arabic to English Translation

🎵 Audio Input:

📝 User Prompt:

Translate the following Arabic speech into English or قم بترجمة المقطع للإنجليزية

💡 System Response:

I took a loan a certain amount of money to pay off the debt

Example 3: Dialect Identification

🎵 Audio Input:

📝 User Prompt:

Identify the dialect of the given speech or ماهي لهجة المتحدث؟

💡 System Response:

KSA

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ArabicSpeech/Octopus

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Finetuned

(523)

this model

Dataset	Ar-Octopus	Bilingual-Octopus	Trans-Octopus	Whisper-large-v3	SeamlessM4T
MGB2 (Arabic)	16.5 \| 6.5	15.2 \| 6.8	13.3 \| 5.9	16.2 \| 7.9	17.2 \| 8.4
test-clean (English)	82.5 \| 92.4	2.6 \| 1.4	67.3 \| 79.4	2.86 \| 0.98	2.68 \| 0.88
test-other (English)	86.9 \| 95.1	5.1 \| 3.4	71.5 \| 87.8	5.00 \| 2.05	5.07 \| 1.94
tedlium (English)	101.9 \| 77.4	5.1 \| 3.9	85.2 \| 63.6	11.9 \| 4.4	86.5 \| 62.2
Escwa (Code-Switched)	42.5 \| 26.3	40.8 \| 27.1	41.8 \| 25.1	47.3 \| 31.0	52.0 \| 35.3
Mixat-ALL (Code-Switched)	22.0 \| 9.0	23.4 \| 10.3	34.1 \| 10.6	29.0 \| 15.0	32.8 \| 16.9
Mixat-CS (Code-Switched)	26.4 \| 12.4	28.5 \| 14.9	27.8 \| 13.3	34.8 \| 20.6	38.2 \| 21.8
In-house Long-form	25.4 \| 13.0	24.9 \| 12.5	24.1 \| 12.1	26.7 \| 15.2	29.3 \| 18.6

Dataset	TinyOctopus LLaMA-3 1B	Fine-tuned LLaMA-3 1B	TinyOctopus DeepSeek 1.5B	Fine-tuned DeepSeek 1.5B
MGB2 (Arabic)	22.6 \| 15.7	16.1 \| 9.5	23.2 \| 15.8	15.5 \| 9.2
test-clean (English)	7.5 \| 5.7	3.1 \| 1.3	7.7 \| 5.8	7.6 \| 5.7
test-other (English)	11.3 \| 8.0	6.9 \| 3.5	11.5 \| 8.2	11.3 \| 8.0
Escwa (Code-Switched)	42.5 \| 26.9	40.3 \| 24.4	43.6 \| 27.8	41.8 \| 26.3
Mixat-All	35.2 \| 19.6	34.1 \| 19.3	37.1 \| 21.1	35.5 \| 19.9
Mixat-CS	40.2 \| 24.2	36.2 \| 21.4	41.2 \| 25.2	39.9 \| 24.2
In-house Long-files	44.3 \| 29.1	42.8 \| 26.9	47.0 \| 32.7	43.7 \| 31.5

ArabicSpeech
/

Octopus

🐙 Octopus: Towards Building the Arabic Speech LLM Suite

📢 Overview

🧩 Architecture

Core Components

📚 Training Datasets

🧮 Model Weights & Resources

⚙️ Installation & Usage

💻 Install Dependencies

Inference

🧪 Evaluation Results

🎙️ ASR Performance (WER ↓)

🪶 Tiny-Octopus & Fine-Tuning (WER ↓)

🌍 Translation Performance (BLEU ↑ / BERT-F1 ↑)

🏷️ Dialect Identification

Examples

Example 1: Arabic Speech Recognition

Example 2: Arabic to English Translation

Example 3: Dialect Identification

Model tree for ArabicSpeech/Octopus

Datasets used to train ArabicSpeech/Octopus