๐Ÿ™ Octopus: Towards Building the Arabic Speech LLM Suite

๐Ÿ“ข Overview

Octopus is a bilingual Audio-Language Model (Audio-LLM) family developed to understand, transcribe, translate, and reason over Arabic and English speech.
It unifies audio, text, and reasoning within one multimodal framework, supporting:

  • Automatic Speech Recognition (ASR) for Arabic & English ๐Ÿ—ฃ๏ธ
  • Speech Translation (Arabic โ†’ English and vice versa) ๐ŸŒ
  • Arabic Dialect Identification (DID) ๐Ÿท๏ธ

The lightweight variant, TinyOctopus, maintains the same modular design but is optimized for efficiency on smaller GPUs.

๐Ÿงฉ Architecture

Core Components

The Octopus family scales across several encoderโ€“decoder configurations, combining complementary strengths in acoustic understanding and text generation.

  1. Audio Encoders

    • Distil-Whisper (distil-large-v3) โ†’ lightweight frozen encoder producing compact speech embeddings.
    • Whisper-large-v3 โ†’ high-capacity encoder for robust transcription and multilingual coverage.
    • BEATs (Microsoft) โ†’ self-supervised audio encoder capturing fine-grained acoustic cues such as timbre and speaker traits.
  2. Alignment & Fusion

    • Cross-Attention Projection Layer โ†’ a trainable bridge that aligns audio representations with the text-language space through cross-modal attention.
  3. Language / Decoder Models

    • DeepSeek 1.5B โ†’ efficient generative decoder for reasoning, dialogue, and translation.
    • LLaMA 3.2 1B โ†’ compact Arabicโ€“English language model variant optimized for code-switching and reasoning on limited hardware.
    • ALLaM 13B โ†’ large bilingual decoder offering high-fidelity generation and deeper contextual grounding for Arabic tasks.

Together these components enable the Octopus lineโ€”from TinyOctopus (Distil-Whisper + LLaMA 3.2 1B or DeepSeek 1.5B) up to full ALLaM-Octopus (Whisper large v3 + BEATs + ALLaM 13 B) to handle diverse audio understanding and speech-to-text reasoning tasks across Arabic and English.

๐Ÿ“š Training Datasets

The Octopus models were trained and evaluated on a diverse collection of Arabic, English, and code-switching speech corpora, totaling โ‰ˆ25,000 hours of high-quality data for ASR, translation, and dialect identification.

Task / Domain Dataset Train (h) Dev (h) Description
ASR (Arabic) QASR 1,880.5 9.6 Broadcast Arabic from Al-Jazeera; multi-dialect with punctuation and speaker tags.
In-house Arabic Corpus 13,392.1 142.7 Large internal Arabic dataset across Gulf, Levantine, and North-African dialects.
ASR (English) LibriSpeech 960.0 10.5 Read English corpus for ASR benchmarking.
TED-LIUM 453.8 1.6 English TED-talk recordings for spontaneous speech recognition.
ASR (Arโ€“En Code Switching) Synthetic (In-house TTS) 119.5 โ€“ Synthetic bilingual utterances generated via TTS to strengthen mixed-speech robustness.
Translation (Arโ†’En) Translated QASR (via GPT-4o) 1,858.4 9.6 QASR corpus automatically translated to English for parallel supervision.
Translated In-house Arabic (via GPT-4o) 7,229.2 141.9 In-house Arabic dataset machine-translated to English via GPT-4o.
Dialect Identification ADI17 2,241.5 19.0 YouTube-sourced Arabic speech across 17 dialects for dialect recognition and adaptation.

Total Coverage: โ‰ˆ25,000 hours of speech across Arabic, English, and mixed-language domains โ€” enabling broad generalization for ASR, translation, and dialect identification.

These datasets jointly provide:

  • Balanced representation across dialects.
  • Both natural and synthetic speech sources for enhanced robustness.
  • Parallel Arabicโ€“English pairs enabling bilingual text generation and translation.

๐Ÿงฎ Model Weights & Resources

The full set of model weights (including large checkpoints) is publicly available here:
โžก๏ธ Octopus Model Weights

โš™๏ธ Installation & Usage

๐Ÿ’ป Install Dependencies

pip install -r requirements.txt

Inference

from inference import transcribe

audio_path = "path/to/audio.wav"  # Replace with your actual audio file
output = transcribe(audio_path, task="asr")  # Options: "dialect", "asr", "translation"

print("Generated Text:", output)

๐Ÿงช Evaluation Results

๐ŸŽ™๏ธ ASR Performance (WER โ†“)

Dataset Ar-Octopus Bilingual-Octopus Trans-Octopus Whisper-large-v3 SeamlessM4T
MGB2 (Arabic) 16.5 | 6.5 15.2 | 6.8 13.3 | 5.9 16.2 | 7.9 17.2 | 8.4
test-clean (English) 82.5 | 92.4 2.6 | 1.4 67.3 | 79.4 2.86 | 0.98 2.68 | 0.88
test-other (English) 86.9 | 95.1 5.1 | 3.4 71.5 | 87.8 5.00 | 2.05 5.07 | 1.94
tedlium (English) 101.9 | 77.4 5.1 | 3.9 85.2 | 63.6 11.9 | 4.4 86.5 | 62.2
Escwa (Code-Switched) 42.5 | 26.3 40.8 | 27.1 41.8 | 25.1 47.3 | 31.0 52.0 | 35.3
Mixat-ALL (Code-Switched) 22.0 | 9.0 23.4 | 10.3 34.1 | 10.6 29.0 | 15.0 32.8 | 16.9
Mixat-CS (Code-Switched) 26.4 | 12.4 28.5 | 14.9 27.8 | 13.3 34.8 | 20.6 38.2 | 21.8
In-house Long-form 25.4 | 13.0 24.9 | 12.5 24.1 | 12.1 26.7 | 15.2 29.3 | 18.6

+86 % English improvement observed with the addition of language-tokens for bilingual and translation variants.


๐Ÿชถ Tiny-Octopus & Fine-Tuning (WER โ†“)

Dataset TinyOctopus LLaMA-3 1B Fine-tuned LLaMA-3 1B TinyOctopus DeepSeek 1.5B Fine-tuned DeepSeek 1.5B
MGB2 (Arabic) 22.6 | 15.7 16.1 | 9.5 23.2 | 15.8 15.5 | 9.2
test-clean (English) 7.5 | 5.7 3.1 | 1.3 7.7 | 5.8 7.6 | 5.7
test-other (English) 11.3 | 8.0 6.9 | 3.5 11.5 | 8.2 11.3 | 8.0
Escwa (Code-Switched) 42.5 | 26.9 40.3 | 24.4 43.6 | 27.8 41.8 | 26.3
Mixat-All 35.2 | 19.6 34.1 | 19.3 37.1 | 21.1 35.5 | 19.9
Mixat-CS 40.2 | 24.2 36.2 | 21.4 41.2 | 25.2 39.9 | 24.2
In-house Long-files 44.3 | 29.1 42.8 | 26.9 47.0 | 32.7 43.7 | 31.5

Code-Switch TTS augmentation yielded โ‰ˆ 20 % WER reduction across multilingual evaluation sets.


๐ŸŒ Translation Performance (BLEU โ†‘ / BERT-F1 โ†‘)

Model / System CoVoST2 (Arโ†’En) FLEURS (Arโ†’En)
Whisper-large-v3 28.8 / 0.53 15.1 / 0.47
SeamlessM4T 33.7 / 0.55 23.9 / 0.56
Trans-Octopus 38.6 / 0.64 23.2 / 0.58
TO-LLaMA-1B 33.9 / 0.61 20.5 / 0.53
TO-DeepSeek-1.5B 33.6 / 0.61 20.8 / 0.53

Trans-Octopus achieves the best BLEU and BERT-F1 on CoVoST2 and competitive results on FLEURS, surpassing SeamlessM4T in low-resource conditions.


๐Ÿท๏ธ Dialect Identification

For dialect identification, the Tiny-Octopus models achieved 87 โ€“ 89 % accuracy across all 17 dialects in ADI-17.
The confusion matrices reveal clear separation among Gulf, Levantine, North-African, and Egyptian clusters โ€” showing that even compact models can internalize subtle dialectal cues when trained in a multitask setting.

Examples

Example 1: Arabic Speech Recognition

๐ŸŽต Audio Input (Arabic):

๐Ÿ“ User Prompt:

Transcribe the audio or ู‚ู… ุจุชูุฑูŠุบ ุงู„ู…ู‚ุทุน ุงู„ุตูˆุชูŠ

๐Ÿ’ก System Response:

ุฃู‡ู„ุง ุจูƒู… ู…ุดุงู‡ุฏูŠู†ุง ุงู„ูƒุฑุงู… ููŠ ุญู„ู‚ุฉ ุฌุฏูŠุฏุฉ ู…ู† ุจุฑู†ุงู…ุฌ ุงู„ุงู‚ุชุตุงุฏ ูˆุงู„ู†ุงุณ

๐ŸŽต Audio Input (English):

๐Ÿ“ User Prompt:

Transcribe the audio or ู‚ู… ุจุชูุฑูŠุบ ุงู„ู…ู‚ุทุน ุงู„ุตูˆุชูŠ

๐Ÿ’ก System Response:

NO IT'S NOT TOO SOON


Example 2: Arabic to English Translation

๐ŸŽต Audio Input:

๐Ÿ“ User Prompt:

Translate the following Arabic speech into English or ู‚ู… ุจุชุฑุฌู…ุฉ ุงู„ู…ู‚ุทุน ู„ู„ุฅู†ุฌู„ูŠุฒูŠุฉ

๐Ÿ’ก System Response:

I took a loan a certain amount of money to pay off the debt


Example 3: Dialect Identification

๐ŸŽต Audio Input:

๐Ÿ“ User Prompt:

Identify the dialect of the given speech or ู…ุงู‡ูŠ ู„ู‡ุฌุฉ ุงู„ู…ุชุญุฏุซุŸ

๐Ÿ’ก System Response:

KSA


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ArabicSpeech/Octopus

Finetuned
(523)
this model

Datasets used to train ArabicSpeech/Octopus