File size: 2,541 Bytes
125f9cd
 
 
 
 
 
0f6d29f
125f9cd
 
 
 
 
 
 
 
 
 
 
f5b7538
701eed4
125f9cd
 
49f9c21
125f9cd
06a313e
125f9cd
dd33b1c
125f9cd
 
 
3395f36
125f9cd
33d7b75
125f9cd
06a313e
 
125f9cd
 
 
dd33b1c
125f9cd
 
b812d72
 
 
84fae40
b812d72
 
 
125f9cd
33d7b75
06a313e
6b4ec72
125f9cd
33d7b75
 
125f9cd
06a313e
 
33d7b75
 
 
 
 
84fae40
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
language:
- ar
metrics:
- wer
base_model:
- openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
tags:
- whisper
- arabic
- pytorch
license: apache-2.0
---
# WhisperLevantineArabic

**Fine-tuned Whisper model for the Levantine Dialect (Israeli-Arabic)**

Thanks to [ivrit.ai](https://github.com/ivrit-ai/ivrit.ai/tree/master) for providing the fine-tuning code scripts!

## Model Description

This model is a fine-tuned version of [Whisper Larg v3](https://github.com/openai/whisper) tailored specifically for transcribing Levantine Arabic, focusing on the Israeli dialect. It is designed to improve automatic speech recognition (ASR) performance for this particular variant of Arabic.

- **Base Model**: Whisper Large V3
- **Fine-tuned for**: Levantine Arabic (Israeli Dialect)
- **WER on test set**: 33%

## Training Data

The dataset used for training and fine-tuning this model consists of approximately 1,200 hours of transcribed audio, primarily featuring Israeli Levantine Arabic, along with some general Levantine Arabic content. The data sources include:

1. **Self-maintained Collection**: 1,200 hours of audio data curated by the team, covering a wide range of Israeli Levantine Arabic speech.

- **Total Dataset Size**: ~1,200 hours
- **Sampling Rate**: 8kHz - upsampled to 16kHz
- **Annotation**: Human-transcribed and annotated for high accuracy.

## How to Use
The fine-tuned model was converted using the [faster-whisper](https://github.com/SYSTRAN/faster-whisper) package, enabling inference up to 4× faster than OpenAI's Whisper. 
The model is compatible with 16kHz audio input. Ensure your files are at the same sample rate for optimal results. You can load the model as follows:

Will save a .vtt file with transcriptions and timestamps in audio_dir:
```python
python transcriber.py --model_path path/to/model --audio_dir path/to/audio --word_timestamps True --vad_filter True
```
or:

To visualize printed transcriptions:
```python
pip install faster-whisper
import faster_whisper
import librosa

model = faster_whisper.WhisperModel("model.bin")
audio_file = 'your audio file.wav'
with torch.no_grad():
    audio_data, sample_rate = librosa.load(audio_file)
    audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
    segments, _ = model.transcribe(audio_data, language='ar')
    for segment in segments:
        for word in segment.words:
            print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

    transcript = ' '.join(s.text for s in segments)
```