metadata

language: ar
tags:
  - text-to-speech
  - tts
  - arabic
  - styletts2
  - pl-bert
license: mit
hardware: H100

Model Card for Arabic StyleTTS2

This is an Arabic text-to-speech model based on StyleTTS2 architecture, specifically adapted for Arabic language synthesis. The model achieves good quality Arabic speech synthesis, though not yet state-of-the-art, and further experimentation is needed to optimize performance for Arabic language specifically. All training objectives from the original StyleTTS2 were maintained, except for the WavLM objectives which were removed as they were primarily designed for English speech.

Model Details

Model Description

This model is a modified version of StyleTTS2, specifically adapted for Arabic text-to-speech synthesis. It incorporates a custom-trained PL-BERT model for Arabic language understanding and removes the WavLM adversarial training component (which was primarily designed for English).

Developed by: Fadi (GitHub: Fadi987)
Model type: Text-to-Speech (StyleTTS2 architecture)
Language(s): Arabic
License: MIT
Finetuned from model: yl4579/StyleTTS2-LibriTTS

Model Sources

Repository: Fadi987/StyleTTS2
Paper: StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
PL-BERT Model: fadi77/pl-bert

Uses

Direct Use

The model can be used for generating Arabic speech from text. To use the model:

Clone the StyleTTS2 repository:

git clone https://github.com/Fadi987/StyleTTS2
cd StyleTTS2

Install the requirements
Download the model files from this repository
Run inference using:

python inference.py

Make sure to set the config path to point to the configuration file from this Hugging Face repository.

Out-of-Scope Use

The model is specifically designed for Arabic text-to-speech synthesis and may not perform well for:

Other languages
Heavy dialect variations
Non-diacritized Arabic text

Training Details

Training Data

Training was performed on approximately 22 hours of Arabic audiobook data
Dataset: fadi77/arabic-audiobook-dataset-24khz
The PL-BERT component was trained on fully diacritized Wikipedia Arabic text

Training Infrastructure

Hardware: Single NVIDIA H100 GPU
Training Duration: 20 epochs
Validation Metrics: Identical to original StyleTTS2 training methodology

Training Procedure

Training Hyperparameters

Number of epochs: 20
Diffusion training: Started from epoch 5
Training objectives: All original StyleTTS2 objectives maintained, except WavLM adversarial training
Validation methodology: Identical to original StyleTTS2 training process
Notable modifications:
- Removed WavLM adversarial training component
- Custom PL-BERT trained for Arabic language

Technical Specifications

Model Architecture and Objective

The model combines:

A custom-trained Arabic PL-BERT model for text understanding
StyleTTS2 architecture for speech synthesis
Modified training procedure without WavLM adversarial component

Compute Infrastructure

Hardware Type: NVIDIA H100 GPU
Training Time: Full training completed in 20 epochs

Citation

BibTeX:

@article{styletts2,
  title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author={Liu, Yinghao Aaron and Chen, Tao and Ping, Wei and Wu, Xiaoliang and Wang, Dongchao and Duan, Yuxuan and Li, Xiaodi and Li, Chong and Liang, Xuchen and Liu, Qiong and others},
  journal={arXiv preprint arXiv:2306.07691},
  year={2023}
}

fadi77
/

StyleTTS2-LibriTTS-arabic