language: ar
tags:
- text-to-speech
- tts
- arabic
- styletts2
- pl-bert
license: mit
hardware: H100
Model Card for Arabic StyleTTS2
This is an Arabic text-to-speech model based on StyleTTS2 architecture, specifically adapted for Arabic language synthesis. The model achieves good quality Arabic speech synthesis, though not yet state-of-the-art, and further experimentation is needed to optimize performance for Arabic language specifically. All training objectives from the original StyleTTS2 were maintained, except for the WavLM objectives which were removed as they were primarily designed for English speech.
Model Details
Model Description
This model is a modified version of StyleTTS2, specifically adapted for Arabic text-to-speech synthesis. It incorporates a custom-trained PL-BERT model for Arabic language understanding and removes the WavLM adversarial training component (which was primarily designed for English).
- Developed by: Fadi (GitHub: Fadi987)
- Model type: Text-to-Speech (StyleTTS2 architecture)
- Language(s): Arabic
- License: MIT
- Finetuned from model: yl4579/StyleTTS2-LibriTTS
Model Sources
- Repository: Fadi987/StyleTTS2
- Paper: StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
- PL-BERT Model: fadi77/pl-bert
Uses
Direct Use
The model can be used for generating Arabic speech from text. To use the model:
- Clone the StyleTTS2 repository:
git clone https://github.com/Fadi987/StyleTTS2
cd StyleTTS2
- Install the requirements
- Download the model files from this repository
- Run inference using:
python inference.py
Make sure to set the config path to point to the configuration file from this Hugging Face repository.
Out-of-Scope Use
The model is specifically designed for Arabic text-to-speech synthesis and may not perform well for:
- Other languages
- Heavy dialect variations
- Non-diacritized Arabic text
Training Details
Training Data
- Training was performed on approximately 22 hours of Arabic audiobook data
- Dataset: fadi77/arabic-audiobook-dataset-24khz
- The PL-BERT component was trained on fully diacritized Wikipedia Arabic text
Training Infrastructure
- Hardware: Single NVIDIA H100 GPU
- Training Duration: 20 epochs
- Validation Metrics: Identical to original StyleTTS2 training methodology
Training Procedure
Training Hyperparameters
- Number of epochs: 20
- Diffusion training: Started from epoch 5
- Training objectives: All original StyleTTS2 objectives maintained, except WavLM adversarial training
- Validation methodology: Identical to original StyleTTS2 training process
- Notable modifications:
- Removed WavLM adversarial training component
- Custom PL-BERT trained for Arabic language
Technical Specifications
Model Architecture and Objective
The model combines:
- A custom-trained Arabic PL-BERT model for text understanding
- StyleTTS2 architecture for speech synthesis
- Modified training procedure without WavLM adversarial component
Compute Infrastructure
- Hardware Type: NVIDIA H100 GPU
- Training Time: Full training completed in 20 epochs
Citation
BibTeX:
@article{styletts2,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Liu, Yinghao Aaron and Chen, Tao and Ping, Wei and Wu, Xiaoliang and Wang, Dongchao and Duan, Yuxuan and Li, Xiaodi and Li, Chong and Liang, Xuchen and Liu, Qiong and others},
journal={arXiv preprint arXiv:2306.07691},
year={2023}
}
Example
Here is an example output from the model: