Afro-TTS

Afro-TTS is the first pan-African accented English speech synthesis system capable of generating speech in 86 African accents. It includes 1000 personas representing the rich phonological diversity across the continent for applications in Education, Public Health, and Automated Content Creation. Afro-TTS lets you clone voices into different African accents by using just a quick 6-second audio clip. The model was adapted from the XTTS model which was developed by Coqui Studio.

Read more about this model in our paper: https://arxiv.org/abs/2406.11727

Features

Supports 86 unique African accents
Voice cloning with just a 6-second audio clip
Emotion and style transfer by cloning
Multi-accent English speech generation
24kHz sampling rate for high-quality audio

Performance

Afro-TTS achieves near ground truth Mean Opinion Scores (MOS) for naturalness and accentedness. Objective and subjective evaluations demonstrated that the model generates natural-sounding accented speech, bridging the current gap in the representation of African voices in speech synthesis.

Languages

Afro-TTS supports only English languages for now. Stay tuned as we continue to add support for more languages. If you have any language requests, feel free to reach out!

Code

The code-base for the paper of this model can be found here

License

This model is licensed under Coqui Public Model License. There's a lot that goes into a license for generative models, and you can read more of the origin story of CPML here.

Contact

Come and join in our Bioramp Community. We're active on Masakhane Slack Server and our website. You can also mail the authors at [email protected], [email protected]

Using Afro-TTS:

Install the Coqui TTS package:

pip install TTS

Run the following code:

from scipy.io.wavfile import write
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("intronhealth/afro-tts/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="intronhealth/afro-tts/", eval=True)
model.cuda()

outputs = model.synthesize(
    "Abraham said today is a good day to sound like an African.",
    config,
    speaker_wav="audios/reference_accent.wav",
    gpt_cond_len=3,
    language="en",
)

write("audios/output.wav", 24000, outputs['wav'])

BibTeX entry and citation info.

@misc{ogun20241000,
      title={1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis}, 
      author={Sewade Ogun and Abraham T. Owodunni and Tobi Olatunji and Eniola Alese and Babatunde Oladimeji and Tejumade Afonja and Kayode Olaleye and Naome A. Etori and Tosin Adewumi},
      year={2024},
      eprint={2406.11727},
      archivePrefix={arXiv},
      primaryClass={id='eess.AS' full_name='Audio and Speech Processing' is_active=True alt_name=None in_archive='eess' is_general=False description='Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome.  Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval;  audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.'}
}