Nigerian Accented Text to Speech Model

Model Summary
Model Description
Bias, Risks, and Limitations
- Recommendations
Speech Samples
Training
Citation
Credits & References

Model Summary

This text-to-speech (TTS) model (v1) was designed to synthesize Nigerian-accented English, offering high-quality, natural and relevant speech synthesis for diverse applications like narration, voice cloning, etc.

Demo

Common Issues

High-pitched background noise: This is caused by numerical float differences
Slow conversation (This depends on the compute used in running the inference)

How to use in Colab

This model can generate realistic audios if prompted with the right inference function arguments. The model has two voices:

Andy
Ben
Oge
But you can also use your own reference audio to clone any voice.

Please, refer to the Github Repository for complete code.

# install libraries
!sudo apt-get update -y
!apt-get install build-essential -y
!pip install torch tensorboard transformers accelerate SoundFile torchaudio librosa phonemizer
!pip install einops einops-exts tqdm typing typing-extensions munch pydub pyyaml nltk matplotlib
!pip install git+https://github.com/resemble-ai/monotonic_align.git
!pip install hf_transfer -qU
!sudo apt-get install -y espeak-ng


# ____________Download model_____________
model_repo = 'benjaminogbonna/tts_demo_models'
model_path = 'Models'
target_files = ['config.yml', 'model_v1.pth']
local_dir = '.'

import os
os.makedirs(local_dir, exist_ok=True)
downloaded_files = []
for file_name in target_files:
    file_path = hf_hub_download(repo_id=model_repo, filename=f'{model_path}/{file_name}', local_dir=local_dir)
    downloaded_files.append(file_path)

print('Downloaded files', downloaded_files)


#________________________
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')



model_folder = 'Models/'

# I do this to always pick the last trained epoch
files = [f for f in os.listdir(model_folder) if f.endswith('.pth')]
sorted_files = sorted(files, key=lambda x: int(x.split('_')[-1].split('.')[0]))
print(sorted_files[-1])



#________________________
import torch
torch.manual_seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

import random
random.seed(0)

import numpy as np
np.random.seed(0)

# load packages
import time
import random
import yaml
from munch import Munch
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torchaudio
import librosa
from nltk.tokenize import word_tokenize

from models import *
from utils import *
from text_utils import TextCleaner
textclenaer = TextCleaner()

%matplotlib inline


#________________________
to_mel = torchaudio.transforms.MelSpectrogram(
    n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4

def length_to_mask(lengths):
    mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
    mask = torch.gt(mask+1, lengths.unsqueeze(1))
    return mask

def preprocess(wave):
    wave_tensor = torch.from_numpy(wave).float()
    mel_tensor = to_mel(wave_tensor)
    mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
    return mel_tensor

def compute_style(path):
    wave, sr = librosa.load(path, sr=24000)
    audio, index = librosa.effects.trim(wave, top_db=30)
    if sr != 24000:
        audio = librosa.resample(audio, sr, 24000)
    mel_tensor = preprocess(audio).to(device)

    with torch.no_grad():
        ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
        ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))

    return torch.cat([ref_s, ref_p], dim=1)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load phonemizer
import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True,  with_stress=True)

config = yaml.safe_load(open(f"{model_folder}config.yml"))

# load pretrained ASR model
ASR_config = config.get('ASR_config', False)
ASR_path = config.get('ASR_path', False)
text_aligner = load_ASR_models(ASR_path, ASR_config)

# load pretrained F0 model
F0_path = config.get('F0_path', False)
pitch_extractor = load_F0_models(F0_path)

# load BERT model
from Utils.PLBERT.util import load_plbert
BERT_path = config.get('PLBERT_dir', False)
plbert = load_plbert(BERT_path)

model_params = recursive_munch(config['model_params'])
model = build_model(model_params, text_aligner, pitch_extractor, plbert)
_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]


#________________________
params_whole = torch.load(f"{model_folder}" + sorted_files[-1], map_location='cpu')
params = params_whole['net']


#________________________
for key in model:
    if key in params:
        print('%s loaded' % key)
        try:
            model[key].load_state_dict(params[key])
        except:
            from collections import OrderedDict
            state_dict = params[key]
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = k[7:] # remove `module.`
                new_state_dict[name] = v
            # load params
            model[key].load_state_dict(new_state_dict, strict=False)
#             except:
#                 _load(params[key], model[key])
_ = [model[key].eval() for key in model]


#________________________
from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule

sampler = DiffusionSampler(
    model.diffusion.diffusion,
    sampler=ADPM2Sampler(),
    sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters
    clamp=False
)


#________________________
def inference(text, ref_s, alpha = 0.3, beta = 0.7, diffusion_steps=5, embedding_scale=1):
    text = text.strip()
    ps = global_phonemizer.phonemize([text])
    ps = word_tokenize(ps[0])
    ps = ' '.join(ps)
    tokens = textclenaer(ps)
    tokens.insert(0, 0)
    tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)

    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
        text_mask = length_to_mask(input_lengths).to(device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
                                          embedding=bert_dur,
                                          embedding_scale=embedding_scale,
                                            features=ref_s, # reference from the same speaker as the embedding
                                             num_steps=diffusion_steps).squeeze(1)


        s = s_pred[:, 128:]
        ref = s_pred[:, :128]

        ref = alpha * ref + (1 - alpha)  * ref_s[:, :128]
        s = beta * s + (1 - beta)  * ref_s[:, 128:]

        d = model.predictor.text_encoder(d_en,
                                         s, input_lengths, text_mask)

        x, _ = model.predictor.lstm(d)
        duration = model.predictor.duration_proj(x)

        duration = torch.sigmoid(duration).sum(axis=-1)
        pred_dur = torch.round(duration.squeeze()).clamp(min=1)


        pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
        c_frame = 0
        for i in range(pred_aln_trg.size(0)):
            pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
            c_frame += int(pred_dur[i].data)

        # encode prosody
        en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
        if model_params.decoder.type == "hifigan":
            asr_new = torch.zeros_like(en)
            asr_new[:, :, 0] = en[:, :, 0]
            asr_new[:, :, 1:] = en[:, :, 0:-1]
            en = asr_new

        F0_pred, N_pred = model.predictor.F0Ntrain(en, s)

        asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
        if model_params.decoder.type == "hifigan":
            asr_new = torch.zeros_like(asr)
            asr_new[:, :, 0] = asr[:, :, 0]
            asr_new[:, :, 1:] = asr[:, :, 0:-1]
            asr = asr_new

        out = model.decoder(asr,
                                F0_pred, N_pred, ref.squeeze().unsqueeze(0))


    return out.squeeze().cpu().numpy()[..., :-50] # weird pulse at the end of the model, need to be fixed later


#________________________
# Synthesize speech
text = "We are happy to invite you to join us on a journey to the future."

#________________________
reference_dicts = {}
reference_dicts['oge'] = "ref_audios/things_fall_apart_1.wav" # or use your own audio samples
reference_dicts['ben'] =  "ref_audios/feels_good_to_be_odd_1.wav"


#________________________
start = time.time()
noise = torch.randn(1,1,256).to(device)
for k, path in reference_dicts.items():
    ref_s = compute_style(path)

    wav = inference(text, ref_s, alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2)
    rtf = (time.time() - start) / (len(wav) / 24000)
    print(f"RTF = {rtf:5f}")
    import IPython.display as ipd
    print(k + ' Synthesized:')
    display(ipd.Audio(wav, rate=24000, normalize=False))
    print('Reference:')
    display(ipd.Audio(path, rate=24000, normalize=False))

Model Description

Developed by: Benjamin
Model type: Text-to-Speech
Language(s) (NLP): English--> Nigerian Accented English
Finetuned from: StyleTTS
Repository: Github Repository

Uses

Generate Nigerian-accented English speech.

Out-of-Scope Use

The model is not suitable for generating speech in languages other than English or other accents.

Bias, Risks, and Limitations

The model may not capture the full diversity of Nigerian accents and could exhibit biases based on the training dataset.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. Feedback and diverse training data contributions are encouraged.

Speech Samples

Text Input	Audio Output	Notes
We are happy to invite you to join us on a journey to the future.		(alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2), Speaker: Ben
We are happy to invite you to join us on a journey to the future.		(alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2), Speaker: Oge
If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word "Homemade," when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first class home made products there is a market in all large cities. All first-class grocers have customers who purchase such goods.		(alpha = 0.3, beta = 0.9, t = 0.7, diffusion_steps=10, embedding_scale=1.5), Long narration: Oge

TODO

Getting more data for current speakers and re-training the model for better generalization
Adding more Speakers for diversity

Training

Data

This model was trained on a proprietary audio dataset (5 hours) of the authors reading different categories of books.

Preprocessing

Audio files were preprocessed and resampled to 24Khz

Training Hyperparameters

Number of epochs: 6
batch_size: 2
use_lora: False

Hardware

GPUs: NVIDIA L40S (3 hours)

Software

Training Framework: Pytorch

Citation [optional]

BibTeX:

@misc{
  authors = {Benjamin O, Oge N, Mathias E, Daniel A.},
  title = {Nigerian-Accented Text-to-Speech Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/benjaminogbonna/tts_demo_models}
}

APA:

Benjamin O, Oge N, Mathias E, Daniel A. (2025). Nigerian-Accented Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/benjaminogbonna/tts_demo_models

Credits & References

StyleTTS

benjaminogbonna
/

tts_demo_models