Nigerian Accented Text to Speech Model
Table of Contents
- Model Summary
- Model Description
- Bias, Risks, and Limitations
- Speech Samples
- Training
- Citation
- Credits & References
Model Summary
This text-to-speech (TTS) model (v1) was designed to synthesize Nigerian-accented English, offering high-quality, natural and relevant speech synthesis for diverse applications like narration, voice cloning, etc.
Demo
Common Issues
- High-pitched background noise: This is caused by numerical float differences
- Slow conversation (This depends on the compute used in running the inference)
How to use in Colab
This model can generate realistic audios if prompted with the right inference function arguments. The model has two voices:
- Andy
- Ben
- Oge
- But you can also use your own reference audio to clone any voice.
Please, refer to the Github Repository for complete code.
# install libraries
!sudo apt-get update -y
!apt-get install build-essential -y
!pip install torch tensorboard transformers accelerate SoundFile torchaudio librosa phonemizer
!pip install einops einops-exts tqdm typing typing-extensions munch pydub pyyaml nltk matplotlib
!pip install git+https://github.com/resemble-ai/monotonic_align.git
!pip install hf_transfer -qU
!sudo apt-get install -y espeak-ng
# ____________Download model_____________
model_repo = 'benjaminogbonna/tts_demo_models'
model_path = 'Models'
target_files = ['config.yml', 'model_v1.pth']
local_dir = '.'
import os
os.makedirs(local_dir, exist_ok=True)
downloaded_files = []
for file_name in target_files:
file_path = hf_hub_download(repo_id=model_repo, filename=f'{model_path}/{file_name}', local_dir=local_dir)
downloaded_files.append(file_path)
print('Downloaded files', downloaded_files)
#________________________
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
model_folder = 'Models/'
# I do this to always pick the last trained epoch
files = [f for f in os.listdir(model_folder) if f.endswith('.pth')]
sorted_files = sorted(files, key=lambda x: int(x.split('_')[-1].split('.')[0]))
print(sorted_files[-1])
#________________________
import torch
torch.manual_seed(0)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
import random
random.seed(0)
import numpy as np
np.random.seed(0)
# load packages
import time
import random
import yaml
from munch import Munch
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F
import torchaudio
import librosa
from nltk.tokenize import word_tokenize
from models import *
from utils import *
from text_utils import TextCleaner
textclenaer = TextCleaner()
%matplotlib inline
#________________________
to_mel = torchaudio.transforms.MelSpectrogram(
n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
mean, std = -4, 4
def length_to_mask(lengths):
mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
mask = torch.gt(mask+1, lengths.unsqueeze(1))
return mask
def preprocess(wave):
wave_tensor = torch.from_numpy(wave).float()
mel_tensor = to_mel(wave_tensor)
mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
return mel_tensor
def compute_style(path):
wave, sr = librosa.load(path, sr=24000)
audio, index = librosa.effects.trim(wave, top_db=30)
if sr != 24000:
audio = librosa.resample(audio, sr, 24000)
mel_tensor = preprocess(audio).to(device)
with torch.no_grad():
ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))
return torch.cat([ref_s, ref_p], dim=1)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load phonemizer
import phonemizer
global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True)
config = yaml.safe_load(open(f"{model_folder}config.yml"))
# load pretrained ASR model
ASR_config = config.get('ASR_config', False)
ASR_path = config.get('ASR_path', False)
text_aligner = load_ASR_models(ASR_path, ASR_config)
# load pretrained F0 model
F0_path = config.get('F0_path', False)
pitch_extractor = load_F0_models(F0_path)
# load BERT model
from Utils.PLBERT.util import load_plbert
BERT_path = config.get('PLBERT_dir', False)
plbert = load_plbert(BERT_path)
model_params = recursive_munch(config['model_params'])
model = build_model(model_params, text_aligner, pitch_extractor, plbert)
_ = [model[key].eval() for key in model]
_ = [model[key].to(device) for key in model]
#________________________
params_whole = torch.load(f"{model_folder}" + sorted_files[-1], map_location='cpu')
params = params_whole['net']
#________________________
for key in model:
if key in params:
print('%s loaded' % key)
try:
model[key].load_state_dict(params[key])
except:
from collections import OrderedDict
state_dict = params[key]
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model[key].load_state_dict(new_state_dict, strict=False)
# except:
# _load(params[key], model[key])
_ = [model[key].eval() for key in model]
#________________________
from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule
sampler = DiffusionSampler(
model.diffusion.diffusion,
sampler=ADPM2Sampler(),
sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters
clamp=False
)
#________________________
def inference(text, ref_s, alpha = 0.3, beta = 0.7, diffusion_steps=5, embedding_scale=1):
text = text.strip()
ps = global_phonemizer.phonemize([text])
ps = word_tokenize(ps[0])
ps = ' '.join(ps)
tokens = textclenaer(ps)
tokens.insert(0, 0)
tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)
with torch.no_grad():
input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
text_mask = length_to_mask(input_lengths).to(device)
t_en = model.text_encoder(tokens, input_lengths, text_mask)
bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
embedding=bert_dur,
embedding_scale=embedding_scale,
features=ref_s, # reference from the same speaker as the embedding
num_steps=diffusion_steps).squeeze(1)
s = s_pred[:, 128:]
ref = s_pred[:, :128]
ref = alpha * ref + (1 - alpha) * ref_s[:, :128]
s = beta * s + (1 - beta) * ref_s[:, 128:]
d = model.predictor.text_encoder(d_en,
s, input_lengths, text_mask)
x, _ = model.predictor.lstm(d)
duration = model.predictor.duration_proj(x)
duration = torch.sigmoid(duration).sum(axis=-1)
pred_dur = torch.round(duration.squeeze()).clamp(min=1)
pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
c_frame = 0
for i in range(pred_aln_trg.size(0)):
pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
c_frame += int(pred_dur[i].data)
# encode prosody
en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan":
asr_new = torch.zeros_like(en)
asr_new[:, :, 0] = en[:, :, 0]
asr_new[:, :, 1:] = en[:, :, 0:-1]
en = asr_new
F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
if model_params.decoder.type == "hifigan":
asr_new = torch.zeros_like(asr)
asr_new[:, :, 0] = asr[:, :, 0]
asr_new[:, :, 1:] = asr[:, :, 0:-1]
asr = asr_new
out = model.decoder(asr,
F0_pred, N_pred, ref.squeeze().unsqueeze(0))
return out.squeeze().cpu().numpy()[..., :-50] # weird pulse at the end of the model, need to be fixed later
#________________________
# Synthesize speech
text = "We are happy to invite you to join us on a journey to the future."
#________________________
reference_dicts = {}
reference_dicts['oge'] = "ref_audios/things_fall_apart_1.wav" # or use your own audio samples
reference_dicts['ben'] = "ref_audios/feels_good_to_be_odd_1.wav"
#________________________
start = time.time()
noise = torch.randn(1,1,256).to(device)
for k, path in reference_dicts.items():
ref_s = compute_style(path)
wav = inference(text, ref_s, alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2)
rtf = (time.time() - start) / (len(wav) / 24000)
print(f"RTF = {rtf:5f}")
import IPython.display as ipd
print(k + ' Synthesized:')
display(ipd.Audio(wav, rate=24000, normalize=False))
print('Reference:')
display(ipd.Audio(path, rate=24000, normalize=False))
Model Description
- Developed by: Benjamin
- Model type: Text-to-Speech
- Language(s) (NLP): English--> Nigerian Accented English
- Finetuned from: StyleTTS
- Repository: Github Repository
Uses
Generate Nigerian-accented English speech.
Out-of-Scope Use
The model is not suitable for generating speech in languages other than English or other accents.
Bias, Risks, and Limitations
The model may not capture the full diversity of Nigerian accents and could exhibit biases based on the training dataset.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. Feedback and diverse training data contributions are encouraged.
Speech Samples
Text Input | Audio Output | Notes |
---|---|---|
We are happy to invite you to join us on a journey to the future. | (alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2), Speaker: Ben | |
We are happy to invite you to join us on a journey to the future. | (alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2), Speaker: Oge | |
If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word "Homemade," when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first class home made products there is a market in all large cities. All first-class grocers have customers who purchase such goods. | (alpha = 0.3, beta = 0.9, t = 0.7, diffusion_steps=10, embedding_scale=1.5), Long narration: Oge |
TODO
- Getting more data for current speakers and re-training the model for better generalization
- Adding more Speakers for diversity
Training
Data
This model was trained on a proprietary audio dataset (5 hours) of the authors reading different categories of books.
Preprocessing
Audio files were preprocessed and resampled to 24Khz
Training Hyperparameters
- Number of epochs: 6
- batch_size: 2
- use_lora: False
Hardware
- GPUs: NVIDIA L40S (3 hours)
Software
- Training Framework: Pytorch
Citation [optional]
BibTeX:
@misc{
authors = {Benjamin O, Oge N, Mathias E, Daniel A.},
title = {Nigerian-Accented Text-to-Speech Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/benjaminogbonna/tts_demo_models}
}
APA:
Benjamin O, Oge N, Mathias E, Daniel A. (2025). Nigerian-Accented Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/benjaminogbonna/tts_demo_models