alakxender commited on
Commit
04143a9
·
0 Parent(s):
Files changed (5) hide show
  1. .gitattributes +35 -0
  2. .gitignore +2 -0
  3. README.md +53 -0
  4. app.py +130 -0
  5. requirements.txt +4 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ lib/__pycache__
2
+ ./gradio
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: TTS Dhivehi Demo - MMS-TTS
3
+ emoji: ⚡
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.25.2
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ ### Fine-tuned Text-to-Speech Model for Divehi
13
+
14
+ #### Overview
15
+
16
+ This project involves fine-tuning a Text-to-Speech (TTS) model specifically for the Divehi language using HuggingFace's powerful tools and pre-trained models, including the Massively Multilingual Speech (MMS) and VITS frameworks. Divehi, also known as Maldivian, is an Indo-Aryan language spoken in the Maldives. The aim of this fine-tuning process is to improve the TTS system's ability to generate natural and accurate Divehi speech from text inputs.
17
+
18
+ #### Model Description
19
+
20
+ **Base Models:**
21
+ - **Massively Multilingual Speech (MMS):** A pre-trained model from Facebook AI Research designed to handle TTS tasks across multiple languages. MMS provides a robust foundation with extensive language support and pre-learned phonetic nuances.
22
+ - **VITS (Variational Inference and Text-to-Speech):** A state-of-the-art TTS model that integrates variational inference and generative adversarial networks (GANs) to produce high-quality, natural-sounding speech.
23
+
24
+ **Fine-tuning Process:**
25
+ 1. **Data Collection and Preparation:**
26
+ - **Text Corpus:** Compilation of a large and diverse Divehi text corpus to capture the language's phonetic and syntactic properties.
27
+ - **Audio Samples:** Collection of high-quality Divehi audio recordings, including:
28
+ - **Common Voice Dataset:** Leveraging Mozilla's Common Voice dataset, which includes a substantial number of Divehi audio samples.
29
+ - **Synthesized Data:** Utilizing over 16 hours of synthesized Divehi speech data to augment the training set.
30
+ - **Alignment:** Ensuring the text-audio pairs are accurately aligned for effective training.
31
+
32
+ 2. **Training Setup:**
33
+ - **HuggingFace's Transformers Library:** Utilizing HuggingFace’s easy-to-use interface for loading pre-trained models and managing the fine-tuning process.
34
+ - **Tokenization:** Employing a tokenizer suitable for Divehi to convert text into token sequences that the model can process.
35
+ - **Model Configuration:** Adjusting model parameters to optimize performance for Divehi, including learning rates, batch sizes, and epochs.
36
+
37
+ 3. **Fine-tuning:**
38
+ - **VITS Fine-tuning:** Leveraging the VITS model’s architecture to fine-tune on Divehi-specific text-audio pairs, focusing on improving the model’s ability to generate Divehi phonetics accurately.
39
+ - **MMS Fine-tuning:** Further fine-tuning the MMS model to enhance its multilingual capabilities with a specific focus on Divehi.
40
+
41
+ 4. **Evaluation and Testing:**
42
+ - **Quality Assessment:** Using objective metrics like Mean Opinion Score (MOS) and subjective listening tests to evaluate the naturalness and accuracy of the generated Divehi speech.
43
+ - **Error Analysis:** Identifying and rectifying common errors such as mispronunciations, intonation issues, and unnatural pacing.
44
+
45
+ #### Benefits
46
+
47
+ - **High-Quality Speech Synthesis:** Produces natural and intelligible Divehi speech, suitable for applications in virtual assistants, audiobooks, and accessibility tools.
48
+ - **Cultural Preservation:** Supports the digital presence and preservation of the Divehi language through advanced speech technology.
49
+ - **Customizability:** Fine-tuning allows for further adjustments and improvements based on specific use cases and user feedback.
50
+
51
+ #### Conclusion
52
+
53
+ The fine-tuned Divehi Text-to-Speech model represents a significant advancement in the accessibility and usability of speech technology for the Divehi-speaking community. By combining the strengths of the MMS and VITS models with the flexibility of HuggingFace's tools, and leveraging a rich dataset including the Common Voice dataset and synthesized data, this project delivers a high-quality, linguistically accurate TTS solution tailored to the unique characteristics of the Divehi language.
app.py ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ from transformers import VitsTokenizer, VitsModel, set_seed
4
+ import tempfile
5
+ import numpy as np
6
+ from scipy.io.wavfile import write
7
+ from dv_normalize.dv_sentence import spoken_dv
8
+
9
+ # HuggingFace models with default seeds
10
+ models = {
11
+ "MMS TTS Base": {"model": "alakxender/mms-tts-div", "seed": 555},
12
+ "Female F01 (CV)": {"model": "alakxender/mms-tts-div-finetuned-md-f01", "seed": 555},
13
+ "Female F02 (CV, pitch/tempo changed)": {"model": "alakxender/mms-tts-div-finetuned-md-f02", "seed": 555},
14
+ "Female F03 (CV, pitch/tempo changed)": {"model": "alakxender/mms-tts-div-finetuned-md-f03", "seed": 555},
15
+ "Female F04 (CV, rvc-test)": {"model": "alakxender/mms-tts-speak-f01", "seed": 555},
16
+ "Female F01 (z-test)": {"model": "alakxender/mms-tts-div-ft-spk01-f01", "seed": 555},
17
+ #"Female Unknown 👩🏽 (🤷‍♀️)": {"model": "alakxender/mms-tts-div-finetuned-sm-fu01", "seed": 555},
18
+ "Male M01 (CV) 👨🏽": {"model": "alakxender/mms-tts-div-finetuned-md-m01", "seed": 555},
19
+ #"Male M02 (javaabu/shaafiu)": {"model": "alakxender/mms-tts-div-finetuned-sm-mu01", "seed": 555},
20
+ "Male M02 (z-test)": {"model": "alakxender/mms-tts-div-ft-spk01-m01", "seed": 620},
21
+ "Male M02 (z-test)1": {"model": "alakxender/mms-tts-div-finetuned-m-spk01-t1", "seed": 555}
22
+ }
23
+
24
+ def tts(text: str, model_name: str, seed_value: int = None):
25
+ if (len(text) > 2000):
26
+ raise gr.Error(f"huh! using free cpu here!, try a small chunk of data. Yours is {len(text)}. try to fit to 2000 chars.")
27
+ if (model_name is None):
28
+ raise gr.Error("huh! not sure what to do without a model. select a model.")
29
+
30
+ # Use default seed if none provided
31
+ if seed_value is None:
32
+ seed_value = models[model_name]["seed"]
33
+
34
+ print(f"Loading...{models[model_name]['model']}")
35
+ # Load the MMS-TTS model
36
+ tokenizer = VitsTokenizer.from_pretrained(models[model_name]["model"])
37
+ model = VitsModel.from_pretrained(models[model_name]["model"])
38
+ print("Model loaded.")
39
+
40
+ # normalize the dv text from written to spoken
41
+ print(f"Normalizing: {text}")
42
+ text = spoken_dv(text)
43
+ print(f"Normalized: {text}")
44
+
45
+ # Preprocess the input text
46
+ inputs = tokenizer(text=text, return_tensors="pt")
47
+ print("Preprocess done.")
48
+
49
+ # Make the speech synthesis deterministic with user-defined seed
50
+ print(f"Setting seed to: {seed_value}")
51
+ set_seed(seed_value)
52
+
53
+ # Generate the audio waveform
54
+ print("Generating audio...")
55
+ with torch.no_grad():
56
+ outputs = model(**inputs)
57
+ waveform = outputs.waveform[0]
58
+ sample_rate = model.config.sampling_rate
59
+
60
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
61
+ # Save the waveform to the temporary file
62
+ write(f.name, sample_rate, waveform.numpy().T)
63
+ # Get the file name
64
+ waveform_file = f.name
65
+ print("done.")
66
+ return waveform_file
67
+
68
+ def get_default_seed(model_name):
69
+ return models[model_name]["seed"]
70
+
71
+ css = """
72
+ .textbox1 textarea {
73
+ font-size: 18px !important;
74
+ font-family: 'MV_Faseyha', 'Faruma', 'A_Faruma' !important;
75
+ line-height: 1.8 !important;
76
+ }
77
+ """
78
+
79
+ with gr.Blocks(css=css) as demo:
80
+ gr.Markdown("# <center> DV Text-To-Speech </center>")
81
+ gr.Markdown("This interface converts Divehi text into natural-sounding speech using a fine-tuned Text-to-Speech model. Leveraging the capabilities of Massively Multilingual Speech (MMS) and VITS models. Text normalization is also incorporated to handle various input formats effectively.")
82
+
83
+ with gr.Row():
84
+ with gr.Column(scale=3):
85
+ text = gr.TextArea(
86
+ label="Input text",
87
+ placeholder="ދިވެހި ބަހުން ކޮންމެވެސް އެއްޗެކޭ މިތާ ލިޔެބަލަ",
88
+ rtl=True,
89
+ elem_classes="textbox1"
90
+ )
91
+ with gr.Column(scale=1):
92
+ model_name = gr.Dropdown(
93
+ choices=list(models.keys()),
94
+ label="Select TTS Model",
95
+ value=list(models.keys())[5] # Default to first model
96
+ )
97
+ seed_slider = gr.Slider(
98
+ minimum=0,
99
+ maximum=1000,
100
+ value=555, # Default value
101
+ step=1,
102
+ label="Seed Value (affects voice variation)"
103
+ )
104
+
105
+ # Update seed slider when model changes
106
+ model_name.change(
107
+ fn=get_default_seed,
108
+ inputs=[model_name],
109
+ outputs=[seed_slider]
110
+ )
111
+
112
+ btn = gr.Button("Text-To-Speech")
113
+ output_audio = gr.Audio(label="Speech Output")
114
+
115
+ # Add examples section
116
+ with gr.Accordion("Examples", open=True):
117
+ example_text = "އައްޑޫގެ ގުޅިފައިވާ ރަށްތަކުގައި އެންމެ މަތިން ކަރަންޓު ބޭނުންވާ ގަޑިތަކުގައި 12 މެގަވޮޓްގެ ކަރަންޓު ބޭނުންވެ އެވެ. ކަރަންޓު ފޯރުކޮށްދިނުމަށް ހިތަދޫގައި ބ��ހައްޓާފައި ވަނީ 20 ޖެނަރޭޓަރު ސެޓެވެ. އޭގެ ކެޕޭސިޓީއަކީ 26.8 މެގަވޮޓެވެ. އެކަމަކު އޭގެ ތެރެއިން ފަސް ޖެނަރޭޓަރު ހަލާކުވުމާ ގުޅިގެން އޭރު އުފެއްދުނީ 15 މެގަވޮޓެވެ."
118
+ gr.Examples(
119
+ [[example_text, list(models.keys())[5], models[list(models.keys())[5]]["seed"]]],
120
+ [text, model_name, seed_slider],
121
+ fn=tts,
122
+ outputs=output_audio
123
+ )
124
+
125
+ text.submit(fn=tts, inputs=[text, model_name, seed_slider], outputs=output_audio)
126
+ btn.click(fn=tts, inputs=[text, model_name, seed_slider], outputs=output_audio)
127
+
128
+ # Launch the Gradio app
129
+ if __name__ == "__main__":
130
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ dv-normalizer
2
+ torch
3
+ transformers
4
+ scipy