Spaces:

alakxender
/

tts-dhivehi-demo-mms

Running on Zero

App Files Files Community

alakxender commited on May 3

Commit

04143a9

0 Parent(s):

c

Browse files

Files changed (5) hide show

.gitattributes +35 -0
.gitignore +2 -0
README.md +53 -0
app.py +130 -0
requirements.txt +4 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ lib/__pycache__
2	+ ./gradio

README.md ADDED Viewed

	@@ -0,0 +1,53 @@

+---
+title: TTS Dhivehi Demo - MMS-TTS
+emoji: ⚡
+colorFrom: indigo
+colorTo: purple
+sdk: gradio
+sdk_version: 5.25.2
+app_file: app.py
+pinned: false
+---
+### Fine-tuned Text-to-Speech Model for Divehi
+#### Overview
+This project involves fine-tuning a Text-to-Speech (TTS) model specifically for the Divehi language using HuggingFace's powerful tools and pre-trained models, including the Massively Multilingual Speech (MMS) and VITS frameworks. Divehi, also known as Maldivian, is an Indo-Aryan language spoken in the Maldives. The aim of this fine-tuning process is to improve the TTS system's ability to generate natural and accurate Divehi speech from text inputs.
+#### Model Description
+**Base Models:**
+- **Massively Multilingual Speech (MMS):** A pre-trained model from Facebook AI Research designed to handle TTS tasks across multiple languages. MMS provides a robust foundation with extensive language support and pre-learned phonetic nuances.
+- **VITS (Variational Inference and Text-to-Speech):** A state-of-the-art TTS model that integrates variational inference and generative adversarial networks (GANs) to produce high-quality, natural-sounding speech.
+**Fine-tuning Process:**
+1. **Data Collection and Preparation:**
+   - **Text Corpus:** Compilation of a large and diverse Divehi text corpus to capture the language's phonetic and syntactic properties.
+   - **Audio Samples:** Collection of high-quality Divehi audio recordings, including:
+     - **Common Voice Dataset:** Leveraging Mozilla's Common Voice dataset, which includes a substantial number of Divehi audio samples.
+     - **Synthesized Data:** Utilizing over 16 hours of synthesized Divehi speech data to augment the training set.
+   - **Alignment:** Ensuring the text-audio pairs are accurately aligned for effective training.
+2. **Training Setup:**
+   - **HuggingFace's Transformers Library:** Utilizing HuggingFace’s easy-to-use interface for loading pre-trained models and managing the fine-tuning process.
+   - **Tokenization:** Employing a tokenizer suitable for Divehi to convert text into token sequences that the model can process.
+   - **Model Configuration:** Adjusting model parameters to optimize performance for Divehi, including learning rates, batch sizes, and epochs.
+3. **Fine-tuning:**
+   - **VITS Fine-tuning:** Leveraging the VITS model’s architecture to fine-tune on Divehi-specific text-audio pairs, focusing on improving the model’s ability to generate Divehi phonetics accurately.
+   - **MMS Fine-tuning:** Further fine-tuning the MMS model to enhance its multilingual capabilities with a specific focus on Divehi.
+4. **Evaluation and Testing:**
+   - **Quality Assessment:** Using objective metrics like Mean Opinion Score (MOS) and subjective listening tests to evaluate the naturalness and accuracy of the generated Divehi speech.
+   - **Error Analysis:** Identifying and rectifying common errors such as mispronunciations, intonation issues, and unnatural pacing.
+#### Benefits
+- **High-Quality Speech Synthesis:** Produces natural and intelligible Divehi speech, suitable for applications in virtual assistants, audiobooks, and accessibility tools.
+- **Cultural Preservation:** Supports the digital presence and preservation of the Divehi language through advanced speech technology.
+- **Customizability:** Fine-tuning allows for further adjustments and improvements based on specific use cases and user feedback.
+#### Conclusion
+The fine-tuned Divehi Text-to-Speech model represents a significant advancement in the accessibility and usability of speech technology for the Divehi-speaking community. By combining the strengths of the MMS and VITS models with the flexibility of HuggingFace's tools, and leveraging a rich dataset including the Common Voice dataset and synthesized data, this project delivers a high-quality, linguistically accurate TTS solution tailored to the unique characteristics of the Divehi language.

app.py ADDED Viewed

	@@ -0,0 +1,130 @@

+import gradio as gr
+import torch
+from transformers import VitsTokenizer, VitsModel, set_seed
+import tempfile
+import numpy as np
+from scipy.io.wavfile import write
+from dv_normalize.dv_sentence import spoken_dv
+# HuggingFace models with default seeds
+models = {
+    "MMS TTS Base": {"model": "alakxender/mms-tts-div", "seed": 555},
+    "Female F01 (CV)": {"model": "alakxender/mms-tts-div-finetuned-md-f01", "seed": 555},
+    "Female F02 (CV, pitch/tempo changed)": {"model": "alakxender/mms-tts-div-finetuned-md-f02", "seed": 555},
+    "Female F03 (CV, pitch/tempo changed)": {"model": "alakxender/mms-tts-div-finetuned-md-f03", "seed": 555},
+    "Female F04 (CV, rvc-test)": {"model": "alakxender/mms-tts-speak-f01", "seed": 555},
+    "Female F01 (z-test)": {"model": "alakxender/mms-tts-div-ft-spk01-f01", "seed": 555},
+    #"Female Unknown 👩🏽 (🤷‍♀️)": {"model": "alakxender/mms-tts-div-finetuned-sm-fu01", "seed": 555},
+    "Male M01 (CV) 👨🏽": {"model": "alakxender/mms-tts-div-finetuned-md-m01", "seed": 555},
+    #"Male M02 (javaabu/shaafiu)": {"model": "alakxender/mms-tts-div-finetuned-sm-mu01", "seed": 555},
+    "Male M02 (z-test)": {"model": "alakxender/mms-tts-div-ft-spk01-m01", "seed": 620},
+    "Male M02 (z-test)1": {"model": "alakxender/mms-tts-div-finetuned-m-spk01-t1", "seed": 555}
+}
+def tts(text: str, model_name: str, seed_value: int = None):
+    if (len(text) > 2000):
+        raise gr.Error(f"huh! using free cpu here!, try a small chunk of data. Yours is {len(text)}. try to fit to 2000 chars.")
+    if (model_name is None):
+        raise gr.Error("huh! not sure what to do without a model. select a model.")
+    # Use default seed if none provided
+    if seed_value is None:
+        seed_value = models[model_name]["seed"]
+    print(f"Loading...{models[model_name]['model']}")
+    # Load the MMS-TTS model
+    tokenizer = VitsTokenizer.from_pretrained(models[model_name]["model"])
+    model = VitsModel.from_pretrained(models[model_name]["model"])
+    print("Model loaded.")
+    # normalize the dv text from written to spoken
+    print(f"Normalizing: {text}")
+    text = spoken_dv(text)
+    print(f"Normalized: {text}")
+    # Preprocess the input text
+    inputs = tokenizer(text=text, return_tensors="pt")
+    print("Preprocess done.")
+    # Make the speech synthesis deterministic with user-defined seed
+    print(f"Setting seed to: {seed_value}")
+    set_seed(seed_value)
+    # Generate the audio waveform
+    print("Generating audio...")
+    with torch.no_grad():
+        outputs = model(**inputs)
+    waveform = outputs.waveform[0]
+    sample_rate = model.config.sampling_rate
+    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
+        # Save the waveform to the temporary file
+        write(f.name, sample_rate, waveform.numpy().T)
+        # Get the file name
+        waveform_file = f.name
+    print("done.")
+    return waveform_file
+def get_default_seed(model_name):
+    return models[model_name]["seed"]
+css = """
+.textbox1 textarea {
+    font-size: 18px !important;
+    font-family: 'MV_Faseyha', 'Faruma', 'A_Faruma' !important;
+    line-height: 1.8 !important;
+}
+"""
+with gr.Blocks(css=css) as demo:
+    gr.Markdown("# <center> DV Text-To-Speech </center>")
+    gr.Markdown("This interface converts Divehi text into natural-sounding speech using a fine-tuned Text-to-Speech model. Leveraging the capabilities of Massively Multilingual Speech (MMS) and VITS models. Text normalization is also incorporated to handle various input formats effectively.")
+    with gr.Row():
+        with gr.Column(scale=3):
+            text = gr.TextArea(
+                label="Input text",
+                placeholder="ދިވެހި ބަހުން ކޮންމެވެސް އެއްޗެކޭ މިތާ ލިޔެބަލަ",
+                rtl=True,
+                elem_classes="textbox1"
+            )
+        with gr.Column(scale=1):
+            model_name = gr.Dropdown(
+                choices=list(models.keys()),
+                label="Select TTS Model",
+                value=list(models.keys())[5]  # Default to first model
+            )
+            seed_slider = gr.Slider(
+                minimum=0,
+                maximum=1000,
+                value=555,  # Default value
+                step=1,
+                label="Seed Value (affects voice variation)"
+            )
+    # Update seed slider when model changes
+    model_name.change(
+        fn=get_default_seed,
+        inputs=[model_name],
+        outputs=[seed_slider]
+    )
+    btn = gr.Button("Text-To-Speech")
+    output_audio = gr.Audio(label="Speech Output")
+    # Add examples section
+    with gr.Accordion("Examples", open=True):
+        example_text = "އައްޑޫގެ ގުޅިފައިވާ ރަށްތަކުގައި އެންމެ މަތިން ކަރަންޓު ބޭނުންވާ ގަޑިތަކުގައި 12 މެގަވޮޓްގެ ކަރަންޓު ބޭނުންވެ އެވެ. ކަރަންޓު ފޯރުކޮށްދިނުމަށް ހިތަދޫގައި ބ��ހައްޓާފައި ވަނީ 20 ޖެނަރޭޓަރު ސެޓެވެ. އޭގެ ކެޕޭސިޓީއަކީ 26.8 މެގަވޮޓެވެ. އެކަމަކު އޭގެ ތެރެއިން ފަސް ޖެނަރޭޓަރު ހަލާކުވުމާ ގުޅިގެން އޭރު އުފެއްދުނީ 15 މެގަވޮޓެވެ."
+        gr.Examples(
+            [[example_text, list(models.keys())[5], models[list(models.keys())[5]]["seed"]]],
+            [text, model_name, seed_slider],
+            fn=tts,
+            outputs=output_audio
+        )
+    text.submit(fn=tts, inputs=[text, model_name, seed_slider], outputs=output_audio)
+    btn.click(fn=tts, inputs=[text, model_name, seed_slider], outputs=output_audio)
+# Launch the Gradio app
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+dv-normalizer
+torch
+transformers
+scipy