I created an API wrapper with web UI

#13

by devnen - opened Apr 22

Discussion

devnen

Apr 22

•

edited Apr 22

Seriously impressive work. The model quality is outstanding, especially for a 3-month effort from scratch.

Seeing some discussions about running Dia locally, I wanted to share a project I put together quickly that might make it easier to get started:
https://github.com/devnen/Dia-TTS-Server

It's an API server that wraps the Dia model. Getting it set up is pretty straightforward, just a standard pip install -r requirements.txt that works on Windows or Linux. It automatically grabs the model from Hugging Face. There's a simple web UI for generating speech, adjusting the parameters, and testing out voice cloning. For integration, it has both an OpenAI-compatible API endpoint and a custom one if you need full control. Plus, it runs on either CUDA GPUs or just the CPU.

The goal was to create a simple way to run and experiment with the model without needing to piece together the example scripts yourself.
Hope you find it useful!

devnen changed discussion status to closed Apr 22

devnen changed discussion status to open Apr 22

Lordmo

Apr 24

thanks for this. Is there a limit to the characters I can input... I have multiple 50k+ character docs I would like to transform to speech

devnen

Apr 24

The model is unable to process more than 25 seconds of audio reliably. You will need to extract text from an input file, split it into manageable chunks, convert each chunk to speech independently, and then concatenate them into a single audio file.

This feature may be added to the Dia TTS Server UI in the next few days. However, I think the generated audio chunks will have inconsistent voices. For now I see no way to control the voice output.

Here is a good video demonstrating the issue and a potential solution:
https://www.youtube.com/watch?v=tje3uAZqgV0

devnen

Apr 29

There is a new version of the server that is designed to handle long documents . I've implemented an automatic 'chunking' feature that intelligently splits the text into smaller segments, respecting sentence boundaries and speaker tags ([S1]/[S2]). Each segment is converted to speech individually and then seamlessly joined, which overcomes previous length limitations.

For the best results with long documents, particularly for ensuring the voice remains consistent throughout, I strongly recommend using the 'Predefined Voices' mode combined with a fixed integer seed (any number other than -1). Using 'Voice Cloning' with a fixed seed is also a good option.

Unfortunately, I do not think the model is production ready. The output is somewhat unpredictable even with all these workarounds.

NariLabs

Nari Labs org Apr 30

This is awesome :))) Thanks for this contribution. We are working to make the model more predictable as well.
Happy to hear any feedback!

justar96

May 5

thank you so much!!

ernestyalumni

25 days ago

devnen @devnen I was curious what hardware you've been able to test this on

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment