I created an API wrapper with web UI

#13
by devnen - opened

Seriously impressive work. The model quality is outstanding, especially for a 3-month effort from scratch.

Seeing some discussions about running Dia locally, I wanted to share a project I put together quickly that might make it easier to get started:
https://github.com/devnen/Dia-TTS-Server

It's an API server that wraps the Dia model. Getting it set up is pretty straightforward, just a standard pip install -r requirements.txt that works on Windows or Linux. It automatically grabs the model from Hugging Face. There's a simple web UI for generating speech, adjusting the parameters, and testing out voice cloning. For integration, it has both an OpenAI-compatible API endpoint and a custom one if you need full control. Plus, it runs on either CUDA GPUs or just the CPU.

screenshot-d-s.png

The goal was to create a simple way to run and experiment with the model without needing to piece together the example scripts yourself.
Hope you find it useful!

devnen changed discussion status to closed
devnen changed discussion status to open

thanks for this. Is there a limit to the characters I can input... I have multiple 50k+ character docs I would like to transform to speech

The model is unable to process more than 25 seconds of audio reliably. You will need to extract text from an input file, split it into manageable chunks, convert each chunk to speech independently, and then concatenate them into a single audio file.

This feature may be added to the Dia TTS Server UI in the next few days. However, I think the generated audio chunks will have inconsistent voices. For now I see no way to control the voice output.

Here is a good video demonstrating the issue and a potential solution:
https://www.youtube.com/watch?v=tje3uAZqgV0

There is a new version of the server that is designed to handle long documents . I've implemented an automatic 'chunking' feature that intelligently splits the text into smaller segments, respecting sentence boundaries and speaker tags ([S1]/[S2]). Each segment is converted to speech individually and then seamlessly joined, which overcomes previous length limitations.

For the best results with long documents, particularly for ensuring the voice remains consistent throughout, I strongly recommend using the 'Predefined Voices' mode combined with a fixed integer seed (any number other than -1). Using 'Voice Cloning' with a fixed seed is also a good option.

Unfortunately, I do not think the model is production ready. The output is somewhat unpredictable even with all these workarounds.

Nari Labs org

This is awesome :))) Thanks for this contribution. We are working to make the model more predictable as well.
Happy to hear any feedback!

thank you so much!!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment