Why are you calling this style-controlled TTS End-to-End?
- Please refer to https://github.com/OpenBMB/MiniCPM-o?tab=readme-ov-file#mimick , which clearly demonstrate that the models is an end-to-end model.
- The model can also do end-to-end voice clone, which is not feasible for a non end-to-end model, check evaluation results on https://github.com/OpenBMB/MiniCPM-o
If I understand correctly, QwenLM here is connected to the TTS's own Text-Encoder LM only with spk_emb and the plain text. The TTS module is seperate, which accepts text and the speaker control embed as input - so this is a cascaded, not an End-to-End mel spectrogram generation.
From another perspective, QwenLM has no control over the output mel spectrogram here, except via the speaker embeddings. This can hardly be considered e2e, as the LLM lacks knowledge of or the ability to control mel-related aspects. This is a cascaded, style-controlled TTS.
MiniCPM-o-2.6 is trained end-to-end, allowing gradients from the speech decoder to flow back through the speech embedding to the backbone LLM during the training process. This setup enables the LLM to exert significant control over the output mel spectrogram. Based on our evaluations and demos, this control is strong. The LLM can influence factors such as speed, emotion, accent, speaker timbre, and even other nuanced speech features that the model has learned implicitly through end-to-end training.
It’s worth clarifying that the LLM does have control over the final mel spectrogram, though not at a fine-grained level for every individual word. The appropriate understanding is that the LLM provides broad control over the speech output.
Additional points to consider:
- The speech embedding has a dimension of
3584
, with each dimension represented as a floating-point number. This allows for broad control over the final mel spectrogram. - No
reference audio encoder
is involved in training the model; only cross-entropy (CE) loss is used.
Also, we appreciate the insights you’ve shared. It’s true that our current architecture involves two encoding processes for text tokens. However, from a training standpoint, this does not hinder the end-to-end training process or the backpropagation of gradients.
At present, the LLM already provides broad control over the output. Since we have not yet released a technical report, it’s understandable that one might assume we rely on a style control objective as a loss function, resembling a multi-stage for training different parts of the model. However, this is not the case—our training process is fully end-to-end with all parameters optimized simultaneously.
In future iterations of the model, we aim to further refine the architecture. For instance, we plan to enhance the connection between the LLM and the voice decoder, making it more feature-rich and fine-grained.
I appreciate your explanation. My intention wasn't to interrogate you, but rather to express my dissatisfaction with the structure of the dual LMs. It seems to be causing issues with latency and detailed control of speech output.
I always gain valuable insights from your new models, and I'm looking forward to seeing your future developments.
I appreciate your explanation. My intention wasn't to interrogate you, but rather to express my dissatisfaction with the structure of the dual LMs. It seems to be causing issues with latency and detailed control of speech output.
I always gain valuable insights from your new models, and I'm looking forward to seeing your future developments.
We have always greatly enjoyed discussing with everyone in the open-source community. Your suggestions are invaluable to us. If you have any questions or suggestions, we warmly welcome you to share them with us! Let’s work together to create better models and a better community!