Did your team consider using an encodec + embedding model (e.g. like moshi?)

#27
by RonanMcGovern - opened

Thanks for releasing this model. I'm curious why you went with the encoder systems rather than going for a tokenised approach (would that be too slow)?

Also, the two-part transformer (think + talk) is quite unusual, did you try and just use one unified transformer there?

I think it's primarily due to the modality gap (personal opinion). Moshi also pointed out the limitations associated with two-stream training, specifically mentioning that intelligence tends to decrease in setups optimized for audio and full-duplex conversations.

In my view, Qwen2.5-Omni puts significant effort into managing the modality gap through their two-part transformer architecture (think + talk). However, they, like GPT-4o, cannot implement true full-duplex communication for audio-to-audio interactions. Moshi uniquely supports true full-duplex conversations. For Qwen2.5-Omni and GPT-4o, voice activity detection (VAD) and significant engineering efforts are required to accurately predict when the assistant should start speaking and when the user stops. Despite these challenges, Qwen2.5-Omni's approach maintains a higher level of intelligence. Ultimately, I believe this reflects a trade-off between intelligence preservation and the complexity of real-time conversational interaction.

@Seungyoun hello, you say that qwen25-omni does not a supports true full-duplex conversations? I found that the demo can achieve real-time conversation, It confuses me

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment