RealTime
Hi guys, great job! That's cool! Any suggestions for an opensource RealTime mode? Thanks
I'm trying to see if it can work with openai realtime console
https://github.com/openai/openai-realtime-console
for testing out on realtime what are you trying onto ? @Kar0nte
@Someshfengde I was thinking about frameworks like LiveKit and FastRTC for real-time streaming. Do you think CSM-1B is fast enough for a WebRTC pipeline, or would we need additional optimization?
The demo gets around the limitations of the model by starting to process the input while the user is still talking, then it seems to stitch the responses together. You can basically force it to use the entire generation time by asking it a series of "repeat after me" in the same sentence, followed by something that triggers it's guardrails (expletives, etc)
To replicate the demo, you'd basically need a fast text-to-speech model, then fire it off to the LLM for a response and bring that response to the CSM for audio generation.
Hi @quadratrix that's why I was thinking about frameworks like livekit, where you can use STT like Whisper, LLM, VOD and TTS, and I wanted to understand if it would be possible to use it. But from what I'm finding out, the 1B model is very immature and needs a lot of optimization. What do you think? To host it locally you would still need a lot of power for acceptable latency. And it seems to only handle English, and not very well either. I think it will take a long time before we can use it well.
MrDragonFox on Discord seems to suggest that he's got the response times down to 3.3 RTR (Millisecond Response), however doesn't upload an audio.wav file to validate that the output is usable. Or attempt to offer anything else.
https://discord.com/channels/1349855029938487437/1349855193151569972
This guy seems to really know his stuff. I learned a Lot from his code this evening. Was able to get working end to end voice chat similar to what you're asking about.
https://github.com/nytopop/csm
Audio was sounding like crap though. Needs a lot of tweaking to get the streaming parameters synced up. Very promising though!
I honestly don't think "bootleg Maya" is all that far away. I have gotten the voice dialed in really good in my repo. Transcripts are key. I believe a changing set of context grouped by emotions are going to be needed to get anywhere near demo quality though.
@zenoran can you share your repo?
Mostly copilot code, I haven't gone through and cleaned it up but it's stable. It got so fast with the compiler, the sentences now overlap so it needs some tweaking. Gotta do the day job though 😔
https://github.com/zenoran/sesameai-tts
BTW this is just console TTS I haven't pushed the web end to end I got working based on that guy's repo.
Thanks for your response I'll try to run this over lightning.ai.
I've been trying to run this over macos but no luck :( raised issue on their official repo
I was able to stream real time but it is a bit slow. The model doesn't seem to to work well for real time streaming. I used 1 H200 and it doesn't generate stream chunk fast enough. The sound comes up a bit chopped.
Disabling watermarks improves performance by around 20%. However, I'm still not achieving real-time speeds with an RTX 5070 Ti—I'm getting around 0.5x
EDIT: I was able to achieve real-time compiling the decoder with inductor as the backend!
Try adding " model.decoder = torch.compile(model.decoder, fullgraph=True, mode='reduce-overhead')" on L178 in load_csm_1b
Can this be integrated with APIs to get realtime data or to do a task like make an appointment calling a different API?