Seamless Speech-to-Speech Translation with Voice Replication (S3TVR)

S3TVR is an advanced AI cascaded framework designed for real-time speech-to-speech translation while maintaining the speaker's voice characteristics in a zero-shot fashion. This project balances latency and output quality, focusing on English and Spanish languages, and involves multiple open-source models and algorithms. The system is optimized for local execution, allowing for dynamic and efficient voice translation with an average latency of ~3 seconds per sentence. For the optimized model, check the Github Repo bellow.

NOTE: The local excution is streamed and fully optimized(unlike this Demo)

Press and Hold till the sentence is not RED