First word always missing in speech-to-speech translation (demo and API) – is this a known limitation?
Hi everyone,
I’m using the facebook/seamless-m4t-v2-large model for speech-to-speech and speech-to-text translation.
I noticed that, both in my own API implementation and on the official Hugging Face demo page, the first word I speak is always missing in the translation output. The model seems to start transcribing/translating from the second word, no matter how clearly I pronounce the first word or how long I wait after starting the recording.
I have checked my audio recordings: the full audio (including the first word) is present and clear.
I have tried trimming leading silence and even skipping the first 0.2 seconds, but the issue persists.
The same behavior occurs in both my local setup and the official Hugging Face demo:
https://huggingface.co/facebook/seamless-m4t-v2-large
Is this a known limitation of the model or its preprocessing?
Is there any workaround or recommended way to ensure the first word is included in the translation?
Does anyone else experience the same issue?
Thanks in advance for any insights or solutions!