Max sequence length
Hi,
Thanks for sharing this amazing model.
What's the maximum sequence length for both text and audio?
Whisper only supports 30 seconds max. Can we just sequentially concat 30-second segments to handle a 4 minute audio file? And how does that affect the overall sequence length?
Thanks.
Right now audio larger than 30 seconds will trigger an error. We are going to provide an update to handle this internally.
Right now audio larger than 30 seconds will trigger an error. We are going to provide an update to handle this internally.
That would be great!
How many tokens are these 30 seconds?
For example, if I have a system prompt that is 20 tokens total and a user audio that is 30 seconds, how many tokens in total would the input be?
Right now the number of audio tokens will be roughly 6.25 per 1 second (subject to change in future releases).
So for 30 seconds it should be roughly ceil(30 * 6.25) = 188
tokens (no extra prefix/suffix). That's 208 tokens for content, but you'll also have a few extra tokens for the Llama chat template (e.g. to indicate the role and the start/end of each turn) as per usual.