Max sequence length

by amgadhasan - opened Sep 9, 2024

Sep 9, 2024

Hi,

Thanks for sharing this amazing model.

What's the maximum sequence length for both text and audio?
Whisper only supports 30 seconds max. Can we just sequentially concat 30-second segments to handle a 4 minute audio file? And how does that affect the overall sequence length?

Thanks.

zqhuang

Fixie.ai org Sep 9, 2024

Right now audio larger than 30 seconds will trigger an error. We are going to provide an update to handle this internally.

amgadhasan

Sep 9, 2024

Right now audio larger than 30 seconds will trigger an error. We are going to provide an update to handle this internally.

That would be great!
How many tokens are these 30 seconds?

For example, if I have a system prompt that is 20 tokens total and a user audio that is 30 seconds, how many tokens in total would the input be?

farzadab

Sep 9, 2024

•

edited Sep 9, 2024

Right now the number of audio tokens will be roughly 6.25 per 1 second (subject to change in future releases).

So for 30 seconds it should be roughly ceil(30 * 6.25) = 188 tokens (no extra prefix/suffix). That's 208 tokens for content, but you'll also have a few extra tokens for the Llama chat template (e.g. to indicate the role and the start/end of each turn) as per usual.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment