RTFx Calculation / Batch Size

#40
by lachln - opened

Was RTFx calculated based on the number of audio samples in the batch? e.g. if 8 samples were transcribed in parallel, each being 10 seconds long (80 seconds of audio total), and transcribing the batch took 4 seconds, is RTFx calculated as 80 / 4 = 20? trying to understand the crazy high RTFx numbers.

Yes, that's how I interpret it.

I recall reading that parakeet-tdt-0.6b-v2 used a batch size of 128 on a Nvidia A10G 24GB GPU to reach 3000+ RTFx.

For reference, I transcribed 400 hours of podcast episodes (averaging 5-15 minutes each) in 24 minutes on my PC using parakeet and two GPUs:

  • RTX 4080 16GB: batch size 6, RTFx of 533x
  • RTX 3080 Ti 12GB: batch size 4, RTFx of 422x

So, that's an RTFx of 955x combined. So, 3000+ RTFx with better GPUs, shorter audio files, and much larger batch sizes seems reasonable.

Sign up or log in to comment