Steven Zheng's picture

Steven Zheng

Steveeeeeeen

AI & ML interests

speech & audio

Recent Activity

Organizations

Hugging Face's profile picture Hugging Test Lab's profile picture Whisper Distillation's profile picture Dynamic-SUPERB's profile picture Dynamic-SUPERB-Private's profile picture Hugging Face for Audio's profile picture huggingPartyParis's profile picture MLX Community's profile picture TTS AGI's profile picture Whisper Multilingual Distillation's profile picture Audio Collabs's profile picture open/ acc's profile picture MultiLlasa's profile picture fluxions-hf's profile picture nvidia-hf-collab's profile picture

Steveeeeeeen's activity

New activity in deepseek-ai/DeepSeek-Prover-V2-7B about 14 hours ago

Add model card metadata

#1 opened about 14 hours ago by
Steveeeeeeen
reacted to fdaudens's post with 👍🔥 2 days ago
view post
Post
2848
Forget everything you know about transcription models - NVIDIA's parakeet-tdt-0.6b-v2 changed the game for me!

Just tested it with Steve Jobs' Stanford speech and was speechless (pun intended). The video isn’t sped up.

3 things that floored me:
- Transcription took just 10 seconds for a 15-min file
- Got a CSV with perfect timestamps, punctuation & capitalization
- Stunning accuracy (correctly captured "Reed College" and other specifics)

NVIDIA also released a demo where you can click any transcribed segment to play it instantly.

The improvement is significant: number 1 on the ASR Leaderboard, 6% error rate (best in class) with complete commercial freedom (cc-by-4.0 license).

Time to update those Whisper pipelines! H/t @Steveeeeeeen for the finding!

Model: nvidia/parakeet-tdt-0.6b-v2
Demo: nvidia/parakeet-tdt-0.6b-v2
ASR Leaderboard: hf-audio/open_asr_leaderboard
  • 1 reply
·