view reply what if we segment the audio first and then transcribe tho its some extra compute to throw in but imo it would resul tin better result !
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Paper • 2412.10302 • Published Dec 13, 2024 • 18 • 10
IndicConformer Collection A collection of ASR models for 22 scheduled languages of India • 24 items • Updated Mar 14 • 12