nvidia/audio-flamingo-2 · What visual model would you use in tandem? Distallignation?

nvidia

Currently I see this as a necessary step forward for next steps on future projects Ive been waiting to try, but I need a visual language model that would so to speak able to align well across video if that makes any sense, real human same id on scale of 250gb of 540p video, with some up to 1920x1080, wan is the current best candidate for the other side of things. Is there cross-modal distillation that may work well for this task if so can you do it to quantized models??? I understand audio has been harder to move through quantized models, but that doesn't mean we cant use say a 3b f16 audio model to distallign through a paired medium like video???