Generate descriptions from images and text prompts
Generate synthesized speech from text and audio reference