Generate audio from text and transcribe audio to text
Localize elements on images based on instructions