Engage in multi-modal conversations with images and videos
Chat with different models using various approaches
Generate depth maps from images
Generate text responses based on images and input text