Image-Text to Text
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.
For more details about the image-text-to-text
task, check out its dedicated page! You will find examples and related materials.
Recommended models
- meta-llama/Llama-3.2-11B-Vision-Instruct: Powerful vision language model with great visual understanding and reasoning capabilities.
- HuggingFaceM4/idefics2-8b-chatty: Cutting-edge conversational vision language model that can take multiple image inputs.
- microsoft/Phi-3.5-vision-instruct: Strong image-text-to-text model.
Explore all available models and find the one that suits you best here.
Using the API
Python
JavaScript
cURL
import requests
API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct"
headers = {"Authorization": "Bearer hf_***"}
from huggingface_hub import InferenceClient
client = InferenceClient(api_key="hf_***")
image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
for message in client.chat_completion(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Describe this image in one sentence."},
],
}
],
max_tokens=500,
stream=True,
):
print(message.choices[0].delta.content, end="")
To use the Python client, see huggingface_hub
’s package reference.
API specification
For the API specification of conversational image-text-to-text models, please refer to the Chat Completion API documentation.
< > Update on GitHub