Inference Providers documentation
Vision Agents
Vision Agents
Vision Agents is an SDK for building real-time AI applications with support for text and vision language models, streaming responses, and video understanding.
Overview
Vision Agents natively supports Hugging Face Inference Providers, giving you access to thousands of models through a unified API. The integration supports multiple inference providers including Together AI, Groq, Cerebras, Replicate, and Fireworks.
The Hugging Face plugin offers two implementations:
- HuggingFace LLM: Text-only language model with streaming responses and function calling
- HuggingFace VLM: Vision language model with automatic video frame buffering for real-time video understanding
Prerequisites
A Hugging Face account with an API token (needs “Make calls to Inference Providers” permission)
Export your token as an environment variable:
export HF_TOKEN=your_huggingface_tokenInstallation
Install Vision Agents with the Hugging Face plugin:
uv add vision-agents[huggingface]
Usage
LLM
When initializing the LLM, specify the model ID and optionally the provider:
from vision_agents.huggingface import HuggingFaceLLM
llm = HuggingFaceLLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
provider="auto" # Optional: "fastest", "cheapest", "groq", "together", etc.
)Once llm is instantiated, you can use it like this:
response = await llm.simple_response("Hello, how are you?")
print(response.text)You can also register Python functions as tools to be called by the LLM:
from vision_agents.plugins import huggingface
llm = huggingface.LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
@llm.register_function()
def get_weather(city: str) -> str:
"""Get the current weather for a city."""
return f"The weather in {city} is sunny."
response = await llm.simple_response("What's the weather in Paris?")Looking for a compatible model? Check out this table for a full list of supported models with providers, pricing, context length, latency, throughput, etc.
VLM
LLMs are nice, but now let’s use Vision Agents to plug a VLM in a video stream to get real-time video understanding!
from vision_agents.plugins import huggingface, getstream, deepgram
from vision_agents.core import Agent, User
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="AI Assistant", id="agent"),
instructions="You are a helpful visual assistant.",
llm=huggingface.VLM(
model="Qwen/Qwen2-VL-7B-Instruct",
fps=1,
frame_buffer_seconds=10,
),
)Once the Agent is defined, simply ask what’s in the video:
response = await vlm.simple_response("What do you see?")
print(response.text)