Moondream is a small vision language model designed to run efficiently everywhere.
This repository contains the latest (2025-06-21) release of Moondream, as well as historical releases. The model is updated frequently, so we recommend specifying a revision as shown below if you're using it in a production application.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-06-21",
    trust_remote_code=True,
    device_map={"": "cuda"}  # ...or 'mps', on Apple Silicon
)
# Captioning
print("Short caption:")
print(model.caption(image, length="short")["caption"])
print("\nNormal caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    # Streaming generation example, supported for caption() and detect()
    print(t, end="", flush=True)
print(model.caption(image, length="normal"))
# Visual Querying
print("\nVisual query: 'How many people are in the image?'")
print(model.query(image, "How many people are in the image?")["answer"])
# Object Detection
print("\nObject detection: 'face'")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")
# Pointing
print("\nPointing: 'person'")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")
Changelog
2025-06-21
(release notes coming soon)
2025-04-15 (full release notes)
- Improved chart understanding (ChartQA up from 74.8 to 77.5, 82.2 with PoT)
 - Added temperature and nucleus sampling to reduce repetitive outputs
 - Better OCR for documents and tables (prompt with “Transcribe the text” or “Transcribe the text in natural reading order”)
 - Object detection supports document layout detection (figure, formula, text, etc)
 - UI understanding (ScreenSpot [email protected] up from 53.3 to 60.3)
 - Improved text understanding (DocVQA up from 76.5 to 79.3, TextVQA up from 74.6 to 76.3)
 
2025-03-27 (full release notes)
- Added support for long-form captioning
 - Open vocabulary image tagging
 - Improved counting accuracy (e.g. CountBenchQA increased from 80 to 86.4)
 - Improved text understanding (e.g. OCRBench increased from 58.3 to 61.2)
 - Improved object detection, especially for small objects (e.g. COCO up from 30.5 to 51.2)
 - Fixed token streaming bug affecting multi-byte unicode characters
 - gpt-fast style 
compile()now supported in HF Transformers implementation 
- Downloads last month
 - 3
 
	Inference Providers
	NEW
	
	
	
	This model isn't deployed by any Inference Provider.
	🙋
			
		Ask for provider support