Sapnous-VR-T1
Collection
Sapnous-VR : AIRAS'S models for visual reasoning
β’
4 items
β’
Updated
Sapnous-6B is a state-of-the-art vision-language model designed to enhance perception and understanding of the world through advanced multimodal capabilities. This model builds upon the success of previous vision-language architectures while introducing novel improvements in performance and efficiency.
Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B | Qwen2.5-VL-7B | Sapnous-MoE (Updated) | Sapnous-6B |
---|---|---|---|---|---|---|---|
MMMU_val | 56 | 50.4 | 60 | 54.1 | 58.6 | 64.4 | 60.2 |
MMMU-Pro_val | 34.3 | - | 37.6 | 30.5 | 41.0 | 44.9 | 40.7 |
DocVQA_test | 93 | 93 | - | 94.5 | 95.7 | 97.8 | 95.6 |
InfoVQA_test | 77.6 | - | - | 76.5 | 82.6 | 88.7 | 81.9 |
ChartQA_test | 84.8 | - | - | 83.0 | 87.3 | 94.2 | 87.2 |
TextVQA_val | 79.1 | 80.1 | - | 84.3 | 84.9 | 91.2 | 84.6 |
OCRBench | 822 | 852 | 785 | 845 | 864 | 929.0 | 861 |
CC_OCR | 57.7 | - | - | 61.6 | 77.8 | 83.7 | 77.3 |
MMStar | 62.8 | - | - | 60.7 | 63.9 | 69.3 | 63.6 |
MMBench-V1.1-En_test | 79.4 | 78.0 | 76.0 | 80.7 | 82.6 | 89.6 | 82.4 |
MMT-Bench_test | - | - | - | 63.7 | 63.6 | 69.0 | 63.3 |
MMStar | 61.5 | 57.5 | 54.8 | 60.7 | 63.9 | 69.2 | 63.6 |
MMVet_GPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | 67.1 | 73.3 | 67.2 |
HallBench_avg | 45.2 | 48.1 | 46.1 | 50.6 | 52.9 | 58.0 | 52.5 |
MathVista_testmini | 58.3 | 60.6 | 52.4 | 58.2 | 68.2 | 74.0 | 67.9 |
MathVision | - | - | - | 16.3 | 25.07 | 27.7 | 24.8 |
Benchmark | # Shots | Metric | Llama 3.2 11B | Llama 3.2 90B | Sapnous-MoE (Updated) | Sapnous-6B |
---|---|---|---|---|---|---|
VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 | 80.3 | 74.1 |
Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 | 81.1 | 74.7 |
DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 | 77.2 | 71.0 |
MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 | 55.4 | 49.2 |
ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 | 61.0 | 54.1 |
InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 | 63.7 | 57.1 |
AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 | 82.3 | 75.6 |
MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 | 66.5 | 60.6 |
MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 | 50.0 | 45.5 |
MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 | 39.6 | 33.9 |
MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 | 63.0 | 57.5 |
ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 | 93.3 | 86.0 |
AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 | 100.9 | 93.5 |
DocVQA (test) | 0 | ANLS | 88.4 | 90.1 | 98.9 | 91.3 |
VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 | 86.0 | 79.0 |
MMLU (CoT) | 0 | Macro_avg/acc | 73.0 | 86.0 | 94.3 | 87.0 |
MATH (CoT) | 0 | Final_em | 51.9 | 68.0 | 75.2 | 68.5 |
GPQA | 0 | Accuracy | 32.8 | 46.7 | 52.2 | 46.7 |
MGSM (CoT) | 0 | em | 68.9 | 86.9 | 95.0 | 87.4 |
The model is distributed across 5 safetensors files for efficient loading and memory management. Each file contains specific layers and weights as documented in the model.safetensors.index.json.
from transformers import pipeline
import requests
from PIL import Image
from io import BytesIO
def process_image_from_url(image_url, text_prompt):
"""Processes an image from a URL using a Transformers pipeline."""
try:
# Fetch the image from the URL
response = requests.get(image_url, stream=True)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
# Open the image using PIL
image = Image.open(BytesIO(response.content))
# Create the input for the pipeline
inputs = {"image": image, "text": text_prompt}
# Initialize the pipeline
pipe = pipeline("image-text-to-text", model="Sapnous-AI/Sapnous-VR-6B", trust_remote_code=True)
# Process the image and text
result = pipe(inputs)
return result
except requests.exceptions.RequestException as e:
print(f"Error fetching image: {e}")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
# Example usage
image_url = "example.com" #replace with your image url.
text_prompt = "What is in this image?"
result = process_image_from_url(image_url, text_prompt)
if result:
print(result)
@misc{sapnous-6b,
title = {Sapnous-6B},
author = {Sapnous AI Team},
year = {2025}
}
@article{Sapnous6B,
title={Sapnous-6B: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Sapnous AI Team},
year={2025}
}
@article{Sapnous-VR,
title={Sapnous-VR: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Sapnous AI Team},
year={2025}
}
Please refer to the LICENSE file for terms of use and distribution.