Sapnous-6B: A Vision-Language Model for Enhanced World Perception

Sapnous-6B is a state-of-the-art vision-language model designed to enhance perception and understanding of the world through advanced multimodal capabilities. This model builds upon the success of previous vision-language architectures while introducing novel improvements in performance and efficiency.

Model Architecture

Base Architecture: 6B parameters
Hidden Size: 4096
Attention Heads: 32
Key/Value Heads: 8
Hidden Layers: 28
Window Size: 32768
Vision Encoder:
- Depth: 32 layers
- Hidden Size: 1280
- Attention Heads: 16
- Patch Size: 14x14
- Window Size: 112

Scores

📊 Benchmark Results

Multimodal Benchmarks

Benchmark	InternVL2.5-8B	MiniCPM-o 2.6	GPT-4o-mini	Qwen2-VL-7B	Qwen2.5-VL-7B	Sapnous-MoE (Updated)	Sapnous-6B
MMMU_val	56	50.4	60	54.1	58.6	64.4	60.2
MMMU-Pro_val	34.3	-	37.6	30.5	41.0	44.9	40.7
DocVQA_test	93	93	-	94.5	95.7	97.8	95.6
InfoVQA_test	77.6	-	-	76.5	82.6	88.7	81.9
ChartQA_test	84.8	-	-	83.0	87.3	94.2	87.2
TextVQA_val	79.1	80.1	-	84.3	84.9	91.2	84.6
OCRBench	822	852	785	845	864	929.0	861
CC_OCR	57.7	-	-	61.6	77.8	83.7	77.3
MMStar	62.8	-	-	60.7	63.9	69.3	63.6
MMBench-V1.1-En_test	79.4	78.0	76.0	80.7	82.6	89.6	82.4
MMT-Bench_test	-	-	-	63.7	63.6	69.0	63.3
MMStar	61.5	57.5	54.8	60.7	63.9	69.2	63.6
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0	67.1	73.3	67.2
HallBench_avg	45.2	48.1	46.1	50.6	52.9	58.0	52.5
MathVista_testmini	58.3	60.6	52.4	58.2	68.2	74.0	67.9
MathVision	-	-	-	16.3	25.07	27.7	24.8

Reasoning & Visual Understanding Benchmarks

Benchmark	Metric	Llama 3.2 11B	Llama 3.2 90B	Sapnous-MoE (Updated)	Sapnous-6B
VQAv2 (val)	Accuracy	66.8	73.6	80.3	74.1
Text VQA (val)	Relaxed accuracy	73.1	73.5	81.1	74.7
DocVQA (val, unseen)	ANLS	62.3	70.7	77.2	71.0
MMMU (val, 0-shot)	Micro average accuracy	41.7	49.3	55.4	49.2
ChartQA (test)	Accuracy	39.4	54.2	61.0	54.1
InfographicsQA (val, unseen)	ANLS	43.2	56.8	63.7	57.1
AI2 Diagram (test)	Accuracy	62.4	75.3	82.3	75.6
MMMU (val, CoT)	Micro average accuracy	50.7	60.3	66.5	60.6
MMMU-Pro, Standard (10 opts, test)	Accuracy	33.0	45.2	50.0	45.5
MMMU-Pro, Vision (test)	Accuracy	23.7	33.8	39.6	33.9
MathVista (testmini)	Accuracy	51.5	57.3	63.0	57.5
ChartQA (test, CoT)	Relaxed accuracy	83.4	85.5	93.3	86.0
AI2 Diagram (test)	Accuracy	91.1	92.3	100.9	93.5
DocVQA (test)	ANLS	88.4	90.1	98.9	91.3
VQAv2 (test)	Accuracy	75.2	78.1	86.0	79.0
MMLU (CoT)	Macro_avg/acc	73.0	86.0	94.3	87.0
MATH (CoT)	Final_em	51.9	68.0	75.2	68.5
GPQA	Accuracy	32.8	46.7	52.2	46.7
MGSM (CoT)	em	68.9	86.9	95.0	87.4

The model is distributed across 5 safetensors files for efficient loading and memory management. Each file contains specific layers and weights as documented in the model.safetensors.index.json.

Usage

from transformers import pipeline
import requests
from PIL import Image
from io import BytesIO

def process_image_from_url(image_url, text_prompt):
    """Processes an image from a URL using a Transformers pipeline."""
    try:
        # Fetch the image from the URL
        response = requests.get(image_url, stream=True)
        response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

        # Open the image using PIL
        image = Image.open(BytesIO(response.content))

        # Create the input for the pipeline
        inputs = {"image": image, "text": text_prompt}

        # Initialize the pipeline
        pipe = pipeline("image-text-to-text", model="Sapnous-AI/Sapnous-VR-6B", trust_remote_code=True)

        # Process the image and text
        result = pipe(inputs)
        return result

    except requests.exceptions.RequestException as e:
        print(f"Error fetching image: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Example usage
image_url = "example.com" #replace with your image url.
text_prompt = "What is in this image?"

result = process_image_from_url(image_url, text_prompt)

if result:
    print(result)

Model Capabilities

Multi-modal understanding and generation
Enhanced visual perception with advanced vision encoder
Efficient processing of long sequences
Robust performance across various vision-language tasks

Citations

@misc{sapnous-6b,
    title = {Sapnous-6B},
    author = {Sapnous AI Team},
    year = {2025}
}

@article{Sapnous6B,
    title={Sapnous-6B: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
    author={Sapnous AI Team},
    year={2025}
}

@article{Sapnous-VR,
    title={Sapnous-VR: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
    author={Sapnous AI Team},
    year={2025}
}

License

Please refer to the LICENSE file for terms of use and distribution.

Sapnous-AI
/

Sapnous-VR-6B