# DeepEncoder (Extracted from DeepSeek-OCR) ## Overview This directory contains the encoder components extracted from DeepSeek-OCR. ## Model Files - `sam_encoder.pth`: SAM ViT-B encoder (95,569,152 params, 364.6 MB) - `clip_encoder.pth`: CLIP-Large encoder (303,177,728 params, 1156.6 MB) - `projector.pth`: Linear projector (2,622,720 params, 10.0 MB) - `config.json`: Model configuration **Total:** 401,369,600 parameters ## Architecture ``` Image (1024×1024) → SAM (95M) → 16× Conv → CLIP (303M) → Projector (3M) → 256 vision tokens ``` ## Usage ```python import torch from deepencoder import build_sam_vit_b, build_clip_l, MlpProjector from easydict import EasyDict as adict # Load models sam = build_sam_vit_b(checkpoint=None) sam.load_state_dict(torch.load('sam_encoder.pth')) clip = build_clip_l() clip.load_state_dict(torch.load('clip_encoder.pth')) projector_cfg = adict({'projector_type': 'linear', 'input_dim': 2048, 'n_embed': 1280}) projector = MlpProjector(projector_cfg) projector.load_state_dict(torch.load('projector.pth')) # Run encoder vision_tokens = encode(image) # [1, 256, 1280] ``` ## Training These weights are: - Initialized from pretrained SAM (SA-1B) + CLIP (LAION-2B) - Fine-tuned together on optical compression/OCR tasks - Optimized for text preservation in compressed form ## Source Extracted from: [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)