Xianhang nielsr HF Staff commited on
Commit
4aa4979
·
verified ·
1 Parent(s): 53f494a

Add model card (#1)

Browse files

- Add model card (368a57e60da1118cff4fe547bb61edb9677d8713)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +15 -0
README.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-feature-extraction
5
+ ---
6
+
7
+ This repository contains the OpenVision model, a fully-open, cost-effective family of advanced vision encoders for multimodal learning, as described in the paper [OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning](https://huggingface.co/papers/2505.04601).
8
+
9
+ **Abstract:**
10
+
11
+ OpenAI's CLIP, released in early 2021, have long been the go-to choice of vision encoder for building multimodal foundation models. Although recent alternatives such as SigLIP have begun to challenge this status quo, to our knowledge none are fully open: their training data remains proprietary and/or their training recipes are not released. This paper fills this gap with OpenVision, a fully-open, cost-effective family of vision encoders that match or surpass the performance of OpenAI's CLIP when integrated into multimodal frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for training framework and Recap-DataComp-1B for training data -- while revealing multiple key insights in enhancing encoder quality and showcasing practical benefits in advancing multimodal models. By releasing vision encoders spanning from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible trade-off between capacity and efficiency in building multimodal models: larger models deliver enhanced multimodal performance, while smaller versions enable lightweight, edge-ready multimodal deployments.
12
+
13
+ Project Page: https://ucsc-vlaa.github.io/OpenVision
14
+
15
+ Code: https://github.com/UCSC-VLAA/OpenVision