Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,60 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
# Model Details
|
6 |
+
|
7 |
+
Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
|
8 |
+
are hidden inside the network](https://TBC)".
|
9 |
+
|
10 |
+
**Model Developer**: Meta
|
11 |
+
|
12 |
+
**Model Overview**: Perception Encoder (PE) is a family of large-scale vision encoder models with state-of-the-art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large-scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.
|
13 |
+
|
14 |
+
<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_image1.png" style="width: 100%; margin: 0 auto; display: block;" />
|
15 |
+
|
16 |
+
|
17 |
+
| Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
|
18 |
+
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
19 |
+
| **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 384 | 16 | 32 |
|
20 |
+
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 384 | 16 | 32 |
|
21 |
+
| **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
22 |
+
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
23 |
+
| **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 392 | 14 | 72 |
|
24 |
+
| | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 392 | 14 | 72 |
|
25 |
+
|
26 |
+
|
27 |
+
# How to use
|
28 |
+
|
29 |
+
## PE codebase
|
30 |
+
We provide the pretraining code in https://github.com/meta-ai-research-fair/occhi.git
|
31 |
+
|
32 |
+
|
33 |
+
You can find more details in the GitHub repo.
|
34 |
+
|
35 |
+
|
36 |
+
# Evaluation
|
37 |
+
We evaluate the pretrained MobileLLM models on Zero-shot Common Sense Reasoning tasks
|
38 |
+
|
39 |
+
Here is the table in Markdown format:
|
40 |
+
|
41 |
+
## Zero-Shot Image Results
|
42 |
+
|
43 |
+
<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_image.png" style="width: 100%; margin: 0;" />
|
44 |
+
|
45 |
+
## Zero-Shot Video Results
|
46 |
+
|
47 |
+
<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_video.png" style="width: 90%; margin: 0" />
|
48 |
+
|
49 |
+
|
50 |
+
# Citation
|
51 |
+
|
52 |
+
If you find our code useful for your research, please consider citing:
|
53 |
+
|
54 |
+
@article{PE,
|
55 |
+
title={Perception Encoder},
|
56 |
+
author={},
|
57 |
+
journal={arXiv:xxx.xxxxx},
|
58 |
+
year={2025}
|
59 |
+
}
|
60 |
+
|