Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ license: apache-2.0
|
|
5 |
# Model Details
|
6 |
|
7 |
Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
|
8 |
-
are
|
9 |
|
10 |
**Model Developer**: Meta
|
11 |
|
@@ -16,43 +16,27 @@ are hidden inside the network](https://TBC)".
|
|
16 |
|
17 |
| Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
|
18 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
19 |
-
| **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 |
|
20 |
-
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 |
|
21 |
| **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
22 |
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
23 |
-
| **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 |
|
24 |
-
| | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 |
|
25 |
|
26 |
|
27 |
# How to use
|
28 |
|
29 |
## PE codebase
|
30 |
-
We provide the pretraining code in https://github.com/meta-ai-research-fair/occhi.git
|
31 |
|
|
|
32 |
|
33 |
You can find more details in the GitHub repo.
|
34 |
|
35 |
-
|
36 |
-
# Evaluation
|
37 |
-
We evaluate the pretrained MobileLLM models on Zero-shot Common Sense Reasoning tasks
|
38 |
-
|
39 |
-
Here is the table in Markdown format:
|
40 |
-
|
41 |
-
## Zero-Shot Image Results
|
42 |
-
|
43 |
-
<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_image.png" style="width: 100%; margin: 0;" />
|
44 |
-
|
45 |
-
## Zero-Shot Video Results
|
46 |
-
|
47 |
-
<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_video.png" style="width: 90%; margin: 0" />
|
48 |
-
|
49 |
-
|
50 |
# Citation
|
51 |
-
|
52 |
If you find our code useful for your research, please consider citing:
|
53 |
|
54 |
@article{PE,
|
55 |
-
title={Perception Encoder},
|
56 |
author={},
|
57 |
journal={arXiv:xxx.xxxxx},
|
58 |
year={2025}
|
|
|
5 |
# Model Details
|
6 |
|
7 |
Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
|
8 |
+
are not at the output of the network](https://ai.meta.com/research/publications/perception-encoder-the-best-visual-embeddings-are-not-at-the-output-of-the-network/)".
|
9 |
|
10 |
**Model Developer**: Meta
|
11 |
|
|
|
16 |
|
17 |
| Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
|
18 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
19 |
+
| **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 224 | 16 | 32 |
|
20 |
+
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 224 | 16 | 32 |
|
21 |
| **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
22 |
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
23 |
+
| **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 448 | 14 | 72 |
|
24 |
+
| | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 448 | 14 | 72 |
|
25 |
|
26 |
|
27 |
# How to use
|
28 |
|
29 |
## PE codebase
|
|
|
30 |
|
31 |
+
We provide the pretraining code in https://github.com/facebookresearch/perception_models
|
32 |
|
33 |
You can find more details in the GitHub repo.
|
34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
# Citation
|
|
|
36 |
If you find our code useful for your research, please consider citing:
|
37 |
|
38 |
@article{PE,
|
39 |
+
title={Perception Encoder: The best visual embeddings are not at the output of the network},
|
40 |
author={},
|
41 |
journal={arXiv:xxx.xxxxx},
|
42 |
year={2025}
|