rwightman HF Staff commited on
Commit
60bed17
·
verified ·
1 Parent(s): dbd24bf
Files changed (4) hide show
  1. README.md +154 -0
  2. config.json +33 -0
  3. model.safetensors +3 -0
  4. pytorch_model.bin +3 -0
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - image-classification
4
+ - timm
5
+ library_name: timm
6
+ license: apache-2.0
7
+ datasets:
8
+ - laion-en
9
+ - laion-zh
10
+ - coyo
11
+ - grit
12
+ - coco
13
+ - textcaps
14
+ - objects365
15
+ - openimages
16
+ - all-seeing
17
+ - wukong-ocr
18
+ - laioncoco-ocr
19
+ - other-ocr
20
+ ---
21
+ # Model card for vit_intern300m_patch14_448.ogvl_dist
22
+
23
+ An InternViT image feature model. Pretrained with distillation from [InternViT-6B](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) by paper authors with a wide variety of image-text data. Model weights have been converted from original to `timm` vit from [OpenGVLab/InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px). NOTE: this vit has no final norm before features / head.
24
+
25
+ ## Model Details
26
+ - **Model Type:** Image classification / feature backbone
27
+ - **Model Stats:**
28
+ - Params (M): 304.0
29
+ - GMACs: 362.0
30
+ - Activations (M): 656.4
31
+ - Image size: 448 x 448
32
+ - **Papers:**
33
+ - InternVL2: Better than the Best: https://internvl.github.io/blog/2024-07-02-InternVL-2.0/
34
+ - InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks: https://arxiv.org/abs/2312.14238
35
+ - **Original:** https://github.com/OpenGVLab/InternVL
36
+ - **Dataset:**
37
+ - LAION-en
38
+ - LAION-zh
39
+ - COYO
40
+ - GRIT
41
+ - COCO
42
+ - TextCaps
43
+ - Objects365
44
+ - OpenImages
45
+ - All-Seeing
46
+ - Wukong-OCR
47
+ - LaionCOCO-OCR
48
+ - other-OCR
49
+
50
+ ## Model Usage
51
+ ### Image Classification
52
+ ```python
53
+ from urllib.request import urlopen
54
+ from PIL import Image
55
+ import timm
56
+
57
+ img = Image.open(urlopen(
58
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
59
+ ))
60
+
61
+ model = timm.create_model('vit_intern300m_patch14_448.ogvl_dist', pretrained=True)
62
+ model = model.eval()
63
+
64
+ # get model specific transforms (normalization, resize)
65
+ data_config = timm.data.resolve_model_data_config(model)
66
+ transforms = timm.data.create_transform(**data_config, is_training=False)
67
+
68
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
69
+
70
+ top5_probabilities, top5_class_indices = torch.topk(output.softmax(dim=1) * 100, k=5)
71
+ ```
72
+
73
+ ### Feature Map Extraction
74
+ ```python
75
+ from urllib.request import urlopen
76
+ from PIL import Image
77
+ import timm
78
+
79
+ img = Image.open(urlopen(
80
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
81
+ ))
82
+
83
+ model = timm.create_model(
84
+ 'vit_intern300m_patch14_448.ogvl_dist',
85
+ pretrained=True,
86
+ features_only=True,
87
+ )
88
+ model = model.eval()
89
+
90
+ # get model specific transforms (normalization, resize)
91
+ data_config = timm.data.resolve_model_data_config(model)
92
+ transforms = timm.data.create_transform(**data_config, is_training=False)
93
+
94
+ output = model(transforms(img).unsqueeze(0)) # unsqueeze single image into batch of 1
95
+
96
+ for o in output:
97
+ # print shape of each feature map in output
98
+ # e.g.:
99
+ # torch.Size([1, 1024, 32, 32])
100
+ # torch.Size([1, 1024, 32, 32])
101
+ # torch.Size([1, 1024, 32, 32])
102
+
103
+ print(o.shape)
104
+ ```
105
+
106
+ ### Image Embeddings
107
+ ```python
108
+ from urllib.request import urlopen
109
+ from PIL import Image
110
+ import timm
111
+
112
+ img = Image.open(urlopen(
113
+ 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
114
+ ))
115
+
116
+ model = timm.create_model(
117
+ 'vit_intern300m_patch14_448.ogvl_dist',
118
+ pretrained=True,
119
+ num_classes=0, # remove classifier nn.Linear
120
+ )
121
+ model = model.eval()
122
+
123
+ # get model specific transforms (normalization, resize)
124
+ data_config = timm.data.resolve_model_data_config(model)
125
+ transforms = timm.data.create_transform(**data_config, is_training=False)
126
+
127
+ output = model(transforms(img).unsqueeze(0)) # output is (batch_size, num_features) shaped tensor
128
+
129
+ # or equivalently (without needing to set num_classes=0)
130
+
131
+ output = model.forward_features(transforms(img).unsqueeze(0))
132
+ # output is unpooled, a (1, 1025, 1024) shaped tensor
133
+
134
+ output = model.forward_head(output, pre_logits=True)
135
+ # output is a (1, num_features) shaped tensor
136
+ ```
137
+
138
+ ## Citation
139
+ ```bibtex
140
+ @article{chen2023internvl,
141
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
142
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
143
+ journal={arXiv preprint arXiv:2312.14238},
144
+ year={2023}
145
+ }
146
+ ```
147
+ ```bibtex
148
+ @article{chen2023internvl,
149
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
150
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
151
+ journal={arXiv preprint arXiv:2312.14238},
152
+ year={2023}
153
+ }
154
+ ```
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "vit_intern300m_patch14_448",
3
+ "num_classes": 0,
4
+ "num_features": 1024,
5
+ "global_pool": "token",
6
+ "pretrained_cfg": {
7
+ "tag": "ogvl_dist",
8
+ "custom_load": false,
9
+ "input_size": [
10
+ 3,
11
+ 448,
12
+ 448
13
+ ],
14
+ "fixed_input_size": true,
15
+ "interpolation": "bicubic",
16
+ "crop_pct": 1.0,
17
+ "crop_mode": "center",
18
+ "mean": [
19
+ 0.485,
20
+ 0.456,
21
+ 0.406
22
+ ],
23
+ "std": [
24
+ 0.229,
25
+ 0.224,
26
+ 0.225
27
+ ],
28
+ "num_classes": 0,
29
+ "pool_size": null,
30
+ "first_conv": "patch_embed.proj",
31
+ "classifier": "head"
32
+ }
33
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b726abd80bb291c5fd05a6c802af03069f105a0dcd812991e2bb7cde080576c6
3
+ size 1216081280
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ba232a769d709813e4c73e4bb7aab201f610d93153ea3265183aee1e8ace7e2
3
+ size 1216173658