tuandunghcmut commited on
Commit
4bb09e0
·
verified ·
1 Parent(s): 391089d

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. groundingLMM/GLaMM-FullScope/.gitattributes +35 -0
  2. groundingLMM/GLaMM-FullScope/README.md +33 -0
  3. groundingLMM/GLaMM-FullScope/added_tokens.json +9 -0
  4. groundingLMM/GLaMM-FullScope/config.json +60 -0
  5. groundingLMM/GLaMM-FullScope/generation_config.json +9 -0
  6. groundingLMM/GLaMM-FullScope/pytorch_model.bin.index.json +975 -0
  7. groundingLMM/GLaMM-FullScope/special_tokens_map.json +24 -0
  8. groundingLMM/GLaMM-FullScope/tokenizer_config.json +33 -0
  9. groundingLMM/GranD/README.md +73 -0
  10. groundingLMM/GranD/run_pipeline.sh +178 -0
  11. groundingLMM/LLaVA/.dockerignore +21 -0
  12. groundingLMM/LLaVA/.editorconfig +18 -0
  13. groundingLMM/LLaVA/.gitattributes +29 -0
  14. groundingLMM/LLaVA/.gitignore +35 -0
  15. groundingLMM/LLaVA/LICENSE +201 -0
  16. groundingLMM/LLaVA/README.md +463 -0
  17. groundingLMM/LLaVA/cog.yaml +37 -0
  18. groundingLMM/LLaVA/predict.py +155 -0
  19. groundingLMM/LLaVA/pyproject.toml +37 -0
  20. groundingLMM/dataset/dataset.py +236 -0
  21. groundingLMM/docs/GranD.md +53 -0
  22. groundingLMM/docs/datasets.md +327 -0
  23. groundingLMM/docs/evaluation.md +75 -0
  24. groundingLMM/docs/install.md +34 -0
  25. groundingLMM/docs/model_zoo.md +21 -0
  26. groundingLMM/docs/offline_demo.md +51 -0
  27. groundingLMM/docs/training.md +83 -0
  28. groundingLMM/eval/region_captioning/evaluate.py +51 -0
  29. groundingLMM/eval/region_captioning/infer.py +188 -0
  30. groundingLMM/eval/region_captioning/run_evaluation_VG.sh +28 -0
  31. groundingLMM/gradio-dev/.dockerignore +40 -0
  32. groundingLMM/gradio-dev/.editorconfig +8 -0
  33. groundingLMM/gradio-dev/.gitignore +65 -0
  34. groundingLMM/gradio-dev/CHANGELOG.md +0 -0
  35. groundingLMM/gradio-dev/CITATION.cff +45 -0
  36. groundingLMM/gradio-dev/CONTRIBUTING.md +138 -0
  37. groundingLMM/gradio-dev/LICENSE +201 -0
  38. groundingLMM/gradio-dev/README.md +94 -0
  39. groundingLMM/gradio-dev/README_old.md +290 -0
  40. groundingLMM/gradio-dev/SECURITY.md +5 -0
  41. groundingLMM/gradio-dev/app_box.py +18 -0
  42. groundingLMM/gradio-dev/globals.d.ts +31 -0
  43. groundingLMM/gradio-dev/package.json +85 -0
  44. groundingLMM/gradio-dev/pnpm-lock.yaml +0 -0
  45. groundingLMM/gradio-dev/pnpm-workspace.yaml +3 -0
  46. groundingLMM/gradio-dev/pyproject.toml +113 -0
  47. groundingLMM/gradio-dev/readme_template.md +68 -0
  48. groundingLMM/gradio-dev/render_readme.py +39 -0
  49. groundingLMM/gradio-dev/requirements.txt +26 -0
  50. groundingLMM/gradio-dev/style.md +160 -0
groundingLMM/GLaMM-FullScope/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
groundingLMM/GLaMM-FullScope/README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # 👁️ GLaMM-FullScope
6
+
7
+ ---
8
+ ## 📝 Description
9
+ GLaMM-FullScope encompasses all capabilities of GLaMM, which is mixed finetuned with many open-source datasets. "Full" signifies its comprehensive nature, incorporating the full range of GLaMM capabilities including
10
+ Grounded Conversation Generation (GCG), Referring Expression Segmentation, Region-level Captioning, Image-level captioning and Visual Question Answering.
11
+
12
+
13
+ ## 💻 Download
14
+ To get started with GLaMM-FullScope, follow these steps:
15
+ ```
16
+ git lfs install
17
+ git clone https://huggingface.co/MBZUAI/GLaMM-FullScope
18
+ ```
19
+
20
+ ## 📚 Additional Resources
21
+ - **Paper:** [ArXiv](https://arxiv.org/abs/2311.03356).
22
+ - **GitHub Repository:** For training and updates: [GitHub - GLaMM](https://github.com/mbzuai-oryx/groundingLMM).
23
+ - **Project Page:** For a detailed overview and insights into the project, visit our [Project Page - GLaMM](https://mbzuai-oryx.github.io/groundingLMM/).
24
+
25
+ ## 📜 Citations and Acknowledgments
26
+
27
+ ```bibtex
28
+ @article{hanoona2023GLaMM,
29
+ title={GLaMM: Pixel Grounding Large Multimodal Model},
30
+ author={Rasheed, Hanoona and Maaz, Muhammad and Shaji, Sahal and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M. and Xing, Eric and Yang, Ming-Hsuan and Khan, Fahad S.},
31
+ journal={ArXiv 2311.03356},
32
+ year={2023}
33
+ }
groundingLMM/GLaMM-FullScope/added_tokens.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</p>": 32006,
3
+ "<bbox>": 32002,
4
+ "<im_end>": 32001,
5
+ "<im_start>": 32000,
6
+ "<p>": 32005,
7
+ "<point>": 32003,
8
+ "[SEG]": 32004
9
+ }
groundingLMM/GLaMM-FullScope/config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "MBZUAI/GLaMM-GranD-Pretrained",
3
+ "architectures": [
4
+ "GLaMMForCausalLM"
5
+ ],
6
+ "bbox_token_idx": 32002,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "freeze_mlp_adapter": true,
10
+ "freeze_mm_mlp_adapter": false,
11
+ "freeze_mm_vision_resampler": false,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 4096,
14
+ "image_aspect": "square",
15
+ "image_aspect_ratio": "square",
16
+ "image_grid_pinpoints": null,
17
+ "image_grid_points": null,
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 11008,
20
+ "max_length": 4096,
21
+ "max_position_embeddings": 4096,
22
+ "mm_hidden_size": 1024,
23
+ "mm_projector_type": "mlp2x_gelu",
24
+ "mm_resampler_type": null,
25
+ "mm_use_im_patch_token": false,
26
+ "mm_use_im_start_end": true,
27
+ "mm_use_image_start_end": true,
28
+ "mm_vision_module": "openai/clip-vit-large-patch14-336",
29
+ "mm_vision_select_feature": "patch",
30
+ "mm_vision_select_layer": -2,
31
+ "mm_vision_tower": "openai/clip-vit-large-patch14-336",
32
+ "model_type": "llava",
33
+ "num_attention_heads": 32,
34
+ "num_hidden_layers": 32,
35
+ "num_key_value_heads": 32,
36
+ "num_level_reg_features": 4,
37
+ "num_reg_features": 4,
38
+ "out_dim": 256,
39
+ "pad_token_id": 0,
40
+ "pretrain_mm_mlp_adapter": null,
41
+ "pretraining_tp": 1,
42
+ "rms_norm_eps": 1e-05,
43
+ "rope_scaling": null,
44
+ "select_feature_type": "patch",
45
+ "tie_word_embeddings": false,
46
+ "torch_dtype": "bfloat16",
47
+ "train_mask_decoder": true,
48
+ "transformers_version": "4.28.0.dev0",
49
+ "tune_mlp_adapter": false,
50
+ "tune_mm_mlp_adapter": false,
51
+ "tune_mm_vision_resampler": false,
52
+ "unfreeze_mm_vision_tower": false,
53
+ "use_cache": false,
54
+ "use_image_patch_token": false,
55
+ "use_mm_proj": true,
56
+ "vision_module": "openai/clip-vit-large-patch14-336",
57
+ "vision_tower": "openai/clip-vit-large-patch14-336",
58
+ "vocab_size": 32007,
59
+ "with_region": true
60
+ }
groundingLMM/GLaMM-FullScope/generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "max_length": 4096,
6
+ "pad_token_id": 0,
7
+ "transformers_version": "4.28.0.dev0",
8
+ "use_cache": false
9
+ }
groundingLMM/GLaMM-FullScope/pytorch_model.bin.index.json ADDED
@@ -0,0 +1,975 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 16752883392
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "pytorch_model-00002-of-00002.bin",
7
+ "model.embed_tokens.weight": "pytorch_model-00001-of-00002.bin",
8
+ "model.grounding_encoder.image_encoder.blocks.0.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
9
+ "model.grounding_encoder.image_encoder.blocks.0.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
10
+ "model.grounding_encoder.image_encoder.blocks.0.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
11
+ "model.grounding_encoder.image_encoder.blocks.0.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
12
+ "model.grounding_encoder.image_encoder.blocks.0.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
13
+ "model.grounding_encoder.image_encoder.blocks.0.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
14
+ "model.grounding_encoder.image_encoder.blocks.0.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
15
+ "model.grounding_encoder.image_encoder.blocks.0.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
16
+ "model.grounding_encoder.image_encoder.blocks.0.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
17
+ "model.grounding_encoder.image_encoder.blocks.0.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
18
+ "model.grounding_encoder.image_encoder.blocks.0.norm1.bias": "pytorch_model-00002-of-00002.bin",
19
+ "model.grounding_encoder.image_encoder.blocks.0.norm1.weight": "pytorch_model-00002-of-00002.bin",
20
+ "model.grounding_encoder.image_encoder.blocks.0.norm2.bias": "pytorch_model-00002-of-00002.bin",
21
+ "model.grounding_encoder.image_encoder.blocks.0.norm2.weight": "pytorch_model-00002-of-00002.bin",
22
+ "model.grounding_encoder.image_encoder.blocks.1.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
23
+ "model.grounding_encoder.image_encoder.blocks.1.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
24
+ "model.grounding_encoder.image_encoder.blocks.1.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
25
+ "model.grounding_encoder.image_encoder.blocks.1.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
26
+ "model.grounding_encoder.image_encoder.blocks.1.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
27
+ "model.grounding_encoder.image_encoder.blocks.1.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
28
+ "model.grounding_encoder.image_encoder.blocks.1.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
29
+ "model.grounding_encoder.image_encoder.blocks.1.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
30
+ "model.grounding_encoder.image_encoder.blocks.1.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
31
+ "model.grounding_encoder.image_encoder.blocks.1.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
32
+ "model.grounding_encoder.image_encoder.blocks.1.norm1.bias": "pytorch_model-00002-of-00002.bin",
33
+ "model.grounding_encoder.image_encoder.blocks.1.norm1.weight": "pytorch_model-00002-of-00002.bin",
34
+ "model.grounding_encoder.image_encoder.blocks.1.norm2.bias": "pytorch_model-00002-of-00002.bin",
35
+ "model.grounding_encoder.image_encoder.blocks.1.norm2.weight": "pytorch_model-00002-of-00002.bin",
36
+ "model.grounding_encoder.image_encoder.blocks.10.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
37
+ "model.grounding_encoder.image_encoder.blocks.10.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
38
+ "model.grounding_encoder.image_encoder.blocks.10.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
39
+ "model.grounding_encoder.image_encoder.blocks.10.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
40
+ "model.grounding_encoder.image_encoder.blocks.10.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
41
+ "model.grounding_encoder.image_encoder.blocks.10.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
42
+ "model.grounding_encoder.image_encoder.blocks.10.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
43
+ "model.grounding_encoder.image_encoder.blocks.10.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
44
+ "model.grounding_encoder.image_encoder.blocks.10.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
45
+ "model.grounding_encoder.image_encoder.blocks.10.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
46
+ "model.grounding_encoder.image_encoder.blocks.10.norm1.bias": "pytorch_model-00002-of-00002.bin",
47
+ "model.grounding_encoder.image_encoder.blocks.10.norm1.weight": "pytorch_model-00002-of-00002.bin",
48
+ "model.grounding_encoder.image_encoder.blocks.10.norm2.bias": "pytorch_model-00002-of-00002.bin",
49
+ "model.grounding_encoder.image_encoder.blocks.10.norm2.weight": "pytorch_model-00002-of-00002.bin",
50
+ "model.grounding_encoder.image_encoder.blocks.11.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
51
+ "model.grounding_encoder.image_encoder.blocks.11.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
52
+ "model.grounding_encoder.image_encoder.blocks.11.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
53
+ "model.grounding_encoder.image_encoder.blocks.11.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
54
+ "model.grounding_encoder.image_encoder.blocks.11.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
55
+ "model.grounding_encoder.image_encoder.blocks.11.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
56
+ "model.grounding_encoder.image_encoder.blocks.11.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
57
+ "model.grounding_encoder.image_encoder.blocks.11.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
58
+ "model.grounding_encoder.image_encoder.blocks.11.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
59
+ "model.grounding_encoder.image_encoder.blocks.11.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
60
+ "model.grounding_encoder.image_encoder.blocks.11.norm1.bias": "pytorch_model-00002-of-00002.bin",
61
+ "model.grounding_encoder.image_encoder.blocks.11.norm1.weight": "pytorch_model-00002-of-00002.bin",
62
+ "model.grounding_encoder.image_encoder.blocks.11.norm2.bias": "pytorch_model-00002-of-00002.bin",
63
+ "model.grounding_encoder.image_encoder.blocks.11.norm2.weight": "pytorch_model-00002-of-00002.bin",
64
+ "model.grounding_encoder.image_encoder.blocks.12.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
65
+ "model.grounding_encoder.image_encoder.blocks.12.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
66
+ "model.grounding_encoder.image_encoder.blocks.12.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
67
+ "model.grounding_encoder.image_encoder.blocks.12.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
68
+ "model.grounding_encoder.image_encoder.blocks.12.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
69
+ "model.grounding_encoder.image_encoder.blocks.12.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
70
+ "model.grounding_encoder.image_encoder.blocks.12.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
71
+ "model.grounding_encoder.image_encoder.blocks.12.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
72
+ "model.grounding_encoder.image_encoder.blocks.12.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
73
+ "model.grounding_encoder.image_encoder.blocks.12.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
74
+ "model.grounding_encoder.image_encoder.blocks.12.norm1.bias": "pytorch_model-00002-of-00002.bin",
75
+ "model.grounding_encoder.image_encoder.blocks.12.norm1.weight": "pytorch_model-00002-of-00002.bin",
76
+ "model.grounding_encoder.image_encoder.blocks.12.norm2.bias": "pytorch_model-00002-of-00002.bin",
77
+ "model.grounding_encoder.image_encoder.blocks.12.norm2.weight": "pytorch_model-00002-of-00002.bin",
78
+ "model.grounding_encoder.image_encoder.blocks.13.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
79
+ "model.grounding_encoder.image_encoder.blocks.13.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
80
+ "model.grounding_encoder.image_encoder.blocks.13.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
81
+ "model.grounding_encoder.image_encoder.blocks.13.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
82
+ "model.grounding_encoder.image_encoder.blocks.13.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
83
+ "model.grounding_encoder.image_encoder.blocks.13.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
84
+ "model.grounding_encoder.image_encoder.blocks.13.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
85
+ "model.grounding_encoder.image_encoder.blocks.13.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
86
+ "model.grounding_encoder.image_encoder.blocks.13.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
87
+ "model.grounding_encoder.image_encoder.blocks.13.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
88
+ "model.grounding_encoder.image_encoder.blocks.13.norm1.bias": "pytorch_model-00002-of-00002.bin",
89
+ "model.grounding_encoder.image_encoder.blocks.13.norm1.weight": "pytorch_model-00002-of-00002.bin",
90
+ "model.grounding_encoder.image_encoder.blocks.13.norm2.bias": "pytorch_model-00002-of-00002.bin",
91
+ "model.grounding_encoder.image_encoder.blocks.13.norm2.weight": "pytorch_model-00002-of-00002.bin",
92
+ "model.grounding_encoder.image_encoder.blocks.14.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
93
+ "model.grounding_encoder.image_encoder.blocks.14.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
94
+ "model.grounding_encoder.image_encoder.blocks.14.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
95
+ "model.grounding_encoder.image_encoder.blocks.14.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
96
+ "model.grounding_encoder.image_encoder.blocks.14.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
97
+ "model.grounding_encoder.image_encoder.blocks.14.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
98
+ "model.grounding_encoder.image_encoder.blocks.14.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
99
+ "model.grounding_encoder.image_encoder.blocks.14.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
100
+ "model.grounding_encoder.image_encoder.blocks.14.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
101
+ "model.grounding_encoder.image_encoder.blocks.14.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
102
+ "model.grounding_encoder.image_encoder.blocks.14.norm1.bias": "pytorch_model-00002-of-00002.bin",
103
+ "model.grounding_encoder.image_encoder.blocks.14.norm1.weight": "pytorch_model-00002-of-00002.bin",
104
+ "model.grounding_encoder.image_encoder.blocks.14.norm2.bias": "pytorch_model-00002-of-00002.bin",
105
+ "model.grounding_encoder.image_encoder.blocks.14.norm2.weight": "pytorch_model-00002-of-00002.bin",
106
+ "model.grounding_encoder.image_encoder.blocks.15.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
107
+ "model.grounding_encoder.image_encoder.blocks.15.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
108
+ "model.grounding_encoder.image_encoder.blocks.15.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
109
+ "model.grounding_encoder.image_encoder.blocks.15.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
110
+ "model.grounding_encoder.image_encoder.blocks.15.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
111
+ "model.grounding_encoder.image_encoder.blocks.15.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
112
+ "model.grounding_encoder.image_encoder.blocks.15.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
113
+ "model.grounding_encoder.image_encoder.blocks.15.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
114
+ "model.grounding_encoder.image_encoder.blocks.15.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
115
+ "model.grounding_encoder.image_encoder.blocks.15.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
116
+ "model.grounding_encoder.image_encoder.blocks.15.norm1.bias": "pytorch_model-00002-of-00002.bin",
117
+ "model.grounding_encoder.image_encoder.blocks.15.norm1.weight": "pytorch_model-00002-of-00002.bin",
118
+ "model.grounding_encoder.image_encoder.blocks.15.norm2.bias": "pytorch_model-00002-of-00002.bin",
119
+ "model.grounding_encoder.image_encoder.blocks.15.norm2.weight": "pytorch_model-00002-of-00002.bin",
120
+ "model.grounding_encoder.image_encoder.blocks.16.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
121
+ "model.grounding_encoder.image_encoder.blocks.16.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
122
+ "model.grounding_encoder.image_encoder.blocks.16.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
123
+ "model.grounding_encoder.image_encoder.blocks.16.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
124
+ "model.grounding_encoder.image_encoder.blocks.16.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
125
+ "model.grounding_encoder.image_encoder.blocks.16.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
126
+ "model.grounding_encoder.image_encoder.blocks.16.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
127
+ "model.grounding_encoder.image_encoder.blocks.16.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
128
+ "model.grounding_encoder.image_encoder.blocks.16.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
129
+ "model.grounding_encoder.image_encoder.blocks.16.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
130
+ "model.grounding_encoder.image_encoder.blocks.16.norm1.bias": "pytorch_model-00002-of-00002.bin",
131
+ "model.grounding_encoder.image_encoder.blocks.16.norm1.weight": "pytorch_model-00002-of-00002.bin",
132
+ "model.grounding_encoder.image_encoder.blocks.16.norm2.bias": "pytorch_model-00002-of-00002.bin",
133
+ "model.grounding_encoder.image_encoder.blocks.16.norm2.weight": "pytorch_model-00002-of-00002.bin",
134
+ "model.grounding_encoder.image_encoder.blocks.17.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
135
+ "model.grounding_encoder.image_encoder.blocks.17.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
136
+ "model.grounding_encoder.image_encoder.blocks.17.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
137
+ "model.grounding_encoder.image_encoder.blocks.17.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
138
+ "model.grounding_encoder.image_encoder.blocks.17.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
139
+ "model.grounding_encoder.image_encoder.blocks.17.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
140
+ "model.grounding_encoder.image_encoder.blocks.17.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
141
+ "model.grounding_encoder.image_encoder.blocks.17.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
142
+ "model.grounding_encoder.image_encoder.blocks.17.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
143
+ "model.grounding_encoder.image_encoder.blocks.17.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
144
+ "model.grounding_encoder.image_encoder.blocks.17.norm1.bias": "pytorch_model-00002-of-00002.bin",
145
+ "model.grounding_encoder.image_encoder.blocks.17.norm1.weight": "pytorch_model-00002-of-00002.bin",
146
+ "model.grounding_encoder.image_encoder.blocks.17.norm2.bias": "pytorch_model-00002-of-00002.bin",
147
+ "model.grounding_encoder.image_encoder.blocks.17.norm2.weight": "pytorch_model-00002-of-00002.bin",
148
+ "model.grounding_encoder.image_encoder.blocks.18.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
149
+ "model.grounding_encoder.image_encoder.blocks.18.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
150
+ "model.grounding_encoder.image_encoder.blocks.18.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
151
+ "model.grounding_encoder.image_encoder.blocks.18.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
152
+ "model.grounding_encoder.image_encoder.blocks.18.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
153
+ "model.grounding_encoder.image_encoder.blocks.18.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
154
+ "model.grounding_encoder.image_encoder.blocks.18.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
155
+ "model.grounding_encoder.image_encoder.blocks.18.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
156
+ "model.grounding_encoder.image_encoder.blocks.18.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
157
+ "model.grounding_encoder.image_encoder.blocks.18.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
158
+ "model.grounding_encoder.image_encoder.blocks.18.norm1.bias": "pytorch_model-00002-of-00002.bin",
159
+ "model.grounding_encoder.image_encoder.blocks.18.norm1.weight": "pytorch_model-00002-of-00002.bin",
160
+ "model.grounding_encoder.image_encoder.blocks.18.norm2.bias": "pytorch_model-00002-of-00002.bin",
161
+ "model.grounding_encoder.image_encoder.blocks.18.norm2.weight": "pytorch_model-00002-of-00002.bin",
162
+ "model.grounding_encoder.image_encoder.blocks.19.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
163
+ "model.grounding_encoder.image_encoder.blocks.19.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
164
+ "model.grounding_encoder.image_encoder.blocks.19.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
165
+ "model.grounding_encoder.image_encoder.blocks.19.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
166
+ "model.grounding_encoder.image_encoder.blocks.19.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
167
+ "model.grounding_encoder.image_encoder.blocks.19.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
168
+ "model.grounding_encoder.image_encoder.blocks.19.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
169
+ "model.grounding_encoder.image_encoder.blocks.19.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
170
+ "model.grounding_encoder.image_encoder.blocks.19.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
171
+ "model.grounding_encoder.image_encoder.blocks.19.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
172
+ "model.grounding_encoder.image_encoder.blocks.19.norm1.bias": "pytorch_model-00002-of-00002.bin",
173
+ "model.grounding_encoder.image_encoder.blocks.19.norm1.weight": "pytorch_model-00002-of-00002.bin",
174
+ "model.grounding_encoder.image_encoder.blocks.19.norm2.bias": "pytorch_model-00002-of-00002.bin",
175
+ "model.grounding_encoder.image_encoder.blocks.19.norm2.weight": "pytorch_model-00002-of-00002.bin",
176
+ "model.grounding_encoder.image_encoder.blocks.2.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
177
+ "model.grounding_encoder.image_encoder.blocks.2.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
178
+ "model.grounding_encoder.image_encoder.blocks.2.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
179
+ "model.grounding_encoder.image_encoder.blocks.2.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
180
+ "model.grounding_encoder.image_encoder.blocks.2.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
181
+ "model.grounding_encoder.image_encoder.blocks.2.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
182
+ "model.grounding_encoder.image_encoder.blocks.2.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
183
+ "model.grounding_encoder.image_encoder.blocks.2.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
184
+ "model.grounding_encoder.image_encoder.blocks.2.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
185
+ "model.grounding_encoder.image_encoder.blocks.2.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
186
+ "model.grounding_encoder.image_encoder.blocks.2.norm1.bias": "pytorch_model-00002-of-00002.bin",
187
+ "model.grounding_encoder.image_encoder.blocks.2.norm1.weight": "pytorch_model-00002-of-00002.bin",
188
+ "model.grounding_encoder.image_encoder.blocks.2.norm2.bias": "pytorch_model-00002-of-00002.bin",
189
+ "model.grounding_encoder.image_encoder.blocks.2.norm2.weight": "pytorch_model-00002-of-00002.bin",
190
+ "model.grounding_encoder.image_encoder.blocks.20.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
191
+ "model.grounding_encoder.image_encoder.blocks.20.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
192
+ "model.grounding_encoder.image_encoder.blocks.20.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
193
+ "model.grounding_encoder.image_encoder.blocks.20.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
194
+ "model.grounding_encoder.image_encoder.blocks.20.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
195
+ "model.grounding_encoder.image_encoder.blocks.20.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
196
+ "model.grounding_encoder.image_encoder.blocks.20.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
197
+ "model.grounding_encoder.image_encoder.blocks.20.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
198
+ "model.grounding_encoder.image_encoder.blocks.20.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
199
+ "model.grounding_encoder.image_encoder.blocks.20.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
200
+ "model.grounding_encoder.image_encoder.blocks.20.norm1.bias": "pytorch_model-00002-of-00002.bin",
201
+ "model.grounding_encoder.image_encoder.blocks.20.norm1.weight": "pytorch_model-00002-of-00002.bin",
202
+ "model.grounding_encoder.image_encoder.blocks.20.norm2.bias": "pytorch_model-00002-of-00002.bin",
203
+ "model.grounding_encoder.image_encoder.blocks.20.norm2.weight": "pytorch_model-00002-of-00002.bin",
204
+ "model.grounding_encoder.image_encoder.blocks.21.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
205
+ "model.grounding_encoder.image_encoder.blocks.21.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
206
+ "model.grounding_encoder.image_encoder.blocks.21.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
207
+ "model.grounding_encoder.image_encoder.blocks.21.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
208
+ "model.grounding_encoder.image_encoder.blocks.21.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
209
+ "model.grounding_encoder.image_encoder.blocks.21.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
210
+ "model.grounding_encoder.image_encoder.blocks.21.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
211
+ "model.grounding_encoder.image_encoder.blocks.21.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
212
+ "model.grounding_encoder.image_encoder.blocks.21.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
213
+ "model.grounding_encoder.image_encoder.blocks.21.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
214
+ "model.grounding_encoder.image_encoder.blocks.21.norm1.bias": "pytorch_model-00002-of-00002.bin",
215
+ "model.grounding_encoder.image_encoder.blocks.21.norm1.weight": "pytorch_model-00002-of-00002.bin",
216
+ "model.grounding_encoder.image_encoder.blocks.21.norm2.bias": "pytorch_model-00002-of-00002.bin",
217
+ "model.grounding_encoder.image_encoder.blocks.21.norm2.weight": "pytorch_model-00002-of-00002.bin",
218
+ "model.grounding_encoder.image_encoder.blocks.22.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
219
+ "model.grounding_encoder.image_encoder.blocks.22.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
220
+ "model.grounding_encoder.image_encoder.blocks.22.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
221
+ "model.grounding_encoder.image_encoder.blocks.22.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
222
+ "model.grounding_encoder.image_encoder.blocks.22.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
223
+ "model.grounding_encoder.image_encoder.blocks.22.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
224
+ "model.grounding_encoder.image_encoder.blocks.22.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
225
+ "model.grounding_encoder.image_encoder.blocks.22.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
226
+ "model.grounding_encoder.image_encoder.blocks.22.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
227
+ "model.grounding_encoder.image_encoder.blocks.22.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
228
+ "model.grounding_encoder.image_encoder.blocks.22.norm1.bias": "pytorch_model-00002-of-00002.bin",
229
+ "model.grounding_encoder.image_encoder.blocks.22.norm1.weight": "pytorch_model-00002-of-00002.bin",
230
+ "model.grounding_encoder.image_encoder.blocks.22.norm2.bias": "pytorch_model-00002-of-00002.bin",
231
+ "model.grounding_encoder.image_encoder.blocks.22.norm2.weight": "pytorch_model-00002-of-00002.bin",
232
+ "model.grounding_encoder.image_encoder.blocks.23.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
233
+ "model.grounding_encoder.image_encoder.blocks.23.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
234
+ "model.grounding_encoder.image_encoder.blocks.23.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
235
+ "model.grounding_encoder.image_encoder.blocks.23.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
236
+ "model.grounding_encoder.image_encoder.blocks.23.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
237
+ "model.grounding_encoder.image_encoder.blocks.23.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
238
+ "model.grounding_encoder.image_encoder.blocks.23.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
239
+ "model.grounding_encoder.image_encoder.blocks.23.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
240
+ "model.grounding_encoder.image_encoder.blocks.23.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
241
+ "model.grounding_encoder.image_encoder.blocks.23.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
242
+ "model.grounding_encoder.image_encoder.blocks.23.norm1.bias": "pytorch_model-00002-of-00002.bin",
243
+ "model.grounding_encoder.image_encoder.blocks.23.norm1.weight": "pytorch_model-00002-of-00002.bin",
244
+ "model.grounding_encoder.image_encoder.blocks.23.norm2.bias": "pytorch_model-00002-of-00002.bin",
245
+ "model.grounding_encoder.image_encoder.blocks.23.norm2.weight": "pytorch_model-00002-of-00002.bin",
246
+ "model.grounding_encoder.image_encoder.blocks.24.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
247
+ "model.grounding_encoder.image_encoder.blocks.24.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
248
+ "model.grounding_encoder.image_encoder.blocks.24.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
249
+ "model.grounding_encoder.image_encoder.blocks.24.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
250
+ "model.grounding_encoder.image_encoder.blocks.24.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
251
+ "model.grounding_encoder.image_encoder.blocks.24.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
252
+ "model.grounding_encoder.image_encoder.blocks.24.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
253
+ "model.grounding_encoder.image_encoder.blocks.24.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
254
+ "model.grounding_encoder.image_encoder.blocks.24.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
255
+ "model.grounding_encoder.image_encoder.blocks.24.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
256
+ "model.grounding_encoder.image_encoder.blocks.24.norm1.bias": "pytorch_model-00002-of-00002.bin",
257
+ "model.grounding_encoder.image_encoder.blocks.24.norm1.weight": "pytorch_model-00002-of-00002.bin",
258
+ "model.grounding_encoder.image_encoder.blocks.24.norm2.bias": "pytorch_model-00002-of-00002.bin",
259
+ "model.grounding_encoder.image_encoder.blocks.24.norm2.weight": "pytorch_model-00002-of-00002.bin",
260
+ "model.grounding_encoder.image_encoder.blocks.25.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
261
+ "model.grounding_encoder.image_encoder.blocks.25.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
262
+ "model.grounding_encoder.image_encoder.blocks.25.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
263
+ "model.grounding_encoder.image_encoder.blocks.25.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
264
+ "model.grounding_encoder.image_encoder.blocks.25.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
265
+ "model.grounding_encoder.image_encoder.blocks.25.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
266
+ "model.grounding_encoder.image_encoder.blocks.25.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
267
+ "model.grounding_encoder.image_encoder.blocks.25.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
268
+ "model.grounding_encoder.image_encoder.blocks.25.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
269
+ "model.grounding_encoder.image_encoder.blocks.25.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
270
+ "model.grounding_encoder.image_encoder.blocks.25.norm1.bias": "pytorch_model-00002-of-00002.bin",
271
+ "model.grounding_encoder.image_encoder.blocks.25.norm1.weight": "pytorch_model-00002-of-00002.bin",
272
+ "model.grounding_encoder.image_encoder.blocks.25.norm2.bias": "pytorch_model-00002-of-00002.bin",
273
+ "model.grounding_encoder.image_encoder.blocks.25.norm2.weight": "pytorch_model-00002-of-00002.bin",
274
+ "model.grounding_encoder.image_encoder.blocks.26.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
275
+ "model.grounding_encoder.image_encoder.blocks.26.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
276
+ "model.grounding_encoder.image_encoder.blocks.26.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
277
+ "model.grounding_encoder.image_encoder.blocks.26.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
278
+ "model.grounding_encoder.image_encoder.blocks.26.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
279
+ "model.grounding_encoder.image_encoder.blocks.26.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
280
+ "model.grounding_encoder.image_encoder.blocks.26.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
281
+ "model.grounding_encoder.image_encoder.blocks.26.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
282
+ "model.grounding_encoder.image_encoder.blocks.26.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
283
+ "model.grounding_encoder.image_encoder.blocks.26.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
284
+ "model.grounding_encoder.image_encoder.blocks.26.norm1.bias": "pytorch_model-00002-of-00002.bin",
285
+ "model.grounding_encoder.image_encoder.blocks.26.norm1.weight": "pytorch_model-00002-of-00002.bin",
286
+ "model.grounding_encoder.image_encoder.blocks.26.norm2.bias": "pytorch_model-00002-of-00002.bin",
287
+ "model.grounding_encoder.image_encoder.blocks.26.norm2.weight": "pytorch_model-00002-of-00002.bin",
288
+ "model.grounding_encoder.image_encoder.blocks.27.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
289
+ "model.grounding_encoder.image_encoder.blocks.27.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
290
+ "model.grounding_encoder.image_encoder.blocks.27.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
291
+ "model.grounding_encoder.image_encoder.blocks.27.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
292
+ "model.grounding_encoder.image_encoder.blocks.27.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
293
+ "model.grounding_encoder.image_encoder.blocks.27.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
294
+ "model.grounding_encoder.image_encoder.blocks.27.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
295
+ "model.grounding_encoder.image_encoder.blocks.27.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
296
+ "model.grounding_encoder.image_encoder.blocks.27.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
297
+ "model.grounding_encoder.image_encoder.blocks.27.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
298
+ "model.grounding_encoder.image_encoder.blocks.27.norm1.bias": "pytorch_model-00002-of-00002.bin",
299
+ "model.grounding_encoder.image_encoder.blocks.27.norm1.weight": "pytorch_model-00002-of-00002.bin",
300
+ "model.grounding_encoder.image_encoder.blocks.27.norm2.bias": "pytorch_model-00002-of-00002.bin",
301
+ "model.grounding_encoder.image_encoder.blocks.27.norm2.weight": "pytorch_model-00002-of-00002.bin",
302
+ "model.grounding_encoder.image_encoder.blocks.28.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
303
+ "model.grounding_encoder.image_encoder.blocks.28.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
304
+ "model.grounding_encoder.image_encoder.blocks.28.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
305
+ "model.grounding_encoder.image_encoder.blocks.28.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
306
+ "model.grounding_encoder.image_encoder.blocks.28.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
307
+ "model.grounding_encoder.image_encoder.blocks.28.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
308
+ "model.grounding_encoder.image_encoder.blocks.28.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
309
+ "model.grounding_encoder.image_encoder.blocks.28.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
310
+ "model.grounding_encoder.image_encoder.blocks.28.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
311
+ "model.grounding_encoder.image_encoder.blocks.28.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
312
+ "model.grounding_encoder.image_encoder.blocks.28.norm1.bias": "pytorch_model-00002-of-00002.bin",
313
+ "model.grounding_encoder.image_encoder.blocks.28.norm1.weight": "pytorch_model-00002-of-00002.bin",
314
+ "model.grounding_encoder.image_encoder.blocks.28.norm2.bias": "pytorch_model-00002-of-00002.bin",
315
+ "model.grounding_encoder.image_encoder.blocks.28.norm2.weight": "pytorch_model-00002-of-00002.bin",
316
+ "model.grounding_encoder.image_encoder.blocks.29.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
317
+ "model.grounding_encoder.image_encoder.blocks.29.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
318
+ "model.grounding_encoder.image_encoder.blocks.29.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
319
+ "model.grounding_encoder.image_encoder.blocks.29.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
320
+ "model.grounding_encoder.image_encoder.blocks.29.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
321
+ "model.grounding_encoder.image_encoder.blocks.29.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
322
+ "model.grounding_encoder.image_encoder.blocks.29.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
323
+ "model.grounding_encoder.image_encoder.blocks.29.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
324
+ "model.grounding_encoder.image_encoder.blocks.29.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
325
+ "model.grounding_encoder.image_encoder.blocks.29.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
326
+ "model.grounding_encoder.image_encoder.blocks.29.norm1.bias": "pytorch_model-00002-of-00002.bin",
327
+ "model.grounding_encoder.image_encoder.blocks.29.norm1.weight": "pytorch_model-00002-of-00002.bin",
328
+ "model.grounding_encoder.image_encoder.blocks.29.norm2.bias": "pytorch_model-00002-of-00002.bin",
329
+ "model.grounding_encoder.image_encoder.blocks.29.norm2.weight": "pytorch_model-00002-of-00002.bin",
330
+ "model.grounding_encoder.image_encoder.blocks.3.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
331
+ "model.grounding_encoder.image_encoder.blocks.3.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
332
+ "model.grounding_encoder.image_encoder.blocks.3.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
333
+ "model.grounding_encoder.image_encoder.blocks.3.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
334
+ "model.grounding_encoder.image_encoder.blocks.3.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
335
+ "model.grounding_encoder.image_encoder.blocks.3.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
336
+ "model.grounding_encoder.image_encoder.blocks.3.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
337
+ "model.grounding_encoder.image_encoder.blocks.3.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
338
+ "model.grounding_encoder.image_encoder.blocks.3.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
339
+ "model.grounding_encoder.image_encoder.blocks.3.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
340
+ "model.grounding_encoder.image_encoder.blocks.3.norm1.bias": "pytorch_model-00002-of-00002.bin",
341
+ "model.grounding_encoder.image_encoder.blocks.3.norm1.weight": "pytorch_model-00002-of-00002.bin",
342
+ "model.grounding_encoder.image_encoder.blocks.3.norm2.bias": "pytorch_model-00002-of-00002.bin",
343
+ "model.grounding_encoder.image_encoder.blocks.3.norm2.weight": "pytorch_model-00002-of-00002.bin",
344
+ "model.grounding_encoder.image_encoder.blocks.30.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
345
+ "model.grounding_encoder.image_encoder.blocks.30.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
346
+ "model.grounding_encoder.image_encoder.blocks.30.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
347
+ "model.grounding_encoder.image_encoder.blocks.30.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
348
+ "model.grounding_encoder.image_encoder.blocks.30.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
349
+ "model.grounding_encoder.image_encoder.blocks.30.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
350
+ "model.grounding_encoder.image_encoder.blocks.30.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
351
+ "model.grounding_encoder.image_encoder.blocks.30.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
352
+ "model.grounding_encoder.image_encoder.blocks.30.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
353
+ "model.grounding_encoder.image_encoder.blocks.30.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
354
+ "model.grounding_encoder.image_encoder.blocks.30.norm1.bias": "pytorch_model-00002-of-00002.bin",
355
+ "model.grounding_encoder.image_encoder.blocks.30.norm1.weight": "pytorch_model-00002-of-00002.bin",
356
+ "model.grounding_encoder.image_encoder.blocks.30.norm2.bias": "pytorch_model-00002-of-00002.bin",
357
+ "model.grounding_encoder.image_encoder.blocks.30.norm2.weight": "pytorch_model-00002-of-00002.bin",
358
+ "model.grounding_encoder.image_encoder.blocks.31.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
359
+ "model.grounding_encoder.image_encoder.blocks.31.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
360
+ "model.grounding_encoder.image_encoder.blocks.31.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
361
+ "model.grounding_encoder.image_encoder.blocks.31.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
362
+ "model.grounding_encoder.image_encoder.blocks.31.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
363
+ "model.grounding_encoder.image_encoder.blocks.31.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
364
+ "model.grounding_encoder.image_encoder.blocks.31.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
365
+ "model.grounding_encoder.image_encoder.blocks.31.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
366
+ "model.grounding_encoder.image_encoder.blocks.31.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
367
+ "model.grounding_encoder.image_encoder.blocks.31.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
368
+ "model.grounding_encoder.image_encoder.blocks.31.norm1.bias": "pytorch_model-00002-of-00002.bin",
369
+ "model.grounding_encoder.image_encoder.blocks.31.norm1.weight": "pytorch_model-00002-of-00002.bin",
370
+ "model.grounding_encoder.image_encoder.blocks.31.norm2.bias": "pytorch_model-00002-of-00002.bin",
371
+ "model.grounding_encoder.image_encoder.blocks.31.norm2.weight": "pytorch_model-00002-of-00002.bin",
372
+ "model.grounding_encoder.image_encoder.blocks.4.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
373
+ "model.grounding_encoder.image_encoder.blocks.4.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
374
+ "model.grounding_encoder.image_encoder.blocks.4.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
375
+ "model.grounding_encoder.image_encoder.blocks.4.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
376
+ "model.grounding_encoder.image_encoder.blocks.4.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
377
+ "model.grounding_encoder.image_encoder.blocks.4.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
378
+ "model.grounding_encoder.image_encoder.blocks.4.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
379
+ "model.grounding_encoder.image_encoder.blocks.4.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
380
+ "model.grounding_encoder.image_encoder.blocks.4.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
381
+ "model.grounding_encoder.image_encoder.blocks.4.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
382
+ "model.grounding_encoder.image_encoder.blocks.4.norm1.bias": "pytorch_model-00002-of-00002.bin",
383
+ "model.grounding_encoder.image_encoder.blocks.4.norm1.weight": "pytorch_model-00002-of-00002.bin",
384
+ "model.grounding_encoder.image_encoder.blocks.4.norm2.bias": "pytorch_model-00002-of-00002.bin",
385
+ "model.grounding_encoder.image_encoder.blocks.4.norm2.weight": "pytorch_model-00002-of-00002.bin",
386
+ "model.grounding_encoder.image_encoder.blocks.5.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
387
+ "model.grounding_encoder.image_encoder.blocks.5.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
388
+ "model.grounding_encoder.image_encoder.blocks.5.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
389
+ "model.grounding_encoder.image_encoder.blocks.5.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
390
+ "model.grounding_encoder.image_encoder.blocks.5.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
391
+ "model.grounding_encoder.image_encoder.blocks.5.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
392
+ "model.grounding_encoder.image_encoder.blocks.5.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
393
+ "model.grounding_encoder.image_encoder.blocks.5.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
394
+ "model.grounding_encoder.image_encoder.blocks.5.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
395
+ "model.grounding_encoder.image_encoder.blocks.5.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
396
+ "model.grounding_encoder.image_encoder.blocks.5.norm1.bias": "pytorch_model-00002-of-00002.bin",
397
+ "model.grounding_encoder.image_encoder.blocks.5.norm1.weight": "pytorch_model-00002-of-00002.bin",
398
+ "model.grounding_encoder.image_encoder.blocks.5.norm2.bias": "pytorch_model-00002-of-00002.bin",
399
+ "model.grounding_encoder.image_encoder.blocks.5.norm2.weight": "pytorch_model-00002-of-00002.bin",
400
+ "model.grounding_encoder.image_encoder.blocks.6.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
401
+ "model.grounding_encoder.image_encoder.blocks.6.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
402
+ "model.grounding_encoder.image_encoder.blocks.6.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
403
+ "model.grounding_encoder.image_encoder.blocks.6.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
404
+ "model.grounding_encoder.image_encoder.blocks.6.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
405
+ "model.grounding_encoder.image_encoder.blocks.6.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
406
+ "model.grounding_encoder.image_encoder.blocks.6.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
407
+ "model.grounding_encoder.image_encoder.blocks.6.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
408
+ "model.grounding_encoder.image_encoder.blocks.6.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
409
+ "model.grounding_encoder.image_encoder.blocks.6.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
410
+ "model.grounding_encoder.image_encoder.blocks.6.norm1.bias": "pytorch_model-00002-of-00002.bin",
411
+ "model.grounding_encoder.image_encoder.blocks.6.norm1.weight": "pytorch_model-00002-of-00002.bin",
412
+ "model.grounding_encoder.image_encoder.blocks.6.norm2.bias": "pytorch_model-00002-of-00002.bin",
413
+ "model.grounding_encoder.image_encoder.blocks.6.norm2.weight": "pytorch_model-00002-of-00002.bin",
414
+ "model.grounding_encoder.image_encoder.blocks.7.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
415
+ "model.grounding_encoder.image_encoder.blocks.7.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
416
+ "model.grounding_encoder.image_encoder.blocks.7.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
417
+ "model.grounding_encoder.image_encoder.blocks.7.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
418
+ "model.grounding_encoder.image_encoder.blocks.7.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
419
+ "model.grounding_encoder.image_encoder.blocks.7.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
420
+ "model.grounding_encoder.image_encoder.blocks.7.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
421
+ "model.grounding_encoder.image_encoder.blocks.7.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
422
+ "model.grounding_encoder.image_encoder.blocks.7.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
423
+ "model.grounding_encoder.image_encoder.blocks.7.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
424
+ "model.grounding_encoder.image_encoder.blocks.7.norm1.bias": "pytorch_model-00002-of-00002.bin",
425
+ "model.grounding_encoder.image_encoder.blocks.7.norm1.weight": "pytorch_model-00002-of-00002.bin",
426
+ "model.grounding_encoder.image_encoder.blocks.7.norm2.bias": "pytorch_model-00002-of-00002.bin",
427
+ "model.grounding_encoder.image_encoder.blocks.7.norm2.weight": "pytorch_model-00002-of-00002.bin",
428
+ "model.grounding_encoder.image_encoder.blocks.8.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
429
+ "model.grounding_encoder.image_encoder.blocks.8.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
430
+ "model.grounding_encoder.image_encoder.blocks.8.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
431
+ "model.grounding_encoder.image_encoder.blocks.8.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
432
+ "model.grounding_encoder.image_encoder.blocks.8.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
433
+ "model.grounding_encoder.image_encoder.blocks.8.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
434
+ "model.grounding_encoder.image_encoder.blocks.8.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
435
+ "model.grounding_encoder.image_encoder.blocks.8.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
436
+ "model.grounding_encoder.image_encoder.blocks.8.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
437
+ "model.grounding_encoder.image_encoder.blocks.8.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
438
+ "model.grounding_encoder.image_encoder.blocks.8.norm1.bias": "pytorch_model-00002-of-00002.bin",
439
+ "model.grounding_encoder.image_encoder.blocks.8.norm1.weight": "pytorch_model-00002-of-00002.bin",
440
+ "model.grounding_encoder.image_encoder.blocks.8.norm2.bias": "pytorch_model-00002-of-00002.bin",
441
+ "model.grounding_encoder.image_encoder.blocks.8.norm2.weight": "pytorch_model-00002-of-00002.bin",
442
+ "model.grounding_encoder.image_encoder.blocks.9.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
443
+ "model.grounding_encoder.image_encoder.blocks.9.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
444
+ "model.grounding_encoder.image_encoder.blocks.9.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
445
+ "model.grounding_encoder.image_encoder.blocks.9.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
446
+ "model.grounding_encoder.image_encoder.blocks.9.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
447
+ "model.grounding_encoder.image_encoder.blocks.9.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
448
+ "model.grounding_encoder.image_encoder.blocks.9.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
449
+ "model.grounding_encoder.image_encoder.blocks.9.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
450
+ "model.grounding_encoder.image_encoder.blocks.9.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
451
+ "model.grounding_encoder.image_encoder.blocks.9.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
452
+ "model.grounding_encoder.image_encoder.blocks.9.norm1.bias": "pytorch_model-00002-of-00002.bin",
453
+ "model.grounding_encoder.image_encoder.blocks.9.norm1.weight": "pytorch_model-00002-of-00002.bin",
454
+ "model.grounding_encoder.image_encoder.blocks.9.norm2.bias": "pytorch_model-00002-of-00002.bin",
455
+ "model.grounding_encoder.image_encoder.blocks.9.norm2.weight": "pytorch_model-00002-of-00002.bin",
456
+ "model.grounding_encoder.image_encoder.neck.0.weight": "pytorch_model-00002-of-00002.bin",
457
+ "model.grounding_encoder.image_encoder.neck.1.bias": "pytorch_model-00002-of-00002.bin",
458
+ "model.grounding_encoder.image_encoder.neck.1.weight": "pytorch_model-00002-of-00002.bin",
459
+ "model.grounding_encoder.image_encoder.neck.2.weight": "pytorch_model-00002-of-00002.bin",
460
+ "model.grounding_encoder.image_encoder.neck.3.bias": "pytorch_model-00002-of-00002.bin",
461
+ "model.grounding_encoder.image_encoder.neck.3.weight": "pytorch_model-00002-of-00002.bin",
462
+ "model.grounding_encoder.image_encoder.patch_embed.proj.bias": "pytorch_model-00002-of-00002.bin",
463
+ "model.grounding_encoder.image_encoder.patch_embed.proj.weight": "pytorch_model-00002-of-00002.bin",
464
+ "model.grounding_encoder.image_encoder.pos_embed": "pytorch_model-00002-of-00002.bin",
465
+ "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.0.bias": "pytorch_model-00002-of-00002.bin",
466
+ "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.0.weight": "pytorch_model-00002-of-00002.bin",
467
+ "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.1.bias": "pytorch_model-00002-of-00002.bin",
468
+ "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.1.weight": "pytorch_model-00002-of-00002.bin",
469
+ "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.2.bias": "pytorch_model-00002-of-00002.bin",
470
+ "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.2.weight": "pytorch_model-00002-of-00002.bin",
471
+ "model.grounding_encoder.mask_decoder.iou_token.weight": "pytorch_model-00002-of-00002.bin",
472
+ "model.grounding_encoder.mask_decoder.mask_tokens.weight": "pytorch_model-00002-of-00002.bin",
473
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.0.bias": "pytorch_model-00002-of-00002.bin",
474
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.0.weight": "pytorch_model-00002-of-00002.bin",
475
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.1.bias": "pytorch_model-00002-of-00002.bin",
476
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.1.weight": "pytorch_model-00002-of-00002.bin",
477
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.2.bias": "pytorch_model-00002-of-00002.bin",
478
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.2.weight": "pytorch_model-00002-of-00002.bin",
479
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.0.bias": "pytorch_model-00002-of-00002.bin",
480
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.0.weight": "pytorch_model-00002-of-00002.bin",
481
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.1.bias": "pytorch_model-00002-of-00002.bin",
482
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.1.weight": "pytorch_model-00002-of-00002.bin",
483
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.2.bias": "pytorch_model-00002-of-00002.bin",
484
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.2.weight": "pytorch_model-00002-of-00002.bin",
485
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.0.bias": "pytorch_model-00002-of-00002.bin",
486
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.0.weight": "pytorch_model-00002-of-00002.bin",
487
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.1.bias": "pytorch_model-00002-of-00002.bin",
488
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.1.weight": "pytorch_model-00002-of-00002.bin",
489
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.2.bias": "pytorch_model-00002-of-00002.bin",
490
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.2.weight": "pytorch_model-00002-of-00002.bin",
491
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.0.bias": "pytorch_model-00002-of-00002.bin",
492
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.0.weight": "pytorch_model-00002-of-00002.bin",
493
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.1.bias": "pytorch_model-00002-of-00002.bin",
494
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.1.weight": "pytorch_model-00002-of-00002.bin",
495
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.2.bias": "pytorch_model-00002-of-00002.bin",
496
+ "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.2.weight": "pytorch_model-00002-of-00002.bin",
497
+ "model.grounding_encoder.mask_decoder.output_upscaling.0.bias": "pytorch_model-00002-of-00002.bin",
498
+ "model.grounding_encoder.mask_decoder.output_upscaling.0.weight": "pytorch_model-00002-of-00002.bin",
499
+ "model.grounding_encoder.mask_decoder.output_upscaling.1.bias": "pytorch_model-00002-of-00002.bin",
500
+ "model.grounding_encoder.mask_decoder.output_upscaling.1.weight": "pytorch_model-00002-of-00002.bin",
501
+ "model.grounding_encoder.mask_decoder.output_upscaling.3.bias": "pytorch_model-00002-of-00002.bin",
502
+ "model.grounding_encoder.mask_decoder.output_upscaling.3.weight": "pytorch_model-00002-of-00002.bin",
503
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.k_proj.bias": "pytorch_model-00002-of-00002.bin",
504
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.k_proj.weight": "pytorch_model-00002-of-00002.bin",
505
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.out_proj.bias": "pytorch_model-00002-of-00002.bin",
506
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.out_proj.weight": "pytorch_model-00002-of-00002.bin",
507
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.q_proj.bias": "pytorch_model-00002-of-00002.bin",
508
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.q_proj.weight": "pytorch_model-00002-of-00002.bin",
509
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.v_proj.bias": "pytorch_model-00002-of-00002.bin",
510
+ "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.v_proj.weight": "pytorch_model-00002-of-00002.bin",
511
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.k_proj.bias": "pytorch_model-00002-of-00002.bin",
512
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.k_proj.weight": "pytorch_model-00002-of-00002.bin",
513
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.out_proj.bias": "pytorch_model-00002-of-00002.bin",
514
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.out_proj.weight": "pytorch_model-00002-of-00002.bin",
515
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.q_proj.bias": "pytorch_model-00002-of-00002.bin",
516
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.q_proj.weight": "pytorch_model-00002-of-00002.bin",
517
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.v_proj.bias": "pytorch_model-00002-of-00002.bin",
518
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.v_proj.weight": "pytorch_model-00002-of-00002.bin",
519
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.k_proj.bias": "pytorch_model-00002-of-00002.bin",
520
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.k_proj.weight": "pytorch_model-00002-of-00002.bin",
521
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.out_proj.bias": "pytorch_model-00002-of-00002.bin",
522
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.out_proj.weight": "pytorch_model-00002-of-00002.bin",
523
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.q_proj.bias": "pytorch_model-00002-of-00002.bin",
524
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.q_proj.weight": "pytorch_model-00002-of-00002.bin",
525
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.v_proj.bias": "pytorch_model-00002-of-00002.bin",
526
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.v_proj.weight": "pytorch_model-00002-of-00002.bin",
527
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
528
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
529
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
530
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
531
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm1.bias": "pytorch_model-00002-of-00002.bin",
532
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm1.weight": "pytorch_model-00002-of-00002.bin",
533
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm2.bias": "pytorch_model-00002-of-00002.bin",
534
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm2.weight": "pytorch_model-00002-of-00002.bin",
535
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm3.bias": "pytorch_model-00002-of-00002.bin",
536
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm3.weight": "pytorch_model-00002-of-00002.bin",
537
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm4.bias": "pytorch_model-00002-of-00002.bin",
538
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.norm4.weight": "pytorch_model-00002-of-00002.bin",
539
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.k_proj.bias": "pytorch_model-00002-of-00002.bin",
540
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
541
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.out_proj.bias": "pytorch_model-00002-of-00002.bin",
542
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.out_proj.weight": "pytorch_model-00002-of-00002.bin",
543
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.q_proj.bias": "pytorch_model-00002-of-00002.bin",
544
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
545
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.v_proj.bias": "pytorch_model-00002-of-00002.bin",
546
+ "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
547
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.k_proj.bias": "pytorch_model-00002-of-00002.bin",
548
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.k_proj.weight": "pytorch_model-00002-of-00002.bin",
549
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.out_proj.bias": "pytorch_model-00002-of-00002.bin",
550
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.out_proj.weight": "pytorch_model-00002-of-00002.bin",
551
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.q_proj.bias": "pytorch_model-00002-of-00002.bin",
552
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.q_proj.weight": "pytorch_model-00002-of-00002.bin",
553
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.v_proj.bias": "pytorch_model-00002-of-00002.bin",
554
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.v_proj.weight": "pytorch_model-00002-of-00002.bin",
555
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.k_proj.bias": "pytorch_model-00002-of-00002.bin",
556
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.k_proj.weight": "pytorch_model-00002-of-00002.bin",
557
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.out_proj.bias": "pytorch_model-00002-of-00002.bin",
558
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.out_proj.weight": "pytorch_model-00002-of-00002.bin",
559
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.q_proj.bias": "pytorch_model-00002-of-00002.bin",
560
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.q_proj.weight": "pytorch_model-00002-of-00002.bin",
561
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.v_proj.bias": "pytorch_model-00002-of-00002.bin",
562
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.v_proj.weight": "pytorch_model-00002-of-00002.bin",
563
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
564
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
565
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
566
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
567
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm1.bias": "pytorch_model-00002-of-00002.bin",
568
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm1.weight": "pytorch_model-00002-of-00002.bin",
569
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm2.bias": "pytorch_model-00002-of-00002.bin",
570
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm2.weight": "pytorch_model-00002-of-00002.bin",
571
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm3.bias": "pytorch_model-00002-of-00002.bin",
572
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm3.weight": "pytorch_model-00002-of-00002.bin",
573
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm4.bias": "pytorch_model-00002-of-00002.bin",
574
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.norm4.weight": "pytorch_model-00002-of-00002.bin",
575
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.k_proj.bias": "pytorch_model-00002-of-00002.bin",
576
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
577
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.out_proj.bias": "pytorch_model-00002-of-00002.bin",
578
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.out_proj.weight": "pytorch_model-00002-of-00002.bin",
579
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.q_proj.bias": "pytorch_model-00002-of-00002.bin",
580
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
581
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.v_proj.bias": "pytorch_model-00002-of-00002.bin",
582
+ "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
583
+ "model.grounding_encoder.mask_decoder.transformer.norm_final_attn.bias": "pytorch_model-00002-of-00002.bin",
584
+ "model.grounding_encoder.mask_decoder.transformer.norm_final_attn.weight": "pytorch_model-00002-of-00002.bin",
585
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.0.bias": "pytorch_model-00002-of-00002.bin",
586
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.0.weight": "pytorch_model-00002-of-00002.bin",
587
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.1.bias": "pytorch_model-00002-of-00002.bin",
588
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.1.weight": "pytorch_model-00002-of-00002.bin",
589
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.3.bias": "pytorch_model-00002-of-00002.bin",
590
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.3.weight": "pytorch_model-00002-of-00002.bin",
591
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.4.bias": "pytorch_model-00002-of-00002.bin",
592
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.4.weight": "pytorch_model-00002-of-00002.bin",
593
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.6.bias": "pytorch_model-00002-of-00002.bin",
594
+ "model.grounding_encoder.prompt_encoder.mask_downscaling.6.weight": "pytorch_model-00002-of-00002.bin",
595
+ "model.grounding_encoder.prompt_encoder.no_mask_embed.weight": "pytorch_model-00002-of-00002.bin",
596
+ "model.grounding_encoder.prompt_encoder.not_a_point_embed.weight": "pytorch_model-00002-of-00002.bin",
597
+ "model.grounding_encoder.prompt_encoder.pe_layer.positional_encoding_gaussian_matrix": "pytorch_model-00002-of-00002.bin",
598
+ "model.grounding_encoder.prompt_encoder.point_embeddings.0.weight": "pytorch_model-00002-of-00002.bin",
599
+ "model.grounding_encoder.prompt_encoder.point_embeddings.1.weight": "pytorch_model-00002-of-00002.bin",
600
+ "model.grounding_encoder.prompt_encoder.point_embeddings.2.weight": "pytorch_model-00002-of-00002.bin",
601
+ "model.grounding_encoder.prompt_encoder.point_embeddings.3.weight": "pytorch_model-00002-of-00002.bin",
602
+ "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
603
+ "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
604
+ "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
605
+ "model.layers.0.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
606
+ "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
607
+ "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
608
+ "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
609
+ "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
610
+ "model.layers.0.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
611
+ "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
612
+ "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
613
+ "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
614
+ "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
615
+ "model.layers.1.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
616
+ "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
617
+ "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
618
+ "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
619
+ "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
620
+ "model.layers.1.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
621
+ "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
622
+ "model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
623
+ "model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
624
+ "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
625
+ "model.layers.10.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
626
+ "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
627
+ "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
628
+ "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
629
+ "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
630
+ "model.layers.10.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
631
+ "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
632
+ "model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
633
+ "model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
634
+ "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
635
+ "model.layers.11.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
636
+ "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
637
+ "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
638
+ "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
639
+ "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
640
+ "model.layers.11.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
641
+ "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
642
+ "model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
643
+ "model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
644
+ "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
645
+ "model.layers.12.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
646
+ "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
647
+ "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
648
+ "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
649
+ "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
650
+ "model.layers.12.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
651
+ "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
652
+ "model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
653
+ "model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
654
+ "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
655
+ "model.layers.13.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
656
+ "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
657
+ "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
658
+ "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
659
+ "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
660
+ "model.layers.13.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
661
+ "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
662
+ "model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
663
+ "model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
664
+ "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
665
+ "model.layers.14.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
666
+ "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
667
+ "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
668
+ "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
669
+ "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
670
+ "model.layers.14.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
671
+ "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
672
+ "model.layers.15.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
673
+ "model.layers.15.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
674
+ "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
675
+ "model.layers.15.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
676
+ "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
677
+ "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
678
+ "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
679
+ "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
680
+ "model.layers.15.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
681
+ "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
682
+ "model.layers.16.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
683
+ "model.layers.16.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
684
+ "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
685
+ "model.layers.16.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
686
+ "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
687
+ "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
688
+ "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
689
+ "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
690
+ "model.layers.16.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
691
+ "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
692
+ "model.layers.17.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
693
+ "model.layers.17.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
694
+ "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
695
+ "model.layers.17.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
696
+ "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
697
+ "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
698
+ "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
699
+ "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
700
+ "model.layers.17.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
701
+ "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
702
+ "model.layers.18.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
703
+ "model.layers.18.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
704
+ "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
705
+ "model.layers.18.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
706
+ "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
707
+ "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
708
+ "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
709
+ "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
710
+ "model.layers.18.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
711
+ "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
712
+ "model.layers.19.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
713
+ "model.layers.19.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
714
+ "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
715
+ "model.layers.19.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
716
+ "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
717
+ "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
718
+ "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
719
+ "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
720
+ "model.layers.19.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
721
+ "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
722
+ "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
723
+ "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
724
+ "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
725
+ "model.layers.2.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
726
+ "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
727
+ "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
728
+ "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
729
+ "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
730
+ "model.layers.2.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
731
+ "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
732
+ "model.layers.20.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
733
+ "model.layers.20.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
734
+ "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
735
+ "model.layers.20.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
736
+ "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
737
+ "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
738
+ "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
739
+ "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
740
+ "model.layers.20.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
741
+ "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
742
+ "model.layers.21.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
743
+ "model.layers.21.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
744
+ "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
745
+ "model.layers.21.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
746
+ "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
747
+ "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
748
+ "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
749
+ "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
750
+ "model.layers.21.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
751
+ "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
752
+ "model.layers.22.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
753
+ "model.layers.22.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
754
+ "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
755
+ "model.layers.22.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
756
+ "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
757
+ "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
758
+ "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
759
+ "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
760
+ "model.layers.22.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
761
+ "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
762
+ "model.layers.23.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
763
+ "model.layers.23.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
764
+ "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
765
+ "model.layers.23.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
766
+ "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
767
+ "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
768
+ "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
769
+ "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
770
+ "model.layers.23.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
771
+ "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
772
+ "model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
773
+ "model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
774
+ "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
775
+ "model.layers.24.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
776
+ "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
777
+ "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
778
+ "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
779
+ "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
780
+ "model.layers.24.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
781
+ "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
782
+ "model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
783
+ "model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
784
+ "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
785
+ "model.layers.25.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
786
+ "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
787
+ "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
788
+ "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
789
+ "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
790
+ "model.layers.25.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
791
+ "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
792
+ "model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
793
+ "model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
794
+ "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
795
+ "model.layers.26.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
796
+ "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
797
+ "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
798
+ "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
799
+ "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
800
+ "model.layers.26.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
801
+ "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
802
+ "model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
803
+ "model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
804
+ "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
805
+ "model.layers.27.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
806
+ "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
807
+ "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
808
+ "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
809
+ "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
810
+ "model.layers.27.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
811
+ "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
812
+ "model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
813
+ "model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
814
+ "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
815
+ "model.layers.28.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
816
+ "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
817
+ "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
818
+ "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
819
+ "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
820
+ "model.layers.28.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
821
+ "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
822
+ "model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
823
+ "model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
824
+ "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
825
+ "model.layers.29.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
826
+ "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
827
+ "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
828
+ "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
829
+ "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
830
+ "model.layers.29.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
831
+ "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
832
+ "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
833
+ "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
834
+ "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
835
+ "model.layers.3.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
836
+ "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
837
+ "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
838
+ "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
839
+ "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
840
+ "model.layers.3.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
841
+ "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
842
+ "model.layers.30.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
843
+ "model.layers.30.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
844
+ "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
845
+ "model.layers.30.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
846
+ "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
847
+ "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
848
+ "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
849
+ "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
850
+ "model.layers.30.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
851
+ "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
852
+ "model.layers.31.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
853
+ "model.layers.31.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
854
+ "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
855
+ "model.layers.31.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
856
+ "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
857
+ "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
858
+ "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
859
+ "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
860
+ "model.layers.31.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
861
+ "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
862
+ "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
863
+ "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
864
+ "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
865
+ "model.layers.4.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
866
+ "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
867
+ "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
868
+ "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
869
+ "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
870
+ "model.layers.4.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
871
+ "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
872
+ "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
873
+ "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
874
+ "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
875
+ "model.layers.5.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
876
+ "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
877
+ "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
878
+ "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
879
+ "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
880
+ "model.layers.5.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
881
+ "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
882
+ "model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
883
+ "model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
884
+ "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
885
+ "model.layers.6.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
886
+ "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
887
+ "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
888
+ "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
889
+ "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
890
+ "model.layers.6.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
891
+ "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
892
+ "model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
893
+ "model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
894
+ "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
895
+ "model.layers.7.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
896
+ "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
897
+ "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
898
+ "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
899
+ "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
900
+ "model.layers.7.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
901
+ "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
902
+ "model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
903
+ "model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
904
+ "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
905
+ "model.layers.8.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
906
+ "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
907
+ "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
908
+ "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
909
+ "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
910
+ "model.layers.8.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
911
+ "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
912
+ "model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
913
+ "model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
914
+ "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
915
+ "model.layers.9.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
916
+ "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
917
+ "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
918
+ "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
919
+ "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
920
+ "model.layers.9.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
921
+ "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
922
+ "model.mm_projector.0.bias": "pytorch_model-00002-of-00002.bin",
923
+ "model.mm_projector.0.weight": "pytorch_model-00002-of-00002.bin",
924
+ "model.mm_projector.2.bias": "pytorch_model-00002-of-00002.bin",
925
+ "model.mm_projector.2.weight": "pytorch_model-00002-of-00002.bin",
926
+ "model.norm.weight": "pytorch_model-00002-of-00002.bin",
927
+ "model.region_encoder.mlvl_fuse.fuse_convs.0.conv.weight": "pytorch_model-00002-of-00002.bin",
928
+ "model.region_encoder.mlvl_fuse.fuse_convs.0.gn.bias": "pytorch_model-00002-of-00002.bin",
929
+ "model.region_encoder.mlvl_fuse.fuse_convs.0.gn.weight": "pytorch_model-00002-of-00002.bin",
930
+ "model.region_encoder.mlvl_fuse.fuse_convs.1.conv.weight": "pytorch_model-00002-of-00002.bin",
931
+ "model.region_encoder.mlvl_fuse.fuse_convs.1.gn.bias": "pytorch_model-00002-of-00002.bin",
932
+ "model.region_encoder.mlvl_fuse.fuse_convs.1.gn.weight": "pytorch_model-00002-of-00002.bin",
933
+ "model.region_encoder.mlvl_fuse.fuse_convs.2.conv.weight": "pytorch_model-00002-of-00002.bin",
934
+ "model.region_encoder.mlvl_fuse.fuse_convs.2.gn.bias": "pytorch_model-00002-of-00002.bin",
935
+ "model.region_encoder.mlvl_fuse.fuse_convs.2.gn.weight": "pytorch_model-00002-of-00002.bin",
936
+ "model.region_encoder.mlvl_fuse.fuse_convs.3.conv.weight": "pytorch_model-00002-of-00002.bin",
937
+ "model.region_encoder.mlvl_fuse.fuse_convs.3.gn.bias": "pytorch_model-00002-of-00002.bin",
938
+ "model.region_encoder.mlvl_fuse.fuse_convs.3.gn.weight": "pytorch_model-00002-of-00002.bin",
939
+ "model.region_encoder.mlvl_fuse.fuse_convs.4.conv.weight": "pytorch_model-00002-of-00002.bin",
940
+ "model.region_encoder.mlvl_fuse.fuse_convs.4.gn.bias": "pytorch_model-00002-of-00002.bin",
941
+ "model.region_encoder.mlvl_fuse.fuse_convs.4.gn.weight": "pytorch_model-00002-of-00002.bin",
942
+ "model.region_encoder.mlvl_fuse.input_conv.0.bias": "pytorch_model-00002-of-00002.bin",
943
+ "model.region_encoder.mlvl_fuse.input_conv.0.weight": "pytorch_model-00002-of-00002.bin",
944
+ "model.region_encoder.mlvl_fuse.input_conv.1.bias": "pytorch_model-00002-of-00002.bin",
945
+ "model.region_encoder.mlvl_fuse.input_conv.1.weight": "pytorch_model-00002-of-00002.bin",
946
+ "model.region_encoder.mlvl_fuse.input_conv.2.bias": "pytorch_model-00002-of-00002.bin",
947
+ "model.region_encoder.mlvl_fuse.input_conv.2.weight": "pytorch_model-00002-of-00002.bin",
948
+ "model.region_encoder.mlvl_fuse.input_conv.3.bias": "pytorch_model-00002-of-00002.bin",
949
+ "model.region_encoder.mlvl_fuse.input_conv.3.weight": "pytorch_model-00002-of-00002.bin",
950
+ "model.region_encoder.roi_align.flatten_linear.bias": "pytorch_model-00002-of-00002.bin",
951
+ "model.region_encoder.roi_align.flatten_linear.weight": "pytorch_model-00002-of-00002.bin",
952
+ "model.region_encoder.roi_align.pconvs.0.bias": "pytorch_model-00002-of-00002.bin",
953
+ "model.region_encoder.roi_align.pconvs.0.weight": "pytorch_model-00002-of-00002.bin",
954
+ "model.region_encoder.roi_align.pconvs.1.bias": "pytorch_model-00002-of-00002.bin",
955
+ "model.region_encoder.roi_align.pconvs.1.weight": "pytorch_model-00002-of-00002.bin",
956
+ "model.region_encoder.roi_align.pconvs.2.bias": "pytorch_model-00002-of-00002.bin",
957
+ "model.region_encoder.roi_align.pconvs.2.weight": "pytorch_model-00002-of-00002.bin",
958
+ "model.region_encoder.roi_align.pconvs.3.bias": "pytorch_model-00002-of-00002.bin",
959
+ "model.region_encoder.roi_align.pconvs.3.weight": "pytorch_model-00002-of-00002.bin",
960
+ "model.region_encoder.roi_align.pos_embedd.0.bias": "pytorch_model-00002-of-00002.bin",
961
+ "model.region_encoder.roi_align.pos_embedd.0.weight": "pytorch_model-00002-of-00002.bin",
962
+ "model.region_encoder.roi_align.pos_embedd.2.bias": "pytorch_model-00002-of-00002.bin",
963
+ "model.region_encoder.roi_align.pos_embedd.2.weight": "pytorch_model-00002-of-00002.bin",
964
+ "model.region_encoder.roi_align.pos_embedd.3.bias": "pytorch_model-00002-of-00002.bin",
965
+ "model.region_encoder.roi_align.pos_embedd.3.weight": "pytorch_model-00002-of-00002.bin",
966
+ "model.region_encoder.roi_align.pos_embedd.5.bias": "pytorch_model-00002-of-00002.bin",
967
+ "model.region_encoder.roi_align.pos_embedd.5.weight": "pytorch_model-00002-of-00002.bin",
968
+ "model.region_encoder.roi_align.updims.bias": "pytorch_model-00002-of-00002.bin",
969
+ "model.region_encoder.roi_align.updims.weight": "pytorch_model-00002-of-00002.bin",
970
+ "model.text_hidden_fcs.0.0.bias": "pytorch_model-00002-of-00002.bin",
971
+ "model.text_hidden_fcs.0.0.weight": "pytorch_model-00002-of-00002.bin",
972
+ "model.text_hidden_fcs.0.2.bias": "pytorch_model-00002-of-00002.bin",
973
+ "model.text_hidden_fcs.0.2.weight": "pytorch_model-00002-of-00002.bin"
974
+ }
975
+ }
groundingLMM/GLaMM-FullScope/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<unk>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
groundingLMM/GLaMM-FullScope/tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "__type": "AddedToken",
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ "clean_up_tokenization_spaces": false,
11
+ "eos_token": {
12
+ "__type": "AddedToken",
13
+ "content": "</s>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false
18
+ },
19
+ "legacy": false,
20
+ "model_max_length": 1536,
21
+ "pad_token": null,
22
+ "padding_side": "right",
23
+ "special_tokens_map_file": "special_tokens_map.json",
24
+ "tokenizer_class": "LlamaTokenizer",
25
+ "unk_token": {
26
+ "__type": "AddedToken",
27
+ "content": "<unk>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
groundingLMM/GranD/README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GranD - Grounding Anything Dataset 🚀
2
+ For details on downloading the dataset, preprocessing annotations for pre-training, and the automated annotation pipeline, please refer to [GranD.md](../docs/GranD.md) in the documentation.
3
+
4
+ ## Running the GranD Automated Annotation Pipeline
5
+ The GranD automated annotation pipeline comprises four levels and a total of 23 steps. Each level utilizes multiple state-of-the-art (SoTA) vision-language models and pipeline scripts to construct image-scene graphs from raw predictions.
6
+
7
+ For a step-by-step guide on running the pipeline, refer to [run_pipeline.sh](run_pipeline.sh). The environments required are listed under [environments](environments).
8
+
9
+ ### Create All Environments
10
+ There are ten environment `.yml` files provided in the [environments](environments) directory. Create all ten environments using the following commands:
11
+
12
+ ```bash
13
+ conda env create -f grand_env_1.yml
14
+ conda env create -f grand_env_2.yml
15
+ ...
16
+ ...
17
+ conda env create -f grand_env_9.yml
18
+ conda env create -f grand_env_utils.yml
19
+ ```
20
+
21
+ **NOTE:** While creating any of the above environments, if one or more `pip` dependencies fail to install, you may need to remove those dependencies from the environment file and rerun the command.
22
+
23
+ ### Download Model Checkpoints
24
+ Download all required model checkpoints to your `CKPT_DIR` directory:
25
+
26
+ ```bash
27
+ # For Landmark Detection
28
+ git lfs install
29
+ git clone https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3
30
+
31
+ # For Depth Estimation
32
+ wget https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt
33
+
34
+ # For Image Tagging
35
+ Download from [recognize-anything/tag2text_swin_14m.pth](https://huggingface.co/spaces/xinyu1205/recognize-anything/blob/main/tag2text_swin_14m.pth) & [recognize-anything/ram_swin_large_14m.pth](https://huggingface.co/spaces/xinyu1205/recognize-anything/blob/main/ram_swin_large_14m.pth)
36
+
37
+ # For Co-DETR Detector
38
+ Download using this [Google Drive link](https://drive.google.com/drive/folders/1asWoZ3SuM6APTL9D-QUF_YW9mjULNdh9?usp=sharing) to obtain the `co_deformable_detr_swin_large_900q_3x_coco.pth` checkpoints.
39
+
40
+ # For EVA-02 Detector
41
+ Download from [eva02_L_lvis_sys.pth](https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/det/eva02_L_lvis_sys.pth) & [eva02_L_lvis_sys_o365.pth](https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/det/eva02_L_lvis_sys_o365.pth)
42
+
43
+ # For POMP
44
+ Download from [Google Drive](https://drive.google.com/file/d/1C8oU6cWkJdU3Q3IHaqTcbIToRLo9bMnu/view?usp=sharing) & [Detic_LI_CLIP_R5021k_640b64_4x_ft4x_max-size_pomp.pth](https://drive.google.com/file/d/1TwrjcUYimkI_f9z9UZXCmLztdgv31Peu/view?usp=sharing)
45
+
46
+ # For GRIT
47
+ wget -c https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth
48
+
49
+ # For OV-SAM
50
+ Download from [HarborYuan/ovsam_models/blob/main/sam2clip_vith_rn50x16.pth](https://huggingface.co/HarborYuan/ovsam_models/blob/main/sam2clip_vith_rn50x16.pth)
51
+
52
+ # For GPT4RoI
53
+ Follow the instructions at [GPT4RoI/Weights](https://github.com/jshilong/GPT4RoI?tab=readme-ov-file#weights) to obtain the GPT4RoI weights.
54
+ ```
55
+
56
+ ### Automatically Annotate Images
57
+ Refer to the [run_pipeline.sh](run_pipeline.sh) script for details. Below is a sample command to run the pipeline:
58
+
59
+ ```bash
60
+ bash run_pipeline.sh $IMG_DIR $PRED_DIR $CKPT_DIR $SAM_ANNOTATIONS_DIR
61
+ ```
62
+
63
+ Where:
64
+
65
+ 1. `IMG_DIR` is the path to the directory containing images you wish to annotate.
66
+ 2. `PRED_DIR` is the path to the directory where the predictions will be saved.
67
+ 3. `CKPT_DIR` is the path to the directory containing all the checkpoints. For downloading the checkpoints, consult the README of each respective model.
68
+ 4. `SAM_ANNOTATIONS_DIR` is the path to the directory containing SAM annotations (.json file).
69
+
70
+ **Note:** If you are not annotating SAM images, remove `ov-sam` from the pipeline and adjust the `add_masks_to_annotations.py` script accordingly. In this case, `SAM_ANNOTATIONS_DIR` will not be required.
71
+
72
+ ### Disclaimer:
73
+ We acknowledge that the pipeline is complex due to the involvement of many different models with various dependencies. Contributions that simplify or improve the pipeline are welcome.
groundingLMM/GranD/run_pipeline.sh ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Exit on error, uninitialized var, and ensure commands in pipes are all checked for success
4
+ set -euo pipefail
5
+
6
+ # Input arguments - Image directory path, output predictions directory path, checkpoints directory path containing all checkpoints and directory containing original SAM annotation files
7
+ IMG_DIR=$1
8
+ PRED_DIR=$2
9
+ CKPT_DIR=$3
10
+ SAM_ANNOTATIONS_DIR=$4
11
+
12
+ # Adjust below configuration as per your setup
13
+ NUM_GPUs=1
14
+ GPU_IDs="0"
15
+ MASTER_PORT=1342
16
+
17
+
18
+ # NOTE: The pipeline contains multiple models from different open-source resources. The dependencies to run varies from one model to other. That's why, we had to create almost 10 different conda environments with different dependencies to run the complete pipeline. Please follow the instructions at the corresponding model directory to install the dependencies. We will welcome any pull request to make this process easy. Thank You.
19
+
20
+
21
+ # We define some commands below to activate the correct conda environments
22
+ run_in_env() {
23
+ local env="$1"
24
+ shift
25
+ source $(conda info --base)/etc/profile.d/conda.sh
26
+ conda activate "$env"
27
+ "$@"
28
+ }
29
+
30
+ run_in_env_targeted() {
31
+ local env="$1"
32
+ shift
33
+ export CUDA_VISIBLE_DEVICES=$GPU_IDsS
34
+ source $(conda info --base)/etc/profile.d/conda.sh
35
+ conda activate "$env"
36
+ "$@"
37
+ }
38
+
39
+ # NOTE: Here we assume to have ten conda environments created, namely grand_env_1, grand_env_2, ---, grand_env_9 and grand_env_utils. The requirements for these environments is available under environments directory.
40
+
41
+
42
+ # 1. Landmark
43
+ run_in_env grand_env_1 pushd level_1_inference/1_landmark_categorization
44
+ python infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDsS" --llava_model_path "$CKPT_DIR/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3"
45
+ popd
46
+
47
+ # 2. Depth Maps
48
+ run_in_env_targeted grand_env_2 level_1_inference/pushd 2_depth_maps
49
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model_weights "$CKPT_DIR/dpt_beit_large_512.pt"
50
+ popd
51
+
52
+ # 3. Image Tagging
53
+ run_in_env_targeted grand_env_3 pushd level_1_inference/3_image_tagging
54
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model-type tag2text --checkpoint "$CKPT_DIR/tag2text_swin_14m.pth"
55
+
56
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model-type ram --checkpoint "$CKPT_DIR/ram_swin_large_14m.pth"
57
+ popd
58
+
59
+ # 4. Object Detection using Co-DETR
60
+ run_in_env grand_env_1 pushd level_1_inference/4_co_detr
61
+ python launch_codetr_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --ckpt_path "$CKPT_DIR/co_deformable_detr_swin_large_900q_3x_coco.pth" --gpu_ids "$GPU_IDs"
62
+ popd
63
+
64
+ # 5. Object Detection using EVA-02
65
+ run_in_env_targeted grand_env_4 pushd level_1_inference/5_eva_02
66
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model_name 'eva-02-01'
67
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model_name 'eva-02-02'
68
+ popd
69
+
70
+ # 6. Open Vocabulary Detection using OWL-ViT
71
+ run_in_env grand_env_1 pushd level_1_inference/6_owl_vit
72
+ python launch_owl_vit_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --tags_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDs"
73
+ popd
74
+
75
+ # 7. Open Vocabulary Detection using POMP
76
+ run_in_env grand_env_4 pushd level_1_inference/7_pomp
77
+ python launch_pomp_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --tags_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDs"
78
+ popd
79
+
80
+ # 8. Attribute Detection and Grounding using GRIT
81
+ run_in_env_targeted grand_env_3 level_1_inference/pushd 8_grit \
82
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR"
83
+ popd
84
+
85
+ # 9. Open Vocabulary Classification using OV-SAM
86
+ run_in_env grand_env_5 pushd level_1_inference/9_ov_sam
87
+ python launch_ov_sam_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --sam_annotations_dir "$SAM_ANNOTATIONS_DIR" --gpu_ids "$GPU_IDs"
88
+ popd
89
+
90
+ # 10. Generate Level-1 Scene Graph
91
+ run_in_env grand_env_utils
92
+ python utils/merge_json_level_1_with_nms.py --image_dir_path "$IMG_DIR" --predictions_dir_path "$PRED_DIR" --output_dir_path "$PRED_DIR/level-1-raw"
93
+
94
+ run_in_env grand_env_utils
95
+ python utils/prepare_level_1.py --image_dir_path "$IMG_DIR" --raw_dir_path "$PRED_DIR/level-1-raw" --output_dir_path "$PRED_DIR/level-1-processed"
96
+
97
+
98
+ # -------------------------------------------------------------------------------------------------------------------- #
99
+
100
+ # 11. Captioning using BLIP-2
101
+ run_in_env_targeted grand_env_3 pushd level_2_inference/1_blip-2
102
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR"
103
+ popd
104
+
105
+ # 12. Captioning using LLaVA
106
+ run_in_env grand_env_6 pushd level_2_inference/2_llava
107
+ python infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDs" --llava_model_path "$CKPT_DIR/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3"
108
+ popd
109
+
110
+ # 13. Grounding using MDETR
111
+ run_in_env_targeted grand_env_7 pushd level_2_inference/3_mdetr
112
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --blip2_pred_path "$PRED_DIR/blip2" --llava_pred_path "$PRED_DIR/llava"
113
+ popd
114
+
115
+ # 14. Generate Level-2 Scene Graph and Update Level-1
116
+ run_in_env grand_env_utils
117
+ python utils/merge_json_level_2.py --predictions_dir_path "$PRED_DIR" --output_dir_path "$PRED_DIR/level-2-raw"
118
+
119
+ run_in_env grand_env_utils
120
+ python utils/prepare_level_2.py --raw_dir_path "$PRED_DIR/level-2-raw" --level_2_output_dir_path "$PRED_DIR/level-2-processed" --level_1_dir_path "$PRED_DIR/level-1-processed"
121
+
122
+
123
+ # -------------------------------------------------------------------------------------------------------------------- #
124
+
125
+ # 15. Enrich Attributes using GPT4RoI
126
+ run_in_env grand_env_8 pushd level_2_inference/4_gpt4roi/GPT4RoI
127
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env gpt4roi/infer.py --image_dir_path "$IMG_DIR" --level_2_pred_path "$PRED_DIR/level-2-processed" --output_dir_path "$PRED_DIR/level-2-processed_gpt4roi"
128
+ popd
129
+
130
+ # 16. Label Assignment using EVA-CLIP
131
+ run_in_env_targeted grand_env_4 pushd level_2_inference/5_label_assignment
132
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --level_2_dir_path "$PRED_DIR/level-2-processed_gpt4roi" --output_dir_path "$PRED_DIR/level-2-processed_eva_clip"
133
+ popd
134
+
135
+ # 17. Merge EVA-CLIP Assigned Labels & Calculate and Store Depths for All Objects
136
+ run_in_env_targeted grand_env_utils
137
+ python utils/merge_eva_labels.py --level_2_dir_path "$PRED_DIR/level-2-processed_gpt4roi" --labels_path "$PRED_DIR/level-2-processed_eva_clip" --output_dir_path "$PRED_DIR/level-2-processed_labelled" --store_depth --depth_map_dir "$PRED_DIR/midas"
138
+
139
+
140
+ # -------------------------------------------------------------------------------------------------------------------- #
141
+
142
+ # 18. Generate Level-3 Dense Captions
143
+ run_in_env grand_env_9 pushd level_3_dense_caption
144
+ python run.py --image_dir_path "$IMG_DIR" --level_2_dir_path "$PRED_DIR/level-2-processed_labelled" --output_dir_path "$PRED_DIR/level-3-vicuna-13B" --gpu_ids "$GPU_IDs" --job_id '111'
145
+ popd
146
+
147
+ # 19. Generate Level-4 Additional Context
148
+ run_in_env grand_env_9 pushd level_4_extra_context
149
+ python run.py --image_dir_path "$IMG_DIR" --level_2_dir_path "$PRED_DIR/level-2-processed_labelled" --output_dir_path "$PRED_DIR/level-4-vicuna-13B" --gpu_ids "$GPU_IDs" --job_id '111'
150
+ popd
151
+
152
+
153
+ # -------------------------------------------------------------------------------------------------------------------- #
154
+
155
+ # 20. Ground short & dense captions
156
+ run_in_env_targeted grand_env_utils
157
+ python utils/ground_short_captions.py --data_dir_path "$PRED_DIR/level-2-processed_labelled" --output_dir_path "$PRED_DIR/short_captions_grounded"
158
+
159
+ run_in_env_targeted grand_env_utils
160
+ python utils/ground_dense_caption.py --level_3_dense_caption_txt_dir_path "$PRED_DIR/level-3-vicuna_13B" --level_2_processed_json_path "$PRED_DIR/short_captions_grounded" --output_dir_path "$PRED_DIR/dense_captions_grounded"
161
+
162
+ # 21. Add Masks to the Annotations (sources: SAM Annotations & EVA Detector)
163
+ run_in_env_targeted grand_env_utils
164
+ python utils/add_masks_to_annotations.py --input_dir_path "$PRED_DIR/dense_captions_grounded" --sam_json_dir_path "$SAM_ANNOTATIONS_DIR" --eva_02_pred_dir_path "$PRED_DIR/eva-02-01" --output_dir_path "$PRED_DIR/level-3-processed"
165
+
166
+ # 22. Use HQ-SAM for the Rest of the Masks not Found in SAM Annotations or EVA Detections
167
+ run_in_env_targeted grand_env_1 pushd utils/hq_sam
168
+ python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env run.py --image_dir_path "$IMG_DIR" --level_3_processed_path "$PRED_DIR/level-3-processed" --output_dir_path "$PRED_DIR/level-3-processed_with_masks" --checkpoints_path "$CKPT_DIR/sam_hq_vit_h.pth"
169
+ popd
170
+
171
+ # 23. Add Additional Context to the Annotations
172
+ run_in_env_targeted grand_env_utils
173
+ python utils/add_addional_context.py --annotations_dir_path "$PRED_DIR/level-3-processed_with_masks" --level_4_additional_context_path "$PRED_DIR/level-4-vicuna_13B" --output_dir_path "$PRED_DIR/level-4-processed"
174
+
175
+
176
+ # -------------------------------------------------------------------------------------------------------------------- #
177
+
178
+ echo The pipeline inference completed and the predictions are saved in "$PRED_DIR/level-4-processed"
groundingLMM/LLaVA/.dockerignore ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # The .dockerignore file excludes files from the container build process.
2
+ #
3
+ # https://docs.docker.com/engine/reference/builder/#dockerignore-file
4
+
5
+ # Exclude Git files
6
+ .git
7
+ .github
8
+ .gitignore
9
+
10
+ # Exclude Python cache files
11
+ __pycache__
12
+ .mypy_cache
13
+ .pytest_cache
14
+ .ruff_cache
15
+
16
+ # Exclude Python virtual environment
17
+ /venv
18
+
19
+ # Exclude some weights
20
+ /openai
21
+ /liuhaotian
groundingLMM/LLaVA/.editorconfig ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ root = true
2
+
3
+ # Unix-style newlines with a newline ending every file
4
+ [*]
5
+ end_of_line = lf
6
+ insert_final_newline = true
7
+ trim_trailing_whitespace = true
8
+ charset = utf-8
9
+
10
+ # 4 space indentation
11
+ [*.{py,json}]
12
+ indent_style = space
13
+ indent_size = 4
14
+
15
+ # 2 space indentation
16
+ [*.{md,sh,yaml,yml}]
17
+ indent_style = space
18
+ indent_size = 2
groundingLMM/LLaVA/.gitattributes ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # https://git-scm.com/docs/gitattributes
2
+
3
+ # Set the default behavior, in case people don't have core.autocrlf set.
4
+ # https://git-scm.com/docs/gitattributes#_end_of_line_conversion
5
+ * text=auto
6
+
7
+ # common python attributes, taken from https://github.com/alexkaratarakis/gitattributes/blob/710900479a2bedeec7003d381719521ffbb18bf8/Python.gitattributes
8
+ # Source files
9
+ # ============
10
+ *.pxd text diff=python
11
+ *.py text diff=python
12
+ *.py3 text diff=python
13
+ *.pyw text diff=python
14
+ *.pyx text diff=python
15
+ *.pyz text diff=python
16
+ *.pyi text diff=python
17
+
18
+ # Binary files
19
+ # ============
20
+ *.db binary
21
+ *.p binary
22
+ *.pkl binary
23
+ *.pickle binary
24
+ *.pyc binary export-ignore
25
+ *.pyo binary export-ignore
26
+ *.pyd binary
27
+
28
+ # Jupyter notebook
29
+ *.ipynb text eol=lf
groundingLMM/LLaVA/.gitignore ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__
3
+ *.pyc
4
+ *.egg-info
5
+ dist
6
+
7
+ # Log
8
+ *.log
9
+ *.log.*
10
+ *.json
11
+ *.jsonl
12
+
13
+ # Data
14
+ !**/alpaca-data-conversation.json
15
+
16
+ # Editor
17
+ .idea
18
+ *.swp
19
+
20
+ # Other
21
+ .DS_Store
22
+ wandb
23
+ output
24
+
25
+ checkpoints
26
+ ckpts*
27
+
28
+ .ipynb_checkpoints
29
+ *.ipynb
30
+
31
+ # DevContainer
32
+ !.devcontainer/*
33
+
34
+ # Demo
35
+ serve_images/
groundingLMM/LLaVA/LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
groundingLMM/LLaVA/README.md ADDED
@@ -0,0 +1,463 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🌋 LLaVA: Large Language and Vision Assistant
2
+
3
+ *Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.*
4
+
5
+ [📢 [LLaVA-NeXT Blog](https://llava-vl.github.io/blog/2024-01-30-llava-next/)] [[Project Page](https://llava-vl.github.io/)] [[Demo](https://llava.hliu.cc/)] [[Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)] [[Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)]
6
+
7
+ 🤝Community Contributions: [[llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436)] [[Colab](https://github.com/camenduru/LLaVA-colab)] [[🤗Space](https://huggingface.co/spaces/badayvedat/LLaVA)] [[Replicate](https://replicate.com/yorickvp/llava-13b)] [[AutoGen](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb)] [[BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA)]
8
+
9
+ **Improved Baselines with Visual Instruction Tuning** [[Paper](https://arxiv.org/abs/2310.03744)] [[HF](https://huggingface.co/papers/2310.03744)] <br>
10
+ [Haotian Liu](https://hliu.cc), [Chunyuan Li](https://chunyuan.li/), [Yuheng Li](https://yuheng-li.github.io/), [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/)
11
+
12
+ **Visual Instruction Tuning** (NeurIPS 2023, **Oral**) [[Paper](https://arxiv.org/abs/2304.08485)] [[HF](https://huggingface.co/papers/2304.08485)] <br>
13
+ [Haotian Liu*](https://hliu.cc), [Chunyuan Li*](https://chunyuan.li/), [Qingyang Wu](https://scholar.google.ca/citations?user=HDiw-TsAAAAJ&hl=en/), [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/) (*Equal Contribution)
14
+
15
+ <!--p align="center">
16
+ <a href="https://llava.hliu.cc/"><img src="images/llava_logo.png" width="50%"></a> <br>
17
+ Generated by <a href="https://gligen.github.io/">GLIGEN</a> via "a cute lava llama with glasses" and box prompt
18
+ </p-->
19
+
20
+
21
+ ## Release
22
+
23
+ - [2024/05/10] 🔥 **LLaVA-NeXT** (Stronger) models are released, stronger LMM with support of LLama-3 (8B) and Qwen-1.5 (72B/110B). [[Blog](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/)] [[Checkpoints](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff)] [[Demo](https://llava-next.lmms-lab.com/)] [[Code](https://github.com/LLaVA-VL/LLaVA-NeXT/)]
24
+ - [2024/05/10] 🔥 **LLaVA-NeXT** (Video) is released. The image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer. DPO training with AI feedback on videos can yield significant improvement. [[Blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)] [[Checkpoints](https://huggingface.co/collections/lmms-lab/llava-next-video-661e86f5e8dabc3ff793c944)] [[Code](https://github.com/LLaVA-VL/LLaVA-NeXT/)]
25
+ - [03/10] Releasing **LMMs-Eval**, a highly efficient evaluation pipeline we used when developing LLaVA-NeXT. It supports the evaluation of LMMs on dozens of public datasets and allows new dataset onboarding, making the dev of new LMMs much faster. [[Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/)] [[Codebase](https://github.com/EvolvingLMMs-Lab/lmms-eval)]
26
+ - [1/30] 🔥 **LLaVA-NeXT** (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the [blog post](https://llava-vl.github.io/blog/2024-01-30-llava-next/), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). Training/eval data and scripts coming soon.
27
+ - [11/10] [LLaVA-Plus](https://llava-vl.github.io/llava-plus/) is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [[Project Page](https://llava-vl.github.io/llava-plus/)] [[Demo](https://llavaplus.ngrok.io/)] [[Code](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)] [[Paper](https://arxiv.org/abs/2311.05437)]
28
+ - [11/2] [LLaVA-Interactive](https://llava-vl.github.io/llava-interactive/) is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [[Project Page](https://llava-vl.github.io/llava-interactive/)] [[Demo](https://llavainteractive.ngrok.io/)] [[Code](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo)] [[Paper](https://arxiv.org/abs/2311.00571)]
29
+ - [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement ([ckpts](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md#llava-v15), [script](https://github.com/haotian-liu/LLaVA#train)). We also provide a [doc](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md) on how to finetune LLaVA-1.5 on your own dataset with LoRA.
30
+ - [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [[🤗 Demo](https://huggingface.co/spaces/etri-vilab/Ko-LLaVA)]
31
+ - [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the [technical report](https://arxiv.org/abs/2310.03744), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), and evaluation scripts are released [here](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md)!
32
+ - [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [[LLavA-RLHF]](https://llava-rlhf.github.io/)
33
+ - [9/22] [LLaVA](https://arxiv.org/abs/2304.08485) is accepted by NeurIPS 2023 as **oral presentation**, and [LLaVA-Med](https://arxiv.org/abs/2306.00890) is accepted by NeurIPS 2023 Datasets and Benchmarks Track as **spotlight presentation**.
34
+
35
+ <details>
36
+ <summary>More</summary>
37
+
38
+ - [11/6] Support **Intel** dGPU and CPU platforms. [More details here.](https://github.com/haotian-liu/LLaVA/tree/intel/docs/intel)
39
+ - [10/12] LLaVA is now supported in [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) with 4-bit / 5-bit quantization support!
40
+ - [10/11] The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), and evaluation scripts are released [here](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md)!
41
+ - [10/10] [Roboflow Deep Dive](https://blog.roboflow.com/first-impressions-with-llava-1-5/): First Impressions with LLaVA-1.5.
42
+ - [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a [note](https://arxiv.org/abs/2309.09958). Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper [``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''.](https://arxiv.org/abs/2309.10020)
43
+ <p align="center">
44
+ <img src="https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings/blob/main/images/mfm_evolution.jpeg?raw=true" width=50%/>
45
+ </p>
46
+
47
+ - [7/19] 🔥 We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release [LLaVA Bench](https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_Bench.md) for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out [LLaVA-from-LLaMA-2](https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_from_LLaMA2.md), and our [model zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)!
48
+ - [6/26] [CVPR 2023 Tutorial](https://vlp-tutorial.github.io/) on **Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4**! Please check out [[Slides](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)] [[Notes](https://arxiv.org/abs/2306.14895)] [[YouTube](https://youtu.be/mkI7EPD1vp8)] [[Bilibli](https://www.bilibili.com/video/BV1Ng4y1T7v3/)].
49
+ - [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations [here](./docs/LoRA.md).
50
+ - [6/1] We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine**, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the [paper](https://arxiv.org/abs/2306.00890) and [page](https://github.com/microsoft/LLaVA-Med).
51
+ - [5/6] We are releasing [LLaVA-Lighting-MPT-7B-preview](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview), based on MPT-7B-Chat! See [here](#LLaVA-MPT-7b) for more details.
52
+ - [5/2] 🔥 We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See [here](#train-llava-lightning) for more details.
53
+ - [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out [here](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/llava).
54
+ - [4/17] 🔥 We released **LLaVA: Large Language and Vision Assistant**. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the [paper](https://arxiv.org/abs/2304.08485) and [demo](https://llava.hliu.cc/).
55
+
56
+ </details>
57
+
58
+ <!-- <a href="https://llava.hliu.cc/"><img src="assets/demo.gif" width="70%"></a> -->
59
+
60
+ [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
61
+ **Usage and License Notices**: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the [OpenAI Terms of Use](https://openai.com/policies/terms-of-use) for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. [Llama community license](https://ai.meta.com/llama/license/) for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
62
+
63
+
64
+ ## Contents
65
+ - [Install](#install)
66
+ - [LLaVA Weights](#llava-weights)
67
+ - [Demo](#Demo)
68
+ - [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)
69
+ - [Dataset](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)
70
+ - [Train](#train)
71
+ - [Evaluation](#evaluation)
72
+
73
+ ## Install
74
+
75
+ If you are not using Linux, do *NOT* proceed, see instructions for [macOS](https://github.com/haotian-liu/LLaVA/blob/main/docs/macOS.md) and [Windows](https://github.com/haotian-liu/LLaVA/blob/main/docs/Windows.md).
76
+
77
+ 1. Clone this repository and navigate to LLaVA folder
78
+ ```bash
79
+ git clone https://github.com/haotian-liu/LLaVA.git
80
+ cd LLaVA
81
+ ```
82
+
83
+ 2. Install Package
84
+ ```Shell
85
+ conda create -n llava python=3.10 -y
86
+ conda activate llava
87
+ pip install --upgrade pip # enable PEP 660 support
88
+ pip install -e .
89
+ ```
90
+
91
+ 3. Install additional packages for training cases
92
+ ```
93
+ pip install -e ".[train]"
94
+ pip install flash-attn --no-build-isolation
95
+ ```
96
+
97
+ ### Upgrade to latest code base
98
+
99
+ ```Shell
100
+ git pull
101
+ pip install -e .
102
+
103
+ # if you see some import errors when you upgrade,
104
+ # please try running the command below (without #)
105
+ # pip install flash-attn --no-build-isolation --no-cache-dir
106
+ ```
107
+
108
+ ### Quick Start With HuggingFace
109
+
110
+ <details>
111
+ <summary>Example Code</summary>
112
+
113
+ ```Python
114
+ from llava.model.builder import load_pretrained_model
115
+ from llava.mm_utils import get_model_name_from_path
116
+ from llava.eval.run_llava import eval_model
117
+
118
+ model_path = "liuhaotian/llava-v1.5-7b"
119
+
120
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
121
+ model_path=model_path,
122
+ model_base=None,
123
+ model_name=get_model_name_from_path(model_path)
124
+ )
125
+ ```
126
+
127
+ Check out the details wth the `load_pretrained_model` function in `llava/model/builder.py`.
128
+
129
+ You can also use the `eval_model` function in `llava/eval/run_llava.py` to get the output easily. By doing so, you can use this code on Colab directly after downloading this repository.
130
+
131
+ ``` python
132
+ model_path = "liuhaotian/llava-v1.5-7b"
133
+ prompt = "What are the things I should be cautious about when I visit here?"
134
+ image_file = "https://llava-vl.github.io/static/images/view.jpg"
135
+
136
+ args = type('Args', (), {
137
+ "model_path": model_path,
138
+ "model_base": None,
139
+ "model_name": get_model_name_from_path(model_path),
140
+ "query": prompt,
141
+ "conv_mode": None,
142
+ "image_file": image_file,
143
+ "sep": ",",
144
+ "temperature": 0,
145
+ "top_p": None,
146
+ "num_beams": 1,
147
+ "max_new_tokens": 512
148
+ })()
149
+
150
+ eval_model(args)
151
+ ```
152
+ </details>
153
+
154
+ ## LLaVA Weights
155
+ Please check out our [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) for all public LLaVA checkpoints, and the instructions of how to use the weights.
156
+
157
+ ## Demo
158
+
159
+ ### Gradio Web UI
160
+
161
+ To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server *ONCE*.
162
+
163
+ ```mermaid
164
+ flowchart BT
165
+ %% Declare Nodes
166
+ gws("Gradio (UI Server)")
167
+ c("Controller (API Server):<br/>PORT: 10000")
168
+ mw7b("Model Worker:<br/>llava-v1.5-7b<br/>PORT: 40000")
169
+ mw13b("Model Worker:<br/>llava-v1.5-13b<br/>PORT: 40001")
170
+ sglw13b("SGLang Backend:<br/>llava-v1.6-34b<br/>http://localhost:30000")
171
+ lsglw13b("SGLang Worker:<br/>llava-v1.6-34b<br/>PORT: 40002")
172
+
173
+ %% Declare Styles
174
+ classDef data fill:#3af,stroke:#48a,stroke-width:2px,color:#444
175
+ classDef success fill:#8f8,stroke:#0a0,stroke-width:2px,color:#444
176
+ classDef failure fill:#f88,stroke:#f00,stroke-width:2px,color:#444
177
+
178
+ %% Assign Styles
179
+ class id,od data;
180
+ class cimg,cs_s,scsim_s success;
181
+ class ncimg,cs_f,scsim_f failure;
182
+
183
+ subgraph Demo Connections
184
+ direction BT
185
+ c<-->gws
186
+
187
+ mw7b<-->c
188
+ mw13b<-->c
189
+ lsglw13b<-->c
190
+ sglw13b<-->lsglw13b
191
+ end
192
+ ```
193
+
194
+ #### Launch a controller
195
+ ```Shell
196
+ python -m llava.serve.controller --host 0.0.0.0 --port 10000
197
+ ```
198
+
199
+ #### Launch a gradio web server.
200
+ ```Shell
201
+ python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
202
+ ```
203
+ You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.
204
+
205
+ #### Launch a SGLang worker
206
+
207
+ This is the recommended way to serve LLaVA model with high throughput, and you need to install SGLang first. Note that currently `4-bit` quantization is not supported yet on SGLang-LLaVA, and if you have limited GPU VRAM, please check out model worker with [quantization](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#launch-a-model-worker-4-bit-8-bit-inference-quantized).
208
+
209
+ ```Shell
210
+ pip install "sglang[all]"
211
+ ```
212
+
213
+ You'll first launch a SGLang backend worker which will execute the models on GPUs. Remember the `--port` you've set and you'll use that later.
214
+
215
+ ```Shell
216
+ # Single GPU
217
+ CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000
218
+
219
+ # Multiple GPUs with tensor parallel
220
+ CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-13b --tokenizer-path llava-hf/llava-1.5-13b-hf --port 30000 --tp 2
221
+ ```
222
+
223
+ Tokenizers (temporary): `llava-hf/llava-1.5-7b-hf`, `llava-hf/llava-1.5-13b-hf`, `liuhaotian/llava-v1.6-34b-tokenizer`.
224
+
225
+ You'll then launch a LLaVA-SGLang worker that will communicate between LLaVA controller and SGLang backend to route the requests. Set `--sgl-endpoint` to `http://127.0.0.1:port` where `port` is the one you just set (default: 30000).
226
+
227
+ ```Shell
228
+ python -m llava.serve.sglang_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --sgl-endpoint http://127.0.0.1:30000
229
+ ```
230
+
231
+ #### Launch a model worker
232
+
233
+ This is the actual *worker* that performs the inference on the GPU. Each worker is responsible for a single model specified in `--model-path`.
234
+
235
+ ```Shell
236
+ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
237
+ ```
238
+ Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
239
+
240
+ You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the `--controller` the same, and modify the `--port` and `--worker` to a different port number for each worker.
241
+ ```Shell
242
+ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port <different from 40000, say 40001> --worker http://localhost:<change accordingly, i.e. 40001> --model-path <ckpt2>
243
+ ```
244
+
245
+ If you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the `--device` flag: `--device mps`.
246
+
247
+ #### Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)
248
+
249
+ If the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with `CUDA_VISIBLE_DEVICES`. Below is an example of running with the first two GPUs.
250
+
251
+ ```Shell
252
+ CUDA_VISIBLE_DEVICES=0,1 python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
253
+ ```
254
+
255
+ #### Launch a model worker (4-bit, 8-bit inference, quantized)
256
+
257
+ You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint, potentially allowing you to run on a GPU with as few as 12GB VRAM. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append `--load-4bit` or `--load-8bit` to the **model worker** command that you are executing. Below is an example of running with 4-bit quantization.
258
+
259
+ ```Shell
260
+ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b --load-4bit
261
+ ```
262
+
263
+ #### Launch a model worker (LoRA weights, unmerged)
264
+
265
+ You can launch the model worker with LoRA weights, without merging them with the base checkpoint, to save disk space. There will be additional loading time, while the inference speed is the same as the merged checkpoints. Unmerged LoRA checkpoints do not have `lora-merge` in the model name, and are usually much smaller (less than 1GB) than the merged checkpoints (13G for 7B, and 25G for 13B).
266
+
267
+ To load unmerged LoRA weights, you simply need to pass an additional argument `--model-base`, which is the base LLM that is used to train the LoRA weights. You can check the base LLM of each LoRA weights in the [model zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md).
268
+
269
+ ```Shell
270
+ python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3 --model-base lmsys/vicuna-13b-v1.3
271
+ ```
272
+
273
+ ### CLI Inference
274
+
275
+ Chat about images using LLaVA without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization, for our LLaVA-1.5-7B, it uses less than 8GB VRAM on a single GPU.
276
+
277
+ ```Shell
278
+ python -m llava.serve.cli \
279
+ --model-path liuhaotian/llava-v1.5-7b \
280
+ --image-file "https://llava-vl.github.io/static/images/view.jpg" \
281
+ --load-4bit
282
+ ```
283
+
284
+ <img src="images/demo_cli.gif" width="70%">
285
+
286
+ ## Train
287
+
288
+ *Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of [this](https://github.com/haotian-liu/LLaVA/tree/v1.0.1) version for now. We'll add them in a separate doc later.*
289
+
290
+ LLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU dataset to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.
291
+
292
+ LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
293
+
294
+ ### Hyperparameters
295
+ We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.
296
+
297
+ 1. Pretraining
298
+
299
+ | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
300
+ | --- | ---: | ---: | ---: | ---: | ---: |
301
+ | LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |
302
+
303
+ 2. Finetuning
304
+
305
+ | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
306
+ | --- | ---: | ---: | ---: | ---: | ---: |
307
+ | LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |
308
+
309
+ ### Download Vicuna checkpoints (automatically)
310
+
311
+ Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.
312
+
313
+ ### Pretrain (feature alignment)
314
+
315
+ Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
316
+
317
+ Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
318
+
319
+ Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/pretrain.sh).
320
+
321
+ - `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.
322
+ - `--vision_tower openai/clip-vit-large-patch14-336`: CLIP ViT-L/14 336px.
323
+
324
+ <details>
325
+ <summary>Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)</summary>
326
+
327
+ We provide training script with DeepSpeed [here](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain_xformers.sh).
328
+ Tips:
329
+ - If you are using V100 which is not supported by FlashAttention, you can use the [memory-efficient attention](https://arxiv.org/abs/2112.05682) implemented in [xFormers](https://github.com/facebookresearch/xformers). Install xformers and replace `llava/train/train_mem.py` above with [llava/train/train_xformers.py](llava/train/train_xformers.py).
330
+ </details>
331
+
332
+ ### Visual Instruction Tuning
333
+
334
+ 1. Prepare data
335
+
336
+ Please download the annotation of the final mixture our instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json), and download the images from constituting datasets:
337
+
338
+ - COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
339
+ - GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
340
+ - OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), **we save all files as `.jpg`**
341
+ - TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
342
+ - VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
343
+
344
+ After downloading all of them, organize the data as follows in `./playground/data`,
345
+
346
+ ```
347
+ ├── coco
348
+ │ └── train2017
349
+ ├── gqa
350
+ │ └── images
351
+ ├── ocr_vqa
352
+ │ └── images
353
+ ├── textvqa
354
+ │ └── train_images
355
+ └── vg
356
+ ├── VG_100K
357
+ └── VG_100K_2
358
+ ```
359
+
360
+ 2. Start training!
361
+
362
+ You may download our pretrained projectors in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
363
+
364
+ Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).
365
+
366
+ Training script with DeepSpeed ZeRO-3: [`finetune.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune.sh).
367
+
368
+ If you are do not have enough GPU memory:
369
+
370
+ - Use LoRA: [`finetune_lora.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_lora.sh). We are able to fit 13B training in 8-A100-40G/8-A6000, and 7B training in 8-RTX3090. Make sure `per_device_train_batch_size*gradient_accumulation_steps` is the same as the provided script for best reproducibility.
371
+ - Replace `zero3.json` with `zero3_offload.json` which offloads some parameters to CPU RAM. This slows down the training speed.
372
+
373
+ If you are interested in finetuning LLaVA model to your own task/data, please check out [`Finetune_Custom_Data.md`](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md)。
374
+
375
+ New options to note:
376
+
377
+ - `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.
378
+ - `--vision_tower openai/clip-vit-large-patch14-336`: CLIP ViT-L/14 336px.
379
+ - `--image_aspect_ratio pad`: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.
380
+ - `--group_by_modality_length True`: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.
381
+
382
+ ## Evaluation
383
+
384
+ In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
385
+
386
+ See [Evaluation.md](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md).
387
+
388
+ ### GPT-assisted Evaluation
389
+
390
+ Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.
391
+
392
+ 1. Generate LLaVA responses
393
+
394
+ ```Shell
395
+ python model_vqa.py \
396
+ --model-path ./checkpoints/LLaVA-13B-v0 \
397
+ --question-file \
398
+ playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
399
+ --image-folder \
400
+ /path/to/coco2014_val \
401
+ --answers-file \
402
+ /path/to/answer-file-our.jsonl
403
+ ```
404
+
405
+ 2. Evaluate the generated responses. In our case, [`answer-file-ref.jsonl`](./playground/data/coco2014_val_qa_eval/qa90_gpt4_answer.jsonl) is the response generated by text-only GPT-4 (0314), with the context captions/boxes provided.
406
+
407
+ ```Shell
408
+ OPENAI_API_KEY="sk-***********************************" python llava/eval/eval_gpt_review_visual.py \
409
+ --question playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
410
+ --context llava/eval/table/caps_boxes_coco2014_val_80.jsonl \
411
+ --answer-list \
412
+ /path/to/answer-file-ref.jsonl \
413
+ /path/to/answer-file-our.jsonl \
414
+ --rule llava/eval/table/rule.json \
415
+ --output /path/to/review.json
416
+ ```
417
+
418
+ 3. Summarize the evaluation results
419
+
420
+ ```Shell
421
+ python summarize_gpt_review.py
422
+ ```
423
+
424
+ ## Citation
425
+
426
+ If you find LLaVA useful for your research and applications, please cite using this BibTeX:
427
+ ```bibtex
428
+ @misc{liu2024llavanext,
429
+ title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
430
+ url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
431
+ author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
432
+ month={January},
433
+ year={2024}
434
+ }
435
+
436
+ @misc{liu2023improvedllava,
437
+ title={Improved Baselines with Visual Instruction Tuning},
438
+ author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
439
+ publisher={arXiv:2310.03744},
440
+ year={2023},
441
+ }
442
+
443
+ @misc{liu2023llava,
444
+ title={Visual Instruction Tuning},
445
+ author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
446
+ publisher={NeurIPS},
447
+ year={2023},
448
+ }
449
+ ```
450
+
451
+ ## Acknowledgement
452
+
453
+ - [Vicuna](https://github.com/lm-sys/FastChat): the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!
454
+
455
+ ## Related Projects
456
+
457
+ - [Instruction Tuning with GPT-4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
458
+ - [LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](https://github.com/microsoft/LLaVA-Med)
459
+ - [Otter: In-Context Multi-Modal Instruction Tuning](https://github.com/Luodian/Otter)
460
+
461
+ For future project ideas, please check out:
462
+ - [SEEM: Segment Everything Everywhere All at Once](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
463
+ - [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything) to detect, segment, and generate anything by marrying [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) and [Segment-Anything](https://github.com/facebookresearch/segment-anything).
groundingLMM/LLaVA/cog.yaml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Configuration for Cog ⚙️
2
+ # Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
3
+
4
+ build:
5
+ gpu: true
6
+
7
+ python_version: "3.11"
8
+
9
+ python_packages:
10
+ - "torch==2.0.1"
11
+ - "accelerate==0.21.0"
12
+ - "bitsandbytes==0.41.0"
13
+ - "deepspeed==0.9.5"
14
+ - "einops-exts==0.0.4"
15
+ - "einops==0.6.1"
16
+ - "gradio==3.35.2"
17
+ - "gradio_client==0.2.9"
18
+ - "httpx==0.24.0"
19
+ - "markdown2==2.4.10"
20
+ - "numpy==1.26.0"
21
+ - "peft==0.4.0"
22
+ - "scikit-learn==1.2.2"
23
+ - "sentencepiece==0.1.99"
24
+ - "shortuuid==1.0.11"
25
+ - "timm==0.6.13"
26
+ - "tokenizers==0.13.3"
27
+ - "torch==2.0.1"
28
+ - "torchvision==0.15.2"
29
+ - "transformers==4.31.0"
30
+ - "wandb==0.15.12"
31
+ - "wavedrom==2.0.3.post3"
32
+ - "Pygments==2.16.1"
33
+ run:
34
+ - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.0.3/pget" && chmod +x /usr/local/bin/pget
35
+
36
+ # predict.py defines how predictions are run on your model
37
+ predict: "predict.py:Predictor"
groundingLMM/LLaVA/predict.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
4
+ from llava.conversation import conv_templates, SeparatorStyle
5
+ from llava.model.builder import load_pretrained_model
6
+ from llava.utils import disable_torch_init
7
+ from llava.mm_utils import tokenizer_image_token
8
+ from transformers.generation.streamers import TextIteratorStreamer
9
+
10
+ from PIL import Image
11
+
12
+ import requests
13
+ from io import BytesIO
14
+
15
+ from cog import BasePredictor, Input, Path, ConcatenateIterator
16
+ import time
17
+ import subprocess
18
+ from threading import Thread
19
+
20
+ import os
21
+ os.environ["HUGGINGFACE_HUB_CACHE"] = os.getcwd() + "/weights"
22
+
23
+ # url for the weights mirror
24
+ REPLICATE_WEIGHTS_URL = "https://weights.replicate.delivery/default"
25
+ # files to download from the weights mirrors
26
+ weights = [
27
+ {
28
+ "dest": "liuhaotian/llava-v1.5-13b",
29
+ # git commit hash from huggingface
30
+ "src": "llava-v1.5-13b/006818fc465ebda4c003c0998674d9141d8d95f8",
31
+ "files": [
32
+ "config.json",
33
+ "generation_config.json",
34
+ "pytorch_model-00001-of-00003.bin",
35
+ "pytorch_model-00002-of-00003.bin",
36
+ "pytorch_model-00003-of-00003.bin",
37
+ "pytorch_model.bin.index.json",
38
+ "special_tokens_map.json",
39
+ "tokenizer.model",
40
+ "tokenizer_config.json",
41
+ ]
42
+ },
43
+ {
44
+ "dest": "openai/clip-vit-large-patch14-336",
45
+ "src": "clip-vit-large-patch14-336/ce19dc912ca5cd21c8a653c79e251e808ccabcd1",
46
+ "files": [
47
+ "config.json",
48
+ "preprocessor_config.json",
49
+ "pytorch_model.bin"
50
+ ],
51
+ }
52
+ ]
53
+
54
+ def download_json(url: str, dest: Path):
55
+ res = requests.get(url, allow_redirects=True)
56
+ if res.status_code == 200 and res.content:
57
+ with dest.open("wb") as f:
58
+ f.write(res.content)
59
+ else:
60
+ print(f"Failed to download {url}. Status code: {res.status_code}")
61
+
62
+ def download_weights(baseurl: str, basedest: str, files: list[str]):
63
+ basedest = Path(basedest)
64
+ start = time.time()
65
+ print("downloading to: ", basedest)
66
+ basedest.mkdir(parents=True, exist_ok=True)
67
+ for f in files:
68
+ dest = basedest / f
69
+ url = os.path.join(REPLICATE_WEIGHTS_URL, baseurl, f)
70
+ if not dest.exists():
71
+ print("downloading url: ", url)
72
+ if dest.suffix == ".json":
73
+ download_json(url, dest)
74
+ else:
75
+ subprocess.check_call(["pget", url, str(dest)], close_fds=False)
76
+ print("downloading took: ", time.time() - start)
77
+
78
+ class Predictor(BasePredictor):
79
+ def setup(self) -> None:
80
+ """Load the model into memory to make running multiple predictions efficient"""
81
+ for weight in weights:
82
+ download_weights(weight["src"], weight["dest"], weight["files"])
83
+ disable_torch_init()
84
+
85
+ self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model("liuhaotian/llava-v1.5-13b", model_name="llava-v1.5-13b", model_base=None, load_8bit=False, load_4bit=False)
86
+
87
+ def predict(
88
+ self,
89
+ image: Path = Input(description="Input image"),
90
+ prompt: str = Input(description="Prompt to use for text generation"),
91
+ top_p: float = Input(description="When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens", ge=0.0, le=1.0, default=1.0),
92
+ temperature: float = Input(description="Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic", default=0.2, ge=0.0),
93
+ max_tokens: int = Input(description="Maximum number of tokens to generate. A word is generally 2-3 tokens", default=1024, ge=0),
94
+ ) -> ConcatenateIterator[str]:
95
+ """Run a single prediction on the model"""
96
+
97
+ conv_mode = "llava_v1"
98
+ conv = conv_templates[conv_mode].copy()
99
+
100
+ image_data = load_image(str(image))
101
+ image_tensor = self.image_processor.preprocess(image_data, return_tensors='pt')['pixel_values'].half().cuda()
102
+
103
+ # loop start
104
+
105
+ # just one turn, always prepend image token
106
+ inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
107
+ conv.append_message(conv.roles[0], inp)
108
+
109
+ conv.append_message(conv.roles[1], None)
110
+ prompt = conv.get_prompt()
111
+
112
+ input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
113
+ stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
114
+ keywords = [stop_str]
115
+ streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True, timeout=20.0)
116
+
117
+ with torch.inference_mode():
118
+ thread = Thread(target=self.model.generate, kwargs=dict(
119
+ inputs=input_ids,
120
+ images=image_tensor,
121
+ do_sample=True,
122
+ temperature=temperature,
123
+ top_p=top_p,
124
+ max_new_tokens=max_tokens,
125
+ streamer=streamer,
126
+ use_cache=True))
127
+ thread.start()
128
+ # workaround: second-to-last token is always " "
129
+ # but we want to keep it if it's not the second-to-last token
130
+ prepend_space = False
131
+ for new_text in streamer:
132
+ if new_text == " ":
133
+ prepend_space = True
134
+ continue
135
+ if new_text.endswith(stop_str):
136
+ new_text = new_text[:-len(stop_str)].strip()
137
+ prepend_space = False
138
+ elif prepend_space:
139
+ new_text = " " + new_text
140
+ prepend_space = False
141
+ if len(new_text):
142
+ yield new_text
143
+ if prepend_space:
144
+ yield " "
145
+ thread.join()
146
+
147
+
148
+ def load_image(image_file):
149
+ if image_file.startswith('http') or image_file.startswith('https'):
150
+ response = requests.get(image_file)
151
+ image = Image.open(BytesIO(response.content)).convert('RGB')
152
+ else:
153
+ image = Image.open(image_file).convert('RGB')
154
+ return image
155
+
groundingLMM/LLaVA/pyproject.toml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "llava"
7
+ version = "1.2.2.post1"
8
+ description = "Towards GPT-4 like large language and visual assistant."
9
+ readme = "README.md"
10
+ requires-python = ">=3.8"
11
+ classifiers = [
12
+ "Programming Language :: Python :: 3",
13
+ "License :: OSI Approved :: Apache Software License",
14
+ ]
15
+ dependencies = [
16
+ "torch==2.1.2", "torchvision==0.16.2",
17
+ "transformers==4.37.2", "tokenizers==0.15.1", "sentencepiece==0.1.99", "shortuuid",
18
+ "accelerate==0.21.0", "peft", "bitsandbytes",
19
+ "pydantic", "markdown2[all]", "numpy", "scikit-learn==1.2.2",
20
+ "gradio==4.16.0", "gradio_client==0.8.1",
21
+ "requests", "httpx==0.24.0", "uvicorn", "fastapi",
22
+ "einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13",
23
+ ]
24
+
25
+ [project.optional-dependencies]
26
+ train = ["deepspeed==0.12.6", "ninja", "wandb"]
27
+ build = ["build", "twine"]
28
+
29
+ [project.urls]
30
+ "Homepage" = "https://llava-vl.github.io"
31
+ "Bug Tracker" = "https://github.com/haotian-liu/LLaVA/issues"
32
+
33
+ [tool.setuptools.packages.find]
34
+ exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
35
+
36
+ [tool.wheel]
37
+ exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
groundingLMM/dataset/dataset.py ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import torch
3
+
4
+ from model.llava import conversation as conversation_lib
5
+ from model.llava.mm_utils import tokenizer_image_token
6
+ from dataset.caption_datasets.COCO_Caption_ds import CocoCapDataset
7
+ from dataset.caption_datasets.LLavaInstruct_vqa_ds import LLaVAInstructDataset
8
+ from dataset.region_datasets.Flickr_Region_ds import Flickr30kRegDataset
9
+ from dataset.segm_datasets.Semantic_Segm_ds import SemanticSegmDataset
10
+ from dataset.segm_datasets.RefCOCO_Segm_ds import ReferSegmDataset
11
+ from dataset.gcg_datasets.GranDf_gcg_ds import GranDfDataset, OpenPsgGCGDataset, Flickr30kGCGDataset, RefCOCOgGCGDataset
12
+ from dataset.region_datasets.RefCOCO_VG_Region_ds import (RefCocoRegDataset, RefCocoGRegDataset, RefCocoPRegDataset,
13
+ VisualGenomeRegDataset)
14
+ from dataset.caption_datasets.GranD_ShortCaption_ds import GrandShortCaptionDataset
15
+ from dataset.region_datasets.GranD_ReferringRegion_ds import GrandReferRegDataset
16
+ from dataset.segm_datasets.GranD_ReferringSegm_ds import GrandReferSegmDataset
17
+ from tools.utils import DEFAULT_IMAGE_TOKEN, IGNORE_INDEX, DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN
18
+
19
+
20
+ class HybridDatasetBase(torch.utils.data.Dataset):
21
+ PIXEL_MEAN = torch.tensor([123.675, 116.28, 103.53]).view(-1, 1, 1)
22
+ PIXEL_STD = torch.tensor([58.395, 57.12, 57.375]).view(-1, 1, 1)
23
+ IMG_SIZE = 1024
24
+ IGNORE_LABEL = 255
25
+
26
+ def __init__(self, dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config,
27
+ epoch_samples=500 * 8 * 2 * 10, batch_size=2, precision="fp32", image_size=224,
28
+ num_classes_per_sample=3, sample_rate=None):
29
+ self.dataset_dir = dataset_dir
30
+ self.tokenizer = tokenizer
31
+ self.global_image_encoder = global_image_encoder
32
+ self.dataset = dataset
33
+ self.datasets_config = datasets_config
34
+ self.epoch_samples = epoch_samples
35
+ self.batch_size = batch_size
36
+ self.precision = precision
37
+ self.image_size = image_size
38
+ self.num_classes_per_sample = num_classes_per_sample
39
+
40
+ self.dataset_list = dataset.split("||")
41
+ self.sample_rate = np.array(sample_rate or [1] * len(self.dataset_list))
42
+ self.sample_rate /= self.sample_rate.sum()
43
+ self.all_datasets = self.create_datasets()
44
+
45
+ def create_datasets(self):
46
+ datasets = []
47
+ for ds in self.dataset_list:
48
+ dataset_cls = self.datasets_config.get(ds)
49
+ if dataset_cls:
50
+ if ds == 'Semantic_Segm':
51
+ datasets.append(
52
+ dataset_cls(
53
+ self.dataset_dir, self.tokenizer, self.global_image_encoder, self.epoch_samples,
54
+ self.precision, self.image_size, self.num_classes_per_sample, self.semantic_segm_data, )
55
+ )
56
+ elif ds == 'Refer_Segm':
57
+ datasets.append(
58
+ dataset_cls(
59
+ self.dataset_dir, self.tokenizer, self.global_image_encoder, self.epoch_samples,
60
+ self.precision, self.image_size, self.num_classes_per_sample, self.refer_segm_data, )
61
+ )
62
+ else:
63
+ datasets.append(
64
+ dataset_cls(
65
+ self.dataset_dir, self.tokenizer, self.global_image_encoder, self.epoch_samples,
66
+ self.precision, self.image_size, self.num_classes_per_sample, )
67
+ )
68
+ return datasets
69
+
70
+ def __len__(self):
71
+ return self.epoch_samples
72
+
73
+ def __getitem__(self, idx):
74
+ dataset_idx = np.random.choice(len(self.dataset_list), p=self.sample_rate)
75
+ selected_dataset = self.all_datasets[dataset_idx]
76
+ data = selected_dataset[0]
77
+ return (*data,)
78
+
79
+
80
+ class HybridCapDataset(HybridDatasetBase):
81
+ def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=500 * 8 * 2 * 10, batch_size=2,
82
+ precision="fp32", image_size=224, num_classes_per_sample=3,
83
+ dataset="CocoCap||LLaVaInstruct", sample_rate=[1, 1]):
84
+ datasets_config = {"CocoCap": CocoCapDataset,
85
+ "LLaVaInstruct": LLaVAInstructDataset,
86
+ "GrandCaptionDataset": GrandShortCaptionDataset,
87
+ # Add other dataset mappings here
88
+ }
89
+ super().__init__(
90
+ dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config, epoch_samples, batch_size,
91
+ precision, image_size, num_classes_per_sample, sample_rate
92
+ )
93
+
94
+
95
+ class HybridRegDataset(HybridDatasetBase):
96
+ def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=500 * 8 * 2 * 10, batch_size=2,
97
+ precision="fp32", image_size=224, num_classes_per_sample=3,
98
+ dataset="RefCoco_Reg||RefCocoG_Reg||RefCocoP_Reg||VisGen_Reg||Flickr_Reg", sample_rate=[1, 1, 1, 1, 1]):
99
+ datasets_config = {"RefCoco_Reg": RefCocoRegDataset,
100
+ "RefCocoG_Reg": RefCocoGRegDataset,
101
+ "RefCocoP_Reg": RefCocoPRegDataset,
102
+ "VisGen_Reg": VisualGenomeRegDataset,
103
+ "Flickr_Reg": Flickr30kRegDataset,
104
+ "GrandRefer_Reg": GrandReferRegDataset,
105
+ # Add other dataset mappings here
106
+ }
107
+ super().__init__(
108
+ dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config, epoch_samples, batch_size,
109
+ precision, image_size, num_classes_per_sample, sample_rate
110
+ )
111
+
112
+
113
+ class HybridSegDataset(HybridDatasetBase):
114
+ def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=500 * 8 * 2 * 10, batch_size=2,
115
+ precision="fp32", image_size=224, num_classes_per_sample=3,
116
+ dataset="Semantic_Segm||Refer_Segm||PSG_GCG||RefCoco_GCG||GranDf_GCG||Flickr_GCG",
117
+ sample_rate=[5,4,1,1,1,1],
118
+ semantic_segm_data="ade20k||cocostuff||pascal_part||paco_lvis||mapillary",
119
+ refer_segm_data="refcoco||refcocog||refcoco+||refclef"):
120
+ self.semantic_segm_data = semantic_segm_data
121
+ self.refer_segm_data = refer_segm_data
122
+ datasets_config = {"Semantic_Segm": SemanticSegmDataset,
123
+ "Refer_Segm": ReferSegmDataset,
124
+ "PSG_GCG": OpenPsgGCGDataset,
125
+ "RefCoco_GCG": RefCOCOgGCGDataset,
126
+ "GranDf_GCG": GranDfDataset,
127
+ "Flickr_GCG": Flickr30kGCGDataset,
128
+ "GrandRefer_Segm": GrandReferSegmDataset,
129
+ # Add other dataset mappings here
130
+ }
131
+ super().__init__(
132
+ dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config, epoch_samples, batch_size,
133
+ precision, image_size, num_classes_per_sample, sample_rate
134
+ )
135
+
136
+
137
+ def custom_collate_fn(batch, tokenizer=None, use_mm_start_end=True, inference=False, local_rank=-1):
138
+ # Initializing lists and counters
139
+ image_path_list, global_enc_image_list, grounding_enc_image_list = [], [], []
140
+ bboxes_list, conversation_list, masks_list = [], [], []
141
+ label_list, resize_list, questions_list = [], [], []
142
+ selected_labels_list, offset_list, inferences = [], [0], []
143
+ cnt = 0
144
+
145
+ # Iterating through the batch
146
+ for (image_path, global_enc_image, grounding_enc_image, bboxes, conversations, masks, label, resize, questions,
147
+ sampled_classes) in batch:
148
+ image_path_list.append(image_path)
149
+ global_enc_image_list.append(global_enc_image)
150
+ grounding_enc_image_list.append(grounding_enc_image)
151
+ bboxes_list.append(bboxes)
152
+ conversation_list.extend(conversations)
153
+ masks_list.append([] if masks is None else masks.float())
154
+ label_list.append(label)
155
+ resize_list.append(resize)
156
+ questions_list.append(questions)
157
+ selected_labels_list.append(sampled_classes)
158
+ offset_list.append(cnt := cnt + len(conversations))
159
+ inferences.append(inference)
160
+
161
+ # Handling the conversation list
162
+ if use_mm_start_end:
163
+ replace_token = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
164
+ conversation_list = [conv.replace(DEFAULT_IMAGE_TOKEN, replace_token) for conv in conversation_list]
165
+
166
+ # Tokenizing and padding input ids
167
+ input_ids = torch.nn.utils.rnn.pad_sequence(
168
+ [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversation_list],
169
+ batch_first=True, padding_value=tokenizer.pad_token_id
170
+ )
171
+ attention_masks = input_ids.ne(tokenizer.pad_token_id)
172
+
173
+ # Preparing targets and handling conversation types
174
+ conv = conversation_lib.default_conversation.copy()
175
+ targets = input_ids.clone()
176
+ # conv_type == "llava_v1"
177
+ sep = conv.sep + conv.roles[1] + ": "
178
+ sep2 = conv.sep2
179
+
180
+ for conversation, target in zip(conversation_list, targets):
181
+ _process_conversation(conversation, target, tokenizer, sep, sep2)
182
+
183
+ # Adjusting for inferences
184
+ if not inferences[0]:
185
+ truncate_len = tokenizer.model_max_length - 575
186
+ if input_ids.shape[1] > truncate_len:
187
+ input_ids, targets, attention_masks = map(
188
+ lambda x: x[:, :truncate_len], [input_ids, targets, attention_masks]
189
+ )
190
+
191
+ return {
192
+ "image_paths": image_path_list,
193
+ "global_enc_images": torch.stack(global_enc_image_list, dim=0),
194
+ "grounding_enc_images": None if grounding_enc_image_list[0] is None else torch.stack(grounding_enc_image_list, dim=0),
195
+ "bboxes": None if bboxes_list[0] is None else bboxes_list,
196
+ "input_ids": input_ids,
197
+ "labels": targets,
198
+ "attention_masks": attention_masks,
199
+ "masks_list": None if masks_list[0] is None else masks_list,
200
+ "label_list": None if label_list[0] is None else label_list,
201
+ "resize_list": None if resize_list[0] is None else resize_list,
202
+ "offset": torch.LongTensor(offset_list),
203
+ "questions_list": questions_list,
204
+ "sampled_classes_list": selected_labels_list,
205
+ "inference": inferences[0],
206
+ "conversation_list": conversation_list,
207
+ }
208
+
209
+
210
+ def _process_conversation(conversation, target, tokenizer, sep, sep2):
211
+ total_len = target.ne(tokenizer.pad_token_id).sum().item()
212
+ rounds = conversation.split(sep2)
213
+ cur_len = 1
214
+ target[:cur_len] = IGNORE_INDEX
215
+
216
+ for rou in rounds:
217
+ if not rou:
218
+ break
219
+
220
+ parts = rou.split(sep)
221
+ assert len(parts) == 2, (len(parts), rou)
222
+ parts[0] += sep
223
+
224
+ if DEFAULT_IMAGE_TOKEN in conversation:
225
+ round_len = len(tokenizer_image_token(rou, tokenizer))
226
+ instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
227
+ else:
228
+ round_len = len(tokenizer(rou).input_ids)
229
+ instruction_len = len(tokenizer(parts[0]).input_ids) - 2
230
+
231
+ target[cur_len: cur_len + instruction_len] = IGNORE_INDEX
232
+ cur_len += round_len
233
+
234
+ target[cur_len:] = IGNORE_INDEX
235
+ if cur_len < tokenizer.model_max_length:
236
+ assert cur_len == total_len
groundingLMM/docs/GranD.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GranD - Grounding Anything Dataset 🚀
2
+ The [Grounding-anything](https://grounding-anything.com/) Dataset (GranD) dataset offers densely annotated data, acquired through an automated annotation pipeline that leverages state-of-the-art (SOTA) vision and V-L models. This documentation covers how to download the GranD dataset and a guide to the automated annotation pipeline used to create GranD.
3
+
4
+ ## Download GranD 📂
5
+ - Annotations: [MBZUAI/GranD](https://huggingface.co/datasets/MBZUAI/GranD)
6
+ - Images: [Download](https://ai.meta.com/datasets/segment-anything-downloads/)
7
+ GranD utilizes images from the SAM dataset.
8
+
9
+ Note: Please note that annotations are being uploaded incrementally and more parts will be available soon.
10
+
11
+ ### Preparing the Pretraining Annotations from GranD 🛠️
12
+
13
+ After downloading the GranD annotations, utilize the scripts below to transform them into GLaMM pretraining data, or to prepare them for your specific tasks.
14
+
15
+ - For object-level tasks like object detection, semantic segmentation: [prepare_object_lvl_data.py](../GranD/prepare_annotations/prepare_object_lvl_data.py)
16
+ - For image-level captioning and caption grounding: [prepare_grand_caption_grounding.py](../GranD/prepare_annotations/prepare_grand_caption_grounding.py)
17
+ - For referring expression generation and referring expression segmentation: [prepare_grand_referring_expression](../GranD/prepare_annotations/prepare_grand_referring_expression.py)
18
+
19
+ The above scripts generate annotations in JSON format. To convert these for use in pretraining datasets requiring LMDB format, please use to the following scripts:
20
+ - To convert to lmdb: [get_txt_for_lmdb.py](../GranD/prepare_annotations/get_txt_for_lmdb.py)
21
+ - To extract file names in txt format: [get_txt_for_lmdb.py](../GranD/prepare_annotations/get_txt_for_lmdb.py)
22
+
23
+ ### GranD Automated Annotation Pipeline
24
+
25
+ GranD is a comprehensive, multi-purpose image-text dataset offering a range of contextual information, from fine-grained to high-level details. The pipeline contains four distinct levels.
26
+ The code for the four levels are provided in: [GranD](../GranD)
27
+
28
+ More detailed information:
29
+ - To run the entire pipeline: [run_pipeline.sh](../GranD/run_pipeline.sh)
30
+ - To set up the environments detailed in [run_pipeline.sh](../GranD/run_pipeline.sh) refer to : [environments](../GranD/environments)
31
+ - Level-1 : Object Localization and Attributes
32
+ - Landmark Categorization: [landmark](../GranD/level_1_inference/1_landmark_categorization/README.md)
33
+ - Depth Map Estimation: [Midas Depth Estimation](../GranD/level_1_inference/2_depth_maps/README.md)
34
+ - Image Tagging: [RAM Tag2Text Tagging](../GranD/level_1_inference/3_image_tagging/README.md)
35
+ - Standard Object Detection: [CO-DETR OD](../GranD/level_1_inference/4_co_detr/README.md), [EVA OD](../GranD/level_1_inference/4_co_detr/README.md)
36
+ - Open Vocabulary Object Detection: [OWL-ViT OVD](../GranD/level_1_inference/6_owl_vit), [POMP OVD](../GranD/level_1_inference/7_pomp)
37
+ - Attribute Detection and Grounding: [Attribute & Grounidng GRiT](../GranD/level_1_inference/8_grit/README.md)
38
+ - Open Vocabulary Classification: [OV Classification OV-SAM](../GranD/level_1_inference/9_ov_sam/README.md)
39
+ - Combine the predictions: [Merging](../GranD/utils/merge_json_level_1_with_nms.py)
40
+ - Generate Level-1 Scene Graph: [Level-1 Scene Graph](../GranD/utils/prepare_level_1.py)
41
+ - Level-2: Relationships
42
+ - Captioning: [BLIP-2 Captioning](../GranD/level_2_inference/1_blip-2/README.md), [LLaVA Captioning](../GranD/level_2_inference/2_llava/README.md)
43
+ - Grounding Short Captions: [MDETR Grounding](../GranD/level_2_inference/3_mdetr/README.md)
44
+ - Combine the predictions: [Merging](../GranD/utils/merge_json_level_2.py)
45
+ - Generate Level-2 Scene Graph and Update Level-1: [Level-2 Scene Graph](../GranD/utils/prepare_level_2.py)
46
+ - Enrich Attributes: [GPT4-RoI Attributes](../GranD/level_2_inference/4_gpt4roi/README.md)
47
+ - Label Assignment: [EVA-CLIP Label Assignment](../GranD/level_2_inference/5_label_assignment/README.md)
48
+ - Level-3: Scene Graph and Dense Captioning
49
+ - Generate Dense Captions: [Scene graph dense captioning LLaVA](../GranD/level_3_dense_caption/README.md)
50
+ - Level-4: Extra Contextual Insight:
51
+ - Generate Level-4 Additional Context: [Extra Context](../GranD/level_4_extra_context/README.md)
52
+
53
+
groundingLMM/docs/datasets.md ADDED
@@ -0,0 +1,327 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prepare Dataset 🚀
2
+ This guide outlines the datasets required for opensource fine-tuning of GLaMM, which encompasses tasks like Grounded Conversation Generation (GCG), Image-level captioning, Visual-question answering, Region-level captioning, and Referring Expression Segmentation. These datasets are used for fine-tuning to achieve the model demonstrated in our demo. We will also highlight the specific datasets needed for each task.
3
+
4
+ To achieve all the capabilities of GLaMM, the following dataset types are used:
5
+ 1. GranD-f Grounded Conversation Generation (GCG) Dataset
6
+ 2. Semantic Segmentation Datasets
7
+ 3. Referring Expression Datasets (Expression Comprehension)
8
+ 4. Region-level Captioning Datasets (Expression Generation)
9
+ 5. Image Captioning
10
+ 6. Visual Question Answering
11
+ 7. GranD pretraining Datasets
12
+
13
+ Overall, they must be arranged in the following format:
14
+ ```
15
+ ├── GranDf
16
+ │ ├── annotations
17
+ │ │ ├── train
18
+ │ │ │ ├── GranDf_HA_GCG_train.json
19
+ │ │ │ ├── OpenPsgGCG_train.json
20
+ │ │ │ ├── OpenPsgGCG_val.json
21
+ │ │ │ ├── RefCOCOg_GCG_train.json
22
+ │ │ │ ├── RefCOCOg_GCG_val.json
23
+ │ │ │ ├── flickr_mergedGT_GCG_train.json
24
+ │ │ │ ├── flickr_mergedGT_GCG_val.json
25
+ │ │ ├── val_test
26
+ │ │ │ ├── test_gcg_coco_caption_gt.json
27
+ │ │ │ ├── test_gcg_coco_mask_gt.json
28
+ │ │ │ ├── val_gcg_coco_caption_gt.json
29
+ │ │ │ ├── val_gcg_coco_mask_gt.json
30
+ ├── GranDf_HA_images
31
+ │ ├── train
32
+ │ │ ├── sa_10010541.jpg
33
+ │ │ ├── sa_10014079.jpg
34
+ │ ├── val_test
35
+ │ │ ├── sa_10010541.jpg
36
+ │ │ ├── sa_10014079.jpg
37
+
38
+ ├── Semantic_Segm
39
+ │ ├── ade20k
40
+ │ │ ├── annotations
41
+ │ │ │ ├── training
42
+ │ │ │ │ ├── ADE_train_00000001.png
43
+ │ │ │ │ ├── ADE_train_00000002.png
44
+ │ │ ├── images
45
+ │ │ │ ├── training
46
+ │ │ │ │ ├── ADE_train_00000001.jpg
47
+ │ │ │ │ ├── ADE_train_00000002.jpg
48
+ ├── coco_stuff
49
+ │ │ ├── train2017
50
+ │ │ │ ├── 000000000009.png
51
+ │ │ │ ├── 000000000025.png
52
+ ├── mapillary
53
+ │ │ ├── config_v2.0.json
54
+ │ │ ├── training
55
+ │ │ │ ├── v2.0
56
+ │ │ │ │ ├── labels
57
+ │ │ │ │ │ ├── 0035fkbjWljhaftpVM37-g.png
58
+ │ │ │ │ │ ├── 00qclUcInksIYnm19b1Xfw.png
59
+ │ │ │ ├── images
60
+ │ │ │ │ ├── 0035fkbjWljhaftpVM37-g.jpg
61
+ │ │ │ │ ├── 00qclUcInksIYnm19b1Xfw.jpg
62
+ ├── paco_lvis
63
+ │ │ ├── annotations
64
+ │ │ │ ├── paco_lvis_v1_train.json
65
+ ├── pascal_part
66
+ │ │ ├── train.json
67
+ │ │ ├── VOCdevkit
68
+ │ │ │ │ ├── VOC2010
69
+ │ │ │ │ │ ├── JPEGImages
70
+ │ │ │ │ │ │ ├── 2007_000027.jpg
71
+ │ │ │ │ │ │ ├── 2007_000032.jpg
72
+
73
+ ├── Refer_Segm
74
+ │ ├── refcoco
75
+ │ ├── refcoco+
76
+ │ ├── refcocog
77
+ │ ├── refclef
78
+ │ ├── images
79
+ │ │ ├── saiapr_tc-12
80
+ │ │ │ ├── 00
81
+ │ │ │ ├── 01
82
+
83
+ ├── RefCoco_Reg
84
+ │ ├── mdetr_annotations
85
+ │ │ ├── finetune_refcoco_train.json
86
+ │ │ ├── finetune_refcocog_train.json
87
+ │ │ ├── finetune_refcocog_val.json
88
+ │ │ ├── finetune_refcoco+_train.json
89
+ │ │ ├── final_flickr_mergedGT_train.json
90
+ ├── visual_genome
91
+ │ │ ├── test_caption.json
92
+ │ │ ├── train.json
93
+ │ │ ├── images
94
+ │ │ │ ├── 1000.jpg
95
+ │ │ │ ├── 1001.jpg
96
+
97
+ ├── llava_dataset
98
+ │ ├── llava_instruct_150k.json
99
+
100
+ ├── coco_2017
101
+ │ ├── train2017
102
+ │ │ ├── 000000000009.jpg
103
+ │ │ ├── 000000000025.jpg
104
+ │ ├── annotations
105
+ │ │ ├── captions_train2017.json
106
+ │ │ ├── captions_val2017.json
107
+
108
+ ├── coco_2014
109
+ │ ├── train2014
110
+ │ │ ├── COCO_train2014_000000000009.jpg
111
+ │ │ ├── COCO_train2014_000000000025.jpg
112
+
113
+ ├── flikcr_30k
114
+ │ ├── train
115
+ │ │ ├── 1000092795.jpg
116
+ │ │ ├── 10002456.jpg
117
+ ```
118
+
119
+ ### 1) GranD-f Grounded Conversation Generation (GCG) Dataset
120
+ The [GranD-f](https://grounding-anything.com/GranD-f) datasets comprise four datasets: one high-quality human-annotated set proposed in our GLaMM paper, and 3 other datasets repurposed for the GCG task.
121
+
122
+ Download links and structure:
123
+ - Annotations: [MBZUAI/GranD-f](https://huggingface.co/datasets/MBZUAI/GranD-f)
124
+ - Images: `GranDf_HA_images` [Download](https://drive.google.com/file/d/1abdxVhrbNQhjJQ8eAcuPrOUBzhGaFsF_/view?usp=drive_link)
125
+ - Other necessary datasets:
126
+ - Open-PSG GCG: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
127
+ - RefCOCO-g GCG: `coco_2014` - COCO-2014 ([train2014](http://images.cocodataset.org/zips/train2014.zip))
128
+ - Flickr-30k GCG: `flikcr_30k` - flikcr_30k (train) - Download the train images from the [Flickr30K webpage](https://shannon.cs.illinois.edu/DenotationGraph/) or use download from the following [link](https://drive.google.com/file/d/1iomUn-Ht0OBfieMuyoVqEFj5PEmXfQ0U/view?usp=drive_link).
129
+
130
+ ```
131
+ ├── GranDf
132
+ │ ├── annotations
133
+ │ │ ├── train
134
+ │ │ │ ├── GranDf_HA_GCG_train.json
135
+ │ │ │ ├── OpenPsgGCG_train.json
136
+ │ │ │ ├── OpenPsgGCG_val.json
137
+ │ │ │ ├── RefCOCOg_GCG_train.json
138
+ │ │ │ ├── RefCOCOg_GCG_val.json
139
+ │ │ │ ├── flickr_mergedGT_GCG_train.json
140
+ │ │ │ ├── flickr_mergedGT_GCG_val.json
141
+ │ │ ├── val_test
142
+ │ │ │ ├── test_gcg_coco_caption_gt.json
143
+ │ │ │ ├── test_gcg_coco_mask_gt.json
144
+ │ │ │ ├── val_gcg_coco_caption_gt.json
145
+ │ │ │ ├── val_gcg_coco_mask_gt.json
146
+ ├── GranDf_HA_images
147
+ │ ├── train
148
+ │ │ ├── sa_10010541.jpg
149
+ │ │ ├── sa_10014079.jpg
150
+ │ ├── val_test
151
+ │ │ ├── sa_10010541.jpg
152
+ │ │ ├── sa_10014079.jpg
153
+ ├── coco_2017
154
+ │ ├── train2017
155
+ │ │ ├── 000000000009.jpg
156
+ │ │ ├── 000000000025.jpg
157
+ ├── coco_2014
158
+ │ ├── train2014
159
+ │ │ ├── COCO_train2014_000000000009.jpg
160
+ │ │ ├── COCO_train2014_000000000025.jpg
161
+ ├── flikcr_30k
162
+ │ ├── train
163
+ │ │ ├── 1000092795.jpg
164
+ │ │ ├── 10002456.jpg
165
+ ```
166
+
167
+ ### 2) Semantic Segmentation Datasets
168
+ For semantic segmentation, we use five open-source datasets providing segmentation masks and semantic class labels: - ADE20K, COCO-Stuff, PASCAL-Part, PACO-LVIS, and Mapillary.
169
+
170
+ Download links and structure:
171
+ - [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip)
172
+ - [COCO-Stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip)
173
+ - [PASCAL-Part](https://www.mapillary.com/dataset/vistas)
174
+ - [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup)
175
+ - [Mapillary](https://github.com/facebookresearch/VLPart/tree/main/datasets#pascal-part)
176
+ - COCO images: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
177
+
178
+ Download and arrange as shown in the directory structure below.
179
+
180
+ ```
181
+ ├── Semantic_Segm
182
+ │ ├── ade20k
183
+ │ │ ├── annotations
184
+ │ │ │ ├── training
185
+ │ │ │ │ ├── ADE_train_00000001.png
186
+ │ │ │ │ ├── ADE_train_00000002.png
187
+ │ │ ├── images
188
+ │ │ │ ├── training
189
+ │ │ │ │ ├── ADE_train_00000001.jpg
190
+ │ │ │ │ ├── ADE_train_00000002.jpg
191
+ ├── coco_stuff
192
+ │ │ ├── train2017
193
+ │ │ │ ├── 000000000009.png
194
+ │ │ │ ├── 000000000025.png
195
+ ├── mapillary
196
+ │ │ ├── config_v2.0.json
197
+ │ │ ├── training
198
+ │ │ │ ├── v2.0
199
+ │ │ │ │ ├── labels
200
+ │ │ │ │ │ ├── 0035fkbjWljhaftpVM37-g.png
201
+ │ │ │ │ │ ├── 00qclUcInksIYnm19b1Xfw.png
202
+ │ │ │ ├── images
203
+ │ │ │ │ ├── 0035fkbjWljhaftpVM37-g.jpg
204
+ │ │ │ │ ├── 00qclUcInksIYnm19b1Xfw.jpg
205
+ ├── paco_lvis
206
+ │ │ ├── annotations
207
+ │ │ │ ├── paco_lvis_v1_train.json
208
+ ├── pascal_part
209
+ │ │ ├── train.json
210
+ │ │ ├── VOCdevkit
211
+ │ │ │ │ ├── VOC2010
212
+ │ │ │ │ │ ├── JPEGImages
213
+ │ │ │ │ │ │ ├── 2007_000027.jpg
214
+ │ │ │ │ │ │ ├── 2007_000032.jpg
215
+ ├── coco_2017
216
+ │ ├── train2017
217
+ │ │ ├── 000000000009.jpg
218
+ │ │ ├── 000000000025.jpg
219
+ ```
220
+
221
+ ### 3) Referring Expression Datasets
222
+ For Referring Expression segmentation - we use COCO referring expression comprehension datasets: RefCOCO, RefCOCO+, RefCOCOg, and RefCLEF.
223
+
224
+ Download links and structure:
225
+ - [RefCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip)
226
+ - [RefCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip)
227
+ - [RefCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip)
228
+ - [RefCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip)
229
+ - RefCOCO images: `coco_2014` - COCO-2014 ([train2014](http://images.cocodataset.org/zips/train2014.zip))
230
+ - For RefCLEF, you need images `[saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip)`
231
+
232
+ Download the data from the source links, and arrange as follows:
233
+
234
+ ```
235
+ ├── Refer_Segm
236
+ │ ├── refcoco
237
+ │ ├── refcoco+
238
+ │ ├── refcocog
239
+ │ ├── refclef
240
+ │ ├── images
241
+ │ │ ├── saiapr_tc-12
242
+ │ │ │ ├── 00
243
+ │ │ │ ├── 01
244
+ ├── coco_2014
245
+ │ ├── train2014
246
+ │ │ ├── COCO_train2014_000000000009.jpg
247
+ │ │ ├── COCO_train2014_000000000025.jpg
248
+ ```
249
+
250
+ ### 4) Region-level Captioning Datasets (Expression Generation)
251
+ For region-level captioning, we use five open source datasets with region(bbox) grounding: RefCOCO, RefCOCOg, RefCOCO+, Visual Genome(V1.2) and Flickr30K.
252
+
253
+ Download links and structure:
254
+ - Annotations - mdetr_annotations: [Download](https://drive.google.com/file/d/1gvH5ToNtmIr3qz7C9lNi_fDmElwAANsI/view?usp=drive_link)
255
+ - Visual Genome: [train.json](https://datarelease.blob.core.windows.net/grit/VG_preprocessed_annotations/train.json), [test_caption.json](https://drive.google.com/file/d/1zF3UGHU1rvgTujinqJ-hZtrCBVsfsuel/view?usp=sharing) [images](https://nlp.stanford.edu/data/gqa/images.zip)
256
+ - Flickr30k: Download the train images from the [Flickr30K webpage](https://shannon.cs.illinois.edu/DenotationGraph/) or use download from the following [link](https://drive.google.com/file/d/1iomUn-Ht0OBfieMuyoVqEFj5PEmXfQ0U/view?usp=drive_link).
257
+ - RefCOCO images: `coco_2014` - COCO-2014 ([train2014](http://images.cocodataset.org/zips/train2014.zip))
258
+ Download the data from the source links, and arrange as follows:
259
+
260
+ ```
261
+ ├── RefCoco_Reg
262
+ │ ├── mdetr_annotations
263
+ │ │ ├── finetune_refcoco_train.json
264
+ │ │ ├── finetune_refcocog_train.json
265
+ │ │ ├── finetune_refcocog_val.json
266
+ │ │ ├── finetune_refcoco+_train.json
267
+ │ │ ├── final_flickr_mergedGT_train.json
268
+ ├── visual_genome
269
+ │ │ ├── test_caption.json
270
+ │ │ ├── train.json
271
+ │ │ ├── images
272
+ │ │ │ ├── 1000.jpg
273
+ │ │ │ ├── 1001.jpg
274
+ ├── flikcr_30k
275
+ │ ├── train
276
+ │ │ ├── 1000092795.jpg
277
+ │ │ ├── 10002456.jpg
278
+ ├── coco_2014
279
+ │ ├── train2014
280
+ │ │ ├── COCO_train2014_000000000009.jpg
281
+ │ │ ├── COCO_train2014_000000000025.jpg
282
+ ```
283
+
284
+ ### 5) Image Captioning
285
+ We use the COCO caption dataset.
286
+
287
+ Download links and structure:
288
+ - Annotations - [COCO - 2017 annotations](http://images.cocodataset.org/annotations/annotations_trainval2017.zip)
289
+ - Images: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
290
+
291
+ Structure as shown in the directory structure above.
292
+
293
+ ```
294
+ ├── coco_2017
295
+ │ ├── train2017
296
+ │ │ ├── 000000000009.jpg
297
+ │ │ ├── 000000000025.jpg
298
+ │ ├── annotations
299
+ │ │ ├── captions_train2017.json
300
+ │ │ ├── captions_val2017.json
301
+ ```
302
+
303
+ ### 6) Visual Question Answering
304
+ We use the LLaVA-instruct-150k set for visual question answering. Download and arrange as detailed below.
305
+
306
+ Download links and structure:
307
+ - Annotations - [LLaVA-instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json)
308
+ - Images: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
309
+
310
+ ```
311
+ ├── llava_dataset
312
+ │ ├── llava_instruct_150k.json
313
+ ├── coco_2017
314
+ │ ├── train2017
315
+ ```
316
+
317
+ ### 7) GranD pretraining Datasets
318
+
319
+ We convert the GranD dataset to multiple annotations in LMDB form for pretraining based on the tasks. For details on how to prepare the annotations, please refer to: [Pretraining Annotations from GranD](../docs/GranD.md#preparing-the-pretraining-annotations-from-grand-).
320
+
321
+ - For image-level captioning:
322
+ - Short Captioning: [GrandShortCaptionDataset](../dataset/caption_datasets/GranD_ShortCaption_ds.py)
323
+ - For referring expression generation and referring expression segmentation:
324
+ - Region-level captioning (referring expression generation): [GrandReferRegDataset](../dataset/region_datasets/GranD_ReferringRegion_ds.py)
325
+ - Referring expression segmentation: [GrandReferSegmDataset](../dataset/segm_datasets/GranD_ReferringSegm_ds.py)
326
+
327
+
groundingLMM/docs/evaluation.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluating GLaMM 🔍
2
+ This guide provides instructions on evaluating the pretrained GLaMM models on the downstream tasks including Grounded Conversation Generation (GCG), referring expression segmentation and region-level captioning.
3
+
4
+
5
+ ### 1) Grounded Conversation Generation (GCG) 🗨️
6
+ Run the following instruction to evaluate GLaMM model on the GCG task
7
+
8
+ ```bash
9
+ bash eval/gcg/run_evaluation.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
10
+ ```
11
+
12
+ <p align="center">
13
+ <img src="../images/tables/GCG_Table.png" alt="GCG_Table">
14
+ </p>
15
+
16
+
17
+ To evaluate provided finetuned GCG model, run,
18
+
19
+ ```bash
20
+ bash eval/gcg/run_evaluation.sh 'MBZUAI/GLaMM-GCG' './results_gcg_finetuned'
21
+ ```
22
+ This will automatically download the `MBZUAI/GLaMM-GCG` from HuggingFace.
23
+
24
+
25
+ ### 2) Referring Expression Segmentation 🎯
26
+ Run the following instruction to evaluate GLaMM model on the referring expression segmentation task
27
+
28
+ ```bash
29
+ bash eval/referring_seg/run_evaluation.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
30
+ ```
31
+
32
+ To evaluate provided finetuned RefSeg model, run,
33
+
34
+ ```bash
35
+ bash eval/referring_seg/run_evaluation.sh 'MBZUAI/GLaMM-RefSeg' './results_refseg_finetuned'
36
+ ```
37
+ This will automatically download the `MBZUAI/GLaMM-RefSeg` from HuggingFace.
38
+
39
+
40
+ <p align="center">
41
+ <img src="../images/tables/ReferSeg_Table.png" alt="Table_RefSeg">
42
+ </p>
43
+
44
+
45
+ ### 3) Region-level Captioning 🖼️
46
+ Run the following instruction to evaluate GLaMM model on the region-level captioning task
47
+
48
+ #### RefCOCOg
49
+ ```bash
50
+ bash eval/region_captioning/run_evaluation_RefCOCOg.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
51
+ ```
52
+
53
+ To evaluate provided finetuned RefCOCOg model, run,
54
+
55
+ ```bash
56
+ bash eval/region_captioning/run_evaluation_RefCOCOg.sh 'MBZUAI/GLaMM-RegCap-RefCOCOg' './results_regcap_refcocog_finetuned'
57
+ ```
58
+ This will automatically download the `MBZUAI/GLaMM-RegCap-RefCOCOg` from HuggingFace.
59
+
60
+
61
+ #### Visual Genome
62
+ ```bash
63
+ bash eval/region_captioning/run_evaluation_VG.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
64
+ ```
65
+
66
+ To evaluate provided finetuned VG model, run,
67
+
68
+ ```bash
69
+ bash eval/region_captioning/run_evaluation_VG.sh 'MBZUAI/GLaMM-RegCap-VG' './results_regcap_vg_finetuned'
70
+ ```
71
+ This will automatically download the `MBZUAI/GLaMM-RegCap-VG` from HuggingFace.
72
+
73
+ <p align="center">
74
+ <img src="../images/tables/Region_Cap_Table.png" alt="Table_RegionCap">
75
+ </p>
groundingLMM/docs/install.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Installation 🛠️
2
+ We recommend setting up a conda environment for the project:
3
+
4
+ ```bash
5
+ conda create --name=glamm python=3.10
6
+ conda activate glamm
7
+
8
+ git clone https://github.com/mbzuai-oryx/groundingLMM.git
9
+ cd groundingLMM
10
+ pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
11
+ pip install -r requirements.txt
12
+
13
+ # Install mmcv
14
+ git clone https://github.com/open-mmlab/mmcv
15
+ cd mmcv
16
+ git checkout v1.4.7
17
+ MMCV_WITH_OPS=1 pip install -e .
18
+
19
+ export PYTHONPATH="./:$PYTHONPATH"
20
+ ```
21
+
22
+ In addition, we also provide conda environment contents in a `.zip` file. Please follow the below steps to set up the environment,
23
+
24
+ 1. Download `glamm_conda_env.zip` from the [google_drive link](https://drive.google.com/file/d/1BN10oChcoKDDd0zC8tU88JcrfmLpKpkB/view?usp=sharing).
25
+ 2. Extract the downloaded `zip` file:
26
+ ```bash
27
+ unzip glamm_conda_env.zip
28
+ ```
29
+ 3. Activate the environment:
30
+ ```bash
31
+ conda activate glamm
32
+ ```
33
+
34
+
groundingLMM/docs/model_zoo.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GLaMM Model Zoo 🚀
2
+
3
+ Welcome to the GLaMM Model Zoo! This repository contains a collection of state-of-the-art models from the GLaMM (Pixel Grounding Large Multimodal Model) family. Each model is designed for specific tasks in the realm of multimodal learning, combining visual and textual data processing.
4
+
5
+ ## Models Overview
6
+
7
+ The following table provides an overview of the available models in our zoo. For each model, you can find links to its Hugging Face page.
8
+
9
+ - To evaluate the pretrained models, please follow the instructions at [evaluation.md](evaluation.md).
10
+ - To run offline demo, please follow the instructions at [offline_demo.md](offline_demo.md).
11
+
12
+ | Model Name | Hugging Face Link | Summary |
13
+ |----------------------|-----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
14
+ | GLaMM-GranD-Pretrained | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-GranD-Pretrained) | Pretrained on GranD dataset. |
15
+ | GLaMM-FullScope | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-FullScope) | Model recommended for offline demo. |
16
+ | GLaMM-GCG | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-GCG) | Finetuned on GranD-f dataset for GCG task. |
17
+ | GLaMM-RefSeg | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-RefSeg) | Finetuned on RefCOCO, RefCOCO+ and RefCOCOg datasets for referring expression segmentation task. |
18
+ | GLaMM-RegCap-RefCOCOg | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-RegCap-RefCOCOg) | Finetuned on RefCOCOg for region captioning task. |
19
+ | GLaMM-RegCap-VG | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-RegCap-VG) | Finetuned on Visual Genome dataset for region captioning task. |
20
+
21
+ Note that all models are finetuned on `GLaMM-GranD-Pretrained`.
groundingLMM/docs/offline_demo.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GLaMM Demo Installation and Usage Guide 🚀
2
+
3
+ Welcome to the GLaMM Demo! This guide will walk you through the process of setting up and running the GLaMM Demo on your local GPU machine. Please ensure that your system meets the necessary requirements before proceeding.
4
+
5
+ ## System Requirements
6
+ - GPU with at least 24 GB of memory
7
+ - Git and Git LFS installed
8
+ - [GLaMM environment](../docs/install.md)
9
+ - Install [gradio-box](https://github.com/ShoufaChen/gradio-box?tab=readme-ov-file#3-install-gradio): Follow the instructions below to install Gradio-Box.
10
+ ```bash
11
+ git clone https://github.com/ShoufaChen/gradio-dev.git
12
+ cd gradio-dev
13
+ bash scripts/build_frontend.sh
14
+ pip install -e .
15
+ ````
16
+ - Version Requirements: Your installation should include the following specific versions:
17
+ - Gradio version 3.35.2
18
+ - Gradio-Client version 0.2.7
19
+ ## Installation Steps
20
+
21
+ ### 1. Clone the GLaMM Repository
22
+ First, you need to clone the GLaMM repository from GitHub. Open your terminal and run the following command:
23
+
24
+ ```bash
25
+ git clone https://github.com/mbzuai-oryx/groundingLMM.git
26
+ ````
27
+
28
+ ## 2. Download GLaMM Weights
29
+ To download the GLaMM model weights, you will need Git LFS. If you haven't installed Git LFS, you can do so by running:
30
+
31
+ ```bash
32
+ git lfs install
33
+ ```
34
+ Once Git LFS is installed, proceed to clone the GLaMM FullScope model:
35
+
36
+ ```bash
37
+ git clone https://huggingface.co/MBZUAI/GLaMM-FullScope
38
+ ```
39
+
40
+ For more information on the GLaMM FullScope model, visit [this link](https://huggingface.co/MBZUAI/GLaMM-FullScope).
41
+
42
+
43
+ ### 3. Run the Demo
44
+
45
+ Navigate to the directory where the repository was cloned and run the demo using Python. Replace path_to_GLaMM_FullScope_model with the actual path to the downloaded GLaMM FullScope model:
46
+ ```bash
47
+ python app.py --version "path/to/GLaMM_FullScope_model"
48
+
49
+ ```
50
+
51
+ Once the demo is running, follow the on-screen instructions to open the demo dashboard in your web browser. The dashboard provides a user-friendly interface for interacting with the GLaMM model.
groundingLMM/docs/training.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training GLaMM 🚀
2
+ GLaMM is pre-trained on the GranD dataset and then fine-tuned on multiple downstream tasks including Grounded Conversation Generation (GCG), referring expression segmentation, region-level captioning, and image-level captioning using OpenSource datasets.
3
+
4
+ ## Downstream Task-Specific Training 🛠️
5
+
6
+ This section explains how to perform downstream fine-tuning using the pretrained GLaMM model checkpoints.
7
+
8
+ ### Preparing the OpenSource Datasets 📂
9
+
10
+ Refer to the [datasets readme](../docs/datasets.md) for details on organizing the data.
11
+
12
+ Generic settings:
13
+ - Path to the GLaMM GranD pretrained Hugging Face model: `PRETRAINED_HF_PATH=MBZUAI/GLaMM-GranD-Pretrained`
14
+ - Path to the Grounding Image Encoder Checkpoints (SAM pretrained weights): `GROUNDING_ENC_CKPT_PATH=./checkpoints/sam_vit_h_4b8939.pth`
15
+
16
+ ### 1) Grounded Conversation Generation (GCG) 🗨️
17
+
18
+ For GCG, the model is fine-tuned on two types of datasets: (i) GranD-f Dataset and (ii) Semantic Segmentation Datasets.
19
+ - [GranD-f datasets](../docs/datasets.md#1-grand-f-grounded-conversation-generation-gcg-dataset): RefCoco_GCG, PSG_GCG, Flickr_GCG, GranDf_GCG
20
+ - [Semantic Segmentation Datasets](../docs/datasets.md#2-semantic-segmentation-datasets): ade20k, cocostuff, pascal_part, paco_lvis, mapillary
21
+
22
+ ```bash
23
+ deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_segm_data --seg_dataset "Semantic_Segm||RefCoco_GCG||PSG_GCG||Flickr_GCG||GranDf_GCG" --segm_sample_rates "1,3,3,3,1" --val_dataset "FlickrGCGVal|RefCocoGCGVal|PsgGCGVal" --epochs 10 --steps_per_epoch 500 --mask_validation
24
+ ```
25
+
26
+ ### 2) Region-level Captioning 🖼️
27
+
28
+ For region-level captioning, the model is fine-tuned on specific datasets:
29
+ - [Region-level Captioning Dataset](../docs/datasets.md#4-region-level-captioning-datasets-expression-generation): RefCocoG_Reg, VisGenomeRegVal
30
+
31
+ For RefCOCOg:
32
+ ```bash
33
+ deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_reg_data --reg_dataset 'RefCocoG_Reg' --reg_sample_rates "1" --val_dataset 'RefCOCOgRegVal' --epochs 5 --steps_per_epoch 500
34
+ ```
35
+ For Visual Genome:
36
+ ```bash
37
+ deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_reg_data --reg_dataset 'VisGen_Reg' --reg_sample_rates "1" --val_dataset 'VisGenomeRegVal' --epochs 5 --steps_per_epoch 500
38
+ ```
39
+
40
+ ### 3) Referring Expression Segmentation 🎯
41
+
42
+ For results on RefCOCO, RefCOCO+ and RefCOCOg datasets, the model is fine-tuned using the following datasets:
43
+ - [Referring Expression Dataset](../docs/datasets.md#3-referring-expression-datasets): refcoco, refcoco+, refcocog
44
+
45
+ ```bash
46
+ deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_segm_data --seg_dataset "Refer_Segm" --segm_sample_rates "1" --refer_segm_data "refcoco||refcoco+||refcocog" --val_dataset "RefCOCOgSegVal" --epochs 5 --steps_per_epoch 350 --mask_validation
47
+ ```
48
+
49
+ ### 4) Finetuning on Combined Tasks 🌍
50
+ To enable combined capabilities in tasks like Grounded Conversation Generation (GCG), Image-level captioning, Visual-question answering, Region-level captioning, and Referring Expression Segmentation, finetune GLaMM using a mix of open-source datasets. This training replicates the model used in the demo.
51
+
52
+ Refer to [datasets readme](../docs/datasets.md) for data preparation details.
53
+
54
+ The `train.py` script is pre-configured with default argument values optimized for combined open-source training. However, for clarity and customization, we detail all essential arguments below:
55
+
56
+ ```bash
57
+ deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_cap_data --use_reg_data --use_segm_data -cap_dataset "CocoCap||LLaVaInstruct" --cap_sample_rate "1,2" --reg_dataset "RefCoco_Reg||RefCocoG_Reg||RefCocoP_Reg||VisGen_Reg||FlickrGCGVal" --reg_sample_rates -"1,1,1,1,1" -seg_dataset "Semantic_Segm||Refer_Segm||RefCoco_GCG||PSG_GCG||Flickr_GCG||GranDf_GCG" --segm_sample_rates "4,3,2,2,2,1" --val_dataset "FlickrGCGVal|RefCocoGCGVal|PsgGCGVal" --epochs 10 --steps_per_epoch 500
58
+ ```
59
+
60
+ ### Merge LORA Weights
61
+ We use LORA finetuning for downstream tasks. Please follow the instructions below to merge LORA weights after training.
62
+
63
+ After training the saved checkpoints directory will look like,
64
+ ```
65
+ ├── global_step5000
66
+ │ ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
67
+ │ ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
68
+ │ ├── bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
69
+ │ ├── bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
70
+ │ ├── bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
71
+ │ ├── bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
72
+ │ ├── bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
73
+ │ ├── bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
74
+ ```
75
+ Run the following command to merge LORA weights,
76
+
77
+ ```bash
78
+ python zero_to_fp32.py . ./pytorch_model.bin
79
+
80
+ # From the root directory
81
+ export PYTHONPATH="./:$PYTHONPATH"
82
+ python scripts/merge_lora_weights.py --version 'MBZUAI/GLaMM-GranD-Pretrained' --weight 'path/to/pytorch_model.bin' --save_path 'path/to/save/the/merged/model/in/HF/format'
83
+ ```
groundingLMM/eval/region_captioning/evaluate.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import argparse
4
+ from pycocotools.coco import COCO
5
+ from pycocoevalcap.eval import COCOEvalCap
6
+
7
+
8
+ def parse_args():
9
+ parser = argparse.ArgumentParser(description="GLaMM Inference - Region Captioning")
10
+
11
+ parser.add_argument("--annotation_file",
12
+ default="data/RefCoco_Reg/mdetr_annotations/finetune_refcocog_val_captions.json", type=str,
13
+ help="Replace with 'data/visual_genome/test_caption.json' for VG.")
14
+ parser.add_argument("--results_dir", default="results", type=str, help="The path to save the results.")
15
+
16
+ return parser.parse_args()
17
+
18
+
19
+ def main():
20
+ args = parse_args()
21
+
22
+ # Load the annotation file
23
+ coco = COCO(args.annotation_file)
24
+
25
+ # Merge and load the results files
26
+ all_results = []
27
+ for result_file in os.listdir(args.results_dir):
28
+ all_results += json.load(open(f"{args.results_dir}/{result_file}", "r"))
29
+ merged_file_path = f"{args.results_dir}/merged.json"
30
+ with open(merged_file_path, 'w') as f:
31
+ json.dump(all_results, f)
32
+ coco_result = coco.loadRes(merged_file_path)
33
+
34
+ # Create coco_eval object by taking coco and coco_result
35
+ coco_eval = COCOEvalCap(coco, coco_result)
36
+
37
+ # Evaluate results
38
+ coco_eval.params['image_id'] = coco_result.getImgIds()
39
+ coco_eval.evaluate()
40
+
41
+ # Print and save the output evaluation scores
42
+ output_file_path = f"{args.results_dir}/metrics.txt"
43
+ f = open(output_file_path, 'w')
44
+ for metric, score in coco_eval.eval.items():
45
+ print(f'{metric}: {score:.3f}')
46
+ f.write(f"{metric}: {score:.3f}\n")
47
+ f.close()
48
+
49
+
50
+ if __name__ == "__main__":
51
+ main()
groundingLMM/eval/region_captioning/infer.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import cv2
3
+ import json
4
+ import argparse
5
+ from tqdm import tqdm
6
+ from transformers import AutoTokenizer, CLIPImageProcessor
7
+ from torch.utils.data import DataLoader, DistributedSampler
8
+
9
+ from eval.utils import *
10
+ from eval.ddp import *
11
+ from model.GLaMM import GLaMMForCausalLM
12
+ from model.llava import conversation as conversation_lib
13
+ from model.llava.mm_utils import tokenizer_image_token
14
+ from model.SAM.utils.transforms import ResizeLongestSide
15
+ from tools.utils import DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
16
+
17
+
18
+ def parse_args():
19
+ parser = argparse.ArgumentParser(description="GLaMM Inference - Region Captioning")
20
+
21
+ parser.add_argument("--hf_model_path", required=True, help="The model path in huggingface format.")
22
+ parser.add_argument("--annotation_file",
23
+ default="data/RefCoco_Reg/mdetr_annotations/finetune_refcocog_val_captions.json", type=str,
24
+ help="Replace with 'data/visual_genome/test_caption.json' for VG.")
25
+ parser.add_argument("--image_dir", default="data/coco_2014/train2014", type=str,
26
+ help="Replace with 'data/visual_genome/images' for VG")
27
+ parser.add_argument("--dataset", default="refcocog", type=str, help="Options are 'refcocog', 'vg'")
28
+ parser.add_argument("--results_dir", default="results", type=str, help="The path to save the results.")
29
+
30
+
31
+ parser.add_argument("--image_size", default=1024, type=int, help="image size")
32
+ parser.add_argument("--model_max_length", default=512, type=int)
33
+ parser.add_argument("--use_mm_start_end", action="store_true", default=True)
34
+ parser.add_argument("--conv_type", default="llava_v1", type=str, choices=["llava_v1", "llava_llama_2"], )
35
+
36
+ # DDP Related parameters
37
+ parser.add_argument("--batch_size_per_gpu", required=False, default=1)
38
+ parser.add_argument('--world_size', default=1, type=int, help='number of distributed processes')
39
+ parser.add_argument('--local_rank', default=-1, type=int)
40
+ parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
41
+
42
+ return parser.parse_args()
43
+
44
+
45
+ def inference(instructions, inputs):
46
+ # Extract the inputs
47
+ bbox_img = inputs['boxes']
48
+ image_path = inputs['image']
49
+
50
+ instructions = instructions.replace('&lt;', '<').replace('&gt;', '>')
51
+
52
+ # Prepare prompt for model Inference
53
+ conv = conversation_lib.conv_templates[args.conv_type].copy()
54
+ conv.messages = []
55
+ begin_str = f"""The {DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n"""
56
+ prompt = begin_str + instructions
57
+ if args.use_mm_start_end:
58
+ replace_token = (DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN)
59
+ prompt = prompt.replace(DEFAULT_IMAGE_TOKEN, replace_token)
60
+ conv.append_message(conv.roles[0], prompt)
61
+ conv.append_message(conv.roles[1], "")
62
+ prompt = conv.get_prompt()
63
+
64
+ # Read and preprocess the image (Global image encoder - CLIP)
65
+ image_np = cv2.imread(image_path)
66
+ image_np = cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB)
67
+ original_size_list = [image_np.shape[:2]]
68
+ image_clip = (clip_image_processor.preprocess(image_np, return_tensors="pt")["pixel_values"][0].unsqueeze(0).cuda())
69
+ image_clip = image_clip.bfloat16() # Precision is bf16 by default
70
+
71
+ # Preprocess the image (Grounding image encoder)
72
+ image = transform.apply_image(image_np)
73
+ resize_list = [image.shape[:2]]
74
+ image = (
75
+ grounding_image_ecoder_preprocess(torch.from_numpy(image).permute(2, 0, 1).contiguous()).unsqueeze(0).cuda())
76
+ image = image.bfloat16() # Precision is bf16 by default
77
+
78
+ # Prepare inputs for inference
79
+ input_ids = tokenizer_image_token(prompt, tokenizer, return_tensors="pt")
80
+ input_ids = input_ids.unsqueeze(0).cuda()
81
+ bboxes = None
82
+ if len(bbox_img) > 0:
83
+ height, width = original_size_list[0] # Original Image Dimensions
84
+
85
+ # Rescaling BBox to 336*336
86
+ x_scale, y_scale = 336 / width, 336 / height
87
+ bboxes_scaled = [[bbox[0] * x_scale, bbox[1] * y_scale,
88
+ bbox[2] * x_scale, bbox[3] * y_scale] for bbox in bbox_img]
89
+ ori_bboxes = np.array(bboxes_scaled, dtype=np.float64)
90
+ height_sc, width_sc = (336, 336) # To normalize the Image
91
+ norm_bboxes = ori_bboxes / np.array([width_sc, height_sc, width_sc, height_sc])
92
+ bboxes = [torch.tensor(norm_bboxes).cuda().half().to(torch.bfloat16)]
93
+
94
+ # Generate output
95
+ output_ids, pred_masks = model.evaluate(image_clip, image, input_ids, resize_list, original_size_list,
96
+ max_tokens_new=512, bboxes=bboxes)
97
+ output_ids = output_ids[0][output_ids[0] != IMAGE_TOKEN_INDEX]
98
+
99
+ # Post-processing
100
+ text_output = tokenizer.decode(output_ids, skip_special_tokens=False)
101
+ text_output = text_output.replace("\n", "").replace(" ", " ")
102
+ text_output = text_output.split("ASSISTANT: ")[-1]
103
+
104
+ cleaned_str = re.sub(r'<.*?>', '', text_output)
105
+
106
+ # Remove the [SEG] token
107
+ cleaned_str = cleaned_str.replace('[SEG]', '')
108
+
109
+ # Strip unnecessary spaces
110
+ cleaned_str = ' '.join(cleaned_str.split()).strip("'")
111
+ cleaned_str = cleaned_str.strip()
112
+
113
+ return cleaned_str
114
+
115
+
116
+ def custom_collate_fn(batch):
117
+ image_id = [item[0] for item in batch]
118
+ filename = [item[1] for item in batch]
119
+ bbox = [item[2] for item in batch]
120
+ gt = [item[3] for item in batch]
121
+
122
+ return image_id, filename, bbox, gt
123
+
124
+
125
+ if __name__ == "__main__":
126
+ args = parse_args()
127
+ init_distributed_mode(args)
128
+
129
+ # Initialize tokenizer and model
130
+ tokenizer = AutoTokenizer.from_pretrained(args.hf_model_path, cache_dir=None,
131
+ model_max_length=args.model_max_length, padding_side="right",
132
+ use_fast=False)
133
+ tokenizer.pad_token = tokenizer.unk_token
134
+ seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
135
+ torch_dtype = torch.bfloat16 # By default, using bf16
136
+ kwargs = {"torch_dtype": torch_dtype}
137
+ model = GLaMMForCausalLM.from_pretrained(args.hf_model_path, low_cpu_mem_usage=True,
138
+ seg_token_idx=seg_token_idx, **kwargs)
139
+ # Update model config
140
+ model.config.eos_token_id = tokenizer.eos_token_id
141
+ model.config.bos_token_id = tokenizer.bos_token_id
142
+ model.config.pad_token_id = tokenizer.pad_token_id
143
+
144
+ # Initialize Global Image Encoder (CLIP)
145
+ model.get_model().initialize_vision_modules(model.get_model().config)
146
+ vision_tower = model.get_model().get_vision_tower()
147
+ vision_tower.to(dtype=torch_dtype)
148
+
149
+ # Transfer the model to GPU
150
+ model = model.bfloat16().cuda() # Replace with model = model.float().cuda() for 32 bit inference
151
+ vision_tower = model.get_model().get_vision_tower()
152
+ vision_tower.to(device="cuda")
153
+
154
+ # Initialize Image Processor for GLobal Image Encoder (CLIP)
155
+ clip_image_processor = CLIPImageProcessor.from_pretrained(model.config.vision_tower)
156
+ transform = ResizeLongestSide(args.image_size)
157
+
158
+ model.eval() # Model should be in evaluation mode for inference
159
+
160
+ # Prompt model to perfor region captioning task
161
+ instruction = "Can you provide me with a detailed description of the region in the picture marked by <bbox>?"
162
+
163
+ # Intermediate results path is hard-coded (you may change it as per your needs)
164
+ os.makedirs(args.results_dir, exist_ok=True)
165
+ results_path = f"{args.results_dir}/{os.path.basename(args.hf_model_path)}_{args.dataset}_{args.rank}.json"
166
+
167
+ # Create DDP Dataset
168
+ dataset = RegionCapDDP(args.annotation_file)
169
+ distributed_sampler = DistributedSampler(dataset, rank=args.rank, shuffle=False)
170
+ dataloader = DataLoader(dataset, batch_size=args.batch_size_per_gpu, num_workers=2,
171
+ sampler=distributed_sampler, collate_fn=custom_collate_fn)
172
+
173
+ # Iterate over all the samples, perform inference and save results
174
+ results = []
175
+ for idx, (image_id, filename, bbox, gt) in enumerate(tqdm(dataloader)):
176
+ image_id, filename, bbox, gt = image_id[0], filename[0], bbox[0], gt[0]
177
+ image_path = os.path.join(args.image_dir, filename)
178
+ inputs = {'image': image_path, 'boxes': [bbox]}
179
+
180
+ result_caption = inference(instruction, inputs) # Perform inference
181
+
182
+ result_dict = {}
183
+ result_dict["image_id"] = image_id
184
+ result_dict["caption"] = result_caption
185
+ results.append(result_dict)
186
+
187
+ with open(results_path, 'w') as json_file:
188
+ json.dump(results, json_file, indent=2)
groundingLMM/eval/region_captioning/run_evaluation_VG.sh ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/sh
2
+
3
+ ## USAGE
4
+
5
+ ## bash eval/region_captioning/run_evaluation.sh <path to the HF checkpoints path> <path to the directory to save the evaluation results>
6
+
7
+ ## USAGE
8
+
9
+
10
+ export PYTHONPATH="./:$PYTHONPATH"
11
+ MASTER_PORT=24999
12
+ NUM_GPUS=1 # Adjust it as per the available #GPU
13
+
14
+ # Positional arguments for the bash scripts
15
+ CKPT_PATH=$1
16
+ RESULT_PATH=$2
17
+
18
+ # Adjust if needed
19
+ ANNOTATION_FILE=./data/visual_genome/test_caption.json
20
+ IMAGE_DIR=./data/visual_genome/images
21
+ DATASET=vg
22
+
23
+ # Run Inference
24
+ torchrun --nnodes=1 --nproc_per_node="$NUM_GPUS" --master_port="$MASTER_PORT" eval/region_captioning/infer.py --hf_model_path "$CKPT_PATH" --annotation_file "$ANNOTATION_FILE" --image_dir "$IMAGE_DIR" --dataset "$DATASET" --results_dir "$RESULT_PATH"
25
+
26
+
27
+ # Evaluate
28
+ python eval/region_captioning/evaluate.py --annotation_file "$ANNOTATION_FILE" --results_dir "$RESULT_PATH"
groundingLMM/gradio-dev/.dockerignore ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python build
2
+ .eggs/
3
+ gradio.egg-info/*
4
+ !gradio.egg-info/requires.txt
5
+ !gradio.egg-info/PKG-INFO
6
+ dist/
7
+ *.pyc
8
+ __pycache__/
9
+ *.py[cod]
10
+ *$py.class
11
+ build/
12
+
13
+ # JS build
14
+ gradio/templates/frontend/static
15
+ gradio/templates/frontend/cdn
16
+
17
+ # Secrets
18
+ .env
19
+
20
+ # Gradio run artifacts
21
+ *.db
22
+ *.sqlite3
23
+ gradio/launches.json
24
+
25
+ # Tests
26
+ .coverage
27
+ coverage.xml
28
+ test.txt
29
+
30
+ # Demos
31
+ demo/tmp.zip
32
+ demo/flagged
33
+ demo/files/*.avi
34
+ demo/files/*.mp4
35
+
36
+ # Etc
37
+ .idea/*
38
+ .DS_Store
39
+ *.bak
40
+ workspace.code-workspace
groundingLMM/gradio-dev/.editorconfig ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+
2
+ root = true
3
+
4
+ [{js/**,client/js/**}]
5
+ end_of_line = lf
6
+ insert_final_newline = true
7
+ indent_style = tab
8
+ tab_width = 2
groundingLMM/gradio-dev/.gitignore ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python build
2
+ .eggs/
3
+ gradio.egg-info
4
+ dist/
5
+ *.pyc
6
+ __pycache__/
7
+ *.py[cod]
8
+ *$py.class
9
+ build/
10
+ __tmp/*
11
+
12
+ # JS build
13
+ gradio/templates/cdn
14
+ gradio/templates/frontend
15
+
16
+ # Secrets
17
+ .env
18
+
19
+ # Gradio run artifacts
20
+ *.db
21
+ *.sqlite3
22
+ gradio/launches.json
23
+ flagged/
24
+ gradio_cached_examples/
25
+ tmp.zip
26
+
27
+ # Tests
28
+ .coverage
29
+ coverage.xml
30
+ test.txt
31
+ **/snapshots/**/*.png
32
+
33
+ # Demos
34
+ demo/tmp.zip
35
+ demo/files/*.avi
36
+ demo/files/*.mp4
37
+ demo/all_demos/demos/*
38
+ demo/all_demos/requirements.txt
39
+ demo/*/config.json
40
+
41
+ # Etc
42
+ .idea/*
43
+ .vscode/*
44
+ .DS_Store
45
+ *.bak
46
+ workspace.code-workspace
47
+ *.h5
48
+
49
+ # dev containers
50
+ .pnpm-store/
51
+
52
+ # log files
53
+ .pnpm-debug.log
54
+
55
+ # Local virtualenv for devs
56
+ .venv*
57
+
58
+ # FRP
59
+ gradio/frpc_*
60
+
61
+ # js
62
+ node_modules
63
+ public/build/
64
+ test-results
65
+ client/js/test.js
groundingLMM/gradio-dev/CHANGELOG.md ADDED
The diff for this file is too large to render. See raw diff
 
groundingLMM/gradio-dev/CITATION.cff ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ cff-version: 1.2.0
2
+ message: Please cite this project using these metadata.
3
+ title: "Gradio: Hassle-free sharing and testing of ML models in the wild"
4
+ abstract: >-
5
+ Accessibility is a major challenge of machine learning (ML).
6
+ Typical ML models are built by specialists and require
7
+ specialized hardware/software as well as ML experience to
8
+ validate. This makes it challenging for non-technical
9
+ collaborators and endpoint users (e.g. physicians) to easily
10
+ provide feedback on model development and to gain trust in
11
+ ML. The accessibility challenge also makes collaboration
12
+ more difficult and limits the ML researcher's exposure to
13
+ realistic data and scenarios that occur in the wild. To
14
+ improve accessibility and facilitate collaboration, we
15
+ developed an open-source Python package, Gradio, which
16
+ allows researchers to rapidly generate a visual interface
17
+ for their ML models. Gradio makes accessing any ML model as
18
+ easy as sharing a URL. Our development of Gradio is informed
19
+ by interviews with a number of machine learning researchers
20
+ who participate in interdisciplinary collaborations. Their
21
+ feedback identified that Gradio should support a variety of
22
+ interfaces and frameworks, allow for easy sharing of the
23
+ interface, allow for input manipulation and interactive
24
+ inference by the domain expert, as well as allow embedding
25
+ the interface in iPython notebooks. We developed these
26
+ features and carried out a case study to understand Gradio's
27
+ usefulness and usability in the setting of a machine
28
+ learning collaboration between a researcher and a
29
+ cardiologist.
30
+ authors:
31
+ - family-names: Abid
32
+ given-names: Abubakar
33
+ - family-names: Abdalla
34
+ given-names: Ali
35
+ - family-names: Abid
36
+ given-names: Ali
37
+ - family-names: Khan
38
+ given-names: Dawood
39
+ - family-names: Alfozan
40
+ given-names: Abdulrahman
41
+ - family-names: Zou
42
+ given-names: James
43
+ doi: 10.48550/arXiv.1906.02569
44
+ date-released: 2019-06-06
45
+ url: https://arxiv.org/abs/1906.02569
groundingLMM/gradio-dev/CONTRIBUTING.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to Gradio
2
+
3
+ Prerequisites:
4
+
5
+ - [Python 3.8+](https://www.python.org/downloads/)
6
+ - [Node.js v16.14+](https://nodejs.dev/en/download/package-manager/) (only needed if you are making changes to the frontend)
7
+ - [pnpm 8.1+](https://pnpm.io/8.x/installation) (only needed if you are making changes to the frontend)
8
+
9
+ More than 80 awesome developers have contributed to the `gradio` library, and we'd be thrilled if you would like be the next `gradio` contributor! Start by cloning this repo and installing Gradio locally:
10
+
11
+ ### Install Gradio locally from the `main` branch
12
+
13
+ - Clone this repo
14
+ - Navigate to the repo folder and run
15
+
16
+ ```bash
17
+ bash scripts/install_gradio.sh
18
+ ```
19
+
20
+ - Build the front end
21
+
22
+ ```
23
+ bash scripts/build_frontend.sh
24
+ ```
25
+
26
+ ### Install development requirements
27
+
28
+ In order to be able to run the Python linter, formatter, and unit tests, do the following:
29
+
30
+ - Navigate to the repo folder and install test requirements (note that it is highly recommended to use a virtual environment running **Python 3.9** since the versions are pinned)
31
+
32
+ ```
33
+ bash scripts/install_test_requirements.sh
34
+ ```
35
+
36
+ - If you have a different Python version and conflicting packages during the installation, please first run:
37
+
38
+ ```
39
+ bash scripts/create_test_requirements.sh
40
+ ```
41
+
42
+ ### Using dev containers
43
+
44
+ Instead of the above steps, you can alternatively use dev containers. This is supported on all platforms (macOS/Windows/Linux).
45
+
46
+ Prerequisites:
47
+
48
+ - An editor which supports dev containers, like VS Code
49
+ - Docker support on the host computer:
50
+ - macOS: [Docker Desktop 2.0+](https://www.docker.com/products/docker-desktop/)
51
+ - Windows: [Docker Desktop 2.0+](https://www.docker.com/products/docker-desktop/)
52
+ - Linux: [Docker CE/EE 18.06+](https://docs.docker.com/get-docker/) and [Docker Compose 1.21+](https://docs.docker.com/compose/install/)
53
+ - If using VS Code, the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension
54
+
55
+ Steps:
56
+
57
+ - Clone repository
58
+ - Open it in editor
59
+ - For VS Code, execute `Dev Containers: Reopen in container` command
60
+
61
+ For detailed instructions, please see the [Dev Containers tutorial](https://code.visualstudio.com/docs/devcontainers/tutorial).
62
+
63
+ ### Extra tidbits
64
+
65
+ - You can run gradio scripts in reload mode which will watch for changes in the `gradio` folder and reload the app if changes are made.
66
+
67
+ ```
68
+ gradio app.py
69
+ ```
70
+
71
+ - To develop the frontend app, you should also follow [js/README.md](js/README.md).
72
+
73
+ - To run all of the tests, do:
74
+
75
+ ```
76
+ bash scripts/run_all_tests.sh
77
+ ```
78
+
79
+ ### Structure of the Repository
80
+
81
+ It's helpful to know the overall structure of the repository so that you can focus on the part of the source code you'd like to contribute to
82
+
83
+ - `/gradio`: contains the Python source code for the library
84
+ - `/gradio/interface.py`: contains the Python source code for the core `Interface` class
85
+ - `/gradio/blocks.py`: contains the Python source code for the core `Blocks` class
86
+ - `/gradio/components.py`: contains the Python source code for the `components`, you can add your custom components here.
87
+ - `/js`: contains the HTML/JS/CSS source code for the library ([start here for frontend changes](/js/README.md))
88
+ - `/test`: contains Python unit tests for the library
89
+ - `/demo`: contains demos that are used in the documentation, you can find `Gradio` examples over here.
90
+ - `/website`: contains the code for the Gradio website (www.gradio.app). See the README in the `/website` folder for more details
91
+
92
+ ### Continuous Integration and Testing
93
+
94
+ All PRs must pass the continuous integration tests before merging. To test locally, you can run `python -m unittest` from the repo directory.
95
+
96
+ ## Submitting PRs
97
+
98
+ All PRs should be against `main`. Direct commits to main are blocked, and PRs require an approving review to merge into main. By convention, the Gradio maintainers will review PRs when:
99
+
100
+ - An initial review has been requested, and
101
+ - A description of the change (with a link to the GitHub PR) has been added to CHANGELOG.md, and
102
+ - A maintainer (@abidlabs, @aliabid94, @aliabd, @AK391, @dawoodkhan82, @pngwn, @freddyaboulton) is tagged in the PR comments and asked to complete a review
103
+
104
+ We ask that you make sure initial CI checks are passing before requesting a review. One of the Gradio maintainers will merge the PR when all the checks are passing.
105
+
106
+ Do not forget the format the backend before pushing.
107
+
108
+ ```
109
+ bash scripts/format_backend.sh
110
+ ```
111
+
112
+ ```
113
+ bash scripts/format_frontend.sh
114
+ ```
115
+
116
+ ## CI checks
117
+
118
+ Currently the following checks are run in CI:
119
+
120
+ ### Gradio library (`gradio` package)
121
+
122
+ ```
123
+ bash scripts/lint_backend.sh
124
+ bash scripts/type_check_backend.sh
125
+ python -m pytest -m "not flaky" --ignore=client
126
+ python -m pytest -m "flaky" --ignore=client
127
+ ```
128
+
129
+ ### Gradio client (`gradio_client` package)
130
+
131
+ ```
132
+ cd client/python
133
+ bash scripts/lint.sh
134
+ python -m pytest -m "not flaky"
135
+ python -m pytest -m "flaky"
136
+ ```
137
+
138
+ _Could these guidelines be clearer? Feel free to open a PR to help us faciltiate open-source contributions!_
groundingLMM/gradio-dev/LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
groundingLMM/gradio-dev/README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <div align="center">
3
+
4
+ # Gradio Box
5
+
6
+ This is the advanced gradio used in our paper, [GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest](https://arxiv.org/abs/2307.03601).
7
+ This is an extension to official [gradio](https://gradio.app/), which supports drawing boxes on top of an image.
8
+ This feature is requested in https://github.com/gradio-app/gradio/issues/2316.
9
+
10
+ ![teaser](box_demo.gif)
11
+ </div>
12
+
13
+
14
+
15
+ ## Usage
16
+
17
+ See mini-demo:
18
+ ```
19
+ python app_box.py
20
+ ```
21
+
22
+
23
+ ## Install
24
+
25
+ ### 1. Install Node.js
26
+
27
+ We install it on Ubuntu with:
28
+
29
+ ```
30
+ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.3/install.sh | bash
31
+
32
+ source ~/.bashrc # or ~/.zshrc based on which one you use
33
+
34
+ nvm install v18.16.0
35
+ ```
36
+
37
+
38
+ ### 2. Install ppnm
39
+
40
+ ```
41
+ curl -fsSL https://get.pnpm.io/install.sh | sh -
42
+
43
+ source ~/.bashrc # or ~/.zshrc based on which one you use
44
+
45
+ pnpm --version # check if success
46
+ ```
47
+
48
+ ### 3. Install gradio
49
+
50
+ ```
51
+ git clone https://github.com/ShoufaChen/gradio-dev.git
52
+
53
+ cd gradio-dev
54
+
55
+ bash scripts/build_frontend.sh
56
+
57
+ pip install -e .
58
+ ```
59
+
60
+
61
+ ## Common Installation Issues
62
+
63
+
64
+ <details>
65
+ <summary>
66
+  ERR_PNPM_FETCH_404  GET https://packagecloud.io/github/git-lfs/npm/whatwg-url/-/whatwg-url-5.0.0.tgz: Not Found - 404
67
+ No authorization header was set for the request.
68
+ </summary>
69
+ <br/>
70
+ https://github.com/pnpm/pnpm/issues/2933#issuecomment-975886322
71
+
72
+ ```
73
+ # Add following in `~/.npmrc` file
74
+ @OWNER:registry=https://packagecloud.io/github/git-lfs/npm/
75
+ ```
76
+
77
+ </details>
78
+
79
+
80
+ <details>
81
+ <summary>
82
+ ERROR: File "setup.py" not found. Directory cannot be installed in editable mode:
83
+ </summary>
84
+ <br/>
85
+ Use pip version >= 23.0.1
86
+
87
+ </details>
88
+
89
+
90
+ ## Acknowledgement
91
+
92
+ Our implementation is mainly inspired by https://github.com/gradio-app/gradio/pull/3220, with several modifications for latest gradio.
93
+ Greate thanks to [CtrlAltDeplete](https://github.com/CtrlAltDeplete).
94
+
groundingLMM/gradio-dev/README_old.md ADDED
@@ -0,0 +1,290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- DO NOT EDIT THIS FILE DIRECTLY. INSTEAD EDIT THE `readme_template.md` OR `guides/1)getting_started/1)quickstart.md` TEMPLATES AND THEN RUN `render_readme.py` SCRIPT. -->
2
+
3
+
4
+ <div align="center">
5
+
6
+ [<img src="readme_files/gradio.svg" alt="gradio" width=300>](https://gradio.app)<br>
7
+ <em>Build & share delightful machine learning apps easily</em>
8
+
9
+ [![gradio-backend](https://github.com/gradio-app/gradio/actions/workflows/backend.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/backend.yml)
10
+ [![gradio-ui](https://github.com/gradio-app/gradio/actions/workflows/ui.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/ui.yml)
11
+ [![PyPI](https://img.shields.io/pypi/v/gradio)](https://pypi.org/project/gradio/)
12
+ [![PyPI downloads](https://img.shields.io/pypi/dm/gradio)](https://pypi.org/project/gradio/)
13
+ ![Python version](https://img.shields.io/badge/python-3.8+-important)
14
+ [![Twitter follow](https://img.shields.io/twitter/follow/gradio?style=social&label=follow)](https://twitter.com/gradio)
15
+
16
+ [Website](https://gradio.app)
17
+ | [Documentation](https://gradio.app/docs/)
18
+ | [Guides](https://gradio.app/guides/)
19
+ | [Getting Started](https://gradio.app/getting_started/)
20
+ | [Examples](demo/)
21
+ | [中文](readme_files/zh-cn#readme)
22
+ </div>
23
+
24
+ # Gradio: Build Machine Learning Web Apps — in Python
25
+
26
+ Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.
27
+
28
+ With Gradio, you can quickly create a beautiful user interface around your machine learning models or data science workflow and let people "try it out" by dragging-and-dropping in their own images,
29
+ pasting text, recording their own voice, and interacting with your demo, all through the browser.
30
+
31
+ ![Interface montage](readme_files/header-image.jpg)
32
+
33
+ Gradio is useful for:
34
+
35
+ - **Demoing** your machine learning models for clients/collaborators/users/students.
36
+
37
+ - **Deploying** your models quickly with automatic shareable links and getting feedback on model performance.
38
+
39
+ - **Debugging** your model interactively during development using built-in manipulation and interpretation tools.
40
+
41
+ ## Quickstart
42
+
43
+ **Prerequisite**: Gradio requires Python 3.8 or higher, that's all!
44
+
45
+ ### What Does Gradio Do?
46
+
47
+ One of the *best ways to share* your machine learning model, API, or data science workflow with others is to create an **interactive app** that allows your users or colleagues to try out the demo in their browsers.
48
+
49
+ Gradio allows you to **build demos and share them, all in Python.** And usually in just a few lines of code! So let's get started.
50
+
51
+ ### Hello, World
52
+
53
+ To get Gradio running with a simple "Hello, World" example, follow these three steps:
54
+
55
+ 1\. Install Gradio using pip:
56
+
57
+ ```bash
58
+ pip install gradio
59
+ ```
60
+
61
+ 2\. Run the code below as a Python script or in a Jupyter Notebook (or [Google Colab](https://colab.research.google.com/drive/18ODkJvyxHutTN0P5APWyGFO_xwNcgHDZ?usp=sharing)):
62
+
63
+ ```python
64
+ import gradio as gr
65
+
66
+ def greet(name):
67
+ return "Hello " + name + "!"
68
+
69
+ demo = gr.Interface(fn=greet, inputs="text", outputs="text")
70
+
71
+ demo.launch()
72
+ ```
73
+
74
+
75
+ 3\. The demo below will appear automatically within the Jupyter Notebook, or pop in a browser on [http://localhost:7860](http://localhost:7860) if running from a script:
76
+
77
+ ![`hello_world` demo](demo/hello_world/screenshot.gif)
78
+
79
+ When developing locally, if you want to run the code as a Python script, you can use the Gradio CLI to launch the application **in reload mode**, which will provide seamless and fast development. Learn more about reloading in the [Auto-Reloading Guide](https://gradio.app/developing-faster-with-reload-mode/).
80
+
81
+ ```bash
82
+ gradio app.py
83
+ ```
84
+
85
+ Note: you can also do `python app.py`, but it won't provide the automatic reload mechanism.
86
+
87
+ ### The `Interface` Class
88
+
89
+ You'll notice that in order to make the demo, we created a `gradio.Interface`. This `Interface` class can wrap any Python function with a user interface. In the example above, we saw a simple text-based function, but the function could be anything from music generator to a tax calculator to the prediction function of a pretrained machine learning model.
90
+
91
+ The core `Interface` class is initialized with three required parameters:
92
+
93
+ - `fn`: the function to wrap a UI around
94
+ - `inputs`: which component(s) to use for the input (e.g. `"text"`, `"image"` or `"audio"`)
95
+ - `outputs`: which component(s) to use for the output (e.g. `"text"`, `"image"` or `"label"`)
96
+
97
+ Let's take a closer look at these components used to provide input and output.
98
+
99
+ ### Components Attributes
100
+
101
+ We saw some simple `Textbox` components in the previous examples, but what if you want to change how the UI components look or behave?
102
+
103
+ Let's say you want to customize the input text field — for example, you wanted it to be larger and have a text placeholder. If we use the actual class for `Textbox` instead of using the string shortcut, you have access to much more customizability through component attributes.
104
+
105
+ ```python
106
+ import gradio as gr
107
+
108
+ def greet(name):
109
+ return "Hello " + name + "!"
110
+
111
+ demo = gr.Interface(
112
+ fn=greet,
113
+ inputs=gr.Textbox(lines=2, placeholder="Name Here..."),
114
+ outputs="text",
115
+ )
116
+ demo.launch()
117
+ ```
118
+
119
+ ![`hello_world_2` demo](demo/hello_world_2/screenshot.gif)
120
+
121
+ ### Multiple Input and Output Components
122
+
123
+ Suppose you had a more complex function, with multiple inputs and outputs. In the example below, we define a function that takes a string, boolean, and number, and returns a string and number. Take a look how you pass a list of input and output components.
124
+
125
+ ```python
126
+ import gradio as gr
127
+
128
+ def greet(name, is_morning, temperature):
129
+ salutation = "Good morning" if is_morning else "Good evening"
130
+ greeting = f"{salutation} {name}. It is {temperature} degrees today"
131
+ celsius = (temperature - 32) * 5 / 9
132
+ return greeting, round(celsius, 2)
133
+
134
+ demo = gr.Interface(
135
+ fn=greet,
136
+ inputs=["text", "checkbox", gr.Slider(0, 100)],
137
+ outputs=["text", "number"],
138
+ )
139
+ demo.launch()
140
+ ```
141
+
142
+ ![`hello_world_3` demo](demo/hello_world_3/screenshot.gif)
143
+
144
+ You simply wrap the components in a list. Each component in the `inputs` list corresponds to one of the parameters of the function, in order. Each component in the `outputs` list corresponds to one of the values returned by the function, again in order.
145
+
146
+ ### An Image Example
147
+
148
+ Gradio supports many types of components, such as `Image`, `DataFrame`, `Video`, or `Label`. Let's try an image-to-image function to get a feel for these!
149
+
150
+ ```python
151
+ import numpy as np
152
+ import gradio as gr
153
+
154
+ def sepia(input_img):
155
+ sepia_filter = np.array([
156
+ [0.393, 0.769, 0.189],
157
+ [0.349, 0.686, 0.168],
158
+ [0.272, 0.534, 0.131]
159
+ ])
160
+ sepia_img = input_img.dot(sepia_filter.T)
161
+ sepia_img /= sepia_img.max()
162
+ return sepia_img
163
+
164
+ demo = gr.Interface(sepia, gr.Image(shape=(200, 200)), "image")
165
+ demo.launch()
166
+ ```
167
+
168
+ ![`sepia_filter` demo](demo/sepia_filter/screenshot.gif)
169
+
170
+ When using the `Image` component as input, your function will receive a NumPy array with the shape `(height, width, 3)`, where the last dimension represents the RGB values. We'll return an image as well in the form of a NumPy array.
171
+
172
+ You can also set the datatype used by the component with the `type=` keyword argument. For example, if you wanted your function to take a file path to an image instead of a NumPy array, the input `Image` component could be written as:
173
+
174
+ ```python
175
+ gr.Image(type="filepath", shape=...)
176
+ ```
177
+
178
+ Also note that our input `Image` component comes with an edit button 🖉, which allows for cropping and zooming into images. Manipulating images in this way can help reveal biases or hidden flaws in a machine learning model!
179
+
180
+ You can read more about the many components and how to use them in the [Gradio docs](https://gradio.app/docs).
181
+
182
+ ### Blocks: More Flexibility and Control
183
+
184
+ Gradio offers two classes to build apps:
185
+
186
+ 1\. **Interface**, that provides a high-level abstraction for creating demos that we've been discussing so far.
187
+
188
+ 2\. **Blocks**, a low-level API for designing web apps with more flexible layouts and data flows. Blocks allows you to do things like feature multiple data flows and demos, control where components appear on the page, handle complex data flows (e.g. outputs can serve as inputs to other functions), and update properties/visibility of components based on user interaction — still all in Python. If this customizability is what you need, try `Blocks` instead!
189
+
190
+ ### Hello, Blocks
191
+
192
+ Let's take a look at a simple example. Note how the API here differs from `Interface`.
193
+
194
+ ```python
195
+ import gradio as gr
196
+
197
+ def greet(name):
198
+ return "Hello " + name + "!"
199
+
200
+ with gr.Blocks() as demo:
201
+ name = gr.Textbox(label="Name")
202
+ output = gr.Textbox(label="Output Box")
203
+ greet_btn = gr.Button("Greet")
204
+ greet_btn.click(fn=greet, inputs=name, outputs=output)
205
+
206
+ demo.launch()
207
+ ```
208
+
209
+ ![`hello_blocks` demo](demo/hello_blocks/screenshot.gif)
210
+
211
+ Things to note:
212
+
213
+ - `Blocks` are made with a `with` clause, and any component created inside this clause is automatically added to the app.
214
+ - Components appear vertically in the app in the order they are created. (Later we will cover customizing layouts!)
215
+ - A `Button` was created, and then a `click` event-listener was added to this button. The API for this should look familiar! Like an `Interface`, the `click` method takes a Python function, input components, and output components.
216
+
217
+ ### More Complexity
218
+
219
+ Here's an app to give you a taste of what's possible with `Blocks`:
220
+
221
+ ```python
222
+ import numpy as np
223
+ import gradio as gr
224
+
225
+
226
+ def flip_text(x):
227
+ return x[::-1]
228
+
229
+
230
+ def flip_image(x):
231
+ return np.fliplr(x)
232
+
233
+
234
+ with gr.Blocks() as demo:
235
+ gr.Markdown("Flip text or image files using this demo.")
236
+ with gr.Tab("Flip Text"):
237
+ text_input = gr.Textbox()
238
+ text_output = gr.Textbox()
239
+ text_button = gr.Button("Flip")
240
+ with gr.Tab("Flip Image"):
241
+ with gr.Row():
242
+ image_input = gr.Image()
243
+ image_output = gr.Image()
244
+ image_button = gr.Button("Flip")
245
+
246
+ with gr.Accordion("Open for More!"):
247
+ gr.Markdown("Look at me...")
248
+
249
+ text_button.click(flip_text, inputs=text_input, outputs=text_output)
250
+ image_button.click(flip_image, inputs=image_input, outputs=image_output)
251
+
252
+ demo.launch()
253
+ ```
254
+
255
+ ![`blocks_flipper` demo](demo/blocks_flipper/screenshot.gif)
256
+
257
+ A lot more going on here! We'll cover how to create complex `Blocks` apps like this in the [building with blocks](https://gradio.app/building_with_blocks) section for you.
258
+
259
+ Congrats, you're now familiar with the basics of Gradio! 🥳 Go to our [next guide](https://gradio.app/key_features) to learn more about the key features of Gradio.
260
+
261
+
262
+ ## Open Source Stack
263
+
264
+ Gradio is built with many wonderful open-source libraries, please support them as well!
265
+
266
+ [<img src="readme_files/huggingface_mini.svg" alt="huggingface" height=40>](https://huggingface.co)
267
+ [<img src="readme_files/python.svg" alt="python" height=40>](https://www.python.org)
268
+ [<img src="readme_files/fastapi.svg" alt="fastapi" height=40>](https://fastapi.tiangolo.com)
269
+ [<img src="readme_files/encode.svg" alt="encode" height=40>](https://www.encode.io)
270
+ [<img src="readme_files/svelte.svg" alt="svelte" height=40>](https://svelte.dev)
271
+ [<img src="readme_files/vite.svg" alt="vite" height=40>](https://vitejs.dev)
272
+ [<img src="readme_files/pnpm.svg" alt="pnpm" height=40>](https://pnpm.io)
273
+ [<img src="readme_files/tailwind.svg" alt="tailwind" height=40>](https://tailwindcss.com)
274
+
275
+ ## License
276
+
277
+ Gradio is licensed under the Apache License 2.0 found in the [LICENSE](LICENSE) file in the root directory of this repository.
278
+
279
+ ## Citation
280
+
281
+ Also check out the paper *[Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild](https://arxiv.org/abs/1906.02569), ICML HILL 2019*, and please cite it if you use Gradio in your work.
282
+
283
+ ```
284
+ @article{abid2019gradio,
285
+ title = {Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild},
286
+ author = {Abid, Abubakar and Abdalla, Ali and Abid, Ali and Khan, Dawood and Alfozan, Abdulrahman and Zou, James},
287
+ journal = {arXiv preprint arXiv:1906.02569},
288
+ year = {2019},
289
+ }
290
+ ```
groundingLMM/gradio-dev/SECURITY.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # Security Policy
2
+
3
+ ## Reporting a Vulnerability
4
+
5
+ If you discover a security vulnerability, we would be very grateful if you could email us at [email protected]. This is the preferred approach instead of opening a public issue. We take all vulnerability reports seriously, and will work to patch the vulnerability immediately. Whenever possible, we will credit the person or people who report the security vulnerabilities after it has been patched.
groundingLMM/gradio-dev/app_box.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+
3
+
4
+ def predict(inp):
5
+ image = inp['image']
6
+ boxes = inp['mask']
7
+
8
+ sub_images = []
9
+ for box in boxes:
10
+ sub_images.append(image.crop(box))
11
+ return sub_images
12
+
13
+
14
+ demo = gr.Interface(fn=predict,
15
+ inputs=gr.Image(tool="boxes", type="pil"),
16
+ outputs=gr.Gallery())
17
+
18
+ demo.launch()
groundingLMM/gradio-dev/globals.d.ts ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ declare global {
2
+ interface Window {
3
+ __gradio_mode__: "app" | "website";
4
+ launchGradio: Function;
5
+ launchGradioFromSpaces: Function;
6
+ gradio_config: Config;
7
+ scoped_css_attach: (link: HTMLLinkElement) => void;
8
+ __is_colab__: boolean;
9
+ }
10
+ }
11
+
12
+ export interface Config {
13
+ auth_required: boolean | undefined;
14
+ auth_message: string;
15
+ components: any[];
16
+ css: string | null;
17
+ dependencies: any[];
18
+ dev_mode: boolean;
19
+ enable_queue: boolean;
20
+ layout: any;
21
+ mode: "blocks" | "interface";
22
+ root: string;
23
+ theme: string;
24
+ title: string;
25
+ version: string;
26
+ is_space: boolean;
27
+ is_colab: boolean;
28
+ show_api: boolean;
29
+ stylesheets: string[];
30
+ path: string;
31
+ }
groundingLMM/gradio-dev/package.json ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "gradio-ui",
3
+ "version": "0.0.1",
4
+ "description": "Gradio UI packages",
5
+ "scripts": {
6
+ "workbench": "pnpm --filter @gradio/workbench dev",
7
+ "dev": "pnpm css && pnpm --filter @gradio/client build && pnpm --filter @gradio/app dev",
8
+ "css": "pnpm --filter @gradio/theme generate",
9
+ "build": "pnpm css && pnpm --filter @gradio/client build && pnpm --filter @gradio/app build:local --emptyOutDir",
10
+ "build:cdn": "pnpm --filter @gradio/client build && pnpm --filter @gradio/app build:cdn --emptyOutDir",
11
+ "build:website": "pnpm --filter @gradio/app build:website --emptyOutDir",
12
+ "build:cdn-local": "TEST_CDN=TRUE pnpm build:cdn",
13
+ "preview:cdn-server": "sirv ./gradio/templates/cdn --single --port=4321 --cors",
14
+ "preview:cdn-app": "pnpm --filter @gradio/cdn-test dev",
15
+ "preview:cdn-local": "run-p preview:cdn-server preview:cdn-app",
16
+ "format:check": "prettier --ignore-path .config/.prettierignore --check --plugin-search-dir=. .",
17
+ "format:write": "prettier --ignore-path .config/.prettierignore --write --plugin-search-dir=. .",
18
+ "ts:check": "svelte-check --tsconfig tsconfig.json",
19
+ "test": "pnpm --filter @gradio/client build && vitest dev --config .config/vitest.config.ts",
20
+ "test:run": "pnpm --filter @gradio/client build && vitest run --config .config/vitest.config.ts",
21
+ "test:node": "TEST_MODE=node pnpm vitest run --config .config/vitest.config.ts",
22
+ "test:browser": "pnpm --filter @gradio/app test:browser:full",
23
+ "test:browser:full": "run-s build test:browser",
24
+ "test:browser:debug": "pnpm --filter @gradio/app test:browser:debug",
25
+ "ci:publish": "pnpm publish --no-git-checks --access public -r",
26
+ "ci:version": "changeset version && pnpm i --lockfile-only"
27
+ },
28
+ "type": "module",
29
+ "author": "",
30
+ "license": "ISC",
31
+ "private": true,
32
+ "dependencies": {
33
+ "@changesets/changelog-github": "^0.4.8",
34
+ "@changesets/cli": "^2.26.1",
35
+ "@gradio/tootils": "workspace:^0.0.1",
36
+ "@playwright/test": "^1.27.1",
37
+ "@sveltejs/vite-plugin-svelte": "^1.0.0-next.44",
38
+ "@tailwindcss/forms": "^0.5.0",
39
+ "@testing-library/dom": "^8.11.3",
40
+ "@testing-library/jest-dom": "^5.16.5",
41
+ "@testing-library/svelte": "^3.1.0",
42
+ "@testing-library/user-event": "^13.5.0",
43
+ "autoprefixer": "^10.4.4",
44
+ "babylonjs": "^5.17.1",
45
+ "babylonjs-loaders": "^5.17.1",
46
+ "happy-dom": "^9.20.3",
47
+ "msw": "^1.0.0",
48
+ "node-html-parser": "^5.3.3",
49
+ "npm-run-all": "^4.1.5",
50
+ "playwright": "^1.27.1",
51
+ "plotly.js-dist-min": "^2.10.1",
52
+ "polka": "^1.0.0-next.22",
53
+ "pollen-css": "^4.6.1",
54
+ "postcss": "^8.4.6",
55
+ "postcss-custom-media": "8",
56
+ "postcss-nested": "^5.0.6",
57
+ "postcss-prefix-selector": "^1.16.0",
58
+ "prettier": "^2.6.2",
59
+ "prettier-plugin-css-order": "^1.3.0",
60
+ "prettier-plugin-svelte": "^2.10.0",
61
+ "sirv": "^2.0.2",
62
+ "sirv-cli": "^2.0.2",
63
+ "svelte": "^3.59.1",
64
+ "svelte-check": "^3.1.4",
65
+ "svelte-i18n": "^3.6.0",
66
+ "svelte-preprocess": "^5.0.3",
67
+ "tailwindcss": "^3.1.6",
68
+ "tinyspy": "^0.3.0",
69
+ "typescript": "^4.7.4",
70
+ "vite": "^4.2.1",
71
+ "vitest": "^0.29.8"
72
+ },
73
+ "devDependencies": {
74
+ "@types/three": "^0.138.0"
75
+ },
76
+ "prettier": {
77
+ "useTabs": true,
78
+ "singleQuote": false,
79
+ "trailingComma": "none",
80
+ "printWidth": 80,
81
+ "pluginSearchDirs": [
82
+ ".."
83
+ ]
84
+ }
85
+ }
groundingLMM/gradio-dev/pnpm-lock.yaml ADDED
The diff for this file is too large to render. See raw diff
 
groundingLMM/gradio-dev/pnpm-workspace.yaml ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ packages:
2
+ - 'js/*'
3
+ - 'client/js'
groundingLMM/gradio-dev/pyproject.toml ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["hatchling", "hatch-requirements-txt", "hatch-fancy-pypi-readme>=22.5.0"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "gradio"
7
+ dynamic = ["version", "dependencies", "readme"]
8
+ description = "Python library for easily interacting with trained machine learning models"
9
+ license = "Apache-2.0"
10
+ requires-python = ">=3.8"
11
+ authors = [
12
+ { name = "Abubakar Abid", email = "[email protected]" },
13
+ { name = "Ali Abid", email = "[email protected]" },
14
+ { name = "Ali Abdalla", email = "[email protected]" },
15
+ { name = "Dawood Khan", email = "[email protected]" },
16
+ { name = "Ahsen Khaliq", email = "[email protected]" },
17
+ { name = "Pete Allen", email = "[email protected]" },
18
+ { name = "Ömer Faruk Özdemir", email = "[email protected]" },
19
+ ]
20
+ keywords = ["machine learning", "reproducibility", "visualization"]
21
+
22
+ classifiers = [
23
+ 'Development Status :: 5 - Production/Stable',
24
+ 'License :: OSI Approved :: Apache Software License',
25
+ 'Operating System :: OS Independent',
26
+ 'Programming Language :: Python :: 3',
27
+ 'Programming Language :: Python :: 3 :: Only',
28
+ 'Programming Language :: Python :: 3.8',
29
+ 'Programming Language :: Python :: 3.9',
30
+ 'Programming Language :: Python :: 3.10',
31
+ 'Programming Language :: Python :: 3.11',
32
+ 'Topic :: Scientific/Engineering',
33
+ 'Topic :: Scientific/Engineering :: Artificial Intelligence',
34
+ 'Topic :: Scientific/Engineering :: Visualization',
35
+ ]
36
+
37
+ [project.scripts]
38
+ gradio = "gradio.cli:cli"
39
+ upload_theme = "gradio.themes.upload_theme:main"
40
+
41
+ [project.urls]
42
+ Homepage = "https://github.com/gradio-app/gradio"
43
+
44
+ [tool.hatch.version]
45
+ path = "gradio/version.txt"
46
+ pattern = "(?P<version>.+)"
47
+
48
+ [tool.hatch.metadata.hooks.requirements_txt]
49
+ filename = "requirements.txt"
50
+
51
+ [tool.hatch.metadata.hooks.fancy-pypi-readme]
52
+ content-type = "text/markdown"
53
+ fragments = [
54
+ { path = "README.md" },
55
+ ]
56
+
57
+ [[tool.hatch.metadata.hooks.fancy-pypi-readme.substitutions]]
58
+ pattern = "(website/homepage|readme_files)/"
59
+ replacement = 'https://raw.githubusercontent.com/gradio-app/gradio/main/\g<1>/'
60
+
61
+ [[tool.hatch.metadata.hooks.fancy-pypi-readme.substitutions]]
62
+ pattern = 'demo/([\S]*.gif)'
63
+ replacement = 'https://raw.githubusercontent.com/gradio-app/gradio/main/demo/\g<1>'
64
+
65
+ [tool.hatch.build]
66
+ artifacts = [
67
+ "/gradio/templates",
68
+ ]
69
+
70
+
71
+ [tool.hatch.build.targets.sdist]
72
+ include = [
73
+ "/gradio",
74
+ "/test",
75
+ "/README.md",
76
+ "/requirements.txt",
77
+ ]
78
+
79
+ [tool.ruff]
80
+ target-version = "py37"
81
+ extend-select = [
82
+ "B",
83
+ "C",
84
+ "I",
85
+ "N",
86
+ "SIM",
87
+ "UP",
88
+ ]
89
+ ignore = [
90
+ "C901", # function is too complex (TODO: un-ignore this)
91
+ "B023", # function definition in loop (TODO: un-ignore this)
92
+ "B008", # function call in argument defaults
93
+ "B017", # pytest.raises considered evil
94
+ "B028", # explicit stacklevel for warnings
95
+ "E501", # from scripts/lint_backend.sh
96
+ "SIM105", # contextlib.suppress (has a performance cost)
97
+ "SIM117", # multiple nested with blocks (doesn't look good with gr.Row etc)
98
+ "UP007", # use X | Y for type annotations (TODO: can be enabled once Pydantic plays nice with them)
99
+ ]
100
+
101
+ [tool.ruff.per-file-ignores]
102
+ "demo/*" = [
103
+ "E402", # Demos may have imports not at the top
104
+ "E741", # Demos may have ambiguous variable names
105
+ "F405", # Demos may use star imports
106
+ "I", # Don't care about import order
107
+ ]
108
+ "gradio/__init__.py" = [
109
+ "F401", # "Imported but unused" (TODO: it would be better to be explicit and use __all__)
110
+ ]
111
+ "gradio/routes.py" = [
112
+ "UP006", # Pydantic on Python 3.7 requires old-style type annotations (TODO: drop when Python 3.7 is dropped)
113
+ ]
groundingLMM/gradio-dev/readme_template.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ [<img src="readme_files/gradio.svg" alt="gradio" width=300>](https://gradio.app)<br>
4
+ <em>Build & share delightful machine learning apps easily</em>
5
+
6
+ [![gradio-backend](https://github.com/gradio-app/gradio/actions/workflows/backend.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/backend.yml)
7
+ [![gradio-ui](https://github.com/gradio-app/gradio/actions/workflows/ui.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/ui.yml)
8
+ [![PyPI](https://img.shields.io/pypi/v/gradio)](https://pypi.org/project/gradio/)
9
+ [![PyPI downloads](https://img.shields.io/pypi/dm/gradio)](https://pypi.org/project/gradio/)
10
+ ![Python version](https://img.shields.io/badge/python-3.8+-important)
11
+ [![Twitter follow](https://img.shields.io/twitter/follow/gradio?style=social&label=follow)](https://twitter.com/gradio)
12
+
13
+ [Website](https://gradio.app)
14
+ | [Documentation](https://gradio.app/docs/)
15
+ | [Guides](https://gradio.app/guides/)
16
+ | [Getting Started](https://gradio.app/getting_started/)
17
+ | [Examples](demo/)
18
+ | [中文](readme_files/zh-cn#readme)
19
+ </div>
20
+
21
+ # Gradio: Build Machine Learning Web Apps — in Python
22
+
23
+ Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.
24
+
25
+ With Gradio, you can quickly create a beautiful user interface around your machine learning models or data science workflow and let people "try it out" by dragging-and-dropping in their own images,
26
+ pasting text, recording their own voice, and interacting with your demo, all through the browser.
27
+
28
+ ![Interface montage](readme_files/header-image.jpg)
29
+
30
+ Gradio is useful for:
31
+
32
+ - **Demoing** your machine learning models for clients/collaborators/users/students.
33
+
34
+ - **Deploying** your models quickly with automatic shareable links and getting feedback on model performance.
35
+
36
+ - **Debugging** your model interactively during development using built-in manipulation and interpretation tools.
37
+
38
+ $getting_started
39
+
40
+ ## Open Source Stack
41
+
42
+ Gradio is built with many wonderful open-source libraries, please support them as well!
43
+
44
+ [<img src="readme_files/huggingface_mini.svg" alt="huggingface" height=40>](https://huggingface.co)
45
+ [<img src="readme_files/python.svg" alt="python" height=40>](https://www.python.org)
46
+ [<img src="readme_files/fastapi.svg" alt="fastapi" height=40>](https://fastapi.tiangolo.com)
47
+ [<img src="readme_files/encode.svg" alt="encode" height=40>](https://www.encode.io)
48
+ [<img src="readme_files/svelte.svg" alt="svelte" height=40>](https://svelte.dev)
49
+ [<img src="readme_files/vite.svg" alt="vite" height=40>](https://vitejs.dev)
50
+ [<img src="readme_files/pnpm.svg" alt="pnpm" height=40>](https://pnpm.io)
51
+ [<img src="readme_files/tailwind.svg" alt="tailwind" height=40>](https://tailwindcss.com)
52
+
53
+ ## License
54
+
55
+ Gradio is licensed under the Apache License 2.0 found in the [LICENSE](LICENSE) file in the root directory of this repository.
56
+
57
+ ## Citation
58
+
59
+ Also check out the paper *[Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild](https://arxiv.org/abs/1906.02569), ICML HILL 2019*, and please cite it if you use Gradio in your work.
60
+
61
+ ```
62
+ @article{abid2019gradio,
63
+ title = {Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild},
64
+ author = {Abid, Abubakar and Abdalla, Ali and Abid, Ali and Khan, Dawood and Alfozan, Abdulrahman and Zou, James},
65
+ journal = {arXiv preprint arXiv:1906.02569},
66
+ year = {2019},
67
+ }
68
+ ```
groundingLMM/gradio-dev/render_readme.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ import re
4
+ from pathlib import Path
5
+
6
+ README_TEMPLATE_FILEPATH = "readme_template.md"
7
+ GETTING_STARTED_TEMPLATE_FILEPATH = "guides/01_getting-started/01_quickstart.md"
8
+
9
+ readme_template = Path(README_TEMPLATE_FILEPATH).read_text()
10
+ getting_started_template = Path(GETTING_STARTED_TEMPLATE_FILEPATH).read_text()
11
+
12
+ # Extract all the code and demo tags from the getting started template
13
+ code_tags = re.findall(r"\$code_([^\s]+)", getting_started_template)
14
+ demo_tags = re.findall(r"\$demo_([^\s]+)", getting_started_template)
15
+ codes = {}
16
+ demos = {}
17
+
18
+ for src in code_tags:
19
+ context = Path(f"demo/{src}/run.py").read_text()
20
+ # Replace the condition to run the demo directly with actual launch code
21
+ context = re.sub(r"if __name__(.*[\n$]*)*", "demo.launch()", context)
22
+ codes[src] = f"```python\n{context}\n```\n" # Convert to Markdown code block
23
+
24
+ for src in demo_tags:
25
+ demos[src] = f"![`{src}` demo](demo/{src}/screenshot.gif)"
26
+
27
+ # Replace the headers in the getting started template with a smaller header (e.g. H3 instead of H2) to
28
+ # make the README more readable and less cluttered.
29
+ getting_started_template = re.sub(r"^(#+)", r"#\1", getting_started_template, flags=re.MULTILINE)
30
+ readme_template = readme_template.replace("$getting_started", getting_started_template)
31
+
32
+ # Now put the codes and the screenshots in the README template
33
+ readme_template = re.sub(r"\$code_([^\s]+)", lambda x: codes[x.group(1)], readme_template)
34
+ readme_template = re.sub(r"\$demo_([^\s]+)", lambda x: demos[x.group(1)], readme_template)
35
+
36
+ # Save the README template to the actual README.md file (with a note about the editing)
37
+ EDITING_NOTE = ("<!-- DO NOT EDIT THIS FILE DIRECTLY. INSTEAD EDIT THE `readme_template.md` OR "
38
+ "`guides/1)getting_started/1)quickstart.md` TEMPLATES AND THEN RUN `render_readme.py` SCRIPT. -->")
39
+ Path("README.md").write_text(f"{EDITING_NOTE}\n\n{readme_template}")
groundingLMM/gradio-dev/requirements.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ aiofiles
2
+ aiohttp
3
+ altair>=4.2.0
4
+ fastapi
5
+ ffmpy
6
+ gradio_client>=0.2.7
7
+ httpx
8
+ huggingface_hub>=0.14.0
9
+ Jinja2
10
+ markdown-it-py[linkify]>=2.0.0
11
+ pygments>=2.12.0
12
+ mdit-py-plugins<=0.3.3
13
+ markupsafe
14
+ matplotlib
15
+ numpy
16
+ orjson
17
+ pandas
18
+ pillow
19
+ pydantic
20
+ python-multipart
21
+ pydub
22
+ pyyaml
23
+ requests
24
+ semantic_version
25
+ uvicorn>=0.14.0
26
+ websockets>=10.0
groundingLMM/gradio-dev/style.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # component-styles
2
+
3
+ ## Textbox
4
+
5
+ | name | type | description |
6
+ | ----------- | ------------------------------------ | ------------------------------ |
7
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of text input |
8
+ | `border` | `bool` or `(bool, bool, bool, bool)` | borders of text input |
9
+ | `container` | `bool` | show or hide the container box |
10
+
11
+ ## Number
12
+
13
+ | name | type | description |
14
+ | ----------- | ------------------------------------ | ------------------------------ |
15
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of text input |
16
+ | `border` | `bool` or `(bool, bool, bool, bool)` | borders of text input |
17
+ | `container` | `bool` | show or hide the container box |
18
+
19
+ ## Slider
20
+
21
+ | name | type | description |
22
+ | ----------- | ------ | ------------------------------ |
23
+ | `container` | `bool` | show or hide the container box |
24
+
25
+ ## Checkbox
26
+
27
+ | name | type | description |
28
+ | ----------- | ------------------------------------ | ------------------------------ |
29
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of checkbox |
30
+ | `border` | `bool` or `(bool, bool, bool, bool)` | borders of checkbox |
31
+ | `container` | `bool` | show or hide the container box |
32
+
33
+ ## Checkbox Group
34
+
35
+ | name | type | description |
36
+ | ---------------- | ------------------------------------ | ----------------------------------------- |
37
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of checkboxes |
38
+ | `container` | `bool` | show or hide the container box |
39
+ | `item_container` | `bool` | show or hide the checkbox container boxes |
40
+
41
+ ## Radio
42
+
43
+ | name | type | description |
44
+ | ---------------- | ------ | -------------------------------------- |
45
+ | `container` | `bool` | show or hide the container box |
46
+ | `item_container` | `bool` | show or hide the radio container boxes |
47
+
48
+ ## Dropdown
49
+
50
+ | name | type | description |
51
+ | ----------- | ------------------------------------ | ------------------------------ |
52
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of input |
53
+ | `border` | `bool` or `(bool, bool, bool, bool)` | borders of input |
54
+ | `container` | `bool` | show or hide the container box |
55
+
56
+ ## Image
57
+
58
+ | name | type | description |
59
+ | --------- | ------------------------------------ | ------------------- |
60
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
61
+
62
+ ## Video
63
+
64
+ | name | type | description |
65
+ | --------- | ------------------------------------ | ------------------- |
66
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
67
+
68
+ ## Audio
69
+
70
+ | name | type | description |
71
+ | --------- | ------------------------------------ | ------------------- |
72
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
73
+
74
+ ## File
75
+
76
+ | name | type | description |
77
+ | --------- | ------------------------------------ | ------------------- |
78
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
79
+
80
+ ## Dataframe
81
+
82
+ | name | type | description |
83
+ | --------- | ------------------------------------ | ------------------- |
84
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
85
+
86
+ ## Timeseries
87
+
88
+ | name | type | description |
89
+ | --------- | ------------------------------------ | ------------------- |
90
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
91
+
92
+ ## Label
93
+
94
+ | name | type | description |
95
+ | ----------- | ------ | ------------------------------ |
96
+ | `container` | `bool` | show or hide the container box |
97
+
98
+ ## HighlightedText
99
+
100
+ | name | type | description |
101
+ | ----------- | ------------------------------------ | ------------------------------ |
102
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of labels |
103
+ | `color_map` | `Dict[str, str]` | color map of labels and colors |
104
+ | `container` | `bool` | show or hide the container box |
105
+
106
+ ## JSON
107
+
108
+ | name | type | description |
109
+ | ----------- | ------ | ------------------------------ |
110
+ | `container` | `bool` | show or hide the container box |
111
+
112
+ ## HTML
113
+
114
+ Nothing
115
+
116
+ ## Gallery
117
+
118
+ | name | type | description |
119
+ | ----------- | ----------------------------------------- | ----------------------------------- |
120
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of images |
121
+ | `grid` | `int` or `(int, int, int, int, int, int)` | grid for images |
122
+ | `height` | `"auto"` | height of gallery (auto or default) |
123
+ | `container` | `bool` | show or hide the container box |
124
+
125
+ ## Chatbot
126
+
127
+ | name | type | description |
128
+ | ----------- | ------------------------------------ | ------------------------------------------------ |
129
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of chat bubbles |
130
+ | `color_map` | `Dict[str, str]` | color map of user and bot color for chat bubbles |
131
+
132
+ ## Model3D
133
+
134
+ | name | type | description |
135
+ | --------- | ------------------------------------ | ------------------- |
136
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
137
+
138
+ ## Plot
139
+
140
+ Nothing (yet)
141
+
142
+ ## Markdown
143
+
144
+ Nothing
145
+
146
+ ## Button
147
+
148
+ | name | type | description |
149
+ | ------------ | ------------------------------------ | --------------------------------------- |
150
+ | `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of button |
151
+ | `border` | `bool` or `(bool, bool, bool, bool)` | borders of button |
152
+ | `full_width` | `bool` | whether button expand to fill container |
153
+
154
+ ## Dataset
155
+
156
+ Nothing
157
+
158
+ ## Variable
159
+
160
+ Nothing