tuandunghcmut
/

vlm_clone_2

Model card Files Files and versions Community

tuandunghcmut commited on Apr 11

Commit

4bb09e0

verified ·

1 Parent(s): 391089d

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

groundingLMM/GLaMM-FullScope/.gitattributes +35 -0
groundingLMM/GLaMM-FullScope/README.md +33 -0
groundingLMM/GLaMM-FullScope/added_tokens.json +9 -0
groundingLMM/GLaMM-FullScope/config.json +60 -0
groundingLMM/GLaMM-FullScope/generation_config.json +9 -0
groundingLMM/GLaMM-FullScope/pytorch_model.bin.index.json +975 -0
groundingLMM/GLaMM-FullScope/special_tokens_map.json +24 -0
groundingLMM/GLaMM-FullScope/tokenizer_config.json +33 -0
groundingLMM/GranD/README.md +73 -0
groundingLMM/GranD/run_pipeline.sh +178 -0
groundingLMM/LLaVA/.dockerignore +21 -0
groundingLMM/LLaVA/.editorconfig +18 -0
groundingLMM/LLaVA/.gitattributes +29 -0
groundingLMM/LLaVA/.gitignore +35 -0
groundingLMM/LLaVA/LICENSE +201 -0
groundingLMM/LLaVA/README.md +463 -0
groundingLMM/LLaVA/cog.yaml +37 -0
groundingLMM/LLaVA/predict.py +155 -0
groundingLMM/LLaVA/pyproject.toml +37 -0
groundingLMM/dataset/dataset.py +236 -0
groundingLMM/docs/GranD.md +53 -0
groundingLMM/docs/datasets.md +327 -0
groundingLMM/docs/evaluation.md +75 -0
groundingLMM/docs/install.md +34 -0
groundingLMM/docs/model_zoo.md +21 -0
groundingLMM/docs/offline_demo.md +51 -0
groundingLMM/docs/training.md +83 -0
groundingLMM/eval/region_captioning/evaluate.py +51 -0
groundingLMM/eval/region_captioning/infer.py +188 -0
groundingLMM/eval/region_captioning/run_evaluation_VG.sh +28 -0
groundingLMM/gradio-dev/.dockerignore +40 -0
groundingLMM/gradio-dev/.editorconfig +8 -0
groundingLMM/gradio-dev/.gitignore +65 -0
groundingLMM/gradio-dev/CHANGELOG.md +0 -0
groundingLMM/gradio-dev/CITATION.cff +45 -0
groundingLMM/gradio-dev/CONTRIBUTING.md +138 -0
groundingLMM/gradio-dev/LICENSE +201 -0
groundingLMM/gradio-dev/README.md +94 -0
groundingLMM/gradio-dev/README_old.md +290 -0
groundingLMM/gradio-dev/SECURITY.md +5 -0
groundingLMM/gradio-dev/app_box.py +18 -0
groundingLMM/gradio-dev/globals.d.ts +31 -0
groundingLMM/gradio-dev/package.json +85 -0
groundingLMM/gradio-dev/pnpm-lock.yaml +0 -0
groundingLMM/gradio-dev/pnpm-workspace.yaml +3 -0
groundingLMM/gradio-dev/pyproject.toml +113 -0
groundingLMM/gradio-dev/readme_template.md +68 -0
groundingLMM/gradio-dev/render_readme.py +39 -0
groundingLMM/gradio-dev/requirements.txt +26 -0
groundingLMM/gradio-dev/style.md +160 -0

groundingLMM/GLaMM-FullScope/.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

groundingLMM/GLaMM-FullScope/README.md ADDED Viewed

	@@ -0,0 +1,33 @@

+---
+license: apache-2.0
+---
+# 👁️ GLaMM-FullScope
+---
+## 📝 Description
+GLaMM-FullScope encompasses all capabilities of GLaMM, which is mixed finetuned with many open-source datasets. "Full" signifies its comprehensive nature, incorporating the full range of GLaMM capabilities including
+Grounded Conversation Generation (GCG), Referring Expression Segmentation, Region-level Captioning, Image-level captioning and Visual Question Answering.
+## 💻 Download
+To get started with GLaMM-FullScope, follow these steps:
+   ```
+   git lfs install
+   git clone https://huggingface.co/MBZUAI/GLaMM-FullScope
+   ```
+## 📚 Additional Resources
+- **Paper:** [ArXiv](https://arxiv.org/abs/2311.03356).
+- **GitHub Repository:** For training and updates: [GitHub - GLaMM](https://github.com/mbzuai-oryx/groundingLMM).
+- **Project Page:** For a detailed overview and insights into the project, visit our [Project Page - GLaMM](https://mbzuai-oryx.github.io/groundingLMM/).
+## 📜 Citations and Acknowledgments
+```bibtex
+  @article{hanoona2023GLaMM,
+          title={GLaMM: Pixel Grounding Large Multimodal Model},
+          author={Rasheed, Hanoona and Maaz, Muhammad and Shaji, Sahal and Shaker, Abdelrahman and Khan, Salman and Cholakkal, Hisham and Anwer, Rao M. and Xing, Eric and Yang, Ming-Hsuan and Khan, Fahad S.},
+          journal={ArXiv 2311.03356},
+          year={2023}
+  }

groundingLMM/GLaMM-FullScope/added_tokens.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "</p>": 32006,
+  "<bbox>": 32002,
+  "<im_end>": 32001,
+  "<im_start>": 32000,
+  "<p>": 32005,
+  "<point>": 32003,
+  "[SEG]": 32004
+}

groundingLMM/GLaMM-FullScope/config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "_name_or_path": "MBZUAI/GLaMM-GranD-Pretrained",
+  "architectures": [
+    "GLaMMForCausalLM"
+  ],
+  "bbox_token_idx": 32002,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "freeze_mlp_adapter": true,
+  "freeze_mm_mlp_adapter": false,
+  "freeze_mm_vision_resampler": false,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "image_aspect": "square",
+  "image_aspect_ratio": "square",
+  "image_grid_pinpoints": null,
+  "image_grid_points": null,
+  "initializer_range": 0.02,
+  "intermediate_size": 11008,
+  "max_length": 4096,
+  "max_position_embeddings": 4096,
+  "mm_hidden_size": 1024,
+  "mm_projector_type": "mlp2x_gelu",
+  "mm_resampler_type": null,
+  "mm_use_im_patch_token": false,
+  "mm_use_im_start_end": true,
+  "mm_use_image_start_end": true,
+  "mm_vision_module": "openai/clip-vit-large-patch14-336",
+  "mm_vision_select_feature": "patch",
+  "mm_vision_select_layer": -2,
+  "mm_vision_tower": "openai/clip-vit-large-patch14-336",
+  "model_type": "llava",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 32,
+  "num_level_reg_features": 4,
+  "num_reg_features": 4,
+  "out_dim": 256,
+  "pad_token_id": 0,
+  "pretrain_mm_mlp_adapter": null,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": null,
+  "select_feature_type": "patch",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "train_mask_decoder": true,
+  "transformers_version": "4.28.0.dev0",
+  "tune_mlp_adapter": false,
+  "tune_mm_mlp_adapter": false,
+  "tune_mm_vision_resampler": false,
+  "unfreeze_mm_vision_tower": false,
+  "use_cache": false,
+  "use_image_patch_token": false,
+  "use_mm_proj": true,
+  "vision_module": "openai/clip-vit-large-patch14-336",
+  "vision_tower": "openai/clip-vit-large-patch14-336",
+  "vocab_size": 32007,
+  "with_region": true
+}

groundingLMM/GLaMM-FullScope/generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "max_length": 4096,
+  "pad_token_id": 0,
+  "transformers_version": "4.28.0.dev0",
+  "use_cache": false
+}

groundingLMM/GLaMM-FullScope/pytorch_model.bin.index.json ADDED Viewed

	@@ -0,0 +1,975 @@

+{
+  "metadata": {
+    "total_size": 16752883392
+  },
+  "weight_map": {
+    "lm_head.weight": "pytorch_model-00002-of-00002.bin",
+    "model.embed_tokens.weight": "pytorch_model-00001-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.0.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.1.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.10.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.11.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.12.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.13.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.14.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.15.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.16.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.17.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.18.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.19.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.2.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.20.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.21.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.22.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.23.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.24.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.25.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.26.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.27.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.28.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.29.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.3.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.30.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.31.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.4.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.5.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.6.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.7.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.8.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.attn.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.attn.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.attn.qkv.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.attn.qkv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.attn.rel_pos_h": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.attn.rel_pos_w": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.blocks.9.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.neck.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.neck.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.neck.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.neck.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.neck.3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.neck.3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.patch_embed.proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.patch_embed.proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.image_encoder.pos_embed": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.iou_prediction_head.layers.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.iou_token.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.mask_tokens.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.0.layers.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.1.layers.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.2.layers.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_hypernetworks_mlps.3.layers.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_upscaling.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_upscaling.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_upscaling.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_upscaling.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_upscaling.3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.output_upscaling.3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.k_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.out_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.out_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.q_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.v_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.final_attn_token_to_image.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.k_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.out_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.out_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.q_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.v_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_image_to_token.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.k_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.out_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.out_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.q_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.v_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.cross_attn_token_to_image.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm4.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.norm4.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.k_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.out_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.out_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.q_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.v_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.0.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.k_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.out_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.out_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.q_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.v_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_image_to_token.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.k_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.out_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.out_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.q_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.v_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.cross_attn_token_to_image.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.mlp.lin2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm4.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.norm4.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.k_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.out_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.out_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.q_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.v_proj.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.layers.1.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.norm_final_attn.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.mask_decoder.transformer.norm_final_attn.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.4.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.4.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.6.bias": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.mask_downscaling.6.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.no_mask_embed.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.not_a_point_embed.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.pe_layer.positional_encoding_gaussian_matrix": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.point_embeddings.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.point_embeddings.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.point_embeddings.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.grounding_encoder.prompt_encoder.point_embeddings.3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.15.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.16.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.17.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.18.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.19.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.20.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.21.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.22.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.23.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.24.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.25.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.26.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.27.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.28.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.29.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.30.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.30.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.mlp.gate_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.mlp.up_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.self_attn.k_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.self_attn.q_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.self_attn.rotary_emb.inv_freq": "pytorch_model-00002-of-00002.bin",
+    "model.layers.31.self_attn.v_proj.weight": "pytorch_model-00002-of-00002.bin",
+    "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.mlp.gate_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.mlp.up_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.self_attn.rotary_emb.inv_freq": "pytorch_model-00001-of-00002.bin",
+    "model.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
+    "model.mm_projector.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.mm_projector.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.mm_projector.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.mm_projector.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.norm.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.0.conv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.0.gn.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.0.gn.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.1.conv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.1.gn.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.1.gn.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.2.conv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.2.gn.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.2.gn.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.3.conv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.3.gn.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.3.gn.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.4.conv.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.4.gn.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.fuse_convs.4.gn.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.mlvl_fuse.input_conv.3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.flatten_linear.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.flatten_linear.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.1.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.1.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pconvs.3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.2.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.3.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.3.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.5.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.pos_embedd.5.weight": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.updims.bias": "pytorch_model-00002-of-00002.bin",
+    "model.region_encoder.roi_align.updims.weight": "pytorch_model-00002-of-00002.bin",
+    "model.text_hidden_fcs.0.0.bias": "pytorch_model-00002-of-00002.bin",
+    "model.text_hidden_fcs.0.0.weight": "pytorch_model-00002-of-00002.bin",
+    "model.text_hidden_fcs.0.2.bias": "pytorch_model-00002-of-00002.bin",
+    "model.text_hidden_fcs.0.2.weight": "pytorch_model-00002-of-00002.bin"
+  }
+}

groundingLMM/GLaMM-FullScope/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<unk>",
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

groundingLMM/GLaMM-FullScope/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "clean_up_tokenization_spaces": false,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "legacy": false,
+  "model_max_length": 1536,
+  "pad_token": null,
+  "padding_side": "right",
+  "special_tokens_map_file": "special_tokens_map.json",
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

groundingLMM/GranD/README.md ADDED Viewed

	@@ -0,0 +1,73 @@

+# GranD - Grounding Anything Dataset 🚀
+For details on downloading the dataset, preprocessing annotations for pre-training, and the automated annotation pipeline, please refer to [GranD.md](../docs/GranD.md) in the documentation.
+## Running the GranD Automated Annotation Pipeline
+The GranD automated annotation pipeline comprises four levels and a total of 23 steps. Each level utilizes multiple state-of-the-art (SoTA) vision-language models and pipeline scripts to construct image-scene graphs from raw predictions.
+For a step-by-step guide on running the pipeline, refer to [run_pipeline.sh](run_pipeline.sh). The environments required are listed under [environments](environments).
+### Create All Environments
+There are ten environment `.yml` files provided in the [environments](environments) directory. Create all ten environments using the following commands:
+```bash
+conda env create -f grand_env_1.yml
+conda env create -f grand_env_2.yml
+...
+...
+conda env create -f grand_env_9.yml
+conda env create -f grand_env_utils.yml
+```
+**NOTE:** While creating any of the above environments, if one or more `pip` dependencies fail to install, you may need to remove those dependencies from the environment file and rerun the command.
+### Download Model Checkpoints
+Download all required model checkpoints to your `CKPT_DIR` directory:
+```bash
+# For Landmark Detection
+git lfs install
+git clone https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3
+# For Depth Estimation
+wget https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt
+# For Image Tagging
+Download from [recognize-anything/tag2text_swin_14m.pth](https://huggingface.co/spaces/xinyu1205/recognize-anything/blob/main/tag2text_swin_14m.pth) & [recognize-anything/ram_swin_large_14m.pth](https://huggingface.co/spaces/xinyu1205/recognize-anything/blob/main/ram_swin_large_14m.pth)
+# For Co-DETR Detector
+Download using this [Google Drive link](https://drive.google.com/drive/folders/1asWoZ3SuM6APTL9D-QUF_YW9mjULNdh9?usp=sharing) to obtain the `co_deformable_detr_swin_large_900q_3x_coco.pth` checkpoints.
+# For EVA-02 Detector
+Download from [eva02_L_lvis_sys.pth](https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/det/eva02_L_lvis_sys.pth) & [eva02_L_lvis_sys_o365.pth](https://huggingface.co/Yuxin-CV/EVA-02/blob/main/eva02/det/eva02_L_lvis_sys_o365.pth)
+# For POMP
+Download from [Google Drive](https://drive.google.com/file/d/1C8oU6cWkJdU3Q3IHaqTcbIToRLo9bMnu/view?usp=sharing) & [Detic_LI_CLIP_R5021k_640b64_4x_ft4x_max-size_pomp.pth](https://drive.google.com/file/d/1TwrjcUYimkI_f9z9UZXCmLztdgv31Peu/view?usp=sharing)
+# For GRIT
+wget -c https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth
+# For OV-SAM
+Download from [HarborYuan/ovsam_models/blob/main/sam2clip_vith_rn50x16.pth](https://huggingface.co/HarborYuan/ovsam_models/blob/main/sam2clip_vith_rn50x16.pth)
+# For GPT4RoI
+Follow the instructions at [GPT4RoI/Weights](https://github.com/jshilong/GPT4RoI?tab=readme-ov-file#weights) to obtain the GPT4RoI weights.
+```
+### Automatically Annotate Images
+Refer to the [run_pipeline.sh](run_pipeline.sh) script for details. Below is a sample command to run the pipeline:
+```bash
+bash run_pipeline.sh $IMG_DIR $PRED_DIR $CKPT_DIR $SAM_ANNOTATIONS_DIR
+```
+Where:
+1. `IMG_DIR` is the path to the directory containing images you wish to annotate.
+2. `PRED_DIR` is the path to the directory where the predictions will be saved.
+3. `CKPT_DIR` is the path to the directory containing all the checkpoints. For downloading the checkpoints, consult the README of each respective model.
+4. `SAM_ANNOTATIONS_DIR` is the path to the directory containing SAM annotations (.json file).
+**Note:** If you are not annotating SAM images, remove `ov-sam` from the pipeline and adjust the `add_masks_to_annotations.py` script accordingly. In this case, `SAM_ANNOTATIONS_DIR` will not be required.
+### Disclaimer:
+We acknowledge that the pipeline is complex due to the involvement of many different models with various dependencies. Contributions that simplify or improve the pipeline are welcome.

groundingLMM/GranD/run_pipeline.sh ADDED Viewed

	@@ -0,0 +1,178 @@

+#!/bin/bash
+# Exit on error, uninitialized var, and ensure commands in pipes are all checked for success
+set -euo pipefail
+# Input arguments - Image directory path, output predictions directory path, checkpoints directory path containing all checkpoints and directory containing original SAM annotation files
+IMG_DIR=$1
+PRED_DIR=$2
+CKPT_DIR=$3
+SAM_ANNOTATIONS_DIR=$4
+# Adjust below configuration as per your setup
+NUM_GPUs=1
+GPU_IDs="0"
+MASTER_PORT=1342
+# NOTE: The pipeline contains multiple models from different open-source resources. The dependencies to run varies from one model to other. That's why, we had to create almost 10 different conda environments with different dependencies to run the complete pipeline. Please follow the instructions at the corresponding model directory to install the dependencies. We will welcome any pull request to make this process easy. Thank You.
+# We define some commands below to activate the correct conda environments
+run_in_env() {
+    local env="$1"
+    shift
+    source $(conda info --base)/etc/profile.d/conda.sh
+    conda activate "$env"
+    "$@"
+}
+run_in_env_targeted() {
+    local env="$1"
+    shift
+    export CUDA_VISIBLE_DEVICES=$GPU_IDsS
+    source $(conda info --base)/etc/profile.d/conda.sh
+    conda activate "$env"
+    "$@"
+}
+# NOTE: Here we assume to have ten conda environments created, namely grand_env_1, grand_env_2, ---, grand_env_9 and grand_env_utils. The requirements for these environments is available under environments directory.
+# 1. Landmark
+run_in_env grand_env_1 pushd level_1_inference/1_landmark_categorization
+    python infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDsS" --llava_model_path "$CKPT_DIR/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3"
+popd
+# 2. Depth Maps
+run_in_env_targeted grand_env_2 level_1_inference/pushd 2_depth_maps
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model_weights "$CKPT_DIR/dpt_beit_large_512.pt"
+popd
+# 3. Image Tagging
+run_in_env_targeted grand_env_3 pushd level_1_inference/3_image_tagging
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model-type tag2text --checkpoint "$CKPT_DIR/tag2text_swin_14m.pth"
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model-type ram --checkpoint "$CKPT_DIR/ram_swin_large_14m.pth"
+popd
+# 4. Object Detection using Co-DETR
+run_in_env grand_env_1 pushd level_1_inference/4_co_detr
+    python launch_codetr_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --ckpt_path "$CKPT_DIR/co_deformable_detr_swin_large_900q_3x_coco.pth" --gpu_ids "$GPU_IDs"
+popd
+# 5. Object Detection using EVA-02
+run_in_env_targeted grand_env_4 pushd level_1_inference/5_eva_02
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model_name 'eva-02-01'
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --model_name 'eva-02-02'
+popd
+# 6. Open Vocabulary Detection using OWL-ViT
+run_in_env grand_env_1 pushd level_1_inference/6_owl_vit
+    python launch_owl_vit_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --tags_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDs"
+popd
+# 7. Open Vocabulary Detection using POMP
+run_in_env grand_env_4 pushd level_1_inference/7_pomp
+    python launch_pomp_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --tags_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDs"
+popd
+# 8. Attribute Detection and Grounding using GRIT
+run_in_env_targeted grand_env_3 level_1_inference/pushd 8_grit \
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR"
+popd
+# 9. Open Vocabulary Classification using OV-SAM
+run_in_env grand_env_5 pushd level_1_inference/9_ov_sam
+    python launch_ov_sam_multi_gpu_inference.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --sam_annotations_dir "$SAM_ANNOTATIONS_DIR" --gpu_ids "$GPU_IDs"
+popd
+# 10. Generate Level-1 Scene Graph
+run_in_env grand_env_utils
+    python utils/merge_json_level_1_with_nms.py --image_dir_path "$IMG_DIR" --predictions_dir_path "$PRED_DIR" --output_dir_path "$PRED_DIR/level-1-raw"
+run_in_env grand_env_utils
+    python utils/prepare_level_1.py --image_dir_path "$IMG_DIR" --raw_dir_path "$PRED_DIR/level-1-raw" --output_dir_path "$PRED_DIR/level-1-processed"
+# -------------------------------------------------------------------------------------------------------------------- #
+# 11. Captioning using BLIP-2
+run_in_env_targeted grand_env_3 pushd level_2_inference/1_blip-2
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR"
+popd
+# 12. Captioning using LLaVA
+run_in_env grand_env_6 pushd level_2_inference/2_llava
+    python infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --gpu_ids "$GPU_IDs" --llava_model_path "$CKPT_DIR/llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3"
+popd
+# 13. Grounding using MDETR
+run_in_env_targeted grand_env_7 pushd level_2_inference/3_mdetr
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --output_dir_path "$PRED_DIR" --blip2_pred_path "$PRED_DIR/blip2"  --llava_pred_path "$PRED_DIR/llava"
+popd
+# 14. Generate Level-2 Scene Graph and Update Level-1
+run_in_env grand_env_utils
+    python utils/merge_json_level_2.py --predictions_dir_path "$PRED_DIR" --output_dir_path "$PRED_DIR/level-2-raw"
+run_in_env grand_env_utils
+    python utils/prepare_level_2.py --raw_dir_path "$PRED_DIR/level-2-raw" --level_2_output_dir_path "$PRED_DIR/level-2-processed" --level_1_dir_path "$PRED_DIR/level-1-processed"
+# -------------------------------------------------------------------------------------------------------------------- #
+# 15. Enrich Attributes using GPT4RoI
+run_in_env grand_env_8 pushd level_2_inference/4_gpt4roi/GPT4RoI
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env gpt4roi/infer.py --image_dir_path "$IMG_DIR" --level_2_pred_path "$PRED_DIR/level-2-processed" --output_dir_path "$PRED_DIR/level-2-processed_gpt4roi"
+popd
+# 16. Label Assignment using EVA-CLIP
+run_in_env_targeted grand_env_4 pushd level_2_inference/5_label_assignment
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env infer.py --image_dir_path "$IMG_DIR" --level_2_dir_path "$PRED_DIR/level-2-processed_gpt4roi" --output_dir_path "$PRED_DIR/level-2-processed_eva_clip"
+popd
+# 17. Merge EVA-CLIP Assigned Labels & Calculate and Store Depths for All Objects
+run_in_env_targeted grand_env_utils
+    python utils/merge_eva_labels.py --level_2_dir_path "$PRED_DIR/level-2-processed_gpt4roi"  --labels_path "$PRED_DIR/level-2-processed_eva_clip" --output_dir_path "$PRED_DIR/level-2-processed_labelled" --store_depth --depth_map_dir "$PRED_DIR/midas"
+# -------------------------------------------------------------------------------------------------------------------- #
+# 18. Generate Level-3 Dense Captions
+run_in_env grand_env_9 pushd level_3_dense_caption
+    python run.py --image_dir_path "$IMG_DIR" --level_2_dir_path "$PRED_DIR/level-2-processed_labelled" --output_dir_path "$PRED_DIR/level-3-vicuna-13B" --gpu_ids "$GPU_IDs" --job_id '111'
+popd
+# 19. Generate Level-4 Additional Context
+run_in_env grand_env_9 pushd level_4_extra_context
+    python run.py --image_dir_path "$IMG_DIR" --level_2_dir_path "$PRED_DIR/level-2-processed_labelled" --output_dir_path "$PRED_DIR/level-4-vicuna-13B" --gpu_ids "$GPU_IDs" --job_id '111'
+popd
+# -------------------------------------------------------------------------------------------------------------------- #
+# 20. Ground short & dense captions
+run_in_env_targeted grand_env_utils
+    python utils/ground_short_captions.py --data_dir_path "$PRED_DIR/level-2-processed_labelled" --output_dir_path "$PRED_DIR/short_captions_grounded"
+run_in_env_targeted grand_env_utils
+    python utils/ground_dense_caption.py --level_3_dense_caption_txt_dir_path "$PRED_DIR/level-3-vicuna_13B" --level_2_processed_json_path "$PRED_DIR/short_captions_grounded" --output_dir_path "$PRED_DIR/dense_captions_grounded"
+# 21. Add Masks to the Annotations (sources: SAM Annotations & EVA Detector)
+run_in_env_targeted grand_env_utils
+    python utils/add_masks_to_annotations.py --input_dir_path "$PRED_DIR/dense_captions_grounded" --sam_json_dir_path "$SAM_ANNOTATIONS_DIR" --eva_02_pred_dir_path "$PRED_DIR/eva-02-01" --output_dir_path "$PRED_DIR/level-3-processed"
+# 22. Use HQ-SAM for the Rest of the Masks not Found in SAM Annotations or EVA Detections
+run_in_env_targeted grand_env_1 pushd utils/hq_sam
+    python -m torch.distributed.launch --nproc_per_node="$NUM_GPUs" --master_port="$MASTER_PORT" --use_env run.py --image_dir_path "$IMG_DIR" --level_3_processed_path "$PRED_DIR/level-3-processed" --output_dir_path "$PRED_DIR/level-3-processed_with_masks" --checkpoints_path "$CKPT_DIR/sam_hq_vit_h.pth"
+popd
+# 23. Add Additional Context to the Annotations
+run_in_env_targeted grand_env_utils
+    python utils/add_addional_context.py --annotations_dir_path "$PRED_DIR/level-3-processed_with_masks" --level_4_additional_context_path "$PRED_DIR/level-4-vicuna_13B" --output_dir_path "$PRED_DIR/level-4-processed"
+# -------------------------------------------------------------------------------------------------------------------- #
+echo The pipeline inference completed and the predictions are saved in "$PRED_DIR/level-4-processed"

groundingLMM/LLaVA/.dockerignore ADDED Viewed

	@@ -0,0 +1,21 @@

+# The .dockerignore file excludes files from the container build process.
+#
+# https://docs.docker.com/engine/reference/builder/#dockerignore-file
+# Exclude Git files
+.git
+.github
+.gitignore
+# Exclude Python cache files
+__pycache__
+.mypy_cache
+.pytest_cache
+.ruff_cache
+# Exclude Python virtual environment
+/venv
+# Exclude some weights
+/openai
+/liuhaotian

groundingLMM/LLaVA/.editorconfig ADDED Viewed

	@@ -0,0 +1,18 @@

+root = true
+# Unix-style newlines with a newline ending every file
+[*]
+end_of_line = lf
+insert_final_newline = true
+trim_trailing_whitespace = true
+charset = utf-8
+# 4 space indentation
+[*.{py,json}]
+indent_style = space
+indent_size = 4
+# 2 space indentation
+[*.{md,sh,yaml,yml}]
+indent_style = space
+indent_size = 2

groundingLMM/LLaVA/.gitattributes ADDED Viewed

	@@ -0,0 +1,29 @@

+# https://git-scm.com/docs/gitattributes
+# Set the default behavior, in case people don't have core.autocrlf set.
+# https://git-scm.com/docs/gitattributes#_end_of_line_conversion
+* text=auto
+# common python attributes, taken from https://github.com/alexkaratarakis/gitattributes/blob/710900479a2bedeec7003d381719521ffbb18bf8/Python.gitattributes
+# Source files
+# ============
+*.pxd    text diff=python
+*.py     text diff=python
+*.py3    text diff=python
+*.pyw    text diff=python
+*.pyx    text diff=python
+*.pyz    text diff=python
+*.pyi    text diff=python
+# Binary files
+# ============
+*.db     binary
+*.p      binary
+*.pkl    binary
+*.pickle binary
+*.pyc    binary export-ignore
+*.pyo    binary export-ignore
+*.pyd    binary
+# Jupyter notebook
+*.ipynb  text eol=lf

groundingLMM/LLaVA/.gitignore ADDED Viewed

	@@ -0,0 +1,35 @@

+# Python
+__pycache__
+*.pyc
+*.egg-info
+dist
+# Log
+*.log
+*.log.*
+*.json
+*.jsonl
+# Data
+!**/alpaca-data-conversation.json
+# Editor
+.idea
+*.swp
+# Other
+.DS_Store
+wandb
+output
+checkpoints
+ckpts*
+.ipynb_checkpoints
+*.ipynb
+# DevContainer
+!.devcontainer/*
+# Demo
+serve_images/

groundingLMM/LLaVA/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

groundingLMM/LLaVA/README.md ADDED Viewed

	@@ -0,0 +1,463 @@

+# 🌋 LLaVA: Large Language and Vision Assistant
+*Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.*
+[📢 [LLaVA-NeXT Blog](https://llava-vl.github.io/blog/2024-01-30-llava-next/)] [[Project Page](https://llava-vl.github.io/)] [[Demo](https://llava.hliu.cc/)]  [[Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)] [[Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)]
+🤝Community Contributions: [[llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436)] [[Colab](https://github.com/camenduru/LLaVA-colab)] [[🤗Space](https://huggingface.co/spaces/badayvedat/LLaVA)] [[Replicate](https://replicate.com/yorickvp/llava-13b)] [[AutoGen](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_llava.ipynb)]  [[BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA)]
+**Improved Baselines with Visual Instruction Tuning** [[Paper](https://arxiv.org/abs/2310.03744)] [[HF](https://huggingface.co/papers/2310.03744)] <br>
+[Haotian Liu](https://hliu.cc), [Chunyuan Li](https://chunyuan.li/), [Yuheng Li](https://yuheng-li.github.io/), [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/)
+**Visual Instruction Tuning** (NeurIPS 2023, **Oral**) [[Paper](https://arxiv.org/abs/2304.08485)] [[HF](https://huggingface.co/papers/2304.08485)] <br>
+[Haotian Liu*](https://hliu.cc), [Chunyuan Li*](https://chunyuan.li/), [Qingyang Wu](https://scholar.google.ca/citations?user=HDiw-TsAAAAJ&hl=en/), [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/) (*Equal Contribution)
+<!--p align="center">
+    <a href="https://llava.hliu.cc/"><img src="images/llava_logo.png" width="50%"></a> <br>
+    Generated by <a href="https://gligen.github.io/">GLIGEN</a> via "a cute lava llama with glasses" and box prompt
+</p-->
+## Release
+- [2024/05/10] 🔥 **LLaVA-NeXT** (Stronger) models are released, stronger LMM with support of LLama-3 (8B) and Qwen-1.5 (72B/110B). [[Blog](https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/)] [[Checkpoints](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff)] [[Demo](https://llava-next.lmms-lab.com/)] [[Code](https://github.com/LLaVA-VL/LLaVA-NeXT/)]
+- [2024/05/10] 🔥 **LLaVA-NeXT** (Video) is released. The image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer. DPO training with AI feedback on videos can yield significant improvement. [[Blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)] [[Checkpoints](https://huggingface.co/collections/lmms-lab/llava-next-video-661e86f5e8dabc3ff793c944)] [[Code](https://github.com/LLaVA-VL/LLaVA-NeXT/)]
+- [03/10] Releasing **LMMs-Eval**, a highly efficient evaluation pipeline we used when developing LLaVA-NeXT. It supports the evaluation of LMMs on dozens of public datasets and allows new dataset onboarding, making the dev of new LMMs much faster. [[Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/)] [[Codebase](https://github.com/EvolvingLMMs-Lab/lmms-eval)]
+- [1/30] 🔥 **LLaVA-NeXT** (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the [blog post](https://llava-vl.github.io/blog/2024-01-30-llava-next/), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). Training/eval data and scripts coming soon.
+- [11/10] [LLaVA-Plus](https://llava-vl.github.io/llava-plus/) is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [[Project Page](https://llava-vl.github.io/llava-plus/)] [[Demo](https://llavaplus.ngrok.io/)] [[Code](https://github.com/LLaVA-VL/LLaVA-Plus-Codebase)] [[Paper](https://arxiv.org/abs/2311.05437)]
+- [11/2] [LLaVA-Interactive](https://llava-vl.github.io/llava-interactive/) is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [[Project Page](https://llava-vl.github.io/llava-interactive/)] [[Demo](https://llavainteractive.ngrok.io/)] [[Code](https://github.com/LLaVA-VL/LLaVA-Interactive-Demo)] [[Paper](https://arxiv.org/abs/2311.00571)]
+- [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement ([ckpts](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md#llava-v15), [script](https://github.com/haotian-liu/LLaVA#train)). We also provide a [doc](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md) on how to finetune LLaVA-1.5 on your own dataset with LoRA.
+- [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [[🤗 Demo](https://huggingface.co/spaces/etri-vilab/Ko-LLaVA)]
+- [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the [technical report](https://arxiv.org/abs/2310.03744), and explore the [demo](https://llava.hliu.cc/)! Models are available in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), and evaluation scripts are released [here](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md)!
+- [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [[LLavA-RLHF]](https://llava-rlhf.github.io/)
+- [9/22] [LLaVA](https://arxiv.org/abs/2304.08485) is accepted by NeurIPS 2023 as **oral presentation**, and [LLaVA-Med](https://arxiv.org/abs/2306.00890) is accepted by NeurIPS 2023 Datasets and Benchmarks Track as **spotlight presentation**.
+<details>
+<summary>More</summary>
+- [11/6] Support **Intel** dGPU and CPU platforms. [More details here.](https://github.com/haotian-liu/LLaVA/tree/intel/docs/intel)
+- [10/12] LLaVA is now supported in [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/3436) with 4-bit / 5-bit quantization support!
+- [10/11] The training data and scripts of LLaVA-1.5 are released [here](https://github.com/haotian-liu/LLaVA#train), and evaluation scripts are released [here](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md)!
+- [10/10] [Roboflow Deep Dive](https://blog.roboflow.com/first-impressions-with-llava-1-5/): First Impressions with LLaVA-1.5.
+- [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a [note](https://arxiv.org/abs/2309.09958). Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper [``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''.](https://arxiv.org/abs/2309.10020)
+<p align="center">
+  <img src="https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings/blob/main/images/mfm_evolution.jpeg?raw=true" width=50%/>
+</p>
+- [7/19] 🔥 We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release [LLaVA Bench](https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_Bench.md) for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out [LLaVA-from-LLaMA-2](https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_from_LLaMA2.md), and our [model zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)!
+- [6/26] [CVPR 2023 Tutorial](https://vlp-tutorial.github.io/) on **Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4**!  Please check out [[Slides](https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf)] [[Notes](https://arxiv.org/abs/2306.14895)] [[YouTube](https://youtu.be/mkI7EPD1vp8)] [[Bilibli](https://www.bilibili.com/video/BV1Ng4y1T7v3/)].
+- [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support!  Please see documentations [here](./docs/LoRA.md).
+- [6/1] We released **LLaVA-Med: Large Language and Vision Assistant for Biomedicine**, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities.  Checkout the [paper](https://arxiv.org/abs/2306.00890) and [page](https://github.com/microsoft/LLaVA-Med).
+- [5/6] We are releasing [LLaVA-Lighting-MPT-7B-preview](https://huggingface.co/liuhaotian/LLaVA-Lightning-MPT-7B-preview), based on MPT-7B-Chat!  See [here](#LLaVA-MPT-7b) for more details.
+- [5/2] 🔥 We are releasing LLaVA-Lighting!  Train a lite, multimodal GPT-4 with just $40 in 3 hours!  See [here](#train-llava-lightning) for more details.
+- [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM!  Try it out [here](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/llava).
+- [4/17] 🔥 We released **LLaVA: Large Language and Vision Assistant**. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities.  Checkout the [paper](https://arxiv.org/abs/2304.08485) and [demo](https://llava.hliu.cc/).
+</details>
+<!-- <a href="https://llava.hliu.cc/"><img src="assets/demo.gif" width="70%"></a> -->
+[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
+**Usage and License Notices**: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the [OpenAI Terms of Use](https://openai.com/policies/terms-of-use) for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. [Llama community license](https://ai.meta.com/llama/license/) for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
+## Contents
+- [Install](#install)
+- [LLaVA Weights](#llava-weights)
+- [Demo](#Demo)
+- [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)
+- [Dataset](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)
+- [Train](#train)
+- [Evaluation](#evaluation)
+## Install
+If you are not using Linux, do *NOT* proceed, see instructions for [macOS](https://github.com/haotian-liu/LLaVA/blob/main/docs/macOS.md) and [Windows](https://github.com/haotian-liu/LLaVA/blob/main/docs/Windows.md).
+1. Clone this repository and navigate to LLaVA folder
+```bash
+git clone https://github.com/haotian-liu/LLaVA.git
+cd LLaVA
+```
+2. Install Package
+```Shell
+conda create -n llava python=3.10 -y
+conda activate llava
+pip install --upgrade pip  # enable PEP 660 support
+pip install -e .
+```
+3. Install additional packages for training cases
+```
+pip install -e ".[train]"
+pip install flash-attn --no-build-isolation
+```
+### Upgrade to latest code base
+```Shell
+git pull
+pip install -e .
+# if you see some import errors when you upgrade,
+# please try running the command below (without #)
+# pip install flash-attn --no-build-isolation --no-cache-dir
+```
+### Quick Start With HuggingFace
+<details>
+<summary>Example Code</summary>
+```Python
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import get_model_name_from_path
+from llava.eval.run_llava import eval_model
+model_path = "liuhaotian/llava-v1.5-7b"
+tokenizer, model, image_processor, context_len = load_pretrained_model(
+    model_path=model_path,
+    model_base=None,
+    model_name=get_model_name_from_path(model_path)
+)
+```
+Check out the details wth the `load_pretrained_model` function in `llava/model/builder.py`.
+You can also use the `eval_model` function in `llava/eval/run_llava.py` to get the output easily. By doing so, you can use this code on Colab directly after downloading this repository.
+``` python
+model_path = "liuhaotian/llava-v1.5-7b"
+prompt = "What are the things I should be cautious about when I visit here?"
+image_file = "https://llava-vl.github.io/static/images/view.jpg"
+args = type('Args', (), {
+    "model_path": model_path,
+    "model_base": None,
+    "model_name": get_model_name_from_path(model_path),
+    "query": prompt,
+    "conv_mode": None,
+    "image_file": image_file,
+    "sep": ",",
+    "temperature": 0,
+    "top_p": None,
+    "num_beams": 1,
+    "max_new_tokens": 512
+})()
+eval_model(args)
+```
+</details>
+## LLaVA Weights
+Please check out our [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md) for all public LLaVA checkpoints, and the instructions of how to use the weights.
+## Demo
+### Gradio Web UI
+To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server *ONCE*.
+```mermaid
+flowchart BT
+    %% Declare Nodes
+    gws("Gradio (UI Server)")
+    c("Controller (API Server):<br/>PORT: 10000")
+    mw7b("Model Worker:<br/>llava-v1.5-7b<br/>PORT: 40000")
+    mw13b("Model Worker:<br/>llava-v1.5-13b<br/>PORT: 40001")
+    sglw13b("SGLang Backend:<br/>llava-v1.6-34b<br/>http://localhost:30000")
+    lsglw13b("SGLang Worker:<br/>llava-v1.6-34b<br/>PORT: 40002")
+    %% Declare Styles
+    classDef data fill:#3af,stroke:#48a,stroke-width:2px,color:#444
+    classDef success fill:#8f8,stroke:#0a0,stroke-width:2px,color:#444
+    classDef failure fill:#f88,stroke:#f00,stroke-width:2px,color:#444
+    %% Assign Styles
+    class id,od data;
+    class cimg,cs_s,scsim_s success;
+    class ncimg,cs_f,scsim_f failure;
+    subgraph Demo Connections
+        direction BT
+        c<-->gws
+        mw7b<-->c
+        mw13b<-->c
+        lsglw13b<-->c
+        sglw13b<-->lsglw13b
+    end
+```
+#### Launch a controller
+```Shell
+python -m llava.serve.controller --host 0.0.0.0 --port 10000
+```
+#### Launch a gradio web server.
+```Shell
+python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
+```
+You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.
+#### Launch a SGLang worker
+This is the recommended way to serve LLaVA model with high throughput, and you need to install SGLang first. Note that currently `4-bit` quantization is not supported yet on SGLang-LLaVA, and if you have limited GPU VRAM, please check out model worker with [quantization](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#launch-a-model-worker-4-bit-8-bit-inference-quantized).
+```Shell
+pip install "sglang[all]"
+```
+You'll first launch a SGLang backend worker which will execute the models on GPUs. Remember the `--port` you've set and you'll use that later.
+```Shell
+# Single GPU
+CUDA_VISIBLE_DEVICES=0 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000
+# Multiple GPUs with tensor parallel
+CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-13b --tokenizer-path llava-hf/llava-1.5-13b-hf --port 30000 --tp 2
+```
+Tokenizers (temporary): `llava-hf/llava-1.5-7b-hf`, `llava-hf/llava-1.5-13b-hf`, `liuhaotian/llava-v1.6-34b-tokenizer`.
+You'll then launch a LLaVA-SGLang worker that will communicate between LLaVA controller and SGLang backend to route the requests. Set `--sgl-endpoint` to `http://127.0.0.1:port` where `port` is the one you just set (default: 30000).
+```Shell
+python -m llava.serve.sglang_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --sgl-endpoint http://127.0.0.1:30000
+```
+#### Launch a model worker
+This is the actual *worker* that performs the inference on the GPU.  Each worker is responsible for a single model specified in `--model-path`.
+```Shell
+python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
+```
+Wait until the process finishes loading the model and you see "Uvicorn running on ...".  Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
+You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the `--controller` the same, and modify the `--port` and `--worker` to a different port number for each worker.
+```Shell
+python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port <different from 40000, say 40001> --worker http://localhost:<change accordingly, i.e. 40001> --model-path <ckpt2>
+```
+If you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the `--device` flag: `--device mps`.
+#### Launch a model worker (Multiple GPUs, when GPU VRAM <= 24GB)
+If the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with `CUDA_VISIBLE_DEVICES`. Below is an example of running with the first two GPUs.
+```Shell
+CUDA_VISIBLE_DEVICES=0,1 python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
+```
+#### Launch a model worker (4-bit, 8-bit inference, quantized)
+You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint, potentially allowing you to run on a GPU with as few as 12GB VRAM. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append `--load-4bit` or `--load-8bit` to the **model worker** command that you are executing. Below is an example of running with 4-bit quantization.
+```Shell
+python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b --load-4bit
+```
+#### Launch a model worker (LoRA weights, unmerged)
+You can launch the model worker with LoRA weights, without merging them with the base checkpoint, to save disk space. There will be additional loading time, while the inference speed is the same as the merged checkpoints. Unmerged LoRA checkpoints do not have `lora-merge` in the model name, and are usually much smaller (less than 1GB) than the merged checkpoints (13G for 7B, and 25G for 13B).
+To load unmerged LoRA weights, you simply need to pass an additional argument `--model-base`, which is the base LLM that is used to train the LoRA weights. You can check the base LLM of each LoRA weights in the [model zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md).
+```Shell
+python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3 --model-base lmsys/vicuna-13b-v1.3
+```
+### CLI Inference
+Chat about images using LLaVA without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization, for our LLaVA-1.5-7B, it uses less than 8GB VRAM on a single GPU.
+```Shell
+python -m llava.serve.cli \
+    --model-path liuhaotian/llava-v1.5-7b \
+    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
+    --load-4bit
+```
+<img src="images/demo_cli.gif" width="70%">
+## Train
+*Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of [this](https://github.com/haotian-liu/LLaVA/tree/v1.0.1) version for now. We'll add them in a separate doc later.*
+LLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU dataset to connect a *frozen pretrained* vision encoder to a *frozen LLM*; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.
+LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
+### Hyperparameters
+We use a similar set of hyperparameters as Vicuna in finetuning.  Both hyperparameters used in pretraining and finetuning are provided below.
+1. Pretraining
+| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |
+2. Finetuning
+| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |
+### Download Vicuna checkpoints (automatically)
+Our base model Vicuna v1.5, which is an instruction-tuned chatbot, will be downloaded automatically when you run our provided training scripts. No action is needed.
+### Pretrain (feature alignment)
+Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper [here](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
+Pretrain takes around 5.5 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 3.5 hours for LLaVA-v1.5-7B.
+Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/pretrain.sh).
+- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.
+- `--vision_tower openai/clip-vit-large-patch14-336`: CLIP ViT-L/14 336px.
+<details>
+<summary>Pretrain takes around 20 hours for LLaVA-7B on 8x V100 (32G)</summary>
+ We provide training script with DeepSpeed [here](https://github.com/haotian-liu/LLaVA/blob/main/scripts/pretrain_xformers.sh).
+Tips:
+- If you are using V100 which is not supported by FlashAttention, you can use the [memory-efficient attention](https://arxiv.org/abs/2112.05682) implemented in [xFormers](https://github.com/facebookresearch/xformers). Install xformers and replace `llava/train/train_mem.py` above with [llava/train/train_xformers.py](llava/train/train_xformers.py).
+</details>
+### Visual Instruction Tuning
+1. Prepare data
+Please download the annotation of the final mixture our instruction tuning data [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json), and download the images from constituting datasets:
+- COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
+- GQA: [images](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
+- OCR-VQA: [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing), **we save all files as `.jpg`**
+- TextVQA: [train_val_images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
+- VisualGenome: [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
+After downloading all of them, organize the data as follows in `./playground/data`,
+```
+├── coco
+│   └── train2017
+├── gqa
+│   └── images
+├── ocr_vqa
+│   └── images
+├── textvqa
+│   └── train_images
+└── vg
+    ├── VG_100K
+    └── VG_100K_2
+```
+2. Start training!
+You may download our pretrained projectors in [Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md). It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
+Visual instruction tuning takes around 20 hours for LLaVA-v1.5-13B on 8x A100 (80G), due to the increased resolution to 336px. It takes around 10 hours for LLaVA-v1.5-7B on 8x A100 (40G).
+Training script with DeepSpeed ZeRO-3: [`finetune.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune.sh).
+If you are do not have enough GPU memory:
+- Use LoRA: [`finetune_lora.sh`](https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_lora.sh). We are able to fit 13B training in 8-A100-40G/8-A6000, and 7B training in 8-RTX3090. Make sure `per_device_train_batch_size*gradient_accumulation_steps` is the same as the provided script for best reproducibility.
+- Replace `zero3.json` with `zero3_offload.json` which offloads some parameters to CPU RAM. This slows down the training speed.
+If you are interested in finetuning LLaVA model to your own task/data, please check out [`Finetune_Custom_Data.md`](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md)。
+New options to note:
+- `--mm_projector_type mlp2x_gelu`: the two-layer MLP vision-language connector.
+- `--vision_tower openai/clip-vit-large-patch14-336`: CLIP ViT-L/14 336px.
+- `--image_aspect_ratio pad`: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.
+- `--group_by_modality_length True`: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.
+## Evaluation
+In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
+See [Evaluation.md](https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md).
+### GPT-assisted Evaluation
+Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models.  Please see our paper for more details.
+1. Generate LLaVA responses
+```Shell
+python model_vqa.py \
+    --model-path ./checkpoints/LLaVA-13B-v0 \
+    --question-file \
+    playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
+    --image-folder \
+    /path/to/coco2014_val \
+    --answers-file \
+    /path/to/answer-file-our.jsonl
+```
+2. Evaluate the generated responses.  In our case, [`answer-file-ref.jsonl`](./playground/data/coco2014_val_qa_eval/qa90_gpt4_answer.jsonl) is the response generated by text-only GPT-4 (0314), with the context captions/boxes provided.
+```Shell
+OPENAI_API_KEY="sk-***********************************" python llava/eval/eval_gpt_review_visual.py \
+    --question playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
+    --context llava/eval/table/caps_boxes_coco2014_val_80.jsonl \
+    --answer-list \
+    /path/to/answer-file-ref.jsonl \
+    /path/to/answer-file-our.jsonl \
+    --rule llava/eval/table/rule.json \
+    --output /path/to/review.json
+```
+3. Summarize the evaluation results
+```Shell
+python summarize_gpt_review.py
+```
+## Citation
+If you find LLaVA useful for your research and applications, please cite using this BibTeX:
+```bibtex
+@misc{liu2024llavanext,
+    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
+    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
+    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
+    month={January},
+    year={2024}
+}
+@misc{liu2023improvedllava,
+      title={Improved Baselines with Visual Instruction Tuning},
+      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
+      publisher={arXiv:2310.03744},
+      year={2023},
+}
+@misc{liu2023llava,
+      title={Visual Instruction Tuning},
+      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
+      publisher={NeurIPS},
+      year={2023},
+}
+```
+## Acknowledgement
+- [Vicuna](https://github.com/lm-sys/FastChat): the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!
+## Related Projects
+- [Instruction Tuning with GPT-4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
+- [LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day](https://github.com/microsoft/LLaVA-Med)
+- [Otter: In-Context Multi-Modal Instruction Tuning](https://github.com/Luodian/Otter)
+For future project ideas, please check out:
+- [SEEM: Segment Everything Everywhere All at Once](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)
+- [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything) to detect, segment, and generate anything by marrying [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) and [Segment-Anything](https://github.com/facebookresearch/segment-anything).

groundingLMM/LLaVA/cog.yaml ADDED Viewed

	@@ -0,0 +1,37 @@

+# Configuration for Cog ⚙️
+# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
+build:
+  gpu: true
+  python_version: "3.11"
+  python_packages:
+    - "torch==2.0.1"
+    - "accelerate==0.21.0"
+    - "bitsandbytes==0.41.0"
+    - "deepspeed==0.9.5"
+    - "einops-exts==0.0.4"
+    - "einops==0.6.1"
+    - "gradio==3.35.2"
+    - "gradio_client==0.2.9"
+    - "httpx==0.24.0"
+    - "markdown2==2.4.10"
+    - "numpy==1.26.0"
+    - "peft==0.4.0"
+    - "scikit-learn==1.2.2"
+    - "sentencepiece==0.1.99"
+    - "shortuuid==1.0.11"
+    - "timm==0.6.13"
+    - "tokenizers==0.13.3"
+    - "torch==2.0.1"
+    - "torchvision==0.15.2"
+    - "transformers==4.31.0"
+    - "wandb==0.15.12"
+    - "wavedrom==2.0.3.post3"
+    - "Pygments==2.16.1"
+  run:
+    - curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.0.3/pget" && chmod +x /usr/local/bin/pget
+# predict.py defines how predictions are run on your model
+predict: "predict.py:Predictor"

groundingLMM/LLaVA/predict.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import torch
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
+from llava.conversation import conv_templates, SeparatorStyle
+from llava.model.builder import load_pretrained_model
+from llava.utils import disable_torch_init
+from llava.mm_utils import tokenizer_image_token
+from transformers.generation.streamers import TextIteratorStreamer
+from PIL import Image
+import requests
+from io import BytesIO
+from cog import BasePredictor, Input, Path, ConcatenateIterator
+import time
+import subprocess
+from threading import Thread
+import os
+os.environ["HUGGINGFACE_HUB_CACHE"] = os.getcwd() + "/weights"
+# url for the weights mirror
+REPLICATE_WEIGHTS_URL = "https://weights.replicate.delivery/default"
+# files to download from the weights mirrors
+weights = [
+    {
+        "dest": "liuhaotian/llava-v1.5-13b",
+        # git commit hash from huggingface
+        "src": "llava-v1.5-13b/006818fc465ebda4c003c0998674d9141d8d95f8",
+        "files": [
+            "config.json",
+            "generation_config.json",
+            "pytorch_model-00001-of-00003.bin",
+            "pytorch_model-00002-of-00003.bin",
+            "pytorch_model-00003-of-00003.bin",
+            "pytorch_model.bin.index.json",
+            "special_tokens_map.json",
+            "tokenizer.model",
+            "tokenizer_config.json",
+        ]
+    },
+    {
+        "dest": "openai/clip-vit-large-patch14-336",
+        "src": "clip-vit-large-patch14-336/ce19dc912ca5cd21c8a653c79e251e808ccabcd1",
+        "files": [
+            "config.json",
+            "preprocessor_config.json",
+            "pytorch_model.bin"
+        ],
+    }
+]
+def download_json(url: str, dest: Path):
+    res = requests.get(url, allow_redirects=True)
+    if res.status_code == 200 and res.content:
+        with dest.open("wb") as f:
+            f.write(res.content)
+    else:
+        print(f"Failed to download {url}. Status code: {res.status_code}")
+def download_weights(baseurl: str, basedest: str, files: list[str]):
+    basedest = Path(basedest)
+    start = time.time()
+    print("downloading to: ", basedest)
+    basedest.mkdir(parents=True, exist_ok=True)
+    for f in files:
+        dest = basedest / f
+        url = os.path.join(REPLICATE_WEIGHTS_URL, baseurl, f)
+        if not dest.exists():
+            print("downloading url: ", url)
+            if dest.suffix == ".json":
+                download_json(url, dest)
+            else:
+                subprocess.check_call(["pget", url, str(dest)], close_fds=False)
+    print("downloading took: ", time.time() - start)
+class Predictor(BasePredictor):
+    def setup(self) -> None:
+        """Load the model into memory to make running multiple predictions efficient"""
+        for weight in weights:
+            download_weights(weight["src"], weight["dest"], weight["files"])
+        disable_torch_init()
+        self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model("liuhaotian/llava-v1.5-13b", model_name="llava-v1.5-13b", model_base=None, load_8bit=False, load_4bit=False)
+    def predict(
+        self,
+        image: Path = Input(description="Input image"),
+        prompt: str = Input(description="Prompt to use for text generation"),
+        top_p: float = Input(description="When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens", ge=0.0, le=1.0, default=1.0),
+        temperature: float = Input(description="Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic", default=0.2, ge=0.0),
+        max_tokens: int = Input(description="Maximum number of tokens to generate. A word is generally 2-3 tokens", default=1024, ge=0),
+    ) -> ConcatenateIterator[str]:
+        """Run a single prediction on the model"""
+        conv_mode = "llava_v1"
+        conv = conv_templates[conv_mode].copy()
+        image_data = load_image(str(image))
+        image_tensor = self.image_processor.preprocess(image_data, return_tensors='pt')['pixel_values'].half().cuda()
+        # loop start
+        # just one turn, always prepend image token
+        inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
+        conv.append_message(conv.roles[0], inp)
+        conv.append_message(conv.roles[1], None)
+        prompt = conv.get_prompt()
+        input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
+        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True, timeout=20.0)
+        with torch.inference_mode():
+            thread = Thread(target=self.model.generate, kwargs=dict(
+                inputs=input_ids,
+                images=image_tensor,
+                do_sample=True,
+                temperature=temperature,
+                top_p=top_p,
+                max_new_tokens=max_tokens,
+                streamer=streamer,
+                use_cache=True))
+            thread.start()
+            # workaround: second-to-last token is always " "
+            # but we want to keep it if it's not the second-to-last token
+            prepend_space = False
+            for new_text in streamer:
+                if new_text == " ":
+                    prepend_space = True
+                    continue
+                if new_text.endswith(stop_str):
+                    new_text = new_text[:-len(stop_str)].strip()
+                    prepend_space = False
+                elif prepend_space:
+                    new_text = " " + new_text
+                    prepend_space = False
+                if len(new_text):
+                    yield new_text
+            if prepend_space:
+                yield " "
+            thread.join()
+def load_image(image_file):
+    if image_file.startswith('http') or image_file.startswith('https'):
+        response = requests.get(image_file)
+        image = Image.open(BytesIO(response.content)).convert('RGB')
+    else:
+        image = Image.open(image_file).convert('RGB')
+    return image

groundingLMM/LLaVA/pyproject.toml ADDED Viewed

	@@ -0,0 +1,37 @@

+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "llava"
+version = "1.2.2.post1"
+description = "Towards GPT-4 like large language and visual assistant."
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: Apache Software License",
+]
+dependencies = [
+    "torch==2.1.2", "torchvision==0.16.2",
+    "transformers==4.37.2", "tokenizers==0.15.1", "sentencepiece==0.1.99", "shortuuid",
+    "accelerate==0.21.0", "peft", "bitsandbytes",
+    "pydantic", "markdown2[all]", "numpy", "scikit-learn==1.2.2",
+    "gradio==4.16.0", "gradio_client==0.8.1",
+    "requests", "httpx==0.24.0", "uvicorn", "fastapi",
+    "einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13",
+]
+[project.optional-dependencies]
+train = ["deepspeed==0.12.6", "ninja", "wandb"]
+build = ["build", "twine"]
+[project.urls]
+"Homepage" = "https://llava-vl.github.io"
+"Bug Tracker" = "https://github.com/haotian-liu/LLaVA/issues"
+[tool.setuptools.packages.find]
+exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
+[tool.wheel]
+exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]

groundingLMM/dataset/dataset.py ADDED Viewed

	@@ -0,0 +1,236 @@

+import numpy as np
+import torch
+from model.llava import conversation as conversation_lib
+from model.llava.mm_utils import tokenizer_image_token
+from dataset.caption_datasets.COCO_Caption_ds import CocoCapDataset
+from dataset.caption_datasets.LLavaInstruct_vqa_ds import LLaVAInstructDataset
+from dataset.region_datasets.Flickr_Region_ds import Flickr30kRegDataset
+from dataset.segm_datasets.Semantic_Segm_ds import SemanticSegmDataset
+from dataset.segm_datasets.RefCOCO_Segm_ds import ReferSegmDataset
+from dataset.gcg_datasets.GranDf_gcg_ds import GranDfDataset, OpenPsgGCGDataset, Flickr30kGCGDataset, RefCOCOgGCGDataset
+from dataset.region_datasets.RefCOCO_VG_Region_ds import (RefCocoRegDataset, RefCocoGRegDataset, RefCocoPRegDataset,
+                                                          VisualGenomeRegDataset)
+from dataset.caption_datasets.GranD_ShortCaption_ds import GrandShortCaptionDataset
+from dataset.region_datasets.GranD_ReferringRegion_ds import GrandReferRegDataset
+from dataset.segm_datasets.GranD_ReferringSegm_ds import GrandReferSegmDataset
+from tools.utils import DEFAULT_IMAGE_TOKEN, IGNORE_INDEX, DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN
+class HybridDatasetBase(torch.utils.data.Dataset):
+    PIXEL_MEAN = torch.tensor([123.675, 116.28, 103.53]).view(-1, 1, 1)
+    PIXEL_STD = torch.tensor([58.395, 57.12, 57.375]).view(-1, 1, 1)
+    IMG_SIZE = 1024
+    IGNORE_LABEL = 255
+    def __init__(self, dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config,
+                 epoch_samples=500 * 8 * 2 * 10, batch_size=2, precision="fp32", image_size=224,
+                 num_classes_per_sample=3, sample_rate=None):
+        self.dataset_dir = dataset_dir
+        self.tokenizer = tokenizer
+        self.global_image_encoder = global_image_encoder
+        self.dataset = dataset
+        self.datasets_config = datasets_config
+        self.epoch_samples = epoch_samples
+        self.batch_size = batch_size
+        self.precision = precision
+        self.image_size = image_size
+        self.num_classes_per_sample = num_classes_per_sample
+        self.dataset_list = dataset.split("||")
+        self.sample_rate = np.array(sample_rate or [1] * len(self.dataset_list))
+        self.sample_rate /= self.sample_rate.sum()
+        self.all_datasets = self.create_datasets()
+    def create_datasets(self):
+        datasets = []
+        for ds in self.dataset_list:
+            dataset_cls = self.datasets_config.get(ds)
+            if dataset_cls:
+                if ds == 'Semantic_Segm':
+                    datasets.append(
+                        dataset_cls(
+                            self.dataset_dir, self.tokenizer, self.global_image_encoder, self.epoch_samples,
+                            self.precision, self.image_size, self.num_classes_per_sample, self.semantic_segm_data, )
+                        )
+                elif ds == 'Refer_Segm':
+                    datasets.append(
+                        dataset_cls(
+                            self.dataset_dir, self.tokenizer, self.global_image_encoder, self.epoch_samples,
+                            self.precision, self.image_size, self.num_classes_per_sample, self.refer_segm_data, )
+                        )
+                else:
+                    datasets.append(
+                        dataset_cls(
+                            self.dataset_dir, self.tokenizer, self.global_image_encoder, self.epoch_samples,
+                            self.precision, self.image_size, self.num_classes_per_sample, )
+                        )
+        return datasets
+    def __len__(self):
+        return self.epoch_samples
+    def __getitem__(self, idx):
+        dataset_idx = np.random.choice(len(self.dataset_list), p=self.sample_rate)
+        selected_dataset = self.all_datasets[dataset_idx]
+        data = selected_dataset[0]
+        return (*data,)
+class HybridCapDataset(HybridDatasetBase):
+    def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=500 * 8 * 2 * 10, batch_size=2,
+                 precision="fp32", image_size=224, num_classes_per_sample=3,
+                 dataset="CocoCap||LLaVaInstruct", sample_rate=[1, 1]):
+        datasets_config = {"CocoCap": CocoCapDataset,
+                           "LLaVaInstruct": LLaVAInstructDataset,
+                           "GrandCaptionDataset": GrandShortCaptionDataset,
+                           # Add other dataset mappings here
+                           }
+        super().__init__(
+            dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config, epoch_samples, batch_size,
+            precision, image_size, num_classes_per_sample, sample_rate
+        )
+class HybridRegDataset(HybridDatasetBase):
+    def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=500 * 8 * 2 * 10, batch_size=2,
+                 precision="fp32", image_size=224, num_classes_per_sample=3,
+                 dataset="RefCoco_Reg||RefCocoG_Reg||RefCocoP_Reg||VisGen_Reg||Flickr_Reg", sample_rate=[1, 1, 1, 1, 1]):
+        datasets_config = {"RefCoco_Reg": RefCocoRegDataset,
+                           "RefCocoG_Reg": RefCocoGRegDataset,
+                           "RefCocoP_Reg": RefCocoPRegDataset,
+                           "VisGen_Reg": VisualGenomeRegDataset,
+                           "Flickr_Reg": Flickr30kRegDataset,
+                           "GrandRefer_Reg": GrandReferRegDataset,
+                           # Add other dataset mappings here
+                           }
+        super().__init__(
+            dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config, epoch_samples, batch_size,
+            precision, image_size, num_classes_per_sample, sample_rate
+        )
+class HybridSegDataset(HybridDatasetBase):
+    def __init__(self, dataset_dir, tokenizer, global_image_encoder, epoch_samples=500 * 8 * 2 * 10, batch_size=2,
+                 precision="fp32", image_size=224, num_classes_per_sample=3,
+                 dataset="Semantic_Segm||Refer_Segm||PSG_GCG||RefCoco_GCG||GranDf_GCG||Flickr_GCG",
+                 sample_rate=[5,4,1,1,1,1],
+                 semantic_segm_data="ade20k||cocostuff||pascal_part||paco_lvis||mapillary",
+                 refer_segm_data="refcoco||refcocog||refcoco+||refclef"):
+        self.semantic_segm_data = semantic_segm_data
+        self.refer_segm_data = refer_segm_data
+        datasets_config = {"Semantic_Segm": SemanticSegmDataset,
+                           "Refer_Segm": ReferSegmDataset,
+                           "PSG_GCG": OpenPsgGCGDataset,
+                           "RefCoco_GCG": RefCOCOgGCGDataset,
+                           "GranDf_GCG": GranDfDataset,
+                           "Flickr_GCG": Flickr30kGCGDataset,
+                           "GrandRefer_Segm": GrandReferSegmDataset,
+                           # Add other dataset mappings here
+                           }
+        super().__init__(
+            dataset_dir, tokenizer, global_image_encoder, dataset, datasets_config, epoch_samples, batch_size,
+            precision, image_size, num_classes_per_sample, sample_rate
+        )
+def custom_collate_fn(batch, tokenizer=None, use_mm_start_end=True, inference=False, local_rank=-1):
+    # Initializing lists and counters
+    image_path_list, global_enc_image_list, grounding_enc_image_list = [], [], []
+    bboxes_list, conversation_list, masks_list = [], [], []
+    label_list, resize_list, questions_list = [], [], []
+    selected_labels_list, offset_list, inferences = [], [0], []
+    cnt = 0
+    # Iterating through the batch
+    for (image_path, global_enc_image, grounding_enc_image, bboxes, conversations, masks, label, resize, questions,
+         sampled_classes) in batch:
+        image_path_list.append(image_path)
+        global_enc_image_list.append(global_enc_image)
+        grounding_enc_image_list.append(grounding_enc_image)
+        bboxes_list.append(bboxes)
+        conversation_list.extend(conversations)
+        masks_list.append([] if masks is None else masks.float())
+        label_list.append(label)
+        resize_list.append(resize)
+        questions_list.append(questions)
+        selected_labels_list.append(sampled_classes)
+        offset_list.append(cnt := cnt + len(conversations))
+        inferences.append(inference)
+    # Handling the conversation list
+    if use_mm_start_end:
+        replace_token = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
+        conversation_list = [conv.replace(DEFAULT_IMAGE_TOKEN, replace_token) for conv in conversation_list]
+    # Tokenizing and padding input ids
+    input_ids = torch.nn.utils.rnn.pad_sequence(
+        [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversation_list],
+        batch_first=True, padding_value=tokenizer.pad_token_id
+    )
+    attention_masks = input_ids.ne(tokenizer.pad_token_id)
+    # Preparing targets and handling conversation types
+    conv = conversation_lib.default_conversation.copy()
+    targets = input_ids.clone()
+    # conv_type == "llava_v1"
+    sep = conv.sep + conv.roles[1] + ": "
+    sep2 = conv.sep2
+    for conversation, target in zip(conversation_list, targets):
+        _process_conversation(conversation, target, tokenizer, sep, sep2)
+    # Adjusting for inferences
+    if not inferences[0]:
+        truncate_len = tokenizer.model_max_length - 575
+        if input_ids.shape[1] > truncate_len:
+            input_ids, targets, attention_masks = map(
+                lambda x: x[:, :truncate_len], [input_ids, targets, attention_masks]
+                )
+    return {
+        "image_paths": image_path_list,
+        "global_enc_images": torch.stack(global_enc_image_list, dim=0),
+        "grounding_enc_images": None if grounding_enc_image_list[0] is None else torch.stack(grounding_enc_image_list, dim=0),
+        "bboxes": None if bboxes_list[0] is None else bboxes_list,
+        "input_ids": input_ids,
+        "labels": targets,
+        "attention_masks": attention_masks,
+        "masks_list": None if masks_list[0] is None else masks_list,
+        "label_list": None if label_list[0] is None else label_list,
+        "resize_list": None if resize_list[0] is None else resize_list,
+        "offset": torch.LongTensor(offset_list),
+        "questions_list": questions_list,
+        "sampled_classes_list": selected_labels_list,
+        "inference": inferences[0],
+        "conversation_list": conversation_list,
+    }
+def _process_conversation(conversation, target, tokenizer, sep, sep2):
+    total_len = target.ne(tokenizer.pad_token_id).sum().item()
+    rounds = conversation.split(sep2)
+    cur_len = 1
+    target[:cur_len] = IGNORE_INDEX
+    for rou in rounds:
+        if not rou:
+            break
+        parts = rou.split(sep)
+        assert len(parts) == 2, (len(parts), rou)
+        parts[0] += sep
+        if DEFAULT_IMAGE_TOKEN in conversation:
+            round_len = len(tokenizer_image_token(rou, tokenizer))
+            instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
+        else:
+            round_len = len(tokenizer(rou).input_ids)
+            instruction_len = len(tokenizer(parts[0]).input_ids) - 2
+        target[cur_len: cur_len + instruction_len] = IGNORE_INDEX
+        cur_len += round_len
+    target[cur_len:] = IGNORE_INDEX
+    if cur_len < tokenizer.model_max_length:
+        assert cur_len == total_len

groundingLMM/docs/GranD.md ADDED Viewed

	@@ -0,0 +1,53 @@

+# GranD - Grounding Anything Dataset 🚀
+The [Grounding-anything](https://grounding-anything.com/) Dataset (GranD) dataset offers densely annotated data, acquired through an automated annotation pipeline that leverages state-of-the-art (SOTA) vision and V-L models. This documentation covers how to download the GranD dataset and a guide to the automated annotation pipeline used to create GranD.
+## Download GranD 📂
+- Annotations: [MBZUAI/GranD](https://huggingface.co/datasets/MBZUAI/GranD)
+- Images: [Download](https://ai.meta.com/datasets/segment-anything-downloads/)
+GranD utilizes images from the SAM dataset.
+Note: Please note that annotations are being uploaded incrementally and more parts will be available soon.
+### Preparing the Pretraining Annotations from GranD 🛠️
+After downloading the GranD annotations, utilize the scripts below to transform them into GLaMM pretraining data, or to prepare them for your specific tasks.
+- For object-level tasks like object detection, semantic segmentation: [prepare_object_lvl_data.py](../GranD/prepare_annotations/prepare_object_lvl_data.py)
+- For image-level captioning and caption grounding: [prepare_grand_caption_grounding.py](../GranD/prepare_annotations/prepare_grand_caption_grounding.py)
+- For referring expression generation and referring expression segmentation: [prepare_grand_referring_expression](../GranD/prepare_annotations/prepare_grand_referring_expression.py)
+The above scripts generate annotations in JSON format. To convert these for use in pretraining datasets requiring LMDB format, please use to the following scripts:
+- To convert to lmdb: [get_txt_for_lmdb.py](../GranD/prepare_annotations/get_txt_for_lmdb.py)
+- To extract file names in txt format: [get_txt_for_lmdb.py](../GranD/prepare_annotations/get_txt_for_lmdb.py)
+### GranD Automated Annotation Pipeline
+GranD is a comprehensive, multi-purpose image-text dataset offering a range of contextual information, from fine-grained to high-level details. The pipeline contains four distinct levels.
+The code for the four levels are provided in: [GranD](../GranD)
+More detailed information:
+- To run the entire pipeline: [run_pipeline.sh](../GranD/run_pipeline.sh)
+- To set up the environments detailed in [run_pipeline.sh](../GranD/run_pipeline.sh) refer to : [environments](../GranD/environments)
+- Level-1 : Object Localization and Attributes
+  - Landmark Categorization: [landmark](../GranD/level_1_inference/1_landmark_categorization/README.md)
+  - Depth Map Estimation: [Midas Depth Estimation](../GranD/level_1_inference/2_depth_maps/README.md)
+  - Image Tagging: [RAM Tag2Text Tagging](../GranD/level_1_inference/3_image_tagging/README.md)
+  - Standard Object Detection: [CO-DETR OD](../GranD/level_1_inference/4_co_detr/README.md), [EVA OD](../GranD/level_1_inference/4_co_detr/README.md)
+  - Open Vocabulary Object Detection: [OWL-ViT OVD](../GranD/level_1_inference/6_owl_vit), [POMP OVD](../GranD/level_1_inference/7_pomp)
+  - Attribute Detection and Grounding: [Attribute & Grounidng GRiT](../GranD/level_1_inference/8_grit/README.md)
+  - Open Vocabulary Classification: [OV Classification OV-SAM](../GranD/level_1_inference/9_ov_sam/README.md)
+  - Combine the predictions: [Merging](../GranD/utils/merge_json_level_1_with_nms.py)
+  - Generate Level-1 Scene Graph: [Level-1 Scene Graph](../GranD/utils/prepare_level_1.py)
+- Level-2: Relationships
+  - Captioning: [BLIP-2 Captioning](../GranD/level_2_inference/1_blip-2/README.md), [LLaVA Captioning](../GranD/level_2_inference/2_llava/README.md)
+  - Grounding Short Captions: [MDETR Grounding](../GranD/level_2_inference/3_mdetr/README.md)
+  - Combine the predictions: [Merging](../GranD/utils/merge_json_level_2.py)
+  - Generate Level-2 Scene Graph and Update Level-1: [Level-2 Scene Graph](../GranD/utils/prepare_level_2.py)
+  - Enrich Attributes: [GPT4-RoI Attributes](../GranD/level_2_inference/4_gpt4roi/README.md)
+  - Label Assignment: [EVA-CLIP Label Assignment](../GranD/level_2_inference/5_label_assignment/README.md)
+- Level-3: Scene Graph and Dense Captioning
+  - Generate Dense Captions: [Scene graph dense captioning LLaVA](../GranD/level_3_dense_caption/README.md)
+- Level-4: Extra Contextual Insight:
+  - Generate Level-4 Additional Context: [Extra Context](../GranD/level_4_extra_context/README.md)

groundingLMM/docs/datasets.md ADDED Viewed

	@@ -0,0 +1,327 @@

+# Prepare Dataset 🚀
+This guide outlines the datasets required for opensource fine-tuning of GLaMM, which encompasses tasks like Grounded Conversation Generation (GCG), Image-level captioning, Visual-question answering, Region-level captioning, and Referring Expression Segmentation. These datasets are used for fine-tuning to achieve the model demonstrated in our demo. We will also highlight the specific datasets needed for each task.
+To achieve all the capabilities of GLaMM, the following dataset types are used:
+1. GranD-f Grounded Conversation Generation (GCG) Dataset
+2. Semantic Segmentation Datasets
+3. Referring Expression Datasets (Expression Comprehension)
+4. Region-level Captioning Datasets (Expression Generation)
+5. Image Captioning
+6. Visual Question Answering
+7. GranD pretraining Datasets
+Overall, they must be arranged in the following format:
+```
+├── GranDf
+│   ├── annotations
+│   │   ├── train
+│   │   │   ├── GranDf_HA_GCG_train.json
+│   │   │   ├── OpenPsgGCG_train.json
+│   │   │   ├── OpenPsgGCG_val.json
+│   │   │   ├── RefCOCOg_GCG_train.json
+│   │   │   ├── RefCOCOg_GCG_val.json
+│   │   │   ├── flickr_mergedGT_GCG_train.json
+│   │   │   ├── flickr_mergedGT_GCG_val.json
+│   │   ├── val_test
+│   │   │   ├── test_gcg_coco_caption_gt.json
+│   │   │   ├── test_gcg_coco_mask_gt.json
+│   │   │   ├── val_gcg_coco_caption_gt.json
+│   │   │   ├── val_gcg_coco_mask_gt.json
+├── GranDf_HA_images
+│   ├── train
+│   │   ├── sa_10010541.jpg
+│   │   ├── sa_10014079.jpg
+│   ├── val_test
+│   │   ├── sa_10010541.jpg
+│   │   ├── sa_10014079.jpg
+│
+├── Semantic_Segm
+│   ├── ade20k
+│   │   ├── annotations
+│   │   │   ├── training
+│   │   │   │   ├── ADE_train_00000001.png
+│   │   │   │   ├── ADE_train_00000002.png
+│   │   ├── images
+│   │   │   ├── training
+│   │   │   │   ├── ADE_train_00000001.jpg
+│   │   │   │   ├── ADE_train_00000002.jpg
+├── coco_stuff
+│   │   ├── train2017
+│   │   │   ├── 000000000009.png
+│   │   │   ├── 000000000025.png
+├── mapillary
+│   │   ├── config_v2.0.json
+│   │   ├── training
+│   │   │   ├── v2.0
+│   │   │   │   ├── labels
+│   │   │   │   │   ├── 0035fkbjWljhaftpVM37-g.png
+│   │   │   │   │   ├── 00qclUcInksIYnm19b1Xfw.png
+│   │   │   ├── images
+│   │   │   │   ├── 0035fkbjWljhaftpVM37-g.jpg
+│   │   │   │   ├── 00qclUcInksIYnm19b1Xfw.jpg
+├── paco_lvis
+│   │   ├── annotations
+│   │   │   ├── paco_lvis_v1_train.json
+├── pascal_part
+│   │   ├── train.json
+│   │   ├── VOCdevkit
+│   │   │   │   ├── VOC2010
+│   │   │   │   │   ├── JPEGImages
+│   │   │   │   │   │   ├── 2007_000027.jpg
+│   │   │   │   │   │   ├── 2007_000032.jpg
+│
+├── Refer_Segm
+│   ├── refcoco
+│   ├── refcoco+
+│   ├── refcocog
+│   ├── refclef
+│   ├── images
+│   │   ├── saiapr_tc-12
+│   │   │   ├── 00
+│   │   │   ├── 01
+│
+├── RefCoco_Reg
+│   ├── mdetr_annotations
+│   │   ├── finetune_refcoco_train.json
+│   │   ├── finetune_refcocog_train.json
+│   │   ├── finetune_refcocog_val.json
+│   │   ├── finetune_refcoco+_train.json
+│   │   ├── final_flickr_mergedGT_train.json
+├── visual_genome
+│   │   ├── test_caption.json
+│   │   ├── train.json
+│   │   ├── images
+│   │   │   ├── 1000.jpg
+│   │   │   ├── 1001.jpg
+│
+├── llava_dataset
+│   ├── llava_instruct_150k.json
+│
+├── coco_2017
+│   ├── train2017
+│   │   ├── 000000000009.jpg
+│   │   ├── 000000000025.jpg
+│   ├── annotations
+│   │   ├── captions_train2017.json
+│   │   ├── captions_val2017.json
+│
+├── coco_2014
+│   ├── train2014
+│   │   ├── COCO_train2014_000000000009.jpg
+│   │   ├── COCO_train2014_000000000025.jpg
+│
+├── flikcr_30k
+│   ├── train
+│   │   ├── 1000092795.jpg
+│   │   ├── 10002456.jpg
+```
+### 1) GranD-f Grounded Conversation Generation (GCG) Dataset
+The [GranD-f](https://grounding-anything.com/GranD-f) datasets comprise four datasets: one high-quality human-annotated set proposed in our GLaMM paper, and 3 other datasets repurposed for the GCG task.
+Download links and structure:
+- Annotations: [MBZUAI/GranD-f](https://huggingface.co/datasets/MBZUAI/GranD-f)
+- Images: `GranDf_HA_images` [Download](https://drive.google.com/file/d/1abdxVhrbNQhjJQ8eAcuPrOUBzhGaFsF_/view?usp=drive_link)
+- Other necessary datasets:
+  - Open-PSG GCG: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
+  - RefCOCO-g GCG: `coco_2014` - COCO-2014 ([train2014](http://images.cocodataset.org/zips/train2014.zip))
+  - Flickr-30k GCG: `flikcr_30k` - flikcr_30k (train) - Download the train images from the [Flickr30K webpage](https://shannon.cs.illinois.edu/DenotationGraph/) or use download from the following [link](https://drive.google.com/file/d/1iomUn-Ht0OBfieMuyoVqEFj5PEmXfQ0U/view?usp=drive_link).
+```
+├── GranDf
+│   ├── annotations
+│   │   ├── train
+│   │   │   ├── GranDf_HA_GCG_train.json
+│   │   │   ├── OpenPsgGCG_train.json
+│   │   │   ├── OpenPsgGCG_val.json
+│   │   │   ├── RefCOCOg_GCG_train.json
+│   │   │   ├── RefCOCOg_GCG_val.json
+│   │   │   ├── flickr_mergedGT_GCG_train.json
+│   │   │   ├── flickr_mergedGT_GCG_val.json
+│   │   ├── val_test
+│   │   │   ├── test_gcg_coco_caption_gt.json
+│   │   │   ├── test_gcg_coco_mask_gt.json
+│   │   │   ├── val_gcg_coco_caption_gt.json
+│   │   │   ├── val_gcg_coco_mask_gt.json
+├── GranDf_HA_images
+│   ├── train
+│   │   ├── sa_10010541.jpg
+│   │   ├── sa_10014079.jpg
+│   ├── val_test
+│   │   ├── sa_10010541.jpg
+│   │   ├── sa_10014079.jpg
+├── coco_2017
+│   ├── train2017
+│   │   ├── 000000000009.jpg
+│   │   ├── 000000000025.jpg
+├── coco_2014
+│   ├── train2014
+│   │   ├── COCO_train2014_000000000009.jpg
+│   │   ├── COCO_train2014_000000000025.jpg
+├── flikcr_30k
+│   ├── train
+│   │   ├── 1000092795.jpg
+│   │   ├── 10002456.jpg
+```
+### 2) Semantic Segmentation Datasets
+For semantic segmentation, we use five open-source datasets providing segmentation masks and semantic class labels: - ADE20K, COCO-Stuff, PASCAL-Part, PACO-LVIS, and Mapillary.
+Download links and structure:
+- [ADE20K](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip)
+- [COCO-Stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip)
+- [PASCAL-Part](https://www.mapillary.com/dataset/vistas)
+- [PACO-LVIS](https://github.com/facebookresearch/paco/tree/main#dataset-setup)
+- [Mapillary](https://github.com/facebookresearch/VLPart/tree/main/datasets#pascal-part)
+- COCO images: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
+Download and arrange as shown in the directory structure below.
+```
+├── Semantic_Segm
+│   ├── ade20k
+│   │   ├── annotations
+│   │   │   ├── training
+│   │   │   │   ├── ADE_train_00000001.png
+│   │   │   │   ├── ADE_train_00000002.png
+│   │   ├── images
+│   │   │   ├── training
+│   │   │   │   ├── ADE_train_00000001.jpg
+│   │   │   │   ├── ADE_train_00000002.jpg
+├── coco_stuff
+│   │   ├── train2017
+│   │   │   ├── 000000000009.png
+│   │   │   ├── 000000000025.png
+├── mapillary
+│   │   ├── config_v2.0.json
+│   │   ├── training
+│   │   │   ├── v2.0
+│   │   │   │   ├── labels
+│   │   │   │   │   ├── 0035fkbjWljhaftpVM37-g.png
+│   │   │   │   │   ├── 00qclUcInksIYnm19b1Xfw.png
+│   │   │   ├── images
+│   │   │   │   ├── 0035fkbjWljhaftpVM37-g.jpg
+│   │   │   │   ├── 00qclUcInksIYnm19b1Xfw.jpg
+├── paco_lvis
+│   │   ├── annotations
+│   │   │   ├── paco_lvis_v1_train.json
+├── pascal_part
+│   │   ├── train.json
+│   │   ├── VOCdevkit
+│   │   │   │   ├── VOC2010
+│   │   │   │   │   ├── JPEGImages
+│   │   │   │   │   │   ├── 2007_000027.jpg
+│   │   │   │   │   │   ├── 2007_000032.jpg
+├── coco_2017
+│   ├── train2017
+│   │   ├── 000000000009.jpg
+│   │   ├── 000000000025.jpg
+```
+### 3) Referring Expression Datasets
+For Referring Expression segmentation - we use COCO referring expression comprehension datasets: RefCOCO, RefCOCO+, RefCOCOg, and RefCLEF.
+Download links and structure:
+- [RefCOCO](https://web.archive.org/web/20220413011718/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco.zip)
+- [RefCOCO+](https://web.archive.org/web/20220413011656/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcoco+.zip)
+- [RefCOCOg](https://web.archive.org/web/20220413012904/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refcocog.zip)
+- [RefCLEF](https://web.archive.org/web/20220413011817/https://bvisionweb1.cs.unc.edu/licheng/referit/data/refclef.zip)
+- RefCOCO images: `coco_2014` - COCO-2014 ([train2014](http://images.cocodataset.org/zips/train2014.zip))
+- For RefCLEF, you need images `[saiapr_tc-12](https://web.archive.org/web/20220515000000/http://bvisionweb1.cs.unc.edu/licheng/referit/data/images/saiapr_tc-12.zip)`
+Download the data from the source links, and arrange as follows:
+```
+├── Refer_Segm
+│   ├── refcoco
+│   ├── refcoco+
+│   ├── refcocog
+│   ├── refclef
+│   ├── images
+│   │   ├── saiapr_tc-12
+│   │   │   ├── 00
+│   │   │   ├── 01
+├── coco_2014
+│   ├── train2014
+│   │   ├── COCO_train2014_000000000009.jpg
+│   │   ├── COCO_train2014_000000000025.jpg
+```
+### 4) Region-level Captioning Datasets (Expression Generation)
+For region-level captioning, we use five open source datasets with region(bbox) grounding: RefCOCO, RefCOCOg, RefCOCO+, Visual Genome(V1.2) and Flickr30K.
+Download links and structure:
+- Annotations - mdetr_annotations: [Download](https://drive.google.com/file/d/1gvH5ToNtmIr3qz7C9lNi_fDmElwAANsI/view?usp=drive_link)
+- Visual Genome: [train.json](https://datarelease.blob.core.windows.net/grit/VG_preprocessed_annotations/train.json), [test_caption.json](https://drive.google.com/file/d/1zF3UGHU1rvgTujinqJ-hZtrCBVsfsuel/view?usp=sharing) [images](https://nlp.stanford.edu/data/gqa/images.zip)
+- Flickr30k: Download the train images from the [Flickr30K webpage](https://shannon.cs.illinois.edu/DenotationGraph/) or use download from the following [link](https://drive.google.com/file/d/1iomUn-Ht0OBfieMuyoVqEFj5PEmXfQ0U/view?usp=drive_link).
+- RefCOCO images: `coco_2014` - COCO-2014 ([train2014](http://images.cocodataset.org/zips/train2014.zip))
+Download the data from the source links, and arrange as follows:
+```
+├── RefCoco_Reg
+│   ├── mdetr_annotations
+│   │   ├── finetune_refcoco_train.json
+│   │   ├── finetune_refcocog_train.json
+│   │   ├── finetune_refcocog_val.json
+│   │   ├── finetune_refcoco+_train.json
+│   │   ├── final_flickr_mergedGT_train.json
+├── visual_genome
+│   │   ├── test_caption.json
+│   │   ├── train.json
+│   │   ├── images
+│   │   │   ├── 1000.jpg
+│   │   │   ├── 1001.jpg
+├── flikcr_30k
+│   ├── train
+│   │   ├── 1000092795.jpg
+│   │   ├── 10002456.jpg
+├── coco_2014
+│   ├── train2014
+│   │   ├── COCO_train2014_000000000009.jpg
+│   │   ├── COCO_train2014_000000000025.jpg
+```
+### 5) Image Captioning
+We use the COCO caption dataset.
+Download links and structure:
+- Annotations - [COCO - 2017 annotations](http://images.cocodataset.org/annotations/annotations_trainval2017.zip)
+- Images: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
+Structure as shown in the directory structure above.
+```
+├── coco_2017
+│   ├── train2017
+│   │   ├── 000000000009.jpg
+│   │   ├── 000000000025.jpg
+│   ├── annotations
+│   │   ├── captions_train2017.json
+│   │   ├── captions_val2017.json
+```
+### 6) Visual Question Answering
+We use the LLaVA-instruct-150k set for visual question answering. Download and arrange as detailed below.
+Download links and structure:
+- Annotations - [LLaVA-instruct-150k](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json)
+- Images: `coco_2017` - COCO-2017 ([train2017](http://images.cocodataset.org/zips/train2017.zip))
+```
+├── llava_dataset
+│   ├── llava_instruct_150k.json
+├── coco_2017
+│   ├── train2017
+```
+### 7) GranD pretraining Datasets
+We convert the GranD dataset to multiple annotations in LMDB form for pretraining based on the tasks. For details on how to prepare the annotations, please refer to: [Pretraining Annotations from GranD](../docs/GranD.md#preparing-the-pretraining-annotations-from-grand-).
+- For image-level captioning:
+  - Short Captioning: [GrandShortCaptionDataset](../dataset/caption_datasets/GranD_ShortCaption_ds.py)
+- For referring expression generation and referring expression segmentation:
+  - Region-level captioning (referring expression generation): [GrandReferRegDataset](../dataset/region_datasets/GranD_ReferringRegion_ds.py)
+  - Referring expression segmentation: [GrandReferSegmDataset](../dataset/segm_datasets/GranD_ReferringSegm_ds.py)

groundingLMM/docs/evaluation.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# Evaluating GLaMM 🔍
+This guide provides instructions on evaluating the pretrained GLaMM models on the downstream tasks including Grounded Conversation Generation (GCG), referring expression segmentation and region-level captioning.
+### 1) Grounded Conversation Generation (GCG) 🗨️
+Run the following instruction to evaluate GLaMM model on the GCG task
+```bash
+bash eval/gcg/run_evaluation.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
+```
+<p align="center">
+  <img src="../images/tables/GCG_Table.png" alt="GCG_Table">
+</p>
+To evaluate provided finetuned GCG model, run,
+```bash
+bash eval/gcg/run_evaluation.sh 'MBZUAI/GLaMM-GCG' './results_gcg_finetuned'
+```
+This will automatically download the `MBZUAI/GLaMM-GCG` from HuggingFace.
+### 2) Referring Expression Segmentation 🎯
+Run the following instruction to evaluate GLaMM model on the referring expression segmentation task
+```bash
+bash eval/referring_seg/run_evaluation.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
+```
+To evaluate provided finetuned RefSeg model, run,
+```bash
+bash eval/referring_seg/run_evaluation.sh 'MBZUAI/GLaMM-RefSeg' './results_refseg_finetuned'
+```
+This will automatically download the `MBZUAI/GLaMM-RefSeg` from HuggingFace.
+<p align="center">
+  <img src="../images/tables/ReferSeg_Table.png" alt="Table_RefSeg">
+</p>
+### 3) Region-level Captioning 🖼️
+Run the following instruction to evaluate GLaMM model on the region-level captioning task
+#### RefCOCOg
+```bash
+bash eval/region_captioning/run_evaluation_RefCOCOg.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
+```
+To evaluate provided finetuned RefCOCOg model, run,
+```bash
+bash eval/region_captioning/run_evaluation_RefCOCOg.sh 'MBZUAI/GLaMM-RegCap-RefCOCOg' './results_regcap_refcocog_finetuned'
+```
+This will automatically download the `MBZUAI/GLaMM-RegCap-RefCOCOg` from HuggingFace.
+#### Visual Genome
+```bash
+bash eval/region_captioning/run_evaluation_VG.sh 'path/to/the/HF/checkpoints/path' 'path/to/the/directory/to/save/the/evaluation/results'
+```
+To evaluate provided finetuned VG model, run,
+```bash
+bash eval/region_captioning/run_evaluation_VG.sh 'MBZUAI/GLaMM-RegCap-VG' './results_regcap_vg_finetuned'
+```
+This will automatically download the `MBZUAI/GLaMM-RegCap-VG` from HuggingFace.
+<p align="center">
+  <img src="../images/tables/Region_Cap_Table.png" alt="Table_RegionCap">
+</p>

groundingLMM/docs/install.md ADDED Viewed

	@@ -0,0 +1,34 @@

+# Installation 🛠️
+We recommend setting up a conda environment for the project:
+```bash
+conda create --name=glamm python=3.10
+conda activate glamm
+git clone https://github.com/mbzuai-oryx/groundingLMM.git
+cd groundingLMM
+pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
+pip install -r requirements.txt
+# Install mmcv
+git clone https://github.com/open-mmlab/mmcv
+cd mmcv
+git checkout v1.4.7
+MMCV_WITH_OPS=1 pip install -e .
+export PYTHONPATH="./:$PYTHONPATH"
+```
+In addition, we also provide conda environment contents in a `.zip` file. Please follow the below steps to set up the environment,
+1. Download `glamm_conda_env.zip` from the [google_drive link](https://drive.google.com/file/d/1BN10oChcoKDDd0zC8tU88JcrfmLpKpkB/view?usp=sharing).
+2. Extract the downloaded `zip` file:
+```bash
+unzip glamm_conda_env.zip
+```
+3. Activate the environment:
+```bash
+conda activate glamm
+```

groundingLMM/docs/model_zoo.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# GLaMM Model Zoo 🚀
+Welcome to the GLaMM Model Zoo! This repository contains a collection of state-of-the-art models from the GLaMM (Pixel Grounding Large Multimodal Model) family. Each model is designed for specific tasks in the realm of multimodal learning, combining visual and textual data processing.
+## Models Overview
+The following table provides an overview of the available models in our zoo. For each model, you can find links to its Hugging Face page.
+- To evaluate the pretrained models, please follow the instructions at [evaluation.md](evaluation.md).
+- To run offline demo, please follow the instructions at [offline_demo.md](offline_demo.md).
+| Model Name           | Hugging Face Link                                                                                         | Summary                                                                                                  |
+|----------------------|-----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
+| GLaMM-GranD-Pretrained | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-GranD-Pretrained) | Pretrained on GranD dataset.                                                          |
+| GLaMM-FullScope      | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-FullScope)              | Model recommended for offline demo.                                                  |
+| GLaMM-GCG            | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-GCG)                     | Finetuned on GranD-f dataset for GCG task.                                            |
+| GLaMM-RefSeg         | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-RefSeg)                  | Finetuned on RefCOCO, RefCOCO+ and RefCOCOg datasets for referring expression segmentation task. |
+| GLaMM-RegCap-RefCOCOg | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-RegCap-RefCOCOg) | Finetuned on RefCOCOg for region captioning task.                                    |
+| GLaMM-RegCap-VG      | [Hugging Face](https://huggingface.co/MBZUAI/GLaMM-RegCap-VG)               | Finetuned on Visual Genome dataset for region captioning task.                       |
+Note that all models are finetuned on `GLaMM-GranD-Pretrained`.

groundingLMM/docs/offline_demo.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# GLaMM Demo Installation and Usage Guide 🚀
+Welcome to the GLaMM Demo! This guide will walk you through the process of setting up and running the GLaMM Demo on your local GPU machine. Please ensure that your system meets the necessary requirements before proceeding.
+## System Requirements
+- GPU with at least 24 GB of memory
+- Git and Git LFS installed
+- [GLaMM environment](../docs/install.md)
+- Install [gradio-box](https://github.com/ShoufaChen/gradio-box?tab=readme-ov-file#3-install-gradio): Follow the instructions below to install Gradio-Box.
+```bash
+git clone https://github.com/ShoufaChen/gradio-dev.git
+cd gradio-dev
+bash scripts/build_frontend.sh
+pip install -e .
+````
+- Version Requirements: Your installation should include the following specific versions:
+  - Gradio version 3.35.2
+  - Gradio-Client version 0.2.7
+## Installation Steps
+### 1. Clone the GLaMM Repository
+First, you need to clone the GLaMM repository from GitHub. Open your terminal and run the following command:
+```bash
+git clone https://github.com/mbzuai-oryx/groundingLMM.git
+````
+## 2. Download GLaMM Weights
+To download the GLaMM model weights, you will need Git LFS. If you haven't installed Git LFS, you can do so by running:
+```bash
+git lfs install
+```
+Once Git LFS is installed, proceed to clone the GLaMM FullScope model:
+```bash
+git clone https://huggingface.co/MBZUAI/GLaMM-FullScope
+```
+For more information on the GLaMM FullScope model, visit [this link](https://huggingface.co/MBZUAI/GLaMM-FullScope).
+### 3. Run the Demo
+Navigate to the directory where the repository was cloned and run the demo using Python. Replace path_to_GLaMM_FullScope_model with the actual path to the downloaded GLaMM FullScope model:
+```bash
+python app.py --version "path/to/GLaMM_FullScope_model"
+```
+Once the demo is running, follow the on-screen instructions to open the demo dashboard in your web browser. The dashboard provides a user-friendly interface for interacting with the GLaMM model.

groundingLMM/docs/training.md ADDED Viewed

	@@ -0,0 +1,83 @@

+# Training GLaMM 🚀
+GLaMM is pre-trained on the GranD dataset and then fine-tuned on multiple downstream tasks including Grounded Conversation Generation (GCG), referring expression segmentation, region-level captioning, and image-level captioning using OpenSource datasets.
+## Downstream Task-Specific Training 🛠️
+This section explains how to perform downstream fine-tuning using the pretrained GLaMM model checkpoints.
+### Preparing the OpenSource Datasets 📂
+Refer to the [datasets readme](../docs/datasets.md) for details on organizing the data.
+Generic settings:
+- Path to the GLaMM GranD pretrained Hugging Face model: `PRETRAINED_HF_PATH=MBZUAI/GLaMM-GranD-Pretrained`
+- Path to the Grounding Image Encoder Checkpoints (SAM pretrained weights): `GROUNDING_ENC_CKPT_PATH=./checkpoints/sam_vit_h_4b8939.pth`
+### 1) Grounded Conversation Generation (GCG) 🗨️
+For GCG, the model is fine-tuned on two types of datasets: (i) GranD-f Dataset and (ii) Semantic Segmentation Datasets.
+  - [GranD-f datasets](../docs/datasets.md#1-grand-f-grounded-conversation-generation-gcg-dataset): RefCoco_GCG, PSG_GCG, Flickr_GCG, GranDf_GCG
+  - [Semantic Segmentation Datasets](../docs/datasets.md#2-semantic-segmentation-datasets): ade20k, cocostuff, pascal_part, paco_lvis, mapillary
+```bash
+deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_segm_data --seg_dataset "Semantic_Segm||RefCoco_GCG||PSG_GCG||Flickr_GCG||GranDf_GCG" --segm_sample_rates "1,3,3,3,1" --val_dataset "FlickrGCGVal|RefCocoGCGVal|PsgGCGVal" --epochs 10 --steps_per_epoch 500 --mask_validation
+```
+### 2) Region-level Captioning 🖼️
+For region-level captioning, the model is fine-tuned on specific datasets:
+  - [Region-level Captioning Dataset](../docs/datasets.md#4-region-level-captioning-datasets-expression-generation): RefCocoG_Reg, VisGenomeRegVal
+For RefCOCOg:
+```bash
+deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_reg_data --reg_dataset 'RefCocoG_Reg' --reg_sample_rates "1" --val_dataset 'RefCOCOgRegVal' --epochs 5 --steps_per_epoch 500
+```
+For Visual Genome:
+```bash
+deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_reg_data --reg_dataset 'VisGen_Reg' --reg_sample_rates "1" --val_dataset 'VisGenomeRegVal' --epochs 5 --steps_per_epoch 500
+```
+### 3) Referring Expression Segmentation 🎯
+For results on RefCOCO, RefCOCO+ and RefCOCOg datasets, the model is fine-tuned using the following datasets:
+  - [Referring Expression Dataset](../docs/datasets.md#3-referring-expression-datasets): refcoco, refcoco+, refcocog
+```bash
+deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_segm_data --seg_dataset "Refer_Segm" --segm_sample_rates "1" --refer_segm_data "refcoco||refcoco+||refcocog" --val_dataset "RefCOCOgSegVal" --epochs 5 --steps_per_epoch 350 --mask_validation
+```
+### 4) Finetuning on Combined Tasks 🌍
+To enable combined capabilities in tasks like Grounded Conversation Generation (GCG), Image-level captioning, Visual-question answering, Region-level captioning, and Referring Expression Segmentation, finetune GLaMM using a mix of open-source datasets. This training replicates the model used in the demo.
+Refer to [datasets readme](../docs/datasets.md) for data preparation details.
+The `train.py` script is pre-configured with default argument values optimized for combined open-source training. However, for clarity and customization, we detail all essential arguments below:
+```bash
+deepspeed --master_port $MASTER_PORT train.py --version $PRETRAINED_HF_PATH --dataset_dir ./data/ --vision_pretrained $GROUNDING_ENC_CKPT_PATH --exp_name $OUTPUT_DIR_PATH --lora_r 8 --lr 3e-4 --pretrained --use_cap_data --use_reg_data --use_segm_data -cap_dataset "CocoCap||LLaVaInstruct" --cap_sample_rate "1,2" --reg_dataset "RefCoco_Reg||RefCocoG_Reg||RefCocoP_Reg||VisGen_Reg||FlickrGCGVal" --reg_sample_rates -"1,1,1,1,1" -seg_dataset "Semantic_Segm||Refer_Segm||RefCoco_GCG||PSG_GCG||Flickr_GCG||GranDf_GCG" --segm_sample_rates "4,3,2,2,2,1" --val_dataset "FlickrGCGVal|RefCocoGCGVal|PsgGCGVal" --epochs 10 --steps_per_epoch 500
+```
+### Merge LORA Weights
+We use LORA finetuning for downstream tasks. Please follow the instructions below to merge LORA weights after training.
+After training the saved checkpoints directory will look like,
+```
+├── global_step5000
+│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
+│   ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
+│   ├── bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
+│   ├── bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
+│   ├── bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
+│   ├── bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
+│   ├── bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
+│   ├── bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
+```
+Run the following command to merge LORA weights,
+```bash
+python zero_to_fp32.py . ./pytorch_model.bin
+# From the root directory
+export PYTHONPATH="./:$PYTHONPATH"
+python scripts/merge_lora_weights.py --version 'MBZUAI/GLaMM-GranD-Pretrained' --weight 'path/to/pytorch_model.bin' --save_path 'path/to/save/the/merged/model/in/HF/format'
+```

groundingLMM/eval/region_captioning/evaluate.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import os
+import json
+import argparse
+from pycocotools.coco import COCO
+from pycocoevalcap.eval import COCOEvalCap
+def parse_args():
+    parser = argparse.ArgumentParser(description="GLaMM Inference - Region Captioning")
+    parser.add_argument("--annotation_file",
+                        default="data/RefCoco_Reg/mdetr_annotations/finetune_refcocog_val_captions.json", type=str,
+                        help="Replace with 'data/visual_genome/test_caption.json' for VG.")
+    parser.add_argument("--results_dir", default="results", type=str, help="The path to save the results.")
+    return parser.parse_args()
+def main():
+    args = parse_args()
+    # Load the annotation file
+    coco = COCO(args.annotation_file)
+    # Merge and load the results files
+    all_results = []
+    for result_file in os.listdir(args.results_dir):
+        all_results += json.load(open(f"{args.results_dir}/{result_file}", "r"))
+    merged_file_path = f"{args.results_dir}/merged.json"
+    with open(merged_file_path, 'w') as f:
+        json.dump(all_results, f)
+    coco_result = coco.loadRes(merged_file_path)
+    # Create coco_eval object by taking coco and coco_result
+    coco_eval = COCOEvalCap(coco, coco_result)
+    # Evaluate results
+    coco_eval.params['image_id'] = coco_result.getImgIds()
+    coco_eval.evaluate()
+    # Print and save the output evaluation scores
+    output_file_path = f"{args.results_dir}/metrics.txt"
+    f = open(output_file_path, 'w')
+    for metric, score in coco_eval.eval.items():
+        print(f'{metric}: {score:.3f}')
+        f.write(f"{metric}: {score:.3f}\n")
+    f.close()
+if __name__ == "__main__":
+    main()

groundingLMM/eval/region_captioning/infer.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import re
+import cv2
+import json
+import argparse
+from tqdm import tqdm
+from transformers import AutoTokenizer, CLIPImageProcessor
+from torch.utils.data import DataLoader, DistributedSampler
+from eval.utils import *
+from eval.ddp import *
+from model.GLaMM import GLaMMForCausalLM
+from model.llava import conversation as conversation_lib
+from model.llava.mm_utils import tokenizer_image_token
+from model.SAM.utils.transforms import ResizeLongestSide
+from tools.utils import DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
+def parse_args():
+    parser = argparse.ArgumentParser(description="GLaMM Inference - Region Captioning")
+    parser.add_argument("--hf_model_path", required=True, help="The model path in huggingface format.")
+    parser.add_argument("--annotation_file",
+                        default="data/RefCoco_Reg/mdetr_annotations/finetune_refcocog_val_captions.json", type=str,
+                        help="Replace with 'data/visual_genome/test_caption.json' for VG.")
+    parser.add_argument("--image_dir", default="data/coco_2014/train2014", type=str,
+                        help="Replace with 'data/visual_genome/images' for VG")
+    parser.add_argument("--dataset", default="refcocog", type=str, help="Options are 'refcocog', 'vg'")
+    parser.add_argument("--results_dir", default="results", type=str, help="The path to save the results.")
+    parser.add_argument("--image_size", default=1024, type=int, help="image size")
+    parser.add_argument("--model_max_length", default=512, type=int)
+    parser.add_argument("--use_mm_start_end", action="store_true", default=True)
+    parser.add_argument("--conv_type", default="llava_v1", type=str, choices=["llava_v1", "llava_llama_2"], )
+    # DDP Related parameters
+    parser.add_argument("--batch_size_per_gpu", required=False, default=1)
+    parser.add_argument('--world_size', default=1, type=int, help='number of distributed processes')
+    parser.add_argument('--local_rank', default=-1, type=int)
+    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
+    return parser.parse_args()
+def inference(instructions, inputs):
+    # Extract the inputs
+    bbox_img = inputs['boxes']
+    image_path = inputs['image']
+    instructions = instructions.replace('&lt;', '<').replace('&gt;', '>')
+    # Prepare prompt for model Inference
+    conv = conversation_lib.conv_templates[args.conv_type].copy()
+    conv.messages = []
+    begin_str = f"""The {DEFAULT_IMAGE_TOKEN} provides an overview of the picture.\n"""
+    prompt = begin_str + instructions
+    if args.use_mm_start_end:
+        replace_token = (DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN)
+        prompt = prompt.replace(DEFAULT_IMAGE_TOKEN, replace_token)
+    conv.append_message(conv.roles[0], prompt)
+    conv.append_message(conv.roles[1], "")
+    prompt = conv.get_prompt()
+    # Read and preprocess the image (Global image encoder - CLIP)
+    image_np = cv2.imread(image_path)
+    image_np = cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB)
+    original_size_list = [image_np.shape[:2]]
+    image_clip = (clip_image_processor.preprocess(image_np, return_tensors="pt")["pixel_values"][0].unsqueeze(0).cuda())
+    image_clip = image_clip.bfloat16()  # Precision is bf16 by default
+    # Preprocess the image (Grounding image encoder)
+    image = transform.apply_image(image_np)
+    resize_list = [image.shape[:2]]
+    image = (
+        grounding_image_ecoder_preprocess(torch.from_numpy(image).permute(2, 0, 1).contiguous()).unsqueeze(0).cuda())
+    image = image.bfloat16()  # Precision is bf16 by default
+    # Prepare inputs for inference
+    input_ids = tokenizer_image_token(prompt, tokenizer, return_tensors="pt")
+    input_ids = input_ids.unsqueeze(0).cuda()
+    bboxes = None
+    if len(bbox_img) > 0:
+        height, width = original_size_list[0]  # Original Image Dimensions
+        # Rescaling BBox to 336*336
+        x_scale, y_scale = 336 / width, 336 / height
+        bboxes_scaled = [[bbox[0] * x_scale, bbox[1] * y_scale,
+                          bbox[2] * x_scale, bbox[3] * y_scale] for bbox in bbox_img]
+        ori_bboxes = np.array(bboxes_scaled, dtype=np.float64)
+        height_sc, width_sc = (336, 336)  # To normalize the Image
+        norm_bboxes = ori_bboxes / np.array([width_sc, height_sc, width_sc, height_sc])
+        bboxes = [torch.tensor(norm_bboxes).cuda().half().to(torch.bfloat16)]
+    # Generate output
+    output_ids, pred_masks = model.evaluate(image_clip, image, input_ids, resize_list, original_size_list,
+                                            max_tokens_new=512, bboxes=bboxes)
+    output_ids = output_ids[0][output_ids[0] != IMAGE_TOKEN_INDEX]
+    # Post-processing
+    text_output = tokenizer.decode(output_ids, skip_special_tokens=False)
+    text_output = text_output.replace("\n", "").replace("  ", " ")
+    text_output = text_output.split("ASSISTANT: ")[-1]
+    cleaned_str = re.sub(r'<.*?>', '', text_output)
+    # Remove the [SEG] token
+    cleaned_str = cleaned_str.replace('[SEG]', '')
+    # Strip unnecessary spaces
+    cleaned_str = ' '.join(cleaned_str.split()).strip("'")
+    cleaned_str = cleaned_str.strip()
+    return cleaned_str
+def custom_collate_fn(batch):
+    image_id = [item[0] for item in batch]
+    filename = [item[1] for item in batch]
+    bbox = [item[2] for item in batch]
+    gt = [item[3] for item in batch]
+    return image_id, filename, bbox, gt
+if __name__ == "__main__":
+    args = parse_args()
+    init_distributed_mode(args)
+    # Initialize tokenizer and model
+    tokenizer = AutoTokenizer.from_pretrained(args.hf_model_path, cache_dir=None,
+                                              model_max_length=args.model_max_length, padding_side="right",
+                                              use_fast=False)
+    tokenizer.pad_token = tokenizer.unk_token
+    seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
+    torch_dtype = torch.bfloat16  # By default, using bf16
+    kwargs = {"torch_dtype": torch_dtype}
+    model = GLaMMForCausalLM.from_pretrained(args.hf_model_path, low_cpu_mem_usage=True,
+                                             seg_token_idx=seg_token_idx, **kwargs)
+    # Update model config
+    model.config.eos_token_id = tokenizer.eos_token_id
+    model.config.bos_token_id = tokenizer.bos_token_id
+    model.config.pad_token_id = tokenizer.pad_token_id
+    # Initialize Global Image Encoder (CLIP)
+    model.get_model().initialize_vision_modules(model.get_model().config)
+    vision_tower = model.get_model().get_vision_tower()
+    vision_tower.to(dtype=torch_dtype)
+    # Transfer the model to GPU
+    model = model.bfloat16().cuda()  # Replace with model = model.float().cuda() for 32 bit inference
+    vision_tower = model.get_model().get_vision_tower()
+    vision_tower.to(device="cuda")
+    # Initialize Image Processor for GLobal Image Encoder (CLIP)
+    clip_image_processor = CLIPImageProcessor.from_pretrained(model.config.vision_tower)
+    transform = ResizeLongestSide(args.image_size)
+    model.eval()  # Model should be in evaluation mode for inference
+    # Prompt model to perfor region captioning task
+    instruction = "Can you provide me with a detailed description of the region in the picture marked by <bbox>?"
+    # Intermediate results path is hard-coded (you may change it as per your needs)
+    os.makedirs(args.results_dir, exist_ok=True)
+    results_path = f"{args.results_dir}/{os.path.basename(args.hf_model_path)}_{args.dataset}_{args.rank}.json"
+    # Create DDP Dataset
+    dataset = RegionCapDDP(args.annotation_file)
+    distributed_sampler = DistributedSampler(dataset, rank=args.rank, shuffle=False)
+    dataloader = DataLoader(dataset, batch_size=args.batch_size_per_gpu, num_workers=2,
+                            sampler=distributed_sampler, collate_fn=custom_collate_fn)
+    # Iterate over all the samples, perform inference and save results
+    results = []
+    for idx, (image_id, filename, bbox, gt) in enumerate(tqdm(dataloader)):
+        image_id, filename, bbox, gt = image_id[0], filename[0], bbox[0], gt[0]
+        image_path = os.path.join(args.image_dir, filename)
+        inputs = {'image': image_path, 'boxes': [bbox]}
+        result_caption = inference(instruction, inputs)  # Perform inference
+        result_dict = {}
+        result_dict["image_id"] = image_id
+        result_dict["caption"] = result_caption
+        results.append(result_dict)
+    with open(results_path, 'w') as json_file:
+        json.dump(results, json_file, indent=2)

groundingLMM/eval/region_captioning/run_evaluation_VG.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/bin/sh
+## USAGE
+## bash eval/region_captioning/run_evaluation.sh <path to the HF checkpoints path> <path to the directory to save the evaluation results>
+## USAGE
+export PYTHONPATH="./:$PYTHONPATH"
+MASTER_PORT=24999
+NUM_GPUS=1  # Adjust it as per the available #GPU
+# Positional arguments for the bash scripts
+CKPT_PATH=$1
+RESULT_PATH=$2
+# Adjust if needed
+ANNOTATION_FILE=./data/visual_genome/test_caption.json
+IMAGE_DIR=./data/visual_genome/images
+DATASET=vg
+# Run Inference
+torchrun --nnodes=1 --nproc_per_node="$NUM_GPUS" --master_port="$MASTER_PORT" eval/region_captioning/infer.py --hf_model_path "$CKPT_PATH" --annotation_file "$ANNOTATION_FILE" --image_dir "$IMAGE_DIR" --dataset "$DATASET" --results_dir "$RESULT_PATH"
+# Evaluate
+python eval/region_captioning/evaluate.py --annotation_file "$ANNOTATION_FILE" --results_dir "$RESULT_PATH"

groundingLMM/gradio-dev/.dockerignore ADDED Viewed

	@@ -0,0 +1,40 @@

+# Python build
+.eggs/
+gradio.egg-info/*
+!gradio.egg-info/requires.txt
+!gradio.egg-info/PKG-INFO
+dist/
+*.pyc
+__pycache__/
+*.py[cod]
+*$py.class
+build/
+# JS build
+gradio/templates/frontend/static
+gradio/templates/frontend/cdn
+# Secrets
+.env
+# Gradio run artifacts
+*.db
+*.sqlite3
+gradio/launches.json
+# Tests
+.coverage
+coverage.xml
+test.txt
+# Demos
+demo/tmp.zip
+demo/flagged
+demo/files/*.avi
+demo/files/*.mp4
+# Etc
+.idea/*
+.DS_Store
+*.bak
+workspace.code-workspace

groundingLMM/gradio-dev/.editorconfig ADDED Viewed

	@@ -0,0 +1,8 @@

+root = true
+[{js/**,client/js/**}]
+end_of_line = lf
+insert_final_newline = true
+indent_style = tab
+tab_width = 2

groundingLMM/gradio-dev/.gitignore ADDED Viewed

	@@ -0,0 +1,65 @@

+# Python build
+.eggs/
+gradio.egg-info
+dist/
+*.pyc
+__pycache__/
+*.py[cod]
+*$py.class
+build/
+__tmp/*
+# JS build
+gradio/templates/cdn
+gradio/templates/frontend
+# Secrets
+.env
+# Gradio run artifacts
+*.db
+*.sqlite3
+gradio/launches.json
+flagged/
+gradio_cached_examples/
+tmp.zip
+# Tests
+.coverage
+coverage.xml
+test.txt
+**/snapshots/**/*.png
+# Demos
+demo/tmp.zip
+demo/files/*.avi
+demo/files/*.mp4
+demo/all_demos/demos/*
+demo/all_demos/requirements.txt
+demo/*/config.json
+# Etc
+.idea/*
+.vscode/*
+.DS_Store
+*.bak
+workspace.code-workspace
+*.h5
+# dev containers
+.pnpm-store/
+# log files
+.pnpm-debug.log
+# Local virtualenv for devs
+.venv*
+# FRP
+gradio/frpc_*
+# js
+node_modules
+public/build/
+test-results
+client/js/test.js

groundingLMM/gradio-dev/CHANGELOG.md ADDED Viewed

The diff for this file is too large to render. See raw diff

groundingLMM/gradio-dev/CITATION.cff ADDED Viewed

	@@ -0,0 +1,45 @@

+cff-version: 1.2.0
+message: Please cite this project using these metadata.
+title: "Gradio: Hassle-free sharing and testing of ML models in the wild"
+abstract: >-
+  Accessibility is a major challenge of machine learning (ML).
+  Typical ML models are built by specialists and require
+  specialized hardware/software as well as ML experience to
+  validate. This makes it challenging for non-technical
+  collaborators and endpoint users (e.g. physicians) to easily
+  provide feedback on model development and to gain trust in
+  ML. The accessibility challenge also makes collaboration
+  more difficult and limits the ML researcher's exposure to
+  realistic data and scenarios that occur in the wild. To
+  improve accessibility and facilitate collaboration, we
+  developed an open-source Python package, Gradio, which
+  allows researchers to rapidly generate a visual interface
+  for their ML models. Gradio makes accessing any ML model as
+  easy as sharing a URL. Our development of Gradio is informed
+  by interviews with a number of machine learning researchers
+  who participate in interdisciplinary collaborations. Their
+  feedback identified that Gradio should support a variety of
+  interfaces and frameworks, allow for easy sharing of the
+  interface, allow for input manipulation and interactive
+  inference by the domain expert, as well as allow embedding
+  the interface in iPython notebooks. We developed these
+  features and carried out a case study to understand Gradio's
+  usefulness and usability in the setting of a machine
+  learning collaboration between a researcher and a
+  cardiologist.
+authors:
+  - family-names: Abid
+    given-names: Abubakar
+  - family-names: Abdalla
+    given-names: Ali
+  - family-names: Abid
+    given-names: Ali
+  - family-names: Khan
+    given-names: Dawood
+  - family-names: Alfozan
+    given-names: Abdulrahman
+  - family-names: Zou
+    given-names: James
+doi: 10.48550/arXiv.1906.02569
+date-released: 2019-06-06
+url: https://arxiv.org/abs/1906.02569

groundingLMM/gradio-dev/CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,138 @@

+# Contributing to Gradio
+Prerequisites:
+- [Python 3.8+](https://www.python.org/downloads/)
+- [Node.js v16.14+](https://nodejs.dev/en/download/package-manager/) (only needed if you are making changes to the frontend)
+- [pnpm 8.1+](https://pnpm.io/8.x/installation) (only needed if you are making changes to the frontend)
+More than 80 awesome developers have contributed to the `gradio` library, and we'd be thrilled if you would like be the next `gradio` contributor! Start by cloning this repo and installing Gradio locally:
+### Install Gradio locally from the `main` branch
+- Clone this repo
+- Navigate to the repo folder and run
+```bash
+bash scripts/install_gradio.sh
+```
+- Build the front end
+```
+bash scripts/build_frontend.sh
+```
+### Install development requirements
+In order to be able to run the Python linter, formatter, and unit tests, do the following:
+- Navigate to the repo folder and install test requirements (note that it is highly recommended to use a virtual environment running **Python 3.9** since the versions are pinned)
+```
+bash scripts/install_test_requirements.sh
+```
+- If you have a different Python version and conflicting packages during the installation, please first run:
+```
+bash scripts/create_test_requirements.sh
+```
+### Using dev containers
+Instead of the above steps, you can alternatively use dev containers. This is supported on all platforms (macOS/Windows/Linux).
+Prerequisites:
+- An editor which supports dev containers, like VS Code
+- Docker support on the host computer:
+  - macOS: [Docker Desktop 2.0+](https://www.docker.com/products/docker-desktop/)
+  - Windows: [Docker Desktop 2.0+](https://www.docker.com/products/docker-desktop/)
+  - Linux: [Docker CE/EE 18.06+](https://docs.docker.com/get-docker/) and [Docker Compose 1.21+](https://docs.docker.com/compose/install/)
+- If using VS Code, the [Dev Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) extension
+Steps:
+- Clone repository
+- Open it in editor
+- For VS Code, execute `Dev Containers: Reopen in container` command
+For detailed instructions, please see the [Dev Containers tutorial](https://code.visualstudio.com/docs/devcontainers/tutorial).
+### Extra tidbits
+- You can run gradio scripts in reload mode which will watch for changes in the `gradio` folder and reload the app if changes are made.
+```
+gradio app.py
+```
+- To develop the frontend app, you should also follow [js/README.md](js/README.md).
+- To run all of the tests, do:
+```
+bash scripts/run_all_tests.sh
+```
+### Structure of the Repository
+It's helpful to know the overall structure of the repository so that you can focus on the part of the source code you'd like to contribute to
+- `/gradio`: contains the Python source code for the library
+  - `/gradio/interface.py`: contains the Python source code for the core `Interface` class
+  - `/gradio/blocks.py`: contains the Python source code for the core `Blocks` class
+  - `/gradio/components.py`: contains the Python source code for the `components`, you can add your custom components here.
+- `/js`: contains the HTML/JS/CSS source code for the library ([start here for frontend changes](/js/README.md))
+- `/test`: contains Python unit tests for the library
+- `/demo`: contains demos that are used in the documentation, you can find `Gradio` examples over here.
+- `/website`: contains the code for the Gradio website (www.gradio.app). See the README in the `/website` folder for more details
+### Continuous Integration and Testing
+All PRs must pass the continuous integration tests before merging. To test locally, you can run `python -m unittest` from the repo directory.
+## Submitting PRs
+All PRs should be against `main`. Direct commits to main are blocked, and PRs require an approving review to merge into main. By convention, the Gradio maintainers will review PRs when:
+- An initial review has been requested, and
+- A description of the change (with a link to the GitHub PR) has been added to CHANGELOG.md, and
+- A maintainer (@abidlabs, @aliabid94, @aliabd, @AK391, @dawoodkhan82, @pngwn, @freddyaboulton) is tagged in the PR comments and asked to complete a review
+We ask that you make sure initial CI checks are passing before requesting a review. One of the Gradio maintainers will merge the PR when all the checks are passing.
+Do not forget the format the backend before pushing.
+```
+bash scripts/format_backend.sh
+```
+```
+bash scripts/format_frontend.sh
+```
+## CI checks
+Currently the following checks are run in CI:
+### Gradio library (`gradio` package)
+```
+bash scripts/lint_backend.sh
+bash scripts/type_check_backend.sh
+python -m pytest -m "not flaky" --ignore=client
+python -m pytest -m "flaky" --ignore=client
+```
+### Gradio client (`gradio_client` package)
+```
+cd client/python
+bash scripts/lint.sh
+python -m pytest -m "not flaky"
+python -m pytest -m "flaky"
+```
+_Could these guidelines be clearer? Feel free to open a PR to help us faciltiate open-source contributions!_

groundingLMM/gradio-dev/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

groundingLMM/gradio-dev/README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+<div align="center">
+# Gradio Box
+This is the advanced gradio used in our paper, [GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest](https://arxiv.org/abs/2307.03601).
+This is an extension to official [gradio](https://gradio.app/), which supports drawing boxes on top of an image.
+This feature is requested in https://github.com/gradio-app/gradio/issues/2316.
+![teaser](box_demo.gif)
+</div>
+## Usage
+See mini-demo:
+```
+python app_box.py
+```
+## Install
+### 1. Install Node.js
+We install it on Ubuntu with:
+```
+curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.3/install.sh | bash
+source ~/.bashrc # or ~/.zshrc based on which one you use
+nvm install v18.16.0
+```
+### 2. Install ppnm
+```
+curl -fsSL https://get.pnpm.io/install.sh | sh -
+source ~/.bashrc # or ~/.zshrc based on which one you use
+pnpm --version  # check if success
+```
+### 3. Install gradio
+```
+git clone https://github.com/ShoufaChen/gradio-dev.git
+cd gradio-dev
+bash scripts/build_frontend.sh
+pip install -e .
+```
+## Common Installation Issues
+<details>
+<summary>
+ ERR_PNPM_FETCH_404  GET https://packagecloud.io/github/git-lfs/npm/whatwg-url/-/whatwg-url-5.0.0.tgz: Not Found - 404
+No authorization header was set for the request.
+</summary>
+<br/>
+https://github.com/pnpm/pnpm/issues/2933#issuecomment-975886322
+```
+# Add following in `~/.npmrc` file
+@OWNER:registry=https://packagecloud.io/github/git-lfs/npm/
+```
+</details>
+<details>
+<summary>
+ERROR: File "setup.py" not found. Directory cannot be installed in editable mode:
+</summary>
+<br/>
+Use pip version >= 23.0.1
+</details>
+## Acknowledgement
+Our implementation is mainly inspired by https://github.com/gradio-app/gradio/pull/3220, with several modifications for latest gradio.
+Greate thanks to [CtrlAltDeplete](https://github.com/CtrlAltDeplete).

groundingLMM/gradio-dev/README_old.md ADDED Viewed

	@@ -0,0 +1,290 @@

+<!-- DO NOT EDIT THIS FILE DIRECTLY. INSTEAD EDIT THE `readme_template.md` OR `guides/1)getting_started/1)quickstart.md` TEMPLATES AND THEN RUN `render_readme.py` SCRIPT. -->
+<div align="center">
+  [<img src="readme_files/gradio.svg" alt="gradio" width=300>](https://gradio.app)<br>
+  <em>Build & share delightful machine learning apps easily</em>
+  [![gradio-backend](https://github.com/gradio-app/gradio/actions/workflows/backend.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/backend.yml)
+  [![gradio-ui](https://github.com/gradio-app/gradio/actions/workflows/ui.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/ui.yml)
+  [![PyPI](https://img.shields.io/pypi/v/gradio)](https://pypi.org/project/gradio/)
+  [![PyPI downloads](https://img.shields.io/pypi/dm/gradio)](https://pypi.org/project/gradio/)
+  ![Python version](https://img.shields.io/badge/python-3.8+-important)
+  [![Twitter follow](https://img.shields.io/twitter/follow/gradio?style=social&label=follow)](https://twitter.com/gradio)
+  [Website](https://gradio.app)
+  | [Documentation](https://gradio.app/docs/)
+  | [Guides](https://gradio.app/guides/)
+  | [Getting Started](https://gradio.app/getting_started/)
+  | [Examples](demo/)
+  | [中文](readme_files/zh-cn#readme)
+</div>
+# Gradio: Build Machine Learning Web Apps — in Python
+Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.
+With Gradio, you can quickly create a beautiful user interface around your machine learning models or data science workflow and let people "try it out" by dragging-and-dropping in their own images,
+pasting text, recording their own voice, and interacting with your demo, all through the browser.
+![Interface montage](readme_files/header-image.jpg)
+Gradio is useful for:
+- **Demoing** your machine learning models for clients/collaborators/users/students.
+- **Deploying** your models quickly with automatic shareable links and getting feedback on model performance.
+- **Debugging** your model interactively during development using built-in manipulation and interpretation tools.
+## Quickstart
+**Prerequisite**: Gradio requires Python 3.8 or higher, that's all!
+### What Does Gradio Do?
+One of the *best ways to share* your machine learning model, API, or data science workflow with others is to create an **interactive app** that allows your users or colleagues to try out the demo in their browsers.
+Gradio allows you to **build demos and share them, all in Python.** And usually in just a few lines of code! So let's get started.
+### Hello, World
+To get Gradio running with a simple "Hello, World" example, follow these three steps:
+1\. Install Gradio using pip:
+```bash
+pip install gradio
+```
+2\. Run the code below as a Python script or in a Jupyter Notebook (or [Google Colab](https://colab.research.google.com/drive/18ODkJvyxHutTN0P5APWyGFO_xwNcgHDZ?usp=sharing)):
+```python
+import gradio as gr
+def greet(name):
+    return "Hello " + name + "!"
+demo = gr.Interface(fn=greet, inputs="text", outputs="text")
+demo.launch()
+```
+3\. The demo below will appear automatically within the Jupyter Notebook, or pop in a browser on [http://localhost:7860](http://localhost:7860) if running from a script:
+![`hello_world` demo](demo/hello_world/screenshot.gif)
+When developing locally, if you want to run the code as a Python script, you can use the Gradio CLI to launch the application **in reload mode**, which will provide seamless and fast development. Learn more about reloading in the [Auto-Reloading Guide](https://gradio.app/developing-faster-with-reload-mode/).
+```bash
+gradio app.py
+```
+Note: you can also do `python app.py`, but it won't provide the automatic reload mechanism.
+### The `Interface` Class
+You'll notice that in order to make the demo, we created a `gradio.Interface`. This `Interface` class can wrap any Python function with a user interface. In the example above, we saw a simple text-based function, but the function could be anything from music generator to a tax calculator to the prediction function of a pretrained machine learning model.
+The core `Interface` class is initialized with three required parameters:
+- `fn`: the function to wrap a UI around
+- `inputs`: which component(s) to use for the input (e.g. `"text"`, `"image"` or `"audio"`)
+- `outputs`: which component(s) to use for the output (e.g. `"text"`, `"image"` or `"label"`)
+Let's take a closer look at these components used to provide input and output.
+### Components Attributes
+We saw some simple `Textbox` components in the previous examples, but what if you want to change how the UI components look or behave?
+Let's say you want to customize the input text field — for example, you wanted it to be larger and have a text placeholder. If we use the actual class for `Textbox` instead of using the string shortcut, you have access to much more customizability through component attributes.
+```python
+import gradio as gr
+def greet(name):
+    return "Hello " + name + "!"
+demo = gr.Interface(
+    fn=greet,
+    inputs=gr.Textbox(lines=2, placeholder="Name Here..."),
+    outputs="text",
+)
+demo.launch()
+```
+![`hello_world_2` demo](demo/hello_world_2/screenshot.gif)
+### Multiple Input and Output Components
+Suppose you had a more complex function, with multiple inputs and outputs. In the example below, we define a function that takes a string, boolean, and number, and returns a string and number. Take a look how you pass a list of input and output components.
+```python
+import gradio as gr
+def greet(name, is_morning, temperature):
+    salutation = "Good morning" if is_morning else "Good evening"
+    greeting = f"{salutation} {name}. It is {temperature} degrees today"
+    celsius = (temperature - 32) * 5 / 9
+    return greeting, round(celsius, 2)
+demo = gr.Interface(
+    fn=greet,
+    inputs=["text", "checkbox", gr.Slider(0, 100)],
+    outputs=["text", "number"],
+)
+demo.launch()
+```
+![`hello_world_3` demo](demo/hello_world_3/screenshot.gif)
+You simply wrap the components in a list. Each component in the `inputs` list corresponds to one of the parameters of the function, in order. Each component in the `outputs` list corresponds to one of the values returned by the function, again in order.
+### An Image Example
+Gradio supports many types of components, such as `Image`, `DataFrame`, `Video`, or `Label`. Let's try an image-to-image function to get a feel for these!
+```python
+import numpy as np
+import gradio as gr
+def sepia(input_img):
+    sepia_filter = np.array([
+        [0.393, 0.769, 0.189],
+        [0.349, 0.686, 0.168],
+        [0.272, 0.534, 0.131]
+    ])
+    sepia_img = input_img.dot(sepia_filter.T)
+    sepia_img /= sepia_img.max()
+    return sepia_img
+demo = gr.Interface(sepia, gr.Image(shape=(200, 200)), "image")
+demo.launch()
+```
+![`sepia_filter` demo](demo/sepia_filter/screenshot.gif)
+When using the `Image` component as input, your function will receive a NumPy array with the shape `(height, width, 3)`, where the last dimension represents the RGB values. We'll return an image as well in the form of a NumPy array.
+You can also set the datatype used by the component with the `type=` keyword argument. For example, if you wanted your function to take a file path to an image instead of a NumPy array, the input `Image` component could be written as:
+```python
+gr.Image(type="filepath", shape=...)
+```
+Also note that our input `Image` component comes with an edit button 🖉, which allows for cropping and zooming into images. Manipulating images in this way can help reveal biases or hidden flaws in a machine learning model!
+You can read more about the many components and how to use them in the [Gradio docs](https://gradio.app/docs).
+### Blocks: More Flexibility and Control
+Gradio offers two classes to build apps:
+1\. **Interface**, that provides a high-level abstraction for creating demos that we've been discussing so far.
+2\. **Blocks**, a low-level API for designing web apps with more flexible layouts and data flows. Blocks allows you to do things like feature multiple data flows and demos, control where components appear on the page, handle complex data flows (e.g. outputs can serve as inputs to other functions), and update properties/visibility of components based on user interaction — still all in Python. If this customizability is what you need, try `Blocks` instead!
+### Hello, Blocks
+Let's take a look at a simple example. Note how the API here differs from `Interface`.
+```python
+import gradio as gr
+def greet(name):
+    return "Hello " + name + "!"
+with gr.Blocks() as demo:
+    name = gr.Textbox(label="Name")
+    output = gr.Textbox(label="Output Box")
+    greet_btn = gr.Button("Greet")
+    greet_btn.click(fn=greet, inputs=name, outputs=output)
+demo.launch()
+```
+![`hello_blocks` demo](demo/hello_blocks/screenshot.gif)
+Things to note:
+- `Blocks` are made with a `with` clause, and any component created inside this clause is automatically added to the app.
+- Components appear vertically in the app in the order they are created. (Later we will cover customizing layouts!)
+- A `Button` was created, and then a `click` event-listener was added to this button. The API for this should look familiar! Like an `Interface`, the `click` method takes a Python function, input components, and output components.
+### More Complexity
+Here's an app to give you a taste of what's possible with `Blocks`:
+```python
+import numpy as np
+import gradio as gr
+def flip_text(x):
+    return x[::-1]
+def flip_image(x):
+    return np.fliplr(x)
+with gr.Blocks() as demo:
+    gr.Markdown("Flip text or image files using this demo.")
+    with gr.Tab("Flip Text"):
+        text_input = gr.Textbox()
+        text_output = gr.Textbox()
+        text_button = gr.Button("Flip")
+    with gr.Tab("Flip Image"):
+        with gr.Row():
+            image_input = gr.Image()
+            image_output = gr.Image()
+        image_button = gr.Button("Flip")
+    with gr.Accordion("Open for More!"):
+        gr.Markdown("Look at me...")
+    text_button.click(flip_text, inputs=text_input, outputs=text_output)
+    image_button.click(flip_image, inputs=image_input, outputs=image_output)
+demo.launch()
+```
+![`blocks_flipper` demo](demo/blocks_flipper/screenshot.gif)
+A lot more going on here! We'll cover how to create complex `Blocks` apps like this in the [building with blocks](https://gradio.app/building_with_blocks) section for you.
+Congrats, you're now familiar with the basics of Gradio! 🥳 Go to our [next guide](https://gradio.app/key_features) to learn more about the key features of Gradio.
+## Open Source Stack
+Gradio is built with many wonderful open-source libraries, please support them as well!
+[<img src="readme_files/huggingface_mini.svg" alt="huggingface" height=40>](https://huggingface.co)
+[<img src="readme_files/python.svg" alt="python" height=40>](https://www.python.org)
+[<img src="readme_files/fastapi.svg" alt="fastapi" height=40>](https://fastapi.tiangolo.com)
+[<img src="readme_files/encode.svg" alt="encode" height=40>](https://www.encode.io)
+[<img src="readme_files/svelte.svg" alt="svelte" height=40>](https://svelte.dev)
+[<img src="readme_files/vite.svg" alt="vite" height=40>](https://vitejs.dev)
+[<img src="readme_files/pnpm.svg" alt="pnpm" height=40>](https://pnpm.io)
+[<img src="readme_files/tailwind.svg" alt="tailwind" height=40>](https://tailwindcss.com)
+## License
+Gradio is licensed under the Apache License 2.0 found in the [LICENSE](LICENSE) file in the root directory of this repository.
+## Citation
+Also check out the paper *[Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild](https://arxiv.org/abs/1906.02569), ICML HILL 2019*, and please cite it if you use Gradio in your work.
+```
+@article{abid2019gradio,
+  title = {Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild},
+  author = {Abid, Abubakar and Abdalla, Ali and Abid, Ali and Khan, Dawood and Alfozan, Abdulrahman and Zou, James},
+  journal = {arXiv preprint arXiv:1906.02569},
+  year = {2019},
+}
+```

groundingLMM/gradio-dev/SECURITY.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Security Policy
+## Reporting a Vulnerability
+If you discover a security vulnerability, we would be very grateful if you could email us at [email protected]. This is the preferred approach instead of opening a public issue. We take all vulnerability reports seriously, and will work to patch the vulnerability immediately. Whenever possible, we will credit the person or people who report the security vulnerabilities after it has been patched.

groundingLMM/gradio-dev/app_box.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import gradio as gr
+def predict(inp):
+    image = inp['image']
+    boxes = inp['mask']
+    sub_images = []
+    for box in boxes:
+        sub_images.append(image.crop(box))
+    return sub_images
+demo = gr.Interface(fn=predict,
+                    inputs=gr.Image(tool="boxes", type="pil"),
+                    outputs=gr.Gallery())
+demo.launch()

groundingLMM/gradio-dev/globals.d.ts ADDED Viewed

	@@ -0,0 +1,31 @@

+declare global {
+	interface Window {
+		__gradio_mode__: "app" | "website";
+		launchGradio: Function;
+		launchGradioFromSpaces: Function;
+		gradio_config: Config;
+		scoped_css_attach: (link: HTMLLinkElement) => void;
+		__is_colab__: boolean;
+	}
+}
+export interface Config {
+	auth_required: boolean | undefined;
+	auth_message: string;
+	components: any[];
+	css: string | null;
+	dependencies: any[];
+	dev_mode: boolean;
+	enable_queue: boolean;
+	layout: any;
+	mode: "blocks" | "interface";
+	root: string;
+	theme: string;
+	title: string;
+	version: string;
+	is_space: boolean;
+	is_colab: boolean;
+	show_api: boolean;
+	stylesheets: string[];
+	path: string;
+}

groundingLMM/gradio-dev/package.json ADDED Viewed

	@@ -0,0 +1,85 @@

+{
+	"name": "gradio-ui",
+	"version": "0.0.1",
+	"description": "Gradio UI packages",
+	"scripts": {
+		"workbench": "pnpm --filter @gradio/workbench dev",
+		"dev": "pnpm css && pnpm --filter @gradio/client build && pnpm --filter @gradio/app dev",
+		"css": "pnpm --filter @gradio/theme generate",
+		"build": "pnpm css && pnpm --filter @gradio/client build && pnpm --filter @gradio/app build:local --emptyOutDir",
+		"build:cdn": "pnpm --filter @gradio/client build && pnpm --filter @gradio/app build:cdn --emptyOutDir",
+		"build:website": "pnpm --filter @gradio/app build:website --emptyOutDir",
+		"build:cdn-local": "TEST_CDN=TRUE pnpm build:cdn",
+		"preview:cdn-server": "sirv ./gradio/templates/cdn --single --port=4321 --cors",
+		"preview:cdn-app": "pnpm --filter @gradio/cdn-test dev",
+		"preview:cdn-local": "run-p preview:cdn-server preview:cdn-app",
+		"format:check": "prettier --ignore-path .config/.prettierignore --check --plugin-search-dir=. .",
+		"format:write": "prettier --ignore-path .config/.prettierignore --write --plugin-search-dir=. .",
+		"ts:check": "svelte-check --tsconfig tsconfig.json",
+		"test": "pnpm --filter @gradio/client build && vitest dev --config .config/vitest.config.ts",
+		"test:run": "pnpm --filter @gradio/client build && vitest run --config .config/vitest.config.ts",
+		"test:node": "TEST_MODE=node pnpm  vitest run --config .config/vitest.config.ts",
+		"test:browser": "pnpm --filter @gradio/app test:browser:full",
+		"test:browser:full": "run-s build test:browser",
+		"test:browser:debug": "pnpm --filter @gradio/app test:browser:debug",
+		"ci:publish": "pnpm publish --no-git-checks --access public -r",
+		"ci:version": "changeset version && pnpm i --lockfile-only"
+	},
+	"type": "module",
+	"author": "",
+	"license": "ISC",
+	"private": true,
+	"dependencies": {
+		"@changesets/changelog-github": "^0.4.8",
+		"@changesets/cli": "^2.26.1",
+		"@gradio/tootils": "workspace:^0.0.1",
+		"@playwright/test": "^1.27.1",
+		"@sveltejs/vite-plugin-svelte": "^1.0.0-next.44",
+		"@tailwindcss/forms": "^0.5.0",
+		"@testing-library/dom": "^8.11.3",
+		"@testing-library/jest-dom": "^5.16.5",
+		"@testing-library/svelte": "^3.1.0",
+		"@testing-library/user-event": "^13.5.0",
+		"autoprefixer": "^10.4.4",
+		"babylonjs": "^5.17.1",
+		"babylonjs-loaders": "^5.17.1",
+		"happy-dom": "^9.20.3",
+		"msw": "^1.0.0",
+		"node-html-parser": "^5.3.3",
+		"npm-run-all": "^4.1.5",
+		"playwright": "^1.27.1",
+		"plotly.js-dist-min": "^2.10.1",
+		"polka": "^1.0.0-next.22",
+		"pollen-css": "^4.6.1",
+		"postcss": "^8.4.6",
+		"postcss-custom-media": "8",
+		"postcss-nested": "^5.0.6",
+		"postcss-prefix-selector": "^1.16.0",
+		"prettier": "^2.6.2",
+		"prettier-plugin-css-order": "^1.3.0",
+		"prettier-plugin-svelte": "^2.10.0",
+		"sirv": "^2.0.2",
+		"sirv-cli": "^2.0.2",
+		"svelte": "^3.59.1",
+		"svelte-check": "^3.1.4",
+		"svelte-i18n": "^3.6.0",
+		"svelte-preprocess": "^5.0.3",
+		"tailwindcss": "^3.1.6",
+		"tinyspy": "^0.3.0",
+		"typescript": "^4.7.4",
+		"vite": "^4.2.1",
+		"vitest": "^0.29.8"
+	},
+	"devDependencies": {
+		"@types/three": "^0.138.0"
+	},
+	"prettier": {
+		"useTabs": true,
+		"singleQuote": false,
+		"trailingComma": "none",
+		"printWidth": 80,
+		"pluginSearchDirs": [
+			".."
+		]
+	}
+}

groundingLMM/gradio-dev/pnpm-lock.yaml ADDED Viewed

The diff for this file is too large to render. See raw diff

groundingLMM/gradio-dev/pnpm-workspace.yaml ADDED Viewed

	@@ -0,0 +1,3 @@

+packages:
+  - 'js/*'
+  - 'client/js'

groundingLMM/gradio-dev/pyproject.toml ADDED Viewed

	@@ -0,0 +1,113 @@

+[build-system]
+requires = ["hatchling", "hatch-requirements-txt", "hatch-fancy-pypi-readme>=22.5.0"]
+build-backend = "hatchling.build"
+[project]
+name = "gradio"
+dynamic = ["version", "dependencies", "readme"]
+description = "Python library for easily interacting with trained machine learning models"
+license = "Apache-2.0"
+requires-python = ">=3.8"
+authors = [
+  { name = "Abubakar Abid", email = "[email protected]" },
+  { name = "Ali Abid", email = "[email protected]" },
+  { name = "Ali Abdalla", email = "[email protected]" },
+  { name = "Dawood Khan", email = "[email protected]" },
+  { name = "Ahsen Khaliq", email = "[email protected]" },
+  { name = "Pete Allen", email = "[email protected]" },
+  { name = "Ömer Faruk Özdemir", email = "[email protected]" },
+]
+keywords = ["machine learning", "reproducibility", "visualization"]
+classifiers = [
+  'Development Status :: 5 - Production/Stable',
+  'License :: OSI Approved :: Apache Software License',
+  'Operating System :: OS Independent',
+  'Programming Language :: Python :: 3',
+  'Programming Language :: Python :: 3 :: Only',
+  'Programming Language :: Python :: 3.8',
+  'Programming Language :: Python :: 3.9',
+  'Programming Language :: Python :: 3.10',
+  'Programming Language :: Python :: 3.11',
+  'Topic :: Scientific/Engineering',
+  'Topic :: Scientific/Engineering :: Artificial Intelligence',
+  'Topic :: Scientific/Engineering :: Visualization',
+]
+[project.scripts]
+gradio = "gradio.cli:cli"
+upload_theme = "gradio.themes.upload_theme:main"
+[project.urls]
+Homepage = "https://github.com/gradio-app/gradio"
+[tool.hatch.version]
+path = "gradio/version.txt"
+pattern = "(?P<version>.+)"
+[tool.hatch.metadata.hooks.requirements_txt]
+filename = "requirements.txt"
+[tool.hatch.metadata.hooks.fancy-pypi-readme]
+content-type = "text/markdown"
+fragments = [
+  { path = "README.md" },
+]
+[[tool.hatch.metadata.hooks.fancy-pypi-readme.substitutions]]
+pattern = "(website/homepage|readme_files)/"
+replacement = 'https://raw.githubusercontent.com/gradio-app/gradio/main/\g<1>/'
+[[tool.hatch.metadata.hooks.fancy-pypi-readme.substitutions]]
+pattern = 'demo/([\S]*.gif)'
+replacement = 'https://raw.githubusercontent.com/gradio-app/gradio/main/demo/\g<1>'
+[tool.hatch.build]
+artifacts = [
+  "/gradio/templates",
+]
+[tool.hatch.build.targets.sdist]
+include = [
+  "/gradio",
+  "/test",
+  "/README.md",
+  "/requirements.txt",
+]
+[tool.ruff]
+target-version = "py37"
+extend-select = [
+  "B",
+  "C",
+  "I",
+  "N",
+  "SIM",
+  "UP",
+]
+ignore = [
+  "C901", # function is too complex (TODO: un-ignore this)
+  "B023", # function definition in loop (TODO: un-ignore this)
+  "B008", # function call in argument defaults
+  "B017", # pytest.raises considered evil
+  "B028", # explicit stacklevel for warnings
+  "E501", # from scripts/lint_backend.sh
+  "SIM105", # contextlib.suppress (has a performance cost)
+  "SIM117", # multiple nested with blocks (doesn't look good with gr.Row etc)
+  "UP007", # use X | Y for type annotations (TODO: can be enabled once Pydantic plays nice with them)
+]
+[tool.ruff.per-file-ignores]
+"demo/*" = [
+  "E402", # Demos may have imports not at the top
+  "E741", # Demos may have ambiguous variable names
+  "F405", # Demos may use star imports
+  "I", # Don't care about import order
+]
+"gradio/__init__.py" = [
+  "F401", # "Imported but unused" (TODO: it would be better to be explicit and use __all__)
+]
+"gradio/routes.py" = [
+  "UP006", # Pydantic on Python 3.7 requires old-style type annotations (TODO: drop when Python 3.7 is dropped)
+]

groundingLMM/gradio-dev/readme_template.md ADDED Viewed

	@@ -0,0 +1,68 @@

+<div align="center">
+  [<img src="readme_files/gradio.svg" alt="gradio" width=300>](https://gradio.app)<br>
+  <em>Build & share delightful machine learning apps easily</em>
+  [![gradio-backend](https://github.com/gradio-app/gradio/actions/workflows/backend.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/backend.yml)
+  [![gradio-ui](https://github.com/gradio-app/gradio/actions/workflows/ui.yml/badge.svg)](https://github.com/gradio-app/gradio/actions/workflows/ui.yml)
+  [![PyPI](https://img.shields.io/pypi/v/gradio)](https://pypi.org/project/gradio/)
+  [![PyPI downloads](https://img.shields.io/pypi/dm/gradio)](https://pypi.org/project/gradio/)
+  ![Python version](https://img.shields.io/badge/python-3.8+-important)
+  [![Twitter follow](https://img.shields.io/twitter/follow/gradio?style=social&label=follow)](https://twitter.com/gradio)
+  [Website](https://gradio.app)
+  | [Documentation](https://gradio.app/docs/)
+  | [Guides](https://gradio.app/guides/)
+  | [Getting Started](https://gradio.app/getting_started/)
+  | [Examples](demo/)
+  | [中文](readme_files/zh-cn#readme)
+</div>
+# Gradio: Build Machine Learning Web Apps — in Python
+Gradio is an open-source Python library that is used to build machine learning and data science demos and web applications.
+With Gradio, you can quickly create a beautiful user interface around your machine learning models or data science workflow and let people "try it out" by dragging-and-dropping in their own images,
+pasting text, recording their own voice, and interacting with your demo, all through the browser.
+![Interface montage](readme_files/header-image.jpg)
+Gradio is useful for:
+- **Demoing** your machine learning models for clients/collaborators/users/students.
+- **Deploying** your models quickly with automatic shareable links and getting feedback on model performance.
+- **Debugging** your model interactively during development using built-in manipulation and interpretation tools.
+$getting_started
+## Open Source Stack
+Gradio is built with many wonderful open-source libraries, please support them as well!
+[<img src="readme_files/huggingface_mini.svg" alt="huggingface" height=40>](https://huggingface.co)
+[<img src="readme_files/python.svg" alt="python" height=40>](https://www.python.org)
+[<img src="readme_files/fastapi.svg" alt="fastapi" height=40>](https://fastapi.tiangolo.com)
+[<img src="readme_files/encode.svg" alt="encode" height=40>](https://www.encode.io)
+[<img src="readme_files/svelte.svg" alt="svelte" height=40>](https://svelte.dev)
+[<img src="readme_files/vite.svg" alt="vite" height=40>](https://vitejs.dev)
+[<img src="readme_files/pnpm.svg" alt="pnpm" height=40>](https://pnpm.io)
+[<img src="readme_files/tailwind.svg" alt="tailwind" height=40>](https://tailwindcss.com)
+## License
+Gradio is licensed under the Apache License 2.0 found in the [LICENSE](LICENSE) file in the root directory of this repository.
+## Citation
+Also check out the paper *[Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild](https://arxiv.org/abs/1906.02569), ICML HILL 2019*, and please cite it if you use Gradio in your work.
+```
+@article{abid2019gradio,
+  title = {Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild},
+  author = {Abid, Abubakar and Abdalla, Ali and Abid, Ali and Khan, Dawood and Alfozan, Abdulrahman and Zou, James},
+  journal = {arXiv preprint arXiv:1906.02569},
+  year = {2019},
+}
+```

groundingLMM/gradio-dev/render_readme.py ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/usr/bin/env python
+import re
+from pathlib import Path
+README_TEMPLATE_FILEPATH = "readme_template.md"
+GETTING_STARTED_TEMPLATE_FILEPATH = "guides/01_getting-started/01_quickstart.md"
+readme_template = Path(README_TEMPLATE_FILEPATH).read_text()
+getting_started_template = Path(GETTING_STARTED_TEMPLATE_FILEPATH).read_text()
+# Extract all the code and demo tags from the getting started template
+code_tags = re.findall(r"\$code_([^\s]+)", getting_started_template)
+demo_tags = re.findall(r"\$demo_([^\s]+)", getting_started_template)
+codes = {}
+demos = {}
+for src in code_tags:
+    context = Path(f"demo/{src}/run.py").read_text()
+    # Replace the condition to run the demo directly with actual launch code
+    context = re.sub(r"if __name__(.*[\n$]*)*", "demo.launch()", context)
+    codes[src] = f"```python\n{context}\n```\n"  # Convert to Markdown code block
+for src in demo_tags:
+    demos[src] = f"![`{src}` demo](demo/{src}/screenshot.gif)"
+# Replace the headers in the getting started template with a smaller header (e.g. H3 instead of H2) to
+# make the README more readable and less cluttered.
+getting_started_template = re.sub(r"^(#+)", r"#\1", getting_started_template, flags=re.MULTILINE)
+readme_template = readme_template.replace("$getting_started", getting_started_template)
+# Now put the codes and the screenshots in the README template
+readme_template = re.sub(r"\$code_([^\s]+)", lambda x: codes[x.group(1)], readme_template)
+readme_template = re.sub(r"\$demo_([^\s]+)", lambda x: demos[x.group(1)], readme_template)
+# Save the README template to the actual README.md file (with a note about the editing)
+EDITING_NOTE = ("<!-- DO NOT EDIT THIS FILE DIRECTLY. INSTEAD EDIT THE `readme_template.md` OR "
+                "`guides/1)getting_started/1)quickstart.md` TEMPLATES AND THEN RUN `render_readme.py` SCRIPT. -->")
+Path("README.md").write_text(f"{EDITING_NOTE}\n\n{readme_template}")

groundingLMM/gradio-dev/requirements.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+aiofiles
+aiohttp
+altair>=4.2.0
+fastapi
+ffmpy
+gradio_client>=0.2.7
+httpx
+huggingface_hub>=0.14.0
+Jinja2
+markdown-it-py[linkify]>=2.0.0
+pygments>=2.12.0
+mdit-py-plugins<=0.3.3
+markupsafe
+matplotlib
+numpy
+orjson
+pandas
+pillow
+pydantic
+python-multipart
+pydub
+pyyaml
+requests
+semantic_version
+uvicorn>=0.14.0
+websockets>=10.0

groundingLMM/gradio-dev/style.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# component-styles
+## Textbox
+| name        | type                                 | description                    |
+| ----------- | ------------------------------------ | ------------------------------ |
+| `rounded`   | `bool` or `(bool, bool, bool, bool)` | corners of text input          |
+| `border`    | `bool` or `(bool, bool, bool, bool)` | borders of text input          |
+| `container` | `bool`                               | show or hide the container box |
+## Number
+| name        | type                                 | description                    |
+| ----------- | ------------------------------------ | ------------------------------ |
+| `rounded`   | `bool` or `(bool, bool, bool, bool)` | corners of text input          |
+| `border`    | `bool` or `(bool, bool, bool, bool)` | borders of text input          |
+| `container` | `bool`                               | show or hide the container box |
+## Slider
+| name        | type   | description                    |
+| ----------- | ------ | ------------------------------ |
+| `container` | `bool` | show or hide the container box |
+## Checkbox
+| name        | type                                 | description                    |
+| ----------- | ------------------------------------ | ------------------------------ |
+| `rounded`   | `bool` or `(bool, bool, bool, bool)` | corners of checkbox            |
+| `border`    | `bool` or `(bool, bool, bool, bool)` | borders of checkbox            |
+| `container` | `bool`                               | show or hide the container box |
+## Checkbox Group
+| name             | type                                 | description                               |
+| ---------------- | ------------------------------------ | ----------------------------------------- |
+| `rounded`        | `bool` or `(bool, bool, bool, bool)` | corners of checkboxes                     |
+| `container`      | `bool`                               | show or hide the container box            |
+| `item_container` | `bool`                               | show or hide the checkbox container boxes |
+## Radio
+| name             | type   | description                            |
+| ---------------- | ------ | -------------------------------------- |
+| `container`      | `bool` | show or hide the container box         |
+| `item_container` | `bool` | show or hide the radio container boxes |
+## Dropdown
+| name        | type                                 | description                    |
+| ----------- | ------------------------------------ | ------------------------------ |
+| `rounded`   | `bool` or `(bool, bool, bool, bool)` | corners of input               |
+| `border`    | `bool` or `(bool, bool, bool, bool)` | borders of input               |
+| `container` | `bool`                               | show or hide the container box |
+## Image
+| name      | type                                 | description         |
+| --------- | ------------------------------------ | ------------------- |
+| `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
+## Video
+| name      | type                                 | description         |
+| --------- | ------------------------------------ | ------------------- |
+| `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
+## Audio
+| name      | type                                 | description         |
+| --------- | ------------------------------------ | ------------------- |
+| `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
+## File
+| name      | type                                 | description         |
+| --------- | ------------------------------------ | ------------------- |
+| `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
+## Dataframe
+| name      | type                                 | description         |
+| --------- | ------------------------------------ | ------------------- |
+| `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
+## Timeseries
+| name      | type                                 | description         |
+| --------- | ------------------------------------ | ------------------- |
+| `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
+## Label
+| name        | type   | description                    |
+| ----------- | ------ | ------------------------------ |
+| `container` | `bool` | show or hide the container box |
+## HighlightedText
+| name        | type                                 | description                    |
+| ----------- | ------------------------------------ | ------------------------------ |
+| `rounded`   | `bool` or `(bool, bool, bool, bool)` | corners of labels              |
+| `color_map` | `Dict[str, str]`                     | color map of labels and colors |
+| `container` | `bool`                               | show or hide the container box |
+## JSON
+| name        | type   | description                    |
+| ----------- | ------ | ------------------------------ |
+| `container` | `bool` | show or hide the container box |
+## HTML
+Nothing
+## Gallery
+| name        | type                                      | description                         |
+| ----------- | ----------------------------------------- | ----------------------------------- |
+| `rounded`   | `bool` or `(bool, bool, bool, bool)`      | corners of images                   |
+| `grid`      | `int` or `(int, int, int, int, int, int)` | grid for images                     |
+| `height`    | `"auto"`                                  | height of gallery (auto or default) |
+| `container` | `bool`                                    | show or hide the container box      |
+## Chatbot
+| name        | type                                 | description                                      |
+| ----------- | ------------------------------------ | ------------------------------------------------ |
+| `rounded`   | `bool` or `(bool, bool, bool, bool)` | corners of chat bubbles                          |
+| `color_map` | `Dict[str, str]`                     | color map of user and bot color for chat bubbles |
+## Model3D
+| name      | type                                 | description         |
+| --------- | ------------------------------------ | ------------------- |
+| `rounded` | `bool` or `(bool, bool, bool, bool)` | corners of main box |
+## Plot
+Nothing (yet)
+## Markdown
+Nothing
+## Button
+| name         | type                                 | description                             |
+| ------------ | ------------------------------------ | --------------------------------------- |
+| `rounded`    | `bool` or `(bool, bool, bool, bool)` | corners of button                       |
+| `border`     | `bool` or `(bool, bool, bool, bool)` | borders of button                       |
+| `full_width` | `bool`                               | whether button expand to fill container |
+## Dataset
+Nothing
+## Variable
+Nothing