tuandunghcmut
/

vlm_clone_2

Model card Files Files and versions Community

tuandunghcmut commited on Apr 11

Commit

fdde15c

verified ·

1 Parent(s): db08794

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

clone-IDEA-Research/Grounded-SAM-2/.clang-format +85 -0
clone-IDEA-Research/Grounded-SAM-2/.gitignore +147 -0
clone-IDEA-Research/Grounded-SAM-2/.watchmanconfig +1 -0
clone-IDEA-Research/Grounded-SAM-2/CODE_OF_CONDUCT.md +80 -0
clone-IDEA-Research/Grounded-SAM-2/CONTRIBUTING.md +31 -0
clone-IDEA-Research/Grounded-SAM-2/Dockerfile +37 -0
clone-IDEA-Research/Grounded-SAM-2/INSTALL.md +189 -0
clone-IDEA-Research/Grounded-SAM-2/LICENSE +201 -0
clone-IDEA-Research/Grounded-SAM-2/LICENSE_cctorch +29 -0
clone-IDEA-Research/Grounded-SAM-2/LICENSE_groundingdino +201 -0
clone-IDEA-Research/Grounded-SAM-2/LICENSE_sam2 +201 -0
clone-IDEA-Research/Grounded-SAM-2/MANIFEST.in +7 -0
clone-IDEA-Research/Grounded-SAM-2/Makefile +37 -0
clone-IDEA-Research/Grounded-SAM-2/README.md +484 -0
clone-IDEA-Research/Grounded-SAM-2/SAM2_README.md +140 -0
clone-IDEA-Research/Grounded-SAM-2/backend.Dockerfile +64 -0
clone-IDEA-Research/Grounded-SAM-2/docker-compose.yaml +42 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_dinox_demo.py +245 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_florence2_autolabel_pipeline.py +198 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_florence2_image_demo.py +657 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_gd1.5_demo.py +249 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_hf_model_demo.py +187 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_local_demo.py +160 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo.py +198 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_dinox.py +237 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_gd1.0_hf_model.py +214 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_gd1.0_local_model.py +220 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_gd1.5.py +239 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_continuous_id.py +203 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py +224 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_continuous_id_plus.py +247 -0
clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_gd1.5.py +221 -0
clone-IDEA-Research/Grounded-SAM-2/pyproject.toml +6 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/__init__.py +11 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/automatic_mask_generator.py +454 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/build_sam.py +167 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_b+.yaml +113 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_l.yaml +117 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_s.yaml +116 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_t.yaml +118 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_image_predictor.py +466 -0
clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_video_predictor.py +1172 -0
clone-IDEA-Research/Grounded-SAM-2/setup.py +174 -0
clone-IDEA-Research/Grounded-Segment-Anything/.gitignore +135 -0
clone-IDEA-Research/Grounded-Segment-Anything/.gitmodules +7 -0
clone-IDEA-Research/Grounded-Segment-Anything/CITATION.cff +8 -0
clone-IDEA-Research/Grounded-Segment-Anything/Dockerfile +30 -0
clone-IDEA-Research/Grounded-Segment-Anything/LICENSE +201 -0
clone-IDEA-Research/Grounded-Segment-Anything/Makefile +43 -0
clone-IDEA-Research/Grounded-Segment-Anything/README.md +808 -0

clone-IDEA-Research/Grounded-SAM-2/.clang-format ADDED Viewed

	@@ -0,0 +1,85 @@

+AccessModifierOffset: -1
+AlignAfterOpenBracket: AlwaysBreak
+AlignConsecutiveAssignments: false
+AlignConsecutiveDeclarations: false
+AlignEscapedNewlinesLeft: true
+AlignOperands:   false
+AlignTrailingComments: false
+AllowAllParametersOfDeclarationOnNextLine: false
+AllowShortBlocksOnASingleLine: false
+AllowShortCaseLabelsOnASingleLine: false
+AllowShortFunctionsOnASingleLine: Empty
+AllowShortIfStatementsOnASingleLine: false
+AllowShortLoopsOnASingleLine: false
+AlwaysBreakAfterReturnType: None
+AlwaysBreakBeforeMultilineStrings: true
+AlwaysBreakTemplateDeclarations: true
+BinPackArguments: false
+BinPackParameters: false
+BraceWrapping:
+  AfterClass:      false
+  AfterControlStatement: false
+  AfterEnum:       false
+  AfterFunction:   false
+  AfterNamespace:  false
+  AfterObjCDeclaration: false
+  AfterStruct:     false
+  AfterUnion:      false
+  BeforeCatch:     false
+  BeforeElse:      false
+  IndentBraces:    false
+BreakBeforeBinaryOperators: None
+BreakBeforeBraces: Attach
+BreakBeforeTernaryOperators: true
+BreakConstructorInitializersBeforeComma: false
+BreakAfterJavaFieldAnnotations: false
+BreakStringLiterals: false
+ColumnLimit:     80
+CommentPragmas:  '^ IWYU pragma:'
+ConstructorInitializerAllOnOneLineOrOnePerLine: true
+ConstructorInitializerIndentWidth: 4
+ContinuationIndentWidth: 4
+Cpp11BracedListStyle: true
+DerivePointerAlignment: false
+DisableFormat:   false
+ForEachMacros:   [ FOR_EACH, FOR_EACH_R, FOR_EACH_RANGE, ]
+IncludeCategories:
+  - Regex:           '^<.*\.h(pp)?>'
+    Priority:        1
+  - Regex:           '^<.*'
+    Priority:        2
+  - Regex:           '.*'
+    Priority:        3
+IndentCaseLabels: true
+IndentWidth:     2
+IndentWrappedFunctionNames: false
+KeepEmptyLinesAtTheStartOfBlocks: false
+MacroBlockBegin: ''
+MacroBlockEnd:   ''
+MaxEmptyLinesToKeep: 1
+NamespaceIndentation: None
+ObjCBlockIndentWidth: 2
+ObjCSpaceAfterProperty: false
+ObjCSpaceBeforeProtocolList: false
+PenaltyBreakBeforeFirstCallParameter: 1
+PenaltyBreakComment: 300
+PenaltyBreakFirstLessLess: 120
+PenaltyBreakString: 1000
+PenaltyExcessCharacter: 1000000
+PenaltyReturnTypeOnItsOwnLine: 200
+PointerAlignment: Left
+ReflowComments:  true
+SortIncludes:    true
+SpaceAfterCStyleCast: false
+SpaceBeforeAssignmentOperators: true
+SpaceBeforeParens: ControlStatements
+SpaceInEmptyParentheses: false
+SpacesBeforeTrailingComments: 1
+SpacesInAngles:  false
+SpacesInContainerLiterals: true
+SpacesInCStyleCastParentheses: false
+SpacesInParentheses: false
+SpacesInSquareBrackets: false
+Standard:        Cpp11
+TabWidth:        8
+UseTab:          Never

clone-IDEA-Research/Grounded-SAM-2/.gitignore ADDED Viewed

	@@ -0,0 +1,147 @@

+# SAM 2
+.vscode/
+.DS_Store
+__pycache__/
+*-checkpoint.ipynb
+.venv
+*.egg*
+build/*
+_C.*
+outputs/*
+checkpoints/*.pt
+*test*
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# checkpoint
+*.pth
+outputs/
+.idea/

clone-IDEA-Research/Grounded-SAM-2/.watchmanconfig ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

clone-IDEA-Research/Grounded-SAM-2/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+  advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+This Code of Conduct also applies outside the project spaces when there is a
+reasonable belief that an individual's behavior may have a negative impact on
+the project or its community.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <[email protected]>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq

clone-IDEA-Research/Grounded-SAM-2/CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Contributing to segment-anything
+We want to make contributing to this project as easy and transparent as
+possible.
+## Pull Requests
+We actively welcome your pull requests.
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints, using the `ufmt format` command. Linting requires `black==24.2.0`, `usort==1.0.2`, and `ufmt==2.0.0b2`, which can be installed via `pip install -e ".[dev]"`.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+## License
+By contributing to segment-anything, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.

clone-IDEA-Research/Grounded-SAM-2/Dockerfile ADDED Viewed

	@@ -0,0 +1,37 @@

+FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel
+# Arguments to build Docker Image using CUDA
+ARG USE_CUDA=0
+ARG TORCH_ARCH="7.0;7.5;8.0;8.6"
+ENV AM_I_DOCKER=True
+ENV BUILD_WITH_CUDA="${USE_CUDA}"
+ENV TORCH_CUDA_ARCH_LIST="${TORCH_ARCH}"
+ENV CUDA_HOME=/usr/local/cuda-12.1/
+# Ensure CUDA is correctly set up
+ENV PATH=/usr/local/cuda-12.1/bin:${PATH}
+ENV LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH}
+# Install required packages and specific gcc/g++
+RUN apt-get update && apt-get install --no-install-recommends wget ffmpeg=7:* \
+    libsm6=2:* libxext6=2:* git=1:* nano vim=2:* ninja-build gcc-10 g++-10 -y \
+    && apt-get clean && apt-get autoremove && rm -rf /var/lib/apt/lists/*
+ENV CC=gcc-10
+ENV CXX=g++-10
+RUN mkdir -p /home/appuser/Grounded-SAM-2
+COPY . /home/appuser/Grounded-SAM-2/
+WORKDIR /home/appuser/Grounded-SAM-2
+# Install essential Python packages
+RUN python -m pip install --upgrade pip setuptools wheel numpy \
+    opencv-python transformers supervision pycocotools addict yapf timm
+# Install segment_anything package in editable mode
+RUN python -m pip install -e .
+# Install grounding dino
+RUN python -m pip install --no-build-isolation -e grounding_dino

clone-IDEA-Research/Grounded-SAM-2/INSTALL.md ADDED Viewed

	@@ -0,0 +1,189 @@

+## Installation
+### Requirements
+- Linux with Python ≥ 3.10, PyTorch ≥ 2.3.1 and [torchvision](https://github.com/pytorch/vision/) that matches the PyTorch installation. Install them together at https://pytorch.org to ensure this.
+  * Note older versions of Python or PyTorch may also work. However, the versions above are strongly recommended to provide all features such as `torch.compile`.
+- [CUDA toolkits](https://developer.nvidia.com/cuda-toolkit-archive) that match the CUDA version for your PyTorch installation. This should typically be CUDA 12.1 if you follow the default installation command.
+- If you are installing on Windows, it's strongly recommended to use [Windows Subsystem for Linux (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install) with Ubuntu.
+Then, install SAM 2 from the root of this repository via
+```bash
+pip install -e ".[notebooks]"
+```
+Note that you may skip building the SAM 2 CUDA extension during installation via environment variable `SAM2_BUILD_CUDA=0`, as follows:
+```bash
+# skip the SAM 2 CUDA extension
+SAM2_BUILD_CUDA=0 pip install -e ".[notebooks]"
+```
+This would also skip the post-processing step at runtime (removing small holes and sprinkles in the output masks, which requires the CUDA extension), but shouldn't affect the results in most cases.
+### Building the SAM 2 CUDA extension
+By default, we allow the installation to proceed even if the SAM 2 CUDA extension fails to build. (In this case, the build errors are hidden unless using `-v` for verbose output in `pip install`.)
+If you see a message like `Skipping the post-processing step due to the error above` at runtime or `Failed to build the SAM 2 CUDA extension due to the error above` during installation, it indicates that the SAM 2 CUDA extension failed to build in your environment. In this case, **you can still use SAM 2 for both image and video applications**. The post-processing step (removing small holes and sprinkles in the output masks) will be skipped, but this shouldn't affect the results in most cases.
+If you would like to enable this post-processing step, you can reinstall SAM 2 on a GPU machine with environment variable `SAM2_BUILD_ALLOW_ERRORS=0` to force building the CUDA extension (and raise errors if it fails to build), as follows
+```bash
+pip uninstall -y SAM-2 && \
+rm -f ./sam2/*.so && \
+SAM2_BUILD_ALLOW_ERRORS=0 pip install -v -e ".[notebooks]"
+```
+Note that PyTorch needs to be installed first before building the SAM 2 CUDA extension. It's also necessary to install [CUDA toolkits](https://developer.nvidia.com/cuda-toolkit-archive) that match the CUDA version for your PyTorch installation. (This should typically be CUDA 12.1 if you follow the default installation command.) After installing the CUDA toolkits, you can check its version via `nvcc --version`.
+Please check the section below on common installation issues if the CUDA extension fails to build during installation or load at runtime.
+### Common Installation Issues
+Click each issue for its solutions:
+<details>
+<summary>
+I got `ImportError: cannot import name '_C' from 'sam2'`
+</summary>
+<br/>
+This is usually because you haven't run the `pip install -e ".[notebooks]"` step above or the installation failed. Please install SAM 2 first, and see the other issues if your installation fails.
+In some systems, you may need to run `python setup.py build_ext --inplace` in the SAM 2 repo root as suggested in https://github.com/facebookresearch/sam2/issues/77.
+</details>
+<details>
+<summary>
+I got `MissingConfigException: Cannot find primary config 'configs/sam2.1/sam2.1_hiera_l.yaml'`
+</summary>
+<br/>
+This is usually because you haven't run the `pip install -e .` step above, so `sam2` isn't in your Python's `sys.path`. Please run this installation step. In case it still fails after the installation step, you may try manually adding the root of this repo to `PYTHONPATH` via
+```bash
+export SAM2_REPO_ROOT=/path/to/sam2  # path to this repo
+export PYTHONPATH="${SAM2_REPO_ROOT}:${PYTHONPATH}"
+```
+to manually add `sam2_configs` into your Python's `sys.path`.
+</details>
+<details>
+<summary>
+I got `RuntimeError: Error(s) in loading state_dict for SAM2Base` when loading the new SAM 2.1 checkpoints
+</summary>
+<br/>
+This is likely because you have installed a previous version of this repo, which doesn't have the new modules to support the SAM 2.1 checkpoints yet. Please try the following steps:
+1. pull the latest code from the `main` branch of this repo
+2. run `pip uninstall -y SAM-2` to uninstall any previous installations
+3. then install the latest repo again using `pip install -e ".[notebooks]"`
+In case the steps above still don't resolve the error, please try running in your Python environment the following
+```python
+from sam2.modeling import sam2_base
+print(sam2_base.__file__)
+```
+and check whether the content in the printed local path of `sam2/modeling/sam2_base.py` matches the latest one in https://github.com/facebookresearch/sam2/blob/main/sam2/modeling/sam2_base.py (e.g. whether your local file has `no_obj_embed_spatial`) to indentify if you're still using a previous installation.
+</details>
+<details>
+<summary>
+My installation failed with `CUDA_HOME environment variable is not set`
+</summary>
+<br/>
+This usually happens because the installation step cannot find the CUDA toolkits (that contain the NVCC compiler) to build a custom CUDA kernel in SAM 2. Please install [CUDA toolkits](https://developer.nvidia.com/cuda-toolkit-archive) or the version that matches the CUDA version for your PyTorch installation. If the error persists after installing CUDA toolkits, you may explicitly specify `CUDA_HOME` via
+```
+export CUDA_HOME=/usr/local/cuda  # change to your CUDA toolkit path
+```
+and rerun the installation.
+Also, you should make sure
+```
+python -c 'import torch; from torch.utils.cpp_extension import CUDA_HOME; print(torch.cuda.is_available(), CUDA_HOME)'
+```
+print `(True, a directory with cuda)` to verify that the CUDA toolkits are correctly set up.
+If you are still having problems after verifying that the CUDA toolkit is installed and the `CUDA_HOME` environment variable is set properly, you may have to add the `--no-build-isolation` flag to the pip command:
+```
+pip install --no-build-isolation -e .
+```
+</details>
+<details>
+<summary>
+I got `undefined symbol: _ZN3c1015SmallVectorBaseIjE8grow_podEPKvmm` (or similar errors)
+</summary>
+<br/>
+This usually happens because you have multiple versions of dependencies (PyTorch or CUDA) in your environment. During installation, the SAM 2 library is compiled against one version library while at run time it links against another version. This might be due to that you have different versions of PyTorch or CUDA installed separately via `pip` or `conda`. You may delete one of the duplicates to only keep a single PyTorch and CUDA version.
+In particular, if you have a lower PyTorch version than 2.3.1, it's recommended to upgrade to PyTorch 2.3.1 or higher first. Otherwise, the installation script will try to upgrade to the latest PyTorch using `pip`, which could sometimes lead to duplicated PyTorch installation if you have previously installed another PyTorch version using `conda`.
+We have been building SAM 2 against PyTorch 2.3.1 internally. However, a few user comments (e.g. https://github.com/facebookresearch/sam2/issues/22, https://github.com/facebookresearch/sam2/issues/14) suggested that downgrading to PyTorch 2.1.0 might resolve this problem. In case the error persists, you may try changing the restriction from `torch>=2.3.1` to `torch>=2.1.0` in both [`pyproject.toml`](pyproject.toml) and [`setup.py`](setup.py) to allow PyTorch 2.1.0.
+</details>
+<details>
+<summary>
+I got `CUDA error: no kernel image is available for execution on the device`
+</summary>
+<br/>
+A possible cause could be that the CUDA kernel is somehow not compiled towards your GPU's CUDA [capability](https://developer.nvidia.com/cuda-gpus). This could happen if the installation is done in an environment different from the runtime (e.g. in a slurm system).
+You can try pulling the latest code from the SAM 2 repo and running the following
+```
+export TORCH_CUDA_ARCH_LIST=9.0 8.0 8.6 8.9 7.0 7.2 7.5 6.0`
+```
+to manually specify the CUDA capability in the compilation target that matches your GPU.
+</details>
+<details>
+<summary>
+I got `RuntimeError: No available kernel. Aborting execution.` (or similar errors)
+</summary>
+<br/>
+This is probably because your machine doesn't have a GPU or a compatible PyTorch version for Flash Attention (see also https://discuss.pytorch.org/t/using-f-scaled-dot-product-attention-gives-the-error-runtimeerror-no-available-kernel-aborting-execution/180900 for a discussion in PyTorch forum). You may be able to resolve this error by replacing the line
+```python
+OLD_GPU, USE_FLASH_ATTN, MATH_KERNEL_ON = get_sdpa_settings()
+```
+in [`sam2/modeling/sam/transformer.py`](sam2/modeling/sam/transformer.py) with
+```python
+OLD_GPU, USE_FLASH_ATTN, MATH_KERNEL_ON = True, True, True
+```
+to relax the attention kernel setting and use other kernels than Flash Attention.
+</details>
+<details>
+<summary>
+I got `Error compiling objects for extension`
+</summary>
+<br/>
+You may see error log of:
+> unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
+This is probably because your versions of CUDA and Visual Studio are incompatible. (see also https://stackoverflow.com/questions/78515942/cuda-compatibility-with-visual-studio-2022-version-17-10 for a discussion in stackoverflow).<br>
+You may be able to fix this by adding the `-allow-unsupported-compiler` argument to `nvcc` after L48 in the [setup.py](https://github.com/facebookresearch/sam2/blob/main/setup.py). <br>
+After adding the argument, `get_extension()` will look like this:
+```python
+def get_extensions():
+    srcs = ["sam2/csrc/connected_components.cu"]
+    compile_args = {
+        "cxx": [],
+        "nvcc": [
+            "-DCUDA_HAS_FP16=1",
+            "-D__CUDA_NO_HALF_OPERATORS__",
+            "-D__CUDA_NO_HALF_CONVERSIONS__",
+            "-D__CUDA_NO_HALF2_OPERATORS__",
+            "-allow-unsupported-compiler"  # Add this argument
+        ],
+    }
+    ext_modules = [CUDAExtension("sam2._C", srcs, extra_compile_args=compile_args)]
+    return ext_modules
+```
+</details>

clone-IDEA-Research/Grounded-SAM-2/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2023 - present, IDEA Research.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

clone-IDEA-Research/Grounded-SAM-2/LICENSE_cctorch ADDED Viewed

	@@ -0,0 +1,29 @@

+BSD 3-Clause License
+Copyright (c) 2020, the respective contributors, as shown by the AUTHORS file.
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its
+   contributors may be used to endorse or promote products derived from
+   this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

clone-IDEA-Research/Grounded-SAM-2/LICENSE_groundingdino ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2023 - present, IDEA Research.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

clone-IDEA-Research/Grounded-SAM-2/LICENSE_sam2 ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

clone-IDEA-Research/Grounded-SAM-2/MANIFEST.in ADDED Viewed

	@@ -0,0 +1,7 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+recursive-include sam2 *.yaml #include all config files

clone-IDEA-Research/Grounded-SAM-2/Makefile ADDED Viewed

	@@ -0,0 +1,37 @@

+# Get version of CUDA and enable it for compilation if CUDA > 11.0
+# This solves https://github.com/IDEA-Research/Grounded-Segment-Anything/issues/53
+# and https://github.com/IDEA-Research/Grounded-Segment-Anything/issues/84
+# when running in Docker
+# Check if nvcc is installed
+NVCC := $(shell which nvcc)
+ifeq ($(NVCC),)
+	# NVCC not found
+	USE_CUDA := 0
+	NVCC_VERSION := "not installed"
+else
+	NVCC_VERSION := $(shell nvcc --version | grep -oP 'release \K[0-9.]+')
+	USE_CUDA := $(shell echo "$(NVCC_VERSION) > 11" | bc -l)
+endif
+# Add the list of supported ARCHs
+ifeq ($(USE_CUDA), 1)
+	TORCH_CUDA_ARCH_LIST := "7.0;7.5;8.0;8.6+PTX"
+	BUILD_MESSAGE := "I will try to build the image with CUDA support"
+else
+	TORCH_CUDA_ARCH_LIST :=
+	BUILD_MESSAGE := "CUDA $(NVCC_VERSION) is not supported"
+endif
+build-image:
+	@echo $(BUILD_MESSAGE)
+	docker build --build-arg USE_CUDA=$(USE_CUDA) \
+	--build-arg TORCH_ARCH=$(TORCH_CUDA_ARCH_LIST) \
+	-t grounded_sam2:1.0 .
+run:
+	docker run --gpus all -it --rm --net=host --privileged \
+	-v /tmp/.X11-unix:/tmp/.X11-unix \
+	-v "${PWD}":/home/appuser/Grounded-SAM-2 \
+	-e DISPLAY=$DISPLAY \
+	--name=gsa \
+	--ipc=host -it grounded_sam2:1.0

clone-IDEA-Research/Grounded-SAM-2/README.md ADDED Viewed

	@@ -0,0 +1,484 @@

+# Grounded SAM 2: Ground and Track Anything in Videos
+**[IDEA-Research](https://github.com/idea-research)**
+[Tianhe Ren](https://rentainhe.github.io/), [Shuo Shen](https://github.com/ShuoShenDe)
+[[`SAM 2 Paper`](https://arxiv.org/abs/2408.00714)] [[`Grounding DINO Paper`](https://arxiv.org/abs/2303.05499)] [[`Grounding DINO 1.5 Paper`](https://arxiv.org/abs/2405.10300)] [[`DINO-X Paper`](https://arxiv.org/abs/2411.14347)] [[`BibTeX`](#citation)]
+[![Video Name](./assets/grounded_sam_2_intro.jpg)](https://github.com/user-attachments/assets/f0fb0022-779a-49fb-8f46-3a18a8b4e893)
+## Highlights
+ Grounded SAM 2 is a foundation model pipeline towards grounding and track anything in Videos with [Grounding DINO](https://arxiv.org/abs/2303.05499), [Grounding DINO 1.5](https://arxiv.org/abs/2405.10300), [Florence-2](https://arxiv.org/abs/2311.06242), [DINO-X](https://arxiv.org/abs/2411.14347) and [SAM 2](https://arxiv.org/abs/2408.00714).
+In this repo, we've supported the following demo with **simple implementations**:
+- **Ground and Segment Anything** with Grounding DINO, Grounding DINO 1.5 & 1.6, DINO-X and SAM 2
+- **Ground and Track Anything** with Grounding DINO, Grounding DINO 1.5 & 1.6, DINO-X and SAM 2
+- **Detect, Segment and Track Visualization** based on the powerful [supervision](https://github.com/roboflow/supervision) library.
+Grounded SAM 2 does not introduce significant methodological changes compared to [Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks](https://arxiv.org/abs/2401.14159). Both approaches leverage the capabilities of open-world models to address complex visual tasks. Consequently, we try to **simplify the code implementation** in this repository, aiming to enhance user convenience.
+## Latest updates
+- **2024.12.02**: Support **DINO-X with SAM 2** demos (including object segmentation and tracking), please install the latest version of `dds-cloudapi-sdk==0.3.3` and refer to [Grounded SAM 2 (with DINO-X)](#grounded-sam-2-image-demo-with-dino-x) and [Grounded SAM 2 Video (with DINO-X)](#grounded-sam-2-video-object-tracking-demo-with-custom-video-input-with-dino-x) for more details.
+- **2024.10.24**: Support [SAHI (Slicing Aided Hyper Inference)](https://docs.ultralytics.com/guides/sahi-tiled-inference/) on Grounded SAM 2 (with Grounding DINO 1.5) which may be helpful for inferencing high resolution image with dense small objects (e.g. **4K** images).
+- **2024.10.10**: Support `SAM-2.1` models, if you want to use `SAM 2.1` model, you need to update to the latest code and reinstall SAM 2 follow [SAM 2.1 Installation](https://github.com/facebookresearch/sam2?tab=readme-ov-file#latest-updates).
+- **2024.08.31**: Support `dump json results` in Grounded SAM 2 Image Demos (with Grounding DINO).
+- **2024.08.20**: Support **Florence-2 SAM 2 Image Demo** which includes `dense region caption`, `object detection`, `phrase grounding`, and cascaded auto-label pipeline `caption + phrase grounding`.
+- **2024.08.09**: Support **Ground and Track New Object** throughout the whole videos. This feature is still under development now. Credits to [Shuo Shen](https://github.com/ShuoShenDe).
+- **2024.08.07**: Support **Custom Video Inputs**, users need only submit their video file (e.g. `.mp4` file) with specific text prompts to get an impressive demo videos.
+## Contents
+- [Installation](#installation)
+- [Grounded SAM 2 Demos](#grounded-sam-2-demos)
+  - [Grounded SAM 2 Image Demo](#grounded-sam-2-image-demo-with-grounding-dino)
+  - [Grounded SAM 2 Image Demo (with Grounding DINO 1.5 & 1.6)](#grounded-sam-2-image-demo-with-grounding-dino-15--16)
+  - [Grounded SAM 2 Image Demo (with DINO-X)](#grounded-sam-2-image-demo-with-dino-x)
+  - [Grounded SAM 2 with SAHI for High Resolution Image Inference](#sahi-slicing-aided-hyper-inference-with-grounding-dino-15-and-sam-2)
+  - [Automatically Saving Grounding and Segmentation Results](#automatically-saving-grounding-results-image-demo)
+  - [Grounded SAM 2 Video Object Tracking Demo](#grounded-sam-2-video-object-tracking-demo)
+  - [Grounded SAM 2 Video Object Tracking Demo (with Grounding DINO 1.5 & 1.6)](#grounded-sam-2-video-object-tracking-demo-with-grounding-dino-15--16)
+  - [Grounded SAM 2 Video Object Tracking with Custom Video Input (using Grounding DINO)](#grounded-sam-2-video-object-tracking-demo-with-custom-video-input-with-grounding-dino)
+  - [Grounded SAM 2 Video Object Tracking with Custom Video Input (using Grounding DINO 1.5 & 1.6)](#grounded-sam-2-video-object-tracking-demo-with-custom-video-input-with-grounding-dino-15--16)
+  - [Grounded SAM 2 Video Object Tracking Demo (with DINO-X)](#grounded-sam-2-video-object-tracking-demo-with-custom-video-input-with-dino-x)
+  - [Grounded SAM 2 Video Object Tracking with Continues ID (using Grounding DINO)](#grounded-sam-2-video-object-tracking-with-continuous-id-with-grounding-dino)
+- [Grounded SAM 2 Florence-2 Demos](#grounded-sam-2-florence-2-demos)
+  - [Grounded SAM 2 Florence-2 Image Demo](#grounded-sam-2-florence-2-image-demo)
+  - [Grounded SAM 2 Florence-2 Image Auto-Labeling Demo](#grounded-sam-2-florence-2-image-auto-labeling-demo)
+- [Citation](#citation)
+## Installation
+Download the pretrained `SAM 2` checkpoints:
+```bash
+cd checkpoints
+bash download_ckpts.sh
+```
+Download the pretrained `Grounding DINO` checkpoints:
+```bash
+cd gdino_checkpoints
+bash download_ckpts.sh
+```
+### Installation without docker
+Install PyTorch environment first. We use `python=3.10`, as well as `torch >= 2.3.1`, `torchvision>=0.18.1` and `cuda-12.1` in our environment to run this demo. Please follow the instructions [here](https://pytorch.org/get-started/locally/) to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended. You can easily install the latest version of PyTorch as follows:
+```bash
+pip3 install torch torchvision torchaudio
+```
+Since we need the CUDA compilation environment to compile the `Deformable Attention` operator used in Grounding DINO, we need to check whether the CUDA environment variables have been set correctly (which you can refer to [Grounding DINO Installation](https://github.com/IDEA-Research/GroundingDINO?tab=readme-ov-file#hammer_and_wrench-install) for more details). You can set the environment variable manually as follows if you want to build a local GPU environment for Grounding DINO to run Grounded SAM 2:
+```bash
+export CUDA_HOME=/path/to/cuda-12.1/
+```
+Install `Segment Anything 2`:
+```bash
+pip install -e .
+```
+Install `Grounding DINO`:
+```bash
+pip install --no-build-isolation -e grounding_dino
+```
+### Installation with docker
+Build the Docker image and Run the Docker container:
+```
+cd Grounded-SAM-2
+make build-image
+make run
+```
+After executing these commands, you will be inside the Docker environment. The working directory within the container is set to: `/home/appuser/Grounded-SAM-2`
+Once inside the Docker environment, you can start the demo by running:
+```
+python grounded_sam2_tracking_demo.py
+```
+## Grounded SAM 2 Demos
+### Grounded SAM 2 Image Demo (with Grounding DINO)
+Note that `Grounding DINO` has already been supported in [Huggingface](https://huggingface.co/IDEA-Research/grounding-dino-tiny), so we provide two choices for running `Grounded SAM 2` model:
+- Use huggingface API to inference Grounding DINO (which is simple and clear)
+```bash
+python grounded_sam2_hf_model_demo.py
+```
+> [!NOTE]
+> 🚨 If you encounter network issues while using the `HuggingFace` model, you can resolve them by setting the appropriate mirror source as `export HF_ENDPOINT=https://hf-mirror.com`
+- Load local pretrained Grounding DINO checkpoint and inference with Grounding DINO original API (make sure you've already downloaded the pretrained checkpoint)
+```bash
+python grounded_sam2_local_demo.py
+```
+### Grounded SAM 2 Image Demo (with Grounding DINO 1.5 & 1.6)
+We've already released our most capable open-set detection model [Grounding DINO 1.5 & 1.6](https://github.com/IDEA-Research/Grounding-DINO-1.5-API), which can be combined with SAM 2 for stronger open-set detection and segmentation capability. You can apply the API token first and run Grounded SAM 2 with Grounding DINO 1.5 as follows:
+Install the latest DDS cloudapi:
+```bash
+pip install dds-cloudapi-sdk --upgrade
+```
+Apply your API token from our official website here: [request API token](https://deepdataspace.com/request_api).
+```bash
+python grounded_sam2_gd1.5_demo.py
+```
+### SAHI (Slicing Aided Hyper Inference) with Grounding DINO 1.5 and SAM 2
+If your images are high resolution with dense objects, directly using Grounding DINO 1.5 for inference on the original image may not be the best choice. We support [SAHI (Slicing Aided Hyper Inference)](https://docs.ultralytics.com/guides/sahi-tiled-inference/), which works by first dividing the original image into smaller overlapping patches. Inference is then performed separately on each patch, and the final detection results are merged. This method is highly effective and accuracy for dense and small objects detection in high resolution images.
+You can run SAHI inference by setting the following param in [grounded_sam2_gd1.5_demo.py](./grounded_sam2_gd1.5_demo.py):
+```python
+WITH_SLICE_INFERENCE = True
+```
+The visualization is shown as follows:
+| Text Prompt | Input Image | Grounded SAM 2 | Grounded SAM 2 with SAHI |
+|:----:|:----:|:----:|:----:|
+| `Person` | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam_2/demo_images/dense%20people.png?raw=true) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam_2/grounding_dino_1.5_slice_inference/grounded_sam2_annotated_image_with_mask.jpg?raw=true) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam_2/grounding_dino_1.5_slice_inference/grounded_sam2_annotated_image_with_mask_with_slice_inference.jpg?raw=true) |
+- **Notes:** We only support SAHI on Grounding DINO 1.5 because it works better with stronger grounding model which may produce less hallucination results.
+### Grounded SAM 2 Image Demo (with DINO-X)
+We've implemented Grounded SAM 2 with the strongest open-world perception model [DINO-X](https://github.com/IDEA-Research/DINO-X-API) for better open-set detection and segmentation performance. You can apply the API token first and run Grounded SAM 2 with DINO-X as follows:
+Install the latest DDS cloudapi:
+```bash
+pip install dds-cloudapi-sdk --upgrade
+```
+Apply your API token from our official website here: [request API token](https://deepdataspace.com/request_api).
+```bash
+python grounded_sam2_dinox_demo.py
+```
+### Automatically Saving Grounding Results (Image Demo)
+After setting `DUMP_JSON_RESULTS=True` in the following Grounded SAM 2 Image Demos:
+- [grounded_sam2_local_demo.py](./grounded_sam2_local_demo.py)
+- [grounded_sam2_hf_model_demo.py](./grounded_sam2_hf_model_demo.py)
+- [grounded_sam2_gd1.5_demo.py](./grounded_sam2_gd1.5_demo.py)
+- [grounded_sam2_dinox_demo.py](./grounded_sam2_dinox_demo.py)
+The `grounding` and `segmentation` results will be automatically saved in the `outputs` dir with the following format:
+```python
+{
+    "image_path": "path/to/image.jpg",
+    "annotations": [
+        {
+            "class_name": "class_name",
+            "bbox": [x1, y1, x2, y2],
+            "segmentation": {
+                "size": [h, w],
+                "counts": "rle_encoded_mask"
+            },
+            "score": confidence score
+        }
+    ],
+    "box_format": "xyxy",
+    "img_width": w,
+    "img_height": h
+}
+```
+### Grounded SAM 2 Video Object Tracking Demo
+Based on the strong tracking capability of SAM 2, we can combined it with Grounding DINO for open-set object segmentation and tracking. You can run the following scripts to get the tracking results with Grounded SAM 2:
+```bash
+python grounded_sam2_tracking_demo.py
+```
+- The tracking results of each frame will be saved in `./tracking_results`
+- The video will be save as `children_tracking_demo_video.mp4`
+- You can refine this file with different text prompt and video clips yourself to get more tracking results.
+- We only prompt the first video frame with Grounding DINO here for simple usage.
+#### Support Various Prompt Type for Tracking
+We've supported different types of prompt for Grounded SAM 2 tracking demo:
+- **Point Prompt**: In order to **get a stable segmentation results**, we re-use the SAM 2 image predictor to get the prediction mask from each object based on Grounding DINO box outputs, then we **uniformly sample points from the prediction mask** as point prompts for SAM 2 video predictor
+- **Box Prompt**: We directly use the box outputs from Grounding DINO as box prompts for SAM 2 video predictor
+- **Mask Prompt**: We use the SAM 2 mask prediction results based on Grounding DINO box outputs as mask prompt for SAM 2 video predictor.
+![Grounded SAM 2 Tracking Pipeline](./assets/g_sam2_tracking_pipeline_vis_new.png)
+### Grounded SAM 2 Video Object Tracking Demo (with Grounding DINO 1.5 & 1.6)
+We've also support video object tracking demo based on our stronger `Grounding DINO 1.5` model and `SAM 2`, you can try the following demo after applying the API keys for running `Grounding DINO 1.5`:
+```bash
+python grounded_sam2_tracking_demo_with_gd1.5.py
+```
+### Grounded SAM 2 Video Object Tracking Demo with Custom Video Input (with Grounding DINO)
+Users can upload their own video file (e.g. `assets/hippopotamus.mp4`) and specify their custom text prompts for grounding and tracking with Grounding DINO and SAM 2 by using the following scripts:
+```bash
+python grounded_sam2_tracking_demo_custom_video_input_gd1.0_hf_model.py
+```
+If you are not convenient to use huggingface demo, you can also run tracking demo with local grounding dino model with the following scripts:
+```bash
+python grounded_sam2_tracking_demo_custom_video_input_gd1.0_local_model.py
+```
+### Grounded SAM 2 Video Object Tracking Demo with Custom Video Input (with Grounding DINO 1.5 & 1.6)
+Users can upload their own video file (e.g. `assets/hippopotamus.mp4`) and specify their custom text prompts for grounding and tracking with Grounding DINO 1.5 and SAM 2 by using the following scripts:
+```bash
+python grounded_sam2_tracking_demo_custom_video_input_gd1.5.py
+```
+You can specify the params in this file:
+```python
+VIDEO_PATH = "./assets/hippopotamus.mp4"
+TEXT_PROMPT = "hippopotamus."
+OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
+API_TOKEN_FOR_GD1_5 = "Your API token" # api token for G-DINO 1.5
+PROMPT_TYPE_FOR_VIDEO = "mask" # using SAM 2 mask prediction as prompt for video predictor
+```
+After running our demo code, you can get the tracking results as follows:
+[![Video Name](./assets/hippopotamus_seg.jpg)](https://github.com/user-attachments/assets/1fbdc6f4-3e50-4221-9600-98c397beecdf)
+And we will automatically save the tracking visualization results in `OUTPUT_VIDEO_PATH`.
+> [!WARNING]
+> We initialize the box prompts on the first frame of the input video. If you want to start from different frame, you can refine `ann_frame_idx` by yourself in our code.
+### Grounded SAM 2 Video Object Tracking Demo with Custom Video Input (with DINO-X)
+Users can upload their own video file (e.g. `assets/hippopotamus.mp4`) and specify their custom text prompts for grounding and tracking with DINO-X and SAM 2 by using the following scripts:
+```bash
+python grounded_sam2_tracking_demo_custom_video_input_dinox.py
+```
+### Grounded-SAM-2 Video Object Tracking with Continuous ID (with Grounding DINO)
+In above demos, we only prompt Grounded SAM 2 in specific frame, which may not be friendly to find new object during the whole video. In this demo, we try to **find new objects** and assign them with new ID across the whole video, this function is **still under develop**. it's not that stable now.
+Users can upload their own video files and specify custom text prompts for grounding and tracking using the Grounding DINO and SAM 2 frameworks. To do this, execute the script:
+```bash
+python grounded_sam2_tracking_demo_with_continuous_id.py
+```
+You can customize various parameters including:
+- `text`: The grounding text prompt.
+- `video_dir`: Directory containing the video files.
+- `output_dir`: Directory to save the processed output.
+- `output_video_path`: Path for the output video.
+- `step`: Frame stepping for processing.
+- `box_threshold`: box threshold for groundingdino model
+- `text_threshold`: text threshold for groundingdino model
+Note: This method supports only the mask type of text prompt.
+After running our demo code, you can get the tracking results as follows:
+[![Video Name](./assets/tracking_car_mask_1.jpg)](https://github.com/user-attachments/assets/d3f91ad0-3d32-43c4-a0dc-0bed661415f4)
+If you want to try `Grounding DINO 1.5` model, you can run the following scripts after setting your API token:
+```bash
+python grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py
+```
+### Grounded-SAM-2 Video Object Tracking with Continuous ID plus Reverse Tracking(with Grounding DINO)
+This method could simply cover the whole lifetime of the object
+```bash
+python grounded_sam2_tracking_demo_with_continuous_id_plus.py
+```
+## Grounded SAM 2 Florence-2 Demos
+### Grounded SAM 2 Florence-2 Image Demo
+In this section, we will explore how to integrate the feature-rich and robust open-source models [Florence-2](https://arxiv.org/abs/2311.06242) and SAM 2 to develop practical applications.
+[Florence-2](https://arxiv.org/abs/2311.06242) is a powerful vision foundation model by Microsoft which supports a series of vision tasks by prompting with special `task_prompt` includes but not limited to:
+| Task | Task Prompt | Text Input | Task Introduction |
+|:---:|:---:|:---:|:---:|
+| Object Detection | `<OD>` | &#10008; | Detect main objects with single category name |
+| Dense Region Caption | `<DENSE_REGION_CAPTION>` | &#10008; | Detect main objects with short description |
+| Region Proposal | `<REGION_PROPOSAL>` | &#10008; | Generate proposals without category name |
+| Phrase Grounding | `<CAPTION_TO_PHRASE_GROUNDING>` | &#10004; | Ground main objects in image mentioned in caption |
+| Referring Expression Segmentation | `<REFERRING_EXPRESSION_SEGMENTATION>` | &#10004; | Ground the object which is most related to the text input |
+| Open Vocabulary Detection and Segmentation | `<OPEN_VOCABULARY_DETECTION>` | &#10004; | Ground any object with text input |
+Integrate `Florence-2` with `SAM-2`, we can build a strong vision pipeline to solve complex vision tasks, you can try the following scripts to run the demo:
+> [!NOTE]
+> 🚨 If you encounter network issues while using the `HuggingFace` model, you can resolve them by setting the appropriate mirror source as `export HF_ENDPOINT=https://hf-mirror.com`
+**Object Detection and Segmentation**
+```bash
+python grounded_sam2_florence2_image_demo.py \
+    --pipeline object_detection_segmentation \
+    --image_path ./notebooks/images/cars.jpg
+```
+**Dense Region Caption and Segmentation**
+```bash
+python grounded_sam2_florence2_image_demo.py \
+    --pipeline dense_region_caption_segmentation \
+    --image_path ./notebooks/images/cars.jpg
+```
+**Region Proposal and Segmentation**
+```bash
+python grounded_sam2_florence2_image_demo.py \
+    --pipeline region_proposal_segmentation \
+    --image_path ./notebooks/images/cars.jpg
+```
+**Phrase Grounding and Segmentation**
+```bash
+python grounded_sam2_florence2_image_demo.py \
+    --pipeline phrase_grounding_segmentation \
+    --image_path ./notebooks/images/cars.jpg \
+    --text_input "The image shows two vintage Chevrolet cars parked side by side, with one being a red convertible and the other a pink sedan, \
+            set against the backdrop of an urban area with a multi-story building and trees. \
+            The cars have Cuban license plates, indicating a location likely in Cuba."
+```
+**Referring Expression Segmentation**
+```bash
+python grounded_sam2_florence2_image_demo.py \
+    --pipeline referring_expression_segmentation \
+    --image_path ./notebooks/images/cars.jpg \
+    --text_input "The left red car."
+```
+**Open-Vocabulary Detection and Segmentation**
+```bash
+python grounded_sam2_florence2_image_demo.py \
+    --pipeline open_vocabulary_detection_segmentation \
+    --image_path ./notebooks/images/cars.jpg \
+    --text_input "car <and> building"
+```
+- Note that if you want to **detect multiple classes** you should split them with `<and>` in your input text.
+### Grounded SAM 2 Florence-2 Image Auto-Labeling Demo
+`Florence-2` can be used as a auto image annotator by cascading its caption capability with its grounding capability.
+| Task | Task Prompt | Text Input |
+|:---:|:---:|:---:|
+| Caption + Phrase Grounding | `<CAPTION>` + `<CAPTION_TO_PHRASE_GROUNDING>` | &#10008; |
+| Detailed Caption + Phrase Grounding | `<DETAILED_CAPTION>` + `<CAPTION_TO_PHRASE_GROUNDING>` | &#10008; |
+| More Detailed Caption + Phrase Grounding | `<MORE_DETAILED_CAPTION>` + `<CAPTION_TO_PHRASE_GROUNDING>` | &#10008; |
+You can try the following scripts to run these demo:
+**Caption to Phrase Grounding**
+```bash
+python grounded_sam2_florence2_autolabel_pipeline.py \
+    --image_path ./notebooks/images/groceries.jpg \
+    --pipeline caption_to_phrase_grounding \
+    --caption_type caption
+```
+- You can specify `caption_type` to control the granularity of the caption, if you want a more detailed caption, you can try `--caption_type detailed_caption` or `--caption_type more_detailed_caption`.
+### Citation
+If you find this project helpful for your research, please consider citing the following BibTeX entry.
+```BibTex
+@misc{ravi2024sam2segmentimages,
+      title={SAM 2: Segment Anything in Images and Videos},
+      author={Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman Rädle and Chloe Rolland and Laura Gustafson and Eric Mintun and Junting Pan and Kalyan Vasudev Alwala and Nicolas Carion and Chao-Yuan Wu and Ross Girshick and Piotr Dollár and Christoph Feichtenhofer},
+      year={2024},
+      eprint={2408.00714},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2408.00714},
+}
+@article{liu2023grounding,
+  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
+  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
+  journal={arXiv preprint arXiv:2303.05499},
+  year={2023}
+}
+@misc{ren2024grounding,
+      title={Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection},
+      author={Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang and Feng Li and Peijun Tang and Kent Yu and Lei Zhang},
+      year={2024},
+      eprint={2405.10300},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+@misc{ren2024grounded,
+      title={Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks},
+      author={Tianhe Ren and Shilong Liu and Ailing Zeng and Jing Lin and Kunchang Li and He Cao and Jiayu Chen and Xinyu Huang and Yukang Chen and Feng Yan and Zhaoyang Zeng and Hao Zhang and Feng Li and Jie Yang and Hongyang Li and Qing Jiang and Lei Zhang},
+      year={2024},
+      eprint={2401.14159},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+@article{kirillov2023segany,
+  title={Segment Anything},
+  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
+  journal={arXiv:2304.02643},
+  year={2023}
+}
+@misc{jiang2024trex2,
+      title={T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy},
+      author={Qing Jiang and Feng Li and Zhaoyang Zeng and Tianhe Ren and Shilong Liu and Lei Zhang},
+      year={2024},
+      eprint={2403.14610},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```

clone-IDEA-Research/Grounded-SAM-2/SAM2_README.md ADDED Viewed

	@@ -0,0 +1,140 @@

+# SAM 2: Segment Anything in Images and Videos
+**[AI at Meta, FAIR](https://ai.meta.com/research/)**
+[Nikhila Ravi](https://nikhilaravi.com/), [Valentin Gabeur](https://gabeur.github.io/), [Yuan-Ting Hu](https://scholar.google.com/citations?user=E8DVVYQAAAAJ&hl=en), [Ronghang Hu](https://ronghanghu.com/), [Chaitanya Ryali](https://scholar.google.com/citations?user=4LWx24UAAAAJ&hl=en), [Tengyu Ma](https://scholar.google.com/citations?user=VeTSl0wAAAAJ&hl=en), [Haitham Khedr](https://hkhedr.com/), [Roman Rädle](https://scholar.google.de/citations?user=Tpt57v0AAAAJ&hl=en), [Chloe Rolland](https://scholar.google.com/citations?hl=fr&user=n-SnMhoAAAAJ), [Laura Gustafson](https://scholar.google.com/citations?user=c8IpF9gAAAAJ&hl=en), [Eric Mintun](https://ericmintun.github.io/), [Junting Pan](https://junting.github.io/), [Kalyan Vasudev Alwala](https://scholar.google.co.in/citations?user=m34oaWEAAAAJ&hl=en), [Nicolas Carion](https://www.nicolascarion.com/), [Chao-Yuan Wu](https://chaoyuan.org/), [Ross Girshick](https://www.rossgirshick.info/), [Piotr Dollár](https://pdollar.github.io/), [Christoph Feichtenhofer](https://feichtenhofer.github.io/)
+[[`Paper`](https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/)] [[`Project`](https://ai.meta.com/sam2)] [[`Demo`](https://sam2.metademolab.com/)] [[`Dataset`](https://ai.meta.com/datasets/segment-anything-video)] [[`Blog`](https://ai.meta.com/blog/segment-anything-2)] [[`BibTeX`](#citing-sam-2)]
+![SAM 2 architecture](assets/model_diagram.png?raw=true)
+**Segment Anything Model 2 (SAM 2)** is a foundation model towards solving promptable visual segmentation in images and videos. We extend SAM to video by considering images as a video with a single frame. The model design is a simple transformer architecture with streaming memory for real-time video processing. We build a model-in-the-loop data engine, which improves model and data via user interaction, to collect [**our SA-V dataset**](https://ai.meta.com/datasets/segment-anything-video), the largest video segmentation dataset to date. SAM 2 trained on our data provides strong performance across a wide range of tasks and visual domains.
+![SA-V dataset](assets/sa_v_dataset.jpg?raw=true)
+## Installation
+Please install SAM 2 on a GPU machine using:
+```bash
+git clone https://github.com/facebookresearch/segment-anything-2.git
+cd segment-anything-2; pip install -e .
+```
+To use the SAM 2 predictor and run the example notebooks, `jupyter` and `matplotlib` are required and can be installed by:
+```bash
+pip install -e ".[demo]"
+```
+## Getting Started
+### Download Checkpoints
+First, we need to download a model checkpoint. All the model checkpoints can be downloaded by running:
+```bash
+cd checkpoints
+./download_ckpts.sh
+```
+or individually from:
+- [sam2_hiera_tiny.pt](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt)
+- [sam2_hiera_small.pt](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_small.pt)
+- [sam2_hiera_base_plus.pt](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_base_plus.pt)
+- [sam2_hiera_large.pt](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt)
+Then SAM 2 can be used in a few lines as follows for image and video prediction.
+### Image prediction
+SAM 2 has all the capabilities of [SAM](https://github.com/facebookresearch/segment-anything) on static images, and we provide image prediction APIs that closely resemble SAM for image use cases. The `SAM2ImagePredictor` class has an easy interface for image prompting.
+```python
+import torch
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+checkpoint = "./checkpoints/sam2_hiera_large.pt"
+model_cfg = "sam2_hiera_l.yaml"
+predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
+with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+    predictor.set_image(<your_image>)
+    masks, _, _ = predictor.predict(<input_prompts>)
+```
+Please refer to the examples in [image_predictor_example.ipynb](./notebooks/image_predictor_example.ipynb) for static image use cases.
+SAM 2 also supports automatic mask generation on images just like SAM. Please see [automatic_mask_generator_example.ipynb](./notebooks/automatic_mask_generator_example.ipynb) for automatic mask generation in images.
+### Video prediction
+For promptable segmentation and tracking in videos, we provide a video predictor with APIs for example to add prompts and propagate masklets throughout a video. SAM 2 supports video inference on multiple objects and uses an inference state to keep track of the interactions in each video.
+```python
+import torch
+from sam2.build_sam import build_sam2_video_predictor
+checkpoint = "./checkpoints/sam2_hiera_large.pt"
+model_cfg = "sam2_hiera_l.yaml"
+predictor = build_sam2_video_predictor(model_cfg, checkpoint)
+with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
+    state = predictor.init_state(<your_video>)
+    # add new prompts and instantly get the output on the same frame
+    frame_idx, object_ids, masks = predictor.add_new_points(state, <your prompts>):
+    # propagate the prompts to get masklets throughout the video
+    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
+        ...
+```
+Please refer to the examples in [video_predictor_example.ipynb](./notebooks/video_predictor_example.ipynb) for details on how to add prompts, make refinements, and track multiple objects in videos.
+## Model Description
+|      **Model**       | **Size (M)** |    **Speed (FPS)**     | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
+| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
+|   sam2_hiera_tiny    |     38.9     |          47.2          |        75.0         |        70.9        |       75.3        |
+|   sam2_hiera_small   |      46      | 43.3 (53.0 compiled\*) |        74.9         |        71.5        |       76.4        |
+| sam2_hiera_base_plus |     80.8     | 34.8 (43.8 compiled\*) |        74.7         |        72.8        |       75.8        |
+|   sam2_hiera_large   |    224.4     | 24.2 (30.2 compiled\*) |        76.0         |        74.6        |       79.8        |
+\* Compile the model by setting `compile_image_encoder: True` in the config.
+## Segment Anything Video Dataset
+See [sav_dataset/README.md](sav_dataset/README.md) for details.
+## License
+The models are licensed under the [Apache 2.0 license](./LICENSE). Please refer to our research paper for more details on the models.
+## Contributing
+See [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md).
+## Contributors
+The SAM 2 project was made possible with the help of many contributors (alphabetical):
+Karen Bergan, Daniel Bolya, Alex Bosenberg, Kai Brown, Vispi Cassod, Christopher Chedeau, Ida Cheng, Luc Dahlin, Shoubhik Debnath, Rene Martinez Doehner, Grant Gardner, Sahir Gomez, Rishi Godugu, Baishan Guo, Caleb Ho, Andrew Huang, Somya Jain, Bob Kamma, Amanda Kallet, Jake Kinney, Alexander Kirillov, Shiva Koduvayur, Devansh Kukreja, Robert Kuo, Aohan Lin, Parth Malani, Jitendra Malik, Mallika Malhotra, Miguel Martin, Alexander Miller, Sasha Mitts, William Ngan, George Orlin, Joelle Pineau, Kate Saenko, Rodrick Shepard, Azita Shokrpour, David Soofian, Jonathan Torres, Jenny Truong, Sagar Vaze, Meng Wang, Claudette Ward, Pengchuan Zhang.
+Third-party code: we use a GPU-based connected component algorithm adapted from [`cc_torch`](https://github.com/zsef123/Connected_components_PyTorch) (with its license in [`LICENSE_cctorch`](./LICENSE_cctorch)) as an optional post-processing step for the mask predictions.
+## Citing SAM 2
+If you use SAM 2 or the SA-V dataset in your research, please use the following BibTeX entry.
+```bibtex
+@article{ravi2024sam2,
+  title={SAM 2: Segment Anything in Images and Videos},
+  author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
+  journal={arXiv preprint},
+  year={2024}
+}
+```

clone-IDEA-Research/Grounded-SAM-2/backend.Dockerfile ADDED Viewed

	@@ -0,0 +1,64 @@

+ARG BASE_IMAGE=pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
+ARG MODEL_SIZE=base_plus
+FROM ${BASE_IMAGE}
+# Gunicorn environment variables
+ENV GUNICORN_WORKERS=1
+ENV GUNICORN_THREADS=2
+ENV GUNICORN_PORT=5000
+# SAM 2 environment variables
+ENV APP_ROOT=/opt/sam2
+ENV PYTHONUNBUFFERED=1
+ENV SAM2_BUILD_CUDA=0
+ENV MODEL_SIZE=${MODEL_SIZE}
+# Install system requirements
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ffmpeg \
+    libavutil-dev \
+    libavcodec-dev \
+    libavformat-dev \
+    libswscale-dev \
+    pkg-config \
+    build-essential \
+    libffi-dev
+COPY setup.py .
+COPY README.md .
+RUN pip install --upgrade pip setuptools
+RUN pip install -e ".[interactive-demo]"
+# https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite/issues/69#issuecomment-1826764707
+RUN rm /opt/conda/bin/ffmpeg && ln -s /bin/ffmpeg /opt/conda/bin/ffmpeg
+# Make app directory. This directory will host all files required for the
+# backend and SAM 2 inference files.
+RUN mkdir ${APP_ROOT}
+# Copy backend server files
+COPY demo/backend/server ${APP_ROOT}/server
+# Copy SAM 2 inference files
+COPY sam2 ${APP_ROOT}/server/sam2
+# Download SAM 2.1 checkpoints
+ADD https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_tiny.pt ${APP_ROOT}/checkpoints/sam2.1_hiera_tiny.pt
+ADD https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_small.pt ${APP_ROOT}/checkpoints/sam2.1_hiera_small.pt
+ADD https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_base_plus.pt ${APP_ROOT}/checkpoints/sam2.1_hiera_base_plus.pt
+ADD https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt ${APP_ROOT}/checkpoints/sam2.1_hiera_large.pt
+WORKDIR ${APP_ROOT}/server
+# https://pythonspeed.com/articles/gunicorn-in-docker/
+CMD gunicorn --worker-tmp-dir /dev/shm \
+    --worker-class gthread app:app \
+    --log-level info \
+    --access-logfile /dev/stdout \
+    --log-file /dev/stderr \
+    --workers ${GUNICORN_WORKERS} \
+    --threads ${GUNICORN_THREADS} \
+    --bind 0.0.0.0:${GUNICORN_PORT} \
+    --timeout 60

clone-IDEA-Research/Grounded-SAM-2/docker-compose.yaml ADDED Viewed

	@@ -0,0 +1,42 @@

+services:
+  frontend:
+    image: sam2/frontend
+    build:
+      context: ./demo/frontend
+      dockerfile: frontend.Dockerfile
+    ports:
+      - 7262:80
+  backend:
+    image: sam2/backend
+    build:
+      context: .
+      dockerfile: backend.Dockerfile
+    ports:
+      - 7263:5000
+    volumes:
+      - ./demo/data/:/data/:rw
+    environment:
+      - SERVER_ENVIRONMENT=DEV
+      - GUNICORN_WORKERS=1
+      # Inference API needs to have at least 2 threads to handle an incoming
+      # parallel cancel propagation request
+      - GUNICORN_THREADS=2
+      - GUNICORN_PORT=5000
+      - API_URL=http://localhost:7263
+      - DEFAULT_VIDEO_PATH=gallery/05_default_juggle.mp4
+      # # ffmpeg/video encode settings
+      - FFMPEG_NUM_THREADS=1
+      - VIDEO_ENCODE_CODEC=libx264
+      - VIDEO_ENCODE_CRF=23
+      - VIDEO_ENCODE_FPS=24
+      - VIDEO_ENCODE_MAX_WIDTH=1280
+      - VIDEO_ENCODE_MAX_HEIGHT=720
+      - VIDEO_ENCODE_VERBOSE=False
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_dinox_demo.py ADDED Viewed

	@@ -0,0 +1,245 @@

+# dds cloudapi for Grounding DINO 1.5
+from dds_cloudapi_sdk import Config
+from dds_cloudapi_sdk import Client
+from dds_cloudapi_sdk.tasks.dinox import DinoxTask
+from dds_cloudapi_sdk.tasks.types import DetectionTarget
+from dds_cloudapi_sdk import TextPrompt
+import os
+import cv2
+import json
+import torch
+import tempfile
+import numpy as np
+import supervision as sv
+import pycocotools.mask as mask_util
+from pathlib import Path
+from PIL import Image
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+"""
+Hyper parameters
+"""
+API_TOKEN = "Your API token"
+TEXT_PROMPT = "car . building ."
+IMG_PATH = "notebooks/images/cars.jpg"
+SAM2_CHECKPOINT = "./checkpoints/sam2.1_hiera_large.pt"
+SAM2_MODEL_CONFIG = "configs/sam2.1/sam2.1_hiera_l.yaml"
+BOX_THRESHOLD = 0.2
+WITH_SLICE_INFERENCE = False
+SLICE_WH = (480, 480)
+OVERLAP_RATIO = (0.2, 0.2)
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+OUTPUT_DIR = Path("outputs/grounded_sam2_dinox_demo")
+DUMP_JSON_RESULTS = True
+# create output directory
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+"""
+Prompt DINO-X with Text for Box Prompt Generation with Cloud API
+"""
+# Step 1: initialize the config
+token = API_TOKEN
+config = Config(token)
+# Step 2: initialize the client
+client = Client(config)
+# Step 3: run the task by DetectionTask class
+# image_url = "https://algosplt.oss-cn-shenzhen.aliyuncs.com/test_files/tasks/detection/iron_man.jpg"
+# if you are processing local image file, upload them to DDS server to get the image url
+classes = [x.strip().lower() for x in TEXT_PROMPT.split('.') if x]
+class_name_to_id = {name: id for id, name in enumerate(classes)}
+class_id_to_name = {id: name for name, id in class_name_to_id.items()}
+if WITH_SLICE_INFERENCE:
+    def callback(image_slice: np.ndarray) -> sv.Detections:
+        print("Inference on image slice")
+        # save the img as temp img file for GD-1.5 API usage
+        with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmpfile:
+            temp_filename = tmpfile.name
+        cv2.imwrite(temp_filename, image_slice)
+        image_url = client.upload_file(temp_filename)
+        task = DinoxTask(
+            image_url=image_url,
+            prompts=[TextPrompt(text=TEXT_PROMPT)],
+            bbox_threshold=0.25,
+            targets=[DetectionTarget.BBox],
+        )
+        client.run_task(task)
+        result = task.result
+        # detele the tempfile
+        os.remove(temp_filename)
+        input_boxes = []
+        confidences = []
+        class_ids = []
+        objects = result.objects
+        for idx, obj in enumerate(objects):
+            input_boxes.append(obj.bbox)
+            confidences.append(obj.score)
+            cls_name = obj.category.lower().strip()
+            class_ids.append(class_name_to_id[cls_name])
+        # ensure input_boxes with shape (_, 4)
+        input_boxes = np.array(input_boxes).reshape(-1, 4)
+        class_ids = np.array(class_ids)
+        confidences = np.array(confidences)
+        return sv.Detections(xyxy=input_boxes, confidence=confidences, class_id=class_ids)
+    slicer = sv.InferenceSlicer(
+        callback=callback,
+        slice_wh=SLICE_WH,
+        overlap_ratio_wh=OVERLAP_RATIO,
+        iou_threshold=0.5,
+        overlap_filter_strategy=sv.OverlapFilter.NON_MAX_SUPPRESSION
+        )
+    detections = slicer(cv2.imread(IMG_PATH))
+    class_names = [class_id_to_name[id] for id in detections.class_id]
+    confidences = detections.confidence
+    class_ids = detections.class_id
+    input_boxes = detections.xyxy
+else:
+    image_url = client.upload_file(IMG_PATH)
+    task = DinoxTask(
+        image_url=image_url,
+        prompts=[TextPrompt(text=TEXT_PROMPT)],
+        bbox_threshold=0.25,
+        targets=[DetectionTarget.BBox],
+    )
+    client.run_task(task)
+    result = task.result
+    objects = result.objects  # the list of detected objects
+    input_boxes = []
+    confidences = []
+    class_names = []
+    class_ids = []
+    for idx, obj in enumerate(objects):
+        input_boxes.append(obj.bbox)
+        confidences.append(obj.score)
+        cls_name = obj.category.lower().strip()
+        class_names.append(cls_name)
+        class_ids.append(class_name_to_id[cls_name])
+    input_boxes = np.array(input_boxes)
+    class_ids = np.array(class_ids)
+"""
+Init SAM 2 Model and Predict Mask with Box Prompt
+"""
+# environment settings
+# use bfloat16
+torch.autocast(device_type=DEVICE, dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# build SAM2 image predictor
+sam2_checkpoint = SAM2_CHECKPOINT
+model_cfg = SAM2_MODEL_CONFIG
+sam2_model = build_sam2(model_cfg, sam2_checkpoint, device=DEVICE)
+sam2_predictor = SAM2ImagePredictor(sam2_model)
+image = Image.open(IMG_PATH)
+sam2_predictor.set_image(np.array(image.convert("RGB")))
+masks, scores, logits = sam2_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+"""
+Post-process the output of the model to get the masks, scores, and logits for visualization
+"""
+# convert the shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Visualization the Predict Results
+"""
+labels = [
+    f"{class_name} {confidence:.2f}"
+    for class_name, confidence
+    in zip(class_names, confidences)
+]
+"""
+Visualize image with supervision useful API
+"""
+img = cv2.imread(IMG_PATH)
+detections = sv.Detections(
+    xyxy=input_boxes,  # (n, 4)
+    mask=masks.astype(bool),  # (n, h, w)
+    class_id=class_ids
+)
+box_annotator = sv.BoxAnnotator()
+annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+label_annotator = sv.LabelAnnotator()
+annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "dinox_annotated_image.jpg"), annotated_frame)
+mask_annotator = sv.MaskAnnotator()
+annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "dinox_sam2_annotated_image_with_mask.jpg"), annotated_frame)
+print(f'Annotated image has already been saved as to "{OUTPUT_DIR}"')
+"""
+Dump the results in standard format and save as json files
+"""
+def single_mask_to_rle(mask):
+    rle = mask_util.encode(np.array(mask[:, :, None], order="F", dtype="uint8"))[0]
+    rle["counts"] = rle["counts"].decode("utf-8")
+    return rle
+if DUMP_JSON_RESULTS:
+    print("Start dumping the annotation...")
+    # convert mask into rle format
+    mask_rles = [single_mask_to_rle(mask) for mask in masks]
+    input_boxes = input_boxes.tolist()
+    scores = scores.tolist()
+    # FIXME: class_names should be a list of strings without spaces
+    class_names = [class_name.strip() for class_name in class_names]
+    # save the results in standard format
+    results = {
+        "image_path": IMG_PATH,
+        "annotations" : [
+            {
+                "class_name": class_name,
+                "bbox": box,
+                "segmentation": mask_rle,
+                "score": score,
+            }
+            for class_name, box, mask_rle, score in zip(class_names, input_boxes, mask_rles, scores)
+        ],
+        "box_format": "xyxy",
+        "img_width": image.width,
+        "img_height": image.height,
+    }
+    with open(os.path.join(OUTPUT_DIR, "grounded_sam2_dinox_image_demo_results.json"), "w") as f:
+        json.dump(results, f, indent=4)
+    print(f'Annotation has already been saved to "{OUTPUT_DIR}"')

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_florence2_autolabel_pipeline.py ADDED Viewed

	@@ -0,0 +1,198 @@

+import os
+import cv2
+import torch
+import argparse
+import numpy as np
+import supervision as sv
+from PIL import Image
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForCausalLM
+from utils.supervision_utils import CUSTOM_COLOR_MAP
+"""
+Define Some Hyperparam
+"""
+TASK_PROMPT = {
+    "caption": "<CAPTION>",
+    "detailed_caption": "<DETAILED_CAPTION>",
+    "more_detailed_caption": "<MORE_DETAILED_CAPTION>",
+    "object_detection": "<OD>",
+    "dense_region_caption": "<DENSE_REGION_CAPTION>",
+    "region_proposal": "<REGION_PROPOSAL>",
+    "phrase_grounding": "<CAPTION_TO_PHRASE_GROUNDING>",
+    "referring_expression_segmentation": "<REFERRING_EXPRESSION_SEGMENTATION>",
+    "region_to_segmentation": "<REGION_TO_SEGMENTATION>",
+    "open_vocabulary_detection": "<OPEN_VOCABULARY_DETECTION>",
+    "region_to_category": "<REGION_TO_CATEGORY>",
+    "region_to_description": "<REGION_TO_DESCRIPTION>",
+    "ocr": "<OCR>",
+    "ocr_with_region": "<OCR_WITH_REGION>",
+}
+OUTPUT_DIR = "./outputs"
+if not os.path.exists(OUTPUT_DIR):
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+"""
+Init Florence-2 and SAM 2 Model
+"""
+FLORENCE2_MODEL_ID = "microsoft/Florence-2-large"
+SAM2_CHECKPOINT = "./checkpoints/sam2_hiera_large.pt"
+SAM2_CONFIG = "sam2_hiera_l.yaml"
+# environment settings
+# use bfloat16
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+# build florence-2
+florence2_model = AutoModelForCausalLM.from_pretrained(FLORENCE2_MODEL_ID, trust_remote_code=True, torch_dtype='auto').eval().to(device)
+florence2_processor = AutoProcessor.from_pretrained(FLORENCE2_MODEL_ID, trust_remote_code=True)
+# build sam 2
+sam2_model = build_sam2(SAM2_CONFIG, SAM2_CHECKPOINT, device=device)
+sam2_predictor = SAM2ImagePredictor(sam2_model)
+def run_florence2(task_prompt, text_input, model, processor, image):
+    assert model is not None, "You should pass the init florence-2 model here"
+    assert processor is not None, "You should set florence-2 processor here"
+    device = model.device
+    if text_input is None:
+        prompt = task_prompt
+    else:
+        prompt = task_prompt + text_input
+    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch.float16)
+    generated_ids = model.generate(
+      input_ids=inputs["input_ids"].to(device),
+      pixel_values=inputs["pixel_values"].to(device),
+      max_new_tokens=1024,
+      early_stopping=False,
+      do_sample=False,
+      num_beams=3,
+    )
+    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+    parsed_answer = processor.post_process_generation(
+        generated_text,
+        task=task_prompt,
+        image_size=(image.width, image.height)
+    )
+    return parsed_answer
+"""
+We try to support a series of cascaded auto-labelling pipelines with Florence-2 and SAM 2
+"""
+"""
+Auto-Labelling Pipeline 1: Caption/Detailed Caption/More Detailed Caption + Phrase Grounding + Segmentation
+"""
+def caption_phrase_grounding_and_segmentation(
+    florence2_model,
+    florence2_processor,
+    sam2_predictor,
+    image_path,
+    caption_task_prompt='<CAPTION>',
+    output_dir=OUTPUT_DIR
+):
+    assert caption_task_prompt in ["<CAPTION>", "<DETAILED_CAPTION>", "<MORE_DETAILED_CAPTION>"]
+    image = Image.open(image_path).convert("RGB")
+    # image caption
+    caption_results = run_florence2(caption_task_prompt, None, florence2_model, florence2_processor, image)
+    text_input = caption_results[caption_task_prompt]
+    print(f'Image caption for "{image_path}": ', text_input)
+    # phrase grounding
+    grounding_results = run_florence2('<CAPTION_TO_PHRASE_GROUNDING>', text_input, florence2_model, florence2_processor, image)
+    grounding_results = grounding_results['<CAPTION_TO_PHRASE_GROUNDING>']
+    # parse florence-2 detection results
+    input_boxes = np.array(grounding_results["bboxes"])
+    class_names = grounding_results["labels"]
+    class_ids = np.array(list(range(len(class_names))))
+    # predict mask with SAM 2
+    sam2_predictor.set_image(np.array(image))
+    masks, scores, logits = sam2_predictor.predict(
+        point_coords=None,
+        point_labels=None,
+        box=input_boxes,
+        multimask_output=False,
+    )
+    if masks.ndim == 4:
+        masks = masks.squeeze(1)
+    # specify labels
+    labels = [
+        f"{class_name}" for class_name in class_names
+    ]
+    # visualization results
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=masks.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_auto_labelling.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_auto_labelling_with_mask.jpg"), annotated_frame)
+    print(f'Successfully save annotated image to "{output_dir}"')
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser("Grounded SAM 2 Florence-2 Demos", add_help=True)
+    parser.add_argument("--image_path", type=str, default="./notebooks/images/cars.jpg", required=True, help="path to image file")
+    parser.add_argument("--pipeline", type=str, default="caption_to_phrase_grounding", required=True, help="pipeline to use")
+    parser.add_argument("--caption_type", type=str, default="caption", required=False, help="granularity of caption")
+    args = parser.parse_args()
+    CAPTION_TO_TASK_PROMPT = {
+        "caption": "<CAPTION>",
+        "detailed_caption": "<DETAILED_CAPTION>",
+        "more_detailed_caption": "<MORE_DETAILED_CAPTION>"
+    }
+    IMAGE_PATH = args.image_path
+    PIPELINE = args.pipeline
+    CAPTION_TYPE = args.caption_type
+    assert CAPTION_TYPE in ["caption", "detailed_caption", "more_detailed_caption"]
+    print(f"Running pipeline: {PIPELINE} now.")
+    if PIPELINE == "caption_to_phrase_grounding":
+        # pipeline-1: caption + phrase grounding + segmentation
+        caption_phrase_grounding_and_segmentation(
+            florence2_model=florence2_model,
+            florence2_processor=florence2_processor,
+            sam2_predictor=sam2_predictor,
+            caption_task_prompt=CAPTION_TO_TASK_PROMPT[CAPTION_TYPE],
+            image_path=IMAGE_PATH
+        )
+    else:
+        raise NotImplementedError(f"Pipeline: {args.pipeline} is not implemented at this time")

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_florence2_image_demo.py ADDED Viewed

	@@ -0,0 +1,657 @@

+import os
+import cv2
+import torch
+import argparse
+import numpy as np
+import supervision as sv
+from PIL import Image
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForCausalLM
+from utils.supervision_utils import CUSTOM_COLOR_MAP
+"""
+Define Some Hyperparam
+"""
+TASK_PROMPT = {
+    "caption": "<CAPTION>",
+    "detailed_caption": "<DETAILED_CAPTION>",
+    "more_detailed_caption": "<MORE_DETAILED_CAPTION",
+    "object_detection": "<OD>",
+    "dense_region_caption": "<DENSE_REGION_CAPTION>",
+    "region_proposal": "<REGION_PROPOSAL>",
+    "phrase_grounding": "<CAPTION_TO_PHRASE_GROUNDING>",
+    "referring_expression_segmentation": "<REFERRING_EXPRESSION_SEGMENTATION>",
+    "region_to_segmentation": "<REGION_TO_SEGMENTATION>",
+    "open_vocabulary_detection": "<OPEN_VOCABULARY_DETECTION>",
+    "region_to_category": "<REGION_TO_CATEGORY>",
+    "region_to_description": "<REGION_TO_DESCRIPTION>",
+    "ocr": "<OCR>",
+    "ocr_with_region": "<OCR_WITH_REGION>",
+}
+OUTPUT_DIR = "./outputs"
+if not os.path.exists(OUTPUT_DIR):
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+"""
+Init Florence-2 and SAM 2 Model
+"""
+FLORENCE2_MODEL_ID = "microsoft/Florence-2-large"
+SAM2_CHECKPOINT = "./checkpoints/sam2.1_hiera_large.pt"
+SAM2_CONFIG = "configs/sam2.1/sam2.1_hiera_l.yaml"
+# environment settings
+# use bfloat16
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+# build florence-2
+florence2_model = AutoModelForCausalLM.from_pretrained(FLORENCE2_MODEL_ID, trust_remote_code=True, torch_dtype='auto').eval().to(device)
+florence2_processor = AutoProcessor.from_pretrained(FLORENCE2_MODEL_ID, trust_remote_code=True)
+# build sam 2
+sam2_model = build_sam2(SAM2_CONFIG, SAM2_CHECKPOINT, device=device)
+sam2_predictor = SAM2ImagePredictor(sam2_model)
+def run_florence2(task_prompt, text_input, model, processor, image):
+    assert model is not None, "You should pass the init florence-2 model here"
+    assert processor is not None, "You should set florence-2 processor here"
+    device = model.device
+    if text_input is None:
+        prompt = task_prompt
+    else:
+        prompt = task_prompt + text_input
+    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch.float16)
+    generated_ids = model.generate(
+      input_ids=inputs["input_ids"].to(device),
+      pixel_values=inputs["pixel_values"].to(device),
+      max_new_tokens=1024,
+      early_stopping=False,
+      do_sample=False,
+      num_beams=3,
+    )
+    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+    parsed_answer = processor.post_process_generation(
+        generated_text,
+        task=task_prompt,
+        image_size=(image.width, image.height)
+    )
+    return parsed_answer
+"""
+We support a set of pipelines built by Florence-2 + SAM 2
+"""
+"""
+Pipeline-1: Object Detection + Segmentation
+"""
+def object_detection_and_segmentation(
+    florence2_model,
+    florence2_processor,
+    sam2_predictor,
+    image_path,
+    task_prompt="<OD>",
+    text_input=None,
+    output_dir=OUTPUT_DIR
+):
+    assert text_input is None, "Text input should be None when calling object detection pipeline."
+    # run florence-2 object detection in demo
+    image = Image.open(image_path).convert("RGB")
+    results = run_florence2(task_prompt, text_input, florence2_model, florence2_processor, image)
+    """ Florence-2 Object Detection Output Format
+    {'<OD>':
+        {
+            'bboxes':
+                [
+                    [33.599998474121094, 159.59999084472656, 596.7999877929688, 371.7599792480469],
+                    [454.0799865722656, 96.23999786376953, 580.7999877929688, 261.8399963378906],
+                    [224.95999145507812, 86.15999603271484, 333.7599792480469, 164.39999389648438],
+                    [449.5999755859375, 276.239990234375, 554.5599975585938, 370.3199768066406],
+                    [91.19999694824219, 280.0799865722656, 198.0800018310547, 370.3199768066406]
+                ],
+            'labels': ['car', 'door', 'door', 'wheel', 'wheel']
+        }
+    }
+    """
+    results = results[task_prompt]
+    # parse florence-2 detection results
+    input_boxes = np.array(results["bboxes"])
+    class_names = results["labels"]
+    class_ids = np.array(list(range(len(class_names))))
+    # predict mask with SAM 2
+    sam2_predictor.set_image(np.array(image))
+    masks, scores, logits = sam2_predictor.predict(
+        point_coords=None,
+        point_labels=None,
+        box=input_boxes,
+        multimask_output=False,
+    )
+    if masks.ndim == 4:
+        masks = masks.squeeze(1)
+    # specify labels
+    labels = [
+        f"{class_name}" for class_name in class_names
+    ]
+    # visualization results
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=masks.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_det_annotated_image.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_det_image_with_mask.jpg"), annotated_frame)
+    print(f'Successfully save annotated image to "{output_dir}"')
+"""
+Pipeline 2: Dense Region Caption + Segmentation
+"""
+def dense_region_caption_and_segmentation(
+    florence2_model,
+    florence2_processor,
+    sam2_predictor,
+    image_path,
+    task_prompt="<DENSE_REGION_CAPTION>",
+    text_input=None,
+    output_dir=OUTPUT_DIR
+):
+    assert text_input is None, "Text input should be None when calling dense region caption pipeline."
+    # run florence-2 object detection in demo
+    image = Image.open(image_path).convert("RGB")
+    results = run_florence2(task_prompt, text_input, florence2_model, florence2_processor, image)
+    """ Florence-2 Object Detection Output Format
+    {'<DENSE_REGION_CAPTION>':
+        {
+            'bboxes':
+                [
+                    [33.599998474121094, 159.59999084472656, 596.7999877929688, 371.7599792480469],
+                    [454.0799865722656, 96.23999786376953, 580.7999877929688, 261.8399963378906],
+                    [224.95999145507812, 86.15999603271484, 333.7599792480469, 164.39999389648438],
+                    [449.5999755859375, 276.239990234375, 554.5599975585938, 370.3199768066406],
+                    [91.19999694824219, 280.0799865722656, 198.0800018310547, 370.3199768066406]
+                ],
+            'labels': ['turquoise Volkswagen Beetle', 'wooden double doors with metal handles', 'wheel', 'wheel', 'door']
+        }
+    }
+    """
+    results = results[task_prompt]
+    # parse florence-2 detection results
+    input_boxes = np.array(results["bboxes"])
+    class_names = results["labels"]
+    class_ids = np.array(list(range(len(class_names))))
+    # predict mask with SAM 2
+    sam2_predictor.set_image(np.array(image))
+    masks, scores, logits = sam2_predictor.predict(
+        point_coords=None,
+        point_labels=None,
+        box=input_boxes,
+        multimask_output=False,
+    )
+    if masks.ndim == 4:
+        masks = masks.squeeze(1)
+    # specify labels
+    labels = [
+        f"{class_name}" for class_name in class_names
+    ]
+    # visualization results
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=masks.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_dense_region_cap_annotated_image.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_dense_region_cap_image_with_mask.jpg"), annotated_frame)
+    print(f'Successfully save annotated image to "{output_dir}"')
+"""
+Pipeline 3: Region Proposal + Segmentation
+"""
+def region_proposal_and_segmentation(
+    florence2_model,
+    florence2_processor,
+    sam2_predictor,
+    image_path,
+    task_prompt="<REGION_PROPOSAL>",
+    text_input=None,
+    output_dir=OUTPUT_DIR
+):
+    assert text_input is None, "Text input should be None when calling region proposal pipeline."
+    # run florence-2 object detection in demo
+    image = Image.open(image_path).convert("RGB")
+    results = run_florence2(task_prompt, text_input, florence2_model, florence2_processor, image)
+    """ Florence-2 Object Detection Output Format
+    {'<REGION_PROPOSAL>':
+        {
+            'bboxes':
+                [
+                    [33.599998474121094, 159.59999084472656, 596.7999877929688, 371.7599792480469],
+                    [454.0799865722656, 96.23999786376953, 580.7999877929688, 261.8399963378906],
+                    [224.95999145507812, 86.15999603271484, 333.7599792480469, 164.39999389648438],
+                    [449.5999755859375, 276.239990234375, 554.5599975585938, 370.3199768066406],
+                    [91.19999694824219, 280.0799865722656, 198.0800018310547, 370.3199768066406]
+                ],
+            'labels': ['', '', '', '', '', '', '']
+        }
+    }
+    """
+    results = results[task_prompt]
+    # parse florence-2 detection results
+    input_boxes = np.array(results["bboxes"])
+    class_names = results["labels"]
+    class_ids = np.array(list(range(len(class_names))))
+    # predict mask with SAM 2
+    sam2_predictor.set_image(np.array(image))
+    masks, scores, logits = sam2_predictor.predict(
+        point_coords=None,
+        point_labels=None,
+        box=input_boxes,
+        multimask_output=False,
+    )
+    if masks.ndim == 4:
+        masks = masks.squeeze(1)
+    # specify labels
+    labels = [
+        f"region_{idx}" for idx, class_name in enumerate(class_names)
+    ]
+    # visualization results
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=masks.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_region_proposal.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_region_proposal_with_mask.jpg"), annotated_frame)
+    print(f'Successfully save annotated image to "{output_dir}"')
+"""
+Pipeline 4: Phrase Grounding + Segmentation
+"""
+def phrase_grounding_and_segmentation(
+    florence2_model,
+    florence2_processor,
+    sam2_predictor,
+    image_path,
+    task_prompt="<CAPTION_TO_PHRASE_GROUNDING>",
+    text_input=None,
+    output_dir=OUTPUT_DIR
+):
+    # run florence-2 object detection in demo
+    image = Image.open(image_path).convert("RGB")
+    results = run_florence2(task_prompt, text_input, florence2_model, florence2_processor, image)
+    """ Florence-2 Object Detection Output Format
+    {'<CAPTION_TO_PHRASE_GROUNDING>':
+        {
+            'bboxes':
+                [
+                    [34.23999786376953, 159.1199951171875, 582.0800170898438, 374.6399841308594],
+                    [1.5999999046325684, 4.079999923706055, 639.0399780273438, 305.03997802734375]
+                ],
+            'labels': ['A green car', 'a yellow building']
+        }
+    }
+    """
+    assert text_input is not None, "Text input should not be None when calling phrase grounding pipeline."
+    results = results[task_prompt]
+    # parse florence-2 detection results
+    input_boxes = np.array(results["bboxes"])
+    class_names = results["labels"]
+    class_ids = np.array(list(range(len(class_names))))
+    # predict mask with SAM 2
+    sam2_predictor.set_image(np.array(image))
+    masks, scores, logits = sam2_predictor.predict(
+        point_coords=None,
+        point_labels=None,
+        box=input_boxes,
+        multimask_output=False,
+    )
+    if masks.ndim == 4:
+        masks = masks.squeeze(1)
+    # specify labels
+    labels = [
+        f"{class_name}" for class_name in class_names
+    ]
+    # visualization results
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=masks.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_phrase_grounding.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_phrase_grounding_with_mask.jpg"), annotated_frame)
+    print(f'Successfully save annotated image to "{output_dir}"')
+"""
+Pipeline 5: Referring Expression Segmentation
+Note that Florence-2 directly support referring segmentation with polygon output format, which may be not that accurate,
+therefore we try to decode box from polygon and use SAM 2 for mask prediction
+"""
+def referring_expression_segmentation(
+    florence2_model,
+    florence2_processor,
+    sam2_predictor,
+    image_path,
+    task_prompt="<REFERRING_EXPRESSION_SEGMENTATION>",
+    text_input=None,
+    output_dir=OUTPUT_DIR
+):
+    # run florence-2 object detection in demo
+    image = Image.open(image_path).convert("RGB")
+    results = run_florence2(task_prompt, text_input, florence2_model, florence2_processor, image)
+    """ Florence-2 Object Detection Output Format
+    {'<REFERRING_EXPRESSION_SEGMENTATION>':
+        {
+            'polygons': [[[...]]]
+            'labels': ['']
+        }
+    }
+    """
+    assert text_input is not None, "Text input should not be None when calling referring segmentation pipeline."
+    results = results[task_prompt]
+    # parse florence-2 detection results
+    polygon_points = np.array(results["polygons"][0], dtype=np.int32).reshape(-1, 2)
+    class_names = [text_input]
+    class_ids = np.array(list(range(len(class_names))))
+    # parse polygon format to mask
+    img_width, img_height = image.size[0], image.size[1]
+    florence2_mask = np.zeros((img_height, img_width), dtype=np.uint8)
+    if len(polygon_points) < 3:
+        print("Invalid polygon:", polygon_points)
+        exit()
+    cv2.fillPoly(florence2_mask, [polygon_points], 1)
+    if florence2_mask.ndim == 2:
+        florence2_mask = florence2_mask[None]
+    # compute bounding box based on polygon points
+    x_min = np.min(polygon_points[:, 0])
+    y_min = np.min(polygon_points[:, 1])
+    x_max = np.max(polygon_points[:, 0])
+    y_max = np.max(polygon_points[:, 1])
+    input_boxes = np.array([[x_min, y_min, x_max, y_max]])
+    # predict mask with SAM 2
+    sam2_predictor.set_image(np.array(image))
+    sam2_masks, scores, logits = sam2_predictor.predict(
+        point_coords=None,
+        point_labels=None,
+        box=input_boxes,
+        multimask_output=False,
+    )
+    if sam2_masks.ndim == 4:
+        sam2_masks = sam2_masks.squeeze(1)
+    # specify labels
+    labels = [
+        f"{class_name}" for class_name in class_names
+    ]
+    # visualization florence2 mask
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=florence2_mask.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "florence2_referring_segmentation_box.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "florence2_referring_segmentation_box_with_mask.jpg"), annotated_frame)
+    print(f'Successfully save florence-2 annotated image to "{output_dir}"')
+    # visualize sam2 mask
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=sam2_masks.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_referring_box.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_referring_box_with_sam2_mask.jpg"), annotated_frame)
+    print(f'Successfully save sam2 annotated image to "{output_dir}"')
+"""
+Pipeline 6: Open-Vocabulary Detection + Segmentation
+"""
+def open_vocabulary_detection_and_segmentation(
+    florence2_model,
+    florence2_processor,
+    sam2_predictor,
+    image_path,
+    task_prompt="<OPEN_VOCABULARY_DETECTION>",
+    text_input=None,
+    output_dir=OUTPUT_DIR
+):
+    # run florence-2 object detection in demo
+    image = Image.open(image_path).convert("RGB")
+    results = run_florence2(task_prompt, text_input, florence2_model, florence2_processor, image)
+    """ Florence-2 Open-Vocabulary Detection Output Format
+    {'<OPEN_VOCABULARY_DETECTION>':
+        {
+            'bboxes':
+                [
+                    [34.23999786376953, 159.1199951171875, 582.0800170898438, 374.6399841308594]
+                ],
+            'bboxes_labels': ['A green car'],
+            'polygons': [],
+            'polygons_labels': []
+        }
+    }
+    """
+    assert text_input is not None, "Text input should not be None when calling open-vocabulary detection pipeline."
+    results = results[task_prompt]
+    # parse florence-2 detection results
+    input_boxes = np.array(results["bboxes"])
+    print(results)
+    class_names = results["bboxes_labels"]
+    class_ids = np.array(list(range(len(class_names))))
+    # predict mask with SAM 2
+    sam2_predictor.set_image(np.array(image))
+    masks, scores, logits = sam2_predictor.predict(
+        point_coords=None,
+        point_labels=None,
+        box=input_boxes,
+        multimask_output=False,
+    )
+    if masks.ndim == 4:
+        masks = masks.squeeze(1)
+    # specify labels
+    labels = [
+        f"{class_name}" for class_name in class_names
+    ]
+    # visualization results
+    img = cv2.imread(image_path)
+    detections = sv.Detections(
+        xyxy=input_boxes,
+        mask=masks.astype(bool),
+        class_id=class_ids
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_open_vocabulary_detection.jpg"), annotated_frame)
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(output_dir, "grounded_sam2_florence2_open_vocabulary_detection_with_mask.jpg"), annotated_frame)
+    print(f'Successfully save annotated image to "{output_dir}"')
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser("Grounded SAM 2 Florence-2 Demos", add_help=True)
+    parser.add_argument("--image_path", type=str, default="./notebooks/images/cars.jpg", required=True, help="path to image file")
+    parser.add_argument("--pipeline", type=str, default="object_detection_segmentation", required=True, help="path to image file")
+    parser.add_argument("--text_input", type=str, default=None, required=False, help="path to image file")
+    args = parser.parse_args()
+    IMAGE_PATH = args.image_path
+    PIPELINE = args.pipeline
+    INPUT_TEXT = args.text_input
+    print(f"Running pipeline: {PIPELINE} now.")
+    if PIPELINE == "object_detection_segmentation":
+        # pipeline-1: detection + segmentation
+        object_detection_and_segmentation(
+            florence2_model=florence2_model,
+            florence2_processor=florence2_processor,
+            sam2_predictor=sam2_predictor,
+            image_path=IMAGE_PATH
+        )
+    elif PIPELINE == "dense_region_caption_segmentation":
+        # pipeline-2: dense region caption + segmentation
+        dense_region_caption_and_segmentation(
+            florence2_model=florence2_model,
+            florence2_processor=florence2_processor,
+            sam2_predictor=sam2_predictor,
+            image_path=IMAGE_PATH
+        )
+    elif PIPELINE == "region_proposal_segmentation":
+        # pipeline-3: dense region caption + segmentation
+        region_proposal_and_segmentation(
+            florence2_model=florence2_model,
+            florence2_processor=florence2_processor,
+            sam2_predictor=sam2_predictor,
+            image_path=IMAGE_PATH
+        )
+    elif PIPELINE == "phrase_grounding_segmentation":
+        # pipeline-4: phrase grounding + segmentation
+        phrase_grounding_and_segmentation(
+            florence2_model=florence2_model,
+            florence2_processor=florence2_processor,
+            sam2_predictor=sam2_predictor,
+            image_path=IMAGE_PATH,
+            text_input=INPUT_TEXT
+        )
+    elif PIPELINE == "referring_expression_segmentation":
+        # pipeline-5: referring segmentation + segmentation
+        referring_expression_segmentation(
+            florence2_model=florence2_model,
+            florence2_processor=florence2_processor,
+            sam2_predictor=sam2_predictor,
+            image_path=IMAGE_PATH,
+            text_input=INPUT_TEXT
+        )
+    elif PIPELINE == "open_vocabulary_detection_segmentation":
+        # pipeline-6: open-vocabulary detection + segmentation
+        open_vocabulary_detection_and_segmentation(
+            florence2_model=florence2_model,
+            florence2_processor=florence2_processor,
+            sam2_predictor=sam2_predictor,
+            image_path=IMAGE_PATH,
+            text_input=INPUT_TEXT
+        )
+    else:
+        raise NotImplementedError(f"Pipeline: {args.pipeline} is not implemented at this time")

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_gd1.5_demo.py ADDED Viewed

	@@ -0,0 +1,249 @@

+# dds cloudapi for Grounding DINO 1.5
+from dds_cloudapi_sdk import Config
+from dds_cloudapi_sdk import Client
+from dds_cloudapi_sdk import DetectionTask
+from dds_cloudapi_sdk import TextPrompt
+from dds_cloudapi_sdk import DetectionModel
+from dds_cloudapi_sdk import DetectionTarget
+import os
+import cv2
+import json
+import torch
+import tempfile
+import numpy as np
+import supervision as sv
+import pycocotools.mask as mask_util
+from pathlib import Path
+from PIL import Image
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+"""
+Hyper parameters
+"""
+API_TOKEN = "Your API token"
+TEXT_PROMPT = "car . building ."
+IMG_PATH = "notebooks/images/cars.jpg"
+SAM2_CHECKPOINT = "./checkpoints/sam2.1_hiera_large.pt"
+SAM2_MODEL_CONFIG = "configs/sam2.1/sam2.1_hiera_l.yaml"
+GROUNDING_MODEL = DetectionModel.GDino1_5_Pro # DetectionModel.GDino1_6_Pro
+BOX_THRESHOLD = 0.2
+WITH_SLICE_INFERENCE = False
+SLICE_WH = (480, 480)
+OVERLAP_RATIO = (0.2, 0.2)
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+OUTPUT_DIR = Path("outputs/grounded_sam2_gd1.5_demo")
+DUMP_JSON_RESULTS = True
+# create output directory
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+"""
+Prompt Grounding DINO 1.5 with Text for Box Prompt Generation with Cloud API
+"""
+# Step 1: initialize the config
+token = API_TOKEN
+config = Config(token)
+# Step 2: initialize the client
+client = Client(config)
+# Step 3: run the task by DetectionTask class
+# image_url = "https://algosplt.oss-cn-shenzhen.aliyuncs.com/test_files/tasks/detection/iron_man.jpg"
+# if you are processing local image file, upload them to DDS server to get the image url
+classes = [x.strip().lower() for x in TEXT_PROMPT.split('.') if x]
+class_name_to_id = {name: id for id, name in enumerate(classes)}
+class_id_to_name = {id: name for name, id in class_name_to_id.items()}
+if WITH_SLICE_INFERENCE:
+    def callback(image_slice: np.ndarray) -> sv.Detections:
+        print("Inference on image slice")
+        # save the img as temp img file for GD-1.5 API usage
+        with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmpfile:
+            temp_filename = tmpfile.name
+        cv2.imwrite(temp_filename, image_slice)
+        image_url = client.upload_file(temp_filename)
+        task = DetectionTask(
+            image_url=image_url,
+            prompts=[TextPrompt(text=TEXT_PROMPT)],
+            targets=[DetectionTarget.BBox],  # detect bbox
+            model=GROUNDING_MODEL,  # detect with GroundingDino-1.5-Pro model
+            bbox_threshold=BOX_THRESHOLD, # box confidence threshold
+        )
+        client.run_task(task)
+        result = task.result
+        # detele the tempfile
+        os.remove(temp_filename)
+        input_boxes = []
+        confidences = []
+        class_ids = []
+        objects = result.objects
+        for idx, obj in enumerate(objects):
+            input_boxes.append(obj.bbox)
+            confidences.append(obj.score)
+            cls_name = obj.category.lower().strip()
+            class_ids.append(class_name_to_id[cls_name])
+        # ensure input_boxes with shape (_, 4)
+        input_boxes = np.array(input_boxes).reshape(-1, 4)
+        class_ids = np.array(class_ids)
+        confidences = np.array(confidences)
+        return sv.Detections(xyxy=input_boxes, confidence=confidences, class_id=class_ids)
+    slicer = sv.InferenceSlicer(
+        callback=callback,
+        slice_wh=SLICE_WH,
+        overlap_ratio_wh=OVERLAP_RATIO,
+        iou_threshold=0.5,
+        overlap_filter_strategy=sv.OverlapFilter.NON_MAX_SUPPRESSION
+        )
+    detections = slicer(cv2.imread(IMG_PATH))
+    class_names = [class_id_to_name[id] for id in detections.class_id]
+    confidences = detections.confidence
+    class_ids = detections.class_id
+    input_boxes = detections.xyxy
+else:
+    image_url = client.upload_file(IMG_PATH)
+    task = DetectionTask(
+        image_url=image_url,
+        prompts=[TextPrompt(text=TEXT_PROMPT)],
+        targets=[DetectionTarget.BBox],  # detect bbox
+        model=GROUNDING_MODEL,  # detect with GroundingDINO-1.5-Pro model
+        bbox_threshold=BOX_THRESHOLD, # box confidence threshold
+    )
+    client.run_task(task)
+    result = task.result
+    objects = result.objects  # the list of detected objects
+    input_boxes = []
+    confidences = []
+    class_names = []
+    class_ids = []
+    for idx, obj in enumerate(objects):
+        input_boxes.append(obj.bbox)
+        confidences.append(obj.score)
+        cls_name = obj.category.lower().strip()
+        class_names.append(cls_name)
+        class_ids.append(class_name_to_id[cls_name])
+    input_boxes = np.array(input_boxes)
+    class_ids = np.array(class_ids)
+"""
+Init SAM 2 Model and Predict Mask with Box Prompt
+"""
+# environment settings
+# use bfloat16
+torch.autocast(device_type=DEVICE, dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# build SAM2 image predictor
+sam2_checkpoint = SAM2_CHECKPOINT
+model_cfg = SAM2_MODEL_CONFIG
+sam2_model = build_sam2(model_cfg, sam2_checkpoint, device=DEVICE)
+sam2_predictor = SAM2ImagePredictor(sam2_model)
+image = Image.open(IMG_PATH)
+sam2_predictor.set_image(np.array(image.convert("RGB")))
+masks, scores, logits = sam2_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+"""
+Post-process the output of the model to get the masks, scores, and logits for visualization
+"""
+# convert the shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Visualization the Predict Results
+"""
+labels = [
+    f"{class_name} {confidence:.2f}"
+    for class_name, confidence
+    in zip(class_names, confidences)
+]
+"""
+Visualize image with supervision useful API
+"""
+img = cv2.imread(IMG_PATH)
+detections = sv.Detections(
+    xyxy=input_boxes,  # (n, 4)
+    mask=masks.astype(bool),  # (n, h, w)
+    class_id=class_ids
+)
+box_annotator = sv.BoxAnnotator()
+annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+label_annotator = sv.LabelAnnotator()
+annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "groundingdino_annotated_image.jpg"), annotated_frame)
+mask_annotator = sv.MaskAnnotator()
+annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "grounded_sam2_annotated_image_with_mask.jpg"), annotated_frame)
+print(f'Annotated image has already been saved as to "{OUTPUT_DIR}"')
+"""
+Dump the results in standard format and save as json files
+"""
+def single_mask_to_rle(mask):
+    rle = mask_util.encode(np.array(mask[:, :, None], order="F", dtype="uint8"))[0]
+    rle["counts"] = rle["counts"].decode("utf-8")
+    return rle
+if DUMP_JSON_RESULTS:
+    print("Start dumping the annotation...")
+    # convert mask into rle format
+    mask_rles = [single_mask_to_rle(mask) for mask in masks]
+    input_boxes = input_boxes.tolist()
+    scores = scores.tolist()
+    # FIXME: class_names should be a list of strings without spaces
+    class_names = [class_name.strip() for class_name in class_names]
+    # save the results in standard format
+    results = {
+        "image_path": IMG_PATH,
+        "annotations" : [
+            {
+                "class_name": class_name,
+                "bbox": box,
+                "segmentation": mask_rle,
+                "score": score,
+            }
+            for class_name, box, mask_rle, score in zip(class_names, input_boxes, mask_rles, scores)
+        ],
+        "box_format": "xyxy",
+        "img_width": image.width,
+        "img_height": image.height,
+    }
+    with open(os.path.join(OUTPUT_DIR, "grounded_sam2_gd1.5_image_demo_results.json"), "w") as f:
+        json.dump(results, f, indent=4)
+    print(f'Annotation has already been saved to "{OUTPUT_DIR}"')

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_hf_model_demo.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import argparse
+import os
+import cv2
+import json
+import torch
+import numpy as np
+import supervision as sv
+import pycocotools.mask as mask_util
+from pathlib import Path
+from supervision.draw.color import ColorPalette
+from utils.supervision_utils import CUSTOM_COLOR_MAP
+from PIL import Image
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+"""
+Hyper parameters
+"""
+parser = argparse.ArgumentParser()
+parser.add_argument('--grounding-model', default="IDEA-Research/grounding-dino-tiny")
+parser.add_argument("--text-prompt", default="car. tire.")
+parser.add_argument("--img-path", default="notebooks/images/truck.jpg")
+parser.add_argument("--sam2-checkpoint", default="./checkpoints/sam2.1_hiera_large.pt")
+parser.add_argument("--sam2-model-config", default="configs/sam2.1/sam2.1_hiera_l.yaml")
+parser.add_argument("--output-dir", default="outputs/test_sam2.1")
+parser.add_argument("--no-dump-json", action="store_true")
+parser.add_argument("--force-cpu", action="store_true")
+args = parser.parse_args()
+GROUNDING_MODEL = args.grounding_model
+TEXT_PROMPT = args.text_prompt
+IMG_PATH = args.img_path
+SAM2_CHECKPOINT = args.sam2_checkpoint
+SAM2_MODEL_CONFIG = args.sam2_model_config
+DEVICE = "cuda" if torch.cuda.is_available() and not args.force_cpu else "cpu"
+OUTPUT_DIR = Path(args.output_dir)
+DUMP_JSON_RESULTS = not args.no_dump_json
+# create output directory
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+# environment settings
+# use bfloat16
+torch.autocast(device_type=DEVICE, dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# build SAM2 image predictor
+sam2_checkpoint = SAM2_CHECKPOINT
+model_cfg = SAM2_MODEL_CONFIG
+sam2_model = build_sam2(model_cfg, sam2_checkpoint, device=DEVICE)
+sam2_predictor = SAM2ImagePredictor(sam2_model)
+# build grounding dino from huggingface
+model_id = GROUNDING_MODEL
+processor = AutoProcessor.from_pretrained(model_id)
+grounding_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(DEVICE)
+# setup the input image and text prompt for SAM 2 and Grounding DINO
+# VERY important: text queries need to be lowercased + end with a dot
+text = TEXT_PROMPT
+img_path = IMG_PATH
+image = Image.open(img_path)
+sam2_predictor.set_image(np.array(image.convert("RGB")))
+inputs = processor(images=image, text=text, return_tensors="pt").to(DEVICE)
+with torch.no_grad():
+    outputs = grounding_model(**inputs)
+results = processor.post_process_grounded_object_detection(
+    outputs,
+    inputs.input_ids,
+    box_threshold=0.4,
+    text_threshold=0.3,
+    target_sizes=[image.size[::-1]]
+)
+"""
+Results is a list of dict with the following structure:
+[
+    {
+        'scores': tensor([0.7969, 0.6469, 0.6002, 0.4220], device='cuda:0'),
+        'labels': ['car', 'tire', 'tire', 'tire'],
+        'boxes': tensor([[  89.3244,  278.6940, 1710.3505,  851.5143],
+                        [1392.4701,  554.4064, 1628.6133,  777.5872],
+                        [ 436.1182,  621.8940,  676.5255,  851.6897],
+                        [1236.0990,  688.3547, 1400.2427,  753.1256]], device='cuda:0')
+    }
+]
+"""
+# get the box prompt for SAM 2
+input_boxes = results[0]["boxes"].cpu().numpy()
+masks, scores, logits = sam2_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+"""
+Post-process the output of the model to get the masks, scores, and logits for visualization
+"""
+# convert the shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+confidences = results[0]["scores"].cpu().numpy().tolist()
+class_names = results[0]["labels"]
+class_ids = np.array(list(range(len(class_names))))
+labels = [
+    f"{class_name} {confidence:.2f}"
+    for class_name, confidence
+    in zip(class_names, confidences)
+]
+"""
+Visualize image with supervision useful API
+"""
+img = cv2.imread(img_path)
+detections = sv.Detections(
+    xyxy=input_boxes,  # (n, 4)
+    mask=masks.astype(bool),  # (n, h, w)
+    class_id=class_ids
+)
+"""
+Note that if you want to use default color map,
+you can set color=ColorPalette.DEFAULT
+"""
+box_annotator = sv.BoxAnnotator(color=ColorPalette.from_hex(CUSTOM_COLOR_MAP))
+annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+label_annotator = sv.LabelAnnotator(color=ColorPalette.from_hex(CUSTOM_COLOR_MAP))
+annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "groundingdino_annotated_image.jpg"), annotated_frame)
+mask_annotator = sv.MaskAnnotator(color=ColorPalette.from_hex(CUSTOM_COLOR_MAP))
+annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "grounded_sam2_annotated_image_with_mask.jpg"), annotated_frame)
+"""
+Dump the results in standard format and save as json files
+"""
+def single_mask_to_rle(mask):
+    rle = mask_util.encode(np.array(mask[:, :, None], order="F", dtype="uint8"))[0]
+    rle["counts"] = rle["counts"].decode("utf-8")
+    return rle
+if DUMP_JSON_RESULTS:
+    # convert mask into rle format
+    mask_rles = [single_mask_to_rle(mask) for mask in masks]
+    input_boxes = input_boxes.tolist()
+    scores = scores.tolist()
+    # save the results in standard format
+    results = {
+        "image_path": img_path,
+        "annotations" : [
+            {
+                "class_name": class_name,
+                "bbox": box,
+                "segmentation": mask_rle,
+                "score": score,
+            }
+            for class_name, box, mask_rle, score in zip(class_names, input_boxes, mask_rles, scores)
+        ],
+        "box_format": "xyxy",
+        "img_width": image.width,
+        "img_height": image.height,
+    }
+    with open(os.path.join(OUTPUT_DIR, "grounded_sam2_hf_model_demo_results.json"), "w") as f:
+        json.dump(results, f, indent=4)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_local_demo.py ADDED Viewed

	@@ -0,0 +1,160 @@

+import os
+import cv2
+import json
+import torch
+import numpy as np
+import supervision as sv
+import pycocotools.mask as mask_util
+from pathlib import Path
+from torchvision.ops import box_convert
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from grounding_dino.groundingdino.util.inference import load_model, load_image, predict
+"""
+Hyper parameters
+"""
+TEXT_PROMPT = "car. tire."
+IMG_PATH = "notebooks/images/truck.jpg"
+SAM2_CHECKPOINT = "./checkpoints/sam2.1_hiera_large.pt"
+SAM2_MODEL_CONFIG = "configs/sam2.1/sam2.1_hiera_l.yaml"
+GROUNDING_DINO_CONFIG = "grounding_dino/groundingdino/config/GroundingDINO_SwinT_OGC.py"
+GROUNDING_DINO_CHECKPOINT = "gdino_checkpoints/groundingdino_swint_ogc.pth"
+BOX_THRESHOLD = 0.35
+TEXT_THRESHOLD = 0.25
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+OUTPUT_DIR = Path("outputs/grounded_sam2_local_demo")
+DUMP_JSON_RESULTS = True
+# create output directory
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+# environment settings
+# use bfloat16
+# build SAM2 image predictor
+sam2_checkpoint = SAM2_CHECKPOINT
+model_cfg = SAM2_MODEL_CONFIG
+sam2_model = build_sam2(model_cfg, sam2_checkpoint, device=DEVICE)
+sam2_predictor = SAM2ImagePredictor(sam2_model)
+# build grounding dino model
+grounding_model = load_model(
+    model_config_path=GROUNDING_DINO_CONFIG,
+    model_checkpoint_path=GROUNDING_DINO_CHECKPOINT,
+    device=DEVICE
+)
+# setup the input image and text prompt for SAM 2 and Grounding DINO
+# VERY important: text queries need to be lowercased + end with a dot
+text = TEXT_PROMPT
+img_path = IMG_PATH
+image_source, image = load_image(img_path)
+sam2_predictor.set_image(image_source)
+boxes, confidences, labels = predict(
+    model=grounding_model,
+    image=image,
+    caption=text,
+    box_threshold=BOX_THRESHOLD,
+    text_threshold=TEXT_THRESHOLD,
+)
+# process the box prompt for SAM 2
+h, w, _ = image_source.shape
+boxes = boxes * torch.Tensor([w, h, w, h])
+input_boxes = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
+# FIXME: figure how does this influence the G-DINO model
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+masks, scores, logits = sam2_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+"""
+Post-process the output of the model to get the masks, scores, and logits for visualization
+"""
+# convert the shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+confidences = confidences.numpy().tolist()
+class_names = labels
+class_ids = np.array(list(range(len(class_names))))
+labels = [
+    f"{class_name} {confidence:.2f}"
+    for class_name, confidence
+    in zip(class_names, confidences)
+]
+"""
+Visualize image with supervision useful API
+"""
+img = cv2.imread(img_path)
+detections = sv.Detections(
+    xyxy=input_boxes,  # (n, 4)
+    mask=masks.astype(bool),  # (n, h, w)
+    class_id=class_ids
+)
+box_annotator = sv.BoxAnnotator()
+annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+label_annotator = sv.LabelAnnotator()
+annotated_frame = label_annotator.annotate(scene=annotated_frame, detections=detections, labels=labels)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "groundingdino_annotated_image.jpg"), annotated_frame)
+mask_annotator = sv.MaskAnnotator()
+annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+cv2.imwrite(os.path.join(OUTPUT_DIR, "grounded_sam2_annotated_image_with_mask.jpg"), annotated_frame)
+"""
+Dump the results in standard format and save as json files
+"""
+def single_mask_to_rle(mask):
+    rle = mask_util.encode(np.array(mask[:, :, None], order="F", dtype="uint8"))[0]
+    rle["counts"] = rle["counts"].decode("utf-8")
+    return rle
+if DUMP_JSON_RESULTS:
+    # convert mask into rle format
+    mask_rles = [single_mask_to_rle(mask) for mask in masks]
+    input_boxes = input_boxes.tolist()
+    scores = scores.tolist()
+    # save the results in standard format
+    results = {
+        "image_path": img_path,
+        "annotations" : [
+            {
+                "class_name": class_name,
+                "bbox": box,
+                "segmentation": mask_rle,
+                "score": score,
+            }
+            for class_name, box, mask_rle, score in zip(class_names, input_boxes, mask_rles, scores)
+        ],
+        "box_format": "xyxy",
+        "img_width": w,
+        "img_height": h,
+    }
+    with open(os.path.join(OUTPUT_DIR, "grounded_sam2_local_image_demo_results.json"), "w") as f:
+        json.dump(results, f, indent=4)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo.py ADDED Viewed

	@@ -0,0 +1,198 @@

+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+"""
+Step 1: Environment settings and model initialization
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# init grounding dino model from huggingface
+model_id = "IDEA-Research/grounding-dino-tiny"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+processor = AutoProcessor.from_pretrained(model_id)
+grounding_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
+# setup the input image and text prompt for SAM 2 and Grounding DINO
+# VERY important: text queries need to be lowercased + end with a dot
+text = "car."
+# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+video_dir = "notebooks/videos/car"
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(video_dir)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=video_dir)
+ann_frame_idx = 0  # the frame index we interact with
+ann_obj_id = 1  # give a unique id to each object we interact with (it can be any integers)
+"""
+Step 2: Prompt Grounding DINO and SAM image predictor to get the box and mask for specific frame
+"""
+# prompt grounding dino to get the box coordinates on specific frame
+img_path = os.path.join(video_dir, frame_names[ann_frame_idx])
+image = Image.open(img_path)
+# run Grounding DINO on the image
+inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+with torch.no_grad():
+    outputs = grounding_model(**inputs)
+results = processor.post_process_grounded_object_detection(
+    outputs,
+    inputs.input_ids,
+    box_threshold=0.25,
+    text_threshold=0.3,
+    target_sizes=[image.size[::-1]]
+)
+# prompt SAM image predictor to get the mask for the object
+image_predictor.set_image(np.array(image.convert("RGB")))
+# process the detection results
+input_boxes = results[0]["boxes"].cpu().numpy()
+OBJECTS = results[0]["labels"]
+# prompt SAM 2 image predictor to get the mask for the object
+masks, scores, logits = image_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+# convert the mask shape to (n, H, W)
+if masks.ndim == 3:
+    masks = masks[None]
+    scores = scores[None]
+    logits = logits[None]
+elif masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Step 3: Register each object's positive points to video predictor with seperate add_new_points call
+"""
+PROMPT_TYPE_FOR_VIDEO = "box" # or "point"
+assert PROMPT_TYPE_FOR_VIDEO in ["point", "box", "mask"], "SAM 2 video predictor only support point/box/mask prompt"
+# If you are using point prompts, we uniformly sample positive points based on the mask
+if PROMPT_TYPE_FOR_VIDEO == "point":
+    # sample the positive points from mask for each objects
+    all_sample_points = sample_points_from_masks(masks=masks, num_points=10)
+    for object_id, (label, points) in enumerate(zip(OBJECTS, all_sample_points), start=1):
+        labels = np.ones((points.shape[0]), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            points=points,
+            labels=labels,
+        )
+# Using box prompt
+elif PROMPT_TYPE_FOR_VIDEO == "box":
+    for object_id, (label, box) in enumerate(zip(OBJECTS, input_boxes), start=1):
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            box=box,
+        )
+# Using mask prompt is a more straightforward way
+elif PROMPT_TYPE_FOR_VIDEO == "mask":
+    for object_id, (label, mask) in enumerate(zip(OBJECTS, masks), start=1):
+        labels = np.ones((1), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            mask=mask
+        )
+else:
+    raise NotImplementedError("SAM 2 video predictor only support point/box/mask prompts")
+"""
+Step 4: Propagate the video predictor to get the segmentation results for each frame
+"""
+video_segments = {}  # video_segments contains the per-frame segmentation results
+for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state):
+    video_segments[out_frame_idx] = {
+        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
+        for i, out_obj_id in enumerate(out_obj_ids)
+    }
+"""
+Step 5: Visualize the segment results across the video and save them
+"""
+save_dir = "./tracking_results"
+if not os.path.exists(save_dir):
+    os.makedirs(save_dir)
+ID_TO_OBJECTS = {i: obj for i, obj in enumerate(OBJECTS, start=1)}
+for frame_idx, segments in video_segments.items():
+    img = cv2.imread(os.path.join(video_dir, frame_names[frame_idx]))
+    object_ids = list(segments.keys())
+    masks = list(segments.values())
+    masks = np.concatenate(masks, axis=0)
+    detections = sv.Detections(
+        xyxy=sv.mask_to_xyxy(masks),  # (n, 4)
+        mask=masks, # (n, h, w)
+        class_id=np.array(object_ids, dtype=np.int32),
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(annotated_frame, detections=detections, labels=[ID_TO_OBJECTS[i] for i in object_ids])
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(save_dir, f"annotated_frame_{frame_idx:05d}.jpg"), annotated_frame)
+"""
+Step 6: Convert the annotated frames to video
+"""
+output_video_path = "./children_tracking_demo_video.mp4"
+create_video_from_images(save_dir, output_video_path)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_dinox.py ADDED Viewed

	@@ -0,0 +1,237 @@

+# dds cloudapi for Grounding DINO 1.5
+from dds_cloudapi_sdk import Config
+from dds_cloudapi_sdk import Client
+from dds_cloudapi_sdk.tasks.dinox import DinoxTask
+from dds_cloudapi_sdk.tasks.types import DetectionTarget
+from dds_cloudapi_sdk import TextPrompt
+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from pathlib import Path
+from tqdm import tqdm
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+"""
+Hyperparam for Ground and Tracking
+"""
+VIDEO_PATH = "./assets/hippopotamus.mp4"
+TEXT_PROMPT = "hippopotamus."
+OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
+SOURCE_VIDEO_FRAME_DIR = "./custom_video_frames"
+SAVE_TRACKING_RESULTS_DIR = "./tracking_results"
+API_TOKEN_FOR_DINOX = "Your API token"
+PROMPT_TYPE_FOR_VIDEO = "box" # choose from ["point", "box", "mask"]
+BOX_THRESHOLD = 0.2
+"""
+Step 1: Environment settings and model initialization for SAM 2
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# # `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+# video_dir = "notebooks/videos/bedroom"
+"""
+Custom video input directly using video files
+"""
+video_info = sv.VideoInfo.from_video_path(VIDEO_PATH)  # get video info
+print(video_info)
+frame_generator = sv.get_video_frames_generator(VIDEO_PATH, stride=1, start=0, end=None)
+# saving video to frames
+source_frames = Path(SOURCE_VIDEO_FRAME_DIR)
+source_frames.mkdir(parents=True, exist_ok=True)
+with sv.ImageSink(
+    target_dir_path=source_frames,
+    overwrite=True,
+    image_name_pattern="{:05d}.jpg"
+) as sink:
+    for frame in tqdm(frame_generator, desc="Saving Video Frames"):
+        sink.save_image(frame)
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(SOURCE_VIDEO_FRAME_DIR)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=SOURCE_VIDEO_FRAME_DIR)
+ann_frame_idx = 0  # the frame index we interact with
+"""
+Step 2: Prompt DINO-X with Cloud API for box coordinates
+"""
+# prompt grounding dino to get the box coordinates on specific frame
+img_path = os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[ann_frame_idx])
+image = Image.open(img_path)
+# Step 1: initialize the config
+config = Config(API_TOKEN_FOR_DINOX)
+# Step 2: initialize the client
+client = Client(config)
+# Step 3: run the task by DetectionTask class
+# image_url = "https://algosplt.oss-cn-shenzhen.aliyuncs.com/test_files/tasks/detection/iron_man.jpg"
+# if you are processing local image file, upload them to DDS server to get the image url
+image_url = client.upload_file(img_path)
+task = DinoxTask(
+    image_url=image_url,
+    prompts=[TextPrompt(text=TEXT_PROMPT)],
+    bbox_threshold=0.25,
+    targets=[DetectionTarget.BBox],
+)
+client.run_task(task)
+result = task.result
+objects = result.objects  # the list of detected objects
+input_boxes = []
+confidences = []
+class_names = []
+for idx, obj in enumerate(objects):
+    input_boxes.append(obj.bbox)
+    confidences.append(obj.score)
+    class_names.append(obj.category)
+input_boxes = np.array(input_boxes)
+print(input_boxes)
+# prompt SAM image predictor to get the mask for the object
+image_predictor.set_image(np.array(image.convert("RGB")))
+# process the detection results
+OBJECTS = class_names
+print(OBJECTS)
+# prompt SAM 2 image predictor to get the mask for the object
+masks, scores, logits = image_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+# convert the mask shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Step 3: Register each object's positive points to video predictor with seperate add_new_points call
+"""
+assert PROMPT_TYPE_FOR_VIDEO in ["point", "box", "mask"], "SAM 2 video predictor only support point/box/mask prompt"
+# If you are using point prompts, we uniformly sample positive points based on the mask
+if PROMPT_TYPE_FOR_VIDEO == "point":
+    # sample the positive points from mask for each objects
+    all_sample_points = sample_points_from_masks(masks=masks, num_points=10)
+    for object_id, (label, points) in enumerate(zip(OBJECTS, all_sample_points), start=1):
+        labels = np.ones((points.shape[0]), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            points=points,
+            labels=labels,
+        )
+# Using box prompt
+elif PROMPT_TYPE_FOR_VIDEO == "box":
+    for object_id, (label, box) in enumerate(zip(OBJECTS, input_boxes), start=1):
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            box=box,
+        )
+# Using mask prompt is a more straightforward way
+elif PROMPT_TYPE_FOR_VIDEO == "mask":
+    for object_id, (label, mask) in enumerate(zip(OBJECTS, masks), start=1):
+        labels = np.ones((1), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            mask=mask
+        )
+else:
+    raise NotImplementedError("SAM 2 video predictor only support point/box/mask prompts")
+"""
+Step 4: Propagate the video predictor to get the segmentation results for each frame
+"""
+video_segments = {}  # video_segments contains the per-frame segmentation results
+for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state):
+    video_segments[out_frame_idx] = {
+        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
+        for i, out_obj_id in enumerate(out_obj_ids)
+    }
+"""
+Step 5: Visualize the segment results across the video and save them
+"""
+if not os.path.exists(SAVE_TRACKING_RESULTS_DIR):
+    os.makedirs(SAVE_TRACKING_RESULTS_DIR)
+ID_TO_OBJECTS = {i: obj for i, obj in enumerate(OBJECTS, start=1)}
+for frame_idx, segments in video_segments.items():
+    img = cv2.imread(os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[frame_idx]))
+    object_ids = list(segments.keys())
+    masks = list(segments.values())
+    masks = np.concatenate(masks, axis=0)
+    detections = sv.Detections(
+        xyxy=sv.mask_to_xyxy(masks),  # (n, 4)
+        mask=masks, # (n, h, w)
+        class_id=np.array(object_ids, dtype=np.int32),
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(annotated_frame, detections=detections, labels=[ID_TO_OBJECTS[i] for i in object_ids])
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(SAVE_TRACKING_RESULTS_DIR, f"annotated_frame_{frame_idx:05d}.jpg"), annotated_frame)
+"""
+Step 6: Convert the annotated frames to video
+"""
+create_video_from_images(SAVE_TRACKING_RESULTS_DIR, OUTPUT_VIDEO_PATH)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_gd1.0_hf_model.py ADDED Viewed

	@@ -0,0 +1,214 @@

+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from pathlib import Path
+from tqdm import tqdm
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+"""
+Hyperparam for Ground and Tracking
+"""
+MODEL_ID = "IDEA-Research/grounding-dino-tiny"
+VIDEO_PATH = "./assets/hippopotamus.mp4"
+TEXT_PROMPT = "hippopotamus."
+OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
+SOURCE_VIDEO_FRAME_DIR = "./custom_video_frames"
+SAVE_TRACKING_RESULTS_DIR = "./tracking_results"
+PROMPT_TYPE_FOR_VIDEO = "box" # choose from ["point", "box", "mask"]
+"""
+Step 1: Environment settings and model initialization for SAM 2
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# build grounding dino from huggingface
+model_id = MODEL_ID
+device = "cuda" if torch.cuda.is_available() else "cpu"
+processor = AutoProcessor.from_pretrained(model_id)
+grounding_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
+"""
+Custom video input directly using video files
+"""
+video_info = sv.VideoInfo.from_video_path(VIDEO_PATH)  # get video info
+print(video_info)
+frame_generator = sv.get_video_frames_generator(VIDEO_PATH, stride=1, start=0, end=None)
+# saving video to frames
+source_frames = Path(SOURCE_VIDEO_FRAME_DIR)
+source_frames.mkdir(parents=True, exist_ok=True)
+with sv.ImageSink(
+    target_dir_path=source_frames,
+    overwrite=True,
+    image_name_pattern="{:05d}.jpg"
+) as sink:
+    for frame in tqdm(frame_generator, desc="Saving Video Frames"):
+        sink.save_image(frame)
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(SOURCE_VIDEO_FRAME_DIR)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=SOURCE_VIDEO_FRAME_DIR)
+ann_frame_idx = 0  # the frame index we interact with
+"""
+Step 2: Prompt Grounding DINO 1.5 with Cloud API for box coordinates
+"""
+# prompt grounding dino to get the box coordinates on specific frame
+img_path = os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[ann_frame_idx])
+image = Image.open(img_path)
+inputs = processor(images=image, text=TEXT_PROMPT, return_tensors="pt").to(device)
+with torch.no_grad():
+    outputs = grounding_model(**inputs)
+results = processor.post_process_grounded_object_detection(
+    outputs,
+    inputs.input_ids,
+    box_threshold=0.4,
+    text_threshold=0.3,
+    target_sizes=[image.size[::-1]]
+)
+input_boxes = results[0]["boxes"].cpu().numpy()
+confidences = results[0]["scores"].cpu().numpy().tolist()
+class_names = results[0]["labels"]
+print(input_boxes)
+# prompt SAM image predictor to get the mask for the object
+image_predictor.set_image(np.array(image.convert("RGB")))
+# process the detection results
+OBJECTS = class_names
+print(OBJECTS)
+# prompt SAM 2 image predictor to get the mask for the object
+masks, scores, logits = image_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+# convert the mask shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Step 3: Register each object's positive points to video predictor with seperate add_new_points call
+"""
+assert PROMPT_TYPE_FOR_VIDEO in ["point", "box", "mask"], "SAM 2 video predictor only support point/box/mask prompt"
+# If you are using point prompts, we uniformly sample positive points based on the mask
+if PROMPT_TYPE_FOR_VIDEO == "point":
+    # sample the positive points from mask for each objects
+    all_sample_points = sample_points_from_masks(masks=masks, num_points=10)
+    for object_id, (label, points) in enumerate(zip(OBJECTS, all_sample_points), start=1):
+        labels = np.ones((points.shape[0]), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            points=points,
+            labels=labels,
+        )
+# Using box prompt
+elif PROMPT_TYPE_FOR_VIDEO == "box":
+    for object_id, (label, box) in enumerate(zip(OBJECTS, input_boxes), start=1):
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            box=box,
+        )
+# Using mask prompt is a more straightforward way
+elif PROMPT_TYPE_FOR_VIDEO == "mask":
+    for object_id, (label, mask) in enumerate(zip(OBJECTS, masks), start=1):
+        labels = np.ones((1), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            mask=mask
+        )
+else:
+    raise NotImplementedError("SAM 2 video predictor only support point/box/mask prompts")
+"""
+Step 4: Propagate the video predictor to get the segmentation results for each frame
+"""
+video_segments = {}  # video_segments contains the per-frame segmentation results
+for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state):
+    video_segments[out_frame_idx] = {
+        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
+        for i, out_obj_id in enumerate(out_obj_ids)
+    }
+"""
+Step 5: Visualize the segment results across the video and save them
+"""
+if not os.path.exists(SAVE_TRACKING_RESULTS_DIR):
+    os.makedirs(SAVE_TRACKING_RESULTS_DIR)
+ID_TO_OBJECTS = {i: obj for i, obj in enumerate(OBJECTS, start=1)}
+for frame_idx, segments in video_segments.items():
+    img = cv2.imread(os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[frame_idx]))
+    object_ids = list(segments.keys())
+    masks = list(segments.values())
+    masks = np.concatenate(masks, axis=0)
+    detections = sv.Detections(
+        xyxy=sv.mask_to_xyxy(masks),  # (n, 4)
+        mask=masks, # (n, h, w)
+        class_id=np.array(object_ids, dtype=np.int32),
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(annotated_frame, detections=detections, labels=[ID_TO_OBJECTS[i] for i in object_ids])
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(SAVE_TRACKING_RESULTS_DIR, f"annotated_frame_{frame_idx:05d}.jpg"), annotated_frame)
+"""
+Step 6: Convert the annotated frames to video
+"""
+create_video_from_images(SAVE_TRACKING_RESULTS_DIR, OUTPUT_VIDEO_PATH)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_gd1.0_local_model.py ADDED Viewed

	@@ -0,0 +1,220 @@

+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from torchvision.ops import box_convert
+from pathlib import Path
+from tqdm import tqdm
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from grounding_dino.groundingdino.util.inference import load_model, load_image, predict
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+"""
+Hyperparam for Ground and Tracking
+"""
+GROUNDING_DINO_CONFIG = "grounding_dino/groundingdino/config/GroundingDINO_SwinT_OGC.py"
+GROUNDING_DINO_CHECKPOINT = "gdino_checkpoints/groundingdino_swint_ogc.pth"
+BOX_THRESHOLD = 0.35
+TEXT_THRESHOLD = 0.25
+VIDEO_PATH = "./assets/hippopotamus.mp4"
+TEXT_PROMPT = "hippopotamus."
+OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
+SOURCE_VIDEO_FRAME_DIR = "./custom_video_frames"
+SAVE_TRACKING_RESULTS_DIR = "./tracking_results"
+PROMPT_TYPE_FOR_VIDEO = "box" # choose from ["point", "box", "mask"]
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+"""
+Step 1: Environment settings and model initialization for Grounding DINO and SAM 2
+"""
+# build grounding dino model from local path
+grounding_model = load_model(
+    model_config_path=GROUNDING_DINO_CONFIG,
+    model_checkpoint_path=GROUNDING_DINO_CHECKPOINT,
+    device=DEVICE
+)
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+"""
+Custom video input directly using video files
+"""
+video_info = sv.VideoInfo.from_video_path(VIDEO_PATH)  # get video info
+print(video_info)
+frame_generator = sv.get_video_frames_generator(VIDEO_PATH, stride=1, start=0, end=None)
+# saving video to frames
+source_frames = Path(SOURCE_VIDEO_FRAME_DIR)
+source_frames.mkdir(parents=True, exist_ok=True)
+with sv.ImageSink(
+    target_dir_path=source_frames,
+    overwrite=True,
+    image_name_pattern="{:05d}.jpg"
+) as sink:
+    for frame in tqdm(frame_generator, desc="Saving Video Frames"):
+        sink.save_image(frame)
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(SOURCE_VIDEO_FRAME_DIR)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=SOURCE_VIDEO_FRAME_DIR)
+ann_frame_idx = 0  # the frame index we interact with
+"""
+Step 2: Prompt Grounding DINO 1.5 with Cloud API for box coordinates
+"""
+# prompt grounding dino to get the box coordinates on specific frame
+img_path = os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[ann_frame_idx])
+image_source, image = load_image(img_path)
+boxes, confidences, labels = predict(
+    model=grounding_model,
+    image=image,
+    caption=TEXT_PROMPT,
+    box_threshold=BOX_THRESHOLD,
+    text_threshold=TEXT_THRESHOLD,
+)
+# process the box prompt for SAM 2
+h, w, _ = image_source.shape
+boxes = boxes * torch.Tensor([w, h, w, h])
+input_boxes = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()
+confidences = confidences.numpy().tolist()
+class_names = labels
+print(input_boxes)
+# prompt SAM image predictor to get the mask for the object
+image_predictor.set_image(image_source)
+# process the detection results
+OBJECTS = class_names
+print(OBJECTS)
+# FIXME: figure how does this influence the G-DINO model
+torch.autocast(device_type=DEVICE, dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# prompt SAM 2 image predictor to get the mask for the object
+masks, scores, logits = image_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+# convert the mask shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Step 3: Register each object's positive points to video predictor with seperate add_new_points call
+"""
+assert PROMPT_TYPE_FOR_VIDEO in ["point", "box", "mask"], "SAM 2 video predictor only support point/box/mask prompt"
+# If you are using point prompts, we uniformly sample positive points based on the mask
+if PROMPT_TYPE_FOR_VIDEO == "point":
+    # sample the positive points from mask for each objects
+    all_sample_points = sample_points_from_masks(masks=masks, num_points=10)
+    for object_id, (label, points) in enumerate(zip(OBJECTS, all_sample_points), start=1):
+        labels = np.ones((points.shape[0]), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            points=points,
+            labels=labels,
+        )
+# Using box prompt
+elif PROMPT_TYPE_FOR_VIDEO == "box":
+    for object_id, (label, box) in enumerate(zip(OBJECTS, input_boxes), start=1):
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            box=box,
+        )
+# Using mask prompt is a more straightforward way
+elif PROMPT_TYPE_FOR_VIDEO == "mask":
+    for object_id, (label, mask) in enumerate(zip(OBJECTS, masks), start=1):
+        labels = np.ones((1), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            mask=mask
+        )
+else:
+    raise NotImplementedError("SAM 2 video predictor only support point/box/mask prompts")
+"""
+Step 4: Propagate the video predictor to get the segmentation results for each frame
+"""
+video_segments = {}  # video_segments contains the per-frame segmentation results
+for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state):
+    video_segments[out_frame_idx] = {
+        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
+        for i, out_obj_id in enumerate(out_obj_ids)
+    }
+"""
+Step 5: Visualize the segment results across the video and save them
+"""
+if not os.path.exists(SAVE_TRACKING_RESULTS_DIR):
+    os.makedirs(SAVE_TRACKING_RESULTS_DIR)
+ID_TO_OBJECTS = {i: obj for i, obj in enumerate(OBJECTS, start=1)}
+for frame_idx, segments in video_segments.items():
+    img = cv2.imread(os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[frame_idx]))
+    object_ids = list(segments.keys())
+    masks = list(segments.values())
+    masks = np.concatenate(masks, axis=0)
+    detections = sv.Detections(
+        xyxy=sv.mask_to_xyxy(masks),  # (n, 4)
+        mask=masks, # (n, h, w)
+        class_id=np.array(object_ids, dtype=np.int32),
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(annotated_frame, detections=detections, labels=[ID_TO_OBJECTS[i] for i in object_ids])
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(SAVE_TRACKING_RESULTS_DIR, f"annotated_frame_{frame_idx:05d}.jpg"), annotated_frame)
+"""
+Step 6: Convert the annotated frames to video
+"""
+create_video_from_images(SAVE_TRACKING_RESULTS_DIR, OUTPUT_VIDEO_PATH)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_custom_video_input_gd1.5.py ADDED Viewed

	@@ -0,0 +1,239 @@

+# dds cloudapi for Grounding DINO 1.5
+from dds_cloudapi_sdk import Config
+from dds_cloudapi_sdk import Client
+from dds_cloudapi_sdk import DetectionTask
+from dds_cloudapi_sdk import TextPrompt
+from dds_cloudapi_sdk import DetectionModel
+from dds_cloudapi_sdk import DetectionTarget
+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from pathlib import Path
+from tqdm import tqdm
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+"""
+Hyperparam for Ground and Tracking
+"""
+VIDEO_PATH = "./assets/hippopotamus.mp4"
+TEXT_PROMPT = "hippopotamus."
+OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
+SOURCE_VIDEO_FRAME_DIR = "./custom_video_frames"
+SAVE_TRACKING_RESULTS_DIR = "./tracking_results"
+API_TOKEN_FOR_GD1_5 = "Your API token"
+PROMPT_TYPE_FOR_VIDEO = "box" # choose from ["point", "box", "mask"]
+BOX_THRESHOLD = 0.2
+"""
+Step 1: Environment settings and model initialization for SAM 2
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# # `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+# video_dir = "notebooks/videos/bedroom"
+"""
+Custom video input directly using video files
+"""
+video_info = sv.VideoInfo.from_video_path(VIDEO_PATH)  # get video info
+print(video_info)
+frame_generator = sv.get_video_frames_generator(VIDEO_PATH, stride=1, start=0, end=None)
+# saving video to frames
+source_frames = Path(SOURCE_VIDEO_FRAME_DIR)
+source_frames.mkdir(parents=True, exist_ok=True)
+with sv.ImageSink(
+    target_dir_path=source_frames,
+    overwrite=True,
+    image_name_pattern="{:05d}.jpg"
+) as sink:
+    for frame in tqdm(frame_generator, desc="Saving Video Frames"):
+        sink.save_image(frame)
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(SOURCE_VIDEO_FRAME_DIR)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=SOURCE_VIDEO_FRAME_DIR)
+ann_frame_idx = 0  # the frame index we interact with
+"""
+Step 2: Prompt Grounding DINO 1.5 with Cloud API for box coordinates
+"""
+# prompt grounding dino to get the box coordinates on specific frame
+img_path = os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[ann_frame_idx])
+image = Image.open(img_path)
+# Step 1: initialize the config
+config = Config(API_TOKEN_FOR_GD1_5)
+# Step 2: initialize the client
+client = Client(config)
+# Step 3: run the task by DetectionTask class
+# image_url = "https://algosplt.oss-cn-shenzhen.aliyuncs.com/test_files/tasks/detection/iron_man.jpg"
+# if you are processing local image file, upload them to DDS server to get the image url
+image_url = client.upload_file(img_path)
+task = DetectionTask(
+    image_url=image_url,
+    prompts=[TextPrompt(text=TEXT_PROMPT)],
+    targets=[DetectionTarget.BBox],  # detect bbox
+    model=DetectionModel.GDino1_6_Pro,  # detect with GroundingDino-1.5-Pro model
+    bbox_threshold=BOX_THRESHOLD,
+)
+client.run_task(task)
+result = task.result
+objects = result.objects  # the list of detected objects
+input_boxes = []
+confidences = []
+class_names = []
+for idx, obj in enumerate(objects):
+    input_boxes.append(obj.bbox)
+    confidences.append(obj.score)
+    class_names.append(obj.category)
+input_boxes = np.array(input_boxes)
+print(input_boxes)
+# prompt SAM image predictor to get the mask for the object
+image_predictor.set_image(np.array(image.convert("RGB")))
+# process the detection results
+OBJECTS = class_names
+print(OBJECTS)
+# prompt SAM 2 image predictor to get the mask for the object
+masks, scores, logits = image_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+# convert the mask shape to (n, H, W)
+if masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Step 3: Register each object's positive points to video predictor with seperate add_new_points call
+"""
+assert PROMPT_TYPE_FOR_VIDEO in ["point", "box", "mask"], "SAM 2 video predictor only support point/box/mask prompt"
+# If you are using point prompts, we uniformly sample positive points based on the mask
+if PROMPT_TYPE_FOR_VIDEO == "point":
+    # sample the positive points from mask for each objects
+    all_sample_points = sample_points_from_masks(masks=masks, num_points=10)
+    for object_id, (label, points) in enumerate(zip(OBJECTS, all_sample_points), start=1):
+        labels = np.ones((points.shape[0]), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            points=points,
+            labels=labels,
+        )
+# Using box prompt
+elif PROMPT_TYPE_FOR_VIDEO == "box":
+    for object_id, (label, box) in enumerate(zip(OBJECTS, input_boxes), start=1):
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            box=box,
+        )
+# Using mask prompt is a more straightforward way
+elif PROMPT_TYPE_FOR_VIDEO == "mask":
+    for object_id, (label, mask) in enumerate(zip(OBJECTS, masks), start=1):
+        labels = np.ones((1), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            mask=mask
+        )
+else:
+    raise NotImplementedError("SAM 2 video predictor only support point/box/mask prompts")
+"""
+Step 4: Propagate the video predictor to get the segmentation results for each frame
+"""
+video_segments = {}  # video_segments contains the per-frame segmentation results
+for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state):
+    video_segments[out_frame_idx] = {
+        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
+        for i, out_obj_id in enumerate(out_obj_ids)
+    }
+"""
+Step 5: Visualize the segment results across the video and save them
+"""
+if not os.path.exists(SAVE_TRACKING_RESULTS_DIR):
+    os.makedirs(SAVE_TRACKING_RESULTS_DIR)
+ID_TO_OBJECTS = {i: obj for i, obj in enumerate(OBJECTS, start=1)}
+for frame_idx, segments in video_segments.items():
+    img = cv2.imread(os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[frame_idx]))
+    object_ids = list(segments.keys())
+    masks = list(segments.values())
+    masks = np.concatenate(masks, axis=0)
+    detections = sv.Detections(
+        xyxy=sv.mask_to_xyxy(masks),  # (n, 4)
+        mask=masks, # (n, h, w)
+        class_id=np.array(object_ids, dtype=np.int32),
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(annotated_frame, detections=detections, labels=[ID_TO_OBJECTS[i] for i in object_ids])
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(SAVE_TRACKING_RESULTS_DIR, f"annotated_frame_{frame_idx:05d}.jpg"), annotated_frame)
+"""
+Step 6: Convert the annotated frames to video
+"""
+create_video_from_images(SAVE_TRACKING_RESULTS_DIR, OUTPUT_VIDEO_PATH)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_continuous_id.py ADDED Viewed

	@@ -0,0 +1,203 @@

+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+from utils.common_utils import CommonUtils
+from utils.mask_dictionary_model import MaskDictionaryModel, ObjectInfo
+import json
+import copy
+"""
+Step 1: Environment settings and model initialization
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+print("device", device)
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint, device=device)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# init grounding dino model from huggingface
+model_id = "IDEA-Research/grounding-dino-tiny"
+processor = AutoProcessor.from_pretrained(model_id)
+grounding_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
+# setup the input image and text prompt for SAM 2 and Grounding DINO
+# VERY important: text queries need to be lowercased + end with a dot
+text = "car."
+# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+video_dir = "notebooks/videos/car"
+# 'output_dir' is the directory to save the annotated frames
+output_dir = "./outputs"
+# 'output_video_path' is the path to save the final video
+output_video_path = "./outputs/output.mp4"
+# create the output directory
+CommonUtils.creat_dirs(output_dir)
+mask_data_dir = os.path.join(output_dir, "mask_data")
+json_data_dir = os.path.join(output_dir, "json_data")
+result_dir = os.path.join(output_dir, "result")
+CommonUtils.creat_dirs(mask_data_dir)
+CommonUtils.creat_dirs(json_data_dir)
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(video_dir)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG", ".png", ".PNG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=video_dir, offload_video_to_cpu=True, async_loading_frames=True)
+step = 20 # the step to sample frames for Grounding DINO predictor
+sam2_masks = MaskDictionaryModel()
+PROMPT_TYPE_FOR_VIDEO = "mask" # box, mask or point
+objects_count = 0
+"""
+Step 2: Prompt Grounding DINO and SAM image predictor to get the box and mask for all frames
+"""
+print("Total frames:", len(frame_names))
+for start_frame_idx in range(0, len(frame_names), step):
+# prompt grounding dino to get the box coordinates on specific frame
+    print("start_frame_idx", start_frame_idx)
+    # continue
+    img_path = os.path.join(video_dir, frame_names[start_frame_idx])
+    image = Image.open(img_path)
+    image_base_name = frame_names[start_frame_idx].split(".")[0]
+    mask_dict = MaskDictionaryModel(promote_type = PROMPT_TYPE_FOR_VIDEO, mask_name = f"mask_{image_base_name}.npy")
+    # run Grounding DINO on the image
+    inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+    with torch.no_grad():
+        outputs = grounding_model(**inputs)
+    results = processor.post_process_grounded_object_detection(
+        outputs,
+        inputs.input_ids,
+        box_threshold=0.25,
+        text_threshold=0.25,
+        target_sizes=[image.size[::-1]]
+    )
+    # prompt SAM image predictor to get the mask for the object
+    image_predictor.set_image(np.array(image.convert("RGB")))
+    # process the detection results
+    input_boxes = results[0]["boxes"] # .cpu().numpy()
+    # print("results[0]",results[0])
+    OBJECTS = results[0]["labels"]
+    if input_boxes.shape[0] != 0:
+        # prompt SAM 2 image predictor to get the mask for the object
+        masks, scores, logits = image_predictor.predict(
+            point_coords=None,
+            point_labels=None,
+            box=input_boxes,
+            multimask_output=False,
+        )
+        # convert the mask shape to (n, H, W)
+        if masks.ndim == 2:
+            masks = masks[None]
+            scores = scores[None]
+            logits = logits[None]
+        elif masks.ndim == 4:
+            masks = masks.squeeze(1)
+        """
+        Step 3: Register each object's positive points to video predictor
+        """
+        # If you are using point prompts, we uniformly sample positive points based on the mask
+        if mask_dict.promote_type == "mask":
+            mask_dict.add_new_frame_annotation(mask_list=torch.tensor(masks).to(device), box_list=torch.tensor(input_boxes), label_list=OBJECTS)
+        else:
+            raise NotImplementedError("SAM 2 video predictor only support mask prompts")
+        """
+        Step 4: Propagate the video predictor to get the segmentation results for each frame
+        """
+        objects_count = mask_dict.update_masks(tracking_annotation_dict=sam2_masks, iou_threshold=0.8, objects_count=objects_count)
+        print("objects_count", objects_count)
+    else:
+        print("No object detected in the frame, skip merge the frame merge {}".format(frame_names[start_frame_idx]))
+        mask_dict = sam2_masks
+    if len(mask_dict.labels) == 0:
+        mask_dict.save_empty_mask_and_json(mask_data_dir, json_data_dir, image_name_list = frame_names[start_frame_idx:start_frame_idx+step])
+        print("No object detected in the frame, skip the frame {}".format(start_frame_idx))
+        continue
+    else:
+        video_predictor.reset_state(inference_state)
+        for object_id, object_info in mask_dict.labels.items():
+            frame_idx, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+                    inference_state,
+                    start_frame_idx,
+                    object_id,
+                    object_info.mask,
+                )
+        video_segments = {}  # output the following {step} frames tracking masks
+        for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state, max_frame_num_to_track=step, start_frame_idx=start_frame_idx):
+            frame_masks = MaskDictionaryModel()
+            for i, out_obj_id in enumerate(out_obj_ids):
+                out_mask = (out_mask_logits[i] > 0.0) # .cpu().numpy()
+                object_info = ObjectInfo(instance_id = out_obj_id, mask = out_mask[0], class_name = mask_dict.get_target_class_name(out_obj_id))
+                object_info.update_box()
+                frame_masks.labels[out_obj_id] = object_info
+                image_base_name = frame_names[out_frame_idx].split(".")[0]
+                frame_masks.mask_name = f"mask_{image_base_name}.npy"
+                frame_masks.mask_height = out_mask.shape[-2]
+                frame_masks.mask_width = out_mask.shape[-1]
+            video_segments[out_frame_idx] = frame_masks
+            sam2_masks = copy.deepcopy(frame_masks)
+        print("video_segments:", len(video_segments))
+    """
+    Step 5: save the tracking masks and json files
+    """
+    for frame_idx, frame_masks_info in video_segments.items():
+        mask = frame_masks_info.labels
+        mask_img = torch.zeros(frame_masks_info.mask_height, frame_masks_info.mask_width)
+        for obj_id, obj_info in mask.items():
+            mask_img[obj_info.mask == True] = obj_id
+        mask_img = mask_img.numpy().astype(np.uint16)
+        np.save(os.path.join(mask_data_dir, frame_masks_info.mask_name), mask_img)
+        json_data = frame_masks_info.to_dict()
+        json_data_path = os.path.join(json_data_dir, frame_masks_info.mask_name.replace(".npy", ".json"))
+        with open(json_data_path, "w") as f:
+            json.dump(json_data, f)
+"""
+Step 6: Draw the results and save the video
+"""
+CommonUtils.draw_masks_and_box_with_supervision(video_dir, mask_data_dir, json_data_dir, result_dir)
+create_video_from_images(result_dir, output_video_path, frame_rate=15)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py ADDED Viewed

	@@ -0,0 +1,224 @@

+# dds cloudapi for Grounding DINO 1.5
+from dds_cloudapi_sdk import Config
+from dds_cloudapi_sdk import Client
+from dds_cloudapi_sdk import DetectionTask
+from dds_cloudapi_sdk import TextPrompt
+from dds_cloudapi_sdk import DetectionModel
+from dds_cloudapi_sdk import DetectionTarget
+import os
+import torch
+import numpy as np
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+from utils.video_utils import create_video_from_images
+from utils.common_utils import CommonUtils
+from utils.mask_dictionary_model import MaskDictionaryModel, ObjectInfo
+import json
+import copy
+"""
+Step 1: Environment settings and model initialization
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+print("device", device)
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint, device=device)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# init grounding dino model from huggingface
+model_id = "IDEA-Research/grounding-dino-tiny"
+processor = AutoProcessor.from_pretrained(model_id)
+grounding_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
+# setup the input image and text prompt for SAM 2 and Grounding DINO
+# VERY important: text queries need to be lowercased + end with a dot
+text = "car."
+# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+video_dir = "notebooks/videos/car"
+# 'output_dir' is the directory to save the annotated frames
+output_dir = "./outputs"
+# 'output_video_path' is the path to save the final video
+output_video_path = "./outputs/output.mp4"
+# create the output directory
+CommonUtils.creat_dirs(output_dir)
+mask_data_dir = os.path.join(output_dir, "mask_data")
+json_data_dir = os.path.join(output_dir, "json_data")
+result_dir = os.path.join(output_dir, "result")
+CommonUtils.creat_dirs(mask_data_dir)
+CommonUtils.creat_dirs(json_data_dir)
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(video_dir)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG", ".png", ".PNG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=video_dir)
+step = 10 # the step to sample frames for Grounding DINO predictor
+sam2_masks = MaskDictionaryModel()
+PROMPT_TYPE_FOR_VIDEO = "mask" # box, mask or point
+objects_count = 0
+"""
+Step 2: Prompt Grounding DINO and SAM image predictor to get the box and mask for all frames
+"""
+print("Total frames:", len(frame_names))
+for start_frame_idx in range(0, len(frame_names), step):
+# prompt grounding dino to get the box coordinates on specific frame
+    print("start_frame_idx", start_frame_idx)
+    # continue
+    img_path = os.path.join(video_dir, frame_names[start_frame_idx])
+    image = Image.open(img_path)
+    image_base_name = frame_names[start_frame_idx].split(".")[0]
+    mask_dict = MaskDictionaryModel(promote_type = PROMPT_TYPE_FOR_VIDEO, mask_name = f"mask_{image_base_name}.npy")
+    # run Grounding DINO 1.5 on the image
+    API_TOKEN_FOR_GD1_5 = "Your API token"
+    config = Config(API_TOKEN_FOR_GD1_5)
+    # Step 2: initialize the client
+    client = Client(config)
+    image_url = client.upload_file(img_path)
+    task = DetectionTask(
+        image_url=image_url,
+        prompts=[TextPrompt(text=text)],
+        targets=[DetectionTarget.BBox],  # detect bbox
+        model=DetectionModel.GDino1_6_Pro,  # detect with GroundingDino-1.5-Pro model
+    )
+    client.run_task(task)
+    result = task.result
+    objects = result.objects  # the list of detected objects
+    input_boxes = []
+    confidences = []
+    class_names = []
+    for idx, obj in enumerate(objects):
+        input_boxes.append(obj.bbox)
+        confidences.append(obj.score)
+        class_names.append(obj.category)
+    input_boxes = np.array(input_boxes)
+    OBJECTS = class_names
+    if input_boxes.shape[0] != 0:
+        # prompt SAM image predictor to get the mask for the object
+        image_predictor.set_image(np.array(image.convert("RGB")))
+        # prompt SAM 2 image predictor to get the mask for the object
+        masks, scores, logits = image_predictor.predict(
+            point_coords=None,
+            point_labels=None,
+            box=input_boxes,
+            multimask_output=False,
+        )
+        # convert the mask shape to (n, H, W)
+        if masks.ndim == 2:
+            masks = masks[None]
+            scores = scores[None]
+            logits = logits[None]
+        elif masks.ndim == 4:
+            masks = masks.squeeze(1)
+        """
+        Step 3: Register each object's positive points to video predictor
+        """
+        # If you are using point prompts, we uniformly sample positive points based on the mask
+        if mask_dict.promote_type == "mask":
+            mask_dict.add_new_frame_annotation(mask_list=torch.tensor(masks).to(device), box_list=torch.tensor(input_boxes), label_list=OBJECTS)
+        else:
+            raise NotImplementedError("SAM 2 video predictor only support mask prompts")
+        objects_count = mask_dict.update_masks(tracking_annotation_dict=sam2_masks, iou_threshold=0.8, objects_count=objects_count)
+        print("objects_count", objects_count)
+    else:
+        print("No object detected in the frame, skip merge the frame merge {}".format(frame_names[start_frame_idx]))
+        mask_dict = sam2_masks
+    """
+        Step 4: Propagate the video predictor to get the segmentation results for each frame
+    """
+    if len(mask_dict.labels) == 0:
+        mask_dict.save_empty_mask_and_json(mask_data_dir, json_data_dir, image_name_list = frame_names[start_frame_idx:start_frame_idx+step])
+        print("No object detected in the frame, skip the frame {}".format(start_frame_idx))
+        continue
+    else:
+        video_predictor.reset_state(inference_state)
+        for object_id, object_info in mask_dict.labels.items():
+            frame_idx, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+                    inference_state,
+                    start_frame_idx,
+                    object_id,
+                    object_info.mask,
+                )
+        video_segments = {}  # output the following {step} frames tracking masks
+        for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state, max_frame_num_to_track=step, start_frame_idx=start_frame_idx):
+            frame_masks = MaskDictionaryModel()
+            for i, out_obj_id in enumerate(out_obj_ids):
+                out_mask = (out_mask_logits[i] > 0.0) # .cpu().numpy()
+                object_info = ObjectInfo(instance_id = out_obj_id, mask = out_mask[0], class_name = mask_dict.get_target_class_name(out_obj_id))
+                object_info.update_box()
+                frame_masks.labels[out_obj_id] = object_info
+                image_base_name = frame_names[out_frame_idx].split(".")[0]
+                frame_masks.mask_name = f"mask_{image_base_name}.npy"
+                frame_masks.mask_height = out_mask.shape[-2]
+                frame_masks.mask_width = out_mask.shape[-1]
+            video_segments[out_frame_idx] = frame_masks
+            sam2_masks = copy.deepcopy(frame_masks)
+        print("video_segments:", len(video_segments))
+    """
+    Step 5: save the tracking masks and json files
+    """
+    for frame_idx, frame_masks_info in video_segments.items():
+        mask = frame_masks_info.labels
+        mask_img = torch.zeros(frame_masks_info.mask_height, frame_masks_info.mask_width)
+        for obj_id, obj_info in mask.items():
+            mask_img[obj_info.mask == True] = obj_id
+        mask_img = mask_img.numpy().astype(np.uint16)
+        np.save(os.path.join(mask_data_dir, frame_masks_info.mask_name), mask_img)
+        json_data = frame_masks_info.to_dict()
+        json_data_path = os.path.join(json_data_dir, frame_masks_info.mask_name.replace(".npy", ".json"))
+        with open(json_data_path, "w") as f:
+            json.dump(json_data, f)
+"""
+Step 6: Draw the results and save the video
+"""
+CommonUtils.draw_masks_and_box_with_supervision(video_dir, mask_data_dir, json_data_dir, result_dir)
+create_video_from_images(result_dir, output_video_path, frame_rate=30)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_continuous_id_plus.py ADDED Viewed

	@@ -0,0 +1,247 @@

+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+from utils.common_utils import CommonUtils
+from utils.mask_dictionary_model import MaskDictionaryModel, ObjectInfo
+import json
+import copy
+# This demo shows the continuous object tracking plus reverse tracking with Grounding DINO and SAM 2
+"""
+Step 1: Environment settings and model initialization
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+print("device", device)
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint, device=device)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# init grounding dino model from huggingface
+model_id = "IDEA-Research/grounding-dino-tiny"
+processor = AutoProcessor.from_pretrained(model_id)
+grounding_model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
+# setup the input image and text prompt for SAM 2 and Grounding DINO
+# VERY important: text queries need to be lowercased + end with a dot
+text = "car."
+# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+video_dir = "notebooks/videos/car"
+# 'output_dir' is the directory to save the annotated frames
+output_dir = "outputs"
+# 'output_video_path' is the path to save the final video
+output_video_path = "./outputs/output.mp4"
+# create the output directory
+mask_data_dir = os.path.join(output_dir, "mask_data")
+json_data_dir = os.path.join(output_dir, "json_data")
+result_dir = os.path.join(output_dir, "result")
+CommonUtils.creat_dirs(mask_data_dir)
+CommonUtils.creat_dirs(json_data_dir)
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(video_dir)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG", ".png", ".PNG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=video_dir)
+step = 20 # the step to sample frames for Grounding DINO predictor
+sam2_masks = MaskDictionaryModel()
+PROMPT_TYPE_FOR_VIDEO = "mask" # box, mask or point
+objects_count = 0
+frame_object_count = {}
+"""
+Step 2: Prompt Grounding DINO and SAM image predictor to get the box and mask for all frames
+"""
+print("Total frames:", len(frame_names))
+for start_frame_idx in range(0, len(frame_names), step):
+# prompt grounding dino to get the box coordinates on specific frame
+    print("start_frame_idx", start_frame_idx)
+    # continue
+    img_path = os.path.join(video_dir, frame_names[start_frame_idx])
+    image = Image.open(img_path).convert("RGB")
+    image_base_name = frame_names[start_frame_idx].split(".")[0]
+    mask_dict = MaskDictionaryModel(promote_type = PROMPT_TYPE_FOR_VIDEO, mask_name = f"mask_{image_base_name}.npy")
+    # run Grounding DINO on the image
+    inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+    with torch.no_grad():
+        outputs = grounding_model(**inputs)
+    results = processor.post_process_grounded_object_detection(
+        outputs,
+        inputs.input_ids,
+        box_threshold=0.25,
+        text_threshold=0.25,
+        target_sizes=[image.size[::-1]]
+    )
+    # prompt SAM image predictor to get the mask for the object
+    image_predictor.set_image(np.array(image.convert("RGB")))
+    # process the detection results
+    input_boxes = results[0]["boxes"] # .cpu().numpy()
+    # print("results[0]",results[0])
+    OBJECTS = results[0]["labels"]
+    if input_boxes.shape[0] != 0:
+        # prompt SAM 2 image predictor to get the mask for the object
+        masks, scores, logits = image_predictor.predict(
+            point_coords=None,
+            point_labels=None,
+            box=input_boxes,
+            multimask_output=False,
+        )
+        # convert the mask shape to (n, H, W)
+        if masks.ndim == 2:
+            masks = masks[None]
+            scores = scores[None]
+            logits = logits[None]
+        elif masks.ndim == 4:
+            masks = masks.squeeze(1)
+        """
+        Step 3: Register each object's positive points to video predictor
+        """
+        # If you are using point prompts, we uniformly sample positive points based on the mask
+        if mask_dict.promote_type == "mask":
+            mask_dict.add_new_frame_annotation(mask_list=torch.tensor(masks).to(device), box_list=torch.tensor(input_boxes), label_list=OBJECTS)
+        else:
+            raise NotImplementedError("SAM 2 video predictor only support mask prompts")
+    else:
+        print("No object detected in the frame, skip merge the frame merge {}".format(frame_names[start_frame_idx]))
+        mask_dict = sam2_masks
+    """
+    Step 4: Propagate the video predictor to get the segmentation results for each frame
+    """
+    objects_count = mask_dict.update_masks(tracking_annotation_dict=sam2_masks, iou_threshold=0.8, objects_count=objects_count)
+    frame_object_count[start_frame_idx] = objects_count
+    print("objects_count", objects_count)
+    if len(mask_dict.labels) == 0:
+        mask_dict.save_empty_mask_and_json(mask_data_dir, json_data_dir, image_name_list = frame_names[start_frame_idx:start_frame_idx+step])
+        print("No object detected in the frame, skip the frame {}".format(start_frame_idx))
+        continue
+    else:
+        video_predictor.reset_state(inference_state)
+        for object_id, object_info in mask_dict.labels.items():
+            frame_idx, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+                    inference_state,
+                    start_frame_idx,
+                    object_id,
+                    object_info.mask,
+                )
+        video_segments = {}  # output the following {step} frames tracking masks
+        for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state, max_frame_num_to_track=step, start_frame_idx=start_frame_idx):
+            frame_masks = MaskDictionaryModel()
+            for i, out_obj_id in enumerate(out_obj_ids):
+                out_mask = (out_mask_logits[i] > 0.0) # .cpu().numpy()
+                object_info = ObjectInfo(instance_id = out_obj_id, mask = out_mask[0], class_name = mask_dict.get_target_class_name(out_obj_id), logit=mask_dict.get_target_logit(out_obj_id))
+                object_info.update_box()
+                frame_masks.labels[out_obj_id] = object_info
+                image_base_name = frame_names[out_frame_idx].split(".")[0]
+                frame_masks.mask_name = f"mask_{image_base_name}.npy"
+                frame_masks.mask_height = out_mask.shape[-2]
+                frame_masks.mask_width = out_mask.shape[-1]
+            video_segments[out_frame_idx] = frame_masks
+            sam2_masks = copy.deepcopy(frame_masks)
+        print("video_segments:", len(video_segments))
+    """
+    Step 5: save the tracking masks and json files
+    """
+    for frame_idx, frame_masks_info in video_segments.items():
+        mask = frame_masks_info.labels
+        mask_img = torch.zeros(frame_masks_info.mask_height, frame_masks_info.mask_width)
+        for obj_id, obj_info in mask.items():
+            mask_img[obj_info.mask == True] = obj_id
+        mask_img = mask_img.numpy().astype(np.uint16)
+        np.save(os.path.join(mask_data_dir, frame_masks_info.mask_name), mask_img)
+        json_data_path = os.path.join(json_data_dir, frame_masks_info.mask_name.replace(".npy", ".json"))
+        frame_masks_info.to_json(json_data_path)
+CommonUtils.draw_masks_and_box_with_supervision(video_dir, mask_data_dir, json_data_dir, result_dir)
+print("try reverse tracking")
+start_object_id = 0
+object_info_dict = {}
+for frame_idx, current_object_count in frame_object_count.items():
+    print("reverse tracking frame", frame_idx, frame_names[frame_idx])
+    if frame_idx != 0:
+        video_predictor.reset_state(inference_state)
+        image_base_name = frame_names[frame_idx].split(".")[0]
+        json_data_path = os.path.join(json_data_dir, f"mask_{image_base_name}.json")
+        json_data = MaskDictionaryModel().from_json(json_data_path)
+        mask_data_path = os.path.join(mask_data_dir, f"mask_{image_base_name}.npy")
+        mask_array = np.load(mask_data_path)
+        for object_id in range(start_object_id+1, current_object_count+1):
+            print("reverse tracking object", object_id)
+            object_info_dict[object_id] = json_data.labels[object_id]
+            video_predictor.add_new_mask(inference_state, frame_idx, object_id, mask_array == object_id)
+    start_object_id = current_object_count
+    for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state, max_frame_num_to_track=step*2,  start_frame_idx=frame_idx, reverse=True):
+        image_base_name = frame_names[out_frame_idx].split(".")[0]
+        json_data_path = os.path.join(json_data_dir, f"mask_{image_base_name}.json")
+        json_data = MaskDictionaryModel().from_json(json_data_path)
+        mask_data_path = os.path.join(mask_data_dir, f"mask_{image_base_name}.npy")
+        mask_array = np.load(mask_data_path)
+        # merge the reverse tracking masks with the original masks
+        for i, out_obj_id in enumerate(out_obj_ids):
+            out_mask = (out_mask_logits[i] > 0.0).cpu()
+            if out_mask.sum() == 0:
+                print("no mask for object", out_obj_id, "at frame", out_frame_idx)
+                continue
+            object_info = object_info_dict[out_obj_id]
+            object_info.mask = out_mask[0]
+            object_info.update_box()
+            json_data.labels[out_obj_id] = object_info
+            mask_array = np.where(mask_array != out_obj_id, mask_array, 0)
+            mask_array[object_info.mask] = out_obj_id
+        np.save(mask_data_path, mask_array)
+        json_data.to_json(json_data_path)
+"""
+Step 6: Draw the results and save the video
+"""
+CommonUtils.draw_masks_and_box_with_supervision(video_dir, mask_data_dir, json_data_dir, result_dir+"_reverse")
+create_video_from_images(result_dir, output_video_path, frame_rate=15)

clone-IDEA-Research/Grounded-SAM-2/grounded_sam2_tracking_demo_with_gd1.5.py ADDED Viewed

	@@ -0,0 +1,221 @@

+# dds cloudapi for Grounding DINO 1.5
+from dds_cloudapi_sdk import Config
+from dds_cloudapi_sdk import Client
+from dds_cloudapi_sdk import DetectionTask
+from dds_cloudapi_sdk import TextPrompt
+from dds_cloudapi_sdk import DetectionModel
+from dds_cloudapi_sdk import DetectionTarget
+import os
+import cv2
+import torch
+import numpy as np
+import supervision as sv
+from PIL import Image
+from sam2.build_sam import build_sam2_video_predictor, build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from utils.track_utils import sample_points_from_masks
+from utils.video_utils import create_video_from_images
+"""
+Step 1: Environment settings and model initialization for SAM 2
+"""
+# use bfloat16 for the entire notebook
+torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
+if torch.cuda.get_device_properties(0).major >= 8:
+    # turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
+    torch.backends.cuda.matmul.allow_tf32 = True
+    torch.backends.cudnn.allow_tf32 = True
+# init sam image predictor and video predictor model
+sam2_checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
+model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
+sam2_image_model = build_sam2(model_cfg, sam2_checkpoint)
+image_predictor = SAM2ImagePredictor(sam2_image_model)
+# `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
+video_dir = "notebooks/videos/bedroom"
+# scan all the JPEG frame names in this directory
+frame_names = [
+    p for p in os.listdir(video_dir)
+    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG", ".png", ".PNG"]
+]
+frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
+# init video predictor state
+inference_state = video_predictor.init_state(video_path=video_dir)
+ann_frame_idx = 0  # the frame index we interact with
+ann_obj_id = 1  # give a unique id to each object we interact with (it can be any integers)
+"""
+Step 2: Prompt Grounding DINO 1.5 with Cloud API for box coordinates
+"""
+# prompt grounding dino to get the box coordinates on specific frame
+img_path = os.path.join(video_dir, frame_names[ann_frame_idx])
+image = Image.open(img_path)
+# Step 1: initialize the config
+token = "Your API token"
+config = Config(token)
+# Step 2: initialize the client
+client = Client(config)
+# Step 3: run the task by DetectionTask class
+# image_url = "https://algosplt.oss-cn-shenzhen.aliyuncs.com/test_files/tasks/detection/iron_man.jpg"
+# if you are processing local image file, upload them to DDS server to get the image url
+image_url = client.upload_file(img_path)
+task = DetectionTask(
+    image_url=image_url,
+    prompts=[TextPrompt(text="children. pillow")],
+    targets=[DetectionTarget.BBox],  # detect bbox
+    model=DetectionModel.GDino1_5_Pro,  # detect with GroundingDino-1.5-Pro model
+    bbox_threshold=0.2,
+)
+client.run_task(task)
+result = task.result
+objects = result.objects  # the list of detected objects
+input_boxes = []
+confidences = []
+class_names = []
+for idx, obj in enumerate(objects):
+    input_boxes.append(obj.bbox)
+    confidences.append(obj.score)
+    class_names.append(obj.category)
+input_boxes = np.array(input_boxes)
+print(input_boxes)
+# prompt SAM image predictor to get the mask for the object
+image_predictor.set_image(np.array(image.convert("RGB")))
+# process the detection results
+OBJECTS = class_names
+print(OBJECTS)
+# prompt SAM 2 image predictor to get the mask for the object
+masks, scores, logits = image_predictor.predict(
+    point_coords=None,
+    point_labels=None,
+    box=input_boxes,
+    multimask_output=False,
+)
+# convert the mask shape to (n, H, W)
+if masks.ndim == 3:
+    masks = masks[None]
+    scores = scores[None]
+    logits = logits[None]
+elif masks.ndim == 4:
+    masks = masks.squeeze(1)
+"""
+Step 3: Register each object's positive points to video predictor with seperate add_new_points call
+"""
+PROMPT_TYPE_FOR_VIDEO = "box" # or "point"
+assert PROMPT_TYPE_FOR_VIDEO in ["point", "box", "mask"], "SAM 2 video predictor only support point/box/mask prompt"
+# If you are using point prompts, we uniformly sample positive points based on the mask
+if PROMPT_TYPE_FOR_VIDEO == "point":
+    # sample the positive points from mask for each objects
+    all_sample_points = sample_points_from_masks(masks=masks, num_points=10)
+    for object_id, (label, points) in enumerate(zip(OBJECTS, all_sample_points), start=1):
+        labels = np.ones((points.shape[0]), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            points=points,
+            labels=labels,
+        )
+# Using box prompt
+elif PROMPT_TYPE_FOR_VIDEO == "box":
+    for object_id, (label, box) in enumerate(zip(OBJECTS, input_boxes), start=1):
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            box=box,
+        )
+# Using mask prompt is a more straightforward way
+elif PROMPT_TYPE_FOR_VIDEO == "mask":
+    for object_id, (label, mask) in enumerate(zip(OBJECTS, masks), start=1):
+        labels = np.ones((1), dtype=np.int32)
+        _, out_obj_ids, out_mask_logits = video_predictor.add_new_mask(
+            inference_state=inference_state,
+            frame_idx=ann_frame_idx,
+            obj_id=object_id,
+            mask=mask
+        )
+else:
+    raise NotImplementedError("SAM 2 video predictor only support point/box/mask prompts")
+"""
+Step 4: Propagate the video predictor to get the segmentation results for each frame
+"""
+video_segments = {}  # video_segments contains the per-frame segmentation results
+for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state):
+    video_segments[out_frame_idx] = {
+        out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
+        for i, out_obj_id in enumerate(out_obj_ids)
+    }
+"""
+Step 5: Visualize the segment results across the video and save them
+"""
+save_dir = "./tracking_results"
+if not os.path.exists(save_dir):
+    os.makedirs(save_dir)
+ID_TO_OBJECTS = {i: obj for i, obj in enumerate(OBJECTS, start=1)}
+for frame_idx, segments in video_segments.items():
+    img = cv2.imread(os.path.join(video_dir, frame_names[frame_idx]))
+    object_ids = list(segments.keys())
+    masks = list(segments.values())
+    masks = np.concatenate(masks, axis=0)
+    detections = sv.Detections(
+        xyxy=sv.mask_to_xyxy(masks),  # (n, 4)
+        mask=masks, # (n, h, w)
+        class_id=np.array(object_ids, dtype=np.int32),
+    )
+    box_annotator = sv.BoxAnnotator()
+    annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
+    label_annotator = sv.LabelAnnotator()
+    annotated_frame = label_annotator.annotate(annotated_frame, detections=detections, labels=[ID_TO_OBJECTS[i] for i in object_ids])
+    mask_annotator = sv.MaskAnnotator()
+    annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
+    cv2.imwrite(os.path.join(save_dir, f"annotated_frame_{frame_idx:05d}.jpg"), annotated_frame)
+"""
+Step 6: Convert the annotated frames to video
+"""
+output_video_path = "./children_tracking_demo_video.mp4"
+create_video_from_images(save_dir, output_video_path)

clone-IDEA-Research/Grounded-SAM-2/pyproject.toml ADDED Viewed

	@@ -0,0 +1,6 @@

+[build-system]
+requires = [
+    "setuptools>=61.0",
+    "torch>=2.3.1",
+    ]
+build-backend = "setuptools.build_meta"

clone-IDEA-Research/Grounded-SAM-2/sam2/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+from hydra import initialize_config_module
+from hydra.core.global_hydra import GlobalHydra
+if not GlobalHydra.instance().is_initialized():
+    initialize_config_module("sam2", version_base="1.2")

clone-IDEA-Research/Grounded-SAM-2/sam2/automatic_mask_generator.py ADDED Viewed

	@@ -0,0 +1,454 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# Adapted from https://github.com/facebookresearch/segment-anything/blob/main/segment_anything/automatic_mask_generator.py
+from typing import Any, Dict, List, Optional, Tuple
+import numpy as np
+import torch
+from torchvision.ops.boxes import batched_nms, box_area  # type: ignore
+from sam2.modeling.sam2_base import SAM2Base
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+from sam2.utils.amg import (
+    area_from_rle,
+    batch_iterator,
+    batched_mask_to_box,
+    box_xyxy_to_xywh,
+    build_all_layer_point_grids,
+    calculate_stability_score,
+    coco_encode_rle,
+    generate_crop_boxes,
+    is_box_near_crop_edge,
+    mask_to_rle_pytorch,
+    MaskData,
+    remove_small_regions,
+    rle_to_mask,
+    uncrop_boxes_xyxy,
+    uncrop_masks,
+    uncrop_points,
+)
+class SAM2AutomaticMaskGenerator:
+    def __init__(
+        self,
+        model: SAM2Base,
+        points_per_side: Optional[int] = 32,
+        points_per_batch: int = 64,
+        pred_iou_thresh: float = 0.8,
+        stability_score_thresh: float = 0.95,
+        stability_score_offset: float = 1.0,
+        mask_threshold: float = 0.0,
+        box_nms_thresh: float = 0.7,
+        crop_n_layers: int = 0,
+        crop_nms_thresh: float = 0.7,
+        crop_overlap_ratio: float = 512 / 1500,
+        crop_n_points_downscale_factor: int = 1,
+        point_grids: Optional[List[np.ndarray]] = None,
+        min_mask_region_area: int = 0,
+        output_mode: str = "binary_mask",
+        use_m2m: bool = False,
+        multimask_output: bool = True,
+        **kwargs,
+    ) -> None:
+        """
+        Using a SAM 2 model, generates masks for the entire image.
+        Generates a grid of point prompts over the image, then filters
+        low quality and duplicate masks. The default settings are chosen
+        for SAM 2 with a HieraL backbone.
+        Arguments:
+          model (Sam): The SAM 2 model to use for mask prediction.
+          points_per_side (int or None): The number of points to be sampled
+            along one side of the image. The total number of points is
+            points_per_side**2. If None, 'point_grids' must provide explicit
+            point sampling.
+          points_per_batch (int): Sets the number of points run simultaneously
+            by the model. Higher numbers may be faster but use more GPU memory.
+          pred_iou_thresh (float): A filtering threshold in [0,1], using the
+            model's predicted mask quality.
+          stability_score_thresh (float): A filtering threshold in [0,1], using
+            the stability of the mask under changes to the cutoff used to binarize
+            the model's mask predictions.
+          stability_score_offset (float): The amount to shift the cutoff when
+            calculated the stability score.
+          mask_threshold (float): Threshold for binarizing the mask logits
+          box_nms_thresh (float): The box IoU cutoff used by non-maximal
+            suppression to filter duplicate masks.
+          crop_n_layers (int): If >0, mask prediction will be run again on
+            crops of the image. Sets the number of layers to run, where each
+            layer has 2**i_layer number of image crops.
+          crop_nms_thresh (float): The box IoU cutoff used by non-maximal
+            suppression to filter duplicate masks between different crops.
+          crop_overlap_ratio (float): Sets the degree to which crops overlap.
+            In the first crop layer, crops will overlap by this fraction of
+            the image length. Later layers with more crops scale down this overlap.
+          crop_n_points_downscale_factor (int): The number of points-per-side
+            sampled in layer n is scaled down by crop_n_points_downscale_factor**n.
+          point_grids (list(np.ndarray) or None): A list over explicit grids
+            of points used for sampling, normalized to [0,1]. The nth grid in the
+            list is used in the nth crop layer. Exclusive with points_per_side.
+          min_mask_region_area (int): If >0, postprocessing will be applied
+            to remove disconnected regions and holes in masks with area smaller
+            than min_mask_region_area. Requires opencv.
+          output_mode (str): The form masks are returned in. Can be 'binary_mask',
+            'uncompressed_rle', or 'coco_rle'. 'coco_rle' requires pycocotools.
+            For large resolutions, 'binary_mask' may consume large amounts of
+            memory.
+          use_m2m (bool): Whether to add a one step refinement using previous mask predictions.
+          multimask_output (bool): Whether to output multimask at each point of the grid.
+        """
+        assert (points_per_side is None) != (
+            point_grids is None
+        ), "Exactly one of points_per_side or point_grid must be provided."
+        if points_per_side is not None:
+            self.point_grids = build_all_layer_point_grids(
+                points_per_side,
+                crop_n_layers,
+                crop_n_points_downscale_factor,
+            )
+        elif point_grids is not None:
+            self.point_grids = point_grids
+        else:
+            raise ValueError("Can't have both points_per_side and point_grid be None.")
+        assert output_mode in [
+            "binary_mask",
+            "uncompressed_rle",
+            "coco_rle",
+        ], f"Unknown output_mode {output_mode}."
+        if output_mode == "coco_rle":
+            try:
+                from pycocotools import mask as mask_utils  # type: ignore  # noqa: F401
+            except ImportError as e:
+                print("Please install pycocotools")
+                raise e
+        self.predictor = SAM2ImagePredictor(
+            model,
+            max_hole_area=min_mask_region_area,
+            max_sprinkle_area=min_mask_region_area,
+        )
+        self.points_per_batch = points_per_batch
+        self.pred_iou_thresh = pred_iou_thresh
+        self.stability_score_thresh = stability_score_thresh
+        self.stability_score_offset = stability_score_offset
+        self.mask_threshold = mask_threshold
+        self.box_nms_thresh = box_nms_thresh
+        self.crop_n_layers = crop_n_layers
+        self.crop_nms_thresh = crop_nms_thresh
+        self.crop_overlap_ratio = crop_overlap_ratio
+        self.crop_n_points_downscale_factor = crop_n_points_downscale_factor
+        self.min_mask_region_area = min_mask_region_area
+        self.output_mode = output_mode
+        self.use_m2m = use_m2m
+        self.multimask_output = multimask_output
+    @classmethod
+    def from_pretrained(cls, model_id: str, **kwargs) -> "SAM2AutomaticMaskGenerator":
+        """
+        Load a pretrained model from the Hugging Face hub.
+        Arguments:
+          model_id (str): The Hugging Face repository ID.
+          **kwargs: Additional arguments to pass to the model constructor.
+        Returns:
+          (SAM2AutomaticMaskGenerator): The loaded model.
+        """
+        from sam2.build_sam import build_sam2_hf
+        sam_model = build_sam2_hf(model_id, **kwargs)
+        return cls(sam_model, **kwargs)
+    @torch.no_grad()
+    def generate(self, image: np.ndarray) -> List[Dict[str, Any]]:
+        """
+        Generates masks for the given image.
+        Arguments:
+          image (np.ndarray): The image to generate masks for, in HWC uint8 format.
+        Returns:
+           list(dict(str, any)): A list over records for masks. Each record is
+             a dict containing the following keys:
+               segmentation (dict(str, any) or np.ndarray): The mask. If
+                 output_mode='binary_mask', is an array of shape HW. Otherwise,
+                 is a dictionary containing the RLE.
+               bbox (list(float)): The box around the mask, in XYWH format.
+               area (int): The area in pixels of the mask.
+               predicted_iou (float): The model's own prediction of the mask's
+                 quality. This is filtered by the pred_iou_thresh parameter.
+               point_coords (list(list(float))): The point coordinates input
+                 to the model to generate this mask.
+               stability_score (float): A measure of the mask's quality. This
+                 is filtered on using the stability_score_thresh parameter.
+               crop_box (list(float)): The crop of the image used to generate
+                 the mask, given in XYWH format.
+        """
+        # Generate masks
+        mask_data = self._generate_masks(image)
+        # Encode masks
+        if self.output_mode == "coco_rle":
+            mask_data["segmentations"] = [
+                coco_encode_rle(rle) for rle in mask_data["rles"]
+            ]
+        elif self.output_mode == "binary_mask":
+            mask_data["segmentations"] = [rle_to_mask(rle) for rle in mask_data["rles"]]
+        else:
+            mask_data["segmentations"] = mask_data["rles"]
+        # Write mask records
+        curr_anns = []
+        for idx in range(len(mask_data["segmentations"])):
+            ann = {
+                "segmentation": mask_data["segmentations"][idx],
+                "area": area_from_rle(mask_data["rles"][idx]),
+                "bbox": box_xyxy_to_xywh(mask_data["boxes"][idx]).tolist(),
+                "predicted_iou": mask_data["iou_preds"][idx].item(),
+                "point_coords": [mask_data["points"][idx].tolist()],
+                "stability_score": mask_data["stability_score"][idx].item(),
+                "crop_box": box_xyxy_to_xywh(mask_data["crop_boxes"][idx]).tolist(),
+            }
+            curr_anns.append(ann)
+        return curr_anns
+    def _generate_masks(self, image: np.ndarray) -> MaskData:
+        orig_size = image.shape[:2]
+        crop_boxes, layer_idxs = generate_crop_boxes(
+            orig_size, self.crop_n_layers, self.crop_overlap_ratio
+        )
+        # Iterate over image crops
+        data = MaskData()
+        for crop_box, layer_idx in zip(crop_boxes, layer_idxs):
+            crop_data = self._process_crop(image, crop_box, layer_idx, orig_size)
+            data.cat(crop_data)
+        # Remove duplicate masks between crops
+        if len(crop_boxes) > 1:
+            # Prefer masks from smaller crops
+            scores = 1 / box_area(data["crop_boxes"])
+            scores = scores.to(data["boxes"].device)
+            keep_by_nms = batched_nms(
+                data["boxes"].float(),
+                scores,
+                torch.zeros_like(data["boxes"][:, 0]),  # categories
+                iou_threshold=self.crop_nms_thresh,
+            )
+            data.filter(keep_by_nms)
+        data.to_numpy()
+        return data
+    def _process_crop(
+        self,
+        image: np.ndarray,
+        crop_box: List[int],
+        crop_layer_idx: int,
+        orig_size: Tuple[int, ...],
+    ) -> MaskData:
+        # Crop the image and calculate embeddings
+        x0, y0, x1, y1 = crop_box
+        cropped_im = image[y0:y1, x0:x1, :]
+        cropped_im_size = cropped_im.shape[:2]
+        self.predictor.set_image(cropped_im)
+        # Get points for this crop
+        points_scale = np.array(cropped_im_size)[None, ::-1]
+        points_for_image = self.point_grids[crop_layer_idx] * points_scale
+        # Generate masks for this crop in batches
+        data = MaskData()
+        for (points,) in batch_iterator(self.points_per_batch, points_for_image):
+            batch_data = self._process_batch(
+                points, cropped_im_size, crop_box, orig_size, normalize=True
+            )
+            data.cat(batch_data)
+            del batch_data
+        self.predictor.reset_predictor()
+        # Remove duplicates within this crop.
+        keep_by_nms = batched_nms(
+            data["boxes"].float(),
+            data["iou_preds"],
+            torch.zeros_like(data["boxes"][:, 0]),  # categories
+            iou_threshold=self.box_nms_thresh,
+        )
+        data.filter(keep_by_nms)
+        # Return to the original image frame
+        data["boxes"] = uncrop_boxes_xyxy(data["boxes"], crop_box)
+        data["points"] = uncrop_points(data["points"], crop_box)
+        data["crop_boxes"] = torch.tensor([crop_box for _ in range(len(data["rles"]))])
+        return data
+    def _process_batch(
+        self,
+        points: np.ndarray,
+        im_size: Tuple[int, ...],
+        crop_box: List[int],
+        orig_size: Tuple[int, ...],
+        normalize=False,
+    ) -> MaskData:
+        orig_h, orig_w = orig_size
+        # Run model on this batch
+        points = torch.as_tensor(
+            points, dtype=torch.float32, device=self.predictor.device
+        )
+        in_points = self.predictor._transforms.transform_coords(
+            points, normalize=normalize, orig_hw=im_size
+        )
+        in_labels = torch.ones(
+            in_points.shape[0], dtype=torch.int, device=in_points.device
+        )
+        masks, iou_preds, low_res_masks = self.predictor._predict(
+            in_points[:, None, :],
+            in_labels[:, None],
+            multimask_output=self.multimask_output,
+            return_logits=True,
+        )
+        # Serialize predictions and store in MaskData
+        data = MaskData(
+            masks=masks.flatten(0, 1),
+            iou_preds=iou_preds.flatten(0, 1),
+            points=points.repeat_interleave(masks.shape[1], dim=0),
+            low_res_masks=low_res_masks.flatten(0, 1),
+        )
+        del masks
+        if not self.use_m2m:
+            # Filter by predicted IoU
+            if self.pred_iou_thresh > 0.0:
+                keep_mask = data["iou_preds"] > self.pred_iou_thresh
+                data.filter(keep_mask)
+            # Calculate and filter by stability score
+            data["stability_score"] = calculate_stability_score(
+                data["masks"], self.mask_threshold, self.stability_score_offset
+            )
+            if self.stability_score_thresh > 0.0:
+                keep_mask = data["stability_score"] >= self.stability_score_thresh
+                data.filter(keep_mask)
+        else:
+            # One step refinement using previous mask predictions
+            in_points = self.predictor._transforms.transform_coords(
+                data["points"], normalize=normalize, orig_hw=im_size
+            )
+            labels = torch.ones(
+                in_points.shape[0], dtype=torch.int, device=in_points.device
+            )
+            masks, ious = self.refine_with_m2m(
+                in_points, labels, data["low_res_masks"], self.points_per_batch
+            )
+            data["masks"] = masks.squeeze(1)
+            data["iou_preds"] = ious.squeeze(1)
+            if self.pred_iou_thresh > 0.0:
+                keep_mask = data["iou_preds"] > self.pred_iou_thresh
+                data.filter(keep_mask)
+            data["stability_score"] = calculate_stability_score(
+                data["masks"], self.mask_threshold, self.stability_score_offset
+            )
+            if self.stability_score_thresh > 0.0:
+                keep_mask = data["stability_score"] >= self.stability_score_thresh
+                data.filter(keep_mask)
+        # Threshold masks and calculate boxes
+        data["masks"] = data["masks"] > self.mask_threshold
+        data["boxes"] = batched_mask_to_box(data["masks"])
+        # Filter boxes that touch crop boundaries
+        keep_mask = ~is_box_near_crop_edge(
+            data["boxes"], crop_box, [0, 0, orig_w, orig_h]
+        )
+        if not torch.all(keep_mask):
+            data.filter(keep_mask)
+        # Compress to RLE
+        data["masks"] = uncrop_masks(data["masks"], crop_box, orig_h, orig_w)
+        data["rles"] = mask_to_rle_pytorch(data["masks"])
+        del data["masks"]
+        return data
+    @staticmethod
+    def postprocess_small_regions(
+        mask_data: MaskData, min_area: int, nms_thresh: float
+    ) -> MaskData:
+        """
+        Removes small disconnected regions and holes in masks, then reruns
+        box NMS to remove any new duplicates.
+        Edits mask_data in place.
+        Requires open-cv as a dependency.
+        """
+        if len(mask_data["rles"]) == 0:
+            return mask_data
+        # Filter small disconnected regions and holes
+        new_masks = []
+        scores = []
+        for rle in mask_data["rles"]:
+            mask = rle_to_mask(rle)
+            mask, changed = remove_small_regions(mask, min_area, mode="holes")
+            unchanged = not changed
+            mask, changed = remove_small_regions(mask, min_area, mode="islands")
+            unchanged = unchanged and not changed
+            new_masks.append(torch.as_tensor(mask).unsqueeze(0))
+            # Give score=0 to changed masks and score=1 to unchanged masks
+            # so NMS will prefer ones that didn't need postprocessing
+            scores.append(float(unchanged))
+        # Recalculate boxes and remove any new duplicates
+        masks = torch.cat(new_masks, dim=0)
+        boxes = batched_mask_to_box(masks)
+        keep_by_nms = batched_nms(
+            boxes.float(),
+            torch.as_tensor(scores),
+            torch.zeros_like(boxes[:, 0]),  # categories
+            iou_threshold=nms_thresh,
+        )
+        # Only recalculate RLEs for masks that have changed
+        for i_mask in keep_by_nms:
+            if scores[i_mask] == 0.0:
+                mask_torch = masks[i_mask].unsqueeze(0)
+                mask_data["rles"][i_mask] = mask_to_rle_pytorch(mask_torch)[0]
+                mask_data["boxes"][i_mask] = boxes[i_mask]  # update res directly
+        mask_data.filter(keep_by_nms)
+        return mask_data
+    def refine_with_m2m(self, points, point_labels, low_res_masks, points_per_batch):
+        new_masks = []
+        new_iou_preds = []
+        for cur_points, cur_point_labels, low_res_mask in batch_iterator(
+            points_per_batch, points, point_labels, low_res_masks
+        ):
+            best_masks, best_iou_preds, _ = self.predictor._predict(
+                cur_points[:, None, :],
+                cur_point_labels[:, None],
+                mask_input=low_res_mask[:, None, :],
+                multimask_output=False,
+                return_logits=True,
+            )
+            new_masks.append(best_masks)
+            new_iou_preds.append(best_iou_preds)
+        masks = torch.cat(new_masks, dim=0)
+        return masks, torch.cat(new_iou_preds, dim=0)

clone-IDEA-Research/Grounded-SAM-2/sam2/build_sam.py ADDED Viewed

	@@ -0,0 +1,167 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+import os
+import torch
+from hydra import compose
+from hydra.utils import instantiate
+from omegaconf import OmegaConf
+import sam2
+# Check if the user is running Python from the parent directory of the sam2 repo
+# (i.e. the directory where this repo is cloned into) -- this is not supported since
+# it could shadow the sam2 package and cause issues.
+if os.path.isdir(os.path.join(sam2.__path__[0], "sam2")):
+    # If the user has "sam2/sam2" in their path, they are likey importing the repo itself
+    # as "sam2" rather than importing the "sam2" python package (i.e. "sam2/sam2" directory).
+    # This typically happens because the user is running Python from the parent directory
+    # that contains the sam2 repo they cloned.
+    raise RuntimeError(
+        "You're likely running Python from the parent directory of the sam2 repository "
+        "(i.e. the directory where https://github.com/facebookresearch/sam2 is cloned into). "
+        "This is not supported since the `sam2` Python package could be shadowed by the "
+        "repository name (the repository is also named `sam2` and contains the Python package "
+        "in `sam2/sam2`). Please run Python from another directory (e.g. from the repo dir "
+        "rather than its parent dir, or from your home directory) after installing SAM 2."
+    )
+HF_MODEL_ID_TO_FILENAMES = {
+    "facebook/sam2-hiera-tiny": (
+        "configs/sam2/sam2_hiera_t.yaml",
+        "sam2_hiera_tiny.pt",
+    ),
+    "facebook/sam2-hiera-small": (
+        "configs/sam2/sam2_hiera_s.yaml",
+        "sam2_hiera_small.pt",
+    ),
+    "facebook/sam2-hiera-base-plus": (
+        "configs/sam2/sam2_hiera_b+.yaml",
+        "sam2_hiera_base_plus.pt",
+    ),
+    "facebook/sam2-hiera-large": (
+        "configs/sam2/sam2_hiera_l.yaml",
+        "sam2_hiera_large.pt",
+    ),
+    "facebook/sam2.1-hiera-tiny": (
+        "configs/sam2.1/sam2.1_hiera_t.yaml",
+        "sam2.1_hiera_tiny.pt",
+    ),
+    "facebook/sam2.1-hiera-small": (
+        "configs/sam2.1/sam2.1_hiera_s.yaml",
+        "sam2.1_hiera_small.pt",
+    ),
+    "facebook/sam2.1-hiera-base-plus": (
+        "configs/sam2.1/sam2.1_hiera_b+.yaml",
+        "sam2.1_hiera_base_plus.pt",
+    ),
+    "facebook/sam2.1-hiera-large": (
+        "configs/sam2.1/sam2.1_hiera_l.yaml",
+        "sam2.1_hiera_large.pt",
+    ),
+}
+def build_sam2(
+    config_file,
+    ckpt_path=None,
+    device="cuda",
+    mode="eval",
+    hydra_overrides_extra=[],
+    apply_postprocessing=True,
+    **kwargs,
+):
+    if apply_postprocessing:
+        hydra_overrides_extra = hydra_overrides_extra.copy()
+        hydra_overrides_extra += [
+            # dynamically fall back to multi-mask if the single mask is not stable
+            "++model.sam_mask_decoder_extra_args.dynamic_multimask_via_stability=true",
+            "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_delta=0.05",
+            "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_thresh=0.98",
+        ]
+    # Read config and init model
+    cfg = compose(config_name=config_file, overrides=hydra_overrides_extra)
+    OmegaConf.resolve(cfg)
+    model = instantiate(cfg.model, _recursive_=True)
+    _load_checkpoint(model, ckpt_path)
+    model = model.to(device)
+    if mode == "eval":
+        model.eval()
+    return model
+def build_sam2_video_predictor(
+    config_file,
+    ckpt_path=None,
+    device="cuda",
+    mode="eval",
+    hydra_overrides_extra=[],
+    apply_postprocessing=True,
+    **kwargs,
+):
+    hydra_overrides = [
+        "++model._target_=sam2.sam2_video_predictor.SAM2VideoPredictor",
+    ]
+    if apply_postprocessing:
+        hydra_overrides_extra = hydra_overrides_extra.copy()
+        hydra_overrides_extra += [
+            # dynamically fall back to multi-mask if the single mask is not stable
+            "++model.sam_mask_decoder_extra_args.dynamic_multimask_via_stability=true",
+            "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_delta=0.05",
+            "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_thresh=0.98",
+            # the sigmoid mask logits on interacted frames with clicks in the memory encoder so that the encoded masks are exactly as what users see from clicking
+            "++model.binarize_mask_from_pts_for_mem_enc=true",
+            # fill small holes in the low-res masks up to `fill_hole_area` (before resizing them to the original video resolution)
+            "++model.fill_hole_area=8",
+        ]
+    hydra_overrides.extend(hydra_overrides_extra)
+    # Read config and init model
+    cfg = compose(config_name=config_file, overrides=hydra_overrides)
+    OmegaConf.resolve(cfg)
+    model = instantiate(cfg.model, _recursive_=True)
+    _load_checkpoint(model, ckpt_path)
+    model = model.to(device)
+    if mode == "eval":
+        model.eval()
+    return model
+def _hf_download(model_id):
+    from huggingface_hub import hf_hub_download
+    config_name, checkpoint_name = HF_MODEL_ID_TO_FILENAMES[model_id]
+    ckpt_path = hf_hub_download(repo_id=model_id, filename=checkpoint_name)
+    return config_name, ckpt_path
+def build_sam2_hf(model_id, **kwargs):
+    config_name, ckpt_path = _hf_download(model_id)
+    return build_sam2(config_file=config_name, ckpt_path=ckpt_path, **kwargs)
+def build_sam2_video_predictor_hf(model_id, **kwargs):
+    config_name, ckpt_path = _hf_download(model_id)
+    return build_sam2_video_predictor(
+        config_file=config_name, ckpt_path=ckpt_path, **kwargs
+    )
+def _load_checkpoint(model, ckpt_path):
+    if ckpt_path is not None:
+        sd = torch.load(ckpt_path, map_location="cpu", weights_only=True)["model"]
+        missing_keys, unexpected_keys = model.load_state_dict(sd)
+        if missing_keys:
+            logging.error(missing_keys)
+            raise RuntimeError()
+        if unexpected_keys:
+            logging.error(unexpected_keys)
+            raise RuntimeError()
+        logging.info("Loaded checkpoint sucessfully")

clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_b+.yaml ADDED Viewed

	@@ -0,0 +1,113 @@

+# @package _global_
+# Model
+model:
+  _target_: sam2.modeling.sam2_base.SAM2Base
+  image_encoder:
+    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
+    scalp: 1
+    trunk:
+      _target_: sam2.modeling.backbones.hieradet.Hiera
+      embed_dim: 112
+      num_heads: 2
+    neck:
+      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 256
+        normalize: true
+        scale: null
+        temperature: 10000
+      d_model: 256
+      backbone_channel_list: [896, 448, 224, 112]
+      fpn_top_down_levels: [2, 3]  # output level 0 and 1 directly use the backbone features
+      fpn_interp_model: nearest
+  memory_attention:
+    _target_: sam2.modeling.memory_attention.MemoryAttention
+    d_model: 256
+    pos_enc_at_input: true
+    layer:
+      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
+      activation: relu
+      dim_feedforward: 2048
+      dropout: 0.1
+      pos_enc_at_attn: false
+      self_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+      d_model: 256
+      pos_enc_at_cross_attn_keys: true
+      pos_enc_at_cross_attn_queries: false
+      cross_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        rope_k_repeat: True
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+        kv_in_dim: 64
+    num_layers: 4
+  memory_encoder:
+      _target_: sam2.modeling.memory_encoder.MemoryEncoder
+      out_dim: 64
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 64
+        normalize: true
+        scale: null
+        temperature: 10000
+      mask_downsampler:
+        _target_: sam2.modeling.memory_encoder.MaskDownSampler
+        kernel_size: 3
+        stride: 2
+        padding: 1
+      fuser:
+        _target_: sam2.modeling.memory_encoder.Fuser
+        layer:
+          _target_: sam2.modeling.memory_encoder.CXBlock
+          dim: 256
+          kernel_size: 7
+          padding: 3
+          layer_scale_init_value: 1e-6
+          use_dwconv: True  # depth-wise convs
+        num_layers: 2
+  num_maskmem: 7
+  image_size: 1024
+  # apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
+  sigmoid_scale_for_mem_enc: 20.0
+  sigmoid_bias_for_mem_enc: -10.0
+  use_mask_input_as_output_without_sam: true
+  # Memory
+  directly_add_no_mem_embed: true
+  # use high-resolution feature map in the SAM mask decoder
+  use_high_res_features_in_sam: true
+  # output 3 masks on the first click on initial conditioning frames
+  multimask_output_in_sam: true
+  # SAM heads
+  iou_prediction_use_sigmoid: True
+  # cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
+  use_obj_ptrs_in_encoder: true
+  add_tpos_enc_to_obj_ptrs: false
+  only_obj_ptrs_in_the_past_for_eval: true
+  # object occlusion prediction
+  pred_obj_scores: true
+  pred_obj_scores_mlp: true
+  fixed_no_obj_ptr: true
+  # multimask tracking settings
+  multimask_output_for_tracking: true
+  use_multimask_token_for_obj_ptr: true
+  multimask_min_pt_num: 0
+  multimask_max_pt_num: 1
+  use_mlp_for_obj_ptr_proj: true
+  # Compilation flag
+  compile_image_encoder: False

clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_l.yaml ADDED Viewed

	@@ -0,0 +1,117 @@

+# @package _global_
+# Model
+model:
+  _target_: sam2.modeling.sam2_base.SAM2Base
+  image_encoder:
+    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
+    scalp: 1
+    trunk:
+      _target_: sam2.modeling.backbones.hieradet.Hiera
+      embed_dim: 144
+      num_heads: 2
+      stages: [2, 6, 36, 4]
+      global_att_blocks: [23, 33, 43]
+      window_pos_embed_bkg_spatial_size: [7, 7]
+      window_spec: [8, 4, 16, 8]
+    neck:
+      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 256
+        normalize: true
+        scale: null
+        temperature: 10000
+      d_model: 256
+      backbone_channel_list: [1152, 576, 288, 144]
+      fpn_top_down_levels: [2, 3]  # output level 0 and 1 directly use the backbone features
+      fpn_interp_model: nearest
+  memory_attention:
+    _target_: sam2.modeling.memory_attention.MemoryAttention
+    d_model: 256
+    pos_enc_at_input: true
+    layer:
+      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
+      activation: relu
+      dim_feedforward: 2048
+      dropout: 0.1
+      pos_enc_at_attn: false
+      self_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+      d_model: 256
+      pos_enc_at_cross_attn_keys: true
+      pos_enc_at_cross_attn_queries: false
+      cross_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        rope_k_repeat: True
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+        kv_in_dim: 64
+    num_layers: 4
+  memory_encoder:
+      _target_: sam2.modeling.memory_encoder.MemoryEncoder
+      out_dim: 64
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 64
+        normalize: true
+        scale: null
+        temperature: 10000
+      mask_downsampler:
+        _target_: sam2.modeling.memory_encoder.MaskDownSampler
+        kernel_size: 3
+        stride: 2
+        padding: 1
+      fuser:
+        _target_: sam2.modeling.memory_encoder.Fuser
+        layer:
+          _target_: sam2.modeling.memory_encoder.CXBlock
+          dim: 256
+          kernel_size: 7
+          padding: 3
+          layer_scale_init_value: 1e-6
+          use_dwconv: True  # depth-wise convs
+        num_layers: 2
+  num_maskmem: 7
+  image_size: 1024
+  # apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
+  sigmoid_scale_for_mem_enc: 20.0
+  sigmoid_bias_for_mem_enc: -10.0
+  use_mask_input_as_output_without_sam: true
+  # Memory
+  directly_add_no_mem_embed: true
+  # use high-resolution feature map in the SAM mask decoder
+  use_high_res_features_in_sam: true
+  # output 3 masks on the first click on initial conditioning frames
+  multimask_output_in_sam: true
+  # SAM heads
+  iou_prediction_use_sigmoid: True
+  # cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
+  use_obj_ptrs_in_encoder: true
+  add_tpos_enc_to_obj_ptrs: false
+  only_obj_ptrs_in_the_past_for_eval: true
+  # object occlusion prediction
+  pred_obj_scores: true
+  pred_obj_scores_mlp: true
+  fixed_no_obj_ptr: true
+  # multimask tracking settings
+  multimask_output_for_tracking: true
+  use_multimask_token_for_obj_ptr: true
+  multimask_min_pt_num: 0
+  multimask_max_pt_num: 1
+  use_mlp_for_obj_ptr_proj: true
+  # Compilation flag
+  compile_image_encoder: False

clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_s.yaml ADDED Viewed

	@@ -0,0 +1,116 @@

+# @package _global_
+# Model
+model:
+  _target_: sam2.modeling.sam2_base.SAM2Base
+  image_encoder:
+    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
+    scalp: 1
+    trunk:
+      _target_: sam2.modeling.backbones.hieradet.Hiera
+      embed_dim: 96
+      num_heads: 1
+      stages: [1, 2, 11, 2]
+      global_att_blocks: [7, 10, 13]
+      window_pos_embed_bkg_spatial_size: [7, 7]
+    neck:
+      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 256
+        normalize: true
+        scale: null
+        temperature: 10000
+      d_model: 256
+      backbone_channel_list: [768, 384, 192, 96]
+      fpn_top_down_levels: [2, 3]  # output level 0 and 1 directly use the backbone features
+      fpn_interp_model: nearest
+  memory_attention:
+    _target_: sam2.modeling.memory_attention.MemoryAttention
+    d_model: 256
+    pos_enc_at_input: true
+    layer:
+      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
+      activation: relu
+      dim_feedforward: 2048
+      dropout: 0.1
+      pos_enc_at_attn: false
+      self_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+      d_model: 256
+      pos_enc_at_cross_attn_keys: true
+      pos_enc_at_cross_attn_queries: false
+      cross_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        rope_k_repeat: True
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+        kv_in_dim: 64
+    num_layers: 4
+  memory_encoder:
+      _target_: sam2.modeling.memory_encoder.MemoryEncoder
+      out_dim: 64
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 64
+        normalize: true
+        scale: null
+        temperature: 10000
+      mask_downsampler:
+        _target_: sam2.modeling.memory_encoder.MaskDownSampler
+        kernel_size: 3
+        stride: 2
+        padding: 1
+      fuser:
+        _target_: sam2.modeling.memory_encoder.Fuser
+        layer:
+          _target_: sam2.modeling.memory_encoder.CXBlock
+          dim: 256
+          kernel_size: 7
+          padding: 3
+          layer_scale_init_value: 1e-6
+          use_dwconv: True  # depth-wise convs
+        num_layers: 2
+  num_maskmem: 7
+  image_size: 1024
+  # apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
+  sigmoid_scale_for_mem_enc: 20.0
+  sigmoid_bias_for_mem_enc: -10.0
+  use_mask_input_as_output_without_sam: true
+  # Memory
+  directly_add_no_mem_embed: true
+  # use high-resolution feature map in the SAM mask decoder
+  use_high_res_features_in_sam: true
+  # output 3 masks on the first click on initial conditioning frames
+  multimask_output_in_sam: true
+  # SAM heads
+  iou_prediction_use_sigmoid: True
+  # cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
+  use_obj_ptrs_in_encoder: true
+  add_tpos_enc_to_obj_ptrs: false
+  only_obj_ptrs_in_the_past_for_eval: true
+  # object occlusion prediction
+  pred_obj_scores: true
+  pred_obj_scores_mlp: true
+  fixed_no_obj_ptr: true
+  # multimask tracking settings
+  multimask_output_for_tracking: true
+  use_multimask_token_for_obj_ptr: true
+  multimask_min_pt_num: 0
+  multimask_max_pt_num: 1
+  use_mlp_for_obj_ptr_proj: true
+  # Compilation flag
+  compile_image_encoder: False

clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_hiera_t.yaml ADDED Viewed

	@@ -0,0 +1,118 @@

+# @package _global_
+# Model
+model:
+  _target_: sam2.modeling.sam2_base.SAM2Base
+  image_encoder:
+    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
+    scalp: 1
+    trunk:
+      _target_: sam2.modeling.backbones.hieradet.Hiera
+      embed_dim: 96
+      num_heads: 1
+      stages: [1, 2, 7, 2]
+      global_att_blocks: [5, 7, 9]
+      window_pos_embed_bkg_spatial_size: [7, 7]
+    neck:
+      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 256
+        normalize: true
+        scale: null
+        temperature: 10000
+      d_model: 256
+      backbone_channel_list: [768, 384, 192, 96]
+      fpn_top_down_levels: [2, 3]  # output level 0 and 1 directly use the backbone features
+      fpn_interp_model: nearest
+  memory_attention:
+    _target_: sam2.modeling.memory_attention.MemoryAttention
+    d_model: 256
+    pos_enc_at_input: true
+    layer:
+      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
+      activation: relu
+      dim_feedforward: 2048
+      dropout: 0.1
+      pos_enc_at_attn: false
+      self_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+      d_model: 256
+      pos_enc_at_cross_attn_keys: true
+      pos_enc_at_cross_attn_queries: false
+      cross_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        rope_k_repeat: True
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+        kv_in_dim: 64
+    num_layers: 4
+  memory_encoder:
+      _target_: sam2.modeling.memory_encoder.MemoryEncoder
+      out_dim: 64
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 64
+        normalize: true
+        scale: null
+        temperature: 10000
+      mask_downsampler:
+        _target_: sam2.modeling.memory_encoder.MaskDownSampler
+        kernel_size: 3
+        stride: 2
+        padding: 1
+      fuser:
+        _target_: sam2.modeling.memory_encoder.Fuser
+        layer:
+          _target_: sam2.modeling.memory_encoder.CXBlock
+          dim: 256
+          kernel_size: 7
+          padding: 3
+          layer_scale_init_value: 1e-6
+          use_dwconv: True  # depth-wise convs
+        num_layers: 2
+  num_maskmem: 7
+  image_size: 1024
+  # apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
+  # SAM decoder
+  sigmoid_scale_for_mem_enc: 20.0
+  sigmoid_bias_for_mem_enc: -10.0
+  use_mask_input_as_output_without_sam: true
+  # Memory
+  directly_add_no_mem_embed: true
+  # use high-resolution feature map in the SAM mask decoder
+  use_high_res_features_in_sam: true
+  # output 3 masks on the first click on initial conditioning frames
+  multimask_output_in_sam: true
+  # SAM heads
+  iou_prediction_use_sigmoid: True
+  # cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
+  use_obj_ptrs_in_encoder: true
+  add_tpos_enc_to_obj_ptrs: false
+  only_obj_ptrs_in_the_past_for_eval: true
+  # object occlusion prediction
+  pred_obj_scores: true
+  pred_obj_scores_mlp: true
+  fixed_no_obj_ptr: true
+  # multimask tracking settings
+  multimask_output_for_tracking: true
+  use_multimask_token_for_obj_ptr: true
+  multimask_min_pt_num: 0
+  multimask_max_pt_num: 1
+  use_mlp_for_obj_ptr_proj: true
+  # Compilation flag
+  # HieraT does not currently support compilation, should always be set to False
+  compile_image_encoder: False

clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_image_predictor.py ADDED Viewed

	@@ -0,0 +1,466 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import logging
+from typing import List, Optional, Tuple, Union
+import numpy as np
+import torch
+from PIL.Image import Image
+from sam2.modeling.sam2_base import SAM2Base
+from sam2.utils.transforms import SAM2Transforms
+class SAM2ImagePredictor:
+    def __init__(
+        self,
+        sam_model: SAM2Base,
+        mask_threshold=0.0,
+        max_hole_area=0.0,
+        max_sprinkle_area=0.0,
+        **kwargs,
+    ) -> None:
+        """
+        Uses SAM-2 to calculate the image embedding for an image, and then
+        allow repeated, efficient mask prediction given prompts.
+        Arguments:
+          sam_model (Sam-2): The model to use for mask prediction.
+          mask_threshold (float): The threshold to use when converting mask logits
+            to binary masks. Masks are thresholded at 0 by default.
+          max_hole_area (int): If max_hole_area > 0, we fill small holes in up to
+            the maximum area of max_hole_area in low_res_masks.
+          max_sprinkle_area (int): If max_sprinkle_area > 0, we remove small sprinkles up to
+            the maximum area of max_sprinkle_area in low_res_masks.
+        """
+        super().__init__()
+        self.model = sam_model
+        self._transforms = SAM2Transforms(
+            resolution=self.model.image_size,
+            mask_threshold=mask_threshold,
+            max_hole_area=max_hole_area,
+            max_sprinkle_area=max_sprinkle_area,
+        )
+        # Predictor state
+        self._is_image_set = False
+        self._features = None
+        self._orig_hw = None
+        # Whether the predictor is set for single image or a batch of images
+        self._is_batch = False
+        # Predictor config
+        self.mask_threshold = mask_threshold
+        # Spatial dim for backbone feature maps
+        self._bb_feat_sizes = [
+            (256, 256),
+            (128, 128),
+            (64, 64),
+        ]
+    @classmethod
+    def from_pretrained(cls, model_id: str, **kwargs) -> "SAM2ImagePredictor":
+        """
+        Load a pretrained model from the Hugging Face hub.
+        Arguments:
+          model_id (str): The Hugging Face repository ID.
+          **kwargs: Additional arguments to pass to the model constructor.
+        Returns:
+          (SAM2ImagePredictor): The loaded model.
+        """
+        from sam2.build_sam import build_sam2_hf
+        sam_model = build_sam2_hf(model_id, **kwargs)
+        return cls(sam_model, **kwargs)
+    @torch.no_grad()
+    def set_image(
+        self,
+        image: Union[np.ndarray, Image],
+    ) -> None:
+        """
+        Calculates the image embeddings for the provided image, allowing
+        masks to be predicted with the 'predict' method.
+        Arguments:
+          image (np.ndarray or PIL Image): The input image to embed in RGB format. The image should be in HWC format if np.ndarray, or WHC format if PIL Image
+          with pixel values in [0, 255].
+          image_format (str): The color format of the image, in ['RGB', 'BGR'].
+        """
+        self.reset_predictor()
+        # Transform the image to the form expected by the model
+        if isinstance(image, np.ndarray):
+            logging.info("For numpy array image, we assume (HxWxC) format")
+            self._orig_hw = [image.shape[:2]]
+        elif isinstance(image, Image):
+            w, h = image.size
+            self._orig_hw = [(h, w)]
+        else:
+            raise NotImplementedError("Image format not supported")
+        input_image = self._transforms(image)
+        input_image = input_image[None, ...].to(self.device)
+        assert (
+            len(input_image.shape) == 4 and input_image.shape[1] == 3
+        ), f"input_image must be of size 1x3xHxW, got {input_image.shape}"
+        logging.info("Computing image embeddings for the provided image...")
+        backbone_out = self.model.forward_image(input_image)
+        _, vision_feats, _, _ = self.model._prepare_backbone_features(backbone_out)
+        # Add no_mem_embed, which is added to the lowest rest feat. map during training on videos
+        if self.model.directly_add_no_mem_embed:
+            vision_feats[-1] = vision_feats[-1] + self.model.no_mem_embed
+        feats = [
+            feat.permute(1, 2, 0).view(1, -1, *feat_size)
+            for feat, feat_size in zip(vision_feats[::-1], self._bb_feat_sizes[::-1])
+        ][::-1]
+        self._features = {"image_embed": feats[-1], "high_res_feats": feats[:-1]}
+        self._is_image_set = True
+        logging.info("Image embeddings computed.")
+    @torch.no_grad()
+    def set_image_batch(
+        self,
+        image_list: List[Union[np.ndarray]],
+    ) -> None:
+        """
+        Calculates the image embeddings for the provided image batch, allowing
+        masks to be predicted with the 'predict_batch' method.
+        Arguments:
+          image_list (List[np.ndarray]): The input images to embed in RGB format. The image should be in HWC format if np.ndarray
+          with pixel values in [0, 255].
+        """
+        self.reset_predictor()
+        assert isinstance(image_list, list)
+        self._orig_hw = []
+        for image in image_list:
+            assert isinstance(
+                image, np.ndarray
+            ), "Images are expected to be an np.ndarray in RGB format, and of shape  HWC"
+            self._orig_hw.append(image.shape[:2])
+        # Transform the image to the form expected by the model
+        img_batch = self._transforms.forward_batch(image_list)
+        img_batch = img_batch.to(self.device)
+        batch_size = img_batch.shape[0]
+        assert (
+            len(img_batch.shape) == 4 and img_batch.shape[1] == 3
+        ), f"img_batch must be of size Bx3xHxW, got {img_batch.shape}"
+        logging.info("Computing image embeddings for the provided images...")
+        backbone_out = self.model.forward_image(img_batch)
+        _, vision_feats, _, _ = self.model._prepare_backbone_features(backbone_out)
+        # Add no_mem_embed, which is added to the lowest rest feat. map during training on videos
+        if self.model.directly_add_no_mem_embed:
+            vision_feats[-1] = vision_feats[-1] + self.model.no_mem_embed
+        feats = [
+            feat.permute(1, 2, 0).view(batch_size, -1, *feat_size)
+            for feat, feat_size in zip(vision_feats[::-1], self._bb_feat_sizes[::-1])
+        ][::-1]
+        self._features = {"image_embed": feats[-1], "high_res_feats": feats[:-1]}
+        self._is_image_set = True
+        self._is_batch = True
+        logging.info("Image embeddings computed.")
+    def predict_batch(
+        self,
+        point_coords_batch: List[np.ndarray] = None,
+        point_labels_batch: List[np.ndarray] = None,
+        box_batch: List[np.ndarray] = None,
+        mask_input_batch: List[np.ndarray] = None,
+        multimask_output: bool = True,
+        return_logits: bool = False,
+        normalize_coords=True,
+    ) -> Tuple[List[np.ndarray], List[np.ndarray], List[np.ndarray]]:
+        """This function is very similar to predict(...), however it is used for batched mode, when the model is expected to generate predictions on multiple images.
+        It returns a tuple of lists of masks, ious, and low_res_masks_logits.
+        """
+        assert self._is_batch, "This function should only be used when in batched mode"
+        if not self._is_image_set:
+            raise RuntimeError(
+                "An image must be set with .set_image_batch(...) before mask prediction."
+            )
+        num_images = len(self._features["image_embed"])
+        all_masks = []
+        all_ious = []
+        all_low_res_masks = []
+        for img_idx in range(num_images):
+            # Transform input prompts
+            point_coords = (
+                point_coords_batch[img_idx] if point_coords_batch is not None else None
+            )
+            point_labels = (
+                point_labels_batch[img_idx] if point_labels_batch is not None else None
+            )
+            box = box_batch[img_idx] if box_batch is not None else None
+            mask_input = (
+                mask_input_batch[img_idx] if mask_input_batch is not None else None
+            )
+            mask_input, unnorm_coords, labels, unnorm_box = self._prep_prompts(
+                point_coords,
+                point_labels,
+                box,
+                mask_input,
+                normalize_coords,
+                img_idx=img_idx,
+            )
+            masks, iou_predictions, low_res_masks = self._predict(
+                unnorm_coords,
+                labels,
+                unnorm_box,
+                mask_input,
+                multimask_output,
+                return_logits=return_logits,
+                img_idx=img_idx,
+            )
+            masks_np = masks.squeeze(0).float().detach().cpu().numpy()
+            iou_predictions_np = (
+                iou_predictions.squeeze(0).float().detach().cpu().numpy()
+            )
+            low_res_masks_np = low_res_masks.squeeze(0).float().detach().cpu().numpy()
+            all_masks.append(masks_np)
+            all_ious.append(iou_predictions_np)
+            all_low_res_masks.append(low_res_masks_np)
+        return all_masks, all_ious, all_low_res_masks
+    def predict(
+        self,
+        point_coords: Optional[np.ndarray] = None,
+        point_labels: Optional[np.ndarray] = None,
+        box: Optional[np.ndarray] = None,
+        mask_input: Optional[np.ndarray] = None,
+        multimask_output: bool = True,
+        return_logits: bool = False,
+        normalize_coords=True,
+    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
+        """
+        Predict masks for the given input prompts, using the currently set image.
+        Arguments:
+          point_coords (np.ndarray or None): A Nx2 array of point prompts to the
+            model. Each point is in (X,Y) in pixels.
+          point_labels (np.ndarray or None): A length N array of labels for the
+            point prompts. 1 indicates a foreground point and 0 indicates a
+            background point.
+          box (np.ndarray or None): A length 4 array given a box prompt to the
+            model, in XYXY format.
+          mask_input (np.ndarray): A low resolution mask input to the model, typically
+            coming from a previous prediction iteration. Has form 1xHxW, where
+            for SAM, H=W=256.
+          multimask_output (bool): If true, the model will return three masks.
+            For ambiguous input prompts (such as a single click), this will often
+            produce better masks than a single prediction. If only a single
+            mask is needed, the model's predicted quality score can be used
+            to select the best mask. For non-ambiguous prompts, such as multiple
+            input prompts, multimask_output=False can give better results.
+          return_logits (bool): If true, returns un-thresholded masks logits
+            instead of a binary mask.
+          normalize_coords (bool): If true, the point coordinates will be normalized to the range [0,1] and point_coords is expected to be wrt. image dimensions.
+        Returns:
+          (np.ndarray): The output masks in CxHxW format, where C is the
+            number of masks, and (H, W) is the original image size.
+          (np.ndarray): An array of length C containing the model's
+            predictions for the quality of each mask.
+          (np.ndarray): An array of shape CxHxW, where C is the number
+            of masks and H=W=256. These low resolution logits can be passed to
+            a subsequent iteration as mask input.
+        """
+        if not self._is_image_set:
+            raise RuntimeError(
+                "An image must be set with .set_image(...) before mask prediction."
+            )
+        # Transform input prompts
+        mask_input, unnorm_coords, labels, unnorm_box = self._prep_prompts(
+            point_coords, point_labels, box, mask_input, normalize_coords
+        )
+        masks, iou_predictions, low_res_masks = self._predict(
+            unnorm_coords,
+            labels,
+            unnorm_box,
+            mask_input,
+            multimask_output,
+            return_logits=return_logits,
+        )
+        masks_np = masks.squeeze(0).float().detach().cpu().numpy()
+        iou_predictions_np = iou_predictions.squeeze(0).float().detach().cpu().numpy()
+        low_res_masks_np = low_res_masks.squeeze(0).float().detach().cpu().numpy()
+        return masks_np, iou_predictions_np, low_res_masks_np
+    def _prep_prompts(
+        self, point_coords, point_labels, box, mask_logits, normalize_coords, img_idx=-1
+    ):
+        unnorm_coords, labels, unnorm_box, mask_input = None, None, None, None
+        if point_coords is not None:
+            assert (
+                point_labels is not None
+            ), "point_labels must be supplied if point_coords is supplied."
+            point_coords = torch.as_tensor(
+                point_coords, dtype=torch.float, device=self.device
+            )
+            unnorm_coords = self._transforms.transform_coords(
+                point_coords, normalize=normalize_coords, orig_hw=self._orig_hw[img_idx]
+            )
+            labels = torch.as_tensor(point_labels, dtype=torch.int, device=self.device)
+            if len(unnorm_coords.shape) == 2:
+                unnorm_coords, labels = unnorm_coords[None, ...], labels[None, ...]
+        if box is not None:
+            box = torch.as_tensor(box, dtype=torch.float, device=self.device)
+            unnorm_box = self._transforms.transform_boxes(
+                box, normalize=normalize_coords, orig_hw=self._orig_hw[img_idx]
+            )  # Bx2x2
+        if mask_logits is not None:
+            mask_input = torch.as_tensor(
+                mask_logits, dtype=torch.float, device=self.device
+            )
+            if len(mask_input.shape) == 3:
+                mask_input = mask_input[None, :, :, :]
+        return mask_input, unnorm_coords, labels, unnorm_box
+    @torch.no_grad()
+    def _predict(
+        self,
+        point_coords: Optional[torch.Tensor],
+        point_labels: Optional[torch.Tensor],
+        boxes: Optional[torch.Tensor] = None,
+        mask_input: Optional[torch.Tensor] = None,
+        multimask_output: bool = True,
+        return_logits: bool = False,
+        img_idx: int = -1,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Predict masks for the given input prompts, using the currently set image.
+        Input prompts are batched torch tensors and are expected to already be
+        transformed to the input frame using SAM2Transforms.
+        Arguments:
+          point_coords (torch.Tensor or None): A BxNx2 array of point prompts to the
+            model. Each point is in (X,Y) in pixels.
+          point_labels (torch.Tensor or None): A BxN array of labels for the
+            point prompts. 1 indicates a foreground point and 0 indicates a
+            background point.
+          boxes (np.ndarray or None): A Bx4 array given a box prompt to the
+            model, in XYXY format.
+          mask_input (np.ndarray): A low resolution mask input to the model, typically
+            coming from a previous prediction iteration. Has form Bx1xHxW, where
+            for SAM, H=W=256. Masks returned by a previous iteration of the
+            predict method do not need further transformation.
+          multimask_output (bool): If true, the model will return three masks.
+            For ambiguous input prompts (such as a single click), this will often
+            produce better masks than a single prediction. If only a single
+            mask is needed, the model's predicted quality score can be used
+            to select the best mask. For non-ambiguous prompts, such as multiple
+            input prompts, multimask_output=False can give better results.
+          return_logits (bool): If true, returns un-thresholded masks logits
+            instead of a binary mask.
+        Returns:
+          (torch.Tensor): The output masks in BxCxHxW format, where C is the
+            number of masks, and (H, W) is the original image size.
+          (torch.Tensor): An array of shape BxC containing the model's
+            predictions for the quality of each mask.
+          (torch.Tensor): An array of shape BxCxHxW, where C is the number
+            of masks and H=W=256. These low res logits can be passed to
+            a subsequent iteration as mask input.
+        """
+        if not self._is_image_set:
+            raise RuntimeError(
+                "An image must be set with .set_image(...) before mask prediction."
+            )
+        if point_coords is not None:
+            concat_points = (point_coords, point_labels)
+        else:
+            concat_points = None
+        # Embed prompts
+        if boxes is not None:
+            box_coords = boxes.reshape(-1, 2, 2)
+            box_labels = torch.tensor([[2, 3]], dtype=torch.int, device=boxes.device)
+            box_labels = box_labels.repeat(boxes.size(0), 1)
+            # we merge "boxes" and "points" into a single "concat_points" input (where
+            # boxes are added at the beginning) to sam_prompt_encoder
+            if concat_points is not None:
+                concat_coords = torch.cat([box_coords, concat_points[0]], dim=1)
+                concat_labels = torch.cat([box_labels, concat_points[1]], dim=1)
+                concat_points = (concat_coords, concat_labels)
+            else:
+                concat_points = (box_coords, box_labels)
+        sparse_embeddings, dense_embeddings = self.model.sam_prompt_encoder(
+            points=concat_points,
+            boxes=None,
+            masks=mask_input,
+        )
+        # Predict masks
+        batched_mode = (
+            concat_points is not None and concat_points[0].shape[0] > 1
+        )  # multi object prediction
+        high_res_features = [
+            feat_level[img_idx].unsqueeze(0)
+            for feat_level in self._features["high_res_feats"]
+        ]
+        low_res_masks, iou_predictions, _, _ = self.model.sam_mask_decoder(
+            image_embeddings=self._features["image_embed"][img_idx].unsqueeze(0),
+            image_pe=self.model.sam_prompt_encoder.get_dense_pe(),
+            sparse_prompt_embeddings=sparse_embeddings,
+            dense_prompt_embeddings=dense_embeddings,
+            multimask_output=multimask_output,
+            repeat_image=batched_mode,
+            high_res_features=high_res_features,
+        )
+        # Upscale the masks to the original image resolution
+        masks = self._transforms.postprocess_masks(
+            low_res_masks, self._orig_hw[img_idx]
+        )
+        low_res_masks = torch.clamp(low_res_masks, -32.0, 32.0)
+        if not return_logits:
+            masks = masks > self.mask_threshold
+        return masks, iou_predictions, low_res_masks
+    def get_image_embedding(self) -> torch.Tensor:
+        """
+        Returns the image embeddings for the currently set image, with
+        shape 1xCxHxW, where C is the embedding dimension and (H,W) are
+        the embedding spatial dimension of SAM (typically C=256, H=W=64).
+        """
+        if not self._is_image_set:
+            raise RuntimeError(
+                "An image must be set with .set_image(...) to generate an embedding."
+            )
+        assert (
+            self._features is not None
+        ), "Features must exist if an image has been set."
+        return self._features["image_embed"]
+    @property
+    def device(self) -> torch.device:
+        return self.model.device
+    def reset_predictor(self) -> None:
+        """
+        Resets the image embeddings and other state variables.
+        """
+        self._is_image_set = False
+        self._features = None
+        self._orig_hw = None
+        self._is_batch = False

clone-IDEA-Research/Grounded-SAM-2/sam2/sam2_video_predictor.py ADDED Viewed

	@@ -0,0 +1,1172 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import warnings
+from collections import OrderedDict
+import torch
+from tqdm import tqdm
+from sam2.modeling.sam2_base import NO_OBJ_SCORE, SAM2Base
+from sam2.utils.misc import concat_points, fill_holes_in_mask_scores, load_video_frames
+class SAM2VideoPredictor(SAM2Base):
+    """The predictor class to handle user interactions and manage inference states."""
+    def __init__(
+        self,
+        fill_hole_area=0,
+        # whether to apply non-overlapping constraints on the output object masks
+        non_overlap_masks=False,
+        # whether to clear non-conditioning memory of the surrounding frames (which may contain outdated information) after adding correction clicks;
+        # note that this would only apply to *single-object tracking* unless `clear_non_cond_mem_for_multi_obj` is also set to True)
+        clear_non_cond_mem_around_input=False,
+        # whether to also clear non-conditioning memory of the surrounding frames (only effective when `clear_non_cond_mem_around_input` is True).
+        clear_non_cond_mem_for_multi_obj=False,
+        # if `add_all_frames_to_correct_as_cond` is True, we also append to the conditioning frame list any frame that receives a later correction click
+        # if `add_all_frames_to_correct_as_cond` is False, we conditioning frame list to only use those initial conditioning frames
+        add_all_frames_to_correct_as_cond=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.fill_hole_area = fill_hole_area
+        self.non_overlap_masks = non_overlap_masks
+        self.clear_non_cond_mem_around_input = clear_non_cond_mem_around_input
+        self.clear_non_cond_mem_for_multi_obj = clear_non_cond_mem_for_multi_obj
+        self.add_all_frames_to_correct_as_cond = add_all_frames_to_correct_as_cond
+    @torch.inference_mode()
+    def init_state(
+        self,
+        video_path,
+        offload_video_to_cpu=False,
+        offload_state_to_cpu=False,
+        async_loading_frames=False,
+    ):
+        """Initialize an inference state."""
+        compute_device = self.device  # device of the model
+        images, video_height, video_width = load_video_frames(
+            video_path=video_path,
+            image_size=self.image_size,
+            offload_video_to_cpu=offload_video_to_cpu,
+            async_loading_frames=async_loading_frames,
+            compute_device=compute_device,
+        )
+        inference_state = {}
+        inference_state["images"] = images
+        inference_state["num_frames"] = len(images)
+        # whether to offload the video frames to CPU memory
+        # turning on this option saves the GPU memory with only a very small overhead
+        inference_state["offload_video_to_cpu"] = offload_video_to_cpu
+        # whether to offload the inference state to CPU memory
+        # turning on this option saves the GPU memory at the cost of a lower tracking fps
+        # (e.g. in a test case of 768x768 model, fps dropped from 27 to 24 when tracking one object
+        # and from 24 to 21 when tracking two objects)
+        inference_state["offload_state_to_cpu"] = offload_state_to_cpu
+        # the original video height and width, used for resizing final output scores
+        inference_state["video_height"] = video_height
+        inference_state["video_width"] = video_width
+        inference_state["device"] = compute_device
+        if offload_state_to_cpu:
+            inference_state["storage_device"] = torch.device("cpu")
+        else:
+            inference_state["storage_device"] = compute_device
+        # inputs on each frame
+        inference_state["point_inputs_per_obj"] = {}
+        inference_state["mask_inputs_per_obj"] = {}
+        # visual features on a small number of recently visited frames for quick interactions
+        inference_state["cached_features"] = {}
+        # values that don't change across frames (so we only need to hold one copy of them)
+        inference_state["constants"] = {}
+        # mapping between client-side object id and model-side object index
+        inference_state["obj_id_to_idx"] = OrderedDict()
+        inference_state["obj_idx_to_id"] = OrderedDict()
+        inference_state["obj_ids"] = []
+        # A storage to hold the model's tracking results and states on each frame
+        inference_state["output_dict"] = {
+            "cond_frame_outputs": {},  # dict containing {frame_idx: <out>}
+            "non_cond_frame_outputs": {},  # dict containing {frame_idx: <out>}
+        }
+        # Slice (view) of each object tracking results, sharing the same memory with "output_dict"
+        inference_state["output_dict_per_obj"] = {}
+        # A temporary storage to hold new outputs when user interact with a frame
+        # to add clicks or mask (it's merged into "output_dict" before propagation starts)
+        inference_state["temp_output_dict_per_obj"] = {}
+        # Frames that already holds consolidated outputs from click or mask inputs
+        # (we directly use their consolidated outputs during tracking)
+        inference_state["consolidated_frame_inds"] = {
+            "cond_frame_outputs": set(),  # set containing frame indices
+            "non_cond_frame_outputs": set(),  # set containing frame indices
+        }
+        # metadata for each tracking frame (e.g. which direction it's tracked)
+        inference_state["tracking_has_started"] = False
+        inference_state["frames_already_tracked"] = {}
+        # Warm up the visual backbone and cache the image feature on frame 0
+        self._get_image_feature(inference_state, frame_idx=0, batch_size=1)
+        return inference_state
+    @classmethod
+    def from_pretrained(cls, model_id: str, **kwargs) -> "SAM2VideoPredictor":
+        """
+        Load a pretrained model from the Hugging Face hub.
+        Arguments:
+          model_id (str): The Hugging Face repository ID.
+          **kwargs: Additional arguments to pass to the model constructor.
+        Returns:
+          (SAM2VideoPredictor): The loaded model.
+        """
+        from sam2.build_sam import build_sam2_video_predictor_hf
+        sam_model = build_sam2_video_predictor_hf(model_id, **kwargs)
+        return sam_model
+    def _obj_id_to_idx(self, inference_state, obj_id):
+        """Map client-side object id to model-side object index."""
+        obj_idx = inference_state["obj_id_to_idx"].get(obj_id, None)
+        if obj_idx is not None:
+            return obj_idx
+        # This is a new object id not sent to the server before. We only allow adding
+        # new objects *before* the tracking starts.
+        allow_new_object = not inference_state["tracking_has_started"]
+        if allow_new_object:
+            # get the next object slot
+            obj_idx = len(inference_state["obj_id_to_idx"])
+            inference_state["obj_id_to_idx"][obj_id] = obj_idx
+            inference_state["obj_idx_to_id"][obj_idx] = obj_id
+            inference_state["obj_ids"] = list(inference_state["obj_id_to_idx"])
+            # set up input and output structures for this object
+            inference_state["point_inputs_per_obj"][obj_idx] = {}
+            inference_state["mask_inputs_per_obj"][obj_idx] = {}
+            inference_state["output_dict_per_obj"][obj_idx] = {
+                "cond_frame_outputs": {},  # dict containing {frame_idx: <out>}
+                "non_cond_frame_outputs": {},  # dict containing {frame_idx: <out>}
+            }
+            inference_state["temp_output_dict_per_obj"][obj_idx] = {
+                "cond_frame_outputs": {},  # dict containing {frame_idx: <out>}
+                "non_cond_frame_outputs": {},  # dict containing {frame_idx: <out>}
+            }
+            return obj_idx
+        else:
+            raise RuntimeError(
+                f"Cannot add new object id {obj_id} after tracking starts. "
+                f"All existing object ids: {inference_state['obj_ids']}. "
+                f"Please call 'reset_state' to restart from scratch."
+            )
+    def _obj_idx_to_id(self, inference_state, obj_idx):
+        """Map model-side object index to client-side object id."""
+        return inference_state["obj_idx_to_id"][obj_idx]
+    def _get_obj_num(self, inference_state):
+        """Get the total number of unique object ids received so far in this session."""
+        return len(inference_state["obj_idx_to_id"])
+    @torch.inference_mode()
+    def add_new_points_or_box(
+        self,
+        inference_state,
+        frame_idx,
+        obj_id,
+        points=None,
+        labels=None,
+        clear_old_points=True,
+        normalize_coords=True,
+        box=None,
+    ):
+        """Add new points to a frame."""
+        obj_idx = self._obj_id_to_idx(inference_state, obj_id)
+        point_inputs_per_frame = inference_state["point_inputs_per_obj"][obj_idx]
+        mask_inputs_per_frame = inference_state["mask_inputs_per_obj"][obj_idx]
+        if (points is not None) != (labels is not None):
+            raise ValueError("points and labels must be provided together")
+        if points is None and box is None:
+            raise ValueError("at least one of points or box must be provided as input")
+        if points is None:
+            points = torch.zeros(0, 2, dtype=torch.float32)
+        elif not isinstance(points, torch.Tensor):
+            points = torch.tensor(points, dtype=torch.float32)
+        if labels is None:
+            labels = torch.zeros(0, dtype=torch.int32)
+        elif not isinstance(labels, torch.Tensor):
+            labels = torch.tensor(labels, dtype=torch.int32)
+        if points.dim() == 2:
+            points = points.unsqueeze(0)  # add batch dimension
+        if labels.dim() == 1:
+            labels = labels.unsqueeze(0)  # add batch dimension
+        # If `box` is provided, we add it as the first two points with labels 2 and 3
+        # along with the user-provided points (consistent with how SAM 2 is trained).
+        if box is not None:
+            if not clear_old_points:
+                raise ValueError(
+                    "cannot add box without clearing old points, since "
+                    "box prompt must be provided before any point prompt "
+                    "(please use clear_old_points=True instead)"
+                )
+            if inference_state["tracking_has_started"]:
+                warnings.warn(
+                    "You are adding a box after tracking starts. SAM 2 may not always be "
+                    "able to incorporate a box prompt for *refinement*. If you intend to "
+                    "use box prompt as an *initial* input before tracking, please call "
+                    "'reset_state' on the inference state to restart from scratch.",
+                    category=UserWarning,
+                    stacklevel=2,
+                )
+            if not isinstance(box, torch.Tensor):
+                box = torch.tensor(box, dtype=torch.float32, device=points.device)
+            box_coords = box.reshape(1, 2, 2)
+            box_labels = torch.tensor([2, 3], dtype=torch.int32, device=labels.device)
+            box_labels = box_labels.reshape(1, 2)
+            points = torch.cat([box_coords, points], dim=1)
+            labels = torch.cat([box_labels, labels], dim=1)
+        if normalize_coords:
+            video_H = inference_state["video_height"]
+            video_W = inference_state["video_width"]
+            points = points / torch.tensor([video_W, video_H]).to(points.device)
+        # scale the (normalized) coordinates by the model's internal image size
+        points = points * self.image_size
+        points = points.to(inference_state["device"])
+        labels = labels.to(inference_state["device"])
+        if not clear_old_points:
+            point_inputs = point_inputs_per_frame.get(frame_idx, None)
+        else:
+            point_inputs = None
+        point_inputs = concat_points(point_inputs, points, labels)
+        point_inputs_per_frame[frame_idx] = point_inputs
+        mask_inputs_per_frame.pop(frame_idx, None)
+        # If this frame hasn't been tracked before, we treat it as an initial conditioning
+        # frame, meaning that the inputs points are to generate segments on this frame without
+        # using any memory from other frames, like in SAM. Otherwise (if it has been tracked),
+        # the input points will be used to correct the already tracked masks.
+        is_init_cond_frame = frame_idx not in inference_state["frames_already_tracked"]
+        # whether to track in reverse time order
+        if is_init_cond_frame:
+            reverse = False
+        else:
+            reverse = inference_state["frames_already_tracked"][frame_idx]["reverse"]
+        obj_output_dict = inference_state["output_dict_per_obj"][obj_idx]
+        obj_temp_output_dict = inference_state["temp_output_dict_per_obj"][obj_idx]
+        # Add a frame to conditioning output if it's an initial conditioning frame or
+        # if the model sees all frames receiving clicks/mask as conditioning frames.
+        is_cond = is_init_cond_frame or self.add_all_frames_to_correct_as_cond
+        storage_key = "cond_frame_outputs" if is_cond else "non_cond_frame_outputs"
+        # Get any previously predicted mask logits on this object and feed it along with
+        # the new clicks into the SAM mask decoder.
+        prev_sam_mask_logits = None
+        # lookup temporary output dict first, which contains the most recent output
+        # (if not found, then lookup conditioning and non-conditioning frame output)
+        prev_out = obj_temp_output_dict[storage_key].get(frame_idx)
+        if prev_out is None:
+            prev_out = obj_output_dict["cond_frame_outputs"].get(frame_idx)
+            if prev_out is None:
+                prev_out = obj_output_dict["non_cond_frame_outputs"].get(frame_idx)
+        if prev_out is not None and prev_out["pred_masks"] is not None:
+            device = inference_state["device"]
+            prev_sam_mask_logits = prev_out["pred_masks"].to(device, non_blocking=True)
+            # Clamp the scale of prev_sam_mask_logits to avoid rare numerical issues.
+            prev_sam_mask_logits = torch.clamp(prev_sam_mask_logits, -32.0, 32.0)
+        current_out, _ = self._run_single_frame_inference(
+            inference_state=inference_state,
+            output_dict=obj_output_dict,  # run on the slice of a single object
+            frame_idx=frame_idx,
+            batch_size=1,  # run on the slice of a single object
+            is_init_cond_frame=is_init_cond_frame,
+            point_inputs=point_inputs,
+            mask_inputs=None,
+            reverse=reverse,
+            # Skip the memory encoder when adding clicks or mask. We execute the memory encoder
+            # at the beginning of `propagate_in_video` (after user finalize their clicks). This
+            # allows us to enforce non-overlapping constraints on all objects before encoding
+            # them into memory.
+            run_mem_encoder=False,
+            prev_sam_mask_logits=prev_sam_mask_logits,
+        )
+        # Add the output to the output dict (to be used as future memory)
+        obj_temp_output_dict[storage_key][frame_idx] = current_out
+        # Resize the output mask to the original video resolution
+        obj_ids = inference_state["obj_ids"]
+        consolidated_out = self._consolidate_temp_output_across_obj(
+            inference_state,
+            frame_idx,
+            is_cond=is_cond,
+            run_mem_encoder=False,
+            consolidate_at_video_res=True,
+        )
+        _, video_res_masks = self._get_orig_video_res_output(
+            inference_state, consolidated_out["pred_masks_video_res"]
+        )
+        return frame_idx, obj_ids, video_res_masks
+    def add_new_points(self, *args, **kwargs):
+        """Deprecated method. Please use `add_new_points_or_box` instead."""
+        return self.add_new_points_or_box(*args, **kwargs)
+    @torch.inference_mode()
+    def add_new_mask(
+        self,
+        inference_state,
+        frame_idx,
+        obj_id,
+        mask,
+    ):
+        """Add new mask to a frame."""
+        obj_idx = self._obj_id_to_idx(inference_state, obj_id)
+        point_inputs_per_frame = inference_state["point_inputs_per_obj"][obj_idx]
+        mask_inputs_per_frame = inference_state["mask_inputs_per_obj"][obj_idx]
+        if not isinstance(mask, torch.Tensor):
+            mask = torch.tensor(mask, dtype=torch.bool)
+        assert mask.dim() == 2
+        mask_H, mask_W = mask.shape
+        mask_inputs_orig = mask[None, None]  # add batch and channel dimension
+        mask_inputs_orig = mask_inputs_orig.float().to(inference_state["device"])
+        # resize the mask if it doesn't match the model's image size
+        if mask_H != self.image_size or mask_W != self.image_size:
+            mask_inputs = torch.nn.functional.interpolate(
+                mask_inputs_orig,
+                size=(self.image_size, self.image_size),
+                align_corners=False,
+                mode="bilinear",
+                antialias=True,  # use antialias for downsampling
+            )
+            mask_inputs = (mask_inputs >= 0.5).float()
+        else:
+            mask_inputs = mask_inputs_orig
+        mask_inputs_per_frame[frame_idx] = mask_inputs
+        point_inputs_per_frame.pop(frame_idx, None)
+        # If this frame hasn't been tracked before, we treat it as an initial conditioning
+        # frame, meaning that the inputs points are to generate segments on this frame without
+        # using any memory from other frames, like in SAM. Otherwise (if it has been tracked),
+        # the input points will be used to correct the already tracked masks.
+        is_init_cond_frame = frame_idx not in inference_state["frames_already_tracked"]
+        # whether to track in reverse time order
+        if is_init_cond_frame:
+            reverse = False
+        else:
+            reverse = inference_state["frames_already_tracked"][frame_idx]["reverse"]
+        obj_output_dict = inference_state["output_dict_per_obj"][obj_idx]
+        obj_temp_output_dict = inference_state["temp_output_dict_per_obj"][obj_idx]
+        # Add a frame to conditioning output if it's an initial conditioning frame or
+        # if the model sees all frames receiving clicks/mask as conditioning frames.
+        is_cond = is_init_cond_frame or self.add_all_frames_to_correct_as_cond
+        storage_key = "cond_frame_outputs" if is_cond else "non_cond_frame_outputs"
+        current_out, _ = self._run_single_frame_inference(
+            inference_state=inference_state,
+            output_dict=obj_output_dict,  # run on the slice of a single object
+            frame_idx=frame_idx,
+            batch_size=1,  # run on the slice of a single object
+            is_init_cond_frame=is_init_cond_frame,
+            point_inputs=None,
+            mask_inputs=mask_inputs,
+            reverse=reverse,
+            # Skip the memory encoder when adding clicks or mask. We execute the memory encoder
+            # at the beginning of `propagate_in_video` (after user finalize their clicks). This
+            # allows us to enforce non-overlapping constraints on all objects before encoding
+            # them into memory.
+            run_mem_encoder=False,
+        )
+        # Add the output to the output dict (to be used as future memory)
+        obj_temp_output_dict[storage_key][frame_idx] = current_out
+        # Resize the output mask to the original video resolution
+        obj_ids = inference_state["obj_ids"]
+        consolidated_out = self._consolidate_temp_output_across_obj(
+            inference_state,
+            frame_idx,
+            is_cond=is_cond,
+            run_mem_encoder=False,
+            consolidate_at_video_res=True,
+        )
+        _, video_res_masks = self._get_orig_video_res_output(
+            inference_state, consolidated_out["pred_masks_video_res"]
+        )
+        return frame_idx, obj_ids, video_res_masks
+    def _get_orig_video_res_output(self, inference_state, any_res_masks):
+        """
+        Resize the object scores to the original video resolution (video_res_masks)
+        and apply non-overlapping constraints for final output.
+        """
+        device = inference_state["device"]
+        video_H = inference_state["video_height"]
+        video_W = inference_state["video_width"]
+        any_res_masks = any_res_masks.to(device, non_blocking=True)
+        if any_res_masks.shape[-2:] == (video_H, video_W):
+            video_res_masks = any_res_masks
+        else:
+            video_res_masks = torch.nn.functional.interpolate(
+                any_res_masks,
+                size=(video_H, video_W),
+                mode="bilinear",
+                align_corners=False,
+            )
+        if self.non_overlap_masks:
+            video_res_masks = self._apply_non_overlapping_constraints(video_res_masks)
+        return any_res_masks, video_res_masks
+    def _consolidate_temp_output_across_obj(
+        self,
+        inference_state,
+        frame_idx,
+        is_cond,
+        run_mem_encoder,
+        consolidate_at_video_res=False,
+    ):
+        """
+        Consolidate the per-object temporary outputs in `temp_output_dict_per_obj` on
+        a frame into a single output for all objects, including
+        1) fill any missing objects either from `output_dict_per_obj` (if they exist in
+           `output_dict_per_obj` for this frame) or leave them as placeholder values
+           (if they don't exist in `output_dict_per_obj` for this frame);
+        2) if specified, rerun memory encoder after apply non-overlapping constraints
+           on the object scores.
+        """
+        batch_size = self._get_obj_num(inference_state)
+        storage_key = "cond_frame_outputs" if is_cond else "non_cond_frame_outputs"
+        # Optionally, we allow consolidating the temporary outputs at the original
+        # video resolution (to provide a better editing experience for mask prompts).
+        if consolidate_at_video_res:
+            assert not run_mem_encoder, "memory encoder cannot run at video resolution"
+            consolidated_H = inference_state["video_height"]
+            consolidated_W = inference_state["video_width"]
+            consolidated_mask_key = "pred_masks_video_res"
+        else:
+            consolidated_H = consolidated_W = self.image_size // 4
+            consolidated_mask_key = "pred_masks"
+        # Initialize `consolidated_out`. Its "maskmem_features" and "maskmem_pos_enc"
+        # will be added when rerunning the memory encoder after applying non-overlapping
+        # constraints to object scores. Its "pred_masks" are prefilled with a large
+        # negative value (NO_OBJ_SCORE) to represent missing objects.
+        consolidated_out = {
+            "maskmem_features": None,
+            "maskmem_pos_enc": None,
+            consolidated_mask_key: torch.full(
+                size=(batch_size, 1, consolidated_H, consolidated_W),
+                fill_value=NO_OBJ_SCORE,
+                dtype=torch.float32,
+                device=inference_state["storage_device"],
+            ),
+            "obj_ptr": torch.full(
+                size=(batch_size, self.hidden_dim),
+                fill_value=NO_OBJ_SCORE,
+                dtype=torch.float32,
+                device=inference_state["device"],
+            ),
+            "object_score_logits": torch.full(
+                size=(batch_size, 1),
+                # default to 10.0 for object_score_logits, i.e. assuming the object is
+                # present as sigmoid(10)=1, same as in `predict_masks` of `MaskDecoder`
+                fill_value=10.0,
+                dtype=torch.float32,
+                device=inference_state["device"],
+            ),
+        }
+        empty_mask_ptr = None
+        for obj_idx in range(batch_size):
+            obj_temp_output_dict = inference_state["temp_output_dict_per_obj"][obj_idx]
+            obj_output_dict = inference_state["output_dict_per_obj"][obj_idx]
+            out = obj_temp_output_dict[storage_key].get(frame_idx, None)
+            # If the object doesn't appear in "temp_output_dict_per_obj" on this frame,
+            # we fall back and look up its previous output in "output_dict_per_obj".
+            # We look up both "cond_frame_outputs" and "non_cond_frame_outputs" in
+            # "output_dict_per_obj" to find a previous output for this object.
+            if out is None:
+                out = obj_output_dict["cond_frame_outputs"].get(frame_idx, None)
+            if out is None:
+                out = obj_output_dict["non_cond_frame_outputs"].get(frame_idx, None)
+            # If the object doesn't appear in "output_dict_per_obj" either, we skip it
+            # and leave its mask scores to the default scores (i.e. the NO_OBJ_SCORE
+            # placeholder above) and set its object pointer to be a dummy pointer.
+            if out is None:
+                # Fill in dummy object pointers for those objects without any inputs or
+                # tracking outcomes on this frame (only do it under `run_mem_encoder=True`,
+                # i.e. when we need to build the memory for tracking).
+                if run_mem_encoder:
+                    if empty_mask_ptr is None:
+                        empty_mask_ptr = self._get_empty_mask_ptr(
+                            inference_state, frame_idx
+                        )
+                    # fill object pointer with a dummy pointer (based on an empty mask)
+                    consolidated_out["obj_ptr"][obj_idx : obj_idx + 1] = empty_mask_ptr
+                continue
+            # Add the temporary object output mask to consolidated output mask
+            obj_mask = out["pred_masks"]
+            consolidated_pred_masks = consolidated_out[consolidated_mask_key]
+            if obj_mask.shape[-2:] == consolidated_pred_masks.shape[-2:]:
+                consolidated_pred_masks[obj_idx : obj_idx + 1] = obj_mask
+            else:
+                # Resize first if temporary object mask has a different resolution
+                resized_obj_mask = torch.nn.functional.interpolate(
+                    obj_mask,
+                    size=consolidated_pred_masks.shape[-2:],
+                    mode="bilinear",
+                    align_corners=False,
+                )
+                consolidated_pred_masks[obj_idx : obj_idx + 1] = resized_obj_mask
+            consolidated_out["obj_ptr"][obj_idx : obj_idx + 1] = out["obj_ptr"]
+            consolidated_out["object_score_logits"][obj_idx : obj_idx + 1] = out[
+                "object_score_logits"
+            ]
+        # Optionally, apply non-overlapping constraints on the consolidated scores
+        # and rerun the memory encoder
+        if run_mem_encoder:
+            device = inference_state["device"]
+            high_res_masks = torch.nn.functional.interpolate(
+                consolidated_out["pred_masks"].to(device, non_blocking=True),
+                size=(self.image_size, self.image_size),
+                mode="bilinear",
+                align_corners=False,
+            )
+            if self.non_overlap_masks_for_mem_enc:
+                high_res_masks = self._apply_non_overlapping_constraints(high_res_masks)
+            maskmem_features, maskmem_pos_enc = self._run_memory_encoder(
+                inference_state=inference_state,
+                frame_idx=frame_idx,
+                batch_size=batch_size,
+                high_res_masks=high_res_masks,
+                object_score_logits=consolidated_out["object_score_logits"],
+                is_mask_from_pts=True,  # these frames are what the user interacted with
+            )
+            consolidated_out["maskmem_features"] = maskmem_features
+            consolidated_out["maskmem_pos_enc"] = maskmem_pos_enc
+        return consolidated_out
+    def _get_empty_mask_ptr(self, inference_state, frame_idx):
+        """Get a dummy object pointer based on an empty mask on the current frame."""
+        # A dummy (empty) mask with a single object
+        batch_size = 1
+        mask_inputs = torch.zeros(
+            (batch_size, 1, self.image_size, self.image_size),
+            dtype=torch.float32,
+            device=inference_state["device"],
+        )
+        # Retrieve correct image features
+        (
+            _,
+            _,
+            current_vision_feats,
+            current_vision_pos_embeds,
+            feat_sizes,
+        ) = self._get_image_feature(inference_state, frame_idx, batch_size)
+        # Feed the empty mask and image feature above to get a dummy object pointer
+        current_out = self.track_step(
+            frame_idx=frame_idx,
+            is_init_cond_frame=True,
+            current_vision_feats=current_vision_feats,
+            current_vision_pos_embeds=current_vision_pos_embeds,
+            feat_sizes=feat_sizes,
+            point_inputs=None,
+            mask_inputs=mask_inputs,
+            output_dict={},
+            num_frames=inference_state["num_frames"],
+            track_in_reverse=False,
+            run_mem_encoder=False,
+            prev_sam_mask_logits=None,
+        )
+        return current_out["obj_ptr"]
+    @torch.inference_mode()
+    def propagate_in_video_preflight(self, inference_state):
+        """Prepare inference_state and consolidate temporary outputs before tracking."""
+        # Tracking has started and we don't allow adding new objects until session is reset.
+        inference_state["tracking_has_started"] = True
+        batch_size = self._get_obj_num(inference_state)
+        # Consolidate per-object temporary outputs in "temp_output_dict_per_obj" and
+        # add them into "output_dict".
+        temp_output_dict_per_obj = inference_state["temp_output_dict_per_obj"]
+        output_dict = inference_state["output_dict"]
+        # "consolidated_frame_inds" contains indices of those frames where consolidated
+        # temporary outputs have been added (either in this call or any previous calls
+        # to `propagate_in_video_preflight`).
+        consolidated_frame_inds = inference_state["consolidated_frame_inds"]
+        for is_cond in [False, True]:
+            # Separately consolidate conditioning and non-conditioning temp outputs
+            storage_key = "cond_frame_outputs" if is_cond else "non_cond_frame_outputs"
+            # Find all the frames that contain temporary outputs for any objects
+            # (these should be the frames that have just received clicks for mask inputs
+            # via `add_new_points_or_box` or `add_new_mask`)
+            temp_frame_inds = set()
+            for obj_temp_output_dict in temp_output_dict_per_obj.values():
+                temp_frame_inds.update(obj_temp_output_dict[storage_key].keys())
+            consolidated_frame_inds[storage_key].update(temp_frame_inds)
+            # consolidate the temporary output across all objects on this frame
+            for frame_idx in temp_frame_inds:
+                consolidated_out = self._consolidate_temp_output_across_obj(
+                    inference_state, frame_idx, is_cond=is_cond, run_mem_encoder=True
+                )
+                # merge them into "output_dict" and also create per-object slices
+                output_dict[storage_key][frame_idx] = consolidated_out
+                self._add_output_per_object(
+                    inference_state, frame_idx, consolidated_out, storage_key
+                )
+                clear_non_cond_mem = self.clear_non_cond_mem_around_input and (
+                    self.clear_non_cond_mem_for_multi_obj or batch_size <= 1
+                )
+                if clear_non_cond_mem:
+                    # clear non-conditioning memory of the surrounding frames
+                    self._clear_non_cond_mem_around_input(inference_state, frame_idx)
+            # clear temporary outputs in `temp_output_dict_per_obj`
+            for obj_temp_output_dict in temp_output_dict_per_obj.values():
+                obj_temp_output_dict[storage_key].clear()
+        # edge case: if an output is added to "cond_frame_outputs", we remove any prior
+        # output on the same frame in "non_cond_frame_outputs"
+        for frame_idx in output_dict["cond_frame_outputs"]:
+            output_dict["non_cond_frame_outputs"].pop(frame_idx, None)
+        for obj_output_dict in inference_state["output_dict_per_obj"].values():
+            for frame_idx in obj_output_dict["cond_frame_outputs"]:
+                obj_output_dict["non_cond_frame_outputs"].pop(frame_idx, None)
+        for frame_idx in consolidated_frame_inds["cond_frame_outputs"]:
+            assert frame_idx in output_dict["cond_frame_outputs"]
+            consolidated_frame_inds["non_cond_frame_outputs"].discard(frame_idx)
+        # Make sure that the frame indices in "consolidated_frame_inds" are exactly those frames
+        # with either points or mask inputs (which should be true under a correct workflow).
+        all_consolidated_frame_inds = (
+            consolidated_frame_inds["cond_frame_outputs"]
+            | consolidated_frame_inds["non_cond_frame_outputs"]
+        )
+        input_frames_inds = set()
+        for point_inputs_per_frame in inference_state["point_inputs_per_obj"].values():
+            input_frames_inds.update(point_inputs_per_frame.keys())
+        for mask_inputs_per_frame in inference_state["mask_inputs_per_obj"].values():
+            input_frames_inds.update(mask_inputs_per_frame.keys())
+        assert all_consolidated_frame_inds == input_frames_inds
+    @torch.inference_mode()
+    def propagate_in_video(
+        self,
+        inference_state,
+        start_frame_idx=None,
+        max_frame_num_to_track=None,
+        reverse=False,
+    ):
+        """Propagate the input points across frames to track in the entire video."""
+        self.propagate_in_video_preflight(inference_state)
+        output_dict = inference_state["output_dict"]
+        consolidated_frame_inds = inference_state["consolidated_frame_inds"]
+        obj_ids = inference_state["obj_ids"]
+        num_frames = inference_state["num_frames"]
+        batch_size = self._get_obj_num(inference_state)
+        if len(output_dict["cond_frame_outputs"]) == 0:
+            raise RuntimeError("No points are provided; please add points first")
+        clear_non_cond_mem = self.clear_non_cond_mem_around_input and (
+            self.clear_non_cond_mem_for_multi_obj or batch_size <= 1
+        )
+        # set start index, end index, and processing order
+        if start_frame_idx is None:
+            # default: start from the earliest frame with input points
+            start_frame_idx = min(output_dict["cond_frame_outputs"])
+        if max_frame_num_to_track is None:
+            # default: track all the frames in the video
+            max_frame_num_to_track = num_frames
+        if reverse:
+            end_frame_idx = max(start_frame_idx - max_frame_num_to_track, 0)
+            if start_frame_idx > 0:
+                processing_order = range(start_frame_idx, end_frame_idx - 1, -1)
+            else:
+                processing_order = []  # skip reverse tracking if starting from frame 0
+        else:
+            end_frame_idx = min(
+                start_frame_idx + max_frame_num_to_track, num_frames - 1
+            )
+            processing_order = range(start_frame_idx, end_frame_idx + 1)
+        for frame_idx in tqdm(processing_order, desc="propagate in video"):
+            # We skip those frames already in consolidated outputs (these are frames
+            # that received input clicks or mask). Note that we cannot directly run
+            # batched forward on them via `_run_single_frame_inference` because the
+            # number of clicks on each object might be different.
+            if frame_idx in consolidated_frame_inds["cond_frame_outputs"]:
+                storage_key = "cond_frame_outputs"
+                current_out = output_dict[storage_key][frame_idx]
+                pred_masks = current_out["pred_masks"]
+                if clear_non_cond_mem:
+                    # clear non-conditioning memory of the surrounding frames
+                    self._clear_non_cond_mem_around_input(inference_state, frame_idx)
+            elif frame_idx in consolidated_frame_inds["non_cond_frame_outputs"]:
+                storage_key = "non_cond_frame_outputs"
+                current_out = output_dict[storage_key][frame_idx]
+                pred_masks = current_out["pred_masks"]
+            else:
+                storage_key = "non_cond_frame_outputs"
+                current_out, pred_masks = self._run_single_frame_inference(
+                    inference_state=inference_state,
+                    output_dict=output_dict,
+                    frame_idx=frame_idx,
+                    batch_size=batch_size,
+                    is_init_cond_frame=False,
+                    point_inputs=None,
+                    mask_inputs=None,
+                    reverse=reverse,
+                    run_mem_encoder=True,
+                )
+                output_dict[storage_key][frame_idx] = current_out
+            # Create slices of per-object outputs for subsequent interaction with each
+            # individual object after tracking.
+            self._add_output_per_object(
+                inference_state, frame_idx, current_out, storage_key
+            )
+            inference_state["frames_already_tracked"][frame_idx] = {"reverse": reverse}
+            # Resize the output mask to the original video resolution (we directly use
+            # the mask scores on GPU for output to avoid any CPU conversion in between)
+            _, video_res_masks = self._get_orig_video_res_output(
+                inference_state, pred_masks
+            )
+            yield frame_idx, obj_ids, video_res_masks
+    def _add_output_per_object(
+        self, inference_state, frame_idx, current_out, storage_key
+    ):
+        """
+        Split a multi-object output into per-object output slices and add them into
+        `output_dict_per_obj`. The resulting slices share the same tensor storage.
+        """
+        maskmem_features = current_out["maskmem_features"]
+        assert maskmem_features is None or isinstance(maskmem_features, torch.Tensor)
+        maskmem_pos_enc = current_out["maskmem_pos_enc"]
+        assert maskmem_pos_enc is None or isinstance(maskmem_pos_enc, list)
+        output_dict_per_obj = inference_state["output_dict_per_obj"]
+        for obj_idx, obj_output_dict in output_dict_per_obj.items():
+            obj_slice = slice(obj_idx, obj_idx + 1)
+            obj_out = {
+                "maskmem_features": None,
+                "maskmem_pos_enc": None,
+                "pred_masks": current_out["pred_masks"][obj_slice],
+                "obj_ptr": current_out["obj_ptr"][obj_slice],
+                "object_score_logits": current_out["object_score_logits"][obj_slice],
+            }
+            if maskmem_features is not None:
+                obj_out["maskmem_features"] = maskmem_features[obj_slice]
+            if maskmem_pos_enc is not None:
+                obj_out["maskmem_pos_enc"] = [x[obj_slice] for x in maskmem_pos_enc]
+            obj_output_dict[storage_key][frame_idx] = obj_out
+    @torch.inference_mode()
+    def clear_all_prompts_in_frame(
+        self, inference_state, frame_idx, obj_id, need_output=True
+    ):
+        """Remove all input points or mask in a specific frame for a given object."""
+        obj_idx = self._obj_id_to_idx(inference_state, obj_id)
+        # Clear the conditioning information on the given frame
+        inference_state["point_inputs_per_obj"][obj_idx].pop(frame_idx, None)
+        inference_state["mask_inputs_per_obj"][obj_idx].pop(frame_idx, None)
+        temp_output_dict_per_obj = inference_state["temp_output_dict_per_obj"]
+        temp_output_dict_per_obj[obj_idx]["cond_frame_outputs"].pop(frame_idx, None)
+        temp_output_dict_per_obj[obj_idx]["non_cond_frame_outputs"].pop(frame_idx, None)
+        # Check and see if there are still any inputs left on this frame
+        batch_size = self._get_obj_num(inference_state)
+        frame_has_input = False
+        for obj_idx2 in range(batch_size):
+            if frame_idx in inference_state["point_inputs_per_obj"][obj_idx2]:
+                frame_has_input = True
+                break
+            if frame_idx in inference_state["mask_inputs_per_obj"][obj_idx2]:
+                frame_has_input = True
+                break
+        # If this frame has no remaining inputs for any objects, we further clear its
+        # conditioning frame status
+        if not frame_has_input:
+            output_dict = inference_state["output_dict"]
+            consolidated_frame_inds = inference_state["consolidated_frame_inds"]
+            consolidated_frame_inds["cond_frame_outputs"].discard(frame_idx)
+            consolidated_frame_inds["non_cond_frame_outputs"].discard(frame_idx)
+            # Remove the frame's conditioning output (possibly downgrading it to non-conditioning)
+            out = output_dict["cond_frame_outputs"].pop(frame_idx, None)
+            if out is not None:
+                # The frame is not a conditioning frame anymore since it's not receiving inputs,
+                # so we "downgrade" its output (if exists) to a non-conditioning frame output.
+                output_dict["non_cond_frame_outputs"][frame_idx] = out
+                inference_state["frames_already_tracked"].pop(frame_idx, None)
+            # Similarly, do it for the sliced output on each object.
+            for obj_idx2 in range(batch_size):
+                obj_output_dict = inference_state["output_dict_per_obj"][obj_idx2]
+                obj_out = obj_output_dict["cond_frame_outputs"].pop(frame_idx, None)
+                if obj_out is not None:
+                    obj_output_dict["non_cond_frame_outputs"][frame_idx] = obj_out
+            # If all the conditioning frames have been removed, we also clear the tracking outputs
+            if len(output_dict["cond_frame_outputs"]) == 0:
+                self._reset_tracking_results(inference_state)
+        if not need_output:
+            return
+        # Finally, output updated masks per object (after removing the inputs above)
+        obj_ids = inference_state["obj_ids"]
+        is_cond = any(
+            frame_idx in obj_temp_output_dict["cond_frame_outputs"]
+            for obj_temp_output_dict in temp_output_dict_per_obj.values()
+        )
+        consolidated_out = self._consolidate_temp_output_across_obj(
+            inference_state,
+            frame_idx,
+            is_cond=is_cond,
+            run_mem_encoder=False,
+            consolidate_at_video_res=True,
+        )
+        _, video_res_masks = self._get_orig_video_res_output(
+            inference_state, consolidated_out["pred_masks_video_res"]
+        )
+        return frame_idx, obj_ids, video_res_masks
+    @torch.inference_mode()
+    def reset_state(self, inference_state):
+        """Remove all input points or mask in all frames throughout the video."""
+        self._reset_tracking_results(inference_state)
+        # Remove all object ids
+        inference_state["obj_id_to_idx"].clear()
+        inference_state["obj_idx_to_id"].clear()
+        inference_state["obj_ids"].clear()
+        inference_state["point_inputs_per_obj"].clear()
+        inference_state["mask_inputs_per_obj"].clear()
+        inference_state["output_dict_per_obj"].clear()
+        inference_state["temp_output_dict_per_obj"].clear()
+    def _reset_tracking_results(self, inference_state):
+        """Reset all tracking inputs and results across the videos."""
+        for v in inference_state["point_inputs_per_obj"].values():
+            v.clear()
+        for v in inference_state["mask_inputs_per_obj"].values():
+            v.clear()
+        for v in inference_state["output_dict_per_obj"].values():
+            v["cond_frame_outputs"].clear()
+            v["non_cond_frame_outputs"].clear()
+        for v in inference_state["temp_output_dict_per_obj"].values():
+            v["cond_frame_outputs"].clear()
+            v["non_cond_frame_outputs"].clear()
+        inference_state["output_dict"]["cond_frame_outputs"].clear()
+        inference_state["output_dict"]["non_cond_frame_outputs"].clear()
+        inference_state["consolidated_frame_inds"]["cond_frame_outputs"].clear()
+        inference_state["consolidated_frame_inds"]["non_cond_frame_outputs"].clear()
+        inference_state["tracking_has_started"] = False
+        inference_state["frames_already_tracked"].clear()
+    def _get_image_feature(self, inference_state, frame_idx, batch_size):
+        """Compute the image features on a given frame."""
+        # Look up in the cache first
+        image, backbone_out = inference_state["cached_features"].get(
+            frame_idx, (None, None)
+        )
+        if backbone_out is None:
+            # Cache miss -- we will run inference on a single image
+            device = inference_state["device"]
+            image = inference_state["images"][frame_idx].to(device).float().unsqueeze(0)
+            backbone_out = self.forward_image(image)
+            # Cache the most recent frame's feature (for repeated interactions with
+            # a frame; we can use an LRU cache for more frames in the future).
+            inference_state["cached_features"] = {frame_idx: (image, backbone_out)}
+        # expand the features to have the same dimension as the number of objects
+        expanded_image = image.expand(batch_size, -1, -1, -1)
+        expanded_backbone_out = {
+            "backbone_fpn": backbone_out["backbone_fpn"].copy(),
+            "vision_pos_enc": backbone_out["vision_pos_enc"].copy(),
+        }
+        for i, feat in enumerate(expanded_backbone_out["backbone_fpn"]):
+            expanded_backbone_out["backbone_fpn"][i] = feat.expand(
+                batch_size, -1, -1, -1
+            )
+        for i, pos in enumerate(expanded_backbone_out["vision_pos_enc"]):
+            pos = pos.expand(batch_size, -1, -1, -1)
+            expanded_backbone_out["vision_pos_enc"][i] = pos
+        features = self._prepare_backbone_features(expanded_backbone_out)
+        features = (expanded_image,) + features
+        return features
+    def _run_single_frame_inference(
+        self,
+        inference_state,
+        output_dict,
+        frame_idx,
+        batch_size,
+        is_init_cond_frame,
+        point_inputs,
+        mask_inputs,
+        reverse,
+        run_mem_encoder,
+        prev_sam_mask_logits=None,
+    ):
+        """Run tracking on a single frame based on current inputs and previous memory."""
+        # Retrieve correct image features
+        (
+            _,
+            _,
+            current_vision_feats,
+            current_vision_pos_embeds,
+            feat_sizes,
+        ) = self._get_image_feature(inference_state, frame_idx, batch_size)
+        # point and mask should not appear as input simultaneously on the same frame
+        assert point_inputs is None or mask_inputs is None
+        current_out = self.track_step(
+            frame_idx=frame_idx,
+            is_init_cond_frame=is_init_cond_frame,
+            current_vision_feats=current_vision_feats,
+            current_vision_pos_embeds=current_vision_pos_embeds,
+            feat_sizes=feat_sizes,
+            point_inputs=point_inputs,
+            mask_inputs=mask_inputs,
+            output_dict=output_dict,
+            num_frames=inference_state["num_frames"],
+            track_in_reverse=reverse,
+            run_mem_encoder=run_mem_encoder,
+            prev_sam_mask_logits=prev_sam_mask_logits,
+        )
+        # optionally offload the output to CPU memory to save GPU space
+        storage_device = inference_state["storage_device"]
+        maskmem_features = current_out["maskmem_features"]
+        if maskmem_features is not None:
+            maskmem_features = maskmem_features.to(torch.bfloat16)
+            maskmem_features = maskmem_features.to(storage_device, non_blocking=True)
+        pred_masks_gpu = current_out["pred_masks"]
+        # potentially fill holes in the predicted masks
+        if self.fill_hole_area > 0:
+            pred_masks_gpu = fill_holes_in_mask_scores(
+                pred_masks_gpu, self.fill_hole_area
+            )
+        pred_masks = pred_masks_gpu.to(storage_device, non_blocking=True)
+        # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
+        maskmem_pos_enc = self._get_maskmem_pos_enc(inference_state, current_out)
+        # object pointer is a small tensor, so we always keep it on GPU memory for fast access
+        obj_ptr = current_out["obj_ptr"]
+        object_score_logits = current_out["object_score_logits"]
+        # make a compact version of this frame's output to reduce the state size
+        compact_current_out = {
+            "maskmem_features": maskmem_features,
+            "maskmem_pos_enc": maskmem_pos_enc,
+            "pred_masks": pred_masks,
+            "obj_ptr": obj_ptr,
+            "object_score_logits": object_score_logits,
+        }
+        return compact_current_out, pred_masks_gpu
+    def _run_memory_encoder(
+        self,
+        inference_state,
+        frame_idx,
+        batch_size,
+        high_res_masks,
+        object_score_logits,
+        is_mask_from_pts,
+    ):
+        """
+        Run the memory encoder on `high_res_masks`. This is usually after applying
+        non-overlapping constraints to object scores. Since their scores changed, their
+        memory also need to be computed again with the memory encoder.
+        """
+        # Retrieve correct image features
+        _, _, current_vision_feats, _, feat_sizes = self._get_image_feature(
+            inference_state, frame_idx, batch_size
+        )
+        maskmem_features, maskmem_pos_enc = self._encode_new_memory(
+            current_vision_feats=current_vision_feats,
+            feat_sizes=feat_sizes,
+            pred_masks_high_res=high_res_masks,
+            object_score_logits=object_score_logits,
+            is_mask_from_pts=is_mask_from_pts,
+        )
+        # optionally offload the output to CPU memory to save GPU space
+        storage_device = inference_state["storage_device"]
+        maskmem_features = maskmem_features.to(torch.bfloat16)
+        maskmem_features = maskmem_features.to(storage_device, non_blocking=True)
+        # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
+        maskmem_pos_enc = self._get_maskmem_pos_enc(
+            inference_state, {"maskmem_pos_enc": maskmem_pos_enc}
+        )
+        return maskmem_features, maskmem_pos_enc
+    def _get_maskmem_pos_enc(self, inference_state, current_out):
+        """
+        `maskmem_pos_enc` is the same across frames and objects, so we cache it as
+        a constant in the inference session to reduce session storage size.
+        """
+        model_constants = inference_state["constants"]
+        # "out_maskmem_pos_enc" should be either a list of tensors or None
+        out_maskmem_pos_enc = current_out["maskmem_pos_enc"]
+        if out_maskmem_pos_enc is not None:
+            if "maskmem_pos_enc" not in model_constants:
+                assert isinstance(out_maskmem_pos_enc, list)
+                # only take the slice for one object, since it's same across objects
+                maskmem_pos_enc = [x[0:1].clone() for x in out_maskmem_pos_enc]
+                model_constants["maskmem_pos_enc"] = maskmem_pos_enc
+            else:
+                maskmem_pos_enc = model_constants["maskmem_pos_enc"]
+            # expand the cached maskmem_pos_enc to the actual batch size
+            batch_size = out_maskmem_pos_enc[0].size(0)
+            expanded_maskmem_pos_enc = [
+                x.expand(batch_size, -1, -1, -1) for x in maskmem_pos_enc
+            ]
+        else:
+            expanded_maskmem_pos_enc = None
+        return expanded_maskmem_pos_enc
+    @torch.inference_mode()
+    def remove_object(self, inference_state, obj_id, strict=False, need_output=True):
+        """
+        Remove an object id from the tracking state. If strict is True, we check whether
+        the object id actually exists and raise an error if it doesn't exist.
+        """
+        old_obj_idx_to_rm = inference_state["obj_id_to_idx"].get(obj_id, None)
+        updated_frames = []
+        # Check whether this object_id to remove actually exists and possibly raise an error.
+        if old_obj_idx_to_rm is None:
+            if not strict:
+                return inference_state["obj_ids"], updated_frames
+            raise RuntimeError(
+                f"Cannot remove object id {obj_id} as it doesn't exist. "
+                f"All existing object ids: {inference_state['obj_ids']}."
+            )
+        # If this is the only remaining object id, we simply reset the state.
+        if len(inference_state["obj_id_to_idx"]) == 1:
+            self.reset_state(inference_state)
+            return inference_state["obj_ids"], updated_frames
+        # There are still remaining objects after removing this object id. In this case,
+        # we need to delete the object storage from inference state tensors.
+        # Step 0: clear the input on those frames where this object id has point or mask input
+        # (note that this step is required as it might downgrade conditioning frames to
+        # non-conditioning ones)
+        obj_input_frames_inds = set()
+        obj_input_frames_inds.update(
+            inference_state["point_inputs_per_obj"][old_obj_idx_to_rm]
+        )
+        obj_input_frames_inds.update(
+            inference_state["mask_inputs_per_obj"][old_obj_idx_to_rm]
+        )
+        for frame_idx in obj_input_frames_inds:
+            self.clear_all_prompts_in_frame(
+                inference_state, frame_idx, obj_id, need_output=False
+            )
+        # Step 1: Update the object id mapping (note that it must be done after Step 0,
+        # since Step 0 still requires the old object id mappings in inference_state)
+        old_obj_ids = inference_state["obj_ids"]
+        old_obj_inds = list(range(len(old_obj_ids)))
+        remain_old_obj_inds = old_obj_inds.copy()
+        remain_old_obj_inds.remove(old_obj_idx_to_rm)
+        new_obj_ids = [old_obj_ids[old_idx] for old_idx in remain_old_obj_inds]
+        new_obj_inds = list(range(len(new_obj_ids)))
+        # build new mappings
+        old_idx_to_new_idx = dict(zip(remain_old_obj_inds, new_obj_inds))
+        inference_state["obj_id_to_idx"] = dict(zip(new_obj_ids, new_obj_inds))
+        inference_state["obj_idx_to_id"] = dict(zip(new_obj_inds, new_obj_ids))
+        inference_state["obj_ids"] = new_obj_ids
+        # Step 2: For per-object tensor storage, we shift their obj_idx in the dict keys.
+        # (note that "consolidated_frame_inds" doesn't need to be updated in this step as
+        # it's already handled in Step 0)
+        def _map_keys(container):
+            new_kvs = []
+            for k in old_obj_inds:
+                v = container.pop(k)
+                if k in old_idx_to_new_idx:
+                    new_kvs.append((old_idx_to_new_idx[k], v))
+            container.update(new_kvs)
+        _map_keys(inference_state["point_inputs_per_obj"])
+        _map_keys(inference_state["mask_inputs_per_obj"])
+        _map_keys(inference_state["output_dict_per_obj"])
+        _map_keys(inference_state["temp_output_dict_per_obj"])
+        # Step 3: For packed tensor storage, we index the remaining ids and rebuild the per-object slices.
+        def _slice_state(output_dict, storage_key):
+            for frame_idx, out in output_dict[storage_key].items():
+                out["maskmem_features"] = out["maskmem_features"][remain_old_obj_inds]
+                out["maskmem_pos_enc"] = [
+                    x[remain_old_obj_inds] for x in out["maskmem_pos_enc"]
+                ]
+                # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
+                out["maskmem_pos_enc"] = self._get_maskmem_pos_enc(inference_state, out)
+                out["pred_masks"] = out["pred_masks"][remain_old_obj_inds]
+                out["obj_ptr"] = out["obj_ptr"][remain_old_obj_inds]
+                out["object_score_logits"] = out["object_score_logits"][
+                    remain_old_obj_inds
+                ]
+                # also update the per-object slices
+                self._add_output_per_object(
+                    inference_state, frame_idx, out, storage_key
+                )
+        _slice_state(inference_state["output_dict"], "cond_frame_outputs")
+        _slice_state(inference_state["output_dict"], "non_cond_frame_outputs")
+        # Step 4: Further collect the outputs on those frames in `obj_input_frames_inds`, which
+        # could show an updated mask for objects previously occluded by the object being removed
+        if need_output:
+            temp_output_dict_per_obj = inference_state["temp_output_dict_per_obj"]
+            for frame_idx in obj_input_frames_inds:
+                is_cond = any(
+                    frame_idx in obj_temp_output_dict["cond_frame_outputs"]
+                    for obj_temp_output_dict in temp_output_dict_per_obj.values()
+                )
+                consolidated_out = self._consolidate_temp_output_across_obj(
+                    inference_state,
+                    frame_idx,
+                    is_cond=is_cond,
+                    run_mem_encoder=False,
+                    consolidate_at_video_res=True,
+                )
+                _, video_res_masks = self._get_orig_video_res_output(
+                    inference_state, consolidated_out["pred_masks_video_res"]
+                )
+                updated_frames.append((frame_idx, video_res_masks))
+        return inference_state["obj_ids"], updated_frames
+    def _clear_non_cond_mem_around_input(self, inference_state, frame_idx):
+        """
+        Remove the non-conditioning memory around the input frame. When users provide
+        correction clicks, the surrounding frames' non-conditioning memories can still
+        contain outdated object appearance information and could confuse the model.
+        This method clears those non-conditioning memories surrounding the interacted
+        frame to avoid giving the model both old and new information about the object.
+        """
+        r = self.memory_temporal_stride_for_eval
+        frame_idx_begin = frame_idx - r * self.num_maskmem
+        frame_idx_end = frame_idx + r * self.num_maskmem
+        output_dict = inference_state["output_dict"]
+        non_cond_frame_outputs = output_dict["non_cond_frame_outputs"]
+        for t in range(frame_idx_begin, frame_idx_end + 1):
+            non_cond_frame_outputs.pop(t, None)
+            for obj_output_dict in inference_state["output_dict_per_obj"].values():
+                obj_output_dict["non_cond_frame_outputs"].pop(t, None)

clone-IDEA-Research/Grounded-SAM-2/setup.py ADDED Viewed

	@@ -0,0 +1,174 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+from setuptools import find_packages, setup
+# Package metadata
+NAME = "SAM-2"
+VERSION = "1.0"
+DESCRIPTION = "SAM 2: Segment Anything in Images and Videos"
+URL = "https://github.com/facebookresearch/sam2"
+AUTHOR = "Meta AI"
+AUTHOR_EMAIL = "[email protected]"
+LICENSE = "Apache 2.0"
+# Read the contents of README file
+with open("README.md", "r", encoding="utf-8") as f:
+    LONG_DESCRIPTION = f.read()
+# Required dependencies
+REQUIRED_PACKAGES = [
+    "torch>=2.3.1",
+    "torchvision>=0.18.1",
+    "numpy>=1.24.4",
+    "tqdm>=4.66.1",
+    "hydra-core>=1.3.2",
+    "iopath>=0.1.10",
+    "pillow>=9.4.0",
+]
+EXTRA_PACKAGES = {
+    "notebooks": [
+        "matplotlib>=3.9.1",
+        "jupyter>=1.0.0",
+        "opencv-python>=4.7.0",
+        "eva-decord>=0.6.1",
+    ],
+    "interactive-demo": [
+        "Flask>=3.0.3",
+        "Flask-Cors>=5.0.0",
+        "av>=13.0.0",
+        "dataclasses-json>=0.6.7",
+        "eva-decord>=0.6.1",
+        "gunicorn>=23.0.0",
+        "imagesize>=1.4.1",
+        "pycocotools>=2.0.8",
+        "strawberry-graphql>=0.243.0",
+    ],
+    "dev": [
+        "black==24.2.0",
+        "usort==1.0.2",
+        "ufmt==2.0.0b2",
+        "fvcore>=0.1.5.post20221221",
+        "pandas>=2.2.2",
+        "scikit-image>=0.24.0",
+        "tensorboard>=2.17.0",
+        "pycocotools>=2.0.8",
+        "tensordict>=0.5.0",
+        "opencv-python>=4.7.0",
+        "submitit>=1.5.1",
+    ],
+}
+# By default, we also build the SAM 2 CUDA extension.
+# You may turn off CUDA build with `export SAM2_BUILD_CUDA=0`.
+BUILD_CUDA = os.getenv("SAM2_BUILD_CUDA", "1") == "1"
+# By default, we allow SAM 2 installation to proceed even with build errors.
+# You may force stopping on errors with `export SAM2_BUILD_ALLOW_ERRORS=0`.
+BUILD_ALLOW_ERRORS = os.getenv("SAM2_BUILD_ALLOW_ERRORS", "1") == "1"
+# Catch and skip errors during extension building and print a warning message
+# (note that this message only shows up under verbose build mode
+# "pip install -v -e ." or "python setup.py build_ext -v")
+CUDA_ERROR_MSG = (
+    "{}\n\n"
+    "Failed to build the SAM 2 CUDA extension due to the error above. "
+    "You can still use SAM 2 and it's OK to ignore the error above, although some "
+    "post-processing functionality may be limited (which doesn't affect the results in most cases; "
+    "(see https://github.com/facebookresearch/sam2/blob/main/INSTALL.md).\n"
+)
+def get_extensions():
+    if not BUILD_CUDA:
+        return []
+    try:
+        from torch.utils.cpp_extension import CUDAExtension
+        srcs = ["sam2/csrc/connected_components.cu"]
+        compile_args = {
+            "cxx": [],
+            "nvcc": [
+                "-DCUDA_HAS_FP16=1",
+                "-D__CUDA_NO_HALF_OPERATORS__",
+                "-D__CUDA_NO_HALF_CONVERSIONS__",
+                "-D__CUDA_NO_HALF2_OPERATORS__",
+            ],
+        }
+        ext_modules = [CUDAExtension("sam2._C", srcs, extra_compile_args=compile_args)]
+    except Exception as e:
+        if BUILD_ALLOW_ERRORS:
+            print(CUDA_ERROR_MSG.format(e))
+            ext_modules = []
+        else:
+            raise e
+    return ext_modules
+try:
+    from torch.utils.cpp_extension import BuildExtension
+    class BuildExtensionIgnoreErrors(BuildExtension):
+        def finalize_options(self):
+            try:
+                super().finalize_options()
+            except Exception as e:
+                print(CUDA_ERROR_MSG.format(e))
+                self.extensions = []
+        def build_extensions(self):
+            try:
+                super().build_extensions()
+            except Exception as e:
+                print(CUDA_ERROR_MSG.format(e))
+                self.extensions = []
+        def get_ext_filename(self, ext_name):
+            try:
+                return super().get_ext_filename(ext_name)
+            except Exception as e:
+                print(CUDA_ERROR_MSG.format(e))
+                self.extensions = []
+                return "_C.so"
+    cmdclass = {
+        "build_ext": (
+            BuildExtensionIgnoreErrors.with_options(no_python_abi_suffix=True)
+            if BUILD_ALLOW_ERRORS
+            else BuildExtension.with_options(no_python_abi_suffix=True)
+        )
+    }
+except Exception as e:
+    cmdclass = {}
+    if BUILD_ALLOW_ERRORS:
+        print(CUDA_ERROR_MSG.format(e))
+    else:
+        raise e
+# Setup configuration
+setup(
+    name=NAME,
+    version=VERSION,
+    description=DESCRIPTION,
+    long_description=LONG_DESCRIPTION,
+    long_description_content_type="text/markdown",
+    url=URL,
+    author=AUTHOR,
+    author_email=AUTHOR_EMAIL,
+    license=LICENSE,
+    packages=find_packages(exclude="notebooks"),
+    include_package_data=True,
+    install_requires=REQUIRED_PACKAGES,
+    extras_require=EXTRA_PACKAGES,
+    python_requires=">=3.10.0",
+    ext_modules=get_extensions(),
+    cmdclass=cmdclass,
+)

clone-IDEA-Research/Grounded-Segment-Anything/.gitignore ADDED Viewed

	@@ -0,0 +1,135 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# checkpoint
+*.pth
+outputs/
+.idea/

clone-IDEA-Research/Grounded-Segment-Anything/.gitmodules ADDED Viewed

	@@ -0,0 +1,7 @@

+[submodule "grounded-sam-osx"]
+	path = grounded-sam-osx
+	url = https://github.com/linjing7/grounded-sam-osx.git
+[submodule "VISAM"]
+	path = VISAM
+	url = https://github.com/BingfengYan/VISAM

clone-IDEA-Research/Grounded-Segment-Anything/CITATION.cff ADDED Viewed

	@@ -0,0 +1,8 @@

+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+authors:
+  - name: "Grounded-SAM Contributors"
+title: "Grounded-Segment-Anything"
+date-released: 2023-04-06
+url: "https://github.com/IDEA-Research/Grounded-Segment-Anything"
+license: Apache-2.0

clone-IDEA-Research/Grounded-Segment-Anything/Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
+# Arguments to build Docker Image using CUDA
+ARG USE_CUDA=0
+ARG TORCH_ARCH=
+ENV AM_I_DOCKER True
+ENV BUILD_WITH_CUDA "${USE_CUDA}"
+ENV TORCH_CUDA_ARCH_LIST "${TORCH_ARCH}"
+ENV CUDA_HOME /usr/local/cuda-11.6/
+RUN mkdir -p /home/appuser/Grounded-Segment-Anything
+COPY . /home/appuser/Grounded-Segment-Anything/
+RUN apt-get update && apt-get install --no-install-recommends wget ffmpeg=7:* \
+    libsm6=2:* libxext6=2:* git=1:* nano=2.* \
+    vim=2:* -y \
+    && apt-get clean && apt-get autoremove && rm -rf /var/lib/apt/lists/*
+WORKDIR /home/appuser/Grounded-Segment-Anything
+RUN python -m pip install --no-cache-dir -e segment_anything
+# When using build isolation, PyTorch with newer CUDA is installed and can't compile GroundingDINO
+RUN python -m pip install --no-cache-dir wheel
+RUN python -m pip install --no-cache-dir --no-build-isolation -e GroundingDINO
+WORKDIR /home/appuser
+RUN pip install --no-cache-dir diffusers[torch]==0.15.1 opencv-python==4.7.0.72 \
+    pycocotools==2.0.6 matplotlib==3.5.3 \
+    onnxruntime==1.14.1 onnx==1.13.1 ipykernel==6.16.2 scipy gradio openai

clone-IDEA-Research/Grounded-Segment-Anything/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2020 - present, IDEA, Inc
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

clone-IDEA-Research/Grounded-Segment-Anything/Makefile ADDED Viewed

	@@ -0,0 +1,43 @@

+# Get version of CUDA and enable it for compilation if CUDA > 11.0
+# This solves https://github.com/IDEA-Research/Grounded-Segment-Anything/issues/53
+# and https://github.com/IDEA-Research/Grounded-Segment-Anything/issues/84
+# when running in Docker
+# Check if nvcc is installed
+NVCC := $(shell which nvcc)
+ifeq ($(NVCC),)
+	# NVCC not found
+	USE_CUDA := 0
+	NVCC_VERSION := "not installed"
+else
+	NVCC_VERSION := $(shell nvcc --version | grep -oP 'release \K[0-9.]+')
+	USE_CUDA := $(shell echo "$(NVCC_VERSION) > 11" | bc -l)
+endif
+# Add the list of supported ARCHs
+ifeq ($(USE_CUDA), 1)
+	TORCH_CUDA_ARCH_LIST := "3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX"
+	BUILD_MESSAGE := "I will try to build the image with CUDA support"
+else
+	TORCH_CUDA_ARCH_LIST :=
+	BUILD_MESSAGE := "CUDA $(NVCC_VERSION) is not supported"
+endif
+build-image:
+	@echo $(BUILD_MESSAGE)
+	docker build --build-arg USE_CUDA=$(USE_CUDA) \
+	--build-arg TORCH_ARCH=$(TORCH_CUDA_ARCH_LIST) \
+	-t gsa:v0 .
+run:
+ifeq (,$(wildcard ./sam_vit_h_4b8939.pth))
+	wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+endif
+ifeq (,$(wildcard ./groundingdino_swint_ogc.pth))
+	wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
+endif
+	docker run --gpus all -it --rm --net=host --privileged \
+	-v /tmp/.X11-unix:/tmp/.X11-unix \
+	-v "${PWD}":/home/appuser/Grounded-Segment-Anything \
+	-e DISPLAY=$DISPLAY \
+	--name=gsa \
+	--ipc=host -it gsa:v0

clone-IDEA-Research/Grounded-Segment-Anything/README.md ADDED Viewed

	@@ -0,0 +1,808 @@

+![](./assets/Grounded-SAM_logo.png)
+# Grounded-Segment-Anything
+[![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://youtu.be/oEQYStnF2l8) [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/automated-dataset-annotation-and-evaluation-with-grounding-dino-and-sam.ipynb) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/camenduru/grounded-segment-anything-colab) [![HuggingFace Space](https://img.shields.io/badge/🤗-HuggingFace%20Space-cyan.svg)](https://huggingface.co/spaces/IDEA-Research/Grounded-SAM) [![Replicate](https://replicate.com/cjwbw/grounded-recognize-anything/badge)](https://replicate.com/cjwbw/grounded-recognize-anything)  [![ModelScope Official Demo](https://img.shields.io/badge/ModelScope-Official%20Demo-important)](https://modelscope.cn/studios/tuofeilunhifi/Grounded-Segment-Anything/summary) [![Huggingface Demo by Community](https://img.shields.io/badge/Huggingface-Demo%20by%20Community-red)](https://huggingface.co/spaces/yizhangliu/Grounded-Segment-Anything) [![Stable-Diffusion WebUI](https://img.shields.io/badge/Stable--Diffusion-WebUI%20by%20Community-critical)](https://github.com/continue-revolution/sd-webui-segment-anything) [![Jupyter Notebook Demo](https://img.shields.io/badge/Demo-Jupyter%20Notebook-informational)](./grounded_sam.ipynb) [![Static Badge](https://img.shields.io/badge/GroundingDINO-arXiv-blue)](https://arxiv.org/abs/2303.05499) [![Static Badge](https://img.shields.io/badge/Segment_Anything-arXiv-blue)](https://arxiv.org/abs/2304.02643) [![Static Badge](https://img.shields.io/badge/Grounded_SAM-arXiv-blue)](https://arxiv.org/abs/2401.14159)
+We plan to create a very interesting demo by combining [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO) and [Segment Anything](https://github.com/facebookresearch/segment-anything) which aims to detect and segment anything with text inputs! And we will continue to improve it and create more interesting demos based on this foundation. And we have already released an overall technical report about our project on arXiv, please check [Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks](https://arxiv.org/abs/2401.14159) for more details.
+- 🔥 **[Grounded SAM 2](https://github.com/IDEA-Research/Grounded-SAM-2)** is released now, which combines Grounding DINO with [SAM 2](https://github.com/facebookresearch/segment-anything-2) for any object tracking in open-world scenarios.
+- 🔥 **[Grounding DINO 1.5](https://github.com/IDEA-Research/Grounding-DINO-1.5-API)** is released now, which is IDEA Research's **Most Capable** Open-World Object Detection Model!
+- 🔥 **[Grounding DINO](https://arxiv.org/abs/2303.05499)** and **[Grounded SAM](https://arxiv.org/abs/2401.14159)** are now supported in Huggingface. For more convenient use, you can refer to [this documentation](https://huggingface.co/docs/transformers/model_doc/grounding-dino)
+We are very willing to **help everyone share and promote new projects** based on Segment-Anything, Please check out here for more amazing demos and works in the community: [Highlight Extension Projects](#highlighted-projects). You can submit a new issue (with `project` tag) or a new pull request to add new project's links.
+![](./assets/grounded_sam_new_demo_image.png)
+![](./assets/ram_grounded_sam_new.png)
+**🍄 Why Building this Project?**
+The **core idea** behind this project is to **combine the strengths of different models in order to build a very powerful pipeline for solving complex problems**. And it's worth mentioning that this is a workflow for combining strong expert models, where **all parts can be used separately or in combination, and can be replaced with any similar but different models (like replacing Grounding DINO with GLIP or other detectors / replacing Stable-Diffusion with ControlNet or GLIGEN/ Combining with ChatGPT)**.
+**🍇 Updates**
+- **`2024/01/26`** We have released a comprehensive technical report about our project on arXiv, please check [Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks](https://arxiv.org/abs/2401.14159) for more details. And we are profoundly grateful for the contributions of all the contributors in this project.
+- **`2023/12/17`** Support [Grounded-RepViT-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-repvit-sam-demo) demo, thanks a lot for their great work!
+- **`2023/12/16`** Support [Grounded-Edge-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-edge-sam-demo) demo, thanks a lot for their great work!
+- **`2023/12/10`** Support [Grounded-Efficient-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-efficient-sam-demo) demo, thanks a lot for their great work!
+- **`2023/11/24`** Release [RAM++](https://arxiv.org/abs/2310.15200), which is the next generation of RAM. RAM++ can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.
+- **`2023/11/23`** Release our newly proposed visual prompt counting model [T-Rex](https://github.com/IDEA-Research/T-Rex). The introduction [Video](https://www.youtube.com/watch?v=engIEhZogAQ) and [Demo](https://deepdataspace.com/playground/ivp) is available in [DDS](https://github.com/IDEA-Research/deepdataspace) now.
+- **`2023/07/25`** Support [Light-HQ-SAM](https://github.com/SysCV/sam-hq) in [EfficientSAM](./EfficientSAM/), credits to [Mingqiao Ye](https://github.com/ymq2017) and [Lei Ke](https://github.com/lkeab), thanks a lot for their great work!
+- **`2023/07/14`** Combining **Grounding-DINO-B** with [SAM-HQ](https://github.com/SysCV/sam-hq) achieves **49.6 mean AP** in [Segmentation in the Wild](https://eval.ai/web/challenges/challenge-page/1931/overview) competition zero-shot track, surpassing Grounded-SAM by **3.6 mean AP**, thanks for their great work!
+- **`2023/06/28`** Combining Grounding-DINO with Efficient SAM variants including [FastSAM](https://github.com/CASIA-IVA-Lab/FastSAM) and [MobileSAM](https://github.com/ChaoningZhang/MobileSAM) in [EfficientSAM](./EfficientSAM/) for faster annotating, thanks a lot for their great work!
+- **`2023/06/20`** By combining **Grounding-DINO-L** with **SAM-ViT-H**, Grounded-SAM achieves 46.0 mean AP in [Segmentation in the Wild](https://eval.ai/web/challenges/challenge-page/1931/overview) competition zero-shot track on [CVPR 2023 workshop](https://computer-vision-in-the-wild.github.io/cvpr-2023/), surpassing [UNINEXT (CVPR 2023)](https://github.com/MasterBin-IIAU/UNINEXT) by about **4 mean AP**.
+- **`2023/06/16`** Release [RAM-Grounded-SAM Replicate Online Demo](https://replicate.com/cjwbw/ram-grounded-sam). Thanks a lot to [Chenxi](https://chenxwh.github.io/) for providing this nice demo 🌹.
+- **`2023/06/14`** Support [RAM-Grounded-SAM & SAM-HQ](./automatic_label_ram_demo.py) and update [Simple Automatic Label Demo](./automatic_label_ram_demo.py) to support [RAM](https://github.com/OPPOMKLab/recognize-anything), setting up a strong automatic annotation pipeline.
+- **`2023/06/13`** Checkout the [Autodistill: Train YOLOv8 with ZERO Annotations](https://youtu.be/gKTYMfwPo4M) tutorial to learn how to use Grounded-SAM + [Autodistill](https://github.com/autodistill/autodistill) for automated data labeling and real-time model training.
+- **`2023/06/13`** Support [SAM-HQ](https://github.com/SysCV/sam-hq) in [Grounded-SAM Demo](#running_man-grounded-sam-detect-and-segment-everything-with-text-prompt) for higher quality prediction.
+- **`2023/06/12`** Support [RAM-Grounded-SAM](#label-grounded-sam-with-ram-or-tag2text-for-automatic-labeling) for strong automatic labeling pipeline! Thanks for [Recognize-Anything](https://github.com/OPPOMKLab/recognize-anything).
+- **`2023/06/01`** Our Grounded-SAM has been accepted to present a **demo** at [ICCV 2023](https://iccv2023.thecvf.com/)! See you in Paris!
+- **`2023/05/23`**: Support `Image-Referring-Segment`, `Audio-Referring-Segment` and `Text-Referring-Segment` in [ImageBind-SAM](./playground/ImageBind_SAM/).
+- **`2023/05/03`**: Checkout the [Automated Dataset Annotation and Evaluation with GroundingDINO and SAM](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/automated-dataset-annotation-and-evaluation-with-grounding-dino-and-sam.ipynb) which is an amazing tutorial on automatic labeling! Thanks a lot for [Piotr Skalski](https://github.com/SkalskiP) and [Roboflow](https://github.com/roboflow/notebooks)!
+## Table of Contents
+- [Grounded-Segment-Anything](#grounded-segment-anything)
+  - [Preliminary Works](#preliminary-works)
+  - [Highlighted Projects](#highlighted-projects)
+- [Installation](#installation)
+  - [Install with Docker](#install-with-docker)
+  - [Install locally](#install-without-docker)
+- [Grounded-SAM Playground](#grounded-sam-playground)
+  - [Step-by-Step Notebook Demo](#open_book-step-by-step-notebook-demo)
+  - [GroundingDINO: Detect Everything with Text Prompt](#running_man-groundingdino-detect-everything-with-text-prompt)
+  - [Grounded-SAM: Detect and Segment Everything with Text Prompt](#running_man-grounded-sam-detect-and-segment-everything-with-text-prompt)
+  - [Grounded-SAM with Inpainting: Detect, Segment and Generate Everything with Text Prompt](#skier-grounded-sam-with-inpainting-detect-segment-and-generate-everything-with-text-prompt)
+  - [Grounded-SAM and Inpaint Gradio APP](#golfing-grounded-sam-and-inpaint-gradio-app)
+  - [Grounded-SAM with RAM or Tag2Text for Automatic Labeling](#label-grounded-sam-with-ram-or-tag2text-for-automatic-labeling)
+  - [Grounded-SAM with BLIP & ChatGPT for Automatic Labeling](#robot-grounded-sam-with-blip-for-automatic-labeling)
+  - [Grounded-SAM with Whisper: Detect and Segment Anything with Audio](#open_mouth-grounded-sam-with-whisper-detect-and-segment-anything-with-audio)
+  - [Grounded-SAM ChatBot with Visual ChatGPT](#speech_balloon-grounded-sam-chatbot-demo)
+  - [Grounded-SAM with OSX for 3D Whole-Body Mesh Recovery](#man_dancing-run-grounded-segment-anything--osx-demo)
+  - [Grounded-SAM with VISAM for Tracking and Segment Anything](#man_dancing-run-grounded-segment-anything--visam-demo)
+  - [Interactive Fashion-Edit Playground: Click for Segmentation And Editing](#dancers-interactive-editing)
+  - [Interactive Human-face Editing Playground: Click And Editing Human Face](#dancers-interactive-editing)
+  - [3D Box Via Segment Anything](#camera-3d-box-via-segment-anything)
+  - [Playground: More Interesting and Imaginative Demos with Grounded-SAM](./playground/)
+    - [DeepFloyd: Image Generation with Text Prompt](./playground/DeepFloyd/)
+    - [PaintByExample: Exemplar-based Image Editing with Diffusion Models](./playground/PaintByExample/)
+    - [LaMa: Resolution-robust Large Mask Inpainting with Fourier Convolutions](./playground/LaMa/)
+    - [RePaint: Inpainting using Denoising Diffusion Probabilistic Models](./playground/RePaint/)
+    - [ImageBind with SAM: Segment with Different Modalities](./playground/ImageBind_SAM/)
+  - [Efficient SAM Series for Faster Annotation](./EfficientSAM/)
+    - [Grounded-FastSAM Demo](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-fastsam-demo)
+    - [Grounded-MobileSAM Demo](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-mobilesam-demo)
+    - [Grounded-Light-HQSAM Demo](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-light-hqsam-demo)
+    - [Grounded-Efficient-SAM Demo](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-efficient-sam-demo)
+    - [Grounded-Edge-SAM Demo](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-edge-sam-demo)
+    - [Grounded-RepViT-SAM Demo](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/main/EfficientSAM#run-grounded-repvit-sam-demo)
+- [Citation](#citation)
+## Preliminary Works
+Here we provide some background knowledge that you may need to know before trying the demos.
+<div align="center">
+| Title | Intro | Description | Links |
+|:----:|:----:|:----:|:----:|
+| [Segment-Anything](https://arxiv.org/abs/2304.02643) | ![](https://github.com/facebookresearch/segment-anything/blob/main/assets/model_diagram.png?raw=true) | A strong foundation model aims to segment everything in an image, which needs prompts (as boxes/points/text) to generate masks | [[Github](https://github.com/facebookresearch/segment-anything)] <br> [[Page](https://segment-anything.com/)] <br> [[Demo](https://segment-anything.com/demo)] |
+| [Grounding DINO](https://arxiv.org/abs/2303.05499) | ![](https://github.com/IDEA-Research/GroundingDINO/blob/main/.asset/hero_figure.png?raw=True) | A strong zero-shot detector which is capable of to generate high quality boxes and labels with free-form text. | [[Github](https://github.com/IDEA-Research/GroundingDINO)] <br> [[Demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)] |
+| [OSX](http://arxiv.org/abs/2303.16160) | ![](https://github.com/IDEA-Research/OSX/blob/main/assets/demo_video.gif?raw=True) | A strong and efficient one-stage motion capture method to generate high quality 3D human mesh from monucular image. OSX also releases a large-scale upper-body dataset UBody for a more accurate reconstrution in the upper-body scene. | [[Github](https://github.com/IDEA-Research/OSX)] <br> [[Page](https://osx-ubody.github.io/)] <br> [[Video](https://osx-ubody.github.io/)] <br> [[Data](https://docs.google.com/forms/d/e/1FAIpQLSehgBP7wdn_XznGAM2AiJPiPLTqXXHw5uX9l7qeQ1Dh9HoO_A/viewform)] |
+| [Stable-Diffusion](https://arxiv.org/abs/2112.10752) | ![](https://github.com/CompVis/stable-diffusion/blob/main/assets/stable-samples/txt2img/merged-0006.png?raw=True) | A super powerful open-source latent text-to-image diffusion model | [[Github](https://github.com/CompVis/stable-diffusion)] <br> [[Page](https://ommer-lab.com/research/latent-diffusion-models/)] |
+| [RAM++](https://arxiv.org/abs/2310.15200) | ![](https://github.com/xinyu1205/recognize-anything/blob/main/images/ram_plus_compare.jpg) | RAM++ is the next generation of RAM, which can recognize any category with high accuracy. | [[Github](https://github.com/OPPOMKLab/recognize-anything)] |
+| [RAM](https://recognize-anything.github.io/) | ![](https://github.com/xinyu1205/Tag2Text/raw/main/images/localization_and_recognition.jpg) | RAM is an image tagging model, which can recognize any common category with high accuracy. | [[Github](https://github.com/OPPOMKLab/recognize-anything)] <br> [[Demo](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text)] |
+| [BLIP](https://arxiv.org/abs/2201.12086) | ![](https://github.com/salesforce/LAVIS/raw/main/docs/_static/logo_final.png) | A wonderful language-vision model for image understanding. | [[GitHub](https://github.com/salesforce/LAVIS)] |
+| [Visual ChatGPT](https://arxiv.org/abs/2303.04671) | ![](https://github.com/microsoft/TaskMatrix/raw/main/assets/figure.jpg) | A wonderful tool that connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting. | [[Github](https://github.com/microsoft/TaskMatrix)] <br> [[Demo](https://huggingface.co/spaces/microsoft/visual_chatgpt)] |
+| [Tag2Text](https://tag2text.github.io/) | ![](https://github.com/xinyu1205/Tag2Text/raw/main/images/tag2text_framework.png) | An efficient and controllable vision-language model which can simultaneously output superior image captioning and image tagging. | [[Github](https://github.com/OPPOMKLab/recognize-anything)] <br> [[Demo](https://huggingface.co/spaces/xinyu1205/Tag2Text)] |
+| [VoxelNeXt](https://arxiv.org/abs/2303.11301) | ![](https://github.com/dvlab-research/VoxelNeXt/raw/master/docs/sequence-v2.gif) | A clean, simple, and fully-sparse 3D object detector, which predicts objects directly upon sparse voxel features. | [[Github](https://github.com/dvlab-research/VoxelNeXt)]
+</div>
+## Highlighted Projects
+Here we provide some impressive works you may find interesting:
+<div align="center">
+| Title | Description | Links |
+|:---:|:---:|:---:|
+| [Semantic-SAM](https://github.com/UX-Decoder/Semantic-SAM) | A universal image segmentation model to enable segment and recognize anything at any desired granularity | [[Github](https://github.com/UX-Decoder/Semantic-SAM)] <br> [[Demo](https://github.com/UX-Decoder/Semantic-SAM)] |
+| [SEEM: Segment Everything Everywhere All at Once](https://arxiv.org/pdf/2304.06718.pdf) | A powerful promptable segmentation model supports segmenting with various types of prompts (text, point, scribble, referring image, etc.) and any combination of prompts. | [[Github](https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once)] <br> [[Demo](https://huggingface.co/spaces/xdecoder/SEEM)] |
+| [OpenSeeD](https://arxiv.org/pdf/2303.08131.pdf) | A simple framework for open-vocabulary segmentation and detection which supports interactive segmentation with box input to generate mask | [[Github](https://github.com/IDEA-Research/OpenSeeD)] |
+| [LLaVA](https://arxiv.org/abs/2304.08485) | Visual instruction tuning with GPT-4 | [[Github](https://github.com/haotian-liu/LLaVA)] <br> [[Page](https://llava-vl.github.io/)] <br> [[Demo](https://llava.hliu.cc/)] <br> [[Data](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K)] <br> [[Model](https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0)] |
+| [GenSAM](https://arxiv.org/abs/2312.07374) | Relaxing the instance-specific manual prompt requirement in SAM through training-free test-time adaptation | [[Github](https://github.com/jyLin8100/GenSAM)] <br> [[Page](https://lwpyh.github.io/GenSAM/)] |
+</div>
+We also list some awesome segment-anything extension projects here you may find interesting:
+- [Computer Vision in the Wild (CVinW) Readings](https://github.com/Computer-Vision-in-the-Wild/CVinW_Readings) for those who are interested in open-set tasks in computer vision.
+- [Zero-Shot Anomaly Detection](https://github.com/caoyunkang/GroundedSAM-zero-shot-anomaly-detection) by Yunkang Cao
+- [EditAnything: ControlNet + StableDiffusion based on the SAM segmentation mask](https://github.com/sail-sg/EditAnything) by Shanghua Gao and Pan Zhou
+- [IEA: Image Editing Anything](https://github.com/feizc/IEA) by Zhengcong Fei
+- [SAM-MMRorate: Combining Rotated Object Detector and SAM](https://github.com/Li-Qingyun/sam-mmrotate) by Qingyun Li and Xue Yang
+- [Awesome-Anything](https://github.com/VainF/Awesome-Anything) by Gongfan Fang
+- [Prompt-Segment-Anything](https://github.com/RockeyCoss/Prompt-Segment-Anything) by Rockey
+- [WebUI for Segment-Anything and Grounded-SAM](https://github.com/continue-revolution/sd-webui-segment-anything) by Chengsong Zhang
+- [Inpainting Anything: Inpaint Anything with SAM + Inpainting models](https://github.com/geekyutao/Inpaint-Anything) by Tao Yu
+- [Grounded Segment Anything From Objects to Parts: Combining Segment-Anything with VLPart & GLIP & Visual ChatGPT](https://github.com/Cheems-Seminar/segment-anything-and-name-it) by Peize Sun and Shoufa Chen
+- [Narapi-SAM: Integration of Segment Anything into Narapi (A nice viewer for SAM)](https://github.com/MIC-DKFZ/napari-sam) by MIC-DKFZ
+- [Grounded Segment Anything Colab](https://github.com/camenduru/grounded-segment-anything-colab) by camenduru
+- [Optical Character Recognition with Segment Anything](https://github.com/yeungchenwa/OCR-SAM) by Zhenhua Yang
+- [Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet](https://github.com/showlab/Image2Paragraph) by showlab
+- [Lang-Segment-Anything: Another awesome demo for combining GroundingDINO with Segment-Anything](https://github.com/luca-medeiros/lang-segment-anything) by Luca Medeiros
+- [🥳 🚀 **Playground: Integrate SAM and OpenMMLab!**](https://github.com/open-mmlab/playground)
+- [3D-object via Segment Anything](https://github.com/dvlab-research/3D-Box-Segment-Anything) by Yukang Chen
+- [Image2Paragraph: Transform Image Into Unique Paragraph](https://github.com/showlab/Image2Paragraph) by Show Lab
+- [Zero-shot Scene Graph Generate with Grounded-SAM](https://github.com/showlab/Image2Paragraph) by JackWhite-rwx
+- [CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks](https://github.com/xmed-lab/CLIP_Surgery) by Eli-YiLi
+- [Panoptic-Segment-Anything: Zero-shot panoptic segmentation using SAM](https://github.com/segments-ai/panoptic-segment-anything) by segments-ai
+- [Caption-Anything: Generates Descriptive Captions for Any Object within an Image](https://github.com/ttengwang/Caption-Anything) by Teng Wang
+- [Segment-Anything-3D: Transferring Segmentation Information of 2D Images to 3D Space](https://github.com/Pointcept/SegmentAnything3D) by Yunhan Yang
+- [Expediting SAM without Fine-tuning](https://github.com/Expedit-LargeScale-Vision-Transformer/Expedit-SAM) by Weicong Liang and Yuhui Yuan
+- [Semantic Segment Anything: Providing Rich Semantic Category Annotations for SAM](https://github.com/fudan-zvg/Semantic-Segment-Anything) by Jiaqi Chen and Zeyu Yang and Li Zhang
+- [Enhance Everything: Combining SAM with Image Restoration and Enhancement Tasks](https://github.com/lixinustc/Enhance-Anything) by Xin Li
+- [DragGAN](https://github.com/Zeqiang-Lai/DragGAN) by Shanghai AI Lab.
+- [Tabletop HandyBot: Robotic arm assistant that performs tabletop tasks using Grounded-SAM](https://github.com/ycheng517/tabletop-handybot) by Yifei Cheng
+## Installation
+The code requires `python>=3.8`, as well as `pytorch>=1.7` and `torchvision>=0.8`. Please follow the instructions [here](https://pytorch.org/get-started/locally/) to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended.
+### Install with Docker
+Open one terminal:
+```
+make build-image
+```
+```
+make run
+```
+That's it.
+If you would like to allow visualization across docker container, open another terminal and type:
+```
+xhost +
+```
+### Install without Docker
+You should set the environment variable manually as follows if you want to build a local GPU environment for Grounded-SAM:
+```bash
+export AM_I_DOCKER=False
+export BUILD_WITH_CUDA=True
+export CUDA_HOME=/path/to/cuda-11.3/
+```
+Install Segment Anything:
+```bash
+python -m pip install -e segment_anything
+```
+Install Grounding DINO:
+```bash
+pip install --no-build-isolation -e GroundingDINO
+```
+Install diffusers:
+```bash
+pip install --upgrade diffusers[torch]
+```
+Install osx:
+```bash
+git submodule update --init --recursive
+cd grounded-sam-osx && bash install.sh
+```
+Install RAM & Tag2Text:
+```bash
+git clone https://github.com/xinyu1205/recognize-anything.git
+pip install -r ./recognize-anything/requirements.txt
+pip install -e ./recognize-anything/
+```
+The following optional dependencies are necessary for mask post-processing, saving masks in COCO format, the example notebooks, and exporting the model in ONNX format. `jupyter` is also required to run the example notebooks.
+```
+pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernel
+```
+More details can be found in [install segment anything](https://github.com/facebookresearch/segment-anything#installation) and [install GroundingDINO](https://github.com/IDEA-Research/GroundingDINO#install) and [install OSX](https://github.com/IDEA-Research/OSX)
+## Grounded-SAM Playground
+Let's start exploring our Grounding-SAM Playground and we will release more interesting demos in the future, stay tuned!
+## :open_book: Step-by-Step Notebook Demo
+Here we list some notebook demo provided in this project:
+- [grounded_sam.ipynb](grounded_sam.ipynb)
+- [grounded_sam_colab_demo.ipynb](grounded_sam_colab_demo.ipynb)
+- [grounded_sam_3d_box.ipynb](grounded_sam_3d_box)
+### :running_man: GroundingDINO: Detect Everything with Text Prompt
+:grapes: [[arXiv Paper](https://arxiv.org/abs/2303.05499)] &nbsp; :rose:[[Try the Colab Demo](https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb)] &nbsp; :sunflower: [[Try Huggingface Demo](https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo)] &nbsp; :mushroom: [[Automated Dataset Annotation and Evaluation](https://youtu.be/C4NqaRBz_Kw)]
+Here's the step-by-step tutorial on running `GroundingDINO` demo:
+**Step 1: Download the pretrained weights**
+```bash
+cd Grounded-Segment-Anything
+# download the pretrained groundingdino-swin-tiny model
+wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
+```
+**Step 2: Running the demo**
+```bash
+python grounding_dino_demo.py
+```
+<details>
+<summary> <b> Running with Python (same as demo but you can run it anywhere after installing GroundingDINO) </b> </summary>
+```python
+from groundingdino.util.inference import load_model, load_image, predict, annotate
+import cv2
+model = load_model("GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", "./groundingdino_swint_ogc.pth")
+IMAGE_PATH = "assets/demo1.jpg"
+TEXT_PROMPT = "bear."
+BOX_THRESHOLD = 0.35
+TEXT_THRESHOLD = 0.25
+image_source, image = load_image(IMAGE_PATH)
+boxes, logits, phrases = predict(
+    model=model,
+    image=image,
+    caption=TEXT_PROMPT,
+    box_threshold=BOX_THRESHOLD,
+    text_threshold=TEXT_THRESHOLD
+)
+annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
+cv2.imwrite("annotated_image.jpg", annotated_frame)
+```
+</details>
+<br>
+**Tips**
+- If you want to detect multiple objects in one sentence with [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO), we suggest separating each name with `.` . An example: `cat . dog . chair .`
+**Step 3: Check the annotated image**
+The annotated image will be saved as `./annotated_image.jpg`.
+<div align="center">
+| Text Prompt | Demo Image | Annotated Image |
+|:----:|:----:|:----:|
+| `Bear.` | ![](./assets/demo1.jpg)  | ![](./assets/annotated_image.jpg) |
+| `Horse. Clouds. Grasses. Sky. Hill` | ![](./assets/demo7.jpg)  | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/grounding_dino/groundingdino_demo7.jpg?raw=true)
+</div>
+### :running_man: Grounded-SAM: Detect and Segment Everything with Text Prompt
+Here's the step-by-step tutorial on running `Grounded-SAM` demo:
+**Step 1: Download the pretrained weights**
+```bash
+cd Grounded-Segment-Anything
+wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
+```
+We provide two versions of Grounded-SAM demo here:
+- [grounded_sam_demo.py](./grounded_sam_demo.py): our original implementation for Grounded-SAM.
+- [grounded_sam_simple_demo.py](./grounded_sam_simple_demo.py) our updated more elegant version for Grounded-SAM.
+**Step 2: Running original grounded-sam demo**
+```bash
+# depends on your device
+export CUDA_VISIBLE_DEVICES=0
+```
+```python
+python grounded_sam_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/demo1.jpg \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --text_prompt "bear" \
+  --device "cuda"
+```
+The annotated results will be saved in `./outputs` as follows
+<div align="center">
+| Input Image | Annotated Image | Generated Mask |
+|:----:|:----:|:----:|
+| ![](./assets/demo1.jpg) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/grounded_sam/original_grounded_sam_demo1.jpg?raw=true) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/grounded_sam/mask.jpg?raw=true) |
+</div>
+**Step 3: Running grounded-sam demo with sam-hq**
+- Download the demo image
+```bash
+wget https://github.com/IDEA-Research/detrex-storage/releases/download/grounded-sam-storage/sam_hq_demo_image.png
+```
+- Download SAM-HQ checkpoint [here](https://github.com/SysCV/sam-hq#model-checkpoints)
+- Running grounded-sam-hq demo as follows:
+```python
+export CUDA_VISIBLE_DEVICES=0
+python grounded_sam_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_hq_checkpoint ./sam_hq_vit_h.pth \  # path to sam-hq checkpoint
+  --use_sam_hq \  # set to use sam-hq model
+  --input_image sam_hq_demo_image.png \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --text_prompt "chair." \
+  --device "cuda"
+```
+The annotated results will be saved in `./outputs` as follows
+<div align="center">
+| Input Image | SAM Output | SAM-HQ Output |
+|:----:|:----:|:----:|
+| ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/sam_hq/sam_hq_demo.png?raw=true) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/sam_hq/sam_output.jpg?raw=true) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/sam_hq/sam_hq_output.jpg?raw=true) |
+</div>
+**Step 4: Running the updated grounded-sam demo (optional)**
+Note that this demo is almost same as the original demo, but **with more elegant code**.
+```python
+python grounded_sam_simple_demo.py
+```
+The annotated results will be saved as `./groundingdino_annotated_image.jpg` and `./grounded_sam_annotated_image.jpg`
+<div align="center">
+| Text Prompt | Input Image | GroundingDINO Annotated Image | Grounded-SAM Annotated Image |
+|:----:|:----:|:----:|:----:|
+| `The running dog` | ![](./assets/demo2.jpg) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/grounded_sam/groundingdino_annotated_image_demo2.jpg?raw=true) | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/grounded_sam/grounded_sam_annotated_image_demo2.jpg?raw=true) |
+| `Horse. Clouds. Grasses. Sky. Hill` | ![](./assets/demo7.jpg) | ![](assets/groundingdino_annotated_image.jpg) | ![](assets/grounded_sam_annotated_image.jpg) |
+</div>
+**Step 5: Running the Sam model with multi-gpu**
+```bash
+export CUDA_VISIBLE_DEVICES=0,1
+```
+```python
+python grounded_sam_multi_gpu_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_path assets/car \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --text_prompt "car" \
+  --device "cuda"
+```
+You will see that the model is loaded once per GPU ![](assets/multi-gpu.png)
+### :skier: Grounded-SAM with Inpainting: Detect, Segment and Generate Everything with Text Prompt
+**Step 1: Download the pretrained weights**
+```bash
+cd Grounded-Segment-Anything
+wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
+```
+**Step 2: Running grounded-sam inpainting demo**
+```bash
+CUDA_VISIBLE_DEVICES=0
+python grounded_sam_inpainting_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/inpaint_demo.jpg \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --det_prompt "bench" \
+  --inpaint_prompt "A sofa, high quality, detailed" \
+  --device "cuda"
+```
+The annotated and inpaint image will be saved in `./outputs`
+**Step 3: Check the results**
+<div align="center">
+| Input Image | Det Prompt | Annotated Image | Inpaint Prompt | Inpaint Image |
+|:---:|:---:|:---:|:---:|:---:|
+|![](./assets/inpaint_demo.jpg) | `Bench` | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/grounded_sam_inpaint/grounded_sam_output.jpg?raw=true) | `A sofa, high quality, detailed` | ![](https://github.com/IDEA-Research/detrex-storage/blob/main/assets/grounded_sam/grounded_sam_inpaint/grounded_sam_inpainting_output.jpg?raw=true) |
+</div>
+### :golfing: Grounded-SAM and Inpaint Gradio APP
+We support 6 tasks in the local Gradio APP：
+1. **scribble**: Segmentation is achieved through Segment Anything and mouse click interaction (you need to click on the object with the mouse, no need to specify the prompt).
+2. **automask**: Segment the entire image at once through Segment Anything (no need to specify a prompt).
+3. **det**: Realize detection through Grounding DINO and text interaction (text prompt needs to be specified).
+4. **seg**: Realize text interaction by combining Grounding DINO and Segment Anything to realize detection + segmentation (need to specify text prompt).
+5. **inpainting**: By combining Grounding DINO + Segment Anything + Stable Diffusion to achieve text exchange and replace the target object (need to specify text prompt and inpaint prompt) .
+6. **automatic**: By combining BLIP + Grounding DINO + Segment Anything to achieve non-interactive detection + segmentation (no need to specify prompt).
+```bash
+python gradio_app.py
+```
+- The gradio_app visualization as follows:
+![](./assets/gradio_demo.png)
+### :label: Grounded-SAM with RAM or Tag2Text for Automatic Labeling
+[**The Recognize Anything Models**](https://github.com/OPPOMKLab/recognize-anything) are a series of open-source and strong fundamental image recognition models, including [RAM++](https://arxiv.org/abs/2310.15200), [RAM](https://arxiv.org/abs/2306.03514) and [Tag2text](https://arxiv.org/abs/2303.05657).
+It is seamlessly linked to generate pseudo labels automatically as follows:
+1. Use RAM/Tag2Text to generate tags.
+2. Use Grounded-Segment-Anything to generate the boxes and masks.
+**Step 1: Init submodule and download the pretrained checkpoint**
+- Init submodule:
+```bash
+cd Grounded-Segment-Anything
+git submodule init
+git submodule update
+```
+- Download pretrained weights for `GroundingDINO`, `SAM` and `RAM/Tag2Text`:
+```bash
+wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
+wget https://huggingface.co/spaces/xinyu1205/Tag2Text/resolve/main/ram_swin_large_14m.pth
+wget https://huggingface.co/spaces/xinyu1205/Tag2Text/resolve/main/tag2text_swin_14m.pth
+```
+**Step 2: Running the demo with RAM**
+```bash
+export CUDA_VISIBLE_DEVICES=0
+python automatic_label_ram_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --ram_checkpoint ram_swin_large_14m.pth \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/demo9.jpg \
+  --output_dir "outputs" \
+  --box_threshold 0.25 \
+  --text_threshold 0.2 \
+  --iou_threshold 0.5 \
+  --device "cuda"
+```
+**Step 2: Or Running the demo with Tag2Text**
+```bash
+export CUDA_VISIBLE_DEVICES=0
+python automatic_label_tag2text_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --tag2text_checkpoint tag2text_swin_14m.pth \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/demo9.jpg \
+  --output_dir "outputs" \
+  --box_threshold 0.25 \
+  --text_threshold 0.2 \
+  --iou_threshold 0.5 \
+  --device "cuda"
+```
+- RAM++ significantly improves the open-set capability of RAM, for [RAM++ inference on unseen categoreis](https://github.com/xinyu1205/recognize-anything#ram-inference-on-unseen-categories-open-set).
+- Tag2Text also provides powerful captioning capabilities, and the process with captions can refer to [BLIP](#robot-run-grounded-segment-anything--blip-demo).
+- The pseudo labels and model prediction visualization will be saved in `output_dir` as follows (right figure):
+![](./assets/automatic_label_output/demo9_tag2text_ram.jpg)
+### :robot: Grounded-SAM with BLIP for Automatic Labeling
+It is easy to generate pseudo labels automatically as follows:
+1. Use BLIP (or other caption models) to generate a caption.
+2. Extract tags from the caption. We use ChatGPT to handle the potential complicated sentences.
+3. Use Grounded-Segment-Anything to generate the boxes and masks.
+- Run Demo
+```bash
+export OPENAI_API_KEY=your_openai_key
+export OPENAI_API_BASE=https://closeai.deno.dev/v1
+export CUDA_VISIBLE_DEVICES=0
+python automatic_label_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/demo3.jpg \
+  --output_dir "outputs" \
+  --openai_key $OPENAI_API_KEY \
+  --box_threshold 0.25 \
+  --text_threshold 0.2 \
+  --iou_threshold 0.5 \
+  --device "cuda"
+```
+- When you don't have a paid Account for ChatGPT is also possible to use NLTK instead. Just don't include the ```openai_key``` Parameter when starting the Demo.
+  - The Script will automatically download the necessary NLTK Data.
+- The pseudo labels and model prediction visualization will be saved in `output_dir` as follows:
+![](./assets/automatic_label_output_demo3.jpg)
+### :open_mouth: Grounded-SAM with Whisper: Detect and Segment Anything with Audio
+Detect and segment anything with speech!
+![](assets/acoustics/gsam_whisper_inpainting_demo.png)
+**Install Whisper**
+```bash
+pip install -U openai-whisper
+```
+See the [whisper official page](https://github.com/openai/whisper#setup) if you have other questions for the installation.
+**Run Voice-to-Label Demo**
+Optional: Download the demo audio file
+```bash
+wget https://huggingface.co/ShilongLiu/GroundingDINO/resolve/main/demo_audio.mp3
+```
+```bash
+export CUDA_VISIBLE_DEVICES=0
+python grounded_sam_whisper_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/demo4.jpg \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --speech_file "demo_audio.mp3" \
+  --device "cuda"
+```
+![](./assets/grounded_sam_whisper_output.jpg)
+**Run Voice-to-inpaint Demo**
+You can enable chatgpt to help you automatically detect the object and inpainting order with `--enable_chatgpt`.
+Or you can specify the object you want to inpaint [stored in `args.det_speech_file`] and the text you want to inpaint with [stored in `args.inpaint_speech_file`].
+```bash
+export OPENAI_API_KEY=your_openai_key
+export OPENAI_API_BASE=https://closeai.deno.dev/v1
+# Example: enable chatgpt
+export CUDA_VISIBLE_DEVICES=0
+python grounded_sam_whisper_inpainting_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/inpaint_demo.jpg \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --prompt_speech_file assets/acoustics/prompt_speech_file.mp3 \
+  --enable_chatgpt \
+  --openai_key $OPENAI_API_KEY\
+  --device "cuda"
+```
+```bash
+# Example: without chatgpt
+export CUDA_VISIBLE_DEVICES=0
+python grounded_sam_whisper_inpainting_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --input_image assets/inpaint_demo.jpg \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --det_speech_file "assets/acoustics/det_voice.mp3" \
+  --inpaint_speech_file "assets/acoustics/inpaint_voice.mp3" \
+  --device "cuda"
+```
+![](./assets/acoustics/gsam_whisper_inpainting_pipeline.png)
+### :speech_balloon: Grounded-SAM ChatBot Demo
+https://user-images.githubusercontent.com/24236723/231955561-2ae4ec1a-c75f-4cc5-9b7b-517aa1432123.mp4
+Following [Visual ChatGPT](https://github.com/microsoft/visual-chatgpt), we add a ChatBot for our project. Currently, it supports:
+1. "Describe the image."
+2. "Detect the dog (and the cat) in the image."
+3. "Segment anything in the image."
+4. "Segment the dog (and the cat) in the image."
+5. "Help me label the image."
+6. "Replace the dog with a cat in the image."
+To use the ChatBot:
+- Install whisper if you want to use audio as input.
+- Set the default model setting in the tool `Grounded_dino_sam_inpainting`.
+- Run Demo
+```bash
+export OPENAI_API_KEY=your_openai_key
+export OPENAI_API_BASE=https://closeai.deno.dev/v1
+export CUDA_VISIBLE_DEVICES=0
+python chatbot.py
+```
+### :man_dancing: Run Grounded-Segment-Anything + OSX Demo
+<p align="middle">
+<img src="assets/osx/grouned_sam_osx_demo.gif">
+<br>
+</p>
+- Download the checkpoint `osx_l_wo_decoder.pth.tar` from [here](https://drive.google.com/drive/folders/1x7MZbB6eAlrq5PKC9MaeIm4GqkBpokow?usp=share_link) for OSX:
+- Download the human model files and place it into `grounded-sam-osx/utils/human_model_files` following the instruction of [OSX](https://github.com/IDEA-Research/OSX).
+- Run Demo
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python grounded_sam_osx_demo.py \
+  --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py \
+  --grounded_checkpoint groundingdino_swint_ogc.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --osx_checkpoint osx_l_wo_decoder.pth.tar \
+  --input_image assets/osx/grounded_sam_osx_demo.png \
+  --output_dir "outputs" \
+  --box_threshold 0.3 \
+  --text_threshold 0.25 \
+  --text_prompt "humans, chairs" \
+  --device "cuda"
+```
+- The model prediction visualization will be saved in `output_dir` as follows:
+<img src="assets/osx/grounded_sam_osx_output.jpg" style="zoom: 49%;" />
+- We also support promptable 3D whole-body mesh recovery. For example, you can track someone with a text prompt  and estimate his 3D pose and shape :
+| ![space-1.jpg](assets/osx/grounded_sam_osx_output1.jpg) |
+| :---------------------------------------------------: |
+|             *A person with pink clothes*              |
+| ![space-1.jpg](assets/osx/grounded_sam_osx_output2.jpg) |
+| :---------------------------------------------------: |
+|               *A man with a sunglasses*               |
+## :man_dancing: Run Grounded-Segment-Anything + VISAM Demo
+- Download the checkpoint `motrv2_dancetrack.pth` from [here](https://drive.google.com/file/d/1EA4lndu2yQcVgBKR09KfMe5efbf631Th/view?usp=share_link) for MOTRv2:
+- See the more thing if you have other questions for the installation.
+- Run Demo
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python grounded_sam_visam.py \
+  --meta_arch motr \
+  --dataset_file e2e_dance \
+  --with_box_refine \
+  --query_interaction_layer QIMv2 \
+  --num_queries 10 \
+  --det_db det_db_motrv2.json \
+  --use_checkpoint \
+  --mot_path your_data_path \
+  --resume motrv2_dancetrack.pth \
+  --sam_checkpoint sam_vit_h_4b8939.pth \
+  --video_path DanceTrack/test/dancetrack0003
+```
+|![](https://raw.githubusercontent.com/BingfengYan/MOTSAM/main/visam.gif)|
+### :dancers: Interactive Editing
+- Release the interactive fashion-edit playground in [here](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/humanFace). Run in the notebook, just click for annotating points for further segmentation. Enjoy it!
+- Release human-face-edit branch [here](https://github.com/IDEA-Research/Grounded-Segment-Anything/tree/humanFace). We'll keep updating this branch with more interesting features. Here are some examples:
+  ![](https://github.com/IDEA-Research/Grounded-Segment-Anything/blob/humanFace/assets/231-hair-edit.png)
+## :camera: 3D-Box via Segment Anything
+We extend the scope to 3D world by combining Segment Anything and [VoxelNeXt](https://github.com/dvlab-research/VoxelNeXt). When we provide a prompt (e.g., a point / box), the result is not only 2D segmentation mask, but also 3D boxes. Please check [voxelnext_3d_box](./voxelnext_3d_box/) for more details.
+  ![](https://github.com/IDEA-Research/Grounded-Segment-Anything/blob/main/voxelnext_3d_box/images/sam-voxelnext.png)
+  ![](https://github.com/IDEA-Research/Grounded-Segment-Anything/blob/main/voxelnext_3d_box/images/image_boxes2.png)
+## :cupid: Acknowledgements
+- [Segment Anything](https://github.com/facebookresearch/segment-anything)
+- [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO)
+## Contributors
+Our project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.
+<a href="https://github.com/IDEA-Research/Grounded-Segment-Anything/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=IDEA-Research/Grounded-Segment-Anything" />
+</a>
+## Citation
+If you find this project helpful for your research, please consider citing the following BibTeX entry.
+```BibTex
+@article{kirillov2023segany,
+  title={Segment Anything},
+  author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
+  journal={arXiv:2304.02643},
+  year={2023}
+}
+@article{liu2023grounding,
+  title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
+  author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
+  journal={arXiv preprint arXiv:2303.05499},
+  year={2023}
+}
+@misc{ren2024grounded,
+      title={Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks},
+      author={Tianhe Ren and Shilong Liu and Ailing Zeng and Jing Lin and Kunchang Li and He Cao and Jiayu Chen and Xinyu Huang and Yukang Chen and Feng Yan and Zhaoyang Zeng and Hao Zhang and Feng Li and Jie Yang and Hongyang Li and Qing Jiang and Lei Zhang},
+      year={2024},
+      eprint={2401.14159},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```