Upload 11 files

Browse files

Files changed (12) hide show

.gitattributes +1 -0
LICENSE +88 -0
README.md +82 -3
aimv2_overview_light.png +3 -0
config.json +56 -0
merges.txt +0 -0
model.safetensors +3 -0
preprocessor_config.json +33 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer_config.json +31 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+aimv2_overview_light.png filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,88 @@

+Disclaimer: IMPORTANT: This Apple Machine Learning Research Model is
+specifically developed and released by Apple Inc. ("Apple") for the sole purpose
+of scientific research of artificial intelligence and machine-learning
+technology. “Apple Machine Learning Research Model” means the model, including
+but not limited to algorithms, formulas, trained model weights, parameters,
+configurations, checkpoints, and any related materials (including
+documentation).
+This Apple Machine Learning Research Model is provided to You by
+Apple in consideration of your agreement to the following terms, and your use,
+modification, creation of Model Derivatives, and or redistribution of the Apple
+Machine Learning Research Model constitutes acceptance of this Agreement. If You
+do not agree with these terms, please do not use, modify, create Model
+Derivatives of, or distribute this Apple Machine Learning Research Model or
+Model Derivatives.
+* License Scope: In consideration of your agreement to abide by the following
+  terms, and subject to these terms, Apple hereby grants you a personal,
+  non-exclusive, worldwide, non-transferable, royalty-free, revocable, and
+  limited license, to use, copy, modify, distribute, and create Model
+  Derivatives (defined below) of the Apple Machine Learning Research Model
+  exclusively for Research Purposes. You agree that any Model Derivatives You
+  may create or that may be created for You will be limited to Research Purposes
+  as well. “Research Purposes” means non-commercial scientific research and
+  academic development activities, such as experimentation, analysis, testing
+  conducted by You with the sole intent to advance scientific knowledge and
+  research. “Research Purposes” does not include any commercial exploitation,
+  product development or use in any commercial product or service.
+* Distribution of Apple Machine Learning Research Model and Model Derivatives:
+  If you choose to redistribute Apple Machine Learning Research Model or its
+  Model Derivatives, you must provide a copy of this Agreement to such third
+  party, and ensure that the following attribution notice be provided: “Apple
+  Machine Learning Research Model is licensed under the Apple Machine Learning
+  Research Model License Agreement.” Additionally, all Model Derivatives must
+  clearly be identified as such, including disclosure of modifications and
+  changes made to the Apple Machine Learning Research Model. The name,
+  trademarks, service marks or logos of Apple may not be used to endorse or
+  promote Model Derivatives or the relationship between You and Apple. “Model
+  Derivatives” means any models or any other artifacts created by modifications,
+  improvements, adaptations, alterations to the architecture, algorithm or
+  training processes of the Apple Machine Learning Research Model, or by any
+  retraining, fine-tuning of the Apple Machine Learning Research Model.
+* No Other License: Except as expressly stated in this notice, no other rights
+  or licenses, express or implied, are granted by Apple herein, including but
+  not limited to any patent, trademark, and similar intellectual property rights
+  worldwide that may be infringed by the Apple Machine Learning Research Model,
+  the Model Derivatives or by other works in which the Apple Machine Learning
+  Research Model may be incorporated.
+* Compliance with Laws: Your use of Apple Machine Learning Research Model must
+  be in compliance with all applicable laws and regulations.
+* Term and Termination: The term of this Agreement will begin upon your
+  acceptance of this Agreement or use of the Apple Machine Learning Research
+  Model and will continue until terminated in accordance with the following
+  terms. Apple may terminate this Agreement at any time if You are in breach of
+  any term or condition of this Agreement. Upon termination of this Agreement,
+  You must cease to use all Apple Machine Learning Research Models and Model
+  Derivatives and permanently delete any copy thereof. Sections 3, 6 and 7 will
+  survive termination.
+* Disclaimer and Limitation of Liability: This Apple Machine Learning Research
+  Model and any outputs generated by the Apple Machine Learning Research Model
+  are provided on an “AS IS” basis. APPLE MAKES NO WARRANTIES, EXPRESS OR
+  IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF
+  NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE,
+  REGARDING THE APPLE MACHINE LEARNING RESEARCH MODEL OR OUTPUTS GENERATED BY
+  THE APPLE MACHINE LEARNING RESEARCH MODEL. You are solely responsible for
+  determining the appropriateness of using or redistributing the Apple Machine
+  Learning Research Model and any outputs of the Apple Machine Learning Research
+  Model and assume any risks associated with Your use of the Apple Machine
+  Learning Research Model and any output and results. IN NO EVENT SHALL APPLE BE
+  LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+  IN ANY WAY OUT OF THE USE, REPRODUCTION, MODIFICATION AND/OR DISTRIBUTION OF
+  THE APPLE MACHINE LEARNING RESEARCH MODEL AND ANY OUTPUTS OF THE APPLE MACHINE
+  LEARNING RESEARCH MODEL, HOWEVER CAUSED AND WHETHER UNDER THEORY OF CONTRACT,
+  TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS
+  BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+* Governing Law: This Agreement will be governed by and construed under the laws
+  of the State of California without regard to its choice of law principles. The
+  Convention on Contracts for the International Sale of Goods shall not apply to
+  the Agreement except that the arbitration clause and any arbitration hereunder
+  shall be governed by the Federal Arbitration Act, Chapters 1 and 2.
+Copyright (C) 2025 Apple Inc. All Rights Reserved.

README.md CHANGED Viewed

@@ -1,3 +1,82 @@
----
-license: apple-amlr
----

+---
+library_name: transformers
+license: apple-amlr
+metrics:
+- accuracy
+pipeline_tag: image-feature-extraction
+tags:
+- vision
+- image-feature-extraction
+- mlx
+- pytorch
+---
+# Introduction
+[[`AIMv2 Paper`](https://arxiv.org/abs/2411.14402)] [[`BibTeX`](#citation)]
+We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective.
+AIMv2 pre-training is simple and straightforward to train and scale effectively. Some AIMv2 highlights include:
+1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
+2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension.
+3. Exhibits strong recognition performance with AIMv2-3B achieving *89.5% on ImageNet using a frozen trunk*.
+<img src="aimv2_overview_light.png" alt="AIMv2 Overview"/>
+## Usage
+### PyTorch
+```python
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, AutoModel
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+processor = AutoImageProcessor.from_pretrained(
+    "apple/aimv2-large-patch14-native",
+)
+model = AutoModel.from_pretrained(
+    "apple/aimv2-large-patch14-native",
+    trust_remote_code=True,
+)
+inputs = processor(images=image, return_tensors="pt")
+outputs = model(**inputs)
+```
+### JAX
+```python
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, FlaxAutoModel
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+processor = AutoImageProcessor.from_pretrained(
+    "apple/aimv2-large-patch14-native",
+)
+model = FlaxAutoModel.from_pretrained(
+    "apple/aimv2-large-patch14-native",
+    trust_remote_code=True,
+)
+inputs = processor(images=image, return_tensors="jax")
+outputs = model(**inputs)
+```
+## Citation
+If you find our work useful, please consider citing us as:
+```bibtex
+@misc{fini2024multimodalautoregressivepretraininglarge,
+  author      = {Fini, Enrico and Shukor, Mustafa and Li, Xiujun and Dufter, Philipp and Klein, Michal and Haldimann, David and Aitharaju, Sai and da Costa, Victor Guilherme Turrisi and Béthune, Louis and Gan, Zhe and Toshev, Alexander T and Eichner, Marcin and Nabi, Moin and Yang, Yinfei and Susskind, Joshua M. and El-Nouby, Alaaeldin},
+  url         = {https://arxiv.org/abs/2411.14402},
+  eprint      = {2411.14402},
+  eprintclass = {cs.CV},
+  eprinttype  = {arXiv},
+  title       = {Multimodal Autoregressive Pre-training of Large Vision Encoders},
+  year        = {2024},
+}
+```

aimv2_overview_light.png ADDED Viewed

Git LFS Details

SHA256: 524b6eb5049fb4bac6303ecee386d0e885fa69a96756557d843084ba4caae08f
Pointer size: 131 Bytes
Size of remote file: 336 kB

config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "architectures": [
+    "AIMv2Model"
+  ],
+  "auto_map": {
+    "AutoConfig": "apple/aimv2-large-patch14-224-lit--configuration_aimv2.AIMv2Config",
+    "AutoModel": "apple/aimv2-large-patch14-224-lit--modeling_aimv2.AIMv2Model"
+  },
+  "init_temperature": 0.07,
+  "logit_scale_init_value": 2.6592,
+  "max_logit_scale": 100.0,
+  "model_type": "aimv2",
+  "projection_dim": 768,
+  "text_config": {
+    "_attn_implementation_autoset": true,
+    "attention_dropout": 0.0,
+    "hidden_act": "silu",
+    "hidden_size": 768,
+    "initializer_range": 0.02,
+    "intermediate_size": 2048,
+    "is_causal": true,
+    "max_context_length": 77,
+    "max_position_embeddings": 77,
+    "model_type": "aimv2_text_model",
+    "num_attention_heads": 6,
+    "num_hidden_layers": 12,
+    "projection_dropout": 0.0,
+    "qkv_bias": false,
+    "rms_norm_eps": 1e-05,
+    "use_bias": false,
+    "vocab_size": 49408
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.0.dev0",
+  "vision_config": {
+    "_attn_implementation_autoset": true,
+    "attention_dropout": 0.0,
+    "hidden_act": "silu",
+    "hidden_size": 1024,
+    "image_size": 224,
+    "initializer_range": 0.02,
+    "intermediate_size": 2816,
+    "is_causal": false,
+    "model_type": "aimv2_vision_model",
+    "num_attention_heads": 8,
+    "num_channels": 3,
+    "num_hidden_layers": 24,
+    "num_queries": 1,
+    "patch_size": 14,
+    "projection_dropout": 0.0,
+    "qkv_bias": false,
+    "rms_norm_eps": 1e-05,
+    "use_bias": false,
+    "use_head": true
+  }
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:055655dcf78e1ef05e5f065747691eabdd8c3de416c79f1592f0f3ff8ab7bc9f
+size 1746762564

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "crop_size": {
+    "height": 224,
+    "width": 224
+  },
+  "data_format": "channels_first",
+  "default_to_square": false,
+  "device": null,
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "CLIPImageProcessorFast",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "input_data_format": null,
+  "processor_class": "CLIPProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "return_tensors": null,
+  "size": {
+    "shortest_edge": 224
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<start_of_text>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<end_of_text>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<end_of_text>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<end_of_text>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "added_tokens_decoder": {
+    "49406": {
+      "content": "<start_of_text>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49407": {
+      "content": "<end_of_text>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<start_of_text>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<end_of_text>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 77,
+  "pad_token": "<end_of_text>",
+  "processor_class": "CLIPProcessor",
+  "tokenizer_class": "CLIPTokenizer",
+  "unk_token": "<end_of_text>",
+  "use_fast": true
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff