maharshpatelx commited on
Commit
de01fb4
·
verified ·
1 Parent(s): 2ab0976

Upload Lfm2VlForConditionalGeneration

Browse files
Files changed (5) hide show
  1. README.md +201 -0
  2. config.json +99 -0
  3. generation_config.json +7 -0
  4. model.safetensors +3 -0
  5. modeling_lfm2_vl.py +688 -0
README.md ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - trl
5
+ - sft
6
+ ---
7
+
8
+ # Model Card for Model ID
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+
12
+
13
+
14
+ ## Model Details
15
+
16
+ ### Model Description
17
+
18
+ <!-- Provide a longer summary of what this model is. -->
19
+
20
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
21
+
22
+ - **Developed by:** [More Information Needed]
23
+ - **Funded by [optional]:** [More Information Needed]
24
+ - **Shared by [optional]:** [More Information Needed]
25
+ - **Model type:** [More Information Needed]
26
+ - **Language(s) (NLP):** [More Information Needed]
27
+ - **License:** [More Information Needed]
28
+ - **Finetuned from model [optional]:** [More Information Needed]
29
+
30
+ ### Model Sources [optional]
31
+
32
+ <!-- Provide the basic links for the model. -->
33
+
34
+ - **Repository:** [More Information Needed]
35
+ - **Paper [optional]:** [More Information Needed]
36
+ - **Demo [optional]:** [More Information Needed]
37
+
38
+ ## Uses
39
+
40
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
41
+
42
+ ### Direct Use
43
+
44
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
45
+
46
+ [More Information Needed]
47
+
48
+ ### Downstream Use [optional]
49
+
50
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
51
+
52
+ [More Information Needed]
53
+
54
+ ### Out-of-Scope Use
55
+
56
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
57
+
58
+ [More Information Needed]
59
+
60
+ ## Bias, Risks, and Limitations
61
+
62
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
63
+
64
+ [More Information Needed]
65
+
66
+ ### Recommendations
67
+
68
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
69
+
70
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
71
+
72
+ ## How to Get Started with the Model
73
+
74
+ Use the code below to get started with the model.
75
+
76
+ [More Information Needed]
77
+
78
+ ## Training Details
79
+
80
+ ### Training Data
81
+
82
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
83
+
84
+ [More Information Needed]
85
+
86
+ ### Training Procedure
87
+
88
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
89
+
90
+ #### Preprocessing [optional]
91
+
92
+ [More Information Needed]
93
+
94
+
95
+ #### Training Hyperparameters
96
+
97
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
98
+
99
+ #### Speeds, Sizes, Times [optional]
100
+
101
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
102
+
103
+ [More Information Needed]
104
+
105
+ ## Evaluation
106
+
107
+ <!-- This section describes the evaluation protocols and provides the results. -->
108
+
109
+ ### Testing Data, Factors & Metrics
110
+
111
+ #### Testing Data
112
+
113
+ <!-- This should link to a Dataset Card if possible. -->
114
+
115
+ [More Information Needed]
116
+
117
+ #### Factors
118
+
119
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
120
+
121
+ [More Information Needed]
122
+
123
+ #### Metrics
124
+
125
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
126
+
127
+ [More Information Needed]
128
+
129
+ ### Results
130
+
131
+ [More Information Needed]
132
+
133
+ #### Summary
134
+
135
+
136
+
137
+ ## Model Examination [optional]
138
+
139
+ <!-- Relevant interpretability work for the model goes here -->
140
+
141
+ [More Information Needed]
142
+
143
+ ## Environmental Impact
144
+
145
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
146
+
147
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
148
+
149
+ - **Hardware Type:** [More Information Needed]
150
+ - **Hours used:** [More Information Needed]
151
+ - **Cloud Provider:** [More Information Needed]
152
+ - **Compute Region:** [More Information Needed]
153
+ - **Carbon Emitted:** [More Information Needed]
154
+
155
+ ## Technical Specifications [optional]
156
+
157
+ ### Model Architecture and Objective
158
+
159
+ [More Information Needed]
160
+
161
+ ### Compute Infrastructure
162
+
163
+ [More Information Needed]
164
+
165
+ #### Hardware
166
+
167
+ [More Information Needed]
168
+
169
+ #### Software
170
+
171
+ [More Information Needed]
172
+
173
+ ## Citation [optional]
174
+
175
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
176
+
177
+ **BibTeX:**
178
+
179
+ [More Information Needed]
180
+
181
+ **APA:**
182
+
183
+ [More Information Needed]
184
+
185
+ ## Glossary [optional]
186
+
187
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
188
+
189
+ [More Information Needed]
190
+
191
+ ## More Information [optional]
192
+
193
+ [More Information Needed]
194
+
195
+ ## Model Card Authors [optional]
196
+
197
+ [More Information Needed]
198
+
199
+ ## Model Card Contact
200
+
201
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Lfm2VlForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "modeling_lfm2_vl.Lfm2VlConfig",
7
+ "AutoModelForImageTextToText": "modeling_lfm2_vl.Lfm2VlForConditionalGeneration"
8
+ },
9
+ "do_image_splitting": true,
10
+ "downsample_factor": 2,
11
+ "encoder_patch_size": 16,
12
+ "image_token_index": 396,
13
+ "max_image_tokens": 256,
14
+ "max_num_patches": 1024,
15
+ "max_pixels_tolerance": 2.0,
16
+ "max_tiles": 10,
17
+ "min_image_tokens": 64,
18
+ "min_tiles": 2,
19
+ "model_type": "lfm2-vl",
20
+ "projector_bias": true,
21
+ "projector_hidden_act": "gelu",
22
+ "projector_hidden_size": 2560,
23
+ "text_config": {
24
+ "_name_or_path": "LiquidAI/LFM2-350M",
25
+ "architectures": [
26
+ "Lfm2ForCausalLM"
27
+ ],
28
+ "block_auto_adjust_ff_dim": true,
29
+ "block_dim": 1024,
30
+ "block_ff_dim": 6656,
31
+ "block_ffn_dim_multiplier": 1.0,
32
+ "block_mlp_init_scale": 1.0,
33
+ "block_multiple_of": 256,
34
+ "block_norm_eps": 1e-05,
35
+ "block_out_init_scale": 1.0,
36
+ "block_use_swiglu": true,
37
+ "block_use_xavier_init": true,
38
+ "conv_L_cache": 3,
39
+ "conv_bias": false,
40
+ "conv_dim": 1024,
41
+ "conv_dim_out": 1024,
42
+ "conv_use_xavier_init": true,
43
+ "eos_token_id": 7,
44
+ "hidden_size": 1024,
45
+ "initializer_range": 0.02,
46
+ "intermediate_size": 6656,
47
+ "layer_types": [
48
+ "conv",
49
+ "conv",
50
+ "full_attention",
51
+ "conv",
52
+ "conv",
53
+ "full_attention",
54
+ "conv",
55
+ "conv",
56
+ "full_attention",
57
+ "conv",
58
+ "full_attention",
59
+ "conv",
60
+ "full_attention",
61
+ "conv",
62
+ "full_attention",
63
+ "conv"
64
+ ],
65
+ "max_position_embeddings": 128000,
66
+ "model_type": "lfm2",
67
+ "norm_eps": 1e-05,
68
+ "num_attention_heads": 16,
69
+ "num_heads": 16,
70
+ "num_hidden_layers": 16,
71
+ "num_key_value_heads": 8,
72
+ "rope_theta": 1000000.0,
73
+ "torch_dtype": "bfloat16",
74
+ "use_cache": true,
75
+ "use_pos_enc": true,
76
+ "vocab_size": 65536
77
+ },
78
+ "tile_size": 512,
79
+ "torch_dtype": "bfloat16",
80
+ "transformers_version": "4.55.0",
81
+ "use_image_special_tokens": true,
82
+ "use_thumbnail": false,
83
+ "vision_config": {
84
+ "attention_dropout": 0.0,
85
+ "hidden_act": "gelu_pytorch_tanh",
86
+ "hidden_size": 768,
87
+ "intermediate_size": 3072,
88
+ "layer_norm_eps": 1e-06,
89
+ "model_type": "siglip2_vision_model",
90
+ "num_attention_heads": 12,
91
+ "num_channels": 3,
92
+ "num_hidden_layers": 12,
93
+ "num_patches": 256,
94
+ "patch_size": 16,
95
+ "torch_dtype": "bfloat16",
96
+ "vision_use_head": false
97
+ },
98
+ "vision_feature_layer": -1
99
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 7,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.55.0"
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9f069717ad508e0f917d988892832fe17c90f706e7a61289cc22b512c9dbb23
3
+ size 901692416
modeling_lfm2_vl.py ADDED
@@ -0,0 +1,688 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """PyTorch LFM2-VL model."""
2
+
3
+ from dataclasses import dataclass
4
+
5
+ import torch
6
+ from torch import nn
7
+ from transformers import AutoConfig, AutoModel
8
+ from transformers.activations import ACT2FN
9
+ from transformers.cache_utils import Cache
10
+ from transformers.configuration_utils import PretrainedConfig
11
+ from transformers.generation import GenerationMixin
12
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
13
+ from transformers.modeling_outputs import BaseModelOutputWithPast, ModelOutput
14
+ from transformers.modeling_utils import PreTrainedModel
15
+ from transformers.models.lfm2.configuration_lfm2 import Lfm2Config
16
+ from transformers.models.siglip2.configuration_siglip2 import Siglip2VisionConfig
17
+ from transformers.models.siglip2.modeling_siglip2 import Siglip2VisionModel
18
+ from transformers.processing_utils import Unpack
19
+ from transformers.utils import can_return_tuple, logging
20
+
21
+ logger = logging.get_logger(__name__)
22
+
23
+
24
+ class Lfm2VlConfig(PretrainedConfig):
25
+ r"""
26
+ This is the configuration class to store the configuration of a [`Lfm2VlForConditionalGeneration`]. It is used to instantiate an
27
+ Lfm2Vl model according to the specified arguments, defining the model architecture. Instantiating a configuration
28
+ with the defaults will yield a similar configuration to that of the Lfm2-VL-1.6B.
29
+
30
+ e.g. [LiquidAI/LFM2-VL-1.6B](https://huggingface.co/LiquidAI/LFM2-VL-1.6B)
31
+
32
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
33
+ documentation from [`PretrainedConfig`] for more information.
34
+
35
+ Args:
36
+ vision_config (`AutoConfig | dict`, *optional*, defaults to `Siglip2ImageConfig`):
37
+ The config object or dictionary of the vision backbone.
38
+ text_config (`AutoConfig | dict`, *optional*, defaults to `Lfm2Config`):
39
+ The config object or dictionary of the text backbone.
40
+ image_token_id (`int`, *optional*, defaults to 396):
41
+ The image token index to encode the image prompt.
42
+ projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
43
+ The activation function used by the multimodal projector.
44
+ projector_hidden_size (`int`, *optional*, defaults to 2056):
45
+ The hidden size of the multimodal projector.
46
+ projector_bias (`bool`, *optional*, defaults to `True`):
47
+ Whether to use bias in the multimodal projector.
48
+ downsample_factor (`int`, *optional*, defaults to 2):
49
+ The downsample_factor factor of the vision backbone.
50
+ vision_feature_layer (`int`, *optional*, defaults to -1):
51
+ The layer of the vision tower to use as features.
52
+ min_image_tokens (`int`, *optional*, defaults to 64):
53
+ The minimum number of image tokens for smart resize.
54
+ max_image_tokens (`int`, *optional*, defaults to 256):
55
+ The maximum number of image tokens for smart resize.
56
+ encoder_patch_size (`int`, *optional*, defaults to 16):
57
+ The patch size of the encoder.
58
+ max_num_patches (`int`, *optional*, defaults to 1024):
59
+ The maximum number of image tokens passed to the encoder per image or tile.
60
+ use_image_special_tokens (`bool`, *optional*, defaults to `True`):
61
+ Whether to use image special tokens.
62
+ do_image_splitting (`bool`, *optional*, defaults to `True`):
63
+ Whether to split large images into tiles.
64
+ min_tiles (`int`, *optional*, defaults to 2):
65
+ The minimum number of tiles to split the image into.
66
+ max_tiles (`int`, *optional*, defaults to 10):
67
+ The maximum number of tiles to split the image into.
68
+ tile_size (`int`, *optional*, defaults to 512):
69
+ The size of the tile to split the image into.
70
+ max_pixels_tolerance (`float`, *optional*, defaults to 2.0):
71
+ The maximum tolerance for the number of pixels in the image before splitting.
72
+ use_thumbnail (`bool`, *optional*, defaults to `True`):
73
+ Whether to append the thumbnail of the image when splitting.
74
+ """
75
+
76
+ model_type = "lfm2-vl"
77
+ attribute_map = {
78
+ "image_token_id": "image_token_index",
79
+ }
80
+ sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig}
81
+
82
+ def __init__(
83
+ self,
84
+ vision_config=None,
85
+ text_config=None,
86
+ image_token_index=396,
87
+ projector_hidden_act="gelu",
88
+ projector_hidden_size=2560,
89
+ projector_bias=True,
90
+ downsample_factor=2,
91
+ vision_feature_layer=-1,
92
+ min_image_tokens=64,
93
+ max_image_tokens=256,
94
+ encoder_patch_size=16,
95
+ max_num_patches=1024,
96
+ use_image_special_tokens=True,
97
+ do_image_splitting=True,
98
+ min_tiles=2,
99
+ max_tiles=10,
100
+ tile_size=512,
101
+ max_pixels_tolerance=2.0,
102
+ use_thumbnail=True,
103
+ torch_dtype=torch.bfloat16,
104
+ **kwargs,
105
+ ):
106
+ self.vision_config = vision_config
107
+ self.text_config = text_config
108
+ self.image_token_index = image_token_index
109
+ self.projector_hidden_act = projector_hidden_act
110
+ self.projector_hidden_size = projector_hidden_size
111
+ self.projector_bias = projector_bias
112
+ self.downsample_factor = downsample_factor
113
+ self.vision_feature_layer = vision_feature_layer
114
+ self.min_image_tokens = min_image_tokens
115
+ self.max_image_tokens = max_image_tokens
116
+ self.encoder_patch_size = encoder_patch_size
117
+ self.max_num_patches = max_num_patches
118
+ self.use_image_special_tokens = use_image_special_tokens
119
+ self.do_image_splitting = do_image_splitting
120
+ self.min_tiles = min_tiles
121
+ self.max_tiles = max_tiles
122
+ self.tile_size = tile_size
123
+ self.max_pixels_tolerance = max_pixels_tolerance
124
+ self.use_thumbnail = use_thumbnail
125
+ self.torch_dtype = torch_dtype
126
+
127
+ if isinstance(vision_config, dict):
128
+ vision_config = Siglip2VisionConfig(**vision_config)
129
+ elif vision_config is None:
130
+ vision_config = Siglip2VisionConfig()
131
+ self.vision_config = vision_config
132
+
133
+ self.vision_config = vision_config
134
+
135
+ if isinstance(text_config, dict):
136
+ text_config = Lfm2Config(**text_config)
137
+ elif text_config is None:
138
+ text_config = Lfm2Config()
139
+
140
+ self.text_config = text_config
141
+
142
+ super().__init__(**kwargs)
143
+
144
+
145
+ @dataclass
146
+ class Lfm2VlModelOutputWithPast(BaseModelOutputWithPast):
147
+ r"""
148
+ past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
149
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
150
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
151
+
152
+ Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
153
+ `past_key_values` input) to speed up sequential decoding.
154
+ image_hidden_states (`torch.FloatTensor`, *optional*):
155
+ A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
156
+ image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
157
+ """
158
+
159
+ image_hidden_states: torch.FloatTensor | None = None
160
+
161
+
162
+ @dataclass
163
+ class Lfm2VlCausalLMOutputWithPast(ModelOutput):
164
+ r"""
165
+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
166
+ Language modeling loss (for next-token prediction).
167
+ logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
168
+ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
169
+ past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
170
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
171
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
172
+
173
+ Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
174
+ `past_key_values` input) to speed up sequential decoding.
175
+ image_hidden_states (`torch.FloatTensor`, *optional*):
176
+ A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
177
+ image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
178
+ """
179
+
180
+ loss: torch.FloatTensor | None = None
181
+ logits: torch.FloatTensor | None = None
182
+ past_key_values: list[torch.FloatTensor] | None = None
183
+ hidden_states: tuple[torch.FloatTensor] | None = None
184
+ attentions: tuple[torch.FloatTensor] | None = None
185
+ image_hidden_states: torch.FloatTensor | None = None
186
+
187
+
188
+ class Lfm2VlMultiModalProjector(nn.Module):
189
+ def __init__(self, config: Lfm2VlConfig):
190
+ super().__init__()
191
+ in_channels = config.vision_config.hidden_size * (config.downsample_factor**2)
192
+ self.layer_norm = nn.LayerNorm(in_channels)
193
+ self.linear_1 = nn.Linear(
194
+ in_channels,
195
+ config.projector_hidden_size,
196
+ bias=config.projector_bias,
197
+ )
198
+ self.act = ACT2FN[config.projector_hidden_act]
199
+ self.linear_2 = nn.Linear(
200
+ config.projector_hidden_size,
201
+ config.text_config.hidden_size,
202
+ bias=config.projector_bias,
203
+ )
204
+
205
+ def forward(self, image_features):
206
+ image_features = self.layer_norm(image_features)
207
+ hidden_states = self.linear_1(image_features)
208
+ hidden_states = self.act(hidden_states)
209
+ hidden_states = self.linear_2(hidden_states)
210
+ return hidden_states
211
+
212
+
213
+ class PixelUnshuffleBlock(nn.Module):
214
+ def __init__(self, factor: int):
215
+ super().__init__()
216
+ self.factor = factor
217
+
218
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
219
+ n, w, h, c = x.size()
220
+ if w % self.factor != 0:
221
+ x = torch.concat(
222
+ [
223
+ x,
224
+ torch.zeros(
225
+ (n, self.factor - (w % self.factor), h, c), dtype=x.dtype
226
+ ).to(x.device),
227
+ ],
228
+ dim=1,
229
+ ).contiguous()
230
+ n, w, h, c = x.size()
231
+ x = x.contiguous()
232
+ if h % self.factor != 0:
233
+ x = torch.concat(
234
+ [
235
+ x,
236
+ torch.zeros(
237
+ (n, w, self.factor - (h % self.factor), c), dtype=x.dtype
238
+ ).to(x.device),
239
+ ],
240
+ dim=2,
241
+ ).contiguous()
242
+ n, w, h, c = x.size()
243
+ x = x.view(n, w, int(h / self.factor), int(c * self.factor))
244
+ x = x.permute(0, 2, 1, 3).contiguous()
245
+ x = x.view(
246
+ n, int(h / self.factor), int(w / self.factor), int(c * self.factor**2)
247
+ )
248
+ x = x.permute(0, 2, 1, 3).contiguous()
249
+ return x
250
+
251
+
252
+ class Lfm2VlPreTrainedModel(PreTrainedModel):
253
+ config: Lfm2VlConfig
254
+ base_model_prefix = ""
255
+ supports_gradient_checkpointing = True
256
+ _skip_keys_device_placement = ["past_key_values"]
257
+
258
+ _supports_flash_attn = True
259
+ _supports_sdpa = True
260
+
261
+ _can_compile_fullgraph = False
262
+ _supports_flex_attn = True
263
+ _supports_attention_backend = True
264
+
265
+
266
+ class Lfm2VlModel(Lfm2VlPreTrainedModel):
267
+ _checkpoint_conversion_mapping = {"language_model.model": "language_model"}
268
+
269
+ def __init__(self, config: Lfm2VlConfig):
270
+ super().__init__(config)
271
+ self.vision_tower = Siglip2VisionModel(config.vision_config)
272
+
273
+ if config.vision_feature_layer != -1:
274
+ self.vision_tower.vision_model.encoder.layers = (
275
+ self.vision_tower.vision_model.encoder.layers[
276
+ : config.vision_feature_layer + 1
277
+ ]
278
+ )
279
+ if config.downsample_factor > 1:
280
+ self.pixel_unshuffle = PixelUnshuffleBlock(config.downsample_factor)
281
+ else:
282
+ self.pixel_unshuffle = nn.Identity()
283
+
284
+ self.multi_modal_projector = Lfm2VlMultiModalProjector(config)
285
+ self.language_model = AutoModel.from_config(config.text_config)
286
+ self.post_init()
287
+
288
+ def get_input_embeddings(self):
289
+ return self.language_model.get_input_embeddings()
290
+
291
+ def set_input_embeddings(self, value):
292
+ self.language_model.set_input_embeddings(value)
293
+
294
+ def set_decoder(self, decoder):
295
+ self.language_model = decoder
296
+
297
+ def get_decoder(self):
298
+ return self.language_model
299
+
300
+ def get_image_features(
301
+ self,
302
+ pixel_values: torch.FloatTensor,
303
+ spatial_shapes: torch.Tensor,
304
+ pixel_attention_mask: torch.Tensor,
305
+ **kwargs,
306
+ ) -> list[torch.Tensor]:
307
+ """
308
+ Obtains image last hidden states from the vision tower and apply multimodal projection.
309
+
310
+ Args:
311
+ pixel_values (`torch.FloatTensor]` of shape `(batch_size, channels, height, width)`):
312
+ The tensors corresponding to the input images.
313
+ spatial_shapes (`torch.Tensor` of shape `(batch_size, 2)`):
314
+ The spatial shapes of the input images.
315
+ pixel_attention_mask (`torch.Tensor` of shape `(batch_size, height, width)`):
316
+ The pixel attention mask of the input images.
317
+ Returns:
318
+ image_features (`list[torch.Tensor]`): Image feature tensor of shape `(num_images, image_length, embed_dim)`).
319
+ """
320
+ image_outputs = self.vision_tower(
321
+ pixel_values=pixel_values,
322
+ spatial_shapes=spatial_shapes,
323
+ pixel_attention_mask=pixel_attention_mask,
324
+ ).last_hidden_state
325
+
326
+ img_feature_lengths = pixel_attention_mask.sum(dim=1)
327
+ image_features = []
328
+
329
+ for img_idx in range(image_outputs.size(0)):
330
+ feature = image_outputs[img_idx]
331
+ # unpad the image representation
332
+ feature = feature[: img_feature_lengths[img_idx], :].unsqueeze(0)
333
+
334
+ feature_org_h, feature_org_w = spatial_shapes[img_idx]
335
+ feature = feature.reshape(1, feature_org_h, feature_org_w, -1)
336
+ feature = self.pixel_unshuffle(feature)
337
+
338
+ # project the image representation
339
+ img_embedding = self.multi_modal_projector(feature)
340
+
341
+ # flatten here to handle variable length in naflex
342
+ img_embedding = img_embedding.reshape(-1, img_embedding.size(-1))
343
+ image_features.append(img_embedding)
344
+
345
+ return image_features
346
+
347
+ def get_placeholder_mask(
348
+ self,
349
+ input_ids: torch.LongTensor | None,
350
+ inputs_embeds: torch.FloatTensor,
351
+ image_features: torch.FloatTensor,
352
+ ):
353
+ """
354
+ Obtains multimodal placeholdr mask from `input_ids` or `inputs_embeds`, and checks that the placeholder token count is
355
+ equal to the length of multimodal features. If the lengths are different, an error is raised.
356
+ """
357
+ if input_ids is None:
358
+ special_image_mask = inputs_embeds == self.get_input_embeddings()(
359
+ torch.tensor(
360
+ self.config.image_token_id,
361
+ dtype=torch.long,
362
+ device=inputs_embeds.device,
363
+ )
364
+ )
365
+ special_image_mask = special_image_mask.all(-1)
366
+ else:
367
+ special_image_mask = input_ids == self.config.image_token_id
368
+ n_image_tokens = special_image_mask.sum()
369
+ special_image_mask = (
370
+ special_image_mask.unsqueeze(-1)
371
+ .expand_as(inputs_embeds)
372
+ .to(inputs_embeds.device)
373
+ )
374
+ n_image_features = image_features.shape[0]
375
+ if inputs_embeds[special_image_mask].numel() != image_features.numel():
376
+ raise ValueError(
377
+ f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
378
+ )
379
+ return special_image_mask
380
+
381
+ @can_return_tuple
382
+ def forward(
383
+ self,
384
+ input_ids: torch.LongTensor = None,
385
+ attention_mask: torch.Tensor | None = None,
386
+ position_ids: torch.LongTensor | None = None,
387
+ pixel_values: torch.FloatTensor = None,
388
+ spatial_shapes: torch.Tensor = None,
389
+ pixel_attention_mask: torch.Tensor = None,
390
+ past_key_values: Cache | None = None,
391
+ inputs_embeds: torch.FloatTensor | None = None,
392
+ use_cache: bool | None = None,
393
+ output_attentions: bool | None = None,
394
+ output_hidden_states: bool | None = None,
395
+ return_dict: bool | None = None,
396
+ cache_position: torch.LongTensor | None = None,
397
+ image_sizes: torch.Tensor = None,
398
+ **kwargs: Unpack[FlashAttentionKwargs],
399
+ ) -> tuple | Lfm2VlModelOutputWithPast:
400
+ """
401
+ spatial_shapes (`torch.Tensor` of shape `(batch_size, 2)`, *optional*):
402
+ The spatial shapes of the input images.
403
+ pixel_attention_mask (`torch.Tensor` of shape `(batch_size, height, width)`, *optional*):
404
+ The pixel attention mask of the input images.
405
+ """
406
+ output_attentions = (
407
+ output_attentions
408
+ if output_attentions is not None
409
+ else self.config.output_attentions
410
+ )
411
+ output_hidden_states = (
412
+ output_hidden_states
413
+ if output_hidden_states is not None
414
+ else self.config.output_hidden_states
415
+ )
416
+ return_dict = (
417
+ return_dict if return_dict is not None else self.config.use_return_dict
418
+ )
419
+
420
+ if (input_ids is None) ^ (inputs_embeds is not None):
421
+ raise ValueError(
422
+ "You must specify exactly one of input_ids or inputs_embeds"
423
+ )
424
+
425
+ if inputs_embeds is None:
426
+ inputs_embeds = self.get_input_embeddings()(input_ids)
427
+
428
+ if pixel_values is not None:
429
+ image_features = self.get_image_features(
430
+ pixel_values=pixel_values,
431
+ spatial_shapes=spatial_shapes,
432
+ pixel_attention_mask=pixel_attention_mask,
433
+ )
434
+ image_features = torch.cat(image_features, dim=0).to(
435
+ inputs_embeds.device, inputs_embeds.dtype
436
+ )
437
+ special_image_mask = self.get_placeholder_mask(
438
+ input_ids=input_ids,
439
+ inputs_embeds=inputs_embeds,
440
+ image_features=image_features,
441
+ )
442
+ inputs_embeds = inputs_embeds.masked_scatter(
443
+ special_image_mask, image_features
444
+ )
445
+
446
+ outputs = self.language_model(
447
+ attention_mask=attention_mask,
448
+ position_ids=position_ids,
449
+ past_key_values=past_key_values,
450
+ inputs_embeds=inputs_embeds,
451
+ use_cache=use_cache,
452
+ output_attentions=output_attentions,
453
+ output_hidden_states=output_hidden_states,
454
+ return_dict=True,
455
+ cache_position=cache_position,
456
+ **kwargs,
457
+ )
458
+
459
+ return Lfm2VlModelOutputWithPast(
460
+ last_hidden_state=outputs.last_hidden_state,
461
+ past_key_values=outputs.past_key_values,
462
+ hidden_states=outputs.hidden_states,
463
+ attentions=outputs.attentions,
464
+ image_hidden_states=image_features if pixel_values is not None else None,
465
+ )
466
+
467
+
468
+ class Lfm2VlForConditionalGeneration(Lfm2VlPreTrainedModel, GenerationMixin):
469
+ _tied_weights_keys = ["lm_head.weight"]
470
+
471
+ def __init__(self, config: Lfm2VlConfig):
472
+ super().__init__(config)
473
+ self.model = Lfm2VlModel(config)
474
+ self.lm_head = nn.Linear(
475
+ config.text_config.hidden_size, config.text_config.vocab_size, bias=False
476
+ )
477
+ self.post_init()
478
+
479
+ def _supports_default_dynamic_cache(self):
480
+ return False
481
+
482
+ def get_input_embeddings(self):
483
+ return self.model.get_input_embeddings()
484
+
485
+ def set_input_embeddings(self, value):
486
+ self.model.set_input_embeddings(value)
487
+
488
+ def get_output_embeddings(self) -> nn.Module:
489
+ return self.lm_head
490
+
491
+ def set_decoder(self, decoder):
492
+ self.model.set_decoder(decoder)
493
+
494
+ def get_decoder(self):
495
+ return self.model.get_decoder()
496
+
497
+ def get_image_features(
498
+ self,
499
+ pixel_values: torch.FloatTensor,
500
+ spatial_shapes: torch.Tensor,
501
+ pixel_attention_mask: torch.Tensor,
502
+ **kwargs,
503
+ ):
504
+ return self.model.get_image_features(
505
+ pixel_values=pixel_values,
506
+ spatial_shapes=spatial_shapes,
507
+ pixel_attention_mask=pixel_attention_mask,
508
+ **kwargs,
509
+ )
510
+
511
+ @property
512
+ def language_model(self):
513
+ return self.model.language_model
514
+
515
+ @property
516
+ def vision_tower(self):
517
+ return self.model.vision_tower
518
+
519
+ @property
520
+ def multi_modal_projector(self):
521
+ return self.model.multi_modal_projector
522
+
523
+ @can_return_tuple
524
+ def forward(
525
+ self,
526
+ input_ids: torch.LongTensor = None,
527
+ pixel_values: torch.FloatTensor = None,
528
+ spatial_shapes: torch.Tensor = None,
529
+ pixel_attention_mask: torch.Tensor = None,
530
+ attention_mask: torch.Tensor | None = None,
531
+ position_ids: torch.LongTensor | None = None,
532
+ past_key_values: Cache | None = None,
533
+ inputs_embeds: torch.FloatTensor | None = None,
534
+ labels: torch.LongTensor | None = None,
535
+ use_cache: bool | None = None,
536
+ output_attentions: bool | None = None,
537
+ output_hidden_states: bool | None = None,
538
+ return_dict: bool | None = None,
539
+ cache_position: torch.LongTensor | None = None,
540
+ logits_to_keep: int | torch.Tensor = 0,
541
+ image_sizes: torch.Tensor | None = None,
542
+ **kwargs,
543
+ ) -> tuple | Lfm2VlCausalLMOutputWithPast:
544
+ r"""
545
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, channels, height, width)`, *optional*):
546
+ The input image tensors.
547
+ spatial_shapes (`torch.Tensor` of shape `(batch_size, 2)`, *optional*):
548
+ The spatial shapes of the input images.
549
+ pixel_attention_mask (`torch.Tensor` of shape `(batch_size, height, width)`, *optional*):
550
+ The pixel attention mask of the input images.
551
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
552
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
553
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
554
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
555
+
556
+ Example:
557
+
558
+ ```python
559
+ >>> from PIL import Image
560
+ >>> import requests
561
+ >>> from transformers import AutoProcessor, AutoModelForImageTextToText
562
+ >>> from transformers.image_utils import load_image
563
+
564
+ >>> model = AutoModelForImageTextToText.from_pretrained(
565
+ ... "LiquidAI/LFM2-VL-1.6B",
566
+ ... trust_remote_code=True
567
+ ... )
568
+ >>> processor = AutoProcessor.from_pretrained(
569
+ ... "LiquidAI/LFM2-VL-1.6B",
570
+ ... trust_remote_code=True
571
+ ... )
572
+
573
+ >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
574
+ >>> image = load_image(url)
575
+
576
+ >>> conversation = [
577
+ ... {
578
+ ... "role": "user",
579
+ ... "content": [
580
+ ... {"type": "image", "image": image},
581
+ ... {"type": "text", "text": "What is in this image?"},
582
+ ... ],
583
+ ... },
584
+ ... ]
585
+
586
+ >>> inputs = processor.apply_chat_template(
587
+ ... conversation,
588
+ ... add_generation_prompt=True,
589
+ ... tokenize=True,
590
+ ... return_dict=True,
591
+ ... return_tensors="pt"
592
+ ... )
593
+
594
+ >>> # Generate
595
+ >>> outputs = model.generate(**inputs, max_new_tokens=45)
596
+ >>> processor.batch_decode(outputs, skip_special_tokens=True)[0]
597
+ 'This image depicts a vibrant street scene in what appears to be a Chinatown or similar cultural area. The focal point is a large red stop sign with white lettering, mounted on a pole.'
598
+ ```"""
599
+ output_attentions = (
600
+ output_attentions
601
+ if output_attentions is not None
602
+ else self.config.output_attentions
603
+ )
604
+ output_hidden_states = (
605
+ output_hidden_states
606
+ if output_hidden_states is not None
607
+ else self.config.output_hidden_states
608
+ )
609
+ return_dict = (
610
+ return_dict if return_dict is not None else self.config.use_return_dict
611
+ )
612
+
613
+ outputs = self.model(
614
+ input_ids=input_ids,
615
+ pixel_values=pixel_values,
616
+ spatial_shapes=spatial_shapes,
617
+ pixel_attention_mask=pixel_attention_mask,
618
+ attention_mask=attention_mask,
619
+ position_ids=position_ids,
620
+ past_key_values=past_key_values,
621
+ inputs_embeds=inputs_embeds,
622
+ use_cache=use_cache,
623
+ output_attentions=output_attentions,
624
+ output_hidden_states=output_hidden_states,
625
+ return_dict=True,
626
+ cache_position=cache_position,
627
+ image_sizes=image_sizes,
628
+ **kwargs,
629
+ )
630
+
631
+ hidden_states = outputs[0]
632
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
633
+ slice_indices = (
634
+ slice(-logits_to_keep, None)
635
+ if isinstance(logits_to_keep, int)
636
+ else logits_to_keep
637
+ )
638
+ logits = self.lm_head(hidden_states[:, slice_indices, :])
639
+
640
+ loss = None
641
+ if labels is not None:
642
+ loss = self.loss_function(
643
+ logits=logits,
644
+ labels=labels,
645
+ vocab_size=self.config.text_config.vocab_size,
646
+ **kwargs,
647
+ )
648
+
649
+ return Lfm2VlCausalLMOutputWithPast(
650
+ loss=loss,
651
+ logits=logits,
652
+ past_key_values=outputs.past_key_values,
653
+ hidden_states=outputs.hidden_states,
654
+ attentions=outputs.attentions,
655
+ image_hidden_states=outputs.image_hidden_states,
656
+ )
657
+
658
+ def prepare_inputs_for_generation(
659
+ self,
660
+ input_ids,
661
+ past_key_values=None,
662
+ inputs_embeds=None,
663
+ pixel_values=None,
664
+ attention_mask=None,
665
+ cache_position=None,
666
+ logits_to_keep=None,
667
+ **kwargs,
668
+ ):
669
+ # Overwritten -- in specific circumstances we don't want to forward image inputs to the model
670
+ model_inputs = super().prepare_inputs_for_generation(
671
+ input_ids,
672
+ past_key_values=past_key_values,
673
+ inputs_embeds=inputs_embeds,
674
+ attention_mask=attention_mask,
675
+ cache_position=cache_position,
676
+ logits_to_keep=logits_to_keep,
677
+ **kwargs,
678
+ )
679
+
680
+ if cache_position[0] == 0:
681
+ # If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
682
+ # Otherwise we need pixel values to be passed to model
683
+ model_inputs["pixel_values"] = pixel_values
684
+
685
+ return model_inputs
686
+
687
+
688
+ __all__ = ["Lfm2VlForConditionalGeneration", "Lfm2VlModel", "Lfm2VlPreTrainedModel"]