yongqiang commited on
Commit
f4acc5b
·
1 Parent(s): faed000

Initial this repo

Browse files
Files changed (46) hide show
  1. README.md +109 -1
  2. assets/gen_out_img.jpg +0 -0
  3. embeds/codebook_entry_embedding.npy +3 -0
  4. embeds/codebook_entry_embedding.pt +3 -0
  5. embeds/gen_embed.npy +3 -0
  6. img_gen_onnx/gen_aligner.onnx +3 -0
  7. img_gen_onnx/gen_vision_model_decode_sim.onnx +3 -0
  8. img_gen_onnx/post_head.onnx +3 -0
  9. img_gen_onnx/post_norm.onnx +3 -0
  10. imgs/image.jpg +0 -0
  11. imgs/image.png +3 -0
  12. infer_axmodel_gen.py +276 -0
  13. infer_axmodel_und.py +228 -0
  14. janus_pro_1b_axmodel/llama_p640_l0_together.axmodel +3 -0
  15. janus_pro_1b_axmodel/llama_p640_l10_together.axmodel +3 -0
  16. janus_pro_1b_axmodel/llama_p640_l11_together.axmodel +3 -0
  17. janus_pro_1b_axmodel/llama_p640_l12_together.axmodel +3 -0
  18. janus_pro_1b_axmodel/llama_p640_l13_together.axmodel +3 -0
  19. janus_pro_1b_axmodel/llama_p640_l14_together.axmodel +3 -0
  20. janus_pro_1b_axmodel/llama_p640_l15_together.axmodel +3 -0
  21. janus_pro_1b_axmodel/llama_p640_l16_together.axmodel +3 -0
  22. janus_pro_1b_axmodel/llama_p640_l17_together.axmodel +3 -0
  23. janus_pro_1b_axmodel/llama_p640_l18_together.axmodel +3 -0
  24. janus_pro_1b_axmodel/llama_p640_l19_together.axmodel +3 -0
  25. janus_pro_1b_axmodel/llama_p640_l1_together.axmodel +3 -0
  26. janus_pro_1b_axmodel/llama_p640_l20_together.axmodel +3 -0
  27. janus_pro_1b_axmodel/llama_p640_l21_together.axmodel +3 -0
  28. janus_pro_1b_axmodel/llama_p640_l22_together.axmodel +3 -0
  29. janus_pro_1b_axmodel/llama_p640_l23_together.axmodel +3 -0
  30. janus_pro_1b_axmodel/llama_p640_l2_together.axmodel +3 -0
  31. janus_pro_1b_axmodel/llama_p640_l3_together.axmodel +3 -0
  32. janus_pro_1b_axmodel/llama_p640_l4_together.axmodel +3 -0
  33. janus_pro_1b_axmodel/llama_p640_l5_together.axmodel +3 -0
  34. janus_pro_1b_axmodel/llama_p640_l6_together.axmodel +3 -0
  35. janus_pro_1b_axmodel/llama_p640_l7_together.axmodel +3 -0
  36. janus_pro_1b_axmodel/llama_p640_l8_together.axmodel +3 -0
  37. janus_pro_1b_axmodel/llama_p640_l9_together.axmodel +3 -0
  38. janus_pro_1b_axmodel/llama_post.axmodel +3 -0
  39. janus_pro_1b_axmodel/model.embed_tokens.weight.npy +3 -0
  40. janus_pro_1b_tokenizer/config.json +66 -0
  41. janus_pro_1b_tokenizer/preprocessor_config.json +23 -0
  42. janus_pro_1b_tokenizer/processor_config.json +9 -0
  43. janus_pro_1b_tokenizer/special_tokens_map.json +16 -0
  44. janus_pro_1b_tokenizer/tokenizer.json +0 -0
  45. janus_pro_1b_tokenizer/tokenizer_config.json +10 -0
  46. vit_axmodel/janus_warp_vit.axmodel +3 -0
README.md CHANGED
@@ -9,4 +9,112 @@ pipeline_tag: visual-question-answering
9
  tags:
10
  - DeepSeek
11
  - Janus-Pro-1B
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  tags:
10
  - DeepSeek
11
  - Janus-Pro-1B
12
+ ---
13
+
14
+ # Janus-Pro-1B-Int8
15
+
16
+ This version of Janus-Pro-1B has been converted to run on the Axera NPU using **w8a16** quantization.
17
+
18
+ This model has been optimized with the following LoRA:
19
+
20
+ Compatible with Pulsar2 version: 3.3
21
+
22
+ ## Convert tools links:
23
+
24
+ For those who are interested in model conversion, you can try to export axmodel through the original repo :
25
+ https://huggingface.co/deepseek-ai/Janus-Pro-1B
26
+
27
+ [Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
28
+
29
+ ## Support Platform
30
+ - AX650
31
+ - AX650N DEMO Board
32
+ - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
33
+
34
+ |chips|image encoder 384 | ttft | w8a16 |
35
+ |--|--|--|--|
36
+ |AX650| 142.682 ms | 4560.214 ms | 11.43 tokens/sec|
37
+
38
+ ## How to use
39
+
40
+ Download all files from this repository to the device.
41
+
42
+ **If you using AX650 Board**
43
+ ```
44
+ root@ax650:/mnt/qtang/llm-test/temp/Janus-Pro-1B# tree -L 1
45
+ .
46
+ |-- config.json
47
+ |-- internvl2_5_1b_448_ax650
48
+ |-- internvl2_5_tokenizer
49
+ |-- internvl2_5_tokenizer_448.py
50
+ |-- main_internvl2_5_448_prefill
51
+ |-- run_internvl2_5_448_ax650.sh
52
+ `-- ssd_car.jpg
53
+ ```
54
+
55
+ #### Install janus
56
+
57
+ ```bash
58
+ $ git clone https://github.com/deepseek-ai/Janus
59
+ $ cd Janus
60
+ $ pip3 install -e .
61
+ ```
62
+
63
+ #### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
64
+
65
+ **Multimodal Understanding**
66
+
67
+ input text:
68
+
69
+ ```
70
+ Describe the picture
71
+ ```
72
+
73
+ - input image
74
+
75
+ ![](imgs/image.png)
76
+
77
+ log information:
78
+
79
+ ```bash
80
+ root@ax650 ~/yongqiang/push_hugging_face/Janus-Pro-1B # python3 infer_axmodel_und.py --tokenizer_dir janus_pro_1b_tokenizer --axmodel_path janus_pro_1b_axmodel --vit_axmodel_path vit_axmodel/janus_warp_vit.axmodel -i ./imgs/image.png
81
+ [INFO] Available providers: ['AxEngineExecutionProvider']
82
+ [INFO] Chip type: ChipType.MC50
83
+ [INFO] VNPU type: VNPUType.DISABLED
84
+ [INFO] Engine version: 2.11.0a
85
+ vit_output.shape is (1, 576, 2048), vit feature extract done!
86
+ Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 24/24 [00:04<00:00, 4.94it/s]
87
+ model load done!
88
+ prefill done!
89
+ Decoder: 62%|█████████████████████████████████████████▍ | 634/1024 [00:00<00:00, 2505.28it/s]Decoder: 72%|█████████████████████████████████████████████████▉ | 741/1024 [00:19<00:10, 27.69it/s]hit eos!
90
+ Decoder: 74%|███████████████████████████████████████████████████▎ | 762/1024 [00:23<00:08, 31.84it/s]
91
+ Janus Answers: The image depicts three astronauts standing in a lush, green forest. They are wearing traditional white space suits with various patches and equipment attached. The suits have a reflective visor on their helmets, and they appear to be in a relaxed pose, with one astronaut raising his arms and the others standing or crouching. The forest is dense with tall trees and dense foliage, creating a serene and somewhat mysterious atmosphere.
92
+ ```
93
+
94
+ **Text-to-Image Generation**
95
+
96
+ input text:
97
+
98
+ ```
99
+ "A close-up high-contrast photo of Sydney Opera House sitting next to Eiffel tower, under a blue night sky of roiling energy, exploding yellow stars, and radiating swirls of blue."
100
+ ```
101
+
102
+ log information:
103
+
104
+ ```bash
105
+ root@ax650 ~/yongqiang/push_hugging_face/Janus-Pro-1B # python3 infer_axmodel_gen.py --tokenizer_dir janus_pro_1b_tokenizer/ --axmodel_path janus_pro_1b_axmodel/
106
+ [INFO] Available providers: ['AxEngineExecutionProvider']
107
+ Init InferenceSession: 0%| | 0/24 [00:00<?, ?it/s][INFO] Chip type: ChipType.MC50
108
+ [INFO] VNPU type: VNPUType.DISABLED
109
+ [INFO] Engine version: 2.11.0a
110
+ Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 24/24 [00:14<00:00, 1.68it/s]
111
+ 2025-04-14 15:55:23.408 | INFO | __main__:<module>:269 - model load done!
112
+ 2025-04-14 15:55:33.104 | DEBUG | __main__:generate:158 - prefill completed!
113
+ ImageToken: 18%|████████████ | 104/575 [00:39<02:58, 2.64it/s]ImageToken: 45%|██████████████████████████████▍ | 261/575 [01:39<01:58, 2.65it/s]ImageToken: 73%|████████████████████████████████████████████████▊ | 419/575 [02:39<00:58, 2.66it/s]ImageToken: 100%|███████████████████████████████████████████████████████████████████| 575/575 [03:38<00:00, 2.63it/s]
114
+ ```
115
+
116
+ output image
117
+
118
+ [](assets/gen_out_img.jpg)
119
+
120
+
assets/gen_out_img.jpg ADDED
embeds/codebook_entry_embedding.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:97fc92031b689c685f3b36d7542eba5002cd937c63dcf33731601ef999c68613
3
+ size 524416
embeds/codebook_entry_embedding.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a67cea6583ef3da486fdfcd6cff62a2771795c7ab46b8f1000852be4f1a137c5
3
+ size 263473
embeds/gen_embed.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c70d799c8ab4c507b2916f304ba0f792e2dbf0a26100cb1242babe1f2e57d455
3
+ size 524416
img_gen_onnx/gen_aligner.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0642c360b65e5f41b1caf7637650e057b38a9aad40552a7669a76b3395653c5d
3
+ size 16860554
img_gen_onnx/gen_vision_model_decode_sim.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e27a17bc19df77059481b30582ca58e3a28bd66783fb9ca8c3022bf33e77f8bf
3
+ size 169913021
img_gen_onnx/post_head.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92be45cb8d1c3ae5c19c906a71195f13a0755f16881a6769e8ae9b5ca85eaa8f
3
+ size 151070226
img_gen_onnx/post_norm.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10899a40c25d7d1a879c0b9a7fe06255b5148d56d5965b6c5a8b8bb7d72feecf
3
+ size 9423
imgs/image.jpg ADDED
imgs/image.png ADDED

Git LFS Details

  • SHA256: 622ae2d01ff4467fa69a7888728d776650117a0f4887e96ba0fb9a8a6d77b3c3
  • Pointer size: 131 Bytes
  • Size of remote file: 355 kB
infer_axmodel_gen.py ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # REF: https://github.com/deepseek-ai/Janus
2
+ import numpy as np
3
+ import torch
4
+ from axengine import InferenceSession
5
+ from ml_dtypes import bfloat16
6
+ from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForCausalLM
7
+ from tqdm import tqdm
8
+ from einops import rearrange
9
+ from janus.models import MultiModalityCausalLM, VLChatProcessor
10
+ from janus.models.modeling_vlm import MultiModalityConfig
11
+ from janus.utils.io import load_pil_images
12
+ import os
13
+ import PIL.Image
14
+ from loguru import logger
15
+ import onnxruntime
16
+ import argparse
17
+
18
+
19
+ parser = argparse.ArgumentParser(description="Model configuration parameters")
20
+ parser.add_argument("--tokenizer_dir", type=str, default="Janus-Pro-1B",
21
+ help="Path to HuggingFace model")
22
+ parser.add_argument("--axmodel_path", type=str, default="janus_pro_1B_axmodel",
23
+ help="Path to save compiled axmodel of llama model")
24
+ args = parser.parse_args()
25
+
26
+
27
+ # base info
28
+ tokenizer_dir = args.tokenizer_dir
29
+ axmodel_path = args.axmodel_path
30
+
31
+ """ONNX MODEL"""
32
+ gen_vision_model_decode = onnxruntime.InferenceSession("./img_gen_onnx/gen_vision_model_decode_sim.onnx", providers=["CPUExecutionProvider"])
33
+ gen_aligner = onnxruntime.InferenceSession("./img_gen_onnx/gen_aligner.onnx", providers=["CPUExecutionProvider"])
34
+ gen_head = onnxruntime.InferenceSession("./img_gen_onnx/post_head.onnx", providers=["CPUExecutionProvider"])
35
+ post_norm = onnxruntime.InferenceSession("./img_gen_onnx/post_norm.onnx", providers=["CPUExecutionProvider"])
36
+ """ONNX MODEL"""
37
+
38
+ """EMBEDINGs"""
39
+ embeds = np.load(f"{axmodel_path}/model.embed_tokens.weight.npy")
40
+ gen_embed = np.load("./embeds/gen_embed.npy")
41
+ codebook_entry_embedding = torch.load('./embeds/codebook_entry_embedding.pt', map_location=torch.device('cpu'))
42
+ """EMBEDINGs"""
43
+
44
+
45
+ def prefill(
46
+ cfg,
47
+ prefill_decoder_sessins,
48
+ vl_chat_processor: VLChatProcessor,
49
+ prompt: str,
50
+ temperature: float = 1,
51
+ parallel_size: int = 1,
52
+ cfg_weight: float = 5,
53
+ image_token_num_per_image: int = 576,
54
+ ):
55
+ input_ids = vl_chat_processor.tokenizer.encode(prompt)
56
+ input_ids = torch.LongTensor(input_ids)
57
+
58
+ tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int)
59
+ for i in range(parallel_size*2):
60
+ tokens[i, :] = input_ids
61
+ if i % 2 != 0:
62
+ tokens[i, 1: -1] = vl_chat_processor.pad_id
63
+
64
+ inputs_embeds = embeds[tokens.numpy()]
65
+ batch, token_len, seq_dim = inputs_embeds.shape
66
+ generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int)
67
+ prefill_len = 640
68
+ token_ids = tokens
69
+
70
+ ###################################################################
71
+ lastN = 1023
72
+ kv_dim = cfg.hidden_size // cfg.num_attention_heads * cfg.num_key_value_heads
73
+ batch_k_caches = {}
74
+ batch_v_caches = {}
75
+
76
+ for bid in range(batch):
77
+ batch_k_caches[bid] = [
78
+ np.zeros((1, lastN, kv_dim), dtype=bfloat16)
79
+ for _ in range(cfg.num_hidden_layers)
80
+ ]
81
+ batch_v_caches[bid] = [
82
+ np.zeros((1, lastN, kv_dim), dtype=bfloat16)
83
+ for _ in range(cfg.num_hidden_layers)
84
+ ]
85
+ ###################################################################
86
+ mask = np.zeros((1, prefill_len, prefill_len)) - 65536
87
+ for j in range(token_len):
88
+ mask[:, j, :j + 1] = 0
89
+ mask = mask.astype(bfloat16)
90
+
91
+ indices = np.array(list(range(prefill_len)), np.uint32).reshape(
92
+ (1, prefill_len)
93
+ )
94
+ indices[:, token_len:] = 0
95
+ hidden_states = np.zeros((batch, token_len, cfg.hidden_size)).astype(bfloat16)
96
+
97
+ for bid in range(batch):
98
+ data = np.zeros((1, prefill_len, cfg.hidden_size)).astype(bfloat16)
99
+ data[:, 0:token_len] = inputs_embeds[bid].astype(bfloat16)
100
+ k_caches = batch_k_caches[bid]
101
+ v_caches = batch_v_caches[bid]
102
+
103
+ for i in range(cfg.num_hidden_layers):
104
+ input_feed = {
105
+ "K_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
106
+ "V_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
107
+ "indices": indices,
108
+ "input": data,
109
+ "mask": mask,
110
+ }
111
+ outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=1)
112
+ k_caches[i][:, :token_len, :] = outputs[0][:, :token_len, :]
113
+ v_caches[i][:, :token_len, :] = outputs[1][:, :token_len, :]
114
+ data[:, :token_len] = outputs[2][:, :token_len, :]
115
+
116
+ ######## BATCH ###########
117
+ hidden_states[bid] = data[:, :token_len]
118
+ batch_k_caches[bid] = k_caches
119
+ batch_v_caches[bid] = v_caches
120
+
121
+ ################# NORM & GEN-HEAD ########################
122
+ hidden_states = post_norm.run(["output"], {"input": hidden_states[:, -1:, :].astype(np.float32)})[0]
123
+ logits = gen_head.run(["output"], {"input": hidden_states[:, -1, :]})[0] # 与 llama head 不同, 此 head 为图像生成专用
124
+ ############# POST & GET NEXT TOKEN #############
125
+ logits = torch.from_numpy(logits)
126
+ logit_cond = logits[0::2, :]
127
+ logit_uncond = logits[1::2, :]
128
+ logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
129
+ probs = torch.softmax(logits / temperature, dim=-1)
130
+ next_token = torch.multinomial(probs, num_samples=1)
131
+ generated_tokens[:, 0] = next_token.squeeze(dim=-1)
132
+ next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
133
+ ################## PREPARE_GEN_IMG_EMBEDS ##################
134
+ gen_embed_res = np.take(gen_embed, next_token.numpy().tolist(), axis=0)
135
+ img_embeds = gen_aligner.run(["output"], {"input": gen_embed_res})[0]
136
+ inputs_embeds = np.expand_dims(img_embeds, axis=1)
137
+ return inputs_embeds, token_ids, generated_tokens, batch_k_caches, batch_v_caches
138
+
139
+
140
+ @torch.inference_mode()
141
+ def generate(
142
+ cfg,
143
+ prefill_decoder_sessins,
144
+ vl_chat_processor: VLChatProcessor,
145
+ prompt: str,
146
+ temperature: float = 1,
147
+ parallel_size: int = 1, # 目前只支持固定为 1
148
+ cfg_weight: float = 5,
149
+ image_token_num_per_image: int = 576,
150
+ img_size: int = 384,
151
+ patch_size: int = 16,
152
+ ):
153
+ inputs_embeds, token_ids, generated_tokens, batch_k_caches, batch_v_caches = prefill(
154
+ cfg, prefill_decoder_sessins, vl_chat_processor,
155
+ prompt, temperature, parallel_size, cfg_weight, image_token_num_per_image
156
+ )
157
+
158
+ logger.debug("prefill completed!")
159
+ token_len = token_ids.shape[1]
160
+
161
+ lastN = 1023
162
+
163
+ batch = parallel_size * 2
164
+
165
+ mask = np.zeros((1, 1, lastN + 1), dtype=np.float32).astype(bfloat16)
166
+ mask[:, :, :lastN] -= 65536
167
+ mask[:, :, :token_len] = 0
168
+
169
+ for image_token_i in tqdm(range(1, image_token_num_per_image), desc="ImageToken"):
170
+
171
+ # 下面是 decode 逻辑
172
+ start_indice = image_token_i + token_len - 1
173
+ indices = np.array([start_indice], np.uint32).reshape((1, 1))
174
+ hidden_states = np.zeros((batch, 1, cfg.hidden_size)).astype(bfloat16) # batch, 1, seq_dim
175
+ assert (inputs_embeds[0] == inputs_embeds[1]).all()
176
+
177
+ for bid in range(batch):
178
+ k_caches = batch_k_caches[bid]
179
+ v_caches = batch_v_caches[bid]
180
+ data = inputs_embeds[:1, ...].astype(bfloat16)
181
+
182
+ for i in range(cfg.num_hidden_layers):
183
+ input_feed = {
184
+ "K_cache": k_caches[i],
185
+ "V_cache": v_caches[i],
186
+ "indices": indices,
187
+ "input": data,
188
+ "mask": mask,
189
+ }
190
+
191
+ outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=0)
192
+ k_caches[i][:, start_indice, :] = outputs[0][:, :, :]
193
+ v_caches[i][:, start_indice, :] = outputs[1][:, :, :]
194
+ data = outputs[2]
195
+
196
+ hidden_states[bid] = data
197
+ batch_k_caches[bid] = k_caches
198
+ batch_v_caches[bid] = v_caches
199
+
200
+ mask[..., start_indice] = 0
201
+
202
+ ############### NORM & GEN_HEAD #######################
203
+ hidden_states = post_norm.run(["output"], {"input": hidden_states.astype(np.float32)})[0]
204
+ logits = gen_head.run(["output"], {"input": hidden_states[:, -1, :]})[0]
205
+ ############# POST & GET NEXT TOKEN #############
206
+ logits = torch.from_numpy(logits)
207
+ logit_cond = logits[0::2, :]
208
+ logit_uncond = logits[1::2, :]
209
+ logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
210
+ probs = torch.softmax(logits / temperature, dim=-1)
211
+ next_token = torch.multinomial(probs, num_samples=1)
212
+ generated_tokens[:, image_token_i] = next_token.squeeze(dim=-1)
213
+ next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
214
+ ################## PREPARE_GEN_IMG_EMBEDS ##################
215
+ gen_embed_res = np.take(gen_embed, next_token.numpy().tolist(), axis=0)
216
+ img_embeds = gen_aligner.run(["output"], {"input": gen_embed_res})[0]
217
+ inputs_embeds = np.expand_dims(img_embeds, axis=1)
218
+
219
+ # z_q 为 vision decode 的输出
220
+ indices = generated_tokens.to(dtype=torch.int)
221
+ shape = [parallel_size, 8, img_size//patch_size, img_size//patch_size]
222
+ z_q = codebook_entry_embedding[indices] # (b*h*w, c)
223
+ z_q = z_q.reshape(shape[0], shape[2], shape[3], shape[1])
224
+ # reshape back to match original input shape
225
+ z_q = z_q.permute(0, 3, 1, 2)
226
+ dec = gen_vision_model_decode.run(['image'], {'quant': z_q.to(dtype=torch.float32).numpy()})[0]
227
+ dec = dec.transpose(0, 2, 3, 1)
228
+ dec = np.clip((dec + 1) / 2 * 255, 0, 255)
229
+ visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
230
+ visual_img[:, :, :] = dec
231
+
232
+ os.makedirs('generated_samples', exist_ok=True)
233
+ for i in range(parallel_size):
234
+ save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
235
+ PIL.Image.fromarray(visual_img[i]).save(save_path)
236
+
237
+ ###################################################################
238
+ config: MultiModalityConfig = AutoConfig.from_pretrained(tokenizer_dir, trust_remote_code=True)
239
+ vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(tokenizer_dir)
240
+ tokenizer = vl_chat_processor.tokenizer
241
+
242
+ description = "A close-up high-contrast photo of Sydney Opera House sitting next to Eiffel tower, under a blue night sky of roiling energy, exploding yellow stars, and radiating swirls of blue."
243
+
244
+ conversation = [
245
+ {
246
+ "role": "User",
247
+ "content": description,
248
+ },
249
+ {"role": "Assistant", "content": ""},
250
+ ]
251
+
252
+ sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
253
+ conversations=conversation,
254
+ sft_format=vl_chat_processor.sft_format,
255
+ system_prompt="",
256
+ )
257
+ prompt = sft_format + vl_chat_processor.image_start_tag
258
+ ###################################################################
259
+
260
+ cfg = config.language_config
261
+
262
+ prefill_decoder_sessins = []
263
+ for i in tqdm(range(cfg.num_hidden_layers), desc="Init InferenceSession"):
264
+ session = InferenceSession(
265
+ f"{axmodel_path}/llama_p640_l{i}_together.axmodel"
266
+ )
267
+ prefill_decoder_sessins.append(session)
268
+
269
+ logger.info("model load done!")
270
+
271
+ generate(
272
+ cfg,
273
+ prefill_decoder_sessins,
274
+ vl_chat_processor,
275
+ prompt
276
+ )
infer_axmodel_und.py ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # REF: https://github.com/deepseek-ai/Janus
2
+ import numpy as np
3
+ import torch
4
+ from axengine import InferenceSession
5
+ from ml_dtypes import bfloat16
6
+ from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForCausalLM
7
+ from tqdm import tqdm
8
+ from einops import rearrange
9
+ from janus.models import MultiModalityCausalLM, VLChatProcessor
10
+ from janus.models.modeling_vlm import MultiModalityConfig
11
+ from janus.utils.io import load_pil_images
12
+ import argparse
13
+ import os
14
+
15
+
16
+ parser = argparse.ArgumentParser(description="Model configuration parameters")
17
+ parser.add_argument("--tokenizer_dir", type=str, default="Janus-Pro-1B",
18
+ help="Path to HuggingFace model")
19
+ parser.add_argument("--axmodel_path", type=str, default="janus_pro_1B_axmodel",
20
+ help="Path to save compiled axmodel of llama model")
21
+ parser.add_argument("-i", "--test_img_path", type=str, default="./imgs/image.png",
22
+ help="Test image path (supports png/jpg formats)")
23
+ parser.add_argument("--vit_axmodel_path", type=str, default="vit_axmodel/janus_warp_vit.axmodel",
24
+ help="Path to ViT model's axmodel")
25
+
26
+ args = parser.parse_args()
27
+
28
+ # base info
29
+ tokenizer_dir = args.tokenizer_dir
30
+ axmodel_path = args.axmodel_path
31
+ test_img_path = args.test_img_path
32
+ vit_axmodel_path = args.vit_axmodel_path
33
+ embeds = np.load(os.path.join(args.axmodel_path, "model.embed_tokens.weight.npy"))
34
+
35
+
36
+ def prepare_inputs_embeds(
37
+ input_ids: torch.LongTensor,
38
+ pixel_values: torch.FloatTensor,
39
+ images_seq_mask: torch.LongTensor,
40
+ images_emb_mask: torch.LongTensor,
41
+ **kwargs,
42
+ ):
43
+ """
44
+
45
+ Args:
46
+ input_ids (torch.LongTensor): [b, T]
47
+ pixel_values (torch.FloatTensor): [b, n_images, 3, h, w]
48
+ images_seq_mask (torch.BoolTensor): [b, T]
49
+ images_emb_mask (torch.BoolTensor): [b, n_images, n_image_tokens]
50
+
51
+ assert torch.sum(images_seq_mask) == torch.sum(images_emb_mask)
52
+
53
+ Returns:
54
+ input_embeds (torch.Tensor): [b, T, D]
55
+ """
56
+
57
+ bs, n = pixel_values.shape[0:2]
58
+ images = rearrange(pixel_values, "b n c h w -> (b n) c h w")
59
+ # [b x n, T2, D]
60
+ vit_session = InferenceSession(vit_axmodel_path)
61
+ images_embeds = vit_session.run(None, {"image": pixel_values[0].numpy()})[0] # pixel_values: [1, 1, 3, 384, 384]
62
+ print(f"vit_output.shape is {images_embeds.shape}, vit feature extract done!")
63
+
64
+ # [b x n, T2, D] -> [b, n x T2, D]
65
+ images_embeds = rearrange(images_embeds, "(b n) t d -> b (n t) d", b=bs, n=n)
66
+ # [b, n, T2] -> [b, n x T2]
67
+ images_emb_mask = rearrange(images_emb_mask, "b n t -> b (n t)")
68
+
69
+ # [b, T, D]
70
+ input_ids[input_ids < 0] = 0 # ignore the image embeddings
71
+ inputs_embeds = np.take(embeds, input_ids[0].cpu().numpy().tolist(), axis=0)[None, ...]
72
+ inputs_embeds[images_seq_mask] = images_embeds[images_emb_mask]
73
+
74
+ return inputs_embeds
75
+
76
+ def post_process(data, topk=1, topp=0.9, temperature=0.6):
77
+ def top_p(l: np.ndarray, p: float) -> np.ndarray:
78
+ index = np.argsort(l)
79
+ res = l.copy()
80
+ sum_p = 0
81
+ for i in index[::-1]:
82
+ if sum_p >= p:
83
+ res[i] = 0
84
+ sum_p += res[i]
85
+ return res / sum_p
86
+
87
+ def softmax(l: np.ndarray) -> np.ndarray:
88
+ l_max = l - l.max()
89
+ l_exp = np.exp(l_max)
90
+ res = l_exp / np.sum(l_exp)
91
+ return res.astype(np.float64)
92
+
93
+ r = data.astype(np.float32)
94
+ r = r.flatten()
95
+ candidate_index = np.argpartition(r, -topk)[-topk:]
96
+ candidate_value = r[candidate_index]
97
+ candidate_value /= temperature
98
+ candidate_soft = softmax(candidate_value)
99
+ candidate_soft = top_p(candidate_soft, topp)
100
+ candidate_soft = candidate_soft.astype(np.float64) / candidate_soft.sum()
101
+ pos = np.random.multinomial(1, candidate_soft).argmax()
102
+ next_token = candidate_index[pos]
103
+ return next_token, candidate_index, candidate_soft
104
+
105
+ config: MultiModalityConfig = AutoConfig.from_pretrained(tokenizer_dir, trust_remote_code=True)
106
+ vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(tokenizer_dir)
107
+ tokenizer = vl_chat_processor.tokenizer
108
+
109
+ # question = "请尝试理解这幅图中的内容."
110
+ question = "Please describe the picture."
111
+ conversation = [
112
+ {
113
+ "role": "User",
114
+ "content": f"<image_placeholder>\n{question}",
115
+ "images": [test_img_path],
116
+ },
117
+ {"role": "Assistant", "content": ""},
118
+ ]
119
+
120
+ # load images and prepare for inputs
121
+ pil_images = load_pil_images(conversation)
122
+ prepare_inputs = vl_chat_processor(
123
+ conversations=conversation, images=pil_images, force_batchify=True
124
+ )
125
+
126
+ input_embedding = prepare_inputs_embeds(**prepare_inputs)
127
+ token_ids = prepare_inputs['input_ids'].squeeze().numpy().tolist()
128
+ prefill_data = input_embedding
129
+ prefill_data = prefill_data.astype(bfloat16)
130
+ token_len = len(token_ids)
131
+
132
+ lastN = 1023
133
+ cfg = config.language_config
134
+
135
+ kv_dim = cfg.hidden_size // cfg.num_attention_heads * cfg.num_key_value_heads
136
+ k_caches = [
137
+ np.zeros((1, lastN, kv_dim), dtype=bfloat16)
138
+ for _ in range(cfg.num_hidden_layers)
139
+ ]
140
+ v_caches = [
141
+ np.zeros((1, lastN, kv_dim), dtype=bfloat16)
142
+ for _ in range(cfg.num_hidden_layers)
143
+ ]
144
+
145
+ prefill_decoder_sessins = []
146
+ for i in tqdm(range(cfg.num_hidden_layers), desc="Init InferenceSession"):
147
+ session = InferenceSession(
148
+ f"{axmodel_path}/llama_p640_l{i}_together.axmodel"
149
+ )
150
+ prefill_decoder_sessins.append(session)
151
+ post_process_session = InferenceSession(
152
+ f"{axmodel_path}/llama_post.axmodel"
153
+ )
154
+ print("model load done!")
155
+
156
+ """
157
+ prefill
158
+ """
159
+ prefill_len = 640
160
+
161
+ if prefill_len > 0:
162
+ indices = np.array(list(range(prefill_len)), np.uint32).reshape(
163
+ (1, prefill_len)
164
+ )
165
+ indices[:, token_len:] = 0
166
+ mask = np.zeros((1, prefill_len, prefill_len)) - 65536
167
+ data = np.zeros((1, prefill_len, cfg.hidden_size)).astype(bfloat16)
168
+ data[:, 0:token_len] = prefill_data
169
+ for i, t in enumerate(token_ids):
170
+ mask[:, i, : i + 1] = 0
171
+ mask = mask.astype(bfloat16)
172
+ for i in range(cfg.num_hidden_layers):
173
+ input_feed = {
174
+ "K_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
175
+ "V_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
176
+ "indices": indices,
177
+ "input": data,
178
+ "mask": mask,
179
+ }
180
+ outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=1)
181
+ k_caches[i][:, :token_len, :] = outputs[0][:, :token_len, :]
182
+ v_caches[i][:, :token_len, :] = outputs[1][:, :token_len, :]
183
+ data[:, :token_len] = outputs[2][:, :token_len, :]
184
+
185
+ post_out = post_process_session.run(None, {"input": data[:, token_len - 1, :][None, ...]})[0]
186
+ next_token, posssible_tokens, possible_soft = post_process(post_out, topk=1)
187
+ posibles = [tokenizer.decode([t]) for t in posssible_tokens]
188
+ posible_soft = [str((t, s)) for t, s in zip(posibles, possible_soft)]
189
+ token_ids.append(next_token)
190
+ print("prefill done!")
191
+
192
+ """
193
+ decode
194
+ """
195
+ mask = np.zeros((1, 1, lastN + 1), dtype=np.float32).astype(bfloat16)
196
+ mask[:, :, :lastN] -= 65536
197
+ mask[:, :, :token_len] = 0
198
+ for start_indice in tqdm(range(lastN + 1), desc="Decoder"): # lastN + 1
199
+ if prefill_len > 0 and start_indice < token_len:
200
+ continue
201
+ next_token = token_ids[start_indice]
202
+ indices = np.array([start_indice], np.uint32).reshape((1, 1))
203
+ data = embeds[next_token, :].reshape((1, 1, cfg.hidden_size)).astype(bfloat16)
204
+
205
+ for i in range(cfg.num_hidden_layers):
206
+ input_feed = {
207
+ "K_cache": k_caches[i],
208
+ "V_cache": v_caches[i],
209
+ "indices": indices,
210
+ "input": data,
211
+ "mask": mask,
212
+ }
213
+ outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=0)
214
+ k_caches[i][:, start_indice, :] = outputs[0][:, :, :]
215
+ v_caches[i][:, start_indice, :] = outputs[1][:, :, :]
216
+ data = outputs[2]
217
+
218
+ mask[..., start_indice] = 0
219
+ if start_indice < token_len - 1:
220
+ pass
221
+ else:
222
+ post_out = post_process_session.run(None, {"input": data})[0]
223
+ next_token, posssible_tokens, possible_soft = post_process(post_out)
224
+ token_ids.append(next_token)
225
+ if next_token == tokenizer.eos_token_id:
226
+ print("hit eos!")
227
+ break
228
+ print("Janus Answers: ", tokenizer.decode(token_ids[token_len:], skip_special_tokens=True))
janus_pro_1b_axmodel/llama_p640_l0_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36e476b67cc13f0fe6701b7d666e9e316ee03d38998ba633964f7f96e92b8db5
3
+ size 58843532
janus_pro_1b_axmodel/llama_p640_l10_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ddf79ff6a43ead47fda4308fd89e0468a3b19ad8e5f5a912247a9de160c34954
3
+ size 58844556
janus_pro_1b_axmodel/llama_p640_l11_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f108e870863238890fb579c7bb991abd0a8b4f695ff2b5d483c6e16a2b0433c
3
+ size 58844684
janus_pro_1b_axmodel/llama_p640_l12_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5da5feb965e8fbb678a144ca26e5ff9d520d80c18823563b1cb382980bcabe1b
3
+ size 58844236
janus_pro_1b_axmodel/llama_p640_l13_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7358c891f13c998f87a1f1d85f3357fffebfac4d5bb67e15868a0a93113108a9
3
+ size 58844620
janus_pro_1b_axmodel/llama_p640_l14_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19e2d24aa96773a866043bfefc1b815f04964c9d27b18637401de306d8bb5595
3
+ size 58844140
janus_pro_1b_axmodel/llama_p640_l15_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80bf1befea66e3f42d9cd77a92b35cae27683f50d34becb7095ba0f035c55cb9
3
+ size 58844268
janus_pro_1b_axmodel/llama_p640_l16_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2283ea05dabba501779dc79ffbf5ce6e0ab18ad157a3aa2a3e488d888082b342
3
+ size 58844396
janus_pro_1b_axmodel/llama_p640_l17_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:723136b342e5197d2e508510c7f247cdae853211e9d8710438cf2fe09712ec1a
3
+ size 58844076
janus_pro_1b_axmodel/llama_p640_l18_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:660e661f19ccf22c91034ed4d3a1869c5963f098e4e6509193a1aca6fcb24401
3
+ size 58844300
janus_pro_1b_axmodel/llama_p640_l19_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:deb31364fa508c5526c70915c38f8ccb052cd84d6c79893bf46590b37cce25a2
3
+ size 58844364
janus_pro_1b_axmodel/llama_p640_l1_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3828219df2633babf673a3fbb20a5d8d8dde602dae5a5ed35a76349c0b7a2dac
3
+ size 58844460
janus_pro_1b_axmodel/llama_p640_l20_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74dddcc432823b8257a52712f4e5cdb53391291b6b19e8f277c96550f8e118a7
3
+ size 58844236
janus_pro_1b_axmodel/llama_p640_l21_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:06961849a4c6a31fa8454abd61e80fed76e4c4a050cbd1b7d16c638c6599d529
3
+ size 58844620
janus_pro_1b_axmodel/llama_p640_l22_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b80d1205cd37f7ff88cf522385910b1332e4a6a9c4b1419e03099c12884e718c
3
+ size 58844108
janus_pro_1b_axmodel/llama_p640_l23_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b8f8387fcd1a8030275555828e8335fe7de694776f847d95bb048f889b880bb
3
+ size 58843980
janus_pro_1b_axmodel/llama_p640_l2_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ff8bd9786537b7cf155ebd64459de2fc643a101a5469a071e0758604bb14f66
3
+ size 58844492
janus_pro_1b_axmodel/llama_p640_l3_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:94cf4e816de0f8f78a6ec18917302da650673f1a9e6a907d1cab3875e2eb15ab
3
+ size 58844556
janus_pro_1b_axmodel/llama_p640_l4_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:97c497026a610c17ca80da4f828ce71053ab71bdaadd356cc7ddbfb2a4ef5c03
3
+ size 58844108
janus_pro_1b_axmodel/llama_p640_l5_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73e0ea653bf7410aab2a2e7e239cb57f71efdad47a78d1fab57e127f327de6fb
3
+ size 58844300
janus_pro_1b_axmodel/llama_p640_l6_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d6d0cc26433000a91adccd97869916bfcebff975c94a59865b8e0343b0cfee0
3
+ size 58844460
janus_pro_1b_axmodel/llama_p640_l7_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:01061af690dcf356ae74c2b2b927c1b06ccfc6e594a9c67d7cb3fdba0aca2508
3
+ size 58843980
janus_pro_1b_axmodel/llama_p640_l8_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ce71289afc108c8e2f304d764f14f9efa13ea6342bc64e0484aba78db25e64f
3
+ size 58844364
janus_pro_1b_axmodel/llama_p640_l9_together.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3f4fa46650e6e2c88bc8a7cb0dd39b7fbd08652e99dac3452e437517788e69b
3
+ size 58844364
janus_pro_1b_axmodel/llama_post.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8950aede1718e00a9f0489c90bf76a8639cd43781ae6c0b49978a3b7202513e
3
+ size 229046979
janus_pro_1b_axmodel/model.embed_tokens.weight.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:282e7088dbdb59b03e948edd97d3768f9d5daecbb7e7cb690147ffca25948ce1
3
+ size 838860928
janus_pro_1b_tokenizer/config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "aligner_config": {
3
+ "cls": "MlpProjector",
4
+ "model_type": "aligner",
5
+ "params": {
6
+ "depth": 2,
7
+ "input_dim": 1024,
8
+ "n_embed": 2048,
9
+ "projector_type": "mlp_gelu"
10
+ }
11
+ },
12
+ "architectures": [
13
+ "MultiModalityCausalLM"
14
+ ],
15
+ "gen_aligner_config": {
16
+ "cls": "MlpProjector",
17
+ "model_type": "gen_aligner",
18
+ "params": {
19
+ "depth": 2,
20
+ "input_dim": 8,
21
+ "n_embed": 2048,
22
+ "projector_type": "mlp_gelu"
23
+ }
24
+ },
25
+ "gen_head_config": {
26
+ "cls": "vision_head",
27
+ "model_type": "gen_head",
28
+ "params": {
29
+ "image_token_embed": 2048,
30
+ "image_token_size": 16384,
31
+ "n_embed": 2048
32
+ }
33
+ },
34
+ "gen_vision_config": {
35
+ "cls": "VQ-16",
36
+ "model_type": "gen_vision",
37
+ "params": {
38
+ "image_token_size": 16384,
39
+ "n_embed": 8
40
+ }
41
+ },
42
+ "language_config": {
43
+ "hidden_size": 2048,
44
+ "intermediate_size": 5632,
45
+ "max_position_embeddings": 16384,
46
+ "model_type": "llama",
47
+ "num_attention_heads": 16,
48
+ "num_hidden_layers": 24,
49
+ "num_key_value_heads": 16,
50
+ "torch_dtype": "bfloat16",
51
+ "vocab_size": 102400
52
+ },
53
+ "model_type": "multi_modality",
54
+ "torch_dtype": "bfloat16",
55
+ "transformers_version": "4.33.1",
56
+ "vision_config": {
57
+ "cls": "CLIPVisionTower",
58
+ "model_type": "vision",
59
+ "params": {
60
+ "image_size": 384,
61
+ "model_name": "siglip_large_patch16_384",
62
+ "select_feature": "same",
63
+ "select_layer": -1
64
+ }
65
+ }
66
+ }
janus_pro_1b_tokenizer/preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "background_color": [
3
+ 127,
4
+ 127,
5
+ 127
6
+ ],
7
+ "do_normalize": true,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "VLMImageProcessor",
14
+ "image_size": 384,
15
+ "image_std": [
16
+ 0.5,
17
+ 0.5,
18
+ 0.5
19
+ ],
20
+ "min_size": 14,
21
+ "processor_class": "VLChatProcessor",
22
+ "rescale_factor": 0.00392156862745098
23
+ }
janus_pro_1b_tokenizer/processor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_special_token": false,
3
+ "ignore_id": -100,
4
+ "image_tag": "<image_placeholder>",
5
+ "mask_prompt": true,
6
+ "num_image_tokens": 576,
7
+ "processor_class": "VLChatProcessor",
8
+ "sft_format": "deepseek"
9
+ }
janus_pro_1b_tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<image_placeholder>",
4
+ "<patch_placeholder>",
5
+ "<|ref|>",
6
+ "<|/ref|>",
7
+ "<|det|>",
8
+ "<|/det|>",
9
+ "<|grounding|>",
10
+ "<|User|>",
11
+ "<|Assistant|>"
12
+ ],
13
+ "bos_token": "<|begin▁of▁sentence|>",
14
+ "eos_token": "<|end▁of▁sentence|>",
15
+ "pad_token": "<|▁pad▁|>"
16
+ }
janus_pro_1b_tokenizer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
janus_pro_1b_tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|begin▁of▁sentence|>",
3
+ "clean_up_tokenization_spaces": false,
4
+ "eos_token": "<|end▁of▁sentence|>",
5
+ "model_max_length": 16384,
6
+ "pad_token": null,
7
+ "tokenizer_class": "LlamaTokenizer",
8
+ "unk_token": null,
9
+ "use_default_system_prompt": true
10
+ }
vit_axmodel/janus_warp_vit.axmodel ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:986d4444e88e3fcece749430abff504868eba25690e3a08dcb9568f7ad5ea0ab
3
+ size 348623368