yongqiang
commited on
Commit
·
f4acc5b
1
Parent(s):
faed000
Initial this repo
Browse files- README.md +109 -1
- assets/gen_out_img.jpg +0 -0
- embeds/codebook_entry_embedding.npy +3 -0
- embeds/codebook_entry_embedding.pt +3 -0
- embeds/gen_embed.npy +3 -0
- img_gen_onnx/gen_aligner.onnx +3 -0
- img_gen_onnx/gen_vision_model_decode_sim.onnx +3 -0
- img_gen_onnx/post_head.onnx +3 -0
- img_gen_onnx/post_norm.onnx +3 -0
- imgs/image.jpg +0 -0
- imgs/image.png +3 -0
- infer_axmodel_gen.py +276 -0
- infer_axmodel_und.py +228 -0
- janus_pro_1b_axmodel/llama_p640_l0_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l10_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l11_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l12_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l13_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l14_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l15_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l16_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l17_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l18_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l19_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l1_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l20_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l21_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l22_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l23_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l2_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l3_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l4_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l5_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l6_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l7_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l8_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_p640_l9_together.axmodel +3 -0
- janus_pro_1b_axmodel/llama_post.axmodel +3 -0
- janus_pro_1b_axmodel/model.embed_tokens.weight.npy +3 -0
- janus_pro_1b_tokenizer/config.json +66 -0
- janus_pro_1b_tokenizer/preprocessor_config.json +23 -0
- janus_pro_1b_tokenizer/processor_config.json +9 -0
- janus_pro_1b_tokenizer/special_tokens_map.json +16 -0
- janus_pro_1b_tokenizer/tokenizer.json +0 -0
- janus_pro_1b_tokenizer/tokenizer_config.json +10 -0
- vit_axmodel/janus_warp_vit.axmodel +3 -0
README.md
CHANGED
@@ -9,4 +9,112 @@ pipeline_tag: visual-question-answering
|
|
9 |
tags:
|
10 |
- DeepSeek
|
11 |
- Janus-Pro-1B
|
12 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
tags:
|
10 |
- DeepSeek
|
11 |
- Janus-Pro-1B
|
12 |
+
---
|
13 |
+
|
14 |
+
# Janus-Pro-1B-Int8
|
15 |
+
|
16 |
+
This version of Janus-Pro-1B has been converted to run on the Axera NPU using **w8a16** quantization.
|
17 |
+
|
18 |
+
This model has been optimized with the following LoRA:
|
19 |
+
|
20 |
+
Compatible with Pulsar2 version: 3.3
|
21 |
+
|
22 |
+
## Convert tools links:
|
23 |
+
|
24 |
+
For those who are interested in model conversion, you can try to export axmodel through the original repo :
|
25 |
+
https://huggingface.co/deepseek-ai/Janus-Pro-1B
|
26 |
+
|
27 |
+
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
|
28 |
+
|
29 |
+
## Support Platform
|
30 |
+
- AX650
|
31 |
+
- AX650N DEMO Board
|
32 |
+
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
|
33 |
+
|
34 |
+
|chips|image encoder 384 | ttft | w8a16 |
|
35 |
+
|--|--|--|--|
|
36 |
+
|AX650| 142.682 ms | 4560.214 ms | 11.43 tokens/sec|
|
37 |
+
|
38 |
+
## How to use
|
39 |
+
|
40 |
+
Download all files from this repository to the device.
|
41 |
+
|
42 |
+
**If you using AX650 Board**
|
43 |
+
```
|
44 |
+
root@ax650:/mnt/qtang/llm-test/temp/Janus-Pro-1B# tree -L 1
|
45 |
+
.
|
46 |
+
|-- config.json
|
47 |
+
|-- internvl2_5_1b_448_ax650
|
48 |
+
|-- internvl2_5_tokenizer
|
49 |
+
|-- internvl2_5_tokenizer_448.py
|
50 |
+
|-- main_internvl2_5_448_prefill
|
51 |
+
|-- run_internvl2_5_448_ax650.sh
|
52 |
+
`-- ssd_car.jpg
|
53 |
+
```
|
54 |
+
|
55 |
+
#### Install janus
|
56 |
+
|
57 |
+
```bash
|
58 |
+
$ git clone https://github.com/deepseek-ai/Janus
|
59 |
+
$ cd Janus
|
60 |
+
$ pip3 install -e .
|
61 |
+
```
|
62 |
+
|
63 |
+
#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
|
64 |
+
|
65 |
+
**Multimodal Understanding**
|
66 |
+
|
67 |
+
input text:
|
68 |
+
|
69 |
+
```
|
70 |
+
Describe the picture
|
71 |
+
```
|
72 |
+
|
73 |
+
- input image
|
74 |
+
|
75 |
+

|
76 |
+
|
77 |
+
log information:
|
78 |
+
|
79 |
+
```bash
|
80 |
+
root@ax650 ~/yongqiang/push_hugging_face/Janus-Pro-1B # python3 infer_axmodel_und.py --tokenizer_dir janus_pro_1b_tokenizer --axmodel_path janus_pro_1b_axmodel --vit_axmodel_path vit_axmodel/janus_warp_vit.axmodel -i ./imgs/image.png
|
81 |
+
[INFO] Available providers: ['AxEngineExecutionProvider']
|
82 |
+
[INFO] Chip type: ChipType.MC50
|
83 |
+
[INFO] VNPU type: VNPUType.DISABLED
|
84 |
+
[INFO] Engine version: 2.11.0a
|
85 |
+
vit_output.shape is (1, 576, 2048), vit feature extract done!
|
86 |
+
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 24/24 [00:04<00:00, 4.94it/s]
|
87 |
+
model load done!
|
88 |
+
prefill done!
|
89 |
+
Decoder: 62%|█████████████████████████████████████████▍ | 634/1024 [00:00<00:00, 2505.28it/s]Decoder: 72%|█████████████████████████████████████████████████▉ | 741/1024 [00:19<00:10, 27.69it/s]hit eos!
|
90 |
+
Decoder: 74%|███████████████████████████████████████████████████▎ | 762/1024 [00:23<00:08, 31.84it/s]
|
91 |
+
Janus Answers: The image depicts three astronauts standing in a lush, green forest. They are wearing traditional white space suits with various patches and equipment attached. The suits have a reflective visor on their helmets, and they appear to be in a relaxed pose, with one astronaut raising his arms and the others standing or crouching. The forest is dense with tall trees and dense foliage, creating a serene and somewhat mysterious atmosphere.
|
92 |
+
```
|
93 |
+
|
94 |
+
**Text-to-Image Generation**
|
95 |
+
|
96 |
+
input text:
|
97 |
+
|
98 |
+
```
|
99 |
+
"A close-up high-contrast photo of Sydney Opera House sitting next to Eiffel tower, under a blue night sky of roiling energy, exploding yellow stars, and radiating swirls of blue."
|
100 |
+
```
|
101 |
+
|
102 |
+
log information:
|
103 |
+
|
104 |
+
```bash
|
105 |
+
root@ax650 ~/yongqiang/push_hugging_face/Janus-Pro-1B # python3 infer_axmodel_gen.py --tokenizer_dir janus_pro_1b_tokenizer/ --axmodel_path janus_pro_1b_axmodel/
|
106 |
+
[INFO] Available providers: ['AxEngineExecutionProvider']
|
107 |
+
Init InferenceSession: 0%| | 0/24 [00:00<?, ?it/s][INFO] Chip type: ChipType.MC50
|
108 |
+
[INFO] VNPU type: VNPUType.DISABLED
|
109 |
+
[INFO] Engine version: 2.11.0a
|
110 |
+
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 24/24 [00:14<00:00, 1.68it/s]
|
111 |
+
2025-04-14 15:55:23.408 | INFO | __main__:<module>:269 - model load done!
|
112 |
+
2025-04-14 15:55:33.104 | DEBUG | __main__:generate:158 - prefill completed!
|
113 |
+
ImageToken: 18%|████████████ | 104/575 [00:39<02:58, 2.64it/s]ImageToken: 45%|██████████████████████████████▍ | 261/575 [01:39<01:58, 2.65it/s]ImageToken: 73%|████████████████████████████████████████████████▊ | 419/575 [02:39<00:58, 2.66it/s]ImageToken: 100%|███████████████████████████████████████████████████████████████████| 575/575 [03:38<00:00, 2.63it/s]
|
114 |
+
```
|
115 |
+
|
116 |
+
output image
|
117 |
+
|
118 |
+
[](assets/gen_out_img.jpg)
|
119 |
+
|
120 |
+
|
assets/gen_out_img.jpg
ADDED
![]() |
embeds/codebook_entry_embedding.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:97fc92031b689c685f3b36d7542eba5002cd937c63dcf33731601ef999c68613
|
3 |
+
size 524416
|
embeds/codebook_entry_embedding.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a67cea6583ef3da486fdfcd6cff62a2771795c7ab46b8f1000852be4f1a137c5
|
3 |
+
size 263473
|
embeds/gen_embed.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c70d799c8ab4c507b2916f304ba0f792e2dbf0a26100cb1242babe1f2e57d455
|
3 |
+
size 524416
|
img_gen_onnx/gen_aligner.onnx
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0642c360b65e5f41b1caf7637650e057b38a9aad40552a7669a76b3395653c5d
|
3 |
+
size 16860554
|
img_gen_onnx/gen_vision_model_decode_sim.onnx
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e27a17bc19df77059481b30582ca58e3a28bd66783fb9ca8c3022bf33e77f8bf
|
3 |
+
size 169913021
|
img_gen_onnx/post_head.onnx
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:92be45cb8d1c3ae5c19c906a71195f13a0755f16881a6769e8ae9b5ca85eaa8f
|
3 |
+
size 151070226
|
img_gen_onnx/post_norm.onnx
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:10899a40c25d7d1a879c0b9a7fe06255b5148d56d5965b6c5a8b8bb7d72feecf
|
3 |
+
size 9423
|
imgs/image.jpg
ADDED
![]() |
imgs/image.png
ADDED
![]() |
Git LFS Details
|
infer_axmodel_gen.py
ADDED
@@ -0,0 +1,276 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# REF: https://github.com/deepseek-ai/Janus
|
2 |
+
import numpy as np
|
3 |
+
import torch
|
4 |
+
from axengine import InferenceSession
|
5 |
+
from ml_dtypes import bfloat16
|
6 |
+
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForCausalLM
|
7 |
+
from tqdm import tqdm
|
8 |
+
from einops import rearrange
|
9 |
+
from janus.models import MultiModalityCausalLM, VLChatProcessor
|
10 |
+
from janus.models.modeling_vlm import MultiModalityConfig
|
11 |
+
from janus.utils.io import load_pil_images
|
12 |
+
import os
|
13 |
+
import PIL.Image
|
14 |
+
from loguru import logger
|
15 |
+
import onnxruntime
|
16 |
+
import argparse
|
17 |
+
|
18 |
+
|
19 |
+
parser = argparse.ArgumentParser(description="Model configuration parameters")
|
20 |
+
parser.add_argument("--tokenizer_dir", type=str, default="Janus-Pro-1B",
|
21 |
+
help="Path to HuggingFace model")
|
22 |
+
parser.add_argument("--axmodel_path", type=str, default="janus_pro_1B_axmodel",
|
23 |
+
help="Path to save compiled axmodel of llama model")
|
24 |
+
args = parser.parse_args()
|
25 |
+
|
26 |
+
|
27 |
+
# base info
|
28 |
+
tokenizer_dir = args.tokenizer_dir
|
29 |
+
axmodel_path = args.axmodel_path
|
30 |
+
|
31 |
+
"""ONNX MODEL"""
|
32 |
+
gen_vision_model_decode = onnxruntime.InferenceSession("./img_gen_onnx/gen_vision_model_decode_sim.onnx", providers=["CPUExecutionProvider"])
|
33 |
+
gen_aligner = onnxruntime.InferenceSession("./img_gen_onnx/gen_aligner.onnx", providers=["CPUExecutionProvider"])
|
34 |
+
gen_head = onnxruntime.InferenceSession("./img_gen_onnx/post_head.onnx", providers=["CPUExecutionProvider"])
|
35 |
+
post_norm = onnxruntime.InferenceSession("./img_gen_onnx/post_norm.onnx", providers=["CPUExecutionProvider"])
|
36 |
+
"""ONNX MODEL"""
|
37 |
+
|
38 |
+
"""EMBEDINGs"""
|
39 |
+
embeds = np.load(f"{axmodel_path}/model.embed_tokens.weight.npy")
|
40 |
+
gen_embed = np.load("./embeds/gen_embed.npy")
|
41 |
+
codebook_entry_embedding = torch.load('./embeds/codebook_entry_embedding.pt', map_location=torch.device('cpu'))
|
42 |
+
"""EMBEDINGs"""
|
43 |
+
|
44 |
+
|
45 |
+
def prefill(
|
46 |
+
cfg,
|
47 |
+
prefill_decoder_sessins,
|
48 |
+
vl_chat_processor: VLChatProcessor,
|
49 |
+
prompt: str,
|
50 |
+
temperature: float = 1,
|
51 |
+
parallel_size: int = 1,
|
52 |
+
cfg_weight: float = 5,
|
53 |
+
image_token_num_per_image: int = 576,
|
54 |
+
):
|
55 |
+
input_ids = vl_chat_processor.tokenizer.encode(prompt)
|
56 |
+
input_ids = torch.LongTensor(input_ids)
|
57 |
+
|
58 |
+
tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int)
|
59 |
+
for i in range(parallel_size*2):
|
60 |
+
tokens[i, :] = input_ids
|
61 |
+
if i % 2 != 0:
|
62 |
+
tokens[i, 1: -1] = vl_chat_processor.pad_id
|
63 |
+
|
64 |
+
inputs_embeds = embeds[tokens.numpy()]
|
65 |
+
batch, token_len, seq_dim = inputs_embeds.shape
|
66 |
+
generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int)
|
67 |
+
prefill_len = 640
|
68 |
+
token_ids = tokens
|
69 |
+
|
70 |
+
###################################################################
|
71 |
+
lastN = 1023
|
72 |
+
kv_dim = cfg.hidden_size // cfg.num_attention_heads * cfg.num_key_value_heads
|
73 |
+
batch_k_caches = {}
|
74 |
+
batch_v_caches = {}
|
75 |
+
|
76 |
+
for bid in range(batch):
|
77 |
+
batch_k_caches[bid] = [
|
78 |
+
np.zeros((1, lastN, kv_dim), dtype=bfloat16)
|
79 |
+
for _ in range(cfg.num_hidden_layers)
|
80 |
+
]
|
81 |
+
batch_v_caches[bid] = [
|
82 |
+
np.zeros((1, lastN, kv_dim), dtype=bfloat16)
|
83 |
+
for _ in range(cfg.num_hidden_layers)
|
84 |
+
]
|
85 |
+
###################################################################
|
86 |
+
mask = np.zeros((1, prefill_len, prefill_len)) - 65536
|
87 |
+
for j in range(token_len):
|
88 |
+
mask[:, j, :j + 1] = 0
|
89 |
+
mask = mask.astype(bfloat16)
|
90 |
+
|
91 |
+
indices = np.array(list(range(prefill_len)), np.uint32).reshape(
|
92 |
+
(1, prefill_len)
|
93 |
+
)
|
94 |
+
indices[:, token_len:] = 0
|
95 |
+
hidden_states = np.zeros((batch, token_len, cfg.hidden_size)).astype(bfloat16)
|
96 |
+
|
97 |
+
for bid in range(batch):
|
98 |
+
data = np.zeros((1, prefill_len, cfg.hidden_size)).astype(bfloat16)
|
99 |
+
data[:, 0:token_len] = inputs_embeds[bid].astype(bfloat16)
|
100 |
+
k_caches = batch_k_caches[bid]
|
101 |
+
v_caches = batch_v_caches[bid]
|
102 |
+
|
103 |
+
for i in range(cfg.num_hidden_layers):
|
104 |
+
input_feed = {
|
105 |
+
"K_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
|
106 |
+
"V_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
|
107 |
+
"indices": indices,
|
108 |
+
"input": data,
|
109 |
+
"mask": mask,
|
110 |
+
}
|
111 |
+
outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=1)
|
112 |
+
k_caches[i][:, :token_len, :] = outputs[0][:, :token_len, :]
|
113 |
+
v_caches[i][:, :token_len, :] = outputs[1][:, :token_len, :]
|
114 |
+
data[:, :token_len] = outputs[2][:, :token_len, :]
|
115 |
+
|
116 |
+
######## BATCH ###########
|
117 |
+
hidden_states[bid] = data[:, :token_len]
|
118 |
+
batch_k_caches[bid] = k_caches
|
119 |
+
batch_v_caches[bid] = v_caches
|
120 |
+
|
121 |
+
################# NORM & GEN-HEAD ########################
|
122 |
+
hidden_states = post_norm.run(["output"], {"input": hidden_states[:, -1:, :].astype(np.float32)})[0]
|
123 |
+
logits = gen_head.run(["output"], {"input": hidden_states[:, -1, :]})[0] # 与 llama head 不同, 此 head 为图像生成专用
|
124 |
+
############# POST & GET NEXT TOKEN #############
|
125 |
+
logits = torch.from_numpy(logits)
|
126 |
+
logit_cond = logits[0::2, :]
|
127 |
+
logit_uncond = logits[1::2, :]
|
128 |
+
logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
|
129 |
+
probs = torch.softmax(logits / temperature, dim=-1)
|
130 |
+
next_token = torch.multinomial(probs, num_samples=1)
|
131 |
+
generated_tokens[:, 0] = next_token.squeeze(dim=-1)
|
132 |
+
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
|
133 |
+
################## PREPARE_GEN_IMG_EMBEDS ##################
|
134 |
+
gen_embed_res = np.take(gen_embed, next_token.numpy().tolist(), axis=0)
|
135 |
+
img_embeds = gen_aligner.run(["output"], {"input": gen_embed_res})[0]
|
136 |
+
inputs_embeds = np.expand_dims(img_embeds, axis=1)
|
137 |
+
return inputs_embeds, token_ids, generated_tokens, batch_k_caches, batch_v_caches
|
138 |
+
|
139 |
+
|
140 |
+
@torch.inference_mode()
|
141 |
+
def generate(
|
142 |
+
cfg,
|
143 |
+
prefill_decoder_sessins,
|
144 |
+
vl_chat_processor: VLChatProcessor,
|
145 |
+
prompt: str,
|
146 |
+
temperature: float = 1,
|
147 |
+
parallel_size: int = 1, # 目前只支持固定为 1
|
148 |
+
cfg_weight: float = 5,
|
149 |
+
image_token_num_per_image: int = 576,
|
150 |
+
img_size: int = 384,
|
151 |
+
patch_size: int = 16,
|
152 |
+
):
|
153 |
+
inputs_embeds, token_ids, generated_tokens, batch_k_caches, batch_v_caches = prefill(
|
154 |
+
cfg, prefill_decoder_sessins, vl_chat_processor,
|
155 |
+
prompt, temperature, parallel_size, cfg_weight, image_token_num_per_image
|
156 |
+
)
|
157 |
+
|
158 |
+
logger.debug("prefill completed!")
|
159 |
+
token_len = token_ids.shape[1]
|
160 |
+
|
161 |
+
lastN = 1023
|
162 |
+
|
163 |
+
batch = parallel_size * 2
|
164 |
+
|
165 |
+
mask = np.zeros((1, 1, lastN + 1), dtype=np.float32).astype(bfloat16)
|
166 |
+
mask[:, :, :lastN] -= 65536
|
167 |
+
mask[:, :, :token_len] = 0
|
168 |
+
|
169 |
+
for image_token_i in tqdm(range(1, image_token_num_per_image), desc="ImageToken"):
|
170 |
+
|
171 |
+
# 下面是 decode 逻辑
|
172 |
+
start_indice = image_token_i + token_len - 1
|
173 |
+
indices = np.array([start_indice], np.uint32).reshape((1, 1))
|
174 |
+
hidden_states = np.zeros((batch, 1, cfg.hidden_size)).astype(bfloat16) # batch, 1, seq_dim
|
175 |
+
assert (inputs_embeds[0] == inputs_embeds[1]).all()
|
176 |
+
|
177 |
+
for bid in range(batch):
|
178 |
+
k_caches = batch_k_caches[bid]
|
179 |
+
v_caches = batch_v_caches[bid]
|
180 |
+
data = inputs_embeds[:1, ...].astype(bfloat16)
|
181 |
+
|
182 |
+
for i in range(cfg.num_hidden_layers):
|
183 |
+
input_feed = {
|
184 |
+
"K_cache": k_caches[i],
|
185 |
+
"V_cache": v_caches[i],
|
186 |
+
"indices": indices,
|
187 |
+
"input": data,
|
188 |
+
"mask": mask,
|
189 |
+
}
|
190 |
+
|
191 |
+
outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=0)
|
192 |
+
k_caches[i][:, start_indice, :] = outputs[0][:, :, :]
|
193 |
+
v_caches[i][:, start_indice, :] = outputs[1][:, :, :]
|
194 |
+
data = outputs[2]
|
195 |
+
|
196 |
+
hidden_states[bid] = data
|
197 |
+
batch_k_caches[bid] = k_caches
|
198 |
+
batch_v_caches[bid] = v_caches
|
199 |
+
|
200 |
+
mask[..., start_indice] = 0
|
201 |
+
|
202 |
+
############### NORM & GEN_HEAD #######################
|
203 |
+
hidden_states = post_norm.run(["output"], {"input": hidden_states.astype(np.float32)})[0]
|
204 |
+
logits = gen_head.run(["output"], {"input": hidden_states[:, -1, :]})[0]
|
205 |
+
############# POST & GET NEXT TOKEN #############
|
206 |
+
logits = torch.from_numpy(logits)
|
207 |
+
logit_cond = logits[0::2, :]
|
208 |
+
logit_uncond = logits[1::2, :]
|
209 |
+
logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond)
|
210 |
+
probs = torch.softmax(logits / temperature, dim=-1)
|
211 |
+
next_token = torch.multinomial(probs, num_samples=1)
|
212 |
+
generated_tokens[:, image_token_i] = next_token.squeeze(dim=-1)
|
213 |
+
next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1)
|
214 |
+
################## PREPARE_GEN_IMG_EMBEDS ##################
|
215 |
+
gen_embed_res = np.take(gen_embed, next_token.numpy().tolist(), axis=0)
|
216 |
+
img_embeds = gen_aligner.run(["output"], {"input": gen_embed_res})[0]
|
217 |
+
inputs_embeds = np.expand_dims(img_embeds, axis=1)
|
218 |
+
|
219 |
+
# z_q 为 vision decode 的输出
|
220 |
+
indices = generated_tokens.to(dtype=torch.int)
|
221 |
+
shape = [parallel_size, 8, img_size//patch_size, img_size//patch_size]
|
222 |
+
z_q = codebook_entry_embedding[indices] # (b*h*w, c)
|
223 |
+
z_q = z_q.reshape(shape[0], shape[2], shape[3], shape[1])
|
224 |
+
# reshape back to match original input shape
|
225 |
+
z_q = z_q.permute(0, 3, 1, 2)
|
226 |
+
dec = gen_vision_model_decode.run(['image'], {'quant': z_q.to(dtype=torch.float32).numpy()})[0]
|
227 |
+
dec = dec.transpose(0, 2, 3, 1)
|
228 |
+
dec = np.clip((dec + 1) / 2 * 255, 0, 255)
|
229 |
+
visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
|
230 |
+
visual_img[:, :, :] = dec
|
231 |
+
|
232 |
+
os.makedirs('generated_samples', exist_ok=True)
|
233 |
+
for i in range(parallel_size):
|
234 |
+
save_path = os.path.join('generated_samples', "img_{}.jpg".format(i))
|
235 |
+
PIL.Image.fromarray(visual_img[i]).save(save_path)
|
236 |
+
|
237 |
+
###################################################################
|
238 |
+
config: MultiModalityConfig = AutoConfig.from_pretrained(tokenizer_dir, trust_remote_code=True)
|
239 |
+
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(tokenizer_dir)
|
240 |
+
tokenizer = vl_chat_processor.tokenizer
|
241 |
+
|
242 |
+
description = "A close-up high-contrast photo of Sydney Opera House sitting next to Eiffel tower, under a blue night sky of roiling energy, exploding yellow stars, and radiating swirls of blue."
|
243 |
+
|
244 |
+
conversation = [
|
245 |
+
{
|
246 |
+
"role": "User",
|
247 |
+
"content": description,
|
248 |
+
},
|
249 |
+
{"role": "Assistant", "content": ""},
|
250 |
+
]
|
251 |
+
|
252 |
+
sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
|
253 |
+
conversations=conversation,
|
254 |
+
sft_format=vl_chat_processor.sft_format,
|
255 |
+
system_prompt="",
|
256 |
+
)
|
257 |
+
prompt = sft_format + vl_chat_processor.image_start_tag
|
258 |
+
###################################################################
|
259 |
+
|
260 |
+
cfg = config.language_config
|
261 |
+
|
262 |
+
prefill_decoder_sessins = []
|
263 |
+
for i in tqdm(range(cfg.num_hidden_layers), desc="Init InferenceSession"):
|
264 |
+
session = InferenceSession(
|
265 |
+
f"{axmodel_path}/llama_p640_l{i}_together.axmodel"
|
266 |
+
)
|
267 |
+
prefill_decoder_sessins.append(session)
|
268 |
+
|
269 |
+
logger.info("model load done!")
|
270 |
+
|
271 |
+
generate(
|
272 |
+
cfg,
|
273 |
+
prefill_decoder_sessins,
|
274 |
+
vl_chat_processor,
|
275 |
+
prompt
|
276 |
+
)
|
infer_axmodel_und.py
ADDED
@@ -0,0 +1,228 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# REF: https://github.com/deepseek-ai/Janus
|
2 |
+
import numpy as np
|
3 |
+
import torch
|
4 |
+
from axengine import InferenceSession
|
5 |
+
from ml_dtypes import bfloat16
|
6 |
+
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForCausalLM
|
7 |
+
from tqdm import tqdm
|
8 |
+
from einops import rearrange
|
9 |
+
from janus.models import MultiModalityCausalLM, VLChatProcessor
|
10 |
+
from janus.models.modeling_vlm import MultiModalityConfig
|
11 |
+
from janus.utils.io import load_pil_images
|
12 |
+
import argparse
|
13 |
+
import os
|
14 |
+
|
15 |
+
|
16 |
+
parser = argparse.ArgumentParser(description="Model configuration parameters")
|
17 |
+
parser.add_argument("--tokenizer_dir", type=str, default="Janus-Pro-1B",
|
18 |
+
help="Path to HuggingFace model")
|
19 |
+
parser.add_argument("--axmodel_path", type=str, default="janus_pro_1B_axmodel",
|
20 |
+
help="Path to save compiled axmodel of llama model")
|
21 |
+
parser.add_argument("-i", "--test_img_path", type=str, default="./imgs/image.png",
|
22 |
+
help="Test image path (supports png/jpg formats)")
|
23 |
+
parser.add_argument("--vit_axmodel_path", type=str, default="vit_axmodel/janus_warp_vit.axmodel",
|
24 |
+
help="Path to ViT model's axmodel")
|
25 |
+
|
26 |
+
args = parser.parse_args()
|
27 |
+
|
28 |
+
# base info
|
29 |
+
tokenizer_dir = args.tokenizer_dir
|
30 |
+
axmodel_path = args.axmodel_path
|
31 |
+
test_img_path = args.test_img_path
|
32 |
+
vit_axmodel_path = args.vit_axmodel_path
|
33 |
+
embeds = np.load(os.path.join(args.axmodel_path, "model.embed_tokens.weight.npy"))
|
34 |
+
|
35 |
+
|
36 |
+
def prepare_inputs_embeds(
|
37 |
+
input_ids: torch.LongTensor,
|
38 |
+
pixel_values: torch.FloatTensor,
|
39 |
+
images_seq_mask: torch.LongTensor,
|
40 |
+
images_emb_mask: torch.LongTensor,
|
41 |
+
**kwargs,
|
42 |
+
):
|
43 |
+
"""
|
44 |
+
|
45 |
+
Args:
|
46 |
+
input_ids (torch.LongTensor): [b, T]
|
47 |
+
pixel_values (torch.FloatTensor): [b, n_images, 3, h, w]
|
48 |
+
images_seq_mask (torch.BoolTensor): [b, T]
|
49 |
+
images_emb_mask (torch.BoolTensor): [b, n_images, n_image_tokens]
|
50 |
+
|
51 |
+
assert torch.sum(images_seq_mask) == torch.sum(images_emb_mask)
|
52 |
+
|
53 |
+
Returns:
|
54 |
+
input_embeds (torch.Tensor): [b, T, D]
|
55 |
+
"""
|
56 |
+
|
57 |
+
bs, n = pixel_values.shape[0:2]
|
58 |
+
images = rearrange(pixel_values, "b n c h w -> (b n) c h w")
|
59 |
+
# [b x n, T2, D]
|
60 |
+
vit_session = InferenceSession(vit_axmodel_path)
|
61 |
+
images_embeds = vit_session.run(None, {"image": pixel_values[0].numpy()})[0] # pixel_values: [1, 1, 3, 384, 384]
|
62 |
+
print(f"vit_output.shape is {images_embeds.shape}, vit feature extract done!")
|
63 |
+
|
64 |
+
# [b x n, T2, D] -> [b, n x T2, D]
|
65 |
+
images_embeds = rearrange(images_embeds, "(b n) t d -> b (n t) d", b=bs, n=n)
|
66 |
+
# [b, n, T2] -> [b, n x T2]
|
67 |
+
images_emb_mask = rearrange(images_emb_mask, "b n t -> b (n t)")
|
68 |
+
|
69 |
+
# [b, T, D]
|
70 |
+
input_ids[input_ids < 0] = 0 # ignore the image embeddings
|
71 |
+
inputs_embeds = np.take(embeds, input_ids[0].cpu().numpy().tolist(), axis=0)[None, ...]
|
72 |
+
inputs_embeds[images_seq_mask] = images_embeds[images_emb_mask]
|
73 |
+
|
74 |
+
return inputs_embeds
|
75 |
+
|
76 |
+
def post_process(data, topk=1, topp=0.9, temperature=0.6):
|
77 |
+
def top_p(l: np.ndarray, p: float) -> np.ndarray:
|
78 |
+
index = np.argsort(l)
|
79 |
+
res = l.copy()
|
80 |
+
sum_p = 0
|
81 |
+
for i in index[::-1]:
|
82 |
+
if sum_p >= p:
|
83 |
+
res[i] = 0
|
84 |
+
sum_p += res[i]
|
85 |
+
return res / sum_p
|
86 |
+
|
87 |
+
def softmax(l: np.ndarray) -> np.ndarray:
|
88 |
+
l_max = l - l.max()
|
89 |
+
l_exp = np.exp(l_max)
|
90 |
+
res = l_exp / np.sum(l_exp)
|
91 |
+
return res.astype(np.float64)
|
92 |
+
|
93 |
+
r = data.astype(np.float32)
|
94 |
+
r = r.flatten()
|
95 |
+
candidate_index = np.argpartition(r, -topk)[-topk:]
|
96 |
+
candidate_value = r[candidate_index]
|
97 |
+
candidate_value /= temperature
|
98 |
+
candidate_soft = softmax(candidate_value)
|
99 |
+
candidate_soft = top_p(candidate_soft, topp)
|
100 |
+
candidate_soft = candidate_soft.astype(np.float64) / candidate_soft.sum()
|
101 |
+
pos = np.random.multinomial(1, candidate_soft).argmax()
|
102 |
+
next_token = candidate_index[pos]
|
103 |
+
return next_token, candidate_index, candidate_soft
|
104 |
+
|
105 |
+
config: MultiModalityConfig = AutoConfig.from_pretrained(tokenizer_dir, trust_remote_code=True)
|
106 |
+
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(tokenizer_dir)
|
107 |
+
tokenizer = vl_chat_processor.tokenizer
|
108 |
+
|
109 |
+
# question = "请尝试理解这幅图中的内容."
|
110 |
+
question = "Please describe the picture."
|
111 |
+
conversation = [
|
112 |
+
{
|
113 |
+
"role": "User",
|
114 |
+
"content": f"<image_placeholder>\n{question}",
|
115 |
+
"images": [test_img_path],
|
116 |
+
},
|
117 |
+
{"role": "Assistant", "content": ""},
|
118 |
+
]
|
119 |
+
|
120 |
+
# load images and prepare for inputs
|
121 |
+
pil_images = load_pil_images(conversation)
|
122 |
+
prepare_inputs = vl_chat_processor(
|
123 |
+
conversations=conversation, images=pil_images, force_batchify=True
|
124 |
+
)
|
125 |
+
|
126 |
+
input_embedding = prepare_inputs_embeds(**prepare_inputs)
|
127 |
+
token_ids = prepare_inputs['input_ids'].squeeze().numpy().tolist()
|
128 |
+
prefill_data = input_embedding
|
129 |
+
prefill_data = prefill_data.astype(bfloat16)
|
130 |
+
token_len = len(token_ids)
|
131 |
+
|
132 |
+
lastN = 1023
|
133 |
+
cfg = config.language_config
|
134 |
+
|
135 |
+
kv_dim = cfg.hidden_size // cfg.num_attention_heads * cfg.num_key_value_heads
|
136 |
+
k_caches = [
|
137 |
+
np.zeros((1, lastN, kv_dim), dtype=bfloat16)
|
138 |
+
for _ in range(cfg.num_hidden_layers)
|
139 |
+
]
|
140 |
+
v_caches = [
|
141 |
+
np.zeros((1, lastN, kv_dim), dtype=bfloat16)
|
142 |
+
for _ in range(cfg.num_hidden_layers)
|
143 |
+
]
|
144 |
+
|
145 |
+
prefill_decoder_sessins = []
|
146 |
+
for i in tqdm(range(cfg.num_hidden_layers), desc="Init InferenceSession"):
|
147 |
+
session = InferenceSession(
|
148 |
+
f"{axmodel_path}/llama_p640_l{i}_together.axmodel"
|
149 |
+
)
|
150 |
+
prefill_decoder_sessins.append(session)
|
151 |
+
post_process_session = InferenceSession(
|
152 |
+
f"{axmodel_path}/llama_post.axmodel"
|
153 |
+
)
|
154 |
+
print("model load done!")
|
155 |
+
|
156 |
+
"""
|
157 |
+
prefill
|
158 |
+
"""
|
159 |
+
prefill_len = 640
|
160 |
+
|
161 |
+
if prefill_len > 0:
|
162 |
+
indices = np.array(list(range(prefill_len)), np.uint32).reshape(
|
163 |
+
(1, prefill_len)
|
164 |
+
)
|
165 |
+
indices[:, token_len:] = 0
|
166 |
+
mask = np.zeros((1, prefill_len, prefill_len)) - 65536
|
167 |
+
data = np.zeros((1, prefill_len, cfg.hidden_size)).astype(bfloat16)
|
168 |
+
data[:, 0:token_len] = prefill_data
|
169 |
+
for i, t in enumerate(token_ids):
|
170 |
+
mask[:, i, : i + 1] = 0
|
171 |
+
mask = mask.astype(bfloat16)
|
172 |
+
for i in range(cfg.num_hidden_layers):
|
173 |
+
input_feed = {
|
174 |
+
"K_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
|
175 |
+
"V_cache": np.zeros((1, 1, cfg.hidden_size), dtype=bfloat16),
|
176 |
+
"indices": indices,
|
177 |
+
"input": data,
|
178 |
+
"mask": mask,
|
179 |
+
}
|
180 |
+
outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=1)
|
181 |
+
k_caches[i][:, :token_len, :] = outputs[0][:, :token_len, :]
|
182 |
+
v_caches[i][:, :token_len, :] = outputs[1][:, :token_len, :]
|
183 |
+
data[:, :token_len] = outputs[2][:, :token_len, :]
|
184 |
+
|
185 |
+
post_out = post_process_session.run(None, {"input": data[:, token_len - 1, :][None, ...]})[0]
|
186 |
+
next_token, posssible_tokens, possible_soft = post_process(post_out, topk=1)
|
187 |
+
posibles = [tokenizer.decode([t]) for t in posssible_tokens]
|
188 |
+
posible_soft = [str((t, s)) for t, s in zip(posibles, possible_soft)]
|
189 |
+
token_ids.append(next_token)
|
190 |
+
print("prefill done!")
|
191 |
+
|
192 |
+
"""
|
193 |
+
decode
|
194 |
+
"""
|
195 |
+
mask = np.zeros((1, 1, lastN + 1), dtype=np.float32).astype(bfloat16)
|
196 |
+
mask[:, :, :lastN] -= 65536
|
197 |
+
mask[:, :, :token_len] = 0
|
198 |
+
for start_indice in tqdm(range(lastN + 1), desc="Decoder"): # lastN + 1
|
199 |
+
if prefill_len > 0 and start_indice < token_len:
|
200 |
+
continue
|
201 |
+
next_token = token_ids[start_indice]
|
202 |
+
indices = np.array([start_indice], np.uint32).reshape((1, 1))
|
203 |
+
data = embeds[next_token, :].reshape((1, 1, cfg.hidden_size)).astype(bfloat16)
|
204 |
+
|
205 |
+
for i in range(cfg.num_hidden_layers):
|
206 |
+
input_feed = {
|
207 |
+
"K_cache": k_caches[i],
|
208 |
+
"V_cache": v_caches[i],
|
209 |
+
"indices": indices,
|
210 |
+
"input": data,
|
211 |
+
"mask": mask,
|
212 |
+
}
|
213 |
+
outputs = prefill_decoder_sessins[i].run(None, input_feed, shape_group=0)
|
214 |
+
k_caches[i][:, start_indice, :] = outputs[0][:, :, :]
|
215 |
+
v_caches[i][:, start_indice, :] = outputs[1][:, :, :]
|
216 |
+
data = outputs[2]
|
217 |
+
|
218 |
+
mask[..., start_indice] = 0
|
219 |
+
if start_indice < token_len - 1:
|
220 |
+
pass
|
221 |
+
else:
|
222 |
+
post_out = post_process_session.run(None, {"input": data})[0]
|
223 |
+
next_token, posssible_tokens, possible_soft = post_process(post_out)
|
224 |
+
token_ids.append(next_token)
|
225 |
+
if next_token == tokenizer.eos_token_id:
|
226 |
+
print("hit eos!")
|
227 |
+
break
|
228 |
+
print("Janus Answers: ", tokenizer.decode(token_ids[token_len:], skip_special_tokens=True))
|
janus_pro_1b_axmodel/llama_p640_l0_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:36e476b67cc13f0fe6701b7d666e9e316ee03d38998ba633964f7f96e92b8db5
|
3 |
+
size 58843532
|
janus_pro_1b_axmodel/llama_p640_l10_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ddf79ff6a43ead47fda4308fd89e0468a3b19ad8e5f5a912247a9de160c34954
|
3 |
+
size 58844556
|
janus_pro_1b_axmodel/llama_p640_l11_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5f108e870863238890fb579c7bb991abd0a8b4f695ff2b5d483c6e16a2b0433c
|
3 |
+
size 58844684
|
janus_pro_1b_axmodel/llama_p640_l12_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5da5feb965e8fbb678a144ca26e5ff9d520d80c18823563b1cb382980bcabe1b
|
3 |
+
size 58844236
|
janus_pro_1b_axmodel/llama_p640_l13_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7358c891f13c998f87a1f1d85f3357fffebfac4d5bb67e15868a0a93113108a9
|
3 |
+
size 58844620
|
janus_pro_1b_axmodel/llama_p640_l14_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:19e2d24aa96773a866043bfefc1b815f04964c9d27b18637401de306d8bb5595
|
3 |
+
size 58844140
|
janus_pro_1b_axmodel/llama_p640_l15_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:80bf1befea66e3f42d9cd77a92b35cae27683f50d34becb7095ba0f035c55cb9
|
3 |
+
size 58844268
|
janus_pro_1b_axmodel/llama_p640_l16_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2283ea05dabba501779dc79ffbf5ce6e0ab18ad157a3aa2a3e488d888082b342
|
3 |
+
size 58844396
|
janus_pro_1b_axmodel/llama_p640_l17_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:723136b342e5197d2e508510c7f247cdae853211e9d8710438cf2fe09712ec1a
|
3 |
+
size 58844076
|
janus_pro_1b_axmodel/llama_p640_l18_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:660e661f19ccf22c91034ed4d3a1869c5963f098e4e6509193a1aca6fcb24401
|
3 |
+
size 58844300
|
janus_pro_1b_axmodel/llama_p640_l19_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:deb31364fa508c5526c70915c38f8ccb052cd84d6c79893bf46590b37cce25a2
|
3 |
+
size 58844364
|
janus_pro_1b_axmodel/llama_p640_l1_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3828219df2633babf673a3fbb20a5d8d8dde602dae5a5ed35a76349c0b7a2dac
|
3 |
+
size 58844460
|
janus_pro_1b_axmodel/llama_p640_l20_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:74dddcc432823b8257a52712f4e5cdb53391291b6b19e8f277c96550f8e118a7
|
3 |
+
size 58844236
|
janus_pro_1b_axmodel/llama_p640_l21_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:06961849a4c6a31fa8454abd61e80fed76e4c4a050cbd1b7d16c638c6599d529
|
3 |
+
size 58844620
|
janus_pro_1b_axmodel/llama_p640_l22_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b80d1205cd37f7ff88cf522385910b1332e4a6a9c4b1419e03099c12884e718c
|
3 |
+
size 58844108
|
janus_pro_1b_axmodel/llama_p640_l23_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0b8f8387fcd1a8030275555828e8335fe7de694776f847d95bb048f889b880bb
|
3 |
+
size 58843980
|
janus_pro_1b_axmodel/llama_p640_l2_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5ff8bd9786537b7cf155ebd64459de2fc643a101a5469a071e0758604bb14f66
|
3 |
+
size 58844492
|
janus_pro_1b_axmodel/llama_p640_l3_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:94cf4e816de0f8f78a6ec18917302da650673f1a9e6a907d1cab3875e2eb15ab
|
3 |
+
size 58844556
|
janus_pro_1b_axmodel/llama_p640_l4_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:97c497026a610c17ca80da4f828ce71053ab71bdaadd356cc7ddbfb2a4ef5c03
|
3 |
+
size 58844108
|
janus_pro_1b_axmodel/llama_p640_l5_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:73e0ea653bf7410aab2a2e7e239cb57f71efdad47a78d1fab57e127f327de6fb
|
3 |
+
size 58844300
|
janus_pro_1b_axmodel/llama_p640_l6_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8d6d0cc26433000a91adccd97869916bfcebff975c94a59865b8e0343b0cfee0
|
3 |
+
size 58844460
|
janus_pro_1b_axmodel/llama_p640_l7_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:01061af690dcf356ae74c2b2b927c1b06ccfc6e594a9c67d7cb3fdba0aca2508
|
3 |
+
size 58843980
|
janus_pro_1b_axmodel/llama_p640_l8_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:9ce71289afc108c8e2f304d764f14f9efa13ea6342bc64e0484aba78db25e64f
|
3 |
+
size 58844364
|
janus_pro_1b_axmodel/llama_p640_l9_together.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c3f4fa46650e6e2c88bc8a7cb0dd39b7fbd08652e99dac3452e437517788e69b
|
3 |
+
size 58844364
|
janus_pro_1b_axmodel/llama_post.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f8950aede1718e00a9f0489c90bf76a8639cd43781ae6c0b49978a3b7202513e
|
3 |
+
size 229046979
|
janus_pro_1b_axmodel/model.embed_tokens.weight.npy
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:282e7088dbdb59b03e948edd97d3768f9d5daecbb7e7cb690147ffca25948ce1
|
3 |
+
size 838860928
|
janus_pro_1b_tokenizer/config.json
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"aligner_config": {
|
3 |
+
"cls": "MlpProjector",
|
4 |
+
"model_type": "aligner",
|
5 |
+
"params": {
|
6 |
+
"depth": 2,
|
7 |
+
"input_dim": 1024,
|
8 |
+
"n_embed": 2048,
|
9 |
+
"projector_type": "mlp_gelu"
|
10 |
+
}
|
11 |
+
},
|
12 |
+
"architectures": [
|
13 |
+
"MultiModalityCausalLM"
|
14 |
+
],
|
15 |
+
"gen_aligner_config": {
|
16 |
+
"cls": "MlpProjector",
|
17 |
+
"model_type": "gen_aligner",
|
18 |
+
"params": {
|
19 |
+
"depth": 2,
|
20 |
+
"input_dim": 8,
|
21 |
+
"n_embed": 2048,
|
22 |
+
"projector_type": "mlp_gelu"
|
23 |
+
}
|
24 |
+
},
|
25 |
+
"gen_head_config": {
|
26 |
+
"cls": "vision_head",
|
27 |
+
"model_type": "gen_head",
|
28 |
+
"params": {
|
29 |
+
"image_token_embed": 2048,
|
30 |
+
"image_token_size": 16384,
|
31 |
+
"n_embed": 2048
|
32 |
+
}
|
33 |
+
},
|
34 |
+
"gen_vision_config": {
|
35 |
+
"cls": "VQ-16",
|
36 |
+
"model_type": "gen_vision",
|
37 |
+
"params": {
|
38 |
+
"image_token_size": 16384,
|
39 |
+
"n_embed": 8
|
40 |
+
}
|
41 |
+
},
|
42 |
+
"language_config": {
|
43 |
+
"hidden_size": 2048,
|
44 |
+
"intermediate_size": 5632,
|
45 |
+
"max_position_embeddings": 16384,
|
46 |
+
"model_type": "llama",
|
47 |
+
"num_attention_heads": 16,
|
48 |
+
"num_hidden_layers": 24,
|
49 |
+
"num_key_value_heads": 16,
|
50 |
+
"torch_dtype": "bfloat16",
|
51 |
+
"vocab_size": 102400
|
52 |
+
},
|
53 |
+
"model_type": "multi_modality",
|
54 |
+
"torch_dtype": "bfloat16",
|
55 |
+
"transformers_version": "4.33.1",
|
56 |
+
"vision_config": {
|
57 |
+
"cls": "CLIPVisionTower",
|
58 |
+
"model_type": "vision",
|
59 |
+
"params": {
|
60 |
+
"image_size": 384,
|
61 |
+
"model_name": "siglip_large_patch16_384",
|
62 |
+
"select_feature": "same",
|
63 |
+
"select_layer": -1
|
64 |
+
}
|
65 |
+
}
|
66 |
+
}
|
janus_pro_1b_tokenizer/preprocessor_config.json
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"background_color": [
|
3 |
+
127,
|
4 |
+
127,
|
5 |
+
127
|
6 |
+
],
|
7 |
+
"do_normalize": true,
|
8 |
+
"image_mean": [
|
9 |
+
0.5,
|
10 |
+
0.5,
|
11 |
+
0.5
|
12 |
+
],
|
13 |
+
"image_processor_type": "VLMImageProcessor",
|
14 |
+
"image_size": 384,
|
15 |
+
"image_std": [
|
16 |
+
0.5,
|
17 |
+
0.5,
|
18 |
+
0.5
|
19 |
+
],
|
20 |
+
"min_size": 14,
|
21 |
+
"processor_class": "VLChatProcessor",
|
22 |
+
"rescale_factor": 0.00392156862745098
|
23 |
+
}
|
janus_pro_1b_tokenizer/processor_config.json
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"add_special_token": false,
|
3 |
+
"ignore_id": -100,
|
4 |
+
"image_tag": "<image_placeholder>",
|
5 |
+
"mask_prompt": true,
|
6 |
+
"num_image_tokens": 576,
|
7 |
+
"processor_class": "VLChatProcessor",
|
8 |
+
"sft_format": "deepseek"
|
9 |
+
}
|
janus_pro_1b_tokenizer/special_tokens_map.json
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"<image_placeholder>",
|
4 |
+
"<patch_placeholder>",
|
5 |
+
"<|ref|>",
|
6 |
+
"<|/ref|>",
|
7 |
+
"<|det|>",
|
8 |
+
"<|/det|>",
|
9 |
+
"<|grounding|>",
|
10 |
+
"<|User|>",
|
11 |
+
"<|Assistant|>"
|
12 |
+
],
|
13 |
+
"bos_token": "<|begin▁of▁sentence|>",
|
14 |
+
"eos_token": "<|end▁of▁sentence|>",
|
15 |
+
"pad_token": "<|▁pad▁|>"
|
16 |
+
}
|
janus_pro_1b_tokenizer/tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
janus_pro_1b_tokenizer/tokenizer_config.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"bos_token": "<|begin▁of▁sentence|>",
|
3 |
+
"clean_up_tokenization_spaces": false,
|
4 |
+
"eos_token": "<|end▁of▁sentence|>",
|
5 |
+
"model_max_length": 16384,
|
6 |
+
"pad_token": null,
|
7 |
+
"tokenizer_class": "LlamaTokenizer",
|
8 |
+
"unk_token": null,
|
9 |
+
"use_default_system_prompt": true
|
10 |
+
}
|
vit_axmodel/janus_warp_vit.axmodel
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:986d4444e88e3fcece749430abff504868eba25690e3a08dcb9568f7ad5ea0ab
|
3 |
+
size 348623368
|