littlebird13 commited on
Commit
d50085f
·
verified ·
1 Parent(s): e55ad91

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
.msc ADDED
Binary file (1.04 kB). View file
 
.mv ADDED
@@ -0,0 +1 @@
 
 
1
+ Revision:master,CreatedAt:1748489423
README.md CHANGED
@@ -1,3 +1,180 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3-Embedding-4B
2
+
3
+ <p align="center">
4
+ <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/logo_qwen3.png" width="400"/>
5
+ <p>
6
+
7
+ ## Highlights
8
+
9
+ The Qwen3 Embedding series model is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining.
10
+
11
+ **Exceptional Versatility**: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks **No.1** in the MTEB multilingual leaderboard (as of May 26, 2025, score **70.58**), while the reranking model excels in various text retrieval scenarios.
12
+
13
+ **Comprehensive Flexibility**: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios.
14
+
15
+ **Multilingual Capability**: The Qwen3 Embedding series support over 100 languages, including various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities.
16
+
17
+ ## Model Overview
18
+
19
+ **Qwen3-Embedding-4B** has the following features:
20
+
21
+ - Model Type: Text Embedding
22
+ - Supported Languages: 100+ Languages
23
+ - Number of Paramaters: 4B
24
+ - Context Length: 32k
25
+ - Embedding Dimension: Up to 2560, supports user-defined output dimensions ranging from 32 to 2560
26
+
27
+ For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3-Embedding/), [GitHub](https://github.com/QwenLM/Qwen3-Embedding).
28
+
29
+ ## Qwen3 Embedding Series Model list
30
+
31
+ | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruct Aware |
32
+ |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------|
33
+ | Text Embedding | [Qwen3-Embedding-0.6B](https://modelscope.cn/models/tongyi/Qwen3-Embedding-0.6B) | 0.6B | 28 | 32K | 1024 | Yes | Yes |
34
+ | Text Embedding | [Qwen3-Embedding-4B](https://modelscope.cn/models/tongyi/Qwen3-Embedding-4B) | 4B | 36 | 32K | 2560 | Yes | Yes |
35
+ | Text Embedding | [Qwen3-Embedding-8B](https://modelscope.cn/models/tongyi/Qwen3-Embedding-8B) | 8B | 36 | 32K | 4096 | Yes | Yes |
36
+ | Text Reranking | [Qwen3-Reranker-0.6B](https://modelscope.cn/models/tongyi/Qwen3-Reranker-0.6B) | 0.6B | 28 | 32K | - | - | Yes |
37
+ | Text Reranking | [Qwen3-Reranker-4B](https://modelscope.cn/models/tongyi/Qwen3-Reranker-4B) | 4B | 36 | 32K | - | - | Yes |
38
+ | Text Reranking | [Qwen3-Reranker-8B](https://modelscope.cn/models/tongyi/Qwen3-Reranker-8B) | 8B | 36 | 32K | - | - | Yes |
39
+
40
+ > **Note**:: `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. `Instruct Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks.
41
+
42
+ ## Usage
43
+
44
+ With Transformers versions earlier than 4.51.0, you may encounter the following error:
45
+ ```
46
+ KeyError: 'qwen3'
47
+ ```
48
+
49
+ ### Transformers Usage
50
+
51
+ ```python
52
+ # Requires transformers>=4.51.0
53
+ import torch
54
+ import torch.nn.functional as F
55
+
56
+ from torch import Tensor
57
+ from modelscope import AutoTokenizer, AutoModel
58
+
59
+
60
+ def last_token_pool(last_hidden_states: Tensor,
61
+ attention_mask: Tensor) -> Tensor:
62
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
63
+ if left_padding:
64
+ return last_hidden_states[:, -1]
65
+ else:
66
+ sequence_lengths = attention_mask.sum(dim=1) - 1
67
+ batch_size = last_hidden_states.shape[0]
68
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
69
+
70
+
71
+ def get_detailed_instruct(task_description: str, query: str) -> str:
72
+ return f'Instruct: {task_description}\nQuery:{query}'
73
+
74
+ def tokenize(tokenizer, input_texts, eod_id, max_length):
75
+ batch_dict = tokenizer(input_texts, padding=False, truncation=True, max_length=max_length-2)
76
+ for seq, att in zip(batch_dict["input_ids"], batch_dict["attention_mask"]):
77
+ seq.append(eod_id)
78
+ att.append(1)
79
+ batch_dict = tokenizer.pad(batch_dict, padding=True, return_tensors="pt")
80
+ return batch_dict
81
+
82
+ # Each query must come with a one-sentence instruction that describes the task
83
+ task = 'Given a web search query, retrieve relevant passages that answer the query'
84
+
85
+ queries = [
86
+ get_detailed_instruct(task, 'What is the capital of China?'),
87
+ get_detailed_instruct(task, 'Explain gravity')
88
+ ]
89
+ # No need to add instruction for retrieval documents
90
+ documents = [
91
+ "The capital of China is Beijing.",
92
+ "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
93
+ ]
94
+ input_texts = queries + documents
95
+
96
+ tokenizer = AutoTokenizer.from_pretrained('tongyi/Qwen3-Embedding-4B', padding_side='left')
97
+ model = AutoModel.from_pretrained('tongyi/Qwen3-Embedding-4B')
98
+
99
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving.
100
+ # model = AutoModel.from_pretrained('tongyi/Qwen3-Embedding-4B', attn_implementation="flash_attention_2", torch_dtype=torch.float16).cuda()
101
+
102
+ eod_id = tokenizer.convert_tokens_to_ids("<|endoftext|>")
103
+ max_length = 8192
104
+
105
+ # Tokenize the input texts
106
+ batch_dict = tokenize(tokenizer, input_texts, eod_id, max_length)
107
+ batch_dict.to(model.device)
108
+ outputs = model(**batch_dict)
109
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
110
+
111
+ # normalize embeddings
112
+ embeddings = F.normalize(embeddings, p=2, dim=1)
113
+ scores = (embeddings[:2] @ embeddings[2:].T)
114
+ print(scores.tolist())
115
+ ```
116
+ 📌 **Tip**: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%.
117
+
118
+ ## Evaluation
119
+
120
+ ### MTEB (Multilingual)
121
+
122
+ | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS |
123
+ |----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:|
124
+ | NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10|
125
+ | GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33|
126
+ | BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12|
127
+ | multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81|
128
+ | gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61|
129
+ | gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98|
130
+ | text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68|
131
+ | Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80|
132
+ | gemini-embedding-exp-03-07 | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | **29.16** | 83.63 | 65.58 | 67.71 | 79.40|
133
+ | **Qwen3-Embedding-0.6B** | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17|
134
+ | **Qwen3-Embedding-4B** | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | **11.56** | 26.77 | 85.05 | 65.08 | 69.60 | 80.86|
135
+ | **Qwen3-Embedding-8B** | 8B | **70.58** | **61.69** | **80.89** | **74.00** | **57.65** | 10.06 | 28.66 | **86.40** | **65.63** | **70.88** | **81.08** |
136
+
137
+ > **Note**: For compared models, the scores are retrieved from MTEB online [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) on May 24th, 2025.
138
+
139
+ ### MTEB (Eng v2)
140
+
141
+ | MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. |
142
+ |--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:|
143
+ | multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 |
144
+ | NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 |
145
+ | GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 |
146
+ | gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 |
147
+ | stella_en_1.5B_v5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 |
148
+ | gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 |
149
+ | gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | **59.39** | **87.7** | 48.59 | 64.35 | 85.29 | **38.28** |
150
+ | **Qwen3-Embedding-0.6B** | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 |
151
+ | **Qwen3-Embedding-4B** | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | **88.72** | 34.39 |
152
+ | **Qwen3-Embedding-8B** | 8B | **75.22** | **68.71** | **90.43** | 58.57 | 87.52 | **51.56** | **69.44** | 88.58 | 34.83 |
153
+
154
+ ### C-MTEB (MTEB Chinese)
155
+
156
+ | C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS |
157
+ |------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------|
158
+ | multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 |
159
+ | bge-multilingual-gemma2 | 9B | 67.64 |68.52 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 |
160
+ | gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 |
161
+ | gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 |
162
+ | ritrieve_zh_v1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | **85.98** | **72.86** | 76.97 | **63.92** |
163
+ | **Qwen3-Embedding-0.6B** | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 |
164
+ | **Qwen3-Embedding-4B** | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 |
165
+ | **Qwen3-Embedding-8B** | 8B | **73.84** | **75.00** | **76.97** | **80.08** | 84.23 | 66.99 | **78.21** | 63.53 |
166
+
167
+
168
+ ## Citation
169
+
170
+ If you find our work helpful, feel free to give us a cite.
171
+
172
+ ```
173
+ @misc{qwen3-embedding,
174
+ title = {Qwen3-Embedding},
175
+ url = {https://qwenlm.github.io/blog/qwen3/},
176
+ author = {Qwen Team},
177
+ month = {May},
178
+ year = {2025}
179
+ }
180
+ ```
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3Model"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 151643,
8
+ "eos_token_id": 151645,
9
+ "head_dim": 128,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 2560,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 9728,
14
+ "max_position_embeddings": 40960,
15
+ "max_window_layers": 36,
16
+ "model_type": "qwen3",
17
+ "num_attention_heads": 32,
18
+ "num_hidden_layers": 36,
19
+ "num_key_value_heads": 8,
20
+ "rms_norm_eps": 1e-06,
21
+ "rope_scaling": null,
22
+ "rope_theta": 1000000,
23
+ "sliding_window": null,
24
+ "tie_word_embeddings": true,
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.51.2",
27
+ "use_cache": true,
28
+ "use_sliding_window": false,
29
+ "vocab_size": 151665
30
+ }
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"sentence-embedding"}
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "eos_token_id": 151643,
4
+ "max_new_tokens": 2048,
5
+ "transformers_version": "4.51.3"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e70bfe3c970523fb7ef4eddffed2254ce3f1e7150c3de2af4342de129dd756f8
3
+ size 4965826464
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed1b87c8e9eb7e535a1a155e4fd00d9f4dba80e58a6db48a4c9f82cede7079c1
3
+ size 3077765624
model.safetensors.index.json ADDED
@@ -0,0 +1,405 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 8043548672
4
+ },
5
+ "weight_map": {
6
+ "embed_tokens.weight": "model-00001-of-00002.safetensors",
7
+ "layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
8
+ "layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
9
+ "layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
10
+ "layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
11
+ "layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
12
+ "layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
13
+ "layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
14
+ "layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
15
+ "layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
16
+ "layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
17
+ "layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
18
+ "layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
19
+ "layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
20
+ "layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
21
+ "layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
22
+ "layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
23
+ "layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
24
+ "layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
25
+ "layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
26
+ "layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
27
+ "layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
28
+ "layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
29
+ "layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
30
+ "layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
31
+ "layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
32
+ "layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
33
+ "layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
34
+ "layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
35
+ "layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
36
+ "layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
37
+ "layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
38
+ "layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
39
+ "layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
40
+ "layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
41
+ "layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
42
+ "layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
43
+ "layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
44
+ "layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
45
+ "layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
46
+ "layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
47
+ "layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
48
+ "layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
49
+ "layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
50
+ "layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
51
+ "layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
52
+ "layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
53
+ "layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
54
+ "layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
55
+ "layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
56
+ "layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
57
+ "layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
58
+ "layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
59
+ "layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
60
+ "layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
61
+ "layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
62
+ "layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
63
+ "layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
64
+ "layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
65
+ "layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
66
+ "layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
67
+ "layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
68
+ "layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
69
+ "layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
70
+ "layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
71
+ "layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
72
+ "layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
73
+ "layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
74
+ "layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
75
+ "layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
76
+ "layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
77
+ "layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
78
+ "layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
79
+ "layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
80
+ "layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
81
+ "layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
82
+ "layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
83
+ "layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
84
+ "layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
85
+ "layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
86
+ "layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
87
+ "layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
88
+ "layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
89
+ "layers.15.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
90
+ "layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
91
+ "layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
92
+ "layers.15.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
93
+ "layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
94
+ "layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
95
+ "layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
96
+ "layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
97
+ "layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
98
+ "layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
99
+ "layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
100
+ "layers.16.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
101
+ "layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
102
+ "layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
103
+ "layers.16.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
104
+ "layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
105
+ "layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
106
+ "layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
107
+ "layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
108
+ "layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
109
+ "layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
110
+ "layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
111
+ "layers.17.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
112
+ "layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
113
+ "layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
114
+ "layers.17.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
115
+ "layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
116
+ "layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
117
+ "layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
118
+ "layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
119
+ "layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
120
+ "layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
121
+ "layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
122
+ "layers.18.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
123
+ "layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
124
+ "layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
125
+ "layers.18.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
126
+ "layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
127
+ "layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
128
+ "layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
129
+ "layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
130
+ "layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
131
+ "layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
132
+ "layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
133
+ "layers.19.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
134
+ "layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
135
+ "layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
136
+ "layers.19.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
137
+ "layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
138
+ "layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
139
+ "layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
140
+ "layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
141
+ "layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
142
+ "layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
143
+ "layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
144
+ "layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
145
+ "layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
146
+ "layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
147
+ "layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
148
+ "layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
149
+ "layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
150
+ "layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
151
+ "layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
152
+ "layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
153
+ "layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
154
+ "layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
155
+ "layers.20.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
156
+ "layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
157
+ "layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
158
+ "layers.20.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
159
+ "layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
160
+ "layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
161
+ "layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
162
+ "layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
163
+ "layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
164
+ "layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
165
+ "layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
166
+ "layers.21.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
167
+ "layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
168
+ "layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
169
+ "layers.21.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
170
+ "layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
171
+ "layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
172
+ "layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
173
+ "layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
174
+ "layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
175
+ "layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
176
+ "layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
177
+ "layers.22.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
178
+ "layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
179
+ "layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
180
+ "layers.22.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
181
+ "layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
182
+ "layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
183
+ "layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
184
+ "layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
185
+ "layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
186
+ "layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
187
+ "layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
188
+ "layers.23.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
189
+ "layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
190
+ "layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
191
+ "layers.23.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
192
+ "layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
193
+ "layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
194
+ "layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
195
+ "layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
196
+ "layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
197
+ "layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
198
+ "layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
199
+ "layers.24.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
200
+ "layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
201
+ "layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
202
+ "layers.24.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
203
+ "layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
204
+ "layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
205
+ "layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
206
+ "layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
207
+ "layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
208
+ "layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
209
+ "layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
210
+ "layers.25.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
211
+ "layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
212
+ "layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
213
+ "layers.25.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
214
+ "layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
215
+ "layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
216
+ "layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
217
+ "layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
218
+ "layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
219
+ "layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
220
+ "layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
221
+ "layers.26.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
222
+ "layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
223
+ "layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
224
+ "layers.26.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
225
+ "layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
226
+ "layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
227
+ "layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
228
+ "layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
229
+ "layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
230
+ "layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
231
+ "layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
232
+ "layers.27.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
233
+ "layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
234
+ "layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
235
+ "layers.27.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
236
+ "layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
237
+ "layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
238
+ "layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
239
+ "layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
240
+ "layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
241
+ "layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
242
+ "layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
243
+ "layers.28.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
244
+ "layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
245
+ "layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
246
+ "layers.28.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
247
+ "layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
248
+ "layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
249
+ "layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
250
+ "layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
251
+ "layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
252
+ "layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
253
+ "layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
254
+ "layers.29.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
255
+ "layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
256
+ "layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
257
+ "layers.29.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
258
+ "layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
259
+ "layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
260
+ "layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
261
+ "layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
262
+ "layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
263
+ "layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
264
+ "layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
265
+ "layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
266
+ "layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
267
+ "layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
268
+ "layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
269
+ "layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
270
+ "layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
271
+ "layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
272
+ "layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
273
+ "layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
274
+ "layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
275
+ "layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
276
+ "layers.30.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
277
+ "layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
278
+ "layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
279
+ "layers.30.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
280
+ "layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
281
+ "layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
282
+ "layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
283
+ "layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
284
+ "layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
285
+ "layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
286
+ "layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
287
+ "layers.31.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
288
+ "layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
289
+ "layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
290
+ "layers.31.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
291
+ "layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
292
+ "layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
293
+ "layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
294
+ "layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
295
+ "layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
296
+ "layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
297
+ "layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
298
+ "layers.32.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
299
+ "layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
300
+ "layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
301
+ "layers.32.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
302
+ "layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
303
+ "layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
304
+ "layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
305
+ "layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
306
+ "layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
307
+ "layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
308
+ "layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
309
+ "layers.33.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
310
+ "layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
311
+ "layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
312
+ "layers.33.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
313
+ "layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
314
+ "layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
315
+ "layers.34.input_layernorm.weight": "model-00002-of-00002.safetensors",
316
+ "layers.34.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
317
+ "layers.34.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
318
+ "layers.34.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
319
+ "layers.34.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
320
+ "layers.34.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
321
+ "layers.34.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
322
+ "layers.34.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
323
+ "layers.34.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
324
+ "layers.34.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
325
+ "layers.34.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
326
+ "layers.35.input_layernorm.weight": "model-00002-of-00002.safetensors",
327
+ "layers.35.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
328
+ "layers.35.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
329
+ "layers.35.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
330
+ "layers.35.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
331
+ "layers.35.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
332
+ "layers.35.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
333
+ "layers.35.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
334
+ "layers.35.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
335
+ "layers.35.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
336
+ "layers.35.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
337
+ "layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
338
+ "layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
339
+ "layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
340
+ "layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
341
+ "layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
342
+ "layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
343
+ "layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
344
+ "layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
345
+ "layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
346
+ "layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
347
+ "layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
348
+ "layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
349
+ "layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
350
+ "layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
351
+ "layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
352
+ "layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
353
+ "layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
354
+ "layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
355
+ "layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
356
+ "layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
357
+ "layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
358
+ "layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
359
+ "layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
360
+ "layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
361
+ "layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
362
+ "layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
363
+ "layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
364
+ "layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
365
+ "layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
366
+ "layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
367
+ "layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
368
+ "layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
369
+ "layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
370
+ "layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
371
+ "layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
372
+ "layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
373
+ "layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
374
+ "layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
375
+ "layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
376
+ "layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
377
+ "layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
378
+ "layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
379
+ "layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
380
+ "layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
381
+ "layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
382
+ "layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
383
+ "layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
384
+ "layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
385
+ "layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
386
+ "layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
387
+ "layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
388
+ "layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
389
+ "layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
390
+ "layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
391
+ "layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
392
+ "layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
393
+ "layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
394
+ "layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
395
+ "layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
396
+ "layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
397
+ "layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
398
+ "layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
399
+ "layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
400
+ "layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
401
+ "layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
402
+ "layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
403
+ "norm.weight": "model-00002-of-00002.safetensors"
404
+ }
405
+ }
modeling.py ADDED
@@ -0,0 +1,1513 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """PyTorch Qwen2 model."""
21
+
22
+ import math
23
+ from typing import List, Optional, Tuple, Union
24
+
25
+ import torch
26
+ import torch.utils.checkpoint
27
+ from torch import nn
28
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
29
+
30
+ from transformers.activations import ACT2FN
31
+ from transformers.cache_utils import Cache, DynamicCache, SlidingWindowCache, StaticCache
32
+ from transformers.generation import GenerationMixin
33
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
34
+ from transformers.modeling_outputs import (
35
+ BaseModelOutputWithPast,
36
+ CausalLMOutputWithPast,
37
+ QuestionAnsweringModelOutput,
38
+ SequenceClassifierOutputWithPast,
39
+ TokenClassifierOutput,
40
+ )
41
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
42
+ from transformers.modeling_utils import PreTrainedModel
43
+ from transformers.utils import (
44
+ add_code_sample_docstrings,
45
+ add_start_docstrings,
46
+ add_start_docstrings_to_model_forward,
47
+ is_flash_attn_2_available,
48
+ is_flash_attn_greater_or_equal_2_10,
49
+ logging,
50
+ replace_return_docstrings,
51
+ )
52
+ from .configuration_qwen2 import Qwen2Config
53
+
54
+
55
+ if is_flash_attn_2_available():
56
+ from transformers.modeling_flash_attention_utils import _flash_attention_forward
57
+
58
+
59
+ logger = logging.get_logger(__name__)
60
+
61
+
62
+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen2-7B"
63
+ _CONFIG_FOR_DOC = "Qwen2Config"
64
+
65
+
66
+ # Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Qwen2
67
+ class Qwen2RMSNorm(nn.Module):
68
+ def __init__(self, hidden_size, eps=1e-6):
69
+ """
70
+ Qwen2RMSNorm is equivalent to T5LayerNorm
71
+ """
72
+ super().__init__()
73
+ self.weight = nn.Parameter(torch.ones(hidden_size))
74
+ self.variance_epsilon = eps
75
+
76
+ def forward(self, hidden_states):
77
+ input_dtype = hidden_states.dtype
78
+ hidden_states = hidden_states.to(torch.float32)
79
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
80
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
81
+ return self.weight * hidden_states.to(input_dtype)
82
+
83
+ def extra_repr(self):
84
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
85
+
86
+
87
+ # Copied from transformers.models.llama.modeling_llama.LlamaRotaryEmbedding with Llama->Qwen2
88
+ class Qwen2RotaryEmbedding(nn.Module):
89
+ def __init__(
90
+ self,
91
+ dim=None,
92
+ max_position_embeddings=2048,
93
+ base=10000,
94
+ device=None,
95
+ scaling_factor=1.0,
96
+ rope_type="default",
97
+ config: Optional[Qwen2Config] = None,
98
+ ):
99
+ super().__init__()
100
+ # TODO (joao): remove the `if` below, only used for BC
101
+ self.rope_kwargs = {}
102
+ if config is None:
103
+ logger.warning_once(
104
+ "`Qwen2RotaryEmbedding` can now be fully parameterized by passing the model config through the "
105
+ "`config` argument. All other arguments will be removed in v4.46"
106
+ )
107
+ self.rope_kwargs = {
108
+ "rope_type": rope_type,
109
+ "factor": scaling_factor,
110
+ "dim": dim,
111
+ "base": base,
112
+ "max_position_embeddings": max_position_embeddings,
113
+ }
114
+ self.rope_type = rope_type
115
+ self.max_seq_len_cached = max_position_embeddings
116
+ self.original_max_seq_len = max_position_embeddings
117
+ else:
118
+ # BC: "rope_type" was originally "type"
119
+ if config.rope_scaling is not None:
120
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
121
+ else:
122
+ self.rope_type = "default"
123
+ self.max_seq_len_cached = config.max_position_embeddings
124
+ self.original_max_seq_len = config.max_position_embeddings
125
+
126
+ self.config = config
127
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
128
+
129
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, **self.rope_kwargs)
130
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
131
+ self.original_inv_freq = self.inv_freq
132
+
133
+ def _dynamic_frequency_update(self, position_ids, device):
134
+ """
135
+ dynamic RoPE layers should recompute `inv_freq` in the following situations:
136
+ 1 - growing beyond the cached sequence length (allow scaling)
137
+ 2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
138
+ """
139
+ seq_len = torch.max(position_ids) + 1
140
+ if seq_len > self.max_seq_len_cached: # growth
141
+ inv_freq, self.attention_scaling = self.rope_init_fn(
142
+ self.config, device, seq_len=seq_len, **self.rope_kwargs
143
+ )
144
+ self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation
145
+ self.max_seq_len_cached = seq_len
146
+
147
+ if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset
148
+ self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
149
+ self.max_seq_len_cached = self.original_max_seq_len
150
+
151
+ @torch.no_grad()
152
+ def forward(self, x, position_ids):
153
+ if "dynamic" in self.rope_type:
154
+ self._dynamic_frequency_update(position_ids, device=x.device)
155
+
156
+ # Core RoPE block
157
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
158
+ position_ids_expanded = position_ids[:, None, :].float()
159
+ # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
160
+ device_type = x.device.type
161
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
162
+ with torch.autocast(device_type=device_type, enabled=False):
163
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
164
+ emb = torch.cat((freqs, freqs), dim=-1)
165
+ cos = emb.cos()
166
+ sin = emb.sin()
167
+
168
+ # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
169
+ cos = cos * self.attention_scaling
170
+ sin = sin * self.attention_scaling
171
+
172
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
173
+
174
+
175
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
176
+ def rotate_half(x):
177
+ """Rotates half the hidden dims of the input."""
178
+ x1 = x[..., : x.shape[-1] // 2]
179
+ x2 = x[..., x.shape[-1] // 2 :]
180
+ return torch.cat((-x2, x1), dim=-1)
181
+
182
+
183
+ # Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
184
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
185
+ """Applies Rotary Position Embedding to the query and key tensors.
186
+
187
+ Args:
188
+ q (`torch.Tensor`): The query tensor.
189
+ k (`torch.Tensor`): The key tensor.
190
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
191
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
192
+ position_ids (`torch.Tensor`, *optional*):
193
+ Deprecated and unused.
194
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
195
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
196
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
197
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
198
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
199
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
200
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
201
+ Returns:
202
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
203
+ """
204
+ cos = cos.unsqueeze(unsqueeze_dim)
205
+ sin = sin.unsqueeze(unsqueeze_dim)
206
+ q_embed = (q * cos) + (rotate_half(q) * sin)
207
+ k_embed = (k * cos) + (rotate_half(k) * sin)
208
+ return q_embed, k_embed
209
+
210
+
211
+ # Copied from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2
212
+ class Qwen2MLP(nn.Module):
213
+ def __init__(self, config):
214
+ super().__init__()
215
+ self.hidden_size = config.hidden_size
216
+ self.intermediate_size = config.intermediate_size
217
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
218
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
219
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
220
+ self.act_fn = ACT2FN[config.hidden_act]
221
+
222
+ def forward(self, hidden_state):
223
+ return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
224
+
225
+
226
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
227
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
228
+ """
229
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
230
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
231
+ """
232
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
233
+ if n_rep == 1:
234
+ return hidden_states
235
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
236
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
237
+
238
+
239
+ class Qwen2Attention(nn.Module):
240
+ """
241
+ Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
242
+ and "Generating Long Sequences with Sparse Transformers".
243
+ """
244
+
245
+ def __init__(self, config: Qwen2Config, layer_idx: Optional[int] = None):
246
+ super().__init__()
247
+ self.config = config
248
+ self.layer_idx = layer_idx
249
+ if layer_idx is None:
250
+ logger.warning_once(
251
+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
252
+ "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
253
+ "when creating this class."
254
+ )
255
+
256
+ self.hidden_size = config.hidden_size
257
+ self.num_heads = config.num_attention_heads
258
+ self.head_dim = config.head_dim or (self.hidden_size // self.num_heads)
259
+ self.num_key_value_heads = config.num_key_value_heads
260
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
261
+ self.max_position_embeddings = config.max_position_embeddings
262
+ self.rope_theta = config.rope_theta
263
+ self.is_causal = True
264
+ self.attention_dropout = config.attention_dropout
265
+
266
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.qkv_bias)
267
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.qkv_bias)
268
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.qkv_bias)
269
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
270
+ if self.config.use_qk_norm:
271
+ self.q_norm = Qwen2RMSNorm(self.head_dim, eps=config.rms_norm_eps)
272
+ self.k_norm = Qwen2RMSNorm(self.head_dim, eps=config.rms_norm_eps)
273
+
274
+ self.rotary_emb = Qwen2RotaryEmbedding(config=self.config)
275
+
276
+ def forward(
277
+ self,
278
+ hidden_states: torch.Tensor,
279
+ attention_mask: Optional[torch.Tensor] = None,
280
+ position_ids: Optional[torch.LongTensor] = None,
281
+ past_key_value: Optional[Cache] = None,
282
+ output_attentions: bool = False,
283
+ use_cache: bool = False,
284
+ cache_position: Optional[torch.LongTensor] = None,
285
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
286
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
287
+ bsz, q_len, _ = hidden_states.size()
288
+
289
+ query_states = self.q_proj(hidden_states)
290
+ key_states = self.k_proj(hidden_states)
291
+ value_states = self.v_proj(hidden_states)
292
+
293
+ if self.config.use_qk_norm:
294
+ query_states = self.q_norm(query_states.view(bsz, q_len, self.num_heads, self.head_dim)).transpose(1, 2)
295
+ key_states = self.k_norm(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim)).transpose(1, 2)
296
+
297
+ else:
298
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
299
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
300
+
301
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
302
+
303
+ if position_embeddings is None:
304
+ logger.warning_once(
305
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
306
+ "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
307
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
308
+ "removed and `position_embeddings` will be mandatory."
309
+ )
310
+ cos, sin = self.rotary_emb(value_states, position_ids)
311
+ else:
312
+ cos, sin = position_embeddings
313
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
314
+
315
+ if past_key_value is not None:
316
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
317
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
318
+
319
+ # repeat k/v heads if n_kv_heads < n_heads
320
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
321
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
322
+
323
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
324
+ if attention_mask is not None: # no matter the length, we just slice it
325
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
326
+ attn_weights = attn_weights + causal_mask
327
+
328
+ # upcast attention to fp32
329
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
330
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
331
+ attn_output = torch.matmul(attn_weights, value_states)
332
+
333
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
334
+ raise ValueError(
335
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
336
+ f" {attn_output.size()}"
337
+ )
338
+
339
+ attn_output = attn_output.transpose(1, 2).contiguous()
340
+ attn_output = attn_output.reshape(bsz, q_len, -1)
341
+
342
+ attn_output = self.o_proj(attn_output)
343
+
344
+ if not output_attentions:
345
+ attn_weights = None
346
+
347
+ return attn_output, attn_weights, past_key_value
348
+
349
+
350
+ class Qwen2FlashAttention2(Qwen2Attention):
351
+ """
352
+ Qwen2 flash attention module, following Qwen2 attention module. This module inherits from `Qwen2Attention`
353
+ as the weights of the module stays untouched. The only required change would be on the forward pass
354
+ where it needs to correctly call the public API of flash attention and deal with padding tokens
355
+ in case the input contains any of them. Additionally, for sliding window attention, we apply SWA only to the bottom
356
+ config.max_window_layers layers.
357
+ """
358
+
359
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
360
+ def __init__(self, *args, **kwargs):
361
+ super().__init__(*args, **kwargs)
362
+
363
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
364
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
365
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
366
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
367
+
368
+ def forward(
369
+ self,
370
+ hidden_states: torch.Tensor,
371
+ attention_mask: Optional[torch.Tensor] = None,
372
+ position_ids: Optional[torch.LongTensor] = None,
373
+ past_key_value: Optional[Cache] = None,
374
+ output_attentions: bool = False,
375
+ use_cache: bool = False,
376
+ cache_position: Optional[torch.LongTensor] = None,
377
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
378
+ ):
379
+ bsz, q_len, _ = hidden_states.size()
380
+
381
+ query_states = self.q_proj(hidden_states)
382
+ key_states = self.k_proj(hidden_states)
383
+ value_states = self.v_proj(hidden_states)
384
+
385
+ if self.config.use_qk_norm:
386
+ query_states = self.q_norm(query_states.view(bsz, q_len, self.num_heads, self.head_dim)).transpose(1, 2)
387
+ key_states = self.k_norm(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim)).transpose(1, 2)
388
+
389
+ else:
390
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
391
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
392
+
393
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
394
+
395
+ if position_embeddings is None:
396
+ logger.warning_once(
397
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
398
+ "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
399
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
400
+ "removed and `position_embeddings` will be mandatory."
401
+ )
402
+ cos, sin = self.rotary_emb(value_states, position_ids)
403
+ else:
404
+ cos, sin = position_embeddings
405
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
406
+
407
+ if past_key_value is not None:
408
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
409
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
410
+
411
+ # repeat k/v heads if n_kv_heads < n_heads
412
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
413
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
414
+ dropout_rate = 0.0 if not self.training else self.attention_dropout
415
+
416
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
417
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
418
+ # cast them back in float16 just to be sure everything works as expected.
419
+ input_dtype = query_states.dtype
420
+ if input_dtype == torch.float32:
421
+ if torch.is_autocast_enabled():
422
+ target_dtype = torch.get_autocast_gpu_dtype()
423
+ # Handle the case where the model is quantized
424
+ elif hasattr(self.config, "_pre_quantization_dtype"):
425
+ target_dtype = self.config._pre_quantization_dtype
426
+ else:
427
+ target_dtype = self.q_proj.weight.dtype
428
+
429
+ logger.warning_once(
430
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
431
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
432
+ f" {target_dtype}."
433
+ )
434
+
435
+ query_states = query_states.to(target_dtype)
436
+ key_states = key_states.to(target_dtype)
437
+ value_states = value_states.to(target_dtype)
438
+
439
+ # Reashape to the expected shape for Flash Attention
440
+ query_states = query_states.transpose(1, 2)
441
+ key_states = key_states.transpose(1, 2)
442
+ value_states = value_states.transpose(1, 2)
443
+
444
+ if (
445
+ self.config.use_sliding_window
446
+ and getattr(self.config, "sliding_window", None) is not None
447
+ and self.layer_idx >= self.config.max_window_layers
448
+ ):
449
+ sliding_window = self.config.sliding_window
450
+ else:
451
+ sliding_window = None
452
+
453
+ attn_output = _flash_attention_forward(
454
+ query_states,
455
+ key_states,
456
+ value_states,
457
+ attention_mask,
458
+ q_len,
459
+ position_ids=position_ids,
460
+ dropout=dropout_rate,
461
+ sliding_window=sliding_window,
462
+ is_causal=self.is_causal,
463
+ use_top_left_mask=self._flash_attn_uses_top_left_mask,
464
+ )
465
+
466
+ attn_output = attn_output.reshape(bsz, q_len, -1).contiguous()
467
+ attn_output = self.o_proj(attn_output)
468
+
469
+ if not output_attentions:
470
+ attn_weights = None
471
+
472
+ return attn_output, attn_weights, past_key_value
473
+
474
+
475
+ class Qwen2SdpaAttention(Qwen2Attention):
476
+ """
477
+ Qwen2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
478
+ `Qwen2Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
479
+ SDPA API.
480
+ """
481
+
482
+ # Adapted from Qwen2Attention.forward
483
+ def forward(
484
+ self,
485
+ hidden_states: torch.Tensor,
486
+ attention_mask: Optional[torch.Tensor] = None,
487
+ position_ids: Optional[torch.LongTensor] = None,
488
+ past_key_value: Optional[Cache] = None,
489
+ output_attentions: bool = False,
490
+ use_cache: bool = False,
491
+ cache_position: Optional[torch.LongTensor] = None,
492
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
493
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
494
+ if output_attentions:
495
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
496
+ logger.warning_once(
497
+ "Qwen2Model is using Qwen2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
498
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
499
+ )
500
+ return super().forward(
501
+ hidden_states=hidden_states,
502
+ attention_mask=attention_mask,
503
+ position_ids=position_ids,
504
+ past_key_value=past_key_value,
505
+ output_attentions=output_attentions,
506
+ use_cache=use_cache,
507
+ )
508
+
509
+ bsz, q_len, _ = hidden_states.size()
510
+
511
+ query_states = self.q_proj(hidden_states)
512
+ key_states = self.k_proj(hidden_states)
513
+ value_states = self.v_proj(hidden_states)
514
+
515
+ if self.config.use_qk_norm:
516
+ query_states = self.q_norm(query_states.view(bsz, q_len, self.num_heads, self.head_dim)).transpose(1, 2)
517
+ key_states = self.k_norm(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim)).transpose(1, 2)
518
+
519
+ else:
520
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
521
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
522
+
523
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
524
+
525
+ if position_embeddings is None:
526
+ logger.warning_once(
527
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
528
+ "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
529
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
530
+ "removed and `position_embeddings` will be mandatory."
531
+ )
532
+ cos, sin = self.rotary_emb(value_states, position_ids)
533
+ else:
534
+ cos, sin = position_embeddings
535
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
536
+
537
+ if past_key_value is not None:
538
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
539
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
540
+
541
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
542
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
543
+
544
+ causal_mask = attention_mask
545
+ if attention_mask is not None: # no matter the length, we just slice it
546
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
547
+
548
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
549
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
550
+ if query_states.device.type == "cuda" and attention_mask is not None:
551
+ query_states = query_states.contiguous()
552
+ key_states = key_states.contiguous()
553
+ value_states = value_states.contiguous()
554
+
555
+ # We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of an inline conditional assignment
556
+ # in SDPA to support both torch.compile's dynamic shapes and full graph options. An inline conditional prevents dynamic shapes from compiling.
557
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
558
+ is_causal = True if causal_mask is None and q_len > 1 else False
559
+
560
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
561
+ query_states,
562
+ key_states,
563
+ value_states,
564
+ attn_mask=causal_mask,
565
+ dropout_p=self.attention_dropout if self.training else 0.0,
566
+ is_causal=is_causal,
567
+ )
568
+
569
+ attn_output = attn_output.transpose(1, 2).contiguous()
570
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
571
+
572
+ attn_output = self.o_proj(attn_output)
573
+
574
+ return attn_output, None, past_key_value
575
+
576
+
577
+ QWEN2_ATTENTION_CLASSES = {
578
+ "eager": Qwen2Attention,
579
+ "flash_attention_2": Qwen2FlashAttention2,
580
+ "sdpa": Qwen2SdpaAttention,
581
+ }
582
+
583
+
584
+ class Qwen2DecoderLayer(nn.Module):
585
+ def __init__(self, config: Qwen2Config, layer_idx: int):
586
+ super().__init__()
587
+ self.hidden_size = config.hidden_size
588
+
589
+ if config.sliding_window and config._attn_implementation != "flash_attention_2":
590
+ logger.warning_once(
591
+ f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
592
+ "unexpected results may be encountered."
593
+ )
594
+ self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
595
+
596
+ self.mlp = Qwen2MLP(config)
597
+ self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
598
+ self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
599
+
600
+ def forward(
601
+ self,
602
+ hidden_states: torch.Tensor,
603
+ attention_mask: Optional[torch.Tensor] = None,
604
+ position_ids: Optional[torch.LongTensor] = None,
605
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
606
+ output_attentions: Optional[bool] = False,
607
+ use_cache: Optional[bool] = False,
608
+ cache_position: Optional[torch.LongTensor] = None,
609
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
610
+ **kwargs,
611
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
612
+ """
613
+ Args:
614
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
615
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
616
+ `(batch, sequence_length)` where padding elements are indicated by 0.
617
+ output_attentions (`bool`, *optional*):
618
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
619
+ returned tensors for more detail.
620
+ use_cache (`bool`, *optional*):
621
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
622
+ (see `past_key_values`).
623
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
624
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
625
+ Indices depicting the position of the input sequence tokens in the sequence.
626
+ position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
627
+ Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
628
+ with `head_dim` being the embedding dimension of each attention head.
629
+ kwargs (`dict`, *optional*):
630
+ Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
631
+ into the model
632
+ """
633
+
634
+ residual = hidden_states
635
+
636
+ hidden_states = self.input_layernorm(hidden_states)
637
+
638
+ # Self Attention
639
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
640
+ hidden_states=hidden_states,
641
+ attention_mask=attention_mask,
642
+ position_ids=position_ids,
643
+ past_key_value=past_key_value,
644
+ output_attentions=output_attentions,
645
+ use_cache=use_cache,
646
+ cache_position=cache_position,
647
+ position_embeddings=position_embeddings,
648
+ )
649
+ hidden_states = residual + hidden_states
650
+
651
+ # Fully Connected
652
+ residual = hidden_states
653
+ hidden_states = self.post_attention_layernorm(hidden_states)
654
+ hidden_states = self.mlp(hidden_states)
655
+ hidden_states = residual + hidden_states
656
+
657
+ outputs = (hidden_states,)
658
+
659
+ if output_attentions:
660
+ outputs += (self_attn_weights,)
661
+
662
+ if use_cache:
663
+ outputs += (present_key_value,)
664
+
665
+ return outputs
666
+
667
+
668
+ QWEN2_START_DOCSTRING = r"""
669
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
670
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
671
+ etc.)
672
+
673
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
674
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
675
+ and behavior.
676
+
677
+ Parameters:
678
+ config ([`Qwen2Config`]):
679
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
680
+ load the weights associated with the model, only the configuration. Check out the
681
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
682
+ """
683
+
684
+
685
+ @add_start_docstrings(
686
+ "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
687
+ QWEN2_START_DOCSTRING,
688
+ )
689
+ class Qwen2PreTrainedModel(PreTrainedModel):
690
+ config_class = Qwen2Config
691
+ base_model_prefix = "model"
692
+ supports_gradient_checkpointing = True
693
+ _no_split_modules = ["Qwen2DecoderLayer"]
694
+ _skip_keys_device_placement = "past_key_values"
695
+ _supports_flash_attn_2 = True
696
+ _supports_sdpa = True
697
+ _supports_cache_class = True
698
+ _supports_quantized_cache = True
699
+ _supports_static_cache = True
700
+
701
+ def _init_weights(self, module):
702
+ std = self.config.initializer_range
703
+ if isinstance(module, nn.Linear):
704
+ module.weight.data.normal_(mean=0.0, std=std)
705
+ if module.bias is not None:
706
+ module.bias.data.zero_()
707
+ elif isinstance(module, nn.Embedding):
708
+ module.weight.data.normal_(mean=0.0, std=std)
709
+ if module.padding_idx is not None:
710
+ module.weight.data[module.padding_idx].zero_()
711
+
712
+
713
+ QWEN2_INPUTS_DOCSTRING = r"""
714
+ Args:
715
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
716
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
717
+ it.
718
+
719
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
720
+ [`PreTrainedTokenizer.__call__`] for details.
721
+
722
+ [What are input IDs?](../glossary#input-ids)
723
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
724
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
725
+
726
+ - 1 for tokens that are **not masked**,
727
+ - 0 for tokens that are **masked**.
728
+
729
+ [What are attention masks?](../glossary#attention-mask)
730
+
731
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
732
+ [`PreTrainedTokenizer.__call__`] for details.
733
+
734
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
735
+ `past_key_values`).
736
+
737
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
738
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
739
+ information on the default strategy.
740
+
741
+ - 1 indicates the head is **not masked**,
742
+ - 0 indicates the head is **masked**.
743
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
744
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
745
+ config.n_positions - 1]`.
746
+
747
+ [What are position IDs?](../glossary#position-ids)
748
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
749
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
750
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
751
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
752
+
753
+ Two formats are allowed:
754
+ - a [`~cache_utils.Cache`] instance, see our
755
+ [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache);
756
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
757
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
758
+ cache format.
759
+
760
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
761
+ legacy cache format will be returned.
762
+
763
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
764
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
765
+ of shape `(batch_size, sequence_length)`.
766
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
767
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
768
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
769
+ model's internal embedding lookup matrix.
770
+ use_cache (`bool`, *optional*):
771
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
772
+ `past_key_values`).
773
+ output_attentions (`bool`, *optional*):
774
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
775
+ tensors for more detail.
776
+ output_hidden_states (`bool`, *optional*):
777
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
778
+ more detail.
779
+ return_dict (`bool`, *optional*):
780
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
781
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
782
+ Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
783
+ this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
784
+ the complete sequence length.
785
+ """
786
+
787
+
788
+ @add_start_docstrings(
789
+ "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
790
+ QWEN2_START_DOCSTRING,
791
+ )
792
+ class Qwen2Model(Qwen2PreTrainedModel):
793
+ """
794
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
795
+
796
+ Args:
797
+ config: Qwen2Config
798
+ """
799
+
800
+ def __init__(self, config: Qwen2Config):
801
+ super().__init__(config)
802
+ self.padding_idx = config.pad_token_id
803
+ self.vocab_size = config.vocab_size
804
+
805
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
806
+ self.layers = nn.ModuleList(
807
+ [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
808
+ )
809
+ self._attn_implementation = config._attn_implementation
810
+ self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
811
+ self.rotary_emb = Qwen2RotaryEmbedding(config=config)
812
+
813
+ self.gradient_checkpointing = False
814
+ # Initialize weights and apply final processing
815
+ self.post_init()
816
+
817
+ def get_input_embeddings(self):
818
+ return self.embed_tokens
819
+
820
+ def set_input_embeddings(self, value):
821
+ self.embed_tokens = value
822
+
823
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
824
+ def forward(
825
+ self,
826
+ input_ids: torch.LongTensor = None,
827
+ attention_mask: Optional[torch.Tensor] = None,
828
+ position_ids: Optional[torch.LongTensor] = None,
829
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
830
+ inputs_embeds: Optional[torch.FloatTensor] = None,
831
+ use_cache: Optional[bool] = None,
832
+ output_attentions: Optional[bool] = None,
833
+ output_hidden_states: Optional[bool] = None,
834
+ return_dict: Optional[bool] = None,
835
+ cache_position: Optional[torch.LongTensor] = None,
836
+ labels: Optional[torch.Tensor] = None
837
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
838
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
839
+ output_hidden_states = (
840
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
841
+ )
842
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
843
+
844
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
845
+
846
+ if (input_ids is None) ^ (inputs_embeds is not None):
847
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
848
+
849
+ if self.gradient_checkpointing and self.training:
850
+ if use_cache:
851
+ logger.warning_once(
852
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
853
+ )
854
+ use_cache = False
855
+
856
+ # kept for BC (non `Cache` `past_key_values` inputs)
857
+ return_legacy_cache = False
858
+ if use_cache and not isinstance(past_key_values, Cache):
859
+ return_legacy_cache = True
860
+ if past_key_values is None:
861
+ past_key_values = DynamicCache()
862
+ else:
863
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
864
+ logger.warning_once(
865
+ "We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and "
866
+ "will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class "
867
+ "(https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)"
868
+ )
869
+
870
+ if inputs_embeds is None:
871
+ inputs_embeds = self.embed_tokens(input_ids)
872
+
873
+ if cache_position is None:
874
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
875
+ cache_position = torch.arange(
876
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
877
+ )
878
+ if position_ids is None:
879
+ position_ids = cache_position.unsqueeze(0)
880
+
881
+ causal_mask = self._update_causal_mask(
882
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
883
+ )
884
+
885
+ hidden_states = inputs_embeds
886
+
887
+ # create position embeddings to be shared across the decoder layers
888
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
889
+
890
+ # decoder layers
891
+ all_hidden_states = () if output_hidden_states else None
892
+ all_self_attns = () if output_attentions else None
893
+ next_decoder_cache = None
894
+
895
+ for decoder_layer in self.layers:
896
+ if output_hidden_states:
897
+ all_hidden_states += (hidden_states,)
898
+
899
+ if self.gradient_checkpointing and self.training:
900
+ layer_outputs = self._gradient_checkpointing_func(
901
+ decoder_layer.__call__,
902
+ hidden_states,
903
+ causal_mask,
904
+ position_ids,
905
+ past_key_values,
906
+ output_attentions,
907
+ use_cache,
908
+ cache_position,
909
+ position_embeddings,
910
+ )
911
+ else:
912
+ layer_outputs = decoder_layer(
913
+ hidden_states,
914
+ attention_mask=causal_mask,
915
+ position_ids=position_ids,
916
+ past_key_value=past_key_values,
917
+ output_attentions=output_attentions,
918
+ use_cache=use_cache,
919
+ cache_position=cache_position,
920
+ position_embeddings=position_embeddings,
921
+ )
922
+
923
+ hidden_states = layer_outputs[0]
924
+
925
+ if use_cache:
926
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
927
+
928
+ if output_attentions:
929
+ all_self_attns += (layer_outputs[1],)
930
+
931
+ hidden_states = self.norm(hidden_states)
932
+
933
+ # add hidden states from the last decoder layer
934
+ if output_hidden_states:
935
+ all_hidden_states += (hidden_states,)
936
+
937
+ next_cache = next_decoder_cache if use_cache else None
938
+ if return_legacy_cache:
939
+ next_cache = next_cache.to_legacy_cache()
940
+
941
+ if not return_dict:
942
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
943
+ return BaseModelOutputWithPast(
944
+ last_hidden_state=hidden_states,
945
+ past_key_values=next_cache,
946
+ hidden_states=all_hidden_states,
947
+ attentions=all_self_attns,
948
+ )
949
+
950
+ # Copied from transformers.models.phi3.modeling_phi3.Phi3Model._update_causal_mask
951
+ def _update_causal_mask(
952
+ self,
953
+ attention_mask: torch.Tensor,
954
+ input_tensor: torch.Tensor,
955
+ cache_position: torch.Tensor,
956
+ past_key_values: Cache,
957
+ output_attentions: bool,
958
+ ):
959
+ if self.config._attn_implementation == "flash_attention_2":
960
+ if attention_mask is not None and 0.0 in attention_mask:
961
+ return attention_mask
962
+ return None
963
+
964
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
965
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
966
+ # to infer the attention mask.
967
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
968
+ using_static_cache = isinstance(past_key_values, StaticCache)
969
+ using_sliding_window_cache = isinstance(past_key_values, SlidingWindowCache)
970
+
971
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
972
+ if (
973
+ self.config._attn_implementation == "sdpa"
974
+ and not (using_static_cache or using_sliding_window_cache)
975
+ and not output_attentions
976
+ ):
977
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
978
+ attention_mask,
979
+ inputs_embeds=input_tensor,
980
+ past_key_values_length=past_seen_tokens,
981
+ sliding_window=self.config.sliding_window,
982
+ is_training=self.training,
983
+ ):
984
+ return None
985
+
986
+ dtype, device = input_tensor.dtype, input_tensor.device
987
+ min_dtype = torch.finfo(dtype).min
988
+ sequence_length = input_tensor.shape[1]
989
+ # SlidingWindowCache or StaticCache
990
+ if using_sliding_window_cache or using_static_cache:
991
+ target_length = past_key_values.get_max_cache_shape()
992
+ # DynamicCache or no cache
993
+ else:
994
+ target_length = (
995
+ attention_mask.shape[-1]
996
+ if isinstance(attention_mask, torch.Tensor)
997
+ else past_seen_tokens + sequence_length + 1
998
+ )
999
+
1000
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
1001
+ causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
1002
+ attention_mask,
1003
+ sequence_length=sequence_length,
1004
+ target_length=target_length,
1005
+ dtype=dtype,
1006
+ device=device,
1007
+ cache_position=cache_position,
1008
+ batch_size=input_tensor.shape[0],
1009
+ config=self.config,
1010
+ past_key_values=past_key_values,
1011
+ )
1012
+
1013
+ if (
1014
+ self.config._attn_implementation == "sdpa"
1015
+ and attention_mask is not None
1016
+ and attention_mask.device.type == "cuda"
1017
+ and not output_attentions
1018
+ ):
1019
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
1020
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
1021
+ # Details: https://github.com/pytorch/pytorch/issues/110213
1022
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
1023
+
1024
+ return causal_mask
1025
+
1026
+ @staticmethod
1027
+ # Copied from transformers.models.mistral.modeling_mistral.MistralModel._prepare_4d_causal_attention_mask_with_cache_position with Mistral->Qwen2
1028
+ def _prepare_4d_causal_attention_mask_with_cache_position(
1029
+ attention_mask: torch.Tensor,
1030
+ sequence_length: int,
1031
+ target_length: int,
1032
+ dtype: torch.dtype,
1033
+ device: torch.device,
1034
+ cache_position: torch.Tensor,
1035
+ batch_size: int,
1036
+ config: Qwen2Config,
1037
+ past_key_values: Cache,
1038
+ ):
1039
+ """
1040
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
1041
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
1042
+
1043
+ Args:
1044
+ attention_mask (`torch.Tensor`):
1045
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
1046
+ sequence_length (`int`):
1047
+ The sequence length being processed.
1048
+ target_length (`int`):
1049
+ The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
1050
+ dtype (`torch.dtype`):
1051
+ The dtype to use for the 4D attention mask.
1052
+ device (`torch.device`):
1053
+ The device to plcae the 4D attention mask on.
1054
+ cache_position (`torch.Tensor`):
1055
+ Indices depicting the position of the input sequence tokens in the sequence.
1056
+ batch_size (`torch.Tensor`):
1057
+ Batch size.
1058
+ config (`Qwen2Config`):
1059
+ The model's configuration class
1060
+ past_key_values (`Cache`):
1061
+ The cache class that is being used currently to generate
1062
+ """
1063
+ if attention_mask is not None and attention_mask.dim() == 4:
1064
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
1065
+ causal_mask = attention_mask
1066
+ else:
1067
+ min_dtype = torch.finfo(dtype).min
1068
+ causal_mask = torch.full(
1069
+ (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
1070
+ )
1071
+ diagonal_attend_mask = torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
1072
+ if config.sliding_window is not None:
1073
+ # if we have sliding window, we should not attend to tokens beyond sliding window length, so we mask them out also
1074
+ # the check is needed to verify is current checkpoint was trained with sliding window or not
1075
+ if not isinstance(past_key_values, SlidingWindowCache) or sequence_length > target_length:
1076
+ sliding_attend_mask = torch.arange(target_length, device=device) <= (
1077
+ cache_position.reshape(-1, 1) - config.sliding_window
1078
+ )
1079
+ diagonal_attend_mask |= sliding_attend_mask
1080
+ causal_mask *= diagonal_attend_mask
1081
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
1082
+ if attention_mask is not None:
1083
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
1084
+ if attention_mask.shape[-1] > target_length:
1085
+ attention_mask = attention_mask[:, :target_length]
1086
+ mask_length = attention_mask.shape[-1]
1087
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
1088
+ padding_mask = padding_mask == 0
1089
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
1090
+ padding_mask, min_dtype
1091
+ )
1092
+ return causal_mask
1093
+
1094
+
1095
+ class Qwen2ForCausalLM(Qwen2PreTrainedModel, GenerationMixin):
1096
+ _tied_weights_keys = ["lm_head.weight"]
1097
+
1098
+ def __init__(self, config):
1099
+ super().__init__(config)
1100
+ self.model = Qwen2Model(config)
1101
+ self.vocab_size = config.vocab_size
1102
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1103
+
1104
+ # Initialize weights and apply final processing
1105
+ self.post_init()
1106
+
1107
+ def get_input_embeddings(self):
1108
+ return self.model.embed_tokens
1109
+
1110
+ def set_input_embeddings(self, value):
1111
+ self.model.embed_tokens = value
1112
+
1113
+ def get_output_embeddings(self):
1114
+ return self.lm_head
1115
+
1116
+ def set_output_embeddings(self, new_embeddings):
1117
+ self.lm_head = new_embeddings
1118
+
1119
+ def set_decoder(self, decoder):
1120
+ self.model = decoder
1121
+
1122
+ def get_decoder(self):
1123
+ return self.model
1124
+
1125
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
1126
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1127
+ def forward(
1128
+ self,
1129
+ input_ids: torch.LongTensor = None,
1130
+ attention_mask: Optional[torch.Tensor] = None,
1131
+ position_ids: Optional[torch.LongTensor] = None,
1132
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1133
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1134
+ labels: Optional[torch.LongTensor] = None,
1135
+ use_cache: Optional[bool] = None,
1136
+ output_attentions: Optional[bool] = None,
1137
+ output_hidden_states: Optional[bool] = None,
1138
+ return_dict: Optional[bool] = None,
1139
+ cache_position: Optional[torch.LongTensor] = None,
1140
+ num_logits_to_keep: int = 0,
1141
+ **loss_kwargs,
1142
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1143
+ r"""
1144
+ Args:
1145
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1146
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1147
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1148
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1149
+
1150
+ num_logits_to_keep (`int`, *optional*):
1151
+ Calculate logits for the last `num_logits_to_keep` tokens. If `0`, calculate logits for all
1152
+ `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
1153
+ token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
1154
+
1155
+ Returns:
1156
+
1157
+ Example:
1158
+
1159
+ ```python
1160
+ >>> from transformers import AutoTokenizer, Qwen2ForCausalLM
1161
+
1162
+ >>> model = Qwen2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1163
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1164
+
1165
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1166
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1167
+
1168
+ >>> # Generate
1169
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1170
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1171
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1172
+ ```"""
1173
+
1174
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1175
+ output_hidden_states = (
1176
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1177
+ )
1178
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1179
+
1180
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1181
+ outputs = self.model(
1182
+ input_ids=input_ids,
1183
+ attention_mask=attention_mask,
1184
+ position_ids=position_ids,
1185
+ past_key_values=past_key_values,
1186
+ inputs_embeds=inputs_embeds,
1187
+ use_cache=use_cache,
1188
+ output_attentions=output_attentions,
1189
+ output_hidden_states=output_hidden_states,
1190
+ return_dict=return_dict,
1191
+ cache_position=cache_position,
1192
+ )
1193
+
1194
+ hidden_states = outputs[0]
1195
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
1196
+ logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])
1197
+
1198
+ loss = None
1199
+ if labels is not None:
1200
+ loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
1201
+
1202
+ if not return_dict:
1203
+ output = (logits,) + outputs[1:]
1204
+ return (loss,) + output if loss is not None else output
1205
+
1206
+ return CausalLMOutputWithPast(
1207
+ loss=loss,
1208
+ logits=logits,
1209
+ past_key_values=outputs.past_key_values,
1210
+ hidden_states=outputs.hidden_states,
1211
+ attentions=outputs.attentions,
1212
+ )
1213
+
1214
+
1215
+ @add_start_docstrings(
1216
+ """
1217
+ The Qwen2 Model transformer with a sequence classification head on top (linear layer).
1218
+
1219
+ [`Qwen2ForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1220
+ (e.g. GPT-2) do.
1221
+
1222
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1223
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1224
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1225
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1226
+ each row of the batch).
1227
+ """,
1228
+ QWEN2_START_DOCSTRING,
1229
+ )
1230
+ class Qwen2ForSequenceClassification(Qwen2PreTrainedModel):
1231
+ def __init__(self, config):
1232
+ super().__init__(config)
1233
+ self.num_labels = config.num_labels
1234
+ self.model = Qwen2Model(config)
1235
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1236
+
1237
+ # Initialize weights and apply final processing
1238
+ self.post_init()
1239
+
1240
+ def get_input_embeddings(self):
1241
+ return self.model.embed_tokens
1242
+
1243
+ def set_input_embeddings(self, value):
1244
+ self.model.embed_tokens = value
1245
+
1246
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
1247
+ def forward(
1248
+ self,
1249
+ input_ids: torch.LongTensor = None,
1250
+ attention_mask: Optional[torch.Tensor] = None,
1251
+ position_ids: Optional[torch.LongTensor] = None,
1252
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1253
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1254
+ labels: Optional[torch.LongTensor] = None,
1255
+ use_cache: Optional[bool] = None,
1256
+ output_attentions: Optional[bool] = None,
1257
+ output_hidden_states: Optional[bool] = None,
1258
+ return_dict: Optional[bool] = None,
1259
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1260
+ r"""
1261
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1262
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1263
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1264
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1265
+ """
1266
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1267
+
1268
+ transformer_outputs = self.model(
1269
+ input_ids,
1270
+ attention_mask=attention_mask,
1271
+ position_ids=position_ids,
1272
+ past_key_values=past_key_values,
1273
+ inputs_embeds=inputs_embeds,
1274
+ use_cache=use_cache,
1275
+ output_attentions=output_attentions,
1276
+ output_hidden_states=output_hidden_states,
1277
+ return_dict=return_dict,
1278
+ )
1279
+ hidden_states = transformer_outputs[0]
1280
+ logits = self.score(hidden_states)
1281
+
1282
+ if input_ids is not None:
1283
+ batch_size = input_ids.shape[0]
1284
+ else:
1285
+ batch_size = inputs_embeds.shape[0]
1286
+
1287
+ if self.config.pad_token_id is None and batch_size != 1:
1288
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1289
+ if self.config.pad_token_id is None:
1290
+ sequence_lengths = -1
1291
+ else:
1292
+ if input_ids is not None:
1293
+ # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
1294
+ sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
1295
+ sequence_lengths = sequence_lengths % input_ids.shape[-1]
1296
+ sequence_lengths = sequence_lengths.to(logits.device)
1297
+ else:
1298
+ sequence_lengths = -1
1299
+
1300
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1301
+
1302
+ loss = None
1303
+ if labels is not None:
1304
+ labels = labels.to(logits.device)
1305
+ if self.config.problem_type is None:
1306
+ if self.num_labels == 1:
1307
+ self.config.problem_type = "regression"
1308
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1309
+ self.config.problem_type = "single_label_classification"
1310
+ else:
1311
+ self.config.problem_type = "multi_label_classification"
1312
+
1313
+ if self.config.problem_type == "regression":
1314
+ loss_fct = MSELoss()
1315
+ if self.num_labels == 1:
1316
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1317
+ else:
1318
+ loss = loss_fct(pooled_logits, labels)
1319
+ elif self.config.problem_type == "single_label_classification":
1320
+ loss_fct = CrossEntropyLoss()
1321
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1322
+ elif self.config.problem_type == "multi_label_classification":
1323
+ loss_fct = BCEWithLogitsLoss()
1324
+ loss = loss_fct(pooled_logits, labels)
1325
+ if not return_dict:
1326
+ output = (pooled_logits,) + transformer_outputs[1:]
1327
+ return ((loss,) + output) if loss is not None else output
1328
+
1329
+ return SequenceClassifierOutputWithPast(
1330
+ loss=loss,
1331
+ logits=pooled_logits,
1332
+ past_key_values=transformer_outputs.past_key_values,
1333
+ hidden_states=transformer_outputs.hidden_states,
1334
+ attentions=transformer_outputs.attentions,
1335
+ )
1336
+
1337
+
1338
+ @add_start_docstrings(
1339
+ """
1340
+ The Qwen2 Model transformer with a token classification head on top (a linear layer on top of the hidden-states
1341
+ output) e.g. for Named-Entity-Recognition (NER) tasks.
1342
+ """,
1343
+ QWEN2_START_DOCSTRING,
1344
+ )
1345
+ # Copied from transformers.models.llama.modeling_llama.LlamaForTokenClassification with Llama->Qwen2, LLAMA->QWEN2
1346
+ class Qwen2ForTokenClassification(Qwen2PreTrainedModel):
1347
+ def __init__(self, config):
1348
+ super().__init__(config)
1349
+ self.num_labels = config.num_labels
1350
+ self.model = Qwen2Model(config)
1351
+ if getattr(config, "classifier_dropout", None) is not None:
1352
+ classifier_dropout = config.classifier_dropout
1353
+ elif getattr(config, "hidden_dropout", None) is not None:
1354
+ classifier_dropout = config.hidden_dropout
1355
+ else:
1356
+ classifier_dropout = 0.1
1357
+ self.dropout = nn.Dropout(classifier_dropout)
1358
+ self.score = nn.Linear(config.hidden_size, config.num_labels)
1359
+
1360
+ # Initialize weights and apply final processing
1361
+ self.post_init()
1362
+
1363
+ def get_input_embeddings(self):
1364
+ return self.model.embed_tokens
1365
+
1366
+ def set_input_embeddings(self, value):
1367
+ self.model.embed_tokens = value
1368
+
1369
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
1370
+ @add_code_sample_docstrings(
1371
+ checkpoint=_CHECKPOINT_FOR_DOC,
1372
+ output_type=TokenClassifierOutput,
1373
+ config_class=_CONFIG_FOR_DOC,
1374
+ )
1375
+ def forward(
1376
+ self,
1377
+ input_ids: Optional[torch.LongTensor] = None,
1378
+ attention_mask: Optional[torch.Tensor] = None,
1379
+ position_ids: Optional[torch.LongTensor] = None,
1380
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1381
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1382
+ labels: Optional[torch.LongTensor] = None,
1383
+ use_cache: Optional[bool] = None,
1384
+ output_attentions: Optional[bool] = None,
1385
+ output_hidden_states: Optional[bool] = None,
1386
+ return_dict: Optional[bool] = None,
1387
+ ) -> Union[Tuple, TokenClassifierOutput]:
1388
+ r"""
1389
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1390
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1391
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1392
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1393
+ """
1394
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1395
+
1396
+ outputs = self.model(
1397
+ input_ids,
1398
+ attention_mask=attention_mask,
1399
+ position_ids=position_ids,
1400
+ past_key_values=past_key_values,
1401
+ inputs_embeds=inputs_embeds,
1402
+ use_cache=use_cache,
1403
+ output_attentions=output_attentions,
1404
+ output_hidden_states=output_hidden_states,
1405
+ return_dict=return_dict,
1406
+ )
1407
+ sequence_output = outputs[0]
1408
+ sequence_output = self.dropout(sequence_output)
1409
+ logits = self.score(sequence_output)
1410
+
1411
+ loss = None
1412
+ if labels is not None:
1413
+ loss = self.loss_function(logits, labels, self.config)
1414
+
1415
+ if not return_dict:
1416
+ output = (logits,) + outputs[2:]
1417
+ return ((loss,) + output) if loss is not None else output
1418
+
1419
+ return TokenClassifierOutput(
1420
+ loss=loss,
1421
+ logits=logits,
1422
+ hidden_states=outputs.hidden_states,
1423
+ attentions=outputs.attentions,
1424
+ )
1425
+
1426
+
1427
+ @add_start_docstrings(
1428
+ """
1429
+ The Qwen2 Model transformer with a span classification head on top for extractive question-answering tasks like
1430
+ SQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
1431
+ """,
1432
+ QWEN2_START_DOCSTRING,
1433
+ )
1434
+ # Copied from transformers.models.mistral.modeling_mistral.MistralForQuestionAnswering with Mistral->Qwen2, MISTRAL->QWEN2
1435
+ class Qwen2ForQuestionAnswering(Qwen2PreTrainedModel):
1436
+ base_model_prefix = "model"
1437
+
1438
+ # Copied from models.models.bloom.modeling_bloom.BloomForQuestionAnswering.__init__ with Bloom->Qwen2
1439
+ def __init__(self, config):
1440
+ super().__init__(config)
1441
+ self.model = Qwen2Model(config)
1442
+ self.qa_outputs = nn.Linear(config.hidden_size, 2)
1443
+
1444
+ # Initialize weights and apply final processing
1445
+ self.post_init()
1446
+
1447
+ def get_input_embeddings(self):
1448
+ return self.model.embed_tokens
1449
+
1450
+ def set_input_embeddings(self, value):
1451
+ self.model.embed_tokens = value
1452
+
1453
+ @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
1454
+ def forward(
1455
+ self,
1456
+ input_ids: Optional[torch.LongTensor] = None,
1457
+ attention_mask: Optional[torch.FloatTensor] = None,
1458
+ position_ids: Optional[torch.LongTensor] = None,
1459
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
1460
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1461
+ start_positions: Optional[torch.LongTensor] = None,
1462
+ end_positions: Optional[torch.LongTensor] = None,
1463
+ output_attentions: Optional[bool] = None,
1464
+ output_hidden_states: Optional[bool] = None,
1465
+ return_dict: Optional[bool] = None,
1466
+ **kwargs,
1467
+ ) -> Union[Tuple, QuestionAnsweringModelOutput]:
1468
+ r"""
1469
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1470
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1471
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1472
+ are not taken into account for computing the loss.
1473
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1474
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1475
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1476
+ are not taken into account for computing the loss.
1477
+ """
1478
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1479
+
1480
+ outputs = self.model(
1481
+ input_ids,
1482
+ attention_mask=attention_mask,
1483
+ position_ids=position_ids,
1484
+ past_key_values=past_key_values,
1485
+ inputs_embeds=inputs_embeds,
1486
+ output_attentions=output_attentions,
1487
+ output_hidden_states=output_hidden_states,
1488
+ return_dict=return_dict,
1489
+ )
1490
+
1491
+ sequence_output = outputs[0]
1492
+
1493
+ logits = self.qa_outputs(sequence_output)
1494
+ start_logits, end_logits = logits.split(1, dim=-1)
1495
+ start_logits = start_logits.squeeze(-1).contiguous()
1496
+ end_logits = end_logits.squeeze(-1).contiguous()
1497
+
1498
+ loss = None
1499
+ if start_positions is not None and end_positions is not None:
1500
+ loss = self.loss_function(start_logits, end_logits, start_positions, end_positions, **kwargs)
1501
+
1502
+ if not return_dict:
1503
+ output = (start_logits, end_logits) + outputs[2:]
1504
+ return ((loss,) + output) if loss is not None else output
1505
+
1506
+ return QuestionAnsweringModelOutput(
1507
+ loss=loss,
1508
+ start_logits=start_logits,
1509
+ end_logits=end_logits,
1510
+ hidden_states=outputs.hidden_states,
1511
+ attentions=outputs.attentions,
1512
+ )
1513
+
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
+ "clean_up_tokenization_spaces": false,
200
+ "eos_token": "<|im_end|>",
201
+ "errors": "replace",
202
+ "extra_special_tokens": {},
203
+ "model_max_length": 131072,
204
+ "pad_token": "<|endoftext|>",
205
+ "split_special_tokens": false,
206
+ "tokenizer_class": "Qwen2Tokenizer",
207
+ "unk_token": null
208
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff