vxo Minsang commited on
Commit
590ab12
·
verified ·
0 Parent(s):

Duplicate from skt/A.X-Encoder-base

Browse files

Co-authored-by: MinsangKim <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/A.X_from_scratch_logo_ko_4x3.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/performance.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/speed.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/skt/A.X-3.1/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ - ko
7
+ pipeline_tag: text-classification
8
+ library_name: transformers
9
+ model_id: skt/A.X-Encoder-base
10
+ developers: SKT AI Model Lab
11
+ model-index:
12
+ - name: A.X-Encoder-base
13
+ results:
14
+ - task:
15
+ type: text-classification
16
+ name: kobest
17
+ metrics:
18
+ - type: KoBEST
19
+ value: 85.50
20
+ - task:
21
+ type: text-classification
22
+ name: klue
23
+ metrics:
24
+ - type: KLUE
25
+ value: 86.10
26
+ ---
27
+
28
+ # A.X Encoder
29
+
30
+ <div align="center">
31
+ <img src="./assets/A.X_from_scratch_logo_ko_4x3.png" alt="A.X Logo" width="300"/>
32
+ </div>
33
+
34
+ ## A.X Encoder Highlights
35
+
36
+ **A.X Encoder** (pronounced "A dot X") is SKT's document understanding model optimized for Korean-language understanding and enterprise deployment.
37
+ This lightweight encoder was developed entirely in-house by SKT, encompassing model architecture, data curation, and training, all carried out on SKT’s proprietary supercomputing infrastructure, TITAN.
38
+ This model utilizes the ModernBERT architecture, which supports flash attention and long-context processing.
39
+
40
+ - **Longer Context**: A.X Encoder supports long-context processing of up to **16,384** tokens.
41
+ - **Faster Inference**: A.X Encoder achieves up to 3x faster inference speed than earlier models.
42
+ - **Superior Korean Language Understanding**: A.X Encoder achieves superior performance on diverse Korean NLU tasks.
43
+
44
+
45
+ ## Core Technologies
46
+
47
+ A.X Encoder represents **an efficient long document understanding model** for processing a large-scale corpus, developed end-to-end by SKT.
48
+
49
+ This model plays a key role in **data curation for A.X LLM** by serving as a versatile document classifier, identifying features such as educational value, domain category, and difficulty level.
50
+
51
+ ## Benchmark Results
52
+
53
+ ### Model Inference Speed (Run on an A100 GPU)
54
+ <div align="center">
55
+ <img src="./assets/speed.png" alt="inference" width="500"/>
56
+ </div>
57
+
58
+ ### Model Performance
59
+ <div align="center">
60
+ <img src="./assets/performance.png" alt="performance" width="500"/>
61
+ </div>
62
+
63
+ | Method | BoolQ (f1) | COPA (f1) | Sentineg (f1) | WiC (f1) | **Avg. (KoBEST)** |
64
+ | ----------------------------- | ---------- | --------- | ------------- | -------- | ----------------- |
65
+ | **klue/roberta-base** | 72.04 | 65.14 | 90.39 | 78.19 | 76.44 |
66
+ | **kakaobank/kf-deberta-base** | 81.30 | 76.50 | 94.70 | 80.50 | 83.25 |
67
+ | **skt/A.X-Encoder-base** | 84.50 | 78.70 | 96.00 | 80.80 | **85.50** |
68
+
69
+
70
+ | Method | NLI (acc) | STS (f1) | YNAT (acc) | **Avg. (KLUE)** |
71
+ | ----------------------------- | --------- | -------- | ---------- | --------------- |
72
+ | **klue/roberta-base** | 84.53 | 84.57 | 86.48 | 85.19 |
73
+ | **kakaobank/kf-deberta-base** | 86.10 | 84.30 | 87.00 | 85.80 |
74
+ | **skt/A.X-Encoder-base** | 87.00 | 84.80 | 86.50 | **86.10** |
75
+
76
+
77
+ ## 🚀 Quickstart
78
+
79
+ ### with HuggingFace Transformers
80
+
81
+ - `transformers>=4.51.0` or the latest version is required to use `skt/A.X-Encoder-base`
82
+ ```bash
83
+ pip install transformers>=4.51.0
84
+ ```
85
+
86
+ ⚠️ If your GPU supports it, we recommend using A.X Encoder with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:
87
+
88
+ ```bash
89
+ pip install flash-attn --no-build-isolation
90
+ ```
91
+ #### Example Usage
92
+
93
+ Using AutoModelForMaskedLM:
94
+
95
+ ```python
96
+ import torch
97
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
98
+
99
+ model_id = "skt/A.X-Encoder-base"
100
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
101
+ model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16)
102
+
103
+ text = "한국의 수도는 <mask>."
104
+ inputs = tokenizer(text, return_tensors="pt")
105
+ outputs = model(**inputs)
106
+
107
+ # To get predictions for the mask:
108
+ masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
109
+ predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
110
+ predicted_token = tokenizer.decode(predicted_token_id)
111
+ print("Predicted token:", predicted_token)
112
+ # Predicted token: 서울
113
+ ```
114
+
115
+ Using a pipeline:
116
+
117
+ ```python
118
+ import torch
119
+ from transformers import pipeline
120
+ from pprint import pprint
121
+
122
+ pipe = pipeline(
123
+ "fill-mask",
124
+ model="skt/A.X-Encoder-base",
125
+ torch_dtype=torch.bfloat16,
126
+ )
127
+
128
+ input_text = "한국의 수도는 <mask>."
129
+ results = pipe(input_text)
130
+ pprint(results)
131
+ # [{'score': 0.07568359375,
132
+ # 'sequence': '한국의 수도는 서울.',
133
+ # 'token': 31430,
134
+ # 'token_str': '서울'}, ...
135
+ ```
136
+
137
+ ## License
138
+
139
+ The `A.X Encoder` model is licensed under `Apache License 2.0`.
140
+
141
+ ## Citation
142
+ ```
143
+ @article{SKTAdotXEncoder-base,
144
+ title={A.X Encoder-base},
145
+ author={SKT AI Model Lab},
146
+ year={2025},
147
+ url={https://huggingface.co/skt/A.X-Encoder-base}
148
+ }
149
+ ```
150
+
151
+ ## Contact
152
+
153
+ - Business & Partnership Contact: [[email protected]]([email protected])
assets/A.X_from_scratch_logo_ko_4x3.png ADDED

Git LFS Details

  • SHA256: 5e41e606567955055a7085c604d922728c926412eafe970faf1d1ec6a62f78b4
  • Pointer size: 131 Bytes
  • Size of remote file: 164 kB
assets/performance.png ADDED

Git LFS Details

  • SHA256: b07261ba4e342ca54f5184f764c771a19544f855a1f2f1ae611038a8dad9b91d
  • Pointer size: 131 Bytes
  • Size of remote file: 100 kB
assets/speed.png ADDED

Git LFS Details

  • SHA256: 21ec0df2dd257b805f6beb236a0e873a754d9f23fb4ed834fcf470d8ed054a71
  • Pointer size: 131 Bytes
  • Size of remote file: 295 kB
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ModernBertForMaskedLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 0,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
+ "embedding_dropout": 0.0,
16
+ "eos_token_id": 1,
17
+ "global_attn_every_n_layers": 3,
18
+ "global_rope_theta": 160000,
19
+ "gradient_checkpointing": false,
20
+ "hidden_activation": "gelu",
21
+ "hidden_size": 768,
22
+ "initializer_cutoff_factor": 2.0,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 1152,
25
+ "layer_norm_eps": 1e-05,
26
+ "local_attention": 128,
27
+ "local_rope_theta": 10000.0,
28
+ "max_position_embeddings": 16384,
29
+ "mlp_bias": false,
30
+ "mlp_dropout": 0.0,
31
+ "model_type": "modernbert",
32
+ "norm_bias": false,
33
+ "norm_eps": 1e-05,
34
+ "num_attention_heads": 12,
35
+ "num_hidden_layers": 22,
36
+ "pad_token_id": 49999,
37
+ "position_embedding_type": "absolute",
38
+ "reference_compile": true,
39
+ "repad_logits_with_grad": false,
40
+ "sep_token_id": 1,
41
+ "sparse_pred_ignore_index": -100,
42
+ "sparse_prediction": false,
43
+ "torch_dtype": "bfloat16",
44
+ "transformers_version": "4.50.0",
45
+ "vocab_size": 50000
46
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c61b00328e5bb68bb18921baa0d4b0e83e76a57975e9cbdd9791850e25457089
3
+ size 298758696
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<cls>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "<\\s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "<sep>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<\\s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<unk>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<sep>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "<cls>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "<unused0>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "<unused1>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "8": {
68
+ "content": "<unused2>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "9": {
76
+ "content": "<unused3>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "10": {
84
+ "content": "<unused4>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "11": {
92
+ "content": "<unused5>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "12": {
100
+ "content": "<unused6>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "13": {
108
+ "content": "<unused7>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "14": {
116
+ "content": "<unused8>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "15": {
124
+ "content": "<unused9>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "16": {
132
+ "content": "<unused10>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "17": {
140
+ "content": "<unused11>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "18": {
148
+ "content": "<unused12>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "19": {
156
+ "content": "<unused13>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "20": {
164
+ "content": "<unused14>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "21": {
172
+ "content": "<unused15>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": true
178
+ },
179
+ "22": {
180
+ "content": "<unused16>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": true
186
+ },
187
+ "23": {
188
+ "content": "<unused17>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": true
194
+ },
195
+ "24": {
196
+ "content": "<unused18>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "25": {
204
+ "content": "<unused19>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "26": {
212
+ "content": "<unused20>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "27": {
220
+ "content": "<unused21>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "28": {
228
+ "content": "<unused22>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "29": {
236
+ "content": "<unused23>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "30": {
244
+ "content": "<unused24>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "31": {
252
+ "content": "<unused25>",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "32": {
260
+ "content": "<unused26>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "33": {
268
+ "content": "<unused27>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": true
274
+ },
275
+ "34": {
276
+ "content": "<unused28>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": true
282
+ },
283
+ "35": {
284
+ "content": "<unused29>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": true
290
+ },
291
+ "36": {
292
+ "content": "<unused30>",
293
+ "lstrip": false,
294
+ "normalized": false,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": true
298
+ },
299
+ "49999": {
300
+ "content": "<pad>",
301
+ "lstrip": false,
302
+ "normalized": false,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": true
306
+ }
307
+ },
308
+ "bos_token": "<s>",
309
+ "clean_up_tokenization_spaces": true,
310
+ "cls_token": "<cls>",
311
+ "do_lower_case": false,
312
+ "eos_token": "<\\s>",
313
+ "extra_special_tokens": {},
314
+ "mask_token": "<mask>",
315
+ "model_max_length": 1024,
316
+ "pad_token": "<pad>",
317
+ "sep_token": "<sep>",
318
+ "strip_accents": null,
319
+ "tokenize_chinese_chars": true,
320
+ "tokenizer_class": "BertTokenizer",
321
+ "unk_token": "<unk>"
322
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff