alea-institute commited on
Commit
3e70a27
·
verified ·
1 Parent(s): 2e77b0b

Update README and config files - README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -5
README.md CHANGED
@@ -118,6 +118,58 @@ This model extends the context window from the standard 512 tokens to 4,096 toke
118
  3. Better handle documents with complex cross-references and definitions
119
  4. Reduce the need for document chunking in downstream applications
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ## Training
122
 
123
  The model was trained on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:
@@ -142,11 +194,11 @@ While the model architecture includes the capability for masked language modelin
142
 
143
  ## Special Tokens
144
 
145
- This model uses custom special tokens which must be used explicitly:
146
 
147
- - CLS token: `<|cls|>` (ID: 5) - Should be added at the beginning of input text
148
  - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
149
- - SEP token: `<|sep|>` (ID: 4) - Should be added at the end of input text
150
  - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
151
  - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
152
  - EOS token: `<|end|>` (ID: 1) - End of sequence
@@ -161,7 +213,11 @@ The model also includes additional special tokens for chat and instruction conte
161
  - `<|instruction|>` (ID: 11)
162
  - `</|instruction|>` (ID: 12)
163
 
164
- For best results, you should **explicitly add** the CLS and SEP tokens to your input text, as shown in the example above.
 
 
 
 
165
 
166
  ## Limitations
167
 
@@ -176,7 +232,7 @@ While providing extended context capabilities, this model has some limitations:
176
  ## References
177
 
178
  - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
179
- - [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models]() (Forthcoming)
180
 
181
  ## Citation
182
 
@@ -193,6 +249,17 @@ If you use this model in your research, please cite:
193
  }
194
  ```
195
 
 
 
 
 
 
 
 
 
 
 
 
196
  ## License
197
 
198
  This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
 
118
  3. Better handle documents with complex cross-references and definitions
119
  4. Reduce the need for document chunking in downstream applications
120
 
121
+ ## Standard Test Examples
122
+
123
+ Using our standardized test examples for comparing embedding models:
124
+
125
+ ### Fill-Mask Results
126
+
127
+ While this model is primarily designed for feature extraction rather than masked language modeling, we tested it on standard examples for comparison purposes. Note that performance on the MLM task is extremely limited compared to models specifically optimized for it:
128
+
129
+ 1. **Contract Clause Heading**:
130
+ `"<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"`
131
+
132
+ Top 5 predictions:
133
+ 1. Programs (0.0003)
134
+ 2. for (0.0003)
135
+ 3. to (0.0002)
136
+ 4. , (0.0002)
137
+ 5. the (0.0002)
138
+
139
+ Note: Unlike the other models in the family, this model shows virtually no confidence in its masked token predictions, with extremely low probability scores that suggest it should not be used for masked language modeling tasks.
140
+
141
+ 2. **Defined Term Example**:
142
+ `"<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"`
143
+
144
+ Top 5 predictions:
145
+ 1. Applicants (0.0002)
146
+ 2. volunteers (0.0001)
147
+ 3. Individuals (0.0001)
148
+ 4. Carriers (0.0001)
149
+ 5. inventors (0.0001)
150
+
151
+ 3. **Regulation Example**:
152
+ `"<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"`
153
+
154
+ Top 5 predictions:
155
+ 1. warrant (0.0002)
156
+ 2. service (0.0001)
157
+ 3. protective (0.0001)
158
+ 4. permit (0.0001)
159
+ 5. authorization (0.0001)
160
+
161
+ ### Document Similarity Results
162
+
163
+ Where this model truly shines is in document embedding and similarity, especially with its mean pooling strategy:
164
+
165
+ | Document Pair | Cosine Similarity (CLS token) | Cosine Similarity (Mean pooling) |
166
+ |---------------|-------------------------------|----------------------------------|
167
+ | Court Complaint vs. Consumer Terms | 0.595 | 0.659 |
168
+ | Court Complaint vs. Credit Agreement | 0.604 | 0.854 |
169
+ | Consumer Terms vs. Credit Agreement | 0.658 | 0.727 |
170
+
171
+ Both CLS token and mean pooling strategies produce high-quality similarity scores, with notable differences in magnitude. While CLS token embeddings show moderately high similarity (especially between Consumer Terms and Credit Agreement at 0.658), mean pooling produces even stronger document-level similarity detection (especially between Court Complaint and Credit Agreement at 0.854). This suggests that for document similarity tasks with this model, both strategies are effective, though mean pooling may provide more pronounced similarity signals in some cases.
172
+
173
  ## Training
174
 
175
  The model was trained on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:
 
194
 
195
  ## Special Tokens
196
 
197
+ This model includes the following special tokens:
198
 
199
+ - CLS token: `<|cls|>` (ID: 5) - Used for the beginning of input text
200
  - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
201
+ - SEP token: `<|sep|>` (ID: 4) - Used for the end of input text
202
  - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
203
  - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
204
  - EOS token: `<|end|>` (ID: 1) - End of sequence
 
213
  - `<|instruction|>` (ID: 11)
214
  - `</|instruction|>` (ID: 12)
215
 
216
+ Important usage notes:
217
+
218
+ When using the MASK token for predictions, be aware that this model uses a **space-prefixed BPE tokenizer**. The <|mask|> token should be placed IMMEDIATELY after the previous token with NO space, because most tokens in this tokenizer have an initial space encoded within them. For example: `"word<|mask|>"` rather than `"word <|mask|>"`.
219
+
220
+ This space-aware placement is crucial for getting accurate predictions.
221
 
222
  ## Limitations
223
 
 
232
  ## References
233
 
234
  - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
235
+ - [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models](https://arxiv.org/abs/2504.07854)
236
 
237
  ## Citation
238
 
 
249
  }
250
  ```
251
 
252
+ ```bibtex
253
+ @misc{bommarito2025kl3mdata,
254
+ title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
255
+ author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
256
+ year={2025},
257
+ eprint={2504.07854},
258
+ archivePrefix={arXiv},
259
+ primaryClass={cs.CL}
260
+ }
261
+ ```
262
+
263
  ## License
264
 
265
  This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).