alea-institute
/

kl3m-doc-pico-long-001

@@ -118,6 +118,58 @@ This model extends the context window from the standard 512 tokens to 4,096 toke
 3. Better handle documents with complex cross-references and definitions
 4. Reduce the need for document chunking in downstream applications
 ## Training
 The model was trained on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:
@@ -142,11 +194,11 @@ While the model architecture includes the capability for masked language modelin
 ## Special Tokens
-This model uses custom special tokens which must be used explicitly:
-- CLS token: `<|cls|>` (ID: 5) - Should be added at the beginning of input text
 - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
-- SEP token: `<|sep|>` (ID: 4) - Should be added at the end of input text
 - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
 - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
 - EOS token: `<|end|>` (ID: 1) - End of sequence
@@ -161,7 +213,11 @@ The model also includes additional special tokens for chat and instruction conte
 - `<|instruction|>` (ID: 11)
 - `</|instruction|>` (ID: 12)
-For best results, you should **explicitly add** the CLS and SEP tokens to your input text, as shown in the example above.
 ## Limitations
@@ -176,7 +232,7 @@ While providing extended context capabilities, this model has some limitations:
 ## References
 - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
-- [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models]() (Forthcoming)
 ## Citation
@@ -193,6 +249,17 @@ If you use this model in your research, please cite:
 }
 ```
 ## License
 This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).

 3. Better handle documents with complex cross-references and definitions
 4. Reduce the need for document chunking in downstream applications
+## Standard Test Examples
+Using our standardized test examples for comparing embedding models:
+### Fill-Mask Results
+While this model is primarily designed for feature extraction rather than masked language modeling, we tested it on standard examples for comparison purposes. Note that performance on the MLM task is extremely limited compared to models specifically optimized for it:
+1. **Contract Clause Heading**:
+   `"<|cls|> 8. REPRESENTATIONS AND<|mask|>. Each party hereby represents and warrants to the other party as of the date hereof as follows: <|sep|>"`
+   Top 5 predictions:
+   1. Programs (0.0003)
+   2. for (0.0003)
+   3. to (0.0002)
+   4. , (0.0002)
+   5. the (0.0002)
+   Note: Unlike the other models in the family, this model shows virtually no confidence in its masked token predictions, with extremely low probability scores that suggest it should not be used for masked language modeling tasks.
+2. **Defined Term Example**:
+   `"<|cls|> \"Effective<|mask|>\" means the date on which all conditions precedent set forth in Article V are satisfied or waived by the Administrative Agent. <|sep|>"`
+   Top 5 predictions:
+   1. Applicants (0.0002)
+   2. volunteers (0.0001)
+   3. Individuals (0.0001)
+   4. Carriers (0.0001)
+   5. inventors (0.0001)
+3. **Regulation Example**:
+   `"<|cls|> All transactions shall comply with the requirements set forth in the Truth in<|mask|> Act and its implementing Regulation Z. <|sep|>"`
+   Top 5 predictions:
+   1. warrant (0.0002)
+   2. service (0.0001)
+   3. protective (0.0001)
+   4. permit (0.0001)
+   5. authorization (0.0001)
+### Document Similarity Results
+Where this model truly shines is in document embedding and similarity, especially with its mean pooling strategy:
+| Document Pair | Cosine Similarity (CLS token) | Cosine Similarity (Mean pooling) |
+|---------------|-------------------------------|----------------------------------|
+| Court Complaint vs. Consumer Terms | 0.595 | 0.659 |
+| Court Complaint vs. Credit Agreement | 0.604 | 0.854 |
+| Consumer Terms vs. Credit Agreement | 0.658 | 0.727 |
+Both CLS token and mean pooling strategies produce high-quality similarity scores, with notable differences in magnitude. While CLS token embeddings show moderately high similarity (especially between Consumer Terms and Credit Agreement at 0.658), mean pooling produces even stronger document-level similarity detection (especially between Court Complaint and Credit Agreement at 0.854). This suggests that for document similarity tasks with this model, both strategies are effective, though mean pooling may provide more pronounced similarity signals in some cases.
 ## Training
 The model was trained on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:
 ## Special Tokens
+This model includes the following special tokens:
+- CLS token: `<|cls|>` (ID: 5) - Used for the beginning of input text
 - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
+- SEP token: `<|sep|>` (ID: 4) - Used for the end of input text
 - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
 - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
 - EOS token: `<|end|>` (ID: 1) - End of sequence
 - `<|instruction|>` (ID: 11)
 - `</|instruction|>` (ID: 12)
+Important usage notes:
+When using the MASK token for predictions, be aware that this model uses a **space-prefixed BPE tokenizer**. The <|mask|> token should be placed IMMEDIATELY after the previous token with NO space, because most tokens in this tokenizer have an initial space encoded within them. For example: `"word<|mask|>"` rather than `"word <|mask|>"`.
+This space-aware placement is crucial for getting accurate predictions.
 ## Limitations
 ## References
 - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
+- [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models](https://arxiv.org/abs/2504.07854)
 ## Citation
 }
 ```
+```bibtex
+@misc{bommarito2025kl3mdata,
+  title={The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models},
+  author={Bommarito II, Michael J. and Bommarito, Jillian and Katz, Daniel Martin},
+  year={2025},
+  eprint={2504.07854},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL}
+}
+```
 ## License
 This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).