Update README.md
Browse files
README.md
CHANGED
|
@@ -1,199 +1,208 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
-
|
| 8 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
| 9 |
-
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
## Model Details
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
- **
|
| 21 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 22 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 23 |
-
- **Model type:** [More Information Needed]
|
| 24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
-
|
| 28 |
-
### Model Sources [optional]
|
| 29 |
-
|
| 30 |
-
<!-- Provide the basic links for the model. -->
|
| 31 |
-
|
| 32 |
-
- **Repository:** [More Information Needed]
|
| 33 |
-
- **Paper [optional]:** [More Information Needed]
|
| 34 |
-
- **Demo [optional]:** [More Information Needed]
|
| 35 |
-
|
| 36 |
-
## Uses
|
| 37 |
-
|
| 38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 39 |
-
|
| 40 |
-
### Direct Use
|
| 41 |
-
|
| 42 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 43 |
-
|
| 44 |
-
[More Information Needed]
|
| 45 |
-
|
| 46 |
-
### Downstream Use [optional]
|
| 47 |
-
|
| 48 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 49 |
-
|
| 50 |
-
[More Information Needed]
|
| 51 |
-
|
| 52 |
-
### Out-of-Scope Use
|
| 53 |
-
|
| 54 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
| 55 |
-
|
| 56 |
-
[More Information Needed]
|
| 57 |
-
|
| 58 |
-
## Bias, Risks, and Limitations
|
| 59 |
-
|
| 60 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 61 |
-
|
| 62 |
-
[More Information Needed]
|
| 63 |
-
|
| 64 |
-
### Recommendations
|
| 65 |
-
|
| 66 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 67 |
-
|
| 68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 69 |
-
|
| 70 |
-
## How to Get Started with the Model
|
| 71 |
-
|
| 72 |
-
Use the code below to get started with the model.
|
| 73 |
-
|
| 74 |
-
[More Information Needed]
|
| 75 |
-
|
| 76 |
-
## Training Details
|
| 77 |
-
|
| 78 |
-
### Training Data
|
| 79 |
-
|
| 80 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 81 |
-
|
| 82 |
-
[More Information Needed]
|
| 83 |
-
|
| 84 |
-
### Training Procedure
|
| 85 |
-
|
| 86 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
| 87 |
-
|
| 88 |
-
#### Preprocessing [optional]
|
| 89 |
-
|
| 90 |
-
[More Information Needed]
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
#### Training Hyperparameters
|
| 94 |
-
|
| 95 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 96 |
-
|
| 97 |
-
#### Speeds, Sizes, Times [optional]
|
| 98 |
-
|
| 99 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
|
|
|
|
|
|
| 114 |
|
| 115 |
-
|
|
|
|
|
|
|
| 116 |
|
| 117 |
-
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
-
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
-
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
-
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
| 130 |
|
| 131 |
-
|
|
|
|
|
|
|
| 132 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
| 138 |
|
| 139 |
-
|
| 140 |
|
| 141 |
-
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
-
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
-
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
-
###
|
| 156 |
|
| 157 |
-
|
| 158 |
|
| 159 |
-
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
|
| 164 |
|
| 165 |
-
|
| 166 |
|
| 167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
| 170 |
|
| 171 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
-
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
|
|
|
| 184 |
|
| 185 |
-
|
| 186 |
|
| 187 |
-
|
| 188 |
|
| 189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
-
[
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
library_name: transformers
|
| 5 |
+
license: cc-by-4.0
|
| 6 |
+
tags:
|
| 7 |
+
- kl3m
|
| 8 |
+
- legal
|
| 9 |
+
- financial
|
| 10 |
+
- mlm
|
| 11 |
+
- roberta
|
| 12 |
+
- long-context
|
| 13 |
+
pipeline_tag: feature-extraction
|
| 14 |
+
widget:
|
| 15 |
+
- text: "<|cls|> This Credit Agreement is made and entered into as of [DATE], by and between [BORROWER NAME], a Delaware corporation with its principal place of business at [ADDRESS] (the \"<|mask|>\"), and [LENDER NAME], a national banking association (the \"Lender\"). <|sep|>"
|
| 16 |
+
- example: "<|cls|> Form 10-<|mask|> is a report filed with the Securities and Exchange Commission used by public companies to disclose financial results on a quarterly basis. It is submitted within 45 days of the end of each of the first three fiscal quarters of the company's fiscal year. <|sep|>"
|
| 17 |
+
- temperature: 0.7
|
| 18 |
+
- do_sample: true
|
| 19 |
+
date: '2025-02-28T00:00:00.000Z'
|
| 20 |
---
|
| 21 |
|
| 22 |
+
# kl3m-doc-pico-long-001
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
`kl3m-doc-pico-long-001` is a domain-specific model based on the RoBERTa architecture, specifically designed for feature extraction in legal and financial document analysis with support for longer context windows. It shares the same basic architecture as the kl3m-doc-pico-001 model but has been trained with extended context capabilities of up to 4,096 tokens. While the model architecture supports masked language modeling (MLM), it is primarily optimized for feature extraction tasks.
|
| 25 |
|
| 26 |
## Model Details
|
| 27 |
|
| 28 |
+
- **Architecture**: RoBERTa
|
| 29 |
+
- **Size**: 41M parameters
|
| 30 |
+
- **Hidden Size**: 256
|
| 31 |
+
- **Layers**: 8
|
| 32 |
+
- **Attention Heads**: 8
|
| 33 |
+
- **Max Sequence Length**: 4,096
|
| 34 |
+
- **Tokenizer**: [alea-institute/kl3m-004-128k-cased](https://huggingface.co/alea-institute/kl3m-004-128k-cased)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
## Use Cases
|
| 37 |
|
| 38 |
+
This model is particularly useful for:
|
| 39 |
|
| 40 |
+
- Document classification in legal and financial domains with lengthy documents
|
| 41 |
+
- Analyzing relationships between distant parts of a document
|
| 42 |
+
- Processing lengthy agreements, contracts, and regulatory filings
|
| 43 |
+
- Feature extraction for downstream legal analysis tasks requiring longer context
|
| 44 |
|
| 45 |
+
The extended context length makes this model especially suited for working with longer portions of legal documents, contracts, and financial statements that often exceed the context limits of standard models.
|
| 46 |
|
| 47 |
+
## Usage
|
| 48 |
|
| 49 |
+
The primary use case for this model is feature extraction for document embedding and downstream classification tasks. Here's how to use it for feature extraction:
|
| 50 |
|
| 51 |
+
```python
|
| 52 |
+
from transformers import AutoModel, AutoTokenizer
|
| 53 |
+
import torch
|
| 54 |
|
| 55 |
+
# Load model and tokenizer
|
| 56 |
+
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
|
| 57 |
+
model = AutoModel.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
|
| 58 |
|
| 59 |
+
# Example with legal document context
|
| 60 |
+
text = "<|cls|> This Credit Agreement is made and entered into as of [DATE], by and between [BORROWER NAME], a Delaware corporation with its principal place of business at [ADDRESS] (the \"Borrower\"), and [LENDER NAME], a national banking association (the \"Lender\"). <|sep|>"
|
| 61 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 62 |
+
outputs = model(**inputs)
|
| 63 |
|
| 64 |
+
# Get the embeddings
|
| 65 |
+
# The CLS token embedding is typically used for classification tasks
|
| 66 |
+
cls_embedding = outputs.last_hidden_state[:, 0, :]
|
| 67 |
+
print(f"CLS embedding shape: {cls_embedding.shape}") # Should be [1, 256]
|
| 68 |
|
| 69 |
+
# For document similarity, you can use the mean of all token embeddings
|
| 70 |
+
mean_embedding = outputs.last_hidden_state.mean(dim=1)
|
| 71 |
+
print(f"Mean embedding shape: {mean_embedding.shape}") # Should be [1, 256]
|
| 72 |
|
| 73 |
+
# You can also process multiple documents in a batch
|
| 74 |
+
texts = [
|
| 75 |
+
"<|cls|> This Credit Agreement is made and entered into as of [DATE]... <|sep|>",
|
| 76 |
+
"<|cls|> Form 10-Q is a report filed with the Securities and Exchange Commission... <|sep|>"
|
| 77 |
+
]
|
| 78 |
+
batch_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
|
| 79 |
+
batch_outputs = model(**batch_inputs)
|
| 80 |
|
| 81 |
+
# Get batch embeddings
|
| 82 |
+
batch_embeddings = batch_outputs.last_hidden_state[:, 0, :]
|
| 83 |
+
print(f"Batch embeddings shape: {batch_embeddings.shape}") # Should be [2, 256]
|
| 84 |
+
```
|
| 85 |
|
| 86 |
+
While the model architecture supports masked language modeling (MLM), it was not specifically trained for this task. If you still want to experiment with it for MLM, you can use the following code, but be aware that performance may be limited:
|
| 87 |
|
| 88 |
+
```python
|
| 89 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
| 90 |
+
import torch
|
| 91 |
|
| 92 |
+
# Load model with MLM head (note: this part wasn't specifically trained)
|
| 93 |
+
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
|
| 94 |
+
mlm_model = AutoModelForMaskedLM.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
|
| 95 |
|
| 96 |
+
# Example with masked token
|
| 97 |
+
text = "<|cls|> Form 10-<|mask|> is a report filed with the Securities and Exchange Commission... <|sep|>"
|
| 98 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 99 |
+
outputs = mlm_model(**inputs)
|
| 100 |
|
| 101 |
+
# Get predictions for masked token
|
| 102 |
+
masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
|
| 103 |
+
probs = outputs.logits[0, masked_index].softmax(dim=0)
|
| 104 |
+
top_5 = torch.topk(probs, 5)
|
| 105 |
|
| 106 |
+
print("Top 5 predictions for masked token:")
|
| 107 |
+
for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
|
| 108 |
+
token = tokenizer.decode(idx).strip()
|
| 109 |
+
print(f"{i+1}. {token} ({score.item():.3f})")
|
| 110 |
+
```
|
| 111 |
|
| 112 |
+
## Long Context Capabilities
|
| 113 |
|
| 114 |
+
This model extends the context window from the standard 512 tokens to 4,096 tokens, enabling it to process much longer documents. The extended context window allows the model to:
|
| 115 |
|
| 116 |
+
1. Process full legal agreements and contracts without truncation
|
| 117 |
+
2. Maintain awareness of context from the beginning of a document when analyzing later sections
|
| 118 |
+
3. Better handle documents with complex cross-references and definitions
|
| 119 |
+
4. Reduce the need for document chunking in downstream applications
|
| 120 |
|
| 121 |
+
## Training
|
| 122 |
|
| 123 |
+
The model was trained on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:
|
| 124 |
|
| 125 |
+
1. Position embeddings extended to 4,096 tokens
|
| 126 |
+
2. Training focused on dense document representation for retrieval and classification tasks
|
| 127 |
+
3. Objectives optimized for feature extraction rather than token prediction
|
| 128 |
+
4. Additional training on full-length documents to ensure contextual understanding
|
|
|
|
| 129 |
|
| 130 |
+
It leverages the KL3M tokenizer which provides 9-17% more efficient tokenization for domain-specific content than general-purpose tokenizers.
|
| 131 |
|
| 132 |
+
### Intended Usage
|
| 133 |
|
| 134 |
+
This model was specifically designed and trained for:
|
| 135 |
|
| 136 |
+
1. **Document embedding**: Generating fixed-length vector representations of documents for similarity comparison
|
| 137 |
+
2. **Feature extraction**: Creating inputs for downstream classification and regression tasks
|
| 138 |
+
3. **Semantic search**: Finding similar documents across large collections
|
| 139 |
+
4. **Document clustering**: Discovering patterns across legal and financial document collections
|
| 140 |
|
| 141 |
+
While the model architecture includes the capability for masked language modeling, the model weights were not specifically optimized for this task. For masked language modeling applications, consider using a model explicitly trained for that purpose.
|
| 142 |
|
| 143 |
+
## Special Tokens
|
| 144 |
|
| 145 |
+
This model uses custom special tokens which must be used explicitly:
|
| 146 |
|
| 147 |
+
- CLS token: `<|cls|>` (ID: 5) - Should be added at the beginning of input text
|
| 148 |
+
- MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
|
| 149 |
+
- SEP token: `<|sep|>` (ID: 4) - Should be added at the end of input text
|
| 150 |
+
- PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
|
| 151 |
+
- BOS token: `<|start|>` (ID: 0) - Beginning of sequence
|
| 152 |
+
- EOS token: `<|end|>` (ID: 1) - End of sequence
|
| 153 |
+
- UNK token: `<|unk|>` (ID: 3) - Unknown token
|
| 154 |
|
| 155 |
+
The model also includes additional special tokens for chat and instruction contexts:
|
| 156 |
|
| 157 |
+
- `<|system|>` (ID: 7)
|
| 158 |
+
- `</|system|>` (ID: 8)
|
| 159 |
+
- `<|user|>` (ID: 9)
|
| 160 |
+
- `</|user|>` (ID: 10)
|
| 161 |
+
- `<|instruction|>` (ID: 11)
|
| 162 |
+
- `</|instruction|>` (ID: 12)
|
| 163 |
|
| 164 |
+
For best results, you should **explicitly add** the CLS and SEP tokens to your input text, as shown in the example above.
|
| 165 |
|
| 166 |
+
## Limitations
|
| 167 |
|
| 168 |
+
While providing extended context capabilities, this model has some limitations:
|
| 169 |
|
| 170 |
+
- Smaller parameter count (41M) compared to larger language models
|
| 171 |
+
- Primarily focused on English legal and financial texts
|
| 172 |
+
- Best suited for domain-specific rather than general-purpose tasks
|
| 173 |
+
- Requires domain expertise to interpret results effectively
|
| 174 |
+
- May show decreased performance at the far edges of the context window
|
| 175 |
|
| 176 |
+
## References
|
| 177 |
|
| 178 |
+
- [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
|
| 179 |
+
- [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models]() (Forthcoming)
|
| 180 |
|
| 181 |
+
## Citation
|
| 182 |
|
| 183 |
+
If you use this model in your research, please cite:
|
| 184 |
|
| 185 |
+
```bibtex
|
| 186 |
+
@misc{bommarito2025kl3m,
|
| 187 |
+
title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
|
| 188 |
+
author={Bommarito II, Michael J. and Katz, Daniel Martin and Bommarito, Jillian},
|
| 189 |
+
year={2025},
|
| 190 |
+
eprint={2503.17247},
|
| 191 |
+
archivePrefix={arXiv},
|
| 192 |
+
primaryClass={cs.CL}
|
| 193 |
+
}
|
| 194 |
+
```
|
| 195 |
|
| 196 |
+
## License
|
| 197 |
|
| 198 |
+
This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
|
| 199 |
|
| 200 |
+
## Contact
|
| 201 |
|
| 202 |
+
The KL3M model family is maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
|
| 203 |
+
|
| 204 |
+
- Email: [email protected]
|
| 205 |
+
- Website: https://aleainstitute.ai
|
| 206 |
+
- GitHub: https://github.com/alea-institute/kl3m-model-research
|
| 207 |
|
| 208 |
+

|