alea-institute commited on
Commit
2e77b0b
·
verified ·
1 Parent(s): 586420c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -142
README.md CHANGED
@@ -1,199 +1,208 @@
1
  ---
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
 
 
11
 
12
  ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
106
 
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
 
 
114
 
115
- #### Factors
 
 
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
 
118
 
119
- [More Information Needed]
 
 
 
120
 
121
- #### Metrics
 
 
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
 
 
124
 
125
- [More Information Needed]
 
 
 
126
 
127
- ### Results
128
 
129
- [More Information Needed]
 
 
130
 
131
- #### Summary
 
 
132
 
 
 
 
 
133
 
 
 
 
 
134
 
135
- ## Model Examination [optional]
 
 
 
 
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
 
 
 
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
 
 
 
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
 
 
 
 
 
 
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
 
 
 
 
 
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
 
 
 
 
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
 
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
 
 
 
 
 
 
 
 
 
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
 
 
 
 
198
 
199
- [More Information Needed]
 
1
  ---
2
+ language:
3
+ - en
4
  library_name: transformers
5
+ license: cc-by-4.0
6
+ tags:
7
+ - kl3m
8
+ - legal
9
+ - financial
10
+ - mlm
11
+ - roberta
12
+ - long-context
13
+ pipeline_tag: feature-extraction
14
+ widget:
15
+ - text: "<|cls|> This Credit Agreement is made and entered into as of [DATE], by and between [BORROWER NAME], a Delaware corporation with its principal place of business at [ADDRESS] (the \"<|mask|>\"), and [LENDER NAME], a national banking association (the \"Lender\"). <|sep|>"
16
+ - example: "<|cls|> Form 10-<|mask|> is a report filed with the Securities and Exchange Commission used by public companies to disclose financial results on a quarterly basis. It is submitted within 45 days of the end of each of the first three fiscal quarters of the company's fiscal year. <|sep|>"
17
+ - temperature: 0.7
18
+ - do_sample: true
19
+ date: '2025-02-28T00:00:00.000Z'
20
  ---
21
 
22
+ # kl3m-doc-pico-long-001
 
 
 
23
 
24
+ `kl3m-doc-pico-long-001` is a domain-specific model based on the RoBERTa architecture, specifically designed for feature extraction in legal and financial document analysis with support for longer context windows. It shares the same basic architecture as the kl3m-doc-pico-001 model but has been trained with extended context capabilities of up to 4,096 tokens. While the model architecture supports masked language modeling (MLM), it is primarily optimized for feature extraction tasks.
25
 
26
  ## Model Details
27
 
28
+ - **Architecture**: RoBERTa
29
+ - **Size**: 41M parameters
30
+ - **Hidden Size**: 256
31
+ - **Layers**: 8
32
+ - **Attention Heads**: 8
33
+ - **Max Sequence Length**: 4,096
34
+ - **Tokenizer**: [alea-institute/kl3m-004-128k-cased](https://huggingface.co/alea-institute/kl3m-004-128k-cased)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
+ ## Use Cases
37
 
38
+ This model is particularly useful for:
39
 
40
+ - Document classification in legal and financial domains with lengthy documents
41
+ - Analyzing relationships between distant parts of a document
42
+ - Processing lengthy agreements, contracts, and regulatory filings
43
+ - Feature extraction for downstream legal analysis tasks requiring longer context
44
 
45
+ The extended context length makes this model especially suited for working with longer portions of legal documents, contracts, and financial statements that often exceed the context limits of standard models.
46
 
47
+ ## Usage
48
 
49
+ The primary use case for this model is feature extraction for document embedding and downstream classification tasks. Here's how to use it for feature extraction:
50
 
51
+ ```python
52
+ from transformers import AutoModel, AutoTokenizer
53
+ import torch
54
 
55
+ # Load model and tokenizer
56
+ tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
57
+ model = AutoModel.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
58
 
59
+ # Example with legal document context
60
+ text = "<|cls|> This Credit Agreement is made and entered into as of [DATE], by and between [BORROWER NAME], a Delaware corporation with its principal place of business at [ADDRESS] (the \"Borrower\"), and [LENDER NAME], a national banking association (the \"Lender\"). <|sep|>"
61
+ inputs = tokenizer(text, return_tensors="pt")
62
+ outputs = model(**inputs)
63
 
64
+ # Get the embeddings
65
+ # The CLS token embedding is typically used for classification tasks
66
+ cls_embedding = outputs.last_hidden_state[:, 0, :]
67
+ print(f"CLS embedding shape: {cls_embedding.shape}") # Should be [1, 256]
68
 
69
+ # For document similarity, you can use the mean of all token embeddings
70
+ mean_embedding = outputs.last_hidden_state.mean(dim=1)
71
+ print(f"Mean embedding shape: {mean_embedding.shape}") # Should be [1, 256]
72
 
73
+ # You can also process multiple documents in a batch
74
+ texts = [
75
+ "<|cls|> This Credit Agreement is made and entered into as of [DATE]... <|sep|>",
76
+ "<|cls|> Form 10-Q is a report filed with the Securities and Exchange Commission... <|sep|>"
77
+ ]
78
+ batch_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
79
+ batch_outputs = model(**batch_inputs)
80
 
81
+ # Get batch embeddings
82
+ batch_embeddings = batch_outputs.last_hidden_state[:, 0, :]
83
+ print(f"Batch embeddings shape: {batch_embeddings.shape}") # Should be [2, 256]
84
+ ```
85
 
86
+ While the model architecture supports masked language modeling (MLM), it was not specifically trained for this task. If you still want to experiment with it for MLM, you can use the following code, but be aware that performance may be limited:
87
 
88
+ ```python
89
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
90
+ import torch
91
 
92
+ # Load model with MLM head (note: this part wasn't specifically trained)
93
+ tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
94
+ mlm_model = AutoModelForMaskedLM.from_pretrained("alea-institute/kl3m-doc-pico-long-001")
95
 
96
+ # Example with masked token
97
+ text = "<|cls|> Form 10-<|mask|> is a report filed with the Securities and Exchange Commission... <|sep|>"
98
+ inputs = tokenizer(text, return_tensors="pt")
99
+ outputs = mlm_model(**inputs)
100
 
101
+ # Get predictions for masked token
102
+ masked_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0].item()
103
+ probs = outputs.logits[0, masked_index].softmax(dim=0)
104
+ top_5 = torch.topk(probs, 5)
105
 
106
+ print("Top 5 predictions for masked token:")
107
+ for i, (score, idx) in enumerate(zip(top_5.values, top_5.indices)):
108
+ token = tokenizer.decode(idx).strip()
109
+ print(f"{i+1}. {token} ({score.item():.3f})")
110
+ ```
111
 
112
+ ## Long Context Capabilities
113
 
114
+ This model extends the context window from the standard 512 tokens to 4,096 tokens, enabling it to process much longer documents. The extended context window allows the model to:
115
 
116
+ 1. Process full legal agreements and contracts without truncation
117
+ 2. Maintain awareness of context from the beginning of a document when analyzing later sections
118
+ 3. Better handle documents with complex cross-references and definitions
119
+ 4. Reduce the need for document chunking in downstream applications
120
 
121
+ ## Training
122
 
123
+ The model was trained on a diverse corpus of legal and financial documents, ensuring high-quality performance in these domains. This version has been specifically optimized for feature extraction with longer contexts, incorporating the following key aspects:
124
 
125
+ 1. Position embeddings extended to 4,096 tokens
126
+ 2. Training focused on dense document representation for retrieval and classification tasks
127
+ 3. Objectives optimized for feature extraction rather than token prediction
128
+ 4. Additional training on full-length documents to ensure contextual understanding
 
129
 
130
+ It leverages the KL3M tokenizer which provides 9-17% more efficient tokenization for domain-specific content than general-purpose tokenizers.
131
 
132
+ ### Intended Usage
133
 
134
+ This model was specifically designed and trained for:
135
 
136
+ 1. **Document embedding**: Generating fixed-length vector representations of documents for similarity comparison
137
+ 2. **Feature extraction**: Creating inputs for downstream classification and regression tasks
138
+ 3. **Semantic search**: Finding similar documents across large collections
139
+ 4. **Document clustering**: Discovering patterns across legal and financial document collections
140
 
141
+ While the model architecture includes the capability for masked language modeling, the model weights were not specifically optimized for this task. For masked language modeling applications, consider using a model explicitly trained for that purpose.
142
 
143
+ ## Special Tokens
144
 
145
+ This model uses custom special tokens which must be used explicitly:
146
 
147
+ - CLS token: `<|cls|>` (ID: 5) - Should be added at the beginning of input text
148
+ - MASK token: `<|mask|>` (ID: 6) - Used to mark tokens for prediction
149
+ - SEP token: `<|sep|>` (ID: 4) - Should be added at the end of input text
150
+ - PAD token: `<|pad|>` (ID: 2) - Used for padding sequences to a uniform length
151
+ - BOS token: `<|start|>` (ID: 0) - Beginning of sequence
152
+ - EOS token: `<|end|>` (ID: 1) - End of sequence
153
+ - UNK token: `<|unk|>` (ID: 3) - Unknown token
154
 
155
+ The model also includes additional special tokens for chat and instruction contexts:
156
 
157
+ - `<|system|>` (ID: 7)
158
+ - `</|system|>` (ID: 8)
159
+ - `<|user|>` (ID: 9)
160
+ - `</|user|>` (ID: 10)
161
+ - `<|instruction|>` (ID: 11)
162
+ - `</|instruction|>` (ID: 12)
163
 
164
+ For best results, you should **explicitly add** the CLS and SEP tokens to your input text, as shown in the example above.
165
 
166
+ ## Limitations
167
 
168
+ While providing extended context capabilities, this model has some limitations:
169
 
170
+ - Smaller parameter count (41M) compared to larger language models
171
+ - Primarily focused on English legal and financial texts
172
+ - Best suited for domain-specific rather than general-purpose tasks
173
+ - Requires domain expertise to interpret results effectively
174
+ - May show decreased performance at the far edges of the context window
175
 
176
+ ## References
177
 
178
+ - [KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications](https://arxiv.org/abs/2503.17247)
179
+ - [The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models]() (Forthcoming)
180
 
181
+ ## Citation
182
 
183
+ If you use this model in your research, please cite:
184
 
185
+ ```bibtex
186
+ @misc{bommarito2025kl3m,
187
+ title={KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications},
188
+ author={Bommarito II, Michael J. and Katz, Daniel Martin and Bommarito, Jillian},
189
+ year={2025},
190
+ eprint={2503.17247},
191
+ archivePrefix={arXiv},
192
+ primaryClass={cs.CL}
193
+ }
194
+ ```
195
 
196
+ ## License
197
 
198
+ This model is licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
199
 
200
+ ## Contact
201
 
202
+ The KL3M model family is maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
203
+
204
+ - Email: [email protected]
205
+ - Website: https://aleainstitute.ai
206
+ - GitHub: https://github.com/alea-institute/kl3m-model-research
207
 
208
+ ![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)