nguyenthanhasia commited on
Commit
6736931
Β·
verified Β·
1 Parent(s): 7b5ea8f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +194 -0
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Paraformer: Attentive Deep Neural Networks for Legal Document Retrieval
2
+
3
+ [![Hugging Face Model](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-blue)](https://huggingface.co/nguyenthanhasia/paraformer)
4
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/nguyenthanhasia/paraformer)
5
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2212.13899)
6
+
7
+ This repository provides a **simplified Hugging Face implementation** of the Paraformer model for legal document retrieval, based on the paper "Attentive Deep Neural Networks for Legal Document Retrieval" by Nguyen et al.
8
+
9
+ ## 🚨 Important Notes
10
+
11
+ ### Usage Scope
12
+ - **This is a simplified, lazy implementation** designed for easy integration with Hugging Face Transformers
13
+ - **For full functionality and customization**, please visit the original repository: [https://github.com/nguyenthanhasia/paraformer](https://github.com/nguyenthanhasia/paraformer)
14
+ - The original repository contains the complete training pipeline, evaluation scripts, and advanced features
15
+
16
+ ### Licensing & Usage
17
+ - βœ… **Research purposes**: Free to use
18
+ - ⚠️ **Commercial purposes**: Use at your own risk
19
+ - Please refer to the original repository for detailed licensing information
20
+
21
+ ## πŸ—οΈ Model Architecture
22
+
23
+ Paraformer employs a hierarchical attention mechanism specifically designed for legal document retrieval:
24
+
25
+ - **Sentence-level encoding** using pre-trained SentenceTransformer (paraphrase-mpnet-base-v2)
26
+ - **Query-aware attention** mechanism with optional sparsemax activation
27
+ - **Binary classification** for document relevance prediction
28
+ - **Interpretable attention weights** for understanding model decisions
29
+
30
+ ## πŸš€ Quick Start
31
+
32
+ ### Installation
33
+
34
+ ```bash
35
+ pip install transformers torch sentence-transformers
36
+ ```
37
+
38
+ ### Basic Usage
39
+
40
+ ```python
41
+ from transformers import AutoModel
42
+
43
+ # Load the model
44
+ model = AutoModel.from_pretrained('nguyenthanhasia/paraformer', trust_remote_code=True)
45
+
46
+ # Example usage
47
+ query = "What are the legal requirements for contract formation?"
48
+ article = [
49
+ "A contract is a legally binding agreement between two or more parties.",
50
+ "For a contract to be valid, it must have offer, acceptance, and consideration.",
51
+ "The parties must have legal capacity to enter into the contract."
52
+ ]
53
+
54
+ # Get relevance score (0.0 to 1.0)
55
+ relevance_score = model.get_relevance_score(query, article)
56
+ print(f"Relevance Score: {relevance_score:.4f}") # Example output: 0.5500
57
+
58
+ # Get binary prediction (0 = not relevant, 1 = relevant)
59
+ prediction = model.predict_relevance(query, article)
60
+ print(f"Prediction: {prediction}") # Example output: 1
61
+ ```
62
+
63
+ ### Batch Processing
64
+
65
+ ```python
66
+ queries = [
67
+ "What constitutes a valid contract?",
68
+ "How can employment be terminated?"
69
+ ]
70
+
71
+ articles = [
72
+ ["A contract requires offer, acceptance, and consideration.", "All parties must have legal capacity."],
73
+ ["Employment can be terminated by mutual agreement.", "Notice period must be respected."]
74
+ ]
75
+
76
+ # Forward pass for batch processing
77
+ import torch
78
+ outputs = model.forward(
79
+ query_texts=queries,
80
+ article_texts=articles,
81
+ return_dict=True
82
+ )
83
+
84
+ # Get probabilities and predictions
85
+ probabilities = torch.softmax(outputs.logits, dim=-1)
86
+ predictions = torch.argmax(outputs.logits, dim=-1)
87
+
88
+ for i, (query, article) in enumerate(zip(queries, articles)):
89
+ score = probabilities[i, 1].item()
90
+ pred = predictions[i].item()
91
+ print(f"Query: {query}")
92
+ print(f"Score: {score:.4f}, Prediction: {pred}")
93
+ ```
94
+
95
+ ## πŸ“Š Model Specifications
96
+
97
+ | Parameter | Value |
98
+ |-----------|-------|
99
+ | Model Size | ~445 MB |
100
+ | Hidden Size | 768 |
101
+ | Base Model | paraphrase-mpnet-base-v2 |
102
+ | Attention Type | General with Sparsemax |
103
+ | Output Classes | 2 (relevant/not relevant) |
104
+ | Input Format | Query string + Article sentences (list) |
105
+
106
+ ## ⚠️ Important Considerations
107
+
108
+ ### Input Format
109
+ - **Documents must be pre-segmented into sentences** (provided as a list of strings)
110
+ - The model processes each sentence individually before applying attention
111
+ - Empty articles are handled gracefully
112
+
113
+ ### Model Behavior
114
+ - **Scores are not absolute relevance judgments** - they represent relative similarity in the learned feature space
115
+ - **Results should be interpreted as similarity scores** rather than definitive relevance conclusions
116
+ - The model was trained on legal documents and may perform differently on other domains
117
+
118
+ ### Performance Notes
119
+ - The model includes pretrained weights converted from the original PyTorch Lightning checkpoint
120
+ - Some weights (particularly SentenceTransformer components) may not be perfectly aligned due to architecture differences
121
+ - For optimal performance, consider fine-tuning on your specific dataset
122
+
123
+ ## πŸ”§ Advanced Usage
124
+
125
+ ### Custom Configuration
126
+
127
+ ```python
128
+ from transformers import AutoConfig
129
+
130
+ # Load configuration
131
+ config = AutoConfig.from_pretrained('nguyenthanhasia/paraformer', trust_remote_code=True)
132
+
133
+ # Modify configuration if needed
134
+ config.dropout_prob = 0.2
135
+ config.use_sparsemax = False # Use softmax instead
136
+
137
+ # Create model with custom config
138
+ model = AutoModel.from_pretrained(
139
+ 'nguyenthanhasia/paraformer',
140
+ config=config,
141
+ trust_remote_code=True
142
+ )
143
+ ```
144
+
145
+ ### Accessing Attention Weights
146
+
147
+ ```python
148
+ # Get attention weights for interpretability
149
+ outputs = model.forward(
150
+ query_texts=["Your query"],
151
+ article_texts=[["Sentence 1", "Sentence 2", "Sentence 3"]],
152
+ return_dict=True
153
+ )
154
+
155
+ # Access attention weights
156
+ attention_weights = outputs.attentions[0] # Shape: [1, num_sentences]
157
+ print("Attention weights:", attention_weights)
158
+ ```
159
+
160
+ ## πŸ”¬ Research & Citation
161
+
162
+ This model is based on the research paper:
163
+
164
+ ```bibtex
165
+ @article{nguyen2022attentive,
166
+ title={Attentive Deep Neural Networks for Legal Document Retrieval},
167
+ author={Nguyen, Ha-Thanh and Phi, Manh-Kien and Ngo, Xuan-Bach and Tran, Vu and Nguyen, Le-Minh and Tu, Minh-Phuong},
168
+ journal={Artificial Intelligence and Law},
169
+ pages={1--30},
170
+ year={2022},
171
+ publisher={Springer}
172
+ }
173
+ ```
174
+
175
+ ## πŸ”— Related Resources
176
+
177
+ - **Original Repository**: [https://github.com/nguyenthanhasia/paraformer](https://github.com/nguyenthanhasia/paraformer) - Full implementation with training scripts
178
+ - **Research Paper**: [https://arxiv.org/abs/2212.13899](https://arxiv.org/abs/2212.13899)
179
+ - **COLIEE Competition**: Data and evaluation framework used in the original research
180
+
181
+ ## 🀝 Contributing
182
+
183
+ For contributions, feature requests, or issues related to the core model:
184
+ - Visit the original repository: [https://github.com/nguyenthanhasia/paraformer](https://github.com/nguyenthanhasia/paraformer)
185
+
186
+ For issues specific to this Hugging Face implementation:
187
+ - Please open an issue in the Hugging Face model repository
188
+
189
+ ## πŸ“„ Disclaimer
190
+
191
+ This is a simplified implementation for easy integration. The original repository contains the complete research implementation with full training and evaluation capabilities. Users seeking to reproduce research results or implement custom training should refer to the original repository.
192
+
193
+ **Use responsibly**: This model is provided for research purposes. Commercial usage is at your own risk and discretion.
194
+