File size: 7,155 Bytes
6736931
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# Paraformer: Attentive Deep Neural Networks for Legal Document Retrieval

[![Hugging Face Model](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-blue)](https://huggingface.co/nguyenthanhasia/paraformer)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/nguyenthanhasia/paraformer)
[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2212.13899)

This repository provides a **simplified Hugging Face implementation** of the Paraformer model for legal document retrieval, based on the paper "Attentive Deep Neural Networks for Legal Document Retrieval" by Nguyen et al.

## 🚨 Important Notes

### Usage Scope
- **This is a simplified, lazy implementation** designed for easy integration with Hugging Face Transformers
- **For full functionality and customization**, please visit the original repository: [https://github.com/nguyenthanhasia/paraformer](https://github.com/nguyenthanhasia/paraformer)
- The original repository contains the complete training pipeline, evaluation scripts, and advanced features

### Licensing & Usage
- βœ… **Research purposes**: Free to use
- ⚠️ **Commercial purposes**: Use at your own risk
- Please refer to the original repository for detailed licensing information

## πŸ—οΈ Model Architecture

Paraformer employs a hierarchical attention mechanism specifically designed for legal document retrieval:

- **Sentence-level encoding** using pre-trained SentenceTransformer (paraphrase-mpnet-base-v2)
- **Query-aware attention** mechanism with optional sparsemax activation
- **Binary classification** for document relevance prediction
- **Interpretable attention weights** for understanding model decisions

## πŸš€ Quick Start

### Installation

```bash
pip install transformers torch sentence-transformers
```

### Basic Usage

```python
from transformers import AutoModel

# Load the model
model = AutoModel.from_pretrained('nguyenthanhasia/paraformer', trust_remote_code=True)

# Example usage
query = "What are the legal requirements for contract formation?"
article = [
    "A contract is a legally binding agreement between two or more parties.",
    "For a contract to be valid, it must have offer, acceptance, and consideration.",
    "The parties must have legal capacity to enter into the contract."
]

# Get relevance score (0.0 to 1.0)
relevance_score = model.get_relevance_score(query, article)
print(f"Relevance Score: {relevance_score:.4f}")  # Example output: 0.5500

# Get binary prediction (0 = not relevant, 1 = relevant)
prediction = model.predict_relevance(query, article)
print(f"Prediction: {prediction}")  # Example output: 1
```

### Batch Processing

```python
queries = [
    "What constitutes a valid contract?",
    "How can employment be terminated?"
]

articles = [
    ["A contract requires offer, acceptance, and consideration.", "All parties must have legal capacity."],
    ["Employment can be terminated by mutual agreement.", "Notice period must be respected."]
]

# Forward pass for batch processing
import torch
outputs = model.forward(
    query_texts=queries,
    article_texts=articles,
    return_dict=True
)

# Get probabilities and predictions
probabilities = torch.softmax(outputs.logits, dim=-1)
predictions = torch.argmax(outputs.logits, dim=-1)

for i, (query, article) in enumerate(zip(queries, articles)):
    score = probabilities[i, 1].item()
    pred = predictions[i].item()
    print(f"Query: {query}")
    print(f"Score: {score:.4f}, Prediction: {pred}")
```

## πŸ“Š Model Specifications

| Parameter | Value |
|-----------|-------|
| Model Size | ~445 MB |
| Hidden Size | 768 |
| Base Model | paraphrase-mpnet-base-v2 |
| Attention Type | General with Sparsemax |
| Output Classes | 2 (relevant/not relevant) |
| Input Format | Query string + Article sentences (list) |

## ⚠️ Important Considerations

### Input Format
- **Documents must be pre-segmented into sentences** (provided as a list of strings)
- The model processes each sentence individually before applying attention
- Empty articles are handled gracefully

### Model Behavior
- **Scores are not absolute relevance judgments** - they represent relative similarity in the learned feature space
- **Results should be interpreted as similarity scores** rather than definitive relevance conclusions
- The model was trained on legal documents and may perform differently on other domains

### Performance Notes
- The model includes pretrained weights converted from the original PyTorch Lightning checkpoint
- Some weights (particularly SentenceTransformer components) may not be perfectly aligned due to architecture differences
- For optimal performance, consider fine-tuning on your specific dataset

## πŸ”§ Advanced Usage

### Custom Configuration

```python
from transformers import AutoConfig

# Load configuration
config = AutoConfig.from_pretrained('nguyenthanhasia/paraformer', trust_remote_code=True)

# Modify configuration if needed
config.dropout_prob = 0.2
config.use_sparsemax = False  # Use softmax instead

# Create model with custom config
model = AutoModel.from_pretrained(
    'nguyenthanhasia/paraformer', 
    config=config,
    trust_remote_code=True
)
```

### Accessing Attention Weights

```python
# Get attention weights for interpretability
outputs = model.forward(
    query_texts=["Your query"],
    article_texts=[["Sentence 1", "Sentence 2", "Sentence 3"]],
    return_dict=True
)

# Access attention weights
attention_weights = outputs.attentions[0]  # Shape: [1, num_sentences]
print("Attention weights:", attention_weights)
```

## πŸ”¬ Research & Citation

This model is based on the research paper:

```bibtex
@article{nguyen2022attentive,
  title={Attentive Deep Neural Networks for Legal Document Retrieval},
  author={Nguyen, Ha-Thanh and Phi, Manh-Kien and Ngo, Xuan-Bach and Tran, Vu and Nguyen, Le-Minh and Tu, Minh-Phuong},
  journal={Artificial Intelligence and Law},
  pages={1--30},
  year={2022},
  publisher={Springer}
}
```

## πŸ”— Related Resources

- **Original Repository**: [https://github.com/nguyenthanhasia/paraformer](https://github.com/nguyenthanhasia/paraformer) - Full implementation with training scripts
- **Research Paper**: [https://arxiv.org/abs/2212.13899](https://arxiv.org/abs/2212.13899)
- **COLIEE Competition**: Data and evaluation framework used in the original research

## 🀝 Contributing

For contributions, feature requests, or issues related to the core model:
- Visit the original repository: [https://github.com/nguyenthanhasia/paraformer](https://github.com/nguyenthanhasia/paraformer)

For issues specific to this Hugging Face implementation:
- Please open an issue in the Hugging Face model repository

## πŸ“„ Disclaimer

This is a simplified implementation for easy integration. The original repository contains the complete research implementation with full training and evaluation capabilities. Users seeking to reproduce research results or implement custom training should refer to the original repository.

**Use responsibly**: This model is provided for research purposes. Commercial usage is at your own risk and discretion.