SentenceTransformer
This is a sentence-transformers model fine‑tuned for semantic similarity tasks in the Russian legal domain. It maps sentences and paragraphs to a dense 1024‑dimensional vector space and is designed for:
- Semantic search
The model supports up to 8192 tokens combining fine‑tuning accuracy on short spans with restored long‑context capacity.
Model Details
Model Description
- Model Type: Sentence Transformer (Bi‑Encoder)
- Base Model: BAAI/bge‑m3
- Fine‑tuned On: 8192‑token legal QA pairs
- Technique: LM‑Cocktail (weight interpolation with base)
- Output Dimensionality: 1024
- Similarity Function: Cosine similarity
- Domain: Russian legal texts
- License: MIT
Model Sources
- Documentation: sbert.net
- Repository: UKPLab/sentence‑transformers
- Model Hub: 🤗 sentence‑transformers models
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'include_prompt': True})
(2): Normalize()
)
Usage
Installation
pip install -U sentence-transformers
Quick Start
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("JetTeam/legal-bge-m3-8192")
query = "Когда возможно упрощённое банкротство?"
documents = [
"Закон о банкротстве, ст. 226. Упрощённая процедура применяется к ...",
"Гражданский кодекс, ст. 65. Предусматривает ...",
]
query_emb = model.encode(query)
doc_embs = model.encode(documents)
scores = util.cos_sim(query_emb, doc_embs)
for doc, score in zip(documents, scores[0]):
print(f"{score:.3f} — {doc[:60]}...")
Applications
- Legal search (laws, codes, court decisions)
Training Details
Dataset
- Corpus: 100 question–relevant fragment pairs annotated by legal experts. This dataset is designed for evaluating the quality of legal AI systems.
- Pairs Generation: 31,673 filtered “question ↔ paragraph” samples
Training Configuration
- Loss: MultipleNegativesRankingLoss
- Optimizer: AdamW (lr = 1e‑5)
- Batch Size: 64
- Epochs: 5
- Hardware: 2 × A100 (80 GB)
LM‑Cocktail
To restore long‑context support:
θ_final = 0.7 × θ_finetuned + 0.3 × θ_base
This preserves semantic quality while re‑activating the 8192‑token capacity of the base model.
Evaluation
Model | Recall@5 | MRR@10 |
---|---|---|
Legal BGE‑m3 (LM‑Cocktail) | 0.75 | 0.59 |
BGE‑m3 (Base) | 0.58 | 0.48 |
BM25 | 0.38 | 0.22 |
Performance
Format | FPS (Batch=2) | Latency (ms) |
---|---|---|
PyTorch FP32 | 3.1 | 480 |
OpenVINO FP32 | 8.9 | 180 |
ONNX INT8 | 10.7 | 160 |
INT8 may reduce Recall@5 by ≈ 1.5 pp.
Training Environment
- Python: 3.10.12
- Sentence Transformers: 4.0.2
- Transformers: 4.48.3
- PyTorch: 2.1.0+cu118
- Accelerate: 1.6.0
- Datasets: 3.5.0
- Tokenizers: 0.21.1
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support