SentenceTransformer

This is a sentence-transformers model fine‑tuned for semantic similarity tasks in the Russian legal domain. It maps sentences and paragraphs to a dense 1024‑dimensional vector space and is designed for:

  • Semantic search

The model supports up to 8192 tokens combining fine‑tuning accuracy on short spans with restored long‑context capacity.


Model Details

Model Description

  • Model Type: Sentence Transformer (Bi‑Encoder)
  • Base Model: BAAI/bge‑m3
  • Fine‑tuned On: 8192‑token legal QA pairs
  • Technique: LM‑Cocktail (weight interpolation with base)
  • Output Dimensionality: 1024
  • Similarity Function: Cosine similarity
  • Domain: Russian legal texts
  • License: MIT

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'include_prompt': True})
  (2): Normalize()
)

Usage

Installation

pip install -U sentence-transformers

Quick Start

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("JetTeam/legal-bge-m3-8192")

query = "Когда возможно упрощённое банкротство?"
documents = [
    "Закон о банкротстве, ст. 226. Упрощённая процедура применяется к ...",
    "Гражданский кодекс, ст. 65. Предусматривает ...",
]

query_emb = model.encode(query)
doc_embs = model.encode(documents)

scores = util.cos_sim(query_emb, doc_embs)

for doc, score in zip(documents, scores[0]):
    print(f"{score:.3f}{doc[:60]}...")

Applications

  • Legal search (laws, codes, court decisions)

Training Details

Dataset

  • Corpus: 100 question–relevant fragment pairs annotated by legal experts. This dataset is designed for evaluating the quality of legal AI systems.
  • Pairs Generation: 31,673 filtered “question ↔ paragraph” samples

Training Configuration

  • Loss: MultipleNegativesRankingLoss
  • Optimizer: AdamW (lr = 1e‑5)
  • Batch Size: 64
  • Epochs: 5
  • Hardware: 2 × A100 (80 GB)

LM‑Cocktail

To restore long‑context support:

θ_final = 0.7 × θ_finetuned + 0.3 × θ_base

This preserves semantic quality while re‑activating the 8192‑token capacity of the base model.


Evaluation

Model Recall@5 MRR@10
Legal BGE‑m3 (LM‑Cocktail) 0.75 0.59
BGE‑m3 (Base) 0.58 0.48
BM25 0.38 0.22

Performance

Format FPS (Batch=2) Latency (ms)
PyTorch FP32 3.1 480
OpenVINO FP32 8.9 180
ONNX INT8 10.7 160

INT8 may reduce Recall@5 by ≈ 1.5 pp.


Training Environment

  • Python: 3.10.12
  • Sentence Transformers: 4.0.2
  • Transformers: 4.48.3
  • PyTorch: 2.1.0+cu118
  • Accelerate: 1.6.0
  • Datasets: 3.5.0
  • Tokenizers: 0.21.1

Downloads last month
2
Safetensors
Model size
568M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support