Tooka-SBERT-V2-Large

This model is a Sentence Transformers model trained for semantic textual similarity and embedding tasks. It maps sentences and paragraphs to a dense vector space, where semantically similar texts are close together.

The model is trained in two sizes: Small and Large

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install sentence-transformers==3.4.1

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Large")
# Run inference
sentences = [
    'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
    'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
    'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

🛠️ Training Details

The training is performed in two stages:

  1. Pretraining on the Targoman News dataset
  2. Fine-tuning on multiple synthetic datasets

Stage 1: Pretraining

  • We use an asymmetric setup.
  • Input formatting:
    • Titles are prepended with "سوال: "
    • Texts are prepended with "متن: "
  • Loss function: CachedMultipleNegativesRankingLoss

Stage 2: Fine-tuning

  • Loss functions:
    • CachedMultipleNegativesRankingLoss
    • CoSENTLoss
  • Used across multiple synthetic datasets

📊 Evaluation

We evaluate our model on the PTEB Benchmark. Our model outperforms mE5-Base on average across PTEB tasks.

For Retrieval and Reranking tasks, we follow the same asymmetric structure, prepending:

  • "سوال: " to queries
  • "متن: " to documents
Model #Params Pair-Classification-Avg Classification-Avg Retrieval-Avg Reranking-Avg CrossTasks-Avg
Tooka-SBERT-V2-Large 353M 80.24 74.73 59.80 73.44 72.05
Tooka-SBERT-V2-Small 123M 75.69 72.16 61.24 73.40 70.62
jina-embeddings-v3 572M 71.88 79.27 65.18 64.62 70.24
multilingual-e5-base 278M 70.76 69.71 63.90 76.01 70.09
Tooka-SBERT-V1-Large 353M 81.52 71.54 45.61 60.44 64.78

Task-Specific Datasets in PTEB

  • Pair-Classification:

    • FarsTail
  • Classification:

    • MassiveIntentClassification
    • MassiveScenarioClassification
    • MultilingualSentimentClassification
    • PersianFoodSentimentClassification
  • Retrieval:

    • MIRACLRetrieval
    • NeuCLIR2023Retrieval
    • WikipediaRetrievalMultilingual
  • Reranking:

    • MIRACLReranking
    • WikipediaRerankingMultilingual

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
586
Safetensors
Model size
353M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PartAI/Tooka-SBERT-V2-Large

Finetuned
(4)
this model

Collection including PartAI/Tooka-SBERT-V2-Large