Bilingual Azerbaijani-English Sentence Embedding Model (az-en-MiniLM-L6-v2)
This is a sentence-transformer model that maps sentences & paragraphs in Azerbaijani (az) and English (en) to a 384-dimensional dense vector space. It is designed for tasks like semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering for these two languages.
The model is based on sentence-transformers/all-MiniLM-L6-v2
and was fine-tuned using knowledge distillation from the high-performance BAAI/bge-small-en-v1.5
English embedding model.
A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer with a vocabulary of ~50k was trained from scratch and is used by this model.
Model Details
- Base Architecture:
sentence-transformers/all-MiniLM-L6-v2
(6 layers, 384 hidden dimension, 12 attention heads) - Parameters: ~30.2 Million (after vocabulary expansion)
- Tokenizer: Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at LocalDoc/az-en-unigram-tokenizer-50k.
- Output Dimension: 384
- Max Sequence Length: 512 tokens
- Training: Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from
BAAI/bge-small-en-v1.5
.
Performance on Azerbaijani STS Benchmarks
This model demonstrates strong performance on Azerbaijani Semantic Textual Similarity (STS) tasks LocalDoc-Azerbaijan/STS-Benchmark, achieving results competitive with, and in some cases surpassing, larger multilingual models.
The following results were obtained after 3 epochs of training :
Dataset | Pearson Correlation |
---|---|
LocalDoc/Azerbaijani-STSBenchmark | 0.7595 |
LocalDoc/Azerbaijani-biosses-sts | 0.7410 |
LocalDoc/Azerbaijani-sickr-sts | 0.7432 |
LocalDoc/Azerbaijani-sts12-sts | 0.7644 |
LocalDoc/Azerbaijani-sts13-sts | 0.6336 |
LocalDoc/Azerbaijani-sts15-sts | 0.7597 |
LocalDoc/Azerbaijani-sts16-sts | 0.6848 |
Average Pearson | 0.7266 |
Comparison with other models on (assumed) Azerbaijani STS Benchmarks (Average Pearson):
- LocalDoc/TEmA-small:
0.7959
- Cohere/embed-multilingual-v3.0:
0.7823
- BAAI/bge-m3:
0.7577
- intfloat/multilingual-e5-large-instruct:
0.7377
- Cohere/embed-multilingual-v2.0:
0.7318
- intfloat/multilingual-e5-large:
0.7280
- OpenAI/text-embedding-3-large:
0.7288
- LocalDoc/az-en-MiniLM-L6-v2:
0.7266
- sentence-transformers/LaBSE:
0.7250
- intfloat/multilingual-e5-small:
0.7242
- Cohere/embed-multilingual-light-v3.0:
0.7142
- intfloat/multilingual-e5-base:
0.6960
How to Use
First, install the sentence-transformers
library:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model_id = "LocalDoc/az-en-MiniLM-L6-v2"
try:
model = SentenceTransformer(model_id)
print(f"Model {model_id} loaded successfully!")
except Exception as e:
print(f"Failed to load model. Ensure the tokenizer 'LocalDoc/az-en-unigram-tokenizer-50k' is accessible and its dependencies (protobuf, sentencepiece_model_pb2.py) are met if loading fails.")
print(f"Error: {e}")
# You might need to ensure the tokenizer can be loaded.
# If the tokenizer requires it (it shouldn't if it's correctly packaged on the Hub by your tokenizer repo):
# !pip install protobuf
# !wget -P ./az_en_tokenizer_hf/ https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py
# model = SentenceTransformer(model_id)
# Example Azerbaijani sentences
sentences_az = [
"Azərbaycanın paytaxtı Bakı şəhəridir.",
"Bu gün hava çox istidir."
]
# Example English sentences
sentences_en = [
"The capital of Azerbaijan is the city of Baku.",
"The weather is very hot today.",
"I enjoy reading books."
]
print("\nEncoding Azerbaijani sentences...")
embeddings_az = model.encode(sentences_az)
for sent, emb in zip(sentences_az, embeddings_az):
print(f"Sentence: {sent}")
print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n")
print("Encoding English sentences...")
embeddings_en = model.encode(sentences_en)
for sent, emb in zip(sentences_en, embeddings_en):
print(f"Sentence: {sent}")
print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n")
Example of calculating similarity
from sentence_transformers.util import cos_sim
similarity_matrix = cos_sim(embeddings_az[0], embeddings_en[0])
print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[0]}': {similarity_matrix.item():.4f}")
similarity_matrix_diff = cos_sim(embeddings_az[0], embeddings_en[2])
print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[2]}': {similarity_matrix_diff.item():.4f}")
Training
This model was fine-tuned from sentence-transformers/all-MiniLM-L6-v2
using a knowledge distillation setup.
- Teacher Model:
BAAI/bge-small-en-v1.5
(used to generate target embeddings for English sentences). - Student Model: Initialized from
sentence-transformers/all-MiniLM-L6-v2
. - Tokenizer: A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer (
LocalDoc/az-en-unigram-tokenizer-50k
) was used.
The student model's token embedding layer was resized to match the new vocabulary size (~50k). - Training Data: A parallel corpus of approximately 4.14 million Azerbaijani-English sentence pairs.
- Loss Function:
MSELoss
— the student model was trained to produce embeddings for both Azerbaijani and English sentences that are similar to the teacher model's embeddings for the corresponding English sentences.
Training Hyperparameters
- Epochs: 3
- Batch Size: 64
- Max Sequence Length: 512
- Learning Rate: 3e-4
- Warmup Ratio: 0.15
CC BY 4.0 License — What It Allows
The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows:
You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.
For more information, please refer to the CC BY 4.0 license.
Contact
For more information, questions, or issues, please contact LocalDoc at [[email protected]].
- Downloads last month
- 21
Evaluation results
- Average Pearson on Azerbaijani STS Benchmarks (Average)self-reported0.727