Bilingual Azerbaijani-English Sentence Embedding Model (az-en-MiniLM-L6-v2)

This is a sentence-transformer model that maps sentences & paragraphs in Azerbaijani (az) and English (en) to a 384-dimensional dense vector space. It is designed for tasks like semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering for these two languages.

The model is based on sentence-transformers/all-MiniLM-L6-v2 and was fine-tuned using knowledge distillation from the high-performance BAAI/bge-small-en-v1.5 English embedding model. A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer with a vocabulary of ~50k was trained from scratch and is used by this model.

Model Details

Base Architecture: sentence-transformers/all-MiniLM-L6-v2 (6 layers, 384 hidden dimension, 12 attention heads)
Parameters: ~30.2 Million (after vocabulary expansion)
Tokenizer: Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at LocalDoc/az-en-unigram-tokenizer-50k. You can get train code from this repository https://github.com/vrashad/azerbaijani_tokenizer
Output Dimension: 384
Max Sequence Length: 512 tokens
Training: Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from BAAI/bge-small-en-v1.5.

Performance on Azerbaijani STS Benchmarks

This model demonstrates strong performance on Azerbaijani Semantic Textual Similarity (STS) tasks LocalDoc-Azerbaijan/STS-Benchmark, achieving results competitive with, and in some cases surpassing, larger multilingual models.

The following results were obtained after 3 epochs of training :

Dataset	Pearson Correlation
LocalDoc/Azerbaijani-STSBenchmark	0.7595
LocalDoc/Azerbaijani-biosses-sts	0.7410
LocalDoc/Azerbaijani-sickr-sts	0.7432
LocalDoc/Azerbaijani-sts12-sts	0.7644
LocalDoc/Azerbaijani-sts13-sts	0.6336
LocalDoc/Azerbaijani-sts15-sts	0.7597
LocalDoc/Azerbaijani-sts16-sts	0.6848
Average Pearson	0.7266

Comparison with other models on (assumed) Azerbaijani STS Benchmarks (Average Pearson):

LocalDoc/TEmA-small: 0.7959
Cohere/embed-multilingual-v3.0: 0.7823
BAAI/bge-m3: 0.7577
intfloat/multilingual-e5-large-instruct: 0.7377
Cohere/embed-multilingual-v2.0: 0.7318
intfloat/multilingual-e5-large: 0.7280
OpenAI/text-embedding-3-large: 0.7288
LocalDoc/az-en-MiniLM-L6-v2: 0.7266
sentence-transformers/LaBSE: 0.7250
intfloat/multilingual-e5-small: 0.7242
Cohere/embed-multilingual-light-v3.0: 0.7142
intfloat/multilingual-e5-base: 0.6960

How to Use

First, install the sentence-transformers library:

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

model_id = "LocalDoc/az-en-MiniLM-L6-v2"

try:
    model = SentenceTransformer(model_id)
    print(f"Model {model_id} loaded successfully!")
except Exception as e:
    print(f"Failed to load model. Ensure the tokenizer 'LocalDoc/az-en-unigram-tokenizer-50k' is accessible and its dependencies (protobuf, sentencepiece_model_pb2.py) are met if loading fails.")
    print(f"Error: {e}")
    # You might need to ensure the tokenizer can be loaded.
    # If the tokenizer requires it (it shouldn't if it's correctly packaged on the Hub by your tokenizer repo):
    # !pip install protobuf
    # !wget -P ./az_en_tokenizer_hf/ https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py
    # model = SentenceTransformer(model_id)


# Example Azerbaijani sentences
sentences_az = [
    "Azərbaycanın paytaxtı Bakı şəhəridir.",
    "Bu gün hava çox istidir."
]

# Example English sentences
sentences_en = [
    "The capital of Azerbaijan is the city of Baku.",
    "The weather is very hot today.",
    "I enjoy reading books."
]

print("\nEncoding Azerbaijani sentences...")
embeddings_az = model.encode(sentences_az)
for sent, emb in zip(sentences_az, embeddings_az):
    print(f"Sentence: {sent}")
    print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n")

print("Encoding English sentences...")
embeddings_en = model.encode(sentences_en)
for sent, emb in zip(sentences_en, embeddings_en):
    print(f"Sentence: {sent}")
    print(f"Embedding shape: {emb.shape}, first 3 dims: {emb[:3]}\n")

Example of calculating similarity

from sentence_transformers.util import cos_sim

similarity_matrix = cos_sim(embeddings_az[0], embeddings_en[0])
print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[0]}': {similarity_matrix.item():.4f}")

similarity_matrix_diff = cos_sim(embeddings_az[0], embeddings_en[2])
print(f"Similarity between '{sentences_az[0]}' and '{sentences_en[2]}': {similarity_matrix_diff.item():.4f}")

Training

This model was fine-tuned from sentence-transformers/all-MiniLM-L6-v2 using a knowledge distillation setup.

Teacher Model: BAAI/bge-small-en-v1.5 (used to generate target embeddings for English sentences).
Student Model: Initialized from sentence-transformers/all-MiniLM-L6-v2.
Tokenizer: A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer (LocalDoc/az-en-unigram-tokenizer-50k) was used.
The student model's token embedding layer was resized to match the new vocabulary size (~50k).
Training Data: A parallel corpus of approximately 4.14 million Azerbaijani-English sentence pairs.
Loss Function: MSELoss — the student model was trained to produce embeddings for both Azerbaijani and English sentences that are similar to the teacher model's embeddings for the corresponding English sentences.

Training Hyperparameters

Epochs: 3
Batch Size: 64
Max Sequence Length: 512
Learning Rate: 3e-4
Warmup Ratio: 0.15

CC BY 4.0 License — What It Allows

The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows:

You are free to use, modify, and distribute the model — even for commercial purposes — as long as you give proper credit to the original creator.

For more information, please refer to the CC BY 4.0 license.

Contact

For more information, questions, or issues, please contact LocalDoc at [[email protected]].

LocalDoc
/

az-en-MiniLM-L6-v2