DIMI-embedding-v4 / README.md
AhmedZaky1's picture
Add new SentenceTransformer model
6bcaa89 verified
|
raw
history blame
40.3 kB
metadata
language:
  - ar
  - en
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:34436
  - loss:MatryoshkaLoss
  - loss:CoSENTLoss
base_model: AhmedZaky1/DIMI-embedding-v2
widget:
  - source_sentence: الرجل يركب حصاناً
    sentences:
      - رجل يُبث الجبن الممزق على البيتزا
      - ar-ar
      - رجل يركب حصاناً
  - source_sentence: المرأة تقلي لحم خنزير مشوي
    sentences:
      - ar-ar
      - امرأة تطبخ لحم خنزير مخبوز
      - طائرة طيران تقلع
  - source_sentence: امرأة تحمل في ذراعها طفل كنغر
    sentences:
      - امرأة تعزف على الغيتار
      - ar-ar
      - امرأة تحمل و تحمل طفل كنغر
  - source_sentence: رجل يعزف على الناي
    sentences:
      - طائرة ستقلع
      - ar-ar
      - رجل يعزف على فرقة الخيزران
  - source_sentence: ثلاثة رجال يلعبون الشطرنج.
    sentences:
      - رجلين يلعبان الشطرنج
      - بعض الرجال يقاتلون
      - ar-ar
datasets:
  - silma-ai/silma-arabic-english-sts-dataset-v1.0
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: SentenceTransformer based on AhmedZaky1/DIMI-embedding-v2
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: silma sts dev 768
          type: silma-sts-dev-768
        metrics:
          - type: pearson_cosine
            value: 0.8894298077237747
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8357984695231979
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: silma sts dev 512
          type: silma-sts-dev-512
        metrics:
          - type: pearson_cosine
            value: 0.8958835653694187
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8394578198917563
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: silma sts dev 256
          type: silma-sts-dev-256
        metrics:
          - type: pearson_cosine
            value: 0.9078743376141943
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8470163055535588
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: silma sts dev 128
          type: silma-sts-dev-128
        metrics:
          - type: pearson_cosine
            value: 0.9181556833949818
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.856188415278301
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: silma sts dev 64
          type: silma-sts-dev-64
        metrics:
          - type: pearson_cosine
            value: 0.9066219844975816
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8434430083292863
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts17 ar test 768
          type: sts17-ar-test-768
        metrics:
          - type: pearson_cosine
            value: 0.8205269118955641
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8258003312254673
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts17 ar test 512
          type: sts17-ar-test-512
        metrics:
          - type: pearson_cosine
            value: 0.8193403796404517
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8226611918447921
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts17 ar test 256
          type: sts17-ar-test-256
        metrics:
          - type: pearson_cosine
            value: 0.8190666923783347
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8245760514866052
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts17 ar test 128
          type: sts17-ar-test-128
        metrics:
          - type: pearson_cosine
            value: 0.8114629254813825
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8183273799928091
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: sts17 ar test 64
          type: sts17-ar-test-64
        metrics:
          - type: pearson_cosine
            value: 0.796172574267003
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.8077141358495715
            name: Spearman Cosine

SentenceTransformer based on AhmedZaky1/DIMI-embedding-v2

This is a sentence-transformers model finetuned from AhmedZaky1/DIMI-embedding-v2 on the silma-arabic-english-sts-dataset-v1.0 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("AhmedZaky1/DIMI-embedding-v2-silma-sts-matryoshka")
# Run inference
sentences = [
    'ثلاثة رجال يلعبون الشطرنج.',
    'رجلين يلعبان الشطرنج',
    'ar-ar',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

  • Datasets: silma-sts-dev-768, silma-sts-dev-512, silma-sts-dev-256, silma-sts-dev-128, silma-sts-dev-64, sts17-ar-test-768, sts17-ar-test-512, sts17-ar-test-256, sts17-ar-test-128 and sts17-ar-test-64
  • Evaluated with EmbeddingSimilarityEvaluator
Metric silma-sts-dev-768 silma-sts-dev-512 silma-sts-dev-256 silma-sts-dev-128 silma-sts-dev-64 sts17-ar-test-768 sts17-ar-test-512 sts17-ar-test-256 sts17-ar-test-128 sts17-ar-test-64
pearson_cosine 0.8894 0.8959 0.9079 0.9182 0.9066 0.8205 0.8193 0.8191 0.8115 0.7962
spearman_cosine 0.8358 0.8395 0.847 0.8562 0.8434 0.8258 0.8227 0.8246 0.8183 0.8077

Training Details

Training Dataset

silma-arabic-english-sts-dataset-v1.0

  • Dataset: silma-arabic-english-sts-dataset-v1.0 at 1885690
  • Size: 34,436 training samples
  • Columns: sentence1, sentence2, score, and langs
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 score langs
    type string string float string
    details
    • min: 4 tokens
    • mean: 9.68 tokens
    • max: 26 tokens
    • min: 4 tokens
    • mean: 9.68 tokens
    • max: 26 tokens
    • min: 0.0
    • mean: 0.47
    • max: 1.0
    • min: 5 tokens
    • mean: 5.0 tokens
    • max: 5 tokens
  • Samples:
    sentence1 sentence2 score langs
    رجل يعزف على البيانو امرأة تعزف على الكمان 0.2 ar-ar
    امرأة تعزف على الكمان رجل يعزف على البيانو 0.2 ar-ar
    امرأة تعزف على الناي. رجل يعزف على الغيتار 0.2 ar-ar
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CoSENTLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

silma-arabic-english-sts-dataset-v1.0

  • Dataset: silma-arabic-english-sts-dataset-v1.0 at 1885690
  • Size: 100 evaluation samples
  • Columns: sentence1, sentence2, score, and langs
  • Approximate statistics based on the first 100 samples:
    sentence1 sentence2 score langs
    type string string float string
    details
    • min: 5 tokens
    • mean: 9.49 tokens
    • max: 19 tokens
    • min: 5 tokens
    • mean: 9.49 tokens
    • max: 19 tokens
    • min: 0.1
    • mean: 0.74
    • max: 1.0
    • min: 5 tokens
    • mean: 5.0 tokens
    • max: 5 tokens
  • Samples:
    sentence1 sentence2 score langs
    طائرة ستقلع طائرة طيران تقلع 1.0 ar-ar
    طائرة طيران تقلع طائرة ستقلع 1.0 ar-ar
    رجل يعزف على ناي كبير رجل يعزف على الناي 0.76 ar-ar
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CoSENTLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • save_only_model: True
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: True
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss silma-sts-dev-768_spearman_cosine silma-sts-dev-512_spearman_cosine silma-sts-dev-256_spearman_cosine silma-sts-dev-128_spearman_cosine silma-sts-dev-64_spearman_cosine sts17-ar-test-768_spearman_cosine sts17-ar-test-512_spearman_cosine sts17-ar-test-256_spearman_cosine sts17-ar-test-128_spearman_cosine sts17-ar-test-64_spearman_cosine
0.0929 100 39.5796 45.0982 0.7199 0.7173 0.7292 0.7433 0.7196 - - - - -
0.1857 200 31.3305 29.9877 0.7233 0.7248 0.7344 0.7337 0.7192 - - - - -
0.2786 300 27.7756 31.4644 0.7288 0.7268 0.7331 0.7388 0.7169 - - - - -
0.3714 400 27.7405 33.3315 0.7172 0.7168 0.7341 0.7349 0.7219 - - - - -
0.4643 500 27.1884 30.4957 0.7469 0.7428 0.7475 0.7547 0.7426 - - - - -
0.5571 600 27.0428 29.5877 0.7133 0.7138 0.7380 0.7549 0.7533 - - - - -
0.6500 700 26.7957 30.3813 0.7520 0.7430 0.7570 0.7604 0.7647 - - - - -
0.7428 800 26.2667 30.6293 0.7323 0.7333 0.7558 0.7609 0.7479 - - - - -
0.8357 900 25.9412 29.8621 0.7730 0.7732 0.7913 0.8117 0.7797 - - - - -
0.9285 1000 25.7816 31.7315 0.7856 0.7918 0.7916 0.8025 0.8048 - - - - -
1.0214 1100 25.1666 31.6311 0.7651 0.7668 0.7673 0.7826 0.7846 - - - - -
1.1142 1200 24.7681 32.3005 0.7719 0.7892 0.7941 0.8022 0.7939 - - - - -
1.2071 1300 24.8771 32.1761 0.7660 0.7744 0.7807 0.7884 0.7841 - - - - -
1.2999 1400 24.9063 33.2694 0.7646 0.7644 0.7884 0.7906 0.7886 - - - - -
1.3928 1500 24.7283 32.4350 0.7935 0.7974 0.8071 0.8112 0.8062 - - - - -
1.4856 1600 24.4217 34.1219 0.7781 0.7754 0.7739 0.7916 0.7889 - - - - -
1.5785 1700 24.4923 33.1239 0.7636 0.7709 0.7882 0.7991 0.7913 - - - - -
1.6713 1800 24.0844 33.5233 0.7785 0.7832 0.7880 0.7977 0.8014 - - - - -
1.7642 1900 24.1453 35.4602 0.7795 0.7816 0.8053 0.8115 0.7944 - - - - -
1.8570 2000 24.2271 36.2812 0.8003 0.8009 0.8008 0.8102 0.8009 - - - - -
1.9499 2100 23.7371 37.0276 0.7769 0.7866 0.7918 0.7926 0.7832 - - - - -
2.0427 2200 23.3566 34.5721 0.7931 0.8017 0.8020 0.8159 0.8027 - - - - -
2.1356 2300 23.2523 35.5316 0.7931 0.7981 0.7896 0.8157 0.8142 - - - - -
2.2284 2400 23.0447 36.6811 0.7973 0.7962 0.7935 0.8030 0.8037 - - - - -
2.3213 2500 22.9782 37.5482 0.8121 0.8185 0.8200 0.8293 0.8244 - - - - -
2.4141 2600 22.9119 37.2809 0.8077 0.8116 0.8113 0.8333 0.8151 - - - - -
2.5070 2700 23.1302 37.7993 0.8255 0.8304 0.8310 0.8376 0.8303 - - - - -
2.5998 2800 22.9941 38.8005 0.8182 0.8214 0.8143 0.8193 0.8155 - - - - -
2.6927 2900 22.8876 36.2524 0.8201 0.8222 0.8194 0.8347 0.8260 - - - - -
2.7855 3000 22.5304 38.1523 0.8195 0.8280 0.8356 0.8545 0.8394 - - - - -
2.8784 3100 22.446 39.4876 0.8242 0.8246 0.8319 0.8483 0.8397 - - - - -
2.9712 3200 22.5077 39.1910 0.8231 0.8249 0.8334 0.8475 0.8372 - - - - -
3.0641 3300 21.9675 36.4245 0.8408 0.8425 0.8456 0.8619 0.8577 - - - - -
3.1569 3400 21.9361 36.7119 0.8344 0.8405 0.8460 0.8656 0.8644 - - - - -
3.2498 3500 21.7747 37.7140 0.8279 0.8353 0.8414 0.8510 0.8446 - - - - -
3.3426 3600 21.8649 38.9102 0.8298 0.8331 0.8456 0.8494 0.8447 - - - - -
3.4355 3700 21.794 37.4385 0.8278 0.8328 0.8377 0.8442 0.8373 - - - - -
3.5283 3800 21.7968 37.0225 0.8352 0.8501 0.8540 0.8722 0.8553 - - - - -
3.6212 3900 21.5941 37.5736 0.8344 0.8515 0.8511 0.8643 0.8587 - - - - -
3.7140 4000 21.8181 37.4984 0.8340 0.8440 0.8470 0.8607 0.8484 - - - - -
3.8069 4100 21.7035 37.9701 0.8346 0.8394 0.8436 0.8615 0.8479 - - - - -
3.8997 4200 21.398 38.1567 0.8349 0.8365 0.8470 0.8572 0.8405 - - - - -
3.9926 4300 21.6518 38.3515 0.8358 0.8395 0.8470 0.8562 0.8434 - - - - -
4.0 4308 - - - - - - - 0.8258 0.8227 0.8246 0.8183 0.8077
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.7
  • Sentence Transformers: 3.3.1
  • Transformers: 4.51.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.4.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}