BGE base ArgillaSDK Matryoshka

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5 on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sud-962081/bge-base-argilla-sdk-matryoshka")
# Run inference
sentences = [
    'Make changes and push them\n\nMake the changes you want in your local repository, and test that everything works and you are following the guidelines. Check the documentation for more information about the development.\n\nOnce you have finished, you can check the status of your repository and synchronize with the upstreaming repo with the following command:\n\n```sh\n\nCheck the status of your repository\n\ngit status\n\nSynchronize with the upstreaming repo',
    'Are changes required to be made and then uploaded to the Argilla dataset repository?',
    'The beautiful scenery of the Italian town Argilla made me want to make changes to my travel plans.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric dim_768 dim_512 dim_256 dim_128 dim_64
cosine_accuracy@1 0.0612 0.0714 0.0408 0.0306 0.0204
cosine_accuracy@3 0.1837 0.1837 0.2041 0.1939 0.0816
cosine_accuracy@5 0.2653 0.2551 0.2551 0.2449 0.2143
cosine_accuracy@10 0.2959 0.3061 0.2959 0.3776 0.2755
cosine_precision@1 0.0612 0.0714 0.0408 0.0306 0.0204
cosine_precision@3 0.0612 0.0612 0.068 0.0646 0.0272
cosine_precision@5 0.0531 0.051 0.051 0.049 0.0429
cosine_precision@10 0.0296 0.0306 0.0296 0.0378 0.0276
cosine_recall@1 0.0612 0.0714 0.0408 0.0306 0.0204
cosine_recall@3 0.1837 0.1837 0.2041 0.1939 0.0816
cosine_recall@5 0.2653 0.2551 0.2551 0.2449 0.2143
cosine_recall@10 0.2959 0.3061 0.2959 0.3776 0.2755
cosine_ndcg@10 0.1788 0.1789 0.1636 0.1845 0.132
cosine_mrr@10 0.1409 0.1389 0.1211 0.1252 0.0874
cosine_map@100 0.154 0.1499 0.1349 0.133 0.1001

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 882 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 882 samples:
    anchor positive negative
    type string string string
    details
    • min: 6 tokens
    • mean: 91.86 tokens
    • max: 198 tokens
    • min: 8 tokens
    • mean: 25.62 tokens
    • max: 91 tokens
    • min: 10 tokens
    • mean: 22.11 tokens
    • max: 61 tokens
  • Samples:
    anchor positive negative
    workspace = client.workspaces("my_workspace")

    Retrieve the dataset from the first workspace

    retrieved_dataset = client.datasets(name="my_dataset")

    Retrieve the dataset from the specified workspace

    retrieved_dataset = client.datasets(name="my_dataset", workspace=workspace)
    <br><br>Check dataset existence<br><br>You can check if a dataset exists by calling the exists method on the Dataset class. This method returns a boolean value.<br><br>python
    import argilla_sdk as rg
    Is there a way to download a dataset from a specific workspace using the Argilla client for my data annotation task? The new coffee shop in town offers a variety of workspace options for remote workers.
    === "As Record objects"
    You can also add suggestions to a record in an initializedRecord` object.

    === "From a generic data structure"
    You can add suggestions as a dictionary, where the keys correspond to the names of the labels that were configured for your dataset. Remember that you can also use the mapping parameter to specify the data structure.
    Is it possible to associate multiple suggestions with a single record object in Argilla? I love adding suggestions to my garden to make it look more beautiful.
    hide: footer

    rg.Argilla

    To interact with the Argilla server from python you can use the Argilla class. The Argilla client is used to create, get, update, and delete all Argilla resources, such as workspaces, users, datasets, and records.

    Usage Examples

    Connecting to an Argilla server

    To connect to an Argilla server, instantiate the Argilla class and pass the api_url of the server and the api_key to authenticate.

    ```python
    import argilla_sdk as rg
    Does the Argilla class provide a convenient way to handle dataset and record administration tasks on the Argilla server? The tourists got lost in the Argilla desert because they forgot to bring a map.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "TripletLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_eval_batch_size: 4
  • gradient_accumulation_steps: 4
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 4
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 4
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0 0 - 0.3815 0.3810 0.3717 0.3897 0.3153
0.1802 5 23.2127 - - - - -
0.3604 10 22.567 - - - - -
0.5405 15 21.0403 - - - - -
0.7207 20 19.6983 - - - - -
0.9009 25 18.4465 - - - - -
0.973 27 - 0.2707 0.2832 0.2721 0.2576 0.238
1.1081 30 19.4241 - - - - -
1.2883 35 17.3167 - - - - -
1.4685 40 17.0334 - - - - -
1.6486 45 16.9455 - - - - -
1.8288 50 16.8353 - - - - -
1.9730 54 - 0.1507 0.1536 0.1595 0.1604 0.1532
2.0360 55 18.4414 - - - - -
2.2162 60 16.7065 - - - - -
2.3964 65 16.6709 - - - - -
2.5766 70 16.6449 - - - - -
2.7568 75 16.6349 - - - - -
2.9369 80 16.633 - - - - -
2.9730 81 - 0.1788 0.1789 0.1636 0.1845 0.1320
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.3.1
  • Transformers: 4.47.1
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.2.1
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Downloads last month
8
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sud-962081/bge-base-argilla-sdk-matryoshka

Finetuned
(424)
this model

Evaluation results