BGE base securiti dataset 1 v2

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v2")
# Run inference
sentences = [
    "Thailand's PDPA applies to any legal entity collecting, using, or disclosing a natural (and alive) person's personal data.",
    "Who does the Thailand's PDPA apply to?",
    "What penalties could an organization face for infringing Kenya's Data Protection Act?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5556
cosine_accuracy@3 0.8333
cosine_accuracy@5 0.8889
cosine_accuracy@10 1.0
cosine_precision@1 0.5556
cosine_precision@3 0.2778
cosine_precision@5 0.1778
cosine_precision@10 0.1
cosine_recall@1 0.5556
cosine_recall@3 0.8333
cosine_recall@5 0.8889
cosine_recall@10 1.0
cosine_ndcg@10 0.773
cosine_mrr@10 0.7011
cosine_map@100 0.7011

Information Retrieval

Metric Value
cosine_accuracy@1 0.5556
cosine_accuracy@3 0.8333
cosine_accuracy@5 0.8889
cosine_accuracy@10 1.0
cosine_precision@1 0.5556
cosine_precision@3 0.2778
cosine_precision@5 0.1778
cosine_precision@10 0.1
cosine_recall@1 0.5556
cosine_recall@3 0.8333
cosine_recall@5 0.8889
cosine_recall@10 1.0
cosine_ndcg@10 0.773
cosine_mrr@10 0.7011
cosine_map@100 0.7011

Information Retrieval

Metric Value
cosine_accuracy@1 0.5556
cosine_accuracy@3 0.8889
cosine_accuracy@5 0.9444
cosine_accuracy@10 1.0
cosine_precision@1 0.5556
cosine_precision@3 0.2963
cosine_precision@5 0.1889
cosine_precision@10 0.1
cosine_recall@1 0.5556
cosine_recall@3 0.8889
cosine_recall@5 0.9444
cosine_recall@10 1.0
cosine_ndcg@10 0.7903
cosine_mrr@10 0.7218
cosine_map@100 0.7218

Information Retrieval

Metric Value
cosine_accuracy@1 0.6111
cosine_accuracy@3 0.8333
cosine_accuracy@5 0.8889
cosine_accuracy@10 0.9444
cosine_precision@1 0.6111
cosine_precision@3 0.2778
cosine_precision@5 0.1778
cosine_precision@10 0.0944
cosine_recall@1 0.6111
cosine_recall@3 0.8333
cosine_recall@5 0.8889
cosine_recall@10 0.9444
cosine_ndcg@10 0.7855
cosine_mrr@10 0.7338
cosine_map@100 0.7369

Information Retrieval

Metric Value
cosine_accuracy@1 0.4444
cosine_accuracy@3 0.7222
cosine_accuracy@5 0.8333
cosine_accuracy@10 1.0
cosine_precision@1 0.4444
cosine_precision@3 0.2407
cosine_precision@5 0.1667
cosine_precision@10 0.1
cosine_recall@1 0.4444
cosine_recall@3 0.7222
cosine_recall@5 0.8333
cosine_recall@10 1.0
cosine_ndcg@10 0.7062
cosine_mrr@10 0.6142
cosine_map@100 0.6142

Training Details

Training Dataset

Unnamed Dataset

  • Size: 161 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 5 tokens
    • mean: 40.09 tokens
    • max: 481 tokens
    • min: 7 tokens
    • mean: 13.01 tokens
    • max: 24 tokens
  • Samples:
    positive anchor
    The DPA may impose administrative fines of up to €10 million, or up to 2%
    of
    worldwide turnover. The DPA may also impose heavier fines up to €20 million,
    or up to 4% of worldwide turnover.
    What is the penalty for non-compliance with the GDPR in Italy?
    As per the DPA, the data handler must seek consent in writing from the data subject to collect any sensitive personal data. What are the consent requirements under the DPA?
    China's cybersecurity laws include the Cybersecurity Law, which governs
    various aspects of cybersecurity, data protection, and the obligations of
    organizations to ensure the security of networks and data within China's
    territory.
    What are the cybersecurity laws in China?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_128_cosine_map@100 dim_256_cosine_map@100 dim_512_cosine_map@100 dim_64_cosine_map@100 dim_768_cosine_map@100
1.0 1 - 0.6103 0.6310 0.6349 0.5377 0.6296
2.0 2 - 0.6556 0.6686 0.6395 0.5549 0.6469
3.0 4 - 0.6698 0.6808 0.6719 0.5812 0.6488
4.0 5 - 0.6701 0.6940 0.6701 0.6010 0.7043
5.0 6 - 0.6704 0.6940 0.6687 0.6116 0.7025
6.0 8 - 0.6807 0.6894 0.6715 0.6162 0.7039
7.0 9 - 0.6809 0.6940 0.6715 0.6154 0.7011
8.0 10 1.42 0.6808 0.6940 0.6965 0.6154 0.7011
1.0 1 - 0.6807 0.6894 0.6715 0.6162 0.7039
2.0 2 - 0.7088 0.7218 0.7039 0.6207 0.7011
3.0 4 - 0.7369 0.7218 0.7011 0.6142 0.7011
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.1.2+cu121
  • Accelerate: 0.31.0
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
14
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for MugheesAwan11/bge-base-securiti-dataset-1-v2

Finetuned
(326)
this model

Evaluation results