BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5 on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cristiano-sartori/bge_ft2")
# Run inference
sentences = [
    "Which of the following statements about coverage-guided fuzzing is/are correct?\nA. [\nB. '\nC. R\nD. e\nE. d\nD. u\nF. n\nG. d\nH. a\nI. n",
    'To determine which statements about coverage-guided fuzzing are correct, let\'s analyze each option step by step.\n\n1. **Redundant seeds in the corpus will reduce fuzzing efficiency.**\n   - **Analysis:** This statement is generally true. In coverage-guided fuzzing, the goal is to explore as many different paths and code branches as possible. If the corpus contains many redundant seeds (i.e., inputs that lead to the same code paths), it can lead to wasted effort and reduced efficiency since the fuzzer may spend more time exploring the same paths rather than discovering new ones.\n\n2. **Counting the number of times the covered code has been executed provides a more fine-grained view of program behavior than only "covered/not covered" binary code coverage.**\n   - **Analysis:** This statement is correct. While binary code coverage only tells you whether a particular part of the code has been executed, counting the number of times each part of the code is executed (also known as edge or path coverage) provides deeper insights into the program\'s behavior. This finer granularity can help the fuzzer prioritize certain inputs that might lead to new or interesting behaviors.\n\n3. **Due to the coverage feedback, a small random perturbation of a seed can have a significant impact on further exploration.**\n   - **Analysis:** This statement is also correct. Coverage-guided fuzzers utilize feedback about which parts of the code are executed to guide their exploration. Even a small change in input can lead to different execution paths being taken, which may uncover new code that wasn\'t reached with the original seed. As such, small perturbations can indeed have a large impact on the exploration of the input space.\n\n4. **Fuzzers that have higher code coverage always find more bugs.**\n   - **Analysis:** This statement is misleading and generally false. While higher code coverage can increase the likelihood of finding bugs, it does not guarantee that more bugs will be found. Some parts of the code may be covered but not contain any bugs, while other areas might have bugs that are difficult to reach, regardless of coverage. Thus, while there is a correlation between coverage and bug discovery, it is not a strict rule that higher coverage will always lead to more bugs being found.\n\nBased on this analysis, the correct statements about coverage-guided fuzzing are:\n\n- **1. True**\n- **2. True**\n- **3. True**\n- **4. False**\n\nIn summary, statements 1, 2, and 3 are correct, while statement 4 is not.',
    "To decrypt the ciphertext c=14 c = 14  in RSA, we first need to find the private key d d  such that ecdotdequiv1modphi(n) e \\cdot d \\equiv 1 \\mod \\phi(n) , where n=pcdotq=77 n = p \\cdot q = 77  and phi(n)=(p1)(q1)=6cdot10=60 \\phi(n) = (p-1)(q-1) = 6 \\cdot 10 = 60 . \n\nGiven e=13 e = 13 , we need to find d d  such that:\n\n\\[\n13d \\equiv 1 \\mod 60\n\\]\n\nUsing the Extended Euclidean Algorithm, we find d d :\n\n1. 60=4cdot13+8 60 = 4 \\cdot 13 + 8 \n2. 13=1cdot8+5 13 = 1 \\cdot 8 + 5 \n3. 8=1cdot5+3 8 = 1 \\cdot 5 + 3 \n4. 5=1cdot3+2 5 = 1 \\cdot 3 + 2 \n5. 3=1cdot2+1 3 = 1 \\cdot 2 + 1 \n6. 2=2cdot1+0 2 = 2 \\cdot 1 + 0 \n\nBack substituting to find 1=31cdot2 1 = 3 - 1 \\cdot 2 :\n\n\\[\n1 = 3 - (5 - 1 \\cdot 3) = 2 \\cdot 3 - 5\n\\]\n\\[\n1 = 2 \\cdot (8 - 1 \\cdot 5) - 5 = 2 \\cdot 8 - 3 \\cdot 5\n\\]\n\\[\n= 2 \\cdot 8 - 3 \\cdot (13 - 1 \\cdot 8) = 5 \\cdot 8 - 3 \\cdot 13\n\\]\n\\[\n= 5 \\cdot (60 - 4 \\cdot 13) - 3 \\cdot 13 = 5 \\cdot 60 - 23 \\cdot 13\n\\]\n\nThus, dequiv23mod60 d \\equiv -23 \\mod 60 , or dequiv37mod60 d \\equiv 37 \\mod 60 .\n\nNow we can decrypt the ciphertext c=14 c = 14 :\n\n\\[\nm \\equiv c^d \\mod n\n\\]\n\\[\nm \\equiv 14^{37} \\mod 77\n\\]\n\nTo simplify this computation, we can use the Chinese Remainder Theorem by calculating mmod7 m \\mod 7  and mmod11 m \\mod 11 :\n\n1. Calculate 1437mod7 14^{37} \\mod 7 :\n   \\[\n   14 \\equiv 0 \\mod 7 \\implies 14^{37} \\equiv 0 \\mod 7\n   \\]\n\n2. Calculate 1437mod11 14^{37} \\mod 11 :\n   \\[\n   14 \\equiv 3 \\mod 11\n   \\]\n   Using Fermat's Little Theorem, 310equiv1mod11 3^{10} \\equiv 1 \\mod 11 . Thus:\n   \\[\n   37 \\mod 10 = 7 \\implies 3^{37} \\equiv 3^7 \\mod 11\n   \\]\n   We calculate 37 3^7 :\n   \\[\n   3^2 = 9, \\quad 3^4 = 81 \\equiv 4 \\mod 11\n   \\]\n   \\[\n   3^6 = 3^4 \\cdot 3^2 = 4 \\cdot 9 = 36 \\equiv 3 \\mod 11\n   \\]\n   \\[\n   3^7 = 3^6 \\cdot 3 = 3 \\cdot 3 = 9 \\mod 11\n   \\]\n\nNow we have:\n- mequiv0mod7 m \\equiv 0 \\mod 7 \n- mequiv9mod11 m \\equiv 9 \\mod 11 \n\nWe can solve these congruences using the method of successive substitutions or direct computation. \n\nLet m=7k m = 7k . Then:\n\n\\[\n7k \\equiv 9 \\mod 11 \\implies 7k = 9 + 11j\n\\]\nSolving for k k  modulo 11, we need the modular inverse of 7 mod 11, which is 8 (since 7cdot8equiv1mod11 7 \\cdot 8 \\equiv 1 \\mod 11 ). Thus:\n\n\\[\nk \\equiv 8 \\cdot 9 \\mod 11 \\equiv 72 \\mod 11 \\equiv 6 \\mod 11\n\\]\n\nSo k=11m+6 k = 11m + 6 . Substituting back, we have:\n\n\\[\nm = 7(11m + 6) = 77m + 42\n\\]\nThus, mequiv42mod77 m \\equiv 42 \\mod 77 .\n\nThe message sent was m=42 m = 42 .\n\nTherefore, the correct answer is:\n\n**$t = 42$**.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.748
cosine_accuracy@3 0.9134
cosine_accuracy@5 0.9291
cosine_accuracy@10 0.9528
cosine_precision@1 0.748
cosine_precision@3 0.3045
cosine_precision@5 0.1858
cosine_precision@10 0.0953
cosine_recall@1 0.748
cosine_recall@3 0.9134
cosine_recall@5 0.9291
cosine_recall@10 0.9528
cosine_ndcg@10 0.8627
cosine_mrr@10 0.8326
cosine_map@100 0.8333

Information Retrieval

Metric Value
cosine_accuracy@1 0.7638
cosine_accuracy@3 0.9055
cosine_accuracy@5 0.9291
cosine_accuracy@10 0.9449
cosine_precision@1 0.7638
cosine_precision@3 0.3018
cosine_precision@5 0.1858
cosine_precision@10 0.0945
cosine_recall@1 0.7638
cosine_recall@3 0.9055
cosine_recall@5 0.9291
cosine_recall@10 0.9449
cosine_ndcg@10 0.8659
cosine_mrr@10 0.8394
cosine_map@100 0.8408

Information Retrieval

Metric Value
cosine_accuracy@1 0.7323
cosine_accuracy@3 0.9055
cosine_accuracy@5 0.9134
cosine_accuracy@10 0.9449
cosine_precision@1 0.7323
cosine_precision@3 0.3018
cosine_precision@5 0.1827
cosine_precision@10 0.0945
cosine_recall@1 0.7323
cosine_recall@3 0.9055
cosine_recall@5 0.9134
cosine_recall@10 0.9449
cosine_ndcg@10 0.8492
cosine_mrr@10 0.8173
cosine_map@100 0.8184

Information Retrieval

Metric Value
cosine_accuracy@1 0.7244
cosine_accuracy@3 0.8898
cosine_accuracy@5 0.9134
cosine_accuracy@10 0.937
cosine_precision@1 0.7244
cosine_precision@3 0.2966
cosine_precision@5 0.1827
cosine_precision@10 0.0937
cosine_recall@1 0.7244
cosine_recall@3 0.8898
cosine_recall@5 0.9134
cosine_recall@10 0.937
cosine_ndcg@10 0.8372
cosine_mrr@10 0.8045
cosine_map@100 0.806

Information Retrieval

Metric Value
cosine_accuracy@1 0.6929
cosine_accuracy@3 0.8661
cosine_accuracy@5 0.9134
cosine_accuracy@10 0.9291
cosine_precision@1 0.6929
cosine_precision@3 0.2887
cosine_precision@5 0.1827
cosine_precision@10 0.0929
cosine_recall@1 0.6929
cosine_recall@3 0.8661
cosine_recall@5 0.9134
cosine_recall@10 0.9291
cosine_ndcg@10 0.8202
cosine_mrr@10 0.784
cosine_map@100 0.7859

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 1,137 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 5 tokens
    • mean: 107.02 tokens
    • max: 512 tokens
    • min: 3 tokens
    • mean: 353.32 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    A simple substitution cipher can be broken \dots
    A. 1
    The correct answer is: A. by analysing the probability occurrence of the language.

    A simple substitution cipher replaces each letter in the plaintext with another letter, which means that the frequency of letters in the ciphertext will still reflect the frequency of letters in the original language. For instance, in English, the letter 'E' is the most commonly used letter, followed by 'T', 'A', 'O', etc. By analyzing the frequency of letters and patterns in the ciphertext, one can deduce which letters correspond to which, thereby breaking the cipher.

    Options B, C, and D are not relevant to breaking a simple substitution cipher:

    - B. only by using a quantum computer. Quantum computers are not necessary for breaking simple substitution ciphers, as they can be solved with classical techniques.

    - C. by using the ENIGMA machine. The ENIGMA machine was used for a more complex form of encryption during World War II and is not applicable to simple substitution ciphers.

    - **D...
    Consider a Generative Adversarial Network (GAN) which successfully produces images of goats. Which of the following statements is false?

    A. T
    B. h
    C. e
    D.
    E. d
    D. i
    F. s
    G. c
    H. r
    I. i
    To determine which statement is false regarding the Generative Adversarial Network (GAN) that produces images of goats, it's essential to clarify the roles of the generator and the discriminator within the GAN framework.

    1. Generator: The generator's main function is to learn the distribution of the training data, which consists of images of goats, and to generate new images that resemble this distribution. The goal is to create synthetic images that are indistinguishable from real goat images.

    2. Discriminator: The discriminator's role is to differentiate between real images (from the training dataset) and fake images (produced by the generator). Its primary task is to classify images as real or fake, not to categorize them into specific classes like "goat" or "non-goat." The discriminator is trained to recognize whether an image comes from the real dataset or is a synthetic creation, regardless of the specific type of image.

    Now, let's analyze each option provided in the q...
    Consider the following toy learning corpus of 59 tokens (using a tokenizer that splits on whitespaces and punctuation), out of a possible vocabulary of $N=100$ different tokens:
    Pulsed operation of lasers refers to any laser not classified as continuous wave, so that the optical power appears in pulses of some duration at some repetition rate. This\linebreak encompasses a wide range of technologies addressing a number of different motivations. Some lasers are pulsed simply because they cannot be run in continuous wave mode.
    Using a 2-gram language model, what are the values of the parameters corresponding to "continuous wave" and to "pulsed laser" using Maximum-Likelihood estimates?
    The probability of "continuous wave" is calculated as $P(\text{continuous wave})=\frac{2}{58}$ because the phrase appears twice in the bigram analysis of the 59-token corpus. In contrast, the phrase "pulsed laser" has a probability of $P(\text{pulsed laser})=0$, as it does not appear at all in the dataset, making it impossible to derive a maximum likelihood estimate for it.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 5
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: False
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.2812 10 5.8639 - - - - -
0.5624 20 3.1297 - - - - -
0.8436 30 2.5823 - - - - -
1.0 36 - 0.8431 0.8461 0.8367 0.8263 0.8052
1.1125 40 0.8878 - - - - -
1.3937 50 1.1603 - - - - -
1.6749 60 0.6109 - - - - -
1.9561 70 1.7633 - - - - -
2.0 72 - 0.8590 0.8583 0.8336 0.8280 0.8039
2.2250 80 0.3261 - - - - -
2.5062 90 0.3084 - - - - -
2.7873 100 0.2973 - - - - -
3.0 108 - 0.8628 0.8713 0.8519 0.8421 0.8165
3.0562 110 0.2864 - - - - -
3.3374 120 0.1124 - - - - -
3.6186 130 0.8529 - - - - -
3.8998 140 0.3042 - - - - -
4.0 144 - 0.8612 0.8659 0.8502 0.8349 0.8171
4.1687 150 0.4779 - - - - -
4.4499 160 0.2737 - - - - -
4.7311 170 0.5733 - - - - -
5.0 180 0.0481 0.8627 0.8659 0.8492 0.8372 0.8202
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.8
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.7.0+cu126
  • Accelerate: 1.3.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
1
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cristiano-sartori/bge_ft2

Finetuned
(421)
this model

Evaluation results