BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5 on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-base-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cristiano-sartori/bge_ft2")
# Run inference
sentences = [
    "Which of the following statements about coverage-guided fuzzing is/are correct?\nA. [\nB. '\nC. R\nD. e\nE. d\nD. u\nF. n\nG. d\nH. a\nI. n",
    'To determine which statements about coverage-guided fuzzing are correct, let\'s analyze each option step by step.\n\n1. **Redundant seeds in the corpus will reduce fuzzing efficiency.**\n   - **Analysis:** This statement is generally true. In coverage-guided fuzzing, the goal is to explore as many different paths and code branches as possible. If the corpus contains many redundant seeds (i.e., inputs that lead to the same code paths), it can lead to wasted effort and reduced efficiency since the fuzzer may spend more time exploring the same paths rather than discovering new ones.\n\n2. **Counting the number of times the covered code has been executed provides a more fine-grained view of program behavior than only "covered/not covered" binary code coverage.**\n   - **Analysis:** This statement is correct. While binary code coverage only tells you whether a particular part of the code has been executed, counting the number of times each part of the code is executed (also known as edge or path coverage) provides deeper insights into the program\'s behavior. This finer granularity can help the fuzzer prioritize certain inputs that might lead to new or interesting behaviors.\n\n3. **Due to the coverage feedback, a small random perturbation of a seed can have a significant impact on further exploration.**\n   - **Analysis:** This statement is also correct. Coverage-guided fuzzers utilize feedback about which parts of the code are executed to guide their exploration. Even a small change in input can lead to different execution paths being taken, which may uncover new code that wasn\'t reached with the original seed. As such, small perturbations can indeed have a large impact on the exploration of the input space.\n\n4. **Fuzzers that have higher code coverage always find more bugs.**\n   - **Analysis:** This statement is misleading and generally false. While higher code coverage can increase the likelihood of finding bugs, it does not guarantee that more bugs will be found. Some parts of the code may be covered but not contain any bugs, while other areas might have bugs that are difficult to reach, regardless of coverage. Thus, while there is a correlation between coverage and bug discovery, it is not a strict rule that higher coverage will always lead to more bugs being found.\n\nBased on this analysis, the correct statements about coverage-guided fuzzing are:\n\n- **1. True**\n- **2. True**\n- **3. True**\n- **4. False**\n\nIn summary, statements 1, 2, and 3 are correct, while statement 4 is not.',
    "To decrypt the ciphertext  $c = 14$  in RSA, we first need to find the private key  $d$  such that  $e \\cdot d \\equiv 1 \\mod \\phi(n)$ , where  $n = p \\cdot q = 77$  and  $\\phi(n) = (p-1)(q-1) = 6 \\cdot 10 = 60$ . \n\nGiven  $e = 13$ , we need to find  $d$  such that:\n\n\\[\n13d \\equiv 1 \\mod 60\n\\]\n\nUsing the Extended Euclidean Algorithm, we find  $d$ :\n\n1.  $60 = 4 \\cdot 13 + 8$ \n2.  $13 = 1 \\cdot 8 + 5$ \n3.  $8 = 1 \\cdot 5 + 3$ \n4.  $5 = 1 \\cdot 3 + 2$ \n5.  $3 = 1 \\cdot 2 + 1$ \n6.  $2 = 2 \\cdot 1 + 0$ \n\nBack substituting to find  $1 = 3 - 1 \\cdot 2$ :\n\n\\[\n1 = 3 - (5 - 1 \\cdot 3) = 2 \\cdot 3 - 5\n\\]\n\\[\n1 = 2 \\cdot (8 - 1 \\cdot 5) - 5 = 2 \\cdot 8 - 3 \\cdot 5\n\\]\n\\[\n= 2 \\cdot 8 - 3 \\cdot (13 - 1 \\cdot 8) = 5 \\cdot 8 - 3 \\cdot 13\n\\]\n\\[\n= 5 \\cdot (60 - 4 \\cdot 13) - 3 \\cdot 13 = 5 \\cdot 60 - 23 \\cdot 13\n\\]\n\nThus,  $d \\equiv -23 \\mod 60$ , or  $d \\equiv 37 \\mod 60$ .\n\nNow we can decrypt the ciphertext  $c = 14$ :\n\n\\[\nm \\equiv c^d \\mod n\n\\]\n\\[\nm \\equiv 14^{37} \\mod 77\n\\]\n\nTo simplify this computation, we can use the Chinese Remainder Theorem by calculating  $m \\mod 7$  and  $m \\mod 11$ :\n\n1. Calculate  $14^{37} \\mod 7$ :\n   \\[\n   14 \\equiv 0 \\mod 7 \\implies 14^{37} \\equiv 0 \\mod 7\n   \\]\n\n2. Calculate  $14^{37} \\mod 11$ :\n   \\[\n   14 \\equiv 3 \\mod 11\n   \\]\n   Using Fermat's Little Theorem,  $3^{10} \\equiv 1 \\mod 11$ . Thus:\n   \\[\n   37 \\mod 10 = 7 \\implies 3^{37} \\equiv 3^7 \\mod 11\n   \\]\n   We calculate  $3^{7}$ :\n   \\[\n   3^2 = 9, \\quad 3^4 = 81 \\equiv 4 \\mod 11\n   \\]\n   \\[\n   3^6 = 3^4 \\cdot 3^2 = 4 \\cdot 9 = 36 \\equiv 3 \\mod 11\n   \\]\n   \\[\n   3^7 = 3^6 \\cdot 3 = 3 \\cdot 3 = 9 \\mod 11\n   \\]\n\nNow we have:\n-  $m \\equiv 0 \\mod 7$ \n-  $m \\equiv 9 \\mod 11$ \n\nWe can solve these congruences using the method of successive substitutions or direct computation. \n\nLet  $m = 7 k$ . Then:\n\n\\[\n7k \\equiv 9 \\mod 11 \\implies 7k = 9 + 11j\n\\]\nSolving for  $k$  modulo 11, we need the modular inverse of 7 mod 11, which is 8 (since  $7 \\cdot 8 \\equiv 1 \\mod 11$ ). Thus:\n\n\\[\nk \\equiv 8 \\cdot 9 \\mod 11 \\equiv 72 \\mod 11 \\equiv 6 \\mod 11\n\\]\n\nSo  $k = 11 m + 6$ . Substituting back, we have:\n\n\\[\nm = 7(11m + 6) = 77m + 42\n\\]\nThus,  $m \\equiv 42 \\mod 77$ .\n\nThe message sent was  $m = 42$ .\n\nTherefore, the correct answer is:\n\n**$t = 42$**.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Dataset: dim_768
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 768
}
```

Metric	Value
cosine_accuracy@1	0.748
cosine_accuracy@3	0.9134
cosine_accuracy@5	0.9291
cosine_accuracy@10	0.9528
cosine_precision@1	0.748
cosine_precision@3	0.3045
cosine_precision@5	0.1858
cosine_precision@10	0.0953
cosine_recall@1	0.748
cosine_recall@3	0.9134
cosine_recall@5	0.9291
cosine_recall@10	0.9528
cosine_ndcg@10	0.8627
cosine_mrr@10	0.8326
cosine_map@100	0.8333

Information Retrieval

Dataset: dim_512
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 512
}
```

Metric	Value
cosine_accuracy@1	0.7638
cosine_accuracy@3	0.9055
cosine_accuracy@5	0.9291
cosine_accuracy@10	0.9449
cosine_precision@1	0.7638
cosine_precision@3	0.3018
cosine_precision@5	0.1858
cosine_precision@10	0.0945
cosine_recall@1	0.7638
cosine_recall@3	0.9055
cosine_recall@5	0.9291
cosine_recall@10	0.9449
cosine_ndcg@10	0.8659
cosine_mrr@10	0.8394
cosine_map@100	0.8408

Information Retrieval

Dataset: dim_256
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 256
}
```

Metric	Value
cosine_accuracy@1	0.7323
cosine_accuracy@3	0.9055
cosine_accuracy@5	0.9134
cosine_accuracy@10	0.9449
cosine_precision@1	0.7323
cosine_precision@3	0.3018
cosine_precision@5	0.1827
cosine_precision@10	0.0945
cosine_recall@1	0.7323
cosine_recall@3	0.9055
cosine_recall@5	0.9134
cosine_recall@10	0.9449
cosine_ndcg@10	0.8492
cosine_mrr@10	0.8173
cosine_map@100	0.8184

Information Retrieval

Dataset: dim_128
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 128
}
```

Metric	Value
cosine_accuracy@1	0.7244
cosine_accuracy@3	0.8898
cosine_accuracy@5	0.9134
cosine_accuracy@10	0.937
cosine_precision@1	0.7244
cosine_precision@3	0.2966
cosine_precision@5	0.1827
cosine_precision@10	0.0937
cosine_recall@1	0.7244
cosine_recall@3	0.8898
cosine_recall@5	0.9134
cosine_recall@10	0.937
cosine_ndcg@10	0.8372
cosine_mrr@10	0.8045
cosine_map@100	0.806

Information Retrieval

Dataset: dim_64
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 64
}
```

Metric	Value
cosine_accuracy@1	0.6929
cosine_accuracy@3	0.8661
cosine_accuracy@5	0.9134
cosine_accuracy@10	0.9291
cosine_precision@1	0.6929
cosine_precision@3	0.2887
cosine_precision@5	0.1827
cosine_precision@10	0.0929
cosine_recall@1	0.6929
cosine_recall@3	0.8661
cosine_recall@5	0.9134
cosine_recall@10	0.9291
cosine_ndcg@10	0.8202
cosine_mrr@10	0.784
cosine_map@100	0.7859

Training Details

Training Dataset

json

Dataset: json
Size: 1,137 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 5 tokens
mean: 107.02 tokens
max: 512 tokens

min: 3 tokens
mean: 353.32 tokens
max: 512 tokens

	anchor	positive
type	string	string
details	min: 5 tokens mean: 107.02 tokens max: 512 tokens	min: 3 tokens mean: 353.32 tokens max: 512 tokens

Samples:

anchor	positive
`A simple substitution cipher can be broken \dots A. 1`	The correct answer is: A. by analysing the probability occurrence of the language. A simple substitution cipher replaces each letter in the plaintext with another letter, which means that the frequency of letters in the ciphertext will still reflect the frequency of letters in the original language. For instance, in English, the letter 'E' is the most commonly used letter, followed by 'T', 'A', 'O', etc. By analyzing the frequency of letters and patterns in the ciphertext, one can deduce which letters correspond to which, thereby breaking the cipher. Options B, C, and D are not relevant to breaking a simple substitution cipher: - B. only by using a quantum computer. Quantum computers are not necessary for breaking simple substitution ciphers, as they can be solved with classical techniques. - C. by using the ENIGMA machine. The ENIGMA machine was used for a more complex form of encryption during World War II and is not applicable to simple substitution ciphers. - **D...
`Consider a Generative Adversarial Network (GAN) which successfully produces images of goats. Which of the following statements is false? A. T B. h C. e D. E. d D. i F. s G. c H. r I. i`	To determine which statement is false regarding the Generative Adversarial Network (GAN) that produces images of goats, it's essential to clarify the roles of the generator and the discriminator within the GAN framework. 1. Generator: The generator's main function is to learn the distribution of the training data, which consists of images of goats, and to generate new images that resemble this distribution. The goal is to create synthetic images that are indistinguishable from real goat images. 2. Discriminator: The discriminator's role is to differentiate between real images (from the training dataset) and fake images (produced by the generator). Its primary task is to classify images as real or fake, not to categorize them into specific classes like "goat" or "non-goat." The discriminator is trained to recognize whether an image comes from the real dataset or is a synthetic creation, regardless of the specific type of image. Now, let's analyze each option provided in the q...
Consider the following toy learning corpus of 59 tokens (using a tokenizer that splits on whitespaces and punctuation), out of a possible vocabulary of $N=100$ different tokens: Pulsed operation of lasers refers to any laser not classified as continuous wave, so that the optical power appears in pulses of some duration at some repetition rate. This\linebreak encompasses a wide range of technologies addressing a number of different motivations. Some lasers are pulsed simply because they cannot be run in continuous wave mode. Using a 2-gram language model, what are the values of the parameters corresponding to "continuous wave" and to "pulsed laser" using Maximum-Likelihood estimates?	`The probability of "continuous wave" is calculated as $P(\text{continuous wave})=\frac{2}{58}$ because the phrase appears twice in the bigram analysis of the 59-token corpus. In contrast, the phrase "pulsed laser" has a probability of $P(\text{pulsed laser})=0$, as it does not appear at all in the dataset, making it impossible to derive a maximum likelihood estimate for it.`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 2e-05
num_train_epochs: 5
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: True
tf32: False
load_best_model_at_end: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 16
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 5
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: False
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	dim_768_cosine_ndcg@10	dim_512_cosine_ndcg@10	dim_256_cosine_ndcg@10	dim_128_cosine_ndcg@10	dim_64_cosine_ndcg@10
0.2812	10	5.8639	-	-	-	-	-
0.5624	20	3.1297	-	-	-	-	-
0.8436	30	2.5823	-	-	-	-	-
1.0	36	-	0.8431	0.8461	0.8367	0.8263	0.8052
1.1125	40	0.8878	-	-	-	-	-
1.3937	50	1.1603	-	-	-	-	-
1.6749	60	0.6109	-	-	-	-	-
1.9561	70	1.7633	-	-	-	-	-
2.0	72	-	0.8590	0.8583	0.8336	0.8280	0.8039
2.2250	80	0.3261	-	-	-	-	-
2.5062	90	0.3084	-	-	-	-	-
2.7873	100	0.2973	-	-	-	-	-
3.0	108	-	0.8628	0.8713	0.8519	0.8421	0.8165
3.0562	110	0.2864	-	-	-	-	-
3.3374	120	0.1124	-	-	-	-	-
3.6186	130	0.8529	-	-	-	-	-
3.8998	140	0.3042	-	-	-	-	-
4.0	144	-	0.8612	0.8659	0.8502	0.8349	0.8171
4.1687	150	0.4779	-	-	-	-	-
4.4499	160	0.2737	-	-	-	-	-
4.7311	170	0.5733	-	-	-	-	-
5.0	180	0.0481	0.8627	0.8659	0.8492	0.8372	0.8202

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.12.8
Sentence Transformers: 4.1.0
Transformers: 4.52.4
PyTorch: 2.7.0+cu126
Accelerate: 1.3.0
Datasets: 3.6.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

cristiano-sartori
/

bge_ft2

BGE base Financial Matryoshka

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Training Details

Training Dataset

json

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Framework Versions

Citation

BibTeX

Sentence Transformers

MatryoshkaLoss

MultipleNegativesRankingLoss

Model tree for cristiano-sartori/bge_ft2

Evaluation results