BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5 on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cristiano-sartori/bge_ft")
# Run inference
sentences = [
    'An expression is referentially transparent if it always returns the same value, no matter\nthe global state of the program. A referentially transparent expression can be replaced by its value without\nchanging the result of the program.\nSay we have a value representing a class of students and their GPAs. Given the following defintions:\n1 case class Student(gpa: Double)\n2\n3 def count(c: List[Student], student: Student): Double =\n4 c.filter(s => s == student).size\n5\n6 val students = List(\n7 Student(1.0), Student(2.0), Student(3.0),\n8 Student(4.0), Student(5.0), Student(6.0)\n9 )\nAnd the expression e:\n1 count(students, Student(6.0))',
    "Let's break this down simply. The function `count` takes a list of students and a specific student, then counts how many times that student appears in the list. In our example, we have a list of students with different GPAs.\n\nWhen we call `count(students, Student(6.0))`, we are asking how many times a student with a GPA of 6.0 is in our list. Since we have `Student(6.0)` in the list only once, the function will return 1.\n\nNow, to understand referential transparency: if we replace the call `count(students, Student(6.0))` with its value (which is 1), the overall result of the program would still remain the same. So, the expression is referentially transparent because it consistently gives us the same output (1) regardless of the program's state.",
    'To solve the problem of identifying a non-empty subset SsubsetneqV S \\subsetneq V  in a d d -regular graph G G  using the second eigenvector v2 v_2  of the normalized adjacency matrix M M , we can follow these steps:\n\n### Step 1: Understanding Eigenvector v2 v_2 \n\nThe second eigenvector v2 v_2  is orthogonal to the all-ones vector v1 v_1 , indicating that it captures structural features of the graph related to its connected components. Its entries will have both positive and negative values, allowing us to partition the vertices.\n\n### Step 2: Properties of v2 v_2 \n\n- The orthogonality to v1 v_1  ensures that there are vertices with positive values (indicating one group) and negative values (indicating another group). Therefore, we can define two sets based on the sign of the entries in v2 v_2 .\n\n### Step 3: Designing the Procedure\n\n1. **Define the Sets:**\n   - Let S=iinV:v2(i)>0 S = \\{ i \\in V : v_2(i) > 0 \\} .\n   - Let T=iinV:v2(i)<0 T = \\{ i \\in V : v_2(i) < 0 \\} .\n\n2. **Check for Non-emptiness:**\n   - Since v2 v_2  is orthogonal to v1 v_1 , at least one vertex must have a positive value and at least one must have a negative value. Hence, S S  cannot be empty, and SneqV S \\neq V .\n\n### Step 4: Showing that S S  Cuts 0 Edges\n\nWe need to demonstrate that the cut defined by S S  does not cross any edges:\n\n- **Edge Contributions:**\n  - For any edge (i,j) (i, j)  in the graph, if one vertex belongs to S S  and the other to T T , the eigenvalue relationship Mcdotv2=v2 M \\cdot v_2 = v_2  indicates that the edge would create a mismatched contribution, leading to a contradiction. This implies that no edges can exist between S S  and T T .\n\n### Final Procedure\n\nThe procedure can be summarized as follows:\n\n```plaintext\nProcedure FindDisconnectedSet(v_2):\n    S = { i ∈ V : v_2(i) > 0 }\n    T = { i ∈ V : v_2(i) < 0 }\n    \n    if S is empty:\n        return T\n    else:\n        return S\n```\n\n### Conclusion\n\nThis algorithm ensures that we find a non-empty subset SsubsetneqV S \\subsetneq V  that defines a cut with no edges crossing between S S  and VsetminusS V \\setminus S , under the condition that lambda2=1 \\lambda_2 = 1 .',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.2737
cosine_accuracy@3 0.786
cosine_accuracy@5 0.8491
cosine_accuracy@10 0.9439
cosine_precision@1 0.2737
cosine_precision@3 0.262
cosine_precision@5 0.1698
cosine_precision@10 0.0944
cosine_recall@1 0.2737
cosine_recall@3 0.786
cosine_recall@5 0.8491
cosine_recall@10 0.9439
cosine_ndcg@10 0.6171
cosine_mrr@10 0.5102
cosine_map@100 0.5136

Information Retrieval

Metric Value
cosine_accuracy@1 0.2772
cosine_accuracy@3 0.7754
cosine_accuracy@5 0.8561
cosine_accuracy@10 0.9474
cosine_precision@1 0.2772
cosine_precision@3 0.2585
cosine_precision@5 0.1712
cosine_precision@10 0.0947
cosine_recall@1 0.2772
cosine_recall@3 0.7754
cosine_recall@5 0.8561
cosine_recall@10 0.9474
cosine_ndcg@10 0.6197
cosine_mrr@10 0.5127
cosine_map@100 0.5158

Information Retrieval

Metric Value
cosine_accuracy@1 0.2632
cosine_accuracy@3 0.7649
cosine_accuracy@5 0.8526
cosine_accuracy@10 0.9368
cosine_precision@1 0.2632
cosine_precision@3 0.255
cosine_precision@5 0.1705
cosine_precision@10 0.0937
cosine_recall@1 0.2632
cosine_recall@3 0.7649
cosine_recall@5 0.8526
cosine_recall@10 0.9368
cosine_ndcg@10 0.6108
cosine_mrr@10 0.5039
cosine_map@100 0.508

Information Retrieval

Metric Value
cosine_accuracy@1 0.2596
cosine_accuracy@3 0.7544
cosine_accuracy@5 0.8386
cosine_accuracy@10 0.9263
cosine_precision@1 0.2596
cosine_precision@3 0.2515
cosine_precision@5 0.1677
cosine_precision@10 0.0926
cosine_recall@1 0.2596
cosine_recall@3 0.7544
cosine_recall@5 0.8386
cosine_recall@10 0.9263
cosine_ndcg@10 0.6009
cosine_mrr@10 0.4944
cosine_map@100 0.4987

Information Retrieval

Metric Value
cosine_accuracy@1 0.2596
cosine_accuracy@3 0.7439
cosine_accuracy@5 0.8351
cosine_accuracy@10 0.9298
cosine_precision@1 0.2596
cosine_precision@3 0.248
cosine_precision@5 0.167
cosine_precision@10 0.093
cosine_recall@1 0.2596
cosine_recall@3 0.7439
cosine_recall@5 0.8351
cosine_recall@10 0.9298
cosine_ndcg@10 0.597
cosine_mrr@10 0.4888
cosine_map@100 0.4924

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 1,140 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 5 tokens
    • mean: 168.34 tokens
    • max: 512 tokens
    • min: 25 tokens
    • mean: 374.7 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    Consider the task of classifying reviews as positive or negative. To create a reference for this task, two human annotators were asked to rate 1000 movie reviews as positive or negative.The first annotator rated {a} reviews as positive and the rest as negative. The second annotator rated {b} reviews as positive and the rest as negative. 80 reviews were rated as positive by both annotators. What is the raw agreement between the two annotators?Give your answer as a numerical value to three decimal places. To calculate the raw agreement between the two annotators, we can use the following formula:

    [
    \text{Raw Agreement} = \frac{\text{Number of agreements}}{\text{Total number of reviews}}
    ]

    1. Identify the total number of reviews: In this case, it is given that there are 1000 movie reviews.

    2. Identify the number of agreements: The agreements consist of the reviews that both annotators rated as positive or both rated as negative. We know that:
    - Both annotators rated 80 reviews as positive.
    - To find the number of reviews both rated as negative, we need to calculate how many reviews each annotator rated as negative.

    Let’s denote:
    - ( a ): the number of positive reviews by Annotator 1
    - ( b ): the number of positive reviews by Annotator 2

    Thus, the number of negative reviews for each annotator would be:
    - Negative reviews by Annotator 1 = ( 1000 - a )
    - Negative reviews by Annotator 2 = ( 1000 - b )

    3. Calculate the total agreements:
    -...
    Let $y_1, y_2, \ldots, y_n$ be uniform random bits. For each non-empty subset $S\subseteq {1,2, \ldots, n}$, define $X_S = \oplus_{i\in S}:y_i$. Show that the bits ${X_S: \emptyset \neq S\subseteq {1,2, \ldots, n} }$ are pairwise independent. This shows how to stretch $n$ truly random bits to $2^n-1$ pairwise independent bits. \ \emph{Hint: Observe that it is sufficient to prove $\mathbb{E}[X_S] = 1/2$ and $\mathbb{E}[X_S X_T] = 1/4$ to show that they are pairwise independent. Also use the identity $\oplus_{i\in A}: y_i = \frac{1}{2}\left( 1 - \prod_{i\in A} (-1)^{y_i} \right)$.} To demonstrate that the random variables ( {X_S : S \subseteq {1, 2, \ldots, n}, S \neq \emptyset} ) are pairwise independent, we need to show two things:

    1. The expected value ( \mathbb{E}[X_S] = \frac{1}{2} ) for any non-empty subset ( S ).
    2. The expected value of the product of any two variables ( X_S ) and ( X_T ) (where ( S ) and ( T ) are non-empty subsets of ({1,2,\ldots,n})) satisfies ( \mathbb{E}[X_S X_T] = \frac{1}{4} ).

    ### Step 1: Calculate ( \mathbb{E}[X_S] )

    The random variable ( X_S ) is defined as the XOR (exclusive OR) of bits indexed by elements of ( S ):

    [
    X_S = \oplus_{i \in S} y_i
    ]

    For each ( y_i ), since it is a uniform random bit, we have:

    [
    \mathbb{E}[y_i] = \frac{1}{2}
    ]

    The XOR operation ( X_S = y_{i_1} \oplus y_{i_2} \oplus \ldots \oplus y_{i_k} ) (where ( S = {i_1, i_2, \ldots, i_k} )) can take the value 0 or 1. The expected value of ( X_S ) can be computed as follows:

    1. The outcome ( X_S = 0 ) occurs ...
    We have a collection of rectangles in a plane, whose sides are aligned with the coordinate axes. Each rectangle is represented by its lower left corner $(x_1,y_1)$ and its upper right corner $(x_2,y_2)$. All coordinates are of type Long. We require $x_1 \le x_2$ and $y_1 \le y_2$. Define a case class Rectangle storing two corners. ### Summary

    To represent rectangles in a plane with aligned sides, we can define a case class in Scala that captures the necessary properties while enforcing the required constraints on the coordinates. Each rectangle will be defined by its lower left corner ((x_1, y_1)) and its upper right corner ((x_2, y_2)). We will ensure that (x_1 \le x_2) and (y_1 \le y_2) through constructor validation.

    ### Implementation

    Here’s a concise implementation of the Rectangle case class with validation:

    scala<br>case class Rectangle(x1: Long, y1: Long, x2: Long, y2: Long) {<br> require(x1 <= x2, "x1 must be less than or equal to x2")<br> require(y1 <= y2, "y1 must be less than or equal to y2")<br>}<br>

    ### Explanation

    1. Case Class Definition: The Rectangle case class is defined with four parameters: x1, y1, x2, and y2, all of type Long.

    2. Constraints Enforcement: The require statements in the constructor ensure that the specified conditions (x_1 \le x_2) and (y_1 ...
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 5
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: False
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.2807 10 2.2566 - - - - -
0.5614 20 0.7721 - - - - -
0.8421 30 0.339 - - - - -
1.0 36 - 0.6171 0.6157 0.6205 0.6049 0.5968
1.1123 40 0.5523 - - - - -
1.3930 50 0.14 - - - - -
1.6737 60 0.0521 - - - - -
1.9544 70 0.0242 - - - - -
2.0 72 - 0.6153 0.6131 0.6077 0.6042 0.5929
2.2246 80 0.5093 - - - - -
2.5053 90 0.0524 - - - - -
2.7860 100 0.0772 - - - - -
3.0 108 - 0.6141 0.6182 0.6108 0.6042 0.5901
3.0561 110 0.0347 - - - - -
3.3368 120 0.1168 - - - - -
3.6175 130 0.8566 - - - - -
3.8982 140 0.0254 - - - - -
4.0 144 - 0.6160 0.6177 0.6091 0.6020 0.5927
4.1684 150 0.2141 - - - - -
4.4491 160 0.0344 - - - - -
4.7298 170 0.8643 - - - - -
5.0 180 0.019 0.6171 0.6197 0.6108 0.6009 0.5970
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.8
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.7.0+cu126
  • Accelerate: 1.3.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
1
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cristiano-sartori/bge_ft_128

Finetuned
(421)
this model

Evaluation results