BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from NovaSearch/stella_en_400M_v5 on the json dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: NovaSearch/stella_en_400M_v5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cristiano-sartori/stella_finetuned1")
# Run inference
sentences = [
    'Describe the techniques that typical dynamically scheduled\n            processors use to achieve the same purpose of the following features\n            of Intel Itanium: (a) Predicated execution; (b) advanced\n            loads---that is, loads moved before a store and explicit check for\n            RAW hazards; (c) speculative loads---that is, loads moved before a\n            branch and explicit check for exceptions; (d) rotating register\n            file.',
    'Dynamically scheduled processors are designed to improve the efficiency of instruction execution by allowing the CPU to make decisions at runtime about the order of instruction execution. Let\'s break down each feature you mentioned from the Intel Itanium architecture and see how typical dynamically scheduled processors achieve similar goals.\n\n### (a) Predicated Execution\n\n**Intuition:**\nPredicated execution allows the processor to execute instructions based on certain conditions without using traditional branching (like `if` statements). This helps to avoid pipeline stalls that can occur when a branch is taken.\n\n**Example:**\nImagine you have the following pseudo-code:\n```c\nif (x > 0) {\n    y = z + 1;\n} else {\n    y = z - 1;\n}\n```\n\nIn a predicated execution model, instead of branching, the processor can execute both instructions but use a predicate (a boolean condition) to determine which result to keep:\n```assembly\np1 = (x > 0)\ny1 = z + 1; // Execute regardless\ny2 = z - 1; // Execute regardless\ny = p1 ? y1 : y2; // Keep the result based on p1\n```\n\n**Dynamically Scheduled Processors:**\nThese processors use techniques like "instruction scheduling" and "register renaming" to allow for instructions to be executed out of order while avoiding the pitfalls of branches. The hardware can evaluate conditions ahead of time and execute the necessary instructions while keeping track of which values are valid.\n\n### (b) Advanced Loads\n\n**Intuition:**\nAdvanced loads allow the processor to move load instructions (fetching data from memory) ahead of store instructions (writing data to memory), while also checking for Read After Write (RAW) hazards to ensure data correctness.\n\n**Example:**\nConsider the following pseudo-code:\n```c\na = b; // Store b into a\nc = a; // Load a into c\n```\n\nIf `a` is stored before it\'s loaded again, there can be a dependency. Advanced load techniques allow the processor to load `c = a` even before it knows if the value of `a` has been updated, as long as it checks that no store operation that modifies `a` has occurred.\n\n**Dynamically Scheduled Processors:**\nThese processors often use a technique called "out-of-order execution." They keep track of the status of loads and stores in a structure like a reorder buffer. When a load is scheduled, the processor checks if any previous instructions modify the value it needs (checking for RAW hazards) before executing the load.\n\n### (c) Speculative Loads\n\n**Intuition:**\nSpeculative loads allow the processor to execute load instructions before it knows the outcome of branches, which can keep the pipeline filled and reduce stalls.\n\n**Example:**\nConsider a branch that depends on a condition:\n```c\nif (x > 0) {\n    a = b + c;\n}\n```\n\nInstead of waiting for the branch to be resolved, a speculative load might execute `load a` regardless of the branch\'s result. If the branch is taken, the processor can discard this load if it\'s not necessary.\n\n**Dynamically Scheduled Processors:**\nThese processors utilize "speculative execution," where they predict the likely path of execution based on past behavior. They perform loads and even entire blocks of instructions speculatively and have mechanisms to roll back if they guessed wrong while checking for exceptions (like accessing invalid memory).\n\n### (d) Rotating Register File\n\n**Intuition:**\nA rotating register file allows the processor to efficiently manage registers, effectively giving it more registers to work with by cycling through them for different contexts or states.\n\n**Example:**\nIn a simple program, if you have a limited number of registers but multiple functions, rotating registers means that as soon as one function completes, its registers can be reused for the next function without needing to save them to memory.\n\n**Dynamically Scheduled Processors:**\nMany dynamically scheduled processors use a "register renaming" technique, which allows them to allocate physical registers dynamically. When an instruction is ready to execute, it checks which registers are free and assigns one, effectively "rotating" the use of registers without the programmer needing to manage this directly.\n\n### Summary\n\nIn summary, dynamically scheduled processors use advanced techniques like out-of-order execution, speculative execution, and register renaming to achieve efficiency and performance similar to the features found in Intel Itanium. These techniques help to minimize stalls and maximize instruction throughput by allowing more flexibility in how instructions are executed relative to their dependencies and branch outcomes.',
    "The question at hand explores whether it is possible to add new documents to a collection such that one document,  $d_{1}$ , is ranked higher than another document,  $d_{2}$ , based on a specific query, while also allowing for the possibility of ranking  $d_{2}$  higher than  $d_{1}$  simultaneously.\n\nTo analyze this problem, we begin by examining the two documents in question: \n\n- Document  $d_{1}$  contains three occurrences of 'a', one occurrence of 'b', and none of 'c' (represented as  $d_1 = \\text{aabc}$ ).\n- Document  $d_{2}$  has one occurrence each of 'a', 'b', and 'c' (represented as  $d_2 = \\text{abc}$ ).\n\nGiven the query  $q = \\text{ab}$ , our focus lies on the occurrences of 'a' and 'b' in both documents.\n\nNext, we calculate the term frequencies for the relevant terms in each document:\n\n- For  $d_{1}$ , the term frequencies are:\n  -  $f_{d_1}(a) = 3$ \n  -  $f_{d_1}(b) = 1$ \n  -  $f_{d_1}(c) = 0$ \n\n- For  $d_{2}$ , the term frequencies are:\n  -  $f_{d_2}(a) = 1$ \n  -  $f_{d_2}(b) = 1$ \n  -  $f_{d_2}(c) = 1$ \n\nThe total number of terms in each document is calculated as follows:\n\n- Total terms in  $d_{1} = 4$  (3 'a's + 1 'b' + 0 'c's).\n- Total terms in  $d_{2} = 3$  (1 'a' + 1 'b' + 1 'c').\n\nWe will apply the smoothed probabilistic retrieval model using the formula:\n\\[\nP(w | d) = \\frac{f_{d}(w) + \\lambda \\cdot P(w | C)}{N + \\lambda \\cdot |V|}\n\\]\nwhere  $N$  is the total number of terms in the document,  $∣ V ∣$  is the size of the vocabulary (which is 3 in this case), and  $P (w ∣ C)$  is the probability of the word in the overall collection.\n\nAssuming a uniform distribution for the collection, we calculate:\n-  $P(a | C) = \\frac{4}{10} = 0.4$ \n-  $P(b | C) = \\frac{2}{10} = 0.2$ \n-  $P(c | C) = \\frac{2}{10} = 0.2$ \n\nNow, we compute the probabilities for the query terms for each document.\n\nFor document  $d_{1}$ :\n- Probability of 'a':\n\\[\nP(a | d_1) = \\frac{3 + 0.5 \\cdot 0.4}{4 + 0.5 \\cdot 3} = \\frac{3 + 0.2}{4 + 1.5} = \\frac{3.2}{5.5} \\approx 0.5818\n\\]\n- Probability of 'b':\n\\[\nP(b | d_1) = \\frac{1 + 0.5 \\cdot 0.2}{4 + 0.5 \\cdot 3} = \\frac{1 + 0.1}{5.5} = \\frac{1.1}{5.5} \\approx 0.2\n\\]\n- Combined score for  $d_{1}$  for the query  $q = a b$ :\n\\[\nP(q | d_1) = P(a | d_1) \\cdot P(b | d_1) \\approx 0.5818 \\cdot 0.2 \\approx 0.1164\n\\]\n\nFor document  $d_{2}$ :\n- Probability of 'a':\n\\[\nP(a | d_2) = \\frac{1 + 0.5 \\cdot 0.4}{3 + 0.5 \\cdot 3} = \\frac{1 + 0.2}{4.5} = \\frac{1.2}{4.5} \\approx 0.2667\n\\]\n- Probability of 'b':\n\\[\nP(b | d_2) = \\frac{1 + 0.5 \\cdot 0.2}{3 + 0.5 \\cdot 3} = \\frac{1 + 0.1}{4.5} = \\frac{1.1}{4.5} \\approx 0.2444\n\\]\n- Combined score for  $d_{2}$  for the query  $q = a b$ :\n\\[\nP(q | d_2) = P(a | d_2) \\cdot P(b | d_2) \\approx 0.2667 \\cdot 0.2444 \\approx 0.0652\n\\]\n\nAt this stage, we find that  $P(q | d_1) \\approx 0.1164$  and  $P(q | d_2) \\approx 0.0652$ , indicating that  $d_{1}$  currently ranks higher than  $d_{2}$ .\n\nTo explore the possibility of achieving both  $d_{1} > d_{2}$  and  $d_{2} > d_{1}$ , we consider the addition of new documents. While it is theoretically possible to manipulate rankings by introducing documents that alter the frequency of terms, the fundamental nature of probabilistic scoring means that achieving both conditions simultaneously is implausible. Specifically, any document that increases the score of  $d_{1}$  will likely decrease the score of  $d_{2}$  and vice versa due to the competitive nature of the scoring based on term frequencies.\n\nIn conclusion, while document addition can influence individual rankings, the inherent constraints of probabilistic retrieval prevent the simultaneous fulfillment of both ranking conditions. Therefore, the answer is **no, it is not possible** to enforce both rankings as required.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Dataset: dim_768
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 768
}
```

Metric	Value
cosine_accuracy@1	0.2947
cosine_accuracy@3	0.8702
cosine_accuracy@5	0.9333
cosine_accuracy@10	0.9789
cosine_precision@1	0.2947
cosine_precision@3	0.2901
cosine_precision@5	0.1867
cosine_precision@10	0.0979
cosine_recall@1	0.2947
cosine_recall@3	0.8702
cosine_recall@5	0.9333
cosine_recall@10	0.9789
cosine_ndcg@10	0.661
cosine_mrr@10	0.5552
cosine_map@100	0.5566

Information Retrieval

Dataset: dim_512
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 512
}
```

Metric	Value
cosine_accuracy@1	0.3053
cosine_accuracy@3	0.8702
cosine_accuracy@5	0.9228
cosine_accuracy@10	0.9719
cosine_precision@1	0.3053
cosine_precision@3	0.2901
cosine_precision@5	0.1846
cosine_precision@10	0.0972
cosine_recall@1	0.3053
cosine_recall@3	0.8702
cosine_recall@5	0.9228
cosine_recall@10	0.9719
cosine_ndcg@10	0.6643
cosine_mrr@10	0.5616
cosine_map@100	0.5636

Information Retrieval

Dataset: dim_256
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 256
}
```

Metric	Value
cosine_accuracy@1	0.2912
cosine_accuracy@3	0.8702
cosine_accuracy@5	0.9263
cosine_accuracy@10	0.9684
cosine_precision@1	0.2912
cosine_precision@3	0.2901
cosine_precision@5	0.1853
cosine_precision@10	0.0968
cosine_recall@1	0.2912
cosine_recall@3	0.8702
cosine_recall@5	0.9263
cosine_recall@10	0.9684
cosine_ndcg@10	0.6575
cosine_mrr@10	0.5535
cosine_map@100	0.5558

Information Retrieval

Dataset: dim_128
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 128
}
```

Metric	Value
cosine_accuracy@1	0.2632
cosine_accuracy@3	0.8456
cosine_accuracy@5	0.9088
cosine_accuracy@10	0.9614
cosine_precision@1	0.2632
cosine_precision@3	0.2819
cosine_precision@5	0.1818
cosine_precision@10	0.0961
cosine_recall@1	0.2632
cosine_recall@3	0.8456
cosine_recall@5	0.9088
cosine_recall@10	0.9614
cosine_ndcg@10	0.6377
cosine_mrr@10	0.5299
cosine_map@100	0.5326

Information Retrieval

Dataset: dim_64
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 64
}
```

Metric	Value
cosine_accuracy@1	0.2596
cosine_accuracy@3	0.8386
cosine_accuracy@5	0.9088
cosine_accuracy@10	0.9474
cosine_precision@1	0.2596
cosine_precision@3	0.2795
cosine_precision@5	0.1818
cosine_precision@10	0.0947
cosine_recall@1	0.2596
cosine_recall@3	0.8386
cosine_recall@5	0.9088
cosine_recall@10	0.9474
cosine_ndcg@10	0.6306
cosine_mrr@10	0.5248
cosine_map@100	0.5286

Training Details

Training Dataset

json

Dataset: json
Size: 1,140 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 5 tokens
mean: 167.15 tokens
max: 512 tokens

min: 3 tokens
mean: 375.41 tokens
max: 512 tokens

	anchor	positive
type	string	string
details	min: 5 tokens mean: 167.15 tokens max: 512 tokens	min: 3 tokens mean: 375.41 tokens max: 512 tokens

Samples:

anchor	positive
`Devise an algorithm that, without consensus, implements a weaker specification of NBAC by replacing the termination property with very weak termination. Very weak termination: If no process crashes, then all processes decide. Is a failure detector needed to implement this algorithm?`	To implement a weaker specification of Non-blocking Atomic Commit (NBAC) with a focus on very weak termination, we can devise a straightforward algorithm leveraging Best-Effort Broadcast. The key here is to ensure that if no processes crash, all processes should arrive at a decision, either COMMIT or ABORT. First, each process will broadcast its proposal to all other processes. Upon receiving proposals, each process will keep track of the received messages. If a process receives only COMMIT proposals from all other processes, it decides to COMMIT; otherwise, it decides to ABORT. This design assumes that no failures occur, which is a critical part of the specification. Watch out for the assumption that no processes crash. If even one process fails, the termination property is broken as some processes may not receive all necessary proposals, leading to a situation where decisions cannot be made consistently. This highlights that failure detection is not required in this scenar...
`The "Consensus-Based Total-Order Broadcast" algorithm transforms a consensus abstraction (together with a reliable broadcast abstraction) into a total-order broadcast abstraction. Describe a transformation between these two primitives in the other direction, that is, implement a (uniform) consensus abstraction from a (uniform) total-order broadcast abstraction.`	To implement a (uniform) consensus abstraction from a (uniform) total-order broadcast abstraction, we can follow these steps: 1. Initialize a variable decided to false to track if a consensus value has been reached. 2. When a process invokes propose(v), it uses the total-order broadcast (TO) to send the value v. 3. Upon receiving a TO-delivered message with a value v, if decided is still false, the process sets decided to true and calls decide(v). This approach works because the total-order broadcast ensures that all processes deliver messages in the same order, allowing them to reach consensus on the first value that is delivered. Thus, the consensus is achieved by agreeing on the first proposed value that is TO-delivered.
We learnt in the lecture that terms are typically stored in an inverted list. Now, in the inverted list, instead of only storing document identifiers of the documents in which the term appears, assume we also store an offset of the appearance of a term in a document. An $offset$ of a term $l_k$ given a document is defined as the number of words between the start of the document and $l_k$. Thus our inverted list is now: $l_k= \langle f_k: {d_{i_1} \rightarrow [o_1,\ldots,o_{n_{i_1}}]}, {d_{i_2} \rightarrow [o_1,\ldots,o_{n_{i_2}}]}, \ldots, {d_{i_k} \rightarrow [o_1,\ldots,o_{n_{i_k}}]} \rangle$ This means that in document $d_{i_1}$ term $l_k$ appears $n_{i_1}$ times and at offset $[o_1,\ldots,o_{n_{i_1}}]$, where $[o_1,\ldots,o_{n_{i_1}}]$ are sorted in ascending order, these type of indices are also known as term-offset indices. An example of a term-offset index is as follows: Obama = $⟨4 : {1 → [3]},{2 → [6]},{3 → [2,17]},{4 → [1]}⟩$ Governor = $⟨2 : {4 → [3]}, ...	### Understanding the Problem We are tasked with analyzing a query involving the SLOP operator between two terms, "Obama" and "Election." The SLOP operator allows for flexibility in the proximity of terms within a specified number of words. Specifically, for a query of the form QueryTerm1 SLOP/x QueryTerm2, we need to find occurrences of QueryTerm1 within x words of QueryTerm2, regardless of word order. ### Term-Offset Indexes We have the following term-offset indexes for the relevant terms: - Obama = ( \langle 4 : {1 \rightarrow [3], 2 \rightarrow [6], 3 \rightarrow [2, 17], 4 \rightarrow [1]} \rangle ) - Election = ( \langle 4 : {1 \rightarrow [1], 2 \rightarrow [1, 21], 3 \rightarrow [3], 5 \rightarrow [16, 22, 51]} \rangle ) From these indexes, we can interpret: - "Obama" appears in documents 1, 2, 3, and 4 at the specified offsets. - "Election" appears in documents 1, 2, 3, and 5 at its respective offsets. ### Analyzing the SLOP Operator We n...

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 2
per_device_eval_batch_size: 16
gradient_accumulation_steps: 16
learning_rate: 2e-05
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: True
tf32: False
load_best_model_at_end: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 2
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 16
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: False
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	dim_768_cosine_ndcg@10	dim_512_cosine_ndcg@10	dim_256_cosine_ndcg@10	dim_128_cosine_ndcg@10	dim_64_cosine_ndcg@10
0.2807	10	0.1056	-	-	-	-	-
0.5614	20	0.6075	-	-	-	-	-
0.8421	30	0.272	-	-	-	-	-
1.0	36	-	0.6633	0.6597	0.6581	0.6378	0.6330
1.1123	40	0.1235	-	-	-	-	-
1.3930	50	0.3118	-	-	-	-	-
1.6737	60	0.2751	-	-	-	-	-
1.9544	70	0.0067	-	-	-	-	-
2.0	72	-	0.6605	0.6679	0.6592	0.6441	0.6326
2.2246	80	0.0981	-	-	-	-	-
2.5053	90	0.0005	-	-	-	-	-
2.7860	100	0.5609	-	-	-	-	-
3.0	108	-	0.6610	0.6643	0.6575	0.6377	0.6306

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.12.8
Sentence Transformers: 4.1.0
Transformers: 4.52.4
PyTorch: 2.7.0+cu126
Accelerate: 1.3.0
Datasets: 3.6.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

cristiano-sartori
/

stella_finetuned1

BGE base Financial Matryoshka

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Training Details

Training Dataset

json

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Framework Versions

Citation

BibTeX

Sentence Transformers

MatryoshkaLoss

MultipleNegativesRankingLoss

Model tree for cristiano-sartori/stella_finetuned1

Evaluation results