BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from NovaSearch/stella_en_400M_v5 on the json dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: NovaSearch/stella_en_400M_v5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cristiano-sartori/stella_finetuned1")
# Run inference
sentences = [
    'Describe the techniques that typical dynamically scheduled\n            processors use to achieve the same purpose of the following features\n            of Intel Itanium: (a) Predicated execution; (b) advanced\n            loads---that is, loads moved before a store and explicit check for\n            RAW hazards; (c) speculative loads---that is, loads moved before a\n            branch and explicit check for exceptions; (d) rotating register\n            file.',
    'Dynamically scheduled processors are designed to improve the efficiency of instruction execution by allowing the CPU to make decisions at runtime about the order of instruction execution. Let\'s break down each feature you mentioned from the Intel Itanium architecture and see how typical dynamically scheduled processors achieve similar goals.\n\n### (a) Predicated Execution\n\n**Intuition:**\nPredicated execution allows the processor to execute instructions based on certain conditions without using traditional branching (like `if` statements). This helps to avoid pipeline stalls that can occur when a branch is taken.\n\n**Example:**\nImagine you have the following pseudo-code:\n```c\nif (x > 0) {\n    y = z + 1;\n} else {\n    y = z - 1;\n}\n```\n\nIn a predicated execution model, instead of branching, the processor can execute both instructions but use a predicate (a boolean condition) to determine which result to keep:\n```assembly\np1 = (x > 0)\ny1 = z + 1; // Execute regardless\ny2 = z - 1; // Execute regardless\ny = p1 ? y1 : y2; // Keep the result based on p1\n```\n\n**Dynamically Scheduled Processors:**\nThese processors use techniques like "instruction scheduling" and "register renaming" to allow for instructions to be executed out of order while avoiding the pitfalls of branches. The hardware can evaluate conditions ahead of time and execute the necessary instructions while keeping track of which values are valid.\n\n### (b) Advanced Loads\n\n**Intuition:**\nAdvanced loads allow the processor to move load instructions (fetching data from memory) ahead of store instructions (writing data to memory), while also checking for Read After Write (RAW) hazards to ensure data correctness.\n\n**Example:**\nConsider the following pseudo-code:\n```c\na = b; // Store b into a\nc = a; // Load a into c\n```\n\nIf `a` is stored before it\'s loaded again, there can be a dependency. Advanced load techniques allow the processor to load `c = a` even before it knows if the value of `a` has been updated, as long as it checks that no store operation that modifies `a` has occurred.\n\n**Dynamically Scheduled Processors:**\nThese processors often use a technique called "out-of-order execution." They keep track of the status of loads and stores in a structure like a reorder buffer. When a load is scheduled, the processor checks if any previous instructions modify the value it needs (checking for RAW hazards) before executing the load.\n\n### (c) Speculative Loads\n\n**Intuition:**\nSpeculative loads allow the processor to execute load instructions before it knows the outcome of branches, which can keep the pipeline filled and reduce stalls.\n\n**Example:**\nConsider a branch that depends on a condition:\n```c\nif (x > 0) {\n    a = b + c;\n}\n```\n\nInstead of waiting for the branch to be resolved, a speculative load might execute `load a` regardless of the branch\'s result. If the branch is taken, the processor can discard this load if it\'s not necessary.\n\n**Dynamically Scheduled Processors:**\nThese processors utilize "speculative execution," where they predict the likely path of execution based on past behavior. They perform loads and even entire blocks of instructions speculatively and have mechanisms to roll back if they guessed wrong while checking for exceptions (like accessing invalid memory).\n\n### (d) Rotating Register File\n\n**Intuition:**\nA rotating register file allows the processor to efficiently manage registers, effectively giving it more registers to work with by cycling through them for different contexts or states.\n\n**Example:**\nIn a simple program, if you have a limited number of registers but multiple functions, rotating registers means that as soon as one function completes, its registers can be reused for the next function without needing to save them to memory.\n\n**Dynamically Scheduled Processors:**\nMany dynamically scheduled processors use a "register renaming" technique, which allows them to allocate physical registers dynamically. When an instruction is ready to execute, it checks which registers are free and assigns one, effectively "rotating" the use of registers without the programmer needing to manage this directly.\n\n### Summary\n\nIn summary, dynamically scheduled processors use advanced techniques like out-of-order execution, speculative execution, and register renaming to achieve efficiency and performance similar to the features found in Intel Itanium. These techniques help to minimize stalls and maximize instruction throughput by allowing more flexibility in how instructions are executed relative to their dependencies and branch outcomes.',
    "The question at hand explores whether it is possible to add new documents to a collection such that one document, d1d_1, is ranked higher than another document, d2d_2, based on a specific query, while also allowing for the possibility of ranking d2d_2 higher than d1d_1 simultaneously.\n\nTo analyze this problem, we begin by examining the two documents in question: \n\n- Document d1d_1 contains three occurrences of 'a', one occurrence of 'b', and none of 'c' (represented as d1=textaabcd_1 = \\text{aabc}).\n- Document d2d_2 has one occurrence each of 'a', 'b', and 'c' (represented as d2=textabcd_2 = \\text{abc}).\n\nGiven the query q=textabq = \\text{ab}, our focus lies on the occurrences of 'a' and 'b' in both documents.\n\nNext, we calculate the term frequencies for the relevant terms in each document:\n\n- For d1d_1, the term frequencies are:\n  - fd1(a)=3f_{d_1}(a) = 3\n  - fd1(b)=1f_{d_1}(b) = 1\n  - fd1(c)=0f_{d_1}(c) = 0\n\n- For d2d_2, the term frequencies are:\n  - fd2(a)=1f_{d_2}(a) = 1\n  - fd2(b)=1f_{d_2}(b) = 1\n  - fd2(c)=1f_{d_2}(c) = 1\n\nThe total number of terms in each document is calculated as follows:\n\n- Total terms in d1=4d_1 = 4 (3 'a's + 1 'b' + 0 'c's).\n- Total terms in d2=3d_2 = 3 (1 'a' + 1 'b' + 1 'c').\n\nWe will apply the smoothed probabilistic retrieval model using the formula:\n\\[\nP(w | d) = \\frac{f_{d}(w) + \\lambda \\cdot P(w | C)}{N + \\lambda \\cdot |V|}\n\\]\nwhere NN is the total number of terms in the document, V|V| is the size of the vocabulary (which is 3 in this case), and P(wC)P(w | C) is the probability of the word in the overall collection.\n\nAssuming a uniform distribution for the collection, we calculate:\n- P(aC)=frac410=0.4P(a | C) = \\frac{4}{10} = 0.4\n- P(bC)=frac210=0.2P(b | C) = \\frac{2}{10} = 0.2\n- P(cC)=frac210=0.2P(c | C) = \\frac{2}{10} = 0.2\n\nNow, we compute the probabilities for the query terms for each document.\n\nFor document d1d_1:\n- Probability of 'a':\n\\[\nP(a | d_1) = \\frac{3 + 0.5 \\cdot 0.4}{4 + 0.5 \\cdot 3} = \\frac{3 + 0.2}{4 + 1.5} = \\frac{3.2}{5.5} \\approx 0.5818\n\\]\n- Probability of 'b':\n\\[\nP(b | d_1) = \\frac{1 + 0.5 \\cdot 0.2}{4 + 0.5 \\cdot 3} = \\frac{1 + 0.1}{5.5} = \\frac{1.1}{5.5} \\approx 0.2\n\\]\n- Combined score for d1d_1 for the query q=abq = ab:\n\\[\nP(q | d_1) = P(a | d_1) \\cdot P(b | d_1) \\approx 0.5818 \\cdot 0.2 \\approx 0.1164\n\\]\n\nFor document d2d_2:\n- Probability of 'a':\n\\[\nP(a | d_2) = \\frac{1 + 0.5 \\cdot 0.4}{3 + 0.5 \\cdot 3} = \\frac{1 + 0.2}{4.5} = \\frac{1.2}{4.5} \\approx 0.2667\n\\]\n- Probability of 'b':\n\\[\nP(b | d_2) = \\frac{1 + 0.5 \\cdot 0.2}{3 + 0.5 \\cdot 3} = \\frac{1 + 0.1}{4.5} = \\frac{1.1}{4.5} \\approx 0.2444\n\\]\n- Combined score for d2d_2 for the query q=abq = ab:\n\\[\nP(q | d_2) = P(a | d_2) \\cdot P(b | d_2) \\approx 0.2667 \\cdot 0.2444 \\approx 0.0652\n\\]\n\nAt this stage, we find that P(qd1)approx0.1164P(q | d_1) \\approx 0.1164 and P(qd2)approx0.0652P(q | d_2) \\approx 0.0652, indicating that d1d_1 currently ranks higher than d2d_2.\n\nTo explore the possibility of achieving both d1>d2d_1 > d_2 and d2>d1d_2 > d_1, we consider the addition of new documents. While it is theoretically possible to manipulate rankings by introducing documents that alter the frequency of terms, the fundamental nature of probabilistic scoring means that achieving both conditions simultaneously is implausible. Specifically, any document that increases the score of d1d_1 will likely decrease the score of d2d_2 and vice versa due to the competitive nature of the scoring based on term frequencies.\n\nIn conclusion, while document addition can influence individual rankings, the inherent constraints of probabilistic retrieval prevent the simultaneous fulfillment of both ranking conditions. Therefore, the answer is **no, it is not possible** to enforce both rankings as required.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.2947
cosine_accuracy@3 0.8702
cosine_accuracy@5 0.9333
cosine_accuracy@10 0.9789
cosine_precision@1 0.2947
cosine_precision@3 0.2901
cosine_precision@5 0.1867
cosine_precision@10 0.0979
cosine_recall@1 0.2947
cosine_recall@3 0.8702
cosine_recall@5 0.9333
cosine_recall@10 0.9789
cosine_ndcg@10 0.661
cosine_mrr@10 0.5552
cosine_map@100 0.5566

Information Retrieval

Metric Value
cosine_accuracy@1 0.3053
cosine_accuracy@3 0.8702
cosine_accuracy@5 0.9228
cosine_accuracy@10 0.9719
cosine_precision@1 0.3053
cosine_precision@3 0.2901
cosine_precision@5 0.1846
cosine_precision@10 0.0972
cosine_recall@1 0.3053
cosine_recall@3 0.8702
cosine_recall@5 0.9228
cosine_recall@10 0.9719
cosine_ndcg@10 0.6643
cosine_mrr@10 0.5616
cosine_map@100 0.5636

Information Retrieval

Metric Value
cosine_accuracy@1 0.2912
cosine_accuracy@3 0.8702
cosine_accuracy@5 0.9263
cosine_accuracy@10 0.9684
cosine_precision@1 0.2912
cosine_precision@3 0.2901
cosine_precision@5 0.1853
cosine_precision@10 0.0968
cosine_recall@1 0.2912
cosine_recall@3 0.8702
cosine_recall@5 0.9263
cosine_recall@10 0.9684
cosine_ndcg@10 0.6575
cosine_mrr@10 0.5535
cosine_map@100 0.5558

Information Retrieval

Metric Value
cosine_accuracy@1 0.2632
cosine_accuracy@3 0.8456
cosine_accuracy@5 0.9088
cosine_accuracy@10 0.9614
cosine_precision@1 0.2632
cosine_precision@3 0.2819
cosine_precision@5 0.1818
cosine_precision@10 0.0961
cosine_recall@1 0.2632
cosine_recall@3 0.8456
cosine_recall@5 0.9088
cosine_recall@10 0.9614
cosine_ndcg@10 0.6377
cosine_mrr@10 0.5299
cosine_map@100 0.5326

Information Retrieval

Metric Value
cosine_accuracy@1 0.2596
cosine_accuracy@3 0.8386
cosine_accuracy@5 0.9088
cosine_accuracy@10 0.9474
cosine_precision@1 0.2596
cosine_precision@3 0.2795
cosine_precision@5 0.1818
cosine_precision@10 0.0947
cosine_recall@1 0.2596
cosine_recall@3 0.8386
cosine_recall@5 0.9088
cosine_recall@10 0.9474
cosine_ndcg@10 0.6306
cosine_mrr@10 0.5248
cosine_map@100 0.5286

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 1,140 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 5 tokens
    • mean: 167.15 tokens
    • max: 512 tokens
    • min: 3 tokens
    • mean: 375.41 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    Devise an algorithm that, without consensus, implements a weaker specification of NBAC by replacing the termination property with very weak termination.

    Very weak termination: If no process crashes, then all processes decide. Is a failure detector needed to implement this algorithm?

    To implement a weaker specification of Non-blocking Atomic Commit (NBAC) with a focus on very weak termination, we can devise a straightforward algorithm leveraging Best-Effort Broadcast. The key here is to ensure that if no processes crash, all processes should arrive at a decision, either COMMIT or ABORT.

    First, each process will broadcast its proposal to all other processes. Upon receiving proposals, each process will keep track of the received messages. If a process receives only COMMIT proposals from all other processes, it decides to COMMIT; otherwise, it decides to ABORT. This design assumes that no failures occur, which is a critical part of the specification.

    Watch out for the assumption that no processes crash. If even one process fails, the termination property is broken as some processes may not receive all necessary proposals, leading to a situation where decisions cannot be made consistently. This highlights that failure detection is not required in this scenar...
    The "Consensus-Based Total-Order Broadcast" algorithm transforms a consensus abstraction (together with a reliable broadcast abstraction) into a total-order broadcast abstraction. Describe a transformation between these two primitives in the other direction, that is, implement a (uniform) consensus abstraction from a (uniform) total-order broadcast abstraction. To implement a (uniform) consensus abstraction from a (uniform) total-order broadcast abstraction, we can follow these steps:

    1. Initialize a variable decided to false to track if a consensus value has been reached.
    2. When a process invokes propose(v), it uses the total-order broadcast (TO) to send the value v.
    3. Upon receiving a TO-delivered message with a value v, if decided is still false, the process sets decided to true and calls decide(v).

    This approach works because the total-order broadcast ensures that all processes deliver messages in the same order, allowing them to reach consensus on the first value that is delivered. Thus, the consensus is achieved by agreeing on the first proposed value that is TO-delivered.
    We learnt in the lecture that terms are typically stored in an inverted list. Now, in the inverted list, instead of only storing document identifiers of the documents in which the term appears, assume we also store an offset of the appearance of a term in a document. An $offset$ of a term $l_k$ given a document is defined as the number of words between the start of the document and $l_k$. Thus our inverted list is now: $l_k= \langle f_k: {d_{i_1} \rightarrow [o_1,\ldots,o_{n_{i_1}}]}, {d_{i_2} \rightarrow [o_1,\ldots,o_{n_{i_2}}]}, \ldots, {d_{i_k} \rightarrow [o_1,\ldots,o_{n_{i_k}}]} \rangle$ This means that in document $d_{i_1}$ term $l_k$ appears $n_{i_1}$ times and at offset $[o_1,\ldots,o_{n_{i_1}}]$, where $[o_1,\ldots,o_{n_{i_1}}]$ are sorted in ascending order, these type of indices are also known as term-offset indices. An example of a term-offset index is as follows: Obama = $⟨4 : {1 → [3]},{2 → [6]},{3 → [2,17]},{4 → [1]}⟩$ Governor = $⟨2 : {4 → [3]}, ... ### Understanding the Problem

    We are tasked with analyzing a query involving the SLOP operator between two terms, "Obama" and "Election." The SLOP operator allows for flexibility in the proximity of terms within a specified number of words. Specifically, for a query of the form QueryTerm1 SLOP/x QueryTerm2, we need to find occurrences of QueryTerm1 within x words of QueryTerm2, regardless of word order.

    ### Term-Offset Indexes

    We have the following term-offset indexes for the relevant terms:

    - Obama = ( \langle 4 : {1 \rightarrow [3], 2 \rightarrow [6], 3 \rightarrow [2, 17], 4 \rightarrow [1]} \rangle )
    - Election = ( \langle 4 : {1 \rightarrow [1], 2 \rightarrow [1, 21], 3 \rightarrow [3], 5 \rightarrow [16, 22, 51]} \rangle )

    From these indexes, we can interpret:
    - "Obama" appears in documents 1, 2, 3, and 4 at the specified offsets.
    - "Election" appears in documents 1, 2, 3, and 5 at its respective offsets.

    ### Analyzing the SLOP Operator

    We n...
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: False
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.2807 10 0.1056 - - - - -
0.5614 20 0.6075 - - - - -
0.8421 30 0.272 - - - - -
1.0 36 - 0.6633 0.6597 0.6581 0.6378 0.6330
1.1123 40 0.1235 - - - - -
1.3930 50 0.3118 - - - - -
1.6737 60 0.2751 - - - - -
1.9544 70 0.0067 - - - - -
2.0 72 - 0.6605 0.6679 0.6592 0.6441 0.6326
2.2246 80 0.0981 - - - - -
2.5053 90 0.0005 - - - - -
2.7860 100 0.5609 - - - - -
3.0 108 - 0.6610 0.6643 0.6575 0.6377 0.6306
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.8
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.7.0+cu126
  • Accelerate: 1.3.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
3
Safetensors
Model size
434M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cristiano-sartori/stella_finetuned1

Finetuned
(15)
this model

Evaluation results