BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from BAAI/bge-base-en on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("RK-1235/bge-base-FIR-matryoshka")
# Run inference
sentences = [
    'Item 8. Financial Statements and Supplementary Data. The Consolidated Financial Statements, together with the Notes thereto and the report thereon dated February 16, 2024, of PricewaterhouseCoopers LLP, the Firm’s independent registered public accounting firm (PCAOB ID 238).',
    'What type of data does Item 8 in a financial document contain?',
    "How did the assumptions and estimates used for assessing the fair value of reporting units potentially impact the company's financial statements?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.2025
cosine_accuracy@3 0.3829
cosine_accuracy@5 0.4509
cosine_accuracy@10 0.5348
cosine_precision@1 0.2025
cosine_precision@3 0.1276
cosine_precision@5 0.0902
cosine_precision@10 0.0535
cosine_recall@1 0.2025
cosine_recall@3 0.3829
cosine_recall@5 0.4509
cosine_recall@10 0.5348
cosine_ndcg@10 0.3671
cosine_mrr@10 0.3138
cosine_map@100 0.3236

Information Retrieval

Metric Value
cosine_accuracy@1 0.1946
cosine_accuracy@3 0.3892
cosine_accuracy@5 0.443
cosine_accuracy@10 0.5095
cosine_precision@1 0.1946
cosine_precision@3 0.1297
cosine_precision@5 0.0886
cosine_precision@10 0.0509
cosine_recall@1 0.1946
cosine_recall@3 0.3892
cosine_recall@5 0.443
cosine_recall@10 0.5095
cosine_ndcg@10 0.3552
cosine_mrr@10 0.3055
cosine_map@100 0.3158

Information Retrieval

Metric Value
cosine_accuracy@1 0.1804
cosine_accuracy@3 0.3497
cosine_accuracy@5 0.4051
cosine_accuracy@10 0.4794
cosine_precision@1 0.1804
cosine_precision@3 0.1166
cosine_precision@5 0.081
cosine_precision@10 0.0479
cosine_recall@1 0.1804
cosine_recall@3 0.3497
cosine_recall@5 0.4051
cosine_recall@10 0.4794
cosine_ndcg@10 0.3274
cosine_mrr@10 0.2791
cosine_map@100 0.2892

Information Retrieval

Metric Value
cosine_accuracy@1 0.1345
cosine_accuracy@3 0.2769
cosine_accuracy@5 0.3354
cosine_accuracy@10 0.3987
cosine_precision@1 0.1345
cosine_precision@3 0.0923
cosine_precision@5 0.0671
cosine_precision@10 0.0399
cosine_recall@1 0.1345
cosine_recall@3 0.2769
cosine_recall@5 0.3354
cosine_recall@10 0.3987
cosine_ndcg@10 0.2632
cosine_mrr@10 0.2203
cosine_map@100 0.2312

Information Retrieval

Metric Value
cosine_accuracy@1 0.0965
cosine_accuracy@3 0.2025
cosine_accuracy@5 0.2468
cosine_accuracy@10 0.3244
cosine_precision@1 0.0965
cosine_precision@3 0.0675
cosine_precision@5 0.0494
cosine_precision@10 0.0324
cosine_recall@1 0.0965
cosine_recall@3 0.2025
cosine_recall@5 0.2468
cosine_recall@10 0.3244
cosine_ndcg@10 0.201
cosine_mrr@10 0.1627
cosine_map@100 0.1703

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 6,300 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 6 tokens
    • mean: 46.06 tokens
    • max: 371 tokens
    • min: 8 tokens
    • mean: 20.8 tokens
    • max: 51 tokens
  • Samples:
    positive anchor
    As of December 31, 2023, a 5 percent change in the contingent consideration liabilities would result in a change in income before income taxes of $5.2 million. How would a 5% change in the contingent consideration liabilities impact income before taxes as of December 31, 2023?
    NIKE, Inc.'s principal business activity involves the design, development, and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories, and services. What is the principal business activity of NIKE, Inc.?
    During 2023, changes in foreign currencies relative to the U.S. dollar negatively impacted net sales by approximately $3,484, 156 basis points, compared to 2022, attributable to our Canadian and Other International operations. What was the overall impact of foreign currencies on net sales in 2023?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 5
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.8122 10 81.9079 - - - - -
1.0 13 - 0.4046 0.3819 0.3506 0.2938 0.1934
1.5685 20 29.1302 - - - - -
2.0 26 - 0.3723 0.3601 0.3222 0.2705 0.1996
2.3249 30 15.9756 - - - - -
3.0 39 - 0.3674 0.3605 0.3294 0.2669 0.2023
3.0812 40 10.8036 - - - - -
3.8934 50 9.3118 - - - - -
4.0 52 - 0.3683 0.3536 0.3265 0.2641 0.1996
4.6497 60 7.5509 - - - - -
5.0 65 - 0.3671 0.3552 0.3274 0.2632 0.2010
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.7.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
4
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RK-1235/bge-base-FIR-matryoshka

Base model

BAAI/bge-base-en
Finetuned
(25)
this model

Evaluation results