bge-large-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-large-en-v1.5 on the natural-questions dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-large-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- natural-questions
Language: en
License: mit

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("DannyAI/embedding_fine_tuning_with_prompts_bge_large_en_v1.5")
# Run inference
queries = [
    "what was agenda 21 of earth summit of rio de janeiro",
]
documents = [
    'Agenda 21 Agenda 21 is a non-binding, action plan of the United Nations with regard to sustainable development.[1] It is a product of the Earth Summit (UN Conference on Environment and Development) held in Rio de Janeiro, Brazil, in 1992. It is an action agenda for the UN, other multilateral organizations, and individual governments around the world that can be executed at local, national, and global levels.',
    'Jab Harry Met Sejal Jab Harry Met Sejal (English: When Harry Met Sejal) is a 2017 Indian romantic comedy film written and directed by Imtiaz Ali. It features Shah Rukh Khan and Anushka Sharma in the lead roles,[1] their third collaboration after Rab Ne Bana Di Jodi (2008) and Jab Tak Hai Jaan (2012). Pre-production of the film begun in April 2015 and principal photography commenced in August 2016 in Prague, Amsterdam, Vienna, Lisbon and Budapest.',
    'Pencil Most manufacturers, and almost all in Europe, designate their pencils with the letters H (commonly interpreted as "hardness") to B (commonly "blackness"), as well as F (usually taken to mean "fineness", although F pencils are no more fine or more easily sharpened than any other grade. also known as "firm" in Japan[68]). The standard writing pencil is graded HB.[69] This designation might have been first used in the early 20th century by Brookman, an English pencil maker. It used B for black and H for hard; a pencil\'s grade was described by a sequence or successive Hs or Bs such as BB and BBB for successively softer leads, and HH and HHH for successively harder ones.[70] The Koh-i-Noor Hardtmuth pencil manufacturers claim to have first used the HB designations, with H standing for Hardtmuth, B for the company\'s location of Budějovice, and F for Franz Hardtmuth, who was responsible for technological improvements in pencil manufacture.[71][72]',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 1024] [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.9017, 0.2307, 0.2148]])

Evaluation

Metrics

Information Retrieval

Dataset: NanoQuoraRetrieval

Evaluated with InformationRetrievalEvaluator with these parameters:

{
    "query_prompt": "query: ",
    "corpus_prompt": "document: "
}

Metric	Value
cosine_accuracy@1	0.88
cosine_accuracy@3	0.96
cosine_accuracy@5	0.98
cosine_accuracy@10	1.0
cosine_precision@1	0.88
cosine_precision@3	0.4
cosine_precision@5	0.26
cosine_precision@10	0.136
cosine_recall@1	0.7673
cosine_recall@3	0.922
cosine_recall@5	0.966
cosine_recall@10	0.9933
cosine_ndcg@10	0.9312
cosine_mrr@10	0.9229
cosine_map@100	0.9057

Information Retrieval

Dataset: NanoQuoraRetrieval

Evaluated with InformationRetrievalEvaluator with these parameters:

{
    "query_prompt": "query: ",
    "corpus_prompt": "document: "
}

Metric	Value
cosine_accuracy@1	0.88
cosine_accuracy@3	0.96
cosine_accuracy@5	0.98
cosine_accuracy@10	1.0
cosine_precision@1	0.88
cosine_precision@3	0.4
cosine_precision@5	0.26
cosine_precision@10	0.136
cosine_recall@1	0.7673
cosine_recall@3	0.922
cosine_recall@5	0.966
cosine_recall@10	0.9933
cosine_ndcg@10	0.9312
cosine_mrr@10	0.9229
cosine_map@100	0.9057

Training Details

Training Dataset

natural-questions

Dataset: natural-questions at f9e894e
Size: 64,147 training samples
Columns: query and answer
Approximate statistics based on the first 1000 samples:
query answer
type string string
details
min: 10 tokens
mean: 11.81 tokens
max: 26 tokens

min: 21 tokens
mean: 137.28 tokens
max: 512 tokens

	query	answer
type	string	string
details	min: 10 tokens mean: 11.81 tokens max: 26 tokens	min: 21 tokens mean: 137.28 tokens max: 512 tokens

Samples:

query	answer
`the internal revenue code is part of federal statutory law. true false`	Internal Revenue Code The Internal Revenue Code (IRC), formally the Internal Revenue Code of 1986, is the domestic portion of federal statutory tax law in the United States, published in various volumes of the United States Statutes at Large, and separately as Title 26 of the United States Code (USC).[1] It is organized topically, into subtitles and sections, covering income tax (see Income tax in the United States), payroll taxes, estate taxes, gift taxes, and excise taxes; as well as procedure and administration. Its implementing agency is the Internal Revenue Service.
`where is the pyramid temple at borobudur located`	Borobudur Approximately 40 kilometres (25 mi) northwest of Yogyakarta and 86 kilometres (53 mi) west of Surakarta, Borobudur is located in an elevated area between two twin volcanoes, Sundoro-Sumbing and Merbabu-Merapi, and two rivers, the Progo and the Elo. According to local myth, the area known as Kedu Plain is a Javanese "sacred" place and has been dubbed "the garden of Java" due to its high agricultural fertility.[19] During the restoration in the early 20th century, it was discovered that three Buddhist temples in the region, Borobudur, Pawon and Mendut, are positioned along a straight line.[20] A ritual relationship between the three temples must have existed, although the exact ritual process is unknown.[14]
`what does uncle stand for in the show man from uncle`	`The Man from U.N.C.L.E. Originally, co-creator Sam Rolfe wanted to leave the meaning of U.N.C.L.E. ambiguous so it could refer to either "Uncle Sam" or the United Nations.[2]:14 Concerns by Metro-Goldwyn-Mayer's (MGM) legal department about using "U.N." for commercial purposes resulted in the producers' clarification that U.N.C.L.E. was an acronym for the United Network Command for Law and Enforcement.[3] Each episode had an "acknowledgement" to the U.N.C.L.E. in the end titles.`

Loss: CachedMultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "mini_batch_size": 16,
    "gather_across_devices": false
}

Evaluation Dataset

natural-questions

Dataset: natural-questions at f9e894e
Size: 16,037 evaluation samples
Columns: query and answer
Approximate statistics based on the first 1000 samples:
query answer
type string string
details
min: 10 tokens
mean: 11.67 tokens
max: 22 tokens

min: 12 tokens
mean: 134.64 tokens
max: 512 tokens

	query	answer
type	string	string
details	min: 10 tokens mean: 11.67 tokens max: 22 tokens	min: 12 tokens mean: 134.64 tokens max: 512 tokens

Samples:

query	answer
`when did last harry potter movie come out`	Harry Potter (film series) Harry Potter is a British-American film series based on the Harry Potter novels by author J. K. Rowling. The series is distributed by Warner Bros. and consists of eight fantasy films, beginning with Harry Potter and the Philosopher's Stone (2001) and culminating with Harry Potter and the Deathly Hallows – Part 2 (2011).[2][3] A spin-off prequel series will consist of five films, starting with Fantastic Beasts and Where to Find Them (2016). The Fantastic Beasts films mark the beginning of a shared media franchise known as J. K. Rowling's Wizarding World.[4]
`where did the saying debbie downer come from`	Debbie Downer The character's name, Debbie Downer, is a slang phrase which refers to someone who frequently adds bad news and negative feelings to a gathering, thus bringing down the mood of everyone around them. Dratch's character would usually appear at social gatherings and interrupt the conversation to voice negative opinions and pronouncements. She is especially concerned about the rate of feline AIDS, a subject that she would bring up on more than one occasion, saying it was the number one killer of domestic cats.
`the financial crisis of 2008 was caused by`	Financial crisis of 2007–2008 It began in 2007 with a crisis in the subprime mortgage market in the United States, and developed into a full-blown international banking crisis with the collapse of the investment bank Lehman Brothers on September 15, 2008.[5] Excessive risk-taking by banks such as Lehman Brothers helped to magnify the financial impact globally.[6] Massive bail-outs of financial institutions and other palliative monetary and fiscal policies were employed to prevent a possible collapse of the world financial system. The crisis was nonetheless followed by a global economic downturn, the Great Recession. The European debt crisis, a crisis in the banking system of the European countries using the euro, followed later.

Loss: CachedMultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "mini_batch_size": 16,
    "gather_across_devices": false
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 5
per_device_eval_batch_size: 5
learning_rate: 2e-05
max_steps: 100
warmup_ratio: 0.1
seed: 30
bf16: True
load_best_model_at_end: True
prompts: {'query': 'query: ', 'answer': 'document: '}
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 5
per_device_eval_batch_size: 5
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 3.0
max_steps: 100
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 30
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: {'query': 'query: ', 'answer': 'document: '}
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss	NanoQuoraRetrieval_cosine_ndcg@10
-1	-1	-	-	0.9583
0.0078	100	0.0063	0.0029	0.9312
-1	-1	-	-	0.9312

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.12.11
Sentence Transformers: 5.1.0
Transformers: 4.56.1
PyTorch: 2.8.0+cu126
Accelerate: 1.10.1
Datasets: 4.0.0
Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Downloads last month: 10

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for DannyAI/embedding_fine_tuning_with_prompts_bge_large_en_v1.5

Base model

BAAI/bge-large-en-v1.5

Finetuned

(43)

this model

Dataset used to train DannyAI/embedding_fine_tuning_with_prompts_bge_large_en_v1.5

Evaluation results

Cosine Accuracy@1 on NanoQuoraRetrieval
self-reported

0.880
Cosine Accuracy@3 on NanoQuoraRetrieval
self-reported

0.960
Cosine Accuracy@5 on NanoQuoraRetrieval
self-reported

0.980
Cosine Accuracy@10 on NanoQuoraRetrieval
self-reported

1.000
Cosine Precision@1 on NanoQuoraRetrieval
self-reported

0.880
Cosine Precision@3 on NanoQuoraRetrieval
self-reported

0.400
Cosine Precision@5 on NanoQuoraRetrieval
self-reported

0.260
Cosine Precision@10 on NanoQuoraRetrieval
self-reported

0.136
Cosine Recall@1 on NanoQuoraRetrieval
self-reported

0.767
Cosine Recall@3 on NanoQuoraRetrieval
self-reported

0.922

View on Papers With Code