SentenceTransformer based on BAAI/bge-base-en

This is a sentence-transformers model finetuned from BAAI/bge-base-en. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-base-en
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("aaa961/finetuned-bge-base-en")
# Run inference
sentences = [
    'Occurence highlighting highlights wrong part of the code <!-- Please search existing issues to avoid creating duplicates. -->\r\n\r\n## Environment data\r\n\r\n-   VS Code version: 1.58.0-insider 062e6519f8973fede2ca736e80682bd19007460a \r\n-   Jupyter Extension version (available under the Extensions sidebar):  v2021.8.1000539794\r\n-   Python Extension version (available under the Extensions sidebar): v2021.6.944021595\r\n-   OS (Windows | Mac | Linux distro) and version: Ubuntu 18.04\r\n-   Python and/or Anaconda version: 3.9.2 Anaconda\r\n-   Type of virtual environment used (N/A | venv | virtualenv | conda | ...): conda\r\n-   Jupyter server running: Remote \r\n\r\nIt seems that issues https://github.com/microsoft/vscode/issues/120148 and https://github.com/microsoft/vscode-jupyter/issues/5451 have been closed but the problem still exists in the last versions. I have not seen any similar issues on the repo',
    'File explorer is expanding all root folders in a MR workspace Steps to Reproduce:\r\n\r\n1.  Create a MR workspace file with more than one folder\r\n2. Open the MR workspace\r\n\r\n🐛 All top level folders are expanded. This is very slow if there are lot of root folders and also if the MR workspace is in remote\r\n',
    'Quick input reset scroll position * use latest from master\r\n* f1 > insert snippet\r\n* scroll down to an extension snippet and hide it (press 👁️ icon)\r\n* :bug: the scroll position resets\r\n\r\nThis is happening when reassigning the items (since the press changed the label) here: https://github.com/microsoft/vscode/blob/92314d61a55f466c125fa9d1f9fe8da633a82423/src/vs/workbench/contrib/snippets/browser/insertSnippet.ts#L213',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5572, 0.5031],
#         [0.5572, 1.0000, 0.5477],
#         [0.5031, 0.5477, 1.0000]])

Evaluation

Metrics

Triplet

Dataset: bge-base-en-train
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9479

Triplet

Dataset: bge-base-en-train
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.9933

Training Details

Training Dataset

Unnamed Dataset

Size: 445 training samples
Columns: sentence and label
Approximate statistics based on the first 445 samples:
sentence label
type string float
details
min: 20 tokens
mean: 338.03 tokens
max: 512 tokens

min: 1.0
mean: 149.53
max: 299.0

	sentence	label
type	string	float
details	min: 20 tokens mean: 338.03 tokens max: 512 tokens	min: 1.0 mean: 149.53 max: 299.0

Samples:

sentence	label
`Branch list is sometimes out of order`
Type: Bug

1. Open a workspace
2. Quickly open the branch picker and type `main`

Bug
The first time you do this, sometimes you end up with an unordered list:




The correct order shows up when you keep start typing or try doing this again:








VS Code version: Code - Insiders 1.91.0-insider (Universal) (0354163c1c66b950b0762364f5b4cd37937b624a, 2024-06-26T10:12:33.304Z)
OS version: Darwin arm64 23.5.0
Modes:


System Info

\|Item\|Value\|
\|---\|---\|
\|CPUs\|Apple M2 Max (12 x 2400)\|
\|GPU Status\|2d_canvas: unavailable_software canvas_oop_rasterization: disabled_off direct_rendering_display_compositor: disabled_off_ok gpu_compositing: disabled_software multiple_raster_threads: enabled_on ope...	`299.0`
`Git Branch Picker Race Condition If I paste the branch too quickly and then press enter, it does not switch to it, but creates a new branch.`
This breaks muscle memory, as it works when you do it slowly.




Once loading completes, it should select the branch again.	`299.0`
`Ctrl+I stopped working after first hold+talk+release Testing #213355`

Screencast shows that it seems to be in the wrong context and is trying to stop the session?



Repro was just asking "Testing testing" and then trying to ask something else	`298.0`

Loss: BatchSemiHardTripletLoss

Evaluation Dataset

Unnamed Dataset

Size: 95 evaluation samples
Columns: sentence and label
Approximate statistics based on the first 95 samples:
sentence label
type string float
details
min: 38 tokens
mean: 348.78 tokens
max: 512 tokens

min: 3.0
mean: 149.72
max: 296.0
Samples:

sentence label

VS Code does not delete old extension versions even after restart

Does this issue occur when all extensions are disabled?: Yes

	sentence	label
type	string	float
details	min: 38 tokens mean: 348.78 tokens max: 512 tokens	min: 3.0 mean: 149.72 max: 296.0

Downloads last month: 2

Safetensors

Model size

109M params

Tensor type

F32

Model tree for aaa961/finetuned-bge-base-en

Base model

BAAI/bge-base-en

Finetuned

(37)

this model

Evaluation results

Cosine Accuracy on bge base en train
self-reported

0.948
Cosine Accuracy on bge base en train
self-reported

0.993

View on Papers With Code

sentence	label
`VS Code does not delete old extension versions even after restart`






Does this issue occur when all extensions are disabled?: Yes