SentenceTransformer based on BAAI/bge-base-en
This is a sentence-transformers model finetuned from BAAI/bge-base-en. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-base-en
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("aaa961/finetuned-bge-base-en")
# Run inference
sentences = [
'Occurence highlighting highlights wrong part of the code <!-- Please search existing issues to avoid creating duplicates. -->\r\n\r\n## Environment data\r\n\r\n- VS Code version: 1.58.0-insider 062e6519f8973fede2ca736e80682bd19007460a \r\n- Jupyter Extension version (available under the Extensions sidebar): v2021.8.1000539794\r\n- Python Extension version (available under the Extensions sidebar): v2021.6.944021595\r\n- OS (Windows | Mac | Linux distro) and version: Ubuntu 18.04\r\n- Python and/or Anaconda version: 3.9.2 Anaconda\r\n- Type of virtual environment used (N/A | venv | virtualenv | conda | ...): conda\r\n- Jupyter server running: Remote \r\n\r\nIt seems that issues https://github.com/microsoft/vscode/issues/120148 and https://github.com/microsoft/vscode-jupyter/issues/5451 have been closed but the problem still exists in the last versions. I have not seen any similar issues on the repo',
'File explorer is expanding all root folders in a MR workspace Steps to Reproduce:\r\n\r\n1. Create a MR workspace file with more than one folder\r\n2. Open the MR workspace\r\n\r\n🐛 All top level folders are expanded. This is very slow if there are lot of root folders and also if the MR workspace is in remote\r\n',
'Quick input reset scroll position * use latest from master\r\n* f1 > insert snippet\r\n* scroll down to an extension snippet and hide it (press 👁️ icon)\r\n* :bug: the scroll position resets\r\n\r\nThis is happening when reassigning the items (since the press changed the label) here: https://github.com/microsoft/vscode/blob/92314d61a55f466c125fa9d1f9fe8da633a82423/src/vs/workbench/contrib/snippets/browser/insertSnippet.ts#L213',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5572, 0.5031],
# [0.5572, 1.0000, 0.5477],
# [0.5031, 0.5477, 1.0000]])
Evaluation
Metrics
Triplet
- Dataset:
bge-base-en-train
- Evaluated with
TripletEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.9479 |
Triplet
- Dataset:
bge-base-en-train
- Evaluated with
TripletEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.9933 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 445 training samples
- Columns:
sentence
andlabel
- Approximate statistics based on the first 445 samples:
sentence label type string float details - min: 20 tokens
- mean: 338.03 tokens
- max: 512 tokens
- min: 1.0
- mean: 149.53
- max: 299.0
- Samples:
- Loss:
BatchSemiHardTripletLoss
Evaluation Dataset
Unnamed Dataset
- Size: 95 evaluation samples
- Columns:
sentence
andlabel
- Approximate statistics based on the first 95 samples:
sentence label type string float details - min: 38 tokens
- mean: 348.78 tokens
- max: 512 tokens
- min: 3.0
- mean: 149.72
- max: 296.0
- Samples:
sentence label VS Code does not delete old extension versions even after restart
Does this issue occur when all extensions are disabled?: Yes
- Downloads last month
- 2
Model tree for aaa961/finetuned-bge-base-en
Base model
BAAI/bge-base-enEvaluation results
- Cosine Accuracy on bge base en trainself-reported0.948
- Cosine Accuracy on bge base en trainself-reported0.993