File size: 6,104 Bytes
497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 8f86988 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 09ac0b9 70604bd 09ac0b9 70604bd 09ac0b9 497686f 70604bd 497686f 70604bd df78000 497686f f614131 df78000 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd f614131 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd 497686f 70604bd f614131 497686f 70604bd 497686f 70604bd 497686f 70604bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: mit
datasets:
- dleemiller/wiki-sim
- sentence-transformers/stsb
language:
- en
metrics:
- spearmanr
- pearsonr
base_model:
- NeuML/bert-hash-nano
pipeline_tag: text-ranking
library_name: sentence-transformers
tags:
- cross-encoder
- modernbert
- sts
- stsb
- stsbenchmark-sts
model-index:
- name: CrossEncoder based on NeuML/bert-hash-nano
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: sts test
type: sts-test
metrics:
- type: pearson_cosine
value: 0.7903643753981804
name: Pearson Cosine
- type: spearman_cosine
value: 0.7743038638523062
name: Spearman Cosine
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
name: sts dev
type: sts-dev
metrics:
- type: pearson_cosine
value: 0.8476854898328164
name: Pearson Cosine
- type: spearman_cosine
value: 0.8444293778764848
name: Spearman Cosine
---
# BERT Hash Cross-Encoder: Semantic Similarity (STS)
Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
They're simple to use, fast and very accurate.
The BERT hash uses a bucketing technique with projection to decrease the size of the embedding parameters (all <1M parameters).
These models are very small and good for inference at the edge.
---
## Features
- **Performance:** Achieves **Pearson: 0.7904** and **Spearman: 0.7743** on the STS-Benchmark test set.
- **Efficient architecture:** Based on the BERT Hash model architecture, offering lightweight models.
- **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
- **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.
---
## Performance
| Model | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed |
|--------------------------------|--------------------|---------------------|----------------|------------|---------|
| `dleemiller/ModernCE-large-sts` | **0.9256** | **0.9215** | **8192** | 395M | **Medium** |
| `dleemiller/CrossGemma-sts-300m` | 0.9175 | 0.9135 | 2048 | 303M | **Medium** |
| `dleemiller/ModernCE-base-sts` | 0.9162 | 0.9122 | **8192** | 149M | **Fast** |
| `cross-encoder/stsb-roberta-large` | 0.9147 | - | 512 | 355M | Slow |
| `dleemiller/EttinX-sts-m` | 0.9143 | 0.9102 | **8192** | 149M | **Fast** |
| `dleemiller/NeoCE-sts` | 0.9124 | 0.9087 | 4096 | 250M | **Fast** |
| `dleemiller/EttinX-sts-s` | 0.9004 | 0.8926 | **8192** | 68M | **Very Fast** |
| `cross-encoder/stsb-distilroberta-base` | 0.8792 | - | 512 | 82M | Fast |
| `dleemiller/EttinX-sts-xs` | 0.8763 | 0.8689 | **8192** | 32M | **Very Fast** |
| `dleemiller/EttinX-sts-xxs` | 0.8414 | 0.8311 | **8192** | 17M | **Very Fast** |
| `dleemiller/sts-bert-hash-nano` | 0.7904 | 0.7743 | **8192** | 0.97M | **Very Fast** |
| `dleemiller/sts-bert-hash-pico` | 0.7595 | 0.7474 | **8192** | 0.45M | **Very Fast** |
---
## Usage
To use sts-bert-hash for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:
```python
from sentence_transformers import CrossEncoder
# Load CrossEncoder model
model = CrossEncoder("dleemiller/sts-bert-hash-nano", trust_remote_code=True)
# Predict similarity scores for sentence pairs
sentence_pairs = [
("It's a wonderful day outside.", "It's so sunny today!"),
("It's a wonderful day outside.", "He drove to work earlier."),
]
scores = model.predict(sentence_pairs)
print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32)
```
### Output
The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.
---
## Training Details
### Pretraining
The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset.
This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
- **Classifier Dropout:** a somewhat large classifier dropout of 0.15, to reduce overreliance on teacher scores.
- **Objective:** STS-B scores from `dleemiller/MocernCE-large-sts`.
### Fine-Tuning
Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.
### Validation Results
The model achieved the following test set performance after fine-tuning:
- **Pearson Correlation:** 0.7904
- **Spearman Correlation:** 0.7743
---
## Model Card
- **Architecture:** bert-hash-nano
- **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
- **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
- **Fine-Tuning Data:** `sentence-transformers/stsb`
---
## Thank You
Thanks to the NeuML team for providing the BERT Hash models, and the Sentence Transformers team for their leadership in transformer encoder models.
---
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{stsnano2025,
author = {Miller, D. Lee},
title = {Bert Hash STS: An STS cross encoder model},
year = {2025},
publisher = {Hugging Face Hub},
url = {https://huggingface.co/dleemiller/sts-bert-hash-nano},
}
```
---
## License
This model is licensed under the [MIT License](LICENSE). |