File size: 6,104 Bytes
497686f
70604bd
 
 
 
 
 
 
 
 
 
 
497686f
 
70604bd
 
 
 
 
 
497686f
70604bd
497686f
 
70604bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
497686f
70604bd
 
497686f
70604bd
 
 
 
 
 
497686f
 
8f86988
497686f
70604bd
 
 
497686f
70604bd
 
497686f
70604bd
497686f
70604bd
 
 
 
 
497686f
70604bd
497686f
70604bd
497686f
70604bd
 
 
09ac0b9
 
70604bd
 
09ac0b9
70604bd
 
 
 
 
09ac0b9
497686f
70604bd
497686f
70604bd
 
df78000
497686f
 
 
 
f614131
df78000
497686f
70604bd
 
 
 
 
 
497686f
70604bd
 
497686f
70604bd
 
497686f
70604bd
497686f
70604bd
497686f
70604bd
f614131
 
70604bd
 
497686f
70604bd
 
497686f
70604bd
 
 
 
497686f
70604bd
497686f
70604bd
497686f
70604bd
 
 
 
497686f
70604bd
497686f
70604bd
497686f
70604bd
497686f
70604bd
497686f
 
 
70604bd
497686f
 
70604bd
 
 
 
 
f614131
497686f
 
 
70604bd
497686f
70604bd
497686f
70604bd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: mit
datasets:
- dleemiller/wiki-sim
- sentence-transformers/stsb
language:
- en
metrics:
- spearmanr
- pearsonr
base_model:
- NeuML/bert-hash-nano
pipeline_tag: text-ranking
library_name: sentence-transformers
tags:
- cross-encoder
- modernbert
- sts
- stsb
- stsbenchmark-sts
model-index:
- name: CrossEncoder based on NeuML/bert-hash-nano
  results:
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: sts test
      type: sts-test
    metrics:
    - type: pearson_cosine
      value: 0.7903643753981804
      name: Pearson Cosine
    - type: spearman_cosine
      value: 0.7743038638523062
      name: Spearman Cosine
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      name: sts dev
      type: sts-dev
    metrics:
    - type: pearson_cosine
      value: 0.8476854898328164
      name: Pearson Cosine
    - type: spearman_cosine
      value: 0.8444293778764848
      name: Spearman Cosine
---

# BERT Hash Cross-Encoder: Semantic Similarity (STS)

Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
They're simple to use, fast and very accurate.

The BERT hash uses a bucketing technique with projection to decrease the size of the embedding parameters (all <1M parameters).
These models are very small and good for inference at the edge.

---

## Features
- **Performance:** Achieves **Pearson: 0.7904** and **Spearman: 0.7743** on the STS-Benchmark test set.
- **Efficient architecture:** Based on the BERT Hash model architecture, offering lightweight models.
- **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
- **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.

---

## Performance

| Model                          | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed  |
|--------------------------------|--------------------|---------------------|----------------|------------|---------|
| `dleemiller/ModernCE-large-sts`           | **0.9256**         | **0.9215**          | **8192**       | 395M       | **Medium** |
| `dleemiller/CrossGemma-sts-300m`          | 0.9175         | 0.9135          | 2048       | 303M       | **Medium** |
| `dleemiller/ModernCE-base-sts`            | 0.9162         | 0.9122          | **8192**       | 149M       | **Fast** |
| `cross-encoder/stsb-roberta-large`        | 0.9147            | -              | 512            | 355M       | Slow    |
| `dleemiller/EttinX-sts-m`                 | 0.9143        | 0.9102          | **8192**       | 149M       | **Fast** |
| `dleemiller/NeoCE-sts`                    | 0.9124         | 0.9087          | 4096       | 250M       | **Fast** |
| `dleemiller/EttinX-sts-s`                 | 0.9004        | 0.8926          | **8192**       | 68M       | **Very Fast** |
| `cross-encoder/stsb-distilroberta-base`   | 0.8792            | -              | 512            | 82M        | Fast    |
| `dleemiller/EttinX-sts-xs`                | 0.8763        | 0.8689          | **8192**       | 32M       | **Very Fast** |
| `dleemiller/EttinX-sts-xxs`               | 0.8414        | 0.8311          | **8192**       | 17M       | **Very Fast** |
| `dleemiller/sts-bert-hash-nano`           | 0.7904        | 0.7743          | **8192**       | 0.97M       | **Very Fast** |
| `dleemiller/sts-bert-hash-pico`           | 0.7595        | 0.7474          | **8192**       | 0.45M       | **Very Fast** |

---

## Usage

To use sts-bert-hash for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:

```python
from sentence_transformers import CrossEncoder

# Load CrossEncoder model
model = CrossEncoder("dleemiller/sts-bert-hash-nano", trust_remote_code=True)

# Predict similarity scores for sentence pairs
sentence_pairs = [
    ("It's a wonderful day outside.", "It's so sunny today!"),
    ("It's a wonderful day outside.", "He drove to work earlier."),
]
scores = model.predict(sentence_pairs)

print(scores)  # Outputs: array([0.9184, 0.0123], dtype=float32)
```

### Output
The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.

---

## Training Details

### Pretraining
The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset.
This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
- **Classifier Dropout:** a somewhat large classifier dropout of 0.15, to reduce overreliance on teacher scores.
- **Objective:** STS-B scores from `dleemiller/MocernCE-large-sts`.

### Fine-Tuning
Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.

### Validation Results
The model achieved the following test set performance after fine-tuning:
- **Pearson Correlation:** 0.7904
- **Spearman Correlation:** 0.7743

---

## Model Card

- **Architecture:** bert-hash-nano
- **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
- **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
- **Fine-Tuning Data:** `sentence-transformers/stsb`

---

## Thank You

Thanks to the NeuML team for providing the BERT Hash models, and the Sentence Transformers team for their leadership in transformer encoder models.

---

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{stsnano2025,
  author = {Miller, D. Lee},
  title = {Bert Hash STS: An STS cross encoder model},
  year = {2025},
  publisher = {Hugging Face Hub},
  url = {https://huggingface.co/dleemiller/sts-bert-hash-nano},
}
```

---

## License

This model is licensed under the [MIT License](LICENSE).