File size: 6,145 Bytes
b81467f
 
 
 
 
 
 
 
e0c4448
b81467f
 
 
 
e0c4448
 
b81467f
 
 
e0c4448
 
b81467f
 
 
 
 
 
 
 
 
 
a4ee477
 
b81467f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0c4448
 
 
b81467f
e0c4448
 
b81467f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0c4448
 
 
b81467f
e0c4448
b81467f
e0c4448
 
 
 
 
 
b81467f
 
 
 
e0c4448
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b81467f
e0c4448
b81467f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
language: 
  - en
license: mit
datasets:
- pile
metrics:
- nDCG@10
- MRR
---

# Carptriever-1


## Model description

Carptriever-1 is a `bert-large-uncased` retrieval model trained with contrastive learning via a momentum contrastive (MoCo) mechanism following the work of G. Izacard et al. in ["Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning"](https://arxiv.org/abs/2112.09118).


## How to use

```python
from transformers import AutoTokenizer, AutoModel

def mean_pooling(token_embeddings, mask):
    token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
    sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
    return sentence_embeddings

# Remove pooling layer
model = AutoModel.from_pretrained("CarperAI/carptriever-1", add_pooling_layer=False)
tokenizer = AutoTokenizer.from_pretrained("CarperAI/carptriever-1")

sentences = [
    "Where was Marie Curie born?",  # Query
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]

# Apply tokenizer
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Encode sentences 
outputs = model(**inputs)
embeddings = mean_pooling(outputs[0], inputs['attention_mask'])

# Compute dot-product scores between the query and sentence embeddings
query_embedding, sentence_embeddings = embeddings[0], embeddings[1:]
scores = (query_embedding @ sentence_embeddings.transpose(0, 1)).cpu().tolist()

sentence_score_pairs = sorted(zip(sentences[1:], scores), reverse=True)
print(f"Query: {sentences[0]}")
for sentence, score in sentence_score_pairs:
    print(f"\nSentence: {sentence}\nScore: {score:.4f}")
```


## Training data

Carptriever-1 is pre-trained on a de-duplicated subset of [The Pile](https://pile.eleuther.ai/), a large and diverse dataset created by EleutherAI for language model training. This subset was created through a [Minhash LSH](http://ekzhu.com/datasketch/lsh.html) process using a threshold of `0.87`.


## Training procedure

The model was trained on 32 40GB A100 for approximately 100 hours with the following configurations:

- Base model: 
    - `bert-large-uncased`
- Optimizer settings:
    - `optimizer  = AdamW`
    - `lr         = 1e-5`
    - `schedule   = linear`
    - `warmup     = 20,000 steps`
    - `batch size = 2048`
    - `training steps = 150,000`
- MoCo settings: 
    - `queue size  = 8192`
    - `momentum    = 0.999`
    - `temperature = 0.05` 


## Evaluation results

#### [BEIR: Benchmarking IR](https://github.com/beir-cellar/beir)

We report the following BEIR scores as measured in normalized discounted cumulative gain (nDCG@10):

|   Model       | Avg   | MSMARCO | TREC-Covid | NFCorpus | NaturalQuestions | HotpotQA | FiQA | ArguAna | Tóuche-2020 | Quora | CQAdupstack | DBPedia | Scidocs | Fever | Climate-fever | Scifact  |
|---------------|-------|---------|------------|----------|------------------|----------|------|---------|-------------|-------|-------------|---------|---------|-------|---------------|----------|
| Contriever*   | 35.97 |    20.6 |       27.4 |     31.7 |             25.4 |     48.1 | 24.5 |    37.9 |        19.3 |  83.5 |       28.40 |    29.2 |    14.9 | 68.20 |          15.5 |    64.9  |
| Carptriever-1 | 34.54 |   18.83 |   **52.2** |     28.5 |             21.1 |     39.4 | 23.2 |    31.7 |        15.2 |  81.3 |       26.88 |    25.4 |    14.2 | 57.36 |      **17.9** |    64.9  |

\* Results are taken from the Contriever [GitHub repository](https://github.com/facebookresearch/contriever).

Note that degradation in performance, relative to the Contriever model, was expected given the much broader diversity of our training dataset. We plan on addressing this in future updates with architectural improvements and view Carptriever-1 as our first iteration in the exploratory phase towards better language-embedding models.


#### [CodeSearchNet Challenge Evaluating the State of Semantic Code Search](https://arxiv.org/pdf/1909.09436.pdf)

We provide results on the CodeSearchNet benchmark, measured in Mean Reciprocal Rank (MRR), following the code search procedure outlined in Section 3.3 of Neelakantan et al.'s ["Text and Code Embeddings by Contrastive Pre-Training"](https://arxiv.org/pdf/2201.10005.pdf).

`Candidate Size = 1,000`

| Model           | Avg   | Python | Go    | Ruby  | PHP   | Java  | JS    |
|-----------------|-------|--------|-------|-------|-------|-------|-------|
| Carptriever-1   | 60.24 |  65.85 | 63.29 |  62.1 |  59.1 | 55.52 | 55.55 |
| Contriever      | 49.39 |  54.81 |  58.9 | 55.19 | 38.46 | 44.89 | 44.09 |


`Candidate Size = 10,000`

| Model.          | Avg   | Python | Go    | Ruby  | PHP   | Java  | JS    |
|-----------------|-------|--------|-------|-------|-------|-------|-------|
| Carptriever-1   | 48.59 |  55.98 | 43.18 | 56.06 | 45.62 | 46.04 | 44.66 |
| Contriever      |    37 |  45.43 | 36.08 | 48.07 | 25.59 | 32.89 | 31.44 |


## Acknowledgements

This work would not have been possible without the compute support of [Stability AI](https://stability.ai/).

Thank you to Louis Castricato for research guidance and Reshinth Adithyan for creating the CodeSearchNet evaluation script.


## Citations

```bibtex
@misc{izacard2021contriever,
    title={Unsupervised Dense Information Retrieval with Contrastive Learning}, 
    author={Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave},
    year={2021},
    url = {https://arxiv.org/abs/2112.09118},
    doi = {10.48550/ARXIV.2112.09118},
}
```

```bibtex
@article{pile,
    title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
    author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
    journal={arXiv preprint arXiv:2101.00027},
    year={2020}
}
```