Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
license: mit
|
5 |
+
datasets:
|
6 |
+
- pile
|
7 |
+
metrics:
|
8 |
+
- nDCG@10
|
9 |
+
---
|
10 |
+
|
11 |
+
# Carptriever-1
|
12 |
+
|
13 |
+
# Model description
|
14 |
+
|
15 |
+
Carptriever-1 is a `bert-large-uncased` retrieval model trained with contrastive learning via a momentum contrastive (MoCo) mechanism following the work of G. Izacard et al. in ["Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning"](https://arxiv.org/abs/2112.09118).
|
16 |
+
|
17 |
+
# How to use
|
18 |
+
|
19 |
+
```python
|
20 |
+
from transformers import AutoTokenizer, AutoModel
|
21 |
+
|
22 |
+
def mean_pooling(token_embeddings, mask):
|
23 |
+
token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
|
24 |
+
sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
|
25 |
+
return sentence_embeddings
|
26 |
+
|
27 |
+
# Remove pooling layer
|
28 |
+
model = AutoModel.from_pretrained("carperai/carptriever-1", add_pooling_layer=False)
|
29 |
+
tokenizer = AutoTokenizer.from_pretrained("carperai/carptriever-1")
|
30 |
+
|
31 |
+
sentences = [
|
32 |
+
"Where was Marie Curie born?", # Query
|
33 |
+
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
|
34 |
+
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
|
35 |
+
]
|
36 |
+
|
37 |
+
# Apply tokenizer
|
38 |
+
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
39 |
+
|
40 |
+
# Encode sentences
|
41 |
+
outputs = model(**inputs)
|
42 |
+
embeddings = mean_pooling(outputs[0], inputs['attention_mask'])
|
43 |
+
|
44 |
+
# Compute dot-product scores between the query and sentence embeddings
|
45 |
+
query_embedding, sentence_embeddings = embeddings[0], embeddings[1:]
|
46 |
+
scores = (query_embedding @ sentence_embeddings.transpose(0, 1)).cpu().tolist()
|
47 |
+
|
48 |
+
sentence_score_pairs = sorted(zip(sentences[1:], scores), reverse=True)
|
49 |
+
print(f"Query: {sentences[0]}")
|
50 |
+
for sentence, score in sentence_score_pairs:
|
51 |
+
print(f"\nSentence: {sentence}\nScore: {score:.4f}")
|
52 |
+
```
|
53 |
+
|
54 |
+
# Training data
|
55 |
+
|
56 |
+
Carptriever-1 is pre-trained on [The Pile](https://pile.eleuther.ai/), a large and diverse dataset created by EleutherAI for language model training.
|
57 |
+
|
58 |
+
# Training procedure
|
59 |
+
|
60 |
+
The model was trained on 32 40GB A100 for approximately 100 hours with the following configurations:
|
61 |
+
|
62 |
+
- Base model:
|
63 |
+
- `bert-large-uncased`
|
64 |
+
- Optimizer settings:
|
65 |
+
- `optimizer = AdamW`
|
66 |
+
- `lr = 1e-5`
|
67 |
+
- `schedule = linear`
|
68 |
+
- `warmup = 20,000 steps`
|
69 |
+
- `batch size = 2048`
|
70 |
+
- `training steps = 150,000`
|
71 |
+
- MoCo settings:
|
72 |
+
- `queue size = 8192`
|
73 |
+
- `momentum = 0.999`
|
74 |
+
- `temperature = 0.05`
|
75 |
+
|
76 |
+
# Evaluation results
|
77 |
+
|
78 |
+
We provide evaluation results on the [BEIR: Benchmarking IR](https://github.com/beir-cellar/beir) suite.
|
79 |
+
|
80 |
+
| nDCG@10 | Avg | MSMARCO | TREC-Covid | NFCorpus | NaturalQuestions | HotpotQA | FiQA | ArguAna | Tóuche-2020 | Quora | CQAdupstack | DBPedia | Scidocs | Fever | Climate-fever | Scifact |
|
81 |
+
|------------------------------------------|-------------|---------|------------|----------|------------------|----------|------|---------|-------------|-------|-------------|---------|---------|-------|---------------|---------|
|
82 |
+
| Contriever* | 35.97 | 20.6 | 27.4 | 31.7 | 25.4 | 48.1 | 24.5 | 37.9 | 19.3 | 83.5 | 28.4 | 29.2 | 14.9 | 68.2 | 15.5 | 64.9 |
|
83 |
+
| Carptriever-1 | 34.29 | 18.81 | **46.5** | 28.9 | 21.1 | 39.01 | 20.2 | 33.4 | 17.3 | 80.6 | 25.4 | 23.6 | 14.9 | 59.6 | **18.7** | **66.4** |
|
84 |
+
|
85 |
+
\* Results are taken from the Contriever [repository](https://github.com/facebookresearch/contriever).
|
86 |
+
|
87 |
+
Note that degradation in performance, relative to the Contriever model, was expected given the much broader diversity of our training dataset. We plan on addressing this in future updates with architectural improvements and view Carptriever-1 as our first iteration in the exploratory phase towards better language-embedding models.
|
88 |
+
|
89 |
+
# Appreciation
|
90 |
+
|
91 |
+
All compute was graciously provided by [Stability.ai](https://stability.ai/).
|
92 |
+
|
93 |
+
# Citations
|
94 |
+
|
95 |
+
```bibtex
|
96 |
+
@misc{izacard2021contriever,
|
97 |
+
title={Unsupervised Dense Information Retrieval with Contrastive Learning},
|
98 |
+
author={Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave},
|
99 |
+
year={2021},
|
100 |
+
url = {https://arxiv.org/abs/2112.09118},
|
101 |
+
doi = {10.48550/ARXIV.2112.09118},
|
102 |
+
}
|
103 |
+
```
|
104 |
+
|
105 |
+
```bibtex
|
106 |
+
@article{pile,
|
107 |
+
title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
|
108 |
+
author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
|
109 |
+
journal={arXiv preprint arXiv:2101.00027},
|
110 |
+
year={2020}
|
111 |
+
}
|
112 |
+
```
|