Upload folder using huggingface_hub
Browse files- 1_Pooling/config.json +1 -1
- README.md +12 -12
- config.json +14 -13
- model.safetensors +2 -2
- onnx/model.onnx +2 -2
- onnx/model_O1.onnx +2 -2
- onnx/model_O2.onnx +2 -2
- onnx/model_O3.onnx +2 -2
- onnx/model_O4.onnx +2 -2
- onnx/model_qint8_arm64.onnx +2 -2
- onnx/model_qint8_avx512.onnx +2 -2
- onnx/model_qint8_avx512_vnni.onnx +2 -2
- onnx/model_quint8_avx2.onnx +2 -2
- openvino/openvino_model.bin +2 -2
- openvino/openvino_model.xml +0 -0
- openvino/openvino_model_qint8_quantized.bin +2 -2
- openvino/openvino_model_qint8_quantized.xml +0 -0
- pytorch_model.bin +2 -2
- rust_model.ot +3 -0
- sentence_bert_config.json +1 -1
- special_tokens_map.json +1 -1
- tf_model.h5 +3 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -1
- train_script.py +1 -1
- vocab.txt +0 -5
1_Pooling/config.json
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
{
|
| 2 |
-
"word_embedding_dimension":
|
| 3 |
"pooling_mode_cls_token": false,
|
| 4 |
"pooling_mode_mean_tokens": true,
|
| 5 |
"pooling_mode_max_tokens": false,
|
|
|
|
| 1 |
{
|
| 2 |
+
"word_embedding_dimension": 384,
|
| 3 |
"pooling_mode_cls_token": false,
|
| 4 |
"pooling_mode_mean_tokens": true,
|
| 5 |
"pooling_mode_max_tokens": false,
|
README.md
CHANGED
|
@@ -33,8 +33,8 @@ pipeline_tag: sentence-similarity
|
|
| 33 |
---
|
| 34 |
|
| 35 |
|
| 36 |
-
# all-
|
| 37 |
-
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a
|
| 38 |
|
| 39 |
## Usage (Sentence-Transformers)
|
| 40 |
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
|
@@ -48,7 +48,7 @@ Then you can use the model like this:
|
|
| 48 |
from sentence_transformers import SentenceTransformer
|
| 49 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 50 |
|
| 51 |
-
model = SentenceTransformer('sentence-transformers/all-
|
| 52 |
embeddings = model.encode(sentences)
|
| 53 |
print(embeddings)
|
| 54 |
```
|
|
@@ -72,8 +72,8 @@ def mean_pooling(model_output, attention_mask):
|
|
| 72 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 73 |
|
| 74 |
# Load model from HuggingFace Hub
|
| 75 |
-
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-
|
| 76 |
-
model = AutoModel.from_pretrained('sentence-transformers/all-
|
| 77 |
|
| 78 |
# Tokenize sentences
|
| 79 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
@@ -97,27 +97,27 @@ print(sentence_embeddings)
|
|
| 97 |
## Background
|
| 98 |
|
| 99 |
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
| 100 |
-
contrastive learning objective. We used the pretrained [`
|
| 101 |
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
|
| 102 |
|
| 103 |
-
We
|
| 104 |
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
|
| 105 |
-
organized by Hugging Face. We
|
| 106 |
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
|
| 107 |
|
| 108 |
## Intended uses
|
| 109 |
|
| 110 |
-
Our model is
|
| 111 |
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
|
| 112 |
|
| 113 |
-
By default, input text longer than
|
| 114 |
|
| 115 |
|
| 116 |
## Training procedure
|
| 117 |
|
| 118 |
### Pre-training
|
| 119 |
|
| 120 |
-
We use the pretrained [`
|
| 121 |
|
| 122 |
### Fine-tuning
|
| 123 |
|
|
@@ -126,7 +126,7 @@ We then apply the cross entropy loss by comparing with true pairs.
|
|
| 126 |
|
| 127 |
#### Hyper parameters
|
| 128 |
|
| 129 |
-
We trained
|
| 130 |
We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
|
| 131 |
a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
|
| 132 |
|
|
|
|
| 33 |
---
|
| 34 |
|
| 35 |
|
| 36 |
+
# all-MiniLM-L6-v2
|
| 37 |
+
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
| 38 |
|
| 39 |
## Usage (Sentence-Transformers)
|
| 40 |
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
|
|
|
| 48 |
from sentence_transformers import SentenceTransformer
|
| 49 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 50 |
|
| 51 |
+
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
|
| 52 |
embeddings = model.encode(sentences)
|
| 53 |
print(embeddings)
|
| 54 |
```
|
|
|
|
| 72 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 73 |
|
| 74 |
# Load model from HuggingFace Hub
|
| 75 |
+
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
| 76 |
+
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
|
| 77 |
|
| 78 |
# Tokenize sentences
|
| 79 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
|
|
| 97 |
## Background
|
| 98 |
|
| 99 |
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
|
| 100 |
+
contrastive learning objective. We used the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model and fine-tuned in on a
|
| 101 |
1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
|
| 102 |
|
| 103 |
+
We developed this model during the
|
| 104 |
[Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
|
| 105 |
+
organized by Hugging Face. We developed this model as part of the project:
|
| 106 |
[Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
|
| 107 |
|
| 108 |
## Intended uses
|
| 109 |
|
| 110 |
+
Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures
|
| 111 |
the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
|
| 112 |
|
| 113 |
+
By default, input text longer than 256 word pieces is truncated.
|
| 114 |
|
| 115 |
|
| 116 |
## Training procedure
|
| 117 |
|
| 118 |
### Pre-training
|
| 119 |
|
| 120 |
+
We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
|
| 121 |
|
| 122 |
### Fine-tuning
|
| 123 |
|
|
|
|
| 126 |
|
| 127 |
#### Hyper parameters
|
| 128 |
|
| 129 |
+
We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
|
| 130 |
We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
|
| 131 |
a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
|
| 132 |
|
config.json
CHANGED
|
@@ -1,23 +1,24 @@
|
|
| 1 |
{
|
| 2 |
-
"_name_or_path": "
|
| 3 |
"architectures": [
|
| 4 |
-
"
|
| 5 |
],
|
| 6 |
"attention_probs_dropout_prob": 0.1,
|
| 7 |
-
"
|
| 8 |
-
"eos_token_id": 2,
|
| 9 |
"hidden_act": "gelu",
|
| 10 |
"hidden_dropout_prob": 0.1,
|
| 11 |
-
"hidden_size":
|
| 12 |
"initializer_range": 0.02,
|
| 13 |
-
"intermediate_size":
|
| 14 |
-
"layer_norm_eps": 1e-
|
| 15 |
-
"max_position_embeddings":
|
| 16 |
-
"model_type": "
|
| 17 |
"num_attention_heads": 12,
|
| 18 |
-
"num_hidden_layers":
|
| 19 |
-
"pad_token_id":
|
| 20 |
-
"
|
| 21 |
"transformers_version": "4.8.2",
|
| 22 |
-
"
|
|
|
|
|
|
|
| 23 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"_name_or_path": "nreimers/MiniLM-L6-H384-uncased",
|
| 3 |
"architectures": [
|
| 4 |
+
"BertModel"
|
| 5 |
],
|
| 6 |
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"gradient_checkpointing": false,
|
|
|
|
| 8 |
"hidden_act": "gelu",
|
| 9 |
"hidden_dropout_prob": 0.1,
|
| 10 |
+
"hidden_size": 384,
|
| 11 |
"initializer_range": 0.02,
|
| 12 |
+
"intermediate_size": 1536,
|
| 13 |
+
"layer_norm_eps": 1e-12,
|
| 14 |
+
"max_position_embeddings": 512,
|
| 15 |
+
"model_type": "bert",
|
| 16 |
"num_attention_heads": 12,
|
| 17 |
+
"num_hidden_layers": 6,
|
| 18 |
+
"pad_token_id": 0,
|
| 19 |
+
"position_embedding_type": "absolute",
|
| 20 |
"transformers_version": "4.8.2",
|
| 21 |
+
"type_vocab_size": 2,
|
| 22 |
+
"use_cache": true,
|
| 23 |
+
"vocab_size": 30522
|
| 24 |
}
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:53aa51172d142c89d9012cce15ae4d6cc0ca6895895114379cacb4fab128d9db
|
| 3 |
+
size 90868376
|
onnx/model.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6fd5d72fe4589f189f8ebc006442dbb529bb7ce38f8082112682524616046452
|
| 3 |
+
size 90405214
|
onnx/model_O1.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1391c6fc20b5530250bc15cbe1f47578ffeca55ab0551d335cc668b6299a88ec
|
| 3 |
+
size 90360328
|
onnx/model_O2.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1de3905029190b398c7d300b530e320cf4b5e7d3dfb9af1429ebd73fd9a16faf
|
| 3 |
+
size 90326566
|
onnx/model_O3.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a44f671e364dddbac31f203f07b91be6b0a35e51936e5ebfab65b6d9538b83ff
|
| 3 |
+
size 90326497
|
onnx/model_O4.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1667d7f3ba669048b13a96ee3a44456d5e42c8f44588ae8b603430e16160c485
|
| 3 |
+
size 45212349
|
onnx/model_qint8_arm64.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
|
| 3 |
+
size 23026053
|
onnx/model_qint8_avx512.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
|
| 3 |
+
size 23026053
|
onnx/model_qint8_avx512_vnni.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
|
| 3 |
+
size 23026053
|
onnx/model_quint8_avx2.onnx
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b941bf19f1f1283680f449fa6a7336bb5600bdcd5f84d10ddc5cd72218a0fd21
|
| 3 |
+
size 23046789
|
openvino/openvino_model.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8b86cab4722e2aefab310cf96d4d5a9eb3b187f7d9670a082afc55c7fa0d392a
|
| 3 |
+
size 90265744
|
openvino/openvino_model.xml
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
openvino/openvino_model_qint8_quantized.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c92ea4af3c6bc7b4a0f3b3d61b147c850f4dbdd7c9e7beee0c0c70dc12da289b
|
| 3 |
+
size 22933664
|
openvino/openvino_model_qint8_quantized.xml
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
pytorch_model.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c3a85f238711653950f6a79ece63eb0ea93d76f6a6284be04019c53733baf256
|
| 3 |
+
size 90888945
|
rust_model.ot
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2d98d96d278348988f2744e6445b8bc16d921c3f6e17c667362f3cb353007aea
|
| 3 |
+
size 90887379
|
sentence_bert_config.json
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
{
|
| 2 |
-
"max_seq_length":
|
| 3 |
"do_lower_case": false
|
| 4 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"max_seq_length": 256,
|
| 3 |
"do_lower_case": false
|
| 4 |
}
|
special_tokens_map.json
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
{"
|
|
|
|
| 1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tf_model.h5
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:24c06a7429b843d46e40c6b167122053921bf94dce2e5550ea5c07fabc597646
|
| 3 |
+
size 91005696
|
tokenizer.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
{"do_lower_case": true, "
|
|
|
|
| 1 |
+
{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "name_or_path": "nreimers/MiniLM-L6-H384-uncased", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer", "model_max_length": 512}
|
train_script.py
CHANGED
|
@@ -341,4 +341,4 @@ if __name__ == "__main__":
|
|
| 341 |
|
| 342 |
|
| 343 |
# Script was called via:
|
| 344 |
-
#python train_many_data_files_v2.py --steps 1000000 --batch_size
|
|
|
|
| 341 |
|
| 342 |
|
| 343 |
# Script was called via:
|
| 344 |
+
#python train_many_data_files_v2.py --steps 1000000 --batch_size 128 --model nreimers/MiniLM-L6-H384-uncased train_data_configs/all_datasets_v4.json output/all_datasets_v4_MiniLM-L6-H384-uncased-batch128
|
vocab.txt
CHANGED
|
@@ -1,7 +1,3 @@
|
|
| 1 |
-
<s>
|
| 2 |
-
<pad>
|
| 3 |
-
</s>
|
| 4 |
-
<unk>
|
| 5 |
[PAD]
|
| 6 |
[unused0]
|
| 7 |
[unused1]
|
|
@@ -30524,4 +30520,3 @@ necessitated
|
|
| 30524 |
##:
|
| 30525 |
##?
|
| 30526 |
##~
|
| 30527 |
-
<mask>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
[PAD]
|
| 2 |
[unused0]
|
| 3 |
[unused1]
|
|
|
|
| 30520 |
##:
|
| 30521 |
##?
|
| 30522 |
##~
|
|
|