Mohamed-Sami-Ghrab commited on
Commit
2adae6c
·
verified ·
1 Parent(s): 1783991

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "word_embedding_dimension": 768,
3
  "pooling_mode_cls_token": false,
4
  "pooling_mode_mean_tokens": true,
5
  "pooling_mode_max_tokens": false,
 
1
  {
2
+ "word_embedding_dimension": 384,
3
  "pooling_mode_cls_token": false,
4
  "pooling_mode_mean_tokens": true,
5
  "pooling_mode_max_tokens": false,
README.md CHANGED
@@ -33,8 +33,8 @@ pipeline_tag: sentence-similarity
33
  ---
34
 
35
 
36
- # all-mpnet-base-v2
37
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
38
 
39
  ## Usage (Sentence-Transformers)
40
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
@@ -48,7 +48,7 @@ Then you can use the model like this:
48
  from sentence_transformers import SentenceTransformer
49
  sentences = ["This is an example sentence", "Each sentence is converted"]
50
 
51
- model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
52
  embeddings = model.encode(sentences)
53
  print(embeddings)
54
  ```
@@ -72,8 +72,8 @@ def mean_pooling(model_output, attention_mask):
72
  sentences = ['This is an example sentence', 'Each sentence is converted']
73
 
74
  # Load model from HuggingFace Hub
75
- tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
76
- model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
77
 
78
  # Tokenize sentences
79
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -97,27 +97,27 @@ print(sentence_embeddings)
97
  ## Background
98
 
99
  The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
100
- contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
101
  1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
102
 
103
- We developped this model during the
104
  [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
105
- organized by Hugging Face. We developped this model as part of the project:
106
  [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
107
 
108
  ## Intended uses
109
 
110
- Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
111
  the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
112
 
113
- By default, input text longer than 384 word pieces is truncated.
114
 
115
 
116
  ## Training procedure
117
 
118
  ### Pre-training
119
 
120
- We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
121
 
122
  ### Fine-tuning
123
 
@@ -126,7 +126,7 @@ We then apply the cross entropy loss by comparing with true pairs.
126
 
127
  #### Hyper parameters
128
 
129
- We trained ou model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
130
  We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
131
  a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
132
 
 
33
  ---
34
 
35
 
36
+ # all-MiniLM-L6-v2
37
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
38
 
39
  ## Usage (Sentence-Transformers)
40
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
48
  from sentence_transformers import SentenceTransformer
49
  sentences = ["This is an example sentence", "Each sentence is converted"]
50
 
51
+ model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
52
  embeddings = model.encode(sentences)
53
  print(embeddings)
54
  ```
 
72
  sentences = ['This is an example sentence', 'Each sentence is converted']
73
 
74
  # Load model from HuggingFace Hub
75
+ tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
76
+ model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
77
 
78
  # Tokenize sentences
79
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
97
  ## Background
98
 
99
  The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
100
+ contrastive learning objective. We used the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model and fine-tuned in on a
101
  1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
102
 
103
+ We developed this model during the
104
  [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
105
+ organized by Hugging Face. We developed this model as part of the project:
106
  [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
107
 
108
  ## Intended uses
109
 
110
+ Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures
111
  the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
112
 
113
+ By default, input text longer than 256 word pieces is truncated.
114
 
115
 
116
  ## Training procedure
117
 
118
  ### Pre-training
119
 
120
+ We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
121
 
122
  ### Fine-tuning
123
 
 
126
 
127
  #### Hyper parameters
128
 
129
+ We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
130
  We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
131
  a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.
132
 
config.json CHANGED
@@ -1,23 +1,24 @@
1
  {
2
- "_name_or_path": "microsoft/mpnet-base",
3
  "architectures": [
4
- "MPNetForMaskedLM"
5
  ],
6
  "attention_probs_dropout_prob": 0.1,
7
- "bos_token_id": 0,
8
- "eos_token_id": 2,
9
  "hidden_act": "gelu",
10
  "hidden_dropout_prob": 0.1,
11
- "hidden_size": 768,
12
  "initializer_range": 0.02,
13
- "intermediate_size": 3072,
14
- "layer_norm_eps": 1e-05,
15
- "max_position_embeddings": 514,
16
- "model_type": "mpnet",
17
  "num_attention_heads": 12,
18
- "num_hidden_layers": 12,
19
- "pad_token_id": 1,
20
- "relative_attention_num_buckets": 32,
21
  "transformers_version": "4.8.2",
22
- "vocab_size": 30527
 
 
23
  }
 
1
  {
2
+ "_name_or_path": "nreimers/MiniLM-L6-H384-uncased",
3
  "architectures": [
4
+ "BertModel"
5
  ],
6
  "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
 
8
  "hidden_act": "gelu",
9
  "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 384,
11
  "initializer_range": 0.02,
12
+ "intermediate_size": 1536,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
  "num_attention_heads": 12,
17
+ "num_hidden_layers": 6,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
  "transformers_version": "4.8.2",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 30522
24
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:78c0197b6159d92658e319bc1d72e4c73a9a03dd03815e70e555c5ef05615658
3
- size 437971872
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53aa51172d142c89d9012cce15ae4d6cc0ca6895895114379cacb4fab128d9db
3
+ size 90868376
onnx/model.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:74187b16d9c946fea252e120cfd7a12c5779d8b8b86838a2e4c56573c47941bd
3
- size 435826548
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6fd5d72fe4589f189f8ebc006442dbb529bb7ce38f8082112682524616046452
3
+ size 90405214
onnx/model_O1.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5c0b47004076ab40bf15a2c52b98a53e985ebb84faaeeb6d2551768f96e384b0
3
- size 435730180
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1391c6fc20b5530250bc15cbe1f47578ffeca55ab0551d335cc668b6299a88ec
3
+ size 90360328
onnx/model_O2.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:14d01256f5f3d2245b15b596173bca4367c9405fde5700dd7fb4e110708c1793
3
- size 435666661
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1de3905029190b398c7d300b530e320cf4b5e7d3dfb9af1429ebd73fd9a16faf
3
+ size 90326566
onnx/model_O3.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dd55510706038d0817b7d41bf2078f01472e4865190584ad624e8ab79bbcb310
3
- size 435666516
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a44f671e364dddbac31f203f07b91be6b0a35e51936e5ebfab65b6d9538b83ff
3
+ size 90326497
onnx/model_O4.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cab2a54139fc4fd5b8e2a23cb5729ee28dc44cfde685ad3356d533653e635310
3
- size 217894954
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1667d7f3ba669048b13a96ee3a44456d5e42c8f44588ae8b603430e16160c485
3
+ size 45212349
onnx/model_qint8_arm64.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c392a9c545c7d4438a16fed8287a76a576b27eaf029c1c23bbf78a7a666d197f
3
- size 110124379
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
3
+ size 23026053
onnx/model_qint8_avx512.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c392a9c545c7d4438a16fed8287a76a576b27eaf029c1c23bbf78a7a666d197f
3
- size 110124379
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
3
+ size 23026053
onnx/model_qint8_avx512_vnni.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:c392a9c545c7d4438a16fed8287a76a576b27eaf029c1c23bbf78a7a666d197f
3
- size 110124379
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
3
+ size 23026053
onnx/model_quint8_avx2.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:aa5c27172d77bbd1cbae3628cbac4b26d7c12adabff25d2d4285d0f29159b237
3
- size 110207323
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b941bf19f1f1283680f449fa6a7336bb5600bdcd5f84d10ddc5cd72218a0fd21
3
+ size 23046789
openvino/openvino_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5c3279d833888eaab745e24b652126c5a71375af185ac21aa47e112e2468dec0
3
- size 435583684
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b86cab4722e2aefab310cf96d4d5a9eb3b187f7d9670a082afc55c7fa0d392a
3
+ size 90265744
openvino/openvino_model.xml CHANGED
The diff for this file is too large to render. See raw diff
 
openvino/openvino_model_qint8_quantized.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fde0c650018f5e244f793316b666aaf4758d4e19072f430e59eb2bcc414895ce
3
- size 109974792
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c92ea4af3c6bc7b4a0f3b3d61b147c850f4dbdd7c9e7beee0c0c70dc12da289b
3
+ size 22933664
openvino/openvino_model_qint8_quantized.xml CHANGED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a8fd120b1a0032e70ff3d4b8ab8e46a6d01c2cb08ffe7c007a021c1788928146
3
- size 438011953
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3a85f238711653950f6a79ece63eb0ea93d76f6a6284be04019c53733baf256
3
+ size 90888945
rust_model.ot ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d98d96d278348988f2744e6445b8bc16d921c3f6e17c667362f3cb353007aea
3
+ size 90887379
sentence_bert_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "max_seq_length": 384,
3
  "do_lower_case": false
4
  }
 
1
  {
2
+ "max_seq_length": 256,
3
  "do_lower_case": false
4
  }
special_tokens_map.json CHANGED
@@ -1 +1 @@
1
- {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:24c06a7429b843d46e40c6b167122053921bf94dce2e5550ea5c07fabc597646
3
+ size 91005696
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1 +1 @@
1
- {"do_lower_case": true, "bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "[UNK]", "pad_token": "<pad>", "mask_token": "<mask>", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "microsoft/mpnet-base", "tokenizer_class": "MPNetTokenizer"}
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "name_or_path": "nreimers/MiniLM-L6-H384-uncased", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer", "model_max_length": 512}
train_script.py CHANGED
@@ -341,4 +341,4 @@ if __name__ == "__main__":
341
 
342
 
343
  # Script was called via:
344
- #python train_many_data_files_v2.py --steps 1000000 --batch_size 64 --model microsoft/mpnet-base train_data_configs/all_datasets_v4.json output/all_datasets_v4_mpnet-base
 
341
 
342
 
343
  # Script was called via:
344
+ #python train_many_data_files_v2.py --steps 1000000 --batch_size 128 --model nreimers/MiniLM-L6-H384-uncased train_data_configs/all_datasets_v4.json output/all_datasets_v4_MiniLM-L6-H384-uncased-batch128
vocab.txt CHANGED
@@ -1,7 +1,3 @@
1
- <s>
2
- <pad>
3
- </s>
4
- <unk>
5
  [PAD]
6
  [unused0]
7
  [unused1]
@@ -30524,4 +30520,3 @@ necessitated
30524
  ##:
30525
  ##?
30526
  ##~
30527
- <mask>
 
 
 
 
 
1
  [PAD]
2
  [unused0]
3
  [unused1]
 
30520
  ##:
30521
  ##?
30522
  ##~