Mohamed-Sami-Ghrab commited on May 27

Commit

2adae6c

verified ·

1 Parent(s): 1783991

Upload folder using huggingface_hub

Browse files

Files changed (26) hide show

1_Pooling/config.json +1 -1
README.md +12 -12
config.json +14 -13
model.safetensors +2 -2
onnx/model.onnx +2 -2
onnx/model_O1.onnx +2 -2
onnx/model_O2.onnx +2 -2
onnx/model_O3.onnx +2 -2
onnx/model_O4.onnx +2 -2
onnx/model_qint8_arm64.onnx +2 -2
onnx/model_qint8_avx512.onnx +2 -2
onnx/model_qint8_avx512_vnni.onnx +2 -2
onnx/model_quint8_avx2.onnx +2 -2
openvino/openvino_model.bin +2 -2
openvino/openvino_model.xml +0 -0
openvino/openvino_model_qint8_quantized.bin +2 -2
openvino/openvino_model_qint8_quantized.xml +0 -0
pytorch_model.bin +2 -2
rust_model.ot +3 -0
sentence_bert_config.json +1 -1
special_tokens_map.json +1 -1
tf_model.h5 +3 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -1
train_script.py +1 -1
vocab.txt +0 -5

1_Pooling/config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "word_embedding_dimension": 768,
   "pooling_mode_cls_token": false,
   "pooling_mode_mean_tokens": true,
   "pooling_mode_max_tokens": false,

 {
+  "word_embedding_dimension": 384,
   "pooling_mode_cls_token": false,
   "pooling_mode_mean_tokens": true,
   "pooling_mode_max_tokens": false,

README.md CHANGED Viewed

@@ -33,8 +33,8 @@ pipeline_tag: sentence-similarity
 ---
-# all-mpnet-base-v2
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 ## Usage (Sentence-Transformers)
 Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
@@ -48,7 +48,7 @@ Then you can use the model like this:
 from sentence_transformers import SentenceTransformer
 sentences = ["This is an example sentence", "Each sentence is converted"]
-model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
@@ -72,8 +72,8 @@ def mean_pooling(model_output, attention_mask):
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
-tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
-model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -97,27 +97,27 @@ print(sentence_embeddings)
 ## Background
 The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
-contrastive learning objective. We used the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model and fine-tuned in on a
 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
-We developped this model during the
 [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
-organized by Hugging Face. We developped this model as part of the project:
 [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
 ## Intended uses
-Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures
 the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
-By default, input text longer than 384 word pieces is truncated.
 ## Training procedure
 ### Pre-training
-We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model. Please refer to the model card for more detailed information about the pre-training procedure.
 ### Fine-tuning
@@ -126,7 +126,7 @@ We then apply the cross entropy loss by comparing with true pairs.
 #### Hyper parameters
-We trained ou model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
 We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
 a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.

 ---
+# all-MiniLM-L6-v2
+This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 ## Usage (Sentence-Transformers)
 Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 from sentence_transformers import SentenceTransformer
 sentences = ["This is an example sentence", "Each sentence is converted"]
+model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
 embeddings = model.encode(sentences)
 print(embeddings)
 ```
 sentences = ['This is an example sentence', 'Each sentence is converted']
 # Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
+model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
 # Tokenize sentences
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 ## Background
 The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
+contrastive learning objective. We used the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model and fine-tuned in on a
 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
+We developed this model during the
 [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
+organized by Hugging Face. We developed this model as part of the project:
 [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
 ## Intended uses
+Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures
 the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
+By default, input text longer than 256 word pieces is truncated.
 ## Training procedure
 ### Pre-training
+We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
 ### Fine-tuning
 #### Hyper parameters
+We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core).
 We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with
 a 2e-5 learning rate. The full training script is accessible in this current repository: `train_script.py`.

config.json CHANGED Viewed

@@ -1,23 +1,24 @@
 {
-  "_name_or_path": "microsoft/mpnet-base",
   "architectures": [
-    "MPNetForMaskedLM"
   ],
   "attention_probs_dropout_prob": 0.1,
-  "bos_token_id": 0,
-  "eos_token_id": 2,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
-  "hidden_size": 768,
   "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 514,
-  "model_type": "mpnet",
   "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "pad_token_id": 1,
-  "relative_attention_num_buckets": 32,
   "transformers_version": "4.8.2",
-  "vocab_size": 30527
 }

 {
+  "_name_or_path": "nreimers/MiniLM-L6-H384-uncased",
   "architectures": [
+    "BertModel"
   ],
   "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
   "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
   "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
   "transformers_version": "4.8.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:78c0197b6159d92658e319bc1d72e4c73a9a03dd03815e70e555c5ef05615658
-size 437971872

 version https://git-lfs.github.com/spec/v1
+oid sha256:53aa51172d142c89d9012cce15ae4d6cc0ca6895895114379cacb4fab128d9db
+size 90868376

onnx/model.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:74187b16d9c946fea252e120cfd7a12c5779d8b8b86838a2e4c56573c47941bd
-size 435826548

 version https://git-lfs.github.com/spec/v1
+oid sha256:6fd5d72fe4589f189f8ebc006442dbb529bb7ce38f8082112682524616046452
+size 90405214

onnx/model_O1.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5c0b47004076ab40bf15a2c52b98a53e985ebb84faaeeb6d2551768f96e384b0
-size 435730180

 version https://git-lfs.github.com/spec/v1
+oid sha256:1391c6fc20b5530250bc15cbe1f47578ffeca55ab0551d335cc668b6299a88ec
+size 90360328

onnx/model_O2.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:14d01256f5f3d2245b15b596173bca4367c9405fde5700dd7fb4e110708c1793
-size 435666661

 version https://git-lfs.github.com/spec/v1
+oid sha256:1de3905029190b398c7d300b530e320cf4b5e7d3dfb9af1429ebd73fd9a16faf
+size 90326566

onnx/model_O3.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dd55510706038d0817b7d41bf2078f01472e4865190584ad624e8ab79bbcb310
-size 435666516

 version https://git-lfs.github.com/spec/v1
+oid sha256:a44f671e364dddbac31f203f07b91be6b0a35e51936e5ebfab65b6d9538b83ff
+size 90326497

onnx/model_O4.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cab2a54139fc4fd5b8e2a23cb5729ee28dc44cfde685ad3356d533653e635310
-size 217894954

 version https://git-lfs.github.com/spec/v1
+oid sha256:1667d7f3ba669048b13a96ee3a44456d5e42c8f44588ae8b603430e16160c485
+size 45212349

onnx/model_qint8_arm64.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c392a9c545c7d4438a16fed8287a76a576b27eaf029c1c23bbf78a7a666d197f
-size 110124379

 version https://git-lfs.github.com/spec/v1
+oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
+size 23026053

onnx/model_qint8_avx512.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c392a9c545c7d4438a16fed8287a76a576b27eaf029c1c23bbf78a7a666d197f
-size 110124379

 version https://git-lfs.github.com/spec/v1
+oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
+size 23026053

onnx/model_qint8_avx512_vnni.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c392a9c545c7d4438a16fed8287a76a576b27eaf029c1c23bbf78a7a666d197f
-size 110124379

 version https://git-lfs.github.com/spec/v1
+oid sha256:4278337fd0ff3c68bfb6291042cad8ab363e1d9fbc43dcb499fe91c871902474
+size 23026053

onnx/model_quint8_avx2.onnx CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:aa5c27172d77bbd1cbae3628cbac4b26d7c12adabff25d2d4285d0f29159b237
-size 110207323

 version https://git-lfs.github.com/spec/v1
+oid sha256:b941bf19f1f1283680f449fa6a7336bb5600bdcd5f84d10ddc5cd72218a0fd21
+size 23046789

openvino/openvino_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5c3279d833888eaab745e24b652126c5a71375af185ac21aa47e112e2468dec0
-size 435583684

 version https://git-lfs.github.com/spec/v1
+oid sha256:8b86cab4722e2aefab310cf96d4d5a9eb3b187f7d9670a082afc55c7fa0d392a
+size 90265744

openvino/openvino_model.xml CHANGED Viewed

The diff for this file is too large to render. See raw diff

openvino/openvino_model_qint8_quantized.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fde0c650018f5e244f793316b666aaf4758d4e19072f430e59eb2bcc414895ce
-size 109974792

 version https://git-lfs.github.com/spec/v1
+oid sha256:c92ea4af3c6bc7b4a0f3b3d61b147c850f4dbdd7c9e7beee0c0c70dc12da289b
+size 22933664

openvino/openvino_model_qint8_quantized.xml CHANGED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a8fd120b1a0032e70ff3d4b8ab8e46a6d01c2cb08ffe7c007a021c1788928146
-size 438011953

 version https://git-lfs.github.com/spec/v1
+oid sha256:c3a85f238711653950f6a79ece63eb0ea93d76f6a6284be04019c53733baf256
+size 90888945

rust_model.ot ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2d98d96d278348988f2744e6445b8bc16d921c3f6e17c667362f3cb353007aea
+size 90887379

sentence_bert_config.json CHANGED Viewed

@@ -1,4 +1,4 @@
 {
-  "max_seq_length": 384,
   "do_lower_case": false
 }

 {
+  "max_seq_length": 256,
   "do_lower_case": false
 }

special_tokens_map.json CHANGED Viewed

	@@ -1 +1 @@
1	- {"~~bos_token": "<s>", "eos_token": "</s>", "~~unk_token": "[UNK]", "sep_token": "~~</s>~~", "pad_token": "~~<pad>~~", "cls_token": "~~<s>~~", "mask_token": {"~~content~~"~~: "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false~~}}


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24c06a7429b843d46e40c6b167122053921bf94dce2e5550ea5c07fabc597646
+size 91005696

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -1 +1 @@

- {"do_lower_case": true, "~~bos_token~~": "~~<s>~~", "~~eos_token~~": "~~</s>~~", "~~sep_token~~": "~~</s>~~", "cls_token": "~~<s>~~", "~~unk_token~~": "[~~UNK~~]", "~~pad_token~~": ~~"<pad>"~~, "~~mask_token~~": ~~"<mask>"~~, "~~tokenize_chinese_chars~~": ~~true,~~ "~~strip_accents~~"~~: null~~, "~~model_max_length~~": ~~512~~, "~~special_tokens_map_file~~": null, "~~name_or_path~~": "~~microsoft/mpnet-base~~", "~~tokenizer_class~~": ~~"MPNetTokenizer"~~}


1	+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "name_or_path": "nreimers/MiniLM-L6-H384-uncased", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer", "model_max_length": 512}

train_script.py CHANGED Viewed

@@ -341,4 +341,4 @@ if __name__ == "__main__":
 # Script was called via:
-#python train_many_data_files_v2.py --steps 1000000 --batch_size 64 --model microsoft/mpnet-base train_data_configs/all_datasets_v4.json output/all_datasets_v4_mpnet-base


341
342
343	# Script was called via:
344	+ #python train_many_data_files_v2.py --steps 1000000 --batch_size 128 --model nreimers/MiniLM-L6-H384-uncased train_data_configs/all_datasets_v4.json output/all_datasets_v4_MiniLM-L6-H384-uncased-batch128

vocab.txt CHANGED Viewed

@@ -1,7 +1,3 @@
-<s>
-<pad>
-</s>
-<unk>
 [PAD]
 [unused0]
 [unused1]
@@ -30524,4 +30520,3 @@ necessitated
 ##：
 ##？
 ##～
-<mask>

 [PAD]
 [unused0]
 [unused1]
 ##：
 ##？
 ##～