sentence_transformers_support (#1)

- Add support to Sentence Transformers (84e5191ddb886bb66563d9d0f267cbcc65c7c883)
- Update README.md (c2e42e091b774937c0040aaf53b072ed4c7f698d)
- Update README.md (64d9cfe5766d56a682b9f3a991b7eb1a051a94ca)
- Update README.md (337f5cd0639bfb4800ebdc348ba6b42214e27aa3)

Files changed (11) hide show

README.md +49 -0
config_sentence_transformers.json +14 -0
document_1_SpladePooling/config.json +5 -0
modules.json +8 -0
query_0_IDF/config.json +3 -0
query_0_IDF/model.safetensors +3 -0
query_0_IDF/special_tokens_map.json +37 -0
query_0_IDF/tokenizer.json +0 -0
query_0_IDF/tokenizer_config.json +63 -0
query_0_IDF/vocab.txt +0 -0
router_config.json +20 -0

README.md CHANGED Viewed

@@ -24,6 +24,14 @@ tags:
 - passage-retrieval
 - document-expansion
 - bag-of-words
 datasets:
 - miracl/miracl
 ---
@@ -48,6 +56,47 @@ This is a learned sparse retrieval model. It encodes the documents to 105879 dim
 OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.
 ## Usage (HuggingFace)
 This model is supposed to run inside OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API.

 - passage-retrieval
 - document-expansion
 - bag-of-words
+- sentence-transformers
+- sparse-encoder
+- sparse
+- asymmetric
+- inference-free
+- splade
+pipeline_tag: feature-extraction
+library_name: sentence-transformers
 datasets:
 - miracl/miracl
 ---
 OpenSearch neural sparse feature supports learned sparse retrieval with lucene inverted index. Link: https://opensearch.org/docs/latest/query-dsl/specialized/neural-sparse/. The indexing and search can be performed with OpenSearch high-level API.
+## Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers.sparse_encoder import SparseEncoder
+# Download from the 🤗 Hub
+model = SparseEncoder("opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1")
+query = "What's the weather in ny now?"
+document = "Currently New York is rainy."
+query_embed = model.encode_query(query)
+document_embed = model.encode_document(document)
+sim = model.similarity(query_embed, document_embed)
+print(f"Similarity: {sim}")
+# Similarity: tensor([[7.7400]])
+decoded_query = model.decode(query_embed)
+decoded_document = model.decode(document_embed)
+for i in range(len(decoded_query)):
+    query_token, query_score = decoded_query[i]
+    doc_score = next((score for token, score in decoded_document if token == query_token), 0)
+    if doc_score != 0:
+        print(f"Token: {query_token}, Query score: {query_score:.4f}, Document score: {doc_score:.4f}")
+# Token: weather, Query score: 3.0699, Document score: 1.2821
+# Token: now, Query score: 1.6406, Document score: 0.9018
+# Token: ?, Query score: 1.6108, Document score: 0.3141
+# Token: ny, Query score: 1.2721, Document score: 1.3446
+# Token: in, Query score: 0.6005, Document score: 0.1804
+```
 ## Usage (HuggingFace)
 This model is supposed to run inside OpenSearch cluster. But you can also use it outside the cluster, with HuggingFace models API.

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SparseEncoder",
+  "__version__": {
+    "sentence_transformers": "5.0.0",
+    "transformers": "4.50.3",
+    "pytorch": "2.6.0+cu124"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "dot"
+}

document_1_SpladePooling/config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "pooling_strategy": "max",
+    "activation_function": "relu",
+    "word_embedding_dimension": null
+}

modules.json ADDED Viewed

	@@ -0,0 +1,8 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Router"
+  }
+]

query_0_IDF/config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "frozen": true
+}

query_0_IDF/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:336ae862c6ee224095e2155085bbe79f076af77caa4b403a11e8cd0ec9f0ceb5
+size 423596

query_0_IDF/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

query_0_IDF/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

query_0_IDF/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 200,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

query_0_IDF/vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

router_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+    "types": {
+        "query_0_IDF": "sentence_transformers.sparse_encoder.models.IDF.IDF",
+        "": "sentence_transformers.sparse_encoder.models.MLMTransformer.MLMTransformer",
+        "document_1_SpladePooling": "sentence_transformers.sparse_encoder.models.SpladePooling.SpladePooling"
+    },
+    "structure": {
+        "query": [
+            "query_0_IDF"
+        ],
+        "document": [
+            "",
+            "document_1_SpladePooling"
+        ]
+    },
+    "parameters": {
+        "default_route": "document",
+        "allow_empty_key": true
+    }
+}