File size: 12,994 Bytes
c137cd5
b56ed0d
 
1f19ea2
b56ed0d
 
 
1f19ea2
 
b56ed0d
 
8f0abf3
b56ed0d
8f0abf3
 
 
 
 
 
 
 
 
 
 
 
6258672
8f0abf3
 
 
 
 
 
b56ed0d
 
8f0abf3
b56ed0d
 
8f0abf3
 
b56ed0d
 
 
 
 
8f0abf3
b56ed0d
 
 
 
8f0abf3
 
78815f7
fa07722
b27de46
 
2b4bbef
78815f7
8f0abf3
665fd48
8f0abf3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ca4e16
f455d76
2ca4e16
1f684cf
 
 
f455d76
1f684cf
f455d76
 
1f684cf
 
 
 
f455d76
 
 
1f684cf
f455d76
 
 
 
 
4ee6b5c
f455d76
 
1f684cf
1f19ea2
b56ed0d
 
 
 
 
 
 
 
 
8f0abf3
 
 
 
 
b56ed0d
 
8f0abf3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b56ed0d
 
8f0abf3
 
b56ed0d
8f0abf3
b56ed0d
8f0abf3
 
a0fd432
8f0abf3
 
b56ed0d
 
 
8f0abf3
b56ed0d
8f0abf3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
tags:
- sentence-transformers
- sentence-similarity
- sparse-encoder
- sparse
- splade
- feature-extraction
- telepix
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: apache-2.0
---
<p align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/61d6f4a4d49065ee28a1ee7e/V8n2En7BlMNHoi1YXVv8Q.png" width="400"/>
<p>
  
# PIXIE-Splade-Preview
**PIXIE-Splade-Preview** is a Korean-only [SPLADE](https://arxiv.org/abs/2403.06789) (Sparse Lexical and Expansion) retriever, developed by [TelePIX Co., Ltd](https://telepix.net/). 
**PIXIE** stands for Tele**PIX** **I**ntelligent **E**mbedding, representing TelePIXโ€™s high-performance embedding technology. 
This model is trained exclusively on Korean data and outputs sparse lexical vectors that are directly 
compatible with inverted indexing (e.g., Lucene/Elasticsearch). 
Because each non-zero weight corresponds to a Korean subword/token, 
interpretability is built-in: you can inspect which tokens drive retrieval.

## Why SPLADE for Search?
- **Inverted Index Ready**: Directly index weighted tokens in standard IR stacks (Lucene/Elasticsearch).
- **Interpretable by Design**: Top-k contributing tokens per query/document explain *why* a hit matched.
- **Production-Friendly**: Fast candidate generation at web scale; memory/latency tunable via sparsity thresholds.
- **Hybrid-Retrieval Friendly**: Combine with dense retrievers via score fusion.

## Model Description
- **Model Type:** SPLADE Sparse Encoder
<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
- **Maximum Sequence Length:** 8192 tokens
- **Output Dimensionality:** 50000 dimensions
- **Similarity Function:** Dot Product
- **Language:** Korean
- **License:** apache-2.0 

### Full Model Architecture

```
SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
)
```

## Quality Benchmarks
**PIXIE-Splade-Preview** delivers consistently strong performance across a diverse set of domain-specific and open-domain benchmarks in Korean, demonstrating its effectiveness in real-world search applications. 
The table below presents the retrieval performance of several embedding models evaluated on a variety of Korean MTEB benchmarks. 
We report **Normalized Discounted Cumulative Gain (NDCG)** scores, which measure how well a ranked list of documents aligns with ground truth relevance. Higher values indicate better retrieval quality.
- **Avg. NDCG**: Average of NDCG@1, @3, @5, and @10 across all benchmark datasets.  
- **NDCG@k**: Relevance quality of the top-*k* retrieved results.

All evaluations were conducted using the open-source **[Korean-MTEB-Retrieval-Evaluators](https://github.com/BM-K/Korean-MTEB-Retrieval-Evaluators)** codebase to ensure consistent dataset handling, indexing, retrieval, and NDCG@k computation across models.

### 6 Datasets of MTEB (Korean)
Our model, **telepix/PIXIE-Splade-Preview**, achieves strong performance across most metrics and benchmarks, 
demonstrating strong generalization across domains such as multi-hop QA, long-document retrieval, public health, and e-commerce.

Descriptions of the benchmark datasets used for evaluation are as follows:
- **Ko-StrategyQA**  
  A Korean multi-hop open-domain question answering dataset designed for complex reasoning over multiple documents.
- **AutoRAGRetrieval**  
  A domain-diverse retrieval dataset covering finance, government, healthcare, legal, and e-commerce sectors.
- **MIRACLRetrieval**  
  A document retrieval benchmark built on Korean Wikipedia articles.
- **PublicHealthQA**  
  A retrieval dataset focused on medical and public health topics.
- **BelebeleRetrieval**  
  A dataset for retrieving relevant content from web and news articles in Korean.
- **MultiLongDocRetrieval**  
  A long-document retrieval benchmark based on Korean Wikipedia and mC4 corpus.

> **Tip:**
> While many benchmark datasets are available for evaluation, in this project we chose to use only those that contain clean positive documents for each query. Keep in mind that a benchmark dataset is just that a benchmark. For real-world applications, it is best to construct an evaluation dataset tailored to your specific domain and evaluate embedding models, such as PIXIE, in that environment to determine the most suitable one.

#### Sparse Embedding
| Model Name | # params | Avg. NDCG | NDCG@1 | NDCG@3 | NDCG@5 | NDCG@10 |
|------|:---:|:---:|:---:|:---:|:---:|:---:|
| telepix/PIXIE-Splade-Preview | 0.1B | 0.7253 | 0.6799 | 0.7217 | 0.7416 | 0.7579 |
|  |  |  |  |  |  |  |
| [BM25](https://github.com/xhluca/bm25s) | N/A | 0.4714 | 0.4194 | 0.4708 | 0.4886 | 0.5071 |
| naver/splade-v3 | 0.1B | 0.0582 | 0.0462 | 0.0566 | 0.0612 | 0.0685 |

#### Dense Embedding
| Model Name | # params | Avg. NDCG | NDCG@1 | NDCG@3 | NDCG@5 | NDCG@10 |
|------|:---:|:---:|:---:|:---:|:---:|:---:|
| telepix/PIXIE-Spell-Preview-1.7B | 1.7B | 0.7567 | 0.7149 | 0.7541 | 0.7696 | 0.7882 |
| telepix/PIXIE-Spell-Preview-0.6B | 0.6B | 0.7280 | 0.6804 | 0.7258 | 0.7448 | 0.7612 |
| telepix/PIXIE-Rune-Preview | 0.5B | 0.7383 | 0.6936 | 0.7356 | 0.7545 | 0.7698 |
|  |  |  |  |  |  |  |
| nlpai-lab/KURE-v1 | 0.5B | 0.7312 | 0.6826 | 0.7303 | 0.7478 | 0.7642 |
| BAAI/bge-m3 | 0.5B | 0.7126 | 0.6613 | 0.7107 | 0.7301 | 0.7483 |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 0.5B | 0.7050 | 0.6570 | 0.7015 | 0.7226 | 0.7390 |
| Qwen/Qwen3-Embedding-0.6B | 0.6B | 0.6872 | 0.6423 | 0.6833 | 0.7017 | 0.7215 |
| jinaai/jina-embeddings-v3 | 0.5B | 0.6731 | 0.6224 | 0.6715 | 0.6899 | 0.7088 |
| SamilPwC-AXNode-GenAI/PwC-Embedding_expr | 0.5B | 0.6709 | 0.6221 | 0.6694 | 0.6852 | 0.7069 | 
| Alibaba-NLP/gte-multilingual-base | 0.3B | 0.6679 | 0.6068 | 0.6673 | 0.6892 | 0.7084 |
| openai/text-embedding-3-large | N/A | 0.6465 | 0.5895 | 0.6467 | 0.6646 | 0.6853 |

## Direct Use (Inverted-Index Retrieval)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
import torch
import numpy as np
from collections import defaultdict
from typing import Dict, List, Tuple
from transformers import AutoTokenizer
from sentence_transformers import SparseEncoder

MODEL_NAME = "telepix/PIXIE-Splade-Preview"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def _to_dense_numpy(x) -> np.ndarray:
    """
    Safely converts a tensor returned by SparseEncoder to a dense numpy array.
    """
    if hasattr(x, "to_dense"):
        return x.to_dense().float().cpu().numpy()
    # If it's already a numpy array or a dense tensor
    if isinstance(x, torch.Tensor):
        return x.float().cpu().numpy()
    return np.asarray(x)

def _filter_special_ids(ids: List[int], tokenizer) -> List[int]:
    """
    Filters out special token IDs from a list of token IDs.
    """
    special = set(getattr(tokenizer, "all_special_ids", []) or [])
    return [i for i in ids if i not in special]

def build_inverted_index(
    model: SparseEncoder,
    tokenizer,
    documents: List[str],
    batch_size: int = 8,
    min_weight: float = 0.0,
) -> Tuple[Dict[int, List[Tuple[int, float]]], List[str]]:
    """
    Generates document embeddings and constructs an inverted index.
    The index maps token_id to a list of (doc_idx, weight) tuples.
    index[token_id] = [(doc_idx, weight), ...]
    """
    with torch.no_grad():
        doc_emb = model.encode_document(documents, batch_size=batch_size)
    doc_dense = _to_dense_numpy(doc_emb)

    index: Dict[int, List[Tuple[int, float]]] = defaultdict(list)

    for doc_idx, vec in enumerate(doc_dense):
        # Extract only active tokens (those with weight above the threshold)
        nz = np.flatnonzero(vec > min_weight)
        # Optionally, remove special tokens
        nz = _filter_special_ids(nz.tolist(), tokenizer)

        for token_id in nz:
            index[token_id].append((doc_idx, float(vec[token_id])))

    return index

# -------------------------
# Search + Token Overlap Explanation
# -------------------------
def splade_token_overlap_inverted(
    model: SparseEncoder,
    tokenizer,
    inverted_index: Dict[int, List[Tuple[int, float]]],
    documents: List[str],
    queries: List[str],
    top_k_docs: int = 3,
    top_k_tokens: int = 10,
    min_weight: float = 0.0,
):
    """
    Calculates SPLADE similarity using an inverted index and shows the
    contribution (qw*dw) of the top_k_tokens 'overlapping tokens' for each top-ranked document.
    """
    for qi, qtext in enumerate(queries):
        with torch.no_grad():
            q_vec = model.encode_query(qtext)
        q_vec = _to_dense_numpy(q_vec).ravel()

        # Active query tokens
        q_nz = np.flatnonzero(q_vec > min_weight).tolist()
        q_nz = _filter_special_ids(q_nz, tokenizer)

        scores: Dict[int, float] = defaultdict(float)
        # Token contribution per document: token_id -> (qw, dw, qw*dw)
        per_doc_contrib: Dict[int, Dict[int, Tuple[float, float, float]]] = defaultdict(dict)

        for tid in q_nz:
            qw = float(q_vec[tid])
            postings = inverted_index.get(tid, [])
            for doc_idx, dw in postings:
                prod = qw * dw
                scores[doc_idx] += prod
                # Store per-token contribution (can be summed if needed)
                per_doc_contrib[doc_idx][tid] = (qw, dw, prod)

        ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k_docs]

        print("\n============================")
        print(f"[Query {qi}] {qtext}")
        print("============================")

        if not ranked:
            print("โ†’ ์ผ์น˜ ํ† ํฐ์ด ์—†์–ด ๋ฌธ์„œ ์Šค์ฝ”์–ด๊ฐ€ ์ƒ์„ฑ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.")
            continue

        for rank, (doc_idx, score) in enumerate(ranked, start=1):
            doc = documents[doc_idx]
            print(f"\nโ†’ Rank {rank} | Document {doc_idx}: {doc}")
            print(f"  [Similarity Score ({score:.6f})]")

            contrib = per_doc_contrib[doc_idx]
            if not contrib:
                print("(๊ฒน์น˜๋Š” ํ† ํฐ์ด ์—†์Šต๋‹ˆ๋‹ค.)")
                continue

            # Extract top K contributing tokens
            top = sorted(contrib.items(), key=lambda kv: kv[1][2], reverse=True)[:top_k_tokens]
            token_ids = [tid for tid, _ in top]
            tokens = tokenizer.convert_ids_to_tokens(token_ids)

            print("  [Top Contributing Tokens]")
            for (tid, (qw, dw, prod)), tok in zip(top, tokens):
                print(f"    {tok:20} {prod:.6f}")

if __name__ == "__main__":
    # 1) Load model and tokenizer
    model = SparseEncoder(MODEL_NAME).to(DEVICE)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    # 2) Example data
    queries = [
        "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์–ด๋–ค ์‚ฐ์—… ๋ถ„์•ผ์—์„œ ์œ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋‚˜์š”?",
        "๊ตญ๋ฐฉ ๋ถ„์•ผ์— ์–ด๋–ค ์œ„์„ฑ ์„œ๋น„์Šค๊ฐ€ ์ œ๊ณต๋˜๋‚˜์š”?",
        "ํ…”๋ ˆํ”ฝ์Šค์˜ ๊ธฐ์ˆ  ์ˆ˜์ค€์€ ์–ด๋А ์ •๋„์ธ๊ฐ€์š”?",
    ]
    documents = [
        "ํ…”๋ ˆํ”ฝ์Šค๋Š” ํ•ด์–‘, ์ž์›, ๋†์—… ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์œ„์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
        "์ •์ฐฐ ๋ฐ ๊ฐ์‹œ ๋ชฉ์ ์˜ ์œ„์„ฑ ์˜์ƒ์„ ํ†ตํ•ด ๊ตญ๋ฐฉ ๊ด€๋ จ ์ •๋ฐ€ ๋ถ„์„ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
        "TelePIX์˜ ๊ด‘ํ•™ ํƒ‘์žฌ์ฒด ๋ฐ AI ๋ถ„์„ ๊ธฐ์ˆ ์€ Global standard๋ฅผ ์ƒํšŒํ•˜๋Š” ์ˆ˜์ค€์œผ๋กœ ํ‰๊ฐ€๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.",
        "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์šฐ์ฃผ์—์„œ ์ˆ˜์ง‘ํ•œ ์ •๋ณด๋ฅผ ๋ถ„์„ํ•˜์—ฌ '์šฐ์ฃผ ๊ฒฝ์ œ(Space Economy)'๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฐ€์น˜๋ฅผ ์ฐฝ์ถœํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.",
        "ํ…”๋ ˆํ”ฝ์Šค๋Š” ์œ„์„ฑ ์˜์ƒ ํš๋“๋ถ€ํ„ฐ ๋ถ„์„, ์„œ๋น„์Šค ์ œ๊ณต๊นŒ์ง€ ์ „ ์ฃผ๊ธฐ๋ฅผ ์•„์šฐ๋ฅด๋Š” ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.",
    ]

    # 3) Build document index (inverted index)
    inverted_index = build_inverted_index(
        model=model,
        tokenizer=tokenizer,
        documents=documents,
        batch_size=8,
        min_weight=0.0,  # Adjust to 1e-6 ~ 1e-4 to filter out very small noise
    )

    # 4) Search and explain token overlap
    splade_token_overlap_inverted(
        model=model,
        tokenizer=tokenizer,
        inverted_index=inverted_index,
        documents=documents,
        queries=queries,
        top_k_docs=2,     # Print only the top 3 documents
        top_k_tokens=5,  # Top 10 contributing tokens for each document
        min_weight=0.0,
    )
```

## License
The PIXIE-Splade-Preview model is licensed under Apache License 2.0.

## Citation
```
@software{TelePIX-PIXIE-Splade-Preview,
  title={PIXIE-Splade-Preview},
  author={TelePIX AI Research Team and Bongmin Kim},
  year={2025},
  url={https://huggingface.co/telepix/PIXIE-Splade-Preview}
}
```

## Contact

If you have any suggestions or questions about the PIXIE, please reach out to the authors at [email protected].