Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ja
|
4 |
+
base_model:
|
5 |
+
- intfloat/multilingual-e5-base
|
6 |
+
---
|
7 |
+
|
8 |
+
|
9 |
+
# Japanese Medical Document Retrieval Model (jmed-me5-v0.1)
|
10 |
+
|
11 |
+
This model is built on top of the [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) checkpoint
|
12 |
+
and has been fine-tuned to specialize in Japanese medical document retrieval.
|
13 |
+
It leverages crawled Japanese medical web documents and LLM-based query generation and distilation of a strong re-ranker to achieve domain specialization.
|
14 |
+
|
15 |
+
---
|
16 |
+
|
17 |
+
## Model Overview
|
18 |
+
|
19 |
+
This model is designed for Japanese medical document search. It was fine-tuned using 750,000 Japanese medical web documents.
|
20 |
+
|
21 |
+
The overall algorithm is based on the work presented in the paper:
|
22 |
+
|
23 |
+
"Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation"
|
24 |
+
GitHub: [manveertamber/enhancing_domain_adaptation](https://github.com/manveertamber/enhancing_domain_adaptation)
|
25 |
+
(NOTE: The authors of this model are different from those of this paper.)
|
26 |
+
|
27 |
+
The pipeline includes:
|
28 |
+
|
29 |
+
- **LLM-Based Query Generation:**
|
30 |
+
A large language model is used to generate queries from a set of 50,000 source documents.
|
31 |
+
- Similar documents in the source set are removed to ensure diversity.
|
32 |
+
- Query generation is performed using [tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1) with three examples provided for few-shot learning.
|
33 |
+
- Generated queries are further filtered by using the LLM to check for the inclusion of relevant medical or health-related knowledge; queries failing this check are removed.
|
34 |
+
|
35 |
+
- **Candidate Query Validation & Re-ranking:**
|
36 |
+
- The generated queries are used to search the Japanese medical documents using [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base).
|
37 |
+
Only queries in which the original source document appears within the top 100 results are retained.
|
38 |
+
- A re-ranking step is performed using the [cl-nagoya/ruri-reranker-large](https://huggingface.co/cl-nagoya/ruri-reranker-large) model.
|
39 |
+
- Only queries where the original document is ranked at the top are kept.
|
40 |
+
- The top result is treated as a positive example.
|
41 |
+
- For candidates ranked between 1 and 100, a min-max scaling is applied. Documents scoring above a threshold (defined as Top 1 score * α) are removed, as they might already be relevant.
|
42 |
+
- The top 20 of the remaining documents are then used as negative examples.
|
43 |
+
|
44 |
+
- **Training Loss:**
|
45 |
+
The model is trained using a combination of:
|
46 |
+
- **InfoNCE Loss (DPR-style):** Encouraging embeddings of queries and positive documents to be similar, and those and negative documents to be dissimilar.
|
47 |
+
- **KL Divergence Loss:** Minimizing the difference between the re-ranking scores and the model’s predicted scores.
|
48 |
+
|
49 |
+
|
50 |
+
## Benchmark Results
|
51 |
+
|
52 |
+
**Japanese NF-Corpus (Japanese translation of NF-Corpus)**
|
53 |
+
|
54 |
+
| | nDCG@10 | Recall@100 |
|
55 |
+
| -------- | -------- | -------- |
|
56 |
+
| BM25 | 0.5721 | 0.1115 |
|
57 |
+
| ruri-base | 0.4435 | 0.0793 |
|
58 |
+
| ruri-base-v2 | 0.6548 | 0.1163 |
|
59 |
+
| ruri-large-v2 | 0.6648 | 0.1215 |
|
60 |
+
| mE5-base | 0.676 | 0.1258 |
|
61 |
+
| jmed-me5-v0.1 (mE5-base + domain adaptation) | 0.7236 | 0.1292 |
|
62 |
+
| aken12/splade-japanese-v3 | 0.6193 | 0.1141 |
|
63 |
+
| hotchpotch/japanese-splade-v2 | 0.7021 | 0.1274 |
|
64 |
+
|
65 |
+
**Japanese TREC-COVID (Japanese translation of TREC-COVID)**
|
66 |
+
|
67 |
+
| | nDCG@10 | Recall@100 |
|
68 |
+
| -------- | -------- | -------- |
|
69 |
+
| BM25| 0.3258| 0.2443|
|
70 |
+
| ruri-base| 0.2713| 0.2544|
|
71 |
+
| ruri-base-v2| 0.2939| 0.2651|
|
72 |
+
| ruri-large-v2| 0.3109| 0.2797|
|
73 |
+
| jmed-me5-v0.1| 0.2865| 0.268|
|
74 |
+
| aken12/splade-japanese-v3| 0.3196| 0.2775|
|
75 |
+
| hotchpotch/japanese-splade-v2| 0.3365| 0.286|
|
76 |
+
|
77 |
+
|
78 |
+
## Contributors
|
79 |
+
|
80 |
+
- [Kenya Abe (aken12)](https://huggingface.co/aken12) (Main contributor)
|
81 |
+
- [Makoto P. Kato (mpkato)](https://huggingface.co/mpkato) (Dataset translation)
|