Safetensors
Japanese
xlm-roberta
mpkato commited on
Commit
96a29aa
·
verified ·
1 Parent(s): 2b791ff

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ja
4
+ base_model:
5
+ - intfloat/multilingual-e5-base
6
+ ---
7
+
8
+
9
+ # Japanese Medical Document Retrieval Model (jmed-me5-v0.1)
10
+
11
+ This model is built on top of the [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) checkpoint
12
+ and has been fine-tuned to specialize in Japanese medical document retrieval.
13
+ It leverages crawled Japanese medical web documents and LLM-based query generation and distilation of a strong re-ranker to achieve domain specialization.
14
+
15
+ ---
16
+
17
+ ## Model Overview
18
+
19
+ This model is designed for Japanese medical document search. It was fine-tuned using 750,000 Japanese medical web documents.
20
+
21
+ The overall algorithm is based on the work presented in the paper:
22
+
23
+  "Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation"
24
+  GitHub: [manveertamber/enhancing_domain_adaptation](https://github.com/manveertamber/enhancing_domain_adaptation)
25
+ (NOTE: The authors of this model are different from those of this paper.)
26
+
27
+ The pipeline includes:
28
+
29
+ - **LLM-Based Query Generation:**
30
+ A large language model is used to generate queries from a set of 50,000 source documents.
31
+ - Similar documents in the source set are removed to ensure diversity.
32
+ - Query generation is performed using [tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1) with three examples provided for few-shot learning.
33
+ - Generated queries are further filtered by using the LLM to check for the inclusion of relevant medical or health-related knowledge; queries failing this check are removed.
34
+
35
+ - **Candidate Query Validation & Re-ranking:**
36
+ - The generated queries are used to search the Japanese medical documents using [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base).
37
+ Only queries in which the original source document appears within the top 100 results are retained.
38
+ - A re-ranking step is performed using the [cl-nagoya/ruri-reranker-large](https://huggingface.co/cl-nagoya/ruri-reranker-large) model.
39
+ - Only queries where the original document is ranked at the top are kept.
40
+ - The top result is treated as a positive example.
41
+ - For candidates ranked between 1 and 100, a min-max scaling is applied. Documents scoring above a threshold (defined as Top 1 score * α) are removed, as they might already be relevant.
42
+ - The top 20 of the remaining documents are then used as negative examples.
43
+
44
+ - **Training Loss:**
45
+ The model is trained using a combination of:
46
+ - **InfoNCE Loss (DPR-style):** Encouraging embeddings of queries and positive documents to be similar, and those and negative documents to be dissimilar.
47
+ - **KL Divergence Loss:** Minimizing the difference between the re-ranking scores and the model’s predicted scores.
48
+
49
+
50
+ ## Benchmark Results
51
+
52
+ **Japanese NF-Corpus (Japanese translation of NF-Corpus)**
53
+
54
+ | | nDCG@10 | Recall@100 |
55
+ | -------- | -------- | -------- |
56
+ | BM25 | 0.5721 | 0.1115 |
57
+ | ruri-base | 0.4435 | 0.0793 |
58
+ | ruri-base-v2 | 0.6548 | 0.1163 |
59
+ | ruri-large-v2 | 0.6648 | 0.1215 |
60
+ | mE5-base | 0.676 | 0.1258 |
61
+ | jmed-me5-v0.1 (mE5-base + domain adaptation) | 0.7236 | 0.1292 |
62
+ | aken12/splade-japanese-v3 | 0.6193 | 0.1141 |
63
+ | hotchpotch/japanese-splade-v2 | 0.7021 | 0.1274 |
64
+
65
+ **Japanese TREC-COVID (Japanese translation of TREC-COVID)**
66
+
67
+ | | nDCG@10 | Recall@100 |
68
+ | -------- | -------- | -------- |
69
+ | BM25| 0.3258| 0.2443|
70
+ | ruri-base| 0.2713| 0.2544|
71
+ | ruri-base-v2| 0.2939| 0.2651|
72
+ | ruri-large-v2| 0.3109| 0.2797|
73
+ | jmed-me5-v0.1| 0.2865| 0.268|
74
+ | aken12/splade-japanese-v3| 0.3196| 0.2775|
75
+ | hotchpotch/japanese-splade-v2| 0.3365| 0.286|
76
+
77
+
78
+ ## Contributors
79
+
80
+ - [Kenya Abe (aken12)](https://huggingface.co/aken12) (Main contributor)
81
+ - [Makoto P. Kato (mpkato)](https://huggingface.co/mpkato) (Dataset translation)