Safetensors
Japanese
xlm-roberta
File size: 4,649 Bytes
96a29aa
 
 
 
 
9ed2679
 
 
96a29aa
 
 
 
 
 
 
 
 
 
 
92e6fc0
 
 
 
96a29aa
 
 
 
6f523c3
96a29aa
6f523c3
 
addcb07
96a29aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6f523c3
 
 
 
 
 
 
 
 
 
 
96a29aa
 
 
e0d9b6d
96a29aa
 
 
 
 
 
 
 
 
 
 
 
e0d9b6d
96a29aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
language:
- ja
base_model:
- intfloat/multilingual-e5-base
license: 
- llama3.1
- gemma
---


# Japanese Medical Document Retrieval Model (jmed-me5-v0.1)

This model is built on top of the [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) checkpoint 
and has been fine-tuned to specialize in Japanese medical document retrieval. 
It leverages crawled Japanese medical web documents and LLM-based query generation and distilation of a strong re-ranker to achieve domain specialization.

---

## Usage

See the Usage section of [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base).

## Model Overview

This model is designed for Japanese medical document search. It was fine-tuned using 750,000 Japanese medical web documents. 

The overall algorithm is based on the work presented in the paper (NOTE: The authors of this model are different from those of this paper):
  
* Tamber et al. "Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation." arXiv preprint arXiv:2502.19712 (2025).
  * GitHub: [manveertamber/enhancing_domain_adaptation](https://github.com/manveertamber/enhancing_domain_adaptation)
 

The pipeline includes:

- **LLM-Based Query Generation:**  
  A large language model is used to generate queries from a set of 50,000 source documents.  
  - Similar documents in the source set are removed to ensure diversity.  
  - Query generation is performed using [tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1) with three examples provided for few-shot learning.  
  - Generated queries are further filtered by using the LLM to check for the inclusion of relevant medical or health-related knowledge; queries failing this check are removed.

- **Candidate Query Validation & Re-ranking:**  
  - The generated queries are used to search the Japanese medical documents using [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base).
    Only queries in which the original source document appears within the top 100 results are retained.
  - A re-ranking step is performed using the [cl-nagoya/ruri-reranker-large](https://huggingface.co/cl-nagoya/ruri-reranker-large) model.
  - Only queries where the original document is ranked at the top are kept.  
  - The top result is treated as a positive example.
  - For candidates ranked between 1 and 100, a min-max scaling is applied. Documents scoring above a threshold (defined as Top 1 score * α) are removed, as they might already be relevant.
  - The top 20 of the remaining documents are then used as negative examples.

- **Training Loss:**  
  The model is trained using a combination of:
  - **InfoNCE Loss (DPR-style):** Encouraging embeddings of queries and positive documents to be similar, and those and negative documents to be dissimilar.
  - **KL Divergence Loss:** Minimizing the difference between the re-ranking scores and the model’s predicted scores.

## Dependencies

- Base model:
  - [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
- Query generation:
  - [tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1)
    - Built with Meta Llama 3
    - Built with Gemma
    - [META LLAMA 3.1 COMMUNITY LICENSE](https://www.llama.com/llama3_1/license/) and [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
- Reranking:
  - [cl-nagoya/ruri-reranker-large](https://huggingface.co/cl-nagoya/ruri-reranker-large)

## Benchmark Results

**Japanese TREC-COVID (Japanese translation of TREC-COVID)**

|  | nDCG@10 | Recall@100 |
| -------- | -------- | -------- |
| BM25 | 0.5721 | 0.1115 | 
| ruri-base | 0.4435 | 0.0793 | 
| ruri-base-v2 | 0.6548 | 0.1163 | 
| ruri-large-v2 | 0.6648 | 0.1215 | 
| mE5-base | 0.676 | 0.1258 | 
| jmed-me5-v0.1 (mE5-base + domain adaptation) | 0.7236 | 0.1292 | 
| aken12/splade-japanese-v3 | 0.6193 | 0.1141 | 
| hotchpotch/japanese-splade-v2 | 0.7021 | 0.1274 | 

**Japanese NF-Corpus (Japanese translation of NF-Corpus)**

|  | nDCG@10 | Recall@100 |
| -------- | -------- | -------- |
| BM25| 0.3258| 0.2443| 
| ruri-base| 0.2713| 0.2544| 
| ruri-base-v2| 0.2939| 0.2651| 
| ruri-large-v2| 0.3109| 0.2797| 
| jmed-me5-v0.1| 0.2865| 0.268| 
| aken12/splade-japanese-v3| 0.3196| 0.2775| 
| hotchpotch/japanese-splade-v2| 0.3365| 0.286| 


## Contributors

- [Kenya Abe (aken12)](https://huggingface.co/aken12) (Main contributor)
- [Makoto P. Kato (mpkato)](https://huggingface.co/mpkato) (Dataset translation)