Japanese Medical Document Retrieval Model (jmed-me5-v0.1)
This model is built on top of the intfloat/multilingual-e5-base checkpoint and has been fine-tuned to specialize in Japanese medical document retrieval. It leverages crawled Japanese medical web documents and LLM-based query generation and distilation of a strong re-ranker to achieve domain specialization.
Usage
See the Usage section of intfloat/multilingual-e5-base.
Model Overview
This model is designed for Japanese medical document search. It was fine-tuned using 750,000 Japanese medical web documents.
The overall algorithm is based on the work presented in the paper (NOTE: The authors of this model are different from those of this paper):
- Tamber et al. "Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation." arXiv preprint arXiv:2502.19712 (2025).
The pipeline includes:
LLM-Based Query Generation:
A large language model is used to generate queries from a set of 50,000 source documents.- Similar documents in the source set are removed to ensure diversity.
- Query generation is performed using tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 with three examples provided for few-shot learning.
- Generated queries are further filtered by using the LLM to check for the inclusion of relevant medical or health-related knowledge; queries failing this check are removed.
Candidate Query Validation & Re-ranking:
- The generated queries are used to search the Japanese medical documents using intfloat/multilingual-e5-base. Only queries in which the original source document appears within the top 100 results are retained.
- A re-ranking step is performed using the cl-nagoya/ruri-reranker-large model.
- Only queries where the original document is ranked at the top are kept.
- The top result is treated as a positive example.
- For candidates ranked between 1 and 100, a min-max scaling is applied. Documents scoring above a threshold (defined as Top 1 score * α) are removed, as they might already be relevant.
- The top 20 of the remaining documents are then used as negative examples.
Training Loss:
The model is trained using a combination of:- InfoNCE Loss (DPR-style): Encouraging embeddings of queries and positive documents to be similar, and those and negative documents to be dissimilar.
- KL Divergence Loss: Minimizing the difference between the re-ranking scores and the model’s predicted scores.
Dependencies
- Base model:
- Query generation:
- tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1
- Built with Meta Llama 3
- Built with Gemma
- META LLAMA 3.1 COMMUNITY LICENSE and Gemma Terms of Use
- tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1
- Reranking:
Benchmark Results
Japanese TREC-COVID (Japanese translation of TREC-COVID)
nDCG@10 | Recall@100 | |
---|---|---|
BM25 | 0.5721 | 0.1115 |
ruri-base | 0.4435 | 0.0793 |
ruri-base-v2 | 0.6548 | 0.1163 |
ruri-large-v2 | 0.6648 | 0.1215 |
mE5-base | 0.676 | 0.1258 |
jmed-me5-v0.1 (mE5-base + domain adaptation) | 0.7236 | 0.1292 |
aken12/splade-japanese-v3 | 0.6193 | 0.1141 |
hotchpotch/japanese-splade-v2 | 0.7021 | 0.1274 |
Japanese NF-Corpus (Japanese translation of NF-Corpus)
nDCG@10 | Recall@100 | |
---|---|---|
BM25 | 0.3258 | 0.2443 |
ruri-base | 0.2713 | 0.2544 |
ruri-base-v2 | 0.2939 | 0.2651 |
ruri-large-v2 | 0.3109 | 0.2797 |
jmed-me5-v0.1 | 0.2865 | 0.268 |
aken12/splade-japanese-v3 | 0.3196 | 0.2775 |
hotchpotch/japanese-splade-v2 | 0.3365 | 0.286 |
Contributors
- Kenya Abe (aken12) (Main contributor)
- Makoto P. Kato (mpkato) (Dataset translation)
- Downloads last month
- 138
Model tree for kasys/jmed-me5-v0.1
Base model
intfloat/multilingual-e5-base