Japanese Medical Document Retrieval Model (jmed-me5-v0.1)

This model is built on top of the intfloat/multilingual-e5-base checkpoint and has been fine-tuned to specialize in Japanese medical document retrieval. It leverages crawled Japanese medical web documents and LLM-based query generation and distilation of a strong re-ranker to achieve domain specialization.

Usage

See the Usage section of intfloat/multilingual-e5-base.

Model Overview

This model is designed for Japanese medical document search. It was fine-tuned using 750,000 Japanese medical web documents.

The overall algorithm is based on the work presented in the paper (NOTE: The authors of this model are different from those of this paper):

Tamber et al. "Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation." arXiv preprint arXiv:2502.19712 (2025).
- GitHub: manveertamber/enhancing_domain_adaptation

The pipeline includes:

LLM-Based Query Generation:
A large language model is used to generate queries from a set of 50,000 source documents.
- Similar documents in the source set are removed to ensure diversity.
- Query generation is performed using tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 with three examples provided for few-shot learning.
- Generated queries are further filtered by using the LLM to check for the inclusion of relevant medical or health-related knowledge; queries failing this check are removed.
Candidate Query Validation & Re-ranking:
- The generated queries are used to search the Japanese medical documents using intfloat/multilingual-e5-base. Only queries in which the original source document appears within the top 100 results are retained.
- A re-ranking step is performed using the cl-nagoya/ruri-reranker-large model.
- Only queries where the original document is ranked at the top are kept.
- The top result is treated as a positive example.
- For candidates ranked between 1 and 100, a min-max scaling is applied. Documents scoring above a threshold (defined as Top 1 score * α) are removed, as they might already be relevant.
- The top 20 of the remaining documents are then used as negative examples.
Training Loss:
The model is trained using a combination of:
- InfoNCE Loss (DPR-style): Encouraging embeddings of queries and positive documents to be similar, and those and negative documents to be dissimilar.
- KL Divergence Loss: Minimizing the difference between the re-ranking scores and the model’s predicted scores.

Dependencies

Base model:
- intfloat/multilingual-e5-base
Query generation:
- tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1
  - Built with Meta Llama 3
  - Built with Gemma
  - META LLAMA 3.1 COMMUNITY LICENSE and Gemma Terms of Use
Reranking:
- cl-nagoya/ruri-reranker-large

Benchmark Results

Japanese NF-Corpus (Japanese translation of NF-Corpus)

	nDCG@10	Recall@100
BM25	0.5721	0.1115
ruri-base	0.4435	0.0793
ruri-base-v2	0.6548	0.1163
ruri-large-v2	0.6648	0.1215
mE5-base	0.676	0.1258
jmed-me5-v0.1 (mE5-base + domain adaptation)	0.7236	0.1292
aken12/splade-japanese-v3	0.6193	0.1141
hotchpotch/japanese-splade-v2	0.7021	0.1274

Japanese TREC-COVID (Japanese translation of TREC-COVID)

	nDCG@10	Recall@100
BM25	0.3258	0.2443
ruri-base	0.2713	0.2544
ruri-base-v2	0.2939	0.2651
ruri-large-v2	0.3109	0.2797
jmed-me5-v0.1	0.2865	0.268
aken12/splade-japanese-v3	0.3196	0.2775
hotchpotch/japanese-splade-v2	0.3365	0.286

Contributors

Kenya Abe (aken12) (Main contributor)
Makoto P. Kato (mpkato) (Dataset translation)