Safetensors
Japanese
xlm-roberta

Japanese Medical Document Retrieval Model (jmed-me5-v0.1)

This model is built on top of the intfloat/multilingual-e5-base checkpoint and has been fine-tuned to specialize in Japanese medical document retrieval. It leverages crawled Japanese medical web documents and LLM-based query generation and distilation of a strong re-ranker to achieve domain specialization.


Usage

See the Usage section of intfloat/multilingual-e5-base.

Model Overview

This model is designed for Japanese medical document search. It was fine-tuned using 750,000 Japanese medical web documents.

The overall algorithm is based on the work presented in the paper (NOTE: The authors of this model are different from those of this paper):

  • Tamber et al. "Teaching Dense Retrieval Models to Specialize with Listwise Distillation and LLM Data Augmentation." arXiv preprint arXiv:2502.19712 (2025).

The pipeline includes:

  • LLM-Based Query Generation:
    A large language model is used to generate queries from a set of 50,000 source documents.

    • Similar documents in the source set are removed to ensure diversity.
    • Query generation is performed using tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1 with three examples provided for few-shot learning.
    • Generated queries are further filtered by using the LLM to check for the inclusion of relevant medical or health-related knowledge; queries failing this check are removed.
  • Candidate Query Validation & Re-ranking:

    • The generated queries are used to search the Japanese medical documents using intfloat/multilingual-e5-base. Only queries in which the original source document appears within the top 100 results are retained.
    • A re-ranking step is performed using the cl-nagoya/ruri-reranker-large model.
    • Only queries where the original document is ranked at the top are kept.
    • The top result is treated as a positive example.
    • For candidates ranked between 1 and 100, a min-max scaling is applied. Documents scoring above a threshold (defined as Top 1 score * α) are removed, as they might already be relevant.
    • The top 20 of the remaining documents are then used as negative examples.
  • Training Loss:
    The model is trained using a combination of:

    • InfoNCE Loss (DPR-style): Encouraging embeddings of queries and positive documents to be similar, and those and negative documents to be dissimilar.
    • KL Divergence Loss: Minimizing the difference between the re-ranking scores and the model’s predicted scores.

Dependencies

Benchmark Results

Japanese NF-Corpus (Japanese translation of NF-Corpus)

nDCG@10 Recall@100
BM25 0.5721 0.1115
ruri-base 0.4435 0.0793
ruri-base-v2 0.6548 0.1163
ruri-large-v2 0.6648 0.1215
mE5-base 0.676 0.1258
jmed-me5-v0.1 (mE5-base + domain adaptation) 0.7236 0.1292
aken12/splade-japanese-v3 0.6193 0.1141
hotchpotch/japanese-splade-v2 0.7021 0.1274

Japanese TREC-COVID (Japanese translation of TREC-COVID)

nDCG@10 Recall@100
BM25 0.3258 0.2443
ruri-base 0.2713 0.2544
ruri-base-v2 0.2939 0.2651
ruri-large-v2 0.3109 0.2797
jmed-me5-v0.1 0.2865 0.268
aken12/splade-japanese-v3 0.3196 0.2775
hotchpotch/japanese-splade-v2 0.3365 0.286

Contributors

Downloads last month
0
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for kasys/jmed-me5-v0.1

Finetuned
(48)
this model