MedEmbed: Specialized Embedding Model for Medical and Clinical Information Retrieval

Model Description

MedEmbed is a family of embedding models fine-tuned specifically for medical and clinical data, designed to enhance performance in healthcare-related natural language processing (NLP) tasks, particularly information retrieval.

GitHub Repo: https://github.com/abhinand5/MedEmbed

Technical Blog Post: https://huggingface.co/blog/abhinand/medembed-finetuned-embedding-models-for-medical-ir

Intended Use

This model is intended for use in medical and clinical contexts to improve information retrieval, question answering, and semantic search tasks. It can be integrated into healthcare systems, research tools, and medical literature databases to enhance search capabilities and information access.

Training Data

The model was trained using a simple yet effective synthetic data generation pipeline:

Source: Clinical notes from PubMed Central (PMC)
Processing: LLaMA 3.1 70B model used to generate query-response pairs
Augmentation: Negative sampling for challenging examples
Format: Triplets (query, positive response, negative response) for contrastive learning

Performance

MedEmbed consistently outperforms general-purpose embedding models across various medical NLP benchmarks:

ArguAna
MedicalQARetrieval
NFCorpus
PublicHealthQA
TRECCOVID

Specific performance metrics (nDCG, MAP, Recall, Precision, MRR) are available in the full documentation.

Limitations

While highly effective for medical and clinical data, this model may not generalize well to non-medical domains. It should be used with caution in general-purpose NLP tasks.

Ethical Considerations

Users should be aware of potential biases in medical data and the ethical implications of AI in healthcare. This model should be used as a tool to assist, not replace, human expertise in medical decision-making.

Citation

If you use this model in your research, please cite:

@software{balachandran2024medembed,
  author = {Balachandran, Abhinand},
  title = {MedEmbed: Medical-Focused Embedding Models},
  year = {2024},
  url = {https://github.com/abhinand5/MedEmbed}
}

For more detailed information, visit our GitHub repository.

Downloads last month: 3,378

Safetensors

Model size

33.4M params

Tensor type

F32

Inference API

Unable to determine this model’s pipeline type. Check the docs .

Model tree for abhinand/MedEmbed-small-v0.1

Base model

BAAI/bge-small-en-v1.5

Finetuned

(134)

this model

Finetunes

1 model

Quantizations

1 model

Space using abhinand/MedEmbed-small-v0.1 1

Collection including abhinand/MedEmbed-small-v0.1

MedEmbed: Embedding Models for Medical Domain

Collection

GitHub -> https://github.com/abhinand5/MedEmbed • 4 items • Updated Oct 21 • 9

Evaluation results

accuracy on MTEB AmazonCounterfactualClassification (en-ext)
test set self-reported

72.174
ap on MTEB AmazonCounterfactualClassification (en-ext)
test set self-reported

21.758
ap_weighted on MTEB AmazonCounterfactualClassification (en-ext)
test set self-reported

21.758
f1 on MTEB AmazonCounterfactualClassification (en-ext)
test set self-reported

59.803
f1_weighted on MTEB AmazonCounterfactualClassification (en-ext)
test set self-reported

77.376
main_score on MTEB AmazonCounterfactualClassification (en-ext)
test set self-reported

72.174
accuracy on MTEB AmazonCounterfactualClassification (en)
test set self-reported

71.284
ap on MTEB AmazonCounterfactualClassification (en)
test set self-reported

33.514
ap_weighted on MTEB AmazonCounterfactualClassification (en)
test set self-reported

33.514
f1 on MTEB AmazonCounterfactualClassification (en)
test set self-reported

65.078

View on Papers With Code