OpenMeditron
/

Meditron3-Gemma2-2B

Text Generation

Model card Files Files and versions Community

Alexandre Sallinen commited on Jan 16

Commit

94f725a

·

verified ·

1 Parent(s): 9466c30

Update README.md

Files changed (1) hide show

README.md +5 -1

README.md CHANGED Viewed

@@ -60,7 +60,11 @@ Additional information about the datasets will be included in the Meditron-3 pub
 #### Evaluation
-Evaluation results for the Gemma[2]-Meditron-3[2B] are coming soon!
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.

 #### Evaluation
+| Model Name                  | MedmcQA | MedQA  | PubmedQA | Average |
+|-----------------------------|---------|--------|----------|---------|
+| google/gemma-2-2b           | 40.31   | 34.80  | 74.20    | 49.77   |
+| gemMeditron-2-2b-it         | 42.51   | 38.81  | 75.40    | 52.24   |
+| Difference (gemMeditron vs.)| 2.20    | 4.01   | 1.20     | 2.47    |
 We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
 While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.