asach commited on
Commit
3350e42
1 Parent(s): a7d5a67

updated readme

Browse files
Files changed (1) hide show
  1. README.md +1 -137
README.md CHANGED
@@ -1,133 +1,9 @@
1
- ![medalpaca](https://user-images.githubusercontent.com/37253540/228315829-b22f793c-2dcd-4c03-a32d-43720085a7de.png)
2
-
3
- # medAlpaca: Finetuned Large Language Models for Medical Question Answering
4
-
5
- ## Project Overview
6
- MedAlpaca expands upon both [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and
7
- [AlpacaLoRA](https://github.com/tloen/alpaca-lora) to offer an advanced suite of large language
8
- models specifically fine-tuned for medical question-answering and dialogue applications.
9
- Our primary objective is to deliver an array of open-source language models, paving the way for
10
- seamless development of medical chatbot solutions.
11
-
12
- These models have been trained using a variety of medical texts, encompassing resources such as
13
- medical flashcards, wikis, and dialogue datasets. For more details on the data utilized, please consult the data section.
14
-
15
- ## Getting Started
16
- Create a new virtual environment, e.g. with conda
17
-
18
- ```bash
19
- conda create -n medalpaca python>=3.9
20
- ```
21
-
22
- Install the required packages:
23
- ```bash
24
- pip install -r requirements.txt
25
- ```
26
-
27
- ## Training of medAlpaca
28
- <img width="256" alt="training your alpaca" src="https://user-images.githubusercontent.com/37253540/229250535-98f28e1c-0a8e-46e7-9e61-aeb98ef115cc.png">
29
-
30
- ### Memory Requirements
31
- We have benchmarked the needed GPU memory as well as the approximate duration per epoch
32
- for finetuning LLaMA 7b on the Medical Meadow small dataset (~6000 Q/A pairs) on a single GPU:
33
-
34
-
35
- | Model | 8bit trainig | LoRA | fp16 | bf16 | VRAM Used | Gradient cktp | Duration/epoch |
36
- |----------|--------------|-------|-------|-------|-----------|---------------|----------------|
37
- | LLaMA 7b | True | True | True | False | 8.9 GB | False | 77:30 |
38
- | LLaMA 7b | False | True | True | False | 18.8 GB | False | 14:30 |
39
- | LLaMA 7b | False | False | True | False | OOM | False | - |
40
- | LLaMA 7b | False | False | False | True | 79.5 GB | True | 35:30 |
41
- | LLaMA 7b | False | False | False | False | OOM | True | - |
42
-
43
- ### Train medAlpaca based on LLaMA
44
- If you have access to the [LLaMA](https://arxiv.org/abs/2302.13971) or [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html)
45
- weights you can finetune the model with the following command.
46
- Just replace `<PATH_TO_LLAMA_WEIGHTS>` with the folder containing you LLaMA or Alpaca weights.
47
-
48
- ```bash
49
- python medalpaca/train.py \
50
- --model PATH_TO_LLAMA_WEIGHTS \
51
- --data_path medical_meadow_small.json \
52
- --output_dir 'output' \
53
- --train_in_8bit True \
54
- --bf16 True \
55
- --tf32 False \
56
- --fp16 False \
57
- --global_batch_size 128 \
58
- --per_device_batch_size 8 \
59
- ```
60
- Per default the script performs mixed precision training.
61
- You can toggle 8bit training with the `train_in_8bit` flag.
62
- While 8 bit training currently only works with `use_lora True`, however you can use
63
- LoRA without 8 bit training.
64
- It is also able to train other models such as `facebook/opt-6.7` with the above script.
65
-
66
- ## Data
67
- <img width="256" alt="Screenshot 2023-03-31 at 09 37 41" src="https://user-images.githubusercontent.com/37253540/229244284-72b00e82-0da1-4218-b08e-63864306631e.png">
68
-
69
- To ensure your cherished llamas and alpacas are well-fed and thriving,
70
- we have diligently gathered high-quality biomedical open-source datasets
71
- and transformed them into instruction tuning formats.
72
- We have dubbed this endeavor **Medical Meadow**.
73
- Medical Meadow currently encompasses roughly 1.5 million data points across a diverse range of tasks,
74
- including openly curated medical data transformed into Q/A pairs with OpenAI's `gpt-3.5-turbo`
75
- and a collection of established NLP tasks in the medical domain.
76
- Please note, that not all data is of the same quantitiy and quality and you may need tp subsample
77
- the data for training your own model.
78
- We will persistently update and refine the dataset, and we welcome everyone to contribute more 'grass' to Medical Meadow!
79
-
80
- ### Data Overview
81
-
82
- | Name | Source | n | n included in training |
83
- |----------------------|-------------------------------------------------------------------------|----------|-------------------------|
84
- | Medical Flashcards | [medalpaca/medical_meadow_medical_flashcards](https://huggingface.co/datasets/medalpaca/medical_meadow_medical_flashcards) | 33955 | 33955 |
85
- | Wikidoc | [medalpaca/medical_meadow_wikidoc](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc) | 67704 | 10000 |
86
- | Wikidoc Patient Information | [medalpaca/medical_meadow_wikidoc_patient_information](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc_patient_information) | 5942 | 5942 |
87
- | Stackexchange academia | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 40865 | 40865 |
88
- | Stackexchange biology | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 27887 | 27887 |
89
- | Stackexchange fitness | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 9833 | 9833 |
90
- | Stackexchange health | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/medalpaca/datasets/medalpaca/medical_meadow_stackexchange) | 7721 | 7721 |
91
- | Stackexchange bioinformatics | [medalpaca/medical_meadow_stack_exchange](https://huggingface.co/datasets/medalpaca/medical_meadow_stackexchange) | 5407 | 5407 |
92
- | USMLE Self Assessment Step 1 | [medalpaca/medical_meadow_usmle_self](https://huggingface.co/datasets/medalpaca/medical_meadow_usmle_self_assessment) | 119 | 92 (test only) |
93
- | USMLE Self Assessment Step 2 | [medalpaca/medical_meadow_usmle_self](https://huggingface.co/datasets/medalpaca/medical_meadow_usmle_self_assessment) | 120 | 110 (test only) |
94
- | USMLE Self Assessment Step 3 | [medalpaca/medical_meadow_usmle_self](https://huggingface.co/datasets/medalpaca/medical_meadow_usmle_self_assessment) | 135 | 122 (test only) |
95
- | MEDIQA | [original](https://osf.io/fyg46/?view_only=), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_mediqa) | 2208 | 2208 |
96
- | CORD-19 | [original](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge ), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_cord19) | 1056660 | 50000 |
97
- | MMMLU | [original](https://github.com/hendrycks/test), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_mmmlu) | 3787 | 3787 |
98
- | Pubmed Health Advice | [original](https://aclanthology.org/D19-1473/), [preprocessed](vhuggingface.co/datasets/medalpaca/health_advice) | 10178 | 10178 |
99
- | Pubmed Causal | [original](https://aclanthology.org/2020.coling-main.427/ ), [preprocessed](https://huggingface.co/datasets/medalpaca/medical_meadow_pubmed_causal) | 2446 | 2446 |
100
- | ChatDoctor | [original](https://github.com/Kent0n-Li/ChatDoctor ) | 215000 | 10000 |
101
- | OpenAssistant | [original](https://huggingface.co/OpenAssistant) | 9209 | 9209 |
102
 
103
 
104
  ### Data description
105
  please refer to [DATA_DESCRIPTION.md](DATA_DESCRIPTION.md)
106
 
107
-
108
- ## Benchmarks
109
- <img width="256" alt="benchmarks" src="https://user-images.githubusercontent.com/37253540/229249302-20ff8a88-95b4-42a3-bdd8-96a9dce9a92b.png">
110
-
111
- We are benchmarking all models on the USMLE self assessment, which is available at this [link](https://www.usmle.org/prepare-your-exam).
112
- Note, that we removed all questions with images, as our models are not multimodal.
113
-
114
- | **Model** | **Step1** | **Step2** | **Step3** |
115
- |--------------------------------------------------------------------------------------------|-------------------|------------------|------------------|
116
- | [LLaMA 7b](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) | 0.198 | 0.202 | 0.203 |
117
- | [Alpaca 7b naive](https://github.com/tatsu-lab/stanford_alpaca) ([weights](https://huggingface.co/chavinlo/alpaca-native)) | 0.275 | 0.266 | 0.293 |
118
- | [Alpaca 7b LoRA](https://github.com/tloen/alpaca-lora) | 0.220 | 0.138 | 0.252 |
119
- | [MedAlpaca 7b](https://huggingface.co/medalpaca/medalpaca-7b) | 0.297 | 0.312 | 0.398 |
120
- | [MedAlpaca 7b LoRA](https://huggingface.co/medalpaca/medalpaca/medalpaca-lora-7b-16bit) | 0.231 | 0.202 | 0.179 |
121
- | [MedAlpaca 7b LoRA 8bit](https://huggingface.co/medalpaca/medalpaca-lora-7b-8bit) | 0.231 | 0.241 | 0.211 |
122
- | [ChatDoctor](https://github.com/Kent0n-Li/ChatDoctor) (7b) | 0.187 | 0.185 | 0.148 |
123
- | [LLaMA 13b](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/) | 0.222 | 0.248 | 0.276 |
124
- | [Alpaca 13b naive](https://huggingface.co/chavinlo/alpaca-13b) | 0.319 | 0.312 | 0.301 |
125
- | [MedAlpaca 13b](https://huggingface.co/medalpaca/medalpaca-13b) | ***0.473*** | ***0.477*** | ***0.602*** |
126
- | [MedAlpaca 13b LoRA](https://huggingface.co/medalpaca/medalpaca/medalpaca-lora-13b-16bit) | 0.250 | 0.255 | 0.255 |
127
- | [MedAlpaca 13b LoRA 8bit](https://huggingface.co/medalpaca/medalpaca-lora-13b-8bit) | 0.189 | 0.303 | 0.289 |
128
- | [MedAlpaca 30b](https://huggingface.co/medalpaca/medalpaca-30b) (still training) | TBA | TBA | TBA |
129
- | [MedAlpaca 30b LoRA 8bit](https://huggingface.co/medalpaca/medalpaca-lora-30b-8bit) | 0.315 | 0.327 | 0.361 |+
130
-
131
  We are continuously working on improving the training as well as our evaluation prompts.
132
  Expect this table to change quite a bit.
133
 
@@ -142,15 +18,3 @@ extensive testing or validation, and their reliability cannot be guaranteed.
142
  We kindly ask you to exercise caution when using these models,
143
  and we appreciate your understanding as we continue to explore and develop this innovative technology.
144
 
145
-
146
- ## Paper
147
- <img width="256" alt="chat-lama" src="https://user-images.githubusercontent.com/37253540/229261366-5cce9a60-176a-471b-80fd-ba390539da72.png">
148
-
149
- ```
150
- @article{han2023medalpaca,
151
- title={MedAlpaca--An Open-Source Collection of Medical Conversational AI Models and Training Data},
152
- author={Han, Tianyu and Adams, Lisa C and Papaioannou, Jens-Michalis and Grundmann, Paul and Oberhauser, Tom and L{\"o}ser, Alexander and Truhn, Daniel and Bressem, Keno K},
153
- journal={arXiv preprint arXiv:2304.08247},
154
- year={2023}
155
- }
156
- ```
 
1
+ # Amigo: Finetuned Large Language Models for Medical Question Answering
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
 
4
  ### Data description
5
  please refer to [DATA_DESCRIPTION.md](DATA_DESCRIPTION.md)
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  We are continuously working on improving the training as well as our evaluation prompts.
8
  Expect this table to change quite a bit.
9
 
 
18
  We kindly ask you to exercise caution when using these models,
19
  and we appreciate your understanding as we continue to explore and develop this innovative technology.
20