instruction-pretrain
commited on
Commit
•
eb2e79b
1
Parent(s):
f5021bc
Update README.md
Browse files
README.md
CHANGED
@@ -22,8 +22,9 @@ We explore supervised multitask pre-training by proposing ***Instruction Pre-Tra
|
|
22 |
</p>
|
23 |
|
24 |
**************************** **Updates** ****************************
|
|
|
25 |
* 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
|
26 |
-
* 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M
|
27 |
<p align='left'>
|
28 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
|
29 |
</p>
|
@@ -73,30 +74,43 @@ pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)
|
|
73 |
print(pred)
|
74 |
```
|
75 |
|
76 |
-
### 2.
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
#
|
89 |
-
|
90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
-
# number of GPUs, chosen from [1,2,4,8]
|
93 |
-
N_GPU=1
|
94 |
-
|
95 |
-
# Set as True
|
96 |
-
add_bos_token=True
|
97 |
-
|
98 |
-
bash scripts/inference.sh ${DOMAIN} 'instruction-pretrain/medicine-Llama3-8B' ${add_bos_token} ${MODEL_PARALLEL} ${N_GPU}
|
99 |
-
```
|
100 |
|
101 |
|
102 |
## Citation
|
|
|
22 |
</p>
|
23 |
|
24 |
**************************** **Updates** ****************************
|
25 |
+
* 2024/8/29: Updated [guidelines](https://huggingface.co/instruction-pretrain/medicine-Llama3-8B) on evaluating any 🤗Huggingface models on the domain-specific tasks
|
26 |
* 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
|
27 |
+
* 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M. The performance trend on downstream tasks throughout the pre-training process:
|
28 |
<p align='left'>
|
29 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="500">
|
30 |
</p>
|
|
|
74 |
print(pred)
|
75 |
```
|
76 |
|
77 |
+
### 2. evaluate any Huggingface LMs on domain-dpecific tasks (💡New!)
|
78 |
+
You can use the following script to reproduce our results and evaluate any other Huggingface models on domain-specific tasks. Note that the script is NOT applicable to models that require specific prompt templates (e.g., Llama2-chat, Llama3-Instruct).
|
79 |
+
|
80 |
+
1). Set Up Dependencies
|
81 |
+
```bash
|
82 |
+
git clone https://github.com/microsoft/LMOps
|
83 |
+
cd LMOps/adaptllm
|
84 |
+
pip install -r requirements.txt
|
85 |
+
```
|
86 |
+
|
87 |
+
2). Evaluate the Model
|
88 |
+
```bash
|
89 |
+
# Select the domain from ['biomedicine', 'finance', 'law']
|
90 |
+
DOMAIN='biomedicine'
|
91 |
+
|
92 |
+
# Specify any Huggingface LM name (Not applicable to models requiring specific prompt templates)
|
93 |
+
MODEL='instruction-pretrain/medicine-Llama3-8B'
|
94 |
+
|
95 |
+
# Model parallelization:
|
96 |
+
# - Set MODEL_PARALLEL=False if the model fits on a single GPU.
|
97 |
+
# We observe that LMs smaller than 10B always meet this requirement.
|
98 |
+
# - Set MODEL_PARALLEL=True if the model is too large and encounters OOM on a single GPU.
|
99 |
+
MODEL_PARALLEL=False
|
100 |
+
|
101 |
+
# Choose the number of GPUs from [1, 2, 4, 8]
|
102 |
+
N_GPU=1
|
103 |
+
|
104 |
+
# Whether to add a BOS token at the beginning of the prompt input:
|
105 |
+
# - Set to False for AdaptLLM.
|
106 |
+
# - Set to True for instruction-pretrain models.
|
107 |
+
# If unsure, we recommend setting it to False, as this is suitable for most LMs.
|
108 |
+
add_bos_token=True
|
109 |
+
|
110 |
+
# Run the evaluation script
|
111 |
+
bash scripts/inference.sh ${DOMAIN} ${MODEL} ${add_bos_token} ${MODEL_PARALLEL} ${N_GPU}
|
112 |
+
```
|
113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
|
115 |
|
116 |
## Citation
|