Enderfga commited on
Commit
b43db50
·
1 Parent(s): 543fc8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -0
README.md CHANGED
@@ -1,3 +1,180 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ # 🔥 MoE-Mixtral-7B-8Expert
5
+ <p align="left">
6
+ LLaMA2-Accessory link: <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory" target="_blank">Github</a>
7
+ </p>
8
+
9
+ [mixtral-8x7b](https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen) is a Mixture-of-Expert (MoE) model. In this
10
+ tutorial, we will introduce how to inference with and to finetune the model.
11
+
12
+ ## Features
13
+ With LLaMA2-Accessory, mixtral-8x7b enjoys the following features:
14
+ 1. Distributed MoE (namely instantiating experts on multiple processes/gpus)
15
+ 2. Load Balancing Loss
16
+ 3. Tensor Parallel and FSDP for efficiently training
17
+
18
+ 4. Distributed and/or quantized inference
19
+
20
+
21
+ ## Install
22
+ Please follow the [instructions here](https://llama2-accessory.readthedocs.io/en/latest/install.html) to install
23
+ LLaMA2-Accessory, which is an easy-to-use and comprehensive toolkit for LLM development.
24
+
25
+ ## Prepare Checkpoint
26
+ Given the official mixtral-8x7b checkpoints, a step of format conversion is needed to make them usable by
27
+ LLaMA2-Accessory. We have released the off-the-shelf converted checkpoints. Alternatively, you can convert them
28
+ by yourself according to the following guides.
29
+ ### A. Download Converted Checkpoints
30
+ The converted checkpoints are released at [HuggingFace](https://huggingface.co/Alpha-VLLM/MoE-Mixtral-7B-8Expert/tree/main/converted),
31
+ please download all files in the folder to your machine.
32
+ ### B. Convert by Yourself
33
+
34
+ #### 1. prepare the original checkpoints
35
+ The original checkpoints are available at https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen, please first
36
+ download the 10 splits and then cat them into one follow the official guides. After this step, you should have the
37
+ `consolidated.00.pth` file.
38
+
39
+ #### 2. convert
40
+
41
+ Downlaod the [split.py](https://huggingface.co/Alpha-VLLM/MoE-Mixtral-7B-8Expert/blob/main/converted/split.py) script and *put it in the same directory as `consolidated.00.pth`*. Run the following
42
+ command to conduct conversion:
43
+ ```bash
44
+ python split.py
45
+ ```
46
+ After running, you should see a folder named `converted` created, with eight `consolidated.**-of-08.model.pth` files
47
+ therein.
48
+
49
+ #### 3. prepare other resources
50
+ Finally, please download the following three files from [our HuggingFace repo](https://huggingface.co/Alpha-VLLM/MoE-Mixtral-7B-8Expert/tree/main/converted):
51
+ ```bash
52
+ config.json
53
+ meta.json
54
+ tokenizer.model
55
+ ```
56
+ and put them under the `converted` directory, next to the weight files you obtained in the previous step.
57
+
58
+
59
+ ## Inference
60
+ ### Simple Inference
61
+ You can run inference on 8, 4, 2, or 1 GPUs. With tensor parallel and distributed MoE, the more GPUs you use, the
62
+ less memory and computation load exists on each individual GPU. The following code exemplifies the inference process.
63
+ ```python
64
+ from accessory.model.meta import MetaModel
65
+
66
+ import random
67
+ import numpy as np
68
+
69
+ import torch
70
+ import torch.distributed as dist
71
+ import multiprocessing as mp
72
+
73
+ def main(world_size, rank) -> None:
74
+ # specify random seed to ensure consistent token sampling among model parallel ranks
75
+ random.seed(0)
76
+ torch.random.manual_seed(0)
77
+ np.random.seed(0)
78
+
79
+ dist.init_process_group(
80
+ backend="nccl", rank=rank, world_size=world_size,
81
+ init_method=f"tcp://127.0.0.1:23560",
82
+ )
83
+ torch.cuda.set_device(rank)
84
+
85
+ # mp_group identifies which ranks will work collaboratively through model parallelism
86
+ model = MetaModel.from_pretrained("/path/to/converted", max_seq_len=2048,
87
+ mp_group=dist.new_group(ranks=list(range(dist.get_world_size()))))
88
+
89
+ prompt = "The best programming language in the world is"
90
+
91
+ response = model.generate([prompt], images=None, max_gen_len=512)[0]
92
+ print(response)
93
+ # or if you want to generate the response token by token
94
+ response = None
95
+ for response_in_progress in model.stream_generate(prompt, image=None, max_gen_len=512):
96
+ response = response_in_progress['text']
97
+ if rank == 0: # without this filter, the response will be printed for `world_size` times
98
+ print(response)
99
+
100
+
101
+ if __name__ == "__main__":
102
+ N_GPU = 8 # 1, 2, 4, or 8
103
+ if N_GPU == 1:
104
+ main(world_size=1, rank=0)
105
+ elif N_GPU > 1:
106
+ # You can use whatever method, e.g. torchrun, slurm, etc. for distributed launch
107
+ # Just be sure to initialize torch distributed (by invoking dist.init_process_group)
108
+ # before creating the model if model parallel size > 1 is used
109
+ mp.set_start_method("spawn")
110
+ for rank in range(N_GPU):
111
+ process = mp.Process(target=main, args=(N_GPU, rank))
112
+ process.start()
113
+ else:
114
+ raise ValueError
115
+ ```
116
+
117
+ A thorough tutorial over the inference with LLaMA2-Accessory can be found in the
118
+ [document](https://llama2-accessory-temp.readthedocs.io/en/latest/inference.html).
119
+
120
+ ### Host Local Demo
121
+ LLaMA2-Accessory provides a series of gradio demos for efficient interaction with your model. To host a local demo
122
+ for the pretrained mixtral-8x7b model, follow the steps below:
123
+ ```bash
124
+ cd LLaMA2-Accessory/accessory
125
+ torchrun --nproc-per-node=$N_GPUS_TO_USE --master-port=$PORT demos/single_turn.py \
126
+ --pretrained_path $PATH_TO_CONVERTED
127
+ ```
128
+ As we have mentioned in the [Simple Inference](#simple-inference) section, `$N-GPUS-TO-USE` can be 1, 2, 4, or 8.
129
+ `$PATH_TO_CONVERTED` should be the directory containing the converted checkpoints, and `$PORT` can be any free port.
130
+
131
+
132
+ ## Finetuning
133
+ LLaMA2-Accessory supports both full-parameter and parameter-efficient finetuning of mixtral-8x7b. It also
134
+ supports the load balancing regularization loss. More advanced MoE support will come soon.
135
+
136
+ ### Data
137
+ We use the following datasets to exemplify finetuning:
138
+ + [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)
139
+ + [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
140
+
141
+ The two files are referred to by the [dialog_ultrachat200kWizardcode.yaml](https://github.com/Alpha-VLLM/LLaMA2-Accessory/accessory/configs/data/finetune/sg/dialog_ultrachat200kWizardcode.yaml)
142
+ file, which is then used by the `*.sh` experiments shown below to define the data for fientuning. Note that the data need
143
+ to be processed to match the format usable by LLaMA2-Accessory. For convenience, we provide the processed data files for
144
+ [💾evol-codealpaca-v1](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/data/evol-codealpaca-v1/wizardCode.json) and
145
+ [💾ultrachat_200k](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/data/ultrachat_200k_train_sft.json).
146
+ Please move them to the position specified by `dialog_ultrachat200kWizardcode.yaml`
147
+
148
+
149
+ ### Full Finetune
150
+ ```bash
151
+ cd LLaMA2-Accessory/accessory
152
+ srun -n32 --gres=gpu:8 --ntasks-per-node=8 bash \
153
+ exps/finetune/sg/dialog_ultrachat200kWizardcode_mistral.sh \
154
+ /path/to/converted/mixtral-8x7b-32kseqlen \
155
+ /path/to/converted/mixtral-8x7b-32kseqlen/config.json \
156
+ /path/to/converted/mixtral-8x7b-32kseqlen/tokenizer.model
157
+ ```
158
+ ### PEFT
159
+ ```bash
160
+ cd LLaMA2-Accessory/accessory
161
+ srun -n16 --gres=gpu:8 --ntasks-per-node=8 bash \
162
+ exps/finetune/sg/dialog_ultrachat200kWizardcode_mistralPeft.sh \
163
+ /path/to/converted/mixtral-8x7b-32kseqlen \
164
+ /path/to/converted/mixtral-8x7b-32kseqlen/config.json \
165
+ /path/to/converted/mixtral-8x7b-32kseqlen/tokenizer.model
166
+ ```
167
+
168
+ **Finetuned Model Release:**
169
+
170
+ + [🤗checkpoint](https://huggingface.co/Alpha-VLLM/MoE-Mixtral-7B-8Expert/tree/main/finetuned/peft)
171
+
172
+ **Host Local Demo**
173
+ ```bash
174
+ cd LLaMA2-Accessory/accessory
175
+ python demos/multi_turn.py --n_gpus $N_GPUS_TO_USE --pretrained_path $PATH_TO_FINETUNED
176
+ ```
177
+
178
+ See the LLaMA2-Accessory [document](https://llama2-accessory.readthedocs.io/en/latest/) to know more about
179
+ [finetuning](https://llama2-accessory.readthedocs.io/en/latest/finetune/index.html)
180
+ and [inference](https://llama2-accessory-temp.readthedocs.io/en/latest/inference.html).