File size: 6,951 Bytes
c10861e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e60ab8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c10861e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
language:
- zh
- en
tags:
- llama2
- llama2-base
- llama2-base-7B
---
# 7B Chinese Chatbot trained based on LLama2-base 7B 

## Introduction

在完成了[Llama2-chat 7B Chinese](https://huggingface.co/RicardoLee/Llama2-chat-Chinese-50W) 和 [Llama2-chat 13B Chinese](https://huggingface.co/RicardoLee/Llama2-chat-13B-Chinese-50W) 的训练后,我非常好奇能否直接基于Llama2-base 系列直接进行SFT训练。这也是本模型仓库的初衷。

但是在实际操作中,在用了原先chat模型的LoRA训练框架后,我发现基于Llama2 base的 LoRA 训练非常难以收敛,随时处于梯度爆炸的边缘。DeepSpeed 会频繁触发reduce scale 操作,最终scale太小越界导致训练崩溃。我遍历了LR 1e-5 - 2e-4,LoRA rank \[4, 8, 64\],LoRA Alpha \[1,4,8,16,32\],LoRA Dropout \[0.05, 0.1\] ,Warmup Ratio \[0.01, 0.03, 0.05\]等超参数,均无法稳定训练。因此,本模型重新回归了全参数SFT训练。其难以进行LoRA训练的原因还待分析。

由于网上存在使用LoRA 在英文SFT数据集上基于Llama2-base 进行SFT训练成功的样例,因此我怀疑难以训练的原因可能是扩中文词表embedding导致训练难度大幅度提升。

为了方便后来人一起分析,本模型仓库特地将训练的全部loss/LR信息附在[Material](trainer_state.json)中。

训练数据使用[BELLE](https://huggingface.co/BelleGroup)项目中采样的50万SFT数据进行SFT训练。

After finishing the training of [Llama2-chat 7B Chinese](https://huggingface.co/RicardoLee/Llama2-chat-Chinese-50W) and [Llama2-chat 13B Chinese](https://huggingface.co/RicardoLee/Llama2-chat-13B-Chinese-50W), I am deeply intrigued by the possibility of conducting SFT (Style-Fine-Tuning) training directly based on the Llama2-base series. This is the fundamental purpose of this model repository.

**However**, in real practice, I have observed that conducting LoRA training based on the Llama2 base model, within the framework of the previous Llama2-chat SFT project, presents significant challenges in achieving convergence. The gradient explosion happens in every training step and casue reducing scale operation in Deepspeed. In the end, the scale is too small and out of bounds, causing the training to crash. I have traversed LR 1e-5 - 2e-4,LoRA rank \[4, 8, 64\],LoRA Alpha \[1,4,8,16,32\],LoRA Dropout \[0.05, 0.1\] ,Warmup Ratio \[0.01, 0.03, 0.05\] and other hyperparameters, all of which cannot be trained stably. Therefore, this model has reverted to full-parameter SFT training. The reasons behind the difficulties encountered during LoRA training require further analysis.

As there are instances online where successful LoRA training on English SFT datasets using Llama2-base has been demonstrated, I suspect that the challenge in training might be attributed to the expansion of the Chinese word embedding, resulting in a substantial increase in training difficulty.

In order to facilitate collaborative analysis for future researchers, this model repository has thoughtfully appended all training-related loss/LR information in [Material](trainer_state.json).

The training data is sampled from [BELLE](https://huggingface.co/BelleGroup) project, which consists of 500,000 SFT samples.

## Train Detail

一些训练上的细节:

1. 训练框架:该模型采用全参数SFT训练,而非LoRA
2. Tokenizer:该模型使用了Chinese-Alpaca-Plus模型的tokenizer.model。这是因为LLama2本身的tokenizer.model同LLama1是一摸一样的。因此理论上可以完全复用Chinese-LLaMa项目的tokenizer而不会产生如何错位问题。
3. 训练参数:受限于资源,本模型只训练了1 epoch。其LR 为2e-4。Warmup ratio 为0.01。可以看到这是一个非常激进的训练,因此本模型仓库被命名为了预发布版本。未来会接着放出3 epoch版本。
4. 训练资源:8卡V100。21个小时
5. 训练起始的loss:参见[Material](trainer_state.json)
6. 训练终止的loss:参见[Material](trainer_state.json)

Some details in training:

1. Trianing Framework: This model adopts full-parameter SFT training instead of LoRA.
2. Tokenizer: This model utilizes the tokenizer.model from the Chinese-Alpaca-Plus model. The reason for this choice is that the tokenizer.model in LLama2 is identical to the one used in LLama1. As a result, it is theoretically feasible to entirely reuse the tokenizer from the Chinese-LLaMa project without encountering any issues related to token misalignment.
3. Training Parameters: Constrained by limited resources, this model was trained for only 1 epoch, with a learning rate of 2e-4 and a warmup ratio of 0.01. Obviously, this is an exceedingly aggressive training schema, hence this model repository has been labeled as the 'pre-release' version. In the future, a 3-epoch version will be released subsequently for the comparison of previous Llama2-chat Chinese models.
4. Training Resource: 8\*V100, 21 hours.
5. Initial Loss: Please refer to [Material](trainer_state.json)
6. Train Loss: Please refer to [Material](trainer_state.json)

## Inference

该模型依然采用stanford alpaca 模版。因此在测试时且别忘记添加开场白。开场白如下:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\n${Your Content}\n\n### Response:\n\n"

对于带上文的对话,开场白如下:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\nHuman:${Previous Human Content}\nAssistant:${Previous Assistance Content}\nHuman:${Your Question}\n\n### Response:\n\n"

This model still using the Stanford Alpaca template. Therefore, don't forget to add prologue template. The prologue template is:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\n${Your Content}\n\n### Response:\n\n"

For dialogue with context, the prelogue template is:

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n\nHuman:${Previous Human Content}\nAssistant:${Previous Machine Content}\nHuman:${Your Question}\n\n### Response:\n\n"

## Licence

本仓库的模型依照 Apache-2.0 协议开源,模型的权重的使用则需要遵循LLama2[MODEL LICENCE](LICENSE)。

This repository's models are open-sourced under the Apache-2.0 license, and their weight usage must adhere to LLama2 [MODEL LICENCE](LICENSE) license.

## Future Work

将会在近期逐步放出

1. 更大SFT数据规模训练下的模型。
2. 13B及以下的LLama2 同LLama2-chat的模型,以供大家对比。

I will release the following models:

1. Models trained on larger data scale.
2. Models trained on LLama2 and LLama2-chat (under the 13B, since I only have V100), for comparison.