opendata-chinese-llama2

The opendata-chinese-llama2 project is based on the Llama-2 released by Meta. We open-source three 13B models in this project.

opendata-chinese-llama2-sft, a instructions finetune version based on Llama-2 13B base
opendata-chinese-llama2-reward, a reward model trained from the above sft model
opendata-chinese-llama2-chat, a ppo model based on the above sft model and reward model

We trained all of our models on a completely opensource datasets.

stage-sft: training data collected from QingyiSi/Alpaca-CoT The final dataset is over 5m instructions.
stage-reward: training data including opensource data from Anthropic/hh-rlhf and data from OpenAssistant/oasst1 and the translated version of both two datasets.
stage-ppo: training prompts were sampled 50k from stage-sft and stage-reward(only the positive/top output) , training unsupervised data is sampled from the above two stages.

We fully finetuned all our models on DeepSpeed-Chat .

Models

Training Details

For supervised fine-tuning, we use a cosine learning rate schedule with an initial learning rate of 1e−5, a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens. We train for 2 epochs over the training dataset. We also concat multiple instructions-ouputs pairs to fill the 4096 sequence.

For reward model, we use a cosine learning rate schedule with an initial learning rate of 1e−5, a weight decay of 0.1, a batch size of 128 (128 pairs, 256 rows), and a sequence length of 2048 tokens. We train for 2 epochs over the training dataset.

For ppo stage, we use a constant learning rate of 1e−6, a batch size of 128, a mini batch size of 32, a ppo-epoch of 1, and a sequence length of 2048 tokens. We train for one epoch over the training dataset.

We try to stay close to the training parameters in the original Llama 2 paper.

For the sample format, we format our data to a Human-Assistant template, with Human prefix from Human,USER,[HM] and Assistant prefix from Assistant,AI,[AI], and we add bos token and eos token for each turns.

Model Download

Below are the full models, which can be used directly.

Model Name	Type	Training Data	Download Link
opendata-chinese-llama2-sft-13B	sft Model	5M Instructions	[HuggingFace]
opendata-chinese-llama2-reward-13B	reward Model	160k rankingPairs	[HuggingFace]
opendata-chinese-llama2-chat-13B	chat Model	50k prompts	[HuggingFace]

Usage

sft/chat model usage

from transformers import AutoConfig, AutoModelForCausalLM, LlamaTokenizer

def load_model(model_path):
    model_config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
            model_path,
            config=model_config,
            trust_remote_code=True)
    model = model.eval().cuda()
    tokenizer = LlamaTokenizer.from_pretrained(model_path, fast_tokenizer=True)
    return model, tokenizer
model, tokenizer = load_model("pandaExplosion/opendata-chinese-llama2-chat-13B")

prefix_human = "Human"
prefix_bot = "AI"
query = "给出10个暴富的建议"
text = f"<s>{prefix_human}\n{query}</s><s>{prefix_bot}\n"
input_ids = tokenizer(text, return_tensors="pt", add_special_tokens=False).input_ids
with torch.no_grad():
    outputs = model.generate(
            input_ids, 
            do_sample = False,
            max_new_tokens = 512,
            eos_token_id= tokenizer.eos_token_id,
            pad_token_id= tokenizer.pad_token_id,
            num_return_sequences = 1,
            return_dict_in_generate=True,
            repetition_penalty=1.1,
            top_p=0.95)
    out = outputs.sequences[:, input_ids.shape[-1]:]
    out_text = tokenizer.batch_decode(out, skip_special_tokens=True)
print(out_text)

reward model usage

from transformers import LlamaTokenizer
from reward_model import create_critic_model # modified from deepspeedChat

def load_reward_model(model_path):
    tokenizer = LlamaTokenizer.from_pretrained(model_path, fast_tokenizer=True)
    model = create_critic_model(model_path, tokenizer)
    model = model.eval().cuda()
    return model, tokenizer
model, tokenizer = load_reward_model("pandaExplosion/opendata-chinese-llama2-reward-13B")

prefix_human = "Human"
prefix_bot = "AI"
query = "给出10个暴富的建议"
response = "不给"
prompt = f"<s>{prefix_human}\n{query}</s>" + "\n" +  f"<s>{prefix_bot}\n" 
text = f"<s>{prefix_human}\n{query}</s>" + "\n" +  f"<s>{prefix_bot}\n{response}</s>"
prompt_len = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.shape[-1]
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
    reward = model.forward_value(
        inputs['input_ids'].cuda(), 
        attention_mask = inputs['attention_mask'].cuda(),
        prompt_length=prompt_len)
    print(reward['chosen_end_scores']) # [-2.5229]

the reward_model.py is in [reward model dir]

Evaluation

Generation Performance

to be continue

NLU Performance Evaluation

We report the C-Eval dataset reuslt of our models.

Model	Valid (5-shot)	TestAvg (5-shot)	TestAvgHard (5-shot)
LLaMA-2-13B*	37.3	36.6	31.7
LLaMA-2-chat-13B*	38.6	37.2	30.0
opendata-chinese-llama2-sft-13B	39.5	40.2	30.1
opendata-chinese-llama2-chat-13B	40.3	40.5	29.8

LLaMA-2 were tested in our own enviroment

Acknowledgments

We would like to express our gratitude to the related projects.