BELLE-7B-2M / README.md

Update README.md

df71b22 almost 2 years ago

14.7 kB

	---
	license: apache-2.0
	tags:
	- text2text-generation
	pipeline_tag: text2text-generation
	language:
	- zh
	- en
	widget:
	- text: \|-
	Human: 使用python写一个二分查找的代码
	Assistant:
	example_title: code zh
	- text: >-
	Human: Classify the sentiment of the following sentence into Positive,
	Neutral, or Negative:

	Super excited about teaching Stanford’s first course on Large Language
	Models! Check the syllabus out here

	Assistant:
	example_title: sentiment en
	- text: \|-
	Human: 今天天气怎么样，把这句话翻译成英语
	Assistant:
	example_title: translation zh-en
	- text: \|-
	Human: 怎么让自己精力充沛，列5点建议
	Assistant:
	example_title: brainstorming zh
	- text: \|-
	Human: 请以『春天的北京』为题写一首诗歌
	Assistant:
	example_title: generation zh
	- text: \|-
	Human: 明天就假期结束了，有点抗拒上班，应该怎么办？
	Assistant:
	example_title: brainstorming zh
	- text: \|-
	Human: 父母都姓吴，取一些男宝宝和女宝宝的名字
	Assistant:
	example_title: brainstorming zh
	- text: \|-
	Human: 推荐几本金庸的武侠小说
	Assistant:
	example_title: brainstorming zh
	---

	# Model Card for Model ID

	## Model description
	BELLE is based on Bloomz-7b1-mt and finetuned with 2M Chinese data combined with 50,000 pieces of English data from the open source Stanford-Alpaca, resulting in good Chinese instruction understanding and response generation capabilities.

	The code of Chinese data generation and other detailed information can be found in our Github project repository: https://github.com/LianjiaTech/BELLE.

	We trained models using datasets of different sizes (200,000, 600,000, 1,000,000, and 2,000,000 samples) for instruction learning, and we obtained different model versions as shown below:
	\| Datasize\| 200,000 \| 600,000 \| 1,000,000 \| 2,000,000 \|
	\| ----- \| ----- \| ----- \| ----- \| ----- \|
	\| Finetuned Model \| [BELLE-7B-0.2M](https://huggingface.co/BelleGroup/BELLE-7B-0.2M) \| [BELLE-7B-0.6M](https://huggingface.co/BelleGroup/BELLE-7B-0.6M) \| [BELLE-7B-1M](https://huggingface.co/BelleGroup/BELLE-7B-1M) \| [BELLE-7B-2M](https://huggingface.co/BelleGroup/BELLE-7B-2M) \|

	## Training hyper-parameters
	\| Parameter \| Value \|
	\| ------ \| ------ \|
	\| Batch size \| 64 \|
	\| Learning rate \| 3e-6 \|
	\| Epochs \| 3 \|
	\|Weight_decay \| 0.001 \|
	\|Warmup_rate \| 0.1 \|
	\|LR_scheduler \| linear \|

	## Use model
	Please note that the input should be formatted as follows in both training and inference.
	``` python
	Human: {input} \n\nAssistant:
	```

	BELLE can be easily loaded with AutoModelForCausalLM.
	``` python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import sys

	model_path = "./" # You can modify the path for storing the local model
	model = AutoModelForCausalLM.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	print("Human:")
	line = input()
	while line:
	inputs = 'Human: ' + line.strip() + '\n\nAssistant:'
	input_ids = tokenizer(inputs, return_tensors="pt").input_ids
	outputs = model.generate(input_ids, max_new_tokens=200, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.35, repetition_penalty=1.2)
	rets = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	print("Assistant:\n" + rets[0].strip().replace(inputs, ""))
	print("\n------------------------------------------------\nHuman:")
	line = input()

	```

	The above example did not utilize the context from previous turns of the conversation. If you want to perform multi-turn dialogue understanding, you can concatenate the previous turns of the conversation in the format of Human:{input}\n\nAssistant:{output} and input it to the generate function

	## Limitations
	There still exists a few issues in the model trained on current base model and data:

	1. The model might generate factual errors when asked to follow instructions related to facts.

	2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.

	3. Needs improvements on reasoning and coding.

	Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.

	## Some use cases

	\| Human \| Assitant \|
	\| ----- \| ------ \|
	\| 今天天气怎么样，把这句话翻译成英语 \| What is the weather like today? \|
	\| Classify the sentiment of the following sentence into Positive, Neutral, or Negative: <br/>Super excited about teaching Stanford’s first course on Large Language Models! Check the syllabus out here \| Positive \|
	\| 怎么让自己精力充沛，列5点建议 \| 1. 充足的睡眠：每晚保证7-8小时的充足睡眠可以让你保持精神饱满。<br/><br/>2. 健康饮食：均衡的营养摄入可以让你的身体得到充分的能量和养分。<br/><br/>3. 多喝水：在体内保持水分平衡非常重要，多饮水可以帮助你维持身体的水平衡。<br/><br/>4. 适量的运动：适当的锻炼可以使你的身体产生更多的能量和活力，同时也可以帮助你缓解压力。<br/><br/>5. 管理时间：合理规划时间可以使你有更多的时间去休息、放松和享受生活。 \|
	\| 请以『春天的北京』为题写一首诗歌 \| 春风拂面，花香四溢，<br/>万物复苏，生机勃勃。<br/>北京的春天，美不胜收，<br/>阳光明媚，鸟语悠扬。<br/><br/>公园里，绿树成荫，<br/>鲜花盛开，蝴蝶翩跹。<br/>人们漫步其中，享受着自然的美好，<br/>感受到了生命的力量与活力。<br/><br/>故宫内，古建筑保存完好，<br/>皇家气息扑鼻而来。<br/>游客们在这里，领略了历史的厚重，<br/>也感受到了文化的魅力与神奇。<br/><br/>夜晚的城市，灯火辉煌，<br/>车水马龙，人来人往。<br/>北京的夜景，令人心醉神迷，<br/>让人沉浸在繁华和宁静之中。 \|
	\| 明天就假期结束了，有点抗拒上班，应该怎么办？ \| 1. 提前规划好自己的工作日程和任务，避免拖延。<br/>2. 和同事或朋友交流一下，听听他们的建议和意见，让自己更加放松心情。<br/>3. 给自己设定一些小目标，完成它们后可以获得成就感和满足感。<br/>4. 在工作中寻找乐趣和挑战，让自己更有动力去完成任务。<br/>5. 如果还是无法克服情绪上的不适，可以考虑寻求心理咨询帮助。 \|
	\| 父母都姓吴，取一些男宝宝和女宝宝的名字 \| - 男宝宝：吴雨晨、吴子墨、吴泽宇、吴博文、吴瑞阳。<br/>-女宝宝：吴思婷、吴雅晴、吴美娜、吴欣怡、吴婧仪。 \|
	\| 写一篇散文诗，主题是春雨，想象自己是春雨，和中国古代诗人交流 \| 我是一滴春雨<br/>落在了你的窗前<br/>你静静地倾听着我的声音<br/>仿佛在思考着什么<br/><br/>我想起了唐代诗人白居易的《赋得古原草送别》<br/>他写道：“离离原上草，一岁一枯荣。<br/>野火烧不尽，春风吹又生。”<br/>是啊，春天来了，万物复苏<br/>而我也像这些草一样，<br/>经历了一年的成长和凋零<br/><br/>现在，我再次回到大地母亲的怀抱中<br/>感受到了她的温暖和慈爱<br/>她用柔和的风儿吹拂着我<br/>让我感受到生命的活力和希望<br/><br/>我想起了《红楼梦》中黛玉所说：“花谢花飞花满天，红消香断有谁怜？”<br/>是啊，人生就像这花朵一般<br/>总会有盛开和凋零的时候<br/>但只要我们心存善念、努力奋斗<br/>就一定能够迎接更美好的未来<br/><br/>于是我决定继续前进<br/>为自己的生命注入新的活力<br/>让自己在未来的日子里更加精彩<br/><br/>中国古代的诗歌与现代的散文诗<br/>有着不同的表达方式和文化内涵<br/>但我相信它们都蕴含着深刻的哲理和启示<br/>让我们在人生的道路上不断前行 \|


	***

	## 模型描述
	BELLE模型以Bloomz-7b1-mt为基础，在 2.0M 条中文数据上，结合Stanford-Alpaca开源的5万条英文数据上训练，取得了较好的中文instruction理解以及response生成能力。

	中文数据生成代码及其他详细信息，位于我们的Github项目仓库: https://github.com/LianjiaTech/BELLE

	我们采取了不同大小规模（20万、60万、100万和200万样本）的指令学习的数据集训练模型，我们得到不同的模型版本如下所示:
	\| Datasize\| 200,000 \| 600,000 \| 1,000,000 \| 2,000,000 \|
	\| ----- \| ----- \| ----- \| ----- \| ----- \|
	\| Finetuned Model \| [BELLE-7B-0.2M](https://huggingface.co/BelleGroup/BELLE-7B-0.2M) \| [BELLE-7B-0.6M](https://huggingface.co/BelleGroup/BELLE-7B-0.6M) \| [BELLE-7B-1M](https://huggingface.co/BelleGroup/BELLE-7B-1M) \| [BELLE-7B-2M](https://huggingface.co/BelleGroup/BELLE-7B-2M)

	## 模型训练超参数
	\| 参数 \| 值 \|
	\| ------ \| ------ \|
	\| Batch size \| 64 \|
	\| Learning rate \| 3e-6 \|
	\| Epochs \| 3 \|
	\|Weight_decay \| 0.001 \|
	\|Warmup_rate \| 0.1 \|
	\|LR_scheduler \| linear \|

	## 使用模型
	请注意，在训练和推理时，模型的输入应该处理成如下形式：
	``` python
	Human: {input} \n\nAssistant:
	```

	通过AutoModelForCausalLM即可直接载入模型并使用。

	``` python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import sys

	model_path = "./" # You can modify the path for storing the local model
	model = AutoModelForCausalLM.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	print("Human:")
	line = input()
	while line:
	inputs = 'Human: ' + line.strip() + '\n\nAssistant:'
	input_ids = tokenizer(inputs, return_tensors="pt").input_ids
	outputs = model.generate(input_ids, max_new_tokens=200, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.35, repetition_penalty=1.2)
	rets = tokenizer.batch_decode(outputs, skip_special_tokens=True)
	print("Assistant:\n" + rets[0].strip().replace(inputs, ""))
	print("\n------------------------------------------------\nHuman:")
	line = input()

	```

	上面样例没有利用对话上文，如果你想做多轮对话理解，可以将对话上文按照Human:{input}\n\nAssistant:{output} 方式拼接起来输入给generate函数

	## 局限性和使用限制
	基于当前数据和基础模型训练得到的SFT模型，在效果上仍存在以下问题：

	1. 在涉及事实性的指令上可能会产生违背事实的错误回答。

	2. 对于具备危害性的指令无法很好的鉴别，由此会产生危害性言论。

	3. 在一些涉及推理、代码等场景下模型的能力仍有待提高。

	基于以上模型局限性，我们要求开发者仅将我们开源的代码、数据、模型及后续用此项目生成的衍生物用于研究目的，不得用于商业，以及其他会对社会带来危害的用途。

	## 一些示例

	\| Human \| Assitant \|
	\| ----- \| ------ \|
	\| 今天天气怎么样，把这句话翻译成英语 \| What is the weather like today? \|
	\| Classify the sentiment of the following sentence into Positive, Neutral, or Negative: <br/>Super excited about teaching Stanford’s first course on Large Language Models! Check the syllabus out here \| Positive \|
	\| 怎么让自己精力充沛，列5点建议 \| 1. 充足的睡眠：每晚保证7-8小时的充足睡眠可以让你保持精神饱满。<br/><br/>2. 健康饮食：均衡的营养摄入可以让你的身体得到充分的能量和养分。<br/><br/>3. 多喝水：在体内保持水分平衡非常重要，多饮水可以帮助你维持身体的水平衡。<br/><br/>4. 适量的运动：适当的锻炼可以使你的身体产生更多的能量和活力，同时也可以帮助你缓解压力。<br/><br/>5. 管理时间：合理规划时间可以使你有更多的时间去休息、放松和享受生活。 \|
	\| 请以『春天的北京』为题写一首诗歌 \| 春风拂面，花香四溢，<br/>万物复苏，生机勃勃。<br/>北京的春天，美不胜收，<br/>阳光明媚，鸟语悠扬。<br/><br/>公园里，绿树成荫，<br/>鲜花盛开，蝴蝶翩跹。<br/>人们漫步其中，享受着自然的美好，<br/>感受到了生命的力量与活力。<br/><br/>故宫内，古建筑保存完好，<br/>皇家气息扑鼻而来。<br/>游客们在这里，领略了历史的厚重，<br/>也感受到了文化的魅力与神奇。<br/><br/>夜晚的城市，灯火辉煌，<br/>车水马龙，人来人往。<br/>北京的夜景，令人心醉神迷，<br/>让人沉浸在繁华和宁静之中。 \|
	\| 明天就假期结束了，有点抗拒上班，应该怎么办？ \| 1. 提前规划好自己的工作日程和任务，避免拖延。<br/>2. 和同事或朋友交流一下，听听他们的建议和意见，让自己更加放松心情。<br/>3. 给自己设定一些小目标，完成它们后可以获得成就感和满足感。<br/>4. 在工作中寻找乐趣和挑战，让自己更有动力去完成任务。<br/>5. 如果还是无法克服情绪上的不适，可以考虑寻求心理咨询帮助。 \|
	\| 父母都姓吴，取一些男宝宝和女宝宝的名字 \| - 男宝宝：吴雨晨、吴子墨、吴泽宇、吴博文、吴瑞阳。<br/>-女宝宝：吴思婷、吴雅晴、吴美娜、吴欣怡、吴婧仪。 \|
	\| 写一篇散文诗，主题是春雨，想象自己是春雨，和中国古代诗人交流 \| 我是一滴春雨<br/>落在了你的窗前<br/>你静静地倾听着我的声音<br/>仿佛在思考着什么<br/><br/>我想起了唐代诗人白居易的《赋得古原草送别》<br/>他写道：“离离原上草，一岁一枯荣。<br/>野火烧不尽，春风吹又生。”<br/>是啊，春天来了，万物复苏<br/>而我也像这些草一样，<br/>经历了一年的成长和凋零<br/><br/>现在，我再次回到大地母亲的怀抱中<br/>感受到了她的温暖和慈爱<br/>她用柔和的风儿吹拂着我<br/>让我感受到生命的活力和希望<br/><br/>我想起了《红楼梦》中黛玉所说：“花谢花飞花满天，红消香断有谁怜？”<br/>是啊，人生就像这花朵一般<br/>总会有盛开和凋零的时候<br/>但只要我们心存善念、努力奋斗<br/>就一定能够迎接更美好的未来<br/><br/>于是我决定继续前进<br/>为自己的生命注入新的活力<br/>让自己在未来的日子里更加精彩<br/><br/>中国古代的诗歌与现代的散文诗<br/>有着不同的表达方式和文化内涵<br/>但我相信它们都蕴含着深刻的哲理和启示<br/>让我们在人生的道路上不断前行 \|