File size: 12,185 Bytes
1335af5 fd6a7ef 1335af5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
---
license: other
license_name: license
license_link: >-
https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20Community%20License.pdf
---
<!-- <div align="center">
<h1>
✨Skywork
</h1>
</div> -->
<div align="center"><img src="misc/skywork_logo.jpeg" width="550"/></div>
<p align="center">
👨💻 <a href="https://github.com/SkyworkAI/Skywork" target="_blank">Github</a> • 🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a>• 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
</p>
<div align="center">
[🎉天工在线对话平台已正式向公众开放](https://sso.tiangong.cn/?redirect=https://model-platform.tiangong.cn/overview&client_id=200005)
</div>
<div align="center">
[![GitHub Stars](https://img.shields.io/github/stars/SkyworkAI/Skywork)](https://github.com/SkyworkAI/Skywork/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/SkyworkAI/Skywork)](https://github.com/SkyworkAI/Skywork/fork)
</div>
# 模型介绍(Introduction)
**Skywork-13B-Base-XT**是Skywork-13B-Base模型的中间checkpoints,是第一阶段预训练完成后的checkpoints,此时模型训练完X TB个token。 X 有这样对应几个值,分别是0.5,1,1.5,2,2.5,3.
**Skywork-13B-Base-XT**: This is an intermediate checkpoints of the Skywork-13B-Base model, which are the checkpoints after the completion of the first stage of pre-training. At this point, the model has trained X TB tokens. X can be 0.5,1,1.5,2,2.5,3.
如果您希望了解更多的信息,如训练方案,评估方法,请参考我们的[技术报告](http://arxiv.org/abs/2310.19341),[Skymath](https://arxiv.org/abs/2310.16713)论文,[SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf)论文。
If you are interested in more training and evaluation details, please refer to our [technical report](http://arxiv.org/abs/2310.19341), [Skymath]((https://arxiv.org/skywork-tech-report)) paper and [SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf) paper.
## 训练数据(Training Data)
我们精心搭建了数据清洗流程对文本中的低质量数据、有害信息、敏感信息进行清洗过滤。我们的Skywork-13B-Base模型是在清洗后的3.2TB高质量中、英、代码数据上进行训练,其中英文占比52.2%,中文占比39.6%,代码占比8%,在兼顾中文和英文上的表现的同时,代码能力也能有保证。
We have developed a data cleaning pipeline with great care to effectively clean and filter low-quality data and eliminate harmful information from text data. Our Skywork-13B-Base model is trained on a dataset with 3.2TB tokens that consists of high-quality Chinese, English, and code data, all of which have been thoroughly cleaned. The English data comprises 52.2% of the dataset, the Chinese data accounts for 39.6%, and the code data makes up 8%. This comprehensive approach ensures optimal performance for both Chinese and English while also maintaining the ability to handle code.
| | Category | Percentage |
|-------------|------------------|------------|
| **English** | Webpages | 39.8% |
| | Books | 3.6% |
| | Academic Papers | 3.0% |
| | Encyclopedia | 0.5% |
| | Miscellany | 2.9% |
| **Chinese** | Webpages | 30.4% |
| | Social Media | 5.5% |
| | Encyclopedia | 0.8% |
| | Miscellany | 3.1% |
| **Other Lang.** | Encyclopedia | 2.4% |
| **Code** | Github | 8.0% |
## 模型结构(Model Structure)
与Llama-2-13B模型对比,天工Skywork-13B模型采用相对更加瘦长的网络结构,层数为52层,同时将FFN Dim和Hidden Dim缩小到12288和4608,从而保证模型参数量和原始Llama-2-13B模型相当。根据我们前期实验对比,相对瘦长的网络结构在大Batch Size训练下可以取得更好的泛化效果。Skywork-13B和Llama-2-13B模型的对比如下:
Compared to the Llama2-13B model, the Skywork-13B model adopts a relatively thinner and deeper network structure with 52 layers. At the same time, the FFN Dim and Hidden Dim are reduced to 12288 and 4608, respectively, to ensure that the model has a similar number of parameters as the original Llama-13B model. Based on our preliminary experimental results, a relatively thinner and deeper network structure can achieve better generalization performance under large batch size training. The detailed comparison between the Skywork-13B and Llama-2-13B models is as follows:
| Model Structure | Llama2-13B | Skywork-13B |
|----------------------|:----:|:-----------:|
| Vocab. Size | 32,000 | 65,536 |
| Hidden Dim. | 5,120 | 4,608 |
| FFN Dim. | 13,696 | 12,288 |
| Head Dim. | 128 | 128 |
| Num. Heads | 40 | 36 |
| Num. Layers | 40 | 52 |
| Seq. Len. | 4,096 | 4,096 |
| Positional Embedding | RoPE | RoPE |
## 分词器(Tokenizer)
我们使用Byte-Pair Encoding(BPE)对数据进行分词,词表大小为65536,其中拉丁字符和子词为32000个,汉字和Unicode符号8000个,汉语词语25519个,剩下的17个为保留字。
We use Byte-Pair Encoding (BPE) to tokenize the data, with a vocabulary size of 65536. Among them, there are 32000 Latin characters and subwords, 8000 Chinese characters and Unicode symbols, 25519 Chinese words, and the remaining 17 are reserved words.
| Category | Size |
|---------------------------------|--------|
| Latin based words & subwords | 32000 |
| Chinese characters & Unicode symbols | 8000 |
| Chinese words | 25519 |
| Reserved symbols | 17 |
| **Total** | **65536** |
# 快速开始(Quickstart)
我们将模型参数、配置文件、tokenizer等在huggingface和modelscope上进行了开源。
We have open-sourced the model parameters, configuration files, tokenizer, and more on Huggingface and Modelscope.
## 依赖安装(Requirements)
- Python 3.8及以上版本
- Pytorch 2.0及以上版本
- CUDA建议使用11.4以上版本。
Skywork-13B-Base模型,Skywork-13B-Chat模型和Skywork-13B-Math模型运行下面的脚本进行Python依赖安装。
- Python 3.8 and above
- Pytorch 2.0 and above
- CUDA 11.4 and above are recommended.
```shell
pip install -r requirements.txt
```
### Base 模型推理(Base Model Inference)
以Skywork-13B-Base-0.5T推理为例,如下:
```python
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from transformers.generation import GenerationConfig
>>> import torch
>>> tokenizer = AutoTokenizer.from_pretrained("Skywork-13B-Base-Intermediate/model_hubs/Skywork-13B-Base-0.5T/", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("Skywork-13B-Base-Intermediate/model_hubs/Skywork-13B-Base-0.5T/", device_map="auto", trust_remote_code=True).eval()
>>> inputs = tokenizer('陕西的省会是西安', return_tensors='pt').to(model.device)
>>> response = model.generate(inputs.input_ids, max_length=128)
>>> print(tokenizer.decode(response.cpu()[0], skip_special_tokens=True))
# 声明和协议(Declaration and License Agreement)
## 声明(Declaration)
我们在此声明,不要利用Skywork模型进行任何危害国家社会安全或违法的活动。另外,我们也要求使用者不要将 Skywork 模型用于未经适当安全审查和备案的互联网服务。我们希望所有的使用者都能遵守这个原则,确保科技的发展能在规范和合法的环境下进行。
我们已经尽我们所能,来确保模型训练过程中使用的数据的合规性。然而,尽管我们已经做出了巨大的努力,但由于模型和数据的复杂性,仍有可能存在一些无法预见的问题。因此,如果由于使用skywork开源模型而导致的任何问题,包括但不限于数据安全问题、公共舆论风险,或模型被误导、滥用、传播或不当利用所带来的任何风险和问题,我们将不承担任何责任。
We hereby declare that the Skywork model should not be used for any activities that pose a threat to national or societal security or engage in unlawful actions. Additionally, we request users not to deploy the Skywork model for internet services without appropriate security reviews and records. We hope that all users will adhere to this principle to ensure that technological advancements occur in a regulated and lawful environment.
We have done our utmost to ensure the compliance of the data used during the model's training process. However, despite our extensive efforts, due to the complexity of the model and data, there may still be unpredictable risks and issues. Therefore, if any problems arise as a result of using the Skywork open-source model, including but not limited to data security issues, public opinion risks, or any risks and problems arising from the model being misled, abused, disseminated, or improperly utilized, we will not assume any responsibility.
## 协议(License Agreement)
社区使用Skywork模型需要遵循[《Skywork 模型社区许可协议》](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf)。Skywork模型支持商业用途,如果您计划将Skywork模型或其衍生品用于商业目的,无需再次申请, 但请您仔细阅读[《Skywork 模型社区许可协议》](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf)并严格遵守相关条款。
The community usage of Skywork model requires [Skywork Community License](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20Community%20License.pdf). The Skywork model supports commercial use. If you plan to use the Skywork model or its derivatives for commercial purposes, you must abide by terms and conditions within [Skywork Community License](https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20Community%20License.pdf).
[《Skywork 模型社区许可协议》》]:https://github.com/SkyworkAI/Skywork/blob/main/Skywork%20模型社区许可协议.pdf
[[email protected]]: mailto:[email protected]
# 引用和联系我们(Contact Us and Citation)
如果您觉得我们的工作对您有帮助,欢迎引用我们的论文~
If you find our work helpful, please feel free to cite our paper~
```
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
```
@article{skyworkmath,
title={SkyMath: Technical Report},
author={Liu Yang, Haihua Yang, Wenjun Cheng, Lei Lin, Chenxia Li, Yifu Chen, Lunan Liu, Jianfei Pan, Tianwen Wei, Biye Li, Liang Zhao, Lijie Wang, Bo Zhu, Guoliang Li, Xuejie Wu, Xilin Luo, Rui Hu},
journal={arXiv preprint arXiv: 2310.16713},
url={https://arxiv.org/abs/2310.16713},
year={2023}
}
```
```
@article{Skywork_Multi-Modal_Group_Empirical_Study_Towards_2023,
author = {Skywork Multi-Modal Group},
month = sep,
title = {{Empirical Study Towards Building An Effective Multi-Modal Large Language Model}},
year = {2023}
}
```
|