|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- nlp |
|
- llm |
|
--- |
|
# K2: a Fully Transparent OSS Language at Llama 2 Performance Using 35% Less Compute |
|
|
|
LLM360 demystifies the training recipe used for Llama 2 - 70B with K2. Reaching a comparable performance with Llama 2, K2 has 65B parameters |
|
and is trained on around 1.4T tokens, resulting a receipe of approximately 35% less compute. |
|
|
|
## Evaluations |
|
<center><img src="eval_table_temp.png" alt="eval table"/></center> |
|
|
|
## Datasets and Mix |
|
|
|
The following data mix was used to train K2 and achieve results in line with Llama 2 70B. The full data sequence will be available soon. |
|
|
|
| Dataset | Starting Tokens | Multiplier | Total Tokens |% of Total | |
|
| ----------- | ----------- | ----------- | ----------- | ----------- | |
|
| dm-math | 4.33B | 3x | 13B | 1% | |
|
| pubmed-abstracts | 4.77B | 3x | 14.3B | 1.1% | |
|
| uspto | 4.77B | 3x | 14.3B | 1.1% | |
|
| pubmed-central | 26B | 1x | 26B | 2% | |
|
| [redpajama.arxiv](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 27.3B | 1x | 27.3B | 2.1% | |
|
| [starcoder.spm](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | |
|
| [starcoder.fim](https://huggingface.co/datasets/bigcode/starcoderdata) | 67.6B | 0.5x | 33.8B | 2.6% | |
|
| [redpajama.stackexchange](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 61.1B | 1x | 61.1B | 4.7% | |
|
| [starcoder](https://huggingface.co/datasets/bigcode/starcoderdata) | 132.6B | 0.5x | 66.3B | 5.1% | |
|
| [pile-of-law](https://huggingface.co/datasets/pile-of-law/pile-of-law) | 76.7B | 1x | 76.7B | 5.9% | |
|
| [redpajama.book](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 80.6B | 1x | 80.6B | 6.2% | |
|
| s2orc | 107.9B | 1x | 107.9B | 8.3% | |
|
| [redpajama.wikipedia](https://huggingface.co/datasets/cerebras/SlimPajama-627B) | 22.1B | 6x | 132.6B | 10.2% | |
|
| [refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 612.3B | 1x | 612.3B | 47.1% | |
|
| Totals | - | - | 1.3T | 100% | |
|
|
|
## First 10 Checkpoints |
|
| Checkpoints | | |
|
| ----------- | ----------- | |
|
| [Checkpoint 360](https://huggingface.co/LLM360/K2/tree/ckpt_360) | [Checkpoint 355](https://huggingface.co/LLM360/K2/tree/ckpt_355) | |
|
| [Checkpoint 359](https://huggingface.co/LLM360/K2/tree/ckpt_359) | [Checkpoint 354](https://huggingface.co/LLM360/K2/tree/ckpt_354) | |
|
| [Checkpoint 358](https://huggingface.co/LLM360/K2/tree/ckpt_358) | [Checkpoint 353](https://huggingface.co/LLM360/K2/tree/ckpt_353) | |
|
| [Checkpoint 357](https://huggingface.co/LLM360/K2/tree/ckpt_357) | [Checkpoint 352](https://huggingface.co/LLM360/K2/tree/ckpt_352) | |
|
| [Checkpoint 356](https://huggingface.co/LLM360/K2/tree/ckpt_356) | [Checkpoint 351](https://huggingface.co/LLM360/K2/tree/ckpt_351) | |
|
|
|
[to find all branches: git branch -a] |
|
|
|
## Additional Artifacts |
|
We are working on release caliber artifacts for the dataset, code, and analysis which will be released over the next few weeks. |
|
|
|
|
|
## Model Description |
|
|
|
- **Model type:** 65 billion parameter language model with the same architecture as LLaMA. |
|
- **Language(s) (NLP):** English |
|
- **License:** Apache 2.0 |
|
- **Resources for more information:** |
|
- Training Code: TBD |
|
- Data Preparation: TBD |
|
- Metrics: TBD |
|
- Fully processed K2 pretraining dataset: TBD |
|
|
|
|
|
## About LLM360 |
|
LLM360 is an initiative for comprehensive and fully open-sourced LLMs, |
|
where all training details, model checkpoints, intermediate results, and |
|
additional analyses are made available to the community. Our goal is to advance |
|
the field by inviting the community to deepen the understanding of LLMs |
|
together. As the first step of the project LLM360, we release all intermediate |
|
model checkpoints, our fully-prepared pre-training dataset, all source code and |
|
configurations, and training details. We are |
|
committed to continually pushing the boundaries of LLMs through this open-source |
|
effort. |
|
|
|
[Visit us](https://www.llm360.ai/) |
|
|