|
--- |
|
language: |
|
- ko |
|
- en |
|
pipeline_tag: text-generation |
|
inference: false |
|
tags: |
|
- solar |
|
- mistral |
|
- pytorch |
|
- solar-ko |
|
library_name: transformers |
|
license: apache-2.0 |
|
--- |
|
|
|
**Update Log** |
|
|
|
- 2024.05.16: Released Solar-Ko-Recovery |
|
|
|
# **Solar-Ko-Recovery-11B** 🌟❤️🩹 |
|
|
|
Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation. |
|
|
|
## Model Details |
|
|
|
**Model Developers:** Junbum Lee (Beomi) |
|
|
|
**Variations:** Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣). |
|
|
|
**Input:** The model accepts only text input. |
|
|
|
**Output:** The model produces text output exclusively. |
|
|
|
**Model Architecture:** |
|
|
|
Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2. |
|
|
|
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate| |
|
|---|---|---|---|---|---|---| |
|
|Solar-Ko-Recovery|*A curated mix of Korean+English Corpora*|10.8B|4k|O|>30B*|5e<sup>-5</sup>| |
|
|
|
> NOTE: Only Embedding layer and LM Head layer are trained. |
|
|
|
**Vocab Expansion** |
|
|
|
Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer. |
|
|
|
| Model Name | Vocabulary Size | Description | |
|
| --- | --- | --- | |
|
| Original Solar | 32000 | Sentencepiece BPE | |
|
| **solar-1-mini-tokenizer** | 64000 | Sentencepiece BPE. Added Ko/JP vocabs | |
|
|
|
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."** |
|
|
|
- SOLAR-10.7B: 26 tokens |
|
- Solar-Ko-Recovery: 7 tokens |
|
|
|
| Model | Tokens | |
|
| --- | --- | |
|
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` | |
|
| Solar-Ko-Recovery | `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` | |
|
|
|
**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"** |
|
|
|
- SOLAR-10.7B: 22 tokens |
|
- Solar-Ko-Recovery: 22 tokens |
|
|
|
| Model | Tokens | |
|
| --- | --- | |
|
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` | |
|
| Solar-Ko-Recovery | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` | |
|
|
|
# LICENSE |
|
|
|
Apache 2.0 |
|
|
|
# **Model Benchmark** |
|
|
|
## LM Eval Harness - Korean |
|
|
|
- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
|
- 5-shot scores |
|
|
|
TBD |
|
|
|
## Citation |
|
|
|
TBD |
|
|
|
## Acknowledgements |
|
|
|
- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program. |