File size: 2,963 Bytes
139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 4b1aa4a 139bc19 f015740 139bc19 f015740 139bc19 0305294 139bc19 0305294 139bc19 f015740 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 c4be33d 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 8b38e9d 139bc19 0305294 139bc19 0305294 139bc19 0305294 8b38e9d 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: apache-2.0
---
**Update Log**
- 2024.05.16: Released Solar-Ko-Recovery
# **Solar-Ko-Recovery-11B** 🌟❤️🩹
Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.
## Model Details
**Model Developers:** Junbum Lee (Beomi)
**Variations:** Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).
**Input:** The model accepts only text input.
**Output:** The model produces text output exclusively.
**Model Architecture:**
Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|---|---|---|---|---|---|---|
|Solar-Ko-Recovery|*A curated mix of Korean+English Corpora*|10.8B|4k|O|>30B*|5e<sup>-5</sup>|
> NOTE: Only Embedding layer and LM Head layer are trained.
**Vocab Expansion**
Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.
| Model Name | Vocabulary Size | Description |
| --- | --- | --- |
| Original Solar | 32000 | Sentencepiece BPE |
| **solar-1-mini-tokenizer** | 64000 | Sentencepiece BPE. Added Ko/JP vocabs |
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**
- SOLAR-10.7B: 26 tokens
- Solar-Ko-Recovery: 7 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
| Solar-Ko-Recovery | `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` |
**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**
- SOLAR-10.7B: 22 tokens
- Solar-Ko-Recovery: 22 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
| Solar-Ko-Recovery | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
# LICENSE
Apache 2.0
# **Model Benchmark**
## LM Eval Harness - Korean
- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- 5-shot scores
TBD
## Citation
TBD
## Acknowledgements
- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program. |