language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: apache-2.0
Update Log
- 2024.05.16: Released Solar-Ko-Recovery
Solar-Ko-Recovery โญ๐ฐ๐ท๐บ๐ธ
Solar-Ko-Recovery aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.
Model Details
Model Developers: Junbum Lee (Beomi)
Variations: Solar-Ko-Recovery is available with one parameter sizes โ 10.8B.
Input: The model accepts only text input.
Output: The model produces text output exclusively.
Model Architecture:
Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
Training Data | Parameters | Content Length | GQA | Tokens | Learning Rate | |
---|---|---|---|---|---|---|
Solar-Ko-Recovery | A curated mix of Korean+English Corpora | 10.8B | 4k | O | >30B* | 5e-5 |
NOTE: Only Embedding layer and LM Head layer are trained.
Vocab Expansion
Vocab expansion is conducted on edited upstage/solar-1-mini-tokenizer, which is superset of Solar tokenizer.
Model Name | Vocabulary Size | Description |
---|---|---|
Original Solar | 32000 | Sentencepiece BPE |
solar-1-mini-tokenizer | 64000 | Sentencepiece BPE. Added Ko/JP vocabs |
Tokenizing "์๋ ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."
- SOLAR-10.7B: 26 tokens
- Solar-Ko-Recovery: 7 tokens
Model | Tokens |
---|---|
SOLAR-10.7B | ['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '๋ ', '<0xEC>', '<0x94>', '<0xA8>', '๊ฐ', 'โ', '์ข', '๋ค', '์', '.'] |
Solar-Ko-Recovery | ['โ์๋
ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ์จ๊ฐ', 'โ์ข', '๋ค์', '.'] |
Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"
- SOLAR-10.7B: 22 tokens
- Solar-Ko-Recovery: 22 tokens
Model | Tokens |
---|---|
SOLAR-10.7B | ['โMeet', 'โ', '1', '0', '.', '7', 'B', 'โSolar', ':', 'โE', 'lev', 'ating', 'โPerformance', 'โwith', 'โUp', 'stage', 'โDep', 'th', 'โUP', 'โScal', 'ing', '!'] |
Solar-Ko-Recovery | ['โMeet', 'โ', '1', '0', '.', '7', 'B', 'โSolar', ':', 'โE', 'lev', 'ating', 'โPerformance', 'โwith', 'โUp', 'stage', 'โDep', 'th', 'โUP', 'โScal', 'ing', '!'] |
LICENSE
Apache 2.0
Model Benchmark
LM Eval Harness - Korean
- Used EleutherAI's lm-evaluation-harness
- 5-shot scores
TBD
Citation
TBD
Acknowledgements
- Training support was provided by the TPU Research Cloud program.