metadata

language:
  - ko
  - en
pipeline_tag: text-generation
inference: false
tags:
  - solar
  - mistral
  - pytorch
  - solar-ko
library_name: transformers
license: apache-2.0

Update Log

2024.05.16: Released Solar-Ko-Recovery

Solar-Ko-Recovery 🌟❤️‍🩹

Solar-Ko-Recovery aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

Model Details

Model Developers: Junbum Lee (Beomi)

Variations: Solar-Ko-Recovery is available with one parameter sizes — 10.8B.

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

	Training Data	Parameters	Content Length	GQA	Tokens	Learning Rate
Solar-Ko-Recovery	A curated mix of Korean+English Corpora	10.8B	4k	O	>30B*	5e^-5

NOTE: Only Embedding layer and LM Head layer are trained.

Vocab Expansion

Vocab expansion is conducted on edited upstage/solar-1-mini-tokenizer, which is superset of Solar tokenizer.

Model Name	Vocabulary Size	Description
Original Solar	32000	Sentencepiece BPE
solar-1-mini-tokenizer	64000	Sentencepiece BPE. Added Ko/JP vocabs

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

SOLAR-10.7B: 26 tokens
Solar-Ko-Recovery: 7 tokens

Model	Tokens
SOLAR-10.7B	`['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']`
Solar-Ko-Recovery	`['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']`

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

SOLAR-10.7B: 22 tokens
Solar-Ko-Recovery: 22 tokens

Model	Tokens
SOLAR-10.7B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`
Solar-Ko-Recovery	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`

LICENSE

Apache 2.0

Model Benchmark

LM Eval Harness - Korean

Used EleutherAI's lm-evaluation-harness
5-shot scores

TBD

Citation

TBD

Acknowledgements

Training support was provided by the TPU Research Cloud program.