beomi's picture
Update README.md
804694b verified
|
raw
history blame
9.43 kB
---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: apache-2.0
---
<img src="https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/WuiaS45EAWDurGTOtjR_d.png" style="max-width:250px;margin:0 auto;" />
**Update Log**
- 2024.07.01: Released Solar-Ko-Recovery & Uploaded Benchmark scores
- 2024.05.16: Preview Released Solar-Ko-Recovery
# **Solar-Ko-Recovery-11B** 🌟❤️‍🩹
Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.
## Model Details
**Model Developers:** Junbum Lee (Beomi)
**Variations:** Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).
**Input:** The model accepts only text input.
**Output:** The model produces text output exclusively.
**Model Architecture:**
Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|---|---|---|---|---|---|---|
|Solar-Ko-Recovery|*A curated mix of Korean+English Corpora*|11B(10.99B)|4k|O|>100B*|5e<sup>-5</sup>|
> NOTE: 2-step training processed
>
> 1) Only Embedding layer and LM Head layer are trained
> 2) Full params trained
**Vocab Expansion**
Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.
| Model Name | Vocabulary Size | Description |
| --- | --- | --- |
| Original Solar | 32000 | Sentencepiece BPE |
| **solar-1-mini-tokenizer** | 64000 | Sentencepiece BPE. Added Ko/JP vocabs |
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**
- SOLAR-10.7B: 26 tokens
- Solar-Ko-Recovery: 7 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
| Solar-Ko-Recovery | `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` |
**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**
- SOLAR-10.7B: 22 tokens
- Solar-Ko-Recovery: 22 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
| Solar-Ko-Recovery | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
# LICENSE
Apache 2.0
# **Model Benchmark**
## LM Eval Harness - Korean
- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- 5-shot scores
| Tasks | Metric | Value | | Stderr |
|----------------------------------------------------------|-----------|--------:|---|--------:|
|haerae |acc_norm | 0.7874 |± | 0.0118 |
| - haerae_general_knowledge |acc | 0.5000 |± | 0.0378 |
| - haerae_history |acc | 0.8723 |± | 0.0244 |
| - haerae_loan_word |acc | 0.8402 |± | 0.0283 |
| - haerae_rare_word |acc | 0.8346 |± | 0.0185 |
| - haerae_standard_nomenclature |acc | 0.8301 |± | 0.0305 |
|kmmlu_direct |exact_match| 0.4205 |± | 0.0026 |
| - kmmlu_direct_accounting |exact_match| 0.3700 |± | 0.0485 |
| - kmmlu_direct_agricultural_sciences |exact_match| 0.3140 |± | 0.0147 |
| - kmmlu_direct_aviation_engineering_and_maintenance |exact_match| 0.3870 |± | 0.0154 |
| - kmmlu_direct_biology |exact_match| 0.3510 |± | 0.0151 |
| - kmmlu_direct_chemical_engineering |exact_match| 0.3910 |± | 0.0154 |
| - kmmlu_direct_chemistry |exact_match| 0.4000 |± | 0.0200 |
| - kmmlu_direct_civil_engineering |exact_match| 0.4010 |± | 0.0155 |
| - kmmlu_direct_computer_science |exact_match| 0.6520 |± | 0.0151 |
| - kmmlu_direct_construction |exact_match| 0.3080 |± | 0.0146 |
| - kmmlu_direct_criminal_law |exact_match| 0.3100 |± | 0.0328 |
| - kmmlu_direct_ecology |exact_match| 0.4660 |± | 0.0158 |
| - kmmlu_direct_economics |exact_match| 0.5385 |± | 0.0439 |
| - kmmlu_direct_education |exact_match| 0.6200 |± | 0.0488 |
| - kmmlu_direct_electrical_engineering |exact_match| 0.3000 |± | 0.0145 |
| - kmmlu_direct_electronics_engineering |exact_match| 0.4740 |± | 0.0158 |
| - kmmlu_direct_energy_management |exact_match| 0.3560 |± | 0.0151 |
| - kmmlu_direct_environmental_science |exact_match| 0.2980 |± | 0.0145 |
| - kmmlu_direct_fashion |exact_match| 0.4470 |± | 0.0157 |
| - kmmlu_direct_food_processing |exact_match| 0.3690 |± | 0.0153 |
| - kmmlu_direct_gas_technology_and_engineering |exact_match| 0.3000 |± | 0.0145 |
| - kmmlu_direct_geomatics |exact_match| 0.3820 |± | 0.0154 |
| - kmmlu_direct_health |exact_match| 0.5700 |± | 0.0498 |
| - kmmlu_direct_industrial_engineer |exact_match| 0.3830 |± | 0.0154 |
| - kmmlu_direct_information_technology |exact_match| 0.6090 |± | 0.0154 |
| - kmmlu_direct_interior_architecture_and_design |exact_match| 0.5440 |± | 0.0158 |
| - kmmlu_direct_korean_history |exact_match| 0.3800 |± | 0.0488 |
| - kmmlu_direct_law |exact_match| 0.4670 |± | 0.0158 |
| - kmmlu_direct_machine_design_and_manufacturing |exact_match| 0.3960 |± | 0.0155 |
| - kmmlu_direct_management |exact_match| 0.5030 |± | 0.0158 |
| - kmmlu_direct_maritime_engineering |exact_match| 0.4283 |± | 0.0202 |
| - kmmlu_direct_marketing |exact_match| 0.7460 |± | 0.0138 |
| - kmmlu_direct_materials_engineering |exact_match| 0.4020 |± | 0.0155 |
| - kmmlu_direct_math |exact_match| 0.2867 |± | 0.0262 |
| - kmmlu_direct_mechanical_engineering |exact_match| 0.3490 |± | 0.0151 |
| - kmmlu_direct_nondestructive_testing |exact_match| 0.3760 |± | 0.0153 |
| - kmmlu_direct_patent |exact_match| 0.3700 |± | 0.0485 |
| - kmmlu_direct_political_science_and_sociology |exact_match| 0.5300 |± | 0.0289 |
| - kmmlu_direct_psychology |exact_match| 0.4470 |± | 0.0157 |
| - kmmlu_direct_public_safety |exact_match| 0.3520 |± | 0.0151 |
| - kmmlu_direct_railway_and_automotive_engineering |exact_match| 0.3220 |± | 0.0148 |
| - kmmlu_direct_real_estate |exact_match| 0.4350 |± | 0.0351 |
| - kmmlu_direct_refrigerating_machinery |exact_match| 0.3240 |± | 0.0148 |
| - kmmlu_direct_social_welfare |exact_match| 0.4970 |± | 0.0158 |
| - kmmlu_direct_taxation |exact_match| 0.3800 |± | 0.0344 |
| - kmmlu_direct_telecommunications_and_wireless_technology|exact_match| 0.5480 |± | 0.0157 |
|kobest_boolq |acc | 0.9202 |± | 0.0072 |
| |f1 | 0.9202 |± |N/A |
|kobest_copa |acc | 0.8680 |± | 0.0107 |
| |f1 | 0.8678 |± |N/A |
|kobest_hellaswag |acc | 0.5560 |± | 0.0222 |
| |f1 | 0.5520 |± |N/A |
| |acc_norm | 0.6540 |± | 0.0213 |
|kobest_sentineg |acc | 0.9824 |± | 0.0066 |
| |f1 | 0.9824 |± |N/A |
## Citation
TBD
## Acknowledgements
- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.