beomi
/

Yi-Ko-34B

+---
+license: other
+license_name: yi-license
+license_link: LICENSE
+extra_gated_heading: Access beomi/Yi-Ko-34B on Hugging Face
+extra_gated_button_content: Submit
+extra_gated_fields:
+  I agree to share my name, email address and username: checkbox
+  I confirm that I understand this project is for research purposes only, and confirm that I agree to follow the LICENSE of this model: checkbox
+language:
+- en
+- ko
+pipeline_tag: text-generation
+inference: false
+tags:
+- pytorch
+- Yi-Ko
+- 01-ai
+- Yi
+library_name: transformers
+---
+# **beomi/Yi-Ko-34B**
+Yi-Ko series models serve as advanced iterations of 01-ai/Yi models,
+benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining.
+Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters.
+This repository focuses on the **34B** pretrained version,
+which is tailored to fit the Hugging Face Transformers format.
+For access to the other models, feel free to consult the index provided below.
+## Model Details
+**Model Developers** Junbum Lee (Beomi)
+**Variations** Yi-Ko-34B will come in a range of parameter sizes — 6B and 34B — with Ko(Korean+English).
+**Input** Models input text only.
+**Output** Models generate text only.
+**Model Architecture**
+Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.
+<small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small>
+|Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Train tokens (per batch)|
+|---|---|---|---|---|---|---|---|
+|Yi-Ko-34B|*A mix of Korean + English online data*|34B|4k|O|40B+|5e<sup>-5</sup>|4M|
+**Vocab Expansion**
+| Model Name | Vocabulary Size | Description |
+| --- | --- | --- |
+| Original Yi-Series | 64000 | Sentencepiece BPE |
+| **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges |
+**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"**
+| Model | # of tokens | Tokens |
+| --- | --- | --- |
+| Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
+| **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` |
+|<small>*Equal Korean vocab with Llama-2-Ko Series</small>||
+**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
+| Model | # of tokens | Tokens |
+| --- | --- | --- |
+| Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
+| **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
+|<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>|
+# **Model Benchmark**
+## LM Eval Harness - Korean (polyglot branch)
+|     Tasks      |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|----------------|------:|------|-----:|--------|-----:|---|------|
+|**kmmlu_direct**|N/A    |none  |     5|exact_match|**0.5027**|±  |0.1019|
+|kobest_boolq    |      1|none  |     5|acc     |0.9202|±  |0.0072|
+|                |       |none  |     5|f1      |0.9202|±  |N/A   |
+|kobest_copa     |      1|none  |     5|acc     |0.8480|±  |0.0114|
+|                |       |none  |     5|f1      |0.8479|±  |N/A   |
+|kobest_hellaswag|      1|none  |     5|acc     |0.5320|±  |0.0223|
+|                |       |none  |     5|f1      |0.5281|±  |N/A   |
+|                |       |none  |     5|acc_norm|0.6340|±  |0.0216|
+|kobest_sentineg |      1|none  |     5|acc     |0.9874|±  |0.0056|
+|                |       |none  |     5|f1      |0.9874|±  |N/A   |
+|haerae                         |N/A    |none  |     5|acc     |0.7965|±  |0.0116|
+|                               |       |none  |     5|acc_norm|0.7965|±  |0.0116|
+| - haerae_general_knowledge    |      1|none  |     5|acc     |0.5114|±  |0.0378|
+|                               |       |none  |     5|acc_norm|0.5114|±  |0.0378|
+| - haerae_history              |      1|none  |     5|acc     |0.8511|±  |0.0260|
+|                               |       |none  |     5|acc_norm|0.8511|±  |0.0260|
+| - haerae_loan_word            |      1|none  |     5|acc     |0.8402|±  |0.0283|
+|                               |       |none  |     5|acc_norm|0.8402|±  |0.0283|
+| - haerae_rare_word            |      1|none  |     5|acc     |0.8642|±  |0.0170|
+|                               |       |none  |     5|acc_norm|0.8642|±  |0.0170|
+| - haerae_standard_nomenclature|      1|none  |     5|acc     |0.8301|±  |0.0305|
+|                               |       |none  |     5|acc_norm|0.8301|±  |0.0305|
+## LICENSE
+Apache 2.0
+## Citation
+## Acknowledgement
+The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.