metadata
language:
- ko
- en
library_name: transformers
license: cc-by-nc-sa-4.0
pipeline_tag: text-generation
tags:
- pytorch
Model Card for RedWhale-tv-10.8B-v1.0
Model Description
RedWhale์ ์ ์ฒ๋ฆฌํ ํ๊ตญ์ด Corpus, ํนํ๋ ํ๊ตญ์ด Tokenizer, ํจ๊ณผ์ ์ธ Model initialization, Continuous Multi-Stage Pretraining strategy ๋ฑ์ ๊ฐ์ถ๊ณ ์์ต๋๋ค. ์ด๋ฌํ ์ ๊ทผ ๋ฐฉ์์ ๋์ ์ ํ๋์ ์ดํด๋๋ฅผ ์ ์งํ๋ฉด์ Computational costs๋ฅผ ์ค์ฌ ์ ํ๋ ๋ฆฌ์์ค์์ Pretraining์ ๊ฐ๋ฅํ๊ฒ ํด์ค๋๋ค. RedWhale ์ฌ์ฉ์ ์ํ์๋ฉด repo access ์์ฒญํด์ฃผ์ธ์.
About the Model
- Name: TwinDoc/RedWhale-tv-10.8B-v1.0
- Foundation Model: upstage/SOLAR-10.7B-v1.0
- Train Corpus: preprocessed AI-Hub datasets
- Developed by: ์ ์์ผ์๋ค (AGILESODA)
- Model type: llama
- Language(s) (NLP): ํ๊ตญ์ด, ์์ด
- License: cc-by-nc-sa-4.0
- Paper: RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
Load the Model
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
YOUR_HF_TOKEN_READ = "hf_..."
model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
Generate Text
text = "๋ํ๋ฏผ๊ตญ์ ์๋๋"
encodings = tokenizer(text, return_tensors='pt')
terminators = [tokenizer.eos_token_id] + tokenizer("\n", add_special_tokens=False)["input_ids"]
outputs = model.generate(**encodings, eos_token_id=terminators)
generated_text = tokenizer.batch_decode(outputs)[0]
# '<s> ๋ํ๋ฏผ๊ตญ์ ์๋๋ ์์ธ์ด๋ค.\n'
License
The content of this project, created by AGILESODA, is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Citation
@misc{vo2024redwhaleadaptedkoreanllm,
title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining},
author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
year={2024},
eprint={2408.11294},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.11294},
}
Built with: