TwinDoc's picture
Update README.md
5d82a50 verified
metadata
language:
  - ko
  - en
library_name: transformers
license: cc-by-nc-sa-4.0
pipeline_tag: text-generation
tags:
  - pytorch

Model Card for RedWhale-tv-10.8B-v1.0

Model Description

RedWhale์€ ์ „์ฒ˜๋ฆฌํ•œ ํ•œ๊ตญ์–ด Corpus, ํŠนํ™”๋œ ํ•œ๊ตญ์–ด Tokenizer, ํšจ๊ณผ์ ์ธ Model initialization, Continuous Multi-Stage Pretraining strategy ๋“ฑ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋†’์€ ์ •ํ™•๋„์™€ ์ดํ•ด๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ Computational costs๋ฅผ ์ค„์—ฌ ์ œํ•œ๋œ ๋ฆฌ์†Œ์Šค์—์„œ Pretraining์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. RedWhale ์‚ฌ์šฉ์„ ์›ํ•˜์‹œ๋ฉด repo access ์š”์ฒญํ•ด์ฃผ์„ธ์š”.

About the Model

Load the Model

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

YOUR_HF_TOKEN_READ = "hf_..."
model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)

Generate Text

text = "๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ์ˆ˜๋„๋Š”"
encodings = tokenizer(text, return_tensors='pt')
terminators = [tokenizer.eos_token_id] + tokenizer("\n", add_special_tokens=False)["input_ids"]

outputs = model.generate(**encodings, eos_token_id=terminators)
generated_text = tokenizer.batch_decode(outputs)[0]
# '<s> ๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ์ˆ˜๋„๋Š” ์„œ์šธ์ด๋‹ค.\n'

License

The content of this project, created by AGILESODA, is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Citation

@misc{vo2024redwhaleadaptedkoreanllm,
      title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining}, 
      author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
      year={2024},
      eprint={2408.11294},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.11294}, 
}

Built with:

AgileSoda TwinDoc Icon