--- language: - ko - en library_name: transformers license: cc-by-nc-sa-4.0 pipeline_tag: text-generation tags: - pytorch --- # Model Card for RedWhale-tv-10.8B-v1.0 ## Model Description **RedWhale**은 전처리한 한국어 Corpus, 특화된 한국어 Tokenizer, 효과적인 Model initialization, Continuous Multi-Stage Pretraining strategy 등을 갖추고 있습니다. 이러한 접근 방식은 높은 정확도와 이해도를 유지하면서 Computational costs를 줄여 제한된 리소스에서 Pretraining을 가능하게 해줍니다. **RedWhale** 사용을 원하시면 repo access 요청해주세요. ## About the Model - **Name:** TwinDoc/RedWhale-tv-10.8B-v1.0 - **Foundation Model:** upstage/SOLAR-10.7B-v1.0 - **Train Corpus:** [preprocessed AI-Hub datasets](https://huggingface.co/datasets/TwinDoc/agilesoda-corpus-AIHUB_splited_shffled) - **Developed by:** 애자일소다 (AGILESODA) - **Model type:** llama - **Language(s) (NLP):** 한국어, 영어 - **License:** cc-by-nc-sa-4.0 - **Paper:** [RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining ](https://arxiv.org/abs/2408.11294) ## Load the Model ``` from transformers import AutoTokenizer from transformers import AutoModelForCausalLM YOUR_HF_TOKEN_READ = "hf_..." model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-v1.0" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ) ``` ## Generate Text ``` text = "대한민국의 수도는" encodings = tokenizer(text, return_tensors='pt') terminators = [tokenizer.eos_token_id] + tokenizer("\n", add_special_tokens=False)["input_ids"] outputs = model.generate(**encodings, eos_token_id=terminators) generated_text = tokenizer.batch_decode(outputs)[0] # ' 대한민국의 수도는 서울이다.\n' ``` ## License The content of this project, created by AGILESODA, is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). ## Citation ``` @misc{vo2024redwhaleadaptedkoreanllm, title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining}, author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi}, year={2024}, eprint={2408.11294}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.11294}, } ``` **Built with:** AgileSoda TwinDoc Icon