TwinDoc's picture
Update README.md
5d82a50 verified
---
language:
- ko
- en
library_name: transformers
license: cc-by-nc-sa-4.0
pipeline_tag: text-generation
tags:
- pytorch
---
# Model Card for RedWhale-tv-10.8B-v1.0
<!-- Provide a quick summary of what the model is/does. -->
<!--
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon.png" width="648">
-->
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon_RWTV.png" width="648">
## Model Description
**RedWhale**은 μ „μ²˜λ¦¬ν•œ ν•œκ΅­μ–΄ Corpus, νŠΉν™”λœ ν•œκ΅­μ–΄ Tokenizer, 효과적인 Model initialization, Continuous Multi-Stage Pretraining strategy 등을 κ°–μΆ”κ³  μžˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ μ ‘κ·Ό 방식은 높은 정확도와 이해도λ₯Ό μœ μ§€ν•˜λ©΄μ„œ Computational costsλ₯Ό 쀄여 μ œν•œλœ λ¦¬μ†ŒμŠ€μ—μ„œ Pretraining을 κ°€λŠ₯ν•˜κ²Œ ν•΄μ€λ‹ˆλ‹€. **RedWhale** μ‚¬μš©μ„ μ›ν•˜μ‹œλ©΄ repo access μš”μ²­ν•΄μ£Όμ„Έμš”.
<!-- Provide a longer summary of what this model is. -->
## About the Model
- **Name:** TwinDoc/RedWhale-tv-10.8B-v1.0
- **Foundation Model:** upstage/SOLAR-10.7B-v1.0
- **Train Corpus:** [preprocessed AI-Hub datasets](https://huggingface.co/datasets/TwinDoc/agilesoda-corpus-AIHUB_splited_shffled)
- **Developed by:** μ• μžμΌμ†Œλ‹€ (AGILESODA)
- **Model type:** llama
- **Language(s) (NLP):** ν•œκ΅­μ–΄, μ˜μ–΄
- **License:** cc-by-nc-sa-4.0
- **Paper:** [RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
](https://arxiv.org/abs/2408.11294)
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
## Load the Model
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
```
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
YOUR_HF_TOKEN_READ = "hf_..."
model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
```
## Generate Text
```
text = "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ”"
encodings = tokenizer(text, return_tensors='pt')
terminators = [tokenizer.eos_token_id] + tokenizer("\n", add_special_tokens=False)["input_ids"]
outputs = model.generate(**encodings, eos_token_id=terminators)
generated_text = tokenizer.batch_decode(outputs)[0]
# '<s> λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμ΄λ‹€.\n'
```
## License
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/license__icon.png" width="324">
The content of this project, created by AGILESODA, is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/).
## Citation
```
@misc{vo2024redwhaleadaptedkoreanllm,
title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining},
author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
year={2024},
eprint={2408.11294},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.11294},
}
```
**Built with:**
<a href="http://www.agilesoda.com/sub/twin_doc.php">
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda_twindoc__icon.png" alt="AgileSoda TwinDoc Icon">
</a>