TwinDoc
/

RedWhale-tv-10.8B-v1.0

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

RedWhale-tv-10.8B-v1.0 / README.md

TwinDoc's picture

Update README.md

5d82a50 verified 5 months ago

|

history blame contribute delete

3.44 kB

	---
	language:
	- ko
	- en
	library_name: transformers
	license: cc-by-nc-sa-4.0
	pipeline_tag: text-generation
	tags:
	- pytorch
	---

	# Model Card for RedWhale-tv-10.8B-v1.0

	<!-- Provide a quick summary of what the model is/does. -->

	<!--
	<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon.png" width="648">
	-->
	<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon_RWTV.png" width="648">


	## Model Description

	RedWhale은 전처리한 한국어 Corpus, 특화된 한국어 Tokenizer, 효과적인 Model initialization, Continuous Multi-Stage Pretraining strategy 등을 갖추고 있습니다. 이러한 접근 방식은 높은 정확도와 이해도를 유지하면서 Computational costs를 줄여 제한된 리소스에서 Pretraining을 가능하게 해줍니다. RedWhale 사용을 원하시면 repo access 요청해주세요.

	<!-- Provide a longer summary of what this model is. -->

	## About the Model

	- Name: TwinDoc/RedWhale-tv-10.8B-v1.0
	- Foundation Model: upstage/SOLAR-10.7B-v1.0
	- Train Corpus: [preprocessed AI-Hub datasets](https://huggingface.co/datasets/TwinDoc/agilesoda-corpus-AIHUB_splited_shffled)
	- Developed by: 애자일소다 (AGILESODA)
	- Model type: llama
	- Language(s) (NLP): 한국어, 영어
	- License: cc-by-nc-sa-4.0
	- Paper: [RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
	](https://arxiv.org/abs/2408.11294)


	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	## Load the Model

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	```
	from transformers import AutoTokenizer
	from transformers import AutoModelForCausalLM

	YOUR_HF_TOKEN_READ = "hf_..."
	model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-v1.0"
	tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
	model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
	```

	## Generate Text

	```
	text = "대한민국의 수도는"
	encodings = tokenizer(text, return_tensors='pt')
	terminators = [tokenizer.eos_token_id] + tokenizer("\n", add_special_tokens=False)["input_ids"]

	outputs = model.generate(**encodings, eos_token_id=terminators)
	generated_text = tokenizer.batch_decode(outputs)[0]
	# '<s> 대한민국의 수도는 서울이다.\n'
	```

	## License

	<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/license__icon.png" width="324">

	The content of this project, created by AGILESODA, is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/).

	## Citation

	```
	@misc{vo2024redwhaleadaptedkoreanllm,
	title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining},
	author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
	year={2024},
	eprint={2408.11294},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2408.11294},
	}
	```


	Built with:

	<a href="http://www.agilesoda.com/sub/twin_doc.php">
	<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda_twindoc__icon.png" alt="AgileSoda TwinDoc Icon">
	</a>