|
--- |
|
base_model: |
|
- openchat/openchat_3.5 |
|
language: |
|
- ko |
|
- en |
|
library_name: adapter-transformers |
|
license: mit |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-generation |
|
tags: |
|
- finance |
|
- biology |
|
- legal |
|
- art |
|
- text-generation-inference |
|
--- |
|
|
|
### โฑ ktdsbaseLM v0.11์ openchat3.5๋ฅผ Foundation ๋ชจ๋ธ๋ก ํ๋ ํ๊ตญ์ด ๋ฐ ํ๊ตญ์ ๋ค์ํ |
|
### ๋ฌธํ์ ์ ์ฉํ ์ ์๋๋ก ํ๊ธฐ ์ํด |
|
### ๊ฐ๋ฐ ๋์์ผ๋ฉฐ ์์ฒด ์ ์ํ 53์์ญ์ ํ๊ตญ์ด ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ์ฌ ํ๊ตญ ์ฌํ ๊ฐ์น์ |
|
### ๋ฌธํ๋ฅผ ์ดํดํ๋ ๋ชจ๋ธ ์
๋๋ค. โ |
|
|
|
|
|
|
|
# โถ ๋ชจ๋ธ ์ค๋ช
|
|
- ๋ชจ๋ธ๋ช
๋ฐ ์ฃผ์๊ธฐ๋ฅ: |
|
KTDSbaseLM v0.11์ OpenChat 3.5 ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก SFT ๋ฐฉ์์ผ๋ก ํ์ธํ๋๋ Mistral 7B / openchat3.5 ๊ธฐ๋ฐ ๋ชจ๋ธ์
๋๋ค. |
|
ํ๊ตญ์ด์ ํ๊ตญ์ ๋ค์ํ ๋ฌธํ์ ๋งฅ๋ฝ์ ์ดํดํ๋๋ก ์ค๊ณ๋์์ผ๋ฉฐ โจโจ, ์์ฒด ์ ์ํ 135๊ฐ ์์ญ์ ํ๊ตญ์ด |
|
๋ฐ์ดํฐ๋ฅผ ํ์ฉํด ํ๊ตญ ์ฌํ์ ๊ฐ์น์ ๋ฌธํ๋ฅผ ๋ฐ์ํฉ๋๋ค. |
|
์ฃผ์ ๊ธฐ๋ฅ์ผ๋ก๋ ํ
์คํธ ์์ฑ, ๋ํ ์ถ๋ก , ๋ฌธ์ ์์ฝ, ์ง์์๋ต, ๊ฐ์ ๋ถ์ ๋ฐ ์์ฐ์ด ์ฒ๋ฆฌ ๊ด๋ จ ๋ค์ํ ์์
์ ์ง์ํ๋ฉฐ, |
|
ํ์ฉ ๋ถ์ผ๋ ๋ฒ๋ฅ , ์ฌ๋ฌด, ๊ณผํ, ๊ต์ก, ๋น์ฆ๋์ค, ๋ฌธํ ์ฐ๊ตฌ ๋ฑ ๋ค์ํ ๋ถ์ผ์์ ์์ฉ๋ ์ ์์ต๋๋ค. |
|
- ๋ชจ๋ธ ์ํคํ
์ฒ: KTDSBaseLM v0.11์ Mistral 7B ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก, ํ๋ผ๋ฏธํฐ ์๋ 70์ต ๊ฐ(7B)๋ก ๊ตฌ์ฑ๋ ๊ณ ์ฑ๋ฅ ์ธ์ด ๋ชจ๋ธ์
๋๋ค. |
|
์ด ๋ชจ๋ธ์ OpenChat 3.5๋ฅผ ํ์ด๋ฐ์ด์
๋ชจ๋ธ๋ก ์ผ์, SFT(์ง๋ ๋ฏธ์ธ ์กฐ์ ) ๋ฐฉ์์ ํตํด ํ๊ตญ์ด์ ํ๊ตญ ๋ฌธํ์ ํนํ๋ ์ฑ๋ฅ์ ๋ฐํํ๋๋ก ํ๋ จ๋์์ต๋๋ค. |
|
Mistral 7B์ ๊ฒฝ๋ํ๋ ๊ตฌ์กฐ๋ ๋น ๋ฅธ ์ถ๋ก ์๋์ ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ฑ์ ๋ณด์ฅํ๋ฉฐ, ๋ค์ํ ์์ฐ์ด ์ฒ๋ฆฌ ์์
์ ์ ํฉํ๊ฒ ์ต์ ํ๋์ด ์์ต๋๋ค. |
|
์ด ์ํคํ
์ฒ๋ ํ
์คํธ ์์ฑ, ์ง์์๋ต, ๋ฌธ์ ์์ฝ, ๊ฐ์ ๋ถ์๊ณผ ๊ฐ์ ๋ค์ํ ์์
์์ ํ์ํ ์ฑ๋ฅ์ ๋ณด์ฌ์ค๋๋ค. |
|
|
|
# โท ํ์ต ๋ฐ์ดํฐ |
|
- ktdsbaseLM v0.11์ ์์ฒด ๊ฐ๋ฐํ ์ด 3.6GB ํฌ๊ธฐ์ ๋ฐ์ดํฐ๋ฅผ ๋ฐํ์ผ๋ก ํ์ต๋์์ต๋๋ค. ๋ชจ๋ 233๋ง ๊ฑด์ QnA, ์์ฝ, ๋ถ๋ฅ ๋ฑ ๋ฐ์ดํฐ๋ฅผ ํฌํจํ๋ฉฐ, |
|
๊ทธ ์ค 133๋ง ๊ฑด์ 53๊ฐ ์์ญ์ ๊ฐ๊ด์ ๋ฌธ์ ๋ก ๊ตฌ์ฑ๋์์ต๋๋ค. ์ด ์์ญ์๋ ํ๊ตญ์ฌ, ์ฌํ, ์ฌ๋ฌด, ๋ฒ๋ฅ , ์ธ๋ฌด, ์ํ, ์๋ฌผ, ๋ฌผ๋ฆฌ, ํํ ๋ฑ์ด ํฌํจ๋๋ฉฐ, |
|
Chain of Thought ๋ฐฉ์์ผ๋ก ํ์ต๋์์ต๋๋ค. ๋ํ 130๋ง ๊ฑด์ ์ฃผ๊ด์ ๋ฌธ์ ๋ ํ๊ตญ์ฌ, ์ฌ๋ฌด, ๋ฒ๋ฅ , ์ธ๋ฌด, ์ํ ๋ฑ 38๊ฐ ์์ญ์ ๊ฑธ์ณ ํ์ต๋์์ต๋๋ค. |
|
ํ์ต ๋ฐ์ดํฐ ์ค ํ๊ตญ์ ์ฌํ ๊ฐ์น์ ์ธ๊ฐ์ ๊ฐ์ ์ ์ดํดํ๊ณ ์ง์ํ ์ฌํญ์ ๋ฐ๋ผ ์ถ๋ ฅํ ์ ์๋ ๋ฐ์ดํฐ๋ฅผ ํ์ตํ์์ต๋๋ค. |
|
- ํ์ต Instruction Datasets Format: |
|
<pre><code>{"prompt": "prompt text", "completion": "ideal generated text"}</code></pre> |
|
|
|
# โธ ์ฌ์ฉ ์ฌ๋ก |
|
ktdsbaseLM v0.11์ ๋ค์ํ ์์ฉ ๋ถ์ผ์์ ์ฌ์ฉ๋ ์ ์์ต๋๋ค. ์๋ฅผ ๋ค์ด: |
|
- ๊ต์ก ๋ถ์ผ: ์ญ์ฌ, ์ํ, ๊ณผํ ๋ฑ ๋ค์ํ ํ์ต ์๋ฃ์ ๋ํ ์ง์์๋ต ๋ฐ ์ค๋ช
์์ฑ. |
|
- ๋น์ฆ๋์ค: ๋ฒ๋ฅ , ์ฌ๋ฌด, ์ธ๋ฌด ๊ด๋ จ ์ง์์ ๋ํ ๋ต๋ณ ์ ๊ณต ๋ฐ ๋ฌธ์ ์์ฝ. |
|
- ์ฐ๊ตฌ ๋ฐ ๋ฌธํ: ํ๊ตญ ์ฌํ์ ๋ฌธํ์ ๋ง์ถ ์์ฐ์ด ์ฒ๋ฆฌ ์์
, ๊ฐ์ ๋ถ์, ๋ฌธ์ ์์ฑ ๋ฐ ๋ฒ์ญ. |
|
- ๊ณ ๊ฐ ์๋น์ค: ์ฌ์ฉ์์์ ๋ํ ์์ฑ ๋ฐ ๋ง์ถคํ ์๋ต ์ ๊ณต. |
|
- ์ด ๋ชจ๋ธ์ ๋ค์ํ ์์ฐ์ด ์ฒ๋ฆฌ ์์
์์ ๋์ ํ์ฉ๋๋ฅผ ๊ฐ์ง๋๋ค. |
|
|
|
# โน ํ๊ณ โโ |
|
- ktdsBaseLM v0.11์ ํ๊ตญ์ด์ ํ๊ตญ ๋ฌธํ์ ํนํ๋์ด ์์ผ๋, |
|
ํน์ ์์ญ(์: ์ต์ ๊ตญ์ ์๋ฃ, ์ ๋ฌธ ๋ถ์ผ)์ ๋ฐ์ดํฐ ๋ถ์กฑ์ผ๋ก ์ธํด ๋ค๋ฅธ ์ธ์ด ๋๋ |
|
๋ฌธํ์ ๋ํ ์๋ต์ ์ ํ์ฑ์ด ๋จ์ด์ง ์ ์์ต๋๋ค. |
|
๋ํ, ๋ณต์กํ ๋
ผ๋ฆฌ์ ์ฌ๊ณ ๋ฅผ ์๊ตฌํ๋ ๋ฌธ์ ์ ๋ํด ์ ํ๋ ์ถ๋ก ๋ฅ๋ ฅ์ ๋ณด์ผ ์ ์์ผ๋ฉฐ, |
|
ํธํฅ๋ ๋ฐ์ดํฐ๊ฐ ํฌํจ๋ ๊ฒฝ์ฐ ํธํฅ๋ ์๋ต์ด ์์ฑ๋ ๊ฐ๋ฅ์ฑ๋ ์กด์ฌํฉ๋๋ค. |
|
|
|
# โบ ์ฌ์ฉ ๋ฐฉ๋ฒ |
|
<pre><code> |
|
import os |
|
import os.path as osp |
|
import sys |
|
import fire |
|
import json |
|
from typing import List, Union |
|
import pandas as pd |
|
import torch |
|
from torch.nn import functional as F |
|
|
|
import transformers |
|
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig |
|
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR |
|
from transformers import LlamaForCausalLM, LlamaTokenizer |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
from datasets import load_dataset |
|
|
|
from peft import ( |
|
LoraConfig, |
|
get_peft_model, |
|
set_peft_model_state_dict |
|
) |
|
from peft import PeftModel |
|
import re |
|
import ast |
|
|
|
device = 'auto' #@param {type: "string"} |
|
model = '' #@param {type: "string"} |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model, |
|
quantization_config=bnb_config, |
|
#load_in_4bit=True, # Quantization Load |
|
device_map=device) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model) |
|
|
|
input_text = "์๋
ํ์ธ์." |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
inputs = inputs.to("cuda:0") |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs, max_length=1024) |
|
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
</code></pre> |
|
|
|
## โ
ktds๋ openchat ์ธ์๋ LlaMA, Polyglot, EEVE ๋ฑ ๋ํ์ ์ธ LLM์ ๋ค์ํ ์์ญ์ ํ๊ตญ์ ๋ฌธํ์ ์ง์์ ํ์ธํ๋ํ LLM์ ์ ๊ณตํ ์์ ์
๋๋ค. |
|
--- |
|
Hereโs the English version of the provided text: |
|
|
|
|
|
|
|
# โถ Model Description |
|
|
|
**Model Name and Key Features**: |
|
KTDSbaseLM v0.11 is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model. |
|
It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society. |
|
The model supports tasks such as text generation, conversation inference, document summarization, |
|
question answering, sentiment analysis, and other NLP tasks. |
|
Its applications span fields like law, finance, science, education, business, and cultural research. |
|
|
|
**Model Architecture**: |
|
KTDSBaseLM v0.11 is a high-performance language model with 7 billion parameters based on the Mistral 7B model. |
|
It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture. |
|
The streamlined Mistral 7B architecture ensures fast inference and memory efficiency, |
|
optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis. |
|
|
|
--- |
|
|
|
# โท Training Data |
|
|
|
KTDSbaseLM v0.11 was trained on 3.6GB of data, comprising 2.33 million Q&A instances. |
|
This includes 1.33 million multiple-choice questions across 53 domains such as history, |
|
finance, law, tax, and science, trained with the Chain of Thought method. Additionally, |
|
1.3 million short-answer questions cover 38 domains including history, finance, and law. |
|
|
|
**Training Instruction Dataset Format**: |
|
`{"prompt": "prompt text", "completion": "ideal generated text"}` |
|
|
|
--- |
|
|
|
# โธ Use Cases |
|
|
|
KTDSbaseLM v0.11 can be used across multiple fields, such as: |
|
|
|
- **Education**: Answering questions and generating explanations for subjects like history, math, and science. |
|
- **Business**: Providing responses and summaries for legal, financial, and tax-related queries. |
|
- **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation. |
|
- **Customer Service**: Generating conversations and personalized responses for users. |
|
|
|
This model is highly versatile in various NLP tasks. |
|
|
|
--- |
|
|
|
# โน Limitations |
|
|
|
KTDSBaseLM v0.11 is specialized in Korean language and culture. |
|
However, it may lack accuracy in responding to topics outside its scope, |
|
such as international or specialized data. |
|
Additionally, it may have limited reasoning ability for complex logical problems and |
|
may produce biased responses if trained on biased data. |
|
|
|
--- |
|
|
|
# โบ Usage Instructions |
|
<pre><code> |
|
import os |
|
import os.path as osp |
|
import sys |
|
import fire |
|
import json |
|
from typing import List, Union |
|
import pandas as pd |
|
import torch |
|
from torch.nn import functional as F |
|
|
|
import transformers |
|
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig |
|
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR |
|
from transformers import LlamaForCausalLM, LlamaTokenizer |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
from datasets import load_dataset |
|
|
|
from peft import ( |
|
LoraConfig, |
|
get_peft_model, |
|
set_peft_model_state_dict |
|
) |
|
from peft import PeftModel |
|
import re |
|
import ast |
|
|
|
device = 'auto' #@param {type: "string"} |
|
model = '' #@param {type: "string"} |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model, |
|
quantization_config=bnb_config, |
|
#load_in_4bit=True, # Quantization Load |
|
device_map=device) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model) |
|
|
|
input_text = "์๋
ํ์ธ์." |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
inputs = inputs.to("cuda:0") |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs, max_length=1024) |
|
|
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
</code></pre> |
|
|
|
## KTDS plans to provide fine-tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge, |
|
## including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE. |
|
## These models will be tailored to better understand and generate content specific to Korean contexts. |