File size: 9,498 Bytes
c4c345f bb8f8c4 c4c345f cfd2935 bb8f8c4 cfd2935 bb8f8c4 cfd2935 73a7785 6175e30 a892d6d 00fc7e4 63d0cad 9ffdfa0 7ff540e 8dc22ae 7ff540e 9ffdfa0 7ff540e a892d6d 897a80e 7ff540e 8dc22ae 9f3e0d8 7ff5bac 461019c 157ac85 8dc22ae c8e8bdb a892d6d 111032e 0c0c755 8dc22ae c8e8bdb 247d1e4 6175e30 8dc22ae deae7da ca800c4 f6dcb97 ca800c4 f6dcb97 9ffdfa0 dc88d2d 7d4e17b 150fb85 7d4e17b 150fb85 8dc22ae 150fb85 8dc22ae 150fb85 9f3e0d8 150fb85 9f3e0d8 150fb85 8dc22ae 150fb85 8dc22ae 150fb85 8dc22ae 1f254d5 ca800c4 1f254d5 ca800c4 1f254d5 ca800c4 1f254d5 dc88d2d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
---
base_model:
- openchat/openchat_3.5
language:
- ko
- en
library_name: adapter-transformers
license: mit
metrics:
- accuracy
pipeline_tag: text-generation
tags:
- finance
- biology
- legal
- art
- text-generation-inference
datasets:
- AIDX-ktds/ko_leaderboard
---
### β± ktdsbaseLM v0.11μ openchat3.5λ₯Ό Foundation λͺ¨λΈλ‘ νλ νκ΅μ΄ λ° νκ΅μ λ€μν
### λ¬Ένμ μ μ©ν μ μλλ‘ νκΈ° μν΄
### κ°λ° λμμΌλ©° μ체 μ μν 53μμμ νκ΅μ΄ λ°μ΄ν°λ₯Ό νμ©νμ¬ νκ΅ μ¬ν κ°μΉμ
### λ¬Ένλ₯Ό μ΄ν΄νλ λͺ¨λΈ μ
λλ€. β
# βΆ λͺ¨λΈ μ€λͺ
- λͺ¨λΈλͺ
λ° μ£ΌμκΈ°λ₯:
KTDSbaseLM v0.11μ OpenChat 3.5 λͺ¨λΈμ κΈ°λ°μΌλ‘ SFT λ°©μμΌλ‘ νμΈνλλ Mistral 7B / openchat3.5 κΈ°λ° λͺ¨λΈμ
λλ€.
νκ΅μ΄μ νκ΅μ λ€μν λ¬Ένμ λ§₯λ½μ μ΄ν΄νλλ‘ μ€κ³λμμΌλ©° β¨β¨, μ체 μ μν 135κ° μμμ νκ΅μ΄
λ°μ΄ν°λ₯Ό νμ©ν΄ νκ΅ μ¬νμ κ°μΉμ λ¬Ένλ₯Ό λ°μν©λλ€.
μ£Όμ κΈ°λ₯μΌλ‘λ ν
μ€νΈ μμ±, λν μΆλ‘ , λ¬Έμ μμ½, μ§μμλ΅, κ°μ λΆμ λ° μμ°μ΄ μ²λ¦¬ κ΄λ ¨ λ€μν μμ
μ μ§μνλ©°,
νμ© λΆμΌλ λ²λ₯ , μ¬λ¬΄, κ³Όν, κ΅μ‘, λΉμ¦λμ€, λ¬Έν μ°κ΅¬ λ± λ€μν λΆμΌμμ μμ©λ μ μμ΅λλ€.
- λͺ¨λΈ μν€ν
μ²: KTDSBaseLM v0.11μ Mistral 7B λͺ¨λΈμ κΈ°λ°μΌλ‘, νλΌλ―Έν° μλ 70μ΅ κ°(7B)λ‘ κ΅¬μ±λ κ³ μ±λ₯ μΈμ΄ λͺ¨λΈμ
λλ€.
μ΄ λͺ¨λΈμ OpenChat 3.5λ₯Ό νμ΄λ°μ΄μ
λͺ¨λΈλ‘ μΌμ, SFT(μ§λ λ―ΈμΈ μ‘°μ ) λ°©μμ ν΅ν΄ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλ μ±λ₯μ λ°ννλλ‘ νλ ¨λμμ΅λλ€.
Mistral 7Bμ κ²½λνλ ꡬ쑰λ λΉ λ₯Έ μΆλ‘ μλμ λ©λͺ¨λ¦¬ ν¨μ¨μ±μ 보μ₯νλ©°, λ€μν μμ°μ΄ μ²λ¦¬ μμ
μ μ ν©νκ² μ΅μ νλμ΄ μμ΅λλ€.
μ΄ μν€ν
μ²λ ν
μ€νΈ μμ±, μ§μμλ΅, λ¬Έμ μμ½, κ°μ λΆμκ³Ό κ°μ λ€μν μμ
μμ νμν μ±λ₯μ 보μ¬μ€λλ€.
# β· νμ΅ λ°μ΄ν°
- ktdsbaseLM v0.11μ μ체 κ°λ°ν μ΄ 3.6GB ν¬κΈ°μ λ°μ΄ν°λ₯Ό λ°νμΌλ‘ νμ΅λμμ΅λλ€. λͺ¨λ 233λ§ κ±΄μ QnA, μμ½, λΆλ₯ λ± λ°μ΄ν°λ₯Ό ν¬ν¨νλ©°,
κ·Έ μ€ 133λ§ κ±΄μ 53κ° μμμ κ°κ΄μ λ¬Έμ λ‘ κ΅¬μ±λμμ΅λλ€. μ΄ μμμλ νκ΅μ¬, μ¬ν, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν, μλ¬Ό, 물리, νν λ±μ΄ ν¬ν¨λλ©°,
Chain of Thought λ°©μμΌλ‘ νμ΅λμμ΅λλ€. λν 130λ§ κ±΄μ μ£Όκ΄μ λ¬Έμ λ νκ΅μ¬, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν λ± 38κ° μμμ κ±Έμ³ νμ΅λμμ΅λλ€.
νμ΅ λ°μ΄ν° μ€ νκ΅μ μ¬ν κ°μΉμ μΈκ°μ κ°μ μ μ΄ν΄νκ³ μ§μν μ¬νμ λ°λΌ μΆλ ₯ν μ μλ λ°μ΄ν°λ₯Ό νμ΅νμμ΅λλ€.
# βΈ μ¬μ© μ¬λ‘
ktdsbaseLM v0.11μ λ€μν μμ© λΆμΌμμ μ¬μ©λ μ μμ΅λλ€. μλ₯Ό λ€μ΄:
- κ΅μ‘ λΆμΌ: μμ¬, μν, κ³Όν λ± λ€μν νμ΅ μλ£μ λν μ§μμλ΅ λ° μ€λͺ
μμ±.
- λΉμ¦λμ€: λ²λ₯ , μ¬λ¬΄, μΈλ¬΄ κ΄λ ¨ μ§μμ λν λ΅λ³ μ 곡 λ° λ¬Έμ μμ½.
- μ°κ΅¬ λ° λ¬Έν: νκ΅ μ¬νμ λ¬Ένμ λ§μΆ μμ°μ΄ μ²λ¦¬ μμ
, κ°μ λΆμ, λ¬Έμ μμ± λ° λ²μ.
- κ³ κ° μλΉμ€: μ¬μ©μμμ λν μμ± λ° λ§μΆ€ν μλ΅ μ 곡.
- μ΄ λͺ¨λΈμ λ€μν μμ°μ΄ μ²λ¦¬ μμ
μμ λμ νμ©λλ₯Ό κ°μ§λλ€.
# βΉ νκ³ ββ
- ktdsBaseLM v0.11μ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλμ΄ μμΌλ,
νΉμ μμ(μ: μ΅μ κ΅μ μλ£, μ λ¬Έ λΆμΌ)μ λ°μ΄ν° λΆμ‘±μΌλ‘ μΈν΄ λ€λ₯Έ μΈμ΄ λλ
λ¬Ένμ λν μλ΅μ μ νμ±μ΄ λ¨μ΄μ§ μ μμ΅λλ€.
λν, 볡μ‘ν λ
Όλ¦¬μ μ¬κ³ λ₯Ό μꡬνλ λ¬Έμ μ λν΄ μ νλ μΆλ‘ λ₯λ ₯μ λ³΄μΌ μ μμΌλ©°,
νΈν₯λ λ°μ΄ν°κ° ν¬ν¨λ κ²½μ° νΈν₯λ μλ΅μ΄ μμ±λ κ°λ₯μ±λ μ‘΄μ¬ν©λλ€.
# βΊ μ¬μ© λ°©λ²
<pre><code>
import os
import os.path as osp
import sys
import fire
import json
from typing import List, Union
import pandas as pd
import torch
from torch.nn import functional as F
import transformers
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import (
LoraConfig,
get_peft_model,
set_peft_model_state_dict
)
from peft import PeftModel
import re
import ast
device = 'auto' #@param {type: "string"}
model = '' #@param {type: "string"}
model = AutoModelForCausalLM.from_pretrained(
model,
quantization_config=bnb_config,
#load_in_4bit=True, # Quantization Load
device_map=device)
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)
input_text = "μλ
νμΈμ."
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")
with torch.no_grad():
outputs = model.generate(**inputs, max_length=1024)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
</code></pre>
## β
ktdsλ openchat μΈμλ LlaMA, Polyglot, EEVE λ± λνμ μΈ LLMμ λ€μν μμμ νκ΅μ λ¬Ένμ μ§μμ νμΈνλν LLMμ μ 곡ν μμ μ
λλ€.
---
Hereβs the English version of the provided text:
# βΆ Model Description
**Model Name and Key Features**:
KTDSbaseLM v0.11 is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model.
It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society.
The model supports tasks such as text generation, conversation inference, document summarization,
question answering, sentiment analysis, and other NLP tasks.
Its applications span fields like law, finance, science, education, business, and cultural research.
**Model Architecture**:
KTDSBaseLM v0.11 is a high-performance language model with 7 billion parameters based on the Mistral 7B model.
It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture.
The streamlined Mistral 7B architecture ensures fast inference and memory efficiency,
optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.
---
# β· Training Data
KTDSbaseLM v0.11 was trained on 3.6GB of data, comprising 2.33 million Q&A instances.
This includes 1.33 million multiple-choice questions across 53 domains such as history,
finance, law, tax, and science, trained with the Chain of Thought method. Additionally,
1.3 million short-answer questions cover 38 domains including history, finance, and law.
**Training Instruction Dataset Format**:
`{"prompt": "prompt text", "completion": "ideal generated text"}`
---
# βΈ Use Cases
KTDSbaseLM v0.11 can be used across multiple fields, such as:
- **Education**: Answering questions and generating explanations for subjects like history, math, and science.
- **Business**: Providing responses and summaries for legal, financial, and tax-related queries.
- **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation.
- **Customer Service**: Generating conversations and personalized responses for users.
This model is highly versatile in various NLP tasks.
---
# βΉ Limitations
KTDSBaseLM v0.11 is specialized in Korean language and culture.
However, it may lack accuracy in responding to topics outside its scope,
such as international or specialized data.
Additionally, it may have limited reasoning ability for complex logical problems and
may produce biased responses if trained on biased data.
---
# βΊ Usage Instructions
<pre><code>
import os
import os.path as osp
import sys
import fire
import json
from typing import List, Union
import pandas as pd
import torch
from torch.nn import functional as F
import transformers
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import (
LoraConfig,
get_peft_model,
set_peft_model_state_dict
)
from peft import PeftModel
import re
import ast
device = 'auto' #@param {type: "string"}
model = '' #@param {type: "string"}
model = AutoModelForCausalLM.from_pretrained(
model,
quantization_config=bnb_config,
#load_in_4bit=True, # Quantization Load
device_map=device)
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)
input_text = "μλ
νμΈμ."
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")
with torch.no_grad():
outputs = model.generate(**inputs, max_length=1024)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
</code></pre>
## KTDS plans to provide fine-tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge,
## including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE.
## These models will be tailored to better understand and generate content specific to Korean contexts. |