|
--- |
|
license: mit |
|
language: |
|
- ko |
|
metrics: |
|
- accuracy |
|
--- |
|
# Model Card for KorSciDeBERTa |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
KorSciDeBERTa๋ Microsoft DeBERTa ๋ชจ๋ธ์ ์ํคํ
์ณ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก, ๋
ผ๋ฌธ, NTIS ์ฐ๊ตฌ๊ณผ์ , ํนํ, ๋ด์ค, ํ๊ตญ์ด ์ํค ๋ง๋ญ์น ์ด 146GB๋ฅผ ์ฌ์ ํ์ตํ ๋ชจ๋ธ์
๋๋ค. |
|
|
|
๋ง์คํน๋ ์ธ์ด ๋ชจ๋ธ๋ง ๋๋ ๋ค์ ๋ฌธ์ฅ ์์ธก์ ์ฌ์ ํ์ต ๋ชจ๋ธ์ ์ฌ์ฉํ ์ ์๊ณ , ์ถ๊ฐ๋ก ๋ฌธ์ฅ ๋ถ๋ฅ, ๋จ์ด ํ ํฐ ๋ถ๋ฅ ๋๋ ์ง์์๋ต๊ณผ ๊ฐ์ ๋ค์ด์คํธ๋ฆผ ์์
์์ ๋ฏธ์ธ ์กฐ์ ์ ํตํด ์ฌ์ฉ๋ ์ ์์ต๋๋ค. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** KISTI |
|
- **Model type:** deberta-v2 |
|
- **Language(s) (NLP):** ํ๊ธ(ko) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository 1:** https://huggingface.co/kisti/korscideberta |
|
- **Repository 2:** https://aida.kisti.re.kr/ |
|
|
|
## Uses |
|
|
|
### Downstream Use - Load model directly |
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
ํํ์ ๋ถ์๊ธฐ(Mecab) ๋ฑ ์ค์น ํ์ - README |
|
|
|
git clone https://huggingface.co/kisti/korscideberta; cd korscideberta |
|
|
|
- **korscideberta-abstractcls.ipynb** |
|
|
|
from tokenization_korscideberta import DebertaV2Tokenizer |
|
|
|
from transformers import AutoModelForSequenceClassification |
|
|
|
|
|
tokenizer = DebertaV2Tokenizer.from_pretrained("kisti/korscideberta") |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("kisti/korscideberta", num_labels=6, hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1) |
|
|
|
#model = AutoModelForMaskedLM.from_pretrained("kisti/korscideberta") |
|
|
|
'''''' |
|
|
|
train_metrics = trainer.train().metrics |
|
|
|
trainer.save_metrics("train", train_metrics) |
|
|
|
trainer.push_to_hub() |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
์ด ๋ชจ๋ธ์ ์๋์ ์ผ๋ก ์ฌ๋๋ค์๊ฒ ์ ๋์ ์ด๋ ์์ธ๋ ํ๊ฒฝ์ ์กฐ์ฑํ๋๋ฐ ์ฌ์ฉ๋์ด์๋ ์ ๋ฉ๋๋ค. |
|
|
|
์ด ๋ชจ๋ธ์ '๊ณ ์ํ ์ค์ '์์ ์ฌ์ฉ๋ ์ ์์ต๋๋ค. ์ด ๋ชจ๋ธ์ ์ฌ๋์ด๋ ์ฌ๋ฌผ์ ๋ํ ์ค์ํ ๊ฒฐ์ ์ ๋ด๋ฆด ์ ์๊ฒ ์ค๊ณ๋์ง ์์์ต๋๋ค. ๋ชจ๋ธ์ ์ถ๋ ฅ๋ฌผ์ ์ฌ์ค์ด ์๋ ์ ์์ต๋๋ค. |
|
|
|
'๊ณ ์ํ ์ค์ '์ ๋ค์๊ณผ ๊ฐ์ ์ฌํญ์ ํฌํจํฉ๋๋ค: |
|
|
|
์๋ฃ/์ ์น/๋ฒ๋ฅ /๊ธ์ต ๋ถ์ผ์์์ ์ฌ์ฉ, ๊ณ ์ฉ/๊ต์ก/์ ์ฉ ๋ถ์ผ์์์ ์ธ๋ฌผ ํ๊ฐ, ์๋์ผ๋ก ์ค์ํ ๊ฒ์ ๊ฒฐ์ ํ๊ธฐ, (๊ฐ์ง)์ฌ์ค์ ์์ฑํ๊ธฐ, ์ ๋ขฐ๋ ๋์ ์์ฝ๋ฌธ ์์ฑ, ํญ์ ์ณ์์ผ๋ง ํ๋ ์์ธก ์์ฑ ๋ฑ. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
์ฐ๊ตฌ๋ชฉ์ ์ผ๋ก ์ ์๊ถ ๋ฌธ์ ๊ฐ ์๋ ๋ง๋ญ์น ๋ฐ์ดํฐ๋ง์ ์ฌ์ฉํ์์ต๋๋ค. ์ด ๋ชจ๋ธ์ ์ฌ์ฉ์๋ ์๋์ ์ํ ์์ธ๋ค์ ์ธ์ํด์ผ ํฉ๋๋ค. |
|
|
|
์ฌ์ฉ๋ ๋ง๋ญ์น๋ ๋๋ถ๋ถ ์ค๋ฆฝ์ ์ธ ์ฑ๊ฒฉ์ ๊ฐ์ง๊ณ ์๋๋ฐ๋ ๋ถ๊ตฌํ๊ณ , ์ธ์ด ๋ชจ๋ธ์ ํน์ฑ์ ์๋์ ๊ฐ์ ์ค๋ฆฌ ๊ด๋ จ ์์๋ฅผ ์ผ๋ถ ํฌํจํ ์ ์์ต๋๋ค: |
|
|
|
ํน์ ๊ด์ ์ ๋ํ ๊ณผ๋/๊ณผ์ ํํ, ๊ณ ์ ๊ด๋
, ๊ฐ์ธ ์ ๋ณด, ์ฆ์ค/๋ชจ์ ๋๋ ํญ๋ ฅ์ ์ธ ์ธ์ด, ์ฐจ๋ณ์ ์ด๊ฑฐ๋ ํธ๊ฒฌ์ ์ธ ์ธ์ด, ๊ด๋ จ์ด ์๊ฑฐ๋ ๋ฐ๋ณต์ ์ธ ์ถ๋ ฅ ์์ฑ ๋ฑ. |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
๋
ผ๋ฌธ, NTIS ์ฐ๊ตฌ๊ณผ์ , ํนํ, ๋ด์ค, ํ๊ตญ์ด ์ํค ๋ง๋ญ์น ์ด 146GB |
|
|
|
### Training Procedure |
|
|
|
KISTI HPC NVIDIA A100 80G GPU 24EA์์ 2.5๊ฐ์๋์ 1,600,000 ์คํ
ํ์ต |
|
|
|
#### Preprocessing |
|
|
|
- ๊ณผํ๊ธฐ์ ๋ถ์ผ ํ ํฌ๋์ด์ (KorSci Tokenizer) |
|
- ๋ณธ ์ฌ์ ํ์ต ๋ชจ๋ธ์์ ์ฌ์ฉ๋ ์ฝํผ์ค๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ช
์ฌ ๋ฐ ๋ณตํฉ๋ช
์ฌ ์ฝ 600๋ง๊ฐ์ ์ฌ์ฉ์์ฌ์ ์ด ์ถ๊ฐ๋ [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)์ ๊ธฐ์กด SentencePiece-BPE๊ฐ ๋ณํฉ๋์ด์ง ํ ํฌ๋์ด์ ๋ฅผ ์ฌ์ฉํ์ฌ ๋ง๋ญ์น๋ฅผ ์ ์ฒ๋ฆฌํ์์ต๋๋ค. |
|
- Total 128,100 words |
|
- Included special tokens ( < unk >, < cls >, < s >, < mask > ) |
|
- File name : spm.model, vocab.txt |
|
|
|
#### Training Hyperparameters |
|
|
|
- **model_size:** base |
|
- **num_train_steps:** 1,600,000 |
|
- **train_batch_size:** 4,096 * 4 accumulative update = 16,384 |
|
- **learning_rate:** 1e-4 |
|
- **max_seq_length:** 512 |
|
- **vocab_size:** 128,100 |
|
- **Training regime:** fp16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Data Card if possible. --> |
|
|
|
๋ณธ ์ธ์ด๋ชจ๋ธ์ ์ฑ๋ฅํ๊ฐ๋ ์ฐ๊ตฌ๊ณผ์ ๋ณด๊ณ ์ ๊ณผํ๊ธฐ์ ํ์ค๋ถ๋ฅ ํ์คํฌ์ ํ์ธํ๋ํ์ฌ ํ๊ฐํ๋ ๋ฐฉ์์ ์ฌ์ฉํ์์ผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋ ์๋์ ๊ฐ์ต๋๋ค. |
|
- ์ฐ๊ตฌ๊ณผ์ ๋ณด๊ณ ์ ๊ณผํ๊ธฐ์ ํ์ค๋ถ๋ฅ ํ๊ฐ ๋ฐ์ดํฐ์
(doi.org/10.23057/50), 145 Classes, 209,454 Training Set, 89,767 Test Set |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
F1-micro/macro: ์ ๋ต Top3 ์ค ์ต์ 1๊ฐ ์์ธก์ ์ฑ๊ณต ๊ธฐ์ค |
|
|
|
F1-strict: ์ ๋ต Top3 ์ค ์์ธกํ ์ ๋งํผ ์ฑ๊ณต ๊ธฐ์ค |
|
|
|
### Results |
|
|
|
F1-micro: 0.85, F1-macro: 0.52, F1-strict: 0.71 |
|
|
|
|
|
|
|
## Technical Specifications |
|
|
|
### Model Objective |
|
|
|
MLM is a technique in which you take your tokenized sample and replace some of the tokens with the < mask > token and train your model with it. The model then tries to predict what should come in the place of that < mask > token and gradually starts learning about the data. MLM teaches the model about the relationship between words. |
|
|
|
Eg. Suppose you have a sentence - 'Deep Learning is so cool! I love neural networks.', now replace few words with the < mask > token. |
|
|
|
Masked Sentence - 'Deep Learning is so < mask >! I love < mask > networks.' |
|
|
|
### Compute Infrastructure |
|
|
|
KISTI ๊ตญ๊ฐ์ํผ์ปดํจํ
์ผํฐ NEURON ์์คํ
. HPE ClusterStor E1000, Lustre, Slurm |
|
|
|
#### Hardware |
|
|
|
NVIDIA A100 80G GPU 24EA |
|
|
|
#### Software |
|
|
|
Python 3.9, Cuda 11.8, PyTorch 1.10 |
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์ (2023) : ํ๊ตญ์ด ๊ณผํ๊ธฐ์ ๋ถ์ผ DeBERTa ์ฌ์ ํ์ต ๋ชจ๋ธ (KorSciDeBERTa). Version 1.0. ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์. |
|
|
|
|
|
## Model Card Authors |
|
|
|
๊น๊ฒฝ๋ฏผ, ๊น์ํฌ, ๊น์ฑ์ฐฌ, ์ด์น์ฐ. ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์ ์ธ๊ณต์ง๋ฅ๋ฐ์ดํฐ์ฐ๊ตฌ๋จ |
|
|
|
## Model Card Contact |
|
|
|
๊น๊ฒฝ๋ฏผ, kkmkorea kisti.re.kr |
|
|
|
|
|
|