dalgarak's picture
Update README.md
a976819 verified
---
license: other
language:
- ko
- en
- ja
- zh
pipeline_tag: fill-mask
---
# Model Card for GBST-KEByT5-large (1.23B #params)
<!-- Provide a quick summary of what the model is/does. -->
KEByT5: Korean-Enhanced/Enriched Byte-level Text-to-Text Transfer Transformer(T5)์˜ GBST ๋ฒ„์ „์œผ๋กœ,
CharFormer(Tay et al., 2021)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
ํ•œ๊ตญ์–ด๋ฅผ ์œ„ํ•ด ํ† ํฐ ํ›„๋ณด ๊ตฌ๊ฐ„์„ (1, 2, 3, 6, 9) ๋ฐ”์ดํŠธ ๋‹จ์œ„๋กœ ์ฒญํ‚นํ•˜์—ฌ ํ›„๋ณด๊ตฐ์„ ์ƒ์„ฑํ•˜๊ณ , GBST๋กœ ๋‚˜์˜จ ์†Œํ”„ํŠธ ์ž„๋ฒ ๋”ฉ ์‹œํ€€์Šค๋ฅผ 1/3๋กœ ๋‹ค์šด์ƒ˜ํ”Œ๋งํ•˜์—ฌ ํ•™์Šต ๋ฐ ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
## Prerequirements / and Model Loading HOW-TO
๋ณธ ๋ชจ๋ธ์˜ ๊ตฌ๋™์„ ์œ„ํ•ด์„œ๋Š” GBSWT5 ๋ชจ๋“ˆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
https://github.com/etri-crossmodal/gbswt5
์•„๋ž˜์™€ ๊ฐ™์ด pip๋ฅผ ํ†ตํ•ด ๋ชจ๋“ˆ์„ ์„ค์น˜ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•์€ github๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์‹ญ์‹œ์˜ค.
```
pip install git+https://github.com/etri-crossmodal/gbswt5.git
```
๋˜๋Š”, ์ตœ์‹  ๋ฒ„์ „์˜ Transformers์™€ ํ•จ๊ป˜, ๋ณ„๋„์˜ ์ฝ”๋“œ ์—†์ด ์•„๋ž˜์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ชจ๋ธ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("etri-lirs/gbst-kebyt5-large-preview")
# ์•„๋ž˜์™€ ๊ฐ™์ด trust_remote_code=True๋ฅผ ๋ถ™์ž„์œผ๋กœ, ์ž๋™์œผ๋กœ ๊ด€๋ จ ์ฝ”๋“œ๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ณ  ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
model = AutoModelForSeq2SeqLM.from_pretrained("etri-lirs/gbst-kebyt5-large-preview", trust_remote_code=True)
```
๋˜ํ•œ, ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ํ•™์Šต ์‹œ, ์•„๋ž˜์˜ python ์ฝ”๋“œ์™€ ๊ฐ™์ด, GBST layer๋ฅผ frozen ํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
```
gbst_frozen_target = ['encoder.embed_tokens.embeds.weight',
'encoder.embed_tokens.positional_convol.2.convol.weight',
'encoder.embed_tokens.positional_convol.2.convol.bias',
'encoder.embed_tokens.positional_convol.2.proj.weight',
'encoder.embed_tokens.positional_convol.2.proj.bias',
'encoder.embed_tokens.cand_scoring.0.weight',
'encoder.embed_tokens.cand_scoring.0.bias',
# embedding weight๋Š” frozen ํ•˜์ง€ ์•Š๋Š” ์ชฝ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„.
#'shared.weight',
]
print("** GBST Model found, freeze GBSWT layer for training downstream.")
for name, param in self.model.named_parameters():
if name in gbst_frozen_target:
print(f"** freeze {name} layer.")
param.requires_grad = False
else:
param.requires_grad = True
```
์ฐธ๊ณ ๋กœ, ๋ชจ๋ธ์— ํฌํ•จ๋œ ์›๊ฒฉ ์ฝ”๋“œ์—๋Š” ๋‹ค์Œ์˜ ์˜คํ”ˆ์†Œ์Šค ์†Œํ”„ํŠธ์›จ์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:
* This software includes lucidrains/charformer-pytorch GitHub project for GBST implementation, which distributed under MIT License. Copyright (c) 2021 Phil Wang. all rights reserved. (Original Code URL: https://github.com/lucidrains/charformer-pytorch)
* This software includes HuggingFace transformers's T5 implementation for GBST-enabled T5 model, which distributed under Apache 2.0 License. Copyright 2018- The Huggingface team. All rights reserved.
## KEByT5: Korean-Enhanced/Enriched Byte-level Text-to-Text Transfer Transformer(T5)
ํฌ๋กœ์Šค๋ชจ๋‹ฌ ๋ฐ ๋‹ค๊ตญ์–ด ์นœํ™”์ ์ธ ํ•œ๊ตญ์–ด ์ค‘์‹ฌ์˜ ํ† ํฐ-ํ”„๋ฆฌ ์–ธ์–ด ์ดํ•ด ์ƒ์„ฑ ๋ชจ๋ธ
(EN=Cross-modal, Multilingual Friendly, Token-free Encoder-Decoder Pretrained Language Model for Korean)
* ๋ณธ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ ์‹œ๊ฐ, ์ฒญ๊ฐ๊ณผ ๊ฐ™์€ ํ…์ŠคํŠธ ์ด์™ธ์˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์™€ ๊ต์ฐจ์–ธ์–ด ์ง€์‹ ๊ตํ™˜์— ์šฉ์ดํ•œ ํ† ํฐ-ํ”„๋ฆฌ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
* ๋ณ„๋„์˜ tokenizer๊ฐ€ ํ•„์š”์—†์ง€๋งŒ, ํŽธ์˜๋ฅผ ์œ„ํ•ด AutoTokenizer.from_pretrained()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ํ† ํฌ๋‚˜์ด์ € ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ๊ณผ ๋™์ผํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ƒ๋žตํ•˜๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ, UTF-8 ์ž…๋ ฅ์„ ๋ฐ”์ดํŠธ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ์–ด, ๊ฐ ๋ฐ”์ดํŠธ์— +3์„ ํ•˜์—ฌ Token ID๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (์ฆ‰, ASCII value 0 == Token ID 3, ASCII value 255 == Token ID 258)
* ํ˜„์žฌ Preview ์Šคํ…Œ์ด์ง€์— ์žˆ๋Š” ๋ชจ๋ธ์ด๋ฉฐ, ํ™œ์šฉ์—๋Š” fine-tuning์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
* ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™” [(Gradient-based Subword Tokenization; CharFormer; Tay et al., 2021;)](https://arxiv.org/abs/2106.12672)๋ฅผ ์ ์šฉํ•œ ๋ณธ ๋ชจ๋ธ์€, KLUE-MRC์—์„œ ๊ฐ™์€ ๊ทœ๋ชจ์˜ KEByT5-base ๋ชจ๋ธ ๋Œ€๋น„ ํ•™์Šต์—์„œ 2.7๋ฐฐ, ์ถ”๋ก ์—์„œ 1.46๋ฐฐ ์ด์ƒ์˜ ํ•™์Šต ์†๋„๊ฐ€ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ถ€ ํ•™์Šต/์ถ”๋ก  ์„ฑ๋Šฅ์— ๋น„๊ต ๊ฐ€๋Šฅํ•œ ์ฐจ์ด๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒ์„ธํ•œ ๋‚ด์šฉ์€ ํ•˜์œ„ ํ‰๊ฐ€ ์ง€ํ‘œ๋ฅผ ์ฐธ๊ณ ํ•˜์‹ญ์‹œ์˜ค.
## Acknowledgements
* ๋ณธ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ 2022๋…„๋„ ์ •๋ถ€(๊ณผํ•™๊ธฐ์ˆ ์ •๋ณดํ†ต์‹ ๋ถ€)์˜ ์žฌ์›์œผ๋กœ ์ •๋ณดํ†ต์‹ ๊ธฐํšํ‰๊ฐ€์›์˜ ์ง€์›์„ ๋ฐ›์•„ ์ˆ˜ํ–‰๋œ ์—ฐ๊ตฌ์ž„ (No. RS-2022-00187238, ํšจ์œจ์  ์‚ฌ์ „ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ํ•œ๊ตญ์–ด ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ ์‚ฌ์ „ํ•™์Šต ๊ธฐ์ˆ  ๊ฐœ๋ฐœ)
(EN=This pretrained language model was supported by the Institute of Information & communication Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training))
# Model Details
๋ณธ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทœ๋ชจ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค:
* kebyt5-small : 330M [link](https://huggingface.co/etri-lirs/kebyt5-small-preview)
* kebyt5-base : 580M [link](https://huggingface.co/etri-lirs/kebyt5-base-preview)
* kebyt5-large : 1.23B [link](https://huggingface.co/etri-lirs/kebyt5-large-preview)
* GBST-kebyt5-base : 584M [link](https://huggingface.co/etri-lirs/gbst-kebyt5-base-preview)
* GBST-kebyt5-large : 1.23B (this model)
์ด๋“ค ๋ชจ๋ธ์€ [google/byt5-small](https://huggingface.co/google/byt5-small), [google/byt5-base](https://huggingface.co/google/byt5-base), [google/byt5-large](https://huggingface.co/google/byt5-large) ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์™€ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ํ† ํฌ๋‚˜์ด์ €(ByT5Tokenizer)์™€ ๊ตฌํ˜„ ์ƒ ๋‘ ๋ชจ๋ธ์€ ๋ณ„๋„์˜ ์ˆ˜์ •์—†์ด ๋ฐ”๋กœ ๊ตํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
huggingface transformers์—์„œ์˜ ์‚ฌ์šฉ๋ฒ• ์—ญ์‹œ, T5ForConditionalGeneration์„ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
## Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Language Intelligence Research Section, Electronics and Telecommunications Research Institute(ETRI)
- **Model type:** Encoder-Decoder Transformer, specifically, ByT5.
- **Language(s) (NLP):** Korean, English(partially for translation task), Chinese(partially for translation task), Japanese(partially for translation task).
- **License:** Apache 2.0 License
- **Finetuned from model:** kebyt5-small/-base/-xl model weights were initialized by google/byt5-* for Warm-start pretraining.
## Model Sources
- **Repository:** ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ํ•™์Šต์„ ์œ„ํ•ด, https://github.com/etri-crossmodal/llm-downstream-s2s
- **Paper:** ์‹ ์ข…ํ›ˆ ์™ธ, "ํ•œ๊ตญ์–ด ์ค‘์‹ฌ์˜ ํ† ํฐ-ํ”„๋ฆฌ ์–ธ์–ด ์ดํ•ด-์ƒ์„ฑ ๋ชจ๋ธ ์‚ฌ์ „ํ•™์Šต ์—ฐ๊ตฌ", ์ œ35ํšŒ ํ•œ๊ธ€ ๋ฐ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํ•™์ˆ ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘, pp.711-715. 2023.
(EN=Shin et al., "Towards Korean-Centric Token-free Pretrained Language Model", in Procs. of the 35th Annual Conference on Human and Cognitive Language Technology. pp. 711-715. 2023.)
# Uses
ํ•ด๋‹น ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์€ ์—ฐ๊ตฌ ๋ฐ ๊ต์œก ๋ชฉ์ ์˜ ํ™œ์šฉ์œผ๋กœ ๊ทธ ์‚ฌ์šฉ ๋ชฉ์ ์ด ์ œํ•œ๋ฉ๋‹ˆ๋‹ค.
## Direct Use
ํ˜„์žฌ ๊ณต๊ฐœ๋˜๋Š” ๋ชจ๋ธ์€ T5 ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ Corrupted span denoising ๋งŒ์œผ๋กœ ํ•™์Šต๋˜์–ด ์žˆ์–ด, ์‹ค์ œ ์‘์šฉ ํƒœ์Šคํฌ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” fine-tuning ๊ณผ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
Sentinel Token(token id 258, 257, 256, ...)์„ ์‚ฌ์šฉํ•˜์—ฌ Masked Token Prediction์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ์˜ˆ์ธก๋œ ๋‚ด์šฉ์—๋Š” ๋ถ€์ ์ ˆํ•œ ๋‚ด์šฉ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
## Downstream Use [optional]
Token-free ๋ชจ๋ธ์˜ ํŠน์„ฑ ์ƒ, ๋ณต์žกํ•˜๊ฑฐ๋‚˜ Noisyํ•œ ์ž…๋ ฅ์— ๊ฐ•๊ฑดํ•˜๋ฉฐ, ์งง์€ ์‹œํ€€์Šค ๊ธธ์ด์˜ ์ƒ์„ฑ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: ์–ธ์–ด ์ดํ•ด, ๋Œ€ํ™” ์‘๋‹ต ์ƒ์„ฑ)
์‚ฌ์ „ํ•™์Šต์€ 1024 bytes ๊ธธ์ด์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ธด ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฌธ์ œ์— ์ ํ•ฉํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ๋‹ค๋ค„์•ผ ํ•˜๋Š” ๋ฌธ์ œ์—์„œ๋Š”, [GBST ๊ธฐ๋ฐ˜์˜ ํ† ํฐ-ํ”„๋ฆฌ ์–ธ์–ด๋ชจ๋ธ](https://huggingface.co/etri-lirs/gbst-kebyt5-base-preview)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
# Bias, Risks, Limitations, and Recommendations
Masked Token Prediction์„ ํ†ตํ•ด ํš๋“๋  ์ˆ˜ ์žˆ๋Š” ์ •๋ณด์—๋Š” ๋‹ค๋ฅธ ์ƒ์„ฑํ˜• ์–ธ์–ด๋ชจ๋ธ๊ณผ ๊ฐ™์€ ์œ„ํ—˜์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋Š” ์š•์„ค, ์Œ๋ž€, ์ •์น˜์  ๋‚ด์šฉ ๋ฐ ๊ธฐํƒ€ ๊ฑฐ์นœ ์–ธ์–ด๋“ค์— ๋Œ€ํ•œ ๋ณ„๋„์˜ ์ฒ˜๋ฆฌ๊ฐ€ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ์‚ฌํšŒ์ ์œผ๋กœ ์šฉ์ธ๋˜์ง€ ์•Š์€ ํ† ํฐ์ด๋‚˜ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ฃผ๋ณ€ ๋ฌธ๋งฅ์— ๋”ฐ๋ผ์„œ ๊ณต๊ฒฉ์ ์ธ ์ž…๋ ฅ์— ์–ด๋– ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์„์ง€ ์‰ฝ๊ฒŒ ์˜ˆ์ƒํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
ํ•œํŽธ, ๋ณธ ์–ธ์–ด๋ชจ๋ธ์€ ์ฃผ๋กœ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ, ์ด๋“ค์˜ ํŠน์„ฑ์„ ์ „์ดํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ, ๊ทธ ์ค‘์—์„œ๋„ ๋ถ„๋ฅ˜, ์š”์•ฝ, ์งง์€ ๋ฌธ์žฅ ์ƒ์„ฑ์— ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž…์ถœ๋ ฅ ์ˆ˜์ค€์—์„œ ๋ฏธ๋“ฑ๋ก์–ด(Out-of-Vocabulary)๊ฐ€ ์กด์žฌํ•  ์ˆ˜ ์—†์œผ๋‚˜, ์‚ฌ์ „ํ•™์Šต๋˜์ง€ ์•Š์€ ํ…์ŠคํŠธ ์‹œํ€€์Šค์— ๋Œ€ํ•ด์„œ๋Š” ์ถ”๊ฐ€์˜ ๋„๋ฉ”์ธ ์ ์‘ ํ•™์Šต ๋ฐ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์˜ ๋ฏธ์„ธ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
[More Information Needed]
## How to Get Started with the Model
Transformers 4.27.0 ์ด์ƒ์˜ ๋ฒ„์ „์—์„œ, ๋‹ค์Œ์˜ ํŒŒ์ด์ฌ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ๊ณผ tokenizer๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒ๊ธฐ์— ์–ธ๊ธ‰๋œ ๋ฐ”์™€ ๊ฐ™์ด, transformer ๋ชจ๋“ˆ ๋กœ๋“œ ์ „ gbswt5 ๋ชจ๋“ˆ์„ import ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
```
import gbswt5
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("etri-lirs/gbst-kebyt5-base-preview")
model = AutoModelForSeq2SeqLM.from_pretrained("etri-lirs/gbst-kebyt5-base-preview")
```
# Training Details
## Training Data
๋ณธ ์‚ฌ์ „ํ•™์Šต์—๋Š” ์•„๋ž˜์˜ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:
* ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ์‹ ๋ฌธ v2.0
* ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ๊ตฌ์–ด ๋ง๋ญ‰์น˜ v1.2
* ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ๋ฌธ์–ด ๋ง๋ญ‰์น˜ v1.0
* ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ์‹ ๋ฌธ 2020 v1.0
* ๊ตญ๋ฆฝ๊ตญ์–ด์›, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜. ์‹ ๋ฌธ 2021 v1.0
* ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์–ด ๋คํ”„, [v2020.09.20](https://github.com/lovit/kowikitext)
* [๋‚˜๋ฌด์œ„ํ‚ค ๋คํ”„](https://github.com/lovit/namuwikitext)
* ํ•œ๊ตญ์ •๋ณดํ™”์ง„ํฅ์›, AIHub. ์ „๋ฌธ๋ถ„์•ผ ๋ง๋ญ‰์น˜, ๋ฒ•๋ฅ /ํŠนํ—ˆ ์ง€์‹๋ฒ ์ด์Šค, ๋…ผ๋ฌธ/๋„์„œ/๋Œ€ํ™”/๋Œ€๋ณธ ์š”์•ฝ, ํ•œ์˜/ํ•œ์ผ/ํ•œ์ค‘ ๋ฒˆ์—ญ ๋ง๋ญ‰์น˜, ์ฝœ์„ผํ„ฐ/์ฃผ๋ฌธ/๋‰ด์Šค๊ธฐ์‚ฌ/์‹œ๊ฐ์ •๋ณด ์งˆ์˜์‘๋‹ต, ๋ฐฉ์†ก/ํšŒ์˜/์ƒ๋‹ด ์Œ์„ฑ์ธ์‹ ๋ฐ์ดํ„ฐ.
* ํ•œ๊ตญ์ •๋ณดํ™”์ง„ํฅ์›, AIHub. ๋Œ€๊ทœ๋ชจ ์›น๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ
* ํ•œ๊ตญ์ •๋ณดํ™”์ง„ํฅ์›, AIHub. ์˜จ๋ผ์ธ ๊ตฌ์–ด์ฒด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ.
* [KcBERT ๋ง๋ญ‰์น˜, v2022.3Q](https://github.com/Beomi/KcBERT)
๋˜ํ•œ, ์†Œ๋Ÿ‰์˜ ์ž์ฒด ๊ตฌ์ถ•๋œ ๋ฐ์ดํ„ฐ ๋ฐ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ผ๋ถ€๋ฅผ ์‚ฌ์šฉ, ์ „์ฒด ์•ฝ ~220GB ๊ฐ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
# Evaluation
## Testing Data, Factors & Metrics & Results
ํ•œ๊ตญ์–ด ์–ธ์–ด ์ดํ•ด ํƒœ์Šคํฌ์— ์‚ฌ์šฉ๋˜๋Š” [KLUE dataset, v1.1](https://klue-benchmark.com/)์˜ dev set์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ƒ์„ฑ์€ ๋ชจ๋‘ seq2seq์„ ์ด์šฉํ•œ ์ถœ๋ ฅ ๋ ˆ์ด๋ธ” ์ง์ ‘ ์ƒ์„ฑ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
๋ชจ๋“  ๋ชจ๋ธ์˜ ํ•™์Šต ์กฐ๊ฑด์€ ์œ ํšจ๋ฐฐ์น˜ ํฌ๊ธฐ 16, ํ•™์Šต epoch 4๋กœ ๊ณ ์ •, ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๊ณ ์ •๋œ ํ•™์Šต๋ฅ , Cosine-Annealing LR Scheduler (min lr=1e-7, restarts=4, gamma=0.7)์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ƒ์„ธ ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ์€ ์‹ ์ข…ํ›ˆ ์™ธ, 2023์— ๊ธฐ๋ก๋œ ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
์ƒ๊ธฐ ํ•™์ˆ ๋…ผ๋ฌธ ์ดํ›„์— ์ถœ์‹œ๋œ ๋ณธ ๋ชจ๋ธ(GBST-KEByT5-Large)์˜ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ํ•™์Šต ์กฐ๊ฑด์€ ํƒœ์Šคํฌ ๋ณ„๋กœ ๊ฐ€๋ณ€์ ์ธ ํ•™์Šต๋ฅ (LR 6.2e-5~4.6e-5) ์‚ฌ์ด์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜์˜€๊ณ , ๋‚˜๋จธ์ง€ ์กฐ๊ฑด์€ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ํ•˜๊ธฐ ๋ฏธ์„ธ์กฐ์ • ์‹คํ—˜์„ ์œ„ํ•ด ์‚ฌ์šฉ๋œ ํ•™์Šต๊ธฐ๋ฅผ ํ•จ๊ป˜ ๊ณต๊ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ํ•™์Šต๊ธฐ๋Š” ๋‹ค๋ฅธ huggingface encoder-decoder ๋ชจ๋ธ(BART ๋“ฑ)์˜ ํ•™์Šต๋„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. https://github.com/etri-crossmodal/llm-downstream-s2s
| models | KLUE-TC(YNAT) (F1) | KLUE-NER (Entity, Char F1) | KLUE-DP (UAS, LAS) | KLUE-MRC (EM, ROUGE-W) |
|-------------|---------------|--------------|-------------------|------------------|
| google/byt5-large (1.23B) | 78.52 | 48.81, 63.95 | 44.26, 7.805 | _NOT TESTED_ |
| KEByT5-Base (580M) | 84.99 | 86.75, 91.05 | 88.70, 85.90 | 62.28, 68.38 |
| GBST-KEByT5-Base (584M) | 85.29 | 87.35, 92.09 | 88.33, 85.00 | 59.69, 66.44 |
| KEByT5-Large (1.23B) | 85.68 | 88.09, 92.40 | 87.18, 85.52 | 70.07, 75.81 |
| GBST-KEByT5-Large (1.23B) | 85.72(LR 4e-5) | 87.22, 91.54(LR 4.6e-5) | -, - | 68.6, 74.33 (LR 6.2e-5) |
๋Œ€ํ™” ์ƒํƒœ ์ถ”์ (DST; Dialogue State Tracking) ํƒœ์Šคํฌ์ธ KLUE-WOS-v1.1 ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ‰๊ฐ€๋Š” ๋ชจ๋‘ seq2seq์„ ์ด์šฉํ•œ ๋‹ค์ด์–ผ๋กœ๊ทธ ์ƒํƒœ ์ง์ ‘ ์ƒ์„ฑ์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค:
| models | WOS (JGA, %) | WOS (F1, %) |
| ------- | ---------- | ----------- |
| klue/klue-roberta-large | 50.22 | 92.23 |
| KEByT5-Base (580M) | 77.15 | 96.92 |
| GBST-KEByt5-base (584M) | 75.94 | 96.73 |
| KEByT5-Large (1.23B) | 78.54 | 97.28 |
| GBST-KEByT5-Large (1.23B) | -(not tested yet) | - |
๊ด€๊ณ„ ์ถ”์ถœ(RE; Relation Extraction) ํƒœ์Šคํฌ์ธ KLUE-RE-v1.1 ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. no_relation์„ ์ œ์™ธํ•œ 29๊ฐœ์˜ ๊ด€๊ณ„ ํด๋ž˜์Šค์— ๋Œ€ํ•œ Micro F1 ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค:
| models | KLUE-RE (F1, %) |
| ------- | ---------- |
| klue/klue-roberta-base | 65.90 |
| KEByT5-Base (580M) | 65.48 |
| KEByT5-Large (1.23B) | 68.95 |
| GBST-KEByT5-Large (1.23B) | -(not tested yet) |
GBST ์ ์šฉ์„ ํ†ตํ•œ ํšจ์œจํ™” ๊ฐœ์„ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ‰๊ฐ€ ํ™˜๊ฒฝ์€ A100 PCIE 80GB๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ์ •๋ฐ€๋„๋Š” bfloat16์—์„œ ์ธก์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
ํ•™์Šต ๋ฐ ํ‰๊ฐ€์—๋Š” KLUE-MRC ๋ฐ์ดํ„ฐ์…‹์ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค ๋ฐ์ดํ„ฐ์…‹์˜ ๊ธธ์ด๋Š” ์ตœ๋Œ€ 6800 bytes์˜ ๋ฌธ๋งฅ์ด ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.
| model | training sample/sec. | inference sample/sec. |
| ----- | -------------------- | --------------------- |
| KEByT5-base (580M) | 1.30 | 3.95 |
| GBST-KEByT5-base (584M) | 3.56 | 5.77 |
| GBST-KEByT5-Large (1.23B) | 2.02 | not tested |
## Compute Infrastructure
* Trained on nVidia A100 80GB * 8EA
# Citations
* ์‹ ์ข…ํ›ˆ ์™ธ, "ํ•œ๊ตญ์–ด ์ค‘์‹ฌ์˜ ํ† ํฐ-ํ”„๋ฆฌ ์–ธ์–ด ์ดํ•ด-์ƒ์„ฑ ๋ชจ๋ธ ์‚ฌ์ „ํ•™์Šต ์—ฐ๊ตฌ", ์ œ35ํšŒ ํ•œ๊ธ€ ๋ฐ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํ•™์ˆ ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘, pp.711-715. 2023.
* ํ—ˆ์ • ์™ธ, "์ƒ์„ฑํ˜• ์–ธ์–ด๋ชจ๋ธ์„ ์ด์šฉํ•œ ๊ด€๊ณ„ ์ถ”์ถœ", ์ œ35ํšŒ ํ•œ๊ธ€ ๋ฐ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํ•™์ˆ ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘. pp.708-710. 2023.
* ์ด๊ธฐ์˜ ์™ธ, "ํ•œ๊ตญ์–ด ํ† ํฐ-ํ”„๋ฆฌ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ KeByT5๋ฅผ ์ด์šฉํ•œ ํ•œ๊ตญ์–ด ์ƒ์„ฑ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ์ƒํƒœ ์ถ”์ ", ์ œ35ํšŒ ํ•œ๊ธ€ ๋ฐ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํ•™์ˆ ๋Œ€ํšŒ ๋…ผ๋ฌธ์ง‘. pp.644-647. 2023.
# Model Card Authors/Contacts
Jong-hun Shin(ETRI), e-mail=jhshin82 _AT_ etri _DOT_ re _DOT_ kr.