|
--- |
|
language: |
|
- ko |
|
metrics: |
|
- bleu |
|
pipeline_tag: text2text-generation |
|
--- |
|
# ๐ ์ ์ฃผ์ด, ํ์ค์ด ์๋ฐฉํฅ ๋ฒ์ญ ๋ชจ๋ธ (Jeju-Standard Bidirectional Translation Model) |
|
## **1. Introduction** |
|
### ๐งโ๐คโ๐ง**Member** |
|
- **Bitamin 12๊ธฐ : ๊ตฌ์คํ, ์ด์ํ, ์ด์๋ฆฐ** |
|
- **Bitamin 13๊ธฐ : ๊น์ค์, ๊น์ฌ๊ฒธ, ์ดํ์** |
|
|
|
### **Github Link** |
|
- https://github.com/junhoeKu/Jeju_Translation.github.io |
|
|
|
### **How to use this Model** |
|
- You can use this model with `transformers` to perform inference. |
|
- Below is an example of how to load the model and generate translations: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
## Set up the device (GPU or CPU) |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
## Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("Junhoee/Kobart-Jeju-translation") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Junhoee/Kobart-Jeju-translation").to(device) |
|
|
|
## Set up the input text |
|
## ๋ฌธ์ฅ ์
๋ ฅ ์ ์ ๋ฐฉํฅ์ ๋ง๊ฒ [์ ์ฃผ] or [ํ์ค] ํ ํฐ์ ์
๋ ฅ ํ ๋ฌธ์ฅ ์
๋ ฅ |
|
input_text = "[ํ์ค] ์๋
ํ์ธ์" |
|
|
|
## Tokenize the input text |
|
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device) |
|
|
|
## Generate the translation |
|
outputs = model.generate(input_ids, max_length=64) |
|
|
|
## Decode and print the output |
|
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print("Model Output:", decoded_output) |
|
``` |
|
```java |
|
Model Output: ์๋
ํ์๊ฝ |
|
``` |
|
|
|
--- |
|
|
|
```python |
|
## Set up the input text |
|
## ๋ฌธ์ฅ ์
๋ ฅ ์ ์ ๋ฐฉํฅ์ ๋ง๊ฒ [์ ์ฃผ] or [ํ์ค] ํ ํฐ์ ์
๋ ฅ ํ ๋ฌธ์ฅ ์
๋ ฅ |
|
input_text = "[์ ์ฃผ] ์๋
ํ์๊ฝ" |
|
|
|
## Tokenize the input text |
|
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device) |
|
|
|
## Generate the translation |
|
outputs = model.generate(input_ids, max_length=64) |
|
|
|
## Decode and print the output |
|
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print("Model Output:", decoded_output) |
|
``` |
|
```java |
|
Model Output: ์๋
ํ์ธ์ |
|
``` |
|
|
|
### **Parent Model** |
|
- gogamza/kobart-base-v2 |
|
- https://huggingface.co/gogamza/kobart-base-v2 |
|
|
|
## **2. Dataset - ์ฝ 93๋ง ๊ฐ์ ํ** |
|
- AI-Hub (์ ์ฃผ์ด ๋ฐํ ๋ฐ์ดํฐ + ์ค๋
์ธต ๋ฐฉ์ธ ๋ฐํ ๋ฐ์ดํฐ) |
|
- Github (์นด์นด์ค๋ธ๋ ์ธ JIT ๋ฐ์ดํฐ) |
|
- ๊ทธ ์ธ |
|
- ์ ์ฃผ์ด์ฌ์ ๋ฐ์ดํฐ (์ ์ฃผ๋์ฒญ ํํ์ด์ง์์ ํฌ๋กค๋ง) |
|
- ๊ฐ์ฌ ๋ฒ์ญ ๋ฐ์ดํฐ (๋ญ๋ญํ๋งจ ์ ํ๋ธ์์ ์ผ์ผ์ด ์์ง) |
|
- ๋์ ๋ฐ์ดํฐ (์ ์ฃผ๋ฐฉ์ธ ๊ทธ ๋ง๊ณผ ๋ฉ, ๋ถ์๋๋ ์ง๊บผ์ ธ๋ ๋์์์ ์ผ์ผ์ด ์์ง) |
|
- 2018๋
๋ ์ ์ฃผ์ด ๊ตฌ์ ์๋ฃ์ง (์ผ์ผ์ด ์์ง - ํ๊ฐ์ฉ ๋ฐ์ดํฐ๋ก ์ฌ์ฉ) |
|
|
|
## **3. Hyper Parameters** |
|
- Epoch : 3 epochs |
|
- Learning Rate : 2e-5 |
|
- Weight Decay=0.01 |
|
- Batch Size : 32 |
|
|
|
## **4. Bleu Score** |
|
- 2018 ์ ์ฃผ์ด ๊ตฌ์ ์๋ฃ์ง ๋ฐ์ดํฐ ๊ธฐ์ค |
|
- ์ ์ฃผ์ด -> ํ์ค์ด : 0.76 |
|
- ํ์ค์ด -> ์ ์ฃผ์ด : 0.5 |
|
|
|
- AI-Hub ์ ์ฃผ์ด ๋ฐํ ๋ฐ์ดํฐ์ validation data ๊ธฐ์ค |
|
- ์ ์ฃผ์ด -> ํ์ค์ด : 0.89 |
|
- ํ์ค์ด -> ์ ์ฃผ์ด : 0.77 |
|
|
|
## **5. CREDIT** |
|
- ๊ตฌ์คํ : [email protected] |
|
- ๊น์ค์ : [email protected] |
|
- ๊น์ฌ๊ฒธ : [email protected] |
|
- ์ด์ํ : [email protected] |
|
- ์ด์๋ฆฐ : [email protected] |
|
- ์ดํ์ : [email protected] |