Junhoee's picture
Update README.md
d990298 verified
|
raw
history blame
3.3 kB
metadata
language:
  - ko
metrics:
  - bleu
pipeline_tag: translation

๐ŸŒŠ ์ œ์ฃผ์–ด, ํ‘œ์ค€์–ด ์–‘๋ฐฉํ–ฅ ๋ฒˆ์—ญ ๋ชจ๋ธ (Jeju-Standard Bidirectional Translation Model)

1. Introduction

๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘Member

  • Bitamin 12๊ธฐ : ๊ตฌ์ค€ํšŒ, ์ด์„œํ˜„, ์ด์˜ˆ๋ฆฐ
  • Bitamin 13๊ธฐ : ๊น€์œค์˜, ๊น€์žฌ๊ฒธ, ์ดํ˜•์„

Github Link

How to use this Model

  • You can use this model with transformers to perform inference.
  • Below is an example of how to load the model and generate translations:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

## Set up the device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Junhoee/Kobart-Jeju-translation")
model = AutoModelForSeq2SeqLM.from_pretrained("Junhoee/Kobart-Jeju-translation").to(device)

## Set up the input text
## ๋ฌธ์žฅ ์ž…๋ ฅ ์ „์— ๋ฐฉํ–ฅ์— ๋งž๊ฒŒ [์ œ์ฃผ] or [ํ‘œ์ค€] ํ† ํฐ์„ ์ž…๋ ฅ ํ›„ ๋ฌธ์žฅ ์ž…๋ ฅ
input_text = "[ํ‘œ์ค€] ์•ˆ๋…•ํ•˜์„ธ์š”"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)
Model Output: ์•ˆ๋…•ํ•˜์ˆ˜๊ฝˆ```

---

```python
## Set up the input text
## ๋ฌธ์žฅ ์ž…๋ ฅ ์ „์— ๋ฐฉํ–ฅ์— ๋งž๊ฒŒ [์ œ์ฃผ] or [ํ‘œ์ค€] ํ† ํฐ์„ ์ž…๋ ฅ ํ›„ ๋ฌธ์žฅ ์ž…๋ ฅ
input_text = "[์ œ์ฃผ] ์•ˆ๋…•ํ•˜์ˆ˜๊ฝˆ"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)```

### **Parent Model**
- gogamza/kobart-base-v2
- https://huggingface.co/gogamza/kobart-base-v2

## **2. Dataset - ์•ฝ 93๋งŒ ๊ฐœ์˜ ํ–‰**
- AI-Hub (์ œ์ฃผ์–ด ๋ฐœํ™” ๋ฐ์ดํ„ฐ + ์ค‘๋…„์ธต ๋ฐฉ์–ธ ๋ฐœํ™” ๋ฐ์ดํ„ฐ)
- Github (์นด์นด์˜ค๋ธŒ๋ ˆ์ธ JIT ๋ฐ์ดํ„ฐ)
- ๊ทธ ์™ธ
  - ์ œ์ฃผ์–ด์‚ฌ์ „ ๋ฐ์ดํ„ฐ (์ œ์ฃผ๋„์ฒญ ํ™ˆํŽ˜์ด์ง€์—์„œ ํฌ๋กค๋ง)
  - ๊ฐ€์‚ฌ ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ (๋ญ๋žญํ•˜๋งจ ์œ ํŠœ๋ธŒ์—์„œ ์ผ์ผ์ด ์ˆ˜์ง‘)
  - ๋„์„œ ๋ฐ์ดํ„ฐ (์ œ์ฃผ๋ฐฉ์–ธ ๊ทธ ๋ง›๊ณผ ๋ฉ‹, ๋ถ€์—๋‚˜๋„ ์ง€๊บผ์ ธ๋„ ๋„์„œ์—์„œ ์ผ์ผ์ด ์ˆ˜์ง‘)
  - 2018๋…„๋„ ์ œ์ฃผ์–ด ๊ตฌ์ˆ  ์ž๋ฃŒ์ง‘ (์ผ์ผ์ด ์ˆ˜์ง‘ - ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ)

## **3. Hyper Parameters**
- Epoch : 3 epochs
- Learning Rate : 2e-5
- Weight Decay=0.01
- Batch Size : 32

## **4. Bleu Score**
- 2018 ์ œ์ฃผ์–ด ๊ตฌ์ˆ  ์ž๋ฃŒ์ง‘ ๋ฐ์ดํ„ฐ ๊ธฐ์ค€
  - ์ œ์ฃผ์–ด -> ํ‘œ์ค€์–ด : 0.76
  - ํ‘œ์ค€์–ด -> ์ œ์ฃผ์–ด : 0.5

- AI-Hub ์ œ์ฃผ์–ด ๋ฐœํ™” ๋ฐ์ดํ„ฐ์˜ validation data ๊ธฐ์ค€
  - ์ œ์ฃผ์–ด -> ํ‘œ์ค€์–ด : 0.89
  - ํ‘œ์ค€์–ด -> ์ œ์ฃผ์–ด : 0.77

## **5. CREDIT**
- ๊ตฌ์ค€ํšŒ : [email protected]
- ๊น€์œค์˜ : 202000872@hufs.ac.kr
- ๊น€์žฌ๊ฒธ : [email protected]
- ์ด์„œํ˜„ : [email protected]
- ์ด์˜ˆ๋ฆฐ : [email protected]
- ์ดํ˜•์„ : [email protected]