metadata

language:
  - ko
metrics:
  - bleu
pipeline_tag: translation

🌊 제주어, 표준어 양방향 번역 모델 (Jeju-Standard Bidirectional Translation Model)

1. Introduction

🧑‍🤝‍🧑Member

Bitamin 12기 : 구준회, 이서현, 이예린
Bitamin 13기 : 김윤영, 김재겸, 이형석

Github Link

https://github.com/junhoeKu/Jeju_Translation.github.io

How to use this Model

You can use this model with transformers to perform inference.
Below is an example of how to load the model and generate translations:

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

## Set up the device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Junhoee/Kobart-Jeju-translation")
model = AutoModelForSeq2SeqLM.from_pretrained("Junhoee/Kobart-Jeju-translation").to(device)

## Set up the input text
## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
input_text = "[표준] 안녕하세요"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)
Model Output: 안녕하수꽈```

---

```python
## Set up the input text
## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
input_text = "[제주] 안녕하수꽈"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)```

### **Parent Model**
- gogamza/kobart-base-v2
- https://huggingface.co/gogamza/kobart-base-v2

## **2. Dataset - 약 93만 개의 행**
- AI-Hub (제주어 발화 데이터 + 중년층 방언 발화 데이터)
- Github (카카오브레인 JIT 데이터)
- 그 외
  - 제주어사전 데이터 (제주도청 홈페이지에서 크롤링)
  - 가사 번역 데이터 (뭐랭하맨 유튜브에서 일일이 수집)
  - 도서 데이터 (제주방언 그 맛과 멋, 부에나도 지꺼져도 도서에서 일일이 수집)
  - 2018년도 제주어 구술 자료집 (일일이 수집 - 평가용 데이터로 사용)

## **3. Hyper Parameters**
- Epoch : 3 epochs
- Learning Rate : 2e-5
- Weight Decay=0.01
- Batch Size : 32

## **4. Bleu Score**
- 2018 제주어 구술 자료집 데이터 기준
  - 제주어 -> 표준어 : 0.76
  - 표준어 -> 제주어 : 0.5

- AI-Hub 제주어 발화 데이터의 validation data 기준
  - 제주어 -> 표준어 : 0.89
  - 표준어 -> 제주어 : 0.77

## **5. CREDIT**
- 구준회 : [email protected]
- 김윤영 : 202000872@hufs.ac.kr
- 김재겸 : [email protected]
- 이서현 : [email protected]
- 이예린 : [email protected]
- 이형석 : [email protected]