metadata

language:
  - ko
metrics:
  - bleu
pipeline_tag: text2text-generation

🌊 제주어, 표준어 양방향 번역 모델 (Jeju-Standard Bidirectional Translation Model)

1. Introduction

🧑‍🤝‍🧑Member

Bitamin 12기 : 구준회, 이서현, 이예린
Bitamin 13기 : 김윤영, 김재겸, 이형석

Github Link

https://github.com/junhoeKu/Jeju_Translation.github.io

How to use this Model

You can use this model with transformers to perform inference.
Below is an example of how to load the model and generate translations:

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

## Set up the device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Junhoee/Kobart-Jeju-translation")
model = AutoModelForSeq2SeqLM.from_pretrained("Junhoee/Kobart-Jeju-translation").to(device)

## Set up the input text
## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
input_text = "[표준] 안녕하세요"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)

Model Output: 안녕하수꽈

## Set up the input text
## 문장 입력 전에 방향에 맞게 [제주] or [표준] 토큰을 입력 후 문장 입력
input_text = "[제주] 안녕하수꽈"

## Tokenize the input text
input_ids = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

## Generate the translation
outputs = model.generate(input_ids, max_length=64)

## Decode and print the output
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model Output:", decoded_output)

Model Output: 안녕하세요

Parent Model

gogamza/kobart-base-v2
https://huggingface.co/gogamza/kobart-base-v2

2. Dataset - 약 93만 개의 행

AI-Hub (제주어 발화 데이터 + 중년층 방언 발화 데이터)
Github (카카오브레인 JIT 데이터)
그 외
- 제주어사전 데이터 (제주도청 홈페이지에서 크롤링)
- 가사 번역 데이터 (뭐랭하맨 유튜브에서 일일이 수집)
- 도서 데이터 (제주방언 그 맛과 멋, 부에나도 지꺼져도 도서에서 일일이 수집)
- 2018년도 제주어 구술 자료집 (일일이 수집 - 평가용 데이터로 사용)

3. Hyper Parameters

Epoch : 3 epochs
Learning Rate : 2e-5
Weight Decay=0.01
Batch Size : 32

4. Bleu Score

2018 제주어 구술 자료집 데이터 기준
- 제주어 -> 표준어 : 0.76
- 표준어 -> 제주어 : 0.5
AI-Hub 제주어 발화 데이터의 validation data 기준
- 제주어 -> 표준어 : 0.89
- 표준어 -> 제주어 : 0.77

5. CREDIT

구준회 : [email protected]
김윤영 : [email protected]
김재겸 : [email protected]
이서현 : [email protected]
이예린 : [email protected]
이형석 : [email protected]