korean Formal Convertor Using Deep Learning

์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์€ ํ•œ๊ตญ์–ด์—์„œ๋งŒ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค, ๋ณธ ๋ชจ๋ธ์€ ๋ฐ˜๋ง(informal)์„ ์กด๋Œ“๋ง(formal)๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ณ€ํ™˜๊ธฐ(convertor) ์ž…๋‹ˆ๋‹ค.
*ํ™•๋ณดํ•œ ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์…‹์—๋Š” "ํ•ด์š”์ฒด"์™€ "ํ•ฉ์‡ผ์ฒด" ๋‘ ์ข…๋ฅ˜๊ฐ€ ์กด์žฌํ–ˆ์ง€๋งŒ ๋ณธ ๋ชจ๋ธ์€ "ํ•ด์š”์ฒด"๋กœ ํ†ต์ผํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ฉ์‡ผ์ฒด *ํ•ด์š”์ฒด
์•ˆ๋…•ํ•˜์‹ญ๋‹ˆ๊นŒ. ์•ˆ๋…•ํ•˜์„ธ์š”.
์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค. ์ข‹์€ ์•„์นจ์ด์—์š”.
๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์–ด์š”.

๋ฐฐ๊ฒฝ

  • ์ด์ „์— ์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ(https://github.com/jongmin-oh/korean-formal-classifier) ๋ฅผ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
    ๋ถ„๋ฅ˜๊ธฐ๋กœ ๋งํˆฌ๋ฅผ ๋‚˜๋ˆ  ์‚ฌ์šฉํ•˜๋ คํ–ˆ์ง€๋งŒ, ์ƒ๋Œ€์ ์œผ๋กœ ์กด๋Œ“๋ง์˜ ๋น„์ค‘์ด ์ ์—ˆ๊ณ  ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ฐ”๊พธ์–ด ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์˜ ๋น„์ค‘์„ ๋Š˜๋ฆฌ๊ธฐ์œ„ํ•ด ๋งŒ๋“ค๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ

  • ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ๋Š” T5๋ชจ๋ธ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ Text2Text generation Task๋ฅผ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐ”๋กœ ์‚ฌ์šฉํ•˜์‹ค ๋ถ„๋“ค์€ ๋ฐ‘์— ์˜ˆ์ œ ์ฝ”๋“œ ์ฐธ๊ณ ํ•ด์„œ huggingFace ๋ชจ๋ธ('j5ng/et5-formal-convertor') ๋‹ค์šด๋ฐ›์•„ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Base on PLM model(ET5)

Base on Dataset

  • AIํ—ˆ๋ธŒ(https://www.aihub.or.kr/) : ํ•œ๊ตญ์–ด ์–ด์ฒด ๋ณ€ํ™˜ ์ฝ”ํผ์Šค

    1. KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋Œ€ํ™” 1,254 ๋ฌธ์žฅ
    2. ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ
  • ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋งํˆฌ ๋ฐ์ดํ„ฐ ์…‹(korean SmileStyle Dataset)

Preprocessing

  1. ๋ฐ˜๋ง/์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ("ํ•ด์š”์ฒด"๋งŒ ๋ถ„๋ฆฌ)

    • ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋ฐ์ดํ„ฐ์—์„œ (['formal','informal']) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
    • ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ์—์„œ [".ban", ".yo"] txt ํŒŒ์ผ๋งŒ ์‚ฌ์šฉ
    • KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋ฐ์ดํ„ฐ์—์„œ(["๋ฐ˜๋ง","ํ•ด์š”์ฒด"]) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
  2. ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ(3๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ)

  3. ๋งˆ์นจํ‘œ(.)์™€ ์‰ผํ‘œ(,)์ œ๊ฑฐ

  4. ๋ฐ˜๋ง(informal) ์นผ๋Ÿผ ์ค‘๋ณต ์ œ๊ฑฐ : 1632๊ฐœ ์ค‘๋ณต๋ฐ์ดํ„ฐ ์ œ๊ฑฐ

์ตœ์ข… ํ•™์Šต๋ฐ์ดํ„ฐ ์˜ˆ์‹œ

informal formal
์‘ ๊ณ ๋งˆ์›Œ ๋„ค ๊ฐ์‚ฌํ•ด์š”
๋‚˜๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์–ด ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด ์ €๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์Šต๋‹ˆ๋‹ค ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด์š”
๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด์•ผ ๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด๋„ค์š”
๊ดœ์ฐฎ๊ฒ ์–ด? ๊ดœ์ฐฎ์œผ์‹ค๊นŒ์š”?
์•„๋‹ˆ์•ผ ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด ์ค€๋น„ํ•ด์ค˜ ์•„๋‹ˆ์—์š” ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด์š” ์ค€๋น„ํ•ด์ฃผ์„ธ์š”

total : 14,992 ์Œ


How to use

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# T5 ๋ชจ๋ธ ๋กœ๋“œ
model = T5ForConditionalGeneration.from_pretrained("j5ng/et5-formal-convertor")
tokenizer = T5Tokenizer.from_pretrained("j5ng/et5-formal-convertor")

device = "cuda:0" if torch.cuda.is_available() else "cpu"
# device = "mps:0" if torch.cuda.is_available() else "cpu" # for mac m1

model = model.to(device) 

# ์˜ˆ์‹œ ์ž…๋ ฅ ๋ฌธ์žฅ
input_text = "๋‚˜ ์ง„์งœ ํ™”๋‚ฌ์–ด ์ง€๊ธˆ"

# ์ž…๋ ฅ ๋ฌธ์žฅ ์ธ์ฝ”๋”ฉ
input_encoding = tokenizer("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text, return_tensors="pt")

input_ids = input_encoding.input_ids.to(device)
attention_mask = input_encoding.attention_mask.to(device)

# T5 ๋ชจ๋ธ ์ถœ๋ ฅ ์ƒ์„ฑ
output_encoding = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=128,
    num_beams=5,
    early_stopping=True,
)

# ์ถœ๋ ฅ ๋ฌธ์žฅ ๋””์ฝ”๋”ฉ
output_text = tokenizer.decode(output_encoding[0], skip_special_tokens=True)

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print(output_text) # ์ € ์ง„์งœ ํ™”๋‚ฌ์Šต๋‹ˆ๋‹ค ์ง€๊ธˆ.

With Transformer Pipeline

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer, pipeline

model = T5ForConditionalGeneration.from_pretrained('j5ng/et5-formal-convertor')
tokenizer = T5Tokenizer.from_pretrained('j5ng/et5-formal-convertor')

typos_corrector = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    framework="pt",
)

input_text = "๋„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์–ด"
output_text = typos_corrector("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text,
            max_length=128,
            num_beams=5,
            early_stopping=True)[0]['generated_text']

print(output_text) # ๋‹น์‹ ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

Thanks to

์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ์˜ ํ•™์Šต์€ ์ธ๊ณต์ง€๋Šฅ์‚ฐ์—…์œตํ•ฉ์‚ฌ์—…๋‹จ(AICA)์˜ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์ง€์›๋ฐ›์•„ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Downloads last month
165
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.