j5ng commited on
Commit
d6f3e4d
ยท
1 Parent(s): 84d2cd6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md CHANGED
@@ -9,3 +9,121 @@ pipeline_tag: text2text-generation
9
  ์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์€ ํ•œ๊ตญ์–ด์—์„œ๋งŒ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค, ๋ณธ ๋ชจ๋ธ์€ ๋ฐ˜๋ง(informal)์„ ์กด๋Œ“๋ง(formal)๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ณ€ํ™˜๊ธฐ(convertor) ์ž…๋‹ˆ๋‹ค. <br>
10
  *ํ™•๋ณดํ•œ ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์…‹์—๋Š” "ํ•ด์š”์ฒด"์™€ "ํ•ฉ์‡ผ์ฒด" ๋‘ ์ข…๋ฅ˜๊ฐ€ ์กด์žฌํ–ˆ์ง€๋งŒ ๋ณธ ๋ชจ๋ธ์€ "ํ•ด์š”์ฒด"๋กœ ํ†ต์ผํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์€ ํ•œ๊ตญ์–ด์—์„œ๋งŒ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค, ๋ณธ ๋ชจ๋ธ์€ ๋ฐ˜๋ง(informal)์„ ์กด๋Œ“๋ง(formal)๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ณ€ํ™˜๊ธฐ(convertor) ์ž…๋‹ˆ๋‹ค. <br>
10
  *ํ™•๋ณดํ•œ ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์…‹์—๋Š” "ํ•ด์š”์ฒด"์™€ "ํ•ฉ์‡ผ์ฒด" ๋‘ ์ข…๋ฅ˜๊ฐ€ ์กด์žฌํ–ˆ์ง€๋งŒ ๋ณธ ๋ชจ๋ธ์€ "ํ•ด์š”์ฒด"๋กœ ํ†ต์ผํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
11
 
12
+ |ํ•ฉ์‡ผ์ฒด|*ํ•ด์š”์ฒด|
13
+ |------|---|
14
+ |์•ˆ๋…•ํ•˜์‹ญ๋‹ˆ๊นŒ.|์•ˆ๋…•ํ•˜์„ธ์š”.|
15
+ |์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค.|์ข‹์€ ์•„์นจ์ด์—์š”.|
16
+ |๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค.|๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์–ด์š”.|
17
+
18
+ ## ๋ฐฐ๊ฒฝ
19
+ - ์ด์ „์— ์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ(https://github.com/jongmin-oh/korean-formal-classifier) ๋ฅผ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.<br>
20
+ ๋ถ„๋ฅ˜๊ธฐ๋กœ ๋งํˆฌ๋ฅผ ๋‚˜๋ˆ  ์‚ฌ์šฉํ•˜๋ คํ–ˆ์ง€๋งŒ, ์ƒ๋Œ€์ ์œผ๋กœ ์กด๋Œ“๋ง์˜ ๋น„์ค‘์ด ์ ์—ˆ๊ณ  ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ฐ”๊พธ์–ด ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์˜ ๋น„์ค‘์„ ๋Š˜๋ฆฌ๊ธฐ์œ„ํ•ด ๋งŒ๋“ค๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
21
+
22
+ ## ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ
23
+ - ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ๋Š” T5๋ชจ๋ธ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ Text2Text generation Task๋ฅผ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
24
+ - ๋ฐ”๋กœ ์‚ฌ์šฉํ•˜์‹ค ๋ถ„๋“ค์€ ๋ฐ‘์— ์˜ˆ์ œ ์ฝ”๋“œ ์ฐธ๊ณ ํ•ด์„œ huggingFace ๋ชจ๋ธ('j5ng/et5-formal-convertor') ๋‹ค์šด๋ฐ›์•„ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
25
+
26
+ ## Base on PLM model(ET5)
27
+ - ETRI(https://aiopen.etri.re.kr/et5Model)
28
+
29
+ ## Base on Dataset
30
+ - AIํ—ˆ๋ธŒ(https://www.aihub.or.kr/) : ํ•œ๊ตญ์–ด ์–ด์ฒด ๋ณ€ํ™˜ ์ฝ”ํผ์Šค
31
+ 1. KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋Œ€ํ™” 1,254 ๋ฌธ์žฅ
32
+ 2. ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ
33
+
34
+ - ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋งํˆฌ ๋ฐ์ดํ„ฐ ์…‹(korean SmileStyle Dataset)
35
+
36
+ ### Preprocessing
37
+ 1. ๋ฐ˜๋ง/์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ("ํ•ด์š”์ฒด"๋งŒ ๋ถ„๋ฆฌ)
38
+ - ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋ฐ์ดํ„ฐ์—์„œ (['formal','informal']) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
39
+ - ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ์—์„œ ["*.ban", "*.yo"] txt ํŒŒ์ผ๋งŒ ์‚ฌ์šฉ
40
+ - KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋ฐ์ดํ„ฐ์—์„œ(["๋ฐ˜๋ง","ํ•ด์š”์ฒด"]) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
41
+
42
+ 2. ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ(3๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ)
43
+ 3. ๋งˆ์นจํ‘œ(.)์™€ ์‰ผํ‘œ(,)์ œ๊ฑฐ
44
+ 4. ๋ฐ˜๋ง(informal) ์นผ๋Ÿผ ์ค‘๋ณต ์ œ๊ฑฐ : 1632๊ฐœ ์ค‘๋ณต๋ฐ์ดํ„ฐ ์ œ๊ฑฐ
45
+
46
+ ### ์ตœ์ข… ํ•™์Šต๋ฐ์ดํ„ฐ ์˜ˆ์‹œ
47
+ |informal|formal|
48
+ |------|---|
49
+ |์‘ ๊ณ ๋งˆ์›Œ|๋„ค ๊ฐ์‚ฌํ•ด์š”|
50
+ |๋‚˜๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์–ด ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด|์ €๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์Šต๋‹ˆ๋‹ค ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด์š”|
51
+ |๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด์•ผ|๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด๋„ค์š”|
52
+ |๊ดœ์ฐฎ๊ฒ ์–ด?|๊ดœ์ฐฎ์œผ์‹ค๊นŒ์š”?|
53
+ |์•„๋‹ˆ์•ผ ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด ์ค€๋น„ํ•ด์ค˜|์•„๋‹ˆ์—์š” ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด์š” ์ค€๋น„ํ•ด์ฃผ์„ธ์š”|
54
+
55
+ #### total : 14,992 ์Œ
56
+
57
+ ***
58
+
59
+ ## How to use
60
+ ```python
61
+ import torch
62
+ from transformers import T5ForConditionalGeneration, T5Tokenizer
63
+
64
+ # T5 ๋ชจ๋ธ ๋กœ๋“œ
65
+ model = T5ForConditionalGeneration.from_pretrained("j5ng/et5-formal-convertor")
66
+ tokenizer = T5Tokenizer.from_pretrained("j5ng/et5-formal-convertor")
67
+
68
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
69
+ # device = "mps:0" if torch.cuda.is_available() else "cpu" # for mac m1
70
+
71
+ model = model.to(device)
72
+
73
+ # ์˜ˆ์‹œ ์ž…๋ ฅ ๋ฌธ์žฅ
74
+ input_text = "๋‚˜ ์ง„์งœ ํ™”๋‚ฌ์–ด ์ง€๊ธˆ"
75
+
76
+ # ์ž…๋ ฅ ๋ฌธ์žฅ ์ธ์ฝ”๋”ฉ
77
+ input_encoding = tokenizer("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text, return_tensors="pt")
78
+
79
+ input_ids = input_encoding.input_ids.to(device)
80
+ attention_mask = input_encoding.attention_mask.to(device)
81
+
82
+ # T5 ๋ชจ๋ธ ์ถœ๋ ฅ ์ƒ์„ฑ
83
+ output_encoding = model.generate(
84
+ input_ids=input_ids,
85
+ attention_mask=attention_mask,
86
+ max_length=128,
87
+ num_beams=5,
88
+ early_stopping=True,
89
+ )
90
+
91
+ # ์ถœ๋ ฅ ๋ฌธ์žฅ ๋””์ฝ”๋”ฉ
92
+ output_text = tokenizer.decode(output_encoding[0], skip_special_tokens=True)
93
+
94
+ # ๊ฒฐ๊ณผ ์ถœ๋ ฅ
95
+ print(output_text) # ์ € ์ง„์งœ ํ™”๋‚ฌ์Šต๋‹ˆ๋‹ค ์ง€๊ธˆ.
96
+ ```
97
+
98
+ ***
99
+
100
+ ## With Transformer Pipeline
101
+ ```python
102
+ import torch
103
+ from transformers import T5ForConditionalGeneration, T5Tokenizer, pipeline
104
+
105
+ model = T5ForConditionalGeneration.from_pretrained('j5ng/et5-formal-convertor')
106
+ tokenizer = T5Tokenizer.from_pretrained('j5ng/et5-formal-convertor')
107
+
108
+ typos_corrector = pipeline(
109
+ "text2text-generation",
110
+ model=model,
111
+ tokenizer=tokenizer,
112
+ device=0 if torch.cuda.is_available() else -1,
113
+ framework="pt",
114
+ )
115
+
116
+ input_text = "๋„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์–ด"
117
+ output_text = typos_corrector("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text,
118
+ max_length=128,
119
+ num_beams=5,
120
+ early_stopping=True)[0]['generated_text']
121
+
122
+ print(output_text) # ๋‹น์‹ ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.
123
+ ```
124
+
125
+ ## Thanks to
126
+ ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ์˜ ํ•™์Šต์€ ์ธ๊ณต์ง€๋Šฅ์‚ฐ์—…์œตํ•ฉ์‚ฌ์—…๋‹จ(AICA)์˜ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์ง€์›๋ฐ›์•„ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
127
+
128
+
129
+