eenzeenee commited on
Commit
55bea00
Β·
1 Parent(s): 48bbb45

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: summarization
3
+ language:
4
+ - ko
5
+ tags:
6
+ - T5
7
+ ---
8
+
9
+ # t5-base-korean-summarization
10
+
11
+ This is [T5](https://huggingface.co/docs/transformers/model_doc/t5) model for korean text summarization.
12
+ - Finetuned based on ['paust/pko-t5-small'](https://huggingface.co/paust/pko-t5-small) model.
13
+ - Finetuned with 3 datasets. Specifically, it is described below.
14
+
15
+ - [Korean Paper Summarization Dataset(λ…Όλ¬Έμžλ£Œ μš”μ•½)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90)
16
+ - [Korean Book Summarization Dataset(λ„μ„œμžλ£Œ μš”μ•½)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=93)
17
+ - [Korean Summary statement and Report Generation Dataset(μš”μ•½λ¬Έ 및 레포트 생성 데이터)](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=90)
18
+
19
+
20
+ # Usage (HuggingFace Transformers)
21
+
22
+ ```python
23
+ import nltk
24
+ nltk.download('punkt')
25
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
26
+
27
+ model = AutoModelForSeq2SeqLM.from_pretrained('eenzeenee/t5-small-korean-summarization')
28
+ tokenizer = AutoTokenizer.from_pretrained('eenzeenee/t5-small-korean-summarization')
29
+
30
+ prefix = "summarize: "
31
+ sample = """
32
+ μ•ˆλ…•ν•˜μ„Έμš”? 우리 (2ν•™λ…„)/(이 ν•™λ…„) μΉœκ΅¬λ“€ 우리 μΉœκ΅¬λ“€ 학ꡐ에 κ°€μ„œ μ§„μ§œ (2ν•™λ…„)/(이 ν•™λ…„) 이 되고 μ‹Άμ—ˆλŠ”λ° 학ꡐ에 λͺ» κ°€κ³  μžˆμ–΄μ„œ λ‹΅λ‹΅ν•˜μ£ ?
33
+ κ·Έλž˜λ„ 우리 μΉœκ΅¬λ“€μ˜ μ•ˆμ „κ³Ό 건강이 μ΅œμš°μ„ μ΄λ‹ˆκΉŒμš” μ˜€λŠ˜λΆ€ν„° μ„ μƒλ‹˜μ΄λž‘ 맀일 맀일 κ΅­μ–΄ 여행을 λ– λ‚˜λ³΄λ„λ‘ ν•΄μš”.
34
+ μ–΄/ μ‹œκ°„μ΄ 벌써 μ΄λ ‡κ²Œ λλ‚˜μš”? λŠ¦μ—ˆμ–΄μš”. λŠ¦μ—ˆμ–΄μš”. 빨리 κ΅­μ–΄ 여행을 λ– λ‚˜μ•Ό λΌμš”.
35
+ 그런데 μ–΄/ ꡭ어여행을 λ– λ‚˜κΈ° 전에 μš°λ¦¬κ°€ 쀀비물을 챙겨야 되겠죠? κ΅­μ–΄ 여행을 λ– λ‚  μ€€λΉ„λ¬Ό, κ΅μ•ˆμ„ μ–΄λ–»κ²Œ 받을 수 μžˆλŠ”μ§€ μ„ μƒλ‹˜μ΄ μ„€λͺ…을 ν•΄μ€„κ²Œμš”.
36
+ (EBS)/(μ΄λΉ„μ—μŠ€) μ΄ˆλ“±μ„ κ²€μƒ‰ν•΄μ„œ λ“€μ–΄κ°€λ©΄μš” 첫화면이 μ΄λ ‡κ²Œ λ‚˜μ™€μš”.
37
+ 자/ κ·ΈλŸ¬λ©΄μš” μ—¬κΈ° (X)/(μ—‘μŠ€) 눌러주(κ³ μš”)/(κ΅¬μš”). μ €κΈ° (동그라미)/(λ˜₯그라미) (EBS)/(μ΄λΉ„μ—μŠ€) (2μ£Ό)/(이 μ£Ό) λΌμ΄λΈŒνŠΉκ°•μ΄λΌκ³  λ˜μ–΄μžˆμ£ ?
38
+ κ±°κΈ°λ₯Ό λ°”λ‘œ κ°€κΈ°λ₯Ό λˆ„λ¦…λ‹ˆλ‹€. 자/ (λˆ„λ₯΄λ©΄μš”)/(눌λ₯΄λ©΄μš”). μ–΄λ–»κ²Œ λ˜λƒ? b/ λ°‘μœΌλ‘œ λ‚΄λ €μš” λ‚΄λ €μš” λ‚΄λ €μš” μ­‰ λ‚΄λ €μš”.
39
+ 우리 λͺ‡ 학년이죠? μ•„/ (2ν•™λ…„)/(이 ν•™λ…„) 이죠 (2ν•™λ…„)/(이 ν•™λ…„)의 무슨 κ³Όλͺ©? κ΅­μ–΄.
40
+ μ΄λ²ˆμ£ΌλŠ” (1μ£Ό)/(일 μ£Ό) μ°¨λ‹ˆκΉŒμš” μ—¬κΈ° κ΅μ•ˆ. λ‹€μŒμ£ΌλŠ” μ—¬κΈ°μ„œ λ‹€μš΄μ„ λ°›μœΌλ©΄ λΌμš”.
41
+ 이 κ΅μ•ˆμ„ 클릭을 ν•˜λ©΄, μ§œμž”/. μ΄λ ‡κ²Œ κ΅μž¬κ°€ λ‚˜μ˜΅λ‹ˆλ‹€ .이 κ΅μ•ˆμ„ (λ‹€μš΄)/(λ”°μš΄)λ°›μ•„μ„œ 우리 ꡭ어여행을 λ– λ‚  μˆ˜κ°€ μžˆμ–΄μš”.
42
+ 그럼 우리 μ§„μ§œλ‘œ κ΅­μ–΄ 여행을 ν•œλ²ˆ λ– λ‚˜λ³΄λ„λ‘ ν•΄μš”? κ΅­μ–΄μ—¬ν–‰ 좜발. 자/ (1단원)/(일 단원) 제λͺ©μ΄ λ­”κ°€μš”? ν•œλ²ˆ μ°Ύμ•„λ΄μš”.
43
+ μ‹œλ₯Ό μ¦κ²¨μš” μ—μš”. κ·Έλƒ₯ μ‹œλ₯Ό μ½μ–΄μš” κ°€ μ•„λ‹ˆμ—μš”. μ‹œλ₯Ό 즐겨야 λΌμš” 즐겨야 돼. μ–΄λ–»κ²Œ 즐길까? 일단은 λ‚΄λ‚΄ μ‹œλ₯Ό μ¦κΈ°λŠ” 방법에 λŒ€ν•΄μ„œ 곡뢀λ₯Ό ν•  κ±΄λ°μš”.
44
+ 그럼 μ˜€λŠ˜μ€μš” μ–΄λ–»κ²Œ μ¦κΈΈκΉŒμš”? 였늘 곡뢀할 λ‚΄μš©μ€μš” μ‹œλ₯Ό μ—¬λŸ¬ 가지 λ°©λ²•μœΌλ‘œ 읽기λ₯Ό κ³΅λΆ€ν• κ²λ‹ˆλ‹€.
45
+ μ–΄λ–»κ²Œ μ—¬λŸ¬κ°€μ§€ λ°©λ²•μœΌλ‘œ μ½μ„κΉŒ 우리 곡뢀해 보도둝 ν•΄μš”. 였늘의 μ‹œ λ‚˜μ™€λΌ μ§œμž”/! μ‹œκ°€ λ‚˜μ™”μŠ΅λ‹ˆλ‹€ μ‹œμ˜ 제λͺ©μ΄ λ­”κ°€μš”? λ‹€νˆ° λ‚ μ΄μ—μš” λ‹€νˆ° λ‚ .
46
+ λˆ„κ΅¬λž‘ λ‹€ν‰œλ‚˜ λ™μƒμ΄λž‘ λ‹€ν‰œλ‚˜ μ–Έλ‹ˆλž‘ μΉœκ΅¬λž‘? λˆ„κ΅¬λž‘ λ‹€ν‰œλŠ”μ§€ μ„ μƒλ‹˜μ΄ μ‹œλ₯Ό 읽어 쀄 ν…Œλ‹ˆκΉŒ ν•œλ²ˆ 생각을 해보도둝 ν•΄μš”."""
47
+
48
+ inputs = [prefix + sample]
49
+
50
+
51
+ inputs = tokenizer(inputs, max_length=512, truncation=True, return_tensors="pt")
52
+ output = model.generate(**inputs, num_beams=3, do_sample=True, min_length=10, max_length=64)
53
+ decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
54
+ result = nltk.sent_tokenize(decoded_output.strip())[0]
55
+
56
+ print('RESULT >>', result)
57
+ ```