BM-K commited on
Commit
bfac658
β€’
1 Parent(s): 257a755

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 😎 KoChatBART
2
+ [**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” `autoencoder`의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μ±„νŒ… BART(μ΄ν•˜ **KoChatBART**) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ `Text Infilling` λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μ•½ **10GB** μ΄μƒμ˜ ν•œκ΅­μ–΄ λŒ€ν™” ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ `encoder-decoder` μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ λŒ€ν™” 생성에 κ°•κ±΄ν•œ `KoChatBART-base`λ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.
3
+
4
+ <img src=https://user-images.githubusercontent.com/55969260/205434343-b72641e9-d0f9-4b88-a334-9f904e0a35c5.png>
5
+
6
+ ## Quick tour
7
+ ```python
8
+ from transformers import AutoTokenizer, BartForConditionalGeneration
9
+
10
+ tokenizer = AutoTokenizer.from_pretrained("BM-K/KoChatBART")
11
+ model = BartForConditionalGeneration.from_pretrained("BM-K/KoChatBART")
12
+
13
+ inputs = tokenizer("μ•ˆλ…• 세상아!", return_tensors="pt")
14
+ outputs = model(**inputs)
15
+ ```
16
+
17
+ ## 사전 ν•™μŠ΅ 데이터 μ „μ²˜λ¦¬
18
+ μ‚¬μš©ν•œ 데이터셋
19
+ - [μ£Όμ œλ³„ ν…μŠ€νŠΈ 일상 λŒ€ν™” 데이터](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=543)
20
+ - [μ†Œμƒκ³΅μΈ 고객 μ£Όλ¬Έ 질의-응닡 ν…μŠ€νŠΈ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=102)
21
+ - [ν•œκ΅­μ–΄ SNS](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=114)
22
+ - [민원 업무 μžλ™ν™” 인곡지λŠ₯ μ–Έμ–΄ 데이터](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=619)
23
+
24
+ KoChatBARTλ₯Ό ν•™μŠ΅μ‹œν‚€κΈ° μœ„ν•˜μ—¬ ν•œκ΅­μ–΄ λŒ€ν™” 데이터셋듀을 μ „μ²˜λ¦¬ ν›„ 합쳐 λŒ€λŸ‰μ˜ ν•œκ΅­μ–΄ λŒ€ν™” λ§λ­‰μΉ˜λ₯Ό λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€.
25
+ 1. λ°μ΄ν„°μ˜ 쀑볡을 쀄이기 μœ„ν•΄ 'γ…‹γ…‹γ…‹γ…‹γ…‹γ…‹'와 같은 μ€‘λ³΅λœ ν‘œν˜„μ΄ 2번 이상 반볡될 λ•ŒλŠ” 'γ…‹γ…‹'와 같이 2번으둜 λ°”κΏ¨μŠ΅λ‹ˆλ‹€.
26
+ 2. λ„ˆλ¬΄ 짧은 λ°μ΄ν„°λŠ” ν•™μŠ΅μ— λ°©ν•΄κ°€ 될 수 있기 λ•Œλ¬Έμ— KoBART ν† ν¬λ‚˜μ΄μ € κΈ°μ€€ 전체 토큰 길이가 3을 λ„˜λŠ” λ°μ΄ν„°λ§Œμ„ μ„ λ³„ν–ˆμŠ΅λ‹ˆλ‹€.
27
+ 3. κ°€λͺ…μ²˜λ¦¬λœ λ°μ΄ν„°λŠ” μ œκ±°ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
28
+
29
+ ## Model
30
+
31
+ | Model | # of params | vocab size | Type | # of layers | # of heads | ffn_dim | hidden_dims |
32
+ | ------------- | :---------: | :-----: | :----------: | ---------: | ------: | ----------: | ----------: |
33
+ | `KoChatBART` | 139M | 50265 | Encoder | 6 | 16 | 3072 | 768 |
34
+ | | | | Decoder | 6 | 16 | 3072 | 768 |
35
+
36
+ ## λŒ€ν™” 생성 μ„±λŠ₯ μΈ‘μ •
37
+ λ‹€μŒ μ½”λ“œ[(Dialogue Generator)](https://github.com/2unju/KoBART_Dialogue_Generator)λ₯Ό 기반으둜 각 λͺ¨λΈμ„ fine-tuning ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λŒ€ν™” 생성 μ„±λŠ₯ 츑정을 μœ„ν•΄ μΆ”λ‘  μ‹œ ν† ν¬λ‚˜μ΄μ§•λ˜μ–΄ μƒμ„±λœ 응닡을 λ³΅μ›ν•œ ν›„, BPE tokenizerλ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹€μ œ 응닡과 μƒμ„±λœ 응닡 μ‚¬μ΄μ˜ overlap 및 distinctλ₯Ό μΈ‘μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
38
+ > **Warning** <br>
39
+ > 일반적으둜 짧은 λŒ€ν™” λ°μ΄ν„°λ‘œ λͺ¨λΈμ„ μ‚¬μ „ν•™μŠ΅ν•˜μ˜€κΈ° λ•Œλ¬Έμ— κΈ΄ λ¬Έμž₯ μ²˜λ¦¬κ°€ μš”κ΅¬λ˜λŠ” νƒœμŠ€ν¬(μš”μ•½) 등에 λŒ€ν•΄μ„œλŠ” μ•½ν•œ λͺ¨μŠ΅μ„ λ³΄μž…λ‹ˆλ‹€.
40
+
41
+ ### μ‹€ν—˜ κ²°κ³Ό
42
+ - [감성 λŒ€ν™” 데이터](https://github.com/songys/Chatbot_data)
43
+
44
+ |Training|Validation|Test|
45
+ |:----:|:----:|:----:|
46
+ |9,458|1,182|1,183|
47
+
48
+ | Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 |
49
+ |------------------------|:----:|:----:|:----:|:----:|:----:|
50
+ | KoBART | 124M | 8.73 | 7.12 | 16.85 | 34.89 |
51
+ | KoChatBART | 139M | **12.97** | **11.23** | **19.64** | **44.53** |
52
+
53
+ - [μ†Œμƒκ³΅μΈ λŒ€ν™” 데이터](https://github.com/2unju/AIHub_Chitchat_dataset_parser)
54
+
55
+ |Training|Validation|Test|
56
+ |:----:|:----:|:----:|
57
+ |29,093|1,616|1,616|
58
+
59
+ | Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 |
60
+ |------------------------|:----:|:----:|:----:|:----:|:----:|
61
+ | KoBART | 124M | 10.04 | 7.24 | 13.76| 42.09 |
62
+ | KoChatBART | 139M | **10.11** | **7.26** | **15.12** | **46.08** |
63
+
64
+ ## Contributors
65
+ <a href="https://github.com/BM-K/KoChatBART/graphs/contributors">
66
+ <img src="https://contrib.rocks/image?repo=BM-K/KoChatBART" />
67
+ </a>
68
+
69
+ ## Reference
70
+ - [KoBART](https://github.com/SKT-AI/KoBART)