File size: 8,164 Bytes
8b1fa81
 
1d8e3ee
b5d8ff3
 
 
 
1d8e3ee
 
233ab0f
b5d8ff3
 
 
 
 
 
892b117
1d8e3ee
e59356e
 
58645f7
 
 
2e91466
f7d11dc
 
 
1d8e3ee
 
 
 
1118c1a
1d8e3ee
 
 
 
 
 
1118c1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d8e3ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13cdc80
1d8e3ee
13cdc80
 
1d8e3ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb6a322
 
 
31051bf
bb6a322
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
tags:
- finance
- accounting
- stock
- quant
- economics
language:
- ko
license: apache-2.0
datasets:
- aiqwe/krx-llm-competition
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: question-answering
library_name: transformers
---

# krx-llm-competition Model Card

+ github: [https://github.com/aiqwe/krx-llm-competition](https://github.com/aiqwe/krx-llm-competition)
+ dataset: [https://huggingface.co/datasets/aiqwe/krx-llm-competition](https://huggingface.co/datasets/aiqwe/krx-llm-competition)

๋ชจ๋ธ์€ [KRX LLM ๊ฒฝ์ง„๋Œ€ํšŒ ๋ฆฌ๋”๋ณด๋“œ](https://krxbench.koscom.co.kr/)์—์„œ ์šฐ์ˆ˜์ƒ์„ ์ˆ˜์ƒํ•œ shibainu24 ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๊ธˆ์œต, ํšŒ๊ณ„ ๋“ฑ ๊ธˆ์œต๊ด€๋ จ ์ง€์‹์— ๋Œ€ํ•œ Text Generation์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.  

+ Vanilla model : [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
  
๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘ ๋ฐ ํ•™์Šต์— ๊ด€๋ จ๋œ ์ฝ”๋“œ๋Š” [https://github.com/aiqwe/krx-llm-competition](https://github.com/aiqwe/krx-llm-competition)์— ์ž์„ธํ•˜๊ฒŒ ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

# Usage
[https://github.com/aiqwe/krx-llm-competition](https://github.com/aiqwe/krx-llm-competition)์˜ example์„ ์ฐธ์กฐํ•˜๋ฉด ์‰ฝ๊ฒŒ inference๋ฅผ ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋Œ€๋ถ€๋ถ„์˜ Inference๋Š” RTX-3090 ์ด์ƒ์—์„œ ๋‹จ์ผ GPU ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

```shell
pip install vllm
```

```python
import pandas as pd
from vllm import LLM

inputs = [
    "์™ธํ™˜์‹œ์žฅ์—์„œ ์ผ๋ณธ ์—”ํ™”์™€ ๋ฏธ๊ตญ ๋‹ฌ๋Ÿฌ์˜ ํ™˜์œจ์ด ๋‘ ์‹œ์žฅ์—์„œ ์•ฝ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ์ด๋•Œ ๋ฌด์œ„ํ—˜ ์ด์ต์„ ์–ป๊ธฐ ์œ„ํ•œ ์ ์ ˆํ•œ ๊ฑฐ๋ž˜ ์ „๋žต์€ ๋ฌด์—‡์ธ๊ฐ€?",
    "์‹ ์ฃผ์ธ์ˆ˜๊ถŒ๋ถ€์‚ฌ์ฑ„(BW)์—์„œ ์ฑ„๊ถŒ์ž๊ฐ€ ์‹ ์ฃผ์ธ์ˆ˜๊ถŒ์„ ํ–‰์‚ฌํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ ์–ด๋–ค ์ผ์ด ๋ฐœ์ƒํ•˜๋Š”๊ฐ€?",
    "๊ณต๋งค๋„(Short Selling)์— ๋Œ€ํ•œ ์„ค๋ช…์œผ๋กœ ์˜ณ์ง€ ์•Š์€ ๊ฒƒ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?"
]

llm = LLM(model="aiqwe/krx-llm-competition", tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(inputs, sampling_params)
for o in outputs:
    print(o.prompt)
    print(o.outputs[0].text)
    print("*"*100)
```

# Model Card
| Contents                       | Spec                                |
|--------------------------------|-------------------------------------|
| Base model                     | Qwen2.5-7B-Instruct                |
| Machine                        | A100 SXM 80GB ร— 2                  |
| dtype                          | bfloat16                           |
| PEFT                           | LoRA (r=8, alpha=64)               |
| Learning Rate                  | 1e-5 (varies by further training)  |
| LRScheduler                    | Cosine (warm-up: 0.05%)            |
| Optimizer                      | AdamW                              |
| Distributed / Efficient Tuning | DeepSpeed v3, Flash Attention      |
| Global Batch Size              | 128                                |

# Datset Card
Reference ๋ฐ์ดํ„ฐ์…‹์€ ์ผ๋ถ€ ์ €์ž‘๊ถŒ ๊ด€๊ณ„๋กœ ์ธํ•ด Link๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
MCQA์™€ QA ๋ฐ์ดํ„ฐ์…‹์€ [https://huggingface.co/datasets/aiqwe/krx-llm-competition](https://huggingface.co/datasets/aiqwe/krx-llm-competition)์œผ๋กœ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.  
ํ•ด๋‹น Huggingface Dataset Repoaitory์—์„œ๋Š” ํ•™์Šต์—๋Š” ์‚ฌ์šฉ๋˜์ง€ ์•Š์•˜์ง€๋งŒ ์ถ”๊ฐ€์ ์ธ MCQA, QA ๋ฐ์ดํ„ฐ์…‹์„ ์ œ๊ณต๋ฐ›์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.  
๋˜ํ•œ [https://github.com/aiqwe/krx-llm-competition](https://github.com/aiqwe/krx-llm-competition)๋ฅผ ์ด์šฉํ•˜๋ฉด ๋‹ค์–‘ํ•œ ์œ ํ‹ธ๋ฆฌํ‹ฐ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ ์†Œ์‹ฑ Pipeline์„ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.  

## References
| ๋ฐ์ดํ„ฐ๋ช…                          | url                                                                                      |
|-----------------------------------|------------------------------------------------------------------------------------------|
| ํ•œ๊ตญ์€ํ–‰ ๊ฒฝ์ œ๊ธˆ์œต ์šฉ์–ด 700์„       | [Link](https://www.bok.or.kr/portal/bbs/B0000249/view.do?nttId=235017&menuNo=200765) |
| ์žฌ๋ฌดํšŒ๊ณ„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ              | ์ž์ฒด ์ œ์ž‘                                                                                        |
| ๊ธˆ์œต๊ฐ๋…์šฉ์–ด์‚ฌ์ „                  | [Link](https://terms.naver.com/list.naver?cid=42088&categoryId=42088) |
| web-text.synthetic.dataset-50k    | [Link](https://huggingface.co/datasets/Cartinoe5930/web_text_synthetic_dataset_50k) |
| ์ง€์‹๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                  | [Link](https://terms.naver.com/list.naver?cid=43668&categoryId=43668) |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ ๋น„์ •๊ธฐ ๊ฐ„ํ–‰๋ฌผ          | [Link](http://open.krx.co.kr/contents/OPN04/04020000/OPN04020000.jsp#b8943a5f87282cde0d653d1ae73431c9=1) |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ๊ทœ์ •                    | [Link](https://law.krx.co.kr/las/TopFrame.jsp&KRX) |
| ์ดˆ๋ณดํˆฌ์ž์ž ์ฆ๊ถŒ๋”ฐ๋ผ์žก๊ธฐ           | [Link](https://main.krxverse.co.kr/_contents/ACA/02010200/file/220104_beginner.pdf) |
| ์ฒญ์†Œ๋…„์„ ์œ„ํ•œ ์ฆ๊ถŒํˆฌ์ž            | [Link](https://main.krxverse.co.kr/_contents/ACA/02010200/file/220104_teen.pdf) |
| ๊ธฐ์—…์‚ฌ์—…๋ณด๊ณ ์„œ ๊ณต์‹œ์ž๋ฃŒ           | [Link](https://opendart.fss.or.kr/)                              |
| ์‹œ์‚ฌ๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                  | [Link](https://terms.naver.com/list.naver?cid=43668&categoryId=43668) |

## MCQA
MCQA ๋ฐ์ดํ„ฐ๋Š” Reference๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์ง€์„ ๋‹คํ˜• ๋ฌธ์ œ๋ฅผ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ๋ฌธ์ œ์™€ ๋‹ต ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Reasoning ํ…์ŠคํŠธ๊นŒ์ง€ ์ƒ์„ฑํ•˜์—ฌ ํ•™์Šต์— ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.  
ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 4.5๋งŒ๊ฐœ ๋ฐ์ดํ„ฐ์…‹์ด๋ฉฐ, tiktoken์˜ o200k_base(gpt-4o, gpt-4o-mini Tokenizer)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ด 2์ฒœ๋งŒ๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
| ๋ฐ์ดํ„ฐ๋ช…                             | ๋ฐ์ดํ„ฐ ์ˆ˜ | ํ† ํฐ ์ˆ˜      |
|--------------------------------------|-----------|--------------|
| ํ•œ๊ตญ์€ํ–‰ ๊ฒฝ์ œ๊ธˆ์œต ์šฉ์–ด 700์„          | 1,203     | 277,114      |
| ์žฌ๋ฌดํšŒ๊ณ„ ๋ชฉ์ฐจ๋ฅผ ์ด์šฉํ•œ ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ    | 451       | 99,770       |
| ๊ธˆ์œต๊ฐ๋…์šฉ์–ด์‚ฌ์ „                     | 827       | 214,297      |
| hf_web_text_synthetic_dataset_50k    | 25,461    | 7,563,529    |
| ์ง€์‹๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 2,314     | 589,763      |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ ๋น„์ •๊ธฐ ๊ฐ„ํ–‰๋ฌผ             | 1,183     | 230,148      |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ๊ทœ์ •                       | 3,015     | 580,556      |
| ์ดˆ๋ณดํˆฌ์ž์ž ์ฆ๊ถŒ๋”ฐ๋ผ์žก๊ธฐ              | 599       | 116,472      |
| ์ฒญ์†Œ๋…„์„ ์œ„ํ•œ ์ฆ๊ถŒ ํˆฌ์ž              | 408       | 77,037       |
| ๊ธฐ์—…์‚ฌ์—…๋ณด๊ณ ์„œ ๊ณต์‹œ์ž๋ฃŒ              | 3,574     | 629,807      |
| ์‹œ์‚ฌ๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 7,410     | 1,545,842    |
| **ํ•ฉ๊ณ„**                             | **46,445**| **19,998,931**|

## QA
QA ๋ฐ์ดํ„ฐ๋Š” Reference์™€ ์งˆ๋ฌธ์„ ํ•จ๊ป˜ Input์œผ๋กœ ๋ฐ›์•„ ์ƒ์„ฑํ•œ ๋‹ต๋ณ€๊ณผ Reference ์—†์ด ์งˆ๋ฌธ๋งŒ์„ Input์œผ๋กœ ๋ฐ›์•„ ์ƒ์„ฑํ•œ ๋‹ต๋ณ€ 2๊ฐ€์ง€๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.  
Reference๋ฅผ ์ œ๊ณต๋ฐ›์œผ๋ฉด ๋ชจ๋ธ์€ ๋ณด๋‹ค ์ •ํ™•ํ•œ ๋‹ต๋ณ€์„ ํ•˜์ง€๋งŒ ๋ชจ๋ธ๋งŒ์˜ ์ง€์‹์ด ์ œํ•œ๋˜์–ด ๋‹ต๋ณ€์ด ์ข€๋” ์งง์•„์ง€๊ฑฐ๋‚˜ ๋‹ค์–‘์„ฑ์ด ์ค„์–ด๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
์ด 4.8๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹๊ณผ 2์–ต๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
| ๋ฐ์ดํ„ฐ๋ช…                             | ๋ฐ์ดํ„ฐ ์ˆ˜ | ํ† ํฐ ์ˆ˜      |
|--------------------------------------|-----------|--------------|
| ํ•œ๊ตญ์€ํ–‰ ๊ฒฝ์ œ๊ธˆ์œต ์šฉ์–ด 700์„          | 1,023     | 846,970      |
| ๊ธˆ์œต๊ฐ๋…์šฉ์–ด์‚ฌ์ „                     | 4,128     | 3,181,831    |
| ์ง€์‹๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 6,526     | 5,311,890    |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ ๋น„์ •๊ธฐ ๊ฐ„ํ–‰๋ฌผ             | 1,510     | 1,089,342    |
| ํ•œ๊ตญ๊ฑฐ๋ž˜์†Œ๊ทœ์ •                       | 4,858     | 3,587,059    |
| ๊ธฐ์—…์‚ฌ์—…๋ณด๊ณ ์„œ ๊ณต์‹œ์ž๋ฃŒ              | 3,574     | 629,807      |
| ์‹œ์‚ฌ๊ฒฝ์ œ์šฉ์–ด์‚ฌ์ „                     | 29,920    | 5,981,839    |
| **ํ•ฉ๊ณ„**                             | **47,965**| **199,998,931**|

# Citation
```bibitex
@misc{jaylee2024krxllmcompetition,
  author = {Jay Lee},
  title = {shibainu24: krx llm completition llm model},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/aiqwe/krx-llm-competition}
}
```