File size: 3,103 Bytes
806f077 820c520 806f077 f858628 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
---
language:
- ko
---
# KR-FinBert & KR-FinBert-SC
Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adaptation using small-scale corpus and fine-tuning with labeled data is effective for overall performance improvement.
we proposed KR-FinBert for the financial domain by further pre-training it on a financial corpus and fine-tuning it for sentiment analysis. As many studies have shown, the performance improvement through adaptation and conducting the downstream task was also clear in this experiment.

## Data
The training data for this model is expanded from those of **[KR-BERT-MEDIUM](https://huggingface.co/snunlp/KR-Medium)**, texts from Korean Wikipedia, general news articles, legal texts crawled from the National Law Information Center and [Korean Comments dataset](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments). For the transfer learning, **corporate related economic news articles from 72 media sources** such as the Financial Times, The Korean Economy Daily, etc and **analyst reports from 16 securities companies** such as Kiwoom Securities, Samsung Securities, etc are added. Included in the dataset is 440,067 news titles with their content and 11,237 analyst reports. **The total data size is about 13.22GB.** For mlm training, we split the data line by line and **the total no. of lines is 6,379,315.**
KR-FinBert is trained for 5.5M steps with the maxlen of 512, training batch size of 32, and learning rate of 5e-5, taking 67.48 hours to train the model using NVIDIA TITAN XP.
## Downstream tasks
### Sentimental Classification model
Downstream task performances with 50,000 labeled data.
|Model|Accuracy|
|-|-|
|KR-FinBert|0.963|
|KR-BERT-MEDIUM|0.958|
|KcBert-large|0.955|
|KcBert-base|0.953|
|KoBert|0.817|
### Inference sample
|Positive|Negative|
|-|-|
|νλλ°μ΄μ€, 'ν΄λ¦¬νμ
' μ½λ‘λ19 μΉλ£ κ°λ₯μ±μ 19% κΈλ± | μνκ΄ζ ͺ 'μ½λ‘λ λΉνκΈ°' μΈμ λλλβ¦"CJ CGV μ¬ 4000μ΅ μμ€ λ μλ" |
|μ΄μνν, 3λΆκΈ° μμ
μ΅ 176μ΅β¦μ λ
ζ― 80%β | CμΌν¬μ λ©μΆ νμλΉνβ¦λνν곡 1λΆκΈ° μμ
μ μ 566μ΅ |
|"GKL, 7λ
λ§μ λ μλ¦Ώμ 맀μΆμ±μ₯ μμ" | '1000μ΅λ ν‘λ ΉΒ·λ°°μ' μ΅μ μ νμ₯ ꡬμβ¦ SKλ€νΈμμ€ "κ²½μ 곡백 λ°©μ§ μ΅μ " |
|μμ§μ
μ€νλμ€, μ½ν
μΈ νμ½μ μ¬μ 첫 λ§€μΆ 1000μ΅μ λν | λΆν κ³΅κΈ μ°¨μ§μβ¦κΈ°μμ°¨ κ΄μ£Όκ³΅μ₯ μ λ©΄ κ°λ μ€λ¨ |
|μΌμ±μ μ, 2λ
λ§μ μΈλ μ€λ§νΈν° μμ₯ μ μ μ¨ 1μ 'μμ’ νν' | νλμ μ² , μ§λν΄ μμ
μ΅ 3,313μ΅μΒ·Β·Β·μ λ
ζ― 67.7% κ°μ |
### Citation
```
@misc{kr-FinBert-SC,
author = {Kim, Eunhee and Hyopil Shin},
title = {KR-FinBert: Fine-tuning KR-FinBert for Sentiment Analysis},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://huggingface.co/snunlp/KR-FinBert-SC}}
}
``` |