pborchert commited on
Commit
6dfecca
1 Parent(s): 48a4cb4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md CHANGED
@@ -1,3 +1,70 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - business
7
+ - finance
8
+ - industry-classification
9
  ---
10
+
11
+ # BusinessBERT
12
+
13
+ An industry-sensitive language model for business applications pretrained on business communication corpora. The model incorporates industry classification (IC) as a pretraining objective besides masked language modeling (MLM).
14
+
15
+ It was introduced in
16
+ [this paper]() and released in
17
+ [this repository](https://github.com/pnborchert/BusinessBERT).
18
+
19
+ ## Model description
20
+
21
+ We introduce BusinessBERT, an industry-sensitive language model for business applications. The advantage of the model is the training approach focused on incorporating industry information relevant for business related natural language processing (NLP) tasks.
22
+ We compile three large-scale textual corpora consisting of annual disclosures, company website content and scientific literature representing business communication. In total, the corpora include 2.23 billion token.
23
+ BusinessBERT builds upon the bidirectional encoder representations from transformer architecture (BERT) and embeds industry information during pretraining in two ways: (1) The business communication corpora contain a variety of industry-specific terminology; (2) We employ industry classification (IC) as an additional pretraining objective for text documents originating from companies.
24
+
25
+ ## Intended uses & limitations
26
+
27
+ The model is intended to be fine-tuned on business related NLP tasks, i.e. sequence classification, named entity recognition, sentiment analysis or question answering.
28
+
29
+ ## Training data
30
+
31
+ - [CompanyWeb](https://huggingface.co/datasets/pborchert/CompanyWeb): 0.77 billion token, 3.5 GB raw text file
32
+ - [MD&A Disclosures](https://data.caltech.edu/records/1249): 1.06 billion token, 5.1 GB raw text file
33
+ - [Semantic Scholar Open Research Corpus](https://api.semanticscholar.org/corpus): 0.40 billion token, 1.9 GB raw text file
34
+
35
+ ## Evaluation results
36
+
37
+ Classification Tasks:
38
+
39
+ | Task | Financial Risk (F1/Acc) | News Headline Topic (F1/Acc) |
40
+ |:----:|:-----------:|:----:|
41
+ | | 85.89/87.02 | 75.06/67.71 |
42
+
43
+ Named Entity Recognition:
44
+
45
+ | Task | SEC Filings (F1/Prec/Rec) |
46
+ |:----:|:-----------:|
47
+ | | 79.82/77.45/83.38 |
48
+
49
+ Sentiment Analysis:
50
+
51
+ | Task | FiQA (MSE/MAE) | Financial Phrasebank (F1/Acc) | StockTweets (F1/Acc) |
52
+ |:----:|:-----------:|:----:| :----:|
53
+ | | 0.0758/0.202 | 75.06/67.71 | 69.14/69.54 |
54
+
55
+ Question Answering:
56
+
57
+ | Task | FinQA (Exe Acc/Prog Acc) |
58
+ |:----:|:-----------:|
59
+ | | 60.07/57.19 |
60
+
61
+
62
+ ### BibTeX entry and citation info
63
+
64
+ ```bibtex
65
+ @misc{title_year,
66
+ title={TITLE},
67
+ author={AUTHORS},
68
+ year={YEAR},
69
+ }
70
+ ```