daviddrzik commited on
Commit
0b410f5
·
verified ·
1 Parent(s): a4b11cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +157 -3
README.md CHANGED
@@ -1,3 +1,157 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - sk
5
+ pipeline_tag: text-classification
6
+ library_name: transformers
7
+ metrics:
8
+ - f1
9
+ base_model: daviddrzik/SK_Morph_BLM
10
+ tags:
11
+ - sentiment
12
+ ---
13
+
14
+ # Fine-Tuned Sentiment Classification Model - SK_Morph_BLM (Universal multi-domain sentiment classification)
15
+
16
+ ## Model Overview
17
+
18
+ This model is a fine-tuned version of the [SK_Morph_BLM model](https://huggingface.co/daviddrzik/SK_Morph_BLM) for the task of sentiment classification. It has been trained on datasets from multiple domains, including banking, social media, movie reviews, politics, and product reviews. Some of these datasets were originally in Czech and were machine-translated into Slovak using Google Cloud Translation.
19
+
20
+ ## Sentiment Labels
21
+
22
+ Each row in the dataset is labeled with one of the following sentiments:
23
+ - **Negative (0)**
24
+ - **Neutral (1)**
25
+ - **Positive (2)**
26
+
27
+ ## Dataset Details
28
+
29
+ The dataset used for fine-tuning comprises text records from various domains. Below are the details for each domain:
30
+
31
+ ### Banking Domain
32
+ - **Source**: [Banking Dataset](https://doi.org/10.1016/j.procs.2023.10.346)
33
+ - **Description**: Sentences from the annual reports of a commercial bank in Slovakia.
34
+ - **Records per Class**: 923
35
+ - **Unique Words**: 11,469
36
+ - **Average Words per Record**: 20.93
37
+ - **Average Characters per Word**: 142.41
38
+
39
+ ### Social Media Domain
40
+ - **Source**: [Social Media Dataset](http://hdl.handle.net/11858/00-097C-0000-0022-FE82-7)
41
+ - **Description**: Data from posts on the Facebook social network.
42
+ - **Records per Class**: 1,991
43
+ - **Unique Words**: 114,549
44
+ - **Average Words per Record**: 9.24
45
+ - **Average Characters per Word**: 57.11
46
+
47
+ ### Movies Domain
48
+ - **Source**: [Movies Dataset](https://doi.org/10.1016/j.ipm.2014.05.001)
49
+ - **Description**: Short movie reviews from ČSFD.
50
+ - **Records per Class**: 3,000
51
+ - **Unique Words**: 72,166
52
+ - **Average Words per Record**: 52.12
53
+ - **Average Characters per Word**: 330.92
54
+
55
+ ### Politics Domain
56
+ - **Source**: [Politics Dataset](https://doi.org/10.48550/arXiv.2309.09783)
57
+ - **Description**: Sentences from Slovak parliamentary proceedings.
58
+ - **Records per Class**: 452
59
+ - **Unique Words**: 6,697
60
+ - **Average Words per Record**: 12.31
61
+ - **Average Characters per Word**: 85.22
62
+
63
+ ### Reviews Domain
64
+ - **Source**: [Reviews Dataset](https://aclanthology.org/W13-1609)
65
+ - **Description**: Product reviews from Mall.cz.
66
+ - **Records per Class**: 3,000
67
+ - **Unique Words**: 35,941
68
+ - **Average Words per Record**: 21.05
69
+ - **Average Characters per Word**: 137.33
70
+
71
+ ## Fine-Tuning Hyperparameters
72
+
73
+ The following hyperparameters were used during the fine-tuning process:
74
+
75
+ - **Learning Rate:** 1e-05
76
+ - **Training Batch Size:** 64
77
+ - **Evaluation Batch Size:** 64
78
+ - **Seed:** 42
79
+ - **Optimizer:** Adam (default)
80
+ - **Number of Epochs:** 15 (with early stopping)
81
+
82
+ ## Model Performance
83
+
84
+ The model was trained on data from all domains simultaneously and evaluated using stratified 10-fold cross-validation on each individual domain. The weighted F1-score, including the mean, minimum, maximum, and quartile values, is presented below for each domain:
85
+
86
+ | Domain | Mean | Min | 25% | 50% | 75% | Max |
87
+ |--------------|------|------|------|------|------|------|
88
+ | Banking | 0.672| 0.640| 0.655| 0.660| 0.690| 0.721|
89
+ | Social media | 0.586| 0.567| 0.584| 0.587| 0.593| 0.603|
90
+ | Movies | 0.577| 0.556| 0.574| 0.579| 0.580| 0.604|
91
+ | Politics | 0.629| 0.566| 0.620| 0.634| 0.644| 0.673|
92
+ | Reviews | 0.580| 0.558| 0.578| 0.580| 0.588| 0.597|
93
+
94
+ ## Model Usage
95
+
96
+ This model is suitable for sentiment classification within the specific domains it was trained on, such as banking, social media, movies, politics, and product reviews. While it may not achieve high F1-scores across all text types, it is well-suited for a wide range of text within these trained domains. However, it may not generalize effectively to entirely different types of text outside these domains.
97
+
98
+ ### Example Usage
99
+
100
+ Below is an example of how to use the fine-tuned `SK_Morph_BLM-sentiment-multidomain` model in a Python script:
101
+
102
+ ```python
103
+ import torch
104
+ from transformers import RobertaForSequenceClassification, RobertaTokenizerFast
105
+ from huggingface_hub import snapshot_download
106
+
107
+ class SentimentClassifier:
108
+ def __init__(self, tokenizer, model):
109
+ self.model = RobertaForSequenceClassification.from_pretrained(model, num_labels=3)
110
+
111
+ repo_path = snapshot_download(repo_id = tokenizer)
112
+ sys.path.append(repo_path)
113
+
114
+ # Import the custom tokenizer from the downloaded repository
115
+ from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer
116
+ self.tokenizer = SKMorfoTokenizer()
117
+
118
+ def tokenize_text(self, text):
119
+ encoded_text = self.tokenizer.tokenize(text.lower(), max_length=256, return_tensors='pt', return_subword=False)
120
+ return encoded_text
121
+
122
+ def classify_text(self, encoded_text):
123
+ with torch.no_grad():
124
+ output = self.model(**encoded_text)
125
+ logits = output.logits
126
+ predicted_class = torch.argmax(logits, dim=1).item()
127
+ probabilities = torch.softmax(logits, dim=1)
128
+ class_probabilities = probabilities[0].tolist()
129
+ predicted_class_text = self.model.config.id2label[predicted_class]
130
+ return predicted_class, predicted_class_text, class_probabilities
131
+
132
+ # Instantiate the sentiment classifier with the specified tokenizer and model
133
+ classifier = SentimentClassifier(tokenizer="daviddrzik/SK_Morph_BLM", model="daviddrzik/SK_Morph_BLM-sentiment-multidomain")
134
+
135
+ # Example text to classify sentiment
136
+ text_to_classify = "Napriek zlepšeniu očakávaní je výhľad stále krehký."
137
+ print("Text to classify: " + text_to_classify + "\n")
138
+
139
+ # Tokenize the input text
140
+ encoded_text = classifier.tokenize_text(text_to_classify)
141
+
142
+ # Classify the sentiment of the tokenized text
143
+ predicted_class, predicted_class_text, logits = classifier.classify_text(encoded_text)
144
+
145
+ # Print the predicted class label and index
146
+ print(f"Predicted class: {predicted_class_text} ({predicted_class})")
147
+ # Print the probabilities for each class
148
+ print(f"Class probabilities: {logits}")
149
+ ```
150
+
151
+ Here is the output when running the above example:
152
+ ```yaml
153
+ Text to classify: Napriek zlepšeniu očakávaní je výhľad stále krehký.
154
+
155
+ Predicted class: Positive (2)
156
+ Class probabilities: [0.04016311839222908, 0.4200247824192047, 0.5398120284080505]
157
+ ```