File size: 4,175 Bytes
f2a2066
 
 
 
 
 
 
 
 
 
 
6cc4d2a
f2a2066
6cc4d2a
f2a2066
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b50123f
f2a2066
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b50123f
f2a2066
 
 
 
 
 
 
 
 
 
 
 
 
 
b50123f
f2a2066
 
 
 
 
b50123f
f2a2066
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
language: en
license: apache-2.0
datasets:
- custom
task_categories:
- text-classification
task_ids:
- sentiment-classification
---

# BERT ForSequenceClassification Fine-tuned for Sentiment Analysis

This model is a fine-tuned version of the `BERT ForSequenceClassification` model for sentiment analysis. 
It is trained on a dataset of texts with six different emotions: anger, fear, joy, love, sadness, and surprise.
The model was trained and tested on a labeled dataset from [Kaggle](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp).

Github link:
https://github.com/hennypurwadi/Bert_FineTune_Sentiment_Analysis

The labeled dataset I used to fine-tune and train the model can be found at: 
https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp?select=train.txt

## Model Training Details

- **Pretrained model**: `bert-base-uncased` ("uncased" means the model was trained on lowercased text)
- **Number of labels**: 6:
-   "Label_0": "anger",
-   "Label_1": "fear",
-   "Label_2": "joy"
-   "Label_3": "love",
-   "Label_4": "sadness",
-   "Label_5": "surprise"
-   
- **Learning rate**: 2e-5
- **Epsilon**: 1e-8
- **Epochs**: 10
- **Warmup steps**: 0
- **Optimizer**: AdamW with correct_bias=False

## Dataset

The model was trained and tested on a labeled dataset from [Kaggle](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp).

##To predict the sentiments on unlabeled datasets, use the predict_sentiments function provided in this repository.

## The unlabeled daataset to be predicted should have a single column named "text". 

Predict Unlabeled dataset collected from Twitter (dc_America.csv)

predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')

##To load and use the model and tokenizer, use the following code:

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import pandas as pd

def predict_sentiments(model_name, tokenizer_name, input_file):

    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    
    df = pd.read_csv(input_file)

    # Tokenize input text
    test_inputs = tokenizer(list(df['text']), padding=True, truncation=True, max_length=128, return_tensors='pt')

    # Make predictions
    with torch.no_grad():
        model.eval()
        outputs = model(test_inputs['input_ids'], token_type_ids=None, attention_mask=test_inputs['attention_mask'])
        logits = outputs[0].detach().cpu().numpy()
        predictions = logits.argmax(axis=-1)

    # Map the predicted labels back to their original names
    int2label = {0: 'anger', 1: 'fear', 2: 'joy', 3: 'love', 4: 'sadness', 5: 'surprise'}
    predicted_labels = [int2label[p] for p in predictions]

    # Add the predicted labels to the test dataframe
    df['label'] = predicted_labels

    # Save the predictions to a file
    output_file = input_file.replace(".csv", "_predicted.csv")
    df.to_csv(output_file, index=False)

model_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"

#Predict Unlabeled data
predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')

# Load predicted data
df_Am = pd.read_csv('/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv')
df_Am.head()

from transformers import AutoTokenizer
import matplotlib.pyplot as plt

# Load tokenizer
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, do_lower_case=True)

# Load dataset
input_file = '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv'
df_Am = pd.read_csv(input_file)

# Examine distribution of data based on labels
sentences = df_Am.text.values
print("Distribution of data based on labels: ", df_Am.label.value_counts())

MAX_LEN = 512

# Plot label 
label_count = df_Am['label'].value_counts()
plot_users = label_count.plot.pie(autopct='%1.1f%%', figsize=(4, 4))
plt.rc('axes', unicode_minus=False)