File size: 4,175 Bytes
f2a2066 6cc4d2a f2a2066 6cc4d2a f2a2066 b50123f f2a2066 b50123f f2a2066 b50123f f2a2066 b50123f f2a2066 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
language: en
license: apache-2.0
datasets:
- custom
task_categories:
- text-classification
task_ids:
- sentiment-classification
---
# BERT ForSequenceClassification Fine-tuned for Sentiment Analysis
This model is a fine-tuned version of the `BERT ForSequenceClassification` model for sentiment analysis.
It is trained on a dataset of texts with six different emotions: anger, fear, joy, love, sadness, and surprise.
The model was trained and tested on a labeled dataset from [Kaggle](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp).
Github link:
https://github.com/hennypurwadi/Bert_FineTune_Sentiment_Analysis
The labeled dataset I used to fine-tune and train the model can be found at:
https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp?select=train.txt
## Model Training Details
- **Pretrained model**: `bert-base-uncased` ("uncased" means the model was trained on lowercased text)
- **Number of labels**: 6:
- "Label_0": "anger",
- "Label_1": "fear",
- "Label_2": "joy"
- "Label_3": "love",
- "Label_4": "sadness",
- "Label_5": "surprise"
-
- **Learning rate**: 2e-5
- **Epsilon**: 1e-8
- **Epochs**: 10
- **Warmup steps**: 0
- **Optimizer**: AdamW with correct_bias=False
## Dataset
The model was trained and tested on a labeled dataset from [Kaggle](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp).
##To predict the sentiments on unlabeled datasets, use the predict_sentiments function provided in this repository.
## The unlabeled daataset to be predicted should have a single column named "text".
Predict Unlabeled dataset collected from Twitter (dc_America.csv)
predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')
##To load and use the model and tokenizer, use the following code:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import pandas as pd
def predict_sentiments(model_name, tokenizer_name, input_file):
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
df = pd.read_csv(input_file)
# Tokenize input text
test_inputs = tokenizer(list(df['text']), padding=True, truncation=True, max_length=128, return_tensors='pt')
# Make predictions
with torch.no_grad():
model.eval()
outputs = model(test_inputs['input_ids'], token_type_ids=None, attention_mask=test_inputs['attention_mask'])
logits = outputs[0].detach().cpu().numpy()
predictions = logits.argmax(axis=-1)
# Map the predicted labels back to their original names
int2label = {0: 'anger', 1: 'fear', 2: 'joy', 3: 'love', 4: 'sadness', 5: 'surprise'}
predicted_labels = [int2label[p] for p in predictions]
# Add the predicted labels to the test dataframe
df['label'] = predicted_labels
# Save the predictions to a file
output_file = input_file.replace(".csv", "_predicted.csv")
df.to_csv(output_file, index=False)
model_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"
#Predict Unlabeled data
predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')
# Load predicted data
df_Am = pd.read_csv('/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv')
df_Am.head()
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
# Load tokenizer
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, do_lower_case=True)
# Load dataset
input_file = '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv'
df_Am = pd.read_csv(input_file)
# Examine distribution of data based on labels
sentences = df_Am.text.values
print("Distribution of data based on labels: ", df_Am.label.value_counts())
MAX_LEN = 512
# Plot label
label_count = df_Am['label'].value_counts()
plot_users = label_count.plot.pie(autopct='%1.1f%%', figsize=(4, 4))
plt.rc('axes', unicode_minus=False)
|