BERT ForSequenceClassification Fine-tuned for Sentiment Analysis
This model is a fine-tuned version of the BERT ForSequenceClassification
model for sentiment analysis.
It is trained on a dataset of texts with six different emotions: anger, fear, joy, love, sadness, and surprise.
The model was trained and tested on a labeled dataset from Kaggle.
Github link: https://github.com/hennypurwadi/Bert_FineTune_Sentiment_Analysis
The labeled dataset I used to fine-tune and train the model can be found at: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp?select=train.txt
Model Training Details
- Pretrained model:
bert-base-uncased
("uncased" means the model was trained on lowercased text) - Number of labels: 6:
- "Label_0": "anger",
- "Label_1": "fear",
- "Label_2": "joy"
- "Label_3": "love",
- "Label_4": "sadness",
- "Label_5": "surprise"
- Learning rate: 2e-5
- Epsilon: 1e-8
- Epochs: 10
- Warmup steps: 0
- Optimizer: AdamW with correct_bias=False
Dataset
The model was trained and tested on a labeled dataset from Kaggle.
##To predict the sentiments on unlabeled datasets, use the predict_sentiments function provided in this repository.
The unlabeled daataset to be predicted should have a single column named "text".
Predict Unlabeled dataset collected from Twitter (dc_America.csv)
predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')
##To load and use the model and tokenizer, use the following code:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import pandas as pd
def predict_sentiments(model_name, tokenizer_name, input_file):
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
df = pd.read_csv(input_file)
# Tokenize input text
test_inputs = tokenizer(list(df['text']), padding=True, truncation=True, max_length=128, return_tensors='pt')
# Make predictions
with torch.no_grad():
model.eval()
outputs = model(test_inputs['input_ids'], token_type_ids=None, attention_mask=test_inputs['attention_mask'])
logits = outputs[0].detach().cpu().numpy()
predictions = logits.argmax(axis=-1)
# Map the predicted labels back to their original names
int2label = {0: 'anger', 1: 'fear', 2: 'joy', 3: 'love', 4: 'sadness', 5: 'surprise'}
predicted_labels = [int2label[p] for p in predictions]
# Add the predicted labels to the test dataframe
df['label'] = predicted_labels
# Save the predictions to a file
output_file = input_file.replace(".csv", "_predicted.csv")
df.to_csv(output_file, index=False)
model_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"
#Predict Unlabeled data
predict_sentiments(model_name, tokenizer_name, '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America.csv')
# Load predicted data
df_Am = pd.read_csv('/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv')
df_Am.head()
from transformers import AutoTokenizer
import matplotlib.pyplot as plt
# Load tokenizer
tokenizer_name = "RinInori/bert-base-uncased_finetune_sentiments"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, do_lower_case=True)
# Load dataset
input_file = '/content/drive/MyDrive/DLBBT01/data/c_unlabeled/dc_America_predicted.csv'
df_Am = pd.read_csv(input_file)
# Examine distribution of data based on labels
sentences = df_Am.text.values
print("Distribution of data based on labels: ", df_Am.label.value_counts())
MAX_LEN = 512
# Plot label
label_count = df_Am['label'].value_counts()
plot_users = label_count.plot.pie(autopct='%1.1f%%', figsize=(4, 4))
plt.rc('axes', unicode_minus=False)
- Downloads last month
- 37