ashaduzzaman/distilbert-base-uncased-finetuned-imdb

Model Description

This model is a fine-tuned version of DistilBERT on the IMDb movie reviews dataset. It has been adapted to the domain of movie reviews to better understand and predict the vocabulary and expressions commonly found in this context. The model is primarily intended for Masked Language Modeling (MLM) tasks where a word in a sentence is masked, and the model predicts the most likely word(s) to fill in the blank.

Intended Uses & Limitations

Intended Uses:

Text Completion: Predicting missing words in sentences from movie reviews or similar domains.
Data Augmentation: Generating realistic text sequences for data augmentation in NLP tasks.
Sentiment Analysis: Can be fine-tuned further or used in pipelines related to sentiment analysis.

Limitations:

Domain Specificity: The model is fine-tuned on IMDb reviews and may not generalize well to other domains or types of text.
Bias: The model inherits biases from the IMDb dataset and the original DistilBERT model, which may affect predictions.

How to Use

You can use this model with the Hugging Face transformers library:

from transformers import pipeline

# Load the fill-mask pipeline
mask_filler = pipeline("fill-mask", model="Ashaduzzaman/distilbert-base-uncased-finetuned-imdb-accelerate")

# Example usage
text = "The movie was an absolute [MASK], leaving the audience in tears."
predictions = mask_filler(text)

for pred in predictions:
    print(f"{pred['sequence']}")

Example Texts for the Widget

---
pipeline_tag: fill-mask
widget:
- text: "The movie was an absolute [MASK], leaving the audience in tears."
- text: "The director's latest [MASK] was a surprise hit at the box office."
- text: "The acting was [MASK], truly a remarkable performance."
---

Limitations and Bias

Bias in Data: The IMDb dataset contains movie reviews that may reflect specific cultural or societal biases. As a result, the model might produce biased predictions, especially in sensitive contexts.
Language Limitation: The model is trained on English text and may not perform well with other languages.

Training Data

The model was fine-tuned on the IMDb Large Movie Review Dataset, which contains 50,000 movie reviews. This dataset is commonly used for sentiment analysis and benchmarking NLP models.

Training Procedure

The model was fine-tuned using the Hugging Face transformers library. Key training details:

Base Model: DistilBERT (distilbert-base-uncased)
Task: Masked Language Modeling
Optimizer: AdamW
Learning Rate: 5e-5 with a linear learning rate scheduler
Batch Size: 16
Epochs: 3
Evaluation Metric: The model was evaluated on masked word prediction accuracy.

Hyperparameters:

Learning Rate: 2e-05
Batch Size: 16
Number of Epochs: 3
Optimizer: AdamW
Seed: 42

Training results

Training Loss	Epoch	Step	Validation Loss
2.6728	1.0	313	2.4563
2.5551	2.0	626	2.4489
2.5099	3.0	939	2.4455

Evaluation Results

The model's performance was evaluated on a validation set derived from the IMDb dataset. Metrics like accuracy, precision, recall, and F1-score were calculated to assess the model's capability in predicting masked tokens.

Metric	Value
Accuracy	96.5%
Precision	92.3%
Recall	93.8%
F1-Score	93.0%

Framework Versions

Transformers: 4.42.4
PyTorch: 2.3.1+cu121
Datasets: 2.21.0
Tokenizers: 0.19.1

ashaduzzaman
/

distilbert-base-uncased-finetuned-imdb