Update README.md
Browse files# Model Card for DistilBERT Product Classifier
## Model Overview
**Model Name**: DistilBERT Product Classifier
**Model Type**: Text Classification
**Base Model**: `distilbert-base-uncased`
**Dataset**: Kaggle Product Classification Dataset ([link to dataset](https://www.kaggle.com/datasets/lakritidis/product-classification-and-categorization?select=pricerunner_aggregate.csv))
**Purpose**: This model was fine-tuned for product classification based on raw product titles. It categorizes items into appropriate product categories, helping in scenarios such as data scraping from multiple e-commerce sites where categories may vary widely.
### Why This Model?
The primary goal for this model is to streamline data classification in price comparison and e-commerce applications. By taking product titles from different websites, it allows for accurate category predictions, which can be challenging given the variations across sites.
This model is ideal for use in:
- **E-commerce sites**: Assisting in categorizing scraped products into uniform categories.
- **Data pipelines**: Enabling automated categorization in real-time or batch pipelines.
- **Product inventory**: Supporting the structuring of large product inventories by category.
---
## Model Architecture
The model is based on `distilbert-base-uncased`, a lightweight and efficient variant of BERT that reduces memory and computation while maintaining performance. DistilBERT is particularly suited for edge deployment, given its low resource requirement.
### Model Class
- **DistilBertForSequenceClassification**: This model class was chosen because it’s pre-configured for text classification tasks, making it easy to adapt for our fine-tuning process.
---
## Training Data
The training dataset consists of various product categories from the [Kaggle Product Classification Dataset](https://www.kaggle.com/datasets/lakritidis/product-classification-and-categorization?select=pricerunner_aggregate.csv). Data was preprocessed to ensure balance across categories and consistency in row count.
**Categories**:
- CPUs = 0
- Digital Cameras = 1
- Dishwashers = 2
- Fridge Freezers = 3
- Microwaves = 4
- Mobile Phones = 5
- TVs = 6
- Washing Machines = 7
**Preprocessing Steps**:
1. Balanced each category by sampling equal rows.
2. Divided data into manageable chunks to avoid memory overload during training.
3. Standardized text casing to lowercase and tokenized product titles.
---
## Training & Fine-Tuning Process
### Fine-Tuning Parameters
The model was fine-tuned with the following parameters to ensure effective learning on the specified task:
- **Epochs**: Multiple epochs were run to allow the model ample training time on each product category.
- **Batch Size**: Set to optimize memory usage without compromising the model’s convergence rate.
- **Learning Rate**: Optimized for stable training across all categories.
- **Loss Function**: Cross-entropy, commonly used in classification tasks, was applied for optimal performance.
### Code Sample
To load and use the model in your environment, refer to the following example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Adnan-AI-Labs/DistilBERT-ProductClassifier")
model = AutoModelForSequenceClassification.from_pretrained("Adnan-AI-Labs/DistilBERT-ProductClassifier")
# Sample product title
text = "Samsung Galaxy S21 Ultra"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predicted_category = torch.argmax(outputs.logits).item()
print(f"Predicted Category ID: {predicted_category}")
##Key Metrics
Accuracy: 96.16%
Macro Average F1 Score: 0.961
Weighted Average F1 Score: 0.962
This evaluation shows that the model performs reliably across a wide range of product categories with high precision and recall, demonstrating its suitability for production applications.
##Deployment & Intended Usage
This model is designed to be easy to integrate with applications and supports low-memory environments, ideal for edge deployment on devices with limited resources. Potential use cases include:
Product categorization in real-time for price comparison websites.
Data sorting in e-commerce applications to maintain consistency across product inventories.
##Limitations
Category Overlap: Some products may fall into multiple categories (e.g., a "smart microwave" might be categorized under microwaves and smart devices).
Data Bias: Results may vary for product names not commonly found in the training data.
##Ethical Considerations
The model may carry over biases from the data, so it’s recommended to periodically test it on updated data and consider fine-tuning on more diverse datasets. Users should monitor the model’s performance and ensure it aligns with ethical practices, especially regarding fair representation.
@@ -1,12 +1,25 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
language:
|
4 |
-
- en
|
5 |
-
base_model:
|
6 |
-
- distilbert/distilbert-base-uncased
|
7 |
-
- lxs1/DistilBertForSequenceClassification_6h_768dim
|
8 |
-
pipeline_tag: text-classification
|
9 |
tags:
|
10 |
-
-
|
11 |
-
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: en
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
tags:
|
4 |
+
- text-classification
|
5 |
+
- e-commerce
|
6 |
+
- product-classification
|
7 |
+
- distilbert
|
8 |
+
license: apache-2.0
|
9 |
+
datasets:
|
10 |
+
- lakritidis/product-classification-and-categorization
|
11 |
+
model-index:
|
12 |
+
- name: DistilBERT-ProductClassifier
|
13 |
+
results:
|
14 |
+
- task:
|
15 |
+
type: text-classification
|
16 |
+
name: Product Category Classification
|
17 |
+
dataset:
|
18 |
+
name: Product Classification and Categorization
|
19 |
+
type: lakritidis/product-classification-and-categorization
|
20 |
+
metrics:
|
21 |
+
- type: accuracy
|
22 |
+
value: 96.17
|
23 |
+
name: Accuracy
|
24 |
+
---
|
25 |
+
|