songhieng's picture
Update README.md
9dcba95 verified
---
license: apache-2.0
language:
- km
metrics:
- accuracy
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: text-classification
library_name: transformers
tags:
- sentimenal
---
# Multi-Label Emotion Classification with XLM-RoBERTa
This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).
## Overview
The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
- Data preparation and splitting (90% training, 10% validation)
- Tokenization using the fast XLM-RoBERTa tokenizer
- A custom data collator that leverages the tokenizer’s efficient padding method
- Fine-tuning an XLM-RoBERTa model for multi-label classification
- Computing a custom multi-label subset accuracy metric during evaluation
### Dataset
- **Total Data Size**: 24,969 samples
- **Train-Test Split**: 90% training, 10% validation
- **Model Accuracy (3 epochs)**: 72.12%
## Requirements
Ensure you have the required dependencies installed:
```bash
pip install torch transformers datasets evaluate scikit-learn
```
## Data Format
The expected input data is a CSV file with columns structured as follows:
| Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
|------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
| ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
- **text_khm**: The Khmer text input
- **Emotion columns**: One column per emotion label (binary values)
## Training the Model
1. **Data Preparation:**
- Load the dataset.
- Select relevant columns (`"text_khm"` and emotion labels).
- Split into training (90%) and validation (10%) sets.
- Convert the dataset into a Hugging Face `Dataset` format.
2. **Tokenization:**
- Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.
3. **Model Setup:**
- Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
- Convert labels to `float` for BCEWithLogitsLoss.
4. **Custom Data Collator:**
- Use the built-in `DataCollatorWithPadding` for efficient batching.
5. **Training and Evaluation:**
- Define training arguments (learning rate, batch sizes, number of epochs, etc.).
- Implement a custom compute metrics function for multi-label subset accuracy.
- Train the model using the `Trainer` class from Hugging Face.
## Testing the Model
To test the trained model on new Khmer text, use the following script:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
# Load tokenizer and model
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example Khmer text for sentiment analysis
text = "αž€αžΆαžšαž”αŸ’αžšαž€αžΆαžŸαž…αŸ†αžŽαžΌαž›αžšαž”αžŸαŸ‹αž€αŸ’αžšαž»αž˜αž αŸŠαž»αž“αž˜αžΆαž“αž€αžΆαžšαž€αžΎαž“αž‘αžΎαž„αž™αŸ‰αžΆαž„αž…αŸ’αžšαžΎαž“"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = F.sigmoid(logits).squeeze().numpy()
# Define emotion labels
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']
# Create a dictionary mapping emotions to probabilities
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}
# Print results
print(emotion_probabilities)
```
### Example Output
```json
{
"Anger": 0.016747648,
"Anticipation": 0.051519673,
"Disgust": 0.01696622,
"Fear": 0.0047147004,
"Joy": 0.82434595,
"Optimism": 0.052789055,
"Sadness": 0.026356682,
"Surprise": 0.0024202482
}
```
This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.
## Customization
- **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
- **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
- **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.
## Troubleshooting
### KeyError: 'text'
This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.
### ValueError in Metric Computation
Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.
## License
This project is provided for educational and research purposes. Please refer to the license file for details.
---
This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.