|
--- |
|
license: apache-2.0 |
|
language: |
|
- km |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- FacebookAI/xlm-roberta-base |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
tags: |
|
- sentimenal |
|
--- |
|
# Multi-Label Emotion Classification with XLM-RoBERTa |
|
|
|
This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate). |
|
|
|
## Overview |
|
|
|
The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates: |
|
- Data preparation and splitting (90% training, 10% validation) |
|
- Tokenization using the fast XLM-RoBERTa tokenizer |
|
- A custom data collator that leverages the tokenizerβs efficient padding method |
|
- Fine-tuning an XLM-RoBERTa model for multi-label classification |
|
- Computing a custom multi-label subset accuracy metric during evaluation |
|
|
|
### Dataset |
|
- **Total Data Size**: 24,969 samples |
|
- **Train-Test Split**: 90% training, 10% validation |
|
- **Model Accuracy (3 epochs)**: 72.12% |
|
|
|
## Requirements |
|
|
|
Ensure you have the required dependencies installed: |
|
|
|
```bash |
|
pip install torch transformers datasets evaluate scikit-learn |
|
``` |
|
|
|
## Data Format |
|
|
|
The expected input data is a CSV file with columns structured as follows: |
|
|
|
| Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise | |
|
|------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------| |
|
| ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | |
|
|
|
- **text_khm**: The Khmer text input |
|
- **Emotion columns**: One column per emotion label (binary values) |
|
|
|
## Training the Model |
|
|
|
1. **Data Preparation:** |
|
- Load the dataset. |
|
- Select relevant columns (`"text_khm"` and emotion labels). |
|
- Split into training (90%) and validation (10%) sets. |
|
- Convert the dataset into a Hugging Face `Dataset` format. |
|
|
|
2. **Tokenization:** |
|
- Utilize the fast XLM-RoBERTa tokenizer with padding and truncation. |
|
|
|
3. **Model Setup:** |
|
- Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model. |
|
- Convert labels to `float` for BCEWithLogitsLoss. |
|
|
|
4. **Custom Data Collator:** |
|
- Use the built-in `DataCollatorWithPadding` for efficient batching. |
|
|
|
5. **Training and Evaluation:** |
|
- Define training arguments (learning rate, batch sizes, number of epochs, etc.). |
|
- Implement a custom compute metrics function for multi-label subset accuracy. |
|
- Train the model using the `Trainer` class from Hugging Face. |
|
|
|
## Testing the Model |
|
|
|
To test the trained model on new Khmer text, use the following script: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
# Load tokenizer and model |
|
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
# Example Khmer text for sentiment analysis |
|
text = "ααΆααααααΆαα
αααΌααααααααα»αα αα»αααΆαααΆαααΎαα‘αΎααααΆαα
αααΎα" |
|
|
|
# Tokenize input |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
|
|
# Perform inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
probabilities = F.sigmoid(logits).squeeze().numpy() |
|
|
|
# Define emotion labels |
|
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise'] |
|
|
|
# Create a dictionary mapping emotions to probabilities |
|
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)} |
|
|
|
# Print results |
|
print(emotion_probabilities) |
|
``` |
|
|
|
### Example Output |
|
|
|
```json |
|
{ |
|
"Anger": 0.016747648, |
|
"Anticipation": 0.051519673, |
|
"Disgust": 0.01696622, |
|
"Fear": 0.0047147004, |
|
"Joy": 0.82434595, |
|
"Optimism": 0.052789055, |
|
"Sadness": 0.026356682, |
|
"Surprise": 0.0024202482 |
|
} |
|
``` |
|
|
|
This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text. |
|
|
|
## Customization |
|
|
|
- **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present. |
|
- **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script. |
|
- **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model. |
|
|
|
## Troubleshooting |
|
|
|
### KeyError: 'text' |
|
This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer. |
|
|
|
### ValueError in Metric Computation |
|
Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy. |
|
|
|
## License |
|
This project is provided for educational and research purposes. Please refer to the license file for details. |
|
|
|
--- |
|
|
|
This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model. |