File size: 5,572 Bytes
4507139 9dcba95 4507139 897c8fd 9dcba95 897c8fd 4507139 897c8fd 9dcba95 4507139 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 4507139 897c8fd 4507139 897c8fd 9dcba95 897c8fd 4507139 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 4507139 9dcba95 897c8fd 4507139 9dcba95 897c8fd 4507139 9dcba95 897c8fd 4507139 9dcba95 897c8fd 4507139 9dcba95 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 9dcba95 4507139 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 4507139 897c8fd 9dcba95 897c8fd 9dcba95 897c8fd 4507139 9dcba95 897c8fd 4507139 897c8fd 9dcba95 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: apache-2.0
language:
- km
metrics:
- accuracy
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: text-classification
library_name: transformers
tags:
- sentimenal
---
# Multi-Label Emotion Classification with XLM-RoBERTa
This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).
## Overview
The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
- Data preparation and splitting (90% training, 10% validation)
- Tokenization using the fast XLM-RoBERTa tokenizer
- A custom data collator that leverages the tokenizerβs efficient padding method
- Fine-tuning an XLM-RoBERTa model for multi-label classification
- Computing a custom multi-label subset accuracy metric during evaluation
### Dataset
- **Total Data Size**: 24,969 samples
- **Train-Test Split**: 90% training, 10% validation
- **Model Accuracy (3 epochs)**: 72.12%
## Requirements
Ensure you have the required dependencies installed:
```bash
pip install torch transformers datasets evaluate scikit-learn
```
## Data Format
The expected input data is a CSV file with columns structured as follows:
| Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
|------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
| ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
- **text_khm**: The Khmer text input
- **Emotion columns**: One column per emotion label (binary values)
## Training the Model
1. **Data Preparation:**
- Load the dataset.
- Select relevant columns (`"text_khm"` and emotion labels).
- Split into training (90%) and validation (10%) sets.
- Convert the dataset into a Hugging Face `Dataset` format.
2. **Tokenization:**
- Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.
3. **Model Setup:**
- Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
- Convert labels to `float` for BCEWithLogitsLoss.
4. **Custom Data Collator:**
- Use the built-in `DataCollatorWithPadding` for efficient batching.
5. **Training and Evaluation:**
- Define training arguments (learning rate, batch sizes, number of epochs, etc.).
- Implement a custom compute metrics function for multi-label subset accuracy.
- Train the model using the `Trainer` class from Hugging Face.
## Testing the Model
To test the trained model on new Khmer text, use the following script:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
# Load tokenizer and model
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example Khmer text for sentiment analysis
text = "ααΆααααααΆαα
αααΌααααααααα»αα αα»αααΆαααΆαααΎαα‘αΎααααΆαα
αααΎα"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = F.sigmoid(logits).squeeze().numpy()
# Define emotion labels
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']
# Create a dictionary mapping emotions to probabilities
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}
# Print results
print(emotion_probabilities)
```
### Example Output
```json
{
"Anger": 0.016747648,
"Anticipation": 0.051519673,
"Disgust": 0.01696622,
"Fear": 0.0047147004,
"Joy": 0.82434595,
"Optimism": 0.052789055,
"Sadness": 0.026356682,
"Surprise": 0.0024202482
}
```
This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.
## Customization
- **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
- **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
- **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.
## Troubleshooting
### KeyError: 'text'
This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.
### ValueError in Metric Computation
Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.
## License
This project is provided for educational and research purposes. Please refer to the license file for details.
---
This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model. |