File size: 5,572 Bytes
4507139
9dcba95
 
 
 
 
 
 
 
 
 
 
 
4507139
897c8fd
9dcba95
897c8fd
4507139
897c8fd
9dcba95
4507139
 
 
 
 
897c8fd
9dcba95
 
 
 
897c8fd
9dcba95
897c8fd
9dcba95
897c8fd
4507139
 
 
897c8fd
4507139
897c8fd
9dcba95
897c8fd
4507139
 
 
897c8fd
9dcba95
 
897c8fd
9dcba95
897c8fd
4507139
9dcba95
 
 
 
897c8fd
4507139
9dcba95
897c8fd
4507139
9dcba95
 
897c8fd
4507139
9dcba95
897c8fd
4507139
9dcba95
 
 
897c8fd
9dcba95
897c8fd
9dcba95
897c8fd
9dcba95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4507139
897c8fd
9dcba95
 
 
 
 
 
 
 
 
 
 
 
 
 
897c8fd
9dcba95
897c8fd
9dcba95
897c8fd
9dcba95
 
 
897c8fd
4507139
897c8fd
9dcba95
 
897c8fd
9dcba95
 
897c8fd
4507139
9dcba95
897c8fd
4507139
897c8fd
9dcba95
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: apache-2.0
language:
- km
metrics:
- accuracy
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: text-classification
library_name: transformers
tags:
- sentimenal
---
# Multi-Label Emotion Classification with XLM-RoBERTa

This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).

## Overview

The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
- Data preparation and splitting (90% training, 10% validation)
- Tokenization using the fast XLM-RoBERTa tokenizer
- A custom data collator that leverages the tokenizer’s efficient padding method
- Fine-tuning an XLM-RoBERTa model for multi-label classification
- Computing a custom multi-label subset accuracy metric during evaluation

### Dataset
- **Total Data Size**: 24,969 samples
- **Train-Test Split**: 90% training, 10% validation
- **Model Accuracy (3 epochs)**: 72.12%

## Requirements

Ensure you have the required dependencies installed:

```bash
pip install torch transformers datasets evaluate scikit-learn
```

## Data Format

The expected input data is a CSV file with columns structured as follows:

| Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
|------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
| ...  | ...     | ...           | ...      | 0/1   | 0/1          | 0/1     | 0/1  | 0/1 | 0/1      | 0/1     | 0/1      |

- **text_khm**: The Khmer text input
- **Emotion columns**: One column per emotion label (binary values)

## Training the Model

1. **Data Preparation:**
   - Load the dataset.
   - Select relevant columns (`"text_khm"` and emotion labels).
   - Split into training (90%) and validation (10%) sets.
   - Convert the dataset into a Hugging Face `Dataset` format.

2. **Tokenization:**
   - Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.

3. **Model Setup:**
   - Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
   - Convert labels to `float` for BCEWithLogitsLoss.

4. **Custom Data Collator:**
   - Use the built-in `DataCollatorWithPadding` for efficient batching.

5. **Training and Evaluation:**
   - Define training arguments (learning rate, batch sizes, number of epochs, etc.).
   - Implement a custom compute metrics function for multi-label subset accuracy.
   - Train the model using the `Trainer` class from Hugging Face.

## Testing the Model

To test the trained model on new Khmer text, use the following script:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load tokenizer and model
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example Khmer text for sentiment analysis
text = "αž€αžΆαžšαž”αŸ’αžšαž€αžΆαžŸαž…αŸ†αžŽαžΌαž›αžšαž”αžŸαŸ‹αž€αŸ’αžšαž»αž˜αž αŸŠαž»αž“αž˜αžΆαž“αž€αžΆαžšαž€αžΎαž“αž‘αžΎαž„αž™αŸ‰αžΆαž„αž…αŸ’αžšαžΎαž“"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = F.sigmoid(logits).squeeze().numpy()

# Define emotion labels
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']

# Create a dictionary mapping emotions to probabilities
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}

# Print results
print(emotion_probabilities)
```

### Example Output

```json
{
    "Anger": 0.016747648,
    "Anticipation": 0.051519673,
    "Disgust": 0.01696622,
    "Fear": 0.0047147004,
    "Joy": 0.82434595,
    "Optimism": 0.052789055,
    "Sadness": 0.026356682,
    "Surprise": 0.0024202482
}
```

This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.

## Customization

- **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
- **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
- **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.

## Troubleshooting

### KeyError: 'text'
This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.

### ValueError in Metric Computation
Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.

## License
This project is provided for educational and research purposes. Please refer to the license file for details.

---

This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.