Update README.md
Browse files
README.md
CHANGED
@@ -1,30 +1,37 @@
|
|
1 |
-
This model archeived around 64.728% Accuracy.
|
2 |
-
|
3 |
---
|
4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
# Multi-Label Emotion Classification with XLM-RoBERTa
|
6 |
|
7 |
-
This repository provides an
|
8 |
|
9 |
## Overview
|
10 |
|
11 |
-
The task involves predicting multiple
|
12 |
- Data preparation and splitting (90% training, 10% validation)
|
13 |
- Tokenization using the fast XLM-RoBERTa tokenizer
|
14 |
- A custom data collator that leverages the tokenizer’s efficient padding method
|
15 |
- Fine-tuning an XLM-RoBERTa model for multi-label classification
|
16 |
- Computing a custom multi-label subset accuracy metric during evaluation
|
17 |
|
18 |
-
|
|
|
|
|
|
|
19 |
|
20 |
-
|
21 |
-
- [PyTorch](https://pytorch.org/) (tested with 1.8+)
|
22 |
-
- [Transformers](https://github.com/huggingface/transformers)
|
23 |
-
- [Datasets](https://github.com/huggingface/datasets)
|
24 |
-
- [Evaluate](https://github.com/huggingface/evaluate)
|
25 |
-
- [scikit-learn](https://scikit-learn.org/)
|
26 |
|
27 |
-
|
28 |
|
29 |
```bash
|
30 |
pip install torch transformers datasets evaluate scikit-learn
|
@@ -32,74 +39,108 @@ pip install torch transformers datasets evaluate scikit-learn
|
|
32 |
|
33 |
## Data Format
|
34 |
|
35 |
-
The expected input data is a CSV file with columns
|
36 |
|
37 |
| Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
|
38 |
|------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
|
39 |
| ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
|
40 |
|
41 |
-
- **text_khm**: The Khmer text
|
42 |
-
- **Emotion columns**: One column per emotion label
|
43 |
|
44 |
-
|
45 |
-
|
46 |
-
## Running the Code
|
47 |
-
|
48 |
-
The main steps are as follows:
|
49 |
|
50 |
1. **Data Preparation:**
|
51 |
-
- Load
|
52 |
-
- Select
|
53 |
-
- Split
|
54 |
-
- Convert the
|
55 |
|
56 |
2. **Tokenization:**
|
57 |
-
-
|
58 |
|
59 |
3. **Model Setup:**
|
60 |
-
- Load the pre-trained
|
61 |
-
- Convert labels to
|
62 |
|
63 |
4. **Custom Data Collator:**
|
64 |
-
- Use the built-in `DataCollatorWithPadding`
|
65 |
-
- Manually add labels (as `torch.float`) back into the batch.
|
66 |
|
67 |
5. **Training and Evaluation:**
|
68 |
-
- Define training arguments (
|
69 |
-
-
|
70 |
-
-
|
71 |
|
72 |
-
|
73 |
|
74 |
-
|
75 |
|
76 |
-
```
|
77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
```
|
79 |
|
80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
-
|
83 |
-
The default accuracy metric has been replaced with a custom function that computes multi-label subset accuracy. This metric considers a sample correct only if all the labels match.
|
84 |
|
85 |
-
|
86 |
-
A custom data collator leverages the fast tokenizer's `__call__` method for padding, ensuring efficient batch processing.
|
87 |
|
88 |
-
- **
|
89 |
-
|
|
|
90 |
|
91 |
## Troubleshooting
|
92 |
|
93 |
-
|
94 |
-
|
95 |
|
96 |
-
|
97 |
-
|
98 |
|
99 |
## License
|
100 |
-
|
101 |
-
This project is provided for educational purposes. Please refer to the license file for details.
|
102 |
|
103 |
---
|
104 |
|
105 |
-
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- km
|
5 |
+
metrics:
|
6 |
+
- accuracy
|
7 |
+
base_model:
|
8 |
+
- FacebookAI/xlm-roberta-base
|
9 |
+
pipeline_tag: text-classification
|
10 |
+
library_name: transformers
|
11 |
+
tags:
|
12 |
+
- sentimenal
|
13 |
+
---
|
14 |
# Multi-Label Emotion Classification with XLM-RoBERTa
|
15 |
|
16 |
+
This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).
|
17 |
|
18 |
## Overview
|
19 |
|
20 |
+
The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
|
21 |
- Data preparation and splitting (90% training, 10% validation)
|
22 |
- Tokenization using the fast XLM-RoBERTa tokenizer
|
23 |
- A custom data collator that leverages the tokenizer’s efficient padding method
|
24 |
- Fine-tuning an XLM-RoBERTa model for multi-label classification
|
25 |
- Computing a custom multi-label subset accuracy metric during evaluation
|
26 |
|
27 |
+
### Dataset
|
28 |
+
- **Total Data Size**: 24,969 samples
|
29 |
+
- **Train-Test Split**: 90% training, 10% validation
|
30 |
+
- **Model Accuracy (3 epochs)**: 72.12%
|
31 |
|
32 |
+
## Requirements
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
+
Ensure you have the required dependencies installed:
|
35 |
|
36 |
```bash
|
37 |
pip install torch transformers datasets evaluate scikit-learn
|
|
|
39 |
|
40 |
## Data Format
|
41 |
|
42 |
+
The expected input data is a CSV file with columns structured as follows:
|
43 |
|
44 |
| Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
|
45 |
|------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
|
46 |
| ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
|
47 |
|
48 |
+
- **text_khm**: The Khmer text input
|
49 |
+
- **Emotion columns**: One column per emotion label (binary values)
|
50 |
|
51 |
+
## Training the Model
|
|
|
|
|
|
|
|
|
52 |
|
53 |
1. **Data Preparation:**
|
54 |
+
- Load the dataset.
|
55 |
+
- Select relevant columns (`"text_khm"` and emotion labels).
|
56 |
+
- Split into training (90%) and validation (10%) sets.
|
57 |
+
- Convert the dataset into a Hugging Face `Dataset` format.
|
58 |
|
59 |
2. **Tokenization:**
|
60 |
+
- Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.
|
61 |
|
62 |
3. **Model Setup:**
|
63 |
+
- Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
|
64 |
+
- Convert labels to `float` for BCEWithLogitsLoss.
|
65 |
|
66 |
4. **Custom Data Collator:**
|
67 |
+
- Use the built-in `DataCollatorWithPadding` for efficient batching.
|
|
|
68 |
|
69 |
5. **Training and Evaluation:**
|
70 |
+
- Define training arguments (learning rate, batch sizes, number of epochs, etc.).
|
71 |
+
- Implement a custom compute metrics function for multi-label subset accuracy.
|
72 |
+
- Train the model using the `Trainer` class from Hugging Face.
|
73 |
|
74 |
+
## Testing the Model
|
75 |
|
76 |
+
To test the trained model on new Khmer text, use the following script:
|
77 |
|
78 |
+
```python
|
79 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
80 |
+
import torch
|
81 |
+
import torch.nn.functional as F
|
82 |
+
|
83 |
+
# Load tokenizer and model
|
84 |
+
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
|
85 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
86 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
87 |
+
|
88 |
+
# Example Khmer text for sentiment analysis
|
89 |
+
text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"
|
90 |
+
|
91 |
+
# Tokenize input
|
92 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
93 |
+
|
94 |
+
# Perform inference
|
95 |
+
with torch.no_grad():
|
96 |
+
outputs = model(**inputs)
|
97 |
+
logits = outputs.logits
|
98 |
+
probabilities = F.sigmoid(logits).squeeze().numpy()
|
99 |
+
|
100 |
+
# Define emotion labels
|
101 |
+
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']
|
102 |
+
|
103 |
+
# Create a dictionary mapping emotions to probabilities
|
104 |
+
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}
|
105 |
+
|
106 |
+
# Print results
|
107 |
+
print(emotion_probabilities)
|
108 |
```
|
109 |
|
110 |
+
### Example Output
|
111 |
+
|
112 |
+
```json
|
113 |
+
{
|
114 |
+
"Anger": 0.016747648,
|
115 |
+
"Anticipation": 0.051519673,
|
116 |
+
"Disgust": 0.01696622,
|
117 |
+
"Fear": 0.0047147004,
|
118 |
+
"Joy": 0.82434595,
|
119 |
+
"Optimism": 0.052789055,
|
120 |
+
"Sadness": 0.026356682,
|
121 |
+
"Surprise": 0.0024202482
|
122 |
+
}
|
123 |
+
```
|
124 |
|
125 |
+
This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.
|
|
|
126 |
|
127 |
+
## Customization
|
|
|
128 |
|
129 |
+
- **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
|
130 |
+
- **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
|
131 |
+
- **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.
|
132 |
|
133 |
## Troubleshooting
|
134 |
|
135 |
+
### KeyError: 'text'
|
136 |
+
This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.
|
137 |
|
138 |
+
### ValueError in Metric Computation
|
139 |
+
Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.
|
140 |
|
141 |
## License
|
142 |
+
This project is provided for educational and research purposes. Please refer to the license file for details.
|
|
|
143 |
|
144 |
---
|
145 |
|
146 |
+
This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.
|