---
license: apache-2.0
language:
- km
metrics:
- accuracy
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: text-classification
library_name: transformers
tags:
- sentimenal
---
# Multi-Label Emotion Classification with XLM-RoBERTa

This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).

## Overview

The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
- Data preparation and splitting (90% training, 10% validation)
- Tokenization using the fast XLM-RoBERTa tokenizer
- A custom data collator that leverages the tokenizer’s efficient padding method
- Fine-tuning an XLM-RoBERTa model for multi-label classification
- Computing a custom multi-label subset accuracy metric during evaluation

### Dataset
- **Total Data Size**: 24,969 samples
- **Train-Test Split**: 90% training, 10% validation
- **Model Accuracy (3 epochs)**: 72.12%

## Requirements

Ensure you have the required dependencies installed:

```bash
pip install torch transformers datasets evaluate scikit-learn
```

## Data Format

The expected input data is a CSV file with columns structured as follows:

| Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
|------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
| ...  | ...     | ...           | ...      | 0/1   | 0/1          | 0/1     | 0/1  | 0/1 | 0/1      | 0/1     | 0/1      |

- **text_khm**: The Khmer text input
- **Emotion columns**: One column per emotion label (binary values)

## Training the Model

1. **Data Preparation:**
   - Load the dataset.
   - Select relevant columns (`"text_khm"` and emotion labels).
   - Split into training (90%) and validation (10%) sets.
   - Convert the dataset into a Hugging Face `Dataset` format.

2. **Tokenization:**
   - Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.

3. **Model Setup:**
   - Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
   - Convert labels to `float` for BCEWithLogitsLoss.

4. **Custom Data Collator:**
   - Use the built-in `DataCollatorWithPadding` for efficient batching.

5. **Training and Evaluation:**
   - Define training arguments (learning rate, batch sizes, number of epochs, etc.).
   - Implement a custom compute metrics function for multi-label subset accuracy.
   - Train the model using the `Trainer` class from Hugging Face.

## Testing the Model

To test the trained model on new Khmer text, use the following script:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load tokenizer and model
model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example Khmer text for sentiment analysis
text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = F.sigmoid(logits).squeeze().numpy()

# Define emotion labels
emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']

# Create a dictionary mapping emotions to probabilities
emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}

# Print results
print(emotion_probabilities)
```

### Example Output

```json
{
    "Anger": 0.016747648,
    "Anticipation": 0.051519673,
    "Disgust": 0.01696622,
    "Fear": 0.0047147004,
    "Joy": 0.82434595,
    "Optimism": 0.052789055,
    "Sadness": 0.026356682,
    "Surprise": 0.0024202482
}
```

This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.

## Customization

- **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
- **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
- **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.

## Troubleshooting

### KeyError: 'text'
This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.

### ValueError in Metric Computation
Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.

## License
This project is provided for educational and research purposes. Please refer to the license file for details.

---

This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.