--- license: apache-2.0 language: - km metrics: - accuracy base_model: - FacebookAI/xlm-roberta-base pipeline_tag: text-classification library_name: transformers tags: - sentimenal --- # Multi-Label Emotion Classification with XLM-RoBERTa This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate). ## Overview The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates: - Data preparation and splitting (90% training, 10% validation) - Tokenization using the fast XLM-RoBERTa tokenizer - A custom data collator that leverages the tokenizer’s efficient padding method - Fine-tuning an XLM-RoBERTa model for multi-label classification - Computing a custom multi-label subset accuracy metric during evaluation ### Dataset - **Total Data Size**: 24,969 samples - **Train-Test Split**: 90% training, 10% validation - **Model Accuracy (3 epochs)**: 72.12% ## Requirements Ensure you have the required dependencies installed: ```bash pip install torch transformers datasets evaluate scikit-learn ``` ## Data Format The expected input data is a CSV file with columns structured as follows: | Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise | |------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------| | ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | - **text_khm**: The Khmer text input - **Emotion columns**: One column per emotion label (binary values) ## Training the Model 1. **Data Preparation:** - Load the dataset. - Select relevant columns (`"text_khm"` and emotion labels). - Split into training (90%) and validation (10%) sets. - Convert the dataset into a Hugging Face `Dataset` format. 2. **Tokenization:** - Utilize the fast XLM-RoBERTa tokenizer with padding and truncation. 3. **Model Setup:** - Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model. - Convert labels to `float` for BCEWithLogitsLoss. 4. **Custom Data Collator:** - Use the built-in `DataCollatorWithPadding` for efficient batching. 5. **Training and Evaluation:** - Define training arguments (learning rate, batch sizes, number of epochs, etc.). - Implement a custom compute metrics function for multi-label subset accuracy. - Train the model using the `Trainer` class from Hugging Face. ## Testing the Model To test the trained model on new Khmer text, use the following script: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import torch.nn.functional as F # Load tokenizer and model model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example Khmer text for sentiment analysis text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន" # Tokenize input inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) # Perform inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probabilities = F.sigmoid(logits).squeeze().numpy() # Define emotion labels emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise'] # Create a dictionary mapping emotions to probabilities emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)} # Print results print(emotion_probabilities) ``` ### Example Output ```json { "Anger": 0.016747648, "Anticipation": 0.051519673, "Disgust": 0.01696622, "Fear": 0.0047147004, "Joy": 0.82434595, "Optimism": 0.052789055, "Sadness": 0.026356682, "Surprise": 0.0024202482 } ``` This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text. ## Customization - **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present. - **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script. - **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model. ## Troubleshooting ### KeyError: 'text' This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer. ### ValueError in Metric Computation Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy. ## License This project is provided for educational and research purposes. Please refer to the license file for details. --- This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.