songhieng
/

khmer-xlmr-base-sentimental-multi-label

@@ -1,30 +1,37 @@
-This model archeived around 64.728% Accuracy.
 ---
 # Multi-Label Emotion Classification with XLM-RoBERTa
-This repository provides an example of fine-tuning an XLM-RoBERTa model for multi-label emotion classification on Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library along with the [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate) libraries.
 ## Overview
-The task involves predicting multiple emotion labels (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
 - Data preparation and splitting (90% training, 10% validation)
 - Tokenization using the fast XLM-RoBERTa tokenizer
 - A custom data collator that leverages the tokenizer’s efficient padding method
 - Fine-tuning an XLM-RoBERTa model for multi-label classification
 - Computing a custom multi-label subset accuracy metric during evaluation
-## Requirements
-- Python 3.7+
-- [PyTorch](https://pytorch.org/) (tested with 1.8+)
-- [Transformers](https://github.com/huggingface/transformers)
-- [Datasets](https://github.com/huggingface/datasets)
-- [Evaluate](https://github.com/huggingface/evaluate)
-- [scikit-learn](https://scikit-learn.org/)
-You can install the required packages with:
 ```bash
 pip install torch transformers datasets evaluate scikit-learn
@@ -32,74 +39,108 @@ pip install torch transformers datasets evaluate scikit-learn
 ## Data Format
-The expected input data is a CSV file with columns similar to:
 | Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
 |------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
 | ...  | ...     | ...           | ...      | 0/1   | 0/1          | 0/1     | 0/1  | 0/1 | 0/1      | 0/1     | 0/1      |
-- **text_khm**: The Khmer text to classify.
-- **Emotion columns**: One column per emotion label with binary values.
-In the provided example, the `"text_khm"` column is used as the input, and the emotion columns are combined into a list to form the multi-label target.
-## Running the Code
-The main steps are as follows:
 1. **Data Preparation:**
-   - Load your CSV data (or use the sample data provided).
-   - Select the necessary columns (`"text_khm"` and emotion label columns).
-   - Split the data into training (90%) and validation (10%) sets.
-   - Convert the splits into a Hugging Face `Dataset` and map each sample to tokenize the text.
 2. **Tokenization:**
-   - Use the fast XLM-RoBERTa tokenizer with the `__call__` method for efficient encoding, padding, and truncation.
 3. **Model Setup:**
-   - Load the pre-trained XLM-RoBERTa model for multi-label classification.
-   - Convert labels to floats to be compatible with BCEWithLogitsLoss.
 4. **Custom Data Collator:**
-   - Use the built-in `DataCollatorWithPadding` to handle padded tokenized features.
-   - Manually add labels (as `torch.float`) back into the batch.
 5. **Training and Evaluation:**
-   - Define training arguments (including learning rate, batch sizes, number of epochs, etc.).
-   - Define a custom compute metrics function that calculates multi-label subset accuracy.
-   - Initialize the Hugging Face `Trainer` and fine-tune the model.
-### Example Command
-Assuming your code is in a file called `train.py`, you can run the training process with:
-```bash
-python train.py
 ```
-## Customizations
-- **Metric:**
-  The default accuracy metric has been replaced with a custom function that computes multi-label subset accuracy. This metric considers a sample correct only if all the labels match.
-- **Data Collator:**
-  A custom data collator leverages the fast tokenizer's `__call__` method for padding, ensuring efficient batch processing.
-- **Model and Tokenizer:**
-  The repository uses `xlm-roberta-base`. You can change this to any other pre-trained model that supports multi-label classification.
 ## Troubleshooting
-- **KeyError: 'text'**
-  If you encounter a KeyError regarding the `"text"` key in the custom data collator, ensure that the tokenization process has been correctly applied to your dataset. The code expects a `"text"` field in the original DataFrame before tokenization.
-- **ValueError in Metric Computation:**
-  Since multi-label targets differ from single-label ones, a custom compute_metrics function is provided to compute subset accuracy. Adjust thresholds or metric calculations as needed for your specific use case.
 ## License
-This project is provided for educational purposes. Please refer to the license file for details.
 ---
-Feel free to modify the README to better suit your specific project details or additional instructions.

 ---
+license: apache-2.0
+language:
+- km
+metrics:
+- accuracy
+base_model:
+- FacebookAI/xlm-roberta-base
+pipeline_tag: text-classification
+library_name: transformers
+tags:
+- sentimenal
+---
 # Multi-Label Emotion Classification with XLM-RoBERTa
+This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).
 ## Overview
+The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
 - Data preparation and splitting (90% training, 10% validation)
 - Tokenization using the fast XLM-RoBERTa tokenizer
 - A custom data collator that leverages the tokenizer’s efficient padding method
 - Fine-tuning an XLM-RoBERTa model for multi-label classification
 - Computing a custom multi-label subset accuracy metric during evaluation
+### Dataset
+- **Total Data Size**: 24,969 samples
+- **Train-Test Split**: 90% training, 10% validation
+- **Model Accuracy (3 epochs)**: 72.12%
+## Requirements
+Ensure you have the required dependencies installed:
 ```bash
 pip install torch transformers datasets evaluate scikit-learn
 ## Data Format
+The expected input data is a CSV file with columns structured as follows:
 | Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
 |------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
 | ...  | ...     | ...           | ...      | 0/1   | 0/1          | 0/1     | 0/1  | 0/1 | 0/1      | 0/1     | 0/1      |
+- **text_khm**: The Khmer text input
+- **Emotion columns**: One column per emotion label (binary values)
+## Training the Model
 1. **Data Preparation:**
+   - Load the dataset.
+   - Select relevant columns (`"text_khm"` and emotion labels).
+   - Split into training (90%) and validation (10%) sets.
+   - Convert the dataset into a Hugging Face `Dataset` format.
 2. **Tokenization:**
+   - Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.
 3. **Model Setup:**
+   - Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
+   - Convert labels to `float` for BCEWithLogitsLoss.
 4. **Custom Data Collator:**
+   - Use the built-in `DataCollatorWithPadding` for efficient batching.
 5. **Training and Evaluation:**
+   - Define training arguments (learning rate, batch sizes, number of epochs, etc.).
+   - Implement a custom compute metrics function for multi-label subset accuracy.
+   - Train the model using the `Trainer` class from Hugging Face.
+## Testing the Model
+To test the trained model on new Khmer text, use the following script:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import torch.nn.functional as F
+# Load tokenizer and model
+model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Example Khmer text for sentiment analysis
+text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"
+# Tokenize input
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+# Perform inference
+with torch.no_grad():
+    outputs = model(**inputs)
+    logits = outputs.logits
+    probabilities = F.sigmoid(logits).squeeze().numpy()
+# Define emotion labels
+emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']
+# Create a dictionary mapping emotions to probabilities
+emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}
+# Print results
+print(emotion_probabilities)
 ```
+### Example Output
+```json
+{
+    "Anger": 0.016747648,
+    "Anticipation": 0.051519673,
+    "Disgust": 0.01696622,
+    "Fear": 0.0047147004,
+    "Joy": 0.82434595,
+    "Optimism": 0.052789055,
+    "Sadness": 0.026356682,
+    "Surprise": 0.0024202482
+}
+```
+This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.
+## Customization
+- **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
+- **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
+- **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.
 ## Troubleshooting
+### KeyError: 'text'
+This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.
+### ValueError in Metric Computation
+Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.
 ## License
+This project is provided for educational and research purposes. Please refer to the license file for details.
 ---
+This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.