Update README.md

9dcba95 verified 19 days ago

5.57 kB

	---
	license: apache-2.0
	language:
	- km
	metrics:
	- accuracy
	base_model:
	- FacebookAI/xlm-roberta-base
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- sentimenal
	---
	# Multi-Label Emotion Classification with XLM-RoBERTa

	This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).

	## Overview

	The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
	- Data preparation and splitting (90% training, 10% validation)
	- Tokenization using the fast XLM-RoBERTa tokenizer
	- A custom data collator that leverages the tokenizer’s efficient padding method
	- Fine-tuning an XLM-RoBERTa model for multi-label classification
	- Computing a custom multi-label subset accuracy metric during evaluation

	### Dataset
	- Total Data Size: 24,969 samples
	- Train-Test Split: 90% training, 10% validation
	- Model Accuracy (3 epochs): 72.12%

	## Requirements

	Ensure you have the required dependencies installed:

	```bash
	pip install torch transformers datasets evaluate scikit-learn
	```

	## Data Format

	The expected input data is a CSV file with columns structured as follows:

	\| Text \| emotion \| emotion_score \| text_khm \| anger \| anticipation \| disgust \| fear \| joy \| optimism \| sadness \| surprise \|
	\|------\|---------\|---------------\|----------\|-------\|--------------\|---------\|------\|-----\|----------\|---------\|----------\|
	\| ... \| ... \| ... \| ... \| 0/1 \| 0/1 \| 0/1 \| 0/1 \| 0/1 \| 0/1 \| 0/1 \| 0/1 \|

	- text_khm: The Khmer text input
	- Emotion columns: One column per emotion label (binary values)

	## Training the Model

	1. Data Preparation:
	- Load the dataset.
	- Select relevant columns (`"text_khm"` and emotion labels).
	- Split into training (90%) and validation (10%) sets.
	- Convert the dataset into a Hugging Face `Dataset` format.

	2. Tokenization:
	- Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.

	3. Model Setup:
	- Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
	- Convert labels to `float` for BCEWithLogitsLoss.

	4. Custom Data Collator:
	- Use the built-in `DataCollatorWithPadding` for efficient batching.

	5. Training and Evaluation:
	- Define training arguments (learning rate, batch sizes, number of epochs, etc.).
	- Implement a custom compute metrics function for multi-label subset accuracy.
	- Train the model using the `Trainer` class from Hugging Face.

	## Testing the Model

	To test the trained model on new Khmer text, use the following script:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import torch.nn.functional as F

	# Load tokenizer and model
	model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Example Khmer text for sentiment analysis
	text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"

	# Tokenize input
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

	# Perform inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probabilities = F.sigmoid(logits).squeeze().numpy()

	# Define emotion labels
	emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']

	# Create a dictionary mapping emotions to probabilities
	emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}

	# Print results
	print(emotion_probabilities)
	```

	### Example Output

	```json
	{
	"Anger": 0.016747648,
	"Anticipation": 0.051519673,
	"Disgust": 0.01696622,
	"Fear": 0.0047147004,
	"Joy": 0.82434595,
	"Optimism": 0.052789055,
	"Sadness": 0.026356682,
	"Surprise": 0.0024202482
	}
	```

	This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.

	## Customization

	- Thresholding: You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
	- Fine-tuning Parameters: Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
	- Alternative Models: You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.

	## Troubleshooting

	### KeyError: 'text'
	This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.

	### ValueError in Metric Computation
	Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.

	## License
	This project is provided for educational and research purposes. Please refer to the license file for details.

	---

	This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.