songhieng commited on
Commit
9dcba95
·
verified ·
1 Parent(s): bb32f9f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -51
README.md CHANGED
@@ -1,30 +1,37 @@
1
- This model archeived around 64.728% Accuracy.
2
-
3
  ---
4
-
 
 
 
 
 
 
 
 
 
 
 
5
  # Multi-Label Emotion Classification with XLM-RoBERTa
6
 
7
- This repository provides an example of fine-tuning an XLM-RoBERTa model for multi-label emotion classification on Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library along with the [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate) libraries.
8
 
9
  ## Overview
10
 
11
- The task involves predicting multiple emotion labels (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
12
  - Data preparation and splitting (90% training, 10% validation)
13
  - Tokenization using the fast XLM-RoBERTa tokenizer
14
  - A custom data collator that leverages the tokenizer’s efficient padding method
15
  - Fine-tuning an XLM-RoBERTa model for multi-label classification
16
  - Computing a custom multi-label subset accuracy metric during evaluation
17
 
18
- ## Requirements
 
 
 
19
 
20
- - Python 3.7+
21
- - [PyTorch](https://pytorch.org/) (tested with 1.8+)
22
- - [Transformers](https://github.com/huggingface/transformers)
23
- - [Datasets](https://github.com/huggingface/datasets)
24
- - [Evaluate](https://github.com/huggingface/evaluate)
25
- - [scikit-learn](https://scikit-learn.org/)
26
 
27
- You can install the required packages with:
28
 
29
  ```bash
30
  pip install torch transformers datasets evaluate scikit-learn
@@ -32,74 +39,108 @@ pip install torch transformers datasets evaluate scikit-learn
32
 
33
  ## Data Format
34
 
35
- The expected input data is a CSV file with columns similar to:
36
 
37
  | Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
38
  |------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
39
  | ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
40
 
41
- - **text_khm**: The Khmer text to classify.
42
- - **Emotion columns**: One column per emotion label with binary values.
43
 
44
- In the provided example, the `"text_khm"` column is used as the input, and the emotion columns are combined into a list to form the multi-label target.
45
-
46
- ## Running the Code
47
-
48
- The main steps are as follows:
49
 
50
  1. **Data Preparation:**
51
- - Load your CSV data (or use the sample data provided).
52
- - Select the necessary columns (`"text_khm"` and emotion label columns).
53
- - Split the data into training (90%) and validation (10%) sets.
54
- - Convert the splits into a Hugging Face `Dataset` and map each sample to tokenize the text.
55
 
56
  2. **Tokenization:**
57
- - Use the fast XLM-RoBERTa tokenizer with the `__call__` method for efficient encoding, padding, and truncation.
58
 
59
  3. **Model Setup:**
60
- - Load the pre-trained XLM-RoBERTa model for multi-label classification.
61
- - Convert labels to floats to be compatible with BCEWithLogitsLoss.
62
 
63
  4. **Custom Data Collator:**
64
- - Use the built-in `DataCollatorWithPadding` to handle padded tokenized features.
65
- - Manually add labels (as `torch.float`) back into the batch.
66
 
67
  5. **Training and Evaluation:**
68
- - Define training arguments (including learning rate, batch sizes, number of epochs, etc.).
69
- - Define a custom compute metrics function that calculates multi-label subset accuracy.
70
- - Initialize the Hugging Face `Trainer` and fine-tune the model.
71
 
72
- ### Example Command
73
 
74
- Assuming your code is in a file called `train.py`, you can run the training process with:
75
 
76
- ```bash
77
- python train.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ```
79
 
80
- ## Customizations
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
- - **Metric:**
83
- The default accuracy metric has been replaced with a custom function that computes multi-label subset accuracy. This metric considers a sample correct only if all the labels match.
84
 
85
- - **Data Collator:**
86
- A custom data collator leverages the fast tokenizer's `__call__` method for padding, ensuring efficient batch processing.
87
 
88
- - **Model and Tokenizer:**
89
- The repository uses `xlm-roberta-base`. You can change this to any other pre-trained model that supports multi-label classification.
 
90
 
91
  ## Troubleshooting
92
 
93
- - **KeyError: 'text'**
94
- If you encounter a KeyError regarding the `"text"` key in the custom data collator, ensure that the tokenization process has been correctly applied to your dataset. The code expects a `"text"` field in the original DataFrame before tokenization.
95
 
96
- - **ValueError in Metric Computation:**
97
- Since multi-label targets differ from single-label ones, a custom compute_metrics function is provided to compute subset accuracy. Adjust thresholds or metric calculations as needed for your specific use case.
98
 
99
  ## License
100
-
101
- This project is provided for educational purposes. Please refer to the license file for details.
102
 
103
  ---
104
 
105
- Feel free to modify the README to better suit your specific project details or additional instructions.
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - km
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - FacebookAI/xlm-roberta-base
9
+ pipeline_tag: text-classification
10
+ library_name: transformers
11
+ tags:
12
+ - sentimenal
13
+ ---
14
  # Multi-Label Emotion Classification with XLM-RoBERTa
15
 
16
+ This repository provides an implementation of fine-tuning the `songhieng/khmer-xlmr-base-sentimental-multi-label` model for multi-label emotion classification in Khmer text. The model is trained using the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, along with [datasets](https://github.com/huggingface/datasets) and [evaluate](https://github.com/huggingface/evaluate).
17
 
18
  ## Overview
19
 
20
+ The task involves predicting multiple emotions (e.g., anger, anticipation, disgust, fear, joy, optimism, sadness, surprise) for a given piece of Khmer text. This implementation demonstrates:
21
  - Data preparation and splitting (90% training, 10% validation)
22
  - Tokenization using the fast XLM-RoBERTa tokenizer
23
  - A custom data collator that leverages the tokenizer’s efficient padding method
24
  - Fine-tuning an XLM-RoBERTa model for multi-label classification
25
  - Computing a custom multi-label subset accuracy metric during evaluation
26
 
27
+ ### Dataset
28
+ - **Total Data Size**: 24,969 samples
29
+ - **Train-Test Split**: 90% training, 10% validation
30
+ - **Model Accuracy (3 epochs)**: 72.12%
31
 
32
+ ## Requirements
 
 
 
 
 
33
 
34
+ Ensure you have the required dependencies installed:
35
 
36
  ```bash
37
  pip install torch transformers datasets evaluate scikit-learn
 
39
 
40
  ## Data Format
41
 
42
+ The expected input data is a CSV file with columns structured as follows:
43
 
44
  | Text | emotion | emotion_score | text_khm | anger | anticipation | disgust | fear | joy | optimism | sadness | surprise |
45
  |------|---------|---------------|----------|-------|--------------|---------|------|-----|----------|---------|----------|
46
  | ... | ... | ... | ... | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 | 0/1 |
47
 
48
+ - **text_khm**: The Khmer text input
49
+ - **Emotion columns**: One column per emotion label (binary values)
50
 
51
+ ## Training the Model
 
 
 
 
52
 
53
  1. **Data Preparation:**
54
+ - Load the dataset.
55
+ - Select relevant columns (`"text_khm"` and emotion labels).
56
+ - Split into training (90%) and validation (10%) sets.
57
+ - Convert the dataset into a Hugging Face `Dataset` format.
58
 
59
  2. **Tokenization:**
60
+ - Utilize the fast XLM-RoBERTa tokenizer with padding and truncation.
61
 
62
  3. **Model Setup:**
63
+ - Load the pre-trained `songhieng/khmer-xlmr-base-sentimental-multi-label` model.
64
+ - Convert labels to `float` for BCEWithLogitsLoss.
65
 
66
  4. **Custom Data Collator:**
67
+ - Use the built-in `DataCollatorWithPadding` for efficient batching.
 
68
 
69
  5. **Training and Evaluation:**
70
+ - Define training arguments (learning rate, batch sizes, number of epochs, etc.).
71
+ - Implement a custom compute metrics function for multi-label subset accuracy.
72
+ - Train the model using the `Trainer` class from Hugging Face.
73
 
74
+ ## Testing the Model
75
 
76
+ To test the trained model on new Khmer text, use the following script:
77
 
78
+ ```python
79
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
80
+ import torch
81
+ import torch.nn.functional as F
82
+
83
+ # Load tokenizer and model
84
+ model_name = "songhieng/khmer-xlmr-base-sentimental-multi-label"
85
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
86
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
87
+
88
+ # Example Khmer text for sentiment analysis
89
+ text = "ការប្រកាសចំណូលរបស់ក្រុមហ៊ុនមានការកើនឡើងយ៉ាងច្រើន"
90
+
91
+ # Tokenize input
92
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
93
+
94
+ # Perform inference
95
+ with torch.no_grad():
96
+ outputs = model(**inputs)
97
+ logits = outputs.logits
98
+ probabilities = F.sigmoid(logits).squeeze().numpy()
99
+
100
+ # Define emotion labels
101
+ emotion_labels = ['Anger', 'Anticipation', 'Disgust', 'Fear', 'Joy', 'Optimism', 'Sadness', 'Surprise']
102
+
103
+ # Create a dictionary mapping emotions to probabilities
104
+ emotion_probabilities = {label: prob for label, prob in zip(emotion_labels, probabilities)}
105
+
106
+ # Print results
107
+ print(emotion_probabilities)
108
  ```
109
 
110
+ ### Example Output
111
+
112
+ ```json
113
+ {
114
+ "Anger": 0.016747648,
115
+ "Anticipation": 0.051519673,
116
+ "Disgust": 0.01696622,
117
+ "Fear": 0.0047147004,
118
+ "Joy": 0.82434595,
119
+ "Optimism": 0.052789055,
120
+ "Sadness": 0.026356682,
121
+ "Surprise": 0.0024202482
122
+ }
123
+ ```
124
 
125
+ This output indicates that the model detected high probabilities for 'Joy' and 'Optimism' in the input text.
 
126
 
127
+ ## Customization
 
128
 
129
+ - **Thresholding:** You can adjust the probability threshold (default: 0.5) to determine which emotions are considered present.
130
+ - **Fine-tuning Parameters:** Modify hyperparameters such as learning rate, batch size, and number of epochs in the training script.
131
+ - **Alternative Models:** You can swap `songhieng/khmer-xlmr-base-sentimental-multi-label` for another Khmer-language model.
132
 
133
  ## Troubleshooting
134
 
135
+ ### KeyError: 'text'
136
+ This error occurs when the text field is missing after tokenization. Ensure that tokenization is correctly applied before passing the dataset to the trainer.
137
 
138
+ ### ValueError in Metric Computation
139
+ Since multi-label targets differ from single-label classification, use a custom compute_metrics function to calculate subset accuracy.
140
 
141
  ## License
142
+ This project is provided for educational and research purposes. Please refer to the license file for details.
 
143
 
144
  ---
145
 
146
+ This README provides an overview of the dataset, model training, evaluation, and inference with the fine-tuned Khmer multi-label sentiment model.