syke9p3 commited on
Commit
c1bd0f3
β€’
1 Parent(s): 0744282

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -1
README.md CHANGED
@@ -12,4 +12,141 @@ pipeline_tag: text-classification
12
  tags:
13
  - nlp
14
  - language
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  tags:
13
  - nlp
14
  - language
15
+ ---
16
+
17
+
18
+ # Multilabel Classification of Tagalog Hate Speech using Bidirectional Encoder Representations from Transformers (BERT)
19
+
20
+ This repository contains source files for the thesis titled, **Multilabel Classification of Tagalog Hate Speech using Bidirectional Encoder Representations from Transformers (BERT)**, at the Polytechnic University of the Philippines. The model classifies a hate speech according to one or more categories: Age, Gender, Physical, Race, Religion, and Others.
21
+
22
+ Hate speech encompasses expressions and behaviors that promote hatred, discrimination, prejudice, or violence against individuals or groups based on specific attributes, with consequences ranging from physical harm to psychological distress, making it a critical issue in today's society.
23
+
24
+ Bidirectional Encoder Representations from Transformers (BERT) is pre-trained deep learning model used in this study that uses a transformer architecture to generate word embeddings, capturing both left and right context information, and can be fine-tuned for various natural language processing tasks. For this project, we fine-tuned [Jiang et. al.'s pre-trained BERT Tagalog Base Uncased model](https://huggingface.co/GKLMIP/bert-tagalog-base-uncased) in the task of multilabel hate speech classification.
25
+
26
+ ## πŸ‘₯ Proponents
27
+ - Saya-ang, Kenth G. ([@syke9p3](https://github.com/syke9p3))
28
+ - Gozum, Denise Julianne S. ([@Xenoxianne](https://github.com/Xenoxianne))
29
+ - Hamor, Mary Grizelle D. ([@mnemoria](https://github.com/mnemoria))
30
+ - Mabansag, Ria Karen B. ([@riavx](https://github.com/riavx))
31
+
32
+ ## πŸ“‹ About the Thesis
33
+
34
+ ### πŸ“„ Abstract
35
+ Hate speech promotes hatred, discrimination, prejudice, or violence against individuals or groups based on specific attributes, leading to physical and psychological harm. This study addresses the prevalence of hate speech on social media by proposing a Tagalog hate speech multilabel classification model. Using a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model, the study classifies hate speech into categories such as Age, Gender, Physical, Race, Religion, and Others. Analyzing 2,116 manually annotated social media posts from Facebook, Reddit, and Twitter, the model achieved varying precision, recall, and f-measure scores across categories, with an overall hamming loss of 3.79%.
36
+ ### πŸ”  Keywords
37
+ *Bidirectional Encoder Representations from Transformers; Hate Speech; Multilabel Classification; Social Media; Tagalog; Polytechnic University of the Philippines; Bachelor of Science in Computer Science*
38
+
39
+ ### πŸ’» Languages and Technologies
40
+
41
+ #### Model
42
+
43
+ [![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/)
44
+ [![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)](https://pytorch.org/)
45
+ [![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-F37626?style=for-the-badge&logo=jupyter&logoColor=white)](https://jupyter.org/)
46
+ [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-FFD21E?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/)
47
+ [![Pandas](https://img.shields.io/badge/Pandas-150458?style=for-the-badge&logo=pandas&logoColor=white)](https://huggingface.co/)
48
+ [![Numpy](https://img.shields.io/badge/Numpy-013243?style=for-the-badge&logo=numpy&logoColor=white)](https://huggingface.co/)
49
+ [![Numpy](https://img.shields.io/badge/ScikitLearn-F7931E?style=for-the-badge&logo=numpy&logoColor=white)](https://huggingface.co/)
50
+
51
+
52
+ #### User Interface
53
+
54
+ [![HTML5](https://img.shields.io/badge/HTML5-E34F26?style=for-the-badge&logo=html5&logoColor=white)](https://en.wikipedia.org/wiki/HTML5)
55
+ [![CSS3](https://img.shields.io/badge/CSS3-1572B6?style=for-the-badge&logo=css3&logoColor=white)](https://en.wikipedia.org/wiki/CSS)
56
+ [![JavaScript](https://img.shields.io/badge/JavaScript-F7DF1E?style=for-the-badge&logo=javascript&logoColor=black)](https://en.wikipedia.org/wiki/JavaScript)
57
+ [![Flask](https://img.shields.io/badge/Flask-000000?style=for-the-badge&logo=flask&logoColor=white)](https://flask.palletsprojects.com/en/3.0.x/)
58
+
59
+ ### πŸ–Ό Screenshots
60
+
61
+ <p align="center">
62
+ <img src="./Screenshot1.jpg"/>
63
+ <img src="./Screenshot2.jpg"/>
64
+ <img src="./Screenshot3.jpg"/>
65
+ </p>
66
+
67
+
68
+ ### 🎨 Labels
69
+
70
+ **Multilabel Classification** refers to the task of assigning one or more relevant labels to each text. Each text can be associated with multiple categories simultaneously, such as Age, Gender, Physical, Race, Religion, or Others.
71
+
72
+ | Label | Description |
73
+ |--------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
74
+ | ![Age](https://img.shields.io/badge/Age-FE5555) | Target of hate speech pertains to one's age bracket or demographic |
75
+ | ![Gender](https://img.shields.io/badge/Gender-F09F2D) | Target of hate speech pertains to gender identity, sex, or sexual orientation |
76
+ | ![Physical](https://img.shields.io/badge/Physical-FFCC00) | Target of hate speech pertains to physical attributes or disability |
77
+ | ![Race](https://img.shields.io/badge/Race-2BCE9A) | Target of hate speech pertains to racial background, ethnicity, or nationality |
78
+ | ![Religion](https://img.shields.io/badge/Religion-424BFC) | Target of hate speech pertains to affiliation, belief, and faith to any of the existing religious or non-religious groups |
79
+ | ![Others](https://img.shields.io/badge/Others-65696C) | Target of hate speech pertains other topic that is not relevant as Age, Gender, Physical, Race, or Religion |
80
+
81
+ ### πŸ“œ Dataset
82
+ 2,116 scraped social media posts from Facebook, Reddit, and Twitter manually annotated for determining labels for each data split into three sets:
83
+
84
+ | Dataset | Number of Posts | Percentage |
85
+ |----------------|-----------------|------------|
86
+ | Training Set | 1,267 | 60% |
87
+ | Validation Set | 212 | 10% |
88
+ | Testing Set | 633 | 30% |
89
+
90
+ ### πŸ”’ Results
91
+
92
+ The testing set containing 633 annotated hate speech data used to analyze performance of the model in its ability to classify the hate speech input according to different label in terms of Precision, Recall, F-Measure, and overall hamming loss.
93
+
94
+ | Label | Precision | Recall | F-Measure |
95
+ |----------|-----------|--------|-----------|
96
+ | Age | 97.12% | 90.18% | 93.52% |
97
+ | Gender | 93.23% | 94.66% | 93.94% |
98
+ | Physical | 92.23% | 71.43% | 80.51% |
99
+ | Race | 90.99% | 88.60% | 89.78% |
100
+ | Religion | 99.03% | 94.44% | 96.68% |
101
+ | Others | 83.74% | 85.12% | 84.43% |
102
+
103
+ **Overall Hamming Loss:** 3.79%
104
+
105
+ ## πŸ› οΈ Installation
106
+
107
+ ### πŸ“¦ Clone with git-lfs
108
+ Since this repo contains large data files (>= 50MB), you need to first download and install a git plugin called git-lfs for versioning large files, and set up Git LFS using command git lfs install in console, in order to fully clone this repo.
109
+
110
+ ### πŸƒ How to run
111
+
112
+ #### Setup model
113
+
114
+ - Clone the repository:
115
+ ```
116
+ git clone https://github.com/kenth9p3/mlthsc-thesis.git
117
+ ```
118
+ - Create a virtual environment:
119
+ ```
120
+ # Windows
121
+ python -m venv venv
122
+
123
+ # Linux
124
+ python3 -m venv venv
125
+ ```
126
+ - Activate virtual environment:
127
+ ```
128
+ # Windows
129
+ source venv/Scripts/activate
130
+
131
+ # Linux
132
+ source venv/bin/activate
133
+ ```
134
+ - Install dependencies:
135
+ ```
136
+ pip install -r requirements.txt
137
+ ```
138
+ - Run app:
139
+ ```
140
+ python ./server.py
141
+ ```
142
+
143
+ #### Setup user interface
144
+
145
+ - Run `index.html` in the browser
146
+
147
+ - Input Tagalog hate speech in text box or choose one of the examples
148
+
149
+ - Click Analyze
150
+
151
+ - Save results
152
+