Offensive Language Detection For Turkish Language
Model Description
This model has been fine-tuned using dbmdz/bert-base-turkish-128k-uncased model with the OffensEval 2020 dataset. The offenseval-tr dataset contains 31,756 annotated tweets.
Dataset Distribution
Non Offensive(0) | Offensive (1) | |
---|---|---|
Train | 25625 | 6131 |
Test | 2812 | 716 |
Preprocessing Steps
Process | Description |
---|---|
Accented character transformation | Converting accented characters to their unaccented equivalents |
Lowercase transformation | Converting all text to lowercase |
Removing @user mentions | Removing @user formatted user mentions from text |
Removing hashtag expressions | Removing #hashtag formatted expressions from text |
Removing URLs | Removing URLs from text |
Removing punctuation and punctuated emojis | Removing punctuation marks and emojis presented with punctuation from text |
Removing emojis | Removing emojis from text |
Deasciification | Converting ASCII text into text containing Turkish characters |
The performance of each pre-process was analyzed. Removing digits and keeping hashtags had no effect.
Usage
Install necessary libraries:
pip install git+https://github.com/emres/turkish-deasciifier.git
pip install keras_preprocessing
Pre-processing functions are below:
from turkish.deasciifier import Deasciifier
def deasciifier(text):
deasciifier = Deasciifier(text)
return deasciifier.convert_to_turkish()
def remove_circumflex(text):
circumflex_map = {
'â': 'a',
'î': 'i',
'û': 'u',
'ô': 'o',
'Â': 'A',
'Î': 'I',
'Û': 'U',
'Ô': 'O'
}
return ''.join(circumflex_map.get(c, c) for c in text)
def turkish_lower(text):
turkish_map = {
'I': 'ı',
'İ': 'i',
'Ç': 'ç',
'Ş': 'ş',
'Ğ': 'ğ',
'Ü': 'ü',
'Ö': 'ö'
}
return ''.join(turkish_map.get(c, c).lower() for c in text)
Clean text using below function:
import re
def clean_text(text):
# Metindeki şapkalı harfleri kaldırma
text = remove_circumflex(text)
# Metni küçük harfe dönüştürme
text = turkish_lower(text)
# deasciifier
text = deasciifier(text)
# Kullanıcı adlarını kaldırma
text = re.sub(r"@\S*", " ", text)
# Hashtag'leri kaldırma
text = re.sub(r'#\S+', ' ', text)
# URL'leri kaldırma
text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
# Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
# Emojileri kaldırma
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r' ', text)
# Birden fazla boşluğu tek boşlukla değiştirme
text = re.sub(r'\s+', ' ', text).strip()
return text
Model Initialization
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
Check if sentence is offensive like below:
import numpy as np
def is_offensive(sentence):
d = {
0: 'non-offensive',
1: 'offensive'
}
normalize_text = clean_text(sentence)
test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')
test_sample = {k: v.to(device) for k, v in test_sample.items()}
output = model(**test_sample)
y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)
print(normalize_text, "-->", d[y_pred[0]])
return y_pred[0]
is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")
Evaluation
Evaluation results on test set shown on table below. We achive %89 accuracy on test set.
Model Performance Metrics
Class | Precision | Recall | F1-score | Accuracy |
---|---|---|---|---|
Class 0 | 0.92 | 0.94 | 0.93 | 0.89 |
Class 1 | 0.73 | 0.67 | 0.70 | |
Macro | 0.83 | 0.80 | 0.81 |
- Downloads last month
- 180
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.