Sinhala-English Code-Mixed Hate Speech Detection Model
This model is fine-tuned to detect hate speech, offensive speech, and neutral content in Sinhala-English code-mixed language. It was trained on a custom dataset containing social media comments and is intended for use in content moderation and analysis tasks.
Model URL
The model is hosted on Hugging Face. You can find it at the following URL:
Contact
For any questions or feedback, please reach out to R.M.D. Pabasara Rathnayake at [[email protected]].
Input format
The model expects input text in the form of a string, which is a social media comment or a piece of text in Sinhala-English code-mixed language.
Output format
{ "text": "mama campus loan ekak ganneh kohomada ballo", "predicted_label": "offensive", "scores": { "neither": 0.2815871834754944, "offensive": 0.645283043384552, "hate": 0.07312975078821182 } }
Installation
To use this model, you need to install the transformers
library from Hugging Face. You can install it using pip:
!pip install transformers
!pip install torch
# Import libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import huggingface_hub
import torch
import json
# Authenticate with your Hugging Face token
huggingface_hub.login(token='Your HF Access Token')
# Load the model and tokenizer
model_name = "GANgstersDev/singlish-hate-offensive-finetuned-model-v2.0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def classify_text(text):
# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
# Get the predicted class
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=1).item()
# Map class ID to class label
class_labels = ["neither", "offensive", "hate"] # Adjust according to your labels
predicted_class_label = class_labels[predicted_class_id]
return predicted_class_label
# Load the model and tokenizer
model_name = "GANgstersDev/singlish-hate-offensive-finetuned-model-v2.0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Define function to classify text and return JSON output
def classify_text(text):
# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
# Get the predicted class
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=1).item()
# Map class ID to class label
class_labels = ["neither", "offensive", "hate"] # Adjust according to your labels
predicted_class_label = class_labels[predicted_class_id]
# Get the category scores
category_scores = torch.softmax(logits, dim=1).numpy().flatten()
# Create JSON output
result = {
"text": text,
"predicted_label": predicted_class_label,
"scores": {label: float(score) for label, score in zip(class_labels, category_scores)}
}
return result
# Example usage
text = "Singlish text goes here"
result = classify_text(text)
# Print the result as JSON
print(json.dumps(result, indent=4))
- Downloads last month
- 112