Bert-based classifier (finetuned from rubert-tiny2)

Merged datasets:

The datasets split into train, val, test splits in 80-10-10 proportion The metrics obtained from test dataset is as follows:

precision recall f1-score support
0 0.9827 0.9827 0.9827 21216
1 0.9272 0.9274 0.9273 5054
accuracy 0.9720 26270
macro avg 0.9550 0.9550 0.9550 26270
weighted avg 0.9720 0.9720 0.9720 26270

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

PATH = 'khvatov/ru_toxicity_detector'
tokenizer = AutoTokenizer.from_pretrained(PATH)
model = AutoModelForSequenceClassification.from_pretrained(PATH)

# if torch.cuda.is_available():
#     model.cuda()

model.to(torch.device("cpu"))


def get_toxicity_probs(text):
    with torch.no_grad():
        inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(model.device)
        proba = torch.nn.functional.softmax(model(**inputs).logits, dim=1).cpu().numpy()
    return proba[0]


TEXT = "Марк был хороший"
print(f'text = {TEXT}, probs={get_toxicity_probs(TEXT)}')
# text = Марк был хороший, probs=[0.9940585  0.00594147]

Train

The model has been trained with Adam optimizer, the learning rate of 2e-5, and batch size of 32 for 3 epochs

Downloads last month
98
Safetensors
Model size
29.2M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.