|
--- |
|
language: ru |
|
tags: |
|
- spam-detection |
|
- text-classification |
|
- russian |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- RUSpam/spam_dataset_v6 |
|
metrics: |
|
- F1 |
|
model-index: |
|
- name: spamNS_v1 |
|
results: |
|
- task: |
|
name: Классификация текста |
|
type: text-classification |
|
metrics: |
|
- name: F1 |
|
type: F1 |
|
value: 0.98 |
|
--- |
|
# RUSpam/spamNS_v1 |
|
|
|
## Описание |
|
|
|
Это модель определения спама, основанная на архитектуре cointegrated/rubert-tiny2, дообученная на русскоязычных данных о спаме. Она классифицирует текст как спам или не спам. |
|
|
|
## Использование |
|
|
|
|
|
```python |
|
import re |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
model_name = 'RUSpam/spamNS_v1' |
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device).eval() |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
def clean_text(text): |
|
text = re.sub(r'http\S+', '', text) |
|
text = re.sub(r'[^А-Яа-я0-9 ]+', ' ', text) |
|
text = text.lower().strip() |
|
return text |
|
|
|
def classify_message(message): |
|
message = clean_text(message) |
|
encoding = tokenizer(message, padding='max_length', truncation=True, max_length=128, return_tensors='pt') |
|
input_ids = encoding['input_ids'].to(device) |
|
attention_mask = encoding['attention_mask'].to(device) |
|
|
|
with torch.no_grad(): |
|
outputs = model(input_ids, attention_mask=attention_mask).logits |
|
pred = torch.sigmoid(outputs).cpu().numpy()[0][0] |
|
|
|
is_spam = int(pred >= 0.5) |
|
return is_spam |
|
|
|
if __name__ == '__main__': |
|
while True: |
|
message = input("Введите сообщение для классификации (или 'exit' для выхода): ") |
|
if message.lower() == 'exit': |
|
break |
|
is_spam = classify_message(message) |
|
print(f"Сообщение {'является спамом' if is_spam else 'не является спамом'}") |
|
|
|
``` |
|
|
|
## Использование при помощи нашей библиотеки |
|
```python |
|
from ruSpamLib import is_spam |
|
|
|
message = input("Введите сообщение: ") |
|
|
|
pred_average = is_spam(message, model_name="spamNS_v1") |
|
|
|
print(f"Prediction: {'Spam' if pred_average else 'Not Spam'}") |
|
|
|
``` |
|
|
|
# Цитирование |
|
``` |
|
@MISC{RUSpam/spamNS_V1, |
|
author = {Kirill Fedko (Neurospacex)}, |
|
title = {Russian Spam Classification Model}, |
|
url = {https://huggingface.co/RUSpam/spamNS_V1/}, |
|
year = 2024 |
|
} |
|
``` |