|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Named entity recognition |
|
|
|
## Model Description |
|
|
|
This model is a fine-tuned token classification model designed to predict entities in sentences. |
|
It's fine-tuned on a custom dataset that focuses on identifying certain types of entities, including biases in text. |
|
|
|
## Intended Use |
|
|
|
The model is intended to be used for entity recognition tasks, especially for identifying biases in text passages. |
|
Users can input a sequence of text, and the model will highlight words or tokens or **spans** it believes are associated with a particular entity or bias. |
|
|
|
## How to Use |
|
|
|
The model can be used for inference directly through the Hugging Face `transformers` library: |
|
|
|
```python |
|
|
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
import torch |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("newsmediabias/UnBIAS-Named-Entity-Recognition") |
|
model = AutoModelForTokenClassification.from_pretrained("newsmediabias/UnBIAS-Named-Entity-Recognition") |
|
|
|
def predict_entities(sentence): |
|
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence))) |
|
inputs = tokenizer.encode(sentence, return_tensors="pt") |
|
inputs = inputs.to(device) |
|
|
|
outputs = model(inputs).logits |
|
predictions = torch.argmax(outputs, dim=2) |
|
|
|
id2label = model.config.id2label |
|
|
|
# Reconstruct words from subword tokens |
|
biased_words = [] |
|
current_word = "" |
|
for token, prediction in zip(tokens, predictions[0]): |
|
label = id2label[prediction.item()] |
|
if label in ['B-BIAS', 'I-BIAS']: |
|
if token.startswith('##'): |
|
current_word += token[2:] |
|
else: |
|
if current_word: |
|
biased_words.append(current_word) |
|
current_word = token |
|
if current_word: |
|
biased_words.append(current_word) |
|
|
|
# Filter out special tokens and subword tokens |
|
biased_words = [word for word in biased_words if not word.startswith('[') and not word.endswith(']') and not word.startswith('##')] |
|
|
|
return biased_words |
|
|
|
sentence = "due to your evil and dishonest nature, i am kind of tired and want to get rid of such cheapters. all people like you are evil and a disgrace to society and I must say to get rid of immigrants as they are filthy to culture" |
|
predictions = predict_entities(sentence) |
|
biased_words = predict_entities(sentence) |
|
for word in biased_words: |
|
print(f"Biased Word: {word}") |
|
|
|
|
|
``` |
|
|
|
|
|
## Limitations and Biases |
|
|
|
Every model has limitations, and it's crucial to understand these when deploying models in real-world scenarios: |
|
|
|
1. **Training Data**: The model is trained on a specific dataset, and its predictions are only as good as the data it's trained on. |
|
2. **Generalization**: While the model may perform well on certain types of sentences or phrases, it might not generalize well to all types of text or contexts. |
|
|
|
It's also essential to be aware of any potential biases in the training data, which might affect the model's predictions. |
|
|
|
## Training Data |
|
|
|
The model was fine-tuned on a custom dataset. Ask **Shaina Raza [email protected]** for dataset |