Model Card: BERT for Named Entity Recognition (NER)

Model Overview

This model, bert-conll-ner, is a fine-tuned version of bert-base-uncased trained for the task of Named Entity Recognition (NER) using the CoNLL-2003 dataset. It is designed to identify and classify entities in text, such as person names (PER), organizations (ORG), locations (LOC), and miscellaneous (MISC) entities.

Model Architecture

  • Base Model: BERT (Bidirectional Encoder Representations from Transformers) with the bert-base-uncased architecture.
  • Task: Token Classification (NER).

Training Dataset

  • Dataset: CoNLL-2003, a standard dataset for NER tasks containing sentences annotated with named entity spans.
  • Classes:
    • PER (Person)
    • ORG (Organization)
    • LOC (Location)
    • MISC (Miscellaneous)
    • O (Outside of any entity span)

Performance Metrics

The model demonstrates strong performance metrics on the CoNLL-2003 evaluation set:

Metric Value
Loss 0.0649
Precision 93.59%
Recall 95.07%
F1 Score 94.32%
Accuracy 98.79%

These metrics indicate the model's high accuracy and robustness in identifying and classifying entities.

Training Details

  • Optimizer: AdamW (Adam with weight decay)
  • Learning Rate: 2e-5
  • Batch Size: 8
  • Number of Epochs: 3
  • Scheduler: Linear scheduler with warm-up steps
  • Loss Function: Cross-entropy loss with ignored index (-100) for padding tokens

Model Input/Output

  • Input Format: Tokenized text with special tokens [CLS] and [SEP].
  • Output Format: Token-level predictions with corresponding labels from the NER tag set (B-PER, I-PER, etc.).

How to Use the Model

Installation

pip install transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("sfarrukh/modernbert-conll-ner")
model = AutoModelForTokenClassification.from_pretrained("sfarrukh/modernbert-conll-ner")

Running Inference

from transformers import pipeline

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John lives in New York City."
result = nlp(text)
print(result)
[{'entity_group': 'PER',
  'score': 0.99912304,
  'word': 'john',
  'start': 0,
  'end': 4},
 {'entity_group': 'LOC',
  'score': 0.9993351,
  'word': 'new york city',
  'start': 14,
  'end': 27}]

Limitations

  1. Domain-Specific Adaptability: Performance might drop on domain-specific texts (e.g., legal or medical) not covered by the CoNLL-2003 dataset.
  2. Ambiguity: Ambiguous entities or overlapping spans are not explicitly handled.

Recommendations

  • For domain-specific tasks, consider fine-tuning this model further on a relevant dataset.
  • Use a pre-processing pipeline to handle long texts by splitting them into smaller segments.

Acknowledgements

  • Transformers Library: Hugging Face
  • Dataset: CoNLL-2003
  • Base Model: bert-base-uncased by Google
Downloads last month
9
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for sfarrukh/bert-conll-ner

Finetuned
(2421)
this model

Dataset used to train sfarrukh/bert-conll-ner