bert_ner / README.md
Evheniia's picture
Update README.md
92cac9b verified
metadata
language:
  - en
base_model:
  - google-bert/bert-large-uncased
pipeline_tag: token-classification

Model Card for Mountain NER Model

Model Summary

This model is a fine-tuned Named Entity Recognition (NER) model specifically designed to identify mountain names in text. It is trained to detect and classify mountain entities using labeled data and state-of-the-art NER architectures. The model can handle both single-word and multi-word mountain names (e.g., "Kilimanjaro" or "Rocky Mountains").

Intended Use

  • Task: Named Entity Recognition (NER) for mountain name identification.

  • Input: A text string containing sentences or paragraphs.

  • Output: A list of tokens annotated with labels:

  • B-MOUNTAIN: Beginning of a mountain name.

  • I-MOUNTAIN: Inside a mountain name.

  • O: Outside of any mountain entity.

How to Use

You can load this model using the Hugging Face transformers library:

from transformers import BertTokenizer, BertForTokenClassification
import torch

tokenizer = BertTokenizer.from_pretrained("your_username/your_model")
model = BertForTokenClassification.from_pretrained("your_username/your_model")

text = "The Kilimanjaro is one of the most famous mountains."

inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
labels = [model.config.id2label[label] for label in predictions.squeeze().tolist()]

print(list(zip(tokens, labels)))

Dataset

The dataset includes annotated examples of text with mountain names in BIO format:

  • Training Set: 350 examples
  • Validation Set: 75 examples
  • Test Set: 75 examples

The dataset was created by combining known mountain names with sentences containing them.

Limitations

  • The model is specifically designed for mountain names and may not generalize to other named entities.

  • Performance may degrade on noisy or informal text.

  • Multi-word mountain names must be tokenized correctly for proper recognition.

  • Repository: [https://github.com/Yevheniia-Ilchenko/Bert_NER]

Training Details

The model was fine-tuned using the BERT Base Uncased architecture for token classification. Below are the training details:

  • Model Architecture: BERT for Token Classification (bert-base-uncased).
  • Dataset: Custom-labeled dataset in BIO format for mountain name recognition.
  • Hyperparameters:
    • Learning Rate: 2e-4
    • Batch Size: 16
    • Maximum Sequence Length: 128
    • Number of Epochs: 3
  • Optimizer: AdamW
  • Warmup Steps: 500
  • Weight Decay: 0.01
  • Evaluation Strategy: Steps-based evaluation with automatic saving of the best model.
  • Training Arguments:
    • save_total_limit=3: Limits the number of saved checkpoints.
    • load_best_model_at_end=True: Ensures the best model is used after training.
  • Training Performance:
    • Training Runtime: 570.44 seconds
    • Training Samples per Second: 1.841
    • Training Steps per Second: 0.116
    • Final Training Loss: 0.4017
  • Evaluation Metrics:
    • Evaluation Loss: 0.0839
    • Precision: 97.11%
    • Recall: 96.89%
    • F1 Score: 96.91%
    • Evaluation Runtime: 13.76 seconds
    • Samples per Second: 5.449
    • Steps per Second: 0.726