XLM-RoBERTa Token Classification for Named Entity Recognition (NER)

Model Description

This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for German Language . The model identifies the following entity types:

PER: Person names

ORG: Organization names

LOC: Location names

image/png

Uses

This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required.

Applications: Information extraction Multilingual NER tasks Automated text analysis for businesses

Training Details

Base Model: xlm-roberta-base

Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.

Training Framework: Hugging Face transformers library with PyTorch backend.

Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.

Training Procedure

Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER:

Setup Environment:

Clone the repository and set up dependencies.

Import necessary libraries and modules.

Load Data:

Load the PAN-X subset from the XTREME dataset.

Shuffle and sample data subsets for training and evaluation.

Data Preparation:

Convert raw dataset into a format suitable for token classification.

Define a mapping for entity tags and apply tokenization.

Align NER tags with tokenized inputs.

Define Model:

Initialize the XLM-RoBERTa model for token classification.

Configure the model with the number of labels based on the dataset.

Setup Training Arguments:

Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.

Configure logging and checkpointing.

Initialize Trainer:

Create a Trainer instance with the model, training arguments, datasets, and data collator.

Specify evaluation metrics to monitor performance.

Train the Model:

Start the training process using the Trainer.

Monitor training progress and metrics.

Evaluation and Results:

Evaluate the model on the validation set.

Compute metrics like F1 score for performance assessment.

Save and Push Model:

Save the fine-tuned model locally or push to a model hub for sharing and further use.

Training Hyperparameters

The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate.

Evaluation

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import pandas as pd

model_checkpoint = "MassMin/Multilingual-NER-tagging"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1)

def tag_text_with_pipeline(text, ner_pipeline):
    # Use the NER pipeline to get predictions
    results = ner_pipeline(text)
    
    # Convert results to a DataFrame for easy viewing
    df = pd.DataFrame(results)
    df = df[['word', 'entity', 'score']]
    df.columns = ['Tokens', 'Tags', 'Score']  # Rename columns for clarity
    return df

text = "2000 Einwohnern	an	der	Danziger	Bucht	in	der	polnischen	Woiwodschaft	Pommern	."
result = tag_text_with_pipeline(text, ner_pipeline)
print(result)







#### Testing Data








    0	1	2	3	4	5	6	7	8	9	10	11
Tokens	2.000	Einwohnern	an	der	Danziger	Bucht	in	der	polnischen	Woiwodschaft	Pommern	.
Tags	O	O	O	O	B-LOC	I-LOC	O	O	B-LOC	B-LOC	I-LOC	O
Downloads last month
14
Inference Examples
Unable to determine this model's library. Check the docs .