|
--- |
|
language: |
|
- bn |
|
metrics: |
|
- f1 |
|
pipeline_tag: token-classification |
|
--- |
|
# Bangla-Person-Name-Extractor |
|
This repository contains the implementation of a Bangla Person Name Extractor model which is able to extract Person name entities from a given sentence. We approached it as a token classification task i.e. tagging each token with either a Person's name or not. We leveraged the [BanglaBERT](http://https://github.com/csebuetnlp/banglabert) model for our task, finetuning it for a binary classification task using a custom-prepare dataset. We have deployed the model into huggingface for easier access and use case. |
|
|
|
# How to use it? |
|
[This Notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Inference_template.ipynb) contains the required Inference Template on a sentence. |
|
<br></br> |
|
You can also directly infer using the following code snippet. Just change the sentence. |
|
|
|
``` |
|
from transformers import AutoModelForPreTraining, AutoTokenizer,AutoModelForTokenClassification #!pip install transformers==4.30.2 |
|
from normalizer import normalize #pip install git+https://github.com/csebuetnlp/normalizer |
|
import torch #pip install torch |
|
import numpy as np #!pip install numpy==1.23.5 |
|
|
|
|
|
model = AutoModelForTokenClassification.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor") |
|
tokenizer = AutoTokenizer.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor") |
|
def inference_fn(sentence): |
|
sentence = normalize(sentence) |
|
tokens = tokenizer.tokenize(sentence) |
|
inputs = tokenizer.encode(sentence,return_tensors="pt") |
|
outputs = model(inputs).logits |
|
predictions = torch.argmax(outputs[0],axis=1)[1:-1].numpy() |
|
idxs = np.where(predictions==1) |
|
|
|
return np.array(tokens)[idxs] |
|
|
|
sentence = "আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম।" |
|
pred = inference_fn(sentence) |
|
print(f"Input Sentence : {sentence}") |
|
print(f"Person Name Entities : {pred}") |
|
|
|
sentence = "ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।'" |
|
pred = inference_fn(sentence) |
|
print(f"Input Sentence : {sentence}") |
|
print(f"Person Name Entities : {pred}") |
|
|
|
|
|
sentence = "দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন।" |
|
pred = inference_fn(sentence) |
|
print(f"Input Sentence : {sentence}") |
|
print(f"Person Name Entities : {pred}") |
|
``` |
|
|
|
**Output:** |
|
``` |
|
Input Sentence : আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম। |
|
Person Name Entities : ['আব্দুর' 'রহিম'] |
|
|
|
|
|
Input Sentence : ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।' |
|
Person Name Entities : ['দেলোয়ার' 'হোসেন' 'মজুমদার'] |
|
|
|
|
|
Input Sentence : দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন। |
|
Person Name Entities : [] |
|
``` |
|
# Datasets |
|
We used two datasets to train and evaluate our pipeline. |
|
1. [Bengali-NER/annotated data at master · Rifat1493/Bengali-NER](http://https://github.com/Rifat1493/Bengali-NER/tree/master/annotated%20data) |
|
2. [banglakit/bengali-ner-data](http://https://raw.githubusercontent.com/banglakit/bengali-ner-data/master/main.jsonl) |
|
|
|
The annotation formats for both datasets were quite different, so we had to preprocess both of them before merging them. Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/prepare-dataset.ipynb) for preparing the dataset as required. |
|
|
|
# Training and Evaluation |
|
We treated this problem as a token classification task.So it seemed perfect to finetune the BanglaBERT model for our purpose. [BanglaBERT ](https://huggingface.co/csebuetnlp/banglabert)is an [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali. |
|
We mainly finetuned two checkpoints of BanglaBERT. |
|
1. [BanglaBERT](https://huggingface.co/csebuetnlp/banglabert) |
|
2. [BanglaEERT small](https://huggingface.co/csebuetnlp/banglabert_small) |
|
|
|
BanglaBERT performed better than BanglaBERT small ( 83% F1 score vs 79% F1 score on the test set) . |
|
Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Training%20Notebook%20%3A%20Person%20Name%20Extractor%20using%20BanglaBERT.ipynb) to see the training process. |
|
|
|
**Quantitative results** |
|
Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Inference%20and%20Evaluation%20Notebook.ipynb) to see the evaluation process. |
|
<br></br> |
|
![Results](https://github.com/MBMMurad/asl-2d-to-3d/blob/master/Screenshot%20from%202023-07-13%2023-11-59.png) |
|
|