metadata
language: bn
tags:
- Bert base Bangla
- Bengali Bert
- Bengali lm
- Bangla Base Bert
- Bangla Bert language model
- Bangla Bert
datasets:
- BanglaLM dataset
Bangla BERT Base
Here we published a pretrained Bangla bert language model as bert-base-bangla! which is now available in huggingface model hub. Here we described bert-base-bangla which is a pretrained Bangla language model based on mask language modeling described in BERT and the GitHub repository
Corpus Details
We trained the Bangla bert language model using BanglaLM dataset from kaggle BanglaLM. There is 3 version of dataset which is almost 40GB. After downloading the dataset, we went on the way to mask LM.
Bangla Base BERT Tokenizer
from transformers import AutoTokenizer, AutoModel
bnbert_tokenizer = AutoTokenizer.from_pretrained("Kowsher/bert-base-test")
text = "খাঁটি সোনার চাইতে খাঁটি আমার দেশের মাটি"
bnbert_tokenizer.tokenize(text)
# output: ['খাটি', 'সে', '##ানার', 'চাইতে', 'খাটি', 'আমার', 'দেশের', 'মাটি']
MASK Generation here, we can use bert base bangla model as for masked language modeling:
from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained("Kowsher/bert-base-test")
tokenizer = BertTokenizer.from_pretrained("Kowsher/bert-base-test")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলার গান {nlp.tokenizer.mask_token}"):
print(pred)
# {'sequence': 'আমি বাংলার গান লিখি', 'score': 0.17955434322357178, 'token': 24749, 'token_str': 'লিখি'}
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"তুই রাজাকার তুই {nlp.tokenizer.mask_token}"):
print(pred)
# {'sequence': 'তই রাজাকার তই রাজাকার', 'score': 0.9975168704986572, 'token': 13401, 'token_str': 'রাজাকার'}
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"বাংলা আমার {nlp.tokenizer.mask_token}"):
print(pred)
# {'sequence': 'বাংলা আমার অহংকার', 'score': 0.5679506063461304, 'token': 19009, 'token_str': 'অহংকার'}