BERT base model for Bangla

Pretrained BERT model for Bangla. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model introduced by Google's research team. BERT has significantly advanced the state-of-the-art in various NLP tasks. Unlike traditional language models, BERT is bidirectional, meaning it takes into account both the left and right contexts of each word during pre-training, enabling it to better grasp the nuances of language.

Data Details

We used 36 GB of text data to train the model. The used corpus has the following cardinalities:

Type	Count
Total words	2,202,024,981 (about 2.2 billion)
Unique words	22,944,811 (about 22.94 million)
Total sentences	181,447,732 (about 181.45 million)
Total documents	17,516,890 (about 17.52 million)

Model Details

The core architecture of BERT is based on the Transformer model, which utilizes self-attention mechanisms to capture long-range dependencies in text efficiently. During pre-training, BERT learns contextualized word embeddings by predicting missing words within sentences, a process known as masked language modeling. This allows BERT to understand words in the context of their surrounding words, leading to more meaningful and context-aware embeddings.

This model is based on the BERT-Base architecture with 12 layers, 768 hidden size, 12 attention heads, and 110 million parameters.

How to use

from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("banglagov/banBERT-Base")
tokenizer = BertTokenizer.from_pretrained("banglagov/banBERT-Base")

text = "আমি বাংলায় পড়ি।"

tokenized_text = tokenizer(text, return_tensors="pt")
outputs = model(**tokenized_text)
print(outputs)

Training Details

The model was trained on a corpus of 36 GB Bangla text data with a vocabulary size of 50k tokens. The model was trained for 1 million steps with a batch size of 440 and a learning rate of 5e-5. The model was trained on two NVIDIA GeForce A40 GPUs.

Results

Metric	Train Loss	Eval Loss	Perplexity	NER	POS	Shallow Parsing	QA
Precision	-	-	-	0.8475	0.8838	0.7396	-
Recall	-	-	-	0.7390	0.8543	0.6858	-
Macro F1	-	-	-	0.7786	0.8611	0.7117	0.7396
Exact Match	-	-	-	-	-	-	0.6809
Loss	1.8633	1.4681	4.3826	-	-	-	-