File size: 4,932 Bytes
ad5105b a721ce1 ad5105b 6dc6003 ad5105b 1c01dc3 ad5105b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
language: bn
tags:
- bert
- bengali
- bengali-lm
- bangla
license: MIT
datasets:
- common_crawl
- wikipedia
- oscar
---
# Bangla BERT Base
A long way passed. Here is our **Bangla-Bert**! It is now available in huggingface model hub.
[Bangla-Bert-Base](https://github.com/sagorbrur/bangla-bert) is a pretrained language model of Bengali language using mask language modeling described in [BERT](https://arxiv.org/abs/1810.04805) and it's github [repository](https://github.com/google-research/bert)
## Pretrain Corpus Details
Corpus was downloaded from two main sources:
* Bengali commoncrawl copurs downloaded from [OSCAR](https://oscar-corpus.com/)
* [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
```
sentence 1
sentence 2
sentence 1
sentence 2
```
## Building Vocab
We used [BNLP](https://github.com/sagorbrur/bnlp) package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format.
Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](https://github.com/sagorbrur/bangla-bert) and also at [huggingface](https://huggingface.co/sagorsarker/bangla-bert-base) model hub.
## Training Details
* Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
* Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
* Total Training Steps: 1 Million
* The model was trained on a single Google Cloud TPU
## Evaluation Results
### LM Evaluation Results
After training 1 millions steps here is the evaluation resutls.
```
global_step = 1000000
loss = 2.2406516
masked_lm_accuracy = 0.60641736
masked_lm_loss = 2.201459
next_sentence_accuracy = 0.98625
next_sentence_loss = 0.040997364
perplexity = numpy.exp(2.2406516) = 9.393331287442784
Loss for final step: 2.426227
```
### Downstream Task Evaluation Results
Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evalution results of classification task.
He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for classification task.
Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves state of the art result.
Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
| Model | Sentiment Analysis | Hate Speech Task | News Topic Task | Average |
| ----- | -------------------| ---------------- | --------------- | ------- |
| mBERT | 68.15 | 52.32 | 72.27 | 64.25 |
| Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
| Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
Also you can check these below paper list. They used this model on their datasets.
* [arXiv:2012.14353](https://arxiv.org/abs/2012.14353)
* [arxiv:2104.08613](https://arxiv.org/abs/2104.08613)
**NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.**
## How to Use
**Bangla BERT Tokenizer**
```py
from transformers import AutoTokenizer, AutoModel
bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
text = "আমি বাংলায় গান গাই।"
bnbert_tokenizer.tokenize(text)
# ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']
```
**MASK Generation**
You can use this model directly with a pipeline for masked language modeling:
```py
from transformers import BertForMaskedLM, BertTokenizer, pipeline
model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
print(pred)
# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}
```
## Author
[Sagor Sarker](https://github.com/sagorbrur)
## Acknowledgements
* Thanks to Google [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc) for providing the free TPU credits - thank you!
* Thank to all the people around, who always helping us to build something for Bengali.
## Reference
* https://github.com/google-research/bert
## Citation
If you find this model helpful, please cite this.
```
@misc{Sagor_2020,
title = {BanglaBERT: Bengali Mask Language Model for Bengali Language Understading},
author = {Sagor Sarker},
year = {2020},
url = {https://github.com/sagorbrur/bangla-bert}
}
```
|