sagorsarker
/

bangla-bert-base

@@ -23,10 +23,10 @@ A long way passed. Here is our **Bangla-Bert**! It is now available in huggingfa
 ## Pretrain Corpus Details
 Corpus was downloaded from two main sources:
-* Bengali commoncrawl copurs downloaded from [OSCAR](https://oscar-corpus.com/)
 * [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
-After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
 ```
 sentence 1
@@ -50,7 +50,7 @@ Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](http
 ## Evaluation Results
 ### LM Evaluation Results
-After training 1 millions steps here is the evaluation resutls.
 ```
 global_step = 1000000
@@ -65,9 +65,11 @@ Loss for final step: 2.426227
 ```
 ### Downstream Task Evaluation Results
-Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evalution results of classification task.
-He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for classification task.
-Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves state of the art result.
 Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
@@ -77,11 +79,26 @@ Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/ma
 | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
 | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
-Also you can check these below paper list. They used this model on their datasets.
 * [arXiv:2012.14353](https://arxiv.org/abs/2012.14353)
 * [arxiv:2104.08613](https://arxiv.org/abs/2104.08613)
-**NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.**
 ## How to Use
@@ -127,7 +144,7 @@ for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গা
 * https://github.com/google-research/bert
 ## Citation
-If you find this model helpful, please cite this.
 ```
 @misc{Sagor_2020,

 ## Pretrain Corpus Details
 Corpus was downloaded from two main sources:
+* Bengali commoncrawl corpus downloaded from [OSCAR](https://oscar-corpus.com/)
 * [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
+After downloading these corpora, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
 ```
 sentence 1
 ## Evaluation Results
 ### LM Evaluation Results
+After training 1 million steps here are the evaluation results.
 ```
 global_step = 1000000
 ```
 ### Downstream Task Evaluation Results
+- Evaluation on Bengali Classification Benchmark Datasets
+Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evaluation results of the classification task.
+He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for the classification task.
+Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves a state of the art result.
 Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
 | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
 | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
+- Evaluation on [Wikiann](https://huggingface.co/datasets/wikiann) Datasets
+We evaluated `Bangla-BERT-Base` with [Wikiann](https://huggingface.co/datasets/wikiann) Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT). </br>
+`Bangla-BERT-Base` got a third-place where `mBERT` got first and `XML-R` got second place after training these models 5 epochs.
+| Base Pre-trained Model | F1 Score | Accuracy |
+| ----- | -------------------| ---------------- |
+| [mBERT-uncased](https://huggingface.co/bert-base-multilingual-uncased) | 97.11 | 97.68 |
+| [XLM-R](https://huggingface.co/xlm-roberta-base) | 96.22 | 97.03 |
+| [Indic-BERT](https://huggingface.co/ai4bharat/indic-bert)| 92.66 | 94.74 |
+| Bangla-BERT-Base | 95.57 | 97.49 |
+All four model trained with [transformers-token-classification](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb) notebook.
+You can find all models evaluation results [here](https://github.com/sagorbrur/bangla-bert/tree/master/evaluations/wikiann)
+Also, you can check the below paper list. They used this model on their datasets.
 * [arXiv:2012.14353](https://arxiv.org/abs/2012.14353)
 * [arxiv:2104.08613](https://arxiv.org/abs/2104.08613)
+**NB: If you use this model for any NLP task please share evaluation results with us. We will add it here.**
 ## How to Use
 * https://github.com/google-research/bert
 ## Citation
+If you find this model helpful, please cite.
 ```
 @misc{Sagor_2020,