sagorsarker commited on
Commit
81b199f
1 Parent(s): 42b19ef

added wikiann evaluation results

Browse files
Files changed (1) hide show
  1. README.md +26 -9
README.md CHANGED
@@ -23,10 +23,10 @@ A long way passed. Here is our **Bangla-Bert**! It is now available in huggingfa
23
  ## Pretrain Corpus Details
24
  Corpus was downloaded from two main sources:
25
 
26
- * Bengali commoncrawl copurs downloaded from [OSCAR](https://oscar-corpus.com/)
27
  * [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
28
 
29
- After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
30
 
31
  ```
32
  sentence 1
@@ -50,7 +50,7 @@ Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](http
50
  ## Evaluation Results
51
 
52
  ### LM Evaluation Results
53
- After training 1 millions steps here is the evaluation resutls.
54
 
55
  ```
56
  global_step = 1000000
@@ -65,9 +65,11 @@ Loss for final step: 2.426227
65
  ```
66
 
67
  ### Downstream Task Evaluation Results
68
- Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evalution results of classification task.
69
- He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for classification task.
70
- Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves state of the art result.
 
 
71
  Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
72
 
73
 
@@ -77,11 +79,26 @@ Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/ma
77
  | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
78
  | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
79
 
80
- Also you can check these below paper list. They used this model on their datasets.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  * [arXiv:2012.14353](https://arxiv.org/abs/2012.14353)
82
  * [arxiv:2104.08613](https://arxiv.org/abs/2104.08613)
83
 
84
- **NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.**
85
 
86
 
87
  ## How to Use
@@ -127,7 +144,7 @@ for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গা
127
  * https://github.com/google-research/bert
128
 
129
  ## Citation
130
- If you find this model helpful, please cite this.
131
 
132
  ```
133
  @misc{Sagor_2020,
 
23
  ## Pretrain Corpus Details
24
  Corpus was downloaded from two main sources:
25
 
26
+ * Bengali commoncrawl corpus downloaded from [OSCAR](https://oscar-corpus.com/)
27
  * [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
28
 
29
+ After downloading these corpora, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
30
 
31
  ```
32
  sentence 1
 
50
  ## Evaluation Results
51
 
52
  ### LM Evaluation Results
53
+ After training 1 million steps here are the evaluation results.
54
 
55
  ```
56
  global_step = 1000000
 
65
  ```
66
 
67
  ### Downstream Task Evaluation Results
68
+ - Evaluation on Bengali Classification Benchmark Datasets
69
+
70
+ Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evaluation results of the classification task.
71
+ He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for the classification task.
72
+ Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves a state of the art result.
73
  Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
74
 
75
 
 
79
  | Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
80
  | Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
81
 
82
+ - Evaluation on [Wikiann](https://huggingface.co/datasets/wikiann) Datasets
83
+
84
+ We evaluated `Bangla-BERT-Base` with [Wikiann](https://huggingface.co/datasets/wikiann) Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT). </br>
85
+ `Bangla-BERT-Base` got a third-place where `mBERT` got first and `XML-R` got second place after training these models 5 epochs.
86
+
87
+ | Base Pre-trained Model | F1 Score | Accuracy |
88
+ | ----- | -------------------| ---------------- |
89
+ | [mBERT-uncased](https://huggingface.co/bert-base-multilingual-uncased) | 97.11 | 97.68 |
90
+ | [XLM-R](https://huggingface.co/xlm-roberta-base) | 96.22 | 97.03 |
91
+ | [Indic-BERT](https://huggingface.co/ai4bharat/indic-bert)| 92.66 | 94.74 |
92
+ | Bangla-BERT-Base | 95.57 | 97.49 |
93
+
94
+ All four model trained with [transformers-token-classification](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb) notebook.
95
+ You can find all models evaluation results [here](https://github.com/sagorbrur/bangla-bert/tree/master/evaluations/wikiann)
96
+
97
+ Also, you can check the below paper list. They used this model on their datasets.
98
  * [arXiv:2012.14353](https://arxiv.org/abs/2012.14353)
99
  * [arxiv:2104.08613](https://arxiv.org/abs/2104.08613)
100
 
101
+ **NB: If you use this model for any NLP task please share evaluation results with us. We will add it here.**
102
 
103
 
104
  ## How to Use
 
144
  * https://github.com/google-research/bert
145
 
146
  ## Citation
147
+ If you find this model helpful, please cite.
148
 
149
  ```
150
  @misc{Sagor_2020,