sagorsarker
commited on
Commit
•
81b199f
1
Parent(s):
42b19ef
added wikiann evaluation results
Browse files
README.md
CHANGED
@@ -23,10 +23,10 @@ A long way passed. Here is our **Bangla-Bert**! It is now available in huggingfa
|
|
23 |
## Pretrain Corpus Details
|
24 |
Corpus was downloaded from two main sources:
|
25 |
|
26 |
-
* Bengali commoncrawl
|
27 |
* [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
|
28 |
|
29 |
-
After downloading these
|
30 |
|
31 |
```
|
32 |
sentence 1
|
@@ -50,7 +50,7 @@ Our final vocab file availabe at [https://github.com/sagorbrur/bangla-bert](http
|
|
50 |
## Evaluation Results
|
51 |
|
52 |
### LM Evaluation Results
|
53 |
-
After training 1
|
54 |
|
55 |
```
|
56 |
global_step = 1000000
|
@@ -65,9 +65,11 @@ Loss for final step: 2.426227
|
|
65 |
```
|
66 |
|
67 |
### Downstream Task Evaluation Results
|
68 |
-
|
69 |
-
|
70 |
-
|
|
|
|
|
71 |
Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
|
72 |
|
73 |
|
@@ -77,11 +79,26 @@ Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/ma
|
|
77 |
| Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
|
78 |
| Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
|
79 |
|
80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
* [arXiv:2012.14353](https://arxiv.org/abs/2012.14353)
|
82 |
* [arxiv:2104.08613](https://arxiv.org/abs/2104.08613)
|
83 |
|
84 |
-
**NB: If you use this model for any
|
85 |
|
86 |
|
87 |
## How to Use
|
@@ -127,7 +144,7 @@ for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গা
|
|
127 |
* https://github.com/google-research/bert
|
128 |
|
129 |
## Citation
|
130 |
-
If you find this model helpful, please cite
|
131 |
|
132 |
```
|
133 |
@misc{Sagor_2020,
|
|
|
23 |
## Pretrain Corpus Details
|
24 |
Corpus was downloaded from two main sources:
|
25 |
|
26 |
+
* Bengali commoncrawl corpus downloaded from [OSCAR](https://oscar-corpus.com/)
|
27 |
* [Bengali Wikipedia Dump Dataset](https://dumps.wikimedia.org/bnwiki/latest/)
|
28 |
|
29 |
+
After downloading these corpora, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.
|
30 |
|
31 |
```
|
32 |
sentence 1
|
|
|
50 |
## Evaluation Results
|
51 |
|
52 |
### LM Evaluation Results
|
53 |
+
After training 1 million steps here are the evaluation results.
|
54 |
|
55 |
```
|
56 |
global_step = 1000000
|
|
|
65 |
```
|
66 |
|
67 |
### Downstream Task Evaluation Results
|
68 |
+
- Evaluation on Bengali Classification Benchmark Datasets
|
69 |
+
|
70 |
+
Huge Thanks to [Nick Doiron](https://twitter.com/mapmeld) for providing evaluation results of the classification task.
|
71 |
+
He used [Bengali Classification Benchmark](https://github.com/rezacsedu/Classification_Benchmarks_Benglai_NLP) datasets for the classification task.
|
72 |
+
Comparing to Nick's [Bengali electra](https://huggingface.co/monsoon-nlp/bangla-electra) and multi-lingual BERT, Bangla BERT Base achieves a state of the art result.
|
73 |
Here is the [evaluation script](https://github.com/sagorbrur/bangla-bert/blob/master/notebook/bangla-bert-evaluation-classification-task.ipynb).
|
74 |
|
75 |
|
|
|
79 |
| Bengali Electra | 69.19 | 44.84 | 82.33 | 65.45 |
|
80 |
| Bangla BERT Base | 70.37 | 71.83 | 89.19 | 77.13 |
|
81 |
|
82 |
+
- Evaluation on [Wikiann](https://huggingface.co/datasets/wikiann) Datasets
|
83 |
+
|
84 |
+
We evaluated `Bangla-BERT-Base` with [Wikiann](https://huggingface.co/datasets/wikiann) Bengali NER datasets along with another benchmark three models(mBERT, XLM-R, Indic-BERT). </br>
|
85 |
+
`Bangla-BERT-Base` got a third-place where `mBERT` got first and `XML-R` got second place after training these models 5 epochs.
|
86 |
+
|
87 |
+
| Base Pre-trained Model | F1 Score | Accuracy |
|
88 |
+
| ----- | -------------------| ---------------- |
|
89 |
+
| [mBERT-uncased](https://huggingface.co/bert-base-multilingual-uncased) | 97.11 | 97.68 |
|
90 |
+
| [XLM-R](https://huggingface.co/xlm-roberta-base) | 96.22 | 97.03 |
|
91 |
+
| [Indic-BERT](https://huggingface.co/ai4bharat/indic-bert)| 92.66 | 94.74 |
|
92 |
+
| Bangla-BERT-Base | 95.57 | 97.49 |
|
93 |
+
|
94 |
+
All four model trained with [transformers-token-classification](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb) notebook.
|
95 |
+
You can find all models evaluation results [here](https://github.com/sagorbrur/bangla-bert/tree/master/evaluations/wikiann)
|
96 |
+
|
97 |
+
Also, you can check the below paper list. They used this model on their datasets.
|
98 |
* [arXiv:2012.14353](https://arxiv.org/abs/2012.14353)
|
99 |
* [arxiv:2104.08613](https://arxiv.org/abs/2104.08613)
|
100 |
|
101 |
+
**NB: If you use this model for any NLP task please share evaluation results with us. We will add it here.**
|
102 |
|
103 |
|
104 |
## How to Use
|
|
|
144 |
* https://github.com/google-research/bert
|
145 |
|
146 |
## Citation
|
147 |
+
If you find this model helpful, please cite.
|
148 |
|
149 |
```
|
150 |
@misc{Sagor_2020,
|