Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ widget:
|
|
10 |
Pretrained model on Hindi language using a masked language modeling (MLM) objective. [A more interactive & comparison demo is available here](https://huggingface.co/spaces/flax-community/roberta-hindi).
|
11 |
|
12 |
> This is part of the
|
13 |
-
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [
|
14 |
|
15 |
## Model description
|
16 |
|
@@ -62,8 +62,8 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
|
|
62 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
63 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
64 |
with `<s>` and the end of one by `</s>`.
|
65 |
-
- We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi(non
|
66 |
-
- We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual
|
67 |
|
68 |
|
69 |
The details of the masking procedure for each sentence are the following:
|
@@ -78,7 +78,7 @@ The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM
|
|
78 |
|
79 |
## Evaluation Results
|
80 |
|
81 |
-
RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
|
82 |
|
83 |
| Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
|
84 |
|-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
|
@@ -96,6 +96,6 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
|
|
96 |
|
97 |
|
98 |
## Credits
|
99 |
-
Huge thanks to
|
100 |
|
101 |
<img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>
|
|
|
10 |
Pretrained model on Hindi language using a masked language modeling (MLM) objective. [A more interactive & comparison demo is available here](https://huggingface.co/spaces/flax-community/roberta-hindi).
|
11 |
|
12 |
> This is part of the
|
13 |
+
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [Hugging Face](https://huggingface.co/) and TPU usage sponsored by Google.
|
14 |
|
15 |
## Model description
|
16 |
|
|
|
62 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
63 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
64 |
with `<s>` and the end of one by `</s>`.
|
65 |
+
- We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi (non Devanagari) characters from the datasets.
|
66 |
+
- We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual labelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv) where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
|
67 |
|
68 |
|
69 |
The details of the masking procedure for each sentence are the following:
|
|
|
78 |
|
79 |
## Evaluation Results
|
80 |
|
81 |
+
RoBERTa Hindi is evaluated on various downstream tasks. The results are summarized below.
|
82 |
|
83 |
| Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
|
84 |
|-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
|
|
|
96 |
|
97 |
|
98 |
## Credits
|
99 |
+
Huge thanks to Hugging Face 🤗 & Google Jax/Flax team for such a wonderful community week, especially for providing such massive computing resources. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
|
100 |
|
101 |
<img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>
|