flax-community
/

roberta-hindi

@@ -10,7 +10,7 @@ widget:
 Pretrained model on Hindi language using a masked language modeling (MLM) objective. [A more interactive & comparison demo is available here](https://huggingface.co/spaces/flax-community/roberta-hindi).
 > This is part of the
-[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
 ## Model description
@@ -62,8 +62,8 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
 The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
 the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
 with `<s>` and the end of one by `</s>`.
-- We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets.
-- We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual lablelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv)  where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
 The details of the masking procedure for each sentence are the following:
@@ -78,7 +78,7 @@ The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM
 ## Evaluation Results
-RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
 | Task                    | Task Type            | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
 |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
@@ -96,6 +96,6 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
 ## Credits
-Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
 <img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>

 Pretrained model on Hindi language using a masked language modeling (MLM) objective. [A more interactive & comparison demo is available here](https://huggingface.co/spaces/flax-community/roberta-hindi).
 > This is part of the
+[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [Hugging Face](https://huggingface.co/) and TPU usage sponsored by Google.
 ## Model description
 The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
 the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
 with `<s>` and the end of one by `</s>`.
+- We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi (non Devanagari) characters from the datasets.
+- We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual labelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv)  where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
 The details of the masking procedure for each sentence are the following:
 ## Evaluation Results
+RoBERTa Hindi is evaluated on various downstream tasks. The results are summarized below.
 | Task                    | Task Type            | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
 |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
 ## Credits
+Huge thanks to Hugging Face 🤗 & Google Jax/Flax team for such a wonderful community week, especially for providing such massive computing resources. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
 <img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>