hassiahk commited on
Commit
81f8b41
·
1 Parent(s): f064dc4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -10,7 +10,7 @@ widget:
10
  Pretrained model on Hindi language using a masked language modeling (MLM) objective. [A more interactive & comparison demo is available here](https://huggingface.co/spaces/flax-community/roberta-hindi).
11
 
12
  > This is part of the
13
- [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
14
 
15
  ## Model description
16
 
@@ -62,8 +62,8 @@ The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
62
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
63
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
64
  with `<s>` and the end of one by `</s>`.
65
- - We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi(non Devanagiri) characters from the datasets.
66
- - We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual lablelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv) where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
67
 
68
 
69
  The details of the masking procedure for each sentence are the following:
@@ -78,7 +78,7 @@ The model was trained on Google Cloud Engine TPUv3-8 machine (with 335 GB of RAM
78
 
79
  ## Evaluation Results
80
 
81
- RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below.
82
 
83
  | Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
84
  |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
@@ -96,6 +96,6 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
96
 
97
 
98
  ## Credits
99
- Huge thanks to Huggingface 🤗 & Google Jax/Flax team for such a wonderful community week. Especially for providing such massive computing resource. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
100
 
101
  <img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>
 
10
  Pretrained model on Hindi language using a masked language modeling (MLM) objective. [A more interactive & comparison demo is available here](https://huggingface.co/spaces/flax-community/roberta-hindi).
11
 
12
  > This is part of the
13
+ [Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [Hugging Face](https://huggingface.co/) and TPU usage sponsored by Google.
14
 
15
  ## Model description
16
 
 
62
  The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
63
  the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
64
  with `<s>` and the end of one by `</s>`.
65
+ - We had to perform cleanup of **mC4** and **oscar** datasets by removing all non hindi (non Devanagari) characters from the datasets.
66
+ - We tried to filter out evaluation set of WikiNER of [IndicGlue](https://indicnlp.ai4bharat.org/indic-glue/) benchmark by [manual labelling](https://github.com/amankhandelia/roberta_hindi/blob/master/wikiner_incorrect_eval_set.csv) where the actual labels were not correct and modifying the [downstream evaluation dataset](https://github.com/amankhandelia/roberta_hindi/blob/master/utils.py).
67
 
68
 
69
  The details of the masking procedure for each sentence are the following:
 
78
 
79
  ## Evaluation Results
80
 
81
+ RoBERTa Hindi is evaluated on various downstream tasks. The results are summarized below.
82
 
83
  | Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
84
  |-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
 
96
 
97
 
98
  ## Credits
99
+ Huge thanks to Hugging Face 🤗 & Google Jax/Flax team for such a wonderful community week, especially for providing such massive computing resources. Big thanks to [Suraj Patil](https://huggingface.co/valhalla) & [Patrick von Platen](https://huggingface.co/patrickvonplaten) for mentoring during the whole week.
100
 
101
  <img src=https://pbs.twimg.com/media/E443fPjX0AY1BsR.jpg:medium>