Update README.md
Browse files
README.md
CHANGED
@@ -46,10 +46,13 @@ language:
|
|
46 |
|
47 |
# Model Card for EntityCS-39-MLM-xlmr-base
|
48 |
|
|
|
|
|
|
|
|
|
49 |
This model has been trained on the EntityCS corpus, an English corpus from Wikipedia with replaced entities in different languages.
|
50 |
The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
|
51 |
-
|
52 |
-
Firstly, we employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
|
53 |
with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
|
54 |
|
55 |
To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
|
@@ -82,14 +85,12 @@ This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM,
|
|
82 |
This model was trained with the **MLM** objective on the EntityCS corpus with 39 languages.
|
83 |
|
84 |
|
85 |
-
##
|
86 |
-
|
87 |
-
### Training Details
|
88 |
|
89 |
We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
|
90 |
We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
|
91 |
For speedup we use fp16 mixed precision.
|
92 |
-
We use the sampling strategy proposed by [Conneau and Lample (2019)](), where high resource languages are down-sampled and low
|
93 |
resource languages get sampled more frequently.
|
94 |
We only train the embedding and the last two layers of the model.
|
95 |
We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
|
@@ -104,9 +105,12 @@ In the paper, we focused on entity-related tasks, such as NER, Word Sense Disamb
|
|
104 |
|
105 |
Alternatively, it can be used directly (no fine-tuning) for probing tasks, i.e. predict missing words, such as [X-FACTR](https://aclanthology.org/2020.emnlp-main.479/).
|
106 |
|
|
|
|
|
|
|
107 |
## How to Get Started with the Model
|
108 |
|
109 |
-
Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
|
110 |
|
111 |
## Citation
|
112 |
|
@@ -128,6 +132,8 @@ Use the code below to get started with the model: https://github.com/huawei-noah
|
|
128 |
}
|
129 |
```
|
130 |
|
131 |
-
|
132 |
|
133 |
-
|
|
|
|
|
|
46 |
|
47 |
# Model Card for EntityCS-39-MLM-xlmr-base
|
48 |
|
49 |
+
- Paper: https://aclanthology.org/2022.findings-emnlp.499.pdf
|
50 |
+
- Repository: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
|
51 |
+
- Point of Contact: [Fenia Christopoulou](mailto:[email protected]), [Chenxi Whitehouse](mailto:[email protected])
|
52 |
+
|
53 |
This model has been trained on the EntityCS corpus, an English corpus from Wikipedia with replaced entities in different languages.
|
54 |
The corpus can be found in [https://huggingface.co/huawei-noah/entity_cs](https://huggingface.co/huawei-noah/entity_cs), check the link for more details.
|
55 |
+
To train models on the corpus, we first employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords
|
|
|
56 |
with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same).
|
57 |
|
58 |
To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging
|
|
|
85 |
This model was trained with the **MLM** objective on the EntityCS corpus with 39 languages.
|
86 |
|
87 |
|
88 |
+
## Training Details
|
|
|
|
|
89 |
|
90 |
We start from the [XLM-R-base](https://huggingface.co/xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs.
|
91 |
We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256.
|
92 |
For speedup we use fp16 mixed precision.
|
93 |
+
We use the sampling strategy proposed by [Conneau and Lample (2019)](https://dl.acm.org/doi/pdf/10.5555/3454287.3454921), where high resource languages are down-sampled and low
|
94 |
resource languages get sampled more frequently.
|
95 |
We only train the embedding and the last two layers of the model.
|
96 |
We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps.
|
|
|
105 |
|
106 |
Alternatively, it can be used directly (no fine-tuning) for probing tasks, i.e. predict missing words, such as [X-FACTR](https://aclanthology.org/2020.emnlp-main.479/).
|
107 |
|
108 |
+
For results on each downstream task, please refer to the paper.
|
109 |
+
|
110 |
+
|
111 |
## How to Get Started with the Model
|
112 |
|
113 |
+
Use the code below to get started with the model: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS
|
114 |
|
115 |
## Citation
|
116 |
|
|
|
132 |
}
|
133 |
```
|
134 |
|
135 |
+
**APA:**
|
136 |
|
137 |
+
```html
|
138 |
+
Whitehouse, C., Christopoulou, F., & Iacobacci, I. (2022). EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 6698–6714). Association for Computational Linguistics.
|
139 |
+
```
|