Update README.md
Browse files
README.md
CHANGED
@@ -32,19 +32,19 @@ widget:
|
|
32 |
|
33 |
<h3>Introduction</h3>
|
34 |
|
35 |
-
This model is a <b>lightweight</b> and uncased version of <b>BERT</b> <b>[1]</b> for the <b>
|
36 |
<b>50% lighter</b> than a typical mono-lingual BERT model. It is ideal when memory consumption and execution speed are critical while maintaining high-quality results.
|
37 |
|
38 |
|
39 |
<h3>Model description</h3>
|
40 |
|
41 |
The model builds on the multilingual <b>DistilBERT</b> <b>[2]</b> model (from the HuggingFace team: [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)) as a starting point,
|
42 |
-
focusing it on the
|
43 |
(as in <b>[3]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
|
44 |
reduction in the number of parameters.
|
45 |
|
46 |
To compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words previously capitalized,
|
47 |
-
the model has been further pre-trained on the
|
48 |
to the new uncased representations.
|
49 |
|
50 |
The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it <b>50% lighter</b> than a typical mono-lingual BERT model and
|
@@ -53,7 +53,7 @@ The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a siz
|
|
53 |
|
54 |
<h3>Training procedure</h3>
|
55 |
|
56 |
-
The model has been trained for <b>masked language modeling</b> on the
|
57 |
(obtained through 128 gradient accumulation steps),
|
58 |
a sequence length of 512, and a linearly decaying learning rate starting from 5e-5. The training has been performed using <b>dynamic masking</b> between epochs and
|
59 |
exploiting the <b>whole word masking</b> technique.
|
|
|
32 |
|
33 |
<h3>Introduction</h3>
|
34 |
|
35 |
+
This model is a <b>lightweight</b> and uncased version of <b>BERT</b> <b>[1]</b> for the <b>Italian</b> language. Its <b>55M parameters</b> and <b>220MB</b> size make it
|
36 |
<b>50% lighter</b> than a typical mono-lingual BERT model. It is ideal when memory consumption and execution speed are critical while maintaining high-quality results.
|
37 |
|
38 |
|
39 |
<h3>Model description</h3>
|
40 |
|
41 |
The model builds on the multilingual <b>DistilBERT</b> <b>[2]</b> model (from the HuggingFace team: [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)) as a starting point,
|
42 |
+
focusing it on the Italian language while at the same time turning it into an uncased model by modifying the embedding layer
|
43 |
(as in <b>[3]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
|
44 |
reduction in the number of parameters.
|
45 |
|
46 |
To compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words previously capitalized,
|
47 |
+
the model has been further pre-trained on the Italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [4]</b> technique to make it more robust
|
48 |
to the new uncased representations.
|
49 |
|
50 |
The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it <b>50% lighter</b> than a typical mono-lingual BERT model and
|
|
|
53 |
|
54 |
<h3>Training procedure</h3>
|
55 |
|
56 |
+
The model has been trained for <b>masked language modeling</b> on the Italian <b>Wikipedia</b> (~3GB) dataset for 10K steps, using the AdamW optimizer, with a batch size of 512
|
57 |
(obtained through 128 gradient accumulation steps),
|
58 |
a sequence length of 512, and a linearly decaying learning rate starting from 5e-5. The training has been performed using <b>dynamic masking</b> between epochs and
|
59 |
exploiting the <b>whole word masking</b> technique.
|