osiria commited on
Commit
93fd74e
1 Parent(s): c6b9145

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md CHANGED
@@ -1,3 +1,100 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - wikipedia
5
+ language:
6
+ - it
7
+ widget:
8
+ - text: "milano è una [MASK] dell'italia"
9
+ example_title: "Example 1"
10
+ - text: "il sole è una [MASK] della via lattea"
11
+ example_title: "Example 2"
12
+ - text: "l'italia è una [MASK] dell'unione europea"
13
+ example_title: "Example 3"
14
  ---
15
+ --------------------------------------------------------------------------------------------------
16
+
17
+ <h1>Model: BLAZE 🔥</h1>
18
+ <p>Language: IT<br>Version: 𝖬𝖪-𝖨</p>
19
+
20
+ --------------------------------------------------------------------------------------------------
21
+
22
+ <h3>Introduction</h3>
23
+
24
+ This model is a <b>lightweight</b> and uncased version of <b>BERT</b> <b>[1]</b> for the <b>italian</b> language. With its <b>55M parameters</b> and <b>220MB</b> size,
25
+ it's <b>50% lighter</b> than a typical mono-lingual BERT model, and ideal
26
+ for circumstances where memory consumption and execution speed are critical aspects, while maintaining high quality results.
27
+
28
+
29
+ <h3>Model description</h3>
30
+
31
+ The model has been obtained by taking the multilingual <b>DistilBERT</b> <b>[2]</b> model (from the HuggingFace team: [distilbert-base-multilingual-cased](https://huggingface.co/distilbert-base-multilingual-cased)) as a starting point,
32
+ and then focusing it on the italian language while at the same time turning it into an uncased model by modifying the embedding layer
33
+ (as in <b>[3]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
34
+ reduction in the number of parameters.
35
+
36
+ In order to compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words which were previously capitalized,
37
+ the model has been further pre-trained on the italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [4]</b> technique to make it more robust
38
+ with respect to the new uncased representations.
39
+
40
+ The resulting model has 55M parameters, a vocabulary of 13.832 tokens, and a size of 220MB, which makes it <b>50% lighter</b> than a typical mono-lingual BERT model and
41
+ 20% lighter than a typical mono-lingual DistilBERT model.
42
+
43
+
44
+ <h3>Training procedure</h3>
45
+
46
+ The model has been trained for <b>masked language modeling</b> on the italian <b>Wikipedia</b> (~3GB) dataset for 10K steps, using the AdamW optimizer, with a batch size of 512
47
+ (obtained through 128 gradient accumulation steps),
48
+ a sequence length of 512, and a linearly decaying learning rate starting from 5e-5. The training has been performed using <b>dynamic masking</b> between epochs and
49
+ exploiting the <b>whole word masking</b> technique.
50
+
51
+
52
+ <h3>Performances</h3>
53
+
54
+ The following metrics have been computed on the Part of Speech Tagging and Named Entity Recognition tasks, using the <b>UD Italian ISDT</b> and <b>WikiNER</b> datasets, respectively.
55
+ The PoST model has been trained for 5 epochs, and the NER model for 3 epochs, both with a constant learning rate, fixed at 1e-5. For Part of Speech Tagging, the metrics have been computed on the default test set
56
+ provided with the dataset, while for Named Entity Recognition the metrics have been computed with a 5-fold cross-validation
57
+
58
+ | Task | Recall | Precision | F1 |
59
+ | ------ | ------ | ------ | ------ |
60
+ | Part of Speech Tagging | 97.48 | 97.29 | 97.37 |
61
+ | Named Entity Recognition | 89.29 | 89.84 | 89.53 |
62
+
63
+ The metrics have been computed at token level and macro-averaged over the classes.
64
+
65
+
66
+ <h3>Demo</h3>
67
+
68
+ You can try the model online (fine-tuned on named entity recognition) using this webapp:
69
+
70
+ <h3>Quick usage</h3>
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer, AutoModel
74
+ from transformers import pipeline
75
+
76
+ tokenizer = AutoTokenizer.from_pretrained("osiria/blaze-it")
77
+ model = AutoModel.from_pretrained("osiria/blaze-it")
78
+ pipeline_mlm = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
79
+ ```
80
+
81
+
82
+ <h3>Limitations</h3>
83
+
84
+ This lightweight model is mainly trained on Wikipedia, so it's particularly suitable as an agile analyzer for large volumes of natively digital text taken
85
+ from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). It may show limitations when it comes to chaotic text, containing errors and slang expressions
86
+ (like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).
87
+
88
+ <h3>References</h3>
89
+
90
+ [1] https://arxiv.org/abs/1810.04805
91
+
92
+ [2] https://arxiv.org/abs/1910.01108
93
+
94
+ [3] https://arxiv.org/abs/2010.05609
95
+
96
+ [4] https://arxiv.org/abs/1906.08101
97
+
98
+ <h3>License</h3>
99
+
100
+ The model is released under <b>Apache-2.0</b> license