jarodrigues commited on
Commit
99677b3
·
verified ·
1 Parent(s): 5c62c77

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md CHANGED
@@ -29,3 +29,107 @@ datasets:
29
  You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
30
  </p>
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
30
  </p>
31
 
32
+ # Gervásio 7B PT-PT Instruct
33
+
34
+
35
+ **Gervásio PT-*** is a ??, large language model for the **Portuguese language**.
36
+
37
+ +++++corrigir
38
+ It is a **decoder** of the GPT family, based on the neural architecture Transformer and
39
+ developed over the Pythia model, with competitive performance for this language.
40
+ It has different versions that were trained for different variants of Portuguese (PT),
41
+ namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
42
+ and it is distributed free of charge and under a most permissible license.
43
+ ++++++++
44
+
45
+ **Gervásio PT-PT 7B Instruct** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal.
46
+
47
+ For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**,
48
+ and which is know tough more shortly as **Gervásio PT-***, or even more briefly just as **Gervásio**, among his acquaintances.
49
+
50
+ For further details, check the respective [publication](https://arxiv.org/abs/?):
51
+
52
+ ``` latex
53
+ @misc{albertina-pt,
54
+ title={Advancing Generative AI for Portuguese with Open Decoder Gervásio~PT*},
55
+ author={Rodrigo Santos, João Silva, Luís Gomes, João Rodrigues, António Branco},
56
+ year={2024},
57
+ eprint={?},
58
+ archivePrefix={arXiv},
59
+ primaryClass={cs.CL}
60
+ }
61
+ ```
62
+
63
+ Please use the above cannonical reference when using or citing this model.
64
+
65
+
66
+ <br>
67
+
68
+
69
+ # Model Description
70
+
71
+ **This model card is for Gervásio-7B-PTPT-Instruct-Decoder**, with 7 billion parameters, ? layers and a hidden size of ?.
72
+
73
+ Gervásio-PT-BR base is distributed under an [Apache 2.0 license](https://huggingface.co/PORTULAN/gervasio-ptpt-base/blob/main/LICENSE) (like Pythia).
74
+
75
+
76
+ <br>
77
+
78
+ # Training Data
79
+
80
+ [**Gervásio-PT-BR base**](https://huggingface.co/PORTULAN/gervasio-ptpt-base) was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set.
81
+ The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature.
82
+ It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
83
+ Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil.
84
+ We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
85
+
86
+
87
+
88
+ ## Preprocessing
89
+
90
+ We filtered the PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
91
+ We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
92
+
93
+
94
+ # Evaluation
95
+
96
+ The base model version was evaluated on downstream tasks, namely the translations into PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue).
97
+
98
+
99
+ ## GLUE tasks translated
100
+
101
+
102
+ We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glue-ptpt), a **PT-PT version of the GLUE** benchmark.
103
+ We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option.
104
+
105
+ | Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
106
+ |--------------------------|----------------|----------------|-----------|-----------------|
107
+ | **Albertina-PT-PT** | **0.8339** | 0.4225 | **0.9171**| **0.8801** |
108
+ | **Albertina-PT-PT base** | 0.6787 | **0.4507** | 0.8829 | 0.8581 |
109
+
110
+ <br>
111
+
112
+ # How to use
113
+
114
+ You can use this model directly with a pipeline for causal language modeling (CLM):
115
+
116
+ ```python3
117
+ >>> from transformers import pipeline
118
+ >>> generator = pipeline(model='PORTULAN/gervasio-ptbr-base')
119
+ >>> generator("A música brasileira é", max_new_tokens=10)
120
+ [{'generated_text': 'A música brasileira é uma das mais ricas do mundo. Ao'}]
121
+
122
+
123
+
124
+ ```
125
+ <br>
126
+
127
+ # Acknowledgments
128
+
129
+ The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language,
130
+ funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the
131
+ grant PINFRA/22117/2016; research project GPT-PT - Transformer-based Decoder for the Portuguese Language, funded by FCT��Fundação para a Ciência e Tecnologia under the
132
+ grant CPCA-IAC/AV/478395/2022; innovation project
133
+ ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação
134
+ under the grant C625734525-00462629, of Plano de Recuperação e Resiliência,
135
+ call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização.