jarodrigues
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -29,3 +29,107 @@ datasets:
|
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
|
30 |
</p>
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
|
30 |
</p>
|
31 |
|
32 |
+
# Gervásio 7B PT-PT Instruct
|
33 |
+
|
34 |
+
|
35 |
+
**Gervásio PT-*** is a ??, large language model for the **Portuguese language**.
|
36 |
+
|
37 |
+
+++++corrigir
|
38 |
+
It is a **decoder** of the GPT family, based on the neural architecture Transformer and
|
39 |
+
developed over the Pythia model, with competitive performance for this language.
|
40 |
+
It has different versions that were trained for different variants of Portuguese (PT),
|
41 |
+
namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
|
42 |
+
and it is distributed free of charge and under a most permissible license.
|
43 |
+
++++++++
|
44 |
+
|
45 |
+
**Gervásio PT-PT 7B Instruct** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal.
|
46 |
+
|
47 |
+
For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**,
|
48 |
+
and which is know tough more shortly as **Gervásio PT-***, or even more briefly just as **Gervásio**, among his acquaintances.
|
49 |
+
|
50 |
+
For further details, check the respective [publication](https://arxiv.org/abs/?):
|
51 |
+
|
52 |
+
``` latex
|
53 |
+
@misc{albertina-pt,
|
54 |
+
title={Advancing Generative AI for Portuguese with Open Decoder Gervásio~PT*},
|
55 |
+
author={Rodrigo Santos, João Silva, Luís Gomes, João Rodrigues, António Branco},
|
56 |
+
year={2024},
|
57 |
+
eprint={?},
|
58 |
+
archivePrefix={arXiv},
|
59 |
+
primaryClass={cs.CL}
|
60 |
+
}
|
61 |
+
```
|
62 |
+
|
63 |
+
Please use the above cannonical reference when using or citing this model.
|
64 |
+
|
65 |
+
|
66 |
+
<br>
|
67 |
+
|
68 |
+
|
69 |
+
# Model Description
|
70 |
+
|
71 |
+
**This model card is for Gervásio-7B-PTPT-Instruct-Decoder**, with 7 billion parameters, ? layers and a hidden size of ?.
|
72 |
+
|
73 |
+
Gervásio-PT-BR base is distributed under an [Apache 2.0 license](https://huggingface.co/PORTULAN/gervasio-ptpt-base/blob/main/LICENSE) (like Pythia).
|
74 |
+
|
75 |
+
|
76 |
+
<br>
|
77 |
+
|
78 |
+
# Training Data
|
79 |
+
|
80 |
+
[**Gervásio-PT-BR base**](https://huggingface.co/PORTULAN/gervasio-ptpt-base) was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set.
|
81 |
+
The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature.
|
82 |
+
It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
|
83 |
+
Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil.
|
84 |
+
We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
|
85 |
+
|
86 |
+
|
87 |
+
|
88 |
+
## Preprocessing
|
89 |
+
|
90 |
+
We filtered the PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
|
91 |
+
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
|
92 |
+
|
93 |
+
|
94 |
+
# Evaluation
|
95 |
+
|
96 |
+
The base model version was evaluated on downstream tasks, namely the translations into PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue).
|
97 |
+
|
98 |
+
|
99 |
+
## GLUE tasks translated
|
100 |
+
|
101 |
+
|
102 |
+
We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glue-ptpt), a **PT-PT version of the GLUE** benchmark.
|
103 |
+
We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option.
|
104 |
+
|
105 |
+
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
|
106 |
+
|--------------------------|----------------|----------------|-----------|-----------------|
|
107 |
+
| **Albertina-PT-PT** | **0.8339** | 0.4225 | **0.9171**| **0.8801** |
|
108 |
+
| **Albertina-PT-PT base** | 0.6787 | **0.4507** | 0.8829 | 0.8581 |
|
109 |
+
|
110 |
+
<br>
|
111 |
+
|
112 |
+
# How to use
|
113 |
+
|
114 |
+
You can use this model directly with a pipeline for causal language modeling (CLM):
|
115 |
+
|
116 |
+
```python3
|
117 |
+
>>> from transformers import pipeline
|
118 |
+
>>> generator = pipeline(model='PORTULAN/gervasio-ptbr-base')
|
119 |
+
>>> generator("A música brasileira é", max_new_tokens=10)
|
120 |
+
[{'generated_text': 'A música brasileira é uma das mais ricas do mundo. Ao'}]
|
121 |
+
|
122 |
+
|
123 |
+
|
124 |
+
```
|
125 |
+
<br>
|
126 |
+
|
127 |
+
# Acknowledgments
|
128 |
+
|
129 |
+
The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language,
|
130 |
+
funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the
|
131 |
+
grant PINFRA/22117/2016; research project GPT-PT - Transformer-based Decoder for the Portuguese Language, funded by FCT��Fundação para a Ciência e Tecnologia under the
|
132 |
+
grant CPCA-IAC/AV/478395/2022; innovation project
|
133 |
+
ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação
|
134 |
+
under the grant C625734525-00462629, of Plano de Recuperação e Resiliência,
|
135 |
+
call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização.
|