PORTULAN
/

gervasio-7b-portuguese-ptpt-decoder

Model card Files Files and versions Community

jarodrigues commited on Feb 28, 2024

Commit

99677b3

verified ·

1 Parent(s): 5c62c77

Update README.md

Browse files

Files changed (1) hide show

README.md +104 -0

README.md CHANGED Viewed

@@ -29,3 +29,107 @@ datasets:
   You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
 </p>

   You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>.
 </p>
+# Gervásio 7B PT-PT Instruct
+**Gervásio PT-*** is a ??, large language model for the **Portuguese language**.
++++++corrigir
+It is a **decoder** of the GPT family, based on the neural architecture Transformer and
+developed over the Pythia model, with competitive performance for this language.
+It has different versions that were trained for different variants of Portuguese (PT),
+namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**),
+and it is distributed free of charge and under a most permissible license.
+++++++++
+**Gervásio PT-PT 7B Instruct** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal.
+For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**,
+and which is know tough more shortly as **Gervásio PT-***, or even more briefly just as **Gervásio**, among his acquaintances.
+For further details, check the respective [publication](https://arxiv.org/abs/?):
+``` latex
+@misc{albertina-pt,
+      title={Advancing Generative AI for Portuguese with Open Decoder Gervásio~PT*},
+      author={Rodrigo Santos, João Silva, Luís Gomes, João Rodrigues, António Branco},
+      year={2024},
+      eprint={?},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+Please use the above cannonical reference when using or citing this model.
+<br>
+# Model Description
+**This model card is for Gervásio-7B-PTPT-Instruct-Decoder**, with 7 billion parameters, ? layers and a hidden size of ?.
+Gervásio-PT-BR base is distributed under an [Apache 2.0 license](https://huggingface.co/PORTULAN/gervasio-ptpt-base/blob/main/LICENSE) (like Pythia).
+<br>
+# Training Data
+[**Gervásio-PT-BR base**](https://huggingface.co/PORTULAN/gervasio-ptpt-base) was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set.
+The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature.
+It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
+Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil.
+We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
+## Preprocessing
+We filtered the PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
+We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
+# Evaluation
+The base model version was evaluated on downstream tasks, namely the translations into PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue).
+## GLUE tasks translated
+We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glue-ptpt), a **PT-PT version of the GLUE** benchmark.
+We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option.
+| Model                    | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
+|--------------------------|----------------|----------------|-----------|-----------------|
+| **Albertina-PT-PT**      | **0.8339**     | 0.4225         | **0.9171**| **0.8801**      |
+| **Albertina-PT-PT base** |  0.6787        | **0.4507**     | 0.8829    | 0.8581          |
+<br>
+# How to use
+You can use this model directly with a pipeline for causal language modeling (CLM):
+```python3
+>>> from transformers import pipeline
+>>> generator = pipeline(model='PORTULAN/gervasio-ptbr-base')
+>>> generator("A música brasileira é", max_new_tokens=10)
+[{'generated_text': 'A música brasileira é uma das mais ricas do mundo. Ao'}]
+```
+<br>
+# Acknowledgments
+The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language,
+funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the
+grant PINFRA/22117/2016; research project GPT-PT - Transformer-based Decoder for the Portuguese Language, funded by FCT��Fundação para a Ciência e Tecnologia under the
+grant CPCA-IAC/AV/478395/2022; innovation project
+ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação
+under the grant C625734525-00462629, of Plano de Recuperação e Resiliência,
+call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização.