|
--- |
|
license: mit |
|
language: |
|
- pt |
|
tags: |
|
- gervasio-pt* |
|
- gervasio-ptpt |
|
- gervasio-ptbr |
|
- gervasio-ptpt-base |
|
- gervasio-ptbr-base |
|
- portulan |
|
- albertina-pt* |
|
- albertina-ptpt |
|
- albertina-ptbr |
|
- albertina-ptbr-nobrwac |
|
- albertina-ptpt-base |
|
- albertina-ptbr-base |
|
- clm |
|
- gpt |
|
- portuguese |
|
- decoder |
|
- foundation model |
|
- instruct |
|
datasets: |
|
- PORTULAN/glue-ptpt |
|
--- |
|
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png"> |
|
<p style="text-align: center;"> This is the model card for Gervásio 7B PT-PT Instruct Decoder |
|
You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Gervásio (decoders) families</a>. |
|
</p> |
|
|
|
# Gervásio 7B PT-PT Instruct |
|
|
|
**Gervásio PT-*** is a competitive **fully open** decoder for the **Portuguese language** language. |
|
|
|
|
|
It is a **decoder** of the GPT family, based on the neural architecture Transformer and developed over the LLaMA~2 7B model. |
|
Its further improvement through additional training was done over language resources that include new instruction data sets of Portuguese prepared for this purpose. |
|
|
|
It has different versions that were trained for different variants of Portuguese (PT), |
|
namely the European variant from Portugal (**PT-PT**) and the American variant from Brazil (**PT-BR**). |
|
|
|
All versions of Gervásio are **distributed for free and under a fully open license**, including for either research or commercial usage, and can |
|
be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese. |
|
|
|
**Gervásio PT-PT 7B Instruct** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal. |
|
|
|
For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**, |
|
and which is know tough more shortly as **Gervásio PT-***, or even more briefly just as **Gervásio**, among his acquaintances. |
|
|
|
For further details, check the respective [publication](https://arxiv.org/abs/?): |
|
|
|
``` latex |
|
@misc{albertina-pt, |
|
title={Advancing Generative AI for Portuguese with Open Decoder Gervásio~PT*}, |
|
author={Rodrigo Santos, João Silva, Luís Gomes, João Rodrigues, António Branco}, |
|
year={2024}, |
|
eprint={?}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
Please use the above cannonical reference when using or citing this model. |
|
|
|
|
|
<br> |
|
|
|
|
|
# Model Description |
|
|
|
**This model card is for Gervásio-7B-PTPT-Instruct-Decoder**, with 7 billion parameters, a hidden size of 4096 units, an intermediate size of 11,008 units, 32 attention heads, 32 hidden layers, and a tokenizer obtained using the Byte-Pair Encoding (BPE) algorithm implemented with SentencePiece, featuring a vocabulary size of 32,000. |
|
Gervásio-7B-PTPT-Instruct-Decoder is distributed under an [MIT license](https://huggingface.co/PORTULAN/albertina-ptpt/blob/main/LICENSE). |
|
|
|
|
|
<br> |
|
|
|
# Training Data |
|
|
|
**Gervásio-7B-PTPT-Instruct-Decoder** over standard supervised fine-tuning, and to keep some alignment with mainstream benchmarks for English, we resorted to tasks and respective datasets in the GLUE and the SuperGLUE collections. |
|
|
|
|
|
We selected those datasets where the outcome of their machine translation into Portuguese could preserve, in the target language, the linguistic properties at stake. |
|
|
|
From GLUE, we resorted to the following four tasks: |
|
- MRPC (paraphrase Detection). |
|
- RTE (recognizing Textual Entailment). |
|
- STS-B (semantic textual similarity). |
|
- WNLI (coreference and natural language inference). |
|
|
|
And from SuperGLUE, we included these other four tasks: |
|
- BoolQ (yes/no question answering). |
|
- CB (inference with 3 labels). |
|
- COPA (reasoning) |
|
- MultiRC (question answering). |
|
|
|
|
|
Instruction templates have been manually crafted for each task. |
|
These take the various fields in the dataset and arrange them into a prompt. |
|
For instance, appending ``Frase 1:'' (Eng.~``Sentence 1:'') before the first sentence of an example in the RTE dataset. |
|
These templates are listed in full detail in TODO. |
|
|
|
## Preprocessing |
|
|
|
We filtered the PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline. |
|
We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese. |
|
|
|
|
|
# Evaluation |
|
|
|
The base model version was evaluated on downstream tasks, namely the translations into PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue). |
|
|
|
|
|
## GLUE tasks translated |
|
|
|
|
|
We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glue-ptpt), a **PT-PT version of the GLUE** benchmark. |
|
We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option. |
|
|
|
| Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) | |
|
|--------------------------|----------------|----------------|-----------|-----------------| |
|
| **Albertina-PT-PT** | **0.8339** | 0.4225 | **0.9171**| **0.8801** | |
|
| **Albertina-PT-PT base** | 0.6787 | **0.4507** | 0.8829 | 0.8581 | |
|
|
|
<br> |
|
|
|
# How to use |
|
|
|
You can use this model directly with a pipeline for causal language modeling (CLM): |
|
|
|
```python3 |
|
>>> from transformers import pipeline |
|
>>> generator = pipeline(model='PORTULAN/gervasio-ptbr-base') |
|
>>> generator("A música brasileira é", max_new_tokens=10) |
|
[{'generated_text': 'A música brasileira é uma das mais ricas do mundo. Ao'}] |
|
|
|
|
|
|
|
``` |
|
<br> |
|
|
|
# Acknowledgments |
|
|
|
The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, |
|
funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the |
|
grant PINFRA/22117/2016; research project GPT-PT - Transformer-based Decoder for the Portuguese Language, funded by FCT—Fundação para a Ciência e Tecnologia under the |
|
grant CPCA-IAC/AV/478395/2022; innovation project |
|
ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação |
|
under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, |
|
call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização. |
|
|