PORTULAN
/

gervasio-7b-portuguese-ptpt-decoder

Model card Files Files and versions Community

jarodrigues commited on Feb 28

Commit

0f48285

•

1 Parent(s): b1cecba

Update README.md

Files changed (1) hide show

README.md +20 -5

README.md CHANGED Viewed

@@ -77,13 +77,28 @@ Gervásio-7B-PTPT-Instruct-Decoder is distributed under an [MIT license](https:/
 # Training Data
-[**Gervásio-PT-BR base**](https://huggingface.co/PORTULAN/gervasio-ptpt-base) was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set.
-The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature.
-It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
-Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil.
-We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
 ## Preprocessing

 # Training Data
+**Gervásio-7B-PTPT-Instruct-Decoder** over standard supervised fine-tuning, and to keep some alignment with mainstream benchmarks for English, we resorted to tasks and respective datasets in the GLUE and the SuperGLUE collections.
+We selected those datasets where the outcome of their machine translation into Portuguese could preserve, in the target language, the linguistic properties at stake.
+From GLUE, we resorted to the following four tasks:
+- MRPC (paraphrase Detection).
+- RTE (recognizing Textual Entailment).
+- STS-B (semantic textual similarity).
+- WNLI (coreference and natural language inference).
+And from SuperGLUE, we included these other four tasks:
+- BoolQ (yes/no question answering).
+- CB (inference with 3 labels).
+- COPA (reasoning)
+- MultiRC (question answering).
+Instruction templates have been manually crafted for each task.
+These take the various fields in the dataset and arrange them into a prompt.
+For instance, appending ``Frase 1:'' (Eng.~``Sentence 1:'') before the first sentence of an example in the RTE dataset.
+These templates are listed in full detail in TODO.
 ## Preprocessing