jarodrigues commited on
Commit
0f48285
1 Parent(s): b1cecba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -5
README.md CHANGED
@@ -77,13 +77,28 @@ Gervásio-7B-PTPT-Instruct-Decoder is distributed under an [MIT license](https:/
77
 
78
  # Training Data
79
 
80
- [**Gervásio-PT-BR base**](https://huggingface.co/PORTULAN/gervasio-ptpt-base) was trained over a 3.7 billion token curated selection of documents from the [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) data set.
81
- The OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature.
82
- It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters.
83
- Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Brazil.
84
- We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
85
 
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  ## Preprocessing
89
 
 
77
 
78
  # Training Data
79
 
80
+ **Gervásio-7B-PTPT-Instruct-Decoder** over standard supervised fine-tuning, and to keep some alignment with mainstream benchmarks for English, we resorted to tasks and respective datasets in the GLUE and the SuperGLUE collections.
 
 
 
 
81
 
82
 
83
+ We selected those datasets where the outcome of their machine translation into Portuguese could preserve, in the target language, the linguistic properties at stake.
84
+
85
+ From GLUE, we resorted to the following four tasks:
86
+ - MRPC (paraphrase Detection).
87
+ - RTE (recognizing Textual Entailment).
88
+ - STS-B (semantic textual similarity).
89
+ - WNLI (coreference and natural language inference).
90
+
91
+ And from SuperGLUE, we included these other four tasks:
92
+ - BoolQ (yes/no question answering).
93
+ - CB (inference with 3 labels).
94
+ - COPA (reasoning)
95
+ - MultiRC (question answering).
96
+
97
+
98
+ Instruction templates have been manually crafted for each task.
99
+ These take the various fields in the dataset and arrange them into a prompt.
100
+ For instance, appending ``Frase 1:'' (Eng.~``Sentence 1:'') before the first sentence of an example in the RTE dataset.
101
+ These templates are listed in full detail in TODO.
102
 
103
  ## Preprocessing
104