jarodrigues
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -135,10 +135,11 @@ This involves repurposing the tasks in various ways, such as generation of answe
|
|
135 |
<br>
|
136 |
|
137 |
For further testing our decoder, in addition to the testing data described above, we also reused some of the datasets that had been resorted for American Portuguese to test the state-of-the-art Sabiá model and that were originally developed with materials from Portuguese: ASSIN2 RTE (entailment) and ASSIN2 STS (similarity), BLUEX (question answering), ENEM~2022 (question answering) and FaQuAD (extractive question-answering).
|
|
|
138 |
The scores of Sabiá invite to contrast them with Gervásio's but such comparison needs to be taken with some caution.
|
139 |
-
First, these are a repetition of the scores presented in the respective paper, which only provide results for a single run of each task, while scores of Gervásio are the average of three runs, with different seeds.
|
140 |
-
Second, the evaluation methods adopted by Sabiá are *sui generis*, and different from the one's adopted for Gervásio.
|
141 |
-
Third, to evaluate Sabiá, the examples included in the few-shot prompt are hand picked, and identical for every test instance in each task.
|
142 |
To evaluate Gervásio, the examples were randomly selected to be included in the prompts.
|
143 |
|
144 |
|
@@ -147,7 +148,7 @@ To evaluate Gervásio, the examples were randomly selected to be included in the
|
|
147 |
| **Gervásio 7B PT-BR** | 0.1977 | 0.2640 | **0.7469**| **0.2136** |
|
148 |
| **LLaMA 2** | 0.2458 | 0.2903 | 0.0913 | 0.1034 |
|
149 |
| **LLaMA 2 Chat** | 0.2231 | 0.2959 | 0.5546 | 0.1750 |
|
150 |
-
|
151 |
| **Sabiá-7B** | **0.6017** | **0.7743** | 0.6847 | 0.1363 |
|
152 |
|
153 |
<br>
|
|
|
135 |
<br>
|
136 |
|
137 |
For further testing our decoder, in addition to the testing data described above, we also reused some of the datasets that had been resorted for American Portuguese to test the state-of-the-art Sabiá model and that were originally developed with materials from Portuguese: ASSIN2 RTE (entailment) and ASSIN2 STS (similarity), BLUEX (question answering), ENEM~2022 (question answering) and FaQuAD (extractive question-answering).
|
138 |
+
|
139 |
The scores of Sabiá invite to contrast them with Gervásio's but such comparison needs to be taken with some caution.
|
140 |
+
- First, these are a repetition of the scores presented in the respective paper, which only provide results for a single run of each task, while scores of Gervásio are the average of three runs, with different seeds.
|
141 |
+
- Second, the evaluation methods adopted by Sabiá are *sui generis*, and different from the one's adopted for Gervásio.
|
142 |
+
- Third, to evaluate Sabiá, the examples included in the few-shot prompt are hand picked, and identical for every test instance in each task.
|
143 |
To evaluate Gervásio, the examples were randomly selected to be included in the prompts.
|
144 |
|
145 |
|
|
|
148 |
| **Gervásio 7B PT-BR** | 0.1977 | 0.2640 | **0.7469**| **0.2136** |
|
149 |
| **LLaMA 2** | 0.2458 | 0.2903 | 0.0913 | 0.1034 |
|
150 |
| **LLaMA 2 Chat** | 0.2231 | 0.2959 | 0.5546 | 0.1750 |
|
151 |
+
||||||
|
152 |
| **Sabiá-7B** | **0.6017** | **0.7743** | 0.6847 | 0.1363 |
|
153 |
|
154 |
<br>
|