jarodrigues commited on
Commit
97fadac
·
1 Parent(s): 6cc0668

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -96,7 +96,7 @@ and you agreed not to use it for any commercial applications".
96
  - [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301): the OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters. Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Portugal. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
97
  - [DCEP](https://joint-research-centre.ec.europa.eu/language-technology-resources/dcep-digital-corpus-european-parliament_en): the Digital Corpus of the European Parliament is a multilingual corpus including documents in all official EU languages published on the European Parliament's official website. We retained its European Portuguese portion.
98
  - [Europarl](https://www.statmt.org/europarl/): the European Parliament Proceedings Parallel Corpus is extracted from the proceedings of the European Parliament from 1996 to 2011. We retained its European Portuguese portion.
99
- - [ParlamentoPT](https://www.parlamento.pt/): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
100
 
101
 
102
 
 
96
  - [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301): the OSCAR data set includes documents in more than one hundred languages, including Portuguese, and it is widely used in the literature. It is the result of a selection performed over the [Common Crawl](https://commoncrawl.org/) data set, crawled from the Web, that retains only pages whose metadata indicates permission to be crawled, that performs deduplication, and that removes some boilerplate, among other filters. Given that it does not discriminate between the Portuguese variants, we performed extra filtering by retaining only documents whose meta-data indicate the Internet country code top-level domain of Portugal. We used the January 2023 version of OSCAR, which is based on the November/December 2022 version of Common Crawl.
97
  - [DCEP](https://joint-research-centre.ec.europa.eu/language-technology-resources/dcep-digital-corpus-european-parliament_en): the Digital Corpus of the European Parliament is a multilingual corpus including documents in all official EU languages published on the European Parliament's official website. We retained its European Portuguese portion.
98
  - [Europarl](https://www.statmt.org/europarl/): the European Parliament Proceedings Parallel Corpus is extracted from the proceedings of the European Parliament from 1996 to 2011. We retained its European Portuguese portion.
99
+ - [ParlamentoPT](https://huggingface.co/datasets/PORTULAN/parlamento-pt): the ParlamentoPT is a data set we obtained by gathering the publicly available documents with the transcription of the debates in the Portuguese Parliament.
100
 
101
 
102