jsaizant commited on
Commit
8e4706f
·
verified ·
1 Parent(s): 16b8dc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -589,8 +589,8 @@ especially if the content originates from less-regulated sources or user-generat
589
  **How was the data collected?**
590
 
591
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
592
- - Web-sourced datasets with some preprocessing available under permissive license (p.e. Common Crawl).
593
- - Domain-specific or language-specific raw crawls, always respecting robots.txt (p.e. Spanish Crawling).
594
  - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
595
  (p.e. CATalog).
596
 
@@ -643,7 +643,7 @@ The original raw data was not kept.
643
 
644
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
645
 
646
- Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for Spanish Crawling and CATalog,
647
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
648
 
649
  #### Uses
 
589
  **How was the data collected?**
590
 
591
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
592
+ - Web-sourced datasets with some preprocessing available under permissive license.
593
+ - Domain-specific or language-specific raw crawls, always respecting robots.txt.
594
  - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
595
  (p.e. CATalog).
596
 
 
643
 
644
  **Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
645
 
646
+ Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated datasets,
647
  and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
648
 
649
  #### Uses