Add data from "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Evaluation dataset(s): LAMA (T-REx), LAMA (Google-RE), XSum, TIFU-short, TIFU-long, WikiBio, AMR-to-text, GLUE (BoolQ, CoLA, MNLI, MRPC, QNLI, RTE, SST-2 ,STS-B, WNLI)
Contaminated model(s): NA
Contaminated corpora: allenai/c4
Contaminated split(s): All test splits
Briefly describe your method to detect data contamination
- Data-based approach
- Model-based approach
Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The approach was simple: exact matches, normalized for capitalization and punctuation, for more details see sec 4.2 in https://arxiv.org/abs/2104.08758
For evidence of contamination, see the original paper.
Citation
Yes , here is the link:
URL: https://arxiv.org/pdf/2104.08758.pdf
Citation:
@article
{dodge2021documenting,
title={Documenting large webtext corpora: A case study on the colossal clean crawled corpus},
author={Dodge, Jesse and Sap, Maarten and Marasovi{\'c}, Ana and Agnew, William and Ilharco, Gabriel and Groeneveld, Dirk and Mitchell, Margaret and Gardner, Matt},
journal={arXiv preprint arXiv:2104.08758},
year={2021}
}
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: Vishaal Udandarao
- Institution: University of Tuebingen, University of Cambridge
- Email: [email protected]
@vishaal27 Thank you! Merged :D