CONDA-Workshop/Data-Contamination-Database · Add data from "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"

Apr 18, 2024

•

edited Apr 18, 2024

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): LAMA (T-REx), LAMA (Google-RE), XSum, TIFU-short, TIFU-long, WikiBio, AMR-to-text, GLUE (BoolQ, CoLA, MNLI, MRPC, QNLI, RTE, SST-2 ,STS-B, WNLI)

Contaminated model(s): NA

Contaminated corpora: allenai/c4

Contaminated split(s): All test splits

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The approach was simple: exact matches, normalized for capitalization and punctuation, for more details see sec 4.2 in https://arxiv.org/abs/2104.08758
For evidence of contamination, see the original paper.

Citation

Yes , here is the link:
URL: https://arxiv.org/pdf/2104.08758.pdf
Citation:



@article
	{dodge2021documenting,
  title={Documenting large webtext corpora: A case study on the colossal clean crawled corpus},
  author={Dodge, Jesse and Sap, Maarten and Marasovi{\'c}, Ana and Agnew, William and Ilharco, Gabriel and Groeneveld, Dirk and Mitchell, Margaret and Gardner, Matt},
  journal={arXiv preprint arXiv:2104.08758},
  year={2021}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Vishaal Udandarao
Institution: University of Tuebingen, University of Cambridge
Email: [email protected]

Add data from "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"ad06fdcd

Iker

Workshop on Data Contamination org Apr 18, 2024

@vishaal27 Thank you! Merged :D

Iker changed pull request status to merged Apr 18, 2024