📖 Benchmark Contamination Monitoring System

Benchmark Name: Provide a name for your benchmark.
Contributor: Enter your name or affiliation.
Data Source: Upload a .jsonl file containing your benchmark entries, or Specify a Hugging Face dataset path ( author/benchmark-name ) along with the appropriate split (e.g., test , validation ).
Field Name: Indicate the field to analyze for contamination: For question-answering datasets: use the question field. For language understanding tasks: use the context or passage field.

This system monitors potential contamination in benchmark datasets used for evaluating language models across various open-source corpora 🧐.

The system is released along with our paper Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index, which documents the methodology and findings in detail.

We welcome the community to submit new benchmarks for contamination analysis using the "Add New Benchmarks" tab.

Benchmark Contamination Bulletin

Select Benchmark Source

core community

Benchmark	Category	Pile-train Dirty (%)	DCLM-baseline Dirty (%)	CC-2025-05 Dirty (%)	CC-2025-08 Dirty (%)
MMLU	Knowledge and Reasoning	13.2	28.4	13.5	9.0
MMLU-Pro	Knowledge and Reasoning	5.5	16.2	7.1	5.4
BBH	Knowledge and Reasoning	0.0	0.1	1.4	1.4
AGIEval	Knowledge and Reasoning	0.8	3.1	2.7	3.6
GPQA	Knowledge and Reasoning	0.0	0.0	0.9	2.0
HLE	Knowledge and Reasoning	0.0	0.3	0.1	0.0
AIME_2024	Math	0.0	0.0	10.0	3.3
GSM8K	Math	0.0	0.4	5.0	0.8
MATH-500	Math	0.6	3.2	0.6	7.8
MGSM	Math	0.0	0.0	5.6	1.6
HumanEval	Code	0.0	0.0	0.0	0.6
HumanEval+	Code	0.0	0.0	0.0	0.6
LiveCodeBench	Code	0.0	0.0	0.0	0.0
SWE-bench	Code	0.0	0.0	0.2	0.2
MBPP	Code	0.0	0.4	1.0	1.4
ARC-Challenge	Commonsense Understanding	1.8	34.1	11.9	4.0
ARC-Easy	Commonsense Understanding	1.3	31.7	5.4	9.5
CSQA	Commonsense Understanding	0.1	1.0	0.1	0.1
HellaSwag	Commonsense Understanding	0.0	0.0	0.0	0.0
OpenbookQA	Commonsense Understanding	10.8	15.6	14.6	30.2
Social IQa	Commonsense Understanding	0.0	0.5	0.2	4.4
WinoGrande	Commonsense Understanding	0.0	0.0	0.0	0.0
CoQA	Reading Comprehension	8.0	18.4	7.4	8.8
SQuAD	Reading Comprehension	2.8	40.1	2.7	33.0