📖 Benchmark Contamination Monitoring System

This system monitors potential contamination in benchmark datasets used for evaluating language models across various open-source corpora 🧐.

The system is released along with our paper Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index, which documents the methodology and findings in detail.

We welcome the community to submit new benchmarks for contamination analysis using the "Add New Benchmarks" tab.

Benchmark Contamination Bulletin

The Benchmark Contamination Bulletin presents contamination statistics for evaluation benchmarks across different data sources.

  • Benchmarks analyzed in our paper are under the core source. Community-submitted benchmarks appear under the community source.
  • The contamination rate represents the percentage of dirty benchmark entries.
  • The bulletin will be updated regularly to include contamination checks on newly released Common Crawl dumps.
Select Benchmark Source
Benchmark Category Pile-train Dirty (%) DCLM-baseline Dirty (%) CC-2025-05 Dirty (%) CC-2025-08 Dirty (%)
MMLUKnowledge and Reasoning 13.2 28.4 13.5 9.0
MMLU-ProKnowledge and Reasoning 5.5 16.2 7.1 5.4
BBHKnowledge and Reasoning 0.0 0.1 1.4 1.4
AGIEvalKnowledge and Reasoning 0.8 3.1 2.7 3.6
GPQAKnowledge and Reasoning 0.0 0.0 0.9 2.0
HLEKnowledge and Reasoning 0.0 0.3 0.1 0.0
AIME_2024Math 0.0 0.0 10.0 3.3
GSM8KMath 0.0 0.4 5.0 0.8
MATH-500Math 0.6 3.2 0.6 7.8
MGSMMath 0.0 0.0 5.6 1.6
HumanEvalCode 0.0 0.0 0.0 0.6
HumanEval+Code 0.0 0.0 0.0 0.6
LiveCodeBenchCode 0.0 0.0 0.0 0.0
SWE-benchCode 0.0 0.0 0.2 0.2
MBPPCode 0.0 0.4 1.0 1.4
ARC-ChallengeCommonsense Understanding 1.8 34.1 11.9 4.0
ARC-EasyCommonsense Understanding 1.3 31.7 5.4 9.5
CSQACommonsense Understanding 0.1 1.0 0.1 0.1
HellaSwagCommonsense Understanding 0.0 0.0 0.0 0.0
OpenbookQACommonsense Understanding 10.8 15.6 14.6 30.2
Social IQaCommonsense Understanding 0.0 0.5 0.2 4.4
WinoGrandeCommonsense Understanding 0.0 0.0 0.0 0.0
CoQAReading Comprehension 8.0 18.4 7.4 8.8
SQuADReading Comprehension 2.8 40.1 2.7 33.0