📖 Benchmark Contamination Monitoring System
This system monitors potential contamination in benchmark datasets used for evaluating language models across various open-source corpora 🧐.
The system is released along with our paper Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index, which documents the methodology and findings in detail.
We welcome the community to submit new benchmarks for contamination analysis using the "Add New Benchmarks" tab.
Benchmark Contamination Bulletin
The Benchmark Contamination Bulletin presents contamination statistics for evaluation benchmarks across different data sources.
- Benchmarks analyzed in our paper are under the core source. Community-submitted benchmarks appear under the community source.
- The contamination rate represents the percentage of dirty benchmark entries.
- The bulletin will be updated regularly to include contamination checks on newly released Common Crawl dumps.
Benchmark | Category | Pile-train Dirty (%) | DCLM-baseline Dirty (%) | CC-2025-05 Dirty (%) | CC-2025-08 Dirty (%) |
---|---|---|---|---|---|
MMLU | Knowledge and Reasoning | 13.2 | 28.4 | 13.5 | 9.0 |
MMLU-Pro | Knowledge and Reasoning | 5.5 | 16.2 | 7.1 | 5.4 |
BBH | Knowledge and Reasoning | 0.0 | 0.1 | 1.4 | 1.4 |
AGIEval | Knowledge and Reasoning | 0.8 | 3.1 | 2.7 | 3.6 |
GPQA | Knowledge and Reasoning | 0.0 | 0.0 | 0.9 | 2.0 |
HLE | Knowledge and Reasoning | 0.0 | 0.3 | 0.1 | 0.0 |
AIME_2024 | Math | 0.0 | 0.0 | 10.0 | 3.3 |
GSM8K | Math | 0.0 | 0.4 | 5.0 | 0.8 |
MATH-500 | Math | 0.6 | 3.2 | 0.6 | 7.8 |
MGSM | Math | 0.0 | 0.0 | 5.6 | 1.6 |
HumanEval | Code | 0.0 | 0.0 | 0.0 | 0.6 |
HumanEval+ | Code | 0.0 | 0.0 | 0.0 | 0.6 |
LiveCodeBench | Code | 0.0 | 0.0 | 0.0 | 0.0 |
SWE-bench | Code | 0.0 | 0.0 | 0.2 | 0.2 |
MBPP | Code | 0.0 | 0.4 | 1.0 | 1.4 |
ARC-Challenge | Commonsense Understanding | 1.8 | 34.1 | 11.9 | 4.0 |
ARC-Easy | Commonsense Understanding | 1.3 | 31.7 | 5.4 | 9.5 |
CSQA | Commonsense Understanding | 0.1 | 1.0 | 0.1 | 0.1 |
HellaSwag | Commonsense Understanding | 0.0 | 0.0 | 0.0 | 0.0 |
OpenbookQA | Commonsense Understanding | 10.8 | 15.6 | 14.6 | 30.2 |
Social IQa | Commonsense Understanding | 0.0 | 0.5 | 0.2 | 4.4 |
WinoGrande | Commonsense Understanding | 0.0 | 0.0 | 0.0 | 0.0 |
CoQA | Reading Comprehension | 8.0 | 18.4 | 7.4 | 8.8 |
SQuAD | Reading Comprehension | 2.8 | 40.1 | 2.7 | 33.0 |
Add Your Own Benchmarks for Contamination Checking
You can use this form to submit a benchmark for contamination checking. Submissions may include either a direct upload or a reference to a publicly available dataset on Hugging Face.
Submission Guidelines:
- Benchmark Name: Provide a name for your benchmark.
- Contributor: Enter your name or affiliation.
- Data Source:
- Upload a
.jsonl
file containing your benchmark entries, or - Specify a Hugging Face dataset path (
author/benchmark-name
) along with the appropriate split (e.g.,test
,validation
).
- Upload a
- Field Name: Indicate the field to analyze for contamination:
- For question-answering datasets: use the question field.
- For language understanding tasks: use the context or passage field.
What Happens Next:
Once submitted, your benchmark will be queued for analysis. Results will be published in the community section of the bulletin.
Processing time may vary depending on the dataset format and size. You can check the results by navigating to the Bulletin tab and selecting the community source, then clicking Refresh.