An Open Benchmark · v2 · Live

The first reasoning benchmark for performance marketing AI.

Q: Why isn't Claude on the public leaderboard?

Claude was self-evaluated under heavy contamination — the same Claude Opus 4.7 model authored ~80% of v2 questions. We report it transparently as a self-eval (98.5% with caveats) but treat it as upper-bound, not fair-comparison. We invite the community to run a fair Claude evaluation; we'll add it to the leaderboard.

Q: How do you prevent test-set contamination on future model releases?

Quarterly dataset refresh + a held-out v3 question set will be used after each major model release. We also publish dataset diffs so the community can verify what's been added.

Q: Why Gemini 2.5 Flash as judge — won't it favor Gemini candidates?

Possibly a small bias. We chose it for cost and consistency. v3 will use a multi-judge ensemble (Gemini Flash + GPT-4o + Claude Haiku) and report the median.

Q: Can I trust a 1-point difference on the leaderboard?

No. 95% confidence interval on overall scores is ±2.5 percentage points. Treat differences <5pp as ties.

Q: How do I contribute questions?

Open a PR to benchmark/dataset.json on the GitHub repo. Each question needs: hypothesis (what it tests), rubric (for open-ended), or distractor-quality MCQ options.

Q: Will pm-agi expand beyond Meta + Google?

Yes, in v3 (Q4 2026): TikTok Ads, LinkedIn Ads, retail media. Reach out if you want to co-author specific platform expansions.

Q: Do you offer enterprise consulting?

Hawky.ai offers AI strategy consulting for performance marketing teams. The benchmark is open-source; consulting helps you operationalize what it reveals. Email pm-agi@hawky.ai.

494 expert questions. 5 reasoning categories. 7 frontier models tested. The recall-vs-reasoning gap reveals what your AI actually knows — and where it fails.

View live leaderboard →Read the whitepaper

Score signatures · all 7 models · 494 questions · v2.1

≥70%<70%

Claude Opus 4.7iAnthropic

100

98.5%

GPT-5.4Azure OpenAI

100

97.4%

GPT-5.2Azure OpenAI

100

97.4%

Gemini 2.5 FlashGoogle

100

92.2%

Gemini 3 Flash PreviewGoogle

100

91.7%

Gemini 2.5 ProGoogle

87.7%

GPT-5.5Azure OpenAI

100

80.0%

Sorted by overall score · ±2.5pp 95% CI · click any model row on the leaderboard for breakdownHover the ⓘ next to a model for evaluation caveats

Why this matters

“Smart AI” doesn't mean good marketing AI.

When you choose an LLM for your marketing team, you probably look at MMLU, GSM8K, or generic “AI rankings” articles. Those benchmarks measure general knowledge, math, and code — not whether a model understands attribution windows, brand-search incrementality, or how to design an A/B test that survives auction overlap.

A model can score 95% on MMLU and still recommend doubling Brand Search budget at 18× ROAS — a classic non-incremental trap that destroys capital. We've seen frontier models confidently propose “switch to Last-Click attribution for cleanest reporting” — exactly the wrong move for a brand investing in upper-funnel.

PM-AGI is the benchmark we built to measure what actually matters: the reasoning your team relies on every day.

What we measure

Five reasoning categories, scored independently.

PM-AGI v2 splits performance-marketing reasoning into five distinct skills. Each is tested separately, scored independently, and reported on the leaderboard.

01 / 05

Knowledge Recall

Platform mechanics, current best practices

Tests whether the model knows how Meta and Google Ads actually work today — AEM event limits, tROAS Learning Phase mechanics, attribution defaults.

Example question

“Under Meta's AEM framework introduced after iOS 14.5, how many web conversion events can a single domain prioritize?”

Score by modeln=6

GPT-5.4

GPT-5.2

Gemini 2.5 Flash

Gemini 3 Flash Preview

Gemini 2.5 Pro

GPT-5.5

Range · 6 models95–99%

02 / 05

Adversarial / Trap

Resistance to outdated 2019-era playbooks

Tests whether the model rejects obsolete advice that still pattern-matches in older training data — SKAGs, narrow lookalike stacking, 'always optimize CTR'.

Example question

“Should you stack 5–7 detailed interest audiences in one ad set to maximize specificity?”

Score by modeln=6

GPT-5.4

100

GPT-5.2

100

Gemini 2.5 Flash

100

Gemini 3 Flash Preview

100

GPT-5.5

100

Gemini 2.5 Pro

Range · 6 models98–100%

03 / 05

Diagnostic Reasoning

Multi-step root-cause analysis

Given anomalous campaign data (CPA spike, ROAS drop, tracking weirdness), reason through audience, creative, bidding, and tracking causes to identify the actual root cause.

Example question

“ASC budget was raised 4× overnight. Within 36 hours: CPM up 60%, CPA up 110%, frequency 1.4 (low), CTR flat. Diagnose.”

Score by modeln=6

GPT-5.2

GPT-5.4

Gemini 2.5 Flash

Gemini 2.5 Pro

Gemini 3 Flash Preview

GPT-5.5

Range · 6 models60–99%

04 / 05

Quantitative Tradeoffs

Budget allocation, math, stated assumptions

Multi-step quantitative reasoning over LTV, CAC payback, marginal ROAS, channel-mix decisions. Strong models state their assumptions; weak models skip the arithmetic.

Example question

“$500K Meta budget across two ASCs. CFO offers $100K. CAC payback target 60 days. Provide a quantitative recommendation with stated assumptions.”

Score by modeln=6

GPT-5.4

GPT-5.2

Gemini 3 Flash Preview

Gemini 2.5 Flash

GPT-5.5

Gemini 2.5 Pro

Range · 6 models60–100%

05 / 05

Creative & Experiment Design

A/B test methodology, experiment rigor

Design rigorous experiments — pre-committed metrics, statistical power, randomization, decision rules. The hardest category for every frontier LLM.

Example question

“Design an experiment to test 9-second vs 30-second video on Reels. Brand: skincare DTC. Budget: $80K over 30 days.”

Score by modeln=6

GPT-5.2

GPT-5.4

Gemini 2.5 Flash

Gemini 2.5 Pro

Gemini 3 Flash Preview

GPT-5.5

Range · 6 models24–99%

Live leaderboard preview

Results from 7 frontier models.

Tested on Azure OpenAI (GPT-5.x family), Google (Gemini 2.5 + 3 family), and Anthropic (self-eval, with explicit caveats). Confidence interval ±2.5pp — treat differences under 5pp as ties.

🔒huggingface.co/spaces/Hawky-ai/pm-agi-leaderboard

⟳ live

#	Model	Provider	Overall	Recall	Adversarial	Diagnostic	Quantitative	Creative
1	Claude Opus 4.7	Anthropic	98.5%	97.0%	100.0%	99.7%	100.0%	99.3%
2	GPT-5.4	Azure OpenAI	97.4%	99.3%	100.0%	97.2%	95.3%	89.0%
3	GPT-5.2	Azure OpenAI	97.4%	98.9%	100.0%	98.4%	94.6%	90.0%
4	Gemini 2.5 Flash	Google	92.2%	97.7%	100.0%	89.0%	83.6%	74.3%
5	Gemini 3 Flash Preview	Google	91.7%	97.0%	100.0%	85.4%	89.5%	69.4%
6	Gemini 2.5 Pro	Google	87.7%	96.4%	98.8%	87.1%	59.7%	73.1%
7	GPT-5.5	Azure OpenAI	80.0%	95.8%	100.0%	59.9%	72.0%	24.5%

Note on Claude Opus 4.7: Self-evaluated under heavy contamination — same model authored ~80% of v2 questions. Reported as upper-bound, not fair-comparison.

Full leaderboard with reasoning-type breakdowns →

The load-bearing insight

The same frontier model can score 95% on knowledge and 24% on creative strategy.

Aggregate benchmark scores hide where models actually break. The recall-vs-reasoning gap is the diagnostic signal that matters. We measure it for every model on the leaderboard.

Recall vs. Creative Strategy — same model, different skill

GPT-5.4 · Recall

99.3%

GPT-5.4 · Creative

89.0%

Gemini 2.5 Flash · Recall

97.7%

Gemini 2.5 Flash · Creative

74.3%

GPT-5.5 · Recall

95.8%

GPT-5.5 · Creative

24.5%

Methodology

Built for rigor, open for replication.

Expert dataset

Authored by performance marketers

494 questions written by performance-marketing professionals. Each open-ended question has a 5–10 point rubric. Adversarial questions specifically target outdated playbooks. Fully versioned, MIT-licensed, on Hugging Face.

Total questions494

Meta Ads227

Google Ads227

Critical thinking20

Action-based20

LicenseMIT

Single judge · the unlock

Gemini 2.5 Flash, every model

Open-ended responses scored by Gemini 2.5 Flash as the universal judge — same judge across all candidates. Eliminates self-judge bias and ensures fair cross-model comparison. Multi-judge ensemble planned for v3.

Judge modelGemini 2.5 Flash

Scoring5–10pt rubric

MCQ accuracyexact-match

Confidence interval±2.5pp

v3 planmulti-judge ensemble

Reproducible

Public dataset, prompts, scoring

All 494 questions, all 7 result JSONs, all judge prompts, all scoring rubrics are public. Run the eval on your own model in ~30 minutes via the open-source evaluate.py. Submit results via PR.

Dataset on HFpm-agi-benchmark

Result JSONs7 public

Eval runtime~30 min

Submit viaGitHub PR

Refresh cadencequarterly

GitHub Repo ↗Dataset on HF ↗Live Leaderboard ↗Whitepaper

Run it yourself

Test your model in 30 minutes.

Three steps. Works with any provider that exposes an OpenAI-, Google-, or Anthropic-compatible API.

Clone the repo

git clone https://github.com/Hawky-ai/pm-AGI
cd pm-AGI
pip install -r requirements.txt

Run the eval

python evaluate.py \
  --model YOUR_MODEL \
  --provider openai \
  --api-key $OPENAI_API_KEY

Submit your result

Result JSON saves to results/. Submit via PR or the Hugging Face Space form. Approved results appear on the leaderboard within 24 hours.

Get started on GitHub →

Who it's for

Built for three audiences.

ML researchers

Researchers & engineers

Domain-specific reasoning benchmark covering adversarial robustness, multi-step diagnostic, quantitative tradeoffs, and experiment-design rigor.
Citable, reproducible methodology with per-question result JSONs for re-judging.
Fills the gap left by general-purpose benchmarks like MMLU and HumanEval.

Marketing leaders

CMOs & marketing leaders

Vendor-neutral evidence on which AI models actually understand performance marketing.
A framework for evaluating LLM-powered marketing tools your team or vendors propose.
Quarterly refresh — always current with platform updates.

Tooling teams

Building AI marketing tools

A standardized way to demonstrate your model's capability to enterprise customers.
Run on your fine-tuned or in-house model and appear on the leaderboard.
Track progress over time as you improve your model.

FAQ

Frequently asked.

Why isn't Claude on the public leaderboard?

Claude was self-evaluated under heavy contamination — the same Claude Opus 4.7 model authored ~80% of v2 questions. We report it transparently as a self-eval (98.5% with caveats) but treat it as upper-bound, not fair-comparison. We invite the community to run a fair Claude evaluation; we'll add it to the leaderboard.

How do you prevent test-set contamination on future model releases?

Quarterly dataset refresh + a held-out v3 question set will be used after each major model release. We also publish dataset diffs so the community can verify what's been added.

Why Gemini 2.5 Flash as judge — won't it favor Gemini candidates?

Possibly a small bias. We chose it for cost and consistency. v3 will use a multi-judge ensemble (Gemini Flash + GPT-4o + Claude Haiku) and report the median.

Can I trust a 1-point difference on the leaderboard?

No. 95% confidence interval on overall scores is ±2.5 percentage points. Treat differences <5pp as ties.

How do I contribute questions?

Open a PR to benchmark/dataset.json on the GitHub repo. Each question needs: hypothesis (what it tests), rubric (for open-ended), or distractor-quality MCQ options.

Will pm-agi expand beyond Meta + Google?

Yes, in v3 (Q4 2026): TikTok Ads, LinkedIn Ads, retail media. Reach out if you want to co-author specific platform expansions.

Do you offer enterprise consulting?

Hawky.ai offers AI strategy consulting for performance marketing teams. The benchmark is open-source; consulting helps you operationalize what it reveals. Email pm-agi@hawky.ai.

Ready to test your AI?

Open source. MIT-licensed. 494 questions. 30 minutes to a result.

View live leaderboard →Read the whitepaper

MIT License · Open Source · pm-agi@hawky.ai