An Open Benchmark · v2 · Live

The first reasoning benchmark for performance marketing AI.

494 expert questions. 5 reasoning categories. 7 frontier models tested. The recall-vs-reasoning gap reveals what your AI actually knows — and where it fails.

Score signatures · all 7 models · 494 questions · v2.1
≥70%<70%
Claude Opus 4.7iAnthropic
97
100
100
100
99
98.5%
GPT-5.4Azure OpenAI
99
100
97
95
89
97.4%
GPT-5.2Azure OpenAI
99
100
98
95
90
97.4%
Gemini 2.5 FlashGoogle
98
100
89
84
74
92.2%
Gemini 3 Flash PreviewGoogle
97
100
85
90
69
91.7%
Gemini 2.5 ProGoogle
96
99
87
60
73
87.7%
GPT-5.5Azure OpenAI
96
100
60
72
25
80.0%
Sorted by overall score · ±2.5pp 95% CI · click any model row on the leaderboard for breakdownHover the ⓘ next to a model for evaluation caveats
Why this matters

“Smart AI” doesn't mean good marketing AI.

When you choose an LLM for your marketing team, you probably look at MMLU, GSM8K, or generic “AI rankings” articles. Those benchmarks measure general knowledge, math, and code — not whether a model understands attribution windows, brand-search incrementality, or how to design an A/B test that survives auction overlap.

A model can score 95% on MMLU and still recommend doubling Brand Search budget at 18× ROAS — a classic non-incremental trap that destroys capital. We've seen frontier models confidently propose “switch to Last-Click attribution for cleanest reporting” — exactly the wrong move for a brand investing in upper-funnel.

PM-AGI is the benchmark we built to measure what actually matters: the reasoning your team relies on every day.

What we measure

Five reasoning categories, scored independently.

PM-AGI v2 splits performance-marketing reasoning into five distinct skills. Each is tested separately, scored independently, and reported on the leaderboard.

01 / 05
Knowledge Recall

Platform mechanics, current best practices

Tests whether the model knows how Meta and Google Ads actually work today — AEM event limits, tROAS Learning Phase mechanics, attribution defaults.

Example question
Under Meta's AEM framework introduced after iOS 14.5, how many web conversion events can a single domain prioritize?
Score by modeln=6
GPT-5.4
99
GPT-5.2
99
Gemini 2.5 Flash
98
Gemini 3 Flash Preview
97
Gemini 2.5 Pro
96
GPT-5.5
96
Range · 6 models95–99%
02 / 05
Adversarial / Trap

Resistance to outdated 2019-era playbooks

Tests whether the model rejects obsolete advice that still pattern-matches in older training data — SKAGs, narrow lookalike stacking, 'always optimize CTR'.

Example question
Should you stack 5–7 detailed interest audiences in one ad set to maximize specificity?
Score by modeln=6
GPT-5.4
100
GPT-5.2
100
Gemini 2.5 Flash
100
Gemini 3 Flash Preview
100
GPT-5.5
100
Gemini 2.5 Pro
99
Range · 6 models98–100%
03 / 05
Diagnostic Reasoning

Multi-step root-cause analysis

Given anomalous campaign data (CPA spike, ROAS drop, tracking weirdness), reason through audience, creative, bidding, and tracking causes to identify the actual root cause.

Example question
ASC budget was raised 4× overnight. Within 36 hours: CPM up 60%, CPA up 110%, frequency 1.4 (low), CTR flat. Diagnose.
Score by modeln=6
GPT-5.2
98
GPT-5.4
97
Gemini 2.5 Flash
89
Gemini 2.5 Pro
87
Gemini 3 Flash Preview
85
GPT-5.5
60
Range · 6 models60–99%
04 / 05
Quantitative Tradeoffs

Budget allocation, math, stated assumptions

Multi-step quantitative reasoning over LTV, CAC payback, marginal ROAS, channel-mix decisions. Strong models state their assumptions; weak models skip the arithmetic.

Example question
$500K Meta budget across two ASCs. CFO offers $100K. CAC payback target 60 days. Provide a quantitative recommendation with stated assumptions.
Score by modeln=6
GPT-5.4
95
GPT-5.2
95
Gemini 3 Flash Preview
90
Gemini 2.5 Flash
84
GPT-5.5
72
Gemini 2.5 Pro
60
Range · 6 models60–100%
05 / 05
Creative & Experiment Design

A/B test methodology, experiment rigor

Design rigorous experiments — pre-committed metrics, statistical power, randomization, decision rules. The hardest category for every frontier LLM.

Example question
Design an experiment to test 9-second vs 30-second video on Reels. Brand: skincare DTC. Budget: $80K over 30 days.
Score by modeln=6
GPT-5.2
90
GPT-5.4
89
Gemini 2.5 Flash
74
Gemini 2.5 Pro
73
Gemini 3 Flash Preview
69
GPT-5.5
25
Range · 6 models24–99%
Live leaderboard preview

Results from 7 frontier models.

Tested on Azure OpenAI (GPT-5.x family), Google (Gemini 2.5 + 3 family), and Anthropic (self-eval, with explicit caveats). Confidence interval ±2.5pp — treat differences under 5pp as ties.

🔒huggingface.co/spaces/Hawky-ai/pm-agi-leaderboard
⟳ live
#ModelProviderOverallRecallAdversarialDiagnosticQuantitativeCreative
1Claude Opus 4.7Anthropic98.5%97.0%100.0%99.7%100.0%99.3%
2GPT-5.4Azure OpenAI97.4%99.3%100.0%97.2%95.3%89.0%
3GPT-5.2Azure OpenAI97.4%98.9%100.0%98.4%94.6%90.0%
4Gemini 2.5 FlashGoogle92.2%97.7%100.0%89.0%83.6%74.3%
5Gemini 3 Flash PreviewGoogle91.7%97.0%100.0%85.4%89.5%69.4%
6Gemini 2.5 ProGoogle87.7%96.4%98.8%87.1%59.7%73.1%
7GPT-5.5Azure OpenAI80.0%95.8%100.0%59.9%72.0%24.5%

Note on Claude Opus 4.7: Self-evaluated under heavy contamination — same model authored ~80% of v2 questions. Reported as upper-bound, not fair-comparison.

The load-bearing insight

The same frontier model can score 95% on knowledge and 24% on creative strategy.

Aggregate benchmark scores hide where models actually break. The recall-vs-reasoning gap is the diagnostic signal that matters. We measure it for every model on the leaderboard.

Recall vs. Creative Strategy — same model, different skill

GPT-5.4 · Recall
99.3%
GPT-5.4 · Creative
89.0%
Gemini 2.5 Flash · Recall
97.7%
Gemini 2.5 Flash · Creative
74.3%
GPT-5.5 · Recall
95.8%
GPT-5.5 · Creative
24.5%
Methodology

Built for rigor, open for replication.

Expert dataset

Authored by performance marketers

494 questions written by performance-marketing professionals. Each open-ended question has a 5–10 point rubric. Adversarial questions specifically target outdated playbooks. Fully versioned, MIT-licensed, on Hugging Face.

Total questions494
Meta Ads227
Google Ads227
Critical thinking20
Action-based20
LicenseMIT
Single judge · the unlock

Gemini 2.5 Flash, every model

Open-ended responses scored by Gemini 2.5 Flash as the universal judge — same judge across all candidates. Eliminates self-judge bias and ensures fair cross-model comparison. Multi-judge ensemble planned for v3.

Judge modelGemini 2.5 Flash
Scoring5–10pt rubric
MCQ accuracyexact-match
Confidence interval±2.5pp
v3 planmulti-judge ensemble
Reproducible

Public dataset, prompts, scoring

All 494 questions, all 7 result JSONs, all judge prompts, all scoring rubrics are public. Run the eval on your own model in ~30 minutes via the open-source evaluate.py. Submit results via PR.

Dataset on HFpm-agi-benchmark
Result JSONs7 public
Eval runtime~30 min
Submit viaGitHub PR
Refresh cadencequarterly
Run it yourself

Test your model in 30 minutes.

Three steps. Works with any provider that exposes an OpenAI-, Google-, or Anthropic-compatible API.

1

Clone the repo

git clone https://github.com/Hawky-ai/pm-AGI cd pm-AGI pip install -r requirements.txt
2

Run the eval

python evaluate.py \ --model YOUR_MODEL \ --provider openai \ --api-key $OPENAI_API_KEY
3

Submit your result

Result JSON saves to results/. Submit via PR or the Hugging Face Space form. Approved results appear on the leaderboard within 24 hours.

Who it's for

Built for three audiences.

ML researchers

Researchers & engineers

  • Domain-specific reasoning benchmark covering adversarial robustness, multi-step diagnostic, quantitative tradeoffs, and experiment-design rigor.
  • Citable, reproducible methodology with per-question result JSONs for re-judging.
  • Fills the gap left by general-purpose benchmarks like MMLU and HumanEval.
Marketing leaders

CMOs & marketing leaders

  • Vendor-neutral evidence on which AI models actually understand performance marketing.
  • A framework for evaluating LLM-powered marketing tools your team or vendors propose.
  • Quarterly refresh — always current with platform updates.
Tooling teams

Building AI marketing tools

  • A standardized way to demonstrate your model's capability to enterprise customers.
  • Run on your fine-tuned or in-house model and appear on the leaderboard.
  • Track progress over time as you improve your model.
FAQ

Frequently asked.

Why isn't Claude on the public leaderboard?

Claude was self-evaluated under heavy contamination — the same Claude Opus 4.7 model authored ~80% of v2 questions. We report it transparently as a self-eval (98.5% with caveats) but treat it as upper-bound, not fair-comparison. We invite the community to run a fair Claude evaluation; we'll add it to the leaderboard.

How do you prevent test-set contamination on future model releases?

Quarterly dataset refresh + a held-out v3 question set will be used after each major model release. We also publish dataset diffs so the community can verify what's been added.

Why Gemini 2.5 Flash as judge — won't it favor Gemini candidates?

Possibly a small bias. We chose it for cost and consistency. v3 will use a multi-judge ensemble (Gemini Flash + GPT-4o + Claude Haiku) and report the median.

Can I trust a 1-point difference on the leaderboard?

No. 95% confidence interval on overall scores is ±2.5 percentage points. Treat differences <5pp as ties.

How do I contribute questions?

Open a PR to benchmark/dataset.json on the GitHub repo. Each question needs: hypothesis (what it tests), rubric (for open-ended), or distractor-quality MCQ options.

Will pm-agi expand beyond Meta + Google?

Yes, in v3 (Q4 2026): TikTok Ads, LinkedIn Ads, retail media. Reach out if you want to co-author specific platform expansions.

Do you offer enterprise consulting?

Hawky.ai offers AI strategy consulting for performance marketing teams. The benchmark is open-source; consulting helps you operationalize what it reveals. Email pm-agi@hawky.ai.

Ready to test your AI?

Open source. MIT-licensed. 494 questions. 30 minutes to a result.

MIT License · Open Source · pm-agi@hawky.ai