Platform mechanics, current best practices
Tests whether the model knows how Meta and Google Ads actually work today — AEM event limits, tROAS Learning Phase mechanics, attribution defaults.
494 expert questions. 5 reasoning categories. 7 frontier models tested. The recall-vs-reasoning gap reveals what your AI actually knows — and where it fails.
When you choose an LLM for your marketing team, you probably look at MMLU, GSM8K, or generic “AI rankings” articles. Those benchmarks measure general knowledge, math, and code — not whether a model understands attribution windows, brand-search incrementality, or how to design an A/B test that survives auction overlap.
A model can score 95% on MMLU and still recommend doubling Brand Search budget at 18× ROAS — a classic non-incremental trap that destroys capital. We've seen frontier models confidently propose “switch to Last-Click attribution for cleanest reporting” — exactly the wrong move for a brand investing in upper-funnel.
PM-AGI is the benchmark we built to measure what actually matters: the reasoning your team relies on every day.
PM-AGI v2 splits performance-marketing reasoning into five distinct skills. Each is tested separately, scored independently, and reported on the leaderboard.
Tests whether the model knows how Meta and Google Ads actually work today — AEM event limits, tROAS Learning Phase mechanics, attribution defaults.
Tests whether the model rejects obsolete advice that still pattern-matches in older training data — SKAGs, narrow lookalike stacking, 'always optimize CTR'.
Given anomalous campaign data (CPA spike, ROAS drop, tracking weirdness), reason through audience, creative, bidding, and tracking causes to identify the actual root cause.
Multi-step quantitative reasoning over LTV, CAC payback, marginal ROAS, channel-mix decisions. Strong models state their assumptions; weak models skip the arithmetic.
Design rigorous experiments — pre-committed metrics, statistical power, randomization, decision rules. The hardest category for every frontier LLM.
Tested on Azure OpenAI (GPT-5.x family), Google (Gemini 2.5 + 3 family), and Anthropic (self-eval, with explicit caveats). Confidence interval ±2.5pp — treat differences under 5pp as ties.
| # | Model | Provider | Overall | Recall | Adversarial | Diagnostic | Quantitative | Creative |
|---|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 98.5% | 97.0% | 100.0% | 99.7% | 100.0% | 99.3% |
| 2 | GPT-5.4 | Azure OpenAI | 97.4% | 99.3% | 100.0% | 97.2% | 95.3% | 89.0% |
| 3 | GPT-5.2 | Azure OpenAI | 97.4% | 98.9% | 100.0% | 98.4% | 94.6% | 90.0% |
| 4 | Gemini 2.5 Flash | 92.2% | 97.7% | 100.0% | 89.0% | 83.6% | 74.3% | |
| 5 | Gemini 3 Flash Preview | 91.7% | 97.0% | 100.0% | 85.4% | 89.5% | 69.4% | |
| 6 | Gemini 2.5 Pro | 87.7% | 96.4% | 98.8% | 87.1% | 59.7% | 73.1% | |
| 7 | GPT-5.5 | Azure OpenAI | 80.0% | 95.8% | 100.0% | 59.9% | 72.0% | 24.5% |
Note on Claude Opus 4.7: Self-evaluated under heavy contamination — same model authored ~80% of v2 questions. Reported as upper-bound, not fair-comparison.
Aggregate benchmark scores hide where models actually break. The recall-vs-reasoning gap is the diagnostic signal that matters. We measure it for every model on the leaderboard.
Recall vs. Creative Strategy — same model, different skill
494 questions written by performance-marketing professionals. Each open-ended question has a 5–10 point rubric. Adversarial questions specifically target outdated playbooks. Fully versioned, MIT-licensed, on Hugging Face.
Open-ended responses scored by Gemini 2.5 Flash as the universal judge — same judge across all candidates. Eliminates self-judge bias and ensures fair cross-model comparison. Multi-judge ensemble planned for v3.
All 494 questions, all 7 result JSONs, all judge prompts, all scoring rubrics are public. Run the eval on your own model in ~30 minutes via the open-source evaluate.py. Submit results via PR.
Three steps. Works with any provider that exposes an OpenAI-, Google-, or Anthropic-compatible API.
git clone https://github.com/Hawky-ai/pm-AGI
cd pm-AGI
pip install -r requirements.txtpython evaluate.py \
--model YOUR_MODEL \
--provider openai \
--api-key $OPENAI_API_KEYResult JSON saves to results/. Submit via PR or the Hugging Face Space form. Approved results appear on the leaderboard within 24 hours.
Claude was self-evaluated under heavy contamination — the same Claude Opus 4.7 model authored ~80% of v2 questions. We report it transparently as a self-eval (98.5% with caveats) but treat it as upper-bound, not fair-comparison. We invite the community to run a fair Claude evaluation; we'll add it to the leaderboard.
Quarterly dataset refresh + a held-out v3 question set will be used after each major model release. We also publish dataset diffs so the community can verify what's been added.
Possibly a small bias. We chose it for cost and consistency. v3 will use a multi-judge ensemble (Gemini Flash + GPT-4o + Claude Haiku) and report the median.
No. 95% confidence interval on overall scores is ±2.5 percentage points. Treat differences <5pp as ties.
Open a PR to benchmark/dataset.json on the GitHub repo. Each question needs: hypothesis (what it tests), rubric (for open-ended), or distractor-quality MCQ options.
Yes, in v3 (Q4 2026): TikTok Ads, LinkedIn Ads, retail media. Reach out if you want to co-author specific platform expansions.
Hawky.ai offers AI strategy consulting for performance marketing teams. The benchmark is open-source; consulting helps you operationalize what it reveals. Email pm-agi@hawky.ai.
Open source. MIT-licensed. 494 questions. 30 minutes to a result.
MIT License · Open Source · pm-agi@hawky.ai