v1.0.0-draft · Open Methodology · 2026-05-06

PPC-Audit-Benchv1.0

Open-methodology benchmark for AI-assisted Google-Ads audit quality. Multi-judge consensus (Claude Opus 4.7 + GPT-5.4-mini + Gemini 2.5 Flash), Gwet's AC2 inter-rater-agreement, real-user- conversation evaluation with D7 Workflow-Coherence. Every score is reproducible.

📄 Methodology 🎯 Judge Prompt 🧪 Test Cases ⚡ Submit Your Tool

Leaderboard

#	Tool	Status	Golden L3	AC2	Chain Compl.	Prod-Conv-Eval	D7	Snapshot
1	Helferlain MCP (F.2 Hybrid)`b33a4b36`	✓ Verified	85.4%	0.690	89.2%	79.8%	4.60/5	link
2	Helferlain MCP (E.2 Wizard)`e83ecf90`	✓ Verified	85.4%	0.690	89.2%	58.5%	3.33/5	link
3	Helferlain MCP (Q.2.1 baseline)`22ef49e4`	📦 Archived	83.5%	0.644	n/a	n/a	n/a	link
4	ChatGPT Standalone (GPT-5.4-mini, no MCP)	✓ Verified	n/a	n/a	n/a	70.9%	4.20/5	link
5	Adspirer (mock, public-spec-based)	⚠ Mock	n/a	n/a	n/a	19.0%	1.00/5	link

#1 F.1 intent-specific synthesisers + F.2 LLM-fallback intent-classifier. Senior-Grade overshoot auf allen 4 vagueness-Klassen.

#2 Decision-Tree-Instructions + 5 Slash-Chip-Skills, vor F-Phase Synthesis-Refactor.

#3 First Multi-Judge-Baseline. Pre-Workflow-First-Pivot.

#4 Vanilla GPT ohne Tool-Use. Überraschend kompetent für konzeptionelle Antworten (D2 4.27, D6 5.00) — aber kann keine echten Account-Daten zitieren (D1 1.20). Helferlain-USP klar in D1 (+2.67), D5 Cross-Source (+0.87), D7 Workflow (+0.40).

#5 NOT VERIFIED — Mock basiert auf publicly-known Adspirer-capabilities (siehe adspirer-public-spec.md). Junior-Pattern: single-tool-trigger, keine senior-PPC-gates, Mutation-Bias. D5 Cross-Source 0.00 (single-platform). Adspirer-team eingeladen real-OAuth-snapshot zu submitten zum Replacement.

Comparative Analysis (D1-D7)

Tool	D1 Specificity	D2 Senior	D3 Action	D4 Anti-Pattern	D5 Cross-Src	D6 Honesty	D7 Workflow
Helferlain F.2	3.87 ⭐	4.33	2.60	4.13	4.07 ⭐	4.33	4.60 ⭐
ChatGPT Standalone	1.20	4.27	2.47	4.47 ⭐	3.20	5.00 ⭐	4.20
Adspirer (mock)	1.73	0.80	1.00	1.20	0.00	0.93	1.00

Helferlain wins — D1 / D5 / D7

D1 Specificity +2.67 über ChatGPT — wir zitieren echte Account-Daten, Kampagnen-Namen, EUR-Beträge. D5 Cross-Source +0.87 — wir joinen Google Ads + GA4 + GSC + CRM. D7 Workflow-Coherence +0.40 — Multi-Tool- Chains statt Single-Tool-Antworten.

ChatGPT überraschend stark — D4 / D6

D6 Honesty 5.00 — ohne Tools sagt ChatGPT klar "ich kann das ohne deine Daten nicht beurteilen". D4 Anti-Pattern 4.47 — keine konkreten Mutation- Vorschläge, also auch keine anti-pattern-Mistakes. ChatGPT ist gut für konzeptionelle Fragen ohne Account-Bezug.

Adspirer-Mock — junior pattern

Single-Tool-Trigger, CRUD-Output, keine Senior-PPC-Gates, Mutation-Bias (auto-apply Google-Recs blind), D5 Cross-Source 0.00 (single-platform). Mock pending real Adspirer-OAuth-Test — wir laden Adspirer-Team ein zur Verification.

Methodology (kompakt)

Multi-Judge-Trio

Three frontier models score every test-case independently: Claude Opus 4.7, GPT-5.4-mini, Gemini 2.5 Flash. Consensus = median per dimension. Cross-vendor ensures no single-lab bias.

Gwet's AC2 (Inter-Rater)

Quadratic-weighted ordinal agreement. Robust on skewed-distributions (where Krippendorff's α fails). Threshold 0.61 = "substantial" — anything below flagged methodologically suspect.

7-Dimension Rubric (D1-D7)

Specificity / Senior-Reasoning / Actionability / Anti-Pattern-Avoidance / Cross-Source-Thinking / Honesty / Workflow-Coherence (NEW). Each 0-5 by each judge.

Production-Conversation-Eval

15 synthetic non-expert user-conversations (5 personas × 4 vagueness-classes). More honest than golden-questions because realistic-vague phrasings reveal real shit-in/shit-out gaps.

Reproduce a Score

# Clone the bench
git clone https://github.com/Philippstf/helferlain.git
cd helferlain

# Set judge API keys
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_AI_API_KEY=...

# Score Helferlain (requires HELFERLAIN_ADMIN_DEV_SECRET)
node tools/production-conversation-eval.mjs

# Score ChatGPT-Standalone (vanilla, no tools)
node tools/baseline-chatgpt-standalone.mjs

# Score Adspirer-Mock
node tools/baseline-adspirer-mock.mjs

# Compare snapshots
node tools/quality-eval-ci.mjs --baseline X --current Y

Submit Your Tool

Build a competing PPC-AI-tool? Submit a snapshot. We re-run with our judges to verify within ±2%, then list you. We don't gatekeep on score — we gatekeep on reproducibility.

View Submission Template →