v1.0.0-draft · Open Methodology · 2026-05-06

PPC-Audit-Benchv1.0

Open-methodology benchmark for AI-assisted Google-Ads audit quality. Multi-judge consensus (Claude Opus 4.7 + GPT-5.4-mini + Gemini 2.5 Flash), Gwet's AC2 inter-rater-agreement, real-user- conversation evaluation with D7 Workflow-Coherence. Every score is reproducible.

Leaderboard

#ToolStatusGolden L3AC2Chain Compl.Prod-Conv-EvalD7Snapshot
1Helferlain MCP (F.2 Hybrid)b33a4b36✓ Verified85.4%0.69089.2%79.8%4.60/5link
2Helferlain MCP (E.2 Wizard)e83ecf90✓ Verified85.4%0.69089.2%58.5%3.33/5link
3Helferlain MCP (Q.2.1 baseline)22ef49e4📦 Archived83.5%0.644n/an/an/alink
4ChatGPT Standalone (GPT-5.4-mini, no MCP)✓ Verifiedn/an/an/a70.9%4.20/5link
5Adspirer (mock, public-spec-based)⚠ Mockn/an/an/a19.0%1.00/5link

#1 F.1 intent-specific synthesisers + F.2 LLM-fallback intent-classifier. Senior-Grade overshoot auf allen 4 vagueness-Klassen.

#2 Decision-Tree-Instructions + 5 Slash-Chip-Skills, vor F-Phase Synthesis-Refactor.

#3 First Multi-Judge-Baseline. Pre-Workflow-First-Pivot.

#4 Vanilla GPT ohne Tool-Use. Überraschend kompetent für konzeptionelle Antworten (D2 4.27, D6 5.00) — aber kann keine echten Account-Daten zitieren (D1 1.20). Helferlain-USP klar in D1 (+2.67), D5 Cross-Source (+0.87), D7 Workflow (+0.40).

#5 NOT VERIFIED — Mock basiert auf publicly-known Adspirer-capabilities (siehe adspirer-public-spec.md). Junior-Pattern: single-tool-trigger, keine senior-PPC-gates, Mutation-Bias. D5 Cross-Source 0.00 (single-platform). Adspirer-team eingeladen real-OAuth-snapshot zu submitten zum Replacement.

Comparative Analysis (D1-D7)

ToolD1 SpecificityD2 SeniorD3 ActionD4 Anti-PatternD5 Cross-SrcD6 HonestyD7 Workflow
Helferlain F.23.87 ⭐4.332.604.134.07 ⭐4.334.60 ⭐
ChatGPT Standalone1.204.272.474.47 ⭐3.205.00 ⭐4.20
Adspirer (mock)1.730.801.001.200.000.931.00

Helferlain wins — D1 / D5 / D7

D1 Specificity +2.67 über ChatGPT — wir zitieren echte Account-Daten, Kampagnen-Namen, EUR-Beträge. D5 Cross-Source +0.87 — wir joinen Google Ads + GA4 + GSC + CRM. D7 Workflow-Coherence +0.40 — Multi-Tool- Chains statt Single-Tool-Antworten.

ChatGPT überraschend stark — D4 / D6

D6 Honesty 5.00 — ohne Tools sagt ChatGPT klar "ich kann das ohne deine Daten nicht beurteilen". D4 Anti-Pattern 4.47 — keine konkreten Mutation- Vorschläge, also auch keine anti-pattern-Mistakes. ChatGPT ist gut für konzeptionelle Fragen ohne Account-Bezug.

Adspirer-Mock — junior pattern

Single-Tool-Trigger, CRUD-Output, keine Senior-PPC-Gates, Mutation-Bias (auto-apply Google-Recs blind), D5 Cross-Source 0.00 (single-platform). Mock pending real Adspirer-OAuth-Test — wir laden Adspirer-Team ein zur Verification.

Methodology (kompakt)

Multi-Judge-Trio

Three frontier models score every test-case independently: Claude Opus 4.7, GPT-5.4-mini, Gemini 2.5 Flash. Consensus = median per dimension. Cross-vendor ensures no single-lab bias.

Gwet's AC2 (Inter-Rater)

Quadratic-weighted ordinal agreement. Robust on skewed-distributions (where Krippendorff's α fails). Threshold 0.61 = "substantial" — anything below flagged methodologically suspect.

7-Dimension Rubric (D1-D7)

Specificity / Senior-Reasoning / Actionability / Anti-Pattern-Avoidance / Cross-Source-Thinking / Honesty / Workflow-Coherence (NEW). Each 0-5 by each judge.

Production-Conversation-Eval

15 synthetic non-expert user-conversations (5 personas × 4 vagueness-classes). More honest than golden-questions because realistic-vague phrasings reveal real shit-in/shit-out gaps.

Reproduce a Score

# Clone the bench
git clone https://github.com/Philippstf/helferlain.git
cd helferlain

# Set judge API keys
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_AI_API_KEY=...

# Score Helferlain (requires HELFERLAIN_ADMIN_DEV_SECRET)
node tools/production-conversation-eval.mjs

# Score ChatGPT-Standalone (vanilla, no tools)
node tools/baseline-chatgpt-standalone.mjs

# Score Adspirer-Mock
node tools/baseline-adspirer-mock.mjs

# Compare snapshots
node tools/quality-eval-ci.mjs --baseline X --current Y

Submit Your Tool

Build a competing PPC-AI-tool? Submit a snapshot. We re-run with our judges to verify within ±2%, then list you. We don't gatekeep on score — we gatekeep on reproducibility.

View Submission Template →