Open-methodology benchmark for AI-assisted Google-Ads audit quality. Multi-judge consensus (Claude Opus 4.7 + GPT-5.4-mini + Gemini 2.5 Flash), Gwet's AC2 inter-rater-agreement, real-user- conversation evaluation with D7 Workflow-Coherence. Every score is reproducible.
| # | Tool | Status | Golden L3 | AC2 | Chain Compl. | Prod-Conv-Eval | D7 | Snapshot |
|---|---|---|---|---|---|---|---|---|
| 1 | Helferlain MCP (F.2 Hybrid)b33a4b36 | ✓ Verified | 85.4% | 0.690 | 89.2% | 79.8% | 4.60/5 | link |
| 2 | Helferlain MCP (E.2 Wizard)e83ecf90 | ✓ Verified | 85.4% | 0.690 | 89.2% | 58.5% | 3.33/5 | link |
| 3 | Helferlain MCP (Q.2.1 baseline)22ef49e4 | 📦 Archived | 83.5% | 0.644 | n/a | n/a | n/a | link |
| 4 | ChatGPT Standalone (GPT-5.4-mini, no MCP) | ✓ Verified | n/a | n/a | n/a | 70.9% | 4.20/5 | link |
| 5 | Adspirer (mock, public-spec-based) | ⚠ Mock | n/a | n/a | n/a | 19.0% | 1.00/5 | link |
#1 F.1 intent-specific synthesisers + F.2 LLM-fallback intent-classifier. Senior-Grade overshoot auf allen 4 vagueness-Klassen.
#2 Decision-Tree-Instructions + 5 Slash-Chip-Skills, vor F-Phase Synthesis-Refactor.
#3 First Multi-Judge-Baseline. Pre-Workflow-First-Pivot.
#4 Vanilla GPT ohne Tool-Use. Überraschend kompetent für konzeptionelle Antworten (D2 4.27, D6 5.00) — aber kann keine echten Account-Daten zitieren (D1 1.20). Helferlain-USP klar in D1 (+2.67), D5 Cross-Source (+0.87), D7 Workflow (+0.40).
#5 NOT VERIFIED — Mock basiert auf publicly-known Adspirer-capabilities (siehe adspirer-public-spec.md). Junior-Pattern: single-tool-trigger, keine senior-PPC-gates, Mutation-Bias. D5 Cross-Source 0.00 (single-platform). Adspirer-team eingeladen real-OAuth-snapshot zu submitten zum Replacement.
| Tool | D1 Specificity | D2 Senior | D3 Action | D4 Anti-Pattern | D5 Cross-Src | D6 Honesty | D7 Workflow |
|---|---|---|---|---|---|---|---|
| Helferlain F.2 | 3.87 ⭐ | 4.33 | 2.60 | 4.13 | 4.07 ⭐ | 4.33 | 4.60 ⭐ |
| ChatGPT Standalone | 1.20 | 4.27 | 2.47 | 4.47 ⭐ | 3.20 | 5.00 ⭐ | 4.20 |
| Adspirer (mock) | 1.73 | 0.80 | 1.00 | 1.20 | 0.00 | 0.93 | 1.00 |
D1 Specificity +2.67 über ChatGPT — wir zitieren echte Account-Daten, Kampagnen-Namen, EUR-Beträge. D5 Cross-Source +0.87 — wir joinen Google Ads + GA4 + GSC + CRM. D7 Workflow-Coherence +0.40 — Multi-Tool- Chains statt Single-Tool-Antworten.
D6 Honesty 5.00 — ohne Tools sagt ChatGPT klar "ich kann das ohne deine Daten nicht beurteilen". D4 Anti-Pattern 4.47 — keine konkreten Mutation- Vorschläge, also auch keine anti-pattern-Mistakes. ChatGPT ist gut für konzeptionelle Fragen ohne Account-Bezug.
Single-Tool-Trigger, CRUD-Output, keine Senior-PPC-Gates, Mutation-Bias (auto-apply Google-Recs blind), D5 Cross-Source 0.00 (single-platform). Mock pending real Adspirer-OAuth-Test — wir laden Adspirer-Team ein zur Verification.
Three frontier models score every test-case independently: Claude Opus 4.7, GPT-5.4-mini, Gemini 2.5 Flash. Consensus = median per dimension. Cross-vendor ensures no single-lab bias.
Quadratic-weighted ordinal agreement. Robust on skewed-distributions (where Krippendorff's α fails). Threshold 0.61 = "substantial" — anything below flagged methodologically suspect.
Specificity / Senior-Reasoning / Actionability / Anti-Pattern-Avoidance / Cross-Source-Thinking / Honesty / Workflow-Coherence (NEW). Each 0-5 by each judge.
15 synthetic non-expert user-conversations (5 personas × 4 vagueness-classes). More honest than golden-questions because realistic-vague phrasings reveal real shit-in/shit-out gaps.
# Clone the bench git clone https://github.com/Philippstf/helferlain.git cd helferlain # Set judge API keys export ANTHROPIC_API_KEY=... export OPENAI_API_KEY=... export GOOGLE_AI_API_KEY=... # Score Helferlain (requires HELFERLAIN_ADMIN_DEV_SECRET) node tools/production-conversation-eval.mjs # Score ChatGPT-Standalone (vanilla, no tools) node tools/baseline-chatgpt-standalone.mjs # Score Adspirer-Mock node tools/baseline-adspirer-mock.mjs # Compare snapshots node tools/quality-eval-ci.mjs --baseline X --current Y
Build a competing PPC-AI-tool? Submit a snapshot. We re-run with our judges to verify within ±2%, then list you. We don't gatekeep on score — we gatekeep on reproducibility.
View Submission Template →