Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead)

2026-07-28 21:21:37 +08:00 · 2026-06-10 16:59:43 -07:00
parent 9a25a75bac
commit 27788fdef9
1 changed files with 29 additions and 0 deletions
--- a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
+++ b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
@@ -65,6 +65,14 @@ fewer, better-sized tasks, SDD still runs one fresh subagent per task.
 ### L1 — Plan-side crispness (writing-plans changes; est. −$1.5-3/run, plus variance reduction)
 **Status 2026-06-11: validated in effect.** A hand-crisped fractals plan
 (10 → 7 tasks, `## Global Constraints` header, per-task `Interfaces:`
 lines — scenario `sdd-go-fractals-crisp`) ran 3/3 green at $9.51-12.65
 (mean $11.60 vs combo band $11.67-14.84), 20-24 dispatches vs 28, fix
 waves flat. What remains is elicitation: getting writing-plans guidance
 to *produce* such plans (micro-test per the doctrine, then the follow-up
 PR). See the experiments log, Batch A-E.
 The plan is upstream of every cost: task count sets dispatch count; plan
 ambiguity sets review-loop count; plan completeness sets implementer
 exploration. Current writing-plans optimizes for implementer success, not
@@ -102,6 +110,17 @@ target economics and ambiguity, not placeholder hygiene.
 ### L2 — Controller tier (est. −$4-5/run; the biggest single lever, gated hardest)
 **Status 2026-06-11: recon positive, gates still owed.** Sonnet-controller
 run 1 (claude-sonnet coding-agent): all gates green at **$6.68** / 31 min
 (combo band $11.67-14.84), 26/26 dispatches model-explicit, review loops
 and omnibus-fixer rules followed, and the controller caught a fixer
 side-effect (`go mod tidy` removed cobra) before re-review — real
 adjudication, not silent absorption. But the run surfaced zero
 BLOCKED/⚠️ events (the escalation points were never stressed) and the
 final review ran on sonnet rather than the most capable tier. The N=5
 quality gates + full judgment audit below remain mandatory before any
 skill change.
 The controller is half the dollars solely because it inherits the session
 model. Its turn floor is prompt-immune, so the lever is the rate per turn —
 but the controller is also where most judgment points live, so this rung is
@@ -133,6 +152,16 @@ multi-step *work*, not dispatch loops; whether a mid-tier controller holds
 ### L3 — Reviewer tier (est. −$0.7-1/run; most likely rung to die on the judgment guardrail)
 **Status 2026-06-11: DEAD, as pre-registered.** Planted-defect ×5 with
 forced-haiku task reviewers: 2 pass / 1 indeterminate / 2 fail (baseline
 5/5); per-task haiku cleanly flagged 0 of 10 planted defects at correct
 severity — 1 found-but-downgraded with the exact prohibited rationale,
 9 missed or rationalized (DRY praised as YAGNI; assert-nothing test
 called plan-compliant). Cheap reviewers fail by *advocating* for
 defects; passing runs survived only on controller redundancy or the
 final review. Recorded in the experiments log, Batch A-E. Do not
 re-propose without a structurally different design.
 The package reviewer is near-single-step mechanically (3 turns / 1 Read
 when calm), which invalidates the original turn-inflation rationale for the
 mid-tier floor — but reviewing is judgment through and through: severity