Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead)

This commit is contained in:
Jesse Vincent
2026-06-10 16:59:43 -07:00
parent 9a25a75bac
commit 27788fdef9

View File

@@ -65,6 +65,14 @@ fewer, better-sized tasks, SDD still runs one fresh subagent per task.
### L1 — Plan-side crispness (writing-plans changes; est. $1.5-3/run, plus variance reduction)
**Status 2026-06-11: validated in effect.** A hand-crisped fractals plan
(10 → 7 tasks, `## Global Constraints` header, per-task `Interfaces:`
lines — scenario `sdd-go-fractals-crisp`) ran 3/3 green at $9.51-12.65
(mean $11.60 vs combo band $11.67-14.84), 20-24 dispatches vs 28, fix
waves flat. What remains is elicitation: getting writing-plans guidance
to *produce* such plans (micro-test per the doctrine, then the follow-up
PR). See the experiments log, Batch A-E.
The plan is upstream of every cost: task count sets dispatch count; plan
ambiguity sets review-loop count; plan completeness sets implementer
exploration. Current writing-plans optimizes for implementer success, not
@@ -102,6 +110,17 @@ target economics and ambiguity, not placeholder hygiene.
### L2 — Controller tier (est. $4-5/run; the biggest single lever, gated hardest)
**Status 2026-06-11: recon positive, gates still owed.** Sonnet-controller
run 1 (claude-sonnet coding-agent): all gates green at **$6.68** / 31 min
(combo band $11.67-14.84), 26/26 dispatches model-explicit, review loops
and omnibus-fixer rules followed, and the controller caught a fixer
side-effect (`go mod tidy` removed cobra) before re-review — real
adjudication, not silent absorption. But the run surfaced zero
BLOCKED/⚠️ events (the escalation points were never stressed) and the
final review ran on sonnet rather than the most capable tier. The N=5
quality gates + full judgment audit below remain mandatory before any
skill change.
The controller is half the dollars solely because it inherits the session
model. Its turn floor is prompt-immune, so the lever is the rate per turn —
but the controller is also where most judgment points live, so this rung is
@@ -133,6 +152,16 @@ multi-step *work*, not dispatch loops; whether a mid-tier controller holds
### L3 — Reviewer tier (est. $0.7-1/run; most likely rung to die on the judgment guardrail)
**Status 2026-06-11: DEAD, as pre-registered.** Planted-defect ×5 with
forced-haiku task reviewers: 2 pass / 1 indeterminate / 2 fail (baseline
5/5); per-task haiku cleanly flagged 0 of 10 planted defects at correct
severity — 1 found-but-downgraded with the exact prohibited rationale,
9 missed or rationalized (DRY praised as YAGNI; assert-nothing test
called plan-compliant). Cheap reviewers fail by *advocating* for
defects; passing runs survived only on controller redundancy or the
final review. Recorded in the experiments log, Batch A-E. Do not
re-propose without a structurally different design.
The package reviewer is near-single-step mechanically (3 turns / 1 Read
when calm), which invalidates the original turn-inflation rationale for the
mid-tier floor — but reviewing is judgment through and through: severity