From 9a25a75bacc9b269e233c51f22c0a9e185cf8995 Mon Sep 17 00:00:00 2001 From: Jesse Vincent Date: Wed, 10 Jun 2026 14:35:00 -0700 Subject: [PATCH] =?UTF-8?q?Spec:=20strict-cost=20SDD=20experiment=20ladder?= =?UTF-8?q?=20=E2=80=94=20judgment=20as=20co-invariant,=20plan-side=20cris?= =?UTF-8?q?pness=20first?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../2026-06-10-strict-cost-sdd-design.md | 186 ++++++++++++++++++ 1 file changed, 186 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md diff --git a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md new file mode 100644 index 00000000..9f800f0a --- /dev/null +++ b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md @@ -0,0 +1,186 @@ +# Strict-Cost SDD — Design Spec + +**Status:** Proposed experiment ladder (not implementation). Each rung ships +only with its gate evidence; abort any rung whose gates fail. +**Objective:** minimize dollars per plan-execution. Wall-clock is +unconstrained; token count matters only as a cost driver. +**Hard invariant:** quality. Concretely: `sdd-quality-reviewer-catches- +planted-defect` pass rate over **N=5 runs** (not 1 — single-run gates were +this campaign's weakest methodology), `sdd-rejects-extra-features` pass, +all end-to-end scenarios pass, blind A/B deliverable parity with the +current config. Any quality regression kills the rung, full stop. + +## Where the dollars are (final 2026-06-10 config, go-fractals, ~$13/run) + +| Component | $ | Driver | +|---|---|---| +| Controller (session model, opus) | ~6-7 | ~150 turns × resident context; prompt-immune turn floor (46% thinking/narration) | +| Implementers (sonnet, 10-13 dispatches) | ~5-6 | the actual work; ~25 turns each; ~13 pre-edit exploration calls each | +| Task reviewers (sonnet, 10) | ~1-1.5 | 3-9 turns each with package | +| Final review + fixes | ~1 | 6 turns with branch package | + +Review-loop count (2-4 per run) is the biggest run-to-run cost variance; +loops are mostly caused by plan ambiguity the implementer resolved wrongly. + +## Judgment guardrail (co-invariant with quality) + +**Cheapen mechanics, never judgment.** Every rung must enumerate which +decisions it moves to a cheaper model and show each is *mechanical* — +deterministic, scriptable, or cheaply verifiable after the fact. Judgment +stays at the highest tier or with the human. The judgment points in SDD, +explicitly: + +- **BLOCKED / NEEDS_CONTEXT handling** — diagnosing why a subagent is stuck + and choosing the remedy +- **⚠️ "cannot verify from diff" resolution** — the controller adjudicating + with cross-task context +- **Dispatch curation** — ambiguity resolution and task-boundary drawing + (measured load-bearing: the Task 5 gradient-direction note prevented a + wrong implementation) +- **Review verdicts and severity calibration** — what is Important vs Minor +- **Review-loop adjudication** — deciding a finding is a false positive +- **Escalate-to-human recognition** — knowing the plan itself is wrong + +A rung that would move any of these to a cheaper model must either (a) +restructure so the decision is made once by the expensive model at plan +time, (b) add an explicit escalation rule routing it back up at execution +time, or (c) die. "The cheap model usually gets it right" is not +acceptance evidence — judgment failures are rare-event, high-blast-radius, +and largely invisible to pass/fail gates, which is why every tier change +below carries a judgment audit (session-resume interrogation of each +judgment point in the gate runs, compared against the expensive-controller +baseline) in addition to the N=5 scenario gates. + +## Thesis guardrail + +SDD's thesis: **a fresh subagent per task with precisely curated context, +gated per task.** Rungs below must preserve it. Dispatch-time task batching +(one implementer dispatch handling several plan tasks) is **counter-thesis** +— it pollutes the fresh-context property and coarsens the gates — and is +deliberately NOT on the ladder. The thesis-compatible route to the same +dispatch economics is plan-time task right-sizing (L1): if the plan defines +fewer, better-sized tasks, SDD still runs one fresh subagent per task. + +## The ladder (in expected $/leverage order) + +### L1 — Plan-side crispness (writing-plans changes; est. −$1.5-3/run, plus variance reduction) + +The plan is upstream of every cost: task count sets dispatch count; plan +ambiguity sets review-loop count; plan completeness sets implementer +exploration. Current writing-plans optimizes for implementer success, not +execution economics. Changes to test: + +1. **Task right-sizing guidance.** Today's plans produce tasks as small as + "create .gitignore" — each costing a full dispatch + review cycle + (~$0.60-1.00 fixed overhead). Add: "A task is the smallest unit that + carries its own test cycle and is worth a fresh reviewer's gate. Merge + setup/config steps into the task that needs them; split only at + boundaries where a reviewer could meaningfully reject." Fractals' plan + would drop from 10 tasks to ~7. Validate: dispatch count falls, gates + hold, review granularity still catches the planted defect. +2. **Structured `## Global Constraints` section** in the plan header + (version floors, naming/copy rules, platform requirements). Today these + live in design.md prose and reach reviewers only if the controller + remembers to paste them (a `go 1.26.1` floor violation shipped because + none did). A fixed heading makes them mechanically extractable — + `task-brief` can append them to every brief automatically (small script + change), removing a controller responsibility entirely. +3. **Per-task `Interfaces:` line** (consumes/produces, exact signatures). + The controller currently re-derives cross-task interfaces per dispatch + (its main legitimate "restating"), and implementers spend ~13 tool calls + re-discovering context. The planner already knows the interfaces; one + line per task moves the work to where it is done once. +4. **Per-task model-tier recommendation** from the planner ("mechanical / + standard / judgment"). The planner has the best information for the + Model Selection decision the controller currently re-makes per dispatch; + the controller keeps override authority. + +Validation: micro-test the planner output shape (recipe-style, per the +instruction-design doctrine), then full runs. Note the 2026-06-10 result: +plan *placeholders* cannot be elicited from current opus — these changes +target economics and ambiguity, not placeholder hygiene. + +### L2 — Controller tier (est. −$4-5/run; the biggest single lever, gated hardest) + +The controller is half the dollars solely because it inherits the session +model. Its turn floor is prompt-immune, so the lever is the rate per turn — +but the controller is also where most judgment points live, so this rung is +designed judgment-first: + +1. **Primary form — judgment moved up front, mechanics cheapened:** the + expensive model does the judgment-dense work at plan time (L1's + Interfaces lines, ambiguity resolutions, per-task constraints — i.e. + the dispatch curation is pre-written into the plan). The mid-tier + execution session then runs a loop that is genuinely mechanical: + extract brief, dispatch, run script, route verdicts. Explicit + escalation rules in the skill: on BLOCKED, on any ⚠️ item, on a + suspected false positive, or on anything the plan does not already + answer, the cheap controller STOPS and escalates (to the human, or to + a fresh expensive-model consultation dispatch) — it never resolves + judgment alone. +2. **Gates beyond the standard N=5:** a judgment audit — every + BLOCKED/⚠️/adjudication event in the gate runs interrogated via + session-resume and scored against how the opus-controller baseline + handled the same class of event; any silently-absorbed judgment call + (cheap controller resolving what it should have escalated) fails the + rung regardless of scenario verdicts. +3. **User authority preserved:** the skill recommends, never enforces, the + execution-session tier. + +Caveat from this campaign: cheap-model turn inflation was measured on +multi-step *work*, not dispatch loops; whether a mid-tier controller holds +~150 turns is part of what the experiment determines. + +### L3 — Reviewer tier (est. −$0.7-1/run; most likely rung to die on the judgment guardrail) + +The package reviewer is near-single-step mechanically (3 turns / 1 Read +when calm), which invalidates the original turn-inflation rationale for the +mid-tier floor — but reviewing is judgment through and through: severity +calibration, spec verdicts, knowing what not to flag. Mechanical cheapness +does not make the decisions mechanical. Test haiku-with-package only with +the full judgment battery: planted-defect ×5, a severity-calibration check +(seeded Minor-vs-Important pairs; miscalibration fails the rung), and the +escape-hatch variance re-measured at that tier. Prior expectation: this +rung dies, and that is a fine outcome — it converts "we suspect cheap +reviewers are bad" into recorded evidence. + +### L4 — Resident-context diet (est. −$0.5-1/run) + +- `task-brief --list` mode: controller reads task headings + Global + Constraints, never the full plan (the plan body is already delivered via + briefs). +- Reports trim 15 → 8 lines. +- SKILL.md minification pass (every section added this week re-justified + at composition-recipe density; Codex pays ~10k chars × ~500 re-reads per + long session). + +### L5 — Re-litigations (explicitly flagged, maintainer-vetoed or counter-thesis) + +Recorded for completeness; each requires Jesse's explicit reversal before +any experiment: +- **Scoped re-reviews** (verify fix + regression scan instead of full + re-review): vetoed 2026-06-09; worth ~$0.50/run at most. +- **Dispatch-time task batching**: counter-thesis (see guardrail). L1.1 + is the sanctioned form. + +## Budget and sequencing + +L1 and L2.1 are independent — run both first (~$80: micro-tests + 2×5-run +gates + A/B). L3 after L2 settles the controller (reviewer behavior depends +on dispatch quality; ~$25 — planted-defect runs are $2-3 each). L4 last +(cheap, but re-gate once after the stack; ~$30). Total ≲ $150 for the full +ladder with honest N=5 gates. Expected end state if every rung survives its gates: **$5-7/run on +fractals (from $12-15)**; if the judgment-sensitive rungs (L2 beyond its +primary form, L3) die as expected, **$8-10/run** — the honest target, since +the guardrail prices judgment above dollars by construction. + +## Relationship to existing work + +Builds on the 2026-06-09 task-scoped review dispatch design (PR #1717) and +the 2026-06-10 experiment campaign (evals +`docs/experiments/2026-06-10-sdd-cost-experiments.md` — consult the +negative-results section before adding rungs; turn-discipline and +parallel-call mechanisms are dead). Instruction wording for any new prose +follows the positive-instruction doctrine spec and gets micro-tested before +full runs. L1 is a writing-plans change → its own PR with eval evidence; +L2-L4 are SDD changes → separate PR(s).