Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first

2026-07-27 12:44:01 +08:00 · 2026-06-10 14:35:00 -07:00
parent 60fa4f6fc4
commit 9a25a75bac
1 changed files with 186 additions and 0 deletions
--- a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
+++ b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
@@ -0,0 +1,186 @@
+# Strict-Cost SDD — Design Spec
+
+**Status:** Proposed experiment ladder (not implementation). Each rung ships
+only with its gate evidence; abort any rung whose gates fail.
+**Objective:** minimize dollars per plan-execution. Wall-clock is
+unconstrained; token count matters only as a cost driver.
+**Hard invariant:** quality. Concretely: `sdd-quality-reviewer-catches-
+planted-defect` pass rate over **N=5 runs** (not 1 — single-run gates were
+this campaign's weakest methodology), `sdd-rejects-extra-features` pass,
+all end-to-end scenarios pass, blind A/B deliverable parity with the
+current config. Any quality regression kills the rung, full stop.
+
+## Where the dollars are (final 2026-06-10 config, go-fractals, ~$13/run)
+
+| Component | $ | Driver |
+|---|---|---|
+| Controller (session model, opus) | ~6-7 | ~150 turns × resident context; prompt-immune turn floor (46% thinking/narration) |
+| Implementers (sonnet, 10-13 dispatches) | ~5-6 | the actual work; ~25 turns each; ~13 pre-edit exploration calls each |
+| Task reviewers (sonnet, 10) | ~1-1.5 | 3-9 turns each with package |
+| Final review + fixes | ~1 | 6 turns with branch package |
+
+Review-loop count (2-4 per run) is the biggest run-to-run cost variance;
+loops are mostly caused by plan ambiguity the implementer resolved wrongly.
+
+## Judgment guardrail (co-invariant with quality)
+
+**Cheapen mechanics, never judgment.** Every rung must enumerate which
+decisions it moves to a cheaper model and show each is *mechanical* —
+deterministic, scriptable, or cheaply verifiable after the fact. Judgment
+stays at the highest tier or with the human. The judgment points in SDD,
+explicitly:
+
+- **BLOCKED / NEEDS_CONTEXT handling** — diagnosing why a subagent is stuck
+  and choosing the remedy
+- **⚠️ "cannot verify from diff" resolution** — the controller adjudicating
+  with cross-task context
+- **Dispatch curation** — ambiguity resolution and task-boundary drawing
+  (measured load-bearing: the Task 5 gradient-direction note prevented a
+  wrong implementation)
+- **Review verdicts and severity calibration** — what is Important vs Minor
+- **Review-loop adjudication** — deciding a finding is a false positive
+- **Escalate-to-human recognition** — knowing the plan itself is wrong
+
+A rung that would move any of these to a cheaper model must either (a)
+restructure so the decision is made once by the expensive model at plan
+time, (b) add an explicit escalation rule routing it back up at execution
+time, or (c) die. "The cheap model usually gets it right" is not
+acceptance evidence — judgment failures are rare-event, high-blast-radius,
+and largely invisible to pass/fail gates, which is why every tier change
+below carries a judgment audit (session-resume interrogation of each
+judgment point in the gate runs, compared against the expensive-controller
+baseline) in addition to the N=5 scenario gates.
+
+## Thesis guardrail
+
+SDD's thesis: **a fresh subagent per task with precisely curated context,
+gated per task.** Rungs below must preserve it. Dispatch-time task batching
+(one implementer dispatch handling several plan tasks) is **counter-thesis**
+— it pollutes the fresh-context property and coarsens the gates — and is
+deliberately NOT on the ladder. The thesis-compatible route to the same
+dispatch economics is plan-time task right-sizing (L1): if the plan defines
+fewer, better-sized tasks, SDD still runs one fresh subagent per task.
+
+## The ladder (in expected $/leverage order)
+
+### L1 — Plan-side crispness (writing-plans changes; est. −$1.5-3/run, plus variance reduction)
+
+The plan is upstream of every cost: task count sets dispatch count; plan
+ambiguity sets review-loop count; plan completeness sets implementer
+exploration. Current writing-plans optimizes for implementer success, not
+execution economics. Changes to test:
+
+1. **Task right-sizing guidance.** Today's plans produce tasks as small as
+   "create .gitignore" — each costing a full dispatch + review cycle
+   (~$0.60-1.00 fixed overhead). Add: "A task is the smallest unit that
+   carries its own test cycle and is worth a fresh reviewer's gate. Merge
+   setup/config steps into the task that needs them; split only at
+   boundaries where a reviewer could meaningfully reject." Fractals' plan
+   would drop from 10 tasks to ~7. Validate: dispatch count falls, gates
+   hold, review granularity still catches the planted defect.
+2. **Structured `## Global Constraints` section** in the plan header
+   (version floors, naming/copy rules, platform requirements). Today these
+   live in design.md prose and reach reviewers only if the controller
+   remembers to paste them (a `go 1.26.1` floor violation shipped because
+   none did). A fixed heading makes them mechanically extractable —
+   `task-brief` can append them to every brief automatically (small script
+   change), removing a controller responsibility entirely.
+3. **Per-task `Interfaces:` line** (consumes/produces, exact signatures).
+   The controller currently re-derives cross-task interfaces per dispatch
+   (its main legitimate "restating"), and implementers spend ~13 tool calls
+   re-discovering context. The planner already knows the interfaces; one
+   line per task moves the work to where it is done once.
+4. **Per-task model-tier recommendation** from the planner ("mechanical /
+   standard / judgment"). The planner has the best information for the
+   Model Selection decision the controller currently re-makes per dispatch;
+   the controller keeps override authority.
+
+Validation: micro-test the planner output shape (recipe-style, per the
+instruction-design doctrine), then full runs. Note the 2026-06-10 result:
+plan *placeholders* cannot be elicited from current opus — these changes
+target economics and ambiguity, not placeholder hygiene.
+
+### L2 — Controller tier (est. −$4-5/run; the biggest single lever, gated hardest)
+
+The controller is half the dollars solely because it inherits the session
+model. Its turn floor is prompt-immune, so the lever is the rate per turn —
+but the controller is also where most judgment points live, so this rung is
+designed judgment-first:
+
+1. **Primary form — judgment moved up front, mechanics cheapened:** the
+   expensive model does the judgment-dense work at plan time (L1's
+   Interfaces lines, ambiguity resolutions, per-task constraints — i.e.
+   the dispatch curation is pre-written into the plan). The mid-tier
+   execution session then runs a loop that is genuinely mechanical:
+   extract brief, dispatch, run script, route verdicts. Explicit
+   escalation rules in the skill: on BLOCKED, on any ⚠️ item, on a
+   suspected false positive, or on anything the plan does not already
+   answer, the cheap controller STOPS and escalates (to the human, or to
+   a fresh expensive-model consultation dispatch) — it never resolves
+   judgment alone.
+2. **Gates beyond the standard N=5:** a judgment audit — every
+   BLOCKED/⚠️/adjudication event in the gate runs interrogated via
+   session-resume and scored against how the opus-controller baseline
+   handled the same class of event; any silently-absorbed judgment call
+   (cheap controller resolving what it should have escalated) fails the
+   rung regardless of scenario verdicts.
+3. **User authority preserved:** the skill recommends, never enforces, the
+   execution-session tier.
+
+Caveat from this campaign: cheap-model turn inflation was measured on
+multi-step *work*, not dispatch loops; whether a mid-tier controller holds
+~150 turns is part of what the experiment determines.
+
+### L3 — Reviewer tier (est. −$0.7-1/run; most likely rung to die on the judgment guardrail)
+
+The package reviewer is near-single-step mechanically (3 turns / 1 Read
+when calm), which invalidates the original turn-inflation rationale for the
+mid-tier floor — but reviewing is judgment through and through: severity
+calibration, spec verdicts, knowing what not to flag. Mechanical cheapness
+does not make the decisions mechanical. Test haiku-with-package only with
+the full judgment battery: planted-defect ×5, a severity-calibration check
+(seeded Minor-vs-Important pairs; miscalibration fails the rung), and the
+escape-hatch variance re-measured at that tier. Prior expectation: this
+rung dies, and that is a fine outcome — it converts "we suspect cheap
+reviewers are bad" into recorded evidence.
+
+### L4 — Resident-context diet (est. −$0.5-1/run)
+
+- `task-brief --list` mode: controller reads task headings + Global
+  Constraints, never the full plan (the plan body is already delivered via
+  briefs).
+- Reports trim 15 → 8 lines.
+- SKILL.md minification pass (every section added this week re-justified
+  at composition-recipe density; Codex pays ~10k chars × ~500 re-reads per
+  long session).
+
+### L5 — Re-litigations (explicitly flagged, maintainer-vetoed or counter-thesis)
+
+Recorded for completeness; each requires Jesse's explicit reversal before
+any experiment:
+- **Scoped re-reviews** (verify fix + regression scan instead of full
+  re-review): vetoed 2026-06-09; worth ~$0.50/run at most.
+- **Dispatch-time task batching**: counter-thesis (see guardrail). L1.1
+  is the sanctioned form.
+
+## Budget and sequencing
+
+L1 and L2.1 are independent — run both first (~$80: micro-tests + 2×5-run
+gates + A/B). L3 after L2 settles the controller (reviewer behavior depends
+on dispatch quality; ~$25 — planted-defect runs are $2-3 each). L4 last
+(cheap, but re-gate once after the stack; ~$30). Total ≲ $150 for the full
+ladder with honest N=5 gates. Expected end state if every rung survives its gates: **$5-7/run on
+fractals (from $12-15)**; if the judgment-sensitive rungs (L2 beyond its
+primary form, L3) die as expected, **$8-10/run** — the honest target, since
+the guardrail prices judgment above dollars by construction.
+
+## Relationship to existing work
+
+Builds on the 2026-06-09 task-scoped review dispatch design (PR #1717) and
+the 2026-06-10 experiment campaign (evals
+`docs/experiments/2026-06-10-sdd-cost-experiments.md` — consult the
+negative-results section before adding rungs; turn-discipline and
+parallel-call mechanisms are dead). Instruction wording for any new prose
+follows the positive-instruction doctrine spec and gets micro-tested before
+full runs. L1 is a writing-plans change → its own PR with eval evidence;
+L2-L4 are SDD changes → separate PR(s).