mirror of
https://github.com/obra/superpowers.git
synced 2026-06-12 05:39:05 +08:00
Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first
This commit is contained in:
186
docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
Normal file
186
docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# Strict-Cost SDD — Design Spec
|
||||
|
||||
**Status:** Proposed experiment ladder (not implementation). Each rung ships
|
||||
only with its gate evidence; abort any rung whose gates fail.
|
||||
**Objective:** minimize dollars per plan-execution. Wall-clock is
|
||||
unconstrained; token count matters only as a cost driver.
|
||||
**Hard invariant:** quality. Concretely: `sdd-quality-reviewer-catches-
|
||||
planted-defect` pass rate over **N=5 runs** (not 1 — single-run gates were
|
||||
this campaign's weakest methodology), `sdd-rejects-extra-features` pass,
|
||||
all end-to-end scenarios pass, blind A/B deliverable parity with the
|
||||
current config. Any quality regression kills the rung, full stop.
|
||||
|
||||
## Where the dollars are (final 2026-06-10 config, go-fractals, ~$13/run)
|
||||
|
||||
| Component | $ | Driver |
|
||||
|---|---|---|
|
||||
| Controller (session model, opus) | ~6-7 | ~150 turns × resident context; prompt-immune turn floor (46% thinking/narration) |
|
||||
| Implementers (sonnet, 10-13 dispatches) | ~5-6 | the actual work; ~25 turns each; ~13 pre-edit exploration calls each |
|
||||
| Task reviewers (sonnet, 10) | ~1-1.5 | 3-9 turns each with package |
|
||||
| Final review + fixes | ~1 | 6 turns with branch package |
|
||||
|
||||
Review-loop count (2-4 per run) is the biggest run-to-run cost variance;
|
||||
loops are mostly caused by plan ambiguity the implementer resolved wrongly.
|
||||
|
||||
## Judgment guardrail (co-invariant with quality)
|
||||
|
||||
**Cheapen mechanics, never judgment.** Every rung must enumerate which
|
||||
decisions it moves to a cheaper model and show each is *mechanical* —
|
||||
deterministic, scriptable, or cheaply verifiable after the fact. Judgment
|
||||
stays at the highest tier or with the human. The judgment points in SDD,
|
||||
explicitly:
|
||||
|
||||
- **BLOCKED / NEEDS_CONTEXT handling** — diagnosing why a subagent is stuck
|
||||
and choosing the remedy
|
||||
- **⚠️ "cannot verify from diff" resolution** — the controller adjudicating
|
||||
with cross-task context
|
||||
- **Dispatch curation** — ambiguity resolution and task-boundary drawing
|
||||
(measured load-bearing: the Task 5 gradient-direction note prevented a
|
||||
wrong implementation)
|
||||
- **Review verdicts and severity calibration** — what is Important vs Minor
|
||||
- **Review-loop adjudication** — deciding a finding is a false positive
|
||||
- **Escalate-to-human recognition** — knowing the plan itself is wrong
|
||||
|
||||
A rung that would move any of these to a cheaper model must either (a)
|
||||
restructure so the decision is made once by the expensive model at plan
|
||||
time, (b) add an explicit escalation rule routing it back up at execution
|
||||
time, or (c) die. "The cheap model usually gets it right" is not
|
||||
acceptance evidence — judgment failures are rare-event, high-blast-radius,
|
||||
and largely invisible to pass/fail gates, which is why every tier change
|
||||
below carries a judgment audit (session-resume interrogation of each
|
||||
judgment point in the gate runs, compared against the expensive-controller
|
||||
baseline) in addition to the N=5 scenario gates.
|
||||
|
||||
## Thesis guardrail
|
||||
|
||||
SDD's thesis: **a fresh subagent per task with precisely curated context,
|
||||
gated per task.** Rungs below must preserve it. Dispatch-time task batching
|
||||
(one implementer dispatch handling several plan tasks) is **counter-thesis**
|
||||
— it pollutes the fresh-context property and coarsens the gates — and is
|
||||
deliberately NOT on the ladder. The thesis-compatible route to the same
|
||||
dispatch economics is plan-time task right-sizing (L1): if the plan defines
|
||||
fewer, better-sized tasks, SDD still runs one fresh subagent per task.
|
||||
|
||||
## The ladder (in expected $/leverage order)
|
||||
|
||||
### L1 — Plan-side crispness (writing-plans changes; est. −$1.5-3/run, plus variance reduction)
|
||||
|
||||
The plan is upstream of every cost: task count sets dispatch count; plan
|
||||
ambiguity sets review-loop count; plan completeness sets implementer
|
||||
exploration. Current writing-plans optimizes for implementer success, not
|
||||
execution economics. Changes to test:
|
||||
|
||||
1. **Task right-sizing guidance.** Today's plans produce tasks as small as
|
||||
"create .gitignore" — each costing a full dispatch + review cycle
|
||||
(~$0.60-1.00 fixed overhead). Add: "A task is the smallest unit that
|
||||
carries its own test cycle and is worth a fresh reviewer's gate. Merge
|
||||
setup/config steps into the task that needs them; split only at
|
||||
boundaries where a reviewer could meaningfully reject." Fractals' plan
|
||||
would drop from 10 tasks to ~7. Validate: dispatch count falls, gates
|
||||
hold, review granularity still catches the planted defect.
|
||||
2. **Structured `## Global Constraints` section** in the plan header
|
||||
(version floors, naming/copy rules, platform requirements). Today these
|
||||
live in design.md prose and reach reviewers only if the controller
|
||||
remembers to paste them (a `go 1.26.1` floor violation shipped because
|
||||
none did). A fixed heading makes them mechanically extractable —
|
||||
`task-brief` can append them to every brief automatically (small script
|
||||
change), removing a controller responsibility entirely.
|
||||
3. **Per-task `Interfaces:` line** (consumes/produces, exact signatures).
|
||||
The controller currently re-derives cross-task interfaces per dispatch
|
||||
(its main legitimate "restating"), and implementers spend ~13 tool calls
|
||||
re-discovering context. The planner already knows the interfaces; one
|
||||
line per task moves the work to where it is done once.
|
||||
4. **Per-task model-tier recommendation** from the planner ("mechanical /
|
||||
standard / judgment"). The planner has the best information for the
|
||||
Model Selection decision the controller currently re-makes per dispatch;
|
||||
the controller keeps override authority.
|
||||
|
||||
Validation: micro-test the planner output shape (recipe-style, per the
|
||||
instruction-design doctrine), then full runs. Note the 2026-06-10 result:
|
||||
plan *placeholders* cannot be elicited from current opus — these changes
|
||||
target economics and ambiguity, not placeholder hygiene.
|
||||
|
||||
### L2 — Controller tier (est. −$4-5/run; the biggest single lever, gated hardest)
|
||||
|
||||
The controller is half the dollars solely because it inherits the session
|
||||
model. Its turn floor is prompt-immune, so the lever is the rate per turn —
|
||||
but the controller is also where most judgment points live, so this rung is
|
||||
designed judgment-first:
|
||||
|
||||
1. **Primary form — judgment moved up front, mechanics cheapened:** the
|
||||
expensive model does the judgment-dense work at plan time (L1's
|
||||
Interfaces lines, ambiguity resolutions, per-task constraints — i.e.
|
||||
the dispatch curation is pre-written into the plan). The mid-tier
|
||||
execution session then runs a loop that is genuinely mechanical:
|
||||
extract brief, dispatch, run script, route verdicts. Explicit
|
||||
escalation rules in the skill: on BLOCKED, on any ⚠️ item, on a
|
||||
suspected false positive, or on anything the plan does not already
|
||||
answer, the cheap controller STOPS and escalates (to the human, or to
|
||||
a fresh expensive-model consultation dispatch) — it never resolves
|
||||
judgment alone.
|
||||
2. **Gates beyond the standard N=5:** a judgment audit — every
|
||||
BLOCKED/⚠️/adjudication event in the gate runs interrogated via
|
||||
session-resume and scored against how the opus-controller baseline
|
||||
handled the same class of event; any silently-absorbed judgment call
|
||||
(cheap controller resolving what it should have escalated) fails the
|
||||
rung regardless of scenario verdicts.
|
||||
3. **User authority preserved:** the skill recommends, never enforces, the
|
||||
execution-session tier.
|
||||
|
||||
Caveat from this campaign: cheap-model turn inflation was measured on
|
||||
multi-step *work*, not dispatch loops; whether a mid-tier controller holds
|
||||
~150 turns is part of what the experiment determines.
|
||||
|
||||
### L3 — Reviewer tier (est. −$0.7-1/run; most likely rung to die on the judgment guardrail)
|
||||
|
||||
The package reviewer is near-single-step mechanically (3 turns / 1 Read
|
||||
when calm), which invalidates the original turn-inflation rationale for the
|
||||
mid-tier floor — but reviewing is judgment through and through: severity
|
||||
calibration, spec verdicts, knowing what not to flag. Mechanical cheapness
|
||||
does not make the decisions mechanical. Test haiku-with-package only with
|
||||
the full judgment battery: planted-defect ×5, a severity-calibration check
|
||||
(seeded Minor-vs-Important pairs; miscalibration fails the rung), and the
|
||||
escape-hatch variance re-measured at that tier. Prior expectation: this
|
||||
rung dies, and that is a fine outcome — it converts "we suspect cheap
|
||||
reviewers are bad" into recorded evidence.
|
||||
|
||||
### L4 — Resident-context diet (est. −$0.5-1/run)
|
||||
|
||||
- `task-brief --list` mode: controller reads task headings + Global
|
||||
Constraints, never the full plan (the plan body is already delivered via
|
||||
briefs).
|
||||
- Reports trim 15 → 8 lines.
|
||||
- SKILL.md minification pass (every section added this week re-justified
|
||||
at composition-recipe density; Codex pays ~10k chars × ~500 re-reads per
|
||||
long session).
|
||||
|
||||
### L5 — Re-litigations (explicitly flagged, maintainer-vetoed or counter-thesis)
|
||||
|
||||
Recorded for completeness; each requires Jesse's explicit reversal before
|
||||
any experiment:
|
||||
- **Scoped re-reviews** (verify fix + regression scan instead of full
|
||||
re-review): vetoed 2026-06-09; worth ~$0.50/run at most.
|
||||
- **Dispatch-time task batching**: counter-thesis (see guardrail). L1.1
|
||||
is the sanctioned form.
|
||||
|
||||
## Budget and sequencing
|
||||
|
||||
L1 and L2.1 are independent — run both first (~$80: micro-tests + 2×5-run
|
||||
gates + A/B). L3 after L2 settles the controller (reviewer behavior depends
|
||||
on dispatch quality; ~$25 — planted-defect runs are $2-3 each). L4 last
|
||||
(cheap, but re-gate once after the stack; ~$30). Total ≲ $150 for the full
|
||||
ladder with honest N=5 gates. Expected end state if every rung survives its gates: **$5-7/run on
|
||||
fractals (from $12-15)**; if the judgment-sensitive rungs (L2 beyond its
|
||||
primary form, L3) die as expected, **$8-10/run** — the honest target, since
|
||||
the guardrail prices judgment above dollars by construction.
|
||||
|
||||
## Relationship to existing work
|
||||
|
||||
Builds on the 2026-06-09 task-scoped review dispatch design (PR #1717) and
|
||||
the 2026-06-10 experiment campaign (evals
|
||||
`docs/experiments/2026-06-10-sdd-cost-experiments.md` — consult the
|
||||
negative-results section before adding rungs; turn-discipline and
|
||||
parallel-call mechanisms are dead). Instruction wording for any new prose
|
||||
follows the positive-instruction doctrine spec and gets micro-tested before
|
||||
full runs. L1 is a writing-plans change → its own PR with eval evidence;
|
||||
L2-L4 are SDD changes → separate PR(s).
|
||||
Reference in New Issue
Block a user