Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first

This commit is contained in:
Jesse Vincent
2026-06-10 14:35:00 -07:00
parent 60fa4f6fc4
commit 9a25a75bac

View File

@@ -0,0 +1,186 @@
# Strict-Cost SDD — Design Spec
**Status:** Proposed experiment ladder (not implementation). Each rung ships
only with its gate evidence; abort any rung whose gates fail.
**Objective:** minimize dollars per plan-execution. Wall-clock is
unconstrained; token count matters only as a cost driver.
**Hard invariant:** quality. Concretely: `sdd-quality-reviewer-catches-
planted-defect` pass rate over **N=5 runs** (not 1 — single-run gates were
this campaign's weakest methodology), `sdd-rejects-extra-features` pass,
all end-to-end scenarios pass, blind A/B deliverable parity with the
current config. Any quality regression kills the rung, full stop.
## Where the dollars are (final 2026-06-10 config, go-fractals, ~$13/run)
| Component | $ | Driver |
|---|---|---|
| Controller (session model, opus) | ~6-7 | ~150 turns × resident context; prompt-immune turn floor (46% thinking/narration) |
| Implementers (sonnet, 10-13 dispatches) | ~5-6 | the actual work; ~25 turns each; ~13 pre-edit exploration calls each |
| Task reviewers (sonnet, 10) | ~1-1.5 | 3-9 turns each with package |
| Final review + fixes | ~1 | 6 turns with branch package |
Review-loop count (2-4 per run) is the biggest run-to-run cost variance;
loops are mostly caused by plan ambiguity the implementer resolved wrongly.
## Judgment guardrail (co-invariant with quality)
**Cheapen mechanics, never judgment.** Every rung must enumerate which
decisions it moves to a cheaper model and show each is *mechanical*
deterministic, scriptable, or cheaply verifiable after the fact. Judgment
stays at the highest tier or with the human. The judgment points in SDD,
explicitly:
- **BLOCKED / NEEDS_CONTEXT handling** — diagnosing why a subagent is stuck
and choosing the remedy
- **⚠️ "cannot verify from diff" resolution** — the controller adjudicating
with cross-task context
- **Dispatch curation** — ambiguity resolution and task-boundary drawing
(measured load-bearing: the Task 5 gradient-direction note prevented a
wrong implementation)
- **Review verdicts and severity calibration** — what is Important vs Minor
- **Review-loop adjudication** — deciding a finding is a false positive
- **Escalate-to-human recognition** — knowing the plan itself is wrong
A rung that would move any of these to a cheaper model must either (a)
restructure so the decision is made once by the expensive model at plan
time, (b) add an explicit escalation rule routing it back up at execution
time, or (c) die. "The cheap model usually gets it right" is not
acceptance evidence — judgment failures are rare-event, high-blast-radius,
and largely invisible to pass/fail gates, which is why every tier change
below carries a judgment audit (session-resume interrogation of each
judgment point in the gate runs, compared against the expensive-controller
baseline) in addition to the N=5 scenario gates.
## Thesis guardrail
SDD's thesis: **a fresh subagent per task with precisely curated context,
gated per task.** Rungs below must preserve it. Dispatch-time task batching
(one implementer dispatch handling several plan tasks) is **counter-thesis**
— it pollutes the fresh-context property and coarsens the gates — and is
deliberately NOT on the ladder. The thesis-compatible route to the same
dispatch economics is plan-time task right-sizing (L1): if the plan defines
fewer, better-sized tasks, SDD still runs one fresh subagent per task.
## The ladder (in expected $/leverage order)
### L1 — Plan-side crispness (writing-plans changes; est. $1.5-3/run, plus variance reduction)
The plan is upstream of every cost: task count sets dispatch count; plan
ambiguity sets review-loop count; plan completeness sets implementer
exploration. Current writing-plans optimizes for implementer success, not
execution economics. Changes to test:
1. **Task right-sizing guidance.** Today's plans produce tasks as small as
"create .gitignore" — each costing a full dispatch + review cycle
(~$0.60-1.00 fixed overhead). Add: "A task is the smallest unit that
carries its own test cycle and is worth a fresh reviewer's gate. Merge
setup/config steps into the task that needs them; split only at
boundaries where a reviewer could meaningfully reject." Fractals' plan
would drop from 10 tasks to ~7. Validate: dispatch count falls, gates
hold, review granularity still catches the planted defect.
2. **Structured `## Global Constraints` section** in the plan header
(version floors, naming/copy rules, platform requirements). Today these
live in design.md prose and reach reviewers only if the controller
remembers to paste them (a `go 1.26.1` floor violation shipped because
none did). A fixed heading makes them mechanically extractable —
`task-brief` can append them to every brief automatically (small script
change), removing a controller responsibility entirely.
3. **Per-task `Interfaces:` line** (consumes/produces, exact signatures).
The controller currently re-derives cross-task interfaces per dispatch
(its main legitimate "restating"), and implementers spend ~13 tool calls
re-discovering context. The planner already knows the interfaces; one
line per task moves the work to where it is done once.
4. **Per-task model-tier recommendation** from the planner ("mechanical /
standard / judgment"). The planner has the best information for the
Model Selection decision the controller currently re-makes per dispatch;
the controller keeps override authority.
Validation: micro-test the planner output shape (recipe-style, per the
instruction-design doctrine), then full runs. Note the 2026-06-10 result:
plan *placeholders* cannot be elicited from current opus — these changes
target economics and ambiguity, not placeholder hygiene.
### L2 — Controller tier (est. $4-5/run; the biggest single lever, gated hardest)
The controller is half the dollars solely because it inherits the session
model. Its turn floor is prompt-immune, so the lever is the rate per turn —
but the controller is also where most judgment points live, so this rung is
designed judgment-first:
1. **Primary form — judgment moved up front, mechanics cheapened:** the
expensive model does the judgment-dense work at plan time (L1's
Interfaces lines, ambiguity resolutions, per-task constraints — i.e.
the dispatch curation is pre-written into the plan). The mid-tier
execution session then runs a loop that is genuinely mechanical:
extract brief, dispatch, run script, route verdicts. Explicit
escalation rules in the skill: on BLOCKED, on any ⚠️ item, on a
suspected false positive, or on anything the plan does not already
answer, the cheap controller STOPS and escalates (to the human, or to
a fresh expensive-model consultation dispatch) — it never resolves
judgment alone.
2. **Gates beyond the standard N=5:** a judgment audit — every
BLOCKED/⚠️/adjudication event in the gate runs interrogated via
session-resume and scored against how the opus-controller baseline
handled the same class of event; any silently-absorbed judgment call
(cheap controller resolving what it should have escalated) fails the
rung regardless of scenario verdicts.
3. **User authority preserved:** the skill recommends, never enforces, the
execution-session tier.
Caveat from this campaign: cheap-model turn inflation was measured on
multi-step *work*, not dispatch loops; whether a mid-tier controller holds
~150 turns is part of what the experiment determines.
### L3 — Reviewer tier (est. $0.7-1/run; most likely rung to die on the judgment guardrail)
The package reviewer is near-single-step mechanically (3 turns / 1 Read
when calm), which invalidates the original turn-inflation rationale for the
mid-tier floor — but reviewing is judgment through and through: severity
calibration, spec verdicts, knowing what not to flag. Mechanical cheapness
does not make the decisions mechanical. Test haiku-with-package only with
the full judgment battery: planted-defect ×5, a severity-calibration check
(seeded Minor-vs-Important pairs; miscalibration fails the rung), and the
escape-hatch variance re-measured at that tier. Prior expectation: this
rung dies, and that is a fine outcome — it converts "we suspect cheap
reviewers are bad" into recorded evidence.
### L4 — Resident-context diet (est. $0.5-1/run)
- `task-brief --list` mode: controller reads task headings + Global
Constraints, never the full plan (the plan body is already delivered via
briefs).
- Reports trim 15 → 8 lines.
- SKILL.md minification pass (every section added this week re-justified
at composition-recipe density; Codex pays ~10k chars × ~500 re-reads per
long session).
### L5 — Re-litigations (explicitly flagged, maintainer-vetoed or counter-thesis)
Recorded for completeness; each requires Jesse's explicit reversal before
any experiment:
- **Scoped re-reviews** (verify fix + regression scan instead of full
re-review): vetoed 2026-06-09; worth ~$0.50/run at most.
- **Dispatch-time task batching**: counter-thesis (see guardrail). L1.1
is the sanctioned form.
## Budget and sequencing
L1 and L2.1 are independent — run both first (~$80: micro-tests + 2×5-run
gates + A/B). L3 after L2 settles the controller (reviewer behavior depends
on dispatch quality; ~$25 — planted-defect runs are $2-3 each). L4 last
(cheap, but re-gate once after the stack; ~$30). Total ≲ $150 for the full
ladder with honest N=5 gates. Expected end state if every rung survives its gates: **$5-7/run on
fractals (from $12-15)**; if the judgment-sensitive rungs (L2 beyond its
primary form, L3) die as expected, **$8-10/run** — the honest target, since
the guardrail prices judgment above dollars by construction.
## Relationship to existing work
Builds on the 2026-06-09 task-scoped review dispatch design (PR #1717) and
the 2026-06-10 experiment campaign (evals
`docs/experiments/2026-06-10-sdd-cost-experiments.md` — consult the
negative-results section before adding rungs; turn-discipline and
parallel-call mechanisms are dead). Instruction wording for any new prose
follows the positive-instruction doctrine spec and gets micro-tested before
full runs. L1 is a writing-plans change → its own PR with eval evidence;
L2-L4 are SDD changes → separate PR(s).