Compare commits

..

8 Commits

Author SHA1 Message Date
Jesse Vincent
35464d67c0 E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract 2026-06-11 13:17:10 -07:00
Jesse Vincent
90b5433f59 E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis) 2026-06-11 13:17:10 -07:00
Jesse Vincent
420c234a2c Bump evals submodule: E29-E34 quality investigation + L2 gate results (af05326) 2026-06-11 13:17:09 -07:00
Jesse Vincent
d1fcc9889a Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not 2026-06-11 13:11:32 -07:00
Jesse Vincent
ac11700642 Bump evals submodule: L1 elicitation + autoresearch scenarios and logs (649b1f8) 2026-06-11 11:37:41 -07:00
Jesse Vincent
710f031ad0 writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks
Claims are fidelity and variance, not dollars (full attribution in the
superpowers-evals experiment log, 2026-06-11 L1 entry):
- Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact
  values verbatim; makes constraints mechanically propagatable to briefs
  and reviewers (a version-floor violation class shipped because they
  weren't). The one fix wave in the elicited full runs was a version-floor
  catch this header enabled.
- Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures,
  within-plan consistent; removes the controller's per-dispatch interface
  re-derivation.
- Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills
  standalone Types/README micro-tasks); no effect at small scale.
- End-to-end (opus-written plan executed under SDD): guidance plan ran 1
  fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug);
  execution cost equal within noise.
2026-06-11 11:35:53 -07:00
Jesse Vincent
72cb21b82c Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules
E30 replay: the planted-DRY catch is causally determined by the
controller-composed constraints block (0/6 with process-shaped vs 5/6
with the spec's own wording). E31 micro: this recipe doubles the rate
at which composed blocks carry the spec's cross-component relationship
(6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).
2026-06-11 11:35:53 -07:00
Jesse Vincent
de1d35e5e7 Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance 2026-06-10 21:44:23 -07:00
5 changed files with 82 additions and 16 deletions

View File

@@ -65,13 +65,21 @@ fewer, better-sized tasks, SDD still runs one fresh subagent per task.
### L1 — Plan-side crispness (writing-plans changes; est. $1.5-3/run, plus variance reduction) ### L1 — Plan-side crispness (writing-plans changes; est. $1.5-3/run, plus variance reduction)
**Status 2026-06-11: validated in effect.** A hand-crisped fractals plan **Status 2026-06-11 (final): elicitation tested end-to-end; claims
(10 → 7 tasks, `## Global Constraints` header, per-task `Interfaces:` re-attributed.** Micro-tests: constraints header and Interfaces blocks
lines — scenario `sdd-go-fractals-crisp`) ran 3/3 green at $9.51-12.65 elicit deterministically (0→5/5, 0→100% of tasks, exact values);
(mean $11.60 vs combo band $11.67-14.84), 20-24 dispatches vs 28, fix right-sizing is modest and scale-dependent (9.4→8.4 tasks at svelte
waves flat. What remains is elicitation: getting writing-plans guidance scale, nothing to move at fractals scale). Full runs: an elicited plan
to *produce* such plans (micro-test per the doctrine, then the follow-up executed at $6.34/$8.49 — but the no-guidance control (opus plan,
PR). See the experiments log, Batch A-E. complete code) hit $7.59/$7.73, inside that range. **The cost win
belongs to opus-written complete-code plans; the hand-written prose
fixture plans all prior numbers used are unrepresentative and ~2×
costlier to execute.** The guidance owns fidelity and variance instead:
deterministic constraints propagation (the one elicited-run fix was a
version-floor catch), exact cross-task interfaces, fix waves 1 vs 2-4
(the control plan shipped a real Sierpinski bug both runs had to fix).
The writing-plans PR claims those grounds, not dollars. Draft at
/tmp/sdd-exp/writing-plans-l1 (branch writing-plans-crisp).
The plan is upstream of every cost: task count sets dispatch count; plan The plan is upstream of every cost: task count sets dispatch count; plan
ambiguity sets review-loop count; plan completeness sets implementer ambiguity sets review-loop count; plan completeness sets implementer
@@ -110,7 +118,25 @@ target economics and ambiguity, not placeholder hygiene.
### L2 — Controller tier (est. $4-5/run; the biggest single lever, gated hardest) ### L2 — Controller tier (est. $4-5/run; the biggest single lever, gated hardest)
**Status 2026-06-11: recon positive (n=2), gates still owed.** **Status 2026-06-11 (final): DIED AT THE GATES, as pre-registered — with
useful anatomy.** Recon was positive ($6.68/$8.05, n=2, mechanics clean).
The full battery split the judgment surface: the new
`sdd-escalates-broken-plan` scenario (explicit plan self-contradiction;
the human never volunteers it) passed **5/5 at sonnet** ($1.02-1.37/run;
opus baseline 2/2) — explicit conflicts get escalated. But the
planted-defect battery failed decisively: under a sonnet controller the
per-task quality gate collapsed into plan-compliance advocacy ("no
assertion, as required" listed under Strengths), the defect shipped in
4/5 runs (deterministic check), and only the tier-pinned opus final
reviewer ever caught it — while the same sonnet-tier reviewers under an
opus controller flagged it 5/5. Cheap controllers handle explicit
escalation; they absorb implicit authority-vs-quality adjudication.
A possible L2b (discrete rule: "a reviewer finding that conflicts with
the plan's text is the human's decision — escalate it") would route the
failing judgment through the escalation behavior that held; untested.
Original recon notes follow.
**Recon (superseded):**
Sonnet-controller runs (claude-sonnet coding-agent): all gates green at Sonnet-controller runs (claude-sonnet coding-agent): all gates green at
**$6.68 and $8.05** / 31-41 min (combo band $11.67-14.84), tokens inside **$6.68 and $8.05** / 31-41 min (combo band $11.67-14.84), tokens inside
the combo band — no cheap-controller turn inflation. 26/26 and 31/31 the combo band — no cheap-controller turn inflation. 26/26 and 31/31

2
evals

Submodule evals updated: ac264b104c...af05326467

View File

@@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review
**Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration **Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration
**Narration:** between tool calls, narrate at most one short line — the
ledger and the tool results carry the record.
**Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it. **Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.
## When to Use ## When to Use
@@ -88,6 +91,8 @@ Use the least powerful model that can handle each role to conserve cost and incr
**Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model. **Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.
**Architecture and design tasks**: use the most capable available model. **Architecture and design tasks**: use the most capable available model.
The final whole-branch review is one of these — dispatch it on the most
capable available model, not the session default.
**Review tasks**: choose the model with the same judgment, scaled to the **Review tasks**: choose the model with the same judgment, scaled to the
diff's size, complexity, and risk. A small mechanical diff does not need the diff's size, complexity, and risk. A small mechanical diff does not need the
@@ -100,8 +105,10 @@ most expensive — which silently defeats this section.
**Turn count beats token price.** Wall-clock and context cost scale with how **Turn count beats token price.** Wall-clock and context cost scale with how
many turns a subagent takes, and the cheapest models routinely take 2-3× the many turns a subagent takes, and the cheapest models routinely take 2-3× the
turns on multi-step work — costing more overall. Use a mid-tier model as the turns on multi-step work — costing more overall. Use a mid-tier model as the
floor for implementers and reviewers; reserve the cheapest tier for floor for reviewers and for implementers working from prose descriptions.
single-file mechanical fixes. When the task's plan text contains the complete code to write, the
implementation is transcription plus testing: use the cheapest tier for
that implementer. Single-file mechanical fixes also take the cheapest tier.
**Task complexity signals (implementation tasks):** **Task complexity signals (implementation tasks):**
- Touches 1-2 files with a complete spec → cheap model - Touches 1-2 files with a complete spec → cheap model
@@ -150,9 +157,13 @@ final whole-branch review. When you fill a reviewer template:
loop. If the prompt you are writing contains "do not flag," "don't treat X loop. If the prompt you are writing contains "do not flag," "don't treat X
as a defect," "at most Minor," or "the plan chose" — stop: you are as a defect," "at most Minor," or "the plan chose" — stop: you are
pre-judging, usually to spare yourself a review loop. pre-judging, usually to spare yourself a review loop.
- Include the spec/design's global constraints that bind the task (version - The global-constraints block you hand the reviewer is its attention
floors, naming and copy rules, platform requirements) in the requirements lens. Copy the binding requirements verbatim from the plan's Global
you paste — a reviewer can only enforce what you hand them. Constraints section or the spec: exact values, exact formats, and the
stated relationships between components ("same layout as X", "matches
Y"). The reviewer's template already carries the process rules (YAGNI,
test hygiene, review method) — the constraints block is for what THIS
project's spec demands.
- Hand the reviewer its diff as a file: run this skill's - Hand the reviewer its diff as a file: run this skill's
`scripts/review-package BASE HEAD` and pass the reviewer the file path `scripts/review-package BASE HEAD` and pass the reviewer the file path
it prints (or, without bash: `git log --oneline`, `git diff --stat`, it prints (or, without bash: `git log --oneline`, `git diff --stat`,

View File

@@ -115,6 +115,11 @@ Subagent (general-purpose):
"yes." A tight report that cites lines gives the controller everything "yes." A tight report that cites lines gives the controller everything
it needs. it needs.
Your final message is the report itself: begin directly with the
spec-compliance verdict. Every line is a verdict, a finding with
file:line, or a check you ran — no preamble, no process narration,
no closing summary.
## Calibration ## Calibration
Categorize issues by actual severity. Not everything is Critical. Categorize issues by actual severity. Not everything is Critical.
@@ -159,8 +164,10 @@ Subagent (general-purpose):
- `[MODEL]` — REQUIRED: reviewer model per SKILL.md Model Selection - `[MODEL]` — REQUIRED: reviewer model per SKILL.md Model Selection
- `[BRIEF_FILE]` — REQUIRED: the task brief file (`scripts/task-brief PLAN N` - `[BRIEF_FILE]` — REQUIRED: the task brief file (`scripts/task-brief PLAN N`
prints the path; same file the implementer worked from) prints the path; same file the implementer worked from)
- `[GLOBAL_CONSTRAINTS]` — the spec/design's global constraints that bind - `[GLOBAL_CONSTRAINTS]` — the binding requirements copied verbatim from
this task (version floors, naming and copy rules, platform requirements) the plan's Global Constraints section or the spec: exact values, formats,
and stated relationships between components (not process rules — those
are already in this template)
- `[REPORT_FILE]` — REQUIRED: the file the implementer wrote its detailed - `[REPORT_FILE]` — REQUIRED: the file the implementer wrote its detailed
report to report to
- `[BASE_SHA]` — commit before this task - `[BASE_SHA]` — commit before this task

View File

@@ -33,6 +33,15 @@ Before defining tasks, map out which files will be created or modified and what
This structure informs the task decomposition. Each task should produce self-contained changes that make sense independently. This structure informs the task decomposition. Each task should produce self-contained changes that make sense independently.
## Task Right-Sizing
A task is the smallest unit that carries its own test cycle and is worth a
fresh reviewer's gate. When drawing task boundaries: fold setup,
configuration, scaffolding, and documentation steps into the task whose
deliverable needs them; split only where a reviewer could meaningfully
reject one task while approving its neighbor. Each task ends with an
independently testable deliverable.
## Bite-Sized Task Granularity ## Bite-Sized Task Granularity
**Each step is one action (2-5 minutes):** **Each step is one action (2-5 minutes):**
@@ -57,6 +66,13 @@ This structure informs the task decomposition. Each task should produce self-con
**Tech Stack:** [Key technologies/libraries] **Tech Stack:** [Key technologies/libraries]
## Global Constraints
[The spec's project-wide requirements — version floors, dependency limits,
naming and copy rules, platform requirements — one line each, with exact
values copied verbatim from the spec. Every task's requirements implicitly
include this section.]
--- ---
``` ```
@@ -70,6 +86,12 @@ This structure informs the task decomposition. Each task should produce self-con
- Modify: `exact/path/to/existing.py:123-145` - Modify: `exact/path/to/existing.py:123-145`
- Test: `tests/exact/path/to/test.py` - Test: `tests/exact/path/to/test.py`
**Interfaces:**
- Consumes: [what this task uses from earlier tasks — exact signatures]
- Produces: [what later tasks rely on — exact function names, parameter
and return types. A task's implementer sees only their own task; this
block is how they learn the names and types neighboring tasks use.]
- [ ] **Step 1: Write the failing test** - [ ] **Step 1: Write the failing test**
```python ```python