Compare commits

..

45 Commits

Author SHA1 Message Date
Jesse Vincent
9b6be89aea E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1 2026-06-15 12:16:47 -07:00
Jesse Vincent
eafa95b437 Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27 2026-06-15 12:16:47 -07:00
Jesse Vincent
228f2cb8a9 L2b: plan-mandated defects are findings the human adjudicates
Reviewer tripwire (Calibration): a plan-mandated defect IS a finding,
reported as Important and labeled plan-mandated — the plan's authorship
does not grade its own work.

Controller rule (review loop): a plan-mandated finding, or any finding
conflicting with the plan's text, escalates to the human like any plan
contradiction — never dismissed because the plan mandates it.

E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the
tripwire 0/6 reports give the controller anything to escalate on (all
Approved, defect endorsed as spec-required); with it 6/6 report the
defect as a labeled finding.
2026-06-15 12:16:47 -07:00
Jesse Vincent
e161df5a9b E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract 2026-06-15 12:16:47 -07:00
Jesse Vincent
5e204f1128 E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis) 2026-06-15 12:16:47 -07:00
Jesse Vincent
b1eb92ea72 Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not 2026-06-15 12:15:06 -07:00
Jesse Vincent
6e9bbb7e3e writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks
Claims are fidelity and variance, not dollars (full attribution in the
superpowers-evals experiment log, 2026-06-11 L1 entry):
- Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact
  values verbatim; makes constraints mechanically propagatable to briefs
  and reviewers (a version-floor violation class shipped because they
  weren't). The one fix wave in the elicited full runs was a version-floor
  catch this header enabled.
- Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures,
  within-plan consistent; removes the controller's per-dispatch interface
  re-derivation.
- Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills
  standalone Types/README micro-tasks); no effect at small scale.
- End-to-end (opus-written plan executed under SDD): guidance plan ran 1
  fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug);
  execution cost equal within noise.
2026-06-15 12:15:06 -07:00
Jesse Vincent
fe938ac86c Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules
E30 replay: the planted-DRY catch is causally determined by the
controller-composed constraints block (0/6 with process-shaped vs 5/6
with the spec's own wording). E31 micro: this recipe doubles the rate
at which composed blocks carry the spec's cross-component relationship
(6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).
2026-06-15 12:15:06 -07:00
Jesse Vincent
07dec9331f Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance 2026-06-15 12:15:06 -07:00
Jesse Vincent
afcbf8bacb Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed) 2026-06-15 12:15:06 -07:00
Jesse Vincent
fa14c8d671 Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead) 2026-06-15 12:15:06 -07:00
Jesse Vincent
8b76932337 Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first 2026-06-15 12:15:06 -07:00
Jesse Vincent
0702ec2c6f Record writing-plans micro-test result: resolved, no change needed 2026-06-15 12:15:06 -07:00
Jesse Vincent
85a9324a53 Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges) 2026-06-15 12:15:06 -07:00
Jesse Vincent
610b09874e Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist 2026-06-15 12:15:06 -07:00
Jesse Vincent
6df501ea5d Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget
Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
2026-06-15 12:15:06 -07:00
Jesse Vincent
1585f40c8e Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants 2026-06-15 12:15:06 -07:00
Jesse Vincent
60c0b744b4 Shared: unique review-package collateral names 2026-06-15 12:15:06 -07:00
Jesse Vincent
6b3e4ad407 Add review-package script; close fix-dispatch test gap
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
2026-06-15 12:15:06 -07:00
Jesse Vincent
a84bb0f52b Describe the review design as current state, not as a delta
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
2026-06-15 12:15:06 -07:00
Jesse Vincent
69d396a676 Spec: record iterations 2-3 results and final frozen-config matrix 2026-06-15 12:15:06 -07:00
Jesse Vincent
e4457c970e Hand reviewers the diff as a file, not a paste
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
2026-06-15 12:15:06 -07:00
Jesse Vincent
fac5888846 Reviewer skepticism covers the implementer's design rationales
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
2026-06-15 12:15:06 -07:00
Jesse Vincent
8ac14c0450 Make diff-pasting non-optional for task reviewer dispatch
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
2026-06-15 12:15:06 -07:00
Jesse Vincent
3ed554d557 Close the Minor-severity escape hatch
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
2026-06-15 12:15:06 -07:00
Jesse Vincent
4e8edca36e Spec: document cost iterations and the per-task review consolidation 2026-06-15 12:15:06 -07:00
Jesse Vincent
d7726d99dc Merge per-task reviews into one task reviewer (iteration 2)
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance //⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
2026-06-15 12:15:06 -07:00
Jesse Vincent
4c1f1e5cc5 Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
2026-06-15 12:15:06 -07:00
Jesse Vincent
7288393773 Add phrase-level pre-judging triggers to reviewer prompt rule
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
2026-06-15 12:15:06 -07:00
Jesse Vincent
254a8e2e32 Red Flags: never tell a reviewer what not to flag or pre-rate severity
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
2026-06-15 12:15:06 -07:00
Jesse Vincent
7c11cee649 Close three review blind spots found by defect tracing
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
2026-06-15 12:15:06 -07:00
Jesse Vincent
b36cf86afd Require explicit model on subagent dispatch
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
2026-06-15 12:15:06 -07:00
Jesse Vincent
06bec17a34 Forbid controllers pre-judging reviewer findings
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
2026-06-15 12:15:06 -07:00
Jesse Vincent
236524413b Sync plan: escaped pre() pattern in Task 5 checks block 2026-06-15 12:15:06 -07:00
Jesse Vincent
6e019e0316 Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block 2026-06-15 12:15:06 -07:00
Jesse Vincent
d4bb8d268f Sync plan's Task 5 blocks with review fixes 2026-06-15 12:15:06 -07:00
Jesse Vincent
d519ba65fd SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment 2026-06-15 12:15:06 -07:00
Jesse Vincent
d32a56dc32 Implementer prompt: re-run covering tests after fixing review findings 2026-06-15 12:15:06 -07:00
Jesse Vincent
994bc26d2a Scope spec reviewer's Your Job wording to the diff 2026-06-15 12:15:06 -07:00
Jesse Vincent
d5850df1bc Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel 2026-06-15 12:15:06 -07:00
Jesse Vincent
b5edd40d2c Use bare placeholder names in quality reviewer prompt body 2026-06-15 12:15:06 -07:00
Jesse Vincent
6a02446953 Make per-task quality reviewer prompt self-contained and task-scoped 2026-06-15 12:15:06 -07:00
Jesse Vincent
042d238b26 Add implementation plan for task-scoped review dispatch 2026-06-15 12:15:06 -07:00
Jesse Vincent
cf81ad2ac3 Harden review-dispatch spec per adversarial review findings 2026-06-15 12:15:06 -07:00
Jesse Vincent
cb0dbeb095 Add design spec: task-scoped review dispatch for SDD 2026-06-15 12:15:06 -07:00
3 changed files with 61 additions and 4 deletions

View File

@@ -133,8 +133,29 @@ opus controller flagged it 5/5. Cheap controllers handle explicit
escalation; they absorb implicit authority-vs-quality adjudication. escalation; they absorb implicit authority-vs-quality adjudication.
A possible L2b (discrete rule: "a reviewer finding that conflicts with A possible L2b (discrete rule: "a reviewer finding that conflicts with
the plan's text is the human's decision — escalate it") would route the the plan's text is the human's decision — escalate it") would route the
failing judgment through the escalation behavior that held; untested. failing judgment through the escalation behavior that held.
Original recon notes follow.
**L2b tested 2026-06-11 (E35/E36, evals
`docs/experiments/2026-06-11-build-loop-autoresearch.md`): improves the
opus stack, does NOT rescue the sonnet rung.** Two rules: a reviewer
tripwire (a plan-mandated defect IS a finding — Important, labeled
plan-mandated; the human decides) and a controller escalation rule
(plan-mandated findings go to the human like any plan contradiction).
Micro on frozen sonnet-composed inputs: 0/6 → 6/6 labeled findings.
Full battery: opus controllers 2/2 internalized the rule, caught their
reviewer's miss as self-described backstop, and escalated for a
sanctioned fix (the 4241 ad-hoc behavior made structural); escalation
sanity 2/2 unbroken. Sonnet controllers: 1/5 full pass — paraphrase
drops the tripwire from dispatches (2/5 transmitted), transmission
alone doesn't fire it live (read-once dilution across the reviewer's
tool reads; placement within the dispatch refuted as the variable),
and no sonnet controller showed backstop behavior; 1/5 shipped the
defect. The L2b rules are a candidate commit for the opus stack.
A future L2c for the sonnet rung would pair the SKILL.md
constraints-recipe (the one channel sonnet transmits verbatim) with a
mandatory output-format slot for plan-mandated findings (the skeleton
survives every observed paraphrase and is consulted at composition
time); untested. Original recon notes follow.
**Recon (superseded):** **Recon (superseded):**
Sonnet-controller runs (claude-sonnet coding-agent): all gates green at Sonnet-controller runs (claude-sonnet coding-agent): all gates green at

View File

@@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review
**Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration **Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration
**Narration:** between tool calls, narrate at most one short line — the
ledger and the tool results carry the record.
**Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it. **Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.
## When to Use ## When to Use
@@ -79,6 +82,20 @@ digraph process {
} }
``` ```
## Pre-Flight Plan Review
Before dispatching Task 1, scan the plan once for conflicts:
- tasks that contradict each other or the plan's Global Constraints
- anything the plan explicitly mandates that the review rubric treats as a
defect (a test that asserts nothing, verbatim duplication of a logic block)
Present everything you find to your human partner as one batched question —
each finding beside the plan text that mandates it, asking which governs —
before execution begins, not one interrupt per discovery mid-plan. If the
scan is clean, proceed without comment. The review loop remains the net for
conflicts that only emerge from implementation.
## Model Selection ## Model Selection
Use the least powerful model that can handle each role to conserve cost and increase speed. Use the least powerful model that can handle each role to conserve cost and increase speed.
@@ -88,6 +105,8 @@ Use the least powerful model that can handle each role to conserve cost and incr
**Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model. **Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.
**Architecture and design tasks**: use the most capable available model. **Architecture and design tasks**: use the most capable available model.
The final whole-branch review is one of these — dispatch it on the most
capable available model, not the session default.
**Review tasks**: choose the model with the same judgment, scaled to the **Review tasks**: choose the model with the same judgment, scaled to the
diff's size, complexity, and risk. A small mechanical diff does not need the diff's size, complexity, and risk. A small mechanical diff does not need the
@@ -100,8 +119,10 @@ most expensive — which silently defeats this section.
**Turn count beats token price.** Wall-clock and context cost scale with how **Turn count beats token price.** Wall-clock and context cost scale with how
many turns a subagent takes, and the cheapest models routinely take 2-3× the many turns a subagent takes, and the cheapest models routinely take 2-3× the
turns on multi-step work — costing more overall. Use a mid-tier model as the turns on multi-step work — costing more overall. Use a mid-tier model as the
floor for implementers and reviewers; reserve the cheapest tier for floor for reviewers and for implementers working from prose descriptions.
single-file mechanical fixes. When the task's plan text contains the complete code to write, the
implementation is transcription plus testing: use the cheapest tier for
that implementer. Single-file mechanical fixes also take the cheapest tier.
**Task complexity signals (implementation tasks):** **Task complexity signals (implementation tasks):**
- Touches 1-2 files with a complete spec → cheap model - Touches 1-2 files with a complete spec → cheap model
@@ -174,6 +195,11 @@ final whole-branch review. When you fill a reviewer template:
findings in the progress ledger as you go, and point the final findings in the progress ledger as you go, and point the final
whole-branch review at that list so it can triage which must be fixed whole-branch review at that list so it can triage which must be fixed
before merge. A roll-up nobody reads is a silent discard. before merge. A roll-up nobody reads is a silent discard.
- A finding labeled plan-mandated — or any finding that conflicts with
what the plan's text requires — is the human's decision, like any plan
contradiction: present the finding and the plan text, ask which governs.
Do not dismiss the finding because the plan mandates it, and do not
dispatch a fix that contradicts the plan without asking.
- The final whole-branch review gets a package too: run - The final whole-branch review gets a package too: run
`scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the `scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the
branch started from, e.g. `git merge-base main HEAD`) and include the branch started from, e.g. `git merge-base main HEAD`) and include the

View File

@@ -115,6 +115,11 @@ Subagent (general-purpose):
"yes." A tight report that cites lines gives the controller everything "yes." A tight report that cites lines gives the controller everything
it needs. it needs.
Your final message is the report itself: begin directly with the
spec-compliance verdict. Every line is a verdict, a finding with
file:line, or a check you ran — no preamble, no process narration,
no closing summary.
## Calibration ## Calibration
Categorize issues by actual severity. Not everything is Critical. Categorize issues by actual severity. Not everything is Critical.
@@ -123,6 +128,11 @@ Subagent (general-purpose):
would block a merge over — verbatim duplication of a logic block, would block a merge over — verbatim duplication of a logic block,
swallowed errors, tests that assert nothing. "Coverage could be broader" swallowed errors, tests that assert nothing. "Coverage could be broader"
and polish suggestions are Minor. and polish suggestions are Minor.
If the plan or brief explicitly mandates something this rubric calls a
defect (a test that asserts nothing, verbatim duplication of a logic
block), that IS a finding — report it as Important, labeled
plan-mandated. The plan's authorship does not grade its own work; the
human decides.
Acknowledge what was done well before listing issues — accurate praise Acknowledge what was done well before listing issues — accurate praise
helps the implementer trust the rest of the feedback. helps the implementer trust the rest of the feedback.