E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1

Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27
L2b: plan-mandated defects are findings the human adjudicates
2026-06-16 07:39:04 +08:00 · 2026-06-15 12:16:47 -07:00 · 2026-06-15 12:16:47 -07:00 · 2026-06-15 12:16:47 -07:00 · 2026-06-15 12:16:47 -07:00 · 2026-06-15 12:16:47 -07:00
3 changed files with 61 additions and 4 deletions
--- a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
+++ b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
@@ -133,8 +133,29 @@ opus controller flagged it 5/5. Cheap controllers handle explicit
 escalation; they absorb implicit authority-vs-quality adjudication.
 A possible L2b (discrete rule: "a reviewer finding that conflicts with
 the plan's text is the human's decision — escalate it") would route the
-failing judgment through the escalation behavior that held; untested.
+failing judgment through the escalation behavior that held.
-Original recon notes follow.
+
 **L2b tested 2026-06-11 (E35/E36, evals
 `docs/experiments/2026-06-11-build-loop-autoresearch.md`): improves the
 opus stack, does NOT rescue the sonnet rung.** Two rules: a reviewer
 tripwire (a plan-mandated defect IS a finding — Important, labeled
 plan-mandated; the human decides) and a controller escalation rule
 (plan-mandated findings go to the human like any plan contradiction).
 Micro on frozen sonnet-composed inputs: 0/6 → 6/6 labeled findings.
 Full battery: opus controllers 2/2 internalized the rule, caught their
 reviewer's miss as self-described backstop, and escalated for a
 sanctioned fix (the 4241 ad-hoc behavior made structural); escalation
 sanity 2/2 unbroken. Sonnet controllers: 1/5 full pass — paraphrase
 drops the tripwire from dispatches (2/5 transmitted), transmission
 alone doesn't fire it live (read-once dilution across the reviewer's
 tool reads; placement within the dispatch refuted as the variable),
 and no sonnet controller showed backstop behavior; 1/5 shipped the
 defect. The L2b rules are a candidate commit for the opus stack.
 A future L2c for the sonnet rung would pair the SKILL.md
 constraints-recipe (the one channel sonnet transmits verbatim) with a
 mandatory output-format slot for plan-mandated findings (the skeleton
 survives every observed paraphrase and is consulted at composition
 time); untested. Original recon notes follow.
 **Recon (superseded):**
 Sonnet-controller runs (claude-sonnet coding-agent): all gates green at
--- a/skills/subagent-driven-development/SKILL.md
+++ b/skills/subagent-driven-development/SKILL.md
@@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review
 **Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration
 **Narration:** between tool calls, narrate at most one short line — the
 ledger and the tool results carry the record.
 **Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.
 ## When to Use
@@ -79,6 +82,20 @@ digraph process {
 }
 ```
 ## Pre-Flight Plan Review
 Before dispatching Task 1, scan the plan once for conflicts:
 - tasks that contradict each other or the plan's Global Constraints
 - anything the plan explicitly mandates that the review rubric treats as a
  defect (a test that asserts nothing, verbatim duplication of a logic block)
 Present everything you find to your human partner as one batched question —
 each finding beside the plan text that mandates it, asking which governs —
 before execution begins, not one interrupt per discovery mid-plan. If the
 scan is clean, proceed without comment. The review loop remains the net for
 conflicts that only emerge from implementation.
 ## Model Selection
 Use the least powerful model that can handle each role to conserve cost and increase speed.
@@ -88,6 +105,8 @@ Use the least powerful model that can handle each role to conserve cost and incr
 **Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.
 **Architecture and design tasks**: use the most capable available model.
 The final whole-branch review is one of these — dispatch it on the most
 capable available model, not the session default.
 **Review tasks**: choose the model with the same judgment, scaled to the
 diff's size, complexity, and risk. A small mechanical diff does not need the
@@ -100,8 +119,10 @@ most expensive — which silently defeats this section.
 **Turn count beats token price.** Wall-clock and context cost scale with how
 many turns a subagent takes, and the cheapest models routinely take 2-3× the
 turns on multi-step work — costing more overall. Use a mid-tier model as the
-floor for implementers and reviewers; reserve the cheapest tier for
+floor for reviewers and for implementers working from prose descriptions.
-single-file mechanical fixes.
+When the task's plan text contains the complete code to write, the
 implementation is transcription plus testing: use the cheapest tier for
 that implementer. Single-file mechanical fixes also take the cheapest tier.
 **Task complexity signals (implementation tasks):**
 - Touches 1-2 files with a complete spec → cheap model
@@ -174,6 +195,11 @@ final whole-branch review. When you fill a reviewer template:
  findings in the progress ledger as you go, and point the final
  whole-branch review at that list so it can triage which must be fixed
  before merge. A roll-up nobody reads is a silent discard.
 - A finding labeled plan-mandated — or any finding that conflicts with
  what the plan's text requires — is the human's decision, like any plan
  contradiction: present the finding and the plan text, ask which governs.
  Do not dismiss the finding because the plan mandates it, and do not
  dispatch a fix that contradicts the plan without asking.
 - The final whole-branch review gets a package too: run
  `scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the
  branch started from, e.g. `git merge-base main HEAD`) and include the
--- a/skills/subagent-driven-development/task-reviewer-prompt.md
+++ b/skills/subagent-driven-development/task-reviewer-prompt.md
@@ -115,6 +115,11 @@ Subagent (general-purpose):
    "yes." A tight report that cites lines gives the controller everything
    it needs.
    Your final message is the report itself: begin directly with the
    spec-compliance verdict. Every line is a verdict, a finding with
    file:line, or a check you ran — no preamble, no process narration,
    no closing summary.
    ## Calibration
    Categorize issues by actual severity. Not everything is Critical.
@@ -123,6 +128,11 @@ Subagent (general-purpose):
    would block a merge over — verbatim duplication of a logic block,
    swallowed errors, tests that assert nothing. "Coverage could be broader"
    and polish suggestions are Minor.
    If the plan or brief explicitly mandates something this rubric calls a
    defect (a test that asserts nothing, verbatim duplication of a logic
    block), that IS a finding — report it as Important, labeled
    plan-mandated. The plan's authorship does not grade its own work; the
    human decides.
    Acknowledge what was done well before listing issues — accurate praise
    helps the implementer trust the rest of the feedback.
Author	SHA1	Message	Date
Jesse Vincent	9b6be89aea	E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1	2026-06-15 12:16:47 -07:00
Jesse Vincent	eafa95b437	Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27	2026-06-15 12:16:47 -07:00
Jesse Vincent	228f2cb8a9	L2b: plan-mandated defects are findings the human adjudicates Reviewer tripwire (Calibration): a plan-mandated defect IS a finding, reported as Important and labeled plan-mandated — the plan's authorship does not grade its own work. Controller rule (review loop): a plan-mandated finding, or any finding conflicting with the plan's text, escalates to the human like any plan contradiction — never dismissed because the plan mandates it. E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the tripwire 0/6 reports give the controller anything to escalate on (all Approved, defect endorsed as spec-required); with it 6/6 report the defect as a labeled finding.	2026-06-15 12:16:47 -07:00
Jesse Vincent	e161df5a9b	E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract	2026-06-15 12:16:47 -07:00
Jesse Vincent	5e204f1128	E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis)	2026-06-15 12:16:47 -07:00
Jesse Vincent	b1eb92ea72	Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not	2026-06-15 12:15:06 -07:00
Jesse Vincent	6e9bbb7e3e	writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks Claims are fidelity and variance, not dollars (full attribution in the superpowers-evals experiment log, 2026-06-11 L1 entry): - Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact values verbatim; makes constraints mechanically propagatable to briefs and reviewers (a version-floor violation class shipped because they weren't). The one fix wave in the elicited full runs was a version-floor catch this header enabled. - Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures, within-plan consistent; removes the controller's per-dispatch interface re-derivation. - Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills standalone Types/README micro-tasks); no effect at small scale. - End-to-end (opus-written plan executed under SDD): guidance plan ran 1 fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug); execution cost equal within noise.	2026-06-15 12:15:06 -07:00
Jesse Vincent	fe938ac86c	Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules E30 replay: the planted-DRY catch is causally determined by the controller-composed constraints block (0/6 with process-shaped vs 5/6 with the spec's own wording). E31 micro: this recipe doubles the rate at which composed blocks carry the spec's cross-component relationship (6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).	2026-06-15 12:15:06 -07:00
Jesse Vincent	07dec9331f	Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance	2026-06-15 12:15:06 -07:00
Jesse Vincent	afcbf8bacb	Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed)	2026-06-15 12:15:06 -07:00
Jesse Vincent	fa14c8d671	Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead)	2026-06-15 12:15:06 -07:00
Jesse Vincent	8b76932337	Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first	2026-06-15 12:15:06 -07:00
Jesse Vincent	0702ec2c6f	Record writing-plans micro-test result: resolved, no change needed	2026-06-15 12:15:06 -07:00
Jesse Vincent	85a9324a53	Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges)	2026-06-15 12:15:06 -07:00
Jesse Vincent	610b09874e	Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist	2026-06-15 12:15:06 -07:00
Jesse Vincent	6df501ea5d	Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31 (baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline 79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline 3/3 runs after moving model: into the templates as a REQUIRED line. Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md	2026-06-15 12:15:06 -07:00
Jesse Vincent	1585f40c8e	Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants	2026-06-15 12:15:06 -07:00
Jesse Vincent	60c0b744b4	Shared: unique review-package collateral names	2026-06-15 12:15:06 -07:00
Jesse Vincent	6b3e4ad407	Add review-package script; close fix-dispatch test gap scripts/review-package generates the reviewer's input deterministically: commit list, stat summary, and net diff with -U10 context, written to a file from an explicit BASE. Live runs showed controllers improvising 'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks, and svelte's five fix dispatches shipped without re-running any tests — fix dispatches now explicitly carry the implementer's re-run-and-report contract.	2026-06-15 12:15:06 -07:00
Jesse Vincent	a84bb0f52b	Describe the review design as current state, not as a delta The skill read as a changelog: 'combined task review,' 'one reviewer, one reading,' 'one dispatch,' and an example still showing diffs pasted into prompts. A reader who never saw the two-reviewer design has no referent for 'combined.' Prose now states the design directly, and the flowchart/example reflect the diff-file handoff.	2026-06-15 12:15:06 -07:00
Jesse Vincent	69d396a676	Spec: record iterations 2-3 results and final frozen-config matrix	2026-06-15 12:15:06 -07:00
Jesse Vincent	e4457c970e	Hand reviewers the diff as a file, not a paste Paste adoption stayed at 0/15 even as a Red Flag — and the controller's reluctance is locally rational: pasting loads the diff into the (most expensive) controller context permanently, while a reviewer self-fetch costs a few cheap turns. The diff-file handoff is cheap for both sides: the controller redirects git diff to /tmp without reading it, and the reviewer gets the whole change in one Read call.	2026-06-15 12:15:06 -07:00
Jesse Vincent	fac5888846	Reviewer skepticism covers the implementer's design rationales Fourth planted-defect failure mode: the implementer's self-report said 'noted mild structural duplication; left unabstracted per YAGNI' and the reviewer deferred to that framing, rating the duplication no finding at all. The pre-judging keeps relocating — controller prompt, then reviewer calibration, now the implementer's report. Rationales are claims; they never downgrade severity.	2026-06-15 12:15:06 -07:00
Jesse Vincent	8ac14c0450	Make diff-pasting non-optional for task reviewer dispatch Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased as guidance; reviewers without the diff re-derive it by hand, which is the single largest remaining reviewer cost. Now a Red Flags Never entry and a REQUIRED marker on the template placeholder.	2026-06-15 12:15:06 -07:00
Jesse Vincent	3ed554d557	Close the Minor-severity escape hatch With merged review, a planted verbatim-duplication defect shipped: the reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted definition of Important, and the Minor-rolls-up rule meant no fix was ever dispatched and the final review never saw the finding. Calibration now names merge-blocking maintainability damage (verbatim duplication, swallowed errors, assertion-free tests) as Important, and controllers must paste accumulated Minor findings into the final review dispatch.	2026-06-15 12:15:06 -07:00
Jesse Vincent	4e8edca36e	Spec: document cost iterations and the per-task review consolidation	2026-06-15 12:15:06 -07:00
Jesse Vincent	d7726d99dc	Merge per-task reviews into one task reviewer (iteration 2) Iteration-1 profiling: implementers and per-dispatch overhead dominate (429 of 686 subagent turns; controller coordination is half the dollars and scales with dispatch count), reviewers are individually lean, and the controller pasted the diff in only 2 of 22 review dispatches when the guidance was phrased as optional. Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md replaced by task-reviewer-prompt.md (one reviewer, one reading of a pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality); one fix dispatch can address both kinds of findings; controller now runs git diff itself and pastes it (imperative, not optional); implementers run focused tests while iterating and the full suite once before committing; flowchart, example, Red Flags, tool tables updated. The broad final whole-branch review is unchanged.	2026-06-15 12:15:06 -07:00
Jesse Vincent	4c1f1e5cc5	Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's 42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn profiling attributed it to: haiku dispatches taking 2-3x the turns of sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand (518 Bash calls), and evidence-rule narration. Changes: turn-count-beats- token-price model guidance; controllers paste small diffs into reviewer prompts (reviewers then need few or no tool calls); evidence scoped to findings and would-be-bare-yes checks; Important defined as cannot-trust- until-fixed with coverage suggestions Minor; fixes dispatched only for Critical/Important.	2026-06-15 12:15:06 -07:00
Jesse Vincent	7288393773	Add phrase-level pre-judging triggers to reviewer prompt rule Resumed the offending eval controller session and asked it why it pre-judged despite the rule being in context. Its retrospective: the motive was avoiding a review loop, the abstract rule was read but not applied at the moment it governs, and a phrase-level trigger ('do not flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose') would have fired where the principle did not.	2026-06-15 12:15:06 -07:00
Jesse Vincent	254a8e2e32	Red Flags: never tell a reviewer what not to flag or pre-rate severity Second observed instance: with the Constructing Reviewer Prompts rule already live, a controller still wrote 'do not treat that duplication as a defect to fix — the plan chose it; you may note it as a Minor observation at most' into a quality reviewer dispatch, fabricating plan intent from the plan's example snippet. Promote the rule to the Red Flags Never list and name the rationalization.	2026-06-15 12:15:06 -07:00
Jesse Vincent	7c11cee649	Close three review blind spots found by defect tracing Live eval deliverables shipped five polish defects; tracing each through the transcripts showed three mechanisms, each now addressed: - reviewers answered pointed checklist items with unsupported yes (evidence rule: every What-to-Check answer needs file:line evidence) - no reviewer ever saw the design's global constraints (controllers now paste binding constraints into task requirements) - test output noise was invisible everywhere (pristine-output checks in implementer self-review and quality review)	2026-06-15 12:15:06 -07:00
Jesse Vincent	b36cf86afd	Require explicit model on subagent dispatch In live eval runs, controllers given judgment-based model selection stopped passing a model at all; the omitted parameter inherits the session's top-tier model, silently making every subagent maximally expensive (one run dispatched 26/26 reviewers on the session model).	2026-06-15 12:15:06 -07:00
Jesse Vincent	06bec17a34	Forbid controllers pre-judging reviewer findings A live eval run of sdd-quality-reviewer-catches-planted-defect caught the SDD controller fabricating a plan constraint and instructing the quality reviewer not to flag the planted DRY violation. The duplication shipped. Constructing Reviewer Prompts now bans suppression directives alongside open-ended broadening directives.	2026-06-15 12:15:06 -07:00
Jesse Vincent	236524413b	Sync plan: escaped pre() pattern in Task 5 checks block	2026-06-15 12:15:06 -07:00
Jesse Vincent	6e019e0316	Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block	2026-06-15 12:15:06 -07:00
Jesse Vincent	d4bb8d268f	Sync plan's Task 5 blocks with review fixes	2026-06-15 12:15:06 -07:00
Jesse Vincent	d519ba65fd	SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment	2026-06-15 12:15:06 -07:00
Jesse Vincent	d32a56dc32	Implementer prompt: re-run covering tests after fixing review findings	2026-06-15 12:15:06 -07:00
Jesse Vincent	994bc26d2a	Scope spec reviewer's Your Job wording to the diff	2026-06-15 12:15:06 -07:00
Jesse Vincent	d5850df1bc	Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel	2026-06-15 12:15:06 -07:00
Jesse Vincent	b5edd40d2c	Use bare placeholder names in quality reviewer prompt body	2026-06-15 12:15:06 -07:00
Jesse Vincent	6a02446953	Make per-task quality reviewer prompt self-contained and task-scoped	2026-06-15 12:15:06 -07:00
Jesse Vincent	042d238b26	Add implementation plan for task-scoped review dispatch	2026-06-15 12:15:06 -07:00
Jesse Vincent	cf81ad2ac3	Harden review-dispatch spec per adversarial review findings	2026-06-15 12:15:06 -07:00
Jesse Vincent	cb0dbeb095	Add design spec: task-scoped review dispatch for SDD	2026-06-15 12:15:06 -07:00