Bump evals submodule: escalation-aware planted-defect scenario (7dc0be7)

E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1
Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27
2026-06-12 21:59:04 +08:00 · 2026-06-11 15:39:04 -07:00 · 2026-06-11 15:31:01 -07:00 · 2026-06-11 14:53:31 -07:00 · 2026-06-11 13:41:21 -07:00 · 2026-06-11 13:17:10 -07:00
4 changed files with 62 additions and 5 deletions
--- a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
+++ b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
@@ -133,8 +133,29 @@ opus controller flagged it 5/5. Cheap controllers handle explicit
 escalation; they absorb implicit authority-vs-quality adjudication.
 A possible L2b (discrete rule: "a reviewer finding that conflicts with
 the plan's text is the human's decision — escalate it") would route the
-failing judgment through the escalation behavior that held; untested.
-Original recon notes follow.
+failing judgment through the escalation behavior that held.
+
+**L2b tested 2026-06-11 (E35/E36, evals
+`docs/experiments/2026-06-11-build-loop-autoresearch.md`): improves the
+opus stack, does NOT rescue the sonnet rung.** Two rules: a reviewer
+tripwire (a plan-mandated defect IS a finding — Important, labeled
+plan-mandated; the human decides) and a controller escalation rule
+(plan-mandated findings go to the human like any plan contradiction).
+Micro on frozen sonnet-composed inputs: 0/6 → 6/6 labeled findings.
+Full battery: opus controllers 2/2 internalized the rule, caught their
+reviewer's miss as self-described backstop, and escalated for a
+sanctioned fix (the 4241 ad-hoc behavior made structural); escalation
+sanity 2/2 unbroken. Sonnet controllers: 1/5 full pass — paraphrase
+drops the tripwire from dispatches (2/5 transmitted), transmission
+alone doesn't fire it live (read-once dilution across the reviewer's
+tool reads; placement within the dispatch refuted as the variable),
+and no sonnet controller showed backstop behavior; 1/5 shipped the
+defect. The L2b rules are a candidate commit for the opus stack.
+A future L2c for the sonnet rung would pair the SKILL.md
+constraints-recipe (the one channel sonnet transmits verbatim) with a
+mandatory output-format slot for plan-mandated findings (the skeleton
+survives every observed paraphrase and is consulted at composition
+time); untested. Original recon notes follow.

 **Recon (superseded):**
 Sonnet-controller runs (claude-sonnet coding-agent): all gates green at
--- a/2
+++ b/2
--- a/skills/subagent-driven-development/SKILL.md
+++ b/skills/subagent-driven-development/SKILL.md
@@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review

 **Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration

+**Narration:** between tool calls, narrate at most one short line — the
+ledger and the tool results carry the record.
+
 **Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.

 ## When to Use
@@ -79,6 +82,20 @@ digraph process {
 }
 ```

+## Pre-Flight Plan Review
+
+Before dispatching Task 1, scan the plan once for conflicts:
+
+- tasks that contradict each other or the plan's Global Constraints
+- anything the plan explicitly mandates that the review rubric treats as a
+  defect (a test that asserts nothing, verbatim duplication of a logic block)
+
+Present everything you find to your human partner as one batched question —
+each finding beside the plan text that mandates it, asking which governs —
+before execution begins, not one interrupt per discovery mid-plan. If the
+scan is clean, proceed without comment. The review loop remains the net for
+conflicts that only emerge from implementation.
+
 ## Model Selection

 Use the least powerful model that can handle each role to conserve cost and increase speed.
@@ -88,6 +105,8 @@ Use the least powerful model that can handle each role to conserve cost and incr
 **Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.

 **Architecture and design tasks**: use the most capable available model.
+The final whole-branch review is one of these — dispatch it on the most
+capable available model, not the session default.

 **Review tasks**: choose the model with the same judgment, scaled to the
 diff's size, complexity, and risk. A small mechanical diff does not need the
@@ -100,8 +119,10 @@ most expensive — which silently defeats this section.
 **Turn count beats token price.** Wall-clock and context cost scale with how
 many turns a subagent takes, and the cheapest models routinely take 2-3× the
 turns on multi-step work — costing more overall. Use a mid-tier model as the
-floor for implementers and reviewers; reserve the cheapest tier for
-single-file mechanical fixes.
+floor for reviewers and for implementers working from prose descriptions.
+When the task's plan text contains the complete code to write, the
+implementation is transcription plus testing: use the cheapest tier for
+that implementer. Single-file mechanical fixes also take the cheapest tier.

 **Task complexity signals (implementation tasks):**
 - Touches 1-2 files with a complete spec → cheap model
@@ -174,6 +195,11 @@ final whole-branch review. When you fill a reviewer template:
  findings in the progress ledger as you go, and point the final
  whole-branch review at that list so it can triage which must be fixed
  before merge. A roll-up nobody reads is a silent discard.
+- A finding labeled plan-mandated — or any finding that conflicts with
+  what the plan's text requires — is the human's decision, like any plan
+  contradiction: present the finding and the plan text, ask which governs.
+  Do not dismiss the finding because the plan mandates it, and do not
+  dispatch a fix that contradicts the plan without asking.
 - The final whole-branch review gets a package too: run
  `scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the
  branch started from, e.g. `git merge-base main HEAD`) and include the
--- a/skills/subagent-driven-development/task-reviewer-prompt.md
+++ b/skills/subagent-driven-development/task-reviewer-prompt.md
@@ -115,6 +115,11 @@ Subagent (general-purpose):
    "yes." A tight report that cites lines gives the controller everything
    it needs.

+    Your final message is the report itself: begin directly with the
+    spec-compliance verdict. Every line is a verdict, a finding with
+    file:line, or a check you ran — no preamble, no process narration,
+    no closing summary.
+
    ## Calibration

    Categorize issues by actual severity. Not everything is Critical.
@@ -123,6 +128,11 @@ Subagent (general-purpose):
    would block a merge over — verbatim duplication of a logic block,
    swallowed errors, tests that assert nothing. "Coverage could be broader"
    and polish suggestions are Minor.
+    If the plan or brief explicitly mandates something this rubric calls a
+    defect (a test that asserts nothing, verbatim duplication of a logic
+    block), that IS a finding — report it as Important, labeled
+    plan-mandated. The plan's authorship does not grade its own work; the
+    human decides.
    Acknowledge what was done well before listing issues — accurate praise
    helps the implementer trust the rest of the feedback.
Author	SHA1	Message	Date
Jesse Vincent	08447d0d10	Bump evals submodule: escalation-aware planted-defect scenario (7dc0be7)	2026-06-11 15:39:04 -07:00
Jesse Vincent	c87f336cc0	E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1	2026-06-11 15:31:01 -07:00
Jesse Vincent	a2a4190809	Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27	2026-06-11 14:53:31 -07:00
Jesse Vincent	8d354bb36b	L2b: plan-mandated defects are findings the human adjudicates Reviewer tripwire (Calibration): a plan-mandated defect IS a finding, reported as Important and labeled plan-mandated — the plan's authorship does not grade its own work. Controller rule (review loop): a plan-mandated finding, or any finding conflicting with the plan's text, escalates to the human like any plan contradiction — never dismissed because the plan mandates it. E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the tripwire 0/6 reports give the controller anything to escalate on (all Approved, defect endorsed as spec-required); with it 6/6 report the defect as a labeled finding.	2026-06-11 13:41:21 -07:00
Jesse Vincent	35464d67c0	E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract	2026-06-11 13:17:10 -07:00
Jesse Vincent	90b5433f59	E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis)	2026-06-11 13:17:10 -07:00