Bump evals submodule: E29-E34 quality investigation + L2 gate results (af05326)

Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not
Bump evals submodule: L1 elicitation + autoresearch scenarios and logs (649b1f8)
2026-06-16 23:59:05 +08:00 · 2026-06-11 13:17:09 -07:00 · 2026-06-11 13:11:32 -07:00 · 2026-06-11 11:37:41 -07:00 · 2026-06-11 11:35:53 -07:00 · 2026-06-11 11:35:53 -07:00
5 changed files with 68 additions and 14 deletions
--- a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
+++ b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md
@@ -65,13 +65,21 @@ fewer, better-sized tasks, SDD still runs one fresh subagent per task.

 ### L1 — Plan-side crispness (writing-plans changes; est. −$1.5-3/run, plus variance reduction)

-**Status 2026-06-11: validated in effect.** A hand-crisped fractals plan
-(10 → 7 tasks, `## Global Constraints` header, per-task `Interfaces:`
-lines — scenario `sdd-go-fractals-crisp`) ran 3/3 green at $9.51-12.65
-(mean $11.60 vs combo band $11.67-14.84), 20-24 dispatches vs 28, fix
-waves flat. What remains is elicitation: getting writing-plans guidance
-to *produce* such plans (micro-test per the doctrine, then the follow-up
-PR). See the experiments log, Batch A-E.
+**Status 2026-06-11 (final): elicitation tested end-to-end; claims
+re-attributed.** Micro-tests: constraints header and Interfaces blocks
+elicit deterministically (0→5/5, 0→100% of tasks, exact values);
+right-sizing is modest and scale-dependent (9.4→8.4 tasks at svelte
+scale, nothing to move at fractals scale). Full runs: an elicited plan
+executed at $6.34/$8.49 — but the no-guidance control (opus plan,
+complete code) hit $7.59/$7.73, inside that range. **The cost win
+belongs to opus-written complete-code plans; the hand-written prose
+fixture plans all prior numbers used are unrepresentative and ~2×
+costlier to execute.** The guidance owns fidelity and variance instead:
+deterministic constraints propagation (the one elicited-run fix was a
+version-floor catch), exact cross-task interfaces, fix waves 1 vs 2-4
+(the control plan shipped a real Sierpinski bug both runs had to fix).
+The writing-plans PR claims those grounds, not dollars. Draft at
+/tmp/sdd-exp/writing-plans-l1 (branch writing-plans-crisp).

 The plan is upstream of every cost: task count sets dispatch count; plan
 ambiguity sets review-loop count; plan completeness sets implementer
@@ -110,7 +118,25 @@ target economics and ambiguity, not placeholder hygiene.

 ### L2 — Controller tier (est. −$4-5/run; the biggest single lever, gated hardest)

-**Status 2026-06-11: recon positive (n=2), gates still owed.**
+**Status 2026-06-11 (final): DIED AT THE GATES, as pre-registered — with
+useful anatomy.** Recon was positive ($6.68/$8.05, n=2, mechanics clean).
+The full battery split the judgment surface: the new
+`sdd-escalates-broken-plan` scenario (explicit plan self-contradiction;
+the human never volunteers it) passed **5/5 at sonnet** ($1.02-1.37/run;
+opus baseline 2/2) — explicit conflicts get escalated. But the
+planted-defect battery failed decisively: under a sonnet controller the
+per-task quality gate collapsed into plan-compliance advocacy ("no
+assertion, as required" listed under Strengths), the defect shipped in
+4/5 runs (deterministic check), and only the tier-pinned opus final
+reviewer ever caught it — while the same sonnet-tier reviewers under an
+opus controller flagged it 5/5. Cheap controllers handle explicit
+escalation; they absorb implicit authority-vs-quality adjudication.
+A possible L2b (discrete rule: "a reviewer finding that conflicts with
+the plan's text is the human's decision — escalate it") would route the
+failing judgment through the escalation behavior that held; untested.
+Original recon notes follow.
+
+**Recon (superseded):**
 Sonnet-controller runs (claude-sonnet coding-agent): all gates green at
 **$6.68 and $8.05** / 31-41 min (combo band $11.67-14.84), tokens inside
 the combo band — no cheap-controller turn inflation. 26/26 and 31/31
--- a/2
+++ b/2
--- a/skills/subagent-driven-development/SKILL.md
+++ b/skills/subagent-driven-development/SKILL.md
@@ -150,9 +150,13 @@ final whole-branch review. When you fill a reviewer template:
  loop. If the prompt you are writing contains "do not flag," "don't treat X
  as a defect," "at most Minor," or "the plan chose" — stop: you are
  pre-judging, usually to spare yourself a review loop.
- Include the spec/design's global constraints that bind the task (version
-  floors, naming and copy rules, platform requirements) in the requirements
-  you paste — a reviewer can only enforce what you hand them.
+- The global-constraints block you hand the reviewer is its attention
+  lens. Copy the binding requirements verbatim from the plan's Global
+  Constraints section or the spec: exact values, exact formats, and the
+  stated relationships between components ("same layout as X", "matches
+  Y"). The reviewer's template already carries the process rules (YAGNI,
+  test hygiene, review method) — the constraints block is for what THIS
+  project's spec demands.
 - Hand the reviewer its diff as a file: run this skill's
  `scripts/review-package BASE HEAD` and pass the reviewer the file path
  it prints (or, without bash: `git log --oneline`, `git diff --stat`,
--- a/skills/subagent-driven-development/task-reviewer-prompt.md
+++ b/skills/subagent-driven-development/task-reviewer-prompt.md
@@ -159,8 +159,10 @@ Subagent (general-purpose):
 - `[MODEL]` — REQUIRED: reviewer model per SKILL.md Model Selection
 - `[BRIEF_FILE]` — REQUIRED: the task brief file (`scripts/task-brief PLAN N`
  prints the path; same file the implementer worked from)
- `[GLOBAL_CONSTRAINTS]` — the spec/design's global constraints that bind
-  this task (version floors, naming and copy rules, platform requirements)
+- `[GLOBAL_CONSTRAINTS]` — the binding requirements copied verbatim from
+  the plan's Global Constraints section or the spec: exact values, formats,
+  and stated relationships between components (not process rules — those
+  are already in this template)
 - `[REPORT_FILE]` — REQUIRED: the file the implementer wrote its detailed
  report to
 - `[BASE_SHA]` — commit before this task
--- a/skills/writing-plans/SKILL.md
+++ b/skills/writing-plans/SKILL.md
@@ -33,6 +33,15 @@ Before defining tasks, map out which files will be created or modified and what

 This structure informs the task decomposition. Each task should produce self-contained changes that make sense independently.

+## Task Right-Sizing
+
+A task is the smallest unit that carries its own test cycle and is worth a
+fresh reviewer's gate. When drawing task boundaries: fold setup,
+configuration, scaffolding, and documentation steps into the task whose
+deliverable needs them; split only where a reviewer could meaningfully
+reject one task while approving its neighbor. Each task ends with an
+independently testable deliverable.
+
 ## Bite-Sized Task Granularity

 **Each step is one action (2-5 minutes):**
@@ -57,6 +66,13 @@ This structure informs the task decomposition. Each task should produce self-con

 **Tech Stack:** [Key technologies/libraries]

+## Global Constraints
+
+[The spec's project-wide requirements — version floors, dependency limits,
+naming and copy rules, platform requirements — one line each, with exact
+values copied verbatim from the spec. Every task's requirements implicitly
+include this section.]
+
 ---
 ```

@@ -70,6 +86,12 @@ This structure informs the task decomposition. Each task should produce self-con
 - Modify: `exact/path/to/existing.py:123-145`
 - Test: `tests/exact/path/to/test.py`

+**Interfaces:**
+- Consumes: [what this task uses from earlier tasks — exact signatures]
+- Produces: [what later tasks rely on — exact function names, parameter
+  and return types. A task's implementer sees only their own task; this
+  block is how they learn the names and types neighboring tasks use.]
+
 - [ ] **Step 1: Write the failing test**

 ```python
Author	SHA1	Message	Date
Jesse Vincent	420c234a2c	Bump evals submodule: E29-E34 quality investigation + L2 gate results (af05326)	2026-06-11 13:17:09 -07:00
Jesse Vincent	d1fcc9889a	Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not	2026-06-11 13:11:32 -07:00
Jesse Vincent	ac11700642	Bump evals submodule: L1 elicitation + autoresearch scenarios and logs (649b1f8)	2026-06-11 11:37:41 -07:00
Jesse Vincent	710f031ad0	writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks Claims are fidelity and variance, not dollars (full attribution in the superpowers-evals experiment log, 2026-06-11 L1 entry): - Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact values verbatim; makes constraints mechanically propagatable to briefs and reviewers (a version-floor violation class shipped because they weren't). The one fix wave in the elicited full runs was a version-floor catch this header enabled. - Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures, within-plan consistent; removes the controller's per-dispatch interface re-derivation. - Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills standalone Types/README micro-tasks); no effect at small scale. - End-to-end (opus-written plan executed under SDD): guidance plan ran 1 fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug); execution cost equal within noise.	2026-06-11 11:35:53 -07:00
Jesse Vincent	72cb21b82c	Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules E30 replay: the planted-DRY catch is causally determined by the controller-composed constraints block (0/6 with process-shaped vs 5/6 with the spec's own wording). E31 micro: this recipe doubles the rate at which composed blocks carry the spec's cross-component relationship (6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).	2026-06-11 11:35:53 -07:00
Jesse Vincent	de1d35e5e7	Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance	2026-06-10 21:44:23 -07:00