Add implementation plan for task-scoped review dispatch

2026-07-27 12:44:01 +08:00 · 2026-06-09 16:42:50 -07:00
parent 4192572d19
commit f8dcd1ed3d
1 changed files with 772 additions and 0 deletions
--- a/docs/superpowers/plans/2026-06-09-sdd-task-scoped-review-dispatch.md
+++ b/docs/superpowers/plans/2026-06-09-sdd-task-scoped-review-dispatch.md
@@ -0,0 +1,772 @@
+# SDD Task-Scoped Review Dispatch Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Scope SDD's per-task reviews to the task (diff-first reading, justified broadening, no redundant test runs) while final branch review stays broad.
+
+**Architecture:** Four prose edits to the subagent-driven-development skill (the per-task quality prompt becomes self-contained instead of delegating to the merge-readiness template; the spec prompt gets a third verdict channel and grounded skepticism; the implementer prompt gains a re-run-after-fix rule; SKILL.md gets controller guidance) plus one new eval scenario in the `evals/` submodule. `skills/requesting-code-review/` is deliberately untouched.
+
+**Tech Stack:** Markdown skill files; Python setup helper + bash checks + story.md for the quorum eval.
+
+**Spec:** `docs/superpowers/specs/2026-06-09-sdd-task-scoped-review-dispatch-design.md` — read it before starting. Decisions already settled there: full re-reviews stay; the two review stages stay separate; coordinator keeps model judgment; `requesting-code-review/` stays broad.
+
+**These are behavior-shaping prose files, not code.** There are no unit tests for them. Each task's verification steps are exact `grep` checks that the edit landed; behavioral verification is Task 6 (static) and Task 7 (live evals, maintainer-gated).
+
+---
+
+### Task 1: Rewrite the per-task quality reviewer prompt as self-contained
+
+The current file delegates to `../requesting-code-review/code-reviewer.md`, which is a merge-readiness review (architecture, security, production readiness, "Ready to merge?"). Replace the entire file with a self-contained, task-scoped template.
+
+**Files:**
+- Rewrite: `skills/subagent-driven-development/code-quality-reviewer-prompt.md`
+
+- [ ] **Step 1: Replace the full file contents with:**
+
+````markdown
+# Code Quality Reviewer Prompt Template
+
+Use this template when dispatching a code quality reviewer subagent.
+
+**Purpose:** Verify one task's implementation is well-built (clean, tested, maintainable)
+
+**Only dispatch after spec compliance review passes.**
+
+```
+Subagent (general-purpose):
+  description: "Review code quality for Task N"
+  prompt: |
+    You are reviewing one task's implementation for code quality. This is a
+    task-scoped gate, not a merge review — a broad whole-branch review happens
+    separately after all tasks are complete.
+
+    ## What Was Implemented
+
+    [DESCRIPTION — task summary, from implementer's report]
+
+    ## Task Requirements (context only)
+
+    [Task N text or plan reference. Context for judging the code, not for
+    re-checking compliance.]
+
+    ## Git Range to Review
+
+    **Base:** [BASE_SHA — commit before this task]
+    **Head:** [HEAD_SHA — current commit]
+
+    ```bash
+    git diff --stat [BASE_SHA]..[HEAD_SHA]
+    git diff [BASE_SHA]..[HEAD_SHA]
+    ```
+
+    ## Read-Only Review
+
+    Your review is read-only on this checkout. Do not mutate the working tree,
+    the index, HEAD, or branch state in any way. Use tools like `git show`,
+    `git diff`, and `git log` to inspect history.
+
+    ## Scope
+
+    Spec compliance was already verified by a separate reviewer. Do not
+    re-check whether the code matches the requirements or the plan.
+
+    Start from the diff. Read the changed files first. Inspect code outside
+    the diff only to evaluate a concrete risk you can name — and name it in
+    your report. Cross-cutting changes are legitimate named risks: if the
+    diff changes lock ordering, a function or API contract, or shared mutable
+    state, checking the call sites is the right method. Do not crawl the
+    codebase by default.
+
+    ## Tests
+
+    The implementer already ran the tests and reported results with TDD
+    evidence for exactly this code. Do not re-run the suite to confirm their
+    report. Run a test only when reading the code raises a specific doubt
+    that no existing run answers — and then a focused test, never a
+    package-wide suite, race detector run, or repeated/high-count loop. If
+    heavy validation seems warranted, recommend it in your report instead of
+    running it. If you cannot run commands in this environment, name the
+    test you would run.
+
+    ## What to Check
+
+    **Code quality:**
+    - Clean separation of concerns?
+    - Proper error handling?
+    - DRY without premature abstraction?
+    - Edge cases handled?
+
+    **Tests:**
+    - Do the new and changed tests verify real behavior, not mocks?
+    - Are the task's edge cases covered?
+
+    **Structure:**
+    - Does each file have one clear responsibility with a well-defined interface?
+    - Are units decomposed so they can be understood and tested independently?
+    - Is the implementation following the file structure from the plan?
+    - Did this change create new files that are already large, or
+      significantly grow existing files? (Don't flag pre-existing file
+      sizes — focus on what this change contributed.)
+
+    ## Calibration
+
+    Categorize issues by actual severity. Not everything is Critical.
+    Acknowledge what was done well before listing issues — accurate praise
+    helps the implementer trust the rest of the feedback.
+
+    ## Output Format
+
+    ### Strengths
+    [What's well done? Be specific.]
+
+    ### Issues
+
+    #### Critical (Must Fix)
+    [Bugs, data loss risks, broken functionality]
+
+    #### Important (Should Fix)
+    [Poor error handling, test gaps, structural problems]
+
+    #### Minor (Nice to Have)
+    [Code style, optimization opportunities]
+
+    For each issue:
+    - File:line reference
+    - What's wrong
+    - Why it matters
+    - How to fix (if not obvious)
+
+    ### Assessment
+
+    **Task quality:** [Approved | Needs fixes]
+
+    **Reasoning:** [1-2 sentence technical assessment]
+```
+
+**Placeholders:**
+- `[DESCRIPTION]` — task summary, from implementer's report
+- `[Task N text or plan reference]` — the task's requirements, for context
+- `[BASE_SHA]` — commit before this task
+- `[HEAD_SHA]` — current commit
+
+**Reviewer returns:** Strengths, Issues (Critical/Important/Minor), Task quality verdict
+````
+
+- [ ] **Step 2: Verify the rewrite landed**
+
+Run: `grep -c "requesting-code-review" skills/subagent-driven-development/code-quality-reviewer-prompt.md || echo ABSENT`
+Expected: `ABSENT` (no more delegation)
+
+Run: `grep -n "Task quality:" skills/subagent-driven-development/code-quality-reviewer-prompt.md | head -2`
+Expected: two matches (output format + placeholder note)
+
+Run: `grep -n "worktree add\|Ready to merge" skills/subagent-driven-development/code-quality-reviewer-prompt.md || echo CLEAN`
+Expected: `CLEAN`
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add skills/subagent-driven-development/code-quality-reviewer-prompt.md
+git commit -m "Make per-task quality reviewer prompt self-contained and task-scoped"
+```
+
+---
+
+### Task 2: Spec reviewer prompt cleanups
+
+Four exact edits to `skills/subagent-driven-development/spec-reviewer-prompt.md`. Current line numbers refer to the file as of commit f55642e.
+
+**Files:**
+- Modify: `skills/subagent-driven-development/spec-reviewer-prompt.md`
+
+- [ ] **Step 1: Add the judge-from-the-diff clause.** After the line (currently line 31):
+
+```
+    Only read files in this diff. Do not crawl the broader codebase.
+```
+
+insert a blank line and:
+
+```
+    Spec compliance is judged by reading the diff against the requirements.
+    The implementer already ran the tests and reported TDD evidence — do not
+    re-run them. If a requirement cannot be verified from this diff alone
+    (it lives in unchanged code or spans tasks), report it as a ⚠️ item
+    instead of broadening your search.
+```
+
+- [ ] **Step 2: Trim the read-only section.** Replace (currently line 35):
+
+```
+    Your review is read-only on this checkout. Do not mutate the working tree, the index, HEAD, or branch state in any way. Use tools like `git show`, `git diff`, and `git log` to inspect history. If you need a working copy of a different revision, check it out into a separate temporary directory (e.g. `git worktree add /tmp/review-[SHA] [SHA]`) — never move HEAD on this checkout.
+```
+
+with:
+
+```
+    Your review is read-only on this checkout. Do not mutate the working tree, the index, HEAD, or branch state in any way. Use tools like `git show`, `git diff`, and `git log` to inspect history.
+```
+
+- [ ] **Step 3: Ground the skepticism.** Replace (currently lines 39-40):
+
+```
+    The implementer finished suspiciously quickly. Their report may be incomplete,
+    inaccurate, or optimistic. You MUST verify everything independently.
+```
+
+with:
+
+```
+    Treat the implementer's report as unverified claims about the code. It may
+    be incomplete, inaccurate, or optimistic. Verify the claims against the diff.
+```
+
+- [ ] **Step 4: Add the third verdict channel.** Replace (currently lines 74-76):
+
+```
+    Report:
+    - ✅ Spec compliant (if everything matches after code inspection)
+    - ❌ Issues found: [list specifically what's missing or extra, with file:line references]
+```
+
+with:
+
+```
+    Report:
+    - ✅ Spec compliant (if everything matches after code inspection)
+    - ❌ Issues found: [list specifically what's missing or extra, with file:line references]
+    - ⚠️ Cannot verify from diff: [requirements you could not verify from the
+      diff alone, and what the controller should check — report alongside the
+      ✅/❌ verdict for everything you could verify]
+```
+
+- [ ] **Step 5: Verify**
+
+Run: `grep -n "suspiciously\|worktree add" skills/subagent-driven-development/spec-reviewer-prompt.md || echo CLEAN`
+Expected: `CLEAN`
+
+Run: `grep -c "⚠️" skills/subagent-driven-development/spec-reviewer-prompt.md`
+Expected: `2` (judge-from-diff clause + verdict channel)
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add skills/subagent-driven-development/spec-reviewer-prompt.md
+git commit -m "Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel"
+```
+
+---
+
+### Task 3: Implementer prompt — re-run tests after fixing review findings
+
+The reviewers' "don't re-run the implementer's tests" rule assumes the implementer re-runs tests after every fix. Make that real.
+
+**Files:**
+- Modify: `skills/subagent-driven-development/implementer-prompt.md`
+
+- [ ] **Step 1: Insert a new section.** Immediately before the line (currently line 100):
+
+```
+    ## Report Format
+```
+
+insert:
+
+```
+    ## After Review Findings
+
+    If a reviewer finds issues and you fix them, re-run the tests that cover
+    the amended code and include the results in your fix report. Reviewers
+    will not re-run tests for you — your report is the test evidence.
+
+```
+
+- [ ] **Step 2: Verify**
+
+Run: `grep -n "After Review Findings" skills/subagent-driven-development/implementer-prompt.md`
+Expected: one match, on a line before `## Report Format`
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add skills/subagent-driven-development/implementer-prompt.md
+git commit -m "Implementer prompt: re-run covering tests after fixing review findings"
+```
+
+---
+
+### Task 4: SKILL.md controller changes
+
+Six exact edits to `skills/subagent-driven-development/SKILL.md`. Current line numbers refer to commit f55642e.
+
+**Files:**
+- Modify: `skills/subagent-driven-development/SKILL.md`
+
+- [ ] **Step 1: Point the final-review flowchart node at the broad template.** The node label `Dispatch final code reviewer subagent for entire implementation` appears 3 times (currently lines 65, 84, 85). In all 3 occurrences, replace the label string with:
+
+```
+Dispatch final code reviewer subagent (../requesting-code-review/code-reviewer.md)
+```
+
+(Graphviz nodes are matched by label text — all three must be byte-identical or the graph grows a phantom node.)
+
+- [ ] **Step 2: Model selection by judgment.** Replace (currently lines 97-99):
+
+```
+**Architecture, design, and review tasks**: use the most capable available model.
+
+**Task complexity signals:**
+```
+
+with:
+
+```
+**Architecture and design tasks**: use the most capable available model.
+
+**Review tasks**: choose the model with the same judgment, scaled to the
+diff's size, complexity, and risk. A small mechanical diff does not need the
+most capable model; a subtle concurrency change does.
+
+**Task complexity signals (implementation tasks):**
+```
+
+- [ ] **Step 3: Add controller guidance sections.** Immediately before the line (currently line 122):
+
+```
+## Prompt Templates
+```
+
+insert:
+
+```
+## Handling Spec Reviewer ⚠️ Items
+
+The spec reviewer may report "⚠️ Cannot verify from diff" items — requirements
+that live in unchanged code or span tasks. These do not block dispatching the
+code quality reviewer, but you must resolve each one yourself before marking
+the task complete: you hold the plan and cross-task context the reviewer
+lacks. If you confirm an item is a real gap, treat it as a failed spec
+review — send it back to the implementer and re-review.
+
+## Constructing Reviewer Prompts
+
+Per-task reviews are task-scoped gates. The broad review happens once, at the
+final whole-branch review. When you fill a reviewer template:
+
+- Do not add open-ended directives like "check all uses" or "run race tests
+  if useful" without a concrete, task-specific reason
+- Do not ask a reviewer to re-run tests the implementer already ran on the
+  same code — the implementer's report carries the test evidence
+
+```
+
+- [ ] **Step 4: Prompt Templates list — add the final-review pointer.** Replace (currently line 126):
+
+```
+- [code-quality-reviewer-prompt.md](code-quality-reviewer-prompt.md) - Dispatch code quality reviewer subagent
+```
+
+with:
+
+```
+- [code-quality-reviewer-prompt.md](code-quality-reviewer-prompt.md) - Dispatch code quality reviewer subagent
+- Final whole-branch review: use superpowers:requesting-code-review's [code-reviewer.md](../requesting-code-review/code-reviewer.md)
+```
+
+- [ ] **Step 5: Example workflow verdict vocabulary.** Two replacements:
+
+Replace (currently line 157):
+```
+Code reviewer: Strengths: Good test coverage, clean. Issues: None. Approved.
+```
+with:
+```
+Code reviewer: Strengths: Good test coverage, clean. Issues: None. Task quality: Approved.
+```
+
+Replace (currently line 191):
+```
+Code reviewer: ✅ Approved
+```
+with:
+```
+Code reviewer: ✅ Task quality: Approved
+```
+
+(The final reviewer's "ready to merge" line, currently line 199, stays.)
+
+- [ ] **Step 6: Integration section.** Replace (currently line 272):
+
+```
+- **superpowers:requesting-code-review** - Code review template for reviewer subagents
+```
+
+with:
+
+```
+- **superpowers:requesting-code-review** - Code review template for the final whole-branch review
+```
+
+- [ ] **Step 7: Verify**
+
+Run: `grep -c "Dispatch final code reviewer subagent (../requesting-code-review/code-reviewer.md)" skills/subagent-driven-development/SKILL.md`
+Expected: `3`
+
+Run: `grep -n "most capable available model" skills/subagent-driven-development/SKILL.md`
+Expected: exactly one match (architecture/design bullet)
+
+Run: `grep -n "Handling Spec Reviewer\|Constructing Reviewer Prompts" skills/subagent-driven-development/SKILL.md`
+Expected: two section headers, both before `## Prompt Templates`
+
+Run: `grep -c "Task quality: Approved" skills/subagent-driven-development/SKILL.md`
+Expected: `2`
+
+- [ ] **Step 8: Commit**
+
+```bash
+git add skills/subagent-driven-development/SKILL.md
+git commit -m "SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment"
+```
+
+---
+
+### Task 5: New eval scenario — per-task quality reviewer catches a planted defect
+
+Lives in the `evals/` **submodule** (separate repo, `superpowers-evals`). Work on a branch there; the parent submodule-pointer bump happens at finishing time per `evals/CLAUDE.md`.
+
+The fixture plan's Task 2 implementation snippet duplicates Task 1's formatting logic verbatim. The duplication is spec-compliant, so the spec reviewer should pass it — the per-task quality reviewer is the gate under test (DRY violation).
+
+**Files:**
+- Create: `evals/setup_helpers/sdd_quality_defect_plan.py`
+- Modify: `evals/setup_helpers/__init__.py`
+- Create: `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/story.md`
+- Create: `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/setup.sh`
+- Create: `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/checks.sh`
+
+- [ ] **Step 0: Branch in the submodule**
+
+```bash
+cd evals
+git checkout -b sdd-quality-defect-scenario
+```
+
+- [ ] **Step 1: Create `evals/setup_helpers/sdd_quality_defect_plan.py`:**
+
+````python
+"""Setup helper for the sdd-quality-reviewer-catches-planted-defect scenario.
+
+Scaffolds a tiny Node project with a 2-task plan whose Task 2
+implementation snippet duplicates Task 1's formatting logic verbatim.
+The duplication is spec-compliant — the requirements only describe
+behavior — so the spec compliance reviewer should pass it. The test
+measures whether the per-task code quality reviewer catches the DRY
+violation and forces a refactor in the review-fix loop.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from setup_helpers.base import _git
+
+PACKAGE_JSON = """\
+{
+  "name": "report-quality",
+  "version": "1.0.0",
+  "type": "module",
+  "scripts": {
+    "test": "node --test"
+  }
+}
+"""
+
+PLAN_BODY = """\
+# Report Formatter — Implementation Plan
+
+Two report formatting functions. Implement exactly what each task
+specifies.
+
+## Task 1: User Report
+
+**File:** `src/report.js`
+
+**Requirements:**
+- Function named `formatUserReport`
+- Takes one parameter `user`: an object with `name`, `email`, `visits`
+- Returns a multi-line string: a banner of 40 `=` characters, then
+  `Report for <name> <<email>>`, then the banner again, then
+  `Visits: <visits>`, then a closing banner
+- Export the function
+
+**Implementation:**
+```javascript
+export function formatUserReport(user) {
+  const banner = "=".repeat(40);
+  const lines = [];
+  lines.push(banner);
+  lines.push(`Report for ${user.name} <${user.email}>`);
+  lines.push(banner);
+  lines.push(`Visits: ${user.visits}`);
+  lines.push(banner);
+  return lines.join("\\n");
+}
+```
+
+**Tests:** Create `test/report.test.js` verifying:
+- the result contains `Report for Ada <ada@example.com>` for that user
+- the result contains `Visits: 3` when `visits` is `3`
+- the result starts and ends with the 40-char banner
+
+**Verification:** `npm test`
+
+## Task 2: Admin Report
+
+**File:** `src/report.js` (add to existing file)
+
+**Requirements:**
+- Function named `formatAdminReport`
+- Takes one parameter `admin`: an object with `name`, `email`, `lastLogin`
+- Same banner layout as the user report; the body line is
+  `Last login: <lastLogin>` instead of the visits line
+- Export the function; keep `formatUserReport` working
+
+**Implementation:**
+```javascript
+export function formatAdminReport(admin) {
+  const banner = "=".repeat(40);
+  const lines = [];
+  lines.push(banner);
+  lines.push(`Report for ${admin.name} <${admin.email}>`);
+  lines.push(banner);
+  lines.push(`Last login: ${admin.lastLogin}`);
+  lines.push(banner);
+  return lines.join("\\n");
+}
+```
+
+**Tests:** Add to `test/report.test.js`:
+- the result contains `Report for Grace <grace@example.com>` for that admin
+- the result contains `Last login: 2026-06-01`
+- the result starts and ends with the 40-char banner
+
+**Verification:** `npm test`
+"""
+
+
+def scaffold_sdd_quality_defect_plan(workdir: Path) -> None:
+    workdir = Path(workdir)
+    workdir.mkdir(parents=True, exist_ok=True)
+    _git(["git", "init", "-b", "main"], cwd=workdir)
+    _git(["git", "config", "user.email", "drill@test.local"], cwd=workdir)
+    _git(["git", "config", "user.name", "Drill Test"], cwd=workdir)
+
+    (workdir / "package.json").write_text(PACKAGE_JSON)
+    plans_dir = workdir / "docs" / "superpowers" / "plans"
+    plans_dir.mkdir(parents=True, exist_ok=True)
+    (plans_dir / "report-plan.md").write_text(PLAN_BODY)
+
+    _git(["git", "add", "-A"], cwd=workdir)
+    _git(["git", "commit", "-m", "initial: report formatter plan"], cwd=workdir)
+````
+
+(Note the `\\n` in the JS snippets inside PLAN_BODY: the Python source must
+produce a literal `\n` in the markdown so the JS reads `lines.join("\n")`.)
+
+- [ ] **Step 2: Register the helper.** In `evals/setup_helpers/__init__.py`:
+
+After the line:
+```python
+from setup_helpers.sdd_real_projects import scaffold_sdd_go_fractals, scaffold_sdd_svelte_todo
+```
+add:
+```python
+from setup_helpers.sdd_quality_defect_plan import scaffold_sdd_quality_defect_plan
+```
+
+After the registry entry:
+```python
+    "scaffold_sdd_yagni_plan": scaffold_sdd_yagni_plan,
+```
+add:
+```python
+    "scaffold_sdd_quality_defect_plan": scaffold_sdd_quality_defect_plan,
+```
+
+- [ ] **Step 3: Create `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/story.md`:**
+
+```markdown
+---
+id: sdd-quality-reviewer-catches-planted-defect
+title: SDD's per-task code quality review catches a planted DRY violation
+status: ready
+tags: subagent-driven-development
+quorum_max_time: 90m
+---
+
+You have a small plan at docs/superpowers/plans/report-plan.md — two report
+formatting functions. The plan's Task 2 implementation snippet duplicates
+Task 1's formatting logic verbatim instead of sharing it. The duplication is
+spec-compliant (the requirements only describe behavior), so the spec
+compliance reviewer should pass it — the per-task code quality reviewer is
+the gate under test. You are spec-aware — name the skill.
+
+When the agent is ready for input, tell it to execute the plan with SDD. Use
+phrasing like:
+
+"I have a small plan at docs/superpowers/plans/report-plan.md — two report
+formatting functions. Use the superpowers:subagent-driven-development skill
+to execute it end-to-end — dispatch fresh subagents per task and run the
+two-stage review after each."
+
+Let the agent proceed autonomously. If it asks clarifying questions, give
+brief answers. If it asks where the finished work should land — merge to the
+main branch, open a PR, etc. — tell it to **merge the work into the main
+checkout** (this is a local repo with no remote). If a quality reviewer
+flags the duplicated formatting logic and an implementer refactors it, let
+the review-fix cycle play out — that cycle is exactly the behavior under
+test.
+
+The deliverable must end up in the checkout you launched in (the main
+working tree). Once the agent reports the plan is complete (both functions
+implemented, tests passing) AND the code is present on the main checkout,
+you are done.
+
+## Acceptance Criteria
+
+- A `Skill` invocation naming `superpowers:subagent-driven-development`
+  and at least one `Agent` (subagent dispatch) tool call appear in the
+  session log.
+- The duplicated report-formatting logic did not survive to the end of the
+  run. Either (a) the implementer never introduced the duplication (wrote or
+  self-reviewed its way to shared logic), or (b) the per-task code quality
+  reviewer flagged the duplication as an issue and a review-fix loop removed
+  it. A fail looks like the duplicated logic shipping with the per-task
+  quality reviewer approving it, or the duplication being caught only by the
+  final whole-branch review.
+- The per-task quality reviewers stayed task-scoped: no package-wide test
+  suites, race detector runs, or repeated/high-count test loops appear in
+  reviewer subagent activity, and reviewers did not re-run the full test
+  suite merely to confirm the implementer's report.
+- `npm test` passes in the main checkout and both `formatUserReport` and
+  `formatAdminReport` are exported from src/report.js. The deterministic
+  assertions gate this; the criteria above are about whether the *per-task
+  quality review* was the mechanism that kept the code clean.
+```
+
+- [ ] **Step 4: Create `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/setup.sh`:**
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+uv run setup-helpers run scaffold_sdd_quality_defect_plan
+```
+
+Then: `chmod +x evals/scenarios/sdd-quality-reviewer-catches-planted-defect/setup.sh`
+
+- [ ] **Step 5: Create `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/checks.sh`** (no executable bit):
+
+```bash
+pre() {
+    git-repo
+    git-branch main
+    requires-tool npm
+    file-exists 'docs/superpowers/plans/report-plan.md'
+    file-contains 'docs/superpowers/plans/report-plan.md' 'formatAdminReport'
+}
+
+post() {
+    skill-called superpowers:subagent-driven-development
+    tool-called Agent
+    command-succeeds 'npm test'
+    file-contains 'src/report.js' 'export function formatUserReport'
+    file-contains 'src/report.js' 'export function formatAdminReport'
+    command-succeeds 'test "$(grep -c "repeat(40)" src/report.js)" -le 1'
+}
+```
+
+(The last check is the deterministic DRY gate: the banner construction
+`"=".repeat(40)` must appear at most once in the final file — shared, not
+duplicated per function.)
+
+- [ ] **Step 6: Validate and test in the evals repo**
+
+```bash
+cd evals
+uv run quorum check
+uv run ruff check
+uv run pytest -x -q
+```
+
+Expected: all pass; `quorum check` lists the new scenario without errors.
+
+- [ ] **Step 7: Commit (in the submodule)**
+
+```bash
+cd evals
+git add setup_helpers/sdd_quality_defect_plan.py setup_helpers/__init__.py scenarios/sdd-quality-reviewer-catches-planted-defect/
+git commit -m "Add sdd-quality-reviewer-catches-planted-defect scenario"
+```
+
+---
+
+### Task 6: Static verification sweep
+
+**Files:** none modified — verification only.
+
+- [ ] **Step 1: No dangling references in the parent repo**
+
+Run: `grep -rn "requesting-code-review" skills/subagent-driven-development/`
+Expected: matches only in SKILL.md (final-review flowchart node ×3, Prompt Templates pointer, Integration bullet). None in code-quality-reviewer-prompt.md.
+
+Run: `grep -rn "Ready to merge" skills/subagent-driven-development/ || echo CLEAN`
+Expected: `CLEAN`
+
+- [ ] **Step 2: Plugin infrastructure tests**
+
+Run: `bash tests/shell-lint/test-lint-shell.sh`
+Expected: all PASS (we added `setup.sh` only inside the evals submodule, which has its own checks).
+
+- [ ] **Step 3: Cross-platform tool tables still coherent**
+
+Run: `grep -n "code-quality-reviewer" skills/using-superpowers/references/antigravity-tools.md skills/using-superpowers/references/gemini-tools.md`
+Expected: both tables still list `code-quality-reviewer` as a reviewer template (the new prompt's "If you cannot run commands in this environment, name the test you would run" line keeps the read-only `research` mapping valid — no table edits needed).
+
+---
+
+### Task 7: Live before/after evals (maintainer-gated)
+
+Live quorum runs launch agent CLIs in permissive modes — **trusted-maintainer operation; Jesse launches these**, per `evals/CLAUDE.md`. Requires `ANTHROPIC_API_KEY`.
+
+- [ ] **Step 1: Baseline (skills as released on dev)** — from the main checkout (`/Users/jesse/git/superpowers/superpowers`, on dev), or any checkout without this branch's changes:
+
+```bash
+cd evals
+export SUPERPOWERS_ROOT=/Users/jesse/git/superpowers/superpowers
+uv run quorum run scenarios/sdd-rejects-extra-features --coding-agent claude
+uv run quorum run scenarios/sdd-go-fractals --coding-agent claude
+uv run quorum run scenarios/sdd-svelte-todo --coding-agent claude
+uv run quorum run scenarios/spec-reviewer-catches-planted-flaws --coding-agent claude
+```
+
+- [ ] **Step 2: After (this branch's skills)** — point `SUPERPOWERS_ROOT` at this worktree:
+
+```bash
+cd evals
+export SUPERPOWERS_ROOT=/Users/jesse/git/superpowers/superpowers/.claude/worktrees/sdd-review-dispatch
+uv run quorum run scenarios/sdd-rejects-extra-features --coding-agent claude
+uv run quorum run scenarios/sdd-go-fractals --coding-agent claude
+uv run quorum run scenarios/sdd-svelte-todo --coding-agent claude
+uv run quorum run scenarios/spec-reviewer-catches-planted-flaws --coding-agent claude
+uv run quorum run scenarios/sdd-quality-reviewer-catches-planted-defect --coding-agent claude
+uv run quorum show
+```
+
+- [ ] **Step 3: Compare**
+
+Pass bar: all four pre-existing scenarios still pass after the change (no regression in catch rate); the new planted-defect scenario passes. For exploration cost, compare reviewer-subagent tool-call counts between the before/after run transcripts (no automated check exists — the spec calls this out as a known gap).
+
+---
+
+## Finishing
+
+After all tasks pass: the evals submodule commit needs to land in `superpowers-evals` (PR to its `main`), then this branch bumps the `evals` submodule pointer — per `evals/CLAUDE.md`, the parent bump is part of propagation, not optional. Then use superpowers:finishing-a-development-branch. PRs against superpowers target `dev`.