diff --git a/docs/superpowers/plans/2026-06-09-sdd-task-scoped-review-dispatch.md b/docs/superpowers/plans/2026-06-09-sdd-task-scoped-review-dispatch.md new file mode 100644 index 00000000..432ae097 --- /dev/null +++ b/docs/superpowers/plans/2026-06-09-sdd-task-scoped-review-dispatch.md @@ -0,0 +1,772 @@ +# SDD Task-Scoped Review Dispatch Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Scope SDD's per-task reviews to the task (diff-first reading, justified broadening, no redundant test runs) while final branch review stays broad. + +**Architecture:** Four prose edits to the subagent-driven-development skill (the per-task quality prompt becomes self-contained instead of delegating to the merge-readiness template; the spec prompt gets a third verdict channel and grounded skepticism; the implementer prompt gains a re-run-after-fix rule; SKILL.md gets controller guidance) plus one new eval scenario in the `evals/` submodule. `skills/requesting-code-review/` is deliberately untouched. + +**Tech Stack:** Markdown skill files; Python setup helper + bash checks + story.md for the quorum eval. + +**Spec:** `docs/superpowers/specs/2026-06-09-sdd-task-scoped-review-dispatch-design.md` — read it before starting. Decisions already settled there: full re-reviews stay; the two review stages stay separate; coordinator keeps model judgment; `requesting-code-review/` stays broad. + +**These are behavior-shaping prose files, not code.** There are no unit tests for them. Each task's verification steps are exact `grep` checks that the edit landed; behavioral verification is Task 6 (static) and Task 7 (live evals, maintainer-gated). + +--- + +### Task 1: Rewrite the per-task quality reviewer prompt as self-contained + +The current file delegates to `../requesting-code-review/code-reviewer.md`, which is a merge-readiness review (architecture, security, production readiness, "Ready to merge?"). Replace the entire file with a self-contained, task-scoped template. + +**Files:** +- Rewrite: `skills/subagent-driven-development/code-quality-reviewer-prompt.md` + +- [ ] **Step 1: Replace the full file contents with:** + +````markdown +# Code Quality Reviewer Prompt Template + +Use this template when dispatching a code quality reviewer subagent. + +**Purpose:** Verify one task's implementation is well-built (clean, tested, maintainable) + +**Only dispatch after spec compliance review passes.** + +``` +Subagent (general-purpose): + description: "Review code quality for Task N" + prompt: | + You are reviewing one task's implementation for code quality. This is a + task-scoped gate, not a merge review — a broad whole-branch review happens + separately after all tasks are complete. + + ## What Was Implemented + + [DESCRIPTION — task summary, from implementer's report] + + ## Task Requirements (context only) + + [Task N text or plan reference. Context for judging the code, not for + re-checking compliance.] + + ## Git Range to Review + + **Base:** [BASE_SHA — commit before this task] + **Head:** [HEAD_SHA — current commit] + + ```bash + git diff --stat [BASE_SHA]..[HEAD_SHA] + git diff [BASE_SHA]..[HEAD_SHA] + ``` + + ## Read-Only Review + + Your review is read-only on this checkout. Do not mutate the working tree, + the index, HEAD, or branch state in any way. Use tools like `git show`, + `git diff`, and `git log` to inspect history. + + ## Scope + + Spec compliance was already verified by a separate reviewer. Do not + re-check whether the code matches the requirements or the plan. + + Start from the diff. Read the changed files first. Inspect code outside + the diff only to evaluate a concrete risk you can name — and name it in + your report. Cross-cutting changes are legitimate named risks: if the + diff changes lock ordering, a function or API contract, or shared mutable + state, checking the call sites is the right method. Do not crawl the + codebase by default. + + ## Tests + + The implementer already ran the tests and reported results with TDD + evidence for exactly this code. Do not re-run the suite to confirm their + report. Run a test only when reading the code raises a specific doubt + that no existing run answers — and then a focused test, never a + package-wide suite, race detector run, or repeated/high-count loop. If + heavy validation seems warranted, recommend it in your report instead of + running it. If you cannot run commands in this environment, name the + test you would run. + + ## What to Check + + **Code quality:** + - Clean separation of concerns? + - Proper error handling? + - DRY without premature abstraction? + - Edge cases handled? + + **Tests:** + - Do the new and changed tests verify real behavior, not mocks? + - Are the task's edge cases covered? + + **Structure:** + - Does each file have one clear responsibility with a well-defined interface? + - Are units decomposed so they can be understood and tested independently? + - Is the implementation following the file structure from the plan? + - Did this change create new files that are already large, or + significantly grow existing files? (Don't flag pre-existing file + sizes — focus on what this change contributed.) + + ## Calibration + + Categorize issues by actual severity. Not everything is Critical. + Acknowledge what was done well before listing issues — accurate praise + helps the implementer trust the rest of the feedback. + + ## Output Format + + ### Strengths + [What's well done? Be specific.] + + ### Issues + + #### Critical (Must Fix) + [Bugs, data loss risks, broken functionality] + + #### Important (Should Fix) + [Poor error handling, test gaps, structural problems] + + #### Minor (Nice to Have) + [Code style, optimization opportunities] + + For each issue: + - File:line reference + - What's wrong + - Why it matters + - How to fix (if not obvious) + + ### Assessment + + **Task quality:** [Approved | Needs fixes] + + **Reasoning:** [1-2 sentence technical assessment] +``` + +**Placeholders:** +- `[DESCRIPTION]` — task summary, from implementer's report +- `[Task N text or plan reference]` — the task's requirements, for context +- `[BASE_SHA]` — commit before this task +- `[HEAD_SHA]` — current commit + +**Reviewer returns:** Strengths, Issues (Critical/Important/Minor), Task quality verdict +```` + +- [ ] **Step 2: Verify the rewrite landed** + +Run: `grep -c "requesting-code-review" skills/subagent-driven-development/code-quality-reviewer-prompt.md || echo ABSENT` +Expected: `ABSENT` (no more delegation) + +Run: `grep -n "Task quality:" skills/subagent-driven-development/code-quality-reviewer-prompt.md | head -2` +Expected: two matches (output format + placeholder note) + +Run: `grep -n "worktree add\|Ready to merge" skills/subagent-driven-development/code-quality-reviewer-prompt.md || echo CLEAN` +Expected: `CLEAN` + +- [ ] **Step 3: Commit** + +```bash +git add skills/subagent-driven-development/code-quality-reviewer-prompt.md +git commit -m "Make per-task quality reviewer prompt self-contained and task-scoped" +``` + +--- + +### Task 2: Spec reviewer prompt cleanups + +Four exact edits to `skills/subagent-driven-development/spec-reviewer-prompt.md`. Current line numbers refer to the file as of commit f55642e. + +**Files:** +- Modify: `skills/subagent-driven-development/spec-reviewer-prompt.md` + +- [ ] **Step 1: Add the judge-from-the-diff clause.** After the line (currently line 31): + +``` + Only read files in this diff. Do not crawl the broader codebase. +``` + +insert a blank line and: + +``` + Spec compliance is judged by reading the diff against the requirements. + The implementer already ran the tests and reported TDD evidence — do not + re-run them. If a requirement cannot be verified from this diff alone + (it lives in unchanged code or spans tasks), report it as a ⚠️ item + instead of broadening your search. +``` + +- [ ] **Step 2: Trim the read-only section.** Replace (currently line 35): + +``` + Your review is read-only on this checkout. Do not mutate the working tree, the index, HEAD, or branch state in any way. Use tools like `git show`, `git diff`, and `git log` to inspect history. If you need a working copy of a different revision, check it out into a separate temporary directory (e.g. `git worktree add /tmp/review-[SHA] [SHA]`) — never move HEAD on this checkout. +``` + +with: + +``` + Your review is read-only on this checkout. Do not mutate the working tree, the index, HEAD, or branch state in any way. Use tools like `git show`, `git diff`, and `git log` to inspect history. +``` + +- [ ] **Step 3: Ground the skepticism.** Replace (currently lines 39-40): + +``` + The implementer finished suspiciously quickly. Their report may be incomplete, + inaccurate, or optimistic. You MUST verify everything independently. +``` + +with: + +``` + Treat the implementer's report as unverified claims about the code. It may + be incomplete, inaccurate, or optimistic. Verify the claims against the diff. +``` + +- [ ] **Step 4: Add the third verdict channel.** Replace (currently lines 74-76): + +``` + Report: + - ✅ Spec compliant (if everything matches after code inspection) + - ❌ Issues found: [list specifically what's missing or extra, with file:line references] +``` + +with: + +``` + Report: + - ✅ Spec compliant (if everything matches after code inspection) + - ❌ Issues found: [list specifically what's missing or extra, with file:line references] + - ⚠️ Cannot verify from diff: [requirements you could not verify from the + diff alone, and what the controller should check — report alongside the + ✅/❌ verdict for everything you could verify] +``` + +- [ ] **Step 5: Verify** + +Run: `grep -n "suspiciously\|worktree add" skills/subagent-driven-development/spec-reviewer-prompt.md || echo CLEAN` +Expected: `CLEAN` + +Run: `grep -c "⚠️" skills/subagent-driven-development/spec-reviewer-prompt.md` +Expected: `2` (judge-from-diff clause + verdict channel) + +- [ ] **Step 6: Commit** + +```bash +git add skills/subagent-driven-development/spec-reviewer-prompt.md +git commit -m "Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel" +``` + +--- + +### Task 3: Implementer prompt — re-run tests after fixing review findings + +The reviewers' "don't re-run the implementer's tests" rule assumes the implementer re-runs tests after every fix. Make that real. + +**Files:** +- Modify: `skills/subagent-driven-development/implementer-prompt.md` + +- [ ] **Step 1: Insert a new section.** Immediately before the line (currently line 100): + +``` + ## Report Format +``` + +insert: + +``` + ## After Review Findings + + If a reviewer finds issues and you fix them, re-run the tests that cover + the amended code and include the results in your fix report. Reviewers + will not re-run tests for you — your report is the test evidence. + +``` + +- [ ] **Step 2: Verify** + +Run: `grep -n "After Review Findings" skills/subagent-driven-development/implementer-prompt.md` +Expected: one match, on a line before `## Report Format` + +- [ ] **Step 3: Commit** + +```bash +git add skills/subagent-driven-development/implementer-prompt.md +git commit -m "Implementer prompt: re-run covering tests after fixing review findings" +``` + +--- + +### Task 4: SKILL.md controller changes + +Six exact edits to `skills/subagent-driven-development/SKILL.md`. Current line numbers refer to commit f55642e. + +**Files:** +- Modify: `skills/subagent-driven-development/SKILL.md` + +- [ ] **Step 1: Point the final-review flowchart node at the broad template.** The node label `Dispatch final code reviewer subagent for entire implementation` appears 3 times (currently lines 65, 84, 85). In all 3 occurrences, replace the label string with: + +``` +Dispatch final code reviewer subagent (../requesting-code-review/code-reviewer.md) +``` + +(Graphviz nodes are matched by label text — all three must be byte-identical or the graph grows a phantom node.) + +- [ ] **Step 2: Model selection by judgment.** Replace (currently lines 97-99): + +``` +**Architecture, design, and review tasks**: use the most capable available model. + +**Task complexity signals:** +``` + +with: + +``` +**Architecture and design tasks**: use the most capable available model. + +**Review tasks**: choose the model with the same judgment, scaled to the +diff's size, complexity, and risk. A small mechanical diff does not need the +most capable model; a subtle concurrency change does. + +**Task complexity signals (implementation tasks):** +``` + +- [ ] **Step 3: Add controller guidance sections.** Immediately before the line (currently line 122): + +``` +## Prompt Templates +``` + +insert: + +``` +## Handling Spec Reviewer ⚠️ Items + +The spec reviewer may report "⚠️ Cannot verify from diff" items — requirements +that live in unchanged code or span tasks. These do not block dispatching the +code quality reviewer, but you must resolve each one yourself before marking +the task complete: you hold the plan and cross-task context the reviewer +lacks. If you confirm an item is a real gap, treat it as a failed spec +review — send it back to the implementer and re-review. + +## Constructing Reviewer Prompts + +Per-task reviews are task-scoped gates. The broad review happens once, at the +final whole-branch review. When you fill a reviewer template: + +- Do not add open-ended directives like "check all uses" or "run race tests + if useful" without a concrete, task-specific reason +- Do not ask a reviewer to re-run tests the implementer already ran on the + same code — the implementer's report carries the test evidence + +``` + +- [ ] **Step 4: Prompt Templates list — add the final-review pointer.** Replace (currently line 126): + +``` +- [code-quality-reviewer-prompt.md](code-quality-reviewer-prompt.md) - Dispatch code quality reviewer subagent +``` + +with: + +``` +- [code-quality-reviewer-prompt.md](code-quality-reviewer-prompt.md) - Dispatch code quality reviewer subagent +- Final whole-branch review: use superpowers:requesting-code-review's [code-reviewer.md](../requesting-code-review/code-reviewer.md) +``` + +- [ ] **Step 5: Example workflow verdict vocabulary.** Two replacements: + +Replace (currently line 157): +``` +Code reviewer: Strengths: Good test coverage, clean. Issues: None. Approved. +``` +with: +``` +Code reviewer: Strengths: Good test coverage, clean. Issues: None. Task quality: Approved. +``` + +Replace (currently line 191): +``` +Code reviewer: ✅ Approved +``` +with: +``` +Code reviewer: ✅ Task quality: Approved +``` + +(The final reviewer's "ready to merge" line, currently line 199, stays.) + +- [ ] **Step 6: Integration section.** Replace (currently line 272): + +``` +- **superpowers:requesting-code-review** - Code review template for reviewer subagents +``` + +with: + +``` +- **superpowers:requesting-code-review** - Code review template for the final whole-branch review +``` + +- [ ] **Step 7: Verify** + +Run: `grep -c "Dispatch final code reviewer subagent (../requesting-code-review/code-reviewer.md)" skills/subagent-driven-development/SKILL.md` +Expected: `3` + +Run: `grep -n "most capable available model" skills/subagent-driven-development/SKILL.md` +Expected: exactly one match (architecture/design bullet) + +Run: `grep -n "Handling Spec Reviewer\|Constructing Reviewer Prompts" skills/subagent-driven-development/SKILL.md` +Expected: two section headers, both before `## Prompt Templates` + +Run: `grep -c "Task quality: Approved" skills/subagent-driven-development/SKILL.md` +Expected: `2` + +- [ ] **Step 8: Commit** + +```bash +git add skills/subagent-driven-development/SKILL.md +git commit -m "SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment" +``` + +--- + +### Task 5: New eval scenario — per-task quality reviewer catches a planted defect + +Lives in the `evals/` **submodule** (separate repo, `superpowers-evals`). Work on a branch there; the parent submodule-pointer bump happens at finishing time per `evals/CLAUDE.md`. + +The fixture plan's Task 2 implementation snippet duplicates Task 1's formatting logic verbatim. The duplication is spec-compliant, so the spec reviewer should pass it — the per-task quality reviewer is the gate under test (DRY violation). + +**Files:** +- Create: `evals/setup_helpers/sdd_quality_defect_plan.py` +- Modify: `evals/setup_helpers/__init__.py` +- Create: `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/story.md` +- Create: `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/setup.sh` +- Create: `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/checks.sh` + +- [ ] **Step 0: Branch in the submodule** + +```bash +cd evals +git checkout -b sdd-quality-defect-scenario +``` + +- [ ] **Step 1: Create `evals/setup_helpers/sdd_quality_defect_plan.py`:** + +````python +"""Setup helper for the sdd-quality-reviewer-catches-planted-defect scenario. + +Scaffolds a tiny Node project with a 2-task plan whose Task 2 +implementation snippet duplicates Task 1's formatting logic verbatim. +The duplication is spec-compliant — the requirements only describe +behavior — so the spec compliance reviewer should pass it. The test +measures whether the per-task code quality reviewer catches the DRY +violation and forces a refactor in the review-fix loop. +""" + +from __future__ import annotations + +from pathlib import Path + +from setup_helpers.base import _git + +PACKAGE_JSON = """\ +{ + "name": "report-quality", + "version": "1.0.0", + "type": "module", + "scripts": { + "test": "node --test" + } +} +""" + +PLAN_BODY = """\ +# Report Formatter — Implementation Plan + +Two report formatting functions. Implement exactly what each task +specifies. + +## Task 1: User Report + +**File:** `src/report.js` + +**Requirements:** +- Function named `formatUserReport` +- Takes one parameter `user`: an object with `name`, `email`, `visits` +- Returns a multi-line string: a banner of 40 `=` characters, then + `Report for <>`, then the banner again, then + `Visits: `, then a closing banner +- Export the function + +**Implementation:** +```javascript +export function formatUserReport(user) { + const banner = "=".repeat(40); + const lines = []; + lines.push(banner); + lines.push(`Report for ${user.name} <${user.email}>`); + lines.push(banner); + lines.push(`Visits: ${user.visits}`); + lines.push(banner); + return lines.join("\\n"); +} +``` + +**Tests:** Create `test/report.test.js` verifying: +- the result contains `Report for Ada ` for that user +- the result contains `Visits: 3` when `visits` is `3` +- the result starts and ends with the 40-char banner + +**Verification:** `npm test` + +## Task 2: Admin Report + +**File:** `src/report.js` (add to existing file) + +**Requirements:** +- Function named `formatAdminReport` +- Takes one parameter `admin`: an object with `name`, `email`, `lastLogin` +- Same banner layout as the user report; the body line is + `Last login: ` instead of the visits line +- Export the function; keep `formatUserReport` working + +**Implementation:** +```javascript +export function formatAdminReport(admin) { + const banner = "=".repeat(40); + const lines = []; + lines.push(banner); + lines.push(`Report for ${admin.name} <${admin.email}>`); + lines.push(banner); + lines.push(`Last login: ${admin.lastLogin}`); + lines.push(banner); + return lines.join("\\n"); +} +``` + +**Tests:** Add to `test/report.test.js`: +- the result contains `Report for Grace ` for that admin +- the result contains `Last login: 2026-06-01` +- the result starts and ends with the 40-char banner + +**Verification:** `npm test` +""" + + +def scaffold_sdd_quality_defect_plan(workdir: Path) -> None: + workdir = Path(workdir) + workdir.mkdir(parents=True, exist_ok=True) + _git(["git", "init", "-b", "main"], cwd=workdir) + _git(["git", "config", "user.email", "drill@test.local"], cwd=workdir) + _git(["git", "config", "user.name", "Drill Test"], cwd=workdir) + + (workdir / "package.json").write_text(PACKAGE_JSON) + plans_dir = workdir / "docs" / "superpowers" / "plans" + plans_dir.mkdir(parents=True, exist_ok=True) + (plans_dir / "report-plan.md").write_text(PLAN_BODY) + + _git(["git", "add", "-A"], cwd=workdir) + _git(["git", "commit", "-m", "initial: report formatter plan"], cwd=workdir) +```` + +(Note the `\\n` in the JS snippets inside PLAN_BODY: the Python source must +produce a literal `\n` in the markdown so the JS reads `lines.join("\n")`.) + +- [ ] **Step 2: Register the helper.** In `evals/setup_helpers/__init__.py`: + +After the line: +```python +from setup_helpers.sdd_real_projects import scaffold_sdd_go_fractals, scaffold_sdd_svelte_todo +``` +add: +```python +from setup_helpers.sdd_quality_defect_plan import scaffold_sdd_quality_defect_plan +``` + +After the registry entry: +```python + "scaffold_sdd_yagni_plan": scaffold_sdd_yagni_plan, +``` +add: +```python + "scaffold_sdd_quality_defect_plan": scaffold_sdd_quality_defect_plan, +``` + +- [ ] **Step 3: Create `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/story.md`:** + +```markdown +--- +id: sdd-quality-reviewer-catches-planted-defect +title: SDD's per-task code quality review catches a planted DRY violation +status: ready +tags: subagent-driven-development +quorum_max_time: 90m +--- + +You have a small plan at docs/superpowers/plans/report-plan.md — two report +formatting functions. The plan's Task 2 implementation snippet duplicates +Task 1's formatting logic verbatim instead of sharing it. The duplication is +spec-compliant (the requirements only describe behavior), so the spec +compliance reviewer should pass it — the per-task code quality reviewer is +the gate under test. You are spec-aware — name the skill. + +When the agent is ready for input, tell it to execute the plan with SDD. Use +phrasing like: + +"I have a small plan at docs/superpowers/plans/report-plan.md — two report +formatting functions. Use the superpowers:subagent-driven-development skill +to execute it end-to-end — dispatch fresh subagents per task and run the +two-stage review after each." + +Let the agent proceed autonomously. If it asks clarifying questions, give +brief answers. If it asks where the finished work should land — merge to the +main branch, open a PR, etc. — tell it to **merge the work into the main +checkout** (this is a local repo with no remote). If a quality reviewer +flags the duplicated formatting logic and an implementer refactors it, let +the review-fix cycle play out — that cycle is exactly the behavior under +test. + +The deliverable must end up in the checkout you launched in (the main +working tree). Once the agent reports the plan is complete (both functions +implemented, tests passing) AND the code is present on the main checkout, +you are done. + +## Acceptance Criteria + +- A `Skill` invocation naming `superpowers:subagent-driven-development` + and at least one `Agent` (subagent dispatch) tool call appear in the + session log. +- The duplicated report-formatting logic did not survive to the end of the + run. Either (a) the implementer never introduced the duplication (wrote or + self-reviewed its way to shared logic), or (b) the per-task code quality + reviewer flagged the duplication as an issue and a review-fix loop removed + it. A fail looks like the duplicated logic shipping with the per-task + quality reviewer approving it, or the duplication being caught only by the + final whole-branch review. +- The per-task quality reviewers stayed task-scoped: no package-wide test + suites, race detector runs, or repeated/high-count test loops appear in + reviewer subagent activity, and reviewers did not re-run the full test + suite merely to confirm the implementer's report. +- `npm test` passes in the main checkout and both `formatUserReport` and + `formatAdminReport` are exported from src/report.js. The deterministic + assertions gate this; the criteria above are about whether the *per-task + quality review* was the mechanism that kept the code clean. +``` + +- [ ] **Step 4: Create `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/setup.sh`:** + +```bash +#!/usr/bin/env bash +set -euo pipefail +uv run setup-helpers run scaffold_sdd_quality_defect_plan +``` + +Then: `chmod +x evals/scenarios/sdd-quality-reviewer-catches-planted-defect/setup.sh` + +- [ ] **Step 5: Create `evals/scenarios/sdd-quality-reviewer-catches-planted-defect/checks.sh`** (no executable bit): + +```bash +pre() { + git-repo + git-branch main + requires-tool npm + file-exists 'docs/superpowers/plans/report-plan.md' + file-contains 'docs/superpowers/plans/report-plan.md' 'formatAdminReport' +} + +post() { + skill-called superpowers:subagent-driven-development + tool-called Agent + command-succeeds 'npm test' + file-contains 'src/report.js' 'export function formatUserReport' + file-contains 'src/report.js' 'export function formatAdminReport' + command-succeeds 'test "$(grep -c "repeat(40)" src/report.js)" -le 1' +} +``` + +(The last check is the deterministic DRY gate: the banner construction +`"=".repeat(40)` must appear at most once in the final file — shared, not +duplicated per function.) + +- [ ] **Step 6: Validate and test in the evals repo** + +```bash +cd evals +uv run quorum check +uv run ruff check +uv run pytest -x -q +``` + +Expected: all pass; `quorum check` lists the new scenario without errors. + +- [ ] **Step 7: Commit (in the submodule)** + +```bash +cd evals +git add setup_helpers/sdd_quality_defect_plan.py setup_helpers/__init__.py scenarios/sdd-quality-reviewer-catches-planted-defect/ +git commit -m "Add sdd-quality-reviewer-catches-planted-defect scenario" +``` + +--- + +### Task 6: Static verification sweep + +**Files:** none modified — verification only. + +- [ ] **Step 1: No dangling references in the parent repo** + +Run: `grep -rn "requesting-code-review" skills/subagent-driven-development/` +Expected: matches only in SKILL.md (final-review flowchart node ×3, Prompt Templates pointer, Integration bullet). None in code-quality-reviewer-prompt.md. + +Run: `grep -rn "Ready to merge" skills/subagent-driven-development/ || echo CLEAN` +Expected: `CLEAN` + +- [ ] **Step 2: Plugin infrastructure tests** + +Run: `bash tests/shell-lint/test-lint-shell.sh` +Expected: all PASS (we added `setup.sh` only inside the evals submodule, which has its own checks). + +- [ ] **Step 3: Cross-platform tool tables still coherent** + +Run: `grep -n "code-quality-reviewer" skills/using-superpowers/references/antigravity-tools.md skills/using-superpowers/references/gemini-tools.md` +Expected: both tables still list `code-quality-reviewer` as a reviewer template (the new prompt's "If you cannot run commands in this environment, name the test you would run" line keeps the read-only `research` mapping valid — no table edits needed). + +--- + +### Task 7: Live before/after evals (maintainer-gated) + +Live quorum runs launch agent CLIs in permissive modes — **trusted-maintainer operation; Jesse launches these**, per `evals/CLAUDE.md`. Requires `ANTHROPIC_API_KEY`. + +- [ ] **Step 1: Baseline (skills as released on dev)** — from the main checkout (`/Users/jesse/git/superpowers/superpowers`, on dev), or any checkout without this branch's changes: + +```bash +cd evals +export SUPERPOWERS_ROOT=/Users/jesse/git/superpowers/superpowers +uv run quorum run scenarios/sdd-rejects-extra-features --coding-agent claude +uv run quorum run scenarios/sdd-go-fractals --coding-agent claude +uv run quorum run scenarios/sdd-svelte-todo --coding-agent claude +uv run quorum run scenarios/spec-reviewer-catches-planted-flaws --coding-agent claude +``` + +- [ ] **Step 2: After (this branch's skills)** — point `SUPERPOWERS_ROOT` at this worktree: + +```bash +cd evals +export SUPERPOWERS_ROOT=/Users/jesse/git/superpowers/superpowers/.claude/worktrees/sdd-review-dispatch +uv run quorum run scenarios/sdd-rejects-extra-features --coding-agent claude +uv run quorum run scenarios/sdd-go-fractals --coding-agent claude +uv run quorum run scenarios/sdd-svelte-todo --coding-agent claude +uv run quorum run scenarios/spec-reviewer-catches-planted-flaws --coding-agent claude +uv run quorum run scenarios/sdd-quality-reviewer-catches-planted-defect --coding-agent claude +uv run quorum show +``` + +- [ ] **Step 3: Compare** + +Pass bar: all four pre-existing scenarios still pass after the change (no regression in catch rate); the new planted-defect scenario passes. For exploration cost, compare reviewer-subagent tool-call counts between the before/after run transcripts (no automated check exists — the spec calls this out as a known gap). + +--- + +## Finishing + +After all tasks pass: the evals submodule commit needs to land in `superpowers-evals` (PR to its `main`), then this branch bumps the `evals` submodule pointer — per `evals/CLAUDE.md`, the parent bump is part of propagation, not optional. Then use superpowers:finishing-a-development-branch. PRs against superpowers target `dev`.