Release v5.1.0 (#1468 )

* docs: add Codex App compatibility design spec (PRI-823) Design for making using-git-worktrees, finishing-a-development-branch, and subagent-driven-development skills work in the Codex App's sandboxed worktree environment. Read-only environment detection via git-dir vs git-common-dir comparison, ~48 lines across 4 files, zero breaking changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: address spec review feedback for PRI-823 Fix three Important issues from spec review: - Clarify Step 1.5 placement relative to existing Steps 2/3 - Re-derive environment state at cleanup time instead of relying on earlier skill output - Acknowledge pre-existing Step 5 cleanup inconsistency Also: precise step references, exact codex-tools.md content, clearer Integration section update instructions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: address team review feedback for PRI-823 spec - Add commit SHA + data loss warning to handoff payload (HIGH) - Add explicit commit step before handoff (HIGH) - Remove misleading "mark as externally managed" from Path B - Add executing-plans 1-line edit (was missing) - Add branch name derivation rules - Add conditional UI language for non-App environments - Add sandbox fallback for permission errors - Add STOP directive after Step 0 reporting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: clarify executing-plans in What Does NOT Change section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add cleanup guard test (#5) and sandbox fallback test (#10) to spec Both tests address real risk scenarios: - #5: cleanup guard bug would delete Codex App's own worktree (data loss) - #10: Local thread sandbox fallback needs manual Codex App validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add implementation plan for Codex App compatibility (PRI-823) 8 tasks covering: environment detection in using-git-worktrees, Step 1.5 + cleanup guard in finishing-a-development-branch, Integration line updates, codex-tools.md docs, automated tests, and final verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(codex-tools): add named agent dispatch mapping for Codex (#647) * fix(writing-skills): correct false 'only two fields' frontmatter claim (#882) * Replace subagent review loops with lightweight inline self-review The subagent review loop (dispatching a fresh agent to review plans/specs) doubled execution time (~25 min overhead) without measurably improving plan quality. Regression testing across 5 versions (v3.6.0 through v5.0.4) with 5 trials each showed identical plan sizes, task counts, and quality scores regardless of whether the review loop ran. Changes: - writing-plans: Replace subagent Plan Review Loop with inline Self-Review checklist (spec coverage, placeholder scan, type consistency) - writing-plans: Add explicit "No Placeholders" section listing plan failures (TBD, vague descriptions, undefined references, "similar to Task N") - brainstorming: Replace subagent Spec Review Loop with inline Spec Self-Review (placeholder scan, internal consistency, scope check, ambiguity check) - Both skills now use "look at it with fresh eyes" framing Testing: 5 trials with the new skill show self-review catches 3-5 real bugs per run (spawn positions, API mismatches, seed bugs, grid indexing) in ~30s instead of ~25 min. Remaining defects are comparable to the subagent approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert "Replace subagent review loops with lightweight inline self-review" This reverts commit bf8f7572eb. * Reapply "Replace subagent review loops with lightweight inline self-review" This reverts commit b045fa3950. * Add v5.0.6 release notes * Move brainstorm server metadata to .meta/ subdirectory Metadata files (.server-info, .events, .server.pid, .server.log, .server-stopped) were stored in the same directory served over HTTP, making them accessible via the /files/ route. They now live in a .meta/ subdirectory that is not web-accessible. Also fixes a stale test assertion ("Waiting for Claude" → "Waiting for the agent"). Reported-By: 吉田仁 * Revert "Move brainstorm server metadata to .meta/ subdirectory" This reverts commit ab500dade6. * Separate brainstorm server content and state into peer directories The session directory now contains two peers: content/ (HTML served to the browser) and state/ (events, server-info, pid, log). Previously all files shared a single directory, making server state and user interaction data accessible over the /files/ HTTP route. Also fixes stale test assertion ("Waiting for Claude" → "Waiting for the agent"). Reported-By: 吉田仁 * Fix owner-PID false positive when owner runs as different user ownerAlive() treated EPERM (permission denied) the same as ESRCH (process not found), causing the server to self-terminate within 60s whenever the owner process ran as a different user. This affected WSL (owner is a Windows process), Tailscale SSH, and any cross-user scenario. The fix: `return e.code === 'EPERM'` — if we get permission denied, the process is alive; we just can't signal it. Tested on Linux via Tailscale SSH with a root-owned grandparent PID: - Server survives past the 60s lifecycle check (EPERM = alive) - Server still shuts down when owner genuinely dies (ESRCH = dead) Fixes #879 * Fix owner-PID lifecycle monitoring for cross-platform reliability Two bugs caused the brainstorm server to self-terminate within 60s: 1. ownerAlive() treated EPERM (permission denied) as "process dead". When the owner PID belongs to a different user (Tailscale SSH, system daemons), process.kill(pid, 0) throws EPERM — but the process IS alive. Fixed: return e.code === 'EPERM'. 2. On WSL, the grandparent PID resolves to a short-lived subprocess that exits before the first 60s lifecycle check. The PID is genuinely dead (ESRCH), so the EPERM fix alone doesn't help. Fixed: validate the owner PID at server startup — if it's already dead, it was a bad resolution, so disable monitoring and rely on the 30-minute idle timeout. This also removes the Windows/MSYS2-specific OWNER_PID="" carve-out from start-server.sh, since the server now handles invalid PIDs generically at startup regardless of platform. Tested on Linux (magic-kingdom) via Tailscale SSH: - Root-owned owner PID (EPERM): server survives ✓ - Dead owner PID at startup (WSL sim): monitoring disabled, survives ✓ - Valid owner that dies: server shuts down within 60s ✓ Fixes #879 * Release v5.0.6: inline self-review, brainstorm server restructure, owner-PID fixes * fix: add Copilot CLI platform detection for sessionStart context injection Copilot CLI v1.0.11 reads `additionalContext` from sessionStart hook output, but the session-start script only emits the Claude Code-specific nested format. Add COPILOT_CLI env var detection so Copilot CLI gets the SDK-standard top-level `additionalContext` while Claude Code continues getting `hookSpecificOutput`. Based on PR #910 by @culinablaz. * feat: add Copilot CLI tool mapping, docs, and install instructions - Add references/copilot-tools.md with full tool equivalence table - Add Copilot CLI to using-superpowers skill platform instructions - Add marketplace install instructions to README - Add changelog entry crediting @culinablaz for the hook fix * fix(opencode): align skills path across bootstrap, runtime, and tests The bootstrap text advertised a configDir-based skills path that didn't match the runtime path (resolved relative to the plugin file). Tests used yet another hardcoded path and referenced a nonexistent lib/ dir. - Remove misleading skills path from bootstrap text; the agent should use the native skill tool, not read files by path - Fix test setup to create a consistent layout matching the plugin's ../../skills resolution - Export SUPERPOWERS_SKILLS_DIR from setup.sh so tests use a single source of truth - Add regression test that bootstrap doesn't advertise the old path - Remove broken cp of nonexistent lib/ directory Fixes #847 * docs: add OpenCode path fix to release notes * fix(opencode): inject bootstrap as user message instead of system message Move bootstrap injection from experimental.chat.system.transform to experimental.chat.messages.transform, prepending to the first user message instead of adding a system message. This avoids two issues: - System messages repeated every turn inflate token usage (#750) - Multiple system messages break Qwen and other models (#894) Tested on OpenCode 1.3.2 with Claude Sonnet 4.5 — brainstorming skill fires correctly on "Let's make a React to do list" prompt. * docs: update release notes with OpenCode bootstrap change * docs: add worktree rototill design spec (PRI-974) Design for detect-and-defer worktree support. Superpowers defers to native harness worktree systems when available, falls back to manual git worktree creation when not. Covers Phases 0-2: detection, consent, native tool preference, finishing state detection, and three bug fixes (#940, #999, #238). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: address SWE review feedback on worktree rototill spec - Fix Bug #999 order: merge → verify → remove worktree → delete branch (avoids losing work if merge fails after worktree removal) - Add submodule guard to Step 0 detection (GIT_DIR != GIT_COMMON is also true in submodules) - Preserve global path (~/.config/superpowers/worktrees/) in detection for backward compatibility, just stop offering it to new users - Add step numbering note and implementation notes section - Expand provenance heuristic to cover global path and manual creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: honest spec revisions after issue/PR deep dive - Step 1a is the load-bearing assumption, not just a risk — if it fails, the entire design needs rework. TDD validation must be first impl task. - #1009 resolution depends on Step 1a working, stated explicitly - #574 honestly deferred, not "partially addressed" - Add hooks symlink to Step 1b (PR #965 idea, prevents silent hook loss) - Add stale worktree pruning to Step 5 (PR #1072 idea, one-line self-heal) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add worktree rototill implementation plan (PRI-974) 5 tasks: TDD gate for Step 1a, using-git-worktrees rewrite, finishing-a-development-branch rewrite, integration updates, end-to-end validation. Task 1 is a hard gate — if native tool preference fails RED/GREEN, stop and redesign. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add RED/GREEN validation for native worktree preference (PRI-974) Gate test for Step 1a — validates agents prefer EnterWorktree over git worktree add on Claude Code. Must pass before skill rewrite. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: rewrite using-git-worktrees with detect-and-defer (PRI-974) Step 0: GIT_DIR != GIT_COMMON detection (skip if already isolated) Step 0 consent: opt-in prompt before creating worktree (#991) Step 1a: native tool preference (short, first, declarative) Step 1b: git worktree fallback with hooks symlink and legacy path compat Submodule guard prevents false detection Platform-neutral instruction file references (#1049) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: rewrite finishing-a-development-branch with detect-and-defer (PRI-974) Step 2: environment detection (GIT_DIR != GIT_COMMON) before presenting menu Detached HEAD: reduced 3-option menu (no merge from detached HEAD) Provenance-based cleanup: .worktrees/ = ours, anything else = hands off Bug #940: Option 2 no longer cleans up worktree Bug #999: merge -> verify -> remove worktree -> delete branch Bug #238: cd to main repo root before git worktree remove Stale worktree pruning after removal (git worktree prune) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address spec review findings in both skill rewrites (PRI-974) using-git-worktrees: submodule guard now says "treat as normal repo" instead of "proceed to Step 1" (preserves consent flow) using-git-worktrees: directory priority summaries include global legacy finishing-a-development-branch: move git branch -d after Step 6 cleanup to make Bug #999 ordering unambiguous (merge -> worktree remove -> branch delete) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update worktree integration references across skills (PRI-974) Remove REQUIRED language from executing-plans and subagent-driven-development. Consent and detection now live inside using-git-worktrees itself. Fix stale 'created by brainstorming' claim in writing-plans. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: include worktrees/ (non-hidden) in finishing provenance check (PRI-974) The creation skill supports both .worktrees/ and worktrees/ directories, but the finishing skill's cleanup only checked .worktrees/. Worktrees under the non-hidden path would be orphaned on merge or discard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: Step 1a validated through TDD — explicit naming + consent bridge (PRI-974) Step 1a failed at 2/6 with the spec's original abstract text ("use your native tool"). Three REFACTOR iterations found what works (50/50 runs): 1. Explicit tool naming — "do you have EnterWorktree, WorktreeCreate..." transforms interpretation into factual toolkit check 2. Consent bridge — "user's consent is your authorization" directly addresses EnterWorktree's "ONLY when user explicitly asks" guardrail 3. Red Flag entry naming the specific anti-pattern File split was tested but proven unnecessary — the fix is the Step 1a text quality, not physical separation of git commands. Control test with full 240-line skill (all git commands visible) passed 20/20. Test script updated: supports batch runs (./test.sh green 20), "all" phase, and checks absence of git worktree add (reliable signal) rather than presence of EnterWorktree text (agent sometimes omits tool name). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update spec with TDD findings on Step 1a (PRI-974) Step 1a's original "deliberately short, abstract" design was disproven by TDD (2/6 pass rate). Spec now documents the validated approach: explicit tool naming + consent bridge + red flag (50/50 pass rate). - Design Principles: updated to reflect explicit naming over abstraction - Step 1a: replaced abstract text with validated approach, added design note explaining the TDD revision and why file splitting was unnecessary - Risks: Step 1a risk marked RESOLVED with cross-platform validation table and residual risk note about upstream tool description dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: honest cross-platform validation table in spec (PRI-974) Research confirmed Claude Code is currently the only harness with an agent-callable mid-session worktree tool. All others either create worktrees before the agent starts (Codex App, Gemini, Cursor) or have no native support (Codex CLI, OpenCode). Table now shows: what was actually tested (Claude Code 50/50, Codex CLI 6/6), what was simulated (Codex App 1/1), and what's untested (Gemini, Cursor, OpenCode). Step 1a is forward-compatible for when other harnesses add agent-callable tools. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: cross-platform validation on 5 harnesses (PRI-974) Tested on Gemini CLI (gemini -p) and Cursor Agent (cursor-agent -p): - Gemini: Step 0 detection 1/1, Step 1b fallback 1/1 - Cursor: Step 0 detection 1/1, Step 1b fallback 1/1 Both correctly identified no native agent-callable worktree tool, fell through to git worktree add, and performed safety verification. Both correctly detected existing worktrees and skipped creation. 5 of 6 harnesses now tested. Only OpenCode untested (no CLI access). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove incorrect hooks symlink step from worktree skill Git worktrees inherit hooks from the main repo automatically via $GIT_COMMON_DIR — this has been the case since git 2.5 (2015). The symlink step was based on an incorrect premise from PR #965 and also fails in practice (.git is a file in worktrees, not a dir). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: address PR #1121 review — respect user preference, drop y/n - Consent prompt: drop "(y/n)" and add escape valve for users who have already declared their worktree preference in global or project agent instruction files. - Directory selection: reorder to put declared user preference ahead of observed filesystem state, and reframe the default as "if no other guidance available". - Sandbox fallback: require explicitly informing the user that the sandbox blocked creation, not just "report accordingly". - writing-plans: fully qualify the superpowers:using-git-worktrees reference. - Plan doc: mirror the consent-prompt change. Step 1a native-tool framing and the helper-scripts suggestion are still outstanding — the first needs a benchmark re-run before softer phrasing can be adopted without regressing compliance; the second is exploratory and will get a thread reply. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: soften Step 1a native-tool framing per PR #1121 review Address obra's comment on explicit step numbers / prescriptive tone. Drops "STOP HERE if available", the "If YES:" gate, and the "even if / even if / NO EXCEPTIONS" reinforcement paragraph. Keeps the specific tool-name anchors (EnterWorktree, WorktreeCreate, /worktree, --worktree), which the original TDD data showed are load-bearing. A/B verified against drill harness on the 3 creation/consent scenarios (consent-flow, creation-from-main, creation-from-main-spec-aware): baseline explicit wording scored 12/12 criteria, softened wording also scored 12/12. The "agent used the most appropriate tool" criterion passed in all 3 softened runs — agents still picked EnterWorktree via ToolSearch without the imperative framing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: drop instruction file enumeration per PR #1121 review Jesse flagged that the verbose CLAUDE.md/AGENTS.md/GEMINI.md/.cursorrules enumeration (a) chews tokens, (b) confuses models that anchor on exact strings, and (c) is repeated DRY-violatingly across 3+ locations. Replace with abstract "your instructions" framing in four spots: - skills/using-git-worktrees/SKILL.md Step 0 → Step 1 transition - skills/using-git-worktrees/SKILL.md Step 1b Directory Selection - docs/superpowers/plans/2026-04-06-worktree-rototill.md (both mirror locations) Same intent, harness-agnostic phrasing, ~half the tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: replace hardcoded /Users/jesse with generic placeholders (#858) * Remove the deprecated legacy slash commands (#1188) * fix: prevent subagent-driven-development from pausing every 3 tasks requesting-code-review had "review after each batch (3 tasks)" for executing-plans, which leaked into subagent-driven-development as a check-in cadence. Replaced with flexible "each task or at natural checkpoints" and added explicit continuous execution directive to subagent-driven-development. * Remove Integration sections from skills These sections don't help with steering and are a legacy of the time before agents had native skills systems. * fix(opencode): cache bootstrap content at module level to eliminate per-step file I/O getBootstrapContent() called fs.existsSync + fs.readFileSync + regex frontmatter parsing on every agent step with zero caching. The experimental.chat.messages.transform hook fires every step in opencode's agent loop (messages are reloaded from DB each step via filterCompactedEffect). A 10-step turn triggered 10 redundant file reads + 10 regex parses for content that never changes during a session. Changes: - Add module-level _bootstrapCache (undefined = not loaded, null = file missing) so the first call reads and parses SKILL.md, all subsequent calls return the cached string with zero filesystem access - Cache the null sentinel when SKILL.md is missing, preventing repeated fs.existsSync probes - Add _testing export (resetCache/getCache) for test infrastructure - Clarify the injection guard comment explaining how it interacts with opencode's per-step message reloading - Add 15 regression tests covering cache behavior, fs call counts, injection guard, missing file sentinel, cache reset, and source audit Fixes #1202 * test(opencode): simplify bootstrap cache coverage * docs: clarify opencode install caveats * test(opencode): modernize integration tests * docs: add Factory Droid installation instructions * Preserve Codex marketplace metadata * docs: add README quickstart install links (#1293) * docs(codex-tools): fix subagent wait mapping to wait_agent Update the Codex tool mapping so Claude Code 'Task returns result' maps to the current Codex spawned-agent result tool, wait_agent. Also clarify that older Codex builds exposed spawned-agent waiting as wait, while current bare wait is the code-mode exec/wait surface for yielded exec cells. Verified with Drill: - codex-tool-mapping-comprehension fails against dev with task_returns_result=wait - codex-tool-mapping-comprehension passes against this PR with task_returns_result=wait_agent and exec/wait scoped correctly - codex-subagent-wait-mapping passes against this PR with spawn_agent -> wait_agent -> close_agent and PR963_OK returned * fix(cursor): run SessionStart hook via run-hook.cmd on Windows Route Cursor's Windows SessionStart hook through the existing run-hook.cmd dispatcher instead of invoking the extensionless session-start script directly. This avoids Windows opening the extensionless hook file and lets Git Bash run the script as intended. Also removed an accidental UTF-8 BOM from hooks-cursor.json before merging. Verified: - hooks-cursor.json parses as JSON and has no BOM - command is ./hooks/run-hook.cmd session-start - CURSOR_PLUGIN_ROOT=/tmp/superpowers ./hooks/run-hook.cmd session-start emits valid Cursor JSON with additional_context * fix(tests): make SDD integration test actually run its assertions The SDD integration test silently bailed before printing any verification results. Three independent bugs caused this: 1. `WORKING_DIR_ESCAPED` was computed from `$SCRIPT_DIR/../..` without resolving `..` segments. The resulting "directory" name contained literal `..` so `find` was looking in a path that doesn't exist. 2. With `set -euo pipefail`, the `find ... | sort -r | head -1` pipeline could exit non-zero (SIGPIPE on the producer when head closes early), killing the script silently before assertions ran. 3. The `claude -p` invocation never passed `--plugin-dir`, so it loaded the installed plugin instead of the working tree. Local edits to skills under test were not actually being tested. Other adjustments: - Run claude from inside the unique TEST_PROJECT directory instead of from the plugin root, so its session JSONL lives in its own `~/.claude/projects/` folder and doesn't race other concurrent claude sessions for "most recent file". - Use the same character-normalization claude does (every non-alphanumeric becomes `-`) when computing the session dir name; macOS-resolved `/private/var/...` paths and tmp dirs with `.`/`_` in their names need this to round-trip correctly. - Accept either `"name":"Agent"` or `"name":"Task"` in the subagent count — the harness renamed the tool but the test wasn't updated. Verified on this branch: all six verification tests now pass against a real end-to-end SDD run (skill invoked, 7 subagents dispatched, 6 TodoWrite calls, working code produced, tests pass, no extra features). * feat: add Gemini CLI subagent support mapping Map Gemini Task dispatch to @agent-name/@generalist and document parallel subagent dispatch for independent tasks. * docs: update Codex plugin install guidance (#1288) * Lift superpowers:code-reviewer agent into the requesting-code-review skill The plugin had a single named agent (`agents/code-reviewer.md`) used by two skills, while every other reviewer/implementer subagent in the repo is dispatched as `general-purpose` with the prompt template living alongside its skill. That asymmetry had no upside and several costs: - Two sources of truth for the code review checklist (the agent file and `requesting-code-review/code-reviewer.md`), both drifting independently. - `Codex` users could not use the named agent directly; the codex-tools reference doc had a workaround section explaining how to flatten the named agent into a `worker` dispatch. - No third-party reliance on `superpowers:code-reviewer` inside this repo. Changes: - Merge `agents/code-reviewer.md` (persona + checklist) and `skills/requesting-code-review/code-reviewer.md` (placeholder template) into a single self-contained Task-dispatch template, matching the shape of `implementer-prompt.md`, `spec-reviewer-prompt.md`, etc. - Update `skills/requesting-code-review/SKILL.md` and `skills/subagent-driven-development/code-quality-reviewer-prompt.md` to dispatch `Task (general-purpose)` instead of the named agent. - Drop the now-obsolete "Named agent dispatch" workaround sections from `codex-tools.md` and `copilot-tools.md` — superpowers no longer ships any named agents, so those instructions documented nothing. - Delete `agents/code-reviewer.md` and the empty `agents/` directory. Tier 3 coverage for the change: a new behavioral test `tests/claude-code/test-requesting-code-review.sh` plants real bugs (SQL injection, plaintext password handling, credential logging) into a tiny project, runs the actual `requesting-code-review` skill against the working tree, and asserts the dispatched reviewer flags every planted issue at Critical/Important severity and refuses to approve the diff. Verified end-to-end on this branch: - The new test passes (5/5 assertions; reviewer caught all planted bugs and several others). - The existing SDD integration test still passes (7/7 subagents dispatched, all as `general-purpose`; spec compliance still rejects extra features; produced code is correct). - Session JSONLs confirm zero remaining `superpowers:code-reviewer` dispatches anywhere in the SDD pipeline. * Prepare v5.1.0: release notes and version bump Add v5.1.0 release notes covering: - Removals: legacy slash commands (/brainstorm, /execute-plan, /write-plan), skill Integration sections - Worktree skills rewrite (PRI-974, PR #1121) - Contributor guidelines for AI agents - Codex plugin mirror tooling (PR #1165) - OpenCode bootstrap caching (#1202) - SDD pause-every-3-tasks fix; SDD integration test fixes - Cursor Windows hook routing - Gemini CLI subagent dispatch mapping - Skill terminology cleanups - Install docs (Factory Droid, Codex, quickstart links) Bumps version 5.0.7 -> 5.1.0 across all declared files via scripts/bump-version.sh; not yet tagged or released. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Drew Ritter <drewritter@workerbee.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Drew Ritter <drew@primeradiant.com> Co-authored-by: Blaž Čulina <culina.blaz@nsoft.com> Co-authored-by: Jesse Vincent <jesse@primeradiant.com> Co-authored-by: voidborne-d <voidborne-d@users.noreply.github.com> Co-authored-by: Richard Luo <luo.richard@gmail.com> Co-authored-by: Drew Ritter <drew@ritter.dev> Co-authored-by: leonsong09 <59187950+leonsong09@users.noreply.github.com> Co-authored-by: YuXiang Hong <41331696+starumiQAQ@users.noreply.github.com> Co-authored-by: Sathvik Gilakamsetty <spacetime1007@gmail.com>
Require session transcript for new-harness PRs
2026-05-05 16:49:04 +08:00 · 2026-05-04 15:05:01 -07:00 · 2026-04-30 14:08:41 -07:00
23 changed files with 554 additions and 430 deletions
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -9,7 +9,7 @@
    {
      "name": "superpowers",
      "description": "Core skills library for Claude Code: TDD, debugging, collaboration patterns, and proven techniques",
-      "version": "5.0.7",
+      "version": "5.1.0",
      "source": "./",
      "author": {
        "name": "Jesse Vincent",
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
  "name": "superpowers",
  "description": "Core skills library for Claude Code: TDD, debugging, collaboration patterns, and proven techniques",
-  "version": "5.0.7",
+  "version": "5.1.0",
  "author": {
    "name": "Jesse Vincent",
    "email": "jesse@fsck.com"
--- a/.codex-plugin/plugin.json
+++ b/.codex-plugin/plugin.json
@@ -1,6 +1,6 @@
 {
  "name": "superpowers",
-  "version": "5.0.7",
+  "version": "5.1.0",
  "description": "An agentic skills framework & software development methodology that works: planning, TDD, debugging, and collaboration workflows.",
  "author": {
    "name": "Jesse Vincent",
--- a/.codex/INSTALL.md
+++ b/.codex/INSTALL.md
@@ -1,67 +0,0 @@
-# Installing Superpowers for Codex
-
-Enable superpowers skills in Codex via native skill discovery. Just clone and symlink.
-
-## Prerequisites
-
- Git
-
-## Installation
-
-1. **Clone the superpowers repository:**
-   ```bash
-   git clone https://github.com/obra/superpowers.git ~/.codex/superpowers
-   ```
-
-2. **Create the skills symlink:**
-   ```bash
-   mkdir -p ~/.agents/skills
-   ln -s ~/.codex/superpowers/skills ~/.agents/skills/superpowers
-   ```
-
-   **Windows (PowerShell):**
-   ```powershell
-   New-Item -ItemType Directory -Force -Path "$env:USERPROFILE\.agents\skills"
-   cmd /c mklink /J "$env:USERPROFILE\.agents\skills\superpowers" "$env:USERPROFILE\.codex\superpowers\skills"
-   ```
-
-3. **Restart Codex** (quit and relaunch the CLI) to discover the skills.
-
-## Migrating from old bootstrap
-
-If you installed superpowers before native skill discovery, you need to:
-
-1. **Update the repo:**
-   ```bash
-   cd ~/.codex/superpowers && git pull
-   ```
-
-2. **Create the skills symlink** (step 2 above) — this is the new discovery mechanism.
-
-3. **Remove the old bootstrap block** from `~/.codex/AGENTS.md` — any block referencing `superpowers-codex bootstrap` is no longer needed.
-
-4. **Restart Codex.**
-
-## Verify
-
-```bash
-ls -la ~/.agents/skills/superpowers
-```
-
-You should see a symlink (or junction on Windows) pointing to your superpowers skills directory.
-
-## Updating
-
-```bash
-cd ~/.codex/superpowers && git pull
-```
-
-Skills update instantly through the symlink.
-
-## Uninstalling
-
-```bash
-rm ~/.agents/skills/superpowers
-```
-
-Optionally delete the clone: `rm -rf ~/.codex/superpowers`.
--- a/.cursor-plugin/plugin.json
+++ b/.cursor-plugin/plugin.json
@@ -2,7 +2,7 @@
  "name": "superpowers",
  "displayName": "Superpowers",
  "description": "Core skills library: TDD, debugging, collaboration patterns, and proven techniques",
-  "version": "5.0.7",
+  "version": "5.1.0",
  "author": {
    "name": "Jesse Vincent",
    "email": "jesse@fsck.com"
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -50,6 +50,45 @@ of human involvement will be closed without review.
 |-------------------------------------|-----------------|-------|------------------|
 |                                     |                 |       |                  |

+## New harness support (required if this PR adds a new harness)
+
+<!-- If this PR adds support for a new harness (IDE, CLI tool, agent
+     runner), you MUST include a session transcript proving the
+     integration actually works.
+
+     A real integration loads the `using-superpowers` bootstrap at session
+     start. The bootstrap is what causes skills to auto-trigger. Without
+     it, the skills are dead weight — present on disk but never invoked
+     at the right moments.
+
+     ACCEPTANCE TEST: Open a clean session in the new harness and send
+     exactly this user message:
+
+         Let's make a react todo list
+
+     A working integration auto-triggers the `brainstorming` skill before
+     any code is written. Paste the complete transcript below.
+
+     These are NOT real integrations and PRs that ship them will be closed:
+
+     - Manually copying skill files into the harness
+     - Wrapping with `npx skills` or similar at-runtime shims
+     - Anything that requires the user to opt in to skills per-session
+     - Anything where brainstorming does not auto-trigger on the test above
+
+     If you are not sure whether your integration loads the bootstrap at
+     session start, it does not.
+-->
+
+<details>
+<summary>Clean-session transcript for "Let's make a react todo list"</summary>
+
+```
+paste the complete transcript here
+```
+
+</details>
+
 ## Evaluation
 - What was the initial prompt you (or your human partner) used to start
  the session that led to this change?
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -64,6 +64,27 @@ PRs containing invented claims, fabricated problem descriptions, or hallucinated

 PRs containing multiple unrelated changes will be closed. Split them into separate PRs.

+## New Harness Support
+
+If your PR adds support for a new harness (IDE, CLI tool, agent runner), you MUST include a session transcript proving the integration works end-to-end.
+
+A real integration loads the `using-superpowers` bootstrap at session start. The bootstrap is what causes skills to auto-trigger at the right moments. Without it, the skills are dead weight — present on disk but never invoked.
+
+**The acceptance test.** Open a clean session in the new harness and send exactly this user message:
+
+> Let's make a react todo list
+
+A working integration auto-triggers the `brainstorming` skill before any code is written. Paste the complete transcript in the PR.
+
+**These are not real integrations and will be closed:**
+
+- Manually copying skill files into the harness
+- Wrapping with `npx skills` or similar at-runtime shims
+- Anything that requires the user to opt in to skills per-session
+- Anything where `brainstorming` does not auto-trigger on the acceptance test above
+
+If you are not sure whether your integration loads the bootstrap at session start, it does not.
+
 ## Skill Changes Require Evaluation

 Skills are not prose — they are code that shapes agent behavior. If you modify skill content:
--- a/RELEASE-NOTES.md
+++ b/RELEASE-NOTES.md
@@ -1,5 +1,89 @@
 # Superpowers Release Notes

+## v5.1.0 (2026-04-30)
+
+### Removals
+
+- **Legacy slash commands removed** — `/brainstorm`, `/execute-plan`, and `/write-plan` are gone. They were deprecated stubs that did nothing but tell the user to invoke the corresponding skill. Invoke `superpowers:brainstorming`, `superpowers:executing-plans`, and `superpowers:writing-plans` directly instead. (#1188)
+- **`superpowers:code-reviewer` named agent removed** — the agent was the plugin's only named agent and was used by exactly two skills, while every other reviewer/implementer subagent in the repo dispatches `general-purpose` with a prompt template alongside its skill. The agent's persona and checklist have been merged into `skills/requesting-code-review/code-reviewer.md` as a self-contained Task-dispatch template. Anyone dispatching `Task (superpowers:code-reviewer)` should switch to `Task (general-purpose)` with the prompt template instead. (PR #1299)
+- **Integration sections removed from skills** — these were a legacy of the time before agents had native skills systems and didn't help with steering.
+
+### Worktree Skills Rewrite
+
+`using-git-worktrees` and `finishing-a-development-branch` now detect when the agent is already running inside an isolated worktree and prefer the harness's native worktree controls before falling back to `git worktree`. Behavior was TDD-validated and cross-platform-checked across five harnesses. (PRI-974, PR #1121)
+
+- **Environment detection** — both skills check `GIT_DIR != GIT_COMMON` before doing anything; if already in a linked worktree, creation is skipped entirely. A submodule guard prevents false detection.
+- **Consent before creating worktrees** — `using-git-worktrees` no longer creates worktrees implicitly; the skill asks the user first. Fixes #991 (subagent-driven-development was auto-creating worktrees without consent).
+- **Native tool preference (Step 1a)** — when the harness exposes its own worktree tool (e.g. Codex), the skill defers to it. The user's stated preference is respected when expressed.
+- **Provenance-based cleanup** — `finishing-a-development-branch` only cleans up worktrees inside `.worktrees/` (created by superpowers); anything outside is left alone. Fixes #940 (Option 2 was incorrectly cleaning up worktrees), #999 (merge-then-remove ordering), and #238 (`cd` to repo root before `git worktree remove`).
+- **Detached HEAD handling** — the finishing menu collapses to two options when there is no branch to merge from.
+- **Hardcoded `/Users/jesse` paths** in skill examples replaced with generic placeholders. (#858, PR #1122)
+
+### Contributor Guidelines for AI Agents
+
+Two new sections at the top of `CLAUDE.md` (symlinked to `AGENTS.md`) speak directly to AI agents. An audit of the last 100 closed PRs against this repo showed a 94% rejection rate driven by AI-generated slop: agents that didn't read the PR template, opened duplicates, fabricated problem descriptions, or pushed fork- or domain-specific changes upstream.
+
+- **Pre-submission checklist** — read the PR template, search for existing PRs, verify a real problem exists, confirm the change belongs in core, and show the human partner the complete diff before submitting.
+- **What we will not accept** — third-party dependencies, "compliance" rewrites of skill content, project-specific configuration, bulk PRs, speculative fixes, domain-specific skills, fork-specific changes, fabricated content, and bundled unrelated changes.
+- **New harness PRs require a session transcript** — most past new-harness integrations copied skill files or wrapped with `npx skills` instead of loading the `using-superpowers` bootstrap at session start. The acceptance test ("Let's make a react todo list" must auto-trigger `brainstorming` in a clean session) and a complete transcript are now required.
+
+### Codex Plugin Mirror Tooling
+
+New `sync-to-codex-plugin` script mirrors superpowers into the OpenAI Codex plugin marketplace as `prime-radiant-inc/openai-codex-plugins`. Path/user-agnostic so any team member can run it. (PR #1165)
+
+- Clones the fork fresh into a temp directory per run, regenerates overlays inline, and opens a PR; auto-detects upstream from the script's own location and preflights `rsync`/`git`/`gh auth`/`python3`.
+- `--bootstrap` flag for first-time setup; `EXCLUDES` patterns anchored to source root; `assets/` excluded.
+- Mirrors `CODE_OF_CONDUCT.md`; drops the `agents/openai.yaml` overlay.
+- Seeds `interface.defaultPrompt` in the mirrored `plugin.json`. (PR #1180 by @arittr)
+- Codex plugin files are committed to the source repo so the sync script uses canonical versions; Codex marketplace metadata is preserved.
+
+### OpenCode
+
+- **Bootstrap content cached at module level** — `getBootstrapContent()` was calling `fs.existsSync` + `fs.readFileSync` + frontmatter regex on every agent step (the `experimental.chat.messages.transform` hook fires on every step in OpenCode's agent loop). Now read once, cached for the session lifetime, with a null sentinel for the missing-file case. 15 regression tests cover cache behavior, fs call counts, the injection guard, the missing-file sentinel, and cache reset. (Fixes #1202)
+- **Integration tests modernized**.
+- **Install caveats clarified** in the README.
+
+### Code Review Consolidation
+
+`requesting-code-review` is now self-contained: the persona, checklist, and dispatch template live in `skills/requesting-code-review/code-reviewer.md` and the skill dispatches `Task (general-purpose)` directly. (PR #1299)
+
+- **Single source of truth** — the persona/checklist that previously lived in both `agents/code-reviewer.md` and the skill's placeholder template (and drifted independently) is now one file.
+- **`subagent-driven-development` follows suit** — its `code-quality-reviewer-prompt.md` now dispatches `Task (general-purpose)` instead of the named agent.
+- **Behavioral test added** — `tests/claude-code/test-requesting-code-review.sh` plants real bugs (SQL injection, plaintext password handling, credential logging) into a tiny project and asserts the dispatched reviewer flags every planted issue at Critical/Important severity and refuses to approve the diff.
+- **Codex and Copilot workaround docs trimmed** — the "Named agent dispatch" sections in `references/codex-tools.md` and `references/copilot-tools.md` documented how to flatten a named agent into a generic dispatch. With no named agents shipping, the workaround is unnecessary; both sections were dropped.
+
+### Subagent-Driven Development
+
+- **No more pause every 3 tasks** — the "review after each batch (3 tasks)" cadence in `requesting-code-review` (originally for `executing-plans`) was leaking into `subagent-driven-development`. Replaced with "each task or at natural checkpoints" plus an explicit continuous-execution directive.
+- **SDD integration test now runs its assertions** — three independent bugs caused the test to silently bail before printing any verification results: an unresolved `..` segment in the working-dir path, a `set -euo pipefail` interaction with `find | sort | head -1` (SIGPIPE on the producer killed the script), and a missing `--plugin-dir` on the `claude -p` invocation that caused the test to load the installed plugin instead of the working tree. All three fixed; six verification tests now actually run against a real end-to-end SDD run.
+
+### Cursor
+
+- **Windows SessionStart hook** routed through `run-hook.cmd` instead of invoking the extensionless `session-start` script directly. Fixes Windows opening the file in an editor instead of running it. Also removed an accidental UTF-8 BOM from `hooks-cursor.json`.
+
+### Gemini CLI
+
+- **Subagent dispatch mapping** — Gemini's `Task` dispatch now maps to `@agent-name` / `@generalist`, with parallel subagent dispatch documented for independent tasks.
+
+### Skills
+
+- **Terminology cleanups** across skill content.
+
+### Documentation & Install
+
+- **Factory Droid installation instructions** added to README.
+- **Quickstart install links** in README. (PR #1293 by @arittr)
+- **Codex plugin install guidance** updated. (PR #1288 by @arittr)
+- **Codex `wait` mapping corrected** to `wait_agent` in the tools reference.
+- **Install order reorganized**; Codex install instructions cleaned up.
+- **Removed vestigial `CHANGELOG.md`** in favor of `RELEASE-NOTES.md` as the single source. (PR #1163 by @shaanmajid)
+- **Discord invite link** fixed; release announcements link and a detailed Discord description added to the Community section.
+
+### Community
+
+- @shaanmajid — vestigial `CHANGELOG.md` removal (PR #1163)
+- @arittr — README quickstart install links (#1293), Codex plugin install guidance (#1288), `sync-to-codex-plugin` `interface.defaultPrompt` seed (#1180)
+
 ## v5.0.7 (2026-03-31)

 ### GitHub Copilot CLI Support
--- a/agents/code-reviewer.md
+++ b/agents/code-reviewer.md
@@ -1,48 +0,0 @@
---
-name: code-reviewer
-description: |
-  Use this agent when a major project step has been completed and needs to be reviewed against the original plan and coding standards. Examples: <example>Context: The user is creating a code-review agent that should be called after a logical chunk of code is written. user: "I've finished implementing the user authentication system as outlined in step 3 of our plan" assistant: "Great work! Now let me use the code-reviewer agent to review the implementation against our plan and coding standards" <commentary>Since a major project step has been completed, use the code-reviewer agent to validate the work against the plan and identify any issues.</commentary></example> <example>Context: User has completed a significant feature implementation. user: "The API endpoints for the task management system are now complete - that covers step 2 from our architecture document" assistant: "Excellent! Let me have the code-reviewer agent examine this implementation to ensure it aligns with our plan and follows best practices" <commentary>A numbered step from the planning document has been completed, so the code-reviewer agent should review the work.</commentary></example>
-model: inherit
---
-
-You are a Senior Code Reviewer with expertise in software architecture, design patterns, and best practices. Your role is to review completed project steps against original plans and ensure code quality standards are met.
-
-When reviewing completed work, you will:
-
-1. **Plan Alignment Analysis**:
-   - Compare the implementation against the original planning document or step description
-   - Identify any deviations from the planned approach, architecture, or requirements
-   - Assess whether deviations are justified improvements or problematic departures
-   - Verify that all planned functionality has been implemented
-
-2. **Code Quality Assessment**:
-   - Review code for adherence to established patterns and conventions
-   - Check for proper error handling, type safety, and defensive programming
-   - Evaluate code organization, naming conventions, and maintainability
-   - Assess test coverage and quality of test implementations
-   - Look for potential security vulnerabilities or performance issues
-
-3. **Architecture and Design Review**:
-   - Ensure the implementation follows SOLID principles and established architectural patterns
-   - Check for proper separation of concerns and loose coupling
-   - Verify that the code integrates well with existing systems
-   - Assess scalability and extensibility considerations
-
-4. **Documentation and Standards**:
-   - Verify that code includes appropriate comments and documentation
-   - Check that file headers, function documentation, and inline comments are present and accurate
-   - Ensure adherence to project-specific coding standards and conventions
-
-5. **Issue Identification and Recommendations**:
-   - Clearly categorize issues as: Critical (must fix), Important (should fix), or Suggestions (nice to have)
-   - For each issue, provide specific examples and actionable recommendations
-   - When you identify plan deviations, explain whether they're problematic or beneficial
-   - Suggest specific improvements with code examples when helpful
-
-6. **Communication Protocol**:
-   - If you find significant deviations from the plan, ask the coding agent to review and confirm the changes
-   - If you identify issues with the original plan itself, recommend plan updates
-   - For implementation problems, provide clear guidance on fixes needed
-   - Always acknowledge what was done well before highlighting issues
-
-Your output should be structured, actionable, and focused on helping maintain high code quality while ensuring project goals are met. Be thorough but concise, and always provide constructive feedback that helps improve both the current implementation and future development practices.
--- a/docs/README.codex.md
+++ b/docs/README.codex.md
@@ -1,126 +0,0 @@
-# Superpowers for Codex
-
-Guide for using Superpowers with OpenAI Codex via native skill discovery.
-
-## Quick Install
-
-Tell Codex:
-
-```
-Fetch and follow instructions from https://raw.githubusercontent.com/obra/superpowers/refs/heads/main/.codex/INSTALL.md
-```
-
-## Manual Installation
-
-### Prerequisites
-
- OpenAI Codex CLI
- Git
-
-### Steps
-
-1. Clone the repo:
-   ```bash
-   git clone https://github.com/obra/superpowers.git ~/.codex/superpowers
-   ```
-
-2. Create the skills symlink:
-   ```bash
-   mkdir -p ~/.agents/skills
-   ln -s ~/.codex/superpowers/skills ~/.agents/skills/superpowers
-   ```
-
-3. Restart Codex.
-
-4. **For subagent skills** (optional): Skills like `dispatching-parallel-agents` and `subagent-driven-development` require Codex's multi-agent feature. Add to your Codex config:
-   ```toml
-   [features]
-   multi_agent = true
-   ```
-
-### Windows
-
-Use a junction instead of a symlink (works without Developer Mode):
-
-```powershell
-New-Item -ItemType Directory -Force -Path "$env:USERPROFILE\.agents\skills"
-cmd /c mklink /J "$env:USERPROFILE\.agents\skills\superpowers" "$env:USERPROFILE\.codex\superpowers\skills"
-```
-
-## How It Works
-
-Codex has native skill discovery — it scans `~/.agents/skills/` at startup, parses SKILL.md frontmatter, and loads skills on demand. Superpowers skills are made visible through a single symlink:
-
-```
-~/.agents/skills/superpowers/ → ~/.codex/superpowers/skills/
-```
-
-The `using-superpowers` skill is discovered automatically and enforces skill usage discipline — no additional configuration needed.
-
-## Usage
-
-Skills are discovered automatically. Codex activates them when:
- You mention a skill by name (e.g., "use brainstorming")
- The task matches a skill's description
- The `using-superpowers` skill directs Codex to use one
-
-### Personal Skills
-
-Create your own skills in `~/.agents/skills/`:
-
-```bash
-mkdir -p ~/.agents/skills/my-skill
-```
-
-Create `~/.agents/skills/my-skill/SKILL.md`:
-
-```markdown
---
-name: my-skill
-description: Use when [condition] - [what it does]
---
-
-# My Skill
-
-[Your skill content here]
-```
-
-The `description` field is how Codex decides when to activate a skill automatically — write it as a clear trigger condition.
-
-## Updating
-
-```bash
-cd ~/.codex/superpowers && git pull
-```
-
-Skills update instantly through the symlink.
-
-## Uninstalling
-
-```bash
-rm ~/.agents/skills/superpowers
-```
-
-**Windows (PowerShell):**
-```powershell
-Remove-Item "$env:USERPROFILE\.agents\skills\superpowers"
-```
-
-Optionally delete the clone: `rm -rf ~/.codex/superpowers` (Windows: `Remove-Item -Recurse -Force "$env:USERPROFILE\.codex\superpowers"`).
-
-## Troubleshooting
-
-### Skills not showing up
-
-1. Verify the symlink: `ls -la ~/.agents/skills/superpowers`
-2. Check skills exist: `ls ~/.codex/superpowers/skills`
-3. Restart Codex — skills are discovered at startup
-
-### Windows junction issues
-
-Junctions normally work without special permissions. If creation fails, try running PowerShell as administrator.
-
-## Getting Help
-
- Report issues: https://github.com/obra/superpowers/issues
- Main documentation: https://github.com/obra/superpowers
--- a/gemini-extension.json
+++ b/gemini-extension.json
@@ -1,6 +1,6 @@
 {
  "name": "superpowers",
  "description": "Core skills library: TDD, debugging, collaboration patterns, and proven techniques",
-  "version": "5.0.7",
+  "version": "5.1.0",
  "contextFileName": "GEMINI.md"
 }
--- a/hooks/hooks-cursor.json
+++ b/hooks/hooks-cursor.json
@@ -3,7 +3,7 @@
  "hooks": {
    "sessionStart": [
      {
-        "command": "./hooks/session-start"
+        "command": "./hooks/run-hook.cmd session-start"
      }
    ]
  }
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
  "name": "superpowers",
-  "version": "5.0.7",
+  "version": "5.1.0",
  "type": "module",
  "main": ".opencode/plugins/superpowers.js"
 }
--- a/skills/requesting-code-review/SKILL.md
+++ b/skills/requesting-code-review/SKILL.md
@@ -5,7 +5,7 @@ description: Use when completing tasks, implementing major features, or before m

 # Requesting Code Review

-Dispatch superpowers:code-reviewer subagent to catch issues before they cascade. The reviewer gets precisely crafted context for evaluation — never your session's history. This keeps the reviewer focused on the work product, not your thought process, and preserves your own context for continued work.
+Dispatch a code reviewer subagent to catch issues before they cascade. The reviewer gets precisely crafted context for evaluation — never your session's history. This keeps the reviewer focused on the work product, not your thought process, and preserves your own context for continued work.

 **Core principle:** Review early, review often.

@@ -29,16 +29,15 @@ BASE_SHA=$(git rev-parse HEAD~1)  # or origin/main
 HEAD_SHA=$(git rev-parse HEAD)
 ```

-**2. Dispatch code-reviewer subagent:**
+**2. Dispatch code reviewer subagent:**

-Use Task tool with superpowers:code-reviewer type, fill template at `code-reviewer.md`
+Use Task tool with `general-purpose` type, fill template at `code-reviewer.md`

 **Placeholders:**
- `{WHAT_WAS_IMPLEMENTED}` - What you just built
+- `{DESCRIPTION}` - Brief summary of what you built
 - `{PLAN_OR_REQUIREMENTS}` - What it should do
 - `{BASE_SHA}` - Starting commit
 - `{HEAD_SHA}` - Ending commit
- `{DESCRIPTION}` - Brief summary

 **3. Act on feedback:**
 - Fix Critical issues immediately
@@ -56,12 +55,11 @@ You: Let me request code review before proceeding.
 BASE_SHA=$(git log --oneline | grep "Task 1" | head -1 | awk '{print $1}')
 HEAD_SHA=$(git rev-parse HEAD)

-[Dispatch superpowers:code-reviewer subagent]
-  WHAT_WAS_IMPLEMENTED: Verification and repair functions for conversation index
+[Dispatch code reviewer subagent]
+  DESCRIPTION: Added verifyIndex() and repairIndex() with 4 issue types
  PLAN_OR_REQUIREMENTS: Task 2 from docs/superpowers/plans/deployment-plan.md
  BASE_SHA: a7981ec
  HEAD_SHA: 3df7661
-  DESCRIPTION: Added verifyIndex() and repairIndex() with 4 issue types

 [Subagent returns]:
  Strengths: Clean architecture, real tests
--- a/skills/requesting-code-review/code-reviewer.md
+++ b/skills/requesting-code-review/code-reviewer.md
@@ -1,111 +1,133 @@
-# Code Review Agent
+# Code Reviewer Prompt Template

-You are reviewing code changes for production readiness.
+Use this template when dispatching a code reviewer subagent.

-**Your task:**
-1. Review {WHAT_WAS_IMPLEMENTED}
-2. Compare against {PLAN_OR_REQUIREMENTS}
-3. Check code quality, architecture, testing
-4. Categorize issues by severity
-5. Assess production readiness
+**Purpose:** Review completed work against requirements and code quality standards before it cascades into more work.

-## What Was Implemented
+```
+Task tool (general-purpose):
+  description: "Review code changes"
+  prompt: |
+    You are a Senior Code Reviewer with expertise in software architecture,
+    design patterns, and best practices. Your job is to review completed work
+    against its plan or requirements and identify issues before they cascade.

-{DESCRIPTION}
+    ## What Was Implemented

-## Requirements/Plan
+    {DESCRIPTION}

-{PLAN_REFERENCE}
+    ## Requirements / Plan

-## Git Range to Review
+    {PLAN_OR_REQUIREMENTS}

-**Base:** {BASE_SHA}
-**Head:** {HEAD_SHA}
+    ## Git Range to Review

-```bash
-git diff --stat {BASE_SHA}..{HEAD_SHA}
-git diff {BASE_SHA}..{HEAD_SHA}
+    **Base:** {BASE_SHA}
+    **Head:** {HEAD_SHA}
+
+    ```bash
+    git diff --stat {BASE_SHA}..{HEAD_SHA}
+    git diff {BASE_SHA}..{HEAD_SHA}
+    ```
+
+    ## What to Check
+
+    **Plan alignment:**
+    - Does the implementation match the plan / requirements?
+    - Are deviations justified improvements, or problematic departures?
+    - Is all planned functionality present?
+
+    **Code quality:**
+    - Clean separation of concerns?
+    - Proper error handling?
+    - Type safety where applicable?
+    - DRY without premature abstraction?
+    - Edge cases handled?
+
+    **Architecture:**
+    - Sound design decisions?
+    - Reasonable scalability and performance?
+    - Security concerns?
+    - Integrates cleanly with surrounding code?
+
+    **Testing:**
+    - Tests verify real behavior, not mocks?
+    - Edge cases covered?
+    - Integration tests where they matter?
+    - All tests passing?
+
+    **Production readiness:**
+    - Migration strategy if schema changed?
+    - Backward compatibility considered?
+    - Documentation complete?
+    - No obvious bugs?
+
+    ## Calibration
+
+    Categorize issues by actual severity. Not everything is Critical.
+    Acknowledge what was done well before listing issues — accurate praise
+    helps the implementer trust the rest of the feedback.
+
+    If you find significant deviations from the plan, flag them specifically
+    so the implementer can confirm whether the deviation was intentional.
+    If you find issues with the plan itself rather than the implementation,
+    say so.
+
+    ## Output Format
+
+    ### Strengths
+    [What's well done? Be specific.]
+
+    ### Issues
+
+    #### Critical (Must Fix)
+    [Bugs, security issues, data loss risks, broken functionality]
+
+    #### Important (Should Fix)
+    [Architecture problems, missing features, poor error handling, test gaps]
+
+    #### Minor (Nice to Have)
+    [Code style, optimization opportunities, documentation polish]
+
+    For each issue:
+    - File:line reference
+    - What's wrong
+    - Why it matters
+    - How to fix (if not obvious)
+
+    ### Recommendations
+    [Improvements for code quality, architecture, or process]
+
+    ### Assessment
+
+    **Ready to merge?** [Yes | No | With fixes]
+
+    **Reasoning:** [1-2 sentence technical assessment]
+
+    ## Critical Rules
+
+    **DO:**
+    - Categorize by actual severity
+    - Be specific (file:line, not vague)
+    - Explain WHY each issue matters
+    - Acknowledge strengths
+    - Give a clear verdict
+
+    **DON'T:**
+    - Say "looks good" without checking
+    - Mark nitpicks as Critical
+    - Give feedback on code you didn't actually read
+    - Be vague ("improve error handling")
+    - Avoid giving a clear verdict
 ```

-## Review Checklist
+**Placeholders:**
+- `{DESCRIPTION}` — brief summary of what was built
+- `{PLAN_OR_REQUIREMENTS}` — what it should do (plan file path, task text, or requirements)
+- `{BASE_SHA}` — starting commit
+- `{HEAD_SHA}` — ending commit

-**Code Quality:**
- Clean separation of concerns?
- Proper error handling?
- Type safety (if applicable)?
- DRY principle followed?
- Edge cases handled?
-
-**Architecture:**
- Sound design decisions?
- Scalability considerations?
- Performance implications?
- Security concerns?
-
-**Testing:**
- Tests actually test logic (not mocks)?
- Edge cases covered?
- Integration tests where needed?
- All tests passing?
-
-**Requirements:**
- All plan requirements met?
- Implementation matches spec?
- No scope creep?
- Breaking changes documented?
-
-**Production Readiness:**
- Migration strategy (if schema changes)?
- Backward compatibility considered?
- Documentation complete?
- No obvious bugs?
-
-## Output Format
-
-### Strengths
-[What's well done? Be specific.]
-
-### Issues
-
-#### Critical (Must Fix)
-[Bugs, security issues, data loss risks, broken functionality]
-
-#### Important (Should Fix)
-[Architecture problems, missing features, poor error handling, test gaps]
-
-#### Minor (Nice to Have)
-[Code style, optimization opportunities, documentation improvements]
-
-**For each issue:**
- File:line reference
- What's wrong
- Why it matters
- How to fix (if not obvious)
-
-### Recommendations
-[Improvements for code quality, architecture, or process]
-
-### Assessment
-
-**Ready to merge?** [Yes/No/With fixes]
-
-**Reasoning:** [Technical assessment in 1-2 sentences]
-
-## Critical Rules
-
-**DO:**
- Categorize by actual severity (not everything is Critical)
- Be specific (file:line, not vague)
- Explain WHY issues matter
- Acknowledge strengths
- Give clear verdict
-
-**DON'T:**
- Say "looks good" without checking
- Mark nitpicks as Critical
- Give feedback on code you didn't review
- Be vague ("improve error handling")
- Avoid giving a clear verdict
+**Reviewer returns:** Strengths, Issues (Critical / Important / Minor), Recommendations, Assessment

 ## Example Output

--- a/skills/subagent-driven-development/code-quality-reviewer-prompt.md
+++ b/skills/subagent-driven-development/code-quality-reviewer-prompt.md
@@ -7,14 +7,13 @@ Use this template when dispatching a code quality reviewer subagent.
 **Only dispatch after spec compliance review passes.**

 ```
-Task tool (superpowers:code-reviewer):
+Task tool (general-purpose):
  Use template at requesting-code-review/code-reviewer.md

-  WHAT_WAS_IMPLEMENTED: [from implementer's report]
+  DESCRIPTION: [task summary, from implementer's report]
  PLAN_OR_REQUIREMENTS: Task N from [plan-file]
  BASE_SHA: [commit before task]
  HEAD_SHA: [current commit]
-  DESCRIPTION: [task summary]
 ```

 **In addition to standard code quality concerns, the reviewer should check:**
--- a/skills/using-superpowers/references/codex-tools.md
+++ b/skills/using-superpowers/references/codex-tools.md
@@ -4,9 +4,9 @@ Skills use Claude Code tool names. When you encounter these in a skill, use your

 | Skill references | Codex equivalent |
 |-----------------|------------------|
-| `Task` tool (dispatch subagent) | `spawn_agent` (see [Named agent dispatch](#named-agent-dispatch)) |
+| `Task` tool (dispatch subagent) | `spawn_agent` (see [Subagent dispatch requires multi-agent support](#subagent-dispatch-requires-multi-agent-support)) |
 | Multiple `Task` calls (parallel) | Multiple `spawn_agent` calls |
-| Task returns result | `wait` |
+| Task returns result | `wait_agent` |
 | Task completes automatically | `close_agent` to free slot |
 | `TodoWrite` (task tracking) | `update_plan` |
 | `Skill` tool (invoke a skill) | Skills load natively — just follow the instructions |
@@ -22,53 +22,12 @@ Add to your Codex config (`~/.codex/config.toml`):
 multi_agent = true
 ```

-This enables `spawn_agent`, `wait`, and `close_agent` for skills like `dispatching-parallel-agents` and `subagent-driven-development`.
+This enables `spawn_agent`, `wait_agent`, and `close_agent` for skills like `dispatching-parallel-agents` and `subagent-driven-development`.

-## Named agent dispatch
-
-Claude Code skills reference named agent types like `superpowers:code-reviewer`.
-Codex does not have a named agent registry — `spawn_agent` creates generic agents
-from built-in roles (`default`, `explorer`, `worker`).
-
-When a skill says to dispatch a named agent type:
-
-1. Find the agent's prompt file (e.g., `agents/code-reviewer.md` or the skill's
-   local prompt template like `code-quality-reviewer-prompt.md`)
-2. Read the prompt content
-3. Fill any template placeholders (`{BASE_SHA}`, `{WHAT_WAS_IMPLEMENTED}`, etc.)
-4. Spawn a `worker` agent with the filled content as the `message`
-
-| Skill instruction | Codex equivalent |
-|-------------------|------------------|
-| `Task tool (superpowers:code-reviewer)` | `spawn_agent(agent_type="worker", message=...)` with `code-reviewer.md` content |
-| `Task tool (general-purpose)` with inline prompt | `spawn_agent(message=...)` with the same prompt |
-
-### Message framing
-
-The `message` parameter is user-level input, not a system prompt. Structure it
-for maximum instruction adherence:
-
-```
-Your task is to perform the following. Follow the instructions below exactly.
-
-<agent-instructions>
-[filled prompt content from the agent's .md file]
-</agent-instructions>
-
-Execute this now. Output ONLY the structured response following the format
-specified in the instructions above.
-```
-
- Use task-delegation framing ("Your task is...") rather than persona framing ("You are...")
- Wrap instructions in XML tags — the model treats tagged blocks as authoritative
- End with an explicit execution directive to prevent summarization of the instructions
-
-### When this workaround can be removed
-
-This approach compensates for Codex's plugin system not yet supporting an `agents`
-field in `plugin.json`. When `RawPluginManifest` gains an `agents` field, the
-plugin can symlink to `agents/` (mirroring the existing `skills/` symlink) and
-skills can dispatch named agent types directly.
+Legacy note: Codex builds before `rust-v0.115.0` exposed spawned-agent
+waiting as `wait`. Current Codex uses `wait_agent` for spawned agents. The
+`wait` name now belongs to code-mode `exec/wait`, which resumes a yielded exec
+cell by `cell_id`; it is not the spawned-agent result tool.

 ## Environment Detection

--- a/skills/using-superpowers/references/copilot-tools.md
+++ b/skills/using-superpowers/references/copilot-tools.md
@@ -12,23 +12,13 @@ Skills use Claude Code tool names. When you encounter these in a skill, use your
 | `Glob` (search files by name) | `glob` |
 | `Skill` tool (invoke a skill) | `skill` |
 | `WebFetch` | `web_fetch` |
-| `Task` tool (dispatch subagent) | `task` (see [Agent types](#agent-types)) |
+| `Task` tool (dispatch subagent) | `task` with `agent_type: "general-purpose"` or `"explore"` |
 | Multiple `Task` calls (parallel) | Multiple `task` calls |
 | Task status/output | `read_agent`, `list_agents` |
 | `TodoWrite` (task tracking) | `sql` with built-in `todos` table |
 | `WebSearch` | No equivalent — use `web_fetch` with a search engine URL |
 | `EnterPlanMode` / `ExitPlanMode` | No equivalent — stay in the main session |

-## Agent types
-
-Copilot CLI's `task` tool accepts an `agent_type` parameter:
-
-| Claude Code agent | Copilot CLI equivalent |
-|-------------------|----------------------|
-| `general-purpose` | `"general-purpose"` |
-| `Explore` | `"explore"` |
-| Named plugin agents (e.g. `superpowers:code-reviewer`) | Discovered automatically from installed plugins |
-
 ## Async shell sessions

 Copilot CLI supports persistent async shell sessions, which have no direct Claude Code equivalent:
--- a/skills/using-superpowers/references/gemini-tools.md
+++ b/skills/using-superpowers/references/gemini-tools.md
@@ -14,11 +14,29 @@ Skills use Claude Code tool names. When you encounter these in a skill, use your
 | `Skill` tool (invoke a skill) | `activate_skill` |
 | `WebSearch` | `google_web_search` |
 | `WebFetch` | `web_fetch` |
-| `Task` tool (dispatch subagent) | No equivalent — Gemini CLI does not support subagents |
+| `Task` tool (dispatch subagent) | `@agent-name` (see [Subagent support](#subagent-support)) |

-## No subagent support
+## Subagent support

-Gemini CLI has no equivalent to Claude Code's `Task` tool. Skills that rely on subagent dispatch (`subagent-driven-development`, `dispatching-parallel-agents`) will fall back to single-session execution via `executing-plans`.
+Gemini CLI supports subagents natively via the `@` syntax. Use the built-in `@generalist` agent to dispatch any task — it has access to all tools and follows the prompt you provide.
+
+When a skill says to dispatch a named agent type, use `@generalist` with the full prompt from the skill's prompt template:
+
+| Skill instruction | Gemini CLI equivalent |
+|-------------------|----------------------|
+| `Task tool (superpowers:implementer)` | `@generalist` with the filled `implementer-prompt.md` template |
+| `Task tool (superpowers:spec-reviewer)` | `@generalist` with the filled `spec-reviewer-prompt.md` template |
+| `Task tool (superpowers:code-reviewer)` | `@code-reviewer` (bundled agent) or `@generalist` with the filled review prompt |
+| `Task tool (superpowers:code-quality-reviewer)` | `@generalist` with the filled `code-quality-reviewer-prompt.md` template |
+| `Task tool (general-purpose)` with inline prompt | `@generalist` with your inline prompt |
+
+### Prompt filling
+
+Skills provide prompt templates with placeholders like `{WHAT_WAS_IMPLEMENTED}` or `[FULL TEXT of task]`. Fill all placeholders and pass the complete prompt as the message to `@generalist`. The prompt template itself contains the agent's role, review criteria, and expected output format — `@generalist` will follow it.
+
+### Parallel dispatch
+
+Gemini CLI supports parallel subagent dispatch. When a skill asks you to dispatch multiple independent subagent tasks in parallel, request all of those `@generalist` or named subagent tasks together in the same prompt. Keep dependent tasks sequential, but do not serialize independent subagent tasks just to preserve a simpler history.

 ## Additional Gemini CLI tools

--- a/tests/claude-code/README.md
+++ b/tests/claude-code/README.md
@@ -115,6 +115,18 @@ Full workflow execution test (~10-30 minutes):
 - Subagents follow the skill correctly
 - Final code is functional and tested

+#### test-requesting-code-review.sh
+Behavioral test for the code reviewer subagent (~5 minutes):
+- Builds a tiny project with a baseline commit
+- Adds a second commit that plants two real bugs (SQL injection, plaintext password handling)
+- Dispatches the code reviewer via the requesting-code-review skill
+- Verifies the reviewer flags the planted bugs at Critical/Important severity and refuses to approve
+
+**What it tests:**
+- The skill actually dispatches a working code reviewer subagent
+- The reviewer template produces reviewers that catch obvious security bugs
+- The reviewer is not sycophantic — it does not approve a diff with planted Critical issues
+
 ## Adding New Tests

 1. Create new test file: `test-<skill-name>.sh`
--- a/tests/claude-code/run-skill-tests.sh
+++ b/tests/claude-code/run-skill-tests.sh
@@ -79,6 +79,7 @@ tests=(
 # Integration tests (slow, full execution)
 integration_tests=(
    "test-subagent-driven-development-integration.sh"
+    "test-requesting-code-review.sh"
 )

 # Add integration tests if requested
--- a/tests/claude-code/test-requesting-code-review.sh
+++ b/tests/claude-code/test-requesting-code-review.sh
@@ -0,0 +1,214 @@
+#!/usr/bin/env bash
+# Integration Test: requesting-code-review skill
+# Verifies the code reviewer dispatched via the skill catches a planted bug
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PLUGIN_DIR="$(cd "$SCRIPT_DIR/../.." && pwd)"
+source "$SCRIPT_DIR/test-helpers.sh"
+
+echo "========================================"
+echo " Integration Test: requesting-code-review"
+echo "========================================"
+echo ""
+echo "This test verifies the code reviewer subagent by:"
+echo "  1. Setting up a tiny project with a baseline commit"
+echo "  2. Adding a second commit that plants an obvious bug"
+echo "  3. Dispatching the code reviewer via the requesting-code-review skill"
+echo "  4. Verifying the reviewer flags the planted bug as Critical/Important"
+echo ""
+
+TEST_PROJECT=$(create_test_project)
+echo "Test project: $TEST_PROJECT"
+trap "cleanup_test_project $TEST_PROJECT" EXIT
+
+cd "$TEST_PROJECT"
+
+# Baseline: a small "safe" implementation
+mkdir -p src
+cat > src/db.js <<'EOF'
+import { Database } from "./database-driver.js";
+
+const db = new Database();
+
+export async function findUserByEmail(email) {
+  if (typeof email !== "string" || !email) {
+    throw new Error("email required");
+  }
+  return db.query(
+    "SELECT id, email, created_at FROM users WHERE email = ?",
+    [email],
+  );
+}
+EOF
+
+cat > package.json <<'EOF'
+{ "name": "test-codereview", "version": "1.0.0", "type": "module" }
+EOF
+
+git init --quiet
+git config user.email "test@test.com"
+git config user.name "Test User"
+git add .
+git commit -m "Initial: parameterized findUserByEmail" --quiet
+BASE_SHA=$(git rev-parse HEAD)
+
+# Second commit: plant two real bugs
+# 1. SQL injection — switch from parameterized to string concatenation
+# 2. Logs the user's password hash on every successful login
+cat > src/db.js <<'EOF'
+import { Database } from "./database-driver.js";
+
+const db = new Database();
+
+export async function findUserByEmail(email) {
+  return db.query(
+    "SELECT id, email, password_hash, created_at FROM users WHERE email = '" + email + "'",
+  );
+}
+
+export async function login(email, password) {
+  const user = await findUserByEmail(email);
+  if (user && user.password_hash === hash(password)) {
+    console.log("login success", { email, password_hash: user.password_hash });
+    return user;
+  }
+  return null;
+}
+
+function hash(s) { return s; }
+EOF
+
+git add .
+git commit -m "Refactor user lookup, add login" --quiet
+HEAD_SHA=$(git rev-parse HEAD)
+
+echo ""
+echo "Planted bugs in $BASE_SHA..$HEAD_SHA:"
+echo "  - SQL injection (string concat instead of parameterized query)"
+echo "  - Password hash logged in plaintext on every successful login"
+echo "  - hash() is the identity function (passwords stored & compared in plaintext)"
+echo ""
+
+OUTPUT_FILE="$TEST_PROJECT/claude-output.txt"
+
+PROMPT="I just finished a refactor. The change is between commits $BASE_SHA and $HEAD_SHA on the current branch.
+
+Use the superpowers:requesting-code-review skill to review these changes before I merge. Follow the skill exactly: dispatch the code reviewer subagent with the template, give the subagent the SHA range, and report back what it found.
+
+Print the reviewer's full output."
+
+# Run claude from inside the test project so its session JSONL lands in a
+# project-specific directory under ~/.claude/projects/, isolated from any
+# other concurrent claude sessions.
+echo "Running Claude (plugin-dir: $PLUGIN_DIR, cwd: $TEST_PROJECT)..."
+echo "================================================================================"
+cd "$TEST_PROJECT" && timeout 600 claude -p "$PROMPT" \
+    --plugin-dir "$PLUGIN_DIR" \
+    --permission-mode bypassPermissions 2>&1 | tee "$OUTPUT_FILE" || {
+    echo ""
+    echo "================================================================================"
+    echo "EXECUTION FAILED (exit code: $?)"
+    exit 1
+}
+echo "================================================================================"
+
+echo ""
+echo "Analyzing reviewer output..."
+echo ""
+
+# Find the session transcript. Because we ran claude from $TEST_PROJECT (a
+# unique tmp dir), its sessions live in their own ~/.claude/projects/ folder.
+# Resolve the real path (macOS mktemp returns /var/... but claude normalizes
+# it to /private/var/...) and replicate claude's normalization (every
+# non-alphanumeric char becomes `-`).
+TEST_PROJECT_REAL=$(cd "$TEST_PROJECT" && pwd -P)
+SESSION_DIR="$HOME/.claude/projects/$(echo "$TEST_PROJECT_REAL" | sed 's|[^a-zA-Z0-9]|-|g')"
+# `|| true` prevents pipefail killing the script if ls gets SIGPIPE'd by head.
+SESSION_FILE=$(ls -t "$SESSION_DIR"/*.jsonl 2>/dev/null | head -1 || true)
+
+FAILED=0
+
+echo "=== Verification Tests ==="
+echo ""
+
+# Test 1: Skill was actually invoked, and a subagent was actually dispatched
+echo "Test 1: requesting-code-review skill invoked + reviewer subagent dispatched..."
+if [ -z "$SESSION_FILE" ] || [ ! -f "$SESSION_FILE" ]; then
+    echo "  [FAIL] Could not locate session transcript in $SESSION_DIR"
+    FAILED=$((FAILED + 1))
+elif ! grep -q '"skill":"superpowers:requesting-code-review"' "$SESSION_FILE"; then
+    echo "  [FAIL] requesting-code-review skill was not invoked"
+    echo "         Session: $SESSION_FILE"
+    FAILED=$((FAILED + 1))
+elif ! grep -q '"name":"Agent"' "$SESSION_FILE"; then
+    echo "  [FAIL] Skill ran but no subagent was dispatched"
+    FAILED=$((FAILED + 1))
+else
+    echo "  [PASS] Skill invoked and subagent dispatched"
+fi
+echo ""
+
+# Test 2: Reviewer caught the SQL injection
+echo "Test 2: SQL injection flagged..."
+if grep -qiE "sql injection|injection|string concat|parameterize|prepared statement|sanitiz" "$OUTPUT_FILE"; then
+    echo "  [PASS] Reviewer flagged the SQL injection vector"
+else
+    echo "  [FAIL] Reviewer missed the SQL injection — most obvious planted bug"
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 3: Reviewer caught the credential / password issue (either logging or no real hashing)
+echo "Test 3: Credential handling issue flagged..."
+if grep -qiE "password|credential|secret|plaintext|log.*hash|hash.*log|sensitive" "$OUTPUT_FILE"; then
+    echo "  [PASS] Reviewer flagged a credential / password handling issue"
+else
+    echo "  [FAIL] Reviewer missed the password/credential issues"
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 4: Reviewer marked at least one issue as Critical or Important (not just Minor)
+echo "Test 4: Severity classification..."
+if grep -qiE "critical|important|severe|high.*risk|security" "$OUTPUT_FILE"; then
+    echo "  [PASS] Reviewer classified findings at Critical/Important severity"
+else
+    echo "  [FAIL] Reviewer did not classify findings as Critical or Important"
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 5: Reviewer did NOT approve the diff for merge
+echo "Test 5: Reviewer verdict..."
+# A correct reviewer says No or "With fixes". A broken/sycophantic reviewer says Yes/Ready.
+if grep -qiE "ready to merge.*yes|approved.*for merge|^\s*yes\s*$|safe to merge" "$OUTPUT_FILE" \
+   && ! grep -qiE "ready to merge.*no|with fixes|do not merge|not ready|block.*merge" "$OUTPUT_FILE"; then
+    echo "  [FAIL] Reviewer approved a diff with planted Critical bugs"
+    FAILED=$((FAILED + 1))
+else
+    echo "  [PASS] Reviewer did not approve the diff"
+fi
+echo ""
+
+echo "========================================"
+echo " Test Summary"
+echo "========================================"
+echo ""
+
+if [ $FAILED -eq 0 ]; then
+    echo "STATUS: PASSED"
+    echo "The code reviewer correctly:"
+    echo "  ✓ Was dispatched via the requesting-code-review skill"
+    echo "  ✓ Flagged the SQL injection"
+    echo "  ✓ Flagged the credential handling issues"
+    echo "  ✓ Classified findings at Critical/Important severity"
+    echo "  ✓ Did not approve the diff for merge"
+    exit 0
+else
+    echo "STATUS: FAILED"
+    echo "Failed $FAILED verification tests"
+    echo ""
+    echo "Output saved to: $OUTPUT_FILE"
+    exit 1
+fi
--- a/tests/claude-code/test-subagent-driven-development-integration.sh
+++ b/tests/claude-code/test-subagent-driven-development-integration.sh
@@ -135,8 +135,7 @@ EOF

 # Note: We use a longer timeout since this is integration testing
 # Use --allowed-tools to enable tool usage in headless mode
-# IMPORTANT: Run from superpowers directory so local dev skills are available
-PROMPT="Change to directory $TEST_PROJECT and then execute the implementation plan at docs/superpowers/plans/implementation-plan.md using the subagent-driven-development skill.
+PROMPT="Execute the implementation plan at docs/superpowers/plans/implementation-plan.md using the subagent-driven-development skill.

 IMPORTANT: Follow the skill exactly. I will be verifying that you:
 1. Read the plan once at the beginning
@@ -147,9 +146,14 @@ IMPORTANT: Follow the skill exactly. I will be verifying that you:

 Begin now. Execute the plan."

-echo "Running Claude (output will be shown below and saved to $OUTPUT_FILE)..."
+PLUGIN_DIR=$(cd "$SCRIPT_DIR/../.." && pwd)
+
+# Run claude from inside the test project so its session JSONL lands in a
+# project-specific directory under ~/.claude/projects/, isolated from any
+# other concurrent claude sessions.
+echo "Running Claude (plugin-dir: $PLUGIN_DIR, cwd: $TEST_PROJECT)..."
 echo "================================================================================"
-cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" --allowed-tools=all --add-dir "$TEST_PROJECT" --permission-mode bypassPermissions 2>&1 | tee "$OUTPUT_FILE" || {
+cd "$TEST_PROJECT" && timeout 1800 claude -p "$PROMPT" --plugin-dir "$PLUGIN_DIR" --allowed-tools=all --permission-mode bypassPermissions 2>&1 | tee "$OUTPUT_FILE" || {
    echo ""
    echo "================================================================================"
    echo "EXECUTION FAILED (exit code: $?)"
@@ -161,13 +165,17 @@ echo ""
 echo "Execution complete. Analyzing results..."
 echo ""

-# Find the session transcript
-# Session files are in ~/.claude/projects/-<working-dir>/<session-id>.jsonl
-WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\//-/g' | sed 's/^-//')
-SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
-
-# Find the most recent session file (created during this test run)
-SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 2>/dev/null | sort -r | head -1)
+# Find the session transcript. Because we ran claude from $TEST_PROJECT (a
+# unique tmp dir), its sessions live in their own ~/.claude/projects/ folder
+# and we can pick the most-recent one without racing other concurrent sessions.
+# Resolve the real path because macOS mktemp returns /var/... but claude
+# normalizes it to /private/var/... when naming the project dir.
+TEST_PROJECT_REAL=$(cd "$TEST_PROJECT" && pwd -P)
+# Claude normalizes the cwd to a directory name by replacing every non-alphanumeric
+# character with `-` (so `_`, `.`, `/` all become `-`).
+SESSION_DIR="$HOME/.claude/projects/$(echo "$TEST_PROJECT_REAL" | sed 's|[^a-zA-Z0-9]|-|g')"
+# `|| true` prevents pipefail killing the script if ls gets SIGPIPE'd by head.
+SESSION_FILE=$(ls -t "$SESSION_DIR"/*.jsonl 2>/dev/null | head -1 || true)

 if [ -z "$SESSION_FILE" ]; then
    echo "ERROR: Could not find session transcript file"
@@ -194,9 +202,9 @@ else
 fi
 echo ""

-# Test 2: Subagents were used (Task tool)
+# Test 2: Subagents were used (Agent / Task tool — name varies by harness version)
 echo "Test 2: Subagents dispatched..."
-task_count=$(grep -c '"name":"Task"' "$SESSION_FILE" || echo "0")
+task_count=$(grep -cE '"name":"(Agent|Task)"' "$SESSION_FILE" || echo "0")
 if [ "$task_count" -ge 2 ]; then
    echo "  [PASS] $task_count subagents dispatched"
 else