mirror of
https://github.com/obra/superpowers.git
synced 2026-05-01 06:29:05 +08:00
The plugin had a single named agent (`agents/code-reviewer.md`) used by two skills, while every other reviewer/implementer subagent in the repo is dispatched as `general-purpose` with the prompt template living alongside its skill. That asymmetry had no upside and several costs: - Two sources of truth for the code review checklist (the agent file and `requesting-code-review/code-reviewer.md`), both drifting independently. - `Codex` users could not use the named agent directly; the codex-tools reference doc had a workaround section explaining how to flatten the named agent into a `worker` dispatch. - No third-party reliance on `superpowers:code-reviewer` inside this repo. Changes: - Merge `agents/code-reviewer.md` (persona + checklist) and `skills/requesting-code-review/code-reviewer.md` (placeholder template) into a single self-contained Task-dispatch template, matching the shape of `implementer-prompt.md`, `spec-reviewer-prompt.md`, etc. - Update `skills/requesting-code-review/SKILL.md` and `skills/subagent-driven-development/code-quality-reviewer-prompt.md` to dispatch `Task (general-purpose)` instead of the named agent. - Drop the now-obsolete "Named agent dispatch" workaround sections from `codex-tools.md` and `copilot-tools.md` — superpowers no longer ships any named agents, so those instructions documented nothing. - Delete `agents/code-reviewer.md` and the empty `agents/` directory. Tier 3 coverage for the change: a new behavioral test `tests/claude-code/test-requesting-code-review.sh` plants real bugs (SQL injection, plaintext password handling, credential logging) into a tiny project, runs the actual `requesting-code-review` skill against the working tree, and asserts the dispatched reviewer flags every planted issue at Critical/Important severity and refuses to approve the diff. Verified end-to-end on this branch: - The new test passes (5/5 assertions; reviewer caught all planted bugs and several others). - The existing SDD integration test still passes (7/7 subagents dispatched, all as `general-purpose`; spec compliance still rejects extra features; produced code is correct). - Session JSONLs confirm zero remaining `superpowers:code-reviewer` dispatches anywhere in the SDD pipeline.
Claude Code Skills Tests
Automated tests for superpowers skills using Claude Code CLI.
Overview
This test suite verifies that skills are loaded correctly and Claude follows them as expected. Tests invoke Claude Code in headless mode (claude -p) and verify the behavior.
Requirements
- Claude Code CLI installed and in PATH (
claude --versionshould work) - Local superpowers plugin installed (see main README for installation)
Running Tests
Run all fast tests (recommended):
./run-skill-tests.sh
Run integration tests (slow, 10-30 minutes):
./run-skill-tests.sh --integration
Run specific test:
./run-skill-tests.sh --test test-subagent-driven-development.sh
Run with verbose output:
./run-skill-tests.sh --verbose
Set custom timeout:
./run-skill-tests.sh --timeout 1800 # 30 minutes for integration tests
Test Structure
test-helpers.sh
Common functions for skills testing:
run_claude "prompt" [timeout]- Run Claude with promptassert_contains output pattern name- Verify pattern existsassert_not_contains output pattern name- Verify pattern absentassert_count output pattern count name- Verify exact countassert_order output pattern_a pattern_b name- Verify ordercreate_test_project- Create temp test directorycreate_test_plan project_dir- Create sample plan file
Test Files
Each test file:
- Sources
test-helpers.sh - Runs Claude Code with specific prompts
- Verifies expected behavior using assertions
- Returns 0 on success, non-zero on failure
Example Test
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/test-helpers.sh"
echo "=== Test: My Skill ==="
# Ask Claude about the skill
output=$(run_claude "What does the my-skill skill do?" 30)
# Verify response
assert_contains "$output" "expected behavior" "Skill describes behavior"
echo "=== All tests passed ==="
Current Tests
Fast Tests (run by default)
test-subagent-driven-development.sh
Tests skill content and requirements (~2 minutes):
- Skill loading and accessibility
- Workflow ordering (spec compliance before code quality)
- Self-review requirements documented
- Plan reading efficiency documented
- Spec compliance reviewer skepticism documented
- Review loops documented
- Task context provision documented
Integration Tests (use --integration flag)
test-subagent-driven-development-integration.sh
Full workflow execution test (~10-30 minutes):
- Creates real test project with Node.js setup
- Creates implementation plan with 2 tasks
- Executes plan using subagent-driven-development
- Verifies actual behaviors:
- Plan read once at start (not per task)
- Full task text provided in subagent prompts
- Subagents perform self-review before reporting
- Spec compliance review happens before code quality
- Spec reviewer reads code independently
- Working implementation is produced
- Tests pass
- Proper git commits created
What it tests:
- The workflow actually works end-to-end
- Our improvements are actually applied
- Subagents follow the skill correctly
- Final code is functional and tested
test-requesting-code-review.sh
Behavioral test for the code reviewer subagent (~5 minutes):
- Builds a tiny project with a baseline commit
- Adds a second commit that plants two real bugs (SQL injection, plaintext password handling)
- Dispatches the code reviewer via the requesting-code-review skill
- Verifies the reviewer flags the planted bugs at Critical/Important severity and refuses to approve
What it tests:
- The skill actually dispatches a working code reviewer subagent
- The reviewer template produces reviewers that catch obvious security bugs
- The reviewer is not sycophantic — it does not approve a diff with planted Critical issues
Adding New Tests
- Create new test file:
test-<skill-name>.sh - Source test-helpers.sh
- Write tests using
run_claudeand assertions - Add to test list in
run-skill-tests.sh - Make executable:
chmod +x test-<skill-name>.sh
Timeout Considerations
- Default timeout: 5 minutes per test
- Claude Code may take time to respond
- Adjust with
--timeoutif needed - Tests should be focused to avoid long runs
Debugging Failed Tests
With --verbose, you'll see full Claude output:
./run-skill-tests.sh --verbose --test test-subagent-driven-development.sh
Without verbose, only failures show output.
CI/CD Integration
To run in CI:
# Run with explicit timeout for CI environments
./run-skill-tests.sh --timeout 900
# Exit code 0 = success, non-zero = failure
Notes
- Tests verify skill instructions, not full execution
- Full workflow tests would be very slow
- Focus on verifying key skill requirements
- Tests should be deterministic
- Avoid testing implementation details