mirror of
https://github.com/obra/superpowers.git
synced 2026-05-03 15:49:04 +08:00
The plugin had a single named agent (`agents/code-reviewer.md`) used by two skills, while every other reviewer/implementer subagent in the repo is dispatched as `general-purpose` with the prompt template living alongside its skill. That asymmetry had no upside and several costs: - Two sources of truth for the code review checklist (the agent file and `requesting-code-review/code-reviewer.md`), both drifting independently. - `Codex` users could not use the named agent directly; the codex-tools reference doc had a workaround section explaining how to flatten the named agent into a `worker` dispatch. - No third-party reliance on `superpowers:code-reviewer` inside this repo. Changes: - Merge `agents/code-reviewer.md` (persona + checklist) and `skills/requesting-code-review/code-reviewer.md` (placeholder template) into a single self-contained Task-dispatch template, matching the shape of `implementer-prompt.md`, `spec-reviewer-prompt.md`, etc. - Update `skills/requesting-code-review/SKILL.md` and `skills/subagent-driven-development/code-quality-reviewer-prompt.md` to dispatch `Task (general-purpose)` instead of the named agent. - Drop the now-obsolete "Named agent dispatch" workaround sections from `codex-tools.md` and `copilot-tools.md` — superpowers no longer ships any named agents, so those instructions documented nothing. - Delete `agents/code-reviewer.md` and the empty `agents/` directory. Tier 3 coverage for the change: a new behavioral test `tests/claude-code/test-requesting-code-review.sh` plants real bugs (SQL injection, plaintext password handling, credential logging) into a tiny project, runs the actual `requesting-code-review` skill against the working tree, and asserts the dispatched reviewer flags every planted issue at Critical/Important severity and refuses to approve the diff. Verified end-to-end on this branch: - The new test passes (5/5 assertions; reviewer caught all planted bugs and several others). - The existing SDD integration test still passes (7/7 subagents dispatched, all as `general-purpose`; spec compliance still rejects extra features; produced code is correct). - Session JSONLs confirm zero remaining `superpowers:code-reviewer` dispatches anywhere in the SDD pipeline.
171 lines
4.8 KiB
Markdown
171 lines
4.8 KiB
Markdown
# Claude Code Skills Tests
|
|
|
|
Automated tests for superpowers skills using Claude Code CLI.
|
|
|
|
## Overview
|
|
|
|
This test suite verifies that skills are loaded correctly and Claude follows them as expected. Tests invoke Claude Code in headless mode (`claude -p`) and verify the behavior.
|
|
|
|
## Requirements
|
|
|
|
- Claude Code CLI installed and in PATH (`claude --version` should work)
|
|
- Local superpowers plugin installed (see main README for installation)
|
|
|
|
## Running Tests
|
|
|
|
### Run all fast tests (recommended):
|
|
```bash
|
|
./run-skill-tests.sh
|
|
```
|
|
|
|
### Run integration tests (slow, 10-30 minutes):
|
|
```bash
|
|
./run-skill-tests.sh --integration
|
|
```
|
|
|
|
### Run specific test:
|
|
```bash
|
|
./run-skill-tests.sh --test test-subagent-driven-development.sh
|
|
```
|
|
|
|
### Run with verbose output:
|
|
```bash
|
|
./run-skill-tests.sh --verbose
|
|
```
|
|
|
|
### Set custom timeout:
|
|
```bash
|
|
./run-skill-tests.sh --timeout 1800 # 30 minutes for integration tests
|
|
```
|
|
|
|
## Test Structure
|
|
|
|
### test-helpers.sh
|
|
Common functions for skills testing:
|
|
- `run_claude "prompt" [timeout]` - Run Claude with prompt
|
|
- `assert_contains output pattern name` - Verify pattern exists
|
|
- `assert_not_contains output pattern name` - Verify pattern absent
|
|
- `assert_count output pattern count name` - Verify exact count
|
|
- `assert_order output pattern_a pattern_b name` - Verify order
|
|
- `create_test_project` - Create temp test directory
|
|
- `create_test_plan project_dir` - Create sample plan file
|
|
|
|
### Test Files
|
|
|
|
Each test file:
|
|
1. Sources `test-helpers.sh`
|
|
2. Runs Claude Code with specific prompts
|
|
3. Verifies expected behavior using assertions
|
|
4. Returns 0 on success, non-zero on failure
|
|
|
|
## Example Test
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
|
source "$SCRIPT_DIR/test-helpers.sh"
|
|
|
|
echo "=== Test: My Skill ==="
|
|
|
|
# Ask Claude about the skill
|
|
output=$(run_claude "What does the my-skill skill do?" 30)
|
|
|
|
# Verify response
|
|
assert_contains "$output" "expected behavior" "Skill describes behavior"
|
|
|
|
echo "=== All tests passed ==="
|
|
```
|
|
|
|
## Current Tests
|
|
|
|
### Fast Tests (run by default)
|
|
|
|
#### test-subagent-driven-development.sh
|
|
Tests skill content and requirements (~2 minutes):
|
|
- Skill loading and accessibility
|
|
- Workflow ordering (spec compliance before code quality)
|
|
- Self-review requirements documented
|
|
- Plan reading efficiency documented
|
|
- Spec compliance reviewer skepticism documented
|
|
- Review loops documented
|
|
- Task context provision documented
|
|
|
|
### Integration Tests (use --integration flag)
|
|
|
|
#### test-subagent-driven-development-integration.sh
|
|
Full workflow execution test (~10-30 minutes):
|
|
- Creates real test project with Node.js setup
|
|
- Creates implementation plan with 2 tasks
|
|
- Executes plan using subagent-driven-development
|
|
- Verifies actual behaviors:
|
|
- Plan read once at start (not per task)
|
|
- Full task text provided in subagent prompts
|
|
- Subagents perform self-review before reporting
|
|
- Spec compliance review happens before code quality
|
|
- Spec reviewer reads code independently
|
|
- Working implementation is produced
|
|
- Tests pass
|
|
- Proper git commits created
|
|
|
|
**What it tests:**
|
|
- The workflow actually works end-to-end
|
|
- Our improvements are actually applied
|
|
- Subagents follow the skill correctly
|
|
- Final code is functional and tested
|
|
|
|
#### test-requesting-code-review.sh
|
|
Behavioral test for the code reviewer subagent (~5 minutes):
|
|
- Builds a tiny project with a baseline commit
|
|
- Adds a second commit that plants two real bugs (SQL injection, plaintext password handling)
|
|
- Dispatches the code reviewer via the requesting-code-review skill
|
|
- Verifies the reviewer flags the planted bugs at Critical/Important severity and refuses to approve
|
|
|
|
**What it tests:**
|
|
- The skill actually dispatches a working code reviewer subagent
|
|
- The reviewer template produces reviewers that catch obvious security bugs
|
|
- The reviewer is not sycophantic — it does not approve a diff with planted Critical issues
|
|
|
|
## Adding New Tests
|
|
|
|
1. Create new test file: `test-<skill-name>.sh`
|
|
2. Source test-helpers.sh
|
|
3. Write tests using `run_claude` and assertions
|
|
4. Add to test list in `run-skill-tests.sh`
|
|
5. Make executable: `chmod +x test-<skill-name>.sh`
|
|
|
|
## Timeout Considerations
|
|
|
|
- Default timeout: 5 minutes per test
|
|
- Claude Code may take time to respond
|
|
- Adjust with `--timeout` if needed
|
|
- Tests should be focused to avoid long runs
|
|
|
|
## Debugging Failed Tests
|
|
|
|
With `--verbose`, you'll see full Claude output:
|
|
```bash
|
|
./run-skill-tests.sh --verbose --test test-subagent-driven-development.sh
|
|
```
|
|
|
|
Without verbose, only failures show output.
|
|
|
|
## CI/CD Integration
|
|
|
|
To run in CI:
|
|
```bash
|
|
# Run with explicit timeout for CI environments
|
|
./run-skill-tests.sh --timeout 900
|
|
|
|
# Exit code 0 = success, non-zero = failure
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Tests verify skill *instructions*, not full execution
|
|
- Full workflow tests would be very slow
|
|
- Focus on verifying key skill requirements
|
|
- Tests should be deterministic
|
|
- Avoid testing implementation details
|