mirror of
https://github.com/obra/superpowers.git
synced 2026-05-10 02:59:04 +08:00
Address adversarial review findings
- evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.
This commit is contained in:
committed by
Drew Ritter
parent
f7c5312265
commit
0bf37499b4
@@ -96,7 +96,7 @@ Skills are not prose — they are code that shapes agent behavior. If you modify
|
|||||||
|
|
||||||
## Eval harness
|
## Eval harness
|
||||||
|
|
||||||
Skill-behavior evals live at `evals/` — see `evals/README.md`. Drill (the harness) drives real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI and judges skill compliance with an LLM verifier. Plugin-infrastructure tests still live at `tests/`.
|
Skill-behavior evals live at `evals/` — see `evals/README.md`. Drill (the harness) drives real tmux sessions of Claude Code / Codex / Gemini CLI and judges skill compliance with an LLM verifier. Plugin-infrastructure tests still live at `tests/`.
|
||||||
|
|
||||||
## Understand the Project Before Contributing
|
## Understand the Project Before Contributing
|
||||||
|
|
||||||
|
|||||||
@@ -3,7 +3,7 @@
|
|||||||
Superpowers has two distinct kinds of tests, each in its own directory:
|
Superpowers has two distinct kinds of tests, each in its own directory:
|
||||||
|
|
||||||
- **`tests/`** — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.
|
- **`tests/`** — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.
|
||||||
- **`evals/`** — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI, with an LLM actor and verifier judging skill compliance.
|
- **`evals/`** — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI, with an LLM actor and verifier judging skill compliance.
|
||||||
|
|
||||||
## Plugin tests
|
## Plugin tests
|
||||||
|
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ Superpowers skill compliance benchmark. Python 3.11+, managed with uv.
|
|||||||
|
|
||||||
## Commands
|
## Commands
|
||||||
|
|
||||||
- **install**: `uv sync --dev`
|
- **install**: `uv sync --extra dev`
|
||||||
- **test**: `uv run pytest`
|
- **test**: `uv run pytest`
|
||||||
- **test single**: `uv run pytest tests/test_engine.py -x -q`
|
- **test single**: `uv run pytest tests/test_engine.py -x -q`
|
||||||
- **lint**: `uv run ruff check`
|
- **lint**: `uv run ruff check`
|
||||||
|
|||||||
@@ -15,7 +15,7 @@ correctly.
|
|||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv sync --dev
|
uv sync --extra dev
|
||||||
```
|
```
|
||||||
|
|
||||||
Required environment:
|
Required environment:
|
||||||
|
|||||||
@@ -115,17 +115,12 @@ Full workflow execution test (~10-30 minutes):
|
|||||||
- Subagents follow the skill correctly
|
- Subagents follow the skill correctly
|
||||||
- Final code is functional and tested
|
- Final code is functional and tested
|
||||||
|
|
||||||
#### test-requesting-code-review.sh
|
#### test-worktree-native-preference.sh
|
||||||
Behavioral test for the code reviewer subagent (~5 minutes):
|
RED-GREEN-REFACTOR validation for the using-git-worktrees skill (~5 minutes):
|
||||||
- Builds a tiny project with a baseline commit
|
- RED: skill without Step 1a — agent should use `git worktree add`
|
||||||
- Adds a second commit that plants two real bugs (SQL injection, plaintext password handling)
|
- GREEN: skill with Step 1a — agent should use the native EnterWorktree tool
|
||||||
- Dispatches the code reviewer via the requesting-code-review skill
|
- PRESSURE: same as GREEN under urgency framing with pre-existing `.worktrees/`
|
||||||
- Verifies the reviewer flags the planted bugs at Critical/Important severity and refuses to approve
|
- Drill scenario `worktree-creation-under-pressure.yaml` covers the PRESSURE phase only
|
||||||
|
|
||||||
**What it tests:**
|
|
||||||
- The skill actually dispatches a working code reviewer subagent
|
|
||||||
- The reviewer template produces reviewers that catch obvious security bugs
|
|
||||||
- The reviewer is not sycophantic — it does not approve a diff with planted Critical issues
|
|
||||||
|
|
||||||
## Adding New Tests
|
## Adding New Tests
|
||||||
|
|
||||||
|
|||||||
@@ -79,7 +79,6 @@ tests=(
|
|||||||
# Integration tests (slow, full execution)
|
# Integration tests (slow, full execution)
|
||||||
integration_tests=(
|
integration_tests=(
|
||||||
"test-subagent-driven-development-integration.sh"
|
"test-subagent-driven-development-integration.sh"
|
||||||
"test-requesting-code-review.sh"
|
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add integration tests if requested
|
# Add integration tests if requested
|
||||||
|
|||||||
Reference in New Issue
Block a user