From e4191c36097750ac050b5b39cab4766765d4c8ac Mon Sep 17 00:00:00 2001 From: Jesse Vincent Date: Wed, 6 May 2026 12:41:28 -0700 Subject: [PATCH] Address adversarial review findings - evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round. --- CLAUDE.md | 2 +- docs/testing.md | 2 +- evals/CLAUDE.md | 2 +- evals/README.md | 2 +- tests/claude-code/README.md | 17 ++++++----------- tests/claude-code/run-skill-tests.sh | 1 - 6 files changed, 10 insertions(+), 16 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 9d1cc1fe..a8b88c89 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -96,7 +96,7 @@ Skills are not prose — they are code that shapes agent behavior. If you modify ## Eval harness -Skill-behavior evals live at `evals/` — see `evals/README.md`. Drill (the harness) drives real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI and judges skill compliance with an LLM verifier. Plugin-infrastructure tests still live at `tests/`. +Skill-behavior evals live at `evals/` — see `evals/README.md`. Drill (the harness) drives real tmux sessions of Claude Code / Codex / Gemini CLI and judges skill compliance with an LLM verifier. Plugin-infrastructure tests still live at `tests/`. ## Understand the Project Before Contributing diff --git a/docs/testing.md b/docs/testing.md index fa504dfb..ac5a0050 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -3,7 +3,7 @@ Superpowers has two distinct kinds of tests, each in its own directory: - **`tests/`** — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities. -- **`evals/`** — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI, with an LLM actor and verifier judging skill compliance. +- **`evals/`** — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI, with an LLM actor and verifier judging skill compliance. ## Plugin tests diff --git a/evals/CLAUDE.md b/evals/CLAUDE.md index 43cce6a0..e80b39ca 100644 --- a/evals/CLAUDE.md +++ b/evals/CLAUDE.md @@ -4,7 +4,7 @@ Superpowers skill compliance benchmark. Python 3.11+, managed with uv. ## Commands -- **install**: `uv sync --dev` +- **install**: `uv sync --extra dev` - **test**: `uv run pytest` - **test single**: `uv run pytest tests/test_engine.py -x -q` - **lint**: `uv run ruff check` diff --git a/evals/README.md b/evals/README.md index 1e9fab2f..1791dd4a 100644 --- a/evals/README.md +++ b/evals/README.md @@ -15,7 +15,7 @@ correctly. ## Setup ```bash -uv sync --dev +uv sync --extra dev ``` Required environment: diff --git a/tests/claude-code/README.md b/tests/claude-code/README.md index 473f1f28..90f1fe1a 100644 --- a/tests/claude-code/README.md +++ b/tests/claude-code/README.md @@ -115,17 +115,12 @@ Full workflow execution test (~10-30 minutes): - Subagents follow the skill correctly - Final code is functional and tested -#### test-requesting-code-review.sh -Behavioral test for the code reviewer subagent (~5 minutes): -- Builds a tiny project with a baseline commit -- Adds a second commit that plants two real bugs (SQL injection, plaintext password handling) -- Dispatches the code reviewer via the requesting-code-review skill -- Verifies the reviewer flags the planted bugs at Critical/Important severity and refuses to approve - -**What it tests:** -- The skill actually dispatches a working code reviewer subagent -- The reviewer template produces reviewers that catch obvious security bugs -- The reviewer is not sycophantic — it does not approve a diff with planted Critical issues +#### test-worktree-native-preference.sh +RED-GREEN-REFACTOR validation for the using-git-worktrees skill (~5 minutes): +- RED: skill without Step 1a — agent should use `git worktree add` +- GREEN: skill with Step 1a — agent should use the native EnterWorktree tool +- PRESSURE: same as GREEN under urgency framing with pre-existing `.worktrees/` +- Drill scenario `worktree-creation-under-pressure.yaml` covers the PRESSURE phase only ## Adding New Tests diff --git a/tests/claude-code/run-skill-tests.sh b/tests/claude-code/run-skill-tests.sh index 023e9794..3e339fd3 100755 --- a/tests/claude-code/run-skill-tests.sh +++ b/tests/claude-code/run-skill-tests.sh @@ -79,7 +79,6 @@ tests=( # Integration tests (slow, full execution) integration_tests=( "test-subagent-driven-development-integration.sh" - "test-requesting-code-review.sh" ) # Add integration tests if requested