mirror of
https://github.com/obra/superpowers.git
synced 2026-05-09 18:49:04 +08:00
- evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.
3.6 KiB
3.6 KiB
Drill
Superpowers skill compliance benchmark. Drives AI coding agents through tmux sessions and evaluates whether they follow superpowers workflows correctly.
How it works
- Setup — a helper creates a git repo with specific conditions (worktree state, plan files, code fixtures)
- Actor — a Sonnet 4.6 LLM plays the user, following turn intents from the scenario YAML
- Agent — the backend under test (Claude Code, Codex, Gemini CLI) runs in a real tmux session
- Verifier — a Sonnet 4.6 LLM evaluates the session transcript + filesystem against criteria
- Assertions — deterministic checks (tool-called, tool-count, shell commands) run post-session
Setup
uv sync --extra dev
Required environment:
export ANTHROPIC_API_KEY=sk-...
SUPERPOWERS_ROOT defaults to the parent of evals/ (the superpowers repo root) and only needs to be set if you're running drill against a different superpowers checkout.
Usage
# Run a single scenario on a single backend
uv run drill run worktree-creation-from-main -b claude
# Run with N repetitions
uv run drill run pattern-match-trap -b claude-opus-4-6 --n 5
# Sweep across multiple backends
uv run drill run pattern-match-trap --models claude-opus-4-6,claude-opus-4-7 --n 10
# Compare results
uv run drill compare pattern-match-trap
# List available scenarios
uv run drill list
Scenarios
| Category | Scenarios | Tests |
|---|---|---|
| Worktree | 8 scenarios (creation, detection, consent, detached HEAD) | Skill compliance for using-git-worktrees |
| Wave decomposition | 5 scenarios (naive, spec-aware, false overlap, dependency chain, conflict surface) | Plan → waves decomposition quality |
| Wave execution | 3 scenarios (minimal, full, task failure) | End-to-end wave execution + failure escalation |
| Pattern-match trap | 1 scenario | Investigation depth gap between 4.6 and 4.7 (PRI-1270) |
Backends
| Backend | CLI | Model |
|---|---|---|
claude |
Claude Code | opus-4-7 (default) |
claude-opus-4-6 |
Claude Code | opus-4-6 |
claude-opus-4-7 |
Claude Code | opus-4-7 |
claude-opus-4-6-1m |
Claude Code | opus-4-6 (1M context) |
claude-opus-4-7-1m |
Claude Code | opus-4-7 (1M context) |
codex |
Codex CLI | — |
gemini |
Gemini CLI | — |
Project structure
drill/ # Core engine
cli.py # Click CLI (run, compare, list)
engine.py # Tmux session orchestration
actor.py # User-simulator LLM
verifier.py # Criteria evaluator LLM
assertions.py # Deterministic post-session assertions
compare.py # Result loading and cross-backend comparison
sweep.py # Multi-backend N-rep orchestrator
stats.py # Wilson score confidence intervals
scenarios/ # YAML scenario definitions
setup_helpers/ # Repo fixture creators
backends/ # Per-backend YAML configs
bin/ # Assertion helper scripts (tool-called, tool-count, etc.)
prompts/ # Actor and verifier system prompts
fixtures/ # Static template repos
tests/ # pytest suite (122 tests)
docs/ # Design spec and manual testing guide
Tests
uv run pytest
uv run ruff check
uv run ty check
Writing a new scenario
- Create a setup helper in
setup_helpers/if you need a custom fixture - Register it in
setup_helpers/__init__.py - Create
scenarios/your-scenario.yamlwith setup, turns, limits, and verify sections - Run it:
uv run drill run your-scenario -b claude
See docs/design.md for the full design spec.