mirror of
https://github.com/obra/superpowers.git
synced 2026-05-10 02:59:04 +08:00
Adds _set_superpowers_root_default() to drill/cli.py, called at module import after load_dotenv(). PROJECT_ROOT resolves to evals/ post-lift; its parent is the superpowers repo root, which is the correct value for SUPERPOWERS_ROOT. Existing env values are respected as overrides via os.environ.setdefault. Tests: - helper sets default when var is unset - helper does not override when var is already set
Drill
Superpowers skill compliance benchmark. Drives AI coding agents through tmux sessions and evaluates whether they follow superpowers workflows correctly.
How it works
- Setup — a helper creates a git repo with specific conditions (worktree state, plan files, code fixtures)
- Actor — a Sonnet 4.6 LLM plays the user, following turn intents from the scenario YAML
- Agent — the backend under test (Claude Code, Codex, Gemini CLI) runs in a real tmux session
- Verifier — a Sonnet 4.6 LLM evaluates the session transcript + filesystem against criteria
- Assertions — deterministic checks (tool-called, tool-count, shell commands) run post-session
Setup
uv sync --dev
Required environment:
export SUPERPOWERS_ROOT=/path/to/superpowers
export ANTHROPIC_API_KEY=sk-...
Usage
# Run a single scenario on a single backend
uv run drill run worktree-creation-from-main -b claude
# Run with N repetitions
uv run drill run pattern-match-trap -b claude-opus-4-6 --n 5
# Sweep across multiple backends
uv run drill run pattern-match-trap --models claude-opus-4-6,claude-opus-4-7 --n 10
# Compare results
uv run drill compare pattern-match-trap
# List available scenarios
uv run drill list
Scenarios
| Category | Scenarios | Tests |
|---|---|---|
| Worktree | 8 scenarios (creation, detection, consent, detached HEAD) | Skill compliance for using-git-worktrees |
| Wave decomposition | 5 scenarios (naive, spec-aware, false overlap, dependency chain, conflict surface) | Plan → waves decomposition quality |
| Wave execution | 3 scenarios (minimal, full, task failure) | End-to-end wave execution + failure escalation |
| Pattern-match trap | 1 scenario | Investigation depth gap between 4.6 and 4.7 (PRI-1270) |
Backends
| Backend | CLI | Model |
|---|---|---|
claude |
Claude Code | opus-4-7 (default) |
claude-opus-4-6 |
Claude Code | opus-4-6 |
claude-opus-4-7 |
Claude Code | opus-4-7 |
claude-opus-4-6-1m |
Claude Code | opus-4-6 (1M context) |
claude-opus-4-7-1m |
Claude Code | opus-4-7 (1M context) |
codex |
Codex CLI | — |
gemini |
Gemini CLI | — |
Project structure
drill/ # Core engine
cli.py # Click CLI (run, compare, list)
engine.py # Tmux session orchestration
actor.py # User-simulator LLM
verifier.py # Criteria evaluator LLM
assertions.py # Deterministic post-session assertions
compare.py # Result loading and cross-backend comparison
sweep.py # Multi-backend N-rep orchestrator
stats.py # Wilson score confidence intervals
scenarios/ # YAML scenario definitions
setup_helpers/ # Repo fixture creators
backends/ # Per-backend YAML configs
bin/ # Assertion helper scripts (tool-called, tool-count, etc.)
prompts/ # Actor and verifier system prompts
fixtures/ # Static template repos
tests/ # pytest suite (122 tests)
docs/ # Design spec and manual testing guide
Tests
uv run pytest
uv run ruff check
uv run ty check
Writing a new scenario
- Create a setup helper in
setup_helpers/if you need a custom fixture - Register it in
setup_helpers/__init__.py - Create
scenarios/your-scenario.yamlwith setup, turns, limits, and verify sections - Run it:
uv run drill run your-scenario -b claude
See docs/design.md for the full design spec.