Files
superpowers/evals/README.md
Jesse Vincent e4191c3609 Address adversarial review findings
- evals/README.md, evals/CLAUDE.md: fix uv install command from
  'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml
  uses [project.optional-dependencies], so --dev is a no-op for
  pytest/ruff/ty; --extra dev is the correct invocation.
- tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh
  from integration_tests array (file deleted earlier in this branch).
- tests/claude-code/README.md: replace test-requesting-code-review.sh
  section with test-worktree-native-preference.sh (the worktree test
  is kept; the code-review test was lifted into drill).
- docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness
  list. evals/backends/ has claude*, codex, gemini configs but no
  copilot.yaml, so the claim was unsupported.

Adversarial review credit: reviewer #2 found four legitimate issues
(uv-sync, run-skill-tests stale ref, README stale ref via #1, and
Copilot CLI fabrication); reviewer #1 found two distinct issues
(run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins
this round.
2026-05-06 12:41:28 -07:00

3.6 KiB

Drill

Superpowers skill compliance benchmark. Drives AI coding agents through tmux sessions and evaluates whether they follow superpowers workflows correctly.

How it works

  1. Setup — a helper creates a git repo with specific conditions (worktree state, plan files, code fixtures)
  2. Actor — a Sonnet 4.6 LLM plays the user, following turn intents from the scenario YAML
  3. Agent — the backend under test (Claude Code, Codex, Gemini CLI) runs in a real tmux session
  4. Verifier — a Sonnet 4.6 LLM evaluates the session transcript + filesystem against criteria
  5. Assertions — deterministic checks (tool-called, tool-count, shell commands) run post-session

Setup

uv sync --extra dev

Required environment:

export ANTHROPIC_API_KEY=sk-...

SUPERPOWERS_ROOT defaults to the parent of evals/ (the superpowers repo root) and only needs to be set if you're running drill against a different superpowers checkout.

Usage

# Run a single scenario on a single backend
uv run drill run worktree-creation-from-main -b claude

# Run with N repetitions
uv run drill run pattern-match-trap -b claude-opus-4-6 --n 5

# Sweep across multiple backends
uv run drill run pattern-match-trap --models claude-opus-4-6,claude-opus-4-7 --n 10

# Compare results
uv run drill compare pattern-match-trap

# List available scenarios
uv run drill list

Scenarios

Category Scenarios Tests
Worktree 8 scenarios (creation, detection, consent, detached HEAD) Skill compliance for using-git-worktrees
Wave decomposition 5 scenarios (naive, spec-aware, false overlap, dependency chain, conflict surface) Plan → waves decomposition quality
Wave execution 3 scenarios (minimal, full, task failure) End-to-end wave execution + failure escalation
Pattern-match trap 1 scenario Investigation depth gap between 4.6 and 4.7 (PRI-1270)

Backends

Backend CLI Model
claude Claude Code opus-4-7 (default)
claude-opus-4-6 Claude Code opus-4-6
claude-opus-4-7 Claude Code opus-4-7
claude-opus-4-6-1m Claude Code opus-4-6 (1M context)
claude-opus-4-7-1m Claude Code opus-4-7 (1M context)
codex Codex CLI
gemini Gemini CLI

Project structure

drill/              # Core engine
  cli.py            # Click CLI (run, compare, list)
  engine.py         # Tmux session orchestration
  actor.py          # User-simulator LLM
  verifier.py       # Criteria evaluator LLM
  assertions.py     # Deterministic post-session assertions
  compare.py        # Result loading and cross-backend comparison
  sweep.py          # Multi-backend N-rep orchestrator
  stats.py          # Wilson score confidence intervals
scenarios/          # YAML scenario definitions
setup_helpers/      # Repo fixture creators
backends/           # Per-backend YAML configs
bin/                # Assertion helper scripts (tool-called, tool-count, etc.)
prompts/            # Actor and verifier system prompts
fixtures/           # Static template repos
tests/              # pytest suite (122 tests)
docs/               # Design spec and manual testing guide

Tests

uv run pytest
uv run ruff check
uv run ty check

Writing a new scenario

  1. Create a setup helper in setup_helpers/ if you need a custom fixture
  2. Register it in setup_helpers/__init__.py
  3. Create scenarios/your-scenario.yaml with setup, turns, limits, and verify sections
  4. Run it: uv run drill run your-scenario -b claude

See docs/design.md for the full design spec.