Files
superpowers/evals/CLAUDE.md
Jesse Vincent 0bf37499b4 Address adversarial review findings
- evals/README.md, evals/CLAUDE.md: fix uv install command from
  'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml
  uses [project.optional-dependencies], so --dev is a no-op for
  pytest/ruff/ty; --extra dev is the correct invocation.
- tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh
  from integration_tests array (file deleted earlier in this branch).
- tests/claude-code/README.md: replace test-requesting-code-review.sh
  section with test-worktree-native-preference.sh (the worktree test
  is kept; the code-review test was lifted into drill).
- docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness
  list. evals/backends/ has claude*, codex, gemini configs but no
  copilot.yaml, so the claim was unsupported.

Adversarial review credit: reviewer #2 found four legitimate issues
(uv-sync, run-skill-tests stale ref, README stale ref via #1, and
Copilot CLI fabrication); reviewer #1 found two distinct issues
(run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins
this round.
2026-05-06 15:47:39 -07:00

2.5 KiB

Drill

Superpowers skill compliance benchmark. Python 3.11+, managed with uv.

Commands

  • install: uv sync --extra dev
  • test: uv run pytest
  • test single: uv run pytest tests/test_engine.py -x -q
  • lint: uv run ruff check
  • format: uv run ruff format
  • typecheck: uv run ty check
  • run scenario: uv run drill run <scenario> -b <backend>
  • sweep: uv run drill run <scenario> --models claude-opus-4-6,claude-opus-4-7 --n 10
  • compare: uv run drill compare <scenario>
  • list: uv run drill list

Architecture

  • drill/engine.py — Tmux session orchestration. Creates workdir, runs setup helpers, drives actor/agent turns, collects results.
  • drill/actor.py — Sonnet 4.6 LLM simulating a user. Reads turn intents from scenario YAML and generates realistic prompts.
  • drill/verifier.py — Sonnet 4.6 LLM evaluating session transcript + filesystem against semantic criteria.
  • drill/assertions.py — Deterministic post-session checks. Runs shell commands from verify.assertions in the results dir.
  • drill/sweep.py — Multi-backend, N-repetition orchestrator. Wraps Engine with try/except per run, writes run-group.json manifest.
  • drill/compare.py — Loads results, computes pass rates and Wilson CIs, formats comparison tables.
  • drill/stats.py — Wilson score confidence interval for pass rate estimation at small N.
  • scenarios/*.yaml — Scenario definitions (setup, turns, limits, verify).
  • setup_helpers/*.py — Repo fixture creators. Each creates a git repo with specific conditions.
  • backends/*.yaml — Per-backend CLI config (args, env, idle patterns, shutdown commands).
  • bin/ — Assertion helper scripts: tool-called, tool-not-called, tool-count, tool-before, tool-arg-match. Run against tool_calls.jsonl in results dir.

Conventions

  • Setup helpers take workdir: Path and mutate the filesystem. Register in setup_helpers/__init__.py.
  • Scenarios use user_posture: naive (no skill names) or spec-aware (can name skills).
  • Verify criteria are semantic (LLM-evaluated). Verify assertions are deterministic (exit code 0 = pass).
  • Assertions run in the results dir with $DRILL_WORKDIR pointing to the scenario workdir and bin/ on PATH.
  • Backend YAMLs are fully self-contained — no override/alias system.

Required env

ANTHROPIC_API_KEY=sk-...

SUPERPOWERS_ROOT defaults to the parent of evals/ (the superpowers repo root). Override only if running drill against a different superpowers checkout.