mirror of
https://github.com/obra/superpowers.git
synced 2026-05-08 18:19:04 +08:00
- evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.
2.5 KiB
2.5 KiB
Drill
Superpowers skill compliance benchmark. Python 3.11+, managed with uv.
Commands
- install:
uv sync --extra dev - test:
uv run pytest - test single:
uv run pytest tests/test_engine.py -x -q - lint:
uv run ruff check - format:
uv run ruff format - typecheck:
uv run ty check - run scenario:
uv run drill run <scenario> -b <backend> - sweep:
uv run drill run <scenario> --models claude-opus-4-6,claude-opus-4-7 --n 10 - compare:
uv run drill compare <scenario> - list:
uv run drill list
Architecture
drill/engine.py— Tmux session orchestration. Creates workdir, runs setup helpers, drives actor/agent turns, collects results.drill/actor.py— Sonnet 4.6 LLM simulating a user. Reads turn intents from scenario YAML and generates realistic prompts.drill/verifier.py— Sonnet 4.6 LLM evaluating session transcript + filesystem against semantic criteria.drill/assertions.py— Deterministic post-session checks. Runs shell commands fromverify.assertionsin the results dir.drill/sweep.py— Multi-backend, N-repetition orchestrator. Wraps Engine with try/except per run, writes run-group.json manifest.drill/compare.py— Loads results, computes pass rates and Wilson CIs, formats comparison tables.drill/stats.py— Wilson score confidence interval for pass rate estimation at small N.scenarios/*.yaml— Scenario definitions (setup, turns, limits, verify).setup_helpers/*.py— Repo fixture creators. Each creates a git repo with specific conditions.backends/*.yaml— Per-backend CLI config (args, env, idle patterns, shutdown commands).bin/— Assertion helper scripts:tool-called,tool-not-called,tool-count,tool-before,tool-arg-match. Run againsttool_calls.jsonlin results dir.
Conventions
- Setup helpers take
workdir: Pathand mutate the filesystem. Register insetup_helpers/__init__.py. - Scenarios use
user_posture: naive(no skill names) orspec-aware(can name skills). - Verify criteria are semantic (LLM-evaluated). Verify assertions are deterministic (exit code 0 = pass).
- Assertions run in the results dir with
$DRILL_WORKDIRpointing to the scenario workdir andbin/on PATH. - Backend YAMLs are fully self-contained — no override/alias system.
Required env
ANTHROPIC_API_KEY=sk-...
SUPERPOWERS_ROOT defaults to the parent of evals/ (the superpowers repo root). Override only if running drill against a different superpowers checkout.