Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
2026-05-10 02:59:04 +08:00 · 2026-05-06 12:15:46 -07:00
parent 2e46e9590d
commit 3b412a3836
124 changed files with 13806 additions and 0 deletions
--- a/evals/README.md
+++ b/evals/README.md
@@ -0,0 +1,104 @@
+# Drill
+
+Superpowers skill compliance benchmark. Drives AI coding agents through
+tmux sessions and evaluates whether they follow superpowers workflows
+correctly.
+
+## How it works
+
+1. **Setup** — a helper creates a git repo with specific conditions (worktree state, plan files, code fixtures)
+2. **Actor** — a Sonnet 4.6 LLM plays the user, following turn intents from the scenario YAML
+3. **Agent** — the backend under test (Claude Code, Codex, Gemini CLI) runs in a real tmux session
+4. **Verifier** — a Sonnet 4.6 LLM evaluates the session transcript + filesystem against criteria
+5. **Assertions** — deterministic checks (tool-called, tool-count, shell commands) run post-session
+
+## Setup
+
+```bash
+uv sync --dev
+```
+
+Required environment:
+```bash
+export SUPERPOWERS_ROOT=/path/to/superpowers
+export ANTHROPIC_API_KEY=sk-...
+```
+
+## Usage
+
+```bash
+# Run a single scenario on a single backend
+uv run drill run worktree-creation-from-main -b claude
+
+# Run with N repetitions
+uv run drill run pattern-match-trap -b claude-opus-4-6 --n 5
+
+# Sweep across multiple backends
+uv run drill run pattern-match-trap --models claude-opus-4-6,claude-opus-4-7 --n 10
+
+# Compare results
+uv run drill compare pattern-match-trap
+
+# List available scenarios
+uv run drill list
+```
+
+## Scenarios
+
+| Category | Scenarios | Tests |
+|----------|-----------|-------|
+| Worktree | 8 scenarios (creation, detection, consent, detached HEAD) | Skill compliance for `using-git-worktrees` |
+| Wave decomposition | 5 scenarios (naive, spec-aware, false overlap, dependency chain, conflict surface) | Plan → waves decomposition quality |
+| Wave execution | 3 scenarios (minimal, full, task failure) | End-to-end wave execution + failure escalation |
+| Pattern-match trap | 1 scenario | Investigation depth gap between 4.6 and 4.7 (PRI-1270) |
+
+## Backends
+
+| Backend | CLI | Model |
+|---------|-----|-------|
+| `claude` | Claude Code | opus-4-7 (default) |
+| `claude-opus-4-6` | Claude Code | opus-4-6 |
+| `claude-opus-4-7` | Claude Code | opus-4-7 |
+| `claude-opus-4-6-1m` | Claude Code | opus-4-6 (1M context) |
+| `claude-opus-4-7-1m` | Claude Code | opus-4-7 (1M context) |
+| `codex` | Codex CLI | — |
+| `gemini` | Gemini CLI | — |
+
+## Project structure
+
+```
+drill/              # Core engine
+  cli.py            # Click CLI (run, compare, list)
+  engine.py         # Tmux session orchestration
+  actor.py          # User-simulator LLM
+  verifier.py       # Criteria evaluator LLM
+  assertions.py     # Deterministic post-session assertions
+  compare.py        # Result loading and cross-backend comparison
+  sweep.py          # Multi-backend N-rep orchestrator
+  stats.py          # Wilson score confidence intervals
+scenarios/          # YAML scenario definitions
+setup_helpers/      # Repo fixture creators
+backends/           # Per-backend YAML configs
+bin/                # Assertion helper scripts (tool-called, tool-count, etc.)
+prompts/            # Actor and verifier system prompts
+fixtures/           # Static template repos
+tests/              # pytest suite (122 tests)
+docs/               # Design spec and manual testing guide
+```
+
+## Tests
+
+```bash
+uv run pytest
+uv run ruff check
+uv run ty check
+```
+
+## Writing a new scenario
+
+1. Create a setup helper in `setup_helpers/` if you need a custom fixture
+2. Register it in `setup_helpers/__init__.py`
+3. Create `scenarios/your-scenario.yaml` with setup, turns, limits, and verify sections
+4. Run it: `uv run drill run your-scenario -b claude`
+
+See [docs/design.md](docs/design.md) for the full design spec.