mirror of
https://github.com/obra/superpowers.git
synced 2026-05-11 03:29:04 +08:00
Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
This commit is contained in:
45
evals/CLAUDE.md
Normal file
45
evals/CLAUDE.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# Drill
|
||||
|
||||
Superpowers skill compliance benchmark. Python 3.11+, managed with uv.
|
||||
|
||||
## Commands
|
||||
|
||||
- **install**: `uv sync --dev`
|
||||
- **test**: `uv run pytest`
|
||||
- **test single**: `uv run pytest tests/test_engine.py -x -q`
|
||||
- **lint**: `uv run ruff check`
|
||||
- **format**: `uv run ruff format`
|
||||
- **typecheck**: `uv run ty check`
|
||||
- **run scenario**: `uv run drill run <scenario> -b <backend>`
|
||||
- **sweep**: `uv run drill run <scenario> --models claude-opus-4-6,claude-opus-4-7 --n 10`
|
||||
- **compare**: `uv run drill compare <scenario>`
|
||||
- **list**: `uv run drill list`
|
||||
|
||||
## Architecture
|
||||
|
||||
- `drill/engine.py` — Tmux session orchestration. Creates workdir, runs setup helpers, drives actor/agent turns, collects results.
|
||||
- `drill/actor.py` — Sonnet 4.6 LLM simulating a user. Reads turn intents from scenario YAML and generates realistic prompts.
|
||||
- `drill/verifier.py` — Sonnet 4.6 LLM evaluating session transcript + filesystem against semantic criteria.
|
||||
- `drill/assertions.py` — Deterministic post-session checks. Runs shell commands from `verify.assertions` in the results dir.
|
||||
- `drill/sweep.py` — Multi-backend, N-repetition orchestrator. Wraps Engine with try/except per run, writes run-group.json manifest.
|
||||
- `drill/compare.py` — Loads results, computes pass rates and Wilson CIs, formats comparison tables.
|
||||
- `drill/stats.py` — Wilson score confidence interval for pass rate estimation at small N.
|
||||
- `scenarios/*.yaml` — Scenario definitions (setup, turns, limits, verify).
|
||||
- `setup_helpers/*.py` — Repo fixture creators. Each creates a git repo with specific conditions.
|
||||
- `backends/*.yaml` — Per-backend CLI config (args, env, idle patterns, shutdown commands).
|
||||
- `bin/` — Assertion helper scripts: `tool-called`, `tool-not-called`, `tool-count`, `tool-before`, `tool-arg-match`. Run against `tool_calls.jsonl` in results dir.
|
||||
|
||||
## Conventions
|
||||
|
||||
- Setup helpers take `workdir: Path` and mutate the filesystem. Register in `setup_helpers/__init__.py`.
|
||||
- Scenarios use `user_posture: naive` (no skill names) or `spec-aware` (can name skills).
|
||||
- Verify criteria are semantic (LLM-evaluated). Verify assertions are deterministic (exit code 0 = pass).
|
||||
- Assertions run in the results dir with `$DRILL_WORKDIR` pointing to the scenario workdir and `bin/` on PATH.
|
||||
- Backend YAMLs are fully self-contained — no override/alias system.
|
||||
|
||||
## Required env
|
||||
|
||||
```
|
||||
SUPERPOWERS_ROOT=/path/to/superpowers
|
||||
ANTHROPIC_API_KEY=sk-...
|
||||
```
|
||||
Reference in New Issue
Block a user