Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.
This commit is contained in:
Jesse Vincent
2026-05-06 12:15:46 -07:00
committed by Drew Ritter
parent 2e46e9590d
commit 3b412a3836
124 changed files with 13806 additions and 0 deletions

418
evals/docs/design.md Normal file
View File

@@ -0,0 +1,418 @@
# Drill: Superpowers Skill Compliance Benchmark
**Date:** 2026-04-07
**Ticket:** [PRI-1040](https://linear.app/prime-radiant/issue/PRI-1040)
**Status:** Design
## Thesis
The value of superpowers depends on whether skills are reliably followed by *any* coding agent — not just Claude Code. Drill tests whether agents actually fire skills, follow workflows, and use native tooling when available. It is a **compliance benchmark**, not a coding ability benchmark.
If a well-written skill produces consistent behavior across Claude Code and Codex, the agent-agnostic coordination layer is working. If agents diverge, Drill tells you exactly where and why.
## What Drill Tests
- Do agents invoke superpowers skills when they should?
- Do they follow multi-step workflows (detect → consent → create) in the right order?
- Do they use native tools (EnterWorktree, structured session logs) vs. raw shell commands?
- Where do agents diverge, and what does that tell us about skill format?
The first scenarios target **PRI-974 (worktree rototill)** — the area with the most cross-agent fragmentation today.
## Architecture
Three layers, each with a single responsibility:
```
┌─────────────────────────────────────────┐
│ CLI (click) │
│ run / compare / list │
├─────────────────────────────────────────┤
│ Engine │
│ ┌───────────┐ ┌───────┐ ┌──────────┐ │
│ │ Session │ │ Actor │ │ Verifier │ │
│ │ (tmux) │ │ (LLM) │ │ (LLM) │ │
│ └───────────┘ └───────┘ └──────────┘ │
├─────────────────────────────────────────┤
│ Backends │
│ claude / codex / (future: gemini) │
├─────────────────────────────────────────┤
│ Setup │
│ template repo + helpers + assertions │
└─────────────────────────────────────────┘
```
- **CLI** — `drill run <scenario> --backend claude`, `drill compare <scenario>`, `drill list`
- **Engine** — Orchestrates the full run lifecycle (setup → session → actor loop → collect → verify → results)
- **Session** — tmux lifecycle: create session, send-keys, capture-pane, kill session
- **Actor** — Sonnet with rolling context. Gets all scenario intents as a goal stack + terminal screens. Outputs what to type next, or `<<DONE>>`/`<<STUCK>>`.
- **Verifier** — Sonnet (near-zero temperature) with full session log + filesystem state + tool call log + criteria list. Returns per-criterion pass/fail with cited evidence + freeform observations.
- **Backends** — Each backend knows: CLI command, auto-approve flags, plugin loading, idle detection, shutdown command, session log location.
- **Setup** — Clone template repo → run backend pre_run hooks → run scenario helpers → run setup assertions → fail fast if invariants violated.
## Engine Flow
```
1. LOAD
- Parse scenario YAML
- Parse backend YAML
- Validate required env vars (fail fast)
2. SETUP
- Clone template repo to temp dir
- Run backend pre_run hooks (codex symlink, etc.)
- Run scenario setup helpers
- Run setup assertions → abort if any fail
3. SESSION
- Create tmux session (backend-specific terminal dimensions)
- Launch agent CLI in tmux pane
- Wait for startup ready pattern
4. ACTOR LOOP
- For each turn (up to max_turns):
a. Wait for idle (quiescence + ready pattern)
b. Capture terminal pane → append to rolling context
c. Send to Actor LLM: system prompt + rolling context + ALL intents + user_posture
d. Actor responds with text to type, <<DONE>>, or <<STUCK>>
e. If <<DONE>> or <<STUCK>> → break
f. Send keystrokes via tmux send-keys
g. Per-turn timeout → <<STUCK>> if exceeded
- Special keys via <<KEY:name>> convention (e.g., <<KEY:ctrl-c>>)
5. COLLECT
- Capture final terminal state
- Send shutdown command (backend-specific: /exit, Ctrl-D, etc.)
- Wait for process exit (with timeout)
- Snapshot filesystem (file tree, git state, worktree list)
- Collect backend session logs → tool_calls.jsonl
- Kill tmux session (cleanup if process didn't exit cleanly)
6. VERIFY
- Send to Verifier LLM: session.log + filesystem.json + tool_calls.jsonl + criteria
- Verifier receives criteria but NOT actor intents (reduces confirmation bias)
- Verifier returns per-criterion pass/fail with evidence + rationale + observations
- Output as structured JSON (verdict.json)
7. RESULTS
- Write to results/<scenario>/<backend>/<timestamp>/
- Print summary to stdout
```
## Backend Abstraction
Each backend is a YAML config. Backends own: CLI invocation, idle detection, shutdown, session log collection, and pre/post-run hooks.
```yaml
# backends/claude.yaml
name: claude
cli: claude
args:
- "--dangerously-skip-permissions"
- "--plugin-dir"
- "${SUPERPOWERS_ROOT}"
required_env:
- ANTHROPIC_API_KEY
- SUPERPOWERS_ROOT
hooks:
pre_run: [] # no repo setup needed; plugin loaded via --plugin-dir
post_run: []
shutdown: "/exit"
idle:
quiescence_seconds: 3
ready_pattern: "^|^\\$|Human:"
startup_timeout: 30
terminal:
cols: 200
rows: 50
session_logs:
pattern: "~/.claude/projects/**/session-*.jsonl"
match_by: timestamp
```
```yaml
# backends/codex.yaml
name: codex
cli: codex
args:
- "--dangerously-bypass-approvals-and-sandbox"
required_env:
- OPENAI_API_KEY
- SUPERPOWERS_ROOT
hooks:
pre_run:
- symlink_superpowers # creates .agents/skills/superpowers symlink in test repo
post_run: []
shutdown: "<<KEY:ctrl-d>>"
idle:
quiescence_seconds: 5
ready_pattern: "codex>|^>"
startup_timeout: 30
terminal:
cols: 200
rows: 50
session_logs:
pattern: "~/.codex/sessions/rollout-*.jsonl"
match_by: timestamp
```
New backends = new YAML file. Backend variants (e.g., `codex-workspace-write.yaml`) are just copies with different args — no inheritance system needed. Scenarios reference backends by name.
## Scenario Format
Scenarios are YAML. They describe *what* to test, not *how* each backend works.
```yaml
scenario: worktree-creation-from-main
description: "Agent creates an isolated worktree from main branch"
user_posture: naive # or spec-aware
setup:
helpers:
- create_base_repo
assertions:
- "git rev-parse --is-inside-work-tree"
- "git branch --show-current | grep main"
- "git worktree list | wc -l | grep 1"
turns:
- intent: >
Ask the agent to create an isolated workspace
for building a login feature.
- intent: "Confirm consent if the agent asks."
limits:
max_turns: 20
turn_timeout: 120 # seconds per turn
verify:
criteria:
- "Agent detected it was on main, not in an existing worktree"
- "Agent asked for consent before creating the worktree"
- "A worktree or isolated workspace now exists with a feature branch"
- "Agent used the most appropriate tool available for its platform to create the worktree"
observe: true # verifier can add freeform observations
```
### User Posture
Each scenario has a `user_posture` field:
- **naive** — User describes what they want in plain language. Tests whether the agent's superpowers skills fire without hand-holding.
- **spec-aware** — User references specific skills or conventions by name. Tests whether the agent follows the spec when pointed at it.
The delta between naive and spec-aware results for the same scenario is the most interesting product signal. A small delta means strong conveyance. A large delta means the skill format needs work.
### Turn Intents
Intents are a **priority-ordered goal stack**, not a rigid script. The actor receives all intents and decides which one applies to the current terminal state. Some intents are conditional ("Confirm consent if the agent asks") and may never fire.
## Setup
### Template Repo
A real git repo checked into `fixtures/template-repo/`. Cloned to a temp directory per run. Covers the 80% common case.
Contents:
- `package.json` — minimal Node project metadata (name, version)
- `src/index.js` — simple entry point (~10 lines)
- `src/utils.js` — helper module (~10 lines)
- `README.md` — basic project description
- 3-4 commits on `main` with realistic messages (e.g., "initial commit", "add utils module", "update readme")
- No existing worktrees, branches, or tags beyond `main`
This is intentionally minimal — just enough for agents to recognize it as a real project. Scenario-specific state (extra branches, worktrees, detached HEAD) is added by setup helpers.
### Setup Helpers
Python functions in `setup_helpers/` that modify the cloned repo for specific scenarios:
- `create_base_repo(workdir)` — Clone template, verify structure
- `add_worktree(workdir, branch, path)` — Create an existing worktree (for "already inside" scenarios)
- `detach_head(workdir)` — Simulate Codex App detached HEAD state
- `symlink_superpowers(workdir)` — Create `.agents/skills/superpowers` symlink (codex pre_run hook)
### Setup Assertions
Run after all setup completes, before the agent launches. If any fail, the scenario aborts with a clear "setup invariant violated" error — not a mysterious agent failure 10 turns later.
## Plugin Loading
Each backend loads superpowers differently. The harness manages this per-run with no global config mutation:
| Backend | Mechanism | Harness action |
|---------|-----------|----------------|
| Claude Code | `--plugin-dir` CLI flag | Pass flag pointing at superpowers checkout |
| Codex | `.agents/skills/` in repo | Backend pre_run hook creates symlink |
This means Drill can test draft skill changes by pointing at a branch checkout of superpowers.
## Post-Session Tool Call Collection
Both backends write structured session logs that record every tool invocation:
| Backend | Log location | Format |
|---------|-------------|--------|
| Claude Code | `~/.claude/projects/**/session-*.jsonl` | JSONL with tool names + args |
| Codex | `~/.codex/sessions/rollout-*.jsonl` | JSONL with `LocalShellCall`, `FunctionCall`, etc. |
The harness snapshots each backend's log directory before the session starts. After shutdown, it diffs the directory to find only files created during the run — no timestamp matching needed, no cross-contamination from concurrent sessions or prior runs.
Collected logs are normalized into a common `tool_calls.jsonl` format before the verifier sees them:
```json
{"tool": "EnterWorktree", "args": {"branch": "add-login"}, "source": "native"}
{"tool": "Bash", "args": {"command": "git worktree add ..."}, "source": "shell"}
```
Each backend defines a normalizer function that maps its native log format (Claude Code's tool call entries, Codex's `ResponseItem` records) into this common schema. The verifier never sees raw backend-specific logs.
## Actor & Verifier LLM Design
### Actor
- **Model:** Sonnet
- **Temperature:** 0.7 (realistic user variation)
- **Context:** Rolling (full conversation history). Sessions are short enough (~5-20 turns) that token cost is not a concern.
- **Input:** System prompt + rolling terminal captures + all intents + user_posture
- **Output:** Structured JSON via Anthropic SDK tool_use: `{"action": "type", "text": "..."}`, `{"action": "done"}`, `{"action": "stuck"}`, or `{"action": "key", "key": "ctrl-c"}`. The harness parses this and sends keystrokes — no free-text sanitization needed.
- **Prompt:** Versioned template at `prompts/actor.md`
### Verifier
- **Model:** Sonnet
- **Temperature:** Near-zero (deterministic judgment)
- **Input:** session.log + filesystem.json + tool_calls.jsonl + criteria list. Does NOT receive actor intents or scenario narrative (reduces confirmation bias).
- **Output:** Structured JSON with per-criterion verdict/evidence/rationale + observations
- **Prompt:** Versioned template at `prompts/verifier.md`
## Results & Compare
### Results Structure
```
results/
<scenario>/
<backend>/
<timestamp>/
session.log # raw tmux capture
filesystem.json # post-run git/file state snapshot
tool_calls.jsonl # collected from backend session logs
verdict.json # verifier output
meta.json # run metadata (backend, duration, turns, model versions)
```
### Compare Command
`drill compare` reads existing results from prior `drill run` invocations. It does not run backends itself — run each backend separately first, then compare.
```
$ drill run worktree-creation-from-main --backend claude
$ drill run worktree-creation-from-main --backend codex
$ drill compare worktree-creation-from-main
Scenario: worktree-creation-from-main (naive posture)
Summary:
┌──────────┬────────┬───────┬───────┐
│ Backend │ Result │ Score │ Turns │
├──────────┼────────┼───────┼───────┤
│ claude │ PASS │ 4/4 │ 6 │
│ codex │ FAIL │ 2/4 │ 12 │
└──────────┴────────┴───────┴───────┘
Detail:
┌────────────────────────────────┬────────┬────────┐
│ Criterion │ claude │ codex │
├────────────────────────────────┼────────┼────────┤
│ Detected on main │ ✓ │ ✓ │
│ Asked consent │ ✓ │ ✗ │
│ Worktree exists │ ✓ │ ✓ │
│ Used native tools │ ✓ │ ✗ │
└────────────────────────────────┴────────┴────────┘
Observations:
claude: "Agent cited the using-git-worktrees skill by name"
codex: "Agent created worktree but skipped consent step entirely"
```
## Project Structure
```
drill/
├── drill/
│ ├── __init__.py
│ ├── cli.py # click CLI: run, compare, list
│ ├── engine.py # orchestrates the full run lifecycle
│ ├── session.py # tmux session management
│ ├── actor.py # actor LLM calls
│ ├── verifier.py # verifier LLM calls
│ ├── setup.py # template repo cloning, helpers, assertions
│ └── backend.py # loads backend YAML, builds commands
├── backends/
│ ├── claude.yaml
│ └── codex.yaml
├── prompts/
│ ├── actor.md
│ └── verifier.md
├── scenarios/
│ ├── worktree-creation-from-main.yaml
│ ├── worktree-already-inside.yaml
│ ├── worktree-codex-detached-head.yaml
│ └── worktree-consent-flow.yaml
├── fixtures/
│ └── template-repo/ # base git repo, cloned per run
├── setup_helpers/
│ ├── __init__.py
│ ├── base.py # create_base_repo, common git ops
│ └── worktree.py # add_worktree, detach_head, etc.
├── results/ # gitignored, populated by runs
├── pyproject.toml # package metadata + [project.scripts] entry point
└── README.md
```
## Phase 1 Scope
- Claude Code + Codex backends
- 4 PRI-974 worktree scenarios (creation, already-inside, detached-head, consent)
- Both user postures (naive + spec-aware) per scenario
- Template repo + setup helpers + assertions
- Actor + verifier with prompts
- `drill run` and `drill compare` commands
- Results storage
## Phase 2 (Future)
- Gemini CLI backend
- Backend variants (e.g., `codex-workspace-write.yaml` for sandbox mode testing)
- Verifier flakiness mitigation (3x voting, agreement tracking)
- Cost tracking and token usage reporting
- Docker isolation for reproducibility
- CI integration
- Scenarios beyond worktrees (stacked PRs, git-spice, brainstorming)
## Installation
```bash
pip install -e . # installs 'drill' console script
```
Requires `tmux` installed as a system dependency.
## Dependencies
- Python 3.11+
- `click` — CLI framework
- `pyyaml` — scenario and backend config parsing
- `anthropic` — Anthropic Python SDK for actor/verifier LLM calls (structured tool_use output)
- `jinja2` — prompt template rendering
- `pydantic` — verdict schema validation (retry on malformed verifier output)
- `tmux` — session driving (system dependency)
## Non-Goals
- Not a coding ability benchmark (SWE-bench covers that)
- Not an LLM evaluation framework (promptfoo covers that)
- Not a generic terminal automation tool (Terminal-Bench covers that)
- No CI in phase 1
- No Docker in phase 1

View File

@@ -0,0 +1,93 @@
# Manual Testing (Codex App)
Some scenarios cannot run automatically because drill has no harness adapter for the target — the Codex App desktop client has no CLI or tmux entry point the way `claude` and `codex` do. These scenarios are marked `manual: true` in their YAML and use a human-in-the-loop protocol.
## Protocol
Three phases. The agent never runs Codex App directly. The tester never writes a verdict by hand.
1. **Agent prepares the handoff** — reads the scenario file, renders setup + turn intents into something a human can act on, hands the package to the tester.
2. **Tester executes** — sets up the repo fixture, opens Codex App, pastes the prompt, handles any follow-ups, copies the transcript + final filesystem state back to the agent.
3. **Agent judges and records** — evaluates the transcript against `verify.criteria`, writes a verdict JSON, saves to `results/<scenario>/codex-app/YYYY-MM-DD-manual/verdict.json`.
## Phase 1: Agent prepares the handoff
Deliver as one self-contained message to the tester:
### Fixture state
Exact repo state Codex App should be launched against. Pull from `setup.notes` if present, otherwise translate `setup.helpers` + `setup.assertions` into prose. Include: which repo/directory, branch, whether to expect a worktree vs normal checkout, any required/forbidden files (e.g. `.gitignore` entries).
### Prompt to paste
Render turn 1's `intent` as a natural first-person message the tester can paste verbatim into Codex App. **Don't leak internal test language** like *"Do NOT say 'create a worktree'"* — that's instruction for the test author, not the end user. Convert it to what a real user would actually type.
Example:
> Intent: *"Ask the agent to use the worktree skill to get set up for a notifications feature. Do NOT say 'create a worktree' — just reference the skill by name."*
>
> Rendered prompt: *"hey, can you use the worktree skill to get me set up for a notifications feature?"*
### Follow-up guidance
For each additional turn, give the tester a short decision rule — not a verbatim script. E.g. *"If the agent asks a clarifying question like branch name, answer concisely. If it stops to ask whether you want a worktree at all, tell it you already asked for the skill and it should proceed."*
### What to capture
Ask the tester to paste back:
- Full agent transcript (messages, tool calls, tool outputs)
- Final filesystem state if criteria depend on it (`git worktree list`, directory tree, branch state)
- Any observations they want on the record
## Phase 2: Tester executes
1. Set up the repo fixture per the instructions
2. Open Codex App in that repo
3. Paste the prompt
4. Follow up per the guidance
5. Copy the transcript + filesystem state back to the agent
## Phase 3: Agent judges and records
For each criterion in `verify.criteria`, write one entry:
```json
{
"criterion": "<verbatim from scenario>",
"passed": true | false,
"evidence": "<quoted snippet from transcript>",
"rationale": "<only if passed is inconclusive or needs context>"
}
```
**Rules:**
- Quote the transcript directly in `evidence`. No paraphrasing.
- If a criterion is genuinely inconclusive from the transcript, mark `passed: false` with `rationale` explaining what was missing. Don't guess.
- Don't grade on intent you can't see. The agent's internal thoughts aren't visible — only messages, tool calls, and results.
### Verdict file
Save to `results/<scenario>/codex-app/YYYY-MM-DD-manual/verdict.json`:
```json
{
"scenario": "<scenario-name>",
"backend": "codex-app",
"manual": true,
"user_posture": "<spec-aware|naive|...>",
"passed": <true iff every criterion.passed is true>,
"criteria": [ ... ],
"notes": "<optional: cross-criterion observations>"
}
```
Matches the format of the existing `results/worktree-codex-app-detached-head/codex-app/2026-04-09-manual/verdict.json`.
## When to invoke
- A scenario's YAML has `manual: true`
- The tester explicitly asks for a manual Codex App run of any scenario
- An automated test result is inconclusive and we want a human-verified cross-check
Do NOT use this procedure for scenarios drill can run itself (`claude`, `codex`, `gemini` backends) — use `drill run` instead.
## Pitfalls
- **Don't skip the fixture step.** Codex App's default environment (detached HEAD under `$CODEX_HOME/worktrees/`) is load-bearing for worktree scenarios. The same prompt gives different results in a normal checkout.
- **Don't render prompts literally.** Scenario intents are written for test authors; they often contain "Do NOT mention X" style instructions. Translate before handing to the tester.
- **Don't grade on missing evidence.** If the transcript doesn't show the agent doing something the criterion asks about, that's a fail, not a pass-by-default.

2725
evals/docs/plan.md Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,89 @@
# Pressure / RED phase testing in drill
## What "RED phase" means
The bash test family in superpowers/tests/ used three implicit phases
when stress-testing skill content:
* **GREEN** — current skill text. Baseline behavior under normal user
prompts. This is what most drill scenarios exercise.
* **PRESSURE** — current skill text, but the user prompt creates
conditions that make the skill's recommended path inconvenient
(urgency, an "easier" alternative already on disk, etc.). Lifted
as `worktree-creation-under-pressure.yaml`.
* **RED** — *modified* skill text where the section under test has
been removed or weakened. Used to confirm a passing GREEN/PRESSURE
result actually depended on the skill text and isn't just baseline
model behavior.
GREEN and PRESSURE both run against the current `SUPERPOWERS_ROOT`.
RED needs a *different* superpowers checkout — one with the section
under test stripped out — and runs the same scenario against that.
## The drill primitive: vary `SUPERPOWERS_ROOT`
Every backend YAML interpolates `${SUPERPOWERS_ROOT}` into its
`--plugin-dir` arg (claude.yaml line 6, gemini.yaml line 5, etc.).
That env var is the only knob you need: point drill at a different
plugin checkout and the agent under test loads a different version
of the skill.
```bash
# GREEN: current skill text
drill run worktree-creation-from-main -b claude
# RED: same scenario, against a checkout where Step 1a is deleted
SUPERPOWERS_ROOT=/path/to/superpowers-without-step-1a \
drill run worktree-creation-from-main -b claude
```
Compare verdicts. If GREEN passes and RED fails, the skill text is
load-bearing. If both pass, the model produces the right behavior
without the skill — meaning either the skill is redundant or the
test isn't probing what it claims to probe.
## Recommended workflow
1. Make a git worktree of superpowers at the commit/branch you want
to test. For RED variants, edit the skill in that worktree to
remove the section under test.
```bash
cd ~/Documents/GitHub/superpowers/superpowers
git worktree add ../superpowers-red-no-step-1a HEAD
# edit skills/using-git-worktrees/SKILL.md in the worktree
```
2. Run the same drill scenario against each variant. Use
`--n N` to get statistical signal — single runs are noisy,
especially under pressure conditions.
```bash
for variant in main red-no-step-1a; do
SUPERPOWERS_ROOT=~/Documents/GitHub/superpowers/superpowers-${variant#main}superpowers \
drill run worktree-creation-from-main -b claude --n 10
done
```
3. Compare with `drill compare`. Look for the RED variant's pass
rate dropping (skill is load-bearing) or holding (skill is
redundant or scenario isn't probing what it claims).
## When to add a new pressure scenario vs. add a turn variation
* **New scenario** when the *filesystem* setup is different (e.g.,
pre-existing `.worktrees/` for the worktree-pressure case).
Setup helpers are scenario-scoped.
* **New `--n` sweep with different prompts** when only the
*user prompt* shape varies (e.g., urgency, framing).
Drill doesn't yet have a way to vary turn intents within a single
scenario YAML — multi-prompt sweeps require multiple scenario files
or running the same scenario with different intents externally.
## Open follow-ups
* `--plugins=A,B,C` sweep dimension (parallel to `--models`) so a
single drill invocation can run RED + GREEN + PRESSURE variants
in one batch and `drill compare` shows them side-by-side. Not yet
implemented; tracked as drill-internal future work.