superpowers/evals/docs/plan.md

# Drill Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Build a tmux-based harness that drives AI coding agents through worktree scenarios and evaluates whether they follow superpowers skills.

**Architecture:** CLI (`click`) orchestrates an engine that sets up a test repo, launches an agent in tmux, drives it via an LLM actor (Anthropic SDK, structured tool_use), collects session logs + filesystem state, then evaluates compliance via an LLM verifier. Backend configs (YAML) define how to launch each agent CLI. Scenarios (YAML) define what to test.

**Tech Stack:** Python 3.11+, click, pyyaml, anthropic SDK, jinja2, pydantic, tmux

---

## File Structure

```
drill/
├── drill/
│   ├── __init__.py          # Package init, version
│   ├── cli.py               # click CLI: run, compare, list
│   ├── engine.py            # Orchestrates full run lifecycle (7 steps)
│   ├── session.py           # tmux session management (create, send-keys, capture, kill)
│   ├── actor.py             # Actor LLM: rolling context, structured tool_use output
│   ├── verifier.py          # Verifier LLM: per-criterion evaluation, pydantic schema
│   ├── setup.py             # Template repo cloning, helper dispatch, assertion runner
│   ├── backend.py           # Loads backend YAML, builds CLI commands, idle detection
│   └── normalizer.py        # Normalizes backend-specific session logs to common schema
├── backends/
│   ├── claude.yaml          # Claude Code backend config
│   └── codex.yaml           # Codex backend config
├── prompts/
│   ├── actor.md             # Actor system prompt (jinja2 template)
│   └── verifier.md          # Verifier system prompt (jinja2 template)
├── scenarios/
│   ├── worktree-creation-from-main.yaml
│   ├── worktree-already-inside.yaml
│   ├── worktree-codex-detached-head.yaml
│   └── worktree-consent-flow.yaml
├── fixtures/
│   └── template-repo/       # Minimal git repo cloned per run
│       ├── package.json
│       ├── src/
│       │   ├── index.js
│       │   └── utils.js
│       └── README.md
├── setup_helpers/
│   ├── __init__.py          # Exports helper registry
│   ├── base.py              # create_base_repo
│   └── worktree.py          # add_worktree, detach_head, symlink_superpowers
├── tests/
│   ├── test_backend.py
│   ├── test_setup.py
│   ├── test_session.py
│   ├── test_actor.py
│   ├── test_verifier.py
│   ├── test_normalizer.py
│   ├── test_engine.py
│   └── test_cli.py
├── pyproject.toml
├── .gitignore
└── README.md
```

---

### Task 1: Project Scaffold

**Files:**
- Create: `pyproject.toml`
- Create: `drill/__init__.py`
- Create: `.gitignore`
- Create: `README.md`

- [ ] **Step 1: Create pyproject.toml**

```toml
[build-system]
requires = ["setuptools>=68.0"]
build-backend = "setuptools.backends._legacy:_Backend"

[project]
name = "drill"
version = "0.1.0"
description = "Superpowers skill compliance benchmark"
requires-python = ">=3.11"
dependencies = [
    "click>=8.1",
    "pyyaml>=6.0",
    "anthropic>=0.42",
    "jinja2>=3.1",
    "pydantic>=2.0",
]

[project.optional-dependencies]
dev = ["pytest>=8.0"]

[project.scripts]
drill = "drill.cli:main"

[tool.setuptools.packages.find]
include = ["drill*", "setup_helpers*"]
```

- [ ] **Step 2: Create drill/__init__.py**

```python
"""Drill: Superpowers skill compliance benchmark."""

__version__ = "0.1.0"
```

- [ ] **Step 3: Create .gitignore**

```
results/
__pycache__/
*.pyc
*.egg-info/
dist/
build/
.venv/
```

- [ ] **Step 4: Create README.md**

```markdown
# Drill

Superpowers skill compliance benchmark. Drives AI coding agents through
tmux sessions and evaluates whether they follow superpowers workflows.

See [docs/design.md](docs/design.md) for the full design spec.

## Setup

```bash
pip install -e ".[dev]"
```

## Usage

```bash
export SUPERPOWERS_ROOT=/path/to/superpowers
export ANTHROPIC_API_KEY=sk-...

drill run worktree-creation-from-main --backend claude
drill compare worktree-creation-from-main
drill list
```
```

- [ ] **Step 5: Install in dev mode and verify**

Run: `cd /Users/drewritter/prime-rad/drill && pip install -e ".[dev]"`
Expected: Installs successfully, `drill --help` shows usage

- [ ] **Step 6: Commit**

```bash
git add pyproject.toml drill/__init__.py .gitignore README.md
git commit -m "chore: project scaffold with pyproject.toml and drill entry point"
```

---

### Task 2: Backend Config Loader

**Files:**
- Create: `drill/backend.py`
- Create: `backends/claude.yaml`
- Create: `backends/codex.yaml`
- Create: `tests/test_backend.py`

- [ ] **Step 1: Write the failing test**

```python
# tests/test_backend.py
import os
import pytest
from pathlib import Path

from drill.backend import Backend, load_backend


@pytest.fixture
def backends_dir():
    return Path(__file__).parent.parent / "backends"


class TestLoadBackend:
    def test_loads_claude_backend(self, backends_dir):
        backend = load_backend("claude", backends_dir)
        assert backend.name == "claude"
        assert backend.cli == "claude"
        assert "--dangerously-skip-permissions" in backend.args

    def test_loads_codex_backend(self, backends_dir):
        backend = load_backend("codex", backends_dir)
        assert backend.name == "codex"
        assert backend.cli == "codex"

    def test_unknown_backend_raises(self, backends_dir):
        with pytest.raises(FileNotFoundError):
            load_backend("nonexistent", backends_dir)


class TestBackendBuildCommand:
    def test_claude_build_command(self, backends_dir, monkeypatch):
        monkeypatch.setenv("SUPERPOWERS_ROOT", "/tmp/superpowers")
        backend = load_backend("claude", backends_dir)
        cmd = backend.build_command("/tmp/workdir")
        assert cmd[0] == "claude"
        assert "--plugin-dir" in cmd
        assert "/tmp/superpowers" in cmd

    def test_codex_build_command(self, backends_dir, monkeypatch):
        monkeypatch.setenv("SUPERPOWERS_ROOT", "/tmp/superpowers")
        backend = load_backend("codex", backends_dir)
        cmd = backend.build_command("/tmp/workdir")
        assert cmd[0] == "codex"


class TestBackendEnvValidation:
    def test_missing_env_raises(self, backends_dir, monkeypatch):
        monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
        monkeypatch.delenv("SUPERPOWERS_ROOT", raising=False)
        backend = load_backend("claude", backends_dir)
        with pytest.raises(EnvironmentError, match="ANTHROPIC_API_KEY"):
            backend.validate_env()


class TestBackendIdleDetection:
    def test_ready_pattern_matches(self, backends_dir):
        backend = load_backend("claude", backends_dir)
        assert backend.is_ready_line("❯ ")
        assert backend.is_ready_line("Human: ")
        assert not backend.is_ready_line("Running tool...")
```

- [ ] **Step 2: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_backend.py -v`
Expected: FAIL — `ModuleNotFoundError: No module named 'drill.backend'`

- [ ] **Step 3: Create backend YAML files**

Create `backends/claude.yaml`:

```yaml
name: claude
cli: claude
args:
  - "--dangerously-skip-permissions"
  - "--plugin-dir"
  - "${SUPERPOWERS_ROOT}"
required_env:
  - ANTHROPIC_API_KEY
  - SUPERPOWERS_ROOT
hooks:
  pre_run: []
  post_run: []
shutdown: "/exit"
idle:
  quiescence_seconds: 3
  ready_pattern: "^❯|^\\$|Human:"
startup_timeout: 30
terminal:
  cols: 200
  rows: 50
session_logs:
  pattern: "~/.claude/projects/**/session-*.jsonl"
```

Create `backends/codex.yaml`:

```yaml
name: codex
cli: codex
args:
  - "--dangerously-bypass-approvals-and-sandbox"
required_env:
  - OPENAI_API_KEY
  - SUPERPOWERS_ROOT
hooks:
  pre_run:
    - symlink_superpowers
  post_run: []
shutdown: "<<KEY:ctrl-d>>"
idle:
  quiescence_seconds: 5
  ready_pattern: "codex>|^>"
startup_timeout: 30
terminal:
  cols: 200
  rows: 50
session_logs:
  pattern: "~/.codex/sessions/rollout-*.jsonl"
```

- [ ] **Step 4: Write the implementation**

```python
# drill/backend.py
"""Backend config loader and command builder."""

from __future__ import annotations

import os
import re
from dataclasses import dataclass, field
from pathlib import Path

import yaml


@dataclass
class Backend:
    name: str
    cli: str
    args: list[str]
    required_env: list[str]
    hooks: dict[str, list[str]]
    shutdown: str
    idle: dict[str, any]
    startup_timeout: int
    terminal: dict[str, int]
    session_logs: dict[str, str]

    def build_command(self, workdir: str) -> list[str]:
        """Build the full CLI invocation with env var interpolation."""
        resolved = []
        for arg in self.args:
            resolved.append(_interpolate_env(arg))
        return [self.cli, *resolved]

    def validate_env(self) -> None:
        """Raise EnvironmentError if any required env vars are missing."""
        missing = [v for v in self.required_env if not os.environ.get(v)]
        if missing:
            raise EnvironmentError(
                f"Missing required environment variables for {self.name} backend: "
                + ", ".join(missing)
            )

    def is_ready_line(self, line: str) -> bool:
        """Check if a terminal line matches the idle ready pattern."""
        pattern = self.idle.get("ready_pattern", "")
        return bool(re.search(pattern, line))

    @property
    def quiescence_seconds(self) -> float:
        return self.idle.get("quiescence_seconds", 5)

    @property
    def cols(self) -> int:
        return self.terminal.get("cols", 200)

    @property
    def rows(self) -> int:
        return self.terminal.get("rows", 50)


def load_backend(name: str, backends_dir: Path) -> Backend:
    """Load a backend config from YAML."""
    path = backends_dir / f"{name}.yaml"
    if not path.exists():
        raise FileNotFoundError(f"Backend config not found: {path}")
    with open(path) as f:
        data = yaml.safe_load(f)
    return Backend(
        name=data["name"],
        cli=data["cli"],
        args=data.get("args", []),
        required_env=data.get("required_env", []),
        hooks=data.get("hooks", {"pre_run": [], "post_run": []}),
        shutdown=data.get("shutdown", "/exit"),
        idle=data.get("idle", {}),
        startup_timeout=data.get("startup_timeout", 30),
        terminal=data.get("terminal", {"cols": 200, "rows": 50}),
        session_logs=data.get("session_logs", {}),
    )


def _interpolate_env(value: str) -> str:
    """Replace ${VAR} with environment variable values."""
    def replacer(match):
        var = match.group(1)
        val = os.environ.get(var)
        if val is None:
            raise EnvironmentError(f"Environment variable {var} not set")
        return val
    return re.sub(r"\$\{(\w+)\}", replacer, value)
```

- [ ] **Step 5: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_backend.py -v`
Expected: All tests PASS

- [ ] **Step 6: Commit**

```bash
git add drill/backend.py backends/ tests/test_backend.py
git commit -m "feat: backend config loader with YAML parsing and env validation"
```

---

### Task 3: tmux Session Manager

**Files:**
- Create: `drill/session.py`
- Create: `tests/test_session.py`

- [ ] **Step 1: Write the failing test**

```python
# tests/test_session.py
import subprocess
import time
import pytest

from drill.session import TmuxSession


class TestTmuxSession:
    def test_create_and_kill(self):
        session = TmuxSession(name="drill-test-create", cols=80, rows=24)
        session.create()
        # Verify session exists
        result = subprocess.run(
            ["tmux", "has-session", "-t", "drill-test-create"],
            capture_output=True,
        )
        assert result.returncode == 0
        session.kill()
        # Verify session is gone
        result = subprocess.run(
            ["tmux", "has-session", "-t", "drill-test-create"],
            capture_output=True,
        )
        assert result.returncode != 0

    def test_send_keys_and_capture(self):
        session = TmuxSession(name="drill-test-keys", cols=80, rows=24)
        session.create()
        try:
            session.send_keys("echo hello-drill-test")
            time.sleep(0.5)
            output = session.capture()
            assert "hello-drill-test" in output
        finally:
            session.kill()

    def test_launch_command(self, tmp_path):
        session = TmuxSession(name="drill-test-launch", cols=80, rows=24)
        session.create()
        try:
            session.launch(["python3", "-c", "import time; time.sleep(30)"], cwd=str(tmp_path))
            time.sleep(0.5)
            output = session.capture()
            # Process should be running, not showing shell prompt
            assert session.is_process_alive()
        finally:
            session.kill()

    def test_send_special_key(self):
        session = TmuxSession(name="drill-test-special", cols=80, rows=24)
        session.create()
        try:
            session.send_keys("cat")  # start cat, which reads stdin
            time.sleep(0.3)
            session.send_special_key("ctrl-c")
            time.sleep(0.3)
            # After ctrl-c, cat should have exited
            output = session.capture()
            assert "^C" in output or output.endswith("$")
        finally:
            session.kill()
```

- [ ] **Step 2: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_session.py -v`
Expected: FAIL — `ModuleNotFoundError: No module named 'drill.session'`

- [ ] **Step 3: Write the implementation**

```python
# drill/session.py
"""tmux session management for driving agent CLI sessions."""

from __future__ import annotations

import subprocess
import time


class TmuxSession:
    """Manages a tmux session for driving an agent CLI."""

    def __init__(self, name: str, cols: int = 200, rows: int = 50):
        self.name = name
        self.cols = cols
        self.rows = rows

    def create(self) -> None:
        """Create a new detached tmux session."""
        subprocess.run(
            [
                "tmux", "new-session",
                "-d",
                "-s", self.name,
                "-x", str(self.cols),
                "-y", str(self.rows),
            ],
            check=True,
        )

    def launch(self, command: list[str], cwd: str) -> None:
        """Launch a command inside the tmux session."""
        cmd_str = " ".join(command)
        self.send_keys(f"cd {cwd} && {cmd_str}")

    def send_keys(self, text: str) -> None:
        """Send keystrokes to the tmux session, followed by Enter."""
        subprocess.run(
            ["tmux", "send-keys", "-t", self.name, text, "Enter"],
            check=True,
        )

    def send_special_key(self, key: str) -> None:
        """Send a special key like ctrl-c, ctrl-d."""
        key_map = {
            "ctrl-c": "C-c",
            "ctrl-d": "C-d",
            "ctrl-z": "C-z",
            "enter": "Enter",
            "escape": "Escape",
        }
        tmux_key = key_map.get(key, key)
        subprocess.run(
            ["tmux", "send-keys", "-t", self.name, tmux_key],
            check=True,
        )

    def capture(self) -> str:
        """Capture the current terminal pane content."""
        result = subprocess.run(
            ["tmux", "capture-pane", "-t", self.name, "-p"],
            capture_output=True,
            text=True,
            check=True,
        )
        return result.stdout

    def is_process_alive(self) -> bool:
        """Check if the process in the pane is still running."""
        result = subprocess.run(
            [
                "tmux", "list-panes", "-t", self.name,
                "-F", "#{pane_dead}",
            ],
            capture_output=True,
            text=True,
        )
        return result.stdout.strip() == "0"

    def kill(self) -> None:
        """Kill the tmux session."""
        subprocess.run(
            ["tmux", "kill-session", "-t", self.name],
            capture_output=True,  # don't fail if already dead
        )
```

- [ ] **Step 4: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_session.py -v`
Expected: All tests PASS

- [ ] **Step 5: Commit**

```bash
git add drill/session.py tests/test_session.py
git commit -m "feat: tmux session manager with send-keys, capture, and special key support"
```

---

### Task 4: Setup Helpers and Template Repo

**Files:**
- Create: `setup_helpers/__init__.py`
- Create: `setup_helpers/base.py`
- Create: `setup_helpers/worktree.py`
- Create: `fixtures/template-repo/` (with contents)
- Create: `drill/setup.py`
- Create: `tests/test_setup.py`

- [ ] **Step 1: Create the template repo fixture**

```bash
cd /Users/drewritter/prime-rad/drill
mkdir -p fixtures/template-repo/src
```

Create `fixtures/template-repo/package.json`:
```json
{
  "name": "drill-test-project",
  "version": "1.0.0",
  "description": "Test project for Drill scenarios",
  "main": "src/index.js"
}
```

Create `fixtures/template-repo/src/index.js`:
```javascript
const { greet } = require('./utils');

function main() {
  console.log(greet('world'));
}

main();
```

Create `fixtures/template-repo/src/utils.js`:
```javascript
function greet(name) {
  return `Hello, ${name}!`;
}

module.exports = { greet };
```

Create `fixtures/template-repo/README.md`:
```markdown
# Test Project

A minimal project for Drill test scenarios.
```

Initialize git history:
```bash
cd fixtures/template-repo
git init
git add package.json README.md
git commit -m "initial commit"
git add src/utils.js
git commit -m "add utils module"
git add src/index.js
git commit -m "add entry point"
cd ../..
```

- [ ] **Step 2: Write the failing test**

```python
# tests/test_setup.py
import os
import subprocess
import pytest
from pathlib import Path

from drill.setup import clone_template, run_assertions
from setup_helpers.base import create_base_repo
from setup_helpers.worktree import add_worktree, detach_head, symlink_superpowers


@pytest.fixture
def fixtures_dir():
    return Path(__file__).parent.parent / "fixtures"


@pytest.fixture
def work_dir(tmp_path):
    return tmp_path / "test-repo"


class TestCloneTemplate:
    def test_clones_template_repo(self, fixtures_dir, work_dir):
        clone_template(fixtures_dir / "template-repo", work_dir)
        assert (work_dir / "package.json").exists()
        assert (work_dir / "src" / "index.js").exists()
        # Should have git history
        result = subprocess.run(
            ["git", "log", "--oneline"],
            cwd=work_dir,
            capture_output=True,
            text=True,
        )
        assert "initial commit" in result.stdout


class TestCreateBaseRepo:
    def test_creates_base_repo(self, fixtures_dir, work_dir):
        create_base_repo(work_dir, fixtures_dir / "template-repo")
        assert (work_dir / "package.json").exists()
        result = subprocess.run(
            ["git", "branch", "--show-current"],
            cwd=work_dir,
            capture_output=True,
            text=True,
        )
        assert result.stdout.strip() == "main"


class TestWorktreeHelpers:
    def test_add_worktree(self, fixtures_dir, work_dir):
        create_base_repo(work_dir, fixtures_dir / "template-repo")
        wt_path = work_dir.parent / "feature-wt"
        add_worktree(work_dir, "feature-branch", str(wt_path))
        assert wt_path.exists()
        result = subprocess.run(
            ["git", "worktree", "list"],
            cwd=work_dir,
            capture_output=True,
            text=True,
        )
        assert "feature-branch" in result.stdout

    def test_detach_head(self, fixtures_dir, work_dir):
        create_base_repo(work_dir, fixtures_dir / "template-repo")
        wt_path = work_dir.parent / "detached-wt"
        add_worktree(work_dir, "tmp-branch", str(wt_path))
        detach_head(str(wt_path))
        result = subprocess.run(
            ["git", "branch", "--show-current"],
            cwd=wt_path,
            capture_output=True,
            text=True,
        )
        assert result.stdout.strip() == ""  # detached = no branch

    def test_symlink_superpowers(self, fixtures_dir, work_dir, tmp_path):
        create_base_repo(work_dir, fixtures_dir / "template-repo")
        fake_superpowers = tmp_path / "superpowers" / "skills"
        fake_superpowers.mkdir(parents=True)
        symlink_superpowers(work_dir, str(tmp_path / "superpowers"))
        link = work_dir / ".agents" / "skills" / "superpowers"
        assert link.is_symlink()


class TestRunAssertions:
    def test_passing_assertions(self, fixtures_dir, work_dir):
        create_base_repo(work_dir, fixtures_dir / "template-repo")
        assertions = [
            "git rev-parse --is-inside-work-tree",
            "git branch --show-current | grep main",
        ]
        # Should not raise
        run_assertions(assertions, work_dir)

    def test_failing_assertion_raises(self, fixtures_dir, work_dir):
        create_base_repo(work_dir, fixtures_dir / "template-repo")
        assertions = ["git branch --show-current | grep nonexistent"]
        with pytest.raises(AssertionError, match="Setup assertion failed"):
            run_assertions(assertions, work_dir)
```

- [ ] **Step 3: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_setup.py -v`
Expected: FAIL — `ModuleNotFoundError`

- [ ] **Step 4: Write setup_helpers**

Create `setup_helpers/__init__.py`:
```python
"""Setup helpers for Drill scenarios."""

from setup_helpers.base import create_base_repo
from setup_helpers.worktree import add_worktree, detach_head, symlink_superpowers

HELPER_REGISTRY = {
    "create_base_repo": create_base_repo,
    "add_worktree": add_worktree,
    "detach_head": detach_head,
    "symlink_superpowers": symlink_superpowers,
}
```

Create `setup_helpers/base.py`:
```python
"""Base setup helpers."""

from __future__ import annotations

import subprocess
from pathlib import Path


def create_base_repo(workdir: Path, template_dir: Path) -> None:
    """Clone the template repo to workdir."""
    subprocess.run(
        ["git", "clone", str(template_dir), str(workdir)],
        check=True,
        capture_output=True,
    )
```

Create `setup_helpers/worktree.py`:
```python
"""Worktree-specific setup helpers."""

from __future__ import annotations

import os
import subprocess
from pathlib import Path


def add_worktree(repo_dir: Path, branch: str, worktree_path: str) -> None:
    """Create a git worktree at the given path."""
    subprocess.run(
        ["git", "worktree", "add", "-b", branch, worktree_path],
        cwd=repo_dir,
        check=True,
        capture_output=True,
    )


def detach_head(worktree_path: str) -> None:
    """Detach HEAD in a worktree (simulates Codex App state)."""
    # Get current commit hash
    result = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
        check=True,
    )
    commit = result.stdout.strip()
    # Get the branch name so we can delete it after detaching
    result = subprocess.run(
        ["git", "branch", "--show-current"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
        check=True,
    )
    branch = result.stdout.strip()
    # Detach HEAD
    subprocess.run(
        ["git", "checkout", "--detach", commit],
        cwd=worktree_path,
        check=True,
        capture_output=True,
    )
    # Delete the temporary branch
    if branch:
        subprocess.run(
            ["git", "branch", "-D", branch],
            cwd=worktree_path,
            capture_output=True,
        )


def symlink_superpowers(workdir: Path, superpowers_root: str) -> None:
    """Create .agents/skills/superpowers symlink for Codex discovery."""
    skills_dir = Path(workdir) / ".agents" / "skills"
    skills_dir.mkdir(parents=True, exist_ok=True)
    target = Path(superpowers_root) / "skills"
    link = skills_dir / "superpowers"
    link.symlink_to(target)
```

- [ ] **Step 5: Write drill/setup.py**

```python
# drill/setup.py
"""Test repo setup: template cloning, helper dispatch, assertion runner."""

from __future__ import annotations

import subprocess
from pathlib import Path

from setup_helpers import HELPER_REGISTRY


def clone_template(template_dir: Path, workdir: Path) -> None:
    """Clone the template repo to a working directory."""
    subprocess.run(
        ["git", "clone", str(template_dir), str(workdir)],
        check=True,
        capture_output=True,
    )


def run_helpers(
    helper_names: list[str],
    workdir: Path,
    fixtures_dir: Path,
) -> None:
    """Run named setup helpers against the working directory."""
    for name in helper_names:
        helper = HELPER_REGISTRY.get(name)
        if helper is None:
            raise ValueError(f"Unknown setup helper: {name}")
        if name == "create_base_repo":
            helper(workdir, fixtures_dir / "template-repo")
        elif name == "symlink_superpowers":
            import os
            helper(workdir, os.environ["SUPERPOWERS_ROOT"])
        else:
            # All other helpers take workdir as single arg
            helper(workdir)


def run_assertions(assertions: list[str], workdir: Path) -> None:
    """Run shell assertion commands. Raise if any fail."""
    for assertion in assertions:
        result = subprocess.run(
            assertion,
            shell=True,
            cwd=workdir,
            capture_output=True,
            text=True,
        )
        if result.returncode != 0:
            raise AssertionError(
                f"Setup assertion failed: {assertion}\n"
                f"stdout: {result.stdout}\n"
                f"stderr: {result.stderr}"
            )
```

- [ ] **Step 6: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_setup.py -v`
Expected: All tests PASS

- [ ] **Step 7: Commit**

```bash
git add fixtures/ setup_helpers/ drill/setup.py tests/test_setup.py
git commit -m "feat: template repo, setup helpers, and assertion runner"
```

---

### Task 5: Actor LLM

**Files:**
- Create: `drill/actor.py`
- Create: `prompts/actor.md`
- Create: `tests/test_actor.py`

- [ ] **Step 1: Write the failing test**

```python
# tests/test_actor.py
import json
import pytest
from unittest.mock import MagicMock, patch

from drill.actor import Actor, ActorAction


class TestActorAction:
    def test_parse_type_action(self):
        action = ActorAction.from_tool_result({"action": "type", "text": "create a worktree"})
        assert action.action == "type"
        assert action.text == "create a worktree"

    def test_parse_done_action(self):
        action = ActorAction.from_tool_result({"action": "done"})
        assert action.action == "done"

    def test_parse_stuck_action(self):
        action = ActorAction.from_tool_result({"action": "stuck"})
        assert action.action == "stuck"

    def test_parse_key_action(self):
        action = ActorAction.from_tool_result({"action": "key", "key": "ctrl-c"})
        assert action.action == "key"
        assert action.key == "ctrl-c"


class TestActorPrompt:
    def test_builds_system_prompt_naive(self):
        actor = Actor(model="claude-sonnet-4-6", temperature=0.7)
        prompt = actor.build_system_prompt(
            posture="naive",
            intents=["Ask the agent to create a worktree"],
        )
        assert "naive" in prompt.lower() or "plain language" in prompt.lower()
        assert "create a worktree" in prompt

    def test_builds_system_prompt_spec_aware(self):
        actor = Actor(model="claude-sonnet-4-6", temperature=0.7)
        prompt = actor.build_system_prompt(
            posture="spec-aware",
            intents=["Use the worktree skill to create an isolated workspace"],
        )
        assert "spec" in prompt.lower() or "skill" in prompt.lower()


class TestActorContext:
    def test_appends_terminal_captures(self):
        actor = Actor(model="claude-sonnet-4-6", temperature=0.7)
        actor.append_capture("Screen 1: Welcome to Claude")
        actor.append_capture("Screen 2: ❯ ")
        messages = actor.build_messages()
        assert len(messages) == 2
        assert "Screen 1" in messages[0]["content"]
        assert "Screen 2" in messages[1]["content"]
```

- [ ] **Step 2: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_actor.py -v`
Expected: FAIL — `ModuleNotFoundError`

- [ ] **Step 3: Create the actor prompt template**

Create `prompts/actor.md`:

```markdown
You are simulating a user interacting with an AI coding agent in a terminal.

{% if posture == "naive" %}
You are a developer who wants to accomplish a task. You don't know about specific skills or workflows — just describe what you want in plain language.
{% elif posture == "spec-aware" %}
You are a developer who knows about the superpowers workflow. You may reference specific skills or conventions by name (e.g., "use the worktree skill", "follow the using-git-worktrees pattern").
{% endif %}

Goals (in rough priority order):
{% for intent in intents %}
- {{ intent }}
{% endfor %}

Rules:
- Decide what to do based on what's currently on screen.
- Goals are not a script — some are conditional. Act on them when relevant.
- Type natural, concise messages like a real developer would.
- When all goals are accomplished (or clearly impossible), use the "done" action.
- If you're stuck and cannot make progress, use the "stuck" action.
```

- [ ] **Step 4: Write the implementation**

```python
# drill/actor.py
"""Actor LLM: simulates a user driving an agent session."""

from __future__ import annotations

from dataclasses import dataclass, field
from pathlib import Path

import anthropic
from jinja2 import Template


ACTOR_TOOL = {
    "name": "terminal_action",
    "description": "Send an action to the terminal session.",
    "input_schema": {
        "type": "object",
        "properties": {
            "action": {
                "type": "string",
                "enum": ["type", "done", "stuck", "key"],
                "description": "The action to take.",
            },
            "text": {
                "type": "string",
                "description": "Text to type (only for 'type' action).",
            },
            "key": {
                "type": "string",
                "description": "Special key to send (only for 'key' action, e.g., 'ctrl-c').",
            },
        },
        "required": ["action"],
    },
}


@dataclass
class ActorAction:
    action: str  # "type", "done", "stuck", "key"
    text: str | None = None
    key: str | None = None

    @classmethod
    def from_tool_result(cls, data: dict) -> ActorAction:
        return cls(
            action=data["action"],
            text=data.get("text"),
            key=data.get("key"),
        )


class Actor:
    """Drives agent sessions by deciding what a simulated user would type."""

    def __init__(self, model: str = "claude-sonnet-4-6", temperature: float = 0.7):
        self.model = model
        self.temperature = temperature
        self.captures: list[str] = []
        self._system_prompt: str | None = None
        self._client = anthropic.Anthropic()

    def build_system_prompt(self, posture: str, intents: list[str]) -> str:
        """Render the actor system prompt from template."""
        template_path = Path(__file__).parent.parent / "prompts" / "actor.md"
        template = Template(template_path.read_text())
        self._system_prompt = template.render(posture=posture, intents=intents)
        return self._system_prompt

    def append_capture(self, terminal_output: str) -> None:
        """Append a terminal capture to the rolling context."""
        self.captures.append(terminal_output)

    def build_messages(self) -> list[dict]:
        """Build the message list from terminal captures."""
        messages = []
        for capture in self.captures:
            messages.append({"role": "user", "content": capture})
        return messages

    def decide(self) -> ActorAction:
        """Call the LLM to decide the next action."""
        response = self._client.messages.create(
            model=self.model,
            max_tokens=1024,
            temperature=self.temperature,
            system=self._system_prompt,
            tools=[ACTOR_TOOL],
            tool_choice={"type": "tool", "name": "terminal_action"},
            messages=self.build_messages(),
        )
        # Extract the tool use block
        for block in response.content:
            if block.type == "tool_use":
                return ActorAction.from_tool_result(block.input)
        raise RuntimeError("Actor did not return a tool_use block")
```

- [ ] **Step 5: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_actor.py -v`
Expected: All tests PASS (no live API calls — only testing parsing and prompt building)

- [ ] **Step 6: Commit**

```bash
git add drill/actor.py prompts/actor.md tests/test_actor.py
git commit -m "feat: actor LLM with structured tool_use output and prompt template"
```

---

### Task 6: Verifier LLM

**Files:**
- Create: `drill/verifier.py`
- Create: `prompts/verifier.md`
- Create: `tests/test_verifier.py`

- [ ] **Step 1: Write the failing test**

```python
# tests/test_verifier.py
import json
import pytest
from unittest.mock import MagicMock, patch

from drill.verifier import Verifier, Verdict, CriterionResult


class TestVerdict:
    def test_parse_valid_verdict(self):
        data = {
            "criteria": [
                {
                    "criterion": "Agent detected on main",
                    "verdict": "pass",
                    "evidence": "Terminal showed 'main branch detected'",
                    "rationale": "Agent correctly identified the branch",
                }
            ],
            "observations": ["Agent was very fast"],
            "summary": "Passed all checks",
        }
        verdict = Verdict.model_validate(data)
        assert len(verdict.criteria) == 1
        assert verdict.criteria[0].verdict == "pass"
        assert verdict.score == "1/1"

    def test_score_calculation(self):
        data = {
            "criteria": [
                {"criterion": "A", "verdict": "pass", "evidence": "e", "rationale": "r"},
                {"criterion": "B", "verdict": "fail", "evidence": "e", "rationale": "r"},
                {"criterion": "C", "verdict": "pass", "evidence": "e", "rationale": "r"},
            ],
            "observations": [],
            "summary": "Mixed results",
        }
        verdict = Verdict.model_validate(data)
        assert verdict.score == "2/3"
        assert verdict.passed is False

    def test_all_pass(self):
        data = {
            "criteria": [
                {"criterion": "A", "verdict": "pass", "evidence": "e", "rationale": "r"},
            ],
            "observations": [],
            "summary": "Good",
        }
        verdict = Verdict.model_validate(data)
        assert verdict.passed is True


class TestVerifierPrompt:
    def test_builds_system_prompt(self):
        verifier = Verifier(model="claude-sonnet-4-6", temperature=0.0)
        prompt = verifier.build_system_prompt()
        assert "criterion" in prompt.lower()
        assert "evidence" in prompt.lower()
        assert "JSON" in prompt
```

- [ ] **Step 2: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_verifier.py -v`
Expected: FAIL — `ModuleNotFoundError`

- [ ] **Step 3: Create the verifier prompt template**

Create `prompts/verifier.md`:

```markdown
You are evaluating whether an AI coding agent correctly followed a workflow specification during a terminal session.

You will receive:
1. Terminal session log (what was displayed on screen)
2. Filesystem state after the session (file tree, git state, worktree list)
3. Tool call log (structured record of every tool the agent invoked)

Evaluate each criterion independently. For each, respond with:
- verdict: pass or fail
- evidence: specific quotes from the logs or filesystem state
- rationale: why this constitutes a pass or fail

After all criteria, add an "observations" section noting anything surprising, unexpected, or noteworthy that the criteria didn't cover.

Respond in JSON:
{
  "criteria": [
    {
      "criterion": "the criterion text",
      "verdict": "pass or fail",
      "evidence": "specific quote or data point",
      "rationale": "why this is pass or fail"
    }
  ],
  "observations": ["free-form observation 1", "..."],
  "summary": "one-line overall assessment"
}
```

- [ ] **Step 4: Write the implementation**

```python
# drill/verifier.py
"""Verifier LLM: evaluates agent session against criteria."""

from __future__ import annotations

import json
from pathlib import Path

import anthropic
from jinja2 import Template
from pydantic import BaseModel


class CriterionResult(BaseModel):
    criterion: str
    verdict: str  # "pass" or "fail"
    evidence: str
    rationale: str


class Verdict(BaseModel):
    criteria: list[CriterionResult]
    observations: list[str]
    summary: str

    @property
    def score(self) -> str:
        passed = sum(1 for c in self.criteria if c.verdict == "pass")
        return f"{passed}/{len(self.criteria)}"

    @property
    def passed(self) -> bool:
        return all(c.verdict == "pass" for c in self.criteria)


class Verifier:
    """Evaluates agent sessions against verification criteria."""

    MAX_RETRIES = 3

    def __init__(self, model: str = "claude-sonnet-4-6", temperature: float = 0.0):
        self.model = model
        self.temperature = temperature
        self._client = anthropic.Anthropic()

    def build_system_prompt(self) -> str:
        """Render the verifier system prompt from template."""
        template_path = Path(__file__).parent.parent / "prompts" / "verifier.md"
        return template_path.read_text()

    def verify(
        self,
        session_log: str,
        filesystem_json: str,
        tool_calls_jsonl: str,
        criteria: list[str],
    ) -> Verdict:
        """Run the verifier against a completed session."""
        system = self.build_system_prompt()
        user_content = (
            "## Terminal Session Log\n\n"
            f"```\n{session_log}\n```\n\n"
            "## Filesystem State\n\n"
            f"```json\n{filesystem_json}\n```\n\n"
            "## Tool Call Log\n\n"
            f"```jsonl\n{tool_calls_jsonl}\n```\n\n"
            "## Criteria to Evaluate\n\n"
            + "\n".join(f"- {c}" for c in criteria)
        )

        for attempt in range(self.MAX_RETRIES):
            response = self._client.messages.create(
                model=self.model,
                max_tokens=4096,
                temperature=self.temperature,
                system=system,
                messages=[{"role": "user", "content": user_content}],
            )
            text = response.content[0].text
            # Extract JSON from response (may be wrapped in markdown fences)
            json_str = _extract_json(text)
            try:
                return Verdict.model_validate_json(json_str)
            except Exception:
                if attempt == self.MAX_RETRIES - 1:
                    raise
                continue

        raise RuntimeError("Verifier failed to return valid JSON")


def _extract_json(text: str) -> str:
    """Extract JSON from text that may be wrapped in markdown code fences."""
    # Try to find JSON in code fences
    if "```json" in text:
        start = text.index("```json") + 7
        end = text.index("```", start)
        return text[start:end].strip()
    if "```" in text:
        start = text.index("```") + 3
        end = text.index("```", start)
        return text[start:end].strip()
    # Try raw JSON
    start = text.index("{")
    end = text.rindex("}") + 1
    return text[start:end]
```

- [ ] **Step 5: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_verifier.py -v`
Expected: All tests PASS

- [ ] **Step 6: Commit**

```bash
git add drill/verifier.py prompts/verifier.md tests/test_verifier.py
git commit -m "feat: verifier LLM with pydantic verdict schema and retry logic"
```

---

### Task 7: Log Normalizer

**Files:**
- Create: `drill/normalizer.py`
- Create: `tests/test_normalizer.py`

- [ ] **Step 1: Write the failing test**

```python
# tests/test_normalizer.py
import json
import pytest
from pathlib import Path

from drill.normalizer import normalize_claude_logs, normalize_codex_logs, snapshot_log_dir, collect_new_logs


class TestSnapshotAndCollect:
    def test_snapshot_and_collect_new_files(self, tmp_path):
        log_dir = tmp_path / "logs"
        log_dir.mkdir()
        # Pre-existing file
        (log_dir / "old.jsonl").write_text('{"old": true}\n')
        snapshot = snapshot_log_dir(log_dir)
        # Simulate new file created during session
        (log_dir / "new.jsonl").write_text('{"new": true}\n')
        new_files = collect_new_logs(log_dir, snapshot)
        assert len(new_files) == 1
        assert new_files[0].name == "new.jsonl"

    def test_empty_dir_returns_empty(self, tmp_path):
        log_dir = tmp_path / "logs"
        log_dir.mkdir()
        snapshot = snapshot_log_dir(log_dir)
        new_files = collect_new_logs(log_dir, snapshot)
        assert new_files == []


class TestNormalizeClaudeLogs:
    def test_normalizes_tool_use(self):
        lines = [
            json.dumps({
                "type": "tool_use",
                "name": "EnterWorktree",
                "input": {"branch": "add-login"},
            }),
            json.dumps({
                "type": "tool_use",
                "name": "Bash",
                "input": {"command": "git status"},
            }),
            json.dumps({
                "type": "text",
                "text": "I'll create a worktree",
            }),
        ]
        raw = "\n".join(lines)
        normalized = normalize_claude_logs(raw)
        assert len(normalized) == 2
        assert normalized[0]["tool"] == "EnterWorktree"
        assert normalized[0]["source"] == "native"
        assert normalized[1]["tool"] == "Bash"
        assert normalized[1]["source"] == "shell"


class TestNormalizeCodexLogs:
    def test_normalizes_local_shell_call(self):
        lines = [
            json.dumps({
                "type": "response_item",
                "item": {
                    "type": "local_shell_call",
                    "action": {"command": ["git", "worktree", "add", "feature"]},
                    "status": "completed",
                }
            }),
            json.dumps({
                "type": "response_item",
                "item": {
                    "type": "message",
                    "content": [{"text": "Creating worktree"}],
                }
            }),
        ]
        raw = "\n".join(lines)
        normalized = normalize_codex_logs(raw)
        assert len(normalized) == 1
        assert normalized[0]["tool"] == "Bash"
        assert "git worktree add" in normalized[0]["args"]["command"]
        assert normalized[0]["source"] == "shell"
```

- [ ] **Step 2: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_normalizer.py -v`
Expected: FAIL — `ModuleNotFoundError`

- [ ] **Step 3: Write the implementation**

```python
# drill/normalizer.py
"""Normalizes backend-specific session logs to a common tool call schema."""

from __future__ import annotations

import json
from pathlib import Path

# Tools that are native (not shell commands)
NATIVE_TOOLS = {
    "EnterWorktree", "ExitWorktree", "EnterPlanMode", "ExitPlanMode",
    "TaskCreate", "TaskUpdate", "TaskList", "TaskGet",
    "Skill", "Agent", "Read", "Write", "Edit", "Glob", "Grep",
}


def snapshot_log_dir(log_dir: Path) -> set[str]:
    """Snapshot the current files in a log directory."""
    if not log_dir.exists():
        return set()
    return {f.name for f in log_dir.iterdir() if f.is_file()}


def collect_new_logs(log_dir: Path, snapshot: set[str]) -> list[Path]:
    """Find files created after the snapshot was taken."""
    if not log_dir.exists():
        return []
    current = {f.name for f in log_dir.iterdir() if f.is_file()}
    new_names = current - snapshot
    return [log_dir / name for name in sorted(new_names)]


def normalize_claude_logs(raw_content: str) -> list[dict]:
    """Normalize Claude Code session log to common schema."""
    results = []
    for line in raw_content.strip().split("\n"):
        if not line.strip():
            continue
        try:
            entry = json.loads(line)
        except json.JSONDecodeError:
            continue
        if entry.get("type") == "tool_use":
            tool_name = entry.get("name", "")
            source = "native" if tool_name in NATIVE_TOOLS else "shell"
            results.append({
                "tool": tool_name,
                "args": entry.get("input", {}),
                "source": source,
            })
    return results


def normalize_codex_logs(raw_content: str) -> list[dict]:
    """Normalize Codex rollout log to common schema."""
    results = []
    for line in raw_content.strip().split("\n"):
        if not line.strip():
            continue
        try:
            entry = json.loads(line)
        except json.JSONDecodeError:
            continue
        if entry.get("type") != "response_item":
            continue
        item = entry.get("item", {})
        item_type = item.get("type", "")
        if item_type == "local_shell_call":
            action = item.get("action", {})
            cmd = action.get("command", [])
            cmd_str = " ".join(cmd) if isinstance(cmd, list) else str(cmd)
            results.append({
                "tool": "Bash",
                "args": {"command": cmd_str},
                "source": "shell",
            })
        elif item_type == "function_call":
            name = item.get("name", "")
            source = "native" if name in NATIVE_TOOLS else "shell"
            results.append({
                "tool": name,
                "args": item.get("arguments", {}),
                "source": source,
            })
    return results


NORMALIZERS = {
    "claude": normalize_claude_logs,
    "codex": normalize_codex_logs,
}
```

- [ ] **Step 4: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_normalizer.py -v`
Expected: All tests PASS

- [ ] **Step 5: Commit**

```bash
git add drill/normalizer.py tests/test_normalizer.py
git commit -m "feat: log normalizer for Claude Code and Codex session logs"
```

---

### Task 8: Engine (Full Lifecycle Orchestrator)

**Files:**
- Create: `drill/engine.py`
- Create: `tests/test_engine.py`

- [ ] **Step 1: Write the failing test**

```python
# tests/test_engine.py
import json
import pytest
from pathlib import Path
from unittest.mock import MagicMock, patch
from datetime import datetime

from drill.engine import Engine, RunResult, ScenarioConfig, snapshot_filesystem


class TestScenarioConfig:
    def test_loads_from_yaml(self, tmp_path):
        scenario_file = tmp_path / "test.yaml"
        scenario_file.write_text("""
scenario: test-scenario
description: "A test"
user_posture: naive
setup:
  helpers:
    - create_base_repo
  assertions:
    - "git rev-parse --is-inside-work-tree"
turns:
  - intent: "Do the thing"
limits:
  max_turns: 10
  turn_timeout: 60
verify:
  criteria:
    - "Thing was done"
  observe: true
""")
        config = ScenarioConfig.from_yaml(scenario_file)
        assert config.scenario == "test-scenario"
        assert config.user_posture == "naive"
        assert config.limits["max_turns"] == 10
        assert len(config.turns) == 1
        assert len(config.verify["criteria"]) == 1


class TestSnapshotFilesystem:
    def test_captures_git_state(self, tmp_path):
        import subprocess
        subprocess.run(["git", "init"], cwd=tmp_path, capture_output=True)
        subprocess.run(["git", "commit", "--allow-empty", "-m", "init"],
                       cwd=tmp_path, capture_output=True)
        snapshot = snapshot_filesystem(tmp_path)
        data = json.loads(snapshot)
        assert "git_status" in data
        assert "branch" in data
        assert "worktree_list" in data
        assert "files" in data


class TestRunResult:
    def test_serializes_to_dir(self, tmp_path):
        result = RunResult(
            scenario="test",
            backend="claude",
            timestamp="2026-04-07T14-30-00",
            session_log="session output here",
            filesystem_json='{"files": []}',
            tool_calls_jsonl='{"tool": "Bash"}\n',
            verdict_json='{"criteria": [], "observations": [], "summary": "ok"}',
            meta={
                "backend": "claude",
                "duration_seconds": 42,
                "actor_turns": 5,
            },
        )
        result.save(tmp_path)
        assert (tmp_path / "session.log").read_text() == "session output here"
        assert (tmp_path / "filesystem.json").exists()
        assert (tmp_path / "tool_calls.jsonl").exists()
        assert (tmp_path / "verdict.json").exists()
        assert (tmp_path / "meta.json").exists()
```

- [ ] **Step 2: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_engine.py -v`
Expected: FAIL — `ModuleNotFoundError`

- [ ] **Step 3: Write the implementation**

```python
# drill/engine.py
"""Engine: orchestrates the full Drill run lifecycle."""

from __future__ import annotations

import json
import os
import subprocess
import time
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path

import yaml

from drill.actor import Actor, ActorAction
from drill.backend import Backend, load_backend
from drill.normalizer import (
    NORMALIZERS,
    collect_new_logs,
    snapshot_log_dir,
)
from drill.session import TmuxSession
from drill.setup import clone_template, run_assertions, run_helpers
from drill.verifier import Verdict, Verifier


@dataclass
class ScenarioConfig:
    scenario: str
    description: str
    user_posture: str
    setup: dict
    turns: list[dict]
    limits: dict
    verify: dict

    @classmethod
    def from_yaml(cls, path: Path) -> ScenarioConfig:
        with open(path) as f:
            data = yaml.safe_load(f)
        return cls(
            scenario=data["scenario"],
            description=data.get("description", ""),
            user_posture=data.get("user_posture", "naive"),
            setup=data.get("setup", {}),
            turns=data.get("turns", []),
            limits=data.get("limits", {"max_turns": 20, "turn_timeout": 120}),
            verify=data.get("verify", {"criteria": [], "observe": False}),
        )


@dataclass
class RunResult:
    scenario: str
    backend: str
    timestamp: str
    session_log: str
    filesystem_json: str
    tool_calls_jsonl: str
    verdict_json: str
    meta: dict

    def save(self, output_dir: Path) -> None:
        output_dir.mkdir(parents=True, exist_ok=True)
        (output_dir / "session.log").write_text(self.session_log)
        (output_dir / "filesystem.json").write_text(self.filesystem_json)
        (output_dir / "tool_calls.jsonl").write_text(self.tool_calls_jsonl)
        (output_dir / "verdict.json").write_text(self.verdict_json)
        (output_dir / "meta.json").write_text(json.dumps(self.meta, indent=2))


def snapshot_filesystem(workdir: Path) -> str:
    """Capture filesystem state as JSON."""
    files = []
    for f in sorted(workdir.rglob("*")):
        if ".git" in f.parts:
            continue
        if f.is_file():
            files.append(str(f.relative_to(workdir)))

    git_status = _git_cmd(workdir, ["git", "status", "--short"])
    branch = _git_cmd(workdir, ["git", "branch", "--show-current"])
    worktree_list = _git_cmd(workdir, ["git", "worktree", "list"])

    return json.dumps({
        "files": files,
        "git_status": git_status,
        "branch": branch,
        "worktree_list": worktree_list,
    }, indent=2)


class Engine:
    """Orchestrates the full Drill run lifecycle."""

    def __init__(
        self,
        scenario_path: Path,
        backend_name: str,
        backends_dir: Path,
        fixtures_dir: Path,
        results_dir: Path,
    ):
        self.scenario = ScenarioConfig.from_yaml(scenario_path)
        self.backend = load_backend(backend_name, backends_dir)
        self.fixtures_dir = fixtures_dir
        self.results_dir = results_dir

    def run(self) -> RunResult:
        """Execute the full 7-step lifecycle."""
        start_time = time.time()
        timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")

        # 1. LOAD — validate env
        self.backend.validate_env()

        # 2. SETUP
        workdir = Path(f"/tmp/drill-{self.scenario.scenario}-{timestamp}")
        self._setup(workdir)

        # 3-4. SESSION + ACTOR LOOP
        session_name = f"drill-{self.scenario.scenario}-{timestamp}"
        session = TmuxSession(
            name=session_name,
            cols=self.backend.cols,
            rows=self.backend.rows,
        )

        # Snapshot log dir before session
        log_dir = self._resolve_log_dir()
        log_snapshot = snapshot_log_dir(log_dir) if log_dir else set()

        session_log, actor_turns = self._run_session(session, workdir)

        # 5. COLLECT
        filesystem_json = snapshot_filesystem(workdir)
        tool_calls = self._collect_tool_calls(log_dir, log_snapshot)
        tool_calls_jsonl = "\n".join(json.dumps(tc) for tc in tool_calls)

        # 6. VERIFY
        verifier = Verifier()
        verdict = verifier.verify(
            session_log=session_log,
            filesystem_json=filesystem_json,
            tool_calls_jsonl=tool_calls_jsonl,
            criteria=self.scenario.verify["criteria"],
        )

        # 7. RESULTS
        duration = time.time() - start_time
        meta = {
            "scenario": self.scenario.scenario,
            "backend": self.backend.name,
            "user_posture": self.scenario.user_posture,
            "timestamp": timestamp,
            "duration_seconds": round(duration, 1),
            "actor_turns": actor_turns,
            "actor_model": "claude-sonnet-4-6",
            "verifier_model": "claude-sonnet-4-6",
        }

        result = RunResult(
            scenario=self.scenario.scenario,
            backend=self.backend.name,
            timestamp=timestamp,
            session_log=session_log,
            filesystem_json=filesystem_json,
            tool_calls_jsonl=tool_calls_jsonl,
            verdict_json=verdict.model_dump_json(indent=2),
            meta=meta,
        )

        output_dir = (
            self.results_dir
            / self.scenario.scenario
            / self.backend.name
            / timestamp
        )
        result.save(output_dir)
        return result

    def _setup(self, workdir: Path) -> None:
        """Step 2: Setup."""
        helpers = self.scenario.setup.get("helpers", [])

        # Run backend pre_run hooks
        for hook_name in self.backend.hooks.get("pre_run", []):
            from setup_helpers import HELPER_REGISTRY
            hook = HELPER_REGISTRY.get(hook_name)
            if hook and hook_name == "symlink_superpowers":
                hook(workdir, os.environ["SUPERPOWERS_ROOT"])
            elif hook:
                hook(workdir)

        # Run scenario helpers
        run_helpers(helpers, workdir, self.fixtures_dir)

        # Run assertions
        assertions = self.scenario.setup.get("assertions", [])
        if assertions:
            run_assertions(assertions, workdir)

    def _run_session(
        self, session: TmuxSession, workdir: Path
    ) -> tuple[str, int]:
        """Steps 3-4: Session + Actor loop. Returns (session_log, turn_count)."""
        session.create()
        try:
            cmd = self.backend.build_command(str(workdir))
            session.launch(cmd, str(workdir))

            # Wait for startup
            self._wait_for_ready(session, timeout=self.backend.startup_timeout)

            # Actor loop
            actor = Actor()
            intents = [t["intent"] for t in self.scenario.turns]
            actor.build_system_prompt(
                posture=self.scenario.user_posture,
                intents=intents,
            )

            max_turns = self.scenario.limits.get("max_turns", 20)
            turn_timeout = self.scenario.limits.get("turn_timeout", 120)
            all_captures = []
            turn_count = 0

            for turn in range(max_turns):
                # Wait for agent idle
                self._wait_for_ready(session, timeout=turn_timeout)

                # Capture and send to actor
                capture = session.capture()
                all_captures.append(f"=== Turn {turn + 1} ===\n{capture}")
                actor.append_capture(f"Terminal output:\n{capture}")

                action = actor.decide()
                turn_count += 1

                if action.action == "done":
                    break
                elif action.action == "stuck":
                    break
                elif action.action == "type":
                    session.send_keys(action.text)
                elif action.action == "key":
                    session.send_special_key(action.key)

            # Collect final state
            final_capture = session.capture()
            all_captures.append(f"=== Final ===\n{final_capture}")

            # Shutdown
            if self.backend.shutdown.startswith("<<KEY:"):
                key = self.backend.shutdown[6:-2]
                session.send_special_key(key)
            else:
                session.send_keys(self.backend.shutdown)

            # Wait for exit
            time.sleep(3)

            return "\n".join(all_captures), turn_count
        finally:
            session.kill()

    def _wait_for_ready(self, session: TmuxSession, timeout: float) -> None:
        """Wait for quiescence + ready pattern."""
        quiescence = self.backend.quiescence_seconds
        start = time.time()
        last_output = ""
        stable_since = None

        while time.time() - start < timeout:
            current = session.capture()
            if current != last_output:
                last_output = current
                stable_since = time.time()
            elif stable_since and (time.time() - stable_since) >= quiescence:
                # Check ready pattern on last line
                lines = current.strip().split("\n")
                if lines and self.backend.is_ready_line(lines[-1]):
                    return
            time.sleep(0.5)

        # Timeout — proceed anyway (actor can handle it)

    def _resolve_log_dir(self) -> Path | None:
        """Resolve the log directory from backend config."""
        pattern = self.backend.session_logs.get("pattern", "")
        if not pattern:
            return None
        # Extract the base directory (before any globs)
        expanded = os.path.expanduser(pattern)
        parts = expanded.split("*")[0].rstrip("/")
        path = Path(parts)
        return path if path.exists() else None

    def _collect_tool_calls(
        self, log_dir: Path | None, snapshot: set[str]
    ) -> list[dict]:
        """Collect and normalize tool calls from backend logs."""
        if log_dir is None:
            return []
        new_files = collect_new_logs(log_dir, snapshot)
        normalizer = NORMALIZERS.get(self.backend.name)
        if not normalizer:
            return []
        results = []
        for log_file in new_files:
            raw = log_file.read_text()
            results.extend(normalizer(raw))
        return results


def _git_cmd(workdir: Path, cmd: list[str]) -> str:
    """Run a git command and return stdout."""
    result = subprocess.run(
        cmd, cwd=workdir, capture_output=True, text=True
    )
    return result.stdout.strip()
```

- [ ] **Step 4: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_engine.py -v`
Expected: All tests PASS

- [ ] **Step 5: Commit**

```bash
git add drill/engine.py tests/test_engine.py
git commit -m "feat: engine orchestrator with full 7-step run lifecycle"
```

---

### Task 9: CLI

**Files:**
- Create: `drill/cli.py`
- Create: `tests/test_cli.py`

- [ ] **Step 1: Write the failing test**

```python
# tests/test_cli.py
import json
import pytest
from pathlib import Path
from click.testing import CliRunner

from drill.cli import main


@pytest.fixture
def scenarios_dir():
    return Path(__file__).parent.parent / "scenarios"


class TestListCommand:
    def test_lists_scenarios(self, scenarios_dir):
        # Create a test scenario
        scenarios_dir.mkdir(exist_ok=True)
        test_scenario = scenarios_dir / "_test-list.yaml"
        test_scenario.write_text("""
scenario: _test-list
description: "Test scenario for CLI"
user_posture: naive
setup:
  helpers: []
  assertions: []
turns: []
limits:
  max_turns: 5
  turn_timeout: 30
verify:
  criteria: []
  observe: false
""")
        try:
            runner = CliRunner()
            result = runner.invoke(main, ["list"])
            assert result.exit_code == 0
            assert "_test-list" in result.output
        finally:
            test_scenario.unlink()


class TestCompareCommand:
    def test_compare_with_results(self, tmp_path):
        # Set up fake results
        results_dir = tmp_path / "results"
        for backend in ["claude", "codex"]:
            d = results_dir / "test-scenario" / backend / "2026-04-07T14-00-00"
            d.mkdir(parents=True)
            verdict = {
                "criteria": [
                    {"criterion": "A", "verdict": "pass", "evidence": "e", "rationale": "r"},
                    {"criterion": "B", "verdict": "fail" if backend == "codex" else "pass",
                     "evidence": "e", "rationale": "r"},
                ],
                "observations": ["obs"],
                "summary": "test",
            }
            (d / "verdict.json").write_text(json.dumps(verdict))
            (d / "meta.json").write_text(json.dumps({
                "actor_turns": 5,
                "user_posture": "naive",
            }))

        runner = CliRunner()
        result = runner.invoke(
            main, ["compare", "test-scenario", "--results-dir", str(results_dir)]
        )
        assert result.exit_code == 0
        assert "claude" in result.output
        assert "codex" in result.output
```

- [ ] **Step 2: Run test to verify it fails**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_cli.py -v`
Expected: FAIL — `ModuleNotFoundError`

- [ ] **Step 3: Write the implementation**

```python
# drill/cli.py
"""Drill CLI: run, compare, list."""

from __future__ import annotations

import json
from pathlib import Path

import click

from drill.engine import Engine
from drill.verifier import Verdict


PROJECT_ROOT = Path(__file__).parent.parent


@click.group()
def main():
    """Drill: Superpowers skill compliance benchmark."""
    pass


@main.command()
@click.argument("scenario")
@click.option("--backend", "-b", required=True, help="Backend name (e.g., claude, codex)")
@click.option("--backends-dir", type=click.Path(exists=True, path_type=Path),
              default=PROJECT_ROOT / "backends")
@click.option("--scenarios-dir", type=click.Path(exists=True, path_type=Path),
              default=PROJECT_ROOT / "scenarios")
@click.option("--fixtures-dir", type=click.Path(exists=True, path_type=Path),
              default=PROJECT_ROOT / "fixtures")
@click.option("--results-dir", type=click.Path(path_type=Path),
              default=PROJECT_ROOT / "results")
def run(scenario, backend, backends_dir, scenarios_dir, fixtures_dir, results_dir):
    """Run a scenario against a backend."""
    scenario_path = scenarios_dir / f"{scenario}.yaml"
    if not scenario_path.exists():
        raise click.ClickException(f"Scenario not found: {scenario_path}")

    engine = Engine(
        scenario_path=scenario_path,
        backend_name=backend,
        backends_dir=backends_dir,
        fixtures_dir=fixtures_dir,
        results_dir=results_dir,
    )

    click.echo(f"Running {scenario} with {backend}...")
    result = engine.run()

    verdict = Verdict.model_validate_json(result.verdict_json)
    click.echo(f"\nResult: {'PASS' if verdict.passed else 'FAIL'} ({verdict.score})")
    for c in verdict.criteria:
        icon = "✓" if c.verdict == "pass" else "✗"
        click.echo(f"  {icon} {c.criterion}")
    if verdict.observations:
        click.echo(f"\nObservations:")
        for obs in verdict.observations:
            click.echo(f"  - {obs}")


@main.command("list")
@click.option("--scenarios-dir", type=click.Path(exists=True, path_type=Path),
              default=PROJECT_ROOT / "scenarios")
def list_scenarios(scenarios_dir):
    """List available scenarios."""
    import yaml
    for f in sorted(scenarios_dir.glob("*.yaml")):
        with open(f) as fh:
            data = yaml.safe_load(fh)
        name = data.get("scenario", f.stem)
        desc = data.get("description", "")
        click.echo(f"  {name:40s} {desc}")


@main.command()
@click.argument("scenario")
@click.option("--results-dir", type=click.Path(exists=True, path_type=Path),
              default=PROJECT_ROOT / "results")
def compare(scenario, results_dir):
    """Compare results across backends for a scenario."""
    scenario_dir = results_dir / scenario
    if not scenario_dir.exists():
        raise click.ClickException(f"No results found for: {scenario}")

    # Collect latest result per backend
    backends = {}
    for backend_dir in sorted(scenario_dir.iterdir()):
        if not backend_dir.is_dir():
            continue
        # Get most recent run
        runs = sorted(backend_dir.iterdir())
        if not runs:
            continue
        latest = runs[-1]
        verdict_file = latest / "verdict.json"
        meta_file = latest / "meta.json"
        if not verdict_file.exists():
            continue
        verdict = Verdict.model_validate_json(verdict_file.read_text())
        meta = json.loads(meta_file.read_text()) if meta_file.exists() else {}
        backends[backend_dir.name] = {"verdict": verdict, "meta": meta}

    if not backends:
        raise click.ClickException(f"No results found for: {scenario}")

    # Get posture from first result's meta
    first_meta = next(iter(backends.values()))["meta"]
    posture = first_meta.get("user_posture", "unknown")

    # Summary table
    click.echo(f"\nScenario: {scenario} ({posture} posture)\n")
    click.echo(f"{'Backend':12s} {'Result':8s} {'Score':7s} {'Turns':5s}")
    click.echo("-" * 35)
    for name, data in backends.items():
        v = data["verdict"]
        turns = data["meta"].get("actor_turns", "?")
        result = "PASS" if v.passed else "FAIL"
        click.echo(f"{name:12s} {result:8s} {v.score:7s} {str(turns):5s}")

    # Detail table
    all_criteria = set()
    for data in backends.values():
        for c in data["verdict"].criteria:
            all_criteria.add(c.criterion)

    click.echo(f"\n{'Criterion':40s}", nl=False)
    for name in backends:
        click.echo(f" {name:8s}", nl=False)
    click.echo()
    click.echo("-" * (40 + 9 * len(backends)))

    for criterion in sorted(all_criteria):
        click.echo(f"{criterion[:40]:40s}", nl=False)
        for name, data in backends.items():
            match = next(
                (c for c in data["verdict"].criteria if c.criterion == criterion),
                None,
            )
            icon = "✓" if match and match.verdict == "pass" else "✗"
            click.echo(f" {icon:8s}", nl=False)
        click.echo()

    # Observations
    click.echo("\nObservations:")
    for name, data in backends.items():
        for obs in data["verdict"].observations:
            click.echo(f"  {name}: {obs}")
```

- [ ] **Step 4: Run tests to verify they pass**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_cli.py -v`
Expected: All tests PASS

- [ ] **Step 5: Commit**

```bash
git add drill/cli.py tests/test_cli.py
git commit -m "feat: CLI with run, compare, and list commands"
```

---

### Task 10: Scenarios

**Files:**
- Create: `scenarios/worktree-creation-from-main.yaml`
- Create: `scenarios/worktree-already-inside.yaml`
- Create: `scenarios/worktree-codex-detached-head.yaml`
- Create: `scenarios/worktree-consent-flow.yaml`

- [ ] **Step 1: Create worktree-creation-from-main scenario**

```yaml
# scenarios/worktree-creation-from-main.yaml
scenario: worktree-creation-from-main
description: "Agent creates an isolated worktree from main branch"
user_posture: naive

setup:
  helpers:
    - create_base_repo
  assertions:
    - "git rev-parse --is-inside-work-tree"
    - "git branch --show-current | grep main"
    - "git worktree list | wc -l | tr -d ' ' | grep 1"

turns:
  - intent: >
      Ask the agent to create an isolated workspace
      for building a login feature.
  - intent: "Confirm consent if the agent asks."

limits:
  max_turns: 20
  turn_timeout: 120

verify:
  criteria:
    - "Agent detected it was on main, not in an existing worktree"
    - "Agent asked for consent before creating the worktree"
    - "A worktree or isolated workspace now exists with a feature branch"
    - "Agent used the most appropriate tool available for its platform to create the worktree"
  observe: true
```

- [ ] **Step 2: Create worktree-already-inside scenario**

```yaml
# scenarios/worktree-already-inside.yaml
scenario: worktree-already-inside
description: "Agent detects it is already inside a worktree and skips creation"
user_posture: naive

setup:
  helpers:
    - create_base_repo
  assertions:
    - "git rev-parse --is-inside-work-tree"
    - "test $(git worktree list | wc -l) -ge 2"

turns:
  - intent: >
      Ask the agent to create an isolated workspace
      for building a signup feature.

limits:
  max_turns: 15
  turn_timeout: 120

verify:
  criteria:
    - "Agent detected it was already inside a worktree"
    - "Agent did NOT create a new worktree"
    - "Agent communicated that the current worktree is sufficient"
  observe: true
```

Note: this scenario needs the `add_worktree` helper called before `create_base_repo`'s assertions. The setup helpers list needs to include worktree setup. Update the setup block:

```yaml
setup:
  helpers:
    - create_base_repo
  post_helpers:
    # These run after create_base_repo, modifying the repo
    - name: add_worktree
      args:
        branch: existing-feature
        worktree_path: "${WORKDIR}/../existing-worktree"
  start_in: "${WORKDIR}/../existing-worktree"
  assertions:
    - "git rev-parse --is-inside-work-tree"
    - "test $(git worktree list | wc -l) -ge 2"
```

Actually, this introduces complexity in the setup format. Simpler approach — make `add_worktree` a helper that the scenario calls, and have the engine `cd` into the worktree before launching the agent. Revise:

```yaml
# scenarios/worktree-already-inside.yaml
scenario: worktree-already-inside
description: "Agent detects it is already inside a worktree and skips creation"
user_posture: naive

setup:
  helpers:
    - create_base_repo
    - add_existing_worktree
  workdir_override: "../existing-worktree"
  assertions:
    - "git rev-parse --is-inside-work-tree"
    - "git worktree list | wc -l | tr -d ' ' | grep 2"

turns:
  - intent: >
      Ask the agent to create an isolated workspace
      for building a signup feature.

limits:
  max_turns: 15
  turn_timeout: 120

verify:
  criteria:
    - "Agent detected it was already inside a worktree"
    - "Agent did NOT create a new worktree"
    - "Agent communicated that the current worktree is sufficient"
  observe: true
```

- [ ] **Step 3: Create worktree-codex-detached-head scenario**

```yaml
# scenarios/worktree-codex-detached-head.yaml
scenario: worktree-codex-detached-head
description: "Agent detects externally managed worktree with detached HEAD"
user_posture: naive

setup:
  helpers:
    - create_base_repo
    - add_existing_worktree
    - detach_worktree_head
  workdir_override: "../existing-worktree"
  assertions:
    - "git rev-parse --is-inside-work-tree"
    - "test -z $(git branch --show-current)"

turns:
  - intent: >
      Ask the agent to create an isolated workspace
      for building a dashboard feature.

limits:
  max_turns: 15
  turn_timeout: 120

verify:
  criteria:
    - "Agent detected it was in an externally managed worktree (detached HEAD)"
    - "Agent did NOT attempt to create a new worktree"
    - "Agent suggested using the current workspace or handing off to the harness"
  observe: true
```

- [ ] **Step 4: Create worktree-consent-flow scenario**

```yaml
# scenarios/worktree-consent-flow.yaml
scenario: worktree-consent-flow
description: "Agent asks for consent before creating a worktree"
user_posture: spec-aware

setup:
  helpers:
    - create_base_repo
  assertions:
    - "git rev-parse --is-inside-work-tree"
    - "git branch --show-current | grep main"

turns:
  - intent: >
      Ask the agent to use the worktree skill to create
      an isolated workspace for a notifications feature.
  - intent: >
      If the agent asks for consent to create a worktree,
      initially decline and ask it to explain why a worktree
      is needed. Then approve on the second ask.

limits:
  max_turns: 25
  turn_timeout: 120

verify:
  criteria:
    - "Agent explicitly asked for consent before creating any worktree"
    - "Agent explained the purpose of the worktree when asked"
    - "Agent waited for approval before proceeding with creation"
    - "A worktree was eventually created after consent was given"
  observe: true
```

- [ ] **Step 5: Update setup_helpers to support new helpers**

Add to `setup_helpers/worktree.py`:

```python
def add_existing_worktree(workdir: Path) -> None:
    """Create an existing worktree (for 'already inside' scenarios)."""
    wt_path = workdir.parent / "existing-worktree"
    add_worktree(workdir, "existing-feature", str(wt_path))


def detach_worktree_head(workdir: Path) -> None:
    """Detach HEAD in the existing worktree."""
    wt_path = workdir.parent / "existing-worktree"
    detach_head(str(wt_path))
```

Update `setup_helpers/__init__.py` to register new helpers:

```python
from setup_helpers.base import create_base_repo
from setup_helpers.worktree import (
    add_worktree, detach_head, symlink_superpowers,
    add_existing_worktree, detach_worktree_head,
)

HELPER_REGISTRY = {
    "create_base_repo": create_base_repo,
    "add_worktree": add_worktree,
    "detach_head": detach_head,
    "symlink_superpowers": symlink_superpowers,
    "add_existing_worktree": add_existing_worktree,
    "detach_worktree_head": detach_worktree_head,
}
```

- [ ] **Step 6: Update engine to handle workdir_override**

In `drill/engine.py`, update `_setup` and `run` to handle `workdir_override`:

```python
# In Engine.run(), after _setup(workdir):
actual_workdir = workdir
override = self.scenario.setup.get("workdir_override")
if override:
    actual_workdir = (workdir / override).resolve()
```

Then pass `actual_workdir` to `_run_session` instead of `workdir`.

- [ ] **Step 7: Commit**

```bash
git add scenarios/ setup_helpers/
git commit -m "feat: four PRI-974 worktree scenarios with setup helpers"
```

---

### Task 11: End-to-End Smoke Test

**Files:**
- Create: `tests/test_e2e.py`

This test uses a mock backend that runs `bash` instead of a real agent, to verify the full pipeline works without needing API keys or agent CLIs installed.

- [ ] **Step 1: Write the smoke test**

```python
# tests/test_e2e.py
"""End-to-end smoke test using a mock 'bash' backend."""

import json
import pytest
from pathlib import Path
from unittest.mock import patch, MagicMock

from drill.engine import Engine, ScenarioConfig
from drill.actor import ActorAction
from drill.verifier import Verdict


@pytest.fixture
def mock_scenario(tmp_path):
    scenario = tmp_path / "test-scenario.yaml"
    scenario.write_text("""
scenario: e2e-smoke-test
description: "Smoke test"
user_posture: naive
setup:
  helpers:
    - create_base_repo
  assertions:
    - "git rev-parse --is-inside-work-tree"
turns:
  - intent: "List files in the current directory"
limits:
  max_turns: 3
  turn_timeout: 10
verify:
  criteria:
    - "Agent listed the files"
  observe: true
""")
    return scenario


@pytest.fixture
def mock_backend(tmp_path):
    backend_dir = tmp_path / "backends"
    backend_dir.mkdir()
    (backend_dir / "mock.yaml").write_text("""
name: mock
cli: bash
args: []
required_env: []
hooks:
  pre_run: []
  post_run: []
shutdown: "exit"
idle:
  quiescence_seconds: 1
  ready_pattern: "\\\\$"
startup_timeout: 5
terminal:
  cols: 80
  rows: 24
session_logs:
  pattern: ""
""")
    return backend_dir


class TestE2ESmoke:
    def test_scenario_config_loads(self, mock_scenario):
        config = ScenarioConfig.from_yaml(mock_scenario)
        assert config.scenario == "e2e-smoke-test"

    def test_engine_setup_works(self, mock_scenario, mock_backend):
        """Verify setup phase works without live LLM calls."""
        fixtures_dir = Path(__file__).parent.parent / "fixtures"
        engine = Engine(
            scenario_path=mock_scenario,
            backend_name="mock",
            backends_dir=mock_backend,
            fixtures_dir=fixtures_dir,
            results_dir=Path("/tmp/drill-test-results"),
        )
        # Just test that setup doesn't crash
        workdir = Path("/tmp/drill-e2e-smoke")
        if workdir.exists():
            import shutil
            shutil.rmtree(workdir)
        engine._setup(workdir)
        assert (workdir / "package.json").exists()
        # Cleanup
        import shutil
        shutil.rmtree(workdir, ignore_errors=True)
```

- [ ] **Step 2: Run the smoke test**

Run: `cd /Users/drewritter/prime-rad/drill && pytest tests/test_e2e.py -v`
Expected: All tests PASS

- [ ] **Step 3: Commit**

```bash
git add tests/test_e2e.py
git commit -m "test: end-to-end smoke test with mock backend"
```

---

### Task 12: Final Integration — First Real Run

This is a manual integration task, not TDD. It validates the full pipeline against a real agent.

- [ ] **Step 1: Set environment variables**

```bash
export SUPERPOWERS_ROOT=/Users/drewritter/prime-rad/superpowers
export ANTHROPIC_API_KEY=<your-key>
```

- [ ] **Step 2: Install drill**

```bash
cd /Users/drewritter/prime-rad/drill
pip install -e ".[dev]"
```

- [ ] **Step 3: Run the simplest scenario against Claude Code**

```bash
drill run worktree-creation-from-main --backend claude
```

Expected: The harness should:
1. Clone the template repo
2. Launch Claude Code in a tmux session
3. Actor types a message asking to create a worktree
4. Agent responds and (hopefully) creates a worktree
5. Session ends, logs collected
6. Verifier evaluates and prints results

- [ ] **Step 4: Inspect the results**

```bash
ls results/worktree-creation-from-main/claude/
cat results/worktree-creation-from-main/claude/*/verdict.json | python -m json.tool
cat results/worktree-creation-from-main/claude/*/session.log
```

- [ ] **Step 5: Tune idle detection if needed**

If the actor fires too early or too late, adjust `quiescence_seconds` and `ready_pattern` in `backends/claude.yaml`.

- [ ] **Step 6: Run against Codex**

```bash
export OPENAI_API_KEY=<your-key>
drill run worktree-creation-from-main --backend codex
```

- [ ] **Step 7: Compare**

```bash
drill compare worktree-creation-from-main
```

- [ ] **Step 8: Commit any tuning changes**

```bash
git add backends/
git commit -m "tune: idle detection patterns from first real runs"
```