mirror of https://github.com/obra/superpowers.git synced 2026-06-24 03:29:04 +08:00

Files

Jesse Vincent fd5b53cb85 evals: drop SUPERPOWERS_ROOT from codex/gemini required_env

These backends only read SUPERPOWERS_ROOT via engine.py/setup.py's
os.environ access, which the new cli.py default helper supplies
automatically. claude*.yaml keep SUPERPOWERS_ROOT in required_env
because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.

2026-05-06 15:47:39 -07:00

backends

evals: drop SUPERPOWERS_ROOT from codex/gemini required_env

2026-05-06 15:47:39 -07:00

bin

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

docs

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

drill

evals: default SUPERPOWERS_ROOT to parent of evals/ if unset

2026-05-06 15:47:39 -07:00

fixtures

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

prompts

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

scenarios

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

setup_helpers

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

tests

evals: default SUPERPOWERS_ROOT to parent of evals/ if unset

2026-05-06 15:47:39 -07:00

.drill-source-sha

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

.gitignore

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

CLAUDE.md

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

lefthook.yml

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

pyproject.toml

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

README.md

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

uv.lock

Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

2026-05-06 15:47:39 -07:00

README.md

Drill

Superpowers skill compliance benchmark. Drives AI coding agents through tmux sessions and evaluates whether they follow superpowers workflows correctly.

How it works

Setup — a helper creates a git repo with specific conditions (worktree state, plan files, code fixtures)
Actor — a Sonnet 4.6 LLM plays the user, following turn intents from the scenario YAML
Agent — the backend under test (Claude Code, Codex, Gemini CLI) runs in a real tmux session
Verifier — a Sonnet 4.6 LLM evaluates the session transcript + filesystem against criteria
Assertions — deterministic checks (tool-called, tool-count, shell commands) run post-session

Setup

uv sync --dev

Required environment:

export SUPERPOWERS_ROOT=/path/to/superpowers
export ANTHROPIC_API_KEY=sk-...

Usage

# Run a single scenario on a single backend
uv run drill run worktree-creation-from-main -b claude

# Run with N repetitions
uv run drill run pattern-match-trap -b claude-opus-4-6 --n 5

# Sweep across multiple backends
uv run drill run pattern-match-trap --models claude-opus-4-6,claude-opus-4-7 --n 10

# Compare results
uv run drill compare pattern-match-trap

# List available scenarios
uv run drill list

Scenarios

Category	Scenarios	Tests
Worktree	8 scenarios (creation, detection, consent, detached HEAD)	Skill compliance for `using-git-worktrees`
Wave decomposition	5 scenarios (naive, spec-aware, false overlap, dependency chain, conflict surface)	Plan → waves decomposition quality
Wave execution	3 scenarios (minimal, full, task failure)	End-to-end wave execution + failure escalation
Pattern-match trap	1 scenario	Investigation depth gap between 4.6 and 4.7 (PRI-1270)

Backends

Backend	CLI	Model
`claude`	Claude Code	opus-4-7 (default)
`claude-opus-4-6`	Claude Code	opus-4-6
`claude-opus-4-7`	Claude Code	opus-4-7
`claude-opus-4-6-1m`	Claude Code	opus-4-6 (1M context)
`claude-opus-4-7-1m`	Claude Code	opus-4-7 (1M context)
`codex`	Codex CLI	—
`gemini`	Gemini CLI	—

Project structure

drill/              # Core engine
  cli.py            # Click CLI (run, compare, list)
  engine.py         # Tmux session orchestration
  actor.py          # User-simulator LLM
  verifier.py       # Criteria evaluator LLM
  assertions.py     # Deterministic post-session assertions
  compare.py        # Result loading and cross-backend comparison
  sweep.py          # Multi-backend N-rep orchestrator
  stats.py          # Wilson score confidence intervals
scenarios/          # YAML scenario definitions
setup_helpers/      # Repo fixture creators
backends/           # Per-backend YAML configs
bin/                # Assertion helper scripts (tool-called, tool-count, etc.)
prompts/            # Actor and verifier system prompts
fixtures/           # Static template repos
tests/              # pytest suite (122 tests)
docs/               # Design spec and manual testing guide

Tests

uv run pytest
uv run ruff check
uv run ty check

Writing a new scenario

Create a setup helper in setup_helpers/ if you need a custom fixture
Register it in setup_helpers/__init__.py
Create scenarios/your-scenario.yaml with setup, turns, limits, and verify sections
Run it: uv run drill run your-scenario -b claude

See docs/design.md for the full design spec.