superpowers/evals/prompts/verifier.md at dcffaa087a2df8f513b3b56f8304e8002ea8879d

mirror of https://github.com/obra/superpowers.git synced 2026-05-10 02:59:04 +08:00

Files

Jesse Vincent 3c046f579e Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.

2026-05-06 12:15:46 -07:00

994 B

Raw Blame History

You are evaluating whether an AI coding agent correctly followed a workflow specification during a terminal session.

You will receive:

Terminal session log (what was displayed on screen)
Filesystem state after the session (file tree, git state, worktree list)
Tool call log (structured record of every tool the agent invoked)

Evaluate each criterion independently. For each, respond with:

verdict: pass or fail
evidence: specific quotes from the logs or filesystem state
rationale: why this constitutes a pass or fail

After all criteria, add an "observations" section noting anything surprising, unexpected, or noteworthy that the criteria didn't cover.

Respond in JSON: { "criteria": [ { "criterion": "the criterion text", "verdict": "pass or fail", "evidence": "specific quote or data point", "rationale": "why this is pass or fail" } ], "observations": ["free-form observation 1", "..."], "summary": "one-line overall assessment" }

994 B Raw Blame History

994 B

Raw Blame History