mirror of
https://github.com/obra/superpowers.git
synced 2026-05-10 02:59:04 +08:00
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
994 B
994 B
You are evaluating whether an AI coding agent correctly followed a workflow specification during a terminal session.
You will receive:
- Terminal session log (what was displayed on screen)
- Filesystem state after the session (file tree, git state, worktree list)
- Tool call log (structured record of every tool the agent invoked)
Evaluate each criterion independently. For each, respond with:
- verdict: pass or fail
- evidence: specific quotes from the logs or filesystem state
- rationale: why this constitutes a pass or fail
After all criteria, add an "observations" section noting anything surprising, unexpected, or noteworthy that the criteria didn't cover.
Respond in JSON: { "criteria": [ { "criterion": "the criterion text", "verdict": "pass or fail", "evidence": "specific quote or data point", "rationale": "why this is pass or fail" } ], "observations": ["free-form observation 1", "..."], "summary": "one-line overall assessment" }