Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
2026-07-10 12:09:04 +08:00 · 2026-05-06 12:15:46 -07:00
parent 2e46e9590d
commit 3b412a3836
124 changed files with 13806 additions and 0 deletions
--- a/evals/prompts/verifier.md
+++ b/evals/prompts/verifier.md
@@ -0,0 +1,27 @@
+You are evaluating whether an AI coding agent correctly followed a workflow specification during a terminal session.
+
+You will receive:
+1. Terminal session log (what was displayed on screen)
+2. Filesystem state after the session (file tree, git state, worktree list)
+3. Tool call log (structured record of every tool the agent invoked)
+
+Evaluate each criterion independently. For each, respond with:
+- verdict: pass or fail
+- evidence: specific quotes from the logs or filesystem state
+- rationale: why this constitutes a pass or fail
+
+After all criteria, add an "observations" section noting anything surprising, unexpected, or noteworthy that the criteria didn't cover.
+
+Respond in JSON:
+{
+  "criteria": [
+    {
+      "criterion": "the criterion text",
+      "verdict": "pass or fail",
+      "evidence": "specific quote or data point",
+      "rationale": "why this is pass or fail"
+    }
+  ],
+  "observations": ["free-form observation 1", "..."],
+  "summary": "one-line overall assessment"
+}