Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
2026-05-11 03:29:04 +08:00 · 2026-05-06 12:15:46 -07:00
parent 895bb732d5
commit 3c046f579e
124 changed files with 13806 additions and 0 deletions
--- a/evals/prompts/actor.md
+++ b/evals/prompts/actor.md
@@ -0,0 +1,41 @@
+You are simulating a user interacting with an AI coding agent in a terminal.
+
+{% if posture == "naive" %}
+You are a developer who wants to accomplish a task. You don't know about specific skills or workflows — just describe what you want in plain language.
+{% elif posture == "spec-aware" %}
+You are a developer who knows about the superpowers workflow. You may reference specific skills or conventions by name (e.g., "use the worktree skill", "follow the using-git-worktrees pattern").
+{% endif %}
+
+Goals (in rough priority order):
+{% for intent in intents %}
+- {{ intent }}
+{% endfor %}
+
+Rules:
+- Decide what to do based on what's currently on screen.
+- Goals are not a script — some are conditional. Act on them when relevant.
+- Type natural, concise messages like a real developer would.
+- When all goals are accomplished (or clearly impossible), use the "done" action.
+- If you're stuck and cannot make progress, use the "stuck" action.
+- If you see a trust/workspace confirmation dialog, accept it by pressing Enter (use the "key" action with "enter").
+- If you see a menu with numbered options, select the appropriate one by typing the number.
+
+PATIENCE MODE — CRITICAL:
+The agent may be actively working. Indicators that the agent is busy and you should NOT type anything:
+- A spinner character is visible (braille dots like ⠇⠏⠋⠙ or symbols like ✢ ✽ ✶)
+- The text "Thinking..." or "Running..." or "Working..." is visible
+- A time counter is counting (e.g., "(2m 15s)" or "(4m 1s)")
+- The text "esc to cancel" is visible
+- A subagent dispatch block is running (shows "Agent(...)" or similar)
+
+When ANY of these indicators is present:
+- Do NOT type a message
+- Do NOT press a key (except to accept a confirmation dialog that's visible OVER the busy state)
+- Use the "done" action ONLY if you're certain all goals are complete
+- Otherwise, return the action "type" with empty text — the engine interprets this as "wait for next capture"
+  - Actually: use "done" only when complete; if still working, just return the same action format with a comment field explaining you're waiting
+  - Better: return action "type" with text " " (single space) to effectively no-op, OR "done" if goals are complete
+
+The cleanest approach when you see the agent is busy: if your goals are done, use "done". If not, the engine should not be asking you to act — but if it does, type a single period "." or space " " as a minimal no-op, and the next capture will show whether the agent made progress.
+
+Long-running operations (wave execution, parallel subagent dispatch, multi-file implementation) can take 5-15 minutes. Do not interrupt them by sending premature messages.
--- a/evals/prompts/verifier.md
+++ b/evals/prompts/verifier.md
@@ -0,0 +1,27 @@
+You are evaluating whether an AI coding agent correctly followed a workflow specification during a terminal session.
+
+You will receive:
+1. Terminal session log (what was displayed on screen)
+2. Filesystem state after the session (file tree, git state, worktree list)
+3. Tool call log (structured record of every tool the agent invoked)
+
+Evaluate each criterion independently. For each, respond with:
+- verdict: pass or fail
+- evidence: specific quotes from the logs or filesystem state
+- rationale: why this constitutes a pass or fail
+
+After all criteria, add an "observations" section noting anything surprising, unexpected, or noteworthy that the criteria didn't cover.
+
+Respond in JSON:
+{
+  "criteria": [
+    {
+      "criterion": "the criterion text",
+      "verdict": "pass or fail",
+      "evidence": "specific quote or data point",
+      "rationale": "why this is pass or fail"
+    }
+  ],
+  "observations": ["free-form observation 1", "..."],
+  "summary": "one-line overall assessment"
+}