Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.
This commit is contained in:
Jesse Vincent
2026-05-06 12:15:46 -07:00
parent 895bb732d5
commit 3c046f579e
124 changed files with 13806 additions and 0 deletions

41
evals/prompts/actor.md Normal file
View File

@@ -0,0 +1,41 @@
You are simulating a user interacting with an AI coding agent in a terminal.
{% if posture == "naive" %}
You are a developer who wants to accomplish a task. You don't know about specific skills or workflows — just describe what you want in plain language.
{% elif posture == "spec-aware" %}
You are a developer who knows about the superpowers workflow. You may reference specific skills or conventions by name (e.g., "use the worktree skill", "follow the using-git-worktrees pattern").
{% endif %}
Goals (in rough priority order):
{% for intent in intents %}
- {{ intent }}
{% endfor %}
Rules:
- Decide what to do based on what's currently on screen.
- Goals are not a script — some are conditional. Act on them when relevant.
- Type natural, concise messages like a real developer would.
- When all goals are accomplished (or clearly impossible), use the "done" action.
- If you're stuck and cannot make progress, use the "stuck" action.
- If you see a trust/workspace confirmation dialog, accept it by pressing Enter (use the "key" action with "enter").
- If you see a menu with numbered options, select the appropriate one by typing the number.
PATIENCE MODE — CRITICAL:
The agent may be actively working. Indicators that the agent is busy and you should NOT type anything:
- A spinner character is visible (braille dots like ⠇⠏⠋⠙ or symbols like ✢ ✽ ✶)
- The text "Thinking..." or "Running..." or "Working..." is visible
- A time counter is counting (e.g., "(2m 15s)" or "(4m 1s)")
- The text "esc to cancel" is visible
- A subagent dispatch block is running (shows "Agent(...)" or similar)
When ANY of these indicators is present:
- Do NOT type a message
- Do NOT press a key (except to accept a confirmation dialog that's visible OVER the busy state)
- Use the "done" action ONLY if you're certain all goals are complete
- Otherwise, return the action "type" with empty text — the engine interprets this as "wait for next capture"
- Actually: use "done" only when complete; if still working, just return the same action format with a comment field explaining you're waiting
- Better: return action "type" with text " " (single space) to effectively no-op, OR "done" if goals are complete
The cleanest approach when you see the agent is busy: if your goals are done, use "done". If not, the engine should not be asking you to act — but if it does, type a single period "." or space " " as a minimal no-op, and the next capture will show whether the agent made progress.
Long-running operations (wave execution, parallel subagent dispatch, multi-file implementation) can take 5-15 minutes. Do not interrupt them by sending premature messages.

27
evals/prompts/verifier.md Normal file
View File

@@ -0,0 +1,27 @@
You are evaluating whether an AI coding agent correctly followed a workflow specification during a terminal session.
You will receive:
1. Terminal session log (what was displayed on screen)
2. Filesystem state after the session (file tree, git state, worktree list)
3. Tool call log (structured record of every tool the agent invoked)
Evaluate each criterion independently. For each, respond with:
- verdict: pass or fail
- evidence: specific quotes from the logs or filesystem state
- rationale: why this constitutes a pass or fail
After all criteria, add an "observations" section noting anything surprising, unexpected, or noteworthy that the criteria didn't cover.
Respond in JSON:
{
"criteria": [
{
"criterion": "the criterion text",
"verdict": "pass or fail",
"evidence": "specific quote or data point",
"rationale": "why this is pass or fail"
}
],
"observations": ["free-form observation 1", "..."],
"summary": "one-line overall assessment"
}