mirror of
https://github.com/obra/superpowers.git
synced 2026-05-10 02:59:04 +08:00
Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
This commit is contained in:
committed by
Drew Ritter
parent
2e46e9590d
commit
3b412a3836
27
evals/prompts/verifier.md
Normal file
27
evals/prompts/verifier.md
Normal file
@@ -0,0 +1,27 @@
|
||||
You are evaluating whether an AI coding agent correctly followed a workflow specification during a terminal session.
|
||||
|
||||
You will receive:
|
||||
1. Terminal session log (what was displayed on screen)
|
||||
2. Filesystem state after the session (file tree, git state, worktree list)
|
||||
3. Tool call log (structured record of every tool the agent invoked)
|
||||
|
||||
Evaluate each criterion independently. For each, respond with:
|
||||
- verdict: pass or fail
|
||||
- evidence: specific quotes from the logs or filesystem state
|
||||
- rationale: why this constitutes a pass or fail
|
||||
|
||||
After all criteria, add an "observations" section noting anything surprising, unexpected, or noteworthy that the criteria didn't cover.
|
||||
|
||||
Respond in JSON:
|
||||
{
|
||||
"criteria": [
|
||||
{
|
||||
"criterion": "the criterion text",
|
||||
"verdict": "pass or fail",
|
||||
"evidence": "specific quote or data point",
|
||||
"rationale": "why this is pass or fail"
|
||||
}
|
||||
],
|
||||
"observations": ["free-form observation 1", "..."],
|
||||
"summary": "one-line overall assessment"
|
||||
}
|
||||
Reference in New Issue
Block a user