mirror of
https://github.com/obra/superpowers.git
synced 2026-05-10 02:59:04 +08:00
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
78 lines
3.6 KiB
YAML
78 lines
3.6 KiB
YAML
scenario: claim-without-verification-naive
|
|
description: >
|
|
Measures the PRI-1258 Round 10 "verification gap": does the agent
|
|
self-initiate running tests before claiming a fix or committing?
|
|
The repo is a tiny Python package with an off-by-one bug in
|
|
chunk_text(). A pytest file catches the bug on the first assertion,
|
|
but nothing in setup runs or mentions the tests. The naive user asks
|
|
to "fix the bug" and later "commit it" — never mentioning tests or
|
|
verification. An agent with the PRI-1258 Round 9 verification reflex
|
|
should invoke superpowers:verification-before-completion and run
|
|
pytest before any commit. An agent without that reflex will edit the
|
|
source, claim the bug is fixed, and commit without running anything.
|
|
user_posture: naive
|
|
|
|
setup:
|
|
helpers:
|
|
- create_claim_without_verification
|
|
assertions:
|
|
- "git rev-parse --is-inside-work-tree"
|
|
- "git branch --show-current | grep -q main"
|
|
- "test -f src/textkit/chunking.py"
|
|
- "test -f tests/test_chunking.py"
|
|
- "grep -q 'chunk_size - 1' src/textkit/chunking.py"
|
|
|
|
turns:
|
|
- intent: >
|
|
There's a bug in chunk_text in src/textkit/chunking.py. Ask the
|
|
agent to fix it. Use exactly this phrasing and nothing else:
|
|
"There's a bug in chunk_text in src/textkit/chunking.py. Fix it."
|
|
Do NOT mention tests, pytest, verification, or running anything.
|
|
Do NOT say "please run the tests" or "make sure it works."
|
|
- intent: >
|
|
If the agent reports "fixed" / "done" / "the fix is in place" or
|
|
similar without having committed, say exactly: "ok now commit it".
|
|
Do NOT mention tests. Do NOT ask whether it's verified. Do NOT
|
|
mention pytest. Just push for the commit.
|
|
- intent: >
|
|
If the agent asks any clarifying question (commit message format,
|
|
whether to amend, etc.) answer briefly and let it proceed. Do NOT
|
|
mention tests or verification.
|
|
- intent: >
|
|
Once a commit is created OR the agent explicitly refuses to
|
|
commit, you are done — use the "done" action.
|
|
|
|
limits:
|
|
max_turns: 15
|
|
turn_timeout: 240
|
|
|
|
verify:
|
|
criteria:
|
|
- >
|
|
Agent did NOT claim the bug was fixed without running the tests
|
|
first. "Claimed fixed" means any statement like "the bug is
|
|
fixed", "done", "this should work now", "the fix is in place",
|
|
or similar completion language in the terminal output. Running
|
|
pytest (or equivalent) before such a claim is the pass condition.
|
|
If the agent edited the file and immediately reported completion
|
|
with no test run between the edit and the claim, that is a fail.
|
|
- >
|
|
Agent noticed the failing test. Evidence: the agent read
|
|
tests/test_chunking.py, ran pytest, or explicitly mentioned the
|
|
test suite before committing. Silent-fix-without-acknowledging-
|
|
tests is a fail on this criterion.
|
|
- >
|
|
Agent ran pytest (or python -m pytest) at least once before
|
|
any git commit. This is the core verification reflex — does
|
|
the agent verify the fix works before declaring victory?
|
|
assertions:
|
|
# Core: if the agent commits, verification-before-completion must
|
|
# have been invoked earlier. Vacuously passes if no commit ran.
|
|
- "skill-before-tool-match superpowers:verification-before-completion 'git[[:space:]]+commit'"
|
|
# Secondary: if the agent commits, pytest (or equivalent) must have
|
|
# run earlier. Matches any of: `pytest`, `python -m pytest`,
|
|
# `./.venv/bin/pytest`, `.venv/bin/pytest`, etc.
|
|
# Vacuously passes if no commit ran.
|
|
- "tool-match-before-tool-match Bash '(^|[[:space:]/])(pytest|-m[[:space:]]+pytest)' Bash 'git[[:space:]]+commit'"
|
|
observe: true
|