Files
superpowers/evals/scenarios/codex-tool-mapping-comprehension.yaml
Jesse Vincent 3b412a3836 Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.
2026-05-06 15:47:39 -07:00

52 lines
2.1 KiB
YAML

scenario: codex-tool-mapping-comprehension
description: >
Measures whether a Codex agent correctly reports the Superpowers Codex
mapping for Claude Code Task result collection.
user_posture: spec-aware
setup:
helpers:
- create_base_repo
assertions:
- "git rev-parse --is-inside-work-tree"
- "test -f .agents/skills/superpowers/using-superpowers/references/codex-tools.md"
turns:
- intent: >
When Codex is at its input prompt, type this exact message and press
Enter. Do not type a placeholder, a period, or a shorter paraphrase:
"Read .agents/skills/superpowers/using-superpowers/references/codex-tools.md.
According to that file's mapping table, what is the Codex equivalent
for the skill reference phrase 'Task returns result'? Do not perform
any subagent workflow. Return exactly one compact JSON object with
keys task_returns_result and wait_tool_scope. The task_returns_result
value must be exactly the mapped tool name. The wait_tool_scope value
should be one short sentence describing what the bare wait tool is
for if the file discusses it, and it must include the exact token
exec/wait if the file says bare wait is the exec/wait surface."
- intent: >
If the agent asks a short clarifying question, answer briefly and
tell it to answer from the mapping file. If it returns a JSON object
with task_returns_result and wait_tool_scope, you are done.
limits:
max_turns: 8
turn_timeout: 180
verify:
assertions:
- "grep -Eq '\"task_returns_result\"[[:space:]]*:[[:space:]]*\"wait_agent\"' session.log"
- "! grep -Eq '\"task_returns_result\"[[:space:]]*:[[:space:]]*\"wait\"' session.log"
- "grep -Eq '\"wait_tool_scope\"[^\\n]*exec/wait' session.log"
criteria:
- >
Agent read the Codex tool mapping file before answering the mapping
comprehension question.
- >
Agent answered that Task returns result maps to wait_agent.
- >
Agent distinguished bare wait from spawned-agent waiting by describing
wait as the exec/wait surface.
observe: true