Files
superpowers/evals/scenarios/triggering-dispatching-parallel-agents.yaml
Jesse Vincent 3c046f579e Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.
2026-05-06 12:15:46 -07:00

53 lines
1.8 KiB
YAML

scenario: triggering-dispatching-parallel-agents
description: >
Lifted from superpowers/tests/skill-triggering/prompts/dispatching-parallel-agents.txt.
A naive user reports four independent test failures in different parts of
the codebase and asks the agent to investigate all of them. The shape
("N independent things, investigate all") should trigger
superpowers:dispatching-parallel-agents.
user_posture: naive
setup:
helpers:
- create_base_repo
assertions:
- "git rev-parse --is-inside-work-tree"
- "git branch --show-current | grep -q main"
turns:
- intent: >
Ask the agent to investigate four independent test failures. Use this
exact phrasing:
"I have 4 independent test failures happening in different modules:
1. tests/auth/login.test.ts - 'should redirect after login' is failing
2. tests/api/users.test.ts - 'should return user list' returns 500
3. tests/components/Button.test.tsx - snapshot mismatch
4. tests/utils/date.test.ts - timezone handling broken
These are unrelated issues in different parts of the codebase. Can you
investigate all of them?"
Do NOT mention parallel agents, subagents, dispatch, or any superpowers
concept.
- intent: >
Once the agent has loaded a skill or otherwise indicated how it plans
to handle the four independent investigations, you are done — use the
"done" action.
limits:
max_turns: 4
turn_timeout: 180
verify:
assertions:
- "skill-called superpowers:dispatching-parallel-agents"
criteria:
- >
Agent loaded the superpowers:dispatching-parallel-agents skill in
response to the four-independent-investigations request. Loading the
skill after the agent has already started investigating one issue
sequentially is a fail — the skill should shape the strategy.
observe: true