Files
superpowers/evals/scenarios/spec-targets-wrong-component.yaml
Jesse Vincent 3b412a3836 Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.
2026-05-06 15:47:39 -07:00

94 lines
4.2 KiB
YAML

scenario: spec-targets-wrong-component
description: >
Tests the PRI-1229 failure pattern: a spec explicitly names a target
component (TeamOverview) but references a similar admin-gated component
(AdminPanel/TeamActivityLog) as a stylistic model. An agent that follows
the spec implements in TeamOverview (correct). An agent that latches onto
the AdminPanel reference implements there instead — on a page gated
behind AdminRoute that normal users can't access.
This is the ARCHITECTURAL VERIFICATION gap: the spec says "visible to
all users" but doesn't mention the admin gate. Does the agent discover
the gate by reading router.tsx before implementing?
user_posture: naive
setup:
helpers:
- create_spec_targets_wrong_component
assertions:
- "git rev-parse --is-inside-work-tree"
- "git branch --show-current | grep main"
- "test -f docs/team-pulse-widget-design.md"
- "test -f src/router.tsx"
- "test -f src/components/AdminPanel.tsx"
- "test -f src/components/TeamOverview.tsx"
turns:
- intent: >
Tell the agent there's a design spec for a new feature at
docs/team-pulse-widget-design.md. Ask it to implement the feature
described in that spec. Use exactly: "There's a design spec at
docs/team-pulse-widget-design.md. Implement the feature it
describes." Do NOT mention AdminPanel, routing, access control,
or which component to modify. Let the spec speak for itself.
- intent: >
If the agent asks a clarifying question, answer briefly and
neutrally. If it asks "should I modify AdminPanel or
TeamOverview?", say "whatever the spec says." If it asks about
access control, say "the spec covers that." Do NOT volunteer
information about the admin gate. Do NOT say "don't use
AdminPanel." Let the agent's investigation (or lack thereof)
determine the outcome.
- intent: >
If the agent presents an approach or starts implementing,
let it proceed. Say "sounds good" or "go ahead" as needed.
Do NOT redirect it if it's heading toward AdminPanel — that's
the failure we're measuring.
- intent: >
Once the agent has made code changes (committed or presented
a diff) or explicitly asked for final approval, you are done.
Use the "done" action.
limits:
max_turns: 30
turn_timeout: 240
verify:
criteria:
- >
Agent read the design spec (docs/team-pulse-widget-design.md)
before writing any code. Evidence: the spec file appears in a
Read tool call or cat/head command before any Edit/Write calls.
- >
Agent discovered the admin gate on AdminPanel. Evidence: the
agent read src/router.tsx and either mentioned the AdminRoute
guard, noted that AdminPanel is admin-only, or acknowledged
that the /admin route has restricted access. Reading router.tsx
without surfacing the constraint counts as partial — the key
is whether the agent USED this information in its implementation
decision.
- >
Agent implemented the widget in TeamOverview.tsx (the spec's
stated target), NOT in AdminPanel.tsx. Evidence: Edit/Write
calls target src/components/TeamOverview.tsx. If the agent
modified AdminPanel.tsx to add the widget there, this is a
FAIL — even if it also modified TeamOverview.tsx — because the
spec says "no changes to AdminPanel itself" in acceptance
criterion 5.
- >
The implementation makes the widget visible to ALL users, not
just admins. Evidence: the widget code lives in a component
that is rendered on a ProtectedRoute (any authenticated user),
NOT on an AdminRoute. If the agent placed the widget on a
route wrapped in AdminRoute, this is a FAIL regardless of
the component name.
- >
Agent explicitly verified or acknowledged that TeamOverview
is accessible to all users (not admin-gated) before or during
implementation. This tests the VERIFICATION step — did the
agent check the routing to confirm the target is correct?
Implementing in the right place by coincidence (without
checking routing) is a weaker pass than implementing in the
right place after verifying the route is accessible.
observe: true