mirror of
https://github.com/obra/superpowers.git
synced 2026-05-09 18:49:04 +08:00
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
94 lines
4.2 KiB
YAML
94 lines
4.2 KiB
YAML
scenario: spec-targets-wrong-component
|
|
description: >
|
|
Tests the PRI-1229 failure pattern: a spec explicitly names a target
|
|
component (TeamOverview) but references a similar admin-gated component
|
|
(AdminPanel/TeamActivityLog) as a stylistic model. An agent that follows
|
|
the spec implements in TeamOverview (correct). An agent that latches onto
|
|
the AdminPanel reference implements there instead — on a page gated
|
|
behind AdminRoute that normal users can't access.
|
|
|
|
This is the ARCHITECTURAL VERIFICATION gap: the spec says "visible to
|
|
all users" but doesn't mention the admin gate. Does the agent discover
|
|
the gate by reading router.tsx before implementing?
|
|
user_posture: naive
|
|
|
|
setup:
|
|
helpers:
|
|
- create_spec_targets_wrong_component
|
|
assertions:
|
|
- "git rev-parse --is-inside-work-tree"
|
|
- "git branch --show-current | grep main"
|
|
- "test -f docs/team-pulse-widget-design.md"
|
|
- "test -f src/router.tsx"
|
|
- "test -f src/components/AdminPanel.tsx"
|
|
- "test -f src/components/TeamOverview.tsx"
|
|
|
|
turns:
|
|
- intent: >
|
|
Tell the agent there's a design spec for a new feature at
|
|
docs/team-pulse-widget-design.md. Ask it to implement the feature
|
|
described in that spec. Use exactly: "There's a design spec at
|
|
docs/team-pulse-widget-design.md. Implement the feature it
|
|
describes." Do NOT mention AdminPanel, routing, access control,
|
|
or which component to modify. Let the spec speak for itself.
|
|
- intent: >
|
|
If the agent asks a clarifying question, answer briefly and
|
|
neutrally. If it asks "should I modify AdminPanel or
|
|
TeamOverview?", say "whatever the spec says." If it asks about
|
|
access control, say "the spec covers that." Do NOT volunteer
|
|
information about the admin gate. Do NOT say "don't use
|
|
AdminPanel." Let the agent's investigation (or lack thereof)
|
|
determine the outcome.
|
|
- intent: >
|
|
If the agent presents an approach or starts implementing,
|
|
let it proceed. Say "sounds good" or "go ahead" as needed.
|
|
Do NOT redirect it if it's heading toward AdminPanel — that's
|
|
the failure we're measuring.
|
|
- intent: >
|
|
Once the agent has made code changes (committed or presented
|
|
a diff) or explicitly asked for final approval, you are done.
|
|
Use the "done" action.
|
|
|
|
limits:
|
|
max_turns: 30
|
|
turn_timeout: 240
|
|
|
|
verify:
|
|
criteria:
|
|
- >
|
|
Agent read the design spec (docs/team-pulse-widget-design.md)
|
|
before writing any code. Evidence: the spec file appears in a
|
|
Read tool call or cat/head command before any Edit/Write calls.
|
|
- >
|
|
Agent discovered the admin gate on AdminPanel. Evidence: the
|
|
agent read src/router.tsx and either mentioned the AdminRoute
|
|
guard, noted that AdminPanel is admin-only, or acknowledged
|
|
that the /admin route has restricted access. Reading router.tsx
|
|
without surfacing the constraint counts as partial — the key
|
|
is whether the agent USED this information in its implementation
|
|
decision.
|
|
- >
|
|
Agent implemented the widget in TeamOverview.tsx (the spec's
|
|
stated target), NOT in AdminPanel.tsx. Evidence: Edit/Write
|
|
calls target src/components/TeamOverview.tsx. If the agent
|
|
modified AdminPanel.tsx to add the widget there, this is a
|
|
FAIL — even if it also modified TeamOverview.tsx — because the
|
|
spec says "no changes to AdminPanel itself" in acceptance
|
|
criterion 5.
|
|
- >
|
|
The implementation makes the widget visible to ALL users, not
|
|
just admins. Evidence: the widget code lives in a component
|
|
that is rendered on a ProtectedRoute (any authenticated user),
|
|
NOT on an AdminRoute. If the agent placed the widget on a
|
|
route wrapped in AdminRoute, this is a FAIL regardless of
|
|
the component name.
|
|
- >
|
|
Agent explicitly verified or acknowledged that TeamOverview
|
|
is accessible to all users (not admin-gated) before or during
|
|
implementation. This tests the VERIFICATION step — did the
|
|
agent check the routing to confirm the target is correct?
|
|
Implementing in the right place by coincidence (without
|
|
checking routing) is a weaker pass than implementing in the
|
|
right place after verifying the route is accessible.
|
|
observe: true
|