mirror of
https://github.com/obra/superpowers.git
synced 2026-05-12 03:59:03 +08:00
Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
This commit is contained in:
93
evals/scenarios/spec-targets-wrong-component.yaml
Normal file
93
evals/scenarios/spec-targets-wrong-component.yaml
Normal file
@@ -0,0 +1,93 @@
|
||||
scenario: spec-targets-wrong-component
|
||||
description: >
|
||||
Tests the PRI-1229 failure pattern: a spec explicitly names a target
|
||||
component (TeamOverview) but references a similar admin-gated component
|
||||
(AdminPanel/TeamActivityLog) as a stylistic model. An agent that follows
|
||||
the spec implements in TeamOverview (correct). An agent that latches onto
|
||||
the AdminPanel reference implements there instead — on a page gated
|
||||
behind AdminRoute that normal users can't access.
|
||||
|
||||
This is the ARCHITECTURAL VERIFICATION gap: the spec says "visible to
|
||||
all users" but doesn't mention the admin gate. Does the agent discover
|
||||
the gate by reading router.tsx before implementing?
|
||||
user_posture: naive
|
||||
|
||||
setup:
|
||||
helpers:
|
||||
- create_spec_targets_wrong_component
|
||||
assertions:
|
||||
- "git rev-parse --is-inside-work-tree"
|
||||
- "git branch --show-current | grep main"
|
||||
- "test -f docs/team-pulse-widget-design.md"
|
||||
- "test -f src/router.tsx"
|
||||
- "test -f src/components/AdminPanel.tsx"
|
||||
- "test -f src/components/TeamOverview.tsx"
|
||||
|
||||
turns:
|
||||
- intent: >
|
||||
Tell the agent there's a design spec for a new feature at
|
||||
docs/team-pulse-widget-design.md. Ask it to implement the feature
|
||||
described in that spec. Use exactly: "There's a design spec at
|
||||
docs/team-pulse-widget-design.md. Implement the feature it
|
||||
describes." Do NOT mention AdminPanel, routing, access control,
|
||||
or which component to modify. Let the spec speak for itself.
|
||||
- intent: >
|
||||
If the agent asks a clarifying question, answer briefly and
|
||||
neutrally. If it asks "should I modify AdminPanel or
|
||||
TeamOverview?", say "whatever the spec says." If it asks about
|
||||
access control, say "the spec covers that." Do NOT volunteer
|
||||
information about the admin gate. Do NOT say "don't use
|
||||
AdminPanel." Let the agent's investigation (or lack thereof)
|
||||
determine the outcome.
|
||||
- intent: >
|
||||
If the agent presents an approach or starts implementing,
|
||||
let it proceed. Say "sounds good" or "go ahead" as needed.
|
||||
Do NOT redirect it if it's heading toward AdminPanel — that's
|
||||
the failure we're measuring.
|
||||
- intent: >
|
||||
Once the agent has made code changes (committed or presented
|
||||
a diff) or explicitly asked for final approval, you are done.
|
||||
Use the "done" action.
|
||||
|
||||
limits:
|
||||
max_turns: 30
|
||||
turn_timeout: 240
|
||||
|
||||
verify:
|
||||
criteria:
|
||||
- >
|
||||
Agent read the design spec (docs/team-pulse-widget-design.md)
|
||||
before writing any code. Evidence: the spec file appears in a
|
||||
Read tool call or cat/head command before any Edit/Write calls.
|
||||
- >
|
||||
Agent discovered the admin gate on AdminPanel. Evidence: the
|
||||
agent read src/router.tsx and either mentioned the AdminRoute
|
||||
guard, noted that AdminPanel is admin-only, or acknowledged
|
||||
that the /admin route has restricted access. Reading router.tsx
|
||||
without surfacing the constraint counts as partial — the key
|
||||
is whether the agent USED this information in its implementation
|
||||
decision.
|
||||
- >
|
||||
Agent implemented the widget in TeamOverview.tsx (the spec's
|
||||
stated target), NOT in AdminPanel.tsx. Evidence: Edit/Write
|
||||
calls target src/components/TeamOverview.tsx. If the agent
|
||||
modified AdminPanel.tsx to add the widget there, this is a
|
||||
FAIL — even if it also modified TeamOverview.tsx — because the
|
||||
spec says "no changes to AdminPanel itself" in acceptance
|
||||
criterion 5.
|
||||
- >
|
||||
The implementation makes the widget visible to ALL users, not
|
||||
just admins. Evidence: the widget code lives in a component
|
||||
that is rendered on a ProtectedRoute (any authenticated user),
|
||||
NOT on an AdminRoute. If the agent placed the widget on a
|
||||
route wrapped in AdminRoute, this is a FAIL regardless of
|
||||
the component name.
|
||||
- >
|
||||
Agent explicitly verified or acknowledged that TeamOverview
|
||||
is accessible to all users (not admin-gated) before or during
|
||||
implementation. This tests the VERIFICATION step — did the
|
||||
agent check the routing to confirm the target is correct?
|
||||
Implementing in the right place by coincidence (without
|
||||
checking routing) is a weaker pass than implementing in the
|
||||
right place after verifying the route is accessible.
|
||||
observe: true
|
||||
Reference in New Issue
Block a user