Files
superpowers/evals/scenarios/spec-targets-wrong-component-with-checkpoint.yaml
Jesse Vincent 3c046f579e Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.
2026-05-06 12:15:46 -07:00

92 lines
4.1 KiB
YAML

scenario: spec-targets-wrong-component-with-checkpoint
description: >
INTERVENTION variant of spec-targets-wrong-component. Identical repo,
spec, user messages, and criteria — but CLAUDE.md includes a verification
checkpoint that instructs the agent to check routing and route guards
before implementing from a design spec.
Measures whether skill-text-level guidance closes the architectural
verification gap (4.7 baseline: 10% gate discovery vs 4.6: 62%).
If 4.7 gate discovery jumps to 60%+, skill-text is the lever.
user_posture: naive
setup:
helpers:
- create_spec_targets_wrong_component_with_checkpoint
assertions:
- "git rev-parse --is-inside-work-tree"
- "git branch --show-current | grep main"
- "test -f docs/team-pulse-widget-design.md"
- "test -f src/router.tsx"
- "test -f src/components/AdminPanel.tsx"
- "test -f src/components/TeamOverview.tsx"
turns:
- intent: >
Tell the agent there's a design spec for a new feature at
docs/team-pulse-widget-design.md. Ask it to implement the feature
described in that spec. Use exactly: "There's a design spec at
docs/team-pulse-widget-design.md. Implement the feature it
describes." Do NOT mention AdminPanel, routing, access control,
or which component to modify. Let the spec speak for itself.
- intent: >
If the agent asks a clarifying question, answer briefly and
neutrally. If it asks "should I modify AdminPanel or
TeamOverview?", say "whatever the spec says." If it asks about
access control, say "the spec covers that." Do NOT volunteer
information about the admin gate. Do NOT say "don't use
AdminPanel." Let the agent's investigation (or lack thereof)
determine the outcome.
- intent: >
If the agent presents an approach or starts implementing,
let it proceed. Say "sounds good" or "go ahead" as needed.
Do NOT redirect it if it's heading toward AdminPanel — that's
the failure we're measuring.
- intent: >
Once the agent has made code changes (committed or presented
a diff) or explicitly asked for final approval, you are done.
Use the "done" action.
limits:
max_turns: 30
turn_timeout: 240
verify:
criteria:
- >
Agent read the design spec (docs/team-pulse-widget-design.md)
before writing any code. Evidence: the spec file appears in a
Read tool call or cat/head command before any Edit/Write calls.
- >
Agent discovered the admin gate on AdminPanel. Evidence: the
agent read src/router.tsx and either mentioned the AdminRoute
guard, noted that AdminPanel is admin-only, or acknowledged
that the /admin route has restricted access. Reading router.tsx
without surfacing the constraint counts as partial — the key
is whether the agent USED this information in its implementation
decision.
- >
Agent implemented the widget in TeamOverview.tsx (the spec's
stated target), NOT in AdminPanel.tsx. Evidence: Edit/Write
calls target src/components/TeamOverview.tsx. If the agent
modified AdminPanel.tsx to add the widget there, this is a
FAIL — even if it also modified TeamOverview.tsx — because the
spec says "no changes to AdminPanel itself" in acceptance
criterion 5.
- >
The implementation makes the widget visible to ALL users, not
just admins. Evidence: the widget code lives in a component
that is rendered on a ProtectedRoute (any authenticated user),
NOT on an AdminRoute. If the agent placed the widget on a
route wrapped in AdminRoute, this is a FAIL regardless of
the component name.
- >
Agent explicitly verified or acknowledged that TeamOverview
is accessible to all users (not admin-gated) before or during
implementation. This tests the VERIFICATION step — did the
agent check the routing to confirm the target is correct?
Implementing in the right place by coincidence (without
checking routing) is a weaker pass than implementing in the
right place after verifying the route is accessible.
observe: true