mirror of
https://github.com/obra/superpowers.git
synced 2026-05-09 02:29:05 +08:00
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
77 lines
3.3 KiB
YAML
77 lines
3.3 KiB
YAML
scenario: spec-reviewer-catches-planted-flaws
|
|
description: >
|
|
Lifted from superpowers/tests/claude-code/test-document-review-system.sh.
|
|
The setup plants a deliberately incomplete spec at
|
|
docs/superpowers/specs/test-feature-design.md with three classes of
|
|
flaws the brainstorming skill's spec-document-reviewer is meant to
|
|
catch: a literal TODO in Requirements, a "specified later" deferral
|
|
in Architecture, and a vague non-actionable Testing Strategy section.
|
|
|
|
Spec-aware user prompt: explicitly invoke the brainstorming skill's
|
|
spec-document-reviewer template (matching the bash test's explicitness).
|
|
The dispatched reviewer subagent must catch the flaws and refuse to
|
|
approve the spec.
|
|
user_posture: spec-aware
|
|
|
|
setup:
|
|
helpers:
|
|
- create_base_repo
|
|
- add_flawed_spec_for_review
|
|
assertions:
|
|
- "git rev-parse --is-inside-work-tree"
|
|
- "git branch --show-current | grep -q main"
|
|
- "test -f docs/superpowers/specs/test-feature-design.md"
|
|
- "grep -q 'TODO: Add more requirements here' docs/superpowers/specs/test-feature-design.md"
|
|
- "grep -q 'specified later' docs/superpowers/specs/test-feature-design.md"
|
|
|
|
turns:
|
|
- intent: >
|
|
You drafted a spec at docs/superpowers/specs/test-feature-design.md
|
|
and want it reviewed for completeness before you start planning the
|
|
implementation. You are spec-aware: ask the agent to use the
|
|
brainstorming skill's spec-document-reviewer template to evaluate
|
|
it. Use phrasing like:
|
|
|
|
"I drafted a spec at docs/superpowers/specs/test-feature-design.md.
|
|
Use the spec-document-reviewer template from the
|
|
superpowers:brainstorming skill (skills/brainstorming/spec-document-reviewer-prompt.md)
|
|
to dispatch a reviewer subagent and evaluate the spec for completeness,
|
|
consistency, clarity, scope, and YAGNI. Report back what it found."
|
|
|
|
Do NOT name the planted flaws (TODO, "specified later", vague
|
|
testing). Do NOT volunteer hints about completeness. The reviewer
|
|
subagent should discover them.
|
|
- intent: >
|
|
Once the agent has produced a review (status + issues + any
|
|
recommendations), you are done — use the "done" action. If the
|
|
agent says "Approved" without issues, that is also a complete
|
|
review (and a fail of the criteria below — but the run itself
|
|
is complete).
|
|
|
|
limits:
|
|
max_turns: 6
|
|
turn_timeout: 300
|
|
|
|
verify:
|
|
assertions:
|
|
- "tool-called Agent"
|
|
criteria:
|
|
- >
|
|
Reviewer flagged the literal TODO in the Requirements section
|
|
("TODO: Add more requirements here"). Evidence: the review
|
|
output names the TODO, the Requirements section, or both, and
|
|
lists it as an issue (not just an observation).
|
|
- >
|
|
Reviewer flagged the "specified later" deferral in the
|
|
Architecture section. Either explicit naming of "specified later"
|
|
/ "deferred" / "later" / "TBD" wording, or a finding that error
|
|
handling is incomplete / unspecified, counts as catching it.
|
|
- >
|
|
Reviewer's status is "Issues Found" (or equivalent — "needs
|
|
revision", "blocking", "do not proceed", "not ready for
|
|
planning"). A reviewer that returns Approved without issues
|
|
fails this criterion. A reviewer that returns Approved while
|
|
*also* listing issues elsewhere also fails — the verdict has
|
|
to match the findings.
|
|
observe: true
|