superpowers/evals/scenarios/code-review-catches-planted-bugs.yaml

scenario: code-review-catches-planted-bugs
description: >
  Lifted from superpowers/tests/claude-code/test-requesting-code-review.sh.
  The setup repo has two commits where the second plants three real
  security bugs (SQL injection, identity-function password hash, credential
  logging). The naive user asks for a review before merge. The
  superpowers:requesting-code-review skill should fire, dispatch a
  reviewer subagent, and the resulting review must flag the planted bugs
  at Critical/Important severity and refuse to approve the diff.

  This is the cross-harness, semantically-judged replacement for the bash
  test that landed in PR #1299. Drill's verifier reads the reviewer's
  actual report in context instead of regex-grepping for keywords.
user_posture: spec-aware

setup:
  helpers:
    - create_code_review_planted_bugs
  assertions:
    - "git rev-parse --is-inside-work-tree"
    - "git branch --show-current | grep -q main"
    - "git log --oneline | wc -l | tr -d ' ' | grep -q '^2$'"
    - "test -f src/db.js"
    # Bug fingerprints — the helper plants string-concatenation in the SQL
    # and an identity-function hash. If either is missing, the run is
    # measuring something other than what the scenario claims.
    - "grep -q '+ email +' src/db.js"
    - "grep -qE 'function hash\\(s\\) \\{[[:space:]]*return s' src/db.js"

turns:
  - intent: >
      You just committed a refactor and want a thorough pre-merge review.
      You are spec-aware: you know the superpowers workflow and want the
      agent to use the requesting-code-review skill (which dispatches a
      reviewer subagent with isolated context). Use phrasing like:

      "I just finished a refactor of our user-lookup code and added a
      login function. The change is the most recent commit on the only
      branch (main has just two commits). Use the
      superpowers:requesting-code-review skill to review the diff before
      I merge — dispatch the reviewer subagent with the template and
      report back what it found."

      Do NOT mention SQL injection, hashing, credentials, or any specific
      bug. Do NOT pre-emptively suggest severities. The skill is named
      because we want the subagent path; the bugs themselves are for the
      reviewer to discover.
  - intent: >
      If the agent asks a short clarifying question (e.g., "what's the
      base SHA?" or "anything specific to focus on?"), answer briefly.
      Sample answers:
        - "The diff is just `git diff HEAD~1..HEAD` — the most recent commit."
        - "Just give it a normal pre-merge review."
      Do NOT volunteer information about the planted bugs or what to
      look for.
  - intent: >
      Once the agent has produced a review (with findings, severity,
      and a verdict), you are done — use the "done" action. If the
      agent says "looks good, ready to merge", that is also a complete
      review (and a fail of the criteria below — but the run itself is
      complete).

limits:
  max_turns: 8
  turn_timeout: 300

verify:
  assertions:
    - "skill-called superpowers:requesting-code-review"
    - "tool-called Agent"
  criteria:
    - >
      Reviewer flagged the SQL injection in findUserByEmail. The query
      switched from a parameterized statement to string concatenation
      with the user-supplied email. The reviewer must call this out
      at Critical or Important severity (not Minor / Nice-to-have).
    - >
      Reviewer flagged the credential-handling problem. Either of these
      counts as catching it: (a) `hash()` is an identity function so
      passwords are stored / compared in plaintext; (b) `console.log`
      writes the password_hash to stdout on every successful login.
      The reviewer must call out at least one at Critical or Important
      severity.
    - >
      Reviewer did not approve the diff for merge. A correct reviewer
      either says "do not merge", "with fixes", "blocking", or
      explicitly refuses approval. A reviewer that says "looks good",
      "ready to merge", or "safe to ship" without qualification fails
      this criterion.
  observe: true