evals: use pre-commit hooks

evals: add Gemini 2.5 Flash backend
evals: drop drill source marker
2026-05-16 14:09:04 +08:00 · 2026-05-06 15:41:52 -07:00 · 2026-05-06 15:09:59 -07:00 · 2026-05-06 14:55:14 -07:00 · 2026-05-06 14:43:08 -07:00 · 2026-05-06 12:41:28 -07:00
11 changed files with 75 additions and 124 deletions
--- a/.cursor-plugin/plugin.json
+++ b/.cursor-plugin/plugin.json
@@ -19,5 +19,7 @@
    "workflows"
  ],
  "skills": "./skills/",
  "agents": "./agents/",
  "commands": "./commands/",
  "hooks": "./hooks/hooks-cursor.json"
 }
--- a/docs/superpowers/plans/2026-04-06-worktree-rototill.md
+++ b/docs/superpowers/plans/2026-04-06-worktree-rototill.md
@@ -275,16 +275,23 @@ If no native tool is available, create a worktree manually using git.
 Follow this priority order:
-1. **Check your instructions for a worktree directory preference.** If specified, use it without asking.
+1. **Check existing directories:**
 2. **Check existing project-local directories:**
   ```bash
   ls -d .worktrees 2>/dev/null     # Preferred (hidden)
   ls -d worktrees 2>/dev/null      # Alternative
   ```
   If found, use that directory. If both exist, `.worktrees` wins.
-3. **Default to `.worktrees/`.**
+2. **Check for existing global directory:**
   ```bash
   project=$(basename "$(git rev-parse --show-toplevel)")
   ls -d ~/.config/superpowers/worktrees/$project 2>/dev/null
   ```
   If found, use it (backward compatibility with legacy global path).
 3. **Check your instructions for a worktree directory preference.** If specified, use it without asking.
 4. **Default to `.worktrees/`.**
 #### Safety Verification (project-local directories only)
@@ -298,11 +305,16 @@ git check-ignore -q .worktrees 2>/dev/null || git check-ignore -q worktrees 2>/d
 **Why critical:** Prevents accidentally committing worktree contents to repository.
 Global directories (`~/.config/superpowers/worktrees/`) need no verification.
 #### Create the Worktree
 ```bash
 project=$(basename "$(git rev-parse --show-toplevel)")
 # Determine path based on chosen location
-path="$LOCATION/$BRANCH_NAME"
+# For project-local: path="$LOCATION/$BRANCH_NAME"
 # For global: path="~/.config/superpowers/worktrees/$project/$BRANCH_NAME"
 git worktree add "$path" -b "$BRANCH_NAME"
 cd "$path"
@@ -375,6 +387,7 @@ Ready to implement <feature-name>
 | `worktrees/` exists | Use it (verify ignored) |
 | Both exist | Use `.worktrees/` |
 | Neither exists | Check instruction file, then default `.worktrees/` |
 | Global path exists | Use it (backward compat) |
 | Directory not ignored | Add to .gitignore + commit |
 | Permission error on create | Sandbox fallback, work in place |
 | Tests fail during baseline | Report failures + ask |
@@ -451,7 +464,7 @@ git commit -m "feat: rewrite using-git-worktrees with detect-and-defer (PRI-974)
 Step 0: GIT_DIR != GIT_COMMON detection (skip if already isolated)
 Step 0 consent: opt-in prompt before creating worktree (#991)
 Step 1a: native tool preference (short, first, declarative)
-Step 1b: git worktree fallback with project-local directory policy
+Step 1b: git worktree fallback with hooks symlink and legacy path compat
 Submodule guard prevents false detection
 Platform-neutral instruction file references (#1049)"
 ```
@@ -650,7 +663,7 @@ WORKTREE_PATH=$(git rev-parse --show-toplevel)
 **If `GIT_DIR == GIT_COMMON`:** Normal repo, no worktree to clean up. Done.
-**If worktree path is under `.worktrees/` or `worktrees/`:** Superpowers created this worktree — we own cleanup.
+**If worktree path is under `.worktrees/` or `~/.config/superpowers/worktrees/`:** Superpowers created this worktree — we own cleanup.
 ```bash
 MAIN_ROOT=$(git -C "$(git rev-parse --git-common-dir)/.." rev-parse --show-toplevel)
@@ -694,7 +707,7 @@ git worktree prune  # Self-healing: clean up any stale registrations
 **Cleaning up harness-owned worktrees**
 - **Problem:** Removing a worktree the harness created causes phantom state
- **Fix:** Only clean up worktrees under `.worktrees/` or `worktrees/`
+- **Fix:** Only clean up worktrees under `.worktrees/` or `~/.config/superpowers/worktrees/`
 **No confirmation for discard**
 - **Problem:** Accidentally delete work
--- a/docs/superpowers/specs/2026-04-06-worktree-rototill-design.md
+++ b/docs/superpowers/specs/2026-04-06-worktree-rototill-design.md
@@ -46,7 +46,7 @@ The skill describes the goal ("ensure work happens in an isolated workspace") an
 ### Provenance-based ownership
-Whoever creates the worktree owns its cleanup. If the harness created it, superpowers doesn't touch it. If superpowers created it (via git fallback), superpowers cleans it up. The heuristic: if the worktree lives under `.worktrees/` or `worktrees/`, superpowers owns it. Anything else (`.claude/worktrees/`, `~/.codex/worktrees/`, `.gemini/worktrees/`, or old user-global Superpowers paths) belongs to the harness or user and is left alone.
+Whoever creates the worktree owns its cleanup. If the harness created it, superpowers doesn't touch it. If superpowers created it (via git fallback), superpowers cleans it up. The heuristic: if the worktree lives under `.worktrees/` or `~/.config/superpowers/worktrees/`, superpowers owns it. Anything else (`.claude/worktrees/`, `~/.codex/worktrees/`, `.gemini/worktrees/`) belongs to the harness.
 ## Design
@@ -110,11 +110,12 @@ File splitting (Step 1b in a separate skill) was tested and proven unnecessary.
 When no native tool is available, create a worktree manually.
 **Directory selection** (priority order):
-1. Check the project's agent instruction file (CLAUDE.md, GEMINI.md, AGENTS.md, .cursorrules, or equivalent) for a worktree directory preference.
+1. Check for existing `.worktrees/` or `worktrees/` directory — if found, use it. If both exist, `.worktrees/` wins.
-2. Check for existing `.worktrees/` or `worktrees/` directory — if found, use it. If both exist, `.worktrees/` wins.
+2. Check for existing `~/.config/superpowers/worktrees/<project>/` directory — if found, use it (backward compatibility with legacy global path).
-3. Default to `.worktrees/`.
+3. Check the project's agent instruction file (CLAUDE.md, GEMINI.md, AGENTS.md, .cursorrules, or equivalent) for a worktree directory preference.
 4. Default to `.worktrees/`.
-No interactive directory selection prompt. Old user-global Superpowers worktree paths are not detected or offered; new manual worktrees are project-local unless the user explicitly specifies another location.
+No interactive directory selection prompt. The global path (`~/.config/superpowers/worktrees/`) is no longer offered as a choice to new users, but existing worktrees at that location are detected and used for backward compatibility.
 **Safety verification** (project-local directories only):
@@ -231,7 +232,7 @@ if GIT_DIR == GIT_COMMON:
    # Normal repo, no worktree to clean up
    done
-if worktree path is under .worktrees/ or worktrees/:
+if worktree path is under .worktrees/ or ~/.config/superpowers/worktrees/:
    # Superpowers created it — we own cleanup
    cd to main repo root       # Bug #238 fix
    git worktree remove <path>
@@ -317,7 +318,7 @@ As of 2026-04-06, Claude Code is the only harness with an agent-callable mid-ses
 ### Provenance heuristic
-The `.worktrees/` or `worktrees/` = ours, anything else = hands off` heuristic works for every current harness. If a future harness adopts one of those project-local directories as its convention, we'd have a false positive (superpowers tries to clean up a harness-owned worktree). Similarly, if a user manually runs `git worktree add .worktrees/experiment` without superpowers, we'd incorrectly claim ownership. Both are low risk — every harness uses branded paths, and manual `.worktrees/` creation is unlikely — but worth noting.
+The `.worktrees/` or `~/.config/superpowers/worktrees/` = ours, anything else = hands off` heuristic works for every current harness. If a future harness adopts `.worktrees/` as its convention, we'd have a false positive (superpowers tries to clean up a harness-owned worktree). Similarly, if a user manually runs `git worktree add .worktrees/experiment` without superpowers, we'd incorrectly claim ownership. Both are low risk — every harness uses branded paths, and manual `.worktrees/` creation is unlikely — but worth noting.
 ### Detached HEAD finishing
--- a/skills/finishing-a-development-branch/SKILL.md
+++ b/skills/finishing-a-development-branch/SKILL.md
@@ -180,7 +180,7 @@ WORKTREE_PATH=$(git rev-parse --show-toplevel)
 **If `GIT_DIR == GIT_COMMON`:** Normal repo, no worktree to clean up. Done.
-**If worktree path is under `.worktrees/` or `worktrees/`:** Superpowers created this worktree — we own cleanup.
+**If worktree path is under `.worktrees/`, `worktrees/`, or `~/.config/superpowers/worktrees/`:** Superpowers created this worktree — we own cleanup.
 ```bash
 MAIN_ROOT=$(git -C "$(git rev-parse --git-common-dir)/.." rev-parse --show-toplevel)
@@ -224,7 +224,7 @@ git worktree prune  # Self-healing: clean up any stale registrations
 **Cleaning up harness-owned worktrees**
 - **Problem:** Removing a worktree the harness created causes phantom state
- **Fix:** Only clean up worktrees under `.worktrees/` or `worktrees/`
+- **Fix:** Only clean up worktrees under `.worktrees/`, `worktrees/`, or `~/.config/superpowers/worktrees/`
 **No confirmation for discard**
 - **Problem:** Accidentally delete work
--- a/skills/test-driven-development/SKILL.md
+++ b/skills/test-driven-development/SKILL.md
@@ -356,7 +356,7 @@ Never fix bugs without a test.
 ## Testing Anti-Patterns
-When adding mocks or test utilities, read [testing-anti-patterns.md](testing-anti-patterns.md) to avoid common pitfalls:
+When adding mocks or test utilities, read @testing-anti-patterns.md to avoid common pitfalls:
 - Testing mock behavior instead of real behavior
 - Adding test-only methods to production classes
 - Mocking without understanding dependencies
--- a/skills/using-git-worktrees/SKILL.md
+++ b/skills/using-git-worktrees/SKILL.md
@@ -30,7 +30,7 @@ BRANCH=$(git branch --show-current)
 git rev-parse --show-superproject-working-tree 2>/dev/null
 ```
-**If `GIT_DIR != GIT_COMMON` (and not a submodule):** You are already in a linked worktree. Skip to Step 2 (Project Setup). Do NOT create another worktree.
+**If `GIT_DIR != GIT_COMMON` (and not a submodule):** You are already in a linked worktree. Skip to Step 3 (Project Setup). Do NOT create another worktree.
 Report with branch state:
 - On a branch: "Already in isolated workspace at `<path>` on branch `<name>`."
@@ -42,7 +42,7 @@ Has the user already indicated their worktree preference in your instructions? I
 > "Would you like me to set up an isolated worktree? It protects your current branch from changes."
-Honor any existing declared preference without asking. If the user declines consent, work in place and skip to Step 2.
+Honor any existing declared preference without asking. If the user declines consent, work in place and skip to Step 3.
 ## Step 1: Create Isolated Workspace
@@ -50,7 +50,7 @@ Honor any existing declared preference without asking. If the user declines cons
 ### 1a. Native Worktree Tools (preferred)
-The user has asked for an isolated workspace (Step 0 consent). Do you already have a way to create a worktree? It might be a tool with a name like `EnterWorktree`, `WorktreeCreate`, a `/worktree` command, or a `--worktree` flag. If you do, use it and skip to Step 2.
+The user has asked for an isolated workspace (Step 0 consent). Do you already have a way to create a worktree? It might be a tool with a name like `EnterWorktree`, `WorktreeCreate`, a `/worktree` command, or a `--worktree` flag. If you do, use it and skip to Step 3.
 Native tools handle directory placement, branch creation, and cleanup automatically. Using `git worktree add` when you have a native tool creates phantom state your harness can't see or manage.
@@ -73,7 +73,14 @@ Follow this priority order. Explicit user preference always beats observed files
   ```
   If found, use it. If both exist, `.worktrees` wins.
-3. **If there is no other guidance available**, default to `.worktrees/` at the project root.
+3. **Check for an existing global directory:**
   ```bash
   project=$(basename "$(git rev-parse --show-toplevel)")
   ls -d ~/.config/superpowers/worktrees/$project 2>/dev/null
   ```
   If found, use it (backward compatibility with legacy global path).
 4. **If there is no other guidance available**, default to `.worktrees/` at the project root.
 #### Safety Verification (project-local directories only)
@@ -87,11 +94,16 @@ git check-ignore -q .worktrees 2>/dev/null || git check-ignore -q worktrees 2>/d
 **Why critical:** Prevents accidentally committing worktree contents to repository.
 Global directories (`~/.config/superpowers/worktrees/`) need no verification.
 #### Create the Worktree
 ```bash
 project=$(basename "$(git rev-parse --show-toplevel)")
 # Determine path based on chosen location
-path="$LOCATION/$BRANCH_NAME"
+# For project-local: path="$LOCATION/$BRANCH_NAME"
 # For global: path="~/.config/superpowers/worktrees/$project/$BRANCH_NAME"
 git worktree add "$path" -b "$BRANCH_NAME"
 cd "$path"
@@ -99,7 +111,7 @@ cd "$path"
 **Sandbox fallback:** If `git worktree add` fails with a permission error (sandbox denial), tell the user the sandbox blocked worktree creation and you're working in the current directory instead. Then run setup and baseline tests in place.
-## Step 2: Project Setup
+## Step 3: Project Setup
 Auto-detect and run appropriate setup:
@@ -118,7 +130,7 @@ if [ -f pyproject.toml ]; then poetry install; fi
 if [ -f go.mod ]; then go mod download; fi
 ```
-## Step 3: Verify Clean Baseline
+## Step 4: Verify Clean Baseline
 Run tests to ensure workspace starts clean:
@@ -151,6 +163,7 @@ Ready to implement <feature-name>
 | `worktrees/` exists | Use it (verify ignored) |
 | Both exist | Use `.worktrees/` |
 | Neither exists | Check instruction file, then default `.worktrees/` |
 | Global path exists | Use it (backward compat) |
 | Directory not ignored | Add to .gitignore + commit |
 | Permission error on create | Sandbox fallback, work in place |
 | Tests fail during baseline | Report failures + ask |
@@ -176,7 +189,7 @@ Ready to implement <feature-name>
 ### Assuming directory location
 - **Problem:** Creates inconsistency, violates project conventions
- **Fix:** Follow priority: explicit instructions > existing project-local directory > default
+- **Fix:** Follow priority: existing > global legacy > instruction file > default
 ### Proceeding with failing tests
@@ -196,7 +209,7 @@ Ready to implement <feature-name>
 **Always:**
 - Run Step 0 detection first
 - Prefer native tools over git fallback
- Follow directory priority: explicit instructions > existing project-local directory > default
+- Follow directory priority: existing > global legacy > instruction file > default
 - Verify directory is ignored for project-local
 - Auto-detect and run project setup
 - Verify clean test baseline
--- a/skills/writing-skills/SKILL.md
+++ b/skills/writing-skills/SKILL.md
@@ -553,7 +553,7 @@ Run same scenarios WITH skill. Agent should now comply.
 Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
-**Testing methodology:** See [testing-skills-with-subagents.md](testing-skills-with-subagents.md) for the complete testing methodology:
+**Testing methodology:** See @testing-skills-with-subagents.md for the complete testing methodology:
 - How to write pressure scenarios
 - Pressure types (time, sunk cost, authority, exhaustion)
 - Plugging holes systematically
--- a/tests/claude-code/run-skill-tests.sh
+++ b/tests/claude-code/run-skill-tests.sh
@@ -25,7 +25,7 @@ fi
 # Parse command line arguments
 VERBOSE=false
 SPECIFIC_TEST=""
-TIMEOUT=600  # Default 10 minute timeout per test
+TIMEOUT=300  # Default 5 minute timeout per test
 RUN_INTEGRATION=false
 while [[ $# -gt 0 ]]; do
@@ -73,7 +73,6 @@ done
 # List of skill tests to run (fast unit tests)
 tests=(
    "test-worktree-path-policy.sh"
    "test-subagent-driven-development.sh"
 )
--- a/tests/claude-code/test-helpers.sh
+++ b/tests/claude-code/test-helpers.sh
@@ -9,14 +9,14 @@ run_claude() {
    local allowed_tools="${3:-}"
    local output_file=$(mktemp)
-    # Build command as an argv array so timeout wraps claude directly.
+    # Build command
-    local cmd=(claude -p "$prompt")
+    local cmd="claude -p \"$prompt\""
    if [ -n "$allowed_tools" ]; then
-        cmd+=(--allowed-tools="$allowed_tools")
+        cmd="$cmd --allowed-tools=$allowed_tools"
    fi
    # Run Claude in headless mode with timeout
-    if timeout "$timeout" "${cmd[@]}" > "$output_file" 2>&1; then
+    if timeout "$timeout" bash -c "$cmd" > "$output_file" 2>&1; then
        cat "$output_file"
        rm -f "$output_file"
        return 0
--- a/tests/claude-code/test-subagent-driven-development.sh
+++ b/tests/claude-code/test-subagent-driven-development.sh
@@ -12,15 +12,13 @@ set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
 source "$SCRIPT_DIR/test-helpers.sh"
 CLAUDE_PROMPT_TIMEOUT="${CLAUDE_PROMPT_TIMEOUT:-90}"
 echo "=== Test: subagent-driven-development skill ==="
 echo ""
 # Test 1: Verify skill can be loaded
 echo "Test 1: Skill loading..."
-output=$(run_claude "What is the subagent-driven-development skill? Describe its key steps briefly." "$CLAUDE_PROMPT_TIMEOUT")
+output=$(run_claude "What is the subagent-driven-development skill? Describe its key steps briefly." 30)
 if assert_contains "$output" "subagent-driven-development\|Subagent-Driven Development\|Subagent Driven" "Skill is recognized"; then
    : # pass
@@ -39,11 +37,9 @@ echo ""
 # Test 2: Verify skill describes correct workflow order
 echo "Test 2: Workflow ordering..."
-output=$(run_claude "In the subagent-driven-development skill, what comes first: spec compliance review or code quality review? Answer using exactly this structure:
+output=$(run_claude "In the subagent-driven-development skill, what comes first: spec compliance review or code quality review? Be specific about the order." 30)
 First: <review type>
 Second: <review type>" "$CLAUDE_PROMPT_TIMEOUT")
-if assert_order "$output" "First:.*spec.*compliance" "Second:.*code.*quality" "Spec compliance before code quality"; then
+if assert_order "$output" "spec.*compliance" "code.*quality" "Spec compliance before code quality"; then
    : # pass
 else
    exit 1
@@ -54,17 +50,15 @@ echo ""
 # Test 3: Verify self-review is mentioned
 echo "Test 3: Self-review requirement..."
-output=$(run_claude "Does the subagent-driven-development skill require implementers to self-review before handoff, and can self-review replace the external reviews? Answer using exactly this structure:
+output=$(run_claude "Does the subagent-driven-development skill require implementers to do self-review? What should they check?" 30)
 Self-review required: <yes or no>
 Self-review replaces external review: <yes or no>" "$CLAUDE_PROMPT_TIMEOUT")
-if assert_contains "$output" "Self-review required:.*yes" "Mentions self-review"; then
+if assert_contains "$output" "self-review\|self review" "Mentions self-review"; then
    : # pass
 else
    exit 1
 fi
-if assert_contains "$output" "Self-review replaces external review:.*no" "Self-review does not replace external review"; then
+if assert_contains "$output" "completeness\|Completeness" "Checks completeness"; then
    : # pass
 else
    exit 1
@@ -75,7 +69,7 @@ echo ""
 # Test 4: Verify plan is read once
 echo "Test 4: Plan reading efficiency..."
-output=$(run_claude "In subagent-driven-development, how many times should the controller read the plan file? When does this happen?" "$CLAUDE_PROMPT_TIMEOUT")
+output=$(run_claude "In subagent-driven-development, how many times should the controller read the plan file? When does this happen?" 30)
 if assert_contains "$output" "once\|one time\|single" "Read plan once"; then
    : # pass
@@ -94,7 +88,7 @@ echo ""
 # Test 5: Verify spec compliance reviewer is skeptical
 echo "Test 5: Spec compliance reviewer mindset..."
-output=$(run_claude "What is the spec compliance reviewer's attitude toward the implementer's report in subagent-driven-development?" "$CLAUDE_PROMPT_TIMEOUT")
+output=$(run_claude "What is the spec compliance reviewer's attitude toward the implementer's report in subagent-driven-development?" 30)
 if assert_contains "$output" "not trust\|don't trust\|skeptical\|verify.*independently\|suspiciously" "Reviewer is skeptical"; then
    : # pass
@@ -113,7 +107,7 @@ echo ""
 # Test 6: Verify review loops
 echo "Test 6: Review loop requirements..."
-output=$(run_claude "In subagent-driven-development, what happens if a reviewer finds issues? Is it a one-time review or a loop?" "$CLAUDE_PROMPT_TIMEOUT")
+output=$(run_claude "In subagent-driven-development, what happens if a reviewer finds issues? Is it a one-time review or a loop?" 30)
 if assert_contains "$output" "loop\|again\|repeat\|until.*approved\|until.*compliant" "Review loops mentioned"; then
    : # pass
@@ -132,9 +126,7 @@ echo ""
 # Test 7: Verify full task text is provided
 echo "Test 7: Task context provision..."
-output=$(run_claude "In subagent-driven-development, how does the controller provide task information to the implementer subagent? Answer using exactly this structure:
+output=$(run_claude "In subagent-driven-development, how does the controller provide task information to the implementer subagent? Does it make them read a file or provide it directly?" 30)
 Controller provides: <directly or by file>
 Implementer must read plan file: <yes or no>" "$CLAUDE_PROMPT_TIMEOUT")
 if assert_contains "$output" "provide.*directly\|full.*text\|paste\|include.*prompt" "Provides text directly"; then
    : # pass
@@ -142,7 +134,7 @@ else
    exit 1
 fi
-if assert_contains "$output" "Implementer must read plan file:.*no" "Doesn't make subagent read file"; then
+if assert_not_contains "$output" "read.*file\|open.*file" "Doesn't make subagent read file"; then
    : # pass
 else
    exit 1
@@ -153,7 +145,7 @@ echo ""
 # Test 8: Verify worktree requirement
 echo "Test 8: Worktree requirement..."
-output=$(run_claude "What workflow skills are required before using subagent-driven-development? List any prerequisites or required skills." "$CLAUDE_PROMPT_TIMEOUT")
+output=$(run_claude "What workflow skills are required before using subagent-driven-development? List any prerequisites or required skills." 30)
 if assert_contains "$output" "using-git-worktrees\|worktree" "Mentions worktree requirement"; then
    : # pass
@@ -166,7 +158,7 @@ echo ""
 # Test 9: Verify main branch warning
 echo "Test 9: Main branch red flag..."
-output=$(run_claude "In subagent-driven-development, is it okay to start implementation directly on the main branch?" "$CLAUDE_PROMPT_TIMEOUT")
+output=$(run_claude "In subagent-driven-development, is it okay to start implementation directly on the main branch?" 30)
 if assert_contains "$output" "worktree\|feature.*branch\|not.*main\|never.*main\|avoid.*main\|don't.*main\|consent\|permission" "Warns against main branch"; then
    : # pass
--- a/tests/claude-code/test-worktree-path-policy.sh
+++ b/tests/claude-code/test-worktree-path-policy.sh
@@ -1,69 +0,0 @@
 #!/usr/bin/env bash
 # Regression check: Superpowers should not route new worktrees through the old
 # global worktree directory.
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
 REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
 USING_SKILL="$REPO_ROOT/skills/using-git-worktrees/SKILL.md"
 FINISHING_SKILL="$REPO_ROOT/skills/finishing-a-development-branch/SKILL.md"
 ROTOTILL_SPEC="$REPO_ROOT/docs/superpowers/specs/2026-04-06-worktree-rototill-design.md"
 ROTOTILL_PLAN="$REPO_ROOT/docs/superpowers/plans/2026-04-06-worktree-rototill.md"
 failures=0
 assert_contains() {
    local file="$1"
    local pattern="$2"
    local label="$3"
    if grep -Fq "$pattern" "$file"; then
        echo "  [PASS] $label"
    else
        echo "  [FAIL] $label"
        echo "    Expected to find: $pattern"
        echo "    In file: $file"
        failures=$((failures + 1))
    fi
 }
 assert_not_contains() {
    local file="$1"
    local pattern="$2"
    local label="$3"
    if grep -Fq "$pattern" "$file"; then
        echo "  [FAIL] $label"
        echo "    Did not expect to find: $pattern"
        echo "    In file: $file"
        failures=$((failures + 1))
    else
        echo "  [PASS] $label"
    fi
 }
 echo "=== Worktree Path Policy Test ==="
 echo ""
 assert_not_contains "$USING_SKILL" "~/.config/superpowers/worktrees" "using-git-worktrees does not mention old global path"
 assert_not_contains "$USING_SKILL" "global legacy" "using-git-worktrees does not use unclear global legacy shorthand"
 assert_not_contains "$USING_SKILL" "Global path" "using-git-worktrees has no global path quick-reference row"
 assert_contains "$USING_SKILL" 'default to `.worktrees/` at the project root' "using-git-worktrees defaults new manual worktrees to .worktrees/"
 assert_not_contains "$FINISHING_SKILL" "~/.config/superpowers/worktrees" "finishing-a-development-branch does not treat old global path as owned"
 assert_contains "$FINISHING_SKILL" '`.worktrees/` or `worktrees/`' "finishing-a-development-branch keeps project-local cleanup ownership"
 assert_not_contains "$ROTOTILL_SPEC" "~/.config/superpowers/worktrees" "rototill spec does not preserve old global path policy"
 assert_not_contains "$ROTOTILL_PLAN" "~/.config/superpowers/worktrees" "rototill plan does not preserve old global path policy"
 assert_not_contains "$ROTOTILL_PLAN" "legacy path compat" "rototill plan does not advertise legacy path compatibility"
 echo ""
 if [ "$failures" -gt 0 ]; then
    echo "STATUS: FAILED ($failures failures)"
    exit 1
 fi
 echo "STATUS: PASSED"
Author	SHA1	Message	Date
Drew Ritter	bad4708a7b	evals: use pre-commit hooks	2026-05-06 15:41:52 -07:00
Drew Ritter	ec9b96a7bf	evals: add Gemini 2.5 Flash backend	2026-05-06 15:09:59 -07:00
Drew Ritter	2d4cdea2bb	evals: drop drill source marker	2026-05-06 14:55:14 -07:00
Drew Ritter	af465f9687	evals: remove unreleased wave scenarios	2026-05-06 14:43:08 -07:00
Jesse Vincent	e4191c3609	Address adversarial review findings - evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.	2026-05-06 12:41:28 -07:00
Jesse Vincent	d545612825	docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files).	2026-05-06 12:33:10 -07:00
Jesse Vincent	b43d14f87f	docs: annotate dated artifacts referencing lifted bash tests - RELEASE-NOTES.md: note that test-requesting-code-review.sh and test-document-review-system.sh were lifted into drill scenarios on 2026-05-06; references are preserved as dated artifacts. - docs/superpowers/plans/2026-03-23-codex-app-compatibility.md: note that tests/skill-triggering/ was lifted into drill scenarios on 2026-05-06; the run-all.sh reference is a dated artifact. Subagent second-pass scrub confirmed no other active references in the tree (excluding evals/ and the spec/plan for this work itself).	2026-05-06 12:32:00 -07:00
Jesse Vincent	11d5db1b22	tests: annotate three kept bash tests with drill coverage notes - test-worktree-native-preference.sh: drill covers PRESSURE phase only; RED + GREEN baselines have no drill counterpart and are kept so the RED-GREEN-REFACTOR validation remains rerunnable end-to-end. - test-subagent-driven-development-integration.sh: drill covers the YAGNI subset (forbidden exports + reviewer-as-gate). Bash adds >=3 commits, >=2 subagent dispatches, TodoWrite usage, test file existence check, and token-budget telemetry. Kept until drill scenario covers those or they are retired. - test-subagent-driven-development.sh: tests agent's ability to describe SDD (string matches against expected keywords). Drill scenarios test behavior, not description-recall. Kept by design. Subagent verification recorded in commit messages of subsequent deletions; gap analyses driving these annotations are also in the verification subagent reports for the gating sweep.	2026-05-06 12:29:59 -07:00
Jesse Vincent	051bff661b	tests: remove test-requesting-code-review.sh (covered by drill code-review-catches-planted-bugs) Subagent verification: every bash assertion (skill invocation, subagent dispatch, SQL injection flagged, credential handling flagged, no merge approval) maps to drill verify checks. Drill is stricter: bundles severity (Critical/Important) into the same criteria as the finding itself (bash split severity into a separate test). Setup parity covered (src/db.js with string concat + identity hash, two commits). The drill scenario header explicitly says it is the "cross-harness, semantically-judged replacement for the bash test."	2026-05-06 12:28:40 -07:00
Jesse Vincent	dc6255291b	tests: remove test-document-review-system.sh (covered by drill spec-reviewer-catches-planted-flaws) Subagent verification: every bash assertion (TODO in Requirements section flagged, "specified later" deferral flagged, Issues section present, did-not-approve verdict) maps to drill verify.criteria entries. Setup parity covered by setup.assertions (test-feature-design.md exists with TODO + 'specified later' content). Drill is stricter: asserts tool-called Agent (subagent dispatch) which the bash test did not check.	2026-05-06 12:28:40 -07:00
Jesse Vincent	d337f4a18a	tests: remove subagent-driven-dev fixtures (covered by drill sdd-go-fractals + sdd-svelte-todo) The bash test had ZERO output assertions — it just ran claude -p and printed token usage. Drill's scenarios are strictly more rigorous: go-fractals: skill-called SDD + tool-called Agent + go test ./... passes + cmd/fractals/main.go exists + >=4 commits + LLM criteria verifying real SDD workflow. svelte-todo: skill-called SDD + tool-called Agent + npm test passes + playwright e2e passes + package.json + svelte.config.js or vite.config.ts + >=4 commits + LLM criteria. design.md and plan.md are byte-identical between bash fixtures and drill fixtures (evals/fixtures/sdd-{go-fractals,svelte-todo}/). Drill's setup helper (scaffold_sdd_*) forces git init -b main (stricter than bash's reliance on init.defaultBranch). The .claude/settings.local.json from bash scaffold.sh is unnecessary for drill since permissions are managed via backend YAML. Subagent verification: SAFE TO DELETE for both.	2026-05-06 12:27:31 -07:00
Jesse Vincent	6fe9cf7515	tests: remove run-claude-describes-sdd.sh (covered by drill mid-conversation-skill-invocation) Subagent verification: every bash assertion (Skill tool invoked + specific skill name 'subagent-driven-development' loaded after the agent describes it conversationally in turn 1) maps to the drill scenario's skill-called assertion + criteria paragraph requiring the skill to fire in direct response to the second user message. Drill additionally asserts tool-called Agent (subagent dispatch) which is stricter than the bash test. Other runners in tests/explicit-skill-requests/ (haiku, multiturn, extended-multiturn) and their prompt files are preserved — they have no drill coverage and exercise different behaviors.	2026-05-06 12:25:46 -07:00
Jesse Vincent	3177c87aa8	tests: remove skill-triggering bash prompts (covered by drill triggering-* scenarios) Subagent verification confirmed each prompt's intent matches its corresponding drill scenario's turns[].intent verbatim, and each scenario has both a deterministic skill-called assertion and a semantic LLM criterion confirming the matching skill was loaded (actually a stronger check than the bash test, which only confirms the skill fires anywhere in the stream). All 6 prompts deleted. The runner had no remaining prompts to drive, so run-test.sh and run-all.sh deleted as well.	2026-05-06 12:24:53 -07:00
Jesse Vincent	a94d2cc414	evals: drop SUPERPOWERS_ROOT setup step from README/CLAUDE The cli.py helper now defaults the env var. Mention as override only.	2026-05-06 12:21:35 -07:00
Jesse Vincent	dcffaa087a	evals: drop SUPERPOWERS_ROOT from codex/gemini required_env These backends only read SUPERPOWERS_ROOT via engine.py/setup.py's os.environ access, which the new cli.py default helper supplies automatically. claude*.yaml keep SUPERPOWERS_ROOT in required_env because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.	2026-05-06 12:20:47 -07:00
Jesse Vincent	b3817bba4f	evals: default SUPERPOWERS_ROOT to parent of evals/ if unset Adds _set_superpowers_root_default() to drill/cli.py, called at module import after load_dotenv(). PROJECT_ROOT resolves to evals/ post-lift; its parent is the superpowers repo root, which is the correct value for SUPERPOWERS_ROOT. Existing env values are respected as overrides via os.environ.setdefault. Tests: - helper sets default when var is unset - helper does not override when var is already set	2026-05-06 12:19:39 -07:00
Jesse Vincent	3c046f579e	Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.	2026-05-06 12:15:46 -07:00
Jesse Vincent	895bb732d5	Plan: lift drill into superpowers as evals/ 15-task implementation plan derived from the design spec at docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md. Each task is bite-sized (2-5 min steps) with exact commands, exact file paths, and exact code where required. Subagent verification gates per the spec are written out as concrete prompt templates. Self-review: - Spec coverage: every spec section maps to a task - Placeholder scan: no TBD/TODO/placeholder/fill-in-later language - Type consistency: helper named _set_superpowers_root_default consistently; drill SHA recorded in evals/.drill-source-sha consistently	2026-05-06 12:08:58 -07:00
Jesse Vincent	cf5914a31f	Spec: address adversarial review findings Two parallel reviewers raised legitimate issues against the lift-drill- into-evals spec. Updates: - Coverage map for tests/explicit-skill-requests/ corrected: 6 run-.sh scripts + prompts, not "2 scenarios cover all". Several scripts (Haiku, multi-turn, please-use-brainstorming, use-systematic-debugging) have no drill counterpart and stay. - tests/claude-code/test-subagent-driven-development.sh marked as meta/documentation test (asks agent to describe SDD); no drill scenario covers description tests; defaults to keep. - Path-defaults section now shows verified evidence: PROJECT_ROOT resolves to evals/ post-move; only claude.yaml substitute ${SUPERPOWERS_ROOT} in args (codex/gemini use it via os.environ in pre-run hooks); helper invocation order specified (after load_dotenv, before click definitions). - Step 2 copy uses explicit rsync excludes (.git, .venv, results, .env, __pycache__, *.egg-info, .private-journal); checksum-level verification rather than file-count. - Drill SHA recorded at copy time in commit message and evals/.drill-source-sha for divergence detection. - evals/tests/ pytest suite added to verification protocol. - Reference scrub list expanded: RELEASE-NOTES.md, docs/superpowers/plans/, .codex-plugin/ (corrected from .codex/), lefthook.yml. Excluded dirs called out (node_modules/, .venv/, evals/). - Historical plan docs / RELEASE-NOTES handling: annotate, don't rewrite. - evals/lefthook.yml move documented (drill ships its own; contributors run cd evals && lefthook run pre-commit manually). - PR description checklist includes archival action item for obra/drill post-merge. False finding rejected: svelte-todo fixture is complete on disk (design.md + plan.md + scaffold.sh present); reviewer #1 #3 dropped.	2026-05-06 12:03:24 -07:00
Jesse Vincent	cf34cef01e	Spec: lift drill into superpowers as evals/ Records scope, branching, architecture, deletion gate, verification protocol, path/config edits, migration ordering, and post-implementation verification. Frames CI integration, scenario co-location, and Python package rename as deferred work. Per-file deletion of bash tests under superpowers/tests/ is gated by a subagent that compares each bash assertion to its drill scenario's verify block. Default keeps the bash test if any assertion is unmatched. Branching: independent off dev (f/evals-lift), not stacked on f/cross-platform.	2026-05-06 11:54:12 -07:00