Compare commits

...

5 Commits

Author SHA1 Message Date
Jesse Vincent
03360d1a21 Add end-to-end brainstorm handoff test
8-turn brainstorming flow that tests whether Claude invokes writing-plans
(correct) or EnterPlanMode (bug) after design approval. Includes
--without-fix mode for failure reproduction attempts.
2026-02-08 18:46:37 -08:00
Jesse Vincent
bdd852d945 Add brainstorm-to-plan handoff regression test
Tests that Claude invokes writing-plans skill (not EnterPlanMode) after
brainstorming. Includes --without-fix mode that strips the fix from a
plugin copy to attempt failure reproduction.
2026-02-08 18:06:03 -08:00
Jesse Vincent
b4cbf642c8 Fix brainstorming-to-planning handoff using EnterPlanMode instead of writing-plans
EnterPlanMode's system prompt guidance fires every turn while skill content
loaded early in conversation fades in context. Three fixes:
- Add EnterPlanMode to using-superpowers red flags table (refreshed every turn)
- Make brainstorming handoff imperative: invoke Skill tool, not EnterPlanMode
- Update writing-plans description to explicitly say "not EnterPlanMode"
- Reorder: plan first (while context is fresh), worktree second (for implementation)
2026-02-08 17:35:01 -08:00
Jesse Vincent
84cd6e7c20 fix: move scope assessment into understanding phase
Testing showed the model skipped scope assessment when it was a
separate step after "Understanding the idea." Inlining it as the
first thing in understanding ensures it fires before detailed questions.
2026-02-08 16:50:33 -08:00
Jesse Vincent
eac29a909f feat: add project-level scope assessment to brainstorming pipeline
Brainstorming now assesses whether a project is too large for a single
spec and helps decompose into sub-projects. Spec reviewer checks scope.
Writing-plans has a backstop if brainstorming missed it.
2026-02-08 16:50:33 -08:00
7 changed files with 663 additions and 8 deletions

View File

@@ -13,7 +13,9 @@ Start by understanding the current project context, then ask questions one at a
**Understanding the idea:**
- Check out the current project state first (files, docs, recent commits)
- Ask questions one at a time to refine the idea
- Before asking detailed questions, assess scope: if the request describes multiple independent subsystems (e.g., "build a platform with chat, file storage, billing, and analytics"), flag this immediately. Don't spend questions refining details of a project that needs to be decomposed first.
- If the project is too large for a single spec, help the user decompose into sub-projects: what are the independent pieces, how do they relate, what order should they be built? Then brainstorm the first sub-project through the normal design flow. Each sub-project gets its own spec → plan → implementation cycle.
- For appropriately-scoped projects, ask questions one at a time to refine the idea
- Prefer multiple choice questions when possible, but open-ended is fine too
- Only one question per message - if a topic needs more exploration, break it into multiple questions
- Focus on understanding: purpose, constraints, success criteria
@@ -56,11 +58,9 @@ After writing the spec document:
3. If loop exceeds 5 iterations, surface to human for guidance
**Implementation (if continuing):**
- Ask: "Ready to set up for implementation?"
- Use superpowers:using-git-worktrees to create isolated workspace
- **REQUIRED:** Use superpowers:writing-plans to create detailed implementation plan
- Do NOT use platform planning features (e.g., EnterPlanMode, plan mode)
- Do NOT start implementing directly - the writing-plans skill comes first
When the user approves the design and wants to build:
1. **Invoke `superpowers:writing-plans` using the Skill tool.** Not EnterPlanMode. Not plan mode. Not direct implementation. The Skill tool.
2. After the plan is written, use superpowers:using-git-worktrees to create an isolated workspace for implementation.
## Key Principles

View File

@@ -23,6 +23,7 @@ Task tool (general-purpose):
| Consistency | Internal contradictions, conflicting requirements |
| Clarity | Ambiguous requirements |
| YAGNI | Unrequested features, over-engineering |
| Scope | Focused enough for a single plan — not covering multiple independent subsystems |
| Architecture | Units with clear boundaries, well-defined interfaces, independently understandable and testable |
## CRITICAL

View File

@@ -73,6 +73,7 @@ These thoughts mean STOP—you're rationalizing:
| "I'll just do this one thing first" | Check BEFORE doing anything. |
| "This feels productive" | Undisciplined action wastes time. Skills prevent this. |
| "I know what that means" | Knowing the concept ≠ using the skill. Invoke it. |
| "I should use EnterPlanMode / plan mode" | If a loaded skill specifies the next step, follow the skill. EnterPlanMode is a platform default — skills override defaults. |
## Skill Priority

View File

@@ -1,6 +1,6 @@
---
name: writing-plans
description: Use when you have a spec or requirements for a multi-step task, before touching code
description: Use when you have a spec or requirements for a multi-step task, before touching code. After brainstorming, ALWAYS use this — not EnterPlanMode or plan mode.
---
# Writing Plans
@@ -13,11 +13,15 @@ Assume they are a skilled developer, but know almost nothing about our toolset o
**Announce at start:** "I'm using the writing-plans skill to create the implementation plan."
**Context:** This should be run in a dedicated worktree (created by brainstorming skill).
**Context:** This runs in the main workspace after brainstorming, while context is fresh. The worktree is created afterward for implementation.
**Save plans to:** `docs/superpowers/plans/YYYY-MM-DD-<feature-name>.md`
- (User preferences for plan location override this default)
## Scope Check
If the spec covers multiple independent subsystems, it should have been broken into sub-project specs during brainstorming. If it wasn't, suggest breaking this into separate plans — one per subsystem. Each plan should produce working, testable software on its own.
## File Structure
Before defining tasks, map out which files will be created or modified and what each one is responsible for. This is where decomposition decisions get locked in.

View File

@@ -58,6 +58,7 @@ while [[ $# -gt 0 ]]; do
echo ""
echo "Tests:"
echo " test-subagent-driven-development.sh Test skill loading and requirements"
echo " test-brainstorm-handoff.sh Test brainstorm→writing-plans handoff"
echo ""
echo "Integration Tests (use --integration):"
echo " test-subagent-driven-development-integration.sh Full workflow execution"
@@ -74,6 +75,7 @@ done
# List of skill tests to run (fast unit tests)
tests=(
"test-subagent-driven-development.sh"
"test-brainstorm-handoff.sh"
)
# Integration tests (slow, full execution)

View File

@@ -0,0 +1,330 @@
#!/usr/bin/env bash
# Test: Brainstorm-to-plan handoff (end-to-end)
#
# Full brainstorming flow that builds enough context distance to reproduce
# the EnterPlanMode failure. Simulates a real brainstorming session with
# multiple turns of Q&A before the "build it" moment.
#
# This test takes 5-10 minutes to run.
#
# PASS: Skill tool invoked with "writing-plans" AND EnterPlanMode NOT invoked
# FAIL: EnterPlanMode invoked OR writing-plans not invoked
#
# Usage:
# ./test-brainstorm-handoff-e2e.sh # With fix (expects PASS)
# ./test-brainstorm-handoff-e2e.sh --without-fix # Strip fix, reproduce failure
# ./test-brainstorm-handoff-e2e.sh --verbose # Show full output
#
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PLUGIN_DIR="$(cd "$SCRIPT_DIR/../.." && pwd)"
# Parse flags
VERBOSE=false
WITHOUT_FIX=false
while [[ $# -gt 0 ]]; do
case $1 in
--verbose|-v) VERBOSE=true; shift ;;
--without-fix) WITHOUT_FIX=true; shift ;;
*) echo "Unknown flag: $1"; exit 1 ;;
esac
done
TIMESTAMP=$(date +%s)
OUTPUT_DIR="/tmp/superpowers-tests/${TIMESTAMP}/brainstorm-handoff-e2e"
mkdir -p "$OUTPUT_DIR"
echo "=== Brainstorm-to-Plan Handoff E2E Test ==="
echo "Mode: $([ "$WITHOUT_FIX" = true ] && echo "WITHOUT FIX (expect failure)" || echo "WITH FIX (expect pass)")"
echo "Output: $OUTPUT_DIR"
echo "This test takes 5-10 minutes."
echo ""
# --- Project Setup ---
PROJECT_DIR="$OUTPUT_DIR/project"
mkdir -p "$PROJECT_DIR/src" "$PROJECT_DIR/test"
cat > "$PROJECT_DIR/package.json" << 'PROJ_EOF'
{
"name": "my-express-app",
"version": "1.0.0",
"type": "module",
"scripts": {
"start": "node src/index.js",
"test": "vitest run"
},
"dependencies": {
"express": "^4.18.0",
"better-sqlite3": "^9.0.0"
},
"devDependencies": {
"vitest": "^1.0.0",
"supertest": "^6.0.0"
}
}
PROJ_EOF
cat > "$PROJECT_DIR/src/index.js" << 'PROJ_EOF'
import express from 'express';
const app = express();
app.use(express.json());
app.get('/health', (req, res) => res.json({ status: 'ok' }));
const PORT = process.env.PORT || 3000;
if (process.env.NODE_ENV !== 'test') {
app.listen(PORT, () => console.log(`Listening on ${PORT}`));
}
export default app;
PROJ_EOF
cd "$PROJECT_DIR"
git init -q
git add -A
git commit -q -m "Initial commit"
# --- Plugin Setup ---
EFFECTIVE_PLUGIN_DIR="$PLUGIN_DIR"
if [ "$WITHOUT_FIX" = true ]; then
echo "Creating plugin copy without the handoff fix..."
EFFECTIVE_PLUGIN_DIR="$OUTPUT_DIR/plugin-without-fix"
cp -R "$PLUGIN_DIR" "$EFFECTIVE_PLUGIN_DIR"
python3 << PYEOF
import pathlib
# Strip fix from brainstorming SKILL.md
p = pathlib.Path('$EFFECTIVE_PLUGIN_DIR/skills/brainstorming/SKILL.md')
content = p.read_text()
content = content.replace(
'**Implementation (if continuing):**\nWhen the user approves the design and wants to build:\n1. **Invoke \`superpowers:writing-plans\` using the Skill tool.** Not EnterPlanMode. Not plan mode. Not direct implementation. The Skill tool.\n2. After the plan is written, use superpowers:using-git-worktrees to create an isolated workspace for implementation.',
'**Implementation (if continuing):**\n- Ask: "Ready to set up for implementation?"\n- Use superpowers:using-git-worktrees to create isolated workspace\n- **REQUIRED:** Use superpowers:writing-plans to create detailed implementation plan'
)
p.write_text(content)
# Strip fix from using-superpowers
p = pathlib.Path('$EFFECTIVE_PLUGIN_DIR/skills/using-superpowers/SKILL.md')
lines = p.read_text().splitlines(keepends=True)
lines = [l for l in lines if 'I should use EnterPlanMode' not in l]
p.write_text(''.join(lines))
# Strip fix from writing-plans
p = pathlib.Path('$EFFECTIVE_PLUGIN_DIR/skills/writing-plans/SKILL.md')
content = p.read_text()
content = content.replace(
'description: Use when you have a spec or requirements for a multi-step task, before touching code. After brainstorming, ALWAYS use this — not EnterPlanMode or plan mode.',
'description: Use when you have a spec or requirements for a multi-step task, before touching code'
)
content = content.replace(
'**Context:** This runs in the main workspace after brainstorming, while context is fresh. The worktree is created afterward for implementation.',
'**Context:** This should be run in a dedicated worktree (created by brainstorming skill).'
)
p.write_text(content)
PYEOF
echo "Plugin copy created."
echo ""
fi
# --- Helper ---
run_turn() {
local turn_num="$1"
local prompt="$2"
local max_turns="$3"
local label="$4"
local continue_flag="${5:-}"
local log_file="$OUTPUT_DIR/turn${turn_num}.json"
echo ">>> Turn $turn_num: $label"
local cmd="timeout 300 claude -p \"$prompt\""
cmd="$cmd --plugin-dir \"$EFFECTIVE_PLUGIN_DIR\""
cmd="$cmd --dangerously-skip-permissions"
cmd="$cmd --max-turns $max_turns"
cmd="$cmd --output-format stream-json"
if [ -n "$continue_flag" ]; then
cmd="$cmd --continue"
fi
eval "$cmd" > "$log_file" 2>&1 || true
echo " Done."
if [ "$VERBOSE" = true ]; then
echo " ---"
grep '"type":"assistant"' "$log_file" 2>/dev/null | tail -1 | \
jq -r '.message.content[0].text // empty' 2>/dev/null | \
head -c 600 || true
echo ""
echo " ---"
fi
echo "$log_file"
}
# --- Run Full Brainstorming Flow ---
cd "$PROJECT_DIR"
# Turn 1: Start brainstorming - this loads the skill and begins Q&A
T1=$(run_turn 1 \
"I want to add URL shortening to this Express app. Help me think through the design." \
5 "Starting brainstorming")
# Turn 2: Answer first question (whatever it is) generically
T2=$(run_turn 2 \
"Good question. Here is what I want: POST /api/shorten that takes a URL and returns a short code. GET /:code that redirects. GET /api/stats/:code for click tracking. Random 6-char alphanumeric codes. SQLite storage using better-sqlite3 which is already in package.json. No auth needed." \
5 "Answering first question" --continue)
# Turn 3: Agree with recommendations
T3=$(run_turn 3 \
"Yes, that sounds right. Go with your recommendation." \
5 "Agreeing with recommendation" --continue)
# Turn 4: Continue agreeing
T4=$(run_turn 4 \
"Looks good. I agree with that approach." \
5 "Continuing to agree" --continue)
# Turn 5: Push toward completion
T5=$(run_turn 5 \
"Perfect. I am happy with all of that. Please wrap up the design and write the spec." \
8 "Requesting spec write-up" --continue)
# Turn 6: Approve the spec
T6=$(run_turn 6 \
"The spec looks great. I approve it." \
5 "Approving spec" --continue)
# Turn 7: THE CRITICAL MOMENT - "build it"
T7=$(run_turn 7 \
"Yes, build it." \
5 "Critical handoff: build it" --continue)
# Turn 8: Safety net in case turn 7 asked a follow-up
T8=$(run_turn 8 \
"Yes. Go ahead and build it now." \
5 "Safety net: build it" --continue)
echo ""
# --- Assertions ---
echo "=== Results ==="
echo ""
# Combine all logs
ALL_LOGS="$OUTPUT_DIR/all-turns.json"
cat "$OUTPUT_DIR"/turn*.json > "$ALL_LOGS" 2>/dev/null
# Check handoff turns (6-8, where approval + "build it" happens)
HANDOFF_LOGS="$OUTPUT_DIR/handoff-turns.json"
cat "$OUTPUT_DIR/turn6.json" "$OUTPUT_DIR/turn7.json" "$OUTPUT_DIR/turn8.json" > "$HANDOFF_LOGS" 2>/dev/null
# Detection: writing-plans skill invoked in handoff turns?
HAS_WRITING_PLANS=false
if grep -q '"name":"Skill"' "$HANDOFF_LOGS" 2>/dev/null && grep -q 'writing-plans' "$HANDOFF_LOGS" 2>/dev/null; then
HAS_WRITING_PLANS=true
fi
# Detection: EnterPlanMode invoked in handoff turns?
HAS_ENTER_PLAN_MODE=false
if grep -q '"name":"EnterPlanMode"' "$HANDOFF_LOGS" 2>/dev/null; then
HAS_ENTER_PLAN_MODE=true
fi
# Also check across ALL turns (might happen earlier)
HAS_ENTER_PLAN_MODE_ANYWHERE=false
if grep -q '"name":"EnterPlanMode"' "$ALL_LOGS" 2>/dev/null; then
HAS_ENTER_PLAN_MODE_ANYWHERE=true
fi
# Report
echo "Skills invoked (all turns):"
grep -o '"skill":"[^"]*"' "$ALL_LOGS" 2>/dev/null | sort -u || echo " (none)"
echo ""
echo "Skills invoked (handoff turns 6-8):"
grep -o '"skill":"[^"]*"' "$HANDOFF_LOGS" 2>/dev/null | sort -u || echo " (none)"
echo ""
echo "Tools invoked in handoff turns (6-8):"
grep -o '"name":"[A-Z][^"]*"' "$HANDOFF_LOGS" 2>/dev/null | sort | uniq -c | sort -rn | head -10 || echo " (none)"
echo ""
if [ "$HAS_ENTER_PLAN_MODE_ANYWHERE" = true ]; then
echo "WARNING: EnterPlanMode was invoked somewhere in the conversation."
echo "Turns containing EnterPlanMode:"
for f in "$OUTPUT_DIR"/turn*.json; do
if grep -q '"name":"EnterPlanMode"' "$f" 2>/dev/null; then
echo " $(basename "$f")"
fi
done
echo ""
fi
# Determine result
PASSED=false
if [ "$WITHOUT_FIX" = true ]; then
echo "--- Without-Fix Mode (reproducing failure) ---"
if [ "$HAS_ENTER_PLAN_MODE" = true ] || [ "$HAS_ENTER_PLAN_MODE_ANYWHERE" = true ]; then
echo "REPRODUCED: Claude used EnterPlanMode (the bug we're fixing)"
PASSED=true
elif [ "$HAS_WRITING_PLANS" = true ]; then
echo "NOT REPRODUCED: Claude used writing-plans even without the fix"
echo "(The old guidance was sufficient in this run)"
PASSED=false
else
echo "INCONCLUSIVE: Claude used neither writing-plans nor EnterPlanMode"
echo "The brainstorming flow may not have reached the handoff point."
PASSED=false
fi
else
echo "--- With-Fix Mode (verifying fix) ---"
if [ "$HAS_WRITING_PLANS" = true ] && [ "$HAS_ENTER_PLAN_MODE_ANYWHERE" = false ]; then
echo "PASS: Claude used writing-plans skill (correct handoff)"
PASSED=true
elif [ "$HAS_ENTER_PLAN_MODE_ANYWHERE" = true ]; then
echo "FAIL: Claude used EnterPlanMode instead of writing-plans"
PASSED=false
elif [ "$HAS_WRITING_PLANS" = true ] && [ "$HAS_ENTER_PLAN_MODE_ANYWHERE" = true ]; then
echo "FAIL: Claude used BOTH writing-plans AND EnterPlanMode"
PASSED=false
else
echo "INCONCLUSIVE: Claude used neither writing-plans nor EnterPlanMode"
echo "The brainstorming flow may not have reached the handoff."
echo "Check logs to see where the conversation stopped."
PASSED=false
fi
fi
echo ""
# Show what happened in each turn
echo "Turn-by-turn summary:"
for i in 1 2 3 4 5 6 7 8; do
local_log="$OUTPUT_DIR/turn${i}.json"
if [ -f "$local_log" ]; then
local_skills=$(grep -o '"skill":"[^"]*"' "$local_log" 2>/dev/null | tr '\n' ' ' || true)
local_tools=$(grep -o '"name":"EnterPlanMode\|"name":"Skill"' "$local_log" 2>/dev/null | tr '\n' ' ' || true)
local_size=$(wc -c < "$local_log" | tr -d ' ')
printf " Turn %d: %s bytes" "$i" "$local_size"
[ -n "$local_skills" ] && printf " | skills: %s" "$local_skills"
[ -n "$local_tools" ] && printf " | tools: %s" "$local_tools"
echo ""
fi
done
echo ""
echo "Logs: $OUTPUT_DIR"
echo ""
if [ "$PASSED" = true ]; then
exit 0
else
exit 1
fi

View File

@@ -0,0 +1,317 @@
#!/usr/bin/env bash
# Test: Brainstorm-to-plan handoff
#
# Verifies that after brainstorming, Claude invokes the writing-plans skill
# instead of using EnterPlanMode.
#
# The failure mode this catches:
# User says "build it" after brainstorming -> Claude calls EnterPlanMode
# (because the system prompt's planning guidance overpowers the brainstorming
# skill's instructions, which were loaded many turns ago)
#
# PASS: Skill tool invoked with "writing-plans" AND EnterPlanMode NOT invoked
# FAIL: EnterPlanMode invoked OR writing-plans not invoked
#
# Usage:
# ./test-brainstorm-handoff.sh # Normal test (expects PASS)
# ./test-brainstorm-handoff.sh --without-fix # Strip fix, reproduce failure
# ./test-brainstorm-handoff.sh --verbose # Show full output
#
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PLUGIN_DIR="$(cd "$SCRIPT_DIR/../.." && pwd)"
# Parse flags
VERBOSE=false
WITHOUT_FIX=false
while [[ $# -gt 0 ]]; do
case $1 in
--verbose|-v) VERBOSE=true; shift ;;
--without-fix) WITHOUT_FIX=true; shift ;;
*) echo "Unknown flag: $1"; exit 1 ;;
esac
done
TIMESTAMP=$(date +%s)
OUTPUT_DIR="/tmp/superpowers-tests/${TIMESTAMP}/brainstorm-handoff"
mkdir -p "$OUTPUT_DIR"
echo "=== Brainstorm-to-Plan Handoff Test ==="
echo "Mode: $([ "$WITHOUT_FIX" = true ] && echo "WITHOUT FIX (expect failure)" || echo "WITH FIX (expect pass)")"
echo "Output: $OUTPUT_DIR"
echo ""
# --- Project Setup ---
PROJECT_DIR="$OUTPUT_DIR/project"
mkdir -p "$PROJECT_DIR/src"
mkdir -p "$PROJECT_DIR/docs/superpowers/specs"
cat > "$PROJECT_DIR/package.json" << 'PROJ_EOF'
{
"name": "my-express-app",
"version": "1.0.0",
"type": "module",
"dependencies": {
"express": "^4.18.0",
"better-sqlite3": "^9.0.0"
}
}
PROJ_EOF
cat > "$PROJECT_DIR/src/index.js" << 'PROJ_EOF'
import express from 'express';
const app = express();
app.use(express.json());
app.get('/health', (req, res) => res.json({ status: 'ok' }));
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => console.log(`Listening on ${PORT}`));
PROJ_EOF
# Pre-create a spec document (simulating completed brainstorming)
cat > "$PROJECT_DIR/docs/superpowers/specs/2025-01-15-url-shortener-design.md" << 'SPEC_EOF'
# URL Shortener Design Spec
## Overview
Add URL shortening capability to the existing Express.js API.
## Features
- POST /api/shorten accepts { url } and returns { shortCode, shortUrl }
- GET /:code redirects to the original URL (302)
- GET /api/stats/:code returns { clicks, createdAt, originalUrl }
## Technical Design
### Database
Single SQLite table via better-sqlite3:
```sql
CREATE TABLE urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
short_code TEXT UNIQUE NOT NULL,
original_url TEXT NOT NULL,
clicks INTEGER DEFAULT 0,
created_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX idx_short_code ON urls(short_code);
```
### File Structure
- `src/index.js` — modified to mount new routes
- `src/db.js` — database initialization and query functions
- `src/shorten.js` — route handlers for all three endpoints
- `src/code-generator.js` — random 6-char alphanumeric code generation
### Code Generation
Random 6-character alphanumeric codes using crypto.randomBytes.
Check for collisions and retry (astronomically unlikely with 36^6 space).
### Validation
- URL must be present and start with http:// or https://
- Return 400 with { error: "..." } for invalid input
### Error Handling
- 404 with { error: "Not found" } for unknown short codes
- 500 with { error: "Internal server error" } for database failures
## Decisions
- 302 redirects (not 301) so browsers don't cache and we always track clicks
- Database path configurable via DATABASE_PATH env var, defaults to ./data/urls.db
- No auth, no custom codes, no expiry — keeping it simple
SPEC_EOF
# Initialize git so brainstorming can inspect project state
cd "$PROJECT_DIR"
git init -q
git add -A
git commit -q -m "Initial commit with URL shortener spec"
# --- Plugin Setup ---
EFFECTIVE_PLUGIN_DIR="$PLUGIN_DIR"
if [ "$WITHOUT_FIX" = true ]; then
echo "Creating plugin copy without the handoff fix..."
EFFECTIVE_PLUGIN_DIR="$OUTPUT_DIR/plugin-without-fix"
cp -R "$PLUGIN_DIR" "$EFFECTIVE_PLUGIN_DIR"
# Strip fix from brainstorming SKILL.md: revert to old implementation section
python3 << PYEOF
import pathlib
p = pathlib.Path('$EFFECTIVE_PLUGIN_DIR/skills/brainstorming/SKILL.md')
content = p.read_text()
content = content.replace(
'**Implementation (if continuing):**\nWhen the user approves the design and wants to build:\n1. **Invoke \`superpowers:writing-plans\` using the Skill tool.** Not EnterPlanMode. Not plan mode. Not direct implementation. The Skill tool.\n2. After the plan is written, use superpowers:using-git-worktrees to create an isolated workspace for implementation.',
'**Implementation (if continuing):**\n- Ask: "Ready to set up for implementation?"\n- Use superpowers:using-git-worktrees to create isolated workspace\n- **REQUIRED:** Use superpowers:writing-plans to create detailed implementation plan'
)
p.write_text(content)
PYEOF
# Strip fix from using-superpowers: remove EnterPlanMode red flag
python3 << PYEOF
import pathlib
p = pathlib.Path('$EFFECTIVE_PLUGIN_DIR/skills/using-superpowers/SKILL.md')
lines = p.read_text().splitlines(keepends=True)
lines = [l for l in lines if 'I should use EnterPlanMode' not in l]
p.write_text(''.join(lines))
PYEOF
# Strip fix from writing-plans: revert description and context
python3 << PYEOF
import pathlib
p = pathlib.Path('$EFFECTIVE_PLUGIN_DIR/skills/writing-plans/SKILL.md')
content = p.read_text()
content = content.replace(
'description: Use when you have a spec or requirements for a multi-step task, before touching code. After brainstorming, ALWAYS use this — not EnterPlanMode or plan mode.',
'description: Use when you have a spec or requirements for a multi-step task, before touching code'
)
content = content.replace(
'**Context:** This runs in the main workspace after brainstorming, while context is fresh. The worktree is created afterward for implementation.',
'**Context:** This should be run in a dedicated worktree (created by brainstorming skill).'
)
p.write_text(content)
PYEOF
echo "Plugin copy created at $EFFECTIVE_PLUGIN_DIR"
echo ""
fi
# --- Run Conversation ---
cd "$PROJECT_DIR"
# Turn 1: Load brainstorming and establish that we finished the design
# The key is that brainstorming gets loaded into context, and we're at the handoff point
echo ">>> Turn 1: Loading brainstorming skill and establishing context..."
TURN1_LOG="$OUTPUT_DIR/turn1.json"
TURN1_PROMPT='I want to add URL shortening to this Express app. I already have the full design worked out and written to docs/superpowers/specs/2025-01-15-url-shortener-design.md. Please read the spec.'
timeout 300 claude -p "$TURN1_PROMPT" \
--plugin-dir "$EFFECTIVE_PLUGIN_DIR" \
--dangerously-skip-permissions \
--max-turns 5 \
--output-format stream-json \
> "$TURN1_LOG" 2>&1 || true
echo "Turn 1 complete."
if [ "$VERBOSE" = true ]; then
echo "---"
grep '"type":"assistant"' "$TURN1_LOG" | tail -1 | jq -r '.message.content[0].text // empty' 2>/dev/null | head -c 800 || true
echo ""
echo "---"
fi
echo ""
# Turn 2: Approve and ask to build - this is the critical handoff moment
echo ">>> Turn 2: 'The spec is done. Build it.' (critical handoff)..."
TURN2_LOG="$OUTPUT_DIR/turn2.json"
TURN2_PROMPT='The spec is complete and I am happy with the design. Build it.'
timeout 300 claude -p "$TURN2_PROMPT" \
--continue \
--plugin-dir "$EFFECTIVE_PLUGIN_DIR" \
--dangerously-skip-permissions \
--max-turns 5 \
--output-format stream-json \
> "$TURN2_LOG" 2>&1 || true
echo "Turn 2 complete."
if [ "$VERBOSE" = true ]; then
echo "---"
grep '"type":"assistant"' "$TURN2_LOG" | tail -1 | jq -r '.message.content[0].text // empty' 2>/dev/null | head -c 800 || true
echo ""
echo "---"
fi
echo ""
# --- Assertions ---
echo "=== Results ==="
echo ""
# Combine all turn logs for analysis
ALL_LOGS="$OUTPUT_DIR/all-turns.json"
cat "$TURN1_LOG" "$TURN2_LOG" > "$ALL_LOGS"
# Detection: writing-plans skill invoked?
HAS_WRITING_PLANS=false
if grep -q '"name":"Skill"' "$ALL_LOGS" 2>/dev/null && grep -q 'writing-plans' "$ALL_LOGS" 2>/dev/null; then
HAS_WRITING_PLANS=true
fi
# Detection: EnterPlanMode invoked?
HAS_ENTER_PLAN_MODE=false
if grep -q '"name":"EnterPlanMode"' "$ALL_LOGS" 2>/dev/null; then
HAS_ENTER_PLAN_MODE=true
fi
# Report what skills were invoked
echo "Skills invoked:"
grep -o '"skill":"[^"]*"' "$ALL_LOGS" 2>/dev/null | sort -u || echo " (none)"
echo ""
echo "Notable tools invoked:"
grep -o '"name":"[A-Z][^"]*"' "$ALL_LOGS" 2>/dev/null | sort | uniq -c | sort -rn | head -10 || echo " (none)"
echo ""
# Determine result
PASSED=false
if [ "$WITHOUT_FIX" = true ]; then
# In without-fix mode, we EXPECT the failure (EnterPlanMode)
echo "--- Without-Fix Mode (reproducing failure) ---"
if [ "$HAS_ENTER_PLAN_MODE" = true ]; then
echo "REPRODUCED: Claude used EnterPlanMode (the bug we're fixing)"
PASSED=true
elif [ "$HAS_WRITING_PLANS" = true ]; then
echo "NOT REPRODUCED: Claude used writing-plans even without the fix"
echo "(The model may have followed the old guidance anyway)"
PASSED=false
else
echo "INCONCLUSIVE: Claude used neither writing-plans nor EnterPlanMode"
echo "The brainstorming flow may not have reached the handoff point."
PASSED=false
fi
else
# Normal mode: expect writing-plans, not EnterPlanMode
echo "--- With-Fix Mode (verifying fix) ---"
if [ "$HAS_WRITING_PLANS" = true ] && [ "$HAS_ENTER_PLAN_MODE" = false ]; then
echo "PASS: Claude used writing-plans skill (correct handoff)"
PASSED=true
elif [ "$HAS_ENTER_PLAN_MODE" = true ]; then
echo "FAIL: Claude used EnterPlanMode instead of writing-plans"
PASSED=false
elif [ "$HAS_WRITING_PLANS" = true ] && [ "$HAS_ENTER_PLAN_MODE" = true ]; then
echo "FAIL: Claude used BOTH writing-plans AND EnterPlanMode"
PASSED=false
else
echo "INCONCLUSIVE: Claude used neither writing-plans nor EnterPlanMode"
echo "The brainstorming flow may not have reached the handoff point."
echo "Check logs - brainstorming may still be asking questions."
PASSED=false
fi
fi
echo ""
# Show the critical turn 2 response
echo "Turn 2 response (first 500 chars):"
grep '"type":"assistant"' "$TURN2_LOG" 2>/dev/null | tail -1 | \
jq -r '.message.content[0].text // .message.content' 2>/dev/null | \
head -c 500 || echo " (could not extract)"
echo ""
echo ""
echo "Logs:"
echo " Turn 1: $TURN1_LOG"
echo " Turn 2: $TURN2_LOG"
echo " Combined: $ALL_LOGS"
echo ""
if [ "$PASSED" = true ]; then
exit 0
else
exit 1
fi