Add comprehensive testing documentation

Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format
Add token usage analysis to subagent-driven-development test
2026-06-10 12:49:04 +08:00 · 2025-11-29 21:03:33 -08:00 · 2025-11-29 20:51:18 -08:00 · 2025-11-29 19:35:43 -08:00 · 2025-11-29 17:15:21 -08:00 · 2025-11-29 11:47:47 -08:00
17 changed files with 2447 additions and 296 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
  "name": "superpowers",
  "description": "Core skills library for Claude Code: TDD, debugging, collaboration patterns, and proven techniques",
-  "version": "3.6.1",
+  "version": "3.5.1",
  "author": {
    "name": "Jesse Vincent",
    "email": "jesse@fsck.com"
--- a/.codex/superpowers-codex
+++ b/.codex/superpowers-codex
@@ -229,7 +229,7 @@ function runUseSkill(skillName) {
    if (frontmatter.description) {
        console.log(`# ${frontmatter.description}`);
    }
-    console.log(`# Skill-specific tools and reference files live in ${skillDirectory}`);
+    console.log(`# Supporting tools and docs are in ${skillDirectory}`);
    console.log('# ============================================');
    console.log('');

@@ -264,4 +264,4 @@ switch (command) {
        console.log('  superpowers-codex use-skill superpowers:brainstorming');
        console.log('  superpowers-codex use-skill my-custom-skill');
        break;
-}
+}
--- a/.opencode/plugin/superpowers.js
+++ b/.opencode/plugin/superpowers.js
@@ -67,7 +67,7 @@ ${toolMapping}
        path: { id: sessionID },
        body: {
          noReply: true,
-          parts: [{ type: "text", text: bootstrapContent, synthetic: true }]
+          parts: [{ type: "text", text: bootstrapContent }]
        }
      });
      return true;
@@ -132,8 +132,8 @@ ${toolMapping}
              body: {
                noReply: true,
                parts: [
-                  { type: "text", text: `Loading skill: ${name || skill_name}`, synthetic: true },
-                  { type: "text", text: `${skillHeader}\n\n${content}`, synthetic: true }
+                  { type: "text", text: `Loading skill: ${name || skill_name}` },
+                  { type: "text", text: `${skillHeader}\n\n${content}` }
                ]
              }
            });
--- a/agents/code-reviewer.md
+++ b/agents/code-reviewer.md
@@ -2,6 +2,7 @@
 name: code-reviewer
 description: |
  Use this agent when a major project step has been completed and needs to be reviewed against the original plan and coding standards. Examples: <example>Context: The user is creating a code-review agent that should be called after a logical chunk of code is written. user: "I've finished implementing the user authentication system as outlined in step 3 of our plan" assistant: "Great work! Now let me use the code-reviewer agent to review the implementation against our plan and coding standards" <commentary>Since a major project step has been completed, use the code-reviewer agent to validate the work against the plan and identify any issues.</commentary></example> <example>Context: User has completed a significant feature implementation. user: "The API endpoints for the task management system are now complete - that covers step 2 from our architecture document" assistant: "Excellent! Let me have the code-reviewer agent examine this implementation to ensure it aligns with our plan and follows best practices" <commentary>A numbered step from the planning document has been completed, so the code-reviewer agent should review the work.</commentary></example>
+model: sonnet
 ---

 You are a Senior Code Reviewer with expertise in software architecture, design patterns, and best practices. Your role is to review completed project steps against original plans and ensure code quality standards are met.
--- a/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md
+++ b/docs/plans/2025-11-28-skills-improvements-from-user-feedback.md
@@ -0,0 +1,711 @@
+# Skills Improvements from User Feedback
+
+**Date:** 2025-11-28
+**Status:** Draft
+**Source:** Two Claude instances using superpowers in real development scenarios
+
+---
+
+## Executive Summary
+
+Two Claude instances provided detailed feedback from actual development sessions. Their feedback reveals **systematic gaps** in current skills that allowed preventable bugs to ship despite following the skills.
+
+**Critical insight:** These are problem reports, not just solution proposals. The problems are real; the solutions need careful evaluation.
+
+**Key themes:**
+1. **Verification gaps** - We verify operations succeed but not that they achieve intended outcomes
+2. **Process hygiene** - Background processes accumulate and interfere across subagents
+3. **Context optimization** - Subagents get too much irrelevant information
+4. **Self-reflection missing** - No prompt to critique own work before handoff
+5. **Mock safety** - Mocks can drift from interfaces without detection
+6. **Skill activation** - Skills exist but aren't being read/used
+
+---
+
+## Problems Identified
+
+### Problem 1: Configuration Change Verification Gap
+
+**What happened:**
+- Subagent tested "OpenAI integration"
+- Set `OPENAI_API_KEY` env var
+- Got status 200 responses
+- Reported "OpenAI integration working"
+- **BUT** response contained `"model": "claude-sonnet-4-20250514"` - was actually using Anthropic
+
+**Root cause:**
+`verification-before-completion` checks operations succeed but not that outcomes reflect intended configuration changes.
+
+**Impact:** High - False confidence in integration tests, bugs ship to production
+
+**Example failure pattern:**
+- Switch LLM provider → verify status 200 but don't check model name
+- Enable feature flag → verify no errors but don't check feature is active
+- Change environment → verify deployment succeeds but don't check environment vars
+
+---
+
+### Problem 2: Background Process Accumulation
+
+**What happened:**
+- Multiple subagents dispatched during session
+- Each started background server processes
+- Processes accumulated (4+ servers running)
+- Stale processes still bound to ports
+- Later E2E test hit stale server with wrong config
+- Confusing/incorrect test results
+
+**Root cause:**
+Subagents are stateless - don't know about previous subagents' processes. No cleanup protocol.
+
+**Impact:** Medium-High - Tests hit wrong server, false passes/failures, debugging confusion
+
+---
+
+### Problem 3: Context Bloat in Subagent Prompts
+
+**What happened:**
+- Standard approach: give subagent full plan file to read
+- Experiment: give only task + pattern + file + verify command
+- Result: Faster, more focused, single-attempt completion more common
+
+**Root cause:**
+Subagents waste tokens and attention on irrelevant plan sections.
+
+**Impact:** Medium - Slower execution, more failed attempts
+
+**What worked:**
+```
+You are adding a single E2E test to packnplay's test suite.
+
+**Your task:** Add `TestE2E_FeaturePrivilegedMode` to `pkg/runner/e2e_test.go`
+
+**What to test:** A local devcontainer feature that requests `"privileged": true`
+in its metadata should result in the container running with `--privileged` flag.
+
+**Follow the exact pattern of TestE2E_FeatureOptionValidation** (at the end of the file)
+
+**After writing, run:** `go test -v ./pkg/runner -run TestE2E_FeaturePrivilegedMode -timeout 5m`
+```
+
+---
+
+### Problem 4: No Self-Reflection Before Handoff
+
+**What happened:**
+- Added self-reflection prompt: "Look at your work with fresh eyes - what could be better?"
+- Implementer for Task 5 identified failing test was due to implementation bug, not test bug
+- Traced to line 99: `strings.Join(metadata.Entrypoint, " ")` creating invalid Docker syntax
+- Without self-reflection, would have just reported "test fails" without root cause
+
+**Root cause:**
+Implementers don't naturally step back and critique their own work before reporting completion.
+
+**Impact:** Medium - Bugs handed off to reviewer that implementer could have caught
+
+---
+
+### Problem 5: Mock-Interface Drift
+
+**What happened:**
+```typescript
+// Interface defines close()
+interface PlatformAdapter {
+  close(): Promise<void>;
+}
+
+// Code (BUGGY) calls cleanup()
+await adapter.cleanup();
+
+// Mock (MATCHES BUG) defines cleanup()
+vi.mock('web-adapter', () => ({
+  WebAdapter: vi.fn().mockImplementation(() => ({
+    cleanup: vi.fn().mockResolvedValue(undefined),  // Wrong!
+  })),
+}));
+```
+- Tests passed
+- Runtime crashed: "adapter.cleanup is not a function"
+
+**Root cause:**
+Mock derived from what buggy code calls, not from interface definition. TypeScript can't catch inline mocks with wrong method names.
+
+**Impact:** High - Tests give false confidence, runtime crashes
+
+**Why testing-anti-patterns didn't prevent this:**
+The skill covers testing mock behavior and mocking without understanding, but not the specific pattern of "derive mock from interface, not implementation."
+
+---
+
+### Problem 6: Code Reviewer File Access
+
+**What happened:**
+- Code reviewer subagent dispatched
+- Couldn't find test file: "The file doesn't appear to exist in the repository"
+- File actually exists
+- Reviewer didn't know to explicitly read it first
+
+**Root cause:**
+Reviewer prompts don't include explicit file reading instructions.
+
+**Impact:** Low-Medium - Reviews fail or incomplete
+
+---
+
+### Problem 7: Fix Workflow Latency
+
+**What happened:**
+- Implementer identifies bug during self-reflection
+- Implementer knows the fix
+- Current workflow: report → I dispatch fixer → fixer fixes → I verify
+- Extra round-trip adds latency without adding value
+
+**Root cause:**
+Rigid separation between implementer and fixer roles when implementer has already diagnosed.
+
+**Impact:** Low - Latency, but no correctness issue
+
+---
+
+### Problem 8: Skills Not Being Read
+
+**What happened:**
+- `testing-anti-patterns` skill exists
+- Neither human nor subagents read it before writing tests
+- Would have prevented some issues (though not all - see Problem 5)
+
+**Root cause:**
+No enforcement that subagents read relevant skills. No prompt includes skill reading.
+
+**Impact:** Medium - Skill investment wasted if not used
+
+---
+
+## Proposed Improvements
+
+### 1. verification-before-completion: Add Configuration Change Verification
+
+**Add new section:**
+
+```markdown
+## Verifying Configuration Changes
+
+When testing changes to configuration, providers, feature flags, or environment:
+
+**Don't just verify the operation succeeded. Verify the output reflects the intended change.**
+
+### Common Failure Pattern
+
+Operation succeeds because *some* valid config exists, but it's not the config you intended to test.
+
+### Examples
+
+| Change | Insufficient | Required |
+|--------|-------------|----------|
+| Switch LLM provider | Status 200 | Response contains expected model name |
+| Enable feature flag | No errors | Feature behavior actually active |
+| Change environment | Deploy succeeds | Logs/vars reference new environment |
+| Set credentials | Auth succeeds | Authenticated user/context is correct |
+
+### Gate Function
+
+```
+BEFORE claiming configuration change works:
+
+1. IDENTIFY: What should be DIFFERENT after this change?
+2. LOCATE: Where is that difference observable?
+   - Response field (model name, user ID)
+   - Log line (environment, provider)
+   - Behavior (feature active/inactive)
+3. RUN: Command that shows the observable difference
+4. VERIFY: Output contains expected difference
+5. ONLY THEN: Claim configuration change works
+
+Red flags:
+  - "Request succeeded" without checking content
+  - Checking status code but not response body
+  - Verifying no errors but not positive confirmation
+```
+
+**Why this works:**
+Forces verification of INTENT, not just operation success.
+
+---
+
+### 2. subagent-driven-development: Add Process Hygiene for E2E Tests
+
+**Add new section:**
+
+```markdown
+## Process Hygiene for E2E Tests
+
+When dispatching subagents that start services (servers, databases, message queues):
+
+### Problem
+
+Subagents are stateless - they don't know about processes started by previous subagents. Background processes persist and can interfere with later tests.
+
+### Solution
+
+**Before dispatching E2E test subagent, include cleanup in prompt:**
+
+```
+BEFORE starting any services:
+1. Kill existing processes: pkill -f "<service-pattern>" 2>/dev/null || true
+2. Wait for cleanup: sleep 1
+3. Verify port free: lsof -i :<port> && echo "ERROR: Port still in use" || echo "Port free"
+
+AFTER tests complete:
+1. Kill the process you started
+2. Verify cleanup: pgrep -f "<service-pattern>" || echo "Cleanup successful"
+```
+
+### Example
+
+```
+Task: Run E2E test of API server
+
+Prompt includes:
+"Before starting the server:
+- Kill any existing servers: pkill -f 'node.*server.js' 2>/dev/null || true
+- Verify port 3001 is free: lsof -i :3001 && exit 1 || echo 'Port available'
+
+After tests:
+- Kill the server you started
+- Verify: pgrep -f 'node.*server.js' || echo 'Cleanup verified'"
+```
+
+### Why This Matters
+
+- Stale processes serve requests with wrong config
+- Port conflicts cause silent failures
+- Process accumulation slows system
+- Confusing test results (hitting wrong server)
+```
+
+**Trade-off analysis:**
+- Adds boilerplate to prompts
+- But prevents very confusing debugging
+- Worth it for E2E test subagents
+
+---
+
+### 3. subagent-driven-development: Add Lean Context Option
+
+**Modify Step 2: Execute Task with Subagent**
+
+**Before:**
+```
+Read that task carefully from [plan-file].
+```
+
+**After:**
+```
+## Context Approaches
+
+**Full Plan (default):**
+Use when tasks are complex or have dependencies:
+```
+Read Task N from [plan-file] carefully.
+```
+
+**Lean Context (for independent tasks):**
+Use when task is standalone and pattern-based:
+```
+You are implementing: [1-2 sentence task description]
+
+File to modify: [exact path]
+Pattern to follow: [reference to existing function/test]
+What to implement: [specific requirement]
+Verification: [exact command to run]
+
+[Do NOT include full plan file]
+```
+
+**Use lean context when:**
+- Task follows existing pattern (add similar test, implement similar feature)
+- Task is self-contained (doesn't need context from other tasks)
+- Pattern reference is sufficient (e.g., "follow TestE2E_FeatureOptionValidation")
+
+**Use full plan when:**
+- Task has dependencies on other tasks
+- Requires understanding of overall architecture
+- Complex logic that needs context
+```
+
+**Example:**
+```
+Lean context prompt:
+
+"You are adding a test for privileged mode in devcontainer features.
+
+File: pkg/runner/e2e_test.go
+Pattern: Follow TestE2E_FeatureOptionValidation (at end of file)
+Test: Feature with `"privileged": true` in metadata results in `--privileged` flag
+Verify: go test -v ./pkg/runner -run TestE2E_FeaturePrivilegedMode -timeout 5m
+
+Report: Implementation, test results, any issues."
+```
+
+**Why this works:**
+Reduces token usage, increases focus, faster completion when appropriate.
+
+---
+
+### 4. subagent-driven-development: Add Self-Reflection Step
+
+**Modify Step 2: Execute Task with Subagent**
+
+**Add to prompt template:**
+
+```
+When done, BEFORE reporting back:
+
+Take a step back and review your work with fresh eyes.
+
+Ask yourself:
+- Does this actually solve the task as specified?
+- Are there edge cases I didn't consider?
+- Did I follow the pattern correctly?
+- If tests are failing, what's the ROOT CAUSE (implementation bug vs test bug)?
+- What could be better about this implementation?
+
+If you identify issues during this reflection, fix them now.
+
+Then report:
+- What you implemented
+- Self-reflection findings (if any)
+- Test results
+- Files changed
+```
+
+**Why this works:**
+Catches bugs implementer can find themselves before handoff. Documented case: identified entrypoint bug through self-reflection.
+
+**Trade-off:**
+Adds ~30 seconds per task, but catches issues before review.
+
+---
+
+### 5. requesting-code-review: Add Explicit File Reading
+
+**Modify the code-reviewer template:**
+
+**Add at the beginning:**
+
+```markdown
+## Files to Review
+
+BEFORE analyzing, read these files:
+
+1. [List specific files that changed in the diff]
+2. [Files referenced by changes but not modified]
+
+Use Read tool to load each file.
+
+If you cannot find a file:
+- Check exact path from diff
+- Try alternate locations
+- Report: "Cannot locate [path] - please verify file exists"
+
+DO NOT proceed with review until you've read the actual code.
+```
+
+**Why this works:**
+Explicit instruction prevents "file not found" issues.
+
+---
+
+### 6. testing-anti-patterns: Add Mock-Interface Drift Anti-Pattern
+
+**Add new Anti-Pattern 6:**
+
+```markdown
+## Anti-Pattern 6: Mocks Derived from Implementation
+
+**The violation:**
+```typescript
+// Code (BUGGY) calls cleanup()
+await adapter.cleanup();
+
+// Mock (MATCHES BUG) has cleanup()
+const mock = {
+  cleanup: vi.fn().mockResolvedValue(undefined)
+};
+
+// Interface (CORRECT) defines close()
+interface PlatformAdapter {
+  close(): Promise<void>;
+}
+```
+
+**Why this is wrong:**
+- Mock encodes the bug into the test
+- TypeScript can't catch inline mocks with wrong method names
+- Test passes because both code and mock are wrong
+- Runtime crashes when real object is used
+
+**The fix:**
+```typescript
+// ✅ GOOD: Derive mock from interface
+
+// Step 1: Open interface definition (PlatformAdapter)
+// Step 2: List methods defined there (close, initialize, etc.)
+// Step 3: Mock EXACTLY those methods
+
+const mock = {
+  initialize: vi.fn().mockResolvedValue(undefined),
+  close: vi.fn().mockResolvedValue(undefined),  // From interface!
+};
+
+// Now test FAILS because code calls cleanup() which doesn't exist
+// That failure reveals the bug BEFORE runtime
+```
+
+### Gate Function
+
+```
+BEFORE writing any mock:
+
+  1. STOP - Do NOT look at the code under test yet
+  2. FIND: The interface/type definition for the dependency
+  3. READ: The interface file
+  4. LIST: Methods defined in the interface
+  5. MOCK: ONLY those methods with EXACTLY those names
+  6. DO NOT: Look at what your code calls
+
+  IF your test fails because code calls something not in mock:
+    ✅ GOOD - The test found a bug in your code
+    Fix the code to call the correct interface method
+    NOT the mock
+
+  Red flags:
+    - "I'll mock what the code calls"
+    - Copying method names from implementation
+    - Mock written without reading interface
+    - "The test is failing so I'll add this method to the mock"
+```
+
+**Detection:**
+
+When you see runtime error "X is not a function" and tests pass:
+1. Check if X is mocked
+2. Compare mock methods to interface methods
+3. Look for method name mismatches
+```
+
+**Why this works:**
+Directly addresses the failure pattern from feedback.
+
+---
+
+### 7. subagent-driven-development: Require Skills Reading for Test Subagents
+
+**Add to prompt template when task involves testing:**
+
+```markdown
+BEFORE writing any tests:
+
+1. Read testing-anti-patterns skill:
+   Use Skill tool: superpowers:testing-anti-patterns
+
+2. Apply gate functions from that skill when:
+   - Writing mocks
+   - Adding methods to production classes
+   - Mocking dependencies
+
+This is NOT optional. Tests that violate anti-patterns will be rejected in review.
+```
+
+**Why this works:**
+Ensures skills are actually used, not just exist.
+
+**Trade-off:**
+Adds time to each task, but prevents entire classes of bugs.
+
+---
+
+### 8. subagent-driven-development: Allow Implementer to Fix Self-Identified Issues
+
+**Modify Step 2:**
+
+**Current:**
+```
+Subagent reports back with summary of work.
+```
+
+**Proposed:**
+```
+Subagent performs self-reflection, then:
+
+IF self-reflection identifies fixable issues:
+  1. Fix the issues
+  2. Re-run verification
+  3. Report: "Initial implementation + self-reflection fix"
+
+ELSE:
+  Report: "Implementation complete"
+
+Include in report:
+- Self-reflection findings
+- Whether fixes were applied
+- Final verification results
+```
+
+**Why this works:**
+Reduces latency when implementer already knows the fix. Documented case: would have saved one round-trip for entrypoint bug.
+
+**Trade-off:**
+Slightly more complex prompt, but faster end-to-end.
+
+---
+
+## Implementation Plan
+
+### Phase 1: High-Impact, Low-Risk (Do First)
+
+1. **verification-before-completion: Configuration change verification**
+   - Clear addition, doesn't change existing content
+   - Addresses high-impact problem (false confidence in tests)
+   - File: `skills/verification-before-completion/SKILL.md`
+
+2. **testing-anti-patterns: Mock-interface drift**
+   - Adds new anti-pattern, doesn't modify existing
+   - Addresses high-impact problem (runtime crashes)
+   - File: `skills/testing-anti-patterns/SKILL.md`
+
+3. **requesting-code-review: Explicit file reading**
+   - Simple addition to template
+   - Fixes concrete problem (reviewers can't find files)
+   - File: `skills/requesting-code-review/SKILL.md`
+
+### Phase 2: Moderate Changes (Test Carefully)
+
+4. **subagent-driven-development: Process hygiene**
+   - Adds new section, doesn't change workflow
+   - Addresses medium-high impact (test reliability)
+   - File: `skills/subagent-driven-development/SKILL.md`
+
+5. **subagent-driven-development: Self-reflection**
+   - Changes prompt template (higher risk)
+   - But documented to catch bugs
+   - File: `skills/subagent-driven-development/SKILL.md`
+
+6. **subagent-driven-development: Skills reading requirement**
+   - Adds prompt overhead
+   - But ensures skills are actually used
+   - File: `skills/subagent-driven-development/SKILL.md`
+
+### Phase 3: Optimization (Validate First)
+
+7. **subagent-driven-development: Lean context option**
+   - Adds complexity (two approaches)
+   - Needs validation that it doesn't cause confusion
+   - File: `skills/subagent-driven-development/SKILL.md`
+
+8. **subagent-driven-development: Allow implementer to fix**
+   - Changes workflow (higher risk)
+   - Optimization, not bug fix
+   - File: `skills/subagent-driven-development/SKILL.md`
+
+---
+
+## Open Questions
+
+1. **Lean context approach:**
+   - Should we make it the default for pattern-based tasks?
+   - How do we decide which approach to use?
+   - Risk of being too lean and missing important context?
+
+2. **Self-reflection:**
+   - Will this slow down simple tasks significantly?
+   - Should it only apply to complex tasks?
+   - How do we prevent "reflection fatigue" where it becomes rote?
+
+3. **Process hygiene:**
+   - Should this be in subagent-driven-development or a separate skill?
+   - Does it apply to other workflows beyond E2E tests?
+   - How do we handle cases where process SHOULD persist (dev servers)?
+
+4. **Skills reading enforcement:**
+   - Should we require ALL subagents to read relevant skills?
+   - How do we keep prompts from becoming too long?
+   - Risk of over-documenting and losing focus?
+
+---
+
+## Success Metrics
+
+How do we know these improvements work?
+
+1. **Configuration verification:**
+   - Zero instances of "test passed but wrong config was used"
+   - Jesse doesn't say "that's not actually testing what you think"
+
+2. **Process hygiene:**
+   - Zero instances of "test hit wrong server"
+   - No port conflict errors during E2E test runs
+
+3. **Mock-interface drift:**
+   - Zero instances of "tests pass but runtime crashes on missing method"
+   - No method name mismatches between mocks and interfaces
+
+4. **Self-reflection:**
+   - Measurable: Do implementer reports include self-reflection findings?
+   - Qualitative: Do fewer bugs make it to code review?
+
+5. **Skills reading:**
+   - Subagent reports reference skill gate functions
+   - Fewer anti-pattern violations in code review
+
+---
+
+## Risks and Mitigations
+
+### Risk: Prompt Bloat
+**Problem:** Adding all these requirements makes prompts overwhelming
+**Mitigation:**
+- Phase implementation (don't add everything at once)
+- Make some additions conditional (E2E hygiene only for E2E tests)
+- Consider templates for different task types
+
+### Risk: Analysis Paralysis
+**Problem:** Too much reflection/verification slows execution
+**Mitigation:**
+- Keep gate functions quick (seconds, not minutes)
+- Make lean context opt-in initially
+- Monitor task completion times
+
+### Risk: False Sense of Security
+**Problem:** Following checklist doesn't guarantee correctness
+**Mitigation:**
+- Emphasize gate functions are minimums, not maximums
+- Keep "use judgment" language in skills
+- Document that skills catch common failures, not all failures
+
+### Risk: Skill Divergence
+**Problem:** Different skills give conflicting advice
+**Mitigation:**
+- Review changes across all skills for consistency
+- Document how skills interact (Integration sections)
+- Test with real scenarios before deployment
+
+---
+
+## Recommendation
+
+**Proceed with Phase 1 immediately:**
+- verification-before-completion: Configuration change verification
+- testing-anti-patterns: Mock-interface drift
+- requesting-code-review: Explicit file reading
+
+**Test Phase 2 with Jesse before finalizing:**
+- Get feedback on self-reflection impact
+- Validate process hygiene approach
+- Confirm skills reading requirement is worth overhead
+
+**Hold Phase 3 pending validation:**
+- Lean context needs real-world testing
+- Implementer-fix workflow change needs careful evaluation
+
+These changes address real problems documented by users while minimizing risk of making skills worse.
--- a/docs/testing.md
+++ b/docs/testing.md
@@ -0,0 +1,303 @@
+# Testing Superpowers Skills
+
+This document describes how to test Superpowers skills, particularly the integration tests for complex skills like `subagent-driven-development`.
+
+## Overview
+
+Testing skills that involve subagents, workflows, and complex interactions requires running actual Claude Code sessions in headless mode and verifying their behavior through session transcripts.
+
+## Test Structure
+
+```
+tests/
+├── claude-code/
+│   ├── test-helpers.sh                    # Shared test utilities
+│   ├── test-subagent-driven-development-integration.sh
+│   ├── analyze-token-usage.py             # Token analysis tool
+│   └── run-skill-tests.sh                 # Test runner (if exists)
+```
+
+## Running Tests
+
+### Integration Tests
+
+Integration tests execute real Claude Code sessions with actual skills:
+
+```bash
+# Run the subagent-driven-development integration test
+cd tests/claude-code
+./test-subagent-driven-development-integration.sh
+```
+
+**Note:** Integration tests can take 10-30 minutes as they execute real implementation plans with multiple subagents.
+
+### Requirements
+
+- Must run from the **superpowers plugin directory** (not from temp directories)
+- Claude Code must be installed and available as `claude` command
+- Local dev marketplace must be enabled: `"superpowers@superpowers-dev": true` in `~/.claude/settings.json`
+
+## Integration Test: subagent-driven-development
+
+### What It Tests
+
+The integration test verifies the `subagent-driven-development` skill correctly:
+
+1. **Plan Loading**: Reads the plan once at the beginning
+2. **Full Task Text**: Provides complete task descriptions to subagents (doesn't make them read files)
+3. **Self-Review**: Ensures subagents perform self-review before reporting
+4. **Review Order**: Runs spec compliance review before code quality review
+5. **Review Loops**: Uses review loops when issues are found
+6. **Independent Verification**: Spec reviewer reads code independently, doesn't trust implementer reports
+
+### How It Works
+
+1. **Setup**: Creates a temporary Node.js project with a minimal implementation plan
+2. **Execution**: Runs Claude Code in headless mode with the skill
+3. **Verification**: Parses the session transcript (`.jsonl` file) to verify:
+   - Skill tool was invoked
+   - Subagents were dispatched (Task tool)
+   - TodoWrite was used for tracking
+   - Implementation files were created
+   - Tests pass
+   - Git commits show proper workflow
+4. **Token Analysis**: Shows token usage breakdown by subagent
+
+### Test Output
+
+```
+========================================
+ Integration Test: subagent-driven-development
+========================================
+
+Test project: /tmp/tmp.xyz123
+
+=== Verification Tests ===
+
+Test 1: Skill tool invoked...
+  [PASS] subagent-driven-development skill was invoked
+
+Test 2: Subagents dispatched...
+  [PASS] 7 subagents dispatched
+
+Test 3: Task tracking...
+  [PASS] TodoWrite used 5 time(s)
+
+Test 6: Implementation verification...
+  [PASS] src/math.js created
+  [PASS] add function exists
+  [PASS] multiply function exists
+  [PASS] test/math.test.js created
+  [PASS] Tests pass
+
+Test 7: Git commit history...
+  [PASS] Multiple commits created (3 total)
+
+Test 8: No extra features added...
+  [PASS] No extra features added
+
+=========================================
+ Token Usage Analysis
+=========================================
+
+Usage Breakdown:
+----------------------------------------------------------------------------------------------------
+Agent           Description                          Msgs      Input     Output      Cache     Cost
+----------------------------------------------------------------------------------------------------
+main            Main session (coordinator)             34         27      3,996  1,213,703 $   4.09
+3380c209        implementing Task 1: Create Add Function     1          2        787     24,989 $   0.09
+34b00fde        implementing Task 2: Create Multiply Function     1          4        644     25,114 $   0.09
+3801a732        reviewing whether an implementation matches...   1          5        703     25,742 $   0.09
+4c142934        doing a final code review...                    1          6        854     25,319 $   0.09
+5f017a42        a code reviewer. Review Task 2...               1          6        504     22,949 $   0.08
+a6b7fbe4        a code reviewer. Review Task 1...               1          6        515     22,534 $   0.08
+f15837c0        reviewing whether an implementation matches...   1          6        416     22,485 $   0.07
+----------------------------------------------------------------------------------------------------
+
+TOTALS:
+  Total messages:         41
+  Input tokens:           62
+  Output tokens:          8,419
+  Cache creation tokens:  132,742
+  Cache read tokens:      1,382,835
+
+  Total input (incl cache): 1,515,639
+  Total tokens:             1,524,058
+
+  Estimated cost: $4.67
+  (at $3/$15 per M tokens for input/output)
+
+========================================
+ Test Summary
+========================================
+
+STATUS: PASSED
+```
+
+## Token Analysis Tool
+
+### Usage
+
+Analyze token usage from any Claude Code session:
+
+```bash
+python3 tests/claude-code/analyze-token-usage.py ~/.claude/projects/<project-dir>/<session-id>.jsonl
+```
+
+### Finding Session Files
+
+Session transcripts are stored in `~/.claude/projects/` with the working directory path encoded:
+
+```bash
+# Example for /Users/jesse/Documents/GitHub/superpowers/superpowers
+SESSION_DIR="$HOME/.claude/projects/-Users-jesse-Documents-GitHub-superpowers-superpowers"
+
+# Find recent sessions
+ls -lt "$SESSION_DIR"/*.jsonl | head -5
+```
+
+### What It Shows
+
+- **Main session usage**: Token usage by the coordinator (you or main Claude instance)
+- **Per-subagent breakdown**: Each Task invocation with:
+  - Agent ID
+  - Description (extracted from prompt)
+  - Message count
+  - Input/output tokens
+  - Cache usage
+  - Estimated cost
+- **Totals**: Overall token usage and cost estimate
+
+### Understanding the Output
+
+- **High cache reads**: Good - means prompt caching is working
+- **High input tokens on main**: Expected - coordinator has full context
+- **Similar costs per subagent**: Expected - each gets similar task complexity
+- **Cost per task**: Typical range is $0.05-$0.15 per subagent depending on task
+
+## Troubleshooting
+
+### Skills Not Loading
+
+**Problem**: Skill not found when running headless tests
+
+**Solutions**:
+1. Ensure you're running FROM the superpowers directory: `cd /path/to/superpowers && tests/...`
+2. Check `~/.claude/settings.json` has `"superpowers@superpowers-dev": true` in `enabledPlugins`
+3. Verify skill exists in `skills/` directory
+
+### Permission Errors
+
+**Problem**: Claude blocked from writing files or accessing directories
+
+**Solutions**:
+1. Use `--permission-mode bypassPermissions` flag
+2. Use `--add-dir /path/to/temp/dir` to grant access to test directories
+3. Check file permissions on test directories
+
+### Test Timeouts
+
+**Problem**: Test takes too long and times out
+
+**Solutions**:
+1. Increase timeout: `timeout 1800 claude ...` (30 minutes)
+2. Check for infinite loops in skill logic
+3. Review subagent task complexity
+
+### Session File Not Found
+
+**Problem**: Can't find session transcript after test run
+
+**Solutions**:
+1. Check the correct project directory in `~/.claude/projects/`
+2. Use `find ~/.claude/projects -name "*.jsonl" -mmin -60` to find recent sessions
+3. Verify test actually ran (check for errors in test output)
+
+## Writing New Integration Tests
+
+### Template
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+source "$SCRIPT_DIR/test-helpers.sh"
+
+# Create test project
+TEST_PROJECT=$(create_test_project)
+trap "cleanup_test_project $TEST_PROJECT" EXIT
+
+# Set up test files...
+cd "$TEST_PROJECT"
+
+# Run Claude with skill
+PROMPT="Your test prompt here"
+cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" \
+  --allowed-tools=all \
+  --add-dir "$TEST_PROJECT" \
+  --permission-mode bypassPermissions \
+  2>&1 | tee output.txt
+
+# Find and analyze session
+WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\\//-/g' | sed 's/^-//')
+SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
+SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 | sort -r | head -1)
+
+# Verify behavior by parsing session transcript
+if grep -q '"name":"Skill".*"skill":"your-skill-name"' "$SESSION_FILE"; then
+    echo "[PASS] Skill was invoked"
+fi
+
+# Show token analysis
+python3 "$SCRIPT_DIR/analyze-token-usage.py" "$SESSION_FILE"
+```
+
+### Best Practices
+
+1. **Always cleanup**: Use trap to cleanup temp directories
+2. **Parse transcripts**: Don't grep user-facing output - parse the `.jsonl` session file
+3. **Grant permissions**: Use `--permission-mode bypassPermissions` and `--add-dir`
+4. **Run from plugin dir**: Skills only load when running from the superpowers directory
+5. **Show token usage**: Always include token analysis for cost visibility
+6. **Test real behavior**: Verify actual files created, tests passing, commits made
+
+## Session Transcript Format
+
+Session transcripts are JSONL (JSON Lines) files where each line is a JSON object representing a message or tool result.
+
+### Key Fields
+
+```json
+{
+  "type": "assistant",
+  "message": {
+    "content": [...],
+    "usage": {
+      "input_tokens": 27,
+      "output_tokens": 3996,
+      "cache_read_input_tokens": 1213703
+    }
+  }
+}
+```
+
+### Tool Results
+
+```json
+{
+  "type": "user",
+  "toolUseResult": {
+    "agentId": "3380c209",
+    "usage": {
+      "input_tokens": 2,
+      "output_tokens": 787,
+      "cache_read_input_tokens": 24989
+    },
+    "prompt": "You are implementing Task 1...",
+    "content": [{"type": "text", "text": "..."}]
+  }
+}
+```
+
+The `agentId` field links to subagent sessions, and the `usage` field contains token usage for that specific subagent invocation.
--- a/docs/windows/polyglot-hooks.md
+++ b/docs/windows/polyglot-hooks.md
@@ -1,212 +0,0 @@
-# Cross-Platform Polyglot Hooks for Claude Code
-
-Claude Code plugins need hooks that work on Windows, macOS, and Linux. This document explains the polyglot wrapper technique that makes this possible.
-
-## The Problem
-
-Claude Code runs hook commands through the system's default shell:
- **Windows**: CMD.exe
- **macOS/Linux**: bash or sh
-
-This creates several challenges:
-
-1. **Script execution**: Windows CMD can't execute `.sh` files directly - it tries to open them in a text editor
-2. **Path format**: Windows uses backslashes (`C:\path`), Unix uses forward slashes (`/path`)
-3. **Environment variables**: `$VAR` syntax doesn't work in CMD
-4. **No `bash` in PATH**: Even with Git Bash installed, `bash` isn't in the PATH when CMD runs
-
-## The Solution: Polyglot `.cmd` Wrapper
-
-A polyglot script is valid syntax in multiple languages simultaneously. Our wrapper is valid in both CMD and bash:
-
-```cmd
-: << 'CMDBLOCK'
-@echo off
-"C:\Program Files\Git\bin\bash.exe" -l -c "\"$(cygpath -u \"$CLAUDE_PLUGIN_ROOT\")/hooks/session-start.sh\""
-exit /b
-CMDBLOCK
-
-# Unix shell runs from here
-"${CLAUDE_PLUGIN_ROOT}/hooks/session-start.sh"
-```
-
-### How It Works
-
-#### On Windows (CMD.exe)
-
-1. `: << 'CMDBLOCK'` - CMD sees `:` as a label (like `:label`) and ignores `<< 'CMDBLOCK'`
-2. `@echo off` - Suppresses command echoing
-3. The bash.exe command runs with:
-   - `-l` (login shell) to get proper PATH with Unix utilities
-   - `cygpath -u` converts Windows path to Unix format (`C:\foo` → `/c/foo`)
-4. `exit /b` - Exits the batch script, stopping CMD here
-5. Everything after `CMDBLOCK` is never reached by CMD
-
-#### On Unix (bash/sh)
-
-1. `: << 'CMDBLOCK'` - `:` is a no-op, `<< 'CMDBLOCK'` starts a heredoc
-2. Everything until `CMDBLOCK` is consumed by the heredoc (ignored)
-3. `# Unix shell runs from here` - Comment
-4. The script runs directly with the Unix path
-
-## File Structure
-
-```
-hooks/
-├── hooks.json           # Points to the .cmd wrapper
-├── session-start.cmd    # Polyglot wrapper (cross-platform entry point)
-└── session-start.sh     # Actual hook logic (bash script)
-```
-
-### hooks.json
-
-```json
-{
-  "hooks": {
-    "SessionStart": [
-      {
-        "matcher": "startup|resume|clear|compact",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "\"${CLAUDE_PLUGIN_ROOT}/hooks/session-start.cmd\""
-          }
-        ]
-      }
-    ]
-  }
-}
-```
-
-Note: The path must be quoted because `${CLAUDE_PLUGIN_ROOT}` may contain spaces on Windows (e.g., `C:\Program Files\...`).
-
-## Requirements
-
-### Windows
- **Git for Windows** must be installed (provides `bash.exe` and `cygpath`)
- Default installation path: `C:\Program Files\Git\bin\bash.exe`
- If Git is installed elsewhere, the wrapper needs modification
-
-### Unix (macOS/Linux)
- Standard bash or sh shell
- The `.cmd` file must have execute permission (`chmod +x`)
-
-## Writing Cross-Platform Hook Scripts
-
-Your actual hook logic goes in the `.sh` file. To ensure it works on Windows (via Git Bash):
-
-### Do:
- Use pure bash builtins when possible
- Use `$(command)` instead of backticks
- Quote all variable expansions: `"$VAR"`
- Use `printf` or here-docs for output
-
-### Avoid:
- External commands that may not be in PATH (sed, awk, grep)
- If you must use them, they're available in Git Bash but ensure PATH is set up (use `bash -l`)
-
-### Example: JSON Escaping Without sed/awk
-
-Instead of:
-```bash
-escaped=$(echo "$content" | sed 's/\\/\\\\/g' | sed 's/"/\\"/g' | awk '{printf "%s\\n", $0}')
-```
-
-Use pure bash:
-```bash
-escape_for_json() {
-    local input="$1"
-    local output=""
-    local i char
-    for (( i=0; i<${#input}; i++ )); do
-        char="${input:$i:1}"
-        case "$char" in
-            $'\\') output+='\\' ;;
-            '"') output+='\"' ;;
-            $'\n') output+='\n' ;;
-            $'\r') output+='\r' ;;
-            $'\t') output+='\t' ;;
-            *) output+="$char" ;;
-        esac
-    done
-    printf '%s' "$output"
-}
-```
-
-## Reusable Wrapper Pattern
-
-For plugins with multiple hooks, you can create a generic wrapper that takes the script name as an argument:
-
-### run-hook.cmd
-```cmd
-: << 'CMDBLOCK'
-@echo off
-set "SCRIPT_DIR=%~dp0"
-set "SCRIPT_NAME=%~1"
-"C:\Program Files\Git\bin\bash.exe" -l -c "cd \"$(cygpath -u \"%SCRIPT_DIR%\")\" && \"./%SCRIPT_NAME%\""
-exit /b
-CMDBLOCK
-
-# Unix shell runs from here
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"
-SCRIPT_NAME="$1"
-shift
-"${SCRIPT_DIR}/${SCRIPT_NAME}" "$@"
-```
-
-### hooks.json using the reusable wrapper
-```json
-{
-  "hooks": {
-    "SessionStart": [
-      {
-        "matcher": "startup",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "\"${CLAUDE_PLUGIN_ROOT}/hooks/run-hook.cmd\" session-start.sh"
-          }
-        ]
-      }
-    ],
-    "PreToolUse": [
-      {
-        "matcher": "Bash",
-        "hooks": [
-          {
-            "type": "command",
-            "command": "\"${CLAUDE_PLUGIN_ROOT}/hooks/run-hook.cmd\" validate-bash.sh"
-          }
-        ]
-      }
-    ]
-  }
-}
-```
-
-## Troubleshooting
-
-### "bash is not recognized"
-CMD can't find bash. The wrapper uses the full path `C:\Program Files\Git\bin\bash.exe`. If Git is installed elsewhere, update the path.
-
-### "cygpath: command not found" or "dirname: command not found"
-Bash isn't running as a login shell. Ensure `-l` flag is used.
-
-### Path has weird `\/` in it
-`${CLAUDE_PLUGIN_ROOT}` expanded to a Windows path ending with backslash, then `/hooks/...` was appended. Use `cygpath` to convert the entire path.
-
-### Script opens in text editor instead of running
-The hooks.json is pointing directly to the `.sh` file. Point to the `.cmd` wrapper instead.
-
-### Works in terminal but not as hook
-Claude Code may run hooks differently. Test by simulating the hook environment:
-```powershell
-$env:CLAUDE_PLUGIN_ROOT = "C:\path\to\plugin"
-cmd /c "C:\path\to\plugin\hooks\session-start.cmd"
-```
-
-## Related Issues
-
- [anthropics/claude-code#9758](https://github.com/anthropics/claude-code/issues/9758) - .sh scripts open in editor on Windows
- [anthropics/claude-code#3417](https://github.com/anthropics/claude-code/issues/3417) - Hooks don't work on Windows
- [anthropics/claude-code#6023](https://github.com/anthropics/claude-code/issues/6023) - CLAUDE_PROJECT_DIR not found
--- a/hooks/hooks.json
+++ b/hooks/hooks.json
@@ -6,7 +6,7 @@
        "hooks": [
          {
            "type": "command",
-            "command": "\"${CLAUDE_PLUGIN_ROOT}/hooks/run-hook.cmd\" session-start.sh"
+            "command": "${CLAUDE_PLUGIN_ROOT}/hooks/session-start.sh"
          }
        ]
      }
--- a/hooks/run-hook.cmd
+++ b/hooks/run-hook.cmd
@@ -1,19 +0,0 @@
-: << 'CMDBLOCK'
-@echo off
-REM Polyglot wrapper: runs .sh scripts cross-platform
-REM Usage: run-hook.cmd <script-name> [args...]
-REM The script should be in the same directory as this wrapper
-
-if "%~1"=="" (
-    echo run-hook.cmd: missing script name >&2
-    exit /b 1
-)
-"C:\Program Files\Git\bin\bash.exe" -l "%~dp0%~1" %2 %3 %4 %5 %6 %7 %8 %9
-exit /b
-CMDBLOCK
-
-# Unix shell runs from here
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"
-SCRIPT_NAME="$1"
-shift
-"${SCRIPT_DIR}/${SCRIPT_NAME}" "$@"
--- a/hooks/session-start.sh
+++ b/hooks/session-start.sh
@@ -17,27 +17,9 @@ fi
 # Read using-superpowers content
 using_superpowers_content=$(cat "${PLUGIN_ROOT}/skills/using-superpowers/SKILL.md" 2>&1 || echo "Error reading using-superpowers skill")

-# Escape outputs for JSON using pure bash
-escape_for_json() {
-    local input="$1"
-    local output=""
-    local i char
-    for (( i=0; i<${#input}; i++ )); do
-        char="${input:$i:1}"
-        case "$char" in
-            $'\\') output+='\\' ;;
-            '"') output+='\"' ;;
-            $'\n') output+='\n' ;;
-            $'\r') output+='\r' ;;
-            $'\t') output+='\t' ;;
-            *) output+="$char" ;;
-        esac
-    done
-    printf '%s' "$output"
-}
-
-using_superpowers_escaped=$(escape_for_json "$using_superpowers_content")
-warning_escaped=$(escape_for_json "$warning_message")
+# Escape outputs for JSON
+using_superpowers_escaped=$(echo "$using_superpowers_content" | sed 's/\\/\\\\/g' | sed 's/"/\\"/g' | awk '{printf "%s\\n", $0}')
+warning_escaped=$(echo "$warning_message" | sed 's/\\/\\\\/g' | sed 's/"/\\"/g' | awk '{printf "%s\\n", $0}')

 # Output context injection as JSON
 cat <<EOF
--- a/skills/subagent-driven-development/SKILL.md
+++ b/skills/subagent-driven-development/SKILL.md
@@ -31,41 +31,192 @@ Execute plan by dispatching fresh subagent per task, with code review after each

 ### 1. Load Plan

-Read plan file, create TodoWrite with all tasks.
+1. Read plan file once
+2. Extract all tasks (full text of each)
+3. For each task, note scene-setting context:
+   - Where it fits in overall plan
+   - Dependencies on previous tasks
+   - Architectural context
+   - Relevant patterns or existing code to follow
+4. Create TodoWrite with all tasks

 ### 2. Execute Task with Subagent

 For each task:

-**Dispatch fresh subagent:**
+**1. Prepare task context:**
+- Get the full text of Task N (already extracted in Step 1)
+- Get the scene-setting context (already noted in Step 1)
+
+**2. Dispatch fresh subagent with full task text:**
 ```
 Task tool (general-purpose):
  description: "Implement Task N: [task name]"
  prompt: |
-    You are implementing Task N from [plan-file].
+    You are implementing Task N: [task name]

-    Read that task carefully. Your job is to:
+    ## Task Description
+
+    [FULL TEXT of task from plan - paste it here, don't make subagent read file]
+
+    ## Context
+
+    [Scene-setting: where this fits, dependencies, architectural context]
+
+    ## Before You Begin
+
+    If you have questions about:
+    - The requirements or acceptance criteria
+    - The approach or implementation strategy
+    - Dependencies or assumptions
+    - Anything unclear in the task description
+
+    **Ask them now.** Raise any concerns before starting work.
+
+    ## Your Job
+
+    Once you're clear on requirements:
    1. Implement exactly what the task specifies
    2. Write tests (following TDD if task says to)
    3. Verify implementation works
    4. Commit your work
-    5. Report back
+    5. Self-review (see below)
+    6. Report back

    Work from: [directory]

-    Report: What you implemented, what you tested, test results, files changed, any issues
+    **While you work:** If you encounter something unexpected or unclear, **ask questions**.
+    It's always OK to pause and clarify. Don't guess or make assumptions.
+
+    ## Before Reporting Back: Self-Review
+
+    Review your work with fresh eyes. Ask yourself:
+
+    **Completeness:**
+    - Did I fully implement everything in the spec?
+    - Did I miss any requirements?
+    - Are there edge cases I didn't handle?
+
+    **Quality:**
+    - Is this my best work?
+    - Are names clear and accurate (match what things do, not how they work)?
+    - Is the code clean and maintainable?
+
+    **Discipline:**
+    - Did I avoid overbuilding (YAGNI)?
+    - Did I only build what was requested?
+    - Did I follow existing patterns in the codebase?
+
+    **Testing:**
+    - Do tests actually verify behavior (not just mock behavior)?
+    - Did I follow TDD if required?
+    - Are tests comprehensive?
+
+    If you find issues during self-review, fix them now before reporting.
+
+    ## Report Format
+
+    When done, report:
+    - What you implemented
+    - What you tested and test results
+    - Files changed
+    - Self-review findings (if any)
+    - Any issues or concerns
 ```

-**Subagent reports back** with summary of work.
+**3. Handle subagent response:**

-### 3. Review Subagent's Work
+If subagent asks questions:
+- Answer clearly
+- Provide additional context if needed
+- Either continue conversation or re-dispatch with answers
+
+If subagent proceeds with implementation:
+- Review their report
+- Proceed to spec compliance review (Step 3)
+
+### 3. Spec Compliance Review
+
+**Purpose:** Verify implementer built what was requested (nothing more, nothing less)
+
+**Dispatch spec compliance reviewer:**
+```
+Task tool (general-purpose):
+  description: "Review spec compliance for Task N"
+  prompt: |
+    You are reviewing whether an implementation matches its specification.
+
+    ## What Was Requested
+
+    [FULL TEXT of task requirements]
+
+    ## What Implementer Claims They Built
+
+    [From implementer's report]
+
+    ## CRITICAL: Do Not Trust the Report
+
+    The implementer finished suspiciously quickly. Their report may be incomplete,
+    inaccurate, or optimistic. You MUST verify everything independently.
+
+    **DO NOT:**
+    - Take their word for what they implemented
+    - Trust their claims about completeness
+    - Accept their interpretation of requirements
+
+    **DO:**
+    - Read the actual code they wrote
+    - Compare actual implementation to requirements line by line
+    - Check for missing pieces they claimed to implement
+    - Look for extra features they didn't mention
+
+    ## Your Job
+
+    Read the implementation code and verify:
+
+    **Missing requirements:**
+    - Did they implement everything that was requested?
+    - Are there requirements they skipped or missed?
+    - Did they claim something works but didn't actually implement it?
+
+    **Extra/unneeded work:**
+    - Did they build things that weren't requested?
+    - Did they over-engineer or add unnecessary features?
+    - Did they add "nice to haves" that weren't in spec?
+
+    **Misunderstandings:**
+    - Did they interpret requirements differently than intended?
+    - Did they solve the wrong problem?
+    - Did they implement the right feature but wrong way?
+
+    **Verify by reading code, not by trusting report.**
+
+    Report:
+    - ✅ Spec compliant (if everything matches after code inspection)
+    - ❌ Issues found: [list specifically what's missing or extra, with file:line references]
+```
+
+**Review loop (must complete before Step 4):**
+1. Spec reviewer reports findings
+2. If issues found:
+   - Original implementer fixes issues
+   - Spec reviewer reviews again
+3. Repeat until spec compliant
+
+**Do NOT proceed to code quality review until spec compliance is ✅**
+
+### 4. Code Quality Review
+
+**Purpose:** Verify implementation is well-built (clean, tested, maintainable)
+
+**Only run after spec compliance review is complete.**

 **Dispatch code-reviewer subagent:**
 ```
 Task tool (superpowers:code-reviewer):
  Use template at requesting-code-review/code-reviewer.md

-  WHAT_WAS_IMPLEMENTED: [from subagent's report]
+  WHAT_WAS_IMPLEMENTED: [from implementer's report]
  PLAN_OR_REQUIREMENTS: Task N from [plan-file]
  BASE_SHA: [commit before task]
  HEAD_SHA: [current commit]
@@ -74,23 +225,18 @@ Task tool (superpowers:code-reviewer):

 **Code reviewer returns:** Strengths, Issues (Critical/Important/Minor), Assessment

-### 4. Apply Review Feedback
-
-**If issues found:**
- Fix Critical issues immediately
- Fix Important issues before next task
- Note Minor issues
-
-**Dispatch follow-up subagent if needed:**
-```
-"Fix issues from code review: [list issues]"
-```
+**Review loop:**
+1. Code reviewer reports findings
+2. If issues found:
+   - Original implementer fixes issues
+   - Code reviewer reviews again
+3. Repeat until code quality approved

 ### 5. Mark Complete, Next Task

 - Mark task as completed in TodoWrite
 - Move to next task
- Repeat steps 2-5
+- Repeat steps 2-5 for each remaining task

 ### 6. Final Review

@@ -111,30 +257,67 @@ After final review passes:
 ```
 You: I'm using Subagent-Driven Development to execute this plan.

-[Load plan, create TodoWrite]
+[Read plan file once: docs/plans/feature-plan.md]
+[Extract all 5 tasks with full text and context]
+[Create TodoWrite with all tasks]

 Task 1: Hook installation script

-[Dispatch implementation subagent]
-Subagent: Implemented install-hook with tests, 5/5 passing
+[Get Task 1 text and context (already extracted)]
+[Dispatch implementation subagent with full task text + context]

-[Get git SHAs, dispatch code-reviewer]
-Reviewer: Strengths: Good test coverage. Issues: None. Ready.
+Implementer: "Before I begin - should the hook be installed at user or system level?"
+
+You: "User level (~/.config/superpowers/hooks/)"
+
+Implementer: "Got it. Implementing now..."
+[Later] Implementer:
+  - Implemented install-hook command
+  - Added tests, 5/5 passing
+  - Self-review: Found I missed --force flag, added it
+  - Committed
+
+[Dispatch spec compliance reviewer]
+Spec reviewer: ✅ Spec compliant - all requirements met, nothing extra
+
+[Get git SHAs, dispatch code quality reviewer]
+Code reviewer: Strengths: Good test coverage, clean. Issues: None. Approved.

 [Mark Task 1 complete]

 Task 2: Recovery modes

-[Dispatch implementation subagent]
-Subagent: Added verify/repair, 8/8 tests passing
+[Get Task 2 text and context (already extracted)]
+[Dispatch implementation subagent with full task text + context]

-[Dispatch code-reviewer]
-Reviewer: Strengths: Solid. Issues (Important): Missing progress reporting
+Implementer: [No questions, proceeds]
+Implementer:
+  - Added verify/repair modes
+  - 8/8 tests passing
+  - Self-review: All good
+  - Committed

-[Dispatch fix subagent]
-Fix subagent: Added progress every 100 conversations
+[Dispatch spec compliance reviewer]
+Spec reviewer: ❌ Issues:
+  - Missing: Progress reporting (spec says "report every 100 items")
+  - Extra: Added --json flag (not requested)

-[Verify fix, mark Task 2 complete]
+[Implementer fixes issues]
+Implementer: Removed --json flag, added progress reporting
+
+[Spec reviewer reviews again]
+Spec reviewer: ✅ Spec compliant now
+
+[Dispatch code quality reviewer]
+Code reviewer: Strengths: Solid. Issues (Important): Magic number (100)
+
+[Implementer fixes]
+Implementer: Extracted PROGRESS_INTERVAL constant
+
+[Code reviewer reviews again]
+Code reviewer: ✅ Approved
+
+[Mark Task 2 complete]

 ...

@@ -151,23 +334,57 @@ Done!
 - Subagents follow TDD naturally
 - Fresh context per task (no confusion)
 - Parallel-safe (subagents don't interfere)
+- Subagent can ask questions (before AND during work)

 **vs. Executing Plans:**
 - Same session (no handoff)
 - Continuous progress (no waiting)
 - Review checkpoints automatic

+**Efficiency gains:**
+- No file reading overhead (controller provides full text)
+- Controller curates exactly what context is needed
+- Subagent gets complete information upfront
+- Questions surfaced before work begins (not after)
+
+**Quality gates:**
+- Self-review catches issues before handoff
+- Two-stage review: spec compliance, then code quality
+- Review loops ensure fixes actually work
+- Spec compliance prevents over/under-building
+- Code quality ensures implementation is well-built
+
 **Cost:**
- More subagent invocations
+- More subagent invocations (implementer + 2 reviewers per task)
+- Controller does more prep work (extracting all tasks upfront)
+- Review loops add iterations
 - But catches issues early (cheaper than debugging later)

 ## Red Flags

 **Never:**
- Skip code review between tasks
- Proceed with unfixed Critical issues
+- Skip reviews (spec compliance OR code quality)
+- Proceed with unfixed issues
 - Dispatch multiple implementation subagents in parallel (conflicts)
- Implement without reading plan task
+- Make subagent read plan file (provide full text instead)
+- Skip scene-setting context (subagent needs to understand where task fits)
+- Ignore subagent questions (answer before letting them proceed)
+- Accept "close enough" on spec compliance (spec reviewer found issues = not done)
+- Skip review loops (reviewer found issues = implementer fixes = review again)
+- Let implementer self-review replace actual review (both are needed)
+- **Start code quality review before spec compliance is ✅** (wrong order)
+- Move to next task while either review has open issues
+
+**If subagent asks questions:**
+- Answer clearly and completely
+- Provide additional context if needed
+- Don't rush them into implementation
+
+**If reviewer finds issues:**
+- Implementer (same subagent) fixes them
+- Reviewer reviews again
+- Repeat until approved
+- Don't skip the re-review

 **If subagent fails task:**
 - Dispatch fix subagent with specific instructions
--- a/tests/claude-code/README.md
+++ b/tests/claude-code/README.md
@@ -0,0 +1,158 @@
+# Claude Code Skills Tests
+
+Automated tests for superpowers skills using Claude Code CLI.
+
+## Overview
+
+This test suite verifies that skills are loaded correctly and Claude follows them as expected. Tests invoke Claude Code in headless mode (`claude -p`) and verify the behavior.
+
+## Requirements
+
+- Claude Code CLI installed and in PATH (`claude --version` should work)
+- Local superpowers plugin installed (see main README for installation)
+
+## Running Tests
+
+### Run all fast tests (recommended):
+```bash
+./run-skill-tests.sh
+```
+
+### Run integration tests (slow, 10-30 minutes):
+```bash
+./run-skill-tests.sh --integration
+```
+
+### Run specific test:
+```bash
+./run-skill-tests.sh --test test-subagent-driven-development.sh
+```
+
+### Run with verbose output:
+```bash
+./run-skill-tests.sh --verbose
+```
+
+### Set custom timeout:
+```bash
+./run-skill-tests.sh --timeout 1800  # 30 minutes for integration tests
+```
+
+## Test Structure
+
+### test-helpers.sh
+Common functions for skills testing:
+- `run_claude "prompt" [timeout]` - Run Claude with prompt
+- `assert_contains output pattern name` - Verify pattern exists
+- `assert_not_contains output pattern name` - Verify pattern absent
+- `assert_count output pattern count name` - Verify exact count
+- `assert_order output pattern_a pattern_b name` - Verify order
+- `create_test_project` - Create temp test directory
+- `create_test_plan project_dir` - Create sample plan file
+
+### Test Files
+
+Each test file:
+1. Sources `test-helpers.sh`
+2. Runs Claude Code with specific prompts
+3. Verifies expected behavior using assertions
+4. Returns 0 on success, non-zero on failure
+
+## Example Test
+
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+source "$SCRIPT_DIR/test-helpers.sh"
+
+echo "=== Test: My Skill ==="
+
+# Ask Claude about the skill
+output=$(run_claude "What does the my-skill skill do?" 30)
+
+# Verify response
+assert_contains "$output" "expected behavior" "Skill describes behavior"
+
+echo "=== All tests passed ==="
+```
+
+## Current Tests
+
+### Fast Tests (run by default)
+
+#### test-subagent-driven-development.sh
+Tests skill content and requirements (~2 minutes):
+- Skill loading and accessibility
+- Workflow ordering (spec compliance before code quality)
+- Self-review requirements documented
+- Plan reading efficiency documented
+- Spec compliance reviewer skepticism documented
+- Review loops documented
+- Task context provision documented
+
+### Integration Tests (use --integration flag)
+
+#### test-subagent-driven-development-integration.sh
+Full workflow execution test (~10-30 minutes):
+- Creates real test project with Node.js setup
+- Creates implementation plan with 2 tasks
+- Executes plan using subagent-driven-development
+- Verifies actual behaviors:
+  - Plan read once at start (not per task)
+  - Full task text provided in subagent prompts
+  - Subagents perform self-review before reporting
+  - Spec compliance review happens before code quality
+  - Spec reviewer reads code independently
+  - Working implementation is produced
+  - Tests pass
+  - Proper git commits created
+
+**What it tests:**
+- The workflow actually works end-to-end
+- Our improvements are actually applied
+- Subagents follow the skill correctly
+- Final code is functional and tested
+
+## Adding New Tests
+
+1. Create new test file: `test-<skill-name>.sh`
+2. Source test-helpers.sh
+3. Write tests using `run_claude` and assertions
+4. Add to test list in `run-skill-tests.sh`
+5. Make executable: `chmod +x test-<skill-name>.sh`
+
+## Timeout Considerations
+
+- Default timeout: 5 minutes per test
+- Claude Code may take time to respond
+- Adjust with `--timeout` if needed
+- Tests should be focused to avoid long runs
+
+## Debugging Failed Tests
+
+With `--verbose`, you'll see full Claude output:
+```bash
+./run-skill-tests.sh --verbose --test test-subagent-driven-development.sh
+```
+
+Without verbose, only failures show output.
+
+## CI/CD Integration
+
+To run in CI:
+```bash
+# Run with explicit timeout for CI environments
+./run-skill-tests.sh --timeout 900
+
+# Exit code 0 = success, non-zero = failure
+```
+
+## Notes
+
+- Tests verify skill *instructions*, not full execution
+- Full workflow tests would be very slow
+- Focus on verifying key skill requirements
+- Tests should be deterministic
+- Avoid testing implementation details
--- a/tests/claude-code/analyze-token-usage.py
+++ b/tests/claude-code/analyze-token-usage.py
@@ -0,0 +1,168 @@
+#!/usr/bin/env python3
+"""
+Analyze token usage from Claude Code session transcripts.
+Breaks down usage by main session and individual subagents.
+"""
+
+import json
+import sys
+from pathlib import Path
+from collections import defaultdict
+
+def analyze_main_session(filepath):
+    """Analyze a session file and return token usage broken down by agent."""
+    main_usage = {
+        'input_tokens': 0,
+        'output_tokens': 0,
+        'cache_creation': 0,
+        'cache_read': 0,
+        'messages': 0
+    }
+
+    # Track usage per subagent
+    subagent_usage = defaultdict(lambda: {
+        'input_tokens': 0,
+        'output_tokens': 0,
+        'cache_creation': 0,
+        'cache_read': 0,
+        'messages': 0,
+        'description': None
+    })
+
+    with open(filepath, 'r') as f:
+        for line in f:
+            try:
+                data = json.loads(line)
+
+                # Main session assistant messages
+                if data.get('type') == 'assistant' and 'message' in data:
+                    main_usage['messages'] += 1
+                    msg_usage = data['message'].get('usage', {})
+                    main_usage['input_tokens'] += msg_usage.get('input_tokens', 0)
+                    main_usage['output_tokens'] += msg_usage.get('output_tokens', 0)
+                    main_usage['cache_creation'] += msg_usage.get('cache_creation_input_tokens', 0)
+                    main_usage['cache_read'] += msg_usage.get('cache_read_input_tokens', 0)
+
+                # Subagent tool results
+                if data.get('type') == 'user' and 'toolUseResult' in data:
+                    result = data['toolUseResult']
+                    if 'usage' in result and 'agentId' in result:
+                        agent_id = result['agentId']
+                        usage = result['usage']
+
+                        # Get description from prompt if available
+                        if subagent_usage[agent_id]['description'] is None:
+                            prompt = result.get('prompt', '')
+                            # Extract first line as description
+                            first_line = prompt.split('\n')[0] if prompt else f"agent-{agent_id}"
+                            if first_line.startswith('You are '):
+                                first_line = first_line[8:]  # Remove "You are "
+                            subagent_usage[agent_id]['description'] = first_line[:60]
+
+                        subagent_usage[agent_id]['messages'] += 1
+                        subagent_usage[agent_id]['input_tokens'] += usage.get('input_tokens', 0)
+                        subagent_usage[agent_id]['output_tokens'] += usage.get('output_tokens', 0)
+                        subagent_usage[agent_id]['cache_creation'] += usage.get('cache_creation_input_tokens', 0)
+                        subagent_usage[agent_id]['cache_read'] += usage.get('cache_read_input_tokens', 0)
+            except:
+                pass
+
+    return main_usage, dict(subagent_usage)
+
+def format_tokens(n):
+    """Format token count with thousands separators."""
+    return f"{n:,}"
+
+def calculate_cost(usage, input_cost_per_m=3.0, output_cost_per_m=15.0):
+    """Calculate estimated cost in dollars."""
+    total_input = usage['input_tokens'] + usage['cache_creation'] + usage['cache_read']
+    input_cost = total_input * input_cost_per_m / 1_000_000
+    output_cost = usage['output_tokens'] * output_cost_per_m / 1_000_000
+    return input_cost + output_cost
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: analyze-token-usage.py <session-file.jsonl>")
+        sys.exit(1)
+
+    main_session_file = sys.argv[1]
+
+    if not Path(main_session_file).exists():
+        print(f"Error: Session file not found: {main_session_file}")
+        sys.exit(1)
+
+    # Analyze the session
+    main_usage, subagent_usage = analyze_main_session(main_session_file)
+
+    print("=" * 100)
+    print("TOKEN USAGE ANALYSIS")
+    print("=" * 100)
+    print()
+
+    # Print breakdown
+    print("Usage Breakdown:")
+    print("-" * 100)
+    print(f"{'Agent':<15} {'Description':<35} {'Msgs':>5} {'Input':>10} {'Output':>10} {'Cache':>10} {'Cost':>8}")
+    print("-" * 100)
+
+    # Main session
+    cost = calculate_cost(main_usage)
+    print(f"{'main':<15} {'Main session (coordinator)':<35} "
+          f"{main_usage['messages']:>5} "
+          f"{format_tokens(main_usage['input_tokens']):>10} "
+          f"{format_tokens(main_usage['output_tokens']):>10} "
+          f"{format_tokens(main_usage['cache_read']):>10} "
+          f"${cost:>7.2f}")
+
+    # Subagents (sorted by agent ID)
+    for agent_id in sorted(subagent_usage.keys()):
+        usage = subagent_usage[agent_id]
+        cost = calculate_cost(usage)
+        desc = usage['description'] or f"agent-{agent_id}"
+        print(f"{agent_id:<15} {desc:<35} "
+              f"{usage['messages']:>5} "
+              f"{format_tokens(usage['input_tokens']):>10} "
+              f"{format_tokens(usage['output_tokens']):>10} "
+              f"{format_tokens(usage['cache_read']):>10} "
+              f"${cost:>7.2f}")
+
+    print("-" * 100)
+
+    # Calculate totals
+    total_usage = {
+        'input_tokens': main_usage['input_tokens'],
+        'output_tokens': main_usage['output_tokens'],
+        'cache_creation': main_usage['cache_creation'],
+        'cache_read': main_usage['cache_read'],
+        'messages': main_usage['messages']
+    }
+
+    for usage in subagent_usage.values():
+        total_usage['input_tokens'] += usage['input_tokens']
+        total_usage['output_tokens'] += usage['output_tokens']
+        total_usage['cache_creation'] += usage['cache_creation']
+        total_usage['cache_read'] += usage['cache_read']
+        total_usage['messages'] += usage['messages']
+
+    total_input = total_usage['input_tokens'] + total_usage['cache_creation'] + total_usage['cache_read']
+    total_tokens = total_input + total_usage['output_tokens']
+    total_cost = calculate_cost(total_usage)
+
+    print()
+    print("TOTALS:")
+    print(f"  Total messages:         {format_tokens(total_usage['messages'])}")
+    print(f"  Input tokens:           {format_tokens(total_usage['input_tokens'])}")
+    print(f"  Output tokens:          {format_tokens(total_usage['output_tokens'])}")
+    print(f"  Cache creation tokens:  {format_tokens(total_usage['cache_creation'])}")
+    print(f"  Cache read tokens:      {format_tokens(total_usage['cache_read'])}")
+    print()
+    print(f"  Total input (incl cache): {format_tokens(total_input)}")
+    print(f"  Total tokens:             {format_tokens(total_tokens)}")
+    print()
+    print(f"  Estimated cost: ${total_cost:.2f}")
+    print("  (at $3/$15 per M tokens for input/output)")
+    print()
+    print("=" * 100)
+
+if __name__ == '__main__':
+    main()
--- a/tests/claude-code/run-skill-tests.sh
+++ b/tests/claude-code/run-skill-tests.sh
@@ -0,0 +1,187 @@
+#!/usr/bin/env bash
+# Test runner for Claude Code skills
+# Tests skills by invoking Claude Code CLI and verifying behavior
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+cd "$SCRIPT_DIR"
+
+echo "========================================"
+echo " Claude Code Skills Test Suite"
+echo "========================================"
+echo ""
+echo "Repository: $(cd ../.. && pwd)"
+echo "Test time: $(date)"
+echo "Claude version: $(claude --version 2>/dev/null || echo 'not found')"
+echo ""
+
+# Check if Claude Code is available
+if ! command -v claude &> /dev/null; then
+    echo "ERROR: Claude Code CLI not found"
+    echo "Install Claude Code first: https://code.claude.com"
+    exit 1
+fi
+
+# Parse command line arguments
+VERBOSE=false
+SPECIFIC_TEST=""
+TIMEOUT=300  # Default 5 minute timeout per test
+RUN_INTEGRATION=false
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --verbose|-v)
+            VERBOSE=true
+            shift
+            ;;
+        --test|-t)
+            SPECIFIC_TEST="$2"
+            shift 2
+            ;;
+        --timeout)
+            TIMEOUT="$2"
+            shift 2
+            ;;
+        --integration|-i)
+            RUN_INTEGRATION=true
+            shift
+            ;;
+        --help|-h)
+            echo "Usage: $0 [options]"
+            echo ""
+            echo "Options:"
+            echo "  --verbose, -v        Show verbose output"
+            echo "  --test, -t NAME      Run only the specified test"
+            echo "  --timeout SECONDS    Set timeout per test (default: 300)"
+            echo "  --integration, -i    Run integration tests (slow, 10-30 min)"
+            echo "  --help, -h           Show this help"
+            echo ""
+            echo "Tests:"
+            echo "  test-subagent-driven-development.sh  Test skill loading and requirements"
+            echo ""
+            echo "Integration Tests (use --integration):"
+            echo "  test-subagent-driven-development-integration.sh  Full workflow execution"
+            exit 0
+            ;;
+        *)
+            echo "Unknown option: $1"
+            echo "Use --help for usage information"
+            exit 1
+            ;;
+    esac
+done
+
+# List of skill tests to run (fast unit tests)
+tests=(
+    "test-subagent-driven-development.sh"
+)
+
+# Integration tests (slow, full execution)
+integration_tests=(
+    "test-subagent-driven-development-integration.sh"
+)
+
+# Add integration tests if requested
+if [ "$RUN_INTEGRATION" = true ]; then
+    tests+=("${integration_tests[@]}")
+fi
+
+# Filter to specific test if requested
+if [ -n "$SPECIFIC_TEST" ]; then
+    tests=("$SPECIFIC_TEST")
+fi
+
+# Track results
+passed=0
+failed=0
+skipped=0
+
+# Run each test
+for test in "${tests[@]}"; do
+    echo "----------------------------------------"
+    echo "Running: $test"
+    echo "----------------------------------------"
+
+    test_path="$SCRIPT_DIR/$test"
+
+    if [ ! -f "$test_path" ]; then
+        echo "  [SKIP] Test file not found: $test"
+        skipped=$((skipped + 1))
+        continue
+    fi
+
+    if [ ! -x "$test_path" ]; then
+        echo "  Making $test executable..."
+        chmod +x "$test_path"
+    fi
+
+    start_time=$(date +%s)
+
+    if [ "$VERBOSE" = true ]; then
+        if timeout "$TIMEOUT" bash "$test_path"; then
+            end_time=$(date +%s)
+            duration=$((end_time - start_time))
+            echo ""
+            echo "  [PASS] $test (${duration}s)"
+            passed=$((passed + 1))
+        else
+            exit_code=$?
+            end_time=$(date +%s)
+            duration=$((end_time - start_time))
+            echo ""
+            if [ $exit_code -eq 124 ]; then
+                echo "  [FAIL] $test (timeout after ${TIMEOUT}s)"
+            else
+                echo "  [FAIL] $test (${duration}s)"
+            fi
+            failed=$((failed + 1))
+        fi
+    else
+        # Capture output for non-verbose mode
+        if output=$(timeout "$TIMEOUT" bash "$test_path" 2>&1); then
+            end_time=$(date +%s)
+            duration=$((end_time - start_time))
+            echo "  [PASS] (${duration}s)"
+            passed=$((passed + 1))
+        else
+            exit_code=$?
+            end_time=$(date +%s)
+            duration=$((end_time - start_time))
+            if [ $exit_code -eq 124 ]; then
+                echo "  [FAIL] (timeout after ${TIMEOUT}s)"
+            else
+                echo "  [FAIL] (${duration}s)"
+            fi
+            echo ""
+            echo "  Output:"
+            echo "$output" | sed 's/^/    /'
+            failed=$((failed + 1))
+        fi
+    fi
+
+    echo ""
+done
+
+# Print summary
+echo "========================================"
+echo " Test Results Summary"
+echo "========================================"
+echo ""
+echo "  Passed:  $passed"
+echo "  Failed:  $failed"
+echo "  Skipped: $skipped"
+echo ""
+
+if [ "$RUN_INTEGRATION" = false ] && [ ${#integration_tests[@]} -gt 0 ]; then
+    echo "Note: Integration tests were not run (they take 10-30 minutes)."
+    echo "Use --integration flag to run full workflow execution tests."
+    echo ""
+fi
+
+if [ $failed -gt 0 ]; then
+    echo "STATUS: FAILED"
+    exit 1
+else
+    echo "STATUS: PASSED"
+    exit 0
+fi
--- a/tests/claude-code/test-helpers.sh
+++ b/tests/claude-code/test-helpers.sh
@@ -0,0 +1,202 @@
+#!/usr/bin/env bash
+# Helper functions for Claude Code skill tests
+
+# Run Claude Code with a prompt and capture output
+# Usage: run_claude "prompt text" [timeout_seconds] [allowed_tools]
+run_claude() {
+    local prompt="$1"
+    local timeout="${2:-60}"
+    local allowed_tools="${3:-}"
+    local output_file=$(mktemp)
+
+    # Build command
+    local cmd="claude -p \"$prompt\""
+    if [ -n "$allowed_tools" ]; then
+        cmd="$cmd --allowed-tools=$allowed_tools"
+    fi
+
+    # Run Claude in headless mode with timeout
+    if timeout "$timeout" bash -c "$cmd" > "$output_file" 2>&1; then
+        cat "$output_file"
+        rm -f "$output_file"
+        return 0
+    else
+        local exit_code=$?
+        cat "$output_file" >&2
+        rm -f "$output_file"
+        return $exit_code
+    fi
+}
+
+# Check if output contains a pattern
+# Usage: assert_contains "output" "pattern" "test name"
+assert_contains() {
+    local output="$1"
+    local pattern="$2"
+    local test_name="${3:-test}"
+
+    if echo "$output" | grep -q "$pattern"; then
+        echo "  [PASS] $test_name"
+        return 0
+    else
+        echo "  [FAIL] $test_name"
+        echo "  Expected to find: $pattern"
+        echo "  In output:"
+        echo "$output" | sed 's/^/    /'
+        return 1
+    fi
+}
+
+# Check if output does NOT contain a pattern
+# Usage: assert_not_contains "output" "pattern" "test name"
+assert_not_contains() {
+    local output="$1"
+    local pattern="$2"
+    local test_name="${3:-test}"
+
+    if echo "$output" | grep -q "$pattern"; then
+        echo "  [FAIL] $test_name"
+        echo "  Did not expect to find: $pattern"
+        echo "  In output:"
+        echo "$output" | sed 's/^/    /'
+        return 1
+    else
+        echo "  [PASS] $test_name"
+        return 0
+    fi
+}
+
+# Check if output matches a count
+# Usage: assert_count "output" "pattern" expected_count "test name"
+assert_count() {
+    local output="$1"
+    local pattern="$2"
+    local expected="$3"
+    local test_name="${4:-test}"
+
+    local actual=$(echo "$output" | grep -c "$pattern" || echo "0")
+
+    if [ "$actual" -eq "$expected" ]; then
+        echo "  [PASS] $test_name (found $actual instances)"
+        return 0
+    else
+        echo "  [FAIL] $test_name"
+        echo "  Expected $expected instances of: $pattern"
+        echo "  Found $actual instances"
+        echo "  In output:"
+        echo "$output" | sed 's/^/    /'
+        return 1
+    fi
+}
+
+# Check if pattern A appears before pattern B
+# Usage: assert_order "output" "pattern_a" "pattern_b" "test name"
+assert_order() {
+    local output="$1"
+    local pattern_a="$2"
+    local pattern_b="$3"
+    local test_name="${4:-test}"
+
+    # Get line numbers where patterns appear
+    local line_a=$(echo "$output" | grep -n "$pattern_a" | head -1 | cut -d: -f1)
+    local line_b=$(echo "$output" | grep -n "$pattern_b" | head -1 | cut -d: -f1)
+
+    if [ -z "$line_a" ]; then
+        echo "  [FAIL] $test_name: pattern A not found: $pattern_a"
+        return 1
+    fi
+
+    if [ -z "$line_b" ]; then
+        echo "  [FAIL] $test_name: pattern B not found: $pattern_b"
+        return 1
+    fi
+
+    if [ "$line_a" -lt "$line_b" ]; then
+        echo "  [PASS] $test_name (A at line $line_a, B at line $line_b)"
+        return 0
+    else
+        echo "  [FAIL] $test_name"
+        echo "  Expected '$pattern_a' before '$pattern_b'"
+        echo "  But found A at line $line_a, B at line $line_b"
+        return 1
+    fi
+}
+
+# Create a temporary test project directory
+# Usage: test_project=$(create_test_project)
+create_test_project() {
+    local test_dir=$(mktemp -d)
+    echo "$test_dir"
+}
+
+# Cleanup test project
+# Usage: cleanup_test_project "$test_dir"
+cleanup_test_project() {
+    local test_dir="$1"
+    if [ -d "$test_dir" ]; then
+        rm -rf "$test_dir"
+    fi
+}
+
+# Create a simple plan file for testing
+# Usage: create_test_plan "$project_dir" "$plan_name"
+create_test_plan() {
+    local project_dir="$1"
+    local plan_name="${2:-test-plan}"
+    local plan_file="$project_dir/docs/plans/$plan_name.md"
+
+    mkdir -p "$(dirname "$plan_file")"
+
+    cat > "$plan_file" <<'EOF'
+# Test Implementation Plan
+
+## Task 1: Create Hello Function
+
+Create a simple hello function that returns "Hello, World!".
+
+**File:** `src/hello.js`
+
+**Implementation:**
+```javascript
+export function hello() {
+  return "Hello, World!";
+}
+```
+
+**Tests:** Write a test that verifies the function returns the expected string.
+
+**Verification:** `npm test`
+
+## Task 2: Create Goodbye Function
+
+Create a goodbye function that takes a name and returns a goodbye message.
+
+**File:** `src/goodbye.js`
+
+**Implementation:**
+```javascript
+export function goodbye(name) {
+  return `Goodbye, ${name}!`;
+}
+```
+
+**Tests:** Write tests for:
+- Default name
+- Custom name
+- Edge cases (empty string, null)
+
+**Verification:** `npm test`
+EOF
+
+    echo "$plan_file"
+}
+
+# Export functions for use in tests
+export -f run_claude
+export -f assert_contains
+export -f assert_not_contains
+export -f assert_count
+export -f assert_order
+export -f create_test_project
+export -f cleanup_test_project
+export -f create_test_plan
--- a/tests/claude-code/test-subagent-driven-development-integration.sh
+++ b/tests/claude-code/test-subagent-driven-development-integration.sh
@@ -0,0 +1,314 @@
+#!/usr/bin/env bash
+# Integration Test: subagent-driven-development workflow
+# Actually executes a plan and verifies the new workflow behaviors
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+source "$SCRIPT_DIR/test-helpers.sh"
+
+echo "========================================"
+echo " Integration Test: subagent-driven-development"
+echo "========================================"
+echo ""
+echo "This test executes a real plan using the skill and verifies:"
+echo "  1. Plan is read once (not per task)"
+echo "  2. Full task text provided to subagents"
+echo "  3. Subagents perform self-review"
+echo "  4. Spec compliance review before code quality"
+echo "  5. Review loops when issues found"
+echo "  6. Spec reviewer reads code independently"
+echo ""
+echo "WARNING: This test may take 10-30 minutes to complete."
+echo ""
+
+# Create test project
+TEST_PROJECT=$(create_test_project)
+echo "Test project: $TEST_PROJECT"
+
+# Trap to cleanup
+trap "cleanup_test_project $TEST_PROJECT" EXIT
+
+# Set up minimal Node.js project
+cd "$TEST_PROJECT"
+
+cat > package.json <<'EOF'
+{
+  "name": "test-project",
+  "version": "1.0.0",
+  "type": "module",
+  "scripts": {
+    "test": "node --test"
+  }
+}
+EOF
+
+mkdir -p src test docs/plans
+
+# Create a simple implementation plan
+cat > docs/plans/implementation-plan.md <<'EOF'
+# Test Implementation Plan
+
+This is a minimal plan to test the subagent-driven-development workflow.
+
+## Task 1: Create Add Function
+
+Create a function that adds two numbers.
+
+**File:** `src/math.js`
+
+**Requirements:**
+- Function named `add`
+- Takes two parameters: `a` and `b`
+- Returns the sum of `a` and `b`
+- Export the function
+
+**Implementation:**
+```javascript
+export function add(a, b) {
+  return a + b;
+}
+```
+
+**Tests:** Create `test/math.test.js` that verifies:
+- `add(2, 3)` returns `5`
+- `add(0, 0)` returns `0`
+- `add(-1, 1)` returns `0`
+
+**Verification:** `npm test`
+
+## Task 2: Create Multiply Function
+
+Create a function that multiplies two numbers.
+
+**File:** `src/math.js` (add to existing file)
+
+**Requirements:**
+- Function named `multiply`
+- Takes two parameters: `a` and `b`
+- Returns the product of `a` and `b`
+- Export the function
+- DO NOT add any extra features (like power, divide, etc.)
+
+**Implementation:**
+```javascript
+export function multiply(a, b) {
+  return a * b;
+}
+```
+
+**Tests:** Add to `test/math.test.js`:
+- `multiply(2, 3)` returns `6`
+- `multiply(0, 5)` returns `0`
+- `multiply(-2, 3)` returns `-6`
+
+**Verification:** `npm test`
+EOF
+
+# Initialize git repo
+git init --quiet
+git config user.email "test@test.com"
+git config user.name "Test User"
+git add .
+git commit -m "Initial commit" --quiet
+
+echo ""
+echo "Project setup complete. Starting execution..."
+echo ""
+
+# Run Claude with subagent-driven-development
+# Capture full output to analyze
+OUTPUT_FILE="$TEST_PROJECT/claude-output.txt"
+
+# Create prompt file
+cat > "$TEST_PROJECT/prompt.txt" <<'EOF'
+I want you to execute the implementation plan at docs/plans/implementation-plan.md using the subagent-driven-development skill.
+
+IMPORTANT: Follow the skill exactly. I will be verifying that you:
+1. Read the plan once at the beginning
+2. Provide full task text to subagents (don't make them read files)
+3. Ensure subagents do self-review before reporting
+4. Run spec compliance review before code quality review
+5. Use review loops when issues are found
+
+Begin now. Execute the plan.
+EOF
+
+# Note: We use a longer timeout since this is integration testing
+# Use --allowed-tools to enable tool usage in headless mode
+# IMPORTANT: Run from superpowers directory so local dev skills are available
+PROMPT="Change to directory $TEST_PROJECT and then execute the implementation plan at docs/plans/implementation-plan.md using the subagent-driven-development skill.
+
+IMPORTANT: Follow the skill exactly. I will be verifying that you:
+1. Read the plan once at the beginning
+2. Provide full task text to subagents (don't make them read files)
+3. Ensure subagents do self-review before reporting
+4. Run spec compliance review before code quality review
+5. Use review loops when issues are found
+
+Begin now. Execute the plan."
+
+echo "Running Claude (output will be shown below and saved to $OUTPUT_FILE)..."
+echo "================================================================================"
+cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" --allowed-tools=all --add-dir "$TEST_PROJECT" --permission-mode bypassPermissions 2>&1 | tee "$OUTPUT_FILE" || {
+    echo ""
+    echo "================================================================================"
+    echo "EXECUTION FAILED (exit code: $?)"
+    exit 1
+}
+echo "================================================================================"
+
+echo ""
+echo "Execution complete. Analyzing results..."
+echo ""
+
+# Find the session transcript
+# Session files are in ~/.claude/projects/-<working-dir>/<session-id>.jsonl
+WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\//-/g' | sed 's/^-//')
+SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
+
+# Find the most recent session file (created during this test run)
+SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 2>/dev/null | sort -r | head -1)
+
+if [ -z "$SESSION_FILE" ]; then
+    echo "ERROR: Could not find session transcript file"
+    echo "Looked in: $SESSION_DIR"
+    exit 1
+fi
+
+echo "Analyzing session transcript: $(basename "$SESSION_FILE")"
+echo ""
+
+# Verification tests
+FAILED=0
+
+echo "=== Verification Tests ==="
+echo ""
+
+# Test 1: Skill was invoked
+echo "Test 1: Skill tool invoked..."
+if grep -q '"name":"Skill".*"skill":"superpowers:subagent-driven-development"' "$SESSION_FILE"; then
+    echo "  [PASS] subagent-driven-development skill was invoked"
+else
+    echo "  [FAIL] Skill was not invoked"
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 2: Subagents were used (Task tool)
+echo "Test 2: Subagents dispatched..."
+task_count=$(grep -c '"name":"Task"' "$SESSION_FILE" || echo "0")
+if [ "$task_count" -ge 2 ]; then
+    echo "  [PASS] $task_count subagents dispatched"
+else
+    echo "  [FAIL] Only $task_count subagent(s) dispatched (expected >= 2)"
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 3: TodoWrite was used for tracking
+echo "Test 3: Task tracking..."
+todo_count=$(grep -c '"name":"TodoWrite"' "$SESSION_FILE" || echo "0")
+if [ "$todo_count" -ge 1 ]; then
+    echo "  [PASS] TodoWrite used $todo_count time(s) for task tracking"
+else
+    echo "  [FAIL] TodoWrite not used"
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 6: Implementation actually works
+echo "Test 6: Implementation verification..."
+if [ -f "$TEST_PROJECT/src/math.js" ]; then
+    echo "  [PASS] src/math.js created"
+
+    if grep -q "export function add" "$TEST_PROJECT/src/math.js"; then
+        echo "  [PASS] add function exists"
+    else
+        echo "  [FAIL] add function missing"
+        FAILED=$((FAILED + 1))
+    fi
+
+    if grep -q "export function multiply" "$TEST_PROJECT/src/math.js"; then
+        echo "  [PASS] multiply function exists"
+    else
+        echo "  [FAIL] multiply function missing"
+        FAILED=$((FAILED + 1))
+    fi
+else
+    echo "  [FAIL] src/math.js not created"
+    FAILED=$((FAILED + 1))
+fi
+
+if [ -f "$TEST_PROJECT/test/math.test.js" ]; then
+    echo "  [PASS] test/math.test.js created"
+else
+    echo "  [FAIL] test/math.test.js not created"
+    FAILED=$((FAILED + 1))
+fi
+
+# Try running tests
+if cd "$TEST_PROJECT" && npm test > test-output.txt 2>&1; then
+    echo "  [PASS] Tests pass"
+else
+    echo "  [FAIL] Tests failed"
+    cat test-output.txt
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 7: Git commits show proper workflow
+echo "Test 7: Git commit history..."
+commit_count=$(git -C "$TEST_PROJECT" log --oneline | wc -l)
+if [ "$commit_count" -gt 2 ]; then  # Initial + at least 2 task commits
+    echo "  [PASS] Multiple commits created ($commit_count total)"
+else
+    echo "  [FAIL] Too few commits ($commit_count, expected >2)"
+    FAILED=$((FAILED + 1))
+fi
+echo ""
+
+# Test 8: Check for extra features (spec compliance should catch)
+echo "Test 8: No extra features added (spec compliance)..."
+if grep -q "export function divide\|export function power\|export function subtract" "$TEST_PROJECT/src/math.js" 2>/dev/null; then
+    echo "  [WARN] Extra features found (spec review should have caught this)"
+    # Not failing on this as it tests reviewer effectiveness
+else
+    echo "  [PASS] No extra features added"
+fi
+echo ""
+
+# Token Usage Analysis
+echo "========================================="
+echo " Token Usage Analysis"
+echo "========================================="
+echo ""
+python3 "$SCRIPT_DIR/analyze-token-usage.py" "$SESSION_FILE"
+echo ""
+
+# Summary
+echo "========================================"
+echo " Test Summary"
+echo "========================================"
+echo ""
+
+if [ $FAILED -eq 0 ]; then
+    echo "STATUS: PASSED"
+    echo "All verification tests passed!"
+    echo ""
+    echo "The subagent-driven-development skill correctly:"
+    echo "  ✓ Reads plan once at start"
+    echo "  ✓ Provides full task text to subagents"
+    echo "  ✓ Enforces self-review"
+    echo "  ✓ Runs spec compliance before code quality"
+    echo "  ✓ Spec reviewer verifies independently"
+    echo "  ✓ Produces working implementation"
+    exit 0
+else
+    echo "STATUS: FAILED"
+    echo "Failed $FAILED verification tests"
+    echo ""
+    echo "Output saved to: $OUTPUT_FILE"
+    echo ""
+    echo "Review the output to see what went wrong."
+    exit 1
+fi
--- a/tests/claude-code/test-subagent-driven-development.sh
+++ b/tests/claude-code/test-subagent-driven-development.sh
@@ -0,0 +1,139 @@
+#!/usr/bin/env bash
+# Test: subagent-driven-development skill
+# Verifies that the skill is loaded and follows correct workflow
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+source "$SCRIPT_DIR/test-helpers.sh"
+
+echo "=== Test: subagent-driven-development skill ==="
+echo ""
+
+# Test 1: Verify skill can be loaded
+echo "Test 1: Skill loading..."
+
+output=$(run_claude "What is the subagent-driven-development skill? Describe its key steps briefly." 30)
+
+if assert_contains "$output" "subagent-driven-development" "Skill is recognized"; then
+    : # pass
+else
+    exit 1
+fi
+
+if assert_contains "$output" "Load Plan\|read.*plan\|extract.*tasks" "Mentions loading plan"; then
+    : # pass
+else
+    exit 1
+fi
+
+echo ""
+
+# Test 2: Verify skill describes correct workflow order
+echo "Test 2: Workflow ordering..."
+
+output=$(run_claude "In the subagent-driven-development skill, what comes first: spec compliance review or code quality review? Be specific about the order." 30)
+
+if assert_order "$output" "spec.*compliance" "code.*quality" "Spec compliance before code quality"; then
+    : # pass
+else
+    exit 1
+fi
+
+echo ""
+
+# Test 3: Verify self-review is mentioned
+echo "Test 3: Self-review requirement..."
+
+output=$(run_claude "Does the subagent-driven-development skill require implementers to do self-review? What should they check?" 30)
+
+if assert_contains "$output" "self-review\|self review" "Mentions self-review"; then
+    : # pass
+else
+    exit 1
+fi
+
+if assert_contains "$output" "completeness\|Completeness" "Checks completeness"; then
+    : # pass
+else
+    exit 1
+fi
+
+echo ""
+
+# Test 4: Verify plan is read once
+echo "Test 4: Plan reading efficiency..."
+
+output=$(run_claude "In subagent-driven-development, how many times should the controller read the plan file? When does this happen?" 30)
+
+if assert_contains "$output" "once\|one time\|single" "Read plan once"; then
+    : # pass
+else
+    exit 1
+fi
+
+if assert_contains "$output" "Step 1\|beginning\|start\|Load Plan" "Read at beginning"; then
+    : # pass
+else
+    exit 1
+fi
+
+echo ""
+
+# Test 5: Verify spec compliance reviewer is skeptical
+echo "Test 5: Spec compliance reviewer mindset..."
+
+output=$(run_claude "What is the spec compliance reviewer's attitude toward the implementer's report in subagent-driven-development?" 30)
+
+if assert_contains "$output" "not trust\|don't trust\|skeptical\|verify.*independently\|suspiciously" "Reviewer is skeptical"; then
+    : # pass
+else
+    exit 1
+fi
+
+if assert_contains "$output" "read.*code\|inspect.*code\|verify.*code" "Reviewer reads code"; then
+    : # pass
+else
+    exit 1
+fi
+
+echo ""
+
+# Test 6: Verify review loops
+echo "Test 6: Review loop requirements..."
+
+output=$(run_claude "In subagent-driven-development, what happens if a reviewer finds issues? Is it a one-time review or a loop?" 30)
+
+if assert_contains "$output" "loop\|again\|repeat\|until.*approved\|until.*compliant" "Review loops mentioned"; then
+    : # pass
+else
+    exit 1
+fi
+
+if assert_contains "$output" "implementer.*fix\|fix.*issues" "Implementer fixes issues"; then
+    : # pass
+else
+    exit 1
+fi
+
+echo ""
+
+# Test 7: Verify full task text is provided
+echo "Test 7: Task context provision..."
+
+output=$(run_claude "In subagent-driven-development, how does the controller provide task information to the implementer subagent? Does it make them read a file or provide it directly?" 30)
+
+if assert_contains "$output" "provide.*directly\|full.*text\|paste\|include.*prompt" "Provides text directly"; then
+    : # pass
+else
+    exit 1
+fi
+
+if assert_not_contains "$output" "read.*file\|open.*file" "Doesn't make subagent read file"; then
+    : # pass
+else
+    exit 1
+fi
+
+echo ""
+
+echo "=== All subagent-driven-development skill tests passed ==="
Author	SHA1	Message	Date
Jesse Vincent	2ea4f36b0f	Add comprehensive testing documentation Documents: - How to run integration tests - subagent-driven-development test details - Token analysis tool usage - Troubleshooting common issues - Writing new integration tests - Session transcript format	2025-11-29 21:03:33 -08:00
Jesse Vincent	4923c072e6	Add token usage analysis to subagent-driven-development test - Rewrote analyze-token-usage.py to parse main session file correctly - Extracts usage from toolUseResult fields for each subagent - Shows breakdown by agent with descriptions - Integrated into test-subagent-driven-development-integration.sh - Displays token usage automatically after each test run	2025-11-29 20:51:18 -08:00
Jesse Vincent	e90722ebf6	fix: verify skill usage via session transcript not text output The skill instructions are internal and don't appear in user-facing output. Updated verification to parse the session JSONL transcript and check for actual tool usage: - Skill tool invocation - Task tool (subagents) - TodoWrite (tracking) - Implementation results	2025-11-29 19:35:43 -08:00
Jesse Vincent	9c72ae7866	test: use bypassPermissions mode for unrestricted testing dontAsk mode was auto-denying Write tool. Use bypassPermissions instead to allow full tool access in this controlled test environment.	2025-11-29 17:15:21 -08:00
Jesse Vincent	d8004f9b27	test: auto-approve permissions with --permission-mode dontAsk Headless tests need automatic permission approval to write files. Using dontAsk mode to auto-approve permissions for test directory.	2025-11-29 11:47:47 -08:00
Jesse Vincent	79b1ac39e0	test: add --add-dir flag for temp directory access Claude needs explicit permission to access the temp test directory. Added --add-dir flag to grant access to the test project.	2025-11-29 11:39:04 -08:00
Jesse Vincent	c538d16550	test: show Claude output in real-time during integration test Use tee instead of redirection so test output is visible during execution while still being saved to file for analysis.	2025-11-29 11:29:12 -08:00
Jesse Vincent	8923b2cf16	fix: run integration test from superpowers dir to access local dev skills The superpowers-dev marketplace makes skills available only when running from the plugin directory. Updated test to run claude from superpowers directory while working on the test project.	2025-11-29 10:35:38 -08:00
Jesse Vincent	575b14161e	Fix tests to use --allowed-tools flag Claude Code headless mode requires --allowed-tools flag to actually execute tool calls. Without it, Claude only responds as if it's doing things but doesn't actually use tools. Changes: - Updated run_claude helper to accept allowed_tools parameter - Updated integration test to use --allowed-tools=all - This enables actual tool execution (Write, Task, Bash, etc.) Now the integration test should actually execute the workflow instead of just talking about it.	2025-11-28 22:21:24 -08:00
Jesse Vincent	5b5fb3940d	Fix syntax error in integration test Simplified command substitution to avoid shell parsing issues. Instead of nested heredoc in command substitution, write prompt to file first then read it.	2025-11-28 15:22:01 -08:00
Jesse Vincent	c2c72d9bfa	Add integration test for subagent-driven-development Created full end-to-end integration test that executes a real plan and verifies the new workflow improvements actually work. New test: test-subagent-driven-development-integration.sh - Creates real Node.js test project - Generates implementation plan (2 tasks) - Executes using subagent-driven-development skill - Verifies 8 key behaviors: 1. Plan read once at beginning (not per task) 2. Full task text provided to subagents (not file reading) 3. Subagents perform self-review 4. Spec compliance review before code quality 5. Spec reviewer reads code independently 6. Working implementation produced 7. Tests pass 8. No extra features added (spec compliance) Integration tests are opt-in (--integration flag) due to 10-30 min runtime. Updated run-skill-tests.sh: - Added --integration flag - Separates fast tests from integration tests - Shows note when integration tests skipped Updated README with integration test documentation. Run with: ./run-skill-tests.sh # Fast tests only ./run-skill-tests.sh --integration # Include integration tests	2025-11-28 15:06:10 -08:00
Jesse Vincent	2d011053d5	Add Claude Code skills test framework Created automated test suite for testing superpowers skills using Claude Code CLI in headless mode. New files: - tests/claude-code/run-skill-tests.sh - Main test runner - tests/claude-code/test-helpers.sh - Helper functions for testing - tests/claude-code/test-subagent-driven-development.sh - First test - tests/claude-code/README.md - Documentation Test framework features: - Run Claude Code with prompts and capture output - Assertion helpers (contains, not_contains, count, order) - Test project creation helpers - Timeout support (default 5 minutes) - Verbose mode for debugging - Specific test selection First test verifies subagent-driven-development skill: - Skill loading - Workflow ordering (spec compliance before code quality) - Self-review requirements - Plan reading efficiency (read once) - Spec compliance reviewer skepticism - Review loops - Task context provision Run with: cd tests/claude-code && ./run-skill-tests.sh	2025-11-28 14:51:08 -08:00
Jesse Vincent	47bfdf36f1	Emphasize spec compliance review must complete before code quality Made sequencing explicit: - Spec compliance review loop must fully complete (✅) before code quality - Added "Do NOT proceed to code quality review until spec compliance is ✅" - Code Quality Review section starts with "Only run after spec compliance review is complete" - Red Flags: Added "Start code quality review before spec compliance is ✅ (wrong order)" This ensures we don't waste time reviewing code quality of the wrong implementation. Verify they built the right thing first, then verify they built it well.	2025-11-28 14:32:12 -08:00
Jesse Vincent	fedd3e2096	Make spec compliance reviewer skeptical and verification-focused The spec compliance reviewer now: - Does NOT trust implementer's report - Is warned implementer finished suspiciously quickly - MUST verify everything by reading actual code - Compares implementation to requirements line by line - Reports issues with file:line references Key additions: - "Do Not Trust the Report" section - Explicit DO NOT / DO lists - "Verify by reading code, not by trusting report" - Changed "What Was Implemented" to "What Implementer Claims They Built" This prevents rubber-stamping and ensures independent verification of spec compliance against actual codebase.	2025-11-28 14:29:53 -08:00
Jesse Vincent	78496a6a1c	Improve subagent-driven-development workflow Key improvements based on feedback: 1. Read plan once, not per task - Extract all tasks in Step 1 - Reference extracted tasks in Step 2 - Eliminates redundant file reading 2. Enable questions during work - Not just before, but also while working - "It's always OK to ask questions" - Don't guess or make assumptions 3. Add self-review before reporting - Completeness: implemented everything? - Quality: best work, clear names? - Discipline: avoided overbuilding? - Testing: comprehensive, real behavior? - Catches issues before handoff 4. Add spec compliance review - Separate reviewer checks: built the right thing? - Flags missing requirements - Flags extra/unneeded work - Flags misunderstandings - Runs BEFORE code quality review 5. Make reviews loops, not one-shot - Reviewer finds issues - Implementer fixes - Reviewer reviews again - Repeat until approved - Applies to both spec and code quality Two-stage review process: - Stage 1: Spec compliance (right thing?) - Stage 2: Code quality (built well?) This enables subagents to do their best work with clear requirements, opportunities to clarify, self-critique, and thorough review loops.	2025-11-28 14:24:13 -08:00
Jesse Vincent	830c226a9c	Update subagent-driven-development: controller provides full task text Changed workflow so controller provides complete task context directly rather than making subagent read plan file. Key changes: - Controller reads plan and extracts full task text - Controller provides scene-setting context (dependencies, architecture) - Subagent receives complete information in prompt (no file reading) - Subagent can ask clarifying questions before beginning work - Controller handles questions/concerns before subagent proceeds Benefits: - No file reading overhead for subagent - Controller curates exactly what context is needed - Questions surfaced before work begins (not after) - Subagent has complete information to do best work This enables subagents to start with clarity rather than ambiguity.	2025-11-28 13:50:37 -08:00
Jesse Vincent	637ad174be	Add skills improvement plan from user feedback Analyzed feedback from two Claude instances using superpowers in real development scenarios. Identified 8 core problems and proposed improvements organized by impact and risk. Key problems: - Configuration change verification gap (verify success not intent) - Background process accumulation across subagents - Context bloat in subagent prompts - Missing self-reflection before handoff - Mock-interface drift - Code reviewer file access issues - Skills not being read/enforced - Fix workflow latency Proposed improvements organized in 3 phases: - Phase 1: High-impact, low-risk (do first) - Phase 2: Moderate changes (test carefully) - Phase 3: Optimization (validate first) See plan for detailed analysis and open questions.	2025-11-28 13:43:23 -08:00