docs: introduce evals/ as the canonical skill-behavior eval harness

- docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files).
2026-05-08 18:19:04 +08:00 · 2026-05-06 12:33:10 -07:00
parent b43d14f87f
commit d545612825
4 changed files with 34 additions and 291 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -5,3 +5,9 @@
 node_modules/
 inspo
 triage/
+
+# Eval harness — drill ships its own gitignore at evals/.gitignore;
+# these are belt-and-suspenders entries for tools that don't recurse.
+evals/results/
+evals/.venv/
+evals/.env
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -94,6 +94,10 @@ Skills are not prose — they are code that shapes agent behavior. If you modify
 - Show before/after eval results in your PR
 - Do not modify carefully-tuned content (Red Flags tables, rationalization lists, "human partner" language) without evidence the change is an improvement

+## Eval harness
+
+Skill-behavior evals live at `evals/` — see `evals/README.md`. Drill (the harness) drives real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI and judges skill compliance with an LLM verifier. Plugin-infrastructure tests still live at `tests/`.
+
 ## Understand the Project Before Contributing

 Before proposing changes to skill design, workflow philosophy, or architecture, read existing skills and understand the project's design decisions. Superpowers has its own tested philosophy about skill design, agent behavior shaping, and terminology (e.g., "your human partner" is deliberate, not interchangeable with "the user"). Changes that rewrite the project's voice or restructure its approach without understanding why it exists will be rejected.
--- a/README.md
+++ b/README.md
@@ -214,6 +214,8 @@ The general contribution process for Superpowers is below. Keep in mind that we
 4. Follow the `writing-skills` skill for creating and testing new and modified skills
 5. Submit a PR, being sure to fill in the pull request template.

+Skill-behavior tests use the eval harness at `evals/`. See `evals/README.md` for setup. Plugin-infrastructure tests live at `tests/` and run via the relevant `run-*.sh` or `npm test`.
+
 See `skills/writing-skills/SKILL.md` for the complete guide.

 ## Updating
--- a/docs/testing.md
+++ b/docs/testing.md
@@ -1,303 +1,34 @@
-# Testing Superpowers Skills
+# Testing Superpowers

-This document describes how to test Superpowers skills, particularly the integration tests for complex skills like `subagent-driven-development`.
+Superpowers has two distinct kinds of tests, each in its own directory:

-## Overview
+- **`tests/`** — does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.
+- **`evals/`** — do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI, with an LLM actor and verifier judging skill compliance.

-Testing skills that involve subagents, workflows, and complex interactions requires running actual Claude Code sessions in headless mode and verifying their behavior through session transcripts.
+## Plugin tests

-## Test Structure
+Live in `tests/`. Currently:

-```
-tests/
-├── claude-code/
-│   ├── test-helpers.sh                    # Shared test utilities
-│   ├── test-subagent-driven-development-integration.sh
-│   ├── analyze-token-usage.py             # Token analysis tool
-│   └── run-skill-tests.sh                 # Test runner (if exists)
-```
+- `tests/brainstorm-server/` — node test suite for the brainstorm server JS code.
+- `tests/opencode/` — bash tests for OpenCode plugin loading, bootstrap caching, and tool registration.
+- `tests/codex-plugin-sync/` — bash sync verification.
+- `tests/claude-code/test-helpers.sh`, `analyze-token-usage.py` — utilities used by remaining bash tests.
+- `tests/claude-code/test-subagent-driven-development.sh` — agent-can-describe-SDD test (no drill counterpart; tests description-recall, not behavior).
+- `tests/claude-code/test-subagent-driven-development-integration.sh` — extended SDD integration with token analysis (drill covers the YAGNI subset; bash adds commit-count, TodoWrite, and token telemetry assertions).
+- `tests/claude-code/test-worktree-native-preference.sh` — RED-GREEN-REFACTOR validation for worktree skill (drill covers the PRESSURE phase; bash also covers RED/GREEN baselines).
+- `tests/explicit-skill-requests/` — Haiku-specific, multi-turn, and skill-name-prompted tests not covered by drill.

-## Running Tests
+Run plugin tests via the relevant directory's `run-*.sh` or `npm test`.

-### Integration Tests
+## Skill behavior evals

-Integration tests execute real Claude Code sessions with actual skills:
+Live in `evals/`. Drill is the harness; scenarios live at `evals/scenarios/*.yaml`. See `evals/README.md` for setup. Quick start:

 ```bash
-# Run the subagent-driven-development integration test
-cd tests/claude-code
-./test-subagent-driven-development-integration.sh
+cd evals
+uv sync --extra dev
+export ANTHROPIC_API_KEY=sk-...
+uv run drill run triggering-test-driven-development -b claude
 ```

-**Note:** Integration tests can take 10-30 minutes as they execute real implementation plans with multiple subagents.
-
-### Requirements
-
- Must run from the **superpowers plugin directory** (not from temp directories)
- Claude Code must be installed and available as `claude` command
- Local dev marketplace must be enabled: `"superpowers@superpowers-dev": true` in `~/.claude/settings.json`
-
-## Integration Test: subagent-driven-development
-
-### What It Tests
-
-The integration test verifies the `subagent-driven-development` skill correctly:
-
-1. **Plan Loading**: Reads the plan once at the beginning
-2. **Full Task Text**: Provides complete task descriptions to subagents (doesn't make them read files)
-3. **Self-Review**: Ensures subagents perform self-review before reporting
-4. **Review Order**: Runs spec compliance review before code quality review
-5. **Review Loops**: Uses review loops when issues are found
-6. **Independent Verification**: Spec reviewer reads code independently, doesn't trust implementer reports
-
-### How It Works
-
-1. **Setup**: Creates a temporary Node.js project with a minimal implementation plan
-2. **Execution**: Runs Claude Code in headless mode with the skill
-3. **Verification**: Parses the session transcript (`.jsonl` file) to verify:
-   - Skill tool was invoked
-   - Subagents were dispatched (Task tool)
-   - TodoWrite was used for tracking
-   - Implementation files were created
-   - Tests pass
-   - Git commits show proper workflow
-4. **Token Analysis**: Shows token usage breakdown by subagent
-
-### Test Output
-
-```
-========================================
- Integration Test: subagent-driven-development
-========================================
-
-Test project: /tmp/tmp.xyz123
-
-=== Verification Tests ===
-
-Test 1: Skill tool invoked...
-  [PASS] subagent-driven-development skill was invoked
-
-Test 2: Subagents dispatched...
-  [PASS] 7 subagents dispatched
-
-Test 3: Task tracking...
-  [PASS] TodoWrite used 5 time(s)
-
-Test 6: Implementation verification...
-  [PASS] src/math.js created
-  [PASS] add function exists
-  [PASS] multiply function exists
-  [PASS] test/math.test.js created
-  [PASS] Tests pass
-
-Test 7: Git commit history...
-  [PASS] Multiple commits created (3 total)
-
-Test 8: No extra features added...
-  [PASS] No extra features added
-
-=========================================
- Token Usage Analysis
-=========================================
-
-Usage Breakdown:
----------------------------------------------------------------------------------------------------
-Agent           Description                          Msgs      Input     Output      Cache     Cost
----------------------------------------------------------------------------------------------------
-main            Main session (coordinator)             34         27      3,996  1,213,703 $   4.09
-3380c209        implementing Task 1: Create Add Function     1          2        787     24,989 $   0.09
-34b00fde        implementing Task 2: Create Multiply Function     1          4        644     25,114 $   0.09
-3801a732        reviewing whether an implementation matches...   1          5        703     25,742 $   0.09
-4c142934        doing a final code review...                    1          6        854     25,319 $   0.09
-5f017a42        a code reviewer. Review Task 2...               1          6        504     22,949 $   0.08
-a6b7fbe4        a code reviewer. Review Task 1...               1          6        515     22,534 $   0.08
-f15837c0        reviewing whether an implementation matches...   1          6        416     22,485 $   0.07
----------------------------------------------------------------------------------------------------
-
-TOTALS:
-  Total messages:         41
-  Input tokens:           62
-  Output tokens:          8,419
-  Cache creation tokens:  132,742
-  Cache read tokens:      1,382,835
-
-  Total input (incl cache): 1,515,639
-  Total tokens:             1,524,058
-
-  Estimated cost: $4.67
-  (at $3/$15 per M tokens for input/output)
-
-========================================
- Test Summary
-========================================
-
-STATUS: PASSED
-```
-
-## Token Analysis Tool
-
-### Usage
-
-Analyze token usage from any Claude Code session:
-
-```bash
-python3 tests/claude-code/analyze-token-usage.py ~/.claude/projects/<project-dir>/<session-id>.jsonl
-```
-
-### Finding Session Files
-
-Session transcripts are stored in `~/.claude/projects/` with the working directory path encoded:
-
-```bash
-# Example for /Users/yourname/Documents/GitHub/superpowers/superpowers
-SESSION_DIR="$HOME/.claude/projects/-Users-yourname-Documents-GitHub-superpowers-superpowers"
-
-# Find recent sessions
-ls -lt "$SESSION_DIR"/*.jsonl | head -5
-```
-
-### What It Shows
-
- **Main session usage**: Token usage by the coordinator (you or main Claude instance)
- **Per-subagent breakdown**: Each Task invocation with:
-  - Agent ID
-  - Description (extracted from prompt)
-  - Message count
-  - Input/output tokens
-  - Cache usage
-  - Estimated cost
- **Totals**: Overall token usage and cost estimate
-
-### Understanding the Output
-
- **High cache reads**: Good - means prompt caching is working
- **High input tokens on main**: Expected - coordinator has full context
- **Similar costs per subagent**: Expected - each gets similar task complexity
- **Cost per task**: Typical range is $0.05-$0.15 per subagent depending on task
-
-## Troubleshooting
-
-### Skills Not Loading
-
-**Problem**: Skill not found when running headless tests
-
-**Solutions**:
-1. Ensure you're running FROM the superpowers directory: `cd /path/to/superpowers && tests/...`
-2. Check `~/.claude/settings.json` has `"superpowers@superpowers-dev": true` in `enabledPlugins`
-3. Verify skill exists in `skills/` directory
-
-### Permission Errors
-
-**Problem**: Claude blocked from writing files or accessing directories
-
-**Solutions**:
-1. Use `--permission-mode bypassPermissions` flag
-2. Use `--add-dir /path/to/temp/dir` to grant access to test directories
-3. Check file permissions on test directories
-
-### Test Timeouts
-
-**Problem**: Test takes too long and times out
-
-**Solutions**:
-1. Increase timeout: `timeout 1800 claude ...` (30 minutes)
-2. Check for infinite loops in skill logic
-3. Review subagent task complexity
-
-### Session File Not Found
-
-**Problem**: Can't find session transcript after test run
-
-**Solutions**:
-1. Check the correct project directory in `~/.claude/projects/`
-2. Use `find ~/.claude/projects -name "*.jsonl" -mmin -60` to find recent sessions
-3. Verify test actually ran (check for errors in test output)
-
-## Writing New Integration Tests
-
-### Template
-
-```bash
-#!/usr/bin/env bash
-set -euo pipefail
-
-SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
-source "$SCRIPT_DIR/test-helpers.sh"
-
-# Create test project
-TEST_PROJECT=$(create_test_project)
-trap "cleanup_test_project $TEST_PROJECT" EXIT
-
-# Set up test files...
-cd "$TEST_PROJECT"
-
-# Run Claude with skill
-PROMPT="Your test prompt here"
-cd "$SCRIPT_DIR/../.." && timeout 1800 claude -p "$PROMPT" \
-  --allowed-tools=all \
-  --add-dir "$TEST_PROJECT" \
-  --permission-mode bypassPermissions \
-  2>&1 | tee output.txt
-
-# Find and analyze session
-WORKING_DIR_ESCAPED=$(echo "$SCRIPT_DIR/../.." | sed 's/\\//-/g' | sed 's/^-//')
-SESSION_DIR="$HOME/.claude/projects/$WORKING_DIR_ESCAPED"
-SESSION_FILE=$(find "$SESSION_DIR" -name "*.jsonl" -type f -mmin -60 | sort -r | head -1)
-
-# Verify behavior by parsing session transcript
-if grep -q '"name":"Skill".*"skill":"your-skill-name"' "$SESSION_FILE"; then
-    echo "[PASS] Skill was invoked"
-fi
-
-# Show token analysis
-python3 "$SCRIPT_DIR/analyze-token-usage.py" "$SESSION_FILE"
-```
-
-### Best Practices
-
-1. **Always cleanup**: Use trap to cleanup temp directories
-2. **Parse transcripts**: Don't grep user-facing output - parse the `.jsonl` session file
-3. **Grant permissions**: Use `--permission-mode bypassPermissions` and `--add-dir`
-4. **Run from plugin dir**: Skills only load when running from the superpowers directory
-5. **Show token usage**: Always include token analysis for cost visibility
-6. **Test real behavior**: Verify actual files created, tests passing, commits made
-
-## Session Transcript Format
-
-Session transcripts are JSONL (JSON Lines) files where each line is a JSON object representing a message or tool result.
-
-### Key Fields
-
-```json
-{
-  "type": "assistant",
-  "message": {
-    "content": [...],
-    "usage": {
-      "input_tokens": 27,
-      "output_tokens": 3996,
-      "cache_read_input_tokens": 1213703
-    }
-  }
-}
-```
-
-### Tool Results
-
-```json
-{
-  "type": "user",
-  "toolUseResult": {
-    "agentId": "3380c209",
-    "usage": {
-      "input_tokens": 2,
-      "output_tokens": 787,
-      "cache_read_input_tokens": 24989
-    },
-    "prompt": "You are implementing Task 1...",
-    "content": [{"type": "text", "text": "..."}]
-  }
-}
-```
-
-The `agentId` field links to subagent sessions, and the `usage` field contains token usage for that specific subagent invocation.
+Drill scenarios are slow (3-30+ minutes each) and run real LLM sessions. They are not part of CI today; the natural follow-up is a tiered model (fast subset on PR, full sweep nightly + on-demand).