mirror of
https://github.com/obra/superpowers.git
synced 2026-05-09 18:49:04 +08:00
- docs/testing.md split into Plugin tests + Skill behavior evals.
Plugin tests section enumerates the bash tests that survive
(kept by drill-coverage analysis or as describe-skill tests).
- CLAUDE.md adds Eval harness section pointing at evals/.
- README.md Contributing section mentions evals/ alongside tests/.
- .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders
(evals/.gitignore covers these locally; root-level entries help
tooling that does not recurse into nested ignore files).
2.1 KiB
2.1 KiB
Testing Superpowers
Superpowers has two distinct kinds of tests, each in its own directory:
tests/— does the plugin's non-LLM code work? Bash + node + python integration tests for brainstorm-server JS, OpenCode plugin loading, codex-plugin sync, and analysis utilities.evals/— do agents behave correctly on real LLM sessions? Python harness driving real tmux sessions of Claude Code / Codex / Gemini CLI / Copilot CLI, with an LLM actor and verifier judging skill compliance.
Plugin tests
Live in tests/. Currently:
tests/brainstorm-server/— node test suite for the brainstorm server JS code.tests/opencode/— bash tests for OpenCode plugin loading, bootstrap caching, and tool registration.tests/codex-plugin-sync/— bash sync verification.tests/claude-code/test-helpers.sh,analyze-token-usage.py— utilities used by remaining bash tests.tests/claude-code/test-subagent-driven-development.sh— agent-can-describe-SDD test (no drill counterpart; tests description-recall, not behavior).tests/claude-code/test-subagent-driven-development-integration.sh— extended SDD integration with token analysis (drill covers the YAGNI subset; bash adds commit-count, TodoWrite, and token telemetry assertions).tests/claude-code/test-worktree-native-preference.sh— RED-GREEN-REFACTOR validation for worktree skill (drill covers the PRESSURE phase; bash also covers RED/GREEN baselines).tests/explicit-skill-requests/— Haiku-specific, multi-turn, and skill-name-prompted tests not covered by drill.
Run plugin tests via the relevant directory's run-*.sh or npm test.
Skill behavior evals
Live in evals/. Drill is the harness; scenarios live at evals/scenarios/*.yaml. See evals/README.md for setup. Quick start:
cd evals
uv sync --extra dev
export ANTHROPIC_API_KEY=sk-...
uv run drill run triggering-test-driven-development -b claude
Drill scenarios are slow (3-30+ minutes each) and run real LLM sessions. They are not part of CI today; the natural follow-up is a tiered model (fast subset on PR, full sweep nightly + on-demand).