Commit Graph

92 Commits

Author SHA1 Message Date
Jesse Vincent
caf14aac66 test(sdd): wire test-sdd-workspace.sh into the runner; note git clean -fdx
The per-worktree workspace test was added but never registered in
run-skill-tests.sh, so it only ran when invoked by hand. Add it to the
fast unit-test array alongside the other pure-shell test.

Also document, in the Durable Progress section, that the ledger now
lives in git-ignored working-tree scratch, so `git clean -fdx` deletes
it — recover from `git log` if that happens.
2026-06-18 15:44:22 -07:00
Jesse Vincent
667b2c4a2e test(sdd): lock in per-worktree workspace isolation (#1780) 2026-06-18 15:44:22 -07:00
Jesse Vincent
93b8444b51 fix(sdd): write artifacts to working-tree .superpowers/sdd, not .git/ (#1780) 2026-06-18 15:44:22 -07:00
Jesse Vincent
207a12b203 feat(sdd): add sdd-workspace helper for a self-ignoring artifact dir 2026-06-18 15:44:22 -07:00
Drew Ritter
29c0b1b7db fix: read Codex plugin version from manifest (PRI-2240) 2026-06-16 17:02:33 -07:00
Drew Ritter
cf32920d3a fix: exclude repo metadata from Codex sync (PRI-1168) 2026-06-16 17:02:33 -07:00
Drew Ritter
b3ee712d3a Add visual companion Prime Radiant branding 2026-06-16 10:09:47 -07:00
Drew Ritter
3a907d6a0a Fix companion stop metadata and token permissions 2026-06-16 10:09:46 -07:00
Drew Ritter
1c80914052 Harden Windows browser launcher 2026-06-16 10:09:46 -07:00
Drew Ritter
2a8479b21d Fix Windows lifecycle validation 2026-06-16 10:09:46 -07:00
Drew Ritter
69ed41af9e Fix companion test cleanup and argv assertions 2026-06-16 10:09:46 -07:00
Drew Ritter
51323e4c64 Harden companion platform tests 2026-06-16 10:09:46 -07:00
Drew Ritter
3402d4e7d7 Fix companion lifecycle test ownership metadata 2026-06-16 10:09:46 -07:00
Drew Ritter
6bc49f0183 Harden companion stop ownership proof 2026-06-16 10:09:46 -07:00
Drew Ritter
8f2525a803 Isolate companion fallback tokens 2026-06-16 10:09:45 -07:00
Drew Ritter
85914fbcf8 Fix server test fallback cleanup 2026-06-16 10:09:45 -07:00
Drew Ritter
0410679757 Harden root screen containment 2026-06-16 10:09:45 -07:00
Drew Ritter
69270c9007 Harden companion Windows lifecycle coverage 2026-06-16 10:09:45 -07:00
Drew Ritter
b17d54f839 Harden brainstorm companion auth regressions 2026-06-16 10:09:45 -07:00
Jesse Vincent
7fbae0252f fix(brainstorm-server): fix auth-integration bugs from full-branch review
A second adversarial review of the merged branch found that combining the
session-key auth with the feature work created real bugs the (vacuous) tests
missed:

- [Critical] GET /files/ (empty name) resolved to CONTENT_DIR and crashed the
  process with uncaught EISDIR — newly reachable because the query-stripping
  refactor turns /files/?key=... into /files/. Reject non-regular-file names.
- [High] --open opened a KEYLESS url, which the auth gate 403s — the headline
  feature landed on the error page. Open the keyed url.
- [High] Same-port restart regenerated the token (port persisted, token not), so
  the open tab's old cookie 403'd and never reconnected — contradicting the
  documented promise. Persist the token (BRAINSTORM_TOKEN_FILE / .last-token)
  alongside the port.
- [Medium] Token sat in world-readable server-info/server.log (0644 in /tmp).
  umask 077 in start-server.sh + mode 0600 on server-info/.last-token.
- [Medium] touchActivity() ran before the auth check, so unauthenticated requests
  defeated the idle timeout. Count activity only after authorization.
- [Low] COOKIE_NAME embedded the pre-fallback port; derive it from the actual
  bound port (also prevents a cross-server cookie-jar collision on fallback).

Tests added/strengthened (previously passed vacuously): /files/ no-crash; the
auto-open url carries the key and is reachable (200); restart reuses the same key
not just the port; unauthenticated requests don't reset the idle clock.
Full suite green (ws-protocol 32, helper 12, auth 13, server 29, lifecycle 8,
stop-server 4); restart smoke confirms same port+key and old URL -> 200.
2026-06-16 10:09:45 -07:00
Jesse Vincent
01de36703d test(brainstorm-server): thread session key through tests after auth merge
Integrating the per-session-key auth onto the same branch as the dotfile and
lifecycle work: two tests added after the auth commit opened WebSockets without a
key (server.test.js dotfile-reload, lifecycle.test.js idle-shutdown), which the
auth gate now resets. Pass ?key=/BRAINSTORM_TOKEN in both. Full suite green:
ws-protocol 32, helper 12, auth 13, server 28, lifecycle 7, stop-server 4.
2026-06-16 10:09:45 -07:00
Jesse Vincent
cb5bb885fd feat(brainstorm-server): gate every endpoint behind a per-session key
The companion server is reachable by any local browser tab (default loopback
bind) and by any host that can route to it (remote --host bind). It served
screens, files, and accepted event-injecting WebSocket connections with no
authentication, so a malicious browser tab or a direct remote client could read
brainstorm content or inject events that the agent reads as the user's input
(prompt injection into a live session).

Generate a per-session secret token, carry it in the served URL as ?key=, and
mirror it into an HttpOnly SameSite=Strict per-port cookie on first load so
same-origin subresources and the WebSocket handshake authenticate automatically.
Every HTTP request and WebSocket upgrade now requires a valid key (query or
cookie, constant-time compared); unauthenticated requests get a friendly 403
explaining they need the full URL. A secret authenticates the client uniformly
across loopback, tunnel, and remote binds and defeats DNS rebinding, which a
Host/Origin allowlist cannot.

Also guard handleMessage against a null JSON payload that crashed the process.

Tests: new auth.test.js (13 cases) covering the key on /, /files/*, and WS plus
cookie bootstrap and the null-payload guard; server.test.js threads the key;
ws-protocol.test.js + auth.test.js wired into npm test.

Closes #1014
Refs #1110, #1553, #1504
2026-06-16 10:09:45 -07:00
Jesse Vincent
7c805f34d2 fix(brainstorm-server): tie stop-server PID check to the session's port
The node+server.cjs command match (from the adversarial review) still matched any
unrelated node process running a file named server.cjs. When we recorded the
bound port (state/server-info) and lsof is available, additionally require the
PID to be the process actually LISTENING on this session's port — which rules out
a different project's server.cjs / editor task runner that recycled the stale
PID. Falls back to the command match when the port or lsof isn't available.

Test: a 'node server.cjs' process not listening on the recorded port is spared.

Refs #1703
2026-06-16 10:09:45 -07:00
Jesse Vincent
fb08947ded fix(brainstorm-server): address adversarial review findings
From a two-reviewer adversarial pass:

- [High] EADDRINUSE fallback clobbered the shared .last-port: onListen wrote the
  bound port unconditionally, so a fallback to a random port overwrote the
  preferred port another live session still owns — stranding that session's open
  tab forever. Now persist only when we bound the preferred port (not on
  fallback). The fallback test now asserts .last-port integrity (teeth-verified).

- [Medium] maybeOpenBrowser ran the URL through a shell (exec + JSON.stringify),
  which does NOT neutralize $(...) in a url-host. Platform launchers now use
  execFile with the URL as an argv element (no shell). The operator-set
  BRAINSTORM_OPEN_CMD path stays shell-based (trusted input).

- [Medium] --open was a silent no-op on native Windows (no win32 branch). Added.

- [Medium] helper.js reconnect/status/tombstone had only substring-grep tests.
  Added behavioral tests driving the state machine against a mocked browser:
  Reconnecting+backoff (500->1000->2000), tombstone after the grace period, and
  reload-on-recovery.

- [Low] status pill showed a false 'Connected' before the socket opened; now
  starts 'Connecting…' until onopen.

Not changed (flagged): stop-server.sh's PID-ownership check still matches any
'node ... server.cjs' (narrow residual — a recycled PID onto an unrelated node
server.cjs); robust fix needs fragile cross-platform process introspection.
2026-06-16 10:09:45 -07:00
Jesse Vincent
463dfb7fd4 feat(brainstorm-server): opt-in auto-open of the browser on the first screen
When the user approves the visual companion, open their browser automatically the
first time a screen is actually ready to show — rather than at startup (just the
waiting page) or making them open the URL by hand.

Opt-in and gated on approval: off unless BRAINSTORM_OPEN is set (start-server.sh
--open, which the agent passes only after the user agrees to use the companion).
Even then it fires once, and is skipped if a browser is already connected, on a
non-loopback/remote bind, or when headless. Launcher is the platform default
(open / xdg-open / WSL cmd.exe) or BRAINSTORM_OPEN_CMD; best-effort, never fatal.

lifecycle.test.js: opens once on the first screen when approved; does NOT open
without approval.

Closes #755
Refs #759
2026-06-16 10:09:45 -07:00
Jesse Vincent
dd9fcc21ee feat(brainstorm-server): reuse the same port on session restart
When the companion idle-shuts-down and the agent restarts it, a fresh random
port meant the user's open browser tab pointed at a dead URL. Persist the bound
port per project and prefer it on the next start, so the restarted server comes
up on the same port and the open tab's reconnect just works.

- start-server.sh exports BRAINSTORM_PORT_FILE=<project>/.superpowers/brainstorm/
  .last-port for project sessions (not /tmp).
- server.cjs prefers an explicit BRAINSTORM_PORT, else the recorded port, else
  random; writes the actually-bound port back; and on EADDRINUSE (preferred port
  still in use) falls back to a random port once instead of crashing.

lifecycle.test.js: restart reuses the recorded port; a taken preferred port
falls back to a random one without crashing.

Refs #1237
2026-06-16 10:09:45 -07:00
Jesse Vincent
36ac3e1336 feat(brainstorm-companion): resilient reconnect, live status, paused overlay
The injected client reconnected on a fixed 1s timer with no feedback: if the
laptop slept or the server restarted, the page showed 'Connected' over a dead
socket and silently queued events. And when the server stopped, the user got a
bare connection-refused with no explanation.

helper.js now:
- reconnects with exponential backoff (500ms, doubling, capped at 30s; reset on
  open), with an onerror->close handler, nulls the socket on close, and clears a
  pending timer before scheduling another;
- drives the frame status pill Connected/Reconnecting/Disconnected via a
  --status-color custom property (frame-template.html);
- after ~15s disconnected, shows a self-styled 'Companion paused' overlay
  (tombstone) explaining the companion stopped and will reconnect automatically;
- on recovery from a tombstoned outage (e.g. server restarted on the same port)
  reloads to pick up the restarted server's current screen.

The reconnect-backoff is an exported pure function; helper.test.js unit-tests it
(doubling + cap progression) and asserts the status/tombstone/reconnect wiring.
DOM behaviour is verified live.

Refs #856, #1237
2026-06-16 10:09:45 -07:00
Jesse Vincent
56757f6877 feat(brainstorm-server): 4h configurable idle timeout; close WS on shutdown
The companion shut down after only 30 minutes idle — too short for real
brainstorming, where a single question can sit far longer. And shutdown() never
closed upgraded WebSocket sockets, so an open browser connection could keep the
Node process alive after it was supposed to exit.

- Default idle timeout raised to 4 hours, configurable via BRAINSTORM_IDLE_TIMEOUT_MS
  and start-server.sh --idle-timeout-minutes (validated positive integer).
- Reported as idle_timeout_ms in the server-started JSON / server-info.
- shutdown() now destroys all client sockets so the process exits even with an
  open WebSocket.
- Watchdog check interval is configurable (BRAINSTORM_LIFECYCLE_CHECK_MS, default
  60s) so the lifecycle can be tested without minute-long waits.

Adds lifecycle.test.js (configured timeout reported; idle shutdown exits despite
an open WS — teeth-verified; the start-server flag). Wires ws-protocol,
lifecycle, and stop-server suites into npm test.

Closes #1237
Refs #1689
2026-06-16 10:09:45 -07:00
Jesse Vincent
5ddce063df fix(brainstorm-server): verify PID ownership before stopping
stop-server.sh read server.pid and SIGKILL'd that PID with no checks. After a
reboot or PID wraparound the pid file can point at an unrelated, live process —
which we would then kill.

Verify the PID is actually our server (a running 'node ... server.cjs') before
signalling it. If ownership can't be proven, fail closed: remove the stale pid
file and report {status: stale_pid} without killing anything. Real servers still
stop ({status: stopped}); a missing pid file still reports not_running.

Adds stop-server.test.sh covering: an unrelated reused PID is left alone, a real
server is stopped, and a missing pid file.

Refs #1703
2026-06-16 10:09:45 -07:00
Jesse Vincent
2b108b7dc2 fix(brainstorm-server): ignore macOS resource-fork dotfiles
On macOS (and ExFAT/SMB volumes) the OS writes ._<name>.html sidecar files
holding binary resource-fork metadata. These end with .html, so they passed the
content filter and could be picked as the newest screen — serving binary garbage
to the browser instead of the mockup — or fetched via /files/.

Skip dotfiles (leading '.') at all four sites that list or serve content:
getNewestScreen, the /files/ endpoint, the known-files seed, and the fs.watch
handler. Tests cover serving (/ and /files/) and the watch path (a ._ file must
not trigger a reload).

Refs #950
2026-06-16 10:09:45 -07:00
Rahul
d9d3d99245 fix(brainstorming): cap websocket frame payloads 2026-06-16 10:09:45 -07:00
Drew Ritter
21b44e44d3 Add shell lint script 2026-06-16 10:09:45 -07:00
Drew Ritter
2c2e2bcbd4 Tighten Kimi plugin porting coverage 2026-06-16 10:09:44 -07:00
Drew Ritter
f61300eac8 fix: wire Kimi plugin into release metadata 2026-06-16 10:09:44 -07:00
Jesse Vincent
36ce0a21e4 feat: add Antigravity CLI (agy) support
Antigravity (Google's `agy` CLI) installs the existing Superpowers plugin
directly:

    agy plugin install https://github.com/obra/superpowers

agy imports the bundled skills and runs the plugin's SessionStart hook, so
using-superpowers bootstraps from the first message — verified on agy 1.0.3:
a fresh session given "Let's make a react todo list" auto-triggers the
brainstorming skill instead of writing code. agy discovers skills natively
and, having no Skill tool, loads them by reading SKILL.md with view_file.

No scaffold, installer, or generated context file is needed. This adds only:

- README.md: an Antigravity install section + Quickstart link
- skills/using-superpowers/SKILL.md: reference to the agy tool mapping
- skills/using-superpowers/references/antigravity-tools.md: action->tool
  mapping for agy (view_file, write_to_file, invoke_subagent, manage_task,
  and skill loading via view_file on SKILL.md)
- tests/antigravity/: structural test for the tool mapping, mirroring
  tests/pi/
2026-06-16 10:09:44 -07:00
Jesse Vincent
95aa3d5007 Align windows-lifecycle test with current brainstorm server layout
The test had drifted behind three server implementation changes and no
longer ran against the actual server:

- Server entrypoint renamed from server.js to server.cjs; the test still
  invoked node on server.js and failed with MODULE_NOT_FOUND.
- Server state moved to a state/ subdirectory (state/server-info,
  state/server.pid); the test still waited on .server-info and wrote
  .server.pid at the session root.
- Owner-PID startup validation now keeps the server running when the
  owner PID is dead at startup: it logs owner-pid-invalid, disables
  owner monitoring, and falls back to the idle timeout. The test still
  expected the server to self-terminate within 60s of a dead-at-startup
  owner.

Update file/path references to match the current server, and rewrite
the dead-at-startup test to assert the current behavior: server
survives, log contains owner-pid-invalid, log does not contain a
spurious "owner process exited" line.

Verified locally: 9 passed, 0 failed, 3 skipped (Windows-only).
2026-06-16 10:09:44 -07:00
Drew Ritter
1e7cd987d3 [codex] support native Codex plugin hooks (#1540)
* docs: specify Codex native hooks parity

* docs: refine Codex hooks spec after review

* docs: record Codex hook contract spike

* docs: plan Codex native hooks implementation

* feat: support Codex native plugin hooks

* test: add Codex native hook drill coverage

* Simplify Codex hook entrypoint
2026-06-16 10:09:44 -07:00
Jesse Vincent
3406f5d80f chore: keep pi extension under .pi 2026-06-16 10:09:44 -07:00
Jesse Vincent
71ac601627 feat: add pi superpowers package extension 2026-06-16 10:09:44 -07:00
Drew Ritter
f030d6ef88 Tighten cross-platform tool references 2026-06-16 10:09:43 -07:00
Jesse Vincent
6ec8686477 Phase D: cross-runtime tweaks (visual-companion, executing-plans, test)
Misc platform/runtime statements and adjacencies that don't fit the
prose, config-ref, README-ordering, or tool-vocabulary buckets:

- visual-companion frame template: rename CSS/HTML id #claude-content
  → #frame-content. The id is purely styling — nothing external
  references it. The brainstorm-server test that asserted the old
  string is updated in lockstep.

- visual-companion launch instructions: add a Copilot CLI section
  alongside Claude Code, Codex, and Gemini CLI; combine the Claude
  Code (macOS / Linux) and (Windows) sections so heading style
  matches the other (non-OS-qualified) platforms.

- visual-companion: "Use Write tool" → "Use your file-creation tool"
  for the cat/heredoc warning. The prohibition is what's load-
  bearing, not the tool name.

- executing-plans/SKILL.md: list all subagent-capable runtimes
  (Claude Code, Codex CLI, Codex App, Copilot CLI, Gemini CLI) and
  point at the per-platform tool refs as the source of truth.

- executing-plans/SKILL.md: relative path "using-superpowers/
  references/" → "../using-superpowers/references/" to resolve
  correctly from the executing-plans/ directory.

No bundled spec doc here — Phase D was scope-extension work that
took place across rounds, with no standalone spec authored.
2026-06-16 10:09:43 -07:00
Drew Ritter
741c232768 Move eval harness to submodule (#1541) 2026-06-16 10:09:43 -07:00
Drew Ritter
d00f4ad442 fix: remove global worktree path fallback (#1476) 2026-06-16 10:09:43 -07:00
Jesse Vincent
a325106502 Address adversarial review findings
- evals/README.md, evals/CLAUDE.md: fix uv install command from
  'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml
  uses [project.optional-dependencies], so --dev is a no-op for
  pytest/ruff/ty; --extra dev is the correct invocation.
- tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh
  from integration_tests array (file deleted earlier in this branch).
- tests/claude-code/README.md: replace test-requesting-code-review.sh
  section with test-worktree-native-preference.sh (the worktree test
  is kept; the code-review test was lifted into drill).
- docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness
  list. evals/backends/ has claude*, codex, gemini configs but no
  copilot.yaml, so the claim was unsupported.

Adversarial review credit: reviewer #2 found four legitimate issues
(uv-sync, run-skill-tests stale ref, README stale ref via #1, and
Copilot CLI fabrication); reviewer #1 found two distinct issues
(run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins
this round.
2026-06-16 10:09:43 -07:00
Jesse Vincent
315ef09ebc tests: annotate three kept bash tests with drill coverage notes
- test-worktree-native-preference.sh: drill covers PRESSURE phase only;
  RED + GREEN baselines have no drill counterpart and are kept so
  the RED-GREEN-REFACTOR validation remains rerunnable end-to-end.
- test-subagent-driven-development-integration.sh: drill covers the
  YAGNI subset (forbidden exports + reviewer-as-gate). Bash adds
  >=3 commits, >=2 subagent dispatches, TodoWrite usage, test file
  existence check, and token-budget telemetry. Kept until drill
  scenario covers those or they are retired.
- test-subagent-driven-development.sh: tests agent's ability to
  *describe* SDD (string matches against expected keywords). Drill
  scenarios test behavior, not description-recall. Kept by design.

Subagent verification recorded in commit messages of subsequent
deletions; gap analyses driving these annotations are also in the
verification subagent reports for the gating sweep.
2026-06-16 10:09:43 -07:00
Jesse Vincent
12ef68d55e tests: remove test-requesting-code-review.sh (covered by drill code-review-catches-planted-bugs)
Subagent verification: every bash assertion (skill invocation,
subagent dispatch, SQL injection flagged, credential handling
flagged, no merge approval) maps to drill verify checks. Drill is
stricter: bundles severity (Critical/Important) into the same
criteria as the finding itself (bash split severity into a separate
test). Setup parity covered (src/db.js with string concat + identity
hash, two commits).

The drill scenario header explicitly says it is the
"cross-harness, semantically-judged replacement for the bash test."
2026-06-16 10:09:43 -07:00
Jesse Vincent
ea8aad8764 tests: remove test-document-review-system.sh (covered by drill spec-reviewer-catches-planted-flaws)
Subagent verification: every bash assertion (TODO in Requirements
section flagged, "specified later" deferral flagged, Issues section
present, did-not-approve verdict) maps to drill verify.criteria
entries. Setup parity covered by setup.assertions (test-feature-design.md
exists with TODO + 'specified later' content). Drill is stricter:
asserts tool-called Agent (subagent dispatch) which the bash test
did not check.
2026-06-16 10:09:43 -07:00
Jesse Vincent
1f0ad3817d tests: remove subagent-driven-dev fixtures (covered by drill sdd-go-fractals + sdd-svelte-todo)
The bash test had ZERO output assertions — it just ran claude -p
and printed token usage. Drill's scenarios are strictly more
rigorous:

go-fractals: skill-called SDD + tool-called Agent + go test ./...
passes + cmd/fractals/main.go exists + >=4 commits + LLM criteria
verifying real SDD workflow.

svelte-todo: skill-called SDD + tool-called Agent + npm test passes
+ playwright e2e passes + package.json + svelte.config.js or
vite.config.ts + >=4 commits + LLM criteria.

design.md and plan.md are byte-identical between bash fixtures and
drill fixtures (evals/fixtures/sdd-{go-fractals,svelte-todo}/).
Drill's setup helper (scaffold_sdd_*) forces git init -b main
(stricter than bash's reliance on init.defaultBranch). The
.claude/settings.local.json from bash scaffold.sh is unnecessary
for drill since permissions are managed via backend YAML.

Subagent verification: SAFE TO DELETE for both.
2026-06-16 10:09:43 -07:00
Jesse Vincent
7fd1ac7bfc tests: remove run-claude-describes-sdd.sh (covered by drill mid-conversation-skill-invocation)
Subagent verification: every bash assertion (Skill tool invoked +
specific skill name 'subagent-driven-development' loaded after the
agent describes it conversationally in turn 1) maps to the drill
scenario's skill-called assertion + criteria paragraph requiring
the skill to fire in direct response to the second user message.
Drill additionally asserts tool-called Agent (subagent dispatch)
which is stricter than the bash test.

Other runners in tests/explicit-skill-requests/ (haiku, multiturn,
extended-multiturn) and their prompt files are preserved — they
have no drill coverage and exercise different behaviors.
2026-06-16 10:09:43 -07:00
Jesse Vincent
8611a4ea97 tests: remove skill-triggering bash prompts (covered by drill triggering-* scenarios)
Subagent verification confirmed each prompt's intent matches its
corresponding drill scenario's turns[].intent verbatim, and each
scenario has both a deterministic skill-called assertion and a
semantic LLM criterion confirming the matching skill was loaded
(actually a stronger check than the bash test, which only confirms
the skill fires anywhere in the stream).

All 6 prompts deleted. The runner had no remaining prompts to drive,
so run-test.sh and run-all.sh deleted as well.
2026-06-16 10:09:43 -07:00