superpowers

mirror of https://github.com/obra/superpowers.git synced 2026-08-01 07:01:36 +08:00

Author	SHA1	Message	Date
Jesse Vincent	3ed554d557	Close the Minor-severity escape hatch With merged review, a planted verbatim-duplication defect shipped: the reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted definition of Important, and the Minor-rolls-up rule meant no fix was ever dispatched and the final review never saw the finding. Calibration now names merge-blocking maintainability damage (verbatim duplication, swallowed errors, assertion-free tests) as Important, and controllers must paste accumulated Minor findings into the final review dispatch.	2026-06-15 12:15:06 -07:00
Jesse Vincent	d7726d99dc	Merge per-task reviews into one task reviewer (iteration 2) Iteration-1 profiling: implementers and per-dispatch overhead dominate (429 of 686 subagent turns; controller coordination is half the dollars and scales with dispatch count), reviewers are individually lean, and the controller pasted the diff in only 2 of 22 review dispatches when the guidance was phrased as optional. Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md replaced by task-reviewer-prompt.md (one reviewer, one reading of a pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality); one fix dispatch can address both kinds of findings; controller now runs git diff itself and pastes it (imperative, not optional); implementers run focused tests while iterating and the full suite once before committing; flowchart, example, Red Flags, tool tables updated. The broad final whole-branch review is unchanged.	2026-06-15 12:15:06 -07:00
Jesse Vincent	4c1f1e5cc5	Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's 42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn profiling attributed it to: haiku dispatches taking 2-3x the turns of sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand (518 Bash calls), and evidence-rule narration. Changes: turn-count-beats- token-price model guidance; controllers paste small diffs into reviewer prompts (reviewers then need few or no tool calls); evidence scoped to findings and would-be-bare-yes checks; Important defined as cannot-trust- until-fixed with coverage suggestions Minor; fixes dispatched only for Critical/Important.	2026-06-15 12:15:06 -07:00
Jesse Vincent	7288393773	Add phrase-level pre-judging triggers to reviewer prompt rule Resumed the offending eval controller session and asked it why it pre-judged despite the rule being in context. Its retrospective: the motive was avoiding a review loop, the abstract rule was read but not applied at the moment it governs, and a phrase-level trigger ('do not flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose') would have fired where the principle did not.	2026-06-15 12:15:06 -07:00
Jesse Vincent	254a8e2e32	Red Flags: never tell a reviewer what not to flag or pre-rate severity Second observed instance: with the Constructing Reviewer Prompts rule already live, a controller still wrote 'do not treat that duplication as a defect to fix — the plan chose it; you may note it as a Minor observation at most' into a quality reviewer dispatch, fabricating plan intent from the plan's example snippet. Promote the rule to the Red Flags Never list and name the rationalization.	2026-06-15 12:15:06 -07:00
Jesse Vincent	7c11cee649	Close three review blind spots found by defect tracing Live eval deliverables shipped five polish defects; tracing each through the transcripts showed three mechanisms, each now addressed: - reviewers answered pointed checklist items with unsupported yes (evidence rule: every What-to-Check answer needs file:line evidence) - no reviewer ever saw the design's global constraints (controllers now paste binding constraints into task requirements) - test output noise was invisible everywhere (pristine-output checks in implementer self-review and quality review)	2026-06-15 12:15:06 -07:00
Jesse Vincent	b36cf86afd	Require explicit model on subagent dispatch In live eval runs, controllers given judgment-based model selection stopped passing a model at all; the omitted parameter inherits the session's top-tier model, silently making every subagent maximally expensive (one run dispatched 26/26 reviewers on the session model).	2026-06-15 12:15:06 -07:00
Jesse Vincent	06bec17a34	Forbid controllers pre-judging reviewer findings A live eval run of sdd-quality-reviewer-catches-planted-defect caught the SDD controller fabricating a plan constraint and instructing the quality reviewer not to flag the planted DRY violation. The duplication shipped. Constructing Reviewer Prompts now bans suppression directives alongside open-ended broadening directives.	2026-06-15 12:15:06 -07:00
Jesse Vincent	d519ba65fd	SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment	2026-06-15 12:15:06 -07:00
Jesse Vincent	d32a56dc32	Implementer prompt: re-run covering tests after fixing review findings	2026-06-15 12:15:06 -07:00
Jesse Vincent	994bc26d2a	Scope spec reviewer's Your Job wording to the diff	2026-06-15 12:15:06 -07:00
Jesse Vincent	d5850df1bc	Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel	2026-06-15 12:15:06 -07:00
Jesse Vincent	b5edd40d2c	Use bare placeholder names in quality reviewer prompt body	2026-06-15 12:15:06 -07:00
Jesse Vincent	6a02446953	Make per-task quality reviewer prompt self-contained and task-scoped	2026-06-15 12:15:06 -07:00
Drew Ritter	93f2ce91b8	Fix companion stop metadata and token permissions	2026-06-11 13:53:06 -07:00
Drew Ritter	e9ee6c5b4d	Harden Windows browser launcher	2026-06-11 13:53:06 -07:00
Drew Ritter	5415cb8ccf	Fix Windows lifecycle validation	2026-06-11 13:53:06 -07:00
Drew Ritter	1c21a91e01	Align visual companion docs with shipped scope	2026-06-11 13:53:06 -07:00
Drew Ritter	a6a4cd85b9	Harden companion stop ownership proof	2026-06-11 13:53:06 -07:00
Drew Ritter	8034176801	Isolate companion fallback tokens	2026-06-11 13:53:06 -07:00
Drew Ritter	c4cde1eed9	Harden root screen containment	2026-06-11 13:53:06 -07:00
Drew Ritter	c7d7e3550f	Harden companion Windows lifecycle coverage	2026-06-11 13:53:06 -07:00
Drew Ritter	a2e67bbd9b	Harden brainstorm companion auth regressions	2026-06-11 13:53:06 -07:00
Jesse Vincent	f4d1788ffb	fix(brainstorm-server): fix auth-integration bugs from full-branch review A second adversarial review of the merged branch found that combining the session-key auth with the feature work created real bugs the (vacuous) tests missed: - [Critical] GET /files/ (empty name) resolved to CONTENT_DIR and crashed the process with uncaught EISDIR — newly reachable because the query-stripping refactor turns /files/?key=... into /files/. Reject non-regular-file names. - [High] --open opened a KEYLESS url, which the auth gate 403s — the headline feature landed on the error page. Open the keyed url. - [High] Same-port restart regenerated the token (port persisted, token not), so the open tab's old cookie 403'd and never reconnected — contradicting the documented promise. Persist the token (BRAINSTORM_TOKEN_FILE / .last-token) alongside the port. - [Medium] Token sat in world-readable server-info/server.log (0644 in /tmp). umask 077 in start-server.sh + mode 0600 on server-info/.last-token. - [Medium] touchActivity() ran before the auth check, so unauthenticated requests defeated the idle timeout. Count activity only after authorization. - [Low] COOKIE_NAME embedded the pre-fallback port; derive it from the actual bound port (also prevents a cross-server cookie-jar collision on fallback). Tests added/strengthened (previously passed vacuously): /files/ no-crash; the auto-open url carries the key and is reachable (200); restart reuses the same key not just the port; unauthenticated requests don't reset the idle clock. Full suite green (ws-protocol 32, helper 12, auth 13, server 29, lifecycle 8, stop-server 4); restart smoke confirms same port+key and old URL -> 200.	2026-06-11 13:53:06 -07:00
Jesse Vincent	c64c4ea6f4	feat(brainstorm-server): gate every endpoint behind a per-session key The companion server is reachable by any local browser tab (default loopback bind) and by any host that can route to it (remote --host bind). It served screens, files, and accepted event-injecting WebSocket connections with no authentication, so a malicious browser tab or a direct remote client could read brainstorm content or inject events that the agent reads as the user's input (prompt injection into a live session). Generate a per-session secret token, carry it in the served URL as ?key=, and mirror it into an HttpOnly SameSite=Strict per-port cookie on first load so same-origin subresources and the WebSocket handshake authenticate automatically. Every HTTP request and WebSocket upgrade now requires a valid key (query or cookie, constant-time compared); unauthenticated requests get a friendly 403 explaining they need the full URL. A secret authenticates the client uniformly across loopback, tunnel, and remote binds and defeats DNS rebinding, which a Host/Origin allowlist cannot. Also guard handleMessage against a null JSON payload that crashed the process. Tests: new auth.test.js (13 cases) covering the key on /, /files/*, and WS plus cookie bootstrap and the null-payload guard; server.test.js threads the key; ws-protocol.test.js + auth.test.js wired into npm test. Closes #1014 Refs #1110, #1553, #1504	2026-06-11 13:53:06 -07:00
Jesse Vincent	eee4f87471	fix(brainstorm-server): tie stop-server PID check to the session's port The node+server.cjs command match (from the adversarial review) still matched any unrelated node process running a file named server.cjs. When we recorded the bound port (state/server-info) and lsof is available, additionally require the PID to be the process actually LISTENING on this session's port — which rules out a different project's server.cjs / editor task runner that recycled the stale PID. Falls back to the command match when the port or lsof isn't available. Test: a 'node server.cjs' process not listening on the recorded port is spared. Refs #1703	2026-06-11 13:53:06 -07:00
Jesse Vincent	bac46a5dcb	fix(brainstorm-server): address adversarial review findings From a two-reviewer adversarial pass: - [High] EADDRINUSE fallback clobbered the shared .last-port: onListen wrote the bound port unconditionally, so a fallback to a random port overwrote the preferred port another live session still owns — stranding that session's open tab forever. Now persist only when we bound the preferred port (not on fallback). The fallback test now asserts .last-port integrity (teeth-verified). - [Medium] maybeOpenBrowser ran the URL through a shell (exec + JSON.stringify), which does NOT neutralize $(...) in a url-host. Platform launchers now use execFile with the URL as an argv element (no shell). The operator-set BRAINSTORM_OPEN_CMD path stays shell-based (trusted input). - [Medium] --open was a silent no-op on native Windows (no win32 branch). Added. - [Medium] helper.js reconnect/status/tombstone had only substring-grep tests. Added behavioral tests driving the state machine against a mocked browser: Reconnecting+backoff (500->1000->2000), tombstone after the grace period, and reload-on-recovery. - [Low] status pill showed a false 'Connected' before the socket opened; now starts 'Connecting…' until onopen. Not changed (flagged): stop-server.sh's PID-ownership check still matches any 'node ... server.cjs' (narrow residual — a recycled PID onto an unrelated node server.cjs); robust fix needs fragile cross-platform process introspection.	2026-06-11 13:53:06 -07:00
Jesse Vincent	daa41c0670	feat(brainstorming): offer the visual companion just-in-time; harden lifecycle guidance Move the companion consent from an upfront, anticipatory offer to the first moment a question would genuinely be clearer shown than told. If no visual question ever arises, it's never offered. On approval the agent starts the server with --open, so the user's browser opens to the first screen — the pop is tied to that approval, never unsolicited. Also hardens visual-companion.md: confirming the server is alive (server-info present, server-stopped absent) before referring to the URL is now a required step; restart with the same --project-dir reuses the port so the open tab reconnects on its own (paused overlay while down); idle default corrected to 4h. NOTE: SKILL.md is behavior-shaping content — this flow change should be eval-tested (writing-skills adversarial pressure test) before merge. Refs #1237, #1037	2026-06-11 13:53:06 -07:00
Jesse Vincent	0d37ff6505	feat(brainstorm-server): opt-in auto-open of the browser on the first screen When the user approves the visual companion, open their browser automatically the first time a screen is actually ready to show — rather than at startup (just the waiting page) or making them open the URL by hand. Opt-in and gated on approval: off unless BRAINSTORM_OPEN is set (start-server.sh --open, which the agent passes only after the user agrees to use the companion). Even then it fires once, and is skipped if a browser is already connected, on a non-loopback/remote bind, or when headless. Launcher is the platform default (open / xdg-open / WSL cmd.exe) or BRAINSTORM_OPEN_CMD; best-effort, never fatal. lifecycle.test.js: opens once on the first screen when approved; does NOT open without approval. Closes #755 Refs #759	2026-06-11 13:53:06 -07:00
Jesse Vincent	13da997ac7	feat(brainstorm-server): reuse the same port on session restart When the companion idle-shuts-down and the agent restarts it, a fresh random port meant the user's open browser tab pointed at a dead URL. Persist the bound port per project and prefer it on the next start, so the restarted server comes up on the same port and the open tab's reconnect just works. - start-server.sh exports BRAINSTORM_PORT_FILE=<project>/.superpowers/brainstorm/ .last-port for project sessions (not /tmp). - server.cjs prefers an explicit BRAINSTORM_PORT, else the recorded port, else random; writes the actually-bound port back; and on EADDRINUSE (preferred port still in use) falls back to a random port once instead of crashing. lifecycle.test.js: restart reuses the recorded port; a taken preferred port falls back to a random one without crashing. Refs #1237	2026-06-11 13:53:06 -07:00
Jesse Vincent	31a0de857b	feat(brainstorm-companion): resilient reconnect, live status, paused overlay The injected client reconnected on a fixed 1s timer with no feedback: if the laptop slept or the server restarted, the page showed 'Connected' over a dead socket and silently queued events. And when the server stopped, the user got a bare connection-refused with no explanation. helper.js now: - reconnects with exponential backoff (500ms, doubling, capped at 30s; reset on open), with an onerror->close handler, nulls the socket on close, and clears a pending timer before scheduling another; - drives the frame status pill Connected/Reconnecting/Disconnected via a --status-color custom property (frame-template.html); - after ~15s disconnected, shows a self-styled 'Companion paused' overlay (tombstone) explaining the companion stopped and will reconnect automatically; - on recovery from a tombstoned outage (e.g. server restarted on the same port) reloads to pick up the restarted server's current screen. The reconnect-backoff is an exported pure function; helper.test.js unit-tests it (doubling + cap progression) and asserts the status/tombstone/reconnect wiring. DOM behaviour is verified live. Refs #856, #1237	2026-06-11 13:53:06 -07:00
Jesse Vincent	c292421627	feat(brainstorm-server): 4h configurable idle timeout; close WS on shutdown The companion shut down after only 30 minutes idle — too short for real brainstorming, where a single question can sit far longer. And shutdown() never closed upgraded WebSocket sockets, so an open browser connection could keep the Node process alive after it was supposed to exit. - Default idle timeout raised to 4 hours, configurable via BRAINSTORM_IDLE_TIMEOUT_MS and start-server.sh --idle-timeout-minutes (validated positive integer). - Reported as idle_timeout_ms in the server-started JSON / server-info. - shutdown() now destroys all client sockets so the process exits even with an open WebSocket. - Watchdog check interval is configurable (BRAINSTORM_LIFECYCLE_CHECK_MS, default 60s) so the lifecycle can be tested without minute-long waits. Adds lifecycle.test.js (configured timeout reported; idle shutdown exits despite an open WS — teeth-verified; the start-server flag). Wires ws-protocol, lifecycle, and stop-server suites into npm test. Closes #1237 Refs #1689	2026-06-11 13:53:06 -07:00
Jesse Vincent	9b00cc298d	fix(brainstorm-server): verify PID ownership before stopping stop-server.sh read server.pid and SIGKILL'd that PID with no checks. After a reboot or PID wraparound the pid file can point at an unrelated, live process — which we would then kill. Verify the PID is actually our server (a running 'node ... server.cjs') before signalling it. If ownership can't be proven, fail closed: remove the stale pid file and report {status: stale_pid} without killing anything. Real servers still stop ({status: stopped}); a missing pid file still reports not_running. Adds stop-server.test.sh covering: an unrelated reused PID is left alone, a real server is stopped, and a missing pid file. Refs #1703	2026-06-11 13:53:06 -07:00
Jesse Vincent	88fe1e7e15	fix(brainstorm-server): ignore macOS resource-fork dotfiles On macOS (and ExFAT/SMB volumes) the OS writes ._<name>.html sidecar files holding binary resource-fork metadata. These end with .html, so they passed the content filter and could be picked as the newest screen — serving binary garbage to the browser instead of the mockup — or fetched via /files/. Skip dotfiles (leading '.') at all four sites that list or serve content: getNewestScreen, the /files/ endpoint, the known-files seed, and the fs.watch handler. Tests cover serving (/ and /files/) and the watch path (a ._ file must not trigger a reload). Refs #950	2026-06-11 13:53:06 -07:00
Jesse Vincent	74f85a7709	fix(writing-skills): hang backfire mechanism on the separated prohibition-vs-recipe comparison (NEW-4); control comparison stated as trend	2026-06-11 12:11:37 -07:00
Jesse Vincent	b148b648eb	fix(writing-skills): scope empirical claims, honest noise reporting, conditionalize micro-test checklist line Adversarial review findings 1/3/9: the head-to-head result is now scoped to its context (dispatch-prompt guidance) with an explicit micro-test-your- own-case instruction; the nuance-clause result is reported as consistent->noisy rather than 'measurably dilutes'; the checklist line is scoped to behavior-shaping guidance and the micro method no longer assumes raw API access.	2026-06-11 12:11:37 -07:00
Jesse Vincent	3e565ca2ad	feat(writing-skills): form-selection table + micro-test wording method RED battery (35 opus authoring samples against the current skill) showed authors default to prohibition+rationalization-table for composition- shaping problems (T1: 5/5), where that form measurably backfires (prohibition 4.4 vs 3.6 no-guidance control vs 3.0 recipe restatement errors), and design only full-subagent verification with no wording micro-tests, no mandatory no-guidance control, no manual inspection of automated matches, no variance signal (T7: 5/5). Adds: Match the Form to the Failure (failure-type -> form table, nuance/ exemption rules), scope note on Bulletproofing, Micro-Test Wording subsection, two checklist lines. Deliberately narrow: T3/T4/T5/T6 RED samples showed Iron Law / elicit-first behavior already strong.	2026-06-11 12:11:37 -07:00
Rahul	d7c260a978	fix(brainstorming): cap websocket frame payloads	2026-06-02 11:24:02 -07:00
Matt Van Horn	f776394360	feat(subagent-dev): add TDD RED evidence to implementer report format Add a conditional TDD Evidence field to the implementer report format so controllers can verify RED and GREEN output when TDD was required. The field asks for the command run, relevant RED/GREEN output, and the expected RED failure reason rather than raw full logs. Fixes #994.	2026-06-01 16:15:05 -07:00
nestorluiscamachopaz	81c3052416	fix: foreground mode saves node PID and clears OWNER_PID on Windows/MSYS2 Verified on real Windows Git Bash: lifecycle test passed 12/12, manual start/stop released the port, and no brainstorm node processes remained.	2026-06-01 14:26:22 -07:00
nawfal	c879454a0d	fix(finishing-a-development-branch): remove gh-specific PR creation instruction Per obra's guidance on #1609: remove the github-specific instruction rather than replacing it with a platform-detection table. Agents already know their forge tooling; the skill only needs to cover the push step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 13:58:22 -07:00
nawfal	ff213eb2cf	fix(finishing-a-development-branch): detect remote platform before creating PR/MR Replaces hardcoded `gh pr create` in Option 2 with a platform-neutral note: check `git remote get-url origin` first, then use gh (GitHub), glab (GitLab), or fall back to the compare URL for unknown platforms. Adds matching Red Flag entry so agents don't skip the detection step. Fixes #1609 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 13:58:22 -07:00
Jesse Vincent	da00e59958	feat: add Antigravity CLI (agy) support Antigravity (Google's `agy` CLI) installs the existing Superpowers plugin directly: agy plugin install https://github.com/obra/superpowers agy imports the bundled skills and runs the plugin's SessionStart hook, so using-superpowers bootstraps from the first message — verified on agy 1.0.3: a fresh session given "Let's make a react todo list" auto-triggers the brainstorming skill instead of writing code. agy discovers skills natively and, having no Skill tool, loads them by reading SKILL.md with view_file. No scaffold, installer, or generated context file is needed. This adds only: - README.md: an Antigravity install section + Quickstart link - skills/using-superpowers/SKILL.md: reference to the agy tool mapping - skills/using-superpowers/references/antigravity-tools.md: action->tool mapping for agy (view_file, write_to_file, invoke_subagent, manage_task, and skill loading via view_file on SKILL.md) - tests/antigravity/: structural test for the tool mapping, mirroring tests/pi/	2026-06-01 11:42:09 -07:00
Jesse Vincent	8811b0f2d7	Revert "Make visual-companion.md script paths skill-rooted, not plugin-rooted" This reverts commit `e9f5188289`.	2026-05-23 17:01:46 -07:00
Nick Galatis	21ad401e90	fix(systematic-debugging): defuse Claude Code ultrathink keyword scanner trigger (#1558 ) The "Signals You're Doing It Wrong" bullet in systematic-debugging/SKILL.md contains the literal token Claude Code's runtime scans for in tool result bodies. Every Skill-tool invocation of this skill caused the harness to inject a spurious system-reminder claiming the user requested deeper reasoning, silently bumping every session into extended thinking. Replace the bullet's spelling so the contiguous letter sequence the scanner matches is broken with a hyphen. The signal text remains recognizable to the agent and the documented action ("Question fundamentals, not just symptoms") is unchanged. Fixes obra/superpowers#1283	2026-05-23 16:51:00 -07:00
Jesse Vincent	e9f5188289	Make visual-companion.md script paths skill-rooted, not plugin-rooted Issue #1134: agents reading visual-companion.md see bare commands like `scripts/start-server.sh`, correctly identify the plugin install directory, then look for `<plugin>/scripts/start-server.sh` instead of `<plugin>/skills/brainstorming/scripts/start-server.sh`. The file doesn't exist at the plugin-rooted path, so the agent concludes the visual companion isn't available and falls back to text-only brainstorming. Multiple independent reproductions in the issue thread, plus one user's agent self-reported: "I assumed the scripts folder was in the root directory of the plugin, it didn't realize it could have been talking about the skill folder itself." Change all `scripts/<file>` references in visual-companion.md to `skills/brainstorming/scripts/<file>`. Agents that correctly identify the plugin root will now join to the right path. Closes #1134.	2026-05-23 16:42:13 -07:00
Jesse Vincent	e1d3f71e0d	Convert curly to square brackets in code-reviewer.md placeholders Matches the style used by the spec-reviewer-prompt.md and code-quality-reviewer-prompt.md call sites, which already use square brackets ([VAR] or [VAR — description]). No semantic change — these placeholders are filled in by the controller; nothing programmatic substitutes them.	2026-05-23 16:14:24 -07:00
Jesse Vincent	b2212dc913	Scope spec reviewer to task diff and make reviewers read-only Two problems with the SDD reviewer prompts on dev: - spec-reviewer-prompt.md never received a git range, so the general-purpose subagent had to crawl the entire codebase to find what changed. Reporter measured 20-33 minute spec reviews on simple tasks (#1538). - Neither reviewer prompt told the subagent that review is read-only. A spec reviewer running `git checkout <parent-sha>` for historical comparison silently detached HEAD on the controller's branch, then subsequent task commits accumulated on the detached HEAD and were effectively orphaned (#1543, reproduced independently in #1543's thread). Add a Git Range to Review section to spec-reviewer-prompt.md that mirrors the one code-reviewer.md already has, plus a Read-Only Review section in both reviewer prompt templates stating the principle: do not mutate the working tree, the index, HEAD, or branch state. Allow inspecting other revisions via a separate temporary worktree, so the read-only rule does not block legitimate historical comparison. Closes #1538. Closes #1543.	2026-05-23 16:14:05 -07:00
Jesse Vincent	180f009090	@mhat reported that his claude got confused about 'debugging' being named as a skill in the bootstrap	2026-05-21 17:23:25 -04:00
Drew Ritter	49bf5ad6dc	Align Pi mapping with action vocabulary	2026-05-13 17:58:46 -07:00

1 2 3 4

199 Commits