Bumps evals 7f8e80c -> db37d5f (superpowers-evals#16): the claude launcher now
sets CLAUDE_CODE_FORCE_SESSION_PERSISTENCE=1 so nested interactive claude
(>=2.1.176) persists its transcript — restoring claude capture (verdicts +
cost/token data) on the latest CLI (2.1.177) with no version pin. Also folds in
the audit_liveness ruff/ty cleanup and the B1 audit-doc correction.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A second adversarial review of the merged branch found that combining the
session-key auth with the feature work created real bugs the (vacuous) tests
missed:
- [Critical] GET /files/ (empty name) resolved to CONTENT_DIR and crashed the
process with uncaught EISDIR — newly reachable because the query-stripping
refactor turns /files/?key=... into /files/. Reject non-regular-file names.
- [High] --open opened a KEYLESS url, which the auth gate 403s — the headline
feature landed on the error page. Open the keyed url.
- [High] Same-port restart regenerated the token (port persisted, token not), so
the open tab's old cookie 403'd and never reconnected — contradicting the
documented promise. Persist the token (BRAINSTORM_TOKEN_FILE / .last-token)
alongside the port.
- [Medium] Token sat in world-readable server-info/server.log (0644 in /tmp).
umask 077 in start-server.sh + mode 0600 on server-info/.last-token.
- [Medium] touchActivity() ran before the auth check, so unauthenticated requests
defeated the idle timeout. Count activity only after authorization.
- [Low] COOKIE_NAME embedded the pre-fallback port; derive it from the actual
bound port (also prevents a cross-server cookie-jar collision on fallback).
Tests added/strengthened (previously passed vacuously): /files/ no-crash; the
auto-open url carries the key and is reachable (200); restart reuses the same key
not just the port; unauthenticated requests don't reset the idle clock.
Full suite green (ws-protocol 32, helper 12, auth 13, server 29, lifecycle 8,
stop-server 4); restart smoke confirms same port+key and old URL -> 200.
Integrating the per-session-key auth onto the same branch as the dotfile and
lifecycle work: two tests added after the auth commit opened WebSockets without a
key (server.test.js dotfile-reload, lifecycle.test.js idle-shutdown), which the
auth gate now resets. Pass ?key=/BRAINSTORM_TOKEN in both. Full suite green:
ws-protocol 32, helper 12, auth 13, server 28, lifecycle 7, stop-server 4.
The companion server is reachable by any local browser tab (default loopback
bind) and by any host that can route to it (remote --host bind). It served
screens, files, and accepted event-injecting WebSocket connections with no
authentication, so a malicious browser tab or a direct remote client could read
brainstorm content or inject events that the agent reads as the user's input
(prompt injection into a live session).
Generate a per-session secret token, carry it in the served URL as ?key=, and
mirror it into an HttpOnly SameSite=Strict per-port cookie on first load so
same-origin subresources and the WebSocket handshake authenticate automatically.
Every HTTP request and WebSocket upgrade now requires a valid key (query or
cookie, constant-time compared); unauthenticated requests get a friendly 403
explaining they need the full URL. A secret authenticates the client uniformly
across loopback, tunnel, and remote binds and defeats DNS rebinding, which a
Host/Origin allowlist cannot.
Also guard handleMessage against a null JSON payload that crashed the process.
Tests: new auth.test.js (13 cases) covering the key on /, /files/*, and WS plus
cookie bootstrap and the null-payload guard; server.test.js threads the key;
ws-protocol.test.js + auth.test.js wired into npm test.
Closes#1014
Refs #1110, #1553, #1504
Records the triage of open issues/PRs touching the brainstorm companion server
and the decision to protect it with a per-session secret key (supersedes the
Host/Origin allowlist approach) so remote-connected users are covered, not just
loopback.
The node+server.cjs command match (from the adversarial review) still matched any
unrelated node process running a file named server.cjs. When we recorded the
bound port (state/server-info) and lsof is available, additionally require the
PID to be the process actually LISTENING on this session's port — which rules out
a different project's server.cjs / editor task runner that recycled the stale
PID. Falls back to the command match when the port or lsof isn't available.
Test: a 'node server.cjs' process not listening on the recorded port is spared.
Refs #1703
From a two-reviewer adversarial pass:
- [High] EADDRINUSE fallback clobbered the shared .last-port: onListen wrote the
bound port unconditionally, so a fallback to a random port overwrote the
preferred port another live session still owns — stranding that session's open
tab forever. Now persist only when we bound the preferred port (not on
fallback). The fallback test now asserts .last-port integrity (teeth-verified).
- [Medium] maybeOpenBrowser ran the URL through a shell (exec + JSON.stringify),
which does NOT neutralize $(...) in a url-host. Platform launchers now use
execFile with the URL as an argv element (no shell). The operator-set
BRAINSTORM_OPEN_CMD path stays shell-based (trusted input).
- [Medium] --open was a silent no-op on native Windows (no win32 branch). Added.
- [Medium] helper.js reconnect/status/tombstone had only substring-grep tests.
Added behavioral tests driving the state machine against a mocked browser:
Reconnecting+backoff (500->1000->2000), tombstone after the grace period, and
reload-on-recovery.
- [Low] status pill showed a false 'Connected' before the socket opened; now
starts 'Connecting…' until onopen.
Not changed (flagged): stop-server.sh's PID-ownership check still matches any
'node ... server.cjs' (narrow residual — a recycled PID onto an unrelated node
server.cjs); robust fix needs fragile cross-platform process introspection.
Move the companion consent from an upfront, anticipatory offer to the first
moment a question would genuinely be clearer shown than told. If no visual
question ever arises, it's never offered. On approval the agent starts the
server with --open, so the user's browser opens to the first screen — the pop is
tied to that approval, never unsolicited.
Also hardens visual-companion.md: confirming the server is alive (server-info
present, server-stopped absent) before referring to the URL is now a required
step; restart with the same --project-dir reuses the port so the open tab
reconnects on its own (paused overlay while down); idle default corrected to 4h.
NOTE: SKILL.md is behavior-shaping content — this flow change should be
eval-tested (writing-skills adversarial pressure test) before merge.
Refs #1237, #1037
When the user approves the visual companion, open their browser automatically the
first time a screen is actually ready to show — rather than at startup (just the
waiting page) or making them open the URL by hand.
Opt-in and gated on approval: off unless BRAINSTORM_OPEN is set (start-server.sh
--open, which the agent passes only after the user agrees to use the companion).
Even then it fires once, and is skipped if a browser is already connected, on a
non-loopback/remote bind, or when headless. Launcher is the platform default
(open / xdg-open / WSL cmd.exe) or BRAINSTORM_OPEN_CMD; best-effort, never fatal.
lifecycle.test.js: opens once on the first screen when approved; does NOT open
without approval.
Closes#755
Refs #759
When the companion idle-shuts-down and the agent restarts it, a fresh random
port meant the user's open browser tab pointed at a dead URL. Persist the bound
port per project and prefer it on the next start, so the restarted server comes
up on the same port and the open tab's reconnect just works.
- start-server.sh exports BRAINSTORM_PORT_FILE=<project>/.superpowers/brainstorm/
.last-port for project sessions (not /tmp).
- server.cjs prefers an explicit BRAINSTORM_PORT, else the recorded port, else
random; writes the actually-bound port back; and on EADDRINUSE (preferred port
still in use) falls back to a random port once instead of crashing.
lifecycle.test.js: restart reuses the recorded port; a taken preferred port
falls back to a random one without crashing.
Refs #1237
The injected client reconnected on a fixed 1s timer with no feedback: if the
laptop slept or the server restarted, the page showed 'Connected' over a dead
socket and silently queued events. And when the server stopped, the user got a
bare connection-refused with no explanation.
helper.js now:
- reconnects with exponential backoff (500ms, doubling, capped at 30s; reset on
open), with an onerror->close handler, nulls the socket on close, and clears a
pending timer before scheduling another;
- drives the frame status pill Connected/Reconnecting/Disconnected via a
--status-color custom property (frame-template.html);
- after ~15s disconnected, shows a self-styled 'Companion paused' overlay
(tombstone) explaining the companion stopped and will reconnect automatically;
- on recovery from a tombstoned outage (e.g. server restarted on the same port)
reloads to pick up the restarted server's current screen.
The reconnect-backoff is an exported pure function; helper.test.js unit-tests it
(doubling + cap progression) and asserts the status/tombstone/reconnect wiring.
DOM behaviour is verified live.
Refs #856, #1237
The companion shut down after only 30 minutes idle — too short for real
brainstorming, where a single question can sit far longer. And shutdown() never
closed upgraded WebSocket sockets, so an open browser connection could keep the
Node process alive after it was supposed to exit.
- Default idle timeout raised to 4 hours, configurable via BRAINSTORM_IDLE_TIMEOUT_MS
and start-server.sh --idle-timeout-minutes (validated positive integer).
- Reported as idle_timeout_ms in the server-started JSON / server-info.
- shutdown() now destroys all client sockets so the process exits even with an
open WebSocket.
- Watchdog check interval is configurable (BRAINSTORM_LIFECYCLE_CHECK_MS, default
60s) so the lifecycle can be tested without minute-long waits.
Adds lifecycle.test.js (configured timeout reported; idle shutdown exits despite
an open WS — teeth-verified; the start-server flag). Wires ws-protocol,
lifecycle, and stop-server suites into npm test.
Closes#1237
Refs #1689
stop-server.sh read server.pid and SIGKILL'd that PID with no checks. After a
reboot or PID wraparound the pid file can point at an unrelated, live process —
which we would then kill.
Verify the PID is actually our server (a running 'node ... server.cjs') before
signalling it. If ownership can't be proven, fail closed: remove the stale pid
file and report {status: stale_pid} without killing anything. Real servers still
stop ({status: stopped}); a missing pid file still reports not_running.
Adds stop-server.test.sh covering: an unrelated reused PID is left alone, a real
server is stopped, and a missing pid file.
Refs #1703
On macOS (and ExFAT/SMB volumes) the OS writes ._<name>.html sidecar files
holding binary resource-fork metadata. These end with .html, so they passed the
content filter and could be picked as the newest screen — serving binary garbage
to the browser instead of the mockup — or fetched via /files/.
Skip dotfiles (leading '.') at all four sites that list or serve content:
getNewestScreen, the /files/ endpoint, the known-files seed, and the fs.watch
handler. Tests cover serving (/ and /files/) and the watch path (a ._ file must
not trigger a reload).
Refs #950
Adversarial review findings 1/3/9: the head-to-head result is now scoped
to its context (dispatch-prompt guidance) with an explicit micro-test-your-
own-case instruction; the nuance-clause result is reported as
consistent->noisy rather than 'measurably dilutes'; the checklist line is
scoped to behavior-shaping guidance and the micro method no longer assumes
raw API access.
RED battery (35 opus authoring samples against the current skill) showed
authors default to prohibition+rationalization-table for composition-
shaping problems (T1: 5/5), where that form measurably backfires
(prohibition 4.4 vs 3.6 no-guidance control vs 3.0 recipe restatement
errors), and design only full-subagent verification with no wording
micro-tests, no mandatory no-guidance control, no manual inspection of
automated matches, no variance signal (T7: 5/5).
Adds: Match the Form to the Failure (failure-type -> form table, nuance/
exemption rules), scope note on Bulletproofing, Micro-Test Wording
subsection, two checklist lines. Deliberately narrow: T3/T4/T5/T6 RED
samples showed Iron Law / elicit-first behavior already strong.
@@ -456,10 +456,29 @@ Different skill types need different test approaches:
**All of these mean: Test before deploying. No exceptions.**
## Match the Form to the Failure
Before writing guidance, classify the baseline failure. The form that bulletproofs one failure type measurably backfires on another.
| Baseline failure | Right form | Wrong form |
|---|---|---|
| Skips/violates a rule under pressure (knows better, does it anyway) | Prohibition + rationalization table + red flags (see Bulletproofing below) | Soft guidance ("prefer...", "consider...") |
| Complies, but output has the wrong shape (bloated prompt, buried verdict, restated spec) | Positive recipe or contract: state what the output IS — its parts, in order | Prohibition list ("don't restate", "never narrate") |
| Omits a required element from something they already produce | Structural: REQUIRED field or slot in the template they fill in | Prose reminders near the template |
| Behavior should depend on a condition | Conditional keyed to an observable predicate ("if the brief exists, reference it") | Unconditional rule + exemption clauses |
**Why prohibitions backfire on shaping problems:** under a competing incentive ("make the prompt self-contained"), agents negotiate with "don't X". In head-to-head wording tests on dispatch-prompt guidance, the prohibition arm produced clearly more of the unwanted content than the recipe arm (fully separated distributions), and trended worse than even the no-guidance control — micro-test your own case rather than assuming, but never reach for the prohibition by default. A recipe leaves nothing to negotiate: the output matches the stated shape or it doesn't.
**Rules for whichever form you pick:**
- **No nuance clauses.** "Don't X unless it matters" reopens the negotiation — appending a single nuance clause to a winning recipe degraded it from consistent to noisy in the same wording tests. Express a real exception as its own conditional on an observable predicate.
- **Exemption clauses don't scope.** "This limit doesn't apply to code blocks" still suppresses code blocks. If part of the output must be exempt, restructure so the rule can't reach it.
## Bulletproofing Skills Against Rationalization
Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure.
**Scope:** this toolkit is for discipline failures — an agent that knows the rule and skips it under pressure. For wrong-shaped output or omitted elements, prohibition-based bulletproofing backfires; use the forms in Match the Form to the Failure instead.
**Psychology note:** Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles.
### Close Every Loophole Explicitly
@@ -553,6 +572,18 @@ Run same scenarios WITH skill. Agent should now comply.
Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
### Micro-Test Wording Before Full Scenarios
Full pressure-scenario runs are the final gate, but they are slow and expensive per iteration. Verify the wording itself first with micro-tests:
1.**One fresh-context sample per call** — a raw API call, or a single-shot subagent if you don't have API access. System prompt = the realistic context the guidance will live in (the full skill or prompt template, not the guidance in isolation); user message = a task that tempts the failure.
2.**Always include a no-guidance control.** If the control doesn't exhibit the failure, there is nothing to fix — stop, don't author the guidance.
3.**5+ reps per variant.** Single samples lie.
4.**Manually read every flagged match.** Score programmatically if you like, but template echoes and quoted counter-examples masquerade as hits; automated counts alone overstate both failure and success.
5.**Variance is a metric.** When guidance lands, reps converge on the same shape. Five different interpretations across five reps means the wording isn't binding — tighten the form before adding words.
Micro-tests verify wording; they do not replace pressure scenarios for discipline skills.
**Testing methodology:** See [testing-skills-with-subagents.md](testing-skills-with-subagents.md) for the complete testing methodology:
@@ -610,6 +641,8 @@ Deploying untested skills = deploying untested code. It's a violation of quality
- [ ] Keywords throughout for search (errors, symptoms, tools)
- [ ] Clear overview with core principle
- [ ] Address specific baseline failures identified in RED
- [ ] Guidance form matches the failure type (see Match the Form to the Failure)
- [ ] For behavior-shaping guidance: wording micro-tested against a no-guidance control (5+ reps, every flagged match read manually) — N/A for pure reference skills
- [ ] Code inline OR link to separate file
- [ ] One excellent example (not multi-language)
- [ ] Run scenarios WITH skill - verify agents now comply
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.