Commit Graph

578 Commits

Author SHA1 Message Date
Jesse Vincent
9a221229a5 Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist 2026-06-15 12:10:33 -07:00
Jesse Vincent
7d8f0ce9e9 Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget
Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
2026-06-15 12:10:33 -07:00
Jesse Vincent
f37c5e5115 Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants 2026-06-15 12:10:33 -07:00
Jesse Vincent
618698d9b3 Shared: unique review-package collateral names 2026-06-15 12:10:33 -07:00
Jesse Vincent
2d6e56ee90 Add review-package script; close fix-dispatch test gap
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
2026-06-15 12:10:33 -07:00
Jesse Vincent
4a92407ae7 Describe the review design as current state, not as a delta
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
2026-06-15 12:10:33 -07:00
Jesse Vincent
cc81ffe7f3 Spec: record iterations 2-3 results and final frozen-config matrix 2026-06-15 12:10:33 -07:00
Jesse Vincent
a0dcb77596 Hand reviewers the diff as a file, not a paste
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
2026-06-15 12:10:33 -07:00
Jesse Vincent
bc7d93de1a Reviewer skepticism covers the implementer's design rationales
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
2026-06-15 12:10:33 -07:00
Jesse Vincent
63a155692b Make diff-pasting non-optional for task reviewer dispatch
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
2026-06-15 12:10:33 -07:00
Jesse Vincent
4866fe8b2d Close the Minor-severity escape hatch
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
2026-06-15 12:10:33 -07:00
Jesse Vincent
e45a8f2548 Spec: document cost iterations and the per-task review consolidation 2026-06-15 12:10:33 -07:00
Jesse Vincent
fc75b0b3b4 Merge per-task reviews into one task reviewer (iteration 2)
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance //⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
2026-06-15 12:10:33 -07:00
Jesse Vincent
da0a11f6d4 Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
2026-06-15 12:10:33 -07:00
Jesse Vincent
b42846401f Add phrase-level pre-judging triggers to reviewer prompt rule
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
2026-06-15 12:10:33 -07:00
Jesse Vincent
c087105ff3 Red Flags: never tell a reviewer what not to flag or pre-rate severity
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
2026-06-15 12:10:33 -07:00
Jesse Vincent
29e5842917 Close three review blind spots found by defect tracing
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
2026-06-15 12:10:33 -07:00
Jesse Vincent
1d94bc939d Require explicit model on subagent dispatch
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
2026-06-15 12:10:33 -07:00
Jesse Vincent
833ec4177e Forbid controllers pre-judging reviewer findings
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
2026-06-15 12:10:33 -07:00
Jesse Vincent
c4abda336c Sync plan: escaped pre() pattern in Task 5 checks block 2026-06-15 12:10:33 -07:00
Jesse Vincent
c874cf0cb3 Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block 2026-06-15 12:10:33 -07:00
Jesse Vincent
08a2e7eed3 Sync plan's Task 5 blocks with review fixes 2026-06-15 12:10:33 -07:00
Jesse Vincent
077dd192a7 SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment 2026-06-15 12:10:33 -07:00
Jesse Vincent
441d22a2c0 Implementer prompt: re-run covering tests after fixing review findings 2026-06-15 12:10:33 -07:00
Jesse Vincent
efcaa40f1f Scope spec reviewer's Your Job wording to the diff 2026-06-15 12:10:32 -07:00
Jesse Vincent
622a3887f3 Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel 2026-06-15 12:10:32 -07:00
Jesse Vincent
d3d6800b07 Use bare placeholder names in quality reviewer prompt body 2026-06-15 12:10:32 -07:00
Jesse Vincent
246b493db4 Make per-task quality reviewer prompt self-contained and task-scoped 2026-06-15 12:10:32 -07:00
Jesse Vincent
7dc323c28b Add implementation plan for task-scoped review dispatch 2026-06-15 12:10:32 -07:00
Jesse Vincent
55938589d3 Harden review-dispatch spec per adversarial review findings 2026-06-15 12:10:32 -07:00
Jesse Vincent
450b02a11b Add design spec: task-scoped review dispatch for SDD 2026-06-15 12:10:32 -07:00
Drew Ritter
9eb452afe7 chore: bump evals submodule to claude transcript-capture fix
Bumps evals 7f8e80c -> db37d5f (superpowers-evals#16): the claude launcher now
sets CLAUDE_CODE_FORCE_SESSION_PERSISTENCE=1 so nested interactive claude
(>=2.1.176) persists its transcript — restoring claude capture (verdicts +
cost/token data) on the latest CLI (2.1.177) with no version pin. Also folds in
the audit_liveness ruff/ty cleanup and the B1 audit-doc correction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:17:11 -07:00
Drew Ritter
93f2ce91b8 Fix companion stop metadata and token permissions 2026-06-11 13:53:06 -07:00
Drew Ritter
e9ee6c5b4d Harden Windows browser launcher 2026-06-11 13:53:06 -07:00
Drew Ritter
5415cb8ccf Fix Windows lifecycle validation 2026-06-11 13:53:06 -07:00
Drew Ritter
1c21a91e01 Align visual companion docs with shipped scope 2026-06-11 13:53:06 -07:00
Drew Ritter
441335ee3e Fix companion test cleanup and argv assertions 2026-06-11 13:53:06 -07:00
Drew Ritter
377192f7a1 Harden companion platform tests 2026-06-11 13:53:06 -07:00
Drew Ritter
5eea0d09d7 Fix companion lifecycle test ownership metadata 2026-06-11 13:53:06 -07:00
Drew Ritter
a6a4cd85b9 Harden companion stop ownership proof 2026-06-11 13:53:06 -07:00
Drew Ritter
8034176801 Isolate companion fallback tokens 2026-06-11 13:53:06 -07:00
Drew Ritter
2bab677ba7 Fix server test fallback cleanup 2026-06-11 13:53:06 -07:00
Drew Ritter
c4cde1eed9 Harden root screen containment 2026-06-11 13:53:06 -07:00
Drew Ritter
5f3b317741 Plan visual companion final hardening fixup 2026-06-11 13:53:06 -07:00
Drew Ritter
7bb6af2f67 Tighten visual companion hardening spec 2026-06-11 13:53:06 -07:00
Drew Ritter
4f88b89c75 Document visual companion final hardening fixup 2026-06-11 13:53:06 -07:00
Drew Ritter
c7d7e3550f Harden companion Windows lifecycle coverage 2026-06-11 13:53:06 -07:00
Drew Ritter
a2e67bbd9b Harden brainstorm companion auth regressions 2026-06-11 13:53:06 -07:00
Drew Ritter
fe812c418f Document visual companion auth hardening plan 2026-06-11 13:53:06 -07:00
Jesse Vincent
f4d1788ffb fix(brainstorm-server): fix auth-integration bugs from full-branch review
A second adversarial review of the merged branch found that combining the
session-key auth with the feature work created real bugs the (vacuous) tests
missed:

- [Critical] GET /files/ (empty name) resolved to CONTENT_DIR and crashed the
  process with uncaught EISDIR — newly reachable because the query-stripping
  refactor turns /files/?key=... into /files/. Reject non-regular-file names.
- [High] --open opened a KEYLESS url, which the auth gate 403s — the headline
  feature landed on the error page. Open the keyed url.
- [High] Same-port restart regenerated the token (port persisted, token not), so
  the open tab's old cookie 403'd and never reconnected — contradicting the
  documented promise. Persist the token (BRAINSTORM_TOKEN_FILE / .last-token)
  alongside the port.
- [Medium] Token sat in world-readable server-info/server.log (0644 in /tmp).
  umask 077 in start-server.sh + mode 0600 on server-info/.last-token.
- [Medium] touchActivity() ran before the auth check, so unauthenticated requests
  defeated the idle timeout. Count activity only after authorization.
- [Low] COOKIE_NAME embedded the pre-fallback port; derive it from the actual
  bound port (also prevents a cross-server cookie-jar collision on fallback).

Tests added/strengthened (previously passed vacuously): /files/ no-crash; the
auto-open url carries the key and is reachable (200); restart reuses the same key
not just the port; unauthenticated requests don't reset the idle clock.
Full suite green (ws-protocol 32, helper 12, auth 13, server 29, lifecycle 8,
stop-server 4); restart smoke confirms same port+key and old URL -> 200.
2026-06-11 13:53:06 -07:00