superpowers

mirror of https://github.com/obra/superpowers.git synced 2026-08-01 07:01:36 +08:00

Author	SHA1	Message	Date
Jesse Vincent	9a221229a5	Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist	2026-06-15 12:10:33 -07:00
Jesse Vincent	7d8f0ce9e9	Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31 (baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline 79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline 3/3 runs after moving model: into the templates as a REQUIRED line. Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md	2026-06-15 12:10:33 -07:00
Jesse Vincent	f37c5e5115	Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants	2026-06-15 12:10:33 -07:00
Jesse Vincent	618698d9b3	Shared: unique review-package collateral names	2026-06-15 12:10:33 -07:00
Jesse Vincent	2d6e56ee90	Add review-package script; close fix-dispatch test gap scripts/review-package generates the reviewer's input deterministically: commit list, stat summary, and net diff with -U10 context, written to a file from an explicit BASE. Live runs showed controllers improvising 'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks, and svelte's five fix dispatches shipped without re-running any tests — fix dispatches now explicitly carry the implementer's re-run-and-report contract.	2026-06-15 12:10:33 -07:00
Jesse Vincent	4a92407ae7	Describe the review design as current state, not as a delta The skill read as a changelog: 'combined task review,' 'one reviewer, one reading,' 'one dispatch,' and an example still showing diffs pasted into prompts. A reader who never saw the two-reviewer design has no referent for 'combined.' Prose now states the design directly, and the flowchart/example reflect the diff-file handoff.	2026-06-15 12:10:33 -07:00
Jesse Vincent	cc81ffe7f3	Spec: record iterations 2-3 results and final frozen-config matrix	2026-06-15 12:10:33 -07:00
Jesse Vincent	a0dcb77596	Hand reviewers the diff as a file, not a paste Paste adoption stayed at 0/15 even as a Red Flag — and the controller's reluctance is locally rational: pasting loads the diff into the (most expensive) controller context permanently, while a reviewer self-fetch costs a few cheap turns. The diff-file handoff is cheap for both sides: the controller redirects git diff to /tmp without reading it, and the reviewer gets the whole change in one Read call.	2026-06-15 12:10:33 -07:00
Jesse Vincent	bc7d93de1a	Reviewer skepticism covers the implementer's design rationales Fourth planted-defect failure mode: the implementer's self-report said 'noted mild structural duplication; left unabstracted per YAGNI' and the reviewer deferred to that framing, rating the duplication no finding at all. The pre-judging keeps relocating — controller prompt, then reviewer calibration, now the implementer's report. Rationales are claims; they never downgrade severity.	2026-06-15 12:10:33 -07:00
Jesse Vincent	63a155692b	Make diff-pasting non-optional for task reviewer dispatch Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased as guidance; reviewers without the diff re-derive it by hand, which is the single largest remaining reviewer cost. Now a Red Flags Never entry and a REQUIRED marker on the template placeholder.	2026-06-15 12:10:33 -07:00
Jesse Vincent	4866fe8b2d	Close the Minor-severity escape hatch With merged review, a planted verbatim-duplication defect shipped: the reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted definition of Important, and the Minor-rolls-up rule meant no fix was ever dispatched and the final review never saw the finding. Calibration now names merge-blocking maintainability damage (verbatim duplication, swallowed errors, assertion-free tests) as Important, and controllers must paste accumulated Minor findings into the final review dispatch.	2026-06-15 12:10:33 -07:00
Jesse Vincent	e45a8f2548	Spec: document cost iterations and the per-task review consolidation	2026-06-15 12:10:33 -07:00
Jesse Vincent	fc75b0b3b4	Merge per-task reviews into one task reviewer (iteration 2) Iteration-1 profiling: implementers and per-dispatch overhead dominate (429 of 686 subagent turns; controller coordination is half the dollars and scales with dispatch count), reviewers are individually lean, and the controller pasted the diff in only 2 of 22 review dispatches when the guidance was phrased as optional. Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md replaced by task-reviewer-prompt.md (one reviewer, one reading of a pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality); one fix dispatch can address both kinds of findings; controller now runs git diff itself and pastes it (imperative, not optional); implementers run focused tests while iterating and the full suite once before committing; flowchart, example, Red Flags, tool tables updated. The broad final whole-branch review is unchanged.	2026-06-15 12:10:33 -07:00
Jesse Vincent	da0a11f6d4	Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's 42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn profiling attributed it to: haiku dispatches taking 2-3x the turns of sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand (518 Bash calls), and evidence-rule narration. Changes: turn-count-beats- token-price model guidance; controllers paste small diffs into reviewer prompts (reviewers then need few or no tool calls); evidence scoped to findings and would-be-bare-yes checks; Important defined as cannot-trust- until-fixed with coverage suggestions Minor; fixes dispatched only for Critical/Important.	2026-06-15 12:10:33 -07:00
Jesse Vincent	b42846401f	Add phrase-level pre-judging triggers to reviewer prompt rule Resumed the offending eval controller session and asked it why it pre-judged despite the rule being in context. Its retrospective: the motive was avoiding a review loop, the abstract rule was read but not applied at the moment it governs, and a phrase-level trigger ('do not flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose') would have fired where the principle did not.	2026-06-15 12:10:33 -07:00
Jesse Vincent	c087105ff3	Red Flags: never tell a reviewer what not to flag or pre-rate severity Second observed instance: with the Constructing Reviewer Prompts rule already live, a controller still wrote 'do not treat that duplication as a defect to fix — the plan chose it; you may note it as a Minor observation at most' into a quality reviewer dispatch, fabricating plan intent from the plan's example snippet. Promote the rule to the Red Flags Never list and name the rationalization.	2026-06-15 12:10:33 -07:00
Jesse Vincent	29e5842917	Close three review blind spots found by defect tracing Live eval deliverables shipped five polish defects; tracing each through the transcripts showed three mechanisms, each now addressed: - reviewers answered pointed checklist items with unsupported yes (evidence rule: every What-to-Check answer needs file:line evidence) - no reviewer ever saw the design's global constraints (controllers now paste binding constraints into task requirements) - test output noise was invisible everywhere (pristine-output checks in implementer self-review and quality review)	2026-06-15 12:10:33 -07:00
Jesse Vincent	1d94bc939d	Require explicit model on subagent dispatch In live eval runs, controllers given judgment-based model selection stopped passing a model at all; the omitted parameter inherits the session's top-tier model, silently making every subagent maximally expensive (one run dispatched 26/26 reviewers on the session model).	2026-06-15 12:10:33 -07:00
Jesse Vincent	833ec4177e	Forbid controllers pre-judging reviewer findings A live eval run of sdd-quality-reviewer-catches-planted-defect caught the SDD controller fabricating a plan constraint and instructing the quality reviewer not to flag the planted DRY violation. The duplication shipped. Constructing Reviewer Prompts now bans suppression directives alongside open-ended broadening directives.	2026-06-15 12:10:33 -07:00
Jesse Vincent	c4abda336c	Sync plan: escaped pre() pattern in Task 5 checks block	2026-06-15 12:10:33 -07:00
Jesse Vincent	c874cf0cb3	Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block	2026-06-15 12:10:33 -07:00
Jesse Vincent	08a2e7eed3	Sync plan's Task 5 blocks with review fixes	2026-06-15 12:10:33 -07:00
Jesse Vincent	077dd192a7	SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment	2026-06-15 12:10:33 -07:00
Jesse Vincent	441d22a2c0	Implementer prompt: re-run covering tests after fixing review findings	2026-06-15 12:10:33 -07:00
Jesse Vincent	efcaa40f1f	Scope spec reviewer's Your Job wording to the diff	2026-06-15 12:10:32 -07:00
Jesse Vincent	622a3887f3	Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel	2026-06-15 12:10:32 -07:00
Jesse Vincent	d3d6800b07	Use bare placeholder names in quality reviewer prompt body	2026-06-15 12:10:32 -07:00
Jesse Vincent	246b493db4	Make per-task quality reviewer prompt self-contained and task-scoped	2026-06-15 12:10:32 -07:00
Jesse Vincent	7dc323c28b	Add implementation plan for task-scoped review dispatch	2026-06-15 12:10:32 -07:00
Jesse Vincent	55938589d3	Harden review-dispatch spec per adversarial review findings	2026-06-15 12:10:32 -07:00
Jesse Vincent	450b02a11b	Add design spec: task-scoped review dispatch for SDD	2026-06-15 12:10:32 -07:00
Drew Ritter	9eb452afe7	chore: bump evals submodule to claude transcript-capture fix Bumps evals 7f8e80c -> db37d5f (superpowers-evals#16): the claude launcher now sets CLAUDE_CODE_FORCE_SESSION_PERSISTENCE=1 so nested interactive claude (>=2.1.176) persists its transcript — restoring claude capture (verdicts + cost/token data) on the latest CLI (2.1.177) with no version pin. Also folds in the audit_liveness ruff/ty cleanup and the B1 audit-doc correction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 15:17:11 -07:00
Drew Ritter	93f2ce91b8	Fix companion stop metadata and token permissions	2026-06-11 13:53:06 -07:00
Drew Ritter	e9ee6c5b4d	Harden Windows browser launcher	2026-06-11 13:53:06 -07:00
Drew Ritter	5415cb8ccf	Fix Windows lifecycle validation	2026-06-11 13:53:06 -07:00
Drew Ritter	1c21a91e01	Align visual companion docs with shipped scope	2026-06-11 13:53:06 -07:00
Drew Ritter	441335ee3e	Fix companion test cleanup and argv assertions	2026-06-11 13:53:06 -07:00
Drew Ritter	377192f7a1	Harden companion platform tests	2026-06-11 13:53:06 -07:00
Drew Ritter	5eea0d09d7	Fix companion lifecycle test ownership metadata	2026-06-11 13:53:06 -07:00
Drew Ritter	a6a4cd85b9	Harden companion stop ownership proof	2026-06-11 13:53:06 -07:00
Drew Ritter	8034176801	Isolate companion fallback tokens	2026-06-11 13:53:06 -07:00
Drew Ritter	2bab677ba7	Fix server test fallback cleanup	2026-06-11 13:53:06 -07:00
Drew Ritter	c4cde1eed9	Harden root screen containment	2026-06-11 13:53:06 -07:00
Drew Ritter	5f3b317741	Plan visual companion final hardening fixup	2026-06-11 13:53:06 -07:00
Drew Ritter	7bb6af2f67	Tighten visual companion hardening spec	2026-06-11 13:53:06 -07:00
Drew Ritter	4f88b89c75	Document visual companion final hardening fixup	2026-06-11 13:53:06 -07:00
Drew Ritter	c7d7e3550f	Harden companion Windows lifecycle coverage	2026-06-11 13:53:06 -07:00
Drew Ritter	a2e67bbd9b	Harden brainstorm companion auth regressions	2026-06-11 13:53:06 -07:00
Drew Ritter	fe812c418f	Document visual companion auth hardening plan	2026-06-11 13:53:06 -07:00
Jesse Vincent	f4d1788ffb	fix(brainstorm-server): fix auth-integration bugs from full-branch review A second adversarial review of the merged branch found that combining the session-key auth with the feature work created real bugs the (vacuous) tests missed: - [Critical] GET /files/ (empty name) resolved to CONTENT_DIR and crashed the process with uncaught EISDIR — newly reachable because the query-stripping refactor turns /files/?key=... into /files/. Reject non-regular-file names. - [High] --open opened a KEYLESS url, which the auth gate 403s — the headline feature landed on the error page. Open the keyed url. - [High] Same-port restart regenerated the token (port persisted, token not), so the open tab's old cookie 403'd and never reconnected — contradicting the documented promise. Persist the token (BRAINSTORM_TOKEN_FILE / .last-token) alongside the port. - [Medium] Token sat in world-readable server-info/server.log (0644 in /tmp). umask 077 in start-server.sh + mode 0600 on server-info/.last-token. - [Medium] touchActivity() ran before the auth check, so unauthenticated requests defeated the idle timeout. Count activity only after authorization. - [Low] COOKIE_NAME embedded the pre-fallback port; derive it from the actual bound port (also prevents a cross-server cookie-jar collision on fallback). Tests added/strengthened (previously passed vacuously): /files/ no-crash; the auto-open url carries the key and is reachable (200); restart reuses the same key not just the port; unauthenticated requests don't reset the idle clock. Full suite green (ws-protocol 32, helper 12, auth 13, server 29, lifecycle 8, stop-server 4); restart smoke confirms same port+key and old URL -> 200.	2026-06-11 13:53:06 -07:00

1 2 3 4 5 ...

578 Commits