superpowers

mirror of https://github.com/obra/superpowers.git synced 2026-08-01 07:01:36 +08:00

Author	SHA1	Message	Date
Jesse Vincent	5e204f1128	E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis)	2026-06-15 12:16:47 -07:00
Jesse Vincent	b1eb92ea72	Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not	2026-06-15 12:15:06 -07:00
Jesse Vincent	6e9bbb7e3e	writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks Claims are fidelity and variance, not dollars (full attribution in the superpowers-evals experiment log, 2026-06-11 L1 entry): - Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact values verbatim; makes constraints mechanically propagatable to briefs and reviewers (a version-floor violation class shipped because they weren't). The one fix wave in the elicited full runs was a version-floor catch this header enabled. - Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures, within-plan consistent; removes the controller's per-dispatch interface re-derivation. - Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills standalone Types/README micro-tasks); no effect at small scale. - End-to-end (opus-written plan executed under SDD): guidance plan ran 1 fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug); execution cost equal within noise.	2026-06-15 12:15:06 -07:00
Jesse Vincent	fe938ac86c	Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules E30 replay: the planted-DRY catch is causally determined by the controller-composed constraints block (0/6 with process-shaped vs 5/6 with the spec's own wording). E31 micro: this recipe doubles the rate at which composed blocks carry the spec's cross-component relationship (6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).	2026-06-15 12:15:06 -07:00
Jesse Vincent	07dec9331f	Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance	2026-06-15 12:15:06 -07:00
Jesse Vincent	afcbf8bacb	Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed)	2026-06-15 12:15:06 -07:00
Jesse Vincent	fa14c8d671	Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead)	2026-06-15 12:15:06 -07:00
Jesse Vincent	8b76932337	Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first	2026-06-15 12:15:06 -07:00
Jesse Vincent	0702ec2c6f	Record writing-plans micro-test result: resolved, no change needed	2026-06-15 12:15:06 -07:00
Jesse Vincent	85a9324a53	Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges)	2026-06-15 12:15:06 -07:00
Jesse Vincent	610b09874e	Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist	2026-06-15 12:15:06 -07:00
Jesse Vincent	6df501ea5d	Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31 (baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline 79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline 3/3 runs after moving model: into the templates as a REQUIRED line. Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md	2026-06-15 12:15:06 -07:00
Jesse Vincent	1585f40c8e	Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants	2026-06-15 12:15:06 -07:00
Jesse Vincent	60c0b744b4	Shared: unique review-package collateral names	2026-06-15 12:15:06 -07:00
Jesse Vincent	6b3e4ad407	Add review-package script; close fix-dispatch test gap scripts/review-package generates the reviewer's input deterministically: commit list, stat summary, and net diff with -U10 context, written to a file from an explicit BASE. Live runs showed controllers improvising 'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks, and svelte's five fix dispatches shipped without re-running any tests — fix dispatches now explicitly carry the implementer's re-run-and-report contract.	2026-06-15 12:15:06 -07:00
Jesse Vincent	a84bb0f52b	Describe the review design as current state, not as a delta The skill read as a changelog: 'combined task review,' 'one reviewer, one reading,' 'one dispatch,' and an example still showing diffs pasted into prompts. A reader who never saw the two-reviewer design has no referent for 'combined.' Prose now states the design directly, and the flowchart/example reflect the diff-file handoff.	2026-06-15 12:15:06 -07:00
Jesse Vincent	69d396a676	Spec: record iterations 2-3 results and final frozen-config matrix	2026-06-15 12:15:06 -07:00
Jesse Vincent	e4457c970e	Hand reviewers the diff as a file, not a paste Paste adoption stayed at 0/15 even as a Red Flag — and the controller's reluctance is locally rational: pasting loads the diff into the (most expensive) controller context permanently, while a reviewer self-fetch costs a few cheap turns. The diff-file handoff is cheap for both sides: the controller redirects git diff to /tmp without reading it, and the reviewer gets the whole change in one Read call.	2026-06-15 12:15:06 -07:00
Jesse Vincent	fac5888846	Reviewer skepticism covers the implementer's design rationales Fourth planted-defect failure mode: the implementer's self-report said 'noted mild structural duplication; left unabstracted per YAGNI' and the reviewer deferred to that framing, rating the duplication no finding at all. The pre-judging keeps relocating — controller prompt, then reviewer calibration, now the implementer's report. Rationales are claims; they never downgrade severity.	2026-06-15 12:15:06 -07:00
Jesse Vincent	8ac14c0450	Make diff-pasting non-optional for task reviewer dispatch Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased as guidance; reviewers without the diff re-derive it by hand, which is the single largest remaining reviewer cost. Now a Red Flags Never entry and a REQUIRED marker on the template placeholder.	2026-06-15 12:15:06 -07:00
Jesse Vincent	3ed554d557	Close the Minor-severity escape hatch With merged review, a planted verbatim-duplication defect shipped: the reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted definition of Important, and the Minor-rolls-up rule meant no fix was ever dispatched and the final review never saw the finding. Calibration now names merge-blocking maintainability damage (verbatim duplication, swallowed errors, assertion-free tests) as Important, and controllers must paste accumulated Minor findings into the final review dispatch.	2026-06-15 12:15:06 -07:00
Jesse Vincent	4e8edca36e	Spec: document cost iterations and the per-task review consolidation	2026-06-15 12:15:06 -07:00
Jesse Vincent	d7726d99dc	Merge per-task reviews into one task reviewer (iteration 2) Iteration-1 profiling: implementers and per-dispatch overhead dominate (429 of 686 subagent turns; controller coordination is half the dollars and scales with dispatch count), reviewers are individually lean, and the controller pasted the diff in only 2 of 22 review dispatches when the guidance was phrased as optional. Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md replaced by task-reviewer-prompt.md (one reviewer, one reading of a pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality); one fix dispatch can address both kinds of findings; controller now runs git diff itself and pastes it (imperative, not optional); implementers run focused tests while iterating and the full suite once before committing; flowchart, example, Red Flags, tool tables updated. The broad final whole-branch review is unchanged.	2026-06-15 12:15:06 -07:00
Jesse Vincent	4c1f1e5cc5	Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's 42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn profiling attributed it to: haiku dispatches taking 2-3x the turns of sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand (518 Bash calls), and evidence-rule narration. Changes: turn-count-beats- token-price model guidance; controllers paste small diffs into reviewer prompts (reviewers then need few or no tool calls); evidence scoped to findings and would-be-bare-yes checks; Important defined as cannot-trust- until-fixed with coverage suggestions Minor; fixes dispatched only for Critical/Important.	2026-06-15 12:15:06 -07:00
Jesse Vincent	7288393773	Add phrase-level pre-judging triggers to reviewer prompt rule Resumed the offending eval controller session and asked it why it pre-judged despite the rule being in context. Its retrospective: the motive was avoiding a review loop, the abstract rule was read but not applied at the moment it governs, and a phrase-level trigger ('do not flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose') would have fired where the principle did not.	2026-06-15 12:15:06 -07:00
Jesse Vincent	254a8e2e32	Red Flags: never tell a reviewer what not to flag or pre-rate severity Second observed instance: with the Constructing Reviewer Prompts rule already live, a controller still wrote 'do not treat that duplication as a defect to fix — the plan chose it; you may note it as a Minor observation at most' into a quality reviewer dispatch, fabricating plan intent from the plan's example snippet. Promote the rule to the Red Flags Never list and name the rationalization.	2026-06-15 12:15:06 -07:00
Jesse Vincent	7c11cee649	Close three review blind spots found by defect tracing Live eval deliverables shipped five polish defects; tracing each through the transcripts showed three mechanisms, each now addressed: - reviewers answered pointed checklist items with unsupported yes (evidence rule: every What-to-Check answer needs file:line evidence) - no reviewer ever saw the design's global constraints (controllers now paste binding constraints into task requirements) - test output noise was invisible everywhere (pristine-output checks in implementer self-review and quality review)	2026-06-15 12:15:06 -07:00
Jesse Vincent	b36cf86afd	Require explicit model on subagent dispatch In live eval runs, controllers given judgment-based model selection stopped passing a model at all; the omitted parameter inherits the session's top-tier model, silently making every subagent maximally expensive (one run dispatched 26/26 reviewers on the session model).	2026-06-15 12:15:06 -07:00
Jesse Vincent	06bec17a34	Forbid controllers pre-judging reviewer findings A live eval run of sdd-quality-reviewer-catches-planted-defect caught the SDD controller fabricating a plan constraint and instructing the quality reviewer not to flag the planted DRY violation. The duplication shipped. Constructing Reviewer Prompts now bans suppression directives alongside open-ended broadening directives.	2026-06-15 12:15:06 -07:00
Jesse Vincent	236524413b	Sync plan: escaped pre() pattern in Task 5 checks block	2026-06-15 12:15:06 -07:00
Jesse Vincent	6e019e0316	Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block	2026-06-15 12:15:06 -07:00
Jesse Vincent	d4bb8d268f	Sync plan's Task 5 blocks with review fixes	2026-06-15 12:15:06 -07:00
Jesse Vincent	d519ba65fd	SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment	2026-06-15 12:15:06 -07:00
Jesse Vincent	d32a56dc32	Implementer prompt: re-run covering tests after fixing review findings	2026-06-15 12:15:06 -07:00
Jesse Vincent	994bc26d2a	Scope spec reviewer's Your Job wording to the diff	2026-06-15 12:15:06 -07:00
Jesse Vincent	d5850df1bc	Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel	2026-06-15 12:15:06 -07:00
Jesse Vincent	b5edd40d2c	Use bare placeholder names in quality reviewer prompt body	2026-06-15 12:15:06 -07:00
Jesse Vincent	6a02446953	Make per-task quality reviewer prompt self-contained and task-scoped	2026-06-15 12:15:06 -07:00
Jesse Vincent	042d238b26	Add implementation plan for task-scoped review dispatch	2026-06-15 12:15:06 -07:00
Jesse Vincent	cf81ad2ac3	Harden review-dispatch spec per adversarial review findings	2026-06-15 12:15:06 -07:00
Jesse Vincent	cb0dbeb095	Add design spec: task-scoped review dispatch for SDD	2026-06-15 12:15:06 -07:00
Drew Ritter	9eb452afe7	chore: bump evals submodule to claude transcript-capture fix Bumps evals 7f8e80c -> db37d5f (superpowers-evals#16): the claude launcher now sets CLAUDE_CODE_FORCE_SESSION_PERSISTENCE=1 so nested interactive claude (>=2.1.176) persists its transcript — restoring claude capture (verdicts + cost/token data) on the latest CLI (2.1.177) with no version pin. Also folds in the audit_liveness ruff/ty cleanup and the B1 audit-doc correction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 15:17:11 -07:00
Drew Ritter	93f2ce91b8	Fix companion stop metadata and token permissions	2026-06-11 13:53:06 -07:00
Drew Ritter	e9ee6c5b4d	Harden Windows browser launcher	2026-06-11 13:53:06 -07:00
Drew Ritter	5415cb8ccf	Fix Windows lifecycle validation	2026-06-11 13:53:06 -07:00
Drew Ritter	1c21a91e01	Align visual companion docs with shipped scope	2026-06-11 13:53:06 -07:00
Drew Ritter	441335ee3e	Fix companion test cleanup and argv assertions	2026-06-11 13:53:06 -07:00
Drew Ritter	377192f7a1	Harden companion platform tests	2026-06-11 13:53:06 -07:00
Drew Ritter	5eea0d09d7	Fix companion lifecycle test ownership metadata	2026-06-11 13:53:06 -07:00
Drew Ritter	a6a4cd85b9	Harden companion stop ownership proof	2026-06-11 13:53:06 -07:00

1 2 3 4 5 ...

588 Commits