superpowers

mirror of https://github.com/obra/superpowers.git synced 2026-06-12 21:59:04 +08:00

Author	SHA1	Message	Date
Jesse Vincent	08447d0d10	Bump evals submodule: escalation-aware planted-defect scenario (7dc0be7)	2026-06-11 15:39:04 -07:00
Jesse Vincent	c87f336cc0	E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1	2026-06-11 15:31:01 -07:00
Jesse Vincent	a2a4190809	Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27	2026-06-11 14:53:31 -07:00
Jesse Vincent	8d354bb36b	L2b: plan-mandated defects are findings the human adjudicates Reviewer tripwire (Calibration): a plan-mandated defect IS a finding, reported as Important and labeled plan-mandated — the plan's authorship does not grade its own work. Controller rule (review loop): a plan-mandated finding, or any finding conflicting with the plan's text, escalates to the human like any plan contradiction — never dismissed because the plan mandates it. E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the tripwire 0/6 reports give the controller anything to escalate on (all Approved, defect endorsed as spec-required); with it 6/6 report the defect as a labeled finding.	2026-06-11 13:41:21 -07:00
Jesse Vincent	35464d67c0	E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract	2026-06-11 13:17:10 -07:00
Jesse Vincent	90b5433f59	E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis)	2026-06-11 13:17:10 -07:00
Jesse Vincent	420c234a2c	Bump evals submodule: E29-E34 quality investigation + L2 gate results (af05326)	2026-06-11 13:17:09 -07:00
Jesse Vincent	d1fcc9889a	Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not	2026-06-11 13:11:32 -07:00
Jesse Vincent	ac11700642	Bump evals submodule: L1 elicitation + autoresearch scenarios and logs (649b1f8)	2026-06-11 11:37:41 -07:00
Jesse Vincent	710f031ad0	writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks Claims are fidelity and variance, not dollars (full attribution in the superpowers-evals experiment log, 2026-06-11 L1 entry): - Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact values verbatim; makes constraints mechanically propagatable to briefs and reviewers (a version-floor violation class shipped because they weren't). The one fix wave in the elicited full runs was a version-floor catch this header enabled. - Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures, within-plan consistent; removes the controller's per-dispatch interface re-derivation. - Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills standalone Types/README micro-tasks); no effect at small scale. - End-to-end (opus-written plan executed under SDD): guidance plan ran 1 fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug); execution cost equal within noise.	2026-06-11 11:35:53 -07:00
Jesse Vincent	72cb21b82c	Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules E30 replay: the planted-DRY catch is causally determined by the controller-composed constraints block (0/6 with process-shaped vs 5/6 with the spec's own wording). E31 micro: this recipe doubles the rate at which composed blocks carry the spec's cross-component relationship (6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).	2026-06-11 11:35:53 -07:00
Jesse Vincent	de1d35e5e7	Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance	2026-06-10 21:44:23 -07:00
Jesse Vincent	ec014e7a7f	Bump evals submodule to merged superpowers-evals main (ac264b1) Carries the planted-defect + crisp scenarios, batch A-E experiment logs, claude-sonnet model-variant target, and method docs — rebased onto the obol migration and pushed to superpowers-evals main.	2026-06-10 19:39:02 -07:00
Jesse Vincent	eba16f6b91	Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed)	2026-06-10 17:11:26 -07:00
Jesse Vincent	27788fdef9	Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead)	2026-06-10 16:59:43 -07:00
Jesse Vincent	9a25a75bac	Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first	2026-06-10 14:35:00 -07:00
Jesse Vincent	60fa4f6fc4	Record writing-plans micro-test result: resolved, no change needed	2026-06-10 14:31:50 -07:00
Jesse Vincent	43a6ee23f7	Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges)	2026-06-10 13:08:40 -07:00
Jesse Vincent	fe90d6c469	Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist	2026-06-10 13:08:19 -07:00
Jesse Vincent	b81f35bb1e	Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31 (baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline 79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline 3/3 runs after moving model: into the templates as a REQUIRED line. Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md	2026-06-10 13:08:06 -07:00
Jesse Vincent	926096a1d7	Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants	2026-06-10 12:32:06 -07:00
Jesse Vincent	a995af2e24	Shared: unique review-package collateral names	2026-06-10 10:18:02 -07:00
Jesse Vincent	d4dbf44162	Add review-package script; close fix-dispatch test gap scripts/review-package generates the reviewer's input deterministically: commit list, stat summary, and net diff with -U10 context, written to a file from an explicit BASE. Live runs showed controllers improvising 'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks, and svelte's five fix dispatches shipped without re-running any tests — fix dispatches now explicitly carry the implementer's re-run-and-report contract.	2026-06-10 08:51:16 -07:00
Jesse Vincent	2434ef7f35	Describe the review design as current state, not as a delta The skill read as a changelog: 'combined task review,' 'one reviewer, one reading,' 'one dispatch,' and an example still showing diffs pasted into prompts. A reader who never saw the two-reviewer design has no referent for 'combined.' Prose now states the design directly, and the flowchart/example reflect the diff-file handoff.	2026-06-10 08:28:28 -07:00
Jesse Vincent	7cf78437e2	Spec: record iterations 2-3 results and final frozen-config matrix	2026-06-10 05:06:59 -07:00
Jesse Vincent	e355795625	Hand reviewers the diff as a file, not a paste Paste adoption stayed at 0/15 even as a Red Flag — and the controller's reluctance is locally rational: pasting loads the diff into the (most expensive) controller context permanently, while a reviewer self-fetch costs a few cheap turns. The diff-file handoff is cheap for both sides: the controller redirects git diff to /tmp without reading it, and the reviewer gets the whole change in one Read call.	2026-06-10 03:44:19 -07:00
Jesse Vincent	29ee4e8e44	Reviewer skepticism covers the implementer's design rationales Fourth planted-defect failure mode: the implementer's self-report said 'noted mild structural duplication; left unabstracted per YAGNI' and the reviewer deferred to that framing, rating the duplication no finding at all. The pre-judging keeps relocating — controller prompt, then reviewer calibration, now the implementer's report. Rationales are claims; they never downgrade severity.	2026-06-10 02:20:28 -07:00
Jesse Vincent	28498a5cde	Make diff-pasting non-optional for task reviewer dispatch Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased as guidance; reviewers without the diff re-derive it by hand, which is the single largest remaining reviewer cost. Now a Red Flags Never entry and a REQUIRED marker on the template placeholder.	2026-06-10 02:10:34 -07:00
Jesse Vincent	5e2907fc4f	Close the Minor-severity escape hatch With merged review, a planted verbatim-duplication defect shipped: the reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted definition of Important, and the Minor-rolls-up rule meant no fix was ever dispatched and the final review never saw the finding. Calibration now names merge-blocking maintainability damage (verbatim duplication, swallowed errors, assertion-free tests) as Important, and controllers must paste accumulated Minor findings into the final review dispatch.	2026-06-10 02:09:10 -07:00
Jesse Vincent	e532f24df7	Spec: document cost iterations and the per-task review consolidation	2026-06-09 23:59:22 -07:00
Jesse Vincent	e3c74fc1c9	Merge per-task reviews into one task reviewer (iteration 2) Iteration-1 profiling: implementers and per-dispatch overhead dominate (429 of 686 subagent turns; controller coordination is half the dollars and scales with dispatch count), reviewers are individually lean, and the controller pasted the diff in only 2 of 22 review dispatches when the guidance was phrased as optional. Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md replaced by task-reviewer-prompt.md (one reviewer, one reading of a pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality); one fix dispatch can address both kinds of findings; controller now runs git diff itself and pastes it (imperative, not optional); implementers run focused tests while iterating and the full suite once before committing; flowchart, example, Red Flags, tool tables updated. The broad final whole-branch review is unchanged.	2026-06-09 23:58:28 -07:00
Jesse Vincent	3e3e1e701e	Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's 42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn profiling attributed it to: haiku dispatches taking 2-3x the turns of sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand (518 Bash calls), and evidence-rule narration. Changes: turn-count-beats- token-price model guidance; controllers paste small diffs into reviewer prompts (reviewers then need few or no tool calls); evidence scoped to findings and would-be-bare-yes checks; Important defined as cannot-trust- until-fixed with coverage suggestions Minor; fixes dispatched only for Critical/Important.	2026-06-09 22:42:54 -07:00
Jesse Vincent	853396e3ae	Add phrase-level pre-judging triggers to reviewer prompt rule Resumed the offending eval controller session and asked it why it pre-judged despite the rule being in context. Its retrospective: the motive was avoiding a review loop, the abstract rule was read but not applied at the moment it governs, and a phrase-level trigger ('do not flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose') would have fired where the principle did not.	2026-06-09 21:49:51 -07:00
Jesse Vincent	83d54f7ddd	Red Flags: never tell a reviewer what not to flag or pre-rate severity Second observed instance: with the Constructing Reviewer Prompts rule already live, a controller still wrote 'do not treat that duplication as a defect to fix — the plan chose it; you may note it as a Minor observation at most' into a quality reviewer dispatch, fabricating plan intent from the plan's example snippet. Promote the rule to the Red Flags Never list and name the rationalization.	2026-06-09 21:47:41 -07:00
Jesse Vincent	c7900f1698	Close three review blind spots found by defect tracing Live eval deliverables shipped five polish defects; tracing each through the transcripts showed three mechanisms, each now addressed: - reviewers answered pointed checklist items with unsupported yes (evidence rule: every What-to-Check answer needs file:line evidence) - no reviewer ever saw the design's global constraints (controllers now paste binding constraints into task requirements) - test output noise was invisible everywhere (pristine-output checks in implementer self-review and quality review)	2026-06-09 21:19:08 -07:00
Jesse Vincent	5cfdb75b94	Require explicit model on subagent dispatch In live eval runs, controllers given judgment-based model selection stopped passing a model at all; the omitted parameter inherits the session's top-tier model, silently making every subagent maximally expensive (one run dispatched 26/26 reviewers on the session model).	2026-06-09 21:11:45 -07:00
Jesse Vincent	87825ff193	Forbid controllers pre-judging reviewer findings A live eval run of sdd-quality-reviewer-catches-planted-defect caught the SDD controller fabricating a plan constraint and instructing the quality reviewer not to flag the planted DRY violation. The duplication shipped. Constructing Reviewer Prompts now bans suppression directives alongside open-ended broadening directives.	2026-06-09 18:28:24 -07:00
Jesse Vincent	09cb4d7361	Sync plan: escaped pre() pattern in Task 5 checks block	2026-06-09 18:19:00 -07:00
Jesse Vincent	b3bb9a68d7	Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block	2026-06-09 17:21:06 -07:00
Jesse Vincent	71dc271a08	Sync plan's Task 5 blocks with review fixes	2026-06-09 17:13:03 -07:00
Jesse Vincent	5aea3dca31	SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment	2026-06-09 16:59:05 -07:00
Jesse Vincent	b3281c0227	Implementer prompt: re-run covering tests after fixing review findings	2026-06-09 16:56:28 -07:00
Jesse Vincent	c14c1de552	Scope spec reviewer's Your Job wording to the diff	2026-06-09 16:55:28 -07:00
Jesse Vincent	be8a6269c4	Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel	2026-06-09 16:53:30 -07:00
Jesse Vincent	da41209243	Use bare placeholder names in quality reviewer prompt body	2026-06-09 16:51:54 -07:00
Jesse Vincent	2cc449b6d4	Make per-task quality reviewer prompt self-contained and task-scoped	2026-06-09 16:47:27 -07:00
Jesse Vincent	f8dcd1ed3d	Add implementation plan for task-scoped review dispatch	2026-06-09 16:42:50 -07:00
Jesse Vincent	4192572d19	Harden review-dispatch spec per adversarial review findings	2026-06-09 16:33:44 -07:00
Jesse Vincent	5da15d7eba	Add design spec: task-scoped review dispatch for SDD	2026-06-09 16:26:00 -07:00
Jesse Vincent	f55642e0dd	Require contributors to disclose authoring environment and target dev Add a mandatory self-identification disclosure (model, harness, harness version, all installed plugins) to the PR template and all three issue templates, and document the requirement in the contributor guidelines. We weigh contributions differently depending on what produced them: content reasoned from documentation is held to a different bar than work grounded in a real session. Also state explicitly, in both CLAUDE.md and the PR template, that all PRs must target the dev branch rather than main.	2026-06-08 22:14:34 -07:00

1 2 3 4 5 ...

560 Commits