Commit Graph

588 Commits

Author SHA1 Message Date
Jesse Vincent
5e204f1128 E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis) 2026-06-15 12:16:47 -07:00
Jesse Vincent
b1eb92ea72 Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not 2026-06-15 12:15:06 -07:00
Jesse Vincent
6e9bbb7e3e writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks
Claims are fidelity and variance, not dollars (full attribution in the
superpowers-evals experiment log, 2026-06-11 L1 entry):
- Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact
  values verbatim; makes constraints mechanically propagatable to briefs
  and reviewers (a version-floor violation class shipped because they
  weren't). The one fix wave in the elicited full runs was a version-floor
  catch this header enabled.
- Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures,
  within-plan consistent; removes the controller's per-dispatch interface
  re-derivation.
- Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills
  standalone Types/README micro-tasks); no effect at small scale.
- End-to-end (opus-written plan executed under SDD): guidance plan ran 1
  fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug);
  execution cost equal within noise.
2026-06-15 12:15:06 -07:00
Jesse Vincent
fe938ac86c Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules
E30 replay: the planted-DRY catch is causally determined by the
controller-composed constraints block (0/6 with process-shaped vs 5/6
with the spec's own wording). E31 micro: this recipe doubles the rate
at which composed blocks carry the spec's cross-component relationship
(6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).
2026-06-15 12:15:06 -07:00
Jesse Vincent
07dec9331f Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance 2026-06-15 12:15:06 -07:00
Jesse Vincent
afcbf8bacb Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed) 2026-06-15 12:15:06 -07:00
Jesse Vincent
fa14c8d671 Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead) 2026-06-15 12:15:06 -07:00
Jesse Vincent
8b76932337 Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first 2026-06-15 12:15:06 -07:00
Jesse Vincent
0702ec2c6f Record writing-plans micro-test result: resolved, no change needed 2026-06-15 12:15:06 -07:00
Jesse Vincent
85a9324a53 Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges) 2026-06-15 12:15:06 -07:00
Jesse Vincent
610b09874e Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist 2026-06-15 12:15:06 -07:00
Jesse Vincent
6df501ea5d Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget
Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
2026-06-15 12:15:06 -07:00
Jesse Vincent
1585f40c8e Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants 2026-06-15 12:15:06 -07:00
Jesse Vincent
60c0b744b4 Shared: unique review-package collateral names 2026-06-15 12:15:06 -07:00
Jesse Vincent
6b3e4ad407 Add review-package script; close fix-dispatch test gap
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
2026-06-15 12:15:06 -07:00
Jesse Vincent
a84bb0f52b Describe the review design as current state, not as a delta
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
2026-06-15 12:15:06 -07:00
Jesse Vincent
69d396a676 Spec: record iterations 2-3 results and final frozen-config matrix 2026-06-15 12:15:06 -07:00
Jesse Vincent
e4457c970e Hand reviewers the diff as a file, not a paste
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
2026-06-15 12:15:06 -07:00
Jesse Vincent
fac5888846 Reviewer skepticism covers the implementer's design rationales
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
2026-06-15 12:15:06 -07:00
Jesse Vincent
8ac14c0450 Make diff-pasting non-optional for task reviewer dispatch
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
2026-06-15 12:15:06 -07:00
Jesse Vincent
3ed554d557 Close the Minor-severity escape hatch
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
2026-06-15 12:15:06 -07:00
Jesse Vincent
4e8edca36e Spec: document cost iterations and the per-task review consolidation 2026-06-15 12:15:06 -07:00
Jesse Vincent
d7726d99dc Merge per-task reviews into one task reviewer (iteration 2)
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance //⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
2026-06-15 12:15:06 -07:00
Jesse Vincent
4c1f1e5cc5 Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
2026-06-15 12:15:06 -07:00
Jesse Vincent
7288393773 Add phrase-level pre-judging triggers to reviewer prompt rule
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
2026-06-15 12:15:06 -07:00
Jesse Vincent
254a8e2e32 Red Flags: never tell a reviewer what not to flag or pre-rate severity
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
2026-06-15 12:15:06 -07:00
Jesse Vincent
7c11cee649 Close three review blind spots found by defect tracing
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
2026-06-15 12:15:06 -07:00
Jesse Vincent
b36cf86afd Require explicit model on subagent dispatch
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
2026-06-15 12:15:06 -07:00
Jesse Vincent
06bec17a34 Forbid controllers pre-judging reviewer findings
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
2026-06-15 12:15:06 -07:00
Jesse Vincent
236524413b Sync plan: escaped pre() pattern in Task 5 checks block 2026-06-15 12:15:06 -07:00
Jesse Vincent
6e019e0316 Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block 2026-06-15 12:15:06 -07:00
Jesse Vincent
d4bb8d268f Sync plan's Task 5 blocks with review fixes 2026-06-15 12:15:06 -07:00
Jesse Vincent
d519ba65fd SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment 2026-06-15 12:15:06 -07:00
Jesse Vincent
d32a56dc32 Implementer prompt: re-run covering tests after fixing review findings 2026-06-15 12:15:06 -07:00
Jesse Vincent
994bc26d2a Scope spec reviewer's Your Job wording to the diff 2026-06-15 12:15:06 -07:00
Jesse Vincent
d5850df1bc Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel 2026-06-15 12:15:06 -07:00
Jesse Vincent
b5edd40d2c Use bare placeholder names in quality reviewer prompt body 2026-06-15 12:15:06 -07:00
Jesse Vincent
6a02446953 Make per-task quality reviewer prompt self-contained and task-scoped 2026-06-15 12:15:06 -07:00
Jesse Vincent
042d238b26 Add implementation plan for task-scoped review dispatch 2026-06-15 12:15:06 -07:00
Jesse Vincent
cf81ad2ac3 Harden review-dispatch spec per adversarial review findings 2026-06-15 12:15:06 -07:00
Jesse Vincent
cb0dbeb095 Add design spec: task-scoped review dispatch for SDD 2026-06-15 12:15:06 -07:00
Drew Ritter
9eb452afe7 chore: bump evals submodule to claude transcript-capture fix
Bumps evals 7f8e80c -> db37d5f (superpowers-evals#16): the claude launcher now
sets CLAUDE_CODE_FORCE_SESSION_PERSISTENCE=1 so nested interactive claude
(>=2.1.176) persists its transcript — restoring claude capture (verdicts +
cost/token data) on the latest CLI (2.1.177) with no version pin. Also folds in
the audit_liveness ruff/ty cleanup and the B1 audit-doc correction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 15:17:11 -07:00
Drew Ritter
93f2ce91b8 Fix companion stop metadata and token permissions 2026-06-11 13:53:06 -07:00
Drew Ritter
e9ee6c5b4d Harden Windows browser launcher 2026-06-11 13:53:06 -07:00
Drew Ritter
5415cb8ccf Fix Windows lifecycle validation 2026-06-11 13:53:06 -07:00
Drew Ritter
1c21a91e01 Align visual companion docs with shipped scope 2026-06-11 13:53:06 -07:00
Drew Ritter
441335ee3e Fix companion test cleanup and argv assertions 2026-06-11 13:53:06 -07:00
Drew Ritter
377192f7a1 Harden companion platform tests 2026-06-11 13:53:06 -07:00
Drew Ritter
5eea0d09d7 Fix companion lifecycle test ownership metadata 2026-06-11 13:53:06 -07:00
Drew Ritter
a6a4cd85b9 Harden companion stop ownership proof 2026-06-11 13:53:06 -07:00