Compare commits

..

46 Commits

Author SHA1 Message Date
Jesse Vincent
3e20a04ae5 Job posting 2026-06-15 13:48:30 -07:00
Jesse Vincent
71489c8160 E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1 2026-06-15 12:17:46 -07:00
Jesse Vincent
97c9ea3f7d Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27 2026-06-15 12:17:46 -07:00
Jesse Vincent
afecfcd239 L2b: plan-mandated defects are findings the human adjudicates
Reviewer tripwire (Calibration): a plan-mandated defect IS a finding,
reported as Important and labeled plan-mandated — the plan's authorship
does not grade its own work.

Controller rule (review loop): a plan-mandated finding, or any finding
conflicting with the plan's text, escalates to the human like any plan
contradiction — never dismissed because the plan mandates it.

E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the
tripwire 0/6 reports give the controller anything to escalate on (all
Approved, defect endorsed as spec-required); with it 6/6 report the
defect as a labeled finding.
2026-06-15 12:17:46 -07:00
Jesse Vincent
2989810931 E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract 2026-06-15 12:17:46 -07:00
Jesse Vincent
1588b949f2 E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis) 2026-06-15 12:17:46 -07:00
Jesse Vincent
b1eb92ea72 Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not 2026-06-15 12:15:06 -07:00
Jesse Vincent
6e9bbb7e3e writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks
Claims are fidelity and variance, not dollars (full attribution in the
superpowers-evals experiment log, 2026-06-11 L1 entry):
- Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact
  values verbatim; makes constraints mechanically propagatable to briefs
  and reviewers (a version-floor violation class shipped because they
  weren't). The one fix wave in the elicited full runs was a version-floor
  catch this header enabled.
- Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures,
  within-plan consistent; removes the controller's per-dispatch interface
  re-derivation.
- Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills
  standalone Types/README micro-tasks); no effect at small scale.
- End-to-end (opus-written plan executed under SDD): guidance plan ran 1
  fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug);
  execution cost equal within noise.
2026-06-15 12:15:06 -07:00
Jesse Vincent
fe938ac86c Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules
E30 replay: the planted-DRY catch is causally determined by the
controller-composed constraints block (0/6 with process-shaped vs 5/6
with the spec's own wording). E31 micro: this recipe doubles the rate
at which composed blocks carry the spec's cross-component relationship
(6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).
2026-06-15 12:15:06 -07:00
Jesse Vincent
07dec9331f Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance 2026-06-15 12:15:06 -07:00
Jesse Vincent
afcbf8bacb Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed) 2026-06-15 12:15:06 -07:00
Jesse Vincent
fa14c8d671 Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead) 2026-06-15 12:15:06 -07:00
Jesse Vincent
8b76932337 Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first 2026-06-15 12:15:06 -07:00
Jesse Vincent
0702ec2c6f Record writing-plans micro-test result: resolved, no change needed 2026-06-15 12:15:06 -07:00
Jesse Vincent
85a9324a53 Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges) 2026-06-15 12:15:06 -07:00
Jesse Vincent
610b09874e Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist 2026-06-15 12:15:06 -07:00
Jesse Vincent
6df501ea5d Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget
Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
2026-06-15 12:15:06 -07:00
Jesse Vincent
1585f40c8e Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants 2026-06-15 12:15:06 -07:00
Jesse Vincent
60c0b744b4 Shared: unique review-package collateral names 2026-06-15 12:15:06 -07:00
Jesse Vincent
6b3e4ad407 Add review-package script; close fix-dispatch test gap
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
2026-06-15 12:15:06 -07:00
Jesse Vincent
a84bb0f52b Describe the review design as current state, not as a delta
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
2026-06-15 12:15:06 -07:00
Jesse Vincent
69d396a676 Spec: record iterations 2-3 results and final frozen-config matrix 2026-06-15 12:15:06 -07:00
Jesse Vincent
e4457c970e Hand reviewers the diff as a file, not a paste
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
2026-06-15 12:15:06 -07:00
Jesse Vincent
fac5888846 Reviewer skepticism covers the implementer's design rationales
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
2026-06-15 12:15:06 -07:00
Jesse Vincent
8ac14c0450 Make diff-pasting non-optional for task reviewer dispatch
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
2026-06-15 12:15:06 -07:00
Jesse Vincent
3ed554d557 Close the Minor-severity escape hatch
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
2026-06-15 12:15:06 -07:00
Jesse Vincent
4e8edca36e Spec: document cost iterations and the per-task review consolidation 2026-06-15 12:15:06 -07:00
Jesse Vincent
d7726d99dc Merge per-task reviews into one task reviewer (iteration 2)
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance //⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
2026-06-15 12:15:06 -07:00
Jesse Vincent
4c1f1e5cc5 Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
2026-06-15 12:15:06 -07:00
Jesse Vincent
7288393773 Add phrase-level pre-judging triggers to reviewer prompt rule
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
2026-06-15 12:15:06 -07:00
Jesse Vincent
254a8e2e32 Red Flags: never tell a reviewer what not to flag or pre-rate severity
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
2026-06-15 12:15:06 -07:00
Jesse Vincent
7c11cee649 Close three review blind spots found by defect tracing
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
2026-06-15 12:15:06 -07:00
Jesse Vincent
b36cf86afd Require explicit model on subagent dispatch
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
2026-06-15 12:15:06 -07:00
Jesse Vincent
06bec17a34 Forbid controllers pre-judging reviewer findings
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
2026-06-15 12:15:06 -07:00
Jesse Vincent
236524413b Sync plan: escaped pre() pattern in Task 5 checks block 2026-06-15 12:15:06 -07:00
Jesse Vincent
6e019e0316 Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block 2026-06-15 12:15:06 -07:00
Jesse Vincent
d4bb8d268f Sync plan's Task 5 blocks with review fixes 2026-06-15 12:15:06 -07:00
Jesse Vincent
d519ba65fd SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment 2026-06-15 12:15:06 -07:00
Jesse Vincent
d32a56dc32 Implementer prompt: re-run covering tests after fixing review findings 2026-06-15 12:15:06 -07:00
Jesse Vincent
994bc26d2a Scope spec reviewer's Your Job wording to the diff 2026-06-15 12:15:06 -07:00
Jesse Vincent
d5850df1bc Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel 2026-06-15 12:15:06 -07:00
Jesse Vincent
b5edd40d2c Use bare placeholder names in quality reviewer prompt body 2026-06-15 12:15:06 -07:00
Jesse Vincent
6a02446953 Make per-task quality reviewer prompt self-contained and task-scoped 2026-06-15 12:15:06 -07:00
Jesse Vincent
042d238b26 Add implementation plan for task-scoped review dispatch 2026-06-15 12:15:06 -07:00
Jesse Vincent
cf81ad2ac3 Harden review-dispatch spec per adversarial review findings 2026-06-15 12:15:06 -07:00
Jesse Vincent
cb0dbeb095 Add design spec: task-scoped review dispatch for SDD 2026-06-15 12:15:06 -07:00
4 changed files with 68 additions and 4 deletions

View File

@@ -2,6 +2,13 @@
Superpowers is a complete software development methodology for your coding agents, built on top of a set of composable skills and some initial instructions that make sure your agent uses them.
## We're Hiring!
We're hiring someone to help out full time with Superpowers community and code work.
You can read about the job at https://primeradiant.com/jobs/superpowers-community-engineer/
If this sounds like someone you know, definitely send them our way.
## Quickstart
Give your agent Superpowers: [Claude Code](#claude-code), [Antigravity](#antigravity), [Codex App](#codex-app), [Codex CLI](#codex-cli), [Cursor](#cursor), [Factory Droid](#factory-droid), [Gemini CLI](#gemini-cli), [GitHub Copilot CLI](#github-copilot-cli), [Kimi Code](#kimi-code), [OpenCode](#opencode), [Pi](#pi).

View File

@@ -133,8 +133,29 @@ opus controller flagged it 5/5. Cheap controllers handle explicit
escalation; they absorb implicit authority-vs-quality adjudication.
A possible L2b (discrete rule: "a reviewer finding that conflicts with
the plan's text is the human's decision — escalate it") would route the
failing judgment through the escalation behavior that held; untested.
Original recon notes follow.
failing judgment through the escalation behavior that held.
**L2b tested 2026-06-11 (E35/E36, evals
`docs/experiments/2026-06-11-build-loop-autoresearch.md`): improves the
opus stack, does NOT rescue the sonnet rung.** Two rules: a reviewer
tripwire (a plan-mandated defect IS a finding — Important, labeled
plan-mandated; the human decides) and a controller escalation rule
(plan-mandated findings go to the human like any plan contradiction).
Micro on frozen sonnet-composed inputs: 0/6 → 6/6 labeled findings.
Full battery: opus controllers 2/2 internalized the rule, caught their
reviewer's miss as self-described backstop, and escalated for a
sanctioned fix (the 4241 ad-hoc behavior made structural); escalation
sanity 2/2 unbroken. Sonnet controllers: 1/5 full pass — paraphrase
drops the tripwire from dispatches (2/5 transmitted), transmission
alone doesn't fire it live (read-once dilution across the reviewer's
tool reads; placement within the dispatch refuted as the variable),
and no sonnet controller showed backstop behavior; 1/5 shipped the
defect. The L2b rules are a candidate commit for the opus stack.
A future L2c for the sonnet rung would pair the SKILL.md
constraints-recipe (the one channel sonnet transmits verbatim) with a
mandatory output-format slot for plan-mandated findings (the skeleton
survives every observed paraphrase and is consulted at composition
time); untested. Original recon notes follow.
**Recon (superseded):**
Sonnet-controller runs (claude-sonnet coding-agent): all gates green at

View File

@@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review
**Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration
**Narration:** between tool calls, narrate at most one short line — the
ledger and the tool results carry the record.
**Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.
## When to Use
@@ -79,6 +82,20 @@ digraph process {
}
```
## Pre-Flight Plan Review
Before dispatching Task 1, scan the plan once for conflicts:
- tasks that contradict each other or the plan's Global Constraints
- anything the plan explicitly mandates that the review rubric treats as a
defect (a test that asserts nothing, verbatim duplication of a logic block)
Present everything you find to your human partner as one batched question —
each finding beside the plan text that mandates it, asking which governs —
before execution begins, not one interrupt per discovery mid-plan. If the
scan is clean, proceed without comment. The review loop remains the net for
conflicts that only emerge from implementation.
## Model Selection
Use the least powerful model that can handle each role to conserve cost and increase speed.
@@ -88,6 +105,8 @@ Use the least powerful model that can handle each role to conserve cost and incr
**Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.
**Architecture and design tasks**: use the most capable available model.
The final whole-branch review is one of these — dispatch it on the most
capable available model, not the session default.
**Review tasks**: choose the model with the same judgment, scaled to the
diff's size, complexity, and risk. A small mechanical diff does not need the
@@ -100,8 +119,10 @@ most expensive — which silently defeats this section.
**Turn count beats token price.** Wall-clock and context cost scale with how
many turns a subagent takes, and the cheapest models routinely take 2-3× the
turns on multi-step work — costing more overall. Use a mid-tier model as the
floor for implementers and reviewers; reserve the cheapest tier for
single-file mechanical fixes.
floor for reviewers and for implementers working from prose descriptions.
When the task's plan text contains the complete code to write, the
implementation is transcription plus testing: use the cheapest tier for
that implementer. Single-file mechanical fixes also take the cheapest tier.
**Task complexity signals (implementation tasks):**
- Touches 1-2 files with a complete spec → cheap model
@@ -174,6 +195,11 @@ final whole-branch review. When you fill a reviewer template:
findings in the progress ledger as you go, and point the final
whole-branch review at that list so it can triage which must be fixed
before merge. A roll-up nobody reads is a silent discard.
- A finding labeled plan-mandated — or any finding that conflicts with
what the plan's text requires — is the human's decision, like any plan
contradiction: present the finding and the plan text, ask which governs.
Do not dismiss the finding because the plan mandates it, and do not
dispatch a fix that contradicts the plan without asking.
- The final whole-branch review gets a package too: run
`scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the
branch started from, e.g. `git merge-base main HEAD`) and include the

View File

@@ -115,6 +115,11 @@ Subagent (general-purpose):
"yes." A tight report that cites lines gives the controller everything
it needs.
Your final message is the report itself: begin directly with the
spec-compliance verdict. Every line is a verdict, a finding with
file:line, or a check you ran — no preamble, no process narration,
no closing summary.
## Calibration
Categorize issues by actual severity. Not everything is Critical.
@@ -123,6 +128,11 @@ Subagent (general-purpose):
would block a merge over — verbatim duplication of a logic block,
swallowed errors, tests that assert nothing. "Coverage could be broader"
and polish suggestions are Minor.
If the plan or brief explicitly mandates something this rubric calls a
defect (a test that asserts nothing, verbatim duplication of a logic
block), that IS a finding — report it as Important, labeled
plan-mandated. The plan's authorship does not grade its own work; the
human decides.
Acknowledge what was done well before listing issues — accurate praise
helps the implementer trust the rest of the feedback.