Commit Graph

190 Commits

Author SHA1 Message Date
Jesse Vincent
c87f336cc0 E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1 2026-06-11 15:31:01 -07:00
Jesse Vincent
8d354bb36b L2b: plan-mandated defects are findings the human adjudicates
Reviewer tripwire (Calibration): a plan-mandated defect IS a finding,
reported as Important and labeled plan-mandated — the plan's authorship
does not grade its own work.

Controller rule (review loop): a plan-mandated finding, or any finding
conflicting with the plan's text, escalates to the human like any plan
contradiction — never dismissed because the plan mandates it.

E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the
tripwire 0/6 reports give the controller anything to escalate on (all
Approved, defect endorsed as spec-required); with it 6/6 report the
defect as a labeled finding.
2026-06-11 13:41:21 -07:00
Jesse Vincent
35464d67c0 E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract 2026-06-11 13:17:10 -07:00
Jesse Vincent
90b5433f59 E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis) 2026-06-11 13:17:10 -07:00
Jesse Vincent
710f031ad0 writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks
Claims are fidelity and variance, not dollars (full attribution in the
superpowers-evals experiment log, 2026-06-11 L1 entry):
- Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact
  values verbatim; makes constraints mechanically propagatable to briefs
  and reviewers (a version-floor violation class shipped because they
  weren't). The one fix wave in the elicited full runs was a version-floor
  catch this header enabled.
- Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures,
  within-plan consistent; removes the controller's per-dispatch interface
  re-derivation.
- Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills
  standalone Types/README micro-tasks); no effect at small scale.
- End-to-end (opus-written plan executed under SDD): guidance plan ran 1
  fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug);
  execution cost equal within noise.
2026-06-11 11:35:53 -07:00
Jesse Vincent
72cb21b82c Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules
E30 replay: the planted-DRY catch is causally determined by the
controller-composed constraints block (0/6 with process-shaped vs 5/6
with the spec's own wording). E31 micro: this recipe doubles the rate
at which composed blocks carry the spec's cross-component relationship
(6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).
2026-06-11 11:35:53 -07:00
Jesse Vincent
fe90d6c469 Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist 2026-06-10 13:08:19 -07:00
Jesse Vincent
b81f35bb1e Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget
Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
2026-06-10 13:08:06 -07:00
Jesse Vincent
a995af2e24 Shared: unique review-package collateral names 2026-06-10 10:18:02 -07:00
Jesse Vincent
d4dbf44162 Add review-package script; close fix-dispatch test gap
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
2026-06-10 08:51:16 -07:00
Jesse Vincent
2434ef7f35 Describe the review design as current state, not as a delta
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
2026-06-10 08:28:28 -07:00
Jesse Vincent
e355795625 Hand reviewers the diff as a file, not a paste
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
2026-06-10 03:44:19 -07:00
Jesse Vincent
29ee4e8e44 Reviewer skepticism covers the implementer's design rationales
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
2026-06-10 02:20:28 -07:00
Jesse Vincent
28498a5cde Make diff-pasting non-optional for task reviewer dispatch
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
2026-06-10 02:10:34 -07:00
Jesse Vincent
5e2907fc4f Close the Minor-severity escape hatch
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
2026-06-10 02:09:10 -07:00
Jesse Vincent
e3c74fc1c9 Merge per-task reviews into one task reviewer (iteration 2)
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance //⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
2026-06-09 23:58:28 -07:00
Jesse Vincent
3e3e1e701e Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
2026-06-09 22:42:54 -07:00
Jesse Vincent
853396e3ae Add phrase-level pre-judging triggers to reviewer prompt rule
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
2026-06-09 21:49:51 -07:00
Jesse Vincent
83d54f7ddd Red Flags: never tell a reviewer what not to flag or pre-rate severity
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
2026-06-09 21:47:41 -07:00
Jesse Vincent
c7900f1698 Close three review blind spots found by defect tracing
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
2026-06-09 21:19:08 -07:00
Jesse Vincent
5cfdb75b94 Require explicit model on subagent dispatch
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
2026-06-09 21:11:45 -07:00
Jesse Vincent
87825ff193 Forbid controllers pre-judging reviewer findings
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
2026-06-09 18:28:24 -07:00
Jesse Vincent
5aea3dca31 SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment 2026-06-09 16:59:05 -07:00
Jesse Vincent
b3281c0227 Implementer prompt: re-run covering tests after fixing review findings 2026-06-09 16:56:28 -07:00
Jesse Vincent
c14c1de552 Scope spec reviewer's Your Job wording to the diff 2026-06-09 16:55:28 -07:00
Jesse Vincent
be8a6269c4 Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel 2026-06-09 16:53:30 -07:00
Jesse Vincent
da41209243 Use bare placeholder names in quality reviewer prompt body 2026-06-09 16:51:54 -07:00
Jesse Vincent
2cc449b6d4 Make per-task quality reviewer prompt self-contained and task-scoped 2026-06-09 16:47:27 -07:00
Rahul
d7c260a978 fix(brainstorming): cap websocket frame payloads 2026-06-02 11:24:02 -07:00
Matt Van Horn
f776394360 feat(subagent-dev): add TDD RED evidence to implementer report format
Add a conditional TDD Evidence field to the implementer report format so controllers can verify RED and GREEN output when TDD was required.

The field asks for the command run, relevant RED/GREEN output, and the expected RED failure reason rather than raw full logs.

Fixes #994.
2026-06-01 16:15:05 -07:00
nestorluiscamachopaz
81c3052416 fix: foreground mode saves node PID and clears OWNER_PID on Windows/MSYS2
Verified on real Windows Git Bash: lifecycle test passed 12/12, manual start/stop released the port, and no brainstorm node processes remained.
2026-06-01 14:26:22 -07:00
nawfal
c879454a0d fix(finishing-a-development-branch): remove gh-specific PR creation instruction
Per obra's guidance on #1609: remove the github-specific instruction rather
than replacing it with a platform-detection table. Agents already know their
forge tooling; the skill only needs to cover the push step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 13:58:22 -07:00
nawfal
ff213eb2cf fix(finishing-a-development-branch): detect remote platform before creating PR/MR
Replaces hardcoded `gh pr create` in Option 2 with a platform-neutral
note: check `git remote get-url origin` first, then use gh (GitHub),
glab (GitLab), or fall back to the compare URL for unknown platforms.

Adds matching Red Flag entry so agents don't skip the detection step.

Fixes #1609

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 13:58:22 -07:00
Jesse Vincent
da00e59958 feat: add Antigravity CLI (agy) support
Antigravity (Google's `agy` CLI) installs the existing Superpowers plugin
directly:

    agy plugin install https://github.com/obra/superpowers

agy imports the bundled skills and runs the plugin's SessionStart hook, so
using-superpowers bootstraps from the first message — verified on agy 1.0.3:
a fresh session given "Let's make a react todo list" auto-triggers the
brainstorming skill instead of writing code. agy discovers skills natively
and, having no Skill tool, loads them by reading SKILL.md with view_file.

No scaffold, installer, or generated context file is needed. This adds only:

- README.md: an Antigravity install section + Quickstart link
- skills/using-superpowers/SKILL.md: reference to the agy tool mapping
- skills/using-superpowers/references/antigravity-tools.md: action->tool
  mapping for agy (view_file, write_to_file, invoke_subagent, manage_task,
  and skill loading via view_file on SKILL.md)
- tests/antigravity/: structural test for the tool mapping, mirroring
  tests/pi/
2026-06-01 11:42:09 -07:00
Jesse Vincent
8811b0f2d7 Revert "Make visual-companion.md script paths skill-rooted, not plugin-rooted"
This reverts commit e9f5188289.
2026-05-23 17:01:46 -07:00
Nick Galatis
21ad401e90 fix(systematic-debugging): defuse Claude Code ultrathink keyword scanner trigger (#1558)
The "Signals You're Doing It Wrong" bullet in systematic-debugging/SKILL.md
contains the literal token Claude Code's runtime scans for in tool result
bodies. Every Skill-tool invocation of this skill caused the harness to
inject a spurious system-reminder claiming the user requested deeper
reasoning, silently bumping every session into extended thinking.

Replace the bullet's spelling so the contiguous letter sequence the scanner
matches is broken with a hyphen. The signal text remains recognizable to
the agent and the documented action ("Question fundamentals, not just
symptoms") is unchanged.

Fixes obra/superpowers#1283
2026-05-23 16:51:00 -07:00
Jesse Vincent
e9f5188289 Make visual-companion.md script paths skill-rooted, not plugin-rooted
Issue #1134: agents reading visual-companion.md see bare commands like
`scripts/start-server.sh`, correctly identify the plugin install
directory, then look for `<plugin>/scripts/start-server.sh` instead of
`<plugin>/skills/brainstorming/scripts/start-server.sh`. The file
doesn't exist at the plugin-rooted path, so the agent concludes the
visual companion isn't available and falls back to text-only
brainstorming.

Multiple independent reproductions in the issue thread, plus one user's
agent self-reported: "I assumed the scripts folder was in the root
directory of the plugin, it didn't realize it could have been talking
about the skill folder itself."

Change all `scripts/<file>` references in visual-companion.md to
`skills/brainstorming/scripts/<file>`. Agents that correctly identify
the plugin root will now join to the right path.

Closes #1134.
2026-05-23 16:42:13 -07:00
Jesse Vincent
e1d3f71e0d Convert curly to square brackets in code-reviewer.md placeholders
Matches the style used by the spec-reviewer-prompt.md and
code-quality-reviewer-prompt.md call sites, which already use square
brackets ([VAR] or [VAR — description]). No semantic change — these
placeholders are filled in by the controller; nothing programmatic
substitutes them.
2026-05-23 16:14:24 -07:00
Jesse Vincent
b2212dc913 Scope spec reviewer to task diff and make reviewers read-only
Two problems with the SDD reviewer prompts on dev:

- spec-reviewer-prompt.md never received a git range, so the
  general-purpose subagent had to crawl the entire codebase to find what
  changed. Reporter measured 20-33 minute spec reviews on simple tasks
  (#1538).
- Neither reviewer prompt told the subagent that review is read-only.
  A spec reviewer running `git checkout <parent-sha>` for historical
  comparison silently detached HEAD on the controller's branch, then
  subsequent task commits accumulated on the detached HEAD and were
  effectively orphaned (#1543, reproduced independently in #1543's
  thread).

Add a Git Range to Review section to spec-reviewer-prompt.md that
mirrors the one code-reviewer.md already has, plus a Read-Only Review
section in both reviewer prompt templates stating the principle: do
not mutate the working tree, the index, HEAD, or branch state. Allow
inspecting other revisions via a separate temporary worktree, so the
read-only rule does not block legitimate historical comparison.

Closes #1538.
Closes #1543.
2026-05-23 16:14:05 -07:00
Jesse Vincent
180f009090 @mhat reported that his claude got confused about 'debugging' being named as a skill in the bootstrap 2026-05-21 17:23:25 -04:00
Drew Ritter
49bf5ad6dc Align Pi mapping with action vocabulary 2026-05-13 17:58:46 -07:00
Jesse Vincent
cafbc5a4bd feat: add pi superpowers package extension 2026-05-13 17:58:46 -07:00
Drew Ritter
d4d99117f2 Tighten cross-platform tool references 2026-05-13 17:46:28 -07:00
Jesse Vincent
01034bcf8f Phase E: action-language tool vocabulary
Replace Claude-Code-specific tool names in skill prose, prompt
templates, and OpenCode-facing docs with action-language descriptions
that resolve to each runtime's native tool via the per-platform refs.

Changes by category:

- Prose mentions ("Use TodoWrite to track...", "Use Task tool with
  general-purpose type") → action language ("Track each item as a
  todo", "Dispatch a general-purpose subagent")

- Prompt template headers (6 files): "Task tool (general-purpose):"
  → "Subagent (general-purpose):" — preserves the type information
  without naming Claude Code's specific dispatch tool

- DOT flowchart node labels: "Invoke Skill tool" → "Invoke the
  skill"; "Create TodoWrite todo per item" → "Create a todo per
  item"

- OpenCode INSTALL.md and docs/README.opencode.md: replace the old
  "TodoWrite → todowrite, Task → @mention" mapping (which both
  taught a vocabulary skills no longer use AND was wrong about
  @mention being a real OpenCode syntax) with an action-language
  mapping verified against the installed OpenCode CLI's tool
  inventory.

The platform-tools refs landed in Phase B already document each
runtime's resolution; skills now speak in the actions those refs
map. Tool names that genuinely belong only in the per-platform
dispatch section ("In Claude Code: Use the `Skill` tool") and the
Claude-Code-specific Bash run_in_background flag note in
visual-companion remain — those are intentional carve-outs.
2026-05-13 17:46:28 -07:00
Jesse Vincent
b87a5e4721 Phase D: cross-runtime tweaks (visual-companion, executing-plans, test)
Misc platform/runtime statements and adjacencies that don't fit the
prose, config-ref, README-ordering, or tool-vocabulary buckets:

- visual-companion frame template: rename CSS/HTML id #claude-content
  → #frame-content. The id is purely styling — nothing external
  references it. The brainstorm-server test that asserted the old
  string is updated in lockstep.

- visual-companion launch instructions: add a Copilot CLI section
  alongside Claude Code, Codex, and Gemini CLI; combine the Claude
  Code (macOS / Linux) and (Windows) sections so heading style
  matches the other (non-OS-qualified) platforms.

- visual-companion: "Use Write tool" → "Use your file-creation tool"
  for the cat/heredoc warning. The prohibition is what's load-
  bearing, not the tool name.

- executing-plans/SKILL.md: list all subagent-capable runtimes
  (Claude Code, Codex CLI, Codex App, Copilot CLI, Gemini CLI) and
  point at the per-platform tool refs as the source of truth.

- executing-plans/SKILL.md: relative path "using-superpowers/
  references/" → "../using-superpowers/references/" to resolve
  correctly from the executing-plans/ directory.

No bundled spec doc here — Phase D was scope-extension work that
took place across rounds, with no standalone spec authored.
2026-05-13 17:46:28 -07:00
Jesse Vincent
5c0402736e Phase B: config-file refs + per-platform tool refs + spec
Two structural changes:

1. Generalize CLAUDE.md-specific guidance:
   - "Project-specific conventions (put in CLAUDE.md)" → "(put in
     your instructions file)" in writing-skills/SKILL.md
   - "(explicit CLAUDE.md violation)" → "(explicit instruction-file
     violation)" in receiving-code-review/SKILL.md
   - The instruction-priority list in using-superpowers/SKILL.md
     stays inclusive (CLAUDE.md, GEMINI.md, AGENTS.md) — that's
     load-bearing, not a substitution opportunity.

2. Per-platform tool reference files at skills/using-superpowers/
   references/{claude-code,codex,copilot,gemini}-tools.md. Each ref
   documents:
   - The runtime's preferred instructions file (CLAUDE.md, AGENTS.md,
     GEMINI.md, etc.) and how it loads
   - The runtime's personal-skills directory + cross-runtime
     ~/.agents/skills/ path where applicable
   - Action-language → tool-name mapping table

Tool names and table content reflect the source-verified state from
direct inspection of openai/codex, google-gemini/gemini-cli,
sst/opencode, and the installed @github/copilot package. Filenames
and behaviors are sourced from each runtime's official docs.

Files in this commit also pick up later-phase changes that
accumulated on the same files (using-superpowers/SKILL.md "How to
Access Skills" overhaul, action-language flowchart, refs' final
table content). The bundled spec records original scope.
2026-05-13 17:46:28 -07:00
Jesse Vincent
d0e413b591 Phase A: agent-neutral prose + CSO → SDO + spec
Replace generic third-person "Claude" with "agents" / "your agent"
forms across active skill prose, the README intro, and the vendored
anthropic-best-practices.md reference. Carve-outs preserved:
historical attribution paths, the "Variant C: Claude.AI Emphatic
Style" example label, model identifiers (Haiku/Sonnet/Opus), and the
"In Claude Code:" per-platform skill-dispatch list.

Coined-term rename: "Claude Search Optimization (CSO)" → "Skill
Discovery Optimization (SDO)" in writing-skills/SKILL.md.

Files in this commit also pick up later-phase changes that
accumulated on the same files (dispatching-parallel-agents code-
example transformation, writing-skills numbering and path fixes).
The bundled spec at docs/superpowers/specs/ records the original
scope and the carve-outs.

README.md gets only its prose change here; the alphabetization
lands in Phase C's commit.
2026-05-13 17:46:28 -07:00
Drew Ritter
3d6dc90c6d fix(tdd): link testing anti-patterns reference (#1532)
Fixes #1529.
2026-05-12 17:22:42 -07:00
Drew Ritter
a152bb3932 [codex] replace Circle K signal with generic review guidance (#1531)
* Remove Circle K signal from review skill

* Add generic review hesitation guidance

* Use Jesse wording for review hesitation guidance
2026-05-12 17:22:19 -07:00
Drew Ritter
3dfb376268 fix: remove global worktree path fallback (#1476) 2026-05-12 10:24:45 -07:00