Commit Graph

611 Commits

Author SHA1 Message Date
Jesse Vincent
bfa3e4137a Keep Codex hooks manifest in plugin metadata
Prompt: Jesse questioned whether the PR should remove the hooks config from the Codex plugin manifest.

Runtime investigation showed Codex accepts a committed plugin manifest with hooks and installs the plugin successfully. Removing the field changes behavior: Codex falls back to the default hooks/hooks.json, which uses the non-Codex session-start hook and CLAUDE_PLUGIN_ROOT path, instead of hooks/hooks-codex.json and the session-start-codex script.

Changes: restore .codex-plugin/plugin.json hooks to ./hooks/hooks-codex.json and update the Codex marketplace manifest test to require that Codex-specific hook pointer instead of rejecting hooks.

Validation: bash tests/codex/test-marketplace-manifest.sh; scripts/lint-shell.sh tests/codex/test-marketplace-manifest.sh; bash tests/codex-plugin-sync/test-sync-to-codex-plugin.sh; bash tests/kimi/test-plugin-manifest.sh; bash tests/shell-lint/test-lint-shell.sh.
2026-06-22 11:51:28 -07:00
Jesse Vincent
a17aaaef3a Add Codex marketplace manifest
Prompt: Jesse asked for a new worktree off the local superpowers dev branch to add the Codex manifest after diagnosing why github.com/obra/superpowers did not show installable Codex plugins.

Root cause: Codex marketplace sources expect a .agents/plugins/marketplace.json at the marketplace root. The superpowers repo only had the Claude marketplace file and the Codex plugin manifest, so Codex could configure the marketplace name but found no installable plugin entries.

Changes: add a repo-local Codex marketplace manifest for superpowers-dev that points at this same repository root via the same-root source pattern Codex already accepts; add a focused marketplace manifest test; remove the unsupported hooks field from .codex-plugin/plugin.json so the plugin validator accepts the manifest.

Validation: bash tests/codex/test-marketplace-manifest.sh; uv run --with PyYAML python /Users/jesse/.codex/skills/.system/plugin-creator/scripts/validate_plugin.py /Users/jesse/git/superpowers/superpowers/.worktrees/codex-marketplace-manifest; throwaway HOME codex plugin marketplace add/list/add; bash tests/codex-plugin-sync/test-sync-to-codex-plugin.sh; bash tests/kimi/test-plugin-manifest.sh; bash tests/shell-lint/test-lint-shell.sh; scripts/lint-shell.sh tests/codex/test-marketplace-manifest.sh.
2026-06-22 11:51:28 -07:00
Jesse Vincent
896224c4b1 Release v6.0.3: SDD artifacts move out of the .git/ protected path
Bump all plugin manifests to 6.0.3. This release moves subagent-driven-
development's scratch artifacts (task briefs, implementer reports, review
diffs, progress ledger) from .git/sdd/ — which Claude Code denies agent
writes to — into a self-ignoring working-tree .superpowers/sdd/ dir, and
bumps the brainstorm-server test harness's ws dependency to clear two
dependabot alerts. See RELEASE-NOTES.md.
v6.0.3
2026-06-18 15:44:22 -07:00
Jesse Vincent
549dee6f64 test(deps): bump ws to ^8.21.0 in brainstorm-server tests
Clears two dependabot alerts on the test harness's ws dependency:
GHSA-96hv-2xvq-fx4p (high, memory-exhaustion DoS, fixed 8.21.0) and
GHSA-58qx-3vcg-4xpx (medium, uninitialized memory disclosure, fixed
8.20.1). Test-only — the shipped brainstorm server hand-rolls its
WebSocket framing and does not depend on ws. Suite passes (57/57).
2026-06-18 15:44:22 -07:00
Jesse Vincent
4f9bd3131e docs: add v6.0.3 release notes for the SDD .git/ workspace fix 2026-06-18 15:44:22 -07:00
Jesse Vincent
caf14aac66 test(sdd): wire test-sdd-workspace.sh into the runner; note git clean -fdx
The per-worktree workspace test was added but never registered in
run-skill-tests.sh, so it only ran when invoked by hand. Add it to the
fast unit-test array alongside the other pure-shell test.

Also document, in the Durable Progress section, that the ledger now
lives in git-ignored working-tree scratch, so `git clean -fdx` deletes
it — recover from `git log` if that happens.
2026-06-18 15:44:22 -07:00
Jesse Vincent
667b2c4a2e test(sdd): lock in per-worktree workspace isolation (#1780) 2026-06-18 15:44:22 -07:00
Jesse Vincent
93b8444b51 fix(sdd): write artifacts to working-tree .superpowers/sdd, not .git/ (#1780) 2026-06-18 15:44:22 -07:00
Jesse Vincent
207a12b203 feat(sdd): add sdd-workspace helper for a self-ignoring artifact dir 2026-06-18 15:44:22 -07:00
Jesse Vincent
b62616fc12 Release v6.0.2: stop shipping the evals submodule
It broke plugin installs for some users (#1778, #1774). The eval harness
now lives in its own repo, separate from the published plugin.
v6.0.2
2026-06-16 22:42:19 -07:00
Jesse Vincent
a21956e48c Release v6.0.1: Codex fixes
- Brainstorm companion reads version from .codex-plugin/plugin.json when package.json is absent (PRI-2240)
- sync-to-codex script excludes .gitmodules and .pre-commit-config.yaml (PRI-1168)
2026-06-16 17:02:33 -07:00
Drew Ritter
29c0b1b7db fix: read Codex plugin version from manifest (PRI-2240) 2026-06-16 17:02:33 -07:00
Drew Ritter
cf32920d3a fix: exclude repo metadata from Codex sync (PRI-1168) 2026-06-16 17:02:33 -07:00
Jesse Vincent
284be5905e Set v6.0.0 release date to 2026-06-16 v6.0.0 2026-06-16 10:09:47 -07:00
Jesse Vincent
77879bbb91 Bump evals submodule: unify per-agent bootstrap scenarios
Points evals at superpowers-evals 70a245c, which replaces the seven
per-agent *-superpowers-bootstrap scenarios with one cross-agent
superpowers-bootstrap scenario (adds the QUORUM_CODING_AGENT env var and
the bootstrap-installed dispatcher check verb).
2026-06-16 10:09:47 -07:00
Jesse Vincent
c5a965101b Bump version to 6.0.0 2026-06-16 10:09:47 -07:00
Drew Ritter
b3ee712d3a Add visual companion Prime Radiant branding 2026-06-16 10:09:47 -07:00
Jesse Vincent
9c61797773 Draft Superpowers 6 release notes 2026-06-16 10:09:47 -07:00
Jesse Vincent
b61b55013a E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1 2026-06-16 10:09:47 -07:00
Jesse Vincent
be400204b3 Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27 2026-06-16 10:09:47 -07:00
Jesse Vincent
530476fd00 L2b: plan-mandated defects are findings the human adjudicates
Reviewer tripwire (Calibration): a plan-mandated defect IS a finding,
reported as Important and labeled plan-mandated — the plan's authorship
does not grade its own work.

Controller rule (review loop): a plan-mandated finding, or any finding
conflicting with the plan's text, escalates to the human like any plan
contradiction — never dismissed because the plan mandates it.

E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the
tripwire 0/6 reports give the controller anything to escalate on (all
Approved, defect endorsed as spec-required); with it 6/6 report the
defect as a labeled finding.
2026-06-16 10:09:47 -07:00
Jesse Vincent
e97faafb5a E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract 2026-06-16 10:09:47 -07:00
Jesse Vincent
cfe48c28ac E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis) 2026-06-16 10:09:47 -07:00
Jesse Vincent
8bcefb12cb Strict-cost spec: L2 final — died at gates; explicit escalation holds at sonnet, implicit adjudication does not 2026-06-16 10:09:47 -07:00
Jesse Vincent
8e1262a3ba writing-plans: task right-sizing, Global Constraints header, per-task Interfaces blocks
Claims are fidelity and variance, not dollars (full attribution in the
superpowers-evals experiment log, 2026-06-11 L1 entry):
- Global Constraints header: 0/5 -> 5/5 adoption in micro-tests, exact
  values verbatim; makes constraints mechanically propagatable to briefs
  and reviewers (a version-floor violation class shipped because they
  weren't). The one fix wave in the elicited full runs was a version-floor
  catch this header enabled.
- Per-task Interfaces blocks: 0 -> 100% of tasks, exact signatures,
  within-plan consistent; removes the controller's per-dispatch interface
  re-derivation.
- Task right-sizing: 9.4 -> 8.4 mean tasks at svelte scale (kills
  standalone Types/README micro-tasks); no effect at small scale.
- End-to-end (opus-written plan executed under SDD): guidance plan ran 1
  fix wave vs control's 2-4 (control plan shipped a real Sierpinski bug);
  execution cost equal within noise.
2026-06-16 10:09:47 -07:00
Jesse Vincent
de4672b171 Constraints block is the reviewer's attention lens: copy spec verbatim, never improvise process rules
E30 replay: the planted-DRY catch is causally determined by the
controller-composed constraints block (0/6 with process-shaped vs 5/6
with the spec's own wording). E31 micro: this recipe doubles the rate
at which composed blocks carry the spec's cross-component relationship
(6/6 vs 3/6). Affects dev and the redesign equally (E29: both 4/5).
2026-06-16 10:09:47 -07:00
Jesse Vincent
25192df30b Strict-cost spec: L1 final — cost win re-attributed to complete-code plans; guidance owns fidelity/variance 2026-06-16 10:09:47 -07:00
Jesse Vincent
f5e8df4252 Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed) 2026-06-16 10:09:47 -07:00
Jesse Vincent
b5b3b5d99c Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead) 2026-06-16 10:09:47 -07:00
Jesse Vincent
30bbeefe89 Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first 2026-06-16 10:09:47 -07:00
Jesse Vincent
d3dd1ecc7d Record writing-plans micro-test result: resolved, no change needed 2026-06-16 10:09:47 -07:00
Jesse Vincent
b2872a4a66 Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges) 2026-06-16 10:09:47 -07:00
Jesse Vincent
e9b88d05c8 Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist 2026-06-16 10:09:47 -07:00
Jesse Vincent
4298eac856 Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget
Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
2026-06-16 10:09:47 -07:00
Jesse Vincent
69a00350ff Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants 2026-06-16 10:09:47 -07:00
Jesse Vincent
d7a8c07fe3 Shared: unique review-package collateral names 2026-06-16 10:09:47 -07:00
Jesse Vincent
c30d822efe Add review-package script; close fix-dispatch test gap
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
2026-06-16 10:09:47 -07:00
Jesse Vincent
68c9ddb870 Describe the review design as current state, not as a delta
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
2026-06-16 10:09:47 -07:00
Jesse Vincent
aa80399355 Spec: record iterations 2-3 results and final frozen-config matrix 2026-06-16 10:09:47 -07:00
Jesse Vincent
ee656563c9 Hand reviewers the diff as a file, not a paste
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
2026-06-16 10:09:47 -07:00
Jesse Vincent
3280a32259 Reviewer skepticism covers the implementer's design rationales
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
2026-06-16 10:09:47 -07:00
Jesse Vincent
84d033e967 Make diff-pasting non-optional for task reviewer dispatch
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
2026-06-16 10:09:47 -07:00
Jesse Vincent
c73e9a9a3f Close the Minor-severity escape hatch
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
2026-06-16 10:09:46 -07:00
Jesse Vincent
097ed5920f Spec: document cost iterations and the per-task review consolidation 2026-06-16 10:09:46 -07:00
Jesse Vincent
e08ad0660a Merge per-task reviews into one task reviewer (iteration 2)
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance //⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
2026-06-16 10:09:46 -07:00
Jesse Vincent
5e03007c85 Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
2026-06-16 10:09:46 -07:00
Jesse Vincent
d55cdce32c Add phrase-level pre-judging triggers to reviewer prompt rule
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
2026-06-16 10:09:46 -07:00
Jesse Vincent
0974229418 Red Flags: never tell a reviewer what not to flag or pre-rate severity
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
2026-06-16 10:09:46 -07:00
Jesse Vincent
62b1682399 Close three review blind spots found by defect tracing
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
2026-06-16 10:09:46 -07:00
Jesse Vincent
b42a232192 Require explicit model on subagent dispatch
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
2026-06-16 10:09:46 -07:00