Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.
Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
(evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
implementer self-review and quality review)
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
Add a mandatory self-identification disclosure (model, harness, harness
version, all installed plugins) to the PR template and all three issue
templates, and document the requirement in the contributor guidelines.
We weigh contributions differently depending on what produced them:
content reasoned from documentation is held to a different bar than work
grounded in a real session.
Also state explicitly, in both CLAUDE.md and the PR template, that all
PRs must target the dev branch rather than main.
Adds `run-all --scenarios` for resuming a scenario subset across the Code
Assist rate-limit windows. Follows the agy rate-limit fix (79f9963).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Serialize antigravity against the Gemini Code Assist rate limit
(max_concurrency=1), diagnose 429/RESOURCE_EXHAUSTED honestly instead of as
auth, fail-fast on a latched window, and tolerant preflight OK match.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a conditional TDD Evidence field to the implementer report format so controllers can verify RED and GREEN output when TDD was required.
The field asks for the command run, relevant RED/GREEN output, and the expected RED failure reason rather than raw full logs.
Fixes#994.
Rewrite the Windows polyglot hook documentation to match the current run-hook.cmd dispatcher and update the porting guide cross-reference.\n\nFixes #1653.
Per obra's guidance on #1609: remove the github-specific instruction rather
than replacing it with a platform-detection table. Agents already know their
forge tooling; the skill only needs to cover the push step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces hardcoded `gh pr create` in Option 2 with a platform-neutral
note: check `git remote get-url origin` first, then use gh (GitHub),
glab (GitLab), or fall back to the compare URL for unknown platforms.
Adds matching Red Flag entry so agents don't skip the detection step.
Fixes#1609
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Antigravity (Google's `agy` CLI) installs the existing Superpowers plugin
directly:
agy plugin install https://github.com/obra/superpowers
agy imports the bundled skills and runs the plugin's SessionStart hook, so
using-superpowers bootstraps from the first message — verified on agy 1.0.3:
a fresh session given "Let's make a react todo list" auto-triggers the
brainstorming skill instead of writing code. agy discovers skills natively
and, having no Skill tool, loads them by reading SKILL.md with view_file.
No scaffold, installer, or generated context file is needed. This adds only:
- README.md: an Antigravity install section + Quickstart link
- skills/using-superpowers/SKILL.md: reference to the agy tool mapping
- skills/using-superpowers/references/antigravity-tools.md: action->tool
mapping for agy (view_file, write_to_file, invoke_subagent, manage_task,
and skill loading via view_file on SKILL.md)
- tests/antigravity/: structural test for the tool mapping, mirroring
tests/pi/
An evergreen guide for adding support for a new harness (IDE, CLI, or agent
runner). Teaches the invariants — automatic session-start bootstrap, skill
discovery/invocation, tool mapping, the acceptance test — and points at the
closest reference integration shape (shell-hook, in-process plugin,
instructions-file / declared context file) to copy. Covers discovery, build,
local install, tmux-driven verification, distribution, and PR submission, with a
live reference-integration index and a gotchas appendix.
Two non-negotiable rules: (1) never edit skill bodies; (2) everything ships
through the harness's own install mechanism — never edit the user's config. When
a plugin installer strips undeclared files, declare the bootstrap as a recognized
component (a manifest contextFileName-style context file the installer preserves
and the harness loads every session), generated at install time from the live
SKILL.md + tool mapping. Surfaced-skill-description bootstrap is the softer
fallback.
Hardened against real end-to-end ports (Antigravity CLI): shapes can compose; a
fork doesn't inherit its parent's behavior; a hook system != a usable
session-start event; verify @-includes AND context-file preservation with a
marker; web-search the docs and study existing plugins; reverse-engineer
undocumented harnesses; print/headless modes may hang; workspace-trust gates
stall tmux; declared context files survive plugin install while undeclared files
are stripped; skills-path registration is per-harness.
The .pi/ directory holds the pi-harness extension (.pi/extensions/superpowers.ts),
which is tracked (not git-ignored), so the git-ignored-path exclusion helpers
never caught it. It was also missing from the static EXCLUDES list alongside the
other harness dotdirs (.opencode, .cursor-plugin, .claude-plugin), so a sync
would rsync pi's files into the Codex plugin distribution. Add /.pi/ to EXCLUDES.
Stock Windows 10/11 ships C:\Windows\System32\bash.exe (the WSL
launcher) as the first match for `where bash`. WSL's bash cannot
execute Windows-style script paths, so when Git Bash is installed
outside the two standard system locations -- specifically the
per-user "Only for me" Git for Windows installer
(%LOCALAPPDATA%\Programs\Git) or a Scoop install
(%USERPROFILE%\scoop\apps\git\current\usr\bin) -- run-hook.cmd
silently fails: WSL prints "Windows Subsystem for Linux must be
updated", the script returns 0, and Superpowers' SessionStart
bootstrap is never injected. From the user's perspective skills
auto-trigger inconsistently or not at all, with no surfaced error.
Add explicit probes for both locations between the existing system-
wide Git for Windows checks and the `where bash` fallback. Also add
a comment to the fallback documenting the WSL-launcher trap so future
maintainers understand why the explicit probes must come first.
Verified on a Windows 11 VM (dockur/windows 11, Git Bash 2.x, Node
22):
- System Git present: existing probe still matches (no regression)
- System Git absent, per-user Git present via junction: new probe
matches, hook produces valid 6422-byte JSON, exit 0
- All Git probes absent: confirmed WSL trap fires
("Windows Subsystem for Linux must be updated") and the hook exits 0
silently, demonstrating the original bug
Existing tests/hooks/test-session-start.sh still passes on macOS (7/7).
Reported by @ytchenak in #1607.
Co-authored-by: ytchenak <ytchenak@users.noreply.github.com>
Closes#1607.
On Windows + Git Bash, the SessionStart hook prints a confusing
diagnostic at every startup ("printf: write error: Permission denied")
when Claude Code closes the hook's stdout pipe before the printf has
finished writing. The hook still runs to completion and context still
gets injected, but the diagnostic surfaces every session because
Git Bash's printf reports EPIPE as "Permission denied" (not "Broken
pipe" like Linux) and our `set -euo pipefail` lets that error escape.
Piping each printf through `cat` makes the external cat process the
recipient of any SIGPIPE / EPIPE. cat's failure does not propagate to
the parent bash under pipefail because cat is the last command in the
pipeline and exits cleanly when the pipe stays open long enough to
hold the data. On macOS/Linux the cat passthrough is transparent (no
behavior change, no measurable cost).
Verified:
- Existing tests/hooks/test-session-start.sh: 7/7 pass on macOS
- Manual run on Windows 11 + Git Bash 5.2 + Node 22 produces valid JSON,
clean stderr, and exit 0
- JSON output is byte-identical to the unpatched hook
Reported by @silvertakana in #1612, attribution preserved in the
Co-authored-by trailer below — this is the same fix shape the original
PR proposed.
Co-authored-by: silvertakana <silvertakana@users.noreply.github.com>
Closes#1612.
The "Signals You're Doing It Wrong" bullet in systematic-debugging/SKILL.md
contains the literal token Claude Code's runtime scans for in tool result
bodies. Every Skill-tool invocation of this skill caused the harness to
inject a spurious system-reminder claiming the user requested deeper
reasoning, silently bumping every session into extended thinking.
Replace the bullet's spelling so the contiguous letter sequence the scanner
matches is broken with a hyphen. The signal text remains recognizable to
the agent and the documented action ("Question fundamentals, not just
symptoms") is unchanged.
Fixesobra/superpowers#1283