Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence

Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
This commit is contained in:
Jesse Vincent
2026-06-09 22:42:54 -07:00
parent 853396e3ae
commit 3e3e1e701e
3 changed files with 34 additions and 2 deletions

View File

@@ -104,6 +104,12 @@ most capable model; a subtle concurrency change does.
omitted model inherits your session's model — often the most capable and omitted model inherits your session's model — often the most capable and
most expensive — which silently defeats this section. most expensive — which silently defeats this section.
**Turn count beats token price.** Wall-clock and context cost scale with how
many turns a subagent takes, and the cheapest models routinely take 2-3× the
turns on multi-step work — costing more overall. Use a mid-tier model as the
floor for implementers and reviewers; reserve the cheapest tier for
single-file mechanical fixes.
**Task complexity signals (implementation tasks):** **Task complexity signals (implementation tasks):**
- Touches 1-2 files with a complete spec → cheap model - Touches 1-2 files with a complete spec → cheap model
- Touches multiple files with integration concerns → standard model - Touches multiple files with integration concerns → standard model
@@ -154,6 +160,11 @@ final whole-branch review. When you fill a reviewer template:
- Include the spec/design's global constraints that bind the task (version - Include the spec/design's global constraints that bind the task (version
floors, naming and copy rules, platform requirements) in the requirements floors, naming and copy rules, platform requirements) in the requirements
you paste — a reviewer can only enforce what you hand them. you paste — a reviewer can only enforce what you hand them.
- Paste the task's diff (`git diff BASE..HEAD` output) into the reviewer
prompt when it fits comfortably (up to a few hundred lines). A reviewer
with the diff in hand needs few or no tool calls.
- Dispatch fix subagents for Critical and Important findings. Record Minor
findings and move on — they roll up to the final whole-branch review.
## Prompt Templates ## Prompt Templates

View File

@@ -32,6 +32,14 @@ Subagent (general-purpose):
git diff [BASE_SHA]..[HEAD_SHA] git diff [BASE_SHA]..[HEAD_SHA]
``` ```
## Diff
[DIFF]
If the diff is provided above, review from it directly — do not re-run
the git commands or re-read the files it already shows. Fetch anything
further only for a named concrete risk.
## Read-Only Review ## Read-Only Review
Your review is read-only on this checkout. Do not mutate the working tree, Your review is read-only on this checkout. Do not mutate the working tree,
@@ -84,12 +92,15 @@ Subagent (general-purpose):
significantly grow existing files? (Don't flag pre-existing file significantly grow existing files? (Don't flag pre-existing file
sizes — focus on what this change contributed.) sizes — focus on what this change contributed.)
Answer each item above with file:line evidence, not a bare yes or no. Cite file:line evidence for every finding and for any check you would
An unsupported "yes" is not a review. otherwise answer with a bare "yes." Cite, don't narrate — a tight report
that points at lines beats a long one that retells the diff.
## Calibration ## Calibration
Categorize issues by actual severity. Not everything is Critical. Categorize issues by actual severity. Not everything is Critical.
Important means this task cannot be trusted until it is fixed;
"coverage could be broader" and polish suggestions are Minor.
Acknowledge what was done well before listing issues — accurate praise Acknowledge what was done well before listing issues — accurate praise
helps the implementer trust the rest of the feedback. helps the implementer trust the rest of the feedback.
@@ -127,5 +138,8 @@ Subagent (general-purpose):
- `[TASK_TEXT]` — the task's requirements text or plan reference, for context - `[TASK_TEXT]` — the task's requirements text or plan reference, for context
- `[BASE_SHA]` — commit before this task - `[BASE_SHA]` — commit before this task
- `[HEAD_SHA]` — current commit - `[HEAD_SHA]` — current commit
- `[DIFF]` — paste `git diff BASE..HEAD` output when it fits comfortably
(up to a few hundred lines); otherwise replace with "(not provided — run
the git commands above)"
**Reviewer returns:** Strengths, Issues (Critical/Important/Minor), Task quality verdict **Reviewer returns:** Strengths, Issues (Critical/Important/Minor), Task quality verdict

View File

@@ -30,6 +30,13 @@ Subagent (general-purpose):
Only read files in this diff. Do not crawl the broader codebase. Only read files in this diff. Do not crawl the broader codebase.
## Diff
[DIFF]
If the diff is provided above, review from it directly — do not re-run
the git commands or re-read the files it already shows.
Spec compliance is judged by reading the diff against the requirements. Spec compliance is judged by reading the diff against the requirements.
The implementer already ran the tests and reported TDD evidence — do not The implementer already ran the tests and reported TDD evidence — do not
re-run them. If a requirement cannot be verified from this diff alone re-run them. If a requirement cannot be verified from this diff alone