Compare commits

..

6 Commits

Author SHA1 Message Date
Jesse Vincent
08447d0d10 Bump evals submodule: escalation-aware planted-defect scenario (7dc0be7) 2026-06-11 15:39:04 -07:00
Jesse Vincent
c87f336cc0 E37: pre-flight plan review — surface plan conflicts as one batched question before Task 1 2026-06-11 15:31:01 -07:00
Jesse Vincent
a2a4190809 Spec: L2b tested — opus structural win, sonnet transmission+attention gap (E35/E36); bump evals to 9919b27 2026-06-11 14:53:31 -07:00
Jesse Vincent
8d354bb36b L2b: plan-mandated defects are findings the human adjudicates
Reviewer tripwire (Calibration): a plan-mandated defect IS a finding,
reported as Important and labeled plan-mandated — the plan's authorship
does not grade its own work.

Controller rule (review loop): a plan-mandated finding, or any finding
conflicting with the plan's text, escalates to the human like any plan
contradiction — never dismissed because the plan mandates it.

E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the
tripwire 0/6 reports give the controller anything to escalate on (all
Approved, defect endorsed as spec-required); with it 6/6 report the
defect as a labeled finding.
2026-06-11 13:41:21 -07:00
Jesse Vincent
35464d67c0 E27 stack: conditional impl tier + final-review tier pin + narration recipe + terse reviewer contract 2026-06-11 13:17:10 -07:00
Jesse Vincent
90b5433f59 E03: cheapest-tier implementers when plan carries complete code (transcription hypothesis) 2026-06-11 13:17:10 -07:00
4 changed files with 62 additions and 5 deletions

View File

@@ -133,8 +133,29 @@ opus controller flagged it 5/5. Cheap controllers handle explicit
escalation; they absorb implicit authority-vs-quality adjudication. escalation; they absorb implicit authority-vs-quality adjudication.
A possible L2b (discrete rule: "a reviewer finding that conflicts with A possible L2b (discrete rule: "a reviewer finding that conflicts with
the plan's text is the human's decision — escalate it") would route the the plan's text is the human's decision — escalate it") would route the
failing judgment through the escalation behavior that held; untested. failing judgment through the escalation behavior that held.
Original recon notes follow.
**L2b tested 2026-06-11 (E35/E36, evals
`docs/experiments/2026-06-11-build-loop-autoresearch.md`): improves the
opus stack, does NOT rescue the sonnet rung.** Two rules: a reviewer
tripwire (a plan-mandated defect IS a finding — Important, labeled
plan-mandated; the human decides) and a controller escalation rule
(plan-mandated findings go to the human like any plan contradiction).
Micro on frozen sonnet-composed inputs: 0/6 → 6/6 labeled findings.
Full battery: opus controllers 2/2 internalized the rule, caught their
reviewer's miss as self-described backstop, and escalated for a
sanctioned fix (the 4241 ad-hoc behavior made structural); escalation
sanity 2/2 unbroken. Sonnet controllers: 1/5 full pass — paraphrase
drops the tripwire from dispatches (2/5 transmitted), transmission
alone doesn't fire it live (read-once dilution across the reviewer's
tool reads; placement within the dispatch refuted as the variable),
and no sonnet controller showed backstop behavior; 1/5 shipped the
defect. The L2b rules are a candidate commit for the opus stack.
A future L2c for the sonnet rung would pair the SKILL.md
constraints-recipe (the one channel sonnet transmits verbatim) with a
mandatory output-format slot for plan-mandated findings (the skeleton
survives every observed paraphrase and is consulted at composition
time); untested. Original recon notes follow.
**Recon (superseded):** **Recon (superseded):**
Sonnet-controller runs (claude-sonnet coding-agent): all gates green at Sonnet-controller runs (claude-sonnet coding-agent): all gates green at

2
evals

Submodule evals updated: af05326467...7dc0be79bb

View File

@@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review
**Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration **Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration
**Narration:** between tool calls, narrate at most one short line — the
ledger and the tool results carry the record.
**Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it. **Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.
## When to Use ## When to Use
@@ -79,6 +82,20 @@ digraph process {
} }
``` ```
## Pre-Flight Plan Review
Before dispatching Task 1, scan the plan once for conflicts:
- tasks that contradict each other or the plan's Global Constraints
- anything the plan explicitly mandates that the review rubric treats as a
defect (a test that asserts nothing, verbatim duplication of a logic block)
Present everything you find to your human partner as one batched question —
each finding beside the plan text that mandates it, asking which governs —
before execution begins, not one interrupt per discovery mid-plan. If the
scan is clean, proceed without comment. The review loop remains the net for
conflicts that only emerge from implementation.
## Model Selection ## Model Selection
Use the least powerful model that can handle each role to conserve cost and increase speed. Use the least powerful model that can handle each role to conserve cost and increase speed.
@@ -88,6 +105,8 @@ Use the least powerful model that can handle each role to conserve cost and incr
**Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model. **Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.
**Architecture and design tasks**: use the most capable available model. **Architecture and design tasks**: use the most capable available model.
The final whole-branch review is one of these — dispatch it on the most
capable available model, not the session default.
**Review tasks**: choose the model with the same judgment, scaled to the **Review tasks**: choose the model with the same judgment, scaled to the
diff's size, complexity, and risk. A small mechanical diff does not need the diff's size, complexity, and risk. A small mechanical diff does not need the
@@ -100,8 +119,10 @@ most expensive — which silently defeats this section.
**Turn count beats token price.** Wall-clock and context cost scale with how **Turn count beats token price.** Wall-clock and context cost scale with how
many turns a subagent takes, and the cheapest models routinely take 2-3× the many turns a subagent takes, and the cheapest models routinely take 2-3× the
turns on multi-step work — costing more overall. Use a mid-tier model as the turns on multi-step work — costing more overall. Use a mid-tier model as the
floor for implementers and reviewers; reserve the cheapest tier for floor for reviewers and for implementers working from prose descriptions.
single-file mechanical fixes. When the task's plan text contains the complete code to write, the
implementation is transcription plus testing: use the cheapest tier for
that implementer. Single-file mechanical fixes also take the cheapest tier.
**Task complexity signals (implementation tasks):** **Task complexity signals (implementation tasks):**
- Touches 1-2 files with a complete spec → cheap model - Touches 1-2 files with a complete spec → cheap model
@@ -174,6 +195,11 @@ final whole-branch review. When you fill a reviewer template:
findings in the progress ledger as you go, and point the final findings in the progress ledger as you go, and point the final
whole-branch review at that list so it can triage which must be fixed whole-branch review at that list so it can triage which must be fixed
before merge. A roll-up nobody reads is a silent discard. before merge. A roll-up nobody reads is a silent discard.
- A finding labeled plan-mandated — or any finding that conflicts with
what the plan's text requires — is the human's decision, like any plan
contradiction: present the finding and the plan text, ask which governs.
Do not dismiss the finding because the plan mandates it, and do not
dispatch a fix that contradicts the plan without asking.
- The final whole-branch review gets a package too: run - The final whole-branch review gets a package too: run
`scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the `scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the
branch started from, e.g. `git merge-base main HEAD`) and include the branch started from, e.g. `git merge-base main HEAD`) and include the

View File

@@ -115,6 +115,11 @@ Subagent (general-purpose):
"yes." A tight report that cites lines gives the controller everything "yes." A tight report that cites lines gives the controller everything
it needs. it needs.
Your final message is the report itself: begin directly with the
spec-compliance verdict. Every line is a verdict, a finding with
file:line, or a check you ran — no preamble, no process narration,
no closing summary.
## Calibration ## Calibration
Categorize issues by actual severity. Not everything is Critical. Categorize issues by actual severity. Not everything is Critical.
@@ -123,6 +128,11 @@ Subagent (general-purpose):
would block a merge over — verbatim duplication of a logic block, would block a merge over — verbatim duplication of a logic block,
swallowed errors, tests that assert nothing. "Coverage could be broader" swallowed errors, tests that assert nothing. "Coverage could be broader"
and polish suggestions are Minor. and polish suggestions are Minor.
If the plan or brief explicitly mandates something this rubric calls a
defect (a test that asserts nothing, verbatim duplication of a logic
block), that IS a finding — report it as Important, labeled
plan-mandated. The plan's authorship does not grade its own work; the
human decides.
Acknowledge what was done well before listing issues — accurate praise Acknowledge what was done well before listing issues — accurate praise
helps the implementer trust the rest of the feedback. helps the implementer trust the rest of the feedback.