mirror of
https://github.com/obra/superpowers.git
synced 2026-06-12 21:59:04 +08:00
Compare commits
6 Commits
sdd-review
...
sdd-l2b-pl
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
08447d0d10 | ||
|
|
c87f336cc0 | ||
|
|
a2a4190809 | ||
|
|
8d354bb36b | ||
|
|
35464d67c0 | ||
|
|
90b5433f59 |
@@ -133,8 +133,29 @@ opus controller flagged it 5/5. Cheap controllers handle explicit
|
|||||||
escalation; they absorb implicit authority-vs-quality adjudication.
|
escalation; they absorb implicit authority-vs-quality adjudication.
|
||||||
A possible L2b (discrete rule: "a reviewer finding that conflicts with
|
A possible L2b (discrete rule: "a reviewer finding that conflicts with
|
||||||
the plan's text is the human's decision — escalate it") would route the
|
the plan's text is the human's decision — escalate it") would route the
|
||||||
failing judgment through the escalation behavior that held; untested.
|
failing judgment through the escalation behavior that held.
|
||||||
Original recon notes follow.
|
|
||||||
|
**L2b tested 2026-06-11 (E35/E36, evals
|
||||||
|
`docs/experiments/2026-06-11-build-loop-autoresearch.md`): improves the
|
||||||
|
opus stack, does NOT rescue the sonnet rung.** Two rules: a reviewer
|
||||||
|
tripwire (a plan-mandated defect IS a finding — Important, labeled
|
||||||
|
plan-mandated; the human decides) and a controller escalation rule
|
||||||
|
(plan-mandated findings go to the human like any plan contradiction).
|
||||||
|
Micro on frozen sonnet-composed inputs: 0/6 → 6/6 labeled findings.
|
||||||
|
Full battery: opus controllers 2/2 internalized the rule, caught their
|
||||||
|
reviewer's miss as self-described backstop, and escalated for a
|
||||||
|
sanctioned fix (the 4241 ad-hoc behavior made structural); escalation
|
||||||
|
sanity 2/2 unbroken. Sonnet controllers: 1/5 full pass — paraphrase
|
||||||
|
drops the tripwire from dispatches (2/5 transmitted), transmission
|
||||||
|
alone doesn't fire it live (read-once dilution across the reviewer's
|
||||||
|
tool reads; placement within the dispatch refuted as the variable),
|
||||||
|
and no sonnet controller showed backstop behavior; 1/5 shipped the
|
||||||
|
defect. The L2b rules are a candidate commit for the opus stack.
|
||||||
|
A future L2c for the sonnet rung would pair the SKILL.md
|
||||||
|
constraints-recipe (the one channel sonnet transmits verbatim) with a
|
||||||
|
mandatory output-format slot for plan-mandated findings (the skeleton
|
||||||
|
survives every observed paraphrase and is consulted at composition
|
||||||
|
time); untested. Original recon notes follow.
|
||||||
|
|
||||||
**Recon (superseded):**
|
**Recon (superseded):**
|
||||||
Sonnet-controller runs (claude-sonnet coding-agent): all gates green at
|
Sonnet-controller runs (claude-sonnet coding-agent): all gates green at
|
||||||
|
|||||||
2
evals
2
evals
Submodule evals updated: af05326467...7dc0be79bb
@@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review
|
|||||||
|
|
||||||
**Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration
|
**Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration
|
||||||
|
|
||||||
|
**Narration:** between tool calls, narrate at most one short line — the
|
||||||
|
ledger and the tool results carry the record.
|
||||||
|
|
||||||
**Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.
|
**Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it.
|
||||||
|
|
||||||
## When to Use
|
## When to Use
|
||||||
@@ -79,6 +82,20 @@ digraph process {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Pre-Flight Plan Review
|
||||||
|
|
||||||
|
Before dispatching Task 1, scan the plan once for conflicts:
|
||||||
|
|
||||||
|
- tasks that contradict each other or the plan's Global Constraints
|
||||||
|
- anything the plan explicitly mandates that the review rubric treats as a
|
||||||
|
defect (a test that asserts nothing, verbatim duplication of a logic block)
|
||||||
|
|
||||||
|
Present everything you find to your human partner as one batched question —
|
||||||
|
each finding beside the plan text that mandates it, asking which governs —
|
||||||
|
before execution begins, not one interrupt per discovery mid-plan. If the
|
||||||
|
scan is clean, proceed without comment. The review loop remains the net for
|
||||||
|
conflicts that only emerge from implementation.
|
||||||
|
|
||||||
## Model Selection
|
## Model Selection
|
||||||
|
|
||||||
Use the least powerful model that can handle each role to conserve cost and increase speed.
|
Use the least powerful model that can handle each role to conserve cost and increase speed.
|
||||||
@@ -88,6 +105,8 @@ Use the least powerful model that can handle each role to conserve cost and incr
|
|||||||
**Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.
|
**Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model.
|
||||||
|
|
||||||
**Architecture and design tasks**: use the most capable available model.
|
**Architecture and design tasks**: use the most capable available model.
|
||||||
|
The final whole-branch review is one of these — dispatch it on the most
|
||||||
|
capable available model, not the session default.
|
||||||
|
|
||||||
**Review tasks**: choose the model with the same judgment, scaled to the
|
**Review tasks**: choose the model with the same judgment, scaled to the
|
||||||
diff's size, complexity, and risk. A small mechanical diff does not need the
|
diff's size, complexity, and risk. A small mechanical diff does not need the
|
||||||
@@ -100,8 +119,10 @@ most expensive — which silently defeats this section.
|
|||||||
**Turn count beats token price.** Wall-clock and context cost scale with how
|
**Turn count beats token price.** Wall-clock and context cost scale with how
|
||||||
many turns a subagent takes, and the cheapest models routinely take 2-3× the
|
many turns a subagent takes, and the cheapest models routinely take 2-3× the
|
||||||
turns on multi-step work — costing more overall. Use a mid-tier model as the
|
turns on multi-step work — costing more overall. Use a mid-tier model as the
|
||||||
floor for implementers and reviewers; reserve the cheapest tier for
|
floor for reviewers and for implementers working from prose descriptions.
|
||||||
single-file mechanical fixes.
|
When the task's plan text contains the complete code to write, the
|
||||||
|
implementation is transcription plus testing: use the cheapest tier for
|
||||||
|
that implementer. Single-file mechanical fixes also take the cheapest tier.
|
||||||
|
|
||||||
**Task complexity signals (implementation tasks):**
|
**Task complexity signals (implementation tasks):**
|
||||||
- Touches 1-2 files with a complete spec → cheap model
|
- Touches 1-2 files with a complete spec → cheap model
|
||||||
@@ -174,6 +195,11 @@ final whole-branch review. When you fill a reviewer template:
|
|||||||
findings in the progress ledger as you go, and point the final
|
findings in the progress ledger as you go, and point the final
|
||||||
whole-branch review at that list so it can triage which must be fixed
|
whole-branch review at that list so it can triage which must be fixed
|
||||||
before merge. A roll-up nobody reads is a silent discard.
|
before merge. A roll-up nobody reads is a silent discard.
|
||||||
|
- A finding labeled plan-mandated — or any finding that conflicts with
|
||||||
|
what the plan's text requires — is the human's decision, like any plan
|
||||||
|
contradiction: present the finding and the plan text, ask which governs.
|
||||||
|
Do not dismiss the finding because the plan mandates it, and do not
|
||||||
|
dispatch a fix that contradicts the plan without asking.
|
||||||
- The final whole-branch review gets a package too: run
|
- The final whole-branch review gets a package too: run
|
||||||
`scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the
|
`scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the
|
||||||
branch started from, e.g. `git merge-base main HEAD`) and include the
|
branch started from, e.g. `git merge-base main HEAD`) and include the
|
||||||
|
|||||||
@@ -115,6 +115,11 @@ Subagent (general-purpose):
|
|||||||
"yes." A tight report that cites lines gives the controller everything
|
"yes." A tight report that cites lines gives the controller everything
|
||||||
it needs.
|
it needs.
|
||||||
|
|
||||||
|
Your final message is the report itself: begin directly with the
|
||||||
|
spec-compliance verdict. Every line is a verdict, a finding with
|
||||||
|
file:line, or a check you ran — no preamble, no process narration,
|
||||||
|
no closing summary.
|
||||||
|
|
||||||
## Calibration
|
## Calibration
|
||||||
|
|
||||||
Categorize issues by actual severity. Not everything is Critical.
|
Categorize issues by actual severity. Not everything is Critical.
|
||||||
@@ -123,6 +128,11 @@ Subagent (general-purpose):
|
|||||||
would block a merge over — verbatim duplication of a logic block,
|
would block a merge over — verbatim duplication of a logic block,
|
||||||
swallowed errors, tests that assert nothing. "Coverage could be broader"
|
swallowed errors, tests that assert nothing. "Coverage could be broader"
|
||||||
and polish suggestions are Minor.
|
and polish suggestions are Minor.
|
||||||
|
If the plan or brief explicitly mandates something this rubric calls a
|
||||||
|
defect (a test that asserts nothing, verbatim duplication of a logic
|
||||||
|
block), that IS a finding — report it as Important, labeled
|
||||||
|
plan-mandated. The plan's authorship does not grade its own work; the
|
||||||
|
human decides.
|
||||||
Acknowledge what was done well before listing issues — accurate praise
|
Acknowledge what was done well before listing issues — accurate praise
|
||||||
helps the implementer trust the rest of the feedback.
|
helps the implementer trust the rest of the feedback.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user