mirror of
https://github.com/obra/superpowers.git
synced 2026-06-16 23:59:05 +08:00
Compare commits
6 Commits
ec014e7a7f
...
420c234a2c
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
420c234a2c | ||
|
|
d1fcc9889a | ||
|
|
ac11700642 | ||
|
|
710f031ad0 | ||
|
|
72cb21b82c | ||
|
|
de1d35e5e7 |
@@ -65,13 +65,21 @@ fewer, better-sized tasks, SDD still runs one fresh subagent per task.
|
||||
|
||||
### L1 — Plan-side crispness (writing-plans changes; est. −$1.5-3/run, plus variance reduction)
|
||||
|
||||
**Status 2026-06-11: validated in effect.** A hand-crisped fractals plan
|
||||
(10 → 7 tasks, `## Global Constraints` header, per-task `Interfaces:`
|
||||
lines — scenario `sdd-go-fractals-crisp`) ran 3/3 green at $9.51-12.65
|
||||
(mean $11.60 vs combo band $11.67-14.84), 20-24 dispatches vs 28, fix
|
||||
waves flat. What remains is elicitation: getting writing-plans guidance
|
||||
to *produce* such plans (micro-test per the doctrine, then the follow-up
|
||||
PR). See the experiments log, Batch A-E.
|
||||
**Status 2026-06-11 (final): elicitation tested end-to-end; claims
|
||||
re-attributed.** Micro-tests: constraints header and Interfaces blocks
|
||||
elicit deterministically (0→5/5, 0→100% of tasks, exact values);
|
||||
right-sizing is modest and scale-dependent (9.4→8.4 tasks at svelte
|
||||
scale, nothing to move at fractals scale). Full runs: an elicited plan
|
||||
executed at $6.34/$8.49 — but the no-guidance control (opus plan,
|
||||
complete code) hit $7.59/$7.73, inside that range. **The cost win
|
||||
belongs to opus-written complete-code plans; the hand-written prose
|
||||
fixture plans all prior numbers used are unrepresentative and ~2×
|
||||
costlier to execute.** The guidance owns fidelity and variance instead:
|
||||
deterministic constraints propagation (the one elicited-run fix was a
|
||||
version-floor catch), exact cross-task interfaces, fix waves 1 vs 2-4
|
||||
(the control plan shipped a real Sierpinski bug both runs had to fix).
|
||||
The writing-plans PR claims those grounds, not dollars. Draft at
|
||||
/tmp/sdd-exp/writing-plans-l1 (branch writing-plans-crisp).
|
||||
|
||||
The plan is upstream of every cost: task count sets dispatch count; plan
|
||||
ambiguity sets review-loop count; plan completeness sets implementer
|
||||
@@ -110,7 +118,25 @@ target economics and ambiguity, not placeholder hygiene.
|
||||
|
||||
### L2 — Controller tier (est. −$4-5/run; the biggest single lever, gated hardest)
|
||||
|
||||
**Status 2026-06-11: recon positive (n=2), gates still owed.**
|
||||
**Status 2026-06-11 (final): DIED AT THE GATES, as pre-registered — with
|
||||
useful anatomy.** Recon was positive ($6.68/$8.05, n=2, mechanics clean).
|
||||
The full battery split the judgment surface: the new
|
||||
`sdd-escalates-broken-plan` scenario (explicit plan self-contradiction;
|
||||
the human never volunteers it) passed **5/5 at sonnet** ($1.02-1.37/run;
|
||||
opus baseline 2/2) — explicit conflicts get escalated. But the
|
||||
planted-defect battery failed decisively: under a sonnet controller the
|
||||
per-task quality gate collapsed into plan-compliance advocacy ("no
|
||||
assertion, as required" listed under Strengths), the defect shipped in
|
||||
4/5 runs (deterministic check), and only the tier-pinned opus final
|
||||
reviewer ever caught it — while the same sonnet-tier reviewers under an
|
||||
opus controller flagged it 5/5. Cheap controllers handle explicit
|
||||
escalation; they absorb implicit authority-vs-quality adjudication.
|
||||
A possible L2b (discrete rule: "a reviewer finding that conflicts with
|
||||
the plan's text is the human's decision — escalate it") would route the
|
||||
failing judgment through the escalation behavior that held; untested.
|
||||
Original recon notes follow.
|
||||
|
||||
**Recon (superseded):**
|
||||
Sonnet-controller runs (claude-sonnet coding-agent): all gates green at
|
||||
**$6.68 and $8.05** / 31-41 min (combo band $11.67-14.84), tokens inside
|
||||
the combo band — no cheap-controller turn inflation. 26/26 and 31/31
|
||||
|
||||
2
evals
2
evals
Submodule evals updated: ac264b104c...af05326467
@@ -150,9 +150,13 @@ final whole-branch review. When you fill a reviewer template:
|
||||
loop. If the prompt you are writing contains "do not flag," "don't treat X
|
||||
as a defect," "at most Minor," or "the plan chose" — stop: you are
|
||||
pre-judging, usually to spare yourself a review loop.
|
||||
- Include the spec/design's global constraints that bind the task (version
|
||||
floors, naming and copy rules, platform requirements) in the requirements
|
||||
you paste — a reviewer can only enforce what you hand them.
|
||||
- The global-constraints block you hand the reviewer is its attention
|
||||
lens. Copy the binding requirements verbatim from the plan's Global
|
||||
Constraints section or the spec: exact values, exact formats, and the
|
||||
stated relationships between components ("same layout as X", "matches
|
||||
Y"). The reviewer's template already carries the process rules (YAGNI,
|
||||
test hygiene, review method) — the constraints block is for what THIS
|
||||
project's spec demands.
|
||||
- Hand the reviewer its diff as a file: run this skill's
|
||||
`scripts/review-package BASE HEAD` and pass the reviewer the file path
|
||||
it prints (or, without bash: `git log --oneline`, `git diff --stat`,
|
||||
|
||||
@@ -159,8 +159,10 @@ Subagent (general-purpose):
|
||||
- `[MODEL]` — REQUIRED: reviewer model per SKILL.md Model Selection
|
||||
- `[BRIEF_FILE]` — REQUIRED: the task brief file (`scripts/task-brief PLAN N`
|
||||
prints the path; same file the implementer worked from)
|
||||
- `[GLOBAL_CONSTRAINTS]` — the spec/design's global constraints that bind
|
||||
this task (version floors, naming and copy rules, platform requirements)
|
||||
- `[GLOBAL_CONSTRAINTS]` — the binding requirements copied verbatim from
|
||||
the plan's Global Constraints section or the spec: exact values, formats,
|
||||
and stated relationships between components (not process rules — those
|
||||
are already in this template)
|
||||
- `[REPORT_FILE]` — REQUIRED: the file the implementer wrote its detailed
|
||||
report to
|
||||
- `[BASE_SHA]` — commit before this task
|
||||
|
||||
@@ -33,6 +33,15 @@ Before defining tasks, map out which files will be created or modified and what
|
||||
|
||||
This structure informs the task decomposition. Each task should produce self-contained changes that make sense independently.
|
||||
|
||||
## Task Right-Sizing
|
||||
|
||||
A task is the smallest unit that carries its own test cycle and is worth a
|
||||
fresh reviewer's gate. When drawing task boundaries: fold setup,
|
||||
configuration, scaffolding, and documentation steps into the task whose
|
||||
deliverable needs them; split only where a reviewer could meaningfully
|
||||
reject one task while approving its neighbor. Each task ends with an
|
||||
independently testable deliverable.
|
||||
|
||||
## Bite-Sized Task Granularity
|
||||
|
||||
**Each step is one action (2-5 minutes):**
|
||||
@@ -57,6 +66,13 @@ This structure informs the task decomposition. Each task should produce self-con
|
||||
|
||||
**Tech Stack:** [Key technologies/libraries]
|
||||
|
||||
## Global Constraints
|
||||
|
||||
[The spec's project-wide requirements — version floors, dependency limits,
|
||||
naming and copy rules, platform requirements — one line each, with exact
|
||||
values copied verbatim from the spec. Every task's requirements implicitly
|
||||
include this section.]
|
||||
|
||||
---
|
||||
```
|
||||
|
||||
@@ -70,6 +86,12 @@ This structure informs the task decomposition. Each task should produce self-con
|
||||
- Modify: `exact/path/to/existing.py:123-145`
|
||||
- Test: `tests/exact/path/to/test.py`
|
||||
|
||||
**Interfaces:**
|
||||
- Consumes: [what this task uses from earlier tasks — exact signatures]
|
||||
- Produces: [what later tasks rely on — exact function names, parameter
|
||||
and return types. A task's implementer sees only their own task; this
|
||||
block is how they learn the names and types neighboring tasks use.]
|
||||
|
||||
- [ ] **Step 1: Write the failing test**
|
||||
|
||||
```python
|
||||
|
||||
Reference in New Issue
Block a user