diff --git a/docs/superpowers/specs/2026-06-10-positive-instruction-redesign-design.md b/docs/superpowers/specs/2026-06-10-positive-instruction-redesign-design.md new file mode 100644 index 00000000..a19c938c --- /dev/null +++ b/docs/superpowers/specs/2026-06-10-positive-instruction-redesign-design.md @@ -0,0 +1,164 @@ +# Positive-Instruction Redesign of Skill Guidance — Design Spec + +**Status:** Proposed (follow-up to the 2026-06-09 SDD review-dispatch work; separate PR per the one-problem-per-PR rule) +**Driver:** Measured evidence (2026-06-10) that some negative instructions in skill prose backfire, while others work — and that the difference is predictable. + +## The measured finding this spec generalizes + +Micro-tests on 2026-06-10 (opus, 5 reps per phrasing, programmatic scoring; +harness described below) measured how guidance phrasing changes what a +controller composes: + +| Case | Phrasing | Result | +|---|---|---| +| Dispatch composition ("don't restate the brief") | prohibition | **4.4** spec values re-typed — *worse than no guidance* (3.6) | +| Dispatch composition | positive recipe ("your dispatch should contain: (1)…(5)") | **3.0, zero variance** — adopted | +| Dispatch composition | recipe + nuance clause ("quote only the fragment…") | 3.8, noisy — nuance dilutes recipes | +| Test-rerun directive ("do not ask reviewer to re-run tests") | prohibition | **0/5 violations** — works fine (control: 3/5) | +| Test-rerun directive | positive recipe | 0/5 — equal, but longer | + +**The doctrine** (use this to classify any negative instruction): + +1. **Tripwires work.** Phrase-level self-checks on concrete tokens ("if the + prompt you are writing contains 'do not flag' … stop") fire reliably. +2. **Recognition tables work.** Red-Flags/rationalization tables read at + decision time, not composition time. +3. **Discrete-directive prohibitions work.** "Do not ask X to do Y" holds + when the model has no competing incentive to do Y. +4. **Composition prohibitions backfire** when the model has its own agenda + for the output (e.g., restating specs feels like helpful curation). + Only a positive composition recipe moves these — and adding nuance + clauses to a winning recipe makes it worse, not better. +5. **Ties go to the shorter phrasing.** Codex re-reads SKILL.md ~500× per + long session (measured 2026-06-10); prose length is a real cost. + +## Audit results (2026-06-10, all ~30 skills + prompt templates) + +Counts: 3 tripwires (keep), 14 recognition tables (keep), ~20 policy gates +(keep — "never push without permission" is policy, not composition +shaping), 5 composition-prohibitions: + +| # | Location | Disposition | +|---|---|---| +| 1 | `subagent-driven-development/task-reviewer-prompt.md` — "Cite, don't narrate" | **Queued in PR #1717 batch**: lead with the positive half ("Your report should point at evidence: file:line for every finding…"), drop the prohibition half (dead weight — the positive half already exists and carries the load) | +| 2 | `subagent-driven-development/SKILL.md` — "Do not add open-ended directives" | **Keep as-is**: micro-test could not elicit the failure in 15 samples; no evidence either way; shorter wins | +| 3 | `subagent-driven-development/SKILL.md` — "Do not ask a reviewer to re-run tests" | **Keep as-is**: measured 0/5 violations; the prohibition also usefully propagates itself into dispatches | +| 4 | `subagent-driven-development/SKILL.md` — "do not re-review on top of it" | **Queued in PR #1717 batch**: replace with the three-element checklist ("Before re-dispatching the reviewer, confirm the fix report contains: the covering tests, the command run, and the output") | +| 5 | `writing-plans/SKILL.md` — the "No Placeholders" banned-patterns list | **This spec's main subject** — see below | + +Borderline, deferred with #5: `task-reviewer-prompt.md` "Don't flag +pre-existing file sizes — focus on what this change contributed" (positive +half present and load-bearing; low impact; test alongside #5 if convenient). + +## The writing-plans change (deferred item #5) + +### Current state + +`skills/writing-plans/SKILL.md`, "No Placeholders": one positive sentence +("Every step must contain the actual content an engineer needs") followed +by a six-bullet banned-patterns list ("never write them: 'TBD', 'TODO', +'Add appropriate error handling', 'Write tests for the above', 'Similar to +Task N', …"). + +### Why it matters and why it is genuinely uncertain + +- Plans are the **largest generated artifact** in the workflow, and the + model has a real competing incentive to emit placeholders (they are the + path of least effort under length pressure) — the incentive structure of + the case where prohibition measurably backfired. +- But the banned items are **discrete, recognizable tokens** — the shape + of the case where prohibition measurably held. +- **The list is load-bearing elsewhere:** the skill's Self-Review section + references it ("Placeholder scan: search your plan for red flags — any + of the patterns from the 'No Placeholders' section above"). The tokens + double as the review-time scan inventory, and review-time recognition is + the category that works. A naive swap to a positive checklist breaks + that reference and discards good tripwire tokens. + +### Variants to test + +- **V0 (current):** positive sentence + banned list at composition time; + Self-Review references the list. +- **V1 (auditor's checklist):** composition-time positive recipe only — + "Before finalizing a step, confirm it has: the literal code to write, a + runnable command with expected output, types and method names defined + within this plan, error handling shown explicitly. A step is complete + when an engineer could implement it without asking any follow-up + questions." Self-Review keeps a generic placeholder scan. +- **V2 (restructure by mechanism — predicted winner):** composition time + gets only V1's positive recipe; the named patterns move wholesale into + the Self-Review placeholder-scan step, reframed as recognition ("when + you scan, look for: 'TBD', 'TODO', 'Similar to Task N', …"). Same + tokens, relocated from the category that primes to the category that + detects. +- **V3 (control):** positive sentence only, no list anywhere. + +### Micro-test design + +- **Task:** opus writes a 2-3 task implementation plan from a deliberately + under-specified spec (under-specification is what tempts placeholders). + Use a fixture spec with: one well-specified task, one task whose error + handling the spec hand-waves, one task similar to the first (tempting + "Similar to Task 1"). +- **Sampling:** 5+ reps per variant, default temperature, model + `claude-opus-4-8` (the model that writes plans in practice). +- **Programmatic scoring** (lower is better unless noted): + - banned-token count: `TBD|TODO|implement later|fill in details|appropriate error handling|handle edge cases|Similar to Task|Write tests for the above` + - steps lacking a fenced code block where the step changes code + - references to types/functions not defined anywhere in the plan output + - (higher is better) runnable commands with expected output per task +- **Two-stage scoring for V2:** also test the Self-Review half — feed each + generated plan back with the variant's Self-Review section and measure + whether the scan actually catches seeded placeholders (insert 2 known + placeholders into a fixture plan; detection rate is the metric). +- **Acceptance:** adopt a variant only if it beats V0 on banned-token count + without losing code-block coverage or self-review detection rate. + Expected cost: ~$6-10 total. + +### PR scoping + +Separate PR (writing-plans is a different skill; its "No Placeholders" +list is tuned content where the contributor guidelines demand eval +evidence). The PR must include: the micro-test harness + results table, +before/after text, and the V2 relocation rationale. + +## The micro-test harness (method, so it isn't lost) + +`/tmp/sdd-exp/micro/run-micro.py` and `/tmp/sdd-exp/micro2/run-micro2.py` +(2026-06-10; to be committed to superpowers-evals as +`docs/superpowers/skills/micro-testing-prompt-guidance.md` + scripts): + +- One API call per sample: system prompt = the skill-guidance variant in + realistic surrounding context; user = a realistic mid-workflow scenario; + output = the composed artifact (dispatch prompt, plan, report). +- Programmatic scoring with greps for unambiguous markers; **manually + inspect every match before trusting a verdict** — one of tonight's + "violations" was the controller correctly quoting the prohibition, and + automated negation detection mislabeled another. +- ~$0.15-0.30/sample, seconds per iteration vs $12/50-min full eval runs. + Iterate phrasings here; confirm winners in full runs only when the + change is structural. +- Always include a no-guidance control — tonight it revealed both a + backfire (restating: prohibition worse than nothing) and a working + prohibition (test-reruns: 3/5 control failures vs 0/5 with either + phrasing). + +## Also explicitly not-dropped (tested-and-declined, with data) + +Recorded so nobody re-proposes them without new evidence — full numbers in +the 2026-06-09 SDD design spec's Cost-iterations section: + +- **Controller turn batching / parallel tool calls in one message:** the + controller emits exactly one tool call per message (0 multi-tool + messages across every measured run, with and without guidance). 46% of + controller turns are thinking/narration with no tool call — a + prompt-immune floor. +- **Pipelined reviews via parallel calls:** dead for the same reason. +- **Pipelined reviews via `run_in_background`:** mechanism adopted when + offered (7/28 dispatches) but benefit below the run-to-run noise floor + on 45-min scenarios (reviews are only ~30-60s each); adds dual + result-stream coordination. Worth revisiting only for plans whose + reviews are individually long. +- **Nuance clauses appended to winning recipes:** measurably degrade them + (C2: 3.8 noisy vs C: 3.0 consistent). Iterate by re-deriving the recipe, + not by appending caveats.