Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants

This commit is contained in:
Jesse Vincent
2026-06-10 12:32:06 -07:00
parent a995af2e24
commit 926096a1d7

View File

@@ -0,0 +1,164 @@
# Positive-Instruction Redesign of Skill Guidance — Design Spec
**Status:** Proposed (follow-up to the 2026-06-09 SDD review-dispatch work; separate PR per the one-problem-per-PR rule)
**Driver:** Measured evidence (2026-06-10) that some negative instructions in skill prose backfire, while others work — and that the difference is predictable.
## The measured finding this spec generalizes
Micro-tests on 2026-06-10 (opus, 5 reps per phrasing, programmatic scoring;
harness described below) measured how guidance phrasing changes what a
controller composes:
| Case | Phrasing | Result |
|---|---|---|
| Dispatch composition ("don't restate the brief") | prohibition | **4.4** spec values re-typed — *worse than no guidance* (3.6) |
| Dispatch composition | positive recipe ("your dispatch should contain: (1)…(5)") | **3.0, zero variance** — adopted |
| Dispatch composition | recipe + nuance clause ("quote only the fragment…") | 3.8, noisy — nuance dilutes recipes |
| Test-rerun directive ("do not ask reviewer to re-run tests") | prohibition | **0/5 violations** — works fine (control: 3/5) |
| Test-rerun directive | positive recipe | 0/5 — equal, but longer |
**The doctrine** (use this to classify any negative instruction):
1. **Tripwires work.** Phrase-level self-checks on concrete tokens ("if the
prompt you are writing contains 'do not flag' … stop") fire reliably.
2. **Recognition tables work.** Red-Flags/rationalization tables read at
decision time, not composition time.
3. **Discrete-directive prohibitions work.** "Do not ask X to do Y" holds
when the model has no competing incentive to do Y.
4. **Composition prohibitions backfire** when the model has its own agenda
for the output (e.g., restating specs feels like helpful curation).
Only a positive composition recipe moves these — and adding nuance
clauses to a winning recipe makes it worse, not better.
5. **Ties go to the shorter phrasing.** Codex re-reads SKILL.md ~500× per
long session (measured 2026-06-10); prose length is a real cost.
## Audit results (2026-06-10, all ~30 skills + prompt templates)
Counts: 3 tripwires (keep), 14 recognition tables (keep), ~20 policy gates
(keep — "never push without permission" is policy, not composition
shaping), 5 composition-prohibitions:
| # | Location | Disposition |
|---|---|---|
| 1 | `subagent-driven-development/task-reviewer-prompt.md` — "Cite, don't narrate" | **Queued in PR #1717 batch**: lead with the positive half ("Your report should point at evidence: file:line for every finding…"), drop the prohibition half (dead weight — the positive half already exists and carries the load) |
| 2 | `subagent-driven-development/SKILL.md` — "Do not add open-ended directives" | **Keep as-is**: micro-test could not elicit the failure in 15 samples; no evidence either way; shorter wins |
| 3 | `subagent-driven-development/SKILL.md` — "Do not ask a reviewer to re-run tests" | **Keep as-is**: measured 0/5 violations; the prohibition also usefully propagates itself into dispatches |
| 4 | `subagent-driven-development/SKILL.md` — "do not re-review on top of it" | **Queued in PR #1717 batch**: replace with the three-element checklist ("Before re-dispatching the reviewer, confirm the fix report contains: the covering tests, the command run, and the output") |
| 5 | `writing-plans/SKILL.md` — the "No Placeholders" banned-patterns list | **This spec's main subject** — see below |
Borderline, deferred with #5: `task-reviewer-prompt.md` "Don't flag
pre-existing file sizes — focus on what this change contributed" (positive
half present and load-bearing; low impact; test alongside #5 if convenient).
## The writing-plans change (deferred item #5)
### Current state
`skills/writing-plans/SKILL.md`, "No Placeholders": one positive sentence
("Every step must contain the actual content an engineer needs") followed
by a six-bullet banned-patterns list ("never write them: 'TBD', 'TODO',
'Add appropriate error handling', 'Write tests for the above', 'Similar to
Task N', …").
### Why it matters and why it is genuinely uncertain
- Plans are the **largest generated artifact** in the workflow, and the
model has a real competing incentive to emit placeholders (they are the
path of least effort under length pressure) — the incentive structure of
the case where prohibition measurably backfired.
- But the banned items are **discrete, recognizable tokens** — the shape
of the case where prohibition measurably held.
- **The list is load-bearing elsewhere:** the skill's Self-Review section
references it ("Placeholder scan: search your plan for red flags — any
of the patterns from the 'No Placeholders' section above"). The tokens
double as the review-time scan inventory, and review-time recognition is
the category that works. A naive swap to a positive checklist breaks
that reference and discards good tripwire tokens.
### Variants to test
- **V0 (current):** positive sentence + banned list at composition time;
Self-Review references the list.
- **V1 (auditor's checklist):** composition-time positive recipe only —
"Before finalizing a step, confirm it has: the literal code to write, a
runnable command with expected output, types and method names defined
within this plan, error handling shown explicitly. A step is complete
when an engineer could implement it without asking any follow-up
questions." Self-Review keeps a generic placeholder scan.
- **V2 (restructure by mechanism — predicted winner):** composition time
gets only V1's positive recipe; the named patterns move wholesale into
the Self-Review placeholder-scan step, reframed as recognition ("when
you scan, look for: 'TBD', 'TODO', 'Similar to Task N', …"). Same
tokens, relocated from the category that primes to the category that
detects.
- **V3 (control):** positive sentence only, no list anywhere.
### Micro-test design
- **Task:** opus writes a 2-3 task implementation plan from a deliberately
under-specified spec (under-specification is what tempts placeholders).
Use a fixture spec with: one well-specified task, one task whose error
handling the spec hand-waves, one task similar to the first (tempting
"Similar to Task 1").
- **Sampling:** 5+ reps per variant, default temperature, model
`claude-opus-4-8` (the model that writes plans in practice).
- **Programmatic scoring** (lower is better unless noted):
- banned-token count: `TBD|TODO|implement later|fill in details|appropriate error handling|handle edge cases|Similar to Task|Write tests for the above`
- steps lacking a fenced code block where the step changes code
- references to types/functions not defined anywhere in the plan output
- (higher is better) runnable commands with expected output per task
- **Two-stage scoring for V2:** also test the Self-Review half — feed each
generated plan back with the variant's Self-Review section and measure
whether the scan actually catches seeded placeholders (insert 2 known
placeholders into a fixture plan; detection rate is the metric).
- **Acceptance:** adopt a variant only if it beats V0 on banned-token count
without losing code-block coverage or self-review detection rate.
Expected cost: ~$6-10 total.
### PR scoping
Separate PR (writing-plans is a different skill; its "No Placeholders"
list is tuned content where the contributor guidelines demand eval
evidence). The PR must include: the micro-test harness + results table,
before/after text, and the V2 relocation rationale.
## The micro-test harness (method, so it isn't lost)
`/tmp/sdd-exp/micro/run-micro.py` and `/tmp/sdd-exp/micro2/run-micro2.py`
(2026-06-10; to be committed to superpowers-evals as
`docs/superpowers/skills/micro-testing-prompt-guidance.md` + scripts):
- One API call per sample: system prompt = the skill-guidance variant in
realistic surrounding context; user = a realistic mid-workflow scenario;
output = the composed artifact (dispatch prompt, plan, report).
- Programmatic scoring with greps for unambiguous markers; **manually
inspect every match before trusting a verdict** — one of tonight's
"violations" was the controller correctly quoting the prohibition, and
automated negation detection mislabeled another.
- ~$0.15-0.30/sample, seconds per iteration vs $12/50-min full eval runs.
Iterate phrasings here; confirm winners in full runs only when the
change is structural.
- Always include a no-guidance control — tonight it revealed both a
backfire (restating: prohibition worse than nothing) and a working
prohibition (test-reruns: 3/5 control failures vs 0/5 with either
phrasing).
## Also explicitly not-dropped (tested-and-declined, with data)
Recorded so nobody re-proposes them without new evidence — full numbers in
the 2026-06-09 SDD design spec's Cost-iterations section:
- **Controller turn batching / parallel tool calls in one message:** the
controller emits exactly one tool call per message (0 multi-tool
messages across every measured run, with and without guidance). 46% of
controller turns are thinking/narration with no tool call — a
prompt-immune floor.
- **Pipelined reviews via parallel calls:** dead for the same reason.
- **Pipelined reviews via `run_in_background`:** mechanism adopted when
offered (7/28 dispatches) but benefit below the run-to-run noise floor
on 45-min scenarios (reviews are only ~30-60s each); adds dual
result-stream coordination. Worth revisiting only for plans whose
reviews are individually long.
- **Nuance clauses appended to winning recipes:** measurably degrade them
(C2: 3.8 noisy vs C: 3.0 consistent). Iterate by re-deriving the recipe,
not by appending caveats.