Compare commits

..

1 Commits

Author SHA1 Message Date
Drew Ritter
5cd1a9d5f2 chore(evals): bump submodule for Claude Haiku target 2026-06-10 16:13:55 -07:00

View File

@@ -456,29 +456,10 @@ Different skill types need different test approaches:
**All of these mean: Test before deploying. No exceptions.** **All of these mean: Test before deploying. No exceptions.**
## Match the Form to the Failure
Before writing guidance, classify the baseline failure. The form that bulletproofs one failure type measurably backfires on another.
| Baseline failure | Right form | Wrong form |
|---|---|---|
| Skips/violates a rule under pressure (knows better, does it anyway) | Prohibition + rationalization table + red flags (see Bulletproofing below) | Soft guidance ("prefer...", "consider...") |
| Complies, but output has the wrong shape (bloated prompt, buried verdict, restated spec) | Positive recipe or contract: state what the output IS — its parts, in order | Prohibition list ("don't restate", "never narrate") |
| Omits a required element from something they already produce | Structural: REQUIRED field or slot in the template they fill in | Prose reminders near the template |
| Behavior should depend on a condition | Conditional keyed to an observable predicate ("if the brief exists, reference it") | Unconditional rule + exemption clauses |
**Why prohibitions backfire on shaping problems:** under a competing incentive ("make the prompt self-contained"), agents negotiate with "don't X". In head-to-head wording tests on dispatch-prompt guidance, the prohibition arm produced clearly more of the unwanted content than the recipe arm (fully separated distributions), and trended worse than even the no-guidance control — micro-test your own case rather than assuming, but never reach for the prohibition by default. A recipe leaves nothing to negotiate: the output matches the stated shape or it doesn't.
**Rules for whichever form you pick:**
- **No nuance clauses.** "Don't X unless it matters" reopens the negotiation — appending a single nuance clause to a winning recipe degraded it from consistent to noisy in the same wording tests. Express a real exception as its own conditional on an observable predicate.
- **Exemption clauses don't scope.** "This limit doesn't apply to code blocks" still suppresses code blocks. If part of the output must be exempt, restructure so the rule can't reach it.
## Bulletproofing Skills Against Rationalization ## Bulletproofing Skills Against Rationalization
Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure. Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure.
**Scope:** this toolkit is for discipline failures — an agent that knows the rule and skips it under pressure. For wrong-shaped output or omitted elements, prohibition-based bulletproofing backfires; use the forms in Match the Form to the Failure instead.
**Psychology note:** Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles. **Psychology note:** Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles.
### Close Every Loophole Explicitly ### Close Every Loophole Explicitly
@@ -572,18 +553,6 @@ Run same scenarios WITH skill. Agent should now comply.
Agent found new rationalization? Add explicit counter. Re-test until bulletproof. Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
### Micro-Test Wording Before Full Scenarios
Full pressure-scenario runs are the final gate, but they are slow and expensive per iteration. Verify the wording itself first with micro-tests:
1. **One fresh-context sample per call** — a raw API call, or a single-shot subagent if you don't have API access. System prompt = the realistic context the guidance will live in (the full skill or prompt template, not the guidance in isolation); user message = a task that tempts the failure.
2. **Always include a no-guidance control.** If the control doesn't exhibit the failure, there is nothing to fix — stop, don't author the guidance.
3. **5+ reps per variant.** Single samples lie.
4. **Manually read every flagged match.** Score programmatically if you like, but template echoes and quoted counter-examples masquerade as hits; automated counts alone overstate both failure and success.
5. **Variance is a metric.** When guidance lands, reps converge on the same shape. Five different interpretations across five reps means the wording isn't binding — tighten the form before adding words.
Micro-tests verify wording; they do not replace pressure scenarios for discipline skills.
**Testing methodology:** See [testing-skills-with-subagents.md](testing-skills-with-subagents.md) for the complete testing methodology: **Testing methodology:** See [testing-skills-with-subagents.md](testing-skills-with-subagents.md) for the complete testing methodology:
- How to write pressure scenarios - How to write pressure scenarios
- Pressure types (time, sunk cost, authority, exhaustion) - Pressure types (time, sunk cost, authority, exhaustion)
@@ -641,8 +610,6 @@ Deploying untested skills = deploying untested code. It's a violation of quality
- [ ] Keywords throughout for search (errors, symptoms, tools) - [ ] Keywords throughout for search (errors, symptoms, tools)
- [ ] Clear overview with core principle - [ ] Clear overview with core principle
- [ ] Address specific baseline failures identified in RED - [ ] Address specific baseline failures identified in RED
- [ ] Guidance form matches the failure type (see Match the Form to the Failure)
- [ ] For behavior-shaping guidance: wording micro-tested against a no-guidance control (5+ reps, every flagged match read manually) — N/A for pure reference skills
- [ ] Code inline OR link to separate file - [ ] Code inline OR link to separate file
- [ ] One excellent example (not multi-language) - [ ] One excellent example (not multi-language)
- [ ] Run scenarios WITH skill - verify agents now comply - [ ] Run scenarios WITH skill - verify agents now comply