Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.
2026-07-04 08:39:04 +08:00 · 2026-05-06 12:15:46 -07:00
parent 2e46e9590d
commit 3b412a3836
124 changed files with 13806 additions and 0 deletions
--- a/evals/docs/pressure-and-red-testing.md
+++ b/evals/docs/pressure-and-red-testing.md
@@ -0,0 +1,89 @@
+# Pressure / RED phase testing in drill
+
+## What "RED phase" means
+
+The bash test family in superpowers/tests/ used three implicit phases
+when stress-testing skill content:
+
+* **GREEN** — current skill text. Baseline behavior under normal user
+  prompts. This is what most drill scenarios exercise.
+* **PRESSURE** — current skill text, but the user prompt creates
+  conditions that make the skill's recommended path inconvenient
+  (urgency, an "easier" alternative already on disk, etc.). Lifted
+  as `worktree-creation-under-pressure.yaml`.
+* **RED** — *modified* skill text where the section under test has
+  been removed or weakened. Used to confirm a passing GREEN/PRESSURE
+  result actually depended on the skill text and isn't just baseline
+  model behavior.
+
+GREEN and PRESSURE both run against the current `SUPERPOWERS_ROOT`.
+RED needs a *different* superpowers checkout — one with the section
+under test stripped out — and runs the same scenario against that.
+
+## The drill primitive: vary `SUPERPOWERS_ROOT`
+
+Every backend YAML interpolates `${SUPERPOWERS_ROOT}` into its
+`--plugin-dir` arg (claude.yaml line 6, gemini.yaml line 5, etc.).
+That env var is the only knob you need: point drill at a different
+plugin checkout and the agent under test loads a different version
+of the skill.
+
+```bash
+# GREEN: current skill text
+drill run worktree-creation-from-main -b claude
+
+# RED: same scenario, against a checkout where Step 1a is deleted
+SUPERPOWERS_ROOT=/path/to/superpowers-without-step-1a \
+  drill run worktree-creation-from-main -b claude
+```
+
+Compare verdicts. If GREEN passes and RED fails, the skill text is
+load-bearing. If both pass, the model produces the right behavior
+without the skill — meaning either the skill is redundant or the
+test isn't probing what it claims to probe.
+
+## Recommended workflow
+
+1. Make a git worktree of superpowers at the commit/branch you want
+   to test. For RED variants, edit the skill in that worktree to
+   remove the section under test.
+
+   ```bash
+   cd ~/Documents/GitHub/superpowers/superpowers
+   git worktree add ../superpowers-red-no-step-1a HEAD
+   # edit skills/using-git-worktrees/SKILL.md in the worktree
+   ```
+
+2. Run the same drill scenario against each variant. Use
+   `--n N` to get statistical signal — single runs are noisy,
+   especially under pressure conditions.
+
+   ```bash
+   for variant in main red-no-step-1a; do
+     SUPERPOWERS_ROOT=~/Documents/GitHub/superpowers/superpowers-${variant#main}superpowers \
+       drill run worktree-creation-from-main -b claude --n 10
+   done
+   ```
+
+3. Compare with `drill compare`. Look for the RED variant's pass
+   rate dropping (skill is load-bearing) or holding (skill is
+   redundant or scenario isn't probing what it claims).
+
+## When to add a new pressure scenario vs. add a turn variation
+
+* **New scenario** when the *filesystem* setup is different (e.g.,
+  pre-existing `.worktrees/` for the worktree-pressure case).
+  Setup helpers are scenario-scoped.
+* **New `--n` sweep with different prompts** when only the
+  *user prompt* shape varies (e.g., urgency, framing).
+
+Drill doesn't yet have a way to vary turn intents within a single
+scenario YAML — multi-prompt sweeps require multiple scenario files
+or running the same scenario with different intents externally.
+
+## Open follow-ups
+
+* `--plugins=A,B,C` sweep dimension (parallel to `--models`) so a
+  single drill invocation can run RED + GREEN + PRESSURE variants
+  in one batch and `drill compare` shows them side-by-side. Not yet
+  implemented; tracked as drill-internal future work.