Files
superpowers/evals/docs/pressure-and-red-testing.md
Jesse Vincent 3b412a3836 Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.
2026-05-06 15:47:39 -07:00

3.5 KiB

Pressure / RED phase testing in drill

What "RED phase" means

The bash test family in superpowers/tests/ used three implicit phases when stress-testing skill content:

  • GREEN — current skill text. Baseline behavior under normal user prompts. This is what most drill scenarios exercise.
  • PRESSURE — current skill text, but the user prompt creates conditions that make the skill's recommended path inconvenient (urgency, an "easier" alternative already on disk, etc.). Lifted as worktree-creation-under-pressure.yaml.
  • REDmodified skill text where the section under test has been removed or weakened. Used to confirm a passing GREEN/PRESSURE result actually depended on the skill text and isn't just baseline model behavior.

GREEN and PRESSURE both run against the current SUPERPOWERS_ROOT. RED needs a different superpowers checkout — one with the section under test stripped out — and runs the same scenario against that.

The drill primitive: vary SUPERPOWERS_ROOT

Every backend YAML interpolates ${SUPERPOWERS_ROOT} into its --plugin-dir arg (claude.yaml line 6, gemini.yaml line 5, etc.). That env var is the only knob you need: point drill at a different plugin checkout and the agent under test loads a different version of the skill.

# GREEN: current skill text
drill run worktree-creation-from-main -b claude

# RED: same scenario, against a checkout where Step 1a is deleted
SUPERPOWERS_ROOT=/path/to/superpowers-without-step-1a \
  drill run worktree-creation-from-main -b claude

Compare verdicts. If GREEN passes and RED fails, the skill text is load-bearing. If both pass, the model produces the right behavior without the skill — meaning either the skill is redundant or the test isn't probing what it claims to probe.

  1. Make a git worktree of superpowers at the commit/branch you want to test. For RED variants, edit the skill in that worktree to remove the section under test.

    cd ~/Documents/GitHub/superpowers/superpowers
    git worktree add ../superpowers-red-no-step-1a HEAD
    # edit skills/using-git-worktrees/SKILL.md in the worktree
    
  2. Run the same drill scenario against each variant. Use --n N to get statistical signal — single runs are noisy, especially under pressure conditions.

    for variant in main red-no-step-1a; do
      SUPERPOWERS_ROOT=~/Documents/GitHub/superpowers/superpowers-${variant#main}superpowers \
        drill run worktree-creation-from-main -b claude --n 10
    done
    
  3. Compare with drill compare. Look for the RED variant's pass rate dropping (skill is load-bearing) or holding (skill is redundant or scenario isn't probing what it claims).

When to add a new pressure scenario vs. add a turn variation

  • New scenario when the filesystem setup is different (e.g., pre-existing .worktrees/ for the worktree-pressure case). Setup helpers are scenario-scoped.
  • New --n sweep with different prompts when only the user prompt shape varies (e.g., urgency, framing).

Drill doesn't yet have a way to vary turn intents within a single scenario YAML — multi-prompt sweeps require multiple scenario files or running the same scenario with different intents externally.

Open follow-ups

  • --plugins=A,B,C sweep dimension (parallel to --models) so a single drill invocation can run RED + GREEN + PRESSURE variants in one batch and drill compare shows them side-by-side. Not yet implemented; tracked as drill-internal future work.