- evals/README.md, evals/CLAUDE.md: fix uv install command from
'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml
uses [project.optional-dependencies], so --dev is a no-op for
pytest/ruff/ty; --extra dev is the correct invocation.
- tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh
from integration_tests array (file deleted earlier in this branch).
- tests/claude-code/README.md: replace test-requesting-code-review.sh
section with test-worktree-native-preference.sh (the worktree test
is kept; the code-review test was lifted into drill).
- docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness
list. evals/backends/ has claude*, codex, gemini configs but no
copilot.yaml, so the claim was unsupported.
Adversarial review credit: reviewer #2 found four legitimate issues
(uv-sync, run-skill-tests stale ref, README stale ref via #1, and
Copilot CLI fabrication); reviewer #1 found two distinct issues
(run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins
this round.
- docs/testing.md split into Plugin tests + Skill behavior evals.
Plugin tests section enumerates the bash tests that survive
(kept by drill-coverage analysis or as describe-skill tests).
- CLAUDE.md adds Eval harness section pointing at evals/.
- README.md Contributing section mentions evals/ alongside tests/.
- .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders
(evals/.gitignore covers these locally; root-level entries help
tooling that does not recurse into nested ignore files).
- RELEASE-NOTES.md: note that test-requesting-code-review.sh and
test-document-review-system.sh were lifted into drill scenarios
on 2026-05-06; references are preserved as dated artifacts.
- docs/superpowers/plans/2026-03-23-codex-app-compatibility.md:
note that tests/skill-triggering/ was lifted into drill scenarios
on 2026-05-06; the run-all.sh reference is a dated artifact.
Subagent second-pass scrub confirmed no other active references in
the tree (excluding evals/ and the spec/plan for this work itself).
- test-worktree-native-preference.sh: drill covers PRESSURE phase only;
RED + GREEN baselines have no drill counterpart and are kept so
the RED-GREEN-REFACTOR validation remains rerunnable end-to-end.
- test-subagent-driven-development-integration.sh: drill covers the
YAGNI subset (forbidden exports + reviewer-as-gate). Bash adds
>=3 commits, >=2 subagent dispatches, TodoWrite usage, test file
existence check, and token-budget telemetry. Kept until drill
scenario covers those or they are retired.
- test-subagent-driven-development.sh: tests agent's ability to
*describe* SDD (string matches against expected keywords). Drill
scenarios test behavior, not description-recall. Kept by design.
Subagent verification recorded in commit messages of subsequent
deletions; gap analyses driving these annotations are also in the
verification subagent reports for the gating sweep.
Subagent verification: every bash assertion (skill invocation,
subagent dispatch, SQL injection flagged, credential handling
flagged, no merge approval) maps to drill verify checks. Drill is
stricter: bundles severity (Critical/Important) into the same
criteria as the finding itself (bash split severity into a separate
test). Setup parity covered (src/db.js with string concat + identity
hash, two commits).
The drill scenario header explicitly says it is the
"cross-harness, semantically-judged replacement for the bash test."
Subagent verification: every bash assertion (TODO in Requirements
section flagged, "specified later" deferral flagged, Issues section
present, did-not-approve verdict) maps to drill verify.criteria
entries. Setup parity covered by setup.assertions (test-feature-design.md
exists with TODO + 'specified later' content). Drill is stricter:
asserts tool-called Agent (subagent dispatch) which the bash test
did not check.
The bash test had ZERO output assertions — it just ran claude -p
and printed token usage. Drill's scenarios are strictly more
rigorous:
go-fractals: skill-called SDD + tool-called Agent + go test ./...
passes + cmd/fractals/main.go exists + >=4 commits + LLM criteria
verifying real SDD workflow.
svelte-todo: skill-called SDD + tool-called Agent + npm test passes
+ playwright e2e passes + package.json + svelte.config.js or
vite.config.ts + >=4 commits + LLM criteria.
design.md and plan.md are byte-identical between bash fixtures and
drill fixtures (evals/fixtures/sdd-{go-fractals,svelte-todo}/).
Drill's setup helper (scaffold_sdd_*) forces git init -b main
(stricter than bash's reliance on init.defaultBranch). The
.claude/settings.local.json from bash scaffold.sh is unnecessary
for drill since permissions are managed via backend YAML.
Subagent verification: SAFE TO DELETE for both.
Subagent verification: every bash assertion (Skill tool invoked +
specific skill name 'subagent-driven-development' loaded after the
agent describes it conversationally in turn 1) maps to the drill
scenario's skill-called assertion + criteria paragraph requiring
the skill to fire in direct response to the second user message.
Drill additionally asserts tool-called Agent (subagent dispatch)
which is stricter than the bash test.
Other runners in tests/explicit-skill-requests/ (haiku, multiturn,
extended-multiturn) and their prompt files are preserved — they
have no drill coverage and exercise different behaviors.
Subagent verification confirmed each prompt's intent matches its
corresponding drill scenario's turns[].intent verbatim, and each
scenario has both a deterministic skill-called assertion and a
semantic LLM criterion confirming the matching skill was loaded
(actually a stronger check than the bash test, which only confirms
the skill fires anywhere in the stream).
All 6 prompts deleted. The runner had no remaining prompts to drive,
so run-test.sh and run-all.sh deleted as well.
These backends only read SUPERPOWERS_ROOT via engine.py/setup.py's
os.environ access, which the new cli.py default helper supplies
automatically. claude*.yaml keep SUPERPOWERS_ROOT in required_env
because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.
Adds _set_superpowers_root_default() to drill/cli.py, called at
module import after load_dotenv(). PROJECT_ROOT resolves to evals/
post-lift; its parent is the superpowers repo root, which is the
correct value for SUPERPOWERS_ROOT.
Existing env values are respected as overrides via os.environ.setdefault.
Tests:
- helper sets default when var is unset
- helper does not override when var is already set
rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.
The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.
Source SHA recorded at evals/.drill-source-sha for divergence
detection.
15-task implementation plan derived from the design spec at
docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md.
Each task is bite-sized (2-5 min steps) with exact commands, exact
file paths, and exact code where required. Subagent verification
gates per the spec are written out as concrete prompt templates.
Self-review:
- Spec coverage: every spec section maps to a task
- Placeholder scan: no TBD/TODO/placeholder/fill-in-later language
- Type consistency: helper named _set_superpowers_root_default
consistently; drill SHA recorded in evals/.drill-source-sha
consistently
Two parallel reviewers raised legitimate issues against the lift-drill-
into-evals spec. Updates:
- Coverage map for tests/explicit-skill-requests/ corrected: 6 run-*.sh
scripts + prompts, not "2 scenarios cover all". Several scripts
(Haiku, multi-turn, please-use-brainstorming, use-systematic-debugging)
have no drill counterpart and stay.
- tests/claude-code/test-subagent-driven-development.sh marked as
meta/documentation test (asks agent to describe SDD); no drill
scenario covers description tests; defaults to keep.
- Path-defaults section now shows verified evidence: PROJECT_ROOT
resolves to evals/ post-move; only claude*.yaml substitute
${SUPERPOWERS_ROOT} in args (codex/gemini use it via os.environ
in pre-run hooks); helper invocation order specified (after
load_dotenv, before click definitions).
- Step 2 copy uses explicit rsync excludes (.git, .venv, results,
.env, __pycache__, *.egg-info, .private-journal); checksum-level
verification rather than file-count.
- Drill SHA recorded at copy time in commit message and
evals/.drill-source-sha for divergence detection.
- evals/tests/ pytest suite added to verification protocol.
- Reference scrub list expanded: RELEASE-NOTES.md,
docs/superpowers/plans/, .codex-plugin/ (corrected from .codex/),
lefthook.yml. Excluded dirs called out (node_modules/, .venv/,
evals/).
- Historical plan docs / RELEASE-NOTES handling: annotate, don't
rewrite.
- evals/lefthook.yml move documented (drill ships its own;
contributors run cd evals && lefthook run pre-commit manually).
- PR description checklist includes archival action item for
obra/drill post-merge.
False finding rejected: svelte-todo fixture is complete on disk
(design.md + plan.md + scaffold.sh present); reviewer #1#3 dropped.
Records scope, branching, architecture, deletion gate, verification
protocol, path/config edits, migration ordering, and post-implementation
verification. Frames CI integration, scenario co-location, and Python
package rename as deferred work.
Per-file deletion of bash tests under superpowers/tests/ is gated by a
subagent that compares each bash assertion to its drill scenario's
verify block. Default keeps the bash test if any assertion is unmatched.
Branching: independent off dev (f/evals-lift), not stacked on
f/cross-platform.