- evals/README.md, evals/CLAUDE.md: fix uv install command from
'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml
uses [project.optional-dependencies], so --dev is a no-op for
pytest/ruff/ty; --extra dev is the correct invocation.
- tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh
from integration_tests array (file deleted earlier in this branch).
- tests/claude-code/README.md: replace test-requesting-code-review.sh
section with test-worktree-native-preference.sh (the worktree test
is kept; the code-review test was lifted into drill).
- docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness
list. evals/backends/ has claude*, codex, gemini configs but no
copilot.yaml, so the claim was unsupported.
Adversarial review credit: reviewer #2 found four legitimate issues
(uv-sync, run-skill-tests stale ref, README stale ref via #1, and
Copilot CLI fabrication); reviewer #1 found two distinct issues
(run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins
this round.
- docs/testing.md split into Plugin tests + Skill behavior evals.
Plugin tests section enumerates the bash tests that survive
(kept by drill-coverage analysis or as describe-skill tests).
- CLAUDE.md adds Eval harness section pointing at evals/.
- README.md Contributing section mentions evals/ alongside tests/.
- .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders
(evals/.gitignore covers these locally; root-level entries help
tooling that does not recurse into nested ignore files).
Most new-harness PRs ship integrations that copy skill files or wrap
with `npx skills` instead of loading the using-superpowers bootstrap at
session start. Those integrations look like they work but skills never
auto-trigger.
Add an acceptance test ("Let's make a react todo list" must auto-trigger
brainstorming in a clean session) and require the transcript in the PR.
Speak directly to AI agents at the top of CLAUDE.md: reframe slop
PRs as harmful to their human partner, give a concrete pre-submission
checklist, and explicitly authorize pushing back on vague instructions.
CLAUDE.md (symlinked to AGENTS.md) covers every major rejection
pattern from auditing the last 100 closed PRs (94% rejection rate):
AI slop, ignored PR template, duplicates, speculative fixes, domain-
specific skills, fork confusion, fabricated content, bundled changes,
and misunderstanding project philosophy.