mirror of
https://github.com/obra/superpowers.git
synced 2026-05-09 18:49:04 +08:00
Spec: address adversarial review findings
Two parallel reviewers raised legitimate issues against the lift-drill-
into-evals spec. Updates:
- Coverage map for tests/explicit-skill-requests/ corrected: 6 run-*.sh
scripts + prompts, not "2 scenarios cover all". Several scripts
(Haiku, multi-turn, please-use-brainstorming, use-systematic-debugging)
have no drill counterpart and stay.
- tests/claude-code/test-subagent-driven-development.sh marked as
meta/documentation test (asks agent to describe SDD); no drill
scenario covers description tests; defaults to keep.
- Path-defaults section now shows verified evidence: PROJECT_ROOT
resolves to evals/ post-move; only claude*.yaml substitute
${SUPERPOWERS_ROOT} in args (codex/gemini use it via os.environ
in pre-run hooks); helper invocation order specified (after
load_dotenv, before click definitions).
- Step 2 copy uses explicit rsync excludes (.git, .venv, results,
.env, __pycache__, *.egg-info, .private-journal); checksum-level
verification rather than file-count.
- Drill SHA recorded at copy time in commit message and
evals/.drill-source-sha for divergence detection.
- evals/tests/ pytest suite added to verification protocol.
- Reference scrub list expanded: RELEASE-NOTES.md,
docs/superpowers/plans/, .codex-plugin/ (corrected from .codex/),
lefthook.yml. Excluded dirs called out (node_modules/, .venv/,
evals/).
- Historical plan docs / RELEASE-NOTES handling: annotate, don't
rewrite.
- evals/lefthook.yml move documented (drill ships its own;
contributors run cd evals && lefthook run pre-commit manually).
- PR description checklist includes archival action item for
obra/drill post-merge.
False finding rejected: svelte-todo fixture is complete on disk
(design.md + plan.md + scaffold.sh present); reviewer #1 #3 dropped.
This commit is contained in:
committed by
Drew Ritter
parent
81472cc9e6
commit
58f821314d
@@ -83,15 +83,21 @@ A bash test is deleted *only if* a drill scenario verifiably covers every assert
|
||||
|
||||
| Bash test | Claimed drill replacement | Coverage status |
|
||||
|-----------|---------------------------|-----------------|
|
||||
| `tests/skill-triggering/*` (6 prompt files) | `triggering-*.yaml` (6 scenarios) | candidate — verify before deleting |
|
||||
| `tests/explicit-skill-requests/*` | `explicit-skill-request-sdd.yaml`, `mid-conversation-skill-invocation.yaml` | candidate — verify before deleting |
|
||||
| `tests/skill-triggering/prompts/*` (6 prompt files) | `triggering-*.yaml` (6 scenarios) | candidate — verify per-prompt before deleting |
|
||||
| `tests/skill-triggering/run-test.sh`, `run-all.sh` | n/a (runners, not tests) | **keep** — runner scripts |
|
||||
| `tests/explicit-skill-requests/prompts/please-use-brainstorming.txt` | needs verification — drill has no obvious counterpart yet | likely **keep** unless drill scenario added |
|
||||
| `tests/explicit-skill-requests/prompts/use-systematic-debugging.txt` | needs verification — drill has no obvious counterpart | likely **keep** unless drill scenario added |
|
||||
| `tests/explicit-skill-requests/run-claude-describes-sdd.sh` | partially → `mid-conversation-skill-invocation.yaml` | candidate — verify per-script |
|
||||
| `tests/explicit-skill-requests/run-haiku-test.sh` | no drill scenario covers Haiku-specific behavior | **keep** |
|
||||
| `tests/explicit-skill-requests/run-multiturn-test.sh`, `run-extended-multiturn-test.sh` | no drill scenario covers multi-turn build-up | **keep** unless drill scenarios added |
|
||||
| `tests/explicit-skill-requests/run-test.sh`, `run-all.sh` | n/a (runners) | **keep** |
|
||||
| `tests/subagent-driven-dev/go-fractals/`, `tests/subagent-driven-dev/svelte-todo/` | `sdd-go-fractals.yaml`, `sdd-svelte-todo.yaml` | candidate — verify before deleting (these include real assertions about test suites passing) |
|
||||
| `tests/claude-code/test-document-review-system.sh` | `spec-reviewer-catches-planted-flaws.yaml` | candidate — verify before deleting |
|
||||
| `tests/claude-code/test-requesting-code-review.sh` | `code-review-catches-planted-bugs.yaml` | candidate — verify before deleting |
|
||||
| `tests/claude-code/test-subagent-driven-development-integration.sh` | `sdd-rejects-extra-features.yaml` (YAGNI subset) | **partial** — bash test also runs token analysis; drill scenario does not. Likely keep + extend drill scenario. |
|
||||
| `tests/claude-code/test-subagent-driven-development.sh` | unclear from commit log | needs investigation in implementation phase |
|
||||
| `tests/claude-code/test-subagent-driven-development-integration.sh` | `sdd-rejects-extra-features.yaml` (YAGNI subset) | **partial** — bash test also asserts ≥3 commits / `npm test` passes / runs `analyze-token-usage.py`. Drill scenario asserts forbidden-exports + reviewer-as-gate. Mostly disjoint — almost certainly **keep + extend drill scenario**. |
|
||||
| `tests/claude-code/test-subagent-driven-development.sh` | meta/documentation test (asks agent to *describe* SDD); no drill scenario covers description tests | **keep** unless drill scenario added |
|
||||
| `tests/claude-code/test-worktree-native-preference.sh` | `worktree-creation-under-pressure.yaml` | candidate — verify before deleting |
|
||||
| `tests/claude-code/test-helpers.sh`, `run-skill-tests.sh`, `analyze-token-usage.py` | n/a (utilities, not tests) | **keep** — these are libraries/tools, not test cases |
|
||||
| `tests/claude-code/test-helpers.sh`, `run-skill-tests.sh`, `analyze-token-usage.py` | n/a (utilities, not tests) | **keep** — libraries/tools |
|
||||
|
||||
## Verification protocol (subagent-gated)
|
||||
|
||||
@@ -100,8 +106,9 @@ Every change in the implementation plan gets cross-checked by an independent sub
|
||||
| Change category | Subagent verification |
|
||||
|----------------|----------------------|
|
||||
| Each bash-test deletion | Dispatch a subagent with: (a) the bash test file content, (b) the candidate drill scenario YAML, (c) the prompt: *"List every assertion the bash test makes. List every verify entry in the drill scenario. For each bash assertion, find a matching drill check or report it as unmatched. Output a per-assertion table."* The subagent's output is the gate — only delete if every bash assertion has a match. |
|
||||
| Initial `evals/` copy | Subagent verifies: file count matches drill repo; all backend YAMLs reference paths that exist post-move; `.env` template carries over; `pyproject.toml` is intact; `uv.lock` is intact. |
|
||||
| Reference scrubbing after deletion | Subagent greps the entire superpowers tree for references to deleted bash test paths (in `docs/`, `CLAUDE.md`, `GEMINI.md`, `AGENTS.md`, `README.md`, `.github/`, `scripts/`, `opencode/INSTALL.md`, `.codex/INSTALL.md`, etc.). Any hit is either updated or surfaces a missed dependency. |
|
||||
| Initial `evals/` copy | Subagent verifies: (a) drill SHA being copied is recorded in commit message and `evals/.drill-source-sha` (a checked-in file) so divergence is detectable; (b) **per-file SHA-256 checksum** matches drill repo for every file (not just file count); (c) excluded paths (`.git/`, `.venv/`, `results/`, `.env`, `__pycache__/`, `*.egg-info/`, any `.private-journal/`) are absent from `evals/`; (d) all backend YAMLs reference paths that exist post-move; (e) `pyproject.toml`, `uv.lock`, `.gitignore` are intact. |
|
||||
| Drill's own pytest suite | Subagent runs `cd evals && uv run pytest` after the path-default change. Drill ships its own pytest suite at `evals/tests/` including `test_backend.py` which exercises `SUPERPOWERS_ROOT` env-var behavior — these tests must update to match the helper and continue to pass. |
|
||||
| Reference scrubbing after deletion | Subagent greps the entire superpowers tree (excluding `node_modules/`, `.venv/`, and `evals/`) for references to deleted bash test paths. Search targets: `docs/`, `docs/superpowers/plans/`, `RELEASE-NOTES.md`, `CLAUDE.md`, `GEMINI.md`, `AGENTS.md`, `README.md`, `.github/`, `scripts/`, `.opencode/INSTALL.md`, `.codex-plugin/INSTALL.md`, `lefthook.yml`. Any hit is either updated or surfaces a missed dependency. |
|
||||
| Path defaults change (`SUPERPOWERS_ROOT` default) | Subagent runs at least one cheap drill scenario after the path changes (e.g., `triggering-test-driven-development`) and confirms it still passes. Real validation, not just code review. |
|
||||
| Final pre-PR adversarial review | Two subagents in parallel, "5 points to whoever finds the most legitimate issues" framing — same protocol used on the cross-platform PR. Verify both source code and behavior. |
|
||||
|
||||
@@ -109,15 +116,21 @@ Each subagent task gets its own bullet in the implementation plan with explicit
|
||||
|
||||
## Concrete path/config edits
|
||||
|
||||
The drill copy is mostly path-agnostic; only a small set of files reference `SUPERPOWERS_ROOT`:
|
||||
**Verified prior to writing this spec.** `drill/cli.py` defines `PROJECT_ROOT = Path(__file__).parent.parent`. After the move, `cli.py` lives at `evals/drill/cli.py`, so `PROJECT_ROOT` resolves to `evals/` and `PROJECT_ROOT.parent` resolves to the superpowers repo root. That's the value `SUPERPOWERS_ROOT` should take by default.
|
||||
|
||||
**YAML substitution audit.** Only the four `claude*.yaml` backend configs interpolate `${SUPERPOWERS_ROOT}` into `args` (for the `--plugin-dir` flag); `codex.yaml` and `gemini.yaml` only list `SUPERPOWERS_ROOT` in `required_env` (consumed by `engine.py:233` / `setup.py:25`'s `os.environ["SUPERPOWERS_ROOT"]` lookups in pre/post-run hooks). The helper's `os.environ` mutation covers both code paths.
|
||||
|
||||
| File | Current | After |
|
||||
|------|---------|-------|
|
||||
| `drill/engine.py`, `drill/setup.py` | `os.environ["SUPERPOWERS_ROOT"]` (KeyError if unset) | New helper `superpowers_root()` that defaults to `PROJECT_ROOT.parent` (the dir containing `evals/`); env var still respected as override |
|
||||
| `backends/*.yaml` (7 files) | `${SUPERPOWERS_ROOT}` substitution in `args` and `required_env` | YAML unchanged. The new helper sets `os.environ["SUPERPOWERS_ROOT"]` at CLI startup *if not already set*, so the existing YAML substitution keeps working with no edits. `required_env` entries can be dropped since the variable is now self-supplied. |
|
||||
| `evals/README.md` | "export SUPERPOWERS_ROOT=/path/to/superpowers" | Drop the export line; note that env var is auto-detected; mention the only required setup is `ANTHROPIC_API_KEY` |
|
||||
| `drill/cli.py` | `load_dotenv(PROJECT_ROOT / ".env")` at module import; nothing about `SUPERPOWERS_ROOT` | After `load_dotenv`, call new helper `_set_superpowers_root_default()` that sets `os.environ["SUPERPOWERS_ROOT"]` to `str(PROJECT_ROOT.parent)` if and only if not already set. Order: `load_dotenv` → set default → click group definitions. |
|
||||
| `drill/engine.py:233`, `drill/setup.py:25` | Direct `os.environ["SUPERPOWERS_ROOT"]` access (KeyError if unset) | Unchanged. The CLI startup hook guarantees the env var is set by the time the engine/setup execute. |
|
||||
| `backends/claude*.yaml` (5 files) | `${SUPERPOWERS_ROOT}` substituted in `args` for `--plugin-dir` | Unchanged. YAML substitution reads `os.environ` at backend-load time, which is after CLI startup. |
|
||||
| `backends/codex.yaml`, `backends/gemini.yaml` | `SUPERPOWERS_ROOT` in `required_env` only | Drop from `required_env` (the helper supplies it). `claude*.yaml` keep `required_env` for backward compat (env var works as override). |
|
||||
| `evals/tests/test_backend.py` | Tests assert `SUPERPOWERS_ROOT` is in `required_env` lists, plus path-resolution tests | Update tests to match the new contract: helper-supplied default, env override still works, `required_env` no longer required for codex/gemini. |
|
||||
| `evals/README.md` | "export SUPERPOWERS_ROOT=/path/to/superpowers" | Drop the export line; note that the env var auto-defaults to the parent of `evals/`; mention the only required setup is `ANTHROPIC_API_KEY` (or `OPENAI_API_KEY` / Gemini auth). |
|
||||
| `evals/CLAUDE.md` | Same | Same |
|
||||
| `evals/.gitignore` | drill's existing patterns (`results/`, `.venv/`, `__pycache__/`, `.env`) | Identical |
|
||||
| `evals/.gitignore` | drill's existing patterns (`results/`, `.venv/`, `__pycache__/`, `.env`, `*.pyc`, `*.egg-info/`, `dist/`, `build/`, `.claude/`) | Copied verbatim. Patterns are relative to file location, so they apply correctly under `evals/`. |
|
||||
| `evals/lefthook.yml` | drill ships `lefthook.yml` defining `pre-commit: uv run ruff check && uv run ty check` | Move to `evals/lefthook.yml`. Either (a) install lefthook at the superpowers root and have it federate to `evals/lefthook.yml`, or (b) document that contributors run `cd evals && lefthook run pre-commit` manually. **Decision in implementation: option (b) for simplicity** — superpowers' top-level workflow doesn't change. |
|
||||
|
||||
`.env` placement: keep `evals/.env` (gitignored). Contributors source it from there or set `ANTHROPIC_API_KEY` in their shell environment.
|
||||
|
||||
@@ -136,19 +149,30 @@ Each step is a separate commit (or small group of commits). Step 2 is the bigges
|
||||
1. Branch off `dev` (f/evals-lift)
|
||||
|
||||
2. Copy drill repo into evals/ (single commit, easy to revert)
|
||||
├─ Subagent gate: file count matches drill repo; key files present
|
||||
│ (pyproject.toml, uv.lock, scenarios/*.yaml, backends/*.yaml, drill/*.py)
|
||||
└─ Smoke check: `cd evals && uv sync` succeeds
|
||||
├─ Record drill SHA at copy time → commit message + evals/.drill-source-sha
|
||||
├─ Use `rsync -a --exclude=.git --exclude=.venv --exclude=results
|
||||
│ --exclude=.env --exclude=__pycache__ --exclude='*.egg-info'
|
||||
│ --exclude=.private-journal /path/to/drill/ evals/`
|
||||
│ (rsync chosen over `cp -r` for explicit excludes; verify with
|
||||
│ `find evals -name '.git' -type d` returns nothing)
|
||||
├─ Subagent gate: per-file SHA-256 checksum matches drill repo for every
|
||||
│ non-excluded file; excluded paths absent from evals/
|
||||
└─ Smoke check: `cd evals && uv sync` succeeds (proves install only;
|
||||
not a behavioral test)
|
||||
|
||||
3. Update path defaults
|
||||
├─ Add superpowers_root() helper to drill module
|
||||
├─ Update engine.py + setup.py to use it
|
||||
├─ Add _set_superpowers_root_default() helper to drill/cli.py
|
||||
├─ Wire it after load_dotenv, before click group definition
|
||||
├─ Update evals/README.md and evals/CLAUDE.md (drop SUPERPOWERS_ROOT install step)
|
||||
└─ Drop SUPERPOWERS_ROOT from required_env in backend YAMLs
|
||||
├─ Drop SUPERPOWERS_ROOT from required_env in codex.yaml/gemini.yaml
|
||||
│ (keep in claude*.yaml as override)
|
||||
└─ Update evals/tests/test_backend.py to match new contract
|
||||
|
||||
4. Validate from new location
|
||||
└─ Run triggering-test-driven-development -b claude (cheap scenario, ~3-5 min).
|
||||
Must pass. Real validation, not code review.
|
||||
4. Validate from new location (TWO checks)
|
||||
├─ Run drill's own pytest: `cd evals && uv run pytest` — must pass
|
||||
└─ Run cheap drill scenario: `cd evals && uv run drill run
|
||||
triggering-test-driven-development -b claude` — must pass.
|
||||
Real behavioral validation, not just code review.
|
||||
|
||||
5. Bash test deletion phase — per-file with subagent gate
|
||||
For each file in the candidate-deletion list:
|
||||
@@ -160,37 +184,52 @@ Each step is a separate commit (or small group of commits). Step 2 is the bigges
|
||||
keep the bash test (no commit)
|
||||
|
||||
6. Stale-reference scrub
|
||||
└─ Subagent greps: deleted file paths in docs/, CLAUDE.md, GEMINI.md,
|
||||
AGENTS.md, README.md, .github/, scripts/, opencode/INSTALL.md,
|
||||
.codex/INSTALL.md. Update or surface each hit.
|
||||
├─ Subagent greps the superpowers tree (excluding node_modules/, .venv/,
|
||||
│ evals/) for deleted file paths
|
||||
├─ Search targets: docs/, docs/superpowers/plans/, RELEASE-NOTES.md,
|
||||
│ CLAUDE.md, GEMINI.md, AGENTS.md, README.md, .github/, scripts/,
|
||||
│ .opencode/INSTALL.md, .codex-plugin/INSTALL.md, lefthook.yml
|
||||
├─ Update active references (e.g., docs/testing.md, README.md install)
|
||||
└─ Historical references in docs/superpowers/plans/*.md and
|
||||
RELEASE-NOTES.md are PRESERVED with a brief annotation
|
||||
("(test removed; behavior covered by drill scenario X)") rather
|
||||
than rewritten — these are dated artifacts, not living docs.
|
||||
|
||||
7. Top-level docs
|
||||
├─ docs/testing.md split
|
||||
├─ CLAUDE.md pointer
|
||||
└─ README.md Contributing section
|
||||
|
||||
8. Re-run smoke scenario from new location (still passing? assertion
|
||||
that step 5 didn't silently break the harness)
|
||||
8. Re-run smoke checks (regression gate)
|
||||
├─ `cd evals && uv run pytest`
|
||||
└─ `cd evals && uv run drill run triggering-test-driven-development -b claude`
|
||||
|
||||
9. Final adversarial review
|
||||
└─ Two parallel subagents, full diff, "5 points to whoever finds the
|
||||
most legitimate issues" framing. Address findings before push.
|
||||
|
||||
10. Push branch + open PR against dev
|
||||
└─ PR description includes: drill SHA pinned at copy, archival action
|
||||
item ("after merge: archive obra/drill, add README pointer to
|
||||
obra/superpowers/evals/"), per-deleted-file coverage receipts.
|
||||
```
|
||||
|
||||
## Verification (post-implementation)
|
||||
|
||||
The implementation plan must show:
|
||||
|
||||
- All drill source files present at `evals/` after step 2 (subagent file-count diff vs `obra/drill`).
|
||||
- All non-excluded drill source files present at `evals/` after step 2 (subagent **per-file SHA-256 checksum diff** vs `obra/drill@<recorded-sha>`).
|
||||
- Excluded paths (`.git/`, `.venv/`, `results/`, `.env`, `__pycache__/`, `*.egg-info/`, `.private-journal/`) absent from `evals/`.
|
||||
- `evals/.drill-source-sha` matches the SHA referenced in the step-2 commit message.
|
||||
- `cd evals && uv sync` succeeds without `SUPERPOWERS_ROOT` set.
|
||||
- `cd evals && uv run drill list` returns the same scenario count as the standalone drill repo.
|
||||
- `cd evals && uv run drill run triggering-test-driven-development -b claude` passes (proves path defaults work).
|
||||
- For each deleted bash test: subagent verification table showing every assertion mapped to a drill check.
|
||||
- Grep for deleted file paths returns zero hits across the superpowers tree (post step 6).
|
||||
- `cd evals && uv run pytest` passes (drill's own pytest suite).
|
||||
- `cd evals && uv run drill list` returns the same scenario count as the standalone drill repo at the recorded SHA.
|
||||
- `cd evals && uv run drill run triggering-test-driven-development -b claude` passes (proves path defaults work end-to-end).
|
||||
- For each deleted bash test: subagent verification table in the commit message showing every assertion mapped to a drill check.
|
||||
- Grep for deleted file paths returns zero hits across living superpowers docs (post step 6); historical refs in `docs/superpowers/plans/*.md` and `RELEASE-NOTES.md` are annotated, not rewritten.
|
||||
- `docs/testing.md` has both "Plugin tests" and "Skill behavior evals" sections.
|
||||
- The drill repo's history is untouched; `obra/drill` is unaffected by this PR.
|
||||
- PR description names the action item to archive `obra/drill` after merge.
|
||||
|
||||
## Open questions
|
||||
|
||||
|
||||
Reference in New Issue
Block a user