Use Jesse wording for review hesitation guidance

Add generic review hesitation guidance
Remove Circle K signal from review skill
2026-05-13 20:49:06 +08:00 · 2026-05-12 15:14:40 -07:00 · 2026-05-12 14:56:57 -07:00 · 2026-05-12 14:11:56 -07:00 · 2026-05-11 17:50:01 -07:00 · 2026-05-11 17:04:35 -07:00
4 changed files with 7 additions and 9 deletions
--- a/.cursor-plugin/plugin.json
+++ b/.cursor-plugin/plugin.json
@@ -19,7 +19,5 @@
    "workflows"
  ],
  "skills": "./skills/",
-  "agents": "./agents/",
-  "commands": "./commands/",
  "hooks": "./hooks/hooks-cursor.json"
 }
--- a/skills/receiving-code-review/SKILL.md
+++ b/skills/receiving-code-review/SKILL.md
@@ -126,7 +126,7 @@ Push back when:
 - Reference working tests/code
 - Involve your human partner if architectural

-**Signal if uncomfortable pushing back out loud:** "Strange things are afoot at the Circle K"
+**If you're uncomfortable pushing back out loud:** Name that tension, then tell your partner about the issue you've seen. They'll appreciate your honesty.

 ## Acknowledging Correct Feedback

--- a/skills/using-git-worktrees/SKILL.md
+++ b/skills/using-git-worktrees/SKILL.md
@@ -30,7 +30,7 @@ BRANCH=$(git branch --show-current)
 git rev-parse --show-superproject-working-tree 2>/dev/null
 ```

-**If `GIT_DIR != GIT_COMMON` (and not a submodule):** You are already in a linked worktree. Skip to Step 3 (Project Setup). Do NOT create another worktree.
+**If `GIT_DIR != GIT_COMMON` (and not a submodule):** You are already in a linked worktree. Skip to Step 2 (Project Setup). Do NOT create another worktree.

 Report with branch state:
 - On a branch: "Already in isolated workspace at `<path>` on branch `<name>`."
@@ -42,7 +42,7 @@ Has the user already indicated their worktree preference in your instructions? I

 > "Would you like me to set up an isolated worktree? It protects your current branch from changes."

-Honor any existing declared preference without asking. If the user declines consent, work in place and skip to Step 3.
+Honor any existing declared preference without asking. If the user declines consent, work in place and skip to Step 2.

 ## Step 1: Create Isolated Workspace

@@ -50,7 +50,7 @@ Honor any existing declared preference without asking. If the user declines cons

 ### 1a. Native Worktree Tools (preferred)

-The user has asked for an isolated workspace (Step 0 consent). Do you already have a way to create a worktree? It might be a tool with a name like `EnterWorktree`, `WorktreeCreate`, a `/worktree` command, or a `--worktree` flag. If you do, use it and skip to Step 3.
+The user has asked for an isolated workspace (Step 0 consent). Do you already have a way to create a worktree? It might be a tool with a name like `EnterWorktree`, `WorktreeCreate`, a `/worktree` command, or a `--worktree` flag. If you do, use it and skip to Step 2.

 Native tools handle directory placement, branch creation, and cleanup automatically. Using `git worktree add` when you have a native tool creates phantom state your harness can't see or manage.

@@ -111,7 +111,7 @@ cd "$path"

 **Sandbox fallback:** If `git worktree add` fails with a permission error (sandbox denial), tell the user the sandbox blocked worktree creation and you're working in the current directory instead. Then run setup and baseline tests in place.

-## Step 3: Project Setup
+## Step 2: Project Setup

 Auto-detect and run appropriate setup:

@@ -130,7 +130,7 @@ if [ -f pyproject.toml ]; then poetry install; fi
 if [ -f go.mod ]; then go mod download; fi
 ```

-## Step 4: Verify Clean Baseline
+## Step 3: Verify Clean Baseline

 Run tests to ensure workspace starts clean:

--- a/skills/writing-skills/SKILL.md
+++ b/skills/writing-skills/SKILL.md
@@ -553,7 +553,7 @@ Run same scenarios WITH skill. Agent should now comply.

 Agent found new rationalization? Add explicit counter. Re-test until bulletproof.

-**Testing methodology:** See @testing-skills-with-subagents.md for the complete testing methodology:
+**Testing methodology:** See [testing-skills-with-subagents.md](testing-skills-with-subagents.md) for the complete testing methodology:
 - How to write pressure scenarios
 - Pressure types (time, sunk cost, authority, exhaustion)
 - Plugging holes systematically
Author	SHA1	Message	Date
Drew Ritter	16132fa61f	Use Jesse wording for review hesitation guidance	2026-05-12 15:14:40 -07:00
Drew Ritter	53194ecd7d	Add generic review hesitation guidance	2026-05-12 14:56:57 -07:00
Drew Ritter	30474eab53	Remove Circle K signal from review skill	2026-05-12 14:11:56 -07:00
Drew Ritter	491df7360c	fix(using-git-worktrees): repair skipped Step 2 numbering (#1522 )	2026-05-11 17:50:01 -07:00
fuleinist	9088f563e7	fix: remove stale Cursor plugin refs	2026-05-11 17:04:35 -07:00
Stable Genius	d4cf61b4c8	fix(writing-skills): use markdown link for testing methodology reference	2026-05-11 16:51:00 -07:00
Drew Ritter	7f02ccd91b	evals: use pre-commit hooks	2026-05-06 15:47:39 -07:00
Drew Ritter	35e42a16ce	evals: add Gemini 2.5 Flash backend	2026-05-06 15:47:39 -07:00
Drew Ritter	58082d04f8	evals: drop drill source marker	2026-05-06 15:47:39 -07:00
Drew Ritter	3dc0ea6876	evals: remove unreleased wave scenarios	2026-05-06 15:47:39 -07:00
Jesse Vincent	0bf37499b4	Address adversarial review findings - evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.	2026-05-06 15:47:39 -07:00
Jesse Vincent	f7c5312265	docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files).	2026-05-06 15:47:39 -07:00
Jesse Vincent	f5175fb31a	docs: annotate dated artifacts referencing lifted bash tests - RELEASE-NOTES.md: note that test-requesting-code-review.sh and test-document-review-system.sh were lifted into drill scenarios on 2026-05-06; references are preserved as dated artifacts. - docs/superpowers/plans/2026-03-23-codex-app-compatibility.md: note that tests/skill-triggering/ was lifted into drill scenarios on 2026-05-06; the run-all.sh reference is a dated artifact. Subagent second-pass scrub confirmed no other active references in the tree (excluding evals/ and the spec/plan for this work itself).	2026-05-06 15:47:39 -07:00
Jesse Vincent	45c7dc2cce	tests: annotate three kept bash tests with drill coverage notes - test-worktree-native-preference.sh: drill covers PRESSURE phase only; RED + GREEN baselines have no drill counterpart and are kept so the RED-GREEN-REFACTOR validation remains rerunnable end-to-end. - test-subagent-driven-development-integration.sh: drill covers the YAGNI subset (forbidden exports + reviewer-as-gate). Bash adds >=3 commits, >=2 subagent dispatches, TodoWrite usage, test file existence check, and token-budget telemetry. Kept until drill scenario covers those or they are retired. - test-subagent-driven-development.sh: tests agent's ability to describe SDD (string matches against expected keywords). Drill scenarios test behavior, not description-recall. Kept by design. Subagent verification recorded in commit messages of subsequent deletions; gap analyses driving these annotations are also in the verification subagent reports for the gating sweep.	2026-05-06 15:47:39 -07:00
Jesse Vincent	39d29a6c28	tests: remove test-requesting-code-review.sh (covered by drill code-review-catches-planted-bugs) Subagent verification: every bash assertion (skill invocation, subagent dispatch, SQL injection flagged, credential handling flagged, no merge approval) maps to drill verify checks. Drill is stricter: bundles severity (Critical/Important) into the same criteria as the finding itself (bash split severity into a separate test). Setup parity covered (src/db.js with string concat + identity hash, two commits). The drill scenario header explicitly says it is the "cross-harness, semantically-judged replacement for the bash test."	2026-05-06 15:47:39 -07:00
Jesse Vincent	f1d2005de3	tests: remove test-document-review-system.sh (covered by drill spec-reviewer-catches-planted-flaws) Subagent verification: every bash assertion (TODO in Requirements section flagged, "specified later" deferral flagged, Issues section present, did-not-approve verdict) maps to drill verify.criteria entries. Setup parity covered by setup.assertions (test-feature-design.md exists with TODO + 'specified later' content). Drill is stricter: asserts tool-called Agent (subagent dispatch) which the bash test did not check.	2026-05-06 15:47:39 -07:00
Jesse Vincent	c0a65f1b4d	tests: remove subagent-driven-dev fixtures (covered by drill sdd-go-fractals + sdd-svelte-todo) The bash test had ZERO output assertions — it just ran claude -p and printed token usage. Drill's scenarios are strictly more rigorous: go-fractals: skill-called SDD + tool-called Agent + go test ./... passes + cmd/fractals/main.go exists + >=4 commits + LLM criteria verifying real SDD workflow. svelte-todo: skill-called SDD + tool-called Agent + npm test passes + playwright e2e passes + package.json + svelte.config.js or vite.config.ts + >=4 commits + LLM criteria. design.md and plan.md are byte-identical between bash fixtures and drill fixtures (evals/fixtures/sdd-{go-fractals,svelte-todo}/). Drill's setup helper (scaffold_sdd_*) forces git init -b main (stricter than bash's reliance on init.defaultBranch). The .claude/settings.local.json from bash scaffold.sh is unnecessary for drill since permissions are managed via backend YAML. Subagent verification: SAFE TO DELETE for both.	2026-05-06 15:47:39 -07:00
Jesse Vincent	f10cddac0d	tests: remove run-claude-describes-sdd.sh (covered by drill mid-conversation-skill-invocation) Subagent verification: every bash assertion (Skill tool invoked + specific skill name 'subagent-driven-development' loaded after the agent describes it conversationally in turn 1) maps to the drill scenario's skill-called assertion + criteria paragraph requiring the skill to fire in direct response to the second user message. Drill additionally asserts tool-called Agent (subagent dispatch) which is stricter than the bash test. Other runners in tests/explicit-skill-requests/ (haiku, multiturn, extended-multiturn) and their prompt files are preserved — they have no drill coverage and exercise different behaviors.	2026-05-06 15:47:39 -07:00
Jesse Vincent	371f41596b	tests: remove skill-triggering bash prompts (covered by drill triggering-* scenarios) Subagent verification confirmed each prompt's intent matches its corresponding drill scenario's turns[].intent verbatim, and each scenario has both a deterministic skill-called assertion and a semantic LLM criterion confirming the matching skill was loaded (actually a stronger check than the bash test, which only confirms the skill fires anywhere in the stream). All 6 prompts deleted. The runner had no remaining prompts to drive, so run-test.sh and run-all.sh deleted as well.	2026-05-06 15:47:39 -07:00
Jesse Vincent	6f0adebe96	evals: drop SUPERPOWERS_ROOT setup step from README/CLAUDE The cli.py helper now defaults the env var. Mention as override only.	2026-05-06 15:47:39 -07:00
Jesse Vincent	fd5b53cb85	evals: drop SUPERPOWERS_ROOT from codex/gemini required_env These backends only read SUPERPOWERS_ROOT via engine.py/setup.py's os.environ access, which the new cli.py default helper supplies automatically. claude*.yaml keep SUPERPOWERS_ROOT in required_env because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.	2026-05-06 15:47:39 -07:00
Jesse Vincent	be0357f98a	evals: default SUPERPOWERS_ROOT to parent of evals/ if unset Adds _set_superpowers_root_default() to drill/cli.py, called at module import after load_dotenv(). PROJECT_ROOT resolves to evals/ post-lift; its parent is the superpowers repo root, which is the correct value for SUPERPOWERS_ROOT. Existing env values are respected as overrides via os.environ.setdefault. Tests: - helper sets default when var is unset - helper does not override when var is already set	2026-05-06 15:47:39 -07:00
Jesse Vincent	3b412a3836	Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.	2026-05-06 15:47:39 -07:00
Jesse Vincent	2e46e9590d	Plan: lift drill into superpowers as evals/ 15-task implementation plan derived from the design spec at docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md. Each task is bite-sized (2-5 min steps) with exact commands, exact file paths, and exact code where required. Subagent verification gates per the spec are written out as concrete prompt templates. Self-review: - Spec coverage: every spec section maps to a task - Placeholder scan: no TBD/TODO/placeholder/fill-in-later language - Type consistency: helper named _set_superpowers_root_default consistently; drill SHA recorded in evals/.drill-source-sha consistently	2026-05-06 15:47:39 -07:00
Jesse Vincent	58f821314d	Spec: address adversarial review findings Two parallel reviewers raised legitimate issues against the lift-drill- into-evals spec. Updates: - Coverage map for tests/explicit-skill-requests/ corrected: 6 run-.sh scripts + prompts, not "2 scenarios cover all". Several scripts (Haiku, multi-turn, please-use-brainstorming, use-systematic-debugging) have no drill counterpart and stay. - tests/claude-code/test-subagent-driven-development.sh marked as meta/documentation test (asks agent to describe SDD); no drill scenario covers description tests; defaults to keep. - Path-defaults section now shows verified evidence: PROJECT_ROOT resolves to evals/ post-move; only claude.yaml substitute ${SUPERPOWERS_ROOT} in args (codex/gemini use it via os.environ in pre-run hooks); helper invocation order specified (after load_dotenv, before click definitions). - Step 2 copy uses explicit rsync excludes (.git, .venv, results, .env, __pycache__, *.egg-info, .private-journal); checksum-level verification rather than file-count. - Drill SHA recorded at copy time in commit message and evals/.drill-source-sha for divergence detection. - evals/tests/ pytest suite added to verification protocol. - Reference scrub list expanded: RELEASE-NOTES.md, docs/superpowers/plans/, .codex-plugin/ (corrected from .codex/), lefthook.yml. Excluded dirs called out (node_modules/, .venv/, evals/). - Historical plan docs / RELEASE-NOTES handling: annotate, don't rewrite. - evals/lefthook.yml move documented (drill ships its own; contributors run cd evals && lefthook run pre-commit manually). - PR description checklist includes archival action item for obra/drill post-merge. False finding rejected: svelte-todo fixture is complete on disk (design.md + plan.md + scaffold.sh present); reviewer #1 #3 dropped.	2026-05-06 15:47:39 -07:00
Jesse Vincent	81472cc9e6	Spec: lift drill into superpowers as evals/ Records scope, branching, architecture, deletion gate, verification protocol, path/config edits, migration ordering, and post-implementation verification. Frames CI integration, scenario co-location, and Python package rename as deferred work. Per-file deletion of bash tests under superpowers/tests/ is gated by a subagent that compares each bash assertion to its drill scenario's verify block. Default keeps the bash test if any assertion is unmatched. Branching: independent off dev (f/evals-lift), not stacked on f/cross-platform.	2026-05-06 15:47:39 -07:00