evals: use pre-commit hooks

evals: add Gemini 2.5 Flash backend
evals: drop drill source marker
2026-05-08 18:19:04 +08:00 · 2026-05-06 15:41:52 -07:00 · 2026-05-06 15:09:59 -07:00 · 2026-05-06 14:55:14 -07:00 · 2026-05-06 14:43:08 -07:00 · 2026-05-06 12:41:28 -07:00
14 changed files with 4 additions and 671 deletions
--- a/.pi/extensions/superpowers.ts
+++ b/.pi/extensions/superpowers.ts
@@ -1,121 +0,0 @@
-import { readFileSync } from "node:fs";
-import { dirname, resolve } from "node:path";
-import { fileURLToPath } from "node:url";
-import type { ExtensionAPI } from "@earendil-works/pi-coding-agent";
-
-const EXTREMELY_IMPORTANT_MARKER = "<EXTREMELY_IMPORTANT>";
-const BOOTSTRAP_MARKER = "superpowers:using-superpowers bootstrap for pi";
-
-const extensionDir = dirname(fileURLToPath(import.meta.url));
-const packageRoot = resolve(extensionDir, "../..");
-const skillsDir = resolve(packageRoot, "skills");
-const bootstrapSkillPath = resolve(skillsDir, "using-superpowers", "SKILL.md");
-
-let cachedBootstrap: string | null | undefined;
-
-export default function superpowersPiExtension(pi: ExtensionAPI) {
-	let injectBootstrap = true;
-
-	pi.on("resources_discover", async () => ({
-		skillPaths: [skillsDir],
-	}));
-
-	pi.on("session_start", async () => {
-		injectBootstrap = true;
-	});
-
-	pi.on("session_compact", async () => {
-		injectBootstrap = true;
-	});
-
-	pi.on("agent_end", async () => {
-		injectBootstrap = false;
-	});
-
-	pi.on("context", async (event) => {
-		if (!injectBootstrap) return;
-		if (event.messages.some(messageContainsBootstrap)) return;
-
-		const bootstrap = getBootstrapContent();
-		if (!bootstrap) return;
-
-		const bootstrapMessage = {
-			role: "user" as const,
-			content: [{ type: "text" as const, text: bootstrap }],
-			timestamp: Date.now(),
-		};
-
-		const insertAt = firstNonCompactionSummaryIndex(event.messages);
-		return {
-			messages: [
-				...event.messages.slice(0, insertAt),
-				bootstrapMessage,
-				...event.messages.slice(insertAt),
-			],
-		};
-	});
-}
-
-function getBootstrapContent(): string | null {
-	if (cachedBootstrap !== undefined) return cachedBootstrap;
-
-	try {
-		const skillContent = readFileSync(bootstrapSkillPath, "utf8");
-		const body = stripFrontmatter(skillContent);
-		cachedBootstrap = `${EXTREMELY_IMPORTANT_MARKER}
-${BOOTSTRAP_MARKER}
-
-You have superpowers.
-
-The using-superpowers skill content is included below and is already loaded for this Pi session. Follow it now. Do not try to load using-superpowers again.
-
-${body}
-
-${piToolMapping()}
-</EXTREMELY_IMPORTANT>`;
-		return cachedBootstrap;
-	} catch {
-		cachedBootstrap = null;
-		return null;
-	}
-}
-
-function stripFrontmatter(content: string): string {
-	const match = content.match(/^---\n[\s\S]*?\n---\n([\s\S]*)$/);
-	return (match ? match[1] : content).trim();
-}
-
-function piToolMapping(): string {
-	return `## Pi tool mapping
-
-Pi has native skills but does not expose Claude Code's \`Skill\` tool. When a Superpowers instruction says to use the \`Skill\` tool, use Pi's native skill system instead: load the relevant \`SKILL.md\` with \`read\` when the skill applies, or let a human invoke \`/skill:name\` explicitly.
-
-Pi's built-in coding tools are lowercase: \`read\`, \`write\`, \`edit\`, \`bash\`, plus optional \`grep\`, \`find\`, and \`ls\`. Map Claude-style tool names \`Read\`, \`Write\`, \`Edit\`, and \`Bash\` to those Pi tools.
-
-Pi does not ship a standard \`Task\` subagent tool. If a subagent tool such as \`subagent\` from \`pi-subagents\` is available, use it for Superpowers subagent workflows. If no subagent tool is available, do the work in this session or explain the missing capability instead of inventing tool calls.
-
-Pi does not ship a standard \`TodoWrite\` task-list tool. If an installed todo/task tool is available, use it. Otherwise track work in plan files or a repo-local \`TODO.md\` when task tracking is needed.`;
-}
-
-function messageContainsBootstrap(message: unknown): boolean {
-	const content = (message as { content?: unknown }).content;
-	if (typeof content === "string") return content.includes(BOOTSTRAP_MARKER);
-	if (!Array.isArray(content)) return false;
-	return content.some((part) => {
-		return (
-			part &&
-			typeof part === "object" &&
-			(part as { type?: unknown }).type === "text" &&
-			typeof (part as { text?: unknown }).text === "string" &&
-			(part as { text: string }).text.includes(BOOTSTRAP_MARKER)
-		);
-	});
-}
-
-function firstNonCompactionSummaryIndex(messages: unknown[]): number {
-	let index = 0;
-	while ((messages[index] as { role?: unknown } | undefined)?.role === "compactionSummary") {
-		index += 1;
-	}
-	return index;
-}
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ Superpowers is a complete software development methodology for your coding agent

 ## Quickstart

-Give your agent Superpowers: [Claude Code](#claude-code), [Codex CLI](#codex-cli), [Codex App](#codex-app), [Factory Droid](#factory-droid), [Gemini CLI](#gemini-cli), [Pi](#pi), [OpenCode](#opencode), [Cursor](#cursor), [GitHub Copilot CLI](#github-copilot-cli).
+Give your agent Superpowers: [Claude Code](#claude-code), [Codex CLI](#codex-cli), [Codex App](#codex-app), [Factory Droid](#factory-droid), [Gemini CLI](#gemini-cli), [OpenCode](#opencode), [Cursor](#cursor), [GitHub Copilot CLI](#github-copilot-cli).

 ## How it works

@@ -114,22 +114,6 @@ Superpowers is available via the [official Codex plugin marketplace](https://git
  gemini extensions update superpowers
  ```

-### Pi
-
-Install Superpowers as a Pi package from this repository:
-
-```bash
-pi install git:github.com/obra/superpowers
-```
-
-For local development, run Pi with this checkout loaded as a temporary package:
-
-```bash
-pi -e /path/to/superpowers
-```
-
-The Pi package loads the Superpowers skills and a small extension that injects the `using-superpowers` bootstrap at session startup and again after compaction. Pi has native skills, so no compatibility `Skill` tool is required. Subagent and task-list tools remain optional Pi companion packages.
-
 ### OpenCode

 OpenCode uses its own plugin install; install Superpowers separately even if you
--- a/docs/superpowers/plans/2026-05-07-pi-extension-and-evals.md
+++ b/docs/superpowers/plans/2026-05-07-pi-extension-and-evals.md
@@ -1,143 +0,0 @@
-# Pi Extension and Evals Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Add first-class Pi package support for Superpowers and add Pi as a Drill eval backend.
-
-**Architecture:** The Pi package is declared in the root `package.json` and loads existing `skills/` plus a small Pi extension. The extension injects the `using-superpowers` bootstrap into provider context as a user-role message on session startup and after compaction, with Pi-specific tool mapping. Drill gains a `pi` backend, Pi session-log normalization, and tests.
-
-**Tech Stack:** Pi TypeScript extension API, Node built-in test runner, Drill Python eval harness, pytest.
-
---
-
-### Task 1: Pi package manifest and extension tests
-
-**Files:**
- Modify: `package.json`
- Create: `tests/pi/test-pi-extension.mjs`
-
- [ ] **Step 1: Write failing package/extension tests**
-
-Create `tests/pi/test-pi-extension.mjs` with tests that import `extensions/superpowers.ts`, register fake Pi handlers, and assert:
- root `package.json` has `keywords` containing `pi-package`
- root `package.json` has `pi.skills: ["./skills"]`
- root `package.json` has `pi.extensions: ["./extensions/superpowers.ts"]`
- the extension registers `resources_discover`, `session_start`, `session_compact`, `context`, and `agent_end`
- startup `context` injects exactly one user-role bootstrap message
- `agent_end` clears startup injection
- `session_compact` re-enables injection
- the extension does not register `session_before_compact`
-
- [ ] **Step 2: Run tests and verify RED**
-
-Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
-
-Expected: FAIL because `extensions/superpowers.ts` does not exist and `package.json` lacks the `pi` manifest.
-
- [ ] **Step 3: Implement manifest fields**
-
-Update `package.json` with `description`, `keywords`, `pi.extensions`, and `pi.skills` while preserving existing `name`, `version`, `type`, and `main`.
-
- [ ] **Step 4: Implement `extensions/superpowers.ts`**
-
-Create a zero-runtime-dependency extension that:
- locates the package root from `import.meta.url`
- reads `skills/using-superpowers/SKILL.md`
- strips YAML frontmatter
- appends Pi-specific tool mapping
- exposes `resources_discover` with the skills path
- marks bootstrap pending on `session_start` and `session_compact`
- injects a user-role bootstrap message in `context`
- inserts post-compact bootstrap after leading `compactionSummary` messages
- clears pending bootstrap on `agent_end`
-
- [ ] **Step 5: Run tests and verify GREEN**
-
-Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
-
-Expected: PASS.
-
-### Task 2: Pi tool mapping reference
-
-**Files:**
- Create: `skills/using-superpowers/references/pi-tools.md`
- Modify: `tests/pi/test-pi-extension.mjs`
-
- [ ] **Step 1: Write failing test for Pi reference doc**
-
-Add assertions that `skills/using-superpowers/references/pi-tools.md` exists and documents mappings for `Skill`, `Task`, `TodoWrite`, and built-in tool names.
-
- [ ] **Step 2: Run tests and verify RED**
-
-Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
-
-Expected: FAIL because `pi-tools.md` does not exist.
-
- [ ] **Step 3: Add Pi reference doc**
-
-Create `skills/using-superpowers/references/pi-tools.md` explaining Pi-native skills, optional `pi-subagents`, no canonical todo/tasklist plugin, and built-in lowercase tools.
-
- [ ] **Step 4: Run tests and verify GREEN**
-
-Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
-
-Expected: PASS.
-
-### Task 3: Drill Pi backend and session log normalization
-
-**Files:**
- Create: `evals/backends/pi.yaml`
- Modify: `evals/drill/backend.py`
- Modify: `evals/drill/engine.py`
- Modify: `evals/drill/normalizer.py`
- Modify: `evals/tests/test_backend.py`
- Modify: `evals/tests/test_normalizer.py`
-
- [ ] **Step 1: Write failing backend/normalizer tests**
-
-Add pytest coverage for:
- `load_backend("pi")` returns `family == "pi"`
- Pi backend command starts with `pi` and includes `-e ${SUPERPOWERS_ROOT}`
- `_resolve_log_dir()` for Pi points under `~/.pi/agent/sessions`
- `filter_pi_logs_by_cwd()` keeps only session files whose header `cwd` matches the scenario workdir
- `normalize_pi_logs()` extracts `toolCall` blocks from Pi assistant session entries and maps built-in lowercase tools to canonical names
-
- [ ] **Step 2: Run tests and verify RED**
-
-Run: `uv run pytest evals/tests/test_backend.py evals/tests/test_normalizer.py -q`
-
-Expected: FAIL because the Pi backend and normalizer do not exist.
-
- [ ] **Step 3: Add `evals/backends/pi.yaml`**
-
-Configure the backend to run `pi -e ${SUPERPOWERS_ROOT}`, use permissive TUI readiness, `/quit` shutdown, and Pi session log location.
-
- [ ] **Step 4: Implement Pi family support**
-
-Update `Backend.family`, `Engine._resolve_log_dir`, `Engine._collect_tool_calls`, and `normalizer.py` with Pi log filtering and normalizing.
-
- [ ] **Step 5: Run tests and verify GREEN**
-
-Run: `uv run pytest evals/tests/test_backend.py evals/tests/test_normalizer.py -q`
-
-Expected: PASS.
-
-### Task 4: Documentation and full verification
-
-**Files:**
- Modify: `README.md`
- Modify: `evals/README.md`
-
- [ ] **Step 1: Document Pi install and eval backend**
-
-Add Pi to README quickstart/install list and add backend entry/usage to `evals/README.md`.
-
- [ ] **Step 2: Run verification**
-
-Run:
-```bash
-node --experimental-strip-types --test tests/pi/test-pi-extension.mjs
-uv run pytest evals/tests/test_backend.py evals/tests/test_setup.py evals/tests/test_normalizer.py -q
-```
-
-Expected: all tests pass.
--- a/evals/README.md
+++ b/evals/README.md
@@ -43,9 +43,6 @@ uv run drill run spec-writing-blind-spot -b claude-opus-4-6 --n 5
 # Sweep across multiple backends
 uv run drill run spec-writing-blind-spot --models claude-opus-4-6,claude-opus-4-7 --n 10

-# Run against Pi, loading the local Superpowers package via -e ${SUPERPOWERS_ROOT}
-uv run drill run triggering-writing-plans -b pi
-
 # Compare results
 uv run drill compare spec-writing-blind-spot

@@ -75,7 +72,6 @@ uv run drill list
 | `codex` | Codex CLI | — |
 | `gemini` | Gemini CLI | auto-gemini-3 |
 | `gemini-2-5-flash` | Gemini CLI | gemini-2.5-flash |
-| `pi` | Pi coding agent | configured Pi default |

 ## Project structure

--- a/evals/backends/pi.yaml
+++ b/evals/backends/pi.yaml
@@ -1,23 +0,0 @@
-name: pi
-cli: pi
-args:
-  - "-e"
-  - "${SUPERPOWERS_ROOT}"
-required_env:
-  - SUPERPOWERS_ROOT
-hooks:
-  pre_run: []
-  post_run: []
-shutdown: "/quit"
-idle:
-  quiescence_seconds: 5
-  ready_pattern: "."
-busy_pattern: "esc to cancel|Thinking\\.\\.\\.|\\(esc to cancel[^)]*\\)|[⠇⠏⠋⠙⠹⠸⠼⠴⠦⠧⠶⠾⠽⠻⠿]"
-max_busy_seconds: 1800
-startup_timeout: 60
-turn_timeout: 300
-terminal:
-  cols: 200
-  rows: 50
-session_logs:
-  pattern: "~/.pi/agent/sessions/**/*.jsonl"
--- a/evals/drill/backend.py
+++ b/evals/drill/backend.py
@@ -71,7 +71,7 @@ class Backend:
    @property
    def family(self) -> str:
        """Normalize backend name to a family for log-dir / normalizer dispatch."""
-        for fam in ("claude", "codex", "gemini", "pi"):
+        for fam in ("claude", "codex", "gemini"):
            if self.name == fam or self.name.startswith(f"{fam}-"):
                return fam
        return "other"
--- a/evals/drill/engine.py
+++ b/evals/drill/engine.py
@@ -21,7 +21,6 @@ from drill.normalizer import (
    NORMALIZERS,
    collect_new_logs,
    filter_codex_logs_by_cwd,
-    filter_pi_logs_by_cwd,
    snapshot_log_dir,
 )
 from drill.session import TmuxSession
@@ -349,11 +348,6 @@ class Engine:
            # Project name is the workdir basename, lowercased
            project = workdir.resolve().name.lower()
            return Path.home() / ".gemini" / "tmp" / project
-        elif self.backend.family == "pi":
-            # Pi stores sessions under ~/.pi/agent/sessions/<encoded-cwd>/.
-            # Return the root and filter by the session header cwd because
-            # multiple evals may run concurrently under the same tree.
-            return Path.home() / ".pi" / "agent" / "sessions"
        pattern = self.backend.session_logs.get("pattern", "")
        if not pattern:
            return None
@@ -369,8 +363,6 @@ class Engine:
        new_files = collect_new_logs(log_dir, snapshot)
        if self.backend.family == "codex":
            new_files = filter_codex_logs_by_cwd(new_files, str(workdir.resolve()))
-        elif self.backend.family == "pi":
-            new_files = filter_pi_logs_by_cwd(new_files, str(workdir.resolve()))
        normalizer = NORMALIZERS.get(self.backend.family)
        if not normalizer:
            return []
--- a/evals/drill/normalizer.py
+++ b/evals/drill/normalizer.py
@@ -74,23 +74,6 @@ def filter_codex_logs_by_cwd(paths: list[Path], target_cwd: str) -> list[Path]:
    return matched


-def filter_pi_logs_by_cwd(paths: list[Path], target_cwd: str) -> list[Path]:
-    """Drop Pi sessions whose header cwd doesn't match target_cwd."""
-    matched: list[Path] = []
-    for path in paths:
-        try:
-            with path.open() as f:
-                first_line = f.readline()
-            entry = json.loads(first_line)
-        except (OSError, json.JSONDecodeError):
-            continue
-        if entry.get("type") != "session":
-            continue
-        if entry.get("cwd") == target_cwd:
-            matched.append(path)
-    return matched
-
-
 def normalize_claude_logs(raw_content: str) -> list[dict[str, Any]]:
    """Normalize Claude Code session logs.

@@ -172,52 +155,6 @@ def normalize_codex_logs(raw_content: str) -> list[dict[str, Any]]:
    return results


-# Reverse mapping: Pi tool names → Claude Code canonical names
-PI_TOOL_MAP: dict[str, str] = {
-    "read": "Read",
-    "write": "Write",
-    "edit": "Edit",
-    "bash": "Bash",
-    "grep": "Grep",
-    "find": "Glob",
-    "ls": "Glob",
-}
-
-
-PI_NATIVE_TOOLS = (set(PI_TOOL_MAP.values()) - {"Bash"}) | {"subagent", "todo", "manage_todo_list"}
-
-
-def normalize_pi_logs(raw_content: str) -> list[dict[str, Any]]:
-    """Normalize Pi JSONL session logs.
-
-    Pi session files are JSONL entries. Assistant messages contain tool calls as
-    content blocks: {"type": "toolCall", "name": "read", "arguments": {...}}.
-    """
-    results: list[dict[str, Any]] = []
-    for line in raw_content.strip().split("\n"):
-        if not line.strip():
-            continue
-        try:
-            entry = json.loads(line)
-        except json.JSONDecodeError:
-            continue
-        if entry.get("type") != "message":
-            continue
-        message = entry.get("message", {})
-        if message.get("role") != "assistant":
-            continue
-        for block in message.get("content", []):
-            if block.get("type") != "toolCall":
-                continue
-            name = block.get("name", "")
-            canonical = PI_TOOL_MAP.get(name, name)
-            source = "native" if canonical in PI_NATIVE_TOOLS else "shell"
-            results.append(
-                {"tool": canonical, "args": block.get("arguments", {}), "source": source}
-            )
-    return results
-
-
 # Reverse mapping: Gemini tool names → Claude Code canonical names
 GEMINI_TOOL_MAP: dict[str, str] = {
    "run_shell_command": "Bash",
@@ -288,5 +225,4 @@ NORMALIZERS: dict[str, Callable[[str], list[dict[str, Any]]]] = {
    "claude": normalize_claude_logs,
    "codex": normalize_codex_logs,
    "gemini": normalize_gemini_logs,
-    "pi": normalize_pi_logs,
 }
--- a/evals/tests/test_backend.py
+++ b/evals/tests/test_backend.py
@@ -44,12 +44,6 @@ class TestLoadBackend:
        assert flash_backend.family == "gemini"
        assert flash_backend.model == "gemini-2.5-flash"

-    def test_loads_pi_backend(self, backends_dir):
-        backend = load_backend("pi", backends_dir)
-        assert backend.name == "pi"
-        assert backend.cli == "pi"
-        assert backend.family == "pi"
-

 class TestBackendBuildCommand:
    def test_claude_build_command(self, backends_dir, monkeypatch):
@@ -66,12 +60,6 @@ class TestBackendBuildCommand:
        cmd = backend.build_command("/tmp/workdir")
        assert cmd[0] == "codex"

-    def test_pi_build_command_loads_local_superpowers_package(self, backends_dir, monkeypatch):
-        monkeypatch.setenv("SUPERPOWERS_ROOT", "/tmp/superpowers")
-        backend = load_backend("pi", backends_dir)
-        cmd = backend.build_command("/tmp/workdir")
-        assert cmd == ["pi", "-e", "/tmp/superpowers"]
-

 class TestBackendEnvValidation:
    def test_missing_env_raises(self, backends_dir, monkeypatch):
@@ -137,21 +125,6 @@ class TestBackendFamily:
        backend = load_backend("codex", backends_dir)
        assert backend.family == "codex"

-    def test_pi_backend_family(self):
-        backend = Backend(
-            name="pi",
-            cli="pi",
-            args=[],
-            required_env=[],
-            hooks={"pre_run": [], "post_run": []},
-            shutdown="/quit",
-            idle={},
-            startup_timeout=30,
-            terminal={},
-            session_logs={},
-        )
-        assert backend.family == "pi"
-
    def test_variant_name_preserves_family(self):
        backend = Backend(
            name="claude-opus-4-6",
--- a/evals/tests/test_engine.py
+++ b/evals/tests/test_engine.py
@@ -4,7 +4,7 @@ import json
 import subprocess
 from pathlib import Path

-from drill.engine import Engine, RunResult, ScenarioConfig, VerifyConfig, snapshot_filesystem
+from drill.engine import RunResult, ScenarioConfig, VerifyConfig, snapshot_filesystem


 class TestVerifyConfig:
@@ -138,40 +138,6 @@ class TestEngineAssertionIntegration:
        assert (tmp_path / "meta.json").exists()


-class TestEnginePiBackend:
-    def test_resolves_pi_session_log_root(self, tmp_path: Path) -> None:
-        scenario = tmp_path / "scenario.yaml"
-        scenario.write_text("scenario: test-pi\n")
-        backends = tmp_path / "backends"
-        backends.mkdir()
-        (backends / "pi.yaml").write_text(
-            """
-name: pi
-cli: pi
-args: []
-required_env: []
-hooks:
-  pre_run: []
-  post_run: []
-shutdown: /quit
-idle: {}
-startup_timeout: 1
-terminal: {}
-session_logs:
-  pattern: ~/.pi/agent/sessions/**/*.jsonl
-"""
-        )
-        engine = Engine(
-            scenario_path=scenario,
-            backend_name="pi",
-            backends_dir=backends,
-            fixtures_dir=tmp_path,
-            results_dir=tmp_path,
-        )
-
-        assert engine._resolve_log_dir(tmp_path) == Path.home() / ".pi" / "agent" / "sessions"
-
-
 class TestEngineRunParams:
    def test_run_result_uses_custom_output_dir(self, tmp_path: Path) -> None:
        custom_dir = tmp_path / "custom" / "run-00"
--- a/evals/tests/test_normalizer.py
+++ b/evals/tests/test_normalizer.py
@@ -3,11 +3,9 @@ import json
 from drill.normalizer import (
    collect_new_logs,
    filter_codex_logs_by_cwd,
-    filter_pi_logs_by_cwd,
    normalize_claude_logs,
    normalize_codex_logs,
    normalize_gemini_logs,
-    normalize_pi_logs,
    snapshot_log_dir,
 )

@@ -139,56 +137,6 @@ class TestNormalizeCodexLogs:
        assert normalized[1]["source"] == "native"


-class TestNormalizePiLogs:
-    def test_filter_by_cwd_keeps_matching_session_headers(self, tmp_path):
-        target = "/tmp/drill-target"
-        match = tmp_path / "match.jsonl"
-        match.write_text(json.dumps({"type": "session", "cwd": target}) + "\n")
-        other = tmp_path / "other.jsonl"
-        other.write_text(json.dumps({"type": "session", "cwd": "/tmp/other"}) + "\n")
-        malformed = tmp_path / "malformed.jsonl"
-        malformed.write_text("not json\n")
-
-        assert filter_pi_logs_by_cwd([match, other, malformed], target) == [match]
-
-    def test_normalizes_assistant_tool_calls_from_session_entries(self):
-        lines = [
-            json.dumps({"type": "session", "cwd": "/tmp/project"}),
-            json.dumps(
-                {
-                    "type": "message",
-                    "message": {
-                        "role": "assistant",
-                        "content": [
-                            {"type": "text", "text": "I will inspect this."},
-                            {
-                                "type": "toolCall",
-                                "name": "read",
-                                "arguments": {"path": "README.md"},
-                            },
-                            {
-                                "type": "toolCall",
-                                "name": "bash",
-                                "arguments": {"command": "git status"},
-                            },
-                            {
-                                "type": "toolCall",
-                                "name": "subagent",
-                                "arguments": {"agent": "reviewer"},
-                            },
-                        ],
-                    },
-                }
-            ),
-        ]
-
-        assert normalize_pi_logs("\n".join(lines)) == [
-            {"tool": "Read", "args": {"path": "README.md"}, "source": "native"},
-            {"tool": "Bash", "args": {"command": "git status"}, "source": "shell"},
-            {"tool": "subagent", "args": {"agent": "reviewer"}, "source": "native"},
-        ]
-
-
 class TestNormalizeGeminiLogs:
    def test_normalizes_jsonl_tool_calls(self):
        lines = [
--- a/package.json
+++ b/package.json
@@ -1,23 +1,6 @@
 {
  "name": "superpowers",
  "version": "5.1.0",
-  "description": "Superpowers skills and runtime bootstrap for coding agents",
  "type": "module",
-  "main": ".opencode/plugins/superpowers.js",
-  "keywords": [
-    "pi-package",
-    "skills",
-    "tdd",
-    "debugging",
-    "collaboration",
-    "workflow"
-  ],
-  "pi": {
-    "extensions": [
-      "./.pi/extensions/superpowers.ts"
-    ],
-    "skills": [
-      "./skills"
-    ]
-  }
+  "main": ".opencode/plugins/superpowers.js"
 }
--- a/skills/using-superpowers/references/pi-tools.md
+++ b/skills/using-superpowers/references/pi-tools.md
@@ -1,30 +0,0 @@
-# Pi Tool Mapping
-
-Pi supports Superpowers skills natively through skill discovery and `/skill:name` commands. It does not expose Claude Code's `Skill` tool.
-
-When a Superpowers skill mentions Claude Code tool names, use these Pi equivalents:
-
-| Superpowers / Claude Code name | Pi equivalent |
-| --- | --- |
-| `Skill` | Pi native skills: load the relevant `SKILL.md` with `read`, or let the human use `/skill:name` |
-| `Read` | `read` |
-| `Write` | `write` |
-| `Edit` | `edit` |
-| `Bash` | `bash` |
-| `Grep` | `grep` when active; otherwise `bash` with `rg`/`grep` |
-| `Glob` | `find` or `bash` with shell globs |
-| `LS` / `List` | `ls` when active; otherwise `bash` with `ls` |
-| `Task` | Use an installed subagent tool such as `subagent` from `pi-subagents` if available |
-| `TodoWrite` | Use an installed todo/task tool if available, otherwise track tasks in the plan or `TODO.md` |
-
-## Skills
-
-Pi discovers skills from configured skill directories and installed Pi packages. A Superpowers Pi package should expose `skills/` through its `pi.skills` manifest entry. The agent should still follow the Superpowers rule: when a skill applies, load and follow it before responding.
-
-## Subagents
-
-Pi core does not ship a standard subagent tool. The `pi-subagents` package is a strong optional companion and provides a `subagent` tool with single-agent, chain, parallel, async, forked-context, and resume/status workflows. If no subagent tool is available, do not fabricate `Task` calls; execute sequentially in the current session or explain that the optional subagent capability is not installed.
-
-## Task lists
-
-Pi core does not ship a standard task-list tool. If a todo/task extension is installed, use its documented tool. Otherwise use Superpowers plan files, checklists in Markdown, or a repo-local `TODO.md` for task tracking.
--- a/tests/pi/test-pi-extension.mjs
+++ b/tests/pi/test-pi-extension.mjs
@@ -1,128 +0,0 @@
-import assert from 'node:assert/strict';
-import { readFile } from 'node:fs/promises';
-import { existsSync } from 'node:fs';
-import { dirname, resolve } from 'node:path';
-import { fileURLToPath, pathToFileURL } from 'node:url';
-import test from 'node:test';
-
-const __dirname = dirname(fileURLToPath(import.meta.url));
-const repoRoot = resolve(__dirname, '../..');
-const packageJsonPath = resolve(repoRoot, 'package.json');
-const extensionPath = resolve(repoRoot, '.pi/extensions/superpowers.ts');
-const piToolsPath = resolve(repoRoot, 'skills/using-superpowers/references/pi-tools.md');
-
-async function readPackageJson() {
-  return JSON.parse(await readFile(packageJsonPath, 'utf8'));
-}
-
-async function loadExtension() {
-  const handlers = new Map();
-  const pi = {
-    on(event, handler) {
-      if (!handlers.has(event)) handlers.set(event, []);
-      handlers.get(event).push(handler);
-    },
-  };
-  const mod = await import(pathToFileURL(extensionPath).href + `?cachebust=${Date.now()}-${Math.random()}`);
-  mod.default(pi);
-  return { handlers };
-}
-
-function firstHandler(handlers, event) {
-  const eventHandlers = handlers.get(event) ?? [];
-  assert.equal(eventHandlers.length, 1, `expected one ${event} handler`);
-  return eventHandlers[0];
-}
-
-function textOf(message) {
-  if (typeof message.content === 'string') return message.content;
-  return message.content
-    .filter((part) => part.type === 'text')
-    .map((part) => part.text)
-    .join('\n');
-}
-
-test('package.json declares a pi package with skills and extension resources', async () => {
-  const pkg = await readPackageJson();
-
-  assert.equal(pkg.name, 'superpowers');
-  assert.ok(pkg.keywords.includes('pi-package'));
-  assert.deepEqual(pkg.pi.skills, ['./skills']);
-  assert.deepEqual(pkg.pi.extensions, ['./.pi/extensions/superpowers.ts']);
-});
-
-test('extension registers lifecycle hooks without pre-compaction injection', async () => {
-  const { handlers } = await loadExtension();
-
-  for (const event of ['resources_discover', 'session_start', 'session_compact', 'context', 'agent_end']) {
-    assert.equal((handlers.get(event) ?? []).length, 1, `missing ${event} handler`);
-  }
-  assert.equal((handlers.get('session_before_compact') ?? []).length, 0);
-});
-
-test('resources_discover contributes the bundled skills directory', async () => {
-  const { handlers } = await loadExtension();
-  const discover = firstHandler(handlers, 'resources_discover');
-
-  const result = await discover({ type: 'resources_discover', cwd: repoRoot, reason: 'startup' }, {});
-
-  assert.deepEqual(result.skillPaths, [resolve(repoRoot, 'skills')]);
-});
-
-test('startup context injects the bootstrap as one user message until agent_end', async () => {
-  const { handlers } = await loadExtension();
-  const sessionStart = firstHandler(handlers, 'session_start');
-  const context = firstHandler(handlers, 'context');
-  const agentEnd = firstHandler(handlers, 'agent_end');
-
-  await sessionStart({ type: 'session_start', reason: 'startup' }, {});
-
-  const originalMessages = [
-    { role: 'user', content: [{ type: 'text', text: 'Let us make a react todo list' }], timestamp: 1 },
-  ];
-  const result = await context({ type: 'context', messages: originalMessages }, {});
-
-  assert.equal(result.messages.length, 2);
-  assert.equal(result.messages[0].role, 'user');
-  assert.match(textOf(result.messages[0]), /You have superpowers/);
-  assert.match(textOf(result.messages[0]), /Pi tool mapping/);
-  assert.equal(result.messages[1], originalMessages[0]);
-
-  const repeatedProviderRequest = await context({ type: 'context', messages: originalMessages }, {});
-  assert.equal(repeatedProviderRequest.messages.length, 2);
-  assert.match(textOf(repeatedProviderRequest.messages[0]), /You have superpowers/);
-
-  const alreadyInjected = await context({ type: 'context', messages: result.messages }, {});
-  assert.equal(alreadyInjected, undefined, 'bootstrap should not duplicate when already present');
-
-  await agentEnd({ type: 'agent_end', messages: [] }, {});
-  const afterEnd = await context({ type: 'context', messages: originalMessages }, {});
-  assert.equal(afterEnd, undefined, 'startup bootstrap should clear after agent_end');
-});
-
-test('session_compact injects bootstrap after compaction summaries, not before compaction', async () => {
-  const { handlers } = await loadExtension();
-  const sessionCompact = firstHandler(handlers, 'session_compact');
-  const context = firstHandler(handlers, 'context');
-
-  await sessionCompact({ type: 'session_compact', compactionEntry: {}, fromExtension: false }, {});
-
-  const summary = { role: 'compactionSummary', summary: 'Prior work summary', tokensBefore: 123, timestamp: 1 };
-  const user = { role: 'user', content: [{ type: 'text', text: 'Continue' }], timestamp: 2 };
-  const result = await context({ type: 'context', messages: [summary, user] }, {});
-
-  assert.equal(result.messages.length, 3);
-  assert.equal(result.messages[0], summary);
-  assert.equal(result.messages[1].role, 'user');
-  assert.match(textOf(result.messages[1]), /You have superpowers/);
-  assert.equal(result.messages[2], user);
-});
-
-test('pi tools reference documents pi-specific mappings', async () => {
-  assert.equal(existsSync(piToolsPath), true, 'pi-tools.md should exist');
-  const text = await readFile(piToolsPath, 'utf8');
-
-  for (const expected of ['Skill', 'Task', 'TodoWrite', 'read', 'write', 'edit', 'bash']) {
-    assert.match(text, new RegExp(expected));
-  }
-});
Author	SHA1	Message	Date
Drew Ritter	bad4708a7b	evals: use pre-commit hooks	2026-05-06 15:41:52 -07:00
Drew Ritter	ec9b96a7bf	evals: add Gemini 2.5 Flash backend	2026-05-06 15:09:59 -07:00
Drew Ritter	2d4cdea2bb	evals: drop drill source marker	2026-05-06 14:55:14 -07:00
Drew Ritter	af465f9687	evals: remove unreleased wave scenarios	2026-05-06 14:43:08 -07:00
Jesse Vincent	e4191c3609	Address adversarial review findings - evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.	2026-05-06 12:41:28 -07:00
Jesse Vincent	d545612825	docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files).	2026-05-06 12:33:10 -07:00
Jesse Vincent	b43d14f87f	docs: annotate dated artifacts referencing lifted bash tests - RELEASE-NOTES.md: note that test-requesting-code-review.sh and test-document-review-system.sh were lifted into drill scenarios on 2026-05-06; references are preserved as dated artifacts. - docs/superpowers/plans/2026-03-23-codex-app-compatibility.md: note that tests/skill-triggering/ was lifted into drill scenarios on 2026-05-06; the run-all.sh reference is a dated artifact. Subagent second-pass scrub confirmed no other active references in the tree (excluding evals/ and the spec/plan for this work itself).	2026-05-06 12:32:00 -07:00
Jesse Vincent	11d5db1b22	tests: annotate three kept bash tests with drill coverage notes - test-worktree-native-preference.sh: drill covers PRESSURE phase only; RED + GREEN baselines have no drill counterpart and are kept so the RED-GREEN-REFACTOR validation remains rerunnable end-to-end. - test-subagent-driven-development-integration.sh: drill covers the YAGNI subset (forbidden exports + reviewer-as-gate). Bash adds >=3 commits, >=2 subagent dispatches, TodoWrite usage, test file existence check, and token-budget telemetry. Kept until drill scenario covers those or they are retired. - test-subagent-driven-development.sh: tests agent's ability to describe SDD (string matches against expected keywords). Drill scenarios test behavior, not description-recall. Kept by design. Subagent verification recorded in commit messages of subsequent deletions; gap analyses driving these annotations are also in the verification subagent reports for the gating sweep.	2026-05-06 12:29:59 -07:00
Jesse Vincent	051bff661b	tests: remove test-requesting-code-review.sh (covered by drill code-review-catches-planted-bugs) Subagent verification: every bash assertion (skill invocation, subagent dispatch, SQL injection flagged, credential handling flagged, no merge approval) maps to drill verify checks. Drill is stricter: bundles severity (Critical/Important) into the same criteria as the finding itself (bash split severity into a separate test). Setup parity covered (src/db.js with string concat + identity hash, two commits). The drill scenario header explicitly says it is the "cross-harness, semantically-judged replacement for the bash test."	2026-05-06 12:28:40 -07:00
Jesse Vincent	dc6255291b	tests: remove test-document-review-system.sh (covered by drill spec-reviewer-catches-planted-flaws) Subagent verification: every bash assertion (TODO in Requirements section flagged, "specified later" deferral flagged, Issues section present, did-not-approve verdict) maps to drill verify.criteria entries. Setup parity covered by setup.assertions (test-feature-design.md exists with TODO + 'specified later' content). Drill is stricter: asserts tool-called Agent (subagent dispatch) which the bash test did not check.	2026-05-06 12:28:40 -07:00
Jesse Vincent	d337f4a18a	tests: remove subagent-driven-dev fixtures (covered by drill sdd-go-fractals + sdd-svelte-todo) The bash test had ZERO output assertions — it just ran claude -p and printed token usage. Drill's scenarios are strictly more rigorous: go-fractals: skill-called SDD + tool-called Agent + go test ./... passes + cmd/fractals/main.go exists + >=4 commits + LLM criteria verifying real SDD workflow. svelte-todo: skill-called SDD + tool-called Agent + npm test passes + playwright e2e passes + package.json + svelte.config.js or vite.config.ts + >=4 commits + LLM criteria. design.md and plan.md are byte-identical between bash fixtures and drill fixtures (evals/fixtures/sdd-{go-fractals,svelte-todo}/). Drill's setup helper (scaffold_sdd_*) forces git init -b main (stricter than bash's reliance on init.defaultBranch). The .claude/settings.local.json from bash scaffold.sh is unnecessary for drill since permissions are managed via backend YAML. Subagent verification: SAFE TO DELETE for both.	2026-05-06 12:27:31 -07:00
Jesse Vincent	6fe9cf7515	tests: remove run-claude-describes-sdd.sh (covered by drill mid-conversation-skill-invocation) Subagent verification: every bash assertion (Skill tool invoked + specific skill name 'subagent-driven-development' loaded after the agent describes it conversationally in turn 1) maps to the drill scenario's skill-called assertion + criteria paragraph requiring the skill to fire in direct response to the second user message. Drill additionally asserts tool-called Agent (subagent dispatch) which is stricter than the bash test. Other runners in tests/explicit-skill-requests/ (haiku, multiturn, extended-multiturn) and their prompt files are preserved — they have no drill coverage and exercise different behaviors.	2026-05-06 12:25:46 -07:00
Jesse Vincent	3177c87aa8	tests: remove skill-triggering bash prompts (covered by drill triggering-* scenarios) Subagent verification confirmed each prompt's intent matches its corresponding drill scenario's turns[].intent verbatim, and each scenario has both a deterministic skill-called assertion and a semantic LLM criterion confirming the matching skill was loaded (actually a stronger check than the bash test, which only confirms the skill fires anywhere in the stream). All 6 prompts deleted. The runner had no remaining prompts to drive, so run-test.sh and run-all.sh deleted as well.	2026-05-06 12:24:53 -07:00
Jesse Vincent	a94d2cc414	evals: drop SUPERPOWERS_ROOT setup step from README/CLAUDE The cli.py helper now defaults the env var. Mention as override only.	2026-05-06 12:21:35 -07:00
Jesse Vincent	dcffaa087a	evals: drop SUPERPOWERS_ROOT from codex/gemini required_env These backends only read SUPERPOWERS_ROOT via engine.py/setup.py's os.environ access, which the new cli.py default helper supplies automatically. claude*.yaml keep SUPERPOWERS_ROOT in required_env because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.	2026-05-06 12:20:47 -07:00
Jesse Vincent	b3817bba4f	evals: default SUPERPOWERS_ROOT to parent of evals/ if unset Adds _set_superpowers_root_default() to drill/cli.py, called at module import after load_dotenv(). PROJECT_ROOT resolves to evals/ post-lift; its parent is the superpowers repo root, which is the correct value for SUPERPOWERS_ROOT. Existing env values are respected as overrides via os.environ.setdefault. Tests: - helper sets default when var is unset - helper does not override when var is already set	2026-05-06 12:19:39 -07:00
Jesse Vincent	3c046f579e	Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.	2026-05-06 12:15:46 -07:00
Jesse Vincent	895bb732d5	Plan: lift drill into superpowers as evals/ 15-task implementation plan derived from the design spec at docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md. Each task is bite-sized (2-5 min steps) with exact commands, exact file paths, and exact code where required. Subagent verification gates per the spec are written out as concrete prompt templates. Self-review: - Spec coverage: every spec section maps to a task - Placeholder scan: no TBD/TODO/placeholder/fill-in-later language - Type consistency: helper named _set_superpowers_root_default consistently; drill SHA recorded in evals/.drill-source-sha consistently	2026-05-06 12:08:58 -07:00
Jesse Vincent	cf5914a31f	Spec: address adversarial review findings Two parallel reviewers raised legitimate issues against the lift-drill- into-evals spec. Updates: - Coverage map for tests/explicit-skill-requests/ corrected: 6 run-.sh scripts + prompts, not "2 scenarios cover all". Several scripts (Haiku, multi-turn, please-use-brainstorming, use-systematic-debugging) have no drill counterpart and stay. - tests/claude-code/test-subagent-driven-development.sh marked as meta/documentation test (asks agent to describe SDD); no drill scenario covers description tests; defaults to keep. - Path-defaults section now shows verified evidence: PROJECT_ROOT resolves to evals/ post-move; only claude.yaml substitute ${SUPERPOWERS_ROOT} in args (codex/gemini use it via os.environ in pre-run hooks); helper invocation order specified (after load_dotenv, before click definitions). - Step 2 copy uses explicit rsync excludes (.git, .venv, results, .env, __pycache__, *.egg-info, .private-journal); checksum-level verification rather than file-count. - Drill SHA recorded at copy time in commit message and evals/.drill-source-sha for divergence detection. - evals/tests/ pytest suite added to verification protocol. - Reference scrub list expanded: RELEASE-NOTES.md, docs/superpowers/plans/, .codex-plugin/ (corrected from .codex/), lefthook.yml. Excluded dirs called out (node_modules/, .venv/, evals/). - Historical plan docs / RELEASE-NOTES handling: annotate, don't rewrite. - evals/lefthook.yml move documented (drill ships its own; contributors run cd evals && lefthook run pre-commit manually). - PR description checklist includes archival action item for obra/drill post-merge. False finding rejected: svelte-todo fixture is complete on disk (design.md + plan.md + scaffold.sh present); reviewer #1 #3 dropped.	2026-05-06 12:03:24 -07:00
Jesse Vincent	cf34cef01e	Spec: lift drill into superpowers as evals/ Records scope, branching, architecture, deletion gate, verification protocol, path/config edits, migration ordering, and post-implementation verification. Frames CI integration, scenario co-location, and Python package rename as deferred work. Per-file deletion of bash tests under superpowers/tests/ is gated by a subagent that compares each bash assertion to its drill scenario's verify block. Default keeps the bash test if any assertion is unmatched. Branching: independent off dev (f/evals-lift), not stacked on f/cross-platform.	2026-05-06 11:54:12 -07:00