evals: use pre-commit hooks

evals: add Gemini 2.5 Flash backend
evals: drop drill source marker
2026-05-10 19:19:03 +08:00 · 2026-05-06 15:41:52 -07:00 · 2026-05-06 15:09:59 -07:00 · 2026-05-06 14:55:14 -07:00 · 2026-05-06 14:43:08 -07:00 · 2026-05-06 12:41:28 -07:00
14 changed files with 4 additions and 671 deletions
--- a/.pi/extensions/superpowers.ts
+++ b/.pi/extensions/superpowers.ts
@@ -1,121 +0,0 @@
 import { readFileSync } from "node:fs";
 import { dirname, resolve } from "node:path";
 import { fileURLToPath } from "node:url";
 import type { ExtensionAPI } from "@earendil-works/pi-coding-agent";
 const EXTREMELY_IMPORTANT_MARKER = "<EXTREMELY_IMPORTANT>";
 const BOOTSTRAP_MARKER = "superpowers:using-superpowers bootstrap for pi";
 const extensionDir = dirname(fileURLToPath(import.meta.url));
 const packageRoot = resolve(extensionDir, "../..");
 const skillsDir = resolve(packageRoot, "skills");
 const bootstrapSkillPath = resolve(skillsDir, "using-superpowers", "SKILL.md");
 let cachedBootstrap: string | null | undefined;
 export default function superpowersPiExtension(pi: ExtensionAPI) {
 	let injectBootstrap = true;
 	pi.on("resources_discover", async () => ({
 		skillPaths: [skillsDir],
 	}));
 	pi.on("session_start", async () => {
 		injectBootstrap = true;
 	});
 	pi.on("session_compact", async () => {
 		injectBootstrap = true;
 	});
 	pi.on("agent_end", async () => {
 		injectBootstrap = false;
 	});
 	pi.on("context", async (event) => {
 		if (!injectBootstrap) return;
 		if (event.messages.some(messageContainsBootstrap)) return;
 		const bootstrap = getBootstrapContent();
 		if (!bootstrap) return;
 		const bootstrapMessage = {
 			role: "user" as const,
 			content: [{ type: "text" as const, text: bootstrap }],
 			timestamp: Date.now(),
 		};
 		const insertAt = firstNonCompactionSummaryIndex(event.messages);
 		return {
 			messages: [
 				...event.messages.slice(0, insertAt),
 				bootstrapMessage,
 				...event.messages.slice(insertAt),
 			],
 		};
 	});
 }
 function getBootstrapContent(): string | null {
 	if (cachedBootstrap !== undefined) return cachedBootstrap;
 	try {
 		const skillContent = readFileSync(bootstrapSkillPath, "utf8");
 		const body = stripFrontmatter(skillContent);
 		cachedBootstrap = `${EXTREMELY_IMPORTANT_MARKER}
 ${BOOTSTRAP_MARKER}
 You have superpowers.
 The using-superpowers skill content is included below and is already loaded for this Pi session. Follow it now. Do not try to load using-superpowers again.
 ${body}
 ${piToolMapping()}
 </EXTREMELY_IMPORTANT>`;
 		return cachedBootstrap;
 	} catch {
 		cachedBootstrap = null;
 		return null;
 	}
 }
 function stripFrontmatter(content: string): string {
 	const match = content.match(/^---\n[\s\S]*?\n---\n([\s\S]*)$/);
 	return (match ? match[1] : content).trim();
 }
 function piToolMapping(): string {
 	return `## Pi tool mapping
 Pi has native skills but does not expose Claude Code's \`Skill\` tool. When a Superpowers instruction says to use the \`Skill\` tool, use Pi's native skill system instead: load the relevant \`SKILL.md\` with \`read\` when the skill applies, or let a human invoke \`/skill:name\` explicitly.
 Pi's built-in coding tools are lowercase: \`read\`, \`write\`, \`edit\`, \`bash\`, plus optional \`grep\`, \`find\`, and \`ls\`. Map Claude-style tool names \`Read\`, \`Write\`, \`Edit\`, and \`Bash\` to those Pi tools.
 Pi does not ship a standard \`Task\` subagent tool. If a subagent tool such as \`subagent\` from \`pi-subagents\` is available, use it for Superpowers subagent workflows. If no subagent tool is available, do the work in this session or explain the missing capability instead of inventing tool calls.
 Pi does not ship a standard \`TodoWrite\` task-list tool. If an installed todo/task tool is available, use it. Otherwise track work in plan files or a repo-local \`TODO.md\` when task tracking is needed.`;
 }
 function messageContainsBootstrap(message: unknown): boolean {
 	const content = (message as { content?: unknown }).content;
 	if (typeof content === "string") return content.includes(BOOTSTRAP_MARKER);
 	if (!Array.isArray(content)) return false;
 	return content.some((part) => {
 		return (
 			part &&
 			typeof part === "object" &&
 			(part as { type?: unknown }).type === "text" &&
 			typeof (part as { text?: unknown }).text === "string" &&
 			(part as { text: string }).text.includes(BOOTSTRAP_MARKER)
 		);
 	});
 }
 function firstNonCompactionSummaryIndex(messages: unknown[]): number {
 	let index = 0;
 	while ((messages[index] as { role?: unknown } | undefined)?.role === "compactionSummary") {
 		index += 1;
 	}
 	return index;
 }
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ Superpowers is a complete software development methodology for your coding agent
 ## Quickstart
-Give your agent Superpowers: [Claude Code](#claude-code), [Codex CLI](#codex-cli), [Codex App](#codex-app), [Factory Droid](#factory-droid), [Gemini CLI](#gemini-cli), [Pi](#pi), [OpenCode](#opencode), [Cursor](#cursor), [GitHub Copilot CLI](#github-copilot-cli).
+Give your agent Superpowers: [Claude Code](#claude-code), [Codex CLI](#codex-cli), [Codex App](#codex-app), [Factory Droid](#factory-droid), [Gemini CLI](#gemini-cli), [OpenCode](#opencode), [Cursor](#cursor), [GitHub Copilot CLI](#github-copilot-cli).
 ## How it works
@@ -114,22 +114,6 @@ Superpowers is available via the [official Codex plugin marketplace](https://git
  gemini extensions update superpowers
  ```
 ### Pi
 Install Superpowers as a Pi package from this repository:
 ```bash
 pi install git:github.com/obra/superpowers
 ```
 For local development, run Pi with this checkout loaded as a temporary package:
 ```bash
 pi -e /path/to/superpowers
 ```
 The Pi package loads the Superpowers skills and a small extension that injects the `using-superpowers` bootstrap at session startup and again after compaction. Pi has native skills, so no compatibility `Skill` tool is required. Subagent and task-list tools remain optional Pi companion packages.
 ### OpenCode
 OpenCode uses its own plugin install; install Superpowers separately even if you
--- a/docs/superpowers/plans/2026-05-07-pi-extension-and-evals.md
+++ b/docs/superpowers/plans/2026-05-07-pi-extension-and-evals.md
@@ -1,143 +0,0 @@
 # Pi Extension and Evals Implementation Plan
 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
 **Goal:** Add first-class Pi package support for Superpowers and add Pi as a Drill eval backend.
 **Architecture:** The Pi package is declared in the root `package.json` and loads existing `skills/` plus a small Pi extension. The extension injects the `using-superpowers` bootstrap into provider context as a user-role message on session startup and after compaction, with Pi-specific tool mapping. Drill gains a `pi` backend, Pi session-log normalization, and tests.
 **Tech Stack:** Pi TypeScript extension API, Node built-in test runner, Drill Python eval harness, pytest.
 ---
 ### Task 1: Pi package manifest and extension tests
 **Files:**
 - Modify: `package.json`
 - Create: `tests/pi/test-pi-extension.mjs`
 - [ ] **Step 1: Write failing package/extension tests**
 Create `tests/pi/test-pi-extension.mjs` with tests that import `extensions/superpowers.ts`, register fake Pi handlers, and assert:
 - root `package.json` has `keywords` containing `pi-package`
 - root `package.json` has `pi.skills: ["./skills"]`
 - root `package.json` has `pi.extensions: ["./extensions/superpowers.ts"]`
 - the extension registers `resources_discover`, `session_start`, `session_compact`, `context`, and `agent_end`
 - startup `context` injects exactly one user-role bootstrap message
 - `agent_end` clears startup injection
 - `session_compact` re-enables injection
 - the extension does not register `session_before_compact`
 - [ ] **Step 2: Run tests and verify RED**
 Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
 Expected: FAIL because `extensions/superpowers.ts` does not exist and `package.json` lacks the `pi` manifest.
 - [ ] **Step 3: Implement manifest fields**
 Update `package.json` with `description`, `keywords`, `pi.extensions`, and `pi.skills` while preserving existing `name`, `version`, `type`, and `main`.
 - [ ] **Step 4: Implement `extensions/superpowers.ts`**
 Create a zero-runtime-dependency extension that:
 - locates the package root from `import.meta.url`
 - reads `skills/using-superpowers/SKILL.md`
 - strips YAML frontmatter
 - appends Pi-specific tool mapping
 - exposes `resources_discover` with the skills path
 - marks bootstrap pending on `session_start` and `session_compact`
 - injects a user-role bootstrap message in `context`
 - inserts post-compact bootstrap after leading `compactionSummary` messages
 - clears pending bootstrap on `agent_end`
 - [ ] **Step 5: Run tests and verify GREEN**
 Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
 Expected: PASS.
 ### Task 2: Pi tool mapping reference
 **Files:**
 - Create: `skills/using-superpowers/references/pi-tools.md`
 - Modify: `tests/pi/test-pi-extension.mjs`
 - [ ] **Step 1: Write failing test for Pi reference doc**
 Add assertions that `skills/using-superpowers/references/pi-tools.md` exists and documents mappings for `Skill`, `Task`, `TodoWrite`, and built-in tool names.
 - [ ] **Step 2: Run tests and verify RED**
 Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
 Expected: FAIL because `pi-tools.md` does not exist.
 - [ ] **Step 3: Add Pi reference doc**
 Create `skills/using-superpowers/references/pi-tools.md` explaining Pi-native skills, optional `pi-subagents`, no canonical todo/tasklist plugin, and built-in lowercase tools.
 - [ ] **Step 4: Run tests and verify GREEN**
 Run: `node --experimental-strip-types --test tests/pi/test-pi-extension.mjs`
 Expected: PASS.
 ### Task 3: Drill Pi backend and session log normalization
 **Files:**
 - Create: `evals/backends/pi.yaml`
 - Modify: `evals/drill/backend.py`
 - Modify: `evals/drill/engine.py`
 - Modify: `evals/drill/normalizer.py`
 - Modify: `evals/tests/test_backend.py`
 - Modify: `evals/tests/test_normalizer.py`
 - [ ] **Step 1: Write failing backend/normalizer tests**
 Add pytest coverage for:
 - `load_backend("pi")` returns `family == "pi"`
 - Pi backend command starts with `pi` and includes `-e ${SUPERPOWERS_ROOT}`
 - `_resolve_log_dir()` for Pi points under `~/.pi/agent/sessions`
 - `filter_pi_logs_by_cwd()` keeps only session files whose header `cwd` matches the scenario workdir
 - `normalize_pi_logs()` extracts `toolCall` blocks from Pi assistant session entries and maps built-in lowercase tools to canonical names
 - [ ] **Step 2: Run tests and verify RED**
 Run: `uv run pytest evals/tests/test_backend.py evals/tests/test_normalizer.py -q`
 Expected: FAIL because the Pi backend and normalizer do not exist.
 - [ ] **Step 3: Add `evals/backends/pi.yaml`**
 Configure the backend to run `pi -e ${SUPERPOWERS_ROOT}`, use permissive TUI readiness, `/quit` shutdown, and Pi session log location.
 - [ ] **Step 4: Implement Pi family support**
 Update `Backend.family`, `Engine._resolve_log_dir`, `Engine._collect_tool_calls`, and `normalizer.py` with Pi log filtering and normalizing.
 - [ ] **Step 5: Run tests and verify GREEN**
 Run: `uv run pytest evals/tests/test_backend.py evals/tests/test_normalizer.py -q`
 Expected: PASS.
 ### Task 4: Documentation and full verification
 **Files:**
 - Modify: `README.md`
 - Modify: `evals/README.md`
 - [ ] **Step 1: Document Pi install and eval backend**
 Add Pi to README quickstart/install list and add backend entry/usage to `evals/README.md`.
 - [ ] **Step 2: Run verification**
 Run:
 ```bash
 node --experimental-strip-types --test tests/pi/test-pi-extension.mjs
 uv run pytest evals/tests/test_backend.py evals/tests/test_setup.py evals/tests/test_normalizer.py -q
 ```
 Expected: all tests pass.
--- a/evals/README.md
+++ b/evals/README.md
@@ -43,9 +43,6 @@ uv run drill run spec-writing-blind-spot -b claude-opus-4-6 --n 5
 # Sweep across multiple backends
 uv run drill run spec-writing-blind-spot --models claude-opus-4-6,claude-opus-4-7 --n 10
 # Run against Pi, loading the local Superpowers package via -e ${SUPERPOWERS_ROOT}
 uv run drill run triggering-writing-plans -b pi
 # Compare results
 uv run drill compare spec-writing-blind-spot
@@ -75,7 +72,6 @@ uv run drill list
 | `codex` | Codex CLI | — |
 | `gemini` | Gemini CLI | auto-gemini-3 |
 | `gemini-2-5-flash` | Gemini CLI | gemini-2.5-flash |
 | `pi` | Pi coding agent | configured Pi default |
 ## Project structure
--- a/evals/backends/pi.yaml
+++ b/evals/backends/pi.yaml
@@ -1,23 +0,0 @@
 name: pi
 cli: pi
 args:
  - "-e"
  - "${SUPERPOWERS_ROOT}"
 required_env:
  - SUPERPOWERS_ROOT
 hooks:
  pre_run: []
  post_run: []
 shutdown: "/quit"
 idle:
  quiescence_seconds: 5
  ready_pattern: "."
 busy_pattern: "esc to cancel|Thinking\\.\\.\\.|\\(esc to cancel[^)]*\\)|[⠇⠏⠋⠙⠹⠸⠼⠴⠦⠧⠶⠾⠽⠻⠿]"
 max_busy_seconds: 1800
 startup_timeout: 60
 turn_timeout: 300
 terminal:
  cols: 200
  rows: 50
 session_logs:
  pattern: "~/.pi/agent/sessions/**/*.jsonl"
--- a/evals/drill/backend.py
+++ b/evals/drill/backend.py
@@ -71,7 +71,7 @@ class Backend:
    @property
    def family(self) -> str:
        """Normalize backend name to a family for log-dir / normalizer dispatch."""
-        for fam in ("claude", "codex", "gemini", "pi"):
+        for fam in ("claude", "codex", "gemini"):
            if self.name == fam or self.name.startswith(f"{fam}-"):
                return fam
        return "other"
--- a/evals/drill/engine.py
+++ b/evals/drill/engine.py
@@ -21,7 +21,6 @@ from drill.normalizer import (
    NORMALIZERS,
    collect_new_logs,
    filter_codex_logs_by_cwd,
    filter_pi_logs_by_cwd,
    snapshot_log_dir,
 )
 from drill.session import TmuxSession
@@ -349,11 +348,6 @@ class Engine:
            # Project name is the workdir basename, lowercased
            project = workdir.resolve().name.lower()
            return Path.home() / ".gemini" / "tmp" / project
        elif self.backend.family == "pi":
            # Pi stores sessions under ~/.pi/agent/sessions/<encoded-cwd>/.
            # Return the root and filter by the session header cwd because
            # multiple evals may run concurrently under the same tree.
            return Path.home() / ".pi" / "agent" / "sessions"
        pattern = self.backend.session_logs.get("pattern", "")
        if not pattern:
            return None
@@ -369,8 +363,6 @@ class Engine:
        new_files = collect_new_logs(log_dir, snapshot)
        if self.backend.family == "codex":
            new_files = filter_codex_logs_by_cwd(new_files, str(workdir.resolve()))
        elif self.backend.family == "pi":
            new_files = filter_pi_logs_by_cwd(new_files, str(workdir.resolve()))
        normalizer = NORMALIZERS.get(self.backend.family)
        if not normalizer:
            return []
--- a/evals/drill/normalizer.py
+++ b/evals/drill/normalizer.py
@@ -74,23 +74,6 @@ def filter_codex_logs_by_cwd(paths: list[Path], target_cwd: str) -> list[Path]:
    return matched
 def filter_pi_logs_by_cwd(paths: list[Path], target_cwd: str) -> list[Path]:
    """Drop Pi sessions whose header cwd doesn't match target_cwd."""
    matched: list[Path] = []
    for path in paths:
        try:
            with path.open() as f:
                first_line = f.readline()
            entry = json.loads(first_line)
        except (OSError, json.JSONDecodeError):
            continue
        if entry.get("type") != "session":
            continue
        if entry.get("cwd") == target_cwd:
            matched.append(path)
    return matched
 def normalize_claude_logs(raw_content: str) -> list[dict[str, Any]]:
    """Normalize Claude Code session logs.
@@ -172,52 +155,6 @@ def normalize_codex_logs(raw_content: str) -> list[dict[str, Any]]:
    return results
 # Reverse mapping: Pi tool names → Claude Code canonical names
 PI_TOOL_MAP: dict[str, str] = {
    "read": "Read",
    "write": "Write",
    "edit": "Edit",
    "bash": "Bash",
    "grep": "Grep",
    "find": "Glob",
    "ls": "Glob",
 }
 PI_NATIVE_TOOLS = (set(PI_TOOL_MAP.values()) - {"Bash"}) | {"subagent", "todo", "manage_todo_list"}
 def normalize_pi_logs(raw_content: str) -> list[dict[str, Any]]:
    """Normalize Pi JSONL session logs.
    Pi session files are JSONL entries. Assistant messages contain tool calls as
    content blocks: {"type": "toolCall", "name": "read", "arguments": {...}}.
    """
    results: list[dict[str, Any]] = []
    for line in raw_content.strip().split("\n"):
        if not line.strip():
            continue
        try:
            entry = json.loads(line)
        except json.JSONDecodeError:
            continue
        if entry.get("type") != "message":
            continue
        message = entry.get("message", {})
        if message.get("role") != "assistant":
            continue
        for block in message.get("content", []):
            if block.get("type") != "toolCall":
                continue
            name = block.get("name", "")
            canonical = PI_TOOL_MAP.get(name, name)
            source = "native" if canonical in PI_NATIVE_TOOLS else "shell"
            results.append(
                {"tool": canonical, "args": block.get("arguments", {}), "source": source}
            )
    return results
 # Reverse mapping: Gemini tool names → Claude Code canonical names
 GEMINI_TOOL_MAP: dict[str, str] = {
    "run_shell_command": "Bash",
@@ -288,5 +225,4 @@ NORMALIZERS: dict[str, Callable[[str], list[dict[str, Any]]]] = {
    "claude": normalize_claude_logs,
    "codex": normalize_codex_logs,
    "gemini": normalize_gemini_logs,
    "pi": normalize_pi_logs,
 }
--- a/evals/tests/test_backend.py
+++ b/evals/tests/test_backend.py
@@ -44,12 +44,6 @@ class TestLoadBackend:
        assert flash_backend.family == "gemini"
        assert flash_backend.model == "gemini-2.5-flash"
    def test_loads_pi_backend(self, backends_dir):
        backend = load_backend("pi", backends_dir)
        assert backend.name == "pi"
        assert backend.cli == "pi"
        assert backend.family == "pi"
 class TestBackendBuildCommand:
    def test_claude_build_command(self, backends_dir, monkeypatch):
@@ -66,12 +60,6 @@ class TestBackendBuildCommand:
        cmd = backend.build_command("/tmp/workdir")
        assert cmd[0] == "codex"
    def test_pi_build_command_loads_local_superpowers_package(self, backends_dir, monkeypatch):
        monkeypatch.setenv("SUPERPOWERS_ROOT", "/tmp/superpowers")
        backend = load_backend("pi", backends_dir)
        cmd = backend.build_command("/tmp/workdir")
        assert cmd == ["pi", "-e", "/tmp/superpowers"]
 class TestBackendEnvValidation:
    def test_missing_env_raises(self, backends_dir, monkeypatch):
@@ -137,21 +125,6 @@ class TestBackendFamily:
        backend = load_backend("codex", backends_dir)
        assert backend.family == "codex"
    def test_pi_backend_family(self):
        backend = Backend(
            name="pi",
            cli="pi",
            args=[],
            required_env=[],
            hooks={"pre_run": [], "post_run": []},
            shutdown="/quit",
            idle={},
            startup_timeout=30,
            terminal={},
            session_logs={},
        )
        assert backend.family == "pi"
    def test_variant_name_preserves_family(self):
        backend = Backend(
            name="claude-opus-4-6",
--- a/evals/tests/test_engine.py
+++ b/evals/tests/test_engine.py
@@ -4,7 +4,7 @@ import json
 import subprocess
 from pathlib import Path
-from drill.engine import Engine, RunResult, ScenarioConfig, VerifyConfig, snapshot_filesystem
+from drill.engine import RunResult, ScenarioConfig, VerifyConfig, snapshot_filesystem
 class TestVerifyConfig:
@@ -138,40 +138,6 @@ class TestEngineAssertionIntegration:
        assert (tmp_path / "meta.json").exists()
 class TestEnginePiBackend:
    def test_resolves_pi_session_log_root(self, tmp_path: Path) -> None:
        scenario = tmp_path / "scenario.yaml"
        scenario.write_text("scenario: test-pi\n")
        backends = tmp_path / "backends"
        backends.mkdir()
        (backends / "pi.yaml").write_text(
            """
 name: pi
 cli: pi
 args: []
 required_env: []
 hooks:
  pre_run: []
  post_run: []
 shutdown: /quit
 idle: {}
 startup_timeout: 1
 terminal: {}
 session_logs:
  pattern: ~/.pi/agent/sessions/**/*.jsonl
 """
        )
        engine = Engine(
            scenario_path=scenario,
            backend_name="pi",
            backends_dir=backends,
            fixtures_dir=tmp_path,
            results_dir=tmp_path,
        )
        assert engine._resolve_log_dir(tmp_path) == Path.home() / ".pi" / "agent" / "sessions"
 class TestEngineRunParams:
    def test_run_result_uses_custom_output_dir(self, tmp_path: Path) -> None:
        custom_dir = tmp_path / "custom" / "run-00"
--- a/evals/tests/test_normalizer.py
+++ b/evals/tests/test_normalizer.py
@@ -3,11 +3,9 @@ import json
 from drill.normalizer import (
    collect_new_logs,
    filter_codex_logs_by_cwd,
    filter_pi_logs_by_cwd,
    normalize_claude_logs,
    normalize_codex_logs,
    normalize_gemini_logs,
    normalize_pi_logs,
    snapshot_log_dir,
 )
@@ -139,56 +137,6 @@ class TestNormalizeCodexLogs:
        assert normalized[1]["source"] == "native"
 class TestNormalizePiLogs:
    def test_filter_by_cwd_keeps_matching_session_headers(self, tmp_path):
        target = "/tmp/drill-target"
        match = tmp_path / "match.jsonl"
        match.write_text(json.dumps({"type": "session", "cwd": target}) + "\n")
        other = tmp_path / "other.jsonl"
        other.write_text(json.dumps({"type": "session", "cwd": "/tmp/other"}) + "\n")
        malformed = tmp_path / "malformed.jsonl"
        malformed.write_text("not json\n")
        assert filter_pi_logs_by_cwd([match, other, malformed], target) == [match]
    def test_normalizes_assistant_tool_calls_from_session_entries(self):
        lines = [
            json.dumps({"type": "session", "cwd": "/tmp/project"}),
            json.dumps(
                {
                    "type": "message",
                    "message": {
                        "role": "assistant",
                        "content": [
                            {"type": "text", "text": "I will inspect this."},
                            {
                                "type": "toolCall",
                                "name": "read",
                                "arguments": {"path": "README.md"},
                            },
                            {
                                "type": "toolCall",
                                "name": "bash",
                                "arguments": {"command": "git status"},
                            },
                            {
                                "type": "toolCall",
                                "name": "subagent",
                                "arguments": {"agent": "reviewer"},
                            },
                        ],
                    },
                }
            ),
        ]
        assert normalize_pi_logs("\n".join(lines)) == [
            {"tool": "Read", "args": {"path": "README.md"}, "source": "native"},
            {"tool": "Bash", "args": {"command": "git status"}, "source": "shell"},
            {"tool": "subagent", "args": {"agent": "reviewer"}, "source": "native"},
        ]
 class TestNormalizeGeminiLogs:
    def test_normalizes_jsonl_tool_calls(self):
        lines = [
--- a/package.json
+++ b/package.json
@@ -1,23 +1,6 @@
 {
  "name": "superpowers",
  "version": "5.1.0",
  "description": "Superpowers skills and runtime bootstrap for coding agents",
  "type": "module",
-  "main": ".opencode/plugins/superpowers.js",
+  "main": ".opencode/plugins/superpowers.js"
  "keywords": [
    "pi-package",
    "skills",
    "tdd",
    "debugging",
    "collaboration",
    "workflow"
  ],
  "pi": {
    "extensions": [
      "./.pi/extensions/superpowers.ts"
    ],
    "skills": [
      "./skills"
    ]
  }
 }
--- a/skills/using-superpowers/references/pi-tools.md
+++ b/skills/using-superpowers/references/pi-tools.md
@@ -1,30 +0,0 @@
 # Pi Tool Mapping
 Pi supports Superpowers skills natively through skill discovery and `/skill:name` commands. It does not expose Claude Code's `Skill` tool.
 When a Superpowers skill mentions Claude Code tool names, use these Pi equivalents:
 | Superpowers / Claude Code name | Pi equivalent |
 | --- | --- |
 | `Skill` | Pi native skills: load the relevant `SKILL.md` with `read`, or let the human use `/skill:name` |
 | `Read` | `read` |
 | `Write` | `write` |
 | `Edit` | `edit` |
 | `Bash` | `bash` |
 | `Grep` | `grep` when active; otherwise `bash` with `rg`/`grep` |
 | `Glob` | `find` or `bash` with shell globs |
 | `LS` / `List` | `ls` when active; otherwise `bash` with `ls` |
 | `Task` | Use an installed subagent tool such as `subagent` from `pi-subagents` if available |
 | `TodoWrite` | Use an installed todo/task tool if available, otherwise track tasks in the plan or `TODO.md` |
 ## Skills
 Pi discovers skills from configured skill directories and installed Pi packages. A Superpowers Pi package should expose `skills/` through its `pi.skills` manifest entry. The agent should still follow the Superpowers rule: when a skill applies, load and follow it before responding.
 ## Subagents
 Pi core does not ship a standard subagent tool. The `pi-subagents` package is a strong optional companion and provides a `subagent` tool with single-agent, chain, parallel, async, forked-context, and resume/status workflows. If no subagent tool is available, do not fabricate `Task` calls; execute sequentially in the current session or explain that the optional subagent capability is not installed.
 ## Task lists
 Pi core does not ship a standard task-list tool. If a todo/task extension is installed, use its documented tool. Otherwise use Superpowers plan files, checklists in Markdown, or a repo-local `TODO.md` for task tracking.
--- a/tests/pi/test-pi-extension.mjs
+++ b/tests/pi/test-pi-extension.mjs
@@ -1,128 +0,0 @@
 import assert from 'node:assert/strict';
 import { readFile } from 'node:fs/promises';
 import { existsSync } from 'node:fs';
 import { dirname, resolve } from 'node:path';
 import { fileURLToPath, pathToFileURL } from 'node:url';
 import test from 'node:test';
 const __dirname = dirname(fileURLToPath(import.meta.url));
 const repoRoot = resolve(__dirname, '../..');
 const packageJsonPath = resolve(repoRoot, 'package.json');
 const extensionPath = resolve(repoRoot, '.pi/extensions/superpowers.ts');
 const piToolsPath = resolve(repoRoot, 'skills/using-superpowers/references/pi-tools.md');
 async function readPackageJson() {
  return JSON.parse(await readFile(packageJsonPath, 'utf8'));
 }
 async function loadExtension() {
  const handlers = new Map();
  const pi = {
    on(event, handler) {
      if (!handlers.has(event)) handlers.set(event, []);
      handlers.get(event).push(handler);
    },
  };
  const mod = await import(pathToFileURL(extensionPath).href + `?cachebust=${Date.now()}-${Math.random()}`);
  mod.default(pi);
  return { handlers };
 }
 function firstHandler(handlers, event) {
  const eventHandlers = handlers.get(event) ?? [];
  assert.equal(eventHandlers.length, 1, `expected one ${event} handler`);
  return eventHandlers[0];
 }
 function textOf(message) {
  if (typeof message.content === 'string') return message.content;
  return message.content
    .filter((part) => part.type === 'text')
    .map((part) => part.text)
    .join('\n');
 }
 test('package.json declares a pi package with skills and extension resources', async () => {
  const pkg = await readPackageJson();
  assert.equal(pkg.name, 'superpowers');
  assert.ok(pkg.keywords.includes('pi-package'));
  assert.deepEqual(pkg.pi.skills, ['./skills']);
  assert.deepEqual(pkg.pi.extensions, ['./.pi/extensions/superpowers.ts']);
 });
 test('extension registers lifecycle hooks without pre-compaction injection', async () => {
  const { handlers } = await loadExtension();
  for (const event of ['resources_discover', 'session_start', 'session_compact', 'context', 'agent_end']) {
    assert.equal((handlers.get(event) ?? []).length, 1, `missing ${event} handler`);
  }
  assert.equal((handlers.get('session_before_compact') ?? []).length, 0);
 });
 test('resources_discover contributes the bundled skills directory', async () => {
  const { handlers } = await loadExtension();
  const discover = firstHandler(handlers, 'resources_discover');
  const result = await discover({ type: 'resources_discover', cwd: repoRoot, reason: 'startup' }, {});
  assert.deepEqual(result.skillPaths, [resolve(repoRoot, 'skills')]);
 });
 test('startup context injects the bootstrap as one user message until agent_end', async () => {
  const { handlers } = await loadExtension();
  const sessionStart = firstHandler(handlers, 'session_start');
  const context = firstHandler(handlers, 'context');
  const agentEnd = firstHandler(handlers, 'agent_end');
  await sessionStart({ type: 'session_start', reason: 'startup' }, {});
  const originalMessages = [
    { role: 'user', content: [{ type: 'text', text: 'Let us make a react todo list' }], timestamp: 1 },
  ];
  const result = await context({ type: 'context', messages: originalMessages }, {});
  assert.equal(result.messages.length, 2);
  assert.equal(result.messages[0].role, 'user');
  assert.match(textOf(result.messages[0]), /You have superpowers/);
  assert.match(textOf(result.messages[0]), /Pi tool mapping/);
  assert.equal(result.messages[1], originalMessages[0]);
  const repeatedProviderRequest = await context({ type: 'context', messages: originalMessages }, {});
  assert.equal(repeatedProviderRequest.messages.length, 2);
  assert.match(textOf(repeatedProviderRequest.messages[0]), /You have superpowers/);
  const alreadyInjected = await context({ type: 'context', messages: result.messages }, {});
  assert.equal(alreadyInjected, undefined, 'bootstrap should not duplicate when already present');
  await agentEnd({ type: 'agent_end', messages: [] }, {});
  const afterEnd = await context({ type: 'context', messages: originalMessages }, {});
  assert.equal(afterEnd, undefined, 'startup bootstrap should clear after agent_end');
 });
 test('session_compact injects bootstrap after compaction summaries, not before compaction', async () => {
  const { handlers } = await loadExtension();
  const sessionCompact = firstHandler(handlers, 'session_compact');
  const context = firstHandler(handlers, 'context');
  await sessionCompact({ type: 'session_compact', compactionEntry: {}, fromExtension: false }, {});
  const summary = { role: 'compactionSummary', summary: 'Prior work summary', tokensBefore: 123, timestamp: 1 };
  const user = { role: 'user', content: [{ type: 'text', text: 'Continue' }], timestamp: 2 };
  const result = await context({ type: 'context', messages: [summary, user] }, {});
  assert.equal(result.messages.length, 3);
  assert.equal(result.messages[0], summary);
  assert.equal(result.messages[1].role, 'user');
  assert.match(textOf(result.messages[1]), /You have superpowers/);
  assert.equal(result.messages[2], user);
 });
 test('pi tools reference documents pi-specific mappings', async () => {
  assert.equal(existsSync(piToolsPath), true, 'pi-tools.md should exist');
  const text = await readFile(piToolsPath, 'utf8');
  for (const expected of ['Skill', 'Task', 'TodoWrite', 'read', 'write', 'edit', 'bash']) {
    assert.match(text, new RegExp(expected));
  }
 });
Author	SHA1	Message	Date
Drew Ritter	bad4708a7b	evals: use pre-commit hooks	2026-05-06 15:41:52 -07:00
Drew Ritter	ec9b96a7bf	evals: add Gemini 2.5 Flash backend	2026-05-06 15:09:59 -07:00
Drew Ritter	2d4cdea2bb	evals: drop drill source marker	2026-05-06 14:55:14 -07:00
Drew Ritter	af465f9687	evals: remove unreleased wave scenarios	2026-05-06 14:43:08 -07:00
Jesse Vincent	e4191c3609	Address adversarial review findings - evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.	2026-05-06 12:41:28 -07:00
Jesse Vincent	d545612825	docs: introduce evals/ as the canonical skill-behavior eval harness - docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files).	2026-05-06 12:33:10 -07:00
Jesse Vincent	b43d14f87f	docs: annotate dated artifacts referencing lifted bash tests - RELEASE-NOTES.md: note that test-requesting-code-review.sh and test-document-review-system.sh were lifted into drill scenarios on 2026-05-06; references are preserved as dated artifacts. - docs/superpowers/plans/2026-03-23-codex-app-compatibility.md: note that tests/skill-triggering/ was lifted into drill scenarios on 2026-05-06; the run-all.sh reference is a dated artifact. Subagent second-pass scrub confirmed no other active references in the tree (excluding evals/ and the spec/plan for this work itself).	2026-05-06 12:32:00 -07:00
Jesse Vincent	11d5db1b22	tests: annotate three kept bash tests with drill coverage notes - test-worktree-native-preference.sh: drill covers PRESSURE phase only; RED + GREEN baselines have no drill counterpart and are kept so the RED-GREEN-REFACTOR validation remains rerunnable end-to-end. - test-subagent-driven-development-integration.sh: drill covers the YAGNI subset (forbidden exports + reviewer-as-gate). Bash adds >=3 commits, >=2 subagent dispatches, TodoWrite usage, test file existence check, and token-budget telemetry. Kept until drill scenario covers those or they are retired. - test-subagent-driven-development.sh: tests agent's ability to describe SDD (string matches against expected keywords). Drill scenarios test behavior, not description-recall. Kept by design. Subagent verification recorded in commit messages of subsequent deletions; gap analyses driving these annotations are also in the verification subagent reports for the gating sweep.	2026-05-06 12:29:59 -07:00
Jesse Vincent	051bff661b	tests: remove test-requesting-code-review.sh (covered by drill code-review-catches-planted-bugs) Subagent verification: every bash assertion (skill invocation, subagent dispatch, SQL injection flagged, credential handling flagged, no merge approval) maps to drill verify checks. Drill is stricter: bundles severity (Critical/Important) into the same criteria as the finding itself (bash split severity into a separate test). Setup parity covered (src/db.js with string concat + identity hash, two commits). The drill scenario header explicitly says it is the "cross-harness, semantically-judged replacement for the bash test."	2026-05-06 12:28:40 -07:00
Jesse Vincent	dc6255291b	tests: remove test-document-review-system.sh (covered by drill spec-reviewer-catches-planted-flaws) Subagent verification: every bash assertion (TODO in Requirements section flagged, "specified later" deferral flagged, Issues section present, did-not-approve verdict) maps to drill verify.criteria entries. Setup parity covered by setup.assertions (test-feature-design.md exists with TODO + 'specified later' content). Drill is stricter: asserts tool-called Agent (subagent dispatch) which the bash test did not check.	2026-05-06 12:28:40 -07:00
Jesse Vincent	d337f4a18a	tests: remove subagent-driven-dev fixtures (covered by drill sdd-go-fractals + sdd-svelte-todo) The bash test had ZERO output assertions — it just ran claude -p and printed token usage. Drill's scenarios are strictly more rigorous: go-fractals: skill-called SDD + tool-called Agent + go test ./... passes + cmd/fractals/main.go exists + >=4 commits + LLM criteria verifying real SDD workflow. svelte-todo: skill-called SDD + tool-called Agent + npm test passes + playwright e2e passes + package.json + svelte.config.js or vite.config.ts + >=4 commits + LLM criteria. design.md and plan.md are byte-identical between bash fixtures and drill fixtures (evals/fixtures/sdd-{go-fractals,svelte-todo}/). Drill's setup helper (scaffold_sdd_*) forces git init -b main (stricter than bash's reliance on init.defaultBranch). The .claude/settings.local.json from bash scaffold.sh is unnecessary for drill since permissions are managed via backend YAML. Subagent verification: SAFE TO DELETE for both.	2026-05-06 12:27:31 -07:00
Jesse Vincent	6fe9cf7515	tests: remove run-claude-describes-sdd.sh (covered by drill mid-conversation-skill-invocation) Subagent verification: every bash assertion (Skill tool invoked + specific skill name 'subagent-driven-development' loaded after the agent describes it conversationally in turn 1) maps to the drill scenario's skill-called assertion + criteria paragraph requiring the skill to fire in direct response to the second user message. Drill additionally asserts tool-called Agent (subagent dispatch) which is stricter than the bash test. Other runners in tests/explicit-skill-requests/ (haiku, multiturn, extended-multiturn) and their prompt files are preserved — they have no drill coverage and exercise different behaviors.	2026-05-06 12:25:46 -07:00
Jesse Vincent	3177c87aa8	tests: remove skill-triggering bash prompts (covered by drill triggering-* scenarios) Subagent verification confirmed each prompt's intent matches its corresponding drill scenario's turns[].intent verbatim, and each scenario has both a deterministic skill-called assertion and a semantic LLM criterion confirming the matching skill was loaded (actually a stronger check than the bash test, which only confirms the skill fires anywhere in the stream). All 6 prompts deleted. The runner had no remaining prompts to drive, so run-test.sh and run-all.sh deleted as well.	2026-05-06 12:24:53 -07:00
Jesse Vincent	a94d2cc414	evals: drop SUPERPOWERS_ROOT setup step from README/CLAUDE The cli.py helper now defaults the env var. Mention as override only.	2026-05-06 12:21:35 -07:00
Jesse Vincent	dcffaa087a	evals: drop SUPERPOWERS_ROOT from codex/gemini required_env These backends only read SUPERPOWERS_ROOT via engine.py/setup.py's os.environ access, which the new cli.py default helper supplies automatically. claude*.yaml keep SUPERPOWERS_ROOT in required_env because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.	2026-05-06 12:20:47 -07:00
Jesse Vincent	b3817bba4f	evals: default SUPERPOWERS_ROOT to parent of evals/ if unset Adds _set_superpowers_root_default() to drill/cli.py, called at module import after load_dotenv(). PROJECT_ROOT resolves to evals/ post-lift; its parent is the superpowers repo root, which is the correct value for SUPERPOWERS_ROOT. Existing env values are respected as overrides via os.environ.setdefault. Tests: - helper sets default when var is unset - helper does not override when var is already set	2026-05-06 12:19:39 -07:00
Jesse Vincent	3c046f579e	Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.	2026-05-06 12:15:46 -07:00
Jesse Vincent	895bb732d5	Plan: lift drill into superpowers as evals/ 15-task implementation plan derived from the design spec at docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md. Each task is bite-sized (2-5 min steps) with exact commands, exact file paths, and exact code where required. Subagent verification gates per the spec are written out as concrete prompt templates. Self-review: - Spec coverage: every spec section maps to a task - Placeholder scan: no TBD/TODO/placeholder/fill-in-later language - Type consistency: helper named _set_superpowers_root_default consistently; drill SHA recorded in evals/.drill-source-sha consistently	2026-05-06 12:08:58 -07:00
Jesse Vincent	cf5914a31f	Spec: address adversarial review findings Two parallel reviewers raised legitimate issues against the lift-drill- into-evals spec. Updates: - Coverage map for tests/explicit-skill-requests/ corrected: 6 run-.sh scripts + prompts, not "2 scenarios cover all". Several scripts (Haiku, multi-turn, please-use-brainstorming, use-systematic-debugging) have no drill counterpart and stay. - tests/claude-code/test-subagent-driven-development.sh marked as meta/documentation test (asks agent to describe SDD); no drill scenario covers description tests; defaults to keep. - Path-defaults section now shows verified evidence: PROJECT_ROOT resolves to evals/ post-move; only claude.yaml substitute ${SUPERPOWERS_ROOT} in args (codex/gemini use it via os.environ in pre-run hooks); helper invocation order specified (after load_dotenv, before click definitions). - Step 2 copy uses explicit rsync excludes (.git, .venv, results, .env, __pycache__, *.egg-info, .private-journal); checksum-level verification rather than file-count. - Drill SHA recorded at copy time in commit message and evals/.drill-source-sha for divergence detection. - evals/tests/ pytest suite added to verification protocol. - Reference scrub list expanded: RELEASE-NOTES.md, docs/superpowers/plans/, .codex-plugin/ (corrected from .codex/), lefthook.yml. Excluded dirs called out (node_modules/, .venv/, evals/). - Historical plan docs / RELEASE-NOTES handling: annotate, don't rewrite. - evals/lefthook.yml move documented (drill ships its own; contributors run cd evals && lefthook run pre-commit manually). - PR description checklist includes archival action item for obra/drill post-merge. False finding rejected: svelte-todo fixture is complete on disk (design.md + plan.md + scaffold.sh present); reviewer #1 #3 dropped.	2026-05-06 12:03:24 -07:00
Jesse Vincent	cf34cef01e	Spec: lift drill into superpowers as evals/ Records scope, branching, architecture, deletion gate, verification protocol, path/config edits, migration ordering, and post-implementation verification. Frames CI integration, scenario co-location, and Python package rename as deferred work. Per-file deletion of bash tests under superpowers/tests/ is gated by a subagent that compares each bash assertion to its drill scenario's verify block. Default keeps the bash test if any assertion is unmatched. Branching: independent off dev (f/evals-lift), not stacked on f/cross-platform.	2026-05-06 11:54:12 -07:00