mirror of https://github.com/obra/superpowers.git synced 2026-05-09 02:29:05 +08:00

Files

Jesse Vincent 3b412a3836 Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b

rsync of obra/drill@013fcb8b7d into superpowers/evals/, excluding
.git/, .venv/, results/, .env/, __pycache__/, *.egg-info/,
.private-journal/.

The drill repo is unaffected by this commit; archival is a separate
manual step after this PR merges.

Source SHA recorded at evals/.drill-source-sha for divergence
detection.

2026-05-06 15:47:39 -07:00

4.7 KiB

Raw Blame History

Manual Testing (Codex App)

Some scenarios cannot run automatically because drill has no harness adapter for the target — the Codex App desktop client has no CLI or tmux entry point the way claude and codex do. These scenarios are marked manual: true in their YAML and use a human-in-the-loop protocol.

Protocol

Three phases. The agent never runs Codex App directly. The tester never writes a verdict by hand.

Agent prepares the handoff — reads the scenario file, renders setup + turn intents into something a human can act on, hands the package to the tester.
Tester executes — sets up the repo fixture, opens Codex App, pastes the prompt, handles any follow-ups, copies the transcript + final filesystem state back to the agent.
Agent judges and records — evaluates the transcript against verify.criteria, writes a verdict JSON, saves to results/<scenario>/codex-app/YYYY-MM-DD-manual/verdict.json.

Phase 1: Agent prepares the handoff

Deliver as one self-contained message to the tester:

Fixture state

Exact repo state Codex App should be launched against. Pull from setup.notes if present, otherwise translate setup.helpers + setup.assertions into prose. Include: which repo/directory, branch, whether to expect a worktree vs normal checkout, any required/forbidden files (e.g. .gitignore entries).

Prompt to paste

Render turn 1's intent as a natural first-person message the tester can paste verbatim into Codex App. Don't leak internal test language like "Do NOT say 'create a worktree'" — that's instruction for the test author, not the end user. Convert it to what a real user would actually type.

Example:

Intent: "Ask the agent to use the worktree skill to get set up for a notifications feature. Do NOT say 'create a worktree' — just reference the skill by name."

Rendered prompt: "hey, can you use the worktree skill to get me set up for a notifications feature?"

Follow-up guidance

For each additional turn, give the tester a short decision rule — not a verbatim script. E.g. "If the agent asks a clarifying question like branch name, answer concisely. If it stops to ask whether you want a worktree at all, tell it you already asked for the skill and it should proceed."

What to capture

Ask the tester to paste back:

Full agent transcript (messages, tool calls, tool outputs)
Final filesystem state if criteria depend on it (git worktree list, directory tree, branch state)
Any observations they want on the record

Phase 2: Tester executes

Set up the repo fixture per the instructions
Open Codex App in that repo
Paste the prompt
Follow up per the guidance
Copy the transcript + filesystem state back to the agent

Phase 3: Agent judges and records

For each criterion in verify.criteria, write one entry:

{
  "criterion": "<verbatim from scenario>",
  "passed": true | false,
  "evidence": "<quoted snippet from transcript>",
  "rationale": "<only if passed is inconclusive or needs context>"
}

Rules:

Quote the transcript directly in evidence. No paraphrasing.
If a criterion is genuinely inconclusive from the transcript, mark passed: false with rationale explaining what was missing. Don't guess.
Don't grade on intent you can't see. The agent's internal thoughts aren't visible — only messages, tool calls, and results.

Verdict file

Save to results/<scenario>/codex-app/YYYY-MM-DD-manual/verdict.json:

{
  "scenario": "<scenario-name>",
  "backend": "codex-app",
  "manual": true,
  "user_posture": "<spec-aware|naive|...>",
  "passed": <true iff every criterion.passed is true>,
  "criteria": [ ... ],
  "notes": "<optional: cross-criterion observations>"
}

Matches the format of the existing results/worktree-codex-app-detached-head/codex-app/2026-04-09-manual/verdict.json.

When to invoke

A scenario's YAML has manual: true
The tester explicitly asks for a manual Codex App run of any scenario
An automated test result is inconclusive and we want a human-verified cross-check

Do NOT use this procedure for scenarios drill can run itself (claude, codex, gemini backends) — use drill run instead.

Pitfalls

Don't skip the fixture step. Codex App's default environment (detached HEAD under $CODEX_HOME/worktrees/) is load-bearing for worktree scenarios. The same prompt gives different results in a normal checkout.
Don't render prompts literally. Scenario intents are written for test authors; they often contain "Do NOT mention X" style instructions. Translate before handing to the tester.
Don't grade on missing evidence. If the transcript doesn't show the agent doing something the criterion asks about, that's a fail, not a pass-by-default.

4.7 KiB Raw Blame History