Commit Graph

547 Commits

Author SHA1 Message Date
Jesse Vincent
eba16f6b91 Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgment clean, escalation points unstressed) 2026-06-10 17:11:26 -07:00
Jesse Vincent
27788fdef9 Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 recon positive, L3 dead) 2026-06-10 16:59:43 -07:00
Jesse Vincent
9a25a75bac Spec: strict-cost SDD experiment ladder — judgment as co-invariant, plan-side crispness first 2026-06-10 14:35:00 -07:00
Jesse Vincent
60fa4f6fc4 Record writing-plans micro-test result: resolved, no change needed 2026-06-10 14:31:50 -07:00
Jesse Vincent
43a6ee23f7 Spec: record iterations 4-5 (variance honesty, structural fixes, final validated ranges) 2026-06-10 13:08:40 -07:00
Jesse Vincent
fe90d6c469 Adopt audited positive phrasings: evidence rule leads positive; fix-report completeness as checklist 2026-06-10 13:08:19 -07:00
Jesse Vincent
b81f35bb1e Land eval-tuned combo: file handoffs, progress ledger, final-review package, REQUIRED model lines, reviewer risk budget
Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
2026-06-10 13:08:06 -07:00
Jesse Vincent
926096a1d7 Spec: positive-instruction redesign — audit results, micro-test method, writing-plans variants 2026-06-10 12:32:06 -07:00
Jesse Vincent
a995af2e24 Shared: unique review-package collateral names 2026-06-10 10:18:02 -07:00
Jesse Vincent
d4dbf44162 Add review-package script; close fix-dispatch test gap
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
2026-06-10 08:51:16 -07:00
Jesse Vincent
2434ef7f35 Describe the review design as current state, not as a delta
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
2026-06-10 08:28:28 -07:00
Jesse Vincent
7cf78437e2 Spec: record iterations 2-3 results and final frozen-config matrix 2026-06-10 05:06:59 -07:00
Jesse Vincent
e355795625 Hand reviewers the diff as a file, not a paste
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
2026-06-10 03:44:19 -07:00
Jesse Vincent
29ee4e8e44 Reviewer skepticism covers the implementer's design rationales
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
2026-06-10 02:20:28 -07:00
Jesse Vincent
28498a5cde Make diff-pasting non-optional for task reviewer dispatch
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
2026-06-10 02:10:34 -07:00
Jesse Vincent
5e2907fc4f Close the Minor-severity escape hatch
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
2026-06-10 02:09:10 -07:00
Jesse Vincent
e532f24df7 Spec: document cost iterations and the per-task review consolidation 2026-06-09 23:59:22 -07:00
Jesse Vincent
e3c74fc1c9 Merge per-task reviews into one task reviewer (iteration 2)
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance //⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
2026-06-09 23:58:28 -07:00
Jesse Vincent
3e3e1e701e Cut review-cost drivers: turn-aware models, inline diffs, scoped evidence
Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
2026-06-09 22:42:54 -07:00
Jesse Vincent
853396e3ae Add phrase-level pre-judging triggers to reviewer prompt rule
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
2026-06-09 21:49:51 -07:00
Jesse Vincent
83d54f7ddd Red Flags: never tell a reviewer what not to flag or pre-rate severity
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
2026-06-09 21:47:41 -07:00
Jesse Vincent
c7900f1698 Close three review blind spots found by defect tracing
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
2026-06-09 21:19:08 -07:00
Jesse Vincent
5cfdb75b94 Require explicit model on subagent dispatch
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
2026-06-09 21:11:45 -07:00
Jesse Vincent
87825ff193 Forbid controllers pre-judging reviewer findings
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
2026-06-09 18:28:24 -07:00
Jesse Vincent
09cb4d7361 Sync plan: escaped pre() pattern in Task 5 checks block 2026-06-09 18:19:00 -07:00
Jesse Vincent
b3bb9a68d7 Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block 2026-06-09 17:21:06 -07:00
Jesse Vincent
71dc271a08 Sync plan's Task 5 blocks with review fixes 2026-06-09 17:13:03 -07:00
Jesse Vincent
5aea3dca31 SDD controller: reviewer prompt budgets, ⚠️ handling, final-review pointer, model judgment 2026-06-09 16:59:05 -07:00
Jesse Vincent
b3281c0227 Implementer prompt: re-run covering tests after fixing review findings 2026-06-09 16:56:28 -07:00
Jesse Vincent
c14c1de552 Scope spec reviewer's Your Job wording to the diff 2026-06-09 16:55:28 -07:00
Jesse Vincent
be8a6269c4 Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict channel 2026-06-09 16:53:30 -07:00
Jesse Vincent
da41209243 Use bare placeholder names in quality reviewer prompt body 2026-06-09 16:51:54 -07:00
Jesse Vincent
2cc449b6d4 Make per-task quality reviewer prompt self-contained and task-scoped 2026-06-09 16:47:27 -07:00
Jesse Vincent
f8dcd1ed3d Add implementation plan for task-scoped review dispatch 2026-06-09 16:42:50 -07:00
Jesse Vincent
4192572d19 Harden review-dispatch spec per adversarial review findings 2026-06-09 16:33:44 -07:00
Jesse Vincent
5da15d7eba Add design spec: task-scoped review dispatch for SDD 2026-06-09 16:26:00 -07:00
Jesse Vincent
f55642e0dd Require contributors to disclose authoring environment and target dev
Add a mandatory self-identification disclosure (model, harness, harness
version, all installed plugins) to the PR template and all three issue
templates, and document the requirement in the contributor guidelines.
We weigh contributions differently depending on what produced them:
content reasoned from documentation is held to a different bar than work
grounded in a real session.

Also state explicitly, in both CLAUDE.md and the PR template, that all
PRs must target the dev branch rather than main.
2026-06-08 22:14:34 -07:00
Drew Ritter
ae1eefb7f9 chore(evals): bump submodule to --scenarios filter (ff3ee83)
Adds `run-all --scenarios` for resuming a scenario subset across the Code
Assist rate-limit windows. Follows the agy rate-limit fix (79f9963).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 22:46:00 -07:00
Drew Ritter
617168aff5 chore(evals): bump submodule to antigravity rate-limit fix (79f9963)
Serialize antigravity against the Gemini Code Assist rate limit
(max_concurrency=1), diagnose 429/RESOURCE_EXHAUSTED honestly instead of as
auth, fail-fast on a latched window, and tolerant preflight OK match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 16:27:35 -07:00
Rahul
d7c260a978 fix(brainstorming): cap websocket frame payloads 2026-06-02 11:24:02 -07:00
Drew Ritter
f3f0789c5c Add shell lint script 2026-06-01 19:48:28 -07:00
Drew Ritter
16a1719988 Tighten Kimi plugin porting coverage 2026-06-01 19:41:58 -07:00
Drew Ritter
c74c22daa7 docs: restore Kimi direct install command 2026-06-01 19:41:58 -07:00
Drew Ritter
773bbf61d6 docs: simplify Kimi README install steps 2026-06-01 19:41:58 -07:00
Drew Ritter
6b76158550 fix: wire Kimi plugin into release metadata 2026-06-01 19:41:58 -07:00
Drew Ritter
7fec40bb55 fix: align Kimi manifest with supported fields 2026-06-01 19:41:58 -07:00
qer
2a8e54735b feat: add Kimi Code plugin manifest 2026-06-01 19:41:58 -07:00
Matt Van Horn
f776394360 feat(subagent-dev): add TDD RED evidence to implementer report format
Add a conditional TDD Evidence field to the implementer report format so controllers can verify RED and GREEN output when TDD was required.

The field asks for the command run, relevant RED/GREEN output, and the expected RED failure reason rather than raw full logs.

Fixes #994.
2026-06-01 16:15:05 -07:00
Drew Ritter
7301c81b4d docs(windows): trim polyglot hook implementation copy 2026-06-01 16:07:01 -07:00
dev_Hakaze
9d3e68a5ad docs(windows): update polyglot hook docs
Rewrite the Windows polyglot hook documentation to match the current run-hook.cmd dispatcher and update the porting guide cross-reference.\n\nFixes #1653.
2026-06-01 15:57:30 -07:00