Coding, research, receipts, diagnostics

The part that turns GPT 5.5 into a reliable builder and research platform.

COTW Scout is not just a memory layer for conversation. Code mode uses the same continuity substrate to keep a builder agent oriented across specs, files, experiments, restarts, correction loops, and runtime diagnostics. It gives beginners a guided path, gives experienced builders a disciplined workbench, and turns live agent work into structured evidence for improvement.

The core difference: work is not trusted because the agent said it happened. Work is trusted when there is a receipt, a changed artifact, a test, a handoff, or a reviewable proposal attached to it.
1 / ShapeIdea to PRDThe PRD skill turns rough intent into objective, requirements, constraints, journeys, and verification criteria.
2 / WorkThreaded executionCode mode keeps the project in an infinite thread with handoffs, tool outputs, and current-state context.
3 / ResearchEvidence loopsDeep research and direct sprints force hypothesis, experiment, result, interpretation, and next move.
4 / VerifyReceipts and testsClaims route through files, diffs, screenshots, logs, tests, or explicit missing gates.
5 / DiagnoseResearch diagnosticsRepeated friction becomes failure signatures, process scores, relabel candidates, and reviewable proposal receipts.

Why it helps novices

The harness teaches project shape without requiring the user to already know how to run a software project.

Guided structure

The agent creates the missing rails.

A rough idea becomes a project folder, a PRD, a plan, acceptance criteria, and a visible next step. The user does not have to know what a good spec looks like before the work can begin.

Continuity

The work survives normal human interruption.

Infinite threads, handoff files, attachment receipts, and source handles mean a project can survive restarts, context compaction, and parallel sessions without restarting from vibes.

Why it helps experienced builders

The same rails scale down to simplicity and up to serious work.

Receipts

Every serious claim gets an artifact trail.

The harness favors diffs, tests, logs, screenshots, benchmark output, and sprint receipts over polished summaries. That makes review easier and narrows where a claim can be wrong.

Research

Open-ended work becomes bounded sprints.

Direct research sprints require a sharp question, a hunch, a bounded experiment or inspection, a readout, an interpretation, and a written receipt before continuing.

Research platform layer

COTW is careful about the word evolution. The model weights do not change, and the agent does not get blanket authority over its own scaffold. The new layer is best understood as a diagnostics and research platform first: a way to make live agent behavior inspectable, comparable, and useful for later evaluation or post-training work.

What exists now: Code Evolution records Code mode sessions, tool calls, outcomes, scaffold version, and satisfaction/correction signals. Runtime diagnostics add exchange IDs, stream cleanup traces, plugin timing, memory pressure, retention classification, and session health receipts. Harness Refiner reads that evidence as a research layer.

What the platform accrues: trajectory windows, failure signatures, process scores, cognitive-surprise events, relabel candidates, teacher-relabel receipts, research digests, scenario replay results, training-health receipts, and redacted research bundles for future training runs.

Why that matters: the harness is not just logging chat history. It is collecting the kind of structured evidence a future fine-tuning, adapter, or evaluation pass would need: what failed, which exchange produced it, what context was present, how the process scored, what a teacher repair would look like, and which artifacts were excluded.

Why proposals appear at all: some diagnostics point at repeated workflow friction. Only bounded low-risk workflow and tool-hint candidates can route through the existing Evolve scaffold promotion gate, where promotion writes before/after hashes, a rollback path, and an outcome receipt.

What it does not do by itself: mutate protected identity, change runtime configuration, launch training, promote adapters, change model routing, grant tool authority, or inject new prompt rules without a separate operator-owned gate.

Exchange spine

One turn can be reconstructed.

Gateway events, renderer stream events, tool calls, attachment receipts, continuity records, evolution receipts, and Refiner windows can carry a shared exchange identity. Subsystems keep their own IDs, but diagnostics get a first-class join key.

Retention

Logs stop becoming a mystery pile.

Runtime artifacts are classified as hot, warm, cold, research export, or excluded. The live app stays bounded while research-useful material can be preserved with source labels, hashes, redaction status, and explicit training approval state.

Cognitive layer

State is observed separately from memory.

The cognitive layer records per-turn latent state and prediction surprise. It is a diagnostic signal for loop detection, candidate flagging, and Refiner scoring; it is not trusted memory and does not enter the prompt as a fact claim.

Callable triage

The same machinery helps debugging.

A developer-facing diagnostic can inspect the last response, an exchange, a session, gateway health, plugin health, or a research bundle. That makes production errors and training-readiness questions use the same evidence path.

Harness Refiner as diagnostics

The new refiner is easiest to understand as the bridge between live agent work and future training or scaffold improvements. It preserves evidence first, structures candidate training data second, then asks what kind of improvement is actually allowed.

Failure signatures

It names recurring friction.

Detectors look for repeated tool failures, tool loops, correction-not-integrated patterns, mode bleed, receipt mismatch, ungrounded recommendations, low-surprise drift, and cognitive-state anomalies.

Process scoring

It scores the work, not the vibe.

Process diagnostics track format compliance, action correctness, grounding, reasoning quality, task progress, correction uptake, no-confabulation, handoff quality, mode containment, and user-burden reduction.

Training runway

It accrues future-training data.

Low-scoring windows can become relabel candidates and teacher-relabel receipts. Each packet carries source handles, score axes, harness/model hashes, inclusion status, and explicit non-authorization for training launch.

Exportable evidence

It can share the trail without leaking the whole camp.

Research bundle export writes redacted JSONL artifacts and a manifest hash so the system can be reviewed, replayed, evaluated, or prepared for later training experiments without dumping raw private state.

What can refine itself now

The platform can support low-level recursive improvement, but the lane matters. The agent may help improve workflow ergonomics; it may not silently rewrite who it is or how the runtime is trusted.

Allowed lane

Low-risk workflow improvements.

Tool-use hints, workflow checklists, diagnostics prompts, retry discipline, handoff quality reminders, and narrow scaffold notes can be proposed and promoted through the existing Evolve gate with evidence and rollback.

Protected lane

Operator approval stays required.

Identity files, prompt authority, memory truth status, runtime configuration, tool permissions, model routing, adapter promotion, and training launches remain outside automatic mutation. They can be diagnosed and proposed, not self-applied.

Future training runway

The research platform is useful before training exists because it makes training data boringly inspectable. When post-training resumes, it should start from selected, redacted, scored, and reproducible shards.

Rollout diagnostics: model/checkpoint/adapter hash, scaffold version, context hash, tool/action emitted, parse success, tool result, failure signature, and receipt handles.

Process scoring: separate axes for grounding, task progress, correction uptake, format, reasoning quality, handoff quality, no-confabulation, and user-burden reduction.

Teacher relabel receipts: original student turn, score breakdown, teacher repair, explanation/diff, inclusion decision, shard id, manifest hash, and training approval state.

Training health receipts: CUDA/runtime readiness, trainer heartbeat, GPU utilization, step counter, loss/grad norm/LR, checkpoint timestamps, adapter hashes, and fixed eval before/after.

Free and hardened

This matters because a lot of what people pay hosted memory systems for is already present locally here, with a stronger provenance story.

Free substrate

Local memory is not a metered service.

SQLite, sqlite-vec, FTS5, Markdown identity files, handoffs, receipts, and evolution ledgers live on the machine. The expensive part is the reasoning model, not the continuity substrate.

Hardened by boring things

Tests, gates, logs, and rollback paths.

The system is intentionally less magical than it sounds: source handles, audit receipts, mutation gates, runtime health checks, and verification criteria keep the harness inspectable.