Round-2 review fixes for PR #234 (#184 autonomous agent runs).
F6 (stability): finalizeRun no longer drops the in-memory entry before the
terminal write. It now UPDATEs first with a bounded retry; only on success does
it arm the idempotency once-gate (a new `settled` set keyed on "row already
terminal", not "entry deleted") and free the chat's active slot. If every
attempt fails the entry is RETAINED and the run left unsettled so a later
finalize / requestStop->onAbort / sweep can retry — a transient blip can no
longer strand a run 'running' and 409 every future turn in the chat. Idempotency
preserved (double-settle still collapses to a single write).
F7 (regression from F2): int-spec constructs AiChatRunService with the 2nd
EnvironmentService arg ({ isCloud: () => false }) so the file type-checks and all
integration tests compile+run again.
F8 (regression from F1): the windowed "stale but not fresh" case now calls
sweepRunning({ staleMs: SWEEP_RUN_STALE_MS }); added an int-level variant-C case
proving the no-arg boot sweep aborts even a FRESH running run.
F9 (coverage): run-race spec now captures streamText's options and invokes
onStepFinish/onFinish/onAbort/onError, asserting the #184 run hooks
(onStep / onSettled completed|aborted|error) fire with the right args.
F10 (docs): added an autonomousRuns single-instance-only note to .env.example so
the warnIfMultiInstance JSDoc reference is accurate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make an agent turn a first-class, server-side RUN that keeps executing and
persisting its steps after the browser window closes, and that a later client
can reconnect to — the core invariant of #184. Phase 1 only; the full proposal
(cross-process BullMQ runner, resumable live-tail transport, autonomy triggers,
budgets, history compaction) is explicitly deferred.
What lands:
- `ai_chat_runs` lifecycle table + repo: the run as a persistent object
(status pending->running->succeeded|failed|aborted, trigger, createdBy,
assistantMessageId projection link, error, step_count, timings). A partial
unique index enforces ONE ACTIVE run per chat; a startup sweep recovers
dangling runs (mirrors #183's sweepStreaming).
- AiChatRunService: owns the run lifecycle + an in-memory abort registry. The
abort is governed by the RUN (an explicit user stop), NOT the HTTP socket —
so a browser disconnect no longer ends the turn. Reuses #183's socket-
independent durable write path (consumeStream + flushAssistant) unchanged.
- Controller, behind `settings.ai.autonomousRuns`: /stream wraps the turn in a
run and does NOT abort on disconnect (logs only); a clean 409 rejects a
concurrent run on the same chat; new POST /ai-chat/stop (explicit stop) and
POST /ai-chat/run (reconnect -> latest persisted run + its projection). The
runId is surfaced on the streamed start metadata. Flag OFF = byte-for-byte
legacy behavior.
Tests: AiChatRunService unit spec (lifecycle, disconnect != stop, explicit
stop aborts the signal, best-effort sweeps); ai_chat_runs integration spec
(one-active-run index, detached persist+reconnect with no subscriber, explicit
stop, stale-run sweep). Server tsc + build clean; touched jest green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>