F12 [suggestion]: finalizeRun's "all retries exhausted" path only logged
per-attempt warns ("attempt 3/3") then silently restored the in-memory
entry, giving no clear signal that the run row was left non-terminal
('running') pending recovery. Emit ONE greppable ERROR with context
(runId, chatId, final error) on give-up, matching the import-attachment
retry-loop pattern, so an operator can tell a survived blip from a give-up.
F13 [suggestion]: the "ORDER MATTERS (F6)" doc overclaimed that a later
settle "can retry" the terminal write as an in-process retrier. Correct it:
in-process retry is only POSSIBLE (not guaranteed) and only once the entry
is restored AND a fresh settler arrives afterwards; a concurrent settler in
the retry window is consumed at the synchronous active.delete claim, and the
no-streamText path has no second settler at all. The UNCONDITIONAL backstop
in every case is the boot sweep on the next restart.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The F6 once-gate was non-atomic: `settled.has` was read BEFORE the awaited
terminal UPDATE and `settled.add` only after, so two concurrent finalizeRun
calls for the same run (the documented safety-net catch vs a streamText
terminal callback) both passed the check and both wrote the terminal row —
double-write + last-write-wins status clobber, a window the bounded retry only
widened.
Restore a SYNCHRONOUS atomic claim before any await: capture the entry, then
`active.delete` as a check-and-clear in one tick. The first caller claims and
proceeds; a concurrent second caller finds the entry gone and returns at the
claim, before any UPDATE. On a successful write we arm `settled` (post-write
idempotency gate) and do not restore; on total bounded-retry failure we restore
the claimed entry so a retrier can complete it — never both write and restore.
Also fix the F6(b) JSDoc/comment to not overclaim an in-process retrier on the
no-streamText path: there the only settler is the safety-net, so recovery on
total UPDATE failure is the unconditional boot sweep on the next restart.
Adds a concurrency test firing two simultaneous finalizeRun on one run (update
held on a pending promise) asserting update is called EXACTLY ONCE; existing F6
retry-rides-transient + retain-on-total-failure tests stay green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round-2 review fixes for PR #234 (#184 autonomous agent runs).
F6 (stability): finalizeRun no longer drops the in-memory entry before the
terminal write. It now UPDATEs first with a bounded retry; only on success does
it arm the idempotency once-gate (a new `settled` set keyed on "row already
terminal", not "entry deleted") and free the chat's active slot. If every
attempt fails the entry is RETAINED and the run left unsettled so a later
finalize / requestStop->onAbort / sweep can retry — a transient blip can no
longer strand a run 'running' and 409 every future turn in the chat. Idempotency
preserved (double-settle still collapses to a single write).
F7 (regression from F2): int-spec constructs AiChatRunService with the 2nd
EnvironmentService arg ({ isCloud: () => false }) so the file type-checks and all
integration tests compile+run again.
F8 (regression from F1): the windowed "stale but not fresh" case now calls
sweepRunning({ staleMs: SWEEP_RUN_STALE_MS }); added an int-level variant-C case
proving the no-arg boot sweep aborts even a FRESH running run.
F9 (coverage): run-race spec now captures streamText's options and invokes
onStepFinish/onFinish/onAbort/onError, asserting the #184 run hooks
(onStep / onSettled completed|aborted|error) fire with the right args.
F10 (docs): added an autonomousRuns single-instance-only note to .env.example so
the warnIfMultiInstance JSDoc reference is accurate.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
F1 (DECISION C): make the crash-recovery boot sweep UNCONDITIONAL. A fast
restart (deploy/OOM within the old 10-min window of the last step) left a run
stuck `running` forever, and the one-active-run gate then 409'd every future
turn in that chat. On a fresh single-process boot any pending|running run is
definitionally hung, so onModuleInit now settles ALL of them to `aborted` with
no staleness window. AiChatRunRepo.sweepRunning takes an optional { staleMs }
window, kept ONLY for the future phase-2 multi-instance timer sweep (the boot
path passes no window). Repo + service tests assert a fresh `running` run
(updatedAt = now) is settled, not skipped.
F2 (DECISION A): treat phase-1 autonomousRuns as SINGLE-INSTANCE-ONLY. Stop and
its AbortController are process-local, so cross-instance Stop is unreliable
(phase 2). AiChatRunService now logs a startup WARNING when a horizontally-scaled
deployment is detected — via EnvironmentService.isCloud() (CLOUD=true), the only
horizontal-scaling signal this codebase has (the socket.io Redis adapter is
always wired since REDIS_URL is mandatory, so it is not a discriminator). The
constraint is documented in AGENTS.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CRITICAL: any failure between a successful beginRun and streamText's terminal
callbacks taking ownership (the bare awaits: user-message insert, history load,
convertToModelMessages, settings resolve; the buildSystemPrompt/forUser block;
and synchronous streamText wiring) left ai_chat_runs stuck 'running' forever
(sweepRunning only runs at startup), which then 409'd every future turn in the
chat and made the observer tab poll forever. Wrap the body of stream() after
beginRun in a safety-net try/catch that settles the run to 'error' (via
onSettled) before rethrowing, and make finalizeRun idempotent (active.delete is
the once-guard) so a settle here and a settle from a streamText callback collapse
to a single terminal write.
Also from review comment 2519:
- correct three client comments that falsely claimed /ai-chat/run is "flag-gated
server-side and would 403" — it is owner-gated only; with the feature off the
chat simply has no runs so the endpoint returns { run: null }
(ai-chat-window.tsx, ai-chat-service.ts, ai-chat-query.ts).
- remove the dead UpdatableAiChatRun type (zero usages; the repo update uses an
inline Partial<...>).
- add controller specs for POST /ai-chat/run and /ai-chat/stop (owner-gating,
run:null when no run, run+message, stop by runId and by chatId).
- add tests: an exception after beginRun settles the run to 'error' and drops the
in-memory entry (next turn is not 409'd); finalizeRun is idempotent.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "one active run per chat" guard was bypassable under a race. Two
simultaneous POST /ai-chat/stream on the same chat both passed the
controller's pre-hijack 409 check (a check-then-act TOCTOU), then the
loser's INSERT into ai_chat_runs hit the partial unique index
(ai_chat_runs_one_active_per_chat, 23505). That error was SWALLOWED, so
the second turn streamed UNTRACKED: no runId, not targetable by /stop,
and (autonomousRuns on) onClose won't abort it -> an orphan unstoppable
run that also spends provider tokens.
Make the unique-index INSERT the authoritative gate:
- AiChatRunService.beginRun: when the run-row INSERT fails with a 23505 on
ONE_ACTIVE_RUN_PER_CHAT_INDEX (via isUniqueViolation/violatedConstraint),
no longer swallow it -> throw a distinct RunAlreadyActiveError. Any other
error (incl. a 23505 on a different constraint) propagates unchanged.
- AiChatService.stream: when begin throws RunAlreadyActiveError, reject the
turn with a 409 ConflictException (code A_RUN_ALREADY_ACTIVE) BEFORE any
AI/provider call -> no tokens spent, no untracked turn. Other begin
failures keep the legacy best-effort fallback (stream socket-bound).
- ai-chat.controller: post-hijack catch honors an HttpException's real
status/body (clean 409) instead of a blanket 500, since the race 409 is
raised before a byte is written. Pre-check 409 now carries the same code.
The controller's cheap pre-check stays as a fast-path for the common
sequential double-submit; the INSERT violation is the race-safe backstop.
Tests: ai-chat-run.service.spec proves beginRun throws RunAlreadyActiveError
on the active-index 23505 (and only that constraint), leaks no controller,
and an integration-style two-concurrent-begins test where exactly one wins;
new ai-chat.service.run-race.spec proves stream rejects with a 409
ConflictException BEFORE any streamText/generateText and never persists an
untracked turn. The latter fails without the fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make an agent turn a first-class, server-side RUN that keeps executing and
persisting its steps after the browser window closes, and that a later client
can reconnect to — the core invariant of #184. Phase 1 only; the full proposal
(cross-process BullMQ runner, resumable live-tail transport, autonomy triggers,
budgets, history compaction) is explicitly deferred.
What lands:
- `ai_chat_runs` lifecycle table + repo: the run as a persistent object
(status pending->running->succeeded|failed|aborted, trigger, createdBy,
assistantMessageId projection link, error, step_count, timings). A partial
unique index enforces ONE ACTIVE run per chat; a startup sweep recovers
dangling runs (mirrors #183's sweepStreaming).
- AiChatRunService: owns the run lifecycle + an in-memory abort registry. The
abort is governed by the RUN (an explicit user stop), NOT the HTTP socket —
so a browser disconnect no longer ends the turn. Reuses #183's socket-
independent durable write path (consumeStream + flushAssistant) unchanged.
- Controller, behind `settings.ai.autonomousRuns`: /stream wraps the turn in a
run and does NOT abort on disconnect (logs only); a clean 409 rejects a
concurrent run on the same chat; new POST /ai-chat/stop (explicit stop) and
POST /ai-chat/run (reconnect -> latest persisted run + its projection). The
runId is surfaced on the streamed start metadata. Flag OFF = byte-for-byte
legacy behavior.
Tests: AiChatRunService unit spec (lifecycle, disconnect != stop, explicit
stop aborts the signal, best-effort sweeps); ai_chat_runs integration spec
(one-active-run index, detached persist+reconnect with no subscriber, explicit
stop, stale-run sweep). Server tsc + build clean; touched jest green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>