gitmost

Author	SHA1	Message	Date
claude code agent 227	0ecddce748	fix(ai-chat): explicit give-up ERROR + accurate retry-window comment (#184 round-4) F12 [suggestion]: finalizeRun's "all retries exhausted" path only logged per-attempt warns ("attempt 3/3") then silently restored the in-memory entry, giving no clear signal that the run row was left non-terminal ('running') pending recovery. Emit ONE greppable ERROR with context (runId, chatId, final error) on give-up, matching the import-attachment retry-loop pattern, so an operator can tell a survived blip from a give-up. F13 [suggestion]: the "ORDER MATTERS (F6)" doc overclaimed that a later settle "can retry" the terminal write as an in-process retrier. Correct it: in-process retry is only POSSIBLE (not guaranteed) and only once the entry is restored AND a fresh settler arrives afterwards; a concurrent settler in the retry window is consumed at the synchronous active.delete claim, and the no-streamText path has no second settler at all. The UNCONDITIONAL backstop in every case is the boot sweep on the next restart. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 02:13:29 +03:00
claude code agent 227	9ad3931a1c	fix(ai-chat): make finalizeRun once-gate atomic against concurrent settle (#184 round-3) The F6 once-gate was non-atomic: `settled.has` was read BEFORE the awaited terminal UPDATE and `settled.add` only after, so two concurrent finalizeRun calls for the same run (the documented safety-net catch vs a streamText terminal callback) both passed the check and both wrote the terminal row — double-write + last-write-wins status clobber, a window the bounded retry only widened. Restore a SYNCHRONOUS atomic claim before any await: capture the entry, then `active.delete` as a check-and-clear in one tick. The first caller claims and proceeds; a concurrent second caller finds the entry gone and returns at the claim, before any UPDATE. On a successful write we arm `settled` (post-write idempotency gate) and do not restore; on total bounded-retry failure we restore the claimed entry so a retrier can complete it — never both write and restore. Also fix the F6(b) JSDoc/comment to not overclaim an in-process retrier on the no-streamText path: there the only settler is the safety-net, so recovery on total UPDATE failure is the unconditional boot sweep on the next restart. Adds a concurrency test firing two simultaneous finalizeRun on one run (update held on a pending promise) asserting update is called EXACTLY ONCE; existing F6 retry-rides-transient + retain-on-total-failure tests stay green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 01:34:43 +03:00
claude code agent 227	97250ac1d1	fix(ai-chat): harden run finalize + restore int-spec, cover terminal callbacks (#184 round-2) Round-2 review fixes for PR #234 (#184 autonomous agent runs). F6 (stability): finalizeRun no longer drops the in-memory entry before the terminal write. It now UPDATEs first with a bounded retry; only on success does it arm the idempotency once-gate (a new `settled` set keyed on "row already terminal", not "entry deleted") and free the chat's active slot. If every attempt fails the entry is RETAINED and the run left unsettled so a later finalize / requestStop->onAbort / sweep can retry — a transient blip can no longer strand a run 'running' and 409 every future turn in the chat. Idempotency preserved (double-settle still collapses to a single write). F7 (regression from F2): int-spec constructs AiChatRunService with the 2nd EnvironmentService arg ({ isCloud: () => false }) so the file type-checks and all integration tests compile+run again. F8 (regression from F1): the windowed "stale but not fresh" case now calls sweepRunning({ staleMs: SWEEP_RUN_STALE_MS }); added an int-level variant-C case proving the no-arg boot sweep aborts even a FRESH running run. F9 (coverage): run-race spec now captures streamText's options and invokes onStepFinish/onFinish/onAbort/onError, asserting the #184 run hooks (onStep / onSettled completed\|aborted\|error) fire with the right args. F10 (docs): added an autonomousRuns single-instance-only note to .env.example so the warnIfMultiInstance JSDoc reference is accurate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 01:23:46 +03:00
claude code agent 227	c0844d5431	fix(ai-chat): unconditional boot sweep + single-instance guard for autonomous runs (#184 ) F1 (DECISION C): make the crash-recovery boot sweep UNCONDITIONAL. A fast restart (deploy/OOM within the old 10-min window of the last step) left a run stuck `running` forever, and the one-active-run gate then 409'd every future turn in that chat. On a fresh single-process boot any pending\|running run is definitionally hung, so onModuleInit now settles ALL of them to `aborted` with no staleness window. AiChatRunRepo.sweepRunning takes an optional { staleMs } window, kept ONLY for the future phase-2 multi-instance timer sweep (the boot path passes no window). Repo + service tests assert a fresh `running` run (updatedAt = now) is settled, not skipped. F2 (DECISION A): treat phase-1 autonomousRuns as SINGLE-INSTANCE-ONLY. Stop and its AbortController are process-local, so cross-instance Stop is unreliable (phase 2). AiChatRunService now logs a startup WARNING when a horizontally-scaled deployment is detected — via EnvironmentService.isCloud() (CLOUD=true), the only horizontal-scaling signal this codebase has (the socket.io Redis adapter is always wired since REDIS_URL is mandatory, so it is not a discriminator). The constraint is documented in AGENTS.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 23:52:32 +03:00
claude code agent 227	4c0a4eb9cc	fix(ai-chat): settle detached runs on pre-stream failures + review fixes (#184 ) CRITICAL: any failure between a successful beginRun and streamText's terminal callbacks taking ownership (the bare awaits: user-message insert, history load, convertToModelMessages, settings resolve; the buildSystemPrompt/forUser block; and synchronous streamText wiring) left ai_chat_runs stuck 'running' forever (sweepRunning only runs at startup), which then 409'd every future turn in the chat and made the observer tab poll forever. Wrap the body of stream() after beginRun in a safety-net try/catch that settles the run to 'error' (via onSettled) before rethrowing, and make finalizeRun idempotent (active.delete is the once-guard) so a settle here and a settle from a streamText callback collapse to a single terminal write. Also from review comment 2519: - correct three client comments that falsely claimed /ai-chat/run is "flag-gated server-side and would 403" — it is owner-gated only; with the feature off the chat simply has no runs so the endpoint returns { run: null } (ai-chat-window.tsx, ai-chat-service.ts, ai-chat-query.ts). - remove the dead UpdatableAiChatRun type (zero usages; the repo update uses an inline Partial<...>). - add controller specs for POST /ai-chat/run and /ai-chat/stop (owner-gating, run:null when no run, run+message, stop by runId and by chatId). - add tests: an exception after beginRun settles the run to 'error' and drops the in-memory entry (next turn is not 409'd); finalizeRun is idempotent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 14:54:19 +03:00
a	6390c45658	fix(ai-chat): close the concurrent-run race in #184 (insert is the gate) The "one active run per chat" guard was bypassable under a race. Two simultaneous POST /ai-chat/stream on the same chat both passed the controller's pre-hijack 409 check (a check-then-act TOCTOU), then the loser's INSERT into ai_chat_runs hit the partial unique index (ai_chat_runs_one_active_per_chat, 23505). That error was SWALLOWED, so the second turn streamed UNTRACKED: no runId, not targetable by /stop, and (autonomousRuns on) onClose won't abort it -> an orphan unstoppable run that also spends provider tokens. Make the unique-index INSERT the authoritative gate: - AiChatRunService.beginRun: when the run-row INSERT fails with a 23505 on ONE_ACTIVE_RUN_PER_CHAT_INDEX (via isUniqueViolation/violatedConstraint), no longer swallow it -> throw a distinct RunAlreadyActiveError. Any other error (incl. a 23505 on a different constraint) propagates unchanged. - AiChatService.stream: when begin throws RunAlreadyActiveError, reject the turn with a 409 ConflictException (code A_RUN_ALREADY_ACTIVE) BEFORE any AI/provider call -> no tokens spent, no untracked turn. Other begin failures keep the legacy best-effort fallback (stream socket-bound). - ai-chat.controller: post-hijack catch honors an HttpException's real status/body (clean 409) instead of a blanket 500, since the race 409 is raised before a byte is written. Pre-check 409 now carries the same code. The controller's cheap pre-check stays as a fast-path for the common sequential double-submit; the INSERT violation is the race-safe backstop. Tests: ai-chat-run.service.spec proves beginRun throws RunAlreadyActiveError on the active-index 23505 (and only that constraint), leaks no controller, and an integration-style two-concurrent-begins test where exactly one wins; new ai-chat.service.run-race.spec proves stream rejects with a 409 ConflictException BEFORE any streamText/generateText and never persists an untracked turn. The latter fails without the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 14:37:07 +03:00
claude code agent 227	95781d80e1	feat(ai-chat): durable detached agent runs (#184 phase 1) Make an agent turn a first-class, server-side RUN that keeps executing and persisting its steps after the browser window closes, and that a later client can reconnect to — the core invariant of #184. Phase 1 only; the full proposal (cross-process BullMQ runner, resumable live-tail transport, autonomy triggers, budgets, history compaction) is explicitly deferred. What lands: - `ai_chat_runs` lifecycle table + repo: the run as a persistent object (status pending->running->succeeded\|failed\|aborted, trigger, createdBy, assistantMessageId projection link, error, step_count, timings). A partial unique index enforces ONE ACTIVE run per chat; a startup sweep recovers dangling runs (mirrors #183's sweepStreaming). - AiChatRunService: owns the run lifecycle + an in-memory abort registry. The abort is governed by the RUN (an explicit user stop), NOT the HTTP socket — so a browser disconnect no longer ends the turn. Reuses #183's socket- independent durable write path (consumeStream + flushAssistant) unchanged. - Controller, behind `settings.ai.autonomousRuns`: /stream wraps the turn in a run and does NOT abort on disconnect (logs only); a clean 409 rejects a concurrent run on the same chat; new POST /ai-chat/stop (explicit stop) and POST /ai-chat/run (reconnect -> latest persisted run + its projection). The runId is surfaced on the streamed start metadata. Flag OFF = byte-for-byte legacy behavior. Tests: AiChatRunService unit spec (lifecycle, disconnect != stop, explicit stop aborts the signal, best-effort sweeps); ai_chat_runs integration spec (one-active-run index, detached persist+reconnect with no subscriber, explicit stop, stale-run sweep). Server tsc + build clean; touched jest green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-28 14:37:07 +03:00

7 Commits