fix(ai-chat): explicit give-up ERROR + accurate retry-window comment (#184 round-4)

F12 [suggestion]: finalizeRun's "all retries exhausted" path only logged per-attempt warns ("attempt 3/3") then silently restored the in-memory entry, giving no clear signal that the run row was left non-terminal ('running') pending recovery. Emit ONE greppable ERROR with context (runId, chatId, final error) on give-up, matching the import-attachment retry-loop pattern, so an operator can tell a survived blip from a give-up. F13 [suggestion]: the "ORDER MATTERS (F6)" doc overclaimed that a later settle "can retry" the terminal write as an in-process retrier. Correct it: in-process retry is only POSSIBLE (not guaranteed) and only once the entry is restored AND a fresh settler arrives afterwards; a concurrent settler in the retry window is consumed at the synchronous active.delete claim, and the no-streamText path has no second settler at all. The UNCONDITIONAL backstop in every case is the boot sweep on the next restart. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 02:13:29 +03:00
parent 9ad3931a1c
commit 0ecddce748
2 changed files with 48 additions and 15 deletions
--- a/apps/server/src/core/ai-chat/ai-chat-run.service.spec.ts
+++ b/apps/server/src/core/ai-chat/ai-chat-run.service.spec.ts
@@ -435,6 +435,9 @@ describe('AiChatRunService run lifecycle', () => {
      }),
    });
    jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined);
    const errorSpy = jest
      .spyOn(Logger.prototype, 'error')
      .mockImplementation(() => undefined);
    const svc = new AiChatRunService(repo as never, makeEnv() as never);
    await svc.beginRun({
      chatId: 'chat-1',
@@ -445,6 +448,16 @@ describe('AiChatRunService run lifecycle', () => {
    // First settle: every bounded attempt fails -> entry retained, NOT settled.
    await svc.finalizeRun('run-1', 'ws-1', 'completed');
    expect(svc.isLocallyActive('run-1')).toBe(true);
    // F12: the give-up emits ONE explicit, greppable ERROR (run + chat context)
    // so an operator can tell "gave up, run held in memory" from a per-attempt
    // blip — distinct from the per-attempt warns.
    const gaveUp = errorSpy.mock.calls.some(
      (c) =>
        /NON-TERMINAL/.test(String(c[0])) &&
        /run-1/.test(String(c[0])) &&
        /chat-1/.test(String(c[0])),
    );
    expect(gaveUp).toBe(true);
    // The DB recovers; a later settle now succeeds and frees the slot.
    healthy = true;
--- a/apps/server/src/core/ai-chat/ai-chat-run.service.ts
+++ b/apps/server/src/core/ai-chat/ai-chat-run.service.ts
@@ -280,16 +280,20 @@ export class AiChatRunService implements OnModuleInit {
   *
   * ORDER MATTERS (F6): once we own the claim, the terminal UPDATE happens FIRST;
   * only once it SUCCEEDS do we record the run as settled. If the UPDATE fails on
-   * every bounded attempt we RESTORE the in-memory entry and do NOT mark it
+   * every bounded attempt we RESTORE the in-memory entry, leave the run UNsettled,
-   * settled — so a later settle can retry the terminal write and the run is never
+   * and emit an ERROR signal that the row is left non-terminal 'running' (which
-   * silently stranded 'running' (which would 409 every future turn in the chat
+   * would 409 every future turn in the chat until recovery). An in-process retry
-   * until a restart). Who that later retrier is depends on the path: when
+   * by a LATER settle is only POSSIBLE, never guaranteed: it needs (a) the entry
-   * streamText attached, its terminal callback (or a requestStop -> onAbort) is a
+   * to have been restored at the give-up path AND (b) a fresh settler to arrive
-   * second in-process settler that retries; on the NO-streamText path (the turn
+   * AFTER that restore. A concurrent settler that arrives DURING the retry window
-   * threw before streamText was wired, so ONLY the safety-net ever settles) there
+   * — while the entry is deleted for backoff and not yet restored — is consumed at
-   * is no in-process retrier — recovery there is the UNCONDITIONAL boot sweep on
+   * the synchronous `active.delete` claim (it finds nothing to delete and returns
-   * the next restart (phase 1 has no periodic in-process sweep). The retained
+   * a no-op), so it does NOT become an in-process retrier. The NO-streamText path
-   * entry is bounded (cleared on restart) and harmless meanwhile.
+   * (the turn threw before streamText was wired, so ONLY the safety-net ever
   * settles) likewise has no second in-process settler at all. The UNCONDITIONAL
   * backstop in every case is the boot sweep on the next restart (phase 1 has no
   * periodic in-process sweep); the retained entry is bounded (cleared on restart)
   * and harmless meanwhile.
   *
   * IDEMPOTENT on SUCCESS (#184 review): the terminal write happens AT MOST ONCE
   * per run. After a successful write the once-gate keys off {@link settled} (the
@@ -315,6 +319,7 @@ export class AiChatRunService implements OnModuleInit {
    // the same tick, before any await — so it can never reach the UPDATE.
    if (!this.active.delete(runId)) return;
    let lastError: unknown;
    for (
      let attempt = 1;
      attempt <= AiChatRunService.FINALIZE_MAX_ATTEMPTS;
@@ -331,6 +336,7 @@ export class AiChatRunService implements OnModuleInit {
        this.settled.add(runId);
        return;
      } catch (err) {
        lastError = err;
        this.logger.warn(
          `Failed to finalize run ${runId} (attempt ${attempt}/${
            AiChatRunService.FINALIZE_MAX_ATTEMPTS
@@ -341,11 +347,25 @@ export class AiChatRunService implements OnModuleInit {
        }
      }
    }
-    // Every attempt failed: RESTORE the claimed entry (and leave the run
+    // Every attempt failed: this is a give-up, materially worse than a per-attempt
-    // UNsettled) so a retrier can try again — the run is not stranded. Where
+    // blip — the row is left NON-TERMINAL ('running'), so emit ONE explicit,
-    // streamText attached, its terminal callback / requestStop -> onAbort is that
+    // greppable ERROR so an operator can tell "survived a blip" from "gave up, run
-    // retrier; on the no-streamText path the retrier is the boot sweep on the
+    // held in memory until recovery" (the last warn alone says only "attempt 3/3").
-    // next restart. The restored entry is bounded and cleared on restart.
+    this.logger.error(
      `Run ${runId} (chat ${entry?.chatId ?? 'unknown'}) left NON-TERMINAL ` +
        `('running'): terminal write failed after ${
          AiChatRunService.FINALIZE_MAX_ATTEMPTS
        } attempts; entry retained in memory, recovery deferred to next settle / ` +
        `boot sweep`,
      lastError,
    );
    // RESTORE the claimed entry (and leave the run UNsettled) so a LATER settle
    // that arrives AFTER this restore MAY retry the terminal write — but that
    // in-process retry is NOT guaranteed (a concurrent settler caught in the retry
    // window above is consumed at the `active.delete` claim, and the no-streamText
    // path has no second settler at all). The UNCONDITIONAL backstop in every case
    // is the boot sweep on the next restart; the restored entry is bounded and
    // cleared on restart.
    if (entry) this.active.set(runId, entry);
  }