fix(ai-chat): explicit give-up ERROR + accurate retry-window comment (#184 round-4)

F12 [suggestion]: finalizeRun's "all retries exhausted" path only logged
per-attempt warns ("attempt 3/3") then silently restored the in-memory
entry, giving no clear signal that the run row was left non-terminal
('running') pending recovery. Emit ONE greppable ERROR with context
(runId, chatId, final error) on give-up, matching the import-attachment
retry-loop pattern, so an operator can tell a survived blip from a give-up.

F13 [suggestion]: the "ORDER MATTERS (F6)" doc overclaimed that a later
settle "can retry" the terminal write as an in-process retrier. Correct it:
in-process retry is only POSSIBLE (not guaranteed) and only once the entry
is restored AND a fresh settler arrives afterwards; a concurrent settler in
the retry window is consumed at the synchronous active.delete claim, and the
no-streamText path has no second settler at all. The UNCONDITIONAL backstop
in every case is the boot sweep on the next restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
claude code agent 227
2026-06-29 02:13:29 +03:00
parent 9ad3931a1c
commit 0ecddce748
2 changed files with 48 additions and 15 deletions

View File

@@ -435,6 +435,9 @@ describe('AiChatRunService run lifecycle', () => {
}), }),
}); });
jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined); jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined);
const errorSpy = jest
.spyOn(Logger.prototype, 'error')
.mockImplementation(() => undefined);
const svc = new AiChatRunService(repo as never, makeEnv() as never); const svc = new AiChatRunService(repo as never, makeEnv() as never);
await svc.beginRun({ await svc.beginRun({
chatId: 'chat-1', chatId: 'chat-1',
@@ -445,6 +448,16 @@ describe('AiChatRunService run lifecycle', () => {
// First settle: every bounded attempt fails -> entry retained, NOT settled. // First settle: every bounded attempt fails -> entry retained, NOT settled.
await svc.finalizeRun('run-1', 'ws-1', 'completed'); await svc.finalizeRun('run-1', 'ws-1', 'completed');
expect(svc.isLocallyActive('run-1')).toBe(true); expect(svc.isLocallyActive('run-1')).toBe(true);
// F12: the give-up emits ONE explicit, greppable ERROR (run + chat context)
// so an operator can tell "gave up, run held in memory" from a per-attempt
// blip — distinct from the per-attempt warns.
const gaveUp = errorSpy.mock.calls.some(
(c) =>
/NON-TERMINAL/.test(String(c[0])) &&
/run-1/.test(String(c[0])) &&
/chat-1/.test(String(c[0])),
);
expect(gaveUp).toBe(true);
// The DB recovers; a later settle now succeeds and frees the slot. // The DB recovers; a later settle now succeeds and frees the slot.
healthy = true; healthy = true;

View File

@@ -280,16 +280,20 @@ export class AiChatRunService implements OnModuleInit {
* *
* ORDER MATTERS (F6): once we own the claim, the terminal UPDATE happens FIRST; * ORDER MATTERS (F6): once we own the claim, the terminal UPDATE happens FIRST;
* only once it SUCCEEDS do we record the run as settled. If the UPDATE fails on * only once it SUCCEEDS do we record the run as settled. If the UPDATE fails on
* every bounded attempt we RESTORE the in-memory entry and do NOT mark it * every bounded attempt we RESTORE the in-memory entry, leave the run UNsettled,
* settled — so a later settle can retry the terminal write and the run is never * and emit an ERROR signal that the row is left non-terminal 'running' (which
* silently stranded 'running' (which would 409 every future turn in the chat * would 409 every future turn in the chat until recovery). An in-process retry
* until a restart). Who that later retrier is depends on the path: when * by a LATER settle is only POSSIBLE, never guaranteed: it needs (a) the entry
* streamText attached, its terminal callback (or a requestStop -> onAbort) is a * to have been restored at the give-up path AND (b) a fresh settler to arrive
* second in-process settler that retries; on the NO-streamText path (the turn * AFTER that restore. A concurrent settler that arrives DURING the retry window
* threw before streamText was wired, so ONLY the safety-net ever settles) there * — while the entry is deleted for backoff and not yet restored — is consumed at
* is no in-process retrier — recovery there is the UNCONDITIONAL boot sweep on * the synchronous `active.delete` claim (it finds nothing to delete and returns
* the next restart (phase 1 has no periodic in-process sweep). The retained * a no-op), so it does NOT become an in-process retrier. The NO-streamText path
* entry is bounded (cleared on restart) and harmless meanwhile. * (the turn threw before streamText was wired, so ONLY the safety-net ever
* settles) likewise has no second in-process settler at all. The UNCONDITIONAL
* backstop in every case is the boot sweep on the next restart (phase 1 has no
* periodic in-process sweep); the retained entry is bounded (cleared on restart)
* and harmless meanwhile.
* *
* IDEMPOTENT on SUCCESS (#184 review): the terminal write happens AT MOST ONCE * IDEMPOTENT on SUCCESS (#184 review): the terminal write happens AT MOST ONCE
* per run. After a successful write the once-gate keys off {@link settled} (the * per run. After a successful write the once-gate keys off {@link settled} (the
@@ -315,6 +319,7 @@ export class AiChatRunService implements OnModuleInit {
// the same tick, before any await — so it can never reach the UPDATE. // the same tick, before any await — so it can never reach the UPDATE.
if (!this.active.delete(runId)) return; if (!this.active.delete(runId)) return;
let lastError: unknown;
for ( for (
let attempt = 1; let attempt = 1;
attempt <= AiChatRunService.FINALIZE_MAX_ATTEMPTS; attempt <= AiChatRunService.FINALIZE_MAX_ATTEMPTS;
@@ -331,6 +336,7 @@ export class AiChatRunService implements OnModuleInit {
this.settled.add(runId); this.settled.add(runId);
return; return;
} catch (err) { } catch (err) {
lastError = err;
this.logger.warn( this.logger.warn(
`Failed to finalize run ${runId} (attempt ${attempt}/${ `Failed to finalize run ${runId} (attempt ${attempt}/${
AiChatRunService.FINALIZE_MAX_ATTEMPTS AiChatRunService.FINALIZE_MAX_ATTEMPTS
@@ -341,11 +347,25 @@ export class AiChatRunService implements OnModuleInit {
} }
} }
} }
// Every attempt failed: RESTORE the claimed entry (and leave the run // Every attempt failed: this is a give-up, materially worse than a per-attempt
// UNsettled) so a retrier can try again — the run is not stranded. Where // blip — the row is left NON-TERMINAL ('running'), so emit ONE explicit,
// streamText attached, its terminal callback / requestStop -> onAbort is that // greppable ERROR so an operator can tell "survived a blip" from "gave up, run
// retrier; on the no-streamText path the retrier is the boot sweep on the // held in memory until recovery" (the last warn alone says only "attempt 3/3").
// next restart. The restored entry is bounded and cleared on restart. this.logger.error(
`Run ${runId} (chat ${entry?.chatId ?? 'unknown'}) left NON-TERMINAL ` +
`('running'): terminal write failed after ${
AiChatRunService.FINALIZE_MAX_ATTEMPTS
} attempts; entry retained in memory, recovery deferred to next settle / ` +
`boot sweep`,
lastError,
);
// RESTORE the claimed entry (and leave the run UNsettled) so a LATER settle
// that arrives AFTER this restore MAY retry the terminal write — but that
// in-process retry is NOT guaranteed (a concurrent settler caught in the retry
// window above is consumed at the `active.delete` claim, and the no-streamText
// path has no second settler at all). The UNCONDITIONAL backstop in every case
// is the boot sweep on the next restart; the restored entry is bounded and
// cleared on restart.
if (entry) this.active.set(runId, entry); if (entry) this.active.set(runId, entry);
} }