fix(ai-chat): explicit give-up ERROR + accurate retry-window comment (#184 round-4)
F12 [suggestion]: finalizeRun's "all retries exhausted" path only logged
per-attempt warns ("attempt 3/3") then silently restored the in-memory
entry, giving no clear signal that the run row was left non-terminal
('running') pending recovery. Emit ONE greppable ERROR with context
(runId, chatId, final error) on give-up, matching the import-attachment
retry-loop pattern, so an operator can tell a survived blip from a give-up.
F13 [suggestion]: the "ORDER MATTERS (F6)" doc overclaimed that a later
settle "can retry" the terminal write as an in-process retrier. Correct it:
in-process retry is only POSSIBLE (not guaranteed) and only once the entry
is restored AND a fresh settler arrives afterwards; a concurrent settler in
the retry window is consumed at the synchronous active.delete claim, and the
no-streamText path has no second settler at all. The UNCONDITIONAL backstop
in every case is the boot sweep on the next restart.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -435,6 +435,9 @@ describe('AiChatRunService run lifecycle', () => {
|
|||||||
}),
|
}),
|
||||||
});
|
});
|
||||||
jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined);
|
jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined);
|
||||||
|
const errorSpy = jest
|
||||||
|
.spyOn(Logger.prototype, 'error')
|
||||||
|
.mockImplementation(() => undefined);
|
||||||
const svc = new AiChatRunService(repo as never, makeEnv() as never);
|
const svc = new AiChatRunService(repo as never, makeEnv() as never);
|
||||||
await svc.beginRun({
|
await svc.beginRun({
|
||||||
chatId: 'chat-1',
|
chatId: 'chat-1',
|
||||||
@@ -445,6 +448,16 @@ describe('AiChatRunService run lifecycle', () => {
|
|||||||
// First settle: every bounded attempt fails -> entry retained, NOT settled.
|
// First settle: every bounded attempt fails -> entry retained, NOT settled.
|
||||||
await svc.finalizeRun('run-1', 'ws-1', 'completed');
|
await svc.finalizeRun('run-1', 'ws-1', 'completed');
|
||||||
expect(svc.isLocallyActive('run-1')).toBe(true);
|
expect(svc.isLocallyActive('run-1')).toBe(true);
|
||||||
|
// F12: the give-up emits ONE explicit, greppable ERROR (run + chat context)
|
||||||
|
// so an operator can tell "gave up, run held in memory" from a per-attempt
|
||||||
|
// blip — distinct from the per-attempt warns.
|
||||||
|
const gaveUp = errorSpy.mock.calls.some(
|
||||||
|
(c) =>
|
||||||
|
/NON-TERMINAL/.test(String(c[0])) &&
|
||||||
|
/run-1/.test(String(c[0])) &&
|
||||||
|
/chat-1/.test(String(c[0])),
|
||||||
|
);
|
||||||
|
expect(gaveUp).toBe(true);
|
||||||
|
|
||||||
// The DB recovers; a later settle now succeeds and frees the slot.
|
// The DB recovers; a later settle now succeeds and frees the slot.
|
||||||
healthy = true;
|
healthy = true;
|
||||||
|
|||||||
@@ -280,16 +280,20 @@ export class AiChatRunService implements OnModuleInit {
|
|||||||
*
|
*
|
||||||
* ORDER MATTERS (F6): once we own the claim, the terminal UPDATE happens FIRST;
|
* ORDER MATTERS (F6): once we own the claim, the terminal UPDATE happens FIRST;
|
||||||
* only once it SUCCEEDS do we record the run as settled. If the UPDATE fails on
|
* only once it SUCCEEDS do we record the run as settled. If the UPDATE fails on
|
||||||
* every bounded attempt we RESTORE the in-memory entry and do NOT mark it
|
* every bounded attempt we RESTORE the in-memory entry, leave the run UNsettled,
|
||||||
* settled — so a later settle can retry the terminal write and the run is never
|
* and emit an ERROR signal that the row is left non-terminal 'running' (which
|
||||||
* silently stranded 'running' (which would 409 every future turn in the chat
|
* would 409 every future turn in the chat until recovery). An in-process retry
|
||||||
* until a restart). Who that later retrier is depends on the path: when
|
* by a LATER settle is only POSSIBLE, never guaranteed: it needs (a) the entry
|
||||||
* streamText attached, its terminal callback (or a requestStop -> onAbort) is a
|
* to have been restored at the give-up path AND (b) a fresh settler to arrive
|
||||||
* second in-process settler that retries; on the NO-streamText path (the turn
|
* AFTER that restore. A concurrent settler that arrives DURING the retry window
|
||||||
* threw before streamText was wired, so ONLY the safety-net ever settles) there
|
* — while the entry is deleted for backoff and not yet restored — is consumed at
|
||||||
* is no in-process retrier — recovery there is the UNCONDITIONAL boot sweep on
|
* the synchronous `active.delete` claim (it finds nothing to delete and returns
|
||||||
* the next restart (phase 1 has no periodic in-process sweep). The retained
|
* a no-op), so it does NOT become an in-process retrier. The NO-streamText path
|
||||||
* entry is bounded (cleared on restart) and harmless meanwhile.
|
* (the turn threw before streamText was wired, so ONLY the safety-net ever
|
||||||
|
* settles) likewise has no second in-process settler at all. The UNCONDITIONAL
|
||||||
|
* backstop in every case is the boot sweep on the next restart (phase 1 has no
|
||||||
|
* periodic in-process sweep); the retained entry is bounded (cleared on restart)
|
||||||
|
* and harmless meanwhile.
|
||||||
*
|
*
|
||||||
* IDEMPOTENT on SUCCESS (#184 review): the terminal write happens AT MOST ONCE
|
* IDEMPOTENT on SUCCESS (#184 review): the terminal write happens AT MOST ONCE
|
||||||
* per run. After a successful write the once-gate keys off {@link settled} (the
|
* per run. After a successful write the once-gate keys off {@link settled} (the
|
||||||
@@ -315,6 +319,7 @@ export class AiChatRunService implements OnModuleInit {
|
|||||||
// the same tick, before any await — so it can never reach the UPDATE.
|
// the same tick, before any await — so it can never reach the UPDATE.
|
||||||
if (!this.active.delete(runId)) return;
|
if (!this.active.delete(runId)) return;
|
||||||
|
|
||||||
|
let lastError: unknown;
|
||||||
for (
|
for (
|
||||||
let attempt = 1;
|
let attempt = 1;
|
||||||
attempt <= AiChatRunService.FINALIZE_MAX_ATTEMPTS;
|
attempt <= AiChatRunService.FINALIZE_MAX_ATTEMPTS;
|
||||||
@@ -331,6 +336,7 @@ export class AiChatRunService implements OnModuleInit {
|
|||||||
this.settled.add(runId);
|
this.settled.add(runId);
|
||||||
return;
|
return;
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
|
lastError = err;
|
||||||
this.logger.warn(
|
this.logger.warn(
|
||||||
`Failed to finalize run ${runId} (attempt ${attempt}/${
|
`Failed to finalize run ${runId} (attempt ${attempt}/${
|
||||||
AiChatRunService.FINALIZE_MAX_ATTEMPTS
|
AiChatRunService.FINALIZE_MAX_ATTEMPTS
|
||||||
@@ -341,11 +347,25 @@ export class AiChatRunService implements OnModuleInit {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
// Every attempt failed: RESTORE the claimed entry (and leave the run
|
// Every attempt failed: this is a give-up, materially worse than a per-attempt
|
||||||
// UNsettled) so a retrier can try again — the run is not stranded. Where
|
// blip — the row is left NON-TERMINAL ('running'), so emit ONE explicit,
|
||||||
// streamText attached, its terminal callback / requestStop -> onAbort is that
|
// greppable ERROR so an operator can tell "survived a blip" from "gave up, run
|
||||||
// retrier; on the no-streamText path the retrier is the boot sweep on the
|
// held in memory until recovery" (the last warn alone says only "attempt 3/3").
|
||||||
// next restart. The restored entry is bounded and cleared on restart.
|
this.logger.error(
|
||||||
|
`Run ${runId} (chat ${entry?.chatId ?? 'unknown'}) left NON-TERMINAL ` +
|
||||||
|
`('running'): terminal write failed after ${
|
||||||
|
AiChatRunService.FINALIZE_MAX_ATTEMPTS
|
||||||
|
} attempts; entry retained in memory, recovery deferred to next settle / ` +
|
||||||
|
`boot sweep`,
|
||||||
|
lastError,
|
||||||
|
);
|
||||||
|
// RESTORE the claimed entry (and leave the run UNsettled) so a LATER settle
|
||||||
|
// that arrives AFTER this restore MAY retry the terminal write — but that
|
||||||
|
// in-process retry is NOT guaranteed (a concurrent settler caught in the retry
|
||||||
|
// window above is consumed at the `active.delete` claim, and the no-streamText
|
||||||
|
// path has no second settler at all). The UNCONDITIONAL backstop in every case
|
||||||
|
// is the boot sweep on the next restart; the restored entry is bounded and
|
||||||
|
// cleared on restart.
|
||||||
if (entry) this.active.set(runId, entry);
|
if (entry) this.active.set(runId, entry);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user