Long research turns failed mid-task with "Lost connection to the AI provider".
Node's global fetch (undici) defaults BOTH headersTimeout and bodyTimeout to
300_000ms, and the chat provider + the external-MCP dispatcher both ran on it
with no override, so:
- the z.ai chat stream dropped when a late step's huge accumulated context
pushed the model's time-to-first-token past 5 min (the model reasons
server-side with NO streamed reasoning, so the connection is silent until the
first answer token — reproduced: even a trivial glm-5.2 query has a ~4-8s
first-chunk gap; a long run reaches 400k+-token steps), or a reasoning model
paused >5 min between chunks (bodyTimeout);
- the crawl4ai SSE transport, held open across the whole turn, dropped when it
idled >5 min between tool calls.
Fix: a dedicated undici dispatcher whose stream timeouts are raised to a
generous-but-FINITE silence timeout (default 15 min, AI_STREAM_TIMEOUT_MS) on
each path. NOT disabled (0): that would let a genuinely hung provider — with the
client still connected — hang forever, since the turn's abortSignal only fires on
client disconnect. The timeout bounds SILENCE (time-to-first-byte and the gap
BETWEEN chunks), NOT total turn duration, so an arbitrarily long turn that keeps
streaming is never cut; only a stream quiet for >15 min is treated as a hang.
- ai-streaming-fetch.ts: createStreamingFetch() + streamTimeoutMs() /
streamingDispatcherOptions() (the shared, configurable timeout).
- ai.service: the chat provider fetch is createStreamingFetch(), wrapped by the
existing passive ECONNRESET telemetry (createDiagnosticFetch gained an
optional baseFetch) so the telemetry observes the SAME transport.
- mcp-clients: the SSRF-pinned Agent uses streamingDispatcherOptions().
Investigation: reproduced the transport mechanism against the real z.ai endpoint
(a 1ms headersTimeout throws UND_ERR_HEADERS_TIMEOUT — the exact drop) and ran
the actual research agent to a ~428k-token context. Verified the fixed path
streams cleanly live (glm-5.2 turns finish; telemetry confirms the streaming
fetch is in use).
Tests: ai-streaming-fetch.spec (default 15m + env override + invalid fallback +
both-timeouts + streams a delayed response); ai-http-diagnostics + ai/mcp specs
green. server tsc clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
82 lines
3.8 KiB
TypeScript
82 lines
3.8 KiB
TypeScript
import { Logger } from '@nestjs/common';
|
|
|
|
/**
|
|
* DIAGNOSTIC (provider ECONNRESET investigation) — temporary.
|
|
*
|
|
* A PASSIVE, behavior-neutral wrapper around the global `fetch`, injected into
|
|
* the OpenAI-compatible provider client (`createOpenAI({ fetch })`, the z.ai
|
|
* path). Per provider HTTP call it logs: time-to-response-headers + status +
|
|
* request-body size on success; and on a pre-response rejection the failure
|
|
* latency + error code/cause + request-body size + the idle gap since the
|
|
* previous provider call. It NEVER retries, times out, swaps the dispatcher, or
|
|
* reads/clones the response body — the Response is returned untouched (streaming
|
|
* unaffected) and any error is rethrown unchanged.
|
|
*
|
|
* How to read the result (a long agentic turn makes one provider call per step):
|
|
* - a failed turn whose last provider line is "PRE-RESPONSE FAILED ... ECONNRESET"
|
|
* => the reset is in the CONNECTION phase of a step's request (the provider
|
|
* never replied) — usually a poisoned keep-alive socket or the provider/middle
|
|
* box resetting that request (large body / idle gap are the suspects, hence
|
|
* reqBytes + idleSincePrevCall below).
|
|
* - the last line is "OK status=200" and the turn still errors with NO
|
|
* "PRE-RESPONSE FAILED" => the cut happened MID-STREAM (after headers), a
|
|
* different failure mode.
|
|
*
|
|
* The seq/last-call timestamps are module-level, so under concurrent turns the
|
|
* idle-gap figure is approximate (fine for single-user reproduction).
|
|
*/
|
|
export function createDiagnosticFetch(
|
|
context: string,
|
|
// The underlying fetch to instrument. Defaults to the global fetch; the chat
|
|
// provider passes a streaming fetch (disabled undici stream timeouts, #175) so
|
|
// the telemetry observes the SAME transport the long agent turn actually uses.
|
|
baseFetch: typeof fetch = fetch,
|
|
): typeof fetch {
|
|
const logger = new Logger(context);
|
|
let callSeq = 0;
|
|
let lastCallStartedAt: number | undefined;
|
|
|
|
return async (input: Parameters<typeof fetch>[0], init?: Parameters<typeof fetch>[1]): Promise<Response> => {
|
|
const callId = ++callSeq;
|
|
const startedAt = Date.now();
|
|
const idleSincePrev =
|
|
lastCallStartedAt === undefined ? undefined : startedAt - lastCallStartedAt;
|
|
lastCallStartedAt = startedAt;
|
|
// Request body size: the chat payload is a JSON string. Used to test whether
|
|
// failures correlate with the large accumulated context on later agent steps.
|
|
const body = init?.body as unknown;
|
|
const bodyBytes =
|
|
typeof body === 'string'
|
|
? body.length
|
|
: body instanceof Uint8Array
|
|
? body.byteLength
|
|
: undefined;
|
|
try {
|
|
// Delegate to the base fetch; return the Response UNTOUCHED (never read/
|
|
// clone the body) so the streamed SSE response is unaffected.
|
|
const res = await baseFetch(input, init);
|
|
logger.log(
|
|
`provider HTTP DIAGNOSTIC: call#${callId} OK ` +
|
|
`headersAfter=${Date.now() - startedAt}ms status=${res.status} ` +
|
|
`reqBytes=${bodyBytes ?? 'n/a'} idleSincePrevCall=${idleSincePrev ?? 'n/a'}ms`,
|
|
);
|
|
return res;
|
|
} catch (err) {
|
|
// fetch() rejected => PRE-RESPONSE failure (no headers/body received yet):
|
|
// the connection/request phase. Log it and rethrow the SAME error.
|
|
const e = err as {
|
|
name?: string;
|
|
message?: string;
|
|
cause?: { code?: string; message?: string };
|
|
};
|
|
logger.warn(
|
|
`provider HTTP DIAGNOSTIC: call#${callId} PRE-RESPONSE FAILED ` +
|
|
`after=${Date.now() - startedAt}ms code=${e?.cause?.code ?? 'none'} ` +
|
|
`name=${e?.name ?? 'Error'} cause=${e?.cause?.message ?? e?.message ?? 'unknown'} ` +
|
|
`reqBytes=${bodyBytes ?? 'n/a'} idleSincePrevCall=${idleSincePrev ?? 'n/a'}ms`,
|
|
);
|
|
throw err;
|
|
}
|
|
};
|
|
}
|