External MCP tools (web search, crawl) had no per-call timeout: a hung
tool call was only broken by the 15-min transport silence timeout shared
with the chat provider, and a server that kept the socket warm but never
returned could spin until the user cancelled.
Add two independent, composing bounds for external MCP traffic (the chat
provider path is unchanged):
- Silence 5 min: buildPinnedDispatcher now overrides headersTimeout/
bodyTimeout with mcpStreamTimeoutMs() (AI_MCP_STREAM_TIMEOUT_MS,
default 300000) on the external-MCP dispatcher only, so a byte-silent
upstream is severed in ~5 min instead of 15.
- Total per-call 15 min: wrapToolWithCallTimeout wraps each external
tool's execute with a fresh AbortController + timer composed with the
turn signal via AbortSignal.any (AI_MCP_CALL_TIMEOUT_MS, default
900000). It RACES the call against the abort signal because
@ai-sdk/mcp does not settle its in-flight promise on abort, so a
warm-but-stuck call would otherwise hang forever.
On timeout the call surfaces as a tool-error and the agent loop recovers.
Add tests (incl. a never-settling real-client-style stub) and document
both env vars in .env.example.
- Invert the transport layers so the pre-response retry is OUTERMOST and the
provider-HTTP instrumentation is INNER. Before, the retry lived inside
createStreamingFetch (under the instrumentation), so a reset the retry
recovered from logged only a clean "OK status=200" — the
"PRE-RESPONSE FAILED ... ECONNRESET ... idleSincePrevCall" signal went blind
exactly when the fix works, and AI_STREAM_KEEPALIVE_MS couldn't be tuned from
prod data. Now createStreamingFetch is the dispatcher-bound BASE (no retry) and
a new withPreResponseRetry() wraps it; ai.service composes
withPreResponseRetry(createInstrumentedFetch('AiService:provider-http',
createStreamingFetch())), so every attempt — including recovered resets — flows
through the instrumentation. (Also expresses the keepAlive-config vs retry-
behavior boundary structurally, per review #3.)
- Add the retry-exhaustion test: a server that resets EVERY connection, asserting
the call rejects with a retryable connection error AND exactly
PRE_RESPONSE_CONNECT_RETRIES + 1 (= 3) requests reached the server — pinning the
bound and that the final error propagates (guards an off-by-one / infinite loop
/ swallowed error). Existing happy-retry + abort tests moved onto
withPreResponseRetry.
Verified on the stand: a normal turn still streams (reasoning + finish) and the
provider-HTTP telemetry still logs. server tsc + ai/mcp specs green (30).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The real cause of the long-task "Lost connection to the AI provider" — the
earlier 300s-timeout fix (#176) was the wrong layer. The provider-HTTP telemetry
on the user's deploy shows the failures are PRE-RESPONSE `read ECONNRESET` ~500ms
in (not a 300s/15min timeout), correlated with idleSincePrevCall ~42s and large
bodies; and crucially a retry of the SAME request often succeeds. A direct probe
to the real z.ai endpoint does NOT reset (113KB bodies and a 45s-idle keep-alive
reuse both succeed), and another agent (opencode) runs fine from the same infra —
so the provider is healthy and the egress network is usable. The difference is
the transport: undici's keep-alive pool REUSES a socket that the deployment's
egress (NAT / firewall / conntrack) silently dropped during a long idle gap, so
the next request resets pre-response.
Fix (brings gitmost in line with clients that don't reuse stale sockets):
- Keep-alive recycling: the streaming dispatcher (chat fetch AND the external-MCP
dispatcher, via the shared streamingDispatcherOptions) now sets
keepAliveTimeout + keepAliveMaxTimeout to a 10s recycle window
(AI_STREAM_KEEPALIVE_MS), so a connection idle longer than that is closed
instead of reused — a long-gap step opens a fresh connection. keepAliveMaxTimeout
also caps a server-advertised keep-alive so the provider can't widen the window.
- Pre-response connection retry: createStreamingFetch retries a connection-level
reset (ECONNRESET / UND_ERR_SOCKET / ECONNREFUSED / EPIPE / *_TIMEOUT) on a
fresh connection up to 2 times. This is SAFE because fetch() only rejects before
the Response resolves — a started stream is never replayed; an abort (client
disconnect) is never retried.
Tests: ai-streaming-fetch.spec — keep-alive options, streamKeepAliveMs env,
isRetryableConnectError, and a server that resets the first connection so the
retry must land on a fresh one (+ aborted requests are not retried). Verified on
the stand that a normal turn still streams (reasoning + text + finish) through the
new transport. server tsc + ai/mcp specs green.
Note: root cause is the deployment's egress dropping idle connections (Traefik is
inbound-only); this makes the app resilient to it. AI_STREAM_KEEPALIVE_MS can be
lowered if the egress drops faster than ~10s.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Long research turns failed mid-task with "Lost connection to the AI provider".
Node's global fetch (undici) defaults BOTH headersTimeout and bodyTimeout to
300_000ms, and the chat provider + the external-MCP dispatcher both ran on it
with no override, so:
- the z.ai chat stream dropped when a late step's huge accumulated context
pushed the model's time-to-first-token past 5 min (the model reasons
server-side with NO streamed reasoning, so the connection is silent until the
first answer token — reproduced: even a trivial glm-5.2 query has a ~4-8s
first-chunk gap; a long run reaches 400k+-token steps), or a reasoning model
paused >5 min between chunks (bodyTimeout);
- the crawl4ai SSE transport, held open across the whole turn, dropped when it
idled >5 min between tool calls.
Fix: a dedicated undici dispatcher whose stream timeouts are raised to a
generous-but-FINITE silence timeout (default 15 min, AI_STREAM_TIMEOUT_MS) on
each path. NOT disabled (0): that would let a genuinely hung provider — with the
client still connected — hang forever, since the turn's abortSignal only fires on
client disconnect. The timeout bounds SILENCE (time-to-first-byte and the gap
BETWEEN chunks), NOT total turn duration, so an arbitrarily long turn that keeps
streaming is never cut; only a stream quiet for >15 min is treated as a hang.
- ai-streaming-fetch.ts: createStreamingFetch() + streamTimeoutMs() /
streamingDispatcherOptions() (the shared, configurable timeout).
- ai.service: the chat provider fetch is createStreamingFetch(), wrapped by the
existing passive ECONNRESET telemetry (createDiagnosticFetch gained an
optional baseFetch) so the telemetry observes the SAME transport.
- mcp-clients: the SSRF-pinned Agent uses streamingDispatcherOptions().
Investigation: reproduced the transport mechanism against the real z.ai endpoint
(a 1ms headersTimeout throws UND_ERR_HEADERS_TIMEOUT — the exact drop) and ran
the actual research agent to a ~428k-token context. Verified the fixed path
streams cleanly live (glm-5.2 turns finish; telemetry confirms the streaming
fetch is in use).
Tests: ai-streaming-fetch.spec (default 15m + env override + invalid fallback +
both-timeouts + streams a delayed response); ai-http-diagnostics + ai/mcp specs
green. server tsc clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>