fix(ai-chat): recycle keep-alive sockets + retry pre-response resets (#175)

The real cause of the long-task "Lost connection to the AI provider" — the
earlier 300s-timeout fix (#176) was the wrong layer. The provider-HTTP telemetry
on the user's deploy shows the failures are PRE-RESPONSE `read ECONNRESET` ~500ms
in (not a 300s/15min timeout), correlated with idleSincePrevCall ~42s and large
bodies; and crucially a retry of the SAME request often succeeds. A direct probe
to the real z.ai endpoint does NOT reset (113KB bodies and a 45s-idle keep-alive
reuse both succeed), and another agent (opencode) runs fine from the same infra —
so the provider is healthy and the egress network is usable. The difference is
the transport: undici's keep-alive pool REUSES a socket that the deployment's
egress (NAT / firewall / conntrack) silently dropped during a long idle gap, so
the next request resets pre-response.

Fix (brings gitmost in line with clients that don't reuse stale sockets):
- Keep-alive recycling: the streaming dispatcher (chat fetch AND the external-MCP
  dispatcher, via the shared streamingDispatcherOptions) now sets
  keepAliveTimeout + keepAliveMaxTimeout to a 10s recycle window
  (AI_STREAM_KEEPALIVE_MS), so a connection idle longer than that is closed
  instead of reused — a long-gap step opens a fresh connection. keepAliveMaxTimeout
  also caps a server-advertised keep-alive so the provider can't widen the window.
- Pre-response connection retry: createStreamingFetch retries a connection-level
  reset (ECONNRESET / UND_ERR_SOCKET / ECONNREFUSED / EPIPE / *_TIMEOUT) on a
  fresh connection up to 2 times. This is SAFE because fetch() only rejects before
  the Response resolves — a started stream is never replayed; an abort (client
  disconnect) is never retried.

Tests: ai-streaming-fetch.spec — keep-alive options, streamKeepAliveMs env,
isRetryableConnectError, and a server that resets the first connection so the
retry must land on a fresh one (+ aborted requests are not retried). Verified on
the stand that a normal turn still streams (reasoning + text + finish) through the
new transport. server tsc + ai/mcp specs green.

Note: root cause is the deployment's egress dropping idle connections (Traefik is
inbound-only); this makes the app resilient to it. AI_STREAM_KEEPALIVE_MS can be
lowered if the egress drops faster than ~10s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
claude code agent 227
2026-06-24 23:51:17 +03:00
parent d1fbcc1bfa
commit b0faa2fe32
3 changed files with 198 additions and 17 deletions

View File

@@ -142,6 +142,13 @@ MCP_DOCMOST_PASSWORD=
# provider is eventually broken instead of leaking forever. Default 900000 (15 min).
# AI_STREAM_TIMEOUT_MS=900000
# Keep-alive recycle window (ms) for streaming chat/agent AI + external-MCP calls.
# A pooled connection idle longer than this is closed instead of reused, so a
# NAT / egress firewall / reverse proxy that silently drops idle connections
# cannot poison a reused socket into a PRE-RESPONSE `read ECONNRESET`. Lower it if
# your egress drops idle connections faster than ~10s. Default 10000 (10 s).
# AI_STREAM_KEEPALIVE_MS=10000
# --- Anonymous public-share AI assistant ---
# Opt-in per workspace (AI settings -> "public share assistant"; off by default).
# When enabled, anonymous visitors of a published share can ask an AI about that