fix(ai-chat): don't sever long agent turns at undici's 300s stream timeout (#175) #176
Reference in New Issue
Block a user
Delete Branch "fix/ai-stream-undici-timeout"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #175 (ломается коннект на длинных задачах).
Первопричина
«Lost connection to the AI provider» на длинных turn'ах. Node global fetch (undici) по умолчанию ставит headersTimeout И bodyTimeout = 300_000ms. Chat-провайдер z.ai и dispatcher внешних MCP оба на нём без override → длинный turn (или простаивающий SSE crawl4ai) рвётся за 5 мин.
Фикс (после ревью — конечный таймаут, не 0)
Отдельный undici-dispatcher, который поднимает оба стрим-таймаута до щедрого, но КОНЕЧНОГО silence-таймаута (дефолт 15 мин,
AI_STREAM_TIMEOUT_MS) на каждом пути. НЕ 0: иначе зависший провайдер при живом клиенте висел бы вечно (abortSignal срабатывает только на дисконнект). Таймаут ограничивает ТИШИНУ (TTFB и паузу между чанками), не общую длину turn'а — длинная прогрессирующая задача не режется.ai-streaming-fetch.ts:createStreamingFetch()+streamTimeoutMs()/streamingDispatcherOptions()(общий, конфигурируемый таймаут).ai.service: chat-fetch = streaming fetch, обёрнутый инструментирующим провайдер-HTTP враппером (телеметрия наблюдает тот же транспорт).mcp-clients: SSRF-pinned Agent используетstreamingDispatcherOptions().По ревью (#176 (comment 1300)) — внесено
ai-http-diagnostics.ts→ai-provider-http.ts,createDiagnosticFetch→createInstrumentedFetch, поле →aiProviderFetch, снял метки «temporary». Теперь chat-транспорт = единая намеренная конструкция (streaming-fetch + инструментирование), а не костыль-под-уборку.AI_STREAM_TIMEOUT_MSдобавлен в.env.example.ai-provider-http.spec(делегация baseFetch, Response нетронут, rethrow, дефолт-global);ai-streaming-fetch.specстал нагрузочным — приAI_STREAM_TIMEOUT_MSниже задержки сервера вызов реально отваливается (потерянный dispatcher → global 300s не отвалился бы), что доказывает проводку.Исследование (на стенде, креды из issue)
headersTimeout: 1→UND_ERR_HEADERS_TIMEOUT).Тесты + server tsc зелёные. MCP-серверы/роль оставлены на стенде для live-проверки.
Комплементарно к #177 (reasoning виден) — обе трогают один
case 'openai', склею при мерже.🤖 Generated with Claude Code
Long research turns failed mid-task with "Lost connection to the AI provider". Node's global fetch (undici) defaults BOTH headersTimeout and bodyTimeout to 300_000ms, and the chat provider + the external-MCP dispatcher both ran on it with no override, so: - the z.ai chat stream dropped when a late step's huge accumulated context pushed the model's time-to-first-token past 5 min (reproduced: even a trivial glm-5.2 query has a ~4-8s first-chunk latency; the live telemetry shows it scaling with context — and a long run reaches 400k+-token steps), or a reasoning model paused >5 min between chunks (bodyTimeout); - the crawl4ai SSE transport, held open across the whole turn, dropped when it idled >5 min between tool calls — a tool failure that aborts the turn and surfaces the same banner. Fix: a dedicated undici dispatcher with both stream timeouts DISABLED (0) on each path. Cancellation is unchanged — the turn is bound to the request abortSignal (client disconnect) and capped by MAX_AGENT_STEPS, so it still terminates; it just no longer dies at an arbitrary 5-minute wall-clock. - ai-streaming-fetch.ts: createStreamingFetch() (+ exported option contract). - ai.service: the chat provider's fetch is now createStreamingFetch(), wrapped by the existing passive ECONNRESET telemetry (createDiagnosticFetch gained an optional baseFetch) so the telemetry observes the SAME transport the turn uses. - mcp-clients: headersTimeout/bodyTimeout: 0 on the SSRF-pinned Agent. Investigation: reproduced the transport mechanism against the real z.ai endpoint (a 1ms headersTimeout throws UND_ERR_HEADERS_TIMEOUT — the exact drop) and ran the actual research agent to a ~428k-token context. Verified the fixed path streams cleanly live (glm-5.2 turns finish; telemetry confirms the streaming fetch is in use). Tests: ai-streaming-fetch.spec (option contract + streams a delayed response); ai-http-diagnostics + ai/mcp specs green. server tsc clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>a5aa1185b1toa14560c7c9- Wording: every comment now says the stream timeouts are RAISED to a generous-but-finite ~15-min silence timeout, not "disabled (0)" (the stale comments contradicted the code, which uses AI_STREAM_TIMEOUT_MS, default 900000ms). - Architecture (the load-bearing-temporary trap): the streaming fetch reached the chat provider only by riding the "temporary DIAGNOSTIC" telemetry, so deleting the telemetry by its own label would silently revert the timeout fix. Legitimize it: rename ai-http-diagnostics.ts -> ai-provider-http.ts, createDiagnosticFetch -> createInstrumentedFetch, field aiDiagnosticFetch -> aiProviderFetch, drop the "temporary" labels, and document the chat transport (streaming fetch + instrumentation) as one intentional construct. - Docs: AI_STREAM_TIMEOUT_MS added to .env.example next to AI_EMBEDDING_TIMEOUT_MS. - Tests: - ai-provider-http.spec: createInstrumentedFetch delegates to the injected baseFetch with the same input/init, returns the Response untouched, rethrows the error, and defaults to global fetch — covering the baseFetch seam. - ai-streaming-fetch.spec: the delayed-server test is now LOAD-BEARING — with AI_STREAM_TIMEOUT_MS set below the 1.5s server delay the call actually rejects (a lost dispatcher -> global 300s default would NOT), proving the configured dispatcher is wired; plus the default-timeout happy path. server tsc clean; ai-streaming-fetch / ai-provider-http / ai.service / mcp-servers / ai-error specs green (41). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Ghost referenced this pull request2026-06-24 22:58:54 +03:00
Ghost referenced this pull request2026-06-25 00:05:31 +03:00