z.ai's GLM Coding Plan throttles hard (429s) and stalls under bursts/overlap;
two mitigations in the shared AI transport:
- Per-host concurrency gate (default 1, env AI_HTTP_MAX_CONCURRENCY): outbound AI
requests to a given host are serialized, and the slot is held until the
(streamed) response body is fully consumed — so a chat stream blocks an
overlapping title-gen / second-tab / RAG-embedding request instead of tripping
z.ai's ~1-concurrent limit. A defensive max-hold (AI_HTTP_MAX_HOLD_MS, 10 min)
prevents a hung stream from deadlocking all AI traffic (headers/body timeouts
are disabled, see #144).
- 429 backoff (AI_HTTP_MAX_429_RETRIES, default 3): respect Retry-After (or
exponential backoff) and retry, so a rate-limited agent step waits the throttle
out instead of failing the whole turn.
NOTE: these address the burst/overlap dimension. The dominant symptom — z.ai's
erratic time-to-first-byte (measured 2s..56s, endpoint/UA/tool-count-independent)
— is mitigated by #144 (wait like curl); it is a z.ai-side capacity issue, not
something client code can speed up.
Tests: ai-http.spec.ts gains a concurrency-gate test (a 2nd request to the same
host does not hit the server until the 1st body is consumed) and a 429-backoff
test (Retry-After honored, eventual 200). 8/8 pass; typecheck clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>