fix(ai-chat): adopt the server-returned chat id (two-tab adoption race #137) #138

Merged
vvzvlad merged 7 commits from fix/ai-chat-chatid-adoption into develop 2026-06-23 03:35:04 +03:00

Fixes the AI-chat adoption race (#137), reproduced live with DB proof, plus a docs clarification.

The bug

A brand-new chat (activeChatId === null) had no way to learn the id of the row the server created — the SSE stream never returned it — so the client adopted the newest chat in the per-user list (chats.items[0]). With two tabs open, a second tab creating a chat at ~the same time made its row the newest, so the first tab adopted the wrong id: its later turns persisted into the other chat and the agent rebuilt history from it (commands "leaked" between chats), while the live UI still showed the original conversation.

The fix

  • Server (ai-chat.service.ts): attach the authoritative chatId to the streamed assistant message via the AI SDK messageMetadata on the start part, so it reaches the client on the very first chunk (before any racing tab).
  • Client (chat-thread.tsx, ai-chat-window.tsx): read message.metadata.chatId in useChat's onFinish and adopt that id in place (no remount → the live turn and the thread's chatIdRef follow the real id, so the next turn targets the right chat). Removed the chats.items[0] guess and the adoptNewChat ref.

Verified live (the two-tab race that confirmed the bug)

Tab A new chat "codeword FIG" → while A streams, Tab B new chat "codeword CIRCUS" → A asks again. Run 3×: A's both turns now stay in A's own chat row (the FIG one); A's 2nd turn no longer lands in B's chat. Single-tab new-chat (2 turns + reload → full history in the correct chat) and existing-chat flows unaffected. tsc clean (client + server).

Also: AGENTS.md migration note

Clarifies that DB migrations auto-apply only in production (built image / start:prod); the local dev stand (pnpm dev / nest start --watch) does not auto-run them — you must pnpm --filter server migration:latest after pulling/switching branches. (During this investigation, the unapplied 20260622…-ai-chat-page-origin migration blanket-500'd all of AI chat on the dev stand — an env gotcha, not a code bug.)

🤖 Generated with Claude Code

Fixes the AI-chat **adoption race** (#137), reproduced live with DB proof, plus a docs clarification. ### The bug A brand-new chat (`activeChatId === null`) had no way to learn the id of the row the server created — the SSE stream never returned it — so the client adopted the **newest** chat in the per-user list (`chats.items[0]`). With two tabs open, a second tab creating a chat at ~the same time made its row the newest, so the first tab adopted the **wrong** id: its later turns persisted into the other chat and the agent rebuilt history from it (commands "leaked" between chats), while the live UI still showed the original conversation. ### The fix - **Server** (`ai-chat.service.ts`): attach the authoritative `chatId` to the streamed assistant message via the AI SDK `messageMetadata` on the `start` part, so it reaches the client on the very first chunk (before any racing tab). - **Client** (`chat-thread.tsx`, `ai-chat-window.tsx`): read `message.metadata.chatId` in `useChat`'s `onFinish` and adopt that id **in place** (no remount → the live turn and the thread's `chatIdRef` follow the real id, so the next turn targets the right chat). Removed the `chats.items[0]` guess and the `adoptNewChat` ref. ### Verified live (the two-tab race that confirmed the bug) Tab A new chat "codeword FIG" → while A streams, Tab B new chat "codeword CIRCUS" → A asks again. Run 3×: A's **both** turns now stay in A's own chat row (the FIG one); A's 2nd turn no longer lands in B's chat. Single-tab new-chat (2 turns + reload → full history in the correct chat) and existing-chat flows unaffected. `tsc` clean (client + server). ### Also: AGENTS.md migration note Clarifies that DB migrations auto-apply **only in production** (built image / `start:prod`); the **local dev** stand (`pnpm dev` / `nest start --watch`) does **not** auto-run them — you must `pnpm --filter server migration:latest` after pulling/switching branches. (During this investigation, the unapplied `20260622…-ai-chat-page-origin` migration blanket-500'd all of AI chat on the dev stand — an env gotcha, not a code bug.) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Ghost added 2 commits 2026-06-22 23:47:21 +03:00
A brand-new chat (activeChatId === null) had no way to learn the id of the row
the server created: the SSE stream never returned it, so the client adopted the
NEWEST chat in the per-user list (chats.items[0]). With two tabs open, a second
tab creating a chat at ~the same time made its row the newest, so the first tab
adopted the wrong id — its later turns persisted into the other chat and the
agent rebuilt history from it (commands leaked between chats), while the live UI
still showed the original conversation. (#137)

The server now attaches the authoritative chatId to the streamed assistant
message via the AI SDK messageMetadata on the 'start' part, so it reaches the
client on the first chunk. The client reads message.metadata.chatId in useChat's
onFinish and adopts that id in place (no remount, so the live turn and the
thread's chatIdRef follow the real id and the next turn targets the right chat).
The chats.items[0] guess and the adoptNewChat ref are removed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Migrations auto-run on boot only in production (the built image / start:prod);
the local dev stand (pnpm dev / nest start --watch) does NOT auto-run them, so
after pulling or switching branches you must apply them with
'pnpm --filter server migration:latest' or endpoints touching new columns 500
(e.g. a freshly-added ai_chats.page_id blanket-500s all of AI chat).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner

Code review — adoption race fix (#137)

Verdict: Request changes. The core fix is correct and genuinely closes #137 — streaming the authoritative chatId in message metadata and adopting it in place (instead of guessing chats.items[0]) is the right approach. Security / conventions / documentation / regressions / simplification all came back clean. Two things to address before merge, plus one non-blocking architecture note.

⚠️ Warning — a new-chat first turn that fails before the start chunk can create a duplicate chat row

Adoption is now gated on a truthy serverChatId, which only exists once the AI SDK has processed the stream's start part. But the server creates the chat row and persists the user message before the stream is wired (apps/server/src/core/ai-chat/ai-chat.service.ts ~232–261). If anything fails in that window — a history DB read, convertToModelMessages, or any synchronous throw before the first chunk (all re-thrown by the try/catch at ~520 → the controller writes a bare 500 with no SSE) — the client never receives a chatId.

On that path both onFinish({ isError }) and onError call onTurnFinished(undefined), so adoption is skipped and chatIdRef.current stays null. The composer is still enabled, so the user retries → the next send carries chatId: null → the server inserts a second chat row, orphaning the first turn in the first row.

The old code adopted unconditionally (via the adoptNewChat ref + the chat-list refetch), so it didn't have this gap — the base onError comment even states the intent: "so a brand-new chat that fails its first turn is adopted." The PR silently drops that (and that comment is now stale).

Fix: for the activeChatId === null && !serverChatId branch, don't drop adoption silently — e.g. gate the next send so a failed first turn can't create a second row, or fall back to the list refetch only when there is exactly one unambiguous candidate. Also update the now-stale onError comment.

⚠️ Test coverage — the fix ships with no tests

This is a concurrency bug fix with no regression test; nothing locks #137 in.

  • Client adoption decision (onTurnFinished) has three branches — adopt / don't-adopt / skip — and none is tested, including the don't-adopt branch where the warning above lives. The project already extracts orchestration logic out of chat-thread into utils/queue-helpers.ts with a queue-helpers.test.ts; the same pattern fits a pure adoptedChatId(activeChatId, serverChatId) helper. A test that returns serverChatId (not "newest in list") is the regression guard for #137.
  • Server messageMetadata (attaching chatId only on the start part) has no test — ai-chat.service.spec.ts is untouched. If it ever regresses (wrong part / wrong field), #137 quietly returns. Extract the callback to an exported helper and assert start → { chatId }, finish → undefined.

💡 Architecture (non-blocking, forward-looking)

This is the 2nd id-handling fix in this exact spot. The new→persisted chat identity is spread across ~5 cooperating state slots (activeChatId atom, liveThreadChatId, threadKey, chatIdRef, chatStoreId) plus a render-phase guard and a load-bearing "both updates batch in one render" assumption. This PR improves the trigger (drops the heuristic useEffect + the adoptNewChat ref) but leaves the scattering in place. When this area is next touched, worth considering either letting ChatThread own its own id transition, or modeling the lifecycle as a small { new | adopting | persisted } state machine so the "stale liveThreadChatId" state becomes unrepresentable. Not a merge blocker.


Multi-aspect review: security · stability · conventions · documentation · regressions · test-coverage · simplification · architecture. The stability finding was verified against the controller/service error path.

## Code review — adoption race fix (#137) **Verdict: Request changes.** The core fix is correct and genuinely closes #137 — streaming the authoritative `chatId` in message metadata and adopting it in place (instead of guessing `chats.items[0]`) is the right approach. Security / conventions / documentation / regressions / simplification all came back clean. Two things to address before merge, plus one non-blocking architecture note. ### :warning: Warning — a new-chat first turn that fails before the `start` chunk can create a **duplicate chat row** Adoption is now gated on a truthy `serverChatId`, which only exists once the AI SDK has processed the stream's `start` part. But the server creates the chat row and persists the user message **before** the stream is wired (`apps/server/src/core/ai-chat/ai-chat.service.ts` ~232–261). If anything fails in that window — a history DB read, `convertToModelMessages`, or any synchronous throw before the first chunk (all re-thrown by the `try/catch` at ~520 → the controller writes a bare 500 with no SSE) — the client never receives a `chatId`. On that path both `onFinish({ isError })` and `onError` call `onTurnFinished(undefined)`, so adoption is skipped and `chatIdRef.current` stays `null`. The composer is still enabled, so the user retries → the next send carries `chatId: null` → the server inserts a **second** chat row, orphaning the first turn in the first row. The old code adopted unconditionally (via the `adoptNewChat` ref + the chat-list refetch), so it didn't have this gap — the base `onError` comment even states the intent: *"so a brand-new chat that fails its first turn is adopted."* The PR silently drops that (and that comment is now stale). **Fix:** for the `activeChatId === null && !serverChatId` branch, don't drop adoption silently — e.g. gate the next send so a failed first turn can't create a second row, or fall back to the list refetch only when there is exactly one unambiguous candidate. Also update the now-stale `onError` comment. ### :warning: Test coverage — the fix ships with no tests This is a concurrency bug fix with no regression test; nothing locks #137 in. - **Client adoption decision (`onTurnFinished`)** has three branches — adopt / don't-adopt / skip — and none is tested, including the don't-adopt branch where the warning above lives. The project already extracts orchestration logic out of `chat-thread` into `utils/queue-helpers.ts` with a `queue-helpers.test.ts`; the same pattern fits a pure `adoptedChatId(activeChatId, serverChatId)` helper. A test that returns `serverChatId` (not "newest in list") is the regression guard for #137. - **Server `messageMetadata`** (attaching `chatId` only on the `start` part) has no test — `ai-chat.service.spec.ts` is untouched. If it ever regresses (wrong part / wrong field), #137 quietly returns. Extract the callback to an exported helper and assert `start → { chatId }`, `finish → undefined`. ### :bulb: Architecture (non-blocking, forward-looking) This is the 2nd id-handling fix in this exact spot. The new→persisted chat identity is spread across ~5 cooperating state slots (`activeChatId` atom, `liveThreadChatId`, `threadKey`, `chatIdRef`, `chatStoreId`) plus a render-phase guard and a load-bearing "both updates batch in one render" assumption. This PR improves the *trigger* (drops the heuristic `useEffect` + the `adoptNewChat` ref) but leaves the scattering in place. When this area is next touched, worth considering either letting `ChatThread` own its own id transition, or modeling the lifecycle as a small `{ new | adopting | persisted }` state machine so the "stale `liveThreadChatId`" state becomes unrepresentable. Not a merge blocker. --- <sub>Multi-aspect review: security · stability · conventions · documentation · regressions · test-coverage · simplification · architecture. The stability finding was verified against the controller/service error path.</sub>
Ghost added 1 commit 2026-06-23 01:17:43 +03:00
Addresses the PR #138 review.

Blocker 1 — duplicate chat row: a brand-new chat whose first turn errors BEFORE
the SSE 'start' chunk never receives the authoritative chatId, so metadata
adoption can't run; a retry then sent chatId:null and the server inserted a
SECOND chat row, orphaning the first turn. Keep metadata adoption as the primary
path (resolveAdoptedChatId) and add a bounded, unambiguous fallback: on a
new-chat finish with no server id, snapshot the known chat ids and, once the
list refetch lands, adopt the SINGLE newly-appeared id (pickNewlyCreatedChatId).
Zero or >1 new ids (e.g. two tabs racing) → no adoption — no items[0] guessing,
so #137 stays fixed. The wait-for-refetch guard compares set membership (robust
to a concurrent delete), and the diff dedupes so a repeated id from a paginated
list never reads as ambiguous.

Blocker 2 — tests: new adopt-chat-id.test.ts covers both pure helpers (adopt
decision + newly-created-id diff incl. dedupe/reorder); the server
messageMetadata callback is extracted to chatStreamStartMetadata and unit-tested
(start -> {chatId}, otherwise undefined).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ghost added 1 commit 2026-06-23 01:46:27 +03:00
Addresses the PR #138 review's architecture note (the deferred 'non-blocking'
item). The brand-new -> persisted chat identity was spread across two separate
useState slots (threadKey + liveThreadChatId) plus a render-phase guard, so the
mount key and the live thread's chat id could in principle diverge.

Consolidate them into ONE atomic state object { key, chatId } with pure,
unit-tested transitions in thread-identity.ts:
- newThread(newKey)       — a brand-new id-less chat (fresh session key);
- switchThread(chatId)    — switch to an existing chat (key := chat id) -> remount;
- adoptThread(prev, id)   — a new chat learns its real id IN PLACE (key unchanged
                            -> no remount, live useChat store preserved).
The 'key vs chatId diverged' state is now unrepresentable. The render-phase
reconciliation stays (the atom is also set externally, e.g. page-history's
history-item opening a referenced chat), but adoption vs switch is now explicit.

Behavior is unchanged; verified live: new-chat adopt + 2nd turn in the same row,
no mid-conversation remount, the two-tab race stays leak-free, switch-to-existing
remounts + reseeds, and reload restores full history. Adds thread-identity.test.ts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ghost added 1 commit 2026-06-23 02:26:06 +03:00
Addresses the 2nd PR #138 review (test debt + the Variant-B architecture ask).

The new→persisted chat id lifecycle (mount key, both adoption paths, the
history-load latch, the render-phase reconciler, onTurnFinished) is moved out of
the 768-line window into a new useChatSession hook driven by a pure
threadSessionReducer (reconcile/adopt), so adopt-vs-switch is one explicit
dispatch point and the scattering the review flagged is gone (window: 768→~620).

Tests (the blockers):
- use-chat-session.test.tsx — hook-level locks incl. the #137 regression
  (adopts the authoritative streamed id 'A', NOT chats.items[0]='B' — fails on
  the old heuristic), the error-path fallback (arm/adopt/ambiguous/add+delete),
  the disarm-on-reconcile lock (a fallback armed then switched away must not be
  adopted by a late refetch), in-place-adopt-keeps-key vs external-switch-remount,
  and the waitingForHistory latch.
- extractServerChatId (reading message.metadata.chatId) and newlyAddedChatIds
  extracted as pure helpers with unit tests; threadSessionReducer tested.

Cleanups: single canonical #137 explanation in adopt-chat-id.ts (other sites
reference it); fallback effect computes the set diff once; invalidate callbacks
memoized; redundant invariant tests folded.

Behavior preserved — re-verified live (z.ai glm-5.2): new-chat adopt + 2nd turn
in the same row, no mid-conversation remount, two-tab race leak-free, switch to
an existing chat reseeds full history, reload restores history.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ghost added 1 commit 2026-06-23 03:04:53 +03:00
Closes the 3rd PR #138 review.

Warning fix: the render-phase reconciler only disarms the error-path adoption
fallback when activeChatId actually changes. Pressing 'New chat' while ALREADY
in a new chat keeps activeChatId === null (a no-op atom write), so the reconciler
never fired and a stale armed fallback could adopt the just-failed chat from a
late refetch, yanking the user out of their fresh chat. useChatSession now
returns cancelPendingAdoption(); the window calls it from startNewChat AND
selectChat. (The hook call moved above those callbacks so they can reference it.)
Added a hook test that fails without the explicit disarm, plus a test for the
existing-chat onTurnFinished branch (no adoption + per-chat invalidation).

Cleanups: removed the dead pickNewlyCreatedChatId (the fallback effect uses
newlyAddedChatIds directly with the 0/1/>1 decision inline) and its tests/doc
mention; inlined the two invalidation closures (onTurnFinished is read live by
useChat's onFinish, never in an effect dep array, so memoizing them was needless
ceremony).

Verified: tsc clean, 127 ai-chat tests green; live (z.ai glm-5.2) new chat + 2nd
turn recalled the number in the SAME row (1 chat / 4 messages), no page errors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ghost added 1 commit 2026-06-23 03:33:28 +03:00
Closes the 4th PR #138 review (3 suggestions, no blockers).

- Double-call safety: a failed turn fires both useChat onFinish AND onError, so
  onTurnFinished can run twice in one turn (streamed id, then no id) before a
  re-render. onTurnFinished now reads the live id from a ref (set imperatively on
  primary adoption) instead of the stale closure, so the 2nd no-id call cannot
  re-arm the error-path fallback at the source; the render-phase reconciler is the
  second layer. Added a hook test for the sequence — verified it fails only if
  BOTH layers are removed (non-tautological).
- Conventions: extracted named UseChatSessionOptions / UseChatSessionResult
  interfaces (was an inline param literal + ChatSession); the test derives its
  driver props from them.
- Simplification: extracted the chatIdSnapshot(chats) projection used at both the
  fallback arm site and the resolver effect.

Architecture notes from the review (caller-driven disarm contract; cross-process
{chatId} type) intentionally left as Variant A per the reviewer's recommendation.

tsc clean; 128 ai-chat tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
vvzvlad merged commit 97002f318a into develop 2026-06-23 03:35:04 +03:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: vvzvlad/gitmost#138