fix(#370 ): thread trx into addPageWatchers (F7 self-deadlock) + restore contributors on commit-failure (F8) + assert the lock (F9) (review round 2)

The round-1 F3 fix (wrapping the processor's find+save in a locked tx) itself introduced two regressions: F7 [CRITICAL] addPageWatchers ran WITHOUT trx inside the tx holding FOR UPDATE on pages[pageId]. The watcher insert's FK check takes FOR KEY SHARE on the same row, but on a DIFFERENT pool connection — a true self-deadlock (our tx connection sits idle-in-transaction awaiting the JS await, the insert connection blocks on the lock). Now passes trx (addPageWatchers already accepts it and routes it through insertMany), so the FK lock is taken on the connection that already holds FOR UPDATE — no self-conflict. F8 [WARNING] popContributors is a destructive Redis SPOP; the inner catch only restores on a throw INSIDE the callback. A COMMIT failure throws OUTSIDE it, rolling the snapshot back while the pop is gone → a retry writes an unattributed version. Now tracks the popped set and restores it in an outer catch (idempotent SADD), leaving BullMQ to retry with attribution intact. F9 [WARNING] The spec asserted saveHistory args with a loosened objectContaining that stopped verifying trx, and never pinned withLock/trx on findById or the trx on addPageWatchers — which is exactly why F7 slipped. Restored the exact saveHistory(trx) assertion and added findById({withLock,trx}) + addPageWatchers trx assertions (the latter would have caught F7), plus a commit-failure test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(#370 ): ES2021-safe spec, source-specific idle ceiling, processor lock, tested index mapping (review round 1)
2026-07-05 05:50:19 +03:00 · 2026-07-05 05:31:26 +03:00 · 2026-07-05 05:00:12 +03:00 · 2026-07-05 00:59:05 +03:00 · 2026-07-05 00:40:26 +03:00 · 2026-07-05 00:31:04 +03:00
107 changed files with 7426 additions and 1295 deletions
@@ -209,6 +209,20 @@ MCP_DOCMOST_PASSWORD=
 # active" behavior.
 # AI_CHAT_DEFERRED_TOOLS=true

+# --- Autonomous / detached agent runs (settings.ai.autonomousRuns) ---
+# Opt-in per workspace (AI settings; off by default). When on, a chat turn becomes
+# a server-side RUN that survives a browser disconnect — only an explicit Stop ends
+# it, and a client reconnects/live-follows the run.
+#
+# DEPLOY CONSTRAINT — SINGLE-INSTANCE ONLY in phase 1: Stop and the in-process
+# AbortController that backs it are process-local, so a Stop only aborts a run
+# executing on the SAME replica that owns it (cross-instance pub/sub stop is phase
+# 2 and not yet reliable). Do NOT enable autonomousRuns on a horizontally-scaled
+# deployment (multiple replicas behind a load balancer, or Docmost cloud
+# CLOUD=true) — run a single instance instead. The server logs a startup WARNING
+# when it detects a multi-instance deployment (CLOUD=true) so the constraint is
+# visible, and a startup sweep settles any run left dangling by a restart.
+
 # --- Anonymous public-share AI assistant ---
 # Opt-in per workspace (AI settings -> "public share assistant"; off by default).
 # When enabled, anonymous visitors of a published share can ask an AI about that
@@ -242,3 +256,27 @@ MCP_DOCMOST_PASSWORD=
 # FAILS CLOSED if Redis is unavailable (default: 1,000,000 tokens per workspace
 # per rolling day).
 # SHARE_AI_WORKSPACE_TOKEN_BUDGET_PER_DAY=1000000
+
+# --- Observability / perf metrics (#355) ---
+#
+# Two INDEPENDENT toggles, both OFF by default:
+#
+# 1) METRICS_PORT — the server-side Prometheus scrape endpoint.
+#    UNSET (default) => the whole prom subsystem is OFF: no registry, no
+#    collectors, and NOTHING is exposed on the main app port. There is NO
+#    default port — leaving it blank disables it. When set to a port (e.g.
+#    9464), a SEPARATE bare node:http listener serves GET /metrics on that port
+#    only (never on the main :3000 app listener), for a scraper such as
+#    VictoriaMetrics/Prometheus reaching it as <host>:<port>/metrics.
+# METRICS_PORT=9464
+#
+# 2) CLIENT_TELEMETRY_ENABLED — the public client perf-telemetry sink.
+#    OFF by default. When true, the unauthenticated POST /api/telemetry/vitals
+#    endpoint is registered and browsers collect + send web-vitals / editor
+#    metrics into the `client_metrics` table (read directly by Grafana, separate
+#    from METRICS_PORT). Leave OFF unless you actually consume this data: the
+#    endpoint is public and the table has NO app-side retention, so enabling it
+#    requires an EXTERNAL pruner to bound `client_metrics` growth (the deployed
+#    infra prunes rows >90d via a maintenance container). When off, the endpoint
+#    does not exist and the client installs no observers.
+# CLIENT_TELEMETRY_ENABLED=false
@@ -18,12 +18,48 @@ env:
  IMAGE: ghcr.io/vvzvlad/gitmost

 jobs:
-  # Run the reusable test suite first so a failing test blocks the image build.
+  # Run the reusable test suite. Together with the e2e jobs below it gates the
+  # publish job (the image push), not the build itself — build runs in parallel.
  test:
    uses: ./.github/workflows/test.yml

+  # Runs in parallel with the test/e2e jobs and only warms the buildx cache
+  # (GHA cache, scope develop-amd64). No push happens here — the publish job
+  # below is the only one that pushes the image.
  build:
-    needs: test
+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Resolve version
+        id: version
+        run: echo "value=$(git describe --tags --always)" >> "$GITHUB_OUTPUT"
+
+      - name: Build develop image (warm cache, no push)
+        uses: docker/build-push-action@v6
+        with:
+          context: .
+          platforms: linux/amd64
+          build-args: |
+            APP_VERSION=${{ steps.version.outputs.value }}
+            AI_AGENT_ROLES_CATALOG_URL=https://raw.githubusercontent.com/vvzvlad/gitmost/develop/agent-roles-catalog
+          push: false
+          cache-from: type=gha,scope=develop-amd64
+          cache-to: type=gha,scope=develop-amd64,mode=max,ignore-error=true
+
+  # The gate: rebuilds from the cache the build job just wrote (near-instant on
+  # a cache hit; worst case — cache eviction — a full rebuild, which matches the
+  # old sequential timing) and pushes :develop only when unit tests AND both
+  # e2e suites AND the build are green.
+  publish:
+    needs: [test, e2e-server, e2e-mcp, build]
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
@@ -57,13 +93,10 @@ jobs:
          push: true
          tags: ${{ env.IMAGE }}:develop
          cache-from: type=gha,scope=develop-amd64
-          cache-to: type=gha,scope=develop-amd64,mode=max,ignore-error=true

-  # e2e jobs run on every develop push but DO NOT gate the build/publish above:
-  # `build` stays `needs: test` only, so the :develop image still ships even if
-  # e2e fails. A failing e2e job turns the run red and triggers GitHub's email
-  # to the pusher — that red run + email is the intended notification, not a
-  # deploy block.
+  # e2e jobs gate the publish (image push), not the build: the :develop image
+  # is pushed only when unit tests AND both e2e suites pass (publish.needs
+  # lists them all).
  e2e-server:
    runs-on: ubuntu-latest
    # Hard cap: the full-AppModule e2e leaks open handles and hung jest to the 6h max.
@@ -124,9 +157,7 @@ jobs:
      - name: Run server e2e
        run: pnpm --filter ./apps/server test:e2e

-  # Same rationale as e2e-server: this job is intentionally NOT in
-  # `build.needs`. Deploy of the :develop image must not be blocked by e2e;
-  # a red run plus GitHub's email to the pusher is the notification mechanism.
+  # Gates the publish too — see the comment above e2e-server.
  e2e-mcp:
    runs-on: ubuntu-latest
    timeout-minutes: 20
@@ -279,6 +279,7 @@ The API server is a Fastify app with a global `/api` prefix (`main.ts` excludes
   - `core/ai-chat/tools/` — the agent's ~40 read+write tools. Every tool runs under the **calling user's** CASL permissions via a per-user loopback access token (`docmost-client.loader.ts`), so the agent can never exceed what the user could do. Only **reversible** operations are exposed (page history + trash; no permanent delete). Agent edits get an "AI agent" provenance badge in page history (`20260616T130000-agent-provenance` migration).
   - `core/ai-chat/embedding/` — RAG indexer + a BullMQ consumer on `AI_QUEUE` that embeds pages into `page_embeddings` (vector search), complementing Postgres full-text search. Pages are (re)indexed on edit; `AI_EMBEDDING_TIMEOUT_MS` bounds a hung embeddings endpoint.
   - `core/ai-chat/external-mcp/` — admins can attach external MCP servers (e.g. Tavily) to give the agent web access. **`ssrf-guard.ts` validates outbound MCP URLs against SSRF** — keep that guard in the path when touching external-MCP connection logic.
+   - `core/ai-chat/ai-chat-run.service.ts` + `ai_chat_runs` — **detached/autonomous agent runs** (`#184`), behind the per-workspace `settings.ai.autonomousRuns` flag (off by default). When on, a turn becomes a server-side RUN that survives a browser disconnect; only an explicit `POST /ai-chat/stop` ends it, and a client reconnects/live-follows via `POST /ai-chat/run`. **DEPLOY CONSTRAINT — single-instance only in phase 1:** Stop and the AbortController that backs it are process-local, so a Stop only aborts a run executing on the **same** replica that owns it (cross-instance pub/sub stop is phase 2). Do **not** enable `autonomousRuns` on a horizontally-scaled deployment (multiple replicas behind a load balancer, or Docmost cloud `CLOUD=true`) — run a single instance instead. The server logs a startup WARNING when it detects a multi-instance deployment (`CLOUD=true`) so the constraint is visible. The startup sweep settles any run left dangling by a restart.

 ### Client structure
 Vite SPA. Code is organized by feature under `apps/client/src/features/*` (mirrors the server domains: `page`, `space`, `comment`, `ai-chat`, `editor`, …). Conventions:
@@ -12,6 +12,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ### Added

+- **Save intentional page versions.** Press `Cmd/Ctrl+S` (or use the page menu)
+  to save a named version of a page. The history panel now distinguishes
+  intentional versions (a "Saved" / "Agent version" badge) from automatic
+  snapshots, dims autosaves, and offers an "Only versions" filter. Automatic
+  snapshots switched from a fixed interval to a trailing idle-flush with a
+  max-wait ceiling, and a boundary snapshot is pinned whenever the editing source
+  changes (e.g. a person's edits followed by the AI agent). (#370)
+
 - **Place several images side by side in a row.** A new "Inline (side by
  side)" alignment mode in the image bubble menu renders consecutive inline
  images as a row that wraps onto the next line on narrow screens. The row is
@@ -72,6 +80,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
  append/prepend fragments, nor to COMMENT bodies — a comment may legitimately
  contain a standalone footnote definition, which canonicalization would drop.
  (#228)
+- **Detached, autonomous agent runs that survive a browser disconnect.** When the
+  new `settings.ai.autonomousRuns` workspace flag is on (off by default), an
+  AI-chat turn becomes a first-class, server-side RUN tracked in a new
+  `ai_chat_runs` table instead of a socket-bound stream: closing the tab or
+  losing the connection no longer aborts the turn — it keeps executing and
+  persisting server-side, and only an explicit Stop ends it. A client can
+  reconnect and live-follow (or stop) an in-flight run via `POST /ai-chat/run`
+  (resolve the latest run + its assistant message for a chat) and
+  `POST /ai-chat/stop` (stop by `runId` or `chatId`). A partial unique index
+  enforces one active run per chat, and a startup sweep settles any run left
+  dangling by a restart. Phase 1 is single-instance-only (cross-instance Stop is
+  not yet reliable); the server warns at startup on a horizontally-scaled
+  deployment. (#184)
 - **Out-of-band page transfer via an in-RAM blob sandbox (`stash_page`).** A
  new MCP tool serializes a whole page (its full ProseMirror JSON, with every
  internal image/file mirrored) into an ephemeral in-RAM blob and returns only
@@ -5,6 +5,13 @@ RUN npm install -g pnpm@10.4.0

 FROM base AS builder

+# re2 (packages/mcp) always compiles from source under pnpm (the prebuilt-binary
+# download cannot identify the GitHub repo), so node-gyp needs python3/make/g++.
+# This stage is discarded, so the toolchain can stay installed.
+RUN apt-get update \
+  && apt-get install -y --no-install-recommends python3 make g++ \
+  && rm -rf /var/lib/apt/lists/*
+
 WORKDIR /app

 COPY . .
@@ -57,9 +64,16 @@ COPY --from=builder /app/patches /app/patches

 RUN chown -R node:node /app

-USER node
+# Toolchain is needed transiently to compile re2 during the prod install; install
+# and purge it in one layer to keep the final image slim. The install itself runs
+# as the node user via su to keep node_modules ownership without a costly chown layer.
+RUN apt-get update \
+  && apt-get install -y --no-install-recommends python3 make g++ \
+  && su node -c "pnpm install --frozen-lockfile --prod" \
+  && apt-get purge -y --auto-remove python3 make g++ \
+  && rm -rf /var/lib/apt/lists/*

-RUN pnpm install --frozen-lockfile --prod
+USER node

 RUN mkdir -p /app/data/storage

@@ -13,7 +13,6 @@
  },
  "dependencies": {
    "@ai-sdk/react": "^3.0.208",
-    "@braintree/sanitize-url": "7.1.2",
    "@atlaskit/pragmatic-drag-and-drop": "1.8.1",
    "@atlaskit/pragmatic-drag-and-drop-auto-scroll": "2.1.5",
    "@atlaskit/pragmatic-drag-and-drop-flourish": "2.0.15",
@@ -62,6 +61,7 @@
    "react-clear-modal": "^2.0.18",
    "react-dom": "^18.3.1",
    "react-drawio": "1.0.7",
+    "web-vitals": "^5.1.0",
    "react-error-boundary": "6.1.1",
    "react-helmet-async": "3.0.0",
    "react-i18next": "16.5.8",
@@ -1385,5 +1385,14 @@
  "The commented text changed since this suggestion was made; it was not applied.": "The commented text changed since this suggestion was made; it was not applied.",
  "Dismiss": "Dismiss",
  "Suggestion dismissed": "Suggestion dismissed",
-  "Failed to dismiss suggestion": "Failed to dismiss suggestion"
+  "Failed to dismiss suggestion": "Failed to dismiss suggestion",
+  "Save version": "Save version",
+  "Ctrl+S": "Ctrl+S",
+  "Version saved": "Version saved",
+  "Already saved as the latest version": "Already saved as the latest version",
+  "Agent version": "Agent version",
+  "Boundary": "Boundary",
+  "Autosave": "Autosave",
+  "Only versions": "Only versions",
+  "No saved versions yet.": "No saved versions yet."
 }
@@ -1248,5 +1248,14 @@
  "The commented text changed since this suggestion was made; it was not applied.": "Прокомментированный текст изменился после создания предложения; оно не было применено.",
  "Dismiss": "Не применять",
  "Suggestion dismissed": "Предложение отклонено",
-  "Failed to dismiss suggestion": "Не удалось отклонить предложение"
+  "Failed to dismiss suggestion": "Не удалось отклонить предложение",
+  "Save version": "Сохранить версию",
+  "Ctrl+S": "Ctrl+S",
+  "Version saved": "Версия сохранена",
+  "Already saved as the latest version": "Уже сохранено как последняя версия",
+  "Agent version": "Версия агента",
+  "Boundary": "Граница",
+  "Autosave": "Автосейв",
+  "Only versions": "Только версии",
+  "No saved versions yet.": "Пока нет сохранённых версий."
 }
@@ -1,72 +1,38 @@
-import { lazy, Suspense } from "react";
 import { Navigate, Route, Routes } from "react-router-dom";
-import { Center, Loader } from "@mantine/core";
-import { Error404 } from "@/components/ui/error-404.tsx";
-import Layout from "@/components/layouts/global/layout.tsx";
-import { useTrackOrigin } from "@/hooks/use-track-origin";
-
-// ShareLayout is route-split: its ShareShell chrome pulls in the table of
-// contents (and thus TipTap), so keeping it out of the eager graph removes the
-// editor engine from startup for authenticated users too.
-const ShareLayout = lazy(
-  () => import("@/features/share/components/share-layout.tsx"),
-);
-
-// Auth / entry pages stay eager: they are the first paint for an unauthenticated
-// visitor (e.g. /login) and are already small, so code-splitting them would only
-// add a cold-chunk round trip to the most common cold-start path.
 import SetupWorkspace from "@/pages/auth/setup-workspace.tsx";
 import LoginPage from "@/pages/auth/login";
+import Home from "@/pages/dashboard/home";
+import Page from "@/pages/page/page";
+import AccountSettings from "@/pages/settings/account/account-settings";
+import WorkspaceMembers from "@/pages/settings/workspace/workspace-members";
+import WorkspaceSettings from "@/pages/settings/workspace/workspace-settings";
+import AiSettings from "@/pages/settings/workspace/ai-settings";
+import Groups from "@/pages/settings/group/groups";
+import GroupInfo from "./pages/settings/group/group-info";
+import Spaces from "@/pages/settings/space/spaces.tsx";
+import { Error404 } from "@/components/ui/error-404.tsx";
+import AccountPreferences from "@/pages/settings/account/account-preferences.tsx";
+import SpaceHome from "@/pages/space/space-home.tsx";
+import PageRedirect from "@/pages/page/page-redirect.tsx";
+import Layout from "@/components/layouts/global/layout.tsx";
 import InviteSignup from "@/pages/auth/invite-signup.tsx";
 import ForgotPassword from "@/pages/auth/forgot-password.tsx";
 import PasswordReset from "./pages/auth/password-reset";
-import PageRedirect from "@/pages/page/page-redirect.tsx";
+import SharedPage from "@/pages/share/shared-page.tsx";
+import Shares from "@/pages/settings/shares/shares.tsx";
+import ShareLayout from "@/features/share/components/share-layout.tsx";
 import ShareRedirect from "@/pages/share/share-redirect.tsx";
-
-// Heavy / leaf pages are route-split with React.lazy so their code (most
-// importantly the whole TipTap editor + KaTeX + lowlight grammars + drawio that
-// the page editor and the readonly share editor pull in) is fetched only when
-// the matching route is actually visited. The <Suspense> boundaries live inside
-// each Layout (around its <Outlet/>), so the app shell stays mounted while a
-// route chunk loads.
-const Home = lazy(() => import("@/pages/dashboard/home"));
-const Page = lazy(() => import("@/pages/page/page"));
-const SpaceHome = lazy(() => import("@/pages/space/space-home.tsx"));
-const SpaceTrash = lazy(() => import("@/pages/space/space-trash.tsx"));
-const SpacesPage = lazy(() => import("@/pages/spaces/spaces.tsx"));
-const FavoritesPage = lazy(() => import("@/pages/favorites/favorites-page"));
-const LabelPage = lazy(() => import("@/pages/label/label-page"));
-const SharedPage = lazy(() => import("@/pages/share/shared-page.tsx"));
-
-const AccountSettings = lazy(
-  () => import("@/pages/settings/account/account-settings"),
-);
-const AccountPreferences = lazy(
-  () => import("@/pages/settings/account/account-preferences.tsx"),
-);
-const WorkspaceSettings = lazy(
-  () => import("@/pages/settings/workspace/workspace-settings"),
-);
-const AiSettings = lazy(() => import("@/pages/settings/workspace/ai-settings"));
-const WorkspaceMembers = lazy(
-  () => import("@/pages/settings/workspace/workspace-members"),
-);
-const Groups = lazy(() => import("@/pages/settings/group/groups"));
-const GroupInfo = lazy(() => import("./pages/settings/group/group-info"));
-const Spaces = lazy(() => import("@/pages/settings/space/spaces.tsx"));
-const Shares = lazy(() => import("@/pages/settings/shares/shares.tsx"));
+import { useTrackOrigin } from "@/hooks/use-track-origin";
+import SpacesPage from "@/pages/spaces/spaces.tsx";
+import SpaceTrash from "@/pages/space/space-trash.tsx";
+import FavoritesPage from "@/pages/favorites/favorites-page";
+import LabelPage from "@/pages/label/label-page";

 export default function App() {
  useTrackOrigin();

  return (
-    <Suspense
-      fallback={
-        <Center h="100vh">
-          <Loader size="sm" />
-        </Center>
-      }
-    >
+    <>
      <Routes>
        <Route index element={<Navigate to="/home" />} />
        <Route path={"/login"} element={<LoginPage />} />
@@ -117,6 +83,6 @@ export default function App() {

        <Route path="*" element={<Error404 />} />
      </Routes>
-    </Suspense>
+    </>
  );
 }
@@ -1,37 +0,0 @@
-import { describe, it, expect } from "vitest";
-import { isChunkLoadError } from "./chunk-load-error-boundary";
-
-// The detector decides whether a caught render error is a stale-deploy chunk-404
-// (→ auto-reload to fetch the new manifest) vs a genuine app error (→ generic
-// recovery UI, no reload). A false negative on a real chunk failure re-blanks the
-// app; a false positive would auto-reload on an ordinary error. Pin both sides.
-describe("isChunkLoadError", () => {
-  it("detects the ChunkLoadError name", () => {
-    expect(isChunkLoadError({ name: "ChunkLoadError", message: "x" })).toBe(true);
-  });
-
-  it.each([
-    "Failed to fetch dynamically imported module: https://x/assets/index-abc.js",
-    "error loading dynamically imported module",
-    "Importing a module script failed.",
-  ])("detects the dynamic-import failure message %#", (message) => {
-    expect(isChunkLoadError({ name: "TypeError", message })).toBe(true);
-  });
-
-  it("is case-insensitive on the message", () => {
-    expect(
-      isChunkLoadError({ message: "FAILED TO FETCH DYNAMICALLY IMPORTED MODULE" }),
-    ).toBe(true);
-  });
-
-  it.each([
-    null,
-    undefined,
-    {},
-    { name: "TypeError", message: "Cannot read properties of undefined" },
-    { message: "Network request failed" },
-    new Error("some ordinary render error"),
-  ])("returns false for a non-chunk error %#", (err) => {
-    expect(isChunkLoadError(err)).toBe(false);
-  });
-});
@@ -1,71 +0,0 @@
-import { ReactNode } from "react";
-import { ErrorBoundary } from "react-error-boundary";
-import { Button, Center, Stack, Text } from "@mantine/core";
-
-const RELOAD_FLAG = "chunk-reload-attempted";
-
-// Heuristic detection of a failed dynamic import. Since the code-splitting work,
-// every route (plus Aside / AiChatWindow) is React.lazy: when a new deploy
-// replaces the hashed chunks, a tab left open on the old index.html requests a
-// chunk URL that now 404s, and React.lazy rejects. Browsers / Vite surface these
-// with a ChunkLoadError name or one of these messages.
-export function isChunkLoadError(error: unknown): boolean {
-  if (!error) return false;
-  const name = (error as { name?: string }).name ?? "";
-  const message = (error as { message?: string }).message ?? "";
-  return (
-    name === "ChunkLoadError" ||
-    /Failed to fetch dynamically imported module/i.test(message) ||
-    /error loading dynamically imported module/i.test(message) ||
-    /Importing a module script failed/i.test(message)
-  );
-}
-
-function handleError(error: unknown) {
-  if (!isChunkLoadError(error)) return;
-  // A stale-chunk 404 is cured by a full reload that re-fetches index.html and
-  // the new chunk manifest. Auto-reload once, guarding against a reload loop
-  // (e.g. a genuinely missing chunk) with a one-shot sessionStorage flag. If the
-  // flag is already set we fall through to the manual recovery UI below.
-  try {
-    if (sessionStorage.getItem(RELOAD_FLAG)) return;
-    sessionStorage.setItem(RELOAD_FLAG, "1");
-  } catch {
-    // sessionStorage unavailable (private mode / disabled): skip the automatic
-    // reload rather than risk an unguarded loop; the fallback UI still recovers.
-    return;
-  }
-  window.location.reload();
-}
-
-// Root-level boundary that sits ABOVE every route-level Suspense boundary so a
-// lazy route/component chunk failure is caught here instead of unmounting the
-// whole tree into a blank white screen. Per-feature ErrorBoundaries (page.tsx,
-// transclusion, page-embed) remain in place underneath for their local errors.
-export function ChunkLoadErrorBoundary({ children }: { children: ReactNode }) {
-  return (
-    <ErrorBoundary
-      onError={handleError}
-      fallbackRender={({ error }) => {
-        const chunk = isChunkLoadError(error);
-        return (
-          <Center h="100vh" p="md">
-            <Stack align="center" gap="sm" maw={420}>
-              <Text fw={600}>
-                {chunk ? "A new version is available" : "Something went wrong"}
-              </Text>
-              <Text size="sm" c="dimmed" ta="center">
-                {chunk
-                  ? "Please reload the page to load the latest version."
-                  : "An unexpected error occurred. Reloading the page may help."}
-              </Text>
-              <Button onClick={() => window.location.reload()}>Reload</Button>
-            </Stack>
-          </Center>
-        );
-      }}
-    >
-      {children}
-    </ErrorBoundary>
-  );
-}
@@ -1,10 +1,9 @@
 import { AppShell, Container } from "@mantine/core";
-import React, { Suspense, useEffect, useRef, useState } from "react";
+import React, { useEffect, useRef, useState } from "react";
 import { useLocation } from "react-router-dom";
 import { useTranslation } from "react-i18next";
 import SettingsSidebar from "@/components/settings/settings-sidebar.tsx";
-import { useAtom, useAtomValue } from "jotai";
-import { aiChatWindowOpenAtom } from "@/features/ai-chat/atoms/ai-chat-atom.ts";
+import { useAtom } from "jotai";
 import {
  APP_NAVBAR_ID,
  NAVBAR_COLLAPSE_BREAKPOINT,
@@ -15,6 +14,8 @@ import {
 } from "@/components/layouts/global/hooks/atoms/sidebar-atom.ts";
 import { SpaceSidebar } from "@/features/space/components/sidebar/space-sidebar.tsx";
 import { AppHeader } from "@/components/layouts/global/app-header.tsx";
+import Aside from "@/components/layouts/global/aside.tsx";
+import AiChatWindow from "@/features/ai-chat/components/ai-chat-window.tsx";
 import GitmostGlobalBridge from "@/features/editor/gitmost/gitmost-global-bridge.tsx";
 import classes from "./app-shell.module.css";
 import { useToggleSidebar } from "@/components/layouts/global/hooks/hooks/use-toggle-sidebar.ts";
@@ -22,21 +23,6 @@ import GlobalSidebar from "@/components/layouts/global/global-sidebar.tsx";
 import { ASIDE_PANEL_ID } from "@/hooks/use-toggle-aside.tsx";
 import { MAIN_CONTENT_ID, SkipToMain } from "@/components/ui/skip-to-main.tsx";

-// Lazily load the AI chat window so the AI SDK runtime it pulls in is fetched
-// only after the user first opens the chat, instead of for every authenticated
-// user on load. The window itself renders null while closed, so there is no
-// behavior difference — it simply is not mounted until first opened.
-const AiChatWindow = React.lazy(
-  () => import("@/features/ai-chat/components/ai-chat-window.tsx"),
-);
-
-// The right aside hosts the comment panel and table of contents, both of which
-// pull in TipTap. It only ever renders on page routes, so lazy-loading it keeps
-// the whole editor engine out of the eager global-shell startup graph.
-const Aside = React.lazy(
-  () => import("@/components/layouts/global/aside.tsx"),
-);
-
 export default function GlobalAppShell({
  children,
 }: {
@@ -51,15 +37,6 @@ export default function GlobalAppShell({
  const [isResizing, setIsResizing] = useState(false);
  const sidebarRef = useRef(null);

-  // Latch: once the AI chat window has been opened, keep it mounted so an
-  // in-flight stream is never torn down. Before the first open the AI chat chunk
-  // is never fetched.
-  const aiChatOpen = useAtomValue(aiChatWindowOpenAtom);
-  const [aiChatEverOpened, setAiChatEverOpened] = useState(false);
-  useEffect(() => {
-    if (aiChatOpen) setAiChatEverOpened(true);
-  }, [aiChatOpen]);
-
  const startResizing = React.useCallback((mouseDownEvent) => {
    mouseDownEvent.preventDefault();
    setIsResizing(true);
@@ -183,21 +160,13 @@ export default function GlobalAppShell({
                  : undefined
          }
        >
-          <Suspense fallback={null}>
-            <Aside />
-          </Suspense>
+          <Aside />
        </AppShell.Aside>
      )}
    </AppShell>
-    {/* Floating AI chat window. Mounted once globally on first open; it is
-        position: fixed and self-hides when closed, so its place in the tree is
-        not critical. Kept mounted after the first open so a live stream is not
-        aborted. */}
-    {aiChatEverOpened && (
-      <Suspense fallback={null}>
-        <AiChatWindow />
-      </Suspense>
-    )}
+    {/* Floating AI chat window. Mounted once globally; it is position: fixed
+        and self-hides when closed, so its place in the tree is not critical. */}
+    <AiChatWindow />
      {/* Global gitmost native bridge: registers listSpaces / listPages /
          createPageWithRecording on window.gitmost so the native host can
          create a page with a recording even when no page editor is open. */}
@@ -1,7 +1,5 @@
-import { Suspense, useEffect } from "react";
 import { UserProvider } from "@/features/user/user-provider.tsx";
 import { Outlet, useParams } from "react-router-dom";
-import { Center, Loader } from "@mantine/core";
 import GlobalAppShell from "@/components/layouts/global/global-app-shell.tsx";
 import { SearchSpotlight } from "@/features/search/components/search-spotlight.tsx";
 import { useGetSpaceBySlugQuery } from "@/features/space/queries/space-query.ts";
@@ -10,39 +8,10 @@ export default function Layout() {
  const { spaceSlug } = useParams();
  const { data: space } = useGetSpaceBySlugQuery(spaceSlug);

-  // Warm the (now route-split) editor chunk during idle time on authenticated
-  // routes, so the first navigation to a page renders from cache instead of a
-  // cold chunk fetch. Best-effort: gated on requestIdleCallback and never blocks
-  // startup — the dynamic import mirrors the App.tsx route lazy loader so both
-  // resolve to the same chunk.
-  useEffect(() => {
-    const ric =
-      typeof window !== "undefined" && (window as any).requestIdleCallback;
-    const warm = () => {
-      // Best-effort prefetch: a failed warm-up (offline, stale 404) is harmless
-      // and must not surface as an unhandledrejection.
-      void import("@/pages/page/page").catch(() => {});
-    };
-    if (ric) {
-      const id = ric(warm);
-      return () => (window as any).cancelIdleCallback?.(id);
-    }
-    const timer = setTimeout(warm, 2000);
-    return () => clearTimeout(timer);
-  }, []);
-
  return (
    <UserProvider>
      <GlobalAppShell>
-        <Suspense
-          fallback={
-            <Center h="60vh">
-              <Loader size="sm" />
-            </Center>
-          }
-        >
-          <Outlet />
-        </Suspense>
+        <Outlet />
      </GlobalAppShell>
      <SearchSpotlight spaceId={space?.id} />
    </UserProvider>
@@ -19,7 +19,7 @@ import {
  IconPlus,
  IconX,
 } from "@tabler/icons-react";
-import { useAtom, useSetAtom } from "jotai";
+import { useAtom, useAtomValue, useSetAtom } from "jotai";
 import { useLocation, useMatch } from "react-router-dom";
 import { useTranslation } from "react-i18next";
 import { useQueryClient } from "@tanstack/react-query";
@@ -41,13 +41,24 @@ import { extractPageSlugId } from "@/lib";
 import {
  AI_CHATS_RQ_KEY,
  AI_CHAT_MESSAGES_RQ_KEY,
+  AI_CHAT_RUN_RQ_KEY,
  useAiChatMessagesQuery,
+  useAiChatRunQuery,
  useAiChatsQuery,
  useAiRolesQuery,
 } from "@/features/ai-chat/queries/ai-chat-query.ts";
+import {
+  shouldClearLatchOnQueryError,
+  shouldClearStoppingLatch,
+  shouldObserveRun,
+} from "@/features/ai-chat/utils/run-polling.ts";
+import { workspaceAtom } from "@/features/user/atoms/current-user-atom";
 import ConversationList from "@/features/ai-chat/components/conversation-list.tsx";
 import ChatThread from "@/features/ai-chat/components/chat-thread.tsx";
-import { exportAiChat } from "@/features/ai-chat/services/ai-chat-service.ts";
+import {
+  exportAiChat,
+  stopRun,
+} from "@/features/ai-chat/services/ai-chat-service.ts";
 import { useChatSession } from "@/features/ai-chat/hooks/use-chat-session.ts";
 import {
  shouldCollapseOnOutsidePointer,
@@ -234,6 +245,147 @@ export default function AiChatWindow() {
  const { data: messageRows, isLoading: messagesLoading } =
    useAiChatMessagesQuery(activeChatId ?? undefined);

+  // #184 reconnect-and-live-follow. Whether detached agent runs are enabled for
+  // this workspace. The reconnect endpoint itself is NOT flag-gated server-side
+  // (it is only owner-gated and returns `{ run: null }` when the chat has no
+  // run); but when the feature is off no runs are ever created, so polling it
+  // would always come back empty — we gate it off here to avoid pointless polls.
+  const workspace = useAtomValue(workspaceAtom);
+  const autonomousRunsEnabled =
+    workspace?.settings?.ai?.autonomousRuns === true;
+
+  // Whether THIS tab is the one actively streaming the open chat's run locally
+  // (it started the run here and holds the SSE). Reported up from ChatThread. We
+  // are the STREAMER while true and a passive OBSERVER while false — the basis of
+  // the observer-vs-streamer detection. Reset to false by the fresh ChatThread's
+  // mount effect on every chat switch.
+  const [localStreaming, setLocalStreaming] = useState(false);
+  const onStreamingChange = useCallback((streaming: boolean) => {
+    setLocalStreaming(streaming);
+  }, []);
+
+  // #184 Stop wiring. While a detached run is being stopped we SUPPRESS the
+  // observer merge so the stopping run's still-persisting output does not
+  // re-stream back into view between the moment the user pressed Stop and the run
+  // actually settling as 'aborted' server-side. Polling itself keeps running (so
+  // the terminal transition is still detected) — only the visual merge is gated.
+  // Cleared when the run is observed terminal (below) or the chat is switched.
+  const [stoppingRun, setStoppingRun] = useState(false);
+  // Reset the stopping latch whenever the open chat changes: it is scoped to the
+  // run of the previously-open chat.
+  useEffect(() => {
+    setStoppingRun(false);
+  }, [activeChatId]);
+
+  // Authoritative stop of the open chat's detached run (the Stop button in
+  // autonomous mode). Latch "stopping" first (suppresses the re-stream flash),
+  // then request the server stop — the ONLY thing that ends a detached run; a mere
+  // local SSE abort is a client disconnect the server ignores. On failure we
+  // release the latch so the observer resumes (better to show the live run than to
+  // freeze the view) and surface the error.
+  const handleServerStop = useCallback(
+    (chatId: string): void => {
+      setStoppingRun(true);
+      // #234 F4: drop the PREVIOUS turn's run from the cache so `run` becomes null
+      // until the CURRENT turn's run is fetched fresh. Without this, once the local
+      // stream aborts (localStreaming -> false) the run query re-enables and
+      // react-query SYNCHRONOUSLY returns the still-cached prior terminal run; the
+      // terminal effect would then clear the stopping latch against that STALE run
+      // before the current turn's (still-running, detached, growing) run is ever
+      // observed — re-opening the observer merge and flashing the growing output
+      // over the frozen row. With the cache cleared the terminal effect's
+      // `if (!run) return` holds the latch until the current run itself is observed
+      // terminal (see shouldClearStoppingLatch).
+      queryClient.removeQueries({ queryKey: AI_CHAT_RUN_RQ_KEY(chatId) });
+      void stopRun(chatId).catch(() => {
+        setStoppingRun(false);
+        notifications.show({
+          message: t("Failed to stop the run"),
+          color: "red",
+        });
+      });
+    },
+    [t, queryClient],
+  );
+
+  // Poll the latest run of the open chat ONLY when we are a passive observer:
+  // feature on, a chat is open, and we are NOT the local streamer (the streamer
+  // already has the live SSE — polling/merging too would double-render). The
+  // query's own status-keyed refetchInterval stops once the run is terminal.
+  const { data: runData, isError: runQueryFailed } = useAiChatRunQuery(
+    activeChatId ?? undefined,
+    autonomousRunsEnabled && !localStreaming,
+  );
+  const run = runData?.run ?? null;
+
+  // Safety net (#234 F4 review): after handleServerStop clears the run cache,
+  // `run` is null until the current turn's run is fetched fresh, and the terminal
+  // effect below holds the latch via `if (!run) return`. If that refetch instead
+  // ERRORS PERMANENTLY (the GET-run keeps failing) while we are no longer the
+  // streamer, the run stays null, its status-keyed refetchInterval is off, and
+  // nothing would ever observe a terminal run — freezing the view with the
+  // observer merge suppressed. Release the latch on that error so the live view
+  // resumes rather than stays stuck (the local stopRun may already have succeeded
+  // independently).
+  //
+  // #234 F7: this must NOT fire on a TRANSIENT error while `run` is still an
+  // ACTIVE held run. In TanStack Query v5 (retry:false) the query's `data` is
+  // RETAINED on error, so `runQueryFailed` can be true while `run` is still
+  // pending/running — releasing then would re-open the observer merge and flash
+  // the growing detached run over the frozen row (the very flash F4 prevents). The
+  // decision is the pure, unit-tested `shouldClearLatchOnQueryError`, which gates
+  // on the run NOT being active: it cures only the genuine permanent-null-freeze
+  // (`run === null`) and never releases against an active run.
+  useEffect(() => {
+    if (
+      shouldClearLatchOnQueryError({
+        stoppingRun,
+        isLocalStreaming: localStreaming,
+        runQueryFailed,
+        run,
+      })
+    )
+      setStoppingRun(false);
+  }, [stoppingRun, localStreaming, runQueryFailed, run]);
+  // The run's incrementally-persisted assistant message to merge into the thread,
+  // but only while we are an observer (never when we are the streamer — guards
+  // against a stale poll fighting the live stream). Includes a terminal run so the
+  // final persisted output is shown on reopen.
+  const observedRow =
+    shouldObserveRun(run, localStreaming) && !stoppingRun
+      ? (runData?.message ?? null)
+      : null;
+
+  // When the observed run reaches a terminal status, do a final messages refetch
+  // so the persisted final state (token/context badge, export source) is shown,
+  // then the query's refetchInterval has already stopped polling. Deduped per run
+  // id so it fires exactly once per run, not on every subsequent poll-less render.
+  const finalizedRunIdRef = useRef<string | null>(null);
+  useEffect(() => {
+    if (!run || !activeChatId) return;
+    if (run.status === "pending" || run.status === "running") {
+      // Active again (a new run) — re-arm so its terminal transition fires once.
+      finalizedRunIdRef.current = null;
+      return;
+    }
+    // Terminal: a stop we requested has landed (or the run finished on its own),
+    // so release the stopping latch — the observer merge can now show the final
+    // persisted (aborted/finished) output without any live re-stream. The decision
+    // is the pure, unit-tested `shouldClearStoppingLatch` (run-polling.ts): release
+    // ONLY when we requested a stop, this tab is no longer the streamer, AND the
+    // CURRENT run is terminal. The #234 F4 cache removal in handleServerStop makes
+    // `run` null (this branch's `if (!run) return` above holds) until the current
+    // turn's run is fetched fresh, so the latch can never clear against a stale
+    // cached run.
+    if (shouldClearStoppingLatch({ stoppingRun, run, isLocalStreaming: localStreaming }))
+      setStoppingRun(false);
+    if (finalizedRunIdRef.current === run.id) return;
+    finalizedRunIdRef.current = run.id;
+    queryClient.invalidateQueries({
+      queryKey: AI_CHAT_MESSAGES_RQ_KEY(activeChatId),
+    });
+  }, [run, activeChatId, queryClient, stoppingRun, localStreaming]);
+
  // The page the user is currently viewing. AiChatWindow lives in a pathless
  // parent layout route, so useParams() can't see :pageSlug. Match the full
  // pathname against the authenticated page route instead so "the current page"
@@ -882,6 +1034,18 @@ export default function AiChatWindow() {
              assistantName={currentRole?.name}
              onTurnFinished={onTurnFinished}
              onServerChatId={onServerChatId}
+              // #184: live-follow a still-running run when we reopened the chat as
+              // a passive observer; null when there is nothing to observe or this
+              // tab is the streamer. onStreamingChange lets the window stop polling
+              // while we are the streamer.
+              observedRow={observedRow}
+              onStreamingChange={onStreamingChange}
+              // #184: in autonomous mode the Stop button must hit the authoritative
+              // server stop (a local SSE abort is a client disconnect the server
+              // ignores). onServerStop also arms the "stopping" latch above so the
+              // stopped run's output does not re-stream via the observer merge.
+              autonomousRunsEnabled={autonomousRunsEnabled}
+              onServerStop={handleServerStop}
            />
          )}
        </div>
@@ -11,6 +11,7 @@ const h = vi.hoisted(() => ({
    onFinish: null as null | ((arg: Record<string, unknown>) => void),
    sendMessage: vi.fn(),
    stop: vi.fn(),
+    setMessages: vi.fn(),
    transport: null as null | {
      prepareSendMessagesRequest: (arg: {
        messages: unknown[];
@@ -30,6 +31,8 @@ vi.mock("@ai-sdk/react", () => ({
      status: h.state.status,
      stop: h.state.stop,
      error: null,
+      // #184: ChatThread reads setMessages to merge a polled observer run.
+      setMessages: h.state.setMessages,
    };
  },
 }));
@@ -228,3 +231,56 @@ describe("ChatThread — turn-end decision (onFinish)", () => {
    }
  });
 });
+
+// #184 passive-observer merge: when reconnecting to a still-running run, the
+// parent feeds the polled run message via `observedRow`; ChatThread merges it via
+// setMessages — but ONLY when this tab is NOT itself streaming (the streamer's
+// SSE owns the view, so a stale observedRow must never overwrite it).
+describe("ChatThread — observer run merge (#184)", () => {
+  beforeEach(() => {
+    h.state.onFinish = null;
+    h.state.setMessages.mockReset();
+  });
+
+  const observedRow = {
+    id: "a-run",
+    role: "assistant",
+    content: "step 1\nstep 2",
+    metadata: {
+      parts: [{ type: "text", text: "step 1\nstep 2" }],
+    },
+    createdAt: "2026-01-01T00:00:00Z",
+  } as const;
+
+  function renderObserver(status: string) {
+    h.state.status = status;
+    render(
+      <MantineProvider>
+        <ChatThread
+          chatId="c1"
+          initialRows={[]}
+          onTurnFinished={vi.fn()}
+          observedRow={observedRow as never}
+        />
+      </MantineProvider>,
+    );
+  }
+
+  it("merges the polled run message when this tab is a passive observer", () => {
+    renderObserver("ready");
+    expect(h.state.setMessages).toHaveBeenCalledTimes(1);
+    // The updater replaces/append the observed assistant row by id.
+    const updater = h.state.setMessages.mock.calls[0][0] as (
+      prev: { id: string; parts: { text: string }[] }[],
+    ) => { id: string; parts: { text: string }[] }[];
+    const merged = updater([{ id: "u1", parts: [{ text: "hi" }] }]);
+    expect(merged).toHaveLength(2);
+    expect(merged[1].id).toBe("a-run");
+    expect(merged[1].parts[0].text).toBe("step 1\nstep 2");
+  });
+
+  it("does NOT merge while THIS tab is the streamer (no double-render)", () => {
+    renderObserver("streaming");
+    expect(h.state.setMessages).not.toHaveBeenCalled();
+  });
+});
@@ -24,6 +24,7 @@ import {
 } from "@/features/ai-chat/utils/role-launch.ts";
 import { describeChatError } from "@/features/ai-chat/utils/error-message.ts";
 import { extractServerChatId } from "@/features/ai-chat/utils/adopt-chat-id.ts";
+import { mergeObservedMessage } from "@/features/ai-chat/utils/run-polling.ts";
 import {
  dequeue,
  enqueueMessage,
@@ -86,6 +87,29 @@ interface ChatThreadProps {
   *  Copy/export button available mid-stream). Distinct from onTurnFinished,
   *  which fires only at the terminal outcome. */
  onServerChatId?: (serverChatId?: string) => void;
+  /** #184 reconnect-and-live-follow. When THIS tab reopened a chat whose agent
+   *  run is still going (it is a PASSIVE OBSERVER — it did not start the run here),
+   *  the parent polls the reconnect endpoint and feeds the run's incrementally-
+   *  persisted assistant message here; we merge it into the live list so new
+   *  steps/tool-calls appear as they are persisted. Null when there is nothing to
+   *  observe (no run, feature off, or this tab IS the streamer). The merge is
+   *  ADDITIONALLY guarded by our own `isStreaming`, so a stale value can never
+   *  fight the local stream when we are the streamer. */
+  observedRow?: IAiChatMessageRow | null;
+  /** Report this tab's live streaming status up to the parent, so it can stop
+   *  polling the run while WE are the active streamer (the SSE owns the view) and
+   *  resume once we go idle. Called from an effect on every transition. */
+  onStreamingChange?: (streaming: boolean) => void;
+  /** #184: whether detached/autonomous agent runs are enabled for this workspace.
+   *  When true the Stop button must additionally hit the AUTHORITATIVE server stop
+   *  (via onServerStop) — aborting only the local SSE is just a client disconnect,
+   *  which the server deliberately ignores, so the detached run would keep going. */
+  autonomousRunsEnabled?: boolean;
+  /** #184: request the server-side stop of this chat's active run (the parent owns
+   *  the endpoint call + the "stopping" latch that keeps observer-polling from
+   *  immediately re-streaming the stopping run's output). Called with the resolved
+   *  chat id when the user presses Stop in autonomous mode. */
+  onServerStop?: (chatId: string) => void;
 }

 /**
@@ -131,6 +155,10 @@ export default function ChatThread({
  assistantName,
  onTurnFinished,
  onServerChatId,
+  observedRow,
+  onStreamingChange,
+  autonomousRunsEnabled,
+  onServerStop,
 }: ChatThreadProps) {
  const { t } = useTranslation();

@@ -216,6 +244,16 @@ export default function ChatThread({
  const flushOnAbortRef = useRef(false);
  const interruptNextSendRef = useRef(false);

+  // #234 F5: the user pressed Stop while streaming a BRAND-NEW chat whose server
+  // chat id has not been adopted yet (the `start` chunk carrying it hadn't landed
+  // when Stop was pressed). A local SSE abort alone does NOT stop the DETACHED
+  // autonomous run — it keeps burning tokens and WRITING TO PAGES — so we cannot
+  // just no-op. We latch the stop as PENDING and fire the authoritative server
+  // stop the moment onServerChatId adopts the id (below). Read-and-cleared there;
+  // also defused on every new turn start so it can never fire against a later,
+  // unrelated turn's run.
+  const stopPendingRef = useRef(false);
+
  // FIFO dequeue + send the next queued message (no-op when the queue is empty).
  // Returns whether a message was actually sent, so callers can tell an empty
  // dequeue (nothing to flush) from a real send.
@@ -274,7 +312,7 @@ export default function ChatThread({
    [],
  );

-  const { messages, sendMessage, status, stop, error } = useChat({
+  const { messages, sendMessage, status, stop, error, setMessages } = useChat({
    // Stable per-mount key. Existing chats use their real id; new chats use a
    // generated client id (never `undefined`) so the store is NOT re-created on
    // every render mid-stream (see `chatStoreId` above).
@@ -365,7 +403,14 @@ export default function ChatThread({
      return;
    lastForwardedChatIdRef.current = serverChatId;
    onServerChatId(serverChatId);
-  }, [messages, onServerChatId]);
+    // #234 F5: if Stop was pressed before the id was known, the authoritative
+    // server stop was deferred to this adoption point — fire it now with the
+    // just-adopted id. One-shot (read-and-clear) so it can't fire twice.
+    if (stopPendingRef.current) {
+      stopPendingRef.current = false;
+      onServerStop?.(serverChatId);
+    }
+  }, [messages, onServerChatId, onServerStop]);

  // Live "turn was interrupted" marker for the CURRENT session. The red error
  // banner (driven by `error`) covers the error case; this covers an aborted
@@ -378,6 +423,27 @@ export default function ChatThread({

  const isStreaming = status === "submitted" || status === "streaming";

+  // #184: report our live streaming status up so the parent stops polling the run
+  // while WE are the streamer (the SSE owns the view) and resumes once we go idle.
+  // Effect (not render) so it never updates parent state during our own render;
+  // fires on mount with `false`, which also re-syncs the parent after a chat
+  // switch remounts this thread (a fresh mount is idle until the user sends).
+  useEffect(() => {
+    onStreamingChange?.(isStreaming);
+  }, [isStreaming, onStreamingChange]);
+
+  // #184 passive-observer merge: when the parent feeds a polled run message (we
+  // reopened a chat whose run is still going and did NOT start it here), merge it
+  // into the live list so new steps/tool-calls appear as they are persisted. Hard-
+  // gated by `!isStreaming`: if THIS tab is actually the streamer, the local SSE
+  // owns the view and a stale observedRow must never overwrite it. `observedRow`
+  // is a stable per-poll object, so this runs once per poll, not per render.
+  useEffect(() => {
+    if (isStreaming || !observedRow) return;
+    const observed = rowToUiMessage(observedRow);
+    setMessages((prev) => mergeObservedMessage(prev, observed));
+  }, [observedRow, isStreaming, setMessages]);
+
  // "Send now" on a queued message: interrupt the current turn and immediately
  // send THIS message, keeping the agent's partial output. Other queued messages
  // stay queued and flush normally after the new turn. Reuses the existing
@@ -409,6 +475,40 @@ export default function ChatThread({
    [setQueue, stop],
  );

+  // Stop the current turn. ALWAYS abort the local SSE (`stop()`) so the composer
+  // returns to idle immediately. In AUTONOMOUS mode the turn is a DETACHED run:
+  // aborting the local SSE is only a client disconnect, which the server ignores,
+  // so the run would keep executing — we ADDITIONALLY request the authoritative
+  // server-side stop (the parent owns that call + the "stopping" latch that keeps
+  // observer-polling from re-streaming the stopping run's output). The chat id is
+  // read live from chatIdRef (adopted early at the stream's `start` chunk); if it
+  // is not known yet — a brand-new chat in the first moment of its first turn —
+  // only the local abort happens (there is no server-side run handle to stop yet).
+  const handleStop = useCallback(() => {
+    stop();
+    if (!autonomousRunsEnabled) return;
+    if (chatIdRef.current) {
+      onServerStop?.(chatIdRef.current);
+    } else {
+      // #234 F5: no chat id yet (brand-new chat in the first moment of its first
+      // turn, before the `start` chunk adopted the id). Latch the stop as pending;
+      // the onServerChatId adoption effect fires the deferred server stop as soon
+      // as the id appears, so the detached run is still authoritatively stopped
+      // instead of left running by a silent local-only abort.
+      //
+      // KNOWN LIMITATION (#234 F5 review): `stop()` above has already aborted the
+      // local SSE reader. In the rare sub-window where Stop is pressed while still
+      // `submitted` (request sent, not one chunk read yet), that abort can cancel
+      // the reader BEFORE the `start` chunk is applied to `messages`, so the
+      // adoption effect never runs and this pending stop never fires. The detached
+      // run then keeps going for that turn. This is not a regression (the pre-fix
+      // behavior sent no server stop at all); closing it fully would require
+      // deferring the local abort until adoption, which is riskier and out of scope
+      // for this fix. Documented so a future change can address the abort-ordering.
+      stopPendingRef.current = true;
+    }
+  }, [stop, autonomousRunsEnabled, onServerStop]);
+
  // Clear the stopped marker as soon as a new turn begins streaming, and drop any
  // stale "Send now" interrupt flags. On the legit interrupt path both refs are
  // already consumed synchronously (onFinish + prepareSendMessagesRequest) before
@@ -420,6 +520,11 @@ export default function ChatThread({
      setStopNotice(null);
      flushOnAbortRef.current = false;
      interruptNextSendRef.current = false;
+      // #234 F5: a new turn is starting — drop any pending deferred-stop from a
+      // previous turn that never adopted an id, so it can never fire against this
+      // (or a later) unrelated turn's run. A deferred stop for the CURRENT turn is
+      // set AFTER this effect (on the Stop click), so this does not clobber it.
+      stopPendingRef.current = false;
    }
  }, [isStreaming]);

@@ -539,7 +644,7 @@ export default function ChatThread({
        <ChatInput
          onSend={(text) => sendMessage({ text })}
          onQueue={enqueue}
-          onStop={stop}
+          onStop={handleStop}
          isStreaming={isStreaming}
        />
      </Stack>
@@ -12,6 +12,7 @@ import {
  deleteAiChat,
  deleteAiRole,
  getAiChatMessages,
+  getAiChatRun,
  getAiChats,
  getAiRoleCatalog,
  getAiRoleCatalogBundle,
@@ -24,6 +25,7 @@ import {
 import {
  IAiChat,
  IAiChatMessageRow,
+  IAiChatRunResponse,
  IAiRole,
  IAiRoleCatalog,
  IAiRoleCatalogBundle,
@@ -34,6 +36,7 @@ import {
  IAiRoleUpdateFromCatalogResult,
 } from "@/features/ai-chat/types/ai-chat.types.ts";
 import { IPagination } from "@/lib/types.ts";
+import { runPollInterval } from "@/features/ai-chat/utils/run-polling.ts";

 export const AI_CHATS_RQ_KEY = ["ai-chats"];
 export const AI_ROLES_RQ_KEY = ["ai-roles"];
@@ -51,16 +54,18 @@ export const AI_CHAT_MESSAGES_RQ_KEY = (chatId: string) => [
  "ai-chat-messages",
  chatId,
 ];
+export const AI_CHAT_RUN_RQ_KEY = (chatId: string) => ["ai-chat-run", chatId];

 /** Paginated list of the current user's chats (auto-loads further pages). */
 export function useAiChatsQuery() {
  const query = useInfiniteQuery({
    queryKey: AI_CHATS_RQ_KEY,
-    queryFn: ({ pageParam }) =>
-      getAiChats({ cursor: pageParam, limit: 50 }),
+    queryFn: ({ pageParam }) => getAiChats({ cursor: pageParam, limit: 50 }),
    initialPageParam: undefined as string | undefined,
    getNextPageParam: (lastPage) =>
-      lastPage.meta.hasNextPage ? (lastPage.meta.nextCursor ?? undefined) : undefined,
+      lastPage.meta.hasNextPage
+        ? (lastPage.meta.nextCursor ?? undefined)
+        : undefined,
  });

  const data = useMemo<IPagination<IAiChat> | undefined>(() => {
@@ -90,7 +95,9 @@ export function useAiChatMessagesQuery(chatId: string | undefined) {
      getAiChatMessages({ chatId: chatId as string, cursor: pageParam }),
    initialPageParam: undefined as string | undefined,
    getNextPageParam: (lastPage) =>
-      lastPage.meta.hasNextPage ? (lastPage.meta.nextCursor ?? undefined) : undefined,
+      lastPage.meta.hasNextPage
+        ? (lastPage.meta.nextCursor ?? undefined)
+        : undefined,
    enabled: !!chatId,
  });

@@ -131,6 +138,34 @@ export function useAiChatMessagesQuery(chatId: string | undefined) {
  };
 }

+/**
+ * Reconnect to a chat's latest agent run and LIVE-FOLLOW it (#184). While the run
+ * is active the query re-polls every {@link runPollInterval} ms (driven off the
+ * fetched `run.status`, the same status-keyed refetchInterval pattern as the
+ * embeddings reindex polling); once the run reaches a terminal status — or there
+ * is no run — the interval returns `false` and polling stops on its own. Polling
+ * is thus naturally bounded by the run terminating; no separate timeout cap.
+ *
+ * `enabled` gates the whole thing: callers pass `false` when the autonomous-runs
+ * feature is off (the endpoint is NOT flag-gated server-side, but with the feature
+ * off the chat has no runs, so polling would only ever return `{ run: null }`) OR
+ * when THIS tab is the one actively streaming the run (the live SSE owns the view,
+ * so we must not also poll/merge). The global `retry: false` means a failed fetch
+ * leaves `data` undefined, so refetchInterval(undefined run) returns false — a
+ * failed fetch can never spin a tight loop.
+ */
+export function useAiChatRunQuery(
+  chatId: string | undefined,
+  enabled: boolean,
+) {
+  return useQuery<IAiChatRunResponse, Error>({
+    queryKey: AI_CHAT_RUN_RQ_KEY(chatId ?? ""),
+    queryFn: () => getAiChatRun(chatId as string),
+    enabled: !!chatId && enabled,
+    refetchInterval: (query) => runPollInterval(query.state.data?.run),
+  });
+}
+
 export function useRenameAiChatMutation() {
  const queryClient = useQueryClient();
  const { t } = useTranslation();
@@ -280,11 +315,14 @@ export function useImportAiRolesFromCatalogMutation() {
    mutationFn: (payload) => importAiRolesFromCatalog(payload),
    onSuccess: (result) => {
      notifications.show({
-        message: t("Imported {{created}}, renamed {{renamed}}, skipped {{skipped}}", {
-          created: result.created,
-          renamed: result.renamed,
-          skipped: result.skipped,
-        }),
+        message: t(
+          "Imported {{created}}, renamed {{renamed}}, skipped {{skipped}}",
+          {
+            created: result.created,
+            renamed: result.renamed,
+            skipped: result.skipped,
+          },
+        ),
      });
      // Surface partial failures (e.g. unique-name races) as a red warning.
      if (result.errors.length > 0) {
@@ -0,0 +1,92 @@
+import { describe, it, expect, vi, beforeEach } from "vitest";
+import React from "react";
+import { renderHook, waitFor } from "@testing-library/react";
+import { QueryClient, QueryClientProvider } from "@tanstack/react-query";
+import type { IAiChatRunResponse } from "@/features/ai-chat/types/ai-chat.types.ts";
+
+// react-i18next is pulled in transitively by ai-chat-query.ts (the mutation hooks
+// use it); stub it so the module imports cleanly in this hook test.
+vi.mock("react-i18next", () => ({
+  useTranslation: () => ({ t: (key: string) => key }),
+}));
+
+vi.mock("@mantine/notifications", () => ({
+  notifications: { show: vi.fn() },
+}));
+
+// Mock the whole service module; only getAiChatRun is exercised here, but the
+// other named exports must exist so ai-chat-query.ts imports resolve.
+vi.mock("@/features/ai-chat/services/ai-chat-service.ts", () => ({
+  getAiChatRun: vi.fn(),
+  getAiChatMessages: vi.fn(),
+  getAiChats: vi.fn(),
+  getAiRoleCatalog: vi.fn(),
+  getAiRoleCatalogBundle: vi.fn(),
+  getAiRoles: vi.fn(),
+  importAiRolesFromCatalog: vi.fn(),
+  createAiRole: vi.fn(),
+  deleteAiChat: vi.fn(),
+  deleteAiRole: vi.fn(),
+  renameAiChat: vi.fn(),
+  updateAiRole: vi.fn(),
+  updateAiRoleFromCatalog: vi.fn(),
+}));
+
+import { getAiChatRun } from "@/features/ai-chat/services/ai-chat-service.ts";
+import { useAiChatRunQuery } from "@/features/ai-chat/queries/ai-chat-query.ts";
+
+function createWrapper() {
+  const queryClient = new QueryClient({
+    defaultOptions: { queries: { retry: false } },
+  });
+  return function Wrapper({ children }: { children: React.ReactNode }) {
+    return (
+      <QueryClientProvider client={queryClient}>{children}</QueryClientProvider>
+    );
+  };
+}
+
+const runningResponse: IAiChatRunResponse = {
+  run: { id: "run-1", chatId: "c1", status: "running" },
+  message: {
+    id: "a1",
+    role: "assistant",
+    content: "working...",
+    createdAt: "2026-01-01T00:00:00Z",
+  },
+};
+
+describe("useAiChatRunQuery — enable gating", () => {
+  beforeEach(() => {
+    vi.clearAllMocks();
+  });
+
+  it("fetches the run when enabled (passive observer, feature on)", async () => {
+    vi.mocked(getAiChatRun).mockResolvedValue(runningResponse);
+    const { result } = renderHook(() => useAiChatRunQuery("c1", true), {
+      wrapper: createWrapper(),
+    });
+    await waitFor(() => expect(result.current.isSuccess).toBe(true));
+    expect(getAiChatRun).toHaveBeenCalledWith("c1");
+    expect(result.current.data?.run?.status).toBe("running");
+  });
+
+  it("does NOT fetch when disabled (this tab is the streamer / feature off)", async () => {
+    vi.mocked(getAiChatRun).mockResolvedValue(runningResponse);
+    renderHook(() => useAiChatRunQuery("c1", false), {
+      wrapper: createWrapper(),
+    });
+    // Give any errant fetch a chance to fire, then assert none did.
+    await new Promise((r) => setTimeout(r, 20));
+    expect(getAiChatRun).not.toHaveBeenCalled();
+  });
+
+  it("does NOT fetch when there is no chat id", async () => {
+    vi.mocked(getAiChatRun).mockResolvedValue(runningResponse);
+    renderHook(() => useAiChatRunQuery(undefined, true), {
+      wrapper: createWrapper(),
+    });
+    await new Promise((r) => setTimeout(r, 20));
+    expect(getAiChatRun).not.toHaveBeenCalled();
+  });
+});
@@ -5,6 +5,7 @@ import {
  IAiChatListParams,
  IAiChatMessageRow,
  IAiChatMessagesParams,
+  IAiChatRunResponse,
  IAiRole,
  IAiRoleCatalog,
  IAiRoleCatalogBundle,
@@ -42,6 +43,38 @@ export async function getAiChatMessages(
  return req.data;
 }

+/**
+ * Reconnect to the latest agent run of a chat (#184). Returns the run's
+ * persisted lifecycle state and the assistant message it materializes (the
+ * partial output while the run is in-flight, the final output once it finished).
+ * The DB is the source of truth, so this works for an in-flight run (the browser
+ * dropped, the run kept going) and a finished one alike; `{ run: null }` when the
+ * chat has never had a run. Owner-gated server-side (the requesting user must own
+ * the chat); it is NOT flag-gated — when the feature is off the chat simply has no
+ * runs, so the endpoint returns `{ run: null }`.
+ */
+export async function getAiChatRun(
+  chatId: string,
+): Promise<IAiChatRunResponse> {
+  const req = await api.post<IAiChatRunResponse>("/ai-chat/run", { chatId });
+  return req.data;
+}
+
+/**
+ * Explicitly STOP the active agent run of a chat (#184). This is the ONLY thing
+ * that ends a DETACHED run — a mere browser disconnect (aborting the local SSE)
+ * is deliberately ignored server-side, so the client must call this to actually
+ * stop an autonomous run. Targeted by `chatId` (the server resolves whatever run
+ * is active on it); owner-gated server-side. Returns `{ stopped }` — false when
+ * there was nothing active to stop.
+ */
+export async function stopRun(
+  chatId: string,
+): Promise<{ stopped: boolean }> {
+  const req = await api.post<{ stopped: boolean }>("/ai-chat/stop", { chatId });
+  return req.data;
+}
+
 /**
 * Resolve the chat bound to a document (the current user's most-recent chat
 * created on that page), or null when there is none. Drives auto-open-on-page.
@@ -200,6 +200,38 @@ export interface IAiChatMessageRow {
  createdAt: string;
 }

+/**
+ * A persisted agent-run row (#184), mirroring the `ai_chat_runs` fields the
+ * client reads from `POST /ai-chat/run`. Only `status` is load-bearing for the
+ * reconnect-and-live-update UX (it drives the poll cadence); the rest are carried
+ * for display/diagnostics. The DB is the source of truth, so this resolves for an
+ * in-flight run (the browser dropped, the run kept going) and a finished one.
+ */
+export interface IAiChatRun {
+  id: string;
+  chatId: string;
+  // 'pending' | 'running' | 'succeeded' | 'failed' | 'aborted'. The first two are
+  // ACTIVE (keep polling); the rest are TERMINAL (stop polling).
+  status: "pending" | "running" | "succeeded" | "failed" | "aborted" | string;
+  error?: string | null;
+  stepCount?: number;
+  assistantMessageId?: string | null;
+  startedAt?: string | null;
+  finishedAt?: string | null;
+  createdAt?: string;
+  updatedAt?: string;
+}
+
+/**
+ * Response of `POST /ai-chat/run` (#184): the latest run of a chat and the
+ * assistant message it materializes (the partial/final output, projected from the
+ * persisted rows). Both are `null` when the chat has never had a run.
+ */
+export interface IAiChatRunResponse {
+  run: IAiChatRun | null;
+  message: IAiChatMessageRow | null;
+}
+
 export interface IAiChatListParams extends QueryParams {}

 export interface IAiChatMessagesParams {
@@ -0,0 +1,303 @@
+import { describe, it, expect } from "vitest";
+import type { UIMessage } from "@ai-sdk/react";
+import type { IAiChatRun } from "@/features/ai-chat/types/ai-chat.types.ts";
+import {
+  RUN_POLL_INTERVAL_MS,
+  isRunActive,
+  runPollInterval,
+  shouldObserveRun,
+  shouldClearStoppingLatch,
+  shouldClearLatchOnQueryError,
+  mergeObservedMessage,
+} from "./run-polling.ts";
+
+function makeRun(status: string): IAiChatRun {
+  return { id: "run-1", chatId: "c1", status };
+}
+
+function makeMsg(id: string, text: string): UIMessage {
+  return {
+    id,
+    role: "assistant",
+    parts: [{ type: "text", text }],
+  } as UIMessage;
+}
+
+describe("isRunActive", () => {
+  it("treats pending and running as active", () => {
+    expect(isRunActive(makeRun("pending"))).toBe(true);
+    expect(isRunActive(makeRun("running"))).toBe(true);
+  });
+
+  it("treats terminal / unknown / nullish as not active", () => {
+    expect(isRunActive(makeRun("succeeded"))).toBe(false);
+    expect(isRunActive(makeRun("failed"))).toBe(false);
+    expect(isRunActive(makeRun("aborted"))).toBe(false);
+    expect(isRunActive(makeRun("weird-future-status"))).toBe(false);
+    expect(isRunActive(null)).toBe(false);
+    expect(isRunActive(undefined)).toBe(false);
+  });
+});
+
+describe("runPollInterval (the refetchInterval helper)", () => {
+  it("returns 2000ms while the run is pending/running", () => {
+    expect(runPollInterval(makeRun("pending"))).toBe(RUN_POLL_INTERVAL_MS);
+    expect(runPollInterval(makeRun("running"))).toBe(RUN_POLL_INTERVAL_MS);
+    expect(RUN_POLL_INTERVAL_MS).toBe(2000);
+  });
+
+  it("returns false (stop polling) once the run is terminal", () => {
+    expect(runPollInterval(makeRun("succeeded"))).toBe(false);
+    expect(runPollInterval(makeRun("failed"))).toBe(false);
+    expect(runPollInterval(makeRun("aborted"))).toBe(false);
+  });
+
+  it("returns false (no polling) when there is no run", () => {
+    expect(runPollInterval(null)).toBe(false);
+    expect(runPollInterval(undefined)).toBe(false);
+  });
+});
+
+describe("shouldObserveRun (observer-vs-streamer decision)", () => {
+  it("observes an active run when this tab is NOT the local streamer", () => {
+    expect(shouldObserveRun(makeRun("running"), false)).toBe(true);
+    expect(shouldObserveRun(makeRun("pending"), false)).toBe(true);
+  });
+
+  it("observes a terminal run too (so the final output shows on reopen)", () => {
+    expect(shouldObserveRun(makeRun("succeeded"), false)).toBe(true);
+  });
+
+  it("does NOT observe when this tab IS the streamer (no double-render)", () => {
+    expect(shouldObserveRun(makeRun("running"), true)).toBe(false);
+    expect(shouldObserveRun(makeRun("succeeded"), true)).toBe(false);
+  });
+
+  it("does NOT observe when there is no run", () => {
+    expect(shouldObserveRun(null, false)).toBe(false);
+    expect(shouldObserveRun(undefined, false)).toBe(false);
+  });
+});
+
+describe("shouldClearStoppingLatch (#234 latch-release decision)", () => {
+  // The one case the latch SHOULD clear: we requested a stop, we are the passive
+  // observer (not streaming), and the CURRENT run is terminal.
+  it("clears only when stopping, observing, and the run is terminal", () => {
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: makeRun("aborted"),
+        isLocalStreaming: false,
+      }),
+    ).toBe(true);
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: makeRun("succeeded"),
+        isLocalStreaming: false,
+      }),
+    ).toBe(true);
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: makeRun("failed"),
+        isLocalStreaming: false,
+      }),
+    ).toBe(true);
+  });
+
+  // Round-3 regression: clearing while THIS tab is still the local streamer would
+  // re-open the flash for the current turn the moment we switch to observer role.
+  // A predicate lacking the streaming gate would (wrongly) return true here.
+  it("does NOT clear while this tab is the local streamer", () => {
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: makeRun("aborted"),
+        isLocalStreaming: true,
+      }),
+    ).toBe(false);
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: makeRun("succeeded"),
+        isLocalStreaming: true,
+      }),
+    ).toBe(false);
+  });
+
+  // The detached run keeps growing after a local abort — while it is still
+  // active the latch MUST hold so the observer merge stays suppressed.
+  it("does NOT clear while the run is still active", () => {
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: makeRun("running"),
+        isLocalStreaming: false,
+      }),
+    ).toBe(false);
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: makeRun("pending"),
+        isLocalStreaming: false,
+      }),
+    ).toBe(false);
+  });
+
+  // #234 F4: on Stop the stale PREVIOUS-turn run is removed from the cache, so the
+  // observed `run` is null until the current turn's run is fetched fresh. A null
+  // run HOLDS the latch — it can never clear against the just-removed stale run,
+  // only against the current turn's own terminal run once observed.
+  it("does NOT clear against a removed/absent run (F4 stale-run guard)", () => {
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: null,
+        isLocalStreaming: false,
+      }),
+    ).toBe(false);
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: true,
+        run: undefined,
+        isLocalStreaming: false,
+      }),
+    ).toBe(false);
+  });
+
+  it("does NOT clear when no stop was requested", () => {
+    expect(
+      shouldClearStoppingLatch({
+        stoppingRun: false,
+        run: makeRun("aborted"),
+        isLocalStreaming: false,
+      }),
+    ).toBe(false);
+  });
+});
+
+describe("shouldClearLatchOnQueryError (#234 F7 error-safety-net decision)", () => {
+  // This guards the REAL anti-flash decision the component's run-query-error
+  // safety-net effect uses (ai-chat-window.tsx wires the effect to THIS helper,
+  // not a copy — so the test is non-vacuous vs the live code).
+
+  // (b) The F7 hole: a TRANSIENT run-query error while `run` is STILL ACTIVE must
+  // NOT clear the latch. TanStack Query v5 retains `data` on error, so
+  // runQueryFailed can be true while the held run is still pending/running.
+  // Against the PRE-F7 condition (without `!isRunActive(run)`) this would return
+  // true — so this assertion fails on the buggy code (non-vacuous).
+  it("does NOT clear on a transient error while the run is still ACTIVE (F7)", () => {
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: true,
+        isLocalStreaming: false,
+        runQueryFailed: true,
+        run: makeRun("running"),
+      }),
+    ).toBe(false);
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: true,
+        isLocalStreaming: false,
+        runQueryFailed: true,
+        run: makeRun("pending"),
+      }),
+    ).toBe(false);
+  });
+
+  // (a) The genuine permanent-null-freeze: run cache cleared by removeQueries +
+  // the refetch keeps ERRORING, so `run === null`. This is the ONLY case the
+  // safety-net exists to cure — it MUST clear so the frozen view resumes.
+  it("clears on a permanent error when the run is null (permanent-null-freeze)", () => {
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: true,
+        isLocalStreaming: false,
+        runQueryFailed: true,
+        run: null,
+      }),
+    ).toBe(true);
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: true,
+        isLocalStreaming: false,
+        runQueryFailed: true,
+        run: undefined,
+      }),
+    ).toBe(true);
+  });
+
+  // A TERMINAL run also satisfies `!isRunActive`; clearing then is harmless — the
+  // terminal effect (shouldClearStoppingLatch) already clears for a terminal run,
+  // so this only ever agrees with it. Asserted so the (c) reasoning is pinned.
+  it("clears on an error when the run is terminal (harmless, agrees with terminal effect)", () => {
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: true,
+        isLocalStreaming: false,
+        runQueryFailed: true,
+        run: makeRun("aborted"),
+      }),
+    ).toBe(true);
+  });
+
+  it("does NOT clear without an actual query error", () => {
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: true,
+        isLocalStreaming: false,
+        runQueryFailed: false,
+        run: null,
+      }),
+    ).toBe(false);
+  });
+
+  it("does NOT clear while this tab is the local streamer", () => {
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: true,
+        isLocalStreaming: true,
+        runQueryFailed: true,
+        run: null,
+      }),
+    ).toBe(false);
+  });
+
+  it("does NOT clear when no stop was requested", () => {
+    expect(
+      shouldClearLatchOnQueryError({
+        stoppingRun: false,
+        isLocalStreaming: false,
+        runQueryFailed: true,
+        run: null,
+      }),
+    ).toBe(false);
+  });
+});
+
+describe("mergeObservedMessage", () => {
+  it("replaces the message with the same id in place (per-step growth)", () => {
+    const prev = [makeMsg("u1", "hi"), makeMsg("a1", "step 1")];
+    const observed = makeMsg("a1", "step 1\nstep 2");
+    const next = mergeObservedMessage(prev, observed);
+    expect(next).toHaveLength(2);
+    expect(next[1]).toBe(observed);
+    expect(next[0]).toBe(prev[0]); // untouched
+    expect(next).not.toBe(prev); // new array (never mutates input)
+  });
+
+  it("appends when the observed message is not yet present", () => {
+    const prev = [makeMsg("u1", "hi")];
+    const observed = makeMsg("a1", "first token");
+    const next = mergeObservedMessage(prev, observed);
+    expect(next).toHaveLength(2);
+    expect(next[1]).toBe(observed);
+  });
+
+  it("returns the original list unchanged when there is nothing to merge", () => {
+    const prev = [makeMsg("u1", "hi")];
+    expect(mergeObservedMessage(prev, null)).toBe(prev);
+    expect(mergeObservedMessage(prev, undefined)).toBe(prev);
+  });
+});
@@ -0,0 +1,151 @@
+import type { UIMessage } from "@ai-sdk/react";
+import type { IAiChatRun } from "@/features/ai-chat/types/ai-chat.types.ts";
+
+/**
+ * Reconnect-and-live-follow helpers (#184). When a chat is reopened while its
+ * agent run is STILL going, this tab is a PASSIVE OBSERVER: it did not start the
+ * run here (no local SSE stream), so it catches up by POLLING the reconnect
+ * endpoint (`POST /ai-chat/run`) and merging the run's incrementally-persisted
+ * assistant message into the rendered thread. These are the small pure decisions
+ * that machinery hangs off, extracted so they can be unit-tested in isolation
+ * (mirrors how reindex polling / editor-sync-state are tested).
+ */
+
+/** How often to re-poll the reconnect endpoint while a run is ACTIVE. */
+export const RUN_POLL_INTERVAL_MS = 2000;
+
+// 'pending' and 'running' are the two ACTIVE statuses; 'succeeded' | 'failed' |
+// 'aborted' are TERMINAL (and any unknown future status is treated as terminal,
+// so a stale/odd value never polls forever).
+const ACTIVE_STATUSES = new Set(["pending", "running"]);
+
+/** Whether a run is still going (worth polling / merging live updates from). */
+export function isRunActive(run: IAiChatRun | null | undefined): boolean {
+  return !!run && ACTIVE_STATUSES.has(run.status);
+}
+
+/**
+ * The TanStack Query `refetchInterval` value for the run query: poll every
+ * {@link RUN_POLL_INTERVAL_MS} while the run is active, and `false` (stop) once
+ * it is terminal or there is no run. Polling is thus naturally bounded by the run
+ * reaching a terminal status — no separate timeout cap is needed.
+ */
+export function runPollInterval(
+  run: IAiChatRun | null | undefined,
+): number | false {
+  return isRunActive(run) ? RUN_POLL_INTERVAL_MS : false;
+}
+
+/**
+ * Observer-vs-streamer decision. We render the polled run message (catch up +
+ * keep advancing) ONLY when this tab is a passive observer: there IS a run AND
+ * this tab is NOT the one locally streaming it (we reconnected, we didn't start
+ * it here). When this tab is the streamer, the live SSE stream owns the view, so
+ * we neither poll nor merge — avoiding a double-render fight. Terminal runs still
+ * merge (so the final persisted output is shown on reopen); the poll itself is
+ * stopped separately by {@link runPollInterval}.
+ */
+export function shouldObserveRun(
+  run: IAiChatRun | null | undefined,
+  localStreaming: boolean,
+): boolean {
+  return !!run && !localStreaming;
+}
+
+/**
+ * Should the "stopping" latch — which suppresses the observer re-stream flash
+ * after the user pressed Stop — be RELEASED now? All three must hold:
+ *  - `stoppingRun`: we actually requested a stop (otherwise nothing to release);
+ *  - `!isLocalStreaming`: this tab is NOT the local streamer. While we are the
+ *    streamer the run query is disabled, so the observed `run` is not the run we
+ *    are following — releasing the latch then would re-open the flash for the
+ *    current turn the instant we switch to observer role;
+ *  - the observed `run` EXISTS and has reached a TERMINAL status.
+ *
+ * The null / still-active `run` case is the #234 F4 invariant. On Stop the stale
+ * PREVIOUS-turn run is removed from the query cache (`removeQueries`), so `run`
+ * is null until the CURRENT turn's run is re-fetched fresh; a null or active run
+ * therefore HOLDS the latch, so it can only ever clear against the current turn's
+ * OWN terminal run — never a stale cached one. (The cache removal itself is
+ * integration-level in AiChatWindow; this predicate encodes the decision given
+ * whatever run is currently observed, and a stale terminal run is
+ * indistinguishable from a current terminal run at the predicate level — hence
+ * the cache removal is what guarantees only the current run is ever passed here.)
+ */
+export function shouldClearStoppingLatch(args: {
+  stoppingRun: boolean;
+  run: IAiChatRun | null | undefined;
+  isLocalStreaming: boolean;
+}): boolean {
+  const { stoppingRun, run, isLocalStreaming } = args;
+  if (!stoppingRun || isLocalStreaming) return false;
+  return !!run && !isRunActive(run);
+}
+
+/**
+ * Should the "stopping" latch be RELEASED by the run-query ERROR safety-net?
+ * (#234 F7 — a NEW path of the same re-stream flash the F4 latch exists to
+ * prevent.) After Stop, `handleServerStop` clears the run cache; the terminal
+ * effect then holds the latch via `if (!run) return` until the CURRENT turn's run
+ * is fetched fresh. If that refetch instead ERRORS permanently, `run` stays null,
+ * its status-keyed refetchInterval is off, and nothing would ever observe a
+ * terminal run — freezing the view with the observer merge suppressed. This
+ * safety-net cures ONLY that genuine permanent-null-freeze.
+ *
+ * All four must hold:
+ *  - `stoppingRun`: we actually requested a stop (otherwise nothing to release);
+ *  - `!isLocalStreaming`: this tab is NOT the local streamer (same reason as
+ *    {@link shouldClearStoppingLatch});
+ *  - `runQueryFailed`: the run query is in its error state (TanStack Query v5 with
+ *    retry:false — isError);
+ *  - `!isRunActive(run)`: the observed `run` is NOT an active (pending/running)
+ *    held run. This is the F7 gate. In TanStack Query v5 the query's `data` is
+ *    RETAINED on error, so `runQueryFailed` can be true while `run` is STILL an
+ *    ACTIVE run (a single transient GET-run failure in the window between Stop and
+ *    settle). Without this gate a transient error would release the latch early —
+ *    re-opening the observer merge and flashing the growing detached run over the
+ *    frozen row (exactly the F4 flash). Gating on the run NOT being active means we
+ *    only ever cure the permanent-null-freeze (`run === null`, so
+ *    `isRunActive(null)` is false), never release against an active run.
+ *
+ * (A terminal `run` also satisfies `!isRunActive(run)`; clearing then is harmless
+ * — the terminal effect's {@link shouldClearStoppingLatch} already clears the
+ * latch for a terminal run, so this only ever agrees with it, never conflicts.)
+ *
+ * INVARIANT (do not break): clearing the latch on the `run === null` branch is safe
+ * ONLY because the run query's `refetchInterval` (see {@link runPollInterval}) stops
+ * polling when the data is empty — so after we clear on null+error there is no
+ * subsequent auto-poll that could return a still-active detached run and re-open the
+ * merge. If `refetchInterval` is ever changed to keep polling on `run === null`/on
+ * error, this null-branch clear would re-open the F7 flash through the null path.
+ * Do not change the run query's refetchInterval without re-checking this path.
+ */
+export function shouldClearLatchOnQueryError(args: {
+  stoppingRun: boolean;
+  isLocalStreaming: boolean;
+  runQueryFailed: boolean;
+  run: IAiChatRun | null | undefined;
+}): boolean {
+  const { stoppingRun, isLocalStreaming, runQueryFailed, run } = args;
+  return (
+    stoppingRun && !isLocalStreaming && runQueryFailed && !isRunActive(run)
+  );
+}
+
+/**
+ * Merge an observed assistant message into the rendered list: replace the message
+ * with the same id in place (the in-progress assistant row is already seeded from
+ * history, so per-step growth replaces it), or append it when absent. Returns a
+ * new array; the input is never mutated.
+ */
+export function mergeObservedMessage(
+  messages: UIMessage[],
+  observed: UIMessage | null | undefined,
+): UIMessage[] {
+  if (!observed) return messages;
+  const idx = messages.findIndex((m) => m.id === observed.id);
+  if (idx === -1) return [...messages, observed];
+  const next = messages.slice();
+  next[idx] = observed;
+  return next;
+}
@@ -1,13 +1,19 @@
 import { atom } from "jotai";
-// Type-only: these atoms only hold an Editor reference for typing. A value
-// import would drag the whole @tiptap/core engine into the eager graph of every
-// shell component that reads one of these atoms.
-import type { Editor } from "@tiptap/core";
+import { Editor } from "@tiptap/core";
+import type { HocuspocusProvider } from "@hocuspocus/provider";
 import { PageEditMode } from "@/features/user/types/user.types.ts";
 import type { DictationUnavailableReason } from "@/features/dictation/dictation-status";

 export const pageEditorAtom = atom<Editor | null>(null);

+// #370 — the active page's collab provider, published by the page editor so the
+// header menu can emit the "save-version" stateless signal (Cmd+S / button).
+// Null when the page is read-only / collab isn't connected. A typed initial
+// value (rather than an explicit generic) keeps jotai's overload resolution on
+// the writable PrimitiveAtom branch.
+const initialCollabProvider: HocuspocusProvider | null = null;
+export const collabProviderAtom = atom(initialCollabProvider);
+
 export const titleEditorAtom = atom<Editor | null>(null);

 export const readOnlyEditorAtom = atom<Editor | null>(null);
@@ -1,16 +0,0 @@
-import { lazy, Suspense } from "react";
-import { EditorMenuProps } from "@/features/editor/components/table/types/types.ts";
-
-// Lazily load the drawio bubble menu so it is split out of the editor chunk and
-// fetched only when an editable editor is mounted (mirrors excalidraw-menu-lazy).
-const DrawioMenu = lazy(
-  () => import("@/features/editor/components/drawio/drawio-menu.tsx"),
-);
-
-export default function DrawioMenuLazy(props: EditorMenuProps) {
-  return (
-    <Suspense fallback={null}>
-      <DrawioMenu {...props} />
-    </Suspense>
-  );
-}
@@ -1,17 +0,0 @@
-import { lazy, Suspense } from "react";
-import { NodeViewProps } from "@tiptap/react";
-
-// Lazily load the drawio node view so the heavy react-drawio embed runtime is
-// split into its own chunk and fetched only when a drawio diagram is actually
-// rendered (mirrors excalidraw-view-lazy).
-const DrawioView = lazy(
-  () => import("@/features/editor/components/drawio/drawio-view.tsx"),
-);
-
-export default function DrawioViewLazy(props: NodeViewProps) {
-  return (
-    <Suspense fallback={null}>
-      <DrawioView {...props} />
-    </Suspense>
-  );
-}
@@ -1,19 +0,0 @@
-import { lazy, Suspense } from "react";
-import { NodeViewProps } from "@tiptap/react";
-
-// Lazily load the KaTeX-backed block math view so the katex chunk is fetched
-// only when a document actually contains a math node (mirrors the mermaid/
-// excalidraw lazy pattern). The local Suspense keeps a slow katex chunk from
-// crashing or blocking the whole editor: while it loads we render the raw
-// LaTeX source as a node-sized placeholder.
-const MathBlockView = lazy(
-  () => import("@/features/editor/components/math/math-block.tsx"),
-);
-
-export default function MathBlockViewLazy(props: NodeViewProps) {
-  return (
-    <Suspense fallback={<div data-katex="true">{props.node.attrs.text}</div>}>
-      <MathBlockView {...props} />
-    </Suspense>
-  );
-}
@@ -1,19 +0,0 @@
-import { lazy, Suspense } from "react";
-import { NodeViewProps } from "@tiptap/react";
-
-// Lazily load the KaTeX-backed inline math view so the katex chunk is fetched
-// only when a document actually contains a math node (mirrors the mermaid/
-// excalidraw lazy pattern). The local Suspense keeps a slow katex chunk from
-// crashing or blocking the whole editor: while it loads we render the raw
-// LaTeX source as a node-sized placeholder.
-const MathInlineView = lazy(
-  () => import("@/features/editor/components/math/math-inline.tsx"),
-);
-
-export default function MathInlineViewLazy(props: NodeViewProps) {
-  return (
-    <Suspense fallback={<span data-katex="true">{props.node.attrs.text}</span>}>
-      <MathInlineView {...props} />
-    </Suspense>
-  );
-}
@@ -81,8 +81,8 @@ import {
  createResizeHandle,
  buildResizeClasses,
 } from "@/features/editor/components/common/node-resize-handles.ts";
-import MathInlineView from "@/features/editor/components/math/math-inline-lazy.tsx";
-import MathBlockView from "@/features/editor/components/math/math-block-lazy.tsx";
+import MathInlineView from "@/features/editor/components/math/math-inline.tsx";
+import MathBlockView from "@/features/editor/components/math/math-block.tsx";
 import ImageView from "@/features/editor/components/image/image-view.tsx";
 import CalloutView from "@/features/editor/components/callout/callout-view.tsx";
 import StatusView from "@/features/editor/components/status/status-view.tsx";
@@ -90,7 +90,7 @@ import VideoView from "@/features/editor/components/video/video-view.tsx";
 import AudioView from "@/features/editor/components/audio/audio-view.tsx";
 import AttachmentView from "@/features/editor/components/attachment/attachment-view.tsx";
 import CodeBlockView from "@/features/editor/components/code-block/code-block-view.tsx";
-import DrawioView from "../components/drawio/drawio-view-lazy.tsx";
+import DrawioView from "../components/drawio/drawio-view";
 import ExcalidrawView from "@/features/editor/components/excalidraw/excalidraw-view-lazy.tsx";
 import EmbedView from "@/features/editor/components/embed/embed-view.tsx";
 import HtmlEmbedView from "@/features/editor/components/html-embed/html-embed-view.tsx";
@@ -1,17 +1,8 @@
 import { useEffect, useRef } from "react";
 import { useNavigate } from "react-router-dom";
 import { getDefaultStore } from "jotai";
-
-// Literal value of WebSocketStatus.Connected from @hocuspocus/provider. Inlined
-// so this always-mounted global bridge does not statically import
-// @hocuspocus/provider — that import pulls Yjs (and, through a shared chunk, the
-// whole TipTap engine) into the eager startup graph. yjsConnectionStatusAtom
-// already stores these raw status strings.
-const YJS_STATUS_CONNECTED = "connected";
-// Type-only: importing Editor as a type keeps @tiptap/core (the whole editor
-// engine) out of the eager global-shell graph — the bridge only uses it for
-// annotations/casts, never as a runtime value.
-import type { Editor } from "@tiptap/core";
+import { WebSocketStatus } from "@hocuspocus/provider";
+import { Editor } from "@tiptap/core";
 import {
  pageEditorAtom,
  yjsConnectionStatusAtom,
@@ -25,19 +16,15 @@ import {
  getSidebarPages,
 } from "@/features/page/services/page-service.ts";
 import { buildPageUrl } from "@/features/page/page.utils.ts";
-// Types are erased at build time, so importing them does not pull the module's
-// runtime (which drags in @tiptap + the editor-ext barrel). The actual recording
-// helpers are dynamically imported at call time inside createPageWithRecording,
-// keeping the editor engine out of the eager global-shell startup graph — the
-// bridge is mounted for every authenticated user but recording is a rare,
-// native-host-driven action.
-import type {
+import {
  GitmostBridge,
  GitmostCreatePagePayload,
  GitmostCreatePageResult,
  GitmostListPagesPayload,
  GitmostListPagesResult,
  GitmostListSpacesResult,
+  gitmostDecodePayloadToFile,
+  gitmostUploadFileToEditor,
 } from "@/features/editor/gitmost/gitmost-recording.ts";

 // How long to wait for a freshly-navigated page's editor to mount, become
@@ -70,7 +57,7 @@ function gitmostWaitForEditor(
        !editor.isDestroyed &&
        editor.isEditable &&
        editorPageId === pageId &&
-        yjsStatus === YJS_STATUS_CONNECTED;
+        yjsStatus === WebSocketStatus.Connected;
      if (ready) {
        resolve(editor);
        return;
@@ -184,12 +171,6 @@ export default function GitmostGlobalBridge() {
          };
        }

-        // Load the recording helpers on demand (see the import note above). This
-        // is the only place they are needed, so the @tiptap/editor-ext code they
-        // pull in stays out of the eager startup graph.
-        const { gitmostDecodePayloadToFile, gitmostUploadFileToEditor } =
-          await import("@/features/editor/gitmost/gitmost-recording.ts");
-
        // Validate/decode the recording BEFORE creating the page so a bad
        // payload never leaves an empty junk page behind. Per the createPage
        // error contract, any decode failure collapses to "insert-failed" (the
@@ -31,11 +31,18 @@ import { useAtom, useAtomValue, useSetAtom } from "jotai";
 import useCollaborationUrl from "@/features/editor/hooks/use-collaboration-url";
 import { currentUserAtom } from "@/features/user/atoms/current-user-atom";
 import {
+  collabProviderAtom,
  currentPageEditModeAtom,
  dictationAvailabilityAtom,
  pageEditorAtom,
  yjsConnectionStatusAtom,
 } from "@/features/editor/atoms/editor-atoms";
+import { notifications } from "@mantine/notifications";
+import {
+  VERSION_SAVED_MESSAGE_TYPE,
+  type VersionSavedMessage,
+  saveVersionPending,
+} from "@/features/page-history/version-messages";
 import { asideStateAtom } from "@/components/layouts/global/hooks/atoms/sidebar-atom";
 import {
  activeCommentIdAtom,
@@ -59,7 +66,7 @@ import {
  handlePaste,
 } from "@/features/editor/components/common/editor-paste-handler.tsx";
 import ExcalidrawMenu from "./components/excalidraw/excalidraw-menu-lazy";
-import DrawioMenu from "./components/drawio/drawio-menu-lazy";
+import DrawioMenu from "./components/drawio/drawio-menu";
 import { useCollabToken } from "@/features/auth/queries/auth-query.tsx";
 import SearchAndReplaceDialog from "@/features/editor/components/search-and-replace/search-and-replace-dialog.tsx";
 import { useDebouncedCallback, useDocumentVisibility } from "@mantine/hooks";
@@ -93,6 +100,11 @@ import {
  isBodyEditable,
  isCollabSynced,
 } from "@/features/editor/editor-sync-state";
+import {
+  isVitalsActive,
+  measurePageOpen,
+  reportEditorTx,
+} from "@/lib/telemetry/vitals";

 interface PageEditorProps {
  pageId: string;
@@ -118,6 +130,7 @@ export default function PageEditor({

  const [currentUser] = useAtom(currentUserAtom);
  const [, setEditor] = useAtom(pageEditorAtom);
+  const setCollabProvider = useSetAtom(collabProviderAtom);
  const [, setAsideState] = useAtom(asideStateAtom);
  const [, setActiveCommentId] = useAtom(activeCommentIdAtom);
  const [showCommentPopup, setShowCommentPopup] = useAtom(showCommentPopupAtom);
@@ -175,6 +188,24 @@ export default function PageEditor({
      const onStatelessHandler = ({ payload }: onStatelessParameters) => {
        try {
          const message = JSON.parse(payload);
+          // #370 — a version was saved somewhere; live-refresh the history panel
+          // on every client. Only the client that pressed Save (tracked by the
+          // module-level flag) shows the confirmation toast.
+          if (message?.type === VERSION_SAVED_MESSAGE_TYPE) {
+            const versionMsg = message as VersionSavedMessage;
+            queryClient.invalidateQueries({
+              queryKey: ["page-history-list"],
+            });
+            if (saveVersionPending.current) {
+              saveVersionPending.current = false;
+              notifications.show({
+                message: versionMsg.alreadySaved
+                  ? t("Already saved as the latest version")
+                  : t("Version saved"),
+              });
+            }
+            return;
+          }
          if (message?.type !== "page.updated" || !message.updatedAt) return;
          const pageData = queryClient.getQueryData<IPage>(["pages", slugId]);
          if (pageData) {
@@ -232,12 +263,16 @@ export default function PageEditor({

      local.on("synced", onLocalSyncedHandler);
      providersRef.current = { socket, local, remote };
+      // #370 — publish the provider so the header menu can emit save-version.
+      setCollabProvider(remote);
      setProvidersReady(true);
    } else {
+      setCollabProvider(providersRef.current.remote);
      setProvidersReady(true);
    }
    // Only destroy on final unmount
    return () => {
+      setCollabProvider(null);
      providersRef.current?.socket.destroy();
      providersRef.current?.remote.destroy();
      providersRef.current?.local.destroy();
@@ -351,6 +386,40 @@ export default function PageEditor({
          editor.storage.pageId = pageId;
          handleScrollTo(editor);
          editorRef.current = editor;
+
+          // #355 — perf instrumentation. Skip ALL of it when telemetry is
+          // disabled (F1 flag off) or this session isn't sampled: no page-open
+          // measure, and crucially NO dispatch wrapping, so a non-collecting
+          // session pays zero per-transaction cost.
+          if (isVitalsActive()) {
+            // page_open_ms: this is the first editor-content render, so measure
+            // against any page-open mark set on the tree-row/link click.
+            measurePageOpen();
+
+            // editor_tx_ms: time the SYNCHRONOUS part of applying each
+            // transaction (state.apply + updateState) by wrapping the view's
+            // dispatch. Only slow syncs (>8ms) are reported (see reportEditorTx),
+            // so the common path adds just one performance.now() pair. Passive:
+            // the original dispatch still runs unchanged.
+            try {
+              const view = editor.view as unknown as {
+                dispatch: (tr: unknown) => void;
+              };
+              const originalDispatch = view.dispatch.bind(view);
+              view.dispatch = (tr: unknown) => {
+                const started = performance.now();
+                originalDispatch(tr);
+                const elapsed = performance.now() - started;
+                try {
+                  reportEditorTx(elapsed, editor.state.doc.content.size);
+                } catch {
+                  // never let telemetry break editing
+                }
+              };
+            } catch {
+              // if the view shape changes, skip editor_tx instrumentation
+            }
+          }
        }
      },
      onUpdate({ editor }) {
@@ -1,4 +1,11 @@
-import { Text, Group, UnstyledButton, Avatar, Tooltip } from "@mantine/core";
+import {
+  Text,
+  Group,
+  UnstyledButton,
+  Avatar,
+  Tooltip,
+  Badge,
+} from "@mantine/core";
 import { CustomAvatar } from "@/components/ui/custom-avatar.tsx";
 import { AgentAvatarStack } from "@/components/ui/agent-avatar-stack.tsx";
 import { formattedDate } from "@/lib/time";
@@ -7,36 +14,59 @@ import clsx from "clsx";
 import { IPageHistory } from "@/features/page-history/types/page.types";
 import { memo, useCallback } from "react";
 import { useSetAtom } from "jotai";
+import { useTranslation } from "react-i18next";
 import { historyAtoms } from "@/features/page-history/atoms/history-atoms.ts";

 const MAX_VISIBLE_AVATARS = 5;

+/**
+ * #370 — map a snapshot's intentionality tier to its badge. `version: true`
+ * marks the intentional points (manual / agent); autosaves (boundary / idle /
+ * legacy null) are non-versions and get dimmed in the list.
+ */
+type HistoryKindMeta = { labelKey: string; color: string; version: boolean };
+export function historyKindMeta(kind?: string | null): HistoryKindMeta {
+  switch (kind) {
+    case "manual":
+      return { labelKey: "Saved", color: "blue", version: true };
+    case "agent":
+      return { labelKey: "Agent version", color: "violet", version: true };
+    case "boundary":
+      return { labelKey: "Boundary", color: "gray", version: false };
+    default: // "idle" | null | undefined (legacy autosave)
+      return { labelKey: "Autosave", color: "gray", version: false };
+  }
+}
+
 interface HistoryItemProps {
  historyItem: IPageHistory;
-  index: number;
-  onSelect: (id: string, index: number) => void;
-  onHover?: (id: string, index: number) => void;
+  // The previous snapshot for diff/restore is resolved by id from the FULL list
+  // in the parent (resolvePrevSnapshotId), so the item only needs to report its
+  // own id — never a list index (which would be the filtered-view index).
+  onSelect: (id: string) => void;
+  onHover?: (id: string) => void;
  onHoverEnd?: () => void;
  isActive: boolean;
 }

 const HistoryItem = memo(function HistoryItem({
  historyItem,
-  index,
  onSelect,
  onHover,
  onHoverEnd,
  isActive,
 }: HistoryItemProps) {
  const setHistoryModalOpen = useSetAtom(historyAtoms);
+  const { t } = useTranslation();
+  const kindMeta = historyKindMeta(historyItem.kind);

  const handleClick = useCallback(() => {
-    onSelect(historyItem.id, index);
-  }, [onSelect, historyItem.id, index]);
+    onSelect(historyItem.id);
+  }, [onSelect, historyItem.id]);

  const handleMouseEnter = useCallback(() => {
-    onHover?.(historyItem.id, index);
-  }, [onHover, historyItem.id, index]);
+    onHover?.(historyItem.id);
+  }, [onHover, historyItem.id]);

  const contributors = historyItem.contributors;
  const hasContributors = contributors && contributors.length > 0;
@@ -49,8 +79,20 @@ const HistoryItem = memo(function HistoryItem({
      onMouseEnter={handleMouseEnter}
      onMouseLeave={onHoverEnd}
      className={clsx(classes.history, { [classes.active]: isActive })}
+      // #370 — dim autosnapshots so intentional versions stand out.
+      style={{ opacity: kindMeta.version ? 1 : 0.55 }}
    >
-      <Text size="sm">{formattedDate(new Date(historyItem.createdAt))}</Text>
+      <Group gap={6} wrap="nowrap" justify="space-between">
+        <Text size="sm">{formattedDate(new Date(historyItem.createdAt))}</Text>
+        <Badge
+          size="xs"
+          radius="sm"
+          variant={kindMeta.version ? "filled" : "light"}
+          color={kindMeta.color}
+        >
+          {t(kindMeta.labelKey)}
+        </Badge>
+      </Group>

      <Group gap={6} wrap="nowrap" mt={4}>
        {hasContributors ? (
@@ -9,7 +9,7 @@ import {
  historyAtoms,
 } from "@/features/page-history/atoms/history-atoms";
 import { useAtom, useSetAtom } from "jotai";
-import { useCallback, useEffect, useMemo, useRef } from "react";
+import { useCallback, useEffect, useMemo, useRef, useState } from "react";
 import {
  Button,
  ScrollArea,
@@ -17,9 +17,12 @@ import {
  Divider,
  Loader,
  Center,
+  Switch,
+  Text,
 } from "@mantine/core";
 import { useTranslation } from "react-i18next";
 import { useHistoryRestore } from "@/features/page-history/hooks";
+import { resolvePrevSnapshotId } from "@/features/page-history/utils/resolve-prev-snapshot";

 const PREFETCH_DELAY_MS = 150;

@@ -47,6 +50,23 @@ function HistoryList({ pageId }: Props) {
    [pageHistoryData],
  );

+  // #370 — "only versions" filter: hide autosnapshots (idle/boundary/legacy
+  // null), keep only intentional points (manual/agent). Filtering is over the
+  // already-loaded pages; the diff/restore still targets the true previous
+  // snapshot, so items carry their index within the FULL list.
+  const [onlyVersions, setOnlyVersions] = useState(false);
+  const isVersion = useCallback(
+    (kind?: string | null) => kind === "manual" || kind === "agent",
+    [],
+  );
+  const visibleItems = useMemo(
+    () =>
+      onlyVersions
+        ? historyItems.filter((item) => isVersion(item.kind))
+        : historyItems,
+    [historyItems, onlyVersions, isVersion],
+  );
+
  const loadMoreRef = useRef<HTMLDivElement>(null);
  const prefetchTimeoutRef = useRef<ReturnType<typeof setTimeout> | null>(null);

@@ -60,11 +80,13 @@ function HistoryList({ pageId }: Props) {
  }, []);

  const handleHover = useCallback(
-    (historyId: string, index: number) => {
+    (historyId: string) => {
      clearPrefetchTimeout();
      prefetchTimeoutRef.current = setTimeout(() => {
        prefetchPageHistory(historyId);
-        const prevId = historyItems[index + 1]?.id;
+        // The true previous snapshot in the FULL list (not the previous visible
+        // one under the "only versions" filter).
+        const prevId = resolvePrevSnapshotId(historyItems, historyId);
        if (prevId) {
          prefetchPageHistory(prevId);
        }
@@ -78,9 +100,11 @@ function HistoryList({ pageId }: Props) {
  }, [clearPrefetchTimeout]);

  const handleSelect = useCallback(
-    (id: string, index: number) => {
+    (id: string) => {
      setActiveHistoryId(id);
-      setActiveHistoryPrevId(historyItems[index + 1]?.id ?? "");
+      // Baseline = true previous snapshot in the FULL list, so the "only
+      // versions" filter never diffs/restores against the wrong item.
+      setActiveHistoryPrevId(resolvePrevSnapshotId(historyItems, id));
    },
    [historyItems, setActiveHistoryId, setActiveHistoryPrevId],
  );
@@ -128,12 +152,27 @@ function HistoryList({ pageId }: Props) {

  return (
    <div>
+      <Group px="xs" py={6} justify="flex-end">
+        <Switch
+          size="xs"
+          checked={onlyVersions}
+          onChange={(e) => setOnlyVersions(e.currentTarget.checked)}
+          label={t("Only versions")}
+        />
+      </Group>
+
      <ScrollArea h={620} w="100%" type="scroll" scrollbarSize={5}>
-        {historyItems.map((historyItem, index) => (
+        {onlyVersions && visibleItems.length === 0 && (
+          <Center py="md">
+            <Text size="sm" c="dimmed">
+              {t("No saved versions yet.")}
+            </Text>
+          </Center>
+        )}
+        {visibleItems.map((historyItem) => (
          <HistoryItem
            key={historyItem.id}
            historyItem={historyItem}
-            index={index}
            onSelect={handleSelect}
            onHover={handleHover}
            onHoverEnd={clearPrefetchTimeout}
@@ -24,6 +24,10 @@ export interface IPageHistory {
  updatedAt: string;
  lastUpdatedBy: IPageHistoryUser;
  contributors?: IPageHistoryUser[];
+  // #370 — intentionality tier: 'manual'/'agent' are versions (intentional
+  // points), 'idle'/'boundary' are autosnapshots; null/undefined = legacy
+  // autosave. Derived server-side, drives the history badge + "versions" filter.
+  kind?: "manual" | "agent" | "idle" | "boundary" | null;
  // Provenance markers copied off the page row when the snapshot was saved.
  // `'agent'` marks a version written by the AI agent; `lastUpdatedAiChatId`
  // (when present) deep-links to the chat that produced the edit.
@@ -0,0 +1,42 @@
+import { describe, it, expect } from "vitest";
+import { resolvePrevSnapshotId } from "./resolve-prev-snapshot";
+
+// #370 F4 — the risky client path: with the "only versions" filter active, diff
+// and restore must still baseline against the TRUE previous snapshot in the FULL
+// list, never the previous VISIBLE version (which would skip the autosnapshots
+// between two versions). These pin that the resolution is by FULL-list order.
+describe("resolvePrevSnapshotId", () => {
+  // Newest-first, as the history list stores it: a version, then two autosaves,
+  // then an older version.
+  const full = [
+    { id: "v2", kind: "manual" },
+    { id: "a2", kind: "idle" },
+    { id: "a1", kind: "boundary" },
+    { id: "v1", kind: "manual" },
+    { id: "a0", kind: null },
+  ];
+
+  it("returns the immediate FULL-list successor, not the previous visible version", () => {
+    // Selecting v2 while filtered to versions-only must baseline against a2 (the
+    // real chronological predecessor), NOT v1 (the previous visible version).
+    expect(resolvePrevSnapshotId(full, "v2")).toBe("a2");
+  });
+
+  it("resolves an autosnapshot's predecessor by full-list order", () => {
+    expect(resolvePrevSnapshotId(full, "a1")).toBe("v1");
+  });
+
+  it("returns '' for the oldest item (no predecessor)", () => {
+    expect(resolvePrevSnapshotId(full, "a0")).toBe("");
+  });
+
+  it("returns '' for an id not in the list", () => {
+    expect(resolvePrevSnapshotId(full, "missing")).toBe("");
+  });
+
+  it("does not depend on a filtered subset — same result whatever is visible", () => {
+    // The helper only ever sees the full list; a filtered view cannot change the
+    // baseline it computes.
+    expect(resolvePrevSnapshotId(full, "v1")).toBe("a0");
+  });
+});
@@ -0,0 +1,22 @@
+/**
+ * #370 — resolve the TRUE previous snapshot for a history item.
+ *
+ * The history panel can be filtered to "only versions" (manual/agent), but diff
+ * and restore must always compare against the immediately-preceding snapshot in
+ * the FULL, unfiltered list — NOT the previous VISIBLE item. Comparing against
+ * the previous visible version would silently skip the autosnapshots between two
+ * versions and diff/restore the wrong baseline.
+ *
+ * Given the full (newest-first) list and an item id, this returns the id of the
+ * item right after it in the full list (its chronological predecessor), or "" if
+ * it is the oldest / not found. Pure and list-order-preserving so it can be unit
+ * tested without mounting the component.
+ */
+export function resolvePrevSnapshotId(
+  fullItems: ReadonlyArray<{ id: string }>,
+  id: string,
+): string {
+  const index = fullItems.findIndex((item) => item.id === id);
+  if (index === -1) return "";
+  return fullItems[index + 1]?.id ?? "";
+}
@@ -0,0 +1,28 @@
+/**
+ * #370 — page-version stateless wire formats. Kept in one place so the client
+ * emitter (Save hotkey / button) and the client listener (page-editor) agree
+ * with the server (PersistenceExtension) on the message shapes.
+ */
+
+/** Client → server: "save a version now". The server derives the tier
+ * (manual/agent) from the signed connection actor, never from this payload. */
+export const SAVE_VERSION_MESSAGE_TYPE = "save-version";
+
+/** Server → all clients: a version was saved (or promoted / already existed). */
+export const VERSION_SAVED_MESSAGE_TYPE = "version.saved";
+
+export interface VersionSavedMessage {
+  type: typeof VERSION_SAVED_MESSAGE_TYPE;
+  historyId: string;
+  kind: "manual" | "agent";
+  /** True when the latest snapshot was already a manual version (a no-op save). */
+  alreadySaved: boolean;
+}
+
+/**
+ * Cross-component coordination flag so only the client that pressed Save shows
+ * the confirmation toast, while every other client silently refreshes its
+ * history panel on the broadcast. A module-level ref avoids stale-closure
+ * pitfalls in the editor's long-lived stateless handler.
+ */
+export const saveVersionPending = { current: false };
@@ -3,6 +3,7 @@ import {
  IconArrowRight,
  IconArrowsHorizontal,
  IconClockHour4,
+  IconDeviceFloppy,
  IconDots,
  IconEye,
  IconEyeOff,
@@ -17,7 +18,7 @@ import {
  IconTrash,
  IconWifiOff,
 } from "@tabler/icons-react";
-import React, { useEffect, useRef, useState } from "react";
+import React, { useCallback, useEffect, useRef, useState } from "react";
 import { useAsideTriggerProps } from "@/hooks/use-toggle-aside.tsx";
 import { useAtom, useAtomValue } from "jotai";
 import { historyAtoms } from "@/features/page-history/atoms/history-atoms.ts";
@@ -39,9 +40,14 @@ import { Trans, useTranslation } from "react-i18next";
 import ExportModal from "@/components/common/export-modal";
 import { htmlToMarkdown } from "@docmost/editor-ext";
 import {
+  collabProviderAtom,
  pageEditorAtom,
  yjsConnectionStatusAtom,
 } from "@/features/editor/atoms/editor-atoms.ts";
+import {
+  SAVE_VERSION_MESSAGE_TYPE,
+  saveVersionPending,
+} from "@/features/page-history/version-messages.ts";
 import { formattedDate } from "@/lib/time.ts";
 import { PageEditModeToggle } from "@/features/user/components/page-state-pref.tsx";
 import MovePageModal from "@/features/page/components/move-page-modal.tsx";
@@ -72,9 +78,34 @@ export default function PageHeaderMenu({ readOnly }: PageHeaderMenuProps) {
  });
  const isDeleted = !!page?.deletedAt;
  const [workspace] = useAtom(workspaceAtom);
+  const collabProvider = useAtomValue(collabProviderAtom);
  // Community public-sharing entry point (replaces the removed EE PageShareModal)
  const workspaceSharingDisabled = workspace?.settings?.sharing?.disabled === true;

+  // #370 — explicit "save a version" (Cmd+S / Save button). One path for the
+  // human; the server derives the tier from the signed actor. Readers can't save
+  // (the button is hidden and the collab connection is read-only server-side).
+  const handleSaveVersion = useCallback(() => {
+    if (readOnly || !collabProvider) return;
+    // Flag this client as the initiator so only it shows the confirmation toast;
+    // a safety timeout clears it if no broadcast comes back (e.g. offline).
+    saveVersionPending.current = true;
+    window.setTimeout(() => {
+      saveVersionPending.current = false;
+    }, 5000);
+    collabProvider.sendStateless(
+      JSON.stringify({ type: SAVE_VERSION_MESSAGE_TYPE }),
+    );
+  }, [readOnly, collabProvider]);
+
+  // mod+S must also block the browser's "Save page" dialog. `triggerOnContent-
+  // Editable` + empty ignore-list so it fires while typing in the editor/title.
+  useHotkeys(
+    [["mod+S", handleSaveVersion, { preventDefault: true }]],
+    [],
+    true,
+  );
+
  useHotkeys(
    [
      [
@@ -133,15 +164,16 @@ export default function PageHeaderMenu({ readOnly }: PageHeaderMenuProps) {
        </ActionIcon>
      </Tooltip>

-      <PageActionMenu readOnly={readOnly} />
+      <PageActionMenu readOnly={readOnly} onSaveVersion={handleSaveVersion} />
    </>
  );
 }

 interface PageActionMenuProps {
  readOnly?: boolean;
+  onSaveVersion?: () => void;
 }
-function PageActionMenu({ readOnly }: PageActionMenuProps) {
+function PageActionMenu({ readOnly, onSaveVersion }: PageActionMenuProps) {
  const { t } = useTranslation();
  const [, setHistoryModalOpen] = useAtom(historyAtoms);
  const clipboard = useClipboard({ timeout: 500 });
@@ -302,6 +334,20 @@ function PageActionMenu({ readOnly }: PageActionMenuProps) {
            </Group>
          </Menu.Item>

+          {!readOnly && (
+            <Menu.Item
+              leftSection={<IconDeviceFloppy size={16} />}
+              onClick={onSaveVersion}
+              rightSection={
+                <Text size="xs" c="dimmed">
+                  {t("Ctrl+S")}
+                </Text>
+              }
+            >
+              {t("Save version")}
+            </Menu.Item>
+          )}
+
          <Menu.Item
            leftSection={<IconHistory size={16} />}
            onClick={openHistoryModal}
@@ -1,20 +1,10 @@
-import { Suspense } from "react";
 import { Outlet } from "react-router-dom";
-import { Center, Loader } from "@mantine/core";
 import ShareShell from "@/features/share/components/share-shell.tsx";

 export default function ShareLayout() {
  return (
    <ShareShell>
-      <Suspense
-        fallback={
-          <Center h="60vh">
-            <Loader size="sm" />
-          </Center>
-        }
-      >
-        <Outlet />
-      </Suspense>
+      <Outlet />
    </ShareShell>
  );
 }
@@ -394,6 +394,10 @@ export default function AiProviderSettings() {
    useState<boolean>(
      workspace?.settings?.ai?.publicShareAssistant ?? false,
    );
+  // #184: detached/autonomous agent runs (settings.ai.autonomousRuns).
+  const [autonomousRunsEnabled, setAutonomousRunsEnabled] = useState<boolean>(
+    workspace?.settings?.ai?.autonomousRuns ?? false,
+  );
  const [chatToggleLoading, setChatToggleLoading] = useState(false);
  const [searchToggleLoading, setSearchToggleLoading] = useState(false);
  const [dictationToggleLoading, setDictationToggleLoading] = useState(false);
@@ -403,6 +407,8 @@ export default function AiProviderSettings() {
    publicShareAssistantToggleLoading,
    setPublicShareAssistantToggleLoading,
  ] = useState(false);
+  const [autonomousRunsToggleLoading, setAutonomousRunsToggleLoading] =
+    useState(false);

  // Whether a key is currently stored server-side (drives the placeholder).
  const [hasApiKey, setHasApiKey] = useState(false);
@@ -730,6 +736,37 @@ export default function AiProviderSettings() {
    }
  }

+  // Optimistic toggle for detached/autonomous agent runs
+  // (settings.ai.autonomousRuns). When on, a chat turn becomes a server-side run
+  // that survives a browser disconnect and can be reconnected to / live-followed;
+  // only an explicit Stop ends it. Off by default; single-instance-only in phase 1.
+  async function handleToggleAutonomousRuns(value: boolean) {
+    setAutonomousRunsToggleLoading(true);
+    const previous = autonomousRunsEnabled;
+    setAutonomousRunsEnabled(value);
+    try {
+      const updated = await updateWorkspace({ autonomousRuns: value });
+      setWorkspace({
+        ...updated,
+        settings: {
+          ...updated.settings,
+          ai: { ...updated.settings?.ai, autonomousRuns: value },
+        },
+      });
+      notifications.show({ message: t("Updated successfully") });
+    } catch (err) {
+      setAutonomousRunsEnabled(previous);
+      const message = (err as { response?: { data?: { message?: string } } })
+        ?.response?.data?.message;
+      notifications.show({
+        message: message ?? t("Failed to update data"),
+        color: "red",
+      });
+    } finally {
+      setAutonomousRunsToggleLoading(false);
+    }
+  }
+
  // Admins only — match the previous behavior.
  if (!isAdmin) {
    return (
@@ -960,6 +997,31 @@ export default function AiProviderSettings() {
          {...form.getInputProps("publicShareAssistantRoleId")}
        />

+        {/* Detached/autonomous agent runs: a chat turn becomes a server-side run
+            that survives a browser disconnect; only an explicit Stop ends it.
+            Single-instance-only in phase 1. */}
+        <Group justify="space-between" align="center" wrap="nowrap" mt="md">
+          <Stack gap={0}>
+            <Text fw={600} size="sm">
+              {t("Autonomous agent runs")}
+            </Text>
+            <Text size="xs" c="dimmed">
+              {t(
+                "Keep an agent turn running server-side even if the browser disconnects; reconnect and follow it on reopen. Single-instance deployments only.",
+              )}
+            </Text>
+          </Stack>
+          <Switch
+            label={t("Enabled")}
+            labelPosition="left"
+            checked={autonomousRunsEnabled}
+            disabled={autonomousRunsToggleLoading}
+            onChange={(e) =>
+              handleToggleAutonomousRuns(e.currentTarget.checked)
+            }
+          />
+        </Group>
+
        <Group mt="md" align="center">
          <Button
            variant="default"
@@ -26,6 +26,9 @@ export interface IWorkspace {
  aiDictation?: boolean;
  aiDictationStreaming?: boolean;
  aiPublicShareAssistant?: boolean;
+  // Write-only field for updateWorkspace({ autonomousRuns }). Read state lives at
+  // settings.ai.autonomousRuns.
+  autonomousRuns?: boolean;
  trashRetentionDays?: number;
  // Default lifetime (HOURS) for new temporary notes; frozen per-note at creation.
  temporaryNoteHours?: number;
@@ -65,6 +68,9 @@ export interface IWorkspaceAiSettings {
  dictation?: boolean;
  dictationStreaming?: boolean;
  publicShareAssistant?: boolean;
+  // #184: detached agent runs (a run survives a browser disconnect and can be
+  // reconnected to / live-followed on reopen). Gates the run-reconnect polling.
+  autonomousRuns?: boolean;
 }

 export interface IWorkspaceSharingSettings {
@@ -1,7 +1,7 @@
 // Source: https://github.com/mantinedev/mantine/blob/master/packages/@mantine/hooks/src/use-clipboard/use-clipboard.ts
 // polyfilled to support execCommand fallback
 import { useState } from "react";
-import { execCommandCopy } from "@/lib/copy-to-clipboard.ts";
+import { execCommandCopy } from "@docmost/editor-ext";

 export type UseClipboardOptions = {
  timeout?: number;
@@ -1,7 +1,7 @@
 import bytes from "bytes";
 import { castToBoolean } from "@/lib/utils.tsx";
 import { AvatarIconType } from "@/features/attachments/types/attachment.types.ts";
-import { sanitizeUrl } from "@/lib/sanitize-url.ts";
+import { sanitizeUrl } from "@docmost/editor-ext";

 declare global {
  interface Window {
@@ -47,6 +47,13 @@ export function isCompactPageTreeEnabled(): boolean {
  return castToBoolean(getConfigValue("COMPACT_PAGE_TREE", "true"));
 }

+// #355 — operator toggle for client perf-telemetry. DEFAULT OFF: the server
+// mirrors CLIENT_TELEMETRY_ENABLED into window.CONFIG; when off the client
+// installs no observers and sends nothing (the sink endpoint doesn't exist).
+export function isClientTelemetryEnabled(): boolean {
+  return castToBoolean(getConfigValue("CLIENT_TELEMETRY_ENABLED", "false"));
+}
+
 export function getAvatarUrl(
  avatarUrl: string,
  type: AvatarIconType = AvatarIconType.AVATAR,
@@ -1,16 +0,0 @@
-// Client-local execCommand copy fallback (previously imported from
-// @docmost/editor-ext). It lives here so the ubiquitous useClipboard / CopyButton
-// path does not pull in the editor-ext barrel — and with it the whole TipTap
-// engine — through the eager startup graph. Behavior is identical to the
-// editor-ext helper it replaces.
-export function execCommandCopy(text: string): void {
-  const textarea = document.createElement("textarea");
-  textarea.value = text;
-  textarea.style.position = "fixed";
-  textarea.style.left = "-9999px";
-  textarea.style.top = "-9999px";
-  document.body.appendChild(textarea);
-  textarea.select();
-  document.execCommand("copy");
-  document.body.removeChild(textarea);
-}
@@ -1,31 +0,0 @@
-import { describe, it, expect } from "vitest";
-import { sanitizeUrl } from "./sanitize-url";
-
-// `sanitizeUrl` is a byte-identical client-local copy of editor-ext's wrapper
-// around @braintree/sanitize-url: it maps the sanitizer's "about:blank" XSS
-// sentinel to "". These assertions mirror editor-ext's own security-contract
-// test so the extracted copy keeps the same guarantees.
-describe("sanitizeUrl", () => {
-  it("blocks dangerous schemes (returns empty string)", () => {
-    expect(sanitizeUrl("javascript:alert(1)")).toBe("");
-    expect(sanitizeUrl("data:text/html,<script>alert(1)</script>")).toBe("");
-    expect(sanitizeUrl("vbscript:msgbox(1)")).toBe("");
-    // Case / whitespace obfuscation must not slip past the sanitizer.
-    expect(sanitizeUrl("  JaVaScRiPt:alert(1)")).toBe("");
-  });
-
-  it("returns empty string for empty / undefined input", () => {
-    expect(sanitizeUrl(undefined)).toBe("");
-    expect(sanitizeUrl("")).toBe("");
-  });
-
-  it("allows safe https, relative file and mailto URLs", () => {
-    expect(sanitizeUrl("https://example.com/page")).toMatch(
-      /^https:\/\/example\.com\/page/,
-    );
-    expect(sanitizeUrl("/api/files/abc-123")).toBe("/api/files/abc-123");
-    expect(sanitizeUrl("mailto:user@example.com")).toBe(
-      "mailto:user@example.com",
-    );
-  });
-});
@@ -1,15 +0,0 @@
-import { sanitizeUrl as braintreeSanitizeUrl } from "@braintree/sanitize-url";
-
-// Client-local copy of editor-ext's sanitizeUrl wrapper. Importing it from the
-// editor-ext barrel dragged the whole TipTap engine into the eager startup graph
-// via the app-wide config module (getFileUrl). This keeps the exact same
-// behavior (braintree sanitize + normalize "about:blank" -> "") without that
-// dependency.
-export function sanitizeUrl(url: string | undefined): string {
-  if (!url) return "";
-
-  const sanitized = braintreeSanitizeUrl(url);
-
-  // Return an empty string instead of "about:blank".
-  return sanitized === "about:blank" ? "" : sanitized;
-}
@@ -0,0 +1,35 @@
+import { describe, it, expect } from "vitest";
+import { templateRoute } from "./route-template";
+
+describe("templateRoute", () => {
+  it("templates a space page path (never leaks slugs)", () => {
+    const t = templateRoute("/s/engineering/p/design-doc-abc123");
+    expect(t).toBe("/s/:space/p/:slug");
+    expect(t).not.toContain("engineering");
+    expect(t).not.toContain("design-doc");
+  });
+
+  it("templates share, redirect and space paths", () => {
+    expect(templateRoute("/share/abc/p/xyz")).toBe("/share/:shareId/p/:slug");
+    expect(templateRoute("/share/p/xyz")).toBe("/share/p/:slug");
+    expect(templateRoute("/p/some-slug")).toBe("/p/:slug");
+    expect(templateRoute("/s/team")).toBe("/s/:space");
+    expect(templateRoute("/s/team/trash")).toBe("/s/:space/trash");
+    expect(templateRoute("/labels/urgent")).toBe("/labels/:label");
+  });
+
+  it("keeps known static routes verbatim", () => {
+    expect(templateRoute("/home")).toBe("/home");
+    expect(templateRoute("/settings/members")).toBe("/settings/members");
+    expect(templateRoute("/")).toBe("/");
+  });
+
+  it("normalises a trailing slash", () => {
+    expect(templateRoute("/s/team/p/slug/")).toBe("/s/:space/p/:slug");
+  });
+
+  it("collapses unknown paths to 'other' (bounded cardinality)", () => {
+    expect(templateRoute("/weird/unknown/thing")).toBe("other");
+    expect(templateRoute("/s/team/p/slug/extra/segments")).toBe("other");
+  });
+});
@@ -0,0 +1,70 @@
+/**
+ * Map a raw pathname to a BOUNDED route TEMPLATE (#355).
+ *
+ * Perf metrics must be labelled by route template only — never a raw path with
+ * slugs/ids — so the server-side `route` column and any downstream aggregation
+ * stay low-cardinality and carry NO page slugs/titles (privacy). Anything that
+ * does not match a known pattern collapses to `other`.
+ *
+ * The template vocabulary mirrors the issue's example (`/s/:space/p/:slug`).
+ */
+const ROUTE_PATTERNS: { re: RegExp; template: string }[] = [
+  // Share pages (public).
+  { re: /^\/share\/[^/]+\/p\/[^/]+$/, template: '/share/:shareId/p/:slug' },
+  { re: /^\/share\/p\/[^/]+$/, template: '/share/p/:slug' },
+  { re: /^\/share\/[^/]+$/, template: '/share/:shareId' },
+  // Page redirect.
+  { re: /^\/p\/[^/]+$/, template: '/p/:slug' },
+  // Space + page.
+  { re: /^\/s\/[^/]+\/p\/[^/]+$/, template: '/s/:space/p/:slug' },
+  { re: /^\/s\/[^/]+\/trash$/, template: '/s/:space/trash' },
+  { re: /^\/s\/[^/]+$/, template: '/s/:space' },
+  // Misc dynamic.
+  { re: /^\/labels\/[^/]+$/, template: '/labels/:label' },
+  { re: /^\/invites\/[^/]+$/, template: '/invites/:invitationId' },
+  { re: /^\/settings\/groups\/[^/]+$/, template: '/settings/groups/:groupId' },
+];
+
+// Static routes we accept verbatim (finite set).
+const STATIC_ROUTES = new Set<string>([
+  '/home',
+  '/spaces',
+  '/favorites',
+  '/login',
+  '/forgot-password',
+  '/password-reset',
+  '/setup/register',
+  '/settings/account/profile',
+  '/settings/account/preferences',
+  '/settings/workspace',
+  '/settings/ai',
+  '/settings/members',
+  '/settings/groups',
+  '/settings/spaces',
+  '/settings/sharing',
+]);
+
+export function templateRoute(pathname: string): string {
+  // Normalise a trailing slash (except root).
+  const path =
+    pathname.length > 1 && pathname.endsWith('/')
+      ? pathname.slice(0, -1)
+      : pathname;
+
+  if (path === '' || path === '/') return '/';
+  if (STATIC_ROUTES.has(path)) return path;
+
+  for (const { re, template } of ROUTE_PATTERNS) {
+    if (re.test(path)) return template;
+  }
+  return 'other';
+}
+
+/** Template for the current window location. */
+export function currentRouteTemplate(): string {
+  try {
+    return templateRoute(window.location.pathname);
+  } catch {
+    return 'other';
+  }
+}
@@ -0,0 +1,290 @@
+import {
+  onCLS,
+  onINP,
+  onLCP,
+  onTTFB,
+  type CLSMetricWithAttribution,
+  type INPMetricWithAttribution,
+  type LCPMetricWithAttribution,
+  type TTFBMetricWithAttribution,
+} from "web-vitals/attribution";
+import { isClientTelemetryEnabled } from "@/lib/config";
+import { currentRouteTemplate } from "./route-template";
+
+/**
+ * Client perf-telemetry (#355): web-vitals + custom metrics buffered and posted
+ * to POST /api/telemetry/vitals via sendBeacon.
+ *
+ * Design constraints from the issue:
+ *  - Sampling is decided ONCE per session (25%), cached in sessionStorage,
+ *    BEFORE any observer is subscribed. Non-sampled sessions send nothing.
+ *  - Route labels are TEMPLATES only; attr is truncated to 120 chars; no page
+ *    titles/slugs/text ever leave the browser.
+ *  - Observers are passive and reporting is best-effort — telemetry must not
+ *    degrade the perf it measures.
+ */
+
+const ENDPOINT = "/api/telemetry/vitals";
+const SAMPLE_RATE = 0.25;
+const SAMPLE_KEY = "gm_vitals_sampled";
+const FLUSH_INTERVAL_MS = 15_000;
+const MAX_BUFFER = 40; // flush early if the buffer fills between timers
+const MAX_ATTR_LENGTH = 120;
+const EDITOR_TX_MIN_MS = 8; // only report editor transactions slower than this
+
+const ALLOWED_NAMES = new Set([
+  "INP",
+  "LCP",
+  "CLS",
+  "TTFB",
+  "editor_tx_ms",
+  "page_open_ms",
+  "longtask_ms",
+]);
+
+interface VitalEvent {
+  name: string;
+  value: number;
+  rating?: string;
+  route?: string;
+  attr?: string;
+  docSize?: number;
+}
+
+let sampledCache: boolean | null = null;
+let initialised = false;
+let buffer: VitalEvent[] = [];
+let longtaskSum = 0; // accumulated longtask duration (ms) for the current window
+
+/**
+ * Decide once per session whether this session is sampled. Cached in
+ * sessionStorage so the choice is stable across reloads within the session and
+ * identical for every observer/custom-metric caller.
+ */
+export function isVitalsSampled(): boolean {
+  if (sampledCache !== null) return sampledCache;
+  try {
+    const stored = sessionStorage.getItem(SAMPLE_KEY);
+    if (stored === "1") return (sampledCache = true);
+    if (stored === "0") return (sampledCache = false);
+    const sampled = Math.random() < SAMPLE_RATE;
+    sessionStorage.setItem(SAMPLE_KEY, sampled ? "1" : "0");
+    return (sampledCache = sampled);
+  } catch {
+    // sessionStorage unavailable (private mode / SSR): default to not sampled.
+    return (sampledCache = false);
+  }
+}
+
+/**
+ * True only when telemetry is BOTH enabled by the operator (F1 flag) AND this
+ * session is sampled. Callers outside initVitals (e.g. the editor dispatch
+ * wrapper) use this to skip ALL instrumentation cost on disabled/non-sampled
+ * sessions — no observers, no per-transaction timing.
+ */
+export function isVitalsActive(): boolean {
+  return isClientTelemetryEnabled() && isVitalsSampled();
+}
+
+function truncateAttr(value: unknown): string | undefined {
+  if (typeof value !== "string" || value.length === 0) return undefined;
+  return value.slice(0, MAX_ATTR_LENGTH);
+}
+
+function enqueue(event: VitalEvent): void {
+  if (!ALLOWED_NAMES.has(event.name)) return;
+  if (!Number.isFinite(event.value)) return;
+  buffer.push(event);
+  if (buffer.length >= MAX_BUFFER) flush();
+}
+
+function flush(): void {
+  // Fold any pending longtask total into the batch first.
+  if (longtaskSum > 0) {
+    buffer.push({
+      name: "longtask_ms",
+      value: Math.round(longtaskSum),
+      route: currentRouteTemplate(),
+    });
+    longtaskSum = 0;
+  }
+  if (buffer.length === 0) return;
+
+  const payload = JSON.stringify({ events: buffer });
+  buffer = [];
+
+  try {
+    const blob = new Blob([payload], { type: "application/json" });
+    if (navigator.sendBeacon && navigator.sendBeacon(ENDPOINT, blob)) return;
+    // Fallback for browsers without sendBeacon: keepalive fetch.
+    void fetch(ENDPOINT, {
+      method: "POST",
+      body: payload,
+      headers: { "Content-Type": "application/json" },
+      keepalive: true,
+    }).catch(() => undefined);
+  } catch {
+    // Best-effort: never throw out of telemetry.
+  }
+}
+
+/**
+ * Report a custom client metric (editor_tx_ms, page_open_ms). No-op unless the
+ * session is sampled. Route is always the current TEMPLATE.
+ */
+export function reportClientMetric(
+  name: "editor_tx_ms" | "page_open_ms",
+  value: number,
+  extra?: { docSize?: number },
+): void {
+  if (!isVitalsActive()) return;
+  if (!Number.isFinite(value)) return;
+  enqueue({
+    name,
+    value,
+    route: currentRouteTemplate(),
+    docSize: extra?.docSize,
+  });
+}
+
+/** Threshold-gated editor transaction reporter (only reports slow syncs). */
+export function reportEditorTx(ms: number, docSize: number): void {
+  if (ms <= EDITOR_TX_MIN_MS) return;
+  reportClientMetric("editor_tx_ms", ms, { docSize });
+}
+
+const PAGE_OPEN_MARK = "gm_page_open_start";
+
+/** Mark the start of a page-open interaction (tree-row / link click). */
+export function markPageOpenStart(): void {
+  try {
+    performance.clearMarks(PAGE_OPEN_MARK);
+    performance.mark(PAGE_OPEN_MARK);
+  } catch {
+    // ignore
+  }
+}
+
+/**
+ * Measure page_open_ms at first editor-content render, if a start mark exists.
+ * Consumes the mark so a later render doesn't double-count.
+ */
+export function measurePageOpen(): void {
+  try {
+    const marks = performance.getEntriesByName(PAGE_OPEN_MARK, "mark");
+    if (marks.length === 0) return;
+    const started = marks[0].startTime;
+    const elapsed = performance.now() - started;
+    performance.clearMarks(PAGE_OPEN_MARK);
+    if (elapsed > 0 && Number.isFinite(elapsed)) {
+      reportClientMetric("page_open_ms", elapsed);
+    }
+  } catch {
+    // ignore
+  }
+}
+
+function attrTarget(
+  metric:
+    | INPMetricWithAttribution
+    | LCPMetricWithAttribution
+    | CLSMetricWithAttribution,
+): string | undefined {
+  const a = metric.attribution as Record<string, unknown> | undefined;
+  if (!a) return undefined;
+  // Different vitals expose their culprit element under different keys; only a
+  // CSS-selector-ish target string is taken (no text content / titles).
+  return (
+    truncateAttr(a.interactionTarget) ??
+    truncateAttr(a.element) ??
+    truncateAttr(a.largestShiftTarget) ??
+    undefined
+  );
+}
+
+/**
+ * Initialise client telemetry. Safe to call multiple times (idempotent). Returns
+ * immediately without subscribing when the session is not sampled — so a
+ * non-sampled session subscribes to NO observers and sends nothing.
+ */
+export function initVitals(): void {
+  if (initialised) return;
+  initialised = true;
+
+  // Operator flag gate (F1, default OFF): when telemetry is disabled the sink
+  // endpoint does not even exist server-side, so install ZERO observers.
+  if (!isClientTelemetryEnabled()) return;
+
+  // Sampling gate is evaluated BEFORE any observer subscription.
+  if (!isVitalsSampled()) return;
+
+  const report = (
+    metric:
+      | INPMetricWithAttribution
+      | LCPMetricWithAttribution
+      | CLSMetricWithAttribution
+      | TTFBMetricWithAttribution,
+  ) => {
+    enqueue({
+      name: metric.name,
+      value: metric.value,
+      rating: metric.rating,
+      route: currentRouteTemplate(),
+      attr:
+        metric.name === "TTFB"
+          ? undefined
+          : attrTarget(
+              metric as
+                | INPMetricWithAttribution
+                | LCPMetricWithAttribution
+                | CLSMetricWithAttribution,
+            ),
+    });
+  };
+
+  onINP(report);
+  onLCP(report);
+  onCLS(report);
+  onTTFB(report);
+
+  // Long tasks: aggregate the total blocking time per flush window (a passive
+  // observer; individual entries are summed, never stored/sent individually).
+  try {
+    if (typeof PerformanceObserver !== "undefined") {
+      const observer = new PerformanceObserver((list) => {
+        for (const entry of list.getEntries()) {
+          longtaskSum += entry.duration;
+        }
+      });
+      observer.observe({ type: "longtask", buffered: true });
+    }
+  } catch {
+    // longtask entry type unsupported: skip silently.
+  }
+
+  // page_open_ms start: mark when the user clicks a page link/tree-row (any
+  // anchor navigating to a page URL). Passive capture listener; the matching
+  // measure fires at first editor-content render (measurePageOpen). No page
+  // titles/slugs are read — only the click timing is marked.
+  document.addEventListener(
+    "click",
+    (event) => {
+      const target = event.target as Element | null;
+      const anchor = target?.closest?.("a[href]") as HTMLAnchorElement | null;
+      if (!anchor) return;
+      const href = anchor.getAttribute("href") ?? "";
+      // A page link is `/s/:space/p/:slug`, `/p/:slug` or a share page path.
+      if (/\/p\//.test(href)) markPageOpenStart();
+    },
+    { capture: true, passive: true },
+  );
+
+  // Flush on tab hide (most reliable delivery point) and periodically.
+  const onHidden = () => {
+    if (document.visibilityState === "hidden") flush();
+  };
+  document.addEventListener("visibilitychange", onHidden);
+  window.addEventListener("pagehide", flush);
+
+  setInterval(flush, FLUSH_INTERVAL_MS);
+}
@@ -13,14 +13,16 @@ import { ModalsProvider } from "@mantine/modals";
 import { Notifications } from "@mantine/notifications";
 import { QueryClient, QueryClientProvider } from "@tanstack/react-query";
 import { HelmetProvider } from "react-helmet-async";
-import { ChunkLoadErrorBoundary } from "@/components/chunk-load-error-boundary.tsx";
 import "./i18n";
+import { PostHogProvider } from "posthog-js/react";
 import {
  getPostHogHost,
  getPostHogKey,
  isCloud,
  isPostHogEnabled,
 } from "@/lib/config.ts";
+import posthog from "posthog-js";
+import { initVitals } from "@/lib/telemetry/vitals";

 export const queryClient = new QueryClient({
  defaultOptions: {
@@ -33,65 +35,35 @@ export const queryClient = new QueryClient({
  },
 });

+if (isCloud() && isPostHogEnabled) {
+  posthog.init(getPostHogKey(), {
+    api_host: getPostHogHost(),
+    defaults: "2025-05-24",
+    disable_session_recording: true,
+    capture_pageleave: false,
+  });
+}
+
+// #355 — client perf-telemetry. Decides sampling ONCE (25%/session) before
+// subscribing to any observer; non-sampled sessions send nothing.
+initVitals();
+
 const container = document.getElementById("root") as HTMLElement;
 const root = (container as any).__reactRoot ??= ReactDOM.createRoot(container);

-function renderApp() {
-  root.render(
-    <BrowserRouter>
-      <MantineProvider theme={theme} cssVariablesResolver={mantineCssResolver}>
-        <ModalsProvider>
-          <QueryClientProvider client={queryClient}>
-            <Notifications position="bottom-center" limit={3} zIndex={10000} />
-            <HelmetProvider>
-              {/* Root boundary above every lazy route's Suspense: a stale-chunk
-                  404 after a deploy is caught and recovered here instead of
-                  blanking the whole app. */}
-              <ChunkLoadErrorBoundary>
-                <App />
-              </ChunkLoadErrorBoundary>
-            </HelmetProvider>
-          </QueryClientProvider>
-        </ModalsProvider>
-      </MantineProvider>
-    </BrowserRouter>,
-  );
-}
-
-async function initAnalytics() {
-  // posthog-js is only pulled in for cloud deployments with analytics enabled, so
-  // self-hosted builds never download it. The gate is kept identical to the
-  // previous eager code so cloud analytics behavior is unchanged; the import is
-  // simply deferred behind it.
-  //
-  // Crucially this runs AFTER the immediate first render below, so first paint is
-  // never gated on the analytics chunk. Any failure (network, stale 404, or an
-  // ad-blocker blocking a chunk named "posthog") is swallowed so the user keeps a
-  // working app without analytics instead of a permanently blank page.
-  //
-  // NOTE: we init the posthog SINGLETON only and do NOT wrap the tree in
-  // <PostHogProvider>. The app has zero consumers of the PostHog React context
-  // (no usePostHog / useFeatureFlag* / PostHogFeature), and PostHogProvider given
-  // an already-initialized `client` is a no-op — all capture goes through the
-  // singleton. Re-rendering to attach the provider would only REMOUNT the whole
-  // App (running every mount effect twice and dropping local state / focus /
-  // in-progress input on cloud cold-load) for no functional gain.
-  if (!(isCloud() && isPostHogEnabled)) return;
-  try {
-    const { default: posthog } = await import("posthog-js");
-    posthog.init(getPostHogKey(), {
-      api_host: getPostHogHost(),
-      defaults: "2025-05-24",
-      disable_session_recording: true,
-      capture_pageleave: false,
-    });
-  } catch {
-    // Analytics failed to load — degrade gracefully; the app already rendered.
-  }
-}
-
-// Paint immediately for everyone (self-hosted stays exactly as instant as before,
-// cloud no longer blocks on the analytics import). The posthog singleton is
-// initialized after, without re-rendering the tree.
-renderApp();
-void initAnalytics();
+root.render(
+  <BrowserRouter>
+    <MantineProvider theme={theme} cssVariablesResolver={mantineCssResolver}>
+      <ModalsProvider>
+        <QueryClientProvider client={queryClient}>
+          <Notifications position="bottom-center" limit={3} zIndex={10000} />
+          <HelmetProvider>
+            <PostHogProvider client={posthog}>
+              <App />
+            </PostHogProvider>
+          </HelmetProvider>
+        </QueryClientProvider>
+      </ModalsProvider>
+    </MantineProvider>
+  </BrowserRouter>,
+);
@@ -63,20 +63,6 @@ export default defineConfig(({ mode }) => {
                name: "vendor-mantine",
                test: /[\\/]node_modules[\\/]@mantine[\\/]/,
              },
-              // NOTE: TipTap/ProseMirror/Yjs are intentionally NOT force-grouped
-              // into a single vendor chunk. Doing so backfires: rolldown co-locates
-              // a small module shared with the (eager) react-i18next runtime into
-              // that group chunk, which then drags the whole ~590KB editor engine
-              // into the eager modulepreload graph. Left to the default splitting,
-              // the editor engine stays in lazily-loaded chunks pulled only by the
-              // route-split editor/share pages. KaTeX is safe to group (nothing
-              // eager references it).
-              // KaTeX in its own stable chunk; loaded on demand by the lazy math
-              // node views (never in the startup path).
-              {
-                name: "vendor-katex",
-                test: /[\\/]node_modules[\\/]katex[\\/]/,
-              },
            ],
          },
        },
@@ -111,6 +111,7 @@
    "pino-pretty": "^13.1.3",
    "postgres": "^3.4.8",
    "postmark": "^4.0.7",
+    "prom-client": "^15.1.3",
    "react": "^18.3.1",
    "react-email": "6.0.8",
    "reflect-metadata": "^0.2.2",
@@ -31,6 +31,8 @@ import { McpModule } from './integrations/mcp/mcp.module';
 import { SandboxModule } from './integrations/sandbox/sandbox.module';
 import { AiModule } from './integrations/ai/ai.module';
 import { AiChatModule } from './core/ai-chat/ai-chat.module';
+import { MetricsModule } from './integrations/metrics/metrics.module';
+import { ClientTelemetryModule } from './core/telemetry/client-telemetry.module';

 const enterpriseModules = [];
 try {
@@ -93,6 +95,10 @@ try {
    SandboxModule,
    AiModule,
    AiChatModule,
+    MetricsModule,
+    // Gated OFF by default: only registers the public vitals sink controller
+    // when CLIENT_TELEMETRY_ENABLED=true (maintainer decision E1=B).
+    ClientTelemetryModule.register(),
    ...enterpriseModules,
  ],
  controllers: [AppController],
@@ -1,3 +1,29 @@
-export const HISTORY_INTERVAL = 5 * 60 * 1000;
-export const HISTORY_FAST_INTERVAL = 60 * 1000;
-export const HISTORY_FAST_THRESHOLD = 5 * 60 * 1000;
+/**
+ * #370 — page-history intentionality tiers. Domain of `page_history.kind`.
+ *   - 'manual' / 'agent'  → Tier 1 versions (intentional points)
+ *   - 'idle' / 'boundary' → Tier 0 autosnapshots (safety net)
+ * A legacy `null` kind is treated as an autosave.
+ */
+export type PageHistoryKind = 'manual' | 'agent' | 'idle' | 'boundary';
+
+/**
+ * #370 — trailing idle-flush windows. A page's pending idle snapshot is
+ * re-armed on every store and fires this long after edits go quiet, so a burst
+ * of edits collapses into a single autosnapshot instead of one-per-store. Human
+ * sessions are noisier and less risky, so they flush less often than the agent.
+ */
+export const IDLE_INTERVAL_USER = 60 * 60 * 1000; // 60m
+export const IDLE_INTERVAL_AGENT = 15 * 60 * 1000; // 15m
+
+/**
+ * #370 — max-wait ceiling for the idle flush. Pure trailing debounce starves the
+ * safety net: hocuspocus stores at least every ~45s, so a CONTINUOUS editing
+ * session would re-arm the trailing timer forever and never take an idle
+ * snapshot until edits finally go quiet (up to IDLE_INTERVAL_USER = 60m). This
+ * ceiling bounds the actual wait from the FIRST edit of a burst, so an idle
+ * snapshot fires at least this often during a long unbroken session — restoring
+ * a recovery point cadence closer to the old heuristic without one-per-store
+ * noise. Mirrors hocuspocus's own maxDebounce idea.
+ */
+export const IDLE_MAX_WAIT_USER = 10 * 60 * 1000; // 10m
+export const IDLE_MAX_WAIT_AGENT = 5 * 60 * 1000; // 5m
@@ -1,84 +1,93 @@
+import { computeHistoryJob, resolveSource } from './persistence.extension';
 import {
-  computeHistoryJob,
-  resolveSource,
-} from './persistence.extension';
-import {
-  HISTORY_FAST_INTERVAL,
-  HISTORY_FAST_THRESHOLD,
-  HISTORY_INTERVAL,
+  IDLE_INTERVAL_AGENT,
+  IDLE_INTERVAL_USER,
+  IDLE_MAX_WAIT_AGENT,
+  IDLE_MAX_WAIT_USER,
 } from '../constants';

-// A fixed clock + fixed createdAt make pageAge deterministic.
-const NOW = 1_700_000_000_000;
 const PAGE_ID = '550e8400-e29b-41d4-a716-446655440000';

-// Build a minimal page whose age (NOW - createdAt) is exactly `ageMs`.
-const pageAged = (ageMs: number) => ({
-  id: PAGE_ID,
-  createdAt: new Date(NOW - ageMs),
-});
+const page = { id: PAGE_ID };

-describe('computeHistoryJob', () => {
-  it('agent edit → delay MUST be 0 and job id is source-keyed', () => {
-    // INVARIANT (§15 H2 / persistence.extension): the agent delay MUST stay 0.
-    // The worker re-reads the page row at run time, so any non-zero delay risks
-    // snapshotting content a later human edit has already overwritten. This is
-    // the load-bearing assertion of this spec — do not relax it.
-    const { jobId, delay } = computeHistoryJob(pageAged(0), 'agent', NOW);
-    expect(delay).toBe(0);
-    expect(jobId).toBe(`${PAGE_ID}-agent`);
-  });
-
-  it('agent edit on an OLD page is still delay 0 (age never applies to agents)', () => {
-    // Even when the page is far older than the fast threshold, the agent path
-    // must short-circuit to 0 — age-based debounce is a human-only concern.
-    const { jobId, delay } = computeHistoryJob(
-      pageAged(HISTORY_FAST_THRESHOLD + 60_000),
-      'agent',
-      NOW,
-    );
-    expect(delay).toBe(0);
-    expect(jobId).toBe(`${PAGE_ID}-agent`);
-  });
-
-  it('human edit on a YOUNG page (age < threshold) → fast interval, bare job id', () => {
-    const { jobId, delay } = computeHistoryJob(
-      pageAged(HISTORY_FAST_THRESHOLD - 1),
-      'user',
-      NOW,
-    );
-    expect(delay).toBe(HISTORY_FAST_INTERVAL);
+describe('computeHistoryJob (#370 — shared trailing idle pipeline)', () => {
+  it('human edit → user idle window, bare page.id job', () => {
+    // Humans and the agent now share ONE idle job per page (jobId = page.id).
+    // The agent's old delay=0 fast path is GONE — intentional agent points now
+    // arrive via the explicit save-version signal, not a zero-delay snapshot.
+    const { jobId, delay } = computeHistoryJob(page, 'user');
+    expect(delay).toBe(IDLE_INTERVAL_USER);
    expect(jobId).toBe(PAGE_ID);
  });

-  it('human edit on an OLD page (age > threshold) → standard interval', () => {
-    const { jobId, delay } = computeHistoryJob(
-      pageAged(HISTORY_FAST_THRESHOLD + 1),
-      'user',
-      NOW,
-    );
-    expect(delay).toBe(HISTORY_INTERVAL);
+  it('agent edit → agent idle window (shorter), still the bare page.id job', () => {
+    const { jobId, delay } = computeHistoryJob(page, 'agent');
+    expect(delay).toBe(IDLE_INTERVAL_AGENT);
+    // No `-agent` suffix anymore: the agent joins the common idle pipeline.
    expect(jobId).toBe(PAGE_ID);
  });

-  it('boundary: pageAge EXACTLY === threshold takes the slow branch (the `<` is strict)', () => {
-    // Off-by-one guard: the condition is `pageAge < HISTORY_FAST_THRESHOLD`, so
-    // an age of exactly the threshold is NOT "fast" — it must use HISTORY_INTERVAL.
-    const { delay } = computeHistoryJob(
-      pageAged(HISTORY_FAST_THRESHOLD),
-      'user',
-      NOW,
-    );
-    expect(delay).toBe(HISTORY_INTERVAL);
+  it('agent flushes sooner than a human', () => {
+    expect(IDLE_INTERVAL_AGENT).toBeLessThan(IDLE_INTERVAL_USER);
  });

-  it('treats any non-"agent" source string as human', () => {
-    // resolveSource only ever yields 'agent' | 'user', but guard the contract:
-    // the agent branch keys strictly on === 'agent'.
-    const { jobId, delay } = computeHistoryJob(pageAged(0), 'user', NOW);
-    expect(delay).toBe(HISTORY_FAST_INTERVAL);
+  it('treats any non-"agent" source string as human (keys strictly on === agent)', () => {
+    const { jobId, delay } = computeHistoryJob(page, 'user');
+    expect(delay).toBe(IDLE_INTERVAL_USER);
    expect(jobId).toBe(PAGE_ID);
  });
+
+  // #370 review round-1 WARNING: the max-wait ceiling prevents autosnapshot
+  // starvation during a continuous editing session (the trailing timer would
+  // otherwise re-arm forever and never fire).
+  describe('max-wait ceiling', () => {
+    const T0 = 1_000_000; // arbitrary fixed epoch for deterministic tests
+
+    it('once a burst is armed, delay clamps to the remaining max-wait budget', () => {
+      // 1 minute into the burst the USER interval (60m) far exceeds the remaining
+      // max-wait budget (10m - 1m = 9m), so the delay is clamped DOWN to that
+      // remaining budget — the full interval is NOT used once a ceiling applies.
+      const { delay } = computeHistoryJob(page, 'user', T0, T0 + 60_000);
+      expect(delay).toBe(IDLE_MAX_WAIT_USER - 60_000);
+    });
+
+    it('never waits longer than the max-wait budget from the burst start', () => {
+      // A store arriving right at the ceiling → delay 0 (fire promptly).
+      const { delay } = computeHistoryJob(
+        page,
+        'user',
+        T0,
+        T0 + IDLE_MAX_WAIT_USER,
+      );
+      expect(delay).toBe(0);
+    });
+
+    it('past the ceiling never returns a negative delay', () => {
+      const { delay } = computeHistoryJob(
+        page,
+        'user',
+        T0,
+        T0 + IDLE_MAX_WAIT_USER + 5 * 60_000,
+      );
+      expect(delay).toBe(0);
+    });
+
+    it('the agent ceiling is shorter than the user ceiling', () => {
+      expect(IDLE_MAX_WAIT_AGENT).toBeLessThan(IDLE_MAX_WAIT_USER);
+      const { delay } = computeHistoryJob(
+        page,
+        'agent',
+        T0,
+        T0 + IDLE_MAX_WAIT_AGENT,
+      );
+      expect(delay).toBe(0);
+    });
+
+    it('without a burstStart there is no ceiling (backward-compatible)', () => {
+      expect(computeHistoryJob(page, 'user').delay).toBe(IDLE_INTERVAL_USER);
+      expect(computeHistoryJob(page, 'agent').delay).toBe(IDLE_INTERVAL_AGENT);
+    });
+  });
 });

 describe('resolveSource (truth table)', () => {
@@ -40,11 +40,12 @@ describe('PersistenceExtension.onStoreDocument — Approach-A boundary snapshot'
  let pageHistoryRepo: {
    saveHistory: jest.Mock;
    findPageLastHistory: jest.Mock;
+    updateHistoryKind: jest.Mock;
  };
  let aiQueue: { add: jest.Mock };
-  let historyQueue: { add: jest.Mock };
+  let historyQueue: { add: jest.Mock; remove: jest.Mock };
  let notificationQueue: { add: jest.Mock };
-  let collabHistory: { addContributors: jest.Mock };
+  let collabHistory: { addContributors: jest.Mock; popContributors: jest.Mock };
  let transclusionService: {
    syncPageTransclusions: jest.Mock;
    syncPageReferences: jest.Mock;
@@ -93,13 +94,22 @@ describe('PersistenceExtension.onStoreDocument — Approach-A boundary snapshot'
    pageHistoryRepo = {
      saveHistory: jest.fn().mockImplementation(async () => {
        callOrder.push('saveHistory');
+        return { id: 'history-1' };
      }),
      findPageLastHistory: jest.fn().mockResolvedValue(null),
+      updateHistoryKind: jest.fn().mockResolvedValue(undefined),
    };
    aiQueue = { add: jest.fn().mockResolvedValue(undefined) };
-    historyQueue = { add: jest.fn().mockResolvedValue(undefined) };
+    historyQueue = {
+      add: jest.fn().mockResolvedValue(undefined),
+      // #370 — enqueuePageHistory now removes any pending idle job before re-adding.
+      remove: jest.fn().mockResolvedValue(undefined),
+    };
    notificationQueue = { add: jest.fn().mockResolvedValue(undefined) };
-    collabHistory = { addContributors: jest.fn().mockResolvedValue(undefined) };
+    collabHistory = {
+      addContributors: jest.fn().mockResolvedValue(undefined),
+      popContributors: jest.fn().mockResolvedValue([]),
+    };
    transclusionService = {
      syncPageTransclusions: jest.fn().mockResolvedValue(undefined),
      syncPageReferences: jest.fn().mockResolvedValue(undefined),
@@ -165,6 +175,50 @@ describe('PersistenceExtension.onStoreDocument — Approach-A boundary snapshot'
    expect(pageRepo.updatePage.mock.calls[0][0].lastUpdatedSource).toBe('user');
  });

+  // #370 review round-1 SUGGESTION: the boundary was GENERALIZED from a
+  // user→agent special-case to ANY lastUpdatedSource transition. These pin the
+  // generalized behaviour it was rebuilt for.
+  describe('generalized boundary — any source transition', () => {
+    // Same persisted page but with an explicit prior source.
+    const pageWithPriorSource = (prior: string | null) => ({
+      ...persistedHumanPage('NEW CONTENT'),
+      lastUpdatedSource: prior,
+    });
+
+    it('agent→user transition fires the boundary (pins the prior agent revision)', async () => {
+      const document = ydocFor(doc('NEW CONTENT'));
+      pageRepo.findById.mockResolvedValue(pageWithPriorSource('agent'));
+      pageHistoryRepo.findPageLastHistory.mockResolvedValue(null);
+
+      await ext.onStoreDocument(buildData(document, 'user') as any);
+
+      expect(pageHistoryRepo.saveHistory).toHaveBeenCalledTimes(1);
+      expect(callOrder).toEqual(['saveHistory', 'updatePage']);
+      expect(pageRepo.updatePage.mock.calls[0][0].lastUpdatedSource).toBe('user');
+    });
+
+    it('git→user transition fires the boundary (git-sync overwrite is a source change)', async () => {
+      const document = ydocFor(doc('NEW CONTENT'));
+      pageRepo.findById.mockResolvedValue(pageWithPriorSource('git'));
+      pageHistoryRepo.findPageLastHistory.mockResolvedValue(null);
+
+      await ext.onStoreDocument(buildData(document, 'user') as any);
+
+      expect(pageHistoryRepo.saveHistory).toHaveBeenCalledTimes(1);
+      expect(callOrder).toEqual(['saveHistory', 'updatePage']);
+    });
+
+    it('a null prior source (first-ever edit) does NOT fire the boundary', async () => {
+      const document = ydocFor(doc('NEW CONTENT'));
+      pageRepo.findById.mockResolvedValue(pageWithPriorSource(null));
+
+      await ext.onStoreDocument(buildData(document, 'agent') as any);
+
+      expect(pageHistoryRepo.saveHistory).not.toHaveBeenCalled();
+      expect(pageRepo.updatePage).toHaveBeenCalledTimes(1);
+    });
+  });
+
  it('idempotency: unchanged content → no updatePage, no history, no queues', async () => {
    // The Y.Doc content equals the persisted content deeply → early skip.
    // A Y.Doc round-trip normalizes attrs (e.g. paragraph indent), so derive
@@ -469,4 +523,125 @@ describe('PersistenceExtension.onStoreDocument — Approach-A boundary snapshot'
    // Contributors keyed by the UUID so they match the PAGE_HISTORY job (page.id).
    expect(collabHistory.addContributors.mock.calls[0][0]).toBe(PAGE_ID);
  });
+
+  // #370 — explicit save-version (Cmd+S / agent save tool) over the stateless
+  // seam. The tier is derived from the SIGNED connection actor, the store path
+  // is reused, and promote-not-dup avoids duplicating heavy content rows.
+  describe('save-version (#370)', () => {
+    const emitSave = (document: any, actor: 'user' | 'agent') =>
+      ext.onStateless({
+        connection: {
+          readOnly: false,
+          context: { user: { id: USER_ID, name: 'Alice' }, actor },
+        } as any,
+        documentName: `page.${PAGE_ID}`,
+        document: document as any,
+        payload: JSON.stringify({ type: 'save-version' }),
+      } as any);
+
+    // findById returns a page whose content already equals the live doc, so the
+    // store path is a no-op and we isolate the versioning decision.
+    const pageMatchingDoc = (document: any) => ({
+      ...persistedHumanPage('IGNORED'),
+      content: TiptapTransformer.fromYdoc(document, 'default'),
+    });
+
+    it('human save with no prior snapshot → writes a manual version + broadcasts', async () => {
+      const document = ydocFor(doc('VERSION ME'));
+      pageRepo.findById.mockResolvedValue(pageMatchingDoc(document));
+      pageHistoryRepo.findPageLastHistory.mockResolvedValue(null);
+
+      await emitSave(document, 'user');
+
+      expect(pageHistoryRepo.saveHistory).toHaveBeenCalledTimes(1);
+      expect(pageHistoryRepo.saveHistory.mock.calls[0][1]).toEqual(
+        expect.objectContaining({ kind: 'manual' }),
+      );
+      // The pending idle autosnapshot is cancelled by the explicit version.
+      expect(historyQueue.remove).toHaveBeenCalledWith(PAGE_ID);
+      const msg = JSON.parse(
+        (document as any).broadcastStateless.mock.calls[(document as any).broadcastStateless.mock.calls.length - 1][0],
+      );
+      expect(msg).toMatchObject({
+        type: 'version.saved',
+        kind: 'manual',
+        alreadySaved: false,
+      });
+    });
+
+    it('agent save derives kind=agent from the signed actor', async () => {
+      const document = ydocFor(doc('AGENT VERSION'));
+      pageRepo.findById.mockResolvedValue(pageMatchingDoc(document));
+      pageHistoryRepo.findPageLastHistory.mockResolvedValue(null);
+
+      await emitSave(document, 'agent');
+
+      expect(pageHistoryRepo.saveHistory.mock.calls[pageHistoryRepo.saveHistory.mock.calls.length - 1][1]).toEqual(
+        expect.objectContaining({ kind: 'agent' }),
+      );
+    });
+
+    it('promote-not-dup: latest snapshot is an autosave with identical content → upgrades in place', async () => {
+      const document = ydocFor(doc('SAME'));
+      const page = pageMatchingDoc(document);
+      pageRepo.findById.mockResolvedValue(page);
+      pageHistoryRepo.findPageLastHistory.mockResolvedValue({
+        id: 'auto-1',
+        content: page.content,
+        kind: 'idle',
+      });
+
+      await emitSave(document, 'user');
+
+      // No heavy new content row — the existing autosave is promoted to manual.
+      expect(pageHistoryRepo.updateHistoryKind).toHaveBeenCalledWith(
+        'auto-1',
+        'manual',
+        expect.anything(),
+      );
+      expect(pageHistoryRepo.saveHistory).not.toHaveBeenCalled();
+      const msg = JSON.parse(
+        (document as any).broadcastStateless.mock.calls[(document as any).broadcastStateless.mock.calls.length - 1][0],
+      );
+      expect(msg).toMatchObject({ historyId: 'auto-1', alreadySaved: false });
+    });
+
+    it('no-op when the latest snapshot is already a manual version of this content', async () => {
+      const document = ydocFor(doc('ALREADY SAVED'));
+      const page = pageMatchingDoc(document);
+      pageRepo.findById.mockResolvedValue(page);
+      pageHistoryRepo.findPageLastHistory.mockResolvedValue({
+        id: 'ver-1',
+        content: page.content,
+        kind: 'manual',
+      });
+
+      await emitSave(document, 'user');
+
+      expect(pageHistoryRepo.updateHistoryKind).not.toHaveBeenCalled();
+      expect(pageHistoryRepo.saveHistory).not.toHaveBeenCalled();
+      const msg = JSON.parse(
+        (document as any).broadcastStateless.mock.calls[(document as any).broadcastStateless.mock.calls.length - 1][0],
+      );
+      expect(msg).toMatchObject({ alreadySaved: true, kind: 'manual' });
+    });
+
+    it('a read-only connection cannot save a version', async () => {
+      const document = ydocFor(doc('READER'));
+      pageRepo.findById.mockResolvedValue(pageMatchingDoc(document));
+
+      await ext.onStateless({
+        connection: {
+          readOnly: true,
+          context: { user: { id: USER_ID }, actor: 'user' },
+        } as any,
+        documentName: `page.${PAGE_ID}`,
+        document: document as any,
+        payload: JSON.stringify({ type: 'save-version' }),
+      } as any);
+
+      expect(pageHistoryRepo.saveHistory).not.toHaveBeenCalled();
+      expect(pageHistoryRepo.updateHistoryKind).not.toHaveBeenCalled();
+    });
+  });
 });
@@ -36,11 +36,14 @@ import {
 import { Page } from '@docmost/db/types/entity.types';
 import { CollabHistoryService } from '../services/collab-history.service';
 import {
-  HISTORY_FAST_INTERVAL,
-  HISTORY_FAST_THRESHOLD,
-  HISTORY_INTERVAL,
+  IDLE_INTERVAL_AGENT,
+  IDLE_INTERVAL_USER,
+  IDLE_MAX_WAIT_AGENT,
+  IDLE_MAX_WAIT_USER,
+  PageHistoryKind,
 } from '../constants';
 import { TransclusionService } from '../../core/page/transclusion/transclusion.service';
+import { observeCollabStore } from '../../integrations/metrics/metrics.registry';

 /**
 * #251 — wire format of the client→server stateless message that signals a
@@ -50,6 +53,16 @@ import { TransclusionService } from '../../core/page/transclusion/transclusion.s
 */
 export const INTENTIONAL_CLEAR_MESSAGE_TYPE = 'intentional-clear';

+/**
+ * #370 — wire format of the client→server "save a version" signal. Sent by the
+ * human (Cmd+S / Save button) and by the agent's explicit save tool over the
+ * SAME stateless channel. The intentionality tier ('manual' vs 'agent') is
+ * derived SERVER-SIDE from the signed connection actor, never from this
+ * payload, so a version's type is unforgeable. The document is taken from the
+ * connection (not the payload), so the signal cannot be aimed at another page.
+ */
+export const SAVE_VERSION_MESSAGE_TYPE = 'save-version';
+
 /**
 * #251 — how long an intentional-clear signal stays "pending" before it is
 * ignored. The signal is set on the clearing keystroke but consumed by the
@@ -86,35 +99,39 @@ export function resolveSource(
 }

 /**
- * Compute the BullMQ job id + delay for a page-history snapshot job. Pure so
- * the data-loss-sensitive timing arithmetic is unit-testable; `now` is injected
- * (caller passes `Date.now()`) for determinism.
+ * #370 — compute the BullMQ job id + delay for a page's trailing idle-flush
+ * autosnapshot. Pure so the timing is unit-testable.
 *
- * - Agent edits: delay 0 and a source-keyed job id `${page.id}-agent`. The
- *   delay MUST stay 0 — the worker re-reads the page row at run time, so any
- *   delay risks reading content a later human edit has already overwritten
- *   (mis-tagged snapshot). 0 minimizes that window. The `-agent` suffix keeps
- *   the job from coalescing with the bare-page.id human job.
- * - Human edits: age-based debounce so rapid human edits coalesce into one
- *   snapshot; job id is the bare `page.id`.
- *
- * BullMQ forbids ':' in custom job ids (Redis key separator), so '-' is used;
- * page.id is a UUID, so `${page.id}-agent` cannot collide with a human job.
+ * Both humans and the agent now share ONE idle pipeline (the agent's old
+ * `delay=0` fast path is gone — intentional agent points arrive via the
+ * explicit save-version signal instead). The job id is the bare `page.id`, so a
+ * page has at most one pending idle job; the caller removes-and-re-adds it on
+ * every store to keep it debounced to the trailing edge of an edit burst. The
+ * window differs by source only: the agent flushes sooner than a human.
 */
 export function computeHistoryJob(
-  page: Pick<Page, 'id' | 'createdAt'>,
+  page: Pick<Page, 'id'>,
  source: string,
-  now: number,
+  // Epoch ms of the FIRST edit in the current burst (when the pending idle job
+  // was first armed). Used to enforce the max-wait ceiling so a continuous
+  // editing session cannot re-arm the trailing timer forever. `now` is injectable
+  // for tests; both default to a live clock / no ceiling when omitted.
+  burstStart?: number,
+  now: number = Date.now(),
 ): { jobId: string; delay: number } {
  const isAgent = source === 'agent';
-  const pageAge = now - new Date(page.createdAt).getTime();
-  const delay = isAgent
-    ? 0
-    : pageAge < HISTORY_FAST_THRESHOLD
-      ? HISTORY_FAST_INTERVAL
-      : HISTORY_INTERVAL;
-  const jobId = isAgent ? `${page.id}-agent` : page.id;
-  return { jobId, delay };
+  const interval = isAgent ? IDLE_INTERVAL_AGENT : IDLE_INTERVAL_USER;
+  const maxWait = isAgent ? IDLE_MAX_WAIT_AGENT : IDLE_MAX_WAIT_USER;
+
+  let delay = interval;
+  if (burstStart !== undefined) {
+    // Time already elapsed since the burst's first edit; the snapshot must fire
+    // no later than `maxWait` after that, so shrink the trailing delay to the
+    // remaining budget (never negative, so BullMQ fires it promptly).
+    const remaining = burstStart + maxWait - now;
+    delay = Math.max(0, Math.min(interval, remaining));
+  }
+  return { jobId: page.id, delay };
 }

@Injectable()
@@ -126,6 +143,11 @@ export class PersistenceExtension implements Extension {
  // coalescing window" per document and OR it across all edits in the window,
  // so the snapshot is marked 'agent' regardless of who wrote last.
  private agentTouched: Map<string, boolean> = new Map();
+  // #370 — epoch ms of the FIRST edit in the current idle-flush burst, per page.
+  // Set when the pending idle job is first armed (empty entry), read to enforce
+  // the max-wait ceiling in computeHistoryJob, and cleared when the idle job is
+  // consumed/cancelled so the next burst starts a fresh window.
+  private idleBurstStart: Map<string, number> = new Map();
  // #251 — per-document "intentional clear pending" flags. Keyed by
  // documentName, value = expiry timestamp (ms). Set by onStateless when the
  // client reports a deliberate clear; consumed once by the next
@@ -192,6 +214,17 @@ export class PersistenceExtension implements Extension {
  }

  async onStoreDocument(data: onStoreDocumentPayload) {
+    // #355 — time the full store (persist + post-store side effects) into
+    // collab_store_duration_seconds. No-op when METRICS_PORT is unset.
+    const startedAt = performance.now();
+    try {
+      await this.storeDocument(data);
+    } finally {
+      observeCollabStore((performance.now() - startedAt) / 1000);
+    }
+  }
+
+  private async storeDocument(data: onStoreDocumentPayload) {
    const { documentName, document, context } = data;

    const pageId = getPageId(documentName);
@@ -314,20 +347,19 @@ export class PersistenceExtension implements Extension {
            //this.logger.debug('Contributors error:' + err?.['message']);
          }

-          // Approach A — boundary snapshot before the agent's first edit.
-          // When this store is the agent's and the page's currently persisted
-          // state was authored by a human, pin that human state as its own
-          // history version BEFORE the agent overwrites it. `page` still holds
-          // the OLD content/provenance here, so saveHistory(page) captures the
-          // pre-agent state tagged 'user'. The agent's new content is
-          // snapshotted later by the debounced PAGE_HISTORY job ('agent'). Skip
-          // if the prior state is already agent-authored (boundary already
-          // pinned on the user->agent transition), if the page is effectively
-          // empty, or if the latest existing snapshot already equals this human
-          // state (avoid duplicates).
+          // #370 — boundary snapshot on ANY source transition. When the store
+          // flips the page's provenance (user↔agent↔git), pin the OUTGOING
+          // state as its own history version BEFORE the incoming source
+          // overwrites it. `page` still holds the OLD content/provenance here,
+          // so saveHistory(page) captures the pre-transition state tagged with
+          // its own source, kind='boundary'. The incoming content is snapshotted
+          // later by the debounced idle job. Skip if the page is effectively
+          // empty or if the latest existing snapshot already equals this state
+          // (the shared isDeepStrictEqual gate — avoids duplicates). Generalizing
+          // beyond the old user→agent special-case also covers git-sync for free.
          if (
-            lastUpdatedSource === 'agent' &&
-            page.lastUpdatedSource !== 'agent'
+            page.lastUpdatedSource &&
+            page.lastUpdatedSource !== lastUpdatedSource
          ) {
            // pageHistory.pageId is uuid-typed; use page.id (never the doc-name
            // slugId) so a `page.<slugId>` doc cannot throw 22P02 here (#260).
@@ -335,15 +367,13 @@ export class PersistenceExtension implements Extension {
              page.id,
              { includeContent: true, trx },
            );
-            const humanBaselineMissing =
+            const baselineMissing =
              !lastHistory ||
              !isDeepStrictEqual(lastHistory.content, page.content);
-            if (
-              !isEmptyParagraphDoc(page.content as any) &&
-              humanBaselineMissing
-            ) {
+            if (!isEmptyParagraphDoc(page.content as any) && baselineMissing) {
              await this.pageHistoryRepo.saveHistory(page, {
                contributorIds: page.contributorIds ?? undefined,
+                kind: 'boundary',
                trx,
              });
            }
@@ -468,6 +498,14 @@ export class PersistenceExtension implements Extension {
      return; // unrelated / malformed stateless message
    }

+    // #370 — explicit "save a version" (human Cmd+S / agent save tool). Edit
+    // rights are already enforced by the readOnly reject above (a reader can't
+    // create a version), exactly as intentional-clear requires.
+    if (message?.type === SAVE_VERSION_MESSAGE_TYPE) {
+      await this.handleSaveVersion(data);
+      return;
+    }
+
    if (message?.type !== INTENTIONAL_CLEAR_MESSAGE_TYPE) return;

    this.intentionalClear.set(
@@ -476,6 +514,117 @@ export class PersistenceExtension implements Extension {
    );
  }

+  /**
+   * #370 — persist an intentional version from the live in-memory ydoc.
+   *
+   * One stateless path serves BOTH the human and the agent; the tier is derived
+   * SERVER-SIDE from the signed connection actor ('agent' → 'agent', anything
+   * else → 'manual'), so the version type cannot be spoofed by the client. We
+   * take the fresh ydoc from the collab process memory and run it through the
+   * EXISTING store path first (so pages.content/ydoc reflect the exact content
+   * being versioned — a REST endpoint would race the up-to-10s-stale page row),
+   * then snapshot it into page_history with the intentional kind.
+   *
+   * Promote-not-dup: if the latest history row already holds this exact content
+   * and it is an autosave (idle/boundary/legacy-null), upgrade its kind in place
+   * instead of duplicating a heavy content row; if it is already 'manual', it is
+   * a no-op (the client shows an "already saved" toast). Otherwise a fresh
+   * version row is written, popping the aggregated contributors from Redis.
+   */
+  private async handleSaveVersion(data: onStatelessPayload): Promise<void> {
+    const { connection, document, documentName } = data;
+    const context = connection?.context;
+    const pageId = getPageId(documentName);
+    // Unforgeable: 'agent' only for a signed agent connection, else 'manual'.
+    const kind: PageHistoryKind =
+      context?.actor === 'agent' ? 'agent' : 'manual';
+
+    // Flush the live ydoc through the normal store path so the page row + ydoc
+    // hold exactly what we are about to version (also fires the idle enqueue we
+    // supersede below, plus any source-transition boundary). onStoreDocument
+    // only needs document/documentName/context.
+    await this.onStoreDocument({
+      document,
+      documentName,
+      context,
+    } as onStoreDocumentPayload);
+
+    let result:
+      | { historyId: string; kind: PageHistoryKind; alreadySaved: boolean }
+      | undefined;
+
+    await executeTx(this.db, async (trx) => {
+      const page = await this.pageRepo.findById(pageId, {
+        withLock: true,
+        includeContent: true,
+        trx,
+      });
+      if (!page) return;
+      // Never version an effectively-empty page (mirrors the processor's
+      // first-history guard); there is nothing intentional to pin.
+      if (isEmptyParagraphDoc(page.content as any)) return;
+
+      const lastHistory = await this.pageHistoryRepo.findPageLastHistory(
+        page.id,
+        { includeContent: true, trx },
+      );
+
+      if (
+        lastHistory &&
+        isDeepStrictEqual(lastHistory.content, page.content)
+      ) {
+        // Content is already snapshotted. Promote-not-dup.
+        if (lastHistory.kind === 'manual') {
+          result = {
+            historyId: lastHistory.id,
+            kind: 'manual',
+            alreadySaved: true,
+          };
+          return;
+        }
+        await this.pageHistoryRepo.updateHistoryKind(
+          lastHistory.id,
+          kind,
+          trx,
+        );
+        result = { historyId: lastHistory.id, kind, alreadySaved: false };
+        return;
+      }
+
+      // Fresh version row. Pop the contributors aggregated since the last
+      // snapshot (SPOP); restore them if the write fails so they aren't lost.
+      const contributorIds = await this.collabHistory.popContributors(page.id);
+      try {
+        const saved = await this.pageHistoryRepo.saveHistory(page, {
+          contributorIds,
+          kind,
+          trx,
+        });
+        result = { historyId: saved.id, kind, alreadySaved: false };
+      } catch (err) {
+        await this.collabHistory.addContributors(page.id, contributorIds);
+        throw err;
+      }
+    });
+
+    // Housekeeping: this explicit version supersedes the page's pending idle
+    // autosnapshot, so cancel it (delayed job → remove() just deletes it) and
+    // end the current idle burst so the next edit starts a fresh max-wait window.
+    await this.historyQueue.remove(pageId).catch(() => undefined);
+    this.idleBurstStart.delete(pageId);
+
+    if (result) {
+      document.broadcastStateless(
+        JSON.stringify({
+          type: 'version.saved',
+          historyId: result.historyId,
+          kind: result.kind,
+          alreadySaved: result.alreadySaved,
+        }),
+      );
+    }
+  }
+
  async onChange(data: onChangePayload) {
    const documentName = data.documentName;
    const userId = data.context?.user?.id;
@@ -533,17 +682,45 @@ export class PersistenceExtension implements Extension {
    page: Page,
    lastUpdatedSource: string,
  ): Promise<void> {
-    // Job id + delay arithmetic lives in the pure `computeHistoryJob` (see its
-    // doc comment for the agent-delay-0 / age-based-debounce invariants).
+    // #370 — trailing idle debounce with a max-wait ceiling. One pending idle
+    // job per page (jobId = page.id); on every store we remove the pending
+    // delayed job and re-add it, so the snapshot lands `delay` after edits go
+    // quiet rather than once per store (precedent: workspace.service.ts).
+    // remove() on a delayed job simply deletes it (0 if absent, no throw); if the
+    // job is already ACTIVE and the remove is a no-op, the add still de-dups and
+    // the processor's isDeepStrictEqual gate collapses the duplicate content.
+    //
+    // The FIRST arm of a burst records `burstStart`; computeHistoryJob shrinks
+    // the delay to the remaining max-wait budget from that point, so a continuous
+    // session cannot re-arm the trailing timer forever and starve the snapshot.
+    // A burst marker older than THIS TIER's max-wait means the previous idle job
+    // has already fired — start a fresh window instead of firing immediately on
+    // the next edit. Must use the SAME source-specific max-wait computeHistoryJob
+    // uses (agent 5m / user 10m): a hardcoded USER ceiling would leave an agent
+    // burst's marker stale for 5..10m, forcing delay=0 on every store in that
+    // window and writing one idle row per store — exactly the per-store bloat the
+    // debounce exists to prevent, on the continuous-agent path.
+    const maxWait =
+      lastUpdatedSource === 'agent' ? IDLE_MAX_WAIT_AGENT : IDLE_MAX_WAIT_USER;
+    const now = Date.now();
+    let burstStart = this.idleBurstStart.get(page.id);
+    if (burstStart === undefined || now - burstStart >= maxWait) {
+      burstStart = now;
+      this.idleBurstStart.set(page.id, burstStart);
+    }
+
    const { jobId, delay } = computeHistoryJob(
      page,
      lastUpdatedSource,
-      Date.now(),
+      burstStart,
+      now,
    );

+    await this.historyQueue.remove(jobId).catch(() => undefined);
+
    await this.historyQueue.add(
      QueueJob.PAGE_HISTORY,
-      { pageId: page.id } as IPageHistoryJob,
+      { pageId: page.id, kind: 'idle' } as IPageHistoryJob,
      { jobId, delay },
    );
  }
@@ -66,6 +66,15 @@ describe('HistoryProcessor.process', () => {
    notificationQueue = { add: jest.fn().mockResolvedValue(undefined) };
    generalQueue = { add: jest.fn().mockResolvedValue(undefined) };

+    // #370 F3 — the processor now serializes its find+save under a page-row lock
+    // via executeTx. A db whose transaction().execute(fn) runs fn with a trx stub
+    // drives the real executeTx() helper without a database.
+    const db = {
+      transaction: () => ({
+        execute: (fn: (trx: any) => Promise<any>) => fn({ __trx: true }),
+      }),
+    };
+
    // WorkerHost's constructor reads `this.worker`; passing repos positionally
    // matches the constructor and avoids the Nest DI container.
    proc = new HistoryProcessor(
@@ -73,6 +82,7 @@ describe('HistoryProcessor.process', () => {
      pageRepo as any,
      collabHistory as any,
      watcherService as any,
+      db as any,
      notificationQueue as any,
      generalQueue as any,
    );
@@ -126,15 +136,26 @@ describe('HistoryProcessor.process', () => {
    await proc.process(buildJob());

    expect(collabHistory.popContributors).toHaveBeenCalledWith(PAGE_ID);
+    // #370 F3/F9 — the snapshot decision runs under a page-row lock. Pin the lock
+    // structurally so a refactor that drops withLock/trx (silently reintroducing
+    // the TOCTOU double-insert) turns this red. The tx stub is { __trx: true }.
+    expect(pageRepo.findById).toHaveBeenCalledWith(
+      PAGE_ID,
+      expect.objectContaining({ withLock: true, trx: { __trx: true } }),
+    );
+    // #370 F7 — addPageWatchers MUST receive the trx, or its FK-check runs on a
+    // separate connection and self-deadlocks against our FOR UPDATE. Asserting
+    // the trx arg here is exactly what would have caught that regression.
    expect(watcherService.addPageWatchers).toHaveBeenCalledWith(
      ['u1', 'u2'],
      PAGE_ID,
      SPACE_ID,
      WORKSPACE_ID,
+      { __trx: true },
    );
    expect(pageHistoryRepo.saveHistory).toHaveBeenCalledWith(
      expect.objectContaining({ id: PAGE_ID }),
-      { contributorIds: ['u1', 'u2'] },
+      { contributorIds: ['u1', 'u2'], kind: 'idle', trx: { __trx: true } },
    );
    expect(generalQueue.add).toHaveBeenCalledWith(
      QueueJob.PAGE_BACKLINKS,
@@ -186,6 +207,48 @@ describe('HistoryProcessor.process', () => {
    ]);
  });

+  it('COMMIT failure (throw outside the tx callback) → contributors RESTORED', async () => {
+    // #370 F8 — a commit-time failure throws OUTSIDE the callback, so the inner
+    // try/catch does not run; the outer catch must restore the popped set (else a
+    // BullMQ retry writes an unattributed version). Use a db whose execute() runs
+    // the callback THEN throws, simulating a commit abort.
+    pageHistoryRepo.findPageLastHistory.mockResolvedValue({
+      content: { type: 'doc', content: [] },
+    });
+    const commitFail = {
+      transaction: () => ({
+        execute: async (fn: (trx: any) => Promise<any>) => {
+          await fn({ __trx: true }); // callback succeeds (saveHistory ok)
+          throw new Error('commit aborted'); // ...but the COMMIT fails
+        },
+      }),
+    };
+    const procCommitFail = new HistoryProcessor(
+      pageHistoryRepo as any,
+      pageRepo as any,
+      collabHistory as any,
+      watcherService as any,
+      commitFail as any,
+      notificationQueue as any,
+      generalQueue as any,
+    );
+    jest
+      .spyOn(procCommitFail['logger'], 'error')
+      .mockImplementation(() => undefined);
+
+    await expect(procCommitFail.process(buildJob())).rejects.toThrow(
+      'commit aborted',
+    );
+    // The inner catch did NOT run (save succeeded), so only the outer catch can
+    // restore — assert it did.
+    expect(collabHistory.addContributors).toHaveBeenCalledWith(PAGE_ID, [
+      'u1',
+      'u2',
+    ]);
+    // And the post-snapshot queue work must NOT have run (we rethrew).
+    expect(generalQueue.add).not.toHaveBeenCalled();
+  });
+
  it('backlinks + notification queue failures are swallowed (history still committed)', async () => {
    pageHistoryRepo.findPageLastHistory.mockResolvedValue({
      content: { type: 'doc', content: [] },
@@ -19,6 +19,9 @@ import { isDeepStrictEqual } from 'node:util';
 import { CollabHistoryService } from '../services/collab-history.service';
 import { WatcherService } from '../../core/watcher/watcher.service';
 import { isEmptyParagraphDoc } from '../collaboration.util';
+import { InjectKysely } from 'nestjs-kysely';
+import { KyselyDB } from '@docmost/db/types/kysely.types';
+import { executeTx } from '@docmost/db/utils';

@Processor(QueueName.HISTORY_QUEUE)
 export class HistoryProcessor extends WorkerHost implements OnModuleDestroy {
@@ -29,6 +32,7 @@ export class HistoryProcessor extends WorkerHost implements OnModuleDestroy {
    private readonly pageRepo: PageRepo,
    private readonly collabHistory: CollabHistoryService,
    private readonly watcherService: WatcherService,
+    @InjectKysely() private readonly db: KyselyDB,
    @InjectQueue(QueueName.NOTIFICATION_QUEUE) private notificationQueue: Queue,
    @InjectQueue(QueueName.GENERAL_QUEUE) private generalQueue: Queue,
  ) {
@@ -41,6 +45,9 @@ export class HistoryProcessor extends WorkerHost implements OnModuleDestroy {
    try {
      const { pageId } = job.data;

+      // Read the page WITHOUT a lock first, only to bail early on the two cheap
+      // no-write cases (page gone / empty first snapshot) without opening a
+      // transaction. The authoritative check-then-write happens locked below.
      const page = await this.pageRepo.findById(pageId, {
        includeContent: true,
      });
@@ -51,40 +58,109 @@ export class HistoryProcessor extends WorkerHost implements OnModuleDestroy {
        return;
      }

-      const lastHistory = await this.pageHistoryRepo.findPageLastHistory(
-        pageId,
-        { includeContent: true },
-      );
+      // #370 F3 — the snapshot decision (findPageLastHistory → saveHistory) must
+      // be serialized against manual-save/boundary writers, which run under a
+      // page-row lock in onStoreDocument. Without it, this processor and a
+      // concurrent manual-save each read the same lastHistory (MVCC), both see
+      // content != lastHistory, and both insert — producing two page_history rows
+      // with IDENTICAL content (one 'idle', one 'manual'), defeating
+      // promote-not-dup and the version-vs-autosave split. Taking the same
+      // page-row lock makes the second writer observe the first's committed row so
+      // the isDeepStrictEqual gate collapses the duplicate. Only the read+write
+      // is transacted; the post-snapshot queue work stays outside.
+      let contributorIds: string[] = [];
+      let snapshotWritten = false;
+      let lastHistoryContent: unknown;
+      // #370 F8 — the contributor set popped from Redis (destructive SPOP) must be
+      // restored if the snapshot does not durably land. The inner try/catch only
+      // covers a throw INSIDE the callback; a COMMIT failure (connection drop,
+      // serialization/deadlock abort on commit — the transient class the epic
+      // already retries) throws OUTSIDE it, rolling the snapshot back while the
+      // pop is already gone. We track the popped set here and restore it in the
+      // outer catch so a BullMQ retry re-attributes the version. addContributors
+      // is an idempotent Redis SADD, so a double-restore is harmless.
+      let poppedForRestore: string[] = [];

-      if (!lastHistory && isEmptyParagraphDoc(page.content as any)) {
-        this.logger.debug(
-          `Skipping first history for page ${pageId}: empty content`,
-        );
-        await this.collabHistory.clearContributors(pageId);
+      try {
+        await executeTx(this.db, async (trx) => {
+          const lockedPage = await this.pageRepo.findById(pageId, {
+            includeContent: true,
+            withLock: true,
+            trx,
+          });
+          if (!lockedPage) return;
+
+          const lastHistory = await this.pageHistoryRepo.findPageLastHistory(
+            pageId,
+            { includeContent: true, trx },
+          );
+          lastHistoryContent = lastHistory?.content;
+
+          if (!lastHistory && isEmptyParagraphDoc(lockedPage.content as any)) {
+            this.logger.debug(
+              `Skipping first history for page ${pageId}: empty content`,
+            );
+            return;
+          }
+
+          if (
+            lastHistory &&
+            isDeepStrictEqual(lastHistory.content, lockedPage.content)
+          ) {
+            return; // already snapshotted at this content — nothing to write
+          }
+
+          contributorIds = await this.collabHistory.popContributors(pageId);
+          poppedForRestore = contributorIds;
+          try {
+            // Pass `trx` so the watcher insert's FK check (FOR KEY SHARE on
+            // pages[pageId]) runs on the SAME connection that already holds the
+            // FOR UPDATE lock from findById — otherwise it takes the FK lock on a
+            // separate pool connection and self-deadlocks against our own tx.
+            await this.watcherService.addPageWatchers(
+              contributorIds,
+              pageId,
+              lockedPage.spaceId,
+              lockedPage.workspaceId,
+              trx,
+            );
+
+            // #370 — every job on this queue is a trailing idle-flush autosnapshot.
+            await this.pageHistoryRepo.saveHistory(lockedPage, {
+              contributorIds,
+              kind: job.data.kind ?? 'idle',
+              trx,
+            });
+            snapshotWritten = true;
+            this.logger.debug(`History created for page: ${pageId}`);
+          } catch (err) {
+            await this.collabHistory.addContributors(pageId, contributorIds);
+            poppedForRestore = [];
+            throw err;
+          }
+        });
+      } catch (err) {
+        // A throw here means the tx did NOT commit (callback threw, or the commit
+        // itself failed and rolled back). If we popped contributors and the inner
+        // catch did not already restore them, restore now so the retry keeps
+        // attribution. snapshotWritten is irrelevant: it is set before commit, so
+        // it can be true even when the commit rolled the snapshot back.
+        if (poppedForRestore.length) {
+          await this.collabHistory.addContributors(pageId, poppedForRestore);
+        }
+        throw err;
+      }
+
+      // No snapshot written (page vanished / empty-first / unchanged content) →
+      // clear the contributor set for the skip cases and stop.
+      if (!snapshotWritten) {
+        if (!lastHistoryContent && isEmptyParagraphDoc(page.content as any)) {
+          await this.collabHistory.clearContributors(pageId);
+        }
        return;
      }

-      if (
-        !lastHistory ||
-        !isDeepStrictEqual(lastHistory.content, page.content)
-      ) {
-        const contributorIds = await this.collabHistory.popContributors(pageId);
-
-        try {
-          await this.watcherService.addPageWatchers(
-            contributorIds,
-            pageId,
-            page.spaceId,
-            page.workspaceId,
-          );
-
-          await this.pageHistoryRepo.saveHistory(page, { contributorIds });
-          this.logger.debug(`History created for page: ${pageId}`);
-        } catch (err) {
-          await this.collabHistory.addContributors(pageId, contributorIds);
-          throw err;
-        }
-
+      {
        const mentions = extractMentions(page.content);
        const pageMentions = extractPageMentions(mentions);
        const internalLinkSlugIds = extractInternalLinkSlugIds(page.content);
@@ -102,7 +178,7 @@ export class HistoryProcessor extends WorkerHost implements OnModuleDestroy {
            );
          });

-        if (contributorIds.length > 0 && lastHistory?.content) {
+        if (contributorIds.length > 0 && lastHistoryContent) {
          await this.notificationQueue
            .add(QueueJob.PAGE_UPDATED, {
              pageId,
@@ -0,0 +1,527 @@
+import { Logger } from '@nestjs/common';
+import {
+  AiChatRunService,
+  RunAlreadyActiveError,
+  ONE_ACTIVE_RUN_PER_CHAT_INDEX,
+  mapTurnStatusToRun,
+} from './ai-chat-run.service';
+
+/** Shape a Postgres unique-violation the way the postgres.js driver surfaces it:
+ *  SQLSTATE 23505 + the offending index in `constraint_name`. */
+function uniqueViolation(constraintName: string): Error & {
+  code: string;
+  constraint_name: string;
+} {
+  return Object.assign(
+    new Error('duplicate key value violates unique constraint'),
+    {
+      code: '23505',
+      constraint_name: constraintName,
+    },
+  );
+}
+
+/**
+ * Unit coverage for the #184 phase-1 run lifecycle (AiChatRunService) with a
+ * hand-rolled mock repo — no Nest graph, no DB. The invariant under test is the
+ * one that makes a run "autonomous": a run keeps going when its SUBSCRIBER (the
+ * browser) detaches, and ONLY an explicit stop aborts it. We assert that at the
+ * abort-signal level (the signal the agent loop actually consumes).
+ */
+
+/** Minimal EnvironmentService stub. Single-instance (CLOUD unset) by default. */
+function makeEnv(isCloud = false) {
+  return { isCloud: () => isCloud };
+}
+
+function makeRepo(overrides: Record<string, jest.Mock> = {}) {
+  return {
+    insert: jest.fn(async (v: any) => ({
+      id: 'run-1',
+      status: v.status ?? 'running',
+      chatId: v.chatId,
+      workspaceId: v.workspaceId,
+    })),
+    update: jest.fn(async () => ({ id: 'run-1' })),
+    markStopRequested: jest.fn(async () => ({ id: 'run-1' })),
+    findActiveByChat: jest.fn(async () => undefined),
+    findLatestByChat: jest.fn(async () => undefined),
+    findById: jest.fn(async () => undefined),
+    sweepRunning: jest.fn(async () => 0),
+    ...overrides,
+  };
+}
+
+describe('mapTurnStatusToRun', () => {
+  it('maps the turn terminal status to the run terminal status', () => {
+    expect(mapTurnStatusToRun('completed')).toBe('succeeded');
+    expect(mapTurnStatusToRun('error')).toBe('failed');
+    expect(mapTurnStatusToRun('aborted')).toBe('aborted');
+  });
+});
+
+describe('AiChatRunService.onModuleInit (startup sweep)', () => {
+  afterEach(() => jest.restoreAllMocks());
+
+  it('calls sweepRunning and resolves; logs when > 0', async () => {
+    const repo = makeRepo({ sweepRunning: jest.fn(async () => 2) });
+    const logSpy = jest
+      .spyOn(Logger.prototype, 'log')
+      .mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await expect(svc.onModuleInit()).resolves.toBeUndefined();
+    expect(repo.sweepRunning).toHaveBeenCalledTimes(1);
+    expect(logSpy).toHaveBeenCalledTimes(1);
+    expect(String(logSpy.mock.calls[0][0])).toContain('2');
+  });
+
+  it('a sweep failure is swallowed (never blocks startup)', async () => {
+    const repo = makeRepo({
+      sweepRunning: jest.fn(async () => {
+        throw new Error('db down');
+      }),
+    });
+    const warnSpy = jest
+      .spyOn(Logger.prototype, 'warn')
+      .mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await expect(svc.onModuleInit()).resolves.toBeUndefined();
+    // The first warn is the sweep failure (the multi-instance warn never fires
+    // single-instance), so the message is the db error.
+    expect(String(warnSpy.mock.calls[0][0])).toContain('db down');
+  });
+
+  it('F1 (DECISION C): the boot sweep is UNCONDITIONAL — sweepRunning is called with NO staleness window, so a fresh running run (updatedAt = now) is settled, not skipped', async () => {
+    // The bug: a fast restart (deploy/OOM within minutes of the last step) left a
+    // run stuck 'running' under the old 10-min window, 409ing every later turn in
+    // the chat. The fix settles ALL pending|running on boot. We assert the service
+    // invokes sweepRunning with no `staleMs` (the unconditional path); the repo's
+    // own spec proves no-window => no updatedAt filter.
+    const repo = makeRepo({ sweepRunning: jest.fn(async () => 1) });
+    jest.spyOn(Logger.prototype, 'log').mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await svc.onModuleInit();
+    expect(repo.sweepRunning).toHaveBeenCalledTimes(1);
+    const callArgs = repo.sweepRunning.mock.calls[0] as unknown[];
+    const firstArg = callArgs[0] as { staleMs?: number } | undefined;
+    // Either no opts at all, or opts without a staleMs window => unconditional.
+    expect(firstArg?.staleMs).toBeUndefined();
+  });
+
+  it('F2 (DECISION A): warns at startup that autonomousRuns is single-instance-only when a horizontally-scaled deployment (CLOUD) is detected', async () => {
+    const repo = makeRepo();
+    const warnSpy = jest
+      .spyOn(Logger.prototype, 'warn')
+      .mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv(true) as never);
+    await svc.onModuleInit();
+    const warned = warnSpy.mock.calls.some((c) =>
+      /single-instance-only/i.test(String(c[0])),
+    );
+    expect(warned).toBe(true);
+  });
+
+  it('F2: does NOT warn about multi-instance on a single-instance (CLOUD unset) deployment', async () => {
+    const repo = makeRepo();
+    const warnSpy = jest
+      .spyOn(Logger.prototype, 'warn')
+      .mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv(false) as never);
+    await svc.onModuleInit();
+    const warned = warnSpy.mock.calls.some((c) =>
+      /single-instance-only/i.test(String(c[0])),
+    );
+    expect(warned).toBe(false);
+  });
+});
+
+describe('AiChatRunService run lifecycle', () => {
+  it('beginRun inserts a running row and registers a live abort controller', async () => {
+    const repo = makeRepo();
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    const handle = await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+    expect(repo.insert).toHaveBeenCalledWith(
+      expect.objectContaining({
+        chatId: 'chat-1',
+        workspaceId: 'ws-1',
+        createdBy: 'user-1',
+        status: 'running',
+        trigger: 'user',
+      }),
+    );
+    expect(handle.runId).toBe('run-1');
+    expect(handle.signal.aborted).toBe(false);
+    expect(svc.isLocallyActive('run-1')).toBe(true);
+  });
+
+  it('beginRun REJECTS the racer: a 23505 on the one-active-per-chat index throws RunAlreadyActiveError (not swallowed) and registers no controller', async () => {
+    // The race: the controller's cheap pre-check passed for BOTH concurrent
+    // turns, so the loser's INSERT hits the partial unique index. That rejection
+    // is the authoritative gate — it must surface, not be swallowed into an
+    // untracked turn.
+    const repo = makeRepo({
+      insert: jest.fn(async () => {
+        throw uniqueViolation(ONE_ACTIVE_RUN_PER_CHAT_INDEX);
+      }),
+    });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await expect(
+      svc.beginRun({ chatId: 'chat-1', workspaceId: 'ws-1', userId: 'user-1' }),
+    ).rejects.toBeInstanceOf(RunAlreadyActiveError);
+    // No controller leaked for a rejected start.
+    expect(svc.isLocallyActive('run-1')).toBe(false);
+  });
+
+  it('beginRun does NOT mask an unrelated unique violation as already-active', async () => {
+    // A 23505 on some OTHER constraint is a real bug, not the race — it must
+    // propagate unchanged so it is never silently treated as "already active".
+    const other = uniqueViolation('ai_chat_runs_pkey');
+    const repo = makeRepo({
+      insert: jest.fn(async () => {
+        throw other;
+      }),
+    });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await expect(
+      svc.beginRun({ chatId: 'chat-1', workspaceId: 'ws-1', userId: 'user-1' }),
+    ).rejects.toBe(other);
+  });
+
+  it('beginRun propagates a non-unique insert failure unchanged', async () => {
+    const boom = new Error('connection reset');
+    const repo = makeRepo({
+      insert: jest.fn(async () => {
+        throw boom;
+      }),
+    });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await expect(
+      svc.beginRun({ chatId: 'chat-1', workspaceId: 'ws-1', userId: 'user-1' }),
+    ).rejects.toBe(boom);
+  });
+
+  it('two concurrent begins on one chat: exactly one wins, the other is rejected as already-active', async () => {
+    // Integration-style: model the DB partial unique index with a one-shot slot.
+    // The first insert claims it; the second hits a 23505 on the active index.
+    let slotTaken = false;
+    const repo = makeRepo({
+      insert: jest.fn(async (v: any) => {
+        if (slotTaken) throw uniqueViolation(ONE_ACTIVE_RUN_PER_CHAT_INDEX);
+        slotTaken = true;
+        return { id: 'run-win', status: v.status, chatId: v.chatId };
+      }),
+    });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    const results = await Promise.allSettled([
+      svc.beginRun({ chatId: 'chat-1', workspaceId: 'ws-1', userId: 'user-1' }),
+      svc.beginRun({ chatId: 'chat-1', workspaceId: 'ws-1', userId: 'user-1' }),
+    ]);
+    const fulfilled = results.filter((r) => r.status === 'fulfilled');
+    const rejected = results.filter((r) => r.status === 'rejected');
+    expect(fulfilled).toHaveLength(1);
+    expect(rejected).toHaveLength(1);
+    expect((rejected[0] as PromiseRejectedResult).reason).toBeInstanceOf(
+      RunAlreadyActiveError,
+    );
+    // Exactly the winner is locally active.
+    expect(svc.isLocallyActive('run-win')).toBe(true);
+  });
+
+  it('a SUBSCRIBER detaching does NOT abort the run (only an explicit stop does)', async () => {
+    const repo = makeRepo();
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    const handle = await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+    // Model a browser disconnect: nothing in the run service is told to stop.
+    // The signal the agent loop consumes must stay un-aborted and the run stays
+    // locally active — i.e. it keeps running server-side.
+    expect(handle.signal.aborted).toBe(false);
+    expect(svc.isLocallyActive('run-1')).toBe(true);
+    // markStopRequested was never called by a mere detach.
+    expect(repo.markStopRequested).not.toHaveBeenCalled();
+  });
+
+  it('requestStop aborts the live controller, marks the row, and reports true', async () => {
+    const repo = makeRepo();
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    const handle = await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+    const aborted = jest.fn();
+    handle.signal.addEventListener('abort', aborted);
+
+    const result = await svc.requestStop('run-1', 'ws-1');
+
+    expect(result).toBe(true);
+    expect(handle.signal.aborted).toBe(true);
+    expect(aborted).toHaveBeenCalledTimes(1);
+    expect(repo.markStopRequested).toHaveBeenCalledWith('run-1', 'ws-1');
+  });
+
+  it('requestStop on a run this replica does NOT hold still marks the row (true)', async () => {
+    // e.g. after a restart, or a sibling replica owns the controller. The row is
+    // marked so the owning replica/sweep settles it; we report a stop took effect.
+    const repo = makeRepo({
+      markStopRequested: jest.fn(async () => ({ id: 'run-9' })),
+    });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    const result = await svc.requestStop('run-9', 'ws-1');
+    expect(result).toBe(true);
+    expect(svc.isLocallyActive('run-9')).toBe(false);
+  });
+
+  it('requestStop still aborts the live controller when markStopRequested rejects (transient DB error)', async () => {
+    // F15: the in-memory abort is the ONLY thing that stops a run and must not be
+    // hostage to the audit write of stop_requested_at. A transient failure on
+    // markStopRequested must NOT prevent abort() nor make requestStop throw.
+    const warnSpy = jest
+      .spyOn(Logger.prototype, 'warn')
+      .mockImplementation(() => undefined);
+    const repo = makeRepo({
+      markStopRequested: jest.fn(async () => {
+        throw new Error('pool exhausted');
+      }),
+    });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    const handle = await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+    const aborted = jest.fn();
+    handle.signal.addEventListener('abort', aborted);
+
+    // Does NOT throw despite the DB write rejecting.
+    const result = await svc.requestStop('run-1', 'ws-1');
+
+    // The live turn was aborted even though the audit write failed...
+    expect(handle.signal.aborted).toBe(true);
+    expect(aborted).toHaveBeenCalledTimes(1);
+    expect(repo.markStopRequested).toHaveBeenCalledWith('run-1', 'ws-1');
+    // ...the catch branch logged the swallowed failure...
+    expect(warnSpy).toHaveBeenCalledTimes(1);
+    // ...and a stop is reported as having taken effect (the entry existed).
+    expect(result).toBe(true);
+    warnSpy.mockRestore();
+  });
+
+  it('requestStop on an already-settled run (nothing active) reports false', async () => {
+    const repo = makeRepo({
+      markStopRequested: jest.fn(async () => undefined),
+    });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    const result = await svc.requestStop('run-done', 'ws-1');
+    expect(result).toBe(false);
+  });
+
+  it('finalizeRun settles the row to the mapped status with finishedAt and drops the in-memory entry', async () => {
+    const repo = makeRepo();
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+    expect(svc.isLocallyActive('run-1')).toBe(true);
+
+    await svc.finalizeRun('run-1', 'ws-1', 'error', 'provider blew up');
+
+    expect(svc.isLocallyActive('run-1')).toBe(false);
+    expect(repo.update).toHaveBeenCalledWith(
+      'run-1',
+      'ws-1',
+      expect.objectContaining({
+        status: 'failed',
+        error: 'provider blew up',
+        finishedAt: expect.any(Date),
+      }),
+    );
+  });
+
+  it('finalizeRun is IDEMPOTENT: a second settle no-ops (single terminal write)', async () => {
+    // The #184 review fix: AiChatService.stream wraps the turn in a safety-net
+    // catch that settles a failed turn AND streamText's terminal callback may
+    // also settle — both routes call finalizeRun. Only the FIRST may write the
+    // terminal row; the second must no-op so a late settle can never clobber the
+    // real terminal status or double-write the row.
+    const repo = makeRepo();
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+
+    await svc.finalizeRun('run-1', 'ws-1', 'error', 'first');
+    expect(svc.isLocallyActive('run-1')).toBe(false);
+    // A second settle (e.g. a streamText callback firing after the catch) no-ops.
+    await svc.finalizeRun('run-1', 'ws-1', 'completed', undefined);
+
+    expect(repo.update).toHaveBeenCalledTimes(1);
+    expect(repo.update).toHaveBeenCalledWith(
+      'run-1',
+      'ws-1',
+      expect.objectContaining({ status: 'failed', error: 'first' }),
+    );
+  });
+
+  it('CONCURRENCY: two simultaneous finalizeRun on the same run write the terminal row EXACTLY ONCE (the 2nd caller exits synchronously at the atomic claim)', async () => {
+    // The CRITICAL race: AiChatService.stream's safety-net catch settles the turn
+    // to 'error' while a streamText terminal callback also settles it — both call
+    // finalizeRun for the SAME runId. The once-gate must close ATOMICALLY: a
+    // `settled.has` check alone is read BEFORE the awaited UPDATE, so both callers
+    // would pass it and BOTH write the row (last-write-wins clobber + double
+    // write). The fix claims the run with a SYNCHRONOUS `active.delete` before any
+    // await, so the second caller returns in the same tick, before the UPDATE.
+    //
+    // We force the two calls to overlap by making `update` return a promise we
+    // resolve only AFTER both finalizeRun calls have run their synchronous bodies.
+    let resolveUpdate!: (v: unknown) => void;
+    const updateGate = new Promise((res) => {
+      resolveUpdate = res;
+    });
+    const update = jest.fn(() => updateGate);
+    const repo = makeRepo({ update });
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+
+    // Fire both before the (pending) update resolves. The first synchronously
+    // claims the entry (active.delete) and awaits update; the second, started in
+    // the same macrotask, finds the entry already gone and returns at the claim
+    // WITHOUT ever calling update.
+    const p1 = svc.finalizeRun('run-1', 'ws-1', 'completed');
+    const p2 = svc.finalizeRun('run-1', 'ws-1', 'error', 'safety-net');
+
+    // The decisive assertion: exactly one caller reached the terminal UPDATE.
+    expect(update).toHaveBeenCalledTimes(1);
+
+    // Let the single in-flight update land; both calls resolve cleanly.
+    resolveUpdate({ id: 'run-1' });
+    await Promise.all([p1, p2]);
+
+    expect(update).toHaveBeenCalledTimes(1);
+    // The winner is the FIRST caller ('completed' -> 'succeeded'); the late
+    // 'error' settle never wrote, so it could not clobber the real status.
+    expect(update).toHaveBeenCalledWith(
+      'run-1',
+      'ws-1',
+      expect.objectContaining({ status: 'succeeded' }),
+    );
+    expect(svc.isLocallyActive('run-1')).toBe(false);
+  });
+
+  it('F6: a TRANSIENT terminal-write failure is ridden out by the bounded retry — the run is settled, not stranded', async () => {
+    // The bug: finalizeRun used to DROP the in-memory entry BEFORE the terminal
+    // UPDATE, then only warn-log a failure. A single transient blip (pool
+    // exhaustion / deadlock / connection hiccup) on that PK UPDATE left the row
+    // 'running' with nothing left to recover it -> every later turn in that chat
+    // 409s until a restart. The fix updates FIRST and retries.
+    let calls = 0;
+    const repo = makeRepo({
+      update: jest.fn(async () => {
+        calls += 1;
+        if (calls === 1) throw new Error('deadlock detected');
+        return { id: 'run-1' };
+      }),
+    });
+    jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+
+    await svc.finalizeRun('run-1', 'ws-1', 'completed');
+
+    // The retry landed the terminal write: the entry is dropped (slot freed) and
+    // the row carries the real terminal status — NOT stranded at 'running'.
+    expect(svc.isLocallyActive('run-1')).toBe(false);
+    expect(repo.update).toHaveBeenCalledTimes(2);
+    expect(repo.update).toHaveBeenLastCalledWith(
+      'run-1',
+      'ws-1',
+      expect.objectContaining({ status: 'succeeded' }),
+    );
+  });
+
+  it('F6: if the terminal write keeps failing, the entry is RETAINED and a LATER settle completes it (chat not permanently 409d)', async () => {
+    // Worst case: the DB is down for the whole first finalize (all attempts fail).
+    // The run must NOT be silently lost — the entry stays so a subsequent settle
+    // (a streamText callback, requestStop -> onAbort, or a future sweep) can retry.
+    let healthy = false;
+    const repo = makeRepo({
+      update: jest.fn(async () => {
+        if (!healthy) throw new Error('pool exhausted');
+        return { id: 'run-1' };
+      }),
+    });
+    jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined);
+    const errorSpy = jest
+      .spyOn(Logger.prototype, 'error')
+      .mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await svc.beginRun({
+      chatId: 'chat-1',
+      workspaceId: 'ws-1',
+      userId: 'user-1',
+    });
+
+    // First settle: every bounded attempt fails -> entry retained, NOT settled.
+    await svc.finalizeRun('run-1', 'ws-1', 'completed');
+    expect(svc.isLocallyActive('run-1')).toBe(true);
+    // F12: the give-up emits ONE explicit, greppable ERROR (run + chat context)
+    // so an operator can tell "gave up, run held in memory" from a per-attempt
+    // blip — distinct from the per-attempt warns.
+    const gaveUp = errorSpy.mock.calls.some(
+      (c) =>
+        /NON-TERMINAL/.test(String(c[0])) &&
+        /run-1/.test(String(c[0])) &&
+        /chat-1/.test(String(c[0])),
+    );
+    expect(gaveUp).toBe(true);
+
+    // The DB recovers; a later settle now succeeds and frees the slot.
+    healthy = true;
+    await svc.finalizeRun('run-1', 'ws-1', 'completed');
+    expect(svc.isLocallyActive('run-1')).toBe(false);
+    expect(repo.update).toHaveBeenLastCalledWith(
+      'run-1',
+      'ws-1',
+      expect.objectContaining({ status: 'succeeded' }),
+    );
+
+    // And it is now idempotent: a further settle no-ops (terminal row already
+    // written), so a double-settle can never clobber the real status.
+    const callsBefore = repo.update.mock.calls.length;
+    await svc.finalizeRun('run-1', 'ws-1', 'error', 'late');
+    expect(repo.update).toHaveBeenCalledTimes(callsBefore);
+  });
+
+  it('recordStep / linkAssistantMessage are best-effort: a repo failure is swallowed', async () => {
+    const repo = makeRepo({
+      update: jest.fn(async () => {
+        throw new Error('transient');
+      }),
+    });
+    jest.spyOn(Logger.prototype, 'warn').mockImplementation(() => undefined);
+    const svc = new AiChatRunService(repo as never, makeEnv() as never);
+    await expect(svc.recordStep('run-1', 'ws-1', 3)).resolves.toBeUndefined();
+    await expect(
+      svc.linkAssistantMessage('run-1', 'ws-1', 'msg-1'),
+    ).resolves.toBeUndefined();
+  });
+});
@@ -0,0 +1,452 @@
+import { Injectable, Logger, OnModuleInit } from '@nestjs/common';
+import { AiChatRunRepo } from '@docmost/db/repos/ai-chat/ai-chat-run.repo';
+import { AiChatRun } from '@docmost/db/types/entity.types';
+import { isUniqueViolation, violatedConstraint } from '@docmost/db/utils';
+import { EnvironmentService } from '../../integrations/environment/environment.service';
+
+/** Name of the partial unique index enforcing "one active run per chat" (see the
+ *  ai_chat_runs migration). A 23505 on THIS constraint is the race-safe signal
+ *  that a concurrent turn already owns the chat — distinct from any other unique
+ *  collision, which must NOT be silently treated as "already active". */
+export const ONE_ACTIVE_RUN_PER_CHAT_INDEX = 'ai_chat_runs_one_active_per_chat';
+
+/**
+ * Thrown by {@link AiChatRunService.beginRun} when the run-row INSERT loses the
+ * race for a chat's single active slot (the partial unique index rejects it with
+ * a 23505). This is the AUTHORITATIVE concurrency gate: the controller's cheap
+ * pre-check is only a fast-path, and a request that slips past it must NOT run
+ * untracked. The caller (AiChatService.stream) translates this into a 409 and
+ * aborts the turn BEFORE any AI/provider call.
+ */
+export class RunAlreadyActiveError extends Error {
+  constructor(public readonly chatId: string) {
+    super(`An agent run is already in progress for chat ${chatId}`);
+    this.name = 'RunAlreadyActiveError';
+  }
+}
+
+/**
+ * The terminal status of a TURN (the #183 assistant-row lifecycle) maps onto the
+ * terminal status of a RUN (#184). A turn that completed -> the run succeeded; a
+ * turn that errored -> the run failed; a turn aborted (explicit user stop) -> the
+ * run aborted. Pure + unit-testable.
+ */
+export type TurnTerminalStatus = 'completed' | 'error' | 'aborted';
+export type RunTerminalStatus = 'succeeded' | 'failed' | 'aborted';
+
+export function mapTurnStatusToRun(
+  status: TurnTerminalStatus,
+): RunTerminalStatus {
+  switch (status) {
+    case 'completed':
+      return 'succeeded';
+    case 'error':
+      return 'failed';
+    case 'aborted':
+      return 'aborted';
+  }
+}
+
+/** An in-flight run held in process memory: its AbortController is the ONLY thing
+ *  that can stop the turn (an explicit user stop), independent of the browser
+ *  socket. A mere disconnect never touches it, so the run keeps going. */
+interface ActiveRun {
+  controller: AbortController;
+  chatId: string;
+  workspaceId: string;
+}
+
+/** The live handle the streaming path drives a run through (returned by
+ *  {@link AiChatRunService.beginRun}). The `signal` governs the agent loop's
+ *  abort — wired to the run, NOT to the HTTP socket. */
+export interface RunHandle {
+  runId: string;
+  signal: AbortSignal;
+}
+
+/**
+ * AiChatRunService (#184 phase 1) — owns the agent RUN as a first-class,
+ * server-side lifecycle object detached from the HTTP request / browser window.
+ *
+ * Responsibilities:
+ *  - create a run row when a turn starts (inserted directly as 'running'; the
+ *    'pending' status is only the column default + a reserved value, never
+ *    written by code in phase 1) and register an in-memory AbortController for it
+ *    (the explicit-stop lever);
+ *  - finalize the run row (succeeded / failed / aborted) and unregister it;
+ *  - service an EXPLICIT user stop (`requestStop`) — the ONLY thing that aborts a
+ *    run; a browser disconnect deliberately does NOT;
+ *  - crash-recovery sweep of dangling runs on startup.
+ *
+ * The agent loop itself still runs in AiChatService.stream (reusing #183's
+ * step-granular durable write path, `consumeStream` already drains it independent
+ * of the socket); this service only wraps it in a durable lifecycle and an
+ * abort handle that outlives the subscriber.
+ */
+@Injectable()
+export class AiChatRunService implements OnModuleInit {
+  private readonly logger = new Logger(AiChatRunService.name);
+
+  // runId -> ActiveRun. Process-local on purpose (phase 1 is single-process /
+  // in-memory transport; a cross-process BullMQ runner + Redis stop-signal is
+  // deferred to phase 2). A stop for a runId not in this map (e.g. after a
+  // restart) still records `stop_requested_at` on the row.
+  private readonly active = new Map<string, ActiveRun>();
+
+  // runIds whose TERMINAL row write has SUCCEEDED — the idempotency once-gate
+  // (F6). A finalize must short-circuit only AFTER the terminal write has landed,
+  // NOT merely after the in-memory entry was dropped: a transient UPDATE failure
+  // has to stay retryable, so "already settled" means "row already terminal", not
+  // "entry already gone". Grows by one short UUID per finished run over process
+  // uptime — negligible in phase 1's single process.
+  private readonly settled = new Set<string>();
+
+  // Bounded retry for the terminal write (F6): a single PK UPDATE can fail
+  // transiently under many fire-and-forget writes (pool exhaustion, deadlock, a
+  // brief connection blip). Riding out that blip in-place matters because the
+  // dominant success path (streamText onFinish) settles exactly ONCE — if that
+  // write is dropped and never retried, the row is stranded 'running' and the
+  // one-active-run gate 409s every future turn in the chat until a restart (no
+  // periodic sweep in phase 1).
+  private static readonly FINALIZE_MAX_ATTEMPTS = 3;
+  private static readonly FINALIZE_RETRY_BASE_MS = 50;
+
+  constructor(
+    private readonly runRepo: AiChatRunRepo,
+    private readonly environment: EnvironmentService,
+  ) {}
+
+  /**
+   * Crash-recovery sweep on server start: settle EVERY run still left
+   * pending/running to 'aborted' (F1 / DECISION C). The boot sweep is
+   * UNCONDITIONAL — no staleness window — because phase 1 is single-process: on a
+   * fresh boot any pending|running run is definitionally hung (no live runner owns
+   * it), so even a fast restart (deploy/OOM within minutes of the last step) can
+   * no longer leave a run stuck 'running' forever (which would make the
+   * one-active-run gate 409 every future turn in that chat). The staleness window
+   * is reintroduced only for the phase-2 multi-instance timer sweep, where a
+   * booting replica must not abort a run another replica is actively executing.
+   * Best-effort — a sweep failure is logged but MUST NOT block startup (mirrors
+   * AiChatService.onModuleInit for #183).
+   */
+  async onModuleInit(): Promise<void> {
+    this.warnIfMultiInstance();
+    try {
+      // No `staleMs`: unconditional boot sweep (F1). See AiChatRunRepo.sweepRunning.
+      const swept = await this.runRepo.sweepRunning();
+      if (swept > 0) {
+        this.logger.log(
+          `Startup sweep: marked ${swept} dangling agent run(s) as 'aborted'.`,
+        );
+      }
+    } catch (err) {
+      this.logger.warn(
+        `Startup sweep of dangling runs failed: ${
+          err instanceof Error ? err.message : 'unknown error'
+        }`,
+      );
+    }
+  }
+
+  /**
+   * F2 (DECISION A): autonomous runs are SINGLE-INSTANCE-ONLY in phase 1. An
+   * explicit Stop, and the in-memory AbortController that backs it, are
+   * process-local: a Stop only aborts the live turn if it lands on the SAME
+   * replica that owns the run (it still stamps `stop_requested_at` cross-instance,
+   * but nothing reads that flag during an active run yet). Cross-instance pub/sub
+   * stop is phase 2. So if the deployment is horizontally scaled, warn loudly at
+   * startup that a Stop may not reach a run executing on another replica.
+   *
+   * DETECTION: this codebase always wires the socket.io Redis adapter (REDIS_URL
+   * is mandatory), so the adapter alone is NOT a horizontal-scaling signal. The
+   * authoritative signal the codebase has is `CLOUD=true` (EnvironmentService
+   * .isCloud()), the Docmost-cloud multi-replica deployment. We warn whenever that
+   * is set, because any workspace could enable settings.ai.autonomousRuns. A
+   * self-hosted operator running multiple replicas behind a load balancer is also
+   * multi-instance; the deploy docs (.env.example / AGENTS.md) spell out the
+   * single-instance constraint for that case.
+   */
+  private warnIfMultiInstance(): void {
+    if (this.environment.isCloud()) {
+      this.logger.warn(
+        'Autonomous agent runs (settings.ai.autonomousRuns) are SINGLE-INSTANCE-ONLY ' +
+          'in phase 1: a horizontally-scaled deployment was detected (CLOUD=true). ' +
+          'An explicit Stop only aborts a run executing on the same replica that owns ' +
+          'it (cross-instance Stop is not yet reliable — phase 2). Run a single ' +
+          'instance if you enable autonomousRuns, or keep the flag off.',
+      );
+    }
+  }
+
+  /**
+   * Start a run for a turn: insert the run row (status 'running', startedAt now),
+   * register a fresh AbortController for it, and return a {@link RunHandle} whose
+   * `signal` the agent loop uses. The DB partial unique index guarantees at most
+   * one active run per chat — a second concurrent start on the same chat REJECTS
+   * at the insert (a 23505 on {@link ONE_ACTIVE_RUN_PER_CHAT_INDEX}). That
+   * rejection is the AUTHORITATIVE race gate: it is surfaced as a distinct
+   * {@link RunAlreadyActiveError} (NOT swallowed), so the caller turns it into a
+   * 409 and never streams an untracked turn. The controller is registered AFTER a
+   * successful insert so a rejected start leaks nothing.
+   */
+  async beginRun(args: {
+    chatId: string;
+    workspaceId: string;
+    userId: string;
+    trigger?: string;
+  }): Promise<RunHandle> {
+    let run: AiChatRun;
+    try {
+      run = await this.runRepo.insert({
+        chatId: args.chatId,
+        workspaceId: args.workspaceId,
+        createdBy: args.userId,
+        trigger: args.trigger ?? 'user',
+        status: 'running',
+        startedAt: new Date(),
+      });
+    } catch (err) {
+      // The race backstop: a concurrent turn already holds this chat's single
+      // active slot, so the partial unique index rejected our insert. Surface a
+      // distinct signal — the caller MUST reject this turn (409), not run it
+      // untracked. Any OTHER error propagates unchanged.
+      if (
+        isUniqueViolation(err) &&
+        violatedConstraint(err) === ONE_ACTIVE_RUN_PER_CHAT_INDEX
+      ) {
+        throw new RunAlreadyActiveError(args.chatId);
+      }
+      throw err;
+    }
+    const controller = new AbortController();
+    this.active.set(run.id, {
+      controller,
+      chatId: args.chatId,
+      workspaceId: args.workspaceId,
+    });
+    return { runId: run.id, signal: controller.signal };
+  }
+
+  /** Link the assistant message (the #183 projection) to its run. Best-effort. */
+  async linkAssistantMessage(
+    runId: string,
+    workspaceId: string,
+    assistantMessageId: string,
+  ): Promise<void> {
+    try {
+      await this.runRepo.update(runId, workspaceId, { assistantMessageId });
+    } catch (err) {
+      this.logger.warn(
+        `Failed to link assistant message to run ${runId}: ${
+          err instanceof Error ? err.message : 'unknown error'
+        }`,
+      );
+    }
+  }
+
+  /** Persist progress: bump the run's finished-step count. Best-effort (never
+   *  blocks or breaks the stream). */
+  async recordStep(
+    runId: string,
+    workspaceId: string,
+    stepCount: number,
+  ): Promise<void> {
+    try {
+      await this.runRepo.update(runId, workspaceId, { stepCount });
+    } catch (err) {
+      this.logger.warn(
+        `Failed to record step for run ${runId}: ${
+          err instanceof Error ? err.message : 'unknown error'
+        }`,
+      );
+    }
+  }
+
+  /**
+   * Finalize a run to its terminal status (succeeded / failed / aborted),
+   * stamping finishedAt + any error. Best-effort, but ROBUST against a transient
+   * terminal-write failure (F6) AND atomically safe against a concurrent settle.
+   *
+   * ATOMIC ONCE-CLAIM (the gate must close in ONE synchronous tick): two
+   * finalizeRun calls for the SAME run can race — the documented real path is
+   * AiChatService.stream's safety-net catch settling the turn to 'error' while a
+   * streamText terminal callback (onFinish/onAbort/onError) ALSO settles it. The
+   * `settled.has` check alone is NOT a gate: it is read BEFORE the awaited UPDATE,
+   * so two callers can both see `false` and both write the row (last-write-wins
+   * clobbers the real terminal status, and the bounded retry only widens that
+   * window). The claim therefore happens via `active.delete`, a SYNCHRONOUS
+   * check-and-clear with NO await between the gate and the entry removal: the
+   * second concurrent caller finds the entry already gone and returns in the same
+   * tick, before any UPDATE. The transition "nobody is finalizing" -> "I am
+   * finalizing" is thus a single atomic step.
+   *
+   * ORDER MATTERS (F6): once we own the claim, the terminal UPDATE happens FIRST;
+   * only once it SUCCEEDS do we record the run as settled. If the UPDATE fails on
+   * every bounded attempt we RESTORE the in-memory entry, leave the run UNsettled,
+   * and emit an ERROR signal that the row is left non-terminal 'running' (which
+   * would 409 every future turn in the chat until recovery). An in-process retry
+   * by a LATER settle is only POSSIBLE, never guaranteed: it needs (a) the entry
+   * to have been restored at the give-up path AND (b) a fresh settler to arrive
+   * AFTER that restore. A concurrent settler that arrives DURING the retry window
+   * — while the entry is deleted for backoff and not yet restored — is consumed at
+   * the synchronous `active.delete` claim (it finds nothing to delete and returns
+   * a no-op), so it does NOT become an in-process retrier. The NO-streamText path
+   * (the turn threw before streamText was wired, so ONLY the safety-net ever
+   * settles) likewise has no second in-process settler at all. The UNCONDITIONAL
+   * backstop in every case is the boot sweep on the next restart (phase 1 has no
+   * periodic in-process sweep); the retained entry is bounded (cleared on restart)
+   * and harmless meanwhile.
+   *
+   * IDEMPOTENT on SUCCESS (#184 review): the terminal write happens AT MOST ONCE
+   * per run. After a successful write the once-gate keys off {@link settled} (the
+   * terminal row already written) so a settle arriving AFTER the entry was already
+   * dropped-and-settled returns early; a settle racing the in-flight write is
+   * stopped earlier still, by the `active.delete` claim. Either way a genuine
+   * double-settle collapses to a single write and a late settle can never clobber
+   * the real terminal status or double-write the row.
+   */
+  async finalizeRun(
+    runId: string,
+    workspaceId: string,
+    turnStatus: TurnTerminalStatus,
+    error?: string,
+  ): Promise<void> {
+    // ---- Atomic once-claim (synchronous; NO await before the gate closes) ----
+    // Already terminally written -> idempotent no-op.
+    if (this.settled.has(runId)) return;
+    // Capture the entry BEFORE the delete so a total-failure path can restore it.
+    const entry = this.active.get(runId);
+    // SYNCHRONOUS check-and-clear: the FIRST caller deletes (claims) the entry;
+    // any concurrent SECOND caller finds nothing to delete and returns HERE, in
+    // the same tick, before any await — so it can never reach the UPDATE.
+    if (!this.active.delete(runId)) return;
+
+    let lastError: unknown;
+    for (
+      let attempt = 1;
+      attempt <= AiChatRunService.FINALIZE_MAX_ATTEMPTS;
+      attempt++
+    ) {
+      try {
+        await this.runRepo.update(runId, workspaceId, {
+          status: mapTurnStatusToRun(turnStatus),
+          finishedAt: new Date(),
+          error: error ?? null,
+        });
+        // Terminal write landed: arm the once-gate. The entry is already gone
+        // (claimed above); we do NOT restore it. The slot is now free.
+        this.settled.add(runId);
+        return;
+      } catch (err) {
+        lastError = err;
+        this.logger.warn(
+          `Failed to finalize run ${runId} (attempt ${attempt}/${
+            AiChatRunService.FINALIZE_MAX_ATTEMPTS
+          }): ${err instanceof Error ? err.message : 'unknown error'}`,
+        );
+        if (attempt < AiChatRunService.FINALIZE_MAX_ATTEMPTS) {
+          await this.delay(AiChatRunService.FINALIZE_RETRY_BASE_MS * attempt);
+        }
+      }
+    }
+    // Every attempt failed: this is a give-up, materially worse than a per-attempt
+    // blip — the row is left NON-TERMINAL ('running'), so emit ONE explicit,
+    // greppable ERROR so an operator can tell "survived a blip" from "gave up, run
+    // held in memory until recovery" (the last warn alone says only "attempt 3/3").
+    this.logger.error(
+      `Run ${runId} (chat ${entry?.chatId ?? 'unknown'}) left NON-TERMINAL ` +
+        `('running'): terminal write failed after ${
+          AiChatRunService.FINALIZE_MAX_ATTEMPTS
+        } attempts; entry retained in memory, recovery deferred to next settle / ` +
+        `boot sweep`,
+      lastError,
+    );
+    // RESTORE the claimed entry (and leave the run UNsettled) so a LATER settle
+    // that arrives AFTER this restore MAY retry the terminal write — but that
+    // in-process retry is NOT guaranteed (a concurrent settler caught in the retry
+    // window above is consumed at the `active.delete` claim, and the no-streamText
+    // path has no second settler at all). The UNCONDITIONAL backstop in every case
+    // is the boot sweep on the next restart; the restored entry is bounded and
+    // cleared on restart.
+    if (entry) this.active.set(runId, entry);
+  }
+
+  /** Small async backoff between terminal-write retries (F6). Isolated so it is
+   *  trivial to stub/fake-time in tests. */
+  private delay(ms: number): Promise<void> {
+    return new Promise((resolve) => setTimeout(resolve, ms));
+  }
+
+  /**
+   * Request an EXPLICIT stop of a run (the user pressed Stop). This is the ONLY
+   * thing that aborts a run — distinct from a browser disconnect, which leaves
+   * the run going. Aborts the in-process controller FIRST (the only thing that
+   * actually stops the run, if this replica owns it), then makes a best-effort
+   * attempt to stamp `stop_requested_at` — that audit write stamps only while the
+   * row is active and may be skipped on a DB error or lost to the finalize race,
+   * which is acceptable since the row still settles as 'aborted'. Returns true
+   * when a stop took effect (row marked and/or controller aborted), false when
+   * there was nothing active to stop.
+   */
+  async requestStop(runId: string, workspaceId: string): Promise<boolean> {
+    const entry = this.active.get(runId);
+    if (entry) {
+      // Abort the live turn FIRST -> streamText onAbort fires -> the partial is
+      // persisted (#183) and finalizeRun settles the row as 'aborted'. This is
+      // the ONLY thing that aborts a run, so it MUST NOT be hostage to the audit
+      // write below: a transient failure on `markStopRequested` (pool exhaustion,
+      // deadlock, dropped connection) must never leave the run executing despite
+      // an explicit Stop. At worst only the `stop_requested_at` timestamp is lost.
+      entry.controller.abort();
+    }
+    // Record `stop_requested_at` (best-effort). A transient DB failure here is
+    // logged and treated as `marked = false`; the abort above already took
+    // effect, so we never rethrow and skip stopping the run. Note: because
+    // markStopRequested only stamps while the row is active, aborting first means
+    // even a healthy write can lose the race against the resulting finalize and
+    // skip the stamp — acceptable, as the row still settles as 'aborted' and only
+    // this audit timestamp may be lost.
+    let marked: unknown;
+    try {
+      marked = await this.runRepo.markStopRequested(runId, workspaceId);
+    } catch (err) {
+      marked = undefined;
+      this.logger.warn(
+        `requestStop: markStopRequested failed for run ${runId} ` +
+          `(stop_requested_at not recorded); abort already issued: ` +
+          `${err instanceof Error ? err.message : String(err)}`,
+      );
+    }
+    return Boolean(marked) || Boolean(entry);
+  }
+
+  /** Latest persisted run for a chat — the reconnect target (an in-flight or
+   *  finished run). Pure read-through to the repo. */
+  getLatestForChat(
+    chatId: string,
+    workspaceId: string,
+  ): Promise<AiChatRun | undefined> {
+    return this.runRepo.findLatestByChat(chatId, workspaceId);
+  }
+
+  /** Fetch a run by id (workspace-scoped). Used to resolve + ownership-check an
+   *  explicit stop targeting a runId. */
+  getRun(runId: string, workspaceId: string): Promise<AiChatRun | undefined> {
+    return this.runRepo.findById(runId, workspaceId);
+  }
+
+  /** The active run on a chat, if any (used to reject a concurrent start with a
+   *  clean 409 before committing to the stream). */
+  getActiveForChat(
+    chatId: string,
+    workspaceId: string,
+  ): Promise<AiChatRun | undefined> {
+    return this.runRepo.findActiveByChat(chatId, workspaceId);
+  }
+
+  /** Test/diagnostic seam: whether this replica is holding a live controller for
+   *  the run. */
+  isLocallyActive(runId: string): boolean {
+    return this.active.has(runId);
+  }
+}
@@ -25,6 +25,7 @@ describe('AiChatController.boundChat', () => {
    };
    const controller = new AiChatController(
      {} as never,
+      {} as never, // aiChatRunService
      aiChatRepo as never,
      {} as never,
      {} as never,
@@ -53,6 +53,7 @@ describe('AiChatController.export', () => {
    };
    const controller = new AiChatController(
      {} as never,
+      {} as never, // aiChatRunService
      aiChatRepo as never,
      aiChatMessageRepo as never,
      {} as never,
@@ -0,0 +1,164 @@
+import { BadRequestException, ForbiddenException } from '@nestjs/common';
+import { AiChatController } from './ai-chat.controller';
+import type { User, Workspace } from '@docmost/db/types/entity.types';
+
+/**
+ * Wiring spec for the #184 run-reconnect / run-stop endpoints
+ * (`POST /ai-chat/run` and `POST /ai-chat/stop`). Both are OWNER-gated via
+ * assertOwnedChat (the requesting user must own the chat) and NOT flag-gated.
+ * Exercised with hand-rolled mocks — no Nest graph, no DB. The controller's
+ * constructor order is (aiChatService, aiChatRunService, aiChatRepo,
+ * aiChatMessageRepo, aiTranscription).
+ */
+describe('AiChatController run endpoints (#184)', () => {
+  const user = { id: 'u1' } as User;
+  const workspace = { id: 'ws1' } as Workspace;
+
+  function makeController(opts: {
+    chat?: unknown; // what aiChatRepo.findById returns (owner-gate)
+    run?: unknown; // getLatestForChat / getRun result
+    activeRun?: unknown; // getActiveForChat result
+    message?: unknown; // aiChatMessageRepo.findById result
+    stopped?: boolean; // requestStop result
+  }) {
+    const aiChatRunService = {
+      getLatestForChat: jest.fn().mockResolvedValue(opts.run),
+      getRun: jest.fn().mockResolvedValue(opts.run),
+      getActiveForChat: jest.fn().mockResolvedValue(opts.activeRun),
+      requestStop: jest.fn().mockResolvedValue(opts.stopped ?? false),
+    };
+    const aiChatRepo = {
+      findById: jest.fn().mockResolvedValue(opts.chat),
+    };
+    const aiChatMessageRepo = {
+      findById: jest.fn().mockResolvedValue(opts.message),
+    };
+    const controller = new AiChatController(
+      {} as never, // aiChatService
+      aiChatRunService as never,
+      aiChatRepo as never,
+      aiChatMessageRepo as never,
+      {} as never, // aiTranscription
+      {} as never, // pageRepo
+    );
+    return { controller, aiChatRunService, aiChatRepo, aiChatMessageRepo };
+  }
+
+  describe('POST /ai-chat/run (getRun)', () => {
+    it('owner-gates: a chat the user does not own throws ForbiddenException', async () => {
+      const { controller, aiChatRunService } = makeController({
+        chat: { id: 'c1', creatorId: 'someone-else' },
+      });
+      await expect(
+        controller.getRun({ chatId: 'c1' }, user, workspace),
+      ).rejects.toBeInstanceOf(ForbiddenException);
+      // It must NOT reach the run lookup once the owner-gate fails.
+      expect(aiChatRunService.getLatestForChat).not.toHaveBeenCalled();
+    });
+
+    it('returns { run: null, message: null } when the chat has never had a run', async () => {
+      const { controller, aiChatRunService } = makeController({
+        chat: { id: 'c1', creatorId: 'u1' },
+        run: undefined,
+      });
+      const res = await controller.getRun({ chatId: 'c1' }, user, workspace);
+      expect(res).toEqual({ run: null, message: null });
+      expect(aiChatRunService.getLatestForChat).toHaveBeenCalledWith(
+        'c1',
+        'ws1',
+      );
+    });
+
+    it('returns the run and its projected assistant message', async () => {
+      const run = { id: 'run-1', chatId: 'c1', assistantMessageId: 'm1' };
+      const message = { id: 'm1', role: 'assistant' };
+      const { controller, aiChatMessageRepo } = makeController({
+        chat: { id: 'c1', creatorId: 'u1' },
+        run,
+        message,
+      });
+      const res = await controller.getRun({ chatId: 'c1' }, user, workspace);
+      expect(res).toEqual({ run, message });
+      expect(aiChatMessageRepo.findById).toHaveBeenCalledWith('m1', 'ws1');
+    });
+
+    it('returns message: null when the run has no linked assistant message', async () => {
+      const run = { id: 'run-1', chatId: 'c1', assistantMessageId: null };
+      const { controller, aiChatMessageRepo } = makeController({
+        chat: { id: 'c1', creatorId: 'u1' },
+        run,
+      });
+      const res = await controller.getRun({ chatId: 'c1' }, user, workspace);
+      expect(res).toEqual({ run, message: null });
+      expect(aiChatMessageRepo.findById).not.toHaveBeenCalled();
+    });
+  });
+
+  describe('POST /ai-chat/stop (stopRun)', () => {
+    it('throws BadRequestException when neither runId nor chatId is given', async () => {
+      const { controller } = makeController({});
+      await expect(
+        controller.stopRun({}, user, workspace),
+      ).rejects.toBeInstanceOf(BadRequestException);
+    });
+
+    it('stops by runId: owner-gates via the run’s chat, then requests the stop', async () => {
+      const { controller, aiChatRunService, aiChatRepo } = makeController({
+        run: { id: 'run-1', chatId: 'c1' },
+        chat: { id: 'c1', creatorId: 'u1' },
+        stopped: true,
+      });
+      const res = await controller.stopRun({ runId: 'run-1' }, user, workspace);
+      expect(res).toEqual({ stopped: true });
+      expect(aiChatRunService.getRun).toHaveBeenCalledWith('run-1', 'ws1');
+      expect(aiChatRepo.findById).toHaveBeenCalledWith('c1', 'ws1');
+      expect(aiChatRunService.requestStop).toHaveBeenCalledWith('run-1', 'ws1');
+    });
+
+    it('stops by runId: a foreign run’s chat throws ForbiddenException (no stop)', async () => {
+      const { controller, aiChatRunService } = makeController({
+        run: { id: 'run-1', chatId: 'c1' },
+        chat: { id: 'c1', creatorId: 'someone-else' },
+      });
+      await expect(
+        controller.stopRun({ runId: 'run-1' }, user, workspace),
+      ).rejects.toBeInstanceOf(ForbiddenException);
+      expect(aiChatRunService.requestStop).not.toHaveBeenCalled();
+    });
+
+    it('stops by runId: an unknown run reports { stopped: false }', async () => {
+      const { controller, aiChatRunService } = makeController({
+        run: undefined,
+      });
+      const res = await controller.stopRun({ runId: 'gone' }, user, workspace);
+      expect(res).toEqual({ stopped: false });
+      expect(aiChatRunService.requestStop).not.toHaveBeenCalled();
+    });
+
+    it('stops by chatId: owner-gates, resolves the active run, requests the stop', async () => {
+      const { controller, aiChatRunService, aiChatRepo } = makeController({
+        chat: { id: 'c1', creatorId: 'u1' },
+        activeRun: { id: 'run-9' },
+        stopped: true,
+      });
+      const res = await controller.stopRun({ chatId: 'c1' }, user, workspace);
+      expect(res).toEqual({ stopped: true });
+      expect(aiChatRepo.findById).toHaveBeenCalledWith('c1', 'ws1');
+      expect(aiChatRunService.getActiveForChat).toHaveBeenCalledWith(
+        'c1',
+        'ws1',
+      );
+      expect(aiChatRunService.requestStop).toHaveBeenCalledWith('run-9', 'ws1');
+    });
+
+    it('stops by chatId: reports { stopped: false } when no run is active', async () => {
+      const { controller, aiChatRunService } = makeController({
+        chat: { id: 'c1', creatorId: 'u1' },
+        activeRun: undefined,
+      });
+      const res = await controller.stopRun({ chatId: 'c1' }, user, workspace);
+      expect(res).toEqual({ stopped: false });
+      expect(aiChatRunService.requestStop).not.toHaveBeenCalled();
+    });
+  });
+});
@@ -1,6 +1,7 @@
 import {
  BadRequestException,
  Body,
+  ConflictException,
  Controller,
  ForbiddenException,
  HttpCode,
@@ -20,7 +21,13 @@ import { JwtAuthGuard } from '../../common/guards/jwt-auth.guard';
 import { AuthUser } from '../../common/decorators/auth-user.decorator';
 import { AuthWorkspace } from '../../common/decorators/auth-workspace.decorator';
 import { SkipTransform } from '../../common/decorators/skip-transform.decorator';
-import { AiChat, User, Workspace } from '@docmost/db/types/entity.types';
+import {
+  AiChat,
+  AiChatMessage,
+  AiChatRun,
+  User,
+  Workspace,
+} from '@docmost/db/types/entity.types';
 import { PaginationOptions } from '@docmost/db/pagination/pagination-options';
 import { AiChatRepo } from '@docmost/db/repos/ai-chat/ai-chat.repo';
 import { AiChatMessageRepo } from '@docmost/db/repos/ai-chat/ai-chat-message.repo';
@@ -28,7 +35,12 @@ import { PageRepo } from '@docmost/db/repos/page/page.repo';
 import { UserThrottlerGuard } from '../../integrations/throttle/user-throttler.guard';
 import { AI_CHAT_THROTTLER } from '../../integrations/throttle/throttler-names';
 import { FileInterceptor } from '../../common/interceptors/file.interceptor';
-import { AiChatService, AiChatStreamBody } from './ai-chat.service';
+import {
+  AiChatRunHooks,
+  AiChatService,
+  AiChatStreamBody,
+} from './ai-chat.service';
+import { AiChatRunService } from './ai-chat-run.service';
 import { AiTranscriptionService } from './ai-transcription.service';
 import {
  BoundChatDto,
@@ -36,7 +48,9 @@ import {
  ExportChatDto,
  GeneratePageTitleDto,
  GetChatMessagesDto,
+  GetRunDto,
  RenameChatDto,
+  StopRunDto,
 } from './dto/ai-chat.dto';
 import { describeProviderError } from '../../integrations/ai/ai-error.util';
 import { buildChatMarkdown } from './chat-markdown.util';
@@ -53,6 +67,7 @@ export class AiChatController {

  constructor(
    private readonly aiChatService: AiChatService,
+    private readonly aiChatRunService: AiChatRunService,
    private readonly aiChatRepo: AiChatRepo,
    private readonly aiChatMessageRepo: AiChatMessageRepo,
    private readonly aiTranscription: AiTranscriptionService,
@@ -149,6 +164,75 @@ export class AiChatController {
    return { markdown };
  }

+  /**
+   * Reconnect to the latest run of a chat (#184 phase 1). Returns the run's
+   * persisted lifecycle state ({ status, error, stepCount, timings, ... }) plus
+   * the assistant message it projects (the partial/final output) — the DB is the
+   * source of truth, so this works for an in-flight run (the browser dropped, the
+   * run kept going) and a finished one alike. Owner-gated via assertOwnedChat.
+   * `{ run: null }` when the chat has never had a run.
+   */
+  @HttpCode(HttpStatus.OK)
+  @Post('run')
+  async getRun(
+    @Body() dto: GetRunDto,
+    @AuthUser() user: User,
+    @AuthWorkspace() workspace: Workspace,
+  ): Promise<{ run: AiChatRun | null; message: AiChatMessage | null }> {
+    await this.assertOwnedChat(dto.chatId, user, workspace);
+    const run = await this.aiChatRunService.getLatestForChat(
+      dto.chatId,
+      workspace.id,
+    );
+    if (!run) return { run: null, message: null };
+    const message = run.assistantMessageId
+      ? await this.aiChatMessageRepo.findById(
+          run.assistantMessageId,
+          workspace.id,
+        )
+      : undefined;
+    return { run, message: message ?? null };
+  }
+
+  /**
+   * Explicitly STOP an agent run (#184 phase 1) — the user pressed Stop. This is
+   * the ONLY thing that ends a detached run; a browser disconnect deliberately
+   * does not. Target by `runId` (from the streamed start metadata) or by `chatId`
+   * (stop whatever run is active on it). Owner-gated. Returns
+   * `{ stopped }` — false when there was nothing active to stop.
+   */
+  @HttpCode(HttpStatus.OK)
+  @Post('stop')
+  async stopRun(
+    @Body() dto: StopRunDto,
+    @AuthUser() user: User,
+    @AuthWorkspace() workspace: Workspace,
+  ): Promise<{ stopped: boolean }> {
+    let runId = dto.runId;
+    if (!runId && !dto.chatId) {
+      throw new BadRequestException('runId or chatId is required');
+    }
+    if (runId) {
+      // Resolve the run to its chat and owner-gate via that chat.
+      const run = await this.aiChatRunService.getRun(runId, workspace.id);
+      if (!run) return { stopped: false };
+      await this.assertOwnedChat(run.chatId, user, workspace);
+    } else {
+      await this.assertOwnedChat(dto.chatId!, user, workspace);
+      const active = await this.aiChatRunService.getActiveForChat(
+        dto.chatId!,
+        workspace.id,
+      );
+      if (!active) return { stopped: false };
+      runId = active.id;
+    }
+    const stopped = await this.aiChatRunService.requestStop(
+      runId,
+      workspace.id,
+    );
+    return { stopped };
+  }
+
  /** Rename a chat. */
  @HttpCode(HttpStatus.OK)
  @Post('rename')
@@ -200,11 +284,20 @@ export class AiChatController {
    @AuthWorkspace() workspace: Workspace,
  ): Promise<void> {
    // A7 gate: the workspace must have AI chat explicitly enabled.
-    const settings = (workspace.settings ?? {}) as { ai?: { chat?: boolean } };
+    const settings = (workspace.settings ?? {}) as {
+      ai?: { chat?: boolean; autonomousRuns?: boolean };
+    };
    if (settings.ai?.chat !== true) {
      throw new ForbiddenException('AI chat is disabled');
    }

+    // #184 phase 1 flag: when ON, the turn becomes a detached, durable RUN — its
+    // lifecycle is tracked in ai_chat_runs, a browser disconnect no longer aborts
+    // it, and only an explicit /ai-chat/stop ends it. When OFF (the default) the
+    // turn is socket-bound exactly as before, so existing deployments are
+    // unaffected.
+    const autonomousRuns = settings.ai?.autonomousRuns === true;
+
    const sessionId = (req.raw as { sessionId?: string }).sessionId;
    if (!sessionId) {
      // The chat requires an interactive session to mint loopback tokens
@@ -228,6 +321,58 @@ export class AiChatController {
    // HttpException) instead of breaking mid-stream.
    const model = await this.aiChatService.getChatModel(workspace.id, role);

+    // #184: one active run per chat. For an EXISTING chat reject a concurrent
+    // start with a clean 409 BEFORE hijack (the common double-submit / second-tab
+    // case), so the user gets JSON, not a mid-stream error. A brand-new chat
+    // (no chatId) cannot have a prior run, and the DB partial unique index is the
+    // backstop against any race that slips past this check.
+    if (autonomousRuns && body.chatId) {
+      const active = await this.aiChatRunService.getActiveForChat(
+        body.chatId,
+        workspace.id,
+      );
+      if (active) {
+        throw new ConflictException({
+          message: 'An agent run is already in progress for this chat',
+          code: 'A_RUN_ALREADY_ACTIVE',
+        });
+      }
+    }
+
+    // Run-lifecycle hooks (#184), only when the flag is on. They wrap the turn in
+    // a durable run whose abort is governed by the run (explicit stop), persist
+    // its progress, and settle its terminal status — see AiChatRunService.
+    const runHooks: AiChatRunHooks | undefined = autonomousRuns
+      ? {
+          begin: (chatId) =>
+            this.aiChatRunService.beginRun({
+              chatId,
+              workspaceId: workspace.id,
+              userId: user.id,
+              trigger: 'user',
+            }),
+          onAssistantSeeded: (runId, messageId) =>
+            this.aiChatRunService.linkAssistantMessage(
+              runId,
+              workspace.id,
+              messageId,
+            ),
+          onStep: (runId, stepCount) =>
+            void this.aiChatRunService.recordStep(
+              runId,
+              workspace.id,
+              stepCount,
+            ),
+          onSettled: (runId, status, error) =>
+            this.aiChatRunService.finalizeRun(
+              runId,
+              workspace.id,
+              status,
+              error,
+            ),
+        }
+      : undefined;
+
    // Abort the agent loop when the client disconnects. `close` also fires on
    // normal completion, so only abort when the response has not finished
    // writing (a genuine disconnect). `once` fires at most once and self-removes;
@@ -242,18 +387,44 @@ export class AiChatController {
      // A genuine disconnect leaves the response unfinished (unlike a normal
      // completion, which also fires `close`). Such a drop — e.g. a reverse
      // proxy cutting the SSE mid-answer — is otherwise invisible server-side,
-      // so log it here before aborting the agent loop.
+      // so log it here.
      if (!res.raw.writableEnded) {
-        this.logger.warn(
-          `AI chat stream: client disconnected before completion; aborting turn ` +
-            `(elapsed=${Date.now() - reqStartedAt}ms since request received)`,
-        );
-        controller.abort();
+        if (autonomousRuns) {
+          // #184: the turn is a DETACHED run. A disconnect must NOT abort it —
+          // the run keeps executing and persisting server-side; the client
+          // reconnects via /ai-chat/run (or re-stops via /ai-chat/stop). Log only.
+          this.logger.log(
+            `AI chat stream: client disconnected; run continues server-side ` +
+              `(elapsed=${Date.now() - reqStartedAt}ms since request received)`,
+          );
+        } else {
+          this.logger.warn(
+            `AI chat stream: client disconnected before completion; aborting turn ` +
+              `(elapsed=${Date.now() - reqStartedAt}ms since request received)`,
+          );
+          controller.abort();
+        }
      }
    };
    req.raw.once('close', onClose);
    res.raw.once('finish', () => req.raw.off('close', onClose));

+    // #184: in detached mode the turn is NOT aborted on disconnect, so the SDK's
+    // pipe keeps writing to a socket the client may have dropped — for the rest of
+    // the (continuing) run. A write to the dead socket can emit an 'error' on the
+    // raw response; without a listener that surfaces as an unhandled error event.
+    // Swallow it (the run continues server-side regardless). Legacy mode aborts on
+    // disconnect, so it does not need this and keeps its exact prior behavior.
+    if (autonomousRuns) {
+      res.raw.on('error', (err) => {
+        this.logger.debug(
+          `AI chat detached stream: post-disconnect socket error swallowed: ${
+            err instanceof Error ? err.message : String(err)
+          }`,
+        );
+      });
+    }
+
    // Commit to streaming: hijack so Fastify stops managing the response and
    // the AI SDK can write the UI-message stream directly to the Node socket.
    res.hijack();
@@ -268,15 +439,32 @@ export class AiChatController {
        signal: controller.signal,
        model,
        role,
+        // #184: present only when the flag is on; wraps the turn in a durable run.
+        runHooks,
      });
    } catch (err) {
-      // Any failure AFTER hijack can no longer send a clean JSON error, so emit
-      // a minimal error on the raw socket if nothing has been written yet.
-      this.logger.error('AI chat stream failed', err as Error);
+      // Any failure AFTER hijack can no longer go through Nest's exception
+      // filter, so emit the error on the raw socket if nothing has been written
+      // yet. The lost-the-race 409 (RunAlreadyActiveError -> ConflictException)
+      // is raised by stream() BEFORE it writes a byte, so headers are still
+      // unsent here: honor the HttpException's real status + body (a clean 409),
+      // not a blanket 500. Everything else stays a 500.
+      const isHttp = err instanceof HttpException;
+      if (!isHttp) {
+        this.logger.error('AI chat stream failed', err as Error);
+      }
      if (!res.raw.headersSent) {
-        res.raw.statusCode = 500;
+        const status = isHttp ? err.getStatus() : 500;
+        const payload = isHttp
+          ? err.getResponse()
+          : { error: 'Internal server error' };
+        res.raw.statusCode = status;
        res.raw.setHeader('Content-Type', 'application/json');
-        res.raw.end(JSON.stringify({ error: 'Internal server error' }));
+        res.raw.end(
+          JSON.stringify(
+            typeof payload === 'string' ? { message: payload } : payload,
+          ),
+        );
      } else if (!res.raw.writableEnded) {
        res.raw.end();
      }
@@ -57,6 +57,7 @@ describe('AiChatController.generatePageTitle', () => {
    const aiChatService = { generatePageTitle: generate };
    const controller = new AiChatController(
      aiChatService as never,
+      {} as never, // aiChatRunService
      {} as never,
      {} as never,
      {} as never,
@@ -3,6 +3,7 @@ import { AiModule } from '../../integrations/ai/ai.module';
 import { TokenModule } from '../auth/token.module';
 import { AiChatController } from './ai-chat.controller';
 import { AiChatService } from './ai-chat.service';
+import { AiChatRunService } from './ai-chat-run.service';
 import { AiTranscriptionService } from './ai-transcription.service';
 import { AiChatToolsService } from './tools/ai-chat-tools.service';
 import { EmbeddingModule } from './embedding/embedding.module';
@@ -42,6 +43,7 @@ import { PublicShareChatToolsService } from './tools/public-share-chat-tools.ser
  controllers: [AiChatController, PublicShareChatController],
  providers: [
    AiChatService,
+    AiChatRunService,
    AiTranscriptionService,
    AiChatToolsService,
    PublicShareChatService,
@@ -1,5 +1,7 @@
 import { Logger } from '@nestjs/common';
-import { AiChatService } from './ai-chat.service';
+import { AiChatService, AiChatRunHooks } from './ai-chat.service';
+import { AiChatRunService } from './ai-chat-run.service';
+import type { User, Workspace } from '@docmost/db/types/entity.types';

 /**
 * Lifecycle unit tests for AiChatService.onModuleInit (#183 crash-recovery
@@ -61,3 +63,99 @@ describe('AiChatService.onModuleInit (startup sweep)', () => {
    expect(String(warnSpy.mock.calls[0][0])).toContain('db unavailable');
  });
 });
+
+/**
+ * #184 CRITICAL run-lifecycle safety net (review fix). A transient failure
+ * AFTER a successful beginRun but BEFORE streamText's terminal callbacks own the
+ * lifecycle must STILL settle the run — otherwise the run row is stuck 'running'
+ * forever (sweepRunning only runs at startup) and the partial unique index + the
+ * controller pre-check 409 every future turn in that chat until a restart. Here
+ * we model the very first bare await after beginRun (the user-message insert)
+ * throwing, wiring the run hooks to a REAL AiChatRunService (mock repo) exactly
+ * as the controller does, and assert the run is settled to 'error' and its
+ * in-memory entry dropped (so a follow-up turn would NOT be 409'd).
+ */
+describe('AiChatService.stream run-lifecycle safety net (#184)', () => {
+  const user = { id: 'u1' } as User;
+  const workspace = { id: 'ws1' } as Workspace;
+
+  afterEach(() => jest.restoreAllMocks());
+
+  it('an exception after beginRun settles the run to error and drops the in-memory entry', async () => {
+    jest.spyOn(Logger.prototype, 'error').mockImplementation(() => undefined);
+
+    // Real run service over a mock repo, so finalizeRun's in-memory bookkeeping
+    // (active.delete) is exercised for real.
+    const runRepo = {
+      insert: jest.fn().mockResolvedValue({ id: 'run-1', status: 'running' }),
+      update: jest.fn().mockResolvedValue({ id: 'run-1' }),
+    };
+    const runService = new AiChatRunService(runRepo as never, { isCloud: () => false } as never);
+
+    // The user-message insert (the first bare await after beginRun) throws.
+    const aiChatMessageRepo = {
+      insert: jest.fn().mockRejectedValue(new Error('insert boom')),
+    };
+    const aiChatRepo = {
+      // Existing chat -> chatId stays, no new-chat insert path.
+      findById: jest.fn().mockResolvedValue({ id: 'chat-1', creatorId: 'u1' }),
+    };
+
+    const service = new AiChatService(
+      {} as never, // ai
+      aiChatRepo as never,
+      aiChatMessageRepo as never,
+      {} as never, // aiChatPageSnapshotRepo
+      {} as never, // aiSettings
+      {} as never, // tools
+      {} as never, // mcpClients
+      {} as never, // aiAgentRoleRepo
+      {} as never, // pageRepo
+      {} as never, // pageAccess
+      {} as never, // environment
+    );
+
+    const runHooks: AiChatRunHooks = {
+      begin: (chatId) =>
+        runService.beginRun({
+          chatId,
+          workspaceId: workspace.id,
+          userId: user.id,
+          trigger: 'user',
+        }),
+      onSettled: (runId, status, error) =>
+        runService.finalizeRun(runId, workspace.id, status, error),
+    };
+
+    await expect(
+      service.stream({
+        user,
+        workspace,
+        sessionId: 'sess',
+        body: {
+          chatId: 'chat-1',
+          messages: [
+            { id: 'm', role: 'user', parts: [{ type: 'text', text: 'hi' }] },
+          ],
+        },
+        res: {} as never,
+        signal: new AbortController().signal,
+        model: {} as never,
+        role: null,
+        runHooks,
+      }),
+    ).rejects.toThrow('insert boom');
+
+    // The run was begun...
+    expect(runRepo.insert).toHaveBeenCalledTimes(1);
+    // ...then settled to a terminal FAILED status by the safety net...
+    expect(runRepo.update).toHaveBeenCalledTimes(1);
+    expect(runRepo.update).toHaveBeenCalledWith(
+      'run-1',
+      'ws1',
+      expect.objectContaining({ status: 'failed' }),
+    );
+    // ...and the in-memory entry is gone, so a follow-up turn is NOT 409'd.
+    expect(runService.isLocallyActive('run-1')).toBe(false);
+  });
+});
@@ -0,0 +1,489 @@
+import { ConflictException, Logger } from '@nestjs/common';
+
+// Mock the AI SDK so we can PROVE no provider call is made for the turn we are
+// about to reject. The race rejection happens at runHooks.begin(), long before
+// any streamText/generateText, so these never resolve a real model.
+jest.mock('ai', () => ({
+  streamText: jest.fn(),
+  generateText: jest.fn(),
+  convertToModelMessages: jest.fn(() => []),
+  stepCountIs: jest.fn(() => () => false),
+}));
+
+import { streamText, generateText } from 'ai';
+import { AiChatService } from './ai-chat.service';
+import { RunAlreadyActiveError } from './ai-chat-run.service';
+
+/**
+ * Race-closure coverage for the "one active run per chat" guard (#184).
+ *
+ * THE BUG: two simultaneous POST /ai-chat/stream on the same chat both pass the
+ * controller's cheap pre-check (TOCTOU), so the loser's run-row INSERT hits the
+ * partial unique index. Previously that 23505 was SWALLOWED and the second turn
+ * streamed UNTRACKED (no runId, not stoppable). THE FIX: beginRun surfaces a
+ * RunAlreadyActiveError and stream() turns it into a 409 BEFORE any AI call —
+ * the second turn never runs.
+ */
+describe('AiChatService.stream — concurrent-run race rejection (#184)', () => {
+  const streamTextMock = streamText as unknown as jest.Mock;
+  const generateTextMock = generateText as unknown as jest.Mock;
+
+  beforeEach(() => {
+    streamTextMock.mockReset();
+    generateTextMock.mockReset();
+  });
+
+  // Minimal service whose only reachable deps before begin() are aiChatRepo
+  // (resolve the existing chat) — everything past begin must remain untouched.
+  function makeService(beginImpl: () => Promise<unknown>) {
+    const aiChatMessageRepo = { insert: jest.fn() };
+    const aiChatRepo = {
+      // An existing chat: stream keeps the supplied chatId and skips creation.
+      findById: jest.fn(async () => ({ id: 'chat-1', workspaceId: 'ws-1' })),
+      insert: jest.fn(),
+    };
+    const svc = new AiChatService(
+      {} as never, // ai
+      aiChatRepo as never,
+      aiChatMessageRepo as never,
+      {} as never, // aiChatPageSnapshotRepo
+      {} as never, // aiSettings
+      {} as never, // tools
+      {} as never, // mcpClients
+      {} as never, // aiAgentRoleRepo
+      {} as never, // pageRepo
+      {} as never, // pageAccess
+      { isAiChatDeferredToolsEnabled: () => false } as never, // environment
+    );
+    const begin = jest.fn(beginImpl);
+    return { svc, begin, aiChatRepo, aiChatMessageRepo };
+  }
+
+  const baseArgs = (begin: jest.Mock) => ({
+    user: { id: 'user-1' } as never,
+    workspace: { id: 'ws-1' } as never,
+    sessionId: 'sess-1',
+    body: { chatId: 'chat-1', messages: [] } as never,
+    res: { raw: {} } as never,
+    signal: new AbortController().signal,
+    model: {} as never,
+    role: null,
+    runHooks: {
+      begin,
+      onAssistantSeeded: jest.fn(),
+      onStep: jest.fn(),
+      onSettled: jest.fn(),
+    } as never,
+  });
+
+  it('rejects the racer with a 409 ConflictException BEFORE any AI call, and never persists an untracked turn', async () => {
+    // begin loses the unique-index race -> RunAlreadyActiveError.
+    const { svc, begin, aiChatMessageRepo } = makeService(() => {
+      throw new RunAlreadyActiveError('chat-1');
+    });
+
+    const promise = svc.stream(baseArgs(begin));
+
+    await expect(promise).rejects.toBeInstanceOf(ConflictException);
+    await promise.catch((err: ConflictException) => {
+      expect(err.getStatus()).toBe(409);
+      expect((err.getResponse() as { code?: string }).code).toBe(
+        'A_RUN_ALREADY_ACTIVE',
+      );
+    });
+
+    // The decisive assertions: the rejected racer spent NO tokens and left NO
+    // untracked turn behind.
+    expect(begin).toHaveBeenCalledTimes(1);
+    expect(streamTextMock).not.toHaveBeenCalled();
+    expect(generateTextMock).not.toHaveBeenCalled();
+    expect(aiChatMessageRepo.insert).not.toHaveBeenCalled();
+  });
+});
+
+/**
+ * F3 — the LOAD-BEARING run-detach wiring: `effectiveSignal = handle.signal`
+ * after runHooks.begin, then `abortSignal: effectiveSignal` passed to streamText.
+ * That single line is what makes a run survive a browser disconnect (the agent
+ * loop's abort is governed by the RUN's signal, not the socket): a regression to
+ * the socket-bound signal would still pass every other test green while silently
+ * breaking Stop + durability. These two tests pin the exact signal streamText
+ * consumes on both paths.
+ */
+describe('AiChatService.stream — abortSignal wiring (#184 F3)', () => {
+  const streamTextMock = streamText as unknown as jest.Mock;
+
+  // A streamText result stub: the post-call drain + pipe are no-ops here; we only
+  // care WHICH abortSignal streamText was handed.
+  function makeStreamResult() {
+    return {
+      consumeStream: jest.fn(),
+      pipeUIMessageStreamToResponse: jest.fn(),
+    };
+  }
+
+  // A raw-response stub sufficient for the post-streamText wiring
+  // (stripStreamingHopByHopHeaders binds writeHead; startSseHeartbeat registers
+  // close/finish listeners; flushHeaders is belt-and-braces).
+  function makeRes() {
+    return {
+      raw: {
+        writeHead: jest.fn(),
+        write: jest.fn(),
+        once: jest.fn(),
+        on: jest.fn(),
+        flushHeaders: jest.fn(),
+        writableEnded: false,
+        destroyed: false,
+      },
+    };
+  }
+
+  // Wire only the deps reached on the way to streamText: resolve the existing
+  // chat, persist the user + seed the assistant row, load (empty) history, the
+  // admin settings, an empty external toolset + Docmost toolset.
+  function makeService() {
+    const aiChatRepo = {
+      findById: jest.fn(async () => ({ id: 'chat-1', workspaceId: 'ws-1' })),
+      insert: jest.fn(),
+    };
+    const aiChatMessageRepo = {
+      insert: jest.fn(async () => ({ id: 'msg-1' })),
+      findAllByChat: jest.fn(async () => []),
+      update: jest.fn(async () => ({ id: 'msg-1' })),
+    };
+    const aiSettings = { resolve: jest.fn(async () => ({})) };
+    const tools = { forUser: jest.fn(async () => ({})) };
+    const mcpClients = {
+      toolsFor: jest.fn(async () => ({
+        tools: {},
+        clients: [],
+        outcomes: [],
+        instructions: [],
+      })),
+    };
+    const svc = new AiChatService(
+      {} as never, // ai
+      aiChatRepo as never,
+      aiChatMessageRepo as never,
+      {} as never, // aiChatPageSnapshotRepo
+      aiSettings as never,
+      tools as never,
+      mcpClients as never,
+      {} as never, // aiAgentRoleRepo
+      {} as never, // pageRepo (openPage undefined -> never touched)
+      {} as never, // pageAccess
+      { isAiChatDeferredToolsEnabled: () => false } as never, // environment
+    );
+    return { svc };
+  }
+
+  const body = {
+    chatId: 'chat-1',
+    messages: [
+      { id: 'm1', role: 'user', parts: [{ type: 'text', text: 'hi' }] },
+    ],
+  };
+
+  beforeEach(() => {
+    streamTextMock.mockReset();
+    streamTextMock.mockImplementation(() => makeStreamResult());
+    jest
+      .spyOn(Logger.prototype, 'log')
+      .mockImplementation(() => undefined as never);
+  });
+
+  afterEach(() => jest.restoreAllMocks());
+
+  it('happy path (run-wrapped): streamText is driven with abortSignal === handle.signal (the RUN signal, NOT the socket)', async () => {
+    const { svc } = makeService();
+    const runController = new AbortController();
+    const runSignal = runController.signal;
+    const socketSignal = new AbortController().signal;
+
+    const begin = jest.fn(async () => ({ runId: 'run-1', signal: runSignal }));
+    await svc.stream({
+      user: { id: 'user-1' } as never,
+      workspace: { id: 'ws-1' } as never,
+      sessionId: 'sess-1',
+      body: body as never,
+      res: makeRes() as never,
+      signal: socketSignal,
+      model: {} as never,
+      role: null,
+      runHooks: {
+        begin,
+        onAssistantSeeded: jest.fn(),
+        onStep: jest.fn(),
+        onSettled: jest.fn(),
+      } as never,
+    });
+
+    expect(begin).toHaveBeenCalledTimes(1);
+    expect(streamTextMock).toHaveBeenCalledTimes(1);
+    // THE assertion: the agent loop's abort is wired to the RUN, so a browser
+    // disconnect (which aborts only `socketSignal`) cannot end the turn.
+    expect(streamTextMock.mock.calls[0][0].abortSignal).toBe(runSignal);
+    expect(streamTextMock.mock.calls[0][0].abortSignal).not.toBe(socketSignal);
+  });
+
+  it('legacy path (no runHooks): streamText is driven with the SOCKET signal', async () => {
+    const { svc } = makeService();
+    const socketSignal = new AbortController().signal;
+
+    await svc.stream({
+      user: { id: 'user-1' } as never,
+      workspace: { id: 'ws-1' } as never,
+      sessionId: 'sess-1',
+      body: body as never,
+      res: makeRes() as never,
+      signal: socketSignal,
+      model: {} as never,
+      role: null,
+      // No runHooks -> the turn stays socket-bound (flag off / default).
+    });
+
+    expect(streamTextMock).toHaveBeenCalledTimes(1);
+    expect(streamTextMock.mock.calls[0][0].abortSignal).toBe(socketSignal);
+  });
+
+  /**
+   * F9 — streamText's TERMINAL callbacks carry the #184 run lifecycle:
+   *   onStepFinish -> runHooks.onStep(runId, stepCount)
+   *   onFinish     -> runHooks.onSettled(runId, 'completed')   (dominant path)
+   *   onAbort      -> runHooks.onSettled(runId, 'aborted')
+   *   onError      -> runHooks.onSettled(runId, 'error', cause)
+   * makeStreamResult() ignores the streamText options, so these callbacks never
+   * fire on their own — a regression in this wiring (esp. the success path) would
+   * strand the run with NO test catching it. Here we CAPTURE the options streamText
+   * was handed and invoke each callback with the real wiring, asserting the run
+   * hooks fire with the right args.
+   */
+  // Drive stream() to the point streamText is called, capturing the options object
+  // (which carries onStepFinish/onFinish/onError/onAbort) and the run hooks.
+  async function captureStreamCallbacks() {
+    const { svc } = makeService();
+    let capturedOpts: any;
+    streamTextMock.mockImplementation((opts: any) => {
+      capturedOpts = opts;
+      return makeStreamResult();
+    });
+    const runHooks = {
+      begin: jest.fn(async () => ({
+        runId: 'run-1',
+        signal: new AbortController().signal,
+      })),
+      onAssistantSeeded: jest.fn(),
+      onStep: jest.fn(),
+      onSettled: jest.fn(),
+    };
+    await svc.stream({
+      user: { id: 'user-1' } as never,
+      workspace: { id: 'ws-1' } as never,
+      sessionId: 'sess-1',
+      body: body as never,
+      res: makeRes() as never,
+      signal: new AbortController().signal,
+      model: {} as never,
+      role: null,
+      runHooks: runHooks as never,
+    });
+    expect(capturedOpts).toBeDefined();
+    return { capturedOpts, runHooks };
+  }
+
+  it('F9: onStepFinish bumps the run step count, onFinish settles the run "completed" (the dominant autonomous-run path)', async () => {
+    const { capturedOpts, runHooks } = await captureStreamCallbacks();
+
+    // A finished step -> onStep(runId, finishedStepCount).
+    capturedOpts.onStepFinish({ text: 'step one', toolCalls: [], content: [] });
+    expect(runHooks.onStep).toHaveBeenCalledWith('run-1', 1);
+    capturedOpts.onStepFinish({ text: 'step two', toolCalls: [], content: [] });
+    expect(runHooks.onStep).toHaveBeenLastCalledWith('run-1', 2);
+
+    // The success terminal callback settles the run.
+    await capturedOpts.onFinish({
+      text: 'done',
+      finishReason: 'stop',
+      totalUsage: {},
+      usage: {},
+      steps: [],
+    });
+    expect(runHooks.onSettled).toHaveBeenCalledWith('run-1', 'completed');
+  });
+
+  it('F9: onAbort settles the run "aborted"', async () => {
+    jest
+      .spyOn(Logger.prototype, 'warn')
+      .mockImplementation(() => undefined as never);
+    const { capturedOpts, runHooks } = await captureStreamCallbacks();
+
+    await capturedOpts.onAbort({ steps: [] });
+    expect(runHooks.onSettled).toHaveBeenCalledWith('run-1', 'aborted');
+  });
+
+  it('F9: onError settles the run "error" carrying the provider cause', async () => {
+    jest
+      .spyOn(Logger.prototype, 'error')
+      .mockImplementation(() => undefined as never);
+    jest
+      .spyOn(Logger.prototype, 'warn')
+      .mockImplementation(() => undefined as never);
+    const { capturedOpts, runHooks } = await captureStreamCallbacks();
+
+    await capturedOpts.onError({ error: new Error('provider exploded') });
+    expect(runHooks.onSettled).toHaveBeenCalledWith(
+      'run-1',
+      'error',
+      expect.stringContaining('provider exploded'),
+    );
+  });
+});
+
+/**
+ * F14 — the begin-failure RESILIENCE branch (the `else` of the run-race guard).
+ *
+ * stream() wraps runHooks.begin in try/catch with TWO branches:
+ *   - RunAlreadyActiveError  -> 409 ConflictException (pinned above).
+ *   - ANY OTHER begin failure -> SWALLOW + continue UNTRACKED on the socket signal
+ *     (legacy fallback): it logs "...streaming without run tracking", leaves
+ *     `effectiveSignal = signal` (runId undefined) and serves the turn anyway.
+ *
+ * The contract: a transient beginRun failure (e.g. a non-unique DB error inserting
+ * the run row) must STILL serve the user's turn — it must NOT re-throw and must NOT
+ * be misclassified as a 409. A regression that re-threw here would break EVERY turn
+ * on a begin failure with nothing to catch it. This branch is otherwise undriven by
+ * any spec, so it is pinned here SEPARATELY from the 409 path: a plain begin error
+ * proceeds to streamText with the SOCKET signal and still persists the user turn.
+ */
+describe('AiChatService.stream — begin-failure resilience / legacy fallback (#184 F14)', () => {
+  const streamTextMock = streamText as unknown as jest.Mock;
+
+  function makeStreamResult() {
+    return {
+      consumeStream: jest.fn(),
+      pipeUIMessageStreamToResponse: jest.fn(),
+    };
+  }
+
+  function makeRes() {
+    return {
+      raw: {
+        writeHead: jest.fn(),
+        write: jest.fn(),
+        once: jest.fn(),
+        on: jest.fn(),
+        flushHeaders: jest.fn(),
+        writableEnded: false,
+        destroyed: false,
+      },
+    };
+  }
+
+  // Same harness as the F3 abortSignal block, but it also exposes
+  // aiChatMessageRepo so we can assert the user turn IS persisted (the turn really
+  // streamed) despite begin() blowing up.
+  function makeService() {
+    const aiChatRepo = {
+      findById: jest.fn(async () => ({ id: 'chat-1', workspaceId: 'ws-1' })),
+      insert: jest.fn(),
+    };
+    const aiChatMessageRepo = {
+      insert: jest.fn(async () => ({ id: 'msg-1' })),
+      findAllByChat: jest.fn(async () => []),
+      update: jest.fn(async () => ({ id: 'msg-1' })),
+    };
+    const aiSettings = { resolve: jest.fn(async () => ({})) };
+    const tools = { forUser: jest.fn(async () => ({})) };
+    const mcpClients = {
+      toolsFor: jest.fn(async () => ({
+        tools: {},
+        clients: [],
+        outcomes: [],
+        instructions: [],
+      })),
+    };
+    const svc = new AiChatService(
+      {} as never, // ai
+      aiChatRepo as never,
+      aiChatMessageRepo as never,
+      {} as never, // aiChatPageSnapshotRepo
+      aiSettings as never,
+      tools as never,
+      mcpClients as never,
+      {} as never, // aiAgentRoleRepo
+      {} as never, // pageRepo
+      {} as never, // pageAccess
+      { isAiChatDeferredToolsEnabled: () => false } as never, // environment
+    );
+    return { svc, aiChatMessageRepo };
+  }
+
+  const body = {
+    chatId: 'chat-1',
+    messages: [
+      { id: 'm1', role: 'user', parts: [{ type: 'text', text: 'hi' }] },
+    ],
+  };
+
+  beforeEach(() => {
+    streamTextMock.mockReset();
+    streamTextMock.mockImplementation(() => makeStreamResult());
+    jest
+      .spyOn(Logger.prototype, 'log')
+      .mockImplementation(() => undefined as never);
+  });
+
+  afterEach(() => jest.restoreAllMocks());
+
+  it('a PLAIN begin() failure (NOT RunAlreadyActiveError) does NOT 409 — it swallows, logs, and streams the turn UNTRACKED on the socket signal', async () => {
+    const errorSpy = jest
+      .spyOn(Logger.prototype, 'error')
+      .mockImplementation(() => undefined as never);
+
+    const { svc, aiChatMessageRepo } = makeService();
+    const socketSignal = new AbortController().signal;
+
+    // A transient, NON-race begin failure (e.g. a non-unique DB error inserting
+    // the run row). This is the `else` branch of the begin try/catch.
+    const begin = jest.fn(async () => {
+      throw new Error('insert failed');
+    });
+
+    const promise = svc.stream({
+      user: { id: 'user-1' } as never,
+      workspace: { id: 'ws-1' } as never,
+      sessionId: 'sess-1',
+      body: body as never,
+      res: makeRes() as never,
+      signal: socketSignal,
+      model: {} as never,
+      role: null,
+      runHooks: {
+        begin,
+        onAssistantSeeded: jest.fn(),
+        onStep: jest.fn(),
+        onSettled: jest.fn(),
+      } as never,
+    });
+
+    // The turn proceeds: NO throw at all (in particular NOT a 409).
+    await expect(promise).resolves.toBeUndefined();
+
+    expect(begin).toHaveBeenCalledTimes(1);
+
+    // The resilience branch logged the legacy-fallback warning.
+    expect(errorSpy).toHaveBeenCalledWith(
+      expect.stringContaining('streaming without run tracking'),
+      expect.anything(),
+    );
+
+    // The turn really streamed: the user message was persisted and streamText ran.
+    expect(aiChatMessageRepo.insert).toHaveBeenCalled();
+    expect(streamTextMock).toHaveBeenCalledTimes(1);
+
+    // The decisive wiring: with no run handle, the fallback uses the SOCKET signal
+    // (effectiveSignal = signal, runId undefined) — not a run-bound signal.
+    expect(streamTextMock.mock.calls[0][0].abortSignal).toBe(socketSignal);
+  });
+});
@@ -453,6 +453,12 @@ describe('chatStreamMetadata', () => {
    });
  });

+  it('attaches the runId on the start part when a run wraps the turn (#184)', () => {
+    expect(
+      chatStreamMetadata({ type: 'start' }, 'chat-1', undefined, 'run-1'),
+    ).toEqual({ chatId: 'chat-1', runId: 'run-1' });
+  });
+
  it('returns the CUMULATIVE step usage passed in for the finish-step part', () => {
    // finish-step usage is per-step in v6; the caller accumulates and passes the
    // running sum, which this just wraps.
@@ -43,6 +43,30 @@ export class BoundChatDto {
  pageId: string;
 }

+/**
+ * Reconnect to the latest run of a chat (#184): fetch its persisted lifecycle
+ * state (and the assistant message it projects) for an in-flight or finished run.
+ */
+export class GetRunDto {
+  @IsString()
+  chatId: string;
+}
+
+/**
+ * Explicitly STOP an agent run (#184): the user pressed Stop — distinct from a
+ * browser disconnect, which never stops a run. Either the run id (preferred, from
+ * the streamed start metadata) or the chat id (stop whatever run is active on it).
+ */
+export class StopRunDto {
+  @IsOptional()
+  @IsString()
+  runId?: string;
+
+  @IsOptional()
+  @IsString()
+  chatId?: string;
+}
+
 /** Export a chat to Markdown (#183). `lang` localizes the few fixed
 *  role/tool-action labels; defaults to English server-side. */
 export class ExportChatDto {
@@ -16,6 +16,7 @@ import {
  AUTH_THROTTLER,
  PAGE_TEMPLATE_THROTTLER,
  PUBLIC_SHARE_AI_THROTTLER,
+  VITALS_THROTTLER,
 } from '../../integrations/throttle/throttler-names';
 import { LoginDto } from './dto/login.dto';
 import { AuthService } from './services/auth.service';
@@ -184,16 +185,21 @@ export class AuthController {
  }

  // The global ThrottlerGuard applies ALL named throttlers to every route by
-  // default, so each non-AUTH bucket (AI chat, page template, public-share AI)
-  // is explicitly skipped here. collab-token is auth-guarded (JwtAuthGuard),
-  // per-user and client-cached, so those feature buckets are irrelevant to it;
-  // skipping them avoids spurious 429s when a user opens many pages in a short
-  // window. The AUTH bucket is skipped too for the same per-user, cached reason.
+  // default, so each non-AUTH bucket (AI chat, page template, public-share AI,
+  // client vitals) is explicitly skipped here. collab-token is auth-guarded
+  // (JwtAuthGuard), per-user and client-cached, so those feature buckets are
+  // irrelevant to it; skipping them avoids spurious 429s when a user opens many
+  // pages in a short window. The VITALS bucket must be skipped too: it is a
+  // process-wide named throttler, so without this skip its per-IP limit would
+  // silently cap collab-token (the one route that opts out of every other
+  // bucket) and break editing behind shared/NAT IPs. The AUTH bucket is skipped
+  // for the same per-user, cached reason.
  @SkipThrottle({
    [AUTH_THROTTLER]: true,
    [AI_CHAT_THROTTLER]: true,
    [PAGE_TEMPLATE_THROTTLER]: true,
    [PUBLIC_SHARE_AI_THROTTLER]: true,
+    [VITALS_THROTTLER]: true,
  })
  @UseGuards(JwtAuthGuard)
  @HttpCode(HttpStatus.OK)
@@ -0,0 +1,105 @@
+/**
+ * Server-side whitelist + limits for POST /api/telemetry/vitals (#355).
+ *
+ * The endpoint is PUBLIC (browsers post it, no auth) so it is a privacy and
+ * abuse surface: everything not on these lists is silently DROPPED and the
+ * request still returns 200 (never 400 — a 400 would make browsers retry).
+ */
+
+// The only metric names accepted. Anything else is dropped.
+export const ALLOWED_METRIC_NAMES = new Set<string>([
+  'INP',
+  'LCP',
+  'CLS',
+  'TTFB',
+  'editor_tx_ms',
+  'page_open_ms',
+  'longtask_ms',
+]);
+
+// The only rating values accepted (web-vitals). Anything else -> null.
+export const ALLOWED_RATINGS = new Set<string>([
+  'good',
+  'needs-improvement',
+  'poor',
+]);
+
+// Max events accepted per batch; the rest are ignored.
+export const MAX_EVENTS_PER_BATCH = 50;
+
+// Defence-in-depth body cap (~16KB). Fastify's global bodyLimit is far larger,
+// so we re-check the parsed payload size here and drop oversized batches.
+export const MAX_BODY_BYTES = 16 * 1024;
+
+// attr is truncated to this many characters (attribution target only, no PII).
+export const MAX_ATTR_LENGTH = 120;
+
+// route label sanity cap (client sends a template like /s/:space/p/:slug).
+export const MAX_ROUTE_LENGTH = 200;
+
+// `client_metrics.doc_size` is a Postgres `int` (int4). A garbage/huge docSize
+// on a single event would overflow int4 and make Postgres reject the WHOLE
+// batch INSERT, losing every event in it. Values outside this range are DROPPED
+// to null (the event is still kept) so one bad field never loses the batch.
+export const DOC_SIZE_MAX = 2147483647; // 2^31 - 1 (int4 max)
+
+export interface ClientMetricRow {
+  name: string;
+  value: number;
+  rating: string | null;
+  route: string | null;
+  attr: string | null;
+  docSize: number | null;
+  workspaceId: string | null;
+}
+
+/**
+ * Validate + normalise a single incoming event into a DB row, or return null to
+ * DROP it. Pure so it is directly unit-testable. Enforces the name whitelist,
+ * numeric value, rating whitelist, attr truncation and doc_size (int) coercion.
+ */
+export function sanitizeVitalEvent(
+  raw: unknown,
+  workspaceId: string | null,
+): ClientMetricRow | null {
+  if (!raw || typeof raw !== 'object') return null;
+  const e = raw as Record<string, unknown>;
+
+  const name = e.name;
+  if (typeof name !== 'string' || !ALLOWED_METRIC_NAMES.has(name)) return null;
+
+  const value =
+    typeof e.value === 'number' && Number.isFinite(e.value) ? e.value : null;
+  if (value === null) return null;
+
+  const rating =
+    typeof e.rating === 'string' && ALLOWED_RATINGS.has(e.rating)
+      ? e.rating
+      : null;
+
+  let route: string | null = null;
+  if (typeof e.route === 'string' && e.route.length > 0) {
+    route = e.route.slice(0, MAX_ROUTE_LENGTH);
+  }
+
+  let attr: string | null = null;
+  if (typeof e.attr === 'string' && e.attr.length > 0) {
+    attr = e.attr.slice(0, MAX_ATTR_LENGTH);
+  }
+
+  let docSize: number | null = null;
+  if (typeof e.docSize === 'number' && Number.isFinite(e.docSize)) {
+    docSize = Math.trunc(e.docSize);
+  } else if (typeof e.doc_size === 'number' && Number.isFinite(e.doc_size)) {
+    // Accept snake_case too, in case a client sends the raw column name.
+    docSize = Math.trunc(e.doc_size as number);
+  }
+  // Guard the int4 column: an out-of-range docSize would overflow int4 and make
+  // Postgres reject the whole batch INSERT. Drop the field (keep the event)
+  // rather than lose every other event in the batch.
+  if (docSize !== null && (docSize < 0 || docSize > DOC_SIZE_MAX)) {
+    docSize = null;
+  }
+
+  return { name, value, rating, route, attr, docSize, workspaceId };
+}
@@ -0,0 +1,47 @@
+import { ClientTelemetryModule } from './client-telemetry.module';
+import { VitalsController } from './vitals.controller';
+import { VitalsService } from './vitals.service';
+
+// The register() gate is the CORE of the maintainer's E1=B decision: the public,
+// unauthenticated /api/telemetry/vitals endpoint must be OFF by default, so a
+// self-host deploy has no anonymous disk-fill surface into `client_metrics`. A
+// regression that inverts the flag (or a truthiness bug where "" / "false"
+// registers the route) would silently reopen that surface — pin it here.
+describe('ClientTelemetryModule.register (E1=B gate)', () => {
+  const original = process.env.CLIENT_TELEMETRY_ENABLED;
+  afterEach(() => {
+    if (original === undefined) delete process.env.CLIENT_TELEMETRY_ENABLED;
+    else process.env.CLIENT_TELEMETRY_ENABLED = original;
+  });
+
+  it('OFF by default (flag unset) — no controller, no provider (endpoint absent)', () => {
+    delete process.env.CLIENT_TELEMETRY_ENABLED;
+    const mod = ClientTelemetryModule.register();
+    expect(mod.controllers).toEqual([]);
+    expect(mod.providers).toEqual([]);
+  });
+
+  it.each(['false', 'False', '0', '', 'yes', '1'])(
+    'stays OFF for non-"true" value %p (no route)',
+    (val) => {
+      process.env.CLIENT_TELEMETRY_ENABLED = val;
+      const mod = ClientTelemetryModule.register();
+      expect(mod.controllers).toEqual([]);
+      expect(mod.providers).toEqual([]);
+    },
+  );
+
+  it('ON only for "true" — registers VitalsController + VitalsService', () => {
+    process.env.CLIENT_TELEMETRY_ENABLED = 'true';
+    const mod = ClientTelemetryModule.register();
+    expect(mod.controllers).toContain(VitalsController);
+    expect(mod.providers).toContain(VitalsService);
+  });
+
+  it('ON is case-insensitive ("TRUE")', () => {
+    process.env.CLIENT_TELEMETRY_ENABLED = 'TRUE';
+    const mod = ClientTelemetryModule.register();
+    expect(mod.controllers).toContain(VitalsController);
+    expect(mod.providers).toContain(VitalsService);
+  });
+});
@@ -0,0 +1,32 @@
+import { DynamicModule, Module } from '@nestjs/common';
+import { VitalsController } from './vitals.controller';
+import { VitalsService } from './vitals.service';
+
+/**
+ * Client perf-telemetry (#355): the public /api/telemetry/vitals sink that
+ * persists web-vitals + custom client metrics into `client_metrics`.
+ * Named ClientTelemetryModule to avoid confusion with the unrelated
+ * integrations/telemetry (product usage ping) module.
+ *
+ * GATED OFF BY DEFAULT (maintainer decision E1=B). The public, unauthenticated
+ * endpoint is only registered when CLIENT_TELEMETRY_ENABLED=true — otherwise the
+ * route does NOT exist at all (no anonymous disk-fill surface, and no unbounded
+ * `client_metrics` growth on a self-host deploy without an external pruner). The
+ * client is told the same flag via window.CONFIG and skips sending when off.
+ */
+@Module({})
+export class ClientTelemetryModule {
+  static register(): DynamicModule {
+    // Read process.env directly (not EnvironmentService) so the toggle is
+    // resolved at module-registration time, identical to how the metrics
+    // subsystem reads METRICS_PORT. Absent/anything-but-"true" => OFF.
+    const enabled =
+      (process.env.CLIENT_TELEMETRY_ENABLED ?? '').toLowerCase() === 'true';
+
+    return {
+      module: ClientTelemetryModule,
+      controllers: enabled ? [VitalsController] : [],
+      providers: enabled ? [VitalsService] : [],
+    };
+  }
+}
@@ -0,0 +1,64 @@
+import {
+  Body,
+  Controller,
+  HttpCode,
+  Post,
+  Req,
+  UseGuards,
+} from '@nestjs/common';
+import { SkipThrottle, Throttle, ThrottlerGuard } from '@nestjs/throttler';
+import { FastifyRequest } from 'fastify';
+import { Public } from '../../common/decorators/public.decorator';
+import {
+  AI_CHAT_THROTTLER,
+  AUTH_THROTTLER,
+  PAGE_TEMPLATE_THROTTLER,
+  PUBLIC_SHARE_AI_THROTTLER,
+  VITALS_THROTTLER,
+} from '../../integrations/throttle/throttler-names';
+import { VitalsService } from './vitals.service';
+
+/**
+ * POST /api/telemetry/vitals (#355) — public client perf-metrics sink.
+ *
+ * PUBLIC (browsers post via sendBeacon, no session) but IP-throttled. Always
+ * returns 200 with no body of interest: invalid/foreign/oversized payloads are
+ * silently dropped by the service rather than 400'd, so browsers never retry.
+ */
+@Controller('telemetry')
+export class VitalsController {
+  constructor(private readonly vitalsService: VitalsService) {}
+
+  @Public()
+  @UseGuards(ThrottlerGuard)
+  // The global ThrottlerGuard applies ALL named throttlers to every route, so
+  // every OTHER bucket must be skipped here — otherwise the strictest of them
+  // (public-share AI at 5/min) would override the intended vitals limit and cap
+  // this route at 5/min instead of 120/min. Skip them all so ONLY the VITALS
+  // bucket below applies.
+  @SkipThrottle({
+    [AUTH_THROTTLER]: true,
+    [AI_CHAT_THROTTLER]: true,
+    [PAGE_TEMPLATE_THROTTLER]: true,
+    [PUBLIC_SHARE_AI_THROTTLER]: true,
+  })
+  @Throttle({ [VITALS_THROTTLER]: { limit: 120, ttl: 60_000 } })
+  @Post('vitals')
+  @HttpCode(200)
+  async vitals(
+    @Body() body: unknown,
+    @Req() req: FastifyRequest,
+  ): Promise<{ ok: true }> {
+    // workspaceId is resolved by the workspace-host middleware onto req.raw when
+    // the browser posts from a workspace host; null otherwise. No other PII.
+    const workspaceId =
+      ((req.raw as unknown as { workspaceId?: string })?.workspaceId ?? null) ||
+      null;
+    try {
+      await this.vitalsService.ingest(body, workspaceId);
+    } catch {
+      // Never surface storage errors to the browser; telemetry is best-effort.
+    }
+    return { ok: true };
+  }
+}
@@ -0,0 +1,149 @@
+import { VitalsService } from './vitals.service';
+import { MAX_ATTR_LENGTH } from './client-metrics.constants';
+
+// buildRows is pure (no DB access), so a null db is fine here.
+const svc = new VitalsService(null as any);
+
+describe('VitalsService.buildRows', () => {
+  const WS = 'ws-uuid';
+
+  it('accepts a valid batch and maps whitelisted fields to rows', () => {
+    const body = {
+      events: [
+        { name: 'INP', value: 123.4, rating: 'good', route: '/s/:space/p/:slug' },
+        { name: 'editor_tx_ms', value: 12, route: '/s/:space/p/:slug', docSize: 4096 },
+      ],
+    };
+    const rows = svc.buildRows(body, WS);
+    expect(rows).toHaveLength(2);
+    expect(rows[0]).toEqual({
+      name: 'INP',
+      value: 123.4,
+      rating: 'good',
+      route: '/s/:space/p/:slug',
+      attr: null,
+      docSize: null,
+      workspaceId: WS,
+    });
+    expect(rows[1].name).toBe('editor_tx_ms');
+    expect(rows[1].docSize).toBe(4096);
+    expect(rows[1].workspaceId).toBe(WS);
+  });
+
+  it('accepts a bare array body', () => {
+    const rows = svc.buildRows([{ name: 'LCP', value: 1 }], WS);
+    expect(rows).toHaveLength(1);
+    expect(rows[0].name).toBe('LCP');
+  });
+
+  it('drops events with foreign metric names', () => {
+    const rows = svc.buildRows(
+      { events: [{ name: 'evil_metric', value: 1 }, { name: 'LCP', value: 2 }] },
+      WS,
+    );
+    expect(rows).toHaveLength(1);
+    expect(rows[0].name).toBe('LCP');
+  });
+
+  it('drops events with a non-numeric or missing value', () => {
+    const rows = svc.buildRows(
+      {
+        events: [
+          { name: 'CLS', value: 'nan' },
+          { name: 'CLS' },
+          { name: 'CLS', value: 0.1 },
+        ],
+      },
+      WS,
+    );
+    expect(rows).toHaveLength(1);
+    expect(rows[0].value).toBe(0.1);
+  });
+
+  it('strips foreign fields and only keeps whitelisted columns', () => {
+    const rows = svc.buildRows(
+      { events: [{ name: 'TTFB', value: 5, secret: 'drop-me', title: 'my page' }] },
+      WS,
+    );
+    expect(rows).toHaveLength(1);
+    expect(Object.keys(rows[0]).sort()).toEqual(
+      ['attr', 'docSize', 'name', 'rating', 'route', 'value', 'workspaceId'].sort(),
+    );
+    expect((rows[0] as any).secret).toBeUndefined();
+    expect((rows[0] as any).title).toBeUndefined();
+  });
+
+  it('rejects a rating outside the allowed set (-> null)', () => {
+    const rows = svc.buildRows(
+      { events: [{ name: 'INP', value: 1, rating: 'terrible' }] },
+      WS,
+    );
+    expect(rows[0].rating).toBeNull();
+  });
+
+  it('truncates attr to 120 chars', () => {
+    const longAttr = 'a'.repeat(500);
+    const rows = svc.buildRows(
+      { events: [{ name: 'INP', value: 1, attr: longAttr }] },
+      WS,
+    );
+    expect(rows[0].attr).toHaveLength(MAX_ATTR_LENGTH);
+  });
+
+  it('caps the batch at 50 events', () => {
+    const events = Array.from({ length: 200 }, () => ({ name: 'CLS', value: 1 }));
+    const rows = svc.buildRows({ events }, WS);
+    expect(rows).toHaveLength(50);
+  });
+
+  it('drops an oversized (>16KB) payload wholesale', () => {
+    const events = Array.from({ length: 50 }, () => ({
+      name: 'INP',
+      value: 1,
+      attr: 'x'.repeat(400),
+      route: '/s/:space/p/:slug',
+    }));
+    // Serialised body far exceeds 16KB.
+    const rows = svc.buildRows({ events }, WS);
+    expect(rows).toHaveLength(0);
+  });
+
+  it('returns [] for malformed bodies', () => {
+    expect(svc.buildRows(null, WS)).toEqual([]);
+    expect(svc.buildRows('nope', WS)).toEqual([]);
+    expect(svc.buildRows({ notEvents: 1 }, WS)).toEqual([]);
+    expect(svc.buildRows(42, WS)).toEqual([]);
+  });
+
+  it('carries a null workspaceId through', () => {
+    const rows = svc.buildRows({ events: [{ name: 'LCP', value: 1 }] }, null);
+    expect(rows[0].workspaceId).toBeNull();
+  });
+
+  it('drops an out-of-int4-range docSize to null without losing the batch', () => {
+    const rows = svc.buildRows(
+      {
+        events: [
+          // Garbage docSize overflowing int4 must NOT reject the whole batch:
+          // the field is dropped to null and the event is kept.
+          { name: 'editor_tx_ms', value: 10, docSize: 9_999_999_999 },
+          { name: 'editor_tx_ms', value: 20, docSize: -5 },
+          { name: 'editor_tx_ms', value: 30, docSize: 4096 },
+        ],
+      },
+      WS,
+    );
+    expect(rows).toHaveLength(3);
+    expect(rows[0].docSize).toBeNull();
+    expect(rows[1].docSize).toBeNull();
+    expect(rows[2].docSize).toBe(4096);
+  });
+
+  it('keeps a docSize exactly at the int4 max', () => {
+    const rows = svc.buildRows(
+      { events: [{ name: 'editor_tx_ms', value: 1, docSize: 2147483647 }] },
+      WS,
+    );
+    expect(rows[0].docSize).toBe(2147483647);
+  });
+});
@@ -0,0 +1,70 @@
+import { Injectable } from '@nestjs/common';
+import { InjectKysely } from 'nestjs-kysely';
+import { KyselyDB } from '@docmost/db/types/kysely.types';
+import {
+  ClientMetricRow,
+  MAX_BODY_BYTES,
+  MAX_EVENTS_PER_BATCH,
+  sanitizeVitalEvent,
+} from './client-metrics.constants';
+
+@Injectable()
+export class VitalsService {
+  constructor(@InjectKysely() private readonly db: KyselyDB) {}
+
+  /**
+   * Turn a raw request body into the (bounded, whitelisted) rows to persist.
+   * Pure/synchronous so it is unit-testable without a DB. Returns [] for any
+   * malformed / oversized / foreign input — the caller still responds 200.
+   */
+  buildRows(body: unknown, workspaceId: string | null): ClientMetricRow[] {
+    if (!body || typeof body !== 'object') return [];
+
+    // Defence-in-depth body cap (~16KB): drop oversized batches wholesale.
+    try {
+      if (JSON.stringify(body).length > MAX_BODY_BYTES) return [];
+    } catch {
+      return [];
+    }
+
+    // Accept either a bare array or `{ events: [...] }`.
+    const events = Array.isArray(body)
+      ? body
+      : Array.isArray((body as { events?: unknown }).events)
+        ? ((body as { events: unknown[] }).events as unknown[])
+        : null;
+    if (!events) return [];
+
+    const rows: ClientMetricRow[] = [];
+    for (const event of events) {
+      if (rows.length >= MAX_EVENTS_PER_BATCH) break;
+      const row = sanitizeVitalEvent(event, workspaceId);
+      if (row) rows.push(row);
+    }
+    return rows;
+  }
+
+  /** Batch-insert the sanitised rows in a single statement. No-op on []. */
+  async insertRows(rows: ClientMetricRow[]): Promise<void> {
+    if (rows.length === 0) return;
+    await this.db
+      .insertInto('clientMetrics')
+      .values(
+        rows.map((r) => ({
+          name: r.name,
+          value: r.value,
+          rating: r.rating,
+          route: r.route,
+          attr: r.attr,
+          docSize: r.docSize,
+          workspaceId: r.workspaceId,
+        })),
+      )
+      .execute();
+  }
+
+  async ingest(body: unknown, workspaceId: string | null): Promise<void> {
+    const rows = this.buildRows(body, workspaceId);
+    await this.insertRows(rows);
+  }
+}
@@ -55,6 +55,14 @@ export class UpdateWorkspaceDto extends PartialType(CreateWorkspaceDto) {
  @IsBoolean()
  aiDictationStreaming: boolean;

+  // #184: detached/autonomous agent runs (settings.ai.autonomousRuns). When on, a
+  // chat turn becomes a server-side RUN that survives a browser disconnect; only
+  // an explicit /ai-chat/stop ends it. Off by default; single-instance-only in
+  // phase 1 (see AiChatRunService.warnIfMultiInstance / AGENTS.md).
+  @IsOptional()
+  @IsBoolean()
+  autonomousRuns: boolean;
+
  // Workspace master toggle that enables/disables the HTML embed block type.
  // Persisted at settings.htmlEmbed. ABSENT/false => OFF (default). The block
  // itself renders in a sandboxed iframe, so this is a feature switch, not a
@@ -526,6 +526,20 @@ export class WorkspaceService {
        );
      }

+      if (typeof updateWorkspaceDto.autonomousRuns !== 'undefined') {
+        const prev = settingsBefore?.ai?.autonomousRuns ?? false;
+        if (prev !== updateWorkspaceDto.autonomousRuns) {
+          before.autonomousRuns = prev;
+          after.autonomousRuns = updateWorkspaceDto.autonomousRuns;
+        }
+        await this.workspaceRepo.updateAiSettings(
+          workspaceId,
+          'autonomousRuns',
+          updateWorkspaceDto.autonomousRuns,
+          trx,
+        );
+      }
+
      if (typeof updateWorkspaceDto.htmlEmbed !== 'undefined') {
        const prev = settingsBefore?.htmlEmbed ?? false;
        if (prev !== updateWorkspaceDto.htmlEmbed) {
@@ -579,6 +593,7 @@ export class WorkspaceService {
      delete updateWorkspaceDto.aiChat;
      delete updateWorkspaceDto.aiDictation;
      delete updateWorkspaceDto.aiDictationStreaming;
+      delete updateWorkspaceDto.autonomousRuns;
      delete updateWorkspaceDto.htmlEmbed;
      delete updateWorkspaceDto.trackerHead;
      delete updateWorkspaceDto.aiPublicShareAssistant;
@@ -31,6 +31,7 @@ import { FavoriteRepo } from '@docmost/db/repos/favorite/favorite.repo';
 import { TemplateRepo } from '@docmost/db/repos/template/template.repo';
 import { AiChatRepo } from '@docmost/db/repos/ai-chat/ai-chat.repo';
 import { AiChatMessageRepo } from '@docmost/db/repos/ai-chat/ai-chat-message.repo';
+import { AiChatRunRepo } from '@docmost/db/repos/ai-chat/ai-chat-run.repo';
 import { AiChatPageSnapshotRepo } from '@docmost/db/repos/ai-chat/ai-chat-page-snapshot.repo';
 import { AiProviderCredentialsRepo } from '@docmost/db/repos/ai-chat/ai-provider-credentials.repo';
 import { AiMcpServerRepo } from '@docmost/db/repos/ai-chat/ai-mcp-server.repo';
@@ -40,6 +41,11 @@ import { PageListener } from '@docmost/db/listeners/page.listener';
 import { PostgresJSDialect } from 'kysely-postgres-js';
 import * as postgres from 'postgres';
 import { normalizePostgresUrl } from '../common/helpers';
+import {
+  observeDbQuery,
+  isMetricsEnabled,
+} from '../integrations/metrics/metrics.registry';
+import { firstSqlToken } from '../integrations/metrics/metrics.constants';

@Global()
@Module({
@@ -67,6 +73,18 @@ import { normalizePostgresUrl } from '../common/helpers';
        }),
        plugins: [new CamelCasePlugin()],
        log: (event: LogEvent) => {
+          // #355 — db_query_duration_seconds, labelled by the leading SQL token
+          // (bounded cardinality). Gated on isMetricsEnabled() so the token work
+          // (regex + Set lookup) is skipped entirely when metrics are OFF — not
+          // just observeDbQuery no-op'd — so a non-metrics deployment pays nothing
+          // per query. Runs independent of the dev-only debug logging below.
+          if (isMetricsEnabled()) {
+            observeDbQuery(
+              firstSqlToken(event.query.sql),
+              event.queryDurationMillis / 1000,
+            );
+          }
+
          if (environmentService.getNodeEnv() !== 'development') return;
          const logger = new Logger(DatabaseModule.name);
          if (process.env.DEBUG_DB?.toLowerCase() === 'true') {
@@ -105,6 +123,7 @@ import { normalizePostgresUrl } from '../common/helpers';
    TemplateRepo,
    AiChatRepo,
    AiChatMessageRepo,
+    AiChatRunRepo,
    AiChatPageSnapshotRepo,
    AiProviderCredentialsRepo,
    AiMcpServerRepo,
@@ -139,6 +158,7 @@ import { normalizePostgresUrl } from '../common/helpers';
    TemplateRepo,
    AiChatRepo,
    AiChatMessageRepo,
+    AiChatRunRepo,
    AiChatPageSnapshotRepo,
    AiProviderCredentialsRepo,
    AiMcpServerRepo,
@@ -0,0 +1,52 @@
+import { type Kysely, sql } from 'kysely';
+
+/**
+ * #355 — `client_metrics`: raw sink for client-side perf telemetry (web-vitals
+ * + custom editor/page metrics) posted to /api/telemetry/vitals.
+ *
+ * The table/columns/indexes here are a FIXED contract shared with the deployed
+ * Grafana infra (the `grafana_ro` role reads this table; a separate maintenance
+ * container prunes rows >90d and re-GRANTs daily). No app-side retention is
+ * added on purpose. Written as raw SQL to match that contract 1:1 (identity PK,
+ * conditional GRANT).
+ */
+export async function up(db: Kysely<any>): Promise<void> {
+  await sql`
+    CREATE TABLE client_metrics (
+      id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
+      created_at timestamptz NOT NULL DEFAULT now(),
+      name text NOT NULL,          -- INP|LCP|CLS|TTFB|editor_tx_ms|page_open_ms|longtask_ms
+      value double precision NOT NULL,
+      rating text,                 -- good|needs-improvement|poor (web-vitals only)
+      route text,                  -- templated: /s/:space/p/:slug — never raw slugs
+      attr text,                   -- attribution target, truncated to 120 chars
+      doc_size int,                -- editor_tx_ms only
+      workspace_id uuid
+    )
+  `.execute(db);
+
+  await sql`
+    CREATE INDEX idx_client_metrics_name_created
+      ON client_metrics (name, created_at)
+  `.execute(db);
+
+  await sql`
+    CREATE INDEX idx_client_metrics_created
+      ON client_metrics (created_at)
+  `.execute(db);
+
+  // The read-only Grafana role only exists in the deployed environment; guard so
+  // the migration still applies cleanly in dev/CI where the role is absent.
+  await sql`
+    DO $$
+    BEGIN
+      IF EXISTS (SELECT FROM pg_roles WHERE rolname = 'grafana_ro') THEN
+        GRANT SELECT ON client_metrics TO grafana_ro;
+      END IF;
+    END $$;
+  `.execute(db);
+}
+
+export async function down(db: Kysely<any>): Promise<void> {
+  await sql`DROP TABLE IF EXISTS client_metrics`.execute(db);
+}
@@ -0,0 +1,106 @@
+import { type Kysely, sql } from 'kysely';
+
+/**
+ * `ai_chat_runs` — the agent RUN as a first-class, server-side lifecycle object
+ * (#184 phase 1: autonomous agent runs detached from the browser window).
+ *
+ * Until now an agent turn lived ONLY as long as the HTTP request was open
+ * (`res.hijack()` in ai-chat.controller.ts); a browser disconnect aborted it.
+ * This table makes a turn a persistent object the server owns: it is created
+ * when a run starts (inserted directly as 'running' in phase 1 — 'pending' is
+ * only this column's default + a reserved value, never written by code yet) and
+ * advances to succeeded|failed|aborted, surviving the subscriber (browser) going
+ * away when it settles. The DB is the source of
+ * truth — a later client reconnects/sees the result by reading this row plus the
+ * assistant message it projects (`assistant_message_id`).
+ *
+ * The assistant message row (#183 step-granular durability) is the PROJECTION of
+ * a run's output; this row is the run's LIFECYCLE. They are linked by
+ * `assistant_message_id` (SET NULL if the message is later pruned).
+ *
+ * `status`  : 'pending' | 'running' | 'succeeded' | 'failed' | 'aborted'.
+ * `trigger` : 'user' | 'autostart' | 'schedule' | 'api' | 'continue' — only
+ *             'user' is produced in phase 1; the others are reserved for the
+ *             autonomy triggers deferred to phase 2 so they need no later
+ *             migration.
+ *
+ * ONE ACTIVE RUN PER CHAT is enforced by a partial unique index on `chat_id`
+ * WHERE status IN ('pending','running'): an autonomous run and a user run can
+ * never trample each other on the same chat. Settled runs (succeeded/failed/
+ * aborted) are excluded from the index so a chat can accumulate any number of
+ * historical runs.
+ */
+export async function up(db: Kysely<any>): Promise<void> {
+  await db.schema
+    .createTable('ai_chat_runs')
+    .ifNotExists()
+    .addColumn('id', 'uuid', (col) =>
+      col.primaryKey().defaultTo(sql`gen_uuid_v7()`),
+    )
+    .addColumn('chat_id', 'uuid', (col) =>
+      col.references('ai_chats.id').onDelete('cascade').notNull(),
+    )
+    .addColumn('workspace_id', 'uuid', (col) =>
+      col.references('workspaces.id').onDelete('cascade').notNull(),
+    )
+    // The human who triggered the run (audit). SET NULL on user deletion so the
+    // run history outlives its author; NULL is also the natural value for a
+    // future system/cron/api trigger with no human actor.
+    .addColumn('created_by', 'uuid', (col) =>
+      col.references('users.id').onDelete('set null'),
+    )
+    // The assistant message this run materializes (the #183 projection). SET NULL
+    // if that message row is later deleted; nullable because the run row is
+    // created a moment BEFORE the assistant row is seeded.
+    .addColumn('assistant_message_id', 'uuid', (col) =>
+      col.references('ai_chat_messages.id').onDelete('set null'),
+    )
+    .addColumn('trigger', 'varchar(20)', (col) =>
+      col.notNull().defaultTo('user'),
+    )
+    .addColumn('status', 'varchar(20)', (col) =>
+      col.notNull().defaultTo('pending'),
+    )
+    // Terminal error message for a failed run (provider/transport cause),
+    // mirroring the assistant message's metadata.error.
+    .addColumn('error', 'text', (col) => col)
+    // Number of agent steps finished so far (kept monotonic with the projection).
+    .addColumn('step_count', 'integer', (col) => col.notNull().defaultTo(0))
+    // Set when an EXPLICIT user stop is requested (distinct from a mere browser
+    // disconnect, which never stops a run). The runner aborts the turn and the
+    // run settles as 'aborted'.
+    .addColumn('stop_requested_at', 'timestamptz', (col) => col)
+    .addColumn('started_at', 'timestamptz', (col) => col)
+    .addColumn('finished_at', 'timestamptz', (col) => col)
+    .addColumn('created_at', 'timestamptz', (col) =>
+      col.notNull().defaultTo(sql`now()`),
+    )
+    .addColumn('updated_at', 'timestamptz', (col) =>
+      col.notNull().defaultTo(sql`now()`),
+    )
+    .execute();
+
+  // Reconnect / "latest run for this chat" reads hit chat_id first.
+  await db.schema
+    .createIndex('ai_chat_runs_chat_id_idx')
+    .ifNotExists()
+    .on('ai_chat_runs')
+    .column('chat_id')
+    .execute();
+
+  // One ACTIVE run per chat (advisory at the DB level): a second pending/running
+  // run on the same chat is rejected, so a user turn and an autonomous turn can
+  // never race on the same chat. Partial so settled runs do not collide.
+  await db.schema
+    .createIndex('ai_chat_runs_one_active_per_chat')
+    .ifNotExists()
+    .on('ai_chat_runs')
+    .column('chat_id')
+    .unique()
+    .where(sql.ref('status'), 'in', sql`('pending','running')`)
+    .execute();
+}
+
+export async function down(db: Kysely<any>): Promise<void> {
+  await db.schema.dropTable('ai_chat_runs').execute();
+}
@@ -0,0 +1,27 @@
+import { type Kysely } from 'kysely';
+
+/**
+ * #370 — page-versioning intentionality tier on a history snapshot.
+ *
+ * Adds `page_history.kind`, the three-tier "how intentional was this snapshot"
+ * marker that lets versions (intentional points) be told apart from autosaves:
+ *   - 'manual'   — a human explicitly saved a version (Cmd+S / Save button)
+ *   - 'agent'    — the AI agent explicitly saved a version
+ *   - 'idle'     — trailing idle-flush autosnapshot (safety net)
+ *   - 'boundary' — autosnapshot pinned on a source transition (user↔agent↔git)
+ *
+ * Nullable with NO default (mirrors last_updated_source in the agent-provenance
+ * migration): legacy rows predate the marker and read back as `null`, which the
+ * client renders as a plain autosave. Stored as a short varchar to stay
+ * forward-compatible without an enum migration.
+ */
+export async function up(db: Kysely<any>): Promise<void> {
+  await db.schema
+    .alterTable('page_history')
+    .addColumn('kind', 'varchar(20)', (col) => col)
+    .execute();
+}
+
+export async function down(db: Kysely<any>): Promise<void> {
+  await db.schema.alterTable('page_history').dropColumn('kind').execute();
+}
@@ -121,6 +121,23 @@ export class AiChatMessageRepo {
    return rows.reverse();
  }

+  /** Fetch a single message by id + workspace (e.g. a run's projection row for
+   *  the #184 reconnect read). Returns undefined when nothing matches. */
+  async findById(
+    id: string,
+    workspaceId: string,
+    trx?: KyselyTransaction,
+  ): Promise<AiChatMessage | undefined> {
+    const db = dbOrTx(this.db, trx);
+    return db
+      .selectFrom('aiChatMessages')
+      .select(this.baseFields)
+      .where('id', '=', id)
+      .where('workspaceId', '=', workspaceId)
+      .where('deletedAt', 'is', null)
+      .executeTakeFirst();
+  }
+
  async insert(
    insertable: InsertableAiChatMessage,
    trx?: KyselyTransaction,
@@ -0,0 +1,82 @@
+import { AiChatRunRepo, SWEEP_RUN_STALE_MS } from './ai-chat-run.repo';
+import type { KyselyDB } from '../../types/kysely.types';
+
+/**
+ * Unit coverage for AiChatRunRepo.sweepRunning over a chainable builder mock (no
+ * live DB). The F1 invariant under test (DECISION C): the BOOT sweep is
+ * UNCONDITIONAL — it adds NO `updatedAt <` predicate, so a fresh 'running' run
+ * (updatedAt = now) IS settled rather than skipped by a staleness window. The
+ * window is added ONLY when an explicit `staleMs` is supplied (the future phase-2
+ * multi-instance timer sweep). We assert the EXACT predicates the spec mandates.
+ */
+describe('AiChatRunRepo.sweepRunning', () => {
+  type Recorded = {
+    table?: string;
+    set?: Record<string, unknown>;
+    wheres: Array<[string, string, unknown]>;
+    returning?: string;
+  };
+
+  function makeDb(swept: Array<{ id: string }>): {
+    db: KyselyDB;
+    rec: Recorded;
+  } {
+    const rec: Recorded = { wheres: [] };
+    const builder: Record<string, unknown> = {};
+    builder.set = (v: Record<string, unknown>) => {
+      rec.set = v;
+      return builder;
+    };
+    builder.where = (col: string, op: string, val: unknown) => {
+      rec.wheres.push([col, op, val]);
+      return builder;
+    };
+    builder.returning = (col: string) => {
+      rec.returning = col;
+      return builder;
+    };
+    builder.execute = () => Promise.resolve(swept);
+    const db = {
+      updateTable: (table: string) => {
+        rec.table = table;
+        return builder;
+      },
+    } as unknown as KyselyDB;
+    return { db, rec };
+  }
+
+  it('F1: the boot sweep (no staleMs) is UNCONDITIONAL — only a status filter, NO updatedAt window', async () => {
+    const { db, rec } = makeDb([{ id: 'r1' }, { id: 'r2' }]);
+    const repo = new AiChatRunRepo(db);
+
+    const swept = await repo.sweepRunning();
+
+    expect(swept).toBe(2);
+    expect(rec.table).toBe('aiChatRuns');
+    // The status filter is always present...
+    expect(rec.wheres).toContainEqual([
+      'status',
+      'in',
+      expect.arrayContaining(['pending', 'running']),
+    ]);
+    // ...but a fresh 'running' run (updatedAt = now) must NOT be skipped: no
+    // updatedAt predicate at all on the boot path.
+    expect(rec.wheres.some(([col]) => col === 'updatedAt')).toBe(false);
+    // It flips to 'aborted' and stamps finishedAt.
+    expect(rec.set).toEqual(
+      expect.objectContaining({ status: 'aborted', finishedAt: expect.any(Date) }),
+    );
+  });
+
+  it('phase-2 path: an explicit staleMs reintroduces the updatedAt window', async () => {
+    const { db, rec } = makeDb([]);
+    const repo = new AiChatRunRepo(db);
+
+    await repo.sweepRunning({ staleMs: SWEEP_RUN_STALE_MS });
+
+    const updatedAtWhere = rec.wheres.find(([col]) => col === 'updatedAt');
+    expect(updatedAtWhere).toBeDefined();
+    expect(updatedAtWhere![1]).toBe('<');
+    expect(updatedAtWhere![2]).toBeInstanceOf(Date);
+  });
+});
@@ -0,0 +1,212 @@
+import { Injectable, Logger } from '@nestjs/common';
+import { InjectKysely } from 'nestjs-kysely';
+import { sql } from 'kysely';
+import { KyselyDB, KyselyTransaction } from '../../types/kysely.types';
+import { dbOrTx } from '../../utils';
+import {
+  AiChatRun,
+  InsertableAiChatRun,
+} from '@docmost/db/types/entity.types';
+
+// Statuses that count as "the run is still live" (an autonomous and a user run
+// must never both be live on one chat — enforced by the partial unique index and
+// checked here for friendly 409s before the insert races the constraint).
+export const ACTIVE_RUN_STATUSES = ['pending', 'running'] as const;
+
+// Crash-recovery sweep recency threshold (mirrors AiChatMessageRepo.sweepStreaming,
+// #183): when a staleness window is supplied, a 'running'/'pending' run is only
+// swept to 'aborted' once it has been UNTOUCHED for this long, so a sibling
+// replica's boot-sweep can never abort a run another replica is actively
+// executing. The runner bumps `updatedAt` on every step, so a live run never
+// matches. PHASE 1 is single-process and the boot sweep passes NO window (every
+// dangling run is settled unconditionally — see sweepRunning / F1). This constant
+// is the window to reintroduce for the phase-2 multi-instance timer sweep.
+export const SWEEP_RUN_STALE_MS = 10 * 60 * 1000; // 10 minutes
+
+/**
+ * Repository for `ai_chat_runs` (#184 phase 1): the agent run as a first-class,
+ * server-side lifecycle object detached from the HTTP request. The run row is the
+ * point a client subscribes/reconnects to (by `id` or by chat); the assistant
+ * message it links to (`assistantMessageId`) is the #183 projection of its output.
+ */
+@Injectable()
+export class AiChatRunRepo {
+  private readonly logger = new Logger(AiChatRunRepo.name);
+
+  private baseFields: Array<keyof AiChatRun> = [
+    'id',
+    'chatId',
+    'workspaceId',
+    'createdBy',
+    'assistantMessageId',
+    'trigger',
+    'status',
+    'error',
+    'stepCount',
+    'stopRequestedAt',
+    'startedAt',
+    'finishedAt',
+    'createdAt',
+    'updatedAt',
+  ];
+
+  constructor(@InjectKysely() private readonly db: KyselyDB) {}
+
+  async insert(
+    insertable: InsertableAiChatRun,
+    trx?: KyselyTransaction,
+  ): Promise<AiChatRun> {
+    const db = dbOrTx(this.db, trx);
+    return db
+      .insertInto('aiChatRuns')
+      .values(insertable)
+      .returning(this.baseFields)
+      .executeTakeFirst();
+  }
+
+  async findById(
+    id: string,
+    workspaceId: string,
+    trx?: KyselyTransaction,
+  ): Promise<AiChatRun | undefined> {
+    const db = dbOrTx(this.db, trx);
+    return db
+      .selectFrom('aiChatRuns')
+      .select(this.baseFields)
+      .where('id', '=', id)
+      .where('workspaceId', '=', workspaceId)
+      .executeTakeFirst();
+  }
+
+  /** The currently-active (pending|running) run for a chat, if any. At most one
+   *  exists thanks to the partial unique index. */
+  async findActiveByChat(
+    chatId: string,
+    workspaceId: string,
+    trx?: KyselyTransaction,
+  ): Promise<AiChatRun | undefined> {
+    const db = dbOrTx(this.db, trx);
+    return db
+      .selectFrom('aiChatRuns')
+      .select(this.baseFields)
+      .where('chatId', '=', chatId)
+      .where('workspaceId', '=', workspaceId)
+      .where('status', 'in', ACTIVE_RUN_STATUSES as unknown as string[])
+      .executeTakeFirst();
+  }
+
+  /** The most-recent run for a chat (active or settled) — the reconnect target. */
+  async findLatestByChat(
+    chatId: string,
+    workspaceId: string,
+    trx?: KyselyTransaction,
+  ): Promise<AiChatRun | undefined> {
+    const db = dbOrTx(this.db, trx);
+    return db
+      .selectFrom('aiChatRuns')
+      .select(this.baseFields)
+      .where('chatId', '=', chatId)
+      .where('workspaceId', '=', workspaceId)
+      .orderBy('createdAt', 'desc')
+      .orderBy('id', 'desc')
+      .limit(1)
+      .executeTakeFirst();
+  }
+
+  /**
+   * Patch a run by id + workspace; always bumps `updatedAt`. Used for every
+   * lifecycle transition (mark running, link the assistant message, bump
+   * step_count, finalize succeeded/failed/aborted). Returns the updated row or
+   * undefined when nothing matched (e.g. a foreign workspace).
+   */
+  async update(
+    id: string,
+    workspaceId: string,
+    patch: Partial<{
+      status: string;
+      error: string | null;
+      stepCount: number;
+      assistantMessageId: string | null;
+      stopRequestedAt: Date | null;
+      startedAt: Date | null;
+      finishedAt: Date | null;
+    }>,
+    trx?: KyselyTransaction,
+  ): Promise<AiChatRun | undefined> {
+    const db = dbOrTx(this.db, trx);
+    return db
+      .updateTable('aiChatRuns')
+      .set({ ...(patch as Record<string, unknown>), updatedAt: new Date() })
+      .where('id', '=', id)
+      .where('workspaceId', '=', workspaceId)
+      .returning(this.baseFields)
+      .executeTakeFirst();
+  }
+
+  /**
+   * Mark an EXPLICIT stop request on an active run (distinct from a browser
+   * disconnect, which never stops a run). Stamps `stop_requested_at` ONLY while
+   * the run is still active, so a late stop on an already-settled run is a no-op.
+   * Returns the row when a stop was recorded, else undefined (nothing active).
+   */
+  async markStopRequested(
+    id: string,
+    workspaceId: string,
+    trx?: KyselyTransaction,
+  ): Promise<AiChatRun | undefined> {
+    const db = dbOrTx(this.db, trx);
+    return db
+      .updateTable('aiChatRuns')
+      .set({ stopRequestedAt: new Date(), updatedAt: new Date() })
+      .where('id', '=', id)
+      .where('workspaceId', '=', workspaceId)
+      .where('status', 'in', ACTIVE_RUN_STATUSES as unknown as string[])
+      .returning(this.baseFields)
+      .executeTakeFirst();
+  }
+
+  /**
+   * Crash-recovery sweep (mirrors AiChatMessageRepo.sweepStreaming): flip every
+   * run still left pending/running — a run whose process died before reaching a
+   * terminal status — to 'aborted', stamping `finished_at`. Returns the number
+   * swept. Workspace-wide on purpose (a crash can dangle runs in any workspace).
+   *
+   * F1 (DECISION C): the BOOT sweep is UNCONDITIONAL — it passes no `staleMs`, so
+   * EVERY dangling run is settled regardless of how recently it was touched. On a
+   * fresh single-process boot any pending|running run is definitionally hung (no
+   * runner is alive to own it), so a fast restart (deploy/OOM within minutes of
+   * the last step) no longer leaves a run stuck 'running' forever — which would
+   * make the one-active-run gate 409 every future turn in that chat.
+   *
+   * The optional `staleMs` window is reintroduced ONLY for the future phase-2
+   * multi-instance timer sweep (see {@link SWEEP_RUN_STALE_MS}): there a booting
+   * replica must NOT abort a run another replica is actively executing, so it
+   * sweeps only runs UNTOUCHED past the window. Phase 1 is single-process, so the
+   * boot path supplies no window.
+   */
+  async sweepRunning(
+    opts: { staleMs?: number } = {},
+    trx?: KyselyTransaction,
+  ): Promise<number> {
+    const db = dbOrTx(this.db, trx);
+    const now = new Date();
+    let query = db
+      .updateTable('aiChatRuns')
+      .set({
+        status: 'aborted',
+        finishedAt: now,
+        updatedAt: now,
+        error: sql`coalesce(error, ${'Run interrupted by a server restart.'})`,
+      })
+      .where('status', 'in', ACTIVE_RUN_STATUSES as unknown as string[]);
+    // Multi-instance (phase 2) only: skip runs touched within the window so a
+    // sibling replica's live run is never aborted. Omitted on the phase-1 boot
+    // sweep -> unconditional.
+    if (typeof opts.staleMs === 'number') {
+      const staleBefore = new Date(now.getTime() - opts.staleMs);
+      query = query.where('updatedAt', '<', staleBefore);
+    }
+    const rows = await query.returning('id').execute();
+    return rows.length;
+  }
+}
@@ -13,6 +13,7 @@ import { jsonArrayFrom, jsonObjectFrom } from 'kysely/helpers/postgres';
 import { ExpressionBuilder, sql } from 'kysely';
 import { DB } from '@docmost/db/types/db';
 import { resolveAgentProvenance } from '../agent-provenance';
+import { PageHistoryKind } from '../../../collaboration/constants';

 /**
 * Role-resolution subquery for a page-history row's bound AI chat (#300). Joins
@@ -46,6 +47,9 @@ export class PageHistoryRepo {
    'lastUpdatedById',
    'lastUpdatedSource',
    'lastUpdatedAiChatId',
+    // #370 — intentionality tier ('manual' | 'agent' | 'idle' | 'boundary');
+    // null on legacy rows (= autosave). Selected so callers can read/promote it.
+    'kind',
    'contributorIds',
    'spaceId',
    'workspaceId',
@@ -85,9 +89,15 @@ export class PageHistoryRepo {

  async saveHistory(
    page: Page,
-    opts?: { contributorIds?: string[]; trx?: KyselyTransaction },
-  ): Promise<void> {
-    await this.insertPageHistory(
+    opts?: {
+      contributorIds?: string[];
+      // #370 — intentionality tier for this snapshot. Omitted → null (legacy
+      // autosave semantics). Callers derive it server-side, never from a client.
+      kind?: PageHistoryKind;
+      trx?: KyselyTransaction;
+    },
+  ): Promise<PageHistory> {
+    return await this.insertPageHistory(
      {
        pageId: page.id,
        slugId: page.slugId,
@@ -99,6 +109,7 @@ export class PageHistoryRepo {
        // Copy the provenance marker off the page row, as for lastUpdatedById.
        lastUpdatedSource: page.lastUpdatedSource,
        lastUpdatedAiChatId: page.lastUpdatedAiChatId,
+        kind: opts?.kind ?? null,
        contributorIds: opts?.contributorIds,
        spaceId: page.spaceId,
        workspaceId: page.workspaceId,
@@ -107,6 +118,25 @@ export class PageHistoryRepo {
    );
  }

+  /**
+   * #370 — promote an existing snapshot's intentionality tier in place. Used by
+   * the manual-save "promote-not-dup" path: when the latest history row already
+   * holds the exact content being versioned, we upgrade its `kind` instead of
+   * duplicating a heavy content row.
+   */
+  async updateHistoryKind(
+    pageHistoryId: string,
+    kind: PageHistoryKind,
+    trx?: KyselyTransaction,
+  ): Promise<void> {
+    const db = dbOrTx(this.db, trx);
+    await db
+      .updateTable('pageHistory')
+      .set({ kind })
+      .where('id', '=', pageHistoryId)
+      .execute();
+  }
+
  async findPageHistoryByPageId(pageId: string, pagination: PaginationOptions) {
    const query = this.db
      .selectFrom('pageHistory')
@@ -156,6 +156,18 @@ export interface Billing {
  workspaceId: string;
 }

+export interface ClientMetrics {
+  id: Generated<Int8>;
+  createdAt: Generated<Timestamp>;
+  name: string;
+  value: number;
+  rating: string | null;
+  route: string | null;
+  attr: string | null;
+  docSize: number | null;
+  workspaceId: string | null;
+}
+
 export interface Comments {
  aiChatId: string | null;
  content: Json | null;
@@ -268,6 +280,7 @@ export interface PageHistory {
  createdAt: Generated<Timestamp>;
  icon: string | null;
  id: Generated<string>;
+  kind: string | null;
  lastUpdatedAiChatId: string | null;
  lastUpdatedById: string | null;
  lastUpdatedSource: string | null;
@@ -647,6 +660,35 @@ export interface AiChatMessages {
  deletedAt: Timestamp | null;
 }

+// The agent RUN as a first-class server-side lifecycle object (#184 phase 1).
+// Mirrors migration 20260704T130000-ai-chat-runs.ts. A run is created when an
+// agent turn starts and survives the browser disconnecting; the DB is the source
+// of truth a later client reconnects to. `assistantMessageId` links to the #183
+// projection row (the assistant message this run materializes).
+export interface AiChatRuns {
+  id: Generated<string>;
+  chatId: string;
+  workspaceId: string;
+  // SET NULL on user deletion (the run history outlives its author); also NULL
+  // for a future non-human trigger (cron/api).
+  createdBy: string | null;
+  // The assistant message this run materializes; SET NULL if it is pruned.
+  assistantMessageId: string | null;
+  // 'user' | 'autostart' | 'schedule' | 'api' | 'continue' (only 'user' is
+  // produced in phase 1; the rest are reserved for the deferred autonomy triggers).
+  trigger: Generated<string>;
+  // 'pending' | 'running' | 'succeeded' | 'failed' | 'aborted'.
+  status: Generated<string>;
+  error: string | null;
+  stepCount: Generated<number>;
+  // Set when an EXPLICIT user stop is requested (distinct from a disconnect).
+  stopRequestedAt: Timestamp | null;
+  startedAt: Timestamp | null;
+  finishedAt: Timestamp | null;
+  createdAt: Generated<Timestamp>;
+  updatedAt: Generated<Timestamp>;
+}
+
 // Per-(chat,page) snapshot of the open page's Markdown at the END of the agent's
 // previous turn (#274). Mirrors migration 20260702T120000-ai-chat-page-snapshot.ts.
 // The next turn diffs the CURRENT Markdown against `contentMd` to surface edits a
@@ -683,6 +725,7 @@ export interface DB {
  aiAgentRoles: AiAgentRoles;
  aiChats: AiChats;
  aiChatMessages: AiChatMessages;
+  aiChatRuns: AiChatRuns;
  aiChatPageSnapshots: AiChatPageSnapshots;
  apiKeys: ApiKeys;
  attachments: Attachments;
@@ -691,6 +734,7 @@ export interface DB {
  authProviders: AuthProviders;
  backlinks: Backlinks;
  billing: Billing;
+  clientMetrics: ClientMetrics;
  comments: Comments;
  favorites: Favorites;
  fileTasks: FileTasks;
@@ -3,6 +3,7 @@ import {
  AiAgentRoles,
  AiChats,
  AiChatMessages,
+  AiChatRuns,
  AiChatPageSnapshots,
  Attachments,
  Comments,
@@ -56,10 +57,12 @@ export type UpdatableAiChat = Updateable<Omit<AiChats, 'id'>>;
 // full-text search. It is omitted from the public type so it never leaks
 // into HTTP responses or the chat history fed to the language model.
 export type AiChatMessage = Omit<Selectable<AiChatMessages>, 'tsv'>;
-export type InsertableAiChatMessage = Omit<
-  Insertable<AiChatMessages>,
-  'tsv'
->;
+export type InsertableAiChatMessage = Omit<Insertable<AiChatMessages>, 'tsv'>;
+
+// AI Chat Run (#184 phase 1): the agent run as a first-class lifecycle object,
+// detached from the HTTP request / browser window.
+export type AiChatRun = Selectable<AiChatRuns>;
+export type InsertableAiChatRun = Insertable<AiChatRuns>;

 // AI Chat Page Snapshot (#274): per-(chat,page) Markdown snapshot taken at the
 // end of the agent's previous turn, diffed against the current page next turn to
@@ -214,11 +217,14 @@ export type UpdatableFavorite = Updateable<Omit<Favorites, 'id'>>;
 // Page Transclusion
 export type PageTransclusion = Selectable<PageTransclusions>;
 export type InsertablePageTransclusion = Insertable<PageTransclusions>;
-export type UpdatablePageTransclusion = Updateable<Omit<PageTransclusions, 'id'>>;
+export type UpdatablePageTransclusion = Updateable<
+  Omit<PageTransclusions, 'id'>
+>;

 // Page Transclusion Reference
 export type PageTransclusionReference = Selectable<PageTransclusionReferences>;
-export type InsertablePageTransclusionReference = Insertable<PageTransclusionReferences>;
+export type InsertablePageTransclusionReference =
+  Insertable<PageTransclusionReferences>;
 export type UpdatablePageTransclusionReference = Updateable<
  Omit<PageTransclusionReferences, 'id'>
 >;
@@ -288,7 +294,9 @@ export type UpdatablePagePermission = Updateable<Omit<_PagePermissions, 'id'>>;
 // Page Verification
 export type PageVerification = Selectable<_PageVerifications>;
 export type InsertablePageVerification = Insertable<_PageVerifications>;
-export type UpdatablePageVerification = Updateable<Omit<_PageVerifications, 'id'>>;
+export type UpdatablePageVerification = Updateable<
+  Omit<_PageVerifications, 'id'>
+>;

 // Page Verifier
 export type PageVerifier = Selectable<_PageVerifiers>;
@@ -227,6 +227,22 @@ export class EnvironmentService {
    return compactTree === 'true';
  }

+  /**
+   * Operator toggle for the public client-telemetry sink (#355). DEFAULT OFF:
+   * the unauthenticated POST /api/telemetry/vitals endpoint + client vitals
+   * collection are only wired when this is explicitly true. Kept SEPARATE from
+   * METRICS_PORT (the server Prometheus half) because Grafana reads the
+   * `client_metrics` table directly, independent of the scrape port — and
+   * because `client_metrics` has no app-side retention, so an operator must opt
+   * in and run an external pruner.
+   */
+  isClientTelemetryEnabled(): boolean {
+    const enabled = this.configService
+      .get<string>('CLIENT_TELEMETRY_ENABLED', 'false')
+      .toLowerCase();
+    return enabled === 'true';
+  }
+
  getStripePublishableKey(): string {
    return this.configService.get<string>('STRIPE_PUBLISHABLE_KEY');
  }
@@ -0,0 +1,46 @@
+import { FastifyReply, FastifyRequest } from 'fastify';
+import { isStreamingResponse } from './metrics.constants';
+import { observeHttp } from './metrics.registry';
+
+/**
+ * Resolve the BOUNDED route label for an HTTP response.
+ *
+ * HARD REQUIREMENT (#355): use the ROUTE TEMPLATE (`/pages/:id`), NEVER the raw
+ * URL (`/pages/abc-123`), so label cardinality stays finite. Fastify exposes the
+ * matched template on `req.routeOptions.url`. On 404s (no route matched) that is
+ * missing → collapse to the literal `unknown`.
+ */
+export function resolveRouteLabel(req: FastifyRequest): string {
+  const url = req.routeOptions?.url;
+  return typeof url === 'string' && url.length > 0 ? url : 'unknown';
+}
+
+/**
+ * Fastify onResponse handler that records http_request_duration_seconds.
+ * No-op when metrics are disabled (the hook is only registered when enabled,
+ * but the observe helpers are also guarded). Never throws into the response
+ * pipeline — telemetry must not break request handling.
+ */
+export function recordHttpResponse(
+  req: FastifyRequest,
+  reply: FastifyReply,
+): void {
+  try {
+    const route = resolveRouteLabel(req);
+
+    // Exclude SSE/streaming responses: onResponse fires at connection close for
+    // those, so it would record the stream lifetime and poison p95/p99.
+    const contentType = reply.getHeader('content-type');
+    if (isStreamingResponse(contentType, route)) return;
+
+    observeHttp(
+      req.method,
+      route,
+      reply.statusCode,
+      // Fastify measures elapsed time in ms; the metric is in seconds.
+      reply.elapsedTime / 1000,
+    );
+  } catch {
+    // Swallow: a telemetry failure must never affect the served response.
+  }
+}
@@ -0,0 +1,146 @@
+import {
+  Injectable,
+  Logger,
+  OnModuleDestroy,
+  OnModuleInit,
+} from '@nestjs/common';
+import { InjectQueue } from '@nestjs/bullmq';
+import { Queue, QueueEvents } from 'bullmq';
+import { QueueName } from '../queue/constants';
+import { EnvironmentService } from '../environment/environment.service';
+import { parseRedisUrl } from '../../common/helpers';
+import {
+  isMetricsEnabled,
+  observeJobDuration,
+  setQueueDepth,
+} from './metrics.registry';
+
+const POLL_INTERVAL_MS = 15_000;
+// Cap the in-flight start-time map so a job that never emits completed/failed
+// (worker crash) cannot leak memory unbounded. Well above realistic concurrency.
+const MAX_INFLIGHT = 10_000;
+
+/**
+ * BullMQ instrumentation for #355:
+ *  - `bullmq_queue_depth{queue}`: polled from getJobCounts() every 15s.
+ *  - `bullmq_job_duration_seconds{queue}`: wall-clock time between a job going
+ *    `active` and `completed`/`failed`, observed via per-queue QueueEvents.
+ *
+ * Queue names are a FINITE list (the QueueName enum), so labels are bounded — no
+ * job ids ever enter a label. Everything is gated on METRICS_PORT: when metrics
+ * are off, onModuleInit does nothing (no interval, no QueueEvents connections).
+ */
+@Injectable()
+export class MetricsBullService implements OnModuleInit, OnModuleDestroy {
+  private readonly logger = new Logger(MetricsBullService.name);
+  private readonly queues: { label: string; queue: Queue }[];
+  private timer: NodeJS.Timeout | null = null;
+  private queueEvents: QueueEvents[] = [];
+  // jobId -> start timestamp (ms). Bounded by MAX_INFLIGHT.
+  private readonly inflight = new Map<string, number>();
+
+  constructor(
+    private readonly environmentService: EnvironmentService,
+    @InjectQueue(QueueName.EMAIL_QUEUE) emailQueue: Queue,
+    @InjectQueue(QueueName.ATTACHMENT_QUEUE) attachmentQueue: Queue,
+    @InjectQueue(QueueName.GENERAL_QUEUE) generalQueue: Queue,
+    @InjectQueue(QueueName.BILLING_QUEUE) billingQueue: Queue,
+    @InjectQueue(QueueName.FILE_TASK_QUEUE) fileTaskQueue: Queue,
+    @InjectQueue(QueueName.SEARCH_QUEUE) searchQueue: Queue,
+    @InjectQueue(QueueName.AI_QUEUE) aiQueue: Queue,
+    @InjectQueue(QueueName.HISTORY_QUEUE) historyQueue: Queue,
+    @InjectQueue(QueueName.NOTIFICATION_QUEUE) notificationQueue: Queue,
+    @InjectQueue(QueueName.AUDIT_QUEUE) auditQueue: Queue,
+  ) {
+    this.queues = [
+      { label: 'email', queue: emailQueue },
+      { label: 'attachment', queue: attachmentQueue },
+      { label: 'general', queue: generalQueue },
+      { label: 'billing', queue: billingQueue },
+      { label: 'file-task', queue: fileTaskQueue },
+      { label: 'search', queue: searchQueue },
+      { label: 'ai', queue: aiQueue },
+      { label: 'history', queue: historyQueue },
+      { label: 'notification', queue: notificationQueue },
+      { label: 'audit', queue: auditQueue },
+    ];
+  }
+
+  onModuleInit(): void {
+    if (!isMetricsEnabled()) return;
+
+    // Poll queue depth.
+    this.timer = setInterval(() => {
+      void this.pollDepths();
+    }, POLL_INTERVAL_MS);
+    // Do not keep the event loop alive solely for polling.
+    this.timer.unref?.();
+    void this.pollDepths();
+
+    // Wire per-queue job-duration events.
+    const redisConfig = parseRedisUrl(this.environmentService.getRedisUrl());
+    const connection = {
+      host: redisConfig.host,
+      port: redisConfig.port,
+      password: redisConfig.password,
+      db: redisConfig.db,
+      family: redisConfig.family,
+    };
+
+    for (const { label, queue } of this.queues) {
+      const events = new QueueEvents(queue.name, { connection });
+      events.on('active', ({ jobId }) => {
+        if (this.inflight.size >= MAX_INFLIGHT) {
+          // Drop the oldest tracked start to keep the map bounded.
+          const oldest = this.inflight.keys().next().value;
+          if (oldest !== undefined) this.inflight.delete(oldest);
+        }
+        this.inflight.set(jobId, Date.now());
+      });
+      const finalize = ({ jobId }: { jobId: string }) => {
+        const start = this.inflight.get(jobId);
+        if (start === undefined) return;
+        this.inflight.delete(jobId);
+        observeJobDuration(label, (Date.now() - start) / 1000);
+      };
+      events.on('completed', finalize);
+      events.on('failed', finalize);
+      events.on('error', (err) => {
+        this.logger.debug(`QueueEvents error (${label}): ${err?.message}`);
+      });
+      this.queueEvents.push(events);
+    }
+  }
+
+  private async pollDepths(): Promise<void> {
+    for (const { label, queue } of this.queues) {
+      try {
+        const counts = await queue.getJobCounts();
+        // "Depth" = jobs not yet finished (backlog + in-flight).
+        const depth =
+          (counts.waiting ?? 0) +
+          (counts.active ?? 0) +
+          (counts.delayed ?? 0) +
+          (counts.prioritized ?? 0) +
+          (counts.paused ?? 0);
+        setQueueDepth(label, depth);
+      } catch (err) {
+        this.logger.debug(
+          `Failed to read job counts for ${label}: ${(err as Error)?.message}`,
+        );
+      }
+    }
+  }
+
+  async onModuleDestroy(): Promise<void> {
+    if (this.timer) {
+      clearInterval(this.timer);
+      this.timer = null;
+    }
+    await Promise.all(
+      this.queueEvents.map((e) => e.close().catch(() => undefined)),
+    );
+    this.queueEvents = [];
+    this.inflight.clear();
+  }
+}
@@ -0,0 +1,16 @@
+import { Injectable, OnModuleDestroy } from '@nestjs/common';
+import { closeMetricsServer } from './metrics.server';
+
+/**
+ * Ties the bare node:http metrics scrape server (started in main.ts after the
+ * Fastify app is up, outside the DI container) into Nest's shutdown lifecycle.
+ * With `app.enableShutdownHooks()`, onModuleDestroy fires on SIGTERM/SIGINT and
+ * closes the listener so it is not left dangling (jest/e2e never exits, and a
+ * prod restart doesn't leak the port). No-op when metrics are disabled.
+ */
+@Injectable()
+export class MetricsServerLifecycle implements OnModuleDestroy {
+  async onModuleDestroy(): Promise<void> {
+    await closeMetricsServer();
+  }
+}
@@ -0,0 +1,84 @@
+/**
+ * Perf-metrics contract (#355). These names/labels are FIXED by the already
+ * deployed scrape+dashboard infra (VictoriaMetrics scraping docmost:9464,
+ * Grafana dashboards, alerts). Do NOT rename them.
+ */
+export const METRIC_HTTP_REQUEST_DURATION = 'http_request_duration_seconds';
+export const METRIC_DB_QUERY_DURATION = 'db_query_duration_seconds';
+export const METRIC_BULLMQ_QUEUE_DEPTH = 'bullmq_queue_depth';
+export const METRIC_BULLMQ_JOB_DURATION = 'bullmq_job_duration_seconds';
+export const METRIC_COLLAB_STORE_DURATION = 'collab_store_duration_seconds';
+
+// Histogram buckets (seconds). Chosen to give useful p50/p95/p99 resolution
+// for typical web/DB latencies without exploding series cardinality.
+export const HTTP_BUCKETS = [
+  0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10,
+];
+export const DB_BUCKETS = [
+  0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5,
+];
+export const COLLAB_BUCKETS = [
+  0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5,
+];
+export const JOB_BUCKETS = [
+  0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120,
+];
+
+/**
+ * Extract the first SQL token (select/insert/update/delete/...) from a query,
+ * lower-cased, to use as a BOUNDED label for db_query_duration_seconds. Using
+ * the full query text would blow up label cardinality; the leading keyword is a
+ * finite set. Unknown/empty queries collapse to `other`.
+ */
+// The bounded set of SQL leading keywords used as db_query_duration_seconds
+// labels. Module-const so it is built ONCE, not per query (this runs on every DB
+// query when metrics are enabled).
+const KNOWN_SQL_TOKENS = new Set([
+  'select',
+  'insert',
+  'update',
+  'delete',
+  'with',
+  'begin',
+  'commit',
+  'rollback',
+  'alter',
+  'create',
+  'drop',
+  'truncate',
+  'explain',
+]);
+
+export function firstSqlToken(sql: string | undefined): string {
+  if (!sql) return 'other';
+  // Skip leading whitespace / comments and grab the first word.
+  const match = /^[\s(]*([a-zA-Z]+)/.exec(sql);
+  if (!match) return 'other';
+  const token = match[1].toLowerCase();
+  return KNOWN_SQL_TOKENS.has(token) ? token : 'other';
+}
+
+/**
+ * Whether an HTTP response must be EXCLUDED from http_request_duration_seconds.
+ *
+ * SSE/streaming responses (the AI-chat `text/event-stream`) keep the connection
+ * open for the whole conversation, so Fastify's onResponse fires only when the
+ * client disconnects — recording the connection lifetime, not a response time,
+ * which would poison p95/p99. We skip by content-type (authoritative) with a
+ * route-suffix fallback for the two known stream endpoints.
+ */
+export function isStreamingResponse(
+  contentType: unknown,
+  route: string | undefined,
+): boolean {
+  if (
+    typeof contentType === 'string' &&
+    contentType.toLowerCase().includes('text/event-stream')
+  ) {
+    return true;
+  }
+  // Fallback: the AI-chat stream routes (/api/ai-chat/stream,
+  // /api/shares/ai/stream) both end in `/stream`.
+  if (route && route.endsWith('/stream')) return true;
+  return false;
+}
@@ -0,0 +1,15 @@
+import { Module } from '@nestjs/common';
+import { MetricsBullService } from './metrics-bull.service';
+import { MetricsServerLifecycle } from './metrics-server.lifecycle';
+
+/**
+ * Wires the BullMQ collectors (#355). The queues are provided by the @Global
+ * QueueModule (which exports BullModule), so no re-registration is needed here.
+ * The HTTP histogram, DB-query and collab-store collectors live in module-level
+ * singletons (metrics.registry) and are wired directly at their call sites.
+ * MetricsServerLifecycle closes the scrape server on shutdown.
+ */
+@Module({
+  providers: [MetricsBullService, MetricsServerLifecycle],
+})
+export class MetricsModule {}
@@ -0,0 +1,126 @@
+import {
+  collectDefaultMetrics,
+  Histogram,
+  Gauge,
+  Registry,
+} from 'prom-client';
+import {
+  COLLAB_BUCKETS,
+  DB_BUCKETS,
+  HTTP_BUCKETS,
+  JOB_BUCKETS,
+  METRIC_BULLMQ_JOB_DURATION,
+  METRIC_BULLMQ_QUEUE_DEPTH,
+  METRIC_COLLAB_STORE_DURATION,
+  METRIC_DB_QUERY_DURATION,
+  METRIC_HTTP_REQUEST_DURATION,
+} from './metrics.constants';
+
+/**
+ * Process-wide perf-metrics registry (#355).
+ *
+ * This is a plain module singleton (NOT a Nest provider) because the collectors
+ * are cross-cutting: the Kysely `log` callback (built in a DI factory), the
+ * Fastify onResponse hook (main.ts, before the Nest container hands out
+ * providers) and the collab persistence extension all need the SAME instruments
+ * without threading DI through them.
+ *
+ * HARD CONTRACT: when `METRICS_PORT` is unset the whole subsystem is OFF — the
+ * registry is never created, `collectDefaultMetrics` never runs, and every
+ * observe/set helper is a cheap no-op. Nothing is exposed on :3000.
+ */
+
+// Decided once at process start. Deliberately read here (not via
+// EnvironmentService) so the toggle is identical for the DI and non-DI callers.
+const enabled = Boolean(process.env.METRICS_PORT);
+
+let registry: Registry | null = null;
+let httpHist: Histogram<'method' | 'route' | 'status'> | null = null;
+let dbHist: Histogram<'op'> | null = null;
+let queueDepthGauge: Gauge<'queue'> | null = null;
+let jobHist: Histogram<'queue'> | null = null;
+let collabHist: Histogram | null = null;
+
+function init(): void {
+  if (registry || !enabled) return;
+
+  registry = new Registry();
+
+  // Node/runtime metrics: gives nodejs_eventloop_lag_p99_seconds, GC, heap, etc.
+  collectDefaultMetrics({ register: registry });
+
+  httpHist = new Histogram({
+    name: METRIC_HTTP_REQUEST_DURATION,
+    help: 'HTTP request duration in seconds, by method, route template and status',
+    labelNames: ['method', 'route', 'status'],
+    buckets: HTTP_BUCKETS,
+    registers: [registry],
+  });
+
+  dbHist = new Histogram({
+    name: METRIC_DB_QUERY_DURATION,
+    help: 'Database query duration in seconds, by leading SQL keyword',
+    labelNames: ['op'],
+    buckets: DB_BUCKETS,
+    registers: [registry],
+  });
+
+  queueDepthGauge = new Gauge({
+    name: METRIC_BULLMQ_QUEUE_DEPTH,
+    help: 'Number of not-yet-finished BullMQ jobs per queue',
+    labelNames: ['queue'],
+    registers: [registry],
+  });
+
+  jobHist = new Histogram({
+    name: METRIC_BULLMQ_JOB_DURATION,
+    help: 'BullMQ job processing duration in seconds, per queue',
+    labelNames: ['queue'],
+    buckets: JOB_BUCKETS,
+    registers: [registry],
+  });
+
+  collabHist = new Histogram({
+    name: METRIC_COLLAB_STORE_DURATION,
+    help: 'Collaboration onStoreDocument duration in seconds',
+    buckets: COLLAB_BUCKETS,
+    registers: [registry],
+  });
+}
+
+// Runs once when this module is first imported. Safe to call again (idempotent).
+init();
+
+export function isMetricsEnabled(): boolean {
+  return enabled;
+}
+
+/** The prom-client registry, or null when metrics are disabled. */
+export function getMetricsRegistry(): Registry | null {
+  return registry;
+}
+
+export function observeHttp(
+  method: string,
+  route: string,
+  status: number,
+  seconds: number,
+): void {
+  httpHist?.observe({ method, route, status }, seconds);
+}
+
+export function observeDbQuery(op: string, seconds: number): void {
+  dbHist?.observe({ op }, seconds);
+}
+
+export function setQueueDepth(queue: string, depth: number): void {
+  queueDepthGauge?.set({ queue }, depth);
+}
+
+export function observeJobDuration(queue: string, seconds: number): void {
+  jobHist?.observe({ queue }, seconds);
+}
+
+export function observeCollabStore(seconds: number): void {
+  collabHist?.observe(seconds);
+}
@@ -0,0 +1,86 @@
+import { createServer, Server } from 'node:http';
+import { Logger } from '@nestjs/common';
+import { getMetricsRegistry, isMetricsEnabled } from './metrics.registry';
+
+/**
+ * Start the Prometheus scrape endpoint on a SEPARATE port, taken from
+ * `METRICS_PORT`. There is NO default port: when `METRICS_PORT` is unset the
+ * whole metrics subsystem is OFF and this returns null. This is a bare node:http
+ * server, NOT part of the Fastify app, so `/metrics` never exists on the public
+ * :3000 listener.
+ *
+ * Returns the http.Server (so callers can close it on shutdown) or null when
+ * metrics are disabled. The reference is also kept module-side so the Nest
+ * lifecycle (see MetricsModule) can close it on application shutdown without
+ * threading the handle back through the non-DI bootstrap.
+ */
+let metricsServer: Server | null = null;
+
+export function startMetricsServer(): Server | null {
+  if (!isMetricsEnabled()) return null;
+
+  const logger = new Logger('MetricsServer');
+  const register = getMetricsRegistry();
+  if (!register) return null;
+
+  const port = Number(process.env.METRICS_PORT);
+  if (!Number.isInteger(port) || port <= 0) {
+    logger.warn(
+      `Invalid METRICS_PORT="${process.env.METRICS_PORT}", metrics endpoint not started`,
+    );
+    return null;
+  }
+
+  const server = createServer(async (req, res) => {
+    if (req.method === 'GET' && req.url === '/metrics') {
+      try {
+        const body = await register.metrics();
+        res.setHeader('Content-Type', register.contentType);
+        res.statusCode = 200;
+        res.end(body);
+      } catch (err) {
+        res.statusCode = 500;
+        res.end(String((err as Error)?.message ?? 'error'));
+      }
+      return;
+    }
+    res.statusCode = 404;
+    res.end();
+  });
+
+  // Bind on all interfaces: the scraper (VictoriaMetrics) reaches this from
+  // another container as docmost:9464. The port is not published to the host.
+  server.listen(port, '0.0.0.0', () => {
+    logger.log(`Metrics endpoint listening on :${port}/metrics`);
+  });
+
+  server.on('error', (err) => {
+    logger.error(`Metrics server error: ${err?.message}`);
+  });
+
+  metricsServer = server;
+  return server;
+}
+
+/**
+ * Close the metrics scrape server if one is running. Idempotent and safe to call
+ * when metrics are disabled (no server was ever started). Wired into Nest's
+ * shutdown lifecycle so the listener is not left dangling on shutdown.
+ */
+export function closeMetricsServer(): Promise<void> {
+  const server = metricsServer;
+  metricsServer = null;
+  if (!server) return Promise.resolve();
+  return new Promise((resolve) => {
+    server.close(() => resolve());
+    // server.close() stops accepting NEW connections but its callback does not
+    // fire until existing keep-alive sockets drain. The scraper (VictoriaMetrics/
+    // vmagent) holds an idle HTTP keep-alive socket, so without this the callback
+    // — and thus shutdown — would hang until the scraper disconnects or the
+    // orchestrator escalates to SIGKILL on the kill-grace window. Force-close idle
+    // keep-alive sockets so close() completes immediately, and unref so this
+    // server never keeps the event loop alive on its own.
+    server.closeIdleConnections();
+    server.unref();
+  });
+}
@@ -0,0 +1,70 @@
+import { FastifyRequest } from 'fastify';
+import { resolveRouteLabel } from './http-metrics.hook';
+import { firstSqlToken, isStreamingResponse } from './metrics.constants';
+
+describe('resolveRouteLabel (histogram route label)', () => {
+  it('uses the ROUTE TEMPLATE, never the raw URL', () => {
+    // routeOptions.url is the matched template; url is the raw path with the id.
+    const req = {
+      url: '/api/pages/abc-123-def',
+      routeOptions: { url: '/api/pages/:id' },
+    } as unknown as FastifyRequest;
+    expect(resolveRouteLabel(req)).toBe('/api/pages/:id');
+    expect(resolveRouteLabel(req)).not.toContain('abc-123-def');
+  });
+
+  it('falls back to "unknown" on a 404 (no matched route template)', () => {
+    const req = {
+      url: '/totally/unmatched/path',
+      routeOptions: {},
+    } as unknown as FastifyRequest;
+    expect(resolveRouteLabel(req)).toBe('unknown');
+  });
+
+  it('falls back to "unknown" when routeOptions is missing', () => {
+    const req = { url: '/x' } as unknown as FastifyRequest;
+    expect(resolveRouteLabel(req)).toBe('unknown');
+  });
+});
+
+describe('isStreamingResponse (SSE exclusion)', () => {
+  it('excludes text/event-stream responses by content-type', () => {
+    expect(isStreamingResponse('text/event-stream', '/api/ai-chat/stream')).toBe(
+      true,
+    );
+    expect(isStreamingResponse('text/event-stream; charset=utf-8', '/x')).toBe(
+      true,
+    );
+  });
+
+  it('excludes known /stream routes by suffix as a fallback', () => {
+    expect(isStreamingResponse('application/json', '/api/ai-chat/stream')).toBe(
+      true,
+    );
+    expect(isStreamingResponse(undefined, '/api/shares/ai/stream')).toBe(true);
+  });
+
+  it('does not exclude ordinary JSON responses', () => {
+    expect(isStreamingResponse('application/json', '/api/pages/:id')).toBe(
+      false,
+    );
+    expect(isStreamingResponse(undefined, '/api/pages/:id')).toBe(false);
+  });
+});
+
+describe('firstSqlToken (bounded db label)', () => {
+  it('returns the lower-cased leading keyword', () => {
+    expect(firstSqlToken('SELECT * FROM pages')).toBe('select');
+    expect(firstSqlToken('  insert into x values (1)')).toBe('insert');
+    expect(firstSqlToken('UPDATE pages SET a=1')).toBe('update');
+    expect(firstSqlToken('delete from pages')).toBe('delete');
+    expect(firstSqlToken('(SELECT 1)')).toBe('select');
+  });
+
+  it('collapses unknown/empty queries to "other"', () => {
+    expect(firstSqlToken('')).toBe('other');
+    expect(firstSqlToken(undefined)).toBe('other');
+    expect(firstSqlToken('123 not sql')).toBe('other');
+    expect(firstSqlToken('vacuum analyze')).toBe('other');
+  });
+});
@@ -20,6 +20,10 @@ export interface IStripeSeatsSyncJob {

 export interface IPageHistoryJob {
  pageId: string;
+  // #370 — intentionality tier the worker stamps on the snapshot. All jobs on
+  // this queue are trailing idle-flush autosnapshots, so this is 'idle' (absent
+  // → treated as 'idle' by the processor).
+  kind?: 'idle';
 }

 /**
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
agent_coder	c1b2210a4e	fix(#370 ): thread trx into addPageWatchers (F7 self-deadlock) + restore contributors on commit-failure (F8) + assert the lock (F9) (review round 2) The round-1 F3 fix (wrapping the processor's find+save in a locked tx) itself introduced two regressions: F7 [CRITICAL] addPageWatchers ran WITHOUT trx inside the tx holding FOR UPDATE on pages[pageId]. The watcher insert's FK check takes FOR KEY SHARE on the same row, but on a DIFFERENT pool connection — a true self-deadlock (our tx connection sits idle-in-transaction awaiting the JS await, the insert connection blocks on the lock). Now passes trx (addPageWatchers already accepts it and routes it through insertMany), so the FK lock is taken on the connection that already holds FOR UPDATE — no self-conflict. F8 [WARNING] popContributors is a destructive Redis SPOP; the inner catch only restores on a throw INSIDE the callback. A COMMIT failure throws OUTSIDE it, rolling the snapshot back while the pop is gone → a retry writes an unattributed version. Now tracks the popped set and restores it in an outer catch (idempotent SADD), leaving BullMQ to retry with attribution intact. F9 [WARNING] The spec asserted saveHistory args with a loosened objectContaining that stopped verifying trx, and never pinned withLock/trx on findById or the trx on addPageWatchers — which is exactly why F7 slipped. Restored the exact saveHistory(trx) assertion and added findById({withLock,trx}) + addPageWatchers trx assertions (the latter would have caught F7), plus a commit-failure test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 05:50:19 +03:00
agent_coder	b1e5193b37	fix(#370 ): ES2021-safe spec, source-specific idle ceiling, processor lock, tested index mapping (review round 1) F1 [BLOCKER] persistence-store.spec used Array.prototype.at(-1) (ES2022) but the server targets ES2021, so server tsc failed (TS2550) and ts-jest could not compile the suite — 22 core manual-save/idle/boundary tests silently did not run in CI. Replaced with [length - 1] index access. F2 [WARNING] The idle burst-reset used a hardcoded IDLE_MAX_WAIT_USER for both tiers, but computeHistoryJob's ceiling is source-specific. On a continuously agent-edited page the burst marker stayed stale for 5..10m, forcing delay=0 on every store and writing one idle row per store — the exact per-store bloat the debounce prevents. The reset now uses the same source-specific max-wait. F3 [WARNING] The processor did an unlocked findPageLastHistory -> saveHistory, which TOCTOU-races a concurrent manual-save (that runs under a page-row lock), producing two page_history rows with identical content (one idle, one manual) and defeating promote-not-dup. The snapshot decision is now wrapped in executeTx with the same page-row lock, so the second writer observes the first's committed row and the isDeepStrictEqual gate collapses the duplicate. F4 [WARNING] The risky client filtered-index -> full-list mapping had no tests. Extracted it to a pure resolvePrevSnapshotId(fullItems, id) helper (diff/restore baseline against the true previous snapshot in the FULL list, never the previous visible version) and unit-tested it; removed the now-vestigial index threading. F5/F6 [low] Renamed the misleading ceiling test + fixed its comment; added a CHANGELOG entry for the user-facing versioning feature. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 05:31:26 +03:00
agent_coder	1542c99979	feat(#370 ): page-history intentionality tiers — kind column + intentional/idle/boundary triggers (PR-1 core) PR-1 'core' of #370: introduces page_history.kind ('manual'\|'agent'\|'idle'\| 'boundary'; legacy null = autosave) and rebuilds the snapshot triggers around a three-tier intentionality model. Draft durability (pages/ydoc hocuspocus autosave) is unchanged; only the frequency and labelling of history points change. - Migration 20260705T120000: page_history.kind nullable varchar(20), no default. - Manual Save: one stateless 'save-version' path for human AND agent; kind is derived SERVER-SIDE from the signed context.actor (never the payload), readOnly connections rejected, the fresh ydoc runs through the existing store path (no REST race), then broadcasts version.saved. - Idle-flush: trailing debounce (one BullMQ job per page, remove-then-readd) with IDLE_INTERVAL_USER=60m / AGENT=15m AND a max-wait ceiling (IDLE_MAX_WAIT_USER=10m / AGENT=5m) so a continuous editing session can't starve the autosnapshot (review round-1 WARNING). - Boundary: generalized from the user→agent special-case to ANY lastUpdatedSource transition (user↔agent↔git), same isDeepStrictEqual gate — covers git-sync free. - Removed the agent delay=0 fast path and the old HISTORY_FAST_* constants; the agent joins the common idle pipeline. - Promote-not-dup: a manual save on unchanged content promotes the latest autosave's kind in place (or no-ops if already manual) instead of duplicating a heavy content row. - Client: mod+S hotkey + menu button (hidden when readOnly), history-panel kind badges, dimmed autosaves, a 'versions only' filter (indices map to the full list so diff/restore still target the true previous snapshot), live refresh on version.saved. Internal review: APPROVE-with-suggestions; the round-1 WARNING (idle starvation) is fixed here via the max-wait ceiling, and the generalized-boundary + ceiling behaviours are pinned with new tests (115 collab/repo specs green, server tsc 0). Deferred to later PRs: shares.published_mode (PR-2), the save_page_version MCP tool + role prompts (PR-3), actor='git' wiring into #359 (PR-4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 05:00:12 +03:00
agent_vscode	e89ac627dd	fix(migrations): rename ai-chat-runs migration to post-merge timestamp 20260627T130000-ai-chat-runs sorted before the already-executed 20260702T120000-ai-chat-page-snapshot, so Kysely's strict ordering check ("corrupted migrations") crash-looped the server on startup. - rename 20260627T130000-ai-chat-runs.ts -> 20260704T130000-ai-chat-runs.ts - update the mirror comment in database/types/db.d.ts	2026-07-05 00:59:05 +03:00
vvzvlad	f665f6fdd2	Merge pull request 'feat(ai-chat): autonomous agent runs — phase 1: durable detached runs (#184 )' (#234 ) from feat/184-autonomous-agent-runs into develop Reviewed-on: #234	2026-07-05 00:40:26 +03:00
vvzvlad	7af85b476e	Merge pull request 'feat(observability): дев-часть перф-метрик — /metrics :9464 + client vitals (#355 )' (#358 ) from feat/355-perf-metrics into develop Reviewed-on: #358	2026-07-05 00:31:04 +03:00
agent_coder	5d8364bb5f	fix(#355 review round-2 F9-F11): register-gate test + shutdown idle-close + DB-path metrics gate - F10 [stability]: closeMetricsServer() now calls server.closeIdleConnections() + server.unref() after server.close(). server.close()'s callback doesn't fire until keep-alive sockets drain, and the scraper (VictoriaMetrics/vmagent) holds an idle keep-alive socket — so onModuleDestroy's awaited close would hang until the scraper disconnects or the orchestrator SIGKILLs on the kill-grace window. closeIdleConnections() drops idle keep-alive sockets so shutdown completes immediately (Node 22, per the Dockerfile base). - F9 [test]: client-telemetry.module.spec.ts pins the E1=B register() gate — the core of the "public endpoint OFF by default" decision: flag unset / any non- "true" value ("false"/""/"0"/…) → empty controllers+providers (route absent); "true"/"TRUE" → registers VitalsController + VitalsService. A flag-inversion or truthiness regression that reopened the anonymous disk-fill surface now fails. - F11 [regression/perf]: the db_query_duration_seconds token work (firstSqlToken regex + Set lookup) is now gated on isMetricsEnabled() in database.module.ts, so a non-metrics deployment pays NOTHING per query (previously observeDbQuery no-op'd but the token was still computed on every query). Also hoisted the 13-element known-token Set to a module const (KNOWN_SQL_TOKENS) so it's built once, not per query. Gate: server tsc 0; metrics + vitals + client-telemetry suites pass (incl. the new register-gate test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 00:20:26 +03:00
agent_coder	d3209b5aab	fix(#355 review E1=B + F1-F8): gate client telemetry OFF by default + throttler/lifecycle/overflow fixes Maintainer resolved E1 as variant B: the public vitals sink + client collection must be OFF by default (else client_metrics grows unbounded on a self-host deploy with no external pruner, via an unauthenticated public endpoint). - F1: new operator flag CLIENT_TELEMETRY_ENABLED (default OFF), SEPARATE from METRICS_PORT (Grafana reads the table directly, independent of the scrape port). ClientTelemetryModule.register() provides VitalsController ONLY when the flag is true (route absent otherwise); the flag reaches the client via window.CONFIG (config.ts isClientTelemetryEnabled), and initVitals() early-returns when off. - F2/F3 [throttler]: this repo's ThrottlerGuard applies EVERY named throttler to every guarded route unless skipped. The new VITALS bucket therefore (a) newly bound collab-token → 429 behind shared/NAT IPs, and (b) the vitals route didn't skip the stricter public-share-ai (5/min) bucket → effective 5/min not 120. Fix (additive, global config unchanged): vitals.controller @SkipThrottle the other buckets + @Throttle VITALS 120/min; collab-token adds VITALS_THROTTLER to its existing @SkipThrottle (restoring its prior effectively-unthrottled state). - F4: metrics node:http server is closed on shutdown (MetricsServerLifecycle OnModuleDestroy → closeMetricsServer(), fired by enableShutdownHooks). - F5: docSize outside [0, int4-max] drops to null (keeping the event) instead of overflowing int4 and failing the WHOLE batch insert (+ 2 tests). - F6: .env.example documents METRICS_PORT (no default — unset = subsystem OFF) + CLIENT_TELEMETRY_ENABLED; fixed the inaccurate "default 9464" wording. - F7: disabled/non-sampled sessions install ZERO observers — isVitalsActive() (enabled && sampled) gates reportClientMetric AND the page-editor measurePageOpen + dispatchTransaction wrapping. - F8: kept db.d.ts hand-added (wontfix) — this repo HAND-CURATES db.d.ts (verified across recent fork migrations a32fba63/8c5b57eb/fdeede00); codegen would be the deviation. The ClientMetrics interface maps the migration 1:1. Gate: server tsc 0, client tsc 0, server metrics/vitals/telemetry/throttle 21 tests, client route-template 5. No new deps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 00:00:03 +03:00
agent_coder	68899a2c2e	feat(ai-chat): durable detached agent runs — phase 1 (#184/#234) Squashed for a clean rebase onto develop (was 19 commits; the reviewer approved the net diff at `fb246080`). Detaches an agent run from the HTTP request/browser window: a run is a first-class lifecycle object (ai_chat_runs), a browser disconnect no longer kills it, a concurrent-run insert-gate prevents double runs, and a reopened chat live-follows a still-running run via a polled observer merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-04 23:35:26 +03:00
agent_coder	b9f3de80f5	feat(observability): dev-side perf metrics — /metrics + client vitals (#355 ) The metrics INFRA is already deployed (VictoriaMetrics scraping docmost:9464, Grafana dashboards, alerts) with a target `gitmost-app` that is red because the app half didn't exist. This is that half. The contract (metric names, port, table, endpoint) is FIXED by the deployed infra and matched exactly. Server (prom-client): - A bare node:http `/metrics` server on METRICS_PORT (default 9464), SEPARATE from the Fastify :3000 listener so /metrics never exists publicly; the whole subsystem is OFF when METRICS_PORT is unset. - collectDefaultMetrics() + http_request_duration_seconds{method,route,status} via a Fastify onResponse hook using the ROUTE TEMPLATE (req.routeOptions.url, never the raw URL — bounded cardinality; 404 -> "unknown"), EXCLUDING SSE/ streaming responses (would record the connection lifetime and poison p95). - db_query_duration_seconds (Kysely log callback, labelled by the leading SQL token), bullmq_queue_depth{queue} (getJobCounts every 15s) + bullmq_job_duration_seconds{queue} (worker completed/failed), collab_store_duration_seconds (around onStoreDocument). - POST /api/telemetry/vitals — PUBLIC (sendBeacon) but IP-throttled; ~16KB body cap, <=50 events/batch, metric-name + rating whitelist, attr truncated to 120 chars, batch insert; malformed/foreign/oversized silently dropped and 200'd (no browser retry). New migration `client_metrics` (schema byte-identical to the contract, both indexes, conditional grafana_ro GRANT; no app-side retention — the maintenance container prunes >90d). Client (web-vitals): - initVitals() decides sampling ONCE per session (25%, sessionStorage) BEFORE subscribing; onINP/onLCP/onCLS/onTTFB (attribution) buffered + flushed via navigator.sendBeacon on visibilitychange:hidden and a timer (not fetch-per- metric). Custom: editor_tx_ms (dispatchTransaction sync-part timer, >8ms, with doc_size), page_open_ms, longtask_ms. Route labels are templates only; no titles/slugs/text. Gate: server + client tsc 0, frozen install 0 (added prom-client + web-vitals + regenerated the lock), server metrics/vitals tests 11, client route-template 5, and the migration verified valid against real Postgres. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-04 23:10:29 +03:00
agent_vscode	5336f06d10	Merge pull request 'fix(e2e)+ci: канон callout '> [!info]' в e2e-mcp + параллельная сборка с гейтом на publish' (#356 ) from fix/e2e-callout-and-gate-build into develop	2026-07-04 22:42:11 +03:00
agent_vscode	4bd579f7f6	ci(develop): build image in parallel with tests, gate only the publish Two-phase scheme instead of the sequential gate: the build job runs in parallel with test/e2e jobs and only warms the buildx GHA cache (push:false, cache-to mode=max); a new publish job (needs: test, e2e-server, e2e-mcp, build) rebuilds from the warm cache (near-instant on hit, full rebuild on eviction — same as the old sequential timing) and pushes :develop. GHCR login moved to publish; build-args blocks are kept textually identical between the two jobs so the cache hits. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:41:25 +03:00
agent_vscode	7bf1c91a95	ci(develop): gate the :develop image build on e2e suites Reverse the previous policy where e2e jobs only turned the run red without blocking the image publish: build.needs now lists test, e2e-server and e2e-mcp, so a failing test of any kind stops the :develop image from being built and pushed. Stale policy comments updated accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:33:06 +03:00
agent_vscode	6c82c54470	test(mcp): expect Obsidian '> [!info]' callout export in e2e (#333 canon) PR #333 deliberately changed the canonical markdown export of callout nodes to the Obsidian-native format ('> [!type]' + blockquote body, pinned by packages/prosemirror-markdown unit tests); the importer still parses both ':::type' fences and '> [!type]'. The get_page e2e assertion was missed in that switch and still expected ':::info', failing the e2e-mcp job on develop since `4369bbc5`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:33:06 +03:00
agent_vscode	382e5196da	Merge pull request 'fix(docker): toolchain python3/make/g++ для нативной сборки re2' (#353 ) from fix/docker-re2-toolchain into develop	2026-07-04 22:11:49 +03:00
agent_vscode	76e0c08cec	fix(docker): install python3/make/g++ toolchain for re2 native build The develop image build broke at `pnpm install --frozen-lockfile`: the new native dependency re2@1.25.0 (packages/mcp, search_in_page #330) always compiles from source under pnpm — its prebuilt-binary downloader (install-artifact-from-github) cannot identify the GitHub repo because pnpm does not populate npm_package_repository_*/npm_package_json env vars ("No github repository was identified. Building locally ..."), and node:22-slim ships no python3/make/g++ for the node-gyp fallback. - builder stage: add a cache-friendly apt layer with python3 make g++ before COPY; the stage is discarded so the toolchain may stay. - installer stage: install the toolchain, run the prod install as the node user via `su node -c`, and purge the toolchain — all in one RUN layer so the final image stays slim and node_modules ownership needs no extra chown layer; USER node is restored right after. Fixes the failed run 28715009124 (develop docker build); release.yml uses the same Dockerfile and is covered too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:09:40 +03:00