fix(mcp): write page body before title to avoid split-brain on failure (#159 )

updatePage (markdown) and updatePageJson wrote the title via REST FIRST, then the body via collab. If the body write failed (e.g. a collab persist timeout), the page was left with the NEW title over its OLD body — a split-brain the tool reported as an error but never repaired (red-team finding #10). Reorder both: write the body first, and only set the title after the body has persisted. Now a body-write failure leaves the title untouched (no split-brain). A title write failing after a successful body is rarer (REST is fast) and leaves correct content under a stale title — the strictly lesser inconsistency — which is the same trade-off the issue's "atomic, or roll back the title" intends, without the fragility of a rollback write that could itself fail. No unit test: both paths require a live collab provider and the suite has no provider mock; the change is a pure reordering. All 306 mcp tests still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(share): SEO route must not leak a restricted page's title (#159 )
2026-06-25 05:19:32 +03:00 · 2026-06-25 05:19:32 +03:00 · 2026-06-25 05:19:32 +03:00 · 2026-06-25 05:19:32 +03:00 · 2026-06-25 05:17:56 +03:00 · 2026-06-25 05:17:56 +03:00
7 changed files with 381 additions and 14 deletions
--- a/.env.example
+++ b/.env.example
@@ -149,6 +149,19 @@ MCP_DOCMOST_PASSWORD=
 # your egress drops idle connections faster than ~10s. Default 10000 (10 s).
 # AI_STREAM_KEEPALIVE_MS=10000

+# Silence timeout (ms) for EXTERNAL-MCP transport ONLY (not the chat provider).
+# Tighter than AI_STREAM_TIMEOUT_MS so a byte-silent/hung MCP server is broken in
+# ~5 min instead of 15. Note it also cuts a legitimately long but byte-silent
+# single tool call (a slow crawl that emits nothing until done) and an SSE
+# transport idling >5 min BETWEEN tool calls. Default 300000 (5 min).
+# AI_MCP_STREAM_TIMEOUT_MS=300000
+
+# Total wall-clock cap (ms) for ONE external MCP tool call (app-level, not
+# transport). Aborts a tool that keeps the socket warm (SSE heartbeats / trickle)
+# but never returns a result — which the silence timeout above never breaks.
+# Default 900000 (15 min).
+# AI_MCP_CALL_TIMEOUT_MS=900000
+
 # --- Anonymous public-share AI assistant ---
 # Opt-in per workspace (AI settings -> "public share assistant"; off by default).
 # When enabled, anonymous visitors of a published share can ask an AI about that
--- a/apps/server/package.json
+++ b/apps/server/package.json
@@ -11,7 +11,7 @@
    "start": "cross-env NODE_ENV=development nest start",
    "start:dev": "cross-env NODE_ENV=development nest start --watch",
    "start:debug": "cross-env NODE_ENV=development nest start --debug --watch",
-    "start:prod": "cross-env NODE_ENV=production node dist/main",
+    "start:prod": "cross-env NODE_ENV=production node --heapsnapshot-near-heap-limit=2 dist/main",
    "collab:prod": "cross-env NODE_ENV=production node dist/collaboration/server/collab-main",
    "collab:dev": "cross-env NODE_ENV=development node dist/collaboration/server/collab-main",
    "email:dev": "email dev -p 5019 -d ./src/integrations/transactional/emails",
--- a/apps/server/src/core/ai-chat/ai-chat.service.ts
+++ b/apps/server/src/core/ai-chat/ai-chat.service.ts
@@ -575,6 +575,19 @@ export class AiChatService {
        },
      });

+      // Drain the stream independently of the client socket so the turn always
+      // runs to completion (or to its abort) and the terminal callbacks
+      // (onFinish/onError/onAbort) fire — releasing the per-turn object graph
+      // (history, the per-request toolset closures, captured steps, SDK buffers)
+      // and closing leased MCP clients. WITHOUT this, a client disconnect leaves
+      // the pipe's dead socket as the only reader; backpressure stalls the stream,
+      // the callbacks never run, and every dropped turn stays rooted in memory —
+      // the heap-OOM leak. consumeStream removes that backpressure (AI SDK v6
+      // "Handling client disconnects"). NOT awaited (fire-and-forget); the stream
+      // errors are already logged by the streamText `onError` callback above, so
+      // swallow here to avoid an unhandledRejection.
+      void result.consumeStream({ onError: () => undefined });
+
      // Stream the UI-message protocol straight to the hijacked Node response.
      // Without onError the AI SDK masks the cause ('An error occurred.') and the
      // UI shows a generic failure. Surface the real provider message instead.
--- a/apps/server/src/core/ai-chat/external-mcp/mcp-call-timeout.spec.ts
+++ b/apps/server/src/core/ai-chat/external-mcp/mcp-call-timeout.spec.ts
@@ -0,0 +1,205 @@
+import { type Tool, type ToolCallOptions } from 'ai';
+import {
+  wrapToolWithCallTimeout,
+  wrapToolsWithCallTimeout,
+} from './mcp-clients.service';
+import {
+  mcpStreamTimeoutMs,
+  mcpCallTimeoutMs,
+} from '../../../integrations/ai/ai-streaming-fetch';
+
+/**
+ * Per-call total-timeout guard for external MCP tools (mcp-clients.service).
+ *
+ * `@ai-sdk/mcp`'s tool execute has NO built-in per-call timeout — a tool that
+ * keeps the connection warm but never returns is otherwise unbounded. The
+ * wrapper attaches a fresh AbortController + timer per CALL and composes it with
+ * the turn's abortSignal via AbortSignal.any, so EITHER the per-call timeout OR a
+ * client disconnect aborts the in-flight call.
+ *
+ * Fake timers prove the timeout fires WITHOUT real waiting; no leaked timer keeps
+ * the process alive after a fast resolve.
+ */
+const CALL_TIMEOUT_MS = 900_000;
+
+/** Build a Tool around an `execute` impl, mirroring the SDK's minimal shape. */
+function toolWith(
+  execute: (args: unknown, options: ToolCallOptions) => unknown,
+): Tool {
+  return { description: 'x', inputSchema: undefined, execute } as unknown as Tool;
+}
+
+/** Invoke a (possibly wrapped) tool's execute with an optional turn signal. */
+function callExecute(
+  tool: Tool,
+  args: unknown,
+  abortSignal?: AbortSignal,
+): unknown {
+  const execute = tool.execute as (
+    args: unknown,
+    options: ToolCallOptions,
+  ) => unknown;
+  return execute(args, { abortSignal } as ToolCallOptions);
+}
+
+describe('wrapToolWithCallTimeout', () => {
+  beforeEach(() => jest.useFakeTimers());
+  afterEach(() => {
+    jest.clearAllTimers();
+    jest.useRealTimers();
+  });
+
+  it('aborts a tool that only rejects when its abortSignal fires, after ms elapses', async () => {
+    // The tool resolves NEVER on its own — it only settles when the abortSignal
+    // it is handed aborts. So a resolution proves the per-call timer fired and
+    // aborted the call (not the tool finishing by itself).
+    let received: AbortSignal | undefined;
+    const tool = toolWith((_args, options) => {
+      received = options.abortSignal;
+      return new Promise((_resolve, reject) => {
+        options.abortSignal?.addEventListener('abort', () => {
+          reject(options.abortSignal?.reason ?? new Error('aborted'));
+        });
+      });
+    });
+
+    const wrapped = wrapToolWithCallTimeout(tool, CALL_TIMEOUT_MS);
+    const promise = callExecute(wrapped, { q: 'x' }) as Promise<unknown>;
+    // Attach the rejection handler synchronously so advancing timers cannot mark
+    // it an unhandled rejection.
+    const settled = promise.then(
+      () => ({ ok: true as const }),
+      (err: unknown) => ({ ok: false as const, err }),
+    );
+
+    // Nothing fired yet.
+    jest.advanceTimersByTime(CALL_TIMEOUT_MS - 1);
+    // Past the cap -> the per-call timer aborts the composed signal.
+    jest.advanceTimersByTime(2);
+
+    const result = await settled;
+    expect(result.ok).toBe(false);
+    expect(received).toBeInstanceOf(AbortSignal);
+    // The abort reason / rejection mentions the timeout.
+    const message =
+      (result as { err: unknown }).err instanceof Error
+        ? ((result as { err: Error }).err.message)
+        : String((result as { err: unknown }).err);
+    expect(message).toMatch(/timed out after 900000ms/);
+  });
+
+  it('aborts a REAL-client-style tool that never settles and ignores abort (race fix)', async () => {
+    // Models the ACTUAL @ai-sdk/mcp semantics: its in-flight promise does NOT
+    // reject on abort (it only checks the signal when a response arrives), so a
+    // warm-but-stuck call NEVER settles on its own and does NOT listen to the
+    // abort signal. The wrapper must still reject after `ms` via the race — an
+    // implementation that merely `await original(...)` would hang here forever.
+    // This test FAILS against the old await-only code and PASSES with the race.
+    const tool = toolWith(() => new Promise(() => {})); // never settles, no abort
+    const wrapped = wrapToolWithCallTimeout(tool, CALL_TIMEOUT_MS);
+    const promise = callExecute(wrapped, { q: 'x' }) as Promise<unknown>;
+    // Assert the rejection without hanging: drive fake time async so the timer's
+    // abort -> race rejection microtasks flush, then await the rejection.
+    const expectation = expect(promise).rejects.toThrow(/timed out after 900000ms/);
+    await jest.advanceTimersByTimeAsync(CALL_TIMEOUT_MS + 1);
+    await expectation;
+  });
+
+  it('passes a fast tool through and leaks no timer (advancing later does not throw)', async () => {
+    const tool = toolWith(() => Promise.resolve('fast-result'));
+    const wrapped = wrapToolWithCallTimeout(tool, CALL_TIMEOUT_MS);
+
+    const value = await (callExecute(wrapped, {}) as Promise<unknown>);
+    expect(value).toBe('fast-result');
+
+    // The timer was cleared in the finally — advancing past the cap aborts
+    // nothing and throws nothing.
+    expect(() => jest.advanceTimersByTime(CALL_TIMEOUT_MS * 2)).not.toThrow();
+  });
+
+  it('aborts when the caller turn signal aborts before the timeout (disconnect path)', async () => {
+    // Real-client semantics: the tool never settles and does NOT listen to abort,
+    // so the wrapper must reject via the race when the caller's turn signal (a
+    // client disconnect) aborts BEFORE the per-call cap. The race propagates the
+    // caller's abort reason.
+    const tool = toolWith(() => new Promise(() => {})); // never settles, no abort
+    const wrapped = wrapToolWithCallTimeout(tool, CALL_TIMEOUT_MS);
+    const turn = new AbortController();
+    const promise = callExecute(wrapped, {}, turn.signal) as Promise<unknown>;
+    const settled = promise.then(
+      () => ({ ok: true as const }),
+      (err: unknown) => ({ ok: false as const, err }),
+    );
+
+    // Disconnect well before the cap; the per-call timer never fires here.
+    turn.abort(new Error('client disconnected'));
+    const result = await settled;
+    expect(result.ok).toBe(false);
+    const message =
+      (result as { err: unknown }).err instanceof Error
+        ? (result as { err: Error }).err.message
+        : String((result as { err: unknown }).err);
+    // The caller's abort reason propagates through the race.
+    expect(message).toMatch(/client disconnected/);
+  });
+
+  it('passes a tool with no execute through unchanged', () => {
+    const noExecute = { description: 'x', inputSchema: undefined } as unknown as Tool;
+    const wrapped = wrapToolWithCallTimeout(noExecute, CALL_TIMEOUT_MS);
+    // Same object back, execute still absent.
+    expect(wrapped).toBe(noExecute);
+    expect((wrapped as { execute?: unknown }).execute).toBeUndefined();
+  });
+});
+
+describe('wrapToolsWithCallTimeout', () => {
+  beforeEach(() => jest.useFakeTimers());
+  afterEach(() => {
+    jest.clearAllTimers();
+    jest.useRealTimers();
+  });
+
+  it('wraps every tool in the map (each call gets its own guard)', async () => {
+    const tools: Record<string, Tool> = {
+      a: toolWith(() => Promise.resolve('A')),
+      b: toolWith(() => Promise.resolve('B')),
+    };
+    const out = wrapToolsWithCallTimeout(tools, CALL_TIMEOUT_MS);
+    expect(Object.keys(out)).toEqual(['a', 'b']);
+    expect(await (callExecute(out.a, {}) as Promise<unknown>)).toBe('A');
+    expect(await (callExecute(out.b, {}) as Promise<unknown>)).toBe('B');
+  });
+});
+
+describe('mcp timeout env helpers', () => {
+  const ORIG_SILENCE = process.env.AI_MCP_STREAM_TIMEOUT_MS;
+  const ORIG_CALL = process.env.AI_MCP_CALL_TIMEOUT_MS;
+  afterEach(() => {
+    if (ORIG_SILENCE === undefined) delete process.env.AI_MCP_STREAM_TIMEOUT_MS;
+    else process.env.AI_MCP_STREAM_TIMEOUT_MS = ORIG_SILENCE;
+    if (ORIG_CALL === undefined) delete process.env.AI_MCP_CALL_TIMEOUT_MS;
+    else process.env.AI_MCP_CALL_TIMEOUT_MS = ORIG_CALL;
+  });
+
+  it('mcpStreamTimeoutMs defaults to 5 min and honors a positive override', () => {
+    delete process.env.AI_MCP_STREAM_TIMEOUT_MS;
+    expect(mcpStreamTimeoutMs()).toBe(300_000);
+    process.env.AI_MCP_STREAM_TIMEOUT_MS = '60000';
+    expect(mcpStreamTimeoutMs()).toBe(60_000);
+    for (const bad of ['0', '-1', 'x', '']) {
+      process.env.AI_MCP_STREAM_TIMEOUT_MS = bad;
+      expect(mcpStreamTimeoutMs()).toBe(300_000);
+    }
+  });
+
+  it('mcpCallTimeoutMs defaults to 15 min and honors a positive override', () => {
+    delete process.env.AI_MCP_CALL_TIMEOUT_MS;
+    expect(mcpCallTimeoutMs()).toBe(900_000);
+    process.env.AI_MCP_CALL_TIMEOUT_MS = '120000';
+    expect(mcpCallTimeoutMs()).toBe(120_000);
+    for (const bad of ['0', '-1', 'x', '']) {
+      process.env.AI_MCP_CALL_TIMEOUT_MS = bad;
+      expect(mcpCallTimeoutMs()).toBe(900_000);
+    }
+  });
+});
--- a/apps/server/src/core/ai-chat/external-mcp/mcp-clients.service.ts
+++ b/apps/server/src/core/ai-chat/external-mcp/mcp-clients.service.ts
@@ -1,12 +1,16 @@
 import { isIP } from 'node:net';
 import { lookup as dnsLookup, type LookupAddress } from 'node:dns';
 import { Injectable, Logger } from '@nestjs/common';
-import { type Tool } from 'ai';
+import { type Tool, type ToolCallOptions } from 'ai';
 import { createMCPClient } from '@ai-sdk/mcp';
 import { Agent, type Dispatcher } from 'undici';
 import { AiMcpServerRepo } from '@docmost/db/repos/ai-chat/ai-mcp-server.repo';
 import { AiMcpServer } from '@docmost/db/types/entity.types';
-import { streamingDispatcherOptions } from '../../../integrations/ai/ai-streaming-fetch';
+import {
+  streamingDispatcherOptions,
+  mcpStreamTimeoutMs,
+  mcpCallTimeoutMs,
+} from '../../../integrations/ai/ai-streaming-fetch';
 import { SecretBoxService } from '../../../integrations/crypto/secret-box';
 import { isUrlAllowed, isIpAllowed } from './ssrf-guard';

@@ -247,6 +251,8 @@ export class McpClientsService {
    const tools: Record<string, Tool> = {};
    const clients: McpClient[] = [];
    const outcomes: ServerOutcome[] = [];
+    // Per-call total wall-clock cap, read once for this build (env-overridable).
+    const callTimeoutMs = mcpCallTimeoutMs();
    const instructions: McpServerInstruction[] = [];

    for (const server of servers) {
@@ -257,12 +263,16 @@ export class McpClientsService {
        const allow = server.toolAllowlist;
        const picked =
          Array.isArray(allow) && allow.length > 0 ? pick(raw, allow) : raw;
+        // Bound each tool's execute with a per-call total-timeout guard before
+        // merging, so a single chatty-but-stuck call is aborted after the cap.
+        const guarded = wrapToolsWithCallTimeout(picked, callTimeoutMs);
        // Namespace each tool with the sanitized server name AND disambiguate
        // against names already merged from earlier servers, so no external
-        // tool is silently overwritten on collision.
+        // tool is silently overwritten on collision. The returned count drives
+        // whether this server's prompt guidance is included (≥1 tool merged).
        const merged = this.mergeNamespaced(
          tools,
-          picked,
+          guarded,
          server.name,
          server.id,
        );
@@ -449,17 +459,21 @@ export function validateResolvedAddresses(addrs: readonly LookupAddress[]): {
 * to an IP literal).
 */
 function buildPinnedDispatcher(): Agent {
+  // External-MCP traffic uses a DEDICATED, shorter silence timeout
+  // (`AI_MCP_STREAM_TIMEOUT_MS`, default 5 min) — deliberately tighter than the
+  // chat provider's 15-min `streamTimeoutMs()` — so a byte-silent/hung MCP
+  // upstream is broken in ~5 min instead of 15. We keep the keep-alive options
+  // from `streamingDispatcherOptions()` but OVERRIDE headers/body timeouts.
+  // Accepted trade-off: a legitimately long but byte-silent single tool call,
+  // and an SSE transport idling >5 min BETWEEN tool calls, are also cut here; the
+  // per-call total cap (wrapToolsWithCallTimeout, `AI_MCP_CALL_TIMEOUT_MS`) is the
+  // complementary guard for chatty-but-stuck calls that keep the socket warm yet
+  // never return.
+  const mcpSilenceMs = mcpStreamTimeoutMs();
  return new Agent({
-    // Raise undici's default 300s headers/body timeouts on external MCP traffic
-    // to the same generous-but-finite silence timeout the chat fetch uses (#175).
-    // A long agent turn keeps an SSE transport (e.g. crawl4ai's /mcp/sse) open
-    // across the whole turn; that connection can idle BETWEEN tool calls longer
-    // than 5 min, and undici's bodyTimeout would otherwise sever it mid-task — a
-    // tool-call failure that aborts the streamed turn and shows the user "Lost
-    // connection to the AI provider". A slow single tool call (a crawl) can
-    // likewise exceed headersTimeout. The timeout stays FINITE so a genuinely
-    // hung server is still broken eventually.
    ...streamingDispatcherOptions(),
+    headersTimeout: mcpSilenceMs,
+    bodyTimeout: mcpSilenceMs,
    connect: {
      lookup: (hostname, _options, callback) => {
        // Always resolve ALL addresses ourselves; do not trust the caller's
@@ -630,6 +644,78 @@ function disambiguate(
  return capName(`${name.slice(0, MAX_TOOL_NAME_LENGTH - 14)}_${Date.now()}`);
 }

+/**
+ * Wrap every tool's execute with a per-call total-timeout guard so a single
+ * external MCP tool call that keeps the connection warm but never returns is
+ * aborted after `ms` wall-clock (complements the transport silence timeout).
+ */
+export function wrapToolsWithCallTimeout(
+  tools: Record<string, Tool>,
+  ms: number,
+): Record<string, Tool> {
+  const out: Record<string, Tool> = {};
+  for (const [name, t] of Object.entries(tools)) {
+    out[name] = wrapToolWithCallTimeout(t, ms);
+  }
+  return out;
+}
+
+/**
+ * Per-call total-timeout wrapper for one MCP tool. A fresh AbortController +
+ * timer bounds the call; it is composed with the turn's abortSignal via
+ * AbortSignal.any so EITHER the per-call timeout OR a client disconnect aborts
+ * the call. We RACE the call against the composed abort signal rather than just
+ * awaiting it, because @ai-sdk/mcp does NOT settle its in-flight promise on abort
+ * (verified in @ai-sdk/mcp@1.0.52: request() only does throwIfAborted() once
+ * before send and only re-checks the signal inside the response-message handler,
+ * which runs ONLY when a response arrives). So for a warm-but-stuck call awaiting
+ * `original` alone would hang forever even after the timer aborts.
+ */
+export function wrapToolWithCallTimeout(tool: Tool, ms: number): Tool {
+  const original = tool.execute;
+  if (typeof original !== 'function') return tool;
+  const execute = async (args: unknown, options: ToolCallOptions) => {
+    const controller = new AbortController();
+    const timer = setTimeout(() => {
+      controller.abort(new Error(`MCP tool call timed out after ${ms}ms`));
+    }, ms);
+    timer.unref?.();
+    const abortSignal = options?.abortSignal
+      ? AbortSignal.any([options.abortSignal, controller.signal])
+      : controller.signal;
+    // Reject as soon as the composed signal fires, independent of whether
+    // `original` ever settles. The losing `original` promise is left pending; it
+    // is cleaned up when the client is closed at turn end, and Promise.race
+    // attaches a rejection handler to BOTH inputs so a late rejection of either
+    // is never an unhandled rejection (do NOT add an extra .catch — it could
+    // swallow the real result and would break the race semantics).
+    const aborted = new Promise<never>((_, reject) => {
+      const fail = () => reject(abortReason(abortSignal));
+      if (abortSignal.aborted) fail();
+      else abortSignal.addEventListener('abort', fail, { once: true });
+    });
+    try {
+      return await Promise.race([
+        original(args, { ...options, abortSignal }),
+        aborted,
+      ]);
+    } finally {
+      clearTimeout(timer);
+    }
+  };
+  // `Tool` is a union whose `execute` overloads conflict; cast narrowly so the
+  // wrapped tool keeps every other field while swapping only `execute`.
+  return { ...tool, execute } as unknown as Tool;
+}
+
+/** The signal's reason as an Error (informative thrown value on abort/timeout). */
+function abortReason(signal: AbortSignal): Error {
+  const r = signal.reason;
+  return r instanceof Error
+    ? r
+    : new Error(typeof r === 'string' ? r : 'MCP tool call aborted');
+}
+
 /** Reject a promise after `ms`, so a hung connect/tools() never stalls a turn. */
 function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  return new Promise<T>((resolve, reject) => {
--- a/apps/server/src/core/ai-chat/public-share-chat.service.ts
+++ b/apps/server/src/core/ai-chat/public-share-chat.service.ts
@@ -244,6 +244,15 @@ export class PublicShareChatService {
        },
      });

+      // Drain the stream independently of the client socket so the turn always
+      // runs to completion (or to its abort) even when the anonymous client
+      // disconnects — otherwise the dead socket is the only reader, backpressure
+      // stalls the stream, and the per-turn object graph stays rooted (heap-OOM
+      // leak). consumeStream removes that backpressure (AI SDK v6 "Handling
+      // client disconnects"). Fire-and-forget; stream errors are already logged
+      // by the streamText `onError` callback above.
+      void result.consumeStream({ onError: () => undefined });
+
      // Stream the UI-message protocol straight to the hijacked Node response.
      // Surface the real provider message (AI SDK error bodies never carry the
      // API key, so this is safe; we never dump the resolved config).
--- a/apps/server/src/integrations/ai/ai-streaming-fetch.ts
+++ b/apps/server/src/integrations/ai/ai-streaming-fetch.ts
@@ -70,6 +70,47 @@ export function streamKeepAliveMs(): number {
  return positiveEnv('AI_STREAM_KEEPALIVE_MS', DEFAULT_STREAM_KEEPALIVE_MS);
 }

+/** Default SILENCE timeout for EXTERNAL-MCP transport (5 min). */
+const DEFAULT_MCP_STREAM_TIMEOUT_MS = 300_000;
+
+/** Default total wall-clock cap for ONE external MCP tool call (15 min). */
+const DEFAULT_MCP_CALL_TIMEOUT_MS = 900_000;
+
+/**
+ * SILENCE timeout (ms) for EXTERNAL-MCP transport ONLY. Override with
+ * `AI_MCP_STREAM_TIMEOUT_MS`; a missing/invalid/non-positive value falls back to
+ * {@link DEFAULT_MCP_STREAM_TIMEOUT_MS} (5 min).
+ *
+ * Deliberately tighter than the chat provider's {@link streamTimeoutMs} (15 min)
+ * so a byte-silent/hung MCP upstream is broken in ~5 min instead of 15. This is
+ * the undici `headersTimeout`/`bodyTimeout` for the external-MCP dispatcher only
+ * — it must NOT change the chat provider, which legitimately needs 15 min between
+ * reasoning chunks (#175).
+ *
+ * Trade-off: a legitimately long but byte-silent single tool call (a slow crawl
+ * that emits nothing until done) and an SSE transport that idles >5 min BETWEEN
+ * tool calls are also cut here. The per-call total cap ({@link mcpCallTimeoutMs},
+ * applied in mcp-clients.service) is the complementary guard for chatty-but-stuck
+ * calls that keep the socket warm yet never return.
+ */
+export function mcpStreamTimeoutMs(): number {
+  return positiveEnv('AI_MCP_STREAM_TIMEOUT_MS', DEFAULT_MCP_STREAM_TIMEOUT_MS);
+}
+
+/**
+ * Total wall-clock cap (ms) for ONE external MCP tool call — APP-LEVEL, not
+ * transport. Override with `AI_MCP_CALL_TIMEOUT_MS`; a missing/invalid/
+ * non-positive value falls back to {@link DEFAULT_MCP_CALL_TIMEOUT_MS} (15 min).
+ *
+ * Catches a tool that keeps the connection warm (SSE heartbeats / trickle) but
+ * never returns a result — which the transport silence timeout
+ * ({@link mcpStreamTimeoutMs}) would never break because the socket never goes
+ * byte-silent.
+ */
+export function mcpCallTimeoutMs(): number {
+  return positiveEnv('AI_MCP_CALL_TIMEOUT_MS', DEFAULT_MCP_CALL_TIMEOUT_MS);
+}
+
 /**
 * undici `Agent` options for streaming AI traffic — the (generous, finite)
 * silence timeouts plus the keep-alive recycle window. Shared by the chat
Author	SHA1	Message	Date
claude code agent 227	dd64c2ea05	fix(mcp): write page body before title to avoid split-brain on failure (#159 ) updatePage (markdown) and updatePageJson wrote the title via REST FIRST, then the body via collab. If the body write failed (e.g. a collab persist timeout), the page was left with the NEW title over its OLD body — a split-brain the tool reported as an error but never repaired (red-team finding #10). Reorder both: write the body first, and only set the title after the body has persisted. Now a body-write failure leaves the title untouched (no split-brain). A title write failing after a successful body is rarer (REST is fast) and leaves correct content under a stale title — the strictly lesser inconsistency — which is the same trade-off the issue's "atomic, or roll back the title" intends, without the fragility of a rollback write that could itself fail. No unit test: both paths require a live collab provider and the suite has no provider mock; the change is a pure reordering. All 306 mcp tests still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:19:32 +03:00
claude code agent 227	96fb737c9d	fix(share): SEO route must not leak a restricted page's title (#159 ) `ShareSeoController.getShare` resolved the inherited share with the RAW `getShareForPage`, which does NOT run the restricted-ancestor gate. So for a page shared with includeSubPages whose descendant is permission-restricted, the SEO route served that descendant's real title in <title>/og:title/twitter:title to anonymous visitors and crawlers — even though the content API returns 404 for it (red-team finding #3). Funnel the SEO path through the canonical `resolveReadableSharePage` boundary (the single place that checks `hasRestrictedAncestor`): a non-readable page now serves the plain SPA index with no meta. Also honour `isSharingAllowed` — a share whose workspace/space sharing toggle was flipped off after creation no longer leaks its title via SEO. Title comes from the server-resolved page; `buildShareMetaHtml` already emits robots=noindex when the share opted out of indexing. Tests (controller routing, fs spied at call time so bcrypt's native loader is untouched): non-readable page => plain index, no title; sharing-disabled => plain index; readable+indexing => title + og:title, no noindex; readable+no- indexing => noindex. Asserts getShareForPage is never called by the SEO path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:19:32 +03:00
claude code agent 227	66f9079996	fix(ai-chat): validate the open page server-side so the agent edits the right one (#159 ) The client sends the "current page" as { id, title } in the request body and the server echoed BOTH verbatim into the system prompt context and the getCurrentPage tool. id and title are independently attacker/desync-controllable (two tabs, stale navigation), so openPage.id could point at page B while openPage.title said "Page A" — the model then reported "updated Page A" while it actually edited page B (CASL still allowed it; the user has access). Red-team finding #4. Resolve the open page ONCE against the DB via a new `resolveOpenPageContext`: workspace-scoped lookup + access check, returning the AUTHORITATIVE { id, title } (title from the DB row, never the client) or null (fail-closed) for a missing / foreign / inaccessible page. That validated value now feeds the system prompt, the getCurrentPage tool, AND the new-chat history origin (which previously did this validation inline, for the id only — now shared, and the title is fixed too). Tests: resolveOpenPageContext covers no-id, not-found, foreign-workspace, Forbidden, non-Forbidden-fault (fail-closed), the DB-title-wins-over-client case, and null-title coercion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:19:32 +03:00
claude code agent 227	3960845eab	feat(ai-chat): per-MCP-server instructions in the agent system prompt (#180 ) Admins can now give each EXTERNAL MCP server a free-text instruction ("how/ when to use this server's tools") that the agent receives in its SYSTEM PROMPT next to the tool descriptions — porting the built-in SERVER_INSTRUCTIONS idea to admin-configured servers. Trusted, admin-authored text (like a system prompt); NON-secret, so unlike headersEnc it IS returned in views/forms. - Migration: nullable `instructions text` on ai_mcp_servers (old rows = null = no guidance). Table type + repo insert/update (blank/whitespace -> null via blankToNull). DTO `@MaxLength(4000)`. Service threads it through McpServerView/toView. - mcp-clients: `McpServerInstruction { serverName, toolPrefix, instructions }` threaded through the toolset/cache/lease. Guidance is built ONLY for a server that actually connected AND contributed >=1 callable tool (the allowlist may filter all of them out) AND has non-blank text — so a guide never appears for tools the agent cannot call. Cached with the toolset, so an edit is picked up next turn via the existing CRUD cache invalidation. - System prompt: `buildMcpToolingBlock` renders an <mcp_tooling> block INSIDE the safety sandwich (after context, before the trailing SAFETY_FRAMEWORK) so it informs tool choice but cannot override the rules; each section is headed by the server's `prefix_*` namespace. Empty/blank -> block omitted. The caller (ai-chat.service) now builds the external toolset BEFORE the prompt and passes external.instructions; client-handle lifecycle (close-once) unchanged. - Client: instructions field in types + a Textarea (autosize, maxLength 4000) in the MCP-server form with a namespace-prefix hint; i18n (en/ru). Tests across every layer (prompt block placement + both SAFETY copies; view blank->null; buildEntry includes guidance only for connected+>=1-tool+non-blank; DTO MaxLength; repo + integration round-trip; service wiring). Delegated impl reviewed (APPROVE); applied the import-type follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:19:32 +03:00
claude code agent 227	b349673a6b	ci(test): run the server integration suite against real Postgres/Redis (#159 ) The only test command in CI was `pnpm -r test` (unit `.spec.ts` on mocks). `test:int` (`.int-spec.ts`, real Postgres/Redis) ran nowhere in CI — there were no DB `services:` — so the cost-cap, FK-cascade, jsonb round-trip and real AI-apply integration tests never gated a PR, and regressions in those high-severity paths stayed green (red-team finding #7). Add `services: postgres (pgvector) + redis` and a `pnpm --filter server test:int` step. The pgvector image is required because migrations create vector columns and global-setup runs `CREATE EXTENSION vector`. Service credentials/db match the defaults in apps/server/test/integration (docmost / docmost_dev_pw, maintenance db `docmost`, redis 6379), so no TEST_*_URL overrides are needed; global-setup drops/recreates the isolated docmost_test DB and migrates it. NOTE: the workflow change itself can only be validated by an actual CI run (YAML parses locally); the int-spec suite is verified passing locally on this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:17:56 +03:00
claude code agent 227	0441a2ee75	fix(mcp): refuse ambiguous patch_node/delete_node on duplicated ids (#159 ) Docmost duplicates block ids on copy/paste, and copyPageContent writes the source document verbatim with the same ids. `patchNode`/`deleteNode` address a block by `attrs.id` via replaceNodeById/deleteNodeById, which act on EVERY node sharing the id — so a single patch_node/delete_node could silently replace/remove multiple unrelated blocks with no signal to the model (red-team finding #6). Guard both write paths: when more than one node matches the id, skip the write entirely (the transform returns null -> no mutation) and throw a clear "ambiguous id — N nodes share it" error so the model re-targets with a more specific anchor. Only an unambiguous single match is written; the 0-match and 1-match behavior is unchanged. The duplicate-count basis is covered by node-ops.test.mjs (replaceNodeById / deleteNodeById report count===2 for a 2-duplicate doc). The end-to-end guard is not unit-tested because patchNode/deleteNode require a live collab provider and the test suite has no provider mock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:17:56 +03:00
claude code agent 227	3ebe24bee2	feat(footnotes): multi-backlinks — definition returns to ALL its references (#168 ) After #166 a repeated `[^a]` is one footnote (reuse): one number, one definition, N forward links. But the definition's ↩ only returned to the FIRST reference. Now a definition with N references shows ↩ a b c …, each backlink scrolling to its own occurrence (Pandoc/Wikipedia convention); a single-reference footnote keeps the plain ↩ unchanged. - editor-ext: `computeFootnoteRefCounts(doc)` (id -> occurrence count) cached alongside the number map in the numbering plugin state; `getFootnoteRefCount` getter (O(1), no per-render doc walk). `scrollToReference(id, index?)` picks the index-th `sup[data-footnote-ref][data-id]` occurrence (document order), falling back to the first. - client: FootnoteDefinitionView renders one lettered link (a, b, c, … aa …) per occurrence when refCount > 1; the chrome stays after the contentDOM so the #146 caret invariant holds. i18n keys (ru) added. Tests: computeFootnoteRefCounts + getFootnoteRefCount (reuse counts, unknown id => 0); structure test gains 3 cases (N lettered links render, click jumps to the n-th occorrence, single ref => one ↩). NOTE: the visual layout of the backlink row needs a real browser to verify (jsdom can't); the structural and behavioral contract is covered headless. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:17:56 +03:00
claude code agent 227	60b7be3534	fix(db): jsonb double-encoding follow-ups from PR #172 review (#173 ) PR #172 fixed the jsonb double-encoding for `tool_allowlist` but the same class of bug, and the same re-derived workaround, remained elsewhere. 1. model_config (agent roles): jsonbObject still used the buggy `::jsonb` bind, so `ai_agent_roles.model_config` round-tripped as a jsonb STRING SCALAR. The read-path `typeof === 'object'` check then failed and the model override was SILENTLY dropped (role fell back to the default model). Fixed to `::text::jsonb` and added `parseModelConfig` + `normalizeRow` so every read self-heals already-corrupted rows (no migration). 2. Centralized the write workaround as `jsonbBind()` in database/utils.ts — one implementation with one explanation of the quirk — replacing the per-repo `jsonbArray` (mcp) and `jsonbObject` (roles). 3. Integration coverage (the fix is a DB round-trip a unit test cannot see; the read-side parser MASKS a write regression): new ai-mcp-server-repo.int-spec asserts `jsonb_typeof(tool_allowlist)='array'` after insert + heals a seeded string-scalar row; ai-agent-roles-repo int-spec gains the same for `model_config` (`'object'` + heal). 4. Updated the stale `ai-mcp-servers.types.ts` comment (the driver returns a JSON string for legacy rows; the repo normalizes every read). 5. Fail-open logging: a corrupt tool_allowlist degrades to "no restriction" (agent gets ALL tools) — normalizeRow now warns (server id only, never contents) so the silent widening leaves a trace. 6. Simplified parseToolAllowlist (normalize the string once, then a single array-of-strings check) — identical behaviour, all 12 cases still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:17:56 +03:00
claude code agent 227	4d03321094	fix(mcp): replaceImage no longer yanks the cursor (#164 ) `mutateLiveContentUnlocked` — the write path used by `replaceImage` — still did the pre-#152 destructive write (delete the whole fragment + applyUpdate a fresh Y.Doc), discarding every Yjs node id. y-prosemirror anchors the editor selection to those ids, so an open editor's cursor snapped to the document end on every image swap, exactly the #152 jump that the main write path no longer causes. Switch it to the same `applyDocToFragment(ydoc, newDoc)` structural diff (updateYFragment) as the main path, so unchanged nodes keep their ids and the live cursor stays put. It runs its own atomic transact, so the old explicit transact/delete is gone; the now-unused docmostExtensions import is dropped. Regression tests (cursor-stability suite): a sibling paragraph's RelativePosition survives a top-level image src/attachmentId swap, and an image nested in a callout, matching the shapes replaceImage produces. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:17:56 +03:00
claude code agent 227	d14ab4a012	feat(ai-chat): compact reasoning rendering — collapse blank lines (#181 ) The "Thinking" (reasoning) block rendered with large vertical gaps: models emit reasoning with a blank line (\n\n) between every list item and paragraph, which `marked` turns into loose lists (each <li> wrapped in a <p>) and separate <p> paragraphs, each carrying a margin. - Add `collapseBlankLines(text)`: collapse 2+ newlines to one, EXCEPT inside fenced code blocks (``` / ~~~) where blank lines are significant. Applied in reasoning-block.tsx before renderChatMarkdown, so loose lists become tight (no <li><p>) and paragraphs join; `breaks: true` keeps single \n as <br>, preserving line breaks. Reasoning-only — the normal answer is untouched. - Drop `white-space: pre-wrap` from `.reasoningText`: on the rendered markdown <div> it turned the newlines between block tags into visible blank lines on top of the margins. The plain-text fallback <Text> that needs pre-wrap already sets it inline. Tests: collapseBlankLines unit (collapse, fence preservation incl. tilde and unclosed fences) + rendered-HTML assertions that a blank-line-separated list becomes a tight list and still parses as a list after a paragraph. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:17:56 +03:00
claude code agent 227	f60fa25696	fix(ai-chat): tick the live token counter between agent steps (#163 ) The header token badge (and the "Thinking… · N tokens" line) froze between agent steps and jumped in chunks instead of ticking smoothly. liveTurnTokens returned the authoritative server `usage` VERBATIM as soon as it appeared, but the server only attaches usage at a step boundary and it is cumulative over COMPLETED steps — so during the next (in-flight) step the figure stayed frozen at the previous boundary and the running text estimate was ignored. Combine both sources per component via max: always compute the running estimate (chars/≈4 over the message's reasoning/text parts, which includes the in-flight step) and take max(authoritativeBase, estimate). Between boundaries the estimate ticks the number up; at a boundary the authoritative figure snaps it exact; and because the server usage is cumulative and we only ever take the max, the counter is monotonic (never drops). Reasoning/output stay split; the #151 reasoning-only authoritative count is preserved. Backward compatible: in every existing test the estimate is <= the authoritative figure, so max returns the same value. +4 tests for the in-flight-step-exceeds- base case (output + reasoning), the authoritative-wins case, and monotonicity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 05:17:56 +03:00
claude_code	27c91e4a69	feat(ai-chat): bound external MCP tool calls with per-call timeouts External MCP tools (web search, crawl) had no per-call timeout: a hung tool call was only broken by the 15-min transport silence timeout shared with the chat provider, and a server that kept the socket warm but never returned could spin until the user cancelled. Add two independent, composing bounds for external MCP traffic (the chat provider path is unchanged): - Silence 5 min: buildPinnedDispatcher now overrides headersTimeout/ bodyTimeout with mcpStreamTimeoutMs() (AI_MCP_STREAM_TIMEOUT_MS, default 300000) on the external-MCP dispatcher only, so a byte-silent upstream is severed in ~5 min instead of 15. - Total per-call 15 min: wrapToolWithCallTimeout wraps each external tool's execute with a fresh AbortController + timer composed with the turn signal via AbortSignal.any (AI_MCP_CALL_TIMEOUT_MS, default 900000). It RACES the call against the abort signal because @ai-sdk/mcp does not settle its in-flight promise on abort, so a warm-but-stuck call would otherwise hang forever. On timeout the call surfaces as a tool-error and the agent loop recovers. Add tests (incl. a never-settling real-client-style stub) and document both env vars in .env.example.	2026-06-25 04:43:49 +03:00
claude_code	c3596dce68	Merge branch 'develop' of https://gitea.vvzvlad.xyz/vvzvlad/gitmost into develop	2026-06-25 03:59:41 +03:00
claude_code	b6787cc542	fix(ai-chat): drain stream on client disconnect to stop heap-OOM leak The /api/ai-chat/stream and public-share streaming paths piped streamText output to the client socket via pipeUIMessageStreamToResponse, whose only reader is that socket. On a client disconnect (pervasive Safari/proxy ECONNRESET), backpressure stalled the stream: the controller aborted the turn but nothing drained it, so streamText's onFinish/onError/onAbort never fired. Cleanup (close leased MCP clients, persist partial) never ran and the whole per-turn object graph (history, per-request toolset closures, captured steps, SDK buffers) stayed rooted — accumulating across turns until the default ~2GB heap saturated and the process crashed with "Ineffective mark-compacts near heap limit - JavaScript heap out of memory". Add the AI SDK v6 documented remedy: fire-and-forget `result.consumeStream({ onError })` right after streamText(), which removes backpressure and drains the stream independently of the client socket so the terminal callbacks always fire and the turn's memory is released even when the client has gone away. Applied to both the authenticated and public-share stream services. Also add `--heapsnapshot-near-heap-limit=2` to the prod start script so any residual leak dumps a heap snapshot near OOM for diagnosis (no effect on normal operation). Heap size stays ops-tunable via NODE_OPTIONS. - apps/server/src/core/ai-chat/ai-chat.service.ts - apps/server/src/core/ai-chat/public-share-chat.service.ts - apps/server/package.json Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-25 03:59:32 +03:00