Addresses the documentation/convention warnings from the #119 review:
- .env.example: add the GIT-SYNC block (9 GIT_SYNC_* vars with defaults), noting
GIT_SYNC_SERVICE_USER_ID is required when sync is enabled.
- yjs-body-merge.ts: translate the Russian review note in the docstring to
English (comments-only-in-English rule).
- persistence.extension.ts: correct the stale "git-sync writes are full-body
replaces" rationale — a git-sync write is now a block-level merge into the live
doc, which is why it is debounced like a human edit rather than snapshotted.
- history-item.tsx: the GitSyncBadge version is created on the PUSH path (writing
the git body back into the doc), not by the pull — fix the comment.
- edit-space-form.tsx: log the raw error in the git-sync toggle catch instead of
swallowing it (AGENTS.md).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the new per-workspace rolling-day token-budget env var in
.env.example alongside the existing share-assistant cost knobs, and add
[Unreleased] Fixed entries for #161/#190/#207/#206 plus a Security entry
for the #159 token budget.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
External MCP tools (web search, crawl) had no per-call timeout: a hung
tool call was only broken by the 15-min transport silence timeout shared
with the chat provider, and a server that kept the socket warm but never
returned could spin until the user cancelled.
Add two independent, composing bounds for external MCP traffic (the chat
provider path is unchanged):
- Silence 5 min: buildPinnedDispatcher now overrides headersTimeout/
bodyTimeout with mcpStreamTimeoutMs() (AI_MCP_STREAM_TIMEOUT_MS,
default 300000) on the external-MCP dispatcher only, so a byte-silent
upstream is severed in ~5 min instead of 15.
- Total per-call 15 min: wrapToolWithCallTimeout wraps each external
tool's execute with a fresh AbortController + timer composed with the
turn signal via AbortSignal.any (AI_MCP_CALL_TIMEOUT_MS, default
900000). It RACES the call against the abort signal because
@ai-sdk/mcp does not settle its in-flight promise on abort, so a
warm-but-stuck call would otherwise hang forever.
On timeout the call surfaces as a tool-error and the agent loop recovers.
Add tests (incl. a never-settling real-client-style stub) and document
both env vars in .env.example.
The real cause of the long-task "Lost connection to the AI provider" — the
earlier 300s-timeout fix (#176) was the wrong layer. The provider-HTTP telemetry
on the user's deploy shows the failures are PRE-RESPONSE `read ECONNRESET` ~500ms
in (not a 300s/15min timeout), correlated with idleSincePrevCall ~42s and large
bodies; and crucially a retry of the SAME request often succeeds. A direct probe
to the real z.ai endpoint does NOT reset (113KB bodies and a 45s-idle keep-alive
reuse both succeed), and another agent (opencode) runs fine from the same infra —
so the provider is healthy and the egress network is usable. The difference is
the transport: undici's keep-alive pool REUSES a socket that the deployment's
egress (NAT / firewall / conntrack) silently dropped during a long idle gap, so
the next request resets pre-response.
Fix (brings gitmost in line with clients that don't reuse stale sockets):
- Keep-alive recycling: the streaming dispatcher (chat fetch AND the external-MCP
dispatcher, via the shared streamingDispatcherOptions) now sets
keepAliveTimeout + keepAliveMaxTimeout to a 10s recycle window
(AI_STREAM_KEEPALIVE_MS), so a connection idle longer than that is closed
instead of reused — a long-gap step opens a fresh connection. keepAliveMaxTimeout
also caps a server-advertised keep-alive so the provider can't widen the window.
- Pre-response connection retry: createStreamingFetch retries a connection-level
reset (ECONNRESET / UND_ERR_SOCKET / ECONNREFUSED / EPIPE / *_TIMEOUT) on a
fresh connection up to 2 times. This is SAFE because fetch() only rejects before
the Response resolves — a started stream is never replayed; an abort (client
disconnect) is never retried.
Tests: ai-streaming-fetch.spec — keep-alive options, streamKeepAliveMs env,
isRetryableConnectError, and a server that resets the first connection so the
retry must land on a fresh one (+ aborted requests are not retried). Verified on
the stand that a normal turn still streams (reasoning + text + finish) through the
new transport. server tsc + ai/mcp specs green.
Note: root cause is the deployment's egress dropping idle connections (Traefik is
inbound-only); this makes the app resilient to it. AI_STREAM_KEEPALIVE_MS can be
lowered if the egress drops faster than ~10s.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Wording: every comment now says the stream timeouts are RAISED to a
generous-but-finite ~15-min silence timeout, not "disabled (0)" (the stale
comments contradicted the code, which uses AI_STREAM_TIMEOUT_MS, default
900000ms).
- Architecture (the load-bearing-temporary trap): the streaming fetch reached
the chat provider only by riding the "temporary DIAGNOSTIC" telemetry, so
deleting the telemetry by its own label would silently revert the timeout fix.
Legitimize it: rename ai-http-diagnostics.ts -> ai-provider-http.ts,
createDiagnosticFetch -> createInstrumentedFetch, field aiDiagnosticFetch ->
aiProviderFetch, drop the "temporary" labels, and document the chat transport
(streaming fetch + instrumentation) as one intentional construct.
- Docs: AI_STREAM_TIMEOUT_MS added to .env.example next to AI_EMBEDDING_TIMEOUT_MS.
- Tests:
- ai-provider-http.spec: createInstrumentedFetch delegates to the injected
baseFetch with the same input/init, returns the Response untouched, rethrows
the error, and defaults to global fetch — covering the baseFetch seam.
- ai-streaming-fetch.spec: the delayed-server test is now LOAD-BEARING — with
AI_STREAM_TIMEOUT_MS set below the 1.5s server delay the call actually rejects
(a lost dispatcher -> global 300s default would NOT), proving the configured
dispatcher is wired; plus the default-timeout happy path.
server tsc clean; ai-streaming-fetch / ai-provider-http / ai.service / mcp-servers
/ ai-error specs green (41).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- [warn 1] Document the is_agent operator setup so it survives plan deletion:
added an AI-agent block to .env.example (use a DEDICATED account, set is_agent
via SQL, never flag a human/shared account) + a CHANGELOG "Added" entry.
- [warn 2] Test the badge deep-link side effects: ai-agent-badge.test.tsx now
renders inside an explicit jotai store, clicks the badge, and asserts the
active chat id, window-open, cleared draft, closed history modal, AND that
stopPropagation keeps a parent onClick from firing.
- [suggestion 3] Hoist the window.matchMedia stub into vitest.setup.ts and drop
the duplicated beforeAll block from the three test files (ai-agent-badge,
comment-list-item, role-cards).
- [suggestion 4] Merge the two near-duplicate "non-clickable" cases via it.each.
- [follow-up 6] Introduce a single ProvenanceSource = 'user' | 'agent' type in
jwt-payload.ts and reference it from AuthProvenanceData, JwtPayload/
JwtCollabPayload, and resolveSource() — so a typo can't slip through as a bare
string. (Server auth chain; client IComment mirroring left as a follow-up.)
Follow-up 5 (shared agentSourceFields write-stamp helper) is deferred as the
review marked it — the 6 REST sites use varied shapes (create-spread vs
resolve-conditional-null vs page move), so it's a separate focused refactor.
Tests: client badge/comment/role-cards suites 11/11 pass; server auth+comment
suites 62 pass; typecheck clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The custom undici RetryAgent + aiFetch transport added for issue #140
did not actually heal mid-stream provider drops: undici's retry path is
a Range-based download-resume that SSE/chat-completions endpoints cannot
satisfy, so a reset after the first byte only swapped ECONNRESET for a
"server does not support the range header" error. Its only real effect
was reconnecting a poisoned keep-alive socket before the first byte, and
PR #141 on top of it turned the 60s headers timeout into deterministic
~61s failures (plus CONTENT_LENGTH_MISMATCH from retrying a POST body
after a timeout abort). The root cause is the z.ai coding endpoint, not
our transport.
Remove the whole layer and return all AI provider calls to Node's
default global fetch.
- delete integrations/ai/ai-http.ts and its spec
- ai.service.ts: drop the aiFetch import, the AI_BYPASS_RESILIENT_FETCH
diagnostic toggle, and fetch:aiFetch from every chat/embedding/STT
factory; raw STT call back to global fetch
- ai-chat.controller.ts: drop the stream-timing START log + startedAt
- ai-chat.service.ts: drop the first-chunk/FINISHED/ERROR timing logs
- .env.example: drop AI_BYPASS_RESILIENT_FETCH
Reverts: 1af5d34a, 7c308728, b7abb7ea, 35fc58ea, d6cd2754, 6efb8656.
Preserved (not part of the rollback): client-disconnect abort, title
generation in onFinish, partial-answer persistence, Safari SSE heartbeat.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The streaming chat turn hangs in all browsers while the non-streaming test
endpoint works — both use the same model/transport (createOpenAI + aiFetch),
so the suspect is the streaming path / custom undici RetryAgent transport.
- ai-http.ts: wrap aiFetch with per-request timing logs (start, ms-to-headers
on success, elapsed ms + cause on failure). Chat at info, embeddings at
debug. Only host+path logged.
- ai-chat.controller.ts / ai-chat.service.ts: log turn START, first-chunk
latency, FINISHED duration, and elapsed ms on disconnect/error/abort.
- ai.service.ts: AI_BYPASS_RESILIENT_FETCH=true makes the CHAT model omit
fetch:aiFetch and use the default global fetch — isolates transport vs
request-shape. Chat-only; embeddings/STT untouched; reversible via env.
- .env.example: document the flag.
No timeout/retry change. tsc clean; ai-chat + ai suites pass (292).
The fail-closed limiter behavior (#62 primary item) already shipped; this
finishes the issue by lowering the default hourly per-workspace cap from 300
to 100 to better fit real anonymous-assistant load. Still overridable via
SHARE_AI_WORKSPACE_MAX_PER_HOUR.
- public-share-workspace-limiter.ts: SHARE_AI_WORKSPACE_MAX_PER_WINDOW 300 -> 100.
- .env.example: documented default + example value 300 -> 100.
- public-share-chat.spec.ts: update the default-cap assertion to 100.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
trustProxy was unconditionally true, so req.ip came from a client-forgeable
X-Forwarded-For and the per-IP throttles (share-AI, /mcp brute-force) were
spoofable. Make it env-configurable (TRUST_PROXY) with a safe default that
trusts XFF only from loopback/private proxies, documented in .env.example.
NOTE: this changes the default from trust-all; deployments whose proxy is on a
public IP must set TRUST_PROXY (caveat documented).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Harden the anonymous public-share AI assistant against token-cost abuse
before exposing it to the internet:
- Add an env-tunable per-request output ceiling (maxOutputTokens) to the
public-share streamText call so one anonymous request cannot run up the
provider bill even if the per-IP throttle is evaded. New
resolveShareAiMaxOutputTokens() / SHARE_AI_MAX_OUTPUT_TOKENS_DEFAULT
(env SHARE_AI_MAX_OUTPUT_TOKENS, default 512), mirroring
resolveShareAiWorkspaceMax().
- Flip the per-workspace cost limiter to FAIL CLOSED on Redis failure
(was fail-open): if Redis is unavailable we cannot prove the workspace is
under its cap, so deny rather than admit an unmetered, billable call.
- Update the limiter spec (fail-open -> fail-closed) and add resolver tests;
document both knobs in .env.example.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
APP_SECRET does double duty: it signs JWTs and derives the AES-256-GCM key
that encrypts stored AI-provider credentials. Rotating it makes every saved
AI API key undecryptable and invalidates existing sessions. Document this
footgun where operators set the value (RT-30 from the red-team report).
- .env.example: dual-role warning block above APP_SECRET
- README.md / README.ru.md: warning callout in the upgrade section
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Post-merge hardening from the #13 security review:
- isInitializeRequestBody now delegates to the SDK isInitializeRequest (same
predicate as packages/mcp/http.ts), so a bare {method:'initialize'} with no
id/params no longer triggers the side-effecting login() (audit-spam /
user_sessions growth) before http.ts 400s it.
- Bind the Bearer path to the instance workspace: verifyBearerAccess rejects a
token whose payload.workspaceId != the instance workspace (resolved via
workspaceRepo.findFirst, consistent with the Basic path); optional param so
it's a no-op when unset.
- Close the user-enumeration timing oracle in verifyUserCredentials: the
missing/disabled branch now runs a bcrypt compare against a module-level dummy
hash whose cost (12) matches production saltRounds, so both paths take one
equal-cost bcrypt compare; the exact CREDENTIALS_MISMATCH_MESSAGE is preserved.
- Document the trusted-proxy requirement for the spoofable per-IP brute-force
limiter in .env.example (trustProxy is on; deploy behind a trusted proxy).
- Add real-execution coverage for enforceBasicLoginGate (SSO enforced / EE-MFA
bundled vs not / user-MFA / workspace-enforced-MFA) instead of stubbing the gate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The anonymous public-share AI assistant's per-IP rate limit is only
effective behind a trusted reverse proxy that overwrites X-Forwarded-For
with the real client IP (the app runs with trustProxy). Document this
deployment requirement and the per-workspace cost backstop env var
(SHARE_AI_WORKSPACE_MAX_PER_HOUR, default 300) in .env.example.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The embedded MCP server acted as a single service account; now each /mcp
session authenticates as the current user, so tools run under that user's
CASL and edits attribute to them.
- HTTP Basic (chosen path): Authorization: Basic email:password, validated
server-side via AuthService; the session carries the issued user JWT (not
the raw password). Password may contain ':' (split on first only).
- Bearer fallback: Authorization: Bearer <access JWT>, verified as ACCESS and
additionally checked for an active session + non-disabled user (matching
JwtStrategy), so revoked/disabled users are rejected.
- Service account stays as an optional fallback (no creds + env configured).
- packages/mcp createMcpHttpHandler accepts a per-request config resolver
(back-compat: static config / stdio unchanged); identity is bound to the
mcp-session-id at init and re-validated from the caller's own credentials on
every request (anti session-fixation: a guessed session id can't be reused
without matching creds).
- A full login (session + audit) happens only once at session init; later
requests re-verify credentials via a new non-side-effecting
AuthService.verifyUserCredentials (no session/audit spam).
- Failed-login limiter (5/60s, keyed per-IP, per-IP+email, and per-email so IP
rotation can't brute one account) since direct login bypasses the controller
throttler. Only real credential failures count.
- MCP_TOKEN shared guard moved off Authorization to an X-MCP-Token header
(timing-safe compare); credsConfigured 503 gate replaced by a clear 401.
- No secrets logged; all auth resolved before res.hijack() so failures return
clean 401 JSON. .env.example marks the service account optional.
Implements docs/backlog/mcp-per-user-auth.md (variant L).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The bulk embedding reindex could hang on a single page forever
("Indexed 27 of 34 pages") with zero log output:
- all progress logs were debug-level, suppressed in production (pino info);
- embedMany() had no timeout, so a slow/hung embeddings endpoint blocked
the sequential per-page loop indefinitely.
Changes:
- ai.service.embedTexts: bound embedMany with AbortSignal.timeout
(configurable via AI_EMBEDDING_TIMEOUT_MS, default 120000ms); on timeout
throw a clear, greppable message, classified by both signal.aborted and
the error name (TimeoutError/AbortError/ResponseAborted) so a real
provider error racing the timer keeps its diagnostics.
- embedding-indexer.reindexWorkspace: promote lifecycle/progress logs to
info; log "[i/N] indexing page <id>" BEFORE the await so a hang names the
stuck page; warn on slow pages (>30s); add timing + final summary.
- .env.example: document AI_EMBEDDING_TIMEOUT_MS.
Replace the removed enterprise EE MCP (private apps/server/src/ee submodule,
license-gated /mcp route) with our docmost-mcp, vendored as an isolated ESM
workspace package and served by the server over HTTP — no enterprise license.
Backend:
- Add packages/mcp (@docmost/mcp): vendored docmost-mcp refactored into a
side-effect-free createDocmostMcpServer() factory (38 tools preserved),
stdio entry kept in stdio.ts, Streamable-HTTP session manager in http.ts.
- Add apps/server McpModule: @Post/@Get/@Delete('mcp') (served at /mcp via the
existing global-prefix exclude), @SkipTransform + reply.hijack to bridge raw
Fastify req/res into the SDK transport. The module dynamically imports the
ESM-only package from CommonJS via a Function-indirected import resolved with
require.resolve + file:// URL. Gated by the workspace ai.mcp toggle, a
service-account (MCP_DOCMOST_EMAIL/PASSWORD/API_URL) and optional MCP_TOKEN;
per-session idle eviction (MCP_SESSION_IDLE_MS).
- Drop the enterprise license check on mcpEnabled in workspace.service.
- Dockerfile: copy packages/mcp into the production image.
- .env.example: document MCP_DOCMOST_*, MCP_TOKEN, MCP_SESSION_IDLE_MS.
Frontend:
- Recreate the community "AI & MCP" workspace-settings panel (mcp-settings.tsx):
admin-only toggle on settings.ai.mcp with optimistic update, copyable
${APP_URL}/mcp URL; wired into workspace-settings page. Reuses existing i18n.
Fixes:
- Pin packages/mcp tiptap deps to 3.20.4 (matching the client) and inline
getStyleProperty, preventing a duplicate @tiptap/core@3.26.1 from leaking into
the client editor via pnpm shamefully-hoist (was breaking apps/client tsc).
* integrate websocket redis adapter
* use APP_SECRET for jwt signing
* auto migrate database on startup in production
* add updatedAt to update db operations
* create enterprise ee package directory
* fix comment editor focus
* other fixes