Upgrades the 2-way body merge to a real diff3 three-way merge (review #5), so a
block ONLY the human changed is KEPT when git changed a DIFFERENT block — the
2-way merge would revert it to git's stale version.
Engine: the push update loop reads the last-synced pre-image
(`git.showFileAtRef(refs/docmost/last-pushed, path)`) and passes it as the
optional `baseMarkdown` to `client.importPageMarkdown` (the common ancestor).
Server: gitmost-datasource converts base+incoming, and writeBody runs a block-
level diff3 (new three-way-merge.ts `diff3Plan`): live-only change -> keep live,
git-only change -> take git, both-changed -> git wins (conflict policy), inserts/
deletes from either side preserved. Without a base (createPage) it falls back to
the 2-way merge. Crash-safety unchanged (docs built before the connection opens).
Tests: three-way-merge.spec.ts (14 — every diff3 case incl. the cross-block
preservation and conflict policy), yjs-body-merge 3-way (real Y.Docs: human's
block instance preserved while git's block is applied), plus an engine test that
the base is forwarded from showFileAtRef. Existing push assertions updated for the
new base arg. git-sync 589 pass; server merge/datasource/gate 62 pass; typecheck
clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Supersedes the active-session "defer" guard with a real merge (review #5 —
"запись делать через мерж", not skip-while-editing).
writeBody no longer does delete-all + re-insert (which discarded a concurrent
editor's in-flight changes on every sync). It now diffs the live body against the
incoming git body at TOP-LEVEL BLOCK granularity (LCS over a canonical structural
serialization) and applies only the minimal inserts/deletes:
- a block a human is editing is left UNTOUCHED when git changed a DIFFERENT block;
- an unchanged resync is a complete 0-op write;
- Yjs CRDT-merges the minimal ops with concurrent edits.
New yjs-body-merge.ts (mergeXmlFragments + cloneXmlNode + diffBlocks) is pure-Yjs
and unit-tested with real Y.Docs (8 tests): identical->0 ops, edit-one-block keeps
the other block instances, append/delete keep neighbours, marks survive the
cross-doc clone. Crash-safety kept: the incoming doc is built before the
connection opens, so a transform failure can't empty the body.
Removed: the ActiveEditSessionError defer path and the now-unused
CollaborationGateway.getActiveEditorCount.
Honest limitation: this is a 2-way merge — for a block BOTH sides changed since the
last sync, git wins (no common ancestor to decide). A full 3-way merge would need
the last-synced base plumbed from the engine; the dominant cases (unchanged
resync, edits to different blocks) are now lossless.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review finding #2: packages/git-sync/build/ (the COMPILED engine) and the
package's node_modules/ were committed. Prod executed the committed build/ while
CI/tests ran src/ and never rebuilt it — so a fix in src/ could pass tests while
stale compiled code shipped (a silent src/prod skew). The committed node_modules
were pnpm symlinks with a baked machine-local store path (/home/claude/...),
useless and misleading for everyone else.
- git rm --cached packages/git-sync/{build,node_modules} (42 + 31 files).
- .gitignore: ignore packages/*/node_modules/ and packages/git-sync/build/.
- Build the package where it is actually consumed: apps/server `pretest` now
builds @docmost/git-sync (its suite imports the built build/index.js), and the
CI Test workflow gains an explicit "Build git-sync" step. The Dockerfile builder
already runs `pnpm build` (nx builds the package) and now COPYs the fresh build/.
Verified: wiped build/, rebuilt via `pnpm --filter @docmost/git-sync build`, then
the server converter gate (26/26, imports the rebuilt package) and the git-sync
suite (588 passed) both pass against the freshly-built, non-committed output.
NOTE: packages/mcp/ has the same committed-build/node_modules pattern (pre-existing,
out of this PR's scope) and should get the same treatment in a follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review finding #5: the git -> page body write (writeBody) did a full-body replace
(delete-all + re-insert) on the shared Yjs doc. Applied while a human is editing
the page, it discarded their in-flight changes; and TiptapTransformer.toYdoc ran
AFTER the fragment was cleared, so a conversion failure could leave the page with
an empty body.
Fixes:
- Active-session guard: CollaborationGateway.getActiveEditorCount(documentName)
reports live human (websocket) editor sessions for a doc, excluding server-side
direct connections. writeBody now throws ActiveEditSessionError when an editor
is connected. The engine's push loop already isolates each importPageMarkdown in
try/catch and does not advance the loop-guard on failure, so the write is simply
retried on the next poll once the editor disconnects — never a clobber.
- Crash-safe conversion: build the replacement Yjs update BEFORE opening the
connection / clearing the fragment, so a transform failure can never leave the
body empty.
Also updates the server-side converter gate spec to the corrected round-trip
shape: the block-image hoist no longer leaves a leading empty paragraph (the
git-sync converter fix in 7d39c16b, now reaching the built package).
A true merge of git content into a live Yjs session is out of scope (it needs a
real 3-way text merge with no shared update lineage); deferring the write while a
page is being edited is the safe, owner-approved minimum.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Expose each git-sync-enabled space as a clonable/pushable git repo over HTTP,
so `git clone https://<user>:<pass>@<host>/git/<spaceId>.git` works and external
pushes flow back into Docmost pages — gitmost itself acts as the git host (no
external GitHub/Gitea, no SSH).
Transport: shell out to `git http-backend` (CGI; git is already in the runtime
image) which implements the full smart-HTTP protocol (info/refs, upload-pack,
receive-pack, protocol v2). A raw Fastify route `/git/*` (mounted at the root,
outside the `/api` prefix) bridges the request/response to the CGI; passthrough
content-type parsers for the git media types stream the raw body to stdin.
Reuse the existing engine: clients push the vault's `main` branch, whose commits
beyond `refs/docmost/last-pushed` the engine already reconciles into Docmost.
- http/git-http.service.ts — auth (HTTP Basic -> AuthService.verifyUserCredentials),
self-resolved workspace (DomainMiddleware does not run for this raw route),
per-space gating (global + per-space gitSync flags, 404 hides existence),
CASL authz (Read=fetch, Manage=push), dispatch.
- http/git-http-backend.service.ts — spawn `git http-backend`, binary-safe CGI
response parsing (Status/headers/body), stream to the socket.
- http/git-http.helpers.ts — pure path parse, service->kind mapping, gate decision
(unit-tested); rejects literal and percent-encoded path traversal.
- orchestrator: extract reusable withSpaceLock (CAS-guarded lock heartbeat so a
long push cannot let the lock expire mid-cycle) and add ingestExternalPush
(receive-pack + Docmost cycle under one lock; 503 on contention).
- vault-registry: ensureServable() — ensureRepo + idempotent receive.denyCurrentBranch
=updateInstead / denyNonFastForwards / http.receivepack / http.uploadpack.
- env: GIT_SYNC_HTTP_ENABLED (defaults to GIT_SYNC_ENABLED) + validation.
- main.ts: register the /git/* route and the git content-type parsers.
Tests: pure helpers, CGI parsing, and the GitHttpService handler (auth/gate/authz
+ workspace resolution). Server tsc + git-sync/env suites green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comprehensive-review follow-ups (APPROVE WITH SUGGESTIONS; no critical issues):
- poll interval is now actually configurable: replaced the hardcoded
@Interval('git-sync-poll', 15000) with a dynamic SchedulerRegistry interval
registered in onModuleInit from getGitSyncPollIntervalMs() (cleared in
onModuleDestroy); /status and the real cadence now share one config source.
Boots logging 'poll interval registered (Nms)'.
- loop-guard now ALWAYS applies: the lastUpdatedSource==='git-sync' skip was
nested inside the !spaceId/!workspaceId branch, so structural self-writes
(CREATE/MOVE/RESTORE/SOFT_DELETE, which carry spaceId+workspaceId) bypassed it
and re-triggered cycles. Fetch the page row once, guard unconditionally, then
resolve space/workspace.
- remove the dead PAGE_CONTENT_UPDATED subscription (it's a BullMQ job, never an
EventEmitter event; body edits arrive via PAGE_UPDATED).
- fix the stale datasource comment (PageService DOES stamp 'git-sync' now).
- env getters: parseInt radix 10 + NaN/<=0 fallback for poll/debounce (+ max
deletes), with 6 new environment.service.spec tests.
tsc clean; jest 723 pass; live cycle re-verified post-refactor (ran, push
applied, unflagged 92-page space untouched).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
UI opt-in for git-sync, mirroring the existing sharing/comments settings pattern
(no new endpoint, no new mechanism; orchestrator read query untouched):
- UpdateSpaceDto.gitSyncEnabled?: boolean.
- SpaceRepo.updateGitSyncSettings: jsonb-merge into settings.gitSync.<key>
(COALESCE || jsonb_build_object — never clobbers sibling sharing/comments);
stored as a real jsonb boolean so the orchestrator's
settings->'gitSync'->>'enabled' = 'true' matches.
- SpaceService.updateSpace handles the flag (audit diff) via the existing
CASL-guarded space update path (Manage/Settings).
- client: Switch in edit-space-form (optimistic mutate + revert-on-error,
readOnly-aware) + space types + 2 i18n keys.
- space.service.spec extended (calls updateGitSyncSettings; no-op when undefined).
tsc clean (server+client); jest src/core/space 4 pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fixes found by the live pull/push e2e:
- CRITICAL: driveCycle never checked out the 'docmost' branch before
applyPullActions, so Docmost content was written straight onto 'main',
clobbering local file edits before push could diff them. Now checkout
'docmost' before pull (applyPullActions commits there then checks out main +
merges) — mirrors the engine's pull main(). Round-trip now works both ways.
- add an unresolved-merge guard (SPEC §9): skip the cycle if the vault is
mid-merge instead of failing on checkout.
- SAFETY: enabledSpaces() is now STRICT opt-in — only spaces with
settings.gitSync.enabled===true; removed the all-spaces fallback that synced
every space (incl. a 92-page one) the moment GIT_SYNC_ENABLED flipped.
- SAFETY: per-cycle delete cap (GIT_SYNC_MAX_DELETES_PER_CYCLE, default 5):
dry-run the push, and if planned deletes exceed the cap, run the apply with
deletePage neutralized — phantom absence-deletions from a non-convergent vault
can't soft-delete real pages. Fails safe if the dry-run throws.
- fix manual trigger: TriggerGitSyncDto.spaceId needs @IsUUID or the global
whitelist ValidationPipe strips it (arrived undefined -> vault 'undefined').
Live-verified on an isolated flagged space: push (vault file edit -> Docmost
content, stamped lastUpdatedSource='git-sync') and pull (Docmost rename -> vault
file + meta) both work; an unrelated 92-page space stayed untouched throughout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Native data plane for git-sync (plan §3, §8.1):
- provenance: widen actor to 'user'|'agent'|'git-sync' (jwt-payload,
auth-provenance decorator); PersistenceExtension resolves lastUpdatedSource
with precedence agent > git-sync > user, debounced history (like a human edit,
not the agent's immediate snapshot).
- GitmostDataSourceService implements @docmost/git-sync's GitSyncClient natively:
reads via PageRepo/SpaceRepo (listSpaceTree complete:true, getPageJson), writes
via PageService (create/removePage soft-delete/movePage with computed fractional
position/update-rename/restore) + the writeBody linchpin through collab
openDirectConnection('page.'+id, {actor:'git-sync'}) mirroring
collaboration.handler withYdocConnection 'replace'. bind({workspaceId,userId})
returns the context-bound client for the orchestrator.
- 10 unit/contract tests (mapping + soft-delete + move-position), tsc clean.
Known gap (closed in A.4b): PageService.create/update/movePage only branch on
actor==='agent'; git-sync provenance is already passed through so the row source
marker propagates once PageService honors 'git-sync'. Module/orchestrator/config
come next.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make @docmost/git-sync natively consumable by the CommonJS server (and jest):
build to CommonJS (tsconfig module CommonJS, drop type:module, strip .js from
relative imports), and lazy-load the only ESM-only dep (marked) via the dynamic
Function('import()') trick (mirrors docmost-client.loader.ts) with a require()
fallback so vitest's evaluator works too. git-sync tests stay green (314 pass,
3 expected fail).
Add the §13.1 idempotency gate (apps/server .../git-sync-converter-gate.spec.ts):
13 editor-ext docs (paragraphs/headings, marks, links, bullet/ordered/task lists,
blockquote, callouts, code block, hr, table, nested mix) round-trip
content(editor-ext) -> convertProseMirrorToMarkdown -> markdownToProseMirror ->
TiptapTransformer.toYdoc/fromYdoc(tiptapExtensions) -> canonicalize and assert
docsCanonicallyEqual. All green => the vendored converter's docmost-schema is
schema-compatible with editor-ext (no node/mark/attr loss), which the plan §13.1
requires before Phase B. The one intrinsic markdown-image lossiness (width/height
/align can't ride plain ) is isolated in a KNOWN DIVERGENCE block, not
hidden. Server tsc clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The #205 share-aliases feature placed share-aliases.migration.spec.ts
inside src/database/migrations/. Kysely's FileMigrationProvider loads
EVERY file in that folder as a migration, so `migration:latest` imported
the test file and crashed with "ReferenceError: describe is not defined"
(no Jest globals under tsx). That broke the migration step shared by the
e2e-server, e2e-mcp and integration-test (test/test) jobs.
Move the spec one level up to src/database/ (matching the existing
src/database/jsonb-bind.spec.ts convention) so the migration runner no
longer sees it, and fix its relative imports
(./migrations/... and ./types/...). Jest still picks it up via the
src/**/*.spec.ts test glob. Verified locally: 3 passed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The e2e transform matches .js (required so ESM-only node_modules like
nanoid/@sindresorhus get transpiled), which also sweeps in editor-ext's
prebuilt CommonJS dist/*.js. ts-jest then warns "Got a .js file to
compile while allowJs is not set to true" for each footnote file. The
.js match cannot be dropped without reintroducing the ESM load errors, so
enable allowJs for ts-jest via an inline tsconfig override (merged with
apps/server/tsconfig.json — decorators/paths/module stay intact).
Verified locally: 0 allowJs warnings, app still compiles and boots to the
Redis connection (no DI/metadata regressions).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous jest-config fix let the module graph load further and exposed
two more reasons the server e2e never passed since it was added:
1. ESM transform chain: AppModule pulls in editor-ext -> @tiptap ->
@sindresorhus/slugify -> @sindresorhus/transliterate / escape-string-regexp,
plus p-limit -> yocto-queue — all ESM-only. Extend the e2e
transformIgnorePatterns whitelist to transform them (scoped packages need
both the pnpm `@scope+name` and nested `@scope/name` path forms, hence
`@sindresorhus[+/][a-z0-9-]+`). Verified locally: the graph now fully
transforms and resolves.
2. Wrong HTTP adapter: Docmost runs on Fastify (main.ts uses FastifyAdapter)
and does not depend on @nestjs/platform-express, but the scaffold test used
the default createNestApplication() (Express) and died with
"@nestjs/platform-express package is missing". Switch the test to
FastifyAdapter + getInstance().ready(), close in afterEach. Verified locally:
createNestApplication + app.init() now proceed to the live Redis/Postgres
connection (the infra CI provides via services + migrations).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the first e2e fix: with nanoid/editRes.edits resolved, the
suites failed one layer deeper. Both layers were never green since the
e2e jobs were added (non-blocking in CI), so the failures had stacked up.
server e2e (jest-e2e.json) — align module resolution/transform with the
working unit/integration jest configs so AppModule's full import graph
loads:
- moduleFileExtensions: add "tsx" (React-Email .tsx templates are pulled
in via the auth controller chain).
- transform: ^.+\.(t|j)s$ -> ^.+\.(t|j)sx?$ so .tsx is transformed.
- moduleNameMapper: add ^src/(.*)$ -> <rootDir>/../src/$1 (code imports
via the absolute 'src/...' alias). Verified locally: the module graph
now fully resolves (only env vars, supplied by CI, remain).
mcp e2e (test-e2e.mjs) — insert_image/replace_image accept only http(s)
URLs the server fetches; the test passed local file paths and died with
"Invalid image URL". Serve the PNG bytes over a throwaway 127.0.0.1 HTTP
server (the Docmost server runs on the same CI host) and pass URLs. The
featPng negative test is untouched: replaceImage checks the attachmentId
and throws before fetching, so its local path is never validated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two unrelated CI failures on the 0.94.0 release PR:
- server e2e: jest-e2e.json lacked transformIgnorePatterns, so the
ESM-only nanoid@5 package was loaded as CommonJS and crashed with
"Cannot use import statement outside a module". Add the same
node_modules whitelist already present in the unit and integration
jest configs (nanoid|uuid|image-dimensions|marked|happy-dom|lib0).
- mcp e2e: test-e2e.mjs read editRes.edits, but editPageText() returns
the per-edit results under `applied` (not `edits`), so editRes.edits
was undefined and .every() threw. Read editRes.applied instead.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The controller's buildPageSlug truncates the page title via
`title?.substring(0, 70)` before slugifying, but no test drove that
branch (the only titled case was 16 chars). Add a resolvable-alias
case with a 119-char title whose 70-char boundary falls mid-word and
assert the 302 target's slug reflects only the first 70 characters.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code-review follow-ups (Approve-with-comments) for batch #197
(context badge #189 / e2e in CI #187 / inline MCP test #170):
- server: extract the duplicated chatContextWindow ::text->positive-int
coercion (resolve() + getMasked()) into an exported parsePositiveInt
helper and unit-test its branches (200000/1.9/0/-5/""/abc/undefined),
closing the untested read-path gap.
- client: merge the two backward scans over messageRows into one pure,
exported selectContextBadge helper (numerator and denominator still
taken from the most recent row carrying EACH value) and unit-test the
different-rows and fresh-zero-doesn't-shadow cases.
- client: extract the MCP "Test" button tristate presentation into a pure
mcpTestButtonView helper (collapses the two parallel if/else chains) and
unit-test idle/ok-with-tools/ok-no-tools/failed label+tooltip branches.
- ci: redirect the backgrounded prod server's stdout/stderr to a log file
in e2e-mcp and cat it on failure, so a start-up crash is diagnosable
instead of surfacing only as the generic health timeout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the two blocking test-coverage specs requested in the PR #214 review and
clear the cheap non-blocking items.
Must-fix:
- share-alias-redirect.controller.spec.ts: routing/leak guard for the public
GET /l/:alias resolver (modeled on share-seo.controller.routing.spec). Pins
302-to-canonical on a hit; SPA index without a 302 for unknown/dangling/
unreadable aliases and a null workspace (no name-existence leak); defensive
percent-decoding treated as unknown; self-hosted findFirst vs subdomain
findByHostname workspace resolution; 404 when no built client index exists.
- share-alias.controller.spec.ts: authz gates with mocked PageRepo/ShareService/
ShareAliasService/PageAccessService. Covers cross-workspace/nonexistent page
-> NotFoundException, validateCanEdit, resolveReadableSharePage null ->
BadRequestException, isSharingAllowed false -> ForbiddenException, set happy
path delegation, remove() of a dangling alias (pageId null) skipping
validateCanEdit but still deleting, and for-page validateCanView.
Cheap review items:
- Remove dead Logger import/field from ShareAliasRedirectController.
- Remove dead PagePermissionRepo import/dependency from ShareAliasController.
- Register the new share-alias UI strings in en-US and ru-RU catalogs.
- Add an [Unreleased]/Added CHANGELOG entry for /l/:alias (#205).
- Drop the tautological boilerplate assertions from the migration spec
(exports up/down; runtime checks of typed entity literals).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
onFinish always receives a totalUsage object, so the `?? {}` guard and
optional chaining were dead. Extract the field-level extraction into a
recordTurnUsage method (totalTokens, else input+output) and unit-test that
recordShareTokens receives the summed value when totalTokens is absent and the
authoritative total when present.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a retargetable, human-readable vanity link namespace /l/<alias> that
sits alongside the untouched /share/... routes.
- New share_aliases table (workspace-scoped, UNIQUE(workspace_id, alias),
page_id nullable ON DELETE SET NULL so the address outlives its target).
- ShareAliasRepo + ShareAliasService (create / no-op / 409 reassign guard /
availability / request-time readable-target resolution through the single
existing share boundary).
- Public ShareAliasRedirectController (GET /l/:alias) issues a 302 (never 301,
the target is mutable) to the canonical /share/:key/p/:slug page; unknown /
dangling / no-longer-readable aliases serve the SPA index with no leak.
'l/:alias' excluded from the global /api prefix.
- Authenticated ShareAliasController (set/remove/availability/for-page).
- Shared ASCII-only normalize/validate util (server + client copies).
- Client: Custom address block in the share modal (live normalize + debounced
availability + copy + reassign confirmation dialog).
- Unit tests: util, repo SQL-shape, service semantics, migration/entity sanity
(server jest) + client alias util (vitest).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The anonymous public-share assistant only capped the COUNT of requests
(100/hour/workspace), not their cost. One accepted turn runs the agent loop
up to stepCountIs(5), re-sending the whole client-held transcript as input on
every step, while maxOutputTokens caps only the output; the request window is
hourly with no daily ceiling, so a steady stream at the cap sustains ~24x its
count per day. Counting requests therefore does not bound the owner's LLM bill
(red-team finding #5).
Add a second cost contour: a cluster-wide, sliding-window per-workspace TOKEN
budget over a rolling day. It is checked read-only BEFORE a turn streams (429,
no request slot consumed, nothing spent) and the turn's real usage
(totalUsage: input re-sent per step + output, summed across all steps) is
recorded once it finishes via streamText onFinish. Fails closed on the check
(deny when Redis can't prove we're under budget); best-effort on the record.
Env-overridable via SHARE_AI_WORKSPACE_TOKEN_BUDGET_PER_DAY (default 1M/day).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
persist-1: onStoreDocument wrapped the page write in a try/catch that only
logged and swallowed the error, then resolved "successfully". hocuspocus
destroys/unloads the in-memory Y.Doc right after the hook resolves (the only
copy of the latest edit), so a transient DB error (deadlock, serialization
failure, dropped connection) silently lost the edit. Worse, the post-store
branch ran on the partially-assigned `page`, broadcasting a phantom
"page.updated" and enqueueing a history snapshot for content never written.
Wrap the write in a small bounded retry (3 attempts) so the save is
re-attempted while we still hold the doc, and clear `page` on failure so the
success-only side effects never report a save that didn't happen.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
attach-1: when the same attachmentId was referenced by more than one page
in a duplicated subtree, the per-attachmentId map held only a single copy
entry, so the last page processed clobbered the others. The downstream
ownership guard (`attachment.pageId !== oldPageId`) then matched at most one
page and skipped the lone DB row entirely: no blob copied, no new row, every
copy's image 404'd. Key the map to a list of entries and copy one blob/row
per referencing page; drop the now-incorrect ownership guard.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
getPageBreadCrumbs (ancestor CTE) and forceDelete (descendant CTE) used
withRecursive + unionAll with no CYCLE clause or depth cap. If a parent/child
cycle already exists in the data (e.g. one slipped in via the #7 TOCTOU race),
both queries loop forever — hang / statement timeout. Worse, the move guard
itself runs the ancestor CTE, so a cycle would disable the very guard meant to
prevent it (#207#8).
Add a depth counter bounded by MAX_PAGE_TREE_DEPTH to both recursive CTEs; the
walk stops at the cap, so a cycle yields a bounded result instead of hanging.
Real page trees are only a few levels deep, so the cap never truncates a
legitimate result. getPageBreadCrumbs selects an explicit column list (not
selectAll) so the internal depth counter never leaks into the breadcrumb shape.
Adds an integration test that seeds an A<->B cycle directly and asserts both
getPageBreadCrumbs and forceDelete return bounded / complete under a short
connection-level statement_timeout instead of hanging.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The server-side move cycle guard (getPageBreadCrumbs) and the move UPDATE ran
as two separate, unlocked statements. Two concurrent moves ("A under B" and
"B under A") could each read the same pre-write acyclic snapshot, both pass the
guard, then persist A.parentPageId=B AND B.parentPageId=A — a parent/child
cycle (TOCTOU, #207#7).
Run the cycle check and the UPDATE inside one transaction (executeTx) guarded
by a per-space advisory lock (pg_advisory_xact_lock, held until COMMIT) so all
moves within a space serialize: the second mover blocks until the first commits
and then sees the freshly written parent, so its guard rejects the cycle.
getPageBreadCrumbs gains an optional trx so the check runs on the locked snapshot.
Adds an integration test driving two opposing concurrent movePage calls and
asserting no cycle ever persists and exactly one move is rejected. Updates the
movePage unit-test stubs for the new transactional path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In-app AI-chat tools used bare zod schemas, so when the model dropped a
required arg (typically pageId) in a parallel/batch tool call, the AI SDK
forwarded zod's raw "expected string, received undefined" text to the model
— not actionable. Add a centralized modelFriendlyInput(shape) wrapper that
keeps the exact JSON Schema contract (required/description/constraints via
z.toJSONSchema draft-7) but replaces the raw zod text with a message naming
each missing/invalid parameter and reminding the model not to drop ids like
pageId in parallel batches. No value is guessed/backfilled (cf. #159).
Applied to every in-app tool: the sharedTool() builder and all inline
inputSchema in ai-chat-tools.service.ts, plus public-share-chat-tools.service.ts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The floating chat window's header badge flipped meaning — a live per-turn token
counter while streaming, the persisted context size at rest — so it "reset to 1"
on each prompt and conflated two different numbers. Replace it with a stable
"current / max" context badge (e.g. `572 / 200k`). The live "Thinking · N tokens"
inside the chat body stays; only the duplicate live counter is removed from the
header.
Max comes from a new admin setting "Context window (tokens)". The server resolves
it and attaches `maxContextTokens` to the completed assistant turn's metadata
(next to contextTokens), so the badge needs no client-side model resolution and
this survives public shares / per-role models.
Server:
- ai.types: chatContextWindow on AiProviderSettings + PROVIDER_SETTINGS_KEYS +
ResolvedAiConfig + MaskedAiSettings.
- workspace.repo: chatContextWindow in AI_PROVIDER_SETTINGS_ALLOWED (parity).
- update-ai-settings.dto: @IsInt @Min(0) chatContextWindow.
- ai-settings.service: coerce the ::text-stored value to a positive int in
resolve()/getMasked().
- ai-chat.service: flushAssistant writes metadata.maxContextTokens (>0); the
completed turn passes resolved.chatContextWindow.
Client:
- ai-chat.types: maxContextTokens on the message-row metadata.
- ai-chat-window: read maxContextTokens; render "current [/ max]"; drop the
liveTurnTokens state/branch and the onLiveTurnTokens prop; new tooltip.
- chat-thread: remove the live-turn-token throttle effect and plumbing.
- count-stream-tokens: drop the now-dead liveTurnTokens()/types; keep
estimateTokens.
- settings: chatContextWindow on IAiSettings(+Update) + a NumberInput in the AI
provider settings form.
i18n: add the badge/settings keys (en, ru); remove the two now-unused keys.
Tests: flushAssistant maxContextTokens, DTO validation, trim token tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Approve-with-comments re-review; no blockers. All 7 actionable points (8 is a
forward-looking architecture note — recommendation A, keep as-is):
1. chat-markdown.util spec: restore parity coverage of the removed client spec —
tool error state (+ errorText), unknown-tool fallback (`Ran tool <name>` en /
`Выполнил инструмент <name>` ru), and the circular-output stringify catch.
2. findAllByChat row cap is now testable (injectable limit) + an int-spec proves
truncation on a modest volume.
3. Stability: the per-step durability updates are SERIALIZED via a promise chain
(stepUpdateChain) so they commit in step order — onlyIfStreaming already
closed the finalize race, this closes inter-step ordering.
4. findAllByChat keeps the NEWEST messages on truncation (order DESC + reverse,
like findRecent) and logs a warning with chatId, instead of silently dropping
the newest tail.
5. The LABELS parity comment already references the real path (tool-parts.tsx /
toolLabelKey) — confirmed accurate.
6. Removed the redundant 'off-by-one boundary' test (strict subset of the two
adjacent prepareAgentStep cases).
7. Extracted the terminal-finalize dispatch into a shared `applyFinalize`, used
by BOTH the service's finalizeAssistant and its test — the test now exercises
the real path, not a copy, so a production drift fails it.
Verified: server build + 325 ai-chat unit + 6 integration; prettier clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The re-review's blocking/structural points (lease leak, dup-id guard test,
body-before-title test, CHANGELOG, pg18, shared jsonb decoder) were already
addressed in commit 24264ef; this adds the 4 genuinely-new coverage requests:
- pt 6: `scrollToReference(id, index?)` exercised against a live editor DOM —
selects the index-th `sup[data-footnote-ref][data-id]` occurrence, falls back
to the first for out-of-range, returns false for an empty id (scrollIntoView
stubbed). (#168)
- pt 7: export `backlinkLabel` and pin the base-26 carry boundary
(25->z, 26->aa, 27->ab, 51->az, 52->ba). (#168)
- pt 8: integration fail-open — a PRESENT-but-corrupt tool_allowlist (jsonb
string scalar holding non-array JSON) reads back as null ("no restriction"),
covering normalizeRow's degrade branch. (#159 #172/#173)
- pt 9: getFootnoteRefCount cache invalidation — adding a `[^a]` reference bumps
the cached count 2 -> 3. (#168)
Verified: editor-ext footnote 23; client structure 7 + tsc; server int 8.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
15-point review of the persistent-history PR. Architecture decisions: crash
recovery = recency threshold; tool-label duplication = leave as-is.
Must-fix:
1. Boot-sweep bounded by recency. sweepStreaming now also requires
`updatedAt < now() - SWEEP_STREAMING_STALE_MS` (10 min), so a fresh replica's
startup sweep can't abort a turn another replica is actively streaming
(multi-instance deploy). Int-spec: a FRESH 'streaming' row is NOT swept, a
STALE one IS.
2. Restore export during the FIRST streaming turn of a new chat (#174). The
server chatId is now adopted EARLY (in-place, on the start-chunk metadata) via
a new `onServerChatId` callback wired through use-chat-session → chat-thread,
so `activeChatId` is set at turn start and the Copy button is live mid-first-
turn (canExport = !!activeChatId). Hook tests for early/in-place/no-op adopt.
3. Cover finalizeAssistant's fallback-insert branch: extracted pure
`planFinalizeAssistant(assistantId)` (update when id present, insert when the
upfront insert failed) + a dispatch harness test for both arms.
Tests: onModuleInit lifecycle spec (sweep called; throw → resolves + warns);
int-spec updatedAt assertion → toBeGreaterThan.
Cleanups: cap findAllByChat at 5000 rows; upfront-insert-failure log carries
chatId+workspaceId; removed the now-dead buildPartialAssistantRecord (only the
spec consumed it; shapes still pinned by the flushAssistant suite); controller
passes `lang: dto.lang` (normalizeLang handles undefined); dropped a no-op
`?? undefined` in errorOf; documented the content-column semantics change
(concatenated step text, UI renders from metadata.parts); CHANGELOG [Unreleased]
entry (#183, #174); reworded the stale LABELS parity comment.
Verified: server build + 323 ai-chat unit + 5 integration; client tsc + 160
ai-chat unit; prettier clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8-point multi-aspect review of the batch PR; security/regressions were clean.
1. Lease leak: the #180 reorder moved `toolsFor` (which leases external MCP
clients, refCount+1) ahead of buildSystemPrompt + forUser, but the only
release (closeExternalClients) was bound to the streamText callbacks. A throw
in between leaked the lease (refCount stuck, undici sockets held until
restart). Define closeExternalClients right after the lease and wrap
buildSystemPrompt+forUser in try/catch that closes-then-rethrows.
2. Cover the patch_node/delete_node dup-id refusal (#159#6): extract the guard
into a pure `assertUnambiguousMatch` (node-ops) and unit-test 0/1/>1.
3. Regress the body-before-title order (#159#10): mock-HTTP test (collab fails
fast against a server with no WS upgrade) asserts /pages/update (title) is
NEVER posted when the body write fails — for updatePage AND updatePageJson.
4. CHANGELOG [Unreleased]: #180, #168 (Added); #163 (Fixed).
5. Add the missing en-US i18n keys (Back to references / {{label}}).
6. Drop the duplicate content/empty/blank cases in ai-chat.prompt.spec.ts (they
repeat the buildMcpToolingBlock unit tests); keep only sandwich placement +
both-safety-copies.
7. CI Postgres pg16 -> pg18 (match docker-compose).
8. jsonb decode seam: shared `parseJsonbValue(value, guard)` in database/utils.ts
holds the legacy double-encoding self-heal in one place; parseToolAllowlist /
parseModelConfig keep only a type-guard.
Verified: server build + 124 unit + 15 integration; mcp 311; prettier clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ShareSeoController.getShare` resolved the inherited share with the RAW
`getShareForPage`, which does NOT run the restricted-ancestor gate. So for a
page shared with includeSubPages whose descendant is permission-restricted, the
SEO route served that descendant's real title in <title>/og:title/twitter:title
to anonymous visitors and crawlers — even though the content API returns 404 for
it (red-team finding #3).
Funnel the SEO path through the canonical `resolveReadableSharePage` boundary
(the single place that checks `hasRestrictedAncestor`): a non-readable page now
serves the plain SPA index with no meta. Also honour `isSharingAllowed` — a
share whose workspace/space sharing toggle was flipped off after creation no
longer leaks its title via SEO. Title comes from the server-resolved page;
`buildShareMetaHtml` already emits robots=noindex when the share opted out of
indexing.
Tests (controller routing, fs spied at call time so bcrypt's native loader is
untouched): non-readable page => plain index, no title; sharing-disabled =>
plain index; readable+indexing => title + og:title, no noindex; readable+no-
indexing => noindex. Asserts getShareForPage is never called by the SEO path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The client sends the "current page" as { id, title } in the request body and the
server echoed BOTH verbatim into the system prompt context and the
getCurrentPage tool. id and title are independently attacker/desync-controllable
(two tabs, stale navigation), so openPage.id could point at page B while
openPage.title said "Page A" — the model then reported "updated Page A" while it
actually edited page B (CASL still allowed it; the user has access). Red-team
finding #4.
Resolve the open page ONCE against the DB via a new `resolveOpenPageContext`:
workspace-scoped lookup + access check, returning the AUTHORITATIVE { id, title }
(title from the DB row, never the client) or null (fail-closed) for a missing /
foreign / inaccessible page. That validated value now feeds the system prompt,
the getCurrentPage tool, AND the new-chat history origin (which previously did
this validation inline, for the id only — now shared, and the title is fixed
too).
Tests: resolveOpenPageContext covers no-id, not-found, foreign-workspace,
Forbidden, non-Forbidden-fault (fail-closed), the DB-title-wins-over-client case,
and null-title coercion.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Admins can now give each EXTERNAL MCP server a free-text instruction ("how/
when to use this server's tools") that the agent receives in its SYSTEM
PROMPT next to the tool descriptions — porting the built-in SERVER_INSTRUCTIONS
idea to admin-configured servers. Trusted, admin-authored text (like a system
prompt); NON-secret, so unlike headersEnc it IS returned in views/forms.
- Migration: nullable `instructions text` on ai_mcp_servers (old rows = null =
no guidance). Table type + repo insert/update (blank/whitespace -> null via
blankToNull). DTO `@MaxLength(4000)`. Service threads it through
McpServerView/toView.
- mcp-clients: `McpServerInstruction { serverName, toolPrefix, instructions }`
threaded through the toolset/cache/lease. Guidance is built ONLY for a server
that actually connected AND contributed >=1 callable tool (the allowlist may
filter all of them out) AND has non-blank text — so a guide never appears for
tools the agent cannot call. Cached with the toolset, so an edit is picked up
next turn via the existing CRUD cache invalidation.
- System prompt: `buildMcpToolingBlock` renders an <mcp_tooling> block INSIDE
the safety sandwich (after context, before the trailing SAFETY_FRAMEWORK) so
it informs tool choice but cannot override the rules; each section is headed
by the server's `prefix_*` namespace. Empty/blank -> block omitted. The
caller (ai-chat.service) now builds the external toolset BEFORE the prompt and
passes external.instructions; client-handle lifecycle (close-once) unchanged.
- Client: instructions field in types + a Textarea (autosize, maxLength 4000)
in the MCP-server form with a namespace-prefix hint; i18n (en/ru).
Tests across every layer (prompt block placement + both SAFETY copies; view
blank->null; buildEntry includes guidance only for connected+>=1-tool+non-blank;
DTO MaxLength; repo + integration round-trip; service wiring). Delegated impl
reviewed (APPROVE); applied the import-type follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR #172 fixed the jsonb double-encoding for `tool_allowlist` but the same
class of bug, and the same re-derived workaround, remained elsewhere.
1. model_config (agent roles): jsonbObject still used the buggy `::jsonb`
bind, so `ai_agent_roles.model_config` round-tripped as a jsonb STRING
SCALAR. The read-path `typeof === 'object'` check then failed and the
model override was SILENTLY dropped (role fell back to the default model).
Fixed to `::text::jsonb` and added `parseModelConfig` + `normalizeRow` so
every read self-heals already-corrupted rows (no migration).
2. Centralized the write workaround as `jsonbBind()` in database/utils.ts —
one implementation with one explanation of the quirk — replacing the
per-repo `jsonbArray` (mcp) and `jsonbObject` (roles).
3. Integration coverage (the fix is a DB round-trip a unit test cannot see;
the read-side parser MASKS a write regression): new
ai-mcp-server-repo.int-spec asserts `jsonb_typeof(tool_allowlist)='array'`
after insert + heals a seeded string-scalar row; ai-agent-roles-repo
int-spec gains the same for `model_config` (`'object'` + heal).
4. Updated the stale `ai-mcp-servers.types.ts` comment (the driver returns a
JSON string for legacy rows; the repo normalizes every read).
5. Fail-open logging: a corrupt tool_allowlist degrades to "no restriction"
(agent gets ALL tools) — normalizeRow now warns (server id only, never
contents) so the silent widening leaves a trace.
6. Simplified parseToolAllowlist (normalize the string once, then a single
array-of-strings check) — identical behaviour, all 12 cases still pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review caught a real race: onStepFinish fires `updateStreaming()` fire-and-
forget (not awaited), so the FINAL step's streaming UPDATE and the terminal
`finalizeAssistant` UPDATE run as two concurrent statements on different pool
connections — commit order is not guaranteed. If the late streaming update
lands AFTER finalize, the completed row is clobbered back to status='streaming'
with no usage/finishReason, and the next startup sweep then mis-marks the
finished turn 'aborted'. Green unit/integration tests don't reproduce a
cross-connection race.
Fix: scope the per-step update with `onlyIfStreaming` → SQL `WHERE
status='streaming'`. Once finalize has set a terminal status the late update
matches zero rows and no-ops, regardless of commit order; finalize runs
unguarded so it always wins. A cheap `if (finalized) return` short-circuit
avoids most wasted queries, but the SQL guard is the authoritative fix (the
flag can be set after a query is already in flight).
Integration test: finalize to 'completed', then a late onlyIfStreaming update
is a no-op — status/content/usage preserved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chat lived in inconsistent paradigms (in-memory stream + client export vs.
DB-as-context), which made export flaky and lost the assistant answer if the
process died mid-turn. Make the DB the single source of truth.
A. STEP-GRANULAR DURABILITY (server)
- ai_chat_messages gains a nullable `status` column (migration; NULL = legacy =
completed). The assistant row is now INSERTED UPFRONT as `status:'streaming'`
and UPDATEd on every onStepFinish with all finished steps (text + tool calls +
tool RESULTS), then finalized once to completed/error/aborted on the terminal
callback. So a process death mid-turn keeps every finished step; a startup
sweep (OnModuleInit → sweepStreaming) flips any dangling 'streaming' row to
'aborted'. The write path no longer depends on a live socket.
- Pure exported `flushAssistant(steps, inProgressText, status, extra?)` builds
the persist payload (metadata.parts byte-identical to the old builder), so a
future background worker can call the same path. AiChatMessageRepo gains
`update`, `sweepStreaming`, and `findAllByChat`.
- consumeStream drain, external-MCP client close-once, SSE heartbeat preserved.
B. SERVER-SIDE EXPORT
- New pure `chat-markdown.util.ts` renders Markdown from DB rows ONLY (server
port of the client builder). Because A persists the in-progress row, the
export now includes an interrupted turn up to its last finished step (flagged
"still generating"). `POST /ai-chat/export` (owner-gated via assertOwnedChat,
workspace-scoped) returns it; `lang` accepts a full client locale tag
('en-US'/'ru-RU') and is normalized server-side (normalizeLang) — a strict
@IsIn(['en','ru']) DTO rejected the real client's i18n.language with a 400,
caught in real-browser testing.
- Client: handleCopy calls the endpoint; `canExport = !!activeChatId`. The whole
liveThreadRef/liveStateRef/onLiveContentChange/hasLiveContent hybrid (and the
client chat-markdown util + test) is removed — the server is now authoritative.
Tests: flushAssistant unit (status shapes + parts parity), chat-markdown.util
unit (incl. legacy NULL-status + interrupted note + ru + normalizeLang locale
tags), controller export wiring + owner-gate, integration update/sweepStreaming.
Verified: server build + 318 ai-chat unit + 3 integration; client tsc + 157
ai-chat unit; and END-TO-END in a real browser — a chat turn persists mid-stream
and the Copy button exports the DB-sourced markdown (showing the in-progress
row), HTTP 200 after the locale fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
External MCP tools (web search, crawl) had no per-call timeout: a hung
tool call was only broken by the 15-min transport silence timeout shared
with the chat provider, and a server that kept the socket warm but never
returned could spin until the user cancelled.
Add two independent, composing bounds for external MCP traffic (the chat
provider path is unchanged):
- Silence 5 min: buildPinnedDispatcher now overrides headersTimeout/
bodyTimeout with mcpStreamTimeoutMs() (AI_MCP_STREAM_TIMEOUT_MS,
default 300000) on the external-MCP dispatcher only, so a byte-silent
upstream is severed in ~5 min instead of 15.
- Total per-call 15 min: wrapToolWithCallTimeout wraps each external
tool's execute with a fresh AbortController + timer composed with the
turn signal via AbortSignal.any (AI_MCP_CALL_TIMEOUT_MS, default
900000). It RACES the call against the abort signal because
@ai-sdk/mcp does not settle its in-flight promise on abort, so a
warm-but-stuck call would otherwise hang forever.
On timeout the call surfaces as a tool-error and the agent loop recovers.
Add tests (incl. a never-settling real-client-style stub) and document
both env vars in .env.example.
The /api/ai-chat/stream and public-share streaming paths piped streamText
output to the client socket via pipeUIMessageStreamToResponse, whose only
reader is that socket. On a client disconnect (pervasive Safari/proxy
ECONNRESET), backpressure stalled the stream: the controller aborted the
turn but nothing drained it, so streamText's onFinish/onError/onAbort never
fired. Cleanup (close leased MCP clients, persist partial) never ran and the
whole per-turn object graph (history, per-request toolset closures, captured
steps, SDK buffers) stayed rooted — accumulating across turns until the
default ~2GB heap saturated and the process crashed with
"Ineffective mark-compacts near heap limit - JavaScript heap out of memory".
Add the AI SDK v6 documented remedy: fire-and-forget
`result.consumeStream({ onError })` right after streamText(), which removes
backpressure and drains the stream independently of the client socket so the
terminal callbacks always fire and the turn's memory is released even when the
client has gone away. Applied to both the authenticated and public-share
stream services.
Also add `--heapsnapshot-near-heap-limit=2` to the prod start script so any
residual leak dumps a heap snapshot near OOM for diagnosis (no effect on
normal operation). Heap size stays ops-tunable via NODE_OPTIONS.
- apps/server/src/core/ai-chat/ai-chat.service.ts
- apps/server/src/core/ai-chat/public-share-chat.service.ts
- apps/server/package.json
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Invert the transport layers so the pre-response retry is OUTERMOST and the
provider-HTTP instrumentation is INNER. Before, the retry lived inside
createStreamingFetch (under the instrumentation), so a reset the retry
recovered from logged only a clean "OK status=200" — the
"PRE-RESPONSE FAILED ... ECONNRESET ... idleSincePrevCall" signal went blind
exactly when the fix works, and AI_STREAM_KEEPALIVE_MS couldn't be tuned from
prod data. Now createStreamingFetch is the dispatcher-bound BASE (no retry) and
a new withPreResponseRetry() wraps it; ai.service composes
withPreResponseRetry(createInstrumentedFetch('AiService:provider-http',
createStreamingFetch())), so every attempt — including recovered resets — flows
through the instrumentation. (Also expresses the keepAlive-config vs retry-
behavior boundary structurally, per review #3.)
- Add the retry-exhaustion test: a server that resets EVERY connection, asserting
the call rejects with a retryable connection error AND exactly
PRE_RESPONSE_CONNECT_RETRIES + 1 (= 3) requests reached the server — pinning the
bound and that the final error propagates (guards an off-by-one / infinite loop
/ swallowed error). Existing happy-retry + abort tests moved onto
withPreResponseRetry.
Verified on the stand: a normal turn still streams (reasoning + finish) and the
provider-HTTP telemetry still logs. server tsc + ai/mcp specs green (30).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The real cause of the long-task "Lost connection to the AI provider" — the
earlier 300s-timeout fix (#176) was the wrong layer. The provider-HTTP telemetry
on the user's deploy shows the failures are PRE-RESPONSE `read ECONNRESET` ~500ms
in (not a 300s/15min timeout), correlated with idleSincePrevCall ~42s and large
bodies; and crucially a retry of the SAME request often succeeds. A direct probe
to the real z.ai endpoint does NOT reset (113KB bodies and a 45s-idle keep-alive
reuse both succeed), and another agent (opencode) runs fine from the same infra —
so the provider is healthy and the egress network is usable. The difference is
the transport: undici's keep-alive pool REUSES a socket that the deployment's
egress (NAT / firewall / conntrack) silently dropped during a long idle gap, so
the next request resets pre-response.
Fix (brings gitmost in line with clients that don't reuse stale sockets):
- Keep-alive recycling: the streaming dispatcher (chat fetch AND the external-MCP
dispatcher, via the shared streamingDispatcherOptions) now sets
keepAliveTimeout + keepAliveMaxTimeout to a 10s recycle window
(AI_STREAM_KEEPALIVE_MS), so a connection idle longer than that is closed
instead of reused — a long-gap step opens a fresh connection. keepAliveMaxTimeout
also caps a server-advertised keep-alive so the provider can't widen the window.
- Pre-response connection retry: createStreamingFetch retries a connection-level
reset (ECONNRESET / UND_ERR_SOCKET / ECONNREFUSED / EPIPE / *_TIMEOUT) on a
fresh connection up to 2 times. This is SAFE because fetch() only rejects before
the Response resolves — a started stream is never replayed; an abort (client
disconnect) is never retried.
Tests: ai-streaming-fetch.spec — keep-alive options, streamKeepAliveMs env,
isRetryableConnectError, and a server that resets the first connection so the
retry must land on a fresh one (+ aborted requests are not retried). Verified on
the stand that a normal turn still streams (reasoning + text + finish) through the
new transport. server tsc + ai/mcp specs green.
Note: root cause is the deployment's egress dropping idle connections (Traefik is
inbound-only); this makes the app resilient to it. AI_STREAM_KEEPALIVE_MS can be
lowered if the egress drops faster than ~10s.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses the second #177 review:
- Architecture (the silent allowlist drift): the writable provider-setting keys
were maintained by hand in two TS-uncheckable places — the key-loop in
ai-settings.service and the SQL ALLOWED list in the generic workspace repo (a
miss there silently dropped a field on persist, exactly what bit chatApiStyle).
Introduce one typed source of truth PROVIDER_SETTINGS_KEYS in ai.types
(`satisfies readonly (keyof AiProviderSettings)[]`), have the service consume
it, and keep the repo's own copy (it can't import AI types) guarded by a parity
test so any future drift fails in CI.
- Tests:
- ai.service.include-usage.spec: mocks @ai-sdk/openai-compatible and asserts the
factory is called with { includeUsage: true, baseURL, apiKey, fetch, name } —
`.provider` alone could not catch a dropped includeUsage (the token-usage
zeroing regression); also asserts the 'openai' style does NOT use it.
- ai-provider-settings-keys.spec: the allowlist parity check + DTO validation
for chatApiStyle (@IsIn accepts both values, rejects garbage, optional).
- CHANGELOG: [Unreleased] entries for the new "Protocol" / chatApiStyle setting
and the default provider change (openai -> openai-compatible). (#175, #177)
server + client tsc clean; 42 ai/settings specs green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebuilt on develop (after #176) and reworked per review: instead of inferring the
provider from baseUrl (`if (baseUrl)`), the admin picks the chat provider
EXPLICITLY via a new `chatApiStyle` ('openai-compatible' | 'openai'), mirroring
the existing sttApiStyle. A custom baseURL can front real OpenAI too, so the
heuristic was fragile.
Why reasoning was missing: glm-5.2 (and DeepSeek etc.) stream their thinking as
`reasoning_content`, but the official @ai-sdk/openai provider does not map that
field. 'openai-compatible' uses @ai-sdk/openai-compatible, which does — so
reasoning parts now stream (verified live: reasoning-start/delta/end appear, and
disappear when set to 'openai').
- Default (unset) = 'openai-compatible', so existing openai+baseUrl workspaces
surface reasoning with no admin action. No DB migration (field lives in the
settings.ai.provider JSON blob).
- includeUsage: true on the openai-compatible model — without it the provider
omits streamed usage, zeroing the live token counter / reasoning-token
metadata. The official provider always sent it; this keeps parity. (Confirmed
live: usage.totalTokens present.)
- openai-compatible has no default endpoint, so with no baseURL (real OpenAI, or
a role's cross-driver override that cleared it) it falls back to the official
provider.
Plumbing: ai.types (ChatApiStyle / CHAT_API_STYLES + AiProviderSettings /
MaskedAiSettings), update DTO (@IsIn), ai-settings.service (resolve / getMasked /
update allowlist), workspace.repo updateAiProviderSettings ALLOWED (the second,
SQL-level allowlist the review missed — without it the field never persisted),
ai.service selector. Client: ai-settings-service types + a Protocol <Select> in
the chat section + i18n (en/ru). Scope is chat-only (embeddings don't stream
reasoning; STT already has sttApiStyle).
Tests: ai.service.spec — 4 cases (openai-compatible+baseURL, openai+baseURL,
default-unset, openai-compatible-without-baseURL fallback). Verified on the stand:
default streams reasoning + usage; 'openai' drops reasoning; the setting
round-trips. server + client tsc clean; 36 ai/settings specs green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>