Compare commits

..

5 Commits

Author SHA1 Message Date
claude_code
204cf9dfe7 test(sandbox): address PR #250 round-4 review — SSRF accept-path tests, MCP structuredContent (#243)
Mandatory (test-coverage):
- internal-file-urls.test: pin the SSRF/traversal ACCEPT path of
  resolveInternalFilePath (the sole guard for content-controlled `src`): an
  absolute/protocol-relative URL has its foreign host dropped and only an
  /api/files/ pathname survives (http://evil.com/api/files/x/y.png -> /files/x/y.png),
  while a host-dropped path that escapes /api/files/ (https://evil.com/api/auth/whoami)
  or a backslash-traversal (/api/files\..\auth\whoami) is rejected. Locks the
  behavior so a future prefix-only refactor cannot silently open a bypass.

Suggestions:
- index.ts: the stash_page MCP tool now returns structuredContent
  { uri, sha256, size, images } alongside the resource_link, so the MCP output
  matches the documented shape (clients get the blob's sha256/ETag and the
  mirror counts, not just the link). No outputSchema registered. Rebuilt build/.
- new stash-page-mcp-result.test: server round-trip via InMemoryTransport asserts
  both the resource_link and the structuredContent mirror.
- internal-file-urls.test: cover the new URL parse-failure catch branch
  (http://[ -> "Invalid internal file src").
- environment.service.spec: assert getPositiveIntEnv warns once per key and
  independently across keys (the invalidPositiveIntWarned dedup).

Tests: packages/mcp 383 pass; apps/server sandbox/environment/mcp 235 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 20:58:36 +03:00
claude_code
aff58646d1 refactor(sandbox): address PR #250 round-3 review — dead import, env validation, uuid validator, docs (#243)
Must-fix:
- mcp.module: drop the now-dead EnvironmentModule import (and its stale
  comment). McpService no longer injects EnvironmentService; EnvironmentModule
  is @Global and imported at the app root, so DI still resolves.

Stability:
- environment.service: route getSandboxTtlMs + the three SANDBOX_MAX_*_BYTES
  caps through a shared getPositiveIntEnv() helper that warns once per key and
  falls back to the default on a non-integer or <= 0 value (previously the byte
  caps did a bare parseInt, so SANDBOX_MAX_TOTAL_BYTES=0 made every stash_page
  fail against a 0-byte cap). TTL behavior is unchanged.

Simplification:
- sandbox.controller: replace the homemade UUID_RE with the project's shared
  `uuid` validator (import { validate as isValidUUID } from 'uuid'), matching
  the attachment routes; update the spec fixtures to valid v4 UUIDs.
- mcp.service: inline the single-caller one-liner buildSandboxConfig() to
  this.sandboxStore.asSink() at the wiring site.

Docs:
- CHANGELOG: add an [Unreleased] > Added entry for #243 (stash_page tool,
  anonymous GET /api/sb/:id, five SANDBOX_* env vars).
- AGENTS.md: note that GET /api/sb/:id is in the workspace-gate preHandler's
  excludedPaths and is fully tokenless, unlike /api/files/public/... which
  still resolves a workspace and needs an attachment JWT.

Tests: cap-getter validation (0/-5/abc -> default, valid -> parsed), updated
UUID fixtures. apps/server jest sandbox/environment/mcp: 233 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 20:21:31 +03:00
claude_code
8842bc8bf3 fix(sandbox): address PR #250 follow-up review — XSS hardening, eviction reconcile, doc sync (#243)
Security (must-fix):
- sandbox.controller: the anonymous GET /api/sb/:id response now sets
  X-Content-Type-Options: nosniff, a restrictive CSP, and Content-Disposition=
  attachment for any mime outside a raster-image allowlist (png/jpeg/gif/webp/
  avif). entry.mime is attacker-controlled, so an evil.svg/evil.html could
  otherwise execute script inline on the Docmost origin (stored XSS). Mirrors
  the public attachment route's hardening.

Stability:
- client.stashPage: reconcile mirrors AFTER the final document put, not only
  before it. The doc blob is the newest entry and FIFO eviction drops the
  oldest = this stash's own images, so the stored doc could reference an
  evicted blob (consumer 404) and over-report images.mirrored. A bounded loop
  now reverts doc-put-evicted mirrors, drops the stale doc blob, and re-puts
  until stable. Regenerated packages/mcp/build/.
- sandbox.controller: emit Cache-Control on the 304 branch too (ttlSeconds is
  computed before the conditional check).

Docs:
- Bump the MCP tool count 39 -> 40 across all READMEs and AGENTS.md (the
  registry now exposes exactly 40 tools).

Refactor:
- SandboxStore.asSink() centralizes the {put,has,evict} sink + uri<->id
  mapping; the embedded-MCP and in-app agent-tools wiring sites share it.

Tests:
- security headers (inline vs attachment, nosniff, CSP), 304 Cache-Control,
  putAndLink URL form, has()/remove(), asSink() round-trip, getSandboxPublicUrl
  (trailing-slash trim + APP_URL fallback), and a stash test where the doc put
  itself evicts a mirrored image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 19:08:06 +03:00
claude_code
6eb335d5e3 fix(sandbox): address PR #250 review — SSRF guard, eviction safety, cleanup (#243)
Security:
- stash_page: reject path-traversal / percent-encoded srcs before the authed
  loopback fetch (resolveInternalFilePath), closing an SSRF/exfiltration hole
  where a crafted node.attrs.src could read an arbitrary internal GET endpoint
  into the anonymous sandbox.

Stability:
- stash_page: revert + recount mirrors FIFO-evicted by a later put in the same
  stash (no dangling sandbox refs, honest images.mirrored/failed); free image
  blobs if the final document put throws.
- Reject/clamp non-positive SANDBOX_TTL_MS to the 1h default (warn once).
- Log mirror failures unconditionally (console.warn, no blob bodies).

Cleanup / architecture:
- Remove dead expiresAt from SandboxPutResult.
- Centralize the /api/sb route in SANDBOX_ROUTE_SEGMENT/SANDBOX_API_PATH and
  move URL composition into SandboxStore.putAndLink; drop the duplicated sink
  closures and the now-unused EnvironmentService injection from McpService and
  AiChatToolsService.
- Un-export isInternalFileUrl; document the process-local (instance-bound)
  sandbox limitation in the tool description and .env.example.

Docs/tests:
- README/README.ru: 38 -> 39 tools + stash_page entry.
- Add traversal/normalize/recursion unit tests, stash self-eviction +
  doc-put-throw + empty/octet-stream mock tests, controller If-None-Match
  (wildcard/weak/list) + Cache-Control tests, and SANDBOX_TTL_MS validation
  tests. Regenerate packages/mcp/build.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 18:02:46 +03:00
claude code agent 227
2fe4ca8537 feat(sandbox): in-RAM blob sandbox for out-of-band page transfer (#243)
Add an ephemeral, process-local blob store so the in-app agent (and the
embedded MCP) can hand a large page document and its images to an external
consumer WITHOUT routing the bytes through the model context or Docmost auth.

- SandboxStore (@Injectable singleton): Map<uuid,{buf,mime,sha256,expiresAt}>
  in RAM only. put() picks a per-blob cap by mime (image vs doc), enforces a
  total-bytes RAM guard with oldest-first eviction, and stamps a TTL; get()
  lazily expires. sha256 computed at put() doubles as the strong ETag. An
  unref'd sweep interval clears expired entries and is cleared on destroy.
- GET /api/sb/:uuid anonymous controller: serves raw bytes with Content-Type,
  Content-Length and ETag=sha256; 404 on missing/expired/non-UUID (anti-
  traversal), 304 on a matching If-None-Match. No tokens, no 401 — the
  capability is the unguessable UUID + short TTL + TLS. Auth-exempt the same
  way as /api/files/public (no JwtAuthGuard) plus an /api/sb entry in main.ts's
  workspace-resolution preHandler so a remote consumer with no workspace host
  is not rejected.
- stash_page tool in both layers (MCP resource_link + in-app {uri,size,sha256,
  images}). client.stashPage serializes the get_page_json shape, mirrors every
  INTERNAL file/image src (type-agnostic, covers drawio/excalidraw/video/file)
  into the sandbox under Docmost auth and rewrites src to the sandbox URL;
  external http(s) srcs are left untouched; dedup by src; a failed image fetch
  is counted, never aborts the doc.
- SANDBOX_PUBLIC_URL / SANDBOX_TTL_MS / SANDBOX_MAX_BYTES /
  SANDBOX_MAX_IMAGE_BYTES / SANDBOX_MAX_TOTAL_BYTES wired through the
  environment service + validation + .env.example.
- SandboxModule (@Global) provides the shared store to the controller,
  McpService and AiChatToolsService (same instance for put and get).

Tests: SandboxStore (round-trip, sha256, TTL lazy + sweep, caps, eviction),
SandboxController (200+ETag+CT+CL, 404 missing/expired/non-UUID, 304), and a
mock-HTTP stashPage test (mirror+rewrite internal, keep external, dedup, failed
image counted, returns only a link). Interoperates with the vvzvlad/habr-mcp
consumer's anonymous-GET + sha256-ETag + resource_link contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 15:13:11 +03:00
51 changed files with 2666 additions and 1503 deletions

View File

@@ -124,6 +124,26 @@ MCP_DOCMOST_PASSWORD=
# MCP_TOKEN=
# MCP_SESSION_IDLE_MS=1800000
#
# BLOB SANDBOX (stash_page). An in-RAM, process-local store that hands large page
# content + images to an external consumer WITHOUT bloating the model context or
# requiring Docmost auth. The stash_page tool serializes a page, mirrors its
# internal images into the store, and returns ONLY a short anonymous URL; the
# consumer fetches blobs via `GET /api/sb/<uuid>` (no token — the capability is
# the unguessable UUID + short TTL + TLS). Blobs are RAM-only and cleared on
# restart. ETag = the blob's sha256 (integrity check).
# SANDBOX_PUBLIC_URL is the base used to build those URLs; it MUST be reachable
# by the consumer (do NOT use a loopback address if the consumer is remote).
# Defaults to APP_URL when unset.
# NOTE: the store is process-local — blobs live only on the instance that
# created them. Behind a multi-replica load balancer WITHOUT sticky sessions a
# consumer may hit a different instance and get a 404 (indistinguishable from an
# expired blob). Single-host deployments are unaffected.
# SANDBOX_PUBLIC_URL=https://docs.example.com
# SANDBOX_TTL_MS=3600000
# SANDBOX_MAX_BYTES=8388608
# SANDBOX_MAX_IMAGE_BYTES=20971520
# SANDBOX_MAX_TOTAL_BYTES=134217728
#
# AI-AGENT ATTRIBUTION (comments/pages written via MCP are badged as "AI"):
# attribution is driven by a per-user `is_agent` flag on the users row. There is
# NO admin UI/API for it — set it out-of-band with SQL. Use a DEDICATED service

View File

@@ -241,7 +241,7 @@ Migration files live in `apps/server/src/database/migrations/` and are named `YY
- **API server** — `dist/main` (`apps/server/src/main.ts`), the Fastify HTTP app (`AppModule`).
- **Collaboration server** — `dist/collaboration/server/collab-main` (`pnpm collab`), a Hocuspocus/Yjs WebSocket server (`apps/server/src/collaboration/`) handling real-time document editing, persistence, and page-history snapshots. It listens on `COLLAB_PORT` (default `3001`), separate from the API server's `PORT` (default `3000`), and shares state with the API server through Redis.
The API server is a Fastify app with a global `/api` prefix (`main.ts` excludes `robots.txt`, public share pages, and `mcp` from the prefix). A `preHandler` hook enforces that a resolved `workspaceId` exists for most `/api` routes (multi-tenant by hostname/subdomain via `DomainMiddleware`). Auth is JWT (cookie + bearer); authorization is **CASL** (`core/casl`) — every data access is scoped to the user's abilities.
The API server is a Fastify app with a global `/api` prefix (`main.ts` excludes `robots.txt`, public share pages, and `mcp` from the prefix). A `preHandler` hook enforces that a resolved `workspaceId` exists for most `/api` routes (multi-tenant by hostname/subdomain via `DomainMiddleware`). `GET /api/sb/:id` (the anonymous blob-sandbox read route) is listed in that preHandler's `excludedPaths`, so it is exempt from workspace resolution and carries no session auth at all (its capability is the unguessable UUID + TTL + TLS) — unlike `/api/files/public/...`, which still resolves a workspace and requires a workspace-bound attachment JWT. Auth is JWT (cookie + bearer); authorization is **CASL** (`core/casl`) — every data access is scoped to the user's abilities.
### Module structure (server)
`AppModule` wires integration modules (`integrations/*`: storage [local/S3/Azure], mail, queue [BullMQ on Redis], security, telemetry, throttle, `mcp`, `ai`) plus `CoreModule`, `DatabaseModule`, and `CollaborationModule`. `CoreModule` (`core/*`) holds the domain modules: `page`, `space`, `comment`, `workspace`, `user`, `auth`, `group`, `attachment`, `search`, `share`, `ai-chat`, etc. Each domain module follows NestJS controller → service → repo layering; DB repos live under `database/repos` and are injected app-wide from the global `DatabaseModule`.
@@ -254,7 +254,7 @@ The API server is a Fastify app with a global `/api` prefix (`main.ts` excludes
- **Redis** backs caching, the BullMQ queues, the WebSocket Socket.IO adapter, and collaboration sync.
### The two AI subsystems (the main fork additions)
1. **Embedded MCP server** (`integrations/mcp/` + `packages/mcp`). The standalone `@docmost/mcp` server (39 agent-native tools: per-block patch/insert/delete by id, scripted `(doc)=>doc` transforms with dry-run diff, table editing, version diff/restore, comments, images, shares) is bundled and served over HTTP at `/mcp`. It writes through Docmost's real-time-collaboration layer so concurrent human edits aren't clobbered. Each request authenticates **per-user** via the `Authorization` header — either HTTP Basic (`base64(email:password)`, the user's own Docmost login, validated through `AuthService`) or a Bearer access JWT (the user's `authToken`) — and the session acts under that user's permissions. `MCP_DOCMOST_EMAIL` / `MCP_DOCMOST_PASSWORD` are an **optional service-account fallback**, used only when a request carries neither Basic nor Bearer credentials (back-compat for CI/scripts). An admin enables MCP with a workspace toggle (Workspace settings → AI). Optionally protected by a shared `MCP_TOKEN`: when set, every `/mcp` request must carry a matching `X-MCP-Token` header (its own header, separate from `Authorization`, which now carries the per-user Basic/Bearer credentials). Note: this changed from the older `Authorization: Bearer <MCP_TOKEN>` scheme — see `.env.example` and the CHANGELOG Breaking Changes entry.
1. **Embedded MCP server** (`integrations/mcp/` + `packages/mcp`). The standalone `@docmost/mcp` server (40 agent-native tools: per-block patch/insert/delete by id, scripted `(doc)=>doc` transforms with dry-run diff, table editing, version diff/restore, comments, images, shares) is bundled and served over HTTP at `/mcp`. It writes through Docmost's real-time-collaboration layer so concurrent human edits aren't clobbered. Each request authenticates **per-user** via the `Authorization` header — either HTTP Basic (`base64(email:password)`, the user's own Docmost login, validated through `AuthService`) or a Bearer access JWT (the user's `authToken`) — and the session acts under that user's permissions. `MCP_DOCMOST_EMAIL` / `MCP_DOCMOST_PASSWORD` are an **optional service-account fallback**, used only when a request carries neither Basic nor Bearer credentials (back-compat for CI/scripts). An admin enables MCP with a workspace toggle (Workspace settings → AI). Optionally protected by a shared `MCP_TOKEN`: when set, every `/mcp` request must carry a matching `X-MCP-Token` header (its own header, separate from `Authorization`, which now carries the per-user Basic/Bearer credentials). Note: this changed from the older `Authorization: Bearer <MCP_TOKEN>` scheme — see `.env.example` and the CHANGELOG Breaking Changes entry.
2. **AI agent chat** (`core/ai-chat/` server + `apps/client/src/features/ai-chat/` client). A built-in agent over the wiki using the Vercel **AI SDK** (`ai`, `@ai-sdk/*`) against any OpenAI-compatible provider configured per workspace (`integrations/ai/` — credentials encrypted at rest via `integrations/crypto`, stored in `ai_provider_credentials`). Key pieces:
- `core/ai-chat/tools/` — the agent's ~40 read+write tools. Every tool runs under the **calling user's** CASL permissions via a per-user loopback access token (`docmost-client.loader.ts`), so the agent can never exceed what the user could do. Only **reversible** operations are exposed (page history + trash; no permanent delete). Agent edits get an "AI agent" provenance badge in page history (`20260616T130000-agent-provenance` migration).
- `core/ai-chat/embedding/` — RAG indexer + a BullMQ consumer on `AI_QUEUE` that embeds pages into `page_embeddings` (vector search), complementing Postgres full-text search. Pages are (re)indexed on edit; `AI_EMBEDDING_TIMEOUT_MS` bounds a hung embeddings endpoint.

View File

@@ -58,6 +58,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
append/prepend fragments, nor to COMMENT bodies — a comment may legitimately
contain a standalone footnote definition, which canonicalization would drop.
(#228)
- **Out-of-band page transfer via an in-RAM blob sandbox (`stash_page`).** A
new MCP tool serializes a whole page (its full ProseMirror JSON, with every
internal image/file mirrored) into an ephemeral in-RAM blob and returns only
a short anonymous URL, so a large page can be handed to an external consumer
without flooding the model context. Blobs are served by unguessable UUID over
a new anonymous `GET /api/sb/:id` route (strong sha256 ETag, short TTL,
`nosniff` + restrictive CSP + attachment disposition for non-image mimes) and
are RAM-only, bound to the instance that created them. Tunable via five
`SANDBOX_*` env vars (see `.env.example`). (#243)
### Changed

View File

@@ -34,7 +34,7 @@ The goal of the fork is a **100% open, AGPL-only build with no Enterprise-Editio
| --- | --- |
| **EE code removed** | Stripped all client and server Enterprise-Edition code; ships as a clean community/AGPL build with no license checks. |
| **Comment resolution** | Re-implemented from scratch as a community feature (resolve / re-open with Open/Resolved tabs). No EE code reused, available to anyone who can comment. |
| **Embedded MCP server** | A community MCP server (`@docmost/mcp`, 39 tools) is served over HTTP at `/mcp` — no enterprise license required. Replaces the removed license-gated EE MCP. |
| **Embedded MCP server** | A community MCP server (`@docmost/mcp`, 40 tools) is served over HTTP at `/mcp` — no enterprise license required. Replaces the removed license-gated EE MCP. |
| **AI agent chat** | Built-in AI agent chat over your wiki, written from scratch as a community feature — no enterprise license. The agent reads and edits pages on your behalf (scoped to your permissions), with full-text + vector (RAG) search and optional web access via external MCP servers. |
| **Rebranding** | App logo / name changed from *Docmost* to *Gitmost*. |
| **Compact page tree** | Default page-tree indentation reduced from 16px to 8px per nesting level. |
@@ -44,7 +44,7 @@ The goal of the fork is a **100% open, AGPL-only build with no Enterprise-Editio
### Embedded MCP server
Gitmost has **our own MCP server** — [docmost-mcp](https://github.com/vvzvlad/docmost-mcp),
which we wrote — **built directly into the app** and served at `/mcp`. It exposes **39
which we wrote — **built directly into the app** and served at `/mcp`. It exposes **40
agent-native tools**: surgical per-block edits (patch / insert / delete by id),
structure-preserving find/replace, scripted `(doc) => doc` transforms with a dry-run diff,
structured table editing, version history with diff / restore, comments, images and share
@@ -60,7 +60,7 @@ every little fix. And it needs no enterprise license.
| | **Gitmost `/mcp` (our docmost-mcp)** | Docmost's built-in MCP |
| --- | :---: | :---: |
| **Enterprise license** | Not required | Required |
| **Tools** | 39, agent-native | Coarse (read Markdown, page CRUD, replace whole page) |
| **Tools** | 40, agent-native | Coarse (read Markdown, page CRUD, replace whole page) |
| **Per-block edits / find-replace / scripted transforms** | ✅ | — |
| **Structured table editing, version diff / restore** | ✅ | — |
| **Comments, images, share links** | ✅ | — |

View File

@@ -33,7 +33,7 @@
| --- | --- |
| **Удалён EE-код** | Вырезан весь код Enterprise-редакции на клиенте и сервере; это чистая community/AGPL-сборка без лицензионных проверок. |
| **Резолв комментариев** | Переписан с нуля как community-функция (резолв / переоткрытие с вкладками «Открытые» / «Решённые»). EE-код не используется, доступно любому, кто может комментировать. |
| **Встроенный MCP-сервер** | Community MCP-сервер (`@docmost/mcp`, 39 инструментов) отдаётся по HTTP на `/mcp` — без enterprise-лицензии. Заменяет удалённый лицензируемый EE MCP. |
| **Встроенный MCP-сервер** | Community MCP-сервер (`@docmost/mcp`, 40 инструментов) отдаётся по HTTP на `/mcp` — без enterprise-лицензии. Заменяет удалённый лицензируемый EE MCP. |
| **Чат с AI-агентом** | Встроенный чат с AI-агентом по содержимому вики, написанный с нуля как community-функция — без enterprise-лицензии. Агент читает и редактирует страницы от вашего имени (в рамках ваших прав), с полнотекстовым + векторным (RAG) поиском и опциональным доступом в интернет через внешние MCP-серверы. |
| **Ребрендинг** | Логотип / название приложения изменены с *Docmost* на *Gitmost*. |
| **Компактное дерево страниц** | Отступ дерева страниц по умолчанию уменьшен с 16px до 8px на уровень вложенности. |
@@ -44,7 +44,7 @@
В Gitmost есть **наш собственный MCP-сервер** — [docmost-mcp](https://github.com/vvzvlad/docmost-mcp),
который мы написали сами, — **встроенный прямо в приложение** и доступный на `/mcp`. Он даёт
**39 agent-native инструментов**: точечное редактирование по блокам (patch / insert / delete
**40 agent-native инструментов**: точечное редактирование по блокам (patch / insert / delete
по id), find/replace с сохранением структуры, скриптовые трансформации `(doc) => doc` с
предпросмотром диффа, структурное редактирование таблиц, история версий с диффом /
восстановлением, комментарии, изображения и ссылки на шаринг — всё применяется через слой
@@ -60,7 +60,7 @@ real-time-коллаборации Docmost, поэтому запись нико
| | **`/mcp` в Gitmost (наш docmost-mcp)** | Родной MCP у Docmost |
| --- | :---: | :---: |
| **Enterprise-лицензия** | Не нужна | Нужна |
| **Инструменты** | 39, agent-native | Примитивные (Markdown, CRUD страниц, замена целиком) |
| **Инструменты** | 40, agent-native | Примитивные (Markdown, CRUD страниц, замена целиком) |
| **Правки по блокам / find-replace / скриптовые трансформации** | ✅ | — |
| **Структурное редактирование таблиц, дифф / восстановление версий** | ✅ | — |
| **Комментарии, изображения, ссылки на шаринг** | ✅ | — |

View File

@@ -3,9 +3,6 @@ import {
resolveCardStatus,
isEndpointConfigured,
resolveKeyField,
nextReindexPollInterval,
isReindexComplete,
isReindexButtonLoading,
} from './ai-provider-settings';
describe('resolveCardStatus', () => {
@@ -74,152 +71,3 @@ describe('resolveKeyField (write-only key payload)', () => {
expect(resolveKeyField('', false)).toEqual({ set: false });
});
});
describe('nextReindexPollInterval', () => {
const INTERVAL = 5000;
const base = { now: 1_000, intervalMs: INTERVAL };
it('does not poll when no reindex deadline is set', () => {
expect(
nextReindexPollInterval({
...base,
deadline: null,
status: { reindexing: true, indexedPages: 0, totalPages: 478 },
}),
).toBe(false);
});
it('keeps polling while the server reports an active run', () => {
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: true, indexedPages: 120, totalPages: 478 },
}),
).toBe(INTERVAL);
});
it('keeps polling during an active run even if counts momentarily look full', () => {
// The run clears its progress record only at the very end, so a transient
// indexed==total while reindexing is still true must NOT stop polling.
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: true, indexedPages: 478, totalPages: 478 },
}),
).toBe(INTERVAL);
});
it('stops once the run is finished AND fully indexed', () => {
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: false, indexedPages: 478, totalPages: 478 },
}),
).toBe(false);
});
it('keeps polling within the deadline when not yet done and no active flag', () => {
// First poll right after enqueue, before the worker publishes progress.
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: false, indexedPages: 0, totalPages: 478 },
}),
).toBe(INTERVAL);
});
it('cap always wins: stops once past the deadline even if still reindexing', () => {
expect(
nextReindexPollInterval({
deadline: 1_000,
now: 2_000, // past the deadline
intervalMs: INTERVAL,
status: { reindexing: true, indexedPages: 200, totalPages: 478 },
}),
).toBe(false);
});
it('stops on an empty workspace (0 of 0) once the run is finished', () => {
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: false, indexedPages: 0, totalPages: 0 },
}),
).toBe(false);
});
});
describe('isReindexComplete', () => {
it('false when no status yet', () => {
expect(isReindexComplete(undefined)).toBe(false);
});
it('false while a run is still active (even at indexed==total)', () => {
expect(
isReindexComplete({ reindexing: true, indexedPages: 478, totalPages: 478 }),
).toBe(false);
});
it('false when finished but not yet fully indexed', () => {
expect(
isReindexComplete({ reindexing: false, indexedPages: 120, totalPages: 478 }),
).toBe(false);
});
it('true once finished and fully indexed', () => {
expect(
isReindexComplete({ reindexing: false, indexedPages: 478, totalPages: 478 }),
).toBe(true);
});
});
describe('isReindexButtonLoading', () => {
it('loads while the POST mutation is pending', () => {
expect(
isReindexButtonLoading({
mutationPending: true,
deadline: null,
status: false,
}),
).toBe(true);
});
it('does NOT load post-cap: deadline nulled but reindexing left stale-true', () => {
// The key case: after the poll cap fires `reindexDeadline` is null while
// `settings.reindexing` can be a stale `true` from the last poll. Gating on
// the deadline keeps the spinner from sticking forever so the admin can
// restart.
expect(
isReindexButtonLoading({
mutationPending: false,
deadline: null,
status: true,
}),
).toBe(false);
});
it('loads during an active run within the poll window', () => {
expect(
isReindexButtonLoading({
mutationPending: false,
deadline: 10_000,
status: true,
}),
).toBe(true);
});
it('does not load once the run finished while still polling', () => {
expect(
isReindexButtonLoading({
mutationPending: false,
deadline: 10_000,
status: false,
}),
).toBe(false);
});
});

View File

@@ -37,7 +37,6 @@ import {
} from "@/features/workspace/queries/ai-settings-query.ts";
import {
AiTestCapability,
IAiSettings,
IAiSettingsUpdate,
SttApiStyle,
ChatApiStyle,
@@ -170,73 +169,6 @@ export function resolveKeyField(
return { set: false };
}
// Subset of the status payload that drives the reindex poll decisions.
type ReindexStatus = Pick<
IAiSettings,
"reindexing" | "indexedPages" | "totalPages"
>;
/**
* Decide the TanStack Query `refetchInterval` while a reindex may be running.
* Returns the poll interval (ms) to keep polling, or `false` to stop.
*
* Polls while the server reports an ACTIVE run (`reindexing === true`) OR we are
* still within the deadline window and not yet fully indexed. Stops once the run
* has finished AND everything is indexed (server cleared its progress record and
* fell back to the DB coverage count), or the deadline cap is hit — the cap
* always wins so a stuck/never-clearing progress record can't poll forever.
*/
export function nextReindexPollInterval(args: {
deadline: number | null;
now: number;
intervalMs: number;
status?: ReindexStatus;
}): number | false {
const { deadline, now, intervalMs, status } = args;
if (deadline === null) return false;
// Cap always wins.
if (now > deadline) return false;
// Active run → keep polling even if the momentary counts already look full.
if (status?.reindexing) return intervalMs;
// Finished and fully indexed (incl. an empty workspace, 0 >= 0) → stop. Reuse
// isReindexComplete so the completeness check lives in exactly one place.
if (isReindexComplete(status)) return false;
// Within the deadline and not yet done → keep polling.
return intervalMs;
}
/**
* Whether the reindex poll deadline should be cleared: the server reports no
* active run AND the count is complete. The single source of truth for the
* "reindex finished" check — `nextReindexPollInterval` reuses it for its stop
* condition (sans the cap, which the effect handles via time).
*/
export function isReindexComplete(status?: ReindexStatus): boolean {
return (
!!status && !status.reindexing && status.indexedPages >= status.totalPages
);
}
/**
* Whether the reindex button should show its spinner (and stay disabled).
*
* Spins while the POST is in flight, and for the WHOLE background run while the
* server reports `reindexing === true`. The `deadline !== null` gate is the
* load-bearing part: once the 120s poll cap fires it nulls `reindexDeadline`
* and stops refetching, so `status` (settings?.reindexing) can be a stale
* `true` from the last poll. Without the gate the spinner would stick forever
* for a run that outlives the cap and block a restart; gating on the active
* poll window clears it so the admin can re-trigger.
*/
export function isReindexButtonLoading(args: {
mutationPending: boolean;
deadline: number | null;
status?: boolean;
}): boolean {
const { mutationPending, deadline, status } = args;
return mutationPending || (deadline !== null && status === true);
}
// Translate the dot's tooltip label. Kept in one place so all three endpoint
// cards share identical wording.
function cardStatusLabel(status: CardStatus, t: (k: string) => string): string {
@@ -283,34 +215,31 @@ export default function AiProviderSettings() {
// PRE-job counts immediately, so the only way the "Indexed X of Y" counter
// visibly climbs is to keep polling the settings query while the job runs.
// `reindexDeadline` is the timestamp until which we poll (set on reindex
// success). Polling tracks the server's `reindexing` flag: it keeps going for
// the whole active run and stops promptly once the server reports the run is
// finished. Bounded by the cap so a stuck/never-clearing progress record can
// never poll forever.
const REINDEX_POLL_INTERVAL = 5000; // ms between refetches while indexing
// success); polling stops early once indexed === total. Bounded so a stuck
// job can never poll forever.
const REINDEX_POLL_INTERVAL = 3000; // ms between refetches while indexing
const REINDEX_POLL_CAP_MS = 120000; // ~2 min hard cap
const [reindexDeadline, setReindexDeadline] = useState<number | null>(null);
// Only admins may read the (masked) AI settings; the server enforces this too.
const { data: settings, isLoading } = useAiSettingsQuery(isAdmin, (query) =>
nextReindexPollInterval({
deadline: reindexDeadline,
now: Date.now(),
intervalMs: REINDEX_POLL_INTERVAL,
status: query.state.data,
}),
);
const { data: settings, isLoading } = useAiSettingsQuery(isAdmin, (query) => {
if (reindexDeadline === null) return false;
// Past the cap → stop polling (cleared via the effect below too).
if (Date.now() > reindexDeadline) return false;
const data = query.state.data;
// Stop once everything is indexed; otherwise keep polling.
if (data && data.indexedPages >= data.totalPages) return false;
return REINDEX_POLL_INTERVAL;
});
// Stop polling once the run is finished or the cap is reached. Also clears on
// Stop polling once the work is done or the cap is reached. Also clears on
// unmount because the deadline state goes away with the component.
useEffect(() => {
if (reindexDeadline === null) return;
// "Done" matches the refetchInterval stop condition: the server reports no
// active run AND the count is complete (indexed >= total, incl. an empty
// workspace 0 >= 0), so the deadline clears promptly instead of waiting out
// the cap. While `reindexing` is still true we keep the deadline so polling
// continues for the whole run.
if (isReindexComplete(settings)) {
// "Done" matches the refetchInterval stop condition (indexed >= total),
// including an empty workspace (0 >= 0), so the deadline clears promptly
// instead of waiting out the cap.
if (settings && settings.indexedPages >= settings.totalPages) {
setReindexDeadline(null);
return;
}
@@ -1102,17 +1031,7 @@ export default function AiProviderSettings() {
<Button
variant="subtle"
size="compact-sm"
// Spin for the WHOLE run: the POST resolves immediately, but the
// background job keeps running, so also stay loading while the
// server reports `reindexing` (this also blocks a redundant
// re-trigger mid-run; the server de-dupes regardless). The
// deadline gate (and why it matters post-cap) lives in
// `isReindexButtonLoading`, which is unit-tested.
loading={isReindexButtonLoading({
mutationPending: reindexMutation.isPending,
deadline: reindexDeadline,
status: settings?.reindexing,
})}
loading={reindexMutation.isPending}
onClick={() =>
reindexMutation.mutate(undefined, {
// Begin bounded polling so the counter climbs as the async

View File

@@ -23,12 +23,8 @@ export function useAiSettingsQuery(
enabled: boolean = true,
// While reindexing runs as an async background job, the counter only climbs
// if the client keeps refetching. The component passes a refetchInterval
// function (`nextReindexPollInterval`) that keeps polling while the server
// reports an active run (reindexing === true) OR we are still within the
// bounded deadline and not yet fully indexed; it returns false to stop only
// once the run has finished AND indexed >= total, or the deadline cap is hit
// (the cap always wins). Note: a transient indexed === total during an active
// run does NOT stop polling. See AiProviderSettings.
// function that polls until indexed === total or a bounded deadline, then
// returns false to stop. See AiProviderSettings.
refetchInterval?:
| number
| false

View File

@@ -48,9 +48,6 @@ export interface IAiSettings {
// RAG indexing coverage (pages indexed for semantic search).
indexedPages: number;
totalPages: number;
// True while a full workspace reindex is actively running; the counts above
// then reflect the live run progress (done climbs 0 -> total).
reindexing?: boolean;
}
// Update payload. Key semantics (same for `apiKey` and `embeddingApiKey`):

View File

@@ -28,6 +28,7 @@ import { ClsModule } from 'nestjs-cls';
import { NoopAuditModule } from './integrations/audit/audit.module';
import { ThrottleModule } from './integrations/throttle/throttle.module';
import { McpModule } from './integrations/mcp/mcp.module';
import { SandboxModule } from './integrations/sandbox/sandbox.module';
import { AiModule } from './integrations/ai/ai.module';
import { AiChatModule } from './core/ai-chat/ai-chat.module';
@@ -89,6 +90,7 @@ try {
TelemetryModule,
ThrottleModule,
McpModule,
SandboxModule,
AiModule,
AiChatModule,
...enterpriseModules,

View File

@@ -3,8 +3,6 @@ import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { PageEmbeddingRepo } from '@docmost/db/repos/ai-chat/page-embedding.repo';
import { KyselyDB } from '@docmost/db/types/kysely.types';
import { AiService } from '../../../integrations/ai/ai.service';
import { EmbeddingReindexProgressService } from '../../../integrations/ai/embedding-reindex-progress.service';
import { AiEmbeddingNotConfiguredException } from '../../../integrations/ai/ai-embedding-not-configured.exception';
/**
* Unit tests for EmbeddingIndexerService.reindexWorkspace's batch control flow.
@@ -14,8 +12,7 @@ import { AiEmbeddingNotConfiguredException } from '../../../integrations/ai/ai-e
* reindexWorkspace actually touches:
* - aiService.getEmbeddingModel -> a model string so the up-front configured
* check passes,
* - pageRepo.getEmbeddablePageIds -> three page ids (the embeddable set the
* reindex iterates),
* - pageRepo.getIdsByWorkspace -> three page ids,
* - service.reindexPage -> spied per test to drive the per-page outcome.
*
* The point under test is the catch block: a FATAL provider error (auth/billing)
@@ -27,30 +24,21 @@ describe('EmbeddingIndexerService.reindexWorkspace fail-fast', () => {
function makeService() {
const pageRepo = {
getEmbeddablePageIds: jest.fn().mockResolvedValue(['p1', 'p2', 'p3']),
getIdsByWorkspace: jest.fn().mockResolvedValue(['p1', 'p2', 'p3']),
};
const pageEmbeddingRepo = {};
const aiService = {
getEmbeddingModel: jest.fn().mockResolvedValue('some-model'),
};
// Progress is a best-effort cosmetic store; mock its async methods so the
// batch control flow can be tested without Redis.
const reindexProgress = {
start: jest.fn().mockResolvedValue(undefined),
increment: jest.fn().mockResolvedValue(undefined),
clear: jest.fn().mockResolvedValue(undefined),
get: jest.fn().mockResolvedValue(null),
};
const db = {};
const service = new EmbeddingIndexerService(
pageRepo as unknown as PageRepo,
pageEmbeddingRepo as unknown as PageEmbeddingRepo,
aiService as unknown as AiService,
reindexProgress as unknown as EmbeddingReindexProgressService,
db as unknown as KyselyDB,
);
return { service, pageRepo, aiService, reindexProgress };
return { service, pageRepo, aiService };
}
it('aborts after the first page on a FATAL (401) provider error', async () => {
@@ -90,100 +78,3 @@ describe('EmbeddingIndexerService.reindexWorkspace fail-fast', () => {
expect(reindexPage).toHaveBeenCalledTimes(3);
});
});
/**
* Live reindex-progress reporting: reindexWorkspace must publish a per-workspace
* progress record (total at start, done incremented per processed page) and ALWAYS
* clear it in a finally — including on a fatal abort and an unconfigured early
* return — so the settings status can show the counter climb without ever getting
* stuck in a "reindexing" state.
*/
describe('EmbeddingIndexerService.reindexWorkspace progress', () => {
const WORKSPACE_ID = 'ws-1';
function makeService(pageIds: string[] = ['p1', 'p2', 'p3']) {
const pageRepo = {
getEmbeddablePageIds: jest.fn().mockResolvedValue(pageIds),
};
const pageEmbeddingRepo = {};
const aiService = {
getEmbeddingModel: jest.fn().mockResolvedValue('some-model'),
};
const reindexProgress = {
start: jest.fn().mockResolvedValue(undefined),
increment: jest.fn().mockResolvedValue(undefined),
clear: jest.fn().mockResolvedValue(undefined),
get: jest.fn().mockResolvedValue(null),
};
const db = {};
const service = new EmbeddingIndexerService(
pageRepo as unknown as PageRepo,
pageEmbeddingRepo as unknown as PageEmbeddingRepo,
aiService as unknown as AiService,
reindexProgress as unknown as EmbeddingReindexProgressService,
db as unknown as KyselyDB,
);
return { service, pageRepo, aiService, reindexProgress };
}
it('sets total at start, increments done per page, and clears in finally', async () => {
const { service, reindexProgress } = makeService(['p1', 'p2', 'p3']);
jest.spyOn(service, 'reindexPage').mockResolvedValue(undefined);
await service.reindexWorkspace(WORKSPACE_ID);
expect(reindexProgress.start).toHaveBeenCalledWith(WORKSPACE_ID, 3);
// One increment per processed page.
expect(reindexProgress.increment).toHaveBeenCalledTimes(3);
expect(reindexProgress.increment).toHaveBeenCalledWith(WORKSPACE_ID);
// Cleared exactly once on completion.
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
expect(reindexProgress.clear).toHaveBeenCalledWith(WORKSPACE_ID);
});
it('counts a handled (non-fatal) per-page failure as processed', async () => {
const { service, reindexProgress } = makeService(['p1', 'p2', 'p3']);
// No statusCode -> non-fatal -> isolate and continue; each counts as done.
jest.spyOn(service, 'reindexPage').mockRejectedValue(new Error('boom'));
await service.reindexWorkspace(WORKSPACE_ID);
expect(reindexProgress.increment).toHaveBeenCalledTimes(3);
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
});
it('clears progress in finally even when a FATAL provider error aborts the batch', async () => {
const { service, reindexProgress } = makeService(['p1', 'p2', 'p3']);
// A 401 aborts on the first page (re-thrown) — the finally must still clear.
jest
.spyOn(service, 'reindexPage')
.mockRejectedValue({ statusCode: 401, message: 'User not found' });
await expect(service.reindexWorkspace(WORKSPACE_ID)).rejects.toMatchObject({
statusCode: 401,
});
expect(reindexProgress.start).toHaveBeenCalledWith(WORKSPACE_ID, 3);
// Aborted page is NOT counted as processed.
expect(reindexProgress.increment).not.toHaveBeenCalled();
// But progress is still cleared so the run never gets stuck.
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
});
it('clears the enqueue-seeded progress on an unconfigured early return', async () => {
const { service, aiService, reindexProgress } = makeService();
// Embeddings not configured: reindexWorkspace returns early WITHOUT starting
// a fresh record, but the finally must still clear the enqueue-time seed.
aiService.getEmbeddingModel = jest
.fn()
.mockRejectedValue(new AiEmbeddingNotConfiguredException());
await expect(
service.reindexWorkspace(WORKSPACE_ID),
).resolves.toBeUndefined();
expect(reindexProgress.start).not.toHaveBeenCalled();
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
expect(reindexProgress.clear).toHaveBeenCalledWith(WORKSPACE_ID);
});
});

View File

@@ -9,7 +9,6 @@ import { KyselyDB } from '@docmost/db/types/kysely.types';
import { InjectKysely } from 'nestjs-kysely';
import { executeTx } from '@docmost/db/utils';
import { AiService } from '../../../integrations/ai/ai.service';
import { EmbeddingReindexProgressService } from '../../../integrations/ai/embedding-reindex-progress.service';
import { AiEmbeddingNotConfiguredException } from '../../../integrations/ai/ai-embedding-not-configured.exception';
import {
describeProviderError,
@@ -49,7 +48,6 @@ export class EmbeddingIndexerService {
private readonly pageRepo: PageRepo,
private readonly pageEmbeddingRepo: PageEmbeddingRepo,
private readonly aiService: AiService,
private readonly reindexProgress: EmbeddingReindexProgressService,
@InjectKysely() private readonly db: KyselyDB,
) {}
@@ -185,17 +183,9 @@ export class EmbeddingIndexerService {
}
/**
* (Re)build embeddings for the EMBEDDABLE page set of a workspace — the same
* set countEmbeddablePages counts (via getEmbeddablePageIds): non-deleted pages
* that have non-empty textContent OR already have a stored embedding row, NOT
* every non-deleted page. Iterating this set keeps the live `total` equal to
* the steady-state denominator, so the progress counter climbs 0 -> total and
* matches the before/after DB coverage exactly. Text-less pages are correctly
* skipped (reindexPage no-ops on them); a page that lost its text but still has
* stale embeddings stays in the set (the EXISTS clause) so it is visited and
* its stale rows are cleared. Used by the bulk reindex
* (WORKSPACE_CREATE_EMBEDDINGS, fired when AI Search is enabled and by the
* manual "Reindex now" action).
* (Re)build embeddings for EVERY non-deleted page in a workspace. Used by the
* bulk reindex (WORKSPACE_CREATE_EMBEDDINGS, fired when AI Search is enabled
* and by the manual "Reindex now" action).
*
* Resolves the embeddings model once up front: if the workspace has no
* embeddings provider configured, the whole batch is skipped (otherwise each
@@ -204,96 +194,69 @@ export class EmbeddingIndexerService {
* the batch.
*/
async reindexWorkspace(workspaceId: string): Promise<void> {
// The whole run is wrapped so the per-workspace progress record is ALWAYS
// cleared in the finally — on success, on a fatal-provider abort, on an
// unconfigured early-return, or on any unexpected throw — so a failed run
// never leaves a stuck "reindexing" state (the status then falls back to the
// steady-state DB coverage count). A placeholder record may already exist
// (seeded at enqueue time); the finally cleans that too.
try {
try {
await this.aiService.getEmbeddingModel(workspaceId);
} catch (err) {
if (err instanceof AiEmbeddingNotConfiguredException) {
this.logger.log(
`reindexWorkspace: embeddings not configured for workspace ${workspaceId}, skipping`,
);
return;
}
throw err;
}
// Iterate the EMBEDDABLE set (same predicate as countEmbeddablePages), NOT
// every non-deleted page: this makes `total` here equal the steady-state
// denominator, so the live counter climbs 0 -> total and matches the
// before/after DB count exactly (no 478 -> 500 -> 478 denominator jump).
// Text-less pages are correctly skipped — reindexPage no-ops on them, and
// a page that lost its text but still has stale embeddings IS in this set
// (the EXISTS clause) so it is still visited and its stale rows cleared.
const pageIds = await this.pageRepo.getEmbeddablePageIds(workspaceId);
const total = pageIds.length;
const startedAt = Date.now();
// Publish the live run progress over this same set (done reset to 0). The
// counter increments once per iterated page and reaches exactly `total`,
// which equals countEmbeddablePages — the steady-state denominator.
await this.reindexProgress.start(workspaceId, total);
this.logger.log(
`reindexWorkspace: starting reindex of ${total} page(s) for workspace ${workspaceId}`,
);
let failed = 0;
for (let i = 0; i < total; i++) {
const pageId = pageIds[i];
const position = i + 1;
// Log BEFORE the await: if the embedding call hangs, this is the last line
// in the log and it names the exact page that is stuck.
await this.aiService.getEmbeddingModel(workspaceId);
} catch (err) {
if (err instanceof AiEmbeddingNotConfiguredException) {
this.logger.log(
`reindexWorkspace: [${position}/${total}] indexing page ${pageId} (workspace ${workspaceId})`,
`reindexWorkspace: embeddings not configured for workspace ${workspaceId}, skipping`,
);
const pageStartedAt = Date.now();
try {
await this.reindexPage(pageId);
// Count this page as processed (matches the [position/total] log).
await this.reindexProgress.increment(workspaceId);
const elapsed = Date.now() - pageStartedAt;
if (elapsed >= SLOW_PAGE_MS) {
this.logger.warn(
`reindexWorkspace: [${position}/${total}] page ${pageId} took ${elapsed}ms`,
);
}
} catch (err) {
// A fatal provider error (invalid/missing key, no credits) recurs
// identically on EVERY remaining page. Abort the whole batch instead of
// issuing hundreds of doomed requests against the provider. Do NOT count
// it as processed — the run aborts here (the finally clears progress).
if (isFatalProviderError(err)) {
this.logger.error(
`reindexWorkspace: aborting at [${position}/${total}] for workspace ` +
`${workspaceId} — fatal provider error, remaining pages would fail ` +
`identically: ${describeProviderError(err)}`,
);
throw err;
}
// Per-page isolation: one non-fatal failure (incl. an embedding timeout)
// must not abort the whole batch. A handled failure still advances the
// counter (matches the [position/total] log, so done reaches total).
failed++;
await this.reindexProgress.increment(workspaceId);
this.logger.error(
`reindexWorkspace: [${position}/${total}] failed to reindex page ${pageId} ` +
`after ${Date.now() - pageStartedAt}ms: ${describeProviderError(err)}`,
return;
}
throw err;
}
const pageIds = await this.pageRepo.getIdsByWorkspace(workspaceId);
const total = pageIds.length;
const startedAt = Date.now();
this.logger.log(
`reindexWorkspace: starting reindex of ${total} page(s) for workspace ${workspaceId}`,
);
let failed = 0;
for (let i = 0; i < total; i++) {
const pageId = pageIds[i];
const position = i + 1;
// Log BEFORE the await: if the embedding call hangs, this is the last line
// in the log and it names the exact page that is stuck.
this.logger.log(
`reindexWorkspace: [${position}/${total}] indexing page ${pageId} (workspace ${workspaceId})`,
);
const pageStartedAt = Date.now();
try {
await this.reindexPage(pageId);
const elapsed = Date.now() - pageStartedAt;
if (elapsed >= SLOW_PAGE_MS) {
this.logger.warn(
`reindexWorkspace: [${position}/${total}] page ${pageId} took ${elapsed}ms`,
);
}
} catch (err) {
// A fatal provider error (invalid/missing key, no credits) recurs
// identically on EVERY remaining page. Abort the whole batch instead of
// issuing hundreds of doomed requests against the provider.
if (isFatalProviderError(err)) {
this.logger.error(
`reindexWorkspace: aborting at [${position}/${total}] for workspace ` +
`${workspaceId} — fatal provider error, remaining pages would fail ` +
`identically: ${describeProviderError(err)}`,
);
throw err;
}
// Per-page isolation: one non-fatal failure (incl. an embedding timeout)
// must not abort the whole batch.
failed++;
this.logger.error(
`reindexWorkspace: [${position}/${total}] failed to reindex page ${pageId} ` +
`after ${Date.now() - pageStartedAt}ms: ${describeProviderError(err)}`,
);
}
this.logger.log(
`reindexWorkspace: done for workspace ${workspaceId}: ` +
`${total - failed}/${total} indexed, ${failed} failed in ${Date.now() - startedAt}ms`,
);
} finally {
// Always remove the progress record so the status reverts to the DB count.
await this.reindexProgress.clear(workspaceId);
}
this.logger.log(
`reindexWorkspace: done for workspace ${workspaceId}: ` +
`${total - failed}/${total} indexed, ${failed} failed in ${Date.now() - startedAt}ms`,
);
}
/** Purge ALL embeddings for a workspace (WORKSPACE_DELETE_EMBEDDINGS). */

View File

@@ -63,6 +63,9 @@ describe('AiChatToolsService deletePage guardrail (H4)', () => {
{} as never,
{} as never,
{} as never,
// sandboxStore (only used by the stash tool closure, which these tests do
// not execute).
{} as never,
);
});
@@ -175,6 +178,9 @@ describe('AiChatToolsService expanded toolset guardrails', () => {
{} as never,
{} as never,
{} as never,
// sandboxStore (only used by the stash tool closure, which these tests do
// not execute).
{} as never,
);
});
@@ -290,6 +296,9 @@ describe('AiChatToolsService node-arg JSON-string coercion', () => {
{} as never,
{} as never,
{} as never,
// sandboxStore (only used by the stash tool closure, which these tests do
// not execute).
{} as never,
);
});
@@ -440,6 +449,9 @@ describe('AiChatToolsService model-friendly input validation (#190)', () => {
{} as never,
{} as never,
{} as never,
// sandboxStore (only used by the stash tool closure, which these tests do
// not execute).
{} as never,
);
});

View File

@@ -16,6 +16,7 @@ import {
import { resolveCurrentPageResult } from './current-page.util';
import { parseNodeArg } from './parse-node-arg';
import { modelFriendlyInput } from './model-friendly-input';
import { SandboxStore } from '../../../integrations/sandbox/sandbox.store';
/**
* Per-user, per-request adapter that exposes Docmost READ operations to the
@@ -41,6 +42,8 @@ export class AiChatToolsService {
private readonly pageEmbeddingRepo: PageEmbeddingRepo,
private readonly spaceMemberRepo: SpaceMemberRepo,
private readonly pagePermissionRepo: PagePermissionRepo,
// Shared singleton in-RAM blob store backing the stash tool.
private readonly sandboxStore: SandboxStore,
) {}
async forUser(
@@ -86,11 +89,17 @@ export class AiChatToolsService {
aiChatId,
});
// Bind the stash tool to the shared in-RAM SandboxStore. The store owns the
// anonymous-URL composition (putAndLink) and the live/evict probes the MCP
// package needs to keep its mirror counts honest under FIFO eviction (the
// package never touches env or the store). asSink() centralizes the uri↔id
// mapping next to putAndLink, shared with the embedded-MCP wiring site.
const { DocmostClient, sharedToolSpecs } = await loadDocmostMcp();
const client: DocmostClientLike = new DocmostClient({
apiUrl,
getToken,
getCollabToken,
sandbox: this.sandboxStore.asSink(),
});
// Build an ai-SDK tool from a shared, zod-agnostic spec. The spec owns the
@@ -625,6 +634,14 @@ export class AiChatToolsService {
async ({ pageId, edits }) => await client.editPageText(pageId, edits),
),
// Returns ONLY the short link object — never the document body — so a
// large page can be handed to an external consumer without bloating
// context.
stashPage: sharedTool(
sharedToolSpecs.stashPage,
async ({ pageId }) => await client.stashPage(pageId),
),
patchNode: tool({
description:
'Replace a single content block (by id) with a new ProseMirror ' +

View File

@@ -154,6 +154,14 @@ export interface DocmostClientLike {
commentId: string,
resolved: boolean,
): Promise<Record<string, unknown>>;
// Serialize a page + mirror its internal images into the blob sandbox; returns
// ONLY a short anonymous URL (the body never enters the model context).
stashPage(pageId: string): Promise<{
uri: string;
sha256: string;
size: number;
images: { mirrored: number; failed: number };
}>;
}
export type DocmostClientConfig = {
@@ -161,6 +169,18 @@ export type DocmostClientConfig = {
getToken: () => Promise<string>;
// Provenance collab-token provider for content mutations (signed agent claim).
getCollabToken?: () => Promise<string>;
// Optional blob-sandbox sink for the stash tool. `put` stores a blob in the
// host's in-RAM SandboxStore and returns the anonymous read URL + integrity.
// The optional `has`/`evict` probes let stashPage keep its mirror counts
// honest under the store's FIFO eviction (mirror of the package's sink type).
sandbox?: {
put: (
buf: Buffer,
mime: string,
) => { uri: string; sha256: string; size: number };
has?: (uri: string) => boolean;
evict?: (uri: string) => void;
};
};
export interface DocmostClientCtor {

View File

@@ -1,167 +0,0 @@
import { PageRepo } from './page.repo';
import {
DummyDriver,
Kysely,
PostgresAdapter,
PostgresIntrospector,
PostgresQueryCompiler,
} from 'kysely';
/**
* F6 regression guard for the embeddable-page predicate.
*
* The predicate is shared by `countEmbeddablePages` (the "Indexed N of M" coverage
* denominator) and `getEmbeddablePageIds` (the exact set a full reindex iterates).
* It MUST select pages whose `text_content` was never backfilled (null/empty) but
* whose ProseMirror `content` JSON still carries body text — `reindexPage` builds
* its chunks straight from `content`, so without a content clause such a page is
* silently SKIPPED by a mass reindex even though it is fully embeddable.
*
* The content clause keys on the structural text-node marker `"type":"text"`, NOT
* a bare `"text":` key. The bare key also appears as the `attrs.text` of atom
* nodes that carry NO extractable text — notably math (`mathBlock`/`mathInline`),
* whose LaTeX lives in `attrs.text` and has no `generateText` serializer. A
* math-ONLY page therefore yields empty `text_content` and zero embeddings; if the
* predicate matched its `attrs.text` it would land in the denominator but
* `reindexPage` would no-op on it, pinning "Indexed N of M" below 100% forever —
* the exact bug this feature fixes. The `"type":"text"` marker matches only real
* text nodes (what `jsonToText` extracts), keeping the predicate consistent with
* what gets indexed.
*
* There is no real Postgres here: a recording Kysely (DummyDriver wired to the
* Postgres query compiler) compiles the queries to SQL so we can assert the WHERE
* predicate ORs in the narrowed content clause alongside the existing text_content
* and stored-embeddings clauses — and that BOTH callers compile the identical
* clause (denominator and reindex set can never diverge).
*/
function makeRecordingDb() {
const sqls: string[] = [];
const db = new Kysely<any>({
dialect: {
createAdapter: () => new PostgresAdapter(),
createDriver: () =>
new (class extends DummyDriver {
async acquireConnection() {
return {
executeQuery: async (compiled: { sql: string }) => {
sqls.push(compiled.sql);
return { rows: [] };
},
// eslint-disable-next-line @typescript-eslint/no-empty-function
streamQuery: async function* () {},
} as any;
}
})(),
createIntrospector: (d: Kysely<any>) => new PostgresIntrospector(d),
createQueryCompiler: () => new PostgresQueryCompiler(),
},
});
return { db, sqls };
}
// The narrowed content clause, as it appears in the compiled SQL. Keying on the
// structural `"type":"text"` marker (not a bare `"text":` key) is what excludes
// math-only pages whose only `"text"` key is the atom node's `attrs.text`.
const NARROWED_CLAUSE = `"type"[[:space:]]*:[[:space:]]*"text"`;
const BARE_TEXT_KEY = `"text"[[:space:]]*:`;
describe('PageRepo embeddable predicate — content-bearing pages (F6)', () => {
it('selects content-bearing pages via the narrowed "type":"text" node marker', async () => {
const { db, sqls } = makeRecordingDb();
const repo = new PageRepo(db as any, {} as any, { emit: jest.fn() } as any);
await repo.getEmbeddablePageIds('ws-1');
expect(sqls).toHaveLength(1);
const sql = sqls[0];
// Clause 1 (existing): pages with extractable text_content.
expect(sql).toContain('text_content');
// Clause 3 (the F6 fix, now narrowed): a page whose content JSON carries a
// real text node is selected even when text_content is null/empty, so a full
// reindex visits it instead of silently skipping it.
expect(sql).toContain('content::text');
expect(sql).toContain(NARROWED_CLAUSE);
// It must NOT use the old bare `"text":` key, which also matches the
// `attrs.text` of math-only atom pages (false-positive denominator inflation).
expect(sql).not.toContain(BARE_TEXT_KEY);
// Clause 2 (existing): pages that already have stored embeddings stay in the
// set so a reindex can clear their stale rows.
expect(sql.toLowerCase()).toContain('embeddings');
});
it('countEmbeddablePages compiles the SAME narrowed clause as getEmbeddablePageIds', async () => {
// Consistency is the core requirement: the denominator (countEmbeddablePages)
// and the reindex set (getEmbeddablePageIds) MUST share the identical
// predicate, else the live "done" counter and the steady-state total diverge.
const { db, sqls } = makeRecordingDb();
const repo = new PageRepo(db as any, {} as any, { emit: jest.fn() } as any);
await repo.countEmbeddablePages('ws-1');
await repo.getEmbeddablePageIds('ws-1');
expect(sqls).toHaveLength(2);
const [countSql, idsSql] = sqls;
// Both carry the narrowed content clause...
expect(countSql).toContain(NARROWED_CLAUSE);
expect(idsSql).toContain(NARROWED_CLAUSE);
// ...neither carries the bare key...
expect(countSql).not.toContain(BARE_TEXT_KEY);
expect(idsSql).not.toContain(BARE_TEXT_KEY);
// ...and the full OR predicate (text_content + content node + embeddings
// EXISTS) is byte-identical between the two queries, so they can't drift.
const where = (s: string) => s.slice(s.indexOf('where'));
expect(where(countSql)).toEqual(where(idsSql));
});
it('the content regex matches a text-bearing doc but NOT a math-only doc', () => {
// Semantic check of the predicate against sample `content::text` payloads.
// Note: `jsonb::text` is NOT identical to JSON.stringify — Postgres renders a
// space after each colon (`"type": "text"`), which is exactly why the POSIX
// clause uses `[[:space:]]*`. The clause `"type"[[:space:]]*:[[:space:]]*"text"`
// maps to the JS regex below (`[[:space:]]` -> `\s`, tolerating both forms);
// we evaluate it the way Postgres would.
const re = /"type"\s*:\s*"text"/;
// A real paragraph with a text node -> embeddable.
const textDoc = JSON.stringify({
type: 'doc',
content: [
{
type: 'paragraph',
content: [{ type: 'text', text: 'hello world' }],
},
],
});
// A doc whose ONLY node is a math atom. Its LaTeX is in `attrs.text`, there is
// no text node, and `jsonToText`/`generateText` has no serializer for it -> it
// yields empty text_content and zero embeddings, so it must NOT qualify.
const mathOnlyDoc = JSON.stringify({
type: 'doc',
content: [
{ type: 'mathBlock', attrs: { text: 'E = mc^2' } },
{ type: 'mathInline', attrs: { text: '\\alpha' } },
],
});
// An empty doc has no text node either.
const emptyDoc = JSON.stringify({ type: 'doc', content: [] });
expect(re.test(textDoc)).toBe(true);
expect(re.test(mathOnlyDoc)).toBe(false);
expect(re.test(emptyDoc)).toBe(false);
// Sanity: the OLD bare-key regex WOULD have wrongly matched the math-only doc,
// which is precisely the false positive the narrowing removes.
expect(/"text"\s*:/.test(mathOnlyDoc)).toBe(true);
// A user literally TYPING `"type":"text"` in prose can't false-positive on an
// otherwise text-less page: in `content::text` the typed value's quotes are
// escaped (`\"type\":\"text\"`), so the literal-quote regex does not match the
// escaped form. (And such a page is a genuine text node anyway.)
const escapedLiteral = JSON.stringify({
type: 'doc',
content: [{ type: 'someAtom', attrs: { note: '"type":"text"' } }],
});
expect(re.test(escapedLiteral)).toBe(false);
});
});

View File

@@ -12,7 +12,6 @@ import { executeWithCursorPagination } from '@docmost/db/pagination/cursor-pagin
import { validate as isValidUUID } from 'uuid';
import { ExpressionBuilder, sql } from 'kysely';
import { DB } from '@docmost/db/types/db';
import { DbInterface } from '@docmost/db/types/db.interface';
import { jsonArrayFrom, jsonObjectFrom } from 'kysely/helpers/postgres';
import { SpaceMemberRepo } from '@docmost/db/repos/space/space-member.repo';
import { EventEmitter2 } from '@nestjs/event-emitter';
@@ -234,9 +233,9 @@ export class PageRepo {
* text-less pages (which legitimately store zero embeddings) don't keep the
* bar below 100% forever.
*
* A page qualifies if it has non-empty textContent, OR its content JSON has at
* least one text node (`"type":"text"`) when textContent was never backfilled,
* OR it already has stored embeddings. The last clause guarantees this total is
* A page qualifies if it has non-empty textContent OR already has stored
* embeddings. The second clause covers pages whose text the indexer extracted
* from the content JSON when textContent was null, and guarantees this total is
* always >= countIndexedPages (the indexed count can never exceed it).
*/
async countEmbeddablePages(workspaceId: string): Promise<number> {
@@ -244,91 +243,37 @@ export class PageRepo {
.selectFrom('pages as p')
.where('p.workspaceId', '=', workspaceId)
.where('p.deletedAt', 'is', null)
.where((eb) => this.embeddablePredicate(eb))
.where((eb) =>
eb.or([
// Has extractable body text. The regex matches any non-whitespace
// character, mirroring the indexer's `text.trim().length === 0` check
// (raw SQL -> use the snake_case column name).
sql<boolean>`p.text_content ~ '[^[:space:]]'`,
// OR already has at least one (non-deleted) embedding row.
eb.exists(
eb
.selectFrom('pageEmbeddings as pe')
.select(sql`1`.as('one'))
.whereRef('pe.pageId', '=', 'p.id')
.where('pe.deletedAt', 'is', null),
),
]),
)
.select((eb) => eb.fn.countAll().as('count'))
.executeTakeFirst();
return Number(row?.count ?? 0);
}
/**
* The "embeddable content" qualifying predicate, shared verbatim by
* countEmbeddablePages (the steady-state denominator) and getEmbeddablePageIds
* (the set the bulk reindex iterates). Both MUST use the exact same condition
* or the live total and steady-state total diverge — extracting it here is what
* guarantees that, replacing the previous hand-duplicated copy. Callers supply
* the trivial workspaceId/deletedAt filters inline; this returns only the
* non-trivial OR clause, evaluated against the `p` alias of `pages`.
*
* A page qualifies if it has non-empty textContent, OR its ProseMirror
* `content` JSON has at least one text node (`"type":"text"`) even though
* textContent was never backfilled, OR it already has a stored (non-deleted)
* embedding row.
* IDs of all non-deleted pages in a workspace. Used by the RAG bulk reindex to
* (re)build embeddings for every existing page.
*/
private embeddablePredicate(
eb: ExpressionBuilder<DbInterface & { p: DbInterface['pages'] }, 'p'>,
) {
return eb.or([
// Has extractable body text. The regex matches any non-whitespace
// character, mirroring the indexer's `text.trim().length === 0` check
// (raw SQL -> use the snake_case column name).
sql<boolean>`p.text_content ~ '[^[:space:]]'`,
// OR the ProseMirror `content` JSON has at least one text node (`"type":
// "text"`) the indexer can extract, even when `text_content` is null/empty
// (never backfilled): `reindexPage` runs `jsonToText` (generateText) over
// `content`, which only emits the text of ProseMirror text nodes, so such a
// page IS embeddable and a full reindex MUST visit it (otherwise it is
// silently skipped). A text node always serialises as
// `{"type":"text","text":"..."}`, so we key on the structural `"type":
// "text"` marker — NOT a bare `"text":` key, which also appears as the
// `attrs.text` of atom nodes that carry NO extractable text (e.g. math
// `mathBlock`/`mathInline`, whose LaTeX lives in `attrs.text` and has no
// text serializer). A math-only page thus produces empty `text_content` and
// zero embeddings; matching its `attrs.text` here would wrongly inflate the
// denominator and keep "Indexed N of M" below 100% forever. An empty doc
// (no text nodes) has no `"type":"text"` and is correctly excluded. A user
// who literally types `"type":"text"` in their prose can't false-positive:
// in `content::text` that text value's quotes are escaped (`\"type\"...`),
// so the literal-quote regex won't match the escaped form (and such a page
// is a real text node anyway).
sql<boolean>`p.content::text ~ '"type"[[:space:]]*:[[:space:]]*"text"'`,
// OR already has at least one (non-deleted) embedding row.
eb.exists(
eb
.selectFrom('pageEmbeddings as pe')
.select(sql`1`.as('one'))
.whereRef('pe.pageId', '=', 'p.id')
.where('pe.deletedAt', 'is', null),
),
]);
}
/**
* IDs of the EMBEDDABLE page set for a workspace — the exact same set that
* `countEmbeddablePages` counts (a page qualifies if it has non-empty
* textContent, OR content JSON with at least one text node (`"type":"text"`)
* and an empty/null textContent, OR already has a stored embedding row). The
* bulk reindex
* iterates THIS set so the live "done" counter reaches exactly
* `countEmbeddablePages` (the steady-state denominator), instead of iterating
* every non-deleted page (which would push the denominator above the
* steady-state value mid-run).
*
* IMPORTANT: the qualifying WHERE is shared with `countEmbeddablePages` via the
* private `embeddablePredicate` helper, so the two can no longer drift — if the
* embeddable definition changes, change it once there and both stay in lockstep
* (else the live total and steady-state total diverge again). Dropping
* text-less pages is correct: `reindexPage` no-ops on
* a page with no extractable content anyway, and a page that lost its text but
* still has stale embeddings IS in this set (the EXISTS clause), so it is still
* visited and its stale rows are cleared.
*/
async getEmbeddablePageIds(workspaceId: string): Promise<string[]> {
async getIdsByWorkspace(workspaceId: string): Promise<string[]> {
const rows = await this.db
.selectFrom('pages as p')
.select('p.id')
.where('p.workspaceId', '=', workspaceId)
.where('p.deletedAt', 'is', null)
.where((eb) => this.embeddablePredicate(eb))
.selectFrom('pages')
.select('id')
.where('workspaceId', '=', workspaceId)
.where('deletedAt', 'is', null)
.execute();
return rows.map((r) => r.id);
}

View File

@@ -1,12 +1,4 @@
import { AiSettingsService, parsePositiveInt } from './ai-settings.service';
import { WorkspaceRepo } from '@docmost/db/repos/workspace/workspace.repo';
import { AiAgentRoleRepo } from '@docmost/db/repos/ai-agent-roles/ai-agent-roles.repo';
import { AiProviderCredentialsRepo } from '@docmost/db/repos/ai-chat/ai-provider-credentials.repo';
import { PageEmbeddingRepo } from '@docmost/db/repos/ai-chat/page-embedding.repo';
import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { SecretBoxService } from '../crypto/secret-box';
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
import type { Queue } from 'bullmq';
import { parsePositiveInt } from './ai-settings.service';
/**
* Round-trip coercion for numeric `::text` provider settings (e.g.
@@ -49,194 +41,3 @@ describe('parsePositiveInt', () => {
expect(parsePositiveInt(42)).toBe(42);
});
});
/**
* getMasked must surface the LIVE reindex run progress while a reindex is active
* (so the "Indexed X of Y" counter can climb 0 -> total), and fall back to the
* steady-state DB coverage count (countIndexedPages / countEmbeddablePages) when
* no reindex is running. This is the server side of the fix for the counter that
* otherwise stays stuck at "478 of 478" the whole reindex.
*/
describe('AiSettingsService.getMasked reindex progress', () => {
const WORKSPACE_ID = 'ws-1';
function makeService() {
// No driver configured -> the credentials lookup is skipped, keeping the
// setup minimal; we only care about the indexed/total numbers here.
const workspaceRepo = {
findById: jest.fn().mockResolvedValue({ settings: {} }),
};
const aiAgentRoleRepo = {};
const aiProviderCredentialsRepo = { find: jest.fn() };
const pageEmbeddingRepo = {
countIndexedPages: jest.fn().mockResolvedValue(478),
};
const pageRepo = {
countEmbeddablePages: jest.fn().mockResolvedValue(478),
};
const secretBox = {};
const reindexProgress = {
get: jest.fn().mockResolvedValue(null),
};
const aiQueue = {};
const service = new AiSettingsService(
workspaceRepo as unknown as WorkspaceRepo,
aiAgentRoleRepo as unknown as AiAgentRoleRepo,
aiProviderCredentialsRepo as unknown as AiProviderCredentialsRepo,
pageEmbeddingRepo as unknown as PageEmbeddingRepo,
pageRepo as unknown as PageRepo,
secretBox as unknown as SecretBoxService,
reindexProgress as unknown as EmbeddingReindexProgressService,
aiQueue as unknown as Queue,
);
return { service, reindexProgress, pageEmbeddingRepo };
}
it('reports the live run numbers when a reindex progress record is active', async () => {
const { service, reindexProgress } = makeService();
// Use a progress.total (500) DISTINCT from the DB count (478) so the test
// actually pins the progress.total branch rather than coincidentally
// matching the DB fallback. With fix #1 the two sources agree in practice,
// but getMasked must still return progress.total when a record is active.
reindexProgress.get.mockResolvedValue({
total: 500,
done: 120,
startedAt: Date.now(),
});
const masked = await service.getMasked(WORKSPACE_ID);
expect(masked.indexedPages).toBe(120); // progress.done, not DB 478
expect(masked.totalPages).toBe(500); // progress.total, not DB 478
expect(masked.reindexing).toBe(true);
});
it('falls back to countIndexedPages when no reindex is active', async () => {
const { service, reindexProgress } = makeService();
reindexProgress.get.mockResolvedValue(null);
const masked = await service.getMasked(WORKSPACE_ID);
expect(masked.indexedPages).toBe(478);
expect(masked.totalPages).toBe(478);
expect(masked.reindexing).toBe(false);
});
});
/**
* reindex() must seed a live progress record (done=0) BEFORE enqueueing so the
* first status poll shows 0 — but ONLY when no run is already active, since
* aiQueue.add() de-duplicates a running reindex and a re-seed would reset the
* visible counter to 0 while the live worker keeps incrementing from its real
* position.
*/
describe('AiSettingsService.reindex progress seed', () => {
const WORKSPACE_ID = 'ws-1';
function makeService() {
const order: string[] = [];
const aiQueue = {
remove: jest.fn().mockResolvedValue(undefined),
add: jest.fn().mockImplementation(async () => {
order.push('add');
}),
};
const pageRepo = {
countEmbeddablePages: jest.fn().mockResolvedValue(478),
};
const reindexProgress = {
// Default: no active run -> seed should happen.
get: jest.fn().mockResolvedValue(null),
start: jest.fn().mockImplementation(async () => {
order.push('start');
}),
clear: jest.fn().mockResolvedValue(undefined),
};
const service = new AiSettingsService(
{} as unknown as WorkspaceRepo,
{} as unknown as AiAgentRoleRepo,
{} as unknown as AiProviderCredentialsRepo,
{} as unknown as PageEmbeddingRepo,
pageRepo as unknown as PageRepo,
{} as unknown as SecretBoxService,
reindexProgress as unknown as EmbeddingReindexProgressService,
aiQueue as unknown as Queue,
);
return { service, aiQueue, pageRepo, reindexProgress, order };
}
it('seeds progress (workspace, count) BEFORE enqueue when no run is active', async () => {
const { service, aiQueue, reindexProgress, order } = makeService();
await service.reindex(WORKSPACE_ID);
// The pre-seed carries the real page count AND a SHORT ttl (3rd arg) so a
// de-duplicated enqueue against a just-finishing job can't leave a phantom
// "reindexing: 0 of N" stuck for the full record TTL (F10).
expect(reindexProgress.start).toHaveBeenCalledWith(
WORKSPACE_ID,
478,
expect.any(Number),
);
const ttl = reindexProgress.start.mock.calls[0][2];
expect(ttl).toBeGreaterThan(0);
expect(ttl).toBeLessThanOrEqual(60); // short, not the full 1h record TTL
expect(aiQueue.add).toHaveBeenCalledTimes(1);
// Seed must precede the enqueue so the first poll already reports done=0.
expect(order).toEqual(['start', 'add']);
});
it('does NOT re-seed when a run is already active (mid-run re-trigger)', async () => {
const { service, aiQueue, reindexProgress } = makeService();
// An active record exists -> a second click must not reset the counter.
reindexProgress.get.mockResolvedValue({
total: 478,
done: 120,
startedAt: Date.now(),
});
await service.reindex(WORKSPACE_ID);
expect(reindexProgress.start).not.toHaveBeenCalled();
// The enqueue still runs (and de-duplicates against the active job).
expect(aiQueue.add).toHaveBeenCalledTimes(1);
});
it('clears the seed it just wrote and re-throws when enqueue fails', async () => {
const { service, aiQueue, reindexProgress } = makeService();
// This call seeds (get() is null) but the enqueue then blows up
// (Redis hiccup/shutdown) -> the worker never runs and never clear()s, so
// reindex() must roll back its own seed to avoid a 1h stuck "reindexing".
const boom = new Error('redis down');
aiQueue.add.mockRejectedValue(boom);
await expect(service.reindex(WORKSPACE_ID)).rejects.toBe(boom);
expect(reindexProgress.start).toHaveBeenCalledWith(
WORKSPACE_ID,
478,
expect.any(Number),
);
expect(reindexProgress.clear).toHaveBeenCalledWith(WORKSPACE_ID);
});
it('does NOT clear a concurrent active run when enqueue fails (no seed)', async () => {
const { service, aiQueue, reindexProgress } = makeService();
// A run is already active, so THIS call does not seed; if the enqueue then
// fails it must NOT wipe the live worker's record.
reindexProgress.get.mockResolvedValue({
total: 478,
done: 120,
startedAt: Date.now(),
});
const boom = new Error('redis down');
aiQueue.add.mockRejectedValue(boom);
await expect(service.reindex(WORKSPACE_ID)).rejects.toBe(boom);
expect(reindexProgress.start).not.toHaveBeenCalled();
expect(reindexProgress.clear).not.toHaveBeenCalled();
});
});

View File

@@ -8,7 +8,6 @@ import { AiProviderCredentialsRepo } from '@docmost/db/repos/ai-chat/ai-provider
import { PageEmbeddingRepo } from '@docmost/db/repos/ai-chat/page-embedding.repo';
import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { SecretBoxService } from '../crypto/secret-box';
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
import {
AiDriver,
AiProviderSettings,
@@ -31,17 +30,6 @@ export function parsePositiveInt(raw: unknown): number | undefined {
return Number.isFinite(n) && n > 0 ? Math.floor(n) : undefined;
}
/**
* TTL (seconds) for the enqueue-time progress PRE-SEED written by `reindex()`
* before the worker starts. Deliberately SHORT: if `aiQueue.add()` de-duplicates
* against a job that is just finishing (the worker's finally already ran
* `clear()` but removeOnComplete hasn't yet removed the job), no new worker runs
* to overwrite/clear this seed — so a short TTL lets the phantom "reindexing:
* 0 of N" expire in seconds instead of sticking for the full 1h record TTL. A
* worker that DOES start re-seeds with the full TTL, so a real run is unaffected.
*/
const PRE_SEED_TTL_SECONDS = 45;
/**
* Shape of the partial update accepted by `update`. Mirrors the validated
* controller DTO. `apiKey` / `embeddingApiKey` are write-only: undefined =
@@ -86,7 +74,6 @@ export class AiSettingsService {
private readonly pageEmbeddingRepo: PageEmbeddingRepo,
private readonly pageRepo: PageRepo,
private readonly secretBox: SecretBoxService,
private readonly reindexProgress: EmbeddingReindexProgressService,
@InjectQueue(QueueName.AI_QUEUE) private readonly aiQueue: Queue,
) {}
@@ -113,60 +100,21 @@ export class AiSettingsService {
.remove(`ai-search-disabled-${workspaceId}`)
.catch(() => undefined);
// Seed a live progress record BEFORE enqueueing so the very first status
// poll already reports done=0 (the reindex POST returns the PRE-job counts,
// so without this seed the first poll would still show "total of total").
// `totalPages` uses countEmbeddablePages — the SAME set the worker iterates
// and the SAME denominator the status endpoint reports, so the live and
// steady-state totals match.
//
// ONLY seed when no run is active: aiQueue.add() de-duplicates an already-
// running reindex, so a mid-run re-trigger (second click / second admin /
// second tab) must NOT reset the visible counter to 0 — that would
// understate the live worker's real position for the rest of the run. The
// worker's own start() at run begin is the single authoritative reset.
let seeded = false;
if ((await this.reindexProgress.get(workspaceId)) === null) {
const totalPages = await this.pageRepo.countEmbeddablePages(workspaceId);
// Short TTL: if add() below de-duplicates against a just-finishing job
// whose worker already clear()ed but isn't removed yet, no worker runs to
// clear this seed — the short TTL expires the phantom record in seconds
// rather than leaving a stuck "reindexing: 0 of N" for the full record TTL.
await this.reindexProgress.start(
workspaceId,
totalPages,
PRE_SEED_TTL_SECONDS,
);
seeded = true;
}
const jobId = `ai-reindex-${workspaceId}`;
// Clear a prior non-active entry so a stale job can't block this reindex.
// A locked/active job is left in place (remove() no-ops) and the add() below
// de-duplicates against it, keeping the in-progress pass.
await this.aiQueue.remove(jobId).catch(() => undefined);
try {
await this.aiQueue.add(
QueueJob.WORKSPACE_CREATE_EMBEDDINGS,
{ workspaceId },
{
jobId,
removeOnComplete: true,
removeOnFail: true,
},
);
} catch (err) {
// If the enqueue fails (Redis hiccup/shutdown) the worker never runs, so
// its finally->clear() never fires. Roll back the seed WE just wrote so
// the status endpoint doesn't report a stuck "reindexing: 0 of N" for the
// full TTL. Only clear when this call did the seed — never wipe a
// concurrent active run's record (get() was non-null, seeded=false).
if (seeded) {
await this.reindexProgress.clear(workspaceId);
}
throw err;
}
await this.aiQueue.add(
QueueJob.WORKSPACE_CREATE_EMBEDDINGS,
{ workspaceId },
{
jobId,
removeOnComplete: true,
removeOnFail: true,
},
);
}
/**
@@ -305,33 +253,13 @@ export class AiSettingsService {
hasSttApiKey = !!creds?.sttApiKeyEnc;
}
// While a reindex run is active, report its LIVE progress (done climbs 0 ->
// total) so the settings UI can watch it advance. Read progress FIRST and
// short-circuit: this endpoint is polled every ~5s for the whole run, so when
// a record is active we skip the two coverage COUNTs entirely (their results
// would be discarded anyway). Without the live progress the counter never
// drops: the per-page reindex hard-replaces rows in its own small
// transaction, so countIndexedPages stays ~= total for the whole run. With no
// active record we fall back to the steady-state DB coverage count, which
// preserves the existing display and the client's "done == total -> stop
// polling" condition (the run ends -> record cleared -> DB count == total).
//
// The fallback `totalPages` counts only pages with embeddable content
// (non-empty text, content-borne text, or already-stored embeddings), so
// empty/text-less pages don't keep the "Indexed N of M pages" bar below 100%
// forever.
const progress = await this.reindexProgress.get(workspaceId);
let indexedPages: number;
let totalPages: number;
if (progress) {
indexedPages = progress.done;
totalPages = progress.total;
} else {
[indexedPages, totalPages] = await Promise.all([
this.pageEmbeddingRepo.countIndexedPages(workspaceId),
this.pageRepo.countEmbeddablePages(workspaceId),
]);
}
// totalPages now counts only pages with embeddable content (non-empty text
// or already-stored embeddings), so empty/text-less pages don't keep the
// "Indexed N of M pages" bar below 100% forever.
const [indexedPages, totalPages] = await Promise.all([
this.pageEmbeddingRepo.countIndexedPages(workspaceId),
this.pageRepo.countEmbeddablePages(workspaceId),
]);
return {
driver: provider.driver,
@@ -353,8 +281,6 @@ export class AiSettingsService {
hasSttApiKey,
indexedPages,
totalPages,
// Optional hint for the client: a reindex run is currently in progress.
reindexing: progress != null,
};
}

View File

@@ -5,7 +5,6 @@ import { QueueName } from '../queue/constants';
import { AiService } from './ai.service';
import { AiSettingsService } from './ai-settings.service';
import { AiSettingsController } from './ai-settings.controller';
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
/**
* LLM driver + provider-settings unit (§6.2/§6.4).
@@ -20,7 +19,7 @@ import { EmbeddingReindexProgressService } from './embedding-reindex-progress.se
BullModule.registerQueue({ name: QueueName.AI_QUEUE }),
],
controllers: [AiSettingsController],
providers: [AiService, AiSettingsService, EmbeddingReindexProgressService],
exports: [AiService, AiSettingsService, EmbeddingReindexProgressService],
providers: [AiService, AiSettingsService],
exports: [AiService, AiSettingsService],
})
export class AiModule {}

View File

@@ -146,7 +146,4 @@ export interface MaskedAiSettings {
// RAG indexing coverage for the settings UI.
indexedPages: number;
totalPages: number;
// True while a full workspace reindex is actively running (the counts above
// then reflect the live run progress rather than the steady-state DB count).
reindexing?: boolean;
}

View File

@@ -1,179 +0,0 @@
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
import type { RedisService } from '@nestjs-labs/nestjs-ioredis';
import type { Redis } from 'ioredis';
/**
* Unit tests for the Redis-backed reindex-progress store.
*
* The store is a thin, BEST-EFFORT wrapper: writes (start/increment) issue an
* hset/hincrby + expire pipeline and must SWALLOW Redis errors (progress is
* cosmetic — it must never break a reindex); reads (get) must map a valid hash
* to a ReindexProgress and degrade to null on a malformed/missing record or a
* Redis failure. We drive it with a hand-rolled fake ioredis (the project mocks
* Redis with plain fakes, see public-share limiter specs).
*/
describe('EmbeddingReindexProgressService', () => {
const WORKSPACE_ID = 'ws-1';
const KEY = 'ai:reindex:progress:ws-1';
/**
* Build a fake ioredis whose `multi()` returns a chainable recorder and whose
* `hgetall`/`del` are configurable jest mocks. `execImpl` lets a test make the
* pipeline reject (to assert error-swallowing).
*/
function makeRedis(opts: { execImpl?: () => Promise<unknown> } = {}) {
const exec = jest
.fn()
.mockImplementation(opts.execImpl ?? (() => Promise.resolve([])));
// mockReturnThis() returns the call's `this` (the multi object), so the
// chain hset().expire().exec() resolves correctly.
const multiObj = {
hset: jest.fn().mockReturnThis(),
hincrby: jest.fn().mockReturnThis(),
expire: jest.fn().mockReturnThis(),
exec,
};
const multi = jest.fn(() => multiObj);
const hgetall = jest.fn().mockResolvedValue({});
const del = jest.fn().mockResolvedValue(1);
const redis = { multi, hgetall, del } as unknown as Redis;
return { redis, multiObj, multi, hgetall, del, exec };
}
function makeService(redis: Redis) {
const redisService = {
getOrThrow: () => redis,
} as unknown as RedisService;
return new EmbeddingReindexProgressService(redisService);
}
describe('get', () => {
it('maps a valid hash to a ReindexProgress object', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: '478', done: '120', startedAt: '1000' });
const service = makeService(redis);
await expect(service.get(WORKSPACE_ID)).resolves.toEqual({
total: 478,
done: 120,
startedAt: 1000,
});
expect(hgetall).toHaveBeenCalledWith(KEY);
});
it('returns null for an empty hash (no record)', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({});
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('returns null when `total` is missing (partial record)', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ done: '5' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('returns null for a non-numeric total', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: 'abc', done: '1', startedAt: '1' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('returns null for a non-numeric done', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: '10', done: 'xyz', startedAt: '1' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('coerces a non-finite startedAt to 0', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: '10', done: '2', startedAt: 'nope' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toEqual({
total: 10,
done: 2,
startedAt: 0,
});
});
it('degrades to null when hgetall throws (degradation contract)', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockRejectedValue(new Error('redis down'));
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
});
describe('start', () => {
it('issues hset + expire on the workspace key', async () => {
const { redis, multiObj } = makeRedis();
await makeService(redis).start(WORKSPACE_ID, 478);
expect(multiObj.hset).toHaveBeenCalledWith(
KEY,
expect.objectContaining({ total: '478', done: '0' }),
);
expect(multiObj.expire).toHaveBeenCalledWith(KEY, expect.any(Number));
expect(multiObj.exec).toHaveBeenCalledTimes(1);
});
it('defaults the expire TTL to the full 1h record TTL', async () => {
const { redis, multiObj } = makeRedis();
await makeService(redis).start(WORKSPACE_ID, 478);
// Default ttl = full record TTL (60 * 60) so a real run never expires
// mid-flight before the worker refreshes it on each increment.
expect(multiObj.expire).toHaveBeenCalledWith(KEY, 60 * 60);
});
it('honours an explicit short ttlSeconds for the enqueue-time pre-seed (F10)', async () => {
const { redis, multiObj } = makeRedis();
// The reindex() pre-seed passes a short ttl so a phantom record left by a
// de-duplicated enqueue expires in seconds, not after the full 1h TTL.
await makeService(redis).start(WORKSPACE_ID, 478, 45);
expect(multiObj.expire).toHaveBeenCalledWith(KEY, 45);
});
it('swallows a thrown Redis error (best-effort)', async () => {
const { redis } = makeRedis({
execImpl: () => Promise.reject(new Error('redis down')),
});
await expect(
makeService(redis).start(WORKSPACE_ID, 1),
).resolves.toBeUndefined();
});
});
describe('increment', () => {
it('issues hincrby + expire on the workspace key', async () => {
const { redis, multiObj } = makeRedis();
await makeService(redis).increment(WORKSPACE_ID);
expect(multiObj.hincrby).toHaveBeenCalledWith(KEY, 'done', 1);
expect(multiObj.expire).toHaveBeenCalledWith(KEY, expect.any(Number));
expect(multiObj.exec).toHaveBeenCalledTimes(1);
});
it('swallows a thrown Redis error (best-effort)', async () => {
const { redis } = makeRedis({
execImpl: () => Promise.reject(new Error('redis down')),
});
await expect(
makeService(redis).increment(WORKSPACE_ID),
).resolves.toBeUndefined();
});
});
describe('clear', () => {
it('deletes the workspace key', async () => {
const { redis, del } = makeRedis();
await makeService(redis).clear(WORKSPACE_ID);
expect(del).toHaveBeenCalledWith(KEY);
});
it('swallows a thrown Redis error (best-effort)', async () => {
const { redis, del } = makeRedis();
del.mockRejectedValue(new Error('redis down'));
await expect(
makeService(redis).clear(WORKSPACE_ID),
).resolves.toBeUndefined();
});
});
});

View File

@@ -1,162 +0,0 @@
import { Injectable, Logger } from '@nestjs/common';
import { RedisService } from '@nestjs-labs/nestjs-ioredis';
import type { Redis } from 'ioredis';
/**
* Live progress of an in-flight workspace embeddings reindex run.
* `total` is the number of pages the run will process, `done` how many it has
* already processed (success OR handled failure), `startedAt` the epoch-ms the
* record was created.
*/
export interface ReindexProgress {
total: number;
done: number;
startedAt: number;
}
/** Redis key namespace for the per-workspace reindex-progress record. */
const KEY_PREFIX = 'ai:reindex:progress:';
/**
* TTL (seconds) on the progress record so a crashed/aborted worker that never
* reaches its `clear()` finally can still self-clean instead of leaving a stuck
* "reindexing" state. Refreshed on every increment so a long run never expires
* mid-flight; on a crash it disappears within TTL of the last processed page.
*
* INTENTIONALLY tied to WRITE progress (start/increment) only — never refreshed
* on get(). Refreshing on read would keep a dead worker's record alive forever
* as long as a client keeps polling (a permanently stuck reindexing:true). The
* clear() in the worker's finally handles normal completion; a dead worker's
* record expires after TTL, and the client's own poll cap stops polling anyway.
*/
const TTL_SECONDS = 60 * 60; // 1h
/**
* Cluster-wide store for the live progress of a workspace embeddings reindex.
*
* The reindex runs in a BullMQ worker (AI_QUEUE) that may be a DIFFERENT process
* than the API handling the settings-status GET, so the progress must live in
* the shared Redis — we reuse the same global ioredis client (RedisService from
* @nestjs-labs/nestjs-ioredis) that backs BullMQ and the other anti-abuse
* limiters, adding NO new Redis config.
*
* Everything here is best-effort and COSMETIC: progress only drives the "Indexed
* X of Y" counter while a reindex is running. Any Redis failure degrades to the
* existing steady-state behaviour (the status falls back to the DB coverage
* count), so reads fail to `null` and writes are swallowed — a reindex must
* never break because progress reporting did.
*
* Stored as a Redis HASH so `done` can be bumped with an atomic HINCRBY (the
* worker is the only writer of `done`, but HINCRBY also keeps us off a
* read-modify-write race and preserves the other fields).
*/
@Injectable()
export class EmbeddingReindexProgressService {
private readonly logger = new Logger(EmbeddingReindexProgressService.name);
private readonly redis: Redis;
constructor(redisService: RedisService) {
this.redis = redisService.getOrThrow();
}
private key(workspaceId: string): string {
return KEY_PREFIX + workspaceId;
}
/**
* Begin (or reset) the progress record for a workspace: `total` pages, `done`
* back to 0, `startedAt` now. Called twice for a run, BOTH with the real page
* count (countEmbeddablePages) so the two totals coincide: once at reindex
* enqueue time (so the very first status poll already reports done=0) and again
* at the worker start (which re-asserts the same total and resets `done`).
* Resets `done` to 0 so a re-trigger never inherits a stale count.
*
* `ttlSeconds` lets the caller pick the record's lifetime. The enqueue-time
* pre-seed passes a SHORT ttl: if `aiQueue.add()` de-duplicates against a job
* that is just finishing (its worker hasn't yet removed the job but already
* ran its `clear()`), no new worker starts to clear this phantom seed, so a
* short ttl lets it expire in seconds instead of sticking for the full TTL.
* The worker's own `start()` at the begin of a real run overwrites this entry
* and raises the ttl back to the default full TTL.
*/
async start(
workspaceId: string,
total: number,
ttlSeconds: number = TTL_SECONDS,
): Promise<void> {
const key = this.key(workspaceId);
try {
await this.redis
.multi()
.hset(key, {
total: String(total),
done: '0',
startedAt: String(Date.now()),
})
.expire(key, ttlSeconds)
.exec();
} catch (err) {
this.logger.warn(
`reindex-progress start failed for workspace ${workspaceId}; ` +
`progress reporting disabled for this run: ${(err as Error).message}`,
);
}
}
/**
* Bump the processed-page counter by one and refresh the TTL. Atomic and
* best-effort: a missing key (cleared/expired) would be recreated with only
* `done`, but `get()` treats a record without a numeric `total` as inactive,
* so that partial state safely reads as "no active reindex".
*/
async increment(workspaceId: string): Promise<void> {
const key = this.key(workspaceId);
try {
await this.redis.multi().hincrby(key, 'done', 1).expire(key, TTL_SECONDS).exec();
} catch (err) {
this.logger.warn(
`reindex-progress increment failed for workspace ${workspaceId}: ` +
`${(err as Error).message}`,
);
}
}
/**
* Remove the progress record. Called in the worker's `finally` so a completed,
* aborted, or unconfigured-early-return run never leaves a stuck record; the
* status then falls back to the DB coverage count.
*/
async clear(workspaceId: string): Promise<void> {
try {
await this.redis.del(this.key(workspaceId));
} catch (err) {
this.logger.warn(
`reindex-progress clear failed for workspace ${workspaceId} ` +
`(self-cleans via TTL): ${(err as Error).message}`,
);
}
}
/**
* Read the live progress, or `null` when no reindex is active (no record, an
* expired record, or a partial record without a numeric `total`). On a Redis
* error returns `null` so the status endpoint degrades to its DB count.
*/
async get(workspaceId: string): Promise<ReindexProgress | null> {
try {
const data = await this.redis.hgetall(this.key(workspaceId));
if (!data || data.total === undefined) return null;
const total = Number(data.total);
const done = Number(data.done);
const startedAt = Number(data.startedAt);
if (!Number.isFinite(total) || !Number.isFinite(done)) return null;
return { total, done, startedAt: Number.isFinite(startedAt) ? startedAt : 0 };
} catch (err) {
this.logger.warn(
`reindex-progress read failed for workspace ${workspaceId}; ` +
`falling back to DB count: ${(err as Error).message}`,
);
return null;
}
}
}

View File

@@ -14,4 +14,148 @@ describe('EnvironmentService', () => {
it('should be defined', () => {
expect(service).toBeDefined();
});
describe('getSandboxTtlMs', () => {
// ConfigService stub: get(key, def) returns the configured value for the key
// (falling back to def), matching the @nestjs/config contract the service
// calls with (key, default).
const build = (sandboxTtl?: string) =>
new EnvironmentService({
get: (key: string, def?: string) =>
key === 'SANDBOX_TTL_MS' ? (sandboxTtl ?? def) : def,
} as any);
it.each(['0', '-5', 'abc'])(
'falls back to the 3600000 default for invalid value %s',
(value) => {
expect(build(value).getSandboxTtlMs()).toBe(3_600_000);
},
);
it('returns the parsed value for a valid positive integer', () => {
expect(build('120000').getSandboxTtlMs()).toBe(120_000);
});
it('uses the 3600000 default when SANDBOX_TTL_MS is unset', () => {
expect(build(undefined).getSandboxTtlMs()).toBe(3_600_000);
});
});
// The three byte caps share the same getPositiveIntEnv() helper as the TTL,
// so a non-integer / non-positive value ('0'/'-5'/'abc') falls back to the
// documented default and a valid positive integer is returned parsed. Note
// parseInt truncates '1.5' -> 1 (a valid positive integer), so that value is
// accepted, not rejected — same as the pre-existing TTL getter.
describe.each([
{
name: 'getSandboxMaxBytes',
key: 'SANDBOX_MAX_BYTES',
def: 8_388_608,
getter: (s: EnvironmentService) => s.getSandboxMaxBytes(),
},
{
name: 'getSandboxMaxImageBytes',
key: 'SANDBOX_MAX_IMAGE_BYTES',
def: 20_971_520,
getter: (s: EnvironmentService) => s.getSandboxMaxImageBytes(),
},
{
name: 'getSandboxMaxTotalBytes',
key: 'SANDBOX_MAX_TOTAL_BYTES',
def: 134_217_728,
getter: (s: EnvironmentService) => s.getSandboxMaxTotalBytes(),
},
])('$name', ({ key, def, getter }) => {
// ConfigService stub: get(k, d) returns the configured value for THIS cap's
// key (falling back to d), and the default for every other key.
const build = (value?: string) =>
new EnvironmentService({
get: (k: string, d?: string) =>
k === key ? (value ?? d) : d,
} as any);
it.each(['0', '-5', 'abc'])(
`falls back to the ${def} default for invalid value %s`,
(value) => {
expect(getter(build(value))).toBe(def);
},
);
it('returns the parsed value for a valid positive integer', () => {
expect(getter(build('4096'))).toBe(4096);
});
it('truncates a non-integer like "1.5" to 1 via parseInt (not rejected)', () => {
expect(getter(build('1.5'))).toBe(1);
});
it(`uses the ${def} default when the env is unset`, () => {
expect(getter(build(undefined))).toBe(def);
});
});
// getPositiveIntEnv keeps a one-shot `invalidPositiveIntWarned` set so a bad
// value is logged ONCE per key (not on every getter call, which the sandbox
// hits per-put). These tests pin that dedup so a regression to per-call logging
// would fail loudly.
describe('invalid-value warn dedup', () => {
it('warns only once per key across repeated getter calls', () => {
const service = new EnvironmentService({
get: (k: string, d?: string) =>
k === 'SANDBOX_MAX_TOTAL_BYTES' ? '-5' : d,
} as any);
const warnSpy = jest
.spyOn((service as any).logger, 'warn')
.mockImplementation(() => undefined);
service.getSandboxMaxTotalBytes();
service.getSandboxMaxTotalBytes();
expect(warnSpy).toHaveBeenCalledTimes(1);
});
it('warns independently per key (dedup is per-key, not global)', () => {
// Two DIFFERENT SANDBOX_* keys are both invalid -> each warns once, so two
// warns total. This proves the dedup set is keyed, not a single global flag.
const service = new EnvironmentService({
get: (k: string, d?: string) =>
k === 'SANDBOX_MAX_BYTES' || k === 'SANDBOX_MAX_TOTAL_BYTES'
? '-5'
: d,
} as any);
const warnSpy = jest
.spyOn((service as any).logger, 'warn')
.mockImplementation(() => undefined);
service.getSandboxMaxBytes();
service.getSandboxMaxTotalBytes();
expect(warnSpy).toHaveBeenCalledTimes(2);
});
});
describe('getSandboxPublicUrl', () => {
// Stub that resolves BOTH keys the public-url logic consults.
const build = (vals: { sandboxUrl?: string; appUrl?: string }) =>
new EnvironmentService({
get: (key: string, def?: string) =>
key === 'SANDBOX_PUBLIC_URL'
? (vals.sandboxUrl ?? def)
: key === 'APP_URL'
? (vals.appUrl ?? def)
: def,
} as any);
it('uses SANDBOX_PUBLIC_URL and trims a trailing slash', () => {
expect(
build({ sandboxUrl: 'https://docs.example.com/' }).getSandboxPublicUrl(),
).toBe('https://docs.example.com');
});
it('falls back to APP_URL (origin) when SANDBOX_PUBLIC_URL is unset', () => {
expect(
build({ appUrl: 'https://app.example.com' }).getSandboxPublicUrl(),
).toBe('https://app.example.com');
});
});
});

View File

@@ -1,9 +1,15 @@
import { Injectable } from '@nestjs/common';
import { Injectable, Logger } from '@nestjs/common';
import { ConfigService } from '@nestjs/config';
import ms, { StringValue } from 'ms';
@Injectable()
export class EnvironmentService {
private readonly logger = new Logger(EnvironmentService.name);
// Env keys already warned about for an invalid value (one-shot per key, so a
// bad SANDBOX_* value is not logged on every blob put). Mirrors the original
// sandboxTtlWarned guard, generalized across the TTL + the three byte caps.
private readonly invalidPositiveIntWarned = new Set<string>();
constructor(private configService: ConfigService) {}
getNodeEnv(): string {
@@ -332,4 +338,63 @@ export class EnvironmentService {
.map((o) => o.trim())
.filter(Boolean);
}
// --- Blob sandbox (in-RAM ephemeral blob transfer; see SandboxModule) ---
// Base URL the sandbox `uri` is built from. It MUST be reachable over the
// network by the external consumer that fetches the blobs (not a loopback
// address if that consumer is remote). Falls back to APP_URL when unset so a
// single-host deployment works out of the box; set it explicitly when the
// consumer lives on another host.
getSandboxPublicUrl(): string {
const raw =
this.configService.get<string>('SANDBOX_PUBLIC_URL') || this.getAppUrl();
// Drop any trailing slash so `${base}/api/sb/${id}` never doubles up.
return raw.replace(/\/+$/, '');
}
// Parse a REQUIRED positive-integer env (TTL in ms or a byte cap). A
// non-integer or <= 0 value would break the sandbox silently (instant expiry,
// or every put failing against a 0-byte cap), so warn once and fall back to
// the default instead. Blob bodies are never logged.
private getPositiveIntEnv(key: string, def: number): number {
const parsed = parseInt(
this.configService.get<string>(key, String(def)),
10,
);
if (!Number.isInteger(parsed) || parsed <= 0) {
if (!this.invalidPositiveIntWarned.has(key)) {
this.invalidPositiveIntWarned.add(key);
this.logger.warn(
`Invalid ${key} (must be a positive integer); falling back to the ${def} default`,
);
}
return def;
}
return parsed;
}
// Blob time-to-live. Default 1h. The unguessable UUID + this short TTL + TLS
// are the whole capability model (no tokens). A non-positive or non-integer
// value would make every blob expire instantly (silent 404s), so reject it and
// fall back to the 1h default (warned about once to avoid per-put log spam).
getSandboxTtlMs(): number {
return this.getPositiveIntEnv('SANDBOX_TTL_MS', 3_600_000);
}
// Per-blob cap for non-image blobs (the serialized document). Default 8 MiB.
getSandboxMaxBytes(): number {
return this.getPositiveIntEnv('SANDBOX_MAX_BYTES', 8_388_608);
}
// Per-blob cap for mirrored image blobs. Default 20 MiB.
getSandboxMaxImageBytes(): number {
return this.getPositiveIntEnv('SANDBOX_MAX_IMAGE_BYTES', 20_971_520);
}
// RAM guard: total bytes the whole store may hold. Default 128 MiB. On
// overflow the store evicts oldest entries to make room.
getSandboxMaxTotalBytes(): number {
return this.getPositiveIntEnv('SANDBOX_MAX_TOTAL_BYTES', 134_217_728);
}
}

View File

@@ -2,6 +2,7 @@ import {
IsIn,
IsNotEmpty,
IsNotIn,
IsNumberString,
IsOptional,
IsString,
IsUrl,
@@ -170,6 +171,35 @@ export class EnvironmentVariables {
},
)
CLICKHOUSE_URL: string;
// --- Blob sandbox (in-RAM ephemeral blob transfer; see SandboxModule) ---
@IsOptional()
@ValidateIf((obj) => obj.SANDBOX_PUBLIC_URL != '' && obj.SANDBOX_PUBLIC_URL != null)
@IsUrl(
{ protocols: ['http', 'https'], require_tld: false },
{
message:
'SANDBOX_PUBLIC_URL must be a valid http(s) URL reachable by the external blob consumer',
},
)
SANDBOX_PUBLIC_URL: string;
@IsOptional()
@IsNumberString({}, { message: 'SANDBOX_TTL_MS must be an integer (milliseconds)' })
SANDBOX_TTL_MS: string;
@IsOptional()
@IsNumberString({}, { message: 'SANDBOX_MAX_BYTES must be an integer (bytes)' })
SANDBOX_MAX_BYTES: string;
@IsOptional()
@IsNumberString({}, { message: 'SANDBOX_MAX_IMAGE_BYTES must be an integer (bytes)' })
SANDBOX_MAX_IMAGE_BYTES: string;
@IsOptional()
@IsNumberString({}, { message: 'SANDBOX_MAX_TOTAL_BYTES must be an integer (bytes)' })
SANDBOX_MAX_TOTAL_BYTES: string;
}
export function validate(config: Record<string, any>) {

View File

@@ -131,10 +131,25 @@ export class FailedLoginLimiter {
}
// The per-session DocmostMcpConfig shape understood by @docmost/mcp: either the
// service-account credentials variant OR the per-user getToken variant.
export type DocmostMcpConfig =
// service-account credentials variant OR the per-user getToken variant. The
// optional `sandbox` sink (blob store for the stash tool) is common to both and
// injected by McpService after the auth decision.
export type DocmostMcpConfig = (
| { apiUrl: string; email: string; password: string }
| { apiUrl: string; getToken: () => Promise<string> };
| { apiUrl: string; getToken: () => Promise<string> }
) & {
sandbox?: {
put: (
buf: Buffer,
mime: string,
) => { uri: string; sha256: string; size: number };
// Optional live/evict probes the package uses to keep stash_page's mirror
// counts honest under the store's FIFO eviction (mirror of the package's
// sink type); older bindings omit them.
has?: (uri: string) => boolean;
evict?: (uri: string) => void;
};
};
export interface ResolvedMcpAuth {
config: DocmostMcpConfig;

View File

@@ -109,13 +109,13 @@ function makeService(opts: {
};
const service = new McpService(
undefined as never, // environmentService
undefined as never, // workspaceRepo
undefined as never, // authService
undefined as never, // tokenService
undefined as never, // userRepo
undefined as never, // userSessionRepo
moduleRef as never, // moduleRef (read by the MFA branch)
undefined as never, // sandboxStore (unused by the login-gate path)
);
// Stop the constructor's unref'd sweep timer leaking across tests.
service.onModuleDestroy();

View File

@@ -2,17 +2,15 @@ import { Module } from '@nestjs/common';
import { McpController } from './mcp.controller';
import { McpService } from './mcp.service';
import { DatabaseModule } from '@docmost/db/database.module';
import { EnvironmentModule } from '../environment/environment.module';
import { AuthModule } from '../../core/auth/auth.module';
import { TokenModule } from '../../core/auth/token.module';
// Community MCP feature: the server itself serves the Model Context Protocol
// over HTTP at /mcp. DatabaseModule (global) provides WorkspaceRepo and
// EnvironmentModule (global) provides EnvironmentService. AuthModule supplies
// AuthService (per-user HTTP-Basic login validation) and TokenModule supplies
// TokenService (Bearer access-JWT verification for the token fallback).
// over HTTP at /mcp. DatabaseModule (global) provides WorkspaceRepo. AuthModule
// supplies AuthService (per-user HTTP-Basic login validation) and TokenModule
// supplies TokenService (Bearer access-JWT verification for the token fallback).
@Module({
imports: [DatabaseModule, EnvironmentModule, AuthModule, TokenModule],
imports: [DatabaseModule, AuthModule, TokenModule],
controllers: [McpController],
providers: [McpService],
})

View File

@@ -8,7 +8,6 @@ import { ModuleRef } from '@nestjs/core';
import { pathToFileURL } from 'node:url';
import { IncomingMessage } from 'node:http';
import { FastifyReply, FastifyRequest } from 'fastify';
import { EnvironmentService } from '../environment/environment.service';
import { WorkspaceRepo } from '@docmost/db/repos/workspace/workspace.repo';
import { UserRepo } from '@docmost/db/repos/user/user.repo';
import { UserSessionRepo } from '@docmost/db/repos/session/user-session.repo';
@@ -30,6 +29,7 @@ import {
DocmostMcpConfig,
ResolvedMcpAuth,
} from './mcp-auth.helpers';
import { SandboxStore } from '../sandbox/sandbox.store';
// Minimal shape of the embedded MCP HTTP handler exported by @docmost/mcp/http.
interface McpHttpHandler {
@@ -92,13 +92,14 @@ export class McpService implements OnModuleDestroy {
private readonly sweepTimer: NodeJS.Timeout;
constructor(
private readonly environmentService: EnvironmentService,
private readonly workspaceRepo: WorkspaceRepo,
private readonly authService: AuthService,
private readonly tokenService: TokenService,
private readonly userRepo: UserRepo,
private readonly userSessionRepo: UserSessionRepo,
private readonly moduleRef: ModuleRef,
// Shared singleton in-RAM blob store backing the stash tool.
private readonly sandboxStore: SandboxStore,
) {
this.sweepTimer = setInterval(() => {
try {
@@ -326,7 +327,11 @@ export class McpService implements OnModuleDestroy {
// Should never happen: handle() always stashes before delegating.
throw new UnauthorizedException('MCP authentication missing.');
}
return resolved.config;
// Inject the blob-sandbox sink after the auth decision so stash_page
// can store blobs in the shared in-RAM store regardless of which
// credential variant resolved. The sink (put/has/evict + uri↔id
// mapping) is owned by SandboxStore.asSink().
return { ...resolved.config, sandbox: this.sandboxStore.asSink() };
},
{
identify: (req: IncomingMessage) => {

View File

@@ -0,0 +1,6 @@
// Single source of truth for the anonymous blob-sandbox route. The controller
// is mounted under the global `/api` prefix, so its decorator uses the bare
// segment while the public URL and the workspace-gate exclusion need the full
// path — derive the latter from the former so the two never drift.
export const SANDBOX_ROUTE_SEGMENT = 'sb';
export const SANDBOX_API_PATH = `/api/${SANDBOX_ROUTE_SEGMENT}`;

View File

@@ -0,0 +1,265 @@
import { SandboxController } from './sandbox.controller';
import { SandboxEntry } from './sandbox.store';
// Capturing fake of the FastifyReply surface the controller uses:
// status()/header()/headers()/send(), all chainable.
function makeRes() {
const sent: { status: number; headers: Record<string, any>; body: any } = {
status: 200,
headers: {},
body: undefined,
};
const res: any = {
status(code: number) {
sent.status = code;
return res;
},
header(key: string, value: any) {
sent.headers[key.toLowerCase()] = value;
return res;
},
headers(obj: Record<string, any>) {
for (const k of Object.keys(obj)) sent.headers[k.toLowerCase()] = obj[k];
return res;
},
send(body?: any) {
sent.body = body;
return res;
},
_sent: sent,
};
return res;
}
function makeReq(headers: Record<string, any> = {}) {
return { headers } as any;
}
// A syntactically valid v4 UUID (version nibble 4, variant nibble 8). The
// shared `uuid` validator is stricter than a bare hex-shape regex, so the id
// must carry a real version/variant.
const VALID_ID = 'aaaaaaaa-bbbb-4ccc-8ddd-eeeeeeeeeeee';
function entry(buf: Buffer, mime: string, sha256: string): SandboxEntry {
return { buf, mime, sha256, expiresAt: Date.now() + 60_000 };
}
describe('SandboxController', () => {
it('serves 200 with body, Content-Type, Content-Length and sha256 ETag', async () => {
const buf = Buffer.from('{"ok":true}', 'utf8');
const sha = 'a'.repeat(64);
const store = { get: jest.fn().mockReturnValue(entry(buf, 'application/json', sha)) };
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq(), res);
expect(store.get).toHaveBeenCalledWith(VALID_ID);
expect(res._sent.status).toBe(200);
expect(res._sent.headers['content-type']).toBe('application/json');
expect(res._sent.headers['content-length']).toBe(buf.length);
expect(res._sent.headers['etag']).toBe(`"${sha}"`);
expect(res._sent.body).toBe(buf);
});
it('returns 404 for a missing/expired blob', async () => {
const store = { get: jest.fn().mockReturnValue(undefined) };
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq(), res);
expect(res._sent.status).toBe(404);
expect(res._sent.body).toBeUndefined();
});
it('returns 404 for a non-UUID id WITHOUT touching the store (anti-traversal)', async () => {
const store = { get: jest.fn() };
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get('../../etc/passwd', makeReq(), res);
expect(store.get).not.toHaveBeenCalled();
expect(res._sent.status).toBe(404);
});
it('returns 304 (no body) when If-None-Match matches the ETag', async () => {
const sha = 'b'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'application/json', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq({ 'if-none-match': `"${sha}"` }), res);
expect(res._sent.status).toBe(304);
expect(res._sent.body).toBeUndefined();
expect(res._sent.headers['etag']).toBe(`"${sha}"`);
});
it('accepts a bare (unquoted) sha256 in If-None-Match too', async () => {
const sha = 'c'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'application/json', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq({ 'if-none-match': sha }), res);
expect(res._sent.status).toBe(304);
});
it('serves 200 when If-None-Match does NOT match', async () => {
const sha = 'd'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'application/json', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq({ 'if-none-match': '"stale"' }), res);
expect(res._sent.status).toBe(200);
});
it('returns 304 for a wildcard "*" If-None-Match', async () => {
const sha = 'e'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'application/json', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq({ 'if-none-match': '*' }), res);
expect(res._sent.status).toBe(304);
});
it('returns 304 for a weak validator W/"<sha>"', async () => {
const sha = 'f'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'application/json', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq({ 'if-none-match': `W/"${sha}"` }), res);
expect(res._sent.status).toBe(304);
});
it('returns 304 when a comma-separated If-None-Match list contains the sha', async () => {
const sha = '1'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'application/json', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(
VALID_ID,
makeReq({ 'if-none-match': `"other", "${sha}"` }),
res,
);
expect(res._sent.status).toBe(304);
});
it('sets a private, immutable Cache-Control with a max-age within the TTL on 200', async () => {
const sha = '2'.repeat(64);
// Known TTL: ~30s out, so the floored max-age must land within [0, 60].
const e: SandboxEntry = {
buf: Buffer.from('x'),
mime: 'application/json',
sha256: sha,
expiresAt: Date.now() + 30_000,
};
const store = { get: jest.fn().mockReturnValue(e) };
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq(), res);
expect(res._sent.status).toBe(200);
const cc = res._sent.headers['cache-control'] as string;
expect(cc).toMatch(/^private, max-age=\d+, immutable$/);
const maxAge = Number(cc.match(/max-age=(\d+)/)![1]);
expect(maxAge).toBeGreaterThanOrEqual(0);
expect(maxAge).toBeLessThanOrEqual(60);
});
it('emits Cache-Control alongside ETag on the 304 branch', async () => {
const sha = '3'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'application/json', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq({ 'if-none-match': `"${sha}"` }), res);
expect(res._sent.status).toBe(304);
expect(res._sent.headers['cache-control']).toMatch(
/^private, max-age=\d+, immutable$/,
);
});
it('sets nosniff + restrictive CSP and serves an allowlisted image inline', async () => {
const sha = '4'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('x'), 'image/png', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq(), res);
expect(res._sent.status).toBe(200);
expect(res._sent.headers['x-content-type-options']).toBe('nosniff');
expect(res._sent.headers['content-security-policy']).toBe(
"base-uri 'none'; object-src 'self'; default-src 'self';",
);
expect(res._sent.headers['content-disposition']).toBe('inline');
});
it('forces an SVG to download (attachment) while keeping nosniff + CSP', async () => {
const sha = '5'.repeat(64);
const store = {
get: jest.fn().mockReturnValue(entry(Buffer.from('<svg/>'), 'image/svg+xml', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq(), res);
expect(res._sent.status).toBe(200);
expect(res._sent.headers['content-disposition']).toBe('attachment');
expect(res._sent.headers['x-content-type-options']).toBe('nosniff');
expect(res._sent.headers['content-security-policy']).toBe(
"base-uri 'none'; object-src 'self'; default-src 'self';",
);
});
it('forces text/html to download (attachment) while keeping nosniff + CSP', async () => {
const sha = '6'.repeat(64);
const store = {
get: jest
.fn()
.mockReturnValue(entry(Buffer.from('<h1>x</h1>'), 'text/html', sha)),
};
const controller = new SandboxController(store as any);
const res = makeRes();
await controller.get(VALID_ID, makeReq(), res);
expect(res._sent.status).toBe(200);
expect(res._sent.headers['content-disposition']).toBe('attachment');
expect(res._sent.headers['x-content-type-options']).toBe('nosniff');
expect(res._sent.headers['content-security-policy']).toBe(
"base-uri 'none'; object-src 'self'; default-src 'self';",
);
});
});

View File

@@ -0,0 +1,130 @@
import { Controller, Get, Param, Req, Res } from '@nestjs/common';
import { FastifyReply, FastifyRequest } from 'fastify';
import { validate as isValidUUID } from 'uuid';
import { SandboxStore } from './sandbox.store';
import { SANDBOX_ROUTE_SEGMENT } from './sandbox.constants';
// MIME types safe to render inline in a browser. SVG is deliberately EXCLUDED
// (it can carry script), as are text/html and the JSON document blob — anything
// not on this list is served as an attachment so an attacker-controlled mime can
// never execute script on this origin (the route is anonymous + same-origin).
const INLINE_SAFE_MIME = new Set([
'image/png',
'image/jpeg',
'image/gif',
'image/webp',
'image/avif',
]);
/**
* Anonymous read endpoint for the in-RAM blob sandbox.
*
* Mounted under the global `/api` prefix as `GET /api/sb/:id`. It carries NO
* `@UseGuards(JwtAuthGuard)`, so — exactly like the public attachment route
* `GET /api/files/public/...` — it is exempt from Docmost session auth. The
* route is ALSO listed in the workspace-resolution preHandler's excludedPaths
* in main.ts so a request from a remote consumer (which carries no workspace
* host) is not rejected with "Workspace not found".
*
* It only ever serves blobs looked up from the SandboxStore by a validated
* UUID; `:id` is never used as a filesystem path, so there is no traversal
* surface. Never returns tokens, never 401s.
*
* Anti-XSS hardening mirrors the public attachment route: every response sets
* `X-Content-Type-Options: nosniff` and a restrictive CSP, and serves any mime
* NOT on the inline-safe allowlist (svg/html/the JSON document blob) as an
* attachment, so an attacker-controlled `entry.mime` can never execute script
* on this same-origin anonymous route.
*/
@Controller(SANDBOX_ROUTE_SEGMENT)
export class SandboxController {
constructor(private readonly store: SandboxStore) {}
@Get(':id')
async get(
@Param('id') id: string,
@Req() req: FastifyRequest,
@Res() res: FastifyReply,
): Promise<void> {
// Validate `:id` as a real UUID via the shared `uuid` validator (same as the
// attachment routes). This is anti-traversal / input hygiene (so `:id` can
// never be a path like `../...`), NOT authorization — the capability is the
// unguessable id itself plus the short TTL plus TLS. A non-UUID id (including
// any traversal attempt) → 404 before touching the store; no stack trace
// leaks out.
if (!isValidUUID(id)) {
res.status(404).send();
return;
}
const entry = this.store.get(id);
if (!entry) {
// Missing or expired — indistinguishable to the caller, by design.
res.status(404).send();
return;
}
// Strong validator: quoted sha256, no W/ weak prefix. Same value computed
// at put() time, so an external consumer can detect a truncated/corrupted
// body — the original bug this whole channel exists to fix.
const etag = `"${entry.sha256}"`;
// Compute freshness BEFORE the conditional check: a 304 conditional
// revalidation must not lose the Cache-Control freshness directives, or a
// revalidating client would forget how long the blob stays fresh.
const ttlSeconds = Math.max(
0,
Math.floor((entry.expiresAt - Date.now()) / 1000),
);
// Capability URL — keep it out of shared caches; immutable for its TTL.
const cacheControl = `private, max-age=${ttlSeconds}, immutable`;
// Conditional request: an exact ETag match → 304 with no body. The blob is
// immutable, so the validator is stable for the blob's whole lifetime.
if (this.ifNoneMatchMatches(req.headers['if-none-match'], entry.sha256)) {
res
.status(304)
.header('ETag', etag)
.header('Cache-Control', cacheControl)
.send();
return;
}
// Non-allowlisted mimes (svg/html/the JSON blob) are forced to download so
// an attacker-controlled mime can never run script inline on this origin.
const disposition = INLINE_SAFE_MIME.has(entry.mime)
? 'inline'
: 'attachment';
// Use @Res() + res.send(Buffer) with an explicit Content-Type so the binary
// body bypasses the global JSON response transform/serializer.
res
.status(200)
.headers({
'Content-Type': entry.mime,
'Content-Length': entry.buf.length,
ETag: etag,
'Cache-Control': cacheControl,
'X-Content-Type-Options': 'nosniff',
'Content-Security-Policy':
"base-uri 'none'; object-src 'self'; default-src 'self';",
'Content-Disposition': disposition,
})
.send(entry.buf);
}
// Accept the consumer's If-None-Match whether it sends the quoted ETag, a bare
// sha256, a weak "W/"-prefixed validator, or a comma-separated list.
private ifNoneMatchMatches(
header: string | string[] | undefined,
sha256: string,
): boolean {
if (!header) return false;
const raw = Array.isArray(header) ? header.join(',') : header;
if (raw.trim() === '*') return true;
return raw
.split(',')
.map((t) => t.trim().replace(/^W\//, '').replace(/^"|"$/g, ''))
.some((t) => t === sha256);
}
}

View File

@@ -0,0 +1,19 @@
import { Global, Module } from '@nestjs/common';
import { SandboxController } from './sandbox.controller';
import { SandboxStore } from './sandbox.store';
/**
* In-RAM blob sandbox: a SINGLE shared SandboxStore (the @Injectable singleton)
* is written to by the stash tool (via McpService / AiChatToolsService) and read
* back by the anonymous SandboxController. Marked @Global so the same store
* instance is injectable everywhere without import churn — put() and get() MUST
* hit the same Map. EnvironmentService (caps/TTL/public URL) is provided by the
* global EnvironmentModule.
*/
@Global()
@Module({
controllers: [SandboxController],
providers: [SandboxStore],
exports: [SandboxStore],
})
export class SandboxModule {}

View File

@@ -0,0 +1,163 @@
import { createHash } from 'node:crypto';
import { validate as isValidUUID } from 'uuid';
import { SandboxStore } from './sandbox.store';
// Build a minimal EnvironmentService stub with overridable caps/TTL.
function makeEnv(
overrides: Partial<{
ttlMs: number;
maxBytes: number;
maxImageBytes: number;
maxTotalBytes: number;
}> = {},
) {
const cfg = {
ttlMs: 3_600_000,
maxBytes: 8_388_608,
maxImageBytes: 20_971_520,
maxTotalBytes: 134_217_728,
...overrides,
};
return {
getSandboxTtlMs: () => cfg.ttlMs,
getSandboxMaxBytes: () => cfg.maxBytes,
getSandboxMaxImageBytes: () => cfg.maxImageBytes,
getSandboxMaxTotalBytes: () => cfg.maxTotalBytes,
getSandboxPublicUrl: () => 'https://example.test',
} as any;
}
describe('SandboxStore', () => {
let store: SandboxStore;
afterEach(() => {
// Clear the unref'd sweep interval so it never leaks across tests.
store?.onModuleDestroy();
jest.useRealTimers();
});
it('put/get round-trips the exact bytes + mime and returns a UUID id', () => {
store = new SandboxStore(makeEnv());
const buf = Buffer.from('{"type":"doc","content":[]}', 'utf8');
const res = store.put(buf, 'application/json');
expect(isValidUUID(res.id)).toBe(true);
expect(res.size).toBe(buf.length);
const entry = store.get(res.id);
expect(entry).toBeDefined();
expect(entry!.buf.equals(buf)).toBe(true);
expect(entry!.mime).toBe('application/json');
});
it('computes sha256 over the body (matches a manual digest)', () => {
store = new SandboxStore(makeEnv());
const buf = Buffer.from('hello sandbox', 'utf8');
const expected = createHash('sha256').update(buf).digest('hex');
const res = store.put(buf, 'text/plain');
expect(res.sha256).toBe(expected);
expect(store.get(res.id)!.sha256).toBe(expected);
});
it('returns undefined for a missing id', () => {
store = new SandboxStore(makeEnv());
expect(store.get('11111111-1111-1111-1111-111111111111')).toBeUndefined();
});
it('lazily expires entries past the TTL (get returns undefined)', () => {
jest.useFakeTimers();
jest.setSystemTime(new Date('2026-01-01T00:00:00Z'));
store = new SandboxStore(makeEnv({ ttlMs: 1000 }));
const res = store.put(Buffer.from('x'), 'text/plain');
expect(store.get(res.id)).toBeDefined();
jest.setSystemTime(new Date('2026-01-01T00:00:02Z')); // +2s > 1s TTL
expect(store.get(res.id)).toBeUndefined();
// Eviction also frees the byte accounting.
expect(store.bytes).toBe(0);
});
it('background sweep drops expired entries without a get()', () => {
jest.useFakeTimers();
jest.setSystemTime(new Date('2026-01-01T00:00:00Z'));
store = new SandboxStore(makeEnv({ ttlMs: 1000 }));
store.put(Buffer.from('x'), 'text/plain');
expect(store.size).toBe(1);
jest.setSystemTime(new Date('2026-01-01T00:01:30Z')); // past TTL
jest.advanceTimersByTime(60_000); // fire the sweep interval
expect(store.size).toBe(0);
});
it('rejects a non-image blob over SANDBOX_MAX_BYTES', () => {
store = new SandboxStore(makeEnv({ maxBytes: 16 }));
expect(() => store.put(Buffer.alloc(17), 'application/json')).toThrow(
/per-blob cap/,
);
});
it('uses the larger image cap for image/* blobs', () => {
// 100 bytes exceeds the doc cap (16) but fits the image cap (1024).
store = new SandboxStore(makeEnv({ maxBytes: 16, maxImageBytes: 1024 }));
expect(() => store.put(Buffer.alloc(100), 'image/png')).not.toThrow();
// SVG counts as an image too.
expect(() => store.put(Buffer.alloc(100), 'image/svg+xml')).not.toThrow();
});
it('evicts oldest entries when the total cap would be exceeded', () => {
// Total cap 250 bytes; each blob 100 bytes -> only 2 fit at a time.
store = new SandboxStore(
makeEnv({ maxTotalBytes: 250, maxBytes: 1024 }),
);
const a = store.put(Buffer.alloc(100), 'application/json');
const b = store.put(Buffer.alloc(100), 'application/json');
const c = store.put(Buffer.alloc(100), 'application/json'); // evicts a
expect(store.get(a.id)).toBeUndefined(); // oldest evicted
expect(store.get(b.id)).toBeDefined();
expect(store.get(c.id)).toBeDefined();
expect(store.bytes).toBeLessThanOrEqual(250);
});
it('rejects a single blob larger than the whole total cap', () => {
store = new SandboxStore(
makeEnv({ maxTotalBytes: 50, maxBytes: 1024 }),
);
expect(() => store.put(Buffer.alloc(100), 'application/json')).toThrow(
/total store cap/,
);
});
it('putAndLink composes the anonymous /api/sb/<id> url with matching integrity', () => {
store = new SandboxStore(makeEnv());
const buf = Buffer.from('hello link', 'utf8');
const expected = createHash('sha256').update(buf).digest('hex');
const res = store.putAndLink(buf, 'image/png');
expect(res.uri).toMatch(/^https:\/\/example\.test\/api\/sb\/[0-9a-f-]{36}$/);
expect(res.sha256).toBe(expected);
expect(res.size).toBe(buf.length);
});
it('has()/remove() report and free a blob by id', () => {
store = new SandboxStore(makeEnv());
const { id } = store.put(Buffer.from('x'), 'text/plain');
expect(store.has(id)).toBe(true);
store.remove(id);
expect(store.has(id)).toBe(false);
expect(store.bytes).toBe(0);
});
it('asSink() round-trips put/has/evict through the anonymous uri', () => {
store = new SandboxStore(makeEnv());
const sink = store.asSink();
const buf = Buffer.from('sink bytes', 'utf8');
const r = sink.put(buf, 'image/png');
expect(sink.has(r.uri)).toBe(true);
sink.evict(r.uri);
expect(sink.has(r.uri)).toBe(false);
});
});

View File

@@ -0,0 +1,178 @@
import { Injectable, Logger, OnModuleDestroy } from '@nestjs/common';
import { createHash, randomUUID } from 'node:crypto';
import { EnvironmentService } from '../environment/environment.service';
import { SANDBOX_API_PATH } from './sandbox.constants';
// In-RAM, process-local blob store. No disk, no DB. Ephemeral by design: a
// restart empties it. A blob is addressed by an unguessable randomUUID() which
// IS the read capability — there are NO tokens. Each blob is immutable (its id
// never maps to changing content), so its sha256 is a perfect strong ETag.
export interface SandboxEntry {
buf: Buffer;
mime: string;
sha256: string;
expiresAt: number;
}
export interface SandboxPutResult {
id: string;
sha256: string;
size: number;
}
@Injectable()
export class SandboxStore implements OnModuleDestroy {
private readonly logger = new Logger(SandboxStore.name);
// Map preserves insertion order, so the first key is the oldest entry — used
// for FIFO eviction when the total-bytes RAM guard is exceeded.
private readonly map = new Map<string, SandboxEntry>();
private totalBytes = 0;
// Background sweep clears expired entries so never-fetched blobs do not linger
// until the next get(). unref()'d so it never holds the event loop open;
// cleared on module destroy. Mirrors the sweepTimer pattern in
// integrations/mcp/mcp.service.ts and packages/mcp/src/http.ts.
private readonly sweepIntervalMs = 60_000;
private readonly sweepTimer: NodeJS.Timeout;
constructor(private readonly environmentService: EnvironmentService) {
this.sweepTimer = setInterval(() => {
try {
this.sweep();
} catch (err) {
this.logger.error('Sandbox sweep failed', err as Error);
}
}, this.sweepIntervalMs);
this.sweepTimer.unref?.();
}
onModuleDestroy(): void {
clearInterval(this.sweepTimer);
}
/**
* Store a blob and return its read capability id + integrity metadata. The
* per-blob cap is chosen by mime (images get the larger image cap), and the
* total-store RAM guard evicts oldest entries to make room. Throws a clear
* error when a single blob cannot fit even after eviction. Blob bodies are
* never logged.
*/
put(buf: Buffer, mime: string): SandboxPutResult {
const perBlobCap = mime.startsWith('image/')
? this.environmentService.getSandboxMaxImageBytes()
: this.environmentService.getSandboxMaxBytes();
if (buf.length > perBlobCap) {
throw new Error(
`Sandbox blob of ${buf.length} bytes exceeds the ${perBlobCap}-byte per-blob cap`,
);
}
const maxTotal = this.environmentService.getSandboxMaxTotalBytes();
if (buf.length > maxTotal) {
throw new Error(
`Sandbox blob of ${buf.length} bytes exceeds the total store cap of ${maxTotal} bytes`,
);
}
// Drop expired entries first, then evict oldest until the new blob fits.
this.sweep();
while (this.totalBytes + buf.length > maxTotal && this.map.size > 0) {
const oldest = this.map.keys().next().value as string;
this.evict(oldest);
}
const id = randomUUID();
const sha256 = createHash('sha256').update(buf).digest('hex');
const expiresAt = Date.now() + this.environmentService.getSandboxTtlMs();
this.map.set(id, { buf, mime, sha256, expiresAt });
this.totalBytes += buf.length;
return { id, sha256, size: buf.length };
}
/**
* Store a blob and return its anonymous read URL plus integrity metadata.
* Owns the single sandbox-URL composition (`${publicBase}${SANDBOX_API_PATH}/
* <id>`) so callers never hand-build the route; the raw put() stays public for
* tests/low-level callers. sha256 is also the blob's strong ETag.
*/
putAndLink(
buf: Buffer,
mime: string,
): { uri: string; sha256: string; size: number } {
const stored = this.put(buf, mime);
const base = this.environmentService.getSandboxPublicUrl();
return {
uri: `${base}${SANDBOX_API_PATH}/${stored.id}`,
sha256: stored.sha256,
size: stored.size,
};
}
/**
* Adapter to the package's blob-sandbox sink contract `{ put, has, evict }`.
* The sink speaks anonymous `uri`s while the store is keyed by `id`, so this is
* the ONE place that maps a sandbox uri back to its id (the last path segment).
* Both wiring sites (embedded MCP + in-app agent tools) use this so the uri↔id
* mapping and URL composition live next to putAndLink, not copy-pasted.
*/
asSink(): {
put: (buf: Buffer, mime: string) => { uri: string; sha256: string; size: number };
has: (uri: string) => boolean;
evict: (uri: string) => void;
} {
const idOf = (uri: string) => uri.substring(uri.lastIndexOf('/') + 1);
return {
put: (buf, mime) => this.putAndLink(buf, mime),
has: (uri) => this.has(idOf(uri)),
evict: (uri) => this.remove(idOf(uri)),
};
}
/** True if the blob is still live (not evicted/expired). */
has(id: string): boolean {
return this.get(id) !== undefined;
}
/** Drop a blob by id (public wrapper over the private FIFO evict). */
remove(id: string): void {
this.evict(id);
}
/** Returns the entry, or undefined if missing OR expired (lazy expiry). */
get(id: string): SandboxEntry | undefined {
const entry = this.map.get(id);
if (!entry) return undefined;
if (entry.expiresAt <= Date.now()) {
this.evict(id);
return undefined;
}
return entry;
}
/** Current number of live entries (test/diagnostic helper). */
get size(): number {
return this.map.size;
}
/** Current total bytes held (test/diagnostic helper). */
get bytes(): number {
return this.totalBytes;
}
private evict(id: string): void {
const entry = this.map.get(id);
if (entry) {
this.totalBytes -= entry.buf.length;
this.map.delete(id);
}
}
private sweep(): void {
const now = Date.now();
for (const [id, entry] of this.map) {
if (entry.expiresAt <= now) {
this.evict(id);
}
}
}
}

View File

@@ -13,6 +13,7 @@ import fastifyCookie from '@fastify/cookie';
import fastifyIp from 'fastify-ip';
import { InternalLogFilter } from './common/logger/internal-log-filter';
import { EnvironmentService } from './integrations/environment/environment.service';
import { SANDBOX_API_PATH } from './integrations/sandbox/sandbox.constants';
import { resolveFrameHeader } from './common/helpers';
import { resolveTrustProxy } from './integrations/environment/trust-proxy.util';
@@ -126,6 +127,10 @@ async function bootstrap() {
'/api/workspace/create',
'/api/workspace/joined',
'/api/workspace/find-by-email',
// Anonymous in-RAM blob sandbox: a remote consumer fetches blobs by an
// unguessable UUID without any workspace host context, so the
// workspace-resolution gate must not apply.
SANDBOX_API_PATH,
];
if (

View File

@@ -1,124 +0,0 @@
import { Kysely } from 'kysely';
import { randomUUID } from 'node:crypto';
import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { SpaceMemberRepo } from '@docmost/db/repos/space/space-member.repo';
import { EventEmitter2 } from '@nestjs/event-emitter';
import { getTestDb, destroyTestDb, createWorkspace, createSpace } from './db';
/**
* `PageRepo.getEmbeddablePageIds` MUST stay in lockstep with
* `PageRepo.countEmbeddablePages` (page.repo.ts) — the bulk reindex iterates the
* ID set while the status endpoint reports the count as the live denominator, so
* if the two predicates ever diverge the "done X of Y" counter ends on the wrong
* total. Both share the SAME WHERE: a page qualifies iff it is non-deleted AND
* (text_content has a non-whitespace char OR it has a non-deleted embedding row).
*
* This is a DB-level invariant: the predicate lives in raw SQL (`text_content ~
* '[^[:space:]]'`) and an EXISTS subquery, so a unit test with mocked Kysely
* cannot observe it. We seed every boundary case against real Postgres and
* assert the returned ID set EQUALS the count (and is exactly the expected set).
* A future edit that touches one predicate but not the other turns this red.
*/
describe('PageRepo embeddable-page set: getEmbeddablePageIds <-> countEmbeddablePages [integration]', () => {
let db: Kysely<any>;
let repo: PageRepo;
let workspaceId: string;
let spaceId: string;
beforeAll(async () => {
db = getTestDb();
// Only the Kysely-backed query methods under test are exercised, so the
// SpaceMemberRepo / EventEmitter2 deps are never touched — stub them.
repo = new PageRepo(
db as any,
{} as unknown as SpaceMemberRepo,
{} as unknown as EventEmitter2,
);
workspaceId = (await createWorkspace(db)).id;
spaceId = (await createSpace(db, workspaceId)).id;
});
afterAll(async () => {
await destroyTestDb();
});
// Insert a page with explicit text_content / deleted_at (createPage in db.ts
// sets neither), returning its id so the test can assert membership.
async function insertPage(args: {
textContent: string | null;
deletedAt?: Date | null;
}): Promise<string> {
const id = randomUUID();
await db
.insertInto('pages')
.values({
id,
slugId: `slug-${id.slice(0, 8)}`,
title: `page-${id.slice(0, 8)}`,
spaceId,
workspaceId,
textContent: args.textContent,
deletedAt: args.deletedAt ?? null,
})
.execute();
return id;
}
// Insert one embedding chunk row for a page (NOT NULL columns + deleted_at).
async function insertEmbedding(
pageId: string,
opts: { deletedAt?: Date | null } = {},
): Promise<void> {
await db
.insertInto('pageEmbeddings')
.values({
id: randomUUID(),
workspaceId,
pageId,
spaceId,
chunkIndex: 0,
chunkStart: 0,
chunkLength: 1,
content: 'x',
modelName: 'test-model',
modelDimensions: 1,
deletedAt: opts.deletedAt ?? null,
})
.execute();
}
it('returns exactly the embeddable set and its size equals countEmbeddablePages', async () => {
// IN the set --------------------------------------------------------------
// (a) non-deleted page with real body text.
const withText = await insertPage({ textContent: 'hello world' });
// (b) non-deleted page with NO text but a live embedding row (EXISTS clause:
// a page that lost its text yet still has stale vectors must be visited
// so the reindex can clear them).
const noTextLiveEmbedding = await insertPage({ textContent: null });
await insertEmbedding(noTextLiveEmbedding);
// OUT of the set ----------------------------------------------------------
// (c) non-deleted, text_content NULL, no embeddings.
await insertPage({ textContent: null });
// (d) non-deleted, whitespace-only text (regex requires a non-space char).
await insertPage({ textContent: ' \n\t ' });
// (e) deleted page WITH body text — excluded by the non-deleted predicate.
await insertPage({
textContent: 'deleted but had text',
deletedAt: new Date(),
});
// (f) non-deleted, no text, with ONLY a DELETED embedding row — the EXISTS
// subquery filters pe.deleted_at IS NULL, so this stays out.
const onlyDeletedEmbedding = await insertPage({ textContent: null });
await insertEmbedding(onlyDeletedEmbedding, { deletedAt: new Date() });
const ids = await repo.getEmbeddablePageIds(workspaceId);
const count = await repo.countEmbeddablePages(workspaceId);
// The two queries agree on the size (the load-bearing lockstep invariant)...
expect(ids.length).toBe(count);
// ...and the set is exactly the two qualifying pages, nothing else.
expect(new Set(ids)).toEqual(new Set([withText, noTextLiveEmbedding]));
expect(count).toBe(2);
});
});

View File

@@ -16,7 +16,7 @@ license.
> that interface. Other Docmost MCPs are human-shaped — they expose "open the page" and
> "replace the page"; this one exposes the editing primitives a model is good at.
It exposes **38 tools** built around three ideas that the other Docmost MCPs do not
It exposes **40 tools** built around three ideas that the other Docmost MCPs do not
combine:
1. **Surgical, token-cheap edits.** Address a single block by id and patch it, or run
@@ -106,7 +106,7 @@ There are several Docmost MCPs. Here is a capability-by-capability comparison.
## Tools
All 38 tools, grouped by what you'd reach for them.
All 40 tools, grouped by what you'd reach for them.
### Exploration & retrieval
@@ -203,6 +203,14 @@ All 38 tools, grouped by what you'd reach for them.
node referencing the old attachment (recursively, including callouts/tables) via the
live document, preserving comments, alignment and alt text. (In-place overwrite is
deliberately avoided — some Docmost versions corrupt the attachment on overwrite.)
- **`stash_page`** — Serialize a whole page (its full ProseMirror JSON) into an ephemeral
in-RAM blob and return ONLY a short anonymous URL — the body never enters the model
context, so it is the way to hand a large page (and its images) to an external consumer
without truncation. Every internal file/image attachment is mirrored into the same
sandbox and its `src` rewritten to a sandbox URL; external http(s) images are left
untouched. Returns `{ uri, size, sha256, images:{ mirrored, failed } }` (`sha256` is also
the blob's ETag). Blobs are RAM-only, expire after a short TTL (~1h) and are bound to the
server instance that created them.
### Comments

View File

@@ -17,7 +17,7 @@
> «открыть страницу» и «заменить страницу»; этот даёт примитивы редактирования, в которых
> модель сильна.
Сервер предоставляет **38 инструментов**, построенных вокруг трёх идей, которые другие
Сервер предоставляет **40 инструментов**, построенных вокруг трёх идей, которые другие
Docmost-MCP не сочетают:
1. **Точечные, экономичные по токенам правки.** Адресуйте отдельный блок по id и патчите
@@ -109,7 +109,7 @@ Docmost-MCP не сочетают:
## Инструменты
Все 38 инструментов, сгруппированы по задачам, для которых вы их возьмёте.
Все 40 инструментов, сгруппированы по задачам, для которых вы их возьмёте.
### Чтение и поиск
@@ -209,6 +209,15 @@ Docmost-MCP не сочетают:
коллауты/таблицы), через живой документ, сохраняя комментарии, выравнивание и alt-текст.
(Перезапись «по месту» намеренно не используется — некоторые версии Docmost портят
вложение при перезаписи.)
- **`stash_page`** — Сериализовать страницу целиком (её полный ProseMirror JSON) в
эфемерный blob в оперативной памяти и вернуть ТОЛЬКО короткий анонимный URL — тело
никогда не попадает в контекст модели, поэтому это способ передать большую страницу
(вместе с её изображениями) внешнему потребителю без усечения. Каждое внутреннее
файловое/графическое вложение зеркалируется в тот же sandbox, а его `src` переписывается
на URL sandbox; внешние http(s)-изображения остаются нетронутыми. Возвращает
`{ uri, size, sha256, images:{ mirrored, failed } }` (`sha256` — это также ETag blob'а).
Blob'ы хранятся только в оперативной памяти, истекают через короткий TTL (~1 ч) и
привязаны к тому экземпляру сервера, который их создал.
### Комментарии

View File

@@ -7,6 +7,7 @@ import { TiptapTransformer } from "@hocuspocus/transformer";
import * as Y from "yjs";
import WebSocket from "ws";
import { convertProseMirrorToMarkdown } from "./lib/markdown-converter.js";
import { collectInternalFileNodes, normalizeFileUrl, resolveInternalFilePath, } from "./lib/internal-file-urls.js";
import { updatePageContentRealtime, replacePageContent, markdownToProseMirror, markdownToProseMirrorCanonical, mutatePageContent, buildCollabWsUrl, assertYjsEncodable, applyDocToFragment, } from "./lib/collaboration.js";
import { footnoteWarningsField } from "./lib/footnote-analyze.js";
import { buildPageTree } from "./lib/tree.js";
@@ -51,6 +52,13 @@ export class DocmostClient {
// its token instead of calling POST /auth/collab-token; on a 401/403 it is
// re-invoked once. Used by the internal agent to carry signed provenance.
getCollabTokenFn = null;
// Optional blob-sandbox sink for the stash tool. Null when not configured.
sandboxPut = null;
// Optional probes paired with the sink. `has` lets stashPage detect a blob
// FIFO-evicted by a LATER put in the same stash; `evict` lets it free this
// op's image blobs if the final doc put throws. Null when the sink omits them.
sandboxHas = null;
sandboxEvict = null;
// In-flight login dedup: when the token expires, the 401 interceptor,
// ensureAuthenticated, getCollabTokenWithReauth and the two multipart retries
// can all call login() at once. Memoizing a single promise collapses that
@@ -77,6 +85,11 @@ export class DocmostClient {
if (config.getCollabToken) {
this.getCollabTokenFn = config.getCollabToken;
}
if (config.sandbox) {
this.sandboxPut = config.sandbox.put;
this.sandboxHas = config.sandbox.has ?? null;
this.sandboxEvict = config.sandbox.evict ?? null;
}
this.client = axios.create({
baseURL: this.apiUrl,
// Default request timeout so a hung connection cannot wedge a per-page
@@ -605,6 +618,181 @@ export class DocmostClient {
content: data.content || { type: "doc", content: [] },
};
}
/**
* Fetch an INTERNAL Docmost file (authed loopback) for sandbox mirroring.
* `src` is normalized to `/api/files/<id>/<file>`; `this.client.baseURL`
* already ends in `/api`, so we strip the leading `/api` and request the
* relative path with the client's Authorization header. Returns the raw bytes
* and the response Content-Type (mime), defaulting to octet-stream.
*
* The fetch is size-bounded (hard 64 MiB ceiling) purely to protect memory;
* the authoritative per-blob cap is enforced by the sandbox `put`. The path is
* resolved via resolveInternalFilePath, which REJECTS (throws) any traversal
* or percent-encoded src that would let an attacker-controlled `attrs.src`
* escape `/api/files/` and reach another internal endpoint (SSRF). That throw
* happens before this.client.get, so a malicious src is counted as a failed
* mirror — it never reaches the network.
*/
async fetchInternalFile(src) {
const HARD_CEILING = 64 * 1024 * 1024; // 64 MiB memory guard
const relPath = resolveInternalFilePath(src);
const response = await this.client.get(relPath, {
responseType: "arraybuffer",
timeout: 30000,
maxContentLength: HARD_CEILING,
maxBodyLength: HARD_CEILING,
});
const buffer = Buffer.from(response.data);
if (buffer.length === 0) {
throw new Error(`Empty file response from "${src}"`);
}
const rawCt = response.headers?.["content-type"];
const mime = typeof rawCt === "string" && rawCt.length > 0
? rawCt.split(";")[0].trim().toLowerCase()
: "application/octet-stream";
return { buffer, mime };
}
/**
* Stash a page's full content into the in-RAM blob sandbox and return ONLY a
* short anonymous URL — the body never enters the model context (this is the
* whole point: ~30KB+ ProseMirror docs blow the model context if passed as a
* tool argument). Every INTERNAL file/image src (the type-agnostic criterion,
* so drawio/excalidraw/video/file nodes are covered too) is mirrored into the
* sandbox and its `src` rewritten to the sandbox URL, so an external consumer
* can fetch the images anonymously. External http(s) srcs are left untouched.
*
* Blobs live in RAM with a short TTL and are cleared on restart — consume the
* URLs within the TTL and one uptime. A failed image fetch never aborts the
* doc: the original src is kept and the failure counted.
*
* Returns { uri, sha256, size, images:{mirrored, failed} }. `uri` and `sha256`
* are for the document blob; `sha256` is also the blob's ETag (integrity).
*/
async stashPage(pageId) {
if (!this.sandboxPut) {
throw new Error("stash_page is unavailable: the blob sandbox is not configured on this server");
}
await this.ensureAuthenticated();
// Stash the SAME shape get_page_json returns (id/title/.../content), with a
// deep clone so the rewrite never mutates anything shared.
const pageJson = await this.getPageJson(pageId);
const cloned = structuredClone(pageJson);
// Group internal-file nodes by normalized src so each unique resource is
// fetched + stored ONCE (dedup), and every node sharing that src points at
// the one sandbox blob. Capture each node's ORIGINAL raw src per-node:
// dedup groups nodes whose normalized src is equal even when their raw srcs
// differ (e.g. `/api/files/...` vs the bare `/files/...`), so on a revert we
// must restore each node's own original value, not the group key.
const bySrc = new Map();
for (const node of collectInternalFileNodes(cloned.content)) {
const origSrc = String(node.attrs.src);
const src = normalizeFileUrl(origSrc);
const entry = { node, origSrc };
const group = bySrc.get(src);
if (group)
group.push(entry);
else
bySrc.set(src, [entry]);
}
let mirrored = 0;
let failed = 0;
// Record every successful mirror so it can be (a) reverted if its blob gets
// FIFO-evicted by a LATER put in this same stash, and (b) freed if the final
// doc put throws.
const mirrors = [];
const MAX_CONCURRENCY = 5;
const groups = [...bySrc.entries()];
for (let i = 0; i < groups.length; i += MAX_CONCURRENCY) {
const batch = groups.slice(i, i + MAX_CONCURRENCY);
await Promise.all(batch.map(async ([src, entries]) => {
try {
const { buffer, mime } = await this.fetchInternalFile(src);
// put may throw if the blob exceeds the per-blob/total caps.
const stored = this.sandboxPut(buffer, mime);
for (const entry of entries)
entry.node.attrs.src = stored.uri;
mirrors.push({ uri: stored.uri, entries });
mirrored++;
}
catch (err) {
// One bad/oversized image (or a rejected traversal src) must not
// abort the document. Logged unconditionally (never the blob body),
// matching the package's ungated console.warn convention.
failed++;
console.warn(`stash_page: failed to mirror "${src}": ${err instanceof Error ? err.message : String(err)}`);
}
}));
}
// Revert one mirror's nodes to their original internal srcs and re-count it
// as failed (its blob was FIFO-evicted before the doc could reference it
// safely).
const revertMirror = (mirror) => {
for (const entry of mirror.entries)
entry.node.attrs.src = entry.origSrc;
mirrored--;
failed++;
console.warn(`stash_page: mirrored blob ${mirror.uri} was evicted before the doc ` +
`could safely reference it; reverted its src and counted it as failed`);
};
// Pre-put reconciliation: an image put earlier in THIS stash can FIFO-evict
// an even-earlier image of the same stash. Drop those from the live set
// first so the first serialized doc is already mostly correct.
let liveMirrors = mirrors;
if (this.sandboxHas) {
liveMirrors = [];
for (const mirror of mirrors) {
if (this.sandboxHas(mirror.uri))
liveMirrors.push(mirror);
else
revertMirror(mirror);
}
}
// Put the document, then reconcile against eviction caused by the doc put
// ITSELF (the doc is newest, FIFO drops oldest = this stash's images). Each
// iteration reverts >=1 mirror, so the loop terminates (worst case: all
// images reverted and the doc references no sandbox image URLs).
let stored;
for (;;) {
const docBuf = Buffer.from(JSON.stringify(cloned), "utf8");
let docStored;
try {
docStored = this.sandboxPut(docBuf, "application/json");
}
catch (err) {
// The doc put failed (e.g. doc exceeds the cap). Free this op's image
// blobs instead of leaking them in RAM for the whole TTL, then
// re-throw.
if (this.sandboxEvict) {
for (const mirror of liveMirrors)
this.sandboxEvict(mirror.uri);
}
throw err;
}
if (!this.sandboxHas) {
stored = docStored;
break;
}
const evictedNow = liveMirrors.filter((m) => !this.sandboxHas(m.uri));
if (evictedNow.length === 0) {
stored = docStored;
break;
}
// The doc we just stored references now-dead blobs. Revert those nodes,
// drop the stale doc blob, and loop to re-serialize + re-put the
// corrected doc.
for (const mirror of evictedNow)
revertMirror(mirror);
liveMirrors = liveMirrors.filter((m) => this.sandboxHas(m.uri));
if (this.sandboxEvict)
this.sandboxEvict(docStored.uri);
}
return {
uri: stored.uri,
sha256: stored.sha256,
size: stored.size,
images: { mirrored, failed },
};
}
/**
* Compact outline of a page's top-level blocks (no full document body).
* Cheap way to locate sections/tables and grab block ids before drilling in

View File

@@ -285,6 +285,38 @@ export function createDocmostMcpServer(config) {
const result = await docmostClient.editPageText(pageId, edits);
return jsonContent(result);
});
// Tool: stash_page — returns a resource_link (NOT embedded text) so the doc
// body never enters the model context. Registered directly (not via
// registerShared) because that helper only emits text content. Also returns
// `structuredContent` carrying the full documented `{uri, sha256, size, images}`
// shape alongside the resource_link, so MCP clients receive the blob's sha256
// (its ETag, for integrity) and mirror counts, not just the link.
server.registerTool(SHARED_TOOL_SPECS.stashPage.mcpName, {
description: SHARED_TOOL_SPECS.stashPage.description,
inputSchema: SHARED_TOOL_SPECS.stashPage.buildShape(z),
}, async ({ pageId }) => {
const result = await docmostClient.stashPage(pageId);
return {
content: [
{
type: "resource_link",
uri: result.uri,
name: "page.json",
mimeType: "application/json",
size: result.size,
},
],
// Mirror the full documented result shape ({ uri, size, sha256, images })
// as structuredContent so MCP clients get the blob's sha256 (its ETag, for
// integrity) and the mirror counts, not just the resource_link.
structuredContent: {
uri: result.uri,
sha256: result.sha256,
size: result.size,
images: result.images,
},
};
});
// Tool: patch_node
server.registerTool("patch_node", {
description: "Replaces a single block identified by its attrs.id WITHOUT resending the " +

View File

@@ -0,0 +1,110 @@
// Detection + collection of INTERNAL Docmost file URLs inside a ProseMirror doc.
//
// An internal file URL is a relative path served by Docmost's authenticated
// attachment route (`GET /api/files/:fileId/:fileName`). It is useless to an
// external consumer (relative + needs a Docmost session), so the stash tool
// mirrors every such resource into the blob sandbox and rewrites its `src`.
//
// The criterion is "internal file URL", NOT the node TYPE: image, drawio,
// excalidraw, video and file nodes all carry such a `src`, so a type-agnostic
// walker covers them all. External http(s) srcs (CDNs) are left untouched.
//
// Mirrors editor-ext's isInternalFileUrl / normalizeFileUrl (kept as a local
// dup so the ESM mcp package does not depend on the editor-ext build).
function isInternalFileUrl(url) {
if (typeof url !== "string")
return false;
const normalized = url.trim();
return (normalized.startsWith("/api/files/") || normalized.startsWith("/files/"));
}
/** Normalize a bare `/files/...` src to the canonical `/api/files/...` form. */
export function normalizeFileUrl(src) {
const trimmed = src.trim();
if (trimmed.startsWith("/files/"))
return "/api" + trimmed;
return trimmed;
}
/**
* Resolve a page-content `src` into the safe, `/api`-relative path the stash
* tool may fetch over the authenticated loopback client — or THROW.
*
* SECURITY (SSRF / path-traversal): `src` comes from page content and is fully
* attacker-controllable. The mirroring fetch runs through the AUTHENTICATED
* loopback axios client whose baseURL ends in `/api`, so a naive
* `src.replace(/^\/api/, "")` lets a crafted value like
* `/api/files/../auth/whoami` collapse (via axios/WHATWG URL `..` resolution)
* into an ARBITRARY internal GET endpoint, whose authed response would then be
* stored in the anonymous sandbox (SSRF + data exfiltration). A prefix-only
* `startsWith("/api/files/")` check does NOT defend against this because the
* `..` segments are still present in the raw string and resolved later.
*
* This function defeats that by resolving the canonical pathname FIRST and only
* then asserting it still lives under `/api/files/`:
* - it rejects any percent-encoded dot/slash (`%2e` / `%2f`): the WHATWG URL
* parser collapses LITERAL `../` but does NOT decode `%2f` separators, so a
* content-controlled src must never be allowed to smuggle those past the
* canonicalization;
* - it resolves `new URL(trimmed, "http://internal.invalid").pathname`, which
* normalizes `..`/`.` segments (e.g. `/api/files/../auth/whoami` →
* `/api/auth/whoami`);
* - it then requires the canonical pathname to start with `/api/files/`, so a
* traversal that escaped that subtree is rejected.
*
* Returns the path RELATIVE to the `/api` base (e.g. `/files/<id>/<name>`),
* ready to hand to the loopback client. The throw happens BEFORE any network
* call, so a rejected src is counted as a failed mirror and its original src is
* kept (the per-image try/catch in stashPage never aborts the whole document).
*/
export function resolveInternalFilePath(src) {
const trimmed = src.trim();
// Percent-encoded dot/slash must never reach the URL canonicalizer: the
// WHATWG parser does NOT decode `%2f` into a path separator, so an encoded
// `..%2fauth` would survive canonicalization and still escape /api/files/.
if (/%2e|%2f/i.test(trimmed)) {
throw new Error(`Refusing internal file src with percent-encoded path segment: "${src}"`);
}
let pathname;
try {
// The base host is irrelevant (never contacted); it only lets the parser
// resolve a relative `src` and normalize `..`/`.` segments.
pathname = new URL(trimmed, "http://internal.invalid").pathname;
}
catch {
throw new Error(`Invalid internal file src: "${src}"`);
}
if (!pathname.startsWith("/api/files/")) {
throw new Error(`Refusing internal file src that escapes /api/files/: "${src}"`);
}
// Strip the `/api` base prefix; the loopback client's baseURL already ends
// in `/api`, so it expects the path relative to that (e.g. /files/<id>/<f>).
return pathname.replace(/^\/api/, "");
}
/**
* Recursively collect every node whose `attrs.src` is an internal file URL.
* Returns references to the live nodes (so the caller can rewrite `attrs.src`
* in place on its clone). Descends `content` arrays, covering callouts, tables,
* details and any other nested container.
*/
export function collectInternalFileNodes(doc) {
const out = [];
const visit = (node) => {
if (!node)
return;
if (Array.isArray(node)) {
for (const child of node)
visit(child);
return;
}
if (typeof node !== "object")
return;
if (node.attrs && isInternalFileUrl(node.attrs.src)) {
out.push(node);
}
if (Array.isArray(node.content)) {
for (const child of node.content)
visit(child);
}
};
visit(doc);
return out;
}

View File

@@ -209,4 +209,27 @@ export const SHARED_TOOL_SPECS = {
.describe('List of find/replace operations, applied in order'),
}),
},
// --- hand a large page to an external consumer without bloating context ---
stashPage: {
mcpName: 'stash_page',
inAppKey: 'stashPage',
description: 'Serialize a whole page (the full ProseMirror JSON, as get_page_json ' +
'returns) into an ephemeral in-memory blob and return ONLY a short ' +
'anonymous URL to it — the body NEVER enters the model context, so this ' +
'is the way to hand a large page (or its images) to an external consumer ' +
'without truncation. Every internal file/image attachment is mirrored ' +
'into the same sandbox and its src rewritten to a sandbox URL, so the ' +
'consumer can fetch the images anonymously too; external http(s) images ' +
'are left untouched. Returns { uri, size, sha256, images:{mirrored, ' +
'failed} }. Integrity: the blob is served with ETag = its sha256, so a ' +
'truncated/corrupted fetch is detectable. Blobs are RAM-only: they expire ' +
'after a short TTL (~1h) and are cleared on restart — consume the URL ' +
'within the TTL and one uptime, or re-stash. A blob is bound to the ' +
'server instance that created it: in a multi-replica deployment without ' +
'sticky sessions a blob stored on one instance is not retrievable via the ' +
'sandbox URL on another (it 404s like an expired one).',
buildShape: (z) => ({
pageId: z.string().min(1),
}),
},
};

View File

@@ -13,6 +13,11 @@ import { TiptapTransformer } from "@hocuspocus/transformer";
import * as Y from "yjs";
import WebSocket from "ws";
import { convertProseMirrorToMarkdown } from "./lib/markdown-converter.js";
import {
collectInternalFileNodes,
normalizeFileUrl,
resolveInternalFilePath,
} from "./lib/internal-file-urls.js";
import {
updatePageContentRealtime,
replacePageContent,
@@ -102,6 +107,14 @@ const MIME_TO_EXT: Record<string, string> = {
* Housed here (not in index.ts) so client.ts has no type dependency on index.ts;
* index.ts re-exports it for the package's public surface.
*/
// Sink the stash tool writes blobs into. The host app binds this to its in-RAM
// SandboxStore and composes the public `uri` (the package never sees the store
// or any env). `put` returns the anonymous read URL plus integrity metadata.
export type SandboxPut = (
buf: Buffer,
mime: string,
) => { uri: string; sha256: string; size: number };
export type DocmostMcpConfig = { apiUrl: string } & (
| { email: string; password: string }
| { getToken: () => Promise<string> } // returns a BARE JWT; the client adds "Bearer "
@@ -109,6 +122,15 @@ export type DocmostMcpConfig = { apiUrl: string } & (
// Optional collab-token provider (returns a ready collab JWT). Common to
// both branches; see the type doc above.
getCollabToken?: () => Promise<string>;
// Optional blob sandbox sink. Present only where the stash tool is wired;
// when absent, stash_page throws a clear "not configured" error. The
// optional `has`/`evict` probes let stashPage keep its mirror counts honest
// under the store's FIFO eviction (see stashPage); older sinks omit them.
sandbox?: {
put: SandboxPut;
has?: (uri: string) => boolean;
evict?: (uri: string) => void;
};
};
export class DocmostClient {
@@ -126,6 +148,13 @@ export class DocmostClient {
// its token instead of calling POST /auth/collab-token; on a 401/403 it is
// re-invoked once. Used by the internal agent to carry signed provenance.
private getCollabTokenFn: (() => Promise<string>) | null = null;
// Optional blob-sandbox sink for the stash tool. Null when not configured.
private sandboxPut: SandboxPut | null = null;
// Optional probes paired with the sink. `has` lets stashPage detect a blob
// FIFO-evicted by a LATER put in the same stash; `evict` lets it free this
// op's image blobs if the final doc put throws. Null when the sink omits them.
private sandboxHas: ((uri: string) => boolean) | null = null;
private sandboxEvict: ((uri: string) => void) | null = null;
// In-flight login dedup: when the token expires, the 401 interceptor,
// ensureAuthenticated, getCollabTokenWithReauth and the two multipart retries
// can all call login() at once. Memoizing a single promise collapses that
@@ -165,6 +194,11 @@ export class DocmostClient {
if (config.getCollabToken) {
this.getCollabTokenFn = config.getCollabToken;
}
if (config.sandbox) {
this.sandboxPut = config.sandbox.put;
this.sandboxHas = config.sandbox.has ?? null;
this.sandboxEvict = config.sandbox.evict ?? null;
}
this.client = axios.create({
baseURL: this.apiUrl,
// Default request timeout so a hung connection cannot wedge a per-page
@@ -767,6 +801,203 @@ export class DocmostClient {
};
}
/**
* Fetch an INTERNAL Docmost file (authed loopback) for sandbox mirroring.
* `src` is normalized to `/api/files/<id>/<file>`; `this.client.baseURL`
* already ends in `/api`, so we strip the leading `/api` and request the
* relative path with the client's Authorization header. Returns the raw bytes
* and the response Content-Type (mime), defaulting to octet-stream.
*
* The fetch is size-bounded (hard 64 MiB ceiling) purely to protect memory;
* the authoritative per-blob cap is enforced by the sandbox `put`. The path is
* resolved via resolveInternalFilePath, which REJECTS (throws) any traversal
* or percent-encoded src that would let an attacker-controlled `attrs.src`
* escape `/api/files/` and reach another internal endpoint (SSRF). That throw
* happens before this.client.get, so a malicious src is counted as a failed
* mirror — it never reaches the network.
*/
private async fetchInternalFile(
src: string,
): Promise<{ buffer: Buffer; mime: string }> {
const HARD_CEILING = 64 * 1024 * 1024; // 64 MiB memory guard
const relPath = resolveInternalFilePath(src);
const response = await this.client.get(relPath, {
responseType: "arraybuffer",
timeout: 30000,
maxContentLength: HARD_CEILING,
maxBodyLength: HARD_CEILING,
});
const buffer = Buffer.from(response.data);
if (buffer.length === 0) {
throw new Error(`Empty file response from "${src}"`);
}
const rawCt = response.headers?.["content-type"];
const mime =
typeof rawCt === "string" && rawCt.length > 0
? rawCt.split(";")[0].trim().toLowerCase()
: "application/octet-stream";
return { buffer, mime };
}
/**
* Stash a page's full content into the in-RAM blob sandbox and return ONLY a
* short anonymous URL — the body never enters the model context (this is the
* whole point: ~30KB+ ProseMirror docs blow the model context if passed as a
* tool argument). Every INTERNAL file/image src (the type-agnostic criterion,
* so drawio/excalidraw/video/file nodes are covered too) is mirrored into the
* sandbox and its `src` rewritten to the sandbox URL, so an external consumer
* can fetch the images anonymously. External http(s) srcs are left untouched.
*
* Blobs live in RAM with a short TTL and are cleared on restart — consume the
* URLs within the TTL and one uptime. A failed image fetch never aborts the
* doc: the original src is kept and the failure counted.
*
* Returns { uri, sha256, size, images:{mirrored, failed} }. `uri` and `sha256`
* are for the document blob; `sha256` is also the blob's ETag (integrity).
*/
async stashPage(pageId: string): Promise<{
uri: string;
sha256: string;
size: number;
images: { mirrored: number; failed: number };
}> {
if (!this.sandboxPut) {
throw new Error(
"stash_page is unavailable: the blob sandbox is not configured on this server",
);
}
await this.ensureAuthenticated();
// Stash the SAME shape get_page_json returns (id/title/.../content), with a
// deep clone so the rewrite never mutates anything shared.
const pageJson = await this.getPageJson(pageId);
const cloned: any = structuredClone(pageJson);
// Group internal-file nodes by normalized src so each unique resource is
// fetched + stored ONCE (dedup), and every node sharing that src points at
// the one sandbox blob. Capture each node's ORIGINAL raw src per-node:
// dedup groups nodes whose normalized src is equal even when their raw srcs
// differ (e.g. `/api/files/...` vs the bare `/files/...`), so on a revert we
// must restore each node's own original value, not the group key.
const bySrc = new Map<string, Array<{ node: any; origSrc: string }>>();
for (const node of collectInternalFileNodes(cloned.content)) {
const origSrc = String(node.attrs.src);
const src = normalizeFileUrl(origSrc);
const entry = { node, origSrc };
const group = bySrc.get(src);
if (group) group.push(entry);
else bySrc.set(src, [entry]);
}
let mirrored = 0;
let failed = 0;
// Record every successful mirror so it can be (a) reverted if its blob gets
// FIFO-evicted by a LATER put in this same stash, and (b) freed if the final
// doc put throws.
const mirrors: Array<{
uri: string;
entries: Array<{ node: any; origSrc: string }>;
}> = [];
const MAX_CONCURRENCY = 5;
const groups = [...bySrc.entries()];
for (let i = 0; i < groups.length; i += MAX_CONCURRENCY) {
const batch = groups.slice(i, i + MAX_CONCURRENCY);
await Promise.all(
batch.map(async ([src, entries]) => {
try {
const { buffer, mime } = await this.fetchInternalFile(src);
// put may throw if the blob exceeds the per-blob/total caps.
const stored = this.sandboxPut!(buffer, mime);
for (const entry of entries) entry.node.attrs.src = stored.uri;
mirrors.push({ uri: stored.uri, entries });
mirrored++;
} catch (err) {
// One bad/oversized image (or a rejected traversal src) must not
// abort the document. Logged unconditionally (never the blob body),
// matching the package's ungated console.warn convention.
failed++;
console.warn(
`stash_page: failed to mirror "${src}": ${
err instanceof Error ? err.message : String(err)
}`,
);
}
}),
);
}
// Revert one mirror's nodes to their original internal srcs and re-count it
// as failed (its blob was FIFO-evicted before the doc could reference it
// safely).
const revertMirror = (mirror: {
uri: string;
entries: Array<{ node: any; origSrc: string }>;
}) => {
for (const entry of mirror.entries) entry.node.attrs.src = entry.origSrc;
mirrored--;
failed++;
console.warn(
`stash_page: mirrored blob ${mirror.uri} was evicted before the doc ` +
`could safely reference it; reverted its src and counted it as failed`,
);
};
// Pre-put reconciliation: an image put earlier in THIS stash can FIFO-evict
// an even-earlier image of the same stash. Drop those from the live set
// first so the first serialized doc is already mostly correct.
let liveMirrors = mirrors;
if (this.sandboxHas) {
liveMirrors = [];
for (const mirror of mirrors) {
if (this.sandboxHas(mirror.uri)) liveMirrors.push(mirror);
else revertMirror(mirror);
}
}
// Put the document, then reconcile against eviction caused by the doc put
// ITSELF (the doc is newest, FIFO drops oldest = this stash's images). Each
// iteration reverts >=1 mirror, so the loop terminates (worst case: all
// images reverted and the doc references no sandbox image URLs).
let stored: { uri: string; sha256: string; size: number };
for (;;) {
const docBuf = Buffer.from(JSON.stringify(cloned), "utf8");
let docStored: { uri: string; sha256: string; size: number };
try {
docStored = this.sandboxPut(docBuf, "application/json");
} catch (err) {
// The doc put failed (e.g. doc exceeds the cap). Free this op's image
// blobs instead of leaking them in RAM for the whole TTL, then
// re-throw.
if (this.sandboxEvict) {
for (const mirror of liveMirrors) this.sandboxEvict(mirror.uri);
}
throw err;
}
if (!this.sandboxHas) {
stored = docStored;
break;
}
const evictedNow = liveMirrors.filter((m) => !this.sandboxHas!(m.uri));
if (evictedNow.length === 0) {
stored = docStored;
break;
}
// The doc we just stored references now-dead blobs. Revert those nodes,
// drop the stale doc blob, and loop to re-serialize + re-put the
// corrected doc.
for (const mirror of evictedNow) revertMirror(mirror);
liveMirrors = liveMirrors.filter((m) => this.sandboxHas!(m.uri));
if (this.sandboxEvict) this.sandboxEvict(docStored.uri);
}
return {
uri: stored.uri,
sha256: stored.sha256,
size: stored.size,
images: { mirrored, failed },
};
}
/**
* Compact outline of a page's top-level blocks (no full document body).
* Cheap way to locate sections/tables and grab block ids before drilling in

View File

@@ -408,6 +408,43 @@ registerShared(SHARED_TOOL_SPECS.editPageText, async ({ pageId, edits }) => {
return jsonContent(result);
});
// Tool: stash_page — returns a resource_link (NOT embedded text) so the doc
// body never enters the model context. Registered directly (not via
// registerShared) because that helper only emits text content. Also returns
// `structuredContent` carrying the full documented `{uri, sha256, size, images}`
// shape alongside the resource_link, so MCP clients receive the blob's sha256
// (its ETag, for integrity) and mirror counts, not just the link.
server.registerTool(
SHARED_TOOL_SPECS.stashPage.mcpName,
{
description: SHARED_TOOL_SPECS.stashPage.description,
inputSchema: SHARED_TOOL_SPECS.stashPage.buildShape!(z),
},
async ({ pageId }: { pageId: string }) => {
const result = await docmostClient.stashPage(pageId);
return {
content: [
{
type: "resource_link" as const,
uri: result.uri,
name: "page.json",
mimeType: "application/json",
size: result.size,
},
],
// Mirror the full documented result shape ({ uri, size, sha256, images })
// as structuredContent so MCP clients get the blob's sha256 (its ETag, for
// integrity) and the mirror counts, not just the resource_link.
structuredContent: {
uri: result.uri,
sha256: result.sha256,
size: result.size,
images: result.images,
},
};
},
);
// Tool: patch_node
server.registerTool(
"patch_node",

View File

@@ -0,0 +1,113 @@
// Detection + collection of INTERNAL Docmost file URLs inside a ProseMirror doc.
//
// An internal file URL is a relative path served by Docmost's authenticated
// attachment route (`GET /api/files/:fileId/:fileName`). It is useless to an
// external consumer (relative + needs a Docmost session), so the stash tool
// mirrors every such resource into the blob sandbox and rewrites its `src`.
//
// The criterion is "internal file URL", NOT the node TYPE: image, drawio,
// excalidraw, video and file nodes all carry such a `src`, so a type-agnostic
// walker covers them all. External http(s) srcs (CDNs) are left untouched.
//
// Mirrors editor-ext's isInternalFileUrl / normalizeFileUrl (kept as a local
// dup so the ESM mcp package does not depend on the editor-ext build).
function isInternalFileUrl(url: unknown): boolean {
if (typeof url !== "string") return false;
const normalized = url.trim();
return (
normalized.startsWith("/api/files/") || normalized.startsWith("/files/")
);
}
/** Normalize a bare `/files/...` src to the canonical `/api/files/...` form. */
export function normalizeFileUrl(src: string): string {
const trimmed = src.trim();
if (trimmed.startsWith("/files/")) return "/api" + trimmed;
return trimmed;
}
/**
* Resolve a page-content `src` into the safe, `/api`-relative path the stash
* tool may fetch over the authenticated loopback client — or THROW.
*
* SECURITY (SSRF / path-traversal): `src` comes from page content and is fully
* attacker-controllable. The mirroring fetch runs through the AUTHENTICATED
* loopback axios client whose baseURL ends in `/api`, so a naive
* `src.replace(/^\/api/, "")` lets a crafted value like
* `/api/files/../auth/whoami` collapse (via axios/WHATWG URL `..` resolution)
* into an ARBITRARY internal GET endpoint, whose authed response would then be
* stored in the anonymous sandbox (SSRF + data exfiltration). A prefix-only
* `startsWith("/api/files/")` check does NOT defend against this because the
* `..` segments are still present in the raw string and resolved later.
*
* This function defeats that by resolving the canonical pathname FIRST and only
* then asserting it still lives under `/api/files/`:
* - it rejects any percent-encoded dot/slash (`%2e` / `%2f`): the WHATWG URL
* parser collapses LITERAL `../` but does NOT decode `%2f` separators, so a
* content-controlled src must never be allowed to smuggle those past the
* canonicalization;
* - it resolves `new URL(trimmed, "http://internal.invalid").pathname`, which
* normalizes `..`/`.` segments (e.g. `/api/files/../auth/whoami` →
* `/api/auth/whoami`);
* - it then requires the canonical pathname to start with `/api/files/`, so a
* traversal that escaped that subtree is rejected.
*
* Returns the path RELATIVE to the `/api` base (e.g. `/files/<id>/<name>`),
* ready to hand to the loopback client. The throw happens BEFORE any network
* call, so a rejected src is counted as a failed mirror and its original src is
* kept (the per-image try/catch in stashPage never aborts the whole document).
*/
export function resolveInternalFilePath(src: string): string {
const trimmed = src.trim();
// Percent-encoded dot/slash must never reach the URL canonicalizer: the
// WHATWG parser does NOT decode `%2f` into a path separator, so an encoded
// `..%2fauth` would survive canonicalization and still escape /api/files/.
if (/%2e|%2f/i.test(trimmed)) {
throw new Error(
`Refusing internal file src with percent-encoded path segment: "${src}"`,
);
}
let pathname: string;
try {
// The base host is irrelevant (never contacted); it only lets the parser
// resolve a relative `src` and normalize `..`/`.` segments.
pathname = new URL(trimmed, "http://internal.invalid").pathname;
} catch {
throw new Error(`Invalid internal file src: "${src}"`);
}
if (!pathname.startsWith("/api/files/")) {
throw new Error(
`Refusing internal file src that escapes /api/files/: "${src}"`,
);
}
// Strip the `/api` base prefix; the loopback client's baseURL already ends
// in `/api`, so it expects the path relative to that (e.g. /files/<id>/<f>).
return pathname.replace(/^\/api/, "");
}
/**
* Recursively collect every node whose `attrs.src` is an internal file URL.
* Returns references to the live nodes (so the caller can rewrite `attrs.src`
* in place on its clone). Descends `content` arrays, covering callouts, tables,
* details and any other nested container.
*/
export function collectInternalFileNodes(doc: unknown): any[] {
const out: any[] = [];
const visit = (node: any): void => {
if (!node) return;
if (Array.isArray(node)) {
for (const child of node) visit(child);
return;
}
if (typeof node !== "object") return;
if (node.attrs && isInternalFileUrl(node.attrs.src)) {
out.push(node);
}
if (Array.isArray(node.content)) {
for (const child of node.content) visit(child);
}
};
visit(doc);
return out;
}

View File

@@ -266,4 +266,29 @@ export const SHARED_TOOL_SPECS = {
.describe('List of find/replace operations, applied in order'),
}),
},
// --- hand a large page to an external consumer without bloating context ---
stashPage: {
mcpName: 'stash_page',
inAppKey: 'stashPage',
description:
'Serialize a whole page (the full ProseMirror JSON, as get_page_json ' +
'returns) into an ephemeral in-memory blob and return ONLY a short ' +
'anonymous URL to it — the body NEVER enters the model context, so this ' +
'is the way to hand a large page (or its images) to an external consumer ' +
'without truncation. Every internal file/image attachment is mirrored ' +
'into the same sandbox and its src rewritten to a sandbox URL, so the ' +
'consumer can fetch the images anonymously too; external http(s) images ' +
'are left untouched. Returns { uri, size, sha256, images:{mirrored, ' +
'failed} }. Integrity: the blob is served with ETag = its sha256, so a ' +
'truncated/corrupted fetch is detectable. Blobs are RAM-only: they expire ' +
'after a short TTL (~1h) and are cleared on restart — consume the URL ' +
'within the TTL and one uptime, or re-stash. A blob is bound to the ' +
'server instance that created it: in a multi-replica deployment without ' +
'sticky sessions a blob stored on one instance is not retrievable via the ' +
'sandbox URL on another (it 404s like an expired one).',
buildShape: (z) => ({
pageId: z.string().min(1),
}),
},
} satisfies Record<string, SharedToolSpec>;

View File

@@ -0,0 +1,155 @@
// Server round-trip test for the stash_page MCP tool result shape. The in-app
// path returns the full documented `{ uri, size, sha256, images }` object, but
// the MCP transport must deliver the SAME shape: a resource_link (primary
// payload) PLUS a `structuredContent` mirror carrying sha256 + image counts.
// This connects a real MCP Client to the server over a linked in-memory
// transport pair and asserts both halves of the result, end to end.
import { test, after } from "node:test";
import assert from "node:assert/strict";
import http from "node:http";
import { createHash } from "node:crypto";
import { createDocmostMcpServer } from "../../build/index.js";
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { InMemoryTransport } from "@modelcontextprotocol/sdk/inMemory.js";
function readBody(req) {
return new Promise((resolve) => {
let raw = "";
req.on("data", (c) => (raw += c));
req.on("end", () => resolve(raw));
});
}
function startServer(handler) {
return new Promise((resolve) => {
const server = http.createServer(handler);
server.listen(0, "127.0.0.1", () => {
const { port } = server.address();
resolve({ server, baseURL: `http://127.0.0.1:${port}/api` });
});
});
}
const openServers = [];
async function spawn(handler) {
const { server, baseURL } = await startServer(handler);
openServers.push(server);
return baseURL;
}
after(async () => {
await Promise.all(openServers.map((s) => new Promise((r) => s.close(r))));
});
// Minimal in-memory sandbox sink: store the blob and return a uri + sha256 +
// size, with has/evict probes the client's reconciliation may call.
function makeSandbox() {
const live = new Map();
const idOf = (uri) => uri.substring(uri.lastIndexOf("/") + 1);
let n = 0;
return {
put(buf) {
const sha256 = createHash("sha256").update(buf).digest("hex");
const id = `id-${n++}`;
live.set(id, buf.length);
return { uri: `https://sb.test/api/sb/${id}`, sha256, size: buf.length };
},
has(uri) {
return live.has(idOf(uri));
},
evict(uri) {
live.delete(idOf(uri));
},
};
}
const IMAGE_BYTES = Buffer.from([0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a]);
// One internal image (so images.mirrored === 1) inside a normal page doc.
function pageDoc() {
return {
type: "doc",
content: [
{
type: "image",
attrs: { src: "/api/files/att-1/pic.png", attachmentId: "att-1" },
},
],
};
}
// Mock Docmost: login, page info, internal file bytes — same pattern as
// stash-page.test.mjs.
async function buildBaseURL() {
return spawn(async (req, res) => {
await readBody(req);
if (req.url === "/api/auth/login") {
res.writeHead(200, {
"Content-Type": "application/json",
"Set-Cookie": "authToken=tok; HttpOnly",
});
res.end(JSON.stringify({ token: "tok" }));
return;
}
if (req.url === "/api/pages/info") {
res.writeHead(200, { "Content-Type": "application/json" });
res.end(
JSON.stringify({ data: { id: "page-1", title: "T", content: pageDoc() } }),
);
return;
}
if (req.url.startsWith("/api/files/")) {
res.writeHead(200, { "Content-Type": "image/png" });
res.end(IMAGE_BYTES);
return;
}
res.writeHead(404);
res.end();
});
}
test("stash_page MCP tool returns a resource_link AND a structuredContent mirror", async () => {
const baseURL = await buildBaseURL();
const sandbox = makeSandbox();
const server = createDocmostMcpServer({
apiUrl: baseURL,
email: "u@example.com",
password: "pw",
sandbox,
});
const client = new Client({ name: "test-client", version: "0.0.0" });
const [a, b] = InMemoryTransport.createLinkedPair();
await server.connect(b);
await client.connect(a);
try {
const res = await client.callTool({
name: "stash_page",
arguments: { pageId: "page-1" },
});
// Primary payload: a resource_link pointing at the sandbox doc blob.
const link = res.content[0];
assert.equal(link.type, "resource_link");
assert.match(link.uri, /^https:\/\/sb\.test\/api\/sb\//);
// structuredContent mirrors the full documented shape.
const sc = res.structuredContent;
assert.equal(typeof sc, "object");
assert.equal(sc.uri, link.uri); // same blob as the link
assert.match(sc.sha256, /^[0-9a-f]{64}$/); // 64-hex ETag
assert.equal(typeof sc.size, "number");
assert.deepEqual(sc.images, { mirrored: 1, failed: 0 });
// Deep-equal the whole structured payload against what the mock implies.
assert.deepEqual(sc, {
uri: link.uri,
sha256: sc.sha256,
size: sc.size,
images: { mirrored: 1, failed: 0 },
});
} finally {
await client.close();
await server.close();
}
});

View File

@@ -0,0 +1,378 @@
// Mock-HTTP test for DocmostClient.stashPage: a local http server stands in for
// Docmost so the whole flow stays deterministic and offline. Asserts the tool
// (1) serializes the page into the sandbox and returns ONLY a link (uri + sha256
// + size), never the body; (2) mirrors INTERNAL image srcs into the sandbox and
// rewrites them to the sandbox uri; (3) leaves EXTERNAL http(s) srcs untouched;
// (4) de-duplicates a repeated internal src to a single blob; (5) counts a
// failed image fetch without aborting the document.
import { test, after } from "node:test";
import assert from "node:assert/strict";
import http from "node:http";
import { createHash } from "node:crypto";
import { DocmostClient } from "../../build/client.js";
function readBody(req) {
return new Promise((resolve) => {
let raw = "";
req.on("data", (c) => (raw += c));
req.on("end", () => resolve(raw));
});
}
function startServer(handler) {
return new Promise((resolve) => {
const server = http.createServer(handler);
server.listen(0, "127.0.0.1", () => {
const { port } = server.address();
resolve({ server, baseURL: `http://127.0.0.1:${port}/api` });
});
});
}
const openServers = [];
async function spawn(handler) {
const { server, baseURL } = await startServer(handler);
openServers.push(server);
return baseURL;
}
after(async () => {
await Promise.all(openServers.map((s) => new Promise((r) => s.close(r))));
});
// In-memory sandbox sink mirroring the host binding: store the blob, return a
// uri + sha256 + size. Records every put so the test can inspect what was
// stashed (and verify the doc body never leaves via the return value). Models
// the real store's FIFO eviction + cap + the has/evict probes so B1 (self-
// eviction reconciliation and doc-put-throw cleanup) is testable. Default
// maxTotal is effectively unlimited so the happy-path tests behave as before.
//
// `throwOnJson` forces the final document put to throw, standing in for "doc
// exceeds the cap".
function makeSandbox({ maxTotal = Infinity, throwOnJson = false } = {}) {
const puts = [];
const evicted = [];
// id -> size, in insertion order (Map preserves it) so the oldest is first.
const live = new Map();
let total = 0;
const idOf = (uri) => uri.substring(uri.lastIndexOf("/") + 1);
return {
puts,
evicted,
put(buf, mime) {
if (throwOnJson && mime === "application/json") {
throw new Error("doc blob exceeds the sandbox cap");
}
const sha256 = createHash("sha256").update(buf).digest("hex");
const id = `id-${puts.length}`;
puts.push({ buf, mime, sha256, id });
live.set(id, buf.length);
total += buf.length;
// FIFO-evict the oldest live blobs until this put fits under the cap.
while (total > maxTotal && live.size > 0) {
const oldest = live.keys().next().value;
if (oldest === id) break; // never evict the blob we just stored
total -= live.get(oldest);
live.delete(oldest);
evicted.push(oldest);
}
return { uri: `https://sb.test/api/sb/${id}`, sha256, size: buf.length };
},
has(uri) {
return live.has(idOf(uri));
},
evict(uri) {
const id = idOf(uri);
if (live.has(id)) {
total -= live.get(id);
live.delete(id);
}
evicted.push(id);
},
};
}
const IMAGE_BYTES = Buffer.from([0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a]); // "PNG" header-ish
function pageDoc() {
return {
type: "doc",
content: [
{
type: "image",
attrs: { src: "/api/files/att-1/pic.png", attachmentId: "att-1", width: 100 },
},
// Same internal src again -> must dedup to ONE blob, both rewritten.
{
type: "image",
attrs: { src: "/api/files/att-1/pic.png", attachmentId: "att-1", width: 50 },
},
// External CDN image -> must be left untouched.
{
type: "image",
attrs: { src: "https://cdn.example.com/remote.png" },
},
],
};
}
// Build a client wired to a server that logs in, serves the page, and serves the
// internal file bytes. `fileStatus` lets a test force the file fetch to fail;
// `doc` overrides the served page; `fileBytes`/`fileHeaders` shape the file
// response (used by the empty-body / missing-Content-Type branch tests).
async function buildClient(
sandbox,
{
fileStatus = 200,
doc = pageDoc(),
fileBytes = IMAGE_BYTES,
fileHeaders = { "Content-Type": "image/png" },
} = {},
) {
const baseURL = await spawn(async (req, res) => {
await readBody(req);
if (req.url === "/api/auth/login") {
res.writeHead(200, {
"Content-Type": "application/json",
"Set-Cookie": "authToken=tok; HttpOnly",
});
res.end(JSON.stringify({ token: "tok" }));
return;
}
if (req.url === "/api/pages/info") {
res.writeHead(200, { "Content-Type": "application/json" });
res.end(JSON.stringify({ data: { id: "page-1", title: "T", content: doc } }));
return;
}
if (req.url.startsWith("/api/files/")) {
if (fileStatus !== 200) {
res.writeHead(fileStatus);
res.end();
return;
}
res.writeHead(200, fileHeaders);
res.end(fileBytes);
return;
}
res.writeHead(404);
res.end();
});
return new DocmostClient({
apiUrl: baseURL,
email: "u@example.com",
password: "pw",
sandbox: {
put: (buf, mime) => sandbox.put(buf, mime),
has: (uri) => sandbox.has(uri),
evict: (uri) => sandbox.evict(uri),
},
});
}
// A page with several DISTINCT internal images (each a unique attachment id) so
// each is its own sandbox blob — needed to exercise FIFO self-eviction.
function multiImageDoc(n) {
return {
type: "doc",
content: Array.from({ length: n }, (_, i) => ({
type: "image",
attrs: { src: `/api/files/att-${i}/pic.png`, attachmentId: `att-${i}` },
})),
};
}
test("stashPage stores the doc + mirrors/rewrites internal images, returns only a link", async () => {
const sandbox = makeSandbox();
const client = await buildClient(sandbox);
const result = await client.stashPage("page-1");
// Returns ONLY a link shape — never the document body.
assert.equal(typeof result.uri, "string");
assert.match(result.uri, /^https:\/\/sb\.test\/api\/sb\//);
assert.equal(typeof result.sha256, "string");
assert.equal(typeof result.size, "number");
assert.ok(!("doc" in result) && !("content" in result) && !("body" in result));
assert.deepEqual(result.images, { mirrored: 1, failed: 0 });
// One image blob (dedup) + one doc blob = 2 puts.
assert.equal(sandbox.puts.length, 2);
const imagePut = sandbox.puts[0];
const docPut = sandbox.puts[1];
assert.equal(imagePut.mime, "image/png");
assert.ok(imagePut.buf.equals(IMAGE_BYTES));
assert.equal(docPut.mime, "application/json");
// The returned uri/sha256 are the DOCUMENT blob's.
assert.equal(result.sha256, docPut.sha256);
// Inspect the stashed document: internal srcs rewritten, external untouched.
const stashed = JSON.parse(docPut.buf.toString("utf8"));
const imgs = stashed.content.content.filter((n) => n.type === "image");
assert.equal(imgs[0].attrs.src, "https://sb.test/api/sb/id-0");
assert.equal(imgs[1].attrs.src, "https://sb.test/api/sb/id-0"); // same blob (dedup)
assert.equal(imgs[2].attrs.src, "https://cdn.example.com/remote.png"); // external kept
});
test("stashPage counts a failed image fetch without aborting the document", async () => {
const sandbox = makeSandbox();
const client = await buildClient(sandbox, { fileStatus: 500 });
const result = await client.stashPage("page-1");
assert.deepEqual(result.images, { mirrored: 0, failed: 1 });
// Only the doc blob was stored (image fetch failed).
assert.equal(sandbox.puts.length, 1);
assert.equal(sandbox.puts[0].mime, "application/json");
// The failed internal src is LEFT as-is so nothing is silently dropped.
const stashed = JSON.parse(sandbox.puts[0].buf.toString("utf8"));
const imgs = stashed.content.content.filter((n) => n.type === "image");
assert.equal(imgs[0].attrs.src, "/api/files/att-1/pic.png");
});
test("stashPage throws a clear error when no sandbox is configured", async () => {
const baseURL = await spawn(async (req, res) => {
await readBody(req);
res.writeHead(200, { "Content-Type": "application/json" });
res.end(JSON.stringify({}));
});
const client = new DocmostClient({
apiUrl: baseURL,
email: "u@example.com",
password: "pw",
});
await assert.rejects(() => client.stashPage("page-1"), /not configured/);
});
test("stashPage reverts a FIFO-evicted image and counts it as failed (B1)", async () => {
// 3 distinct images of S=4000 bytes each; doc JSON is far smaller than one
// image. With a cap of 4500: storing img1 evicts img0, storing img2 evicts
// img1 — so only img2 survives the loop (img0 + img1 reverted). The doc
// (4000 + a few hundred bytes <= 4500) then fits alongside the survivor, so it
// does NOT trigger further eviction. The stored doc must therefore reference
// exactly one live blob and revert the other two to their internal srcs.
const BIG = Buffer.alloc(4000, 0x41);
const sandbox = makeSandbox({ maxTotal: 4500 });
const client = await buildClient(sandbox, {
doc: multiImageDoc(3),
fileBytes: BIG,
});
const result = await client.stashPage("page-1");
// Two images were evicted before the doc was stored -> counted as failed.
assert.deepEqual(result.images, { mirrored: 1, failed: 2 });
// Inspect the stashed doc: no node may point at an evicted (now-dead) blob,
// and every reverted node carries its ORIGINAL internal src again.
const docPut = sandbox.puts.find((p) => p.mime === "application/json");
const stashed = JSON.parse(docPut.buf.toString("utf8"));
const imgs = stashed.content.content.filter((n) => n.type === "image");
let live = 0;
let reverted = 0;
for (const img of imgs) {
const src = img.attrs.src;
if (src.startsWith("https://sb.test/api/sb/")) {
assert.ok(sandbox.has(src), `doc references evicted blob ${src}`);
live++;
} else {
// Reverted to the original internal src.
assert.match(src, /^\/api\/files\/att-\d+\/pic\.png$/);
reverted++;
}
}
assert.equal(live, 1);
assert.equal(reverted, 2);
});
test("stashPage reverts an image evicted by the DOC put itself (after-put reconcile, B1)", async () => {
// Both images (1000 bytes each) survive the image phase: total 2000 <= cap
// 2500. The doc, however, serializes large (a node with a ~700-byte string
// attr), so putting it (newest) tips total over the cap and FIFO-evicts the
// OLDEST image (img0) — an eviction caused by the doc put itself, which only
// the after-put reconciliation can catch. The loop then reverts img0, drops
// the stale doc blob, and re-puts the corrected doc (now total = img1 +
// docSize <= cap, so img1 survives).
const BIG = Buffer.alloc(1000, 0x41);
const sandbox = makeSandbox({ maxTotal: 2500 });
const doc = {
type: "doc",
content: [
{ type: "image", attrs: { src: "/api/files/att-0/pic.png", attachmentId: "att-0" } },
{ type: "image", attrs: { src: "/api/files/att-1/pic.png", attachmentId: "att-1" } },
// Bulk the doc JSON up so the doc put crosses the cap on its own. Stays in
// the doc across reverts, so each re-serialization is similarly large.
{ type: "paragraph", attrs: { filler: "x".repeat(700) }, content: [] },
],
};
const client = await buildClient(sandbox, { doc, fileBytes: BIG });
const result = await client.stashPage("page-1");
// The doc put evicted exactly one image -> reverted + counted as failed.
assert.deepEqual(result.images, { mirrored: 1, failed: 1 });
// Use the LAST json put: the first (stale) doc referenced the now-dead blob
// and was itself evicted; the corrected re-put is the one that stands.
const docPut = sandbox.puts.filter((p) => p.mime === "application/json").at(-1);
const stashed = JSON.parse(docPut.buf.toString("utf8"));
const imgs = stashed.content.content.filter((n) => n.type === "image");
let live = 0;
let reverted = 0;
for (const img of imgs) {
const src = img.attrs.src;
if (src.startsWith("https://sb.test/api/sb/")) {
assert.ok(sandbox.has(src), `final doc references evicted blob ${src}`);
live++;
} else {
assert.match(src, /^\/api\/files\/att-\d+\/pic\.png$/);
reverted++;
}
}
assert.equal(live, 1);
assert.equal(reverted, 1);
});
test("stashPage frees image blobs when the doc put throws (B1)", async () => {
// Two distinct images mirror fine; the final JSON doc put throws (doc exceeds
// cap). stashPage must reject AND evict every image blob it stored this op.
const sandbox = makeSandbox({ throwOnJson: true });
const client = await buildClient(sandbox, { doc: multiImageDoc(2) });
await assert.rejects(() => client.stashPage("page-1"));
// Both image blobs were stored, then evicted on the doc-put failure.
const imagePuts = sandbox.puts.filter((p) => p.mime === "image/png");
assert.equal(imagePuts.length, 2);
for (const p of imagePuts) {
assert.ok(sandbox.evicted.includes(p.id), `image ${p.id} was not freed`);
}
});
test("stashPage counts an empty file response as failed (B1/fetchInternalFile)", async () => {
const sandbox = makeSandbox();
const client = await buildClient(sandbox, {
fileBytes: Buffer.alloc(0),
fileHeaders: { "Content-Type": "image/png", "Content-Length": "0" },
});
const result = await client.stashPage("page-1");
// The single internal image (deduped) yielded an empty body -> failed.
assert.deepEqual(result.images, { mirrored: 0, failed: 1 });
// Only the doc blob was stored.
assert.equal(sandbox.puts.filter((p) => p.mime === "image/png").length, 0);
});
test("stashPage mirrors a file with no Content-Type as octet-stream (fetchInternalFile)", async () => {
const sandbox = makeSandbox();
// No Content-Type header at all -> fetchInternalFile defaults to octet-stream.
const client = await buildClient(sandbox, { fileHeaders: {} });
const result = await client.stashPage("page-1");
assert.equal(result.images.mirrored, 1);
const imagePut = sandbox.puts.find((p) => p.mime !== "application/json");
assert.ok(imagePut, "expected an image put");
assert.equal(imagePut.mime, "application/octet-stream");
});

View File

@@ -0,0 +1,101 @@
// Unit tests for the internal-file URL helpers the stash tool relies on. The
// critical case is resolveInternalFilePath, whose whole job is to REJECT a
// content-controlled `src` that tries to escape /api/files/ (SSRF / traversal)
// before it ever reaches the authenticated loopback client.
import { test } from "node:test";
import assert from "node:assert/strict";
import {
resolveInternalFilePath,
normalizeFileUrl,
collectInternalFileNodes,
} from "../../build/lib/internal-file-urls.js";
test("resolveInternalFilePath accepts a normal internal src", () => {
assert.equal(
resolveInternalFilePath("/api/files/att-1/pic.png"),
"/files/att-1/pic.png",
);
});
test("resolveInternalFilePath rejects traversal / encoded variants (SSRF guard)", () => {
// `..` collapses to /api/auth/whoami -> outside /api/files/ -> rejected.
assert.throws(() => resolveInternalFilePath("/api/files/../auth/whoami"));
// Escapes the /api base entirely.
assert.throws(() => resolveInternalFilePath("/api/files/../../internal"));
// Percent-encoded dot -> rejected before canonicalization.
assert.throws(() => resolveInternalFilePath("/api/files/%2e%2e/x"));
// Percent-encoded slash separator -> rejected before canonicalization.
assert.throws(() => resolveInternalFilePath("/api/files/..%2fauth"));
});
test("resolveInternalFilePath drops a foreign host and keeps only the /api/files/ pathname (SSRF accept-path)", () => {
// ACCEPT path: an absolute URL has its host dropped; only the canonical
// pathname survives, and it must still start with /api/files/. This is SAFE
// because the loopback axios client ignores any host in `src` and uses its own
// /api baseURL — so a foreign host like evil.com is never contacted. This is
// the SOLE SSRF/traversal guard for content-controlled `src`, so it must be
// pinned: a future refactor to a prefix-only check would silently open a
// bypass with no failing test.
assert.equal(
resolveInternalFilePath("http://evil.com/api/files/x/y.png"),
"/files/x/y.png",
);
// Protocol-relative URL: host likewise dropped, pathname kept.
assert.equal(
resolveInternalFilePath("//evil.com/api/files/x/y.png"),
"/files/x/y.png",
);
});
test("resolveInternalFilePath rejects a foreign-host src whose pathname escapes /api/files/", () => {
// Even though the host is dropped, the canonical pathname /api/auth/whoami
// does NOT start with /api/files/, so it is rejected.
assert.throws(() =>
resolveInternalFilePath("https://evil.com/api/auth/whoami"),
);
// The WHATWG URL parser converts backslashes to `/` for http(s), so this
// collapses to /api/auth/whoami and escapes the /api/files/ subtree.
assert.throws(() => resolveInternalFilePath("/api/files\\..\\auth\\whoami"));
});
test("resolveInternalFilePath wraps a new URL parse failure in a clear error", () => {
// `http://[` has no %2e/%2f so it passes the first guard, then fails the
// `new URL(...)` parse — exercising the catch branch that re-throws with a
// clear message.
assert.throws(
() => resolveInternalFilePath("http://["),
/Invalid internal file src/,
);
});
test("normalizeFileUrl rewrites the bare /files/ branch and leaves /api/files/ alone", () => {
assert.equal(
normalizeFileUrl("/files/att-1/pic.png"),
"/api/files/att-1/pic.png",
);
assert.equal(
normalizeFileUrl("/api/files/att-1/pic.png"),
"/api/files/att-1/pic.png",
);
});
test("collectInternalFileNodes recurses into nested content containers", () => {
// The internal image is buried inside a callout's content array, so a
// regression on the recursion (e.g. a shallow .filter()) would miss it.
const nested = {
type: "image",
attrs: { src: "/api/files/att-9/deep.png", attachmentId: "att-9" },
};
const doc = {
type: "doc",
content: [
{
type: "callout",
content: [{ type: "paragraph", content: [nested] }],
},
],
};
const found = collectInternalFileNodes(doc);
assert.equal(found.length, 1);
assert.equal(found[0], nested);
});