Compare commits

..

18 Commits

Author SHA1 Message Date
claude code agent 227
91f24fc062 fix(ai): include content-bearing pages in reindex coverage; correct progress race & hot path (F6-F10)
F6: extend embeddablePredicate to pages with body content but null text_content,
    keyed on the text-node marker "type":"text" (not a bare "text": key, which
    also matched math nodes' attrs.text and would leave math-only pages stuck
    below 100%). Numerator and denominator share the predicate; tests assert the
    compiled WHERE is byte-identical and a math-only doc is excluded.
F7: correct the start() JSDoc (both totals are the real page count).
F8: nextReindexPollInterval reuses isReindexComplete.
F9: getMasked reads progress first and skips the two COUNTs while a reindex is active.
F10: pre-seed the progress entry with a short 45s TTL so a deduped enqueue's
     phantom "0 of N" expires quickly instead of sticking for the 1h TTL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 14:37:26 +03:00
claude code agent 227
bdc033e689 fix(ai): extract reindex-button loading predicate + correct poll comment (PR #242)
F4: extract the reindex button `loading` predicate into a pure, unit-tested
`isReindexButtonLoading({ mutationPending, deadline, status })` next to the
other reindex helpers, replacing the inline JSX expression. Covers the
load-bearing post-cap case (deadline nulled, reindexing stale-true -> not
loading) plus mutationPending, active-run, and finished cases.

F5: rewrite the `useAiSettingsQuery` poll comment to match the actual
`nextReindexPollInterval` stop condition (continues while reindexing===true OR
within deadline and not fully indexed; stops only when reindexing===false &&
indexed>=total, or the deadline cap) instead of the stale "until indexed===total".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 01:49:55 +03:00
claude code agent 227
85b38d6946 fix(ai): address reindex-progress review round 1 (PR #242)
F1: clear the "Reindex now" spinner once the poll cap fires. Gate the
reindexing part of the button's loading state on the active poll window
(reindexDeadline !== null) so a run that outlives the 120s cap no longer
leaves the button stuck-disabled with a stale `reindexing: true`; the
admin can restart.

F2: rewrite reindexWorkspace JSDoc to describe the EMBEDDABLE page set
(text OR existing embeddings), matching getEmbeddablePageIds /
countEmbeddablePages instead of the old "every non-deleted page".

F3: extract the shared embeddable-content predicate into a private
PageRepo.embeddablePredicate helper, called by both countEmbeddablePages
and getEmbeddablePageIds, removing the verbatim duplication. Behavior is
identical (lockstep int-spec stays green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 23:39:20 +03:00
claude_code
bf09eec4e1 fix(ai): address reindex-progress review (PR #242)
- Delete the now-orphaned PageRepo.getIdsByWorkspace (its only caller,
  reindexWorkspace, switched to getEmbeddablePageIds). Its docstring still
  claimed "Used by the RAG bulk reindex"; re-grep confirmed zero callers.
- ai-settings.service.reindex(): if aiQueue.add() throws (Redis hiccup/
  shutdown) the worker never runs so its finally->clear() never fires,
  leaving the seeded progress record stuck for the full 1h TTL (button
  stuck "reindexing: 0 of N"). Roll back the seed THIS call wrote
  (seeded flag, only when get() was null) before re-throwing, so a
  concurrent active run's record is never wiped. Add tests for both the
  clear-on-throw and the don't-clear-a-concurrent-run paths.
- Add an integration spec (real Postgres) proving getEmbeddablePageIds'
  WHERE stays in lockstep with countEmbeddablePages: seeds every boundary
  case and asserts the returned id set equals the count.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 04:39:18 +03:00
a
95d07d8d6f fix(ai): align reindex live denominator with the steady-state count
Review fixes for the reindex-progress counter (#242):

1. Denominator jump (478 -> 500 -> 478): reindexWorkspace iterated
   getIdsByWorkspace() (ALL non-deleted pages) but the seed/status use
   countEmbeddablePages (text OR existing-embedding), so the live total exceeded
   the steady-state total whenever empty/text-less pages existed. Add
   PageRepo.getEmbeddablePageIds() that selects the IDs of the EXACT same set
   countEmbeddablePages counts (deletedAt IS NULL AND (text_content matches a
   non-whitespace char OR an EXISTS non-deleted pageEmbeddings row)), and have
   reindexWorkspace iterate THAT set with total = its length. Iteration set and
   count source change together, so done reaches exactly total == the
   steady-state denominator. Dropping text-less pages is correct (reindexPage
   no-ops on them; a page that lost its text but still has stale embeddings is in
   the set via the EXISTS clause and still gets its stale rows cleared). Removed
   the contradictory "worker overwrites with the real page count" / "denominator
   matches" comment.

2. Mid-run re-trigger reset: reindex() unconditionally re-seeded done=0 before an
   enqueue that de-dupes a running job, so a second click/admin/tab reset the
   visible counter while the worker kept incrementing. Now seed only when
   get(workspaceId) === null; the worker's own start() remains the single
   authoritative reset.

3. TTL: documented that it is intentionally tied to write progress
   (start/increment) and never refreshed on get(), so a dead worker's record
   can't be kept alive forever by client polling.

Tests: new embedding-reindex-progress.service.spec.ts (fake ioredis: hash ->
ReindexProgress, malformed/missing/non-numeric -> null, non-finite startedAt ->
0, hgetall throws -> null, start/increment issue hset/hincrby+expire and swallow
Redis errors); reindex() seed order + no-reseed-when-active guard; getMasked
live test now uses progress.total=500 vs DB 478 to pin the progress branch;
indexer specs updated to mock getEmbeddablePageIds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 04:32:36 +03:00
a
630939e8f3 feat(ai): tighten reindex-progress polling on the reindexing flag
Make the "Indexed N of N" counter update near-realtime during a reindex by
tracking the server's active-run state instead of a pure time window:

- Set REINDEX_POLL_INTERVAL to 5000ms (kept bounded by the cap).
- Extract two pure, exported, unit-tested helpers:
  - nextReindexPollInterval: keep polling while the server reports an ACTIVE run
    (reindexing===true) OR within the deadline and not yet done; stop once the
    run is finished AND fully indexed (reindexing===false && indexed>=total) or
    the deadline cap is hit (the cap always wins, so a stuck/never-clearing
    progress record can't poll forever).
  - isReindexComplete: deadline-clear predicate mirroring that stop condition.
- Wire the refetchInterval and the deadline-clearing effect to those helpers.
- Keep the Reindex button spinner active for the whole run (loading also while
  settings.reindexing), reusing the existing loading prop; also blocks a
  redundant mid-run re-trigger (server de-dupes regardless).

No SSE/websockets: polling keyed on the reindexing flag is the intended scope.
The counter now tracks the actual active-reindex state and stops promptly when
the server reports the run is done.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 04:32:36 +03:00
a
72bb03918d fix(ai): show live reindex progress in semantic-search settings
The "Indexed X of Y pages" counter stayed stuck at "478 of 478" during a
manual "Reindex now" run instead of resetting to 0 and climbing. The status
reports indexedPages = countIndexedPages (DISTINCT pages with >=1 embedding
row), but reindex hard-replaces each page in its OWN small transaction, so
nearly all pages always have rows -> the count never drops.

Add a per-workspace live reindex-progress record in Redis (reusing the
existing global ioredis client via RedisService, no new Redis config):
- EmbeddingReindexProgressService: start/increment/clear/get over a Redis hash
  with a 1h TTL self-clean; all best-effort/cosmetic so a Redis failure degrades
  to the existing DB-count behavior.
- AiSettingsService.reindex seeds {total, done:0, startedAt} at enqueue time so
  the very first poll already reports done=0.
- EmbeddingIndexerService.reindexWorkspace overwrites total with the real page
  count at start, increments done per processed page (success or handled
  failure), and clears the record in a finally (covers success, fatal abort,
  and the unconfigured early-return) so a failed run never sticks.
- AiSettingsService.getMasked returns the live run numbers when a progress
  record is active (plus an optional reindexing flag), else falls back to
  countIndexedPages/countEmbeddablePages.

Per-page edits (reindexPage) never touch the workspace progress record, and no
mass up-front delete is introduced (search availability preserved).

Tests: indexer sets/increments/clears progress (incl. fatal abort and
unconfigured early-return); status reports run progress when active and falls
back when not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 04:32:36 +03:00
claude_code
106df7c907 Merge branch 'develop' of https://gitea.vvzvlad.xyz/vvzvlad/gitmost into develop 2026-06-28 02:28:02 +03:00
claude_code
89edddc5a1 feat(agent-roles): fact-checker flags errors instead of confirming facts
Rework the fact-checker editorial role prompt so it stops commenting on
correct facts and only flags problems (errors, doubtful, unverifiable).

- Add the directive "don't write/comment that a fact is right or confirmed:
  your job is to find errors, not confirm facts" to both RU and EN bundles.
- Remove the [Подтверждено]/[Verified] verdict; reframe the verdict list as
  "for problem claims only".
- Reword the role description (no longer "confirms") and the
  comment-on-every-claim rule to "problem claims only".
- Bump fact-checker role version 2 -> 3 and refresh the content-hash lock.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 02:27:53 +03:00
c5109aa2a3 Merge pull request 'feat(footnotes): author-inline footnotes + deterministic server canonicalization (#228)' (#232) from feat/228-inline-footnotes into develop
Reviewed-on: #232
2026-06-28 02:23:27 +03:00
a
c4ed4a4855 fix(footnotes): strip bare definitions on rebuild; MCP full-doc + zip-import canonicalize tests (#228)
Review #6 (approve-with-comments) follow-ups:
1. canonicalize step 7 now strips bare footnoteDefinitions at ANY depth
   (stripFootnoteDefinitionsDeep), not just footnotesList, in BOTH copies. A
   definition hand-authored outside a list (e.g. nested in a callout via a
   raw-JSON write path) was left in place while a copy was also added to the
   rebuilt list -> duplicate, idempotent, self-perpetuating. Runs only in the
   rebuild path (after the lists are stripped); the fast-path / placement-keep
   branch is untouched. Added a shared-corpus case (bare def nested in a callout)
   to pin it in both mirrors.
2. markdown-clipboard: removed the dead top-level footnoteReference check in
   canonicalizePastedFootnotes (an inline atom is never a top-level slice child;
   only the descendants scan can find it).

Test coverage:
4. New MCP binding tests (full-doc-write-canonicalize.test.mjs): update_page_json
   and copy_page_content canonicalize the persisted full doc, asserted via a new
   `replacePage` seam (symmetric to the existing `mutatePage` seam) so no live
   collab socket is needed. Routed both writers through the seam.
5. New server spec (file-import-task.service.footnote-canonicalize.spec.ts): the
   zip-import path (processGenericImport) canonicalizes footnotes — real
   markdown->HTML->JSON via a real ImportService over a temp-dir .md file, DB trx
   stubbed to capture the persisted page content. FileImportTaskService had no
   spec before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 01:39:25 +03:00
a
9c1f952b2f fix(footnotes): guard insert against nested/bare definitions, skip definitions-only paste, doc + reorder fixes (#228)
Must-fix:
- insertInlineFootnote could glue a footnoteReference inside an EXISTING
  definition (nested footnotesList, or a bare footnoteDefinition with no list
  wrapper), which canonicalize then dropped as an orphan — silently losing the
  definition's prose. Now: (a) the body/notes boundary is computed from the first
  top-level block that IS or CONTAINS (recursively) a footnotesList/
  footnoteDefinition, not just a top-level list; and (b) the insertNodesAfterAnchor
  core skips footnotesList/footnoteDefinition subtrees entirely (skipSubtreeTypes),
  so an anchor whose only match is inside a definition -> inserted:false (clean
  abort, no write). Added tests: nested-definition, bare-definition, and
  body-before-nested-list-still-inserts.
- editor-ext footnote-canonicalize header listed `markdownToProseMirror` among the
  canonicalizing MCP paths; it is the NON-canonicalizing primitive. Replaced with
  `markdownToProseMirrorCanonical` (+ note that the plain primitive is for comment
  bodies) and added copy_page_content.
- Client paste: canonicalizePastedFootnotes now skips a definitions-ONLY paste
  (no footnoteReference anywhere) — canonicalizing it would strip the
  reference-less list and yield an EMPTY paste. Added a test.

Suggestions:
- docmost_transform now runs validateDocStructure/validateDocUrls on the RAW
  transform output BEFORE canonicalizeFootnotes (mirrors updatePageJson), so a
  too-deep doc gives the intended max-depth error instead of a stack overflow.
- docmost_transform tool description now states the RESULT is footnote-canonical
  (dryRun diff may show tidy-ups; idempotent after first run).
- insertFootnote: dropped the dead `result ? … : undefined` ternaries and the
  `as any` casts (result is always set by the time we return; the not-found path
  throws and aborts mutatePage). `const r = result!;`.

Tests / architecture:
- Added a LIVE-plugin golden case: the real footnoteSyncPlugin leaves a list with
  non-empty content after it in place, and canonicalize agrees (placement parity
  is now a driven property, not a hand-set expected).
- Added generateFootnoteId uuidv7 shape + uniqueness test.
- Item 9: added the ENFORCEMENT-RULE comments at the server parseProsemirrorContent
  and the MCP canonicalizer header (any NEW full-doc persist path MUST canonicalize;
  fragments/append/prepend and comment bodies MUST NOT). Kept per-call-site over a
  brittle grep CI test (the replace-vs-fragment + comment-vs-page nuance makes a
  single wrapper unsafe).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 23:40:28 +03:00
c6ffdb6536 Merge pull request 'fix(ui)+test: QA UI bugs (#216 #218) + test coverage (#206 #204 #192)' (#230) from fix/qa-ui-bugs-216-218 into develop
Reviewed-on: #230
2026-06-27 22:50:19 +03:00
a
3fd66b4245 fix(footnotes): don't canonicalize comment bodies (data loss); canonicalize only page write paths (#228)
Must-fix (REAL DATA LOSS):
- markdownToProseMirror is reused for COMMENT bodies (createComment/updateComment).
  It unconditionally canonicalized, so a comment carrying a standalone footnote
  definition ([^1]: text with no matching reference) had its whole footnotesList
  stripped (referenceIds.length===0 -> stripFootnotesListsDeep) — the text
  vanished. Fix: markdownToProseMirror no longer canonicalizes (content-preserving
  primitive); a new markdownToProseMirrorCanonical wraps it for the PAGE write
  paths (markdown import via importPageMarkdown, update_page markdown via
  updatePageContentRealtime). Comment callers keep the non-canonicalizing
  primitive. Updated the now-false header comment and added create/update-comment
  inline notes. Added collaboration tests: comment path PRESERVES a reference-less
  definition; page path still drops it AND still reorders real footnotes. Updated
  the page-import canonicalization test to use the canonical variant.

Suggestions / architecture:
- #2: collapsed transforms.footnoteDefinition onto the shared
  makeFootnoteDefinition factory (adds only the inner paragraph block id); kept
  the dependency direction transforms -> footnote-authoring (no circular import,
  mirror stays pure).
- #3: confirmed docmost_transform auto-canonicalization is documented (inline
  comment, tool description, CHANGELOG) — no code change.
- #4: copyPageContent is a FULL-document write (replacePageContent of a
  type:"doc"); added a defensive canonicalizeFootnotes pass (no-op on
  already-canonical source).
- CHANGELOG entry refined to list the FULL-document write paths (incl.
  copy_page_content) and to state canonicalization is NOT applied to comment
  bodies.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 22:17:15 +03:00
a
a77a0bc92b fix(footnotes): re-review #232 — refuse footnoteRef into codeBlock/definition, deep-strip nested lists, docs + cross-copy guard (#228)
Must-fix:
- REAL BUG: insertInlineFootnote could splice a footnoteReference (inline atom)
  into a codeBlock or an existing footnoteDefinition, persisting a schema-invalid
  doc (insert_footnote skips validateDocStructure). Now the search is bounded to
  the BODY (before the first footnotesList) and the insertNodesAfterAnchor core
  refuses textblocks that can't hold the atom (codeBlock); when the only match is
  in such a place the insert returns inserted:false and the write aborts cleanly.
  Reachable via docmost_transform too. Added codeBlock / definition / fall-through
  tests.
- Fixed the deepEqualJson doc comment in both copies: arrays are order-SENSITIVE
  (correctness depends on it), only object keys are order-insensitive.
- README.ru.md MCP tool count 38 -> 39 (lines 36/47/63), matching README.md/AGENTS.
- CHANGELOG [Unreleased] Added entry for insert_footnote + server-side footnote
  canonicalization on non-editor write paths (#228).

Suggestions:
- canonicalize step 5/7 now strips footnotesList at ANY depth (both copies), so a
  schema-valid list nested in a callout/blockquote can't leave duplicate defs.
- Exclude the test-only footnote-corpus.ts fixture from the editor-ext build
  (tsconfig), so it no longer ships in dist/.
- Removed the duplicate manual canonicalize cases from the MCP unit test (the
  shared corpus covers them via full deepEqual); kept idempotence + immutability.
- insertInlineFootnote dedup key now keys off the inline array directly
  (footnoteContentKey({ content: inline })) instead of a throwaway node.

Tests / architecture:
- New client-wrapper test (#9): overrides a small mutatePage seam to assert the
  not-found path throws and persists NOTHING, and the success path shapes
  footnoteId/reused/message/verify and writes the right content. Fixed the
  misleading comment in footnote-write.test.mjs.
- B: cross-copy corpus parity guard test (loads both corpora, asserts deep-equal)
  so a typo in one copy can't pass both suites green.
- A: declined — the full-vs-fragment decision lives at the call site, so a
  prepareDocForPersist wrapper would be a bare alias for canonicalizeFootnotes;
  kept the existing per-call-site comments instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 21:41:10 +03:00
a
07ebd8c63e fix(footnotes): address PR #232 review — fragment-safe canonicalization, plugin placement parity, dead-code removal (#228)
Must-fix:
- Move canonicalizeFootnotes OUT of parseProsemirrorContent. It now runs only
  on FULL writes (createPage, updatePageContent operation==='replace'), never on
  an append/prepend fragment (a fragment would lose definition-only footnotes or
  synthesize a bogus empty list). Add a server binding spec.
- Match the live plugin's list PLACEMENT: a single already-canonical
  footnotesList is left exactly where it sits (the plugin never repositions a
  sole correct list), so the first write no longer reorders content that follows
  the list. Applied to BOTH the editor-ext copy and the MCP mirror; pinned by a
  shared golden corpus case with content after the list.
- Fix MCP tool count 38 -> 39 (README x3, AGENTS.md) and the transformJs param
  help (add canonicalizeFootnotes/insertInlineFootnote).

Simplifications:
- Remove the dead duplicate re-id mechanism (deriveFootnoteId/suffix/occurrence)
  from the PURE canonicalizer in both copies — references are never renamed, so
  the derived ids were never requested; first-wins-drop is the real behaviour.
  This also makes the editor-ext footnote-util note about "no cross-package copy"
  true again.
- Remove the sentinel round-trip in insertInlineFootnote: a generalized
  insertNodesAfterAnchor core inserts the footnoteReference node directly.
- Drop the redundant per-definition deep clone in step 4 (shallow id-normalizing
  copy; out is already deep-cloned).

Docs / architecture:
- Correct the editor-ext copy's "It exists because…" header to its real
  consumers (server import, page.service create/update, client paste).
- Note markdownToProseMirror reuse for create/update comment in collaboration.ts.
- A: shared golden JSON corpus exercised by BOTH the editor-ext copy and the MCP
  mirror (footnote-corpus.ts / .mjs) so "the two copies behave identically" is
  checkable.
- C: split the MCP canonicalizer into a pure mirror + footnote-authoring.ts.
- B: import services persist via a different path, so left one-line consolidation
  comments at the call sites rather than folding (does not fall out cleanly).

Tests: insertFootnote wrapper guards + docmost_transform dryRun auto-canonicalize
(MCP mock), page.service create/update + append/prepend binding (server jest),
shared corpus incl. nested-container reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 20:23:16 +03:00
a
fa929c9e86 fix(footnotes): canonicalize footnotes on server import + markdown paste (#228)
The footnote canonicalizer was wired into the MCP and editor-ext write paths
but NOT into the server's user-facing markdown/HTML import paths, so importing
or pasting markdown with out-of-order, reused, or orphan footnotes did not
canonicalize -- the exact trigger bug #228 fixes was still reproduced on
import. markdownToHtml -> htmlToJson builds ProseMirror JSON directly and never
runs the editor's footnoteSyncPlugin, and that plugin does not reorder an
existing list, so the stored footnotes kept the source's physical definition
order, retained orphans, and did not collapse reused references.

Wire canonicalizeFootnotes (already exported from @docmost/editor-ext) into
every server markdown/HTML -> page-JSON seam, before persisting:
  - ImportService.importPage (REST single-file .md/.html import)
  - FileImportTaskService (zip import worker)
  - PageService.parseProsemirrorContent (API createPage / updatePageContent)

Also hook the client markdown paste: handlePaste applies a manual transaction
(returns true), bypassing transformPasted/footnoteSyncPlugin, so a pasted
out-of-order markdown footnote block would persist out of order.
canonicalizePastedFootnotes reorders a self-contained pasted block (one that
carries its own footnotesList) to reference order, deduped and orphan-free; it
is deliberately scoped to whole-block pastes so a reference-only paste that
reuses a footnote already defined in the target doc is left untouched.

canonicalizeFootnotes is pure, idempotent and shape-safe (a doc with no
footnotes is unchanged), so it is safe on every write path.

Residual: when a pasted block merges into a doc that already has footnotes,
ordering relative to the pre-existing footnotes is still governed by the live
sync plugin (which does not reorder across the boundary).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 17:10:41 +03:00
claude code agent 227
30cb9d293c feat(footnotes): inline authoring + deterministic server-side canonicalization
Make footnotes author-inline: the agent/tool inserts a footnote at its point
of use (anchor + text) and the numbering plus the bottom list are DERIVED
deterministically server-side. The agent has no access to footnotesList and
cannot desync — out-of-order lists, orphan definitions, and raw trailing
[^id] blocks become structurally impossible.

editor-ext:
- canonicalizeFootnotes(docJSON) -> docJSON: a pure, EditorView-free port of
  footnoteSyncPlugin's end-state. Distinct reference ids in document order are
  the source of truth; exactly one trailing footnotesList holds one definition
  per referenced id in reference order (reusing the existing node or
  synthesizing an empty one); orphans dropped; duplicate definitions resolved
  deterministically (first wins, never lost); idempotent.
- Unit tests + a golden parity suite: on every editor-reachable steady state
  the live footnoteSyncPlugin's JSON is a canonicalize no-op (byte-for-byte
  parity), and the canonicalizer additionally repairs the out-of-order list a
  non-editor write produces.

mcp:
- footnote-canonicalize.ts: behavioural mirror of the editor-ext canonicalizer
  (the MCP package is intentionally decoupled from the editor barrel, like
  footnote-lex/docmost-schema), plus footnoteContentKey for content dedup.
- Auto-canonicalize on EVERY write path: markdownToProseMirror (fixes import
  ordering), update_page_json, and after every docmost_transform. Idempotent,
  so it is a no-op when footnotes are already canonical.
- insert_footnote tool + insertInlineFootnote: anchor + markdown text -> a
  mark-safe footnoteReference and a content-dedup'd definition; the list and
  numbering are derived. Same-content footnotes reuse one number/definition.
- canonicalizeFootnotes + insertInlineFootnote exposed as docmost_transform
  sandbox helpers.

Tests: editor-ext 157 green; MCP 325 green; server + client tsc clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-27 06:35:25 +03:00
56 changed files with 7561 additions and 209 deletions

View File

@@ -254,7 +254,7 @@ The API server is a Fastify app with a global `/api` prefix (`main.ts` excludes
- **Redis** backs caching, the BullMQ queues, the WebSocket Socket.IO adapter, and collaboration sync.
### The two AI subsystems (the main fork additions)
1. **Embedded MCP server** (`integrations/mcp/` + `packages/mcp`). The standalone `@docmost/mcp` server (38 agent-native tools: per-block patch/insert/delete by id, scripted `(doc)=>doc` transforms with dry-run diff, table editing, version diff/restore, comments, images, shares) is bundled and served over HTTP at `/mcp`. It writes through Docmost's real-time-collaboration layer so concurrent human edits aren't clobbered. Each request authenticates **per-user** via the `Authorization` header — either HTTP Basic (`base64(email:password)`, the user's own Docmost login, validated through `AuthService`) or a Bearer access JWT (the user's `authToken`) — and the session acts under that user's permissions. `MCP_DOCMOST_EMAIL` / `MCP_DOCMOST_PASSWORD` are an **optional service-account fallback**, used only when a request carries neither Basic nor Bearer credentials (back-compat for CI/scripts). An admin enables MCP with a workspace toggle (Workspace settings → AI). Optionally protected by a shared `MCP_TOKEN`: when set, every `/mcp` request must carry a matching `X-MCP-Token` header (its own header, separate from `Authorization`, which now carries the per-user Basic/Bearer credentials). Note: this changed from the older `Authorization: Bearer <MCP_TOKEN>` scheme — see `.env.example` and the CHANGELOG Breaking Changes entry.
1. **Embedded MCP server** (`integrations/mcp/` + `packages/mcp`). The standalone `@docmost/mcp` server (39 agent-native tools: per-block patch/insert/delete by id, scripted `(doc)=>doc` transforms with dry-run diff, table editing, version diff/restore, comments, images, shares) is bundled and served over HTTP at `/mcp`. It writes through Docmost's real-time-collaboration layer so concurrent human edits aren't clobbered. Each request authenticates **per-user** via the `Authorization` header — either HTTP Basic (`base64(email:password)`, the user's own Docmost login, validated through `AuthService`) or a Bearer access JWT (the user's `authToken`) — and the session acts under that user's permissions. `MCP_DOCMOST_EMAIL` / `MCP_DOCMOST_PASSWORD` are an **optional service-account fallback**, used only when a request carries neither Basic nor Bearer credentials (back-compat for CI/scripts). An admin enables MCP with a workspace toggle (Workspace settings → AI). Optionally protected by a shared `MCP_TOKEN`: when set, every `/mcp` request must carry a matching `X-MCP-Token` header (its own header, separate from `Authorization`, which now carries the per-user Basic/Bearer credentials). Note: this changed from the older `Authorization: Bearer <MCP_TOKEN>` scheme — see `.env.example` and the CHANGELOG Breaking Changes entry.
2. **AI agent chat** (`core/ai-chat/` server + `apps/client/src/features/ai-chat/` client). A built-in agent over the wiki using the Vercel **AI SDK** (`ai`, `@ai-sdk/*`) against any OpenAI-compatible provider configured per workspace (`integrations/ai/` — credentials encrypted at rest via `integrations/crypto`, stored in `ai_provider_credentials`). Key pieces:
- `core/ai-chat/tools/` — the agent's ~40 read+write tools. Every tool runs under the **calling user's** CASL permissions via a per-user loopback access token (`docmost-client.loader.ts`), so the agent can never exceed what the user could do. Only **reversible** operations are exposed (page history + trash; no permanent delete). Agent edits get an "AI agent" provenance badge in page history (`20260616T130000-agent-provenance` migration).
- `core/ai-chat/embedding/` — RAG indexer + a BullMQ consumer on `AI_QUEUE` that embeds pages into `page_embeddings` (vector search), complementing Postgres full-text search. Pages are (re)indexed on edit; `AI_EMBEDDING_TIMEOUT_MS` bounds a hung embeddings endpoint.

View File

@@ -41,6 +41,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
`AI_AGENT_ROLES_CATALOG_URL` env var — an `http(s)://` base URL to the
catalog's raw files; the image ships a per-branch default baked in CI, and it
can be overridden at runtime via the env var (see `.env.example`). (#222)
- **Author footnotes inline from an agent, and deterministic server-side footnote
canonicalization on every non-editor write path.** A new MCP `insert_footnote`
tool places a footnote at a body anchor by content only — the agent supplies
WHERE (anchor text) and WHAT (markdown); the number and the bottom
`footnotesList` are derived server-side, so an agent can never assign a number,
edit the list, or desync, and a same-content note reuses one definition. Under
the hood, the editor's footnote-integrity invariant (one trailing list,
numbering by first reference, no orphans/duplicates, no raw `[^id]`) is now
enforced as a pure `canonicalizeFootnotes(doc)` on the FULL-document write paths
that bypass the editor's plugins: server markdown/HTML import, `PageService`
create and full-document (`replace`) updates, the client markdown paste, and the
MCP markdown page-import / `update_page` (markdown) / `update_page_json` /
`docmost_transform` / `insert_footnote` / `copy_page_content` paths. It is
idempotent (a no-op once canonical) and is deliberately NOT applied to
append/prepend fragments, nor to COMMENT bodies — a comment may legitimately
contain a standalone footnote definition, which canonicalization would drop.
(#228)
### Changed

View File

@@ -34,7 +34,7 @@ The goal of the fork is a **100% open, AGPL-only build with no Enterprise-Editio
| --- | --- |
| **EE code removed** | Stripped all client and server Enterprise-Edition code; ships as a clean community/AGPL build with no license checks. |
| **Comment resolution** | Re-implemented from scratch as a community feature (resolve / re-open with Open/Resolved tabs). No EE code reused, available to anyone who can comment. |
| **Embedded MCP server** | A community MCP server (`@docmost/mcp`, 38 tools) is served over HTTP at `/mcp` — no enterprise license required. Replaces the removed license-gated EE MCP. |
| **Embedded MCP server** | A community MCP server (`@docmost/mcp`, 39 tools) is served over HTTP at `/mcp` — no enterprise license required. Replaces the removed license-gated EE MCP. |
| **AI agent chat** | Built-in AI agent chat over your wiki, written from scratch as a community feature — no enterprise license. The agent reads and edits pages on your behalf (scoped to your permissions), with full-text + vector (RAG) search and optional web access via external MCP servers. |
| **Rebranding** | App logo / name changed from *Docmost* to *Gitmost*. |
| **Compact page tree** | Default page-tree indentation reduced from 16px to 8px per nesting level. |
@@ -44,7 +44,7 @@ The goal of the fork is a **100% open, AGPL-only build with no Enterprise-Editio
### Embedded MCP server
Gitmost has **our own MCP server** — [docmost-mcp](https://github.com/vvzvlad/docmost-mcp),
which we wrote — **built directly into the app** and served at `/mcp`. It exposes **38
which we wrote — **built directly into the app** and served at `/mcp`. It exposes **39
agent-native tools**: surgical per-block edits (patch / insert / delete by id),
structure-preserving find/replace, scripted `(doc) => doc` transforms with a dry-run diff,
structured table editing, version history with diff / restore, comments, images and share
@@ -60,7 +60,7 @@ every little fix. And it needs no enterprise license.
| | **Gitmost `/mcp` (our docmost-mcp)** | Docmost's built-in MCP |
| --- | :---: | :---: |
| **Enterprise license** | Not required | Required |
| **Tools** | 38, agent-native | Coarse (read Markdown, page CRUD, replace whole page) |
| **Tools** | 39, agent-native | Coarse (read Markdown, page CRUD, replace whole page) |
| **Per-block edits / find-replace / scripted transforms** | ✅ | — |
| **Structured table editing, version diff / restore** | ✅ | — |
| **Comments, images, share links** | ✅ | — |

View File

@@ -33,7 +33,7 @@
| --- | --- |
| **Удалён EE-код** | Вырезан весь код Enterprise-редакции на клиенте и сервере; это чистая community/AGPL-сборка без лицензионных проверок. |
| **Резолв комментариев** | Переписан с нуля как community-функция (резолв / переоткрытие с вкладками «Открытые» / «Решённые»). EE-код не используется, доступно любому, кто может комментировать. |
| **Встроенный MCP-сервер** | Community MCP-сервер (`@docmost/mcp`, 38 инструментов) отдаётся по HTTP на `/mcp` — без enterprise-лицензии. Заменяет удалённый лицензируемый EE MCP. |
| **Встроенный MCP-сервер** | Community MCP-сервер (`@docmost/mcp`, 39 инструментов) отдаётся по HTTP на `/mcp` — без enterprise-лицензии. Заменяет удалённый лицензируемый EE MCP. |
| **Чат с AI-агентом** | Встроенный чат с AI-агентом по содержимому вики, написанный с нуля как community-функция — без enterprise-лицензии. Агент читает и редактирует страницы от вашего имени (в рамках ваших прав), с полнотекстовым + векторным (RAG) поиском и опциональным доступом в интернет через внешние MCP-серверы. |
| **Ребрендинг** | Логотип / название приложения изменены с *Docmost* на *Gitmost*. |
| **Компактное дерево страниц** | Отступ дерева страниц по умолчанию уменьшен с 16px до 8px на уровень вложенности. |
@@ -44,7 +44,7 @@
В Gitmost есть **наш собственный MCP-сервер** — [docmost-mcp](https://github.com/vvzvlad/docmost-mcp),
который мы написали сами, — **встроенный прямо в приложение** и доступный на `/mcp`. Он даёт
**38 agent-native инструментов**: точечное редактирование по блокам (patch / insert / delete
**39 agent-native инструментов**: точечное редактирование по блокам (patch / insert / delete
по id), find/replace с сохранением структуры, скриптовые трансформации `(doc) => doc` с
предпросмотром диффа, структурное редактирование таблиц, история версий с диффом /
восстановлением, комментарии, изображения и ссылки на шаринг — всё применяется через слой
@@ -60,7 +60,7 @@ real-time-коллаборации Docmost, поэтому запись нико
| | **`/mcp` в Gitmost (наш docmost-mcp)** | Родной MCP у Docmost |
| --- | :---: | :---: |
| **Enterprise-лицензия** | Не нужна | Нужна |
| **Инструменты** | 38, agent-native | Примитивные (Markdown, CRUD страниц, замена целиком) |
| **Инструменты** | 39, agent-native | Примитивные (Markdown, CRUD страниц, замена целиком) |
| **Правки по блокам / find-replace / скриптовые трансформации** | ✅ | — |
| **Структурное редактирование таблиц, дифф / восстановление версий** | ✅ | — |
| **Комментарии, изображения, ссылки на шаринг** | ✅ | — |

View File

@@ -24,8 +24,8 @@
"slug": "fact-checker",
"emoji": "🔍",
"name": "Fact-checker",
"description": "Verifies facts, figures, dates, names, and quotes with web search. Confirms, corrects, or flags the unverifiable — with a verdict and a source.",
"instructions": "You are a fact-checker at Gitmost, verifying the factual accuracy of non-fiction texts (articles, opinion pieces, technical material, blogs, documentation). You have access to web search — use it to verify. Communicate with the user in English.\n\nWHAT YOU DO\nVerify every checkable claim: names, titles, positions; dates, chronology, sequence; numbers, statistics, proportions, units; quotations and their attribution; technical facts, terms, versions, specifications; causal and logical claims, and internal consistency.\n\nRemember the weakness of machine text: an LLM does not fact-check and will confidently state falsehoods, invent non-existent terms, conflate near-neighbor entities (e.g. claim \"handwriting understanding\" where it was template-based recognition), and insert pseudo-precise numbers. Be especially wary of smoothly written but unverifiable claims.\n\nA VERDICT FOR EACH CLAIM\n- [Verified] — the fact is correct; cite the source.\n- [Incorrect] — the fact is wrong; give the correction and the source.\n- [Unverified] — probably correct but not confirmed; say what's needed to verify.\n- [Unverifiable] — the claim can't be checked in principle (no source, too vague).\n- [Opinion] — not a factual claim, not subject to checking.\n\nSource rule: rely on primary sources (original data, documentation, official site), not retellings. One primary source or two independent secondary sources is a reasonable minimum. Cite the source in the comment.\n\nWHAT YOU DON'T DO\n- Don't fix style, grammar, punctuation, structure, or typography — those are other roles.\n- Don't rewrite the text. You confirm, correct, or flag — the decision is the author's.\n- Don't judge opinions or subjective phrasing as facts.\n- Don't fabricate confirmations. If you can't verify, honestly mark [Unverified] or [Unverifiable]. Never confirm a fact you don't know.\n\nHOW TO LEAVE COMMENTS\nYou don't edit the text directly. For each checked claim, select the span via the MCP tool and leave a comment. Open the comment with the label `[Facts]`, then the verdict, the correction (if any), and the source. Tag severity:\n- [Critical] — a factual error, especially in numbers, names, or quotes, or a claim that risks misinformation.\n- [Major] — a doubtful or unconfirmed claim that needs a source.\n- [Minor] — a small correction, or false precision worth rounding or confirming.\n\nTONE\nNeutral and precise. Don't argue with the author's stance — check facts, not views.\n\nWHEN UNSURE\nBetter to honestly flag \"can't confirm\" than to give a false confirmation.",
"description": "Verifies facts, figures, dates, names, and quotes with web search. Finds errors and flags the doubtful or unverifiable — with a verdict and a source.",
"instructions": "You are a fact-checker at Gitmost, verifying the factual accuracy of non-fiction texts (articles, opinion pieces, technical material, blogs, documentation). You have access to web search — use it to verify. Communicate with the user in English.\n\nWHAT YOU DO\nVerify every checkable claim: names, titles, positions; dates, chronology, sequence; numbers, statistics, proportions, units; quotations and their attribution; technical facts, terms, versions, specifications; causal and logical claims, and internal consistency. Your job is to find errors and doubtful spots, not to confirm what is already correct.\n\nRemember the weakness of machine text: an LLM does not fact-check and will confidently state falsehoods, invent non-existent terms, conflate near-neighbor entities (e.g. claim \"handwriting understanding\" where it was template-based recognition), and insert pseudo-precise numbers. Be especially wary of smoothly written but unverifiable claims.\n\nVERDICTS (for problem claims only)\nDon't comment on correct facts — don't write or mark that a fact is right or confirmed. Leave a verdict only where there is a problem:\n- [Incorrect] — the fact is wrong; give the correction and the source.\n- [Unverified] — probably correct but not confirmed; say what's needed to verify.\n- [Unverifiable] — the claim can't be checked in principle (no source, too vague).\n- [Opinion] — not a factual claim, not subject to checking.\n\nSource rule: rely on primary sources (original data, documentation, official site), not retellings. One primary source or two independent secondary sources is a reasonable minimum. Cite the source in the comment.\n\nWHAT YOU DON'T DO\n- Don't fix style, grammar, punctuation, structure, or typography — those are other roles.\n- Don't rewrite the text. You refute or flag a problem — the decision is the author's.\n- Don't judge opinions or subjective phrasing as facts.\n- Don't write or comment that a fact is right or confirmed: your job is to find errors, not to confirm facts.\n- Don't fabricate confirmations. If you can't verify, honestly mark [Unverified] or [Unverifiable].\n\nHOW TO LEAVE COMMENTS\nYou don't edit the text directly. For each problem claim (an error, a doubt, an unverifiable statement), select the span via the MCP tool and leave a comment; leave no comment on correct facts. Open the comment with the label `[Facts]`, then the verdict, the correction (if any), and the source. Tag severity:\n- [Critical] — a factual error, especially in numbers, names, or quotes, or a claim that risks misinformation.\n- [Major] — a doubtful or unconfirmed claim that needs a source.\n- [Minor] — a small correction, or false precision worth rounding or confirming.\n\nTONE\nNeutral and precise. Don't argue with the author's stance — check facts, not views.\n\nWHEN UNSURE\nBetter to honestly flag \"can't confirm\" than to give a false confirmation.",
"autoStart": true,
"launchMessage": "Take the current page into work. If there is none, ask the user which page to work on."
},

File diff suppressed because one or more lines are too long

View File

@@ -12,7 +12,7 @@
"roles": [
{ "slug": "structural-editor", "version": 2 },
{ "slug": "line-editor", "version": 2 },
{ "slug": "fact-checker", "version": 2 },
{ "slug": "fact-checker", "version": 3 },
{ "slug": "proofreader", "version": 3 },
{ "slug": "narrator", "version": 1 }
]

View File

@@ -1,7 +1,7 @@
{
"fact-checker": {
"version": 2,
"hash": "d7ad1dae07d6f4321e7d40c5b36259dbf930264d748834809c4fb77294bf72e3"
"version": 3,
"hash": "a94931fbd20272570a588c72159ac9e48a89c99bd8f718449cda5e7ca4280fdf"
},
"line-editor": {
"version": 2,

View File

@@ -0,0 +1,168 @@
import { describe, it, expect } from "vitest";
import { Editor } from "@tiptap/core";
import { Document } from "@tiptap/extension-document";
import { Paragraph } from "@tiptap/extension-paragraph";
import { Text } from "@tiptap/extension-text";
import { Node as PMNode, Fragment, Slice } from "@tiptap/pm/model";
import {
FootnoteReference,
FootnotesList,
FootnoteDefinition,
FOOTNOTE_REFERENCE_NAME,
FOOTNOTE_DEFINITION_NAME,
FOOTNOTES_LIST_NAME,
} from "@docmost/editor-ext";
import { canonicalizePastedFootnotes } from "./markdown-clipboard";
/**
* A markdown paste builds its ProseMirror fragment via DOM -> parseSlice and is
* applied with a manual transaction (handlePaste returns true), so it bypasses
* the editor's footnoteSyncPlugin — which never reorders an existing list. These
* tests pin canonicalizePastedFootnotes, the focused hook that makes a pasted
* out-of-order markdown footnote block come out canonical (issue #228).
*/
const extensions = [
Document,
Paragraph,
Text,
FootnoteReference,
FootnotesList,
FootnoteDefinition,
];
function makeSchema() {
const editor = new Editor({ extensions, content: { type: "doc", content: [] } });
const { schema } = editor;
return { editor, schema };
}
/** List footnote def ids of the (single) footnotesList in a slice, in order. */
function listIds(slice: Slice): string[] {
const out: string[] = [];
slice.content.forEach((node: PMNode) => {
if (node.type.name === FOOTNOTES_LIST_NAME) {
node.content.forEach((def: PMNode) => {
if (def.type.name === FOOTNOTE_DEFINITION_NAME) out.push(def.attrs.id);
});
}
});
return out;
}
function hasList(slice: Slice): boolean {
let found = false;
slice.content.forEach((n: PMNode) => {
if (n.type.name === FOOTNOTES_LIST_NAME) found = true;
});
return found;
}
describe("canonicalizePastedFootnotes", () => {
it("reorders a pasted block to reference order, dedups reuse, drops orphans", () => {
const { editor, schema } = makeSchema();
// Body references c, a, b (and again a => reuse); definitions a, b, c, z
// (z is an orphan) — the exact shape a markdown paste produces.
const slice = new Slice(
Fragment.fromArray([
schema.nodes.paragraph.create(null, [
schema.text("body "),
schema.nodes[FOOTNOTE_REFERENCE_NAME].create({ id: "c" }),
schema.nodes[FOOTNOTE_REFERENCE_NAME].create({ id: "a" }),
schema.nodes[FOOTNOTE_REFERENCE_NAME].create({ id: "b" }),
schema.nodes[FOOTNOTE_REFERENCE_NAME].create({ id: "a" }),
]),
schema.nodes[FOOTNOTES_LIST_NAME].create(null, [
schema.nodes[FOOTNOTE_DEFINITION_NAME].create({ id: "a" }, [
schema.nodes.paragraph.create(null, [schema.text("note A")]),
]),
schema.nodes[FOOTNOTE_DEFINITION_NAME].create({ id: "b" }, [
schema.nodes.paragraph.create(null, [schema.text("note B")]),
]),
schema.nodes[FOOTNOTE_DEFINITION_NAME].create({ id: "c" }, [
schema.nodes.paragraph.create(null, [schema.text("note C")]),
]),
schema.nodes[FOOTNOTE_DEFINITION_NAME].create({ id: "z" }, [
schema.nodes.paragraph.create(null, [schema.text("orphan")]),
]),
]),
]),
0,
0,
);
const out = canonicalizePastedFootnotes(slice, schema);
// Reference order, orphan z dropped, reused a appears once.
expect(listIds(out)).toEqual(["c", "a", "b"]);
editor.destroy();
});
it("leaves a reference-ONLY paste untouched (no synthesized definitions)", () => {
// A paste that reuses an id defined in the TARGET doc must NOT gain a
// synthesized empty definition here — it carries no footnotesList of its own.
const { editor, schema } = makeSchema();
const slice = new Slice(
Fragment.from(
schema.nodes.paragraph.create(null, [
schema.text("see "),
schema.nodes[FOOTNOTE_REFERENCE_NAME].create({ id: "a" }),
]),
),
0,
0,
);
const out = canonicalizePastedFootnotes(slice, schema);
expect(hasList(out)).toBe(false);
expect(out).toBe(slice); // returned unchanged (same reference)
editor.destroy();
});
it("leaves a definitions-ONLY paste untouched (no references -> no empty paste)", () => {
// A whole-block paste of ONLY definitions (a footnotesList with no matching
// footnoteReference anywhere in the selection). Canonicalizing it would strip
// the reference-less list -> an EMPTY paste, losing the pasted text. The hook
// must leave such a block untouched.
const { editor, schema } = makeSchema();
const slice = new Slice(
Fragment.fromArray([
schema.nodes[FOOTNOTES_LIST_NAME].create(null, [
schema.nodes[FOOTNOTE_DEFINITION_NAME].create({ id: "a" }, [
schema.nodes.paragraph.create(null, [schema.text("note A")]),
]),
schema.nodes[FOOTNOTE_DEFINITION_NAME].create({ id: "b" }, [
schema.nodes.paragraph.create(null, [schema.text("note B")]),
]),
]),
]),
0,
0,
);
const out = canonicalizePastedFootnotes(slice, schema);
expect(out).toBe(slice); // returned unchanged (same reference, content kept)
expect(listIds(out)).toEqual(["a", "b"]);
editor.destroy();
});
it("leaves an open (partial) slice untouched even if it carries a list", () => {
// An open slice (openStart/openEnd > 0) is a partial selection, not a
// standalone block, so it is returned as-is BEFORE any footnote handling.
const { editor, schema } = makeSchema();
const slice = new Slice(
Fragment.fromArray([
schema.nodes.paragraph.create(null, [
schema.nodes[FOOTNOTE_REFERENCE_NAME].create({ id: "a" }),
]),
schema.nodes[FOOTNOTES_LIST_NAME].create(null, [
schema.nodes[FOOTNOTE_DEFINITION_NAME].create({ id: "a" }, [
schema.nodes.paragraph.create(null, [schema.text("A")]),
]),
]),
]),
1,
1,
);
const out = canonicalizePastedFootnotes(slice, schema);
expect(out).toBe(slice);
editor.destroy();
});
});

View File

@@ -3,7 +3,14 @@ import { Extension } from "@tiptap/core";
import { Plugin, PluginKey, TextSelection } from "@tiptap/pm/state";
import { DOMParser, DOMSerializer, Fragment, Slice } from "@tiptap/pm/model";
import { find } from "linkifyjs";
import { markdownToHtml, htmlToMarkdown } from "@docmost/editor-ext";
import {
markdownToHtml,
htmlToMarkdown,
canonicalizeFootnotes,
FOOTNOTES_LIST_NAME,
FOOTNOTE_REFERENCE_NAME,
} from "@docmost/editor-ext";
import type { Schema } from "@tiptap/pm/model";
export const MarkdownClipboard = Extension.create({
name: "markdownClipboard",
@@ -83,12 +90,25 @@ export const MarkdownClipboard = Extension.create({
const body = elementFromString(parsed);
normalizeTableColumnWidths(body);
const contentNodes = DOMParser.fromSchema(
const parsedSlice = DOMParser.fromSchema(
this.editor.schema,
).parseSlice(body, {
preserveWhitespace: true,
});
// A markdown paste builds its ProseMirror fragment directly (DOM ->
// parseSlice), bypassing the editor's footnoteSyncPlugin, which never
// reorders an existing list. So a pasted markdown block whose footnote
// definitions are out of order (or contains orphan defs) would be
// stored out of order. Canonicalize the self-contained pasted block so
// its footnotes come out reference-ordered, deduped and orphan-free
// (issue #228). See canonicalizePastedFootnotes for why this is scoped
// to whole-block pastes that carry their own footnotesList.
const contentNodes = canonicalizePastedFootnotes(
parsedSlice,
this.editor.schema,
);
tr.replaceRange(from, to, contentNodes);
const insertEnd = tr.mapping.map(from, 1);
tr.setSelection(TextSelection.near(tr.doc.resolve(Math.max(from, insertEnd - 2)), -1));
@@ -133,6 +153,54 @@ export const MarkdownClipboard = Extension.create({
},
});
/**
* Reorder/dedup the footnotes of a SELF-CONTAINED pasted markdown block to the
* canonical invariant (the live footnoteSyncPlugin never reorders an existing
* list, so an out-of-order pasted block would otherwise persist out of order).
*
* Scoped deliberately to whole-block pastes (openStart/openEnd === 0) that carry
* their OWN footnotesList: canonicalizeFootnotes would synthesize empty
* definitions for any reference lacking a definition, which is correct for a
* standalone block but would be wrong for a reference-only paste that REUSES a
* footnote already defined in the target document — so those are left untouched
* for the paste/sync plugins to merge. Residual: when the pasted block is merged
* into a doc that already has footnotes, ordering RELATIVE to the pre-existing
* footnotes is still governed by the sync plugin (which does not reorder).
*
* Also requires at least one footnoteReference in the selection: a definitions-ONLY
* paste (`[^a]: …` with no `[^a]` reference in the same block) has no references,
* so canonicalizeFootnotes would drop the whole list and the paste would come out
* EMPTY — losing the pasted text. Such a block is left as-is for the sync plugin.
*/
export function canonicalizePastedFootnotes(slice: Slice, schema: Schema): Slice {
if (slice.openStart !== 0 || slice.openEnd !== 0) return slice;
let hasFootnotesList = false;
let hasReference = false;
slice.content.forEach((node) => {
if (node.type.name === FOOTNOTES_LIST_NAME) hasFootnotesList = true;
// footnoteReference is an inline atom, never a top-level slice child here
// (this function early-returns for open slices, so children are whole
// blocks), so it is only reachable by descending.
node.descendants((child) => {
if (child.type.name === FOOTNOTE_REFERENCE_NAME) hasReference = true;
});
});
if (!hasFootnotesList) return slice;
// No reference anywhere -> a definitions-only paste; canonicalizing would strip
// the reference-less list (empty paste). Leave it untouched.
if (!hasReference) return slice;
const content = slice.content.toJSON();
if (!Array.isArray(content)) return slice;
const canonical = canonicalizeFootnotes({ type: "doc", content }) as {
content?: unknown[];
};
const fragment = Fragment.fromJSON(schema, canonical.content ?? []);
return new Slice(fragment, 0, 0);
}
function elementFromString(value) {
// add a wrapper to preserve leading and trailing whitespace
const wrappedValue = `<body>${value}</body>`;

View File

@@ -3,6 +3,9 @@ import {
resolveCardStatus,
isEndpointConfigured,
resolveKeyField,
nextReindexPollInterval,
isReindexComplete,
isReindexButtonLoading,
} from './ai-provider-settings';
describe('resolveCardStatus', () => {
@@ -71,3 +74,152 @@ describe('resolveKeyField (write-only key payload)', () => {
expect(resolveKeyField('', false)).toEqual({ set: false });
});
});
describe('nextReindexPollInterval', () => {
const INTERVAL = 5000;
const base = { now: 1_000, intervalMs: INTERVAL };
it('does not poll when no reindex deadline is set', () => {
expect(
nextReindexPollInterval({
...base,
deadline: null,
status: { reindexing: true, indexedPages: 0, totalPages: 478 },
}),
).toBe(false);
});
it('keeps polling while the server reports an active run', () => {
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: true, indexedPages: 120, totalPages: 478 },
}),
).toBe(INTERVAL);
});
it('keeps polling during an active run even if counts momentarily look full', () => {
// The run clears its progress record only at the very end, so a transient
// indexed==total while reindexing is still true must NOT stop polling.
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: true, indexedPages: 478, totalPages: 478 },
}),
).toBe(INTERVAL);
});
it('stops once the run is finished AND fully indexed', () => {
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: false, indexedPages: 478, totalPages: 478 },
}),
).toBe(false);
});
it('keeps polling within the deadline when not yet done and no active flag', () => {
// First poll right after enqueue, before the worker publishes progress.
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: false, indexedPages: 0, totalPages: 478 },
}),
).toBe(INTERVAL);
});
it('cap always wins: stops once past the deadline even if still reindexing', () => {
expect(
nextReindexPollInterval({
deadline: 1_000,
now: 2_000, // past the deadline
intervalMs: INTERVAL,
status: { reindexing: true, indexedPages: 200, totalPages: 478 },
}),
).toBe(false);
});
it('stops on an empty workspace (0 of 0) once the run is finished', () => {
expect(
nextReindexPollInterval({
...base,
deadline: 10_000,
status: { reindexing: false, indexedPages: 0, totalPages: 0 },
}),
).toBe(false);
});
});
describe('isReindexComplete', () => {
it('false when no status yet', () => {
expect(isReindexComplete(undefined)).toBe(false);
});
it('false while a run is still active (even at indexed==total)', () => {
expect(
isReindexComplete({ reindexing: true, indexedPages: 478, totalPages: 478 }),
).toBe(false);
});
it('false when finished but not yet fully indexed', () => {
expect(
isReindexComplete({ reindexing: false, indexedPages: 120, totalPages: 478 }),
).toBe(false);
});
it('true once finished and fully indexed', () => {
expect(
isReindexComplete({ reindexing: false, indexedPages: 478, totalPages: 478 }),
).toBe(true);
});
});
describe('isReindexButtonLoading', () => {
it('loads while the POST mutation is pending', () => {
expect(
isReindexButtonLoading({
mutationPending: true,
deadline: null,
status: false,
}),
).toBe(true);
});
it('does NOT load post-cap: deadline nulled but reindexing left stale-true', () => {
// The key case: after the poll cap fires `reindexDeadline` is null while
// `settings.reindexing` can be a stale `true` from the last poll. Gating on
// the deadline keeps the spinner from sticking forever so the admin can
// restart.
expect(
isReindexButtonLoading({
mutationPending: false,
deadline: null,
status: true,
}),
).toBe(false);
});
it('loads during an active run within the poll window', () => {
expect(
isReindexButtonLoading({
mutationPending: false,
deadline: 10_000,
status: true,
}),
).toBe(true);
});
it('does not load once the run finished while still polling', () => {
expect(
isReindexButtonLoading({
mutationPending: false,
deadline: 10_000,
status: false,
}),
).toBe(false);
});
});

View File

@@ -37,6 +37,7 @@ import {
} from "@/features/workspace/queries/ai-settings-query.ts";
import {
AiTestCapability,
IAiSettings,
IAiSettingsUpdate,
SttApiStyle,
ChatApiStyle,
@@ -169,6 +170,73 @@ export function resolveKeyField(
return { set: false };
}
// Subset of the status payload that drives the reindex poll decisions.
type ReindexStatus = Pick<
IAiSettings,
"reindexing" | "indexedPages" | "totalPages"
>;
/**
* Decide the TanStack Query `refetchInterval` while a reindex may be running.
* Returns the poll interval (ms) to keep polling, or `false` to stop.
*
* Polls while the server reports an ACTIVE run (`reindexing === true`) OR we are
* still within the deadline window and not yet fully indexed. Stops once the run
* has finished AND everything is indexed (server cleared its progress record and
* fell back to the DB coverage count), or the deadline cap is hit — the cap
* always wins so a stuck/never-clearing progress record can't poll forever.
*/
export function nextReindexPollInterval(args: {
deadline: number | null;
now: number;
intervalMs: number;
status?: ReindexStatus;
}): number | false {
const { deadline, now, intervalMs, status } = args;
if (deadline === null) return false;
// Cap always wins.
if (now > deadline) return false;
// Active run → keep polling even if the momentary counts already look full.
if (status?.reindexing) return intervalMs;
// Finished and fully indexed (incl. an empty workspace, 0 >= 0) → stop. Reuse
// isReindexComplete so the completeness check lives in exactly one place.
if (isReindexComplete(status)) return false;
// Within the deadline and not yet done → keep polling.
return intervalMs;
}
/**
* Whether the reindex poll deadline should be cleared: the server reports no
* active run AND the count is complete. The single source of truth for the
* "reindex finished" check — `nextReindexPollInterval` reuses it for its stop
* condition (sans the cap, which the effect handles via time).
*/
export function isReindexComplete(status?: ReindexStatus): boolean {
return (
!!status && !status.reindexing && status.indexedPages >= status.totalPages
);
}
/**
* Whether the reindex button should show its spinner (and stay disabled).
*
* Spins while the POST is in flight, and for the WHOLE background run while the
* server reports `reindexing === true`. The `deadline !== null` gate is the
* load-bearing part: once the 120s poll cap fires it nulls `reindexDeadline`
* and stops refetching, so `status` (settings?.reindexing) can be a stale
* `true` from the last poll. Without the gate the spinner would stick forever
* for a run that outlives the cap and block a restart; gating on the active
* poll window clears it so the admin can re-trigger.
*/
export function isReindexButtonLoading(args: {
mutationPending: boolean;
deadline: number | null;
status?: boolean;
}): boolean {
const { mutationPending, deadline, status } = args;
return mutationPending || (deadline !== null && status === true);
}
// Translate the dot's tooltip label. Kept in one place so all three endpoint
// cards share identical wording.
function cardStatusLabel(status: CardStatus, t: (k: string) => string): string {
@@ -215,31 +283,34 @@ export default function AiProviderSettings() {
// PRE-job counts immediately, so the only way the "Indexed X of Y" counter
// visibly climbs is to keep polling the settings query while the job runs.
// `reindexDeadline` is the timestamp until which we poll (set on reindex
// success); polling stops early once indexed === total. Bounded so a stuck
// job can never poll forever.
const REINDEX_POLL_INTERVAL = 3000; // ms between refetches while indexing
// success). Polling tracks the server's `reindexing` flag: it keeps going for
// the whole active run and stops promptly once the server reports the run is
// finished. Bounded by the cap so a stuck/never-clearing progress record can
// never poll forever.
const REINDEX_POLL_INTERVAL = 5000; // ms between refetches while indexing
const REINDEX_POLL_CAP_MS = 120000; // ~2 min hard cap
const [reindexDeadline, setReindexDeadline] = useState<number | null>(null);
// Only admins may read the (masked) AI settings; the server enforces this too.
const { data: settings, isLoading } = useAiSettingsQuery(isAdmin, (query) => {
if (reindexDeadline === null) return false;
// Past the cap → stop polling (cleared via the effect below too).
if (Date.now() > reindexDeadline) return false;
const data = query.state.data;
// Stop once everything is indexed; otherwise keep polling.
if (data && data.indexedPages >= data.totalPages) return false;
return REINDEX_POLL_INTERVAL;
});
const { data: settings, isLoading } = useAiSettingsQuery(isAdmin, (query) =>
nextReindexPollInterval({
deadline: reindexDeadline,
now: Date.now(),
intervalMs: REINDEX_POLL_INTERVAL,
status: query.state.data,
}),
);
// Stop polling once the work is done or the cap is reached. Also clears on
// Stop polling once the run is finished or the cap is reached. Also clears on
// unmount because the deadline state goes away with the component.
useEffect(() => {
if (reindexDeadline === null) return;
// "Done" matches the refetchInterval stop condition (indexed >= total),
// including an empty workspace (0 >= 0), so the deadline clears promptly
// instead of waiting out the cap.
if (settings && settings.indexedPages >= settings.totalPages) {
// "Done" matches the refetchInterval stop condition: the server reports no
// active run AND the count is complete (indexed >= total, incl. an empty
// workspace 0 >= 0), so the deadline clears promptly instead of waiting out
// the cap. While `reindexing` is still true we keep the deadline so polling
// continues for the whole run.
if (isReindexComplete(settings)) {
setReindexDeadline(null);
return;
}
@@ -1031,7 +1102,17 @@ export default function AiProviderSettings() {
<Button
variant="subtle"
size="compact-sm"
loading={reindexMutation.isPending}
// Spin for the WHOLE run: the POST resolves immediately, but the
// background job keeps running, so also stay loading while the
// server reports `reindexing` (this also blocks a redundant
// re-trigger mid-run; the server de-dupes regardless). The
// deadline gate (and why it matters post-cap) lives in
// `isReindexButtonLoading`, which is unit-tested.
loading={isReindexButtonLoading({
mutationPending: reindexMutation.isPending,
deadline: reindexDeadline,
status: settings?.reindexing,
})}
onClick={() =>
reindexMutation.mutate(undefined, {
// Begin bounded polling so the counter climbs as the async

View File

@@ -23,8 +23,12 @@ export function useAiSettingsQuery(
enabled: boolean = true,
// While reindexing runs as an async background job, the counter only climbs
// if the client keeps refetching. The component passes a refetchInterval
// function that polls until indexed === total or a bounded deadline, then
// returns false to stop. See AiProviderSettings.
// function (`nextReindexPollInterval`) that keeps polling while the server
// reports an active run (reindexing === true) OR we are still within the
// bounded deadline and not yet fully indexed; it returns false to stop only
// once the run has finished AND indexed >= total, or the deadline cap is hit
// (the cap always wins). Note: a transient indexed === total during an active
// run does NOT stop polling. See AiProviderSettings.
refetchInterval?:
| number
| false

View File

@@ -48,6 +48,9 @@ export interface IAiSettings {
// RAG indexing coverage (pages indexed for semantic search).
indexedPages: number;
totalPages: number;
// True while a full workspace reindex is actively running; the counts above
// then reflect the live run progress (done climbs 0 -> total).
reindexing?: boolean;
}
// Update payload. Key semantics (same for `apiKey` and `embeddingApiKey`):

View File

@@ -3,6 +3,8 @@ import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { PageEmbeddingRepo } from '@docmost/db/repos/ai-chat/page-embedding.repo';
import { KyselyDB } from '@docmost/db/types/kysely.types';
import { AiService } from '../../../integrations/ai/ai.service';
import { EmbeddingReindexProgressService } from '../../../integrations/ai/embedding-reindex-progress.service';
import { AiEmbeddingNotConfiguredException } from '../../../integrations/ai/ai-embedding-not-configured.exception';
/**
* Unit tests for EmbeddingIndexerService.reindexWorkspace's batch control flow.
@@ -12,7 +14,8 @@ import { AiService } from '../../../integrations/ai/ai.service';
* reindexWorkspace actually touches:
* - aiService.getEmbeddingModel -> a model string so the up-front configured
* check passes,
* - pageRepo.getIdsByWorkspace -> three page ids,
* - pageRepo.getEmbeddablePageIds -> three page ids (the embeddable set the
* reindex iterates),
* - service.reindexPage -> spied per test to drive the per-page outcome.
*
* The point under test is the catch block: a FATAL provider error (auth/billing)
@@ -24,21 +27,30 @@ describe('EmbeddingIndexerService.reindexWorkspace fail-fast', () => {
function makeService() {
const pageRepo = {
getIdsByWorkspace: jest.fn().mockResolvedValue(['p1', 'p2', 'p3']),
getEmbeddablePageIds: jest.fn().mockResolvedValue(['p1', 'p2', 'p3']),
};
const pageEmbeddingRepo = {};
const aiService = {
getEmbeddingModel: jest.fn().mockResolvedValue('some-model'),
};
// Progress is a best-effort cosmetic store; mock its async methods so the
// batch control flow can be tested without Redis.
const reindexProgress = {
start: jest.fn().mockResolvedValue(undefined),
increment: jest.fn().mockResolvedValue(undefined),
clear: jest.fn().mockResolvedValue(undefined),
get: jest.fn().mockResolvedValue(null),
};
const db = {};
const service = new EmbeddingIndexerService(
pageRepo as unknown as PageRepo,
pageEmbeddingRepo as unknown as PageEmbeddingRepo,
aiService as unknown as AiService,
reindexProgress as unknown as EmbeddingReindexProgressService,
db as unknown as KyselyDB,
);
return { service, pageRepo, aiService };
return { service, pageRepo, aiService, reindexProgress };
}
it('aborts after the first page on a FATAL (401) provider error', async () => {
@@ -78,3 +90,100 @@ describe('EmbeddingIndexerService.reindexWorkspace fail-fast', () => {
expect(reindexPage).toHaveBeenCalledTimes(3);
});
});
/**
* Live reindex-progress reporting: reindexWorkspace must publish a per-workspace
* progress record (total at start, done incremented per processed page) and ALWAYS
* clear it in a finally — including on a fatal abort and an unconfigured early
* return — so the settings status can show the counter climb without ever getting
* stuck in a "reindexing" state.
*/
describe('EmbeddingIndexerService.reindexWorkspace progress', () => {
const WORKSPACE_ID = 'ws-1';
function makeService(pageIds: string[] = ['p1', 'p2', 'p3']) {
const pageRepo = {
getEmbeddablePageIds: jest.fn().mockResolvedValue(pageIds),
};
const pageEmbeddingRepo = {};
const aiService = {
getEmbeddingModel: jest.fn().mockResolvedValue('some-model'),
};
const reindexProgress = {
start: jest.fn().mockResolvedValue(undefined),
increment: jest.fn().mockResolvedValue(undefined),
clear: jest.fn().mockResolvedValue(undefined),
get: jest.fn().mockResolvedValue(null),
};
const db = {};
const service = new EmbeddingIndexerService(
pageRepo as unknown as PageRepo,
pageEmbeddingRepo as unknown as PageEmbeddingRepo,
aiService as unknown as AiService,
reindexProgress as unknown as EmbeddingReindexProgressService,
db as unknown as KyselyDB,
);
return { service, pageRepo, aiService, reindexProgress };
}
it('sets total at start, increments done per page, and clears in finally', async () => {
const { service, reindexProgress } = makeService(['p1', 'p2', 'p3']);
jest.spyOn(service, 'reindexPage').mockResolvedValue(undefined);
await service.reindexWorkspace(WORKSPACE_ID);
expect(reindexProgress.start).toHaveBeenCalledWith(WORKSPACE_ID, 3);
// One increment per processed page.
expect(reindexProgress.increment).toHaveBeenCalledTimes(3);
expect(reindexProgress.increment).toHaveBeenCalledWith(WORKSPACE_ID);
// Cleared exactly once on completion.
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
expect(reindexProgress.clear).toHaveBeenCalledWith(WORKSPACE_ID);
});
it('counts a handled (non-fatal) per-page failure as processed', async () => {
const { service, reindexProgress } = makeService(['p1', 'p2', 'p3']);
// No statusCode -> non-fatal -> isolate and continue; each counts as done.
jest.spyOn(service, 'reindexPage').mockRejectedValue(new Error('boom'));
await service.reindexWorkspace(WORKSPACE_ID);
expect(reindexProgress.increment).toHaveBeenCalledTimes(3);
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
});
it('clears progress in finally even when a FATAL provider error aborts the batch', async () => {
const { service, reindexProgress } = makeService(['p1', 'p2', 'p3']);
// A 401 aborts on the first page (re-thrown) — the finally must still clear.
jest
.spyOn(service, 'reindexPage')
.mockRejectedValue({ statusCode: 401, message: 'User not found' });
await expect(service.reindexWorkspace(WORKSPACE_ID)).rejects.toMatchObject({
statusCode: 401,
});
expect(reindexProgress.start).toHaveBeenCalledWith(WORKSPACE_ID, 3);
// Aborted page is NOT counted as processed.
expect(reindexProgress.increment).not.toHaveBeenCalled();
// But progress is still cleared so the run never gets stuck.
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
});
it('clears the enqueue-seeded progress on an unconfigured early return', async () => {
const { service, aiService, reindexProgress } = makeService();
// Embeddings not configured: reindexWorkspace returns early WITHOUT starting
// a fresh record, but the finally must still clear the enqueue-time seed.
aiService.getEmbeddingModel = jest
.fn()
.mockRejectedValue(new AiEmbeddingNotConfiguredException());
await expect(
service.reindexWorkspace(WORKSPACE_ID),
).resolves.toBeUndefined();
expect(reindexProgress.start).not.toHaveBeenCalled();
expect(reindexProgress.clear).toHaveBeenCalledTimes(1);
expect(reindexProgress.clear).toHaveBeenCalledWith(WORKSPACE_ID);
});
});

View File

@@ -9,6 +9,7 @@ import { KyselyDB } from '@docmost/db/types/kysely.types';
import { InjectKysely } from 'nestjs-kysely';
import { executeTx } from '@docmost/db/utils';
import { AiService } from '../../../integrations/ai/ai.service';
import { EmbeddingReindexProgressService } from '../../../integrations/ai/embedding-reindex-progress.service';
import { AiEmbeddingNotConfiguredException } from '../../../integrations/ai/ai-embedding-not-configured.exception';
import {
describeProviderError,
@@ -48,6 +49,7 @@ export class EmbeddingIndexerService {
private readonly pageRepo: PageRepo,
private readonly pageEmbeddingRepo: PageEmbeddingRepo,
private readonly aiService: AiService,
private readonly reindexProgress: EmbeddingReindexProgressService,
@InjectKysely() private readonly db: KyselyDB,
) {}
@@ -183,9 +185,17 @@ export class EmbeddingIndexerService {
}
/**
* (Re)build embeddings for EVERY non-deleted page in a workspace. Used by the
* bulk reindex (WORKSPACE_CREATE_EMBEDDINGS, fired when AI Search is enabled
* and by the manual "Reindex now" action).
* (Re)build embeddings for the EMBEDDABLE page set of a workspace — the same
* set countEmbeddablePages counts (via getEmbeddablePageIds): non-deleted pages
* that have non-empty textContent OR already have a stored embedding row, NOT
* every non-deleted page. Iterating this set keeps the live `total` equal to
* the steady-state denominator, so the progress counter climbs 0 -> total and
* matches the before/after DB coverage exactly. Text-less pages are correctly
* skipped (reindexPage no-ops on them); a page that lost its text but still has
* stale embeddings stays in the set (the EXISTS clause) so it is visited and
* its stale rows are cleared. Used by the bulk reindex
* (WORKSPACE_CREATE_EMBEDDINGS, fired when AI Search is enabled and by the
* manual "Reindex now" action).
*
* Resolves the embeddings model once up front: if the workspace has no
* embeddings provider configured, the whole batch is skipped (otherwise each
@@ -194,69 +204,96 @@ export class EmbeddingIndexerService {
* the batch.
*/
async reindexWorkspace(workspaceId: string): Promise<void> {
// The whole run is wrapped so the per-workspace progress record is ALWAYS
// cleared in the finally — on success, on a fatal-provider abort, on an
// unconfigured early-return, or on any unexpected throw — so a failed run
// never leaves a stuck "reindexing" state (the status then falls back to the
// steady-state DB coverage count). A placeholder record may already exist
// (seeded at enqueue time); the finally cleans that too.
try {
await this.aiService.getEmbeddingModel(workspaceId);
} catch (err) {
if (err instanceof AiEmbeddingNotConfiguredException) {
this.logger.log(
`reindexWorkspace: embeddings not configured for workspace ${workspaceId}, skipping`,
);
return;
}
throw err;
}
const pageIds = await this.pageRepo.getIdsByWorkspace(workspaceId);
const total = pageIds.length;
const startedAt = Date.now();
this.logger.log(
`reindexWorkspace: starting reindex of ${total} page(s) for workspace ${workspaceId}`,
);
let failed = 0;
for (let i = 0; i < total; i++) {
const pageId = pageIds[i];
const position = i + 1;
// Log BEFORE the await: if the embedding call hangs, this is the last line
// in the log and it names the exact page that is stuck.
this.logger.log(
`reindexWorkspace: [${position}/${total}] indexing page ${pageId} (workspace ${workspaceId})`,
);
const pageStartedAt = Date.now();
try {
await this.reindexPage(pageId);
const elapsed = Date.now() - pageStartedAt;
if (elapsed >= SLOW_PAGE_MS) {
this.logger.warn(
`reindexWorkspace: [${position}/${total}] page ${pageId} took ${elapsed}ms`,
);
}
await this.aiService.getEmbeddingModel(workspaceId);
} catch (err) {
// A fatal provider error (invalid/missing key, no credits) recurs
// identically on EVERY remaining page. Abort the whole batch instead of
// issuing hundreds of doomed requests against the provider.
if (isFatalProviderError(err)) {
this.logger.error(
`reindexWorkspace: aborting at [${position}/${total}] for workspace ` +
`${workspaceId} — fatal provider error, remaining pages would fail ` +
`identically: ${describeProviderError(err)}`,
if (err instanceof AiEmbeddingNotConfiguredException) {
this.logger.log(
`reindexWorkspace: embeddings not configured for workspace ${workspaceId}, skipping`,
);
throw err;
return;
}
// Per-page isolation: one non-fatal failure (incl. an embedding timeout)
// must not abort the whole batch.
failed++;
this.logger.error(
`reindexWorkspace: [${position}/${total}] failed to reindex page ${pageId} ` +
`after ${Date.now() - pageStartedAt}ms: ${describeProviderError(err)}`,
);
throw err;
}
}
this.logger.log(
`reindexWorkspace: done for workspace ${workspaceId}: ` +
`${total - failed}/${total} indexed, ${failed} failed in ${Date.now() - startedAt}ms`,
);
// Iterate the EMBEDDABLE set (same predicate as countEmbeddablePages), NOT
// every non-deleted page: this makes `total` here equal the steady-state
// denominator, so the live counter climbs 0 -> total and matches the
// before/after DB count exactly (no 478 -> 500 -> 478 denominator jump).
// Text-less pages are correctly skipped — reindexPage no-ops on them, and
// a page that lost its text but still has stale embeddings IS in this set
// (the EXISTS clause) so it is still visited and its stale rows cleared.
const pageIds = await this.pageRepo.getEmbeddablePageIds(workspaceId);
const total = pageIds.length;
const startedAt = Date.now();
// Publish the live run progress over this same set (done reset to 0). The
// counter increments once per iterated page and reaches exactly `total`,
// which equals countEmbeddablePages — the steady-state denominator.
await this.reindexProgress.start(workspaceId, total);
this.logger.log(
`reindexWorkspace: starting reindex of ${total} page(s) for workspace ${workspaceId}`,
);
let failed = 0;
for (let i = 0; i < total; i++) {
const pageId = pageIds[i];
const position = i + 1;
// Log BEFORE the await: if the embedding call hangs, this is the last line
// in the log and it names the exact page that is stuck.
this.logger.log(
`reindexWorkspace: [${position}/${total}] indexing page ${pageId} (workspace ${workspaceId})`,
);
const pageStartedAt = Date.now();
try {
await this.reindexPage(pageId);
// Count this page as processed (matches the [position/total] log).
await this.reindexProgress.increment(workspaceId);
const elapsed = Date.now() - pageStartedAt;
if (elapsed >= SLOW_PAGE_MS) {
this.logger.warn(
`reindexWorkspace: [${position}/${total}] page ${pageId} took ${elapsed}ms`,
);
}
} catch (err) {
// A fatal provider error (invalid/missing key, no credits) recurs
// identically on EVERY remaining page. Abort the whole batch instead of
// issuing hundreds of doomed requests against the provider. Do NOT count
// it as processed — the run aborts here (the finally clears progress).
if (isFatalProviderError(err)) {
this.logger.error(
`reindexWorkspace: aborting at [${position}/${total}] for workspace ` +
`${workspaceId} — fatal provider error, remaining pages would fail ` +
`identically: ${describeProviderError(err)}`,
);
throw err;
}
// Per-page isolation: one non-fatal failure (incl. an embedding timeout)
// must not abort the whole batch. A handled failure still advances the
// counter (matches the [position/total] log, so done reaches total).
failed++;
await this.reindexProgress.increment(workspaceId);
this.logger.error(
`reindexWorkspace: [${position}/${total}] failed to reindex page ${pageId} ` +
`after ${Date.now() - pageStartedAt}ms: ${describeProviderError(err)}`,
);
}
}
this.logger.log(
`reindexWorkspace: done for workspace ${workspaceId}: ` +
`${total - failed}/${total} indexed, ${failed} failed in ${Date.now() - startedAt}ms`,
);
} finally {
// Always remove the progress record so the status reverts to the DB count.
await this.reindexProgress.clear(workspaceId);
}
}
/** Purge ALL embeddings for a workspace (WORKSPACE_DELETE_EMBEDDINGS). */

View File

@@ -0,0 +1,153 @@
// Binding test for issue #228 must-fix #1 / test-coverage #12: footnote
// canonicalization moved OUT of parseProsemirrorContent and is now applied only
// on FULL-document writes (createPage, and updatePageContent with operation
// 'replace'), NEVER on an append/prepend FRAGMENT.
//
// The Yjs encode / plain-text extract are stubbed (partial module mock keeps the
// REAL canonicalizeFootnotes) and parseProsemirrorContent is spied to return the
// raw fixture, so the test isolates the canonicalize BINDING from schema/Yjs.
jest.mock('@docmost/editor-ext', () => {
const actual = jest.requireActual('@docmost/editor-ext');
return {
...actual,
createYdocFromJson: jest.fn(() => Buffer.from([])),
jsonToText: jest.fn(() => ''),
};
});
import { PageService } from './page.service';
const refNode = (id: string) => ({ type: 'footnoteReference', attrs: { id } });
const defNode = (id: string, text: string) => ({
type: 'footnoteDefinition',
attrs: { id },
content: [{ type: 'paragraph', content: [{ type: 'text', text }] }],
});
const doc = (...content: any[]) => ({ type: 'doc', content });
/** A full doc whose footnote definitions are OUT of reference order (b,a refs;
* a,b defs) — canonicalization must reorder the definitions to [b, a]. */
const outOfOrderFull = () =>
doc(
{ type: 'paragraph', content: [{ type: 'text', text: 'x' }, refNode('b'), refNode('a')] },
{ type: 'footnotesList', content: [defNode('a', 'A'), defNode('b', 'B')] },
);
/** A definition-ONLY fragment (no references): canonicalizing it would drop the
* whole footnotesList (referenceIds is empty) — i.e. LOSE the footnote. */
const defOnlyFragment = () =>
doc({ type: 'footnotesList', content: [defNode('a', 'appended note')] });
/** A reference-only fragment that REUSES an id defined elsewhere in the live
* doc: canonicalizing it would synthesize a bogus empty footnotesList/def. */
const refReuseFragment = () =>
doc({ type: 'paragraph', content: [{ type: 'text', text: 'more' }, refNode('a')] });
function listDefIds(content: any): string[] {
const list = (content.content ?? []).find((n: any) => n.type === 'footnotesList');
return (list?.content ?? [])
.filter((n: any) => n.type === 'footnoteDefinition')
.map((n: any) => n.attrs?.id);
}
function hasFootnotesList(content: any): boolean {
return (content.content ?? []).some((n: any) => n.type === 'footnotesList');
}
describe('PageService footnote canonicalization binding (#228)', () => {
function makeService() {
let insertedContent: any = null;
let yjsPayload: any = null;
const pageRepo = {
insertPage: jest.fn(async (values: any) => {
insertedContent = values.content;
return { id: 'page-id', slugId: 'slug-id' };
}),
};
const generalQueue = { add: jest.fn().mockReturnValue({ catch: jest.fn() }) };
const collaborationGateway = {
handleYjsEvent: jest.fn(async (_evt: string, _name: string, payload: any) => {
yjsPayload = payload;
}),
};
const service = new PageService(
pageRepo as any,
{} as any, // pagePermissionRepo
{} as any, // attachmentRepo
{} as any, // db
{} as any, // storageService
{} as any, // attachmentQueue
{} as any, // aiQueue
generalQueue as any,
{} as any, // eventEmitter
collaborationGateway as any,
{} as any, // watcherService
{} as any, // transclusionService
);
// Isolate the canonicalize BINDING: return the raw fixture (a deep clone so
// canonicalize never mutates the caller's object) instead of running the
// real markdown/HTML/JSON parse + schema validation.
jest
.spyOn(service as any, 'parseProsemirrorContent')
.mockImplementation(async (content: any) => structuredClone(content));
jest.spyOn(service as any, 'nextPagePosition').mockResolvedValue('a0');
return { service, getInsertedContent: () => insertedContent, getYjsPayload: () => yjsPayload };
}
it('createPage (full write) canonicalizes footnotes into reference order', async () => {
const { service, getInsertedContent } = makeService();
await service.create('user-id', 'workspace-id', {
spaceId: 'space-id',
content: outOfOrderFull(),
format: 'json',
} as any);
// Definitions reordered to reference order [b, a].
expect(listDefIds(getInsertedContent())).toEqual(['b', 'a']);
});
it("updatePageContent operation 'replace' canonicalizes footnotes", async () => {
const { service, getYjsPayload } = makeService();
await service.updatePageContent(
'page-id',
outOfOrderFull(),
'replace' as any,
'json' as any,
{ id: 'user-id' } as any,
);
expect(getYjsPayload().operation).toBe('replace');
expect(listDefIds(getYjsPayload().prosemirrorJson)).toEqual(['b', 'a']);
});
it("append of a definition-only fragment is NOT canonicalized (footnote preserved, not dropped)", async () => {
const { service, getYjsPayload } = makeService();
await service.updatePageContent(
'page-id',
defOnlyFragment(),
'append' as any,
'json' as any,
{ id: 'user-id' } as any,
);
// Canonicalizing a reference-less fragment would DROP the whole list; the
// fragment must pass through untouched so the merge keeps the definition.
expect(getYjsPayload().operation).toBe('append');
expect(hasFootnotesList(getYjsPayload().prosemirrorJson)).toBe(true);
expect(listDefIds(getYjsPayload().prosemirrorJson)).toEqual(['a']);
});
it('prepend of a reference-reuse fragment is NOT canonicalized (no synthesized garbage list)', async () => {
const { service, getYjsPayload } = makeService();
await service.updatePageContent(
'page-id',
refReuseFragment(),
'prepend' as any,
'json' as any,
{ id: 'user-id' } as any,
);
// Canonicalizing would synthesize a bogus empty footnotesList for the reused
// reference; the fragment must pass through with no list at all.
expect(getYjsPayload().operation).toBe('prepend');
expect(hasFootnotesList(getYjsPayload().prosemirrorJson)).toBe(false);
});
});

View File

@@ -52,7 +52,7 @@ import {
INTERNAL_LINK_REGEX,
extractPageSlugId,
} from '../../../integrations/export/utils';
import { markdownToHtml } from '@docmost/editor-ext';
import { markdownToHtml, canonicalizeFootnotes } from '@docmost/editor-ext';
import { WatcherService } from '../../watcher/watcher.service';
import { sql } from 'kysely';
import { TransclusionService } from '../transclusion/transclusion.service';
@@ -160,9 +160,14 @@ export class PageService {
let ydoc = undefined;
if (createPageDto?.content && createPageDto?.format) {
const prosemirrorJson = await this.parseProsemirrorContent(
createPageDto.content,
createPageDto.format,
// createPage always writes a FULL document, so canonicalize footnotes to
// the editor's invariant before persisting (issue #228). Pure + idempotent
// + shape-safe: a doc with no footnotes is returned unchanged.
const prosemirrorJson = canonicalizeFootnotes(
await this.parseProsemirrorContent(
createPageDto.content,
createPageDto.format,
),
);
content = prosemirrorJson;
@@ -343,7 +348,17 @@ export class PageService {
format: ContentFormat,
user: User,
): Promise<void> {
const prosemirrorJson = await this.parseProsemirrorContent(content, format);
let prosemirrorJson = await this.parseProsemirrorContent(content, format);
// Canonicalize footnotes ONLY for a full-document write ('replace'). For an
// append/prepend FRAGMENT, canonicalizing is semantically wrong (it would
// drop a definition-only fragment's list, or synthesize a duplicate empty
// definition for a fragment reusing an existing id) — the fragment merges
// into the live doc where the editor's footnoteSyncPlugin keeps the invariant
// (issue #228, must-fix #1).
if (operation === 'replace') {
prosemirrorJson = canonicalizeFootnotes(prosemirrorJson);
}
const documentName = `page.${pageId}`;
await this.collaborationGateway.handleYjsEvent(
@@ -1301,6 +1316,24 @@ export class PageService {
}
}
// NOTE: footnote canonicalization is intentionally NOT done here. This
// method serves BOTH full writes (createPage / updatePageContent with
// operation 'replace') AND fragment writes (append / prepend). Canonicalizing
// a FRAGMENT is semantically wrong — e.g. a definition-only fragment has no
// references, so the canonicalizer would drop its whole footnotesList (lost
// footnotes), and a fragment reusing an existing id would synthesize an empty
// duplicate definition. The canonicalizer therefore runs only at the
// FULL-DOCUMENT callers (createPage, and updatePageContent for 'replace'),
// never on a fragment (issue #228, must-fix #1).
// (Future consolidation, architecture B: the import services persist via a
// different path; folding all of these into one "prepare JSON for persist"
// helper would centralize the canonicalize call — left as follow-up.)
//
// ENFORCEMENT RULE (#228): any NEW FULL-document persist path MUST call
// `canonicalizeFootnotes(json)` before writing (see createPage and
// updatePageContent 'replace'); append/prepend FRAGMENT writes MUST NOT (it
// would drop or duplicate footnotes — that is exactly why this is per-call-site
// rather than a single wrapper here).
try {
jsonToNode(prosemirrorJson);
} catch (err) {

View File

@@ -0,0 +1,167 @@
import { PageRepo } from './page.repo';
import {
DummyDriver,
Kysely,
PostgresAdapter,
PostgresIntrospector,
PostgresQueryCompiler,
} from 'kysely';
/**
* F6 regression guard for the embeddable-page predicate.
*
* The predicate is shared by `countEmbeddablePages` (the "Indexed N of M" coverage
* denominator) and `getEmbeddablePageIds` (the exact set a full reindex iterates).
* It MUST select pages whose `text_content` was never backfilled (null/empty) but
* whose ProseMirror `content` JSON still carries body text — `reindexPage` builds
* its chunks straight from `content`, so without a content clause such a page is
* silently SKIPPED by a mass reindex even though it is fully embeddable.
*
* The content clause keys on the structural text-node marker `"type":"text"`, NOT
* a bare `"text":` key. The bare key also appears as the `attrs.text` of atom
* nodes that carry NO extractable text — notably math (`mathBlock`/`mathInline`),
* whose LaTeX lives in `attrs.text` and has no `generateText` serializer. A
* math-ONLY page therefore yields empty `text_content` and zero embeddings; if the
* predicate matched its `attrs.text` it would land in the denominator but
* `reindexPage` would no-op on it, pinning "Indexed N of M" below 100% forever —
* the exact bug this feature fixes. The `"type":"text"` marker matches only real
* text nodes (what `jsonToText` extracts), keeping the predicate consistent with
* what gets indexed.
*
* There is no real Postgres here: a recording Kysely (DummyDriver wired to the
* Postgres query compiler) compiles the queries to SQL so we can assert the WHERE
* predicate ORs in the narrowed content clause alongside the existing text_content
* and stored-embeddings clauses — and that BOTH callers compile the identical
* clause (denominator and reindex set can never diverge).
*/
function makeRecordingDb() {
const sqls: string[] = [];
const db = new Kysely<any>({
dialect: {
createAdapter: () => new PostgresAdapter(),
createDriver: () =>
new (class extends DummyDriver {
async acquireConnection() {
return {
executeQuery: async (compiled: { sql: string }) => {
sqls.push(compiled.sql);
return { rows: [] };
},
// eslint-disable-next-line @typescript-eslint/no-empty-function
streamQuery: async function* () {},
} as any;
}
})(),
createIntrospector: (d: Kysely<any>) => new PostgresIntrospector(d),
createQueryCompiler: () => new PostgresQueryCompiler(),
},
});
return { db, sqls };
}
// The narrowed content clause, as it appears in the compiled SQL. Keying on the
// structural `"type":"text"` marker (not a bare `"text":` key) is what excludes
// math-only pages whose only `"text"` key is the atom node's `attrs.text`.
const NARROWED_CLAUSE = `"type"[[:space:]]*:[[:space:]]*"text"`;
const BARE_TEXT_KEY = `"text"[[:space:]]*:`;
describe('PageRepo embeddable predicate — content-bearing pages (F6)', () => {
it('selects content-bearing pages via the narrowed "type":"text" node marker', async () => {
const { db, sqls } = makeRecordingDb();
const repo = new PageRepo(db as any, {} as any, { emit: jest.fn() } as any);
await repo.getEmbeddablePageIds('ws-1');
expect(sqls).toHaveLength(1);
const sql = sqls[0];
// Clause 1 (existing): pages with extractable text_content.
expect(sql).toContain('text_content');
// Clause 3 (the F6 fix, now narrowed): a page whose content JSON carries a
// real text node is selected even when text_content is null/empty, so a full
// reindex visits it instead of silently skipping it.
expect(sql).toContain('content::text');
expect(sql).toContain(NARROWED_CLAUSE);
// It must NOT use the old bare `"text":` key, which also matches the
// `attrs.text` of math-only atom pages (false-positive denominator inflation).
expect(sql).not.toContain(BARE_TEXT_KEY);
// Clause 2 (existing): pages that already have stored embeddings stay in the
// set so a reindex can clear their stale rows.
expect(sql.toLowerCase()).toContain('embeddings');
});
it('countEmbeddablePages compiles the SAME narrowed clause as getEmbeddablePageIds', async () => {
// Consistency is the core requirement: the denominator (countEmbeddablePages)
// and the reindex set (getEmbeddablePageIds) MUST share the identical
// predicate, else the live "done" counter and the steady-state total diverge.
const { db, sqls } = makeRecordingDb();
const repo = new PageRepo(db as any, {} as any, { emit: jest.fn() } as any);
await repo.countEmbeddablePages('ws-1');
await repo.getEmbeddablePageIds('ws-1');
expect(sqls).toHaveLength(2);
const [countSql, idsSql] = sqls;
// Both carry the narrowed content clause...
expect(countSql).toContain(NARROWED_CLAUSE);
expect(idsSql).toContain(NARROWED_CLAUSE);
// ...neither carries the bare key...
expect(countSql).not.toContain(BARE_TEXT_KEY);
expect(idsSql).not.toContain(BARE_TEXT_KEY);
// ...and the full OR predicate (text_content + content node + embeddings
// EXISTS) is byte-identical between the two queries, so they can't drift.
const where = (s: string) => s.slice(s.indexOf('where'));
expect(where(countSql)).toEqual(where(idsSql));
});
it('the content regex matches a text-bearing doc but NOT a math-only doc', () => {
// Semantic check of the predicate against sample `content::text` payloads.
// Note: `jsonb::text` is NOT identical to JSON.stringify — Postgres renders a
// space after each colon (`"type": "text"`), which is exactly why the POSIX
// clause uses `[[:space:]]*`. The clause `"type"[[:space:]]*:[[:space:]]*"text"`
// maps to the JS regex below (`[[:space:]]` -> `\s`, tolerating both forms);
// we evaluate it the way Postgres would.
const re = /"type"\s*:\s*"text"/;
// A real paragraph with a text node -> embeddable.
const textDoc = JSON.stringify({
type: 'doc',
content: [
{
type: 'paragraph',
content: [{ type: 'text', text: 'hello world' }],
},
],
});
// A doc whose ONLY node is a math atom. Its LaTeX is in `attrs.text`, there is
// no text node, and `jsonToText`/`generateText` has no serializer for it -> it
// yields empty text_content and zero embeddings, so it must NOT qualify.
const mathOnlyDoc = JSON.stringify({
type: 'doc',
content: [
{ type: 'mathBlock', attrs: { text: 'E = mc^2' } },
{ type: 'mathInline', attrs: { text: '\\alpha' } },
],
});
// An empty doc has no text node either.
const emptyDoc = JSON.stringify({ type: 'doc', content: [] });
expect(re.test(textDoc)).toBe(true);
expect(re.test(mathOnlyDoc)).toBe(false);
expect(re.test(emptyDoc)).toBe(false);
// Sanity: the OLD bare-key regex WOULD have wrongly matched the math-only doc,
// which is precisely the false positive the narrowing removes.
expect(/"text"\s*:/.test(mathOnlyDoc)).toBe(true);
// A user literally TYPING `"type":"text"` in prose can't false-positive on an
// otherwise text-less page: in `content::text` the typed value's quotes are
// escaped (`\"type\":\"text\"`), so the literal-quote regex does not match the
// escaped form. (And such a page is a genuine text node anyway.)
const escapedLiteral = JSON.stringify({
type: 'doc',
content: [{ type: 'someAtom', attrs: { note: '"type":"text"' } }],
});
expect(re.test(escapedLiteral)).toBe(false);
});
});

View File

@@ -12,6 +12,7 @@ import { executeWithCursorPagination } from '@docmost/db/pagination/cursor-pagin
import { validate as isValidUUID } from 'uuid';
import { ExpressionBuilder, sql } from 'kysely';
import { DB } from '@docmost/db/types/db';
import { DbInterface } from '@docmost/db/types/db.interface';
import { jsonArrayFrom, jsonObjectFrom } from 'kysely/helpers/postgres';
import { SpaceMemberRepo } from '@docmost/db/repos/space/space-member.repo';
import { EventEmitter2 } from '@nestjs/event-emitter';
@@ -233,9 +234,9 @@ export class PageRepo {
* text-less pages (which legitimately store zero embeddings) don't keep the
* bar below 100% forever.
*
* A page qualifies if it has non-empty textContent OR already has stored
* embeddings. The second clause covers pages whose text the indexer extracted
* from the content JSON when textContent was null, and guarantees this total is
* A page qualifies if it has non-empty textContent, OR its content JSON has at
* least one text node (`"type":"text"`) when textContent was never backfilled,
* OR it already has stored embeddings. The last clause guarantees this total is
* always >= countIndexedPages (the indexed count can never exceed it).
*/
async countEmbeddablePages(workspaceId: string): Promise<number> {
@@ -243,37 +244,91 @@ export class PageRepo {
.selectFrom('pages as p')
.where('p.workspaceId', '=', workspaceId)
.where('p.deletedAt', 'is', null)
.where((eb) =>
eb.or([
// Has extractable body text. The regex matches any non-whitespace
// character, mirroring the indexer's `text.trim().length === 0` check
// (raw SQL -> use the snake_case column name).
sql<boolean>`p.text_content ~ '[^[:space:]]'`,
// OR already has at least one (non-deleted) embedding row.
eb.exists(
eb
.selectFrom('pageEmbeddings as pe')
.select(sql`1`.as('one'))
.whereRef('pe.pageId', '=', 'p.id')
.where('pe.deletedAt', 'is', null),
),
]),
)
.where((eb) => this.embeddablePredicate(eb))
.select((eb) => eb.fn.countAll().as('count'))
.executeTakeFirst();
return Number(row?.count ?? 0);
}
/**
* IDs of all non-deleted pages in a workspace. Used by the RAG bulk reindex to
* (re)build embeddings for every existing page.
* The "embeddable content" qualifying predicate, shared verbatim by
* countEmbeddablePages (the steady-state denominator) and getEmbeddablePageIds
* (the set the bulk reindex iterates). Both MUST use the exact same condition
* or the live total and steady-state total diverge — extracting it here is what
* guarantees that, replacing the previous hand-duplicated copy. Callers supply
* the trivial workspaceId/deletedAt filters inline; this returns only the
* non-trivial OR clause, evaluated against the `p` alias of `pages`.
*
* A page qualifies if it has non-empty textContent, OR its ProseMirror
* `content` JSON has at least one text node (`"type":"text"`) even though
* textContent was never backfilled, OR it already has a stored (non-deleted)
* embedding row.
*/
async getIdsByWorkspace(workspaceId: string): Promise<string[]> {
private embeddablePredicate(
eb: ExpressionBuilder<DbInterface & { p: DbInterface['pages'] }, 'p'>,
) {
return eb.or([
// Has extractable body text. The regex matches any non-whitespace
// character, mirroring the indexer's `text.trim().length === 0` check
// (raw SQL -> use the snake_case column name).
sql<boolean>`p.text_content ~ '[^[:space:]]'`,
// OR the ProseMirror `content` JSON has at least one text node (`"type":
// "text"`) the indexer can extract, even when `text_content` is null/empty
// (never backfilled): `reindexPage` runs `jsonToText` (generateText) over
// `content`, which only emits the text of ProseMirror text nodes, so such a
// page IS embeddable and a full reindex MUST visit it (otherwise it is
// silently skipped). A text node always serialises as
// `{"type":"text","text":"..."}`, so we key on the structural `"type":
// "text"` marker — NOT a bare `"text":` key, which also appears as the
// `attrs.text` of atom nodes that carry NO extractable text (e.g. math
// `mathBlock`/`mathInline`, whose LaTeX lives in `attrs.text` and has no
// text serializer). A math-only page thus produces empty `text_content` and
// zero embeddings; matching its `attrs.text` here would wrongly inflate the
// denominator and keep "Indexed N of M" below 100% forever. An empty doc
// (no text nodes) has no `"type":"text"` and is correctly excluded. A user
// who literally types `"type":"text"` in their prose can't false-positive:
// in `content::text` that text value's quotes are escaped (`\"type\"...`),
// so the literal-quote regex won't match the escaped form (and such a page
// is a real text node anyway).
sql<boolean>`p.content::text ~ '"type"[[:space:]]*:[[:space:]]*"text"'`,
// OR already has at least one (non-deleted) embedding row.
eb.exists(
eb
.selectFrom('pageEmbeddings as pe')
.select(sql`1`.as('one'))
.whereRef('pe.pageId', '=', 'p.id')
.where('pe.deletedAt', 'is', null),
),
]);
}
/**
* IDs of the EMBEDDABLE page set for a workspace — the exact same set that
* `countEmbeddablePages` counts (a page qualifies if it has non-empty
* textContent, OR content JSON with at least one text node (`"type":"text"`)
* and an empty/null textContent, OR already has a stored embedding row). The
* bulk reindex
* iterates THIS set so the live "done" counter reaches exactly
* `countEmbeddablePages` (the steady-state denominator), instead of iterating
* every non-deleted page (which would push the denominator above the
* steady-state value mid-run).
*
* IMPORTANT: the qualifying WHERE is shared with `countEmbeddablePages` via the
* private `embeddablePredicate` helper, so the two can no longer drift — if the
* embeddable definition changes, change it once there and both stay in lockstep
* (else the live total and steady-state total diverge again). Dropping
* text-less pages is correct: `reindexPage` no-ops on
* a page with no extractable content anyway, and a page that lost its text but
* still has stale embeddings IS in this set (the EXISTS clause), so it is still
* visited and its stale rows are cleared.
*/
async getEmbeddablePageIds(workspaceId: string): Promise<string[]> {
const rows = await this.db
.selectFrom('pages')
.select('id')
.where('workspaceId', '=', workspaceId)
.where('deletedAt', 'is', null)
.selectFrom('pages as p')
.select('p.id')
.where('p.workspaceId', '=', workspaceId)
.where('p.deletedAt', 'is', null)
.where((eb) => this.embeddablePredicate(eb))
.execute();
return rows.map((r) => r.id);
}

View File

@@ -1,4 +1,12 @@
import { parsePositiveInt } from './ai-settings.service';
import { AiSettingsService, parsePositiveInt } from './ai-settings.service';
import { WorkspaceRepo } from '@docmost/db/repos/workspace/workspace.repo';
import { AiAgentRoleRepo } from '@docmost/db/repos/ai-agent-roles/ai-agent-roles.repo';
import { AiProviderCredentialsRepo } from '@docmost/db/repos/ai-chat/ai-provider-credentials.repo';
import { PageEmbeddingRepo } from '@docmost/db/repos/ai-chat/page-embedding.repo';
import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { SecretBoxService } from '../crypto/secret-box';
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
import type { Queue } from 'bullmq';
/**
* Round-trip coercion for numeric `::text` provider settings (e.g.
@@ -41,3 +49,194 @@ describe('parsePositiveInt', () => {
expect(parsePositiveInt(42)).toBe(42);
});
});
/**
* getMasked must surface the LIVE reindex run progress while a reindex is active
* (so the "Indexed X of Y" counter can climb 0 -> total), and fall back to the
* steady-state DB coverage count (countIndexedPages / countEmbeddablePages) when
* no reindex is running. This is the server side of the fix for the counter that
* otherwise stays stuck at "478 of 478" the whole reindex.
*/
describe('AiSettingsService.getMasked reindex progress', () => {
const WORKSPACE_ID = 'ws-1';
function makeService() {
// No driver configured -> the credentials lookup is skipped, keeping the
// setup minimal; we only care about the indexed/total numbers here.
const workspaceRepo = {
findById: jest.fn().mockResolvedValue({ settings: {} }),
};
const aiAgentRoleRepo = {};
const aiProviderCredentialsRepo = { find: jest.fn() };
const pageEmbeddingRepo = {
countIndexedPages: jest.fn().mockResolvedValue(478),
};
const pageRepo = {
countEmbeddablePages: jest.fn().mockResolvedValue(478),
};
const secretBox = {};
const reindexProgress = {
get: jest.fn().mockResolvedValue(null),
};
const aiQueue = {};
const service = new AiSettingsService(
workspaceRepo as unknown as WorkspaceRepo,
aiAgentRoleRepo as unknown as AiAgentRoleRepo,
aiProviderCredentialsRepo as unknown as AiProviderCredentialsRepo,
pageEmbeddingRepo as unknown as PageEmbeddingRepo,
pageRepo as unknown as PageRepo,
secretBox as unknown as SecretBoxService,
reindexProgress as unknown as EmbeddingReindexProgressService,
aiQueue as unknown as Queue,
);
return { service, reindexProgress, pageEmbeddingRepo };
}
it('reports the live run numbers when a reindex progress record is active', async () => {
const { service, reindexProgress } = makeService();
// Use a progress.total (500) DISTINCT from the DB count (478) so the test
// actually pins the progress.total branch rather than coincidentally
// matching the DB fallback. With fix #1 the two sources agree in practice,
// but getMasked must still return progress.total when a record is active.
reindexProgress.get.mockResolvedValue({
total: 500,
done: 120,
startedAt: Date.now(),
});
const masked = await service.getMasked(WORKSPACE_ID);
expect(masked.indexedPages).toBe(120); // progress.done, not DB 478
expect(masked.totalPages).toBe(500); // progress.total, not DB 478
expect(masked.reindexing).toBe(true);
});
it('falls back to countIndexedPages when no reindex is active', async () => {
const { service, reindexProgress } = makeService();
reindexProgress.get.mockResolvedValue(null);
const masked = await service.getMasked(WORKSPACE_ID);
expect(masked.indexedPages).toBe(478);
expect(masked.totalPages).toBe(478);
expect(masked.reindexing).toBe(false);
});
});
/**
* reindex() must seed a live progress record (done=0) BEFORE enqueueing so the
* first status poll shows 0 — but ONLY when no run is already active, since
* aiQueue.add() de-duplicates a running reindex and a re-seed would reset the
* visible counter to 0 while the live worker keeps incrementing from its real
* position.
*/
describe('AiSettingsService.reindex progress seed', () => {
const WORKSPACE_ID = 'ws-1';
function makeService() {
const order: string[] = [];
const aiQueue = {
remove: jest.fn().mockResolvedValue(undefined),
add: jest.fn().mockImplementation(async () => {
order.push('add');
}),
};
const pageRepo = {
countEmbeddablePages: jest.fn().mockResolvedValue(478),
};
const reindexProgress = {
// Default: no active run -> seed should happen.
get: jest.fn().mockResolvedValue(null),
start: jest.fn().mockImplementation(async () => {
order.push('start');
}),
clear: jest.fn().mockResolvedValue(undefined),
};
const service = new AiSettingsService(
{} as unknown as WorkspaceRepo,
{} as unknown as AiAgentRoleRepo,
{} as unknown as AiProviderCredentialsRepo,
{} as unknown as PageEmbeddingRepo,
pageRepo as unknown as PageRepo,
{} as unknown as SecretBoxService,
reindexProgress as unknown as EmbeddingReindexProgressService,
aiQueue as unknown as Queue,
);
return { service, aiQueue, pageRepo, reindexProgress, order };
}
it('seeds progress (workspace, count) BEFORE enqueue when no run is active', async () => {
const { service, aiQueue, reindexProgress, order } = makeService();
await service.reindex(WORKSPACE_ID);
// The pre-seed carries the real page count AND a SHORT ttl (3rd arg) so a
// de-duplicated enqueue against a just-finishing job can't leave a phantom
// "reindexing: 0 of N" stuck for the full record TTL (F10).
expect(reindexProgress.start).toHaveBeenCalledWith(
WORKSPACE_ID,
478,
expect.any(Number),
);
const ttl = reindexProgress.start.mock.calls[0][2];
expect(ttl).toBeGreaterThan(0);
expect(ttl).toBeLessThanOrEqual(60); // short, not the full 1h record TTL
expect(aiQueue.add).toHaveBeenCalledTimes(1);
// Seed must precede the enqueue so the first poll already reports done=0.
expect(order).toEqual(['start', 'add']);
});
it('does NOT re-seed when a run is already active (mid-run re-trigger)', async () => {
const { service, aiQueue, reindexProgress } = makeService();
// An active record exists -> a second click must not reset the counter.
reindexProgress.get.mockResolvedValue({
total: 478,
done: 120,
startedAt: Date.now(),
});
await service.reindex(WORKSPACE_ID);
expect(reindexProgress.start).not.toHaveBeenCalled();
// The enqueue still runs (and de-duplicates against the active job).
expect(aiQueue.add).toHaveBeenCalledTimes(1);
});
it('clears the seed it just wrote and re-throws when enqueue fails', async () => {
const { service, aiQueue, reindexProgress } = makeService();
// This call seeds (get() is null) but the enqueue then blows up
// (Redis hiccup/shutdown) -> the worker never runs and never clear()s, so
// reindex() must roll back its own seed to avoid a 1h stuck "reindexing".
const boom = new Error('redis down');
aiQueue.add.mockRejectedValue(boom);
await expect(service.reindex(WORKSPACE_ID)).rejects.toBe(boom);
expect(reindexProgress.start).toHaveBeenCalledWith(
WORKSPACE_ID,
478,
expect.any(Number),
);
expect(reindexProgress.clear).toHaveBeenCalledWith(WORKSPACE_ID);
});
it('does NOT clear a concurrent active run when enqueue fails (no seed)', async () => {
const { service, aiQueue, reindexProgress } = makeService();
// A run is already active, so THIS call does not seed; if the enqueue then
// fails it must NOT wipe the live worker's record.
reindexProgress.get.mockResolvedValue({
total: 478,
done: 120,
startedAt: Date.now(),
});
const boom = new Error('redis down');
aiQueue.add.mockRejectedValue(boom);
await expect(service.reindex(WORKSPACE_ID)).rejects.toBe(boom);
expect(reindexProgress.start).not.toHaveBeenCalled();
expect(reindexProgress.clear).not.toHaveBeenCalled();
});
});

View File

@@ -8,6 +8,7 @@ import { AiProviderCredentialsRepo } from '@docmost/db/repos/ai-chat/ai-provider
import { PageEmbeddingRepo } from '@docmost/db/repos/ai-chat/page-embedding.repo';
import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { SecretBoxService } from '../crypto/secret-box';
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
import {
AiDriver,
AiProviderSettings,
@@ -30,6 +31,17 @@ export function parsePositiveInt(raw: unknown): number | undefined {
return Number.isFinite(n) && n > 0 ? Math.floor(n) : undefined;
}
/**
* TTL (seconds) for the enqueue-time progress PRE-SEED written by `reindex()`
* before the worker starts. Deliberately SHORT: if `aiQueue.add()` de-duplicates
* against a job that is just finishing (the worker's finally already ran
* `clear()` but removeOnComplete hasn't yet removed the job), no new worker runs
* to overwrite/clear this seed — so a short TTL lets the phantom "reindexing:
* 0 of N" expire in seconds instead of sticking for the full 1h record TTL. A
* worker that DOES start re-seeds with the full TTL, so a real run is unaffected.
*/
const PRE_SEED_TTL_SECONDS = 45;
/**
* Shape of the partial update accepted by `update`. Mirrors the validated
* controller DTO. `apiKey` / `embeddingApiKey` are write-only: undefined =
@@ -74,6 +86,7 @@ export class AiSettingsService {
private readonly pageEmbeddingRepo: PageEmbeddingRepo,
private readonly pageRepo: PageRepo,
private readonly secretBox: SecretBoxService,
private readonly reindexProgress: EmbeddingReindexProgressService,
@InjectQueue(QueueName.AI_QUEUE) private readonly aiQueue: Queue,
) {}
@@ -100,21 +113,60 @@ export class AiSettingsService {
.remove(`ai-search-disabled-${workspaceId}`)
.catch(() => undefined);
// Seed a live progress record BEFORE enqueueing so the very first status
// poll already reports done=0 (the reindex POST returns the PRE-job counts,
// so without this seed the first poll would still show "total of total").
// `totalPages` uses countEmbeddablePages — the SAME set the worker iterates
// and the SAME denominator the status endpoint reports, so the live and
// steady-state totals match.
//
// ONLY seed when no run is active: aiQueue.add() de-duplicates an already-
// running reindex, so a mid-run re-trigger (second click / second admin /
// second tab) must NOT reset the visible counter to 0 — that would
// understate the live worker's real position for the rest of the run. The
// worker's own start() at run begin is the single authoritative reset.
let seeded = false;
if ((await this.reindexProgress.get(workspaceId)) === null) {
const totalPages = await this.pageRepo.countEmbeddablePages(workspaceId);
// Short TTL: if add() below de-duplicates against a just-finishing job
// whose worker already clear()ed but isn't removed yet, no worker runs to
// clear this seed — the short TTL expires the phantom record in seconds
// rather than leaving a stuck "reindexing: 0 of N" for the full record TTL.
await this.reindexProgress.start(
workspaceId,
totalPages,
PRE_SEED_TTL_SECONDS,
);
seeded = true;
}
const jobId = `ai-reindex-${workspaceId}`;
// Clear a prior non-active entry so a stale job can't block this reindex.
// A locked/active job is left in place (remove() no-ops) and the add() below
// de-duplicates against it, keeping the in-progress pass.
await this.aiQueue.remove(jobId).catch(() => undefined);
await this.aiQueue.add(
QueueJob.WORKSPACE_CREATE_EMBEDDINGS,
{ workspaceId },
{
jobId,
removeOnComplete: true,
removeOnFail: true,
},
);
try {
await this.aiQueue.add(
QueueJob.WORKSPACE_CREATE_EMBEDDINGS,
{ workspaceId },
{
jobId,
removeOnComplete: true,
removeOnFail: true,
},
);
} catch (err) {
// If the enqueue fails (Redis hiccup/shutdown) the worker never runs, so
// its finally->clear() never fires. Roll back the seed WE just wrote so
// the status endpoint doesn't report a stuck "reindexing: 0 of N" for the
// full TTL. Only clear when this call did the seed — never wipe a
// concurrent active run's record (get() was non-null, seeded=false).
if (seeded) {
await this.reindexProgress.clear(workspaceId);
}
throw err;
}
}
/**
@@ -253,13 +305,33 @@ export class AiSettingsService {
hasSttApiKey = !!creds?.sttApiKeyEnc;
}
// totalPages now counts only pages with embeddable content (non-empty text
// or already-stored embeddings), so empty/text-less pages don't keep the
// "Indexed N of M pages" bar below 100% forever.
const [indexedPages, totalPages] = await Promise.all([
this.pageEmbeddingRepo.countIndexedPages(workspaceId),
this.pageRepo.countEmbeddablePages(workspaceId),
]);
// While a reindex run is active, report its LIVE progress (done climbs 0 ->
// total) so the settings UI can watch it advance. Read progress FIRST and
// short-circuit: this endpoint is polled every ~5s for the whole run, so when
// a record is active we skip the two coverage COUNTs entirely (their results
// would be discarded anyway). Without the live progress the counter never
// drops: the per-page reindex hard-replaces rows in its own small
// transaction, so countIndexedPages stays ~= total for the whole run. With no
// active record we fall back to the steady-state DB coverage count, which
// preserves the existing display and the client's "done == total -> stop
// polling" condition (the run ends -> record cleared -> DB count == total).
//
// The fallback `totalPages` counts only pages with embeddable content
// (non-empty text, content-borne text, or already-stored embeddings), so
// empty/text-less pages don't keep the "Indexed N of M pages" bar below 100%
// forever.
const progress = await this.reindexProgress.get(workspaceId);
let indexedPages: number;
let totalPages: number;
if (progress) {
indexedPages = progress.done;
totalPages = progress.total;
} else {
[indexedPages, totalPages] = await Promise.all([
this.pageEmbeddingRepo.countIndexedPages(workspaceId),
this.pageRepo.countEmbeddablePages(workspaceId),
]);
}
return {
driver: provider.driver,
@@ -281,6 +353,8 @@ export class AiSettingsService {
hasSttApiKey,
indexedPages,
totalPages,
// Optional hint for the client: a reindex run is currently in progress.
reindexing: progress != null,
};
}

View File

@@ -5,6 +5,7 @@ import { QueueName } from '../queue/constants';
import { AiService } from './ai.service';
import { AiSettingsService } from './ai-settings.service';
import { AiSettingsController } from './ai-settings.controller';
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
/**
* LLM driver + provider-settings unit (§6.2/§6.4).
@@ -19,7 +20,7 @@ import { AiSettingsController } from './ai-settings.controller';
BullModule.registerQueue({ name: QueueName.AI_QUEUE }),
],
controllers: [AiSettingsController],
providers: [AiService, AiSettingsService],
exports: [AiService, AiSettingsService],
providers: [AiService, AiSettingsService, EmbeddingReindexProgressService],
exports: [AiService, AiSettingsService, EmbeddingReindexProgressService],
})
export class AiModule {}

View File

@@ -146,4 +146,7 @@ export interface MaskedAiSettings {
// RAG indexing coverage for the settings UI.
indexedPages: number;
totalPages: number;
// True while a full workspace reindex is actively running (the counts above
// then reflect the live run progress rather than the steady-state DB count).
reindexing?: boolean;
}

View File

@@ -0,0 +1,179 @@
import { EmbeddingReindexProgressService } from './embedding-reindex-progress.service';
import type { RedisService } from '@nestjs-labs/nestjs-ioredis';
import type { Redis } from 'ioredis';
/**
* Unit tests for the Redis-backed reindex-progress store.
*
* The store is a thin, BEST-EFFORT wrapper: writes (start/increment) issue an
* hset/hincrby + expire pipeline and must SWALLOW Redis errors (progress is
* cosmetic — it must never break a reindex); reads (get) must map a valid hash
* to a ReindexProgress and degrade to null on a malformed/missing record or a
* Redis failure. We drive it with a hand-rolled fake ioredis (the project mocks
* Redis with plain fakes, see public-share limiter specs).
*/
describe('EmbeddingReindexProgressService', () => {
const WORKSPACE_ID = 'ws-1';
const KEY = 'ai:reindex:progress:ws-1';
/**
* Build a fake ioredis whose `multi()` returns a chainable recorder and whose
* `hgetall`/`del` are configurable jest mocks. `execImpl` lets a test make the
* pipeline reject (to assert error-swallowing).
*/
function makeRedis(opts: { execImpl?: () => Promise<unknown> } = {}) {
const exec = jest
.fn()
.mockImplementation(opts.execImpl ?? (() => Promise.resolve([])));
// mockReturnThis() returns the call's `this` (the multi object), so the
// chain hset().expire().exec() resolves correctly.
const multiObj = {
hset: jest.fn().mockReturnThis(),
hincrby: jest.fn().mockReturnThis(),
expire: jest.fn().mockReturnThis(),
exec,
};
const multi = jest.fn(() => multiObj);
const hgetall = jest.fn().mockResolvedValue({});
const del = jest.fn().mockResolvedValue(1);
const redis = { multi, hgetall, del } as unknown as Redis;
return { redis, multiObj, multi, hgetall, del, exec };
}
function makeService(redis: Redis) {
const redisService = {
getOrThrow: () => redis,
} as unknown as RedisService;
return new EmbeddingReindexProgressService(redisService);
}
describe('get', () => {
it('maps a valid hash to a ReindexProgress object', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: '478', done: '120', startedAt: '1000' });
const service = makeService(redis);
await expect(service.get(WORKSPACE_ID)).resolves.toEqual({
total: 478,
done: 120,
startedAt: 1000,
});
expect(hgetall).toHaveBeenCalledWith(KEY);
});
it('returns null for an empty hash (no record)', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({});
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('returns null when `total` is missing (partial record)', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ done: '5' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('returns null for a non-numeric total', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: 'abc', done: '1', startedAt: '1' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('returns null for a non-numeric done', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: '10', done: 'xyz', startedAt: '1' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
it('coerces a non-finite startedAt to 0', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockResolvedValue({ total: '10', done: '2', startedAt: 'nope' });
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toEqual({
total: 10,
done: 2,
startedAt: 0,
});
});
it('degrades to null when hgetall throws (degradation contract)', async () => {
const { redis, hgetall } = makeRedis();
hgetall.mockRejectedValue(new Error('redis down'));
await expect(makeService(redis).get(WORKSPACE_ID)).resolves.toBeNull();
});
});
describe('start', () => {
it('issues hset + expire on the workspace key', async () => {
const { redis, multiObj } = makeRedis();
await makeService(redis).start(WORKSPACE_ID, 478);
expect(multiObj.hset).toHaveBeenCalledWith(
KEY,
expect.objectContaining({ total: '478', done: '0' }),
);
expect(multiObj.expire).toHaveBeenCalledWith(KEY, expect.any(Number));
expect(multiObj.exec).toHaveBeenCalledTimes(1);
});
it('defaults the expire TTL to the full 1h record TTL', async () => {
const { redis, multiObj } = makeRedis();
await makeService(redis).start(WORKSPACE_ID, 478);
// Default ttl = full record TTL (60 * 60) so a real run never expires
// mid-flight before the worker refreshes it on each increment.
expect(multiObj.expire).toHaveBeenCalledWith(KEY, 60 * 60);
});
it('honours an explicit short ttlSeconds for the enqueue-time pre-seed (F10)', async () => {
const { redis, multiObj } = makeRedis();
// The reindex() pre-seed passes a short ttl so a phantom record left by a
// de-duplicated enqueue expires in seconds, not after the full 1h TTL.
await makeService(redis).start(WORKSPACE_ID, 478, 45);
expect(multiObj.expire).toHaveBeenCalledWith(KEY, 45);
});
it('swallows a thrown Redis error (best-effort)', async () => {
const { redis } = makeRedis({
execImpl: () => Promise.reject(new Error('redis down')),
});
await expect(
makeService(redis).start(WORKSPACE_ID, 1),
).resolves.toBeUndefined();
});
});
describe('increment', () => {
it('issues hincrby + expire on the workspace key', async () => {
const { redis, multiObj } = makeRedis();
await makeService(redis).increment(WORKSPACE_ID);
expect(multiObj.hincrby).toHaveBeenCalledWith(KEY, 'done', 1);
expect(multiObj.expire).toHaveBeenCalledWith(KEY, expect.any(Number));
expect(multiObj.exec).toHaveBeenCalledTimes(1);
});
it('swallows a thrown Redis error (best-effort)', async () => {
const { redis } = makeRedis({
execImpl: () => Promise.reject(new Error('redis down')),
});
await expect(
makeService(redis).increment(WORKSPACE_ID),
).resolves.toBeUndefined();
});
});
describe('clear', () => {
it('deletes the workspace key', async () => {
const { redis, del } = makeRedis();
await makeService(redis).clear(WORKSPACE_ID);
expect(del).toHaveBeenCalledWith(KEY);
});
it('swallows a thrown Redis error (best-effort)', async () => {
const { redis, del } = makeRedis();
del.mockRejectedValue(new Error('redis down'));
await expect(
makeService(redis).clear(WORKSPACE_ID),
).resolves.toBeUndefined();
});
});
});

View File

@@ -0,0 +1,162 @@
import { Injectable, Logger } from '@nestjs/common';
import { RedisService } from '@nestjs-labs/nestjs-ioredis';
import type { Redis } from 'ioredis';
/**
* Live progress of an in-flight workspace embeddings reindex run.
* `total` is the number of pages the run will process, `done` how many it has
* already processed (success OR handled failure), `startedAt` the epoch-ms the
* record was created.
*/
export interface ReindexProgress {
total: number;
done: number;
startedAt: number;
}
/** Redis key namespace for the per-workspace reindex-progress record. */
const KEY_PREFIX = 'ai:reindex:progress:';
/**
* TTL (seconds) on the progress record so a crashed/aborted worker that never
* reaches its `clear()` finally can still self-clean instead of leaving a stuck
* "reindexing" state. Refreshed on every increment so a long run never expires
* mid-flight; on a crash it disappears within TTL of the last processed page.
*
* INTENTIONALLY tied to WRITE progress (start/increment) only — never refreshed
* on get(). Refreshing on read would keep a dead worker's record alive forever
* as long as a client keeps polling (a permanently stuck reindexing:true). The
* clear() in the worker's finally handles normal completion; a dead worker's
* record expires after TTL, and the client's own poll cap stops polling anyway.
*/
const TTL_SECONDS = 60 * 60; // 1h
/**
* Cluster-wide store for the live progress of a workspace embeddings reindex.
*
* The reindex runs in a BullMQ worker (AI_QUEUE) that may be a DIFFERENT process
* than the API handling the settings-status GET, so the progress must live in
* the shared Redis — we reuse the same global ioredis client (RedisService from
* @nestjs-labs/nestjs-ioredis) that backs BullMQ and the other anti-abuse
* limiters, adding NO new Redis config.
*
* Everything here is best-effort and COSMETIC: progress only drives the "Indexed
* X of Y" counter while a reindex is running. Any Redis failure degrades to the
* existing steady-state behaviour (the status falls back to the DB coverage
* count), so reads fail to `null` and writes are swallowed — a reindex must
* never break because progress reporting did.
*
* Stored as a Redis HASH so `done` can be bumped with an atomic HINCRBY (the
* worker is the only writer of `done`, but HINCRBY also keeps us off a
* read-modify-write race and preserves the other fields).
*/
@Injectable()
export class EmbeddingReindexProgressService {
private readonly logger = new Logger(EmbeddingReindexProgressService.name);
private readonly redis: Redis;
constructor(redisService: RedisService) {
this.redis = redisService.getOrThrow();
}
private key(workspaceId: string): string {
return KEY_PREFIX + workspaceId;
}
/**
* Begin (or reset) the progress record for a workspace: `total` pages, `done`
* back to 0, `startedAt` now. Called twice for a run, BOTH with the real page
* count (countEmbeddablePages) so the two totals coincide: once at reindex
* enqueue time (so the very first status poll already reports done=0) and again
* at the worker start (which re-asserts the same total and resets `done`).
* Resets `done` to 0 so a re-trigger never inherits a stale count.
*
* `ttlSeconds` lets the caller pick the record's lifetime. The enqueue-time
* pre-seed passes a SHORT ttl: if `aiQueue.add()` de-duplicates against a job
* that is just finishing (its worker hasn't yet removed the job but already
* ran its `clear()`), no new worker starts to clear this phantom seed, so a
* short ttl lets it expire in seconds instead of sticking for the full TTL.
* The worker's own `start()` at the begin of a real run overwrites this entry
* and raises the ttl back to the default full TTL.
*/
async start(
workspaceId: string,
total: number,
ttlSeconds: number = TTL_SECONDS,
): Promise<void> {
const key = this.key(workspaceId);
try {
await this.redis
.multi()
.hset(key, {
total: String(total),
done: '0',
startedAt: String(Date.now()),
})
.expire(key, ttlSeconds)
.exec();
} catch (err) {
this.logger.warn(
`reindex-progress start failed for workspace ${workspaceId}; ` +
`progress reporting disabled for this run: ${(err as Error).message}`,
);
}
}
/**
* Bump the processed-page counter by one and refresh the TTL. Atomic and
* best-effort: a missing key (cleared/expired) would be recreated with only
* `done`, but `get()` treats a record without a numeric `total` as inactive,
* so that partial state safely reads as "no active reindex".
*/
async increment(workspaceId: string): Promise<void> {
const key = this.key(workspaceId);
try {
await this.redis.multi().hincrby(key, 'done', 1).expire(key, TTL_SECONDS).exec();
} catch (err) {
this.logger.warn(
`reindex-progress increment failed for workspace ${workspaceId}: ` +
`${(err as Error).message}`,
);
}
}
/**
* Remove the progress record. Called in the worker's `finally` so a completed,
* aborted, or unconfigured-early-return run never leaves a stuck record; the
* status then falls back to the DB coverage count.
*/
async clear(workspaceId: string): Promise<void> {
try {
await this.redis.del(this.key(workspaceId));
} catch (err) {
this.logger.warn(
`reindex-progress clear failed for workspace ${workspaceId} ` +
`(self-cleans via TTL): ${(err as Error).message}`,
);
}
}
/**
* Read the live progress, or `null` when no reindex is active (no record, an
* expired record, or a partial record without a numeric `total`). On a Redis
* error returns `null` so the status endpoint degrades to its DB count.
*/
async get(workspaceId: string): Promise<ReindexProgress | null> {
try {
const data = await this.redis.hgetall(this.key(workspaceId));
if (!data || data.total === undefined) return null;
const total = Number(data.total);
const done = Number(data.done);
const startedAt = Number(data.startedAt);
if (!Number.isFinite(total) || !Number.isFinite(done)) return null;
return { total, done, startedAt: Number.isFinite(startedAt) ? startedAt : 0 };
} catch (err) {
this.logger.warn(
`reindex-progress read failed for workspace ${workspaceId}; ` +
`falling back to DB count: ${(err as Error).message}`,
);
return null;
}
}
}

View File

@@ -0,0 +1,150 @@
// Importing FileImportTaskService transitively loads import-formatter.ts, which
// imports the ESM-only @sindresorhus/slugify package (not in jest's transform
// allowlist). slugify is irrelevant to the path under test, so it is mocked out
// to keep the module graph loadable under ts-jest (mirrors the import.service spec).
jest.mock('@sindresorhus/slugify', () => ({
__esModule: true,
default: (input: string) => String(input),
}));
// import-attachment.service.ts (loaded transitively for DI typing) imports the
// ESM-only `p-limit` / `image-dimensions`; neither is exercised on the path under
// test, so stub them so the module graph loads under ts-jest.
jest.mock('p-limit', () => ({
__esModule: true,
default: () => (fn: any) => fn(),
}));
jest.mock('image-dimensions', () => ({
__esModule: true,
imageDimensionsFromData: () => undefined,
}));
import { promises as fs } from 'fs';
import * as os from 'os';
import * as path from 'path';
import { FileImportTaskService } from './file-import-task.service';
import { ImportService } from './import.service';
/**
* Binding test for issue #228 / review #5: FileImportTaskService.processGenericImport
* is a NON-editor write path (markdownToHtml -> processHTML -> JSON, never runs
* footnoteSyncPlugin), so it canonicalizes footnotes before persisting. This pins
* that binding — the same one import.service has a spec for — which previously had
* NO spec at all.
*
* The markdown -> HTML -> ProseMirror conversion is REAL (a real ImportService,
* its createYdoc stubbed); the filesystem is a real temp dir with one .md file;
* the DB transaction is stubbed to capture the persisted page content.
*/
// Out-of-order references (c, a, b), a REUSED reference ([^a] twice), and an
// ORPHAN definition ([^z], never referenced).
const MARKDOWN = [
'# Title',
'',
'Body refs [^c] and [^a] and [^b] and again [^a].',
'',
'[^a]: note A',
'[^b]: note B',
'[^c]: note C',
'[^z]: orphan note',
].join('\n');
function footnoteListIds(content: any): string[] {
const list = (content?.content ?? []).find(
(n: any) => n.type === 'footnotesList',
);
return (list?.content ?? [])
.filter((n: any) => n.type === 'footnoteDefinition')
.map((n: any) => n.attrs?.id);
}
// A permissive chainable stub for the spaces lookup (selectFrom(...).select(...)
// .where(...).executeTakeFirst()).
function chainable(result: any): any {
const proxy: any = new Proxy(function () {}, {
get: (_t, prop) => {
if (prop === 'executeTakeFirst') return async () => result;
if (prop === 'execute') return async () => [];
return () => proxy;
},
});
return proxy;
}
describe('FileImportTaskService.processGenericImport — footnote canonicalization (#228)', () => {
it('orders footnotes by first reference, dedupes reuse, and drops orphans on zip import', async () => {
const extractDir = await fs.mkdtemp(path.join(os.tmpdir(), 'fit-canon-'));
await fs.writeFile(path.join(extractDir, 'note.md'), MARKDOWN, 'utf-8');
// Real ImportService for the html -> JSON conversion; stub the yjs encode.
const importService = new ImportService(
{} as any,
{} as any,
{} as any,
{} as any,
);
jest
.spyOn(importService as any, 'createYdoc')
.mockResolvedValue(Buffer.from([]) as any);
let captured: any = null;
const trx = {
insertInto: (table: string) => ({
values: (v: any) => {
if (table === 'pages') captured = v;
return { execute: async () => {} };
},
}),
};
const db: any = {
selectFrom: () => chainable({ slug: 'space-slug' }),
transaction: () => ({ execute: (fn: any) => fn(trx) }),
};
const importAttachmentService = {
processAttachments: async ({ html }: any) => html,
};
const backlinkRepo = { insertBacklink: jest.fn() };
const eventEmitter = { emit: jest.fn() };
const auditService = { logBatchWithContext: jest.fn() };
const pageService = { nextPagePosition: async () => 'a0' };
const service = new FileImportTaskService(
{} as any, // storageService
importService as any,
pageService as any,
backlinkRepo as any,
db,
importAttachmentService as any,
eventEmitter as any,
auditService as any,
);
const fileTask: any = {
id: 'task-1',
source: 'generic',
spaceId: 'space-1',
workspaceId: 'ws-1',
creatorId: 'user-1',
};
try {
await service.processGenericImport({ extractDir, fileTask });
expect(captured).toBeTruthy();
const content = captured.content;
// Reference order is c, a, b (NOT the markdown definition order a, b, c).
expect(footnoteListIds(content)).toEqual(['c', 'a', 'b']);
// Orphan [^z] dropped; reused [^a] collapses to one definition; one list.
expect(footnoteListIds(content)).not.toContain('z');
const lists = (content.content ?? []).filter(
(n: any) => n.type === 'footnotesList',
);
expect(lists).toHaveLength(1);
expect(footnoteListIds(content).filter((id) => id === 'a')).toHaveLength(1);
} finally {
await fs.rm(extractDir, { recursive: true, force: true });
}
});
});

View File

@@ -18,7 +18,7 @@ import { generateSlugId } from '../../../common/helpers';
import { v7 } from 'uuid';
import { generateJitteredKeyBetween } from 'fractional-indexing-jittered';
import { FileTask, InsertablePage } from '@docmost/db/types/entity.types';
import { markdownToHtml } from '@docmost/editor-ext';
import { markdownToHtml, canonicalizeFootnotes } from '@docmost/editor-ext';
import { getProsemirrorContent } from '../../../common/helpers/prosemirror/utils';
import { formatImportHtml } from '../utils/import-formatter';
import {
@@ -496,9 +496,19 @@ export class FileImportTaskService {
await this.importService.processHTML(html),
);
const { title, prosemirrorJson } =
const { title, prosemirrorJson: extractedJson } =
this.importService.extractTitleAndRemoveHeading(pmState);
// Canonicalize footnote topology on this non-editor write path
// (markdownToHtml/processHTML never runs footnoteSyncPlugin), so a
// zip-imported page's footnotes are reference-ordered, deduped, and
// orphan-free like the editor's invariant (issue #228). Pure +
// idempotent + shape-safe; a footnote-free doc is unchanged.
// (Future consolidation, architecture B: like import.service, this
// path persists directly rather than via PageService — a shared
// "prepare JSON for persist" helper would centralize this call.)
const prosemirrorJson = canonicalizeFootnotes(extractedJson);
const insertablePage: InsertablePage = {
id: page.id,
slugId: page.slugId,

View File

@@ -0,0 +1,139 @@
// Importing ImportService transitively loads import-formatter.ts, which imports
// the ESM-only @sindresorhus/slugify package (not in jest's transform
// allowlist). slugify is irrelevant to the path under test, so it is mocked out
// to keep the module graph loadable under ts-jest.
jest.mock('@sindresorhus/slugify', () => ({
__esModule: true,
default: (input: string) => String(input),
}));
import { ImportService } from './import.service';
import { canonicalizeFootnotes } from '@docmost/editor-ext';
/**
* Integration-ish test for the USER-FACING markdown import path
* (`ImportService.importPage`). It exercises the REAL markdown -> HTML -> JSON
* conversion and asserts that the stored page content has its footnotes
* canonicalized — the gap that issue #228 fixes: the import path builds
* ProseMirror JSON directly (never running the editor's footnoteSyncPlugin), so
* before this wiring the stored footnotes kept the markdown's physical
* definition order (out of order vs. references), retained orphan definitions,
* and did not collapse reused references.
*
* The DB/ydoc side-effects are stubbed: `getNewPagePosition` (DB query) and
* `createYdoc` (Yjs encode) are spied, and `pageRepo.insertPage` captures the
* persisted `content`. Everything between markdown and persistence is REAL.
*/
// Out-of-order references (c, a, b), a REUSED reference ([^a] twice -> one
// footnote), and an ORPHAN definition ([^z], never referenced).
const MARKDOWN = [
'# Title',
'',
'Body refs [^c] and [^a] and [^b] and again [^a].',
'',
'[^a]: note A',
'[^b]: note B',
'[^c]: note C',
'[^z]: orphan note',
].join('\n');
function makeFile(filename: string, contents: string) {
return {
filename,
toBuffer: async () => Buffer.from(contents),
} as any;
}
function makeService() {
let captured: any = null;
const pageRepo = {
insertPage: jest.fn(async (values: any) => {
captured = values;
return { id: 'page-id', slugId: 'slug-id' };
}),
};
const service = new ImportService(
pageRepo as any,
{} as any,
{} as any,
{} as any,
);
jest.spyOn(service as any, 'getNewPagePosition').mockResolvedValue('a0');
jest
.spyOn(service as any, 'createYdoc')
.mockResolvedValue(Buffer.from([]) as any);
return { service, pageRepo, getCaptured: () => captured };
}
/** List the footnote-definition ids of the (single) footnotesList, in order. */
function footnoteListIds(content: any): string[] {
const list = (content.content ?? []).find(
(n: any) => n.type === 'footnotesList',
);
if (!list) return [];
return (list.content ?? [])
.filter((n: any) => n.type === 'footnoteDefinition')
.map((n: any) => n.attrs?.id);
}
function definitionText(content: any, id: string): string | undefined {
const list = (content.content ?? []).find(
(n: any) => n.type === 'footnotesList',
);
const def = (list?.content ?? []).find(
(n: any) => n.type === 'footnoteDefinition' && n.attrs?.id === id,
);
return def?.content?.[0]?.content?.[0]?.text;
}
describe('ImportService.importPage — footnote canonicalization (#228)', () => {
it('orders footnotes by first reference, dedupes reuse, and drops orphans', async () => {
const { service, getCaptured } = makeService();
await service.importPage(
Promise.resolve(makeFile('note.md', MARKDOWN)),
'user-id',
'space-id',
'workspace-id',
);
const content = getCaptured().content;
expect(content).toBeTruthy();
// Reference order is c, a, b (NOT the markdown definition order a, b, c).
expect(footnoteListIds(content)).toEqual(['c', 'a', 'b']);
// Definitions preserved and attached to the right ids.
expect(definitionText(content, 'c')).toBe('note C');
expect(definitionText(content, 'a')).toBe('note A');
expect(definitionText(content, 'b')).toBe('note B');
// Orphan definition [^z] is dropped.
expect(footnoteListIds(content)).not.toContain('z');
// Reused [^a] yields exactly ONE definition, and exactly one list.
const lists = (content.content ?? []).filter(
(n: any) => n.type === 'footnotesList',
);
expect(lists).toHaveLength(1);
expect(footnoteListIds(content).filter((id) => id === 'a')).toHaveLength(1);
});
it('is idempotent: canonicalizing the stored output again is a no-op', async () => {
const { service, getCaptured } = makeService();
await service.importPage(
Promise.resolve(makeFile('note.md', MARKDOWN)),
'user-id',
'space-id',
'workspace-id',
);
const stored = getCaptured().content;
// The stored content is already canonical; running the canonicalizer a second
// time must not change it (safe to wire into every write path).
const second = canonicalizeFootnotes(stored);
expect(second).toEqual(stored);
expect(footnoteListIds(second)).toEqual(['c', 'a', 'b']);
});
});

View File

@@ -17,7 +17,7 @@ import {
import { generateJitteredKeyBetween } from 'fractional-indexing-jittered';
import { TiptapTransformer } from '@hocuspocus/transformer';
import * as Y from 'yjs';
import { markdownToHtml } from '@docmost/editor-ext';
import { markdownToHtml, canonicalizeFootnotes } from '@docmost/editor-ext';
import {
FileTaskStatus,
FileTaskType,
@@ -85,7 +85,17 @@ export class ImportService {
const extracted = this.extractTitleAndRemoveHeading(prosemirrorState);
const title = extracted.title;
const prosemirrorJson = extracted.prosemirrorJson;
// Imported markdown/HTML is built via markdownToHtml -> htmlToJson, which
// never runs the editor's footnoteSyncPlugin, so the footnote topology keeps
// the source's PHYSICAL definition order (out of order vs. references),
// retains orphan definitions, and is not deduped. Canonicalize before
// persisting so the stored page matches the editor's invariant (issue #228).
// Pure + idempotent + shape-safe: a doc with no footnotes is unchanged.
// (Future consolidation, architecture B: this import path persists directly
// via pageRepo.insertPage rather than through PageService.createPage, so the
// canonicalize call lives here; folding both into one "prepare JSON for
// persist" helper is a sensible follow-up.)
const prosemirrorJson = canonicalizeFootnotes(extracted.prosemirrorJson);
const pageTitle = title || fileName;

View File

@@ -0,0 +1,124 @@
import { Kysely } from 'kysely';
import { randomUUID } from 'node:crypto';
import { PageRepo } from '@docmost/db/repos/page/page.repo';
import { SpaceMemberRepo } from '@docmost/db/repos/space/space-member.repo';
import { EventEmitter2 } from '@nestjs/event-emitter';
import { getTestDb, destroyTestDb, createWorkspace, createSpace } from './db';
/**
* `PageRepo.getEmbeddablePageIds` MUST stay in lockstep with
* `PageRepo.countEmbeddablePages` (page.repo.ts) — the bulk reindex iterates the
* ID set while the status endpoint reports the count as the live denominator, so
* if the two predicates ever diverge the "done X of Y" counter ends on the wrong
* total. Both share the SAME WHERE: a page qualifies iff it is non-deleted AND
* (text_content has a non-whitespace char OR it has a non-deleted embedding row).
*
* This is a DB-level invariant: the predicate lives in raw SQL (`text_content ~
* '[^[:space:]]'`) and an EXISTS subquery, so a unit test with mocked Kysely
* cannot observe it. We seed every boundary case against real Postgres and
* assert the returned ID set EQUALS the count (and is exactly the expected set).
* A future edit that touches one predicate but not the other turns this red.
*/
describe('PageRepo embeddable-page set: getEmbeddablePageIds <-> countEmbeddablePages [integration]', () => {
let db: Kysely<any>;
let repo: PageRepo;
let workspaceId: string;
let spaceId: string;
beforeAll(async () => {
db = getTestDb();
// Only the Kysely-backed query methods under test are exercised, so the
// SpaceMemberRepo / EventEmitter2 deps are never touched — stub them.
repo = new PageRepo(
db as any,
{} as unknown as SpaceMemberRepo,
{} as unknown as EventEmitter2,
);
workspaceId = (await createWorkspace(db)).id;
spaceId = (await createSpace(db, workspaceId)).id;
});
afterAll(async () => {
await destroyTestDb();
});
// Insert a page with explicit text_content / deleted_at (createPage in db.ts
// sets neither), returning its id so the test can assert membership.
async function insertPage(args: {
textContent: string | null;
deletedAt?: Date | null;
}): Promise<string> {
const id = randomUUID();
await db
.insertInto('pages')
.values({
id,
slugId: `slug-${id.slice(0, 8)}`,
title: `page-${id.slice(0, 8)}`,
spaceId,
workspaceId,
textContent: args.textContent,
deletedAt: args.deletedAt ?? null,
})
.execute();
return id;
}
// Insert one embedding chunk row for a page (NOT NULL columns + deleted_at).
async function insertEmbedding(
pageId: string,
opts: { deletedAt?: Date | null } = {},
): Promise<void> {
await db
.insertInto('pageEmbeddings')
.values({
id: randomUUID(),
workspaceId,
pageId,
spaceId,
chunkIndex: 0,
chunkStart: 0,
chunkLength: 1,
content: 'x',
modelName: 'test-model',
modelDimensions: 1,
deletedAt: opts.deletedAt ?? null,
})
.execute();
}
it('returns exactly the embeddable set and its size equals countEmbeddablePages', async () => {
// IN the set --------------------------------------------------------------
// (a) non-deleted page with real body text.
const withText = await insertPage({ textContent: 'hello world' });
// (b) non-deleted page with NO text but a live embedding row (EXISTS clause:
// a page that lost its text yet still has stale vectors must be visited
// so the reindex can clear them).
const noTextLiveEmbedding = await insertPage({ textContent: null });
await insertEmbedding(noTextLiveEmbedding);
// OUT of the set ----------------------------------------------------------
// (c) non-deleted, text_content NULL, no embeddings.
await insertPage({ textContent: null });
// (d) non-deleted, whitespace-only text (regex requires a non-space char).
await insertPage({ textContent: ' \n\t ' });
// (e) deleted page WITH body text — excluded by the non-deleted predicate.
await insertPage({
textContent: 'deleted but had text',
deletedAt: new Date(),
});
// (f) non-deleted, no text, with ONLY a DELETED embedding row — the EXISTS
// subquery filters pe.deleted_at IS NULL, so this stays out.
const onlyDeletedEmbedding = await insertPage({ textContent: null });
await insertEmbedding(onlyDeletedEmbedding, { deletedAt: new Date() });
const ids = await repo.getEmbeddablePageIds(workspaceId);
const count = await repo.countEmbeddablePages(workspaceId);
// The two queries agree on the size (the load-bearing lockstep invariant)...
expect(ids.length).toBe(count);
// ...and the set is exactly the two qualifying pages, nothing else.
expect(new Set(ids)).toEqual(new Set([withText, noTextLiveEmbedding]));
expect(count).toBe(2);
});
});

View File

@@ -0,0 +1,371 @@
import { describe, it, expect } from 'vitest';
import { Editor, getSchema } from '@tiptap/core';
import { Document } from '@tiptap/extension-document';
import { Paragraph } from '@tiptap/extension-paragraph';
import { Text } from '@tiptap/extension-text';
import { FootnoteReference } from './footnote-reference';
import { FootnotesList } from './footnotes-list';
import { FootnoteDefinition } from './footnote-definition';
import { canonicalizeFootnotes } from './footnote-canonicalize';
import { FOOTNOTE_CORPUS } from './footnote-corpus';
import {
collectReferenceIds,
computeFootnoteNumbers,
FOOTNOTE_REFERENCE_NAME,
FOOTNOTES_LIST_NAME,
FOOTNOTE_DEFINITION_NAME,
} from './footnote-util';
import { Node as PMNode } from '@tiptap/pm/model';
const extensions = [
Document,
Paragraph,
Text,
FootnoteReference,
FootnotesList,
FootnoteDefinition,
];
const ref = (id: string) => ({ type: FOOTNOTE_REFERENCE_NAME, attrs: { id } });
const def = (id: string, text?: string) => ({
type: FOOTNOTE_DEFINITION_NAME,
attrs: { id },
content: [
text
? { type: 'paragraph', content: [{ type: 'text', text }] }
: { type: 'paragraph' },
],
});
const list = (...defs: any[]) => ({ type: FOOTNOTES_LIST_NAME, content: defs });
const para = (...inline: any[]) => ({ type: 'paragraph', content: inline });
/** Find every node of `type`, document order. */
function findAll(node: any, type: string, acc: any[] = []): any[] {
if (!node || typeof node !== 'object') return acc;
if (node.type === type) acc.push(node);
if (Array.isArray(node.content)) {
for (const c of node.content) findAll(c, type, acc);
}
return acc;
}
/** Physical id order of the definitions in the (single) footnotesList. */
function defOrder(doc: any): string[] {
return findAll(doc, FOOTNOTE_DEFINITION_NAME).map((d) => d.attrs.id);
}
const schema = getSchema(extensions);
/** Reference order (distinct, document order) computed via the shared util. */
function refOrder(doc: any): string[] {
return collectReferenceIds(PMNode.fromJSON(schema, doc));
}
describe('canonicalizeFootnotes (pure JSON)', () => {
it('orders definitions by FIRST reference (out-of-order list -> 1..N)', () => {
// References appear b, a, d, c; the bottom list is in a different (import)
// order. The canonical list must follow reference order so reading it top to
// bottom yields numbers 1..N.
const doc = {
type: 'doc',
content: [
para(
{ type: 'text', text: 'x' },
ref('b'),
ref('a'),
ref('d'),
ref('c'),
),
list(def('a', 'A'), def('c', 'C'), def('b', 'B'), def('d', 'D')),
],
};
const out = canonicalizeFootnotes(doc);
expect(defOrder(out)).toEqual(['b', 'a', 'd', 'c']);
// The physical definition order now matches reference order, so the derived
// numbers (1..N) run sequentially down the list.
expect(refOrder(out)).toEqual(['b', 'a', 'd', 'c']);
const numbers = computeFootnoteNumbers(PMNode.fromJSON(schema, out));
expect(numbers.get('b')).toBe(1);
expect(numbers.get('a')).toBe(2);
expect(numbers.get('d')).toBe(3);
expect(numbers.get('c')).toBe(4);
});
it('numbers run 1..N down the canonical list', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'x' }, ref('b'), ref('a'), ref('c')),
list(def('a', 'A'), def('c', 'C'), def('b', 'B')),
],
};
const out = canonicalizeFootnotes(doc);
// Definition order == reference order == 1,2,3 reading down.
expect(defOrder(out)).toEqual(['b', 'a', 'c']);
});
it('drops an orphan definition (no matching reference)', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'x' }, ref('a')),
list(def('a', 'A'), def('orphan', 'O')),
],
};
const out = canonicalizeFootnotes(doc);
expect(defOrder(out)).toEqual(['a']);
expect(findAll(out, FOOTNOTE_DEFINITION_NAME)).toHaveLength(1);
});
it('with NO references, removes the footnotesList entirely', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'plain' }),
list(def('orphan', 'O')),
],
};
const out = canonicalizeFootnotes(doc);
expect(findAll(out, FOOTNOTES_LIST_NAME)).toHaveLength(0);
expect(findAll(out, FOOTNOTE_DEFINITION_NAME)).toHaveLength(0);
});
it('reuse: repeated references collapse to ONE definition/number', () => {
const doc = {
type: 'doc',
content: [
para(ref('d'), { type: 'text', text: ' a ' }, ref('d'), ref('d')),
list(def('d', 'shared')),
],
};
const out = canonicalizeFootnotes(doc);
// One definition; the three references keep id "d".
expect(defOrder(out)).toEqual(['d']);
expect(
findAll(out, FOOTNOTE_REFERENCE_NAME).map((r) => r.attrs.id),
).toEqual(['d', 'd', 'd']);
});
it('duplicate definitions: first wins, the rest are dropped (never resurface as orphans)', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'x' }, ref('d')),
list(def('d', 'first'), def('d', 'second'), def('d', 'third')),
],
};
const out = canonicalizeFootnotes(doc);
const defs = findAll(out, FOOTNOTE_DEFINITION_NAME);
expect(defs.map((d) => d.attrs.id)).toEqual(['d']);
expect(defs[0].content[0].content[0].text).toBe('first');
});
it('synthesizes an empty definition for a reference that has none', () => {
const doc = {
type: 'doc',
content: [para({ type: 'text', text: 'x' }, ref('missing'))],
};
const out = canonicalizeFootnotes(doc);
expect(defOrder(out)).toEqual(['missing']);
const list0 = findAll(out, FOOTNOTES_LIST_NAME);
expect(list0).toHaveLength(1);
});
it('merges multiple footnotesList nodes into one', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'a' }, ref('x'), ref('y')),
list(def('x', 'X')),
para({ type: 'text', text: 'tail' }),
list(def('y', 'Y')),
],
};
const out = canonicalizeFootnotes(doc);
expect(findAll(out, FOOTNOTES_LIST_NAME)).toHaveLength(1);
expect(defOrder(out)).toEqual(['x', 'y']);
});
it('places the single list before trailing empty paragraphs', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'x' }, ref('a')),
list(def('a', 'A')),
{ type: 'paragraph' },
],
};
const out = canonicalizeFootnotes(doc);
const last = out.content[out.content.length - 1];
expect(last.type).toBe('paragraph');
expect(out.content[out.content.length - 2].type).toBe(FOOTNOTES_LIST_NAME);
});
it('is idempotent: canonicalize(canonicalize(x)) === canonicalize(x)', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'x' }, ref('b'), ref('a')),
list(def('a', 'A'), def('b', 'B'), def('orphan', 'O')),
],
};
const once = canonicalizeFootnotes(doc);
const twice = canonicalizeFootnotes(once);
expect(twice).toEqual(once);
});
it('does not mutate its input', () => {
const doc = {
type: 'doc',
content: [
para({ type: 'text', text: 'x' }, ref('a')),
list(def('orphan', 'O')),
],
};
const snapshot = JSON.parse(JSON.stringify(doc));
canonicalizeFootnotes(doc);
expect(doc).toEqual(snapshot);
});
});
/**
* GOLDEN PARITY against the live `footnoteSyncPlugin`. The server canonicalizer
* must produce EXACTLY what the editor keeps. For every editor-reachable steady
* state (the list is already reference-ordered there), driving a real editor to
* convergence and then running `canonicalizeFootnotes` on its JSON must be a
* byte-for-byte no-op — proving the server output is identical to the editor's.
*/
describe('canonicalizeFootnotes golden parity with footnoteSyncPlugin', () => {
function makeEditor(content: any) {
return new Editor({ extensions, content });
}
/** Load `content`, fire one local edit so the sync plugin converges, return JSON. */
function pluginSteadyState(content: any): any {
const editor = makeEditor(content);
// A local doc change triggers footnoteSyncPlugin.appendTransaction.
editor.commands.insertContentAt(1, ' ');
const json = editor.state.doc.toJSON();
editor.destroy();
return json;
}
const corpus: Array<{ name: string; content: any }> = [
{
name: 'plain ref + def',
content: {
type: 'doc',
content: [para({ type: 'text', text: 'a' }, ref('x')), list(def('x', 'X'))],
},
},
{
name: 'two refs, two defs in reference order',
content: {
type: 'doc',
content: [
para({ type: 'text', text: 'a' }, ref('x'), { type: 'text', text: 'b' }, ref('y')),
list(def('x', 'X'), def('y', 'Y')),
],
},
},
{
name: 'orphan definition gets removed',
content: {
type: 'doc',
content: [para({ type: 'text', text: 'a' }, ref('x')), list(def('x', 'X'), def('orphan', 'O'))],
},
},
{
name: 'reference missing its definition (synth empty)',
content: {
type: 'doc',
content: [para({ type: 'text', text: 'a' }, ref('x'))],
},
},
{
name: 'reuse: repeated references, one definition',
content: {
type: 'doc',
content: [
para(ref('d'), { type: 'text', text: ' a ' }, ref('d'), ref('d')),
list(def('d', 'shared')),
],
},
},
{
name: 'no footnotes at all',
content: {
type: 'doc',
content: [para({ type: 'text', text: 'just text' })],
},
},
];
for (const { name, content } of corpus) {
it(`steady state is a canonicalize no-op: ${name}`, () => {
const steady = pluginSteadyState(content);
expect(canonicalizeFootnotes(steady)).toEqual(steady);
});
}
it('placement parity: the LIVE plugin leaves a list with NON-EMPTY content after it in place, and canonicalize agrees', () => {
// Drives the real footnoteSyncPlugin (not a hand-authored expected): a single
// canonical list with body content AFTER it must NOT be repositioned by the
// plugin, and the server canonicalizer must agree (step-6 placement parity).
const content = {
type: 'doc',
content: [
para({ type: 'text', text: 'a' }, ref('x')),
list(def('x', 'X')),
para({ type: 'text', text: 'epilogue' }),
],
};
const steady = pluginSteadyState(content);
// The plugin did NOT move the list to the end: a non-empty paragraph follows it.
const types = steady.content.map((n: any) => n.type);
const listPos = types.indexOf(FOOTNOTES_LIST_NAME);
expect(listPos).toBeGreaterThanOrEqual(0);
expect(listPos).toBeLessThan(types.length - 1);
const after = steady.content[listPos + 1];
expect(after.type).toBe('paragraph');
expect(JSON.stringify(after)).toContain('epilogue');
// The canonicalizer is a byte-for-byte no-op on that steady state (parity).
expect(canonicalizeFootnotes(steady)).toEqual(steady);
});
it('the canonicalizer and the editor agree on reference order and definition set', () => {
const content = {
type: 'doc',
content: [
para({ type: 'text', text: 'a' }, ref('x'), { type: 'text', text: 'b' }, ref('y')),
list(def('y', 'Y'), def('x', 'X')), // physically reversed
],
};
const steady = pluginSteadyState(content);
const canon = canonicalizeFootnotes(content);
// Same reference order and same DEFINITION SET (ids) in both, even though the
// physical list order may differ (the plugin preserves node identity, the
// canonicalizer reorders). Numbering — derived from reference order — matches.
expect(refOrder(steady)).toEqual(['x', 'y']);
expect(defOrder(canon)).toEqual(['x', 'y']);
expect(new Set(defOrder(steady))).toEqual(new Set(defOrder(canon)));
});
});
/**
* SHARED golden corpus: this editor-ext copy of `canonicalizeFootnotes` and the
* MCP mirror (`packages/mcp/src/lib/footnote-canonicalize.ts`) are BOTH run
* against the identical { input -> expected } corpus. Pinning the same expected
* outputs in both suites makes "the two pure copies behave identically" a
* checkable property without coupling the packages (architecture item A). The
* MCP mirror of these assertions lives in `test/unit/footnote-corpus.test.mjs`.
*/
describe('canonicalizeFootnotes shared golden corpus (editor-ext copy)', () => {
for (const { name, input, expected } of FOOTNOTE_CORPUS) {
it(`matches the corpus expected output: ${name}`, () => {
expect(canonicalizeFootnotes(input)).toEqual(expected);
// Idempotent on the corpus too.
expect(canonicalizeFootnotes(expected)).toEqual(expected);
});
}
});

View File

@@ -0,0 +1,272 @@
import {
FOOTNOTE_REFERENCE_NAME,
FOOTNOTES_LIST_NAME,
FOOTNOTE_DEFINITION_NAME,
} from './footnote-util';
/**
* Server-side, EditorView-free port of the footnote integrity invariant that
* `footnoteSyncPlugin` maintains in the live editor. Where the plugin is an
* `appendTransaction` that only runs inside a ProseMirror `EditorView`, this is
* a PURE function over ProseMirror JSON: `canonicalizeFootnotes(doc) -> doc`.
*
* It exists because the NON-editor write paths served by THIS copy build
* ProseMirror JSON directly (never running the editor's plugins), so the
* canonical footnote topology was never enforced on those writes. The consumers
* of this editor-ext copy are: the server markdown/HTML import
* (`markdownToHtml -> htmlToJson` in import.service / file-import-task.service),
* `PageService` create/update (`parseProsemirrorContent` for the JSON/markdown/
* HTML REST write paths), and the client markdown PASTE path
* (`markdown-clipboard.ts`). (The MCP package mirrors this canonicalizer in
* `packages/mcp/src/lib/footnote-canonicalize.ts` for its own FULL-document write
* paths — `markdownToProseMirrorCanonical` (the page markdown-import path; the
* plain `markdownToProseMirror` primitive used for COMMENT bodies does NOT
* canonicalize), `update_page_json`, `docmost_transform`, `insert_footnote`,
* `copy_page_content` — see that file's header.) All of these are the root cause
* of the symptom in the issue: footnotes rendered out of order (`1, 4, 2, 3, …`),
* a raw trailing `[^id]: …` block, and orphan definitions, all of which are
* simply the result of content written PAST the canonicalizer.
*
* The desired end-state (identical to the plugin's) is:
*
* 1. Reference ids in DOCUMENT ORDER are the single source of truth for which
* definitions exist and in what order (numbering is derived from this, see
* `computeFootnoteNumbers`). Repeated references that share an id are REUSE
* (one footnote, one number, one definition) — never re-id'd.
* 2. Exactly ONE `footnotesList`, holding one definition per referenced id in
* REFERENCE order, reusing the existing definition node (content preserved)
* or synthesizing an empty one when missing. The list sits after the last
* meaningful block (only trailing empty paragraphs may follow it).
* 3. Orphan definitions (no matching reference) are dropped.
* 4. Duplicate DEFINITIONS (two nodes sharing an id) are resolved first-wins:
* the first definition for an id is kept; later duplicates carry the SAME
* id, so they can never be referenced separately and are simply dropped.
* This matches the importer's first-wins rule ("one definition per id").
* (The LIVE editor instead re-id's a duplicate definition so a paste/collab
* merge cannot silently lose live user data; the artifacts this copy
* sanitizes are agent/import-authored, so first-wins is the right policy —
* see footnote-sync.ts `resolveCollisions`.)
* 5. Idempotent: a document that already satisfies the invariant is returned
* structurally unchanged (the existing definition/list nodes are reused
* verbatim), so re-running the canonicalizer — or running it on a write that
* the editor already canonicalized — is a no-op. This is what makes it safe
* to wire into EVERY write path without spurious mutations / git-sync churn.
*
* Divergence from the live plugin (intentional): the plugin preserves the
* PHYSICAL order of existing definition nodes to keep their Yjs/CRDT subtree
* identity stable across collaborators (numbering is decoration-derived, so the
* displayed numbers are correct regardless of physical order). This function has
* no live CRDT to protect, so when a REPAIR is needed it physically REORDERS the
* list into reference order — which is exactly the fix the out-of-order import
* needs.
*
* Placement PARITY with the plugin: when the document is already in the canonical
* single-list state, this function leaves that list EXACTLY where it sits (it
* does not move it to the end). The plugin behaves the same — it treats one
* footnotesList holding the canonical definition set as canonical regardless of
* whether content follows it (footnote-sync.ts: `primaryList` falls back to the
* last list and `noChangeNeeded` stays true). So on every editor-reachable steady
* state the two agree byte-for-byte, including when non-empty content follows the
* list; see the golden parity test and the shared corpus.
*
* Pure: deep-clones its input, never mutates the caller's object, and is
* deterministic (no `Math.random`/`Date.now`).
*/
export function canonicalizeFootnotes<T = any>(doc: T): T {
if (
doc == null ||
typeof doc !== 'object' ||
!Array.isArray((doc as any).content)
) {
return doc;
}
const out = cloneJson(doc) as any;
// 1) Distinct reference ids in document order (deep — references can live in
// callouts, tables, list items, ...). This is the ordering/numbering truth.
const referenceIds: string[] = [];
const seenRefIds = new Set<string>();
collectReferenceIds(out, referenceIds, seenRefIds);
// 2) Every definition node in document order (deep — defs normally live inside
// one or more `footnotesList` blocks, but we tolerate stray placements).
const defNodes: any[] = [];
collectDefinitions(out, defNodes);
// 3) First definition per id wins. Later duplicates carry the SAME id, so they
// can never be referenced separately and would be orphans — they are simply
// dropped (first-wins; see the file header, item 4).
const defById = new Map<string, any>();
for (const d of defNodes) {
const id = d?.attrs?.id;
if (id && !defById.has(id)) defById.set(id, d);
}
// 4) Build the ordered definition list: one per referenced id, in REFERENCE
// order, reusing the existing node (content preserved, id normalized) or
// synthesizing an empty definition. Definitions whose id is NOT referenced
// are orphans and are simply never added. The reused node is SHALLOW-copied
// (id normalized): `out` is already a deep clone and the old lists are cut,
// so a second per-definition deep clone is needless.
const orderedDefs: any[] = [];
for (const id of referenceIds) {
const existing = defById.get(id);
if (existing) {
orderedDefs.push({
...existing,
attrs: { ...(existing.attrs ?? {}), id },
});
} else {
orderedDefs.push(emptyDefinition(id));
}
}
// 5) No references -> there must be NO list at all (at any depth).
if (referenceIds.length === 0) {
stripFootnotesListsDeep(out);
return out;
}
// 6) Placement parity with the live plugin: when the document is ALREADY in the
// canonical single-list state, leave that list exactly where it sits instead
// of cutting and re-inserting it at the end. The plugin never repositions a
// sole correct list (footnote-sync.ts), so moving it here would silently
// reorder any user content that follows the list on the first write. The doc
// is in that state when there is exactly one top-level footnotesList, every
// definition in the doc is referenced (no orphans / duplicates: the def count
// equals the canonical count), and the list already holds exactly the
// canonical definitions in reference order.
const topLevelLists = out.content.filter(
(n: any) => n && n.type === FOOTNOTES_LIST_NAME,
);
if (
topLevelLists.length === 1 &&
defNodes.length === orderedDefs.length &&
deepEqualJson(topLevelLists[0].content, orderedDefs)
) {
return out;
}
// 7) Otherwise rebuild: strip every footnotesList AND every bare
// footnoteDefinition at ANY depth (collectDefinitions gathers defs
// recursively, so a list nested in a callout/blockquote — or a bare
// definition outside any list — would otherwise have its defs copied into the
// rebuilt list while the original survives in place → duplicates) and
// re-insert exactly one list after the last meaningful (non-empty paragraph)
// top-level block, so it coexists with a trailing-node empty paragraph. This
// both repairs a non-canonical doc and (in the import case) physically
// reorders the list into reference order.
stripFootnotesListsDeep(out);
stripFootnoteDefinitionsDeep(out);
const top: any[] = out.content;
let insertAt = top.length;
while (insertAt > 0 && isEmptyParagraph(top[insertAt - 1])) insertAt--;
top.splice(insertAt, 0, { type: FOOTNOTES_LIST_NAME, content: orderedDefs });
out.content = top;
return out;
}
/** Remove every `footnotesList` node at ANY depth (mutates the given clone). */
function stripFootnotesListsDeep(node: any): void {
if (!node || typeof node !== 'object' || !Array.isArray(node.content)) return;
node.content = node.content.filter(
(c: any) => !(c && c.type === FOOTNOTES_LIST_NAME),
);
for (const child of node.content) stripFootnotesListsDeep(child);
}
/**
* Remove every BARE `footnoteDefinition` node at ANY depth (mutates the given
* clone). Runs only in the rebuild path AFTER the lists are stripped, so it
* targets definitions that were sitting outside a list (e.g. hand-authored via a
* raw-JSON write path and nested in a callout); their content was already copied
* into the rebuilt list, so leaving the originals would duplicate them.
*/
function stripFootnoteDefinitionsDeep(node: any): void {
if (!node || typeof node !== 'object' || !Array.isArray(node.content)) return;
node.content = node.content.filter(
(c: any) => !(c && c.type === FOOTNOTE_DEFINITION_NAME),
);
for (const child of node.content) stripFootnoteDefinitionsDeep(child);
}
/**
* Deep equality over plain JSON: arrays are compared POSITIONALLY
* (order-SENSITIVE), object keys order-insensitively. The array order-sensitivity
* is required for correctness here — a reordered `footnotesList.content` must
* compare UNEQUAL so the canonical rebuild fires instead of leaving it in place.
*/
function deepEqualJson(a: any, b: any): boolean {
if (a === b) return true;
if (a == null || b == null || typeof a !== typeof b) return false;
if (Array.isArray(a) || Array.isArray(b)) {
if (!Array.isArray(a) || !Array.isArray(b) || a.length !== b.length) {
return false;
}
for (let i = 0; i < a.length; i++) {
if (!deepEqualJson(a[i], b[i])) return false;
}
return true;
}
if (typeof a === 'object') {
const ka = Object.keys(a);
const kb = Object.keys(b);
if (ka.length !== kb.length) return false;
for (const k of ka) {
if (!Object.prototype.hasOwnProperty.call(b, k)) return false;
if (!deepEqualJson(a[k], b[k])) return false;
}
return true;
}
return false;
}
/** A fresh empty definition node for a referenced id with no definition. */
function emptyDefinition(id: string): any {
return {
type: FOOTNOTE_DEFINITION_NAME,
attrs: { id },
content: [{ type: 'paragraph' }],
};
}
function isEmptyParagraph(node: any): boolean {
return (
!!node &&
node.type === 'paragraph' &&
(!Array.isArray(node.content) || node.content.length === 0)
);
}
/** Collect DISTINCT footnoteReference ids in document order (first appearance). */
function collectReferenceIds(
node: any,
out: string[],
seen: Set<string>,
): void {
if (!node || typeof node !== 'object') return;
if (node.type === FOOTNOTE_REFERENCE_NAME) {
const id = node?.attrs?.id;
if (id && !seen.has(id)) {
seen.add(id);
out.push(id);
}
}
if (Array.isArray(node.content)) {
for (const child of node.content) collectReferenceIds(child, out, seen);
}
}
/** Collect every footnoteDefinition node in document order. */
function collectDefinitions(node: any, out: any[]): void {
if (!node || typeof node !== 'object') return;
if (node.type === FOOTNOTE_DEFINITION_NAME) out.push(node);
if (Array.isArray(node.content)) {
for (const child of node.content) collectDefinitions(child, out);
}
}
function cloneJson<T>(v: T): T {
if (typeof structuredClone === 'function') return structuredClone(v);
return JSON.parse(JSON.stringify(v)) as T;
}

File diff suppressed because it is too large Load Diff

View File

@@ -4,3 +4,4 @@ export * from "./footnotes-list";
export * from "./footnote-definition";
export * from "./footnote-numbering";
export * from "./footnote-sync";
export * from "./footnote-canonicalize";

View File

@@ -22,5 +22,11 @@
"noFallthroughCasesInSwitch": false
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist", "src/**/*.spec.ts", "src/**/*.test.ts"]
"exclude": [
"node_modules",
"dist",
"src/**/*.spec.ts",
"src/**/*.test.ts",
"src/lib/footnote/footnote-corpus.ts"
]
}

View File

@@ -7,7 +7,7 @@ import { TiptapTransformer } from "@hocuspocus/transformer";
import * as Y from "yjs";
import WebSocket from "ws";
import { convertProseMirrorToMarkdown } from "./lib/markdown-converter.js";
import { updatePageContentRealtime, replacePageContent, markdownToProseMirror, mutatePageContent, buildCollabWsUrl, assertYjsEncodable, applyDocToFragment, } from "./lib/collaboration.js";
import { updatePageContentRealtime, replacePageContent, markdownToProseMirror, markdownToProseMirrorCanonical, mutatePageContent, buildCollabWsUrl, assertYjsEncodable, applyDocToFragment, } from "./lib/collaboration.js";
import { footnoteWarningsField } from "./lib/footnote-analyze.js";
import { buildPageTree } from "./lib/tree.js";
import { serializeDocmostMarkdown, parseDocmostMarkdown, } from "./lib/markdown-document.js";
@@ -17,7 +17,7 @@ import { applyTextEdits, } from "./lib/json-edit.js";
import { getCollabToken, performLogin } from "./lib/auth-utils.js";
import { diffDocs, summarizeChange } from "./lib/diff.js";
import { applyAnchorInDoc, canAnchorInDoc } from "./lib/comment-anchor.js";
import { blockText, walk, getList, insertMarkerAfter, setCalloutRange, noteItem, mdToInlineNodes, commentsToFootnotes, } from "./lib/transforms.js";
import { blockText, walk, getList, insertMarkerAfter, setCalloutRange, noteItem, mdToInlineNodes, commentsToFootnotes, canonicalizeFootnotes, insertInlineFootnote, } from "./lib/transforms.js";
import vm from "node:vm";
// Supported image types, kept as two lookup tables so both a local file
// extension and a remote Content-Type can be mapped to the same canonical set.
@@ -1063,10 +1063,15 @@ export class DocmostClient {
// the markdown link path (which TipTap sanitizes), raw JSON could otherwise
// inject javascript:/data: link hrefs or media srcs straight into the doc.
this.validateDocUrls(doc);
// Canonicalize footnotes (idempotent): an agent-authored JSON doc cannot
// leave footnotes out of order, orphaned, or in multiple lists — the bottom
// list + numbering are always derived from reference order. No-op when the
// footnotes are already canonical.
doc = canonicalizeFootnotes(doc);
// Write the BODY first, then the title (#159 split-brain): a failed body
// write (e.g. persist timeout) must not leave a new title over the old body.
const collabToken = await this.getCollabTokenWithReauth();
const mutation = await replacePageContent(pageId, doc, collabToken, this.apiUrl);
const mutation = await this.replacePage(pageId, doc, collabToken, this.apiUrl);
// Body persisted successfully — now it is safe to set the title.
if (title) {
await this.client.post("/pages/update", { pageId, title });
@@ -1079,6 +1084,73 @@ export class DocmostClient {
verify: mutation.verify,
};
}
/**
* AUTHOR-INLINE footnote insertion. The agent supplies only WHERE
* (`anchorText`, a snippet of body text to attach the marker after) and WHAT
* (`text`, the footnote content as markdown). Numbering and the bottom
* `footnotesList` are derived deterministically server-side
* (`insertInlineFootnote` -> `canonicalizeFootnotes`): the agent never sees,
* assigns, or edits a footnote number or the list, so it CANNOT desync.
*
* Content DEDUP: when an existing definition has the same content, its id is
* reused (one number, one definition, several references). The write is atomic
* via `mutatePageContent` (single-writer, page-locked); if the anchor text is
* not found the transform aborts with a clear error and no write happens.
*/
async insertFootnote(pageId, anchorText, text) {
await this.ensureAuthenticated();
if (!anchorText || !anchorText.trim()) {
throw new Error("insert_footnote: anchorText is required");
}
if (text == null || `${text}`.trim() === "") {
throw new Error("insert_footnote: text is required");
}
const collabToken = await this.getCollabTokenWithReauth();
let result = null;
const mutation = await this.mutatePage(pageId, collabToken, this.apiUrl, (liveDoc) => {
const r = insertInlineFootnote(liveDoc, { anchorText, text });
if (!r.inserted) {
// Abort the page-locked write by throwing: mutatePageContent does not
// persist when the transform throws, so a missing anchor leaves the
// page untouched (no partial write).
throw new Error(`insert_footnote: anchor text not found: ${JSON.stringify(anchorText.slice(0, 80))}`);
}
result = { footnoteId: r.footnoteId, reused: r.reused };
return r.doc;
});
// The not-found path throws inside the transform (aborting mutatePage), so by
// here `result` is always set.
const r = result;
return {
success: true,
modified: true,
pageId,
footnoteId: r.footnoteId,
reused: r.reused,
message: r.reused
? "Footnote inserted (reused an existing same-content definition)."
: "Footnote inserted.",
verify: mutation.verify,
};
}
/**
* Page-locked write seam over collaboration.mutatePageContent. Production just
* delegates; it exists as an overridable method so the insert_footnote wrapper
* (transform abort-on-not-found + response shaping) can be unit-tested without
* standing up a live Hocuspocus collab socket.
*/
mutatePage(pageId, collabToken, apiUrl, transform) {
return mutatePageContent(pageId, collabToken, apiUrl, transform);
}
/**
* Full-document write seam over collaboration.replacePageContent. Production
* just delegates; it exists as an overridable method so the full-doc write
* tools (update_page_json, copy_page_content) can have their footnote-
* canonicalization binding unit-tested without a live Hocuspocus collab socket.
*/
replacePage(pageId, doc, collabToken, apiUrl) {
return replacePageContent(pageId, doc, collabToken, apiUrl);
}
/**
* Export a page to a single self-contained Docmost-flavoured markdown file:
* meta block + body (with inline comment anchors + diagrams) + comment
@@ -1120,7 +1192,8 @@ export class DocmostClient {
async importPageMarkdown(pageId, fullMarkdown) {
await this.ensureAuthenticated();
const { meta, body, comments } = parseDocmostMarkdown(fullMarkdown);
const doc = await markdownToProseMirror(body);
// PAGE import: canonicalize footnotes (see markdownToProseMirrorCanonical).
const doc = await markdownToProseMirrorCanonical(body);
const collabToken = await this.getCollabTokenWithReauth();
const mutation = await replacePageContent(pageId, doc, collabToken, this.apiUrl);
// Collect distinct comment ids that actually became comment marks in the doc.
@@ -1200,13 +1273,18 @@ export class DocmostClient {
// uses, so copying never lands a javascript:/data: href/src on the target
// (parity with updatePageJson; harmless for already-stored source content).
this.validateDocUrls(content);
// Defense-in-depth (#228): this is a FULL-document write, so canonicalize
// footnotes before copying — a no-op on already-canonical source content, but
// it guarantees a copy can never propagate a non-canonical footnote topology
// to the target (parity with the other full-doc write paths).
const canonical = canonicalizeFootnotes(content);
const collabToken = await this.getCollabTokenWithReauth();
const mutation = await replacePageContent(targetPageId, content, collabToken, this.apiUrl);
const mutation = await this.replacePage(targetPageId, canonical, collabToken, this.apiUrl);
return {
success: true,
sourcePageId,
targetPageId,
copiedNodes: content.content.length,
copiedNodes: canonical.content.length,
verify: mutation.verify,
};
}
@@ -1613,7 +1691,10 @@ export class DocmostClient {
}
}
}
// Convert through the full Docmost schema (consistent with page paths)
// Convert through the full Docmost schema. Deliberately the NON-canonicalizing
// variant: a comment body may carry a footnote definition with no matching
// reference, and canonicalization would drop it (data loss). See
// markdownToProseMirror vs markdownToProseMirrorCanonical.
const jsonContent = await markdownToProseMirror(content);
const payload = {
pageId,
@@ -1701,6 +1782,7 @@ export class DocmostClient {
}
async updateComment(commentId, content) {
await this.ensureAuthenticated();
// NON-canonicalizing on purpose (comment body — see createComment).
const jsonContent = await markdownToProseMirror(content);
await this.client.post("/comments/update", {
commentId,
@@ -2422,6 +2504,8 @@ export class DocmostClient {
noteItem,
mdToInlineNodes,
commentsToFootnotes,
canonicalizeFootnotes,
insertInlineFootnote,
},
};
// Captured oldDoc / newDoc for the diff (set inside runTransform).
@@ -2455,16 +2539,25 @@ export class DocmostClient {
if (typeof fn !== "function") {
throw new Error("transform must evaluate to a function (doc, ctx) => doc");
}
const result = vm.runInNewContext("f(d, c)", { f: fn, d: sandbox.doc, c: ctx }, { timeout: 5000 });
if (!result ||
typeof result !== "object" ||
result.type !== "doc" ||
!Array.isArray(result.content)) {
const raw = vm.runInNewContext("f(d, c)", { f: fn, d: sandbox.doc, c: ctx }, { timeout: 5000 });
if (!raw ||
typeof raw !== "object" ||
raw.type !== "doc" ||
!Array.isArray(raw.content)) {
throw new Error('transform must return a ProseMirror doc node ({ type:"doc", content:[...] })');
}
// Validate the returned doc before it can be written.
this.validateDocStructure(result);
this.validateDocUrls(result);
// Validate the RAW transform output FIRST (structure — including the
// MAX_DEPTH guard — and URLs), mirroring updatePageJson. The canonicalizer
// recurses without a depth limiter, so validating after it would turn a
// too-deep doc into an opaque "Maximum call stack size exceeded" instead of
// the intended "nesting exceeds the maximum depth" error.
this.validateDocStructure(raw);
this.validateDocUrls(raw);
// Auto-canonicalize footnotes after the transform (idempotent): no write
// path can leave footnotes out of order / orphaned / in a raw `[^id]`
// block. In a dryRun preview this may surface footnote edits the script
// author did not write (the canonicalizer tidied them) — that is expected.
const result = canonicalizeFootnotes(raw);
newDoc = result;
return result;
};

View File

@@ -637,8 +637,15 @@ export function createDocmostMcpServer(config) {
"mark-safe), setCalloutRange(doc, n) (sync a [1]…[K] callout range to " +
"[1]…[n]), noteItem(inlineNodes) (wrap inline nodes in a listItem with a " +
"fresh id), mdToInlineNodes(markdown) (comment markdown -> inline nodes), " +
"and commentsToFootnotes(doc, comments, {notesHeading}) (turn inline " +
"comments into numbered footnotes). Footnote convention: markers are " +
"commentsToFootnotes(doc, comments, {notesHeading}) (turn inline " +
"comments into numbered footnotes), canonicalizeFootnotes(doc) (derive " +
"footnote numbering + the single bottom list from reference order, drop " +
"orphans/duplicates — runs AUTOMATICALLY on the transform RESULT, so the " +
"applied (and dryRun-previewed) doc is always footnote-canonical; a dryRun " +
"diff may therefore show footnote tidy-ups your script did not make, and " +
"it is idempotent after the first run), and " +
"insertInlineFootnote(doc, {anchorText, text}) (author-inline footnote: " +
"marker + dedup'd definition, list derived). Footnote convention: markers are " +
"plain '[N]' text in the body; the notes are an orderedList under a " +
"heading whose text is 'Примечания переводчика'. The transform runs " +
"sandboxed (no require/process/fs/network, 5s timeout) and must return a " +
@@ -652,7 +659,8 @@ export function createDocmostMcpServer(config) {
"parenthesized function). It receives a clone of the live doc and " +
"ctx (comments, log, consume(id), helpers: blockText/walk/getList/" +
"insertMarkerAfter/setCalloutRange/noteItem/mdToInlineNodes/" +
"commentsToFootnotes) and must return a {type:'doc'} node."),
"commentsToFootnotes/canonicalizeFootnotes/insertInlineFootnote) " +
"and must return a {type:'doc'} node."),
dryRun: z
.boolean()
.optional()
@@ -672,6 +680,33 @@ export function createDocmostMcpServer(config) {
});
return jsonContent(result);
});
// Tool: insert_footnote
server.registerTool("insert_footnote", {
description: "Insert an AUTHOR-INLINE footnote: you specify only WHERE (anchorText) " +
"and WHAT (text). The footnote marker is placed right after anchorText in " +
"the body, and the bottom footnotes list + the numbering are derived " +
"deterministically server-side. You do NOT assign a number, and you " +
"never see or edit the footnotes list — so footnotes cannot end up out " +
"of order, orphaned, or as a raw '[^id]' block. If a footnote with the " +
"SAME text already exists, its number is REUSED (one definition, several " +
"references). The write is atomic and won't clobber concurrent edits; if " +
"anchorText is not found, nothing is written and an error is returned.",
inputSchema: {
pageId: z.string().min(1),
anchorText: z
.string()
.min(1)
.describe("A snippet of existing body text; the footnote marker is inserted " +
"immediately after its first occurrence (mark-safe)."),
text: z
.string()
.min(1)
.describe("The footnote content as markdown (becomes the definition)."),
},
}, async ({ pageId, anchorText, text }) => {
const result = await docmostClient.insertFootnote(pageId, anchorText, text);
return jsonContent(result);
});
// Tool: diff_page_versions
registerShared(SHARED_TOOL_SPECS.diffPageVersions, async ({ pageId, from, to }) => {
const result = await docmostClient.diffPageVersions(pageId, from, to);

View File

@@ -11,6 +11,7 @@ import { docmostExtensions, docmostSchema } from "./docmost-schema.js";
import { withPageLock } from "./page-lock.js";
import { sanitizeForYjs, findUnstorableAttr } from "./node-ops.js";
import { lexFootnoteLines } from "./footnote-lex.js";
import { canonicalizeFootnotes } from "./footnote-canonicalize.js";
import { summarizeChange } from "./diff.js";
/**
* Build the descriptive error for an opaque Yjs encode failure ("Unexpected
@@ -343,7 +344,20 @@ function extractFootnotes(markdown) {
section: `<section data-footnotes>${inner}</section>`,
};
}
/** Convert markdown to a ProseMirror doc using the full Docmost schema. */
/**
* Convert markdown to a ProseMirror doc using the full Docmost schema.
*
* This conversion does NOT canonicalize footnotes — it is the shared, content-
* preserving primitive used by BOTH page write paths and COMMENT bodies
* (createComment / updateComment). Canonicalization MUST NOT run on a comment
* body: a comment may legitimately contain a footnote-definition line
* (`[^1]: text`) with no matching reference, and the canonicalizer drops a
* reference-less footnotesList — which would silently delete the comment's text.
*
* Page write paths that DO need the canonical footnote topology call
* `markdownToProseMirrorCanonical` instead (markdown import, update_page markdown
* path). Keep this function reference-loss-free.
*/
export async function markdownToProseMirror(markdownContent) {
const withCallouts = await preprocessCallouts(markdownContent);
const { body, section } = extractFootnotes(withCallouts);
@@ -351,6 +365,20 @@ export async function markdownToProseMirror(markdownContent) {
const bridged = bridgeTaskLists(html);
return generateJSON(bridged, docmostExtensions);
}
/**
* Page-write variant of `markdownToProseMirror`: converts markdown then enforces
* the canonical footnote topology. The footnote `section` markdown is emitted in
* DEFINITION order, but numbering derives from REFERENCE order, so without this
* the bottom list renders out of order (`1, 4, 2, 3, …`); orphan definitions and
* duplicate lists are also normalized. Idempotent — a no-op once canonical, and a
* no-op for footnote-free content.
*
* Use this ONLY for full-document PAGE writes (never for comment bodies, where it
* would drop a reference-less footnote definition — see `markdownToProseMirror`).
*/
export async function markdownToProseMirrorCanonical(markdownContent) {
return canonicalizeFootnotes(await markdownToProseMirror(markdownContent));
}
/**
* Build the collaboration WebSocket URL from an API base URL:
* switch http(s)->ws(s), strip a trailing /api, mount on /collab.
@@ -708,6 +736,8 @@ export async function replacePageContent(pageId, prosemirrorDoc, collabToken, ba
* Tables and :::callout::: blocks survive thanks to the full schema.
*/
export async function updatePageContentRealtime(pageId, markdownContent, collabToken, baseUrl) {
const tiptapJson = await markdownToProseMirror(markdownContent);
// PAGE write: canonicalize footnotes (markdown import builds the bottom list in
// definition order; numbering is reference-ordered).
const tiptapJson = await markdownToProseMirrorCanonical(markdownContent);
return await mutatePageContent(pageId, collabToken, baseUrl, () => tiptapJson);
}

View File

@@ -0,0 +1,88 @@
/**
* Inline-authoring helpers for footnotes (MCP).
*
* These build/identify footnote DEFINITION nodes for the author-inline tool
* (`insertInlineFootnote` in transforms.ts): a content key to de-duplicate notes
* by text, a definition-node factory, and a fresh uuidv7-style id generator.
*
* Split out of `footnote-canonicalize.ts` so that module stays a pure MIRROR of
* the editor-ext canonicalizer (compositionally symmetric to the editor-ext
* copy, which keeps its authoring helpers in `footnote-util.ts`). The pure
* canonicalizer has no dependency on these.
*/
const FOOTNOTE_DEFINITION_NAME = "footnoteDefinition";
function cloneJson(v) {
if (typeof structuredClone === "function")
return structuredClone(v);
return JSON.parse(JSON.stringify(v));
}
/**
* Normalized content key for de-duplicating footnote DEFINITIONS by their text.
*
* Two definitions with the same key are the SAME footnote — so the inline
* authoring tool reuses one id (one number, one definition, several references)
* instead of minting a second definition. Key = plaintext (whitespace-collapsed,
* trimmed) PLUS a signature of the inline mark types in order, so two notes that
* read the same but differ in formatting (one bold, one plain) are NOT merged.
* Conservative: only an exact match merges.
*/
export function footnoteContentKey(defNode) {
const parts = [];
const visit = (n) => {
if (!n || typeof n !== "object")
return;
if (n.type === "text" && typeof n.text === "string") {
const marks = Array.isArray(n.marks)
? n.marks.map((m) => m?.type).filter(Boolean).sort().join(",")
: "";
parts.push(`${n.text}${marks}`);
}
if (Array.isArray(n.content))
for (const c of n.content)
visit(c);
};
visit(defNode);
// Collapse the assembled text's whitespace and trim, keeping the mark
// signature attached so formatting differences still distinguish notes.
return parts
.join("")
.replace(/[ \t\r\n]+/g, " ")
.trim();
}
/**
* Build a footnoteDefinition node from inline ProseMirror nodes, keyed by id.
*/
export function makeFootnoteDefinition(id, inlineNodes) {
const content = Array.isArray(inlineNodes) ? cloneJson(inlineNodes) : [];
return {
type: FOOTNOTE_DEFINITION_NAME,
attrs: { id },
content: [{ type: "paragraph", content }],
};
}
/**
* Generate a uuidv7-style id (time-ordered), matching editor-ext's
* `generateFootnoteId`. Used for a genuinely-new inline footnote id.
*/
export function generateFootnoteId() {
const now = Date.now();
const timeHex = now.toString(16).padStart(12, "0");
const rand = (length) => {
let s = "";
for (let i = 0; i < length; i++)
s += Math.floor(Math.random() * 16).toString(16);
return s;
};
const versioned = "7" + rand(3);
const variantNibble = (8 + Math.floor(Math.random() * 4)).toString(16);
const variant = variantNibble + rand(3);
return (timeHex.slice(0, 8) +
"-" +
timeHex.slice(8, 12) +
"-" +
versioned +
"-" +
variant +
"-" +
rand(12));
}

View File

@@ -0,0 +1,215 @@
/**
* Server-side footnote canonicalizer (MCP mirror — PURE).
*
* `canonicalizeFootnotes(doc)` is a pure ProseMirror-JSON port of the editor's
* `footnoteSyncPlugin` end-state, identical in behaviour to
* `@docmost/editor-ext`'s `canonicalizeFootnotes`. It is mirrored here — rather
* than imported from editor-ext — for the SAME reason `footnote-lex.ts` and the
* `docmost-schema.ts` nodes are mirrored: the MCP package is deliberately
* decoupled from the browser/React-heavy editor barrel and operates on plain
* JSON. The editor-ext copy owns the golden test against the live plugin; this
* copy must stay behaviourally identical (a SHARED golden corpus, exercised by
* both test suites, pins that — see `test/unit/footnote-corpus.mjs`).
*
* This module is the pure MIRROR only. The inline-authoring helpers
* (`footnoteContentKey`, `makeFootnoteDefinition`, `generateFootnoteId`) used by
* `insertInlineFootnote` live in the sibling `footnote-authoring.ts`, so this
* file is compositionally symmetric to the editor-ext copy.
*
* Why it exists: every NON-editor write path (markdown import, update_page_json,
* docmost_transform, insert_footnote) builds ProseMirror JSON directly, so the
* editor's footnote plugins never run and the canonical topology (sequential
* numbering by first reference, one trailing list, no orphans, no raw `[^id]`)
* was never enforced. Running this at the end of every write path closes that
* gap; because it is idempotent, it is a no-op when the footnotes are already
* canonical (no spurious mutations / git-sync churn).
*
* ENFORCEMENT RULE (#228): any NEW FULL-document persist path MUST call
* `canonicalizeFootnotes(doc)` before writing — the current callers are
* `markdownToProseMirrorCanonical` (page markdown import/update; the plain
* `markdownToProseMirror` used for COMMENT bodies must NOT, or it would drop a
* reference-less definition), `update_page_json`, `docmost_transform`,
* `insert_footnote`, and `copy_page_content`. Append/prepend FRAGMENT writes MUST
* NOT canonicalize. This is deliberately per-call-site (the replace-vs-fragment
* and comment-vs-page nuances make a single naive wrapper unsafe).
*/
const FOOTNOTE_REFERENCE_NAME = "footnoteReference";
const FOOTNOTES_LIST_NAME = "footnotesList";
const FOOTNOTE_DEFINITION_NAME = "footnoteDefinition";
function cloneJson(v) {
if (typeof structuredClone === "function")
return structuredClone(v);
return JSON.parse(JSON.stringify(v));
}
function isEmptyParagraph(node) {
return (!!node &&
node.type === "paragraph" &&
(!Array.isArray(node.content) || node.content.length === 0));
}
function collectReferenceIds(node, out, seen) {
if (!node || typeof node !== "object")
return;
if (node.type === FOOTNOTE_REFERENCE_NAME) {
const id = node?.attrs?.id;
if (id && !seen.has(id)) {
seen.add(id);
out.push(id);
}
}
if (Array.isArray(node.content)) {
for (const child of node.content)
collectReferenceIds(child, out, seen);
}
}
function collectDefinitions(node, out) {
if (!node || typeof node !== "object")
return;
if (node.type === FOOTNOTE_DEFINITION_NAME)
out.push(node);
if (Array.isArray(node.content)) {
for (const child of node.content)
collectDefinitions(child, out);
}
}
function emptyDefinition(id) {
return {
type: FOOTNOTE_DEFINITION_NAME,
attrs: { id },
content: [{ type: "paragraph" }],
};
}
/**
* Deep equality over plain JSON: arrays are compared POSITIONALLY
* (order-SENSITIVE), object keys order-insensitively. The array order-sensitivity
* is required for correctness here — a reordered `footnotesList.content` must
* compare UNEQUAL so the canonical rebuild fires instead of leaving it in place.
*/
function deepEqualJson(a, b) {
if (a === b)
return true;
if (a == null || b == null || typeof a !== typeof b)
return false;
if (Array.isArray(a) || Array.isArray(b)) {
if (!Array.isArray(a) || !Array.isArray(b) || a.length !== b.length) {
return false;
}
for (let i = 0; i < a.length; i++) {
if (!deepEqualJson(a[i], b[i]))
return false;
}
return true;
}
if (typeof a === "object") {
const ka = Object.keys(a);
const kb = Object.keys(b);
if (ka.length !== kb.length)
return false;
for (const k of ka) {
if (!Object.prototype.hasOwnProperty.call(b, k))
return false;
if (!deepEqualJson(a[k], b[k]))
return false;
}
return true;
}
return false;
}
/**
* Canonicalize footnotes in a ProseMirror-JSON document. See the file header and
* the editor-ext twin for the full contract. Pure (deep-clones input,
* deterministic, idempotent).
*/
export function canonicalizeFootnotes(doc) {
if (doc == null ||
typeof doc !== "object" ||
!Array.isArray(doc.content)) {
return doc;
}
const out = cloneJson(doc);
// 1) Distinct reference ids in document order (deep — refs can live in
// callouts, tables, list items, ...). The ordering/numbering truth.
const referenceIds = [];
collectReferenceIds(out, referenceIds, new Set());
// 2) Every definition node in document order (deep).
const defNodes = [];
collectDefinitions(out, defNodes);
// 3) First definition per id wins; later duplicates carry the SAME id, so they
// cannot be referenced separately and would be orphans — they are dropped.
const defById = new Map();
for (const d of defNodes) {
const id = d?.attrs?.id;
if (id && !defById.has(id))
defById.set(id, d);
}
// 4) Build the ordered definition list: one per referenced id, in REFERENCE
// order, reusing the existing node (shallow-copied, id normalized — `out` is
// already deep-cloned and the old lists are cut) or synthesizing an empty
// one. Definitions whose id is not referenced are orphans and never added.
const orderedDefs = [];
for (const id of referenceIds) {
const existing = defById.get(id);
if (existing) {
orderedDefs.push({
...existing,
attrs: { ...(existing.attrs ?? {}), id },
});
}
else {
orderedDefs.push(emptyDefinition(id));
}
}
// 5) No references -> there must be NO list at all (at any depth).
if (referenceIds.length === 0) {
stripFootnotesListsDeep(out);
return out;
}
// 6) Placement parity with the live plugin: when the document is ALREADY in the
// canonical single-list state, leave that list exactly where it sits rather
// than cutting and re-inserting it at the end (the plugin never repositions a
// sole correct list, so moving it would silently reorder any content that
// follows the list on the first write).
const topLevelLists = out.content.filter((n) => n && n.type === FOOTNOTES_LIST_NAME);
if (topLevelLists.length === 1 &&
defNodes.length === orderedDefs.length &&
deepEqualJson(topLevelLists[0].content, orderedDefs)) {
return out;
}
// 7) Otherwise rebuild: strip every footnotesList AND every bare
// footnoteDefinition at ANY depth (collectDefinitions gathers defs
// recursively, so a list nested in a callout/blockquote — or a bare
// definition outside any list — would otherwise have its defs copied into the
// rebuilt list while the original survives in place → duplicates) and
// re-insert exactly one list after the last meaningful (non-empty paragraph)
// top-level block.
stripFootnotesListsDeep(out);
stripFootnoteDefinitionsDeep(out);
const top = out.content;
let insertAt = top.length;
while (insertAt > 0 && isEmptyParagraph(top[insertAt - 1]))
insertAt--;
top.splice(insertAt, 0, { type: FOOTNOTES_LIST_NAME, content: orderedDefs });
out.content = top;
return out;
}
/** Remove every `footnotesList` node at ANY depth (mutates the given clone). */
function stripFootnotesListsDeep(node) {
if (!node || typeof node !== "object" || !Array.isArray(node.content))
return;
node.content = node.content.filter((c) => !(c && c.type === FOOTNOTES_LIST_NAME));
for (const child of node.content)
stripFootnotesListsDeep(child);
}
/**
* Remove every BARE `footnoteDefinition` node at ANY depth (mutates the given
* clone). Runs only in the rebuild path AFTER the lists are stripped, so it
* targets definitions that were sitting outside a list (e.g. hand-authored via a
* raw-JSON write path and nested in a callout); their content was already copied
* into the rebuilt list, so leaving the originals would duplicate them.
*/
function stripFootnoteDefinitionsDeep(node) {
if (!node || typeof node !== "object" || !Array.isArray(node.content))
return;
node.content = node.content.filter((c) => !(c && c.type === FOOTNOTE_DEFINITION_NAME));
for (const child of node.content)
stripFootnoteDefinitionsDeep(child);
}

View File

@@ -14,6 +14,9 @@
* - `marks` arrays are preserved verbatim when fragments are split/reordered.
*/
import { blockPlainText } from "./node-ops.js";
import { canonicalizeFootnotes } from "./footnote-canonicalize.js";
import { footnoteContentKey, makeFootnoteDefinition, generateFootnoteId, } from "./footnote-authoring.js";
export { canonicalizeFootnotes } from "./footnote-canonicalize.js";
/** Deep-clone a JSON-serializable value without mutating the original. */
function clone(value) {
if (typeof structuredClone === "function") {
@@ -64,6 +67,36 @@ export function getList(doc, predicate) {
});
return found;
}
/**
* Textblocks that hold raw text but do NOT accept inline atom nodes. A
* `footnoteReference` is `group:"inline", atom:true`; `codeBlock` is
* `content:"text*"` (text only), so splicing a footnoteReference into it yields
* an invalid document. (paragraph/heading/detailsSummary are `inline*` and DO
* accept it; footnote definitions live inside a footnotesList which the
* footnote inserter excludes via `beforeBlock`.)
*/
const INLINE_ATOM_FORBIDDEN_BLOCKS = new Set(["codeBlock"]);
/**
* Footnote-notes subtrees the inline footnote inserter must never split into (at
* any depth): a `footnotesList` and the `footnoteDefinition`s it holds. Anchoring
* a reference inside one of these would later be dropped as an orphan by the
* canonicalizer, taking the existing definition's text with it.
*/
const FOOTNOTE_NOTES_SUBTREES = new Set([
"footnotesList",
"footnoteDefinition",
]);
/** True if `node` IS, or contains at any depth, a footnotesList/footnoteDefinition. */
function containsFootnoteNotes(node) {
if (!isObject(node))
return false;
if (FOOTNOTE_NOTES_SUBTREES.has(node.type))
return true;
if (Array.isArray(node.content)) {
return node.content.some((c) => containsFootnoteNotes(c));
}
return false;
}
/**
* Insert `marker` as a PLAIN (unmarked) text run right after the first
* occurrence of `anchor`.
@@ -83,6 +116,19 @@ export function getList(doc, predicate) {
* false when the anchor text was not found in any in-scope block.
*/
export function insertMarkerAfter(doc, anchor, marker, opts = {}) {
// A plain marker is a leading-space-padded unmarked text run.
return insertNodesAfterAnchor(doc, anchor, () => [{ type: "text", text: " " + marker }], opts);
}
/**
* Mark-safe insertion CORE: split the inline text run that holds the END of
* `anchor` (preserving the surrounding marks) and splice the nodes produced by
* `makeMiddle()` in at the split point. `insertMarkerAfter` (plain text marker)
* and `insertInlineFootnote` (a `footnoteReference` node) are both thin callers —
* the only difference is WHAT is inserted (a space-padded text run vs. a node
* that should hug the preceding word), which is exactly what `makeMiddle`
* decides. Operates on a clone; returns `{ doc, inserted }`.
*/
function insertNodesAfterAnchor(doc, anchor, makeMiddle, opts = {}) {
const out = clone(doc);
if (!isObject(out) || !Array.isArray(out.content) || !anchor) {
return { doc: out, inserted: false };
@@ -111,10 +157,25 @@ export function insertMarkerAfter(doc, anchor, marker, opts = {}) {
if (inserted || !isObject(container) || !Array.isArray(container.content)) {
return;
}
// Skip a forbidden subtree entirely (e.g. footnotesList/footnoteDefinition):
// never split into it, but keep `offset` aligned for any sibling text after
// it within this block.
if (opts.skipSubtreeTypes && opts.skipSubtreeTypes.has(container.type)) {
offset += blockPlainText(container).length;
return;
}
const inline = container.content;
// Detect whether this array is an inline array (contains text nodes).
const hasText = inline.some((n) => isObject(n) && n.type === "text");
if (hasText) {
// Refuse a textblock whose content spec cannot hold the inserted nodes
// (e.g. a codeBlock for an inline atom). Keep `offset` aligned for any
// sibling textblocks in this same block, then bail so the search falls
// through to the next candidate block.
if (opts.forbidBlockTypes && opts.forbidBlockTypes.has(container.type)) {
offset += blockPlainText(container).length;
return;
}
for (let i = 0; i < inline.length; i++) {
const n = inline[i];
const len = isObject(n) ? blockPlainText(n).length : 0;
@@ -136,8 +197,9 @@ export function insertMarkerAfter(doc, anchor, marker, opts = {}) {
if (before.length > 0) {
parts.push({ ...n, text: before, marks: [...marks] });
}
// Marker is a PLAIN run: no marks copied. Leading space separates it.
parts.push({ type: "text", text: " " + marker });
// The inserted nodes are caller-decided (a space-padded marker run,
// or a node that hugs the word). They carry no copied marks.
parts.push(...makeMiddle());
if (after.length > 0) {
parts.push({ ...n, text: after, marks: [...marks] });
}
@@ -227,14 +289,16 @@ export function noteItem(inlineNodes) {
* Wrap inline ProseMirror nodes in a real footnoteDefinition node keyed by id:
* { type:"footnoteDefinition", attrs:{id}, content:[{ type:"paragraph", content }] }
* (mirrors the editor-ext / docmost-schema FootnoteDefinition node).
*
* Built on the shared `makeFootnoteDefinition` factory (footnote-authoring.ts);
* the only extra is a fresh block id on the inner paragraph (Docmost stamps one,
* and the canonicalizer preserves attrs as-is). Single factory, one place to
* change the definition shape.
*/
export function footnoteDefinition(id, inlineNodes) {
const content = Array.isArray(inlineNodes) ? clone(inlineNodes) : [];
return {
type: "footnoteDefinition",
attrs: { id },
content: [{ type: "paragraph", attrs: { id: freshId() }, content }],
};
const node = makeFootnoteDefinition(id, inlineNodes);
node.content[0].attrs = { id: freshId() };
return node;
}
/**
* Replace every `[N]` body marker and `\u0000FN<i>\u0000` comment placeholder in
@@ -471,3 +535,97 @@ export function commentsToFootnotes(doc, comments, opts = {}) {
const synced = setCalloutRange(working, definitions.length);
return { doc: synced.doc, consumed };
}
/**
* AUTHOR-INLINE footnote insertion. The caller supplies WHERE (anchorText) and
* WHAT (markdown text); numbering and the bottom list are derived server-side by
* `canonicalizeFootnotes`. The caller never sees or edits `footnotesList`, never
* assigns a number, and cannot desync — orphans / out-of-order lists / raw
* `[^id]` markdown are structurally impossible.
*
* Content DEDUP (#3 in the issue): if an existing definition has the SAME
* normalized content key, its id is REUSED (the new reference points at it: one
* number, one definition, several references). Otherwise a fresh uuid id is
* minted and a new definition added. Conservative — only an exact content match
* merges.
*
* Mechanics: the `footnoteReference` node is inserted DIRECTLY at the anchor via
* the same mark-safe split as `insertMarkerAfter` (the shared
* `insertNodesAfterAnchor` core), so it hugs the preceding word with no text
* sentinel round-trip. The whole document is then canonicalized.
*
* Operates on a clone of `doc`. When the anchor is not found, returns the input
* unchanged with `inserted:false`.
*/
export function insertInlineFootnote(doc, opts) {
const inline = mdToInlineNodes(opts.text ?? "");
// footnoteContentKey only reads `.content`, so key off the inline array
// directly instead of building a throwaway definition node.
const key = footnoteContentKey({ content: inline });
// Content dedup: reuse an existing definition's id when its key matches.
let footnoteId = null;
let reused = false;
if (key !== "") {
walk(doc, (n) => {
if (footnoteId == null &&
isObject(n) &&
n.type === "footnoteDefinition" &&
n.attrs &&
typeof n.attrs.id === "string" &&
n.attrs.id !== "" &&
footnoteContentKey(n) === key) {
footnoteId = n.attrs.id;
reused = true;
}
});
}
if (footnoteId == null)
footnoteId = generateFootnoteId();
// Insert the footnoteReference node directly after the anchor (mark-safe
// split); it hugs the preceding word with no leading space. Two guards keep the
// inline atom out of the notes section and out of blocks that cannot hold it:
// - beforeBlock bounds the search to the BODY, before the first top-level block
// that IS or CONTAINS (at any depth) a footnotesList/footnoteDefinition — so
// a NESTED list or a bare definition also bounds the search, not just a
// top-level list;
// - skipSubtreeTypes refuses to descend into any footnotesList/footnoteDefinition
// subtree, so a reference is never glued inside an existing definition (which
// the canonicalizer would then drop as an orphan, losing that definition's
// prose); and forbidBlockTypes refuses codeBlocks (an inline atom there is a
// schema-invalid doc; insert_footnote skips validateDocStructure).
// When the only anchor match is in such a place, the insert is refused and the
// write aborts cleanly (inserted:false) instead of destroying content.
const boundaryIdx = Array.isArray(doc?.content)
? doc.content.findIndex((n) => containsFootnoteNotes(n))
: -1;
const r = insertNodesAfterAnchor(doc, (opts.anchorText ?? "").trimEnd(), () => [{ type: "footnoteReference", attrs: { id: footnoteId } }], {
...(boundaryIdx >= 0 ? { beforeBlock: boundaryIdx } : {}),
forbidBlockTypes: INLINE_ATOM_FORBIDDEN_BLOCKS,
skipSubtreeTypes: FOOTNOTE_NOTES_SUBTREES,
});
if (!r.inserted) {
return { doc: clone(doc), inserted: false, footnoteId, reused };
}
let working = r.doc;
// Add a NEW definition (canonicalize will order/place it); a reused id needs
// no new definition (the existing one is shared).
if (!reused) {
appendDefinition(working, makeFootnoteDefinition(footnoteId, inline));
}
// Derive numbering + the single bottom list deterministically.
working = canonicalizeFootnotes(working);
return { doc: working, inserted: true, footnoteId, reused };
}
/**
* Append a definition node so the canonicalizer can order/place it: into the
* first existing footnotesList, or a new trailing list when none exists.
*/
function appendDefinition(doc, defNode) {
const existingList = getList(doc, (n) => isObject(n) && n.type === "footnotesList");
if (existingList && Array.isArray(existingList.content)) {
existingList.content.push(defNode);
return;
}
if (Array.isArray(doc.content)) {
doc.content.push({ type: "footnotesList", content: [defNode] });
}
}

View File

@@ -17,6 +17,7 @@ import {
updatePageContentRealtime,
replacePageContent,
markdownToProseMirror,
markdownToProseMirrorCanonical,
mutatePageContent,
buildCollabWsUrl,
assertYjsEncodable,
@@ -60,6 +61,8 @@ import {
noteItem,
mdToInlineNodes,
commentsToFootnotes,
canonicalizeFootnotes,
insertInlineFootnote,
} from "./lib/transforms.js";
import vm from "node:vm";
@@ -1344,10 +1347,16 @@ export class DocmostClient {
// inject javascript:/data: link hrefs or media srcs straight into the doc.
this.validateDocUrls(doc);
// Canonicalize footnotes (idempotent): an agent-authored JSON doc cannot
// leave footnotes out of order, orphaned, or in multiple lists — the bottom
// list + numbering are always derived from reference order. No-op when the
// footnotes are already canonical.
doc = canonicalizeFootnotes(doc);
// Write the BODY first, then the title (#159 split-brain): a failed body
// write (e.g. persist timeout) must not leave a new title over the old body.
const collabToken = await this.getCollabTokenWithReauth();
const mutation = await replacePageContent(
const mutation = await this.replacePage(
pageId,
doc,
collabToken,
@@ -1368,6 +1377,95 @@ export class DocmostClient {
};
}
/**
* AUTHOR-INLINE footnote insertion. The agent supplies only WHERE
* (`anchorText`, a snippet of body text to attach the marker after) and WHAT
* (`text`, the footnote content as markdown). Numbering and the bottom
* `footnotesList` are derived deterministically server-side
* (`insertInlineFootnote` -> `canonicalizeFootnotes`): the agent never sees,
* assigns, or edits a footnote number or the list, so it CANNOT desync.
*
* Content DEDUP: when an existing definition has the same content, its id is
* reused (one number, one definition, several references). The write is atomic
* via `mutatePageContent` (single-writer, page-locked); if the anchor text is
* not found the transform aborts with a clear error and no write happens.
*/
async insertFootnote(pageId: string, anchorText: string, text: string) {
await this.ensureAuthenticated();
if (!anchorText || !anchorText.trim()) {
throw new Error("insert_footnote: anchorText is required");
}
if (text == null || `${text}`.trim() === "") {
throw new Error("insert_footnote: text is required");
}
const collabToken = await this.getCollabTokenWithReauth();
let result: { footnoteId: string; reused: boolean } | null = null;
const mutation = await this.mutatePage(
pageId,
collabToken,
this.apiUrl,
(liveDoc: any) => {
const r = insertInlineFootnote(liveDoc, { anchorText, text });
if (!r.inserted) {
// Abort the page-locked write by throwing: mutatePageContent does not
// persist when the transform throws, so a missing anchor leaves the
// page untouched (no partial write).
throw new Error(
`insert_footnote: anchor text not found: ${JSON.stringify(
anchorText.slice(0, 80),
)}`,
);
}
result = { footnoteId: r.footnoteId, reused: r.reused };
return r.doc;
},
);
// The not-found path throws inside the transform (aborting mutatePage), so by
// here `result` is always set.
const r = result!;
return {
success: true,
modified: true,
pageId,
footnoteId: r.footnoteId,
reused: r.reused,
message: r.reused
? "Footnote inserted (reused an existing same-content definition)."
: "Footnote inserted.",
verify: mutation.verify,
};
}
/**
* Page-locked write seam over collaboration.mutatePageContent. Production just
* delegates; it exists as an overridable method so the insert_footnote wrapper
* (transform abort-on-not-found + response shaping) can be unit-tested without
* standing up a live Hocuspocus collab socket.
*/
protected mutatePage(
pageId: string,
collabToken: string,
apiUrl: string,
transform: (doc: any) => any,
): Promise<{ doc?: any; verify?: any }> {
return mutatePageContent(pageId, collabToken, apiUrl, transform);
}
/**
* Full-document write seam over collaboration.replacePageContent. Production
* just delegates; it exists as an overridable method so the full-doc write
* tools (update_page_json, copy_page_content) can have their footnote-
* canonicalization binding unit-tested without a live Hocuspocus collab socket.
*/
protected replacePage(
pageId: string,
doc: any,
collabToken: string,
apiUrl: string,
): Promise<{ doc?: any; verify?: any }> {
return replacePageContent(pageId, doc, collabToken, apiUrl);
}
/**
* Export a page to a single self-contained Docmost-flavoured markdown file:
* meta block + body (with inline comment anchors + diagrams) + comment
@@ -1408,7 +1506,8 @@ export class DocmostClient {
async importPageMarkdown(pageId: string, fullMarkdown: string): Promise<any> {
await this.ensureAuthenticated();
const { meta, body, comments } = parseDocmostMarkdown(fullMarkdown);
const doc = await markdownToProseMirror(body);
// PAGE import: canonicalize footnotes (see markdownToProseMirrorCanonical).
const doc = await markdownToProseMirrorCanonical(body);
const collabToken = await this.getCollabTokenWithReauth();
const mutation = await replacePageContent(
pageId,
@@ -1503,10 +1602,16 @@ export class DocmostClient {
// (parity with updatePageJson; harmless for already-stored source content).
this.validateDocUrls(content);
// Defense-in-depth (#228): this is a FULL-document write, so canonicalize
// footnotes before copying — a no-op on already-canonical source content, but
// it guarantees a copy can never propagate a non-canonical footnote topology
// to the target (parity with the other full-doc write paths).
const canonical = canonicalizeFootnotes(content);
const collabToken = await this.getCollabTokenWithReauth();
const mutation = await replacePageContent(
const mutation = await this.replacePage(
targetPageId,
content,
canonical,
collabToken,
this.apiUrl,
);
@@ -1515,7 +1620,7 @@ export class DocmostClient {
success: true,
sourcePageId,
targetPageId,
copiedNodes: content.content.length,
copiedNodes: canonical.content.length,
verify: mutation.verify,
};
}
@@ -2033,7 +2138,10 @@ export class DocmostClient {
}
}
// Convert through the full Docmost schema (consistent with page paths)
// Convert through the full Docmost schema. Deliberately the NON-canonicalizing
// variant: a comment body may carry a footnote definition with no matching
// reference, and canonicalization would drop it (data loss). See
// markdownToProseMirror vs markdownToProseMirrorCanonical.
const jsonContent = await markdownToProseMirror(content);
const payload: Record<string, any> = {
pageId,
@@ -2136,6 +2244,7 @@ export class DocmostClient {
async updateComment(commentId: string, content: string) {
await this.ensureAuthenticated();
// NON-canonicalizing on purpose (comment body — see createComment).
const jsonContent = await markdownToProseMirror(content);
await this.client.post("/comments/update", {
commentId,
@@ -2986,6 +3095,8 @@ export class DocmostClient {
noteItem,
mdToInlineNodes,
commentsToFootnotes,
canonicalizeFootnotes,
insertInlineFootnote,
},
};
@@ -3022,24 +3133,33 @@ export class DocmostClient {
"transform must evaluate to a function (doc, ctx) => doc",
);
}
const result = vm.runInNewContext(
const raw = vm.runInNewContext(
"f(d, c)",
{ f: fn, d: sandbox.doc, c: ctx },
{ timeout: 5000 },
);
if (
!result ||
typeof result !== "object" ||
result.type !== "doc" ||
!Array.isArray(result.content)
!raw ||
typeof raw !== "object" ||
raw.type !== "doc" ||
!Array.isArray(raw.content)
) {
throw new Error(
'transform must return a ProseMirror doc node ({ type:"doc", content:[...] })',
);
}
// Validate the returned doc before it can be written.
this.validateDocStructure(result);
this.validateDocUrls(result);
// Validate the RAW transform output FIRST (structure — including the
// MAX_DEPTH guard — and URLs), mirroring updatePageJson. The canonicalizer
// recurses without a depth limiter, so validating after it would turn a
// too-deep doc into an opaque "Maximum call stack size exceeded" instead of
// the intended "nesting exceeds the maximum depth" error.
this.validateDocStructure(raw);
this.validateDocUrls(raw);
// Auto-canonicalize footnotes after the transform (idempotent): no write
// path can leave footnotes out of order / orphaned / in a raw `[^id]`
// block. In a dryRun preview this may surface footnote edits the script
// author did not write (the canonicalizer tidied them) — that is expected.
const result = canonicalizeFootnotes(raw);
newDoc = result;
return result;
};

View File

@@ -892,8 +892,15 @@ server.registerTool(
"mark-safe), setCalloutRange(doc, n) (sync a [1]…[K] callout range to " +
"[1]…[n]), noteItem(inlineNodes) (wrap inline nodes in a listItem with a " +
"fresh id), mdToInlineNodes(markdown) (comment markdown -> inline nodes), " +
"and commentsToFootnotes(doc, comments, {notesHeading}) (turn inline " +
"comments into numbered footnotes). Footnote convention: markers are " +
"commentsToFootnotes(doc, comments, {notesHeading}) (turn inline " +
"comments into numbered footnotes), canonicalizeFootnotes(doc) (derive " +
"footnote numbering + the single bottom list from reference order, drop " +
"orphans/duplicates — runs AUTOMATICALLY on the transform RESULT, so the " +
"applied (and dryRun-previewed) doc is always footnote-canonical; a dryRun " +
"diff may therefore show footnote tidy-ups your script did not make, and " +
"it is idempotent after the first run), and " +
"insertInlineFootnote(doc, {anchorText, text}) (author-inline footnote: " +
"marker + dedup'd definition, list derived). Footnote convention: markers are " +
"plain '[N]' text in the body; the notes are an orderedList under a " +
"heading whose text is 'Примечания переводчика'. The transform runs " +
"sandboxed (no require/process/fs/network, 5s timeout) and must return a " +
@@ -908,7 +915,8 @@ server.registerTool(
"parenthesized function). It receives a clone of the live doc and " +
"ctx (comments, log, consume(id), helpers: blockText/walk/getList/" +
"insertMarkerAfter/setCalloutRange/noteItem/mdToInlineNodes/" +
"commentsToFootnotes) and must return a {type:'doc'} node.",
"commentsToFootnotes/canonicalizeFootnotes/insertInlineFootnote) " +
"and must return a {type:'doc'} node.",
),
dryRun: z
.boolean()
@@ -934,6 +942,41 @@ server.registerTool(
},
);
// Tool: insert_footnote
server.registerTool(
"insert_footnote",
{
description:
"Insert an AUTHOR-INLINE footnote: you specify only WHERE (anchorText) " +
"and WHAT (text). The footnote marker is placed right after anchorText in " +
"the body, and the bottom footnotes list + the numbering are derived " +
"deterministically server-side. You do NOT assign a number, and you " +
"never see or edit the footnotes list — so footnotes cannot end up out " +
"of order, orphaned, or as a raw '[^id]' block. If a footnote with the " +
"SAME text already exists, its number is REUSED (one definition, several " +
"references). The write is atomic and won't clobber concurrent edits; if " +
"anchorText is not found, nothing is written and an error is returned.",
inputSchema: {
pageId: z.string().min(1),
anchorText: z
.string()
.min(1)
.describe(
"A snippet of existing body text; the footnote marker is inserted " +
"immediately after its first occurrence (mark-safe).",
),
text: z
.string()
.min(1)
.describe("The footnote content as markdown (becomes the definition)."),
},
},
async ({ pageId, anchorText, text }) => {
const result = await docmostClient.insertFootnote(pageId, anchorText, text);
return jsonContent(result);
},
);
// Tool: diff_page_versions
registerShared(
SHARED_TOOL_SPECS.diffPageVersions,

View File

@@ -11,6 +11,7 @@ import { docmostExtensions, docmostSchema } from "./docmost-schema.js";
import { withPageLock } from "./page-lock.js";
import { sanitizeForYjs, findUnstorableAttr } from "./node-ops.js";
import { lexFootnoteLines } from "./footnote-lex.js";
import { canonicalizeFootnotes } from "./footnote-canonicalize.js";
import { summarizeChange, VerifyReport } from "./diff.js";
/**
@@ -392,7 +393,20 @@ function extractFootnotes(markdown: string): {
};
}
/** Convert markdown to a ProseMirror doc using the full Docmost schema. */
/**
* Convert markdown to a ProseMirror doc using the full Docmost schema.
*
* This conversion does NOT canonicalize footnotes — it is the shared, content-
* preserving primitive used by BOTH page write paths and COMMENT bodies
* (createComment / updateComment). Canonicalization MUST NOT run on a comment
* body: a comment may legitimately contain a footnote-definition line
* (`[^1]: text`) with no matching reference, and the canonicalizer drops a
* reference-less footnotesList — which would silently delete the comment's text.
*
* Page write paths that DO need the canonical footnote topology call
* `markdownToProseMirrorCanonical` instead (markdown import, update_page markdown
* path). Keep this function reference-loss-free.
*/
export async function markdownToProseMirror(
markdownContent: string,
): Promise<any> {
@@ -403,6 +417,23 @@ export async function markdownToProseMirror(
return generateJSON(bridged, docmostExtensions);
}
/**
* Page-write variant of `markdownToProseMirror`: converts markdown then enforces
* the canonical footnote topology. The footnote `section` markdown is emitted in
* DEFINITION order, but numbering derives from REFERENCE order, so without this
* the bottom list renders out of order (`1, 4, 2, 3, …`); orphan definitions and
* duplicate lists are also normalized. Idempotent — a no-op once canonical, and a
* no-op for footnote-free content.
*
* Use this ONLY for full-document PAGE writes (never for comment bodies, where it
* would drop a reference-less footnote definition — see `markdownToProseMirror`).
*/
export async function markdownToProseMirrorCanonical(
markdownContent: string,
): Promise<any> {
return canonicalizeFootnotes(await markdownToProseMirror(markdownContent));
}
/**
* Build the collaboration WebSocket URL from an API base URL:
* switch http(s)->ws(s), strip a trailing /api, mount on /collab.
@@ -801,7 +832,9 @@ export async function updatePageContentRealtime(
collabToken: string,
baseUrl: string,
): Promise<MutationResult> {
const tiptapJson = await markdownToProseMirror(markdownContent);
// PAGE write: canonicalize footnotes (markdown import builds the bottom list in
// definition order; numbering is reference-ordered).
const tiptapJson = await markdownToProseMirrorCanonical(markdownContent);
return await mutatePageContent(
pageId,
collabToken,

View File

@@ -0,0 +1,91 @@
/**
* Inline-authoring helpers for footnotes (MCP).
*
* These build/identify footnote DEFINITION nodes for the author-inline tool
* (`insertInlineFootnote` in transforms.ts): a content key to de-duplicate notes
* by text, a definition-node factory, and a fresh uuidv7-style id generator.
*
* Split out of `footnote-canonicalize.ts` so that module stays a pure MIRROR of
* the editor-ext canonicalizer (compositionally symmetric to the editor-ext
* copy, which keeps its authoring helpers in `footnote-util.ts`). The pure
* canonicalizer has no dependency on these.
*/
const FOOTNOTE_DEFINITION_NAME = "footnoteDefinition";
function cloneJson<T>(v: T): T {
if (typeof structuredClone === "function") return structuredClone(v);
return JSON.parse(JSON.stringify(v)) as T;
}
/**
* Normalized content key for de-duplicating footnote DEFINITIONS by their text.
*
* Two definitions with the same key are the SAME footnote — so the inline
* authoring tool reuses one id (one number, one definition, several references)
* instead of minting a second definition. Key = plaintext (whitespace-collapsed,
* trimmed) PLUS a signature of the inline mark types in order, so two notes that
* read the same but differ in formatting (one bold, one plain) are NOT merged.
* Conservative: only an exact match merges.
*/
export function footnoteContentKey(defNode: any): string {
const parts: string[] = [];
const visit = (n: any): void => {
if (!n || typeof n !== "object") return;
if (n.type === "text" && typeof n.text === "string") {
const marks = Array.isArray(n.marks)
? n.marks.map((m: any) => m?.type).filter(Boolean).sort().join(",")
: "";
parts.push(`${n.text}${marks}`);
}
if (Array.isArray(n.content)) for (const c of n.content) visit(c);
};
visit(defNode);
// Collapse the assembled text's whitespace and trim, keeping the mark
// signature attached so formatting differences still distinguish notes.
return parts
.join("")
.replace(/[ \t\r\n]+/g, " ")
.trim();
}
/**
* Build a footnoteDefinition node from inline ProseMirror nodes, keyed by id.
*/
export function makeFootnoteDefinition(id: string, inlineNodes: any[]): any {
const content = Array.isArray(inlineNodes) ? cloneJson(inlineNodes) : [];
return {
type: FOOTNOTE_DEFINITION_NAME,
attrs: { id },
content: [{ type: "paragraph", content }],
};
}
/**
* Generate a uuidv7-style id (time-ordered), matching editor-ext's
* `generateFootnoteId`. Used for a genuinely-new inline footnote id.
*/
export function generateFootnoteId(): string {
const now = Date.now();
const timeHex = now.toString(16).padStart(12, "0");
const rand = (length: number) => {
let s = "";
for (let i = 0; i < length; i++)
s += Math.floor(Math.random() * 16).toString(16);
return s;
};
const versioned = "7" + rand(3);
const variantNibble = (8 + Math.floor(Math.random() * 4)).toString(16);
const variant = variantNibble + rand(3);
return (
timeHex.slice(0, 8) +
"-" +
timeHex.slice(8, 12) +
"-" +
versioned +
"-" +
variant +
"-" +
rand(12)
);
}

View File

@@ -0,0 +1,225 @@
/**
* Server-side footnote canonicalizer (MCP mirror — PURE).
*
* `canonicalizeFootnotes(doc)` is a pure ProseMirror-JSON port of the editor's
* `footnoteSyncPlugin` end-state, identical in behaviour to
* `@docmost/editor-ext`'s `canonicalizeFootnotes`. It is mirrored here — rather
* than imported from editor-ext — for the SAME reason `footnote-lex.ts` and the
* `docmost-schema.ts` nodes are mirrored: the MCP package is deliberately
* decoupled from the browser/React-heavy editor barrel and operates on plain
* JSON. The editor-ext copy owns the golden test against the live plugin; this
* copy must stay behaviourally identical (a SHARED golden corpus, exercised by
* both test suites, pins that — see `test/unit/footnote-corpus.mjs`).
*
* This module is the pure MIRROR only. The inline-authoring helpers
* (`footnoteContentKey`, `makeFootnoteDefinition`, `generateFootnoteId`) used by
* `insertInlineFootnote` live in the sibling `footnote-authoring.ts`, so this
* file is compositionally symmetric to the editor-ext copy.
*
* Why it exists: every NON-editor write path (markdown import, update_page_json,
* docmost_transform, insert_footnote) builds ProseMirror JSON directly, so the
* editor's footnote plugins never run and the canonical topology (sequential
* numbering by first reference, one trailing list, no orphans, no raw `[^id]`)
* was never enforced. Running this at the end of every write path closes that
* gap; because it is idempotent, it is a no-op when the footnotes are already
* canonical (no spurious mutations / git-sync churn).
*
* ENFORCEMENT RULE (#228): any NEW FULL-document persist path MUST call
* `canonicalizeFootnotes(doc)` before writing — the current callers are
* `markdownToProseMirrorCanonical` (page markdown import/update; the plain
* `markdownToProseMirror` used for COMMENT bodies must NOT, or it would drop a
* reference-less definition), `update_page_json`, `docmost_transform`,
* `insert_footnote`, and `copy_page_content`. Append/prepend FRAGMENT writes MUST
* NOT canonicalize. This is deliberately per-call-site (the replace-vs-fragment
* and comment-vs-page nuances make a single naive wrapper unsafe).
*/
const FOOTNOTE_REFERENCE_NAME = "footnoteReference";
const FOOTNOTES_LIST_NAME = "footnotesList";
const FOOTNOTE_DEFINITION_NAME = "footnoteDefinition";
function cloneJson<T>(v: T): T {
if (typeof structuredClone === "function") return structuredClone(v);
return JSON.parse(JSON.stringify(v)) as T;
}
function isEmptyParagraph(node: any): boolean {
return (
!!node &&
node.type === "paragraph" &&
(!Array.isArray(node.content) || node.content.length === 0)
);
}
function collectReferenceIds(node: any, out: string[], seen: Set<string>): void {
if (!node || typeof node !== "object") return;
if (node.type === FOOTNOTE_REFERENCE_NAME) {
const id = node?.attrs?.id;
if (id && !seen.has(id)) {
seen.add(id);
out.push(id);
}
}
if (Array.isArray(node.content)) {
for (const child of node.content) collectReferenceIds(child, out, seen);
}
}
function collectDefinitions(node: any, out: any[]): void {
if (!node || typeof node !== "object") return;
if (node.type === FOOTNOTE_DEFINITION_NAME) out.push(node);
if (Array.isArray(node.content)) {
for (const child of node.content) collectDefinitions(child, out);
}
}
function emptyDefinition(id: string): any {
return {
type: FOOTNOTE_DEFINITION_NAME,
attrs: { id },
content: [{ type: "paragraph" }],
};
}
/**
* Deep equality over plain JSON: arrays are compared POSITIONALLY
* (order-SENSITIVE), object keys order-insensitively. The array order-sensitivity
* is required for correctness here — a reordered `footnotesList.content` must
* compare UNEQUAL so the canonical rebuild fires instead of leaving it in place.
*/
function deepEqualJson(a: any, b: any): boolean {
if (a === b) return true;
if (a == null || b == null || typeof a !== typeof b) return false;
if (Array.isArray(a) || Array.isArray(b)) {
if (!Array.isArray(a) || !Array.isArray(b) || a.length !== b.length) {
return false;
}
for (let i = 0; i < a.length; i++) {
if (!deepEqualJson(a[i], b[i])) return false;
}
return true;
}
if (typeof a === "object") {
const ka = Object.keys(a);
const kb = Object.keys(b);
if (ka.length !== kb.length) return false;
for (const k of ka) {
if (!Object.prototype.hasOwnProperty.call(b, k)) return false;
if (!deepEqualJson(a[k], b[k])) return false;
}
return true;
}
return false;
}
/**
* Canonicalize footnotes in a ProseMirror-JSON document. See the file header and
* the editor-ext twin for the full contract. Pure (deep-clones input,
* deterministic, idempotent).
*/
export function canonicalizeFootnotes<T = any>(doc: T): T {
if (
doc == null ||
typeof doc !== "object" ||
!Array.isArray((doc as any).content)
) {
return doc;
}
const out = cloneJson(doc) as any;
// 1) Distinct reference ids in document order (deep — refs can live in
// callouts, tables, list items, ...). The ordering/numbering truth.
const referenceIds: string[] = [];
collectReferenceIds(out, referenceIds, new Set<string>());
// 2) Every definition node in document order (deep).
const defNodes: any[] = [];
collectDefinitions(out, defNodes);
// 3) First definition per id wins; later duplicates carry the SAME id, so they
// cannot be referenced separately and would be orphans — they are dropped.
const defById = new Map<string, any>();
for (const d of defNodes) {
const id = d?.attrs?.id;
if (id && !defById.has(id)) defById.set(id, d);
}
// 4) Build the ordered definition list: one per referenced id, in REFERENCE
// order, reusing the existing node (shallow-copied, id normalized — `out` is
// already deep-cloned and the old lists are cut) or synthesizing an empty
// one. Definitions whose id is not referenced are orphans and never added.
const orderedDefs: any[] = [];
for (const id of referenceIds) {
const existing = defById.get(id);
if (existing) {
orderedDefs.push({
...existing,
attrs: { ...(existing.attrs ?? {}), id },
});
} else {
orderedDefs.push(emptyDefinition(id));
}
}
// 5) No references -> there must be NO list at all (at any depth).
if (referenceIds.length === 0) {
stripFootnotesListsDeep(out);
return out;
}
// 6) Placement parity with the live plugin: when the document is ALREADY in the
// canonical single-list state, leave that list exactly where it sits rather
// than cutting and re-inserting it at the end (the plugin never repositions a
// sole correct list, so moving it would silently reorder any content that
// follows the list on the first write).
const topLevelLists = out.content.filter(
(n: any) => n && n.type === FOOTNOTES_LIST_NAME,
);
if (
topLevelLists.length === 1 &&
defNodes.length === orderedDefs.length &&
deepEqualJson(topLevelLists[0].content, orderedDefs)
) {
return out;
}
// 7) Otherwise rebuild: strip every footnotesList AND every bare
// footnoteDefinition at ANY depth (collectDefinitions gathers defs
// recursively, so a list nested in a callout/blockquote — or a bare
// definition outside any list — would otherwise have its defs copied into the
// rebuilt list while the original survives in place → duplicates) and
// re-insert exactly one list after the last meaningful (non-empty paragraph)
// top-level block.
stripFootnotesListsDeep(out);
stripFootnoteDefinitionsDeep(out);
const top: any[] = out.content;
let insertAt = top.length;
while (insertAt > 0 && isEmptyParagraph(top[insertAt - 1])) insertAt--;
top.splice(insertAt, 0, { type: FOOTNOTES_LIST_NAME, content: orderedDefs });
out.content = top;
return out;
}
/** Remove every `footnotesList` node at ANY depth (mutates the given clone). */
function stripFootnotesListsDeep(node: any): void {
if (!node || typeof node !== "object" || !Array.isArray(node.content)) return;
node.content = node.content.filter(
(c: any) => !(c && c.type === FOOTNOTES_LIST_NAME),
);
for (const child of node.content) stripFootnotesListsDeep(child);
}
/**
* Remove every BARE `footnoteDefinition` node at ANY depth (mutates the given
* clone). Runs only in the rebuild path AFTER the lists are stripped, so it
* targets definitions that were sitting outside a list (e.g. hand-authored via a
* raw-JSON write path and nested in a callout); their content was already copied
* into the rebuilt list, so leaving the originals would duplicate them.
*/
function stripFootnoteDefinitionsDeep(node: any): void {
if (!node || typeof node !== "object" || !Array.isArray(node.content)) return;
node.content = node.content.filter(
(c: any) => !(c && c.type === FOOTNOTE_DEFINITION_NAME),
);
for (const child of node.content) stripFootnoteDefinitionsDeep(child);
}

View File

@@ -15,6 +15,14 @@
*/
import { blockPlainText } from "./node-ops.js";
import { canonicalizeFootnotes } from "./footnote-canonicalize.js";
import {
footnoteContentKey,
makeFootnoteDefinition,
generateFootnoteId,
} from "./footnote-authoring.js";
export { canonicalizeFootnotes } from "./footnote-canonicalize.js";
/** Deep-clone a JSON-serializable value without mutating the original. */
function clone<T>(value: T): T {
@@ -73,13 +81,61 @@ export function getList(
return found;
}
/** Options for insertMarkerAfter. */
/** Options for insertMarkerAfter / insertNodesAfterAnchor. */
export interface InsertMarkerOptions {
/**
* Limit the search to TOP-LEVEL blocks with index < beforeBlock. Used to keep
* footnote markers in the body and out of the notes section.
*/
beforeBlock?: number;
/**
* Textblock node types that MUST NOT receive the inserted nodes. When the
* split point lands inside such a block it is refused (skipped), so an inline
* ATOM (e.g. footnoteReference) is never spliced into a block whose content
* spec forbids it — which would persist a schema-invalid doc. Plain-text
* markers leave this unset (text is valid inside a codeBlock).
*/
forbidBlockTypes?: ReadonlySet<string>;
/**
* Node types whose ENTIRE subtree is skipped during the walk (never split into,
* at any depth). Used to keep the footnote inserter out of the notes section:
* splitting text inside an existing `footnoteDefinition` would glue a reference
* into a definition, which the canonicalizer then drops as an orphan together
* with the definition's prose — silent loss of an existing footnote. Skipped
* subtrees still advance the running offset so sibling text stays aligned.
*/
skipSubtreeTypes?: ReadonlySet<string>;
}
/**
* Textblocks that hold raw text but do NOT accept inline atom nodes. A
* `footnoteReference` is `group:"inline", atom:true`; `codeBlock` is
* `content:"text*"` (text only), so splicing a footnoteReference into it yields
* an invalid document. (paragraph/heading/detailsSummary are `inline*` and DO
* accept it; footnote definitions live inside a footnotesList which the
* footnote inserter excludes via `beforeBlock`.)
*/
const INLINE_ATOM_FORBIDDEN_BLOCKS: ReadonlySet<string> = new Set(["codeBlock"]);
/**
* Footnote-notes subtrees the inline footnote inserter must never split into (at
* any depth): a `footnotesList` and the `footnoteDefinition`s it holds. Anchoring
* a reference inside one of these would later be dropped as an orphan by the
* canonicalizer, taking the existing definition's text with it.
*/
const FOOTNOTE_NOTES_SUBTREES: ReadonlySet<string> = new Set([
"footnotesList",
"footnoteDefinition",
]);
/** True if `node` IS, or contains at any depth, a footnotesList/footnoteDefinition. */
function containsFootnoteNotes(node: any): boolean {
if (!isObject(node)) return false;
if (FOOTNOTE_NOTES_SUBTREES.has(node.type)) return true;
if (Array.isArray(node.content)) {
return node.content.some((c: any) => containsFootnoteNotes(c));
}
return false;
}
/**
@@ -105,6 +161,30 @@ export function insertMarkerAfter(
anchor: string,
marker: string,
opts: InsertMarkerOptions = {},
): { doc: any; inserted: boolean } {
// A plain marker is a leading-space-padded unmarked text run.
return insertNodesAfterAnchor(
doc,
anchor,
() => [{ type: "text", text: " " + marker }],
opts,
);
}
/**
* Mark-safe insertion CORE: split the inline text run that holds the END of
* `anchor` (preserving the surrounding marks) and splice the nodes produced by
* `makeMiddle()` in at the split point. `insertMarkerAfter` (plain text marker)
* and `insertInlineFootnote` (a `footnoteReference` node) are both thin callers —
* the only difference is WHAT is inserted (a space-padded text run vs. a node
* that should hug the preceding word), which is exactly what `makeMiddle`
* decides. Operates on a clone; returns `{ doc, inserted }`.
*/
function insertNodesAfterAnchor(
doc: any,
anchor: string,
makeMiddle: () => any[],
opts: InsertMarkerOptions = {},
): { doc: any; inserted: boolean } {
const out = clone(doc);
if (!isObject(out) || !Array.isArray(out.content) || !anchor) {
@@ -137,12 +217,27 @@ export function insertMarkerAfter(
if (inserted || !isObject(container) || !Array.isArray(container.content)) {
return;
}
// Skip a forbidden subtree entirely (e.g. footnotesList/footnoteDefinition):
// never split into it, but keep `offset` aligned for any sibling text after
// it within this block.
if (opts.skipSubtreeTypes && opts.skipSubtreeTypes.has(container.type)) {
offset += blockPlainText(container).length;
return;
}
const inline = container.content;
// Detect whether this array is an inline array (contains text nodes).
const hasText = inline.some(
(n: any) => isObject(n) && n.type === "text",
);
if (hasText) {
// Refuse a textblock whose content spec cannot hold the inserted nodes
// (e.g. a codeBlock for an inline atom). Keep `offset` aligned for any
// sibling textblocks in this same block, then bail so the search falls
// through to the next candidate block.
if (opts.forbidBlockTypes && opts.forbidBlockTypes.has(container.type)) {
offset += blockPlainText(container).length;
return;
}
for (let i = 0; i < inline.length; i++) {
const n = inline[i];
const len = isObject(n) ? blockPlainText(n).length : 0;
@@ -166,8 +261,9 @@ export function insertMarkerAfter(
if (before.length > 0) {
parts.push({ ...n, text: before, marks: [...marks] });
}
// Marker is a PLAIN run: no marks copied. Leading space separates it.
parts.push({ type: "text", text: " " + marker });
// The inserted nodes are caller-decided (a space-padded marker run,
// or a node that hugs the word). They carry no copied marks.
parts.push(...makeMiddle());
if (after.length > 0) {
parts.push({ ...n, text: after, marks: [...marks] });
}
@@ -268,14 +364,16 @@ export function noteItem(inlineNodes: any[]): any {
* Wrap inline ProseMirror nodes in a real footnoteDefinition node keyed by id:
* { type:"footnoteDefinition", attrs:{id}, content:[{ type:"paragraph", content }] }
* (mirrors the editor-ext / docmost-schema FootnoteDefinition node).
*
* Built on the shared `makeFootnoteDefinition` factory (footnote-authoring.ts);
* the only extra is a fresh block id on the inner paragraph (Docmost stamps one,
* and the canonicalizer preserves attrs as-is). Single factory, one place to
* change the definition shape.
*/
export function footnoteDefinition(id: string, inlineNodes: any[]): any {
const content = Array.isArray(inlineNodes) ? clone(inlineNodes) : [];
return {
type: "footnoteDefinition",
attrs: { id },
content: [{ type: "paragraph", attrs: { id: freshId() }, content }],
};
const node = makeFootnoteDefinition(id, inlineNodes);
node.content[0].attrs = { id: freshId() };
return node;
}
/**
@@ -559,3 +657,131 @@ export function commentsToFootnotes(
return { doc: synced.doc, consumed };
}
/** Options for insertInlineFootnote. */
export interface InsertInlineFootnoteOptions {
/** Body text after which the footnote marker is placed (mark-safe). */
anchorText: string;
/** Footnote content as markdown (converted to inline nodes). */
text: string;
}
/** Result of insertInlineFootnote. */
export interface InsertInlineFootnoteResult {
doc: any;
/** False when the anchor text was not found (no write). */
inserted: boolean;
/** The footnote id used (new or reused). */
footnoteId: string;
/** True when an existing same-content definition was reused (content dedup). */
reused: boolean;
}
/**
* AUTHOR-INLINE footnote insertion. The caller supplies WHERE (anchorText) and
* WHAT (markdown text); numbering and the bottom list are derived server-side by
* `canonicalizeFootnotes`. The caller never sees or edits `footnotesList`, never
* assigns a number, and cannot desync — orphans / out-of-order lists / raw
* `[^id]` markdown are structurally impossible.
*
* Content DEDUP (#3 in the issue): if an existing definition has the SAME
* normalized content key, its id is REUSED (the new reference points at it: one
* number, one definition, several references). Otherwise a fresh uuid id is
* minted and a new definition added. Conservative — only an exact content match
* merges.
*
* Mechanics: the `footnoteReference` node is inserted DIRECTLY at the anchor via
* the same mark-safe split as `insertMarkerAfter` (the shared
* `insertNodesAfterAnchor` core), so it hugs the preceding word with no text
* sentinel round-trip. The whole document is then canonicalized.
*
* Operates on a clone of `doc`. When the anchor is not found, returns the input
* unchanged with `inserted:false`.
*/
export function insertInlineFootnote(
doc: any,
opts: InsertInlineFootnoteOptions,
): InsertInlineFootnoteResult {
const inline = mdToInlineNodes(opts.text ?? "");
// footnoteContentKey only reads `.content`, so key off the inline array
// directly instead of building a throwaway definition node.
const key = footnoteContentKey({ content: inline });
// Content dedup: reuse an existing definition's id when its key matches.
let footnoteId: string | null = null;
let reused = false;
if (key !== "") {
walk(doc, (n) => {
if (
footnoteId == null &&
isObject(n) &&
n.type === "footnoteDefinition" &&
n.attrs &&
typeof n.attrs.id === "string" &&
n.attrs.id !== "" &&
footnoteContentKey(n) === key
) {
footnoteId = n.attrs.id;
reused = true;
}
});
}
if (footnoteId == null) footnoteId = generateFootnoteId();
// Insert the footnoteReference node directly after the anchor (mark-safe
// split); it hugs the preceding word with no leading space. Two guards keep the
// inline atom out of the notes section and out of blocks that cannot hold it:
// - beforeBlock bounds the search to the BODY, before the first top-level block
// that IS or CONTAINS (at any depth) a footnotesList/footnoteDefinition — so
// a NESTED list or a bare definition also bounds the search, not just a
// top-level list;
// - skipSubtreeTypes refuses to descend into any footnotesList/footnoteDefinition
// subtree, so a reference is never glued inside an existing definition (which
// the canonicalizer would then drop as an orphan, losing that definition's
// prose); and forbidBlockTypes refuses codeBlocks (an inline atom there is a
// schema-invalid doc; insert_footnote skips validateDocStructure).
// When the only anchor match is in such a place, the insert is refused and the
// write aborts cleanly (inserted:false) instead of destroying content.
const boundaryIdx = Array.isArray(doc?.content)
? doc.content.findIndex((n: any) => containsFootnoteNotes(n))
: -1;
const r = insertNodesAfterAnchor(
doc,
(opts.anchorText ?? "").trimEnd(),
() => [{ type: "footnoteReference", attrs: { id: footnoteId } }],
{
...(boundaryIdx >= 0 ? { beforeBlock: boundaryIdx } : {}),
forbidBlockTypes: INLINE_ATOM_FORBIDDEN_BLOCKS,
skipSubtreeTypes: FOOTNOTE_NOTES_SUBTREES,
},
);
if (!r.inserted) {
return { doc: clone(doc), inserted: false, footnoteId, reused };
}
let working = r.doc;
// Add a NEW definition (canonicalize will order/place it); a reused id needs
// no new definition (the existing one is shared).
if (!reused) {
appendDefinition(working, makeFootnoteDefinition(footnoteId, inline));
}
// Derive numbering + the single bottom list deterministically.
working = canonicalizeFootnotes(working);
return { doc: working, inserted: true, footnoteId, reused };
}
/**
* Append a definition node so the canonicalizer can order/place it: into the
* first existing footnotesList, or a new trailing list when none exists.
*/
function appendDefinition(doc: any, defNode: any): void {
const existingList = getList(doc, (n) => isObject(n) && n.type === "footnotesList");
if (existingList && Array.isArray(existingList.content)) {
existingList.content.push(defNode);
return;
}
if (Array.isArray(doc.content)) {
doc.content.push({ type: "footnotesList", content: [defNode] });
}
}

View File

@@ -0,0 +1,153 @@
// Mock-HTTP orchestration tests for the footnote WRITE wrappers on DocmostClient
// (issue #228):
// - insertFootnote (#11): the required-argument guards reject BEFORE any write,
// and never touch the collab/mutate path.
// - transformPage / docmost_transform (#13): the auto-canonicalize step
// (`result = canonicalizeFootnotes(raw)`) runs after every transform, so a
// transform that introduces an orphan footnote definition is silently tidied
// away — observable as an EMPTY diff in a dryRun preview.
//
// These stand a local http.createServer in for Docmost and only exercise plain
// HTTP routes (login / comments / pages.info), deliberately avoiding the live
// Hocuspocus collab WebSocket: the insertFootnote guards short-circuit before it,
// and docmost_transform's dryRun preview never opens it. The collab mutate path
// itself — abort-via-throw on a missing anchor with NO persisted write, and the
// reused-vs-new response shaping — is covered in
// test/mock/insert-footnote-wrapper.test.mjs (which overrides the mutatePage
// seam to drive the transform), not here.
import { test, after } from "node:test";
import assert from "node:assert/strict";
import http from "node:http";
import { DocmostClient } from "../../build/client.js";
function readBody(req) {
return new Promise((resolve) => {
let raw = "";
req.on("data", (c) => (raw += c));
req.on("end", () => resolve(raw));
});
}
function startServer(handler) {
return new Promise((resolve) => {
const server = http.createServer(handler);
server.listen(0, "127.0.0.1", () => {
const { port } = server.address();
resolve({ server, baseURL: `http://127.0.0.1:${port}/api` });
});
});
}
function sendJson(res, status, obj, extraHeaders = {}) {
res.writeHead(status, { "Content-Type": "application/json", ...extraHeaders });
res.end(JSON.stringify(obj));
}
const openServers = [];
async function spawn(handler) {
const { server, baseURL } = await startServer(handler);
openServers.push(server);
return { baseURL };
}
after(async () => {
await Promise.all(openServers.map((s) => new Promise((r) => s.close(r))));
});
const ref = (id) => ({ type: "footnoteReference", attrs: { id } });
const def = (id, text) => ({
type: "footnoteDefinition",
attrs: { id },
content: [{ type: "paragraph", content: [{ type: "text", text }] }],
});
// ---------------------------------------------------------------------------
// #11 insertFootnote guards: missing anchorText / text reject and never write.
// ---------------------------------------------------------------------------
test("insertFootnote rejects a missing anchorText before any write", async () => {
const otherRoutes = [];
const { baseURL } = await spawn(async (req, res) => {
await readBody(req);
if (req.url === "/api/auth/login") {
return sendJson(res, 200, { success: true }, {
"Set-Cookie": "authToken=t; Path=/; HttpOnly",
});
}
otherRoutes.push(req.url);
sendJson(res, 404, { message: "not found" });
});
const client = new DocmostClient(baseURL, "user@example.com", "pw");
await assert.rejects(
() => client.insertFootnote("page-1", " ", "a note"),
/anchorText is required/i,
);
assert.deepEqual(otherRoutes, [], "must not hit any write route");
});
test("insertFootnote rejects an empty text before any write", async () => {
const otherRoutes = [];
const { baseURL } = await spawn(async (req, res) => {
await readBody(req);
if (req.url === "/api/auth/login") {
return sendJson(res, 200, { success: true }, {
"Set-Cookie": "authToken=t; Path=/; HttpOnly",
});
}
otherRoutes.push(req.url);
sendJson(res, 404, { message: "not found" });
});
const client = new DocmostClient(baseURL, "user@example.com", "pw");
await assert.rejects(
() => client.insertFootnote("page-1", "anchor", " "),
/text is required/i,
);
assert.deepEqual(otherRoutes, [], "must not hit any write route");
});
// ---------------------------------------------------------------------------
// #13 docmost_transform auto-canonicalization: a transform that adds an orphan
// footnote definition produces NO net change (the canonicalizer drops it), so a
// dryRun preview reports an empty diff. Without the auto-canonicalize step the
// orphan would survive and the diff would be non-empty.
// ---------------------------------------------------------------------------
test("transformPage dryRun auto-canonicalizes footnotes (orphan def is dropped)", async () => {
// A page already in canonical footnote state (refs b,a; defs b,a).
const pageContent = {
type: "doc",
content: [
{ type: "paragraph", content: [{ type: "text", text: "x" }, ref("b"), ref("a")] },
{ type: "footnotesList", content: [def("b", "B"), def("a", "A")] },
],
};
const { baseURL } = await spawn(async (req, res) => {
await readBody(req);
if (req.url === "/api/auth/login") {
return sendJson(res, 200, { success: true }, {
"Set-Cookie": "authToken=t; Path=/; HttpOnly",
});
}
if (req.url === "/api/comments") {
return sendJson(res, 200, { data: { items: [], meta: { nextCursor: null } } });
}
if (req.url === "/api/pages/info") {
return sendJson(res, 200, {
data: { id: "page-1", slugId: "s", title: "P", spaceId: "sp", content: pageContent },
});
}
sendJson(res, 404, { message: "not found" });
});
const client = new DocmostClient(baseURL, "user@example.com", "pw");
// The transform appends an ORPHAN definition (id "z", no matching reference).
const transformJs = `(doc) => {
const list = doc.content.find((n) => n.type === "footnotesList");
list.content.push({
type: "footnoteDefinition",
attrs: { id: "z" },
content: [{ type: "paragraph", content: [{ type: "text", text: "orphan" }] }],
});
return doc;
}`;
const result = await client.transformPage("page-1", transformJs, { dryRun: true });
assert.equal(result.pushed, false);
// Auto-canonicalize dropped the orphan, so the doc is unchanged => empty diff.
assert.equal(result.diff.summary.inserted, 0, "orphan def must be canonicalized away");
assert.equal(result.diff.summary.deleted, 0);
});

View File

@@ -0,0 +1,78 @@
// Footnote-canonicalization binding tests for the MCP FULL-document write tools
// (issue #228, review #4): update_page_json and copy_page_content must persist a
// footnote-canonical doc. These override the `replacePage` seam (symmetric to the
// `mutatePage` seam used by the insert-footnote-wrapper test) to capture the
// persisted doc WITHOUT a live Hocuspocus collab socket. Symmetric to the
// server-side focus specs for createPage / updatePageContent('replace').
import { test } from "node:test";
import assert from "node:assert/strict";
import { DocmostClient } from "../../build/client.js";
const para = (...c) => ({ type: "paragraph", content: c });
const ref = (id) => ({ type: "footnoteReference", attrs: { id } });
const def = (id, text) => ({
type: "footnoteDefinition",
attrs: { id },
content: [{ type: "paragraph", content: [{ type: "text", text }] }],
});
const list = (...d) => ({ type: "footnotesList", content: d });
function findAll(node, type, acc = []) {
if (!node || typeof node !== "object") return acc;
if (node.type === type) acc.push(node);
if (Array.isArray(node.content)) for (const c of node.content) findAll(c, type, acc);
return acc;
}
const defIds = (doc) => findAll(doc, "footnoteDefinition").map((d) => d.attrs.id);
function makeClient(sourceDoc) {
const calls = { replaced: [] };
class TestClient extends DocmostClient {
async ensureAuthenticated() {}
async getCollabTokenWithReauth() {
return "collab-token";
}
async getPageRaw(pageId) {
return { id: pageId, slugId: "s", title: "P", spaceId: "sp", content: sourceDoc };
}
async replacePage(pageId, doc, token, apiUrl) {
calls.replaced.push({ pageId, doc });
return { doc, verify: { ok: true } };
}
}
const client = new TestClient("http://127.0.0.1:1/api", "e@x.com", "pw");
return { client, calls };
}
test("update_page_json canonicalizes the persisted full doc (out-of-order -> reference order)", async () => {
const { client, calls } = makeClient();
const outOfOrder = {
type: "doc",
content: [
para({ type: "text", text: "x" }, ref("b"), ref("a")),
list(def("a", "A"), def("b", "B")),
],
};
await client.updatePageJson("p1", outOfOrder);
assert.equal(calls.replaced.length, 1);
// Definitions reordered to reference order [b, a] before persisting.
assert.deepEqual(defIds(calls.replaced[0].doc), ["b", "a"]);
assert.equal(findAll(calls.replaced[0].doc, "footnotesList").length, 1);
});
test("copy_page_content canonicalizes the persisted copy (orphan definition dropped)", async () => {
const sourceDoc = {
type: "doc",
content: [
para({ type: "text", text: "x" }, ref("a")),
list(def("a", "A"), def("orphan", "O")),
],
};
const { client, calls } = makeClient(sourceDoc);
const res = await client.copyPageContent("src", "dst");
assert.equal(calls.replaced.length, 1);
assert.equal(calls.replaced[0].pageId, "dst");
// The orphan definition is dropped by canonicalization before the copy lands.
assert.deepEqual(defIds(calls.replaced[0].doc), ["a"]);
assert.equal(res.success, true);
});

View File

@@ -0,0 +1,100 @@
// Wrapper tests for DocmostClient.insertFootnote (issue #228, review #11/#9):
// the page-locked write seam (mutatePage) is overridden so the wrapper's
// transform + response shaping can be exercised WITHOUT a live Hocuspocus collab
// socket. We assert the two guarantees that the pure insertInlineFootnote test
// can NOT prove on its own:
// - a missing anchor makes the transform throw "anchor text not found" and NO
// document is persisted (the no-partial-write guarantee), and
// - a success shapes footnoteId / reused / message / verify and writes a doc
// carrying the new reference + the derived single list.
import { test } from "node:test";
import assert from "node:assert/strict";
import { DocmostClient } from "../../build/client.js";
const para = (...c) => ({ type: "paragraph", content: c });
const ref = (id) => ({ type: "footnoteReference", attrs: { id } });
const def = (id, text) => ({
type: "footnoteDefinition",
attrs: { id },
content: [{ type: "paragraph", content: [{ type: "text", text }] }],
});
const list = (...d) => ({ type: "footnotesList", content: d });
function findAll(node, type, acc = []) {
if (!node || typeof node !== "object") return acc;
if (node.type === type) acc.push(node);
if (Array.isArray(node.content)) for (const c of node.content) findAll(c, type, acc);
return acc;
}
// A DocmostClient whose auth + page-locked write are stubbed; `mutatePage`
// mirrors collaboration.mutatePageContent (run the transform against a clone of
// the live doc; if it throws, persist NOTHING and rethrow).
function makeClient(liveDoc) {
const calls = { writes: [] };
class TestClient extends DocmostClient {
async ensureAuthenticated() {}
async getCollabTokenWithReauth() {
return "collab-token";
}
async mutatePage(pageId, token, apiUrl, transform) {
calls.pageId = pageId;
calls.token = token;
const newDoc = transform(structuredClone(liveDoc));
calls.writes.push(newDoc);
return { doc: newDoc, verify: { ok: true, marker: "v" } };
}
}
const client = new TestClient("http://127.0.0.1:1/api", "e@x.com", "pw");
return { client, calls };
}
test("insertFootnote: anchor not found -> throws and persists nothing", async () => {
const { client, calls } = makeClient({
type: "doc",
content: [para({ type: "text", text: "nothing to anchor on" })],
});
await assert.rejects(
() => client.insertFootnote("p1", "ZZZ", "a note"),
/anchor text not found/i,
);
assert.equal(calls.writes.length, 0, "no document may be persisted on a missing anchor");
});
test("insertFootnote: success (new) writes a reference + derived list and shapes the response", async () => {
const { client, calls } = makeClient({
type: "doc",
content: [para({ type: "text", text: "The sky is blue today." })],
});
const res = await client.insertFootnote("p1", "blue", "Rayleigh scattering.");
assert.equal(res.success, true);
assert.equal(res.modified, true);
assert.equal(res.pageId, "p1");
assert.equal(res.reused, false);
assert.equal(typeof res.footnoteId, "string");
assert.ok(res.footnoteId.length > 0);
assert.equal(res.message, "Footnote inserted.");
assert.deepEqual(res.verify, { ok: true, marker: "v" });
assert.equal(calls.writes.length, 1, "exactly one write persisted");
assert.equal(findAll(calls.writes[0], "footnoteReference").length, 1);
assert.equal(findAll(calls.writes[0], "footnotesList").length, 1);
assert.equal(calls.pageId, "p1");
});
test("insertFootnote: success (reused) reuses the existing definition and reports it", async () => {
const liveDoc = {
type: "doc",
content: [
para({ type: "text", text: "Alpha and beta." }, ref("a")),
list(def("a", "shared note")),
],
};
const { client, calls } = makeClient(liveDoc);
const res = await client.insertFootnote("p1", "beta", "shared note");
assert.equal(res.reused, true);
assert.equal(res.footnoteId, "a");
assert.match(res.message, /reused an existing same-content definition/i);
// Still exactly one definition (the reused one), two references to it.
assert.equal(findAll(calls.writes[0], "footnoteDefinition").length, 1);
assert.equal(findAll(calls.writes[0], "footnoteReference").length, 2);
});

View File

@@ -4,6 +4,7 @@ import assert from "node:assert/strict";
import {
buildCollabWsUrl,
markdownToProseMirror,
markdownToProseMirrorCanonical,
} from "../../build/lib/collaboration.js";
/** Recursively find the first descendant node (or self) of the given type. */
@@ -124,3 +125,38 @@ test("markdownToProseMirror: an aligned GFM table maps header alignment", async
["left", "center", "right"],
);
});
// Comment-body data-loss guard (#228 review #4): markdownToProseMirror is reused
// for COMMENT bodies (createComment/updateComment), so it must NOT canonicalize —
// a comment may legitimately carry a standalone footnote definition with no
// matching reference, and canonicalization would drop the whole list (the text
// would vanish). The page-write variant DOES canonicalize.
test("markdownToProseMirror (comment path) PRESERVES a reference-less footnote definition", async () => {
const md = "A comment.\n\n[^1]: a standalone footnote definition";
const doc = await markdownToProseMirror(md);
const defs = findAll(doc, "footnoteDefinition");
assert.equal(defs.length, 1, "the footnote definition must be preserved");
assert.match(
JSON.stringify(doc),
/a standalone footnote definition/,
"the definition text must survive the comment write path",
);
});
test("markdownToProseMirrorCanonical (page path) DROPS a reference-less footnote definition", async () => {
// Same input through the PAGE variant: with no reference, the canonical doc has
// no footnotesList (this is the page-side behavior the comment path must avoid).
const md = "A page.\n\n[^1]: a standalone footnote definition";
const doc = await markdownToProseMirrorCanonical(md);
assert.equal(findAll(doc, "footnotesList").length, 0);
assert.equal(findAll(doc, "footnoteDefinition").length, 0);
});
test("markdownToProseMirrorCanonical still canonicalizes a real page footnote (order)", async () => {
// Page path must STILL canonicalize: refs b,a -> definitions reorder to b,a.
const md = "See[^b] then[^a].\n\n[^a]: alpha\n[^b]: bravo";
const doc = await markdownToProseMirrorCanonical(md);
const defs = findAll(doc, "footnoteDefinition").map((d) => d.attrs.id);
assert.deepEqual(defs, ["b", "a"]);
assert.equal(findAll(doc, "footnotesList").length, 1);
});

View File

@@ -0,0 +1,286 @@
import { test } from "node:test";
import assert from "node:assert/strict";
import { canonicalizeFootnotes } from "../../build/lib/footnote-canonicalize.js";
import {
footnoteContentKey,
generateFootnoteId,
} from "../../build/lib/footnote-authoring.js";
import { insertInlineFootnote } from "../../build/lib/transforms.js";
import { markdownToProseMirrorCanonical } from "../../build/lib/collaboration.js";
function findAll(node, type, acc = []) {
if (!node || typeof node !== "object") return acc;
if (node.type === type) acc.push(node);
if (Array.isArray(node.content)) {
for (const c of node.content) findAll(c, type, acc);
}
return acc;
}
const defIds = (doc) =>
findAll(doc, "footnoteDefinition").map((d) => d.attrs.id);
const refIds = (doc) =>
findAll(doc, "footnoteReference").map((r) => r.attrs.id);
const ref = (id) => ({ type: "footnoteReference", attrs: { id } });
const def = (id, text) => ({
type: "footnoteDefinition",
attrs: { id },
content: [{ type: "paragraph", content: [{ type: "text", text }] }],
});
const para = (...inline) => ({ type: "paragraph", content: inline });
const list = (...defs) => ({ type: "footnotesList", content: defs });
// The ordering / orphan-drop / no-refs / duplicate-first-wins cases are covered
// (with full deepEqual on input -> expected) by the shared golden corpus in
// footnote-corpus.test.mjs; only the input-immutability and idempotence
// properties — which the corpus does not assert — are kept here.
test("canonicalize is idempotent", () => {
const doc = {
type: "doc",
content: [
para({ type: "text", text: "x" }, ref("b"), ref("a")),
list(def("a", "A"), def("b", "B"), def("orphan", "O")),
],
};
const once = canonicalizeFootnotes(doc);
const twice = canonicalizeFootnotes(once);
assert.deepEqual(twice, once);
});
test("canonicalize does not mutate its input", () => {
const doc = {
type: "doc",
content: [para({ type: "text", text: "x" }, ref("a")), list(def("o", "O"))],
};
const snap = JSON.parse(JSON.stringify(doc));
canonicalizeFootnotes(doc);
assert.deepEqual(doc, snap);
});
test("footnoteContentKey: same text -> same key; formatting differs -> different key", () => {
const plain = def("x", "hello world");
const sameText = def("y", "hello world"); // whitespace-collapsed match
const bold = {
type: "footnoteDefinition",
attrs: { id: "z" },
content: [
{
type: "paragraph",
content: [
{ type: "text", text: "hello world", marks: [{ type: "bold" }] },
],
},
],
};
assert.equal(footnoteContentKey(plain), footnoteContentKey(sameText));
assert.notEqual(footnoteContentKey(plain), footnoteContentKey(bold));
});
test("insertInlineFootnote: places a reference at the anchor and derives the list", () => {
const doc = {
type: "doc",
content: [para({ type: "text", text: "The sky is blue today." })],
};
const r = insertInlineFootnote(doc, {
anchorText: "blue",
text: "Rayleigh scattering.",
});
assert.equal(r.inserted, true);
assert.equal(r.reused, false);
assert.equal(refIds(r.doc).length, 1);
assert.deepEqual(defIds(r.doc), [r.footnoteId]);
// The marker hugs the anchor word (no leading space text run before the ref).
assert.equal(findAll(r.doc, "footnotesList").length, 1);
});
test("insertInlineFootnote: content dedup -> same text reuses one definition, two refs", () => {
let doc = {
type: "doc",
content: [para({ type: "text", text: "Alpha and beta and gamma." })],
};
const r1 = insertInlineFootnote(doc, {
anchorText: "Alpha",
text: "shared note",
});
const r2 = insertInlineFootnote(r1.doc, {
anchorText: "beta",
text: "shared note",
});
assert.equal(r2.reused, true);
assert.equal(r2.footnoteId, r1.footnoteId);
// One definition, two references both pointing at it.
assert.deepEqual(defIds(r2.doc), [r1.footnoteId]);
assert.deepEqual(refIds(r2.doc), [r1.footnoteId, r1.footnoteId]);
});
test("insertInlineFootnote: distinct text -> two definitions numbered by reference order", () => {
let doc = {
type: "doc",
content: [para({ type: "text", text: "First point, second point." })],
};
const r1 = insertInlineFootnote(doc, { anchorText: "First", text: "note one" });
const r2 = insertInlineFootnote(r1.doc, {
anchorText: "second",
text: "note two",
});
assert.equal(r2.reused, false);
// Reference order in the body is [First-ref, second-ref]; the derived list
// matches that order.
assert.deepEqual(defIds(r2.doc), refIds(r2.doc));
assert.equal(defIds(r2.doc).length, 2);
});
test("insertInlineFootnote: anchor not found -> inserted:false, no write", () => {
const doc = {
type: "doc",
content: [para({ type: "text", text: "nothing to anchor on" })],
};
const r = insertInlineFootnote(doc, { anchorText: "ZZZ", text: "x" });
assert.equal(r.inserted, false);
assert.equal(findAll(r.doc, "footnoteReference").length, 0);
});
test("insertInlineFootnote: anchor ONLY inside a codeBlock -> refused (no invalid doc)", () => {
// A footnoteReference is an inline atom; codeBlock content is text-only, so
// splicing one in would persist a schema-invalid doc. The insert must refuse.
const doc = {
type: "doc",
content: [{ type: "codeBlock", content: [{ type: "text", text: "const blue = 1;" }] }],
};
const r = insertInlineFootnote(doc, { anchorText: "blue", text: "Rayleigh." });
assert.equal(r.inserted, false);
assert.equal(findAll(r.doc, "footnoteReference").length, 0);
assert.equal(findAll(r.doc, "footnotesList").length, 0);
// The codeBlock text is untouched.
assert.deepEqual(r.doc, doc);
});
test("insertInlineFootnote: anchor ONLY inside an existing footnote definition -> refused", () => {
// The anchor text lives in a definition (inside the footnotesList). The search
// is bounded to the BODY (before the first list), so it is not matched there
// and the insert is refused rather than nesting a reference in a definition.
const doc = {
type: "doc",
content: [
para({ type: "text", text: "Hello world." }, ref("a")),
list(def("a", "the sky is blue")),
],
};
const r = insertInlineFootnote(doc, { anchorText: "sky", text: "note" });
assert.equal(r.inserted, false);
// No EXTRA reference and still exactly one (the pre-existing) list/definition.
assert.equal(findAll(r.doc, "footnoteReference").length, 1);
assert.deepEqual(defIds(r.doc), ["a"]);
});
test("insertInlineFootnote: codeBlock match is skipped, a later body paragraph still anchors", () => {
// The anchor first appears in a codeBlock (refused) but also in a normal
// paragraph after it; the insert falls through to the valid block.
const doc = {
type: "doc",
content: [
{ type: "codeBlock", content: [{ type: "text", text: "let token = 1;" }] },
para({ type: "text", text: "The token is rotated daily." }),
],
};
const r = insertInlineFootnote(doc, { anchorText: "token", text: "secret" });
assert.equal(r.inserted, true);
// The reference landed in the paragraph, NOT the codeBlock.
const code = findAll(r.doc, "codeBlock")[0];
assert.equal(findAll(code, "footnoteReference").length, 0);
assert.equal(findAll(r.doc, "footnoteReference").length, 1);
});
test("insertInlineFootnote: anchor only inside a NESTED definition -> refused, definition preserved", () => {
// The footnotesList is nested in a callout (not top level) and the anchor text
// appears ONLY inside that definition. The search must be bounded past the
// notes subtree (recursive boundary) AND refuse to descend into the definition,
// so it aborts cleanly instead of gluing a reference into the definition (which
// canonicalize would then drop as an orphan, losing the definition's prose).
const doc = {
type: "doc",
content: [
para({ type: "text", text: "Body text here." }, ref("a")),
{
type: "callout",
content: [list(def("a", "the unique anchor lives here"))],
},
],
};
const r = insertInlineFootnote(doc, {
anchorText: "unique anchor",
text: "new note",
});
assert.equal(r.inserted, false);
// The existing definition (and its text) is preserved untouched.
assert.equal(findAll(r.doc, "footnoteDefinition").length, 1);
assert.match(JSON.stringify(r.doc), /the unique anchor lives here/);
assert.equal(findAll(r.doc, "footnoteReference").length, 1); // only the original
});
test("insertInlineFootnote: anchor only inside a BARE definition (no list wrapper) -> refused", () => {
const doc = {
type: "doc",
content: [
para({ type: "text", text: "Some body." }),
{
type: "footnoteDefinition",
attrs: { id: "a" },
content: [{ type: "paragraph", content: [{ type: "text", text: "orphan anchor text" }] }],
},
],
};
const r = insertInlineFootnote(doc, { anchorText: "orphan anchor", text: "x" });
assert.equal(r.inserted, false);
assert.equal(findAll(r.doc, "footnoteDefinition").length, 1);
assert.match(JSON.stringify(r.doc), /orphan anchor text/);
});
test("insertInlineFootnote: anchor in body BEFORE a nested list still inserts", () => {
const doc = {
type: "doc",
content: [
para({ type: "text", text: "The sky is blue." }, ref("a")),
{ type: "callout", content: [list(def("a", "note a"))] },
],
};
const r = insertInlineFootnote(doc, { anchorText: "blue", text: "Rayleigh." });
assert.equal(r.inserted, true);
// The new reference plus the original = two references; a single canonical list.
assert.equal(findAll(r.doc, "footnoteReference").length, 2);
assert.equal(findAll(r.doc, "footnotesList").length, 1);
});
test("markdown import (page path): out-of-order definitions render as a reference-ordered list", async () => {
// References appear b, a, c in the body; definitions are written in a, b, c
// order (the import order). The PAGE import path (markdownToProseMirrorCanonical)
// canonicalizes so the bottom list follows REFERENCE order — numbers read 1, 2,
// 3 down the list. (The non-canonicalizing markdownToProseMirror, used for
// comment bodies, would keep the import order; see collaboration.test.mjs.)
const md = [
"See[^b] then[^a] then[^c].",
"",
"[^a]: alpha",
"[^b]: bravo",
"[^c]: charlie",
].join("\n");
const json = await markdownToProseMirrorCanonical(md);
assert.deepEqual(defIds(json), ["b", "a", "c"]);
assert.equal(findAll(json, "footnotesList").length, 1);
});
test("generateFootnoteId: valid uuidv7 shape (version 7, variant 8..b) and unique", () => {
// version nibble = 7; variant nibble in [8,9,a,b]; otherwise lowercase hex.
const re =
/^[0-9a-f]{8}-[0-9a-f]{4}-7[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/;
const ids = new Set();
for (let i = 0; i < 50; i++) {
const id = generateFootnoteId();
assert.match(id, re, `not a uuidv7: ${id}`);
ids.add(id);
}
// Distinct across calls (random component makes collisions astronomically rare).
assert.equal(ids.size, 50, "generated ids must be unique");
});

View File

@@ -0,0 +1,49 @@
// CI guard for architecture item B: the shared golden corpus is duplicated (the
// canonical TS copy in editor-ext + the MCP .mjs mirror), so a typo in one copy
// would otherwise pass BOTH per-package suites green while silently breaking the
// cross-copy invariant. This test loads BOTH copies and asserts they are
// deep-equal, turning "the two corpora stay identical" into a checked property.
//
// The editor-ext copy is a .ts module (not importable from node:test), so it is
// read as text and its array literal — which is pure JSON produced by
// JSON.stringify — is parsed out directly.
import { test } from "node:test";
import assert from "node:assert/strict";
import { readFileSync } from "node:fs";
import { fileURLToPath } from "node:url";
import { dirname, resolve } from "node:path";
import { FOOTNOTE_CORPUS as MCP_CORPUS } from "./footnote-corpus.mjs";
function loadEditorExtCorpus() {
const here = dirname(fileURLToPath(import.meta.url));
const tsPath = resolve(
here,
"../../../editor-ext/src/lib/footnote/footnote-corpus.ts",
);
const src = readFileSync(tsPath, "utf8");
// The value is `export const FOOTNOTE_CORPUS: FootnoteCorpusCase[] = [ ... ];`
// where `[ ... ]` is strict JSON (JSON.stringify output). Slice from the
// assignment's opening bracket to the final closing bracket and parse.
const assignAt = src.indexOf("] = ");
assert.ok(assignAt >= 0, "could not locate the editor-ext corpus assignment");
const jsonStart = src.indexOf("[", assignAt + 3);
const jsonEnd = src.lastIndexOf("]");
assert.ok(jsonStart >= 0 && jsonEnd > jsonStart, "could not bound the corpus array");
return JSON.parse(src.slice(jsonStart, jsonEnd + 1));
}
test("the editor-ext and MCP golden corpora are byte-for-byte identical", () => {
const editorExt = loadEditorExtCorpus();
assert.ok(Array.isArray(editorExt) && editorExt.length > 0, "editor-ext corpus is non-empty");
assert.equal(
MCP_CORPUS.length,
editorExt.length,
"the two corpora must have the same number of cases",
);
assert.deepEqual(
MCP_CORPUS,
editorExt,
"the MCP corpus mirror has drifted from the editor-ext canonical copy — re-sync them",
);
});

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,19 @@
// Runs the MCP mirror of `canonicalizeFootnotes` against the SHARED golden
// corpus (the same { input -> expected } cases the editor-ext copy is tested
// against in footnote-canonicalize.test.ts). Pinning identical expected outputs
// in both suites makes "the editor-ext copy and the MCP mirror behave
// identically" a checkable property without coupling the two packages
// (architecture item A). The corpus data is mirrored in footnote-corpus.mjs.
import { test } from "node:test";
import assert from "node:assert/strict";
import { canonicalizeFootnotes } from "../../build/lib/footnote-canonicalize.js";
import { FOOTNOTE_CORPUS } from "./footnote-corpus.mjs";
for (const { name, input, expected } of FOOTNOTE_CORPUS) {
test(`shared corpus (MCP mirror): ${name}`, () => {
assert.deepEqual(canonicalizeFootnotes(input), expected);
// Idempotent on the corpus too.
assert.deepEqual(canonicalizeFootnotes(expected), expected);
});
}