feat(mcp): search_in_page tool — in-page substring/regex search for the agent (#330)
Editorial roles (Corrector/Factchecker) brute-forced `get_node` block-by-block to
find occurrences (unquoted «ё», straight quotes, «т.е.»), burning tokens. New
`search_in_page(pageId, query, {regex?, caseSensitive?, limit?})` reads the page's
ProseMirror JSON via the existing getPageRaw and searches it IN MEMORY — no server
endpoint, no DB/schema change, no touch to the packages/mcp/src/lib schema mirror.
New pure `searchInDoc(doc, query, opts)` (packages/mcp/src/lib/page-search.ts):
recursive descent to each TEXT CONTAINER (paragraph/heading/table-cell paragraph),
glues its inline text via `blockPlainText` (a match survives inline-mark
boundaries — e.g. «т.е.» split across bold/italic), searches literal (indexOf) or
regex, and returns `{ total, truncated, matches:[{ nodeId, blockIndex, type,
before, match, after }] }`. `nodeId` is the container's attrs.id or the
`#<topLevelIndex>` of the enclosing top-level block — the SAME ref format
get_node/patch_node/comment-anchoring accept (verified identical to getNodeByRef),
so the agent goes straight from a hit to a targeted comment; `before`/`after` are
~40-char windows for a unique selection. `total`/`truncated` always reported (never
silent truncation). Lives in the SHARED_TOOL_SPECS registry → exposed in BOTH
transports (external /mcp + in-app AI-chat), with a SERVER_INSTRUCTIONS line and a
DocmostClientLike signature + contract-test entry. Corrector/Factchecker prompts
get a one-line "use search_in_page first" hint (versions bumped, catalog hash lock
refreshed).
Guards: empty/whitespace query → clear error; invalid regex → clear error (not a
generic 500); zero-length regex matches (`\b`, `a*`) skipped with lastIndex
advanced (no loop/flood); MAX_PATTERN_LENGTH=1000, MAX_CONTAINER_TEXT=100k bound
each exec; limit clamped [1,200] (default 50).
Tests: new page-search.test.mjs (17) — literal+regex, case-sensitivity,
mark-boundary glue, nodeId for paragraph/heading (attrs.id) and table-cell
(#<index> fallback), context bounds, limit/total/truncated + clamp, invalid
regex/empty/over-long errors, zero-length skip, empty-doc null-safety.
mcp: tsc clean; node --test 467 passed (+17). apps/server: tsc --noEmit clean
(DocmostClientLike + wiring). catalog check.mjs OK.
Known limitations (from internal review, non-blocking):
- Residual ReDoS: a crafted catastrophic-backtracking pattern (e.g. `(a+)+$`)
against a large single container can hang the event loop — JS regex is not
interruptible, so the length caps bound the base but not the backtracking.
Realistic exposure is low (containers are small; the pattern is supplied by the
authenticated model). Candidate for a follow-up hardening (safe-regex validation
or a worker+timeout) if it matters.
- Case-insensitive LITERAL search folds via toLowerCase; a char whose lowercase
differs in length (e.g. Turkish İ) BEFORE a match could shift the context
window — negligible for the RU/EN editorial scenario.
- On a `#<index>` table-cell fallback, `type` is the inline container ("paragraph")
while nodeId addresses the top-level block — addressing is correct; the field is
documented as the container's type.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -128,7 +128,7 @@ roles:
|
||||
- Don't fabricate confirmations. If you can't verify, honestly mark [Unverified] or [Unverifiable].
|
||||
|
||||
HOW TO LEAVE COMMENTS
|
||||
You don't edit the text directly. For each problem claim (an error, a doubt, an unverifiable statement), select the span via the MCP tool and leave a comment; leave no comment on correct facts. Give the verdict, the correction (if any), and the source. For an [Incorrect] verdict, ALWAYS attach the ready correction as a suggested replacement (the `suggestedText` parameter): since you found the correct value in the sources, propose the ready fix right away instead of merely describing the error. The replacement is the exact new text for the selected fragment, plain text with no markup; the author applies it with one click instead of retyping the fragment. The selected fragment must occur exactly once in the text; if it isn't unique, extend the selection with surrounding context. Do not attach a replacement to [Unverified], [Unverifiable], or [Opinion] verdicts. Tag severity:
|
||||
You don't edit the text directly. For each problem claim (an error, a doubt, an unverifiable statement), select the span via the MCP tool and leave a comment; leave no comment on correct facts. Give the verdict, the correction (if any), and the source. For an [Incorrect] verdict, ALWAYS attach the ready correction as a suggested replacement (the `suggestedText` parameter): since you found the correct value in the sources, propose the ready fix right away instead of merely describing the error. The replacement is the exact new text for the selected fragment, plain text with no markup; the author applies it with one click instead of retyping the fragment. The selected fragment must occur exactly once in the text; if it isn't unique, extend the selection with surrounding context. When a figure, name, term, or version to check recurs across the page, use search_in_page to find every occurrence in one call first, then place a targeted comment per hit instead of reading block by block. Do not attach a replacement to [Unverified], [Unverifiable], or [Opinion] verdicts. Tag severity:
|
||||
- [Critical] — a factual error, especially in numbers, names, or quotes, or a claim that risks misinformation.
|
||||
- [Major] — a doubtful or unconfirmed claim that needs a source.
|
||||
- [Minor] — a small correction, or false precision worth rounding or confirming.
|
||||
@@ -169,7 +169,7 @@ roles:
|
||||
- Don't make substantive changes. Edits are minimal and mechanical.
|
||||
|
||||
HOW TO WORK
|
||||
Go through the whole text from start to finish in a single pass. Flag EVERY violation, including all repeat occurrences of the same error and minor items tagged [Minor] — don't stop at the first few or the most conspicuous. Don't summarize instead of marking up: until you've reached the end of the document, the job isn't done. One run covers the whole text, not just "the most important".
|
||||
Go through the whole text from start to finish in a single pass. Flag EVERY violation, including all repeat occurrences of the same error and minor items tagged [Minor] — don't stop at the first few or the most conspicuous. Don't summarize instead of marking up: until you've reached the end of the document, the job isn't done. One run covers the whole text, not just "the most important". For a systematic issue that recurs — straight quotes, a hyphen used as a dash, an inconsistent unit or spelling — use search_in_page to list every occurrence in one call first, then leave a targeted comment (with its replacement) on each hit, instead of scanning block by block.
|
||||
|
||||
HOW TO LEAVE COMMENTS
|
||||
You don't edit the text directly. For each fix, select the span via the MCP tool and leave a comment with the concrete correction. Attach a suggested replacement to every fix (the `suggestedText` parameter): the exact corrected text for the selected fragment, plain text with no markup — the author applies it with one click. The selected fragment must occur exactly once in the text; if it isn't unique, extend the selection with surrounding context. Do NOT leave summary notes like "throughout, replace X with Y" or "make the units/quotes/spelling consistent": such a comment can't be applied with a button. If the same error occurs in several places, walk EVERY occurrence and leave a separate targeted comment with its own replacement on each — ten targeted fixes instead of one blanket note. The only exception is a note that genuinely cannot be expressed as a replacement of a concrete fragment; leave those rare cases as an ordinary comment without a replacement. Tag severity:
|
||||
|
||||
@@ -128,7 +128,7 @@ roles:
|
||||
- Не выдумываешь подтверждения. Если не можешь проверить — честно ставь [Не проверено] или [Непроверяемо].
|
||||
|
||||
КАК ОСТАВЛЯТЬ ЗАМЕЧАНИЯ
|
||||
Ты не редактируешь текст напрямую. Для каждого проблемного утверждения (ошибка, сомнение, непроверяемость) через MCP-инструмент выдели фрагмент и оставь комментарий; на верные факты комментарии не оставляй. В комментарии дай вердикт, исправление (если нужно) и источник. К вердикту [Неверно] всегда прикладывай готовое исправление как предложение-замену (параметр `suggestedText`): раз ты нашёл по источникам верное значение — сразу предлагай готовую правку, а не только описывай ошибку. Замена — это точный новый текст взамен выделенного фрагмента, обычным текстом без разметки; автор применит её одной кнопкой, не переписывая фрагмент вручную. Выделенный фрагмент должен встречаться в тексте ровно один раз; если он не уникален, расширь выделение контекстом. К вердиктам [Не проверено], [Непроверяемо] и [Это мнение] замену не прикладывай. Помечай важность:
|
||||
Ты не редактируешь текст напрямую. Для каждого проблемного утверждения (ошибка, сомнение, непроверяемость) через MCP-инструмент выдели фрагмент и оставь комментарий; на верные факты комментарии не оставляй. В комментарии дай вердикт, исправление (если нужно) и источник. К вердикту [Неверно] всегда прикладывай готовое исправление как предложение-замену (параметр `suggestedText`): раз ты нашёл по источникам верное значение — сразу предлагай готовую правку, а не только описывай ошибку. Замена — это точный новый текст взамен выделенного фрагмента, обычным текстом без разметки; автор применит её одной кнопкой, не переписывая фрагмент вручную. Выделенный фрагмент должен встречаться в тексте ровно один раз; если он не уникален, расширь выделение контекстом. Когда проверяемая цифра, имя, термин или версия встречается по тексту несколько раз, сначала одним вызовом search_in_page найди все вхождения, а затем ставь целевой комментарий на каждое — не читая страницу поблочно. К вердиктам [Не проверено], [Непроверяемо] и [Это мнение] замену не прикладывай. Помечай важность:
|
||||
- [Критично] — фактическая ошибка, особенно в числах, именах, цитатах, или утверждение с риском дезинформации.
|
||||
- [Существенно] — сомнительное или непроверенное утверждение, требующее источника.
|
||||
- [Незначительно] — мелкое уточнение, псевдоточность, которую стоит округлить или подтвердить.
|
||||
@@ -170,7 +170,7 @@ roles:
|
||||
- Не вносишь содержательных изменений. Правки — минимальные и механические.
|
||||
|
||||
КАК РАБОТАТЬ
|
||||
Пройди весь текст от начала до конца за один проход. Помечай КАЖДОЕ нарушение, включая все повторные вхождения одной и той же ошибки и мелочи с меткой [Незначительно], — не ограничивайся первыми несколькими или самыми заметными. Не подводи итог вместо разбора: пока не дошёл до конца документа, работа не закончена. Один прогон покрывает весь текст, а не «самое важное».
|
||||
Пройди весь текст от начала до конца за один проход. Помечай КАЖДОЕ нарушение, включая все повторные вхождения одной и той же ошибки и мелочи с меткой [Незначительно], — не ограничивайся первыми несколькими или самыми заметными. Не подводи итог вместо разбора: пока не дошёл до конца документа, работа не закончена. Один прогон покрывает весь текст, а не «самое важное». Для систематической ошибки, которая повторяется — прямые кавычки, «е» вместо «ё», дефис вместо тире, неединообразная единица или написание, — сначала одним вызовом search_in_page получи все вхождения, а затем оставь на каждом целевой комментарий с заменой, вместо поблочного просмотра.
|
||||
|
||||
КАК ОСТАВЛЯТЬ ЗАМЕЧАНИЯ
|
||||
Ты не редактируешь текст напрямую. Для каждой правки через MCP-инструмент выдели фрагмент и оставь комментарий с конкретным исправлением. К каждой правке прикладывай предложение-замену (параметр `suggestedText`): точный исправленный текст взамен выделенного фрагмента, обычным текстом без разметки — автор применит его одной кнопкой. Выделенный фрагмент должен встречаться в тексте ровно один раз; если он не уникален, расширь выделение контекстом. НЕ оставляй сводных замечаний вида «во всём тексте заменить X на Y» или «привести единицы/кавычки/написание к единообразию»: такой комментарий нельзя применить кнопкой. Если одна и та же ошибка встречается в нескольких местах, обойди КАЖДОЕ вхождение и оставь на нём отдельный целевой комментарий со своей заменой — десять точечных правок вместо одной общей. Единственное исключение — замечание, которое в принципе невозможно выразить заменой конкретного фрагмента; такие редкие случаи оставляй обычным комментарием без замены. Помечай важность:
|
||||
|
||||
@@ -16,9 +16,9 @@ bundles:
|
||||
- slug: line-editor
|
||||
version: 4
|
||||
- slug: fact-checker
|
||||
version: 5
|
||||
version: 6
|
||||
- slug: proofreader
|
||||
version: 7
|
||||
version: 8
|
||||
- slug: narrator
|
||||
version: 2
|
||||
- id: research
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
{
|
||||
"fact-checker": {
|
||||
"version": 5,
|
||||
"hash": "d7769872968109a1ccfb58d71bc3f3564a750b91766156f59031762848de4f24"
|
||||
"version": 6,
|
||||
"hash": "6bb22a9e5a5079b5cb287b5b26addbd36b9afeb7c9508287dcad9343fc53d685"
|
||||
},
|
||||
"line-editor": {
|
||||
"version": 4,
|
||||
@@ -12,8 +12,8 @@
|
||||
"hash": "66fe653003b4f63ef3c3a5c5c48552fe47daeefffc16907c37c35f0e8da98851"
|
||||
},
|
||||
"proofreader": {
|
||||
"version": 7,
|
||||
"hash": "fdf8e0a443fa3c4102095e024146401363629a3f9015fb938c7bac2642825e56"
|
||||
"version": 8,
|
||||
"hash": "cef39fed321779631ddd1077fcba53399adf0e48b301df281c71eb042610900d"
|
||||
},
|
||||
"researcher": {
|
||||
"version": 1,
|
||||
|
||||
@@ -630,6 +630,16 @@ export class AiChatToolsService {
|
||||
async ({ pageId, nodeId }) => await client.getNode(pageId, nodeId),
|
||||
),
|
||||
|
||||
searchInPage: sharedTool(
|
||||
sharedToolSpecs.searchInPage,
|
||||
async ({ pageId, query, regex, caseSensitive, limit }) =>
|
||||
await client.searchInPage(pageId, query, {
|
||||
regex,
|
||||
caseSensitive,
|
||||
limit,
|
||||
}),
|
||||
),
|
||||
|
||||
getTable: tool({
|
||||
description:
|
||||
'Read a table as a matrix of cell texts (plus a parallel cellIds ' +
|
||||
|
||||
@@ -55,6 +55,11 @@ export interface DocmostClientLike {
|
||||
getOutline(pageId: string): Promise<Record<string, unknown>>;
|
||||
getPageJson(pageId: string): Promise<Record<string, unknown>>;
|
||||
getNode(pageId: string, nodeId: string): Promise<Record<string, unknown>>;
|
||||
searchInPage(
|
||||
pageId: string,
|
||||
query: string,
|
||||
opts?: { regex?: boolean; caseSensitive?: boolean; limit?: number },
|
||||
): Promise<Record<string, unknown>>;
|
||||
getTable(pageId: string, tableRef: string): Promise<Record<string, unknown>>;
|
||||
listComments(pageId: string): Promise<unknown[]>;
|
||||
getComment(
|
||||
|
||||
@@ -0,0 +1,6 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
exports.copy = copy;
|
||||
function copy(value) {
|
||||
return JSON.parse(JSON.stringify(value));
|
||||
}
|
||||
@@ -0,0 +1,18 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
exports.getFromPath = getFromPath;
|
||||
/**
|
||||
* get target value from json-pointer (e.g. /content/0/content)
|
||||
* @param {AnyObject} obj object to resolve path into
|
||||
* @param {string} path json-pointer
|
||||
* @return {any} target value
|
||||
*/
|
||||
function getFromPath(obj, path) {
|
||||
const pathParts = path.split("/");
|
||||
pathParts.shift(); // remove root-entry
|
||||
while (pathParts.length) {
|
||||
const property = pathParts.shift();
|
||||
obj = obj[property];
|
||||
}
|
||||
return obj;
|
||||
}
|
||||
@@ -0,0 +1,27 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
exports.getReplaceStep = getReplaceStep;
|
||||
const transform_1 = require("@tiptap/pm/transform");
|
||||
function getReplaceStep(fromDoc, toDoc) {
|
||||
let start = toDoc.content.findDiffStart(fromDoc.content);
|
||||
if (start === null) {
|
||||
return false;
|
||||
}
|
||||
// @ts-ignore property access to content
|
||||
let { a: endA, b: endB } = toDoc.content.findDiffEnd(fromDoc.content);
|
||||
const overlap = start - Math.min(endA, endB);
|
||||
if (overlap > 0) {
|
||||
// If there is an overlap, there is some freedom of choice in how to calculate the
|
||||
// start/end boundary. for an inserted/removed slice. We choose the extreme with
|
||||
// the lowest depth value.
|
||||
if (fromDoc.resolve(start - overlap).depth <
|
||||
toDoc.resolve(endA + overlap).depth) {
|
||||
start -= overlap;
|
||||
}
|
||||
else {
|
||||
endA += overlap;
|
||||
endB += overlap;
|
||||
}
|
||||
}
|
||||
return new transform_1.ReplaceStep(start, endB, toDoc.slice(start, endA));
|
||||
}
|
||||
@@ -0,0 +1,8 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
exports.RecreateTransform = exports.recreateTransform = void 0;
|
||||
// https://gitlab.com/mpapp-public/prosemirror-recreate-steps - MIT
|
||||
// https://github.com/sueddeutsche/prosemirror-recreate-transform - MIT
|
||||
var recreateTransform_1 = require("./recreateTransform");
|
||||
Object.defineProperty(exports, "recreateTransform", { enumerable: true, get: function () { return recreateTransform_1.recreateTransform; } });
|
||||
Object.defineProperty(exports, "RecreateTransform", { enumerable: true, get: function () { return recreateTransform_1.RecreateTransform; } });
|
||||
@@ -0,0 +1 @@
|
||||
{"type":"commonjs"}
|
||||
@@ -0,0 +1,242 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
exports.RecreateTransform = void 0;
|
||||
exports.recreateTransform = recreateTransform;
|
||||
const transform_1 = require("@tiptap/pm/transform");
|
||||
const rfc6902_1 = require("rfc6902");
|
||||
const diff_1 = require("diff");
|
||||
const getReplaceStep_1 = require("./getReplaceStep");
|
||||
const simplifyTransform_1 = require("./simplifyTransform");
|
||||
const removeMarks_1 = require("./removeMarks");
|
||||
const getFromPath_1 = require("./getFromPath");
|
||||
const copy_1 = require("./copy");
|
||||
class RecreateTransform {
|
||||
constructor(fromDoc, toDoc, options = {}) {
|
||||
const o = {
|
||||
complexSteps: true,
|
||||
wordDiffs: false,
|
||||
simplifyDiff: true,
|
||||
...options,
|
||||
};
|
||||
this.fromDoc = fromDoc;
|
||||
this.toDoc = toDoc;
|
||||
this.complexSteps = o.complexSteps; // Whether to return steps other than ReplaceSteps
|
||||
this.wordDiffs = o.wordDiffs; // Whether to make text diffs cover entire words
|
||||
this.simplifyDiff = o.simplifyDiff;
|
||||
this.schema = fromDoc.type.schema;
|
||||
this.tr = new transform_1.Transform(fromDoc);
|
||||
}
|
||||
init() {
|
||||
if (this.complexSteps) {
|
||||
// For First steps: we create versions of the documents without marks as
|
||||
// these will only confuse the diffing mechanism and marks won't cause
|
||||
// any mapping changes anyway.
|
||||
this.currentJSON = (0, removeMarks_1.removeMarks)(this.fromDoc).toJSON();
|
||||
this.finalJSON = (0, removeMarks_1.removeMarks)(this.toDoc).toJSON();
|
||||
this.ops = (0, rfc6902_1.createPatch)(this.currentJSON, this.finalJSON);
|
||||
this.recreateChangeContentSteps();
|
||||
this.recreateChangeMarkSteps();
|
||||
}
|
||||
else {
|
||||
// We don't differentiate between mark changes and other changes.
|
||||
this.currentJSON = this.fromDoc.toJSON();
|
||||
this.finalJSON = this.toDoc.toJSON();
|
||||
this.ops = (0, rfc6902_1.createPatch)(this.currentJSON, this.finalJSON);
|
||||
this.recreateChangeContentSteps();
|
||||
}
|
||||
if (this.simplifyDiff) {
|
||||
this.tr = (0, simplifyTransform_1.simplifyTransform)(this.tr) || this.tr;
|
||||
}
|
||||
return this.tr;
|
||||
}
|
||||
/** convert json-diff to prosemirror steps */
|
||||
recreateChangeContentSteps() {
|
||||
// First step: find content changing steps.
|
||||
let ops = [];
|
||||
while (this.ops.length) {
|
||||
// get next
|
||||
let op = this.ops.shift();
|
||||
ops.push(op);
|
||||
let toDoc;
|
||||
const afterStepJSON = (0, copy_1.copy)(this.currentJSON); // working document receiving patches
|
||||
const pathParts = op.path.split("/");
|
||||
// collect operations until we receive a valid document:
|
||||
// apply ops-patches until a valid prosemirror document is retrieved,
|
||||
// then try to create a transformation step or retry with next operation
|
||||
while (toDoc == null) {
|
||||
(0, rfc6902_1.applyPatch)(afterStepJSON, [op]);
|
||||
try {
|
||||
toDoc = this.schema.nodeFromJSON(afterStepJSON);
|
||||
toDoc.check();
|
||||
}
|
||||
catch (error) {
|
||||
toDoc = null;
|
||||
if (this.ops.length > 0) {
|
||||
op = this.ops.shift();
|
||||
ops.push(op);
|
||||
}
|
||||
else {
|
||||
throw new Error(`No valid diff possible applying ${op.path}`);
|
||||
}
|
||||
}
|
||||
}
|
||||
// apply operation (ignoring afterStepJSON)
|
||||
if (this.complexSteps &&
|
||||
ops.length === 1 &&
|
||||
(pathParts.includes("attrs") || pathParts.includes("type"))) {
|
||||
// Node markup is changing
|
||||
this.addSetNodeMarkup(); // a lost update is ignored
|
||||
ops = [];
|
||||
// console.log("%cop", logStyle, "- update node", ops);
|
||||
}
|
||||
else if (ops.length === 1 &&
|
||||
op.op === "replace" &&
|
||||
pathParts[pathParts.length - 1] === "text") {
|
||||
// Text is being replaced, we apply text diffing to find the smallest possible diffs.
|
||||
this.addReplaceTextSteps(op, afterStepJSON);
|
||||
ops = [];
|
||||
// console.log("%cop", logStyle, "- replace", ops);
|
||||
}
|
||||
else if (this.addReplaceStep(toDoc, afterStepJSON)) {
|
||||
// operations have been applied
|
||||
ops = [];
|
||||
// console.log("%cop", logStyle, "- other", ops);
|
||||
}
|
||||
}
|
||||
}
|
||||
/** update node with attrs and marks, may also change type */
|
||||
addSetNodeMarkup() {
|
||||
// first diff in document is supposed to be a node-change (in type and/or attributes)
|
||||
// thus simply find the first change and apply a node change step, then recalculate the diff
|
||||
// after updating the document
|
||||
const fromDoc = this.schema.nodeFromJSON(this.currentJSON);
|
||||
const toDoc = this.schema.nodeFromJSON(this.finalJSON);
|
||||
const start = toDoc.content.findDiffStart(fromDoc.content);
|
||||
// @note start is the same (first) position for current and target document
|
||||
const fromNode = fromDoc.nodeAt(start);
|
||||
const toNode = toDoc.nodeAt(start);
|
||||
if (start != null) {
|
||||
// @note this completly updates all attributes in one step, by completely replacing node
|
||||
const nodeType = fromNode.type === toNode.type ? null : toNode.type;
|
||||
try {
|
||||
this.tr.setNodeMarkup(start, nodeType, toNode.attrs, toNode.marks);
|
||||
}
|
||||
catch (e) {
|
||||
// if nodetypes differ, the updated node-type and contents might not be compatible
|
||||
// with schema and requires a replace
|
||||
if (nodeType && e.message.includes("Invalid content")) {
|
||||
// @todo add test-case for this scenario
|
||||
this.tr.replaceWith(start, start + fromNode.nodeSize, toNode);
|
||||
}
|
||||
else {
|
||||
throw e;
|
||||
}
|
||||
}
|
||||
this.currentJSON = (0, removeMarks_1.removeMarks)(this.tr.doc).toJSON();
|
||||
// setting the node markup may have invalidated the following ops, so we calculate them again.
|
||||
this.ops = (0, rfc6902_1.createPatch)(this.currentJSON, this.finalJSON);
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
recreateChangeMarkSteps() {
|
||||
// Now the documents should be the same, except their marks, so everything should map 1:1.
|
||||
// Second step: Iterate through the toDoc and make sure all marks are the same in tr.doc
|
||||
this.toDoc.descendants((tNode, tPos) => {
|
||||
if (!tNode.isInline) {
|
||||
return true;
|
||||
}
|
||||
this.tr.doc.nodesBetween(tPos, tPos + tNode.nodeSize, (fNode, fPos) => {
|
||||
if (!fNode.isInline) {
|
||||
return true;
|
||||
}
|
||||
const from = Math.max(tPos, fPos);
|
||||
const to = Math.min(tPos + tNode.nodeSize, fPos + fNode.nodeSize);
|
||||
fNode.marks.forEach((nodeMark) => {
|
||||
if (!nodeMark.isInSet(tNode.marks)) {
|
||||
this.tr.removeMark(from, to, nodeMark);
|
||||
}
|
||||
});
|
||||
tNode.marks.forEach((nodeMark) => {
|
||||
if (!nodeMark.isInSet(fNode.marks)) {
|
||||
this.tr.addMark(from, to, nodeMark);
|
||||
}
|
||||
});
|
||||
});
|
||||
});
|
||||
}
|
||||
/**
|
||||
* retrieve and possibly apply replace-step based from doc changes
|
||||
* From http://prosemirror.net/examples/footnote/
|
||||
*/
|
||||
addReplaceStep(toDoc, afterStepJSON) {
|
||||
const fromDoc = this.schema.nodeFromJSON(this.currentJSON);
|
||||
const step = (0, getReplaceStep_1.getReplaceStep)(fromDoc, toDoc);
|
||||
if (!step) {
|
||||
return false;
|
||||
}
|
||||
else if (!this.tr.maybeStep(step).failed) {
|
||||
this.currentJSON = afterStepJSON;
|
||||
return true; // @change previously null
|
||||
}
|
||||
throw new Error("No valid step found.");
|
||||
}
|
||||
/** retrieve and possibly apply text replace-steps based from doc changes */
|
||||
addReplaceTextSteps(op, afterStepJSON) {
|
||||
// We find the position number of the first character in the string
|
||||
const op1 = { ...op, value: "xx" };
|
||||
const op2 = { ...op, value: "yy" };
|
||||
const afterOP1JSON = (0, copy_1.copy)(this.currentJSON);
|
||||
const afterOP2JSON = (0, copy_1.copy)(this.currentJSON);
|
||||
(0, rfc6902_1.applyPatch)(afterOP1JSON, [op1]);
|
||||
(0, rfc6902_1.applyPatch)(afterOP2JSON, [op2]);
|
||||
const op1Doc = this.schema.nodeFromJSON(afterOP1JSON);
|
||||
const op2Doc = this.schema.nodeFromJSON(afterOP2JSON);
|
||||
// get text diffs
|
||||
const finalText = op.value;
|
||||
const currentText = (0, getFromPath_1.getFromPath)(this.currentJSON, op.path);
|
||||
const textDiffs = this.wordDiffs
|
||||
? (0, diff_1.diffWordsWithSpace)(currentText, finalText)
|
||||
: (0, diff_1.diffChars)(currentText, finalText);
|
||||
let offset = op1Doc.content.findDiffStart(op2Doc.content);
|
||||
const marks = op1Doc.resolve(offset + 1).marks();
|
||||
while (textDiffs.length) {
|
||||
const diff = textDiffs.shift();
|
||||
if (diff.added) {
|
||||
const textNode = this.schema
|
||||
.nodeFromJSON({ type: "text", text: diff.value })
|
||||
.mark(marks);
|
||||
if (textDiffs.length && textDiffs[0].removed) {
|
||||
const nextDiff = textDiffs.shift();
|
||||
this.tr.replaceWith(offset, offset + nextDiff.value.length, textNode);
|
||||
}
|
||||
else {
|
||||
this.tr.insert(offset, textNode);
|
||||
}
|
||||
offset += diff.value.length;
|
||||
}
|
||||
else if (diff.removed) {
|
||||
if (textDiffs.length && textDiffs[0].added) {
|
||||
const nextDiff = textDiffs.shift();
|
||||
const textNode = this.schema
|
||||
.nodeFromJSON({ type: "text", text: nextDiff.value })
|
||||
.mark(marks);
|
||||
this.tr.replaceWith(offset, offset + diff.value.length, textNode);
|
||||
offset += nextDiff.value.length;
|
||||
}
|
||||
else {
|
||||
this.tr.delete(offset, offset + diff.value.length);
|
||||
}
|
||||
}
|
||||
else {
|
||||
offset += diff.value.length;
|
||||
}
|
||||
}
|
||||
this.currentJSON = afterStepJSON;
|
||||
}
|
||||
}
|
||||
exports.RecreateTransform = RecreateTransform;
|
||||
function recreateTransform(fromDoc, toDoc, options = {}) {
|
||||
const recreator = new RecreateTransform(fromDoc, toDoc, options);
|
||||
return recreator.init();
|
||||
}
|
||||
@@ -0,0 +1,118 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
const vitest_1 = require("vitest");
|
||||
const schema_basic_1 = require("@tiptap/pm/schema-basic");
|
||||
const transform_1 = require("@tiptap/pm/transform");
|
||||
const recreateTransform_1 = require("./recreateTransform");
|
||||
/**
|
||||
* recreateTransform diffs two documents and produces ProseMirror steps that turn
|
||||
* `fromDoc` into `toDoc`. It is the backbone of collaborative/version diffing, so
|
||||
* THE invariant that matters is: replaying the produced steps on `fromDoc` must
|
||||
* reproduce `toDoc` exactly. Every test below re-applies the steps onto a fresh
|
||||
* Transform seeded from `fromDoc` (not just trusting `tr.doc`) and asserts node
|
||||
* equality with `.eq()`. If a regression makes any step wrong, the round-trip
|
||||
* breaks and the test fails.
|
||||
*/
|
||||
// Real ProseMirror schema (the standard basic schema) with paragraph/heading +
|
||||
// strong/em marks — the same primitives the editor diffs in production.
|
||||
const doc = (...c) => schema_basic_1.schema.node("doc", null, c);
|
||||
const p = (...c) => schema_basic_1.schema.node("paragraph", null, c.length ? c : undefined);
|
||||
const h = (level, ...c) => schema_basic_1.schema.node("heading", { level }, c);
|
||||
const t = (text, ...marks) => schema_basic_1.schema.text(text, marks.length ? marks : undefined);
|
||||
const strong = schema_basic_1.schema.marks.strong.create();
|
||||
const em = schema_basic_1.schema.marks.em.create();
|
||||
// Replay the diff's steps onto a fresh Transform built from `fromDoc`. This is
|
||||
// the faithful "apply(diff) == target" check — it exercises the actual Step
|
||||
// objects rather than the transform's internal accumulated doc.
|
||||
function applyDiff(fromDoc, toDoc, options) {
|
||||
const tr = (0, recreateTransform_1.recreateTransform)(fromDoc, toDoc, options);
|
||||
const replay = new transform_1.Transform(fromDoc);
|
||||
tr.steps.forEach((s) => {
|
||||
const result = replay.maybeStep(s);
|
||||
if (result.failed)
|
||||
throw new Error(`step failed: ${result.failed}`);
|
||||
});
|
||||
return replay.doc;
|
||||
}
|
||||
(0, vitest_1.describe)("recreateTransform round-trip (apply(diff) == target)", () => {
|
||||
(0, vitest_1.it)("reconstructs the target on plain text insertion", () => {
|
||||
// Inserting " world" must yield exactly the target paragraph.
|
||||
const from = doc(p(t("hello")));
|
||||
const to = doc(p(t("hello world")));
|
||||
(0, vitest_1.expect)(applyDiff(from, to).eq(to)).toBe(true);
|
||||
});
|
||||
(0, vitest_1.it)("reconstructs the target on text deletion", () => {
|
||||
// Deleting a trailing word is the inverse of insertion and must round-trip.
|
||||
const from = doc(p(t("hello world")));
|
||||
const to = doc(p(t("hello")));
|
||||
(0, vitest_1.expect)(applyDiff(from, to).eq(to)).toBe(true);
|
||||
});
|
||||
(0, vitest_1.it)("reconstructs the target when a word is replaced mid-string", () => {
|
||||
// A char-level replace in the middle must not corrupt the surrounding text.
|
||||
const from = doc(p(t("the quick brown fox")));
|
||||
const to = doc(p(t("the slow brown fox")));
|
||||
(0, vitest_1.expect)(applyDiff(from, to).eq(to)).toBe(true);
|
||||
});
|
||||
(0, vitest_1.it)("reconstructs the target when a mark is added (complexSteps path)", () => {
|
||||
// Mark-only changes are diffed in a separate pass; the bolded run must match.
|
||||
const from = doc(p(t("hello")));
|
||||
const to = doc(p(t("hello", strong)));
|
||||
const out = applyDiff(from, to);
|
||||
(0, vitest_1.expect)(out.eq(to)).toBe(true);
|
||||
// Sanity: the produced doc actually carries the strong mark.
|
||||
(0, vitest_1.expect)(out.firstChild.firstChild.marks.length).toBe(1);
|
||||
});
|
||||
(0, vitest_1.it)("reconstructs the target when a mark is removed", () => {
|
||||
// Removing the only mark must leave the same text with no marks.
|
||||
const from = doc(p(t("hello", strong)));
|
||||
const to = doc(p(t("hello")));
|
||||
const out = applyDiff(from, to);
|
||||
(0, vitest_1.expect)(out.eq(to)).toBe(true);
|
||||
(0, vitest_1.expect)(out.firstChild.firstChild.marks.length).toBe(0);
|
||||
});
|
||||
(0, vitest_1.it)("reconstructs the target on a paragraph split into two blocks", () => {
|
||||
// Structural change (one block -> two) must replay as valid replace steps.
|
||||
const from = doc(p(t("hello world")));
|
||||
const to = doc(p(t("hello")), p(t("world")));
|
||||
const out = applyDiff(from, to);
|
||||
(0, vitest_1.expect)(out.eq(to)).toBe(true);
|
||||
(0, vitest_1.expect)(out.childCount).toBe(2);
|
||||
});
|
||||
(0, vitest_1.it)("reconstructs the target on a node-type change (paragraph -> heading)", () => {
|
||||
// Type/attrs changes drive the setNodeMarkup branch; the node must become a
|
||||
// heading while keeping its text.
|
||||
const from = doc(p(t("hello")));
|
||||
const to = doc(h(1, t("hello")));
|
||||
const out = applyDiff(from, to);
|
||||
(0, vitest_1.expect)(out.eq(to)).toBe(true);
|
||||
(0, vitest_1.expect)(out.firstChild.type.name).toBe("heading");
|
||||
});
|
||||
(0, vitest_1.it)("reconstructs a combined structural + mark change", () => {
|
||||
// Several diff kinds at once (new block + italic run) still round-trips.
|
||||
const from = doc(p(t("alpha")));
|
||||
const to = doc(p(t("alpha")), p(t("beta", em)));
|
||||
const out = applyDiff(from, to);
|
||||
(0, vitest_1.expect)(out.eq(to)).toBe(true);
|
||||
});
|
||||
(0, vitest_1.it)("produces an empty step list for identical documents", () => {
|
||||
// No diff => no work; spurious steps would mean wasted/incorrect history.
|
||||
const from = doc(p(t("same")));
|
||||
const to = doc(p(t("same")));
|
||||
const tr = (0, recreateTransform_1.recreateTransform)(from, to);
|
||||
(0, vitest_1.expect)(tr.steps.length).toBe(0);
|
||||
(0, vitest_1.expect)(tr.doc.eq(to)).toBe(true);
|
||||
});
|
||||
(0, vitest_1.it)("round-trips with complexSteps:false (marks diffed as replaces)", () => {
|
||||
// With complexSteps off, mark changes are folded into replace steps rather
|
||||
// than dedicated mark steps — the result must still equal the target.
|
||||
const from = doc(p(t("hello")));
|
||||
const to = doc(p(t("hello", strong)));
|
||||
(0, vitest_1.expect)(applyDiff(from, to, { complexSteps: false }).eq(to)).toBe(true);
|
||||
});
|
||||
(0, vitest_1.it)("round-trips with wordDiffs:true (whole-word text diffing)", () => {
|
||||
// wordDiffs changes the granularity of the text diff, not the outcome.
|
||||
const from = doc(p(t("the quick brown fox")));
|
||||
const to = doc(p(t("the quick red fox")));
|
||||
(0, vitest_1.expect)(applyDiff(from, to, { wordDiffs: true }).eq(to)).toBe(true);
|
||||
});
|
||||
});
|
||||
@@ -0,0 +1,9 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
exports.removeMarks = removeMarks;
|
||||
const transform_1 = require("@tiptap/pm/transform");
|
||||
function removeMarks(doc) {
|
||||
const tr = new transform_1.Transform(doc);
|
||||
tr.removeMark(0, doc.nodeSize - 2);
|
||||
return tr.doc;
|
||||
}
|
||||
@@ -0,0 +1,27 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
exports.simplifyTransform = simplifyTransform;
|
||||
const transform_1 = require("@tiptap/pm/transform");
|
||||
const getReplaceStep_1 = require("./getReplaceStep");
|
||||
// join adjacent ReplaceSteps
|
||||
function simplifyTransform(tr) {
|
||||
if (!tr.steps.length) {
|
||||
return undefined;
|
||||
}
|
||||
const newTr = new transform_1.Transform(tr.docs[0]);
|
||||
const oldSteps = tr.steps.slice();
|
||||
while (oldSteps.length) {
|
||||
let step = oldSteps.shift();
|
||||
while (oldSteps.length && step.merge(oldSteps[0])) {
|
||||
const addedStep = oldSteps.shift();
|
||||
if (step instanceof transform_1.ReplaceStep && addedStep instanceof transform_1.ReplaceStep) {
|
||||
step = (0, getReplaceStep_1.getReplaceStep)(newTr.doc, addedStep.apply(step.apply(newTr.doc).doc).doc);
|
||||
}
|
||||
else {
|
||||
step = step.merge(addedStep);
|
||||
}
|
||||
}
|
||||
newTr.step(step);
|
||||
}
|
||||
return newTr;
|
||||
}
|
||||
@@ -0,0 +1,2 @@
|
||||
"use strict";
|
||||
Object.defineProperty(exports, "__esModule", { value: true });
|
||||
@@ -13,6 +13,7 @@ import { footnoteWarningsField } from "./lib/footnote-analyze.js";
|
||||
import { buildPageTree } from "./lib/tree.js";
|
||||
import { serializeDocmostMarkdown, parseDocmostMarkdown, } from "./lib/markdown-document.js";
|
||||
import { replaceNodeById, deleteNodeById, assertUnambiguousMatch, insertNodeRelative, buildOutline, getNodeByRef, readTable, insertTableRow, deleteTableRow, updateTableCell, } from "./lib/node-ops.js";
|
||||
import { searchInDoc } from "./lib/page-search.js";
|
||||
import { withPageLock } from "./lib/page-lock.js";
|
||||
import { applyTextEdits, } from "./lib/json-edit.js";
|
||||
import { getCollabToken, performLogin } from "./lib/auth-utils.js";
|
||||
@@ -872,6 +873,22 @@ export class DocmostClient {
|
||||
node: hit.node,
|
||||
};
|
||||
}
|
||||
/**
|
||||
* Find every occurrence of `query` on a page IN MEMORY, over the plain text of
|
||||
* each text container (reusing the same `getPageRaw` fetch as the other read
|
||||
* tools) — no server search endpoint, no whole-document round-trip through the
|
||||
* model. Returns `{ total, truncated, matches }`; each match carries a ref the
|
||||
* agent can hand straight to get_node/patch_node or a comment anchor, plus the
|
||||
* top-level block index and a short context window to build a unique selection.
|
||||
* The pure engine (`searchInDoc`) owns the traversal, glue, ReDoS guards and
|
||||
* the empty-query / invalid-regex errors.
|
||||
*/
|
||||
async searchInPage(pageId, query, opts = {}) {
|
||||
await this.ensureAuthenticated();
|
||||
const data = await this.getPageRaw(pageId);
|
||||
const result = searchInDoc(data.content ?? { type: "doc", content: [] }, query, opts);
|
||||
return { pageId, query, ...result };
|
||||
}
|
||||
/**
|
||||
* Read a table as a matrix. `tableRef` is `#<index>` (from get_outline) or a
|
||||
* block id of any node inside the table. Returns the cell texts plus a
|
||||
|
||||
@@ -35,7 +35,7 @@ const VERSION = packageJson.version;
|
||||
// name is not mentioned below (see its EXCEPTIONS list for the rare opt-outs).
|
||||
// Exported for that test.
|
||||
export const SERVER_INSTRUCTIONS = "Docmost editing guide — choose the tool by intent.\n" +
|
||||
"READ: find a page -> search (workspace-wide full-text); list -> list_pages / list_spaces. Locate blocks and their ids CHEAPLY -> get_outline (compact top-level map; start here, not get_page_json). One block's subtree -> get_node (by attrs.id, or \"#<index>\" for tables, which carry no id). Whole page -> get_page (Markdown, lossy; inline <span data-comment-id> tags are comment anchors — markup, not text) or get_page_json (lossless ProseMirror with block ids). Hand a huge page (with images) to an external consumer without pulling it through the model context -> stash_page (returns a short-lived anonymous URL).\n" +
|
||||
"READ: find a page -> search (workspace-wide full-text); list -> list_pages / list_spaces. Locate blocks and their ids CHEAPLY -> get_outline (compact top-level map; start here, not get_page_json). One block's subtree -> get_node (by attrs.id, or \"#<index>\" for tables, which carry no id). Find every occurrence of a string/regex ON a page (and where each is) -> search_in_page, NOT block-by-block get_node — it returns each hit's node ref + block index + context for a targeted comment. Whole page -> get_page (Markdown, lossy; inline <span data-comment-id> tags are comment anchors — markup, not text) or get_page_json (lossless ProseMirror with block ids). Hand a huge page (with images) to an external consumer without pulling it through the model context -> stash_page (returns a short-lived anonymous URL).\n" +
|
||||
"EDIT: fix wording/typos/numbers -> edit_page_text (find/replace inside blocks, no node id needed). Change ONE block (paragraph/heading/callout/etc.) structurally -> patch_node (by attrs.id from get_outline). Add a block -> insert_node (before/after a block by attrs.id or by anchor text, or append). Remove a block -> delete_node (by attrs.id). Tables -> table_get / table_update_cell / table_insert_row / table_delete_row (address by \"#<index>\" from get_outline; table nodes have no attrs.id). Images -> insert_image (add from a web URL) / replace_image (swap an existing image). Footnotes -> insert_footnote. Bulk/structural rewrite -> update_page_json (full ProseMirror replace; prefer the granular tools above to avoid resending the whole ~100KB+ document). Complex/scripted rewrite (multiple coordinated edits, renumbering) -> docmost_transform: write a JS `(doc, ctx) => doc` transform, preview the diff with dryRun (default), then apply with dryRun:false; ctx.helpers includes commentsToFootnotes for turning inline comments into numbered footnotes.\n" +
|
||||
"PAGES: new -> create_page (Markdown). Rename (title only) -> rename_page. Move -> move_page. Delete -> delete_page (SOFT delete — the page goes to trash and is restorable; nothing is permanent). Copy/replace a page's whole content from another page (server-side, no document through the model) -> copy_page_content. Sharing -> share_page / unshare_page / list_shares; share_page makes the page PUBLICLY accessible — do it only when explicitly asked.\n" +
|
||||
"COMMENTS: create_comment is always inline and requires an EXACT selection — contiguous text from a single block, <=250 chars (fails rather than leaving an unanchored comment); reply to a thread via parentCommentId. Propose a concrete text fix for one-click human approval -> create_comment with suggestedText (the exact plain-text replacement for the selection; the selection must then be UNIQUE in the page — extend it with context if needed); prefer this over editing directly when the change is subjective or needs the author's sign-off. Manage -> list_comments, update_comment, resolve_comment (resolve/reopen, reversible — prefer over delete to close), delete_comment, check_new_comments.\n" +
|
||||
@@ -141,6 +141,15 @@ export function createDocmostMcpServer(config) {
|
||||
const result = await docmostClient.getNode(pageId, nodeId);
|
||||
return jsonContent(result);
|
||||
});
|
||||
// Tool: search_in_page
|
||||
registerShared(SHARED_TOOL_SPECS.searchInPage, async ({ pageId, query, regex, caseSensitive, limit }) => {
|
||||
const result = await docmostClient.searchInPage(pageId, query, {
|
||||
regex,
|
||||
caseSensitive,
|
||||
limit,
|
||||
});
|
||||
return jsonContent(result);
|
||||
});
|
||||
// Tool: table_get
|
||||
server.registerTool("table_get", {
|
||||
description: "Read a table as a matrix. Returns {rows, cols, cells (text[][]), " +
|
||||
|
||||
@@ -0,0 +1,169 @@
|
||||
/**
|
||||
* Pure, network-free in-page search over a ProseMirror/TipTap document tree.
|
||||
*
|
||||
* `searchInDoc(doc, query, opts)` finds every occurrence of a literal substring
|
||||
* (default) or a regular expression across the page's TEXT CONTAINERS and
|
||||
* reports WHERE each match is — the container's ref (usable verbatim with
|
||||
* get_node/patch_node and comment anchoring), the top-level block index, and a
|
||||
* short context window around the hit. It never touches the network, the DB, or
|
||||
* the schema mirror; like `comment-anchor.ts` it is isolated-testable.
|
||||
*
|
||||
* WHY plain text (not markdown): each container's inline text is glued into ONE
|
||||
* string via `blockPlainText`, so a match survives inline-mark boundaries
|
||||
* (bold/italic/link splits that fracture a run like "т.е." into several text
|
||||
* nodes) and comment-anchor spans never clutter the haystack.
|
||||
*
|
||||
* The SEARCH UNIT is a text container: a node whose direct children include
|
||||
* text nodes (a paragraph/heading, or the paragraph inside a table cell / list
|
||||
* item). ProseMirror keeps block vs. inline content exclusive, so a container
|
||||
* never nests another container — the walk reaches each cell/item's own text and
|
||||
* the context window is naturally scoped to that specific cell/item, not the
|
||||
* whole top-level block's glued text.
|
||||
*/
|
||||
import { blockPlainText } from "./node-ops.js";
|
||||
/** True if `value` is a non-null plain object (and not an array). */
|
||||
function isObject(value) {
|
||||
return value != null && typeof value === "object" && !Array.isArray(value);
|
||||
}
|
||||
/**
|
||||
* A text container is a node with a `content` array holding at least one text
|
||||
* node (a child with a string `text`). These are the paragraphs/headings whose
|
||||
* glued inline text we search.
|
||||
*/
|
||||
function isTextContainer(node) {
|
||||
return (isObject(node) &&
|
||||
Array.isArray(node.content) &&
|
||||
node.content.some((c) => isObject(c) && typeof c.text === "string"));
|
||||
}
|
||||
// Result-size defaults/ceiling.
|
||||
const DEFAULT_LIMIT = 50;
|
||||
const MAX_LIMIT = 200;
|
||||
// Context window on each side of a match.
|
||||
const CONTEXT = 40;
|
||||
// Anti-ReDoS guards. JS regex is not interruptible, so a pathological pattern
|
||||
// on a large input can wedge the event loop; we bound BOTH inputs by size (not
|
||||
// a timeout). These also bound the literal engine's work.
|
||||
const MAX_PATTERN_LENGTH = 1000; // cap the query/pattern length
|
||||
const MAX_CONTAINER_TEXT = 100_000; // cap the text scanned per container
|
||||
/** Clamp the requested limit into [1, MAX_LIMIT], defaulting when absent. */
|
||||
function resolveLimit(limit) {
|
||||
const n = typeof limit === "number" && Number.isFinite(limit) ? limit : DEFAULT_LIMIT;
|
||||
return Math.min(MAX_LIMIT, Math.max(1, Math.floor(n)));
|
||||
}
|
||||
/**
|
||||
* Yield the [start, length] of every occurrence of the engine in `text`, in
|
||||
* order. A literal engine uses indexOf (case-folded when requested); a regex
|
||||
* engine uses a global RegExp. Zero-length regex matches (e.g. `\b`, `a*`) are
|
||||
* SKIPPED and lastIndex is advanced, so a pattern that can match the empty
|
||||
* string cannot flood the results or spin forever.
|
||||
*/
|
||||
function* eachMatch(text, query, re, caseSensitive) {
|
||||
if (re) {
|
||||
re.lastIndex = 0;
|
||||
let m;
|
||||
while ((m = re.exec(text)) != null) {
|
||||
const len = m[0].length;
|
||||
if (len === 0) {
|
||||
// Empty match: advance past this position and do not record it.
|
||||
re.lastIndex = m.index + 1;
|
||||
continue;
|
||||
}
|
||||
yield [m.index, len];
|
||||
}
|
||||
return;
|
||||
}
|
||||
// Literal engine. For case-insensitive search, fold BOTH sides only to locate
|
||||
// the indices; the reported match/context are always sliced from the original
|
||||
// text so the caller gets the real casing (needed to build a unique selection).
|
||||
const haystack = caseSensitive ? text : text.toLowerCase();
|
||||
const needle = caseSensitive ? query : query.toLowerCase();
|
||||
const len = needle.length;
|
||||
let from = 0;
|
||||
for (;;) {
|
||||
const idx = haystack.indexOf(needle, from);
|
||||
if (idx === -1)
|
||||
return;
|
||||
yield [idx, len];
|
||||
from = idx + len;
|
||||
}
|
||||
}
|
||||
/**
|
||||
* Search a ProseMirror document for `query` and return `{ total, truncated,
|
||||
* matches }`. `total` counts EVERY occurrence (even beyond the limit) and
|
||||
* `truncated` flags when the returned list was capped — nothing is silently
|
||||
* dropped.
|
||||
*
|
||||
* Throws a clear, model-actionable error (never a generic failure) on: an
|
||||
* empty/whitespace-only query, an over-long pattern, or — with `regex:true` —
|
||||
* an invalid RegExp, so the agent can fix its input.
|
||||
*/
|
||||
export function searchInDoc(doc, query, opts = {}) {
|
||||
// --- edge-case guards (fail loudly so the agent can correct the call) ---
|
||||
if (typeof query !== "string" || query.trim().length === 0) {
|
||||
throw new Error("search_in_page: query is empty — pass the text (or regex) to look for.");
|
||||
}
|
||||
if (query.length > MAX_PATTERN_LENGTH) {
|
||||
throw new Error(`search_in_page: query is too long (${query.length} chars; max ${MAX_PATTERN_LENGTH}). Shorten the search text/pattern.`);
|
||||
}
|
||||
const caseSensitive = opts.caseSensitive === true;
|
||||
const limit = resolveLimit(opts.limit);
|
||||
// Compile the regex up front so an invalid pattern is a clean tool error
|
||||
// rather than a failure deep in the traversal.
|
||||
let re = null;
|
||||
if (opts.regex === true) {
|
||||
try {
|
||||
re = new RegExp(query, caseSensitive ? "g" : "gi");
|
||||
}
|
||||
catch (e) {
|
||||
throw new Error(`search_in_page: invalid regular expression: ${e instanceof Error ? e.message : String(e)}`);
|
||||
}
|
||||
}
|
||||
const matches = [];
|
||||
let total = 0;
|
||||
const topLevel = isObject(doc) && Array.isArray(doc.content) ? doc.content : [];
|
||||
// Descend a top-level block, collecting matches from every text container
|
||||
// within it. blockIndex/topRef stay pinned to the enclosing top-level block.
|
||||
const descend = (node, blockIndex, topRef) => {
|
||||
if (!isObject(node))
|
||||
return;
|
||||
if (isTextContainer(node)) {
|
||||
// Glue this container's inline text into one string (mark-safe) and cap it
|
||||
// so a single non-interruptible regex exec can never run on an unbounded
|
||||
// input.
|
||||
let text = blockPlainText(node);
|
||||
if (text.length > MAX_CONTAINER_TEXT) {
|
||||
text = text.slice(0, MAX_CONTAINER_TEXT);
|
||||
}
|
||||
// The container's own id addresses it verbatim in get_node/patch_node and
|
||||
// comment anchoring; a container with no id (e.g. a table-cell paragraph)
|
||||
// falls back to the top-level block's #<index>.
|
||||
const id = isObject(node.attrs) && typeof node.attrs.id === "string" && node.attrs.id.length > 0
|
||||
? node.attrs.id
|
||||
: topRef;
|
||||
for (const [idx, len] of eachMatch(text, query, re, caseSensitive)) {
|
||||
total++;
|
||||
if (matches.length < limit) {
|
||||
matches.push({
|
||||
nodeId: id,
|
||||
blockIndex,
|
||||
type: node.type,
|
||||
before: text.slice(Math.max(0, idx - CONTEXT), idx),
|
||||
match: text.slice(idx, idx + len),
|
||||
after: text.slice(idx + len, idx + len + CONTEXT),
|
||||
});
|
||||
}
|
||||
}
|
||||
// A text container holds inline content only — no nested containers to
|
||||
// recurse into.
|
||||
return;
|
||||
}
|
||||
if (Array.isArray(node.content)) {
|
||||
for (const child of node.content)
|
||||
descend(child, blockIndex, topRef);
|
||||
}
|
||||
};
|
||||
for (let i = 0; i < topLevel.length; i++) {
|
||||
descend(topLevel[i], i, `#${i}`);
|
||||
}
|
||||
return { total, truncated: total > matches.length, matches };
|
||||
}
|
||||
@@ -74,6 +74,48 @@ export const SHARED_TOOL_SPECS = {
|
||||
nodeId: z.string().min(1),
|
||||
}),
|
||||
},
|
||||
// --- in-page occurrence search (client-side, over ProseMirror plain text) ---
|
||||
searchInPage: {
|
||||
mcpName: 'search_in_page',
|
||||
inAppKey: 'searchInPage',
|
||||
description: 'Find every occurrence of a string (or regex) INSIDE one page and get ' +
|
||||
'WHERE each is — instead of pulling blocks one-by-one with get_node. ' +
|
||||
'Searches the plain text of each text block/cell (marks glued, so a match ' +
|
||||
'survives bold/italic/link splits; comment anchors do not interfere). ' +
|
||||
'Returns { total, truncated, matches:[{ nodeId, blockIndex, type, before, ' +
|
||||
'match, after }] }: `nodeId` is the block id (or "#<index>" for ' +
|
||||
'table/cell content) — pass it straight to get_node/patch_node or as a ' +
|
||||
'comment anchor; `blockIndex` is the get_outline index; `before`/`after` ' +
|
||||
'give ~40 chars of context to build a unique selection. `total` counts all ' +
|
||||
'hits and `truncated` is true when more than `limit` were found (nothing ' +
|
||||
'is silently dropped). Default is a literal, case-INSENSITIVE substring; ' +
|
||||
'set regex:true for a JS regular expression (char classes, word ' +
|
||||
'boundaries) and caseSensitive:true to match case. Ideal for systematic ' +
|
||||
'editorial sweeps (unquoted "ё", straight quotes, "т.е.", stray units). An ' +
|
||||
'invalid regex or an empty query returns a clear error to fix.',
|
||||
buildShape: (z) => ({
|
||||
pageId: z.string().min(1).describe('ID of the page to search'),
|
||||
query: z
|
||||
.string()
|
||||
.min(1)
|
||||
.describe('The text to find (a literal substring, or a regex when regex:true)'),
|
||||
regex: z
|
||||
.boolean()
|
||||
.optional()
|
||||
.describe('Treat query as a JS regular expression (default false).'),
|
||||
caseSensitive: z
|
||||
.boolean()
|
||||
.optional()
|
||||
.describe('Case-sensitive matching (default false).'),
|
||||
limit: z
|
||||
.number()
|
||||
.int()
|
||||
.min(1)
|
||||
.max(200)
|
||||
.optional()
|
||||
.describe('Max matches to RETURN (default 50, max 200); total is always reported.'),
|
||||
}),
|
||||
},
|
||||
// --- node delete ---
|
||||
deleteNode: {
|
||||
mcpName: 'delete_node',
|
||||
|
||||
@@ -47,6 +47,7 @@ import {
|
||||
deleteTableRow,
|
||||
updateTableCell,
|
||||
} from "./lib/node-ops.js";
|
||||
import { searchInDoc, SearchOptions } from "./lib/page-search.js";
|
||||
import { withPageLock } from "./lib/page-lock.js";
|
||||
import {
|
||||
applyTextEdits,
|
||||
@@ -1093,6 +1094,27 @@ export class DocmostClient {
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Find every occurrence of `query` on a page IN MEMORY, over the plain text of
|
||||
* each text container (reusing the same `getPageRaw` fetch as the other read
|
||||
* tools) — no server search endpoint, no whole-document round-trip through the
|
||||
* model. Returns `{ total, truncated, matches }`; each match carries a ref the
|
||||
* agent can hand straight to get_node/patch_node or a comment anchor, plus the
|
||||
* top-level block index and a short context window to build a unique selection.
|
||||
* The pure engine (`searchInDoc`) owns the traversal, glue, ReDoS guards and
|
||||
* the empty-query / invalid-regex errors.
|
||||
*/
|
||||
async searchInPage(pageId: string, query: string, opts: SearchOptions = {}) {
|
||||
await this.ensureAuthenticated();
|
||||
const data = await this.getPageRaw(pageId);
|
||||
const result = searchInDoc(
|
||||
data.content ?? { type: "doc", content: [] },
|
||||
query,
|
||||
opts,
|
||||
);
|
||||
return { pageId, query, ...result };
|
||||
}
|
||||
|
||||
/**
|
||||
* Read a table as a matrix. `tableRef` is `#<index>` (from get_outline) or a
|
||||
* block id of any node inside the table. Returns the cell texts plus a
|
||||
|
||||
@@ -46,7 +46,7 @@ const VERSION = packageJson.version;
|
||||
// Exported for that test.
|
||||
export const SERVER_INSTRUCTIONS =
|
||||
"Docmost editing guide — choose the tool by intent.\n" +
|
||||
"READ: find a page -> search (workspace-wide full-text); list -> list_pages / list_spaces. Locate blocks and their ids CHEAPLY -> get_outline (compact top-level map; start here, not get_page_json). One block's subtree -> get_node (by attrs.id, or \"#<index>\" for tables, which carry no id). Whole page -> get_page (Markdown, lossy; inline <span data-comment-id> tags are comment anchors — markup, not text) or get_page_json (lossless ProseMirror with block ids). Hand a huge page (with images) to an external consumer without pulling it through the model context -> stash_page (returns a short-lived anonymous URL).\n" +
|
||||
"READ: find a page -> search (workspace-wide full-text); list -> list_pages / list_spaces. Locate blocks and their ids CHEAPLY -> get_outline (compact top-level map; start here, not get_page_json). One block's subtree -> get_node (by attrs.id, or \"#<index>\" for tables, which carry no id). Find every occurrence of a string/regex ON a page (and where each is) -> search_in_page, NOT block-by-block get_node — it returns each hit's node ref + block index + context for a targeted comment. Whole page -> get_page (Markdown, lossy; inline <span data-comment-id> tags are comment anchors — markup, not text) or get_page_json (lossless ProseMirror with block ids). Hand a huge page (with images) to an external consumer without pulling it through the model context -> stash_page (returns a short-lived anonymous URL).\n" +
|
||||
"EDIT: fix wording/typos/numbers -> edit_page_text (find/replace inside blocks, no node id needed). Change ONE block (paragraph/heading/callout/etc.) structurally -> patch_node (by attrs.id from get_outline). Add a block -> insert_node (before/after a block by attrs.id or by anchor text, or append). Remove a block -> delete_node (by attrs.id). Tables -> table_get / table_update_cell / table_insert_row / table_delete_row (address by \"#<index>\" from get_outline; table nodes have no attrs.id). Images -> insert_image (add from a web URL) / replace_image (swap an existing image). Footnotes -> insert_footnote. Bulk/structural rewrite -> update_page_json (full ProseMirror replace; prefer the granular tools above to avoid resending the whole ~100KB+ document). Complex/scripted rewrite (multiple coordinated edits, renumbering) -> docmost_transform: write a JS `(doc, ctx) => doc` transform, preview the diff with dryRun (default), then apply with dryRun:false; ctx.helpers includes commentsToFootnotes for turning inline comments into numbered footnotes.\n" +
|
||||
"PAGES: new -> create_page (Markdown). Rename (title only) -> rename_page. Move -> move_page. Delete -> delete_page (SOFT delete — the page goes to trash and is restorable; nothing is permanent). Copy/replace a page's whole content from another page (server-side, no document through the model) -> copy_page_content. Sharing -> share_page / unshare_page / list_shares; share_page makes the page PUBLICLY accessible — do it only when explicitly asked.\n" +
|
||||
"COMMENTS: create_comment is always inline and requires an EXACT selection — contiguous text from a single block, <=250 chars (fails rather than leaving an unanchored comment); reply to a thread via parentCommentId. Propose a concrete text fix for one-click human approval -> create_comment with suggestedText (the exact plain-text replacement for the selection; the selection must then be UNIQUE in the page — extend it with context if needed); prefer this over editing directly when the change is subjective or needs the author's sign-off. Manage -> list_comments, update_comment, resolve_comment (resolve/reopen, reversible — prefer over delete to close), delete_comment, check_new_comments.\n" +
|
||||
@@ -187,6 +187,19 @@ registerShared(SHARED_TOOL_SPECS.getNode, async ({ pageId, nodeId }) => {
|
||||
return jsonContent(result);
|
||||
});
|
||||
|
||||
// Tool: search_in_page
|
||||
registerShared(
|
||||
SHARED_TOOL_SPECS.searchInPage,
|
||||
async ({ pageId, query, regex, caseSensitive, limit }) => {
|
||||
const result = await docmostClient.searchInPage(pageId, query, {
|
||||
regex,
|
||||
caseSensitive,
|
||||
limit,
|
||||
});
|
||||
return jsonContent(result);
|
||||
},
|
||||
);
|
||||
|
||||
// Tool: table_get
|
||||
server.registerTool(
|
||||
"table_get",
|
||||
|
||||
@@ -0,0 +1,245 @@
|
||||
/**
|
||||
* Pure, network-free in-page search over a ProseMirror/TipTap document tree.
|
||||
*
|
||||
* `searchInDoc(doc, query, opts)` finds every occurrence of a literal substring
|
||||
* (default) or a regular expression across the page's TEXT CONTAINERS and
|
||||
* reports WHERE each match is — the container's ref (usable verbatim with
|
||||
* get_node/patch_node and comment anchoring), the top-level block index, and a
|
||||
* short context window around the hit. It never touches the network, the DB, or
|
||||
* the schema mirror; like `comment-anchor.ts` it is isolated-testable.
|
||||
*
|
||||
* WHY plain text (not markdown): each container's inline text is glued into ONE
|
||||
* string via `blockPlainText`, so a match survives inline-mark boundaries
|
||||
* (bold/italic/link splits that fracture a run like "т.е." into several text
|
||||
* nodes) and comment-anchor spans never clutter the haystack.
|
||||
*
|
||||
* The SEARCH UNIT is a text container: a node whose direct children include
|
||||
* text nodes (a paragraph/heading, or the paragraph inside a table cell / list
|
||||
* item). ProseMirror keeps block vs. inline content exclusive, so a container
|
||||
* never nests another container — the walk reaches each cell/item's own text and
|
||||
* the context window is naturally scoped to that specific cell/item, not the
|
||||
* whole top-level block's glued text.
|
||||
*/
|
||||
|
||||
import { blockPlainText } from "./node-ops.js";
|
||||
|
||||
/** True if `value` is a non-null plain object (and not an array). */
|
||||
function isObject(value: any): value is Record<string, any> {
|
||||
return value != null && typeof value === "object" && !Array.isArray(value);
|
||||
}
|
||||
|
||||
/**
|
||||
* A text container is a node with a `content` array holding at least one text
|
||||
* node (a child with a string `text`). These are the paragraphs/headings whose
|
||||
* glued inline text we search.
|
||||
*/
|
||||
function isTextContainer(node: any): boolean {
|
||||
return (
|
||||
isObject(node) &&
|
||||
Array.isArray(node.content) &&
|
||||
node.content.some((c: any) => isObject(c) && typeof c.text === "string")
|
||||
);
|
||||
}
|
||||
|
||||
/** Options controlling the search engine and result size. */
|
||||
export interface SearchOptions {
|
||||
/** Treat `query` as a RegExp instead of a literal substring (default false). */
|
||||
regex?: boolean;
|
||||
/** Case-sensitive matching (default false). */
|
||||
caseSensitive?: boolean;
|
||||
/** Max matches to RETURN (default 50, clamped to [1, 200]); total is unbounded. */
|
||||
limit?: number;
|
||||
}
|
||||
|
||||
/** One located occurrence. */
|
||||
export interface SearchMatch {
|
||||
/**
|
||||
* The container's ref: its `attrs.id` when it has one, otherwise
|
||||
* `#<topLevelIndex>` of the nearest top-level block (the same ref format
|
||||
* get_node/patch_node and comment anchoring accept). Table-cell/list-item
|
||||
* paragraphs that carry no id fall back to the `#<index>` form.
|
||||
*/
|
||||
nodeId: string;
|
||||
/** The top-level block index (as in get_outline). */
|
||||
blockIndex: number;
|
||||
/** The container node's type (paragraph/heading/...). */
|
||||
type: string | undefined;
|
||||
/** ~40 chars of context immediately before the match (from THIS container). */
|
||||
before: string;
|
||||
/** The matched text. */
|
||||
match: string;
|
||||
/** ~40 chars of context immediately after the match (from THIS container). */
|
||||
after: string;
|
||||
}
|
||||
|
||||
/** The search result. `truncated` is true when `total > matches.length`. */
|
||||
export interface SearchResult {
|
||||
total: number;
|
||||
truncated: boolean;
|
||||
matches: SearchMatch[];
|
||||
}
|
||||
|
||||
// Result-size defaults/ceiling.
|
||||
const DEFAULT_LIMIT = 50;
|
||||
const MAX_LIMIT = 200;
|
||||
|
||||
// Context window on each side of a match.
|
||||
const CONTEXT = 40;
|
||||
|
||||
// Anti-ReDoS guards. JS regex is not interruptible, so a pathological pattern
|
||||
// on a large input can wedge the event loop; we bound BOTH inputs by size (not
|
||||
// a timeout). These also bound the literal engine's work.
|
||||
const MAX_PATTERN_LENGTH = 1000; // cap the query/pattern length
|
||||
const MAX_CONTAINER_TEXT = 100_000; // cap the text scanned per container
|
||||
|
||||
/** Clamp the requested limit into [1, MAX_LIMIT], defaulting when absent. */
|
||||
function resolveLimit(limit: number | undefined): number {
|
||||
const n = typeof limit === "number" && Number.isFinite(limit) ? limit : DEFAULT_LIMIT;
|
||||
return Math.min(MAX_LIMIT, Math.max(1, Math.floor(n)));
|
||||
}
|
||||
|
||||
/**
|
||||
* Yield the [start, length] of every occurrence of the engine in `text`, in
|
||||
* order. A literal engine uses indexOf (case-folded when requested); a regex
|
||||
* engine uses a global RegExp. Zero-length regex matches (e.g. `\b`, `a*`) are
|
||||
* SKIPPED and lastIndex is advanced, so a pattern that can match the empty
|
||||
* string cannot flood the results or spin forever.
|
||||
*/
|
||||
function* eachMatch(
|
||||
text: string,
|
||||
query: string,
|
||||
re: RegExp | null,
|
||||
caseSensitive: boolean,
|
||||
): Generator<[number, number]> {
|
||||
if (re) {
|
||||
re.lastIndex = 0;
|
||||
let m: RegExpExecArray | null;
|
||||
while ((m = re.exec(text)) != null) {
|
||||
const len = m[0].length;
|
||||
if (len === 0) {
|
||||
// Empty match: advance past this position and do not record it.
|
||||
re.lastIndex = m.index + 1;
|
||||
continue;
|
||||
}
|
||||
yield [m.index, len];
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
// Literal engine. For case-insensitive search, fold BOTH sides only to locate
|
||||
// the indices; the reported match/context are always sliced from the original
|
||||
// text so the caller gets the real casing (needed to build a unique selection).
|
||||
const haystack = caseSensitive ? text : text.toLowerCase();
|
||||
const needle = caseSensitive ? query : query.toLowerCase();
|
||||
const len = needle.length;
|
||||
let from = 0;
|
||||
for (;;) {
|
||||
const idx = haystack.indexOf(needle, from);
|
||||
if (idx === -1) return;
|
||||
yield [idx, len];
|
||||
from = idx + len;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Search a ProseMirror document for `query` and return `{ total, truncated,
|
||||
* matches }`. `total` counts EVERY occurrence (even beyond the limit) and
|
||||
* `truncated` flags when the returned list was capped — nothing is silently
|
||||
* dropped.
|
||||
*
|
||||
* Throws a clear, model-actionable error (never a generic failure) on: an
|
||||
* empty/whitespace-only query, an over-long pattern, or — with `regex:true` —
|
||||
* an invalid RegExp, so the agent can fix its input.
|
||||
*/
|
||||
export function searchInDoc(
|
||||
doc: any,
|
||||
query: string,
|
||||
opts: SearchOptions = {},
|
||||
): SearchResult {
|
||||
// --- edge-case guards (fail loudly so the agent can correct the call) ---
|
||||
if (typeof query !== "string" || query.trim().length === 0) {
|
||||
throw new Error(
|
||||
"search_in_page: query is empty — pass the text (or regex) to look for.",
|
||||
);
|
||||
}
|
||||
if (query.length > MAX_PATTERN_LENGTH) {
|
||||
throw new Error(
|
||||
`search_in_page: query is too long (${query.length} chars; max ${MAX_PATTERN_LENGTH}). Shorten the search text/pattern.`,
|
||||
);
|
||||
}
|
||||
|
||||
const caseSensitive = opts.caseSensitive === true;
|
||||
const limit = resolveLimit(opts.limit);
|
||||
|
||||
// Compile the regex up front so an invalid pattern is a clean tool error
|
||||
// rather than a failure deep in the traversal.
|
||||
let re: RegExp | null = null;
|
||||
if (opts.regex === true) {
|
||||
try {
|
||||
re = new RegExp(query, caseSensitive ? "g" : "gi");
|
||||
} catch (e) {
|
||||
throw new Error(
|
||||
`search_in_page: invalid regular expression: ${
|
||||
e instanceof Error ? e.message : String(e)
|
||||
}`,
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
const matches: SearchMatch[] = [];
|
||||
let total = 0;
|
||||
|
||||
const topLevel =
|
||||
isObject(doc) && Array.isArray(doc.content) ? doc.content : [];
|
||||
|
||||
// Descend a top-level block, collecting matches from every text container
|
||||
// within it. blockIndex/topRef stay pinned to the enclosing top-level block.
|
||||
const descend = (node: any, blockIndex: number, topRef: string): void => {
|
||||
if (!isObject(node)) return;
|
||||
|
||||
if (isTextContainer(node)) {
|
||||
// Glue this container's inline text into one string (mark-safe) and cap it
|
||||
// so a single non-interruptible regex exec can never run on an unbounded
|
||||
// input.
|
||||
let text = blockPlainText(node);
|
||||
if (text.length > MAX_CONTAINER_TEXT) {
|
||||
text = text.slice(0, MAX_CONTAINER_TEXT);
|
||||
}
|
||||
|
||||
// The container's own id addresses it verbatim in get_node/patch_node and
|
||||
// comment anchoring; a container with no id (e.g. a table-cell paragraph)
|
||||
// falls back to the top-level block's #<index>.
|
||||
const id =
|
||||
isObject(node.attrs) && typeof node.attrs.id === "string" && node.attrs.id.length > 0
|
||||
? node.attrs.id
|
||||
: topRef;
|
||||
|
||||
for (const [idx, len] of eachMatch(text, query, re, caseSensitive)) {
|
||||
total++;
|
||||
if (matches.length < limit) {
|
||||
matches.push({
|
||||
nodeId: id,
|
||||
blockIndex,
|
||||
type: node.type,
|
||||
before: text.slice(Math.max(0, idx - CONTEXT), idx),
|
||||
match: text.slice(idx, idx + len),
|
||||
after: text.slice(idx + len, idx + len + CONTEXT),
|
||||
});
|
||||
}
|
||||
}
|
||||
// A text container holds inline content only — no nested containers to
|
||||
// recurse into.
|
||||
return;
|
||||
}
|
||||
|
||||
if (Array.isArray(node.content)) {
|
||||
for (const child of node.content) descend(child, blockIndex, topRef);
|
||||
}
|
||||
};
|
||||
|
||||
for (let i = 0; i < topLevel.length; i++) {
|
||||
descend(topLevel[i], i, `#${i}`);
|
||||
}
|
||||
|
||||
return { total, truncated: total > matches.length, matches };
|
||||
}
|
||||
@@ -110,6 +110,51 @@ export const SHARED_TOOL_SPECS = {
|
||||
}),
|
||||
},
|
||||
|
||||
// --- in-page occurrence search (client-side, over ProseMirror plain text) ---
|
||||
|
||||
searchInPage: {
|
||||
mcpName: 'search_in_page',
|
||||
inAppKey: 'searchInPage',
|
||||
description:
|
||||
'Find every occurrence of a string (or regex) INSIDE one page and get ' +
|
||||
'WHERE each is — instead of pulling blocks one-by-one with get_node. ' +
|
||||
'Searches the plain text of each text block/cell (marks glued, so a match ' +
|
||||
'survives bold/italic/link splits; comment anchors do not interfere). ' +
|
||||
'Returns { total, truncated, matches:[{ nodeId, blockIndex, type, before, ' +
|
||||
'match, after }] }: `nodeId` is the block id (or "#<index>" for ' +
|
||||
'table/cell content) — pass it straight to get_node/patch_node or as a ' +
|
||||
'comment anchor; `blockIndex` is the get_outline index; `before`/`after` ' +
|
||||
'give ~40 chars of context to build a unique selection. `total` counts all ' +
|
||||
'hits and `truncated` is true when more than `limit` were found (nothing ' +
|
||||
'is silently dropped). Default is a literal, case-INSENSITIVE substring; ' +
|
||||
'set regex:true for a JS regular expression (char classes, word ' +
|
||||
'boundaries) and caseSensitive:true to match case. Ideal for systematic ' +
|
||||
'editorial sweeps (unquoted "ё", straight quotes, "т.е.", stray units). An ' +
|
||||
'invalid regex or an empty query returns a clear error to fix.',
|
||||
buildShape: (z) => ({
|
||||
pageId: z.string().min(1).describe('ID of the page to search'),
|
||||
query: z
|
||||
.string()
|
||||
.min(1)
|
||||
.describe('The text to find (a literal substring, or a regex when regex:true)'),
|
||||
regex: z
|
||||
.boolean()
|
||||
.optional()
|
||||
.describe('Treat query as a JS regular expression (default false).'),
|
||||
caseSensitive: z
|
||||
.boolean()
|
||||
.optional()
|
||||
.describe('Case-sensitive matching (default false).'),
|
||||
limit: z
|
||||
.number()
|
||||
.int()
|
||||
.min(1)
|
||||
.max(200)
|
||||
.optional()
|
||||
.describe('Max matches to RETURN (default 50, max 200); total is always reported.'),
|
||||
}),
|
||||
},
|
||||
|
||||
// --- node delete ---
|
||||
|
||||
deleteNode: {
|
||||
|
||||
@@ -45,6 +45,7 @@ const HOST_CONTRACT_METHODS = [
|
||||
"getOutline",
|
||||
"getPageJson",
|
||||
"getNode",
|
||||
"searchInPage",
|
||||
"getTable",
|
||||
"listComments",
|
||||
"getComment",
|
||||
|
||||
@@ -0,0 +1,217 @@
|
||||
import { test } from "node:test";
|
||||
import assert from "node:assert/strict";
|
||||
|
||||
import { searchInDoc } from "../../build/lib/page-search.js";
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Document builders. Mirror the Docmost ProseMirror shape: paragraphs/headings
|
||||
// carry an attrs.id and hold text nodes; a text node may carry marks, and
|
||||
// adjacent runs with different marks are GLUED by blockPlainText so a match can
|
||||
// straddle a mark boundary. Table cells hold id-less paragraphs.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const text = (t, marks) => (marks ? { type: "text", text: t, marks } : { type: "text", text: t });
|
||||
const para = (id, ...children) => ({ type: "paragraph", attrs: { id }, content: children });
|
||||
const heading = (id, level, t) => ({
|
||||
type: "heading",
|
||||
attrs: { id, level },
|
||||
content: [text(t)],
|
||||
});
|
||||
|
||||
function doc(...content) {
|
||||
return { type: "doc", content };
|
||||
}
|
||||
|
||||
test("literal substring: finds every occurrence with total/truncated and refs", () => {
|
||||
const d = doc(
|
||||
para("p1", text("The cat sat on the cat mat.")),
|
||||
heading("h1", 2, "Another cat here"),
|
||||
);
|
||||
const res = searchInDoc(d, "cat");
|
||||
assert.equal(res.total, 3);
|
||||
assert.equal(res.truncated, false);
|
||||
assert.equal(res.matches.length, 3);
|
||||
// First hit: paragraph p1, block index 0.
|
||||
assert.equal(res.matches[0].nodeId, "p1");
|
||||
assert.equal(res.matches[0].blockIndex, 0);
|
||||
assert.equal(res.matches[0].type, "paragraph");
|
||||
assert.equal(res.matches[0].match, "cat");
|
||||
// Third hit is in the heading (block index 1).
|
||||
assert.equal(res.matches[2].nodeId, "h1");
|
||||
assert.equal(res.matches[2].blockIndex, 1);
|
||||
assert.equal(res.matches[2].type, "heading");
|
||||
});
|
||||
|
||||
test("context windows: before/after are drawn from the SAME container", () => {
|
||||
const d = doc(para("p1", text("alpha beta gamma delta")));
|
||||
const res = searchInDoc(d, "gamma");
|
||||
assert.equal(res.matches.length, 1);
|
||||
assert.equal(res.matches[0].before, "alpha beta ");
|
||||
assert.equal(res.matches[0].match, "gamma");
|
||||
assert.equal(res.matches[0].after, " delta");
|
||||
});
|
||||
|
||||
test("context windows are bounded to ~40 chars each side", () => {
|
||||
const long = "x".repeat(100);
|
||||
const d = doc(para("p1", text(long + "NEEDLE" + long)));
|
||||
const res = searchInDoc(d, "NEEDLE");
|
||||
assert.equal(res.matches.length, 1);
|
||||
assert.equal(res.matches[0].before.length, 40);
|
||||
assert.equal(res.matches[0].after.length, 40);
|
||||
});
|
||||
|
||||
test("case-insensitive by default; caseSensitive:true narrows", () => {
|
||||
const d = doc(para("p1", text("Cat CAT cat")));
|
||||
assert.equal(searchInDoc(d, "cat").total, 3);
|
||||
assert.equal(searchInDoc(d, "cat", { caseSensitive: true }).total, 1);
|
||||
// Reported match preserves the ORIGINAL casing even under a folded search.
|
||||
const res = searchInDoc(d, "cat");
|
||||
assert.deepEqual(
|
||||
res.matches.map((m) => m.match),
|
||||
["Cat", "CAT", "cat"],
|
||||
);
|
||||
});
|
||||
|
||||
test("match survives an inline mark boundary (glued runs)", () => {
|
||||
// "т.е." is fractured across three text nodes by bold/italic marks.
|
||||
const d = doc(
|
||||
para(
|
||||
"p1",
|
||||
text("вводное слово, "),
|
||||
text("т", [{ type: "bold" }]),
|
||||
text(".", [{ type: "italic" }]),
|
||||
text("е", [{ type: "bold" }]),
|
||||
text(". дальше"),
|
||||
),
|
||||
);
|
||||
const res = searchInDoc(d, "т.е.");
|
||||
assert.equal(res.total, 1);
|
||||
assert.equal(res.matches[0].match, "т.е.");
|
||||
assert.equal(res.matches[0].nodeId, "p1");
|
||||
});
|
||||
|
||||
test("regex engine: character classes and word boundaries", () => {
|
||||
const d = doc(para("p1", text("v1 v22 version v3")));
|
||||
const res = searchInDoc(d, "\\bv\\d+\\b", { regex: true });
|
||||
assert.deepEqual(
|
||||
res.matches.map((m) => m.match),
|
||||
["v1", "v22", "v3"],
|
||||
);
|
||||
// "version" is not matched by \bv\d+\b.
|
||||
assert.equal(res.total, 3);
|
||||
});
|
||||
|
||||
test("regex is case-insensitive by default and respects caseSensitive", () => {
|
||||
const d = doc(para("p1", text("Foo foo FOO")));
|
||||
assert.equal(searchInDoc(d, "foo", { regex: true }).total, 3);
|
||||
assert.equal(
|
||||
searchInDoc(d, "foo", { regex: true, caseSensitive: true }).total,
|
||||
1,
|
||||
);
|
||||
});
|
||||
|
||||
test("regex empty/zero-length matches are skipped, not flooded", () => {
|
||||
const d = doc(para("p1", text("abc")));
|
||||
// `a*` can match the empty string at every position; we must not emit those.
|
||||
const res = searchInDoc(d, "a*", { regex: true });
|
||||
assert.equal(res.total, 1);
|
||||
assert.equal(res.matches[0].match, "a");
|
||||
});
|
||||
|
||||
test("nodeId for a table cell paragraph WITHOUT an id falls back to #<topLevelIndex>", () => {
|
||||
// A table at top-level block index 1; its cell paragraphs carry no attrs.id.
|
||||
const cellPara = (t) => ({ type: "paragraph", content: [text(t)] });
|
||||
const d = doc(
|
||||
para("intro", text("before the table")),
|
||||
{
|
||||
type: "table",
|
||||
content: [
|
||||
{
|
||||
type: "tableRow",
|
||||
content: [
|
||||
{ type: "tableCell", content: [cellPara("needle in a cell")] },
|
||||
{ type: "tableHeader", content: [cellPara("another needle")] },
|
||||
],
|
||||
},
|
||||
],
|
||||
},
|
||||
);
|
||||
const res = searchInDoc(d, "needle");
|
||||
assert.equal(res.total, 2);
|
||||
// Both cell hits report the table's top-level #<index> (block 1) since the
|
||||
// cell paragraphs have no id.
|
||||
for (const m of res.matches) {
|
||||
assert.equal(m.nodeId, "#1");
|
||||
assert.equal(m.blockIndex, 1);
|
||||
}
|
||||
// Context is scoped to the specific cell, not the whole table's glued text.
|
||||
assert.equal(res.matches[0].after, " in a cell");
|
||||
assert.equal(res.matches[1].before, "another ");
|
||||
});
|
||||
|
||||
test("nodeId uses attrs.id when the container has one (paragraph & heading)", () => {
|
||||
const d = doc(heading("h9", 1, "heading needle"), para("p9", text("para needle")));
|
||||
const res = searchInDoc(d, "needle");
|
||||
assert.equal(res.matches[0].nodeId, "h9");
|
||||
assert.equal(res.matches[1].nodeId, "p9");
|
||||
});
|
||||
|
||||
test("limit caps the returned matches but total and truncated stay honest", () => {
|
||||
const d = doc(para("p1", text("x ".repeat(10).trim()))); // 10 'x'
|
||||
const res = searchInDoc(d, "x", { limit: 3 });
|
||||
assert.equal(res.total, 10);
|
||||
assert.equal(res.matches.length, 3);
|
||||
assert.equal(res.truncated, true);
|
||||
});
|
||||
|
||||
test("limit is clamped to the [1, 200] range", () => {
|
||||
const d = doc(para("p1", text("a".repeat(5))));
|
||||
// A limit above the ceiling still returns all 5 (< 200) without truncation.
|
||||
const hi = searchInDoc(d, "a", { limit: 9999 });
|
||||
assert.equal(hi.matches.length, 5);
|
||||
assert.equal(hi.truncated, false);
|
||||
// A non-positive limit clamps up to 1.
|
||||
const lo = searchInDoc(d, "a", { limit: 0 });
|
||||
assert.equal(lo.matches.length, 1);
|
||||
assert.equal(lo.total, 5);
|
||||
assert.equal(lo.truncated, true);
|
||||
});
|
||||
|
||||
test("invalid regex throws a clear tool error", () => {
|
||||
const d = doc(para("p1", text("hi")));
|
||||
assert.throws(
|
||||
() => searchInDoc(d, "(", { regex: true }),
|
||||
/invalid regular expression/i,
|
||||
);
|
||||
});
|
||||
|
||||
test("empty or whitespace-only query is rejected", () => {
|
||||
const d = doc(para("p1", text("hi")));
|
||||
assert.throws(() => searchInDoc(d, ""), /query is empty/i);
|
||||
assert.throws(() => searchInDoc(d, " "), /query is empty/i);
|
||||
assert.throws(() => searchInDoc(d, undefined), /query is empty/i);
|
||||
});
|
||||
|
||||
test("an over-long pattern is rejected (anti-ReDoS pattern cap)", () => {
|
||||
const d = doc(para("p1", text("hi")));
|
||||
assert.throws(() => searchInDoc(d, "a".repeat(1001)), /too long/i);
|
||||
});
|
||||
|
||||
test("no matches yields an empty, non-truncated result", () => {
|
||||
const d = doc(para("p1", text("nothing to see")));
|
||||
const res = searchInDoc(d, "zebra");
|
||||
assert.deepEqual(res, { total: 0, truncated: false, matches: [] });
|
||||
});
|
||||
|
||||
test("null-safe on a missing/empty doc", () => {
|
||||
assert.deepEqual(searchInDoc(null, "x"), {
|
||||
total: 0,
|
||||
truncated: false,
|
||||
matches: [],
|
||||
});
|
||||
assert.deepEqual(searchInDoc({ type: "doc" }, "x"), {
|
||||
total: 0,
|
||||
truncated: false,
|
||||
matches: [],
|
||||
});
|
||||
});
|
||||
Reference in New Issue
Block a user