fix(prosemirror-markdown): escape canon inline-extension triggers = $ ^ in link/alt text (#333 review F5)
F1 (round 1) wrapped the image alt in escapeLinkText, and that helper also guards the link-form media captions (attachment/pdf/embed). But its character class covered only stock CommonMark — NOT the Docmost inline EXTENSIONS this same PR registers on the marked instance: highlight `==x==` (canon #7), math `$x$` (canon #6), footnote `^[x]` (canon #2). Their triggers `= $ ^` are not CommonMark punctuation, so an alt or media filename like `x $A$ y`, `use ==bold==`, `^[fn]`, or `data $A$.csv` was silently turned into a math/highlight/footnote node on import — the same class of round-trip data loss F1 closed, reintroduced by this PR's own canon. Fix: add `= $ ^` to the escapeLinkText class (`/[\\`*_~[\]<&!()=$^]/g`). `\= \$ \^` decode back to literals (all ASCII punctuation) AND, being escape tokens, stop the extension tokenizer from matching — verified lossless byte-stable round-trip. Updated the helper comment to name the two trigger sets (CommonMark + Docmost inline extensions). Extended the adversarial round-trip tests: image alt gains `x $A$ y` / `5$ and 10$` / `use ==bold==` / `^[fn]` / `cost $5 == price`; pdf name gains `data $A$.csv` / `q3 ==final==.pdf` / `5$ and 10$.pdf` / `note ^[x].pdf` — all byte-stable with the node intact, so the hole can't reopen. package vitest: 658 passed; tsc clean. git-sync: 268. mcp: 454. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -78,23 +78,26 @@ export function convertProseMirrorToMarkdown(content: any): string {
|
||||
.replace(/\(/g, "%28")
|
||||
.replace(/\)/g, "%29");
|
||||
|
||||
// Backslash-escape every character a markdown link's `[text]` label would
|
||||
// otherwise INTERPRET, so a link-form media node's visible text (a filename or
|
||||
// provider carried in `attrs.name`/`attrs.provider`) round-trips byte-exact
|
||||
// (#293 canon #8 link-form). Escaping only `[ ] \` is NOT enough: the label is
|
||||
// parsed as inline content, so emphasis (`* _`), code (`` ` ``), strikethrough
|
||||
// (`~`), autolinks/raw-HTML (`<`), HTML entities (`&`), and image markers (`!`)
|
||||
// would all be consumed and lost when the importer reads `a.textContent` back
|
||||
// (e.g. `report *v2*.pdf` -> `report v2.pdf`). CommonMark treats a backslash
|
||||
// before ANY ASCII punctuation as that literal char, so escaping this active
|
||||
// set is always lossless on re-parse; the old `<div data-attachment-name>`
|
||||
// form carried arbitrary strings via escapeAttr, so anything less is a
|
||||
// data-loss regression on the git-sync data path. `( )` are escaped too: even
|
||||
// with `[ ]` escaped, an unescaped `](x)` sequence inside the label (e.g. a
|
||||
// name like `.pdf`) forms a false nested-link destination and
|
||||
// fragments the parse — escaping the parens removes that ambiguity entirely.
|
||||
// Backslash-escape every character that would be INTERPRETED inside a markdown
|
||||
// label re-parsed as inline content — used for a link-form media node's visible
|
||||
// text (`attrs.name`/`attrs.provider`, #293 canon #8) AND for an image `![alt]`
|
||||
// (canon #4) — so the value round-trips byte-exact. Two overlapping trigger
|
||||
// sets must be escaped:
|
||||
// 1. Stock CommonMark inline: emphasis (`* _`), code (`` ` ``), strikethrough
|
||||
// (`~`), autolinks/raw-HTML (`<`), HTML entities (`&`), image markers (`!`),
|
||||
// brackets (`[ ]`), and `( )` — even with `[ ]` escaped, an unescaped
|
||||
// `](x)` forms a false nested-link destination and fragments the parse.
|
||||
// 2. The Docmost inline EXTENSIONS this package registers on its marked
|
||||
// instance: highlight `==x==` (canon #7), math `$x$` (canon #6), and
|
||||
// footnote `^[x]` (canon #2). Their triggers `= $ ^` are NOT CommonMark
|
||||
// punctuation the stock lexer would treat specially, but the extension
|
||||
// tokenizers fire on them — so an alt/name like `x $A$ y`, `use ==b==`, or
|
||||
// `^[fn]` would be silently turned into a math/highlight/footnote node on
|
||||
// import unless the trigger is escaped. `\= \$ \^` decode back to literals
|
||||
// (all ASCII punctuation) and, being escape tokens, stop the extension
|
||||
// tokenizer from matching — verified lossless round-trip.
|
||||
const escapeLinkText = (value: unknown): string =>
|
||||
String(value ?? "").replace(/[\\`*_~[\]<&!()]/g, (c: string) => `\\${c}`);
|
||||
String(value ?? "").replace(/[\\`*_~[\]<&!()=$^]/g, (c: string) => `\\${c}`);
|
||||
|
||||
// #293 canon #6: the schema-HTML forms for math. These are the LOSSLESS forms
|
||||
// the raw-HTML path (columns/cells) and the mathInline fallback emit, and the
|
||||
|
||||
@@ -62,7 +62,13 @@ describe('#293 canon #4 — image serialization + attached img-comment', () => {
|
||||
// import; without escaping, a bracket/emphasis in a realistic description
|
||||
// would make the image node VANISH or collapse emphasis. Assert the image
|
||||
// survives with the exact alt AND the markdown is byte-stable on re-export.
|
||||
for (const alt of ['a]b[c', 'Figure [1]', 'the *new* logo', 'x_y_z', 'see ![img', 'a & b']) {
|
||||
for (const alt of [
|
||||
'a]b[c', 'Figure [1]', 'the *new* logo', 'x_y_z', 'see ![img', 'a & b',
|
||||
// Canon inline-extension triggers this same package introduces (F5): math
|
||||
// `$`, highlight `==`, footnote `^[` — an unescaped one turns the alt into
|
||||
// a math/highlight/footnote node on import.
|
||||
'x $A$ y', '5$ and 10$', 'use ==bold==', '^[fn]', 'cost $5 == price',
|
||||
]) {
|
||||
const md1 = convertProseMirrorToMarkdown(image({ alt }));
|
||||
const back = await markdownToProseMirror(md1);
|
||||
const img = findImage(back);
|
||||
|
||||
@@ -249,6 +249,13 @@ describe('#293 #8 LINK-FORM: pdf', () => {
|
||||
'tag <x> & y.pdf',
|
||||
'amp & here.pdf',
|
||||
'.pdf',
|
||||
// Canon inline-extension triggers (F5): math `$`, highlight `==`, footnote
|
||||
// `^[` — a filename carrying these must not become a math/highlight/footnote
|
||||
// node on import.
|
||||
'data $A$.csv',
|
||||
'q3 ==final==.pdf',
|
||||
'5$ and 10$.pdf',
|
||||
'note ^[x].pdf',
|
||||
]) {
|
||||
const { md1, md2, doc2 } = await roundTrip(mkDoc([{ type: 'pdf', attrs: { src: '/x', name } }]));
|
||||
expect(md2).toBe(md1); // byte-stable, no churn
|
||||
|
||||
Reference in New Issue
Block a user