fix(ai): raise reindex pre-seed TTL to the client poll cap; cover predicate clause; align docs (F11-F13)
F11: PRE_SEED_TTL_SECONDS 45->120 (= client REINDEX_POLL_CAP_MS). At concurrency
1 a queued reindex can wait past the old 45s; if the pre-seed expired while
pending, getMasked fell back to the COUNT and reported done, so the client
stopped polling and missed the climb. Tie the pre-seed TTL to the client cap.
F12: extend the lockstep integration spec — insertPage takes content; a
text_content=null + text-node-content page is IN and a math-only page is OUT,
pinning the structural "type":"text" clause (and the jsonb space-after-colon).
F13: list all three embeddable clauses in the reindex JSDoc/inline comments.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -187,15 +187,19 @@ export class EmbeddingIndexerService {
|
||||
/**
|
||||
* (Re)build embeddings for the EMBEDDABLE page set of a workspace — the same
|
||||
* set countEmbeddablePages counts (via getEmbeddablePageIds): non-deleted pages
|
||||
* that have non-empty textContent OR already have a stored embedding row, NOT
|
||||
* every non-deleted page. Iterating this set keeps the live `total` equal to
|
||||
* the steady-state denominator, so the progress counter climbs 0 -> total and
|
||||
* matches the before/after DB coverage exactly. Text-less pages are correctly
|
||||
* skipped (reindexPage no-ops on them); a page that lost its text but still has
|
||||
* stale embeddings stays in the set (the EXISTS clause) so it is visited and
|
||||
* its stale rows are cleared. Used by the bulk reindex
|
||||
* (WORKSPACE_CREATE_EMBEDDINGS, fired when AI Search is enabled and by the
|
||||
* manual "Reindex now" action).
|
||||
* that qualify under any of the three clauses of `embeddablePredicate` —
|
||||
* non-empty textContent, OR an empty/null textContent whose ProseMirror
|
||||
* `content` JSON has at least one text node (`"type":"text"`) that `jsonToText`
|
||||
* can extract, OR an already-stored (non-deleted) embedding row — NOT every
|
||||
* non-deleted page. Iterating this set keeps the live `total` equal to the
|
||||
* steady-state denominator, so the progress counter climbs 0 -> total and
|
||||
* matches the before/after DB coverage exactly. A page with truly no
|
||||
* extractable text (empty textContent AND content with only non-text/atom
|
||||
* nodes such as math) is correctly skipped (reindexPage no-ops on it); a page
|
||||
* that lost its text but still has stale embeddings stays in the set (the
|
||||
* EXISTS clause) so it is visited and its stale rows are cleared. Used by the
|
||||
* bulk reindex (WORKSPACE_CREATE_EMBEDDINGS, fired when AI Search is enabled
|
||||
* and by the manual "Reindex now" action).
|
||||
*
|
||||
* Resolves the embeddings model once up front: if the workspace has no
|
||||
* embeddings provider configured, the whole batch is skipped (otherwise each
|
||||
@@ -223,13 +227,16 @@ export class EmbeddingIndexerService {
|
||||
throw err;
|
||||
}
|
||||
|
||||
// Iterate the EMBEDDABLE set (same predicate as countEmbeddablePages), NOT
|
||||
// every non-deleted page: this makes `total` here equal the steady-state
|
||||
// denominator, so the live counter climbs 0 -> total and matches the
|
||||
// before/after DB count exactly (no 478 -> 500 -> 478 denominator jump).
|
||||
// Text-less pages are correctly skipped — reindexPage no-ops on them, and
|
||||
// a page that lost its text but still has stale embeddings IS in this set
|
||||
// (the EXISTS clause) so it is still visited and its stale rows cleared.
|
||||
// Iterate the EMBEDDABLE set (same three-clause predicate as
|
||||
// countEmbeddablePages), NOT every non-deleted page: this makes `total`
|
||||
// here equal the steady-state denominator, so the live counter climbs
|
||||
// 0 -> total and matches the before/after DB count exactly (no
|
||||
// 478 -> 500 -> 478 denominator jump). Pages whose text lives in the
|
||||
// ProseMirror `content` JSON (a text node) even with empty text_content ARE
|
||||
// in this set (the content-JSON clause) and get embedded; a page with no
|
||||
// extractable text at all is correctly skipped — reindexPage no-ops on it —
|
||||
// and a page that lost its text but still has stale embeddings IS in this
|
||||
// set (the EXISTS clause) so it is still visited and its stale rows cleared.
|
||||
const pageIds = await this.pageRepo.getEmbeddablePageIds(workspaceId);
|
||||
const total = pageIds.length;
|
||||
const startedAt = Date.now();
|
||||
|
||||
@@ -182,7 +182,11 @@ describe('AiSettingsService.reindex progress seed', () => {
|
||||
);
|
||||
const ttl = reindexProgress.start.mock.calls[0][2];
|
||||
expect(ttl).toBeGreaterThan(0);
|
||||
expect(ttl).toBeLessThanOrEqual(60); // short, not the full 1h record TTL
|
||||
// Short pre-seed TTL, distinct from the full 1h (3600s) record TTL, but
|
||||
// pinned to the client poll cap (120s) so a still-pending run can't expire
|
||||
// into a false "done" while the client is still polling (F11).
|
||||
expect(ttl).toBe(120);
|
||||
expect(ttl).toBeLessThanOrEqual(120);
|
||||
expect(aiQueue.add).toHaveBeenCalledTimes(1);
|
||||
// Seed must precede the enqueue so the first poll already reports done=0.
|
||||
expect(order).toEqual(['start', 'add']);
|
||||
|
||||
@@ -33,14 +33,27 @@ export function parsePositiveInt(raw: unknown): number | undefined {
|
||||
|
||||
/**
|
||||
* TTL (seconds) for the enqueue-time progress PRE-SEED written by `reindex()`
|
||||
* before the worker starts. Deliberately SHORT: if `aiQueue.add()` de-duplicates
|
||||
* against a job that is just finishing (the worker's finally already ran
|
||||
* `clear()` but removeOnComplete hasn't yet removed the job), no new worker runs
|
||||
* to overwrite/clear this seed — so a short TTL lets the phantom "reindexing:
|
||||
* 0 of N" expire in seconds instead of sticking for the full 1h record TTL. A
|
||||
* worker that DOES start re-seeds with the full TTL, so a real run is unaffected.
|
||||
* before the worker starts. Deliberately SHORT relative to the full 1h record
|
||||
* TTL: if `aiQueue.add()` de-duplicates against a job that is just finishing
|
||||
* (the worker's finally already ran `clear()` but removeOnComplete hasn't yet
|
||||
* removed the job), no new worker runs to overwrite/clear this seed — so this
|
||||
* shorter TTL lets the phantom "reindexing: 0 of N" expire instead of sticking
|
||||
* for the full 1h record TTL. A worker that DOES start re-seeds with the full
|
||||
* TTL, so a real run is unaffected.
|
||||
*
|
||||
* It MUST be >= the client poll cap (REINDEX_POLL_CAP_MS = 120000ms in
|
||||
* ai-provider-settings.tsx) though: the AI_QUEUE worker runs at concurrency 1
|
||||
* and shares the queue with page-level embedding jobs, so a queued reindex can
|
||||
* wait well beyond a few dozen seconds before the worker re-seeds with the full
|
||||
* TTL. If the pre-seed expired while the job is still pending, `get()` returns
|
||||
* null and getMasked() falls back to the steady-state COUNT (indexedPages ==
|
||||
* totalPages, reindexing=false) — the client reads that as "done & fully
|
||||
* indexed", clears its deadline and STOPS polling, so the admin never sees the
|
||||
* real climb. Pinning the pre-seed TTL to the client cap means a deduped phantom
|
||||
* is bounded to ~120s — the same window the client already polls — and a genuine
|
||||
* pending run never expires-into-"done" inside that window.
|
||||
*/
|
||||
const PRE_SEED_TTL_SECONDS = 45;
|
||||
const PRE_SEED_TTL_SECONDS = 120;
|
||||
|
||||
/**
|
||||
* Shape of the partial update accepted by `update`. Mirrors the validated
|
||||
@@ -128,10 +141,13 @@ export class AiSettingsService {
|
||||
let seeded = false;
|
||||
if ((await this.reindexProgress.get(workspaceId)) === null) {
|
||||
const totalPages = await this.pageRepo.countEmbeddablePages(workspaceId);
|
||||
// Short TTL: if add() below de-duplicates against a just-finishing job
|
||||
// whose worker already clear()ed but isn't removed yet, no worker runs to
|
||||
// clear this seed — the short TTL expires the phantom record in seconds
|
||||
// rather than leaving a stuck "reindexing: 0 of N" for the full record TTL.
|
||||
// Short TTL (vs the full 1h record TTL): if add() below de-duplicates
|
||||
// against a just-finishing job whose worker already clear()ed but isn't
|
||||
// removed yet, no worker runs to clear this seed — the shorter TTL expires
|
||||
// the phantom record rather than leaving a stuck "reindexing: 0 of N" for
|
||||
// the full record TTL. It is kept >= the client poll cap (120s) so a
|
||||
// genuine but still-pending run never expires into a false "done" while
|
||||
// the client is still polling (see PRE_SEED_TTL_SECONDS).
|
||||
await this.reindexProgress.start(
|
||||
workspaceId,
|
||||
totalPages,
|
||||
|
||||
@@ -42,10 +42,14 @@ describe('PageRepo embeddable-page set: getEmbeddablePageIds <-> countEmbeddable
|
||||
await destroyTestDb();
|
||||
});
|
||||
|
||||
// Insert a page with explicit text_content / deleted_at (createPage in db.ts
|
||||
// sets neither), returning its id so the test can assert membership.
|
||||
// Insert a page with explicit text_content / content / deleted_at (createPage
|
||||
// in db.ts sets none), returning its id so the test can assert membership.
|
||||
// `content` is the ProseMirror doc JSON (jsonb): postgres.js serializes a plain
|
||||
// object to JSON for jsonb columns, so we pass it through only when supplied so
|
||||
// the rest of the rows keep the DB default.
|
||||
async function insertPage(args: {
|
||||
textContent: string | null;
|
||||
content?: unknown;
|
||||
deletedAt?: Date | null;
|
||||
}): Promise<string> {
|
||||
const id = randomUUID();
|
||||
@@ -58,6 +62,7 @@ describe('PageRepo embeddable-page set: getEmbeddablePageIds <-> countEmbeddable
|
||||
spaceId,
|
||||
workspaceId,
|
||||
textContent: args.textContent,
|
||||
...(args.content !== undefined ? { content: args.content as any } : {}),
|
||||
deletedAt: args.deletedAt ?? null,
|
||||
})
|
||||
.execute();
|
||||
@@ -96,29 +101,59 @@ describe('PageRepo embeddable-page set: getEmbeddablePageIds <-> countEmbeddable
|
||||
// so the reindex can clear them).
|
||||
const noTextLiveEmbedding = await insertPage({ textContent: null });
|
||||
await insertEmbedding(noTextLiveEmbedding);
|
||||
// (c) non-deleted page with EMPTY text_content but ProseMirror `content` JSON
|
||||
// carrying a real text node — the content-JSON clause. This pins BOTH the
|
||||
// third OR-clause AND the space-after-colon: jsonb stores the key/value
|
||||
// separator as `"type": "text"` (a space after the colon), which is why
|
||||
// the predicate needs `[[:space:]]*`. `reindexPage` extracts this text, so
|
||||
// the page IS embeddable and the reindex MUST visit it.
|
||||
const noTextContentDoc = await insertPage({
|
||||
textContent: null,
|
||||
content: {
|
||||
type: 'doc',
|
||||
content: [
|
||||
{ type: 'paragraph', content: [{ type: 'text', text: 'hello' }] },
|
||||
],
|
||||
},
|
||||
});
|
||||
|
||||
// OUT of the set ----------------------------------------------------------
|
||||
// (c) non-deleted, text_content NULL, no embeddings.
|
||||
// (d) non-deleted, text_content NULL, no embeddings.
|
||||
await insertPage({ textContent: null });
|
||||
// (d) non-deleted, whitespace-only text (regex requires a non-space char).
|
||||
// (e) non-deleted, whitespace-only text (regex requires a non-space char).
|
||||
await insertPage({ textContent: ' \n\t ' });
|
||||
// (e) deleted page WITH body text — excluded by the non-deleted predicate.
|
||||
// (f) deleted page WITH body text — excluded by the non-deleted predicate.
|
||||
await insertPage({
|
||||
textContent: 'deleted but had text',
|
||||
deletedAt: new Date(),
|
||||
});
|
||||
// (f) non-deleted, no text, with ONLY a DELETED embedding row — the EXISTS
|
||||
// (g) non-deleted, no text, with ONLY a DELETED embedding row — the EXISTS
|
||||
// subquery filters pe.deleted_at IS NULL, so this stays out.
|
||||
const onlyDeletedEmbedding = await insertPage({ textContent: null });
|
||||
await insertEmbedding(onlyDeletedEmbedding, { deletedAt: new Date() });
|
||||
// (h) non-deleted, empty text_content, content JSON with ONLY a math atom
|
||||
// node — its LaTeX lives in `attrs.text` (a `"text":` KEY, not a
|
||||
// `"type":"text"` text node) and has no text serializer, so `jsonToText`
|
||||
// yields nothing and the page produces zero embeddings. The predicate
|
||||
// keys on the structural `"type":"text"` marker, so this stays OUT (a
|
||||
// bare `"text":` match would wrongly inflate the denominator).
|
||||
await insertPage({
|
||||
textContent: null,
|
||||
content: {
|
||||
type: 'doc',
|
||||
content: [{ type: 'mathBlock', attrs: { text: 'E=mc^2' } }],
|
||||
},
|
||||
});
|
||||
|
||||
const ids = await repo.getEmbeddablePageIds(workspaceId);
|
||||
const count = await repo.countEmbeddablePages(workspaceId);
|
||||
|
||||
// The two queries agree on the size (the load-bearing lockstep invariant)...
|
||||
expect(ids.length).toBe(count);
|
||||
// ...and the set is exactly the two qualifying pages, nothing else.
|
||||
expect(new Set(ids)).toEqual(new Set([withText, noTextLiveEmbedding]));
|
||||
expect(count).toBe(2);
|
||||
// ...and the set is exactly the three qualifying pages, nothing else.
|
||||
expect(new Set(ids)).toEqual(
|
||||
new Set([withText, noTextLiveEmbedding, noTextContentDoc]),
|
||||
);
|
||||
expect(count).toBe(3);
|
||||
});
|
||||
});
|
||||
|
||||
Reference in New Issue
Block a user