fix(ai): align reindex live denominator with the steady-state count
Review fixes for the reindex-progress counter (#242): 1. Denominator jump (478 -> 500 -> 478): reindexWorkspace iterated getIdsByWorkspace() (ALL non-deleted pages) but the seed/status use countEmbeddablePages (text OR existing-embedding), so the live total exceeded the steady-state total whenever empty/text-less pages existed. Add PageRepo.getEmbeddablePageIds() that selects the IDs of the EXACT same set countEmbeddablePages counts (deletedAt IS NULL AND (text_content matches a non-whitespace char OR an EXISTS non-deleted pageEmbeddings row)), and have reindexWorkspace iterate THAT set with total = its length. Iteration set and count source change together, so done reaches exactly total == the steady-state denominator. Dropping text-less pages is correct (reindexPage no-ops on them; a page that lost its text but still has stale embeddings is in the set via the EXISTS clause and still gets its stale rows cleared). Removed the contradictory "worker overwrites with the real page count" / "denominator matches" comment. 2. Mid-run re-trigger reset: reindex() unconditionally re-seeded done=0 before an enqueue that de-dupes a running job, so a second click/admin/tab reset the visible counter while the worker kept incrementing. Now seed only when get(workspaceId) === null; the worker's own start() remains the single authoritative reset. 3. TTL: documented that it is intentionally tied to write progress (start/increment) and never refreshed on get(), so a dead worker's record can't be kept alive forever by client polling. Tests: new embedding-reindex-progress.service.spec.ts (fake ioredis: hash -> ReindexProgress, malformed/missing/non-numeric -> null, non-finite startedAt -> 0, hgetall throws -> null, start/increment issue hset/hincrby+expire and swallow Redis errors); reindex() seed order + no-reseed-when-active guard; getMasked live test now uses progress.total=500 vs DB 478 to pin the progress branch; indexer specs updated to mock getEmbeddablePageIds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -278,6 +278,47 @@ export class PageRepo {
|
||||
return rows.map((r) => r.id);
|
||||
}
|
||||
|
||||
/**
|
||||
* IDs of the EMBEDDABLE page set for a workspace — the exact same set that
|
||||
* `countEmbeddablePages` counts (a page qualifies if it has non-empty
|
||||
* textContent OR already has a stored embedding row). The bulk reindex
|
||||
* iterates THIS set so the live "done" counter reaches exactly
|
||||
* `countEmbeddablePages` (the steady-state denominator), instead of iterating
|
||||
* every non-deleted page (which would push the denominator above the
|
||||
* steady-state value mid-run).
|
||||
*
|
||||
* IMPORTANT: the WHERE here MUST stay in lockstep with `countEmbeddablePages`
|
||||
* — if one changes, change both, or the live total and steady-state total
|
||||
* diverge again. Dropping text-less pages is correct: `reindexPage` no-ops on
|
||||
* a page with no extractable content anyway, and a page that lost its text but
|
||||
* still has stale embeddings IS in this set (the EXISTS clause), so it is still
|
||||
* visited and its stale rows are cleared.
|
||||
*/
|
||||
async getEmbeddablePageIds(workspaceId: string): Promise<string[]> {
|
||||
const rows = await this.db
|
||||
.selectFrom('pages as p')
|
||||
.select('p.id')
|
||||
.where('p.workspaceId', '=', workspaceId)
|
||||
.where('p.deletedAt', 'is', null)
|
||||
.where((eb) =>
|
||||
eb.or([
|
||||
// Has extractable body text (mirrors countEmbeddablePages: any
|
||||
// non-whitespace char; raw SQL -> snake_case column name).
|
||||
sql<boolean>`p.text_content ~ '[^[:space:]]'`,
|
||||
// OR already has at least one (non-deleted) embedding row.
|
||||
eb.exists(
|
||||
eb
|
||||
.selectFrom('pageEmbeddings as pe')
|
||||
.select(sql`1`.as('one'))
|
||||
.whereRef('pe.pageId', '=', 'p.id')
|
||||
.where('pe.deletedAt', 'is', null),
|
||||
),
|
||||
]),
|
||||
)
|
||||
.execute();
|
||||
return rows.map((r) => r.id);
|
||||
}
|
||||
|
||||
async deletePage(pageId: string): Promise<void> {
|
||||
let query = this.db.deleteFrom('pages');
|
||||
|
||||
|
||||
Reference in New Issue
Block a user