feat(ai): hybrid RRF retrieval, heading-breadcrumb chunks, merged search tool

Improve agent RAG quality with three changes, plus a roadmap doc for the rest. - Indexer: prefix each chunk with its heading path ("Page > H1 > H2"), built by walking the ProseMirror JSON (heading nodes) so a `#` inside a fenced code block is never mistaken for a heading. Falls back to plain-text chunking on any error. buildChunkRows: drop indexOf-against-source offsets (breadcrumb prefixes break verbatim matching) for a cumulative cursor — offsets are provenance-only. - Hybrid search: new migration adds a generated `fts` tsvector column + GIN index to page_embeddings (same english+f_unaccent config as pages.tsv). New PageEmbeddingRepo.hybridSearch fuses cosine + full-text rankings via Reciprocal Rank Fusion (k=60, equal weights) in one SQL query at chunk granularity. - Tools: collapse semanticSearch + searchPages into one hybrid `searchPages` tool with a query-rewrite-oriented description; gracefully falls back to the REST full-text path when embeddings are unconfigured. Access control (space scope + page-permission post-filter) preserved. Add a query-rewrite hint to the default system prompt. - docs/rag-improvements-plan.md: record what shipped and the deferred backlog (reranker, attachment indexing, eval harness, tuning). Note: requires a corpus reindex to populate breadcrumbs on existing pages.
2026-06-18 03:43:01 +03:00
parent 91a63f0b2c
commit c8e41e8916
6 changed files with 555 additions and 145 deletions
--- a/apps/server/src/database/migrations/20260618T150000-page-embeddings-fts.ts
+++ b/apps/server/src/database/migrations/20260618T150000-page-embeddings-fts.ts
@@ -0,0 +1,48 @@
+import { type Kysely, sql } from 'kysely';
+
+/**
+ * Chunk-level lexical index for HYBRID retrieval (RRF) over `page_embeddings`.
+ *
+ * The agent's retrieval used to be either pure full-text (loopback REST over
+ * `pages.tsv`) OR pure vector (cosine over `page_embeddings.embedding`). Hybrid
+ * retrieval fuses BOTH rankings with Reciprocal Rank Fusion so exact keyword /
+ * identifier matches AND semantic matches both surface. The vector side already
+ * exists; this migration adds the missing LEXICAL side AT CHUNK GRANULARITY so
+ * both CTEs of the fused query rank the SAME chunk rows.
+ *
+ * `fts` is a GENERATED ALWAYS ... STORED `tsvector` built from `content` with
+ * the SAME text-search config as `pages.tsv`: `to_tsvector('english',
+ * f_unaccent(content))`. Using the identical config keeps lexical behaviour
+ * consistent with the existing page search (incl. unaccented Cyrillic content).
+ * `f_unaccent(text)` is declared IMMUTABLE (migration 20250729T213756), which is
+ * exactly what a GENERATED STORED column requires — so this needs NO trigger.
+ * The column is independent of the embedding vector dimension: it indexes text,
+ * not the vector, so it works for any model dimension.
+ *
+ * NOTE: `fts` is deliberately NOT added to the `PageEmbeddings` Kysely type. It
+ * is a generated column accessed ONLY via raw SQL (the hybrid query); adding it
+ * to the Kysely type would force it into the explicit-column insert in
+ * `insertChunks` and break inserts (a GENERATED column cannot be written to).
+ */
+export async function up(db: Kysely<any>): Promise<void> {
+  // Generated STORED tsvector mirroring pages.tsv's config. f_unaccent is
+  // IMMUTABLE so it is valid inside a GENERATED column expression (no trigger).
+  await sql`
+    ALTER TABLE page_embeddings
+      ADD COLUMN IF NOT EXISTS fts tsvector
+      GENERATED ALWAYS AS (to_tsvector('english', f_unaccent(content))) STORED
+  `.execute(db);
+
+  // GIN index for fast `fts @@ query` lexical matching on the chunk text.
+  await sql`
+    CREATE INDEX IF NOT EXISTS idx_page_embeddings_fts
+      ON page_embeddings USING gin(fts)
+  `.execute(db);
+}
+
+export async function down(db: Kysely<any>): Promise<void> {
+  await sql`DROP INDEX IF EXISTS idx_page_embeddings_fts`.execute(db);
+  await sql`
+    ALTER TABLE page_embeddings DROP COLUMN IF EXISTS fts
+  `.execute(db);
+}