gitmost

vvzvlad/gitmost

Fork 0

Commit Graph

Author	SHA1	Message	Date
vvzvlad	c8e41e8916	feat(ai): hybrid RRF retrieval, heading-breadcrumb chunks, merged search tool Improve agent RAG quality with three changes, plus a roadmap doc for the rest. - Indexer: prefix each chunk with its heading path ("Page > H1 > H2"), built by walking the ProseMirror JSON (heading nodes) so a `#` inside a fenced code block is never mistaken for a heading. Falls back to plain-text chunking on any error. buildChunkRows: drop indexOf-against-source offsets (breadcrumb prefixes break verbatim matching) for a cumulative cursor — offsets are provenance-only. - Hybrid search: new migration adds a generated `fts` tsvector column + GIN index to page_embeddings (same english+f_unaccent config as pages.tsv). New PageEmbeddingRepo.hybridSearch fuses cosine + full-text rankings via Reciprocal Rank Fusion (k=60, equal weights) in one SQL query at chunk granularity. - Tools: collapse semanticSearch + searchPages into one hybrid `searchPages` tool with a query-rewrite-oriented description; gracefully falls back to the REST full-text path when embeddings are unconfigured. Access control (space scope + page-permission post-filter) preserved. Add a query-rewrite hint to the default system prompt. - docs/rag-improvements-plan.md: record what shipped and the deferred backlog (reranker, attachment indexing, eval harness, tuning). Note: requires a corpus reindex to populate breadcrumbs on existing pages.	2026-06-18 03:43:01 +03:00

Author

SHA1

Message

Date

vvzvlad

c8e41e8916

feat(ai): hybrid RRF retrieval, heading-breadcrumb chunks, merged search tool

Improve agent RAG quality with three changes, plus a roadmap doc for the rest.

- Indexer: prefix each chunk with its heading path ("Page > H1 > H2"), built by
  walking the ProseMirror JSON (heading nodes) so a `#` inside a fenced code block
  is never mistaken for a heading. Falls back to plain-text chunking on any error.
  buildChunkRows: drop indexOf-against-source offsets (breadcrumb prefixes break
  verbatim matching) for a cumulative cursor — offsets are provenance-only.
- Hybrid search: new migration adds a generated `fts` tsvector column + GIN index
  to page_embeddings (same english+f_unaccent config as pages.tsv). New
  PageEmbeddingRepo.hybridSearch fuses cosine + full-text rankings via Reciprocal
  Rank Fusion (k=60, equal weights) in one SQL query at chunk granularity.
- Tools: collapse semanticSearch + searchPages into one hybrid `searchPages` tool
  with a query-rewrite-oriented description; gracefully falls back to the REST
  full-text path when embeddings are unconfigured. Access control (space scope +
  page-permission post-filter) preserved. Add a query-rewrite hint to the default
  system prompt.
- docs/rag-improvements-plan.md: record what shipped and the deferred backlog
  (reranker, attachment indexing, eval harness, tuning).

Note: requires a corpus reindex to populate breadcrumbs on existing pages.

2026-06-18 03:43:01 +03:00

1 Commits