From 459d636ffb4f8bf919163d0378127069cc9d556f Mon Sep 17 00:00:00 2001 From: agent_coder Date: Sun, 5 Jul 2026 01:36:57 +0300 Subject: [PATCH 1/3] fix(db): prevent the migration-order crash-loop from long-lived branches (#363, incident #361) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A long-lived branch can add a migration whose timestamped filename sorts BEFORE migrations already applied in prod (#234's 20260627T130000-ai-chat-runs merged after 20260704T120000-client-metrics was live). Kysely's migrator with the default ordered setting then rejects the applied set as "corrupted migrations" (no longer a prefix of the sorted list), throws, and the app crash-loops on boot — exactly incident #361 (502s for ~11 min after a develop deploy). #119 and #120 (June branches) are the next such threats. Two levels, both: 1. CI migration-order gate (a new `migration-order` job in test.yml, PR-only): fails the PR when an added migration sorts at/before the newest migration on the base branch, with an actionable message to rename it to a current timestamp before merge. This is the primary defense — makes back-dating impossible to merge accidentally. 2. `allowUnorderedMigrations: true` on BOTH Migrators (migration.service.ts startup auto-migrate + migrate.ts CLI): the runtime safety net — Kysely applies a not-yet-applied older migration instead of bricking startup, so a back-dated migration that bypasses the gate (manual push / hotfix branch) still boots. Trade-off documented inline: apply order across instances may differ from lexicographic, so migrations must stay independent (ours each create their own objects); the CI gate remains the primary line. Verified: allowUnorderedMigrations is a valid Kysely 0.28.17 Migrator option; server tsc clean; the gate script rejects a back-dated filename and passes a current one. No new deps, no migration, no runtime behavior change beyond the migrator resilience. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/test.yml | 36 +++++++++++++++++++ apps/server/src/database/migrate.ts | 4 +++ .../database/services/migration.service.ts | 10 ++++++ 3 files changed, 50 insertions(+) diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index 7ea47b94..5eb5c9f4 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -13,6 +13,42 @@ permissions: contents: read jobs: + # Guard against a long-lived branch adding a migration whose timestamped + # filename sorts BEFORE migrations already applied on the target branch (and + # thus in prod). The Kysely startup migrator rejects that as "corrupted + # migrations" and crash-loops the app on boot (incident #361). This gate fails + # the PR so the migration is renamed to a current timestamp before merge. Only + # runs for pull_request events (needs a base branch to diff against). + migration-order: + if: github.event_name == 'pull_request' + runs-on: ubuntu-latest + timeout-minutes: 5 + steps: + - name: Checkout (full history for the base-branch diff) + uses: actions/checkout@v4 + with: + fetch-depth: 0 + - name: Added migrations must sort after the newest on the base branch + env: + TARGET_BRANCH: ${{ github.base_ref }} + run: | + set -euo pipefail + MIG_DIR="apps/server/src/database/migrations" + git fetch --no-tags --depth=1 origin "$TARGET_BRANCH" + newest_on_target=$(git ls-tree -r --name-only "origin/${TARGET_BRANCH}" "$MIG_DIR" | sort | tail -1) + added=$(git diff --diff-filter=A --name-only "origin/${TARGET_BRANCH}...HEAD" -- "$MIG_DIR" || true) + bad=0 + for f in $added; do + if [[ "$f" < "$newest_on_target" || "$f" == "$newest_on_target" ]]; then + echo "::error::Migration $f sorts at or before the newest on ${TARGET_BRANCH} ($newest_on_target) — rename it with a CURRENT timestamp before merge (do not change its contents). See incident #361." + bad=1 + fi + done + if [ "$bad" -eq 0 ]; then + echo "Migration order OK (added migrations all sort after $newest_on_target)." + fi + exit $bad + test: runs-on: ubuntu-latest timeout-minutes: 20 diff --git a/apps/server/src/database/migrate.ts b/apps/server/src/database/migrate.ts index a5d58766..bc2db37e 100644 --- a/apps/server/src/database/migrate.ts +++ b/apps/server/src/database/migrate.ts @@ -24,6 +24,10 @@ const migrator = new Migrator({ path, migrationFolder, }), + // Match the startup auto-migrator (migration.service.ts): a back-dated + // migration from a long-lived branch must be applied, not rejected as + // "corrupted migrations" (incident #361). See that file for the full rationale. + allowUnorderedMigrations: true, }); run(db, migrator, migrationFolder); diff --git a/apps/server/src/database/services/migration.service.ts b/apps/server/src/database/services/migration.service.ts index 3247e5f8..2b63f728 100644 --- a/apps/server/src/database/services/migration.service.ts +++ b/apps/server/src/database/services/migration.service.ts @@ -19,6 +19,16 @@ export class MigrationService { path, migrationFolder: path.join(__dirname, '..', 'migrations'), }), + // A long-lived branch can add a migration whose timestamped filename sorts + // BEFORE migrations already applied in prod (e.g. #234's 20260627 landing + // after 20260704 was live). With the default (ordered) setting the startup + // migrator then sees "corrupted migrations" — the applied set is no longer a + // prefix of the sorted list — throws, and the app crash-loops on boot + // (incident #361: 502s for ~11 min). allowUnorderedMigrations runs any + // not-yet-applied migration regardless of filename order, so a back-dated + // migration is applied instead of bricking startup. A CI order-gate still + // discourages back-dating; this is the runtime safety net. + allowUnorderedMigrations: true, }); const { error, results } = await migrator.migrateToLatest(); -- 2.52.0 From 7b4617db703199c0c6ee91b05a1014c3497827de Mon Sep 17 00:00:00 2001 From: agent_coder Date: Sun, 5 Jul 2026 02:27:16 +0300 Subject: [PATCH 2/3] fix(#363 review F1): make the migration-order gate fail CLOSED (not open) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The CI gate — whose whole job is to BLOCK a back-dated migration — could pass open in exactly the scenario it guards (a long branch vs a moving base, i.e. #361): - Dropped the redundant `git fetch --depth=1`: the checkout already did fetch-depth:0 (full history), and the shallow graft truncated the BASE history, so `merge-base` (thus the three-dot `origin/base...HEAD` diff) failed when the base had moved ahead of the PR merge commit. - Removed `|| true` on the diff: it swallowed that failure → `added` empty → loop skipped → bad=0 → gate PASS. Now `set -e` aborts the job (fail CLOSED) on any diff error — a gate must never pass on error. Verified: yaml parses (jobs migration-order, test); a broken-ref diff with set -e and no `|| true` aborts before bad=0 (fail-closed) instead of passing. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/test.yml | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml index 5eb5c9f4..a0943391 100644 --- a/.github/workflows/test.yml +++ b/.github/workflows/test.yml @@ -34,9 +34,16 @@ jobs: run: | set -euo pipefail MIG_DIR="apps/server/src/database/migrations" - git fetch --no-tags --depth=1 origin "$TARGET_BRANCH" + # checkout above already did fetch-depth:0 (full history). Fetch the base + # WITHOUT --depth (a shallow graft would truncate the base history and + # break the merge-base when the base has moved ahead of the PR merge — + # exactly the long-branch-vs-moving-base case this gate guards, #361). + git fetch --no-tags origin "$TARGET_BRANCH" newest_on_target=$(git ls-tree -r --name-only "origin/${TARGET_BRANCH}" "$MIG_DIR" | sort | tail -1) - added=$(git diff --diff-filter=A --name-only "origin/${TARGET_BRANCH}...HEAD" -- "$MIG_DIR" || true) + # NO `|| true`: a diff failure (e.g. an unresolved merge-base) must fail + # the job CLOSED — a gate whose job is to BLOCK must never pass on error. + # `set -e` above already aborts on a non-zero diff exit. + added=$(git diff --diff-filter=A --name-only "origin/${TARGET_BRANCH}...HEAD" -- "$MIG_DIR") bad=0 for f in $added; do if [[ "$f" < "$newest_on_target" || "$f" == "$newest_on_target" ]]; then -- 2.52.0 From 0050ad7ebb38bc23931405f2914156733fd61316 Mon Sep 17 00:00:00 2001 From: agent_coder Date: Sun, 5 Jul 2026 02:57:11 +0300 Subject: [PATCH 3/3] docs(#363 review F2): update AGENTS.md migration-ordering to the new tolerant behavior MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The "Migration ordering" section still described the OLD crash-loop-at-boot behavior this PR removes ("Kysely refuses to start … rejected at boot"). Rewrote it to the new two-layer model: the CI migration-order gate is the primary defense (rename to a current timestamp), and the runtime now sets allowUnorderedMigrations so the app applies a back-dated migration instead of crash-looping (with the note that #ensureNoMissingMigrations still guards a removed applied migration, and that migrations must stay independent since apply order can differ across instances). Co-Authored-By: Claude Opus 4.8 (1M context) --- AGENTS.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/AGENTS.md b/AGENTS.md index 9bedbc39..f541ebb8 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -250,7 +250,10 @@ pnpm --filter server migration:codegen # regenerate src/databa ``` Migration files live in `apps/server/src/database/migrations/` and are named `YYYYMMDDThhmmss-description.ts`. Fork-specific migrations only **add** tables (`page_embeddings`, `ai_chats`, `ai_chat_messages`, `ai_provider_credentials`, `ai_mcp_servers`, `page_template_references`) and columns (e.g. `pages.is_template`, a `NOT NULL DEFAULT false` boolean) — never drop/rewrite Docmost data. -**Migration ordering — always check when merging branches/features.** Kysely runs migrations in **alphabetical (= timestamp) order** and refuses to start if a *new* migration sorts **before** one already applied to the DB (`corrupted migrations: ... must always have a name that comes alphabetically after the last executed migration`). When you merge a branch or land a feature, verify your migration's timestamp still sorts **after every migration that may already be applied on the target** (`/bin/ls -1 apps/server/src/database/migrations | sort | tail`). Branches developed in parallel routinely break this: a feature branch adds `…T130000-…`, `main` meanwhile ships and deploys `…T150000-…`, and after the merge the older-timestamped file is rejected at boot. **Fix = rename your migration to a timestamp after the latest one already in the target** (content unchanged — the filename is the ordering key), then rebuild so the compiled `dist/database/migrations/` picks up the new name. +**Migration ordering — always check when merging branches/features.** Kysely runs migrations in **alphabetical (= timestamp) order**. A *new* migration that sorts **before** one already applied to the DB is a "back-dated" migration, which branches developed in parallel routinely produce: a feature branch adds `…T130000-…`, `develop` meanwhile ships and deploys `…T150000-…`, and after the merge the older-timestamped file has been skipped. Two layers guard this (both added for incident #361, where a back-dated migration crash-looped prod for ~11 min): + +- **CI gate (primary):** the `migration-order` job in `.github/workflows/test.yml` fails a PR whose added migration sorts at/before the newest on the base branch. **So the fix is to rename your migration to a timestamp after the latest one already in the target** (`/bin/ls -1 apps/server/src/database/migrations | sort | tail`; content unchanged — the filename is the ordering key), then rebuild so the compiled `dist/database/migrations/` picks up the new name. +- **Runtime safety net:** both Migrators (`migration.service.ts` startup auto-migrate + `migrate.ts` CLI) set `allowUnorderedMigrations: true`, so the app does **not** refuse to start on an out-of-order migration — it applies the skipped older one instead of crash-looping. Kysely's `#ensureNoMissingMigrations` guard is still on (a *removed* applied migration is still an error). Because apply order can then differ from lexicographic across instances, migrations must stay **independent** (each creates its own objects) — the CI gate remains the primary line; this net only covers a gate bypass (manual push / hotfix branch). ## Architecture — the big picture -- 2.52.0