docs(#363 review F2): update AGENTS.md migration-ordering to the new tolerant behavior

The "Migration ordering" section still described the OLD crash-loop-at-boot behavior this PR removes ("Kysely refuses to start … rejected at boot"). Rewrote it to the new two-layer model: the CI migration-order gate is the primary defense (rename to a current timestamp), and the runtime now sets allowUnorderedMigrations so the app applies a back-dated migration instead of crash-looping (with the note that #ensureNoMissingMigrations still guards a removed applied migration, and that migrations must stay independent since apply order can differ across instances). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(#363 review F1): make the migration-order gate fail CLOSED (not open)
2026-07-05 02:57:11 +03:00 · 2026-07-05 02:27:16 +03:00 · 2026-07-05 01:36:57 +03:00 · 2026-07-04 22:42:11 +03:00 · 2026-07-04 22:41:25 +03:00 · 2026-07-04 22:33:06 +03:00
7 changed files with 120 additions and 15 deletions
@@ -18,12 +18,48 @@ env:
  IMAGE: ghcr.io/vvzvlad/gitmost
 jobs:
-  # Run the reusable test suite first so a failing test blocks the image build.
+  # Run the reusable test suite. Together with the e2e jobs below it gates the
  # publish job (the image push), not the build itself — build runs in parallel.
  test:
    uses: ./.github/workflows/test.yml
  # Runs in parallel with the test/e2e jobs and only warms the buildx cache
  # (GHA cache, scope develop-amd64). No push happens here — the publish job
  # below is the only one that pushes the image.
  build:
-    needs: test
+    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Resolve version
        id: version
        run: echo "value=$(git describe --tags --always)" >> "$GITHUB_OUTPUT"
      - name: Build develop image (warm cache, no push)
        uses: docker/build-push-action@v6
        with:
          context: .
          platforms: linux/amd64
          build-args: |
            APP_VERSION=${{ steps.version.outputs.value }}
            AI_AGENT_ROLES_CATALOG_URL=https://raw.githubusercontent.com/vvzvlad/gitmost/develop/agent-roles-catalog
          push: false
          cache-from: type=gha,scope=develop-amd64
          cache-to: type=gha,scope=develop-amd64,mode=max,ignore-error=true
  # The gate: rebuilds from the cache the build job just wrote (near-instant on
  # a cache hit; worst case — cache eviction — a full rebuild, which matches the
  # old sequential timing) and pushes :develop only when unit tests AND both
  # e2e suites AND the build are green.
  publish:
    needs: [test, e2e-server, e2e-mcp, build]
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
@@ -57,13 +93,10 @@ jobs:
          push: true
          tags: ${{ env.IMAGE }}:develop
          cache-from: type=gha,scope=develop-amd64
          cache-to: type=gha,scope=develop-amd64,mode=max,ignore-error=true
-  # e2e jobs run on every develop push but DO NOT gate the build/publish above:
+  # e2e jobs gate the publish (image push), not the build: the :develop image
-  # `build` stays `needs: test` only, so the :develop image still ships even if
+  # is pushed only when unit tests AND both e2e suites pass (publish.needs
-  # e2e fails. A failing e2e job turns the run red and triggers GitHub's email
+  # lists them all).
  # to the pusher — that red run + email is the intended notification, not a
  # deploy block.
  e2e-server:
    runs-on: ubuntu-latest
    # Hard cap: the full-AppModule e2e leaks open handles and hung jest to the 6h max.
@@ -124,9 +157,7 @@ jobs:
      - name: Run server e2e
        run: pnpm --filter ./apps/server test:e2e
-  # Same rationale as e2e-server: this job is intentionally NOT in
+  # Gates the publish too — see the comment above e2e-server.
  # `build.needs`. Deploy of the :develop image must not be blocked by e2e;
  # a red run plus GitHub's email to the pusher is the notification mechanism.
  e2e-mcp:
    runs-on: ubuntu-latest
    timeout-minutes: 20
@@ -13,6 +13,49 @@ permissions:
  contents: read
 jobs:
  # Guard against a long-lived branch adding a migration whose timestamped
  # filename sorts BEFORE migrations already applied on the target branch (and
  # thus in prod). The Kysely startup migrator rejects that as "corrupted
  # migrations" and crash-loops the app on boot (incident #361). This gate fails
  # the PR so the migration is renamed to a current timestamp before merge. Only
  # runs for pull_request events (needs a base branch to diff against).
  migration-order:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - name: Checkout (full history for the base-branch diff)
        uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Added migrations must sort after the newest on the base branch
        env:
          TARGET_BRANCH: ${{ github.base_ref }}
        run: |
          set -euo pipefail
          MIG_DIR="apps/server/src/database/migrations"
          # checkout above already did fetch-depth:0 (full history). Fetch the base
          # WITHOUT --depth (a shallow graft would truncate the base history and
          # break the merge-base when the base has moved ahead of the PR merge —
          # exactly the long-branch-vs-moving-base case this gate guards, #361).
          git fetch --no-tags origin "$TARGET_BRANCH"
          newest_on_target=$(git ls-tree -r --name-only "origin/${TARGET_BRANCH}" "$MIG_DIR" | sort | tail -1)
          # NO `|| true`: a diff failure (e.g. an unresolved merge-base) must fail
          # the job CLOSED — a gate whose job is to BLOCK must never pass on error.
          # `set -e` above already aborts on a non-zero diff exit.
          added=$(git diff --diff-filter=A --name-only "origin/${TARGET_BRANCH}...HEAD" -- "$MIG_DIR")
          bad=0
          for f in $added; do
            if [[ "$f" < "$newest_on_target" || "$f" == "$newest_on_target" ]]; then
              echo "::error::Migration $f sorts at or before the newest on ${TARGET_BRANCH} ($newest_on_target) — rename it with a CURRENT timestamp before merge (do not change its contents). See incident #361."
              bad=1
            fi
          done
          if [ "$bad" -eq 0 ]; then
            echo "Migration order OK (added migrations all sort after $newest_on_target)."
          fi
          exit $bad
  test:
    runs-on: ubuntu-latest
    timeout-minutes: 20
@@ -250,7 +250,10 @@ pnpm --filter server migration:codegen                   # regenerate src/databa
 ```
 Migration files live in `apps/server/src/database/migrations/` and are named `YYYYMMDDThhmmss-description.ts`. Fork-specific migrations only **add** tables (`page_embeddings`, `ai_chats`, `ai_chat_messages`, `ai_provider_credentials`, `ai_mcp_servers`, `page_template_references`) and columns (e.g. `pages.is_template`, a `NOT NULL DEFAULT false` boolean) — never drop/rewrite Docmost data.
-**Migration ordering — always check when merging branches/features.** Kysely runs migrations in **alphabetical (= timestamp) order** and refuses to start if a *new* migration sorts **before** one already applied to the DB (`corrupted migrations: ... must always have a name that comes alphabetically after the last executed migration`). When you merge a branch or land a feature, verify your migration's timestamp still sorts **after every migration that may already be applied on the target** (`/bin/ls -1 apps/server/src/database/migrations | sort | tail`). Branches developed in parallel routinely break this: a feature branch adds `…T130000-…`, `main` meanwhile ships and deploys `…T150000-…`, and after the merge the older-timestamped file is rejected at boot. **Fix = rename your migration to a timestamp after the latest one already in the target** (content unchanged — the filename is the ordering key), then rebuild so the compiled `dist/database/migrations/` picks up the new name.
+**Migration ordering — always check when merging branches/features.** Kysely runs migrations in **alphabetical (= timestamp) order**. A *new* migration that sorts **before** one already applied to the DB is a "back-dated" migration, which branches developed in parallel routinely produce: a feature branch adds `…T130000-…`, `develop` meanwhile ships and deploys `…T150000-…`, and after the merge the older-timestamped file has been skipped. Two layers guard this (both added for incident #361, where a back-dated migration crash-looped prod for ~11 min):
 - **CI gate (primary):** the `migration-order` job in `.github/workflows/test.yml` fails a PR whose added migration sorts at/before the newest on the base branch. **So the fix is to rename your migration to a timestamp after the latest one already in the target** (`/bin/ls -1 apps/server/src/database/migrations | sort | tail`; content unchanged — the filename is the ordering key), then rebuild so the compiled `dist/database/migrations/` picks up the new name.
 - **Runtime safety net:** both Migrators (`migration.service.ts` startup auto-migrate + `migrate.ts` CLI) set `allowUnorderedMigrations: true`, so the app does **not** refuse to start on an out-of-order migration — it applies the skipped older one instead of crash-looping. Kysely's `#ensureNoMissingMigrations` guard is still on (a *removed* applied migration is still an error). Because apply order can then differ from lexicographic across instances, migrations must stay **independent** (each creates its own objects) — the CI gate remains the primary line; this net only covers a gate bypass (manual push / hotfix branch).
 ## Architecture — the big picture
@@ -5,6 +5,13 @@ RUN npm install -g pnpm@10.4.0
 FROM base AS builder
 # re2 (packages/mcp) always compiles from source under pnpm (the prebuilt-binary
 # download cannot identify the GitHub repo), so node-gyp needs python3/make/g++.
 # This stage is discarded, so the toolchain can stay installed.
 RUN apt-get update \
  && apt-get install -y --no-install-recommends python3 make g++ \
  && rm -rf /var/lib/apt/lists/*
 WORKDIR /app
 COPY . .
@@ -57,9 +64,16 @@ COPY --from=builder /app/patches /app/patches
 RUN chown -R node:node /app
-USER node
+# Toolchain is needed transiently to compile re2 during the prod install; install
 # and purge it in one layer to keep the final image slim. The install itself runs
 # as the node user via su to keep node_modules ownership without a costly chown layer.
 RUN apt-get update \
  && apt-get install -y --no-install-recommends python3 make g++ \
  && su node -c "pnpm install --frozen-lockfile --prod" \
  && apt-get purge -y --auto-remove python3 make g++ \
  && rm -rf /var/lib/apt/lists/*
-RUN pnpm install --frozen-lockfile --prod
+USER node
 RUN mkdir -p /app/data/storage
@@ -24,6 +24,10 @@ const migrator = new Migrator({
    path,
    migrationFolder,
  }),
  // Match the startup auto-migrator (migration.service.ts): a back-dated
  // migration from a long-lived branch must be applied, not rejected as
  // "corrupted migrations" (incident #361). See that file for the full rationale.
  allowUnorderedMigrations: true,
 });
 run(db, migrator, migrationFolder);
@@ -19,6 +19,16 @@ export class MigrationService {
        path,
        migrationFolder: path.join(__dirname, '..', 'migrations'),
      }),
      // A long-lived branch can add a migration whose timestamped filename sorts
      // BEFORE migrations already applied in prod (e.g. #234's 20260627 landing
      // after 20260704 was live). With the default (ordered) setting the startup
      // migrator then sees "corrupted migrations" — the applied set is no longer a
      // prefix of the sorted list — throws, and the app crash-loops on boot
      // (incident #361: 502s for ~11 min). allowUnorderedMigrations runs any
      // not-yet-applied migration regardless of filename order, so a back-dated
      // migration is applied instead of bricking startup. A CI order-gate still
      // discourages back-dating; this is the runtime safety net.
      allowUnorderedMigrations: true,
    });
    const { error, results } = await migrator.migrateToLatest();
@@ -450,7 +450,7 @@ async function main() {
    // 8. get_page markdown round-trip sanity (table separator present)
    const md = await client.getPage(pageId);
    check("get_page md: table separator emitted", md.data.content.includes("| --- |"), "");
-    check("get_page md: callout exported as :::", md.data.content.includes(":::info"));
+    check("get_page md: callout exported as Obsidian '> [!info]'", md.data.content.includes("> [!info]"));
    // 9. comments: create / list / reply / update / check_new / delete
    const beforeComments = new Date(Date.now() - 1000).toISOString();
Author	SHA1	Message	Date
agent_coder	0050ad7ebb	docs(#363 review F2): update AGENTS.md migration-ordering to the new tolerant behavior The "Migration ordering" section still described the OLD crash-loop-at-boot behavior this PR removes ("Kysely refuses to start … rejected at boot"). Rewrote it to the new two-layer model: the CI migration-order gate is the primary defense (rename to a current timestamp), and the runtime now sets allowUnorderedMigrations so the app applies a back-dated migration instead of crash-looping (with the note that #ensureNoMissingMigrations still guards a removed applied migration, and that migrations must stay independent since apply order can differ across instances). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 02:57:11 +03:00
agent_coder	7b4617db70	fix(#363 review F1): make the migration-order gate fail CLOSED (not open) The CI gate — whose whole job is to BLOCK a back-dated migration — could pass open in exactly the scenario it guards (a long branch vs a moving base, i.e. #361): - Dropped the redundant `git fetch --depth=1`: the checkout already did fetch-depth:0 (full history), and the shallow graft truncated the BASE history, so `merge-base` (thus the three-dot `origin/base...HEAD` diff) failed when the base had moved ahead of the PR merge commit. - Removed `\|\| true` on the diff: it swallowed that failure → `added` empty → loop skipped → bad=0 → gate PASS. Now `set -e` aborts the job (fail CLOSED) on any diff error — a gate must never pass on error. Verified: yaml parses (jobs migration-order, test); a broken-ref diff with set -e and no `\|\| true` aborts before bad=0 (fail-closed) instead of passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 02:27:16 +03:00
agent_coder	459d636ffb	fix(db): prevent the migration-order crash-loop from long-lived branches (#363 , incident #361 ) A long-lived branch can add a migration whose timestamped filename sorts BEFORE migrations already applied in prod (#234's 20260627T130000-ai-chat-runs merged after 20260704T120000-client-metrics was live). Kysely's migrator with the default ordered setting then rejects the applied set as "corrupted migrations" (no longer a prefix of the sorted list), throws, and the app crash-loops on boot — exactly incident #361 (502s for ~11 min after a develop deploy). #119 and #120 (June branches) are the next such threats. Two levels, both: 1. CI migration-order gate (a new `migration-order` job in test.yml, PR-only): fails the PR when an added migration sorts at/before the newest migration on the base branch, with an actionable message to rename it to a current timestamp before merge. This is the primary defense — makes back-dating impossible to merge accidentally. 2. `allowUnorderedMigrations: true` on BOTH Migrators (migration.service.ts startup auto-migrate + migrate.ts CLI): the runtime safety net — Kysely applies a not-yet-applied older migration instead of bricking startup, so a back-dated migration that bypasses the gate (manual push / hotfix branch) still boots. Trade-off documented inline: apply order across instances may differ from lexicographic, so migrations must stay independent (ours each create their own objects); the CI gate remains the primary line. Verified: allowUnorderedMigrations is a valid Kysely 0.28.17 Migrator option; server tsc clean; the gate script rejects a back-dated filename and passes a current one. No new deps, no migration, no runtime behavior change beyond the migrator resilience. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-05 01:36:57 +03:00
agent_vscode	5336f06d10	Merge pull request 'fix(e2e)+ci: канон callout '> [!info]' в e2e-mcp + параллельная сборка с гейтом на publish' (#356 ) from fix/e2e-callout-and-gate-build into develop	2026-07-04 22:42:11 +03:00
agent_vscode	4bd579f7f6	ci(develop): build image in parallel with tests, gate only the publish Two-phase scheme instead of the sequential gate: the build job runs in parallel with test/e2e jobs and only warms the buildx GHA cache (push:false, cache-to mode=max); a new publish job (needs: test, e2e-server, e2e-mcp, build) rebuilds from the warm cache (near-instant on hit, full rebuild on eviction — same as the old sequential timing) and pushes :develop. GHCR login moved to publish; build-args blocks are kept textually identical between the two jobs so the cache hits. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:41:25 +03:00
agent_vscode	7bf1c91a95	ci(develop): gate the :develop image build on e2e suites Reverse the previous policy where e2e jobs only turned the run red without blocking the image publish: build.needs now lists test, e2e-server and e2e-mcp, so a failing test of any kind stops the :develop image from being built and pushed. Stale policy comments updated accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:33:06 +03:00
agent_vscode	6c82c54470	test(mcp): expect Obsidian '> [!info]' callout export in e2e (#333 canon) PR #333 deliberately changed the canonical markdown export of callout nodes to the Obsidian-native format ('> [!type]' + blockquote body, pinned by packages/prosemirror-markdown unit tests); the importer still parses both ':::type' fences and '> [!type]'. The get_page e2e assertion was missed in that switch and still expected ':::info', failing the e2e-mcp job on develop since `4369bbc5`. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:33:06 +03:00
agent_vscode	382e5196da	Merge pull request 'fix(docker): toolchain python3/make/g++ для нативной сборки re2' (#353 ) from fix/docker-re2-toolchain into develop	2026-07-04 22:11:49 +03:00
agent_vscode	76e0c08cec	fix(docker): install python3/make/g++ toolchain for re2 native build The develop image build broke at `pnpm install --frozen-lockfile`: the new native dependency re2@1.25.0 (packages/mcp, search_in_page #330) always compiles from source under pnpm — its prebuilt-binary downloader (install-artifact-from-github) cannot identify the GitHub repo because pnpm does not populate npm_package_repository_*/npm_package_json env vars ("No github repository was identified. Building locally ..."), and node:22-slim ships no python3/make/g++ for the node-gyp fallback. - builder stage: add a cache-friendly apt layer with python3 make g++ before COPY; the stage is discarded so the toolchain may stay. - installer stage: install the toolchain, run the prod install as the node user via `su node -c`, and purge the toolchain — all in one RUN layer so the final image stays slim and node_modules ownership needs no extra chown layer; USER node is restored right after. Fixes the failed run 28715009124 (develop docker build); release.yml uses the same Dockerfile and is covered too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-04 22:09:40 +03:00