fix(ai-chat): unconditional boot sweep + single-instance guard for autonomous runs (#184)

F1 (DECISION C): make the crash-recovery boot sweep UNCONDITIONAL. A fast restart (deploy/OOM within the old 10-min window of the last step) left a run stuck `running` forever, and the one-active-run gate then 409'd every future turn in that chat. On a fresh single-process boot any pending|running run is definitionally hung, so onModuleInit now settles ALL of them to `aborted` with no staleness window. AiChatRunRepo.sweepRunning takes an optional { staleMs } window, kept ONLY for the future phase-2 multi-instance timer sweep (the boot path passes no window). Repo + service tests assert a fresh `running` run (updatedAt = now) is settled, not skipped. F2 (DECISION A): treat phase-1 autonomousRuns as SINGLE-INSTANCE-ONLY. Stop and its AbortController are process-local, so cross-instance Stop is unreliable (phase 2). AiChatRunService now logs a startup WARNING when a horizontally-scaled deployment is detected — via EnvironmentService.isCloud() (CLOUD=true), the only horizontal-scaling signal this codebase has (the socket.io Redis adapter is always wired since REDIS_URL is mandatory, so it is not a discriminator). The constraint is documented in AGENTS.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 23:52:32 +03:00
parent 4c0a4eb9cc
commit c0844d5431
6 changed files with 236 additions and 37 deletions
--- a/apps/server/src/core/ai-chat/ai-chat-run.service.ts
+++ b/apps/server/src/core/ai-chat/ai-chat-run.service.ts
@@ -2,6 +2,7 @@ import { Injectable, Logger, OnModuleInit } from '@nestjs/common';
 import { AiChatRunRepo } from '@docmost/db/repos/ai-chat/ai-chat-run.repo';
 import { AiChatRun } from '@docmost/db/types/entity.types';
 import { isUniqueViolation, violatedConstraint } from '@docmost/db/utils';
+import { EnvironmentService } from '../../integrations/environment/environment.service';

 /** Name of the partial unique index enforcing "one active run per chat" (see the
 *  ai_chat_runs migration). A 23505 on THIS constraint is the race-safe signal
@@ -90,16 +91,28 @@ export class AiChatRunService implements OnModuleInit {
  // restart) still records `stop_requested_at` on the row.
  private readonly active = new Map<string, ActiveRun>();

-  constructor(private readonly runRepo: AiChatRunRepo) {}
+  constructor(
+    private readonly runRepo: AiChatRunRepo,
+    private readonly environment: EnvironmentService,
+  ) {}

  /**
-   * Crash-recovery sweep on server start: any run left pending/running that has
-   * been untouched past the staleness window is the relic of a process that died
-   * mid-turn; flip it to 'aborted'. Best-effort — a sweep failure is logged but
-   * MUST NOT block startup (mirrors AiChatService.onModuleInit for #183).
+   * Crash-recovery sweep on server start: settle EVERY run still left
+   * pending/running to 'aborted' (F1 / DECISION C). The boot sweep is
+   * UNCONDITIONAL — no staleness window — because phase 1 is single-process: on a
+   * fresh boot any pending|running run is definitionally hung (no live runner owns
+   * it), so even a fast restart (deploy/OOM within minutes of the last step) can
+   * no longer leave a run stuck 'running' forever (which would make the
+   * one-active-run gate 409 every future turn in that chat). The staleness window
+   * is reintroduced only for the phase-2 multi-instance timer sweep, where a
+   * booting replica must not abort a run another replica is actively executing.
+   * Best-effort — a sweep failure is logged but MUST NOT block startup (mirrors
+   * AiChatService.onModuleInit for #183).
   */
  async onModuleInit(): Promise<void> {
+    this.warnIfMultiInstance();
    try {
+      // No `staleMs`: unconditional boot sweep (F1). See AiChatRunRepo.sweepRunning.
      const swept = await this.runRepo.sweepRunning();
      if (swept > 0) {
        this.logger.log(
@@ -115,6 +128,36 @@ export class AiChatRunService implements OnModuleInit {
    }
  }

+  /**
+   * F2 (DECISION A): autonomous runs are SINGLE-INSTANCE-ONLY in phase 1. An
+   * explicit Stop, and the in-memory AbortController that backs it, are
+   * process-local: a Stop only aborts the live turn if it lands on the SAME
+   * replica that owns the run (it still stamps `stop_requested_at` cross-instance,
+   * but nothing reads that flag during an active run yet). Cross-instance pub/sub
+   * stop is phase 2. So if the deployment is horizontally scaled, warn loudly at
+   * startup that a Stop may not reach a run executing on another replica.
+   *
+   * DETECTION: this codebase always wires the socket.io Redis adapter (REDIS_URL
+   * is mandatory), so the adapter alone is NOT a horizontal-scaling signal. The
+   * authoritative signal the codebase has is `CLOUD=true` (EnvironmentService
+   * .isCloud()), the Docmost-cloud multi-replica deployment. We warn whenever that
+   * is set, because any workspace could enable settings.ai.autonomousRuns. A
+   * self-hosted operator running multiple replicas behind a load balancer is also
+   * multi-instance; the deploy docs (.env.example / AGENTS.md) spell out the
+   * single-instance constraint for that case.
+   */
+  private warnIfMultiInstance(): void {
+    if (this.environment.isCloud()) {
+      this.logger.warn(
+        'Autonomous agent runs (settings.ai.autonomousRuns) are SINGLE-INSTANCE-ONLY ' +
+          'in phase 1: a horizontally-scaled deployment was detected (CLOUD=true). ' +
+          'An explicit Stop only aborts a run executing on the same replica that owns ' +
+          'it (cross-instance Stop is not yet reliable — phase 2). Run a single ' +
+          'instance if you enable autonomousRuns, or keep the flag off.',
+      );
+    }
+  }
+
  /**
   * Start a run for a turn: insert the run row (status 'running', startedAt now),
   * register a fresh AbortController for it, and return a {@link RunHandle} whose