6 Commits

Author SHA1 Message Date
agent_coder
492d3d01b0 feat(#19): separate webhook per automation mechanism (update vs heal)
Split the single container-automation webhook URL into two independently
optional URLs — UpdateWebhookURL (fired on update/rollback/update-failed) and
HealWebhookURL (fired on auto-heal restart). The notifier routes each event to
its mechanism's URL by kind; an empty URL silences only that mechanism, so a
user can enable notifications for updates without heal (or vice-versa).

Settings gain both fields (each validated http/https, {{message}} allowed), the
NotificationPanel exposes two labeled inputs, and the golden migration output is
updated. Delivery path (goroutine/recover/timeout, {{message}} GET vs POST,
per-container stack message format) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 22:47:25 +03:00
agent_coder
eb35e9c47f feat(automation): configurable webhook notifier for automation events
Add an opt-in webhook notification for container-automation events (image
update, rollback, update-failed, auto-heal restart), plugging into the existing
Notifier seam in notify.go.

- Settings: new ContainerAutomation.Notification.WebhookURL (shared across
  update + heal), persisted and validated in the settings update handler
  (optional; http/https only; accepts the {{message}} placeholder).
- webhookNotifier reads the current URL from the datastore per event (UI changes
  take effect without a restart). If the URL contains {{message}} it substitutes
  the URL-encoded message and issues a GET; otherwise it POSTs the message as the
  body. Delivery, the env/stack name lookups, and any panic run in a goroutine
  under recover() with a 10s timeout — strictly best-effort, never blocks or
  crashes the automation daemon. multiNotifier fans events to logNotifier +
  webhook and isolates a panic in any one notifier.
- Message format (maintainer's spec):
    Environment | <env>
    Stack [<name>]            (Container [<name>] for non-stack events)
    Update [<name>]: <old> -> <new>
  Auto-heal: 'Auto-heal: restarted unhealthy container'.
- New NotificationPanel in settings to configure the URL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 19:31:18 +03:00
claude code agent
be3bfd0513 fix(automation): maintainer pre-merge review — stale detection, daemon edge cases, parity (F1-F9)
F1: cap the image-status cache TTL at 5m (was 24h) — the cache is keyed by the
    LOCAL imageID, which doesn't change when upstream pushes a new image under the
    same tag, so the 24h TTL hid new images from both the badge and the auto-update
    daemon; a short TTL re-resolves the remote digest within the poll window.
F2: document that the update->rollback guard map is in-memory (restart implication).
F3: skip auto-update for an unnamed container when rollback is on (the endpoint+name
    keyed guard can't record it, so it would loop) — pure skipUnnamedForRollback + test.
F4: wrap the pre-update ContainerInspect in context.WithTimeout(endpointTimeout).
F5: document Reload() does not interrupt an in-flight tick.
F6: floor auto-heal CheckInterval at 1s (mirrors auto-update) + test.
F7: wontfix — migration is currently correct; namespace rework is out of scope.
F8: correct the misleading SSRF/AllowList comment (no filter is applied).
F9: front auto-heal interval floor + test; dedup STALE_TIME; fix invalidation comment.
Also refresh three stale '24h/long-lived cache' comments to match the 5m TTL.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 19:51:15 +03:00
claude code agent
cdf17d904d fix(automation): rollback robustness — transient inspect, start_period, digest images, shutdown, event order (#12 review)
F1: tolerate up to 3 consecutive health-gate inspect failures (reset on
success) before declaring an update failed, so a transient Docker API blip no
longer triggers a false rollback.

F2: detect baseCtx cancellation during the gate and abort without rolling back
or emitting update-failed (debug log only), instead of a misleading
"rollback failed" event on every shutdown mid-gate.

F3: derive the gate deadline as start + max(RollbackTimeout, StartPeriod+buffer)
via effectiveRollbackDeadline, reading the container's healthcheck StartPeriod
so a legitimately slow-starting container is not rolled back while starting.

F4: only enable the gate when the original reference is a proper tag (new
isTagReference helper); skip with a log line for digest-pinned / bare-image-id
containers that cannot be re-tagged.

F5: document the sequential-tick delay limitation of the gate poll.

F6: emit EventUpdated only after the gate confirms healthy (or immediately when
no gate is active); the rollback path emits only EventRollback, so the event
sequence is truthful.

F7: floor RollbackTimeout at 10s in backend and frontend validation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 10:57:54 +03:00
claude code agent
32a2b7a9ae feat(automation): health-gated rollback + per-endpoint + notify hook (#12, epic #3 M5)
P0 Health-gated rollback (standalone auto-update path): capture the previous
image id + reference + healthcheck before the recreate, then poll the new
container's health over a configurable window. On healthy proceed (and only
then clean up the old image); on unhealthy/exit/timeout re-tag the old image
back onto the original reference and Recreate (no pull) to restore it, reusing
Recreate's config preservation. The decision is a pure decideRollback() helper.

P1 Per-endpoint enable: ContainerAutomationDisabled flag on Endpoint (zero value
participates, no migration churn), checked by both daemons; settable via the
endpoint update API. UI control deferred (see report).

P2 Notifier seam: minimal Notifier interface + logNotifier, emitting structured
updated/rollback/update-failed/heal-restarted events from the daemon.

Settings: RollbackOnFailure + RollbackTimeout (default 120s) added to
ContainerAutomation.AutoUpdate, wired through defaults/migration/golden,
settings_update validation, the AutoUpdatePanel and the TS types.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 10:41:55 +03:00
claude code agent
21b5ec3e05 fix(automation): git-stack honesty + ECR registry refresh + interval floor (#11 review)
F1: Stop routing git-backed stacks through a per-tick RedeployWhenChanged for
image-only updates. The git redeploy path short-circuits when the commit is
unchanged (so an upstream-digest update never applies) yet still git-fetches
every tick. Git stacks are now detect-only in the auto-apply path; their image
update lands on the next git change or via manual "Update now". File (non-git)
stacks still force-pull-redeploy immediately. The AutoUpdatePanel text no longer
promises daemon auto-update for git/externally-managed containers.

F2: Resolve registries for the file-stack redeploy the same way the established
userless/system path (RedeployWhenChanged) does, via the new
deployments.ResolveStackRegistries: scope to the stack author's endpoint access
and RefreshAndPersistECRTokens, instead of hand-passing Registry().ReadAll().
ECR-backed stacks now auto-update with fresh tokens.

F3: Add a 1m floor for the auto-update poll interval, enforced in the settings
Validate and mirrored in the frontend validation.

F4: Thread the application shutdownCtx into NewService and use it as the base
for the heal/update job operation contexts, so shutdown cancels in-flight work.

F5: Correct the updateEndpoint comment about monitor-only badge-cache warming
(only in-scope monitor-only containers are status-checked).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 10:24:58 +03:00