Split the single container-automation webhook URL into two independently
optional URLs — UpdateWebhookURL (fired on update/rollback/update-failed) and
HealWebhookURL (fired on auto-heal restart). The notifier routes each event to
its mechanism's URL by kind; an empty URL silences only that mechanism, so a
user can enable notifications for updates without heal (or vice-versa).
Settings gain both fields (each validated http/https, {{message}} allowed), the
NotificationPanel exposes two labeled inputs, and the golden migration output is
updated. Delivery path (goroutine/recover/timeout, {{message}} GET vs POST,
per-container stack message format) is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an opt-in webhook notification for container-automation events (image
update, rollback, update-failed, auto-heal restart), plugging into the existing
Notifier seam in notify.go.
- Settings: new ContainerAutomation.Notification.WebhookURL (shared across
update + heal), persisted and validated in the settings update handler
(optional; http/https only; accepts the {{message}} placeholder).
- webhookNotifier reads the current URL from the datastore per event (UI changes
take effect without a restart). If the URL contains {{message}} it substitutes
the URL-encoded message and issues a GET; otherwise it POSTs the message as the
body. Delivery, the env/stack name lookups, and any panic run in a goroutine
under recover() with a 10s timeout — strictly best-effort, never blocks or
crashes the automation daemon. multiNotifier fans events to logNotifier +
webhook and isolates a panic in any one notifier.
- Message format (maintainer's spec):
Environment | <env>
Stack [<name>] (Container [<name>] for non-stack events)
Update [<name>]: <old> -> <new>
Auto-heal: 'Auto-heal: restarted unhealthy container'.
- New NotificationPanel in settings to configure the URL.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make the container image-status badge actionable, matching native Portainer:
- Clicking "Update available" opens the update confirm dialog and runs the
existing update flow (standalone recreate-with-pull / stack redeploy), gated
and disabled while in flight to avoid a double submit. The confirm+apply logic
is extracted from UpdateNowButton into a shared useApplyContainerImageUpdate
hook so the details button and the list badge share one implementation.
- Clicking "Up to date" re-queries the registry. Because the server caches image
status (statusCache 5m + remoteDigestCache 5s), a plain refetch was a no-op, so
the endpoint gains an optional ?force=true that bypasses BOTH caches for a
manual re-check while still repopulating them; the default (auto badges + the
auto-update daemon) keeps using the caches unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extract the manual stream-and-flush loop from dockerLocalProxy.ServeHTTP
into a behaviour-preserving package-private streamResponse(w, body) helper,
and add docker_test.go regression tests for the riskiest path (it runs on
every Docker API response):
- DeliversFullBodyAndFlushesPerChunk: a >32KB body delivered as several
chunks (boundaries not aligned to the 32KB buffer), with the final Read
returning (n>0, io.EOF) simultaneously, asserts the streamed body equals
the input exactly (no loss/duplication) and that Flush ran more than once
(the per-chunk flush is the whole point of the change).
- StopsOnWriteErrorWithoutPanic: a writer that errors on first Write (and
does not implement http.Flusher, exercising the nil-flusher fallback)
breaks the loop after one write without panicking.
No production behaviour change — the loop body is identical, only moved.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Backend (the "logs arrive every ~5s / pipe clogged" bug):
- dockerLocalProxy.ServeHTTP streamed the docker socket response via
io.Copy, which buffers ~2KB into the ResponseWriter and only flushes
when full or on handler return. Low-throughput streaming endpoints
(container logs follow=1, events, stats, attach) therefore arrived in
multi-second batches. Stream manually and Flush() after each chunk so
they are delivered live. Behaviour is otherwise identical to io.Copy
(full-write contract, EOF handling, Debug error logging); hijacked
attach/exec go through a separate websocket handler, unaffected.
- NewSingleHostReverseProxyWithHostHeader: set FlushInterval = -1 so the
remote-endpoint path streams live too.
Frontend (maintainer UI asks):
- Remove the line-selection mechanic entirely (Copy-selected-lines and
Unselect buttons, selectLine/copySelection/clearSelection, selectedLines
state, line_selected highlight): selecting/copying is mouse-native. Copy
(all visible) and Download stay.
- Rename the unclear "Fetch" since-selector label to "Since".
- Move the settings controls into the widget header (rd-widget-header
default transclude slot) so they share one row with the "Log viewer
settings" title, reclaiming vertical space for the log pane.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
F1: cap the image-status cache TTL at 5m (was 24h) — the cache is keyed by the
LOCAL imageID, which doesn't change when upstream pushes a new image under the
same tag, so the 24h TTL hid new images from both the badge and the auto-update
daemon; a short TTL re-resolves the remote digest within the poll window.
F2: document that the update->rollback guard map is in-memory (restart implication).
F3: skip auto-update for an unnamed container when rollback is on (the endpoint+name
keyed guard can't record it, so it would loop) — pure skipUnnamedForRollback + test.
F4: wrap the pre-update ContainerInspect in context.WithTimeout(endpointTimeout).
F5: document Reload() does not interrupt an in-flight tick.
F6: floor auto-heal CheckInterval at 1s (mirrors auto-update) + test.
F7: wontfix — migration is currently correct; namespace rework is out of scope.
F8: correct the misleading SSRF/AllowList comment (no filter is applied).
F9: front auto-heal interval floor + test; dedup STALE_TIME; fix invalidation comment.
Also refresh three stale '24h/long-lived cache' comments to match the 5m TTL.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
F1: record rolled-back targets per service (endpointID/containerName + remote
digest) and skip auto-update during a 24h cooldown unless the remote digest
changes — breaks the infinite update→rollback loop on a persistently
unhealthy image, without blocking a genuinely new image.
F2: unit-test applyContainerUpdate dispatch/payload mapping.
F3: settings_update.go comments mention auto-heal AND auto-update.
F4: drop stale '(future M4)' TS docs; primitives are frontend-only.
F5: replace the anonymous ContainerAutomation settings struct with named
types (identical JSON tags).
F6: drop parseEnable (duplicate of boolLabel).
F7: remove the unused gitService dependency.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
F1: tolerate up to 3 consecutive health-gate inspect failures (reset on
success) before declaring an update failed, so a transient Docker API blip no
longer triggers a false rollback.
F2: detect baseCtx cancellation during the gate and abort without rolling back
or emitting update-failed (debug log only), instead of a misleading
"rollback failed" event on every shutdown mid-gate.
F3: derive the gate deadline as start + max(RollbackTimeout, StartPeriod+buffer)
via effectiveRollbackDeadline, reading the container's healthcheck StartPeriod
so a legitimately slow-starting container is not rolled back while starting.
F4: only enable the gate when the original reference is a proper tag (new
isTagReference helper); skip with a log line for digest-pinned / bare-image-id
containers that cannot be re-tagged.
F5: document the sequential-tick delay limitation of the gate poll.
F6: emit EventUpdated only after the gate confirms healthy (or immediately when
no gate is active); the rollback path emits only EventRollback, so the event
sequence is truthful.
F7: floor RollbackTimeout at 10s in backend and frontend validation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
P0 Health-gated rollback (standalone auto-update path): capture the previous
image id + reference + healthcheck before the recreate, then poll the new
container's health over a configurable window. On healthy proceed (and only
then clean up the old image); on unhealthy/exit/timeout re-tag the old image
back onto the original reference and Recreate (no pull) to restore it, reusing
Recreate's config preservation. The decision is a pure decideRollback() helper.
P1 Per-endpoint enable: ContainerAutomationDisabled flag on Endpoint (zero value
participates, no migration churn), checked by both daemons; settable via the
endpoint update API. UI control deferred (see report).
P2 Notifier seam: minimal Notifier interface + logNotifier, emitting structured
updated/rollback/update-failed/heal-restarted events from the daemon.
Settings: RollbackOnFailure + RollbackTimeout (default 120s) added to
ContainerAutomation.AutoUpdate, wired through defaults/migration/golden,
settings_update validation, the AutoUpdatePanel and the TS types.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
F1: Stop routing git-backed stacks through a per-tick RedeployWhenChanged for
image-only updates. The git redeploy path short-circuits when the commit is
unchanged (so an upstream-digest update never applies) yet still git-fetches
every tick. Git stacks are now detect-only in the auto-apply path; their image
update lands on the next git change or via manual "Update now". File (non-git)
stacks still force-pull-redeploy immediately. The AutoUpdatePanel text no longer
promises daemon auto-update for git/externally-managed containers.
F2: Resolve registries for the file-stack redeploy the same way the established
userless/system path (RedeployWhenChanged) does, via the new
deployments.ResolveStackRegistries: scope to the stack author's endpoint access
and RefreshAndPersistECRTokens, instead of hand-passing Registry().ReadAll().
ECR-backed stacks now auto-update with fresh tokens.
F3: Add a 1m floor for the auto-update poll interval, enforced in the settings
Validate and mirrored in the frontend validation.
F4: Thread the application shutdownCtx into NewService and use it as the base
for the heal/update job operation contexts, so shutdown cancels in-flight work.
F5: Correct the updateEndpoint comment about monitor-only badge-cache warming
(only in-scope monitor-only containers are status-checked).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an optional periodic auto-update daemon that detects outdated container
images and applies updates, replacing the containrrr/watchtower sidecar. It
extends M1's containerautomation service/scheduler/labels infrastructure and
reuses the existing zlib image-detection engine, the standalone Recreate path
and the stack deployer.
Backend:
- api/containerautomation/autoupdate.go: scheduler job iterating Docker
(non-edge) endpoints -> in-scope running containers -> ContainerImageStatus;
for Outdated: standalone -> ContainerService.Recreate(pull); stack-managed ->
one stack redeploy-with-pull per stack per tick (git via RedeployWhenChanged,
file via the deployer directly); external compose -> detect only. Monitor-only
containers are status-checked (warms the badge cache) but never applied.
Overlap guard (atomic), pull/registry-auth failure -> leave running container
untouched, conservative cleanup of the dangling old image on the Cleanup flag
(non-forced ImageRemove only succeeds when truly unused).
- labels.go: update enable / monitor-only labels with watchtower aliases,
InUpdateScope, IsMonitorOnly, and pure resolveContainerUpdateRouting /
groupContainersForUpdate (Go analogue of M3's TS routing + grouping).
- service.go: run both jobs, Reload restarts/stops each per settings; NewService
also takes ContainerService, StackDeployer and GitService.
- Settings.ContainerAutomation.AutoUpdate {Enabled, PollInterval, Scope,
Cleanup} with fresh-install defaults and a 2.43.0 backfill (extends M1's
migration; golden test data updated). settings handler validates + reloads.
Frontend:
- Global AutoUpdatePanel in SettingsView (enable / poll interval / scope /
cleanup) via useUpdateSettingsMutation, plus settings TS types.
- Read-only per-container Auto-update row in the container details view
(Docker labels are immutable at runtime), surfacing monitor-only.
Tests: Go unit tests for the update label aliases, scope, monitor-only, the
routing decision and the one-redeploy-per-stack grouping; vitest for the panel
and the per-container row.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
F1: ContainerImageStatus now reads the 24h statusCache (keyed by imageID)
before the remote registry digest lookup, so the cache is effective on the
input side for all callers instead of being write-only. This avoids the
rate-limited registry HEAD on repeat loads.
F2: add nodeName to the imageStatus query key so cached results cannot be
reused across nodes.
F3: correct the swagger annotations to reflect that engine-level issues
degrade to a 200 skipped/error status rather than 400/404.
F4: return a generic error message to the client instead of the raw
registry/engine error; the raw error is still logged server-side.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add native CE detection of "a newer image is available" for running
containers, surfaced as a read-only HTTP endpoint and a containers-list
badge/column. No applying of updates (M3/M4), no auto-heal (M1).
Backend:
- New CE handler GET /docker/{id}/containers/{containerId}/image_status
backed by the existing zlib/CE digest engine
(images.NewClientWithRegistry + ContainerImageStatus). Honors nodeName,
authz, and routes registry calls through the credential store / SSRF
AllowList. Engine failures degrade to a 200 {Status:"error"} so the UI
stays graceful. Response shape: {Status, Message?}.
Frontend (CE-only, no isBE gating; the EE ImageStatus component is left
untouched):
- useContainerImageStatus TanStack Query hook (5min staleTime, no
refetch-on-focus; backend caches 24h) calling the non-proxied endpoint.
- UpdateStatusBadge component (own assets, neutral on skipped/error).
- "Update available" column in the containers datatable; one cached,
non-blocking query per visible row.
Tests: Go response-shape unit test; vitest for the badge (all statuses)
and the hook (url + nodeName query param via msw).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a native, CE-only auto-heal daemon that restarts Docker containers whose
healthcheck reports "unhealthy", replacing the willfarrell/autoheal sidecar.
Backend:
- New package api/containerautomation (service lifecycle + scheduler job,
per-endpoint heal pass, label/scope parsing, in-memory cooldown/retry state).
- Settings.ContainerAutomation.AutoHeal {Enabled, CheckInterval, Scope} with
fresh-install defaults and a 2.43.0 migration backfilling existing installs.
- Settings update handler reloads/stops the job via a small Reloader interface
(no import cycle); service bootstrapped from main.go after stack schedules.
Frontend:
- Global AutoHealPanel in SettingsView (enable / interval / scope) via
useUpdateSettingsMutation, plus settings TS types.
- Read-only per-container Auto-heal row in the container details view (Docker
labels are immutable at runtime; opt-in is set via Create/Edit form labels).
Tests: Go unit tests for label/scope resolution and the cooldown/retry decision;
vitest for the panel and the per-container row.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>