Status: resolved
Priority: high
Created: 2026-03-07
Updated: 2026-03-07
Assigned to: unassigned
Planned node reboot could spend 90 to 120 seconds in shutdown because shared Proxmox NFS storages were not consistently ordered ahead of transport or provider teardown.
This incident had two independent cluster-level contributors that happened to surface in the same maintenance workflow.
The first was transport-related on baobab: AutoNAS-1 and AutoNAS-2 are mounted over 192.168.10.x through thunderbridge, but Thunderbolt bridge membership was being torn down before Proxmox attempted to unmount those remote NFS storages.
The second was provider-related on ebony and tapia: local AutoNAS exports were mounted back on the same node as Proxmox NFS storages. In that self-hosted topology, shutdown became sensitive to the ordering between umount.nfs4 and nfs-server.service.
The same investigation also exposed a separate maintenance-preflight issue in pgs: cleanup could block in kernel I/O wait when it touched stale remote NFS-backed storages.
The final fix therefore spans cluster maintenance, thunderbolts, autoNAS, and pve-guests-state, and should be tracked as a cluster issue rather than a project-local one.
baobab, ebony, tapia6.17.13-1-pve, tb-enlist@.service, autoNAS, pgs/usr/local/sbin/pgs suspend -v.systemctl reboot.journalctl -b -1 around the reboot window.pgs suspend should not block because a remote NFS mount is stale.baobab:
TIME_TO_STOP_SECONDS 105.852TIME_TO_FIRST_REPLY_SECONDS 130.230DOWNTIME_SECONDS 24.377ebony before self-hosted fix:
TIME_TO_STOP_SECONDS 120.275TIME_TO_FIRST_REPLY_SECONDS 145.840DOWNTIME_SECONDS 25.565tapia before provider-ordering fix:
TIME_TO_STOP_SECONDS 123.285TIME_TO_FIRST_REPLY_SECONDS 149.420DOWNTIME_SECONDS 26.135baobab: TIME_TO_STOP_SECONDS 14.599, TIME_TO_FIRST_REPLY_SECONDS 35.651ebony: TIME_TO_STOP_SECONDS 27.573, TIME_TO_FIRST_REPLY_SECONDS 53.288tapia: TIME_TO_STOP_SECONDS 28.305, TIME_TO_FIRST_REPLY_SECONDS 53.588tapia validation: TIME_TO_STOP_SECONDS 28.990, TIME_TO_FIRST_REPLY_SECONDS 53.384Transport ordering failure on baobab:
Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
Preflight stale-NFS block in pgs:
[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
Provider-ordering fix validated on tapia:
TIME_TO_STOP_SECONDS 28.990
TIME_TO_FIRST_REPLY_SECONDS 53.384
DOWNTIME_SECONDS 24.394
baobab delay was dominated by NFS unmount timeout after Thunderbolt transport disappeared too early.tb-enlist@.service with Before=network.target; reboot timing on baobab dropped from ~106s to ~15s.pgs preflight could block on stale remote NFS during storage cleanup.pgs cleanup to scan only local dir storages; remote NFS is skipped intentionally.ebony delay was self-hosted AutoNAS-1: the node exported local storage and mounted it back as Proxmox NFS.autonas.service and autonas-boot-scan.service ordered before remote-fs.target and umount.target; ebony improved to ~28s shutdown-to-ICMP-loss.tapia still showed ~123s shutdown with that first AutoNAS patch because nfs-server.service still stopped too early for self-hosted AutoNAS-2./etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf from storage.cfg, adding explicit Before= ordering from nfs-server.service to matching self-hosted Proxmox mount units.tapia twice after the nfs-server.service ordering fix; both tests converged around 29s to ICMP loss and 53s to first ICMP reply.network.target so transport-backed NFS over thunderbridge stays alive until remote filesystems unmount.pgs cleanup limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.nfs-server.service ordering against the matching Proxmox mnt-pve-*.mount units discovered from storage.cfg.List CHANGELOG.md entries that reference this issue:
- projects/thunderbolts/CHANGELOG.md: tb-enlist@.service now stays active until network.target stops... [ISSUE-2026-002]
- projects/autoNAS/CHANGELOG.md: self-hosted AutoNAS shutdown now adds explicit nfs-server.service ordering... [ISSUE-2026-002]
- projects/pve-guests-state/CHANGELOG.md: Suspend-artifact cleanup now scans only local dir storages... [ISSUE-2026-002]