Status: investigating
Priority: high
Created: 2026-03-07
Updated: 2026-03-07
Assigned to: unassigned
Planned node reboot on baobab spent ~106 seconds in shutdown because Proxmox NFS storages were still mounted after Thunderbolt transport had already been detached from thunderbridge.
During a controlled reboot validation on baobab, guest suspend worked correctly, but the host remained reachable over ICMP for almost two minutes after systemctl reboot. Journal analysis showed that the Thunderbolt bridge ports were detached early in shutdown, while Proxmox only attempted to unmount NFS storages later. Because AutoNAS-1 and AutoNAS-2 are mounted over 192.168.10.x through thunderbridge, the NFS unmount path lost transport and waited for timeout.
The same investigation exposed a second maintenance risk in pgs: preflight cleanup could block in kernel I/O wait when it touched remote NFS-backed storages that were stale or temporarily unavailable. That does not create the slow reboot itself, but it can block the maintenance preparation step.
Follow-up validation on ebony showed a different but related cluster behavior: AutoNAS-1 is currently exported by ebony itself. During reboot, autonas.service stops early, which makes the node's own Proxmox NFS client mount for AutoNAS-1 stale and it then waits for timeout during unmount. In the same window, VM 301 is-anjohibe (PBS anjothibe) is intentionally suspended by pgs, so PBS availability loss is expected during the maintenance window.
Validation on tapia initially showed the same class of topology problem for AutoNAS-2, which is locally exported there and mounted back as a Proxmox NFS storage. The first AutoNAS shutdown-ordering patch remained active, but reboot timing still stayed near the pre-fix range because mnt-pve-AutoNAS-2.mount waited for timeout during shutdown while PBS andrafiabe-AutoNAS had already become unreachable.
Follow-up work in the autoNAS project added an explicit nfs-server.service drop-in for self-hosted Proxmox NFS mounts discovered from storage.cfg. After that second patch, tapia reboot timing dropped into the same range as ebony, confirming that the remaining blocker was provider ordering on nfs-server.service, not on autonas.service.
baobab confirmed, likely all nodes using Proxmox NFS storages over thunderbridge6.17.13-1-pve, tb-enlist@.service, pgsthunderbridge, run /usr/local/sbin/pgs suspend -v.systemctl reboot.journalctl -b -1 around the reboot window.pgs suspend should not hang because a remote NFS mount is stale.baobab:
TIME_TO_STOP_SECONDS 105.852TIME_TO_FIRST_REPLY_SECONDS 130.230DOWNTIME_SECONDS 24.377ebony:
TIME_TO_STOP_SECONDS 120.275TIME_TO_FIRST_REPLY_SECONDS 145.840DOWNTIME_SECONDS 25.565tapia after cluster-wide AutoNAS rollout:
TIME_TO_STOP_SECONDS 123.285TIME_TO_FIRST_REPLY_SECONDS 149.420DOWNTIME_SECONDS 26.135tapia after explicit nfs-server.service self-hosted ordering fix:
TIME_TO_STOP_SECONDS 28.305TIME_TO_FIRST_REPLY_SECONDS 53.588DOWNTIME_SECONDS 25.283journalctl -b -1 showed:
08:48:17.98908:48:30.540mnt-pve-AutoNAS-1.mount and mnt-pve-AutoNAS-2.mount timed out at 08:50:00.604/0.605journalctl -b -1 on ebony showed:
autonas.service stopped at 11:04:22.326mnt-pve-AutoNAS-2.mount unmounted successfully by 11:04:38.693mnt-pve-AutoNAS-1.mount timed out at 11:06:08.679network.target stop and tb-enlist@thunderbolt0.service detach from thunderbridgepgs suspend blocked in nfs4_proc_getattr while scanning storage paths.Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
Mar 07 08:48:17.993120 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt1 was detached
Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
Mar 07 08:48:30.541335 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-2.mount - /mnt/pve/AutoNAS-2...
Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
Mar 07 08:50:00.605215 baobab systemd[1]: mnt-pve-AutoNAS-1.mount: Unmounting timed out. Terminating.
Blocked pgs stack during stale-NFS preflight:
[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
[<0>] __do_sys_newfstatat+0x43/0x90
Validated timing after fixes on baobab:
TIME_TO_STOP_SECONDS 14.599
TIME_TO_FIRST_REPLY_SECONDS 35.651
DOWNTIME_SECONDS 21.053
AutoNAS-1 and AutoNAS-2 on baobab are Proxmox NFS storages mounted from 192.168.10.21 and 192.168.10.22 over thunderbridge.baobab showed shutdown delay dominated by NFS unmount timeout, not by boot.tb-enlist@.service had no ordering against network.target; systemd stopped Thunderbolt bridge membership before Proxmox unmounted remote storages.tb-enlist@.service with Before=network.target and deployed to baobab, then cluster-wide.pgs suspend can block in nfs4_proc_getattr while scanning storage paths on stale remote NFS mounts.pgs cleanup to scan only local dir storages; remote storages such as NFS are skipped intentionally.baobab after both fixes:
10:48:12.354/10:48:12.35610:48:12.460network.target stopped later at 10:48:16.152pgs resume completed successfully after reboot on baobab; state file survived boot and all 4 VMs + 1 CT were restored.ebony with current pgs and cluster-wide thunderbolts rollout. pgs suspend / resume succeeded for VMs 101, 102, 301; state file survived reboot and restore completed.ebony still showed long shutdown because AutoNAS-1 is currently provided by ebony itself through autonas. Stopping autonas.service made the node's own NFS client mount stale and mnt-pve-AutoNAS-1.mount waited for timeout.ebony, PBS anjothibe availability loss during maintenance is expected because VM 301 is-anjohibe is intentionally suspended by pgs, and its datastore dependency is also on AutoNAS-1.ebony: autonas.service and autonas-boot-scan.service now declare Before=remote-fs.target and Before=umount.target.ebony after AutoNAS patch:
TIME_TO_STOP_SECONDS 120.275, TIME_TO_FIRST_REPLY_SECONDS 145.840TIME_TO_STOP_SECONDS 27.573, TIME_TO_FIRST_REPLY_SECONDS 53.288mnt-pve-AutoNAS-2.mount still unmounted cleanlyAutoNAS-1 no longer waited for the old 90s timeout, though a brief Stale file handle was still observed before the provider side stoppedebony: even with later provider shutdown, pvestatd briefly logged storage 'AutoNAS-1' is not online / Stale file handle during the maintenance window, so the self-hosted NFS topology remains fragile but no longer dominates shutdown time.tapia.pgs suspend / reboot / pgs resume succeeded on tapia for VMs 104, 107, 113, 302; state file survived reboot and all four guests were restored.tapia still showed slow shutdown after the AutoNAS patch:
TIME_TO_STOP_SECONDS 123.285, TIME_TO_FIRST_REPLY_SECONDS 149.420mnt-pve-AutoNAS-1.mount unmounted immediately at 11:45:01.827autonas.service and nfs-server.service stopped around 11:45:01.689/11:45:01.900mnt-pve-AutoNAS-2.mount then waited until timeout at 11:46:31.778network.target stopped only after that, at 11:46:31.781tapia, the remaining delay is concentrated on self-hosted AutoNAS-2 (server 192.168.10.22) plus expected maintenance-window loss of PBS andrafiabe-AutoNAS (192.168.10.96)./etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf from storage.cfg, adding Before= ordering from nfs-server.service to the matching self-hosted Proxmox mount units.tapia after the nfs-server.service ordering fix:
TIME_TO_STOP_SECONDS 123.285, TIME_TO_FIRST_REPLY_SECONDS 149.420TIME_TO_STOP_SECONDS 28.305, TIME_TO_FIRST_REPLY_SECONDS 53.588nfs-server.service stopped at 12:07:42.157, network.target stopped later at 12:07:47.230mnt-pve-AutoNAS-2.mount no longer dominated shutdownpgs suspend / reboot / pgs resume completed successfully for VMs 104, 107, 113, 302network.target so storage traffic over thunderbridge remains alive until remote filesystems are unmounted.pgs cleanup path limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.ebony, exclude AutoNAS-1 from local use or replace that local dependency with a direct/local storage path.autonas and PBS on ebony).nfs-server.service self-hosted ordering drop-in as the cluster fix for nodes that export AutoNAS locally and also consume the same export back through Proxmox NFS.nfs-server.service ordering fix is deployed.List CHANGELOG.md entries that reference this issue:
- projects/thunderbolts/CHANGELOG.md: [Unreleased] - tb-enlist@.service now stays active until network.target stops... [ISSUE-2026-002]
- projects/pve-guests-state/CHANGELOG.md: [1.5] - Suspend-artifact cleanup now scans only local dir storages... [ISSUE-2026-002]