Madagascar / issues / ISSUE-2026-002.md
8f00f0f 3 months ago History
1 contributor
141 lines | 6.182kb

Issue ISSUE-2026-002: Planned reboot stalls on shared NFS storages during maintenance shutdown

Issue ID: ISSUE-2026-002

Status: resolved
Priority: high
Created: 2026-03-07
Updated: 2026-03-07
Assigned to: unassigned


Summary

Planned node reboot could spend 90 to 120 seconds in shutdown because shared Proxmox NFS storages were not consistently ordered ahead of transport or provider teardown.


Description

This incident had two independent cluster-level contributors that happened to surface in the same maintenance workflow.

The first was transport-related on baobab: AutoNAS-1 and AutoNAS-2 are mounted over 192.168.10.x through thunderbridge, but Thunderbolt bridge membership was being torn down before Proxmox attempted to unmount those remote NFS storages.

The second was provider-related on ebony and tapia: local AutoNAS exports were mounted back on the same node as Proxmox NFS storages. In that self-hosted topology, shutdown became sensitive to the ordering between umount.nfs4 and nfs-server.service.

The same investigation also exposed a separate maintenance-preflight issue in pgs: cleanup could block in kernel I/O wait when it touched stale remote NFS-backed storages.

The final fix therefore spans cluster maintenance, thunderbolts, autoNAS, and pve-guests-state, and should be tracked as a cluster issue rather than a project-local one.


Environment

  • Affected nodes: baobab, ebony, tapia
  • Component: cluster storage + maintenance workflow
  • Version/software: Proxmox VE 9.1 / kernel 6.17.13-1-pve, tb-enlist@.service, autoNAS, pgs

Steps to Reproduce

  1. On a node with shared Proxmox NFS storages, run /usr/local/sbin/pgs suspend -v.
  2. Trigger systemctl reboot.
  3. Measure ICMP availability during shutdown and boot.
  4. Inspect journalctl -b -1 around the reboot window.

Expected Behavior

  • NFS storages should unmount before either their transport or provider disappears.
  • Host should stop replying to ICMP shortly after reboot is requested.
  • pgs suspend should not block because a remote NFS mount is stale.

Actual Behavior

  • First validation on baobab:
    • TIME_TO_STOP_SECONDS 105.852
    • TIME_TO_FIRST_REPLY_SECONDS 130.230
    • DOWNTIME_SECONDS 24.377
  • Follow-up validation on ebony before self-hosted fix:
    • TIME_TO_STOP_SECONDS 120.275
    • TIME_TO_FIRST_REPLY_SECONDS 145.840
    • DOWNTIME_SECONDS 25.565
  • Follow-up validation on tapia before provider-ordering fix:
    • TIME_TO_STOP_SECONDS 123.285
    • TIME_TO_FIRST_REPLY_SECONDS 149.420
    • DOWNTIME_SECONDS 26.135
  • Revalidation after fixes:
    • baobab: TIME_TO_STOP_SECONDS 14.599, TIME_TO_FIRST_REPLY_SECONDS 35.651
    • ebony: TIME_TO_STOP_SECONDS 27.573, TIME_TO_FIRST_REPLY_SECONDS 53.288
    • tapia: TIME_TO_STOP_SECONDS 28.305, TIME_TO_FIRST_REPLY_SECONDS 53.588
    • repeated tapia validation: TIME_TO_STOP_SECONDS 28.990, TIME_TO_FIRST_REPLY_SECONDS 53.384

Logs/Evidence

Transport ordering failure on baobab:

Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.

Preflight stale-NFS block in pgs:

[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]

Provider-ordering fix validated on tapia:

TIME_TO_STOP_SECONDS 28.990
TIME_TO_FIRST_REPLY_SECONDS 53.384
DOWNTIME_SECONDS 24.394

Investigation Notes

  • 2026-03-07: Confirmed baobab delay was dominated by NFS unmount timeout after Thunderbolt transport disappeared too early.
  • 2026-03-07: Patched tb-enlist@.service with Before=network.target; reboot timing on baobab dropped from ~106s to ~15s.
  • 2026-03-07: Confirmed pgs preflight could block on stale remote NFS during storage cleanup.
  • 2026-03-07: Patched pgs cleanup to scan only local dir storages; remote NFS is skipped intentionally.
  • 2026-03-07: Confirmed ebony delay was self-hosted AutoNAS-1: the node exported local storage and mounted it back as Proxmox NFS.
  • 2026-03-07: First AutoNAS patch kept autonas.service and autonas-boot-scan.service ordered before remote-fs.target and umount.target; ebony improved to ~28s shutdown-to-ICMP-loss.
  • 2026-03-07: tapia still showed ~123s shutdown with that first AutoNAS patch because nfs-server.service still stopped too early for self-hosted AutoNAS-2.
  • 2026-03-07: Implemented second-generation AutoNAS fix that generates /etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf from storage.cfg, adding explicit Before= ordering from nfs-server.service to matching self-hosted Proxmox mount units.
  • 2026-03-07: Revalidated tapia twice after the nfs-server.service ordering fix; both tests converged around 29s to ICMP loss and 53s to first ICMP reply.

Proposed Solution

  1. Keep Thunderbolt enlist units ordered before network.target so transport-backed NFS over thunderbridge stays alive until remote filesystems unmount.
  2. Keep pgs cleanup limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
  3. For self-hosted AutoNAS exports, generate explicit nfs-server.service ordering against the matching Proxmox mnt-pve-*.mount units discovered from storage.cfg.

Related Issues

  • ISSUE-2026-001

Changelog References

List CHANGELOG.md entries that reference this issue: - projects/thunderbolts/CHANGELOG.md: tb-enlist@.service now stays active until network.target stops... [ISSUE-2026-002] - projects/autoNAS/CHANGELOG.md: self-hosted AutoNAS shutdown now adds explicit nfs-server.service ordering... [ISSUE-2026-002] - projects/pve-guests-state/CHANGELOG.md: Suspend-artifact cleanup now scans only local dir storages... [ISSUE-2026-002]