# Issue ISSUE-2026-002: Planned reboot stalls on shared NFS storages during maintenance shutdown

## Issue ID: ISSUE-2026-002

**Status:** resolved  
**Priority:** high  
**Created:** 2026-03-07  
**Updated:** 2026-03-07  
**Assigned to:** unassigned

---

## Summary

Planned node reboot could spend 90 to 120 seconds in shutdown because shared Proxmox NFS storages were not consistently ordered ahead of transport or provider teardown.

---

## Description

This incident had two independent cluster-level contributors that happened to surface in the same maintenance workflow.

The first was transport-related on `baobab`: `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, but Thunderbolt bridge membership was being torn down before Proxmox attempted to unmount those remote NFS storages.

The second was provider-related on `ebony` and `tapia`: local AutoNAS exports were mounted back on the same node as Proxmox NFS storages. In that self-hosted topology, shutdown became sensitive to the ordering between `umount.nfs4` and `nfs-server.service`.

The same investigation also exposed a separate maintenance-preflight issue in `pgs`: cleanup could block in kernel I/O wait when it touched stale remote NFS-backed storages.

The final fix therefore spans cluster maintenance, `thunderbolts`, `autoNAS`, and `pve-guests-state`, and should be tracked as a cluster issue rather than a project-local one.

---

## Environment

- **Affected nodes:** `baobab`, `ebony`, `tapia`
- **Component:** cluster storage + maintenance workflow
- **Version/software:** Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `autoNAS`, `pgs`

---

## Steps to Reproduce

1. On a node with shared Proxmox NFS storages, run `/usr/local/sbin/pgs suspend -v`.
2. Trigger `systemctl reboot`.
3. Measure ICMP availability during shutdown and boot.
4. Inspect `journalctl -b -1` around the reboot window.

---

## Expected Behavior

- NFS storages should unmount before either their transport or provider disappears.
- Host should stop replying to ICMP shortly after reboot is requested.
- `pgs suspend` should not block because a remote NFS mount is stale.

---

## Actual Behavior

- First validation on `baobab`:
  - `TIME_TO_STOP_SECONDS 105.852`
  - `TIME_TO_FIRST_REPLY_SECONDS 130.230`
  - `DOWNTIME_SECONDS 24.377`
- Follow-up validation on `ebony` before self-hosted fix:
  - `TIME_TO_STOP_SECONDS 120.275`
  - `TIME_TO_FIRST_REPLY_SECONDS 145.840`
  - `DOWNTIME_SECONDS 25.565`
- Follow-up validation on `tapia` before provider-ordering fix:
  - `TIME_TO_STOP_SECONDS 123.285`
  - `TIME_TO_FIRST_REPLY_SECONDS 149.420`
  - `DOWNTIME_SECONDS 26.135`
- Revalidation after fixes:
  - `baobab`: `TIME_TO_STOP_SECONDS 14.599`, `TIME_TO_FIRST_REPLY_SECONDS 35.651`
  - `ebony`: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288`
  - `tapia`: `TIME_TO_STOP_SECONDS 28.305`, `TIME_TO_FIRST_REPLY_SECONDS 53.588`
  - repeated `tapia` validation: `TIME_TO_STOP_SECONDS 28.990`, `TIME_TO_FIRST_REPLY_SECONDS 53.384`

---

## Logs/Evidence

Transport ordering failure on `baobab`:

```text
Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
```

Preflight stale-NFS block in `pgs`:

```text
[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
```

Provider-ordering fix validated on `tapia`:

```text
TIME_TO_STOP_SECONDS 28.990
TIME_TO_FIRST_REPLY_SECONDS 53.384
DOWNTIME_SECONDS 24.394
```

---

## Investigation Notes

- 2026-03-07: Confirmed `baobab` delay was dominated by NFS unmount timeout after Thunderbolt transport disappeared too early.
- 2026-03-07: Patched `tb-enlist@.service` with `Before=network.target`; reboot timing on `baobab` dropped from ~106s to ~15s.
- 2026-03-07: Confirmed `pgs` preflight could block on stale remote NFS during storage cleanup.
- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote NFS is skipped intentionally.
- 2026-03-07: Confirmed `ebony` delay was self-hosted `AutoNAS-1`: the node exported local storage and mounted it back as Proxmox NFS.
- 2026-03-07: First AutoNAS patch kept `autonas.service` and `autonas-boot-scan.service` ordered before `remote-fs.target` and `umount.target`; `ebony` improved to ~28s shutdown-to-ICMP-loss.
- 2026-03-07: `tapia` still showed ~123s shutdown with that first AutoNAS patch because `nfs-server.service` still stopped too early for self-hosted `AutoNAS-2`.
- 2026-03-07: Implemented second-generation AutoNAS fix that generates `/etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf` from `storage.cfg`, adding explicit `Before=` ordering from `nfs-server.service` to matching self-hosted Proxmox mount units.
- 2026-03-07: Revalidated `tapia` twice after the `nfs-server.service` ordering fix; both tests converged around `29s` to ICMP loss and `53s` to first ICMP reply.

---

## Proposed Solution

1. Keep Thunderbolt enlist units ordered before `network.target` so transport-backed NFS over `thunderbridge` stays alive until remote filesystems unmount.
2. Keep `pgs` cleanup limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
3. For self-hosted AutoNAS exports, generate explicit `nfs-server.service` ordering against the matching Proxmox `mnt-pve-*.mount` units discovered from `storage.cfg`.

---

## Related Issues

- ISSUE-2026-001

---

## Changelog References

List CHANGELOG.md entries that reference this issue:
- `projects/thunderbolts/CHANGELOG.md`: `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002]
- `projects/autoNAS/CHANGELOG.md`: self-hosted AutoNAS shutdown now adds explicit `nfs-server.service` ordering... [ISSUE-2026-002]
- `projects/pve-guests-state/CHANGELOG.md`: Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002]
