Move ISSUE-2026-002 to cluster issue tracking · 8f00f0f

Move ISSUE-2026-002 to cluster issue tracking
Browse files

bogdan committed 3 months ago

main

1 parent 302fe99

commit 8f00f0f

Showing 7 changed files with 228 additions and 175 deletions

+1 -0

CHANGELOG.md

@@ -12,6 +12,7 @@ Each entry should reference related issues using the format `[ISSUE-YYYY-NNN]`.
 
 ### Known Issues
 - [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open)
+- PBS instances hosted inside VMs `301 is-anjohibe` and `302 is-andrafiabe` become temporarily unavailable during planned node reboot while `pgs suspend` freezes them; this is an accepted cluster-wide maintenance-window limitation that affects backup and restore availability only for that window
 
 ### Added
 - Added a central `cluster/projects/README.md` policy for current and future cluster-level projects


+141 -0

issues/ISSUE-2026-002.md

View

+                    +# Issue ISSUE-2026-002: Planned reboot stalls on shared NFS storages during maintenance shutdown
+                    +## Issue ID: ISSUE-2026-002
+                    +**Status:** resolved
+                    +**Priority:** high
+                    +**Created:** 2026-03-07
+                    +**Updated:** 2026-03-07
+                    +**Assigned to:** unassigned
+                    +---
+                    +## Summary
+                    +Planned node reboot could spend 90 to 120 seconds in shutdown because shared Proxmox NFS storages were not consistently ordered ahead of transport or provider teardown.
+                    +---
+                    +## Description
+                    +This incident had two independent cluster-level contributors that happened to surface in the same maintenance workflow.
+                    +The first was transport-related on `baobab`: `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, but Thunderbolt bridge membership was being torn down before Proxmox attempted to unmount those remote NFS storages.
+                    +The second was provider-related on `ebony` and `tapia`: local AutoNAS exports were mounted back on the same node as Proxmox NFS storages. In that self-hosted topology, shutdown became sensitive to the ordering between `umount.nfs4` and `nfs-server.service`.
+                    +The same investigation also exposed a separate maintenance-preflight issue in `pgs`: cleanup could block in kernel I/O wait when it touched stale remote NFS-backed storages.
+                    +The final fix therefore spans cluster maintenance, `thunderbolts`, `autoNAS`, and `pve-guests-state`, and should be tracked as a cluster issue rather than a project-local one.
+                    +---
+                    +## Environment
+                    +- **Affected nodes:** `baobab`, `ebony`, `tapia`
+                    +- **Component:** cluster storage + maintenance workflow
+                    +- **Version/software:** Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `autoNAS`, `pgs`
+                    +---
+                    +## Steps to Reproduce
+                    +1. On a node with shared Proxmox NFS storages, run `/usr/local/sbin/pgs suspend -v`.
+                    +2. Trigger `systemctl reboot`.
+                    +3. Measure ICMP availability during shutdown and boot.
+                    +4. Inspect `journalctl -b -1` around the reboot window.
+                    +---
+                    +## Expected Behavior
+                    +- NFS storages should unmount before either their transport or provider disappears.
+                    +- Host should stop replying to ICMP shortly after reboot is requested.
+                    +- `pgs suspend` should not block because a remote NFS mount is stale.
+                    +---
+                    +## Actual Behavior
+                    +- First validation on `baobab`:
+                    +  - `TIME_TO_STOP_SECONDS 105.852`
+                    +  - `TIME_TO_FIRST_REPLY_SECONDS 130.230`
+                    +  - `DOWNTIME_SECONDS 24.377`
+                    +- Follow-up validation on `ebony` before self-hosted fix:
+                    +  - `TIME_TO_STOP_SECONDS 120.275`
+                    +  - `TIME_TO_FIRST_REPLY_SECONDS 145.840`
+                    +  - `DOWNTIME_SECONDS 25.565`
+                    +- Follow-up validation on `tapia` before provider-ordering fix:
+                    +  - `TIME_TO_STOP_SECONDS 123.285`
+                    +  - `TIME_TO_FIRST_REPLY_SECONDS 149.420`
+                    +  - `DOWNTIME_SECONDS 26.135`
+                    +- Revalidation after fixes:
+                    +  - `baobab`: `TIME_TO_STOP_SECONDS 14.599`, `TIME_TO_FIRST_REPLY_SECONDS 35.651`
+                    +  - `ebony`: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288`
+                    +  - `tapia`: `TIME_TO_STOP_SECONDS 28.305`, `TIME_TO_FIRST_REPLY_SECONDS 53.588`
+                    +  - repeated `tapia` validation: `TIME_TO_STOP_SECONDS 28.990`, `TIME_TO_FIRST_REPLY_SECONDS 53.384`
+                    +---
+                    +## Logs/Evidence
+                    +Transport ordering failure on `baobab`:
+                    +```text
+                    +Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
+                    +Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
+                    +Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
+                    +```
+                    +Preflight stale-NFS block in `pgs`:
+                    +```text
+                    +[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
+                    +[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
+                    +[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
+                    +```
+                    +Provider-ordering fix validated on `tapia`:
+                    +```text
+                    +TIME_TO_STOP_SECONDS 28.990
+                    +TIME_TO_FIRST_REPLY_SECONDS 53.384
+                    +DOWNTIME_SECONDS 24.394
+                    +```
+                    +---
+                    +## Investigation Notes
+                    +- 2026-03-07: Confirmed `baobab` delay was dominated by NFS unmount timeout after Thunderbolt transport disappeared too early.
+                    +- 2026-03-07: Patched `tb-enlist@.service` with `Before=network.target`; reboot timing on `baobab` dropped from ~106s to ~15s.
+                    +- 2026-03-07: Confirmed `pgs` preflight could block on stale remote NFS during storage cleanup.
+                    +- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote NFS is skipped intentionally.
+                    +- 2026-03-07: Confirmed `ebony` delay was self-hosted `AutoNAS-1`: the node exported local storage and mounted it back as Proxmox NFS.
+                    +- 2026-03-07: First AutoNAS patch kept `autonas.service` and `autonas-boot-scan.service` ordered before `remote-fs.target` and `umount.target`; `ebony` improved to ~28s shutdown-to-ICMP-loss.
+                    +- 2026-03-07: `tapia` still showed ~123s shutdown with that first AutoNAS patch because `nfs-server.service` still stopped too early for self-hosted `AutoNAS-2`.
+                    +- 2026-03-07: Implemented second-generation AutoNAS fix that generates `/etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf` from `storage.cfg`, adding explicit `Before=` ordering from `nfs-server.service` to matching self-hosted Proxmox mount units.
+                    +- 2026-03-07: Revalidated `tapia` twice after the `nfs-server.service` ordering fix; both tests converged around `29s` to ICMP loss and `53s` to first ICMP reply.
+                    +---
+                    +## Proposed Solution
+                    +1. Keep Thunderbolt enlist units ordered before `network.target` so transport-backed NFS over `thunderbridge` stays alive until remote filesystems unmount.
+                    +2. Keep `pgs` cleanup limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
+                    +3. For self-hosted AutoNAS exports, generate explicit `nfs-server.service` ordering against the matching Proxmox `mnt-pve-*.mount` units discovered from `storage.cfg`.
+                    +---
+                    +## Related Issues
+                    +- ISSUE-2026-001
+                    +---
+                    +## Changelog References
+                    +List CHANGELOG.md entries that reference this issue:
+                    +- `projects/thunderbolts/CHANGELOG.md`: `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002]
+                    +- `projects/autoNAS/CHANGELOG.md`: self-hosted AutoNAS shutdown now adds explicit `nfs-server.service` ordering... [ISSUE-2026-002]
+                    +- `projects/pve-guests-state/CHANGELOG.md`: Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002]

+83 -0

issues/TEMPLATE.md

View

+                    +# Issue Template
+                    +## Issue ID: ISSUE-YYYY-NNN
+                    +**Status:** [open|investigating|in-progress|resolved|closed]
+                    +**Priority:** [low|medium|high|critical]
+                    +**Created:** YYYY-MM-DD
+                    +**Updated:** YYYY-MM-DD
+                    +**Assigned to:** [name or unassigned]
+                    +---
+                    +## Summary
+                    +Brief one-line description of the issue.
+                    +---
+                    +## Description
+                    +Detailed description of the problem, behavior, or feature request.
+                    +---
+                    +## Environment
+                    +- **Affected nodes:** [baobab|ebony|tapia|all]
+                    +- **Component:** [network|storage|vm|backup|cluster|other]
+                    +- **Version/software:** (e.g., Proxmox version, kernel version, etc.)
+                    +---
+                    +## Steps to Reproduce
+                    +1. Step 1
+                    +2. Step 2
+                    +3. ...
+                    +---
+                    +## Expected Behavior
+                    +What should happen.
+                    +---
+                    +## Actual Behavior
+                    +What actually happens.
+                    +---
+                    +## Logs/Evidence
+                    +```text
+                    +Paste relevant logs, command output, or error messages here.
+                    +```
+                    +---
+                    +## Investigation Notes
+                    +- [Date] Note 1
+                    +- [Date] Note 2
+                    +---
+                    +## Proposed Solution
+                    +Describe the proposed fix or workaround.
+                    +---
+                    +## Related Issues
+                    +- ISSUE-YYYY-NNN (if any)
+                    +---
+                    +## Changelog References
+                    +List CHANGELOG.md entries that reference this issue:
+                    +- CHANGELOG entry: [date] - description

+1 -1

projects/autoNAS

View

	@@ -1 +1 @@
1		-Subproject commit 5bf8614cfafe29b3ec048bdcbf89ce09a65990dc
	1	+Subproject commit 574cfb19bfb9d8b91657144354490e40c3a9518c

+1 -1

projects/thunderbolts/CHANGELOG.md

View

@@ -10,7 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Fixed
 - Invalid `ExecStop` syntax in `tb-enlist@.service` caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001]
 - Tapia-Baobab Thunderbolt recovery path hardened after reboot-time disconnect/reconnect events [ISSUE-2026-001]
-- `tb-enlist@.service` now stays active until `network.target` stops, so NFS storages routed over `thunderbridge` can unmount cleanly before Thunderbolt ports are detached [ISSUE-2026-002]
+- `tb-enlist@.service` now stays active until `network.target` stops, so NFS storages routed over `thunderbridge` can unmount cleanly before Thunderbolt ports are detached; this is the Thunderbolt-side fix for the cluster-wide maintenance shutdown incident [ISSUE-2026-002]
 
 ### Added
 - Automatic Thunderbolt recovery worker (`tb-recover.service`) and periodic timer (`tb-recover.timer`) for flap resilience [ISSUE-2026-001]


+1 -1

projects/thunderbolts/README.md

View

@@ -135,7 +135,7 @@ refreshes the interfaces.
 - *Slow shutdown with NFS on thunderbridge*: Verify the host has the updated
   `tb-enlist@.service` with `Before=network.target`; otherwise `thunderbridge`
   can disappear before Proxmox unmounts NFS storages and shutdown waits on NFS
-  timeouts.
+  timeouts. Full incident context is tracked in cluster issue `ISSUE-2026-002`.
 - *MTU mismatch complaints*: The service forces MTU 65520 on both sides; verify the
   connected devices also support it.
 


+0 -172

projects/thunderbolts/issues/ISSUE-2026-002.md

View

+                    -# Issue ISSUE-2026-002: Planned reboot stalls on NFS storages over thunderbridge before network shutdown
+                    -## Issue ID: ISSUE-2026-002
+                    -**Status:** investigating
+                    -**Priority:** high
+                    -**Created:** 2026-03-07
+                    -**Updated:** 2026-03-07
+                    -**Assigned to:** unassigned
+                    -## Summary
+                    -Planned node reboot on `baobab` spent ~106 seconds in shutdown because Proxmox NFS storages were still mounted after Thunderbolt transport had already been detached from `thunderbridge`.
+                    -## Description
+                    -During a controlled reboot validation on `baobab`, guest suspend worked correctly, but the host remained reachable over ICMP for almost two minutes after `systemctl reboot`. Journal analysis showed that the Thunderbolt bridge ports were detached early in shutdown, while Proxmox only attempted to unmount NFS storages later. Because `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, the NFS unmount path lost transport and waited for timeout.
+                    -The same investigation exposed a second maintenance risk in `pgs`: preflight cleanup could block in kernel I/O wait when it touched remote NFS-backed storages that were stale or temporarily unavailable. That does not create the slow reboot itself, but it can block the maintenance preparation step.
+                    -Follow-up validation on `ebony` showed a different but related cluster behavior: `AutoNAS-1` is currently exported by `ebony` itself. During reboot, `autonas.service` stops early, which makes the node's own Proxmox NFS client mount for `AutoNAS-1` stale and it then waits for timeout during unmount. In the same window, VM `301 is-anjohibe` (PBS `anjothibe`) is intentionally suspended by `pgs`, so PBS availability loss is expected during the maintenance window.
+                    -Validation on `tapia` initially showed the same class of topology problem for `AutoNAS-2`, which is locally exported there and mounted back as a Proxmox NFS storage. The first AutoNAS shutdown-ordering patch remained active, but reboot timing still stayed near the pre-fix range because `mnt-pve-AutoNAS-2.mount` waited for timeout during shutdown while PBS `andrafiabe-AutoNAS` had already become unreachable.
+                    -Follow-up work in the `autoNAS` project added an explicit `nfs-server.service` drop-in for self-hosted Proxmox NFS mounts discovered from `storage.cfg`. After that second patch, `tapia` reboot timing dropped into the same range as `ebony`, confirming that the remaining blocker was provider ordering on `nfs-server.service`, not on `autonas.service`.
+                    -## Environment
+                    -- **Affected nodes:** `baobab` confirmed, likely all nodes using Proxmox NFS storages over `thunderbridge`
+                    -- **Component:** network + storage + maintenance workflow
+                    -- **Version/software:** Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `pgs`
+                    -## Steps to Reproduce
+                    -1. On a node with Proxmox NFS storages routed over `thunderbridge`, run `/usr/local/sbin/pgs suspend -v`.
+                    -2. Trigger `systemctl reboot`.
+                    -3. Measure ICMP availability during shutdown and boot.
+                    -4. Inspect `journalctl -b -1` around the reboot window.
+                    -## Expected Behavior
+                    -- NFS storages should unmount while Thunderbolt transport is still available.
+                    -- Host should stop replying to ICMP shortly after reboot is requested.
+                    -- `pgs suspend` should not hang because a remote NFS mount is stale.
+                    -## Actual Behavior
+                    -- First validation on `baobab`:
+                    -  - `TIME_TO_STOP_SECONDS 105.852`
+                    -  - `TIME_TO_FIRST_REPLY_SECONDS 130.230`
+                    -  - `DOWNTIME_SECONDS 24.377`
+                    -- Follow-up validation on `ebony`:
+                    -  - `TIME_TO_STOP_SECONDS 120.275`
+                    -  - `TIME_TO_FIRST_REPLY_SECONDS 145.840`
+                    -  - `DOWNTIME_SECONDS 25.565`
+                    -- Follow-up validation on `tapia` after cluster-wide AutoNAS rollout:
+                    -  - `TIME_TO_STOP_SECONDS 123.285`
+                    -  - `TIME_TO_FIRST_REPLY_SECONDS 149.420`
+                    -  - `DOWNTIME_SECONDS 26.135`
+                    -- Revalidation on `tapia` after explicit `nfs-server.service` self-hosted ordering fix:
+                    -  - `TIME_TO_STOP_SECONDS 28.305`
+                    -  - `TIME_TO_FIRST_REPLY_SECONDS 53.588`
+                    -  - `DOWNTIME_SECONDS 25.283`
+                    -- `journalctl -b -1` showed:
+                    -  - Thunderbolt bridge ports detached at `08:48:17.989`
+                    -  - NFS unmount only started at `08:48:30.540`
+                    -  - `mnt-pve-AutoNAS-1.mount` and `mnt-pve-AutoNAS-2.mount` timed out at `08:50:00.604/0.605`
+                    -- `journalctl -b -1` on `ebony` showed:
+                    -  - `autonas.service` stopped at `11:04:22.326`
+                    -  - `mnt-pve-AutoNAS-2.mount` unmounted successfully by `11:04:38.693`
+                    -  - `mnt-pve-AutoNAS-1.mount` timed out at `11:06:08.679`
+                    -  - only after that did `network.target` stop and `tb-enlist@thunderbolt0.service` detach from `thunderbridge`
+                    -- A later maintenance attempt also showed `pgs suspend` blocked in `nfs4_proc_getattr` while scanning storage paths.
+                    -## Logs/Evidence
+                    -```text
+                    -Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
+                    -Mar 07 08:48:17.993120 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt1 was detached
+                    -Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
+                    -Mar 07 08:48:30.541335 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-2.mount - /mnt/pve/AutoNAS-2...
+                    -Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
+                    -Mar 07 08:50:00.605215 baobab systemd[1]: mnt-pve-AutoNAS-1.mount: Unmounting timed out. Terminating.
+                    -```
+                    -Blocked `pgs` stack during stale-NFS preflight:
+                    -```text
+                    -[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
+                    -[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
+                    -[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
+                    -[<0>] __do_sys_newfstatat+0x43/0x90
+                    -```
+                    -Validated timing after fixes on `baobab`:
+                    -```text
+                    -TIME_TO_STOP_SECONDS 14.599
+                    -TIME_TO_FIRST_REPLY_SECONDS 35.651
+                    -DOWNTIME_SECONDS 21.053
+                    -```
+                    -## Investigation Notes
+                    -- 2026-03-07: Confirmed `AutoNAS-1` and `AutoNAS-2` on `baobab` are Proxmox NFS storages mounted from `192.168.10.21` and `192.168.10.22` over `thunderbridge`.
+                    -- 2026-03-07: First reboot validation on `baobab` showed shutdown delay dominated by NFS unmount timeout, not by boot.
+                    -- 2026-03-07: `tb-enlist@.service` had no ordering against `network.target`; systemd stopped Thunderbolt bridge membership before Proxmox unmounted remote storages.
+                    -- 2026-03-07: Patched shared `tb-enlist@.service` with `Before=network.target` and deployed to `baobab`, then cluster-wide.
+                    -- 2026-03-07: Separate maintenance attempt showed `pgs suspend` can block in `nfs4_proc_getattr` while scanning storage paths on stale remote NFS mounts.
+                    -- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote storages such as NFS are skipped intentionally.
+                    -- 2026-03-07: Revalidated on `baobab` after both fixes:
+                    -  - NFS unmount started at `10:48:12.354/10:48:12.356`
+                    -  - both NFS mounts unmounted successfully by `10:48:12.460`
+                    -  - `network.target` stopped later at `10:48:16.152`
+                    -  - ICMP loss dropped from ~106s to ~15s after reboot command
+                    -- 2026-03-07: `pgs resume` completed successfully after reboot on `baobab`; state file survived boot and all 4 VMs + 1 CT were restored.
+                    -- 2026-03-07: Validated `ebony` with current `pgs` and cluster-wide `thunderbolts` rollout. `pgs suspend` / `resume` succeeded for VMs `101`, `102`, `301`; state file survived reboot and restore completed.
+                    -- 2026-03-07: `ebony` still showed long shutdown because `AutoNAS-1` is currently provided by `ebony` itself through `autonas`. Stopping `autonas.service` made the node's own NFS client mount stale and `mnt-pve-AutoNAS-1.mount` waited for timeout.
+                    -- 2026-03-07: On `ebony`, PBS `anjothibe` availability loss during maintenance is expected because VM `301 is-anjohibe` is intentionally suspended by `pgs`, and its datastore dependency is also on `AutoNAS-1`.
+                    -- 2026-03-07: Implemented AutoNAS shutdown-ordering experiment on `ebony`: `autonas.service` and `autonas-boot-scan.service` now declare `Before=remote-fs.target` and `Before=umount.target`.
+                    -- 2026-03-07: Revalidated `ebony` after AutoNAS patch:
+                    -  - previous timing: `TIME_TO_STOP_SECONDS 120.275`, `TIME_TO_FIRST_REPLY_SECONDS 145.840`
+                    -  - new timing: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288`
+                    -  - `mnt-pve-AutoNAS-2.mount` still unmounted cleanly
+                    -  - `AutoNAS-1` no longer waited for the old 90s timeout, though a brief `Stale file handle` was still observed before the provider side stopped
+                    -- 2026-03-07: Residual issue on `ebony`: even with later provider shutdown, `pvestatd` briefly logged `storage 'AutoNAS-1' is not online` / `Stale file handle` during the maintenance window, so the self-hosted NFS topology remains fragile but no longer dominates shutdown time.
+                    -- 2026-03-07: Deployed the same AutoNAS ordering patch cluster-wide and revalidated `tapia`.
+                    -- 2026-03-07: `pgs suspend` / reboot / `pgs resume` succeeded on `tapia` for VMs `104`, `107`, `113`, `302`; state file survived reboot and all four guests were restored.
+                    -- 2026-03-07: `tapia` still showed slow shutdown after the AutoNAS patch:
+                    -  - `TIME_TO_STOP_SECONDS 123.285`, `TIME_TO_FIRST_REPLY_SECONDS 149.420`
+                    -  - `mnt-pve-AutoNAS-1.mount` unmounted immediately at `11:45:01.827`
+                    -  - `autonas.service` and `nfs-server.service` stopped around `11:45:01.689/11:45:01.900`
+                    -  - `mnt-pve-AutoNAS-2.mount` then waited until timeout at `11:46:31.778`
+                    -  - `network.target` stopped only after that, at `11:46:31.781`
+                    -- 2026-03-07: On `tapia`, the remaining delay is concentrated on self-hosted `AutoNAS-2` (`server 192.168.10.22`) plus expected maintenance-window loss of PBS `andrafiabe-AutoNAS` (`192.168.10.96`).
+                    -- 2026-03-07: Implemented a second-generation AutoNAS fix that generates `/etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf` from `storage.cfg`, adding `Before=` ordering from `nfs-server.service` to the matching self-hosted Proxmox mount units.
+                    -- 2026-03-07: Revalidated `tapia` after the `nfs-server.service` ordering fix:
+                    -  - previous timing after first AutoNAS patch: `TIME_TO_STOP_SECONDS 123.285`, `TIME_TO_FIRST_REPLY_SECONDS 149.420`
+                    -  - new timing: `TIME_TO_STOP_SECONDS 28.305`, `TIME_TO_FIRST_REPLY_SECONDS 53.588`
+                    -  - `nfs-server.service` stopped at `12:07:42.157`, `network.target` stopped later at `12:07:47.230`
+                    -  - the old ~90s timeout on `mnt-pve-AutoNAS-2.mount` no longer dominated shutdown
+                    -  - `pgs suspend` / reboot / `pgs resume` completed successfully for VMs `104`, `107`, `113`, `302`
+                    -## Proposed Solution
+                    -1. Keep Thunderbolt enlist units ordered before `network.target` so storage traffic over `thunderbridge` remains alive until remote filesystems are unmounted.
+                    -2. Keep `pgs` cleanup path limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
+                    -3. Do not mount a node's own AutoNAS export back onto the same node as a Proxmox NFS storage; on `ebony`, exclude `AutoNAS-1` from local use or replace that local dependency with a direct/local storage path.
+                    -4. Review colocated service dependencies before planned reboot, especially when the node provides the storage it also consumes (for example `autonas` and PBS on `ebony`).
+                    -5. Keep the generated `nfs-server.service` self-hosted ordering drop-in as the cluster fix for nodes that export AutoNAS locally and also consume the same export back through Proxmox NFS.
+                    -6. Validate the same shutdown path on the remaining nodes after storage-role cleanup and after the `nfs-server.service` ordering fix is deployed.
+                    -## Related Issues
+                    -- ISSUE-2026-001
+                    -## Changelog References
+                    -List CHANGELOG.md entries that reference this issue:
+                    -- `projects/thunderbolts/CHANGELOG.md`: [Unreleased] - `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002]
+                    -- `projects/pve-guests-state/CHANGELOG.md`: [1.5] - Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002]

	@@ -12,6 +12,7 @@ Each entry should reference related issues using the format `[ISSUE-YYYY-NNN]`.
12	12
13	13	### Known Issues
14	14	- [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open)
	15	+- PBS instances hosted inside VMs `301 is-anjohibe` and `302 is-andrafiabe` become temporarily unavailable during planned node reboot while `pgs suspend` freezes them; this is an accepted cluster-wide maintenance-window limitation that affects backup and restore availability only for that window
15	16
16	17	### Added
17	18	- Added a central `cluster/projects/README.md` policy for current and future cluster-level projects

	@@ -10,7 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
10	10	### Fixed
11	11	- Invalid `ExecStop` syntax in `tb-enlist@.service` caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001]
12	12	- Tapia-Baobab Thunderbolt recovery path hardened after reboot-time disconnect/reconnect events [ISSUE-2026-001]
13		-- `tb-enlist@.service` now stays active until `network.target` stops, so NFS storages routed over `thunderbridge` can unmount cleanly before Thunderbolt ports are detached [ISSUE-2026-002]
	13	+- `tb-enlist@.service` now stays active until `network.target` stops, so NFS storages routed over `thunderbridge` can unmount cleanly before Thunderbolt ports are detached; this is the Thunderbolt-side fix for the cluster-wide maintenance shutdown incident [ISSUE-2026-002]
14	14
15	15	### Added
16	16	- Automatic Thunderbolt recovery worker (`tb-recover.service`) and periodic timer (`tb-recover.timer`) for flap resilience [ISSUE-2026-001]

	@@ -135,7 +135,7 @@ refreshes the interfaces.
135	135	- Slow shutdown with NFS on thunderbridge: Verify the host has the updated
136	136	`tb-enlist@.service` with `Before=network.target`; otherwise `thunderbridge`
137	137	can disappear before Proxmox unmounts NFS storages and shutdown waits on NFS
138		- timeouts.
	138	+ timeouts. Full incident context is tracked in cluster issue `ISSUE-2026-002`.
139	139	- MTU mismatch complaints: The service forces MTU 65520 on both sides; verify the
140	140	connected devices also support it.
141	141