Showing 7 changed files with 228 additions and 175 deletions
+1 -0
CHANGELOG.md
@@ -12,6 +12,7 @@ Each entry should reference related issues using the format `[ISSUE-YYYY-NNN]`.
12 12
 
13 13
 ### Known Issues
14 14
 - [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open)
15
+- PBS instances hosted inside VMs `301 is-anjohibe` and `302 is-andrafiabe` become temporarily unavailable during planned node reboot while `pgs suspend` freezes them; this is an accepted cluster-wide maintenance-window limitation that affects backup and restore availability only for that window
15 16
 
16 17
 ### Added
17 18
 - Added a central `cluster/projects/README.md` policy for current and future cluster-level projects
+141 -0
issues/ISSUE-2026-002.md
@@ -0,0 +1,141 @@
1
+# Issue ISSUE-2026-002: Planned reboot stalls on shared NFS storages during maintenance shutdown
2
+
3
+## Issue ID: ISSUE-2026-002
4
+
5
+**Status:** resolved  
6
+**Priority:** high  
7
+**Created:** 2026-03-07  
8
+**Updated:** 2026-03-07  
9
+**Assigned to:** unassigned
10
+
11
+---
12
+
13
+## Summary
14
+
15
+Planned node reboot could spend 90 to 120 seconds in shutdown because shared Proxmox NFS storages were not consistently ordered ahead of transport or provider teardown.
16
+
17
+---
18
+
19
+## Description
20
+
21
+This incident had two independent cluster-level contributors that happened to surface in the same maintenance workflow.
22
+
23
+The first was transport-related on `baobab`: `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, but Thunderbolt bridge membership was being torn down before Proxmox attempted to unmount those remote NFS storages.
24
+
25
+The second was provider-related on `ebony` and `tapia`: local AutoNAS exports were mounted back on the same node as Proxmox NFS storages. In that self-hosted topology, shutdown became sensitive to the ordering between `umount.nfs4` and `nfs-server.service`.
26
+
27
+The same investigation also exposed a separate maintenance-preflight issue in `pgs`: cleanup could block in kernel I/O wait when it touched stale remote NFS-backed storages.
28
+
29
+The final fix therefore spans cluster maintenance, `thunderbolts`, `autoNAS`, and `pve-guests-state`, and should be tracked as a cluster issue rather than a project-local one.
30
+
31
+---
32
+
33
+## Environment
34
+
35
+- **Affected nodes:** `baobab`, `ebony`, `tapia`
36
+- **Component:** cluster storage + maintenance workflow
37
+- **Version/software:** Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `autoNAS`, `pgs`
38
+
39
+---
40
+
41
+## Steps to Reproduce
42
+
43
+1. On a node with shared Proxmox NFS storages, run `/usr/local/sbin/pgs suspend -v`.
44
+2. Trigger `systemctl reboot`.
45
+3. Measure ICMP availability during shutdown and boot.
46
+4. Inspect `journalctl -b -1` around the reboot window.
47
+
48
+---
49
+
50
+## Expected Behavior
51
+
52
+- NFS storages should unmount before either their transport or provider disappears.
53
+- Host should stop replying to ICMP shortly after reboot is requested.
54
+- `pgs suspend` should not block because a remote NFS mount is stale.
55
+
56
+---
57
+
58
+## Actual Behavior
59
+
60
+- First validation on `baobab`:
61
+  - `TIME_TO_STOP_SECONDS 105.852`
62
+  - `TIME_TO_FIRST_REPLY_SECONDS 130.230`
63
+  - `DOWNTIME_SECONDS 24.377`
64
+- Follow-up validation on `ebony` before self-hosted fix:
65
+  - `TIME_TO_STOP_SECONDS 120.275`
66
+  - `TIME_TO_FIRST_REPLY_SECONDS 145.840`
67
+  - `DOWNTIME_SECONDS 25.565`
68
+- Follow-up validation on `tapia` before provider-ordering fix:
69
+  - `TIME_TO_STOP_SECONDS 123.285`
70
+  - `TIME_TO_FIRST_REPLY_SECONDS 149.420`
71
+  - `DOWNTIME_SECONDS 26.135`
72
+- Revalidation after fixes:
73
+  - `baobab`: `TIME_TO_STOP_SECONDS 14.599`, `TIME_TO_FIRST_REPLY_SECONDS 35.651`
74
+  - `ebony`: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288`
75
+  - `tapia`: `TIME_TO_STOP_SECONDS 28.305`, `TIME_TO_FIRST_REPLY_SECONDS 53.588`
76
+  - repeated `tapia` validation: `TIME_TO_STOP_SECONDS 28.990`, `TIME_TO_FIRST_REPLY_SECONDS 53.384`
77
+
78
+---
79
+
80
+## Logs/Evidence
81
+
82
+Transport ordering failure on `baobab`:
83
+
84
+```text
85
+Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
86
+Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
87
+Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
88
+```
89
+
90
+Preflight stale-NFS block in `pgs`:
91
+
92
+```text
93
+[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
94
+[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
95
+[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
96
+```
97
+
98
+Provider-ordering fix validated on `tapia`:
99
+
100
+```text
101
+TIME_TO_STOP_SECONDS 28.990
102
+TIME_TO_FIRST_REPLY_SECONDS 53.384
103
+DOWNTIME_SECONDS 24.394
104
+```
105
+
106
+---
107
+
108
+## Investigation Notes
109
+
110
+- 2026-03-07: Confirmed `baobab` delay was dominated by NFS unmount timeout after Thunderbolt transport disappeared too early.
111
+- 2026-03-07: Patched `tb-enlist@.service` with `Before=network.target`; reboot timing on `baobab` dropped from ~106s to ~15s.
112
+- 2026-03-07: Confirmed `pgs` preflight could block on stale remote NFS during storage cleanup.
113
+- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote NFS is skipped intentionally.
114
+- 2026-03-07: Confirmed `ebony` delay was self-hosted `AutoNAS-1`: the node exported local storage and mounted it back as Proxmox NFS.
115
+- 2026-03-07: First AutoNAS patch kept `autonas.service` and `autonas-boot-scan.service` ordered before `remote-fs.target` and `umount.target`; `ebony` improved to ~28s shutdown-to-ICMP-loss.
116
+- 2026-03-07: `tapia` still showed ~123s shutdown with that first AutoNAS patch because `nfs-server.service` still stopped too early for self-hosted `AutoNAS-2`.
117
+- 2026-03-07: Implemented second-generation AutoNAS fix that generates `/etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf` from `storage.cfg`, adding explicit `Before=` ordering from `nfs-server.service` to matching self-hosted Proxmox mount units.
118
+- 2026-03-07: Revalidated `tapia` twice after the `nfs-server.service` ordering fix; both tests converged around `29s` to ICMP loss and `53s` to first ICMP reply.
119
+
120
+---
121
+
122
+## Proposed Solution
123
+
124
+1. Keep Thunderbolt enlist units ordered before `network.target` so transport-backed NFS over `thunderbridge` stays alive until remote filesystems unmount.
125
+2. Keep `pgs` cleanup limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
126
+3. For self-hosted AutoNAS exports, generate explicit `nfs-server.service` ordering against the matching Proxmox `mnt-pve-*.mount` units discovered from `storage.cfg`.
127
+
128
+---
129
+
130
+## Related Issues
131
+
132
+- ISSUE-2026-001
133
+
134
+---
135
+
136
+## Changelog References
137
+
138
+List CHANGELOG.md entries that reference this issue:
139
+- `projects/thunderbolts/CHANGELOG.md`: `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002]
140
+- `projects/autoNAS/CHANGELOG.md`: self-hosted AutoNAS shutdown now adds explicit `nfs-server.service` ordering... [ISSUE-2026-002]
141
+- `projects/pve-guests-state/CHANGELOG.md`: Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002]
+83 -0
issues/TEMPLATE.md
@@ -0,0 +1,83 @@
1
+# Issue Template
2
+
3
+## Issue ID: ISSUE-YYYY-NNN
4
+
5
+**Status:** [open|investigating|in-progress|resolved|closed]  
6
+**Priority:** [low|medium|high|critical]  
7
+**Created:** YYYY-MM-DD  
8
+**Updated:** YYYY-MM-DD  
9
+**Assigned to:** [name or unassigned]
10
+
11
+---
12
+
13
+## Summary
14
+
15
+Brief one-line description of the issue.
16
+
17
+---
18
+
19
+## Description
20
+
21
+Detailed description of the problem, behavior, or feature request.
22
+
23
+---
24
+
25
+## Environment
26
+
27
+- **Affected nodes:** [baobab|ebony|tapia|all]
28
+- **Component:** [network|storage|vm|backup|cluster|other]
29
+- **Version/software:** (e.g., Proxmox version, kernel version, etc.)
30
+
31
+---
32
+
33
+## Steps to Reproduce
34
+
35
+1. Step 1
36
+2. Step 2
37
+3. ...
38
+
39
+---
40
+
41
+## Expected Behavior
42
+
43
+What should happen.
44
+
45
+---
46
+
47
+## Actual Behavior
48
+
49
+What actually happens.
50
+
51
+---
52
+
53
+## Logs/Evidence
54
+
55
+```text
56
+Paste relevant logs, command output, or error messages here.
57
+```
58
+
59
+---
60
+
61
+## Investigation Notes
62
+
63
+- [Date] Note 1
64
+- [Date] Note 2
65
+
66
+---
67
+
68
+## Proposed Solution
69
+
70
+Describe the proposed fix or workaround.
71
+
72
+---
73
+
74
+## Related Issues
75
+
76
+- ISSUE-YYYY-NNN (if any)
77
+
78
+---
79
+
80
+## Changelog References
81
+
82
+List CHANGELOG.md entries that reference this issue:
83
+- CHANGELOG entry: [date] - description
+1 -1
projects/autoNAS
@@ -1 +1 @@
1
-Subproject commit 5bf8614cfafe29b3ec048bdcbf89ce09a65990dc
1
+Subproject commit 574cfb19bfb9d8b91657144354490e40c3a9518c
+1 -1
projects/thunderbolts/CHANGELOG.md
@@ -10,7 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
10 10
 ### Fixed
11 11
 - Invalid `ExecStop` syntax in `tb-enlist@.service` caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001]
12 12
 - Tapia-Baobab Thunderbolt recovery path hardened after reboot-time disconnect/reconnect events [ISSUE-2026-001]
13
-- `tb-enlist@.service` now stays active until `network.target` stops, so NFS storages routed over `thunderbridge` can unmount cleanly before Thunderbolt ports are detached [ISSUE-2026-002]
13
+- `tb-enlist@.service` now stays active until `network.target` stops, so NFS storages routed over `thunderbridge` can unmount cleanly before Thunderbolt ports are detached; this is the Thunderbolt-side fix for the cluster-wide maintenance shutdown incident [ISSUE-2026-002]
14 14
 
15 15
 ### Added
16 16
 - Automatic Thunderbolt recovery worker (`tb-recover.service`) and periodic timer (`tb-recover.timer`) for flap resilience [ISSUE-2026-001]
+1 -1
projects/thunderbolts/README.md
@@ -135,7 +135,7 @@ refreshes the interfaces.
135 135
 - *Slow shutdown with NFS on thunderbridge*: Verify the host has the updated
136 136
   `tb-enlist@.service` with `Before=network.target`; otherwise `thunderbridge`
137 137
   can disappear before Proxmox unmounts NFS storages and shutdown waits on NFS
138
-  timeouts.
138
+  timeouts. Full incident context is tracked in cluster issue `ISSUE-2026-002`.
139 139
 - *MTU mismatch complaints*: The service forces MTU 65520 on both sides; verify the
140 140
   connected devices also support it.
141 141
 
+0 -172
projects/thunderbolts/issues/ISSUE-2026-002.md
@@ -1,183 +0,0 @@
1
-# Issue ISSUE-2026-002: Planned reboot stalls on NFS storages over thunderbridge before network shutdown
2
-
3
-## Issue ID: ISSUE-2026-002
4
-
5
-**Status:** investigating  
6
-**Priority:** high  
7
-**Created:** 2026-03-07  
8
-**Updated:** 2026-03-07  
9
-**Assigned to:** unassigned
10
-
11
-
12
-## Summary
13
-
14
-Planned node reboot on `baobab` spent ~106 seconds in shutdown because Proxmox NFS storages were still mounted after Thunderbolt transport had already been detached from `thunderbridge`.
15
-
16
-
17
-## Description
18
-
19
-During a controlled reboot validation on `baobab`, guest suspend worked correctly, but the host remained reachable over ICMP for almost two minutes after `systemctl reboot`. Journal analysis showed that the Thunderbolt bridge ports were detached early in shutdown, while Proxmox only attempted to unmount NFS storages later. Because `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, the NFS unmount path lost transport and waited for timeout.
20
-
21
-The same investigation exposed a second maintenance risk in `pgs`: preflight cleanup could block in kernel I/O wait when it touched remote NFS-backed storages that were stale or temporarily unavailable. That does not create the slow reboot itself, but it can block the maintenance preparation step.
22
-
23
-Follow-up validation on `ebony` showed a different but related cluster behavior: `AutoNAS-1` is currently exported by `ebony` itself. During reboot, `autonas.service` stops early, which makes the node's own Proxmox NFS client mount for `AutoNAS-1` stale and it then waits for timeout during unmount. In the same window, VM `301 is-anjohibe` (PBS `anjothibe`) is intentionally suspended by `pgs`, so PBS availability loss is expected during the maintenance window.
24
-
25
-Validation on `tapia` initially showed the same class of topology problem for `AutoNAS-2`, which is locally exported there and mounted back as a Proxmox NFS storage. The first AutoNAS shutdown-ordering patch remained active, but reboot timing still stayed near the pre-fix range because `mnt-pve-AutoNAS-2.mount` waited for timeout during shutdown while PBS `andrafiabe-AutoNAS` had already become unreachable.
26
-
27
-Follow-up work in the `autoNAS` project added an explicit `nfs-server.service` drop-in for self-hosted Proxmox NFS mounts discovered from `storage.cfg`. After that second patch, `tapia` reboot timing dropped into the same range as `ebony`, confirming that the remaining blocker was provider ordering on `nfs-server.service`, not on `autonas.service`.
28
-
29
-
30
-## Environment
31
-
32
-- **Affected nodes:** `baobab` confirmed, likely all nodes using Proxmox NFS storages over `thunderbridge`
33
-- **Component:** network + storage + maintenance workflow
34
-- **Version/software:** Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `pgs`
35
-
36
-
37
-## Steps to Reproduce
38
-
39
-1. On a node with Proxmox NFS storages routed over `thunderbridge`, run `/usr/local/sbin/pgs suspend -v`.
40
-2. Trigger `systemctl reboot`.
41
-3. Measure ICMP availability during shutdown and boot.
42
-4. Inspect `journalctl -b -1` around the reboot window.
43
-
44
-
45
-## Expected Behavior
46
-
47
-- NFS storages should unmount while Thunderbolt transport is still available.
48
-- Host should stop replying to ICMP shortly after reboot is requested.
49
-- `pgs suspend` should not hang because a remote NFS mount is stale.
50
-
51
-
52
-## Actual Behavior
53
-
54
-- First validation on `baobab`:
55
-  - `TIME_TO_STOP_SECONDS 105.852`
56
-  - `TIME_TO_FIRST_REPLY_SECONDS 130.230`
57
-  - `DOWNTIME_SECONDS 24.377`
58
-- Follow-up validation on `ebony`:
59
-  - `TIME_TO_STOP_SECONDS 120.275`
60
-  - `TIME_TO_FIRST_REPLY_SECONDS 145.840`
61
-  - `DOWNTIME_SECONDS 25.565`
62
-- Follow-up validation on `tapia` after cluster-wide AutoNAS rollout:
63
-  - `TIME_TO_STOP_SECONDS 123.285`
64
-  - `TIME_TO_FIRST_REPLY_SECONDS 149.420`
65
-  - `DOWNTIME_SECONDS 26.135`
66
-- Revalidation on `tapia` after explicit `nfs-server.service` self-hosted ordering fix:
67
-  - `TIME_TO_STOP_SECONDS 28.305`
68
-  - `TIME_TO_FIRST_REPLY_SECONDS 53.588`
69
-  - `DOWNTIME_SECONDS 25.283`
70
-- `journalctl -b -1` showed:
71
-  - Thunderbolt bridge ports detached at `08:48:17.989`
72
-  - NFS unmount only started at `08:48:30.540`
73
-  - `mnt-pve-AutoNAS-1.mount` and `mnt-pve-AutoNAS-2.mount` timed out at `08:50:00.604/0.605`
74
-- `journalctl -b -1` on `ebony` showed:
75
-  - `autonas.service` stopped at `11:04:22.326`
76
-  - `mnt-pve-AutoNAS-2.mount` unmounted successfully by `11:04:38.693`
77
-  - `mnt-pve-AutoNAS-1.mount` timed out at `11:06:08.679`
78
-  - only after that did `network.target` stop and `tb-enlist@thunderbolt0.service` detach from `thunderbridge`
79
-- A later maintenance attempt also showed `pgs suspend` blocked in `nfs4_proc_getattr` while scanning storage paths.
80
-
81
-
82
-## Logs/Evidence
83
-
84
-```text
85
-Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
86
-Mar 07 08:48:17.993120 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt1 was detached
87
-Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
88
-Mar 07 08:48:30.541335 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-2.mount - /mnt/pve/AutoNAS-2...
89
-Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
90
-Mar 07 08:50:00.605215 baobab systemd[1]: mnt-pve-AutoNAS-1.mount: Unmounting timed out. Terminating.
91
-```
92
-
93
-Blocked `pgs` stack during stale-NFS preflight:
94
-
95
-```text
96
-[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
97
-[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
98
-[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
99
-[<0>] __do_sys_newfstatat+0x43/0x90
100
-```
101
-
102
-Validated timing after fixes on `baobab`:
103
-
104
-```text
105
-TIME_TO_STOP_SECONDS 14.599
106
-TIME_TO_FIRST_REPLY_SECONDS 35.651
107
-DOWNTIME_SECONDS 21.053
108
-```
109
-
110
-
111
-## Investigation Notes
112
-
113
-- 2026-03-07: Confirmed `AutoNAS-1` and `AutoNAS-2` on `baobab` are Proxmox NFS storages mounted from `192.168.10.21` and `192.168.10.22` over `thunderbridge`.
114
-- 2026-03-07: First reboot validation on `baobab` showed shutdown delay dominated by NFS unmount timeout, not by boot.
115
-- 2026-03-07: `tb-enlist@.service` had no ordering against `network.target`; systemd stopped Thunderbolt bridge membership before Proxmox unmounted remote storages.
116
-- 2026-03-07: Patched shared `tb-enlist@.service` with `Before=network.target` and deployed to `baobab`, then cluster-wide.
117
-- 2026-03-07: Separate maintenance attempt showed `pgs suspend` can block in `nfs4_proc_getattr` while scanning storage paths on stale remote NFS mounts.
118
-- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote storages such as NFS are skipped intentionally.
119
-- 2026-03-07: Revalidated on `baobab` after both fixes:
120
-  - NFS unmount started at `10:48:12.354/10:48:12.356`
121
-  - both NFS mounts unmounted successfully by `10:48:12.460`
122
-  - `network.target` stopped later at `10:48:16.152`
123
-  - ICMP loss dropped from ~106s to ~15s after reboot command
124
-- 2026-03-07: `pgs resume` completed successfully after reboot on `baobab`; state file survived boot and all 4 VMs + 1 CT were restored.
125
-- 2026-03-07: Validated `ebony` with current `pgs` and cluster-wide `thunderbolts` rollout. `pgs suspend` / `resume` succeeded for VMs `101`, `102`, `301`; state file survived reboot and restore completed.
126
-- 2026-03-07: `ebony` still showed long shutdown because `AutoNAS-1` is currently provided by `ebony` itself through `autonas`. Stopping `autonas.service` made the node's own NFS client mount stale and `mnt-pve-AutoNAS-1.mount` waited for timeout.
127
-- 2026-03-07: On `ebony`, PBS `anjothibe` availability loss during maintenance is expected because VM `301 is-anjohibe` is intentionally suspended by `pgs`, and its datastore dependency is also on `AutoNAS-1`.
128
-- 2026-03-07: Implemented AutoNAS shutdown-ordering experiment on `ebony`: `autonas.service` and `autonas-boot-scan.service` now declare `Before=remote-fs.target` and `Before=umount.target`.
129
-- 2026-03-07: Revalidated `ebony` after AutoNAS patch:
130
-  - previous timing: `TIME_TO_STOP_SECONDS 120.275`, `TIME_TO_FIRST_REPLY_SECONDS 145.840`
131
-  - new timing: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288`
132
-  - `mnt-pve-AutoNAS-2.mount` still unmounted cleanly
133
-  - `AutoNAS-1` no longer waited for the old 90s timeout, though a brief `Stale file handle` was still observed before the provider side stopped
134
-- 2026-03-07: Residual issue on `ebony`: even with later provider shutdown, `pvestatd` briefly logged `storage 'AutoNAS-1' is not online` / `Stale file handle` during the maintenance window, so the self-hosted NFS topology remains fragile but no longer dominates shutdown time.
135
-- 2026-03-07: Deployed the same AutoNAS ordering patch cluster-wide and revalidated `tapia`.
136
-- 2026-03-07: `pgs suspend` / reboot / `pgs resume` succeeded on `tapia` for VMs `104`, `107`, `113`, `302`; state file survived reboot and all four guests were restored.
137
-- 2026-03-07: `tapia` still showed slow shutdown after the AutoNAS patch:
138
-  - `TIME_TO_STOP_SECONDS 123.285`, `TIME_TO_FIRST_REPLY_SECONDS 149.420`
139
-  - `mnt-pve-AutoNAS-1.mount` unmounted immediately at `11:45:01.827`
140
-  - `autonas.service` and `nfs-server.service` stopped around `11:45:01.689/11:45:01.900`
141
-  - `mnt-pve-AutoNAS-2.mount` then waited until timeout at `11:46:31.778`
142
-  - `network.target` stopped only after that, at `11:46:31.781`
143
-- 2026-03-07: On `tapia`, the remaining delay is concentrated on self-hosted `AutoNAS-2` (`server 192.168.10.22`) plus expected maintenance-window loss of PBS `andrafiabe-AutoNAS` (`192.168.10.96`).
144
-- 2026-03-07: Implemented a second-generation AutoNAS fix that generates `/etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf` from `storage.cfg`, adding `Before=` ordering from `nfs-server.service` to the matching self-hosted Proxmox mount units.
145
-- 2026-03-07: Revalidated `tapia` after the `nfs-server.service` ordering fix:
146
-  - previous timing after first AutoNAS patch: `TIME_TO_STOP_SECONDS 123.285`, `TIME_TO_FIRST_REPLY_SECONDS 149.420`
147
-  - new timing: `TIME_TO_STOP_SECONDS 28.305`, `TIME_TO_FIRST_REPLY_SECONDS 53.588`
148
-  - `nfs-server.service` stopped at `12:07:42.157`, `network.target` stopped later at `12:07:47.230`
149
-  - the old ~90s timeout on `mnt-pve-AutoNAS-2.mount` no longer dominated shutdown
150
-  - `pgs suspend` / reboot / `pgs resume` completed successfully for VMs `104`, `107`, `113`, `302`
151
-
152
-
153
-## Proposed Solution
154
-
155
-1. Keep Thunderbolt enlist units ordered before `network.target` so storage traffic over `thunderbridge` remains alive until remote filesystems are unmounted.
156
-2. Keep `pgs` cleanup path limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
157
-3. Do not mount a node's own AutoNAS export back onto the same node as a Proxmox NFS storage; on `ebony`, exclude `AutoNAS-1` from local use or replace that local dependency with a direct/local storage path.
158
-4. Review colocated service dependencies before planned reboot, especially when the node provides the storage it also consumes (for example `autonas` and PBS on `ebony`).
159
-5. Keep the generated `nfs-server.service` self-hosted ordering drop-in as the cluster fix for nodes that export AutoNAS locally and also consume the same export back through Proxmox NFS.
160
-6. Validate the same shutdown path on the remaining nodes after storage-role cleanup and after the `nfs-server.service` ordering fix is deployed.
161
-
162
-
163
-## Related Issues
164
-
165
-- ISSUE-2026-001
166
-
167
-
168
-## Changelog References
169
-
170
-List CHANGELOG.md entries that reference this issue:
171
-- `projects/thunderbolts/CHANGELOG.md`: [Unreleased] - `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002]
172
-- `projects/pve-guests-state/CHANGELOG.md`: [1.5] - Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002]