@@ -1 +1 @@ |
||
| 1 |
-Subproject commit d426b0effcb2e2195b7c6742718037862bd15767 |
|
| 1 |
+Subproject commit 9443424f399d74cacc9a7f8888c038aa23bd9713 |
|
@@ -12,6 +12,7 @@ |
||
| 12 | 12 |
- Cleanup explicitly ignores `vm-*-state-cp*.raw` checkpoint files and only targets `vm-*-state-suspend-YYYY-MM-DD.raw` |
| 13 | 13 |
- Repeated `pgs suspend` runs now merge with the existing state file instead of discarding prior `to_resume` intent |
| 14 | 14 |
- State now records `vm_details.suspend_volume` and `vm_details.suspend_file_date`, and `resume` skips auto-restore when a VM's suspend artifact changed after the state was saved |
| 15 |
+- Suspend-artifact cleanup now scans only local `dir` storages; remote storages such as NFS are skipped so planned maintenance cannot block in kernel I/O wait on a stale mount [ISSUE-2026-002] |
|
| 15 | 16 |
|
| 16 | 17 |
## [1.4] - 2026-03-06 |
| 17 | 18 |
|
@@ -92,6 +93,7 @@ Tested on: |
||
| 92 | 93 |
- Mixed VM configurations (4GB-16GB RAM) |
| 93 | 94 |
- LXC containers with running services |
| 94 | 95 |
- Storage: local-dir, NFS mount points |
| 96 |
+- Planned reboot validation on `baobab`: shutdown-to-ICMP-loss improved from ~106s to ~15s once NFS over Thunderbolt stopped losing transport before unmount |
|
| 95 | 97 |
|
| 96 | 98 |
## Future Enhancements |
| 97 | 99 |
|
@@ -19,6 +19,7 @@ Automatizarea prin systemd pentru shutdown si boot a fost abandonata intentionat |
||
| 19 | 19 |
- cleanup pentru volume orphan `vm-*-state-suspend-YYYY-MM-DD.raw` |
| 20 | 20 |
- retry pentru anumite erori legate de quorum |
| 21 | 21 |
- dry-run pentru verificare fara efecte |
| 22 |
+- preflight cleanup limitat la storages locale `dir`, astfel incat un NFS remote stale sa nu blocheze `pgs suspend` |
|
| 22 | 23 |
|
| 23 | 24 |
## Layout proiect |
| 24 | 25 |
|
@@ -90,6 +91,7 @@ sudo /usr/local/lib/xdev/pve-guests-state/uninstall.sh |
||
| 90 | 91 |
- dupa un `resume` complet reusit, state file-ul este sters |
| 91 | 92 |
- daca `resume` are erori, state file-ul este pastrat pentru retry |
| 92 | 93 |
- `cleanup` si preflight-ul din `suspend` ating doar fisiere `vm-*-state-suspend-YYYY-MM-DD.raw`; fisiere `vm-*-state-cp*.raw` sau alte variante raman neatinse |
| 94 |
+- `cleanup` si preflight-ul din `suspend` scaneaza doar storages locale de tip `dir`; storages remote (de exemplu NFS) sunt sarite intentionat pentru a evita blocarea mentenantei cand un mount remote este stale |
|
| 93 | 95 |
- un nou `suspend` peste un state file existent face merge, nu reseteaza lista de guest-uri de restaurat |
| 94 | 96 |
- state file-ul retine si `suspend_volume`/`suspend_file_date` per VM pentru a detecta guest-uri alterate dupa salvarea state-ului |
| 95 | 97 |
|
@@ -52,9 +52,10 @@ State file-ul contine: |
||
| 52 | 52 |
|
| 53 | 53 |
### Cleanup |
| 54 | 54 |
|
| 55 |
-- scaneaza storage-urile cu `content images` definite in `/etc/pve/storage.cfg` |
|
| 55 |
+- scaneaza doar storage-urile locale de tip `dir` cu `content images` definite in `/etc/pve/storage.cfg` |
|
| 56 | 56 |
- cauta exclusiv fisiere `vm-*-state-suspend-YYYY-MM-DD.raw` |
| 57 | 57 |
- ignora fisiere de forma `vm-*-state-cp*.raw` |
| 58 |
+- storages remote precum NFS sunt sarite intentionat, pentru ca un mount stale poate bloca procesul in kernel I/O wait chiar inainte de mentenanta |
|
| 58 | 59 |
- daca un volum `state-suspend` este referit de un VM valid suspendat, il pastreaza |
| 59 | 60 |
- daca un volum `state-suspend` este referit, dar VM-ul nu mai are stare valida de suspend, curata `lock`, `vmstate` si volumul |
| 60 | 61 |
- daca un volum `state-suspend` nu mai este referit de niciun VM, il trateaza ca orphan si il sterge |
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 |
||
| 10 | 10 |
### Fixed |
| 11 | 11 |
- Invalid `ExecStop` syntax in `tb-enlist@.service` caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001] |
| 12 | 12 |
- Tapia-Baobab Thunderbolt recovery path hardened after reboot-time disconnect/reconnect events [ISSUE-2026-001] |
| 13 |
+- `tb-enlist@.service` now stays active until `network.target` stops, so NFS storages routed over `thunderbridge` can unmount cleanly before Thunderbolt ports are detached [ISSUE-2026-002] |
|
| 13 | 14 |
|
| 14 | 15 |
### Added |
| 15 | 16 |
- Automatic Thunderbolt recovery worker (`tb-recover.service`) and periodic timer (`tb-recover.timer`) for flap resilience [ISSUE-2026-001] |
@@ -101,7 +101,8 @@ interface names, then extend both helper functions so the script can locate it. |
||
| 101 | 101 |
Linux bridge, sets MTU 65520, and brings it up during early boot. |
| 102 | 102 |
- `tb-enlist@.service` attaches Thunderbolt NIC instances to the bridge, aligning |
| 103 | 103 |
their MTU and keeping them hotplug friendly; systemd stops the unit cleanly on |
| 104 |
- device removal. |
|
| 104 |
+ device removal and keeps it ordered before `network.target` during shutdown so |
|
| 105 |
+ remote filesystems over `thunderbridge` can unmount before the ports detach. |
|
| 105 | 106 |
- `90-thunderbolt-net-systemd.rules` tags `thunderbolt*` NICs so udev starts the |
| 106 | 107 |
enlist service automatically. |
| 107 | 108 |
|
@@ -131,6 +132,10 @@ refreshes the interfaces. |
||
| 131 | 132 |
`systemctl enable --now tb-bridge.service` on the host. |
| 132 | 133 |
- *NICs not joining*: Check `journalctl -u tb-enlist@thunderbolt0` for logs and make |
| 133 | 134 |
sure the udev rule is present under `/etc/udev/rules.d`. |
| 135 |
+- *Slow shutdown with NFS on thunderbridge*: Verify the host has the updated |
|
| 136 |
+ `tb-enlist@.service` with `Before=network.target`; otherwise `thunderbridge` |
|
| 137 |
+ can disappear before Proxmox unmounts NFS storages and shutdown waits on NFS |
|
| 138 |
+ timeouts. |
|
| 134 | 139 |
- *MTU mismatch complaints*: The service forces MTU 65520 on both sides; verify the |
| 135 | 140 |
connected devices also support it. |
| 136 | 141 |
|
@@ -0,0 +1,170 @@ |
||
| 1 |
+# Issue ISSUE-2026-002: Planned reboot stalls on NFS storages over thunderbridge before network shutdown |
|
| 2 |
+ |
|
| 3 |
+## Issue ID: ISSUE-2026-002 |
|
| 4 |
+ |
|
| 5 |
+**Status:** investigating |
|
| 6 |
+**Priority:** high |
|
| 7 |
+**Created:** 2026-03-07 |
|
| 8 |
+**Updated:** 2026-03-07 |
|
| 9 |
+**Assigned to:** unassigned |
|
| 10 |
+ |
|
| 11 |
+--- |
|
| 12 |
+ |
|
| 13 |
+## Summary |
|
| 14 |
+ |
|
| 15 |
+Planned node reboot on `baobab` spent ~106 seconds in shutdown because Proxmox NFS storages were still mounted after Thunderbolt transport had already been detached from `thunderbridge`. |
|
| 16 |
+ |
|
| 17 |
+--- |
|
| 18 |
+ |
|
| 19 |
+## Description |
|
| 20 |
+ |
|
| 21 |
+During a controlled reboot validation on `baobab`, guest suspend worked correctly, but the host remained reachable over ICMP for almost two minutes after `systemctl reboot`. Journal analysis showed that the Thunderbolt bridge ports were detached early in shutdown, while Proxmox only attempted to unmount NFS storages later. Because `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, the NFS unmount path lost transport and waited for timeout. |
|
| 22 |
+ |
|
| 23 |
+The same investigation exposed a second maintenance risk in `pgs`: preflight cleanup could block in kernel I/O wait when it touched remote NFS-backed storages that were stale or temporarily unavailable. That does not create the slow reboot itself, but it can block the maintenance preparation step. |
|
| 24 |
+ |
|
| 25 |
+Follow-up validation on `ebony` showed a different but related cluster behavior: `AutoNAS-1` is currently exported by `ebony` itself. During reboot, `autonas.service` stops early, which makes the node's own Proxmox NFS client mount for `AutoNAS-1` stale and it then waits for timeout during unmount. In the same window, VM `301 is-anjohibe` (PBS `anjothibe`) is intentionally suspended by `pgs`, so PBS availability loss is expected during the maintenance window. |
|
| 26 |
+ |
|
| 27 |
+Validation on `tapia` showed the same class of topology problem for `AutoNAS-2`, which is locally exported there and mounted back as a Proxmox NFS storage. The AutoNAS shutdown-ordering patch remained active, but reboot timing still stayed near the pre-fix range because `mnt-pve-AutoNAS-2.mount` waited for timeout during shutdown while PBS `andrafiabe-AutoNAS` had already become unreachable. |
|
| 28 |
+ |
|
| 29 |
+--- |
|
| 30 |
+ |
|
| 31 |
+## Environment |
|
| 32 |
+ |
|
| 33 |
+- **Affected nodes:** `baobab` confirmed, likely all nodes using Proxmox NFS storages over `thunderbridge` |
|
| 34 |
+- **Component:** network + storage + maintenance workflow |
|
| 35 |
+- **Version/software:** Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `pgs` |
|
| 36 |
+ |
|
| 37 |
+--- |
|
| 38 |
+ |
|
| 39 |
+## Steps to Reproduce |
|
| 40 |
+ |
|
| 41 |
+1. On a node with Proxmox NFS storages routed over `thunderbridge`, run `/usr/local/sbin/pgs suspend -v`. |
|
| 42 |
+2. Trigger `systemctl reboot`. |
|
| 43 |
+3. Measure ICMP availability during shutdown and boot. |
|
| 44 |
+4. Inspect `journalctl -b -1` around the reboot window. |
|
| 45 |
+ |
|
| 46 |
+--- |
|
| 47 |
+ |
|
| 48 |
+## Expected Behavior |
|
| 49 |
+ |
|
| 50 |
+- NFS storages should unmount while Thunderbolt transport is still available. |
|
| 51 |
+- Host should stop replying to ICMP shortly after reboot is requested. |
|
| 52 |
+- `pgs suspend` should not hang because a remote NFS mount is stale. |
|
| 53 |
+ |
|
| 54 |
+--- |
|
| 55 |
+ |
|
| 56 |
+## Actual Behavior |
|
| 57 |
+ |
|
| 58 |
+- First validation on `baobab`: |
|
| 59 |
+ - `TIME_TO_STOP_SECONDS 105.852` |
|
| 60 |
+ - `TIME_TO_FIRST_REPLY_SECONDS 130.230` |
|
| 61 |
+ - `DOWNTIME_SECONDS 24.377` |
|
| 62 |
+- Follow-up validation on `ebony`: |
|
| 63 |
+ - `TIME_TO_STOP_SECONDS 120.275` |
|
| 64 |
+ - `TIME_TO_FIRST_REPLY_SECONDS 145.840` |
|
| 65 |
+ - `DOWNTIME_SECONDS 25.565` |
|
| 66 |
+- Follow-up validation on `tapia` after cluster-wide AutoNAS rollout: |
|
| 67 |
+ - `TIME_TO_STOP_SECONDS 123.285` |
|
| 68 |
+ - `TIME_TO_FIRST_REPLY_SECONDS 149.420` |
|
| 69 |
+ - `DOWNTIME_SECONDS 26.135` |
|
| 70 |
+- `journalctl -b -1` showed: |
|
| 71 |
+ - Thunderbolt bridge ports detached at `08:48:17.989` |
|
| 72 |
+ - NFS unmount only started at `08:48:30.540` |
|
| 73 |
+ - `mnt-pve-AutoNAS-1.mount` and `mnt-pve-AutoNAS-2.mount` timed out at `08:50:00.604/0.605` |
|
| 74 |
+- `journalctl -b -1` on `ebony` showed: |
|
| 75 |
+ - `autonas.service` stopped at `11:04:22.326` |
|
| 76 |
+ - `mnt-pve-AutoNAS-2.mount` unmounted successfully by `11:04:38.693` |
|
| 77 |
+ - `mnt-pve-AutoNAS-1.mount` timed out at `11:06:08.679` |
|
| 78 |
+ - only after that did `network.target` stop and `tb-enlist@thunderbolt0.service` detach from `thunderbridge` |
|
| 79 |
+- A later maintenance attempt also showed `pgs suspend` blocked in `nfs4_proc_getattr` while scanning storage paths. |
|
| 80 |
+ |
|
| 81 |
+--- |
|
| 82 |
+ |
|
| 83 |
+## Logs/Evidence |
|
| 84 |
+ |
|
| 85 |
+```text |
|
| 86 |
+Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached |
|
| 87 |
+Mar 07 08:48:17.993120 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt1 was detached |
|
| 88 |
+Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1... |
|
| 89 |
+Mar 07 08:48:30.541335 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-2.mount - /mnt/pve/AutoNAS-2... |
|
| 90 |
+Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating. |
|
| 91 |
+Mar 07 08:50:00.605215 baobab systemd[1]: mnt-pve-AutoNAS-1.mount: Unmounting timed out. Terminating. |
|
| 92 |
+``` |
|
| 93 |
+ |
|
| 94 |
+Blocked `pgs` stack during stale-NFS preflight: |
|
| 95 |
+ |
|
| 96 |
+```text |
|
| 97 |
+[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc] |
|
| 98 |
+[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4] |
|
| 99 |
+[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs] |
|
| 100 |
+[<0>] __do_sys_newfstatat+0x43/0x90 |
|
| 101 |
+``` |
|
| 102 |
+ |
|
| 103 |
+Validated timing after fixes on `baobab`: |
|
| 104 |
+ |
|
| 105 |
+```text |
|
| 106 |
+TIME_TO_STOP_SECONDS 14.599 |
|
| 107 |
+TIME_TO_FIRST_REPLY_SECONDS 35.651 |
|
| 108 |
+DOWNTIME_SECONDS 21.053 |
|
| 109 |
+``` |
|
| 110 |
+ |
|
| 111 |
+--- |
|
| 112 |
+ |
|
| 113 |
+## Investigation Notes |
|
| 114 |
+ |
|
| 115 |
+- 2026-03-07: Confirmed `AutoNAS-1` and `AutoNAS-2` on `baobab` are Proxmox NFS storages mounted from `192.168.10.21` and `192.168.10.22` over `thunderbridge`. |
|
| 116 |
+- 2026-03-07: First reboot validation on `baobab` showed shutdown delay dominated by NFS unmount timeout, not by boot. |
|
| 117 |
+- 2026-03-07: `tb-enlist@.service` had no ordering against `network.target`; systemd stopped Thunderbolt bridge membership before Proxmox unmounted remote storages. |
|
| 118 |
+- 2026-03-07: Patched shared `tb-enlist@.service` with `Before=network.target` and deployed to `baobab`, then cluster-wide. |
|
| 119 |
+- 2026-03-07: Separate maintenance attempt showed `pgs suspend` can block in `nfs4_proc_getattr` while scanning storage paths on stale remote NFS mounts. |
|
| 120 |
+- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote storages such as NFS are skipped intentionally. |
|
| 121 |
+- 2026-03-07: Revalidated on `baobab` after both fixes: |
|
| 122 |
+ - NFS unmount started at `10:48:12.354/10:48:12.356` |
|
| 123 |
+ - both NFS mounts unmounted successfully by `10:48:12.460` |
|
| 124 |
+ - `network.target` stopped later at `10:48:16.152` |
|
| 125 |
+ - ICMP loss dropped from ~106s to ~15s after reboot command |
|
| 126 |
+- 2026-03-07: `pgs resume` completed successfully after reboot on `baobab`; state file survived boot and all 4 VMs + 1 CT were restored. |
|
| 127 |
+- 2026-03-07: Validated `ebony` with current `pgs` and cluster-wide `thunderbolts` rollout. `pgs suspend` / `resume` succeeded for VMs `101`, `102`, `301`; state file survived reboot and restore completed. |
|
| 128 |
+- 2026-03-07: `ebony` still showed long shutdown because `AutoNAS-1` is currently provided by `ebony` itself through `autonas`. Stopping `autonas.service` made the node's own NFS client mount stale and `mnt-pve-AutoNAS-1.mount` waited for timeout. |
|
| 129 |
+- 2026-03-07: On `ebony`, PBS `anjothibe` availability loss during maintenance is expected because VM `301 is-anjohibe` is intentionally suspended by `pgs`, and its datastore dependency is also on `AutoNAS-1`. |
|
| 130 |
+- 2026-03-07: Implemented AutoNAS shutdown-ordering experiment on `ebony`: `autonas.service` and `autonas-boot-scan.service` now declare `Before=remote-fs.target` and `Before=umount.target`. |
|
| 131 |
+- 2026-03-07: Revalidated `ebony` after AutoNAS patch: |
|
| 132 |
+ - previous timing: `TIME_TO_STOP_SECONDS 120.275`, `TIME_TO_FIRST_REPLY_SECONDS 145.840` |
|
| 133 |
+ - new timing: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288` |
|
| 134 |
+ - `mnt-pve-AutoNAS-2.mount` still unmounted cleanly |
|
| 135 |
+ - `AutoNAS-1` no longer waited for the old 90s timeout, though a brief `Stale file handle` was still observed before the provider side stopped |
|
| 136 |
+- 2026-03-07: Residual issue on `ebony`: even with later provider shutdown, `pvestatd` briefly logged `storage 'AutoNAS-1' is not online` / `Stale file handle` during the maintenance window, so the self-hosted NFS topology remains fragile but no longer dominates shutdown time. |
|
| 137 |
+- 2026-03-07: Deployed the same AutoNAS ordering patch cluster-wide and revalidated `tapia`. |
|
| 138 |
+- 2026-03-07: `pgs suspend` / reboot / `pgs resume` succeeded on `tapia` for VMs `104`, `107`, `113`, `302`; state file survived reboot and all four guests were restored. |
|
| 139 |
+- 2026-03-07: `tapia` still showed slow shutdown after the AutoNAS patch: |
|
| 140 |
+ - `TIME_TO_STOP_SECONDS 123.285`, `TIME_TO_FIRST_REPLY_SECONDS 149.420` |
|
| 141 |
+ - `mnt-pve-AutoNAS-1.mount` unmounted immediately at `11:45:01.827` |
|
| 142 |
+ - `autonas.service` and `nfs-server.service` stopped around `11:45:01.689/11:45:01.900` |
|
| 143 |
+ - `mnt-pve-AutoNAS-2.mount` then waited until timeout at `11:46:31.778` |
|
| 144 |
+ - `network.target` stopped only after that, at `11:46:31.781` |
|
| 145 |
+- 2026-03-07: On `tapia`, the remaining delay is concentrated on self-hosted `AutoNAS-2` (`server 192.168.10.22`) plus expected maintenance-window loss of PBS `andrafiabe-AutoNAS` (`192.168.10.96`). |
|
| 146 |
+ |
|
| 147 |
+--- |
|
| 148 |
+ |
|
| 149 |
+## Proposed Solution |
|
| 150 |
+ |
|
| 151 |
+1. Keep Thunderbolt enlist units ordered before `network.target` so storage traffic over `thunderbridge` remains alive until remote filesystems are unmounted. |
|
| 152 |
+2. Keep `pgs` cleanup path limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance. |
|
| 153 |
+3. Do not mount a node's own AutoNAS export back onto the same node as a Proxmox NFS storage; on `ebony`, exclude `AutoNAS-1` from local use or replace that local dependency with a direct/local storage path. |
|
| 154 |
+4. Review colocated service dependencies before planned reboot, especially when the node provides the storage it also consumes (for example `autonas` and PBS on `ebony`). |
|
| 155 |
+5. Apply the same self-hosted-storage review on `tapia`, where `AutoNAS-2` remains the dominant shutdown delay even after the AutoNAS ordering patch. |
|
| 156 |
+6. Validate the same shutdown path on the remaining nodes after storage-role cleanup. |
|
| 157 |
+ |
|
| 158 |
+--- |
|
| 159 |
+ |
|
| 160 |
+## Related Issues |
|
| 161 |
+ |
|
| 162 |
+- ISSUE-2026-001 |
|
| 163 |
+ |
|
| 164 |
+--- |
|
| 165 |
+ |
|
| 166 |
+## Changelog References |
|
| 167 |
+ |
|
| 168 |
+List CHANGELOG.md entries that reference this issue: |
|
| 169 |
+- `projects/thunderbolts/CHANGELOG.md`: [Unreleased] - `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002] |
|
| 170 |
+- `projects/pve-guests-state/CHANGELOG.md`: [1.5] - Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002] |
|