Blaming Madagascar/projects/thunderbolts/issues/ISSUE-2026-002.md at c29921356d23c85a156b48d6a66b271394966058 · bogdan/Madagascar

Madagascar / projects / thunderbolts / issues / ISSUE-2026-002.md

Newer ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ Older

bogdan Document cluster reboot fixes for AutoNAS and NFS maintenance

170 lines | 10.242kb

Document cluster reboot fixe... Bogdan Timofte authored 3 months ago	1	# Issue ISSUE-2026-002: Planned reboot stalls on NFS storages over thunderbridge before network shutdown
	2
	3	## Issue ID: ISSUE-2026-002
	4
	5	Status: investigating
	6	Priority: high
	7	Created: 2026-03-07
	8	Updated: 2026-03-07
	9	Assigned to: unassigned
	10
	11	---
	12
	13	## Summary
	14
	15	Planned node reboot on `baobab` spent ~106 seconds in shutdown because Proxmox NFS storages were still mounted after Thunderbolt transport had already been detached from `thunderbridge`.
	16
	17	---
	18
	19	## Description
	20
	21	During a controlled reboot validation on `baobab`, guest suspend worked correctly, but the host remained reachable over ICMP for almost two minutes after `systemctl reboot`. Journal analysis showed that the Thunderbolt bridge ports were detached early in shutdown, while Proxmox only attempted to unmount NFS storages later. Because `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, the NFS unmount path lost transport and waited for timeout.
	22
	23	The same investigation exposed a second maintenance risk in `pgs`: preflight cleanup could block in kernel I/O wait when it touched remote NFS-backed storages that were stale or temporarily unavailable. That does not create the slow reboot itself, but it can block the maintenance preparation step.
	24
	25	Follow-up validation on `ebony` showed a different but related cluster behavior: `AutoNAS-1` is currently exported by `ebony` itself. During reboot, `autonas.service` stops early, which makes the node's own Proxmox NFS client mount for `AutoNAS-1` stale and it then waits for timeout during unmount. In the same window, VM `301 is-anjohibe` (PBS `anjothibe`) is intentionally suspended by `pgs`, so PBS availability loss is expected during the maintenance window.
	26
	27	Validation on `tapia` showed the same class of topology problem for `AutoNAS-2`, which is locally exported there and mounted back as a Proxmox NFS storage. The AutoNAS shutdown-ordering patch remained active, but reboot timing still stayed near the pre-fix range because `mnt-pve-AutoNAS-2.mount` waited for timeout during shutdown while PBS `andrafiabe-AutoNAS` had already become unreachable.
	28
	29	---
	30
	31	## Environment
	32
	33	- Affected nodes: `baobab` confirmed, likely all nodes using Proxmox NFS storages over `thunderbridge`
	34	- Component: network + storage + maintenance workflow
	35	- Version/software: Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `pgs`
	36
	37	---
	38
	39	## Steps to Reproduce
	40
	41	1. On a node with Proxmox NFS storages routed over `thunderbridge`, run `/usr/local/sbin/pgs suspend -v`.
	42	2. Trigger `systemctl reboot`.
	43	3. Measure ICMP availability during shutdown and boot.
	44	4. Inspect `journalctl -b -1` around the reboot window.
	45
	46	---
	47
	48	## Expected Behavior
	49
	50	- NFS storages should unmount while Thunderbolt transport is still available.
	51	- Host should stop replying to ICMP shortly after reboot is requested.
	52	- `pgs suspend` should not hang because a remote NFS mount is stale.
	53
	54	---
	55
	56	## Actual Behavior
	57
	58	- First validation on `baobab`:
	59	- `TIME_TO_STOP_SECONDS 105.852`
	60	- `TIME_TO_FIRST_REPLY_SECONDS 130.230`
	61	- `DOWNTIME_SECONDS 24.377`
	62	- Follow-up validation on `ebony`:
	63	- `TIME_TO_STOP_SECONDS 120.275`
	64	- `TIME_TO_FIRST_REPLY_SECONDS 145.840`
	65	- `DOWNTIME_SECONDS 25.565`
	66	- Follow-up validation on `tapia` after cluster-wide AutoNAS rollout:
	67	- `TIME_TO_STOP_SECONDS 123.285`
	68	- `TIME_TO_FIRST_REPLY_SECONDS 149.420`
	69	- `DOWNTIME_SECONDS 26.135`
	70	- `journalctl -b -1` showed:
	71	- Thunderbolt bridge ports detached at `08:48:17.989`
	72	- NFS unmount only started at `08:48:30.540`
	73	- `mnt-pve-AutoNAS-1.mount` and `mnt-pve-AutoNAS-2.mount` timed out at `08:50:00.604/0.605`
	74	- `journalctl -b -1` on `ebony` showed:
	75	- `autonas.service` stopped at `11:04:22.326`
	76	- `mnt-pve-AutoNAS-2.mount` unmounted successfully by `11:04:38.693`
	77	- `mnt-pve-AutoNAS-1.mount` timed out at `11:06:08.679`
	78	- only after that did `network.target` stop and `tb-enlist@thunderbolt0.service` detach from `thunderbridge`
	79	- A later maintenance attempt also showed `pgs suspend` blocked in `nfs4_proc_getattr` while scanning storage paths.
	80
	81	---
	82
	83	## Logs/Evidence
	84
	85	```text
	86	Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
	87	Mar 07 08:48:17.993120 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt1 was detached
	88	Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
	89	Mar 07 08:48:30.541335 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-2.mount - /mnt/pve/AutoNAS-2...
	90	Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
	91	Mar 07 08:50:00.605215 baobab systemd[1]: mnt-pve-AutoNAS-1.mount: Unmounting timed out. Terminating.
	92	```
	93
	94	Blocked `pgs` stack during stale-NFS preflight:
	95
	96	```text
	97	[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
	98	[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
	99	[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
	100	[<0>] __do_sys_newfstatat+0x43/0x90
	101	```
	102
	103	Validated timing after fixes on `baobab`:
	104
	105	```text
	106	TIME_TO_STOP_SECONDS 14.599
	107	TIME_TO_FIRST_REPLY_SECONDS 35.651
	108	DOWNTIME_SECONDS 21.053
	109	```
	110
	111	---
	112
	113	## Investigation Notes
	114
	115	- 2026-03-07: Confirmed `AutoNAS-1` and `AutoNAS-2` on `baobab` are Proxmox NFS storages mounted from `192.168.10.21` and `192.168.10.22` over `thunderbridge`.
	116	- 2026-03-07: First reboot validation on `baobab` showed shutdown delay dominated by NFS unmount timeout, not by boot.
	117	- 2026-03-07: `tb-enlist@.service` had no ordering against `network.target`; systemd stopped Thunderbolt bridge membership before Proxmox unmounted remote storages.
	118	- 2026-03-07: Patched shared `tb-enlist@.service` with `Before=network.target` and deployed to `baobab`, then cluster-wide.
	119	- 2026-03-07: Separate maintenance attempt showed `pgs suspend` can block in `nfs4_proc_getattr` while scanning storage paths on stale remote NFS mounts.
	120	- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote storages such as NFS are skipped intentionally.
	121	- 2026-03-07: Revalidated on `baobab` after both fixes:
	122	- NFS unmount started at `10:48:12.354/10:48:12.356`
	123	- both NFS mounts unmounted successfully by `10:48:12.460`
	124	- `network.target` stopped later at `10:48:16.152`
	125	- ICMP loss dropped from ~106s to ~15s after reboot command
	126	- 2026-03-07: `pgs resume` completed successfully after reboot on `baobab`; state file survived boot and all 4 VMs + 1 CT were restored.
	127	- 2026-03-07: Validated `ebony` with current `pgs` and cluster-wide `thunderbolts` rollout. `pgs suspend` / `resume` succeeded for VMs `101`, `102`, `301`; state file survived reboot and restore completed.
	128	- 2026-03-07: `ebony` still showed long shutdown because `AutoNAS-1` is currently provided by `ebony` itself through `autonas`. Stopping `autonas.service` made the node's own NFS client mount stale and `mnt-pve-AutoNAS-1.mount` waited for timeout.
	129	- 2026-03-07: On `ebony`, PBS `anjothibe` availability loss during maintenance is expected because VM `301 is-anjohibe` is intentionally suspended by `pgs`, and its datastore dependency is also on `AutoNAS-1`.
	130	- 2026-03-07: Implemented AutoNAS shutdown-ordering experiment on `ebony`: `autonas.service` and `autonas-boot-scan.service` now declare `Before=remote-fs.target` and `Before=umount.target`.
	131	- 2026-03-07: Revalidated `ebony` after AutoNAS patch:
	132	- previous timing: `TIME_TO_STOP_SECONDS 120.275`, `TIME_TO_FIRST_REPLY_SECONDS 145.840`
	133	- new timing: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288`
	134	- `mnt-pve-AutoNAS-2.mount` still unmounted cleanly
	135	- `AutoNAS-1` no longer waited for the old 90s timeout, though a brief `Stale file handle` was still observed before the provider side stopped
	136	- 2026-03-07: Residual issue on `ebony`: even with later provider shutdown, `pvestatd` briefly logged `storage 'AutoNAS-1' is not online` / `Stale file handle` during the maintenance window, so the self-hosted NFS topology remains fragile but no longer dominates shutdown time.
	137	- 2026-03-07: Deployed the same AutoNAS ordering patch cluster-wide and revalidated `tapia`.
	138	- 2026-03-07: `pgs suspend` / reboot / `pgs resume` succeeded on `tapia` for VMs `104`, `107`, `113`, `302`; state file survived reboot and all four guests were restored.
	139	- 2026-03-07: `tapia` still showed slow shutdown after the AutoNAS patch:
	140	- `TIME_TO_STOP_SECONDS 123.285`, `TIME_TO_FIRST_REPLY_SECONDS 149.420`
	141	- `mnt-pve-AutoNAS-1.mount` unmounted immediately at `11:45:01.827`
	142	- `autonas.service` and `nfs-server.service` stopped around `11:45:01.689/11:45:01.900`
	143	- `mnt-pve-AutoNAS-2.mount` then waited until timeout at `11:46:31.778`
	144	- `network.target` stopped only after that, at `11:46:31.781`
	145	- 2026-03-07: On `tapia`, the remaining delay is concentrated on self-hosted `AutoNAS-2` (`server 192.168.10.22`) plus expected maintenance-window loss of PBS `andrafiabe-AutoNAS` (`192.168.10.96`).
	146
	147	---
	148
	149	## Proposed Solution
	150
	151	1. Keep Thunderbolt enlist units ordered before `network.target` so storage traffic over `thunderbridge` remains alive until remote filesystems are unmounted.
	152	2. Keep `pgs` cleanup path limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
	153	3. Do not mount a node's own AutoNAS export back onto the same node as a Proxmox NFS storage; on `ebony`, exclude `AutoNAS-1` from local use or replace that local dependency with a direct/local storage path.
	154	4. Review colocated service dependencies before planned reboot, especially when the node provides the storage it also consumes (for example `autonas` and PBS on `ebony`).
	155	5. Apply the same self-hosted-storage review on `tapia`, where `AutoNAS-2` remains the dominant shutdown delay even after the AutoNAS ordering patch.
	156	6. Validate the same shutdown path on the remaining nodes after storage-role cleanup.
	157
	158	---
	159
	160	## Related Issues
	161
	162	- ISSUE-2026-001
	163
	164	---
	165
	166	## Changelog References
	167
	168	List CHANGELOG.md entries that reference this issue:
	169	- `projects/thunderbolts/CHANGELOG.md`: [Unreleased] - `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002]
	170	- `projects/pve-guests-state/CHANGELOG.md`: [1.5] - Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002]