Blaming Madagascar/issues/ISSUE-2026-002.md at 5ce722df6d7dfefc2b15db4b33f0f47eab56f663 · bogdan/Madagascar

Madagascar / issues / ISSUE-2026-002.md

Newer ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ Older

bogdan Move ISSUE-2026-002 to cluster issue tracking

141 lines | 6.182kb

Move ISSUE-2026-002 to clust... Bogdan Timofte authored 3 months ago	1	# Issue ISSUE-2026-002: Planned reboot stalls on shared NFS storages during maintenance shutdown
	2
	3	## Issue ID: ISSUE-2026-002
	4
	5	Status: resolved
	6	Priority: high
	7	Created: 2026-03-07
	8	Updated: 2026-03-07
	9	Assigned to: unassigned
	10
	11	---
	12
	13	## Summary
	14
	15	Planned node reboot could spend 90 to 120 seconds in shutdown because shared Proxmox NFS storages were not consistently ordered ahead of transport or provider teardown.
	16
	17	---
	18
	19	## Description
	20
	21	This incident had two independent cluster-level contributors that happened to surface in the same maintenance workflow.
	22
	23	The first was transport-related on `baobab`: `AutoNAS-1` and `AutoNAS-2` are mounted over `192.168.10.x` through `thunderbridge`, but Thunderbolt bridge membership was being torn down before Proxmox attempted to unmount those remote NFS storages.
	24
	25	The second was provider-related on `ebony` and `tapia`: local AutoNAS exports were mounted back on the same node as Proxmox NFS storages. In that self-hosted topology, shutdown became sensitive to the ordering between `umount.nfs4` and `nfs-server.service`.
	26
	27	The same investigation also exposed a separate maintenance-preflight issue in `pgs`: cleanup could block in kernel I/O wait when it touched stale remote NFS-backed storages.
	28
	29	The final fix therefore spans cluster maintenance, `thunderbolts`, `autoNAS`, and `pve-guests-state`, and should be tracked as a cluster issue rather than a project-local one.
	30
	31	---
	32
	33	## Environment
	34
	35	- Affected nodes: `baobab`, `ebony`, `tapia`
	36	- Component: cluster storage + maintenance workflow
	37	- Version/software: Proxmox VE 9.1 / kernel `6.17.13-1-pve`, `tb-enlist@.service`, `autoNAS`, `pgs`
	38
	39	---
	40
	41	## Steps to Reproduce
	42
	43	1. On a node with shared Proxmox NFS storages, run `/usr/local/sbin/pgs suspend -v`.
	44	2. Trigger `systemctl reboot`.
	45	3. Measure ICMP availability during shutdown and boot.
	46	4. Inspect `journalctl -b -1` around the reboot window.
	47
	48	---
	49
	50	## Expected Behavior
	51
	52	- NFS storages should unmount before either their transport or provider disappears.
	53	- Host should stop replying to ICMP shortly after reboot is requested.
	54	- `pgs suspend` should not block because a remote NFS mount is stale.
	55
	56	---
	57
	58	## Actual Behavior
	59
	60	- First validation on `baobab`:
	61	- `TIME_TO_STOP_SECONDS 105.852`
	62	- `TIME_TO_FIRST_REPLY_SECONDS 130.230`
	63	- `DOWNTIME_SECONDS 24.377`
	64	- Follow-up validation on `ebony` before self-hosted fix:
	65	- `TIME_TO_STOP_SECONDS 120.275`
	66	- `TIME_TO_FIRST_REPLY_SECONDS 145.840`
	67	- `DOWNTIME_SECONDS 25.565`
	68	- Follow-up validation on `tapia` before provider-ordering fix:
	69	- `TIME_TO_STOP_SECONDS 123.285`
	70	- `TIME_TO_FIRST_REPLY_SECONDS 149.420`
	71	- `DOWNTIME_SECONDS 26.135`
	72	- Revalidation after fixes:
	73	- `baobab`: `TIME_TO_STOP_SECONDS 14.599`, `TIME_TO_FIRST_REPLY_SECONDS 35.651`
	74	- `ebony`: `TIME_TO_STOP_SECONDS 27.573`, `TIME_TO_FIRST_REPLY_SECONDS 53.288`
	75	- `tapia`: `TIME_TO_STOP_SECONDS 28.305`, `TIME_TO_FIRST_REPLY_SECONDS 53.588`
	76	- repeated `tapia` validation: `TIME_TO_STOP_SECONDS 28.990`, `TIME_TO_FIRST_REPLY_SECONDS 53.384`
	77
	78	---
	79
	80	## Logs/Evidence
	81
	82	Transport ordering failure on `baobab`:
	83
	84	```text
	85	Mar 07 08:48:17.989246 baobab NetworkManager[1096]: device (thunderbridge): bridge port thunderbolt0 was detached
	86	Mar 07 08:48:30.540186 baobab systemd[1]: Unmounting mnt-pve-AutoNAS-1.mount - /mnt/pve/AutoNAS-1...
	87	Mar 07 08:50:00.604036 baobab systemd[1]: mnt-pve-AutoNAS-2.mount: Unmounting timed out. Terminating.
	88	```
	89
	90	Preflight stale-NFS block in `pgs`:
	91
	92	```text
	93	[<0>] rpc_wait_bit_killable+0x11/0x80 [sunrpc]
	94	[<0>] nfs4_do_call_sync+0x6a/0xc0 [nfsv4]
	95	[<0>] __nfs_revalidate_inode+0xd4/0x320 [nfs]
	96	```
	97
	98	Provider-ordering fix validated on `tapia`:
	99
	100	```text
	101	TIME_TO_STOP_SECONDS 28.990
	102	TIME_TO_FIRST_REPLY_SECONDS 53.384
	103	DOWNTIME_SECONDS 24.394
	104	```
	105
	106	---
	107
	108	## Investigation Notes
	109
	110	- 2026-03-07: Confirmed `baobab` delay was dominated by NFS unmount timeout after Thunderbolt transport disappeared too early.
	111	- 2026-03-07: Patched `tb-enlist@.service` with `Before=network.target`; reboot timing on `baobab` dropped from ~106s to ~15s.
	112	- 2026-03-07: Confirmed `pgs` preflight could block on stale remote NFS during storage cleanup.
	113	- 2026-03-07: Patched `pgs` cleanup to scan only local `dir` storages; remote NFS is skipped intentionally.
	114	- 2026-03-07: Confirmed `ebony` delay was self-hosted `AutoNAS-1`: the node exported local storage and mounted it back as Proxmox NFS.
	115	- 2026-03-07: First AutoNAS patch kept `autonas.service` and `autonas-boot-scan.service` ordered before `remote-fs.target` and `umount.target`; `ebony` improved to ~28s shutdown-to-ICMP-loss.
	116	- 2026-03-07: `tapia` still showed ~123s shutdown with that first AutoNAS patch because `nfs-server.service` still stopped too early for self-hosted `AutoNAS-2`.
	117	- 2026-03-07: Implemented second-generation AutoNAS fix that generates `/etc/systemd/system/nfs-server.service.d/50-autonas-self-hosted-proxmox.conf` from `storage.cfg`, adding explicit `Before=` ordering from `nfs-server.service` to matching self-hosted Proxmox mount units.
	118	- 2026-03-07: Revalidated `tapia` twice after the `nfs-server.service` ordering fix; both tests converged around `29s` to ICMP loss and `53s` to first ICMP reply.
	119
	120	---
	121
	122	## Proposed Solution
	123
	124	1. Keep Thunderbolt enlist units ordered before `network.target` so transport-backed NFS over `thunderbridge` stays alive until remote filesystems unmount.
	125	2. Keep `pgs` cleanup limited to local directory-backed storages; do not let remote NFS availability gate planned maintenance.
	126	3. For self-hosted AutoNAS exports, generate explicit `nfs-server.service` ordering against the matching Proxmox `mnt-pve-*.mount` units discovered from `storage.cfg`.
	127
	128	---
	129
	130	## Related Issues
	131
	132	- ISSUE-2026-001
	133
	134	---
	135
	136	## Changelog References
	137
	138	List CHANGELOG.md entries that reference this issue:
	139	- `projects/thunderbolts/CHANGELOG.md`: `tb-enlist@.service` now stays active until `network.target` stops... [ISSUE-2026-002]
	140	- `projects/autoNAS/CHANGELOG.md`: self-hosted AutoNAS shutdown now adds explicit `nfs-server.service` ordering... [ISSUE-2026-002]
	141	- `projects/pve-guests-state/CHANGELOG.md`: Suspend-artifact cleanup now scans only local `dir` storages... [ISSUE-2026-002]