Blaming Madagascar/cluster-context/history/2026-03-07-trixie-pve9-upgrade-journal.md at c29921356d23c85a156b48d6a66b271394966058 · bogdan/Madagascar

Madagascar / cluster-context / history / 2026-03-07-trixie-pve9-upgrade-journal.md

Newer ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ Older

bogdan Analizeaza timpi reboot si documente

234 lines | 8.269kb

Analizeaza timpi reboot si d... Bogdan Timofte authored 3 months ago	1	# 2026-03-07 Trixie / Proxmox VE 9 Upgrade Journal
	2
	3	## Scope
	4
	5	Upgrade and recovery journal for the Madagascar cluster nodes:
	6
	7	- `tapia`
	8	- `ebony`
	9	- `baobab`
	10
	11	All three nodes were upgraded from Debian 12 / Proxmox VE 8 to Debian 13 (`trixie`) / Proxmox VE 9.1.
	12
	13	## Common Pattern Observed
	14
	15	The package upgrade itself completed cleanly on all nodes. The disruptive failures were in the boot path after the upgrade, not in `apt` or `dpkg`.
	16
	17	Recurring issues:
	18
	19	- EFI fallback binaries under `EFI/BOOT` were inconsistent across nodes.
	20	- Boot order could still point to a non-Proxmox path even when the `proxmox` entry existed.
	21	- Some systems still had `systemd-boot` style fallback artifacts while the host had moved to GRUB + `proxmox-boot-tool`.
	22	- Testing was complicated by slow shutdowns and, in one case, missing hardware during boot.
	23
	24	## Node Journal
	25
	26	### tapia
	27
	28	Initial symptoms:
	29
	30	- Upgrade to `trixie` completed, but the node no longer booted normally.
	31	- UEFI shell could see the ESP and Proxmox EFI payloads.
	32	- Launching `\EFI\proxmox\grubx64.efi` initially dropped back into BIOS settings.
	33	- Later, after loader repair, boot worked on the old kernel first.
	34
	35	Findings:
	36
	37	- The system had moved to Debian 13 and Proxmox VE 9 packages correctly.
	38	- GRUB and EFI files existed, but the boot path was inconsistent after the upgrade.
	39	- `EFI/proxmox/grub.cfg` on `tapia` had drifted from the standard Proxmox ESP stub and referenced Btrfs directly.
	40	- `AutoNAS` also produced noisy failed units for unmanaged boot disk UUIDs.
	41
	42	Fixes applied:
	43
	44	- Offline disk repair on another node.
	45	- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
	46	- Ran:
	47	- `update-initramfs -u -k 6.8.12-19-pve`
	48	- `update-grub`
	49	- `proxmox-boot-tool refresh`
	50	- `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
	51	- Restored `EFI/proxmox/grub.cfg` to the standard Proxmox ESP stub:
	52	- `search.fs_uuid <ESP>`
	53	- `set prefix=($root)/grub`
	54	- `configfile $prefix/grub.cfg`
	55	- Later confirmed `6.17.13-1-pve` boots correctly and made it the default again.
	56	- Deployed an `AutoNAS` fix so unmanaged UUIDs are ignored instead of failing `autonas-attach@...` units.
	57
	58	Final state:
	59
	60	- Running `6.17.13-1-pve`
	61	- `systemctl --failed` empty
	62	- Boot default set to `6.17.13-1-pve`
	63
	64	### ebony
	65
	66	Initial symptoms:
	67
	68	- Upgrade completed cleanly, but the node did not return after reboot.
	69	- UEFI fallback could boot `memtest`, but Proxmox GRUB payloads returned to BIOS settings.
	70	- After EFI repair, boot progressed to:
	71	- `Loading Linux...`
	72	- `Loading initial ramdisk...`
	73	and then stopped.
	74
	75	Findings:
	76
	77	- The fallback `EFI/BOOT/BOOTX64.EFI` was not aligned with the Proxmox boot chain and could route to memtest.
	78	- GRUB loader repair was required.
	79	- During one boot attempt, the NVMe device was physically absent; this caused the post-kernel boot stall and initially looked like a kernel/initramfs failure.
	80	- Once hardware was restored, the newer kernel booted successfully.
	81
	82	Fixes applied:
	83
	84	- Offline disk repair on another node.
	85	- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
	86	- Ran:
	87	- `update-initramfs -u -k 6.8.12-19-pve`
	88	- `update-grub`
	89	- `proxmox-boot-tool refresh`
	90	- `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
	91	- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi` payload and synchronized the other fallback EFI files.
	92	- After the node booted with full hardware present, validated `6.17.13-1-pve` and set it as the default.
	93	- Fixed stale `AutoNAS` export behavior by cleaning marked exports whose paths do not exist yet at boot.
	94
	95	Final state:
	96
	97	- Running `6.17.13-1-pve`
	98	- `systemctl --failed` empty
	99	- `AutoNAS-1` and `AutoNAS-2` active
	100	- Boot default set to `6.17.13-1-pve`
	101
	102	### baobab
	103
	104	Initial symptoms:
	105
	106	- Upgrade completed cleanly, but the node failed to return after reboot.
	107	- Before recovery, fallback `BOOTX64.EFI` was still a small `systemd-boot` style binary instead of the Proxmox shim.
	108	- The node eventually required offline repair from another machine.
	109
	110	Findings:
	111
	112	- Package state was healthy; the failure was again in the EFI/boot path.
	113	- `BootOrder` needed to prioritize the `proxmox` entry.
	114	- `EFI/BOOT/BOOTX64.EFI` needed to point into the Proxmox chain, not the old fallback path.
	115
	116	Fixes applied:
	117
	118	- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` for the first stable boot after upgrade.
	119	- Corrected `BootOrder` so `proxmox` is first.
	120	- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi`.
	121	- Offline repair after the failed reboot:
	122	- `fsck.vfat -a` on the ESP
	123	- `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
	124	- `update-grub`
	125	- `proxmox-boot-tool refresh`
	126	- Fixed remaining failed units unrelated to the OS upgrade:
	127	- `rc-local.service` now ignores missing optional disks instead of failing
	128	- removed orphan `discover_vms.service` and `discover_vms.timer`
	129
	130	Final state:
	131
	132	- Running `6.8.12-19-pve`
	133	- `systemctl --failed` empty
	134	- Boot default left on `6.8.12-19-pve` as the conservative stable choice
	135
	136	## AutoNAS Follow-up
	137
	138	Two AutoNAS issues were identified and fixed during the upgrade recovery:
	139
	140	1. `attach-deferred` could fail for disks with UUIDs that are not managed by AutoNAS.
	141	- Fix: return success for unmanaged UUIDs so `systemd` does not mark the unit failed.
	142
	143	2. Boot-time cleanup preserved stale AutoNAS exports even when the export path did not exist yet.
	144	- Fix: remove AutoNAS-marked exports with missing paths during boot cleanup, then let normal mount/export flow recreate them when the disk is available.
	145
	146	Both fixes were deployed to:
	147
	148	- `baobab`
	149	- `ebony`
	150	- `tapia`
	151
	152	## Recovery Commands That Proved Useful
	153
	154	Most effective recovery sequence when a node no longer boots after the upgrade:
	155
	156	1. Move the system disk to another node.
	157	2. Mount root and ESP.
	158	3. Force a known-good kernel in `/etc/default/grub`.
	159	4. Run:
	160
	161	```bash
	162	update-initramfs -u -k <known-good-kernel>
	163	update-grub
	164	proxmox-boot-tool refresh
	165	grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
	166	```
	167
	168	5. Verify:
	169
	170	```bash
	171	efibootmgr -v
	172	proxmox-boot-tool status
	173	ls -l /boot/efi/EFI/proxmox
	174	ls -l /boot/efi/EFI/BOOT
	175	```
	176
	177	6. If needed, replace `EFI/BOOT/BOOTX64.EFI` with Proxmox `shimx64.efi`.
	178
	179	## Recommended Post-Upgrade Checklist
	180
	181	Before rebooting a node after the Debian 13 / PVE 9 upgrade:
	182
	183	1. Confirm package state is clean:
	184	- `dpkg --audit`
	185	- `apt-get -s full-upgrade`
	186	2. Refresh boot assets:
	187	- `update-grub`
	188	- `proxmox-boot-tool refresh`
	189	3. Verify EFI layout:
	190	- `efibootmgr -v`
	191	- `proxmox-boot-tool status`
	192	- `EFI/proxmox/grub.cfg` should be the standard ESP stub
	193	- `EFI/BOOT/BOOTX64.EFI` should route into the Proxmox chain, not an old `systemd-boot` or memtest fallback
	194	4. Suspend guests manually before reboot:
	195	- run `/usr/local/sbin/pgs suspend -v`
	196	- do not rely on legacy `systemd` automation for guest suspend/resume
	197	- otherwise `pve-guests.service` can stall shutdown while waiting for VMs/CTs to stop
	198	5. Verify all expected storage hardware is physically present before reboot.
	199	6. Keep one older known-good kernel available in GRUB until the new kernel is validated on that node.
	200
	201	## Operational Note: Reboot Discipline
	202
	203	During this upgrade, one avoidable failure mode was a reboot started without first suspending or stopping guests through `pgs`.
	204
	205	Observed effect:
	206
	207	- `pve-guests.service` remained in `deactivating`
	208	- shutdown took a very long time
	209	- guest stop operations had to be forced manually
	210	- this obscured boot diagnostics and made the recovery look worse than the underlying boot issue
	211
	212	Operational rule going forward:
	213
	214	1. Before any planned node reboot for maintenance, run:
	215
	216	```bash
	217	/usr/local/sbin/pgs suspend -v
	218	```
	219
	220	2. Reboot only after guest suspend/shutdown has completed.
	221	3. After the node or cluster is back in a stable state, run:
	222
	223	```bash
	224	/usr/local/sbin/pgs resume -v
	225	```
	226
	227	## Outcome
	228
	229	The cluster upgrade completed successfully, but only after boot-path recovery on all three nodes.
	230
	231	Main lesson:
	232
	233	- the risky part of this upgrade was not package dependency resolution
	234	- it was EFI and boot chain consistency after the transition to Debian 13 / Proxmox VE 9