Upgrade and recovery journal for the Madagascar cluster nodes:
tapiaebonybaobabAll three nodes were upgraded from Debian 12 / Proxmox VE 8 to Debian 13 (trixie) / Proxmox VE 9.1.
The package upgrade itself completed cleanly on all nodes. The disruptive failures were in the boot path after the upgrade, not in apt or dpkg.
Recurring issues:
EFI/BOOT were inconsistent across nodes.proxmox entry existed.systemd-boot style fallback artifacts while the host had moved to GRUB + proxmox-boot-tool.Initial symptoms:
trixie completed, but the node no longer booted normally.\EFI\proxmox\grubx64.efi initially dropped back into BIOS settings.Findings:
EFI/proxmox/grub.cfg on tapia had drifted from the standard Proxmox ESP stub and referenced Btrfs directly.AutoNAS also produced noisy failed units for unmanaged boot disk UUIDs.Fixes applied:
GRUB_DEFAULT to 6.8.12-19-pve during recovery.update-initramfs -u -k 6.8.12-19-pveupdate-grubproxmox-boot-tool refreshgrub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheckEFI/proxmox/grub.cfg to the standard Proxmox ESP stub:
search.fs_uuid <ESP>set prefix=($root)/grubconfigfile $prefix/grub.cfg6.17.13-1-pve boots correctly and made it the default again.AutoNAS fix so unmanaged UUIDs are ignored instead of failing autonas-attach@... units.Final state:
6.17.13-1-pvesystemctl --failed empty6.17.13-1-pveInitial symptoms:
memtest, but Proxmox GRUB payloads returned to BIOS settings.Loading Linux...Loading initial ramdisk...
and then stopped.Findings:
EFI/BOOT/BOOTX64.EFI was not aligned with the Proxmox boot chain and could route to memtest.Fixes applied:
GRUB_DEFAULT to 6.8.12-19-pve during recovery.update-initramfs -u -k 6.8.12-19-pveupdate-grubproxmox-boot-tool refreshgrub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheckEFI/BOOT/BOOTX64.EFI with the Proxmox shimx64.efi payload and synchronized the other fallback EFI files.6.17.13-1-pve and set it as the default.AutoNAS export behavior by cleaning marked exports whose paths do not exist yet at boot.Final state:
6.17.13-1-pvesystemctl --failed emptyAutoNAS-1 and AutoNAS-2 active6.17.13-1-pveInitial symptoms:
BOOTX64.EFI was still a small systemd-boot style binary instead of the Proxmox shim.Findings:
BootOrder needed to prioritize the proxmox entry.EFI/BOOT/BOOTX64.EFI needed to point into the Proxmox chain, not the old fallback path.Fixes applied:
GRUB_DEFAULT to 6.8.12-19-pve for the first stable boot after upgrade.BootOrder so proxmox is first.EFI/BOOT/BOOTX64.EFI with the Proxmox shimx64.efi.fsck.vfat -a on the ESPgrub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheckupdate-grubproxmox-boot-tool refreshrc-local.service now ignores missing optional disks instead of failingdiscover_vms.service and discover_vms.timerFinal state:
6.8.12-19-pvesystemctl --failed empty6.8.12-19-pve as the conservative stable choiceTwo AutoNAS issues were identified and fixed during the upgrade recovery:
attach-deferred could fail for disks with UUIDs that are not managed by AutoNAS.
systemd does not mark the unit failed.Boot-time cleanup preserved stale AutoNAS exports even when the export path did not exist yet.
Both fixes were deployed to:
baobabebonytapiaMost effective recovery sequence when a node no longer boots after the upgrade:
/etc/default/grub.update-initramfs -u -k <known-good-kernel>
update-grub
proxmox-boot-tool refresh
grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
efibootmgr -v
proxmox-boot-tool status
ls -l /boot/efi/EFI/proxmox
ls -l /boot/efi/EFI/BOOT
EFI/BOOT/BOOTX64.EFI with Proxmox shimx64.efi.Before rebooting a node after the Debian 13 / PVE 9 upgrade:
dpkg --auditapt-get -s full-upgradeupdate-grubproxmox-boot-tool refreshefibootmgr -vproxmox-boot-tool statusEFI/proxmox/grub.cfg should be the standard ESP stubEFI/BOOT/BOOTX64.EFI should route into the Proxmox chain, not an old systemd-boot or memtest fallback/usr/local/sbin/pgs suspend -vsystemd automation for guest suspend/resumepve-guests.service can stall shutdown while waiting for VMs/CTs to stopDuring this upgrade, one avoidable failure mode was a reboot started without first suspending or stopping guests through pgs.
Observed effect:
pve-guests.service remained in deactivatingOperational rule going forward:
/usr/local/sbin/pgs suspend -v
/usr/local/sbin/pgs resume -v
The cluster upgrade completed successfully, but only after boot-path recovery on all three nodes.
Main lesson: