Madagascar / cluster-context / history / 2026-03-07-trixie-pve9-upgrade-journal.md
f16725e 3 months ago History
1 contributor
234 lines | 8.269kb

2026-03-07 Trixie / Proxmox VE 9 Upgrade Journal

Scope

Upgrade and recovery journal for the Madagascar cluster nodes:

  • tapia
  • ebony
  • baobab

All three nodes were upgraded from Debian 12 / Proxmox VE 8 to Debian 13 (trixie) / Proxmox VE 9.1.

Common Pattern Observed

The package upgrade itself completed cleanly on all nodes. The disruptive failures were in the boot path after the upgrade, not in apt or dpkg.

Recurring issues:

  • EFI fallback binaries under EFI/BOOT were inconsistent across nodes.
  • Boot order could still point to a non-Proxmox path even when the proxmox entry existed.
  • Some systems still had systemd-boot style fallback artifacts while the host had moved to GRUB + proxmox-boot-tool.
  • Testing was complicated by slow shutdowns and, in one case, missing hardware during boot.

Node Journal

tapia

Initial symptoms:

  • Upgrade to trixie completed, but the node no longer booted normally.
  • UEFI shell could see the ESP and Proxmox EFI payloads.
  • Launching \EFI\proxmox\grubx64.efi initially dropped back into BIOS settings.
  • Later, after loader repair, boot worked on the old kernel first.

Findings:

  • The system had moved to Debian 13 and Proxmox VE 9 packages correctly.
  • GRUB and EFI files existed, but the boot path was inconsistent after the upgrade.
  • EFI/proxmox/grub.cfg on tapia had drifted from the standard Proxmox ESP stub and referenced Btrfs directly.
  • AutoNAS also produced noisy failed units for unmanaged boot disk UUIDs.

Fixes applied:

  • Offline disk repair on another node.
  • Forced GRUB_DEFAULT to 6.8.12-19-pve during recovery.
  • Ran:
    • update-initramfs -u -k 6.8.12-19-pve
    • update-grub
    • proxmox-boot-tool refresh
    • grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
  • Restored EFI/proxmox/grub.cfg to the standard Proxmox ESP stub:
    • search.fs_uuid <ESP>
    • set prefix=($root)/grub
    • configfile $prefix/grub.cfg
  • Later confirmed 6.17.13-1-pve boots correctly and made it the default again.
  • Deployed an AutoNAS fix so unmanaged UUIDs are ignored instead of failing autonas-attach@... units.

Final state:

  • Running 6.17.13-1-pve
  • systemctl --failed empty
  • Boot default set to 6.17.13-1-pve

ebony

Initial symptoms:

  • Upgrade completed cleanly, but the node did not return after reboot.
  • UEFI fallback could boot memtest, but Proxmox GRUB payloads returned to BIOS settings.
  • After EFI repair, boot progressed to:
    • Loading Linux...
    • Loading initial ramdisk... and then stopped.

Findings:

  • The fallback EFI/BOOT/BOOTX64.EFI was not aligned with the Proxmox boot chain and could route to memtest.
  • GRUB loader repair was required.
  • During one boot attempt, the NVMe device was physically absent; this caused the post-kernel boot stall and initially looked like a kernel/initramfs failure.
  • Once hardware was restored, the newer kernel booted successfully.

Fixes applied:

  • Offline disk repair on another node.
  • Forced GRUB_DEFAULT to 6.8.12-19-pve during recovery.
  • Ran:
    • update-initramfs -u -k 6.8.12-19-pve
    • update-grub
    • proxmox-boot-tool refresh
    • grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
  • Replaced fallback EFI/BOOT/BOOTX64.EFI with the Proxmox shimx64.efi payload and synchronized the other fallback EFI files.
  • After the node booted with full hardware present, validated 6.17.13-1-pve and set it as the default.
  • Fixed stale AutoNAS export behavior by cleaning marked exports whose paths do not exist yet at boot.

Final state:

  • Running 6.17.13-1-pve
  • systemctl --failed empty
  • AutoNAS-1 and AutoNAS-2 active
  • Boot default set to 6.17.13-1-pve

baobab

Initial symptoms:

  • Upgrade completed cleanly, but the node failed to return after reboot.
  • Before recovery, fallback BOOTX64.EFI was still a small systemd-boot style binary instead of the Proxmox shim.
  • The node eventually required offline repair from another machine.

Findings:

  • Package state was healthy; the failure was again in the EFI/boot path.
  • BootOrder needed to prioritize the proxmox entry.
  • EFI/BOOT/BOOTX64.EFI needed to point into the Proxmox chain, not the old fallback path.

Fixes applied:

  • Forced GRUB_DEFAULT to 6.8.12-19-pve for the first stable boot after upgrade.
  • Corrected BootOrder so proxmox is first.
  • Replaced fallback EFI/BOOT/BOOTX64.EFI with the Proxmox shimx64.efi.
  • Offline repair after the failed reboot:
    • fsck.vfat -a on the ESP
    • grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
    • update-grub
    • proxmox-boot-tool refresh
  • Fixed remaining failed units unrelated to the OS upgrade:
    • rc-local.service now ignores missing optional disks instead of failing
    • removed orphan discover_vms.service and discover_vms.timer

Final state:

  • Running 6.8.12-19-pve
  • systemctl --failed empty
  • Boot default left on 6.8.12-19-pve as the conservative stable choice

AutoNAS Follow-up

Two AutoNAS issues were identified and fixed during the upgrade recovery:

  1. attach-deferred could fail for disks with UUIDs that are not managed by AutoNAS.

    • Fix: return success for unmanaged UUIDs so systemd does not mark the unit failed.
  2. Boot-time cleanup preserved stale AutoNAS exports even when the export path did not exist yet.

    • Fix: remove AutoNAS-marked exports with missing paths during boot cleanup, then let normal mount/export flow recreate them when the disk is available.

Both fixes were deployed to:

  • baobab
  • ebony
  • tapia

Recovery Commands That Proved Useful

Most effective recovery sequence when a node no longer boots after the upgrade:

  1. Move the system disk to another node.
  2. Mount root and ESP.
  3. Force a known-good kernel in /etc/default/grub.
  4. Run:
update-initramfs -u -k <known-good-kernel>
update-grub
proxmox-boot-tool refresh
grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
  1. Verify:
efibootmgr -v
proxmox-boot-tool status
ls -l /boot/efi/EFI/proxmox
ls -l /boot/efi/EFI/BOOT
  1. If needed, replace EFI/BOOT/BOOTX64.EFI with Proxmox shimx64.efi.

Recommended Post-Upgrade Checklist

Before rebooting a node after the Debian 13 / PVE 9 upgrade:

  1. Confirm package state is clean:
    • dpkg --audit
    • apt-get -s full-upgrade
  2. Refresh boot assets:
    • update-grub
    • proxmox-boot-tool refresh
  3. Verify EFI layout:
    • efibootmgr -v
    • proxmox-boot-tool status
    • EFI/proxmox/grub.cfg should be the standard ESP stub
    • EFI/BOOT/BOOTX64.EFI should route into the Proxmox chain, not an old systemd-boot or memtest fallback
  4. Suspend guests manually before reboot:
    • run /usr/local/sbin/pgs suspend -v
    • do not rely on legacy systemd automation for guest suspend/resume
    • otherwise pve-guests.service can stall shutdown while waiting for VMs/CTs to stop
  5. Verify all expected storage hardware is physically present before reboot.
  6. Keep one older known-good kernel available in GRUB until the new kernel is validated on that node.

Operational Note: Reboot Discipline

During this upgrade, one avoidable failure mode was a reboot started without first suspending or stopping guests through pgs.

Observed effect:

  • pve-guests.service remained in deactivating
  • shutdown took a very long time
  • guest stop operations had to be forced manually
  • this obscured boot diagnostics and made the recovery look worse than the underlying boot issue

Operational rule going forward:

  1. Before any planned node reboot for maintenance, run:
/usr/local/sbin/pgs suspend -v
  1. Reboot only after guest suspend/shutdown has completed.
  2. After the node or cluster is back in a stable state, run:
/usr/local/sbin/pgs resume -v

Outcome

The cluster upgrade completed successfully, but only after boot-path recovery on all three nodes.

Main lesson:

  • the risky part of this upgrade was not package dependency resolution
  • it was EFI and boot chain consistency after the transition to Debian 13 / Proxmox VE 9