# 2026-03-07 Trixie / Proxmox VE 9 Upgrade Journal

## Scope

Upgrade and recovery journal for the Madagascar cluster nodes:

- `tapia`
- `ebony`
- `baobab`

All three nodes were upgraded from Debian 12 / Proxmox VE 8 to Debian 13 (`trixie`) / Proxmox VE 9.1.

## Common Pattern Observed

The package upgrade itself completed cleanly on all nodes. The disruptive failures were in the boot path after the upgrade, not in `apt` or `dpkg`.

Recurring issues:

- EFI fallback binaries under `EFI/BOOT` were inconsistent across nodes.
- Boot order could still point to a non-Proxmox path even when the `proxmox` entry existed.
- Some systems still had `systemd-boot` style fallback artifacts while the host had moved to GRUB + `proxmox-boot-tool`.
- Testing was complicated by slow shutdowns and, in one case, missing hardware during boot.

## Node Journal

### tapia

Initial symptoms:

- Upgrade to `trixie` completed, but the node no longer booted normally.
- UEFI shell could see the ESP and Proxmox EFI payloads.
- Launching `\EFI\proxmox\grubx64.efi` initially dropped back into BIOS settings.
- Later, after loader repair, boot worked on the old kernel first.

Findings:

- The system had moved to Debian 13 and Proxmox VE 9 packages correctly.
- GRUB and EFI files existed, but the boot path was inconsistent after the upgrade.
- `EFI/proxmox/grub.cfg` on `tapia` had drifted from the standard Proxmox ESP stub and referenced Btrfs directly.
- `AutoNAS` also produced noisy failed units for unmanaged boot disk UUIDs.

Fixes applied:

- Offline disk repair on another node.
- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
- Ran:
  - `update-initramfs -u -k 6.8.12-19-pve`
  - `update-grub`
  - `proxmox-boot-tool refresh`
  - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
- Restored `EFI/proxmox/grub.cfg` to the standard Proxmox ESP stub:
  - `search.fs_uuid <ESP>`
  - `set prefix=($root)/grub`
  - `configfile $prefix/grub.cfg`
- Later confirmed `6.17.13-1-pve` boots correctly and made it the default again.
- Deployed an `AutoNAS` fix so unmanaged UUIDs are ignored instead of failing `autonas-attach@...` units.

Final state:

- Running `6.17.13-1-pve`
- `systemctl --failed` empty
- Boot default set to `6.17.13-1-pve`

### ebony

Initial symptoms:

- Upgrade completed cleanly, but the node did not return after reboot.
- UEFI fallback could boot `memtest`, but Proxmox GRUB payloads returned to BIOS settings.
- After EFI repair, boot progressed to:
  - `Loading Linux...`
  - `Loading initial ramdisk...`
  and then stopped.

Findings:

- The fallback `EFI/BOOT/BOOTX64.EFI` was not aligned with the Proxmox boot chain and could route to memtest.
- GRUB loader repair was required.
- During one boot attempt, the NVMe device was physically absent; this caused the post-kernel boot stall and initially looked like a kernel/initramfs failure.
- Once hardware was restored, the newer kernel booted successfully.

Fixes applied:

- Offline disk repair on another node.
- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
- Ran:
  - `update-initramfs -u -k 6.8.12-19-pve`
  - `update-grub`
  - `proxmox-boot-tool refresh`
  - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi` payload and synchronized the other fallback EFI files.
- After the node booted with full hardware present, validated `6.17.13-1-pve` and set it as the default.
- Fixed stale `AutoNAS` export behavior by cleaning marked exports whose paths do not exist yet at boot.

Final state:

- Running `6.17.13-1-pve`
- `systemctl --failed` empty
- `AutoNAS-1` and `AutoNAS-2` active
- Boot default set to `6.17.13-1-pve`

### baobab

Initial symptoms:

- Upgrade completed cleanly, but the node failed to return after reboot.
- Before recovery, fallback `BOOTX64.EFI` was still a small `systemd-boot` style binary instead of the Proxmox shim.
- The node eventually required offline repair from another machine.

Findings:

- Package state was healthy; the failure was again in the EFI/boot path.
- `BootOrder` needed to prioritize the `proxmox` entry.
- `EFI/BOOT/BOOTX64.EFI` needed to point into the Proxmox chain, not the old fallback path.

Fixes applied:

- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` for the first stable boot after upgrade.
- Corrected `BootOrder` so `proxmox` is first.
- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi`.
- Offline repair after the failed reboot:
  - `fsck.vfat -a` on the ESP
  - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
  - `update-grub`
  - `proxmox-boot-tool refresh`
- Fixed remaining failed units unrelated to the OS upgrade:
  - `rc-local.service` now ignores missing optional disks instead of failing
  - removed orphan `discover_vms.service` and `discover_vms.timer`

Final state:

- Running `6.8.12-19-pve`
- `systemctl --failed` empty
- Boot default left on `6.8.12-19-pve` as the conservative stable choice

## AutoNAS Follow-up

Two AutoNAS issues were identified and fixed during the upgrade recovery:

1. `attach-deferred` could fail for disks with UUIDs that are not managed by AutoNAS.
   - Fix: return success for unmanaged UUIDs so `systemd` does not mark the unit failed.

2. Boot-time cleanup preserved stale AutoNAS exports even when the export path did not exist yet.
   - Fix: remove AutoNAS-marked exports with missing paths during boot cleanup, then let normal mount/export flow recreate them when the disk is available.

Both fixes were deployed to:

- `baobab`
- `ebony`
- `tapia`

## Recovery Commands That Proved Useful

Most effective recovery sequence when a node no longer boots after the upgrade:

1. Move the system disk to another node.
2. Mount root and ESP.
3. Force a known-good kernel in `/etc/default/grub`.
4. Run:

```bash
update-initramfs -u -k <known-good-kernel>
update-grub
proxmox-boot-tool refresh
grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
```

5. Verify:

```bash
efibootmgr -v
proxmox-boot-tool status
ls -l /boot/efi/EFI/proxmox
ls -l /boot/efi/EFI/BOOT
```

6. If needed, replace `EFI/BOOT/BOOTX64.EFI` with Proxmox `shimx64.efi`.

## Recommended Post-Upgrade Checklist

Before rebooting a node after the Debian 13 / PVE 9 upgrade:

1. Confirm package state is clean:
   - `dpkg --audit`
   - `apt-get -s full-upgrade`
2. Refresh boot assets:
   - `update-grub`
   - `proxmox-boot-tool refresh`
3. Verify EFI layout:
   - `efibootmgr -v`
   - `proxmox-boot-tool status`
   - `EFI/proxmox/grub.cfg` should be the standard ESP stub
   - `EFI/BOOT/BOOTX64.EFI` should route into the Proxmox chain, not an old `systemd-boot` or memtest fallback
4. Suspend guests manually before reboot:
   - run `/usr/local/sbin/pgs suspend -v`
   - do not rely on legacy `systemd` automation for guest suspend/resume
   - otherwise `pve-guests.service` can stall shutdown while waiting for VMs/CTs to stop
5. Verify all expected storage hardware is physically present before reboot.
6. Keep one older known-good kernel available in GRUB until the new kernel is validated on that node.

## Operational Note: Reboot Discipline

During this upgrade, one avoidable failure mode was a reboot started without first suspending or stopping guests through `pgs`.

Observed effect:

- `pve-guests.service` remained in `deactivating`
- shutdown took a very long time
- guest stop operations had to be forced manually
- this obscured boot diagnostics and made the recovery look worse than the underlying boot issue

Operational rule going forward:

1. Before any planned node reboot for maintenance, run:

```bash
/usr/local/sbin/pgs suspend -v
```

2. Reboot only after guest suspend/shutdown has completed.
3. After the node or cluster is back in a stable state, run:

```bash
/usr/local/sbin/pgs resume -v
```

## Outcome

The cluster upgrade completed successfully, but only after boot-path recovery on all three nodes.

Main lesson:

- the risky part of this upgrade was not package dependency resolution
- it was EFI and boot chain consistency after the transition to Debian 13 / Proxmox VE 9
