|
Bogdan Timofte
authored
3 months ago
|
1
|
# 2026-03-07 Trixie / Proxmox VE 9 Upgrade Journal
|
|
|
2
|
|
|
|
3
|
## Scope
|
|
|
4
|
|
|
|
5
|
Upgrade and recovery journal for the Madagascar cluster nodes:
|
|
|
6
|
|
|
|
7
|
- `tapia`
|
|
|
8
|
- `ebony`
|
|
|
9
|
- `baobab`
|
|
|
10
|
|
|
|
11
|
All three nodes were upgraded from Debian 12 / Proxmox VE 8 to Debian 13 (`trixie`) / Proxmox VE 9.1.
|
|
|
12
|
|
|
|
13
|
## Common Pattern Observed
|
|
|
14
|
|
|
|
15
|
The package upgrade itself completed cleanly on all nodes. The disruptive failures were in the boot path after the upgrade, not in `apt` or `dpkg`.
|
|
|
16
|
|
|
|
17
|
Recurring issues:
|
|
|
18
|
|
|
|
19
|
- EFI fallback binaries under `EFI/BOOT` were inconsistent across nodes.
|
|
|
20
|
- Boot order could still point to a non-Proxmox path even when the `proxmox` entry existed.
|
|
|
21
|
- Some systems still had `systemd-boot` style fallback artifacts while the host had moved to GRUB + `proxmox-boot-tool`.
|
|
|
22
|
- Testing was complicated by slow shutdowns and, in one case, missing hardware during boot.
|
|
|
23
|
|
|
|
24
|
## Node Journal
|
|
|
25
|
|
|
|
26
|
### tapia
|
|
|
27
|
|
|
|
28
|
Initial symptoms:
|
|
|
29
|
|
|
|
30
|
- Upgrade to `trixie` completed, but the node no longer booted normally.
|
|
|
31
|
- UEFI shell could see the ESP and Proxmox EFI payloads.
|
|
|
32
|
- Launching `\EFI\proxmox\grubx64.efi` initially dropped back into BIOS settings.
|
|
|
33
|
- Later, after loader repair, boot worked on the old kernel first.
|
|
|
34
|
|
|
|
35
|
Findings:
|
|
|
36
|
|
|
|
37
|
- The system had moved to Debian 13 and Proxmox VE 9 packages correctly.
|
|
|
38
|
- GRUB and EFI files existed, but the boot path was inconsistent after the upgrade.
|
|
|
39
|
- `EFI/proxmox/grub.cfg` on `tapia` had drifted from the standard Proxmox ESP stub and referenced Btrfs directly.
|
|
|
40
|
- `AutoNAS` also produced noisy failed units for unmanaged boot disk UUIDs.
|
|
|
41
|
|
|
|
42
|
Fixes applied:
|
|
|
43
|
|
|
|
44
|
- Offline disk repair on another node.
|
|
|
45
|
- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
|
|
|
46
|
- Ran:
|
|
|
47
|
- `update-initramfs -u -k 6.8.12-19-pve`
|
|
|
48
|
- `update-grub`
|
|
|
49
|
- `proxmox-boot-tool refresh`
|
|
|
50
|
- `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
|
|
|
51
|
- Restored `EFI/proxmox/grub.cfg` to the standard Proxmox ESP stub:
|
|
|
52
|
- `search.fs_uuid <ESP>`
|
|
|
53
|
- `set prefix=($root)/grub`
|
|
|
54
|
- `configfile $prefix/grub.cfg`
|
|
|
55
|
- Later confirmed `6.17.13-1-pve` boots correctly and made it the default again.
|
|
|
56
|
- Deployed an `AutoNAS` fix so unmanaged UUIDs are ignored instead of failing `autonas-attach@...` units.
|
|
|
57
|
|
|
|
58
|
Final state:
|
|
|
59
|
|
|
|
60
|
- Running `6.17.13-1-pve`
|
|
|
61
|
- `systemctl --failed` empty
|
|
|
62
|
- Boot default set to `6.17.13-1-pve`
|
|
|
63
|
|
|
|
64
|
### ebony
|
|
|
65
|
|
|
|
66
|
Initial symptoms:
|
|
|
67
|
|
|
|
68
|
- Upgrade completed cleanly, but the node did not return after reboot.
|
|
|
69
|
- UEFI fallback could boot `memtest`, but Proxmox GRUB payloads returned to BIOS settings.
|
|
|
70
|
- After EFI repair, boot progressed to:
|
|
|
71
|
- `Loading Linux...`
|
|
|
72
|
- `Loading initial ramdisk...`
|
|
|
73
|
and then stopped.
|
|
|
74
|
|
|
|
75
|
Findings:
|
|
|
76
|
|
|
|
77
|
- The fallback `EFI/BOOT/BOOTX64.EFI` was not aligned with the Proxmox boot chain and could route to memtest.
|
|
|
78
|
- GRUB loader repair was required.
|
|
|
79
|
- During one boot attempt, the NVMe device was physically absent; this caused the post-kernel boot stall and initially looked like a kernel/initramfs failure.
|
|
|
80
|
- Once hardware was restored, the newer kernel booted successfully.
|
|
|
81
|
|
|
|
82
|
Fixes applied:
|
|
|
83
|
|
|
|
84
|
- Offline disk repair on another node.
|
|
|
85
|
- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
|
|
|
86
|
- Ran:
|
|
|
87
|
- `update-initramfs -u -k 6.8.12-19-pve`
|
|
|
88
|
- `update-grub`
|
|
|
89
|
- `proxmox-boot-tool refresh`
|
|
|
90
|
- `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
|
|
|
91
|
- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi` payload and synchronized the other fallback EFI files.
|
|
|
92
|
- After the node booted with full hardware present, validated `6.17.13-1-pve` and set it as the default.
|
|
|
93
|
- Fixed stale `AutoNAS` export behavior by cleaning marked exports whose paths do not exist yet at boot.
|
|
|
94
|
|
|
|
95
|
Final state:
|
|
|
96
|
|
|
|
97
|
- Running `6.17.13-1-pve`
|
|
|
98
|
- `systemctl --failed` empty
|
|
|
99
|
- `AutoNAS-1` and `AutoNAS-2` active
|
|
|
100
|
- Boot default set to `6.17.13-1-pve`
|
|
|
101
|
|
|
|
102
|
### baobab
|
|
|
103
|
|
|
|
104
|
Initial symptoms:
|
|
|
105
|
|
|
|
106
|
- Upgrade completed cleanly, but the node failed to return after reboot.
|
|
|
107
|
- Before recovery, fallback `BOOTX64.EFI` was still a small `systemd-boot` style binary instead of the Proxmox shim.
|
|
|
108
|
- The node eventually required offline repair from another machine.
|
|
|
109
|
|
|
|
110
|
Findings:
|
|
|
111
|
|
|
|
112
|
- Package state was healthy; the failure was again in the EFI/boot path.
|
|
|
113
|
- `BootOrder` needed to prioritize the `proxmox` entry.
|
|
|
114
|
- `EFI/BOOT/BOOTX64.EFI` needed to point into the Proxmox chain, not the old fallback path.
|
|
|
115
|
|
|
|
116
|
Fixes applied:
|
|
|
117
|
|
|
|
118
|
- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` for the first stable boot after upgrade.
|
|
|
119
|
- Corrected `BootOrder` so `proxmox` is first.
|
|
|
120
|
- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi`.
|
|
|
121
|
- Offline repair after the failed reboot:
|
|
|
122
|
- `fsck.vfat -a` on the ESP
|
|
|
123
|
- `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
|
|
|
124
|
- `update-grub`
|
|
|
125
|
- `proxmox-boot-tool refresh`
|
|
|
126
|
- Fixed remaining failed units unrelated to the OS upgrade:
|
|
|
127
|
- `rc-local.service` now ignores missing optional disks instead of failing
|
|
|
128
|
- removed orphan `discover_vms.service` and `discover_vms.timer`
|
|
|
129
|
|
|
|
130
|
Final state:
|
|
|
131
|
|
|
|
132
|
- Running `6.8.12-19-pve`
|
|
|
133
|
- `systemctl --failed` empty
|
|
|
134
|
- Boot default left on `6.8.12-19-pve` as the conservative stable choice
|
|
|
135
|
|
|
|
136
|
## AutoNAS Follow-up
|
|
|
137
|
|
|
|
138
|
Two AutoNAS issues were identified and fixed during the upgrade recovery:
|
|
|
139
|
|
|
|
140
|
1. `attach-deferred` could fail for disks with UUIDs that are not managed by AutoNAS.
|
|
|
141
|
- Fix: return success for unmanaged UUIDs so `systemd` does not mark the unit failed.
|
|
|
142
|
|
|
|
143
|
2. Boot-time cleanup preserved stale AutoNAS exports even when the export path did not exist yet.
|
|
|
144
|
- Fix: remove AutoNAS-marked exports with missing paths during boot cleanup, then let normal mount/export flow recreate them when the disk is available.
|
|
|
145
|
|
|
|
146
|
Both fixes were deployed to:
|
|
|
147
|
|
|
|
148
|
- `baobab`
|
|
|
149
|
- `ebony`
|
|
|
150
|
- `tapia`
|
|
|
151
|
|
|
|
152
|
## Recovery Commands That Proved Useful
|
|
|
153
|
|
|
|
154
|
Most effective recovery sequence when a node no longer boots after the upgrade:
|
|
|
155
|
|
|
|
156
|
1. Move the system disk to another node.
|
|
|
157
|
2. Mount root and ESP.
|
|
|
158
|
3. Force a known-good kernel in `/etc/default/grub`.
|
|
|
159
|
4. Run:
|
|
|
160
|
|
|
|
161
|
```bash
|
|
|
162
|
update-initramfs -u -k <known-good-kernel>
|
|
|
163
|
update-grub
|
|
|
164
|
proxmox-boot-tool refresh
|
|
|
165
|
grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
|
|
|
166
|
```
|
|
|
167
|
|
|
|
168
|
5. Verify:
|
|
|
169
|
|
|
|
170
|
```bash
|
|
|
171
|
efibootmgr -v
|
|
|
172
|
proxmox-boot-tool status
|
|
|
173
|
ls -l /boot/efi/EFI/proxmox
|
|
|
174
|
ls -l /boot/efi/EFI/BOOT
|
|
|
175
|
```
|
|
|
176
|
|
|
|
177
|
6. If needed, replace `EFI/BOOT/BOOTX64.EFI` with Proxmox `shimx64.efi`.
|
|
|
178
|
|
|
|
179
|
## Recommended Post-Upgrade Checklist
|
|
|
180
|
|
|
|
181
|
Before rebooting a node after the Debian 13 / PVE 9 upgrade:
|
|
|
182
|
|
|
|
183
|
1. Confirm package state is clean:
|
|
|
184
|
- `dpkg --audit`
|
|
|
185
|
- `apt-get -s full-upgrade`
|
|
|
186
|
2. Refresh boot assets:
|
|
|
187
|
- `update-grub`
|
|
|
188
|
- `proxmox-boot-tool refresh`
|
|
|
189
|
3. Verify EFI layout:
|
|
|
190
|
- `efibootmgr -v`
|
|
|
191
|
- `proxmox-boot-tool status`
|
|
|
192
|
- `EFI/proxmox/grub.cfg` should be the standard ESP stub
|
|
|
193
|
- `EFI/BOOT/BOOTX64.EFI` should route into the Proxmox chain, not an old `systemd-boot` or memtest fallback
|
|
|
194
|
4. Suspend guests manually before reboot:
|
|
|
195
|
- run `/usr/local/sbin/pgs suspend -v`
|
|
|
196
|
- do not rely on legacy `systemd` automation for guest suspend/resume
|
|
|
197
|
- otherwise `pve-guests.service` can stall shutdown while waiting for VMs/CTs to stop
|
|
|
198
|
5. Verify all expected storage hardware is physically present before reboot.
|
|
|
199
|
6. Keep one older known-good kernel available in GRUB until the new kernel is validated on that node.
|
|
|
200
|
|
|
|
201
|
## Operational Note: Reboot Discipline
|
|
|
202
|
|
|
|
203
|
During this upgrade, one avoidable failure mode was a reboot started without first suspending or stopping guests through `pgs`.
|
|
|
204
|
|
|
|
205
|
Observed effect:
|
|
|
206
|
|
|
|
207
|
- `pve-guests.service` remained in `deactivating`
|
|
|
208
|
- shutdown took a very long time
|
|
|
209
|
- guest stop operations had to be forced manually
|
|
|
210
|
- this obscured boot diagnostics and made the recovery look worse than the underlying boot issue
|
|
|
211
|
|
|
|
212
|
Operational rule going forward:
|
|
|
213
|
|
|
|
214
|
1. Before any planned node reboot for maintenance, run:
|
|
|
215
|
|
|
|
216
|
```bash
|
|
|
217
|
/usr/local/sbin/pgs suspend -v
|
|
|
218
|
```
|
|
|
219
|
|
|
|
220
|
2. Reboot only after guest suspend/shutdown has completed.
|
|
|
221
|
3. After the node or cluster is back in a stable state, run:
|
|
|
222
|
|
|
|
223
|
```bash
|
|
|
224
|
/usr/local/sbin/pgs resume -v
|
|
|
225
|
```
|
|
|
226
|
|
|
|
227
|
## Outcome
|
|
|
228
|
|
|
|
229
|
The cluster upgrade completed successfully, but only after boot-path recovery on all three nodes.
|
|
|
230
|
|
|
|
231
|
Main lesson:
|
|
|
232
|
|
|
|
233
|
- the risky part of this upgrade was not package dependency resolution
|
|
|
234
|
- it was EFI and boot chain consistency after the transition to Debian 13 / Proxmox VE 9
|