Status: investigating
Priority: high
Created: 2026-03-06
Updated: 2026-03-06
Assigned to: unassigned
On tapia, tb-enlist@thunderbolt0.service failed during ExecStop, and after a post-boot disconnect/reconnect the thunderbolt0 interface did not come back.
After reboot, the Tapia-Baobab Thunderbolt link briefly came up, then disconnected. A bad ExecStop= command in tb-enlist@.service caused unit failure (status=255) when systemd stopped the instance. In parallel, boltd logged a probing timeout after reconnect, and thunderbolt0 was no longer present on tapia.
6.8.12-19-pve, systemd oneshot templated unittapia with current shared tb-enlist@.service.systemctl status tb-enlist@thunderbolt0.service and ip link show thunderbolt0.tb-enlist@*.service should stop cleanly when a Thunderbolt netdev disappears.tb-enlist@thunderbolt0.service entered failed state on stop.ExecStop.thunderbolt0 disappeared on tapia and did not reappear after reconnect.tapia reboots, link stays down until physical unplug/replug.Mar 06 08:27:07 tapia ip[4054]: Error: either "dev" is duplicate, or "2>/dev/null" is a garbage.
Mar 06 08:27:07 tapia systemd[1]: tb-enlist@thunderbolt0.service: Control process exited, code=exited, status=255/EXCEPTION
Mar 06 08:27:22 tapia boltd[838]: probing: started [1000]
Mar 06 08:27:24 tapia boltd[838]: probing: timeout, done: [2002832] (2000000)
Device "thunderbolt0" does not exist.
tb-bridge.service was active and thunderbridge existed on both baobab and tapia.ExecStop lines used shell syntax in non-shell context:
ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || trueExecStop=/sbin/ip link set %i down 2>/dev/null || trueExecStop=-/sbin/ip link set %i nomasterExecStop=-/sbin/ip link set %i downtapia and validated that unit can be reset/stopped without entering failed.tapia link was observed up again (thunderbolt0 forwarding), confirming intermittent behavior.tb-recover.service + tb-recover.timer) to re-enlist interfaces and force rescan when no thunderbolt netdev is present.tapia by intentionally stopping tb-enlist@thunderbolt0; recovery timer re-attached interface in next cycle and returned forwarding state.baobab and ebony; timer enabled and active on all three nodes.tapia (host disconnected at 10:01:30); recovery happened after reconnect event (new host found at 10:01:48), consistent with unplug/replug recovery.tb-recover.sh: if no thunderbolt netdev after rescan, restart bolt.service and retrigger udev as fallback.tapia using thunderbolt-net unbind/bind (0-1.0) passed; thunderbolt0 reappeared and returned to forwarding within seconds (TEST_PASS).baobab and ebony; tb-recover.timer active/enabled and tb-enlist@* units active on all nodes.tapia - thunderbridge up but thunderbolt0 missing entirely (tb-enlist@thunderbolt0 inactive), while peer baobab port showed NO-CARRIER.bolt.service restart) was insufficient; repeated boltd messages observed: failed to get boot_acl: Connection timed out.remove + rescan; thunderbolt0 recreated and rejoined bridge.tb-recover.sh updated with cooldowned NHI rescan fallback (and guarded boltd restart fallback) and deployed cluster-wide.tapia running 6.17.13-1-pve, first NHI rescan rediscovered peer host 0-1 but did not recreate 0-1.0; a second manual NHI reset at 03:42 recreated thunderbolt0 and restored forwarding.*.0 service triggers one bounded second NHI reset in the same tb-recover.sh run.Use a two-layer recovery approach:
1. Keep ExecStop commands shell-free and use systemd - prefix to ignore expected failures when device is already gone.
2. Run periodic recovery (tb-recover.timer) that re-enlists existing thunderbolt netdevs and forces controller/net udev retrigger when no thunderbolt netdev is present.
3. If netdev is still missing, perform cooldowned Thunderbolt NHI PCI remove + rescan (soft replug equivalent), then retrigger udev.
4. If the controller comes back only as a peer xdomain host node (for example 0-1) with no 0-1.0 service child, immediately perform one additional bounded NHI reset in the same recovery run.
List CHANGELOG.md entries that reference this issue:
- CHANGELOG entry: [Unreleased] - Fix invalid ExecStop in tb-enlist@.service to prevent failed unit on device removal [ISSUE-2026-001]