f16725e 3 months ago History
1 contributor
118 lines | 6.468kb

tb-enlist Fails on Device Disconnect, Leaving Thunderbolt Link Down After Reboot

Issue ID: ISSUE-2026-001

Status: investigating
Priority: high
Created: 2026-03-06
Updated: 2026-03-06
Assigned to: unassigned


Summary

On tapia, tb-enlist@thunderbolt0.service failed during ExecStop, and after a post-boot disconnect/reconnect the thunderbolt0 interface did not come back.


Description

After reboot, the Tapia-Baobab Thunderbolt link briefly came up, then disconnected. A bad ExecStop= command in tb-enlist@.service caused unit failure (status=255) when systemd stopped the instance. In parallel, boltd logged a probing timeout after reconnect, and thunderbolt0 was no longer present on tapia.


Environment

  • Affected nodes: tapia (observed), all (same shared unit deployed cluster-wide)
  • Component: network (thunderbolt bridging/systemd integration)
  • Version/software: Proxmox VE 8.x, kernel 6.8.12-19-pve, systemd oneshot templated unit

Steps to Reproduce

  1. Boot tapia with current shared tb-enlist@.service.
  2. Let Thunderbolt peer connect, then trigger disconnect/remove event (observed during boot sequence).
  3. Check systemctl status tb-enlist@thunderbolt0.service and ip link show thunderbolt0.

Expected Behavior

  • tb-enlist@*.service should stop cleanly when a Thunderbolt netdev disappears.
  • Unit should not remain failed due to teardown path.
  • On reconnect, interface should be eligible to re-enlist normally.

Actual Behavior

  • tb-enlist@thunderbolt0.service entered failed state on stop.
  • Error included invalid arguments in ExecStop.
  • thunderbolt0 disappeared on tapia and did not reappear after reconnect.
  • Behavior remains intermittent: after some tapia reboots, link stays down until physical unplug/replug.

Logs/Evidence

Mar 06 08:27:07 tapia ip[4054]: Error: either "dev" is duplicate, or "2>/dev/null" is a garbage.
Mar 06 08:27:07 tapia systemd[1]: tb-enlist@thunderbolt0.service: Control process exited, code=exited, status=255/EXCEPTION
Mar 06 08:27:22 tapia boltd[838]: probing: started [1000]
Mar 06 08:27:24 tapia boltd[838]: probing: timeout, done: [2002832] (2000000)
Device "thunderbolt0" does not exist.

Investigation Notes

  • 2026-03-06: Confirmed tb-bridge.service was active and thunderbridge existed on both baobab and tapia.
  • 2026-03-06: Confirmed old ExecStop lines used shell syntax in non-shell context:
    • ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || true
    • ExecStop=/sbin/ip link set %i down 2>/dev/null || true
  • 2026-03-06: Implemented fix with systemd-native ignore-errors prefix:
    • ExecStop=-/sbin/ip link set %i nomaster
    • ExecStop=-/sbin/ip link set %i down
  • 2026-03-06: Deployed patch to tapia and validated that unit can be reset/stopped without entering failed.
  • 2026-03-06: User-induced flap still showed intermittent non-recovery pattern; remediation was not sufficient by itself.
  • 2026-03-06: After reboot at ~08:49 EET, tapia link was observed up again (thunderbolt0 forwarding), confirming intermittent behavior.
  • 2026-03-06: Added second-stage mitigation candidate: periodic recovery (tb-recover.service + tb-recover.timer) to re-enlist interfaces and force rescan when no thunderbolt netdev is present.
  • 2026-03-06: Validated mitigation on tapia by intentionally stopping tb-enlist@thunderbolt0; recovery timer re-attached interface in next cycle and returned forwarding state.
  • 2026-03-06: Rolled out mitigation to baobab and ebony; timer enabled and active on all three nodes.
  • 2026-03-06 10:01 EET: New flap captured on tapia (host disconnected at 10:01:30); recovery happened after reconnect event (new host found at 10:01:48), consistent with unplug/replug recovery.
  • 2026-03-06 10:05 EET: Added third-stage mitigation in tb-recover.sh: if no thunderbolt netdev after rescan, restart bolt.service and retrigger udev as fallback.
  • 2026-03-06 10:39 EET: Controlled flap test on tapia using thunderbolt-net unbind/bind (0-1.0) passed; thunderbolt0 reappeared and returned to forwarding within seconds (TEST_PASS).
  • 2026-03-06 10:46 EET: Latest mitigation rollout completed on baobab and ebony; tb-recover.timer active/enabled and tb-enlist@* units active on all nodes.
  • 2026-03-06 13:25 EET: Reboot-loop regression reproduced on tapia - thunderbridge up but thunderbolt0 missing entirely (tb-enlist@thunderbolt0 inactive), while peer baobab port showed NO-CARRIER.
  • 2026-03-06 13:22-14:02 EET: Existing fallback (bolt.service restart) was insufficient; repeated boltd messages observed: failed to get boot_acl: Connection timed out.
  • 2026-03-06 14:02 EET: Software recovery without cable succeeded via Thunderbolt NHI PCI remove + rescan; thunderbolt0 recreated and rejoined bridge.
  • 2026-03-06 14:04 EET: tb-recover.sh updated with cooldowned NHI rescan fallback (and guarded boltd restart fallback) and deployed cluster-wide.
  • 2026-03-07 03:35-03:42 EET: On tapia running 6.17.13-1-pve, first NHI rescan rediscovered peer host 0-1 but did not recreate 0-1.0; a second manual NHI reset at 03:42 recreated thunderbolt0 and restored forwarding.
  • 2026-03-07 03:4x EET: Recovery logic updated so a stale xdomain host node without a *.0 service triggers one bounded second NHI reset in the same tb-recover.sh run.

Proposed Solution

Use a two-layer recovery approach: 1. Keep ExecStop commands shell-free and use systemd - prefix to ignore expected failures when device is already gone. 2. Run periodic recovery (tb-recover.timer) that re-enlists existing thunderbolt netdevs and forces controller/net udev retrigger when no thunderbolt netdev is present. 3. If netdev is still missing, perform cooldowned Thunderbolt NHI PCI remove + rescan (soft replug equivalent), then retrigger udev. 4. If the controller comes back only as a peer xdomain host node (for example 0-1) with no 0-1.0 service child, immediately perform one additional bounded NHI reset in the same recovery run.


Related Issues

  • ISSUE-2025-002
  • ISSUE-2025-001

Changelog References

List CHANGELOG.md entries that reference this issue: - CHANGELOG entry: [Unreleased] - Fix invalid ExecStop in tb-enlist@.service to prevent failed unit on device removal [ISSUE-2026-001]