|
Bogdan Timofte
authored
3 months ago
|
1
|
# tb-enlist Fails on Device Disconnect, Leaving Thunderbolt Link Down After Reboot
|
|
|
2
|
|
|
|
3
|
## Issue ID: ISSUE-2026-001
|
|
|
4
|
|
|
|
5
|
**Status:** investigating
|
|
|
6
|
**Priority:** high
|
|
|
7
|
**Created:** 2026-03-06
|
|
|
8
|
**Updated:** 2026-03-06
|
|
|
9
|
**Assigned to:** unassigned
|
|
|
10
|
|
|
|
11
|
---
|
|
|
12
|
|
|
|
13
|
## Summary
|
|
|
14
|
|
|
|
15
|
On `tapia`, `tb-enlist@thunderbolt0.service` failed during `ExecStop`, and after a post-boot disconnect/reconnect the `thunderbolt0` interface did not come back.
|
|
|
16
|
|
|
|
17
|
---
|
|
|
18
|
|
|
|
19
|
## Description
|
|
|
20
|
|
|
|
21
|
After reboot, the Tapia-Baobab Thunderbolt link briefly came up, then disconnected. A bad `ExecStop=` command in `tb-enlist@.service` caused unit failure (`status=255`) when systemd stopped the instance. In parallel, `boltd` logged a probing timeout after reconnect, and `thunderbolt0` was no longer present on `tapia`.
|
|
|
22
|
|
|
|
23
|
---
|
|
|
24
|
|
|
|
25
|
## Environment
|
|
|
26
|
|
|
|
27
|
- **Affected nodes:** tapia (observed), all (same shared unit deployed cluster-wide)
|
|
|
28
|
- **Component:** network (thunderbolt bridging/systemd integration)
|
|
|
29
|
- **Version/software:** Proxmox VE 8.x, kernel `6.8.12-19-pve`, systemd oneshot templated unit
|
|
|
30
|
|
|
|
31
|
---
|
|
|
32
|
|
|
|
33
|
## Steps to Reproduce
|
|
|
34
|
|
|
|
35
|
1. Boot `tapia` with current shared `tb-enlist@.service`.
|
|
|
36
|
2. Let Thunderbolt peer connect, then trigger disconnect/remove event (observed during boot sequence).
|
|
|
37
|
3. Check `systemctl status tb-enlist@thunderbolt0.service` and `ip link show thunderbolt0`.
|
|
|
38
|
|
|
|
39
|
---
|
|
|
40
|
|
|
|
41
|
## Expected Behavior
|
|
|
42
|
|
|
|
43
|
- `tb-enlist@*.service` should stop cleanly when a Thunderbolt netdev disappears.
|
|
|
44
|
- Unit should not remain failed due to teardown path.
|
|
|
45
|
- On reconnect, interface should be eligible to re-enlist normally.
|
|
|
46
|
|
|
|
47
|
---
|
|
|
48
|
|
|
|
49
|
## Actual Behavior
|
|
|
50
|
|
|
|
51
|
- `tb-enlist@thunderbolt0.service` entered failed state on stop.
|
|
|
52
|
- Error included invalid arguments in `ExecStop`.
|
|
|
53
|
- `thunderbolt0` disappeared on `tapia` and did not reappear after reconnect.
|
|
|
54
|
- Behavior remains intermittent: after some `tapia` reboots, link stays down until physical unplug/replug.
|
|
|
55
|
|
|
|
56
|
---
|
|
|
57
|
|
|
|
58
|
## Logs/Evidence
|
|
|
59
|
|
|
|
60
|
```text
|
|
|
61
|
Mar 06 08:27:07 tapia ip[4054]: Error: either "dev" is duplicate, or "2>/dev/null" is a garbage.
|
|
|
62
|
Mar 06 08:27:07 tapia systemd[1]: tb-enlist@thunderbolt0.service: Control process exited, code=exited, status=255/EXCEPTION
|
|
|
63
|
Mar 06 08:27:22 tapia boltd[838]: probing: started [1000]
|
|
|
64
|
Mar 06 08:27:24 tapia boltd[838]: probing: timeout, done: [2002832] (2000000)
|
|
|
65
|
Device "thunderbolt0" does not exist.
|
|
|
66
|
```
|
|
|
67
|
|
|
|
68
|
---
|
|
|
69
|
|
|
|
70
|
## Investigation Notes
|
|
|
71
|
|
|
|
72
|
- 2026-03-06: Confirmed `tb-bridge.service` was active and `thunderbridge` existed on both `baobab` and `tapia`.
|
|
|
73
|
- 2026-03-06: Confirmed old `ExecStop` lines used shell syntax in non-shell context:
|
|
|
74
|
- `ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || true`
|
|
|
75
|
- `ExecStop=/sbin/ip link set %i down 2>/dev/null || true`
|
|
|
76
|
- 2026-03-06: Implemented fix with systemd-native ignore-errors prefix:
|
|
|
77
|
- `ExecStop=-/sbin/ip link set %i nomaster`
|
|
|
78
|
- `ExecStop=-/sbin/ip link set %i down`
|
|
|
79
|
- 2026-03-06: Deployed patch to `tapia` and validated that unit can be reset/stopped without entering `failed`.
|
|
|
80
|
- 2026-03-06: User-induced flap still showed intermittent non-recovery pattern; remediation was not sufficient by itself.
|
|
|
81
|
- 2026-03-06: After reboot at ~08:49 EET, `tapia` link was observed up again (`thunderbolt0` forwarding), confirming intermittent behavior.
|
|
|
82
|
- 2026-03-06: Added second-stage mitigation candidate: periodic recovery (`tb-recover.service` + `tb-recover.timer`) to re-enlist interfaces and force rescan when no thunderbolt netdev is present.
|
|
|
83
|
- 2026-03-06: Validated mitigation on `tapia` by intentionally stopping `tb-enlist@thunderbolt0`; recovery timer re-attached interface in next cycle and returned `forwarding` state.
|
|
|
84
|
- 2026-03-06: Rolled out mitigation to `baobab` and `ebony`; timer enabled and active on all three nodes.
|
|
|
85
|
- 2026-03-06 10:01 EET: New flap captured on `tapia` (`host disconnected` at `10:01:30`); recovery happened after reconnect event (`new host found` at `10:01:48`), consistent with unplug/replug recovery.
|
|
|
86
|
- 2026-03-06 10:05 EET: Added third-stage mitigation in `tb-recover.sh`: if no thunderbolt netdev after rescan, restart `bolt.service` and retrigger udev as fallback.
|
|
|
87
|
- 2026-03-06 10:39 EET: Controlled flap test on `tapia` using `thunderbolt-net` unbind/bind (`0-1.0`) passed; `thunderbolt0` reappeared and returned to `forwarding` within seconds (`TEST_PASS`).
|
|
|
88
|
- 2026-03-06 10:46 EET: Latest mitigation rollout completed on `baobab` and `ebony`; `tb-recover.timer` active/enabled and `tb-enlist@*` units active on all nodes.
|
|
|
89
|
- 2026-03-06 13:25 EET: Reboot-loop regression reproduced on `tapia` - `thunderbridge` up but `thunderbolt0` missing entirely (`tb-enlist@thunderbolt0` inactive), while peer `baobab` port showed `NO-CARRIER`.
|
|
|
90
|
- 2026-03-06 13:22-14:02 EET: Existing fallback (`bolt.service` restart) was insufficient; repeated `boltd` messages observed: `failed to get boot_acl: Connection timed out`.
|
|
|
91
|
- 2026-03-06 14:02 EET: Software recovery without cable succeeded via Thunderbolt NHI PCI `remove + rescan`; `thunderbolt0` recreated and rejoined bridge.
|
|
|
92
|
- 2026-03-06 14:04 EET: `tb-recover.sh` updated with cooldowned NHI rescan fallback (and guarded `boltd` restart fallback) and deployed cluster-wide.
|
|
|
93
|
- 2026-03-07 03:35-03:42 EET: On `tapia` running `6.17.13-1-pve`, first NHI rescan rediscovered peer host `0-1` but did not recreate `0-1.0`; a second manual NHI reset at `03:42` recreated `thunderbolt0` and restored `forwarding`.
|
|
|
94
|
- 2026-03-07 03:4x EET: Recovery logic updated so a stale xdomain host node without a `*.0` service triggers one bounded second NHI reset in the same `tb-recover.sh` run.
|
|
|
95
|
|
|
|
96
|
---
|
|
|
97
|
|
|
|
98
|
## Proposed Solution
|
|
|
99
|
|
|
|
100
|
Use a two-layer recovery approach:
|
|
|
101
|
1. Keep `ExecStop` commands shell-free and use systemd `-` prefix to ignore expected failures when device is already gone.
|
|
|
102
|
2. Run periodic recovery (`tb-recover.timer`) that re-enlists existing thunderbolt netdevs and forces controller/net udev retrigger when no thunderbolt netdev is present.
|
|
|
103
|
3. If netdev is still missing, perform cooldowned Thunderbolt NHI PCI `remove + rescan` (soft replug equivalent), then retrigger udev.
|
|
|
104
|
4. If the controller comes back only as a peer xdomain host node (for example `0-1`) with no `0-1.0` service child, immediately perform one additional bounded NHI reset in the same recovery run.
|
|
|
105
|
|
|
|
106
|
---
|
|
|
107
|
|
|
|
108
|
## Related Issues
|
|
|
109
|
|
|
|
110
|
- ISSUE-2025-002
|
|
|
111
|
- ISSUE-2025-001
|
|
|
112
|
|
|
|
113
|
---
|
|
|
114
|
|
|
|
115
|
## Changelog References
|
|
|
116
|
|
|
|
117
|
List CHANGELOG.md entries that reference this issue:
|
|
|
118
|
- CHANGELOG entry: [Unreleased] - Fix invalid `ExecStop` in `tb-enlist@.service` to prevent failed unit on device removal [ISSUE-2026-001]
|