# tb-enlist Fails on Device Disconnect, Leaving Thunderbolt Link Down After Reboot

## Issue ID: ISSUE-2026-001

**Status:** investigating  
**Priority:** high  
**Created:** 2026-03-06  
**Updated:** 2026-03-06  
**Assigned to:** unassigned

---

## Summary

On `tapia`, `tb-enlist@thunderbolt0.service` failed during `ExecStop`, and after a post-boot disconnect/reconnect the `thunderbolt0` interface did not come back.

---

## Description

After reboot, the Tapia-Baobab Thunderbolt link briefly came up, then disconnected. A bad `ExecStop=` command in `tb-enlist@.service` caused unit failure (`status=255`) when systemd stopped the instance. In parallel, `boltd` logged a probing timeout after reconnect, and `thunderbolt0` was no longer present on `tapia`.

---

## Environment

- **Affected nodes:** tapia (observed), all (same shared unit deployed cluster-wide)
- **Component:** network (thunderbolt bridging/systemd integration)
- **Version/software:** Proxmox VE 8.x, kernel `6.8.12-19-pve`, systemd oneshot templated unit

---

## Steps to Reproduce

1. Boot `tapia` with current shared `tb-enlist@.service`.
2. Let Thunderbolt peer connect, then trigger disconnect/remove event (observed during boot sequence).
3. Check `systemctl status tb-enlist@thunderbolt0.service` and `ip link show thunderbolt0`.

---

## Expected Behavior

- `tb-enlist@*.service` should stop cleanly when a Thunderbolt netdev disappears.
- Unit should not remain failed due to teardown path.
- On reconnect, interface should be eligible to re-enlist normally.

---

## Actual Behavior

- `tb-enlist@thunderbolt0.service` entered failed state on stop.
- Error included invalid arguments in `ExecStop`.
- `thunderbolt0` disappeared on `tapia` and did not reappear after reconnect.
- Behavior remains intermittent: after some `tapia` reboots, link stays down until physical unplug/replug.

---

## Logs/Evidence

```text
Mar 06 08:27:07 tapia ip[4054]: Error: either "dev" is duplicate, or "2>/dev/null" is a garbage.
Mar 06 08:27:07 tapia systemd[1]: tb-enlist@thunderbolt0.service: Control process exited, code=exited, status=255/EXCEPTION
Mar 06 08:27:22 tapia boltd[838]: probing: started [1000]
Mar 06 08:27:24 tapia boltd[838]: probing: timeout, done: [2002832] (2000000)
Device "thunderbolt0" does not exist.
```

---

## Investigation Notes

- 2026-03-06: Confirmed `tb-bridge.service` was active and `thunderbridge` existed on both `baobab` and `tapia`.
- 2026-03-06: Confirmed old `ExecStop` lines used shell syntax in non-shell context:
  - `ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || true`
  - `ExecStop=/sbin/ip link set %i down 2>/dev/null || true`
- 2026-03-06: Implemented fix with systemd-native ignore-errors prefix:
  - `ExecStop=-/sbin/ip link set %i nomaster`
  - `ExecStop=-/sbin/ip link set %i down`
- 2026-03-06: Deployed patch to `tapia` and validated that unit can be reset/stopped without entering `failed`.
- 2026-03-06: User-induced flap still showed intermittent non-recovery pattern; remediation was not sufficient by itself.
- 2026-03-06: After reboot at ~08:49 EET, `tapia` link was observed up again (`thunderbolt0` forwarding), confirming intermittent behavior.
- 2026-03-06: Added second-stage mitigation candidate: periodic recovery (`tb-recover.service` + `tb-recover.timer`) to re-enlist interfaces and force rescan when no thunderbolt netdev is present.
- 2026-03-06: Validated mitigation on `tapia` by intentionally stopping `tb-enlist@thunderbolt0`; recovery timer re-attached interface in next cycle and returned `forwarding` state.
- 2026-03-06: Rolled out mitigation to `baobab` and `ebony`; timer enabled and active on all three nodes.
- 2026-03-06 10:01 EET: New flap captured on `tapia` (`host disconnected` at `10:01:30`); recovery happened after reconnect event (`new host found` at `10:01:48`), consistent with unplug/replug recovery.
- 2026-03-06 10:05 EET: Added third-stage mitigation in `tb-recover.sh`: if no thunderbolt netdev after rescan, restart `bolt.service` and retrigger udev as fallback.
- 2026-03-06 10:39 EET: Controlled flap test on `tapia` using `thunderbolt-net` unbind/bind (`0-1.0`) passed; `thunderbolt0` reappeared and returned to `forwarding` within seconds (`TEST_PASS`).
- 2026-03-06 10:46 EET: Latest mitigation rollout completed on `baobab` and `ebony`; `tb-recover.timer` active/enabled and `tb-enlist@*` units active on all nodes.
- 2026-03-06 13:25 EET: Reboot-loop regression reproduced on `tapia` - `thunderbridge` up but `thunderbolt0` missing entirely (`tb-enlist@thunderbolt0` inactive), while peer `baobab` port showed `NO-CARRIER`.
- 2026-03-06 13:22-14:02 EET: Existing fallback (`bolt.service` restart) was insufficient; repeated `boltd` messages observed: `failed to get boot_acl: Connection timed out`.
- 2026-03-06 14:02 EET: Software recovery without cable succeeded via Thunderbolt NHI PCI `remove + rescan`; `thunderbolt0` recreated and rejoined bridge.
- 2026-03-06 14:04 EET: `tb-recover.sh` updated with cooldowned NHI rescan fallback (and guarded `boltd` restart fallback) and deployed cluster-wide.
- 2026-03-07 03:35-03:42 EET: On `tapia` running `6.17.13-1-pve`, first NHI rescan rediscovered peer host `0-1` but did not recreate `0-1.0`; a second manual NHI reset at `03:42` recreated `thunderbolt0` and restored `forwarding`.
- 2026-03-07 03:4x EET: Recovery logic updated so a stale xdomain host node without a `*.0` service triggers one bounded second NHI reset in the same `tb-recover.sh` run.

---

## Proposed Solution

Use a two-layer recovery approach:
1. Keep `ExecStop` commands shell-free and use systemd `-` prefix to ignore expected failures when device is already gone.
2. Run periodic recovery (`tb-recover.timer`) that re-enlists existing thunderbolt netdevs and forces controller/net udev retrigger when no thunderbolt netdev is present.
3. If netdev is still missing, perform cooldowned Thunderbolt NHI PCI `remove + rescan` (soft replug equivalent), then retrigger udev.
4. If the controller comes back only as a peer xdomain host node (for example `0-1`) with no `0-1.0` service child, immediately perform one additional bounded NHI reset in the same recovery run.

---

## Related Issues

- ISSUE-2025-002
- ISSUE-2025-001

---

## Changelog References

List CHANGELOG.md entries that reference this issue:
- CHANGELOG entry: [Unreleased] - Fix invalid `ExecStop` in `tb-enlist@.service` to prevent failed unit on device removal [ISSUE-2026-001]
