Blaming Madagascar/projects/thunderbolts/issues/ISSUE-2026-001.md at 5cf81a9f4028c69c09c76581b26cbec4414fe5ae · bogdan/Madagascar

Madagascar / projects / thunderbolts / issues / ISSUE-2026-001.md

Newer ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ Older

bogdan Analizeaza timpi reboot si documente

118 lines | 6.468kb

Analizeaza timpi reboot si d... Bogdan Timofte authored 3 months ago	1	# tb-enlist Fails on Device Disconnect, Leaving Thunderbolt Link Down After Reboot
	2
	3	## Issue ID: ISSUE-2026-001
	4
	5	Status: investigating
	6	Priority: high
	7	Created: 2026-03-06
	8	Updated: 2026-03-06
	9	Assigned to: unassigned
	10
	11	---
	12
	13	## Summary
	14
	15	On `tapia`, `tb-enlist@thunderbolt0.service` failed during `ExecStop`, and after a post-boot disconnect/reconnect the `thunderbolt0` interface did not come back.
	16
	17	---
	18
	19	## Description
	20
	21	After reboot, the Tapia-Baobab Thunderbolt link briefly came up, then disconnected. A bad `ExecStop=` command in `tb-enlist@.service` caused unit failure (`status=255`) when systemd stopped the instance. In parallel, `boltd` logged a probing timeout after reconnect, and `thunderbolt0` was no longer present on `tapia`.
	22
	23	---
	24
	25	## Environment
	26
	27	- Affected nodes: tapia (observed), all (same shared unit deployed cluster-wide)
	28	- Component: network (thunderbolt bridging/systemd integration)
	29	- Version/software: Proxmox VE 8.x, kernel `6.8.12-19-pve`, systemd oneshot templated unit
	30
	31	---
	32
	33	## Steps to Reproduce
	34
	35	1. Boot `tapia` with current shared `tb-enlist@.service`.
	36	2. Let Thunderbolt peer connect, then trigger disconnect/remove event (observed during boot sequence).
	37	3. Check `systemctl status tb-enlist@thunderbolt0.service` and `ip link show thunderbolt0`.
	38
	39	---
	40
	41	## Expected Behavior
	42
	43	- `tb-enlist@*.service` should stop cleanly when a Thunderbolt netdev disappears.
	44	- Unit should not remain failed due to teardown path.
	45	- On reconnect, interface should be eligible to re-enlist normally.
	46
	47	---
	48
	49	## Actual Behavior
	50
	51	- `tb-enlist@thunderbolt0.service` entered failed state on stop.
	52	- Error included invalid arguments in `ExecStop`.
	53	- `thunderbolt0` disappeared on `tapia` and did not reappear after reconnect.
	54	- Behavior remains intermittent: after some `tapia` reboots, link stays down until physical unplug/replug.
	55
	56	---
	57
	58	## Logs/Evidence
	59
	60	```text
	61	Mar 06 08:27:07 tapia ip[4054]: Error: either "dev" is duplicate, or "2>/dev/null" is a garbage.
	62	Mar 06 08:27:07 tapia systemd[1]: tb-enlist@thunderbolt0.service: Control process exited, code=exited, status=255/EXCEPTION
	63	Mar 06 08:27:22 tapia boltd[838]: probing: started [1000]
	64	Mar 06 08:27:24 tapia boltd[838]: probing: timeout, done: [2002832] (2000000)
	65	Device "thunderbolt0" does not exist.
	66	```
	67
	68	---
	69
	70	## Investigation Notes
	71
	72	- 2026-03-06: Confirmed `tb-bridge.service` was active and `thunderbridge` existed on both `baobab` and `tapia`.
	73	- 2026-03-06: Confirmed old `ExecStop` lines used shell syntax in non-shell context:
	74	- `ExecStop=/sbin/ip link set %i nomaster 2>/dev/null \|\| true`
	75	- `ExecStop=/sbin/ip link set %i down 2>/dev/null \|\| true`
	76	- 2026-03-06: Implemented fix with systemd-native ignore-errors prefix:
	77	- `ExecStop=-/sbin/ip link set %i nomaster`
	78	- `ExecStop=-/sbin/ip link set %i down`
	79	- 2026-03-06: Deployed patch to `tapia` and validated that unit can be reset/stopped without entering `failed`.
	80	- 2026-03-06: User-induced flap still showed intermittent non-recovery pattern; remediation was not sufficient by itself.
	81	- 2026-03-06: After reboot at ~08:49 EET, `tapia` link was observed up again (`thunderbolt0` forwarding), confirming intermittent behavior.
	82	- 2026-03-06: Added second-stage mitigation candidate: periodic recovery (`tb-recover.service` + `tb-recover.timer`) to re-enlist interfaces and force rescan when no thunderbolt netdev is present.
	83	- 2026-03-06: Validated mitigation on `tapia` by intentionally stopping `tb-enlist@thunderbolt0`; recovery timer re-attached interface in next cycle and returned `forwarding` state.
	84	- 2026-03-06: Rolled out mitigation to `baobab` and `ebony`; timer enabled and active on all three nodes.
	85	- 2026-03-06 10:01 EET: New flap captured on `tapia` (`host disconnected` at `10:01:30`); recovery happened after reconnect event (`new host found` at `10:01:48`), consistent with unplug/replug recovery.
	86	- 2026-03-06 10:05 EET: Added third-stage mitigation in `tb-recover.sh`: if no thunderbolt netdev after rescan, restart `bolt.service` and retrigger udev as fallback.
	87	- 2026-03-06 10:39 EET: Controlled flap test on `tapia` using `thunderbolt-net` unbind/bind (`0-1.0`) passed; `thunderbolt0` reappeared and returned to `forwarding` within seconds (`TEST_PASS`).
	88	- 2026-03-06 10:46 EET: Latest mitigation rollout completed on `baobab` and `ebony`; `tb-recover.timer` active/enabled and `tb-enlist@*` units active on all nodes.
	89	- 2026-03-06 13:25 EET: Reboot-loop regression reproduced on `tapia` - `thunderbridge` up but `thunderbolt0` missing entirely (`tb-enlist@thunderbolt0` inactive), while peer `baobab` port showed `NO-CARRIER`.
	90	- 2026-03-06 13:22-14:02 EET: Existing fallback (`bolt.service` restart) was insufficient; repeated `boltd` messages observed: `failed to get boot_acl: Connection timed out`.
	91	- 2026-03-06 14:02 EET: Software recovery without cable succeeded via Thunderbolt NHI PCI `remove + rescan`; `thunderbolt0` recreated and rejoined bridge.
	92	- 2026-03-06 14:04 EET: `tb-recover.sh` updated with cooldowned NHI rescan fallback (and guarded `boltd` restart fallback) and deployed cluster-wide.
	93	- 2026-03-07 03:35-03:42 EET: On `tapia` running `6.17.13-1-pve`, first NHI rescan rediscovered peer host `0-1` but did not recreate `0-1.0`; a second manual NHI reset at `03:42` recreated `thunderbolt0` and restored `forwarding`.
	94	- 2026-03-07 03:4x EET: Recovery logic updated so a stale xdomain host node without a `*.0` service triggers one bounded second NHI reset in the same `tb-recover.sh` run.
	95
	96	---
	97
	98	## Proposed Solution
	99
	100	Use a two-layer recovery approach:
	101	1. Keep `ExecStop` commands shell-free and use systemd `-` prefix to ignore expected failures when device is already gone.
	102	2. Run periodic recovery (`tb-recover.timer`) that re-enlists existing thunderbolt netdevs and forces controller/net udev retrigger when no thunderbolt netdev is present.
	103	3. If netdev is still missing, perform cooldowned Thunderbolt NHI PCI `remove + rescan` (soft replug equivalent), then retrigger udev.
	104	4. If the controller comes back only as a peer xdomain host node (for example `0-1`) with no `0-1.0` service child, immediately perform one additional bounded NHI reset in the same recovery run.
	105
	106	---
	107
	108	## Related Issues
	109
	110	- ISSUE-2025-002
	111	- ISSUE-2025-001
	112
	113	---
	114
	115	## Changelog References
	116
	117	List CHANGELOG.md entries that reference this issue:
	118	- CHANGELOG entry: [Unreleased] - Fix invalid `ExecStop` in `tb-enlist@.service` to prevent failed unit on device removal [ISSUE-2026-001]