All notable changes to the Madagascar cluster will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
tb-recover.sh peer_ip_for_iface() used a static host:interface → peer-IP table that assumed kernel Thunderbolt interface numbers follow a fixed port order. The kernel assigns interface numbers dynamically; after a baobab reboot thunderbolt0 was bound to the tapia connection (domain1, 1-1.0) while the table expected it to face ebony. This caused assess_peer_health to ping the wrong peer on every 30-second cycle, accumulate two failures, and call recover_iface_cycle (ifdown/ifup) every ~5 minutes — continuously aborting active ThunderboltIP sessions and keeping all cluster nodes isolated on thunderbridge. Fixed by replacing the static table with a dynamic sysfs lookup: readlink /sys/class/net/<iface>/device resolves to the XDomain service path whose parent directory exposes device_name (e.g. tapia), which is then mapped to the correct peer IP. The static table is retained as a fallback. [ISSUE-2026-003]ExecStop syntax in tb-enlist@.service caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001]tb-enlist@.service now stays active until network.target stops, so NFS storages routed over thunderbridge can unmount cleanly before Thunderbolt ports are detached; this is the Thunderbolt-side fix for the cluster-wide maintenance shutdown incident [ISSUE-2026-002]tb-recover.service) and periodic timer (tb-recover.timer) for flap resilience [ISSUE-2026-001]tb-recover.sh now escalates recovery by restarting bolt.service when rescan alone does not recreate thunderbolt net devices [ISSUE-2026-001]tb-recover.sh now includes cooldowned Thunderbolt NHI PCI remove+rescan fallback (soft replug path) for reboot cases where netdev is missing [ISSUE-2026-001]tb-recover.sh now retries the Thunderbolt NHI reset within the same recovery run when a peer xdomain host reappears without its *.0 network service [ISSUE-2026-001]tb-recover.sh now probes the expected peer behind each Thunderbolt port and cycles the affected interface with ifdown/ifup when a port stays attached but logically detached [ISSUE-2026-001]