1 contributor
45 lines | 3.38kb

Changelog

All notable changes to the Madagascar cluster will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Fixed

  • tb-recover.sh peer_ip_for_iface() used a static host:interface → peer-IP table that assumed kernel Thunderbolt interface numbers follow a fixed port order. The kernel assigns interface numbers dynamically; after a baobab reboot thunderbolt0 was bound to the tapia connection (domain1, 1-1.0) while the table expected it to face ebony. This caused assess_peer_health to ping the wrong peer on every 30-second cycle, accumulate two failures, and call recover_iface_cycle (ifdown/ifup) every ~5 minutes — continuously aborting active ThunderboltIP sessions and keeping all cluster nodes isolated on thunderbridge. Fixed by replacing the static table with a dynamic sysfs lookup: readlink /sys/class/net/<iface>/device resolves to the XDomain service path whose parent directory exposes device_name (e.g. tapia), which is then mapped to the correct peer IP. The static table is retained as a fallback. [ISSUE-2026-003]
  • Invalid ExecStop syntax in tb-enlist@.service caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001]
  • Tapia-Baobab Thunderbolt recovery path hardened after reboot-time disconnect/reconnect events [ISSUE-2026-001]
  • tb-enlist@.service now stays active until network.target stops, so NFS storages routed over thunderbridge can unmount cleanly before Thunderbolt ports are detached; this is the Thunderbolt-side fix for the cluster-wide maintenance shutdown incident [ISSUE-2026-002]

Added

  • Automatic Thunderbolt recovery worker (tb-recover.service) and periodic timer (tb-recover.timer) for flap resilience [ISSUE-2026-001]

Changed

  • tb-recover.sh now escalates recovery by restarting bolt.service when rescan alone does not recreate thunderbolt net devices [ISSUE-2026-001]
  • tb-recover.sh now includes cooldowned Thunderbolt NHI PCI remove+rescan fallback (soft replug path) for reboot cases where netdev is missing [ISSUE-2026-001]
  • tb-recover.sh now retries the Thunderbolt NHI reset within the same recovery run when a peer xdomain host reappears without its *.0 network service [ISSUE-2026-001]
  • tb-recover.sh now probes the expected peer behind each Thunderbolt port and cycles the affected interface with ifdown/ifup when a port stays attached but logically detached [ISSUE-2026-001]
  • Added standardized shared-runtime install/uninstall flow that manages scripts, unit files, and udev rules without rewriting host network configuration

[2025-10-30]

Fixed

  • Thunderbolt interfaces not in bridge after MTU fix deployment [ISSUE-2025-002]
  • MTU reset to 1500 after systemctl restart networking [ISSUE-2025-001]

Added

  • Issue tracking system with structured templates
  • Defense-in-depth for thunderbolt network configuration (udev + ifupdown2 hooks)

Changed

  • Enhanced udev rules for thunderbolt device handling
  • Updated network interfaces.d with post-up hooks for MTU and bridge membership

[2025-10-29]

Added

  • Initial issue tracking setup
  • COPILOT_BACKUPS_INSTRUCTIONS.md for backup procedures
  • CHANGELOG.md for change documentation