# Madagascar's Thunderbolts

Thunderbolt networking toolkit for three Proxmox hosts (`baobab`, `ebony`, `tapia`).  
The goal is to bring up a high-MTU Thunderbolt bridge (`thunderbridge`) early in boot,
enlist hot-plugged Thunderbolt NICs as they appear, and keep management networking
configs consistent across the cluster.

## Repository layout

```
deploy/attempt1/
├── common/                    # Shared bits copied to every host
│   ├── systemd/system/
│   │   ├── tb-bridge.service  # Ensures the bridge device exists and is up
│   │   └── tb-enlist@.service # Enlists hotplugged NICs into the bridge
│   └── udev/rules.d/
│       └── 90-…systemd.rules  # Starts tb-enlist@ for thunderbolt* devices
├── baobab/…                   # Node-specific /etc/network config
├── ebony/…
├── tapia/…
└── deploy_tb.sh               # Main deployment script
```

The repo currently holds a single deployment attempt (`deploy/attempt1`). If you
iterate on the design, prefer adding a new attempt directory so older snapshots
stay reproducible.

## Standardized lifecycle

This project now has two distinct operational paths:

- Full bootstrap: `deploy/attempt1/deploy_tb.sh`
  - can update host-specific network configuration
  - use for initial deployment or deliberate network template rollout
- Shared runtime reinstall: `./setup.sh`
  - standardizes the shared runtime artifacts only
  - installs/removes `tb-recover.sh`, the shared systemd units, and the udev rule
  - intentionally leaves `/etc/network/interfaces` and `/etc/network/interfaces.d/10-thunderbolt` untouched

Standardized host paths for the shared runtime:

- canonical uninstall: `/usr/local/lib/xdev/thunderbolts/uninstall.sh`
- canonical shared script: `/usr/local/lib/xdev/thunderbolts/tb-recover.sh`
- operator wrapper: `/usr/local/sbin/tb-recover.sh`
- installed docs: `/usr/local/share/doc/xdev/thunderbolts`

Use:

```bash
./setup.sh                 # reinstall shared runtime on baobab ebony tapia
./setup.sh baobab          # single host
./setup.sh --uninstall baobab
```

## Prerequisites

- Machine with Bash ≥3, `ssh`, and `scp` available.
- Access to the target hosts as `root` (default username) over the management or
  Thunderbolt network; passwordless SSH is assumed.
- Target hosts run Proxmox (or any Debian-like system with ifupdown2 and systemd).
- `ip`, `systemctl`, and `udevadm` available on the remote hosts.

## How deployment works

`deploy_tb.sh` is idempotent. For each target host it:

- Chooses an IP by trying management first, then Thunderbolt (`get_mgmt_ip`/`get_tb_ip`).
- Uploads shared udev and systemd units that prepare the `thunderbridge` device and
  attach Thunderbolt NICs when they hot-plug.
- Replaces `/etc/network/interfaces` with the host-specific template and places the
  Thunderbolt overlay in `/etc/network/interfaces.d/10-thunderbolt`.
- Reloads udev and systemd, triggers network reloads, enables the services, and
  prints a short status report (bridge state, enlisted NICs).

Run it from inside the attempt directory so relative paths resolve correctly.

```bash
cd deploy/attempt1
./deploy_tb.sh            # deploys to baobab, ebony, tapia
./deploy_tb.sh baobab     # deploys to a single host
./deploy_tb.sh tapia ebony
```

## Customising host lists and addresses

Edit the `get_mgmt_ip()` and `get_tb_ip()` helpers near the top of
`deploy/attempt1/deploy_tb.sh` to match your environment. Each host that you want
to target must:

1. Have a subdirectory named after the host inside `deploy/attempt1`.
2. Provide the full `/etc/network/interfaces` template.
3. Provide `etc/network/interfaces.d/10-thunderbolt` with the bridge definition
   and hotplug rules for Thunderbolt interfaces.

To add a new host, copy one of the existing directories, adjust static IPs and
interface names, then extend both helper functions so the script can locate it.

## What the systemd/udev pieces do

- `tb-bridge.service` (oneshot) makes sure the `thunderbridge` device exists as a
  Linux bridge, sets MTU 65520, and brings it up during early boot.
- `tb-enlist@.service` attaches Thunderbolt NIC instances to the bridge, aligning
  their MTU and keeping them hotplug friendly; systemd stops the unit cleanly on
  device removal and keeps it ordered before `network.target` during shutdown so
  remote filesystems over `thunderbridge` can unmount before the ports detach.
- `90-thunderbolt-net-systemd.rules` tags `thunderbolt*` NICs so udev starts the
  enlist service automatically.

These files live under `deploy/attempt1/common/` and are copied verbatim to the
remote host’s `/etc/systemd/system` and `/etc/udev/rules.d`.

## Validation checklist

After running the deploy script on a host:

- `systemctl status tb-bridge.service` should show an *active* oneshot unit.
- `systemctl list-units 'tb-enlist@*'` should list one unit per detected Thunderbolt
  NIC, each *loaded* and *active*.
- `ip -d link show thunderbridge` should display MTU 65520 and `state UP`.
- `bridge link` should list your Thunderbolt interfaces as ports of `thunderbridge`
  once cables are connected.

If you change the network definitions, re-run `./deploy_tb.sh <host>` to push the
updates. The script re-applies permissions, reloads systemd, retriggers udev, and
refreshes the interfaces.

## Troubleshooting tips

- *SSH unreachable*: Confirm management and Thunderbolt IPs in the helper functions
  are correct, and that firewalls allow SSH. The script prints which IP it tried.
- *Bridge missing after reboot*: Ensure `tb-bridge.service` is enabled; run
  `systemctl enable --now tb-bridge.service` on the host.
- *NICs not joining*: Check `journalctl -u tb-enlist@thunderbolt0` for logs and make
  sure the udev rule is present under `/etc/udev/rules.d`.
- *Slow shutdown with NFS on thunderbridge*: Verify the host has the updated
  `tb-enlist@.service` with `Before=network.target`; otherwise `thunderbridge`
  can disappear before Proxmox unmounts NFS storages and shutdown waits on NFS
  timeouts.
- *MTU mismatch complaints*: The service forces MTU 65520 on both sides; verify the
  connected devices also support it.

## Extending beyond attempt1

Prefer copying `deploy/attempt1` into a new versioned folder (for example,
`attempt2`) when you experiment with alternate topologies or addresses. This keeps
previous rollouts reproducible and eases diffing of changes.
