Madagascar / CHANGELOG.md
8f00f0f 3 months ago History
1 contributor
106 lines | 5.886kb
# Madagascar Cluster Changelog

All notable changes to the Madagascar cluster configuration and infrastructure are documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

Each entry should reference related issues using the format `[ISSUE-YYYY-NNN]`.

---

## [Unreleased]

### Known Issues
- [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open)
- PBS instances hosted inside VMs `301 is-anjohibe` and `302 is-andrafiabe` become temporarily unavailable during planned node reboot while `pgs suspend` freezes them; this is an accepted cluster-wide maintenance-window limitation that affects backup and restore availability only for that window

### Added
- Added a central `cluster/projects/README.md` policy for current and future cluster-level projects

### Changed
- Consolidated `pve-net-hang-watchdog` into its own project folder under `cluster/projects/pve-net-hang-watchdog`
- Standardized project rules around well-known install paths, mandatory uninstall scripts, and uninstall-before-reinstall workflow
- Anchored the central project policy in the existing `autoNAS` install/uninstall workflow and documented its known lessons and current path exception
- Established `/usr/local/lib/xdev/<project-name>/uninstall.sh` as the canonical uninstall script location, with optional `/usr/local/sbin/xdev-<project-name>-uninstall` wrapper
- Added standard namespaced locations for installed documentation, configuration, operational data, cache, and optional file-based logs
- Removed the accidental empty `autoNAS/autoSMART` nested drop and kept `cluster/projects/autoSMART` as the canonical project location
- Standardized `cluster/projects/pve-guests-state` with dedicated install/uninstall scripts, namespaced host paths, migrated state location, and cleaned legacy project artifacts
- Standardized `cluster/projects/pve-net-hang-watchdog` with namespaced install paths, dedicated lifecycle scripts, and a defaults file under `/etc/default/xdev-pve-net-hang-watchdog`
- Updated `pve-net-hang-watchdog` install behavior so deployment also starts the service immediately, not just enables it for boot
- Added a standardized shared-runtime lifecycle for `cluster/projects/thunderbolts` that leaves network interface files untouched during reinstall/uninstall
- Documented the cluster-wide deployment rule that required services/timers must be activated with `systemctl enable --now` during install, not left merely enabled
- Standardized `cluster/projects/pve-backup-scheduler` around `/usr/local/lib/xdev/pve-backup-scheduler`, added canonical lifecycle scripts and `setup.sh`, and kept `/etc/pve/autobackup` as an explicit preserved config exception
- Standardized `cluster/projects/autoNAS` around `/usr/local/lib/xdev/autonas` and `/usr/local/sbin/autonas`, while keeping `/etc/pve/autonas` and `/mnt/autonas` as explicit shared-state exceptions
- Grouped cluster metadata and historical cache files under `cluster-context/` and moved legacy snapshots under `cluster-context/history/`
- Added cluster-wide deployment orchestration in `scripts/deploy-project.sh`, driven by `cluster-context/madagascar.json`, while preserving one-node deploy paths for development and testing
- Tightened lifecycle cleanup for `pve-guests-state` legacy systemd units and suppressed `thunderbolts` recovery noise on hosts without `bolt.service`

---

## [2025-10-30]

### Fixed
- [ISSUE-2025-001] Thunderbolt interfaces MTU persistence issue resolved
  - **Root cause**: `systemctl restart networking` resets MTU because systemd services don't re-trigger
  - **Solution**: Hybrid approach with udev rule enhancement + post-up hooks
  - **Changes**: Updated udev rules and interfaces.d configs on all nodes (baobab, ebony, tapia)
  - **Testing**: Verified MTU 65520 persists after networking restart on all nodes

### Added
- Issue tracking system in `cluster/issues/` directory
- CHANGELOG.md for documenting all cluster changes with issue references
- Template for issue documentation (`issues/TEMPLATE.md`)
- First documented issue: ISSUE-2025-001 regarding thunderbolt MTU reset problem
- Added `scripts/check_mcluster_network.sh` for cluster thunderbridge and network health checks (table output, ping tests from localhost and baobab).

### Changed
- Removed codebase-specific references from `madagascar.json` to keep it cluster-focused

---

## [2025-10-19]

### Added
- PBS (Proxmox Backup Server) configuration to `madagascar.json`
  - andrafiabe-AutoNAS (192.168.2.96)
  - anjothibe-AutoNAS (192.168.2.95)
- Node roles (primary/secondary) to cluster configuration

---

## [2025-10-18]

### Added
- Initial `madagascar.json` cluster cache file
- Cluster network documentation (thunderbolt bridge configuration)
- WAN configuration for all nodes (vmbr443, vmbr444)
- Node-specific network information (baobab, ebony, tapia)
- `madagascar-changelog.json` for automation-triggered changes
- `README_madagascar_cache.md` with file contract documentation

### Infrastructure
- Thunderbolt bridge (thunderbridge) on 192.168.10.0/24 with MTU 65520
- WAN bridges on 192.168.2.0/24 (vmbr443) and 192.168.4.0/24 (vmbr444)

---

## Format Guidelines

### Categories
- **Added** - new features, files, or configurations
- **Changed** - changes to existing functionality or configuration
- **Deprecated** - features or configurations that will be removed
- **Removed** - removed features or configurations
- **Fixed** - bug fixes (always reference issue number)
- **Security** - security-related changes

### Entry Format
```
- Brief description [ISSUE-YYYY-NNN] (optional details)
```

### Issue References
Always link changes to issues when applicable:
- Bug fixes must reference the issue
- New features should reference planning/feature issues
- Configuration changes should reference related issues or RFCs