Madagascar / CHANGELOG.md
8f00f0f 3 months ago History
1 contributor
106 lines | 5.886kb

Madagascar Cluster Changelog

All notable changes to the Madagascar cluster configuration and infrastructure are documented in this file.

The format is based on Keep a Changelog.

Each entry should reference related issues using the format [ISSUE-YYYY-NNN].


[Unreleased]

Known Issues

  • [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open)
  • PBS instances hosted inside VMs 301 is-anjohibe and 302 is-andrafiabe become temporarily unavailable during planned node reboot while pgs suspend freezes them; this is an accepted cluster-wide maintenance-window limitation that affects backup and restore availability only for that window

Added

  • Added a central cluster/projects/README.md policy for current and future cluster-level projects

Changed

  • Consolidated pve-net-hang-watchdog into its own project folder under cluster/projects/pve-net-hang-watchdog
  • Standardized project rules around well-known install paths, mandatory uninstall scripts, and uninstall-before-reinstall workflow
  • Anchored the central project policy in the existing autoNAS install/uninstall workflow and documented its known lessons and current path exception
  • Established /usr/local/lib/xdev/<project-name>/uninstall.sh as the canonical uninstall script location, with optional /usr/local/sbin/xdev-<project-name>-uninstall wrapper
  • Added standard namespaced locations for installed documentation, configuration, operational data, cache, and optional file-based logs
  • Removed the accidental empty autoNAS/autoSMART nested drop and kept cluster/projects/autoSMART as the canonical project location
  • Standardized cluster/projects/pve-guests-state with dedicated install/uninstall scripts, namespaced host paths, migrated state location, and cleaned legacy project artifacts
  • Standardized cluster/projects/pve-net-hang-watchdog with namespaced install paths, dedicated lifecycle scripts, and a defaults file under /etc/default/xdev-pve-net-hang-watchdog
  • Updated pve-net-hang-watchdog install behavior so deployment also starts the service immediately, not just enables it for boot
  • Added a standardized shared-runtime lifecycle for cluster/projects/thunderbolts that leaves network interface files untouched during reinstall/uninstall
  • Documented the cluster-wide deployment rule that required services/timers must be activated with systemctl enable --now during install, not left merely enabled
  • Standardized cluster/projects/pve-backup-scheduler around /usr/local/lib/xdev/pve-backup-scheduler, added canonical lifecycle scripts and setup.sh, and kept /etc/pve/autobackup as an explicit preserved config exception
  • Standardized cluster/projects/autoNAS around /usr/local/lib/xdev/autonas and /usr/local/sbin/autonas, while keeping /etc/pve/autonas and /mnt/autonas as explicit shared-state exceptions
  • Grouped cluster metadata and historical cache files under cluster-context/ and moved legacy snapshots under cluster-context/history/
  • Added cluster-wide deployment orchestration in scripts/deploy-project.sh, driven by cluster-context/madagascar.json, while preserving one-node deploy paths for development and testing
  • Tightened lifecycle cleanup for pve-guests-state legacy systemd units and suppressed thunderbolts recovery noise on hosts without bolt.service

[2025-10-30]

Fixed

  • [ISSUE-2025-001] Thunderbolt interfaces MTU persistence issue resolved
    • Root cause: systemctl restart networking resets MTU because systemd services don't re-trigger
    • Solution: Hybrid approach with udev rule enhancement + post-up hooks
    • Changes: Updated udev rules and interfaces.d configs on all nodes (baobab, ebony, tapia)
    • Testing: Verified MTU 65520 persists after networking restart on all nodes

Added

  • Issue tracking system in cluster/issues/ directory
  • CHANGELOG.md for documenting all cluster changes with issue references
  • Template for issue documentation (issues/TEMPLATE.md)
  • First documented issue: ISSUE-2025-001 regarding thunderbolt MTU reset problem
  • Added scripts/check_mcluster_network.sh for cluster thunderbridge and network health checks (table output, ping tests from localhost and baobab).

Changed

  • Removed codebase-specific references from madagascar.json to keep it cluster-focused

[2025-10-19]

Added

  • PBS (Proxmox Backup Server) configuration to madagascar.json
    • andrafiabe-AutoNAS (192.168.2.96)
    • anjothibe-AutoNAS (192.168.2.95)
  • Node roles (primary/secondary) to cluster configuration

[2025-10-18]

Added

  • Initial madagascar.json cluster cache file
  • Cluster network documentation (thunderbolt bridge configuration)
  • WAN configuration for all nodes (vmbr443, vmbr444)
  • Node-specific network information (baobab, ebony, tapia)
  • madagascar-changelog.json for automation-triggered changes
  • README_madagascar_cache.md with file contract documentation

Infrastructure

  • Thunderbolt bridge (thunderbridge) on 192.168.10.0/24 with MTU 65520
  • WAN bridges on 192.168.2.0/24 (vmbr443) and 192.168.4.0/24 (vmbr444)

Format Guidelines

Categories

  • Added - new features, files, or configurations
  • Changed - changes to existing functionality or configuration
  • Deprecated - features or configurations that will be removed
  • Removed - removed features or configurations
  • Fixed - bug fixes (always reference issue number)
  • Security - security-related changes

Entry Format

- Brief description [ISSUE-YYYY-NNN] (optional details)

Issue References

Always link changes to issues when applicable: - Bug fixes must reference the issue - New features should reference planning/feature issues - Configuration changes should reference related issues or RFCs