|
Bogdan Timofte
authored
3 months ago
|
1
|
# Madagascar Cluster Changelog
|
|
|
2
|
|
|
|
3
|
All notable changes to the Madagascar cluster configuration and infrastructure are documented in this file.
|
|
|
4
|
|
|
|
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|
|
6
|
|
|
|
7
|
Each entry should reference related issues using the format `[ISSUE-YYYY-NNN]`.
|
|
|
8
|
|
|
|
9
|
---
|
|
|
10
|
|
|
|
11
|
## [Unreleased]
|
|
|
12
|
|
|
|
13
|
### Known Issues
|
|
|
14
|
- [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open)
|
|
Bogdan Timofte
authored
3 months ago
|
15
|
- PBS instances hosted inside VMs `301 is-anjohibe` and `302 is-andrafiabe` become temporarily unavailable during planned node reboot while `pgs suspend` freezes them; this is an accepted cluster-wide maintenance-window limitation that affects backup and restore availability only for that window
|
|
Bogdan Timofte
authored
3 months ago
|
16
|
|
|
|
17
|
### Added
|
|
|
18
|
- Added a central `cluster/projects/README.md` policy for current and future cluster-level projects
|
|
|
19
|
|
|
|
20
|
### Changed
|
|
|
21
|
- Consolidated `pve-net-hang-watchdog` into its own project folder under `cluster/projects/pve-net-hang-watchdog`
|
|
|
22
|
- Standardized project rules around well-known install paths, mandatory uninstall scripts, and uninstall-before-reinstall workflow
|
|
|
23
|
- Anchored the central project policy in the existing `autoNAS` install/uninstall workflow and documented its known lessons and current path exception
|
|
|
24
|
- Established `/usr/local/lib/xdev/<project-name>/uninstall.sh` as the canonical uninstall script location, with optional `/usr/local/sbin/xdev-<project-name>-uninstall` wrapper
|
|
|
25
|
- Added standard namespaced locations for installed documentation, configuration, operational data, cache, and optional file-based logs
|
|
|
26
|
- Removed the accidental empty `autoNAS/autoSMART` nested drop and kept `cluster/projects/autoSMART` as the canonical project location
|
|
|
27
|
- Standardized `cluster/projects/pve-guests-state` with dedicated install/uninstall scripts, namespaced host paths, migrated state location, and cleaned legacy project artifacts
|
|
|
28
|
- Standardized `cluster/projects/pve-net-hang-watchdog` with namespaced install paths, dedicated lifecycle scripts, and a defaults file under `/etc/default/xdev-pve-net-hang-watchdog`
|
|
|
29
|
- Updated `pve-net-hang-watchdog` install behavior so deployment also starts the service immediately, not just enables it for boot
|
|
|
30
|
- Added a standardized shared-runtime lifecycle for `cluster/projects/thunderbolts` that leaves network interface files untouched during reinstall/uninstall
|
|
|
31
|
- Documented the cluster-wide deployment rule that required services/timers must be activated with `systemctl enable --now` during install, not left merely enabled
|
|
|
32
|
- Standardized `cluster/projects/pve-backup-scheduler` around `/usr/local/lib/xdev/pve-backup-scheduler`, added canonical lifecycle scripts and `setup.sh`, and kept `/etc/pve/autobackup` as an explicit preserved config exception
|
|
|
33
|
- Standardized `cluster/projects/autoNAS` around `/usr/local/lib/xdev/autonas` and `/usr/local/sbin/autonas`, while keeping `/etc/pve/autonas` and `/mnt/autonas` as explicit shared-state exceptions
|
|
|
34
|
- Grouped cluster metadata and historical cache files under `cluster-context/` and moved legacy snapshots under `cluster-context/history/`
|
|
|
35
|
- Added cluster-wide deployment orchestration in `scripts/deploy-project.sh`, driven by `cluster-context/madagascar.json`, while preserving one-node deploy paths for development and testing
|
|
|
36
|
- Tightened lifecycle cleanup for `pve-guests-state` legacy systemd units and suppressed `thunderbolts` recovery noise on hosts without `bolt.service`
|
|
|
37
|
|
|
|
38
|
---
|
|
|
39
|
|
|
|
40
|
## [2025-10-30]
|
|
|
41
|
|
|
|
42
|
### Fixed
|
|
|
43
|
- [ISSUE-2025-001] Thunderbolt interfaces MTU persistence issue resolved
|
|
|
44
|
- **Root cause**: `systemctl restart networking` resets MTU because systemd services don't re-trigger
|
|
|
45
|
- **Solution**: Hybrid approach with udev rule enhancement + post-up hooks
|
|
|
46
|
- **Changes**: Updated udev rules and interfaces.d configs on all nodes (baobab, ebony, tapia)
|
|
|
47
|
- **Testing**: Verified MTU 65520 persists after networking restart on all nodes
|
|
|
48
|
|
|
|
49
|
### Added
|
|
|
50
|
- Issue tracking system in `cluster/issues/` directory
|
|
|
51
|
- CHANGELOG.md for documenting all cluster changes with issue references
|
|
|
52
|
- Template for issue documentation (`issues/TEMPLATE.md`)
|
|
|
53
|
- First documented issue: ISSUE-2025-001 regarding thunderbolt MTU reset problem
|
|
|
54
|
- Added `scripts/check_mcluster_network.sh` for cluster thunderbridge and network health checks (table output, ping tests from localhost and baobab).
|
|
|
55
|
|
|
|
56
|
### Changed
|
|
|
57
|
- Removed codebase-specific references from `madagascar.json` to keep it cluster-focused
|
|
|
58
|
|
|
|
59
|
---
|
|
|
60
|
|
|
|
61
|
## [2025-10-19]
|
|
|
62
|
|
|
|
63
|
### Added
|
|
|
64
|
- PBS (Proxmox Backup Server) configuration to `madagascar.json`
|
|
|
65
|
- andrafiabe-AutoNAS (192.168.2.96)
|
|
|
66
|
- anjothibe-AutoNAS (192.168.2.95)
|
|
|
67
|
- Node roles (primary/secondary) to cluster configuration
|
|
|
68
|
|
|
|
69
|
---
|
|
|
70
|
|
|
|
71
|
## [2025-10-18]
|
|
|
72
|
|
|
|
73
|
### Added
|
|
|
74
|
- Initial `madagascar.json` cluster cache file
|
|
|
75
|
- Cluster network documentation (thunderbolt bridge configuration)
|
|
|
76
|
- WAN configuration for all nodes (vmbr443, vmbr444)
|
|
|
77
|
- Node-specific network information (baobab, ebony, tapia)
|
|
|
78
|
- `madagascar-changelog.json` for automation-triggered changes
|
|
|
79
|
- `README_madagascar_cache.md` with file contract documentation
|
|
|
80
|
|
|
|
81
|
### Infrastructure
|
|
|
82
|
- Thunderbolt bridge (thunderbridge) on 192.168.10.0/24 with MTU 65520
|
|
|
83
|
- WAN bridges on 192.168.2.0/24 (vmbr443) and 192.168.4.0/24 (vmbr444)
|
|
|
84
|
|
|
|
85
|
---
|
|
|
86
|
|
|
|
87
|
## Format Guidelines
|
|
|
88
|
|
|
|
89
|
### Categories
|
|
|
90
|
- **Added** - new features, files, or configurations
|
|
|
91
|
- **Changed** - changes to existing functionality or configuration
|
|
|
92
|
- **Deprecated** - features or configurations that will be removed
|
|
|
93
|
- **Removed** - removed features or configurations
|
|
|
94
|
- **Fixed** - bug fixes (always reference issue number)
|
|
|
95
|
- **Security** - security-related changes
|
|
|
96
|
|
|
|
97
|
### Entry Format
|
|
|
98
|
```
|
|
|
99
|
- Brief description [ISSUE-YYYY-NNN] (optional details)
|
|
|
100
|
```
|
|
|
101
|
|
|
|
102
|
### Issue References
|
|
|
103
|
Always link changes to issues when applicable:
|
|
|
104
|
- Bug fixes must reference the issue
|
|
|
105
|
- New features should reference planning/feature issues
|
|
|
106
|
- Configuration changes should reference related issues or RFCs
|