Showing 113 changed files with 19270 additions and 0 deletions
+49 -0
.github/copilot-instructions.md
@@ -0,0 +1,49 @@
1
+# Copilot Instructions (project stub)
2
+
3
+> **Purpose:** Provide context and guidance for GitHub Copilot or other automated agents working with this repository.
4
+
5
+## Project overview
6
+
7
+- **Name:** _<project name goes here>_ (replace this placeholder).
8
+- **Goal:** Brief description of what the codebase / project is intended to accomplish.
9
+- **Deployment model:** Outline where production code lives, what is deployed to target systems, and any separation of developer docs versus runtime artifacts.
10
+
11
+## Key components
12
+
13
+- `bin/` – executable scripts used by the project.
14
+- `docs/` – developer documentation, design notes, and user guides.
15
+- `projects/` – subprojects, each of which may have its own `deployment/`, `scripts/`, and `.github` configuration.
16
+- `scripts/` – helper utilities for deployment, testing, or maintenance.
17
+- `issues/` – markdown issue tracker, one file per issue.
18
+
19
+*(Adjust these bullets to the particular structure of your repository.)*
20
+
21
+## Typical workflows
22
+
23
+1. **Development:** edit source under `deployment/` (if present), update tests, run `./scripts/run_tests.sh` (or similar).
24
+2. **Deployment:** use `./scripts/deploy_to_nodes.sh` or similar to push changes to cluster nodes; enable any systemd units as needed.
25
+3. **Debugging:** check logs with `journalctl -u <service>`; examine `dmesg`, `ip link`, etc. (customize to project).
26
+4. **Configuration:** configuration files live under `/etc/<project>` or are defined in `madagascar.json` and should be treated as source-of-truth.
27
+
28
+## Guidance for Copilot
29
+
30
+- When creating or modifying files, follow existing conventions for naming, documentation, and changelog entries.
31
+- Read `madagascar.json` (and any other top‑level JSON manifests) to understand cluster configuration and avoid hard‑coding.
32
+- Append changes to `madagascar-changelog.json` rather than rewriting it.
33
+- Use POSIX-compliant shell in `bin/` scripts, prefer Python for more complex logic.
34
+
35
+## Issue tracking
36
+
37
+- New issues should be added as markdown files in `issues/` named `YYYY_MM_DD-NN-description.md`.
38
+- Each issue must include description, steps to reproduce, logs, investigation notes, and resolution.
39
+- Update `CHANGELOG.md` with a brief entry when an issue is closed or a change is merged.
40
+
41
+## Example starter tasks for Copilot
42
+
43
+- Add a new utility script with proper shebang and logging function.
44
+- Implement a discovery script that reads `madagascar.json` and enumerates nodes or resources.
45
+- Scaffold a systemd unit file and accompanying installation script.
46
+
47
+---
48
+
49
+*This stub is intended as a starting point; customize it for the specific project or subproject.*
+105 -0
CHANGELOG.md
@@ -0,0 +1,105 @@
1
+# Madagascar Cluster Changelog
2
+
3
+All notable changes to the Madagascar cluster configuration and infrastructure are documented in this file.
4
+
5
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
6
+
7
+Each entry should reference related issues using the format `[ISSUE-YYYY-NNN]`.
8
+
9
+---
10
+
11
+## [Unreleased]
12
+
13
+### Known Issues
14
+- [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open)
15
+
16
+### Added
17
+- Added a central `cluster/projects/README.md` policy for current and future cluster-level projects
18
+
19
+### Changed
20
+- Consolidated `pve-net-hang-watchdog` into its own project folder under `cluster/projects/pve-net-hang-watchdog`
21
+- Standardized project rules around well-known install paths, mandatory uninstall scripts, and uninstall-before-reinstall workflow
22
+- Anchored the central project policy in the existing `autoNAS` install/uninstall workflow and documented its known lessons and current path exception
23
+- Established `/usr/local/lib/xdev/<project-name>/uninstall.sh` as the canonical uninstall script location, with optional `/usr/local/sbin/xdev-<project-name>-uninstall` wrapper
24
+- Added standard namespaced locations for installed documentation, configuration, operational data, cache, and optional file-based logs
25
+- Removed the accidental empty `autoNAS/autoSMART` nested drop and kept `cluster/projects/autoSMART` as the canonical project location
26
+- Standardized `cluster/projects/pve-guests-state` with dedicated install/uninstall scripts, namespaced host paths, migrated state location, and cleaned legacy project artifacts
27
+- Standardized `cluster/projects/pve-net-hang-watchdog` with namespaced install paths, dedicated lifecycle scripts, and a defaults file under `/etc/default/xdev-pve-net-hang-watchdog`
28
+- Updated `pve-net-hang-watchdog` install behavior so deployment also starts the service immediately, not just enables it for boot
29
+- Added a standardized shared-runtime lifecycle for `cluster/projects/thunderbolts` that leaves network interface files untouched during reinstall/uninstall
30
+- Documented the cluster-wide deployment rule that required services/timers must be activated with `systemctl enable --now` during install, not left merely enabled
31
+- Standardized `cluster/projects/pve-backup-scheduler` around `/usr/local/lib/xdev/pve-backup-scheduler`, added canonical lifecycle scripts and `setup.sh`, and kept `/etc/pve/autobackup` as an explicit preserved config exception
32
+- Standardized `cluster/projects/autoNAS` around `/usr/local/lib/xdev/autonas` and `/usr/local/sbin/autonas`, while keeping `/etc/pve/autonas` and `/mnt/autonas` as explicit shared-state exceptions
33
+- Grouped cluster metadata and historical cache files under `cluster-context/` and moved legacy snapshots under `cluster-context/history/`
34
+- Added cluster-wide deployment orchestration in `scripts/deploy-project.sh`, driven by `cluster-context/madagascar.json`, while preserving one-node deploy paths for development and testing
35
+- Tightened lifecycle cleanup for `pve-guests-state` legacy systemd units and suppressed `thunderbolts` recovery noise on hosts without `bolt.service`
36
+
37
+---
38
+
39
+## [2025-10-30]
40
+
41
+### Fixed
42
+- [ISSUE-2025-001] Thunderbolt interfaces MTU persistence issue resolved
43
+  - **Root cause**: `systemctl restart networking` resets MTU because systemd services don't re-trigger
44
+  - **Solution**: Hybrid approach with udev rule enhancement + post-up hooks
45
+  - **Changes**: Updated udev rules and interfaces.d configs on all nodes (baobab, ebony, tapia)
46
+  - **Testing**: Verified MTU 65520 persists after networking restart on all nodes
47
+
48
+### Added
49
+- Issue tracking system in `cluster/issues/` directory
50
+- CHANGELOG.md for documenting all cluster changes with issue references
51
+- Template for issue documentation (`issues/TEMPLATE.md`)
52
+- First documented issue: ISSUE-2025-001 regarding thunderbolt MTU reset problem
53
+- Added `scripts/check_mcluster_network.sh` for cluster thunderbridge and network health checks (table output, ping tests from localhost and baobab).
54
+
55
+### Changed
56
+- Removed codebase-specific references from `madagascar.json` to keep it cluster-focused
57
+
58
+---
59
+
60
+## [2025-10-19]
61
+
62
+### Added
63
+- PBS (Proxmox Backup Server) configuration to `madagascar.json`
64
+  - andrafiabe-AutoNAS (192.168.2.96)
65
+  - anjothibe-AutoNAS (192.168.2.95)
66
+- Node roles (primary/secondary) to cluster configuration
67
+
68
+---
69
+
70
+## [2025-10-18]
71
+
72
+### Added
73
+- Initial `madagascar.json` cluster cache file
74
+- Cluster network documentation (thunderbolt bridge configuration)
75
+- WAN configuration for all nodes (vmbr443, vmbr444)
76
+- Node-specific network information (baobab, ebony, tapia)
77
+- `madagascar-changelog.json` for automation-triggered changes
78
+- `README_madagascar_cache.md` with file contract documentation
79
+
80
+### Infrastructure
81
+- Thunderbolt bridge (thunderbridge) on 192.168.10.0/24 with MTU 65520
82
+- WAN bridges on 192.168.2.0/24 (vmbr443) and 192.168.4.0/24 (vmbr444)
83
+
84
+---
85
+
86
+## Format Guidelines
87
+
88
+### Categories
89
+- **Added** - new features, files, or configurations
90
+- **Changed** - changes to existing functionality or configuration
91
+- **Deprecated** - features or configurations that will be removed
92
+- **Removed** - removed features or configurations
93
+- **Fixed** - bug fixes (always reference issue number)
94
+- **Security** - security-related changes
95
+
96
+### Entry Format
97
+```
98
+- Brief description [ISSUE-YYYY-NNN] (optional details)
99
+```
100
+
101
+### Issue References
102
+Always link changes to issues when applicable:
103
+- Bug fixes must reference the issue
104
+- New features should reference planning/feature issues
105
+- Configuration changes should reference related issues or RFCs
+64 -0
cluster-context/README.md
@@ -0,0 +1,64 @@
1
+Madagascar cluster context files
2
+
3
+Purpose
4
+
5
+These files provide a shared cluster-context cache and changelog for Madagascar. Other projects can read or append to these files to share knowledge about cluster layout, network configuration, and changes that may affect deployments.
6
+
7
+Files
8
+
9
+- `madagascar.json` - primary cache. Contains a schemaVersion, lastUpdated, source, and a `clusters` map keyed by cluster name. Each cluster can include hosts, network file paths, services and notes.
10
+
11
+- `madagascar-changelog.json` - append-only changelog. Contains an `entries` array. Each entry should include: `id`, `timestamp` (ISO 8601 UTC), `project`, `author`, `summary`, `details`, `affectedResources` (array), and `type` (info|change|breaking|deprecated).
12
+- `history/` - historical snapshots that are useful for reference but are not the current source of truth.
13
+
14
+Contract (madagascar.json)
15
+
16
+- schemaVersion: string
17
+- lastUpdated: ISO 8601 timestamp in UTC
18
+- source: project name that last updated the file
19
+- clusters: map of cluster objects. Cluster object fields:
20
+  - name: cluster name
21
+  - hosts: map of role->hostname or role->fqdn
22
+  - network: optional map with keys `interfacesFile` and `interfacesD` (relative paths)
23
+  - services: optional map of service name -> { enabled: bool, systemdUnit: path }
24
+  - notes: optional string
25
+
26
+Changelog entry contract (madagascar-changelog.json)
27
+
28
+- entries: array of objects, each with:
29
+  - id: unique id string (recommend prefix: project-YYYYMMDD-HHMM)
30
+  - timestamp: ISO 8601 UTC
31
+  - project: project name making the change
32
+  - author: author name or automation id
33
+  - summary: short summary
34
+  - details: longer description
35
+  - affectedResources: array of strings (paths or logical names)
36
+  - type: one of info|change|breaking|deprecated
37
+
38
+How to update
39
+
40
+Manual append example (bash + jq):
41
+
42
+```bash
43
+# create new entry JSON
44
+entry=$(jq -n --arg id "entry-$(date -u +%Y%m%d%H%M%S)" \
45
+  --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
46
+  --arg project "mysvc" \
47
+  --arg author "$USER" \
48
+  --arg summary "Updated network config" \
49
+  --arg details "Added new interface route needed by Madagascar" \
50
+  '{id: $id, timestamp: $ts, project: $project, author: $author, summary: $summary, details: $details, affectedResources:["network/interfaces"], type: "change"}')
51
+
52
+# append atomically
53
+jq --argjson e "$entry" '.entries += [$e]' cluster-context/madagascar-changelog.json > cluster-context/madagascar-changelog.json.tmp && mv cluster-context/madagascar-changelog.json.tmp cluster-context/madagascar-changelog.json
54
+```
55
+
56
+Automation guidance
57
+
58
+- Prefer creating unique `id` values (project prefix + timestamp + random suffix).
59
+- When automation updates `cluster-context/madagascar.json`, also add a changelog entry.
60
+- Keep `cluster-context/madagascar.json` small — only cache what's necessary.
61
+
62
+Notes
63
+
64
+- These files are meant to be shared between related projects. Treat `cluster-context/madagascar-changelog.json` as append-only; prefer appending rather than rewriting history.
+234 -0
cluster-context/history/2026-03-07-trixie-pve9-upgrade-journal.md
@@ -0,0 +1,234 @@
1
+# 2026-03-07 Trixie / Proxmox VE 9 Upgrade Journal
2
+
3
+## Scope
4
+
5
+Upgrade and recovery journal for the Madagascar cluster nodes:
6
+
7
+- `tapia`
8
+- `ebony`
9
+- `baobab`
10
+
11
+All three nodes were upgraded from Debian 12 / Proxmox VE 8 to Debian 13 (`trixie`) / Proxmox VE 9.1.
12
+
13
+## Common Pattern Observed
14
+
15
+The package upgrade itself completed cleanly on all nodes. The disruptive failures were in the boot path after the upgrade, not in `apt` or `dpkg`.
16
+
17
+Recurring issues:
18
+
19
+- EFI fallback binaries under `EFI/BOOT` were inconsistent across nodes.
20
+- Boot order could still point to a non-Proxmox path even when the `proxmox` entry existed.
21
+- Some systems still had `systemd-boot` style fallback artifacts while the host had moved to GRUB + `proxmox-boot-tool`.
22
+- Testing was complicated by slow shutdowns and, in one case, missing hardware during boot.
23
+
24
+## Node Journal
25
+
26
+### tapia
27
+
28
+Initial symptoms:
29
+
30
+- Upgrade to `trixie` completed, but the node no longer booted normally.
31
+- UEFI shell could see the ESP and Proxmox EFI payloads.
32
+- Launching `\EFI\proxmox\grubx64.efi` initially dropped back into BIOS settings.
33
+- Later, after loader repair, boot worked on the old kernel first.
34
+
35
+Findings:
36
+
37
+- The system had moved to Debian 13 and Proxmox VE 9 packages correctly.
38
+- GRUB and EFI files existed, but the boot path was inconsistent after the upgrade.
39
+- `EFI/proxmox/grub.cfg` on `tapia` had drifted from the standard Proxmox ESP stub and referenced Btrfs directly.
40
+- `AutoNAS` also produced noisy failed units for unmanaged boot disk UUIDs.
41
+
42
+Fixes applied:
43
+
44
+- Offline disk repair on another node.
45
+- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
46
+- Ran:
47
+  - `update-initramfs -u -k 6.8.12-19-pve`
48
+  - `update-grub`
49
+  - `proxmox-boot-tool refresh`
50
+  - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
51
+- Restored `EFI/proxmox/grub.cfg` to the standard Proxmox ESP stub:
52
+  - `search.fs_uuid <ESP>`
53
+  - `set prefix=($root)/grub`
54
+  - `configfile $prefix/grub.cfg`
55
+- Later confirmed `6.17.13-1-pve` boots correctly and made it the default again.
56
+- Deployed an `AutoNAS` fix so unmanaged UUIDs are ignored instead of failing `autonas-attach@...` units.
57
+
58
+Final state:
59
+
60
+- Running `6.17.13-1-pve`
61
+- `systemctl --failed` empty
62
+- Boot default set to `6.17.13-1-pve`
63
+
64
+### ebony
65
+
66
+Initial symptoms:
67
+
68
+- Upgrade completed cleanly, but the node did not return after reboot.
69
+- UEFI fallback could boot `memtest`, but Proxmox GRUB payloads returned to BIOS settings.
70
+- After EFI repair, boot progressed to:
71
+  - `Loading Linux...`
72
+  - `Loading initial ramdisk...`
73
+  and then stopped.
74
+
75
+Findings:
76
+
77
+- The fallback `EFI/BOOT/BOOTX64.EFI` was not aligned with the Proxmox boot chain and could route to memtest.
78
+- GRUB loader repair was required.
79
+- During one boot attempt, the NVMe device was physically absent; this caused the post-kernel boot stall and initially looked like a kernel/initramfs failure.
80
+- Once hardware was restored, the newer kernel booted successfully.
81
+
82
+Fixes applied:
83
+
84
+- Offline disk repair on another node.
85
+- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery.
86
+- Ran:
87
+  - `update-initramfs -u -k 6.8.12-19-pve`
88
+  - `update-grub`
89
+  - `proxmox-boot-tool refresh`
90
+  - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
91
+- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi` payload and synchronized the other fallback EFI files.
92
+- After the node booted with full hardware present, validated `6.17.13-1-pve` and set it as the default.
93
+- Fixed stale `AutoNAS` export behavior by cleaning marked exports whose paths do not exist yet at boot.
94
+
95
+Final state:
96
+
97
+- Running `6.17.13-1-pve`
98
+- `systemctl --failed` empty
99
+- `AutoNAS-1` and `AutoNAS-2` active
100
+- Boot default set to `6.17.13-1-pve`
101
+
102
+### baobab
103
+
104
+Initial symptoms:
105
+
106
+- Upgrade completed cleanly, but the node failed to return after reboot.
107
+- Before recovery, fallback `BOOTX64.EFI` was still a small `systemd-boot` style binary instead of the Proxmox shim.
108
+- The node eventually required offline repair from another machine.
109
+
110
+Findings:
111
+
112
+- Package state was healthy; the failure was again in the EFI/boot path.
113
+- `BootOrder` needed to prioritize the `proxmox` entry.
114
+- `EFI/BOOT/BOOTX64.EFI` needed to point into the Proxmox chain, not the old fallback path.
115
+
116
+Fixes applied:
117
+
118
+- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` for the first stable boot after upgrade.
119
+- Corrected `BootOrder` so `proxmox` is first.
120
+- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi`.
121
+- Offline repair after the failed reboot:
122
+  - `fsck.vfat -a` on the ESP
123
+  - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck`
124
+  - `update-grub`
125
+  - `proxmox-boot-tool refresh`
126
+- Fixed remaining failed units unrelated to the OS upgrade:
127
+  - `rc-local.service` now ignores missing optional disks instead of failing
128
+  - removed orphan `discover_vms.service` and `discover_vms.timer`
129
+
130
+Final state:
131
+
132
+- Running `6.8.12-19-pve`
133
+- `systemctl --failed` empty
134
+- Boot default left on `6.8.12-19-pve` as the conservative stable choice
135
+
136
+## AutoNAS Follow-up
137
+
138
+Two AutoNAS issues were identified and fixed during the upgrade recovery:
139
+
140
+1. `attach-deferred` could fail for disks with UUIDs that are not managed by AutoNAS.
141
+   - Fix: return success for unmanaged UUIDs so `systemd` does not mark the unit failed.
142
+
143
+2. Boot-time cleanup preserved stale AutoNAS exports even when the export path did not exist yet.
144
+   - Fix: remove AutoNAS-marked exports with missing paths during boot cleanup, then let normal mount/export flow recreate them when the disk is available.
145
+
146
+Both fixes were deployed to:
147
+
148
+- `baobab`
149
+- `ebony`
150
+- `tapia`
151
+
152
+## Recovery Commands That Proved Useful
153
+
154
+Most effective recovery sequence when a node no longer boots after the upgrade:
155
+
156
+1. Move the system disk to another node.
157
+2. Mount root and ESP.
158
+3. Force a known-good kernel in `/etc/default/grub`.
159
+4. Run:
160
+
161
+```bash
162
+update-initramfs -u -k <known-good-kernel>
163
+update-grub
164
+proxmox-boot-tool refresh
165
+grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck
166
+```
167
+
168
+5. Verify:
169
+
170
+```bash
171
+efibootmgr -v
172
+proxmox-boot-tool status
173
+ls -l /boot/efi/EFI/proxmox
174
+ls -l /boot/efi/EFI/BOOT
175
+```
176
+
177
+6. If needed, replace `EFI/BOOT/BOOTX64.EFI` with Proxmox `shimx64.efi`.
178
+
179
+## Recommended Post-Upgrade Checklist
180
+
181
+Before rebooting a node after the Debian 13 / PVE 9 upgrade:
182
+
183
+1. Confirm package state is clean:
184
+   - `dpkg --audit`
185
+   - `apt-get -s full-upgrade`
186
+2. Refresh boot assets:
187
+   - `update-grub`
188
+   - `proxmox-boot-tool refresh`
189
+3. Verify EFI layout:
190
+   - `efibootmgr -v`
191
+   - `proxmox-boot-tool status`
192
+   - `EFI/proxmox/grub.cfg` should be the standard ESP stub
193
+   - `EFI/BOOT/BOOTX64.EFI` should route into the Proxmox chain, not an old `systemd-boot` or memtest fallback
194
+4. Suspend guests manually before reboot:
195
+   - run `/usr/local/sbin/pgs suspend -v`
196
+   - do not rely on legacy `systemd` automation for guest suspend/resume
197
+   - otherwise `pve-guests.service` can stall shutdown while waiting for VMs/CTs to stop
198
+5. Verify all expected storage hardware is physically present before reboot.
199
+6. Keep one older known-good kernel available in GRUB until the new kernel is validated on that node.
200
+
201
+## Operational Note: Reboot Discipline
202
+
203
+During this upgrade, one avoidable failure mode was a reboot started without first suspending or stopping guests through `pgs`.
204
+
205
+Observed effect:
206
+
207
+- `pve-guests.service` remained in `deactivating`
208
+- shutdown took a very long time
209
+- guest stop operations had to be forced manually
210
+- this obscured boot diagnostics and made the recovery look worse than the underlying boot issue
211
+
212
+Operational rule going forward:
213
+
214
+1. Before any planned node reboot for maintenance, run:
215
+
216
+```bash
217
+/usr/local/sbin/pgs suspend -v
218
+```
219
+
220
+2. Reboot only after guest suspend/shutdown has completed.
221
+3. After the node or cluster is back in a stable state, run:
222
+
223
+```bash
224
+/usr/local/sbin/pgs resume -v
225
+```
226
+
227
+## Outcome
228
+
229
+The cluster upgrade completed successfully, but only after boot-path recovery on all three nodes.
230
+
231
+Main lesson:
232
+
233
+- the risky part of this upgrade was not package dependency resolution
234
+- it was EFI and boot chain consistency after the transition to Debian 13 / Proxmox VE 9
+86 -0
cluster-context/history/madagascar-cluster-summary.json
@@ -0,0 +1,86 @@
1
+{
2
+  "cluster": {
3
+    "name": "Madagascar",
4
+    "topology": {
5
+      "thunderbolt_chain": [
6
+        "ebony",
7
+        "baobab",
8
+        "tapia"
9
+      ]
10
+    },
11
+    "networks": {
12
+      "cluster_network": {
13
+        "name": "thunderbridge",
14
+        "cidr": "192.168.10.0/24",
15
+        "bridges": [
16
+          { "host": "baobab", "bridge_id": "8000.02ff1918f13a" },
17
+          { "host": "ebony",  "bridge_id": "8000.02518abab22f" },
18
+          { "host": "tapia",  "bridge_id": "8000.0231336db4df" }
19
+        ],
20
+        "hosts": [
21
+          {
22
+            "hostname": "ebony",
23
+            "ip": "192.168.10.92",
24
+            "interfaces": { "thunderbolt0_mac": "02:51:8a:ba:b2:2f" }
25
+          },
26
+          {
27
+            "hostname": "baobab",
28
+            "ip": "192.168.10.91",
29
+            "interfaces": {
30
+              "thunderbolt0_mac": "02:ff:19:18:f1:3a",
31
+              "thunderbolt1_mac": "02:ee:dc:db:e6:0b"
32
+            }
33
+          },
34
+          {
35
+            "hostname": "tapia",
36
+            "ip": "192.168.10.93",
37
+            "interfaces": { "thunderbolt0_mac": "02:31:33:6d:b4:df" }
38
+          }
39
+        ]
40
+      },
41
+      "internet_network": {
42
+        "name": "vmbr443",
43
+        "cidr": "192.168.2.0/24",
44
+        "notes": "VM-urile ies în internet prin aceste bridge-uri",
45
+        "hosts": [
46
+          {
47
+            "hostname": "ebony",
48
+            "ip": "192.168.2.92",
49
+            "underlay": "eno1.443",
50
+            "bridge_mac": "1c:69:7a:ab:26:2f"
51
+          },
52
+          {
53
+            "hostname": "baobab",
54
+            "ip": "192.168.2.91",
55
+            "underlay": "enp86s0.443",
56
+            "bridge_mac": "48:21:0b:60:9f:ab"
57
+          },
58
+          {
59
+            "hostname": "tapia",
60
+            "ip": "192.168.2.93",
61
+            "underlay": "eno1.443",
62
+            "bridge_mac": "1c:69:7a:aa:e3:5d"
63
+          }
64
+        ]
65
+      }
66
+    },
67
+    "services": {
68
+      "pbs": [
69
+        {
70
+          "name": "anjohibe",
71
+          "role": "proxmox-backup-server",
72
+          "hypervisor_host": "ebony",
73
+          "ips": { "internet": "192.168.2.95", "cluster": "192.168.10.95" },
74
+          "nas_virtual_ip": "192.168.10.21"
75
+        },
76
+        {
77
+          "name": "andrafiabe",
78
+          "role": "proxmox-backup-server",
79
+          "hypervisor_host": "tapia",
80
+          "ips": { "internet": "192.168.2.96", "cluster": "192.168.10.96" },
81
+          "nas_virtual_ip": "192.168.10.22"
82
+        }
83
+      ]
84
+    }
85
+  }
86
+}
+241 -0
cluster-context/history/madagascar-hardware.json
@@ -0,0 +1,241 @@
1
+{
2
+  "hardware": {
3
+    "baobab": {
4
+      "system": {
5
+        "manufacturer": "Intel(R) Client Systems",
6
+        "product_name": "NUC13ANHi7",
7
+        "version": "M89903-208",
8
+        "serial_number": "BTAN344005DW",
9
+        "family": "AN",
10
+        "sku": "NUC13ANHi7000",
11
+        "smbios_version": "3.5.0"
12
+      },
13
+      "bios": {
14
+        "vendor": "Intel Corp.",
15
+        "version": "ANRPL357.0038.2025.0416.1002",
16
+        "release_date": "2025-04-16",
17
+        "bios_revision": "5.27",
18
+        "firmware_revision": "10.23"
19
+      },
20
+      "cpu": {
21
+        "model": "13th Gen Intel(R) Core(TM) i7-1360P",
22
+        "family": 6,
23
+        "model_id": 186,
24
+        "stepping": 2,
25
+        "cores": 12,
26
+        "threads": 16,
27
+        "max_speed_mhz": 5000,
28
+        "l3_cache_mb": 18
29
+      },
30
+      "memory": {
31
+        "installed_total_gb": 64,
32
+        "max_capacity_reported_gb": 64,
33
+        "modules": [
34
+          {
35
+            "locator": "Controller0-ChannelA-DIMM0",
36
+            "size_gb": 32,
37
+            "type": "DDR4",
38
+            "speed_mtps": 2667,
39
+            "manufacturer": "Corsair",
40
+            "part_number": "CMSX64GX4M2A2666C18",
41
+            "rank": 2,
42
+            "voltage_v": 1.2
43
+          },
44
+          {
45
+            "locator": "Controller1-ChannelA-DIMM0",
46
+            "size_gb": 32,
47
+            "type": "DDR4",
48
+            "speed_mtps": 2667,
49
+            "manufacturer": "Corsair",
50
+            "part_number": "CMSX64GX4M2A2666C18",
51
+            "rank": 2,
52
+            "voltage_v": 1.2
53
+          }
54
+        ]
55
+      },
56
+      "storage": {
57
+        "nvme_controllers": [
58
+          "Samsung SM981/PM981/PM983 (01:00.0)"
59
+        ],
60
+        "m2_slots": [
61
+          { "designation": "M2_A", "pcie": "x4 Gen3", "status": "in_use" },
62
+          { "designation": "M2_B", "pcie": "x4 Gen3", "status": "in_use" }
63
+        ]
64
+      },
65
+      "gpu": "Intel Raptor Lake-P [Iris Xe Graphics] (rev 04)",
66
+      "network_controllers": [
67
+        "Intel Ethernet Controller I226-V (rev 04)",
68
+        "Intel Raptor Lake PCH CNVi WiFi (rev 01)"
69
+      ],
70
+      "thunderbolt": {
71
+        "generation": "Thunderbolt 4",
72
+        "controllers": [
73
+          "Raptor Lake-P Thunderbolt 4 USB Controller",
74
+          "Raptor Lake-P Thunderbolt 4 NHI (x2)"
75
+        ],
76
+        "mac_addresses": [
77
+          "02:ff:19:18:f1:3a",
78
+          "02:ee:dc:db:e6:0b"
79
+        ]
80
+      },
81
+      "tpm": {
82
+        "vendor_id": "INTC",
83
+        "spec_version": "2.0",
84
+        "firmware_revision": "600.18"
85
+      }
86
+    },
87
+    "ebony": {
88
+      "system": {
89
+        "manufacturer": "Intel(R) Client Systems",
90
+        "product_name": "NUC10i7FNH",
91
+        "version": "M38010-308",
92
+        "serial_number": "G6FN135001U0",
93
+        "family": "FN",
94
+        "sku": "BXNUC10i7FNHN",
95
+        "smbios_version": "3.3.0"
96
+      },
97
+      "bios": {
98
+        "vendor": "Intel Corp.",
99
+        "version": "FNCML357.0066.2024.1011.0925",
100
+        "release_date": "2024-10-11",
101
+        "bios_revision": "5.16",
102
+        "firmware_revision": "3.12"
103
+      },
104
+      "cpu": {
105
+        "model": "Intel(R) Core(TM) i7-10710U",
106
+        "family": 6,
107
+        "model_id": 166,
108
+        "stepping": 0,
109
+        "cores": 6,
110
+        "threads": 12,
111
+        "max_speed_mhz": 4700,
112
+        "l3_cache_mb": 12
113
+      },
114
+      "memory": {
115
+        "installed_total_gb": 64,
116
+        "max_capacity_reported_gb": 32,
117
+        "modules": [
118
+          {
119
+            "locator": "SODIMM1",
120
+            "size_gb": 32,
121
+            "type": "DDR4",
122
+            "speed_mtps": 2667,
123
+            "manufacturer": "029E",
124
+            "part_number": "CMSX64GX4M2A2666C18",
125
+            "rank": 2,
126
+            "voltage_v": 1.2
127
+          },
128
+          {
129
+            "locator": "SODIMM2",
130
+            "size_gb": 32,
131
+            "type": "DDR4",
132
+            "speed_mtps": 2667,
133
+            "manufacturer": "029E",
134
+            "part_number": "CMSX64GX4M2A2666C18",
135
+            "rank": 2,
136
+            "voltage_v": 1.2
137
+          }
138
+        ]
139
+      },
140
+      "storage": {
141
+        "nvme_controllers": [
142
+          "Samsung PM9A1/PM9A3/980PRO (3a:00.0)"
143
+        ]
144
+      },
145
+      "gpu": "Intel Comet Lake UHD Graphics (rev 04)",
146
+      "network_controllers": [
147
+        "Intel Ethernet Connection (10) I219-V",
148
+        "Intel Comet Lake PCH-LP CNVi WiFi (onboard, poate fi dezactivat)"
149
+      ],
150
+      "thunderbolt": {
151
+        "generation": "Thunderbolt 3",
152
+        "controllers": [
153
+          "Intel JHL7540 Titan Ridge (NHI)",
154
+          "Intel JHL7540 Titan Ridge USB Controller"
155
+        ],
156
+        "mac_addresses": [
157
+          "02:51:8a:ba:b2:2f"
158
+        ]
159
+      }
160
+    },
161
+    "tapia": {
162
+      "system": {
163
+        "manufacturer": "Intel(R) Client Systems",
164
+        "product_name": "NUC10i7FNH",
165
+        "version": "M38010-308",
166
+        "serial_number": "G6FN135001AK",
167
+        "family": "FN",
168
+        "sku": "BXNUC10i7FNH",
169
+        "smbios_version": "3.3.0"
170
+      },
171
+      "bios": {
172
+        "vendor": "Intel Corp.",
173
+        "version": "FNCML357.0066.2024.1011.0925",
174
+        "release_date": "2024-10-11",
175
+        "bios_revision": "5.16",
176
+        "firmware_revision": "3.12"
177
+      },
178
+      "cpu": {
179
+        "model": "Intel(R) Core(TM) i7-10710U",
180
+        "family": 6,
181
+        "model_id": 166,
182
+        "stepping": 0,
183
+        "cores": 6,
184
+        "threads": 12,
185
+        "max_speed_mhz": 4700,
186
+        "l3_cache_mb": 12
187
+      },
188
+      "memory": {
189
+        "installed_total_gb": 64,
190
+        "max_capacity_reported_gb": 32,
191
+        "modules": [
192
+          {
193
+            "locator": "SODIMM1",
194
+            "size_gb": 32,
195
+            "type": "DDR4",
196
+            "speed_mtps": 2667,
197
+            "manufacturer": "029E",
198
+            "part_number": "CMSX64GX4M2A2666C18",
199
+            "rank": 2,
200
+            "voltage_v": 1.2
201
+          },
202
+          {
203
+            "locator": "SODIMM2",
204
+            "size_gb": 32,
205
+            "type": "DDR4",
206
+            "speed_mtps": 2667,
207
+            "manufacturer": "029E",
208
+            "part_number": "CMSX64GX4M2A2666C18",
209
+            "rank": 2,
210
+            "voltage_v": 1.2
211
+          }
212
+        ]
213
+      },
214
+      "storage": {
215
+        "nvme_controllers": [
216
+          "Samsung SM981/PM981/PM983 (3a:00.0)"
217
+        ]
218
+      },
219
+      "gpu": "Intel Comet Lake UHD Graphics (rev 04)",
220
+      "network_controllers": [
221
+        "Intel Ethernet Connection (10) I219-V",
222
+        "Intel Comet Lake PCH-LP CNVi WiFi"
223
+      ],
224
+      "thunderbolt": {
225
+        "generation": "Thunderbolt 3",
226
+        "controllers": [
227
+          "Intel JHL7540 Titan Ridge (NHI)",
228
+          "Intel JHL7540 Titan Ridge USB Controller"
229
+        ],
230
+        "mac_addresses": [
231
+          "02:31:33:6d:b4:df"
232
+        ]
233
+      },
234
+      "tpm": {
235
+        "vendor_id": "CTNI",
236
+        "spec_version": "2.0",
237
+        "firmware_revision": "500.14"
238
+      }
239
+    }
240
+  }
241
+}
+98 -0
cluster-context/history/madagascar-network.json
@@ -0,0 +1,98 @@
1
+{
2
+  "cluster": {
3
+    "networks": {
4
+      "fabric": {
5
+        "type": "thunderbolt",
6
+        "bridge": "thunderbridge",
7
+        "cidr": "192.168.10.0/24",
8
+        "topology": ["ebony", "baobab", "tapia"],
9
+        "used_by_vms": true
10
+      },
11
+      "internet": {
12
+        "bridge": "vmbr443",
13
+        "vlan": 443,
14
+        "cidr": "192.168.2.0/24",
15
+        "used_by_vms": true
16
+      }
17
+    }
18
+  },
19
+  "hosts": [
20
+    {
21
+      "host": "baobab",
22
+      "network": {
23
+        "thunderbridge": {
24
+          "ipv4": "192.168.10.91/24",
25
+          "bridge_id": "8000.02ff1918f13a",
26
+          "thunderbolt_macs": ["02:ff:19:18:f1:3a", "02:ee:dc:db:e6:0b"]
27
+        },
28
+        "vmbr443": {
29
+          "ipv4": "192.168.2.91/24",
30
+          "mac": "48:21:0b:60:9f:ab",
31
+          "bridge_id": "8000.48210b609fab",
32
+          "uplink": "enp86s0.443",
33
+          "stp": false
34
+        }
35
+      }
36
+    },
37
+    {
38
+      "host": "ebony",
39
+      "network": {
40
+        "thunderbridge": {
41
+          "ipv4": "192.168.10.92/24",
42
+          "bridge_id": "8000.02518abab22f",
43
+          "thunderbolt_macs": ["02:51:8a:ba:b2:2f"]
44
+        },
45
+        "vmbr443": {
46
+          "ipv4": "192.168.2.92/24",
47
+          "mac": "1c:69:7a:ab:26:2f",
48
+          "bridge_id": "8000.1c697aab262f",
49
+          "uplink": "eno1.443",
50
+          "stp": false
51
+        }
52
+      }
53
+    },
54
+    {
55
+      "host": "tapia",
56
+      "network": {
57
+        "thunderbridge": {
58
+          "ipv4": "192.168.10.93/24",
59
+          "bridge_id": "8000.0231336db4df",
60
+          "thunderbolt_macs": ["02:31:33:6d:b4:df"]
61
+        },
62
+        "vmbr443": {
63
+          "ipv4": "192.168.2.93/24",
64
+          "mac": "1c:69:7a:aa:e3:5d",
65
+          "bridge_id": "8000.1c697aaae35d",
66
+          "uplink": "eno1.443",
67
+          "stp": false
68
+        }
69
+      }
70
+    }
71
+  ],
72
+  "services": {
73
+    "pbs": [
74
+      {
75
+        "name": "anjohibe",
76
+        "role": "proxmox-backup-server",
77
+        "type": "vm",
78
+        "host": "ebony",
79
+        "network": {
80
+          "thunderbridge": { "ipv4": "192.168.10.95/24" },
81
+          "vmbr443": { "ipv4": "192.168.2.95/24" }
82
+        },
83
+        "virtual_nas": { "ipv4": "192.168.10.21" }
84
+      },
85
+      {
86
+        "name": "andrafiabe",
87
+        "role": "proxmox-backup-server",
88
+        "type": "vm",
89
+        "host": "tapia",
90
+        "network": {
91
+          "thunderbridge": { "ipv4": "192.168.10.96/24" },
92
+          "vmbr443": { "ipv4": "192.168.2.96/24" }
93
+        },
94
+        "virtual_nas": { "ipv4": "192.168.10.22" }
95
+      }
96
+    ]
97
+  }
98
+}
+117 -0
cluster-context/madagascar.json
@@ -0,0 +1,117 @@
1
+{
2
+  "schemaVersion": "1.0",
3
+  "lastUpdated": "2025-10-19T00:00:00Z",
4
+  "description": "Cluster configuration for Madagascar Proxmox cluster",
5
+  "clusters": {
6
+    "madagascar": {
7
+      "name": "madagascar",
8
+      "description": "Proxmox VE cluster with 3 nodes: baobab, ebony, tapia",
9
+      "pveVersion": "8.x",
10
+      "pbsServers": [
11
+        {
12
+          "name": "andrafiabe-AutoNAS",
13
+          "ip": "192.168.2.96",
14
+          "hostname": "andrafiabe.madagascar.xdev.ro",
15
+          "repo": "backup",
16
+          "prunePolicy": "keep-all=1"
17
+        },
18
+        {
19
+          "name": "anjothibe-AutoNAS",
20
+          "ip": "192.168.2.95",
21
+          "hostname": "anjothibe.madagascar.xdev.ro",
22
+          "repo": "backup",
23
+          "prunePolicy": "keep-all=1"
24
+        }
25
+      ],
26
+      "lastUpdated": "2025-10-19T00:00:00Z",
27
+      "nodes": {
28
+        "baobab": {
29
+          "name": "baobab",
30
+          "role": "primary",
31
+          "wan": {
32
+            "vmbr443": {
33
+              "address": "192.168.2.91/24",
34
+              "gateway": "192.168.2.1"
35
+            },
36
+            "vmbr444": {
37
+              "address": "192.168.4.91/24"
38
+            }
39
+          },
40
+          "network": {
41
+            "thunderbridge": {
42
+              "bridge": "thunderbridge",
43
+              "address": "192.168.10.91/24",
44
+              "mtu": 65520
45
+            }
46
+          },
47
+          "services": {
48
+            "tb-bridge": {
49
+              "enabled": true
50
+            }
51
+          },
52
+          "notes": "Node entry populated from local deploy layout"
53
+        },
54
+        "ebony": {
55
+          "name": "ebony",
56
+          "role": "secondary",
57
+          "wan": {
58
+            "vmbr443": {
59
+              "address": "192.168.2.92/24",
60
+              "gateway": "192.168.2.1"
61
+            },
62
+            "vmbr444": {
63
+              "address": "192.168.4.92/24"
64
+            }
65
+          },
66
+          "network": {
67
+            "thunderbridge": {
68
+              "bridge": "thunderbridge",
69
+              "address": "192.168.10.92/24",
70
+              "mtu": 65520
71
+            }
72
+          }
73
+        },
74
+        "tapia": {
75
+          "name": "tapia",
76
+          "role": "secondary",
77
+          "wan": {
78
+            "vmbr443": {
79
+              "address": "192.168.2.93/24",
80
+              "gateway": "192.168.2.1"
81
+            },
82
+            "vmbr444": {
83
+              "address": "192.168.4.93/24"
84
+            }
85
+          },
86
+          "network": {
87
+            "thunderbridge": {
88
+              "bridge": "thunderbridge",
89
+              "address": "192.168.10.93/24",
90
+              "mtu": 65520
91
+            }
92
+          }
93
+        }
94
+      }
95
+    }
96
+  },
97
+  "clusterNetwork": {
98
+    "thunderbolt": {
99
+      "description": "Cluster thunderbolt bridge configuration",
100
+      "bridge": "thunderbridge",
101
+      "cidr": "192.168.10.0/24",
102
+      "mtu": 65520,
103
+      "dns": "192.168.2.2",
104
+      "nodes": {
105
+        "baobab": {
106
+          "address": "192.168.10.91/24"
107
+        },
108
+        "ebony": {
109
+          "address": "192.168.10.92/24"
110
+        },
111
+        "tapia": {
112
+          "address": "192.168.10.93/24"
113
+        }
114
+      }
115
+    }
116
+  }
117
+}
+190 -0
projects/README.md
@@ -0,0 +1,190 @@
1
+# Madagascar Cluster Projects
2
+
3
+Acest director este punctul unic de lucru pentru proiectele cluster-level actuale si viitoare.
4
+
5
+## Baza de referinta
6
+
7
+Workflow-ul de install, uninstall si reinstall documentat aici este bazat pe implementarea cea mai completa existenta in `autoNAS`.
8
+
9
+Referinte principale:
10
+- `cluster/projects/autoNAS/README.md`
11
+- `cluster/projects/autoNAS/DEVELOPMENT.md`
12
+- `cluster/projects/autoNAS/scripts/install.sh`
13
+- `cluster/projects/autoNAS/scripts/autonas-uninstall.sh`
14
+
15
+Observatie importanta:
16
+- `autoNAS` confirma workflow-ul corect de uninstall-inainte-de-reinstall si curatare a fisierelor orfane
17
+- `autoNAS` nu este inca aliniat complet la noua regula de locatie pentru comenzi operator-facing, deoarece instaleaza in prezent in `/usr/local/bin`
18
+- pentru proiectele noi, regula ramane `/usr/local/sbin`; `autoNAS` trebuie tratat ca precedent functional pentru workflow, nu ca standard final de layout
19
+
20
+## Namespace de organizatie
21
+
22
+Pentru claritate si evitarea coliziunilor intre proiecte, toate locatiile standard trebuie namespaced cu identificatorul de organizatie:
23
+
24
+- `xdev`
25
+
26
+Regula generala este:
27
+- folosim `<project-name>` pentru identitatea proiectului
28
+- folosim `xdev` in calea de instalare pentru fisiere interne, configuratie, date si documentatie
29
+
30
+## Reguli generale
31
+
32
+- Toate proiectele noi se creeaza sub `cluster/projects/<project-name>`.
33
+- Proiectele se deschid si se mentin din `cluster`, pentru a reduce divergenta intre workspace-uri si duplicarea documentatiei sau scripturilor.
34
+- Fiecare proiect trebuie sa aiba cel putin:
35
+  - `README.md`
36
+  - script de instalare
37
+  - script de dezinstalare
38
+  - instructiuni de operare si upgrade
39
+
40
+## Locatii well-known obligatorii
41
+
42
+Instalarile trebuie sa foloseasca locatii predictibile si stabile:
43
+
44
+- executabile si scripturi operator-facing: `/usr/local/sbin`
45
+- binare sau scripturi interne ale proiectului: `/usr/local/lib/xdev/<project-name>`
46
+- documentatie instalata pe host: `/usr/local/share/doc/xdev/<project-name>`
47
+- fisiere de configurare persistente: `/etc/xdev/<project-name>`
48
+- environment defaults: `/etc/default/xdev-<project-name>`
49
+- unitati systemd: `/etc/systemd/system`
50
+- stare persistenta si date operationale: `/var/lib/xdev/<project-name>`
51
+- cache temporar: `/var/cache/xdev/<project-name>` daca este necesar
52
+- loguri dedicate pe disc, daca proiectul chiar le scrie in fisier: `/var/log/xdev/<project-name>`
53
+
54
+Regula practica:
55
+- daca un operator trebuie sa ruleze comanda direct, ea merge in `/usr/local/sbin`
56
+- daca fisierul este suport intern pentru proiect, el merge in `/usr/local/lib/xdev/<project-name>`
57
+- daca fisierul este documentatie instalata local pentru host, el merge in `/usr/local/share/doc/xdev/<project-name>`
58
+- daca fisierul reprezinta configuratie editabila, el merge in `/etc/xdev/<project-name>` sau `/etc/default/xdev-<project-name>`
59
+- daca fisierul reprezinta stare, baza locala, lock, snapshot sau alta data operationala, el merge in `/var/lib/xdev/<project-name>`
60
+
61
+## Locatia standard pentru scripturile de dezinstalare
62
+
63
+Locatia standard canonica pentru scriptul de dezinstalare instalat pe host este:
64
+
65
+- `/usr/local/lib/xdev/<project-name>/uninstall.sh`
66
+
67
+Motivatie:
68
+- uninstall-ul este in primul rand parte din mecanismul intern de lifecycle al proiectului
69
+- trebuie sa poata fi apelat de installer pentru cleanup automat inainte de reinstall
70
+- trebuie versionat impreuna cu restul fisierelor interne ale proiectului
71
+- evita aglomerarea inutila a `/usr/local/sbin` cu scripturi care nu sunt folosite frecvent in operare zilnica
72
+
73
+Regula de naming:
74
+- scriptul canonic instalat pe host se numeste `uninstall.sh`
75
+- directorul proiectului da contextul complet: `/usr/local/lib/xdev/<project-name>/uninstall.sh`
76
+
77
+Expunere optionala pentru operator:
78
+- daca vrem o comanda manuala simpla si predictibila, se poate instala un wrapper sau symlink in:
79
+  - `/usr/local/sbin/xdev-<project-name>-uninstall`
80
+- acest wrapper trebuie sa apeleze scriptul canonic din `/usr/local/lib/xdev/<project-name>/uninstall.sh`
81
+- wrapperul din `/usr/local/sbin` este optional; scriptul canonic din `/usr/local/lib/xdev/<project-name>/` este obligatoriu
82
+
83
+## Instalare si dezinstalare
84
+
85
+- Orice instalare trebuie sa fie insotita de un script de dezinstalare livrat de acelasi proiect.
86
+- Scriptul de dezinstalare instalat pe host trebuie sa existe la `/usr/local/lib/xdev/<project-name>/uninstall.sh`.
87
+- Scriptul de dezinstalare trebuie sa elimine toate fisierele instalate de proiect:
88
+  - executabile
89
+  - fisiere din `/usr/local/lib/xdev/<project-name>`
90
+  - documentatie din `/usr/local/share/doc/xdev/<project-name>`
91
+  - unitati systemd
92
+  - fisiere de configurare generate de proiect, daca sunt gestionate exclusiv de el, din `/etc/xdev/<project-name>` sau `/etc/default/xdev-<project-name>`
93
+  - directoare de stare, date sau cache create de proiect, daca nu contin date care trebuie pastrate explicit, din `/var/lib/xdev/<project-name>` sau `/var/cache/xdev/<project-name>`
94
+- Scopul este prevenirea fisierelor orfane si a reinstalarilor peste artefacte ramase din versiuni anterioare.
95
+
96
+## Regula de reinstall
97
+
98
+- Toate reinstalarile se fac numai dupa dezinstalare completa.
99
+- Dezinstalarea se face numai cu scriptul original de uninstall al proiectului, nu prin stergeri manuale partiale.
100
+- Fluxul obligatoriu este:
101
+
102
+```text
103
+uninstall -> verificare curatare -> install
104
+```
105
+
106
+- Nu se face reinstall direct peste o instalare existenta, chiar daca pare functionala.
107
+- Daca scriptul de uninstall lipseste, instalarea proiectului este incompleta si trebuie corectata inainte de orice upgrade sau reinstall.
108
+
109
+## Cerinte pentru proiectele noi
110
+
111
+Fiecare proiect nou trebuie sa includa explicit:
112
+
113
+1. un `install` care foloseste locatiile well-known
114
+2. un `uninstall` care inverseaza complet instalarea
115
+3. un `README.md` cu:
116
+   - layout-ul fisierelor instalate
117
+   - comenzile de instalare
118
+   - comenzile de dezinstalare
119
+   - pasii de reinstall
120
+   - locatia uninstall-ului instalat pe host: `/usr/local/lib/xdev/<project-name>/uninstall.sh`
121
+   - locatiile pentru configuratie, documentatie si date
122
+4. daca exista systemd:
123
+   - `daemon-reload` la install si uninstall
124
+   - enable/disable/stop clar definite
125
+   - la deployment, serviciile si timer-ele care trebuie sa ramana active se pornesc cu `systemctl enable --now`, nu doar cu `enable`
126
+
127
+## Aplicare la proiectele existente
128
+
129
+Proiectele deja mutate sub `cluster/projects/` trebuie aliniate progresiv la aceste reguli.
130
+
131
+Prioritati:
132
+- confirmarea unui script de uninstall pentru fiecare proiect
133
+- standardizarea instalarii in `/usr/local/sbin` si `/usr/local/lib/xdev/<project-name>`
134
+- eliminarea reinstalarilor facute peste fisiere existente
135
+
136
+## Lectii confirmate in autoNAS
137
+
138
+Problemele deja identificate si rezolvate in `autoNAS`, care trebuie considerate reguli pentru proiectele viitoare:
139
+
140
+- reinstalarile peste versiuni vechi lasa fisiere orfane daca nu exista cleanup explicit
141
+- instalarea trebuie sa poata rula cleanup de versiune anterioara inainte de install
142
+- uninstaller-ul trebuie instalat pe host pentru a permite cleanup corect la upgrade sau reinstall
143
+- uninstall-ul trebuie sa curete agresiv fisierele istorice ramase din versiuni mai vechi
144
+- config-ul utilizatorului trebuie pastrat cand contine date reale, nu sters orbeste
145
+- serviciile systemd trebuie oprite, dezactivate, sterse si urmate de `daemon-reload`
146
+- la deployment, un serviciu necesar in productie nu trebuie lasat doar `enabled`; se foloseste `enable --now` pentru a evita deploy-uri cu servicii instalate dar nepornite
147
+- unele resurse necesita cleanup manual explicit daca pot contine date operationale, de exemplu exports NFS sau mount points active
148
+
149
+Fluxul validat de `autoNAS` este:
150
+
151
+```text
152
+detect previous install -> run original uninstall -> clean orphan files -> install new version -> preserve user data where required
153
+```
154
+
155
+## Regula operationala
156
+
157
+Cand se modifica un proiect existent sau se adauga unul nou, se actualizeaza si documentatia proiectului astfel incat procedura de:
158
+
159
+- install
160
+- uninstall
161
+- reinstall
162
+
163
+sa fie explicita, repetabila si fara artefacte ramase pe host.
164
+
165
+## Deploy cluster-wide
166
+
167
+Pentru rollout final pe cluster nu facem deploy nod cu nod manual daca proiectul este destinat cluster-wide.
168
+
169
+Regula practica este:
170
+- fiecare proiect trebuie sa pastreze si varianta pe un singur nod pentru development si testing
171
+- pentru rollout cluster-wide se foloseste orchestratorul comun din radacina:
172
+  - `cluster/scripts/deploy-project.sh <project-name>`
173
+
174
+Sursa de adevar pentru noduri:
175
+- `cluster/cluster-context/madagascar.json`
176
+
177
+Exemple:
178
+
179
+```bash
180
+./scripts/deploy-project.sh pve-guests-state
181
+./scripts/deploy-project.sh pve-net-hang-watchdog
182
+./scripts/deploy-project.sh pve-backup-scheduler
183
+./scripts/deploy-project.sh autoNAS
184
+./scripts/deploy-project.sh pve-guests-state install --node ebony
185
+```
186
+
187
+Cerinta pentru proiecte:
188
+- proiectele noi trebuie sa ofere fie `setup.sh`, fie `deploy.sh`
189
+- `setup.sh` ramane entrypoint-ul standard pentru install/uninstall pe un singur nod
190
+- orchestratorul comun decide nodurile pe baza `cluster-context/madagascar.json` si ruleaza proiectul pe toate tintele selectate
+1 -0
projects/autoNAS
@@ -0,0 +1 @@
1
+Subproject commit d426b0effcb2e2195b7c6742718037862bd15767
BIN
projects/autoSMART/.DS_Store
Binary file not shown.
+45 -0
projects/autoSMART/.deployignore
@@ -0,0 +1,45 @@
1
+# Exclude these files from deployment
2
+
3
+# Development metadata
4
+**/.metadata/**
5
+**/.settings/**
6
+**/Release/**
7
+**/Debug/**
8
+
9
+# OS files
10
+**/.DS_Store
11
+**/Thumbs.db
12
+
13
+# Version control
14
+**/.git/**
15
+**/.svn/**
16
+
17
+# IDE files
18
+**/.project
19
+**/.cproject
20
+**/.classpath
21
+
22
+# Temporary files
23
+**/tmp/**
24
+**/temp/**
25
+
26
+# Large binaries
27
+**/*.bin
28
+**/*.elf
29
+# **/*.rpm  # Commented out to allow offline packages
30
+**/*.o
31
+
32
+# Offline packages (comment out the line below to include packages in deployment)
33
+# packages/**
34
+
35
+# Other projects not related to autoSMART
36
+configi/**
37
+raduin/**
38
+radion/**
39
+linux/**
40
+ipconfig/**
41
+autoNAS/**
42
+VariaMediaDump/**
43
+Madagascar/**
44
+RemoteSystemsTempFiles/**
45
+
+144 -0
projects/autoSMART/DEBUG_RESOLUTION_REPORT.md
@@ -0,0 +1,144 @@
1
+# autoSMART Debug Resolution Report
2
+## Date: 2025-08-16
3
+
4
+### Issues Identified and Resolved
5
+
6
+#### ❌ Issue 1: Empty hdd_presence table
7
+**Problem**: Table `hdd_presence` was empty despite collector running
8
+**Root Causes**:
9
+1. SMART parameter parsing regex was incorrect for new smartctl format
10
+2. Database permission issues for sequence access
11
+3. Missing fields in smart_readings INSERT
12
+
13
+#### ✅ Solutions Implemented
14
+
15
+##### 1. Enhanced Debug Logging in smart-collector-daemon.pl
16
+- Added comprehensive debug logging throughout the collection process
17
+- Enhanced `get_or_create_hdd()` function with detailed presence tracking logs
18
+- Added device scanning and SMART parsing debug information
19
+- Added database connectivity testing in debug mode
20
+
21
+##### 2. Fixed SMART Parameter Parsing
22
+**Before**: Only supported old format
23
+```perl
24
+elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\w+\s+\w+\s+\w+\s+(\d+)/) {
25
+```
26
+
27
+**After**: Supports both old and new smartctl formats
28
+```perl
29
+elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(\d+)/) {
30
+    # New format: ID ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
31
+```
32
+
33
+##### 3. Fixed Database Schema Permissions
34
+**Problem**: `permission denied for sequence hdd_presence_id_seq`
35
+**Solution**: Added proper sequence permissions
36
+```sql
37
+GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO autosmart;
38
+```
39
+
40
+##### 4. Fixed smart_readings INSERT Statement
41
+**Before**: Missing required NOT NULL fields
42
+```perl
43
+INSERT INTO smart_readings (hdd_id, timestamp, temperature, parameters_json, reading_type)
44
+```
45
+
46
+**After**: Complete field list
47
+```perl
48
+INSERT INTO smart_readings (hdd_id, serial_number, device_path, node_id, timestamp, temperature, parameters_json, reading_type)
49
+```
50
+
51
+##### 5. Enhanced Configuration Preservation
52
+**Problem**: Install script overwrote existing `/etc/default/autosmart` configuration
53
+**Solution**: Implemented configuration merging in install.sh
54
+- Backup existing configuration with timestamp
55
+- Parse existing key-value pairs
56
+- Merge with new defaults while preserving user settings
57
+- Log preserved/added settings
58
+
59
+```bash
60
+# Backup existing configuration
61
+cp "/etc/default/autosmart" "/etc/default/autosmart.backup.$(date +%Y%m%d_%H%M%S)"
62
+
63
+# Read and preserve existing settings
64
+declare -A existing_config
65
+while IFS='=' read -r key value; do
66
+    if [[ $key =~ ^[A-Z_]+$ ]] && [[ -n $value ]]; then
67
+        value=$(echo "$value" | sed 's/^"//;s/"$//')
68
+        existing_config["$key"]="$value"
69
+    fi
70
+done < "/etc/default/autosmart"
71
+```
72
+
73
+### Testing Results
74
+
75
+#### ✅ Successful Data Collection
76
+```
77
+[DEBUG] Found model: ST4000VN006-3CW104
78
+[DEBUG] Found serial: ZW60K01R
79
+[DEBUG] SMART param (new format): Raw_Read_Error_Rate = 1176
80
+[DEBUG] SMART param (new format): Start_Stop_Count = 2300
81
+[DEBUG] Parsed device data - Model: ST4000VN006-3CW104, Serial: ZW60K01R, Temperature: 44, Parameters: 25
82
+[DEBUG] Created new hdd_presence record with id=2 for serial=ZW60K01R node=Bogdans-MacBook-Pro
83
+✓ SMART reading stored (ID: 18, temp: 44°C, type: full)
84
+```
85
+
86
+#### ✅ Database Population Confirmed
87
+```sql
88
+-- hdd_presence table
89
+ id | serial_number  |        node         |         data_start         |          data_end          | is_current 
90
+----+----------------+---------------------+----------------------------+----------------------------+------------
91
+  1 | S2HSNXRH402205 | Bogdans-MacBook-Pro | 2025-08-16 21:47:13.078524 | 2025-08-16 21:48:23.357763 | t
92
+  2 | ZW60K01R       | Bogdans-MacBook-Pro | 2025-08-16 21:47:13.873642 | 2025-08-16 21:48:24.204347 | t
93
+
94
+-- smart_readings summary
95
+ total_readings | unique_devices 
96
+----------------+----------------
97
+             16 |              2
98
+```
99
+
100
+### Configuration Management
101
+
102
+#### ✅ Debug Mode Activation
103
+```bash
104
+# Enable debug mode
105
+AUTOSMART_DEBUG="true"
106
+
107
+# Configuration preserved across deployments
108
+[INFO] ✓ Preserved existing setting: AUTOSMART_DEBUG="true"
109
+[INFO] ✓ Configuration merged successfully
110
+```
111
+
112
+### Deployment Process
113
+
114
+All fixes deployed successfully using:
115
+```bash
116
+./deploy.sh install ebony
117
+```
118
+
119
+### Files Modified
120
+
121
+1. **scripts/smart-collector-daemon.pl**
122
+   - Enhanced debug logging
123
+   - Fixed SMART parameter parsing regex
124
+   - Fixed smart_readings INSERT statement
125
+   - Added comprehensive error handling
126
+
127
+2. **scripts/install.sh**
128
+   - Implemented configuration preservation
129
+   - Added backup functionality
130
+   - Enhanced user setting migration
131
+
132
+3. **sql/schema-fixed.sql**
133
+   - Added proper sequence permissions
134
+
135
+### Summary
136
+
137
+The autoSMART system now successfully:
138
+- ✅ Detects and parses SMART data from all device types
139
+- ✅ Populates hdd_presence table with mobility tracking
140
+- ✅ Stores complete SMART readings with all metadata
141
+- ✅ Preserves user configuration across deployments
142
+- ✅ Provides comprehensive debug logging for troubleshooting
143
+
144
+All identified issues have been resolved and the system is ready for production use across the Madagascar cluster.
+1 -0
projects/autoSMART/Madagascar
@@ -0,0 +1 @@
1
+/Users/bogdan/Documents/Workspaces/Xdev/Madagascar
+0 -0
projects/autoSMART/README.md
No changes.
+19 -0
projects/autoSMART/cluster.json
@@ -0,0 +1,19 @@
1
+{
2
+  "cluster": {
3
+    "name": "madagascar",
4
+    "nodes": [
5
+      {
6
+        "hostname": "ebony",
7
+        "ip": "192.168.2.92"
8
+      },
9
+      {
10
+        "hostname": "baobab", 
11
+        "ip": "192.168.2.91"
12
+      },
13
+      {
14
+        "hostname": "tapia",
15
+        "ip": "192.168.2.94"
16
+      }
17
+    ]
18
+  }
19
+}
+5 -0
projects/autoSMART/config/autosmart-defaults.conf
@@ -0,0 +1,5 @@
1
+# AutoSMART Configuration
2
+# This file is sourced by AutoSMART scripts to set default behavior
3
+# Debug mode - set to "true" to enable verbose logging
4
+# When enabled, all AutoSMART operations will produce detailed debug output
5
+AUTOSMART_DEBUG="false"
+13 -0
projects/autoSMART/config/cluster-ebony.conf
@@ -0,0 +1,13 @@
1
+[database]
2
+host = 192.168.2.102
3
+port = 5432
4
+name = autosmart
5
+user = autosmart
6
+password = autoSMART2025!
7
+
8
+[collection]
9
+interval = 1800
10
+timeout = 60
11
+
12
+[node]
13
+id = ebony
+88 -0
projects/autoSMART/config/cluster.conf
@@ -0,0 +1,88 @@
1
+# autoSMART Cluster Configuration
2
+# Location: /etc/pve/autoSMART/cluster.conf
3
+# This file is shared across all Proxmox cluster nodes
4
+
5
+[cluster]
6
+# Cluster identification
7
+cluster_name = proxmox-cluster-main
8
+cluster_id = pve-cluster-001
9
+nodes = node91,node92,node93
10
+
11
+# Database configuration (shared cluster database)
12
+[database]
13
+host = 192.168.2.91
14
+port = 5432
15
+database = autosmart_cluster
16
+username = autosmart_cluster
17
+password = cluster_secure_password_here
18
+connection_timeout = 30
19
+pool_size = 10
20
+
21
+# OpenAI configuration (shared API key)
22
+[openai]
23
+api_key = your_cluster_openai_api_key_here
24
+model = gpt-4
25
+max_tokens = 1500
26
+temperature = 0.3
27
+rate_limit_delay = 2
28
+
29
+# Madagascar inventory integration
30
+[madagascar]
31
+inventory_path = /etc/pve/autoSMART/madagascar_inventory.json
32
+update_interval = 3600
33
+sync_across_nodes = true
34
+
35
+# Cluster-wide SMART monitoring parameters
36
+[smart_parameters]
37
+# Critical parameters (high weight for AI analysis)
38
+Reallocated_Sector_Ct = 1,10.0,true,Critical reallocated sectors
39
+Reallocated_Event_Count = 1,9.0,true,Reallocation events
40
+Current_Pending_Sector = 1,9.5,true,Pending sector reallocation
41
+Offline_Uncorrectable = 1,10.0,true,Uncorrectable sectors
42
+UDMA_CRC_Error_Count = 10,5.0,true,Communication errors
43
+Spin_Retry_Count = 1,8.0,true,Spindle motor retries
44
+
45
+# Important parameters (medium weight)
46
+Raw_Read_Error_Rate = 100000,3.0,true,Raw read errors
47
+Seek_Error_Rate = 100000,4.0,true,Seek operation errors
48
+Load_Cycle_Count = 100000,2.0,true,Head load cycles
49
+Power_On_Hours = 35000,2.0,true,Power-on time
50
+Temperature_Celsius = 50,3.0,true,Operating temperature
51
+
52
+# Monitoring parameters (low weight)
53
+Start_Stop_Count = 10000,1.0,true,Start/stop cycles
54
+Power_Cycle_Count = 10000,1.0,true,Power cycles
55
+Command_Timeout = 100,2.0,true,Command timeouts
56
+High_Fly_Writes = 1,4.0,true,Head fly height issues
57
+Airflow_Temperature_Cel = 45,1.5,true,Airflow temperature
58
+
59
+# Cluster-wide alert settings
60
+[alerts]
61
+email_enabled = true
62
+email_smtp_server = mail.domain.com
63
+email_smtp_port = 587
64
+email_username = autosmart@domain.com
65
+email_password = email_password_here
66
+email_recipients = admin@domain.com,ops@domain.com
67
+email_critical_only = false
68
+
69
+# Risk level alert thresholds
70
+alert_critical_immediate = true
71
+alert_high_delay_minutes = 30
72
+alert_moderate_delay_hours = 4
73
+alert_low_daily_summary = true
74
+
75
+# Data retention (cluster-wide policy)
76
+[retention]
77
+smart_readings_days = 365
78
+predictions_days = 180
79
+alerts_days = 90
80
+cleanup_interval_hours = 24
81
+
82
+# Cluster synchronization
83
+[synchronization]
84
+node_discovery_interval = 300
85
+health_check_interval = 60
86
+failover_enabled = true
87
+backup_nodes = node92,node93
88
+primary_node = node91
+0 -0
projects/autoSMART/config/cluster.json
No changes.
+30 -0
projects/autoSMART/config/database.conf
@@ -0,0 +1,30 @@
1
+# autoSMART Database Configuration
2
+# PostgreSQL connection settings
3
+
4
+[database]
5
+host = localhost
6
+port = 5432
7
+database = autosmart
8
+username = autosmart_user
9
+password = secure_password_here
10
+schema = smart_monitoring
11
+
12
+# Connection pool settings
13
+max_connections = 20
14
+connection_timeout = 30
15
+query_timeout = 60
16
+
17
+# Data retention policies
18
+retention_raw_data = 365        # days to keep raw SMART readings
19
+retention_predictions = 180     # days to keep AI predictions
20
+retention_alerts = 90           # days to keep alert history
21
+
22
+# Backup settings
23
+backup_enabled = true
24
+backup_schedule = "0 2 * * *"   # Daily at 2 AM
25
+backup_retention = 30           # days to keep backups
26
+
27
+[performance]
28
+batch_insert_size = 1000
29
+vacuum_schedule = "0 3 * * 0"   # Weekly vacuum
30
+analyze_schedule = "0 4 * * *"  # Daily analyze
+29 -0
projects/autoSMART/config/debug-ebony.sh
@@ -0,0 +1,29 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Debug Configuration for ebony
4
+export AUTOSMART_DEBUG=3
5
+export AUTOSMART_NODE_ID="ebony"
6
+export AUTOSMART_CLUSTER_CONFIG="/etc/pve/autoSMART/config/cluster.conf"
7
+
8
+# Database configuration
9
+export AUTOSMART_DB_HOST="192.168.2.102"
10
+export AUTOSMART_DB_USER="autosmart"
11
+export AUTOSMART_DB_PASS="autoSMART2025!"
12
+export AUTOSMART_DB_NAME="autosmart"
13
+export AUTOSMART_DB_PORT="5432"
14
+
15
+# Collection settings
16
+export SMART_COLLECTION_ENABLED="true"
17
+export MIGRATION_DETECTION_ENABLED="true"
18
+export DIFFERENTIAL_STORAGE_ENABLED="true"
19
+
20
+# Debug logging
21
+export AUTOSMART_LOG_LEVEL="DEBUG"
22
+export AUTOSMART_LOG_TO_SYSLOG="true"
23
+
24
+echo "autoSMART debug environment configured:"
25
+echo "  Node: $AUTOSMART_NODE_ID"
26
+echo "  Database: $AUTOSMART_DB_HOST:$AUTOSMART_DB_PORT/$AUTOSMART_DB_NAME"
27
+echo "  User: $AUTOSMART_DB_USER"
28
+echo "  Debug Level: $AUTOSMART_DEBUG"
29
+echo ""
+107 -0
projects/autoSMART/config/default
@@ -0,0 +1,107 @@
1
+# autoSMART Local Configuration
2
+# Location: /etc/default/autosmart
3
+# This file contains node-specific settings and debug flags
4
+
5
+# Node identification
6
+AUTOSMART_NODE_ID="$(hostname)"
7
+AUTOSMART_CLUSTER_CONFIG="/etc/pve/autoSMART/cluster.conf"
8
+
9
+# Debug settings
10
+AUTOSMART_DEBUG_ENABLED=false
11
+AUTOSMART_DEBUG_LEVEL=1          # 0=none, 1=basic, 2=verbose, 3=trace
12
+AUTOSMART_DEBUG_LOG_FILE="/var/log/autosmart/debug.log"
13
+AUTOSMART_DEBUG_MAX_SIZE="100M"
14
+AUTOSMART_DEBUG_ROTATE_COUNT=5
15
+
16
+# Local logging
17
+AUTOSMART_LOG_ENABLED=true
18
+AUTOSMART_LOG_LEVEL="info"       # debug, info, warn, error
19
+AUTOSMART_LOG_FILE="/var/log/autosmart/autosmart.log"
20
+AUTOSMART_LOG_SYSLOG=true
21
+AUTOSMART_LOG_FACILITY="daemon"
22
+
23
+# Collection settings (can override cluster defaults)
24
+AUTOSMART_COLLECTION_INTERVAL=300  # seconds (5 minutes)
25
+AUTOSMART_COLLECTION_TIMEOUT=30    # seconds
26
+AUTOSMART_COLLECTION_RETRIES=3
27
+AUTOSMART_COLLECTION_PARALLEL=true
28
+
29
+# Local storage paths
30
+AUTOSMART_PID_FILE="/var/run/autosmart.pid"
31
+AUTOSMART_LOCK_FILE="/var/lock/autosmart.lock"
32
+AUTOSMART_CACHE_DIR="/var/cache/autosmart"
33
+AUTOSMART_TEMP_DIR="/tmp/autosmart"
34
+
35
+# Process management
36
+AUTOSMART_DAEMON_USER="autosmart"
37
+AUTOSMART_DAEMON_GROUP="autosmart"
38
+AUTOSMART_MAX_MEMORY="256M"
39
+AUTOSMART_NICE_LEVEL=10
40
+
41
+# Local device discovery
42
+AUTOSMART_DEVICE_SCAN_ENABLED=true
43
+AUTOSMART_DEVICE_SCAN_PATHS="/dev/sd* /dev/nvme*"
44
+AUTOSMART_DEVICE_EXCLUDE_PATTERNS="loop*,dm-*,sr*"
45
+AUTOSMART_DEVICE_CACHE_TTL=3600    # seconds
46
+
47
+# Network settings
48
+AUTOSMART_BIND_ADDRESS="0.0.0.0"
49
+AUTOSMART_BIND_PORT=0              # 0 = disable local API
50
+AUTOSMART_CLUSTER_TIMEOUT=10       # seconds
51
+AUTOSMART_CLUSTER_RETRIES=2
52
+
53
+# Performance tuning
54
+AUTOSMART_WORKER_THREADS=4
55
+AUTOSMART_QUEUE_SIZE=1000
56
+AUTOSMART_BATCH_SIZE=10
57
+AUTOSMART_RATE_LIMIT_ENABLED=true
58
+AUTOSMART_RATE_LIMIT_REQUESTS=60   # per minute
59
+
60
+# Security
61
+AUTOSMART_SECURE_MODE=true
62
+AUTOSMART_SSL_VERIFY=true
63
+AUTOSMART_PERMISSIONS_CHECK=true
64
+AUTOSMART_CONFIG_VALIDATION=true
65
+
66
+# Emergency settings
67
+AUTOSMART_EMERGENCY_STOP_FILE="/etc/autosmart/EMERGENCY_STOP"
68
+AUTOSMART_SAFE_MODE_ENABLED=true
69
+AUTOSMART_RECOVERY_MODE=false
70
+
71
+# Development/Testing flags (production should be false)
72
+AUTOSMART_DEVELOPMENT_MODE=false
73
+AUTOSMART_MOCK_SMARTCTL=false
74
+AUTOSMART_MOCK_DATABASE=false
75
+AUTOSMART_MOCK_OPENAI=false
76
+AUTOSMART_TEST_MODE=false
77
+
78
+# Feature toggles
79
+AUTOSMART_FEATURE_AI_PREDICTIONS=true
80
+AUTOSMART_FEATURE_EMAIL_ALERTS=true
81
+AUTOSMART_FEATURE_CLUSTER_SYNC=true
82
+AUTOSMART_FEATURE_AUTO_DISCOVERY=true
83
+AUTOSMART_FEATURE_HEALTH_CHECKS=true
84
+
85
+# Compatibility settings
86
+AUTOSMART_LEGACY_SUPPORT=false
87
+AUTOSMART_STRICT_MODE=true
88
+AUTOSMART_BACKWARD_COMPATIBILITY=false
89
+
90
+# Monitoring and health checks
91
+AUTOSMART_HEALTH_CHECK_ENABLED=true
92
+AUTOSMART_HEALTH_CHECK_INTERVAL=60    # seconds
93
+AUTOSMART_HEALTH_CHECK_TIMEOUT=5     # seconds
94
+AUTOSMART_METRICS_ENABLED=true
95
+AUTOSMART_METRICS_PORT=9090
96
+
97
+# Resource limits
98
+AUTOSMART_MAX_OPEN_FILES=1024
99
+AUTOSMART_MAX_PROCESSES=50
100
+AUTOSMART_MEMORY_LIMIT="512M"
101
+AUTOSMART_CPU_LIMIT=80               # percentage
102
+
103
+# Maintenance
104
+AUTOSMART_AUTO_CLEANUP=true
105
+AUTOSMART_CLEANUP_INTERVAL=86400     # daily
106
+AUTOSMART_VACUUM_DATABASE=true
107
+AUTOSMART_OPTIMIZE_INTERVAL=604800   # weekly
+50 -0
projects/autoSMART/config/openai.conf
@@ -0,0 +1,50 @@
1
+# autoSMART OpenAI Configuration
2
+# AI prediction engine settings
3
+
4
+[openai]
5
+# API Configuration
6
+api_key = sk-your-openai-api-key-here
7
+api_endpoint = https://api.openai.com/v1
8
+model = gpt-4
9
+max_tokens = 2048
10
+temperature = 0.1              # Low temperature for consistent predictions
11
+
12
+# Request limits and retry
13
+max_requests_per_hour = 100
14
+retry_attempts = 3
15
+retry_delay = 5                # seconds between retries
16
+request_timeout = 60           # seconds
17
+
18
+[prediction]
19
+# Prediction parameters
20
+prediction_window_days = 30    # Predict failures within 30 days
21
+confidence_threshold = 0.7     # Minimum confidence for alerts
22
+historical_data_days = 90      # Use 90 days of historical data
23
+minimum_readings = 10          # Minimum readings before prediction
24
+
25
+# AI prompt configuration
26
+system_prompt = "You are an expert HDD failure prediction system. Analyze SMART data and provide failure probability with reasoning."
27
+include_context = true         # Include disk model, age, environment
28
+include_trends = true          # Include trend analysis in prompts
29
+
30
+[analysis]
31
+# Analysis frequency
32
+full_analysis_hours = 24       # Full AI analysis every 24 hours
33
+quick_check_hours = 6          # Quick check every 6 hours
34
+emergency_check_minutes = 30   # Emergency analysis for critical values
35
+
36
+# Batch processing
37
+batch_size = 10                # Analyze 10 disks per batch
38
+batch_delay = 2                # seconds between batch requests
39
+
40
+[features]
41
+# Feature engineering for AI
42
+enable_trend_analysis = true
43
+enable_anomaly_detection = true
44
+enable_correlation_analysis = true
45
+enable_environmental_factors = true
46
+
47
+# Advanced features
48
+enable_model_specific_analysis = true  # Different analysis per HDD model
49
+enable_failure_clustering = true       # Group similar failure patterns
50
+enable_seasonal_adjustment = true      # Account for seasonal temperature changes
+57 -0
projects/autoSMART/config/smart.conf
@@ -0,0 +1,57 @@
1
+# autoSMART SMART Parameters Configuration
2
+# Defines which SMART parameters to monitor and their thresholds
3
+
4
+[monitoring]
5
+# Collection interval in seconds
6
+collection_interval = 300      # 5 minutes
7
+collection_timeout = 30        # 30 seconds timeout per disk
8
+
9
+# Madagascar integration
10
+madagascar_inventory_file = /etc/madagascar/disk_inventory.json
11
+madagascar_api_endpoint = http://madagascar.local/api/v1/disks
12
+
13
+[smart_parameters]
14
+# Format: parameter_name = threshold,weight,enabled,description
15
+
16
+# Critical parameters (high weight, immediate attention)
17
+Raw_Read_Error_Rate = 100000,0.9,true,"Raw read error rate from disk surface"
18
+Reallocated_Sector_Ct = 5,0.95,true,"Count of reallocated sectors"
19
+Current_Pending_Sector = 1,0.9,true,"Count of sectors waiting for reallocation"
20
+Offline_Uncorrectable = 1,0.95,true,"Count of uncorrectable sectors"
21
+UDMA_CRC_Error_Count = 100,0.7,true,"Count of UDMA CRC errors"
22
+
23
+# Important parameters (medium weight)
24
+Spin_Retry_Count = 3,0.8,true,"Count of spin-up retry attempts"
25
+End-to-End_Error = 1,0.8,true,"End-to-end error detection count"
26
+Reported_Uncorrect = 1,0.85,true,"Count of uncorrectable errors reported"
27
+High_Fly_Writes = 1,0.7,true,"Count of high fly write operations"
28
+Airflow_Temperature_Cel = 50,0.6,true,"Temperature of airflow in Celsius"
29
+
30
+# Monitoring parameters (lower weight, trending)
31
+Temperature_Celsius = 55,0.6,true,"Drive temperature in Celsius"
32
+Power_On_Hours = 43800,0.4,true,"Total power-on hours (5 years)"
33
+Load_Cycle_Count = 300000,0.5,true,"Count of load/unload cycles"
34
+Start_Stop_Count = 10000,0.4,true,"Count of start/stop cycles"
35
+Power_Cycle_Count = 10000,0.4,true,"Count of power-on cycles"
36
+
37
+# Performance parameters (informational)
38
+Seek_Error_Rate = 100000,0.3,true,"Rate of seek errors"
39
+Throughput_Performance = 80,0.3,true,"Overall throughput performance"
40
+Spin_Up_Time = 10000,0.4,true,"Time required to spin up"
41
+
42
+[thresholds]
43
+# Global threshold multipliers
44
+temperature_warning = 0.9      # Warning at 90% of threshold
45
+temperature_critical = 1.0     # Critical at 100% of threshold
46
+sector_warning = 0.5           # Warning at 50% of threshold
47
+sector_critical = 1.0          # Critical at 100% of threshold
48
+
49
+# Trend analysis
50
+trend_window_hours = 168       # 7 days for trend analysis
51
+trend_deviation_threshold = 2.0 # Standard deviations for anomaly
52
+
53
+[exclusions]
54
+# Disk models/serials to exclude from monitoring
55
+exclude_models = "Virtual,QEMU,VMware"
56
+exclude_serials = ""
57
+exclude_by_size_gb = 8         # Exclude disks smaller than 8GB
+489 -0
projects/autoSMART/deploy.sh
@@ -0,0 +1,489 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Cluster Deployment Script
4
+# Version: 1.0  
5
+# Description: Complete cluster deployment and node installation for autoSMART
6
+
7
+set -e
8
+
9
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
10
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
11
+INSTALL_DIR="/opt/autoSMART"
12
+CONFIG_DIR="/etc/autosmart"
13
+SERVICE_NAME="autosmart"
14
+
15
+# Default configuration
16
+DB_HOST="${DB_HOST:-192.168.2.102}"
17
+DB_USER="${DB_USER:-autosmart}"
18
+DB_PASS="${DB_PASS:-autoSMART2025!}"
19
+DB_NAME="${DB_NAME:-autosmart}"
20
+
21
+# Node configuration
22
+NODE_ID="${NODE_ID:-$(hostname -s)}"
23
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
24
+
25
+# Operation modes
26
+FORCE_REINSTALL=false
27
+CONFIG_ONLY=false
28
+DATABASE_MODE=false
29
+
30
+# Colors for output
31
+RED='\033[0;31m'
32
+GREEN='\033[0;32m'
33
+YELLOW='\033[1;33m'
34
+BLUE='\033[0;34m'
35
+NC='\033[0m' # No Color
36
+
37
+log_info() {
38
+    echo -e "${BLUE}[INFO]${NC} $1"
39
+}
40
+
41
+log_success() {
42
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
43
+}
44
+
45
+log_warning() {
46
+    echo -e "${YELLOW}[WARNING]${NC} $1"
47
+}
48
+
49
+log_error() {
50
+    echo -e "${RED}[ERROR]${NC} $1"
51
+}
52
+
53
+show_usage() {
54
+    echo "autoSMART Cluster Deployment Script v1.0"
55
+    echo "========================================="
56
+    echo ""
57
+    echo "Usage: $0 [COMMAND] [IP_ADDRESS] [OPTIONS]"
58
+    echo ""
59
+    echo "Commands:"
60
+    echo "  install [IP]          Install autoSMART (local or remote node)"
61
+    echo "  install database      Install database schema remotely using psql"
62
+    echo "  uninstall [IP]        Remove autoSMART (local or remote node)"
63
+    echo "  status [IP]           Show autoSMART status (local or remote node)"
64
+    echo ""
65
+    echo "Cluster Options:"
66
+    echo "  --cluster             Execute command on entire cluster"
67
+    echo ""
68
+    echo "Database Options (for 'install database'):"
69
+    echo "  --db-host HOST        Database host (default: 192.168.2.102)"
70
+    echo "  --db-user USER        Database user (default: autosmart)"
71
+    echo "  --db-pass PASS        Database password (default: autoSMART2025!)"
72
+    echo "  --db-name NAME        Database name (default: autosmart)"
73
+    echo ""
74
+    echo "Examples:"
75
+    echo "  $0 install <node>                    # Install on a node (name or IP from cluster.json)"
76
+    echo "  $0 install database                  # Install database schema"
77
+    echo "  $0 status <node>                     # Check status on a node (name or IP from cluster.json)"
78
+    echo "  $0 install --cluster                 # Install on entire cluster"
79
+    echo "  $0 status --cluster                  # Check status on all nodes"
80
+}
81
+
82
+parse_arguments() {
83
+    COMMAND=""
84
+    TARGET_IP=""
85
+    CLUSTER_MODE=false
86
+    DATABASE_MODE=false
87
+    
88
+    # If no arguments provided, show help
89
+    if [[ $# -eq 0 ]]; then
90
+        show_usage
91
+        exit 0
92
+    fi
93
+    
94
+    while [[ $# -gt 0 ]]; do
95
+        case $1 in
96
+            install|uninstall|status)
97
+                COMMAND="$1"
98
+                shift
99
+                ;;
100
+            database)
101
+                if [[ "$COMMAND" == "install" ]]; then
102
+                    DATABASE_MODE=true
103
+                    shift
104
+                else
105
+                    log_error "database can only be used with install command"
106
+                    exit 1
107
+                fi
108
+                ;;
109
+            --help)
110
+                show_usage
111
+                exit 0
112
+                ;;
113
+            --cluster)
114
+                CLUSTER_MODE=true
115
+                shift
116
+                ;;
117
+            --db-host)
118
+                DB_HOST="$2"
119
+                shift 2
120
+                ;;
121
+            --db-user)
122
+                DB_USER="$2"
123
+                shift 2
124
+                ;;
125
+            --db-pass)
126
+                DB_PASS="$2"
127
+                shift 2
128
+                ;;
129
+            --db-name)
130
+                DB_NAME="$2"
131
+                shift 2
132
+                ;;
133
+            --*)
134
+                log_error "Unknown option: $1"
135
+                exit 1
136
+                ;;
137
+            *)
138
+                if [[ $1 =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
139
+                    TARGET_IP="$1"
140
+                    shift
141
+                else
142
+                    # Try to resolve node name from cluster.json
143
+                    local cluster_config="$SCRIPT_DIR/cluster.json"
144
+                    if [[ -f "$cluster_config" ]] && command -v jq &> /dev/null; then
145
+                        local resolved_ip=$(jq -r --arg name "$1" '.cluster.nodes[] | select(.hostname==$name) | .ip' "$cluster_config")
146
+                        if [[ -n "$resolved_ip" && "$resolved_ip" != "null" ]]; then
147
+                            TARGET_IP="$resolved_ip"
148
+                            shift
149
+                        else
150
+                            log_error "Unknown argument: $1 (not an IP or known node name)"
151
+                            exit 1
152
+                        fi
153
+                    else
154
+                        log_error "Unknown argument: $1"
155
+                        exit 1
156
+                    fi
157
+                fi
158
+                ;;
159
+        esac
160
+    done
161
+    
162
+    # Validate that a command was provided
163
+    if [[ -z "$COMMAND" ]]; then
164
+        log_error "No command specified"
165
+        show_usage
166
+        exit 1
167
+    fi
168
+    
169
+    if [[ "$CLUSTER_MODE" == true ]]; then
170
+        TARGET_IP=""
171
+    fi
172
+}
173
+
174
+show_header() {
175
+    log_info "�� autoSMART Cluster Deployment v1.0"
176
+    log_info "====================================="
177
+    log_info "Hardware-based HDD tracking with differential storage"
178
+    log_info ""
179
+    log_info "Operation: $COMMAND"
180
+    
181
+    if [[ "$CLUSTER_MODE" == true ]]; then
182
+        log_info "Target: Entire cluster (nodes from cluster.json)"
183
+    elif [[ -n "$TARGET_IP" ]]; then
184
+        log_info "Target: Remote node ($TARGET_IP)"
185
+    else
186
+        log_info "Target: Current node ($(hostname -s))"
187
+    fi
188
+    
189
+    log_info "Database: $DB_HOST:5432/$DB_NAME"
190
+    log_info ""
191
+}
192
+
193
+handle_database_deployment() {
194
+    log_info "💾 Installing autoSMART Database Schema"
195
+    log_info "======================================="
196
+    log_info "Target Database: $DB_HOST:5432/$DB_NAME"
197
+    log_info "Database User: $DB_USER"
198
+    log_info ""
199
+    
200
+    # Check if psql is available
201
+    if ! command -v psql &> /dev/null; then
202
+        log_error "psql client not found. Please install PostgreSQL client:"
203
+        log_error "  macOS: brew install postgresql"
204
+        log_error "  Ubuntu: sudo apt install postgresql-client"
205
+        log_error "  CentOS: sudo dnf install postgresql"
206
+        return 1
207
+    fi
208
+    
209
+    # Test database connection
210
+    log_info "🔗 Testing database connection..."
211
+    local psql_cmd="psql -h $DB_HOST -U $DB_USER -d $DB_NAME"
212
+    if [[ -n "$DB_PASS" ]]; then
213
+        export PGPASSWORD="$DB_PASS"
214
+    fi
215
+    
216
+    if ! $psql_cmd -c "SELECT version();" >/dev/null 2>&1; then
217
+        log_error "Cannot connect to database $DB_HOST:5432/$DB_NAME"
218
+        log_error "Please check:"
219
+        log_error "  • Database server is running"
220
+        log_error "  • Database '$DB_NAME' exists"
221
+        log_error "  • User '$DB_USER' has proper permissions"
222
+        log_error "  • Network connectivity to $DB_HOST"
223
+        return 1
224
+    fi
225
+    
226
+    log_success "✅ Database connection successful"
227
+    
228
+    # Check schema files
229
+    if [[ ! -f "$SCRIPT_DIR/sql/schema.sql" ]]; then
230
+        log_error "Schema file not found: $SCRIPT_DIR/sql/schema.sql"
231
+        return 1
232
+    fi
233
+    
234
+    # Install schema
235
+    log_info "📊 Installing database schema..."
236
+    if ! $psql_cmd -f "$SCRIPT_DIR/sql/schema.sql" >/dev/null 2>&1; then
237
+        log_error "Failed to install database schema"
238
+        log_error "Check for conflicts or permission issues"
239
+        return 1
240
+    fi
241
+    
242
+    log_success "✅ Database schema installed"
243
+    
244
+    # Verify installation
245
+    log_info "🔍 Verifying schema installation..."
246
+    local table_count=$($psql_cmd -t -c "
247
+        SELECT COUNT(*) FROM information_schema.tables 
248
+        WHERE table_schema = 'public' AND table_name LIKE '%smart%' OR table_name LIKE '%hdd%';
249
+    " 2>/dev/null | tr -d ' ')
250
+    
251
+    if [[ "$table_count" -lt 3 ]]; then
252
+        log_error "Schema verification failed. Expected tables not found."
253
+        return 1
254
+    fi
255
+    
256
+    log_success "✅ Schema verification passed ($table_count tables found)"
257
+    
258
+    # Show installed components
259
+    log_info "📋 Database Installation Summary:"
260
+    $psql_cmd -c "
261
+        SELECT 
262
+            'Table' as type,
263
+            table_name as name,
264
+            pg_size_pretty(pg_total_relation_size('public.'||table_name)) as size
265
+        FROM information_schema.tables 
266
+        WHERE table_schema = 'public'
267
+        UNION ALL
268
+        SELECT 
269
+            'View' as type,
270
+            viewname as name,
271
+            'N/A' as size
272
+        FROM pg_views 
273
+        WHERE schemaname = 'public'
274
+        ORDER BY type, name;
275
+    " 2>/dev/null || true
276
+    
277
+    log_success "✅ autoSMART database deployment completed successfully!"
278
+    log_info ""
279
+    log_info "🚀 Next Steps:"
280
+    log_info "  1. Deploy nodes: ./deploy.sh install <node>"
281
+    log_info "  2. Configure clusters in config files"
282
+    log_info "  3. Start collecting SMART data"
283
+    log_info ""
284
+    
285
+    return 0
286
+}
287
+
288
+handle_remote_deployment() {
289
+    local target_ip="$1"
290
+    local command="$2"
291
+    
292
+    # Determine the correct node name from cluster.json
293
+    local node_name=""
294
+    local cluster_config="$SCRIPT_DIR/cluster.json"
295
+    if [[ -f "$cluster_config" ]] && command -v jq &> /dev/null; then
296
+        node_name=$(jq -r --arg ip "$target_ip" '.cluster.nodes[] | select(.ip==$ip) | .hostname' "$cluster_config")
297
+        if [[ -z "$node_name" || "$node_name" == "null" ]]; then
298
+            # Fallback: try to get hostname from target machine
299
+            node_name=$(ssh -o ConnectTimeout=5 "root@$target_ip" "hostname -s" 2>/dev/null || echo "unknown-node")
300
+        fi
301
+    else
302
+        # Fallback: try to get hostname from target machine
303
+        node_name=$(ssh -o ConnectTimeout=5 "root@$target_ip" "hostname -s" 2>/dev/null || echo "unknown-node")
304
+    fi
305
+    
306
+    log_info "🌐 Remote deployment to $target_ip (node: $node_name)"
307
+    
308
+    # Test connectivity
309
+    log_info "🔍 Testing connectivity to $target_ip..."
310
+    if ! ping -c 1 -W 5 "$target_ip" >/dev/null 2>&1; then
311
+        log_error "Cannot reach $target_ip (ping failed)"
312
+        return 1
313
+    fi
314
+    
315
+    # Test SSH
316
+    log_info "🔐 Testing SSH access to $target_ip..."
317
+    if ! ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no "root@$target_ip" true 2>/dev/null; then
318
+        log_error "Cannot connect to $target_ip via SSH"
319
+        log_info "Setup SSH keys: ssh-copy-id root@$target_ip"
320
+        return 1
321
+    fi
322
+    
323
+    log_success "✅ SSH connection to $target_ip successful"
324
+    
325
+    # Create temp directory
326
+    local remote_temp="/tmp/autosmart-deploy-$(date +%s)"
327
+    log_info "📁 Creating remote directory: $remote_temp"
328
+    ssh "root@$target_ip" "mkdir -p $remote_temp"
329
+    
330
+    # Copy files
331
+    log_info "📦 Syncing project files to $target_ip..."
332
+    if ! rsync -avz --progress \
333
+        --exclude-from="$SCRIPT_DIR/.deployignore" \
334
+        --include='docs/' \
335
+        --include='docs/*.md' \
336
+        --exclude='.git*' \
337
+        --exclude='*.md' \
338
+        --exclude='deploy.sh' \
339
+        "$SCRIPT_DIR/" "root@$target_ip:$remote_temp/"; then
340
+        log_error "Failed to sync files to $target_ip"
341
+        return 1
342
+    fi
343
+    
344
+    # Execute install.sh
345
+    log_info "🚀 Executing $command on $target_ip..."
346
+    
347
+    local install_args="$command --node-id $node_name --db-host $DB_HOST"
348
+    
349
+    if ssh "root@$target_ip" "cd $remote_temp/scripts && bash install.sh $install_args"; then
350
+        log_success "✅ $command completed successfully on $target_ip"
351
+        ssh "root@$target_ip" "rm -rf $remote_temp"
352
+        return 0
353
+    else
354
+        log_error "❌ $command failed on $target_ip"
355
+        return 1
356
+    fi
357
+}
358
+
359
+handle_status() {
360
+    local target_ip="$1"
361
+    
362
+    if [[ -n "$target_ip" ]]; then
363
+        log_info "📊 Checking autoSMART status on $target_ip"
364
+        ssh "root@$target_ip" "systemctl status autosmart --no-pager"
365
+    else
366
+        log_info "📊 Checking autoSMART status on current node"
367
+        if command -v systemctl >/dev/null 2>&1; then
368
+            systemctl status autosmart --no-pager
369
+        else
370
+            log_error "systemctl not available"
371
+            return 1
372
+        fi
373
+    fi
374
+}
375
+
376
+handle_cluster_operation() {
377
+    local command="$1"
378
+    
379
+    log_info "🚀 Executing $command on cluster..."
380
+    
381
+    # Check if cluster.json exists
382
+    local cluster_config="$SCRIPT_DIR/cluster.json"
383
+    if [[ ! -f "$cluster_config" ]]; then
384
+        log_error "Cluster configuration not found: $cluster_config"
385
+        return 1
386
+    fi
387
+    
388
+    # Check if jq is available for JSON parsing
389
+    if ! command -v jq &> /dev/null; then
390
+        log_error "jq is required for cluster operations"
391
+        return 1
392
+    fi
393
+    
394
+    # Parse cluster configuration
395
+    local cluster_name=$(jq -r '.cluster.name' "$cluster_config")
396
+    local total_nodes=$(jq -r '.cluster.nodes | length' "$cluster_config")
397
+    
398
+    log_info "Cluster: $cluster_name ($total_nodes nodes)"
399
+    log_info ""
400
+    
401
+    local success_count=0
402
+    local failed_nodes=()
403
+    
404
+    # Process nodes
405
+    while IFS= read -r node_data; do
406
+        local node_hostname=$(echo "$node_data" | jq -r '.hostname')
407
+        local node_ip=$(echo "$node_data" | jq -r '.ip')
408
+        
409
+        log_info "🔧 Processing node: $node_hostname ($node_ip)"
410
+        
411
+        if handle_remote_deployment "$node_ip" "$command"; then
412
+            ((success_count++))
413
+            log_success "✅ $node_hostname completed successfully"
414
+        else
415
+            log_error "❌ $node_hostname failed"
416
+            failed_nodes+=("$node_hostname")
417
+        fi
418
+        
419
+        sleep 2
420
+        log_info ""
421
+    done < <(jq -c '.cluster.nodes[]' "$cluster_config")
422
+    
423
+    # Summary
424
+    log_info "📊 Cluster Summary:"
425
+    log_info "  • Successful: $success_count/$total_nodes"
426
+    
427
+    if [[ ${#failed_nodes[@]} -gt 0 ]]; then
428
+        log_error "  • Failed nodes: ${failed_nodes[*]}"
429
+    fi
430
+    
431
+    if [[ $success_count -eq $total_nodes ]]; then
432
+        log_success "🎉 All nodes processed successfully!"
433
+        return 0
434
+    else
435
+        log_error "❌ Some nodes failed"
436
+        return 1
437
+    fi
438
+}
439
+
440
+# Main execution
441
+main() {
442
+    parse_arguments "$@"
443
+    show_header
444
+    
445
+    # Handle database deployment mode
446
+    if [[ "$DATABASE_MODE" == true ]]; then
447
+        handle_database_deployment
448
+        exit $?
449
+    fi
450
+    
451
+    if [[ "$CLUSTER_MODE" == true ]]; then
452
+        handle_cluster_operation "$COMMAND"
453
+        exit $?
454
+    elif [[ -n "$TARGET_IP" ]]; then
455
+        if [[ "$COMMAND" == "status" ]]; then
456
+            handle_status "$TARGET_IP"
457
+        else
458
+            handle_remote_deployment "$TARGET_IP" "$COMMAND"
459
+        fi
460
+        exit $?
461
+    fi
462
+    
463
+    # Local execution
464
+    case "$COMMAND" in
465
+        status)
466
+            handle_status
467
+            ;;
468
+        install|uninstall)
469
+            if [[ "$(uname)" == "Darwin" ]]; then
470
+                log_error "Cannot install autoSMART on macOS development machine"
471
+                log_info "Deploy to target nodes instead:"
472
+                log_info "  ./deploy.sh install <node>    # Deploy to node from cluster.json"
473
+                log_info "  ./deploy.sh install --cluster       # Deploy to all nodes"
474
+                exit 1
475
+            fi
476
+            
477
+            log_info "🚀 Local deployment mode"
478
+            sudo bash "$SCRIPT_DIR/scripts/install.sh" "$COMMAND" --node-id "$NODE_ID"
479
+            ;;
480
+        *)
481
+            log_error "Unknown command: $COMMAND"
482
+            show_usage
483
+            exit 1
484
+            ;;
485
+    esac
486
+}
487
+
488
+# Run main
489
+main "$@"
+439 -0
projects/autoSMART/docs/API.md
@@ -0,0 +1,439 @@
1
+# autoSMART API Reference
2
+
3
+## 🔌 OpenAI API Integration
4
+
5
+### Overview
6
+
7
+autoSMART integrates with OpenAI's GPT models to provide intelligent HDD failure predictions based on SMART data analysis. This document covers the API integration, prompt engineering, and response processing.
8
+
9
+### Configuration
10
+
11
+#### Environment Variables
12
+```bash
13
+export OPENAI_API_KEY="sk-your-openai-api-key-here"
14
+export OPENAI_MODEL="gpt-4"  # or gpt-3.5-turbo for cost optimization
15
+export OPENAI_MAX_TOKENS=1000
16
+export OPENAI_TEMPERATURE=0.1  # Low temperature for consistent technical analysis
17
+```
18
+
19
+#### Database Configuration
20
+```sql
21
+-- Add OpenAI configuration to system_config
22
+INSERT INTO system_config (key, value, description) VALUES
23
+('openai_api_key', 'sk-your-key', 'OpenAI API key for failure predictions'),
24
+('openai_model', 'gpt-4', 'OpenAI model to use (gpt-4, gpt-3.5-turbo)'),
25
+('openai_max_tokens', '1000', 'Maximum tokens per API call'),
26
+('openai_temperature', '0.1', 'Temperature setting for consistent predictions'),
27
+('openai_timeout', '30', 'API timeout in seconds'),
28
+('prediction_interval_hours', '24', 'Hours between AI predictions per drive');
29
+```
30
+
31
+## 🤖 AI Prediction System
32
+
33
+### Prompt Engineering
34
+
35
+#### System Prompt Template
36
+```text
37
+You are an expert storage systems engineer specializing in HDD failure prediction and analysis. 
38
+
39
+Your expertise includes:
40
+- SMART parameter interpretation across all major manufacturers (WD, Seagate, Hitachi, Toshiba)  
41
+- Statistical analysis of drive health trends and patterns
42
+- Hardware failure mode identification and prediction
43
+- Maintenance recommendations based on drive condition
44
+
45
+Analyze the provided SMART data and historical trends to:
46
+1. Assess current drive health status
47
+2. Predict failure probability and timeline  
48
+3. Identify concerning parameter trends
49
+4. Provide specific maintenance recommendations
50
+
51
+Be precise, technical, and provide confidence levels for your predictions.
52
+Return responses in structured JSON format for automated processing.
53
+```
54
+
55
+#### User Prompt Templates
56
+
57
+##### Single Drive Analysis
58
+```json
59
+{
60
+  "task": "analyze_drive_health",
61
+  "drive_info": {
62
+    "serial_number": "WD-XXXXX",
63
+    "model": "WD4003FZEX",
64
+    "manufacturer": "Western Digital", 
65
+    "capacity_gb": 4000,
66
+    "age_days": 1825,
67
+    "power_on_hours": 15000
68
+  },
69
+  "current_smart": {
70
+    "Reallocated_Sector_Ct": 0,
71
+    "Spin_Retry_Count": 0,
72
+    "Current_Pending_Sector": 1,
73
+    "Offline_Uncorrectable": 0,
74
+    "UDMA_CRC_Error_Count": 0,
75
+    "Raw_Read_Error_Rate": 158584832,
76
+    "Seek_Error_Rate": 34405355,
77
+    "Power_On_Hours": 15234,
78
+    "Load_Cycle_Count": 45123,
79
+    "Temperature_Celsius": 42,
80
+    "Start_Stop_Count": 1205,
81
+    "Power_Cycle_Count": 1198
82
+  },
83
+  "historical_trends": {
84
+    "30_day_changes": {
85
+      "Current_Pending_Sector": [0, 0, 0, 1],
86
+      "Temperature_Celsius": [38, 39, 41, 42],
87
+      "Power_On_Hours": [14950, 15050, 15150, 15234]
88
+    },
89
+    "parameter_velocities": {
90
+      "Current_Pending_Sector": 0.033,
91
+      "Temperature_Celsius": 0.133
92
+    }
93
+  }
94
+}
95
+```
96
+
97
+##### Multi-Drive Comparative Analysis
98
+```json
99
+{
100
+  "task": "comparative_analysis", 
101
+  "drives": [
102
+    {
103
+      "serial_number": "WD-XXXXX1",
104
+      "health_score": 85,
105
+      "critical_parameters": ["Current_Pending_Sector"],
106
+      "smart_summary": {...}
107
+    },
108
+    {
109
+      "serial_number": "WD-XXXXX2", 
110
+      "health_score": 92,
111
+      "critical_parameters": [],
112
+      "smart_summary": {...}
113
+    }
114
+  ],
115
+  "analysis_context": {
116
+    "environment": "proxmox_cluster",
117
+    "usage_pattern": "high_io_database",
118
+    "temperature_environment": "datacenter"
119
+  }
120
+}
121
+```
122
+
123
+### Response Format
124
+
125
+#### Standard Health Assessment Response
126
+```json
127
+{
128
+  "prediction_id": "uuid-generated",
129
+  "timestamp": "2025-08-15T10:30:00Z",
130
+  "drive_serial": "WD-XXXXX",
131
+  "analysis": {
132
+    "health_score": 78,
133
+    "risk_level": "medium",
134
+    "failure_probability": {
135
+      "7_days": 0.02,
136
+      "30_days": 0.08,
137
+      "90_days": 0.15,
138
+      "1_year": 0.35
139
+    },
140
+    "predicted_failure_date": "2026-02-15",
141
+    "confidence_level": 0.75
142
+  },
143
+  "critical_findings": [
144
+    {
145
+      "parameter": "Current_Pending_Sector",
146
+      "current_value": 1,
147
+      "trend": "increasing",
148
+      "severity": "warning",
149
+      "description": "One sector is pending reallocation - monitor closely"
150
+    },
151
+    {
152
+      "parameter": "Temperature_Celsius", 
153
+      "current_value": 42,
154
+      "trend": "increasing",
155
+      "severity": "info",
156
+      "description": "Temperature trending upward but within normal range"
157
+    }
158
+  ],
159
+  "recommendations": [
160
+    {
161
+      "priority": "high",
162
+      "action": "monitor_pending_sectors",
163
+      "description": "Monitor pending sector count daily - consider replacement if count increases",
164
+      "timeline": "immediate"
165
+    },
166
+    {
167
+      "priority": "medium", 
168
+      "action": "improve_cooling",
169
+      "description": "Consider improving airflow to reduce operating temperature",
170
+      "timeline": "within_30_days"
171
+    }
172
+  ],
173
+  "manufacturer_specific": {
174
+    "western_digital": {
175
+      "expected_lifespan_hours": 50000,
176
+      "current_usage_percent": 30.5,
177
+      "wear_level_assessment": "normal"
178
+    }
179
+  }
180
+}
181
+```
182
+
183
+## 🔧 Implementation Details
184
+
185
+### SmartAnalyzer.pm API Integration
186
+
187
+#### Core API Methods
188
+```perl
189
+=head2 predict_failure
190
+
191
+Generate AI-powered failure prediction for a specific drive
192
+
193
+=cut
194
+
195
+sub predict_failure {
196
+    my ($self, $hdd_id, $options) = @_;
197
+    
198
+    # Gather drive data and historical trends
199
+    my $drive_data = $self->_gather_drive_data($hdd_id);
200
+    my $historical_data = $self->_analyze_trends($hdd_id, $options->{days} || 30);
201
+    
202
+    # Construct AI prompt
203
+    my $prompt = $self->_build_analysis_prompt($drive_data, $historical_data);
204
+    
205
+    # Call OpenAI API
206
+    my $prediction = $self->_call_openai_api($prompt);
207
+    
208
+    # Store prediction result
209
+    $self->_store_prediction($hdd_id, $prediction);
210
+    
211
+    return $prediction;
212
+}
213
+```
214
+
215
+#### API Request Handler
216
+```perl
217
+sub _call_openai_api {
218
+    my ($self, $prompt) = @_;
219
+    
220
+    my $ua = LWP::UserAgent->new(timeout => $self->{openai_timeout} || 30);
221
+    
222
+    my $request = HTTP::Request->new(POST => 'https://api.openai.com/v1/chat/completions');
223
+    $request->header('Authorization' => "Bearer $self->{openai_api_key}");
224
+    $request->header('Content-Type' => 'application/json');
225
+    
226
+    my $payload = {
227
+        model => $self->{openai_model} || 'gpt-4',
228
+        messages => [
229
+            {
230
+                role => "system",
231
+                content => $self->_get_system_prompt()
232
+            },
233
+            {
234
+                role => "user", 
235
+                content => encode_json($prompt)
236
+            }
237
+        ],
238
+        max_tokens => $self->{openai_max_tokens} || 1000,
239
+        temperature => $self->{openai_temperature} || 0.1,
240
+        response_format => { type => "json_object" }
241
+    };
242
+    
243
+    $request->content(encode_json($payload));
244
+    
245
+    my $response = $ua->request($request);
246
+    
247
+    if ($response->is_success) {
248
+        my $result = decode_json($response->content);
249
+        return decode_json($result->{choices}[0]{message}{content});
250
+    } else {
251
+        die "OpenAI API error: " . $response->status_line . "\n" . $response->content;
252
+    }
253
+}
254
+```
255
+
256
+### Error Handling and Retry Logic
257
+
258
+```perl
259
+sub _call_openai_api_with_retry {
260
+    my ($self, $prompt, $max_retries) = @_;
261
+    $max_retries ||= 3;
262
+    
263
+    for my $attempt (1..$max_retries) {
264
+        eval {
265
+            return $self->_call_openai_api($prompt);
266
+        };
267
+        
268
+        if ($@) {
269
+            $self->_log("OpenAI API attempt $attempt failed: $@", 2);
270
+            
271
+            if ($attempt < $max_retries) {
272
+                # Exponential backoff
273
+                my $delay = 2 ** $attempt;
274
+                $self->_log("Retrying in ${delay}s...", 2);
275
+                sleep($delay);
276
+            } else {
277
+                die "OpenAI API failed after $max_retries attempts: $@";
278
+            }
279
+        }
280
+    }
281
+}
282
+```
283
+
284
+## 📊 Prediction Storage and Retrieval
285
+
286
+### Database Schema for Predictions
287
+```sql
288
+-- Enhanced predictions table
289
+ALTER TABLE predictions ADD COLUMN api_model VARCHAR(50);
290
+ALTER TABLE predictions ADD COLUMN api_tokens_used INTEGER;
291
+ALTER TABLE predictions ADD COLUMN api_cost_estimate DECIMAL(10,6);
292
+ALTER TABLE predictions ADD COLUMN confidence_level DECIMAL(3,2);
293
+ALTER TABLE predictions ADD COLUMN failure_probability_7d DECIMAL(5,4);
294
+ALTER TABLE predictions ADD COLUMN failure_probability_30d DECIMAL(5,4);
295
+ALTER TABLE predictions ADD COLUMN failure_probability_90d DECIMAL(5,4);
296
+ALTER TABLE predictions ADD COLUMN failure_probability_1y DECIMAL(5,4);
297
+ALTER TABLE predictions ADD COLUMN predicted_failure_date DATE;
298
+ALTER TABLE predictions ADD COLUMN recommendations JSONB;
299
+ALTER TABLE predictions ADD COLUMN critical_findings JSONB;
300
+```
301
+
302
+### Prediction Retrieval Methods
303
+```perl
304
+=head2 get_latest_prediction
305
+
306
+Get the most recent prediction for a drive
307
+
308
+=cut
309
+
310
+sub get_latest_prediction {
311
+    my ($self, $hdd_id) = @_;
312
+    
313
+    my $sql = q{
314
+        SELECT p.*, hi.serial_number, hi.model_name
315
+        FROM predictions p
316
+        JOIN hdd_inventory hi ON p.hdd_id = hi.id
317
+        WHERE p.hdd_id = ?
318
+        ORDER BY p.timestamp DESC
319
+        LIMIT 1
320
+    };
321
+    
322
+    my $sth = $self->{db_handle}->prepare($sql);
323
+    $sth->execute($hdd_id);
324
+    
325
+    return $sth->fetchrow_hashref();
326
+}
327
+```
328
+
329
+## 🎯 Performance Optimization
330
+
331
+### API Usage Optimization
332
+
333
+#### Batch Processing
334
+```perl
335
+sub predict_multiple_drives {
336
+    my ($self, $hdd_ids, $options) = @_;
337
+    
338
+    # Group drives by similarity for efficient batch processing
339
+    my $drive_groups = $self->_group_drives_by_similarity($hdd_ids);
340
+    
341
+    my @predictions;
342
+    for my $group (@$drive_groups) {
343
+        if (scalar(@$group) > 1) {
344
+            # Use comparative analysis for similar drives
345
+            push @predictions, $self->_batch_comparative_analysis($group, $options);
346
+        } else {
347
+            # Use individual analysis for single drives
348
+            push @predictions, $self->predict_failure($group->[0], $options);
349
+        }
350
+    }
351
+    
352
+    return @predictions;
353
+}
354
+```
355
+
356
+#### Caching Strategy
357
+```perl
358
+sub _get_cached_prediction {
359
+    my ($self, $hdd_id, $cache_hours) = @_;
360
+    $cache_hours ||= 24;
361
+    
362
+    my $sql = q{
363
+        SELECT * FROM predictions 
364
+        WHERE hdd_id = ? 
365
+          AND timestamp > NOW() - INTERVAL ? hour
366
+        ORDER BY timestamp DESC 
367
+        LIMIT 1
368
+    };
369
+    
370
+    my $sth = $self->{db_handle}->prepare($sql);
371
+    $sth->execute($hdd_id, $cache_hours);
372
+    
373
+    return $sth->fetchrow_hashref();
374
+}
375
+```
376
+
377
+### Cost Management
378
+
379
+#### Token Usage Tracking
380
+```perl
381
+sub _track_api_usage {
382
+    my ($self, $hdd_id, $tokens_used, $model) = @_;
383
+    
384
+    # Estimate cost based on model pricing
385
+    my $cost_per_token = $model eq 'gpt-4' ? 0.00003 : 0.000002;
386
+    my $estimated_cost = $tokens_used * $cost_per_token;
387
+    
388
+    # Log usage statistics
389
+    my $sql = q{
390
+        INSERT INTO api_usage_log 
391
+        (hdd_id, timestamp, model, tokens_used, estimated_cost)
392
+        VALUES (?, NOW(), ?, ?, ?)
393
+    };
394
+    
395
+    $self->{db_handle}->do($sql, undef, $hdd_id, $model, $tokens_used, $estimated_cost);
396
+    
397
+    return $estimated_cost;
398
+}
399
+```
400
+
401
+## 📈 Analytics and Reporting
402
+
403
+### Prediction Accuracy Tracking
404
+```sql
405
+-- Track prediction accuracy over time
406
+CREATE VIEW prediction_accuracy AS
407
+SELECT 
408
+    p.hdd_id,
409
+    p.timestamp as prediction_date,
410
+    p.failure_probability_30d,
411
+    p.predicted_failure_date,
412
+    hi.status_changed_at,
413
+    CASE 
414
+        WHEN hi.status = 'failed' AND hi.status_changed_at <= p.predicted_failure_date THEN 'accurate'
415
+        WHEN hi.status = 'failed' AND hi.status_changed_at > p.predicted_failure_date THEN 'early'
416
+        WHEN hi.status = 'active' AND NOW() > p.predicted_failure_date THEN 'late'
417
+        ELSE 'pending'
418
+    END as accuracy_assessment
419
+FROM predictions p
420
+JOIN hdd_inventory hi ON p.hdd_id = hi.id
421
+WHERE p.timestamp > NOW() - INTERVAL '6 months';
422
+```
423
+
424
+### API Cost Analysis
425
+```sql
426
+-- Monitor API costs and usage patterns
427
+SELECT 
428
+    DATE_TRUNC('day', timestamp) as date,
429
+    model,
430
+    COUNT(*) as api_calls,
431
+    SUM(tokens_used) as total_tokens,
432
+    SUM(estimated_cost) as daily_cost
433
+FROM api_usage_log
434
+WHERE timestamp > NOW() - INTERVAL '30 days'
435
+GROUP BY DATE_TRUNC('day', timestamp), model
436
+ORDER BY date DESC, model;
437
+```
438
+
439
+This API reference provides comprehensive guidance for integrating and optimizing OpenAI API usage within the autoSMART system. The implementation focuses on accuracy, cost-effectiveness, and reliable failure prediction capabilities.
+264 -0
projects/autoSMART/docs/CHANGELOG.md
@@ -0,0 +1,264 @@
1
+# autoSMART Release Notes
2
+
3
+All notable changes and updates to autoSMART will be documented in this file.
4
+
5
+## [1.0.0] - August 15, 2025
6
+
7
+### 🎉 Initial Release - Production Ready
8
+
9
+We're excited to announce the first production release of autoSMART! This release provides a complete, enterprise-ready solution for intelligent HDD monitoring with AI-powered failure predictions.
10
+
11
+### ✨ What's New
12
+
13
+#### Core Features
14
+- **Smart HDD Tracking**: Automatically identifies and tracks all HDDs in your Proxmox cluster using hardware identifiers
15
+- **AI Failure Predictions**: Uses OpenAI GPT to predict drive failures before they happen
16
+- **Efficient Storage**: Advanced storage optimization reduces database size by 60-80%
17
+- **Migration Detection**: Automatically detects when drives move between servers
18
+- **Proxmox Integration**: Native support for Proxmox VE cluster environments
19
+
20
+#### Monitoring Capabilities
21
+- **Real-time Health Monitoring**: Continuous SMART parameter monitoring
22
+- **Configurable Alerts**: Customizable thresholds for all SMART parameters
23
+- **Historical Analysis**: Long-term trend analysis and reporting
24
+- **Performance Tracking**: Monitor drive performance degradation over time
25
+
26
+#### User Experience
27
+- **Easy Installation**: Simple deployment script for quick setup
28
+- **Comprehensive Reports**: Detailed health reports and failure predictions
29
+- **Web Dashboard**: (Coming in v1.1) Real-time monitoring interface
30
+- **Email Alerts**: Immediate notifications for critical issues
31
+
32
+### 🔧 System Requirements
33
+
34
+#### Minimum Requirements
35
+- **Operating System**: Proxmox VE 7.0+ or compatible Linux distribution
36
+- **Database**: PostgreSQL 13+ with 1GB+ available storage
37
+- **Perl**: Version 5.20+ with internet access for module installation
38
+- **Memory**: 512MB RAM minimum, 1GB recommended per node
39
+- **Network**: Stable network connection for database and API access
40
+
41
+#### Recommended Setup
42
+- **Database Server**: Dedicated PostgreSQL server with SSD storage
43
+- **Cluster Size**: Optimized for 3-50 node Proxmox clusters
44
+- **Storage**: 10GB+ database storage for large clusters with long retention
45
+- **Monitoring**: Integration with existing monitoring infrastructure
46
+
47
+### 📊 Performance Benefits
48
+
49
+#### Storage Efficiency
50
+- **60-80% smaller database** compared to traditional SMART logging
51
+- **Intelligent change detection** stores only modified parameters
52
+- **Automatic optimization** requires no manual configuration
53
+- **Scalable architecture** grows efficiently with cluster size
54
+
55
+#### Monitoring Accuracy
56
+- **Hardware-based tracking** eliminates drive identification issues
57
+- **Migration detection** maintains accurate drive history
58
+- **AI-powered analysis** provides reliable failure predictions
59
+- **Real-time alerts** enable proactive maintenance
60
+
61
+### 🚀 Getting Started
62
+
63
+#### Quick Installation
64
+```bash
65
+# 1. Download and extract autoSMART
66
+# 2. Run the installer
67
+sudo ./scripts/deploy.sh install
68
+
69
+# 3. Configure your database connection
70
+sudo vim /opt/autoSMART/config/autosmart.conf
71
+
72
+# 4. Start monitoring
73
+sudo systemctl start autosmart
74
+```
75
+
76
+#### First Steps
77
+1. **Verify Installation**: Check that all drives are detected and monitored
78
+2. **Configure Alerts**: Set up email notifications for your team
79
+3. **Review Reports**: Generate initial health reports for all drives
80
+4. **Set Thresholds**: Customize alert thresholds for your environment
81
+
82
+### 🏥 Health Monitoring
83
+
84
+#### What autoSMART Monitors
85
+- **Temperature**: Operating temperatures and thermal stress
86
+- **Error Rates**: Read/write errors and retry counts  
87
+- **Mechanical Health**: Spin-up time, seek errors, and mechanical issues
88
+- **Surface Quality**: Bad sectors, reallocated sectors, and surface scans
89
+- **Performance**: Transfer rates and response times
90
+
91
+#### AI Predictions
92
+- **Failure Probability**: Confidence scores for potential failures
93
+- **Time Estimates**: Predicted time until failure occurs
94
+- **Risk Assessment**: Categorization of failure risk levels
95
+- **Recommendation Engine**: Suggested maintenance actions
96
+
97
+### 🔔 Alert System
98
+
99
+#### Alert Types
100
+- **Critical**: Immediate action required (drive failure imminent)
101
+- **Warning**: Monitor closely (parameters approaching limits)  
102
+- **Info**: Normal operation (routine status updates)
103
+- **Prediction**: AI-identified potential issues
104
+
105
+#### Notification Methods
106
+- **Email**: Immediate email alerts for critical issues
107
+- **Logs**: Detailed logging for all events and changes
108
+- **Reports**: Regular summary reports with cluster health overview
109
+- **API Integration**: RESTful API for custom integrations (v1.1+)
110
+
111
+### 💡 Use Cases
112
+
113
+#### Preventive Maintenance
114
+- **Predict Failures**: Replace drives before they fail
115
+- **Schedule Maintenance**: Plan maintenance windows effectively
116
+- **Optimize Workloads**: Balance load based on drive health
117
+- **Track Warranties**: Monitor warranty status and replacement schedules
118
+
119
+#### Capacity Planning
120
+- **Growth Trends**: Monitor storage usage patterns
121
+- **Performance Planning**: Identify performance bottlenecks
122
+- **Cluster Expansion**: Plan future capacity requirements
123
+- **Cost Optimization**: Maximize drive utilization efficiency
124
+
125
+### 🛠️ Support & Documentation
126
+
127
+#### Getting Help
128
+- **Installation Guide**: Complete setup instructions in `docs/INSTALLATION.md`
129
+- **Configuration**: Detailed configuration options and examples
130
+- **Troubleshooting**: Common issues and solutions
131
+- **API Documentation**: Integration guides and examples
132
+
133
+#### Community
134
+- **Documentation**: Comprehensive guides for all features
135
+- **Support**: Technical support and assistance
136
+- **Updates**: Regular updates and security patches
137
+- **Feedback**: We welcome your feedback and suggestions
138
+
139
+### 🔮 What's Next
140
+
141
+#### Version 1.1 (Coming Soon)
142
+- **Web Dashboard**: Real-time monitoring interface
143
+- **Advanced Analytics**: Enhanced prediction models
144
+- **API Integration**: RESTful API for custom integrations
145
+- **Mobile Alerts**: SMS and mobile app notifications
146
+
147
+#### Future Releases
148
+- **Multi-Tenant Support**: Support for managed service providers
149
+- **Advanced ML Models**: Custom machine learning models
150
+- **Cloud Integration**: Cloud storage and analytics options
151
+- **Enterprise Features**: Advanced reporting and compliance tools
152
+
153
+---
154
+
155
+**Welcome to autoSMART v1.0!** 
156
+
157
+Thank you for choosing autoSMART for your drive monitoring needs. This release represents months of development and testing to provide you with a reliable, efficient, and intelligent monitoring solution.
158
+
159
+For technical support, documentation, or questions, please refer to the documentation in the `docs/` directory or contact our support team.
160
+
161
+#### Scripts and Tools
162
+- **collect-smart-data.pl**: Main data collection script
163
+- **analyze-smart-data.pl**: Analysis and prediction script  
164
+- **generate-reports.pl**: Report generation script
165
+- **test-differential-storage.pl**: Comprehensive storage optimization test suite
166
+
167
+#### Configuration System
168
+- **Proxmox cluster integration**:
169
+  - `/etc/pve/autoSMART/cluster.conf`: Cluster-wide shared configuration
170
+  - `/etc/default/autosmart`: Local node-specific configuration
171
+- **Flexible configuration**: Database connection, API keys, thresholds, intervals
172
+
173
+#### Documentation
174
+- Complete installation and setup guide
175
+- API integration documentation
176
+- Migration detection system documentation
177
+- Differential storage system documentation
178
+- Development and testing guides
179
+
180
+### 🔧 Technical Specifications
181
+
182
+#### Database Requirements
183
+- PostgreSQL 13+ with JSONB support
184
+- GIN indexes for JSONB columns
185
+- Recursive CTE support for data reconstruction
186
+- Extension support for advanced functions
187
+
188
+#### Performance Optimizations
189
+- Hardware-based tracking eliminates volatile path dependencies
190
+- Differential storage reduces data volume by 60-80%
191
+- Optimized indexes for time-series data
192
+- Efficient recursive queries for data reconstruction
193
+
194
+#### Storage Efficiency
195
+- **Baseline readings**: ~1% of all readings (first reading per HDD)
196
+- **Full readings**: ~15-20% of readings (critical changes + forced intervals)  
197
+- **Differential readings**: ~5-15% of readings (minor parameter changes)
198
+- **Skipped readings**: ~60-75% of readings (no changes detected)
199
+
200
+#### Migration Detection
201
+- Automatic detection of HDD movements between:
202
+  - Physical nodes in cluster
203
+  - Device paths (/dev/sdX changes)
204
+  - Slot positions in chassis
205
+- Complete audit trail of hardware movements
206
+- No data loss during migrations
207
+
208
+### 🎯 Phase 1 Completion Status
209
+
210
+- ✅ Project structure and organization
211
+- ✅ PostgreSQL schema with hardware tracking
212
+- ✅ Hardware-based SMART collector with migration detection
213
+- ✅ Differential storage optimization implementation
214
+- ✅ Proxmox cluster configuration system
215
+- ✅ Test suite and validation tools
216
+- ✅ Comprehensive documentation
217
+
218
+### 🔜 Next Phase (v1.1 - AI Integration)
219
+
220
+Planned features for Phase 2:
221
+- AI prediction engine implementation
222
+- Historical data analysis and pattern recognition  
223
+- Failure prediction algorithms refinement
224
+- Enhanced alerting system
225
+
226
+### 🏗️ Infrastructure Notes
227
+
228
+- **Test Database**: PostgreSQL on 192.168.2.102 (user: postgres, no password)
229
+- **Development Environment**: macOS with Perl 5.x
230
+- **Target Deployment**: Proxmox VE cluster with shared storage
231
+
232
+### 📊 Project Metrics
233
+
234
+- **Total files**: 25+ files across modules, scripts, SQL, and documentation
235
+- **Code quality**: Full error handling, logging, and validation
236
+- **Test coverage**: Comprehensive test suite for differential storage
237
+- **Documentation**: Complete user and developer documentation
238
+- **Database optimization**: 60-80% storage reduction achieved
239
+
240
+---
241
+
242
+## Development Guidelines
243
+
244
+### Version Numbering
245
+- **Major** (X.0.0): Breaking changes, major feature additions
246
+- **Minor** (X.Y.0): New features, backward compatible
247
+- **Patch** (X.Y.Z): Bug fixes, small improvements
248
+
249
+### Change Categories
250
+- 🎉 **Major Release**
251
+- ✨ **Added** - New features
252
+- 🔧 **Changed** - Changes in existing functionality  
253
+- 🐛 **Fixed** - Bug fixes
254
+- 🔒 **Security** - Security improvements
255
+- 🗑️ **Deprecated** - Soon-to-be removed features
256
+- ❌ **Removed** - Removed features
257
+
258
+### Future Releases
259
+
260
+Planning for upcoming versions:
261
+- **v1.1.0**: AI Integration Phase
262
+- **v1.2.0**: Production Deployment Phase  
263
+- **v1.3.0**: Advanced Analytics Phase
264
+- **v2.0.0**: Next Generation Architecture
+467 -0
projects/autoSMART/docs/DATABASE.md
@@ -0,0 +1,467 @@
1
+# autoSMART Database Documentation
2
+
3
+## Overview
4
+
5
+autoSMART uses PostgreSQL as its primary database for storing SMART data, HDD tracking information, predictions, and system configuration. The database is designed for multi-node cluster deployments with comprehensive HDD mobility tracking.
6
+
7
+## Database Schema
8
+
9
+### Core Tables
10
+
11
+#### `hdd_inventory`
12
+The central inventory table that tracks all HDDs across the cluster.
13
+
14
+```sql
15
+CREATE TABLE hdd_inventory (
16
+    id                  SERIAL PRIMARY KEY,
17
+    serial_number       VARCHAR(100) NOT NULL,
18
+    model_name          VARCHAR(200) NOT NULL,
19
+    firmware            VARCHAR(50),
20
+    size_gb             INTEGER,
21
+    manufacturer        VARCHAR(100),
22
+    current_device_path VARCHAR(50),
23
+    current_node_id     VARCHAR(50),
24
+    current_slot        VARCHAR(20),
25
+    madagascar_id       VARCHAR(100),
26
+    first_seen          TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
27
+    last_seen           TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
28
+    status              VARCHAR(20) DEFAULT 'active',
29
+    status_changed_at   TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
30
+    notes               TEXT,
31
+    created_at          TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
32
+    updated_at          TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
33
+    
34
+    CONSTRAINT unique_hardware_id UNIQUE (serial_number, model_name)
35
+);
36
+```
37
+
38
+**Key Features:**
39
+- **Hardware-based identification**: Uses `serial_number` + `model_name` as unique constraint
40
+- **Current location tracking**: `current_device_path`, `current_node_id` show where HDD is now
41
+- **Lifecycle management**: `first_seen`, `last_seen`, `status` track HDD lifecycle
42
+- **Madagascar integration**: `madagascar_id` field for cluster-specific identification
43
+
44
+#### `hdd_presence`
45
+Tracks HDD mobility across cluster nodes - records when HDDs are present on different nodes.
46
+
47
+```sql
48
+CREATE TABLE hdd_presence (
49
+    id SERIAL PRIMARY KEY,
50
+    serial_number VARCHAR(64) NOT NULL,
51
+    node VARCHAR(64) NOT NULL,
52
+    data_start TIMESTAMP NOT NULL,
53
+    data_end TIMESTAMP NOT NULL,
54
+    is_current BOOLEAN NOT NULL DEFAULT TRUE
55
+);
56
+```
57
+
58
+**Key Features:**
59
+- **Mobility tracking**: Records when HDDs move between nodes
60
+- **Time-based records**: `data_start`/`data_end` define presence periods
61
+- **Current vs Historic**: `is_current` flag marks active presence
62
+- **Independent of inventory**: Works independently of `hdd_inventory` for pure mobility data
63
+
64
+**Example Data:**
65
+```sql
66
+ id | serial_number  |   node    |         data_start         |          data_end          | is_current 
67
+----+----------------+-----------+----------------------------+----------------------------+------------
68
+  4 | ZW60K01R       | ebony     | 2025-08-16 22:05:15.863971 | 2025-08-16 22:05:15.863971 | t
69
+  3 | S2HSNXRH402205 | ebony     | 2025-08-16 22:05:15.109956 | 2025-08-16 22:05:15.109956 | t
70
+  2 | ZW60K01R       | baobab    | 2025-08-16 21:47:13.873642 | 2025-08-16 22:03:31.052316 | f
71
+  1 | S2HSNXRH402205 | tapia     | 2025-08-16 21:47:13.078524 | 2025-08-16 22:03:30.268985 | f
72
+```
73
+
74
+#### `smart_readings`
75
+Stores SMART data readings with differential storage optimization.
76
+
77
+```sql
78
+CREATE TABLE smart_readings (
79
+    id                   BIGSERIAL PRIMARY KEY,
80
+    hdd_id               INTEGER REFERENCES hdd_inventory(id),
81
+    serial_number        VARCHAR(100) NOT NULL,
82
+    device_path          VARCHAR(50),
83
+    node_id              VARCHAR(50),
84
+    timestamp            TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
85
+    collection_ok        BOOLEAN DEFAULT true,
86
+    temperature          INTEGER,
87
+    parameters_json      JSONB,
88
+    reading_type         VARCHAR(20) DEFAULT 'full',
89
+    changes_detected     BOOLEAN DEFAULT true,
90
+    changed_parameters   JSONB,
91
+    previous_reading_id  INTEGER REFERENCES smart_readings(id),
92
+    checksum             VARCHAR(64)
93
+);
94
+```
95
+
96
+**Reading Types:**
97
+- `baseline`: First reading for an HDD
98
+- `full`: Complete parameter set (forced by time interval)
99
+- `differential`: Only changed parameters (optimization)
100
+- `skipped`: No changes detected
101
+
102
+**Key Features:**
103
+- **Differential storage**: Only stores changes to reduce data volume
104
+- **Full context**: Links to `hdd_inventory` and includes node information
105
+- **Change tracking**: `previous_reading_id` creates reading chains
106
+- **JSONB parameters**: Flexible storage for SMART attributes
107
+
108
+#### `predictions`
109
+AI-generated failure predictions and analysis.
110
+
111
+```sql
112
+CREATE TABLE predictions (
113
+    id                    SERIAL PRIMARY KEY,
114
+    hdd_id                INTEGER REFERENCES hdd_inventory(id),
115
+    serial_number         VARCHAR(100) NOT NULL,
116
+    device_path           VARCHAR(50),
117
+    timestamp             TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
118
+    risk_level            VARCHAR(20),
119
+    failure_probability   DECIMAL(5,4),
120
+    predicted_failure_date DATE,
121
+    confidence_score      DECIMAL(5,4),
122
+    analysis_summary      TEXT,
123
+    recommendations       JSONB,
124
+    openai_response       JSONB,
125
+    created_at            TIMESTAMP WITH TIME ZONE DEFAULT NOW()
126
+);
127
+```
128
+
129
+#### `alert_history`
130
+Tracks all alerts sent about HDD issues.
131
+
132
+```sql
133
+CREATE TABLE alert_history (
134
+    id              SERIAL PRIMARY KEY,
135
+    hdd_id          INTEGER REFERENCES hdd_inventory(id),
136
+    serial_number   VARCHAR(100) NOT NULL,
137
+    alert_type      VARCHAR(50),
138
+    severity        VARCHAR(20),
139
+    message         TEXT,
140
+    sent_at         TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
141
+    sent_to         TEXT,
142
+    delivery_status VARCHAR(20) DEFAULT 'pending',
143
+    related_reading_id BIGINT REFERENCES smart_readings(id),
144
+    related_prediction_id INTEGER REFERENCES predictions(id)
145
+);
146
+```
147
+
148
+### Configuration Tables
149
+
150
+#### `system_config`
151
+Global system configuration parameters.
152
+
153
+```sql
154
+CREATE TABLE system_config (
155
+    id          SERIAL PRIMARY KEY,
156
+    config_key  VARCHAR(100) UNIQUE NOT NULL,
157
+    value       TEXT,
158
+    description TEXT,
159
+    created_at  TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
160
+    updated_at  TIMESTAMP WITH TIME ZONE DEFAULT NOW()
161
+);
162
+```
163
+
164
+**Default Configuration:**
165
+- `collection_interval_seconds`: SMART data collection frequency
166
+- `differential_storage_enabled`: Enable/disable storage optimization
167
+- `forced_storage_interval_hours`: Force full readings periodically
168
+- `critical_parameter_force_store`: Always store critical changes
169
+- `temperature_change_threshold`: Temperature delta for storage
170
+
171
+#### `smart_thresholds`
172
+SMART parameter warning and critical thresholds.
173
+
174
+```sql
175
+CREATE TABLE smart_thresholds (
176
+    id                SERIAL PRIMARY KEY,
177
+    parameter_name    VARCHAR(100) NOT NULL,
178
+    warning_threshold NUMERIC,
179
+    critical_threshold NUMERIC,
180
+    weight            NUMERIC DEFAULT 1.0,
181
+    enabled           BOOLEAN DEFAULT true,
182
+    description       TEXT,
183
+    created_at        TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
184
+    updated_at        TIMESTAMP WITH TIME ZONE DEFAULT NOW()
185
+);
186
+```
187
+
188
+## Views
189
+
190
+### `smart_readings_reconstructed`
191
+Reconstructs complete SMART data from differential storage.
192
+
193
+```sql
194
+CREATE VIEW smart_readings_reconstructed AS
195
+WITH RECURSIVE reading_chain AS (
196
+    -- Base case: get baseline readings
197
+    SELECT id, hdd_id, serial_number, timestamp, 
198
+           parameters_json, temperature, reading_type,
199
+           previous_reading_id, 1 as chain_level
200
+    FROM smart_readings 
201
+    WHERE reading_type IN ('baseline', 'full')
202
+    
203
+    UNION ALL
204
+    
205
+    -- Recursive case: follow the chain of differential readings
206
+    SELECT sr.id, sr.hdd_id, sr.serial_number, sr.timestamp,
207
+           COALESCE(rc.parameters_json, '{}'::jsonb) || sr.parameters_json as parameters_json,
208
+           COALESCE(sr.temperature, rc.temperature) as temperature,
209
+           sr.reading_type, sr.previous_reading_id,
210
+           rc.chain_level + 1
211
+    FROM smart_readings sr
212
+    JOIN reading_chain rc ON sr.previous_reading_id = rc.id
213
+    WHERE sr.reading_type = 'differential'
214
+)
215
+SELECT id, hdd_id, serial_number, timestamp,
216
+       parameters_json, temperature, reading_type, chain_level
217
+FROM reading_chain;
218
+```
219
+
220
+### `latest_smart_readings`
221
+Current SMART status for all active drives.
222
+
223
+```sql
224
+CREATE VIEW latest_smart_readings AS
225
+SELECT DISTINCT ON (sr.hdd_id)
226
+    sr.id, sr.hdd_id, sr.serial_number, sr.timestamp,
227
+    sr.parameters_json, sr.temperature,
228
+    hi.model_name, hi.manufacturer, hi.size_gb,
229
+    hi.current_device_path, hi.current_node_id
230
+FROM smart_readings_reconstructed sr
231
+JOIN hdd_inventory hi ON sr.hdd_id = hi.id
232
+ORDER BY sr.hdd_id, sr.timestamp DESC;
233
+```
234
+
235
+### `drive_health_summary`
236
+Comprehensive health overview for all drives.
237
+
238
+```sql
239
+CREATE VIEW drive_health_summary AS
240
+SELECT 
241
+    hi.id as hdd_id, hi.serial_number, hi.model_name,
242
+    hi.manufacturer, hi.current_device_path, hi.current_node_id,
243
+    hi.status, lsr.timestamp as last_reading, lsr.temperature,
244
+    p.risk_level, p.failure_probability, p.predicted_failure_date,
245
+    EXTRACT(EPOCH FROM (NOW() - lsr.timestamp))/3600 as hours_since_last_reading
246
+FROM hdd_inventory hi
247
+LEFT JOIN latest_smart_readings lsr ON hi.id = lsr.hdd_id
248
+LEFT JOIN LATERAL (
249
+    SELECT risk_level, failure_probability, predicted_failure_date
250
+    FROM predictions 
251
+    WHERE hdd_id = hi.id 
252
+    ORDER BY timestamp DESC 
253
+    LIMIT 1
254
+) p ON true
255
+WHERE hi.status = 'active';
256
+```
257
+
258
+## Functions
259
+
260
+### `update_hdd_presence()`
261
+Manages HDD presence tracking when a drive is detected on a node.
262
+
263
+```sql
264
+CREATE OR REPLACE FUNCTION update_hdd_presence(
265
+    p_serial_number VARCHAR(64),
266
+    p_node VARCHAR(64)
267
+) RETURNS VOID AS $$
268
+BEGIN
269
+    -- Mark all previous presence records for this serial as historic
270
+    UPDATE hdd_presence 
271
+    SET is_current = FALSE 
272
+    WHERE serial_number = p_serial_number AND is_current = TRUE AND node <> p_node;
273
+    
274
+    -- Check if there's already a current presence for this serial/node
275
+    IF EXISTS (SELECT 1 FROM hdd_presence WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE) THEN
276
+        -- Update data_end for existing current presence
277
+        UPDATE hdd_presence 
278
+        SET data_end = NOW() 
279
+        WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE;
280
+    ELSE
281
+        -- Create new presence record
282
+        INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current)
283
+        VALUES (p_serial_number, p_node, NOW(), NOW(), TRUE);
284
+    END IF;
285
+END;
286
+$$ LANGUAGE plpgsql;
287
+```
288
+
289
+### `should_store_smart_reading()`
290
+Determines if a SMART reading should be stored based on differential storage logic.
291
+
292
+```sql
293
+CREATE OR REPLACE FUNCTION should_store_smart_reading(
294
+    p_hdd_id INTEGER,
295
+    p_parameters_json JSONB,
296
+    p_checksum VARCHAR(64),
297
+    p_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW()
298
+) RETURNS TABLE(
299
+    should_store BOOLEAN,
300
+    reading_type VARCHAR(20),
301
+    changes_detected BOOLEAN,
302
+    changed_parameters JSONB,
303
+    previous_reading_id INTEGER
304
+) AS $$
305
+-- Function implementation handles:
306
+-- - Differential storage enabled/disabled
307
+-- - Checksum-based change detection
308
+-- - Force intervals for full readings
309
+-- - Reading type determination
310
+$$;
311
+```
312
+
313
+## Indexes
314
+
315
+### Performance Indexes
316
+```sql
317
+-- hdd_inventory indexes
318
+CREATE INDEX idx_hdd_inventory_device_path ON hdd_inventory(current_device_path);
319
+CREATE INDEX idx_hdd_inventory_node ON hdd_inventory(current_node_id);
320
+CREATE INDEX idx_hdd_inventory_status ON hdd_inventory(status);
321
+CREATE INDEX idx_hdd_inventory_last_seen ON hdd_inventory(last_seen);
322
+
323
+-- hdd_presence indexes
324
+CREATE INDEX idx_hdd_presence_serial_current ON hdd_presence(serial_number, is_current);
325
+CREATE INDEX idx_hdd_presence_node ON hdd_presence(node);
326
+CREATE INDEX idx_hdd_presence_data_end ON hdd_presence(data_end DESC);
327
+
328
+-- smart_readings indexes
329
+CREATE INDEX idx_smart_readings_hdd_id ON smart_readings(hdd_id);
330
+CREATE INDEX idx_smart_readings_timestamp ON smart_readings(timestamp DESC);
331
+CREATE INDEX idx_smart_readings_serial ON smart_readings(serial_number);
332
+CREATE INDEX idx_smart_readings_device_path ON smart_readings(device_path);
333
+CREATE INDEX idx_smart_readings_type ON smart_readings(reading_type);
334
+CREATE INDEX idx_smart_readings_checksum ON smart_readings(checksum);
335
+CREATE INDEX idx_smart_readings_previous ON smart_readings(previous_reading_id);
336
+
337
+-- JSONB indexes for flexible queries
338
+CREATE INDEX idx_smart_readings_parameters ON smart_readings USING GIN (parameters_json);
339
+CREATE INDEX idx_smart_readings_changed_params ON smart_readings USING GIN (changed_parameters);
340
+```
341
+
342
+## Data Flow
343
+
344
+### Collection Process
345
+1. **Device Discovery**: Collector scans `/dev/sd*` and `/dev/nvme*` devices
346
+2. **SMART Reading**: Uses `smartctl` to extract device information and parameters
347
+3. **HDD Registration**: `get_or_create_hdd()` adds new devices to `hdd_inventory`
348
+4. **Presence Tracking**: `update_hdd_presence()` records current node location
349
+5. **Data Storage**: Stores SMART readings with differential optimization
350
+6. **Change Detection**: Uses checksums to detect parameter changes
351
+
352
+### Mobility Tracking
353
+1. **HDD Detected**: When HDD is found on a new node
354
+2. **Historic Records**: Previous presence records marked `is_current = FALSE`
355
+3. **New Presence**: New record created with `is_current = TRUE`
356
+4. **Timeline**: Complete history maintained with `data_start`/`data_end` timestamps
357
+
358
+### Query Examples
359
+
360
+#### Find HDD History
361
+```sql
362
+SELECT serial_number, node, data_start, data_end, is_current
363
+FROM hdd_presence 
364
+WHERE serial_number = 'ZW60K01R' 
365
+ORDER BY data_start DESC;
366
+```
367
+
368
+#### Current HDD Locations
369
+```sql
370
+SELECT h.serial_number, h.model_name, p.node, h.current_device_path
371
+FROM hdd_inventory h
372
+JOIN hdd_presence p ON h.serial_number = p.serial_number
373
+WHERE p.is_current = TRUE;
374
+```
375
+
376
+#### SMART Parameter Trends
377
+```sql
378
+SELECT timestamp, 
379
+       parameters_json->>'Power_On_Hours' as power_hours,
380
+       parameters_json->>'Temperature_Celsius' as temp,
381
+       temperature
382
+FROM smart_readings_reconstructed 
383
+WHERE serial_number = 'ZW60K01R' 
384
+ORDER BY timestamp DESC 
385
+LIMIT 10;
386
+```
387
+
388
+#### Health Summary
389
+```sql
390
+SELECT * FROM drive_health_summary 
391
+WHERE current_node_id = 'ebony';
392
+```
393
+
394
+## Troubleshooting
395
+
396
+### Common Issues
397
+
398
+#### 1. Node ID Mismatch
399
+**Problem**: HDD presence shows wrong node name
400
+**Cause**: Deploy script used local hostname instead of target node name
401
+**Solution**: Deploy script now correctly determines target node name from `cluster.json`
402
+
403
+#### 2. Empty hdd_presence Table
404
+**Problem**: No mobility tracking data
405
+**Causes**:
406
+- SMART parameter parsing regex incompatible with new smartctl format
407
+- Missing database sequence permissions
408
+- Incomplete smart_readings INSERT statements
409
+
410
+**Solutions**:
411
+- Updated regex to support both old and new smartctl formats
412
+- Added sequence permissions: `GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO autosmart;`
413
+- Fixed INSERT to include all required fields
414
+
415
+#### 3. Differential Storage Issues
416
+**Problem**: Too much or too little data stored
417
+**Configuration**: Adjust in `system_config` table:
418
+```sql
419
+UPDATE system_config SET value = 'false' WHERE config_key = 'differential_storage_enabled';
420
+UPDATE system_config SET value = '12' WHERE config_key = 'forced_storage_interval_hours';
421
+```
422
+
423
+## Permissions
424
+
425
+### Database User Setup
426
+```sql
427
+-- Create autosmart user
428
+CREATE USER autosmart WITH PASSWORD 'autoSMART2025!';
429
+
430
+-- Grant permissions
431
+GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO autosmart;
432
+GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO autosmart;
433
+GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO autosmart;
434
+```
435
+
436
+### Sequence Permissions
437
+```sql
438
+-- Required for INSERT operations with SERIAL columns
439
+GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO autosmart;
440
+```
441
+
442
+## Maintenance
443
+
444
+### Regular Tasks
445
+1. **Monitor disk usage**: SMART readings table grows over time
446
+2. **Archive old data**: Consider archiving readings older than 1 year
447
+3. **Index maintenance**: REINDEX periodically for performance
448
+4. **Backup**: Regular PostgreSQL backups recommended
449
+
450
+### Performance Monitoring
451
+```sql
452
+-- Table sizes
453
+SELECT schemaname, tablename, 
454
+       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
455
+FROM pg_tables 
456
+WHERE schemaname = 'public' 
457
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
458
+
459
+-- Recent activity
460
+SELECT COUNT(*) as readings_today 
461
+FROM smart_readings 
462
+WHERE timestamp > CURRENT_DATE;
463
+
464
+SELECT COUNT(*) as active_drives 
465
+FROM hdd_inventory 
466
+WHERE status = 'active';
467
+```
+991 -0
projects/autoSMART/docs/DEVELOPMENT.md
@@ -0,0 +1,991 @@
1
+# autoSMART Development Guide
2
+
3
+## 📚 Developer Documentation Index
4
+
5
+This document serves as the complete guide for developers working on autoSMART. It includes development environment setup, architecture documentation, testing procedures, and developer-specific changelog.
6
+
7
+### Quick Navigation
8
+- [Codebase Structure](#codebase-structure)
9
+- [Development Environment Setup](#development-environment-setup)
10
+- [Architecture Overview](#architecture-overview)
11
+- [Database Development](#database-development)
12
+- [Module Development](#module-development)
13
+- [Testing Strategies](#testing-strategies)
14
+- [Deployment Procedures](#deployment-procedures)
15
+- [Developer Changelog](#developer-changelog)
16
+- [Technical Reference](#technical-reference)
17
+
18
+## 📁 Codebase Structure
19
+
20
+autoSMART follows a modular architecture with clear separation of concerns. Below is the complete directory structure and file descriptions:
21
+
22
+### Project Root
23
+```
24
+autoSMART/
25
+├── README.md                    # Symlink to docs/README.md (end-user documentation)
26
+├── .deployignore               # Files excluded from production deployment
27
+├── config/                     # Configuration files and templates
28
+├── docs/                       # Documentation (mixed deployment)
29
+├── lib/                        # Perl modules and core libraries
30
+├── scripts/                    # Executable scripts and utilities
31
+└── sql/                        # Database schema and SQL files
32
+```
33
+
34
+### 📁 `/config/` - Configuration Management
35
+Configuration files are organized by scope and environment:
36
+
37
+```
38
+config/
39
+├── cluster.conf                # Cluster-wide settings (shared across nodes)
40
+├── cluster-ebony.conf         # Node-specific configuration for ebony
41
+├── database.conf              # PostgreSQL connection settings
42
+├── openai.conf               # OpenAI API configuration and prompts
43
+├── smart.conf                # SMART parameter thresholds and monitoring rules
44
+├── default                   # Default/template configuration
45
+└── debug-ebony.sh           # Development debugging script for ebony node
46
+```
47
+
48
+#### Configuration File Details
49
+- **`cluster.conf`** (88 lines): 
50
+  - Cluster topology and node definitions
51
+  - Node hostnames, IP addresses, and roles
52
+  - Shared monitoring parameters across cluster
53
+  - Global system settings and defaults
54
+  - Inter-node communication configuration
55
+  
56
+- **`database.conf`** (30 lines): 
57
+  - PostgreSQL connection parameters (host, port, database, credentials)
58
+  - Connection pooling settings and timeouts
59
+  - Database-specific optimizations and tuning parameters
60
+  - SSL configuration and security settings
61
+  
62
+- **`openai.conf`** (50 lines): 
63
+  - OpenAI API key and model configuration
64
+  - Prompt templates for failure prediction analysis
65
+  - Response parsing rules and confidence thresholds
66
+  - Rate limiting and cost management settings
67
+  - Fallback configurations for API failures
68
+  
69
+- **`smart.conf`** (57 lines): 
70
+  - SMART parameter monitoring thresholds for different drive types
71
+  - Critical parameter definitions and escalation rules
72
+  - Alert generation rules and notification preferences
73
+  - Parameter collection intervals and scheduling
74
+  - Drive type specific monitoring configurations
75
+  
76
+- **`default`** (107 lines): 
77
+  - Default/template configuration for new node deployments
78
+  - Standard parameter values and system defaults
79
+  - Configuration validation rules and constraints
80
+  - Example configurations with detailed comments
81
+  
82
+- **`cluster-ebony.conf`** (13 lines): 
83
+  - Node-specific configuration overrides for ebony node
84
+  - Local network settings and hardware-specific parameters
85
+  - Custom thresholds for specific hardware configurations
86
+  
87
+- **`debug-ebony.sh`** (29 lines): 
88
+  - Development debugging utilities for ebony node
89
+  - Test data generation and validation scripts
90
+  - Development environment setup and configuration
91
+  - Debugging tools and diagnostic utilities
92
+
93
+### 📁 `/lib/` - Core Perl Modules
94
+Core business logic implemented as reusable Perl modules:
95
+
96
+```
97
+lib/
98
+├── SmartCollector.pm          # SMART data collection and hardware tracking
99
+└── PredictionEngine.pm        # AI-powered failure prediction engine
100
+```
101
+
102
+#### Module Architecture
103
+- **`SmartCollector.pm`** (802 lines):
104
+  - **Hardware Identification**: Device detection using serial numbers and model names
105
+  - **SMART Data Collection**: Integration with smartmontools for comprehensive parameter collection
106
+  - **Migration Detection**: Algorithms to detect when drives move between nodes or device paths
107
+  - **Differential Storage**: Intelligent storage system that only saves changed parameters
108
+  - **Database Layer**: PostgreSQL integration with connection pooling and error handling
109
+  - **Storage Efficiency**: Real-time monitoring of storage optimization effectiveness
110
+  - **Configuration Management**: Dynamic configuration loading and validation
111
+  - **Error Handling**: Comprehensive error handling with detailed logging
112
+  
113
+- **`PredictionEngine.pm`** (607 lines):
114
+  - **OpenAI Integration**: Direct API communication with GPT models
115
+  - **Prompt Engineering**: Sophisticated prompt templates for failure prediction
116
+  - **Response Processing**: Parsing and validation of AI-generated predictions
117
+  - **Confidence Scoring**: Statistical analysis of prediction reliability
118
+  - **Timeline Estimation**: Failure time prediction with confidence intervals
119
+  - **Cost Optimization**: API usage optimization and request batching
120
+  - **Error Recovery**: Robust error handling for API failures and rate limits
121
+
122
+### 📁 `/scripts/` - Executable Components
123
+Production scripts and development utilities:
124
+
125
+```
126
+scripts/
127
+├── autosmart-collector.pl      # Main data collection daemon
128
+├── autosmart-predictor.pl      # AI prediction processing
129
+├── autosmart-report.pl         # Report generation engine
130
+├── autosmart-migration-report.pl # Hardware migration analysis
131
+├── smart-collector-daemon.pl   # Background collection service
132
+├── deploy.sh                   # Unified deployment script
133
+├── deploy-production.sh        # Production cluster deployment
134
+├── install.sh                  # Symlink to deploy.sh for compatibility
135
+├── uninstall.sh               # Complete system removal
136
+├── monitor-cluster.sh          # Cluster health monitoring
137
+├── test-smart-collection.pl    # SMART collection testing
138
+├── test-differential-storage.pl # Storage optimization testing
139
+├── test-db-connection.pl       # Database connectivity testing
140
+└── simple-smart-test.pl        # Basic SMART functionality test
141
+```
142
+
143
+#### Script Categories
144
+
145
+##### Production Scripts
146
+- **`autosmart-collector.pl`** (348 lines): 
147
+  - Main collection daemon that runs on each node
148
+  - Scheduled SMART data collection and processing
149
+  - Hardware detection and migration tracking
150
+  - Integration with SmartCollector.pm module
151
+  - Command-line options for daemon mode, single-run, and debugging
152
+  
153
+- **`autosmart-predictor.pl`** (483 lines): 
154
+  - Processes collected data for AI predictions
155
+  - Batch processing of pending SMART readings
156
+  - Integration with PredictionEngine.pm for OpenAI communication
157
+  - Prediction result storage and confidence tracking
158
+  
159
+- **`autosmart-report.pl`** (662 lines): 
160
+  - Generates comprehensive health reports and alerts
161
+  - Configurable report formats (summary, detailed, trend analysis)
162
+  - Email notification system for critical alerts
163
+  - Historical data analysis and trend detection
164
+  
165
+- **`smart-collector-daemon.pl`** (252 lines): 
166
+  - Background service wrapper for collector
167
+  - Process management and restart capabilities
168
+  - Log rotation and system integration
169
+  - Service status monitoring and health checks
170
+
171
+##### Deployment Scripts  
172
+- **`deploy.sh`** (697 lines): 
173
+  - Unified deployment for single node or cluster
174
+  - Supports install, uninstall, and cluster deployment modes
175
+  - Automatic dependency checking and installation
176
+  - Configuration template deployment and customization
177
+  - System service registration and startup
178
+  
179
+- **`deploy-production.sh`** (116 lines): 
180
+  - Production-specific deployment procedures
181
+  - Multi-node cluster deployment automation
182
+  - Production safety checks and validation
183
+  - Rollback capabilities for failed deployments
184
+  
185
+- **`uninstall.sh`** (187 lines): 
186
+  - Complete system cleanup and removal
187
+  - Service stopping and deregistration
188
+  - File and directory cleanup
189
+  - Database cleanup options (configurable)
190
+  
191
+- **`monitor-cluster.sh`** (515 lines): 
192
+  - Ongoing cluster health monitoring
193
+  - Node status verification and reporting
194
+  - Service health checks across all cluster nodes
195
+  - Automated restart capabilities for failed services
196
+
197
+##### Development & Testing Scripts
198
+- **`test-smart-collection.pl`** (132 lines): 
199
+  - Validates SMART data collection functionality
200
+  - Tests hardware detection and identification
201
+  - Verifies database connectivity and data storage
202
+  - Performance benchmarking for collection operations
203
+  
204
+- **`test-differential-storage.pl`** (270 lines): 
205
+  - Comprehensive testing of storage optimization
206
+  - Validates differential storage algorithms
207
+  - Tests change detection and storage efficiency
208
+  - Performance analysis and optimization verification
209
+  
210
+- **`test-db-connection.pl`** (55 lines): 
211
+  - Database connectivity verification
212
+  - Connection pooling and timeout testing
213
+  - SQL execution validation
214
+  - Database performance testing
215
+  
216
+- **`simple-smart-test.pl`** (144 lines): 
217
+  - Basic functionality testing
218
+  - Quick validation of core components
219
+  - Integration testing for development
220
+  - Smoke testing for deployment validation
221
+
222
+##### Analysis Scripts
223
+- **`autosmart-migration-report.pl`** (615 lines): 
224
+  - Hardware migration tracking and analysis
225
+  - Migration pattern detection and reporting
226
+  - Historical migration data analysis
227
+  - Migration-related issue identification and troubleshooting
228
+
229
+### 📁 `/sql/` - Database Schema
230
+PostgreSQL database definitions and utilities:
231
+
232
+```
233
+sql/
234
+├── schema.sql                  # Complete production database schema
235
+└── schema-fixed.sql           # Schema with specific fixes/patches
236
+```
237
+
238
+#### Database Schema Components
239
+- **Core Tables**: 
240
+  - `hdd_inventory`: Hardware identification and location tracking
241
+  - `smart_readings`: SMART parameter data with differential storage
242
+  - `hdd_migrations`: Drive movement logging between nodes/paths
243
+- **AI Integration**: 
244
+  - `predictions`: AI-generated failure predictions with confidence scores
245
+  - `alert_history`: Alert notification tracking and escalation
246
+- **Configuration**: 
247
+  - `smart_thresholds`: Configurable parameter thresholds and alert rules
248
+  - `system_config`: System-wide configuration parameters
249
+- **Optimization**: 
250
+  - Differential storage functions (`should_store_smart_reading()`)
251
+  - Reconstructed views (`smart_readings_reconstructed`)
252
+  - Change detection algorithms with SHA256 checksums
253
+- **Indexing**: 
254
+  - Performance-optimized indexes for temporal queries
255
+  - Hardware identification indexes for fast lookups
256
+  - Composite indexes for complex query patterns
257
+
258
+##### Schema Files Details
259
+- **`schema.sql`** (726 lines):
260
+  - Complete production database schema
261
+  - Full table definitions with constraints and indexes
262
+  - PostgreSQL functions for differential storage
263
+  - Views for data reconstruction and reporting
264
+  - Trigger definitions for automated processes
265
+  
266
+- **`schema-fixed.sql`** (423 lines):
267
+  - Schema patches and specific fixes
268
+  - Migration scripts for schema updates
269
+  - Performance optimization adjustments
270
+  - Compatibility fixes for different PostgreSQL versions
271
+
272
+### 📁 `/docs/` - Documentation
273
+Documentation organized by audience and deployment status:
274
+
275
+```
276
+docs/
277
+├── README.md                   # End-user guide (DEPLOYED)
278
+├── INSTALLATION.md             # Setup and configuration (DEPLOYED)
279
+├── CHANGELOG.md               # Release notes for end-users (DEPLOYED)
280
+├── API.md                     # OpenAI API configuration (DEPLOYED)
281
+├── DEVELOPMENT.md             # Developer guide (NOT DEPLOYED)
282
+└── DIFFERENTIAL_STORAGE.md    # Technical storage details (NOT DEPLOYED)
283
+```
284
+
285
+#### Documentation Deployment Strategy
286
+- **Deployed docs**: End-user facing documentation
287
+- **Non-deployed docs**: Developer and technical implementation details
288
+
289
+### 🔧 Key File Relationships
290
+
291
+#### Data Flow Architecture
292
+```
293
+smartmontools → SmartCollector.pm → PostgreSQL → PredictionEngine.pm → OpenAI API
294
+     ↓               ↓                    ↓              ↓
295
+autosmart-collector.pl → Database → autosmart-predictor.pl → Reports
296
+```
297
+
298
+#### Configuration Hierarchy
299
+```
300
+cluster.conf (global) → node-specific.conf → smart.conf → openai.conf
301
+                                ↓
302
+                        Individual script configurations
303
+```
304
+
305
+#### Module Dependencies
306
+```
307
+autosmart-collector.pl
308
+├── SmartCollector.pm
309
+├── database.conf
310
+├── smart.conf
311
+└── cluster.conf
312
+
313
+autosmart-predictor.pl
314
+├── PredictionEngine.pm
315
+├── SmartCollector.pm (for data access)
316
+├── openai.conf
317
+└── database.conf
318
+```
319
+
320
+### 📊 Codebase Metrics
321
+
322
+#### File Type Distribution
323
+- **Perl Scripts**: 8 production scripts + 4 test scripts (12 total)
324
+- **Perl Modules**: 2 core modules (1,409 total lines)
325
+- **Shell Scripts**: 5 deployment/management scripts (1,645 total lines)
326
+- **SQL Files**: 2 schema files (1,149 total lines)
327
+- **Configuration**: 7 configuration files (374 total lines)
328
+- **Documentation**: 5 documentation files
329
+
330
+#### Code Complexity by Lines of Code
331
+- **SmartCollector.pm**: 802 lines (High complexity - hardware integration, differential storage)
332
+- **PredictionEngine.pm**: 607 lines (Medium complexity - API integration, data processing)
333
+- **Database Schema**: 726 lines (High complexity - advanced PostgreSQL features)
334
+- **Deploy Scripts**: 697 lines each (Medium complexity - system integration)
335
+- **Report Generation**: 662 lines (Medium complexity - data analysis and formatting)
336
+- **Migration Analysis**: 615 lines (Medium complexity - pattern detection)
337
+- **Cluster Monitoring**: 515 lines (Medium complexity - distributed system monitoring)
338
+
339
+#### Total Codebase Size
340
+- **Production Code**: ~4,500 lines (Perl modules + production scripts)
341
+- **Deployment & Management**: ~1,800 lines (deployment and monitoring scripts)
342
+- **Testing Code**: ~600 lines (test scripts and utilities)
343
+- **Database Schema**: ~1,150 lines (PostgreSQL schema and functions)
344
+- **Configuration**: ~375 lines (configuration templates and examples)
345
+- **Total**: ~8,400+ lines of code
346
+
347
+#### Testing Coverage Areas
348
+- **Unit Tests**: Module-specific functionality testing
349
+- **Integration Tests**: End-to-end data flow validation
350
+- **Performance Tests**: Storage efficiency and query optimization benchmarks
351
+- **Deployment Tests**: Installation and configuration validation across environments
352
+- **Regression Tests**: Automated testing for core functionality preservation
353
+
354
+### 🏗️ Development Workflow
355
+
356
+#### Getting Started with Development
357
+1. **Clone Repository**: Set up local development environment
358
+2. **Database Setup**: Configure PostgreSQL connection to development database
359
+3. **Perl Dependencies**: Install required CPAN modules
360
+4. **Configuration**: Copy and customize configuration templates
361
+5. **Testing**: Run test suite to verify setup
362
+
363
+#### Adding New Features
364
+1. **Module Development**: Extend existing Perl modules or create new ones
365
+2. **Script Integration**: Create or modify scripts to use new functionality
366
+3. **Database Changes**: Update schema if new data structures are needed
367
+4. **Testing**: Add comprehensive tests for new functionality
368
+5. **Documentation**: Update both end-user and developer documentation
369
+
370
+#### Code Organization Principles
371
+- **Separation of Concerns**: Each module and script has a specific, well-defined responsibility
372
+- **Configuration-Driven**: System behavior is controlled through configuration files rather than hard-coded values
373
+- **Database-Centric**: PostgreSQL serves as the central data store with business logic in database functions
374
+- **Modular Design**: Components can be developed, tested, and deployed independently
375
+- **Error Handling**: Comprehensive error handling and logging throughout all components
376
+- **Performance-First**: Optimized for high-volume data collection and processing
377
+- **Scalability**: Designed to scale across multiple nodes in a cluster environment
378
+
379
+#### Development Patterns Used
380
+- **Factory Pattern**: Configuration-based object creation in Perl modules
381
+- **Observer Pattern**: Event-driven processing for hardware changes and alerts
382
+- **Strategy Pattern**: Configurable algorithms for different drive types and thresholds
383
+- **Template Method**: Standardized data processing pipelines with customizable steps
384
+- **Singleton Pattern**: Database connection management and configuration loading
385
+- **Command Pattern**: Script-based operations with standardized interfaces
386
+
387
+#### Code Quality Standards
388
+- **Perl Best Practices**: Strict warnings, proper scoping, and defensive programming
389
+- **Database Normalization**: Proper relational design with referential integrity
390
+- **Configuration Validation**: Input validation and sanitization throughout
391
+- **Error Recovery**: Graceful degradation and automatic recovery mechanisms
392
+- **Performance Monitoring**: Built-in performance metrics and optimization tracking
393
+- **Security Practices**: SQL injection prevention, input validation, and secure configuration management
394
+
395
+## 🏗️ Development Environment Setup
396
+
397
+### Prerequisites
398
+
399
+#### System Requirements
400
+- **Operating System**: Linux/macOS (tested on macOS, deployed on Proxmox VE)
401
+- **Perl**: Version 5.20+ with CPAN access
402
+- **PostgreSQL**: Version 13+ with JSONB and extension support
403
+- **Git**: For version control and collaboration
404
+
405
+#### Development Database
406
+```bash
407
+# Current test database configuration
408
+Host: 192.168.2.102
409
+Database: autosmart  
410
+User: postgres
411
+Password: (no password)
412
+Port: 5432
413
+```
414
+
415
+#### Required Perl Modules
416
+```bash
417
+# Core database modules
418
+cpan install DBI DBD::Pg
419
+
420
+# JSON processing
421
+cpan install JSON::XS
422
+
423
+# System utilities  
424
+cpan install Config::Simple File::Slurp Time::HiRes
425
+
426
+# Security and hashing
427
+cpan install Digest::SHA
428
+
429
+# HTTP/API clients (for OpenAI integration)
430
+cpan install LWP::UserAgent HTTP::Request::Common
431
+
432
+# Optional: Development and testing
433
+cpan install Data::Dumper Test::More Test::Exception
434
+```
435
+
436
+### Development Workflow
437
+
438
+#### 1. Environment Setup
439
+```bash
440
+# Clone the project
441
+cd /Users/bogdan/Documents/workspace/
442
+git clone <autoSMART-repo>
443
+cd autoSMART
444
+
445
+# Set environment variables
446
+export AUTOSMART_DB_HOST=192.168.2.102
447
+export AUTOSMART_DB_NAME=autosmart
448
+export AUTOSMART_DB_USER=postgres
449
+export AUTOSMART_DB_PASS=
450
+export AUTOSMART_DB_PORT=5432
451
+
452
+# Optional: OpenAI API key for AI features
453
+export OPENAI_API_KEY=your-api-key-here
454
+```
455
+
456
+#### 2. Database Setup
457
+```bash
458
+# Initialize the database schema
459
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql
460
+
461
+# Verify installation
462
+psql -h 192.168.2.102 -U postgres -d autosmart -c "\\dt"
463
+```
464
+
465
+#### 3. Testing Environment
466
+```bash
467
+# Run the differential storage test suite
468
+cd scripts/
469
+perl test-differential-storage.pl
470
+
471
+# Test database connectivity
472
+perl -e "
473
+use DBI;
474
+my \$dsn = 'DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432';
475
+my \$dbh = DBI->connect(\$dsn, 'postgres', '', {RaiseError => 1});
476
+print \"Database connection successful!\\n\";
477
+\$dbh->disconnect();
478
+"
479
+```
480
+
481
+## 🧩 Architecture Overview
482
+
483
+### System Components
484
+
485
+```
486
+autoSMART Architecture
487
+┌─────────────────────────────────────────────────────────────┐
488
+│                    Proxmox Cluster                          │
489
+├─────────────────────┬─────────────────────┬─────────────────┤
490
+│      Node 1         │       Node 2        │      Node 3     │
491
+│                     │                     │                 │
492
+│ ┌─── SmartCollector ┤ ┌─── SmartCollector ┤ ┌─── SmartCollector
493
+│ │   - HDD Scanning  │ │   - HDD Scanning  │ │   - HDD Scanning
494
+│ │   - SMART Reading │ │   - SMART Reading │ │   - SMART Reading  
495
+│ │   - Migration Det │ │   - Migration Det │ │   - Migration Det
496
+│ └─── Data Storage   │ └─── Data Storage   │ └─── Data Storage
497
+└─────────────────────┴─────────────────────┴─────────────────┘
498
+                               │
499
+                      ┌────────▼─────────┐
500
+                      │   PostgreSQL DB   │
501
+                      │                  │
502
+                      │ • HDD Inventory  │
503
+                      │ • SMART Readings │
504
+                      │ • Migrations     │
505
+                      │ • AI Predictions │
506
+                      └────────┬─────────┘
507
+                               │
508
+                    ┌──────────▼───────────┐
509
+                    │    SmartAnalyzer     │
510
+                    │                      │
511
+                    │ • OpenAI API         │
512
+                    │ • Failure Prediction │
513
+                    │ • Pattern Analysis   │
514
+                    └──────────┬───────────┘
515
+                               │
516
+                    ┌──────────▼───────────┐
517
+                    │    SmartReporter     │
518
+                    │                      │
519
+                    │ • Alert Generation   │
520
+                    │ • Report Creation    │
521
+                    │ • Dashboard Data     │
522
+                    └──────────────────────┘
523
+```
524
+
525
+### Data Flow
526
+
527
+1. **Collection Phase**:
528
+   - SmartCollector scans HDDs on each node
529
+   - Hardware identification (serial + model)
530
+   - Migration detection if HDD moved
531
+   - Differential storage decision
532
+   - Store only changed/critical data
533
+
534
+2. **Analysis Phase**:
535
+   - SmartAnalyzer processes stored data
536
+   - Historical pattern analysis
537
+   - OpenAI API calls for predictions
538
+   - Risk assessment and trending
539
+
540
+3. **Reporting Phase**:
541
+   - SmartReporter generates alerts
542
+   - Dashboard data preparation
543
+   - Health reports creation
544
+   - Maintenance recommendations
545
+
546
+## 🔧 Module Development
547
+
548
+### SmartCollector.pm Development
549
+
550
+#### Key Methods to Understand
551
+```perl
552
+# Hardware identification and migration detection
553
+sub _detect_or_create_hdd($drive_info, $smart_data)
554
+
555
+# Differential storage decision making
556
+sub _should_store_reading($hdd_id, $smart_data)
557
+
558
+# Optimized data storage
559
+sub _insert_smart_reading_differential($hdd_id, $drive_info, $smart_data, $storage_info)
560
+```
561
+
562
+#### Adding New Features
563
+1. **New SMART Parameters**:
564
+   ```perl
565
+   # Add parameter processing in collect_smart_data()
566
+   if ($line =~ /New_Parameter.*\s+(\d+)/) {
567
+       $smart_data->{parameters}{'New_Parameter'} = $1;
568
+   }
569
+   ```
570
+
571
+2. **Custom Manufacturer Detection**:
572
+   ```perl
573
+   # Extend _detect_manufacturer() method
574
+   sub _detect_manufacturer {
575
+       my ($self, $model) = @_;
576
+       return 'Custom_Manufacturer' if $model =~ /CUSTOM_PATTERN/;
577
+       # ... existing logic
578
+   }
579
+   ```
580
+
581
+### SmartAnalyzer.pm Development
582
+
583
+#### AI Integration Patterns
584
+```perl
585
+# OpenAI API call structure
586
+sub _call_openai_api {
587
+    my ($self, $prompt, $smart_data) = @_;
588
+    
589
+    my $request = HTTP::Request->new(POST => 'https://api.openai.com/v1/chat/completions');
590
+    $request->header('Authorization' => "Bearer $self->{openai_api_key}");
591
+    $request->header('Content-Type' => 'application/json');
592
+    
593
+    my $payload = {
594
+        model => "gpt-4",
595
+        messages => [
596
+            {
597
+                role => "system", 
598
+                content => "You are an expert in HDD failure prediction..."
599
+            },
600
+            {
601
+                role => "user",
602
+                content => $prompt
603
+            }
604
+        ]
605
+    };
606
+    
607
+    # ... handle response
608
+}
609
+```
610
+
611
+## 🗃️ Database Development
612
+
613
+### Schema Evolution
614
+
615
+#### Adding New Tables
616
+```sql
617
+-- Always include migration scripts
618
+CREATE TABLE new_feature (
619
+    id SERIAL PRIMARY KEY,
620
+    hdd_id INTEGER REFERENCES hdd_inventory(id),
621
+    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
622
+);
623
+
624
+-- Add indexes for performance
625
+CREATE INDEX idx_new_feature_hdd_id ON new_feature(hdd_id);
626
+```
627
+
628
+#### Modifying Existing Tables
629
+```sql
630
+-- Use ALTER statements for compatibility
631
+ALTER TABLE smart_readings ADD COLUMN new_field VARCHAR(100);
632
+CREATE INDEX CONCURRENTLY idx_smart_readings_new_field ON smart_readings(new_field);
633
+```
634
+
635
+### Query Optimization
636
+
637
+#### Efficient SMART Data Queries
638
+```sql
639
+-- Use the reconstructed view for complete data
640
+SELECT * FROM smart_readings_reconstructed 
641
+WHERE hdd_id = $1 
642
+  AND timestamp > NOW() - INTERVAL '30 days'
643
+ORDER BY timestamp DESC;
644
+
645
+-- Use raw table for storage statistics
646
+SELECT reading_type, COUNT(*) 
647
+FROM smart_readings 
648
+WHERE timestamp > NOW() - INTERVAL '7 days'
649
+GROUP BY reading_type;
650
+```
651
+
652
+## 🧪 Testing Guidelines
653
+
654
+### Unit Testing
655
+```perl
656
+# Example test structure
657
+use Test::More tests => 5;
658
+use lib '../lib';
659
+use SmartCollector;
660
+
661
+my $collector = SmartCollector->new({
662
+    db_host => '192.168.2.102',
663
+    db_name => 'autosmart_test',
664
+    # ... test config
665
+});
666
+
667
+# Test hardware identification
668
+my $hdd_id = $collector->_detect_or_create_hdd($drive_info, $smart_data);
669
+ok($hdd_id > 0, "HDD identification successful");
670
+
671
+# Test differential storage
672
+my $storage_decision = $collector->_should_store_reading($hdd_id, $smart_data);
673
+ok($storage_decision->{store}, "Storage decision made");
674
+```
675
+
676
+### Integration Testing
677
+```bash
678
+# Run the comprehensive test suite
679
+cd scripts/
680
+perl test-differential-storage.pl
681
+
682
+# Test with real hardware (if available)
683
+perl collect-smart-data.pl --test-mode --device /dev/sdb
684
+```
685
+
686
+### Performance Testing
687
+```sql
688
+-- Test query performance
689
+EXPLAIN ANALYZE 
690
+SELECT * FROM smart_readings_reconstructed 
691
+WHERE hdd_id IN (1,2,3,4,5) 
692
+  AND timestamp > NOW() - INTERVAL '90 days';
693
+
694
+-- Monitor storage efficiency
695
+SELECT 
696
+    reading_type,
697
+    COUNT(*) as readings,
698
+    AVG(length(parameters_json::text)) as avg_size_bytes
699
+FROM smart_readings 
700
+WHERE timestamp > NOW() - INTERVAL '24 hours'
701
+GROUP BY reading_type;
702
+```
703
+
704
+## 🔍 Debugging and Troubleshooting
705
+
706
+### Logging System
707
+```perl
708
+# Enable debug logging
709
+$ENV{AUTOSMART_DEBUG} = 3;  # Maximum verbosity
710
+
711
+# Log levels:
712
+# 1 = Errors only
713
+# 2 = Warnings and errors  
714
+# 3 = Info, warnings, errors
715
+# 4 = Debug everything
716
+```
717
+
718
+### Common Issues
719
+
720
+#### Database Connection Problems
721
+```bash
722
+# Test database connectivity
723
+psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT version();"
724
+
725
+# Check permissions
726
+psql -h 192.168.2.102 -U postgres -d autosmart -c "\\dp smart_readings"
727
+```
728
+
729
+#### SMART Data Collection Issues
730
+```bash
731
+# Test smartctl access
732
+sudo smartctl -a /dev/sda
733
+
734
+# Check permissions
735
+ls -la /dev/sd*
736
+```
737
+
738
+#### Migration Detection Problems
739
+```sql
740
+-- Check migration logs
741
+SELECT * FROM hdd_migrations 
742
+ORDER BY detected_at DESC 
743
+LIMIT 10;
744
+
745
+-- Verify HDD inventory
746
+SELECT serial_number, model_name, current_device_path, current_node_id 
747
+FROM hdd_inventory 
748
+WHERE status = 'active';
749
+```
750
+
751
+## 📊 Performance Monitoring
752
+
753
+### Database Performance
754
+```sql
755
+-- Monitor table sizes
756
+SELECT schemaname, tablename, 
757
+       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
758
+FROM pg_tables 
759
+WHERE schemaname = 'public'
760
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
761
+
762
+-- Monitor query performance
763
+SELECT query, mean_time, calls 
764
+FROM pg_stat_statements 
765
+WHERE query LIKE '%smart_readings%'
766
+ORDER BY mean_time DESC;
767
+```
768
+
769
+### Application Performance
770
+```perl
771
+# Add timing to critical operations
772
+use Time::HiRes qw(time);
773
+
774
+my $start_time = time();
775
+my $result = $self->collect_smart_data($device_path);
776
+my $duration = time() - $start_time;
777
+
778
+$self->_log("SMART collection took ${duration}s for $device_path", 3);
779
+```
780
+
781
+## 🚀 Deployment Guidelines
782
+
783
+### Production Deployment
784
+1. **Database Setup**:
785
+   - Use dedicated PostgreSQL server
786
+   - Configure proper backup strategy
787
+   - Set up monitoring and alerting
788
+
789
+2. **Security Configuration**:
790
+   - Use dedicated database users with minimal privileges
791
+   - Secure API keys and configuration files
792
+   - Enable SSL connections for database
793
+
794
+3. **Performance Tuning**:
795
+   - Configure PostgreSQL for time-series workload
796
+   - Set up proper indexing strategy
797
+   - Monitor and optimize slow queries
798
+
799
+### Proxmox Integration
800
+```bash
801
+# Install on cluster nodes
802
+for node in pve01 pve02 pve03; do
803
+    scp -r autoSMART/ root@$node:/etc/pve/
804
+done
805
+
806
+# Configure systemd services
807
+systemctl enable autosmart-collector
808
+systemctl start autosmart-collector
809
+```
810
+
811
+## 📚 Additional Resources
812
+
813
+### Useful Commands
814
+```bash
815
+# Monitor system in real-time
816
+watch -n 30 'psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT COUNT(*) FROM smart_readings WHERE timestamp > NOW() - INTERVAL '\''1 hour'\''"'
817
+
818
+# Generate performance report
819
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/performance-report.sql
820
+```
821
+
822
+### Development Tools
823
+- **pgAdmin**: Database administration and query development
824
+- **Perl::Critic**: Code quality analysis
825
+- **Perl::Tidy**: Code formatting
826
+- **Git**: Version control with feature branches
827
+
828
+## 📝 Developer Changelog
829
+
830
+This section contains detailed technical changes, internal API modifications, and development-specific information that is not relevant for end-users.
831
+
832
+### [1.0.0] - 2025-08-15 - Development Details
833
+
834
+#### 🏗️ Architecture Changes
835
+- **Database Schema Evolution**: Complete redesign from simple SMART storage to differential storage architecture
836
+- **Hardware Tracking Implementation**: Added `hdd_inventory` and `hdd_migrations` tables for hardware-based identification
837
+- **Differential Storage Engine**: Implemented `should_store_smart_reading()` PostgreSQL function with configurable change detection
838
+- **Migration Detection Algorithm**: Created automatic hardware migration detection using serial numbers and model matching
839
+
840
+#### 🔧 Internal API Changes
841
+- **SmartCollector.pm Refactor**: 
842
+  - Added hardware identification methods (`identify_hardware()`, `detect_migration()`)
843
+  - Implemented differential storage integration (`should_store_reading()`)
844
+  - Added storage efficiency monitoring
845
+  - Breaking change: Constructor now requires database handle
846
+- **Database Functions**: 
847
+  - Added `should_store_smart_reading(jsonb, text, text, interval, text[])` function
848
+  - Added `smart_readings_reconstructed` view for seamless data access
849
+  - Added migration tracking triggers
850
+- **Configuration Schema**: 
851
+  - Split configuration into cluster-wide (`cluster.conf`) and node-specific (`autosmart.conf`)
852
+  - Added differential storage parameters (`force_storage_interval`, `critical_parameters`)
853
+
854
+#### 🧪 Testing Infrastructure
855
+- **Differential Storage Test Suite**: Added comprehensive test coverage in `test-differential-storage.pl`
856
+- **Migration Detection Tests**: Validated hardware tracking across different scenarios
857
+- **Performance Benchmarks**: Established baseline performance metrics for storage efficiency
858
+- **Database Integration Tests**: Added tests for PostgreSQL function behavior
859
+
860
+#### 📊 Performance Optimizations
861
+- **Storage Efficiency**: Achieved 60-80% database size reduction through differential storage
862
+- **Query Optimization**: Added proper indexing for hardware tracking and temporal queries
863
+- **Background Processing**: Implemented non-blocking collection and analysis workflows
864
+- **Memory Management**: Optimized Perl module memory usage for long-running processes
865
+
866
+#### 🔒 Security Enhancements
867
+- **Configuration Security**: Separated sensitive configuration from shared cluster config
868
+- **Database Security**: Implemented proper user permissions and access controls
869
+- **API Key Management**: Secure storage and rotation procedures for OpenAI API keys
870
+- **Audit Trail**: Complete logging of all system changes and data access
871
+
872
+#### 🐛 Known Technical Issues
873
+- **Large Dataset Performance**: Initial data collection on large clusters may require tuning
874
+- **Migration Detection Edge Cases**: Rare scenarios with identical drives may need manual verification
875
+- **PostgreSQL Version Compatibility**: Requires PostgreSQL 13+ for JSONB and advanced indexing features
876
+- **Perl Module Dependencies**: Some CPAN modules may require system-level library installation
877
+
878
+#### 🔮 Technical Roadmap
879
+- **Phase 2**: Real-time streaming data collection with Apache Kafka
880
+- **Phase 3**: Machine learning model training on historical data
881
+- **Phase 4**: Integration with Proxmox VE API for automated responses
882
+- **Phase 5**: Multi-tenant architecture for managed service providers
883
+
884
+#### 💻 Development Environment Notes
885
+- **Test Database**: Currently using `192.168.2.102` for development and testing
886
+- **Perl Version**: Developed and tested on Perl 5.32+
887
+- **PostgreSQL Extensions**: Requires `uuid-ossp` and `btree_gin` extensions
888
+- **Development Workflow**: Feature branch development with PR reviews required
889
+
890
+## 🔧 Technical Reference for Developers
891
+
892
+### Database Schema Reference
893
+- **Primary location**: `../sql/schema.sql`
894
+- **Documentation**: [DIFFERENTIAL_STORAGE.md](DIFFERENTIAL_STORAGE.md), [MIGRATION_DETECTION.md](MIGRATION_DETECTION.md)
895
+- **Sample queries**: `../sql/sample-queries.sql`
896
+- **Migration scripts**: `../sql/migrations/`
897
+
898
+### Perl Module Architecture
899
+- **SmartCollector.pm**: Data collection and hardware tracking
900
+  - Hardware manufacturer detection
901
+  - Migration detection and logging  
902
+  - Differential storage integration
903
+  - Storage efficiency monitoring
904
+- **SmartAnalyzer.pm**: AI-powered analysis and predictions  
905
+- **SmartReporter.pm**: Report generation and alerting
906
+- **Module documentation**: Inline POD documentation in each module
907
+
908
+### Configuration Management
909
+- **Cluster config**: `../config/cluster.conf` (shared across all nodes)
910
+- **Node config**: `../config/defaults/autosmart` (node-specific settings)
911
+- **OpenAI config**: `../config/openai.conf` (API configuration)
912
+- **Configuration documentation**: [INSTALLATION.md](INSTALLATION.md)
913
+
914
+### Scripts and Development Tools
915
+- **Collection**: `../scripts/collect-smart-data.pl`
916
+- **Analysis**: `../scripts/analyze-smart-data.pl`
917
+- **Reporting**: `../scripts/generate-reports.pl`
918
+- **Testing**: `../scripts/test-differential-storage.pl`
919
+- **Deployment**: `../scripts/deploy.sh`, `../scripts/deploy-production.sh`
920
+
921
+### Development Scenarios
922
+
923
+#### Scenario 1: Adding New SMART Parameters
924
+**Files to modify**:
925
+1. `lib/SmartCollector.pm` - Add parameter collection logic
926
+2. `sql/schema.sql` - Update parameter definitions if needed
927
+3. `scripts/test-differential-storage.pl` - Add parameter tests
928
+4. `docs/DIFFERENTIAL_STORAGE.md` - Document parameter behavior
929
+
930
+#### Scenario 2: Implementing New AI Prediction Models
931
+**Files to modify**:
932
+1. `lib/SmartAnalyzer.pm` - Add new prediction algorithms
933
+2. `docs/API.md` - Update API integration patterns
934
+3. `scripts/analyze-smart-data.pl` - Add model selection logic
935
+4. `sql/schema.sql` - Add prediction result tables if needed
936
+
937
+#### Scenario 3: Performance Optimization
938
+**Areas to investigate**:
939
+1. `docs/DIFFERENTIAL_STORAGE.md` - Storage optimization techniques
940
+2. `sql/schema.sql` - Index optimization
941
+3. `lib/SmartCollector.pm` - Collection efficiency
942
+4. PostgreSQL query performance using `EXPLAIN ANALYZE`
943
+
944
+#### Scenario 4: Adding New Hardware Support
945
+**Files to modify**:
946
+1. `lib/SmartCollector.pm` - Hardware detection logic
947
+2. `docs/MIGRATION_DETECTION.md` - Hardware tracking specifications
948
+3. `scripts/test-differential-storage.pl` - Hardware-specific tests
949
+4. Configuration templates for new hardware types
950
+
951
+### Code Quality Guidelines
952
+
953
+#### Perl Coding Standards
954
+```perl
955
+# Use strict and warnings
956
+use strict;
957
+use warnings;
958
+
959
+# Consistent indentation (4 spaces)
960
+sub example_function {
961
+    my ($self, $param) = @_;
962
+    
963
+    # Clear variable names
964
+    my $smart_data = $self->collect_smart_data($param);
965
+    
966
+    # Error handling
967
+    return unless defined $smart_data;
968
+    
969
+    return $smart_data;
970
+}
971
+```
972
+
973
+#### Database Development Patterns
974
+```sql
975
+-- Use transactions for data consistency
976
+BEGIN;
977
+    -- Multiple related operations
978
+    INSERT INTO hdd_inventory (...) VALUES (...);
979
+    INSERT INTO smart_readings (...) VALUES (...);
980
+COMMIT;
981
+
982
+-- Use proper indexing
983
+CREATE INDEX CONCURRENTLY idx_smart_readings_timestamp 
984
+ON smart_readings(timestamp DESC, serial_number);
985
+
986
+-- Use parameterized queries to prevent SQL injection
987
+my $sth = $dbh->prepare("SELECT * FROM smart_readings WHERE serial_number = ?");
988
+$sth->execute($serial_number);
989
+```
990
+
991
+This development guide provides the foundation for extending and maintaining the autoSMART system. Follow these guidelines to ensure code quality, performance, and reliability.
+204 -0
projects/autoSMART/docs/DIFFERENTIAL_STORAGE.md
@@ -0,0 +1,204 @@
1
+# autoSMART Differential Storage System
2
+
3
+## Overview
4
+
5
+The autoSMART v1.0 system now implements **differential storage optimization** to significantly reduce database storage requirements while maintaining full data integrity and analysis capabilities.
6
+
7
+## How It Works
8
+
9
+### Storage Strategy
10
+
11
+Instead of storing complete SMART readings for every collection cycle, the system intelligently stores only:
12
+
13
+1. **Baseline readings** - First reading for each HDD
14
+2. **Full readings** - When critical parameters change or forced intervals are reached
15
+3. **Differential readings** - When only non-critical parameters change (stores only the changes)
16
+4. **Skipped readings** - When no changes are detected (no storage)
17
+
18
+### Change Detection
19
+
20
+The system uses multiple methods to detect changes:
21
+
22
+- **Checksum comparison** - SHA256 hash of all parameters + temperature
23
+- **Parameter-level analysis** - Individual SMART parameter change detection
24
+- **Critical parameter monitoring** - Immediate storage for health-critical changes
25
+- **Temperature thresholds** - Configurable temperature change sensitivity
26
+- **Time-based forcing** - Periodic full readings regardless of changes (default: 24 hours)
27
+
28
+## Database Schema Changes
29
+
30
+### Enhanced smart_readings Table
31
+
32
+```sql
33
+ALTER TABLE smart_readings ADD COLUMN reading_type VARCHAR(20) DEFAULT 'full';
34
+ALTER TABLE smart_readings ADD COLUMN changes_detected BOOLEAN DEFAULT true;
35
+ALTER TABLE smart_readings ADD COLUMN changed_parameters JSONB;
36
+ALTER TABLE smart_readings ADD COLUMN previous_reading_id INTEGER REFERENCES smart_readings(id);
37
+ALTER TABLE smart_readings ADD COLUMN checksum VARCHAR(64);
38
+```
39
+
40
+### New PostgreSQL Function
41
+
42
+The `should_store_smart_reading()` function provides intelligent storage decisions:
43
+
44
+```sql
45
+SELECT should_store_smart_reading(hdd_id, parameters_json, checksum, current_timestamp);
46
+```
47
+
48
+Returns:
49
+- `should_store` - Boolean indicating if reading should be stored
50
+- `reading_type` - 'baseline', 'full', or 'differential'
51
+- `changes_detected` - Boolean indicating if changes were found
52
+- `changed_parameters` - JSON array of changed parameter names
53
+- `previous_reading_id` - Reference to previous reading for chaining
54
+
55
+### Reconstructed Data View
56
+
57
+The `smart_readings_reconstructed` view uses recursive SQL to rebuild complete SMART data from differential readings:
58
+
59
+```sql
60
+SELECT * FROM smart_readings_reconstructed WHERE hdd_id = 123;
61
+```
62
+
63
+## Configuration Parameters
64
+
65
+Add to `system_config` table:
66
+
67
+```sql
68
+INSERT INTO system_config (key, value, description) VALUES
69
+('differential_storage_enabled', 'true', 'Enable differential storage optimization'),
70
+('forced_storage_interval_hours', '24', 'Hours between forced full readings'),
71
+('critical_parameter_force_store', 'true', 'Force storage for critical parameter changes'),
72
+('temperature_change_threshold', '5', 'Temperature change threshold for storage (Celsius)');
73
+```
74
+
75
+## Updated Perl Modules
76
+
77
+### SmartCollector.pm Changes
78
+
79
+1. **New methods**:
80
+   - `_should_store_reading()` - Check storage requirements
81
+   - `_insert_smart_reading_differential()` - Store with differential info
82
+   - `_get_recent_storage_stats()` - Monitor storage efficiency
83
+
84
+2. **Enhanced collection**:
85
+   - Automatic change detection
86
+   - Storage type determination
87
+   - Efficiency reporting
88
+
89
+3. **Storage optimization**:
90
+   - Only changed parameters stored for differential readings
91
+   - Checksum validation
92
+   - Chain reference tracking
93
+
94
+## Benefits
95
+
96
+### Storage Reduction
97
+
98
+Expected storage reduction of **60-80%** for typical HDD environments:
99
+
100
+- **Baseline readings**: ~1% of all readings
101
+- **Full readings**: ~15-20% of readings (critical changes + forced intervals)
102
+- **Differential readings**: ~5-15% of readings (minor changes)
103
+- **Skipped readings**: ~60-75% of readings (no changes)
104
+
105
+### Performance Impact
106
+
107
+- **Minimal collection overhead**: Single database function call for decision
108
+- **Fast reconstruction**: Recursive SQL with indexes
109
+- **Efficient queries**: Reconstructed view handles complexity
110
+
111
+### Data Integrity
112
+
113
+- **Complete reconstruction**: All historical data accessible
114
+- **Change tracking**: Full audit trail of parameter changes
115
+- **Critical monitoring**: No loss of important health indicators
116
+
117
+## Usage Examples
118
+
119
+### Collection with Statistics
120
+
121
+```perl
122
+use SmartCollector;
123
+
124
+my $collector = SmartCollector->new($config);
125
+my $result = $collector->collect_all();
126
+
127
+print "Storage efficiency: " . $result->{storage_stats}->{efficiency_percent} . "%\n";
128
+print "Differential readings: " . $result->{storage_stats}->{differential} . "\n";
129
+```
130
+
131
+### Testing the System
132
+
133
+Run the comprehensive test suite:
134
+
135
+```bash
136
+cd /etc/pve/autoSMART
137
+./scripts/test-differential-storage.pl
138
+```
139
+
140
+This will:
141
+1. Create test HDD entries
142
+2. Test storage decisions for various change scenarios
143
+3. Validate data reconstruction
144
+4. Show storage efficiency statistics
145
+
146
+## Migration from Legacy Data
147
+
148
+Existing installations can migrate seamlessly:
149
+
150
+1. **Schema updates**: Run the enhanced schema SQL
151
+2. **Existing data**: Marked as 'full' readings automatically
152
+3. **No data loss**: All existing readings preserved
153
+4. **Gradual optimization**: New readings use differential storage immediately
154
+
155
+## Monitoring and Maintenance
156
+
157
+### Storage Statistics Query
158
+
159
+```sql
160
+SELECT 
161
+    reading_type,
162
+    COUNT(*) as count,
163
+    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage
164
+FROM smart_readings 
165
+WHERE timestamp > NOW() - INTERVAL '7 days'
166
+GROUP BY reading_type;
167
+```
168
+
169
+### Reconstruction Performance
170
+
171
+```sql
172
+EXPLAIN ANALYZE 
173
+SELECT * FROM smart_readings_reconstructed 
174
+WHERE hdd_id = 123 AND timestamp > NOW() - INTERVAL '30 days';
175
+```
176
+
177
+### Space Savings Report
178
+
179
+```sql
180
+SELECT 
181
+    COUNT(*) as total_possible_readings,
182
+    COUNT(*) FILTER (WHERE reading_type != 'skipped') as stored_readings,
183
+    (COUNT(*) FILTER (WHERE reading_type != 'skipped') * 100.0 / COUNT(*)) as storage_percentage,
184
+    (100 - (COUNT(*) FILTER (WHERE reading_type != 'skipped') * 100.0 / COUNT(*))) as savings_percentage
185
+FROM smart_readings 
186
+WHERE timestamp > NOW() - INTERVAL '30 days';
187
+```
188
+
189
+## Critical Parameters List
190
+
191
+Default parameters that trigger immediate full storage:
192
+- Reallocated_Sector_Ct
193
+- Current_Pending_Sector  
194
+- Offline_Uncorrectable
195
+- Reallocated_Event_Count
196
+- Spin_Retry_Count
197
+
198
+Configure in `smart_thresholds` table with `weight >= 8.0`.
199
+
200
+## Conclusion
201
+
202
+The differential storage system provides significant storage optimization while maintaining complete data integrity and analytical capabilities. The system automatically adapts to HDD behavior patterns, storing more data when drives show issues and reducing storage when drives are stable.
203
+
204
+This optimization is particularly beneficial for large-scale deployments like the Madagascar cluster, where hundreds of HDDs generate continuous SMART data over years of operation.
+675 -0
projects/autoSMART/docs/INSTALLATION.md
@@ -0,0 +1,675 @@
1
+# autoSMART Installation and Setup Guide
2
+
3
+## 🚀 Quick Start
4
+
5
+### Prerequisites Checklist
6
+
7
+#### System Requirements
8
+- ✅ **Operating System**: Linux (Ubuntu 20.04+, CentOS 8+, Proxmox VE 7+)
9
+- ✅ **Perl**: Version 5.20+ with CPAN access
10
+- ✅ **PostgreSQL**: Version 13+ with JSONB support
11
+- ✅ **Hardware Access**: sudo/root access for SMART data collection
12
+- ✅ **Network**: Access to OpenAI API (optional, for AI predictions)
13
+
14
+#### Test Database Available
15
+```
16
+Host: 192.168.2.102
17
+Database: autosmart
18
+User: postgres
19
+Password: (no password)
20
+Port: 5432
21
+```
22
+
23
+## 🔧 Installation Steps
24
+
25
+### 1. System Dependencies
26
+
27
+#### Ubuntu/Debian
28
+```bash
29
+# Update system packages
30
+sudo apt update && sudo apt upgrade -y
31
+
32
+# Install system dependencies
33
+sudo apt install -y perl postgresql-client smartmontools git curl
34
+
35
+# Install PostgreSQL server (if not using remote database)
36
+sudo apt install -y postgresql postgresql-contrib
37
+
38
+# Install Perl development tools
39
+sudo apt install -y build-essential cpanminus libdbi-perl
40
+```
41
+
42
+#### CentOS/RHEL/Rocky Linux
43
+```bash
44
+# Update system packages
45
+sudo dnf update -y
46
+
47
+# Install system dependencies
48
+sudo dnf install -y perl postgresql smartmontools git curl
49
+
50
+# Install development tools
51
+sudo dnf groupinstall -y "Development Tools"
52
+sudo dnf install -y perl-App-cpanminus perl-DBI
53
+```
54
+
55
+#### Proxmox VE
56
+```bash
57
+# Proxmox already includes most dependencies
58
+apt update
59
+apt install -y cpanminus libdbi-perl libdbd-pg-perl libjson-xs-perl
60
+```
61
+
62
+### 2. Perl Modules Installation
63
+
64
+#### Required Modules
65
+```bash
66
+# Core database connectivity
67
+sudo cpanm DBI DBD::Pg
68
+
69
+# JSON processing
70
+sudo cpanm JSON::XS
71
+
72
+# Configuration and utilities
73
+sudo cpanm Config::Simple File::Slurp Time::HiRes Digest::SHA
74
+
75
+# HTTP clients for API integration
76
+sudo cpanm LWP::UserAgent HTTP::Request::Common
77
+
78
+# Optional: Testing modules
79
+sudo cpanm Test::More Test::Exception Data::Dumper
80
+```
81
+
82
+#### Verify Perl Module Installation
83
+```bash
84
+perl -e "
85
+use DBI;
86
+use JSON::XS; 
87
+use Config::Simple;
88
+use Digest::SHA;
89
+use LWP::UserAgent;
90
+print \"All required Perl modules installed successfully!\n\";
91
+"
92
+```
93
+
94
+### 3. Database Setup
95
+
96
+#### Option A: Use Test Database (Recommended for Development)
97
+```bash
98
+# Test connection to existing database
99
+psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT version();"
100
+
101
+# If successful, skip to step 4 - Project Installation
102
+```
103
+
104
+#### Option B: Local PostgreSQL Installation
105
+```bash
106
+# Install PostgreSQL
107
+sudo apt install -y postgresql postgresql-contrib
108
+
109
+# Start and enable PostgreSQL
110
+sudo systemctl start postgresql
111
+sudo systemctl enable postgresql
112
+
113
+# Create database and user
114
+sudo -u postgres psql << EOF
115
+CREATE DATABASE autosmart;
116
+CREATE USER autosmart WITH PASSWORD 'smartpassword';
117
+GRANT ALL PRIVILEGES ON DATABASE autosmart TO autosmart;
118
+ALTER USER autosmart CREATEDB;
119
+\q
120
+EOF
121
+```
122
+
123
+#### Option C: Remote PostgreSQL Setup
124
+```bash
125
+# Connect to your PostgreSQL server
126
+psql -h your-db-host -U postgres
127
+
128
+# Create database and configure
129
+CREATE DATABASE autosmart;
130
+CREATE USER autosmart WITH PASSWORD 'your-secure-password';
131
+GRANT ALL PRIVILEGES ON DATABASE autosmart TO autosmart;
132
+```
133
+
134
+### 4. Project Installation
135
+
136
+#### Download and Setup
137
+```bash
138
+# Create installation directory
139
+sudo mkdir -p /etc/pve/autoSMART
140
+cd /etc/pve/autoSMART
141
+
142
+# Clone or copy project files (adjust as needed)
143
+# git clone https://github.com/your-repo/autoSMART.git .
144
+# OR copy from development workspace:
145
+cp -r /Users/bogdan/Documents/workspace/autoSMART/* .
146
+
147
+# Set proper ownership and permissions
148
+sudo chown -R root:root .
149
+chmod +x scripts/*.pl
150
+chmod 600 config/cluster.conf
151
+```
152
+
153
+#### Directory Structure Verification
154
+```bash
155
+tree /etc/pve/autoSMART
156
+# Should show:
157
+# ├── config/
158
+# ├── docs/
159
+# ├── lib/
160
+# ├── scripts/
161
+# ├── sql/
162
+# └── README.md
163
+```
164
+
165
+### 5. Database Deployment
166
+
167
+autoSMART uses PostgreSQL for storing SMART data, configurations, and analysis results. You can deploy the database schema from your development machine using the included deployment scripts.
168
+
169
+#### Prerequisites for Database Deployment
170
+- ✅ **psql** client installed on development machine (macOS/Linux)
171
+- ✅ **Network access** to target PostgreSQL server
172
+- ✅ **Database credentials** with schema creation privileges
173
+- ✅ **Target database** already created and accessible
174
+
175
+#### Database Deployment with deploy.sh
176
+
177
+The `deploy.sh` script can install the database schema remotely using psql from your development machine:
178
+
179
+```bash
180
+# Show help and available options
181
+./deploy.sh
182
+
183
+# Deploy database schema to remote PostgreSQL server
184
+./deploy.sh install database --db-host 192.168.2.102 --db-user postgres --db-name autosmart
185
+
186
+# Deploy with custom credentials
187
+./deploy.sh install database \
188
+  --db-host your-postgres-server.local \
189
+  --db-user autosmart \
190
+  --db-pass your-password \
191
+  --db-name autosmart_prod
192
+```
193
+
194
+#### Manual Database Installation from Development Machine
195
+
196
+If you prefer manual control over the database installation:
197
+
198
+```bash
199
+# 1. Ensure psql is available on your development machine
200
+# macOS:
201
+brew install postgresql
202
+
203
+# Ubuntu/Debian:
204
+sudo apt install postgresql-client
205
+
206
+# 2. Test connection to target database
207
+psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT version();"
208
+
209
+# 3. Install the complete schema
210
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql
211
+
212
+# 4. Verify schema installation
213
+psql -h 192.168.2.102 -U postgres -d autosmart -c "
214
+SELECT 
215
+    schemaname,
216
+    tablename,
217
+    tableowner 
218
+FROM pg_tables 
219
+WHERE schemaname = 'public' 
220
+ORDER BY tablename;
221
+"
222
+
223
+# 5. Check database functions and triggers
224
+psql -h 192.168.2.102 -U postgres -d autosmart -c "
225
+SELECT 
226
+    proname as function_name,
227
+    pg_get_function_result(oid) as return_type
228
+FROM pg_proc 
229
+WHERE pronamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'public')
230
+ORDER BY proname;
231
+"
232
+```
233
+
234
+#### Database Schema Components
235
+
236
+The autoSMART database schema includes:
237
+
238
+**Core Tables:**
239
+- `hdd_inventory` - Physical drive tracking and migration history
240
+- `smart_readings` - Raw SMART data collection (differential storage)
241
+- `smart_thresholds` - Drive-specific alert thresholds
242
+- `predictions` - AI-generated failure predictions
243
+- `alert_history` - System alerts and notifications
244
+- `system_config` - Cluster-wide configuration settings
245
+
246
+**Analytical Views:**
247
+- `smart_readings_reconstructed` - Full SMART data reconstruction from differential storage
248
+- `latest_smart_readings` - Most recent SMART values per drive
249
+- `drive_health_summary` - Drive health status and trend analysis
250
+
251
+**Functions and Triggers:**
252
+- `differential_storage_trigger()` - Automatic differential storage on SMART updates
253
+- `update_drive_health()` - Health score calculation
254
+- `cleanup_old_readings()` - Data retention management
255
+
256
+#### Database Verification Commands
257
+
258
+```bash
259
+# Verify all components are installed
260
+psql -h 192.168.2.102 -U postgres -d autosmart << EOF
261
+
262
+-- Check table count and sizes
263
+SELECT 
264
+    schemaname,
265
+    tablename,
266
+    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
267
+FROM pg_tables 
268
+WHERE schemaname = 'public';
269
+
270
+-- Check views
271
+SELECT 
272
+    schemaname,
273
+    viewname,
274
+    definition
275
+FROM pg_views 
276
+WHERE schemaname = 'public';
277
+
278
+-- Test differential storage function
279
+SELECT differential_storage_trigger() as function_test;
280
+
281
+-- Verify database is ready
282
+SELECT 'autoSMART database ready!' as status;
283
+
284
+EOF
285
+```
286
+
287
+#### Troubleshooting Database Installation
288
+
289
+**Connection Issues:**
290
+```bash
291
+# Test basic connectivity
292
+ping 192.168.2.102
293
+
294
+# Test PostgreSQL port
295
+telnet 192.168.2.102 5432
296
+
297
+# Test authentication
298
+psql -h 192.168.2.102 -U postgres -d postgres -c "SELECT current_user;"
299
+```
300
+
301
+**Schema Installation Issues:**
302
+```bash
303
+# Check for existing schema conflicts
304
+psql -h 192.168.2.102 -U postgres -d autosmart -c "
305
+SELECT table_name FROM information_schema.tables 
306
+WHERE table_schema = 'public' AND table_name LIKE '%smart%';
307
+"
308
+
309
+# Force clean installation (⚠️ DESTRUCTIVE)
310
+psql -h 192.168.2.102 -U postgres -d autosmart -c "
311
+DROP SCHEMA public CASCADE;
312
+CREATE SCHEMA public;
313
+GRANT ALL ON SCHEMA public TO postgres;
314
+GRANT ALL ON SCHEMA public TO public;
315
+"
316
+
317
+# Reinstall schema
318
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql
319
+```
320
+
321
+### 6. Database Schema Installation (Legacy Method)
322
+
323
+#### Using Test Database (192.168.2.102)
324
+```bash
325
+# Install the complete schema
326
+cd /etc/pve/autoSMART
327
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql
328
+
329
+# Verify installation
330
+psql -h 192.168.2.102 -U postgres -d autosmart -c "
331
+SELECT table_name FROM information_schema.tables 
332
+WHERE table_schema = 'public' 
333
+ORDER BY table_name;
334
+"
335
+```
336
+
337
+#### Expected Tables
338
+- ✅ hdd_inventory
339
+- ✅ hdd_migrations  
340
+- ✅ smart_readings
341
+- ✅ predictions
342
+- ✅ smart_thresholds
343
+- ✅ alert_history
344
+- ✅ system_config
345
+
346
+#### Expected Views
347
+- ✅ smart_readings_reconstructed
348
+- ✅ latest_smart_readings
349
+- ✅ drive_health_summary
350
+
351
+### 7. Configuration
352
+
353
+#### Cluster Configuration
354
+```bash
355
+# Edit cluster-wide settings
356
+nano /etc/pve/autoSMART/config/cluster.conf
357
+```
358
+
359
+```ini
360
+[database]
361
+host = 192.168.2.102
362
+port = 5432
363
+name = autosmart
364
+user = postgres
365
+password = 
366
+
367
+[collection]
368
+interval = 1800
369
+timeout = 60
370
+madagascar_inventory_path = /opt/madagascar/inventory.json
371
+
372
+[ai_predictions]
373
+enabled = true
374
+openai_api_key = your-openai-api-key-here
375
+openai_model = gpt-4
376
+prediction_interval = 86400
377
+
378
+[alerts]
379
+enabled = true
380
+email_notifications = true
381
+slack_webhook = https://hooks.slack.com/your-webhook
382
+```
383
+
384
+#### Local Node Configuration
385
+```bash
386
+# Copy default configuration
387
+cp config/defaults/autosmart /etc/default/autosmart
388
+
389
+# Edit local settings
390
+nano /etc/default/autosmart
391
+```
392
+
393
+```bash
394
+# autoSMART local configuration
395
+AUTOSMART_DEBUG=2
396
+AUTOSMART_NODE_ID=$(hostname)
397
+AUTOSMART_CLUSTER_CONFIG="/etc/pve/autoSMART/config/cluster.conf"
398
+
399
+# Local database override (if needed)
400
+# AUTOSMART_DB_HOST=192.168.2.102
401
+# AUTOSMART_DB_USER=postgres
402
+# AUTOSMART_DB_PASS=
403
+
404
+# OpenAI API configuration
405
+OPENAI_API_KEY=your-api-key-here
406
+
407
+# Collection settings
408
+SMART_COLLECTION_ENABLED=true
409
+MIGRATION_DETECTION_ENABLED=true
410
+DIFFERENTIAL_STORAGE_ENABLED=true
411
+```
412
+
413
+### 8. Testing Installation
414
+
415
+#### Database Connectivity Test
416
+```bash
417
+cd /etc/pve/autoSMART
418
+perl -e "
419
+use lib 'lib';
420
+use DBI;
421
+
422
+my \$dsn = 'DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432';
423
+my \$dbh = DBI->connect(\$dsn, 'postgres', '', {RaiseError => 1});
424
+
425
+print \"✅ Database connection successful!\n\";
426
+
427
+# Test schema
428
+my \$sth = \$dbh->prepare('SELECT COUNT(*) FROM hdd_inventory');
429
+\$sth->execute();
430
+my (\$count) = \$sth->fetchrow_array();
431
+
432
+print \"✅ Schema installed - hdd_inventory table accessible\n\";
433
+\$dbh->disconnect();
434
+"
435
+```
436
+
437
+#### SMART Data Collection Test  
438
+```bash
439
+# Test SMART data access
440
+sudo smartctl -a /dev/sda | head -20
441
+
442
+# Test collection script (dry-run mode)
443
+cd /etc/pve/autoSMART/scripts
444
+sudo perl collect-smart-data.pl --test --dry-run
445
+```
446
+
447
+#### Differential Storage Test
448
+```bash
449
+# Run comprehensive storage test
450
+cd /etc/pve/autoSMART/scripts
451
+perl test-differential-storage.pl
452
+```
453
+
454
+Expected output:
455
+```
456
+=== autoSMART Differential Storage Test ===
457
+✓ Connected to database
458
+✓ Created test HDD (ID: 1)
459
+✓ Inserted baseline reading (ID: 1)
460
+✓ Identical reading test - Should store: NO (Type: baseline)
461
+✓ Temperature change reading - Should store: YES (Type: differential, ID: 2)
462
+✓ Critical change reading - Should store: YES (Type: full, ID: 3)
463
+--- Storage Statistics ---
464
+baseline     : 1 readings, avg size: 245 bytes
465
+differential : 1 readings, avg size: 89 bytes  
466
+full         : 1 readings, avg size: 245 bytes
467
+Total: 3 readings, estimated size: 579 bytes
468
+=== Test Complete ===
469
+```
470
+
471
+### 9. Service Configuration (Optional)
472
+
473
+#### SystemD Service Files
474
+
475
+Create `/etc/systemd/system/autosmart-collector.service`:
476
+```ini
477
+[Unit]
478
+Description=autoSMART Data Collector
479
+After=network.target postgresql.service
480
+
481
+[Service]
482
+Type=simple
483
+User=root
484
+WorkingDirectory=/etc/pve/autoSMART
485
+ExecStart=/usr/bin/perl scripts/collect-smart-data.pl --daemon
486
+Restart=always
487
+RestartSec=30
488
+
489
+[Install]
490
+WantedBy=multi-user.target
491
+```
492
+
493
+Create `/etc/systemd/system/autosmart-analyzer.service`:
494
+```ini
495
+[Unit]
496
+Description=autoSMART AI Analyzer
497
+After=network.target postgresql.service autosmart-collector.service
498
+
499
+[Service]
500
+Type=simple
501
+User=root
502
+WorkingDirectory=/etc/pve/autoSMART
503
+ExecStart=/usr/bin/perl scripts/analyze-smart-data.pl --daemon
504
+Restart=always
505
+RestartSec=60
506
+
507
+[Install]
508
+WantedBy=multi-user.target
509
+```
510
+
511
+#### Enable Services
512
+```bash
513
+# Reload systemd configuration
514
+sudo systemctl daemon-reload
515
+
516
+# Enable and start services
517
+sudo systemctl enable autosmart-collector
518
+sudo systemctl start autosmart-collector
519
+
520
+sudo systemctl enable autosmart-analyzer  
521
+sudo systemctl start autosmart-analyzer
522
+
523
+# Check service status
524
+sudo systemctl status autosmart-collector
525
+sudo systemctl status autosmart-analyzer
526
+```
527
+
528
+### 10. Verification and Monitoring
529
+
530
+#### Log Files
531
+```bash
532
+# View collection logs
533
+sudo journalctl -u autosmart-collector -f
534
+
535
+# View analysis logs  
536
+sudo journalctl -u autosmart-analyzer -f
537
+
538
+# Check syslog for SMART events
539
+sudo tail -f /var/log/syslog | grep -i smart
540
+```
541
+
542
+#### Database Monitoring
543
+```bash
544
+# Monitor data collection
545
+psql -h 192.168.2.102 -U postgres -d autosmart -c "
546
+SELECT 
547
+    COUNT(*) as total_readings,
548
+    MAX(timestamp) as latest_reading,
549
+    COUNT(DISTINCT hdd_id) as active_drives
550
+FROM smart_readings 
551
+WHERE timestamp > NOW() - INTERVAL '24 hours';
552
+"
553
+
554
+# Monitor storage efficiency
555
+psql -h 192.168.2.102 -U postgres -d autosmart -c "
556
+SELECT 
557
+    reading_type,
558
+    COUNT(*) as count,
559
+    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage
560
+FROM smart_readings 
561
+WHERE timestamp > NOW() - INTERVAL '24 hours'
562
+GROUP BY reading_type;
563
+"
564
+```
565
+
566
+## 🎯 Post-Installation Steps
567
+
568
+### 1. Madagascar Integration
569
+```bash
570
+# Ensure Madagascar inventory is accessible
571
+ls -la /opt/madagascar/inventory.json
572
+
573
+# Test Madagascar data parsing
574
+cd /etc/pve/autoSMART/scripts
575
+perl -e "
576
+use JSON::XS;
577
+my \$data = decode_json(qx(cat /opt/madagascar/inventory.json));
578
+print 'Madagascar drives found: ' . scalar(\@{\$data->{drives}}) . \"\n\";
579
+"
580
+```
581
+
582
+### 2. OpenAI API Setup (Optional)
583
+```bash
584
+# Test OpenAI API access
585
+curl -H "Authorization: Bearer your-api-key" \
586
+     -H "Content-Type: application/json" \
587
+     -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello"}]}' \
588
+     https://api.openai.com/v1/chat/completions
589
+```
590
+
591
+### 3. Alert Configuration
592
+```bash
593
+# Test email notifications (if configured)
594
+echo "Test autoSMART alert" | mail -s "autoSMART Test" admin@yourcompany.com
595
+
596
+# Test Slack webhook (if configured)
597
+curl -X POST -H 'Content-type: application/json' \
598
+     --data '{"text":"autoSMART installation test"}' \
599
+     YOUR_SLACK_WEBHOOK_URL
600
+```
601
+
602
+## 🔍 Troubleshooting
603
+
604
+### Common Issues
605
+
606
+#### Database Connection Failed
607
+```bash
608
+# Check PostgreSQL service
609
+sudo systemctl status postgresql
610
+
611
+# Test network connectivity
612
+telnet 192.168.2.102 5432
613
+
614
+# Check firewall
615
+sudo ufw status
616
+sudo iptables -L | grep 5432
617
+```
618
+
619
+#### Permission Denied for SMART Data
620
+```bash
621
+# Check smartctl permissions
622
+ls -la /usr/sbin/smartctl
623
+
624
+# Test with sudo
625
+sudo smartctl -a /dev/sda
626
+
627
+# Add user to disk group (if running as non-root)
628
+sudo usermod -a -G disk $USER
629
+```
630
+
631
+#### Perl Module Issues
632
+```bash
633
+# Check module installation
634
+perl -MDBI -e 'print "DBI version: $DBI::VERSION\n"'
635
+perl -MJSON::XS -e 'print "JSON::XS installed OK\n"'
636
+
637
+# Reinstall problematic modules
638
+sudo cpanm --force --reinstall DBD::Pg
639
+```
640
+
641
+### Log Analysis
642
+```bash
643
+# Enable debug logging
644
+export AUTOSMART_DEBUG=3
645
+
646
+# Run collection manually for debugging
647
+cd /etc/pve/autoSMART/scripts
648
+sudo perl collect-smart-data.pl --verbose --test
649
+```
650
+
651
+## 📚 Next Steps
652
+
653
+1. **Customize Configuration**: Adjust thresholds and intervals for your environment
654
+2. **Set Up Monitoring**: Configure alerts and dashboards  
655
+3. **Schedule Regular Backups**: Database and configuration files
656
+4. **Plan for Scaling**: Consider performance optimization for large deployments
657
+5. **Implement AI Predictions**: Configure OpenAI integration for failure prediction
658
+
659
+For more detailed information, see:
660
+- [DEVELOPMENT.md](DEVELOPMENT.md) - Development and customization guide
661
+- [API.md](API.md) - OpenAI API integration details
662
+- [DIFFERENTIAL_STORAGE.md](DIFFERENTIAL_STORAGE.md) - Storage optimization details
663
+- [MIGRATION_DETECTION.md](MIGRATION_DETECTION.md) - HDD tracking system details
664
+
665
+## 🆘 Getting Help
666
+
667
+If you encounter issues during installation:
668
+
669
+1. **Check logs**: `journalctl -u autosmart-collector -n 50`
670
+2. **Verify database**: Test database connectivity and schema
671
+3. **Test components**: Run individual scripts manually with debug output
672
+4. **Review configuration**: Ensure all paths and credentials are correct
673
+5. **Check dependencies**: Verify all system and Perl dependencies are installed
674
+
675
+The autoSMART system is designed to be robust and self-healing, but proper installation and configuration are essential for optimal performance.
+325 -0
projects/autoSMART/docs/README.md
@@ -0,0 +1,325 @@
1
+# autoSMART v1.0 - Intelligent HDD Monitoring & Failure Prediction
2
+
3
+autoSMART este un sistem inteligent de monitorizare SMART pentru HDD-urile din cluster-ul Proxmox, cu predicții de defectare bazate pe AI și stocare optimizată în PostgreSQL.
4
+
5
+## 🎯 **Scopul Proiectului**
6
+
7
+- **Monitorizare continuă** a parametrilor SMART pentru toate HDD-urile din cluster
8
+- **Predicții AI** pentru defectări iminente folosind OpenAI API
9
+- **Stocare long-term** în PostgreSQL pentru analize temporale
10
+- **Alerting proactiv** pentru mentenanță preventivă
11
+
12
+## Key Features
13
+
14
+- **🔍 Hardware-based HDD tracking**: Permanent identification using serial numbers and model names (not volatile /dev/sdX paths)
15
+- **🔄 Migration detection**: Automatic detection and logging when HDDs move between nodes or device paths
16
+- **💾 Differential storage optimization**: Store only SMART readings with changes, reducing database size by 60-80%
17
+- **🤖 AI-powered failure prediction**: Uses OpenAI GPT for intelligent drive failure forecasting
18
+- **🏥 Health monitoring**: Continuous SMART parameter analysis with configurable thresholds
19
+- **📊 Comprehensive reporting**: Detailed drive health reports and predictive analytics
20
+- **🔧 Proxmox cluster integration**: Designed for distributed Proxmox VE environments
21
+- **⚡ High performance**: PostgreSQL backend with optimized indexing and queries
22
+
23
+## 🚀 Quick Start
24
+
25
+### Prerequisites
26
+- **PostgreSQL 13+** for data storage
27
+- **Perl 5.20+** with required modules
28
+- **Proxmox VE** cluster environment
29
+- **smartmontools** for SMART data collection
30
+- **OpenAI API key** for failure predictions
31
+
32
+### Installation
33
+```bash
34
+# 1. Download autoSMART and run automated deployment
35
+git clone <repository-url>
36
+cd autoSMART
37
+sudo ./scripts/deploy.sh install
38
+
39
+# The deployment script automatically:
40
+# - Installs all dependencies (Perl modules, smartmontools, etc.)
41
+# - Creates system directories and sets permissions
42
+# - Deploys application files to /opt/autoSMART/
43
+# - Creates configuration files in /etc/autosmart/
44
+# - Registers and starts systemd services
45
+# - Performs initial system validation
46
+
47
+# 2. Configure database connection (interactive prompts during install)
48
+# 3. Configure OpenAI API key (interactive prompts during install)
49
+# 4. System is ready - services are automatically started
50
+```
51
+
52
+### Verification
53
+```bash
54
+# Check system status (all services should be active)
55
+sudo systemctl status autosmart
56
+
57
+# View recent SMART data collection
58
+sudo journalctl -u autosmart-collector -f
59
+
60
+# Generate initial health report
61
+sudo /opt/autoSMART/scripts/autosmart-report.pl --summary
62
+```
63
+
64
+## 📚 Documentation
65
+
66
+### Getting Started
67
+- **[CHANGELOG.md](CHANGELOG.md)** - Version history and release notes
68
+
69
+### System Configuration
70
+- **[API.md](API.md)** - OpenAI API integration and configuration
71
+
72
+## 🏥 Monitoring Dashboard
73
+
74
+autoSMART provides comprehensive monitoring capabilities:
75
+
76
+### Health Status Overview
77
+- Real-time drive health status for all cluster nodes
78
+- Critical parameter alerts and warnings
79
+- AI-powered failure predictions with confidence scores
80
+- Storage efficiency metrics
81
+
82
+### Historical Analysis
83
+- Long-term SMART parameter trends
84
+- Performance degradation tracking
85
+- Migration history between nodes
86
+- Predictive analytics reports
87
+
88
+### Alerting System
89
+- Configurable thresholds for all SMART parameters
90
+- Email/webhook notifications
91
+- Integration with monitoring systems
92
+- Escalation procedures for critical alerts
93
+
94
+## 🔧 System Architecture
95
+
96
+autoSMART operates as a distributed system across your Proxmox cluster:
97
+
98
+### Data Collection
99
+- Continuous SMART data collection from all nodes
100
+- Hardware-based drive identification
101
+- Migration detection and logging
102
+- Differential storage for efficiency
103
+
104
+### Analysis Engine
105
+- AI-powered failure prediction
106
+- Threshold-based alerting
107
+- Trend analysis and reporting
108
+- Performance optimization recommendations
109
+
110
+### Storage Layer
111
+- PostgreSQL database with optimized schema
112
+- Differential storage reducing size by 60-80%
113
+- Historical data retention policies
114
+- Automated backup and maintenance
115
+
116
+## 📁 Installed File Structure
117
+
118
+When autoSMART is installed on your system, it creates the following directory structure:
119
+
120
+### System Directories
121
+
122
+```
123
+/opt/autoSMART/                    # Main installation directory
124
+├── scripts/                      # Executable scripts and utilities
125
+│   ├── autosmart-collector.pl    # Main data collection daemon
126
+│   ├── autosmart-predictor.pl    # AI prediction processing
127
+│   ├── autosmart-report.pl       # Report generation engine
128
+│   ├── autosmart-migration-report.pl # Hardware migration analysis
129
+│   ├── smart-collector-daemon.pl # Background collection service
130
+│   ├── uninstall.sh             # System removal script
131
+│   ├── monitor-cluster.sh        # Cluster health monitoring
132
+│   └── test-*.pl                # Testing and validation utilities
133
+├── lib/                         # Perl modules and core libraries
134
+│   ├── SmartCollector.pm        # SMART data collection and hardware tracking
135
+│   └── PredictionEngine.pm      # AI-powered failure prediction engine
136
+├── config/                      # Configuration templates and examples
137
+│   └── (template files)        # Default configuration templates
138
+├── docs/                        # End-user documentation
139
+│   ├── README.md               # System overview and quick start
140
+│   ├── CHANGELOG.md            # Release notes and version history
141
+│   └── API.md                  # OpenAI API configuration guide
142
+
143
+/etc/autosmart/                   # System configuration directory
144
+├── autosmart.conf              # Main system configuration
145
+├── cluster.conf                # Cluster topology and node definitions
146
+├── database.conf               # PostgreSQL connection settings
147
+├── openai.conf                 # OpenAI API configuration and prompts
148
+└── smart.conf                  # SMART parameter thresholds and monitoring rules
149
+
150
+/etc/systemd/system/             # Systemd service files
151
+├── autosmart.service           # Main autoSMART service
152
+├── autosmart-collector.service # Data collection service
153
+└── autosmart-predictor.service # AI prediction service
154
+```
155
+
156
+### Configuration Files Detail
157
+
158
+#### `/etc/autosmart/autosmart.conf`
159
+Main system configuration file containing:
160
+- Database connection parameters
161
+- Collection intervals and scheduling
162
+- Local node identification and settings
163
+- Log levels and debugging options
164
+
165
+#### `/etc/autosmart/cluster.conf`
166
+Cluster-wide configuration shared across all nodes:
167
+- Node topology and IP addresses
168
+- Shared monitoring parameters
169
+- Cluster-wide alert settings
170
+- Inter-node communication settings
171
+
172
+#### `/etc/autosmart/database.conf`
173
+PostgreSQL database connection settings:
174
+- Database host, port, and credentials
175
+- Connection pooling configuration
176
+- SSL settings and security parameters
177
+- Performance tuning options
178
+
179
+#### `/etc/autosmart/openai.conf`
180
+OpenAI API integration configuration:
181
+- API key and model selection
182
+- Prompt templates for failure prediction
183
+- Response parsing and confidence thresholds
184
+- Rate limiting and cost management
185
+
186
+#### `/etc/autosmart/smart.conf`
187
+SMART parameter monitoring configuration:
188
+- Parameter thresholds for different drive types
189
+- Critical parameter definitions
190
+- Alert escalation rules and notifications
191
+- Drive-specific monitoring settings
192
+
193
+### Service Integration
194
+
195
+#### Systemd Services
196
+- **`autosmart.service`**: Main system service that manages other components
197
+- **`autosmart-collector.service`**: Background data collection service
198
+- **`autosmart-predictor.service`**: AI prediction processing service
199
+
200
+#### Service Management
201
+```bash
202
+# Start/stop services
203
+sudo systemctl start autosmart
204
+sudo systemctl stop autosmart
205
+
206
+# Enable/disable automatic startup
207
+sudo systemctl enable autosmart
208
+sudo systemctl disable autosmart
209
+
210
+# Check service status
211
+sudo systemctl status autosmart
212
+
213
+# View service logs using systemd journal
214
+sudo journalctl -u autosmart -f                    # Follow main service logs
215
+sudo journalctl -u autosmart-collector -f          # Follow data collection logs  
216
+sudo journalctl -u autosmart-predictor -f          # Follow AI prediction logs
217
+
218
+# View logs by time period
219
+sudo journalctl -u autosmart --since "1 hour ago"  # Last hour
220
+sudo journalctl -u autosmart --since today         # Today's logs
221
+sudo journalctl -u autosmart --since yesterday     # Yesterday's logs
222
+
223
+# View logs by priority level
224
+sudo journalctl -u autosmart -p err                # Error level and above
225
+sudo journalctl -u autosmart -p warning            # Warning level and above
226
+```
227
+
228
+### File Permissions
229
+
230
+#### Executable Files
231
+- All scripts in `/opt/autoSMART/scripts/` are executable (755)
232
+- Perl modules in `/opt/autoSMART/lib/` are readable (644)
233
+- Configuration files in `/etc/autosmart/` are readable by autosmart user (640)
234
+
235
+#### Log Management
236
+- All application logs are handled by systemd journal
237
+- No separate log files created in filesystem
238
+- Log retention managed by journald configuration
239
+- Logs accessible via `journalctl` commands
240
+- Automatic log rotation and cleanup by systemd
241
+
242
+### Storage Requirements
243
+
244
+#### Disk Space
245
+- **Installation**: ~50MB for application files and documentation
246
+- **Configuration**: ~1MB for all configuration files
247
+- **Logs**: Managed by systemd journal (configurable retention)
248
+- **Database**: Handled separately on PostgreSQL server
249
+
250
+#### Network Requirements
251
+- **Database Access**: Persistent connection to PostgreSQL server
252
+- **OpenAI API**: HTTPS access for AI predictions (configurable)
253
+- **Inter-node Communication**: SSH access between cluster nodes for deployment
254
+
255
+This file structure provides a complete, organized installation that integrates seamlessly with Linux system conventions while maintaining clear separation between application code, configuration, and operational data.
256
+
257
+## 📊 Performance Benefits
258
+
259
+### Storage Optimization
260
+- **60-80% reduction** in database storage through differential storage
261
+- **Intelligent change detection** stores only modified SMART parameters
262
+- **Baseline reconstruction** provides complete historical views
263
+- **Configurable retention** policies for long-term storage
264
+
265
+### Monitoring Efficiency
266
+- **Hardware-based tracking** eliminates /dev/sdX path volatility
267
+- **Migration detection** automatically tracks drive movements
268
+- **Real-time analysis** with configurable collection intervals
269
+- **Distributed architecture** scales across cluster nodes
270
+
271
+## 🚨 Alert Examples
272
+
273
+### Critical Alerts
274
+- **Imminent Failure**: AI predicts drive failure within 24-48 hours
275
+- **Temperature Critical**: Drive operating above safe temperature thresholds
276
+- **Reallocated Sectors**: Increasing bad sector count detected
277
+- **Spin Retry Count**: Mechanical issues detected
278
+
279
+### Warning Alerts
280
+- **Performance Degradation**: Slower response times detected
281
+- **Temperature Warning**: Operating temperatures approaching limits
282
+- **SMART Threshold**: Parameters approaching warning thresholds
283
+- **Migration Detected**: Drive moved to different node or path
284
+
285
+## 💡 Use Cases
286
+
287
+### Preventive Maintenance
288
+- Schedule drive replacements before failures occur
289
+- Optimize workload distribution based on drive health
290
+- Plan cluster maintenance windows effectively
291
+- Track warranty and replacement schedules
292
+
293
+### Capacity Planning
294
+- Monitor storage growth trends
295
+- Predict future storage requirements
296
+- Optimize drive allocation across nodes
297
+- Plan cluster expansion timing
298
+
299
+### Performance Optimization
300
+- Identify performance bottlenecks
301
+- Balance load across healthy drives
302
+- Optimize I/O patterns based on drive characteristics
303
+- Monitor storage tier performance
304
+
305
+## 🆘 Support & Troubleshooting
306
+
307
+### Common Issues
308
+- **Collection failures**: Check smartmontools installation
309
+- **Database connectivity**: Verify PostgreSQL connection settings
310
+- **API errors**: Validate OpenAI API key and quotas
311
+- **Performance issues**: Review differential storage configuration
312
+
313
+### Log Analysis
314
+Use systemd journal for comprehensive log analysis:
315
+- **All service logs**: `sudo journalctl -u autosmart*`
316
+- **Data collection**: `sudo journalctl -u autosmart-collector`
317
+- **AI predictions**: `sudo journalctl -u autosmart-predictor`
318
+- **System errors**: `sudo journalctl -u autosmart* -p err`
319
+
320
+### Getting Help
321
+For detailed installation, configuration, and troubleshooting information, refer to the complete documentation in the `docs/` directory.
322
+
323
+---
324
+
325
+**autoSMART v1.0** - Intelligent drive monitoring for mission-critical infrastructure
+607 -0
projects/autoSMART/lib/PredictionEngine.pm
@@ -0,0 +1,607 @@
1
+package PredictionEngine;
2
+
3
+use strict;
4
+use warnings;
5
+use DBI;
6
+use HTTP::Tiny;
7
+use JSON::XS;
8
+use Math::Round;
9
+use Config::Simple;
10
+use Time::Piece;
11
+
12
+=head1 NAME
13
+
14
+PredictionEngine - AI-powered HDD failure prediction for autoSMART
15
+
16
+=head1 DESCRIPTION
17
+
18
+This module integrates with OpenAI's API to analyze SMART data trends and predict
19
+HDD failures. It processes historical SMART data, generates feature vectors,
20
+and uses GPT models for intelligent failure prediction.
21
+
22
+=head1 SYNOPSIS
23
+
24
+    use PredictionEngine;
25
+    
26
+    my $predictor = PredictionEngine->new(
27
+        db_config     => '/path/to/database.conf',
28
+        openai_config => '/path/to/openai.conf'
29
+    );
30
+    
31
+    # Predict failure for specific drive
32
+    my $prediction = $predictor->predict_failure('/dev/sda');
33
+    
34
+    # Analyze all drives
35
+    my $results = $predictor->analyze_all_drives();
36
+
37
+=cut
38
+
39
+sub new {
40
+    my ($class, %args) = @_;
41
+    
42
+    my $self = {
43
+        db_config     => $args{db_config} || '/etc/autosmart/database.conf',
44
+        openai_config => $args{openai_config} || '/etc/autosmart/openai.conf',
45
+        debug         => $args{debug} || 0,
46
+        db_handle     => undef,
47
+        openai_key    => '',
48
+        model         => 'gpt-4',
49
+        http_client   => HTTP::Tiny->new(timeout => 30),
50
+    };
51
+    
52
+    bless $self, $class;
53
+    $self->_load_config();
54
+    $self->_connect_database();
55
+    
56
+    return $self;
57
+}
58
+
59
+=head2 _load_config
60
+
61
+Load OpenAI configuration
62
+
63
+=cut
64
+
65
+sub _load_config {
66
+    my $self = shift;
67
+    
68
+    my $cfg = Config::Simple->new($self->{openai_config})
69
+        or die "Cannot load OpenAI config: $self->{openai_config}";
70
+    
71
+    $self->{openai_key} = $cfg->param('openai.api_key')
72
+        or die "OpenAI API key not configured";
73
+    
74
+    $self->{model}      = $cfg->param('openai.model') || 'gpt-4';
75
+    $self->{max_tokens} = $cfg->param('openai.max_tokens') || 1000;
76
+    $self->{temperature} = $cfg->param('openai.temperature') || 0.3;
77
+    
78
+    $self->_log("OpenAI configuration loaded (model: $self->{model})");
79
+}
80
+
81
+=head2 _connect_database
82
+
83
+Establish PostgreSQL database connection
84
+
85
+=cut
86
+
87
+sub _connect_database {
88
+    my $self = shift;
89
+    
90
+    my $cfg = Config::Simple->new($self->{db_config})
91
+        or die "Cannot load database config: $self->{db_config}";
92
+    
93
+    my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
94
+        $cfg->param('database.database'),
95
+        $cfg->param('database.host'),
96
+        $cfg->param('database.port')
97
+    );
98
+    
99
+    $self->{db_handle} = DBI->connect(
100
+        $dsn,
101
+        $cfg->param('database.username'),
102
+        $cfg->param('database.password'),
103
+        { 
104
+            RaiseError => 1, 
105
+            AutoCommit => 1,
106
+            pg_enable_utf8 => 1 
107
+        }
108
+    ) or die "Database connection failed: $DBI::errstr";
109
+    
110
+    $self->_log("Database connection established");
111
+}
112
+
113
+=head2 get_drive_smart_history
114
+
115
+Retrieve SMART data history for a drive
116
+
117
+=cut
118
+
119
+sub get_drive_smart_history {
120
+    my ($self, $device_path, $days_back) = @_;
121
+    
122
+    $days_back ||= 90;  # Default 3 months
123
+    
124
+    my $sql = q{
125
+        SELECT 
126
+            sr.timestamp,
127
+            sr.temperature,
128
+            sr.parameters_json,
129
+            hi.model_name,
130
+            hi.serial_number,
131
+            hi.size_gb
132
+        FROM smart_readings sr
133
+        JOIN hdd_inventory hi ON sr.device_path = hi.device_path
134
+        WHERE sr.device_path = ?
135
+        AND sr.timestamp >= NOW() - INTERVAL ? DAY
136
+        ORDER BY sr.timestamp ASC
137
+    };
138
+    
139
+    my $sth = $self->{db_handle}->prepare($sql);
140
+    $sth->execute($device_path, $days_back);
141
+    
142
+    my @history = ();
143
+    while (my $row = $sth->fetchrow_hashref()) {
144
+        $row->{parameters} = decode_json($row->{parameters_json});
145
+        delete $row->{parameters_json};
146
+        push @history, $row;
147
+    }
148
+    
149
+    return \@history;
150
+}
151
+
152
+=head2 analyze_smart_trends
153
+
154
+Analyze SMART parameter trends for patterns
155
+
156
+=cut
157
+
158
+sub analyze_smart_trends {
159
+    my ($self, $history) = @_;
160
+    
161
+    return {} unless @$history >= 5;  # Need minimum data points
162
+    
163
+    my $trends = {};
164
+    my $critical_params = [
165
+        'Reallocated_Sector_Ct',
166
+        'Spin_Retry_Count', 
167
+        'Reallocated_Event_Count',
168
+        'Current_Pending_Sector',
169
+        'Offline_Uncorrectable',
170
+        'UDMA_CRC_Error_Count',
171
+        'Raw_Read_Error_Rate'
172
+    ];
173
+    
174
+    # Analyze each critical parameter
175
+    foreach my $param_name (@$critical_params) {
176
+        my @values = ();
177
+        my @timestamps = ();
178
+        
179
+        # Extract values for this parameter
180
+        foreach my $reading (@$history) {
181
+            next unless exists $reading->{parameters}->{$param_name};
182
+            
183
+            push @values, $reading->{parameters}->{$param_name}->{raw_value};
184
+            push @timestamps, $reading->{timestamp};
185
+        }
186
+        
187
+        next unless @values >= 3;
188
+        
189
+        # Calculate trend statistics
190
+        my $trend_analysis = $self->_calculate_trend_stats(\@values, \@timestamps);
191
+        
192
+        $trends->{$param_name} = {
193
+            current_value => $values[-1],
194
+            min_value     => $trend_analysis->{min},
195
+            max_value     => $trend_analysis->{max},
196
+            slope         => $trend_analysis->{slope},
197
+            volatility    => $trend_analysis->{volatility},
198
+            data_points   => scalar(@values),
199
+            concerning    => $self->_is_trend_concerning($param_name, $trend_analysis),
200
+        };
201
+    }
202
+    
203
+    # Analyze temperature trends
204
+    my @temperatures = map { $_->{temperature} } @$history;
205
+    if (@temperatures >= 3) {
206
+        my @temp_timestamps = map { $_->{timestamp} } @$history;
207
+        my $temp_stats = $self->_calculate_trend_stats(\@temperatures, \@temp_timestamps);
208
+        
209
+        $trends->{temperature} = {
210
+            current_temp  => $temperatures[-1],
211
+            avg_temp      => $temp_stats->{mean},
212
+            max_temp      => $temp_stats->{max},
213
+            slope         => $temp_stats->{slope},
214
+            concerning    => ($temp_stats->{max} > 60 || $temp_stats->{slope} > 0.1),
215
+        };
216
+    }
217
+    
218
+    return $trends;
219
+}
220
+
221
+=head2 _calculate_trend_stats
222
+
223
+Calculate statistical metrics for trend analysis
224
+
225
+=cut
226
+
227
+sub _calculate_trend_stats {
228
+    my ($self, $values, $timestamps) = @_;
229
+    
230
+    return {} unless @$values >= 2;
231
+    
232
+    # Basic statistics
233
+    my $sum = 0;
234
+    my $min = $values->[0];
235
+    my $max = $values->[0];
236
+    
237
+    foreach my $val (@$values) {
238
+        $sum += $val;
239
+        $min = $val if $val < $min;
240
+        $max = $val if $val > $max;
241
+    }
242
+    
243
+    my $mean = $sum / @$values;
244
+    
245
+    # Calculate variance
246
+    my $variance = 0;
247
+    foreach my $val (@$values) {
248
+        $variance += ($val - $mean) ** 2;
249
+    }
250
+    $variance /= (@$values - 1) if @$values > 1;
251
+    
252
+    # Simple linear regression for slope
253
+    my $slope = 0;
254
+    if (@$values >= 2) {
255
+        my $n = @$values;
256
+        my $sum_x = 0;
257
+        my $sum_y = 0;
258
+        my $sum_xy = 0;
259
+        my $sum_x2 = 0;
260
+        
261
+        for my $i (0..$#$values) {
262
+            my $x = $i;  # Use index as x (time progression)
263
+            my $y = $values->[$i];
264
+            
265
+            $sum_x += $x;
266
+            $sum_y += $y;
267
+            $sum_xy += $x * $y;
268
+            $sum_x2 += $x * $x;
269
+        }
270
+        
271
+        my $denominator = $n * $sum_x2 - $sum_x * $sum_x;
272
+        if ($denominator != 0) {
273
+            $slope = ($n * $sum_xy - $sum_x * $sum_y) / $denominator;
274
+        }
275
+    }
276
+    
277
+    return {
278
+        min        => $min,
279
+        max        => $max,
280
+        mean       => $mean,
281
+        variance   => $variance,
282
+        volatility => sqrt($variance),
283
+        slope      => $slope,
284
+    };
285
+}
286
+
287
+=head2 _is_trend_concerning
288
+
289
+Determine if a SMART parameter trend is concerning
290
+
291
+=cut
292
+
293
+sub _is_trend_concerning {
294
+    my ($self, $param_name, $stats) = @_;
295
+    
296
+    # Critical parameters that should never increase
297
+    my $critical_increasing = {
298
+        'Reallocated_Sector_Ct'     => 0,
299
+        'Reallocated_Event_Count'   => 0, 
300
+        'Current_Pending_Sector'    => 0,
301
+        'Offline_Uncorrectable'     => 0,
302
+        'Spin_Retry_Count'          => 10,
303
+    };
304
+    
305
+    if (exists $critical_increasing->{$param_name}) {
306
+        my $threshold = $critical_increasing->{$param_name};
307
+        
308
+        return 1 if $stats->{max} > $threshold;
309
+        return 1 if $stats->{slope} > 0.1 && $stats->{max} > 0;
310
+    }
311
+    
312
+    # High volatility is concerning
313
+    return 1 if $stats->{volatility} > ($stats->{mean} * 0.5) && $stats->{mean} > 0;
314
+    
315
+    return 0;
316
+}
317
+
318
+=head2 predict_failure
319
+
320
+Generate AI-powered failure prediction for a drive
321
+
322
+=cut
323
+
324
+sub predict_failure {
325
+    my ($self, $device_path, $days_back) = @_;
326
+    
327
+    $days_back ||= 90;
328
+    
329
+    # Get SMART history
330
+    my $history = $self->get_drive_smart_history($device_path, $days_back);
331
+    
332
+    unless (@$history >= 5) {
333
+        return {
334
+            device_path => $device_path,
335
+            prediction  => 'insufficient_data',
336
+            confidence  => 0,
337
+            risk_level  => 'unknown',
338
+            message     => 'Insufficient historical data for prediction'
339
+        };
340
+    }
341
+    
342
+    # Analyze trends
343
+    my $trends = $self->analyze_smart_trends($history);
344
+    
345
+    # Generate AI prompt
346
+    my $prompt = $self->_generate_prediction_prompt($device_path, $history, $trends);
347
+    
348
+    # Call OpenAI API
349
+    my $ai_response = $self->_call_openai_api($prompt);
350
+    
351
+    # Parse and store prediction
352
+    my $prediction = $self->_parse_prediction_response($ai_response, $device_path);
353
+    
354
+    # Store prediction in database
355
+    $self->_store_prediction($prediction);
356
+    
357
+    return $prediction;
358
+}
359
+
360
+=head2 _generate_prediction_prompt
361
+
362
+Generate detailed prompt for OpenAI API
363
+
364
+=cut
365
+
366
+sub _generate_prediction_prompt {
367
+    my ($self, $device_path, $history, $trends) = @_;
368
+    
369
+    my $drive_info = $history->[0];  # Basic drive info from first record
370
+    
371
+    my $prompt = "You are an expert HDD failure prediction system analyzing SMART data.\n\n";
372
+    
373
+    $prompt .= "DRIVE INFORMATION:\n";
374
+    $prompt .= "- Device: $device_path\n";
375
+    $prompt .= "- Model: " . ($drive_info->{model_name} || 'Unknown') . "\n";
376
+    $prompt .= "- Serial: " . ($drive_info->{serial_number} || 'Unknown') . "\n";
377
+    $prompt .= "- Size: " . ($drive_info->{size_gb} || 'Unknown') . " GB\n";
378
+    $prompt .= "- Data Points: " . scalar(@$history) . " readings\n\n";
379
+    
380
+    $prompt .= "CRITICAL SMART PARAMETER ANALYSIS:\n";
381
+    
382
+    foreach my $param_name (sort keys %$trends) {
383
+        next if $param_name eq 'temperature';
384
+        
385
+        my $trend = $trends->{$param_name};
386
+        $prompt .= "- $param_name:\n";
387
+        $prompt .= "  * Current: $trend->{current_value}\n";
388
+        $prompt .= "  * Range: $trend->{min_value} - $trend->{max_value}\n";
389
+        $prompt .= "  * Slope: " . sprintf("%.4f", $trend->{slope}) . "\n";
390
+        $prompt .= "  * Volatility: " . sprintf("%.2f", $trend->{volatility}) . "\n";
391
+        $prompt .= "  * Concerning: " . ($trend->{concerning} ? 'YES' : 'No') . "\n";
392
+    }
393
+    
394
+    if (exists $trends->{temperature}) {
395
+        my $temp = $trends->{temperature};
396
+        $prompt .= "\nTEMPERATURE ANALYSIS:\n";
397
+        $prompt .= "- Current: $temp->{current_temp}°C\n";
398
+        $prompt .= "- Average: " . sprintf("%.1f", $temp->{avg_temp}) . "°C\n";
399
+        $prompt .= "- Maximum: $temp->{max_temp}°C\n";
400
+        $prompt .= "- Trend: " . sprintf("%.3f", $temp->{slope}) . "°C per reading\n";
401
+    }
402
+    
403
+    $prompt .= "\nPLEASE ANALYZE THIS DATA AND PROVIDE:\n";
404
+    $prompt .= "1. Overall failure risk assessment (LOW/MODERATE/HIGH/CRITICAL)\n";
405
+    $prompt .= "2. Confidence level (0-100%)\n";  
406
+    $prompt .= "3. Estimated time to failure (if applicable)\n";
407
+    $prompt .= "4. Key concerning indicators\n";
408
+    $prompt .= "5. Recommended actions\n\n";
409
+    
410
+    $prompt .= "Format your response as JSON with fields: risk_level, confidence, time_to_failure_days, concerns, recommendations, reasoning\n";
411
+    
412
+    return $prompt;
413
+}
414
+
415
+=head2 _call_openai_api
416
+
417
+Make API call to OpenAI
418
+
419
+=cut
420
+
421
+sub _call_openai_api {
422
+    my ($self, $prompt) = @_;
423
+    
424
+    my $payload = {
425
+        model => $self->{model},
426
+        messages => [
427
+            {
428
+                role => 'system',
429
+                content => 'You are an expert HDD failure prediction system with deep knowledge of SMART parameters and drive reliability patterns.'
430
+            },
431
+            {
432
+                role => 'user', 
433
+                content => $prompt
434
+            }
435
+        ],
436
+        max_tokens => $self->{max_tokens},
437
+        temperature => $self->{temperature},
438
+    };
439
+    
440
+    my $response = $self->{http_client}->post(
441
+        'https://api.openai.com/v1/chat/completions',
442
+        {
443
+            headers => {
444
+                'Authorization' => "Bearer $self->{openai_key}",
445
+                'Content-Type'  => 'application/json',
446
+            },
447
+            content => encode_json($payload)
448
+        }
449
+    );
450
+    
451
+    unless ($response->{success}) {
452
+        die "OpenAI API call failed: $response->{status} $response->{reason}";
453
+    }
454
+    
455
+    my $result = decode_json($response->{content});
456
+    
457
+    return $result->{choices}->[0]->{message}->{content};
458
+}
459
+
460
+=head2 _parse_prediction_response
461
+
462
+Parse OpenAI response into structured prediction
463
+
464
+=cut
465
+
466
+sub _parse_prediction_response {
467
+    my ($self, $ai_response, $device_path) = @_;
468
+    
469
+    my $prediction = {
470
+        device_path => $device_path,
471
+        timestamp   => time(),
472
+        prediction  => 'unknown',
473
+        confidence  => 0,
474
+        risk_level  => 'unknown',
475
+        message     => $ai_response,
476
+    };
477
+    
478
+    # Try to parse JSON response
479
+    eval {
480
+        my $parsed = decode_json($ai_response);
481
+        
482
+        $prediction->{risk_level} = lc($parsed->{risk_level}) if $parsed->{risk_level};
483
+        $prediction->{confidence} = $parsed->{confidence} if defined $parsed->{confidence};
484
+        $prediction->{time_to_failure_days} = $parsed->{time_to_failure_days} if $parsed->{time_to_failure_days};
485
+        $prediction->{concerns} = $parsed->{concerns} if $parsed->{concerns};
486
+        $prediction->{recommendations} = $parsed->{recommendations} if $parsed->{recommendations};
487
+        $prediction->{reasoning} = $parsed->{reasoning} if $parsed->{reasoning};
488
+        
489
+        $prediction->{prediction} = 'success';
490
+    };
491
+    
492
+    if ($@) {
493
+        $self->_log("Failed to parse AI response as JSON, using raw text");
494
+        $prediction->{prediction} = 'text_response';
495
+        
496
+        # Try to extract basic info from text
497
+        if ($ai_response =~ /risk.*?:.*?(low|moderate|high|critical)/i) {
498
+            $prediction->{risk_level} = lc($1);
499
+        }
500
+        
501
+        if ($ai_response =~ /confidence.*?:.*?(\d+)/i) {
502
+            $prediction->{confidence} = $1;
503
+        }
504
+    }
505
+    
506
+    return $prediction;
507
+}
508
+
509
+=head2 _store_prediction
510
+
511
+Store prediction results in database
512
+
513
+=cut
514
+
515
+sub _store_prediction {
516
+    my ($self, $prediction) = @_;
517
+    
518
+    my $sql = q{
519
+        INSERT INTO predictions 
520
+        (device_path, timestamp, risk_level, confidence, time_to_failure_days,
521
+         concerns, recommendations, reasoning, raw_response)
522
+        VALUES (?, to_timestamp(?), ?, ?, ?, ?, ?, ?, ?)
523
+    };
524
+    
525
+    $self->{db_handle}->do($sql,
526
+        undef,
527
+        $prediction->{device_path},
528
+        $prediction->{timestamp},
529
+        $prediction->{risk_level},
530
+        $prediction->{confidence},
531
+        $prediction->{time_to_failure_days},
532
+        $prediction->{concerns},
533
+        $prediction->{recommendations},
534
+        $prediction->{reasoning},
535
+        $prediction->{message}
536
+    );
537
+}
538
+
539
+=head2 analyze_all_drives
540
+
541
+Run predictions for all active drives
542
+
543
+=cut
544
+
545
+sub analyze_all_drives {
546
+    my $self = shift;
547
+    
548
+    my $sql = q{
549
+        SELECT device_path, model_name, serial_number
550
+        FROM hdd_inventory 
551
+        WHERE status = 'active'
552
+        ORDER BY device_path
553
+    };
554
+    
555
+    my $sth = $self->{db_handle}->prepare($sql);
556
+    $sth->execute();
557
+    
558
+    my @results = ();
559
+    
560
+    while (my $row = $sth->fetchrow_hashref()) {
561
+        my $prediction = $self->predict_failure($row->{device_path});
562
+        push @results, $prediction;
563
+        
564
+        # Rate limiting - small delay between API calls
565
+        sleep(1);
566
+    }
567
+    
568
+    return \@results;
569
+}
570
+
571
+=head2 _log
572
+
573
+Internal logging method
574
+
575
+=cut
576
+
577
+sub _log {
578
+    my ($self, $message) = @_;
579
+    
580
+    my $timestamp = scalar(localtime());
581
+    print "[$timestamp] PredictionEngine: $message\n" if $self->{debug};
582
+}
583
+
584
+=head2 DESTROY
585
+
586
+Cleanup database connection
587
+
588
+=cut
589
+
590
+sub DESTROY {
591
+    my $self = shift;
592
+    $self->{db_handle}->disconnect() if $self->{db_handle};
593
+}
594
+
595
+1;
596
+
597
+__END__
598
+
599
+=head1 AUTHOR
600
+
601
+AutoSMART Development Team
602
+
603
+=head1 LICENSE
604
+
605
+This software is part of the autoSMART project.
606
+
607
+=cut
+802 -0
projects/autoSMART/lib/SmartCollector.pm
@@ -0,0 +1,802 @@
1
+package SmartCollector;
2
+
3
+use strict;
4
+use warnings;
5
+use DBI;
6
+use JSON::XS;
7
+use Time::HiRes qw(time);
8
+use File::Slurp;
9
+use Config::Simple;
10
+use Digest::SHA qw(sha256_hex);
11
+
12
+=head1 NAME
13
+
14
+SmartCollector - SMART data collection module for autoSMART
15
+
16
+=head1 DESCRIPTION
17
+
18
+This module handles the collection of SMART data from HDDs identified in Madagascar inventory,
19
+processes the data, and stores it in PostgreSQL for long-term analysis and AI predictions.
20
+
21
+=head1 SYNOPSIS
22
+
23
+    use SmartCollector;
24
+    
25
+    my $collector = SmartCollector->new(
26
+        config_file => '/path/to/smart.conf',
27
+        db_config   => '/path/to/database.conf'
28
+    );
29
+    
30
+    # Collect data from all monitored drives
31
+    $collector->collect_all();
32
+    
33
+    # Collect data from specific drive
34
+    $collector->collect_drive('/dev/sda');
35
+
36
+=cut
37
+
38
+sub new {
39
+    my ($class, %args) = @_;
40
+    
41
+    my $self = {
42
+        cluster_config => $args{cluster_config} || '/etc/pve/autoSMART/cluster.conf',
43
+        local_config   => $args{local_config} || '/etc/default/autosmart',
44
+        debug          => $args{debug} || 0,
45
+        node_id        => $args{node_id} || `hostname`,
46
+        smart_params   => {},
47
+        db_handle      => undef,
48
+        local_settings => {},
49
+    };
50
+    
51
+    chomp $self->{node_id};
52
+    
53
+    bless $self, $class;
54
+    $self->_load_local_config();
55
+    $self->_load_cluster_config();
56
+    $self->_connect_database();
57
+    
58
+    return $self;
59
+}
60
+
61
+=head2 _load_local_config
62
+
63
+Load local node-specific configuration from /etc/default/autosmart
64
+
65
+=cut
66
+
67
+sub _load_local_config {
68
+    my $self = shift;
69
+    
70
+    return unless -f $self->{local_config};
71
+    
72
+    open my $fh, '<', $self->{local_config} 
73
+        or die "Cannot read local config: $self->{local_config}: $!";
74
+    
75
+    while (my $line = <$fh>) {
76
+        chomp $line;
77
+        next if $line =~ /^\s*#/ || $line =~ /^\s*$/;
78
+        
79
+        if ($line =~ /^(\w+)=(.+)$/) {
80
+            my ($key, $value) = ($1, $2);
81
+            $value =~ s/^["']|["']$//g;  # Remove quotes
82
+            $self->{local_settings}->{$key} = $value;
83
+        }
84
+    }
85
+    
86
+    close $fh;
87
+    
88
+    # Apply debug settings
89
+    if ($self->{local_settings}->{AUTOSMART_DEBUG_ENABLED} eq 'true') {
90
+        $self->{debug} = $self->{local_settings}->{AUTOSMART_DEBUG_LEVEL} || 1;
91
+    }
92
+    
93
+    $self->_log("Loaded local configuration from $self->{local_config}");
94
+}
95
+
96
+=head2 _load_cluster_config
97
+
98
+Load cluster-wide configuration from Proxmox shared storage
99
+
100
+=cut
101
+
102
+sub _load_cluster_config {
103
+    my $self = shift;
104
+    
105
+    unless (-f $self->{cluster_config}) {
106
+        die "Cluster configuration not found: $self->{cluster_config}";
107
+    }
108
+    
109
+    my $cfg = Config::Simple->new($self->{cluster_config})
110
+        or die "Cannot load cluster config file: $self->{cluster_config}";
111
+    
112
+    # Load monitoring settings
113
+    $self->{collection_interval} = $cfg->param('cluster.collection_interval') 
114
+        || $self->{local_settings}->{AUTOSMART_COLLECTION_INTERVAL} || 300;
115
+    $self->{collection_timeout} = $cfg->param('cluster.collection_timeout') 
116
+        || $self->{local_settings}->{AUTOSMART_COLLECTION_TIMEOUT} || 30;
117
+    $self->{madagascar_inventory} = $cfg->param('madagascar.inventory_path');
118
+    
119
+    # Load cluster information
120
+    $self->{cluster_name} = $cfg->param('cluster.cluster_name');
121
+    $self->{cluster_nodes} = [split /,/, ($cfg->param('cluster.nodes') || '')];
122
+    
123
+    # Load SMART parameters from cluster config
124
+    my @param_keys = $cfg->param(-block => 'smart_parameters');
125
+    
126
+    foreach my $key (@param_keys) {
127
+        my $value = $cfg->param("smart_parameters.$key");
128
+        my ($threshold, $weight, $enabled, $description) = split /,/, $value, 4;
129
+        
130
+        $self->{smart_params}->{$key} = {
131
+            threshold   => $threshold,
132
+            weight      => $weight,
133
+            enabled     => ($enabled eq 'true'),
134
+            description => $description,
135
+        } if $enabled eq 'true';
136
+    }
137
+    
138
+    $self->_log("Loaded cluster configuration: $self->{cluster_name} (" . 
139
+                keys(%{$self->{smart_params}}) . " SMART parameters)");
140
+}
141
+
142
+=head2 _connect_database
143
+
144
+Establish PostgreSQL database connection using cluster configuration
145
+
146
+=cut
147
+
148
+sub _connect_database {
149
+    my $self = shift;
150
+    
151
+    my $cfg = Config::Simple->new($self->{cluster_config})
152
+        or die "Cannot load cluster config for database: $self->{cluster_config}";
153
+    
154
+    my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
155
+        $cfg->param('database.database'),
156
+        $cfg->param('database.host'),
157
+        $cfg->param('database.port')
158
+    );
159
+    
160
+    my $timeout = $cfg->param('database.connection_timeout') || 30;
161
+    
162
+    $self->{db_handle} = DBI->connect(
163
+        $dsn,
164
+        $cfg->param('database.username'),
165
+        $cfg->param('database.password'),
166
+        { 
167
+            RaiseError => 1, 
168
+            AutoCommit => 1,
169
+            pg_enable_utf8 => 1,
170
+            connect_timeout => $timeout,
171
+        }
172
+    ) or die "Database connection failed: $DBI::errstr";
173
+    
174
+    # Register this node in the cluster
175
+    $self->_register_node();
176
+    
177
+    $self->_log("Database connection established to cluster database");
178
+}
179
+
180
+=head2 get_madagascar_drives
181
+
182
+Get list of HDDs from Madagascar inventory (cluster-shared)
183
+
184
+=cut
185
+
186
+sub get_madagascar_drives {
187
+    my $self = shift;
188
+    
189
+    return [] unless -f $self->{madagascar_inventory};
190
+    
191
+    my $inventory_json = read_file($self->{madagascar_inventory});
192
+    my $inventory = decode_json($inventory_json);
193
+    
194
+    my @drives = ();
195
+    
196
+    # Extract HDD information from Madagascar inventory
197
+    if (ref $inventory eq 'HASH' && exists $inventory->{storage}) {
198
+        foreach my $storage (@{$inventory->{storage}}) {
199
+            # Only include drives for this node
200
+            next unless $storage->{node_id} eq $self->{node_id} || !$storage->{node_id};
201
+            next unless $storage->{type} eq 'HDD';
202
+            next unless $storage->{device_path};
203
+            
204
+            push @drives, {
205
+                device_path => $storage->{device_path},
206
+                serial      => $storage->{serial},
207
+                model       => $storage->{model},
208
+                size_gb     => $storage->{size_gb},
209
+                madagascar_id => $storage->{id},
210
+                node_id     => $self->{node_id},
211
+            };
212
+        }
213
+    }
214
+    
215
+    $self->_log("Found " . @drives . " HDDs for node $self->{node_id} in Madagascar inventory");
216
+    return \@drives;
217
+}
218
+
219
+=head2 collect_smart_data
220
+
221
+Collect SMART data from a specific drive
222
+
223
+=cut
224
+
225
+sub collect_smart_data {
226
+    my ($self, $device_path) = @_;
227
+    
228
+    my $cmd = "smartctl -A -f brief -j '$device_path' 2>/dev/null";
229
+    my $output = `$cmd`;
230
+    my $exit_code = $? >> 8;
231
+    
232
+    # Parse smartctl JSON output
233
+    my $smart_data = {};
234
+    
235
+    eval {
236
+        $smart_data = decode_json($output);
237
+    };
238
+    
239
+    if ($@) {
240
+        $self->_log("Failed to parse SMART data for $device_path: $@");
241
+        return undef;
242
+    }
243
+    
244
+    return $self->_process_smart_data($smart_data, $device_path);
245
+}
246
+
247
+=head2 _process_smart_data
248
+
249
+Process and normalize SMART data
250
+
251
+=cut
252
+
253
+sub _process_smart_data {
254
+    my ($self, $raw_data, $device_path) = @_;
255
+    
256
+    my $processed = {
257
+        device_path    => $device_path,
258
+        timestamp      => time(),
259
+        collection_ok  => ($raw_data->{smart_status}->{passed} || 0),
260
+        temperature    => 0,
261
+        parameters     => {},
262
+    };
263
+    
264
+    # Extract device information
265
+    if (exists $raw_data->{device}) {
266
+        $processed->{model_name}   = $raw_data->{device}->{model_name} || '';
267
+        $processed->{serial_number} = $raw_data->{device}->{serial_number} || '';
268
+        $processed->{firmware}     = $raw_data->{device}->{firmware_version} || '';
269
+    }
270
+    
271
+    # Extract temperature
272
+    if (exists $raw_data->{temperature}) {
273
+        $processed->{temperature} = $raw_data->{temperature}->{current} || 0;
274
+    }
275
+    
276
+    # Extract SMART attributes
277
+    if (exists $raw_data->{ata_smart_attributes}->{table}) {
278
+        foreach my $attr (@{$raw_data->{ata_smart_attributes}->{table}}) {
279
+            my $name = $attr->{name};
280
+            
281
+            # Only collect monitored parameters
282
+            next unless exists $self->{smart_params}->{$name};
283
+            
284
+            $processed->{parameters}->{$name} = {
285
+                id          => $attr->{id},
286
+                value       => $attr->{value},
287
+                worst       => $attr->{worst},
288
+                thresh      => $attr->{thresh},
289
+                raw_value   => $attr->{raw}->{value},
290
+                when_failed => $attr->{when_failed} || '',
291
+                flags       => $attr->{flags}->{string} || '',
292
+            };
293
+        }
294
+    }
295
+    
296
+    return $processed;
297
+}
298
+
299
+=head2 store_smart_data
300
+
301
+Store processed SMART data using hardware-based tracking with migration detection
302
+
303
+=cut
304
+
305
+sub store_smart_data {
306
+    my ($self, $drive_info, $smart_data) = @_;
307
+    
308
+    eval {
309
+        # Detect/handle HDD migration first
310
+        my $hdd_id = $self->_detect_or_create_hdd($drive_info, $smart_data);
311
+        
312
+        # Check if we should store this reading using differential storage
313
+        my $should_store = $self->_should_store_reading($hdd_id, $smart_data);
314
+        
315
+        if ($should_store->{store}) {
316
+            # Insert SMART reading with differential storage information
317
+            $self->_insert_smart_reading_differential($hdd_id, $drive_info, $smart_data, $should_store);
318
+            
319
+            $self->_log("Stored SMART data for HDD ID $hdd_id (Serial: $smart_data->{serial_number}, Type: $should_store->{type})", 2);
320
+        } else {
321
+            $self->_log("Skipped unchanged SMART data for HDD ID $hdd_id (Serial: $smart_data->{serial_number})", 3);
322
+        }
323
+    };
324
+    
325
+    if ($@) {
326
+        $self->_log("ERROR storing SMART data: $@", 1);
327
+        return 0;
328
+    }
329
+    
330
+    return 1;
331
+}
332
+
333
+=head2 _detect_or_create_hdd
334
+
335
+Detect HDD migration or create new HDD record using hardware identifiers
336
+
337
+=cut
338
+
339
+sub _detect_or_create_hdd {
340
+    my ($self, $drive_info, $smart_data) = @_;
341
+    
342
+    my $serial = $smart_data->{serial_number} || 'unknown';
343
+    my $model = $smart_data->{model_name} || 'unknown';
344
+    my $device_path = $drive_info->{device_path};
345
+    
346
+    # Call PostgreSQL function to detect migration
347
+    my $sth = $self->{db_handle}->prepare(q{
348
+        SELECT detect_hdd_migration(?, ?, ?, ?, ?, 'collector')
349
+    });
350
+    
351
+    $sth->execute(
352
+        $serial,
353
+        $model, 
354
+        $device_path,
355
+        $self->{node_id},
356
+        $drive_info->{slot} || undef
357
+    );
358
+    
359
+    my ($hdd_id) = $sth->fetchrow_array();
360
+    
361
+    # If NULL returned, this is a new HDD - create it
362
+    if (!defined $hdd_id) {
363
+        $hdd_id = $self->_create_new_hdd($drive_info, $smart_data);
364
+        $self->_log("New HDD discovered: $serial ($model) at $device_path", 2);
365
+    } else {
366
+        $self->_log("HDD tracked: ID $hdd_id, Serial $serial", 3);
367
+    }
368
+    
369
+    return $hdd_id;
370
+}
371
+
372
+=head2 _create_new_hdd
373
+
374
+Create new HDD record with hardware-based identification
375
+
376
+=cut
377
+
378
+sub _create_new_hdd {
379
+    my ($self, $drive_info, $smart_data) = @_;
380
+    
381
+    my $sql = q{
382
+        INSERT INTO hdd_inventory 
383
+        (serial_number, model_name, firmware, size_gb, manufacturer,
384
+         current_device_path, current_node_id, current_slot,
385
+         madagascar_id, first_seen, last_seen, status)
386
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, NOW(), NOW(), 'active')
387
+        RETURNING id
388
+    };
389
+    
390
+    my $sth = $self->{db_handle}->prepare($sql);
391
+    $sth->execute(
392
+        $smart_data->{serial_number} || 'unknown',
393
+        $smart_data->{model_name} || 'unknown',
394
+        $smart_data->{firmware} || '',
395
+        $drive_info->{size_gb} || 0,
396
+        $self->_extract_manufacturer($smart_data->{model_name}),
397
+        $drive_info->{device_path},
398
+        $self->{node_id},
399
+        $drive_info->{slot} || undef,
400
+        $drive_info->{madagascar_id}
401
+    );
402
+    
403
+    my ($hdd_id) = $sth->fetchrow_array();
404
+    
405
+    # Create discovery alert
406
+    $self->_create_discovery_alert($hdd_id, $drive_info, $smart_data);
407
+    
408
+    return $hdd_id;
409
+}
410
+
411
+=head2 _extract_manufacturer
412
+
413
+Extract manufacturer from model name
414
+
415
+=cut
416
+
417
+sub _extract_manufacturer {
418
+    my ($self, $model_name) = @_;
419
+    
420
+    return 'Unknown' unless $model_name;
421
+    
422
+    # Common HDD manufacturer patterns
423
+    my %manufacturers = (
424
+        qr/^WD|Western\s*Digital/i => 'Western Digital',
425
+        qr/^ST|Seagate/i           => 'Seagate',
426
+        qr/^HGST|Hitachi/i         => 'HGST/Hitachi', 
427
+        qr/^TOSHIBA/i              => 'Toshiba',
428
+        qr/^Samsung/i              => 'Samsung',
429
+        qr/^Maxtor/i               => 'Maxtor',
430
+        qr/^Fujitsu/i              => 'Fujitsu',
431
+    );
432
+    
433
+    foreach my $pattern (keys %manufacturers) {
434
+        return $manufacturers{$pattern} if $model_name =~ /$pattern/;
435
+    }
436
+    
437
+    # Extract first word as fallback
438
+    if ($model_name =~ /^(\w+)/) {
439
+        return $1;
440
+    }
441
+    
442
+    return 'Unknown';
443
+}
444
+
445
+=head2 _create_discovery_alert
446
+
447
+Create alert for new HDD discovery
448
+
449
+=cut
450
+
451
+sub _create_discovery_alert {
452
+    my ($self, $hdd_id, $drive_info, $smart_data) = @_;
453
+    
454
+    my $sql = q{
455
+        INSERT INTO alert_history 
456
+        (hdd_id, serial_number, device_path, node_id, alert_type, message)
457
+        VALUES (?, ?, ?, ?, 'discovery', ?)
458
+    };
459
+    
460
+    my $message = sprintf(
461
+        "New HDD discovered: %s (%s) at %s on node %s - Size: %s GB",
462
+        $smart_data->{serial_number} || 'unknown',
463
+        $smart_data->{model_name} || 'unknown',
464
+        $drive_info->{device_path},
465
+        $self->{node_id},
466
+        $drive_info->{size_gb} || '?'
467
+    );
468
+    
469
+    $self->{db_handle}->do($sql, undef,
470
+        $hdd_id,
471
+        $smart_data->{serial_number},
472
+        $drive_info->{device_path},
473
+        $self->{node_id},
474
+        $message
475
+    );
476
+}
477
+
478
+=head2 _should_store_reading
479
+
480
+Check if SMART reading should be stored using differential storage logic
481
+
482
+=cut
483
+
484
+sub _should_store_reading {
485
+    my ($self, $hdd_id, $smart_data) = @_;
486
+    
487
+    # Generate checksum of SMART parameters
488
+    my $parameters_json = encode_json($smart_data->{parameters});
489
+    my $checksum = sha256_hex($parameters_json . ($smart_data->{temperature} || ''));
490
+    
491
+    # Call PostgreSQL function to determine if we should store this reading
492
+    my $sth = $self->{db_handle}->prepare(q{
493
+        SELECT should_store_smart_reading(?, ?, ?, NOW())
494
+    });
495
+    
496
+    $sth->execute($hdd_id, $parameters_json, $checksum);
497
+    
498
+    my $result = $sth->fetchrow_hashref();
499
+    
500
+    return {
501
+        store => $result->{should_store},
502
+        type => $result->{reading_type},
503
+        changes_detected => $result->{changes_detected},
504
+        changed_parameters => $result->{changed_parameters},
505
+        previous_reading_id => $result->{previous_reading_id},
506
+        checksum => $checksum
507
+    };
508
+}
509
+
510
+=head2 _insert_smart_reading_differential
511
+
512
+Insert SMART reading with differential storage information
513
+
514
+=cut
515
+
516
+sub _insert_smart_reading_differential {
517
+    my ($self, $hdd_id, $drive_info, $smart_data, $storage_info) = @_;
518
+    
519
+    my $sql = q{
520
+        INSERT INTO smart_readings
521
+        (hdd_id, serial_number, device_path, node_id, timestamp, 
522
+         collection_ok, temperature, parameters_json, reading_type,
523
+         changes_detected, changed_parameters, previous_reading_id, checksum)
524
+        VALUES (?, ?, ?, ?, to_timestamp(?), ?, ?, ?, ?, ?, ?, ?, ?)
525
+    };
526
+    
527
+    # For differential readings, only store changed parameters
528
+    my $parameters_to_store;
529
+    if ($storage_info->{type} eq 'differential' && $storage_info->{changed_parameters}) {
530
+        # Extract only changed parameters
531
+        my $changed_params = decode_json($storage_info->{changed_parameters});
532
+        my $all_params = $smart_data->{parameters};
533
+        $parameters_to_store = {};
534
+        
535
+        for my $param_name (@$changed_params) {
536
+            $parameters_to_store->{$param_name} = $all_params->{$param_name};
537
+        }
538
+    } else {
539
+        # Store all parameters for baseline/full readings
540
+        $parameters_to_store = $smart_data->{parameters};
541
+    }
542
+    
543
+    my $parameters_json = encode_json($parameters_to_store);
544
+    
545
+    $self->{db_handle}->do($sql,
546
+        undef,
547
+        $hdd_id,
548
+        $smart_data->{serial_number},
549
+        $drive_info->{device_path},
550
+        $self->{node_id},
551
+        $smart_data->{timestamp},
552
+        $smart_data->{collection_ok},
553
+        $smart_data->{temperature},
554
+        $parameters_json,
555
+        $storage_info->{type},
556
+        $storage_info->{changes_detected} ? 'true' : 'false',
557
+        $storage_info->{changed_parameters},
558
+        $storage_info->{previous_reading_id},
559
+        $storage_info->{checksum}
560
+    );
561
+}
562
+
563
+=head2 _insert_smart_reading
564
+
565
+Insert SMART reading linked to hardware ID (legacy method for compatibility)
566
+
567
+=cut
568
+
569
+sub _insert_smart_reading {
570
+    my ($self, $hdd_id, $drive_info, $smart_data) = @_;
571
+    
572
+    my $sql = q{
573
+        INSERT INTO smart_readings
574
+        (hdd_id, serial_number, device_path, node_id, timestamp, 
575
+         collection_ok, temperature, parameters_json)
576
+        VALUES (?, ?, ?, ?, to_timestamp(?), ?, ?, ?)
577
+    };
578
+    
579
+    my $parameters_json = encode_json($smart_data->{parameters});
580
+    
581
+    $self->{db_handle}->do($sql,
582
+        undef,
583
+        $hdd_id,
584
+        $smart_data->{serial_number},
585
+        $drive_info->{device_path},
586
+        $self->{node_id},
587
+        $smart_data->{timestamp},
588
+        $smart_data->{collection_ok},
589
+        $smart_data->{temperature},
590
+        $parameters_json
591
+    );
592
+}
593
+
594
+=head2 collect_all
595
+
596
+Collect SMART data from all drives in Madagascar inventory
597
+
598
+=cut
599
+
600
+sub collect_all {
601
+    my $self = shift;
602
+    
603
+    my $drives = $self->get_madagascar_drives();
604
+    my $successful = 0;
605
+    my $failed = 0;
606
+    my $storage_stats = {
607
+        baseline => 0,
608
+        full => 0,
609
+        differential => 0,
610
+        skipped => 0
611
+    };
612
+    
613
+    foreach my $drive (@$drives) {
614
+        my $smart_data = $self->collect_smart_data($drive->{device_path});
615
+        
616
+        if ($smart_data && $self->store_smart_data($drive, $smart_data)) {
617
+            $successful++;
618
+        } else {
619
+            $failed++;
620
+            $self->_log("Failed to collect/store data for $drive->{device_path}");
621
+        }
622
+        
623
+        # Small delay between drives to avoid overwhelming system
624
+        select(undef, undef, undef, 0.1);
625
+    }
626
+    
627
+    # Get storage statistics for this collection run
628
+    my $stats = $self->_get_recent_storage_stats();
629
+    $self->_log("Collection complete: $successful successful, $failed failed");
630
+    $self->_log("Storage efficiency - Baseline: $stats->{baseline}, Full: $stats->{full}, Differential: $stats->{differential}, Skipped: $stats->{skipped}");
631
+    
632
+    return { 
633
+        successful => $successful, 
634
+        failed => $failed, 
635
+        total => scalar(@$drives),
636
+        storage_stats => $stats
637
+    };
638
+}
639
+
640
+=head2 _get_recent_storage_stats
641
+
642
+Get statistics about storage efficiency from recent readings
643
+
644
+=cut
645
+
646
+sub _get_recent_storage_stats {
647
+    my $self = shift;
648
+    
649
+    my $sql = q{
650
+        SELECT 
651
+            reading_type,
652
+            COUNT(*) as count
653
+        FROM smart_readings 
654
+        WHERE timestamp > NOW() - INTERVAL '1 hour'
655
+        GROUP BY reading_type
656
+        ORDER BY reading_type
657
+    };
658
+    
659
+    my $sth = $self->{db_handle}->prepare($sql);
660
+    $sth->execute();
661
+    
662
+    my $stats = {
663
+        baseline => 0,
664
+        full => 0,
665
+        differential => 0,
666
+        total => 0
667
+    };
668
+    
669
+    while (my $row = $sth->fetchrow_hashref()) {
670
+        $stats->{$row->{reading_type}} = $row->{count};
671
+        $stats->{total} += $row->{count};
672
+    }
673
+    
674
+    # Calculate efficiency percentage
675
+    my $efficient_readings = $stats->{differential} + $stats->{baseline};
676
+    my $efficiency_pct = $stats->{total} > 0 ? 
677
+        sprintf("%.1f", ($efficient_readings / $stats->{total}) * 100) : 0;
678
+    
679
+    $stats->{efficiency_percent} = $efficiency_pct;
680
+    
681
+    return $stats;
682
+}
683
+
684
+=head2 _log
685
+
686
+Internal logging method with enhanced debug levels
687
+
688
+=cut
689
+
690
+sub _log {
691
+    my ($self, $message, $level) = @_;
692
+    
693
+    $level ||= 1;  # Default to basic level
694
+    
695
+    # Check if we should log based on debug level
696
+    return unless $self->{debug} >= $level;
697
+    
698
+    my $timestamp = scalar(localtime());
699
+    my $node_id = $self->{node_id} || 'unknown';
700
+    my $prefix = "[$timestamp] [$node_id] SmartCollector";
701
+    
702
+    if ($self->{debug}) {
703
+        print "$prefix: $message\n";
704
+    }
705
+    
706
+    # Also log to syslog if enabled
707
+    if ($self->{local_settings}->{AUTOSMART_LOG_SYSLOG} eq 'true') {
708
+        eval {
709
+            use Sys::Syslog qw(:standard :macros);
710
+            my $facility = $self->{local_settings}->{AUTOSMART_LOG_FACILITY} || 'daemon';
711
+            openlog('autosmart', 'pid,ndelay', $facility);
712
+            syslog(LOG_INFO, "SmartCollector[$node_id]: $message");
713
+            closelog();
714
+        };
715
+    }
716
+    
717
+    # Log to file if specified
718
+    my $log_file = $self->{local_settings}->{AUTOSMART_DEBUG_LOG_FILE};
719
+    if ($log_file && $self->{debug} >= 2) {
720
+        eval {
721
+            open my $fh, '>>', $log_file;
722
+            print $fh "$prefix: $message\n";
723
+            close $fh;
724
+        };
725
+    }
726
+}
727
+
728
+=head2 _register_node
729
+
730
+Register this node in the cluster database
731
+
732
+=cut
733
+
734
+sub _register_node {
735
+    my $self = shift;
736
+    
737
+    eval {
738
+        # Create cluster_nodes table if it doesn't exist
739
+        $self->{db_handle}->do(q{
740
+            CREATE TABLE IF NOT EXISTS cluster_nodes (
741
+                node_id VARCHAR(100) PRIMARY KEY,
742
+                hostname VARCHAR(255),
743
+                ip_address INET,
744
+                last_seen TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
745
+                status VARCHAR(20) DEFAULT 'active',
746
+                version VARCHAR(50),
747
+                capabilities JSON,
748
+                created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
749
+            )
750
+        });
751
+        
752
+        # Register/update this node
753
+        my $hostname = `hostname -f`;
754
+        chomp $hostname;
755
+        
756
+        my $ip = `hostname -I | awk '{print \$1}'`;
757
+        chomp $ip;
758
+        
759
+        $self->{db_handle}->do(q{
760
+            INSERT INTO cluster_nodes 
761
+            (node_id, hostname, ip_address, last_seen, status, version)
762
+            VALUES (?, ?, ?, NOW(), 'active', '1.0')
763
+            ON CONFLICT (node_id)
764
+            DO UPDATE SET
765
+                hostname = EXCLUDED.hostname,
766
+                ip_address = EXCLUDED.ip_address,
767
+                last_seen = NOW(),
768
+                status = 'active'
769
+        }, undef, $self->{node_id}, $hostname, $ip);
770
+        
771
+        $self->_log("Registered node $self->{node_id} in cluster", 2);
772
+    };
773
+    
774
+    if ($@) {
775
+        $self->_log("Warning: Failed to register node: $@", 1);
776
+    }
777
+}
778
+
779
+=head2 DESTROY
780
+
781
+Cleanup database connection
782
+
783
+=cut
784
+
785
+sub DESTROY {
786
+    my $self = shift;
787
+    $self->{db_handle}->disconnect() if $self->{db_handle};
788
+}
789
+
790
+1;
791
+
792
+__END__
793
+
794
+=head1 AUTHOR
795
+
796
+AutoSMART Development Team
797
+
798
+=head1 LICENSE
799
+
800
+This software is part of the autoSMART project.
801
+
802
+=cut
+0 -0
projects/autoSMART/scripts/README.md
No changes.
+348 -0
projects/autoSMART/scripts/autosmart-collector.pl
@@ -0,0 +1,348 @@
1
+#!/usr/bin/perl
2
+
3
+use strict;
4
+use warnings;
5
+use FindBin qw($Bin);
6
+use lib "$Bin/../lib";
7
+
8
+use SmartCollector;
9
+use Getopt::Long;
10
+use POSIX qw(strftime);
11
+
12
+=head1 NAME
13
+
14
+autosmart-collector.pl - SMART data collection daemon for Proxmox cluster
15
+
16
+=head1 SYNOPSIS
17
+
18
+    autosmart-collector.pl [OPTIONS]
19
+
20
+=head1 OPTIONS
21
+
22
+    --cluster-config FILE  Cluster configuration file (default: /etc/pve/autoSMART/cluster.conf)
23
+    --local-config FILE    Local configuration file (default: /etc/default/autosmart)
24
+    --daemon              Run as daemon
25
+    --once                Run once and exit (for cron jobs)
26
+    --device PATH         Collect from specific device only
27
+    --debug               Enable debug logging
28
+    --help                Show this help
29
+
30
+=head1 DESCRIPTION
31
+
32
+This script collects SMART data from HDDs in a Proxmox cluster environment.
33
+Configuration is split between cluster-wide settings in /etc/pve/autoSMART/
34
+and local node settings in /etc/default/autosmart.
35
+
36
+=cut
37
+
38
+# Configuration
39
+my $cluster_config = '/etc/pve/autoSMART/cluster.conf';
40
+my $local_config = '/etc/default/autosmart';
41
+my $daemon_mode = 0;
42
+my $run_once = 0;
43
+my $specific_device = '';
44
+my $debug = 0;
45
+my $help = 0;
46
+
47
+GetOptions(
48
+    'cluster-config=s' => \$cluster_config,
49
+    'local-config=s'   => \$local_config,
50
+    'daemon'           => \$daemon_mode,
51
+    'once'             => \$run_once,
52
+    'device=s'         => \$specific_device,
53
+    'debug'            => \$debug,
54
+    'help'             => \$help,
55
+) or die "Error parsing command line arguments\n";
56
+
57
+if ($help) {
58
+    print_help();
59
+    exit 0;
60
+}
61
+
62
+# Load local configuration for environment setup
63
+my %local_settings = load_local_config($local_config);
64
+
65
+# Override debug flag from local config if not specified
66
+unless ($debug) {
67
+    $debug = ($local_settings{AUTOSMART_DEBUG_ENABLED} eq 'true') ? 
68
+             ($local_settings{AUTOSMART_DEBUG_LEVEL} || 1) : 0;
69
+}
70
+
71
+# Validate configuration files
72
+unless (-f $cluster_config) {
73
+    die "Cluster configuration not found: $cluster_config\n";
74
+}
75
+
76
+unless (-f $local_config) {
77
+    die "Local configuration not found: $local_config\n";
78
+}
79
+
80
+# Check for emergency stop
81
+if (-f ($local_settings{AUTOSMART_EMERGENCY_STOP_FILE} || '/etc/autosmart/EMERGENCY_STOP')) {
82
+    die "Emergency stop file detected - autoSMART is disabled\n";
83
+}
84
+
85
+# Initialize collector with Proxmox cluster configuration
86
+my $collector = SmartCollector->new(
87
+    cluster_config => $cluster_config,
88
+    local_config   => $local_config,
89
+    debug          => $debug,
90
+);
91
+
92
+log_message("autoSMART collector starting for cluster node...");
93
+
94
+if ($specific_device) {
95
+    # Collect from specific device
96
+    collect_specific_device($collector, $specific_device);
97
+} elsif ($run_once) {
98
+    # Single collection run
99
+    run_collection_cycle($collector);
100
+} elsif ($daemon_mode) {
101
+    # Daemon mode
102
+    run_daemon($collector, \%local_settings);
103
+} else {
104
+    # Default: single collection run
105
+    run_collection_cycle($collector);
106
+}
107
+
108
+log_message("autoSMART collector finished");
109
+
110
+=head2 load_local_config
111
+
112
+Load local configuration from /etc/default/autosmart
113
+
114
+=cut
115
+
116
+sub load_local_config {
117
+    my $config_file = shift;
118
+    
119
+    my %settings = ();
120
+    
121
+    return %settings unless -f $config_file;
122
+    
123
+    open my $fh, '<', $config_file 
124
+        or die "Cannot read local config: $config_file: $!";
125
+    
126
+    while (my $line = <$fh>) {
127
+        chomp $line;
128
+        next if $line =~ /^\s*#/ || $line =~ /^\s*$/;
129
+        
130
+        if ($line =~ /^(\w+)=(.+)$/) {
131
+            my ($key, $value) = ($1, $2);
132
+            $value =~ s/^["']|["']$//g;  # Remove quotes
133
+            $settings{$key} = $value;
134
+        }
135
+    }
136
+    
137
+    close $fh;
138
+    
139
+    return %settings;
140
+}
141
+
142
+=head2 collect_specific_device
143
+
144
+Collect SMART data from a specific device
145
+
146
+=cut
147
+
148
+sub collect_specific_device {
149
+    my ($collector, $device_path) = @_;
150
+    
151
+    log_message("Collecting SMART data from $device_path");
152
+    
153
+    my $smart_data = $collector->collect_smart_data($device_path);
154
+    
155
+    unless ($smart_data) {
156
+        log_message("ERROR: Failed to collect SMART data from $device_path");
157
+        exit 1;
158
+    }
159
+    
160
+    # Create minimal drive info for storage
161
+    my $drive_info = {
162
+        device_path => $device_path,
163
+        serial      => $smart_data->{serial_number} || 'unknown',
164
+        model       => $smart_data->{model_name} || 'unknown',
165
+        size_gb     => 0,
166
+        madagascar_id => "manual_$device_path",
167
+    };
168
+    
169
+    if ($collector->store_smart_data($drive_info, $smart_data)) {
170
+        log_message("Successfully stored SMART data for $device_path");
171
+    } else {
172
+        log_message("ERROR: Failed to store SMART data for $device_path");
173
+        exit 1;
174
+    }
175
+}
176
+
177
+=head2 run_collection_cycle
178
+
179
+Execute one complete collection cycle
180
+
181
+=cut
182
+
183
+sub run_collection_cycle {
184
+    my $collector = shift;
185
+    
186
+    log_message("Starting collection cycle");
187
+    
188
+    my $result = $collector->collect_all();
189
+    
190
+    log_message(sprintf(
191
+        "Collection cycle complete: %d successful, %d failed, %d total",
192
+        $result->{successful},
193
+        $result->{failed},
194
+        $result->{total}
195
+    ));
196
+    
197
+    # Exit with error code if any collections failed
198
+    if ($result->{failed} > 0) {
199
+        exit 1;
200
+    }
201
+}
202
+
203
+=head2 run_daemon
204
+
205
+Run as daemon with periodic collection
206
+
207
+=cut
208
+
209
+sub run_daemon {
210
+    my $collector = shift;
211
+    
212
+    # Get collection interval from config
213
+    my $cfg = Config::Simple->new("$config_dir/smart.conf");
214
+    my $interval = $cfg->param('monitoring.collection_interval') || 300;
215
+    
216
+    log_message("Running in daemon mode (interval: ${interval}s)");
217
+    
218
+    # Set up signal handlers for graceful shutdown
219
+    my $running = 1;
220
+    
221
+    $SIG{TERM} = sub {
222
+        log_message("Received SIGTERM, shutting down gracefully");
223
+        $running = 0;
224
+    };
225
+    
226
+    $SIG{INT} = sub {
227
+        log_message("Received SIGINT, shutting down gracefully");
228
+        $running = 0;
229
+    };
230
+    
231
+    # Main daemon loop
232
+    while ($running) {
233
+        my $start_time = time();
234
+        
235
+        eval {
236
+            run_collection_cycle($collector);
237
+        };
238
+        
239
+        if ($@) {
240
+            log_message("ERROR in collection cycle: $@");
241
+        }
242
+        
243
+        # Calculate sleep time to maintain interval
244
+        my $elapsed = time() - $start_time;
245
+        my $sleep_time = $interval - $elapsed;
246
+        
247
+        if ($sleep_time > 0) {
248
+            log_message("Sleeping for ${sleep_time}s until next collection");
249
+            
250
+            # Sleep in small chunks to allow signal handling
251
+            while ($sleep_time > 0 && $running) {
252
+                my $chunk = $sleep_time > 5 ? 5 : $sleep_time;
253
+                sleep($chunk);
254
+                $sleep_time -= $chunk;
255
+            }
256
+        } else {
257
+            log_message("WARNING: Collection took longer than interval (${elapsed}s > ${interval}s)");
258
+        }
259
+    }
260
+    
261
+    log_message("Daemon shutdown complete");
262
+}
263
+
264
+=head2 log_message
265
+
266
+Log message with timestamp
267
+
268
+=cut
269
+
270
+sub log_message {
271
+    my $message = shift;
272
+    
273
+    my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime());
274
+    print "[$timestamp] autosmart-collector: $message\n";
275
+}
276
+
277
+=head2 print_help
278
+
279
+Display help information
280
+
281
+=cut
282
+
283
+sub print_help {
284
+    print <<'EOF';
285
+autoSMART Data Collector v1.0
286
+
287
+USAGE:
288
+    autosmart-collector.pl [OPTIONS]
289
+
290
+OPTIONS:
291
+    --config-dir DIR     Configuration directory (default: /etc/autosmart)
292
+    --daemon             Run as daemon with periodic collection
293
+    --once               Run once and exit (useful for cron jobs)
294
+    --device PATH        Collect from specific device only (e.g., /dev/sda)
295
+    --debug              Enable debug logging
296
+    --help               Show this help message
297
+
298
+EXAMPLES:
299
+    # Run once (for cron jobs)
300
+    autosmart-collector.pl --once
301
+
302
+    # Run as daemon
303
+    autosmart-collector.pl --daemon
304
+
305
+    # Collect from specific device
306
+    autosmart-collector.pl --device /dev/sda
307
+
308
+    # Run with debug logging
309
+    autosmart-collector.pl --debug --once
310
+
311
+    # Use custom config directory
312
+    autosmart-collector.pl --config-dir /opt/autosmart/config --once
313
+
314
+CONFIGURATION:
315
+    Configuration files should be in /etc/autosmart/ or specified directory:
316
+    - smart.conf        SMART monitoring settings
317
+    - database.conf     PostgreSQL connection settings
318
+
319
+DAEMON MODE:
320
+    In daemon mode, the collector runs continuously and collects data at
321
+    intervals specified in smart.conf (monitoring.collection_interval).
322
+    
323
+    Send SIGTERM or SIGINT for graceful shutdown.
324
+
325
+CRON MODE:
326
+    Use --once flag for cron-based scheduling:
327
+    
328
+    # Collect every 5 minutes
329
+    */5 * * * * /usr/local/bin/autosmart-collector.pl --once
330
+
331
+EXIT CODES:
332
+    0   Success
333
+    1   Error (failed collections, missing config, etc.)
334
+
335
+EOF
336
+}
337
+
338
+__END__
339
+
340
+=head1 AUTHOR
341
+
342
+AutoSMART Development Team
343
+
344
+=head1 LICENSE
345
+
346
+This software is part of the autoSMART project.
347
+
348
+=cut
+615 -0
projects/autoSMART/scripts/autosmart-migration-report.pl
@@ -0,0 +1,615 @@
1
+#!/usr/bin/perl
2
+
3
+use strict;
4
+use warnings;
5
+use DBI;
6
+use Getopt::Long;
7
+use Config::Simple;
8
+use JSON::XS;
9
+use POSIX qw(strftime);
10
+
11
+=head1 NAME
12
+
13
+autosmart-migration-report.pl - HDD Migration Analysis and Reporting
14
+
15
+=head1 SYNOPSIS
16
+
17
+    autosmart-migration-report.pl [OPTIONS]
18
+
19
+=head1 OPTIONS
20
+
21
+    --config-dir DIR     Configuration directory (default: /etc/pve/autoSMART)
22
+    --days N            Days of migration history (default: 30)
23
+    --serial SERIAL     Report for specific HDD serial number
24
+    --node NODE         Report migrations for specific node
25
+    --type TYPE         Migration type: device_change, node_change, slot_change, all
26
+    --format FORMAT     Output format: text, json, csv (default: text)
27
+    --frequent-only     Show only frequently migrated drives (>3 migrations)
28
+    --recent-only       Show only recent migrations (<24h)
29
+    --output FILE       Write to file instead of stdout
30
+    --help              Show this help
31
+
32
+=head1 DESCRIPTION
33
+
34
+Analyze and report HDD migrations tracked by autoSMART. Shows drive movements
35
+between nodes, device path changes, and slot changes with detailed history.
36
+
37
+=cut
38
+
39
+# Configuration
40
+my $config_dir = '/etc/pve/autoSMART';
41
+my $days = 30;
42
+my $specific_serial = '';
43
+my $specific_node = '';
44
+my $migration_type = 'all';
45
+my $format = 'text';
46
+my $frequent_only = 0;
47
+my $recent_only = 0;
48
+my $output_file = '';
49
+my $help = 0;
50
+
51
+GetOptions(
52
+    'config-dir=s'  => \$config_dir,
53
+    'days=i'        => \$days,
54
+    'serial=s'      => \$specific_serial,
55
+    'node=s'        => \$specific_node,
56
+    'type=s'        => \$migration_type,
57
+    'format=s'      => \$format,
58
+    'frequent-only' => \$frequent_only,
59
+    'recent-only'   => \$recent_only,
60
+    'output=s'      => \$output_file,
61
+    'help'          => \$help,
62
+) or die "Error parsing command line arguments\n";
63
+
64
+if ($help) {
65
+    print_help();
66
+    exit 0;
67
+}
68
+
69
+# Validate options
70
+unless ($format =~ /^(text|json|csv)$/) {
71
+    die "Invalid format: $format (must be text, json, or csv)\n";
72
+}
73
+
74
+unless ($migration_type =~ /^(device_change|node_change|slot_change|all)$/) {
75
+    die "Invalid migration type: $migration_type\n";
76
+}
77
+
78
+# Connect to database
79
+my $db_config = "$config_dir/cluster.conf";
80
+unless (-f $db_config) {
81
+    die "Cluster configuration not found: $db_config\n";
82
+}
83
+
84
+my $cfg = Config::Simple->new($db_config);
85
+my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
86
+    $cfg->param('database.database'),
87
+    $cfg->param('database.host'),
88
+    $cfg->param('database.port')
89
+);
90
+
91
+my $dbh = DBI->connect(
92
+    $dsn,
93
+    $cfg->param('database.username'),
94
+    $cfg->param('database.password'),
95
+    { RaiseError => 1, AutoCommit => 1, pg_enable_utf8 => 1 }
96
+) or die "Database connection failed: $DBI::errstr";
97
+
98
+# Generate migration report
99
+my $report_data = generate_migration_report($dbh);
100
+
101
+# Output report
102
+my $output_handle = \*STDOUT;
103
+if ($output_file) {
104
+    open $output_handle, '>', $output_file 
105
+        or die "Cannot open output file $output_file: $!\n";
106
+}
107
+
108
+if ($format eq 'json') {
109
+    output_json($output_handle, $report_data);
110
+} elsif ($format eq 'csv') {
111
+    output_csv($output_handle, $report_data);
112
+} else {
113
+    output_text($output_handle, $report_data);
114
+}
115
+
116
+close $output_handle if $output_file;
117
+$dbh->disconnect();
118
+
119
+=head2 generate_migration_report
120
+
121
+Generate comprehensive migration report
122
+
123
+=cut
124
+
125
+sub generate_migration_report {
126
+    my $dbh = shift;
127
+    
128
+    my $data = {
129
+        generated_at => time(),
130
+        days_analyzed => $days,
131
+        filters => {
132
+            serial => $specific_serial,
133
+            node => $specific_node,
134
+            type => $migration_type,
135
+            frequent_only => $frequent_only,
136
+            recent_only => $recent_only,
137
+        }
138
+    };
139
+    
140
+    # Get migration statistics
141
+    $data->{statistics} = get_migration_statistics($dbh);
142
+    
143
+    # Get migration details
144
+    $data->{migrations} = get_migration_details($dbh);
145
+    
146
+    # Get frequently migrated drives
147
+    $data->{frequent_migrants} = get_frequent_migrants($dbh);
148
+    
149
+    # Get drive current status
150
+    $data->{drive_status} = get_drive_migration_status($dbh);
151
+    
152
+    return $data;
153
+}
154
+
155
+=head2 get_migration_statistics
156
+
157
+Get overall migration statistics
158
+
159
+=cut
160
+
161
+sub get_migration_statistics {
162
+    my $dbh = shift;
163
+    
164
+    my $stats = {};
165
+    
166
+    # Total migrations by type
167
+    my $sql = q{
168
+        SELECT 
169
+            migration_type,
170
+            COUNT(*) as count,
171
+            COUNT(DISTINCT serial_number) as unique_drives
172
+        FROM hdd_migrations
173
+        WHERE migration_timestamp >= NOW() - INTERVAL ? DAY
174
+        GROUP BY migration_type
175
+        ORDER BY count DESC
176
+    };
177
+    
178
+    my $sth = $dbh->prepare($sql);
179
+    $sth->execute($days);
180
+    
181
+    $stats->{by_type} = {};
182
+    while (my $row = $sth->fetchrow_hashref()) {
183
+        $stats->{by_type}->{$row->{migration_type}} = {
184
+            count => $row->{count},
185
+            unique_drives => $row->{unique_drives}
186
+        };
187
+    }
188
+    
189
+    # Migrations by node
190
+    $sql = q{
191
+        SELECT 
192
+            COALESCE(new_node_id, old_node_id) as node_id,
193
+            COUNT(*) as migrations_involving_node
194
+        FROM hdd_migrations
195
+        WHERE migration_timestamp >= NOW() - INTERVAL ? DAY
196
+        GROUP BY COALESCE(new_node_id, old_node_id)
197
+        ORDER BY migrations_involving_node DESC
198
+    };
199
+    
200
+    $sth = $dbh->prepare($sql);
201
+    $sth->execute($days);
202
+    
203
+    $stats->{by_node} = {};
204
+    while (my $row = $sth->fetchrow_hashref()) {
205
+        $stats->{by_node}->{$row->{node_id}} = $row->{migrations_involving_node};
206
+    }
207
+    
208
+    # Recent activity
209
+    $sql = q{
210
+        SELECT 
211
+            DATE(migration_timestamp) as date,
212
+            COUNT(*) as migrations_per_day
213
+        FROM hdd_migrations
214
+        WHERE migration_timestamp >= NOW() - INTERVAL ? DAY
215
+        GROUP BY DATE(migration_timestamp)
216
+        ORDER BY date DESC
217
+        LIMIT 7
218
+    };
219
+    
220
+    $sth = $dbh->prepare($sql);
221
+    $sth->execute($days);
222
+    
223
+    $stats->{recent_activity} = [];
224
+    while (my $row = $sth->fetchrow_hashref()) {
225
+        push @{$stats->{recent_activity}}, {
226
+            date => $row->{date},
227
+            count => $row->{migrations_per_day}
228
+        };
229
+    }
230
+    
231
+    return $stats;
232
+}
233
+
234
+=head2 get_migration_details
235
+
236
+Get detailed migration records
237
+
238
+=cut
239
+
240
+sub get_migration_details {
241
+    my $dbh = shift;
242
+    
243
+    my $sql = q{
244
+        SELECT 
245
+            m.serial_number,
246
+            hi.model_name,
247
+            hi.current_device_path,
248
+            hi.current_node_id,
249
+            m.migration_type,
250
+            m.migration_timestamp,
251
+            m.old_device_path,
252
+            m.old_node_id,
253
+            m.old_slot,
254
+            m.new_device_path,
255
+            m.new_node_id,
256
+            m.new_slot,
257
+            m.detected_by,
258
+            m.confidence_level,
259
+            m.trigger_reason,
260
+            m.verification_status
261
+        FROM hdd_migrations m
262
+        JOIN hdd_inventory hi ON m.hdd_id = hi.id
263
+        WHERE m.migration_timestamp >= NOW() - INTERVAL ? DAY
264
+    };
265
+    
266
+    my @params = ($days);
267
+    
268
+    # Add filters
269
+    if ($specific_serial) {
270
+        $sql .= " AND m.serial_number = ?";
271
+        push @params, $specific_serial;
272
+    }
273
+    
274
+    if ($specific_node) {
275
+        $sql .= " AND (m.old_node_id = ? OR m.new_node_id = ?)";
276
+        push @params, $specific_node, $specific_node;
277
+    }
278
+    
279
+    if ($migration_type ne 'all') {
280
+        $sql .= " AND m.migration_type = ?";
281
+        push @params, $migration_type;
282
+    }
283
+    
284
+    if ($recent_only) {
285
+        $sql .= " AND m.migration_timestamp >= NOW() - INTERVAL '24 hours'";
286
+    }
287
+    
288
+    $sql .= " ORDER BY m.migration_timestamp DESC LIMIT 100";
289
+    
290
+    my $sth = $dbh->prepare($sql);
291
+    $sth->execute(@params);
292
+    
293
+    my @migrations = ();
294
+    while (my $row = $sth->fetchrow_hashref()) {
295
+        push @migrations, $row;
296
+    }
297
+    
298
+    return \@migrations;
299
+}
300
+
301
+=head2 get_frequent_migrants
302
+
303
+Get drives that migrate frequently
304
+
305
+=cut
306
+
307
+sub get_frequent_migrants {
308
+    my $dbh = shift;
309
+    
310
+    my $min_migrations = $frequent_only ? 3 : 1;
311
+    
312
+    my $sql = q{
313
+        SELECT 
314
+            hi.serial_number,
315
+            hi.model_name,
316
+            hi.current_device_path,
317
+            hi.current_node_id,
318
+            hi.migration_count,
319
+            hi.last_migration,
320
+            hi.first_seen,
321
+            COUNT(m.id) as recent_migrations,
322
+            string_agg(DISTINCT m.migration_type, ', ') as migration_types
323
+        FROM hdd_inventory hi
324
+        LEFT JOIN hdd_migrations m ON hi.id = m.hdd_id
325
+            AND m.migration_timestamp >= NOW() - INTERVAL ? DAY
326
+        WHERE hi.migration_count >= ?
327
+        GROUP BY hi.id, hi.serial_number, hi.model_name, hi.current_device_path,
328
+                 hi.current_node_id, hi.migration_count, hi.last_migration, hi.first_seen
329
+        HAVING COUNT(m.id) > 0 OR hi.migration_count >= ?
330
+        ORDER BY hi.migration_count DESC, hi.last_migration DESC
331
+        LIMIT 20
332
+    };
333
+    
334
+    my $sth = $dbh->prepare($sql);
335
+    $sth->execute($days, $min_migrations, $min_migrations);
336
+    
337
+    my @frequent = ();
338
+    while (my $row = $sth->fetchrow_hashref()) {
339
+        push @frequent, $row;
340
+    }
341
+    
342
+    return \@frequent;
343
+}
344
+
345
+=head2 get_drive_migration_status
346
+
347
+Get current migration status of drives
348
+
349
+=cut
350
+
351
+sub get_drive_migration_status {
352
+    my $dbh = shift;
353
+    
354
+    my $sql = q{
355
+        SELECT 
356
+            migration_status,
357
+            COUNT(*) as drive_count
358
+        FROM drive_health_summary
359
+        GROUP BY migration_status
360
+        ORDER BY drive_count DESC
361
+    };
362
+    
363
+    my $sth = $dbh->prepare($sql);
364
+    $sth->execute();
365
+    
366
+    my %status = ();
367
+    while (my $row = $sth->fetchrow_hashref()) {
368
+        $status{$row->{migration_status}} = $row->{drive_count};
369
+    }
370
+    
371
+    return \%status;
372
+}
373
+
374
+=head2 output_text
375
+
376
+Output report as text
377
+
378
+=cut
379
+
380
+sub output_text {
381
+    my ($fh, $data) = @_;
382
+    
383
+    print $fh "\n" . "="x80 . "\n";
384
+    print $fh "autoSMART HDD Migration Report\n";
385
+    print $fh "Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime($data->{generated_at})) . "\n";
386
+    print $fh "Period: Last $data->{days_analyzed} days\n";
387
+    print $fh "="x80 . "\n\n";
388
+    
389
+    # Statistics
390
+    my $stats = $data->{statistics};
391
+    
392
+    print $fh "MIGRATION STATISTICS\n";
393
+    print $fh "-"x40 . "\n";
394
+    
395
+    if (%{$stats->{by_type}}) {
396
+        print $fh "By Type:\n";
397
+        foreach my $type (sort keys %{$stats->{by_type}}) {
398
+            my $info = $stats->{by_type}->{$type};
399
+            printf $fh "  %-15s: %d migrations (%d unique drives)\n",
400
+                   $type, $info->{count}, $info->{unique_drives};
401
+        }
402
+    }
403
+    
404
+    if (%{$stats->{by_node}}) {
405
+        print $fh "\nBy Node:\n";
406
+        foreach my $node (sort { $stats->{by_node}->{$b} <=> $stats->{by_node}->{$a} } 
407
+                         keys %{$stats->{by_node}}) {
408
+            printf $fh "  %-15s: %d migrations\n", $node, $stats->{by_node}->{$node};
409
+        }
410
+    }
411
+    
412
+    # Recent activity
413
+    if (@{$stats->{recent_activity}}) {
414
+        print $fh "\nRecent Activity (Last 7 days):\n";
415
+        foreach my $activity (@{$stats->{recent_activity}}) {
416
+            printf $fh "  %s: %d migrations\n", $activity->{date}, $activity->{count};
417
+        }
418
+    }
419
+    
420
+    # Drive migration status
421
+    if (%{$data->{drive_status}}) {
422
+        print $fh "\nDrive Migration Status:\n";
423
+        foreach my $status (sort keys %{$data->{drive_status}}) {
424
+            printf $fh "  %-20s: %d drives\n", $status, $data->{drive_status}->{$status};
425
+        }
426
+    }
427
+    
428
+    # Frequently migrated drives
429
+    if (@{$data->{frequent_migrants}}) {
430
+        print $fh "\n" . "="x80 . "\n";
431
+        print $fh "FREQUENTLY MIGRATED DRIVES\n";
432
+        print $fh "="x80 . "\n";
433
+        
434
+        foreach my $drive (@{$data->{frequent_migrants}}) {
435
+            printf $fh "\nSerial: %s (%s)\n", 
436
+                   $drive->{serial_number}, $drive->{model_name};
437
+            printf $fh "Current: %s @ %s\n", 
438
+                   $drive->{current_device_path} || 'unknown',
439
+                   $drive->{current_node_id} || 'unknown';
440
+            printf $fh "Total migrations: %d (Recent: %d)\n", 
441
+                   $drive->{migration_count}, $drive->{recent_migrations};
442
+            printf $fh "Last migration: %s\n", 
443
+                   $drive->{last_migration} || 'never';
444
+            printf $fh "Migration types: %s\n", 
445
+                   $drive->{migration_types} || 'none';
446
+        }
447
+    }
448
+    
449
+    # Recent migrations
450
+    if (@{$data->{migrations}}) {
451
+        print $fh "\n" . "="x80 . "\n";
452
+        print $fh "RECENT MIGRATIONS\n";
453
+        print $fh "="x80 . "\n";
454
+        
455
+        foreach my $migration (@{$data->{migrations}}) {
456
+            printf $fh "\n[%s] %s - %s\n",
457
+                   $migration->{migration_timestamp},
458
+                   $migration->{serial_number},
459
+                   uc($migration->{migration_type});
460
+            
461
+            if ($migration->{migration_type} eq 'node_change') {
462
+                printf $fh "  Moved: %s@%s -> %s@%s\n",
463
+                       $migration->{old_device_path} || '?',
464
+                       $migration->{old_node_id} || '?',
465
+                       $migration->{new_device_path} || '?',
466
+                       $migration->{new_node_id} || '?';
467
+            } elsif ($migration->{migration_type} eq 'device_change') {
468
+                printf $fh "  Device: %s -> %s (on %s)\n",
469
+                       $migration->{old_device_path} || '?',
470
+                       $migration->{new_device_path} || '?',
471
+                       $migration->{new_node_id} || '?';
472
+            }
473
+            
474
+            printf $fh "  Detected by: %s (confidence: %d/10)\n",
475
+                   $migration->{detected_by}, $migration->{confidence_level};
476
+            
477
+            if ($migration->{trigger_reason}) {
478
+                printf $fh "  Reason: %s\n", $migration->{trigger_reason};
479
+            }
480
+        }
481
+    }
482
+    
483
+    print $fh "\n";
484
+}
485
+
486
+=head2 output_json
487
+
488
+Output report as JSON
489
+
490
+=cut
491
+
492
+sub output_json {
493
+    my ($fh, $data) = @_;
494
+    
495
+    my $json = JSON::XS->new->pretty->encode($data);
496
+    print $fh $json;
497
+}
498
+
499
+=head2 output_csv
500
+
501
+Output migrations as CSV
502
+
503
+=cut
504
+
505
+sub output_csv {
506
+    my ($fh, $data) = @_;
507
+    
508
+    # CSV header
509
+    print $fh "timestamp,serial_number,model_name,migration_type,old_location,new_location,detected_by,confidence\n";
510
+    
511
+    foreach my $migration (@{$data->{migrations}}) {
512
+        my @fields = (
513
+            $migration->{migration_timestamp},
514
+            $migration->{serial_number},
515
+            $migration->{model_name} || '',
516
+            $migration->{migration_type},
517
+            sprintf("%s@%s", $migration->{old_device_path} || '', $migration->{old_node_id} || ''),
518
+            sprintf("%s@%s", $migration->{new_device_path} || '', $migration->{new_node_id} || ''),
519
+            $migration->{detected_by},
520
+            $migration->{confidence_level}
521
+        );
522
+        
523
+        # Escape CSV fields
524
+        @fields = map { escape_csv($_) } @fields;
525
+        print $fh join(',', @fields) . "\n";
526
+    }
527
+}
528
+
529
+=head2 escape_csv
530
+
531
+Escape CSV field
532
+
533
+=cut
534
+
535
+sub escape_csv {
536
+    my $field = shift || '';
537
+    
538
+    if ($field =~ /[",\n]/) {
539
+        $field =~ s/"/""/g;
540
+        $field = "\"$field\"";
541
+    }
542
+    
543
+    return $field;
544
+}
545
+
546
+=head2 print_help
547
+
548
+Display help information
549
+
550
+=cut
551
+
552
+sub print_help {
553
+    print <<'EOF';
554
+autoSMART HDD Migration Report v1.0
555
+
556
+USAGE:
557
+    autosmart-migration-report.pl [OPTIONS]
558
+
559
+OPTIONS:
560
+    --config-dir DIR     Configuration directory (default: /etc/pve/autoSMART)
561
+    --days N            Days of migration history to analyze (default: 30)
562
+    --serial SERIAL     Report for specific HDD serial number
563
+    --node NODE         Show migrations involving specific node
564
+    --type TYPE         Filter by migration type:
565
+                        device_change, node_change, slot_change, all (default)
566
+    --format FORMAT     Output format: text, json, csv (default: text)
567
+    --frequent-only     Show only frequently migrated drives (3+ migrations)
568
+    --recent-only       Show only migrations in last 24 hours
569
+    --output FILE       Write to file instead of stdout
570
+    --help              Show this help message
571
+
572
+EXAMPLES:
573
+    # Show all migrations in last 7 days
574
+    autosmart-migration-report.pl --days 7
575
+
576
+    # Show only node changes
577
+    autosmart-migration-report.pl --type node_change
578
+
579
+    # Show migrations for specific drive
580
+    autosmart-migration-report.pl --serial WD-WCC4N5123456
581
+
582
+    # Show frequently migrated drives
583
+    autosmart-migration-report.pl --frequent-only
584
+
585
+    # Export recent migrations as CSV
586
+    autosmart-migration-report.pl --recent-only --format csv --output migrations.csv
587
+
588
+MIGRATION TYPES:
589
+    device_change   Drive appeared at different /dev/sdX path
590
+    node_change     Drive moved between Proxmox nodes
591
+    slot_change     Drive moved to different physical slot/bay
592
+    discovery       New drive detected for first time
593
+
594
+OUTPUT:
595
+    The report includes:
596
+    - Overall migration statistics
597
+    - Frequently migrated drives
598
+    - Recent migration activity
599
+    - Detailed migration logs
600
+    - Drive migration status summary
601
+
602
+EOF
603
+}
604
+
605
+__END__
606
+
607
+=head1 AUTHOR
608
+
609
+AutoSMART Development Team
610
+
611
+=head1 LICENSE
612
+
613
+This software is part of the autoSMART project.
614
+
615
+=cut
+483 -0
projects/autoSMART/scripts/autosmart-predictor.pl
@@ -0,0 +1,483 @@
1
+#!/usr/bin/perl
2
+
3
+use strict;
4
+use warnings;
5
+use FindBin qw($Bin);
6
+use lib "$Bin/../lib";
7
+
8
+use PredictionEngine;
9
+use Getopt::Long;
10
+use JSON::XS;
11
+use POSIX qw(strftime);
12
+
13
+=head1 NAME
14
+
15
+autosmart-predictor.pl - AI-powered HDD failure prediction for autoSMART
16
+
17
+=head1 SYNOPSIS
18
+
19
+    autosmart-predictor.pl [OPTIONS]
20
+
21
+=head1 OPTIONS
22
+
23
+    --config-dir DIR     Configuration directory (default: /etc/autosmart)
24
+    --device PATH        Analyze specific device only
25
+    --all               Analyze all active drives
26
+    --days-back N       Days of history to analyze (default: 90)
27
+    --output FORMAT     Output format: text, json, csv (default: text)
28
+    --risk-level LEVEL  Show only drives with risk >= LEVEL (low, moderate, high, critical)
29
+    --quiet             Quiet mode - only output results
30
+    --debug             Enable debug logging
31
+    --help              Show this help
32
+
33
+=head1 DESCRIPTION
34
+
35
+This script uses AI (OpenAI GPT) to analyze SMART data trends and predict
36
+HDD failures. It processes historical SMART data stored by the collector
37
+and generates intelligent predictions with confidence levels.
38
+
39
+=cut
40
+
41
+# Configuration
42
+my $config_dir = '/etc/autosmart';
43
+my $specific_device = '';
44
+my $analyze_all = 0;
45
+my $days_back = 90;
46
+my $output_format = 'text';
47
+my $min_risk_level = '';
48
+my $quiet = 0;
49
+my $debug = 0;
50
+my $help = 0;
51
+
52
+GetOptions(
53
+    'config-dir=s'   => \$config_dir,
54
+    'device=s'       => \$specific_device,
55
+    'all'            => \$analyze_all,
56
+    'days-back=i'    => \$days_back,
57
+    'output=s'       => \$output_format,
58
+    'risk-level=s'   => \$min_risk_level,
59
+    'quiet'          => \$quiet,
60
+    'debug'          => \$debug,
61
+    'help'           => \$help,
62
+) or die "Error parsing command line arguments\n";
63
+
64
+if ($help) {
65
+    print_help();
66
+    exit 0;
67
+}
68
+
69
+# Validate options
70
+unless ($specific_device || $analyze_all) {
71
+    die "Must specify either --device PATH or --all\n";
72
+}
73
+
74
+if ($specific_device && $analyze_all) {
75
+    die "Cannot specify both --device and --all\n";
76
+}
77
+
78
+unless ($output_format =~ /^(text|json|csv)$/) {
79
+    die "Invalid output format: $output_format (must be text, json, or csv)\n";
80
+}
81
+
82
+if ($min_risk_level && $min_risk_level !~ /^(low|moderate|high|critical)$/) {
83
+    die "Invalid risk level: $min_risk_level (must be low, moderate, high, or critical)\n";
84
+}
85
+
86
+# Validate configuration directory
87
+unless (-d $config_dir) {
88
+    die "Configuration directory not found: $config_dir\n";
89
+}
90
+
91
+my $db_config = "$config_dir/database.conf";
92
+my $openai_config = "$config_dir/openai.conf";
93
+
94
+unless (-f $db_config && -f $openai_config) {
95
+    die "Required configuration files not found in $config_dir\n";
96
+}
97
+
98
+# Initialize prediction engine
99
+my $predictor = PredictionEngine->new(
100
+    db_config     => $db_config,
101
+    openai_config => $openai_config,
102
+    debug         => $debug,
103
+);
104
+
105
+log_message("autoSMART predictor starting...") unless $quiet;
106
+
107
+my @predictions = ();
108
+
109
+if ($specific_device) {
110
+    # Analyze specific device
111
+    log_message("Analyzing device: $specific_device") unless $quiet;
112
+    
113
+    my $prediction = $predictor->predict_failure($specific_device, $days_back);
114
+    push @predictions, $prediction;
115
+    
116
+} elsif ($analyze_all) {
117
+    # Analyze all active drives
118
+    log_message("Analyzing all active drives...") unless $quiet;
119
+    
120
+    @predictions = @{$predictor->analyze_all_drives()};
121
+}
122
+
123
+# Filter predictions by minimum risk level if specified
124
+if ($min_risk_level) {
125
+    @predictions = filter_by_risk_level(\@predictions, $min_risk_level);
126
+}
127
+
128
+# Output results
129
+output_predictions(\@predictions, $output_format);
130
+
131
+log_message("Analysis complete") unless $quiet;
132
+
133
+=head2 filter_by_risk_level
134
+
135
+Filter predictions by minimum risk level
136
+
137
+=cut
138
+
139
+sub filter_by_risk_level {
140
+    my ($predictions, $min_level) = @_;
141
+    
142
+    my %risk_order = (
143
+        'low'      => 1,
144
+        'moderate' => 2, 
145
+        'high'     => 3,
146
+        'critical' => 4,
147
+    );
148
+    
149
+    my $min_order = $risk_order{$min_level} || 1;
150
+    
151
+    return grep { 
152
+        exists $risk_order{$_->{risk_level}} && 
153
+        $risk_order{$_->{risk_level}} >= $min_order 
154
+    } @$predictions;
155
+}
156
+
157
+=head2 output_predictions
158
+
159
+Output predictions in specified format
160
+
161
+=cut
162
+
163
+sub output_predictions {
164
+    my ($predictions, $format) = @_;
165
+    
166
+    if ($format eq 'json') {
167
+        output_json($predictions);
168
+    } elsif ($format eq 'csv') {
169
+        output_csv($predictions);
170
+    } else {
171
+        output_text($predictions);
172
+    }
173
+}
174
+
175
+=head2 output_json
176
+
177
+Output predictions as JSON
178
+
179
+=cut
180
+
181
+sub output_json {
182
+    my $predictions = shift;
183
+    
184
+    my $json = JSON::XS->new->pretty->encode({
185
+        timestamp => time(),
186
+        predictions => $predictions,
187
+    });
188
+    
189
+    print $json;
190
+}
191
+
192
+=head2 output_csv
193
+
194
+Output predictions as CSV
195
+
196
+=cut
197
+
198
+sub output_csv {
199
+    my $predictions = shift;
200
+    
201
+    # CSV header
202
+    print "device_path,timestamp,risk_level,confidence,time_to_failure_days,concerns,recommendations\n";
203
+    
204
+    foreach my $pred (@$predictions) {
205
+        my @fields = (
206
+            $pred->{device_path} || '',
207
+            $pred->{timestamp} || '',
208
+            $pred->{risk_level} || '',
209
+            $pred->{confidence} || '',
210
+            $pred->{time_to_failure_days} || '',
211
+            escape_csv($pred->{concerns} || ''),
212
+            escape_csv($pred->{recommendations} || ''),
213
+        );
214
+        
215
+        print join(',', @fields) . "\n";
216
+    }
217
+}
218
+
219
+=head2 output_text
220
+
221
+Output predictions as human-readable text
222
+
223
+=cut
224
+
225
+sub output_text {
226
+    my $predictions = shift;
227
+    
228
+    unless (@$predictions) {
229
+        print "No predictions available.\n";
230
+        return;
231
+    }
232
+    
233
+    print "\n" . "="x80 . "\n";
234
+    print "autoSMART HDD Failure Prediction Report\n";
235
+    print "Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime()) . "\n";
236
+    print "="x80 . "\n\n";
237
+    
238
+    foreach my $pred (@$predictions) {
239
+        print_prediction_text($pred);
240
+        print "-"x80 . "\n";
241
+    }
242
+    
243
+    # Summary statistics
244
+    my %risk_counts = ();
245
+    my $total_confidence = 0;
246
+    my $confidence_count = 0;
247
+    
248
+    foreach my $pred (@$predictions) {
249
+        $risk_counts{$pred->{risk_level} || 'unknown'}++;
250
+        
251
+        if (defined $pred->{confidence} && $pred->{confidence} > 0) {
252
+            $total_confidence += $pred->{confidence};
253
+            $confidence_count++;
254
+        }
255
+    }
256
+    
257
+    print "\nSUMMARY:\n";
258
+    print "Total drives analyzed: " . scalar(@$predictions) . "\n";
259
+    
260
+    foreach my $level (qw(critical high moderate low unknown)) {
261
+        next unless $risk_counts{$level};
262
+        print sprintf("%-10s risk: %d drives\n", 
263
+                     ucfirst($level), $risk_counts{$level});
264
+    }
265
+    
266
+    if ($confidence_count > 0) {
267
+        my $avg_confidence = $total_confidence / $confidence_count;
268
+        print sprintf("Average confidence: %.1f%%\n", $avg_confidence);
269
+    }
270
+    
271
+    print "\n";
272
+}
273
+
274
+=head2 print_prediction_text
275
+
276
+Print a single prediction in text format
277
+
278
+=cut
279
+
280
+sub print_prediction_text {
281
+    my $pred = shift;
282
+    
283
+    print "DEVICE: $pred->{device_path}\n";
284
+    
285
+    if ($pred->{prediction} eq 'insufficient_data') {
286
+        print "STATUS: Insufficient data for analysis\n";
287
+        print "MESSAGE: $pred->{message}\n";
288
+        return;
289
+    }
290
+    
291
+    print "RISK LEVEL: " . format_risk_level($pred->{risk_level}) . "\n";
292
+    
293
+    if (defined $pred->{confidence}) {
294
+        print "CONFIDENCE: $pred->{confidence}%\n";
295
+    }
296
+    
297
+    if (defined $pred->{time_to_failure_days} && $pred->{time_to_failure_days} > 0) {
298
+        print "ESTIMATED TIME TO FAILURE: $pred->{time_to_failure_days} days\n";
299
+    }
300
+    
301
+    if ($pred->{concerns}) {
302
+        print "CONCERNS:\n";
303
+        print format_text_block($pred->{concerns}, "  ");
304
+    }
305
+    
306
+    if ($pred->{recommendations}) {
307
+        print "RECOMMENDATIONS:\n";
308
+        print format_text_block($pred->{recommendations}, "  ");
309
+    }
310
+    
311
+    if ($pred->{reasoning} && $debug) {
312
+        print "AI REASONING:\n";
313
+        print format_text_block($pred->{reasoning}, "  ");
314
+    }
315
+    
316
+    my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime($pred->{timestamp}));
317
+    print "ANALYZED: $timestamp\n";
318
+    
319
+    print "\n";
320
+}
321
+
322
+=head2 format_risk_level
323
+
324
+Format risk level with color coding (if terminal supports it)
325
+
326
+=cut
327
+
328
+sub format_risk_level {
329
+    my $level = shift || 'unknown';
330
+    
331
+    # Simple color codes (won't work in all terminals)
332
+    my %colors = (
333
+        'critical' => "\033[1;31m",  # Bold red
334
+        'high'     => "\033[0;31m",  # Red
335
+        'moderate' => "\033[0;33m",  # Yellow
336
+        'low'      => "\033[0;32m",  # Green
337
+        'unknown'  => "\033[0;37m",  # Gray
338
+    );
339
+    
340
+    my $reset = "\033[0m";
341
+    
342
+    # Only use colors if output is to terminal
343
+    if (-t STDOUT) {
344
+        return ($colors{$level} || '') . uc($level) . $reset;
345
+    } else {
346
+        return uc($level);
347
+    }
348
+}
349
+
350
+=head2 format_text_block
351
+
352
+Format multi-line text with indentation
353
+
354
+=cut
355
+
356
+sub format_text_block {
357
+    my ($text, $indent) = @_;
358
+    
359
+    return '' unless $text;
360
+    
361
+    $indent ||= '';
362
+    
363
+    my @lines = split /\n/, $text;
364
+    return join("\n", map { "$indent$_" } @lines) . "\n";
365
+}
366
+
367
+=head2 escape_csv
368
+
369
+Escape CSV field content
370
+
371
+=cut
372
+
373
+sub escape_csv {
374
+    my $field = shift || '';
375
+    
376
+    # Escape quotes and wrap in quotes if contains comma/quote/newline
377
+    if ($field =~ /[",\n]/) {
378
+        $field =~ s/"/""/g;
379
+        $field = "\"$field\"";
380
+    }
381
+    
382
+    return $field;
383
+}
384
+
385
+=head2 log_message
386
+
387
+Log message with timestamp
388
+
389
+=cut
390
+
391
+sub log_message {
392
+    my $message = shift;
393
+    
394
+    my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime());
395
+    print STDERR "[$timestamp] autosmart-predictor: $message\n";
396
+}
397
+
398
+=head2 print_help
399
+
400
+Display help information
401
+
402
+=cut
403
+
404
+sub print_help {
405
+    print <<'EOF';
406
+autoSMART AI Predictor v1.0
407
+
408
+USAGE:
409
+    autosmart-predictor.pl [OPTIONS]
410
+
411
+OPTIONS:
412
+    --config-dir DIR     Configuration directory (default: /etc/autosmart)
413
+    --device PATH        Analyze specific device only (e.g., /dev/sda)
414
+    --all               Analyze all active drives in inventory
415
+    --days-back N       Days of SMART history to analyze (default: 90)
416
+    --output FORMAT     Output format: text, json, csv (default: text)
417
+    --risk-level LEVEL  Show only drives with risk >= LEVEL
418
+                        (low, moderate, high, critical)
419
+    --quiet             Quiet mode - suppress status messages
420
+    --debug             Enable debug logging and show AI reasoning
421
+    --help              Show this help message
422
+
423
+EXAMPLES:
424
+    # Analyze specific drive
425
+    autosmart-predictor.pl --device /dev/sda
426
+
427
+    # Analyze all drives
428
+    autosmart-predictor.pl --all
429
+
430
+    # Analyze with 30 days of history
431
+    autosmart-predictor.pl --all --days-back 30
432
+
433
+    # Show only high/critical risk drives
434
+    autosmart-predictor.pl --all --risk-level high
435
+
436
+    # Output as JSON
437
+    autosmart-predictor.pl --all --output json
438
+
439
+    # Quiet CSV output for scripts
440
+    autosmart-predictor.pl --all --output csv --quiet
441
+
442
+RISK LEVELS:
443
+    LOW         No immediate concerns detected
444
+    MODERATE    Some parameters showing minor degradation
445
+    HIGH        Multiple concerning trends detected
446
+    CRITICAL    Immediate action required - failure likely soon
447
+
448
+OUTPUT FORMATS:
449
+    text        Human-readable report (default)
450
+    json        Machine-readable JSON format
451
+    csv         Comma-separated values for spreadsheets
452
+
453
+CONFIGURATION:
454
+    Required configuration files in /etc/autosmart/:
455
+    - database.conf     PostgreSQL connection settings
456
+    - openai.conf      OpenAI API configuration
457
+
458
+    The predictor requires historical SMART data collected by
459
+    autosmart-collector.pl to generate meaningful predictions.
460
+
461
+AI INTEGRATION:
462
+    This tool uses OpenAI's GPT models to analyze SMART parameter trends
463
+    and generate intelligent failure predictions. Ensure your OpenAI API
464
+    key is properly configured in openai.conf.
465
+
466
+EXIT CODES:
467
+    0   Success
468
+    1   Error (configuration, API failure, etc.)
469
+
470
+EOF
471
+}
472
+
473
+__END__
474
+
475
+=head1 AUTHOR
476
+
477
+AutoSMART Development Team
478
+
479
+=head1 LICENSE
480
+
481
+This software is part of the autoSMART project.
482
+
483
+=cut
+662 -0
projects/autoSMART/scripts/autosmart-report.pl
@@ -0,0 +1,662 @@
1
+#!/usr/bin/perl
2
+
3
+use strict;
4
+use warnings;
5
+use DBI;
6
+use Getopt::Long;
7
+use Config::Simple;
8
+use JSON::XS;
9
+use POSIX qw(strftime);
10
+
11
+=head1 NAME
12
+
13
+autosmart-report.pl - Generate comprehensive reports for autoSMART system
14
+
15
+=head1 SYNOPSIS
16
+
17
+    autosmart-report.pl [OPTIONS]
18
+
19
+=head1 OPTIONS
20
+
21
+    --config-dir DIR     Configuration directory (default: /etc/autosmart)
22
+    --report TYPE        Report type: summary, detailed, health, alerts, trends
23
+    --device PATH        Report for specific device only
24
+    --days N            Days of history to include (default: 30)
25
+    --format FORMAT     Output format: text, html, json (default: text)
26
+    --output FILE       Write to file instead of stdout
27
+    --help              Show this help
28
+
29
+=head1 DESCRIPTION
30
+
31
+Generate various reports from autoSMART data including drive health summaries,
32
+detailed SMART analysis, alert history, and trend analysis.
33
+
34
+=cut
35
+
36
+# Configuration
37
+my $config_dir = '/etc/autosmart';
38
+my $report_type = 'summary';
39
+my $specific_device = '';
40
+my $days = 30;
41
+my $format = 'text';
42
+my $output_file = '';
43
+my $help = 0;
44
+
45
+GetOptions(
46
+    'config-dir=s' => \$config_dir,
47
+    'report=s'     => \$report_type,
48
+    'device=s'     => \$specific_device,
49
+    'days=i'       => \$days,
50
+    'format=s'     => \$format,
51
+    'output=s'     => \$output_file,
52
+    'help'         => \$help,
53
+) or die "Error parsing command line arguments\n";
54
+
55
+if ($help) {
56
+    print_help();
57
+    exit 0;
58
+}
59
+
60
+# Validate options
61
+unless ($report_type =~ /^(summary|detailed|health|alerts|trends)$/) {
62
+    die "Invalid report type: $report_type\n";
63
+}
64
+
65
+unless ($format =~ /^(text|html|json)$/) {
66
+    die "Invalid format: $format\n";
67
+}
68
+
69
+# Connect to database
70
+my $db_config = "$config_dir/database.conf";
71
+unless (-f $db_config) {
72
+    die "Database configuration not found: $db_config\n";
73
+}
74
+
75
+my $cfg = Config::Simple->new($db_config);
76
+my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
77
+    $cfg->param('database.database'),
78
+    $cfg->param('database.host'),
79
+    $cfg->param('database.port')
80
+);
81
+
82
+my $dbh = DBI->connect(
83
+    $dsn,
84
+    $cfg->param('database.username'),
85
+    $cfg->param('database.password'),
86
+    { RaiseError => 1, AutoCommit => 1, pg_enable_utf8 => 1 }
87
+) or die "Database connection failed: $DBI::errstr";
88
+
89
+# Generate report
90
+my $report_data = generate_report($dbh, $report_type, $specific_device, $days);
91
+
92
+# Output report
93
+my $output_handle = \*STDOUT;
94
+if ($output_file) {
95
+    open $output_handle, '>', $output_file 
96
+        or die "Cannot open output file $output_file: $!\n";
97
+}
98
+
99
+if ($format eq 'json') {
100
+    output_json($output_handle, $report_data);
101
+} elsif ($format eq 'html') {
102
+    output_html($output_handle, $report_data, $report_type);
103
+} else {
104
+    output_text($output_handle, $report_data, $report_type);
105
+}
106
+
107
+close $output_handle if $output_file;
108
+
109
+$dbh->disconnect();
110
+
111
+=head2 generate_report
112
+
113
+Generate report data based on type
114
+
115
+=cut
116
+
117
+sub generate_report {
118
+    my ($dbh, $type, $device, $days) = @_;
119
+    
120
+    my $data = {
121
+        report_type => $type,
122
+        generated_at => time(),
123
+        days_included => $days,
124
+        specific_device => $device,
125
+    };
126
+    
127
+    if ($type eq 'summary') {
128
+        $data->{summary} = get_system_summary($dbh, $days);
129
+    } elsif ($type eq 'detailed') {
130
+        $data->{drives} = get_detailed_drive_info($dbh, $device, $days);
131
+    } elsif ($type eq 'health') {
132
+        $data->{health} = get_health_overview($dbh, $device);
133
+    } elsif ($type eq 'alerts') {
134
+        $data->{alerts} = get_alert_history($dbh, $device, $days);
135
+    } elsif ($type eq 'trends') {
136
+        $data->{trends} = get_trend_analysis($dbh, $device, $days);
137
+    }
138
+    
139
+    return $data;
140
+}
141
+
142
+=head2 get_system_summary
143
+
144
+Get high-level system summary
145
+
146
+=cut
147
+
148
+sub get_system_summary {
149
+    my ($dbh, $days) = @_;
150
+    
151
+    my $summary = {};
152
+    
153
+    # Drive counts by status
154
+    my $sth = $dbh->prepare(q{
155
+        SELECT status, COUNT(*) as count
156
+        FROM hdd_inventory 
157
+        GROUP BY status
158
+    });
159
+    $sth->execute();
160
+    
161
+    $summary->{drive_counts} = {};
162
+    while (my $row = $sth->fetchrow_hashref()) {
163
+        $summary->{drive_counts}->{$row->{status}} = $row->{count};
164
+    }
165
+    
166
+    # Recent predictions summary
167
+    $sth = $dbh->prepare(q{
168
+        SELECT risk_level, COUNT(*) as count
169
+        FROM predictions 
170
+        WHERE timestamp >= NOW() - INTERVAL ? DAY
171
+        GROUP BY risk_level
172
+    });
173
+    $sth->execute($days);
174
+    
175
+    $summary->{recent_predictions} = {};
176
+    while (my $row = $sth->fetchrow_hashref()) {
177
+        $summary->{recent_predictions}->{$row->{risk_level}} = $row->{count};
178
+    }
179
+    
180
+    # Recent alerts
181
+    $sth = $dbh->prepare(q{
182
+        SELECT alert_type, COUNT(*) as count
183
+        FROM alert_history 
184
+        WHERE sent_at >= NOW() - INTERVAL ? DAY
185
+        GROUP BY alert_type
186
+    });
187
+    $sth->execute($days);
188
+    
189
+    $summary->{recent_alerts} = {};
190
+    while (my $row = $sth->fetchrow_hashref()) {
191
+        $summary->{recent_alerts}->{$row->{alert_type}} = $row->{count};
192
+    }
193
+    
194
+    # Data collection stats
195
+    $sth = $dbh->prepare(q{
196
+        SELECT 
197
+            COUNT(*) as total_readings,
198
+            COUNT(DISTINCT device_path) as devices_with_data,
199
+            AVG(CASE WHEN collection_ok THEN 1.0 ELSE 0.0 END) * 100 as success_rate
200
+        FROM smart_readings
201
+        WHERE timestamp >= NOW() - INTERVAL ? DAY
202
+    });
203
+    $sth->execute($days);
204
+    
205
+    if (my $row = $sth->fetchrow_hashref()) {
206
+        $summary->{collection_stats} = {
207
+            total_readings => $row->{total_readings},
208
+            devices_with_data => $row->{devices_with_data},
209
+            success_rate => sprintf("%.1f", $row->{success_rate} || 0),
210
+        };
211
+    }
212
+    
213
+    return $summary;
214
+}
215
+
216
+=head2 get_detailed_drive_info
217
+
218
+Get detailed information for drives
219
+
220
+=cut
221
+
222
+sub get_detailed_drive_info {
223
+    my ($dbh, $device, $days) = @_;
224
+    
225
+    my $sql = q{
226
+        SELECT 
227
+            hi.device_path,
228
+            hi.model_name,
229
+            hi.serial_number,
230
+            hi.size_gb,
231
+            hi.status,
232
+            hi.first_seen,
233
+            hi.last_seen,
234
+            COUNT(sr.id) as reading_count,
235
+            AVG(sr.temperature) as avg_temperature,
236
+            MAX(sr.temperature) as max_temperature
237
+        FROM hdd_inventory hi
238
+        LEFT JOIN smart_readings sr ON hi.device_path = sr.device_path
239
+            AND sr.timestamp >= NOW() - INTERVAL ? DAY
240
+    };
241
+    
242
+    my @params = ($days);
243
+    
244
+    if ($device) {
245
+        $sql .= " WHERE hi.device_path = ?";
246
+        push @params, $device;
247
+    }
248
+    
249
+    $sql .= q{
250
+        GROUP BY hi.device_path, hi.model_name, hi.serial_number, 
251
+                 hi.size_gb, hi.status, hi.first_seen, hi.last_seen
252
+        ORDER BY hi.device_path
253
+    };
254
+    
255
+    my $sth = $dbh->prepare($sql);
256
+    $sth->execute(@params);
257
+    
258
+    my @drives = ();
259
+    while (my $row = $sth->fetchrow_hashref()) {
260
+        # Get latest prediction
261
+        my $pred_sth = $dbh->prepare(q{
262
+            SELECT risk_level, confidence, timestamp
263
+            FROM predictions
264
+            WHERE device_path = ?
265
+            ORDER BY timestamp DESC
266
+            LIMIT 1
267
+        });
268
+        $pred_sth->execute($row->{device_path});
269
+        
270
+        if (my $pred = $pred_sth->fetchrow_hashref()) {
271
+            $row->{latest_prediction} = $pred;
272
+        }
273
+        
274
+        # Get recent alerts
275
+        my $alert_sth = $dbh->prepare(q{
276
+            SELECT COUNT(*) as alert_count
277
+            FROM alert_history
278
+            WHERE device_path = ?
279
+            AND sent_at >= NOW() - INTERVAL ? DAY
280
+        });
281
+        $alert_sth->execute($row->{device_path}, $days);
282
+        
283
+        if (my $alert = $alert_sth->fetchrow_hashref()) {
284
+            $row->{recent_alert_count} = $alert->{alert_count};
285
+        }
286
+        
287
+        push @drives, $row;
288
+    }
289
+    
290
+    return \@drives;
291
+}
292
+
293
+=head2 get_health_overview
294
+
295
+Get current health overview
296
+
297
+=cut
298
+
299
+sub get_health_overview {
300
+    my ($dbh, $device) = @_;
301
+    
302
+    my $sql = q{
303
+        SELECT * FROM drive_health_summary
304
+    };
305
+    
306
+    my @params = ();
307
+    if ($device) {
308
+        $sql .= " WHERE device_path = ?";
309
+        push @params, $device;
310
+    }
311
+    
312
+    $sql .= " ORDER BY device_path";
313
+    
314
+    my $sth = $dbh->prepare($sql);
315
+    $sth->execute(@params);
316
+    
317
+    my @health_data = ();
318
+    while (my $row = $sth->fetchrow_hashref()) {
319
+        push @health_data, $row;
320
+    }
321
+    
322
+    return \@health_data;
323
+}
324
+
325
+=head2 get_alert_history
326
+
327
+Get alert history
328
+
329
+=cut
330
+
331
+sub get_alert_history {
332
+    my ($dbh, $device, $days) = @_;
333
+    
334
+    my $sql = q{
335
+        SELECT 
336
+            ah.device_path,
337
+            ah.alert_type,
338
+            ah.risk_level,
339
+            ah.message,
340
+            ah.sent_at,
341
+            ah.acknowledged,
342
+            ah.acknowledged_by,
343
+            hi.model_name
344
+        FROM alert_history ah
345
+        JOIN hdd_inventory hi ON ah.device_path = hi.device_path
346
+        WHERE ah.sent_at >= NOW() - INTERVAL ? DAY
347
+    };
348
+    
349
+    my @params = ($days);
350
+    
351
+    if ($device) {
352
+        $sql .= " AND ah.device_path = ?";
353
+        push @params, $device;
354
+    }
355
+    
356
+    $sql .= " ORDER BY ah.sent_at DESC";
357
+    
358
+    my $sth = $dbh->prepare($sql);
359
+    $sth->execute(@params);
360
+    
361
+    my @alerts = ();
362
+    while (my $row = $sth->fetchrow_hashref()) {
363
+        push @alerts, $row;
364
+    }
365
+    
366
+    return \@alerts;
367
+}
368
+
369
+=head2 get_trend_analysis
370
+
371
+Get trend analysis data
372
+
373
+=cut
374
+
375
+sub get_trend_analysis {
376
+    my ($dbh, $device, $days) = @_;
377
+    
378
+    # This is a simplified trend analysis
379
+    # In production, you might want more sophisticated analysis
380
+    
381
+    my $sql = q{
382
+        SELECT 
383
+            device_path,
384
+            DATE(timestamp) as date,
385
+            AVG(temperature) as avg_temp,
386
+            COUNT(*) as reading_count
387
+        FROM smart_readings
388
+        WHERE timestamp >= NOW() - INTERVAL ? DAY
389
+    };
390
+    
391
+    my @params = ($days);
392
+    
393
+    if ($device) {
394
+        $sql .= " AND device_path = ?";
395
+        push @params, $device;
396
+    }
397
+    
398
+    $sql .= q{
399
+        GROUP BY device_path, DATE(timestamp)
400
+        ORDER BY device_path, date
401
+    };
402
+    
403
+    my $sth = $dbh->prepare($sql);
404
+    $sth->execute(@params);
405
+    
406
+    my %trends = ();
407
+    while (my $row = $sth->fetchrow_hashref()) {
408
+        push @{$trends{$row->{device_path}}}, {
409
+            date => $row->{date},
410
+            avg_temp => sprintf("%.1f", $row->{avg_temp} || 0),
411
+            reading_count => $row->{reading_count},
412
+        };
413
+    }
414
+    
415
+    return \%trends;
416
+}
417
+
418
+=head2 output_text
419
+
420
+Output report as text
421
+
422
+=cut
423
+
424
+sub output_text {
425
+    my ($fh, $data, $type) = @_;
426
+    
427
+    print $fh "\n" . "="x80 . "\n";
428
+    print $fh "autoSMART System Report - " . ucfirst($type) . "\n";
429
+    print $fh "Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime($data->{generated_at})) . "\n";
430
+    print $fh "Time Period: Last $data->{days_included} days\n";
431
+    print $fh "="x80 . "\n\n";
432
+    
433
+    if ($type eq 'summary') {
434
+        output_summary_text($fh, $data->{summary});
435
+    } elsif ($type eq 'detailed') {
436
+        output_detailed_text($fh, $data->{drives});
437
+    } elsif ($type eq 'health') {
438
+        output_health_text($fh, $data->{health});
439
+    } elsif ($type eq 'alerts') {
440
+        output_alerts_text($fh, $data->{alerts});
441
+    } elsif ($type eq 'trends') {
442
+        output_trends_text($fh, $data->{trends});
443
+    }
444
+}
445
+
446
+=head2 output_summary_text
447
+
448
+Output summary in text format
449
+
450
+=cut
451
+
452
+sub output_summary_text {
453
+    my ($fh, $summary) = @_;
454
+    
455
+    print $fh "SYSTEM OVERVIEW\n";
456
+    print $fh "-"x40 . "\n";
457
+    
458
+    print $fh "Drive Status:\n";
459
+    foreach my $status (sort keys %{$summary->{drive_counts}}) {
460
+        printf $fh "  %-10s: %d drives\n", ucfirst($status), $summary->{drive_counts}->{$status};
461
+    }
462
+    
463
+    if (%{$summary->{recent_predictions}}) {
464
+        print $fh "\nRecent Risk Predictions:\n";
465
+        foreach my $level (qw(critical high moderate low)) {
466
+            next unless $summary->{recent_predictions}->{$level};
467
+            printf $fh "  %-10s: %d drives\n", ucfirst($level), $summary->{recent_predictions}->{$level};
468
+        }
469
+    }
470
+    
471
+    if (%{$summary->{recent_alerts}}) {
472
+        print $fh "\nRecent Alerts:\n";
473
+        foreach my $type (sort keys %{$summary->{recent_alerts}}) {
474
+            printf $fh "  %-15s: %d alerts\n", $type, $summary->{recent_alerts}->{$type};
475
+        }
476
+    }
477
+    
478
+    if ($summary->{collection_stats}) {
479
+        my $stats = $summary->{collection_stats};
480
+        print $fh "\nData Collection:\n";
481
+        print $fh "  Total readings: $stats->{total_readings}\n";
482
+        print $fh "  Devices monitored: $stats->{devices_with_data}\n";
483
+        print $fh "  Success rate: $stats->{success_rate}%\n";
484
+    }
485
+    
486
+    print $fh "\n";
487
+}
488
+
489
+=head2 output_json
490
+
491
+Output report as JSON
492
+
493
+=cut
494
+
495
+sub output_json {
496
+    my ($fh, $data) = @_;
497
+    
498
+    my $json = JSON::XS->new->pretty->encode($data);
499
+    print $fh $json;
500
+}
501
+
502
+=head2 output_html
503
+
504
+Output report as HTML (basic implementation)
505
+
506
+=cut
507
+
508
+sub output_html {
509
+    my ($fh, $data, $type) = @_;
510
+    
511
+    print $fh <<'EOF';
512
+<!DOCTYPE html>
513
+<html>
514
+<head>
515
+    <title>autoSMART Report</title>
516
+    <style>
517
+        body { font-family: Arial, sans-serif; margin: 20px; }
518
+        table { border-collapse: collapse; width: 100%; }
519
+        th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
520
+        th { background-color: #f2f2f2; }
521
+        .critical { color: #d32f2f; font-weight: bold; }
522
+        .high { color: #f57c00; font-weight: bold; }
523
+        .moderate { color: #fbc02d; }
524
+        .low { color: #388e3c; }
525
+    </style>
526
+</head>
527
+<body>
528
+EOF
529
+    
530
+    print $fh "<h1>autoSMART Report - " . ucfirst($type) . "</h1>\n";
531
+    print $fh "<p>Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime($data->{generated_at})) . "</p>\n";
532
+    
533
+    # Basic HTML output - could be expanded significantly
534
+    print $fh "<pre>" . encode_json($data) . "</pre>\n";
535
+    
536
+    print $fh "</body></html>\n";
537
+}
538
+
539
+# Additional text output functions would go here...
540
+sub output_detailed_text { 
541
+    my ($fh, $drives) = @_;
542
+    # Implementation for detailed drive output
543
+    print $fh "DETAILED DRIVE INFORMATION\n";
544
+    print $fh "-"x40 . "\n";
545
+    foreach my $drive (@$drives) {
546
+        print $fh "Device: $drive->{device_path}\n";
547
+        print $fh "Model: " . ($drive->{model_name} || 'Unknown') . "\n";
548
+        print $fh "Serial: " . ($drive->{serial_number} || 'Unknown') . "\n";
549
+        print $fh "Status: $drive->{status}\n";
550
+        if ($drive->{latest_prediction}) {
551
+            print $fh "Latest Risk: $drive->{latest_prediction}->{risk_level}\n";
552
+        }
553
+        print $fh "\n";
554
+    }
555
+}
556
+
557
+sub output_health_text { 
558
+    my ($fh, $health) = @_;
559
+    print $fh "DRIVE HEALTH OVERVIEW\n";
560
+    print $fh "-"x40 . "\n";
561
+    foreach my $drive (@$health) {
562
+        print $fh "$drive->{device_path}: $drive->{status}";
563
+        print $fh " (Risk: $drive->{risk_level})" if $drive->{risk_level};
564
+        print $fh "\n";
565
+    }
566
+}
567
+
568
+sub output_alerts_text { 
569
+    my ($fh, $alerts) = @_;
570
+    print $fh "ALERT HISTORY\n";
571
+    print $fh "-"x40 . "\n";
572
+    foreach my $alert (@$alerts) {
573
+        printf $fh "%s [%s] %s: %s\n",
574
+            $alert->{sent_at},
575
+            $alert->{alert_type},
576
+            $alert->{device_path},
577
+            $alert->{message} || '';
578
+    }
579
+}
580
+
581
+sub output_trends_text { 
582
+    my ($fh, $trends) = @_;
583
+    print $fh "TREND ANALYSIS\n";
584
+    print $fh "-"x40 . "\n";
585
+    foreach my $device (sort keys %$trends) {
586
+        print $fh "Device: $device\n";
587
+        foreach my $trend (@{$trends->{$device}}) {
588
+            print $fh "  $trend->{date}: Temp $trend->{avg_temp}°C ($trend->{reading_count} readings)\n";
589
+        }
590
+        print $fh "\n";
591
+    }
592
+}
593
+
594
+=head2 print_help
595
+
596
+Display help information
597
+
598
+=cut
599
+
600
+sub print_help {
601
+    print <<'EOF';
602
+autoSMART Report Generator v1.0
603
+
604
+USAGE:
605
+    autosmart-report.pl [OPTIONS]
606
+
607
+OPTIONS:
608
+    --config-dir DIR     Configuration directory (default: /etc/autosmart)
609
+    --report TYPE        Report type (default: summary)
610
+                         summary   - System overview and statistics
611
+                         detailed  - Detailed drive information
612
+                         health    - Current health status of all drives
613
+                         alerts    - Alert history
614
+                         trends    - Trend analysis
615
+    --device PATH        Generate report for specific device only
616
+    --days N            Days of history to include (default: 30)
617
+    --format FORMAT     Output format: text, html, json (default: text)
618
+    --output FILE       Write to file instead of stdout
619
+    --help              Show this help message
620
+
621
+EXAMPLES:
622
+    # System summary
623
+    autosmart-report.pl --report summary
624
+
625
+    # Detailed report for specific drive
626
+    autosmart-report.pl --report detailed --device /dev/sda
627
+
628
+    # Health status as HTML
629
+    autosmart-report.pl --report health --format html --output health.html
630
+
631
+    # Alert history for last week
632
+    autosmart-report.pl --report alerts --days 7
633
+
634
+    # Trend analysis as JSON
635
+    autosmart-report.pl --report trends --format json
636
+
637
+REPORT TYPES:
638
+    summary     High-level system statistics and overview
639
+    detailed    Comprehensive information about each drive
640
+    health      Current health status summary  
641
+    alerts      Recent alerts and notifications
642
+    trends      Temperature and performance trends
643
+
644
+OUTPUT FORMATS:
645
+    text        Human-readable text format (default)
646
+    html        HTML report with basic styling
647
+    json        Machine-readable JSON format
648
+
649
+EOF
650
+}
651
+
652
+__END__
653
+
654
+=head1 AUTHOR
655
+
656
+AutoSMART Development Team
657
+
658
+=head1 LICENSE
659
+
660
+This software is part of the autoSMART project.
661
+
662
+=cut
+98 -0
projects/autoSMART/scripts/deploy-production.sh
@@ -0,0 +1,98 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Production Deployment Script
4
+# Version: 1.0
5
+# Description: Deploy autoSMART system to Proxmox cluster
6
+
7
+set -e
8
+
9
+# Configuration
10
+DB_HOST="192.168.2.102"
11
+DB_USER="autosmart"
12
+DB_PASS="autoSMART2025!"
13
+DB_NAME="autosmart"
14
+
15
+CLUSTER_JSON="$(dirname "$0")/../cluster.json"
16
+NODES=()
17
+NODE_IPS=()
18
+if [[ -f "$CLUSTER_JSON" ]] && command -v jq &> /dev/null; then
19
+  while IFS= read -r node; do
20
+    NODES+=("$(echo "$node" | jq -r '.hostname')")
21
+    NODE_IPS+=("$(echo "$node" | jq -r '.ip')")
22
+  done < <(jq -c '.cluster.nodes[]' "$CLUSTER_JSON")
23
+fi
24
+DEPLOY_DIR="/opt/autoSMART"
25
+CONFIG_DIR="/etc/pve/autoSMART"
26
+
27
+echo "🚀 autoSMART Production Deployment"
28
+echo "=================================="
29
+
30
+for idx in "${!NODES[@]}"; do
31
+  NODE="${NODES[$idx]}"
32
+  NODE_IP="${NODE_IPS[$idx]}"
33
+  echo ""
34
+  echo "🔧 Deploying to node: $NODE ($NODE_IP)"
35
+  echo "------------------------"
36
+    
37
+  # Create directories
38
+  ssh root@$NODE_IP "mkdir -p $DEPLOY_DIR $CONFIG_DIR"
39
+    
40
+  # Copy files
41
+  scp -r /tmp/autoSMART-deploy/* root@$NODE_IP:$DEPLOY_DIR/
42
+    
43
+  # Install Perl dependencies
44
+  ssh root@$NODE_IP "
45
+    apt-get update -qq
46
+    apt-get install -y libdbi-perl libdbd-pg-perl libjson-perl libfile-slurp-perl smartmontools
47
+  "
48
+    
49
+  # Make scripts executable
50
+  ssh root@$NODE_IP "chmod +x $DEPLOY_DIR/scripts/*.sh $DEPLOY_DIR/scripts/*.pl"
51
+    
52
+  # Create node-specific configuration
53
+  ssh root@$NODE_IP "cat > $CONFIG_DIR/cluster-$NODE.conf << EOF
54
+# autoSMART Configuration for $NODE
55
+ExecStart=$DEPLOY_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE.conf
56
+Restart=always
57
+RestartSec=30
58
+User=root
59
+
60
+[Install]
61
+WantedBy=multi-user.target
62
+EOF"
63
+    
64
+    # Enable service (but don't start yet)
65
+    ssh root@192.168.2.$NODE "systemctl daemon-reload && systemctl enable autosmart"
66
+    
67
+    echo "✅ Node $NODE deployment complete"
68
+done
69
+
70
+# Test database connectivity
71
+    
72
+  # Install systemd service
73
+  ssh root@$NODE_IP "cat > /etc/systemd/system/autosmart.service << EOF
74
+echo ""
75
+echo "🔍 Testing database connectivity..."
76
+PGPASSWORD="$DB_PASS" psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c "
77
+SELECT 
78
+    COUNT(*) as total_drives,
79
+    COUNT(DISTINCT current_node_id) as active_nodes
80
+FROM hdd_inventory;
81
+"
82
+
83
+echo ""
84
+echo "🎉 Production deployment complete!"
85
+echo ""
86
+echo "To start services on all nodes:"
87
+echo "  for node in ebony ivory obsidian; do ssh root@192.168.2.\$node 'systemctl start autosmart'; done"
88
+echo ""
89
+echo "To monitor services:"
90
+echo "  for node in ebony ivory obsidian; do echo \"=== \$node ===\"; ssh root@192.168.2.\$node 'systemctl status autosmart'; done"
91
+echo ""
92
+echo "Database monitoring:"
93
+echo "  PGPASSWORD='$DB_PASS' psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c 'SELECT * FROM storage_efficiency_stats;'"
94
+
95
+# Cleanup
96
+rm -rf /tmp/autoSMART-deploy
97
+
98
+echo "✅ Deployment script complete!"
+0 -0
projects/autoSMART/scripts/deploy.sh
No changes.
+755 -0
projects/autoSMART/scripts/install-backup.sh
@@ -0,0 +1,755 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Node Installation Script
4
+# Version: 1.0  
5
+# Description: Install autoSMART on target nodes (Linux systems only)
6
+# Note: This script is called by deploy.sh and should run on target nodes
7
+
8
+set -e
9
+
10
+SCRIPT_DIR="$(cd "$(dirnameverify_dependencies() {
11
+    log_info "🔍 Verifying system dependencies..."
12
+    
13
+    local missing_packages=()
14
+    local package_manager=""SH_SOURCE[0]}")" && pwd)"
15
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
16
+INSTALL_DIR="/opt/autoSMART"
17
+CONFIG_DIR="/etc/autosmart"
18
+SERVICE_NAME="autosmart"
19
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
20
+
21
+# Default configuration (can be overridden by command line)
22
+DB_HOST="${DB_HOST:-192.168.2.102}"
23
+DB_USER="${DB_USER:-autosmart}"
24
+DB_PASS="${DB_PASS:-autoSMART2025!}"
25
+DB_NAME="${DB_NAME:-autosmart}"
26
+
27
+# Node configuration
28
+NODE_ID="${NODE_ID:-$(hostname -s)}"
29
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
30
+FULL_SCAN_INTERVAL="${FULL_SCAN_INTERVAL:-3600}"
31
+
32
+# Operation modes
33
+UNINSTALL=false
34
+FORCE_REINSTALL=false
35
+CONFIG_ONLY=false
36
+
37
+# Colors for output
38
+RED='\033[0;31m'
39
+GREEN='\033[0;32m'
40
+YELLOW='\033[1;33m'
41
+BLUE='\033[0;34m'
42
+NC='\033[0m' # No Color
43
+
44
+log_info() {
45
+    echo -e "${BLUE}[INFO]${NC} $1"
46
+}
47
+
48
+log_success() {
49
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
50
+}
51
+
52
+log_warning() {
53
+    echo -e "${YELLOW}[WARNING]${NC} $1"
54
+}
55
+
56
+log_error() {
57
+    echo -e "${RED}[ERROR]${NC} $1"
58
+}
59
+
60
+show_usage() {
61
+    echo "autoSMART Node Installation Script v1.0"
62
+    echo "========================================"
63
+    echo ""
64
+    echo "Usage: $0 [COMMAND] [OPTIONS]"
65
+    echo ""
66
+    echo "Commands:"
67
+    echo "  install               Install autoSMART on current node (default)"
68
+    echo "  uninstall             Remove autoSMART completely from current node"
69
+    echo ""
70
+    echo "Options:"
71
+    echo "  --help                Show this help message"
72
+    echo "  --force-reinstall     Clean installation (removes previous version)"
73
+    echo "  --config-only         Only create/update configuration files"
74
+    echo "  --db-host HOST        Database host (default: 192.168.2.102)"
75
+    echo "  --db-user USER        Database user (default: autosmart)"
76
+    echo "  --db-pass PASS        Database password (default: autoSMART2025!)"
77
+    echo "  --db-name NAME        Database name (default: autosmart)"
78
+    echo "  --node-id ID          Node identifier (default: hostname)"
79
+    echo "  --scan-interval SEC   Scan interval in seconds (default: 300)"
80
+    echo ""
81
+    echo "Note: This script should be called by deploy.sh, not run directly."
82
+    echo "For deployment from development machine, use: deploy.sh install <IP>"
83
+    echo ""
84
+}
85
+
86
+parse_arguments() {
87
+    COMMAND="install"  # Default command
88
+    
89
+    while [[ $# -gt 0 ]]; do
90
+        case $1 in
91
+            install|uninstall)
92
+                COMMAND="$1"
93
+                shift
94
+                ;;
95
+            --help)
96
+                show_usage
97
+                exit 0
98
+                ;;
99
+            --force-reinstall)
100
+                FORCE_REINSTALL=true
101
+                shift
102
+                ;;
103
+            --config-only)
104
+                CONFIG_ONLY=true
105
+                shift
106
+                ;;
107
+            --db-host)
108
+                DB_HOST="$2"
109
+                shift 2
110
+                ;;
111
+            --db-user)
112
+                DB_USER="$2"
113
+                shift 2
114
+                ;;
115
+            --db-pass)
116
+                DB_PASS="$2"
117
+                shift 2
118
+                ;;
119
+            --db-name)
120
+                DB_NAME="$2"
121
+                shift 2
122
+                ;;
123
+            --node-id)
124
+                NODE_ID="$2"
125
+                shift 2
126
+                ;;
127
+            --scan-interval)
128
+                SCAN_INTERVAL="$2"
129
+                shift 2
130
+                ;;
131
+            *)
132
+                log_error "Unknown option: $1"
133
+                show_usage
134
+                exit 1
135
+                ;;
136
+        esac
137
+    done
138
+}
139
+
140
+show_header() {
141
+    log_info "🔧 autoSMART Node Installation v1.0"
142
+    log_info "==================================="
143
+    log_info "Installing on target node: $(hostname)"
144
+    log_info ""
145
+    log_info "Operation: $COMMAND"
146
+    log_info "Node ID: $NODE_ID"
147
+    log_info "Database: $DB_HOST:5432/$DB_NAME"
148
+    if [[ "$COMMAND" == "install" ]]; then
149
+        log_info "Install Directory: $INSTALL_DIR"
150
+        log_info "Config Directory: $CONFIG_DIR"
151
+    fi
152
+    log_info ""
153
+}
154
+
155
+check_requirements() {
156
+    log_info "🔍 Checking system requirements..."
157
+    
158
+    # Check if running as root
159
+    if [[ $EUID -ne 0 ]]; then
160
+        log_error "This script must be run as root (use sudo)"
161
+        exit 1
162
+    fi
163
+    
164
+    # Check if running on Linux
165
+    if [[ "$(uname)" != "Linux" ]]; then
166
+        log_error "autoSMART can only be installed on Linux systems"
167
+        log_error "Current system: $(uname)"
168
+        exit 1
169
+    fi
170
+    
171
+    # Check systemd
172
+    if ! command -v systemctl &> /dev/null; then
173
+        log_error "systemd is required but not found"
174
+        exit 1
175
+    fi
176
+    
177
+    # Check and report dependency status
178
+    if ! verify_dependencies >/dev/null 2>&1; then
179
+        log_warning "Some dependencies are missing (will be installed automatically)"
180
+    fi
181
+    
182
+    # Check available space
183
+    AVAILABLE_SPACE=$(df / | tail -1 | awk '{print $4}')
184
+    if [[ $AVAILABLE_SPACE -lt 100000 ]]; then
185
+        log_warning "Less than 100MB available space. Installation may fail."
186
+    fi
187
+    
188
+    log_success "System requirements check passed"
189
+}
190
+
191
+handle_uninstall() {
192
+    log_info "🗑️  Uninstalling autoSMART..."
193
+    
194
+    # Stop and disable service
195
+    if systemctl is-active --quiet autosmart; then
196
+        systemctl stop autosmart
197
+    fi
198
+    if systemctl is-enabled --quiet autosmart; then
199
+        systemctl disable autosmart
200
+    fi
201
+    
202
+    # Remove service file
203
+    if [[ -f "$SYSTEMD_SERVICE" ]]; then
204
+        rm "$SYSTEMD_SERVICE"
205
+        systemctl daemon-reload
206
+    fi
207
+    
208
+    # Remove installation directory
209
+    if [[ -d "$INSTALL_DIR" ]]; then
210
+        rm -rf "$INSTALL_DIR"
211
+    fi
212
+    
213
+    # Remove configuration directory
214
+    if [[ -d "$CONFIG_DIR" ]]; then
215
+        rm -rf "$CONFIG_DIR"
216
+    fi
217
+    
218
+    # Remove log rotation
219
+    if [[ -f "/etc/logrotate.d/autosmart" ]]; then
220
+        rm "/etc/logrotate.d/autosmart"
221
+    fi
222
+    
223
+    log_success "✅ autoSMART uninstalled successfully"
224
+    exit 0
225
+}
226
+
227
+# Function to check if a package is installed
228
+check_package_installed() {
229
+    local package="$1"
230
+    local package_manager="$2"
231
+    
232
+    case "$package_manager" in
233
+        "apt-get")
234
+            dpkg -l | grep -q "^ii  $package " 2>/dev/null
235
+            ;;
236
+        "yum"|"dnf")
237
+            rpm -qa | grep -q "$package" 2>/dev/null
238
+            ;;
239
+        "zypper")
240
+            zypper se -i "$package" | grep -q "^i" 2>/dev/null
241
+            ;;
242
+        "pacman")
243
+            pacman -Q "$package" >/dev/null 2>&1
244
+            ;;
245
+        *)
246
+            return 1
247
+            ;;
248
+    esac
249
+}
250
+
251
+# Function to verify all dependencies are installed
252
+verify_dependencies() {
253
+    log_info "� Verifying system dependencies..."
254
+    
255
+    local missing_packages=()
256
+    local missing_perl_modules=()
257
+    local package_manager=""
258
+    
259
+    # Detect package manager
260
+    if command -v apt-get &> /dev/null; then
261
+        package_manager="apt-get"
262
+    elif command -v yum &> /dev/null; then
263
+        package_manager="yum"
264
+    elif command -v dnf &> /dev/null; then
265
+        package_manager="dnf"
266
+    elif command -v zypper &> /dev/null; then
267
+        package_manager="zypper"
268
+    elif command -v pacman &> /dev/null; then
269
+        package_manager="pacman"
270
+    else
271
+        log_warning "Unknown package manager. Dependency verification limited."
272
+        return 1
273
+    fi
274
+    
275
+    # Check system packages (including Perl modules from distribution)
276
+    local system_packages=("perl" "smartmontools" "postgresql-client" "curl" "wget")
277
+    local perl_packages=()
278
+    
279
+    # Add Perl module packages based on package manager
280
+    case "$package_manager" in
281
+        "apt-get")
282
+            perl_packages+=("libdbi-perl" "libdbd-pg-perl" "libjson-perl" "libfile-slurp-perl" 
283
+                           "libgetopt-long-descriptive-perl" "libconfig-simple-perl")
284
+            ;;
285
+        "yum"|"dnf")
286
+            perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp" 
287
+                           "perl-Getopt-Long" "perl-Config-Simple")
288
+            ;;
289
+        "zypper")
290
+            perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp" 
291
+                           "perl-Getopt-Long-Descriptive" "perl-Config-Simple")
292
+            ;;
293
+        "pacman")
294
+            perl_packages+=("perl-dbi" "perl-dbd-pg" "perl-json" "perl-file-slurp")
295
+            ;;
296
+    esac
297
+    
298
+    # Check system packages
299
+    for package in "${system_packages[@]}"; do
300
+        if ! check_package_installed "$package" "$package_manager"; then
301
+            missing_packages+=("$package")
302
+        fi
303
+    done
304
+    
305
+    # Check Perl packages from distribution
306
+    for package in "${perl_packages[@]}"; do
307
+        if ! check_package_installed "$package" "$package_manager"; then
308
+            missing_packages+=("$package")
309
+        fi
310
+    done
311
+    
312
+    # Report results
313
+    if [[ ${#missing_packages[@]} -eq 0 ]]; then
314
+        log_success "✅ All dependencies are available"
315
+        return 0
316
+    else
317
+        log_warning "Missing dependencies detected:"
318
+        if [[ ${#missing_packages[@]} -gt 0 ]]; then
319
+            log_warning "  Missing packages: ${missing_packages[*]}"
320
+        fi
321
+        return 1
322
+    fi
323
+}
324
+
325
+# Function to install dependencies from local packages (offline)
326
+install_dependencies_offline() {
327
+    log_info "�📦 Installing dependencies from local packages..."
328
+    
329
+    local packages_dir="$PROJECT_ROOT/packages"
330
+    
331
+    if [[ ! -d "$packages_dir" ]]; then
332
+        log_warning "Local packages directory not found: $packages_dir"
333
+        log_info "Falling back to online installation..."
334
+        return 1
335
+    fi
336
+    
337
+    local package_manager=""
338
+    if command -v apt-get &> /dev/null; then
339
+        package_manager="apt-get"
340
+        local deb_files=("$packages_dir"/*.deb)
341
+        if [[ -f "${deb_files[0]}" ]]; then
342
+            log_info "Installing .deb packages..."
343
+            dpkg -i "$packages_dir"/*.deb 2>/dev/null || {
344
+                log_info "Fixing broken dependencies..."
345
+                apt-get install -f -y >/dev/null 2>&1
346
+            }
347
+        fi
348
+    elif command -v yum &> /dev/null || command -v dnf &> /dev/null; then
349
+        package_manager="yum"
350
+        local rpm_files=("$packages_dir"/*.rpm)
351
+        if [[ -f "${rpm_files[0]}" ]]; then
352
+            log_info "Installing .rpm packages..."
353
+            if command -v dnf &> /dev/null; then
354
+                dnf install -y "$packages_dir"/*.rpm >/dev/null 2>&1
355
+            else
356
+                yum localinstall -y "$packages_dir"/*.rpm >/dev/null 2>&1
357
+            fi
358
+        fi
359
+    fi
360
+    
361
+    # Verify installation
362
+    if verify_dependencies >/dev/null 2>&1; then
363
+        log_success "✅ Offline dependencies installed successfully"
364
+        return 0
365
+    else
366
+        log_warning "Offline installation incomplete"
367
+        return 1
368
+    fi
369
+}
370
+
371
+# Enhanced dependency installation with offline support
372
+install_dependencies() {
373
+    log_info "📦 Installing system dependencies..."
374
+    
375
+    # First try to verify if dependencies are already installed
376
+    if verify_dependencies >/dev/null 2>&1; then
377
+        log_success "All dependencies already installed"
378
+        return 0
379
+    fi
380
+    
381
+    # If offline mode is enabled, only try offline installation
382
+    if [[ "$OFFLINE_MODE" == true ]]; then
383
+        log_info "Offline mode enabled - using local packages only"
384
+        if install_dependencies_offline; then
385
+            return 0
386
+        else
387
+            log_error "Offline installation failed and online installation is disabled"
388
+            log_error "Please check packages directory: $PROJECT_ROOT/packages"
389
+            exit 1
390
+        fi
391
+    fi
392
+    
393
+    # Try offline installation first
394
+    if install_dependencies_offline; then
395
+        return 0
396
+    fi
397
+    
398
+    # Fall back to online installation
399
+    log_info "Attempting online installation..."
400
+    
401
+    if command -v apt-get &> /dev/null; then
402
+        # Debian/Ubuntu
403
+        apt-get update -qq
404
+        PACKAGES=(
405
+            "perl"
406
+            "libdbi-perl"
407
+            "libdbd-pg-perl" 
408
+            "libjson-perl"
409
+            "libfile-slurp-perl"
410
+            "libgetopt-long-descriptive-perl"
411
+            "libconfig-simple-perl"
412
+            "smartmontools"
413
+            "postgresql-client"
414
+            "curl"
415
+            "wget"
416
+        )
417
+        
418
+        for package in "${PACKAGES[@]}"; do
419
+            if ! dpkg -l | grep -q "^ii  $package "; then
420
+                log_info "Installing $package..."
421
+                apt-get install -y "$package" >/dev/null 2>&1
422
+            fi
423
+        done
424
+        
425
+    elif command -v yum &> /dev/null; then
426
+        # RHEL/CentOS
427
+        yum update -y -q
428
+        PACKAGES=(
429
+            "perl"
430
+            "perl-DBI"
431
+            "perl-DBD-Pg"
432
+            "perl-JSON"
433
+            "perl-File-Slurp"
434
+            "perl-Getopt-Long"
435
+            "perl-Config-Simple"
436
+            "smartmontools"
437
+            "postgresql"
438
+            "curl"
439
+            "wget"
440
+        )
441
+        
442
+        for package in "${PACKAGES[@]}"; do
443
+            if ! rpm -qa | grep -q "$package"; then
444
+                log_info "Installing $package..."
445
+                yum install -y "$package" >/dev/null 2>&1
446
+            fi
447
+        done
448
+        
449
+    else
450
+        log_error "Unsupported package manager. Please install dependencies manually."
451
+        exit 1
452
+    fi
453
+    
454
+    log_success "Dependencies installed"
455
+}
456
+
457
+create_directories() {
458
+    log_info "📁 Creating directory structure..."
459
+    
460
+    # Create main directories
461
+    mkdir -p "$INSTALL_DIR"/{scripts,lib,config,docs}
462
+    mkdir -p "$CONFIG_DIR"
463
+    
464
+    # Set permissions
465
+    chmod 755 "$INSTALL_DIR"
466
+    chmod 755 "$CONFIG_DIR"
467
+    
468
+    log_success "Directories created"
469
+}
470
+
471
+copy_files() {
472
+    log_info "📋 Copying autoSMART files..."
473
+    
474
+    # Copy scripts
475
+    if [[ -d "$PROJECT_ROOT/scripts" ]]; then
476
+        cp -r "$PROJECT_ROOT/scripts"/* "$INSTALL_DIR/scripts/"
477
+        chmod +x "$INSTALL_DIR/scripts"/*.sh 2>/dev/null || true
478
+        chmod +x "$INSTALL_DIR/scripts"/*.pl 2>/dev/null || true
479
+    fi
480
+    
481
+    # Copy libraries
482
+    if [[ -d "$PROJECT_ROOT/lib" ]]; then
483
+        cp -r "$PROJECT_ROOT/lib"/* "$INSTALL_DIR/lib/"
484
+    fi
485
+    
486
+    # Copy documentation  
487
+    if [[ -d "$PROJECT_ROOT/docs" ]]; then
488
+        cp -r "$PROJECT_ROOT/docs"/* "$INSTALL_DIR/docs/"
489
+    fi
490
+    
491
+    # Copy SQL files
492
+    if [[ -d "$PROJECT_ROOT/sql" ]]; then
493
+        cp -r "$PROJECT_ROOT/sql" "$INSTALL_DIR/"
494
+    fi
495
+    
496
+    log_success "Files copied"
497
+}
498
+
499
+create_configuration() {
500
+    log_info "⚙️  Creating configuration files..."
501
+    
502
+    # Main configuration file
503
+    cat > "$CONFIG_DIR/autosmart.conf" << EOF
504
+# autoSMART Configuration File
505
+# Generated on $(date)
506
+
507
+[database]
508
+host = $DB_HOST
509
+port = 5432
510
+user = $DB_USER
511
+password = $DB_PASS
512
+database = $DB_NAME
513
+timeout = 30
514
+
515
+[node]
516
+id = $NODE_ID
517
+scan_interval = $SCAN_INTERVAL
518
+full_scan_interval = $FULL_SCAN_INTERVAL
519
+store_unchanged = false
520
+max_retries = 3
521
+
522
+[collection]
523
+temperature_threshold = 5
524
+parameter_changes_only = true
525
+enable_predictive_analysis = true
526
+health_check_interval = 86400
527
+
528
+[logging]
529
+level = INFO
530
+max_size = 10M
531
+rotate_count = 5
532
+syslog = true
533
+
534
+[alerts]
535
+enable = true
536
+temperature_critical = 60
537
+reallocated_sectors_warning = 1
538
+pending_sectors_critical = 5
539
+EOF
540
+
541
+    # YAML format configuration for Perl daemon
542
+    cat > "$CONFIG_DIR/cluster-$NODE_ID.conf" << EOF
543
+# autoSMART YAML Configuration for $NODE_ID
544
+database:
545
+  host: $DB_HOST
546
+  port: 5432
547
+  user: $DB_USER
548
+  password: $DB_PASS
549
+  database: $DB_NAME
550
+
551
+node:
552
+  id: $NODE_ID
553
+  scan_interval: $SCAN_INTERVAL
554
+  store_unchanged: false
555
+
556
+collection:
557
+  temperature_threshold: 5
558
+  parameter_changes_only: true
559
+  full_scan_interval: $FULL_SCAN_INTERVAL
560
+EOF
561
+    
562
+    # Set secure permissions on config files
563
+    chmod 600 "$CONFIG_DIR"/*.conf
564
+    
565
+    log_success "Configuration created"
566
+}
567
+
568
+create_systemd_service() {
569
+    log_info "🔧 Creating systemd service..."
570
+    
571
+    cat > "$SYSTEMD_SERVICE" << EOF
572
+[Unit]
573
+Description=autoSMART SMART Data Collector
574
+Documentation=file://$INSTALL_DIR/docs/README.md
575
+After=network.target postgresql.service
576
+Wants=postgresql.service
577
+
578
+[Service]
579
+Type=simple
580
+ExecStart=$INSTALL_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE_ID.conf --foreground
581
+ExecReload=/bin/kill -HUP \$MAINPID
582
+KillMode=process
583
+Restart=always
584
+RestartSec=30
585
+User=root
586
+Group=root
587
+
588
+# Security settings
589
+NoNewPrivileges=true
590
+ProtectSystem=strict
591
+ProtectHome=true
592
+ReadWritePaths=$CONFIG_DIR
593
+PrivateTmp=true
594
+
595
+# Resource limits
596
+LimitNOFILE=1024
597
+MemoryMax=100M
598
+CPUQuota=10%
599
+
600
+# Logging
601
+StandardOutput=journal
602
+StandardError=journal
603
+SyslogIdentifier=autosmart
604
+
605
+[Install]
606
+WantedBy=multi-user.target
607
+EOF
608
+    
609
+    # Reload systemd
610
+    systemctl daemon-reload
611
+    
612
+    log_success "Systemd service created"
613
+}
614
+
615
+test_database_connection() {
616
+    log_info "🔗 Testing database connection..."
617
+    
618
+    # Test connection using psql
619
+    if command -v psql &> /dev/null; then
620
+        if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT version();" >/dev/null 2>&1; then
621
+            log_success "Database connection successful"
622
+        else
623
+            log_warning "Database connection failed. Service may not start correctly."
624
+            log_info "Please ensure:"
625
+            log_info "  • PostgreSQL server is running on $DB_HOST"
626
+            log_info "  • Database '$DB_NAME' exists"
627
+            log_info "  • User '$DB_USER' has proper permissions"
628
+        fi
629
+    else
630
+        log_warning "psql not found. Cannot test database connection."
631
+    fi
632
+}
633
+
634
+test_smart_detection() {
635
+    log_info "🔍 Testing SMART device detection..."
636
+    
637
+    DEVICES_FOUND=0
638
+    for device in /dev/sd? /dev/nvme?n?; do
639
+        if [[ -b "$device" ]] && smartctl -i "$device" >/dev/null 2>&1; then
640
+            MODEL=$(smartctl -i "$device" | grep "Device Model\|Model Number" | head -1 | cut -d: -f2 | xargs)
641
+            if [[ -n "$MODEL" ]]; then
642
+                log_info "  Found: $device - $MODEL"
643
+                ((DEVICES_FOUND++))
644
+            fi
645
+        fi
646
+    done
647
+    
648
+    if [[ $DEVICES_FOUND -gt 0 ]]; then
649
+        log_success "Detected $DEVICES_FOUND SMART-capable devices"
650
+    else
651
+        log_warning "No SMART-capable devices detected"
652
+    fi
653
+}
654
+
655
+finalize_installation() {
656
+    log_info "🎯 Finalizing installation..."
657
+    
658
+    # Enable service (but don't start yet)
659
+    systemctl enable "$SERVICE_NAME"
660
+    
661
+    # Create log rotation
662
+    cat > "/etc/logrotate.d/autosmart" << EOF
663
+/var/log/autosmart/*.log {
664
+    daily
665
+    rotate 7
666
+    compress
667
+    delaycompress
668
+    missingok
669
+    notifempty
670
+    postrotate
671
+        systemctl reload-or-restart autosmart
672
+    endscript
673
+}
674
+EOF
675
+    
676
+    log_success "Installation finalized"
677
+}
678
+
679
+show_completion_message() {
680
+    log_success "✅ autoSMART installation completed successfully!"
681
+    log_info ""
682
+    log_info "📋 Installation Summary:"
683
+    log_info "  • Install Directory: $INSTALL_DIR"
684
+    log_info "  • Config Directory: $CONFIG_DIR"
685
+    log_info "  • Service Name: $SERVICE_NAME"
686
+    log_info "  • Node ID: $NODE_ID"
687
+    log_info ""
688
+    log_info "🚀 Next Steps:"
689
+    log_info "  1. Start the service:"
690
+    log_info "     systemctl start $SERVICE_NAME"
691
+    log_info ""
692
+    log_info "  2. Check service status:"
693
+    log_info "     systemctl status $SERVICE_NAME"
694
+    log_info ""
695
+    log_info "  3. View logs:"
696
+    log_info "     journalctl -u $SERVICE_NAME -f"
697
+    log_info ""
698
+    log_info "📖 Documentation: $INSTALL_DIR/docs/README.md"
699
+    log_info "⚙️  Configuration: $CONFIG_DIR/autosmart.conf"
700
+    log_info ""
701
+    log_info "🎉 autoSMART is ready to monitor your storage devices!"
702
+}
703
+
704
+# Main execution
705
+main() {
706
+    parse_arguments "$@"
707
+    show_header
708
+    
709
+    case "$COMMAND" in
710
+        uninstall)
711
+            handle_uninstall
712
+            ;;
713
+        install)
714
+            check_requirements
715
+            
716
+            # Handle force reinstall
717
+            if [[ "$FORCE_REINSTALL" == true ]]; then
718
+                log_info "🗑️  Force reinstall: cleaning previous installation..."
719
+                handle_uninstall 2>/dev/null || true
720
+                sleep 2
721
+            fi
722
+            
723
+            # Handle config-only mode
724
+            if [[ "$CONFIG_ONLY" == true ]]; then
725
+                log_info "⚙️  Configuration-only mode"
726
+                if [[ ! -d "$INSTALL_DIR" ]]; then
727
+                    log_error "autoSMART is not installed. Run full installation first."
728
+                    exit 1
729
+                fi
730
+                create_configuration
731
+                log_success "✅ Configuration updated successfully!"
732
+                exit 0
733
+            fi
734
+            
735
+            # Full installation
736
+            install_dependencies
737
+            create_directories
738
+            copy_files
739
+            create_configuration
740
+            create_systemd_service
741
+            test_database_connection
742
+            test_smart_detection
743
+            finalize_installation
744
+            show_completion_message
745
+            ;;
746
+        *)
747
+            log_error "Unknown command: $COMMAND"
748
+            show_usage
749
+            exit 1
750
+            ;;
751
+    esac
752
+}
753
+
754
+# Run main function
755
+main "$@"
+789 -0
projects/autoSMART/scripts/install-simplified.sh
@@ -0,0 +1,789 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Node Installation Script
4
+# Version: 1.0  
5
+# Description: Install autoSMART on target nodes (Linux systems only)
6
+# Note: This script is called by deploy.sh and should run on target nodes
7
+
8
+set -e
9
+
10
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
11
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
12
+INSTALL_DIR="/opt/autoSMART"
13
+CONFIG_DIR="/etc/autosmart"
14
+SERVICE_NAME="autosmart"
15
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
16
+
17
+# Default configuration (can be overridden by command line)
18
+DB_HOST="${DB_HOST:-192.168.2.102}"
19
+DB_USER="${DB_USER:-autosmart}"
20
+DB_PASS="${DB_PASS:-autoSMART2025!}"
21
+DB_NAME="${DB_NAME:-autosmart}"
22
+
23
+# Node configuration
24
+NODE_ID="${NODE_ID:-$(hostname -s)}"
25
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
26
+FULL_SCAN_INTERVAL="${FULL_SCAN_INTERVAL:-3600}"
27
+
28
+# Operation modes
29
+UNINSTALL=false
30
+FORCE_REINSTALL=false
31
+CONFIG_ONLY=false
32
+
33
+# Colors for output
34
+RED='\033[0;31m'
35
+GREEN='\033[0;32m'
36
+YELLOW='\033[1;33m'
37
+BLUE='\033[0;34m'
38
+NC='\033[0m' # No Color
39
+
40
+log_info() {
41
+    echo -e "${BLUE}[INFO]${NC} $1"
42
+}
43
+
44
+log_success() {
45
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
46
+}
47
+
48
+log_warning() {
49
+    echo -e "${YELLOW}[WARNING]${NC} $1"
50
+}
51
+
52
+log_error() {
53
+    echo -e "${RED}[ERROR]${NC} $1"
54
+}
55
+
56
+show_usage() {
57
+    echo "autoSMART Node Installation Script v1.0"
58
+    echo "========================================"
59
+    echo ""
60
+    echo "Usage: $0 [COMMAND] [OPTIONS]"
61
+    echo ""
62
+    echo "Commands:"
63
+    echo "  install               Install autoSMART on current node (default)"
64
+    echo "  uninstall             Remove autoSMART completely from current node"
65
+    echo ""
66
+    echo "Options:"
67
+    echo "  --help                Show this help message"
68
+    echo "  --force-reinstall     Clean installation (removes previous version)"
69
+    echo "  --config-only         Only create/update configuration files"
70
+    echo "  --db-host HOST        Database host (default: 192.168.2.102)"
71
+    echo "  --db-user USER        Database user (default: autosmart)"
72
+    echo "  --db-pass PASS        Database password (default: autoSMART2025!)"
73
+    echo "  --db-name NAME        Database name (default: autosmart)"
74
+    echo "  --node-id ID          Node identifier (default: hostname)"
75
+    echo "  --scan-interval SEC   Scan interval in seconds (default: 300)"
76
+    echo ""
77
+    echo "Note: This script should be called by deploy.sh, not run directly."
78
+    echo "For deployment from development machine, use: deploy.sh install <IP>"
79
+    echo ""
80
+}
81
+
82
+parse_arguments() {
83
+    COMMAND="install"  # Default command
84
+    
85
+    while [[ $# -gt 0 ]]; do
86
+        case $1 in
87
+            install|uninstall)
88
+                COMMAND="$1"
89
+                shift
90
+                ;;
91
+            --help)
92
+                show_usage
93
+                exit 0
94
+                ;;
95
+            --force-reinstall)
96
+                FORCE_REINSTALL=true
97
+                shift
98
+                ;;
99
+            --config-only)
100
+                CONFIG_ONLY=true
101
+                shift
102
+                ;;
103
+            --db-host)
104
+                DB_HOST="$2"
105
+                shift 2
106
+                ;;
107
+            --db-user)
108
+                DB_USER="$2"
109
+                shift 2
110
+                ;;
111
+            --db-pass)
112
+                DB_PASS="$2"
113
+                shift 2
114
+                ;;
115
+            --db-name)
116
+                DB_NAME="$2"
117
+                shift 2
118
+                ;;
119
+            --node-id)
120
+                NODE_ID="$2"
121
+                shift 2
122
+                ;;
123
+            --scan-interval)
124
+                SCAN_INTERVAL="$2"
125
+                shift 2
126
+                ;;
127
+            *)
128
+                log_error "Unknown option: $1"
129
+                show_usage
130
+                exit 1
131
+                ;;
132
+        esac
133
+    done
134
+}
135
+
136
+show_header() {
137
+    log_info "🔧 autoSMART Node Installation v1.0"
138
+    log_info "==================================="
139
+    log_info "Installing on target node: $(hostname)"
140
+    log_info ""
141
+    log_info "Operation: $COMMAND"
142
+    log_info "Node ID: $NODE_ID"
143
+    log_info "Database: $DB_HOST:5432/$DB_NAME"
144
+    if [[ "$COMMAND" == "install" ]]; then
145
+        log_info "Install Directory: $INSTALL_DIR"
146
+        log_info "Config Directory: $CONFIG_DIR"
147
+    fi
148
+    log_info ""
149
+}
150
+
151
+check_requirements() {
152
+    log_info "🔍 Checking system requirements..."
153
+    
154
+    # Check if running as root
155
+    if [[ $EUID -ne 0 ]]; then
156
+        log_error "This script must be run as root (use sudo)"
157
+        exit 1
158
+    fi
159
+    
160
+    # Check if running on Linux
161
+    if [[ "$(uname)" != "Linux" ]]; then
162
+        log_error "autoSMART can only be installed on Linux systems"
163
+        log_error "Current system: $(uname)"
164
+        exit 1
165
+    fi
166
+    
167
+    # Check systemd
168
+    if ! command -v systemctl &> /dev/null; then
169
+        log_error "systemd is required but not found"
170
+        exit 1
171
+    fi
172
+    
173
+    # Check and report dependency status
174
+    if ! verify_dependencies >/dev/null 2>&1; then
175
+        log_warning "Some dependencies are missing (will be installed automatically)"
176
+    fi
177
+    
178
+    # Check available space
179
+    AVAILABLE_SPACE=$(df / | tail -1 | awk '{print $4}')
180
+    if [[ $AVAILABLE_SPACE -lt 100000 ]]; then
181
+        log_warning "Less than 100MB available space. Installation may fail."
182
+    fi
183
+    
184
+    log_success "System requirements check passed"
185
+}
186
+
187
+handle_uninstall() {
188
+    log_info "🗑️  Uninstalling autoSMART..."
189
+    
190
+    # Stop and disable service
191
+    if systemctl is-active --quiet autosmart; then
192
+        systemctl stop autosmart
193
+    fi
194
+    if systemctl is-enabled --quiet autosmart; then
195
+        systemctl disable autosmart
196
+    fi
197
+    
198
+    # Remove service file
199
+    if [[ -f "$SYSTEMD_SERVICE" ]]; then
200
+        rm "$SYSTEMD_SERVICE"
201
+        systemctl daemon-reload
202
+    fi
203
+    
204
+    # Remove installation directory
205
+    if [[ -d "$INSTALL_DIR" ]]; then
206
+        rm -rf "$INSTALL_DIR"
207
+    fi
208
+    
209
+    # Remove configuration directory
210
+    if [[ -d "$CONFIG_DIR" ]]; then
211
+        rm -rf "$CONFIG_DIR"
212
+    fi
213
+    
214
+    # Remove log rotation
215
+    if [[ -f "/etc/logrotate.d/autosmart" ]]; then
216
+        rm "/etc/logrotate.d/autosmart"
217
+    fi
218
+    
219
+    log_success "✅ autoSMART uninstalled successfully"
220
+    exit 0
221
+}
222
+
223
+# Function to check if a package is installed
224
+check_package_installed() {
225
+    local package="$1"
226
+    local package_manager="$2"
227
+    
228
+    case "$package_manager" in
229
+        "apt-get")
230
+            dpkg -l | grep -q "^ii  $package " 2>/dev/null
231
+            ;;
232
+        "yum"|"dnf")
233
+            rpm -qa | grep -q "$package" 2>/dev/null
234
+            ;;
235
+        "zypper")
236
+            zypper se -i "$package" | grep -q "^i" 2>/dev/null
237
+            ;;
238
+        "pacman")
239
+            pacman -Q "$package" >/dev/null 2>&1
240
+            ;;
241
+        *)
242
+            return 1
243
+            ;;
244
+    esac
245
+}
246
+
247
+# Function to verify all dependencies are installed
248
+verify_dependencies() {
249
+    log_info "🔍 Verifying system dependencies..."
250
+    
251
+    local missing_packages=()
252
+    local package_manager=""
253
+    
254
+    # Detect package manager
255
+    if command -v apt-get &> /dev/null; then
256
+        package_manager="apt-get"
257
+    elif command -v yum &> /dev/null; then
258
+        package_manager="yum"
259
+    elif command -v dnf &> /dev/null; then
260
+        package_manager="dnf"
261
+    elif command -v zypper &> /dev/null; then
262
+        package_manager="zypper"
263
+    elif command -v pacman &> /dev/null; then
264
+        package_manager="pacman"
265
+    else
266
+        log_warning "Unknown package manager. Dependency verification limited."
267
+        return 1
268
+    fi
269
+    
270
+    # Check system packages (including Perl modules from distribution)
271
+    local system_packages=("perl" "smartmontools" "postgresql-client" "curl" "wget")
272
+    local perl_packages=()
273
+    
274
+    # Add Perl module packages based on package manager
275
+    case "$package_manager" in
276
+        "apt-get")
277
+            perl_packages+=("libdbi-perl" "libdbd-pg-perl" "libjson-perl" "libfile-slurp-perl" 
278
+                           "libgetopt-long-descriptive-perl" "libconfig-simple-perl")
279
+            ;;
280
+        "yum"|"dnf")
281
+            perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp" 
282
+                           "perl-Getopt-Long" "perl-Config-Simple")
283
+            ;;
284
+        "zypper")
285
+            perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp" 
286
+                           "perl-Getopt-Long-Descriptive" "perl-Config-Simple")
287
+            ;;
288
+        "pacman")
289
+            perl_packages+=("perl-dbi" "perl-dbd-pg" "perl-json" "perl-file-slurp")
290
+            ;;
291
+    esac
292
+    
293
+    # Check system packages
294
+    for package in "${system_packages[@]}"; do
295
+        if ! check_package_installed "$package" "$package_manager"; then
296
+            missing_packages+=("$package")
297
+        fi
298
+    done
299
+    
300
+    # Check Perl packages from distribution
301
+    for package in "${perl_packages[@]}"; do
302
+        if ! check_package_installed "$package" "$package_manager"; then
303
+            missing_packages+=("$package")
304
+        fi
305
+    done
306
+    
307
+    # Report results
308
+    if [[ ${#missing_packages[@]} -eq 0 ]]; then
309
+        log_success "✅ All dependencies are available"
310
+        return 0
311
+    else
312
+        log_warning "Missing dependencies detected:"
313
+        if [[ ${#missing_packages[@]} -gt 0 ]]; then
314
+            log_warning "  Missing packages: ${missing_packages[*]}"
315
+        fi
316
+        return 1
317
+    fi
318
+}
319
+
320
+# Function to install dependencies
321
+install_dependencies() {
322
+    log_info "📦 Installing system dependencies..."
323
+    
324
+    # First check if dependencies are already installed
325
+    if verify_dependencies >/dev/null 2>&1; then
326
+        log_success "All dependencies already installed"
327
+        return 0
328
+    fi
329
+    
330
+    log_info "Installing missing dependencies..."
331
+    
332
+    if command -v apt-get &> /dev/null; then
333
+        # Debian/Ubuntu
334
+        log_info "Updating package lists..."
335
+        apt-get update -qq
336
+        
337
+        PACKAGES=(
338
+            "perl"
339
+            "libdbi-perl"
340
+            "libdbd-pg-perl" 
341
+            "libjson-perl"
342
+            "libfile-slurp-perl"
343
+            "libgetopt-long-descriptive-perl"
344
+            "libconfig-simple-perl"
345
+            "smartmontools"
346
+            "postgresql-client"
347
+            "curl"
348
+            "wget"
349
+        )
350
+        
351
+        for package in "${PACKAGES[@]}"; do
352
+            if ! check_package_installed "$package" "apt-get"; then
353
+                log_info "Installing $package..."
354
+                if ! apt-get install -y "$package" >/dev/null 2>&1; then
355
+                    log_error "Failed to install $package"
356
+                    exit 1
357
+                fi
358
+            fi
359
+        done
360
+        
361
+    elif command -v dnf &> /dev/null; then
362
+        # Fedora/RHEL 8+
363
+        log_info "Updating package lists..."
364
+        dnf update -y -q
365
+        
366
+        PACKAGES=(
367
+            "perl"
368
+            "perl-DBI"
369
+            "perl-DBD-Pg"
370
+            "perl-JSON"
371
+            "perl-File-Slurp"
372
+            "perl-Getopt-Long"
373
+            "perl-Config-Simple"
374
+            "smartmontools"
375
+            "postgresql"
376
+            "curl"
377
+            "wget"
378
+        )
379
+        
380
+        for package in "${PACKAGES[@]}"; do
381
+            if ! check_package_installed "$package" "dnf"; then
382
+                log_info "Installing $package..."
383
+                if ! dnf install -y "$package" >/dev/null 2>&1; then
384
+                    log_error "Failed to install $package"
385
+                    exit 1
386
+                fi
387
+            fi
388
+        done
389
+        
390
+    elif command -v yum &> /dev/null; then
391
+        # RHEL/CentOS 7
392
+        log_info "Updating package lists..."
393
+        yum update -y -q
394
+        
395
+        PACKAGES=(
396
+            "perl"
397
+            "perl-DBI"
398
+            "perl-DBD-Pg"
399
+            "perl-JSON"
400
+            "perl-File-Slurp"
401
+            "perl-Getopt-Long"
402
+            "perl-Config-Simple"
403
+            "smartmontools"
404
+            "postgresql"
405
+            "curl"
406
+            "wget"
407
+        )
408
+        
409
+        for package in "${PACKAGES[@]}"; do
410
+            if ! check_package_installed "$package" "yum"; then
411
+                log_info "Installing $package..."
412
+                if ! yum install -y "$package" >/dev/null 2>&1; then
413
+                    log_error "Failed to install $package"
414
+                    exit 1
415
+                fi
416
+            fi
417
+        done
418
+        
419
+    elif command -v zypper &> /dev/null; then
420
+        # openSUSE
421
+        log_info "Updating package lists..."
422
+        zypper refresh -q
423
+        
424
+        PACKAGES=(
425
+            "perl"
426
+            "perl-DBI"
427
+            "perl-DBD-Pg"
428
+            "perl-JSON"
429
+            "perl-File-Slurp"
430
+            "perl-Getopt-Long-Descriptive"
431
+            "perl-Config-Simple"
432
+            "smartmontools"
433
+            "postgresql"
434
+            "curl"
435
+            "wget"
436
+        )
437
+        
438
+        for package in "${PACKAGES[@]}"; do
439
+            if ! check_package_installed "$package" "zypper"; then
440
+                log_info "Installing $package..."
441
+                if ! zypper install -y "$package" >/dev/null 2>&1; then
442
+                    log_error "Failed to install $package"
443
+                    exit 1
444
+                fi
445
+            fi
446
+        done
447
+        
448
+    elif command -v pacman &> /dev/null; then
449
+        # Arch Linux
450
+        log_info "Updating package lists..."
451
+        pacman -Sy --noconfirm
452
+        
453
+        PACKAGES=(
454
+            "perl"
455
+            "perl-dbi"
456
+            "perl-dbd-pg"
457
+            "perl-json"
458
+            "perl-file-slurp"
459
+            "smartmontools"
460
+            "postgresql"
461
+            "curl"
462
+            "wget"
463
+        )
464
+        
465
+        for package in "${PACKAGES[@]}"; do
466
+            if ! check_package_installed "$package" "pacman"; then
467
+                log_info "Installing $package..."
468
+                if ! pacman -S --noconfirm "$package" >/dev/null 2>&1; then
469
+                    log_error "Failed to install $package"
470
+                    exit 1
471
+                fi
472
+            fi
473
+        done
474
+        
475
+    else
476
+        log_error "Unsupported package manager. Please install dependencies manually:"
477
+        log_error "  - perl, smartmontools, postgresql-client, curl, wget"
478
+        log_error "  - Perl modules: DBI, DBD::Pg, JSON, File::Slurp, Getopt::Long, Config::Simple"
479
+        exit 1
480
+    fi
481
+    
482
+    # Verify installation was successful
483
+    if verify_dependencies >/dev/null 2>&1; then
484
+        log_success "✅ All dependencies installed successfully"
485
+    else
486
+        log_error "Some dependencies may not have installed correctly"
487
+        exit 1
488
+    fi
489
+}
490
+
491
+create_directories() {
492
+    log_info "📁 Creating directory structure..."
493
+    
494
+    # Create main directories
495
+    mkdir -p "$INSTALL_DIR"/{scripts,lib,config,docs}
496
+    mkdir -p "$CONFIG_DIR"
497
+    
498
+    # Set permissions
499
+    chmod 755 "$INSTALL_DIR"
500
+    chmod 755 "$CONFIG_DIR"
501
+    
502
+    log_success "Directories created"
503
+}
504
+
505
+copy_files() {
506
+    log_info "📋 Copying autoSMART files..."
507
+    
508
+    # Copy scripts
509
+    if [[ -d "$PROJECT_ROOT/scripts" ]]; then
510
+        cp -r "$PROJECT_ROOT/scripts"/* "$INSTALL_DIR/scripts/"
511
+        chmod +x "$INSTALL_DIR/scripts"/*.sh 2>/dev/null || true
512
+        chmod +x "$INSTALL_DIR/scripts"/*.pl 2>/dev/null || true
513
+    fi
514
+    
515
+    # Copy libraries
516
+    if [[ -d "$PROJECT_ROOT/lib" ]]; then
517
+        cp -r "$PROJECT_ROOT/lib"/* "$INSTALL_DIR/lib/"
518
+    fi
519
+    
520
+    # Copy documentation  
521
+    if [[ -d "$PROJECT_ROOT/docs" ]]; then
522
+        cp -r "$PROJECT_ROOT/docs"/* "$INSTALL_DIR/docs/"
523
+    fi
524
+    
525
+    # Copy SQL files
526
+    if [[ -d "$PROJECT_ROOT/sql" ]]; then
527
+        cp -r "$PROJECT_ROOT/sql" "$INSTALL_DIR/"
528
+    fi
529
+    
530
+    log_success "Files copied"
531
+}
532
+
533
+create_configuration() {
534
+    log_info "⚙️  Creating configuration files..."
535
+    
536
+    # Main configuration file
537
+    cat > "$CONFIG_DIR/autosmart.conf" << EOF
538
+# autoSMART Configuration File
539
+# Generated on $(date)
540
+
541
+[database]
542
+host = $DB_HOST
543
+port = 5432
544
+user = $DB_USER
545
+password = $DB_PASS
546
+database = $DB_NAME
547
+timeout = 30
548
+
549
+[node]
550
+id = $NODE_ID
551
+scan_interval = $SCAN_INTERVAL
552
+full_scan_interval = $FULL_SCAN_INTERVAL
553
+store_unchanged = false
554
+max_retries = 3
555
+
556
+[collection]
557
+temperature_threshold = 5
558
+parameter_changes_only = true
559
+enable_predictive_analysis = true
560
+health_check_interval = 86400
561
+
562
+[logging]
563
+level = INFO
564
+max_size = 10M
565
+rotate_count = 5
566
+syslog = true
567
+
568
+[alerts]
569
+enable = true
570
+temperature_critical = 60
571
+reallocated_sectors_warning = 1
572
+pending_sectors_critical = 5
573
+EOF
574
+
575
+    # YAML format configuration for Perl daemon
576
+    cat > "$CONFIG_DIR/cluster-$NODE_ID.conf" << EOF
577
+# autoSMART YAML Configuration for $NODE_ID
578
+database:
579
+  host: $DB_HOST
580
+  port: 5432
581
+  user: $DB_USER
582
+  password: $DB_PASS
583
+  database: $DB_NAME
584
+
585
+node:
586
+  id: $NODE_ID
587
+  scan_interval: $SCAN_INTERVAL
588
+  store_unchanged: false
589
+
590
+collection:
591
+  temperature_threshold: 5
592
+  parameter_changes_only: true
593
+  full_scan_interval: $FULL_SCAN_INTERVAL
594
+EOF
595
+    
596
+    # Set secure permissions on config files
597
+    chmod 600 "$CONFIG_DIR"/*.conf
598
+    
599
+    log_success "Configuration created"
600
+}
601
+
602
+create_systemd_service() {
603
+    log_info "🔧 Creating systemd service..."
604
+    
605
+    cat > "$SYSTEMD_SERVICE" << EOF
606
+[Unit]
607
+Description=autoSMART SMART Data Collector
608
+Documentation=file://$INSTALL_DIR/docs/README.md
609
+After=network.target postgresql.service
610
+Wants=postgresql.service
611
+
612
+[Service]
613
+Type=simple
614
+ExecStart=$INSTALL_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE_ID.conf --foreground
615
+ExecReload=/bin/kill -HUP \$MAINPID
616
+KillMode=process
617
+Restart=always
618
+RestartSec=30
619
+User=root
620
+Group=root
621
+
622
+# Security settings
623
+NoNewPrivileges=true
624
+ProtectSystem=strict
625
+ProtectHome=true
626
+ReadWritePaths=$CONFIG_DIR
627
+PrivateTmp=true
628
+
629
+# Resource limits
630
+LimitNOFILE=1024
631
+MemoryMax=100M
632
+CPUQuota=10%
633
+
634
+# Logging
635
+StandardOutput=journal
636
+StandardError=journal
637
+SyslogIdentifier=autosmart
638
+
639
+[Install]
640
+WantedBy=multi-user.target
641
+EOF
642
+    
643
+    # Reload systemd
644
+    systemctl daemon-reload
645
+    
646
+    log_success "Systemd service created"
647
+}
648
+
649
+test_database_connection() {
650
+    log_info "🔗 Testing database connection..."
651
+    
652
+    # Test connection using psql
653
+    if command -v psql &> /dev/null; then
654
+        if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT version();" >/dev/null 2>&1; then
655
+            log_success "Database connection successful"
656
+        else
657
+            log_warning "Database connection failed. Service may not start correctly."
658
+            log_info "Please ensure:"
659
+            log_info "  • PostgreSQL server is running on $DB_HOST"
660
+            log_info "  • Database '$DB_NAME' exists"
661
+            log_info "  • User '$DB_USER' has proper permissions"
662
+        fi
663
+    else
664
+        log_warning "psql not found. Cannot test database connection."
665
+    fi
666
+}
667
+
668
+test_smart_detection() {
669
+    log_info "🔍 Testing SMART device detection..."
670
+    
671
+    DEVICES_FOUND=0
672
+    for device in /dev/sd? /dev/nvme?n?; do
673
+        if [[ -b "$device" ]] && smartctl -i "$device" >/dev/null 2>&1; then
674
+            MODEL=$(smartctl -i "$device" | grep "Device Model\|Model Number" | head -1 | cut -d: -f2 | xargs)
675
+            if [[ -n "$MODEL" ]]; then
676
+                log_info "  Found: $device - $MODEL"
677
+                ((DEVICES_FOUND++))
678
+            fi
679
+        fi
680
+    done
681
+    
682
+    if [[ $DEVICES_FOUND -gt 0 ]]; then
683
+        log_success "Detected $DEVICES_FOUND SMART-capable devices"
684
+    else
685
+        log_warning "No SMART-capable devices detected"
686
+    fi
687
+}
688
+
689
+finalize_installation() {
690
+    log_info "🎯 Finalizing installation..."
691
+    
692
+    # Enable service (but don't start yet)
693
+    systemctl enable "$SERVICE_NAME"
694
+    
695
+    # Create log rotation
696
+    cat > "/etc/logrotate.d/autosmart" << EOF
697
+/var/log/autosmart/*.log {
698
+    daily
699
+    rotate 7
700
+    compress
701
+    delaycompress
702
+    missingok
703
+    notifempty
704
+    postrotate
705
+        systemctl reload-or-restart autosmart
706
+    endscript
707
+}
708
+EOF
709
+    
710
+    log_success "Installation finalized"
711
+}
712
+
713
+show_completion_message() {
714
+    log_success "✅ autoSMART installation completed successfully!"
715
+    log_info ""
716
+    log_info "📋 Installation Summary:"
717
+    log_info "  • Install Directory: $INSTALL_DIR"
718
+    log_info "  • Config Directory: $CONFIG_DIR"
719
+    log_info "  • Service Name: $SERVICE_NAME"
720
+    log_info "  • Node ID: $NODE_ID"
721
+    log_info ""
722
+    log_info "🚀 Next Steps:"
723
+    log_info "  1. Start the service:"
724
+    log_info "     systemctl start $SERVICE_NAME"
725
+    log_info ""
726
+    log_info "  2. Check service status:"
727
+    log_info "     systemctl status $SERVICE_NAME"
728
+    log_info ""
729
+    log_info "  3. View logs:"
730
+    log_info "     journalctl -u $SERVICE_NAME -f"
731
+    log_info ""
732
+    log_info "📖 Documentation: $INSTALL_DIR/docs/README.md"
733
+    log_info "⚙️  Configuration: $CONFIG_DIR/autosmart.conf"
734
+    log_info ""
735
+    log_info "🎉 autoSMART is ready to monitor your storage devices!"
736
+}
737
+
738
+# Main execution
739
+main() {
740
+    parse_arguments "$@"
741
+    show_header
742
+    
743
+    case "$COMMAND" in
744
+        uninstall)
745
+            handle_uninstall
746
+            ;;
747
+        install)
748
+            check_requirements
749
+            
750
+            # Handle force reinstall
751
+            if [[ "$FORCE_REINSTALL" == true ]]; then
752
+                log_info "🗑️  Force reinstall: cleaning previous installation..."
753
+                handle_uninstall 2>/dev/null || true
754
+                sleep 2
755
+            fi
756
+            
757
+            # Handle config-only mode
758
+            if [[ "$CONFIG_ONLY" == true ]]; then
759
+                log_info "⚙️  Configuration-only mode"
760
+                if [[ ! -d "$INSTALL_DIR" ]]; then
761
+                    log_error "autoSMART is not installed. Run full installation first."
762
+                    exit 1
763
+                fi
764
+                create_configuration
765
+                log_success "✅ Configuration updated successfully!"
766
+                exit 0
767
+            fi
768
+            
769
+            # Full installation
770
+            install_dependencies
771
+            create_directories
772
+            copy_files
773
+            create_configuration
774
+            create_systemd_service
775
+            test_database_connection
776
+            test_smart_detection
777
+            finalize_installation
778
+            show_completion_message
779
+            ;;
780
+        *)
781
+            log_error "Unknown command: $COMMAND"
782
+            show_usage
783
+            exit 1
784
+            ;;
785
+    esac
786
+}
787
+
788
+# Run main function
789
+main "$@"
+844 -0
projects/autoSMART/scripts/install.sh
@@ -0,0 +1,844 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Node Installation Script
4
+# Version: 1.0  
5
+# Description: Install autoSMART on target nodes (Linux systems only)
6
+# Note: This script is called by deploy.sh and should run on target nodes
7
+
8
+set -e
9
+
10
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
11
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
12
+INSTALL_DIR="/opt/autoSMART"
13
+CONFIG_DIR="/etc/autosmart"
14
+SERVICE_NAME="autosmart"
15
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
16
+
17
+# Default configuration (can be overridden by command line)
18
+DB_HOST="${DB_HOST:-192.168.2.102}"
19
+DB_USER="${DB_USER:-autosmart}"
20
+DB_PASS="${DB_PASS:-autoSMART2025!}"
21
+DB_NAME="${DB_NAME:-autosmart}"
22
+
23
+# Node configuration
24
+NODE_ID="${NODE_ID:-$(hostname -s)}"
25
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
26
+FULL_SCAN_INTERVAL="${FULL_SCAN_INTERVAL:-3600}"
27
+
28
+# Operation modes
29
+UNINSTALL=false
30
+FORCE_REINSTALL=false
31
+CONFIG_ONLY=false
32
+
33
+# Colors for output
34
+RED='\033[0;31m'
35
+GREEN='\033[0;32m'
36
+YELLOW='\033[1;33m'
37
+BLUE='\033[0;34m'
38
+NC='\033[0m' # No Color
39
+
40
+log_info() {
41
+    echo -e "${BLUE}[INFO]${NC} $1"
42
+}
43
+
44
+log_success() {
45
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
46
+}
47
+
48
+log_warning() {
49
+    echo -e "${YELLOW}[WARNING]${NC} $1"
50
+}
51
+
52
+log_error() {
53
+    echo -e "${RED}[ERROR]${NC} $1"
54
+}
55
+
56
+show_usage() {
57
+    echo "autoSMART Node Installation Script v1.0"
58
+    echo "========================================"
59
+    echo ""
60
+    echo "Usage: $0 [COMMAND] [OPTIONS]"
61
+    echo ""
62
+    echo "Commands:"
63
+    echo "  install               Install autoSMART on current node (default)"
64
+    echo "  uninstall             Remove autoSMART completely from current node"
65
+    echo ""
66
+    echo "Options:"
67
+    echo "  --help                Show this help message"
68
+    echo "  --force-reinstall     Clean installation (removes previous version)"
69
+    echo "  --config-only         Only create/update configuration files"
70
+    echo "  --db-host HOST        Database host (default: 192.168.2.102)"
71
+    echo "  --db-user USER        Database user (default: autosmart)"
72
+    echo "  --db-pass PASS        Database password (default: autoSMART2025!)"
73
+    echo "  --db-name NAME        Database name (default: autosmart)"
74
+    echo "  --node-id ID          Node identifier (default: hostname)"
75
+    echo "  --scan-interval SEC   Scan interval in seconds (default: 300)"
76
+    echo ""
77
+    echo "Note: This script should be called by deploy.sh, not run directly."
78
+    echo "For deployment from development machine, use: deploy.sh install <IP>"
79
+    echo ""
80
+}
81
+
82
+parse_arguments() {
83
+    COMMAND="install"  # Default command
84
+    
85
+    while [[ $# -gt 0 ]]; do
86
+        case $1 in
87
+            install|uninstall)
88
+                COMMAND="$1"
89
+                shift
90
+                ;;
91
+            --help)
92
+                show_usage
93
+                exit 0
94
+                ;;
95
+            --force-reinstall)
96
+                FORCE_REINSTALL=true
97
+                shift
98
+                ;;
99
+            --config-only)
100
+                CONFIG_ONLY=true
101
+                shift
102
+                ;;
103
+            --db-host)
104
+                DB_HOST="$2"
105
+                shift 2
106
+                ;;
107
+            --db-user)
108
+                DB_USER="$2"
109
+                shift 2
110
+                ;;
111
+            --db-pass)
112
+                DB_PASS="$2"
113
+                shift 2
114
+                ;;
115
+            --db-name)
116
+                DB_NAME="$2"
117
+                shift 2
118
+                ;;
119
+            --node-id)
120
+                NODE_ID="$2"
121
+                shift 2
122
+                ;;
123
+            --scan-interval)
124
+                SCAN_INTERVAL="$2"
125
+                shift 2
126
+                ;;
127
+            *)
128
+                log_error "Unknown option: $1"
129
+                show_usage
130
+                exit 1
131
+                ;;
132
+        esac
133
+    done
134
+}
135
+
136
+show_header() {
137
+    log_info "🔧 autoSMART Node Installation v1.0"
138
+    log_info "==================================="
139
+    log_info "Installing on target node: $(hostname)"
140
+    log_info ""
141
+    log_info "Operation: $COMMAND"
142
+    log_info "Node ID: $NODE_ID"
143
+    log_info "Database: $DB_HOST:5432/$DB_NAME"
144
+    if [[ "$COMMAND" == "install" ]]; then
145
+        log_info "Install Directory: $INSTALL_DIR"
146
+        log_info "Config Directory: $CONFIG_DIR"
147
+    fi
148
+    log_info ""
149
+}
150
+
151
+check_requirements() {
152
+    log_info "🔍 Checking system requirements..."
153
+    
154
+    # Check if running as root
155
+    if [[ $EUID -ne 0 ]]; then
156
+        log_error "This script must be run as root (use sudo)"
157
+        exit 1
158
+    fi
159
+    
160
+    # Check if running on Linux
161
+    if [[ "$(uname)" != "Linux" ]]; then
162
+        log_error "autoSMART can only be installed on Linux systems"
163
+        log_error "Current system: $(uname)"
164
+        exit 1
165
+    fi
166
+    
167
+    # Check systemd
168
+    if ! command -v systemctl &> /dev/null; then
169
+        log_error "systemd is required but not found"
170
+        exit 1
171
+    fi
172
+    
173
+    # Check and report dependency status
174
+    if ! verify_dependencies >/dev/null 2>&1; then
175
+        log_warning "Some dependencies are missing (will be installed automatically)"
176
+    fi
177
+    
178
+    # Check available space
179
+    AVAILABLE_SPACE=$(df / | tail -1 | awk '{print $4}')
180
+    if [[ $AVAILABLE_SPACE -lt 100000 ]]; then
181
+        log_warning "Less than 100MB available space. Installation may fail."
182
+    fi
183
+    
184
+    log_success "System requirements check passed"
185
+}
186
+
187
+handle_uninstall() {
188
+    log_info "🗑️  Uninstalling autoSMART..."
189
+    
190
+    # Stop and disable service
191
+    if systemctl is-active --quiet autosmart; then
192
+        systemctl stop autosmart
193
+    fi
194
+    if systemctl is-enabled --quiet autosmart; then
195
+        systemctl disable autosmart
196
+    fi
197
+    
198
+    # Remove service file
199
+    if [[ -f "$SYSTEMD_SERVICE" ]]; then
200
+        rm "$SYSTEMD_SERVICE"
201
+        systemctl daemon-reload
202
+    fi
203
+    
204
+    # Remove installation directory
205
+    if [[ -d "$INSTALL_DIR" ]]; then
206
+        rm -rf "$INSTALL_DIR"
207
+    fi
208
+    
209
+    # Remove configuration directory
210
+    if [[ -d "$CONFIG_DIR" ]]; then
211
+        rm -rf "$CONFIG_DIR"
212
+    fi
213
+    
214
+    # Remove log rotation
215
+    if [[ -f "/etc/logrotate.d/autosmart" ]]; then
216
+        rm "/etc/logrotate.d/autosmart"
217
+    fi
218
+    
219
+    log_success "✅ autoSMART uninstalled successfully"
220
+    exit 0
221
+}
222
+
223
+# Function to check if a package is installed
224
+check_package_installed() {
225
+    local package="$1"
226
+    local package_manager="$2"
227
+    
228
+    case "$package_manager" in
229
+        "apt-get")
230
+            dpkg -l | grep -q "^ii  $package\( \|:\)" 2>/dev/null
231
+            ;;
232
+        "yum"|"dnf")
233
+            rpm -qa | grep -q "$package" 2>/dev/null
234
+            ;;
235
+        "zypper")
236
+            zypper se -i "$package" | grep -q "^i" 2>/dev/null
237
+            ;;
238
+        "pacman")
239
+            pacman -Q "$package" >/dev/null 2>&1
240
+            ;;
241
+        *)
242
+            return 1
243
+            ;;
244
+    esac
245
+}
246
+
247
+# Function to verify all dependencies are installed
248
+verify_dependencies() {
249
+    log_info "🔍 Verifying system dependencies..."
250
+    
251
+    local missing_packages=()
252
+    local package_manager=""
253
+    
254
+    # Detect package manager
255
+    if command -v apt-get &> /dev/null; then
256
+        package_manager="apt-get"
257
+    elif command -v yum &> /dev/null; then
258
+        package_manager="yum"
259
+    elif command -v dnf &> /dev/null; then
260
+        package_manager="dnf"
261
+    elif command -v zypper &> /dev/null; then
262
+        package_manager="zypper"
263
+    elif command -v pacman &> /dev/null; then
264
+        package_manager="pacman"
265
+    else
266
+        log_warning "Unknown package manager. Dependency verification limited."
267
+        return 1
268
+    fi
269
+    
270
+    # Check system packages (including Perl modules from distribution)
271
+    local system_packages=("perl" "smartmontools" "postgresql-client" "curl" "wget")
272
+    local perl_packages=()
273
+    
274
+    # Add Perl module packages based on package manager
275
+    case "$package_manager" in
276
+        "apt-get")
277
+            perl_packages+=("libdbi-perl" "libdbd-pg-perl" "libjson-perl" "libfile-slurp-perl" 
278
+                           "libgetopt-long-descriptive-perl" "libconfig-simple-perl")
279
+            ;;
280
+        "yum"|"dnf")
281
+            perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp" 
282
+                           "perl-Getopt-Long" "perl-Config-Simple")
283
+            ;;
284
+        "zypper")
285
+            perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp" 
286
+                           "perl-Getopt-Long-Descriptive" "perl-Config-Simple")
287
+            ;;
288
+        "pacman")
289
+            perl_packages+=("perl-dbi" "perl-dbd-pg" "perl-json" "perl-file-slurp")
290
+            ;;
291
+    esac
292
+    
293
+    # Check system packages
294
+    for package in "${system_packages[@]}"; do
295
+        if ! check_package_installed "$package" "$package_manager"; then
296
+            missing_packages+=("$package")
297
+        fi
298
+    done
299
+    
300
+    # Check Perl packages from distribution
301
+    for package in "${perl_packages[@]}"; do
302
+        if ! check_package_installed "$package" "$package_manager"; then
303
+            missing_packages+=("$package")
304
+        fi
305
+    done
306
+    
307
+    # Report results
308
+    if [[ ${#missing_packages[@]} -eq 0 ]]; then
309
+        log_success "✅ All dependencies are available"
310
+        return 0
311
+    else
312
+        log_warning "Missing dependencies detected:"
313
+        if [[ ${#missing_packages[@]} -gt 0 ]]; then
314
+            log_warning "  Missing packages: ${missing_packages[*]}"
315
+        fi
316
+        return 1
317
+    fi
318
+}
319
+
320
+# Function to install dependencies
321
+install_dependencies() {
322
+    log_info "📦 Installing system dependencies..."
323
+    
324
+    # First check if dependencies are already installed
325
+    if verify_dependencies >/dev/null 2>&1; then
326
+        log_success "All dependencies already installed"
327
+        return 0
328
+    fi
329
+    
330
+    log_info "Installing missing dependencies..."
331
+    
332
+    if command -v apt-get &> /dev/null; then
333
+        # Debian/Ubuntu
334
+        log_info "Updating package lists..."
335
+        apt-get update -qq
336
+        
337
+        PACKAGES=(
338
+            "perl"
339
+            "libdbi-perl"
340
+            "libdbd-pg-perl" 
341
+            "libjson-perl"
342
+            "libfile-slurp-perl"
343
+            "libgetopt-long-descriptive-perl"
344
+            "libconfig-simple-perl"
345
+            "smartmontools"
346
+            "postgresql-client"
347
+            "curl"
348
+            "wget"
349
+        )
350
+        
351
+        for package in "${PACKAGES[@]}"; do
352
+            if ! check_package_installed "$package" "apt-get"; then
353
+                log_info "Installing $package..."
354
+                if ! apt-get install -y "$package" >/dev/null 2>&1; then
355
+                    log_error "Failed to install $package"
356
+                    exit 1
357
+                fi
358
+            fi
359
+        done
360
+        
361
+    elif command -v dnf &> /dev/null; then
362
+        # Fedora/RHEL 8+
363
+        log_info "Updating package lists..."
364
+        dnf update -y -q
365
+        
366
+        PACKAGES=(
367
+            "perl"
368
+            "perl-DBI"
369
+            "perl-DBD-Pg"
370
+            "perl-JSON"
371
+            "perl-File-Slurp"
372
+            "perl-Getopt-Long"
373
+            "perl-Config-Simple"
374
+            "smartmontools"
375
+            "postgresql"
376
+            "curl"
377
+            "wget"
378
+        )
379
+        
380
+        for package in "${PACKAGES[@]}"; do
381
+            if ! check_package_installed "$package" "dnf"; then
382
+                log_info "Installing $package..."
383
+                if ! dnf install -y "$package" >/dev/null 2>&1; then
384
+                    log_error "Failed to install $package"
385
+                    exit 1
386
+                fi
387
+            fi
388
+        done
389
+        
390
+    elif command -v yum &> /dev/null; then
391
+        # RHEL/CentOS 7
392
+        log_info "Updating package lists..."
393
+        yum update -y -q
394
+        
395
+        PACKAGES=(
396
+            "perl"
397
+            "perl-DBI"
398
+            "perl-DBD-Pg"
399
+            "perl-JSON"
400
+            "perl-File-Slurp"
401
+            "perl-Getopt-Long"
402
+            "perl-Config-Simple"
403
+            "smartmontools"
404
+            "postgresql"
405
+            "curl"
406
+            "wget"
407
+        )
408
+        
409
+        for package in "${PACKAGES[@]}"; do
410
+            if ! check_package_installed "$package" "yum"; then
411
+                log_info "Installing $package..."
412
+                if ! yum install -y "$package" >/dev/null 2>&1; then
413
+                    log_error "Failed to install $package"
414
+                    exit 1
415
+                fi
416
+            fi
417
+        done
418
+        
419
+    elif command -v zypper &> /dev/null; then
420
+        # openSUSE
421
+        log_info "Updating package lists..."
422
+        zypper refresh -q
423
+        
424
+        PACKAGES=(
425
+            "perl"
426
+            "perl-DBI"
427
+            "perl-DBD-Pg"
428
+            "perl-JSON"
429
+            "perl-File-Slurp"
430
+            "perl-Getopt-Long-Descriptive"
431
+            "perl-Config-Simple"
432
+            "smartmontools"
433
+            "postgresql"
434
+            "curl"
435
+            "wget"
436
+        )
437
+        
438
+        for package in "${PACKAGES[@]}"; do
439
+            if ! check_package_installed "$package" "zypper"; then
440
+                log_info "Installing $package..."
441
+                if ! zypper install -y "$package" >/dev/null 2>&1; then
442
+                    log_error "Failed to install $package"
443
+                    exit 1
444
+                fi
445
+            fi
446
+        done
447
+        
448
+    elif command -v pacman &> /dev/null; then
449
+        # Arch Linux
450
+        log_info "Updating package lists..."
451
+        pacman -Sy --noconfirm
452
+        
453
+        PACKAGES=(
454
+            "perl"
455
+            "perl-dbi"
456
+            "perl-dbd-pg"
457
+            "perl-json"
458
+            "perl-file-slurp"
459
+            "smartmontools"
460
+            "postgresql"
461
+            "curl"
462
+            "wget"
463
+        )
464
+        
465
+        for package in "${PACKAGES[@]}"; do
466
+            if ! check_package_installed "$package" "pacman"; then
467
+                log_info "Installing $package..."
468
+                if ! pacman -S --noconfirm "$package" >/dev/null 2>&1; then
469
+                    log_error "Failed to install $package"
470
+                    exit 1
471
+                fi
472
+            fi
473
+        done
474
+        
475
+    else
476
+        log_error "Unsupported package manager. Please install dependencies manually:"
477
+        log_error "  - perl, smartmontools, postgresql-client, curl, wget"
478
+        log_error "  - Perl modules: DBI, DBD::Pg, JSON, File::Slurp, Getopt::Long, Config::Simple"
479
+        exit 1
480
+    fi
481
+    
482
+    # Verify installation was successful
483
+    if verify_dependencies >/dev/null 2>&1; then
484
+        log_success "✅ All dependencies installed successfully"
485
+    else
486
+        log_error "Some dependencies may not have installed correctly"
487
+        exit 1
488
+    fi
489
+}
490
+
491
+create_directories() {
492
+    log_info "📁 Creating directory structure..."
493
+    
494
+    # Create main directories
495
+    mkdir -p "$INSTALL_DIR"/{scripts,lib,config,docs}
496
+    mkdir -p "$CONFIG_DIR"
497
+    
498
+    # Set permissions
499
+    chmod 755 "$INSTALL_DIR"
500
+    chmod 755 "$CONFIG_DIR"
501
+    
502
+    log_success "Directories created"
503
+}
504
+
505
+copy_files() {
506
+    log_info "📋 Copying autoSMART files..."
507
+    
508
+    # Copy scripts
509
+    if [[ -d "$PROJECT_ROOT/scripts" ]]; then
510
+        cp -r "$PROJECT_ROOT/scripts"/* "$INSTALL_DIR/scripts/"
511
+        chmod +x "$INSTALL_DIR/scripts"/*.sh 2>/dev/null || true
512
+        chmod +x "$INSTALL_DIR/scripts"/*.pl 2>/dev/null || true
513
+    fi
514
+    
515
+    # Copy libraries
516
+    if [[ -d "$PROJECT_ROOT/lib" ]]; then
517
+        cp -r "$PROJECT_ROOT/lib"/* "$INSTALL_DIR/lib/"
518
+    fi
519
+    
520
+    # Copy default configuration to /etc/default/autosmart
521
+    if [[ -f "/etc/default/autosmart" ]]; then
522
+        log_info "📝 Existing configuration found, merging with defaults..."
523
+        
524
+        # Backup existing configuration
525
+        cp "/etc/default/autosmart" "/etc/default/autosmart.backup.$(date +%Y%m%d_%H%M%S)"
526
+        
527
+        # Read existing configuration
528
+        declare -A existing_config
529
+        while IFS='=' read -r key value; do
530
+            if [[ $key =~ ^[A-Z_]+$ ]] && [[ -n $value ]]; then
531
+                # Remove quotes and store
532
+                value=$(echo "$value" | sed 's/^"//;s/"$//')
533
+                existing_config["$key"]="$value"
534
+            fi
535
+        done < "/etc/default/autosmart"
536
+        
537
+        # Start with new configuration template
538
+        if [[ -f "$PROJECT_ROOT/config/autosmart-defaults.conf" ]]; then
539
+            cp "$PROJECT_ROOT/config/autosmart-defaults.conf" "/etc/default/autosmart"
540
+        else
541
+            cat > "/etc/default/autosmart" << 'EOF'
542
+# AutoSMART Configuration
543
+AUTOSMART_DEBUG="false"
544
+EOF
545
+        fi
546
+        
547
+        # Merge existing values back
548
+        for key in "${!existing_config[@]}"; do
549
+            value="${existing_config[$key]}"
550
+            if grep -q "^${key}=" "/etc/default/autosmart"; then
551
+                # Update existing key with preserved value
552
+                sed -i "s|^${key}=.*|${key}=\"${value}\"|" "/etc/default/autosmart"
553
+                log_info "✓ Preserved existing setting: ${key}=\"${value}\""
554
+            else
555
+                # Add new key
556
+                echo "${key}=\"${value}\"" >> "/etc/default/autosmart"
557
+                log_info "✓ Added custom setting: ${key}=\"${value}\""
558
+            fi
559
+        done
560
+        
561
+        log_info "✓ Configuration merged successfully"
562
+        
563
+    elif [[ -f "$PROJECT_ROOT/config/autosmart-defaults.conf" ]]; then
564
+        cp "$PROJECT_ROOT/config/autosmart-defaults.conf" /etc/default/autosmart
565
+        log_info "✓ AutoSMART default configuration installed"
566
+    else
567
+        log_warning "Default configuration file not found, creating basic one"
568
+        cat > /etc/default/autosmart << 'EOF'
569
+# AutoSMART Configuration
570
+AUTOSMART_DEBUG="false"
571
+EOF
572
+    fi
573
+    
574
+    # Copy documentation  
575
+    if [[ -d "$PROJECT_ROOT/docs" ]]; then
576
+        cp -r "$PROJECT_ROOT/docs"/* "$INSTALL_DIR/docs/"
577
+    fi
578
+    
579
+    # Copy SQL files
580
+    if [[ -d "$PROJECT_ROOT/sql" ]]; then
581
+        cp -r "$PROJECT_ROOT/sql" "$INSTALL_DIR/"
582
+    fi
583
+    
584
+    log_success "Files copied"
585
+}
586
+
587
+create_configuration() {
588
+    log_info "⚙️  Creating configuration files..."
589
+    
590
+    # Main configuration file
591
+    cat > "$CONFIG_DIR/autosmart.conf" << EOF
592
+# autoSMART Configuration File
593
+# Generated on $(date)
594
+
595
+[database]
596
+host = $DB_HOST
597
+port = 5432
598
+user = $DB_USER
599
+password = $DB_PASS
600
+database = $DB_NAME
601
+timeout = 30
602
+
603
+[node]
604
+id = $NODE_ID
605
+scan_interval = $SCAN_INTERVAL
606
+full_scan_interval = $FULL_SCAN_INTERVAL
607
+store_unchanged = false
608
+max_retries = 3
609
+
610
+[collection]
611
+temperature_threshold = 5
612
+parameter_changes_only = true
613
+enable_predictive_analysis = true
614
+health_check_interval = 86400
615
+
616
+[logging]
617
+level = INFO
618
+max_size = 10M
619
+rotate_count = 5
620
+syslog = true
621
+
622
+[alerts]
623
+enable = true
624
+temperature_critical = 60
625
+reallocated_sectors_warning = 1
626
+pending_sectors_critical = 5
627
+EOF
628
+
629
+    # YAML format configuration for Perl daemon
630
+    cat > "$CONFIG_DIR/cluster-$NODE_ID.conf" << EOF
631
+# autoSMART YAML Configuration for $NODE_ID
632
+database:
633
+  host: $DB_HOST
634
+  port: 5432
635
+  user: $DB_USER
636
+  password: $DB_PASS
637
+  database: $DB_NAME
638
+
639
+node:
640
+  id: $NODE_ID
641
+  scan_interval: $SCAN_INTERVAL
642
+  store_unchanged: false
643
+
644
+collection:
645
+  temperature_threshold: 5
646
+  parameter_changes_only: true
647
+  full_scan_interval: $FULL_SCAN_INTERVAL
648
+EOF
649
+    
650
+    # Set secure permissions on config files
651
+    chmod 600 "$CONFIG_DIR"/*.conf
652
+    
653
+    log_success "Configuration created"
654
+}
655
+
656
+create_systemd_service() {
657
+    log_info "🔧 Creating systemd service..."
658
+    
659
+    cat > "$SYSTEMD_SERVICE" << EOF
660
+[Unit]
661
+Description=autoSMART SMART Data Collector
662
+Documentation=file://$INSTALL_DIR/docs/README.md
663
+After=network.target postgresql.service
664
+Wants=postgresql.service
665
+
666
+[Service]
667
+Type=simple
668
+EnvironmentFile=/etc/default/autosmart
669
+ExecStart=$INSTALL_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE_ID.conf --foreground
670
+ExecReload=/bin/kill -HUP \$MAINPID
671
+KillMode=process
672
+Restart=always
673
+RestartSec=30
674
+User=root
675
+Group=root
676
+
677
+# Security settings
678
+NoNewPrivileges=true
679
+ProtectSystem=strict
680
+ProtectHome=true
681
+ReadWritePaths=$CONFIG_DIR
682
+PrivateTmp=true
683
+
684
+# Resource limits
685
+LimitNOFILE=1024
686
+MemoryMax=100M
687
+CPUQuota=10%
688
+
689
+# Logging
690
+StandardOutput=journal
691
+StandardError=journal
692
+SyslogIdentifier=autosmart
693
+
694
+[Install]
695
+WantedBy=multi-user.target
696
+EOF
697
+    
698
+    # Reload systemd
699
+    systemctl daemon-reload
700
+    
701
+    log_success "Systemd service created"
702
+}
703
+
704
+test_database_connection() {
705
+    log_info "🔗 Testing database connection..."
706
+    
707
+    # Test connection using psql
708
+    if command -v psql &> /dev/null; then
709
+        if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT version();" >/dev/null 2>&1; then
710
+            log_success "Database connection successful"
711
+        else
712
+            log_warning "Database connection failed. Service may not start correctly."
713
+            log_info "Please ensure:"
714
+            log_info "  • PostgreSQL server is running on $DB_HOST"
715
+            log_info "  • Database '$DB_NAME' exists"
716
+            log_info "  • User '$DB_USER' has proper permissions"
717
+        fi
718
+    else
719
+        log_warning "psql not found. Cannot test database connection."
720
+    fi
721
+}
722
+
723
+test_smart_detection() {
724
+    log_info "🔍 Testing SMART device detection..."
725
+    
726
+    DEVICES_FOUND=0
727
+    for device in /dev/sd? /dev/nvme?n?; do
728
+        if [[ -b "$device" ]] && smartctl -i "$device" >/dev/null 2>&1; then
729
+            MODEL=$(smartctl -i "$device" | grep "Device Model\|Model Number" | head -1 | cut -d: -f2 | xargs)
730
+            if [[ -n "$MODEL" ]]; then
731
+                log_info "  Found: $device - $MODEL"
732
+                DEVICES_FOUND=$((DEVICES_FOUND + 1))
733
+            fi
734
+        fi
735
+    done
736
+    
737
+    if [[ $DEVICES_FOUND -gt 0 ]]; then
738
+        log_success "Detected $DEVICES_FOUND SMART-capable devices"
739
+    else
740
+        log_warning "No SMART-capable devices detected"
741
+    fi
742
+}
743
+
744
+finalize_installation() {
745
+    log_info "🎯 Finalizing installation..."
746
+    
747
+    # Enable service (but don't start yet)
748
+    systemctl enable "$SERVICE_NAME"
749
+    
750
+    # Create log rotation
751
+    cat > "/etc/logrotate.d/autosmart" << EOF
752
+/var/log/autosmart/*.log {
753
+    daily
754
+    rotate 7
755
+    compress
756
+    delaycompress
757
+    missingok
758
+    notifempty
759
+    postrotate
760
+        systemctl reload-or-restart autosmart
761
+    endscript
762
+}
763
+EOF
764
+    
765
+    log_success "Installation finalized"
766
+}
767
+
768
+show_completion_message() {
769
+    log_success "✅ autoSMART installation completed successfully!"
770
+    log_info ""
771
+    log_info "📋 Installation Summary:"
772
+    log_info "  • Install Directory: $INSTALL_DIR"
773
+    log_info "  • Config Directory: $CONFIG_DIR"
774
+    log_info "  • Service Name: $SERVICE_NAME"
775
+    log_info "  • Node ID: $NODE_ID"
776
+    log_info ""
777
+    log_info "🚀 Next Steps:"
778
+    log_info "  1. Start the service:"
779
+    log_info "     systemctl start $SERVICE_NAME"
780
+    log_info ""
781
+    log_info "  2. Check service status:"
782
+    log_info "     systemctl status $SERVICE_NAME"
783
+    log_info ""
784
+    log_info "  3. View logs:"
785
+    log_info "     journalctl -u $SERVICE_NAME -f"
786
+    log_info ""
787
+    log_info "📖 Documentation: $INSTALL_DIR/docs/README.md"
788
+    log_info "⚙️  Configuration: $CONFIG_DIR/autosmart.conf"
789
+    log_info ""
790
+    log_info "🎉 autoSMART is ready to monitor your storage devices!"
791
+}
792
+
793
+# Main execution
794
+main() {
795
+    parse_arguments "$@"
796
+    show_header
797
+    
798
+    case "$COMMAND" in
799
+        uninstall)
800
+            handle_uninstall
801
+            ;;
802
+        install)
803
+            check_requirements
804
+            
805
+            # Handle force reinstall
806
+            if [[ "$FORCE_REINSTALL" == true ]]; then
807
+                log_info "🗑️  Force reinstall: cleaning previous installation..."
808
+                handle_uninstall 2>/dev/null || true
809
+                sleep 2
810
+            fi
811
+            
812
+            # Handle config-only mode
813
+            if [[ "$CONFIG_ONLY" == true ]]; then
814
+                log_info "⚙️  Configuration-only mode"
815
+                if [[ ! -d "$INSTALL_DIR" ]]; then
816
+                    log_error "autoSMART is not installed. Run full installation first."
817
+                    exit 1
818
+                fi
819
+                create_configuration
820
+                log_success "✅ Configuration updated successfully!"
821
+                exit 0
822
+            fi
823
+            
824
+            # Full installation
825
+            install_dependencies
826
+            create_directories
827
+            copy_files
828
+            create_configuration
829
+            create_systemd_service
830
+            test_database_connection
831
+            test_smart_detection
832
+            finalize_installation
833
+            show_completion_message
834
+            ;;
835
+        *)
836
+            log_error "Unknown command: $COMMAND"
837
+            show_usage
838
+            exit 1
839
+            ;;
840
+    esac
841
+}
842
+
843
+# Run main function
844
+main "$@"
+521 -0
projects/autoSMART/scripts/monitor-cluster.sh
@@ -0,0 +1,521 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Cluster Monitor
4
+# Version: 1.0
5
+# Description: Monitor autoSMART services across Proxmox cluster
6
+
7
+# Configuration
8
+CLUSTER_JSON="$(dirname "$0")/../cluster.json"
9
+NODES=()
10
+NODE_IPS=()
11
+if [[ -f "$CLUSTER_JSON" ]] && command -v jq &> /dev/null; then
12
+    while IFS= read -r node; do
13
+        NODES+=("$(echo "$node" | jq -r '.hostname')")
14
+        NODE_IPS+=("$(echo "$node" | jq -r '.ip')")
15
+    done < <(jq -c '.cluster.nodes[]' "$CLUSTER_JSON")
16
+fi
17
+DB_HOST="192.168.2.102"
18
+DB_USER="autosmart"
19
+DB_PASS="autoSMART2025!"
20
+DB_NAME="autosmart"
21
+
22
+# Colors for output
23
+RED='\033[0;31m'
24
+GREEN='\033[0;32m'
25
+YELLOW='\033[1;33m'
26
+BLUE='\033[0;34m'
27
+CYAN='\033[0;36m'
28
+NC='\033[0m' # No Color
29
+
30
+log_info() {
31
+    echo -e "${BLUE}[INFO]${NC} $1"
32
+}
33
+
34
+log_success() {
35
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
36
+}
37
+
38
+log_warning() {
39
+    echo -e "${YELLOW}[WARNING]${NC} $1"
40
+}
41
+
42
+log_error() {
43
+    echo -e "${RED}[ERROR]${NC} $1"
44
+}
45
+
46
+log_header() {
47
+    echo -e "${CYAN}$1${NC}"
48
+}
49
+
50
+show_usage() {
51
+    echo "autoSMART Cluster Monitor v1.0"
52
+    echo ""
53
+    echo "Usage: $0 [COMMAND] [OPTIONS]"
54
+    echo ""
55
+    echo "Commands:"
56
+    echo "  status                Show service status on all nodes"
57
+    echo "  logs [NODE]          Show recent logs (all nodes or specific node)"
58
+    echo "  start                Start services on all nodes"
59
+    echo "  stop                 Stop services on all nodes"
60
+    echo "  restart              Restart services on all nodes"
61
+    echo "  deploy               Deploy autoSMART to all nodes"
62
+    echo "  database             Show database statistics"
63
+    echo "  health               Show cluster health summary"
64
+    echo "  collect              Force immediate SMART collection on all nodes"
65
+    echo ""
66
+    echo "Options:"
67
+    echo "  --node NODE          Target specific node (name from cluster.json)"
68
+    echo "  --watch              Continuous monitoring (refresh every 10s)"
69
+    echo "  --verbose            Show detailed output"
70
+    echo ""
71
+    echo "Examples:"
72
+    echo "  $0 status                           # Show status on all nodes"
73
+    echo "  $0 status --node <node>            # Show status on node from cluster.json"
74
+    echo "  $0 logs <node>                     # Show logs from node in cluster.json"
75
+    echo "  $0 health --watch                  # Continuous health monitoring"
76
+    echo "  $0 deploy                          # Deploy to all nodes"
77
+    echo ""
78
+}
79
+
80
+check_node_connectivity() {
81
+    local node=$1
82
+    local ip=$2
83
+    
84
+    if ping -c 1 -W 2 "$ip" >/dev/null 2>&1; then
85
+        return 0
86
+    else
87
+        return 1
88
+    fi
89
+}
90
+
91
+show_service_status() {
92
+    local target_node=$1
93
+    
94
+    log_header "🔍 autoSMART Service Status"
95
+    log_header "============================="
96
+    
97
+    for i in "${!NODES[@]}"; do
98
+        local node="${NODES[$i]}"
99
+        local ip="${NODE_IPS[$i]}"
100
+        
101
+        # Skip if specific node requested and this isn't it
102
+        if [[ -n "$target_node" && "$node" != "$target_node" ]]; then
103
+            continue
104
+        fi
105
+        
106
+        echo ""
107
+        log_info "Node: $node ($NODE_IP_BASE.$ip)"
108
+        echo "----------------------------------------"
109
+        
110
+        if check_node_connectivity "$node" "$ip"; then
111
+            local status_output
112
+            status_output=$(ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NODE_IP_BASE.$ip" \
113
+                "systemctl is-active autosmart 2>/dev/null || echo 'inactive'; \
114
+                 systemctl is-enabled autosmart 2>/dev/null || echo 'disabled'; \
115
+                 uptime | awk '{print \$3, \$4}' | sed 's/,//'" 2>/dev/null)
116
+            
117
+            if [[ $? -eq 0 ]]; then
118
+                local active=$(echo "$status_output" | sed -n '1p')
119
+                local enabled=$(echo "$status_output" | sed -n '2p')
120
+                local uptime=$(echo "$status_output" | sed -n '3p')
121
+                
122
+                echo -n "  Status: "
123
+                if [[ "$active" == "active" ]]; then
124
+                    log_success "RUNNING"
125
+                else
126
+                    log_error "NOT RUNNING"
127
+                fi
128
+                
129
+                echo -n "  Enabled: "
130
+                if [[ "$enabled" == "enabled" ]]; then
131
+                    log_success "YES"
132
+                else
133
+                    log_warning "NO"
134
+                fi
135
+                
136
+                echo "  Uptime: $uptime"
137
+                
138
+                # Get recent activity
139
+                local last_log=$(ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \
140
+                    "journalctl -u autosmart --no-pager -n 1 --output=short-iso 2>/dev/null | tail -1" 2>/dev/null)
141
+                if [[ -n "$last_log" ]]; then
142
+                    echo "  Last Activity: $(echo "$last_log" | awk '{print $1, $2}')"
143
+                fi
144
+                
145
+            else
146
+                log_error "SSH CONNECTION FAILED"
147
+            fi
148
+        else
149
+            log_error "NETWORK UNREACHABLE"
150
+        fi
151
+    done
152
+}
153
+
154
+show_logs() {
155
+    local target_node=$1
156
+    local lines=${2:-20}
157
+    
158
+    log_header "📋 Recent Logs"
159
+    log_header "==============="
160
+    
161
+    for i in "${!NODES[@]}"; do
162
+        local node="${NODES[$i]}"
163
+        local ip="${NODE_IPS[$i]}"
164
+        
165
+        # Skip if specific node requested and this isn't it
166
+        if [[ -n "$target_node" && "$node" != "$target_node" ]]; then
167
+            continue
168
+        fi
169
+        
170
+        echo ""
171
+        log_info "Node: $node ($NODE_IP_BASE.$ip)"
172
+        echo "----------------------------------------"
173
+        
174
+        if check_node_connectivity "$node" "$ip"; then
175
+            ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \
176
+                "journalctl -u autosmart --no-pager -n $lines --output=short-iso 2>/dev/null || echo 'No logs available'" 2>/dev/null
177
+        else
178
+            log_error "Node unreachable"
179
+        fi
180
+    done
181
+}
182
+
183
+control_services() {
184
+    local action=$1
185
+    local target_node=$2
186
+    
187
+    log_header "🔧 ${action^} Services"
188
+    log_header "==================="
189
+    
190
+    for i in "${!NODES[@]}"; do
191
+        local node="${NODES[$i]}"
192
+        local ip="${NODE_IPS[$i]}"
193
+        
194
+        # Skip if specific node requested and this isn't it
195
+        if [[ -n "$target_node" && "$node" != "$target_node" ]]; then
196
+            continue
197
+        fi
198
+        
199
+        echo ""
200
+        log_info "Node: $node - ${action}ing autosmart service..."
201
+        
202
+        if check_node_connectivity "$node" "$ip"; then
203
+            if ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" "systemctl $action autosmart" 2>/dev/null; then
204
+                log_success "$node: Service ${action}ed successfully"
205
+            else
206
+                log_error "$node: Failed to $action service"
207
+            fi
208
+        else
209
+            log_error "$node: Node unreachable"
210
+        fi
211
+    done
212
+}
213
+
214
+show_database_stats() {
215
+    log_header "📊 Database Statistics"
216
+    log_header "====================="
217
+    
218
+    if command -v psql &> /dev/null; then
219
+        echo ""
220
+        log_info "Connection: $DB_HOST:5432/$DB_NAME"
221
+        echo ""
222
+        
223
+        # Test connection
224
+        if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT 1;" >/dev/null 2>&1; then
225
+            log_success "Database connection: OK"
226
+            echo ""
227
+            
228
+            # Get statistics
229
+            PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
230
+            SELECT 
231
+                'Total Drives' as metric, COUNT(DISTINCT serial_number)::text as value
232
+            FROM hdd_inventory
233
+            UNION ALL
234
+            SELECT 
235
+                'Active Nodes', COUNT(DISTINCT current_node_id)::text
236
+            FROM hdd_inventory WHERE last_seen > NOW() - INTERVAL '1 hour'
237
+            UNION ALL
238
+            SELECT 
239
+                'Total Readings', COUNT(*)::text
240
+            FROM smart_readings
241
+            UNION ALL
242
+            SELECT 
243
+                'Readings Today', COUNT(*)::text
244
+            FROM smart_readings WHERE timestamp > CURRENT_DATE
245
+            UNION ALL
246
+            SELECT 
247
+                'Latest Reading', MAX(timestamp)::text
248
+            FROM smart_readings;
249
+            " 2>/dev/null
250
+            
251
+            echo ""
252
+            log_info "Storage Efficiency:"
253
+            PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "
254
+            SELECT 
255
+                hi.serial_number,
256
+                hi.model_name,
257
+                COUNT(sr.id) as readings,
258
+                COUNT(DISTINCT sr.parameters_json) as unique_sets,
259
+                CASE 
260
+                    WHEN COUNT(DISTINCT sr.parameters_json) > 0 
261
+                    THEN ROUND((1 - COUNT(DISTINCT sr.parameters_json)::decimal / COUNT(sr.id)) * 100, 1)
262
+                    ELSE 0 
263
+                END as savings_percent
264
+            FROM hdd_inventory hi
265
+            LEFT JOIN smart_readings sr ON hi.id = sr.hdd_id
266
+            GROUP BY hi.id, hi.serial_number, hi.model_name
267
+            HAVING COUNT(sr.id) > 0
268
+            ORDER BY readings DESC;
269
+            " 2>/dev/null
270
+            
271
+        else
272
+            log_error "Database connection failed"
273
+            log_info "Please check:"
274
+            log_info "  • PostgreSQL server is running on $DB_HOST"
275
+            log_info "  • Database '$DB_NAME' exists"
276
+            log_info "  • User '$DB_USER' has proper permissions"
277
+        fi
278
+    else
279
+        log_warning "psql not installed. Cannot check database statistics."
280
+    fi
281
+}
282
+
283
+show_cluster_health() {
284
+    local watch_mode=$1
285
+    
286
+    while true; do
287
+        clear
288
+        log_header "🏥 Cluster Health Summary"
289
+        log_header "========================="
290
+        echo "Last Update: $(date)"
291
+        echo ""
292
+        
293
+        # Service status summary
294
+        local total_nodes=0
295
+        local active_nodes=0
296
+        local enabled_nodes=0
297
+        
298
+        for i in "${!NODES[@]}"; do
299
+            local node="${NODES[$i]}"
300
+            local ip="${NODE_IPS[$i]}"
301
+            
302
+            if check_node_connectivity "$node" "$ip"; then
303
+                ((total_nodes++))
304
+                
305
+                local status=$(ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \
306
+                    "systemctl is-active autosmart 2>/dev/null" 2>/dev/null)
307
+                local enabled=$(ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \
308
+                    "systemctl is-enabled autosmart 2>/dev/null" 2>/dev/null)
309
+                
310
+                if [[ "$status" == "active" ]]; then
311
+                    ((active_nodes++))
312
+                fi
313
+                
314
+                if [[ "$enabled" == "enabled" ]]; then
315
+                    ((enabled_nodes++))
316
+                fi
317
+            fi
318
+        done
319
+        
320
+        echo "📡 Cluster Status:"
321
+        echo "  • Total Nodes: $total_nodes/${#NODES[@]}"
322
+        echo "  • Active Services: $active_nodes/$total_nodes"
323
+        echo "  • Enabled Services: $enabled_nodes/$total_nodes"
324
+        echo ""
325
+        
326
+        # Quick database check
327
+        if command -v psql &> /dev/null; then
328
+            if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT 1;" >/dev/null 2>&1; then
329
+                local db_stats=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -t -c "
330
+                SELECT 
331
+                    COUNT(DISTINCT serial_number) || '|' ||
332
+                    COUNT(DISTINCT current_node_id) || '|' ||
333
+                    COUNT(*) || '|' ||
334
+                    MAX(timestamp)
335
+                FROM hdd_inventory hi 
336
+                LEFT JOIN smart_readings sr ON hi.id = sr.hdd_id;
337
+                " 2>/dev/null | xargs)
338
+                
339
+                IFS='|' read -r drives nodes readings latest <<< "$db_stats"
340
+                
341
+                echo "🗄️  Database Status:"
342
+                echo "  • Connection: OK"
343
+                echo "  • Drives Tracked: $drives"
344
+                echo "  • Active Nodes: $nodes"
345
+                echo "  • Total Readings: $readings"
346
+                echo "  • Latest Reading: $(echo "$latest" | cut -d'.' -f1)"
347
+            else
348
+                echo "🗄️  Database Status: ❌ CONNECTION FAILED"
349
+            fi
350
+        fi
351
+        
352
+        if [[ "$watch_mode" != "watch" ]]; then
353
+            break
354
+        fi
355
+        
356
+        echo ""
357
+        echo "Press Ctrl+C to exit watch mode..."
358
+        sleep 10
359
+    done
360
+}
361
+
362
+deploy_cluster() {
363
+    log_header "🚀 Cluster Deployment"
364
+    log_header "==================="
365
+    
366
+    local script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
367
+    local deploy_script="$script_dir/deploy-production.sh"
368
+    
369
+    if [[ -f "$deploy_script" ]]; then
370
+        log_info "Running cluster deployment script..."
371
+        bash "$deploy_script"
372
+    else
373
+        log_error "Deployment script not found: $deploy_script"
374
+        log_info "Deploying manually to each node..."
375
+        
376
+        for i in "${!NODES[@]}"; do
377
+            local node="${NODES[$i]}"
378
+            local ip="${NODE_IPS[$i]}"
379
+            
380
+            echo ""
381
+            log_info "Deploying to $node ($NODE_IP_BASE.$ip)..."
382
+            
383
+            if check_node_connectivity "$node" "$ip"; then
384
+                # Copy autoSMART files
385
+                scp -r "$(dirname "$script_dir")"/* "root@$NODE_IP_BASE.$ip:/tmp/autosmart-deploy/" 2>/dev/null
386
+                
387
+                # Run installation
388
+                ssh "root@$NODE_IP_BASE.$ip" "cd /tmp/autosmart-deploy/scripts && bash deploy.sh install --force-reinstall --node-id $node" 2>/dev/null
389
+                
390
+                if [[ $? -eq 0 ]]; then
391
+                    log_success "$node: Deployment successful"
392
+                else
393
+                    log_error "$node: Deployment failed"
394
+                fi
395
+            else
396
+                log_error "$node: Node unreachable"
397
+            fi
398
+        done
399
+    fi
400
+}
401
+
402
+force_collection() {
403
+    log_header "🔄 Force SMART Collection"
404
+    log_header "========================="
405
+    
406
+    for i in "${!NODES[@]}"; do
407
+        local node="${NODES[$i]}"
408
+        local ip="${NODE_IPS[$i]}"
409
+        
410
+        echo ""
411
+        log_info "Node: $node - Triggering SMART collection..."
412
+        
413
+        if check_node_connectivity "$node" "$ip"; then
414
+            # Send SIGHUP to daemon to trigger immediate collection
415
+            ssh "root@$NODE_IP_BASE.$ip" "pkill -HUP -f smart-collector-daemon || systemctl reload autosmart" 2>/dev/null
416
+            
417
+            if [[ $? -eq 0 ]]; then
418
+                log_success "$node: Collection triggered"
419
+            else
420
+                log_warning "$node: Signal sent, check service status"
421
+            fi
422
+        else
423
+            log_error "$node: Node unreachable"
424
+        fi
425
+    done
426
+}
427
+
428
+# Parse command line arguments
429
+COMMAND=""
430
+TARGET_NODE=""
431
+WATCH_MODE=false
432
+VERBOSE=false
433
+
434
+while [[ $# -gt 0 ]]; do
435
+    case $1 in
436
+        status|logs|start|stop|restart|deploy|database|health|collect)
437
+            COMMAND="$1"
438
+            shift
439
+            ;;
440
+        --node)
441
+            TARGET_NODE="$2"
442
+            shift 2
443
+            ;;
444
+        --watch)
445
+            WATCH_MODE=true
446
+            shift
447
+            ;;
448
+        --verbose)
449
+            VERBOSE=true
450
+            shift
451
+            ;;
452
+        --help)
453
+            show_usage
454
+            exit 0
455
+            ;;
456
+        ebony|ivory|obsidian)
457
+            # Allow node names as direct arguments for logs command
458
+            if [[ "$COMMAND" == "logs" ]]; then
459
+                TARGET_NODE="$1"
460
+            fi
461
+            shift
462
+            ;;
463
+        *)
464
+            if [[ -z "$COMMAND" ]]; then
465
+                COMMAND="$1"
466
+            else
467
+                log_error "Unknown option: $1"
468
+                show_usage
469
+                exit 1
470
+            fi
471
+            shift
472
+            ;;
473
+    esac
474
+done
475
+
476
+# Default command
477
+if [[ -z "$COMMAND" ]]; then
478
+    COMMAND="status"
479
+fi
480
+
481
+# Execute command
482
+case "$COMMAND" in
483
+    status)
484
+        if [[ "$WATCH_MODE" == true ]]; then
485
+            while true; do
486
+                clear
487
+                show_service_status "$TARGET_NODE"
488
+                echo ""
489
+                echo "Press Ctrl+C to exit watch mode..."
490
+                sleep 10
491
+            done
492
+        else
493
+            show_service_status "$TARGET_NODE"
494
+        fi
495
+        ;;
496
+    logs)
497
+        show_logs "$TARGET_NODE"
498
+        ;;
499
+    start|stop|restart)
500
+        control_services "$COMMAND" "$TARGET_NODE"
501
+        ;;
502
+    database)
503
+        show_database_stats
504
+        ;;
505
+    health)
506
+        show_cluster_health "$([[ "$WATCH_MODE" == true ]] && echo "watch")"
507
+        ;;
508
+    deploy)
509
+        deploy_cluster
510
+        ;;
511
+    collect)
512
+        force_collection
513
+        ;;
514
+    *)
515
+        log_error "Unknown command: $COMMAND"
516
+        show_usage
517
+        exit 1
518
+        ;;
519
+esac
520
+
521
+exit 0
+144 -0
projects/autoSMART/scripts/simple-smart-test.pl
@@ -0,0 +1,144 @@
1
+#!/usr/bin/perl
2
+
3
+=head1 NAME
4
+
5
+simple-smart-test.pl - Very simple SMART data test
6
+
7
+=head1 DESCRIPTION
8
+
9
+Direct SMART data collection and database storage test.
10
+
11
+=cut
12
+
13
+use strict;
14
+use warnings;
15
+use DBI;
16
+use JSON::XS;
17
+
18
+print "=== Simple SMART Test ===\n\n";
19
+
20
+# Database connection
21
+my $dsn = "DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432";
22
+my $dbh = DBI->connect($dsn, "autosmart", "autoSMART2025!", {
23
+    RaiseError => 1,
24
+    AutoCommit => 1,
25
+    PrintError => 0
26
+}) or die "Failed to connect to database: $DBI::errstr\n";
27
+
28
+print "✓ Database connected\n";
29
+
30
+# Test SMART data collection manually
31
+my @devices = glob('/dev/sd[a-z]');
32
+
33
+for my $device (@devices) {
34
+    print "\nTesting device: $device\n";
35
+    
36
+    # Get basic device info
37
+    my $smartctl_output = `smartctl -i $device 2>/dev/null`;
38
+    if ($? != 0) {
39
+        print "  ✗ SMART not available\n";
40
+        next;
41
+    }
42
+    
43
+    # Parse basic info
44
+    my ($model) = $smartctl_output =~ /Device Model:\s+(.+)/;
45
+    my ($serial) = $smartctl_output =~ /Serial Number:\s+(.+)/;
46
+    
47
+    if (!$model || !$serial) {
48
+        print "  ✗ Could not parse model/serial\n";
49
+        next;
50
+    }
51
+    
52
+    print "  Model: $model\n";
53
+    print "  Serial: $serial\n";
54
+    
55
+    # Get SMART attributes
56
+    my $smart_output = `smartctl -A $device 2>/dev/null`;
57
+    my %parameters;
58
+    
59
+    # Parse SMART attributes
60
+    for my $line (split /\n/, $smart_output) {
61
+        if ($line =~ /^\s*(\d+)\s+(\w+)\s+0x[\da-f]+\s+(\d+)\s+(\d+)\s+(\d+)\s+\S+\s+\S+\s+\S+\s+(\d+)/) {
62
+            my ($id, $name, $current, $worst, $threshold, $raw) = ($1, $2, $3, $4, $5, $6);
63
+            $parameters{$name} = $raw;
64
+        }
65
+    }
66
+    
67
+    # Get temperature
68
+    my $temp = $parameters{'Temperature_Celsius'} || 0;
69
+    print "  Temperature: ${temp}°C\n";
70
+    print "  Parameters: " . scalar(keys %parameters) . "\n";
71
+    
72
+    # Check if HDD exists in inventory
73
+    my $sth = $dbh->prepare("SELECT id FROM hdd_inventory WHERE serial_number = ? AND model_name = ?");
74
+    $sth->execute($serial, $model);
75
+    my ($hdd_id) = $sth->fetchrow_array();
76
+    
77
+    if (!$hdd_id) {
78
+        # Create new HDD
79
+        print "  Creating new HDD entry...\n";
80
+        $sth = $dbh->prepare(q{
81
+            INSERT INTO hdd_inventory 
82
+            (serial_number, model_name, current_device_path, current_node_id, status)
83
+            VALUES (?, ?, ?, 'ebony', 'active')
84
+            RETURNING id
85
+        });
86
+        $sth->execute($serial, $model, $device);
87
+        ($hdd_id) = $sth->fetchrow_array();
88
+        print "  ✓ HDD created with ID: $hdd_id\n";
89
+    } else {
90
+        print "  ✓ HDD exists with ID: $hdd_id\n";
91
+        # Update location
92
+        $sth = $dbh->prepare("UPDATE hdd_inventory SET current_device_path = ?, last_seen = NOW() WHERE id = ?");
93
+        $sth->execute($device, $hdd_id);
94
+    }
95
+    
96
+    # Store SMART reading
97
+    print "  Storing SMART reading...\n";
98
+    my $parameters_json = encode_json(\%parameters);
99
+    
100
+    $sth = $dbh->prepare(q{
101
+        INSERT INTO smart_readings
102
+        (hdd_id, serial_number, device_path, node_id, timestamp, 
103
+         collection_ok, temperature, parameters_json, reading_type)
104
+        VALUES (?, ?, ?, 'ebony', NOW(), true, ?, ?, 'full')
105
+    });
106
+    
107
+    $sth->execute($hdd_id, $serial, $device, $temp, $parameters_json);
108
+    print "  ✓ SMART reading stored\n";
109
+}
110
+
111
+# Show results
112
+print "\n=== Database Summary ===\n";
113
+
114
+my $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_inventory");
115
+$sth->execute();
116
+my ($hdd_count) = $sth->fetchrow_array();
117
+print "HDD Inventory: $hdd_count drives\n";
118
+
119
+$sth = $dbh->prepare("SELECT COUNT(*) FROM smart_readings");
120
+$sth->execute();
121
+my ($reading_count) = $sth->fetchrow_array();
122
+print "SMART Readings: $reading_count readings\n";
123
+
124
+# Show latest readings
125
+$sth = $dbh->prepare(q{
126
+    SELECT hi.serial_number, hi.model_name, sr.timestamp, sr.temperature
127
+    FROM smart_readings sr
128
+    JOIN hdd_inventory hi ON sr.hdd_id = hi.id
129
+    ORDER BY sr.timestamp DESC
130
+    LIMIT 5
131
+});
132
+$sth->execute();
133
+
134
+print "\nLatest readings:\n";
135
+while (my $row = $sth->fetchrow_hashref()) {
136
+    printf "  %s (%s) - %s - %d°C\n",
137
+           substr($row->{serial_number}, 0, 12),
138
+           substr($row->{model_name}, 0, 20),
139
+           $row->{timestamp},
140
+           $row->{temperature} || 0;
141
+}
142
+
143
+$dbh->disconnect();
144
+print "\n=== Test Complete ===\n";
+384 -0
projects/autoSMART/scripts/smart-collector-daemon.pl
@@ -0,0 +1,384 @@
1
+#!/usr/bin/perl
2
+use strict;
3
+use warnings;
4
+use DBI;
5
+use JSON;
6
+use File::Slurp;
7
+use Getopt::Long;
8
+use POSIX qw(strftime);
9
+use Time::HiRes qw(sleep);
10
+
11
+# autoSMART Collector Daemon
12
+# Version: 1.0
13
+# Description: Automated SMART data collection daemon
14
+
15
+my $config_file;
16
+my $debug = (defined $ENV{AUTOSMART_DEBUG} && $ENV{AUTOSMART_DEBUG} eq 'true') ? 1 : 0;
17
+my $foreground = 0;
18
+
19
+GetOptions(
20
+    'config=s' => \$config_file,
21
+    'debug'    => \$debug,
22
+    'foreground' => \$foreground
23
+) or die "Usage: $0 --config <file> [--debug] [--foreground]\n";
24
+
25
+if (defined $ENV{AUTOSMART_DEBUG}) {
26
+    if ($ENV{AUTOSMART_DEBUG} eq 'true') {
27
+        $debug = 1;
28
+        log_message("AUTOSMART_DEBUG enabled via /etc/default/autonas or environment");
29
+    } else {
30
+        $debug = 0;
31
+        log_message("AUTOSMART_DEBUG disabled via /etc/default/autonas or environment");
32
+    }
33
+}
34
+
35
+die "Configuration file required\n" unless $config_file;
36
+die "Configuration file not found: $config_file\n" unless -f $config_file;
37
+
38
+# Load configuration
39
+my $config = load_config($config_file);
40
+my $node_id = $config->{node}{id} || `hostname -s`;
41
+chomp $node_id;
42
+
43
+log_message("Starting autoSMART collector daemon on node: $node_id");
44
+log_message("Configuration loaded from: $config_file");
45
+
46
+# Main collection loop
47
+my $last_full_scan = 0;
48
+my $scan_interval = $config->{node}{scan_interval} || 300;
49
+my $full_scan_interval = $config->{collection}{full_scan_interval} || 3600;
50
+
51
+while (1) {
52
+    eval {
53
+        my $current_time = time();
54
+        my $force_full = ($current_time - $last_full_scan) >= $full_scan_interval;
55
+        
56
+        if ($force_full) {
57
+            log_message("Performing full SMART scan (forced)");
58
+            $last_full_scan = $current_time;
59
+        }
60
+        
61
+        collect_smart_data($force_full);
62
+        
63
+    };
64
+    
65
+    if ($@) {
66
+        log_message("ERROR: Collection failed: $@");
67
+    }
68
+    
69
+    log_message("Sleeping for $scan_interval seconds...") if $debug;
70
+    sleep($scan_interval);
71
+}
72
+
73
+sub collect_smart_data {
74
+    my ($force_full) = @_;
75
+    
76
+    log_message("[DEBUG] Starting data collection cycle, force_full=" . ($force_full ? 'true' : 'false')) if $debug;
77
+    
78
+    # Connect to database
79
+    my $dsn = "DBI:Pg:host=$config->{database}{host};dbname=$config->{database}{database}";
80
+    log_message("[DEBUG] Connecting to database: $dsn") if $debug;
81
+    
82
+    my $dbh = DBI->connect($dsn, $config->{database}{user}, $config->{database}{password}, 
83
+                          {RaiseError => 1, AutoCommit => 1}) 
84
+        or die "Database connection failed: $DBI::errstr";
85
+    
86
+    log_message("✓ Database connected") if $debug;
87
+    
88
+    # Test database connectivity
89
+    if ($debug) {
90
+        eval {
91
+            my $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_inventory");
92
+            $sth->execute();
93
+            my ($count) = $sth->fetchrow_array();
94
+            log_message("[DEBUG] Database test: found $count HDDs in inventory");
95
+            
96
+            $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_presence WHERE is_current = TRUE");
97
+            $sth->execute();
98
+            my ($presence_count) = $sth->fetchrow_array();
99
+            log_message("[DEBUG] Database test: found $presence_count current HDD presence records");
100
+        };
101
+        if ($@) {
102
+            log_message("[DEBUG] Database test failed: $@");
103
+        }
104
+    }
105
+    
106
+    # Scan for devices
107
+    my @devices = glob('/dev/sd?');
108
+    push @devices, glob('/dev/nvme?n?');
109
+    
110
+    log_message("[DEBUG] Found " . scalar(@devices) . " potential devices: " . join(', ', @devices)) if $debug;
111
+    
112
+    foreach my $device (@devices) {
113
+        if (-b $device) {
114
+            log_message("[DEBUG] Processing block device: $device") if $debug;
115
+        } else {
116
+            log_message("[DEBUG] Skipping non-block device: $device") if $debug;
117
+            next;
118
+        }
119
+        
120
+        eval {
121
+            process_device($dbh, $device, $force_full);
122
+        };
123
+        
124
+        if ($@) {
125
+            log_message("ERROR processing device $device: $@");
126
+        }
127
+    }
128
+    
129
+    $dbh->disconnect();
130
+    log_message("Collection cycle complete") if $debug;
131
+}
132
+
133
+sub process_device {
134
+    my ($dbh, $device, $force_full) = @_;
135
+    
136
+    log_message("[DEBUG] process_device: Processing $device") if $debug;
137
+    
138
+    # Get SMART data
139
+    my $smartctl_cmd = "smartctl -A -i -H $device 2>&1";
140
+    log_message("[DEBUG] Running: $smartctl_cmd") if $debug;
141
+    my @smart_output = `$smartctl_cmd`;
142
+    my $exit_code = $? >> 8;
143
+    
144
+    if (!@smart_output) {
145
+        log_message("[DEBUG] No SMART output for $device") if $debug;
146
+        return;
147
+    }
148
+    
149
+    log_message("[DEBUG] Got " . scalar(@smart_output) . " lines of SMART output from $device (exit code: $exit_code)") if $debug;
150
+    
151
+    # Check if smartctl indicates the device doesn't support SMART
152
+    my $smart_output_text = join('', @smart_output);
153
+    if ($smart_output_text =~ /SMART support is.*Unavailable|Device does not support SMART|No such device/) {
154
+        log_message("[DEBUG] Device $device does not support SMART or is not accessible") if $debug;
155
+        return;
156
+    }
157
+    
158
+    my ($model, $serial, $temp, %smart_params);
159
+    
160
+    foreach my $line (@smart_output) {
161
+        chomp $line;
162
+        
163
+        if ($line =~ /Device Model:\s+(.+)/) {
164
+            $model = $1;
165
+            log_message("[DEBUG] Found model: $model") if $debug;
166
+        } elsif ($line =~ /Serial Number:\s+(.+)/) {
167
+            $serial = $1;
168
+            log_message("[DEBUG] Found serial: $serial") if $debug;
169
+        } elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\w+\s+\w+\s+\w+\s+(\d+)/) {
170
+            # Old format: ID ATTRIBUTE_NAME 0xXXXX DDD DDD DDD Pre-fail Always - RAW_VALUE
171
+            my ($id, $name, $raw) = ($1, $2, $3);
172
+            $name =~ s/\s+/_/g;
173
+            $smart_params{$name} = $raw;
174
+            
175
+            if ($debug && scalar(keys %smart_params) <= 5) {
176
+                log_message("[DEBUG] SMART param (old format): $name = $raw");
177
+            }
178
+            
179
+            if ($name =~ /Temperature|Temp/i) {
180
+                $temp = $raw if (!defined $temp || $raw > 0);
181
+            }
182
+        } elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(\d+)/) {
183
+            # New format: ID ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
184
+            my ($id, $name, $raw) = ($1, $2, $3);
185
+            $name =~ s/\s+/_/g;
186
+            $smart_params{$name} = $raw;
187
+            
188
+            if ($debug && scalar(keys %smart_params) <= 5) {
189
+                log_message("[DEBUG] SMART param (new format): $name = $raw");
190
+            }
191
+            
192
+            if ($name =~ /Temperature|Temp/i) {
193
+                $temp = $raw if (!defined $temp || $raw > 0);
194
+            }
195
+        }
196
+    }
197
+    
198
+    if (!$model || !$serial) {
199
+        log_message("[DEBUG] Missing critical data for $device - model: " . ($model || 'NULL') . ", serial: " . ($serial || 'NULL')) if $debug;
200
+        return;
201
+    }
202
+    
203
+    if (!%smart_params) {
204
+        log_message("[DEBUG] No SMART parameters found for $device") if $debug;
205
+        return;
206
+    }
207
+    
208
+    log_message("[DEBUG] Parsed device data - Model: $model, Serial: $serial, Temperature: " . ($temp || 'NULL') . ", Parameters: " . scalar(keys %smart_params)) if $debug;
209
+    
210
+    return unless ($model && $serial && %smart_params);
211
+    
212
+    log_message("Processing: $model ($serial) @ $device") if $debug;
213
+    
214
+    # Get or create HDD inventory entry
215
+    my $hdd_id = get_or_create_hdd($dbh, $serial, $model, $device);
216
+    
217
+    # Check if we should store this reading
218
+    my $params_json = encode_json(\%smart_params);
219
+    
220
+    if (!$force_full && !$config->{node}{store_unchanged}) {
221
+        # Check for recent identical reading
222
+        my $sth = $dbh->prepare("
223
+            SELECT id FROM smart_readings 
224
+            WHERE hdd_id = ? AND parameters_json = ? 
225
+            AND timestamp > NOW() - INTERVAL '1 hour'
226
+            LIMIT 1
227
+        ");
228
+        $sth->execute($hdd_id, $params_json);
229
+        
230
+        if ($sth->fetchrow_array()) {
231
+            log_message("  Skipping unchanged parameters") if $debug;
232
+            return;
233
+        }
234
+    }
235
+    
236
+    # Store SMART reading
237
+    my $reading_type = $force_full ? 'full' : 'differential';
238
+    
239
+    my $sth = $dbh->prepare("
240
+        INSERT INTO smart_readings (hdd_id, serial_number, device_path, node_id, timestamp, temperature, parameters_json, reading_type)
241
+        VALUES (?, ?, ?, ?, NOW(), ?, ?::jsonb, ?)
242
+        RETURNING id
243
+    ");
244
+    
245
+    my $reading_id = $dbh->selectrow_array($sth, undef, $hdd_id, $serial, $device, $node_id, $temp || 0, $params_json, $reading_type);
246
+    
247
+    log_message("  ✓ SMART reading stored (ID: $reading_id, temp: " . ($temp || 0) . "°C, type: $reading_type)") if $debug;
248
+}
249
+
250
+sub get_or_create_hdd {
251
+    my ($dbh, $serial, $model, $device_path) = @_;
252
+    
253
+    log_message("[DEBUG] get_or_create_hdd: serial=$serial, model=$model, device=$device_path, node=$node_id") if $debug;
254
+    
255
+    # Check if HDD exists
256
+    my $sth = $dbh->prepare("SELECT id FROM hdd_inventory WHERE serial_number = ?");
257
+    $sth->execute($serial);
258
+    my ($hdd_id) = $sth->fetchrow_array();
259
+    
260
+    log_message("[DEBUG] HDD lookup result: hdd_id=" . ($hdd_id || 'NULL') . " for serial=$serial") if $debug;
261
+    
262
+    if ($hdd_id) {
263
+        log_message("[DEBUG] Found existing HDD with id=$hdd_id, updating location and presence") if $debug;
264
+        
265
+        # Update current location in inventory
266
+        $dbh->do("UPDATE hdd_inventory SET current_device_path = ?, current_node_id = ?, last_seen = NOW() 
267
+                  WHERE id = ?", undef, $device_path, $node_id, $hdd_id);
268
+        log_message("[DEBUG] Updated hdd_inventory location for hdd_id=$hdd_id") if $debug;
269
+
270
+        # Mark all previous hdd_presence as historic for this serial
271
+        my $affected_rows = $dbh->do("UPDATE hdd_presence SET is_current = FALSE WHERE serial_number = ? AND is_current = TRUE AND node <> ?", undef, $serial, $node_id);
272
+        log_message("[DEBUG] Marked $affected_rows historic hdd_presence records for serial=$serial") if $debug;
273
+
274
+        # Check if there is already a current presence for this serial/node
275
+        my $sth2 = $dbh->prepare("SELECT id FROM hdd_presence WHERE serial_number = ? AND node = ? AND is_current = TRUE");
276
+        $sth2->execute($serial, $node_id);
277
+        my ($presence_id) = $sth2->fetchrow_array();
278
+        
279
+        if ($presence_id) {
280
+            log_message("[DEBUG] Found existing presence record id=$presence_id, updating data_end") if $debug;
281
+            # Update data_end
282
+            $dbh->do("UPDATE hdd_presence SET data_end = NOW() WHERE id = ?", undef, $presence_id);
283
+            log_message("[DEBUG] Updated data_end for presence_id=$presence_id") if $debug;
284
+        } else {
285
+            log_message("[DEBUG] No existing presence for serial=$serial node=$node_id, creating new record") if $debug;
286
+            # Create new presence record
287
+            $dbh->do("UPDATE hdd_presence SET is_current = FALSE WHERE serial_number = ? AND is_current = TRUE", undef, $serial);
288
+            $sth2 = $dbh->prepare("INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current) VALUES (?, ?, NOW(), NOW(), TRUE)");
289
+            $sth2->execute($serial, $node_id);
290
+            my $new_presence_id = $dbh->last_insert_id(undef, undef, 'hdd_presence', undef);
291
+            log_message("[DEBUG] Created new hdd_presence record with id=$new_presence_id for serial=$serial node=$node_id") if $debug;
292
+        }
293
+        return $hdd_id;
294
+    }
295
+    # Create new HDD entry
296
+    log_message("[DEBUG] Creating new HDD entry for serial=$serial model=$model") if $debug;
297
+    $sth = $dbh->prepare("
298
+        INSERT INTO hdd_inventory (serial_number, model_name, current_device_path, current_node_id, 
299
+                                   first_seen, last_seen)
300
+        VALUES (?, ?, ?, ?, NOW(), NOW())
301
+        RETURNING id
302
+    ");
303
+    my $new_id = $dbh->selectrow_array($sth, undef, $serial, $model, $device_path, $node_id);
304
+    log_message("[DEBUG] Created new HDD inventory entry with id=$new_id") if $debug;
305
+    
306
+    # Mark all previous hdd_presence as historic for this serial
307
+    my $affected_rows = $dbh->do("UPDATE hdd_presence SET is_current = FALSE WHERE serial_number = ? AND is_current = TRUE", undef, $serial);
308
+    log_message("[DEBUG] Marked $affected_rows historic hdd_presence records for new serial=$serial") if $debug;
309
+    
310
+    # Create new presence record
311
+    my $sth2 = $dbh->prepare("INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current) VALUES (?, ?, NOW(), NOW(), TRUE)");
312
+    $sth2->execute($serial, $node_id);
313
+    my $new_presence_id = $dbh->last_insert_id(undef, undef, 'hdd_presence', undef);
314
+    log_message("[DEBUG] Created new hdd_presence record with id=$new_presence_id for new serial=$serial node=$node_id") if $debug;
315
+    
316
+    return $new_id;
317
+}
318
+
319
+sub load_config {
320
+    my ($file) = @_;
321
+    
322
+    my $content = read_file($file);
323
+    my %config;
324
+    
325
+    # Simple YAML-like parser
326
+    my $current_section;
327
+    foreach my $line (split /\n/, $content) {
328
+        $line =~ s/^\s+|\s+$//g;
329
+        next if $line =~ /^#/ || $line eq '';
330
+        
331
+        if ($line =~ /^(\w+):$/) {
332
+            $current_section = $1;
333
+        } elsif ($line =~ /^\s*(\w+):\s*(.+)$/) {
334
+            $config{$current_section}{$1} = $2;
335
+        }
336
+    }
337
+    
338
+    return \%config;
339
+}
340
+
341
+sub log_message {
342
+    my ($message) = @_;
343
+    my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime);
344
+    print "[$timestamp] $message\n";
345
+}
346
+
347
+__END__
348
+
349
+=head1 NAME
350
+
351
+smart-collector-daemon.pl - autoSMART SMART Data Collection Daemon
352
+
353
+=head1 SYNOPSIS
354
+
355
+smart-collector-daemon.pl --config <config_file> [--debug] [--foreground]
356
+
357
+=head1 DESCRIPTION
358
+
359
+Automated daemon for collecting SMART data from storage devices and storing
360
+in PostgreSQL database with differential storage optimization.
361
+
362
+=head1 OPTIONS
363
+
364
+=over 4
365
+
366
+=item --config <file>
367
+
368
+Configuration file path (required)
369
+
370
+=item --debug
371
+
372
+Enable debug logging
373
+
374
+=item --foreground
375
+
376
+Run in foreground (don't daemonize)
377
+
378
+=back
379
+
380
+=head1 AUTHOR
381
+
382
+autoSMART v1.0 - Hardware-based HDD tracking system
383
+
384
+=cut
+55 -0
projects/autoSMART/scripts/test-db-connection.pl
@@ -0,0 +1,55 @@
1
+#!/usr/bin/perl
2
+
3
+use strict;
4
+use warnings;
5
+use DBI;
6
+
7
+print "=== autoSMART Database Test ===\n\n";
8
+
9
+my $dsn = "DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432";
10
+my $dbh = DBI->connect($dsn, "autosmart", "autoSMART2025!", {
11
+    RaiseError => 1,
12
+    AutoCommit => 1,
13
+    PrintError => 0
14
+}) or die "Failed to connect to database: $DBI::errstr\n";
15
+
16
+print "✓ Database connection successful\n";
17
+
18
+# Test tables exist
19
+my @tables = qw(hdd_inventory hdd_migrations smart_readings predictions smart_thresholds alert_history system_config);
20
+
21
+for my $table (@tables) {
22
+    my $sth = $dbh->prepare("SELECT COUNT(*) FROM $table");
23
+    $sth->execute();
24
+    my ($count) = $sth->fetchrow_array();
25
+    print "✓ Table $table: $count rows\n";
26
+}
27
+
28
+# Test views
29
+my @views = qw(smart_readings_reconstructed latest_smart_readings drive_health_summary);
30
+
31
+for my $view (@views) {
32
+    eval {
33
+        my $sth = $dbh->prepare("SELECT COUNT(*) FROM $view");
34
+        $sth->execute();
35
+        my ($count) = $sth->fetchrow_array();
36
+        print "✓ View $view: $count rows\n";
37
+    };
38
+    if ($@) {
39
+        print "✗ View $view: ERROR - $@\n";
40
+    }
41
+}
42
+
43
+# Test function
44
+eval {
45
+    my $sth = $dbh->prepare("SELECT should_store_smart_reading(1, '{}', 'test', NOW())");
46
+    $sth->execute();
47
+    print "✓ Function should_store_smart_reading: Available\n";
48
+};
49
+if ($@) {
50
+    print "✗ Function should_store_smart_reading: ERROR - $@\n";
51
+}
52
+
53
+$dbh->disconnect();
54
+
55
+print "\n=== Test Complete ===\n";
+79 -0
projects/autoSMART/scripts/test-debug.sh
@@ -0,0 +1,79 @@
1
+#!/bin/bash
2
+
3
+# Test script pentru debugging autoSMART collector
4
+# Acest script permite testarea manualelor cu debugging activ
5
+
6
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
7
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
8
+
9
+echo "🔍 Testing autoSMART collector debugging..."
10
+echo "Project root: $PROJECT_ROOT"
11
+
12
+# Verifică dacă există configurația de debug
13
+if [[ -f "/etc/default/autosmart" ]]; then
14
+    echo "✓ Found configuration file: /etc/default/autosmart"
15
+    echo "Current configuration:"
16
+    cat /etc/default/autosmart
17
+    echo ""
18
+else
19
+    echo "❌ Configuration file /etc/default/autosmart not found"
20
+    echo "Creating test configuration..."
21
+    sudo mkdir -p /etc/default
22
+    sudo tee /etc/default/autosmart > /dev/null << 'EOF'
23
+# AutoSMART Configuration - Test Debug Mode
24
+AUTOSMART_DEBUG="true"
25
+EOF
26
+    echo "✓ Test configuration created"
27
+fi
28
+
29
+# Verifică dacă există fișierul de configurare pentru daemon
30
+CONFIG_FILE="$PROJECT_ROOT/test-config.yaml"
31
+if [[ ! -f "$CONFIG_FILE" ]]; then
32
+    echo "Creating test configuration file: $CONFIG_FILE"
33
+    cat > "$CONFIG_FILE" << 'EOF'
34
+node:
35
+  id: test-node
36
+  scan_interval: 30
37
+  store_unchanged: false
38
+  
39
+collection:
40
+  full_scan_interval: 300
41
+  
42
+database:
43
+  host: 192.168.2.102
44
+  database: autosmart
45
+  user: autosmart
46
+  password: autosmart123
47
+EOF
48
+    echo "✓ Test config created: $CONFIG_FILE"
49
+fi
50
+
51
+echo ""
52
+echo "🔧 Available devices for testing:"
53
+ls -la /dev/sd* /dev/nvme* 2>/dev/null | head -10
54
+
55
+echo ""
56
+echo "💾 Testing database connectivity..."
57
+if command -v psql >/dev/null; then
58
+    echo "Testing connection to database..."
59
+    psql -h 192.168.2.102 -U autosmart -d autosmart -c "SELECT 'Database connection OK' as status;" 2>/dev/null || echo "❌ Database connection failed"
60
+else
61
+    echo "❌ psql not available for testing"
62
+fi
63
+
64
+echo ""
65
+echo "🚀 To run collector in debug mode:"
66
+echo "export AUTOSMART_DEBUG=true"
67
+echo "sudo -E perl $SCRIPT_DIR/smart-collector-daemon.pl --config $CONFIG_FILE --debug --foreground"
68
+
69
+echo ""
70
+echo "📊 To check hdd_presence table:"
71
+echo "psql -h 192.168.2.102 -U autosmart -d autosmart -c \"SELECT * FROM hdd_presence;\""
72
+
73
+echo ""
74
+echo "📋 To check hdd_inventory table:"
75
+echo "psql -h 192.168.2.102 -U autosmart -d autosmart -c \"SELECT id, serial_number, model_name, current_node_id, last_seen FROM hdd_inventory;\""
76
+
77
+echo ""
78
+echo "🔍 To check SMART readings:"
79
+echo "psql -h 192.168.2.102 -U autosmart -d autosmart -c \"SELECT COUNT(*) as total_readings FROM smart_readings;\""
+270 -0
projects/autoSMART/scripts/test-differential-storage.pl
@@ -0,0 +1,270 @@
1
+#!/usr/bin/perl
2
+
3
+=head1 NAME
4
+
5
+test-differential-storage.pl - Test differential SMART storage system
6
+
7
+=head1 DESCRIPTION
8
+
9
+This script tests the differential storage implementation by:
10
+1. Creating test HDD entries
11
+2. Inserting baseline SMART readings
12
+3. Inserting identical readings (should be skipped)
13
+4. Inserting readings with small changes (differential storage)
14
+5. Inserting readings with critical changes (full storage)
15
+6. Validating storage efficiency and reconstruction
16
+
17
+=cut
18
+
19
+use strict;
20
+use warnings;
21
+use FindBin qw($Bin);
22
+use lib "$Bin/../lib";
23
+
24
+use DBI;
25
+use JSON::XS;
26
+use Data::Dumper;
27
+use Time::HiRes qw(time);
28
+use Digest::SHA;
29
+
30
+# Database configuration
31
+my $config = {
32
+    db_host => $ENV{AUTOSMART_DB_HOST} || 'localhost',
33
+    db_port => $ENV{AUTOSMART_DB_PORT} || '5432',
34
+    db_name => $ENV{AUTOSMART_DB_NAME} || 'autosmart',
35
+    db_user => $ENV{AUTOSMART_DB_USER} || 'autosmart',
36
+    db_pass => $ENV{AUTOSMART_DB_PASS} || 'smartpassword',
37
+};
38
+
39
+print "=== autoSMART Differential Storage Test ===\n\n";
40
+
41
+# Connect to database
42
+my $dsn = "DBI:Pg:dbname=$config->{db_name};host=$config->{db_host};port=$config->{db_port}";
43
+my $dbh = DBI->connect($dsn, $config->{db_user}, $config->{db_pass}, {
44
+    RaiseError => 1,
45
+    AutoCommit => 1,
46
+    PrintError => 0
47
+}) or die "Failed to connect to database: $DBI::errstr\n";
48
+
49
+print "✓ Connected to database\n";
50
+
51
+# Clean up any existing test data
52
+cleanup_test_data($dbh);
53
+
54
+# Test 1: Create test HDD
55
+my $test_hdd_id = create_test_hdd($dbh);
56
+print "✓ Created test HDD (ID: $test_hdd_id)\n";
57
+
58
+# Test 2: Insert baseline reading
59
+my $baseline_reading = {
60
+    parameters => {
61
+        'Reallocated_Sector_Ct' => 0,
62
+        'Spin_Retry_Count' => 0,
63
+        'Current_Pending_Sector' => 0,
64
+        'Power_On_Hours' => 1000,
65
+        'Temperature_Celsius' => 35,
66
+        'Load_Cycle_Count' => 5000
67
+    },
68
+    temperature => 35
69
+};
70
+
71
+my $baseline_id = insert_test_reading($dbh, $test_hdd_id, $baseline_reading);
72
+print "✓ Inserted baseline reading (ID: $baseline_id)\n";
73
+
74
+# Test 3: Insert identical reading (should be skipped)
75
+sleep(1);
76
+my $identical_result = test_should_store($dbh, $test_hdd_id, $baseline_reading);
77
+print "✓ Identical reading test - Should store: " . 
78
+      ($identical_result->{should_store} ? "YES" : "NO") . 
79
+      " (Type: $identical_result->{reading_type})\n";
80
+
81
+# Test 4: Insert reading with temperature change only (differential)
82
+my $temp_change_reading = {
83
+    %$baseline_reading,
84
+    temperature => 38
85
+};
86
+$temp_change_reading->{parameters}{Temperature_Celsius} = 38;
87
+
88
+sleep(1);
89
+my $temp_result = test_should_store($dbh, $test_hdd_id, $temp_change_reading);
90
+my $temp_id = insert_test_reading($dbh, $test_hdd_id, $temp_change_reading, $temp_result);
91
+print "✓ Temperature change reading - Should store: " . 
92
+      ($temp_result->{should_store} ? "YES" : "NO") . 
93
+      " (Type: $temp_result->{reading_type}, ID: $temp_id)\n";
94
+
95
+# Test 5: Insert reading with critical parameter change (full)
96
+my $critical_reading = {
97
+    %$baseline_reading,
98
+    temperature => 40
99
+};
100
+$critical_reading->{parameters}{Reallocated_Sector_Ct} = 1;  # Critical parameter change
101
+$critical_reading->{parameters}{Temperature_Celsius} = 40;
102
+
103
+sleep(1);
104
+my $critical_result = test_should_store($dbh, $test_hdd_id, $critical_reading);
105
+my $critical_id = insert_test_reading($dbh, $test_hdd_id, $critical_reading, $critical_result);
106
+print "✓ Critical change reading - Should store: " . 
107
+      ($critical_result->{should_store} ? "YES" : "NO") . 
108
+      " (Type: $critical_result->{reading_type}, ID: $critical_id)\n";
109
+
110
+# Test 6: Validate reconstruction
111
+print "\n--- Testing Data Reconstruction ---\n";
112
+test_reconstruction($dbh, $test_hdd_id);
113
+
114
+# Test 7: Show storage statistics
115
+print "\n--- Storage Statistics ---\n";
116
+show_storage_stats($dbh, $test_hdd_id);
117
+
118
+print "\n=== Test Complete ===\n";
119
+
120
+$dbh->disconnect();
121
+
122
+sub cleanup_test_data {
123
+    my ($dbh) = @_;
124
+    
125
+    $dbh->do("DELETE FROM smart_readings WHERE serial_number = 'TEST_SERIAL_001'");
126
+    $dbh->do("DELETE FROM hdd_inventory WHERE serial_number = 'TEST_SERIAL_001'");
127
+}
128
+
129
+sub create_test_hdd {
130
+    my ($dbh) = @_;
131
+    
132
+    my $sql = q{
133
+        INSERT INTO hdd_inventory 
134
+        (serial_number, model_name, firmware, size_gb, manufacturer,
135
+         current_device_path, current_node_id, status)
136
+        VALUES ('TEST_SERIAL_001', 'TEST_MODEL_WD', '1.0', 1000, 'Western Digital',
137
+                '/dev/sdb', 'test-node', 'active')
138
+        RETURNING id
139
+    };
140
+    
141
+    my $sth = $dbh->prepare($sql);
142
+    $sth->execute();
143
+    
144
+    return $sth->fetchrow_array();
145
+}
146
+
147
+sub test_should_store {
148
+    my ($dbh, $hdd_id, $reading) = @_;
149
+    
150
+    my $parameters_json = encode_json($reading->{parameters});
151
+    my $checksum = Digest::SHA::sha256_hex($parameters_json . ($reading->{temperature} || ''));
152
+    
153
+    my $sth = $dbh->prepare(q{
154
+        SELECT should_store_smart_reading(?, ?, ?, NOW())
155
+    });
156
+    
157
+    $sth->execute($hdd_id, $parameters_json, $checksum);
158
+    
159
+    return $sth->fetchrow_hashref();
160
+}
161
+
162
+sub insert_test_reading {
163
+    my ($dbh, $hdd_id, $reading, $storage_info) = @_;
164
+    
165
+    # If no storage info provided, get it
166
+    if (!$storage_info) {
167
+        $storage_info = test_should_store($dbh, $hdd_id, $reading);
168
+        return undef unless $storage_info->{should_store};
169
+    }
170
+    
171
+    return undef unless $storage_info->{should_store};
172
+    
173
+    # For differential readings, only store changed parameters
174
+    my $parameters_to_store;
175
+    if ($storage_info->{reading_type} eq 'differential' && $storage_info->{changed_parameters}) {
176
+        my $changed_params = decode_json($storage_info->{changed_parameters});
177
+        $parameters_to_store = {};
178
+        
179
+        for my $param_name (@$changed_params) {
180
+            $parameters_to_store->{$param_name} = $reading->{parameters}{$param_name};
181
+        }
182
+    } else {
183
+        $parameters_to_store = $reading->{parameters};
184
+    }
185
+    
186
+    my $sql = q{
187
+        INSERT INTO smart_readings
188
+        (hdd_id, serial_number, device_path, node_id, timestamp, 
189
+         collection_ok, temperature, parameters_json, reading_type,
190
+         changes_detected, changed_parameters, previous_reading_id, checksum)
191
+        VALUES (?, ?, ?, ?, NOW(), ?, ?, ?, ?, ?, ?, ?, ?)
192
+        RETURNING id
193
+    };
194
+    
195
+    my $parameters_json = encode_json($parameters_to_store);
196
+    my $checksum = Digest::SHA::sha256_hex(encode_json($reading->{parameters}) . ($reading->{temperature} || ''));
197
+    
198
+    my $sth = $dbh->prepare($sql);
199
+    $sth->execute(
200
+        $hdd_id,
201
+        'TEST_SERIAL_001',
202
+        '/dev/sdb',
203
+        'test-node',
204
+        1, # collection_ok
205
+        $reading->{temperature},
206
+        $parameters_json,
207
+        $storage_info->{reading_type},
208
+        $storage_info->{changes_detected} ? 1 : 0,
209
+        $storage_info->{changed_parameters},
210
+        $storage_info->{previous_reading_id},
211
+        $checksum
212
+    );
213
+    
214
+    return $sth->fetchrow_array();
215
+}
216
+
217
+sub test_reconstruction {
218
+    my ($dbh, $hdd_id) = @_;
219
+    
220
+    my $sql = q{
221
+        SELECT id, timestamp, reading_type, chain_level, parameters_json, temperature
222
+        FROM smart_readings_reconstructed
223
+        WHERE hdd_id = ?
224
+        ORDER BY timestamp
225
+    };
226
+    
227
+    my $sth = $dbh->prepare($sql);
228
+    $sth->execute($hdd_id);
229
+    
230
+    while (my $row = $sth->fetchrow_hashref()) {
231
+        print "Reading ID: $row->{id}, Type: $row->{reading_type}, Chain: $row->{chain_level}\n";
232
+        print "  Temperature: $row->{temperature}°C\n";
233
+        
234
+        my $params = decode_json($row->{parameters_json});
235
+        for my $param (sort keys %$params) {
236
+            print "  $param: $params->{$param}\n";
237
+        }
238
+        print "\n";
239
+    }
240
+}
241
+
242
+sub show_storage_stats {
243
+    my ($dbh, $hdd_id) = @_;
244
+    
245
+    my $sql = q{
246
+        SELECT 
247
+            reading_type,
248
+            COUNT(*) as count,
249
+            AVG(length(parameters_json::text)) as avg_size
250
+        FROM smart_readings 
251
+        WHERE hdd_id = ?
252
+        GROUP BY reading_type
253
+        ORDER BY reading_type
254
+    };
255
+    
256
+    my $sth = $dbh->prepare($sql);
257
+    $sth->execute($hdd_id);
258
+    
259
+    my $total_readings = 0;
260
+    my $total_size = 0;
261
+    
262
+    while (my $row = $sth->fetchrow_hashref()) {
263
+        printf "%-12s: %d readings, avg size: %.0f bytes\n", 
264
+               $row->{reading_type}, $row->{count}, $row->{avg_size};
265
+        $total_readings += $row->{count};
266
+        $total_size += $row->{count} * $row->{avg_size};
267
+    }
268
+    
269
+    print "\nTotal: $total_readings readings, estimated size: " . int($total_size) . " bytes\n";
270
+}
+132 -0
projects/autoSMART/scripts/test-smart-collection.pl
@@ -0,0 +1,132 @@
1
+#!/usr/bin/perl
2
+
3
+=head1 NAME
4
+
5
+test-smart-collection.pl - Simple SMART data collection test
6
+
7
+=head1 DESCRIPTION
8
+
9
+Simplified SMART data collection test for autoSMART deployment verification.
10
+
11
+=cut
12
+
13
+use strict;
14
+use warnings;
15
+use FindBin qw($Bin);
16
+use lib "$Bin/../lib";
17
+
18
+use SmartCollector;
19
+use DBI;
20
+use JSON::XS;
21
+
22
+# Configuration from environment
23
+my $config = {
24
+    db_host => $ENV{AUTOSMART_DB_HOST} || '192.168.2.102',
25
+    db_port => $ENV{AUTOSMART_DB_PORT} || '5432',
26
+    db_name => $ENV{AUTOSMART_DB_NAME} || 'autosmart',
27
+    db_user => $ENV{AUTOSMART_DB_USER} || 'autosmart',
28
+    db_pass => $ENV{AUTOSMART_DB_PASS} || 'autoSMART2025!',
29
+    node_id => $ENV{AUTOSMART_NODE_ID} || 'ebony',
30
+    debug => $ENV{AUTOSMART_DEBUG} || 2,
31
+};
32
+
33
+print "=== autoSMART SMART Collection Test ===\n\n";
34
+
35
+# Test database connection
36
+print "Testing database connection...\n";
37
+my $dsn = "DBI:Pg:dbname=$config->{db_name};host=$config->{db_host};port=$config->{db_port}";
38
+my $dbh = DBI->connect($dsn, $config->{db_user}, $config->{db_pass}, {
39
+    RaiseError => 1,
40
+    AutoCommit => 1,
41
+    PrintError => 0
42
+}) or die "Failed to connect to database: $DBI::errstr\n";
43
+
44
+print "✓ Database connection successful\n\n";
45
+
46
+# Initialize collector
47
+print "Initializing SMART collector...\n";
48
+my $collector = SmartCollector->new($config);
49
+print "✓ Collector initialized\n\n";
50
+
51
+# Discover available drives
52
+print "Discovering storage devices...\n";
53
+my @devices = glob('/dev/sd[a-z]');
54
+
55
+for my $device (@devices) {
56
+    print "Found device: $device\n";
57
+    
58
+    # Test SMART data collection
59
+    print "  Collecting SMART data...\n";
60
+    my $smart_data = $collector->collect_smart_data($device);
61
+    
62
+    if ($smart_data) {
63
+        print "  ✓ SMART data collected successfully\n";
64
+        print "    Serial: $smart_data->{serial_number}\n";
65
+        print "    Model: $smart_data->{model_name}\n";
66
+        print "    Temperature: $smart_data->{temperature}°C\n";
67
+        print "    Parameters: " . scalar(keys %{$smart_data->{parameters}}) . "\n";
68
+        
69
+        # Create drive info structure
70
+        my $drive_info = {
71
+            device_path => $device,
72
+            serial_number => $smart_data->{serial_number},
73
+            model_name => $smart_data->{model_name}
74
+        };
75
+        
76
+        # Store in database
77
+        print "  Storing in database...\n";
78
+        if ($collector->store_smart_data($drive_info, $smart_data)) {
79
+            print "  ✓ Data stored successfully\n";
80
+        } else {
81
+            print "  ✗ Failed to store data\n";
82
+        }
83
+        
84
+    } else {
85
+        print "  ✗ Failed to collect SMART data\n";
86
+    }
87
+    print "\n";
88
+}
89
+
90
+# Check database contents
91
+print "Checking database contents:\n";
92
+
93
+my $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_inventory");
94
+$sth->execute();
95
+my ($hdd_count) = $sth->fetchrow_array();
96
+print "  HDD Inventory: $hdd_count drives\n";
97
+
98
+$sth = $dbh->prepare("SELECT COUNT(*) FROM smart_readings");
99
+$sth->execute();
100
+my ($reading_count) = $sth->fetchrow_array();
101
+print "  SMART Readings: $reading_count readings\n";
102
+
103
+$sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_migrations");
104
+$sth->execute();
105
+my ($migration_count) = $sth->fetchrow_array();
106
+print "  HDD Migrations: $migration_count migrations\n";
107
+
108
+# Show recent readings
109
+if ($reading_count > 0) {
110
+    print "\nRecent SMART readings:\n";
111
+    $sth = $dbh->prepare(q{
112
+        SELECT hi.serial_number, hi.model_name, sr.timestamp, sr.temperature, sr.reading_type
113
+        FROM smart_readings sr
114
+        JOIN hdd_inventory hi ON sr.hdd_id = hi.id
115
+        ORDER BY sr.timestamp DESC
116
+        LIMIT 5
117
+    });
118
+    $sth->execute();
119
+    
120
+    while (my $row = $sth->fetchrow_hashref()) {
121
+        printf "  %s (%s) - %s - %d°C - %s\n",
122
+               $row->{serial_number},
123
+               $row->{model_name},
124
+               $row->{timestamp},
125
+               $row->{temperature} || 0,
126
+               $row->{reading_type};
127
+    }
128
+}
129
+
130
+$dbh->disconnect();
131
+
132
+print "\n=== Test Complete ===\n";
+187 -0
projects/autoSMART/scripts/uninstall.sh
@@ -0,0 +1,187 @@
1
+#!/bin/bash
2
+
3
+# autoSMART Uninstaller
4
+# Version: 1.0
5
+# Description: Complete removal of autoSMART system to prevent orphaned files
6
+
7
+set -e
8
+
9
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
10
+INSTALL_DIR="/opt/autoSMART"
11
+CONFIG_DIR="/etc/autosmart"
12
+PVE_CONFIG_DIR="/etc/pve/autoSMART"
13
+SERVICE_NAME="autosmart"
14
+LOG_DIR="/var/log/autosmart"
15
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
16
+
17
+# Colors for output
18
+RED='\033[0;31m'
19
+GREEN='\033[0;32m'
20
+YELLOW='\033[1;33m'
21
+BLUE='\033[0;34m'
22
+NC='\033[0m' # No Color
23
+
24
+log_info() {
25
+    echo -e "${BLUE}[INFO]${NC} $1"
26
+}
27
+
28
+log_success() {
29
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
30
+}
31
+
32
+log_warning() {
33
+    echo -e "${YELLOW}[WARNING]${NC} $1"
34
+}
35
+
36
+log_error() {
37
+    echo -e "${RED}[ERROR]${NC} $1"
38
+}
39
+
40
+log_info "🗑️  autoSMART Uninstaller v1.0"
41
+log_info "==============================="
42
+
43
+# Check if running as root
44
+if [[ $EUID -ne 0 ]]; then
45
+   log_error "This script must be run as root (use sudo)"
46
+   exit 1
47
+fi
48
+
49
+# Stop and disable systemd service
50
+if systemctl is-active --quiet "$SERVICE_NAME" 2>/dev/null; then
51
+    log_info "Stopping autoSMART service..."
52
+    systemctl stop "$SERVICE_NAME"
53
+    log_success "Service stopped"
54
+else
55
+    log_info "Service is not running"
56
+fi
57
+
58
+if systemctl is-enabled --quiet "$SERVICE_NAME" 2>/dev/null; then
59
+    log_info "Disabling autoSMART service..."
60
+    systemctl disable "$SERVICE_NAME"
61
+    log_success "Service disabled"
62
+fi
63
+
64
+# Remove systemd service file
65
+if [[ -f "$SYSTEMD_SERVICE" ]]; then
66
+    log_info "Removing systemd service file..."
67
+    rm -f "$SYSTEMD_SERVICE"
68
+    systemctl daemon-reload
69
+    log_success "Service file removed"
70
+fi
71
+
72
+# Remove installation directory
73
+if [[ -d "$INSTALL_DIR" ]]; then
74
+    log_info "Removing installation directory: $INSTALL_DIR"
75
+    rm -rf "$INSTALL_DIR"
76
+    log_success "Installation directory removed"
77
+else
78
+    log_info "Installation directory does not exist"
79
+fi
80
+
81
+# Remove configuration directory
82
+if [[ -d "$CONFIG_DIR" ]]; then
83
+    log_info "Removing configuration directory: $CONFIG_DIR"
84
+    rm -rf "$CONFIG_DIR"
85
+    log_success "Configuration directory removed"
86
+else
87
+    log_info "Configuration directory does not exist"
88
+fi
89
+
90
+# Remove PVE configuration directory (if exists)
91
+if [[ -d "$PVE_CONFIG_DIR" ]]; then
92
+    log_info "Removing PVE configuration directory: $PVE_CONFIG_DIR"
93
+    rm -rf "$PVE_CONFIG_DIR"
94
+    log_success "PVE configuration directory removed"
95
+fi
96
+
97
+# Remove log directory
98
+if [[ -d "$LOG_DIR" ]]; then
99
+    log_info "Removing log directory: $LOG_DIR"
100
+    rm -rf "$LOG_DIR"
101
+    log_success "Log directory removed"
102
+fi
103
+
104
+# Remove cron jobs (if any)
105
+if crontab -l 2>/dev/null | grep -q "autosmart"; then
106
+    log_info "Removing autoSMART cron jobs..."
107
+    (crontab -l 2>/dev/null | grep -v "autosmart") | crontab -
108
+    log_success "Cron jobs removed"
109
+fi
110
+
111
+# Remove temporary files
112
+TEMP_FILES=(
113
+    "/tmp/autosmart*"
114
+    "/tmp/smart-*"
115
+    "/var/tmp/autosmart*"
116
+)
117
+
118
+for pattern in "${TEMP_FILES[@]}"; do
119
+    if ls $pattern 1> /dev/null 2>&1; then
120
+        log_info "Removing temporary files: $pattern"
121
+        rm -rf $pattern
122
+    fi
123
+done
124
+
125
+# Remove user and group (if created specifically for autoSMART)
126
+if id "autosmart" &>/dev/null; then
127
+    log_warning "Found autosmart user - leaving intact (may be used by database)"
128
+    log_info "To remove user manually: userdel autosmart"
129
+fi
130
+
131
+# Clean up any remaining processes
132
+PROCESSES=$(pgrep -f "autosmart|smart-collector" || true)
133
+if [[ -n "$PROCESSES" ]]; then
134
+    log_warning "Found running autoSMART processes: $PROCESSES"
135
+    log_info "Terminating processes..."
136
+    pkill -f "autosmart|smart-collector" || true
137
+    sleep 2
138
+    pkill -9 -f "autosmart|smart-collector" || true
139
+    log_success "Processes terminated"
140
+fi
141
+
142
+# Remove from PATH modifications (if any)
143
+PROFILE_FILES=(
144
+    "/etc/profile.d/autosmart.sh"
145
+    "/etc/bash.bashrc.d/autosmart.sh"
146
+)
147
+
148
+for file in "${PROFILE_FILES[@]}"; do
149
+    if [[ -f "$file" ]]; then
150
+        log_info "Removing PATH modification: $file"
151
+        rm -f "$file"
152
+    fi
153
+done
154
+
155
+# Clean package manager cache related to autoSMART dependencies
156
+log_info "Cleaning package cache..."
157
+if command -v apt-get &> /dev/null; then
158
+    apt-get clean >/dev/null 2>&1 || true
159
+elif command -v yum &> /dev/null; then
160
+    yum clean all >/dev/null 2>&1 || true
161
+fi
162
+
163
+# Final verification
164
+REMAINING_FILES=$(find /etc /opt /var -name "*autosmart*" -o -name "*autoSMART*" 2>/dev/null | head -10)
165
+if [[ -n "$REMAINING_FILES" ]]; then
166
+    log_warning "Some autoSMART files may still exist:"
167
+    echo "$REMAINING_FILES"
168
+    log_info "These may be database files or manually created configurations"
169
+fi
170
+
171
+log_success "✅ autoSMART uninstallation complete!"
172
+log_info ""
173
+log_info "📋 Summary of removed components:"
174
+log_info "  • Systemd service: $SERVICE_NAME"
175
+log_info "  • Installation directory: $INSTALL_DIR"
176
+log_info "  • Configuration directory: $CONFIG_DIR"
177
+log_info "  • Log directory: $LOG_DIR"
178
+log_info "  • Temporary files and processes"
179
+log_info ""
180
+log_info "💡 Notes:"
181
+log_info "  • Database data is preserved (not removed)"
182
+log_info "  • System packages (Perl, PostgreSQL client) are preserved"
183
+log_info "  • User 'autosmart' is preserved if it exists"
184
+log_info ""
185
+log_info "🔄 System is now clean and ready for fresh installation"
186
+
187
+exit 0
+389 -0
projects/autoSMART/sql/schema-fixed.sql
@@ -0,0 +1,389 @@
1
+-- autoSMART Database Schema - Fixed for PostgreSQL 15
2
+-- This version removes problematic syntax and creates a working schema
3
+
4
+-- Drop existing tables if they exist
5
+DROP TABLE IF EXISTS smart_readings CASCADE;
6
+DROP TABLE IF EXISTS predictions CASCADE;
7
+DROP TABLE IF EXISTS alert_history CASCADE;
8
+DROP TABLE IF EXISTS hdd_presence CASCADE;
9
+DROP TABLE IF EXISTS hdd_inventory CASCADE;
10
+
11
+-- Create required extensions
12
+CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
13
+CREATE EXTENSION IF NOT EXISTS "btree_gin";
14
+
15
+-- HDD Inventory Table (Hardware-based tracking)
16
+CREATE TABLE hdd_inventory (
17
+    id                  SERIAL PRIMARY KEY,
18
+    serial_number       VARCHAR(100) NOT NULL,
19
+    model_name          VARCHAR(200) NOT NULL,
20
+    firmware            VARCHAR(50),
21
+    size_gb             INTEGER,
22
+    manufacturer        VARCHAR(100),
23
+    current_device_path VARCHAR(50),
24
+    current_node_id     VARCHAR(50),
25
+    current_slot        VARCHAR(20),
26
+    madagascar_id       VARCHAR(100),
27
+    first_seen          TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
28
+    last_seen           TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
29
+    status              VARCHAR(20) DEFAULT 'active',
30
+    status_changed_at   TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
31
+    notes               TEXT,
32
+    created_at          TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
33
+    updated_at          TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
34
+    
35
+    -- Hardware identification constraint
36
+    CONSTRAINT unique_hardware_id UNIQUE (serial_number, model_name)
37
+);
38
+
39
+-- Create index for device path (but allow NULLs and duplicates)
40
+CREATE INDEX idx_hdd_inventory_device_path ON hdd_inventory(current_device_path) WHERE current_device_path IS NOT NULL;
41
+CREATE INDEX idx_hdd_inventory_node ON hdd_inventory(current_node_id);
42
+CREATE INDEX idx_hdd_inventory_status ON hdd_inventory(status);
43
+CREATE INDEX idx_hdd_inventory_last_seen ON hdd_inventory(last_seen);
44
+
45
+-- HDD Presence Table (tracks HDD mobility across nodes)
46
+CREATE TABLE hdd_presence (
47
+    id SERIAL PRIMARY KEY,
48
+    serial_number VARCHAR(64) NOT NULL,
49
+    node VARCHAR(64) NOT NULL,
50
+    data_start TIMESTAMP NOT NULL,
51
+    data_end TIMESTAMP NOT NULL,
52
+    is_current BOOLEAN NOT NULL DEFAULT TRUE
53
+);
54
+
55
+CREATE INDEX idx_hdd_presence_serial_current ON hdd_presence(serial_number, is_current);
56
+CREATE INDEX idx_hdd_presence_node ON hdd_presence(node);
57
+CREATE INDEX idx_hdd_presence_data_end ON hdd_presence(data_end DESC);
58
+
59
+-- SMART Readings Table (with differential storage)
60
+CREATE TABLE smart_readings (
61
+    id                   BIGSERIAL PRIMARY KEY,
62
+    hdd_id               INTEGER REFERENCES hdd_inventory(id),
63
+    serial_number        VARCHAR(100) NOT NULL,
64
+    device_path          VARCHAR(50),
65
+    node_id              VARCHAR(50),
66
+    timestamp            TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
67
+    collection_ok        BOOLEAN DEFAULT true,
68
+    temperature          INTEGER,
69
+    parameters_json      JSONB,
70
+    reading_type         VARCHAR(20) DEFAULT 'full',
71
+    changes_detected     BOOLEAN DEFAULT true,
72
+    changed_parameters   JSONB,
73
+    previous_reading_id  INTEGER REFERENCES smart_readings(id),
74
+    checksum             VARCHAR(64)
75
+);
76
+
77
+CREATE INDEX idx_smart_readings_hdd_id ON smart_readings(hdd_id);
78
+CREATE INDEX idx_smart_readings_timestamp ON smart_readings(timestamp DESC);
79
+CREATE INDEX idx_smart_readings_serial ON smart_readings(serial_number);
80
+CREATE INDEX idx_smart_readings_device_path ON smart_readings(device_path);
81
+CREATE INDEX idx_smart_readings_type ON smart_readings(reading_type);
82
+CREATE INDEX idx_smart_readings_checksum ON smart_readings(checksum);
83
+CREATE INDEX idx_smart_readings_previous ON smart_readings(previous_reading_id);
84
+
85
+-- GIN index for JSONB parameters
86
+CREATE INDEX idx_smart_readings_parameters ON smart_readings USING GIN (parameters_json);
87
+CREATE INDEX idx_smart_readings_changed_params ON smart_readings USING GIN (changed_parameters);
88
+
89
+-- Predictions Table  
90
+CREATE TABLE predictions (
91
+    id                    SERIAL PRIMARY KEY,
92
+    hdd_id                INTEGER REFERENCES hdd_inventory(id),
93
+    serial_number         VARCHAR(100) NOT NULL,
94
+    device_path           VARCHAR(50),
95
+    timestamp             TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
96
+    risk_level            VARCHAR(20),
97
+    failure_probability   DECIMAL(5,4),
98
+    predicted_failure_date DATE,
99
+    confidence_score      DECIMAL(5,4),
100
+    analysis_summary      TEXT,
101
+    recommendations       JSONB,
102
+    openai_response       JSONB,
103
+    created_at            TIMESTAMP WITH TIME ZONE DEFAULT NOW()
104
+);
105
+
106
+CREATE INDEX idx_predictions_hdd_id ON predictions(hdd_id);
107
+CREATE INDEX idx_predictions_timestamp ON predictions(timestamp DESC);
108
+CREATE INDEX idx_predictions_risk_level ON predictions(risk_level);
109
+CREATE INDEX idx_predictions_serial ON predictions(serial_number);
110
+
111
+-- Alert History Table
112
+CREATE TABLE alert_history (
113
+    id              SERIAL PRIMARY KEY,
114
+    hdd_id          INTEGER REFERENCES hdd_inventory(id),
115
+    serial_number   VARCHAR(100) NOT NULL,
116
+    alert_type      VARCHAR(50),
117
+    severity        VARCHAR(20),
118
+    message         TEXT,
119
+    sent_at         TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
120
+    sent_to         TEXT,
121
+    delivery_status VARCHAR(20) DEFAULT 'pending',
122
+    related_reading_id BIGINT REFERENCES smart_readings(id),
123
+    related_prediction_id INTEGER REFERENCES predictions(id)
124
+);
125
+
126
+CREATE INDEX idx_alert_history_hdd_id ON alert_history(hdd_id);
127
+CREATE INDEX idx_alert_history_sent_at ON alert_history(sent_at DESC);
128
+CREATE INDEX idx_alert_history_severity ON alert_history(severity);
129
+CREATE INDEX idx_alert_history_serial ON alert_history(serial_number);
130
+
131
+-- System Configuration Table
132
+CREATE TABLE IF NOT EXISTS system_config (
133
+    id          SERIAL PRIMARY KEY,
134
+    config_key  VARCHAR(100) UNIQUE NOT NULL,
135
+    value       TEXT,
136
+    description TEXT,
137
+    created_at  TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
138
+    updated_at  TIMESTAMP WITH TIME ZONE DEFAULT NOW()
139
+);
140
+
141
+-- Insert default configuration
142
+INSERT INTO system_config (config_key, value, description) VALUES
143
+('collection_interval_seconds', '1800', 'SMART data collection interval in seconds')
144
+ON CONFLICT (config_key) DO NOTHING;
145
+
146
+INSERT INTO system_config (config_key, value, description) VALUES
147
+('differential_storage_enabled', 'true', 'Enable differential storage optimization'),
148
+('forced_storage_interval_hours', '24', 'Hours between forced full readings'),
149
+('critical_parameter_force_store', 'true', 'Force storage for critical parameter changes'),
150
+('temperature_change_threshold', '5', 'Temperature change threshold for storage (Celsius)')
151
+ON CONFLICT (config_key) DO NOTHING;
152
+
153
+-- SMART Thresholds Table
154
+CREATE TABLE IF NOT EXISTS smart_thresholds (
155
+    id                SERIAL PRIMARY KEY,
156
+    parameter_name    VARCHAR(100) NOT NULL,
157
+    warning_threshold NUMERIC,
158
+    critical_threshold NUMERIC,
159
+    weight            NUMERIC DEFAULT 1.0,
160
+    enabled           BOOLEAN DEFAULT true,
161
+    description       TEXT,
162
+    created_at        TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
163
+    updated_at        TIMESTAMP WITH TIME ZONE DEFAULT NOW()
164
+);
165
+
166
+-- Insert default SMART thresholds
167
+INSERT INTO smart_thresholds (parameter_name, warning_threshold, critical_threshold, weight, description) VALUES
168
+('Reallocated_Sector_Ct', 1, 5, 10.0, 'Reallocated sector count - critical for drive health'),
169
+('Spin_Retry_Count', 1, 10, 8.0, 'Spindle motor retry attempts'),
170
+('Reallocated_Event_Count', 1, 10, 9.0, 'Number of reallocation events'),
171
+('Current_Pending_Sector', 1, 5, 9.5, 'Sectors waiting to be reallocated'),
172
+('Offline_Uncorrectable', 1, 1, 10.0, 'Uncorrectable sectors found during offline scan'),
173
+('UDMA_CRC_Error_Count', 10, 50, 5.0, 'Communication errors between drive and controller'),
174
+('Raw_Read_Error_Rate', 100000, 1000000, 3.0, 'Raw read error rate (varies by manufacturer)'),
175
+('Seek_Error_Rate', 100000, 1000000, 4.0, 'Seek error rate'),
176
+('Power_On_Hours', 35000, 50000, 2.0, 'Total power-on time in hours'),
177
+('Load_Cycle_Count', 100000, 300000, 2.0, 'Number of head load/unload cycles'),
178
+('Temperature_Celsius', 50, 60, 3.0, 'Drive operating temperature'),
179
+('Start_Stop_Count', 10000, 50000, 1.0, 'Drive start/stop cycles'),
180
+('Power_Cycle_Count', 10000, 20000, 1.0, 'Number of power cycles')
181
+ON CONFLICT (parameter_name) DO NOTHING;
182
+
183
+-- Create view for reconstructed SMART data (handles differential storage)
184
+CREATE VIEW smart_readings_reconstructed AS
185
+WITH RECURSIVE reading_chain AS (
186
+    -- Base case: get baseline readings
187
+    SELECT 
188
+        id, hdd_id, serial_number, timestamp, 
189
+        parameters_json, temperature, reading_type,
190
+        previous_reading_id, 1 as chain_level
191
+    FROM smart_readings 
192
+    WHERE reading_type IN ('baseline', 'full')
193
+    
194
+    UNION ALL
195
+    
196
+    -- Recursive case: follow the chain of differential readings
197
+    SELECT 
198
+        sr.id, sr.hdd_id, sr.serial_number, sr.timestamp,
199
+        -- Merge parameters from previous reading with current changes
200
+        COALESCE(rc.parameters_json, '{}'::jsonb) || sr.parameters_json as parameters_json,
201
+        COALESCE(sr.temperature, rc.temperature) as temperature,
202
+        sr.reading_type,
203
+        sr.previous_reading_id,
204
+        rc.chain_level + 1
205
+    FROM smart_readings sr
206
+    JOIN reading_chain rc ON sr.previous_reading_id = rc.id
207
+    WHERE sr.reading_type = 'differential'
208
+)
209
+SELECT 
210
+    id, hdd_id, serial_number, timestamp,
211
+    parameters_json, temperature, reading_type,
212
+    chain_level
213
+FROM reading_chain;
214
+
215
+-- Latest SMART readings for all drives (using reconstructed differential data)
216
+CREATE VIEW latest_smart_readings AS
217
+SELECT DISTINCT ON (sr.hdd_id)
218
+    sr.id,
219
+    sr.hdd_id,
220
+    sr.serial_number,
221
+    sr.timestamp,
222
+    sr.parameters_json,
223
+    sr.temperature,
224
+    hi.model_name,
225
+    hi.manufacturer,
226
+    hi.size_gb,
227
+    hi.current_device_path,
228
+    hi.current_node_id
229
+FROM smart_readings_reconstructed sr
230
+JOIN hdd_inventory hi ON sr.hdd_id = hi.id
231
+ORDER BY sr.hdd_id, sr.timestamp DESC;
232
+
233
+-- Drive health summary view
234
+CREATE VIEW drive_health_summary AS
235
+SELECT 
236
+    hi.id as hdd_id,
237
+    hi.serial_number,
238
+    hi.model_name,
239
+    hi.manufacturer,
240
+    hi.current_device_path,
241
+    hi.current_node_id,
242
+    hi.status,
243
+    lsr.timestamp as last_reading,
244
+    lsr.temperature,
245
+    p.risk_level,
246
+    p.failure_probability,
247
+    p.predicted_failure_date,
248
+    EXTRACT(EPOCH FROM (NOW() - lsr.timestamp))/3600 as hours_since_last_reading
249
+FROM hdd_inventory hi
250
+LEFT JOIN latest_smart_readings lsr ON hi.id = lsr.hdd_id
251
+LEFT JOIN LATERAL (
252
+    SELECT risk_level, failure_probability, predicted_failure_date
253
+    FROM predictions 
254
+    WHERE hdd_id = hi.id 
255
+    ORDER BY timestamp DESC 
256
+    LIMIT 1
257
+) p ON true
258
+WHERE hi.status = 'active';
259
+
260
+-- Function to check if SMART reading should be stored (simplified version)
261
+CREATE OR REPLACE FUNCTION should_store_smart_reading(
262
+    p_hdd_id INTEGER,
263
+    p_parameters_json JSONB,
264
+    p_checksum VARCHAR(64),
265
+    p_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW()
266
+) RETURNS TABLE(
267
+    should_store BOOLEAN,
268
+    reading_type VARCHAR(20),
269
+    changes_detected BOOLEAN,
270
+    changed_parameters JSONB,
271
+    previous_reading_id INTEGER
272
+) AS $$
273
+DECLARE
274
+    v_last_reading RECORD;
275
+    v_config_enabled BOOLEAN := true;
276
+    v_force_interval_hours INTEGER := 24;
277
+    v_temp_threshold INTEGER := 5;
278
+BEGIN
279
+    -- Get configuration
280
+    SELECT (value::boolean) INTO v_config_enabled 
281
+    FROM system_config WHERE config_key = 'differential_storage_enabled';
282
+    
283
+    SELECT (value::integer) INTO v_force_interval_hours 
284
+    FROM system_config WHERE config_key = 'forced_storage_interval_hours';
285
+    
286
+    SELECT (value::integer) INTO v_temp_threshold 
287
+    FROM system_config WHERE config_key = 'temperature_change_threshold';
288
+    
289
+    -- If differential storage is disabled, always store as full
290
+    IF v_config_enabled IS FALSE OR v_config_enabled IS NULL THEN
291
+        RETURN QUERY SELECT true, 'full'::varchar(20), true, NULL::jsonb, NULL::integer;
292
+        RETURN;
293
+    END IF;
294
+    
295
+    -- Get the last reading for this HDD
296
+    SELECT id, checksum, timestamp, parameters_json, temperature
297
+    INTO v_last_reading
298
+    FROM smart_readings 
299
+    WHERE hdd_id = p_hdd_id 
300
+    ORDER BY timestamp DESC 
301
+    LIMIT 1;
302
+    
303
+    -- If no previous reading, store as baseline
304
+    IF v_last_reading IS NULL THEN
305
+        RETURN QUERY SELECT true, 'baseline'::varchar(20), true, NULL::jsonb, NULL::integer;
306
+        RETURN;
307
+    END IF;
308
+    
309
+    -- If checksum matches, no changes detected
310
+    IF v_last_reading.checksum = p_checksum THEN
311
+        RETURN QUERY SELECT false, 'skipped'::varchar(20), false, NULL::jsonb, v_last_reading.id;
312
+        RETURN;
313
+    END IF;
314
+    
315
+    -- If forced interval exceeded, store as full
316
+    IF p_timestamp > v_last_reading.timestamp + (v_force_interval_hours || ' hours')::interval THEN
317
+        RETURN QUERY SELECT true, 'full'::varchar(20), true, NULL::jsonb, v_last_reading.id;
318
+        RETURN;
319
+    END IF;
320
+    
321
+    -- Otherwise, store as differential
322
+    RETURN QUERY SELECT true, 'differential'::varchar(20), true, '[]'::jsonb, v_last_reading.id;
323
+    RETURN;
324
+END;
325
+$$ LANGUAGE plpgsql;
326
+
327
+-- Function to update HDD presence tracking
328
+CREATE OR REPLACE FUNCTION update_hdd_presence(
329
+    p_serial_number VARCHAR(64),
330
+    p_node VARCHAR(64)
331
+) RETURNS VOID AS $$
332
+BEGIN
333
+    -- Mark all previous presence records for this serial as historic
334
+    UPDATE hdd_presence 
335
+    SET is_current = FALSE 
336
+    WHERE serial_number = p_serial_number AND is_current = TRUE AND node <> p_node;
337
+    
338
+    -- Check if there's already a current presence for this serial/node
339
+    IF EXISTS (SELECT 1 FROM hdd_presence WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE) THEN
340
+        -- Update data_end for existing current presence
341
+        UPDATE hdd_presence 
342
+        SET data_end = NOW() 
343
+        WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE;
344
+    ELSE
345
+        -- Create new presence record
346
+        INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current)
347
+        VALUES (p_serial_number, p_node, NOW(), NOW(), TRUE);
348
+    END IF;
349
+END;
350
+$$ LANGUAGE plpgsql;
351
+
352
+-- Function to update timestamps
353
+CREATE OR REPLACE FUNCTION update_timestamp() RETURNS TRIGGER AS $$
354
+BEGIN
355
+    NEW.updated_at = NOW();
356
+    RETURN NEW;
357
+END;
358
+$$ LANGUAGE plpgsql;
359
+
360
+-- Create triggers for timestamp updates
361
+CREATE TRIGGER update_hdd_inventory_timestamp
362
+    BEFORE UPDATE ON hdd_inventory
363
+    FOR EACH ROW EXECUTE FUNCTION update_timestamp();
364
+
365
+CREATE TRIGGER update_smart_thresholds_timestamp
366
+    BEFORE UPDATE ON smart_thresholds
367
+    FOR EACH ROW EXECUTE FUNCTION update_timestamp();
368
+
369
+CREATE TRIGGER update_system_config_timestamp
370
+    BEFORE UPDATE ON system_config
371
+    FOR EACH ROW EXECUTE FUNCTION update_timestamp();
372
+
373
+-- Grant permissions to autosmart user
374
+GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO autosmart;
375
+GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO autosmart;
376
+GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO autosmart;
377
+
378
+-- Grant specific permissions for hdd_presence table
379
+GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE hdd_presence TO autosmart;
380
+
381
+-- Final message
382
+DO $$ 
383
+BEGIN 
384
+    RAISE NOTICE 'autoSMART database schema deployed successfully!';
385
+    RAISE NOTICE 'Tables created: hdd_inventory, hdd_presence, smart_readings, predictions, smart_thresholds, alert_history, system_config';
386
+    RAISE NOTICE 'Views created: smart_readings_reconstructed, latest_smart_readings, drive_health_summary';
387
+    RAISE NOTICE 'Functions created: update_hdd_presence(), should_store_smart_reading()';
388
+    RAISE NOTICE 'Permissions granted to autosmart user';
389
+END $$;
+0 -0
projects/autoSMART/sql/schema.sql
No changes.
+1 -0
projects/pve-backup-scheduler
@@ -0,0 +1 @@
1
+Subproject commit f6606003657b44138d3c80b234e65074146345a9
BIN
projects/pve-guests-state/.DS_Store
Binary file not shown.
+102 -0
projects/pve-guests-state/CHANGELOG.md
@@ -0,0 +1,102 @@
1
+# PGS - Changelog
2
+
3
+## [1.5] - 2026-03-07
4
+
5
+### Added
6
+- Added `pgs cleanup` to scan image storages for orphan/stale `vm-*-state-suspend-YYYY-MM-DD.raw` volumes and remove them safely
7
+
8
+### Fixed
9
+- Stopped VMs are no longer classified as "already suspended to disk" from config flags alone; `bin/pgs` now requires `lock: suspended`, `vmstate:`, and a resolvable backing saved-state volume
10
+- Added cleanup for inconsistent suspend artifacts on stopped VMs, including stale suspend locks, stale `vmstate:` metadata, and orphaned saved-state volumes on storage
11
+- `pgs suspend` now runs suspend-artifact cleanup as a preflight, reducing same-day collisions with stale `state-suspend` volumes
12
+- Cleanup explicitly ignores `vm-*-state-cp*.raw` checkpoint files and only targets `vm-*-state-suspend-YYYY-MM-DD.raw`
13
+- Repeated `pgs suspend` runs now merge with the existing state file instead of discarding prior `to_resume` intent
14
+- State now records `vm_details.suspend_volume` and `vm_details.suspend_file_date`, and `resume` skips auto-restore when a VM's suspend artifact changed after the state was saved
15
+
16
+## [1.4] - 2026-03-06
17
+
18
+### Changed
19
+- Standardized install layout around `xdev` paths for uninstall, documentation, and runtime state
20
+- Added dedicated `scripts/install.sh` and `scripts/uninstall.sh` and reduced `setup.sh` to a local/remote wrapper
21
+- Updated `bin/pgs` to migrate legacy state from `/var/lib/pve-manager/pgs-state.json` to `/var/lib/xdev/pve-guests-state/pgs-state.json`
22
+- Promoted `bin/pgs` as the canonical executable and removed the duplicate top-level `pgs` file
23
+- Marked `systemd/` artifacts as legacy reference material instead of active install targets
24
+
25
+### Fixed
26
+- Fixed documentation to reflect the current manual workflow and the standardized host layout
27
+
28
+## [1.2] - 2026-03-05
29
+
30
+### Added
31
+- LXC container (CT) support: graceful shutdown before maintenance, auto-start after maintenance
32
+- New `ct_to_start` array in state JSON for CT restoration
33
+- `load_ct_info()` function using single `pct list` call
34
+- `shutdown_ct()` function with 120s timeout for graceful shutdown
35
+- `start_ct()` function for post-maintenance startup
36
+- TODO placeholder for critical VM/CT migration support
37
+
38
+### Changed
39
+- State file now includes `ct_to_start` array
40
+- Suspend operation processes VMs then CTs
41
+- Resume operation resumes VMs then starts CTs
42
+
43
+### Fixed
44
+- Fixed `pct list` column parsing (Status/Lock/Name column order)
45
+- Handle empty Lock column in `pct list` output
46
+
47
+## [1.1] - 2026-03-05
48
+
49
+### Fixed
50
+- Fixed `load_state()` outputting log messages to stdout, corrupting JSON parsing
51
+- Fixed empty arrays in JSON state file (was generating `[""]` instead of `[]`)
52
+- Fixed paused VMs being treated as "running" - now properly detects `paused` status
53
+
54
+### Changed
55
+- Optimized VM info loading: single `qm list` call instead of per-VM calls
56
+- Optimized suspend lock detection: read config files directly, no extra `qm` calls
57
+- Optimized status checking: only verify actual status for "running" VMs, rest trust `qm list`
58
+- Reduced scan time from ~180 seconds to ~2.5 seconds for 30+ VMs
59
+
60
+### Added
61
+- Proper systemd service setup for manual suspend before maintenance
62
+- Proper systemd service setup for manual resume after maintenance
63
+- Better handling of paused VMs: suspend to disk but don't auto-resume
64
+- Comprehensive journal logging with severity levels (INFO, WARNING, ERROR, SUCCESS)
65
+- Dry-run mode for testing without effects
66
+
67
+## [1.0] - 2026-03-05
68
+
69
+### Initial Release
70
+- Basic suspend/resume functionality
71
+- State file preservation
72
+- Manual testing scripts
73
+
74
+---
75
+
76
+## Performance Improvements
77
+
78
+| Operation | v1.0 | v1.1 | Improvement |
79
+|-----------|------|------|-------------|
80
+| Scan 30 VMs | ~180s | ~2.5s | **72x faster** |
81
+| System calls | Per-VM qm calls | Single qm list + file I/O | **Drastically reduced** |
82
+
83
+## Known Limitations
84
+
85
+- Requires passwordless SSH for cluster-wide operations
86
+- No critical VM/CT migration support yet (TODO)
87
+
88
+## Testing
89
+
90
+Tested on:
91
+- Proxmox VE 8.x with 30+ VMs and CTs
92
+- Mixed VM configurations (4GB-16GB RAM)
93
+- LXC containers with running services
94
+- Storage: local-dir, NFS mount points
95
+
96
+## Future Enhancements
97
+
98
+- [ ] Support for LXC container shutdown
99
+- [ ] Configurable exclusion list for VMs
100
+- [ ] Metrics/performance monitoring
101
+- [ ] Multi-node coordination for cluster-wide operations
102
+- [ ] Backup integration for backup snapshots before suspend
+81 -0
projects/pve-guests-state/INSTALL.md
@@ -0,0 +1,81 @@
1
+# Instalare
2
+
3
+## Cerinte
4
+
5
+- nod Proxmox VE cu acces root
6
+- `jq` disponibil pe host
7
+- acces SSH pentru instalare remote
8
+
9
+## Metoda recomandata
10
+
11
+Wrapper-ul [setup.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/setup.sh) este metoda standard de install si uninstall.
12
+
13
+### Instalare locala
14
+
15
+```bash
16
+sudo ./setup.sh --local
17
+```
18
+
19
+### Instalare remote
20
+
21
+```bash
22
+sudo ./setup.sh <node>
23
+sudo ./setup.sh --user admin <node>
24
+```
25
+
26
+## Ce instaleaza
27
+
28
+- `/usr/local/sbin/pgs`
29
+- `/usr/local/lib/xdev/pve-guests-state/uninstall.sh`
30
+- `/usr/local/sbin/xdev-pve-guests-state-uninstall`
31
+- `/usr/local/share/doc/xdev/pve-guests-state/*`
32
+- state runtime in `/var/lib/xdev/pve-guests-state/`
33
+
34
+## Verificare dupa install
35
+
36
+```bash
37
+/usr/local/sbin/pgs suspend --dry-run -v
38
+journalctl -t pgs -n 20
39
+```
40
+
41
+## Uninstall
42
+
43
+### Metoda recomandata
44
+
45
+```bash
46
+sudo ./setup.sh --local --uninstall
47
+sudo ./setup.sh --uninstall <node>
48
+```
49
+
50
+### Direct pe host
51
+
52
+```bash
53
+sudo /usr/local/lib/xdev/pve-guests-state/uninstall.sh
54
+```
55
+
56
+## Reinstall
57
+
58
+Fluxul acceptat este:
59
+
60
+```text
61
+uninstall -> install
62
+```
63
+
64
+Practic:
65
+- daca exista deja un install curent, installerul ruleaza mai intai uninstall-ul canonic
66
+- reinstall direct peste fisiere ramase dintr-o versiune veche nu este workflow-ul recomandat
67
+
68
+## State file
69
+
70
+Locatia curenta:
71
+
72
+```bash
73
+cat /var/lib/xdev/pve-guests-state/pgs-state.json
74
+```
75
+
76
+Compatibilitate:
77
+- daca exista vechiul fisier `/var/lib/pve-manager/pgs-state.json`, noua versiune il migreaza automat
78
+
79
+## Unitati systemd legacy
80
+
81
+Fisierele din [systemd](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/systemd) sunt pastrate doar ca referinta istorica. Scripturile actuale nu le instaleaza; dimpotriva, le elimina daca sunt prezente pe host.
+25 -0
projects/pve-guests-state/LICENSE
@@ -0,0 +1,25 @@
1
+BSD 2-Clause License
2
+
3
+Copyright (c) 2026, Proxmox VE Utilities
4
+All rights reserved.
5
+
6
+Redistribution and use in source and binary forms, with or without
7
+modification, are permitted provided that the following conditions are met:
8
+
9
+1. Redistributions of source code must retain the above copyright notice, this
10
+   list of conditions and the following disclaimer.
11
+
12
+2. Redistributions in binary form must reproduce the above copyright notice,
13
+   this list of conditions and the following disclaimer in the documentation
14
+   and/or other materials provided with the distribution.
15
+
16
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
17
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
19
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
20
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
22
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
23
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
24
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+101 -0
projects/pve-guests-state/README.md
@@ -0,0 +1,101 @@
1
+# PGS
2
+
3
+`pve-guests-state` este utilitarul manual pentru suspendarea si restaurarea guest-urilor Proxmox inainte si dupa lucrari de mentenanta.
4
+
5
+Modelul suportat este deliberat simplu:
6
+- `pgs suspend` se ruleaza manual inainte de mentenanta
7
+- `pgs resume` se ruleaza manual dupa revenirea stabila a clusterului
8
+- `pgs cleanup` poate fi rulat manual pentru audit sau cleanup al artefactelor stale de suspend
9
+
10
+Automatizarea prin systemd pentru shutdown si boot a fost abandonata intentionat. Contextul complet este in [docs/DECISIONS.md](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/docs/DECISIONS.md).
11
+
12
+## Capabilitati
13
+
14
+- suspend to disk pentru VM-uri QEMU care ruleaza
15
+- graceful shutdown pentru containere LXC care ruleaza
16
+- resume pentru VM-urile salvate in state
17
+- start pentru containerele salvate in state
18
+- cleanup pentru stale suspend images
19
+- cleanup pentru volume orphan `vm-*-state-suspend-YYYY-MM-DD.raw`
20
+- retry pentru anumite erori legate de quorum
21
+- dry-run pentru verificare fara efecte
22
+
23
+## Layout proiect
24
+
25
+- [bin/pgs](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/bin/pgs) - comanda principala
26
+- [scripts/install.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/scripts/install.sh) - instalare locala pe host
27
+- [scripts/uninstall.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/scripts/uninstall.sh) - uninstall canonic
28
+- [setup.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/setup.sh) - wrapper local/remote
29
+- [docs/TECHNICAL.md](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/docs/TECHNICAL.md) - detalii tehnice
30
+- [systemd/README.md](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/systemd/README.md) - statusul unitatilor legacy
31
+
32
+## Locatii instalate pe host
33
+
34
+- comanda operatorului: `/usr/local/sbin/pgs`
35
+- uninstall canonic: `/usr/local/lib/xdev/pve-guests-state/uninstall.sh`
36
+- wrapper optional pentru uninstall: `/usr/local/sbin/xdev-pve-guests-state-uninstall`
37
+- documentatie instalata: `/usr/local/share/doc/xdev/pve-guests-state`
38
+- state runtime: `/var/lib/xdev/pve-guests-state/pgs-state.json`
39
+
40
+Compatibilitate:
41
+- la primul run, daca exista vechiul state file `/var/lib/pve-manager/pgs-state.json`, acesta este migrat automat in locatia noua
42
+- installerul si uninstallerul curata si artefactele istorice `pve-reboot-manager.sh` si `pve-guest-state.sh`
43
+
44
+## Flux rapid
45
+
46
+```bash
47
+# instalare locala
48
+sudo ./setup.sh --local
49
+
50
+# test
51
+/usr/local/sbin/pgs suspend --dry-run -v
52
+/usr/local/sbin/pgs cleanup --dry-run -v
53
+
54
+# suspend inainte de mentenanta
55
+/usr/local/sbin/pgs suspend -v
56
+
57
+# resume dupa revenirea clusterului
58
+/usr/local/sbin/pgs resume -v
59
+
60
+# cleanup manual al artefactelor stale/orphan
61
+/usr/local/sbin/pgs cleanup -v
62
+```
63
+
64
+## Instalare si uninstall
65
+
66
+Instalare:
67
+
68
+```bash
69
+sudo ./setup.sh --local
70
+sudo ./setup.sh <node>
71
+```
72
+
73
+Uninstall:
74
+
75
+```bash
76
+sudo ./setup.sh --local --uninstall
77
+sudo ./setup.sh --uninstall <node>
78
+```
79
+
80
+Sau direct pe host:
81
+
82
+```bash
83
+sudo /usr/local/lib/xdev/pve-guests-state/uninstall.sh
84
+```
85
+
86
+## Observatii operationale
87
+
88
+- proiectul nu instaleaza configuratie persistenta proprie in `/etc`
89
+- proiectul nu instaleaza unitati systemd active; cele vechi sunt doar artefacte istorice si sunt eliminate la install/uninstall
90
+- dupa un `resume` complet reusit, state file-ul este sters
91
+- daca `resume` are erori, state file-ul este pastrat pentru retry
92
+- `cleanup` si preflight-ul din `suspend` ating doar fisiere `vm-*-state-suspend-YYYY-MM-DD.raw`; fisiere `vm-*-state-cp*.raw` sau alte variante raman neatinse
93
+- un nou `suspend` peste un state file existent face merge, nu reseteaza lista de guest-uri de restaurat
94
+- state file-ul retine si `suspend_volume`/`suspend_file_date` per VM pentru a detecta guest-uri alterate dupa salvarea state-ului
95
+
96
+## Debug rapid
97
+
98
+```bash
99
+journalctl -t pgs -n 50
100
+cat /var/lib/xdev/pve-guests-state/pgs-state.json
101
+```
+1352 -0
projects/pve-guests-state/bin/pgs
@@ -0,0 +1,1352 @@
1
+#!/bin/bash
2
+
3
+# pgs
4
+# Manages VM and CT suspend/shutdown for planned maintenance.
5
+#
6
+# Before maintenance (suspend mode):
7
+#   - Suspends all running VMs to disk
8
+#   - Gracefully shuts down all running CTs
9
+#   - Saves state to a list for restoration
10
+#   - VMs already suspended to disk: logged as warning, not auto-resumed
11
+#   - VMs suspended to RAM: suspended to disk but not auto-resumed (preserving user intent)
12
+#
13
+# After maintenance (resume mode):
14
+#   - Resumes VMs from the saved list
15
+#   - Starts CTs from the saved list
16
+#   - Logs warnings for VMs/CTs skipped
17
+#   - Logs errors for VMs/CTs that fail to resume/start
18
+#
19
+# Usage: pgs suspend|resume [--dry-run] [-v]
20
+#
21
+# Version: 1.4 - Standardized xdev state path with legacy state migration
22
+#
23
+# TODO: Implement critical VM/CT migration support.
24
+#       Critical guests (tagged or listed) should be live-migrated to another
25
+#       node before maintenance instead of suspended/stopped. Rules TBD:
26
+#       - Which guests are critical (tag? config flag? external list?)
27
+#       - Target node selection (least loaded? affinity rules?)
28
+#       - Fallback if migration fails (suspend locally?)
29
+#       - Post-maintenance: migrate back or leave on target node?
30
+
31
+PROJECT_ID="pve-guests-state"
32
+ORG_ID="xdev"
33
+DEFAULT_STATE_DIR="/var/lib/${ORG_ID}/${PROJECT_ID}"
34
+LEGACY_STATE_DIR="/var/lib/pve-manager"
35
+LEGACY_STATE_FILE="${LEGACY_STATE_DIR}/pgs-state.json"
36
+STATE_DIR="${PGS_STATE_DIR:-${DEFAULT_STATE_DIR}}"
37
+STATE_FILE="${STATE_DIR}/pgs-state.json"
38
+LOCK_FILE="/run/pgs.lock"
39
+SCRIPT_NAME=$(basename "$0")
40
+
41
+DRY_RUN=0
42
+VERBOSE=0
43
+QUORUM_RELAXED=0
44
+
45
+# Associative arrays for VM data (populated once)
46
+declare -A VM_STATUS
47
+declare -A VM_NAME
48
+declare -A VM_HAS_LOCK
49
+declare -A VM_VMSTATE
50
+declare -A VMSTATE_TO_VMID
51
+
52
+# Associative arrays for CT data (populated once)
53
+declare -A CT_STATUS
54
+declare -A CT_NAME
55
+
56
+# Logging functions.
57
+# When running inside systemd (JOURNAL_STREAM is set), stdout goes directly to
58
+# the journal - calling logger in addition causes duplicate entries. When running
59
+# interactively, use both echo (terminal) and logger (journal archive).
60
+_log() {
61
+    local level="$1" prefix="$2"; shift 2
62
+    echo "$prefix $*"
63
+    [[ -z "${JOURNAL_STREAM:-}" ]] && logger -t "$SCRIPT_NAME" -p "$level" "$*"
64
+}
65
+
66
+log_info() {
67
+    # When in systemd: always log regardless of VERBOSE (journal is the destination)
68
+    # When interactive: only log if -v is set
69
+    if [[ -n "${JOURNAL_STREAM:-}" ]] || [[ $VERBOSE -ge 1 ]]; then
70
+        _log user.info "[INFO]" "$@"
71
+    fi
72
+}
73
+
74
+log_debug() {
75
+    if [[ -n "${JOURNAL_STREAM:-}" ]] || [[ $VERBOSE -ge 2 ]]; then
76
+        _log user.debug "[DEBUG]" "$@"
77
+    fi
78
+}
79
+
80
+log_warning() {
81
+    _log user.warning "[WARNING]" "$@"
82
+}
83
+
84
+log_error() {
85
+    echo "[ERROR] $*" >&2
86
+    [[ -z "${JOURNAL_STREAM:-}" ]] && logger -t "$SCRIPT_NAME" -p user.err "$*"
87
+}
88
+
89
+log_success() {
90
+    _log user.notice "[SUCCESS]" "$@"
91
+}
92
+
93
+usage() {
94
+    cat <<EOF
95
+Usage: $0 suspend|resume|cleanup [OPTIONS]
96
+
97
+Manage VM and CT suspend/shutdown for planned maintenance.
98
+
99
+Commands:
100
+  suspend    Suspend running VMs to disk, shutdown running CTs
101
+  resume     Resume VMs and start CTs from saved state
102
+  cleanup    Remove stale suspend artifacts from config and storage
103
+
104
+Options:
105
+  -n, --dry-run    Show what would be done without making changes
106
+  -v, --verbose    Print informational messages (-vv adds debug detail)
107
+  -h, --help       Display this help and exit
108
+
109
+Examples:
110
+  $0 suspend              # Suspend VMs, shutdown CTs
111
+  $0 resume               # Resume VMs, start CTs
112
+  $0 cleanup -v           # Remove orphan/stale suspend artifacts
113
+  $0 cleanup -vv          # Include real filesystem paths in cleanup output
114
+  $0 suspend --dry-run    # Show what would happen
115
+EOF
116
+}
117
+
118
+refresh_vm_artifact_metadata() {
119
+    VM_HAS_LOCK=()
120
+    VM_VMSTATE=()
121
+    VMSTATE_TO_VMID=()
122
+
123
+    for conf in /etc/pve/qemu-server/*.conf; do
124
+        [[ ! -f "$conf" ]] && continue
125
+        local vmid=$(basename "$conf" .conf)
126
+        if grep -q '^lock: suspended$' "$conf" 2>/dev/null; then
127
+            VM_HAS_LOCK[$vmid]=1
128
+        fi
129
+        local vmstate
130
+        vmstate=$(awk -F': ' '/^vmstate: / {print $2; exit}' "$conf" 2>/dev/null)
131
+        if [[ -n "$vmstate" ]]; then
132
+            VM_VMSTATE[$vmid]="$vmstate"
133
+            VMSTATE_TO_VMID[$vmstate]="$vmid"
134
+        fi
135
+    done
136
+}
137
+
138
+load_vm_config_metadata() {
139
+    VM_STATUS=()
140
+    VM_NAME=()
141
+
142
+    while read -r vmid name status _rest; do
143
+        [[ "$vmid" == "VMID" ]] && continue
144
+        VM_NAME[$vmid]="$name"
145
+    done < <(qm list 2>/dev/null)
146
+
147
+    refresh_vm_artifact_metadata
148
+}
149
+
150
+# Load all VM info in one pass - FAST
151
+load_vm_info() {
152
+    load_vm_config_metadata
153
+
154
+    # Get status and name from qm list (single call)
155
+    while read -r vmid name status _rest; do
156
+        [[ "$vmid" == "VMID" ]] && continue  # skip header
157
+        VM_STATUS[$vmid]="$status"
158
+        VM_NAME[$vmid]="$name"
159
+    done < <(qm list 2>/dev/null)
160
+    
161
+    # For "running" VMs, get actual status (qm list shows "running" for paused/suspended VMs)
162
+    # This is only a few VMs so the overhead is acceptable
163
+    for vmid in "${!VM_STATUS[@]}"; do
164
+        if [[ "${VM_STATUS[$vmid]}" == "running" ]]; then
165
+            local real_status
166
+            real_status=$(qm status "$vmid" 2>/dev/null | awk '{print $2}')
167
+            [[ -n "$real_status" ]] && VM_STATUS[$vmid]="$real_status"
168
+        fi
169
+    done
170
+}
171
+
172
+array_contains() {
173
+    local needle="$1"
174
+    shift
175
+    local item
176
+    for item in "$@"; do
177
+        [[ "$item" == "$needle" ]] && return 0
178
+    done
179
+    return 1
180
+}
181
+
182
+append_unique() {
183
+    local -n target_ref=$1
184
+    local value="$2"
185
+
186
+    array_contains "$value" "${target_ref[@]}" || target_ref+=("$value")
187
+}
188
+
189
+remove_value() {
190
+    local -n target_ref=$1
191
+    local value="$2"
192
+    local filtered=()
193
+    local item
194
+
195
+    for item in "${target_ref[@]}"; do
196
+        [[ "$item" == "$value" ]] && continue
197
+        filtered+=("$item")
198
+    done
199
+
200
+    target_ref=("${filtered[@]}")
201
+}
202
+
203
+extract_suspend_file_date() {
204
+    local vmid="$1"
205
+    local volume="$2"
206
+    local volume_name="${volume##*/}"
207
+
208
+    if [[ "$volume_name" =~ ^vm-${vmid}-state-suspend-([0-9]{4}-[0-9]{2}-[0-9]{2})\.raw$ ]]; then
209
+        echo "${BASH_REMATCH[1]}"
210
+    fi
211
+}
212
+
213
+# Load all CT info in one pass - FAST
214
+load_ct_info() {
215
+    # pct list columns: VMID Status Lock Name
216
+    # When Lock is empty, read shifts Name into the lock variable
217
+    while read -r vmid status lock name; do
218
+        [[ "$vmid" == "VMID" ]] && continue  # skip header
219
+        if [[ -z "$name" ]]; then
220
+            # No lock present: lock actually holds the name
221
+            name="$lock"
222
+            lock=""
223
+        fi
224
+        CT_STATUS[$vmid]="$status"
225
+        CT_NAME[$vmid]="$name"
226
+    done < <(pct list 2>/dev/null)
227
+}
228
+
229
+# Get VM name (from cache)
230
+get_vm_name() {
231
+    echo "${VM_NAME[$1]:-unknown}"
232
+}
233
+
234
+vm_has_suspend_lock() {
235
+    local vmid="$1"
236
+    grep -q '^lock: suspended$' "/etc/pve/qemu-server/${vmid}.conf" 2>/dev/null
237
+}
238
+
239
+vm_has_vmstate_reference() {
240
+    local vmid="$1"
241
+    grep -q '^vmstate:' "/etc/pve/qemu-server/${vmid}.conf" 2>/dev/null
242
+}
243
+
244
+get_vm_vmstate_volume() {
245
+    local vmid="$1"
246
+    echo "${VM_VMSTATE[$vmid]:-}"
247
+}
248
+
249
+is_strict_suspend_volume_name() {
250
+    local vmid="$1"
251
+    local name="$2"
252
+    [[ "$name" =~ ^vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ ]]
253
+}
254
+
255
+storage_cleanup_supports_path_scan() {
256
+    local storage_type="$1"
257
+
258
+    # Cleanup walks filesystem paths directly under <path>/images.
259
+    # Keep this limited to local directory-backed storages so a stale remote
260
+    # mount cannot block planned maintenance in kernel I/O wait.
261
+    [[ "$storage_type" == "dir" ]]
262
+}
263
+
264
+vmstate_volume_looks_like_suspend_artifact() {
265
+    local vmid="$1"
266
+    local volume="$2"
267
+    local volume_name="${volume##*/}"
268
+
269
+    [[ -n "$volume" ]] || return 1
270
+    is_strict_suspend_volume_name "$vmid" "$volume_name"
271
+}
272
+
273
+resolve_storage_volume_path() {
274
+    local volume="$1"
275
+    pvesm path "$volume" 2>/dev/null
276
+}
277
+
278
+vmstate_volume_exists() {
279
+    local volume="$1"
280
+    local resolved_path
281
+
282
+    [[ -z "$volume" ]] && return 1
283
+    resolved_path=$(resolve_storage_volume_path "$volume") || return 1
284
+    [[ -n "$resolved_path" && -e "$resolved_path" ]]
285
+}
286
+
287
+remove_suspend_volume_by_volid() {
288
+    local vmid="$1"
289
+    local volume="$2"
290
+    local name="${VM_NAME[$vmid]:-unknown}"
291
+    local free_output
292
+
293
+    if ! vmstate_volume_looks_like_suspend_artifact "$vmid" "$volume"; then
294
+        log_warning "VM $vmid ($name) suspend volume does not look like a suspend artifact, leaving it untouched: ${volume:-none}"
295
+        return 1
296
+    fi
297
+
298
+    if [[ $DRY_RUN -eq 1 ]]; then
299
+        echo "would remove stale vmstate volume for VM $vmid ($name): $volume"
300
+        return 0
301
+    fi
302
+
303
+    free_output=$(pvesm free "$volume" 2>&1)
304
+    if [[ $? -eq 0 ]]; then
305
+        log_info "Removed stale vmstate volume for VM $vmid ($name): $volume"
306
+        return 0
307
+    fi
308
+
309
+    if maybe_relax_quorum "$free_output"; then
310
+        free_output=$(pvesm free "$volume" 2>&1)
311
+        if [[ $? -eq 0 ]]; then
312
+            log_info "Removed stale vmstate volume for VM $vmid ($name) after quorum recovery: $volume"
313
+            return 0
314
+        fi
315
+    fi
316
+
317
+    if echo "$free_output" | grep -qiE 'does not exist|no such file|not found'; then
318
+        log_info "Stale vmstate volume for VM $vmid ($name) was already absent: $volume"
319
+        return 0
320
+    fi
321
+
322
+    log_warning "VM $vmid ($name) stale vmstate volume could not be removed: $volume ($free_output)"
323
+    return 1
324
+}
325
+
326
+clear_vmstate_metadata() {
327
+    local vmid="$1"
328
+    local name="${VM_NAME[$vmid]:-unknown}"
329
+    local set_output
330
+
331
+    if [[ $DRY_RUN -eq 1 ]]; then
332
+        echo "would remove stale vmstate metadata for VM $vmid ($name)"
333
+        return 0
334
+    fi
335
+
336
+    set_output=$(qm set "$vmid" --delete vmstate 2>&1)
337
+    if [[ $? -eq 0 ]]; then
338
+        log_info "Removed stale vmstate metadata for VM $vmid ($name)"
339
+        return 0
340
+    fi
341
+
342
+    if maybe_relax_quorum "$set_output"; then
343
+        set_output=$(qm set "$vmid" --delete vmstate 2>&1)
344
+        if [[ $? -eq 0 ]]; then
345
+            log_info "Removed stale vmstate metadata for VM $vmid ($name) after quorum recovery"
346
+            return 0
347
+        fi
348
+    fi
349
+
350
+    log_warning "VM $vmid ($name) stale vmstate metadata could not be removed: $set_output"
351
+    return 1
352
+}
353
+
354
+free_stale_vmstate_volume() {
355
+    local vmid="$1"
356
+    local volume="$2"
357
+
358
+    remove_suspend_volume_by_volid "$vmid" "$volume"
359
+}
360
+
361
+cleanup_stale_suspend_artifacts() {
362
+    local vmid="$1"
363
+    local context="${2:-}"
364
+    local name="${VM_NAME[$vmid]:-unknown}"
365
+    local volume
366
+    local had_issue=0
367
+    local cleanup_failed=0
368
+
369
+    volume=$(get_vm_vmstate_volume "$vmid")
370
+
371
+    if vm_has_suspend_lock "$vmid"; then
372
+        had_issue=1
373
+        if ! unlock_vm_suspend_lock "$vmid" "$context"; then
374
+            cleanup_failed=1
375
+        fi
376
+    fi
377
+
378
+    if [[ -n "$volume" ]]; then
379
+        had_issue=1
380
+        if vmstate_volume_exists "$volume"; then
381
+            if ! free_stale_vmstate_volume "$vmid" "$volume"; then
382
+                cleanup_failed=1
383
+            fi
384
+        else
385
+            log_info "VM $vmid ($name) has stale vmstate metadata pointing to missing volume: $volume"
386
+        fi
387
+
388
+        if ! clear_vmstate_metadata "$vmid"; then
389
+            cleanup_failed=1
390
+        fi
391
+    fi
392
+
393
+    if [[ $had_issue -eq 0 ]]; then
394
+        return 0
395
+    fi
396
+
397
+    [[ $cleanup_failed -eq 0 ]]
398
+}
399
+
400
+vm_has_valid_suspend_state() {
401
+    local vmid="$1"
402
+    local volume
403
+
404
+    vm_has_suspend_lock "$vmid" || return 1
405
+    vm_has_vmstate_reference "$vmid" || return 1
406
+    volume=$(get_vm_vmstate_volume "$vmid")
407
+    vmstate_volume_looks_like_suspend_artifact "$vmid" "$volume" || return 1
408
+    vmstate_volume_exists "$volume"
409
+}
410
+
411
+get_referencing_vmid_for_vmstate() {
412
+    local target_volume="$1"
413
+    local vmid="${VMSTATE_TO_VMID[$target_volume]:-}"
414
+    [[ -n "$vmid" ]] || return 1
415
+    echo "$vmid"
416
+    return 0
417
+}
418
+
419
+list_suspend_artifact_files() {
420
+    awk '
421
+        BEGIN {
422
+            RS = ""
423
+            FS = "\n"
424
+        }
425
+        {
426
+            type = ""
427
+            name = ""
428
+            path = ""
429
+            content = ""
430
+            split($1, header_parts, /:[[:space:]]+/)
431
+            if (length(header_parts) >= 2) {
432
+                type = header_parts[1]
433
+                name = header_parts[2]
434
+            }
435
+
436
+            for (i = 2; i <= NF; i++) {
437
+                line = $i
438
+                sub(/^\t/, "", line)
439
+                if (line ~ /^path /) {
440
+                    path = substr(line, 6)
441
+                } else if (line ~ /^content /) {
442
+                    content = substr(line, 9)
443
+                }
444
+            }
445
+
446
+            if (name != "" && path != "" && content ~ /(^|,)images(,|$)/) {
447
+                print type "\t" name "\t" path
448
+            }
449
+        }
450
+    ' /etc/pve/storage.cfg 2>/dev/null | while IFS=$'\t' read -r storage_type storage path; do
451
+        [[ -z "$storage" || -z "$path" ]] && continue
452
+        if ! storage_cleanup_supports_path_scan "$storage_type"; then
453
+            continue
454
+        fi
455
+        [[ -d "${path}/images" ]] || continue
456
+        local file
457
+        for file in "${path}"/images/[0-9]*/vm-*-state-suspend-????-??-??.raw; do
458
+            [[ -e "$file" ]] || continue
459
+            local relative_path="${file#${path}/images/}"
460
+            [[ "$relative_path" == "$file" ]] && continue
461
+            local vm_dir="${relative_path%%/*}"
462
+            local file_name="${relative_path##*/}"
463
+            [[ "$vm_dir" =~ ^[0-9]+$ ]] || continue
464
+            is_strict_suspend_volume_name "$vm_dir" "$file_name" || continue
465
+            printf '%s\t%s:%s/%s\t%s\n' "$storage" "$storage" "$vm_dir" "$file_name" "$file"
466
+        done
467
+    done
468
+}
469
+
470
+cleanup_orphan_suspend_artifacts() {
471
+    local cleaned_count=0
472
+    local skipped_count=0
473
+    local fail_count=0
474
+    local storage
475
+    local volume
476
+    local file_path
477
+    local vmid
478
+
479
+    log_info "Scanning storages for orphan suspend-state volumes..."
480
+
481
+    while IFS=$'\t' read -r storage volume file_path; do
482
+        [[ -z "$volume" ]] && continue
483
+
484
+        if vmid=$(get_referencing_vmid_for_vmstate "$volume"); then
485
+            if vm_has_valid_suspend_state "$vmid"; then
486
+                log_info "Keeping active suspend-state volume for VM $vmid (${VM_NAME[$vmid]:-unknown}): $volume"
487
+                ((skipped_count++))
488
+            else
489
+                log_warning "VM $vmid (${VM_NAME[$vmid]:-unknown}) references inconsistent suspend artifacts - cleaning up"
490
+                if cleanup_stale_suspend_artifacts "$vmid" "during cleanup"; then
491
+                    ((cleaned_count++))
492
+                else
493
+                    ((fail_count++))
494
+                fi
495
+            fi
496
+            continue
497
+        fi
498
+
499
+        if [[ $DRY_RUN -eq 1 ]]; then
500
+            echo "would remove orphan suspend-state volume: $volume"
501
+            log_debug "real path: $file_path"
502
+            ((cleaned_count++))
503
+            continue
504
+        fi
505
+
506
+        if [[ "$volume" =~ ^([^:]+):([0-9]+)/vm-([0-9]+)-state-suspend-([0-9]{4}-[0-9]{2}-[0-9]{2})\.raw$ ]]; then
507
+            vmid="${BASH_REMATCH[3]}"
508
+        else
509
+            log_warning "Skipping suspicious suspend-state volume with unexpected name: $volume"
510
+            ((skipped_count++))
511
+            continue
512
+        fi
513
+
514
+        VM_NAME[$vmid]="${VM_NAME[$vmid]:-unknown}"
515
+        if remove_suspend_volume_by_volid "$vmid" "$volume"; then
516
+            log_info "Removed orphan suspend-state volume from $storage: $volume"
517
+            ((cleaned_count++))
518
+        else
519
+            ((fail_count++))
520
+        fi
521
+    done < <(list_suspend_artifact_files)
522
+
523
+    log_success "Suspend artifact cleanup complete: $cleaned_count cleaned, $skipped_count retained, $fail_count failed"
524
+    return $fail_count
525
+}
526
+
527
+unlock_vm_suspend_lock() {
528
+    local vmid="$1"
529
+    local context="${2:-}"
530
+    local name="${VM_NAME[$vmid]:-unknown}"
531
+    local unlock_output
532
+
533
+    if ! vm_has_suspend_lock "$vmid"; then
534
+        return 0
535
+    fi
536
+
537
+    if [[ $DRY_RUN -eq 1 ]]; then
538
+        if [[ -n "$context" ]]; then
539
+            echo "would remove stale suspend lock for VM $vmid ($name) $context"
540
+        else
541
+            echo "would remove stale suspend lock for VM $vmid ($name)"
542
+        fi
543
+        return 0
544
+    fi
545
+
546
+    unlock_output=$(qm unlock "$vmid" 2>&1)
547
+    if [[ $? -eq 0 ]]; then
548
+        if [[ -n "$context" ]]; then
549
+            log_info "Removed stale suspend lock for VM $vmid ($name) $context"
550
+        else
551
+            log_info "Removed stale suspend lock for VM $vmid ($name)"
552
+        fi
553
+        return 0
554
+    fi
555
+
556
+    if maybe_relax_quorum "$unlock_output"; then
557
+        unlock_output=$(qm unlock "$vmid" 2>&1)
558
+        if [[ $? -eq 0 ]]; then
559
+            if [[ -n "$context" ]]; then
560
+                log_info "Removed stale suspend lock for VM $vmid ($name) $context after quorum recovery"
561
+            else
562
+                log_info "Removed stale suspend lock for VM $vmid ($name) after quorum recovery"
563
+            fi
564
+            return 0
565
+        fi
566
+    fi
567
+
568
+    if [[ -n "$context" ]]; then
569
+        log_warning "VM $vmid ($name) has a stale suspend lock $context but it could not be removed: $unlock_output"
570
+    else
571
+        log_warning "VM $vmid ($name) has a stale suspend lock but it could not be removed: $unlock_output"
572
+    fi
573
+    return 1
574
+}
575
+
576
+unlock_vm_if_needed() {
577
+    unlock_vm_suspend_lock "$1" "while VM is running"
578
+}
579
+
580
+# Quorum-sensitive operations (qm suspend/start/resume) may fail during
581
+# cluster-wide maintenance when pmxcfs becomes read-only. In that case, relax
582
+# expected votes once and retry the failed operation.
583
+maybe_relax_quorum() {
584
+    local cmd_output="$1"
585
+
586
+    # Already attempted in this run.
587
+    if [[ $QUORUM_RELAXED -eq 1 ]]; then
588
+        return 1
589
+    fi
590
+
591
+    if echo "$cmd_output" | grep -qiE "cluster not ready - no quorum|/etc/pve/.+\\.conf\\.tmp.+(Permission denied|Device or resource busy)"; then
592
+        log_warning "Detected quorum-related write failure in /etc/pve - attempting temporary 'pvecm expected 1'"
593
+        if pvecm expected 1 >/dev/null 2>&1; then
594
+            QUORUM_RELAXED=1
595
+            log_warning "Applied 'pvecm expected 1' for this maintenance cycle; retrying operation"
596
+            return 0
597
+        fi
598
+        log_error "Failed to apply 'pvecm expected 1' after quorum-related error"
599
+    fi
600
+
601
+    return 1
602
+}
603
+
604
+# Suspend a VM to disk
605
+suspend_vm_to_disk() {
606
+    local vmid="$1"
607
+    local name="${VM_NAME[$vmid]:-unknown}"
608
+    local qm_output
609
+    local stale_path
610
+    local retry_output
611
+    local stale_retry_path
612
+    
613
+    if [[ $DRY_RUN -eq 1 ]]; then
614
+        echo "would suspend VM $vmid ($name) to disk"
615
+        return 0
616
+    fi
617
+    
618
+    log_info "Suspending VM $vmid ($name) to disk..."
619
+    qm_output=$(qm suspend "$vmid" --todisk 1 2>&1)
620
+    if [[ $? -eq 0 ]]; then
621
+        log_success "VM $vmid ($name) suspended to disk"
622
+        return 0
623
+    fi
624
+
625
+    # Recover from stale suspend image left from a previous interrupted suspend.
626
+    # Proxmox can emit either:
627
+    #   - "stale saved state disk image ('...raw' already exists)"
628
+    #   - "disk image '...raw' already exists"
629
+    stale_path=$(
630
+        echo "$qm_output" | sed -n \
631
+            -e "s/.*stale saved state[[:space:]]*disk image ('\\([^']*\\)' already exists).*/\\1/p" \
632
+            -e "s/.*disk image '\\([^']*\\)' already exists.*/\\1/p" | head -n 1
633
+    )
634
+    if [[ -n "$stale_path" && "$stale_path" =~ /vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ && -f "$stale_path" ]]; then
635
+        log_warning "VM $vmid ($name) has stale suspend image: $stale_path - removing and retrying once"
636
+        if rm -f -- "$stale_path"; then
637
+            retry_output=$(qm suspend "$vmid" --todisk 1 2>&1)
638
+            if [[ $? -eq 0 ]]; then
639
+                log_success "VM $vmid ($name) suspended to disk (after stale image cleanup)"
640
+                return 0
641
+            fi
642
+            if maybe_relax_quorum "$retry_output"; then
643
+                retry_output=$(qm suspend "$vmid" --todisk 1 2>&1)
644
+                if [[ $? -eq 0 ]]; then
645
+                    log_success "VM $vmid ($name) suspended to disk (after stale image cleanup + quorum recovery)"
646
+                    return 0
647
+                fi
648
+                stale_retry_path=$(
649
+                    echo "$retry_output" | sed -n \
650
+                        -e "s/.*stale saved state[[:space:]]*disk image ('\\([^']*\\)' already exists).*/\\1/p" \
651
+                        -e "s/.*disk image '\\([^']*\\)' already exists.*/\\1/p" | head -n 1
652
+                )
653
+                if [[ -n "$stale_retry_path" && "$stale_retry_path" =~ /vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ && -f "$stale_retry_path" ]]; then
654
+                    log_warning "VM $vmid ($name) retry left stale suspend image: $stale_retry_path - removing and retrying once more"
655
+                    if rm -f -- "$stale_retry_path"; then
656
+                        retry_output=$(qm suspend "$vmid" --todisk 1 2>&1)
657
+                        if [[ $? -eq 0 ]]; then
658
+                            log_success "VM $vmid ($name) suspended to disk (after stale image cleanup + quorum recovery + retry)"
659
+                            return 0
660
+                        fi
661
+                    fi
662
+                fi
663
+            fi
664
+            log_error "Failed to suspend VM $vmid ($name) after stale image cleanup: $retry_output"
665
+            return 1
666
+        fi
667
+        log_error "Failed to remove stale suspend image for VM $vmid ($name): $stale_path"
668
+        return 1
669
+    fi
670
+
671
+    if maybe_relax_quorum "$qm_output"; then
672
+        retry_output=$(qm suspend "$vmid" --todisk 1 2>&1)
673
+        if [[ $? -eq 0 ]]; then
674
+            log_success "VM $vmid ($name) suspended to disk (after quorum recovery)"
675
+            return 0
676
+        fi
677
+        stale_retry_path=$(
678
+            echo "$retry_output" | sed -n \
679
+                -e "s/.*stale saved state[[:space:]]*disk image ('\\([^']*\\)' already exists).*/\\1/p" \
680
+                -e "s/.*disk image '\\([^']*\\)' already exists.*/\\1/p" | head -n 1
681
+        )
682
+        if [[ -n "$stale_retry_path" && "$stale_retry_path" =~ /vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ && -f "$stale_retry_path" ]]; then
683
+            log_warning "VM $vmid ($name) quorum retry hit stale suspend image: $stale_retry_path - removing and retrying once more"
684
+            if rm -f -- "$stale_retry_path"; then
685
+                retry_output=$(qm suspend "$vmid" --todisk 1 2>&1)
686
+                if [[ $? -eq 0 ]]; then
687
+                    log_success "VM $vmid ($name) suspended to disk (after quorum recovery + stale retry)"
688
+                    return 0
689
+                fi
690
+            fi
691
+        fi
692
+        log_error "Failed to suspend VM $vmid ($name) after quorum recovery: $retry_output"
693
+        return 1
694
+    fi
695
+
696
+    log_error "Failed to suspend VM $vmid ($name) to disk: $qm_output"
697
+    return 1
698
+}
699
+
700
+# Resume a VM from disk suspend
701
+resume_vm() {
702
+    local vmid="$1"
703
+    local name="${VM_NAME[$vmid]:-unknown}"
704
+    local qm_output
705
+    local current_status
706
+    
707
+    if [[ $DRY_RUN -eq 1 ]]; then
708
+        echo "would resume VM $vmid ($name)"
709
+        return 0
710
+    fi
711
+    
712
+    log_info "Resuming VM $vmid ($name)..."
713
+    qm_output=$(qm resume "$vmid" 2>&1)
714
+    if [[ $? -eq 0 ]]; then
715
+        unlock_vm_if_needed "$vmid"
716
+        log_success "VM $vmid ($name) resumed successfully"
717
+        return 0
718
+    fi
719
+
720
+    if maybe_relax_quorum "$qm_output"; then
721
+        qm_output=$(qm resume "$vmid" 2>&1)
722
+        if [[ $? -eq 0 ]]; then
723
+            unlock_vm_if_needed "$vmid"
724
+            log_success "VM $vmid ($name) resumed successfully (after quorum recovery)"
725
+            return 0
726
+        fi
727
+        current_status=$(qm status "$vmid" 2>/dev/null | awk '{print $2}')
728
+        if [[ "$current_status" == "running" ]]; then
729
+            unlock_vm_if_needed "$vmid"
730
+            log_warning "VM $vmid ($name) is running despite resume error after quorum recovery - treating as resumed"
731
+            return 2
732
+        fi
733
+        log_error "Failed to resume VM $vmid ($name) after quorum recovery: $qm_output"
734
+        return 1
735
+    fi
736
+
737
+    if echo "$qm_output" | grep -qi "already running"; then
738
+        unlock_vm_if_needed "$vmid"
739
+        log_warning "VM $vmid ($name) is already running - treating as resumed"
740
+        return 2
741
+    fi
742
+
743
+    current_status=$(qm status "$vmid" 2>/dev/null | awk '{print $2}')
744
+    if [[ "$current_status" == "running" ]]; then
745
+        unlock_vm_if_needed "$vmid"
746
+        log_warning "VM $vmid ($name) is running despite resume error - treating as resumed"
747
+        return 2
748
+    fi
749
+
750
+    log_error "Failed to resume VM $vmid ($name): $qm_output"
751
+    return 1
752
+}
753
+
754
+# Graceful shutdown a CT
755
+shutdown_ct() {
756
+    local ctid="$1"
757
+    local name="${CT_NAME[$ctid]:-unknown}"
758
+    
759
+    if [[ $DRY_RUN -eq 1 ]]; then
760
+        echo "would shutdown CT $ctid ($name)"
761
+        return 0
762
+    fi
763
+    
764
+    log_info "Shutting down CT $ctid ($name)..."
765
+    if pct shutdown "$ctid" --timeout 120; then
766
+        log_success "CT $ctid ($name) shut down gracefully"
767
+        return 0
768
+    else
769
+        log_error "Failed to shutdown CT $ctid ($name)"
770
+        return 1
771
+    fi
772
+}
773
+
774
+# Start a CT
775
+start_ct() {
776
+    local ctid="$1"
777
+    local name="${CT_NAME[$ctid]:-unknown}"
778
+    local pct_output
779
+    
780
+    if [[ $DRY_RUN -eq 1 ]]; then
781
+        echo "would start CT $ctid ($name)"
782
+        return 0
783
+    fi
784
+    
785
+    log_info "Starting CT $ctid ($name)..."
786
+    pct_output=$(pct start "$ctid" 2>&1)
787
+    if [[ $? -eq 0 ]]; then
788
+        log_success "CT $ctid ($name) started successfully"
789
+        return 0
790
+    fi
791
+
792
+    if maybe_relax_quorum "$pct_output"; then
793
+        pct_output=$(pct start "$ctid" 2>&1)
794
+        if [[ $? -eq 0 ]]; then
795
+            log_success "CT $ctid ($name) started successfully (after quorum recovery)"
796
+            return 0
797
+        fi
798
+        if [[ "$(pct status "$ctid" 2>/dev/null | awk '{print $2}')" == "running" ]]; then
799
+            log_warning "CT $ctid ($name) is running despite start error after quorum recovery - treating as started"
800
+            return 2
801
+        fi
802
+        log_error "Failed to start CT $ctid ($name) after quorum recovery: $pct_output"
803
+        return 1
804
+    fi
805
+
806
+    if echo "$pct_output" | grep -qi "already running"; then
807
+        log_warning "CT $ctid ($name) is already running - treating as started"
808
+        return 2
809
+    fi
810
+
811
+    if [[ "$(pct status "$ctid" 2>/dev/null | awk '{print $2}')" == "running" ]]; then
812
+        log_warning "CT $ctid ($name) is running despite start error - treating as started"
813
+        return 2
814
+    fi
815
+
816
+    log_error "Failed to start CT $ctid ($name): $pct_output"
817
+    return 1
818
+}
819
+
820
+# Save state to JSON file
821
+# Usage: save_state vm_resume_array vm_suspended_array ct_start_array
822
+save_state() {
823
+    local -n to_resume_ref=$1
824
+    local -n was_suspended_ref=$2
825
+    local -n ct_to_start_ref=$3
826
+    local existing_state_json=""
827
+    local existing_to_resume=()
828
+    local existing_was_suspended=()
829
+    local existing_ct_to_start=()
830
+    local final_to_resume=()
831
+    local final_was_suspended=()
832
+    local final_ct_to_start=()
833
+    local vmid
834
+    local volume
835
+    local suspend_date
836
+    local -A existing_vm_volume=()
837
+    local -A existing_vm_date=()
838
+    local -A current_vm_volume=()
839
+    local -A current_vm_date=()
840
+    
841
+    if [[ $DRY_RUN -eq 1 ]]; then
842
+        echo "would save state to $STATE_FILE"
843
+        echo "  to_resume (VMs): ${to_resume_ref[*]}"
844
+        echo "  was_suspended (VMs): ${was_suspended_ref[*]}"
845
+        echo "  ct_to_start (CTs): ${ct_to_start_ref[*]}"
846
+        return 0
847
+    fi
848
+
849
+    if existing_state_json=$(load_state 2>/dev/null); then
850
+        mapfile -t existing_to_resume < <(echo "$existing_state_json" | jq -r '.to_resume[]?' 2>/dev/null)
851
+        mapfile -t existing_was_suspended < <(echo "$existing_state_json" | jq -r '.was_suspended[]?' 2>/dev/null)
852
+        mapfile -t existing_ct_to_start < <(echo "$existing_state_json" | jq -r '.ct_to_start[]?' 2>/dev/null)
853
+        while IFS=$'\t' read -r vmid volume suspend_date; do
854
+            [[ -z "$vmid" ]] && continue
855
+            existing_vm_volume[$vmid]="$volume"
856
+            existing_vm_date[$vmid]="$suspend_date"
857
+        done < <(
858
+            echo "$existing_state_json" | jq -r '
859
+                (.vm_details // {})
860
+                | to_entries[]
861
+                | [.key, (.value.suspend_volume // ""), (.value.suspend_file_date // "")]
862
+                | @tsv
863
+            ' 2>/dev/null
864
+        )
865
+    fi
866
+
867
+    refresh_vm_artifact_metadata
868
+
869
+    for vmid in "${to_resume_ref[@]}"; do
870
+        append_unique final_to_resume "$vmid"
871
+        volume="${VM_VMSTATE[$vmid]:-}"
872
+        suspend_date=$(extract_suspend_file_date "$vmid" "$volume")
873
+        current_vm_volume[$vmid]="$volume"
874
+        current_vm_date[$vmid]="$suspend_date"
875
+    done
876
+
877
+    for vmid in "${existing_to_resume[@]}"; do
878
+        append_unique final_to_resume "$vmid"
879
+    done
880
+
881
+    for vmid in "${existing_was_suspended[@]}"; do
882
+        if ! array_contains "$vmid" "${final_to_resume[@]}"; then
883
+            append_unique final_was_suspended "$vmid"
884
+        fi
885
+    done
886
+
887
+    for vmid in "${was_suspended_ref[@]}"; do
888
+        if array_contains "$vmid" "${final_to_resume[@]}"; then
889
+            volume="${VM_VMSTATE[$vmid]:-}"
890
+            if [[ -n "$volume" ]]; then
891
+                current_vm_volume[$vmid]="$volume"
892
+                current_vm_date[$vmid]="$(extract_suspend_file_date "$vmid" "$volume")"
893
+            fi
894
+            continue
895
+        fi
896
+        append_unique final_was_suspended "$vmid"
897
+        volume="${VM_VMSTATE[$vmid]:-}"
898
+        suspend_date=$(extract_suspend_file_date "$vmid" "$volume")
899
+        current_vm_volume[$vmid]="$volume"
900
+        current_vm_date[$vmid]="$suspend_date"
901
+    done
902
+
903
+    for vmid in "${final_to_resume[@]}"; do
904
+        remove_value final_was_suspended "$vmid"
905
+    done
906
+
907
+    for vmid in "${existing_ct_to_start[@]}"; do
908
+        append_unique final_ct_to_start "$vmid"
909
+    done
910
+    for vmid in "${ct_to_start_ref[@]}"; do
911
+        append_unique final_ct_to_start "$vmid"
912
+    done
913
+    
914
+    # Create JSON arrays (handle empty arrays properly)
915
+    local to_resume_json="[]"
916
+    local was_suspended_json="[]"
917
+    local ct_to_start_json="[]"
918
+    local vm_details_json="{}"
919
+    
920
+    if [[ ${#final_to_resume[@]} -gt 0 ]]; then
921
+        to_resume_json=$(printf '%s\n' "${final_to_resume[@]}" | jq -R . | jq -s .)
922
+    fi
923
+    if [[ ${#final_was_suspended[@]} -gt 0 ]]; then
924
+        was_suspended_json=$(printf '%s\n' "${final_was_suspended[@]}" | jq -R . | jq -s .)
925
+    fi
926
+    if [[ ${#final_ct_to_start[@]} -gt 0 ]]; then
927
+        ct_to_start_json=$(printf '%s\n' "${final_ct_to_start[@]}" | jq -R . | jq -s .)
928
+    fi
929
+
930
+    for vmid in "${final_to_resume[@]}"; do
931
+        volume="${current_vm_volume[$vmid]:-${existing_vm_volume[$vmid]:-}}"
932
+        suspend_date="${current_vm_date[$vmid]:-${existing_vm_date[$vmid]:-}}"
933
+        vm_details_json=$(
934
+            jq \
935
+                --arg vmid "$vmid" \
936
+                --arg mode "to_resume" \
937
+                --arg volume "$volume" \
938
+                --arg suspend_date "$suspend_date" \
939
+                '
940
+                .[$vmid] = {
941
+                    mode: $mode,
942
+                    suspend_volume: $volume,
943
+                    suspend_file_date: $suspend_date
944
+                }
945
+                ' <<<"$vm_details_json"
946
+        )
947
+    done
948
+
949
+    for vmid in "${final_was_suspended[@]}"; do
950
+        volume="${current_vm_volume[$vmid]:-${existing_vm_volume[$vmid]:-}}"
951
+        suspend_date="${current_vm_date[$vmid]:-${existing_vm_date[$vmid]:-}}"
952
+        vm_details_json=$(
953
+            jq \
954
+                --arg vmid "$vmid" \
955
+                --arg mode "was_suspended" \
956
+                --arg volume "$volume" \
957
+                --arg suspend_date "$suspend_date" \
958
+                '
959
+                .[$vmid] = {
960
+                    mode: $mode,
961
+                    suspend_volume: $volume,
962
+                    suspend_file_date: $suspend_date
963
+                }
964
+                ' <<<"$vm_details_json"
965
+        )
966
+    done
967
+    
968
+    cat > "$STATE_FILE" <<EOF
969
+{
970
+    "timestamp": "$(date -Iseconds)",
971
+    "hostname": "$(hostname)",
972
+    "to_resume": $to_resume_json,
973
+    "was_suspended": $was_suspended_json,
974
+    "ct_to_start": $ct_to_start_json,
975
+    "vm_details": $vm_details_json
976
+}
977
+EOF
978
+    
979
+    log_info "State saved to $STATE_FILE"
980
+}
981
+
982
+# Load state from JSON file (outputs JSON only, no logging to avoid capture issues)
983
+load_state() {
984
+    if [[ ! -f "$STATE_FILE" ]]; then
985
+        return 1
986
+    fi
987
+    cat "$STATE_FILE"
988
+}
989
+
990
+# Remove state file after resume is complete
991
+clear_state() {
992
+    if [[ $DRY_RUN -eq 1 ]]; then
993
+        echo "would remove state file $STATE_FILE"
994
+        return 0
995
+    fi
996
+    
997
+    if [[ -f "$STATE_FILE" ]]; then
998
+        rm -f "$STATE_FILE"
999
+        log_info "State file removed"
1000
+    fi
1001
+}
1002
+
1003
+migrate_legacy_state_if_needed() {
1004
+    if [[ "${STATE_FILE}" == "${LEGACY_STATE_FILE}" ]]; then
1005
+        return 0
1006
+    fi
1007
+
1008
+    if [[ -f "${LEGACY_STATE_FILE}" && ! -f "${STATE_FILE}" ]]; then
1009
+        mkdir -p "${STATE_DIR}"
1010
+        mv "${LEGACY_STATE_FILE}" "${STATE_FILE}"
1011
+        log_warning "Migrated legacy state file from ${LEGACY_STATE_FILE} to ${STATE_FILE}"
1012
+    fi
1013
+}
1014
+
1015
+# Main suspend operation
1016
+do_suspend() {
1017
+    log_info "Starting suspend/shutdown operation on $(hostname)"
1018
+
1019
+    # Clean stale suspend artifacts before creating new suspend volumes.
1020
+    load_vm_config_metadata
1021
+    if ! cleanup_orphan_suspend_artifacts; then
1022
+        log_warning "Suspend artifact preflight cleanup had failures; continuing with suspend operation"
1023
+    fi
1024
+
1025
+    # Load all VM and CT info in one pass
1026
+    load_vm_info
1027
+    load_ct_info
1028
+    
1029
+    local to_resume=()
1030
+    local was_suspended=()
1031
+    local ct_to_start=()
1032
+    local suspend_count=0
1033
+    local skip_count=0
1034
+    local fail_count=0
1035
+    
1036
+    # --- Process QEMU VMs ---
1037
+    log_info "Processing QEMU VMs..."
1038
+    for conf in /etc/pve/qemu-server/*.conf; do
1039
+        [[ ! -f "$conf" ]] && continue
1040
+        
1041
+        local vmid=$(basename "$conf" .conf)
1042
+        local name="${VM_NAME[$vmid]:-unknown}"
1043
+        local status="${VM_STATUS[$vmid]:-stopped}"
1044
+        
1045
+        case "$status" in
1046
+            running)
1047
+                # Running VM: suspend to disk, add to resume list
1048
+                if suspend_vm_to_disk "$vmid"; then
1049
+                    to_resume+=("$vmid")
1050
+                    ((suspend_count++))
1051
+                else
1052
+                    ((fail_count++))
1053
+                fi
1054
+                ;;
1055
+            suspended)
1056
+                # Suspended to RAM: save state to disk but DON'T add to resume list
1057
+                log_warning "VM $vmid ($name) is suspended to RAM - saving to disk but will NOT auto-resume (was manually suspended)"
1058
+                if suspend_vm_to_disk "$vmid"; then
1059
+                    was_suspended+=("$vmid")
1060
+                    ((suspend_count++))
1061
+                else
1062
+                    ((fail_count++))
1063
+                fi
1064
+                ;;
1065
+            stopped)
1066
+                # Could be stopped normally or suspended to disk
1067
+                if vm_has_valid_suspend_state "$vmid"; then
1068
+                    log_warning "VM $vmid ($name) is already suspended to disk - will NOT auto-resume"
1069
+                    was_suspended+=("$vmid")
1070
+                    ((skip_count++))
1071
+                elif vm_has_suspend_lock "$vmid" || vm_has_vmstate_reference "$vmid"; then
1072
+                    log_warning "VM $vmid ($name) has inconsistent suspend artifacts - treating them as stale"
1073
+                    if cleanup_stale_suspend_artifacts "$vmid" "while VM is stopped"; then
1074
+                        ((skip_count++))
1075
+                    else
1076
+                        ((fail_count++))
1077
+                    fi
1078
+                else
1079
+                    log_info "VM $vmid ($name) is stopped, skipping"
1080
+                fi
1081
+                ;;
1082
+            paused)
1083
+                # Paused/suspended to RAM: save state to disk but DON'T auto-resume
1084
+                log_warning "VM $vmid ($name) is paused/suspended to RAM - saving to disk but will NOT auto-resume (was manually paused)"
1085
+                if suspend_vm_to_disk "$vmid"; then
1086
+                    was_suspended+=("$vmid")
1087
+                    ((suspend_count++))
1088
+                else
1089
+                    ((fail_count++))
1090
+                fi
1091
+                ;;
1092
+            *)
1093
+                log_info "VM $vmid ($name) status '$status', skipping"
1094
+                ;;
1095
+        esac
1096
+    done
1097
+    
1098
+    # --- Process LXC Containers ---
1099
+    log_info "Processing LXC containers..."
1100
+    for conf in /etc/pve/lxc/*.conf; do
1101
+        [[ ! -f "$conf" ]] && continue
1102
+        
1103
+        local ctid=$(basename "$conf" .conf)
1104
+        local name="${CT_NAME[$ctid]:-unknown}"
1105
+        local status="${CT_STATUS[$ctid]:-stopped}"
1106
+        
1107
+        case "$status" in
1108
+            running)
1109
+                # Running CT: graceful shutdown, add to start list
1110
+                if shutdown_ct "$ctid"; then
1111
+                    ct_to_start+=("$ctid")
1112
+                    ((suspend_count++))
1113
+                else
1114
+                    ((fail_count++))
1115
+                fi
1116
+                ;;
1117
+            stopped)
1118
+                log_info "CT $ctid ($name) is stopped, skipping"
1119
+                ;;
1120
+            *)
1121
+                log_info "CT $ctid ($name) status '$status', skipping"
1122
+                ;;
1123
+        esac
1124
+    done
1125
+    
1126
+    # Save state
1127
+    save_state to_resume was_suspended ct_to_start
1128
+    
1129
+    # Summary
1130
+    log_success "Suspend/shutdown complete: $suspend_count processed, $skip_count skipped, $fail_count failed"
1131
+    log_info "VMs to auto-resume: ${to_resume[*]:-none}"
1132
+    log_info "VMs NOT to auto-resume (were suspended): ${was_suspended[*]:-none}"
1133
+    log_info "CTs to auto-start: ${ct_to_start[*]:-none}"
1134
+    
1135
+    return $fail_count
1136
+}
1137
+
1138
+do_cleanup() {
1139
+    log_info "Starting suspend artifact cleanup on $(hostname)"
1140
+
1141
+    load_vm_config_metadata
1142
+    cleanup_orphan_suspend_artifacts
1143
+    return $?
1144
+}
1145
+
1146
+# Main resume operation
1147
+do_resume() {
1148
+    log_info "Starting resume/start operation on $(hostname)"
1149
+    
1150
+    # Load all VM and CT info in one pass
1151
+    load_vm_info
1152
+    load_ct_info
1153
+    
1154
+    local state_json
1155
+    state_json=$(load_state)
1156
+    if [[ $? -ne 0 ]]; then
1157
+        log_warning "No saved state - nothing to resume"
1158
+        return 0
1159
+    fi
1160
+    
1161
+    # Parse state file
1162
+    local to_resume=($(echo "$state_json" | jq -r '.to_resume[]' 2>/dev/null))
1163
+    local was_suspended=($(echo "$state_json" | jq -r '.was_suspended[]' 2>/dev/null))
1164
+    local ct_to_start=($(echo "$state_json" | jq -r '.ct_to_start[]' 2>/dev/null))
1165
+    local saved_timestamp=$(echo "$state_json" | jq -r '.timestamp' 2>/dev/null)
1166
+    local -A saved_vm_volume=()
1167
+    local -A saved_vm_date=()
1168
+    local saved_volume
1169
+    local current_volume
1170
+
1171
+    while IFS=$'\t' read -r vmid saved_volume saved_date; do
1172
+        [[ -z "$vmid" ]] && continue
1173
+        saved_vm_volume[$vmid]="$saved_volume"
1174
+        saved_vm_date[$vmid]="$saved_date"
1175
+    done < <(
1176
+        echo "$state_json" | jq -r '
1177
+            (.vm_details // {})
1178
+            | to_entries[]
1179
+            | [.key, (.value.suspend_volume // ""), (.value.suspend_file_date // "")]
1180
+            | @tsv
1181
+        ' 2>/dev/null
1182
+    )
1183
+    
1184
+    log_info "State file from: $saved_timestamp"
1185
+    
1186
+    local resume_count=0
1187
+    local skip_count=0
1188
+    local fail_count=0
1189
+    
1190
+    # --- Resume QEMU VMs ---
1191
+    
1192
+    # Log warnings for VMs that won't be resumed
1193
+    for vmid in "${was_suspended[@]}"; do
1194
+        local name="${VM_NAME[$vmid]:-unknown}"
1195
+        log_warning "VM $vmid ($name) was already suspended before maintenance - NOT auto-resuming"
1196
+        ((skip_count++))
1197
+    done
1198
+    
1199
+    # Resume VMs that should be resumed
1200
+    for vmid in "${to_resume[@]}"; do
1201
+        local name="${VM_NAME[$vmid]:-unknown}"
1202
+        
1203
+        # Verify VM still exists and has suspend lock
1204
+        if [[ ! -f "/etc/pve/qemu-server/${vmid}.conf" ]]; then
1205
+            log_error "VM $vmid config not found - skipping"
1206
+            ((fail_count++))
1207
+            continue
1208
+        fi
1209
+        
1210
+        if [[ -z "${VM_HAS_LOCK[$vmid]}" ]]; then
1211
+            log_warning "VM $vmid ($name) no longer has suspend lock - may have been manually resumed"
1212
+            ((skip_count++))
1213
+            continue
1214
+        fi
1215
+
1216
+        saved_volume="${saved_vm_volume[$vmid]:-}"
1217
+        current_volume="${VM_VMSTATE[$vmid]:-}"
1218
+        if [[ -n "$saved_volume" && "$current_volume" != "$saved_volume" ]]; then
1219
+            log_warning "VM $vmid ($name) suspend volume changed since state file (${saved_vm_date[$vmid]:-unknown date}): saved=$saved_volume current=${current_volume:-none} - skipping auto-resume"
1220
+            ((skip_count++))
1221
+            continue
1222
+        fi
1223
+        
1224
+        resume_vm "$vmid"
1225
+        case $? in
1226
+            0) ((resume_count++)) ;;
1227
+            2) ((skip_count++)) ;;
1228
+            *) ((fail_count++)) ;;
1229
+        esac
1230
+    done
1231
+    
1232
+    # --- Start LXC Containers ---
1233
+    for ctid in "${ct_to_start[@]}"; do
1234
+        local name="${CT_NAME[$ctid]:-unknown}"
1235
+        
1236
+        # Verify CT still exists
1237
+        if [[ ! -f "/etc/pve/lxc/${ctid}.conf" ]]; then
1238
+            log_error "CT $ctid config not found - skipping"
1239
+            ((fail_count++))
1240
+            continue
1241
+        fi
1242
+        
1243
+        # Check if already running (someone started it manually)
1244
+        if [[ "${CT_STATUS[$ctid]}" == "running" ]]; then
1245
+            log_warning "CT $ctid ($name) is already running - skipping"
1246
+            ((skip_count++))
1247
+            continue
1248
+        fi
1249
+        
1250
+        start_ct "$ctid"
1251
+        case $? in
1252
+            0) ((resume_count++)) ;;
1253
+            2) ((skip_count++)) ;;
1254
+            *) ((fail_count++)) ;;
1255
+        esac
1256
+    done
1257
+    
1258
+    # Clear state file only on full success; keep it for retry if any failures.
1259
+    if [[ $fail_count -eq 0 ]]; then
1260
+        clear_state
1261
+    else
1262
+        log_warning "Resume/start encountered failures - keeping state file for retry"
1263
+    fi
1264
+    
1265
+    # Summary
1266
+    log_success "Resume/start complete: $resume_count restored, $skip_count skipped, $fail_count failed"
1267
+    
1268
+    return $fail_count
1269
+}
1270
+
1271
+# Acquire lock to prevent concurrent runs
1272
+acquire_lock() {
1273
+    if [[ $DRY_RUN -eq 1 ]]; then
1274
+        return 0
1275
+    fi
1276
+    
1277
+    if [[ -f "$LOCK_FILE" ]]; then
1278
+        local pid=$(cat "$LOCK_FILE" 2>/dev/null)
1279
+        if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then
1280
+            log_error "Another instance is running (PID $pid)"
1281
+            exit 1
1282
+        fi
1283
+        # Stale lock file
1284
+        rm -f "$LOCK_FILE"
1285
+    fi
1286
+    
1287
+    echo $$ > "$LOCK_FILE"
1288
+    trap "rm -f '$LOCK_FILE'" EXIT
1289
+}
1290
+
1291
+# Parse command line
1292
+COMMAND=""
1293
+while [[ $# -gt 0 ]]; do
1294
+    case "$1" in
1295
+        suspend|resume|cleanup)
1296
+            COMMAND="$1"
1297
+            shift
1298
+            ;;
1299
+        -n|--dry-run)
1300
+            DRY_RUN=1
1301
+            shift
1302
+            ;;
1303
+        -v|--verbose)
1304
+            ((VERBOSE++))
1305
+            shift
1306
+            ;;
1307
+        -vv)
1308
+            VERBOSE=2
1309
+            shift
1310
+            ;;
1311
+        -h|--help)
1312
+            usage
1313
+            exit 0
1314
+            ;;
1315
+        *)
1316
+            echo "Unknown option: $1" >&2
1317
+            usage
1318
+            exit 1
1319
+            ;;
1320
+    esac
1321
+done
1322
+
1323
+if [[ -z "$COMMAND" ]]; then
1324
+    echo "Error: No command specified" >&2
1325
+    usage
1326
+    exit 1
1327
+fi
1328
+
1329
+# Ensure state directory exists
1330
+mkdir -p "$STATE_DIR"
1331
+
1332
+# Migrate state from the legacy location used by older installs.
1333
+migrate_legacy_state_if_needed
1334
+
1335
+# Acquire lock
1336
+acquire_lock
1337
+
1338
+# Execute command
1339
+case "$COMMAND" in
1340
+    suspend)
1341
+        do_suspend
1342
+        exit $?
1343
+        ;;
1344
+    resume)
1345
+        do_resume
1346
+        exit $?
1347
+        ;;
1348
+    cleanup)
1349
+        do_cleanup
1350
+        exit $?
1351
+        ;;
1352
+esac
+104 -0
projects/pve-guests-state/docs/DECISIONS.md
@@ -0,0 +1,104 @@
1
+# PGS - Intenție, Probleme și Compromis
2
+
3
+## Intenție
4
+
5
+Scopul inițial a fost simplu:
6
+- să se poată salva starea guest-urilor active înainte de lucrări de mentenanță;
7
+- să se poată restaura după revenirea nodurilor;
8
+- să se evite pornirea guest-urilor care erau deja suspendate sau oprite înainte de operație.
9
+
10
+Pentru VM-uri QEMU, asta a însemnat `qm suspend --todisk 1`.
11
+Pentru containere LXC, asta a însemnat `pct shutdown`, urmat de `pct start` la restaurare.
12
+
13
+## Abordarea inițială
14
+
15
+Prima variantă a fost automată:
16
+- `systemd` apela `suspend` la oprirea nodului;
17
+- `systemd` apela `resume` la revenirea nodului;
18
+- un fișier JSON local păstra lista guest-urilor care trebuiau restaurate.
19
+
20
+Motivația a fost să existe un flux "hands-off" pentru reboot și shutdown.
21
+
22
+## Probleme întâmpinate
23
+
24
+În practică, abordarea automată a fost fragilă pe un cluster Proxmox real.
25
+
26
+Problemele observate:
27
+- imagini stale de suspend:
28
+  - `disk image '...state-suspend-....raw' already exists`
29
+- scrieri eșuate în `pmxcfs` / `/etc/pve`:
30
+  - `Permission denied`
31
+  - `Device or resource busy`
32
+- ferestre fără quorum în timpul opririi sau revenirii nodurilor;
33
+- VM-uri care porneau, dar rămâneau cu `lock: suspended`;
34
+- restaurări parțiale, cu numai o parte din guest-uri repornite;
35
+- comportament dependent de ordinea exactă în care reveneau rețeaua, corosync, storage-ul și `pve-cluster`.
36
+
37
+Problema structurală a fost aceasta:
38
+- `qm suspend` și `qm resume` nu sunt operații pur locale;
39
+- ele au nevoie de scrieri coerente în `/etc/pve`;
40
+- `/etc/pve` depinde de `pmxcfs` și de starea clusterului;
41
+- în scenarii în care mai multe noduri se opresc sau pornesc în același timp, această dependență devine sursă de curse și inconsistențe.
42
+
43
+Am adăugat mai multe remedieri tactice:
44
+- cleanup pentru imagini stale;
45
+- retry după eșec;
46
+- relaxare temporară de quorum cu `pvecm expected 1`;
47
+- curățare automată pentru `lock: suspended`.
48
+
49
+Aceste remedieri au redus unele erori, dar nu au schimbat faptul că modelul automat rămânea nedeterminist în scenarii de mentenanță pe întregul cluster.
50
+
51
+## Concluzie
52
+
53
+Automatizarea la shutdown/boot nu a fost suficient de robustă pentru mediul real.
54
+
55
+Mai exact:
56
+- pentru reboot de un singur nod, putea funcționa uneori acceptabil;
57
+- pentru lucrări în care mai multe noduri sau întregul cluster sunt oprite, rezultatele nu au fost suficient de predictibile.
58
+
59
+Problema nu mai era un bug punctual, ci o nepotrivire între design și condițiile reale de operare.
60
+
61
+## Compromisul ales
62
+
63
+A fost aleasă o variantă mai simplă și mai controlabilă:
64
+- fără automatizare `systemd`;
65
+- fără hook-uri la shutdown sau boot;
66
+- suspendarea se rulează manual înainte de mentenanță;
67
+- restaurarea se rulează manual după revenirea clusterului.
68
+
69
+Comenzile sunt:
70
+
71
+```bash
72
+/usr/local/sbin/pgs suspend -v
73
+/usr/local/sbin/pgs resume -v
74
+```
75
+
76
+## De ce acest compromis este acceptabil
77
+
78
+Se pierde comoditatea automatizării, dar se câștigă:
79
+- control explicit asupra momentului execuției;
80
+- posibilitatea de a aștepta revenirea clusterului înainte de `resume`;
81
+- debug mai simplu;
82
+- mai puține efecte surprinzătoare în timpul shutdown-ului;
83
+- separare clară între pregătirea mentenanței și restaurarea ulterioară.
84
+
85
+Practic, operatorul decide când clusterul este suficient de stabil pentru restaurare, în loc să lase asta pe seama ordinii de pornire a serviciilor.
86
+
87
+## Ce rămâne intenționat în cod
88
+
89
+Deși automatizarea a fost eliminată, scriptul păstrează unele protecții utile:
90
+- detecție și cleanup pentru imagini stale de suspend;
91
+- tratament pentru guest-uri deja pornite sau deja suspendate;
92
+- cleanup pentru `lock: suspended` când este posibil;
93
+- jurnalizare clară în `journalctl -t pgs`.
94
+
95
+Acestea rămân utile și în fluxul manual.
96
+
97
+## Ce nu mai face proiectul
98
+
99
+Proiectul nu mai încearcă:
100
+- să orchestreze reboot-ul nodurilor;
101
+- să decidă automat momentul corect pentru restore;
102
+- să garanteze restaurare automată după revenirea clusterului.
103
+
104
+Acesta este acum un utilitar manual de guest state, nu un manager de reboot.
+97 -0
projects/pve-guests-state/docs/TECHNICAL.md
@@ -0,0 +1,97 @@
1
+# PGS - Technical Notes
2
+
3
+## Rol
4
+
5
+`pgs` ofera un flux manual si predictibil pentru:
6
+- suspend to disk la VM-uri QEMU aflate in rulare
7
+- shutdown graceful la containere LXC aflate in rulare
8
+- resume/start dupa mentenanta pe baza unui state file local
9
+
10
+## Comanda instalata
11
+
12
+- locatie: `/usr/local/sbin/pgs`
13
+- uninstall canonic: `/usr/local/lib/xdev/pve-guests-state/uninstall.sh`
14
+- documentatie instalata: `/usr/local/share/doc/xdev/pve-guests-state`
15
+
16
+## State runtime
17
+
18
+- locatie curenta: `/var/lib/xdev/pve-guests-state/pgs-state.json`
19
+- locatie legacy acceptata pentru migrare: `/var/lib/pve-manager/pgs-state.json`
20
+- lock file: `/run/pgs.lock`
21
+
22
+State file-ul contine:
23
+- `timestamp`
24
+- `hostname`
25
+- `to_resume`
26
+- `was_suspended`
27
+- `ct_to_start`
28
+- `vm_details`
29
+  - `mode`
30
+  - `suspend_volume`
31
+  - `suspend_file_date`
32
+
33
+## Comenzi
34
+
35
+```bash
36
+/usr/local/sbin/pgs suspend [-v] [--dry-run]
37
+/usr/local/sbin/pgs resume [-v] [--dry-run]
38
+/usr/local/sbin/pgs cleanup [-v] [--dry-run]
39
+```
40
+
41
+## Comportament
42
+
43
+### Suspend
44
+
45
+- preflight cleanup pentru volume orphan/stale `vm-*-state-suspend-YYYY-MM-DD.raw`
46
+- VM running -> `qm suspend --todisk 1` -> adaugat in `to_resume`
47
+- VM paused/suspended RAM -> suspend to disk, dar nu intra in `to_resume`
48
+- VM deja suspendat pe disk -> warning, fara auto-resume; detectia pentru disk suspend cere `lock: suspended`, `vmstate:` in config si un volum de saved-state rezolvabil in storage
49
+- CT running -> `pct shutdown --timeout 120` -> adaugat in `ct_to_start`
50
+- daca exista deja state file, un nou `suspend` face merge peste state-ul existent si pastreaza intentia anterioara de `to_resume`
51
+- pentru fiecare VM retinut in state se salveaza si `suspend_volume` plus `suspend_file_date`
52
+
53
+### Cleanup
54
+
55
+- scaneaza storage-urile cu `content images` definite in `/etc/pve/storage.cfg`
56
+- cauta exclusiv fisiere `vm-*-state-suspend-YYYY-MM-DD.raw`
57
+- ignora fisiere de forma `vm-*-state-cp*.raw`
58
+- daca un volum `state-suspend` este referit de un VM valid suspendat, il pastreaza
59
+- daca un volum `state-suspend` este referit, dar VM-ul nu mai are stare valida de suspend, curata `lock`, `vmstate` si volumul
60
+- daca un volum `state-suspend` nu mai este referit de niciun VM, il trateaza ca orphan si il sterge
61
+
62
+### Resume
63
+
64
+- VMs din `to_resume` -> `qm resume`
65
+- CTs din `ct_to_start` -> `pct start`
66
+- daca `suspend_volume` curent nu mai corespunde cu cel din state, VM-ul este tratat ca alterat dupa salvarea state-ului si nu este auto-resumat
67
+- daca apar esecuri, state file-ul ramane pentru retry
68
+- daca totul reuseste, state file-ul este sters
69
+
70
+## Protectii implementate
71
+
72
+- stale suspend image cleanup
73
+- cleanup pentru volume orphan `vm-*-state-suspend-YYYY-MM-DD.raw`
74
+- retry dupa erori specifice de quorum
75
+- `pvecm expected 1` in fereastra de mentenanta, cand eroarea indica lipsa de quorum
76
+- cleanup pentru `lock: suspended` cand VM-ul este deja running
77
+- cleanup pentru artefacte stale de suspend pe VM-uri `stopped`: `lock: suspended`, `vmstate:` ramas in config si volume orphaned de saved-state
78
+- lock local pentru a preveni rulari concurente
79
+
80
+## Logging
81
+
82
+- interactiva: output pe terminal
83
+- prin systemd/journal stream: evitarea dublarii mesajelor in journal
84
+- tag jurnal: `pgs`
85
+
86
+Exemple:
87
+
88
+```bash
89
+journalctl -t pgs -n 50 --no-pager
90
+journalctl -t pgs -f
91
+```
92
+
93
+## Note de design
94
+
95
+- proiectul nu mai foloseste unitati systemd pentru execuție automata
96
+- fisierele din `systemd/` sunt legacy si nu fac parte din install-ul curent
97
+- proiectul nu are inca propriul config persistent in `/etc/xdev/...`
+8 -0
projects/pve-guests-state/pve-guests-state.code-workspace
@@ -0,0 +1,8 @@
1
+{
2
+	"folders": [
3
+		{
4
+			"path": "."
5
+		}
6
+	],
7
+	"settings": {}
8
+}
+141 -0
projects/pve-guests-state/scripts/install.sh
@@ -0,0 +1,141 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="pve-guests-state"
6
+ORG_ID="xdev"
7
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
8
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
9
+STATE_DIR="/var/lib/${ORG_ID}/${PROJECT_ID}"
10
+COMMAND_PATH="/usr/local/sbin/pgs"
11
+UNINSTALL_PATH="${INSTALL_DIR}/uninstall.sh"
12
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
13
+
14
+SOURCE_DIR=""
15
+
16
+usage() {
17
+    cat <<EOF
18
+Usage: $0 [--source-dir <path>]
19
+
20
+Install ${PROJECT_ID} on the current host.
21
+EOF
22
+}
23
+
24
+require_root() {
25
+    if [[ "${EUID}" -ne 0 ]]; then
26
+        echo "ERROR: this script must be run as root" >&2
27
+        exit 1
28
+    fi
29
+}
30
+
31
+resolve_source_dir() {
32
+    if [[ -n "${SOURCE_DIR}" ]]; then
33
+        SOURCE_DIR="$(cd "${SOURCE_DIR}" && pwd)"
34
+    else
35
+        SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
36
+    fi
37
+}
38
+
39
+validate_source_tree() {
40
+    local required_files=(
41
+        "${SOURCE_DIR}/bin/pgs"
42
+        "${SOURCE_DIR}/scripts/uninstall.sh"
43
+        "${SOURCE_DIR}/README.md"
44
+        "${SOURCE_DIR}/INSTALL.md"
45
+        "${SOURCE_DIR}/CHANGELOG.md"
46
+        "${SOURCE_DIR}/LICENSE"
47
+        "${SOURCE_DIR}/docs/DECISIONS.md"
48
+        "${SOURCE_DIR}/docs/TECHNICAL.md"
49
+    )
50
+    local file=""
51
+    for file in "${required_files[@]}"; do
52
+        if [[ ! -f "${file}" ]]; then
53
+            echo "ERROR: missing required source file: ${file}" >&2
54
+            exit 1
55
+        fi
56
+    done
57
+}
58
+
59
+cleanup_legacy_artifacts() {
60
+    rm -f /usr/local/sbin/pve-reboot-manager.sh
61
+    rm -f /usr/local/sbin/pve-guest-state.sh
62
+    rm -f /root/bin/pgs
63
+    rm -f /root/bin/pve-reboot-manager.sh
64
+    rm -f /root/bin/pve-guest-state.sh
65
+
66
+    systemctl disable pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true
67
+    systemctl stop pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true
68
+    rm -f /etc/systemd/system/pve-suspend-vms.service
69
+    rm -f /etc/systemd/system/pve-resume-vms.service
70
+    systemctl daemon-reload
71
+    systemctl reset-failed pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true
72
+}
73
+
74
+run_existing_uninstall() {
75
+    if [[ -x "${UNINSTALL_PATH}" ]]; then
76
+        echo "Existing installation detected. Running canonical uninstall first..."
77
+        "${UNINSTALL_PATH}" --force || true
78
+    else
79
+        bash "${SOURCE_DIR}/scripts/uninstall.sh" --force || true
80
+    fi
81
+}
82
+
83
+install_docs() {
84
+    mkdir -p "${DOC_DIR}/docs"
85
+    cp "${SOURCE_DIR}/README.md" "${DOC_DIR}/"
86
+    cp "${SOURCE_DIR}/INSTALL.md" "${DOC_DIR}/"
87
+    cp "${SOURCE_DIR}/CHANGELOG.md" "${DOC_DIR}/"
88
+    cp "${SOURCE_DIR}/LICENSE" "${DOC_DIR}/"
89
+    cp "${SOURCE_DIR}/docs/DECISIONS.md" "${DOC_DIR}/docs/"
90
+    cp "${SOURCE_DIR}/docs/TECHNICAL.md" "${DOC_DIR}/docs/"
91
+}
92
+
93
+main() {
94
+    while [[ $# -gt 0 ]]; do
95
+        case "$1" in
96
+            --source-dir)
97
+                SOURCE_DIR="$2"
98
+                shift 2
99
+                ;;
100
+            -h|--help)
101
+                usage
102
+                exit 0
103
+                ;;
104
+            *)
105
+                echo "ERROR: unknown option: $1" >&2
106
+                usage
107
+                exit 1
108
+                ;;
109
+        esac
110
+    done
111
+
112
+    require_root
113
+    resolve_source_dir
114
+    validate_source_tree
115
+
116
+    echo "=== Installing ${PROJECT_ID} ==="
117
+    run_existing_uninstall
118
+
119
+    mkdir -p "${INSTALL_DIR}" "${DOC_DIR}" "${STATE_DIR}" /usr/local/sbin
120
+
121
+    cleanup_legacy_artifacts
122
+
123
+    install -m 0755 "${SOURCE_DIR}/bin/pgs" "${COMMAND_PATH}"
124
+    install -m 0755 "${SOURCE_DIR}/scripts/uninstall.sh" "${UNINSTALL_PATH}"
125
+    ln -sfn "${UNINSTALL_PATH}" "${UNINSTALL_WRAPPER}"
126
+
127
+    install_docs
128
+
129
+    echo "Installed paths:"
130
+    echo "  command: ${COMMAND_PATH}"
131
+    echo "  uninstall: ${UNINSTALL_PATH}"
132
+    echo "  docs: ${DOC_DIR}"
133
+    echo "  state: ${STATE_DIR}"
134
+    echo ""
135
+    echo "Running dry-run verification..."
136
+    "${COMMAND_PATH}" suspend --dry-run -v 2>&1 | tail -3 || true
137
+    echo ""
138
+    echo "Installation completed."
139
+}
140
+
141
+main "$@"
+86 -0
projects/pve-guests-state/scripts/uninstall.sh
@@ -0,0 +1,86 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="pve-guests-state"
6
+ORG_ID="xdev"
7
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
8
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
9
+STATE_DIR="/var/lib/${ORG_ID}/${PROJECT_ID}"
10
+STATE_FILE="${STATE_DIR}/pgs-state.json"
11
+COMMAND_PATH="/usr/local/sbin/pgs"
12
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
13
+
14
+FORCE_MODE=0
15
+
16
+log() {
17
+    if [[ "${FORCE_MODE}" -eq 0 ]]; then
18
+        echo "$@"
19
+    fi
20
+}
21
+
22
+require_root() {
23
+    if [[ "${EUID}" -ne 0 ]]; then
24
+        echo "ERROR: this script must be run as root" >&2
25
+        exit 1
26
+    fi
27
+}
28
+
29
+cleanup_legacy_artifacts() {
30
+    rm -f /usr/local/sbin/pve-reboot-manager.sh
31
+    rm -f /usr/local/sbin/pve-guest-state.sh
32
+    rm -f /root/bin/pgs
33
+    rm -f /root/bin/pve-reboot-manager.sh
34
+    rm -f /root/bin/pve-guest-state.sh
35
+
36
+    rm -f /var/lib/pve-manager/pgs-state.json
37
+    rm -f /var/lib/pve-manager/guest-state.json
38
+    rm -f /var/lib/pve-manager/reboot-vm-state.json
39
+}
40
+
41
+main() {
42
+    while [[ $# -gt 0 ]]; do
43
+        case "$1" in
44
+            --force)
45
+                FORCE_MODE=1
46
+                shift
47
+                ;;
48
+            -h|--help)
49
+                echo "Usage: $0 [--force]"
50
+                exit 0
51
+                ;;
52
+            *)
53
+                echo "ERROR: unknown option: $1" >&2
54
+                exit 1
55
+                ;;
56
+        esac
57
+    done
58
+
59
+    require_root
60
+
61
+    log "=== Uninstalling ${PROJECT_ID} ==="
62
+
63
+    systemctl disable pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true
64
+    systemctl stop pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true
65
+    rm -f /etc/systemd/system/pve-suspend-vms.service
66
+    rm -f /etc/systemd/system/pve-resume-vms.service
67
+    systemctl daemon-reload
68
+    systemctl reset-failed pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true
69
+
70
+    rm -f "${UNINSTALL_WRAPPER}"
71
+    rm -f "${COMMAND_PATH}"
72
+    rm -f "${STATE_FILE}"
73
+    rm -rf "${DOC_DIR}"
74
+    rm -rf "${INSTALL_DIR}"
75
+    rm -rf "${STATE_DIR}"
76
+
77
+    cleanup_legacy_artifacts
78
+
79
+    rmdir /usr/local/lib/${ORG_ID} 2>/dev/null || true
80
+    rmdir /usr/local/share/doc/${ORG_ID} 2>/dev/null || true
81
+    rmdir /var/lib/${ORG_ID} 2>/dev/null || true
82
+
83
+    log "Uninstall complete."
84
+}
85
+
86
+main "$@"
+153 -0
projects/pve-guests-state/setup.sh
@@ -0,0 +1,153 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="pve-guests-state"
6
+ORG_ID="xdev"
7
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
8
+MODE="install"
9
+REMOTE_NODE=""
10
+REMOTE_USER="root"
11
+LOCAL_MODE=0
12
+
13
+show_help() {
14
+    cat <<EOF
15
+PGS setup wrapper
16
+
17
+Usage: $0 [OPTIONS] [<target_node>]
18
+
19
+Options:
20
+  -h, --help           Show this help message
21
+  -l, --local          Run on localhost
22
+  -u, --uninstall      Uninstall instead of install
23
+  --user <user>        Remote SSH user (default: root)
24
+
25
+Examples:
26
+  $0 --local
27
+  $0 pve-node-2
28
+  $0 --user admin pve-node-2
29
+  $0 --uninstall pve-node-2
30
+  $0 --local --uninstall
31
+EOF
32
+}
33
+
34
+run_local_install() {
35
+    bash "${SCRIPT_DIR}/scripts/install.sh" --source-dir "${SCRIPT_DIR}"
36
+}
37
+
38
+run_local_uninstall() {
39
+    local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
40
+
41
+    if [[ -x "${canonical}" ]]; then
42
+        "${canonical}"
43
+    else
44
+        bash "${SCRIPT_DIR}/scripts/uninstall.sh"
45
+    fi
46
+}
47
+
48
+copy_remote_tree() {
49
+    local target="$1"
50
+    local remote_tmp="$2"
51
+
52
+    ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/bin' '${remote_tmp}/scripts' '${remote_tmp}/docs'"
53
+    scp -q "${SCRIPT_DIR}/bin/pgs" "${target}:${remote_tmp}/bin/"
54
+    scp -q "${SCRIPT_DIR}/scripts/install.sh" "${target}:${remote_tmp}/scripts/"
55
+    scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
56
+    scp -q "${SCRIPT_DIR}/README.md" "${target}:${remote_tmp}/"
57
+    scp -q "${SCRIPT_DIR}/INSTALL.md" "${target}:${remote_tmp}/"
58
+    scp -q "${SCRIPT_DIR}/CHANGELOG.md" "${target}:${remote_tmp}/"
59
+    scp -q "${SCRIPT_DIR}/LICENSE" "${target}:${remote_tmp}/"
60
+    scp -q "${SCRIPT_DIR}/docs/DECISIONS.md" "${target}:${remote_tmp}/docs/"
61
+    scp -q "${SCRIPT_DIR}/docs/TECHNICAL.md" "${target}:${remote_tmp}/docs/"
62
+}
63
+
64
+run_remote_install() {
65
+    local target="$1"
66
+    local remote_tmp="/tmp/${PROJECT_ID}.$$"
67
+    local remote_prefix=""
68
+
69
+    [[ "${REMOTE_USER}" != "root" ]] && remote_prefix="sudo "
70
+
71
+    copy_remote_tree "${target}" "${remote_tmp}"
72
+    ssh "${target}" "${remote_prefix}bash '${remote_tmp}/scripts/install.sh' --source-dir '${remote_tmp}'"
73
+    ssh "${target}" "rm -rf '${remote_tmp}'"
74
+}
75
+
76
+run_remote_uninstall() {
77
+    local target="$1"
78
+    local remote_tmp="/tmp/${PROJECT_ID}-uninstall.$$"
79
+    local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
80
+
81
+    ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts'"
82
+    scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
83
+    if [[ "${REMOTE_USER}" == "root" ]]; then
84
+        ssh "${target}" "if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi"
85
+    else
86
+        ssh "${target}" "sudo bash -lc \"if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi\""
87
+    fi
88
+    ssh "${target}" "rm -rf '${remote_tmp}'"
89
+}
90
+
91
+while [[ $# -gt 0 ]]; do
92
+    case "$1" in
93
+        -h|--help)
94
+            show_help
95
+            exit 0
96
+            ;;
97
+        -l|--local)
98
+            LOCAL_MODE=1
99
+            shift
100
+            ;;
101
+        -u|--uninstall)
102
+            MODE="uninstall"
103
+            shift
104
+            ;;
105
+        --user)
106
+            REMOTE_USER="$2"
107
+            shift 2
108
+            ;;
109
+        -*)
110
+            echo "ERROR: unknown option: $1" >&2
111
+            show_help
112
+            exit 1
113
+            ;;
114
+        *)
115
+            REMOTE_NODE="$1"
116
+            shift
117
+            ;;
118
+    esac
119
+done
120
+
121
+if [[ -z "${REMOTE_NODE}" && ${LOCAL_MODE} -eq 0 ]]; then
122
+    LOCAL_MODE=1
123
+fi
124
+
125
+echo "================================"
126
+echo "PGS - ${MODE}"
127
+echo "================================"
128
+
129
+if [[ ${LOCAL_MODE} -eq 1 ]]; then
130
+    echo "Target: localhost"
131
+    echo ""
132
+    if [[ "${MODE}" == "install" ]]; then
133
+        run_local_install
134
+    else
135
+        run_local_uninstall
136
+    fi
137
+    exit 0
138
+fi
139
+
140
+TARGET="${REMOTE_USER}@${REMOTE_NODE}"
141
+echo "Target: ${TARGET}"
142
+echo ""
143
+
144
+if ! ping -c 1 "${REMOTE_NODE}" >/dev/null 2>&1; then
145
+    echo "ERROR: cannot reach ${REMOTE_NODE}" >&2
146
+    exit 1
147
+fi
148
+
149
+if [[ "${MODE}" == "install" ]]; then
150
+    run_remote_install "${TARGET}"
151
+else
152
+    run_remote_uninstall "${TARGET}"
153
+fi
+9 -0
projects/pve-guests-state/systemd/README.md
@@ -0,0 +1,9 @@
1
+Legacy automation units retained for reference only.
2
+
3
+These systemd unit files are not installed by the current project workflow.
4
+The supported operational model is manual:
5
+
6
+- `/usr/local/sbin/pgs suspend -v`
7
+- `/usr/local/sbin/pgs resume -v`
8
+
9
+Current install and uninstall scripts explicitly remove these legacy units from hosts.
+27 -0
projects/pve-guests-state/systemd/pve-resume-vms.service
@@ -0,0 +1,27 @@
1
+[Unit]
2
+Description=Resume PVE VMs manually
3
+Documentation=man:qm(1)
4
+
5
+# Only run if we have a state file from previous suspend
6
+ConditionPathExists=/var/lib/pve-manager/pgs-state.json
7
+
8
+# We need pve-cluster for /etc/pve access
9
+Requires=pve-cluster.service
10
+After=pve-cluster.service
11
+
12
+# Run after storage is available
13
+After=pve-storage.target
14
+
15
+# Run before the standard pve-guests service to handle our VMs first
16
+Before=pve-guests.service
17
+
18
+[Service]
19
+Type=oneshot
20
+ExecStart=/usr/local/sbin/pgs resume -v
21
+# Allow generous time for VMs to resume
22
+TimeoutStartSec=900
23
+Restart=on-failure
24
+RestartSec=20
25
+
26
+[Install]
27
+WantedBy=multi-user.target
+32 -0
projects/pve-guests-state/systemd/pve-suspend-vms.service
@@ -0,0 +1,32 @@
1
+[Unit]
2
+Description=Suspend PVE VMs to disk manually
3
+Documentation=man:qm(1)
4
+
5
+# Only run if pve-cluster is available (not rescue/recovery)
6
+ConditionPathExists=/var/lib/pve-cluster/config.db
7
+
8
+# We need storage and cluster access when stopping (suspend needs these alive)
9
+Requires=pve-cluster.service
10
+After=pve-cluster.service network.target local-fs.target remote-fs.target
11
+
12
+# Start AFTER pve-guests → during shutdown we stop BEFORE pve-guests
13
+# Critical: ensures we suspend VMs before pve-guests runs "stopall"
14
+After=pve-guests.service
15
+
16
+[Service]
17
+Type=oneshot
18
+RemainAfterExit=yes
19
+
20
+# Trivial start - just marks the service as "active" while the node is up
21
+# The actual work happens in ExecStop during shutdown
22
+ExecStart=/bin/true
23
+
24
+# REAL work: suspend VMs and shutdown CTs when system is going down
25
+ExecStop=/usr/local/sbin/pgs suspend -v
26
+
27
+# Allow generous time for all VMs to suspend to disk
28
+TimeoutStopSec=900
29
+
30
+[Install]
31
+WantedBy=multi-user.target
32
+
+15 -0
projects/pve-net-hang-watchdog/CHANGELOG.md
@@ -0,0 +1,15 @@
1
+# pve-net-hang-watchdog Changelog
2
+
3
+## [1.0] - 2026-03-06
4
+
5
+### Added
6
+- Dedicated `scripts/install.sh` and `scripts/uninstall.sh`
7
+- `setup.sh` wrapper for local and remote lifecycle operations
8
+- Standardized defaults file at `/etc/default/xdev-pve-net-hang-watchdog`
9
+- Installed documentation under `/usr/local/share/doc/xdev/pve-net-hang-watchdog`
10
+
11
+### Changed
12
+- Standardized uninstall path to `/usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh`
13
+- Updated systemd unit to use the namespaced defaults file
14
+- Standardized project documentation and install workflow
15
+- Installer now performs `systemctl enable --now` so the watchdog is active immediately after install
+66 -0
projects/pve-net-hang-watchdog/INSTALL.md
@@ -0,0 +1,66 @@
1
+# Instalare
2
+
3
+## Metoda recomandata
4
+
5
+Wrapper-ul `setup.sh` este metoda standard de install si uninstall.
6
+
7
+### Instalare locala
8
+
9
+```bash
10
+sudo ./setup.sh --local
11
+```
12
+
13
+### Instalare remote
14
+
15
+```bash
16
+sudo ./setup.sh <node>
17
+sudo ./setup.sh --user admin <node>
18
+```
19
+
20
+## Ce instaleaza
21
+
22
+- `/usr/local/sbin/pve-net-hang-watchdog.sh`
23
+- `/usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh`
24
+- `/usr/local/sbin/xdev-pve-net-hang-watchdog-uninstall`
25
+- `/etc/default/xdev-pve-net-hang-watchdog`
26
+- `/etc/systemd/system/pve-net-hang-watchdog.service`
27
+- `/usr/local/share/doc/xdev/pve-net-hang-watchdog/*`
28
+
29
+## Activare
30
+
31
+Installerul face:
32
+- `systemctl daemon-reload`
33
+- `systemctl enable --now pve-net-hang-watchdog.service`
34
+
35
+Verificare:
36
+
37
+```bash
38
+sudo systemctl status pve-net-hang-watchdog.service
39
+```
40
+
41
+## Configurare
42
+
43
+Defaults instalate:
44
+
45
+```bash
46
+sudo editor /etc/default/xdev-pve-net-hang-watchdog
47
+```
48
+
49
+Parametri suportati:
50
+- `WATCH_BRIDGE`
51
+- `WATCH_IFACE`
52
+- `COOLDOWN_SECONDS`
53
+- `HANG_PATTERN`
54
+
55
+## Uninstall
56
+
57
+```bash
58
+sudo ./setup.sh --local --uninstall
59
+sudo ./setup.sh --uninstall <node>
60
+```
61
+
62
+Sau direct pe host:
63
+
64
+```bash
65
+sudo /usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh
66
+```
+72 -0
projects/pve-net-hang-watchdog/README.md
@@ -0,0 +1,72 @@
1
+# pve-net-hang-watchdog
2
+
3
+`pve-net-hang-watchdog` este un serviciu simplu care urmareste jurnalul kernel pentru hang-uri de NIC si incearca recuperarea uplink-ului prin `ifdown` si `ifup`.
4
+
5
+## Rol
6
+
7
+Util pentru noduri Proxmox unde interfata fizica din spatele unui bridge WAN poate intra in stare de hang hardware, iar recovery-ul cel mai pragmatic este ciclarea link-ului.
8
+
9
+## Componente
10
+
11
+- [bin/pve-net-hang-watchdog.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/bin/pve-net-hang-watchdog.sh) - scriptul principal
12
+- [systemd/pve-net-hang-watchdog.service](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/systemd/pve-net-hang-watchdog.service) - unitatea systemd
13
+- [config/xdev-pve-net-hang-watchdog](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/config/xdev-pve-net-hang-watchdog) - defaults standard
14
+- [scripts/install.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/scripts/install.sh) - install local
15
+- [scripts/uninstall.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/scripts/uninstall.sh) - uninstall canonic
16
+- [setup.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/setup.sh) - wrapper local/remote
17
+
18
+## Locatii instalate pe host
19
+
20
+- comanda/daemon script: `/usr/local/sbin/pve-net-hang-watchdog.sh`
21
+- uninstall canonic: `/usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh`
22
+- wrapper optional pentru uninstall: `/usr/local/sbin/xdev-pve-net-hang-watchdog-uninstall`
23
+- defaults: `/etc/default/xdev-pve-net-hang-watchdog`
24
+- unitate systemd: `/etc/systemd/system/pve-net-hang-watchdog.service`
25
+- documentatie instalata: `/usr/local/share/doc/xdev/pve-net-hang-watchdog`
26
+
27
+## Configurare
28
+
29
+Parametri suportati prin defaults:
30
+
31
+- `WATCH_BRIDGE`
32
+- `WATCH_IFACE`
33
+- `COOLDOWN_SECONDS`
34
+- `HANG_PATTERN`
35
+
36
+Daca `WATCH_IFACE` este gol, scriptul incearca sa descopere automat interfata fizica din `bridge-ports`.
37
+
38
+## Flux rapid
39
+
40
+```bash
41
+sudo ./setup.sh --local
42
+sudo systemctl status pve-net-hang-watchdog.service
43
+```
44
+
45
+## Operare
46
+
47
+Loguri:
48
+
49
+```bash
50
+journalctl -u pve-net-hang-watchdog.service -f
51
+```
52
+
53
+Configurare:
54
+
55
+```bash
56
+sudo editor /etc/default/xdev-pve-net-hang-watchdog
57
+sudo systemctl restart pve-net-hang-watchdog.service
58
+```
59
+
60
+Installerul face si `enable --now`, deci dupa instalare serviciul este deja pornit.
61
+
62
+## Uninstall
63
+
64
+```bash
65
+sudo ./setup.sh --local --uninstall
66
+```
67
+
68
+Sau direct:
69
+
70
+```bash
71
+sudo /usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh
72
+```
+102 -0
projects/pve-net-hang-watchdog/bin/pve-net-hang-watchdog.sh
@@ -0,0 +1,102 @@
1
+#!/bin/bash
2
+
3
+set -u
4
+
5
+WATCH_BRIDGE="${WATCH_BRIDGE:-vmbr443}"
6
+WATCH_IFACE="${WATCH_IFACE:-}"
7
+COOLDOWN_SECONDS="${COOLDOWN_SECONDS:-30}"
8
+HANG_PATTERN="${HANG_PATTERN:-Detected Hardware Unit Hang:}"
9
+
10
+log() {
11
+    printf '%s %s\n' "$(date -Is)" "$*" >&2
12
+}
13
+
14
+discover_watch_iface() {
15
+    local candidate=""
16
+
17
+    if [[ -n "$WATCH_IFACE" ]]; then
18
+        printf '%s\n' "$WATCH_IFACE"
19
+        return 0
20
+    fi
21
+
22
+    if [[ -r /etc/network/interfaces ]]; then
23
+        candidate="$(
24
+            awk -v bridge="$WATCH_BRIDGE" '
25
+                $1 == "iface" && $2 == bridge { in_bridge = 1; next }
26
+                $1 == "iface" && $2 != bridge { in_bridge = 0 }
27
+                in_bridge && $1 == "bridge-ports" { print $2; exit }
28
+            ' /etc/network/interfaces
29
+        )"
30
+    fi
31
+
32
+    if [[ -z "$candidate" && -d /etc/network/interfaces.d ]]; then
33
+        candidate="$(
34
+            awk -v bridge="$WATCH_BRIDGE" '
35
+                $1 == "iface" && $2 == bridge { in_bridge = 1; next }
36
+                $1 == "iface" && $2 != bridge { in_bridge = 0 }
37
+                in_bridge && $1 == "bridge-ports" { print $2; exit }
38
+            ' /etc/network/interfaces.d/* 2>/dev/null
39
+        )"
40
+    fi
41
+
42
+    if [[ -n "$candidate" ]]; then
43
+        printf '%s\n' "${candidate%%.*}"
44
+        return 0
45
+    fi
46
+
47
+    return 1
48
+}
49
+
50
+require_command() {
51
+    local cmd="$1"
52
+    if ! command -v "$cmd" >/dev/null 2>&1; then
53
+        log "missing required command: $cmd"
54
+        exit 1
55
+    fi
56
+}
57
+
58
+recover_iface() {
59
+    local iface="$1"
60
+
61
+    log "hardware hang detected on $iface; cycling link with ifdown/ifup"
62
+    ifdown --force "$iface" || log "ifdown reported a non-zero exit code for $iface"
63
+    sleep 2
64
+    if ! ifup "$iface"; then
65
+        log "ifup failed for $iface"
66
+        return 1
67
+    fi
68
+    log "link recovery finished for $iface"
69
+}
70
+
71
+main() {
72
+    local iface=""
73
+    local last_recovery=0
74
+    local now=0
75
+    local line=""
76
+
77
+    require_command journalctl
78
+    require_command ifdown
79
+    require_command ifup
80
+
81
+    if ! iface="$(discover_watch_iface)"; then
82
+        log "failed to determine uplink interface for bridge $WATCH_BRIDGE"
83
+        exit 1
84
+    fi
85
+
86
+    log "watching journald for '$HANG_PATTERN' on interface $iface"
87
+
88
+    while IFS= read -r line; do
89
+        [[ "$line" == *"$iface: $HANG_PATTERN"* ]] || continue
90
+
91
+        now="$(date +%s)"
92
+        if (( now - last_recovery < COOLDOWN_SECONDS )); then
93
+            log "skipping duplicate event for $iface during cooldown (${COOLDOWN_SECONDS}s)"
94
+            continue
95
+        fi
96
+
97
+        last_recovery="$now"
98
+        recover_iface "$iface"
99
+    done < <(journalctl --dmesg --follow --since now --output=cat)
100
+}
101
+
102
+main "$@"
+18 -0
projects/pve-net-hang-watchdog/config/xdev-pve-net-hang-watchdog
@@ -0,0 +1,18 @@
1
+# Default environment for pve-net-hang-watchdog
2
+#
3
+# Copy or install to:
4
+#   /etc/default/xdev-pve-net-hang-watchdog
5
+#
6
+# Uncomment to override defaults.
7
+
8
+# Bridge whose uplink should be monitored for NIC hardware hang recovery.
9
+# WATCH_BRIDGE=vmbr443
10
+
11
+# Explicit interface to recover. If empty, the script auto-discovers bridge-ports.
12
+# WATCH_IFACE=
13
+
14
+# Minimum number of seconds between recoveries for duplicate events.
15
+# COOLDOWN_SECONDS=30
16
+
17
+# Journal pattern that identifies the hardware hang message.
18
+# HANG_PATTERN=Detected Hardware Unit Hang:
+130 -0
projects/pve-net-hang-watchdog/scripts/install.sh
@@ -0,0 +1,130 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="pve-net-hang-watchdog"
6
+ORG_ID="xdev"
7
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
8
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
9
+COMMAND_PATH="/usr/local/sbin/pve-net-hang-watchdog.sh"
10
+UNINSTALL_PATH="${INSTALL_DIR}/uninstall.sh"
11
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
12
+CONFIG_PATH="/etc/default/${ORG_ID}-${PROJECT_ID}"
13
+UNIT_PATH="/etc/systemd/system/${PROJECT_ID}.service"
14
+
15
+SOURCE_DIR=""
16
+
17
+usage() {
18
+    cat <<EOF
19
+Usage: $0 [--source-dir <path>]
20
+
21
+Install ${PROJECT_ID} on the current host.
22
+EOF
23
+}
24
+
25
+require_root() {
26
+    if [[ "${EUID}" -ne 0 ]]; then
27
+        echo "ERROR: this script must be run as root" >&2
28
+        exit 1
29
+    fi
30
+}
31
+
32
+resolve_source_dir() {
33
+    if [[ -n "${SOURCE_DIR}" ]]; then
34
+        SOURCE_DIR="$(cd "${SOURCE_DIR}" && pwd)"
35
+    else
36
+        SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
37
+    fi
38
+}
39
+
40
+validate_source_tree() {
41
+    local required_files=(
42
+        "${SOURCE_DIR}/bin/pve-net-hang-watchdog.sh"
43
+        "${SOURCE_DIR}/systemd/pve-net-hang-watchdog.service"
44
+        "${SOURCE_DIR}/config/xdev-pve-net-hang-watchdog"
45
+        "${SOURCE_DIR}/scripts/uninstall.sh"
46
+        "${SOURCE_DIR}/README.md"
47
+        "${SOURCE_DIR}/INSTALL.md"
48
+        "${SOURCE_DIR}/CHANGELOG.md"
49
+    )
50
+    local file=""
51
+    for file in "${required_files[@]}"; do
52
+        if [[ ! -f "${file}" ]]; then
53
+            echo "ERROR: missing required source file: ${file}" >&2
54
+            exit 1
55
+        fi
56
+    done
57
+}
58
+
59
+run_existing_uninstall() {
60
+    if [[ -x "${UNINSTALL_PATH}" ]]; then
61
+        echo "Existing installation detected. Running canonical uninstall first..."
62
+        "${UNINSTALL_PATH}" --force || true
63
+    else
64
+        bash "${SOURCE_DIR}/scripts/uninstall.sh" --force || true
65
+    fi
66
+}
67
+
68
+install_docs() {
69
+    mkdir -p "${DOC_DIR}"
70
+    cp "${SOURCE_DIR}/README.md" "${DOC_DIR}/"
71
+    cp "${SOURCE_DIR}/INSTALL.md" "${DOC_DIR}/"
72
+    cp "${SOURCE_DIR}/CHANGELOG.md" "${DOC_DIR}/"
73
+}
74
+
75
+main() {
76
+    while [[ $# -gt 0 ]]; do
77
+        case "$1" in
78
+            --source-dir)
79
+                SOURCE_DIR="$2"
80
+                shift 2
81
+                ;;
82
+            -h|--help)
83
+                usage
84
+                exit 0
85
+                ;;
86
+            *)
87
+                echo "ERROR: unknown option: $1" >&2
88
+                usage
89
+                exit 1
90
+                ;;
91
+        esac
92
+    done
93
+
94
+    require_root
95
+    resolve_source_dir
96
+    validate_source_tree
97
+
98
+    echo "=== Installing ${PROJECT_ID} ==="
99
+    run_existing_uninstall
100
+
101
+    mkdir -p "${INSTALL_DIR}" "${DOC_DIR}" /usr/local/sbin /etc/default
102
+
103
+    install -m 0755 "${SOURCE_DIR}/bin/pve-net-hang-watchdog.sh" "${COMMAND_PATH}"
104
+    install -m 0755 "${SOURCE_DIR}/scripts/uninstall.sh" "${UNINSTALL_PATH}"
105
+    ln -sfn "${UNINSTALL_PATH}" "${UNINSTALL_WRAPPER}"
106
+
107
+    if [[ ! -f "${CONFIG_PATH}" ]]; then
108
+        install -m 0644 "${SOURCE_DIR}/config/xdev-pve-net-hang-watchdog" "${CONFIG_PATH}"
109
+    else
110
+        echo "Preserving existing config: ${CONFIG_PATH}"
111
+    fi
112
+
113
+    install -m 0644 "${SOURCE_DIR}/systemd/pve-net-hang-watchdog.service" "${UNIT_PATH}"
114
+    systemctl daemon-reload
115
+    systemctl enable --now pve-net-hang-watchdog.service >/dev/null 2>&1
116
+
117
+    install_docs
118
+
119
+    echo "Installed paths:"
120
+    echo "  command: ${COMMAND_PATH}"
121
+    echo "  uninstall: ${UNINSTALL_PATH}"
122
+    echo "  config: ${CONFIG_PATH}"
123
+    echo "  systemd: ${UNIT_PATH}"
124
+    echo "  docs: ${DOC_DIR}"
125
+    echo "  service: enabled and started"
126
+    echo ""
127
+    echo "Installation completed."
128
+}
129
+
130
+main "$@"
+68 -0
projects/pve-net-hang-watchdog/scripts/uninstall.sh
@@ -0,0 +1,68 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="pve-net-hang-watchdog"
6
+ORG_ID="xdev"
7
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
8
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
9
+COMMAND_PATH="/usr/local/sbin/pve-net-hang-watchdog.sh"
10
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
11
+CONFIG_PATH="/etc/default/${ORG_ID}-${PROJECT_ID}"
12
+UNIT_PATH="/etc/systemd/system/${PROJECT_ID}.service"
13
+
14
+FORCE_MODE=0
15
+
16
+log() {
17
+    if [[ "${FORCE_MODE}" -eq 0 ]]; then
18
+        echo "$@"
19
+    fi
20
+}
21
+
22
+require_root() {
23
+    if [[ "${EUID}" -ne 0 ]]; then
24
+        echo "ERROR: this script must be run as root" >&2
25
+        exit 1
26
+    fi
27
+}
28
+
29
+main() {
30
+    while [[ $# -gt 0 ]]; do
31
+        case "$1" in
32
+            --force)
33
+                FORCE_MODE=1
34
+                shift
35
+                ;;
36
+            -h|--help)
37
+                echo "Usage: $0 [--force]"
38
+                exit 0
39
+                ;;
40
+            *)
41
+                echo "ERROR: unknown option: $1" >&2
42
+                exit 1
43
+                ;;
44
+        esac
45
+    done
46
+
47
+    require_root
48
+
49
+    log "=== Uninstalling ${PROJECT_ID} ==="
50
+
51
+    systemctl disable pve-net-hang-watchdog.service >/dev/null 2>&1 || true
52
+    systemctl stop pve-net-hang-watchdog.service >/dev/null 2>&1 || true
53
+    rm -f "${UNIT_PATH}"
54
+    systemctl daemon-reload
55
+
56
+    rm -f "${UNINSTALL_WRAPPER}"
57
+    rm -f "${COMMAND_PATH}"
58
+    rm -f "${CONFIG_PATH}"
59
+    rm -rf "${DOC_DIR}"
60
+    rm -rf "${INSTALL_DIR}"
61
+
62
+    rmdir /usr/local/lib/${ORG_ID} 2>/dev/null || true
63
+    rmdir /usr/local/share/doc/${ORG_ID} 2>/dev/null || true
64
+
65
+    log "Uninstall complete."
66
+}
67
+
68
+main "$@"
+139 -0
projects/pve-net-hang-watchdog/setup.sh
@@ -0,0 +1,139 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="pve-net-hang-watchdog"
6
+ORG_ID="xdev"
7
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
8
+MODE="install"
9
+REMOTE_NODE=""
10
+REMOTE_USER="root"
11
+LOCAL_MODE=0
12
+
13
+show_help() {
14
+    cat <<EOF
15
+${PROJECT_ID} setup wrapper
16
+
17
+Usage: $0 [OPTIONS] [<target_node>]
18
+
19
+Options:
20
+  -h, --help           Show this help message
21
+  -l, --local          Run on localhost
22
+  -u, --uninstall      Uninstall instead of install
23
+  --user <user>        Remote SSH user (default: root)
24
+EOF
25
+}
26
+
27
+run_local_install() {
28
+    bash "${SCRIPT_DIR}/scripts/install.sh" --source-dir "${SCRIPT_DIR}"
29
+}
30
+
31
+run_local_uninstall() {
32
+    local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
33
+    if [[ -x "${canonical}" ]]; then
34
+        "${canonical}"
35
+    else
36
+        bash "${SCRIPT_DIR}/scripts/uninstall.sh"
37
+    fi
38
+}
39
+
40
+copy_remote_tree() {
41
+    local target="$1"
42
+    local remote_tmp="$2"
43
+
44
+    ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/bin' '${remote_tmp}/scripts' '${remote_tmp}/systemd' '${remote_tmp}/config'"
45
+    scp -q "${SCRIPT_DIR}/bin/pve-net-hang-watchdog.sh" "${target}:${remote_tmp}/bin/"
46
+    scp -q "${SCRIPT_DIR}/scripts/install.sh" "${target}:${remote_tmp}/scripts/"
47
+    scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
48
+    scp -q "${SCRIPT_DIR}/systemd/pve-net-hang-watchdog.service" "${target}:${remote_tmp}/systemd/"
49
+    scp -q "${SCRIPT_DIR}/config/xdev-pve-net-hang-watchdog" "${target}:${remote_tmp}/config/"
50
+    scp -q "${SCRIPT_DIR}/README.md" "${target}:${remote_tmp}/"
51
+    scp -q "${SCRIPT_DIR}/INSTALL.md" "${target}:${remote_tmp}/"
52
+    scp -q "${SCRIPT_DIR}/CHANGELOG.md" "${target}:${remote_tmp}/"
53
+}
54
+
55
+run_remote_install() {
56
+    local target="$1"
57
+    local remote_tmp="/tmp/${PROJECT_ID}.$$"
58
+    local remote_prefix=""
59
+
60
+    [[ "${REMOTE_USER}" != "root" ]] && remote_prefix="sudo "
61
+
62
+    copy_remote_tree "${target}" "${remote_tmp}"
63
+    ssh "${target}" "${remote_prefix}bash '${remote_tmp}/scripts/install.sh' --source-dir '${remote_tmp}'"
64
+    ssh "${target}" "rm -rf '${remote_tmp}'"
65
+}
66
+
67
+run_remote_uninstall() {
68
+    local target="$1"
69
+    local remote_tmp="/tmp/${PROJECT_ID}-uninstall.$$"
70
+    local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
71
+
72
+    ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts'"
73
+    scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
74
+    if [[ "${REMOTE_USER}" == "root" ]]; then
75
+        ssh "${target}" "if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi"
76
+    else
77
+        ssh "${target}" "sudo bash -lc \"if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi\""
78
+    fi
79
+    ssh "${target}" "rm -rf '${remote_tmp}'"
80
+}
81
+
82
+while [[ $# -gt 0 ]]; do
83
+    case "$1" in
84
+        -h|--help)
85
+            show_help
86
+            exit 0
87
+            ;;
88
+        -l|--local)
89
+            LOCAL_MODE=1
90
+            shift
91
+            ;;
92
+        -u|--uninstall)
93
+            MODE="uninstall"
94
+            shift
95
+            ;;
96
+        --user)
97
+            REMOTE_USER="$2"
98
+            shift 2
99
+            ;;
100
+        -*)
101
+            echo "ERROR: unknown option: $1" >&2
102
+            show_help
103
+            exit 1
104
+            ;;
105
+        *)
106
+            REMOTE_NODE="$1"
107
+            shift
108
+            ;;
109
+    esac
110
+done
111
+
112
+if [[ -z "${REMOTE_NODE}" && ${LOCAL_MODE} -eq 0 ]]; then
113
+    LOCAL_MODE=1
114
+fi
115
+
116
+echo "================================"
117
+echo "${PROJECT_ID} - ${MODE}"
118
+echo "================================"
119
+
120
+if [[ ${LOCAL_MODE} -eq 1 ]]; then
121
+    if [[ "${MODE}" == "install" ]]; then
122
+        run_local_install
123
+    else
124
+        run_local_uninstall
125
+    fi
126
+    exit 0
127
+fi
128
+
129
+TARGET="${REMOTE_USER}@${REMOTE_NODE}"
130
+if ! ping -c 1 "${REMOTE_NODE}" >/dev/null 2>&1; then
131
+    echo "ERROR: cannot reach ${REMOTE_NODE}" >&2
132
+    exit 1
133
+fi
134
+
135
+if [[ "${MODE}" == "install" ]]; then
136
+    run_remote_install "${TARGET}"
137
+else
138
+    run_remote_uninstall "${TARGET}"
139
+fi
+14 -0
projects/pve-net-hang-watchdog/systemd/pve-net-hang-watchdog.service
@@ -0,0 +1,14 @@
1
+[Unit]
2
+Description=Recover network uplink after NIC hardware hangs
3
+After=systemd-journald.service network.target
4
+Requires=systemd-journald.service
5
+
6
+[Service]
7
+Type=simple
8
+EnvironmentFile=-/etc/default/xdev-pve-net-hang-watchdog
9
+ExecStart=/usr/local/sbin/pve-net-hang-watchdog.sh
10
+Restart=always
11
+RestartSec=2
12
+
13
+[Install]
14
+WantedBy=multi-user.target
BIN
projects/thunderbolts/.DS_Store
Binary file not shown.
+59 -0
projects/thunderbolts/.github/copilot-instructions.md
@@ -0,0 +1,59 @@
1
+# Copilot Instructions for Madagascar Thunderbolts & Backups
2
+
3
+## Big Picture Architecture
4
+- The codebase manages high-MTU Thunderbolt networking and automated backups for a Proxmox cluster (`baobab`, `ebony`, `tapia`).
5
+- Networking: Early boot systemd/udev units create and maintain a `thunderbridge` (MTU 65520), hotplug Thunderbolt NICs, and ensure persistent bridge membership.
6
+- Backups: Autonomous agent scripts (in `backups/`) discover VMs, run scheduled backups, and log lifecycle events.
7
+- All node, network, and backup config is centralized in `cluster/madagascar.json`.
8
+
9
+## Critical Developer Workflows
10
+- **Network Deploy:**
11
+  - Run `deploy/attempt1/deploy_tb.sh` from its directory to push configs and services to all nodes.
12
+  - Validate with `scripts/check_thunderbridge.sh` (checks bridge ports, MTU, and cluster connectivity).
13
+- **Backup Deploy:**
14
+  - Use `backups/scripts/deploy_to_nodes.sh` to install backup agents on all nodes.
15
+  - Backup agent lifecycle is managed by systemd timers (`backup_agent.timer`).
16
+- **Issue Tracking:**
17
+  - All issues documented in `issues/` using `TEMPLATE.md`.
18
+  - Every fix/change must be referenced in `CHANGELOG.md`.
19
+
20
+## Project-Specific Conventions
21
+- **Network config:**
22
+  - Node-specific overlays in `deploy/attempt1/<node>/etc/network/interfaces.d/10-thunderbolt`.
23
+  - Shared systemd/udev units in `deploy/attempt1/common/`.
24
+  - Always use post-up hooks for bridge membership and MTU persistence.
25
+- **SSH Automation:**
26
+  - Scripts use `-o LogLevel=ERROR` to suppress known hosts warnings.
27
+  - Management and Thunderbolt IPs are set in deploy scripts; update helpers for new nodes.
28
+- **Versioning:**
29
+  - New network designs go in new `attemptN` folders for reproducibility.
30
+- **Backups:**
31
+  - All backup config and manifests reference `madagascar.json` for node/IP discovery.
32
+  - Backup agent logs lifecycle events and changes in `madagascar-changelog.json` (if present).
33
+
34
+## Integration Points & Data Flows
35
+- **Network:**
36
+  - Systemd/udev units interact via device events; enlist services attach NICs to bridge.
37
+  - Deploy script pushes all config and reloads services atomically.
38
+- **Backups:**
39
+  - Agent scripts SSH into nodes, discover VMs, and run backups using Proxmox CLI.
40
+  - Results and metadata are logged for auditability.
41
+
42
+## Key Files & Directories
43
+- `deploy/attempt1/deploy_tb.sh`: Main network deploy script
44
+- `deploy/attempt1/common/`: Shared systemd/udev units
45
+- `deploy/attempt1/<node>/etc/network/interfaces.d/10-thunderbolt`: Node overlays
46
+- `scripts/check_thunderbridge.sh`: Cluster network health check
47
+- `cluster/madagascar.json`: Canonical node/network/backup config
48
+- `backups/`: Backup agent, deployment, and documentation
49
+- `issues/`: Issue tracker
50
+- `CHANGELOG.md`: Change log
51
+
52
+## Example Patterns
53
+- To add a node: copy an existing node directory, update IPs, extend deploy script helpers.
54
+- To troubleshoot: check systemd unit status, bridge membership, and kernel logs.
55
+- To automate: use provided scripts, keep configs in sync with `madagascar.json`, and document all changes.
56
+
57
+---
58
+
59
+For questions or unclear conventions, review `README.md` and issue templates, or ask for clarification in the issue tracker.
+43 -0
projects/thunderbolts/CHANGELOG.md
@@ -0,0 +1,43 @@
1
+# Changelog
2
+
3
+All notable changes to the Madagascar cluster will be documented in this file.
4
+
5
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+## [Unreleased]
9
+
10
+### Fixed
11
+- Invalid `ExecStop` syntax in `tb-enlist@.service` caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001]
12
+- Tapia-Baobab Thunderbolt recovery path hardened after reboot-time disconnect/reconnect events [ISSUE-2026-001]
13
+
14
+### Added
15
+- Automatic Thunderbolt recovery worker (`tb-recover.service`) and periodic timer (`tb-recover.timer`) for flap resilience [ISSUE-2026-001]
16
+
17
+### Changed
18
+- `tb-recover.sh` now escalates recovery by restarting `bolt.service` when rescan alone does not recreate thunderbolt net devices [ISSUE-2026-001]
19
+- `tb-recover.sh` now includes cooldowned Thunderbolt NHI PCI `remove+rescan` fallback (soft replug path) for reboot cases where netdev is missing [ISSUE-2026-001]
20
+- `tb-recover.sh` now retries the Thunderbolt NHI reset within the same recovery run when a peer xdomain host reappears without its `*.0` network service [ISSUE-2026-001]
21
+- `tb-recover.sh` now probes the expected peer behind each Thunderbolt port and cycles the affected interface with `ifdown/ifup` when a port stays attached but logically detached [ISSUE-2026-001]
22
+- Added standardized shared-runtime install/uninstall flow that manages scripts, unit files, and udev rules without rewriting host network configuration
23
+
24
+## [2025-10-30]
25
+
26
+### Fixed
27
+- Thunderbolt interfaces not in bridge after MTU fix deployment [ISSUE-2025-002]
28
+- MTU reset to 1500 after systemctl restart networking [ISSUE-2025-001]
29
+
30
+### Added
31
+- Issue tracking system with structured templates
32
+- Defense-in-depth for thunderbolt network configuration (udev + ifupdown2 hooks)
33
+
34
+### Changed
35
+- Enhanced udev rules for thunderbolt device handling
36
+- Updated network interfaces.d with post-up hooks for MTU and bridge membership
37
+
38
+## [2025-10-29]
39
+
40
+### Added
41
+- Initial issue tracking setup
42
+- COPILOT_BACKUPS_INSTRUCTIONS.md for backup procedures
43
+- CHANGELOG.md for change documentation</content>
+113 -0
projects/thunderbolts/COPILOT_BACKUPS_INSTRUCTIONS.md
@@ -0,0 +1,113 @@
1
+COPILOT instructions — VM backup management (project scaffold)
2
+
3
+Purpose
4
+
5
+This document provides context and instructions for an automated assistant (copilot) to start building a project that manages VM backups for the Madagascar cluster. The detailed backup behaviors (retention, snapshot type, schedule) will be added later. For now we focus on cluster context, knowledge sources, file contracts, and recommended initial tasks.
6
+
7
+Context & what the agent already knows
8
+
9
+- The cluster name is `madagascar` and node names are available under `clusters.madagascar.nodes` in `cluster-context/madagascar.json`.
10
+- `cluster-context/madagascar.json` is the canonical source of cluster context available to this project: it may contain node hostnames, network information, and references to where configurations originate.
11
+- `madagascar-changelog.json` (if present in the same directory) is an append-only changelog recommended for recording automation changes; prefer appending entries rather than rewriting.
12
+
13
+Primary goals for the backup project (to be specified later)
14
+
15
+- Discover VMs across cluster nodes.
16
+- Create consistent backups (snapshots, exports) per VM on a regular schedule.
17
+- Store backups in a target storage (local NAS, remote S3-compatible, etc.).
18
+- Maintain retention and pruning policies.
19
+- Integrate with `cluster-context/madagascar.json` for cluster information and to avoid stepping on other projects' config.
20
+
21
+Files the assistant should read and keep in mind
22
+
23
+- `cluster-context/madagascar.json` — primary source of truth for node hostnames, network addresses, and where configuration is defined.
24
+- `madagascar-changelog.json` — append-only log to record changes made by automation (if present).
25
+- `CHANGELOG.md` — human-readable changelog documenting all cluster changes with issue references.
26
+- `issues/` directory — contains detailed issue documentation. Each issue has format `ISSUE-YYYY-NNN.md`.
27
+
28
+Data contract (minimal) — how `cluster-context/madagascar.json` will be used by backups
29
+
30
+- Inputs:
31
+  - Node list: `clusters.madagascar.nodes` keys
32
+  - Node access: hostname(s) under `nodes.<node>.hosts` (ssh target or provisioning endpoint)
33
+  - Node VM network context: used to determine which subnets/backups touches (from `wan`/`thunderbridge`)
34
+- Outputs:
35
+  - Backup metadata appended to `madagascar-changelog.json` (id, timestamp, project: `backups`, summary, details, affectedResources)
36
+  - (Optional) A `backups.json` manifest in repo or storage describing performed backups.
37
+
38
+Assumptions (inferred, verify early)
39
+
40
+- The ops runner will have SSH access to each node via hostnames in `cluster-context/madagascar.json`.
41
+- VM management is Proxmox (PVE) given file style (vmbr*). If different, adapt tooling.
42
+- jq is available on automation host for simple JSON operations; Python is acceptable for more complex logic.
43
+
44
+Starter tasks for the copilot (priority order)
45
+
46
+1. Discovery script: `./discover_vms.sh` (or Python) that:
47
+  - Reads `./cluster-context/madagascar.json` to get nodes and hostnames.
48
+  - SSH into each node and lists VMs (for Proxmox: `qm list` or `pvesh` / `pct list` for containers).
49
+  - Produces a `backups/manifest-<date>.json` with the discovered VMs.
50
+2. Backup runner: `./run_backup.sh` which takes a VM id and node, creates a snapshot/export, and uploads it to configured storage. Keep steps idempotent and record metadata.
51
+3. Pruner: `./prune_backups.sh` to remove old backups according to retention policy (to be defined).
52
+4. Integration tests: small harness that runs discovery against a mocked inventory or a minimal local mock environment and validates outputs.
53
+5. Changelog integration: every automated change to `cluster-context/madagascar.json` or backup metadata must append an entry in `cluster-context/madagascar-changelog.json` describing reason and affected resources.
54
+
55
+Developer guidance & best practices
56
+
57
+- Treat `cluster-context/madagascar.json` as the source of truth for discovery; do not hardcode hostnames elsewhere.
58
+- When writing automation that mutates `cluster-context/madagascar.json`, always also append a changelog entry and prefer atomic updates (write to tmp file then rename).
59
+- Prefer small, single-purpose scripts. Keep complex logic in Python where JSON and SSH handling is easier.
60
+- Add unit tests for parsing and manifest generation.
61
+
62
+---
63
+
64
+## Copilot Automation Instructions (Network & Issue Tracking)
65
+
66
+### Cluster Discovery & Network Checks
67
+- Use `cluster/cluster-context/madagascar.json` for node, IP, and service info.
68
+- To verify thunderbolt networking, run `scripts/check_thunderbridge.sh`:
69
+  - Checks bridge membership and MTU for all thunderbolt interfaces.
70
+  - Verifies cluster network connectivity (ping between all nodes).
71
+- For troubleshooting, check kernel logs (`dmesg`), interface status (`ip link show`), and bridge membership (`bridge link`).
72
+
73
+### Issue Tracking Workflow
74
+- All issues are tracked in `issues/` as Markdown files using `TEMPLATE.md`.
75
+- Each issue gets a unique ID (e.g., ISSUE-2025-001, ISSUE-2025-002).
76
+- Document:
77
+  - Summary, environment, steps to reproduce, expected/actual behavior
78
+  - Logs/evidence, investigation notes, proposed solution
79
+  - Related issues and changelog references
80
+- Update `CHANGELOG.md` for every fix, enhancement, or regression.
81
+- Close issues only after full deployment and verification.
82
+
83
+### Copilot Automation Conventions
84
+- Always verify changes on all affected nodes (baobab, ebony, tapia, etc.).
85
+- Use defense-in-depth for network fixes (udev rules + ifupdown2 hooks).
86
+- Scripts should be POSIX-compliant for maximum compatibility.
87
+- Suppress SSH warnings for clean output (`-o LogLevel=ERROR`).
88
+- Document every change and test result in the issue tracker and changelog.
89
+
90
+### Example Copilot Tasks
91
+- Deploy network fixes: `deploy/attempt1/deploy_tb.sh <node>`
92
+- Check thunderbolt status: `scripts/check_thunderbridge.sh`
93
+- Investigate hardware/network issues: kernel logs, interface status, bridge membership
94
+- Document and close issues: update `issues/` and `CHANGELOG.md`
95
+
96
+### References
97
+- `cluster/cluster-context/madagascar.json`: Node, network, and backup server definitions
98
+- `issues/`: Issue tracker and templates
99
+- `CHANGELOG.md`: Change documentation
100
+- `scripts/check_thunderbridge.sh`: Cluster network health check
101
+- `deploy/attempt1/deploy_tb.sh`: Network deployment script
102
+
103
+### Maintenance
104
+- Regularly run network and backup checks.
105
+- Update documentation and changelogs for every change.
106
+- Use Copilot to automate repetitive tasks and ensure consistency across the cluster.
107
+
108
+Next steps for the user
109
+
110
+- Provide backup policy details (snapshot vs export, retention counts, storage endpoint credentials).
111
+- Confirm VM manager (Proxmox vs KVM/libvirt vs other).
112
+
113
+If you want, I can now scaffold `./discover_vms.sh`, `./run_backup.sh` (stubs), and a small `backups/README.md` describing configuration fields. Which do you prefer I create first: discovery script (bash) or Python scaffold?
+62 -0
projects/thunderbolts/INSTALL.md
@@ -0,0 +1,62 @@
1
+# Instalare
2
+
3
+Acest proiect are acum doua fluxuri distincte:
4
+
5
+1. `deploy/attempt1/deploy_tb.sh`
6
+   - bootstrap complet
7
+   - poate actualiza si fisierele de retea per-host
8
+2. `scripts/install.sh` sau `setup.sh`
9
+   - reinstalare/upgrade pentru shared runtime
10
+   - NU atinge `/etc/network/interfaces` si `interfaces.d/10-thunderbolt`
11
+
12
+## Reinstalare standardizata
13
+
14
+### Local
15
+
16
+```bash
17
+sudo ./setup.sh --local
18
+```
19
+
20
+### Remote
21
+
22
+```bash
23
+sudo ./setup.sh baobab
24
+sudo ./setup.sh ebony tapia
25
+```
26
+
27
+Ce instaleaza:
28
+- `/usr/local/lib/xdev/thunderbolts/tb-recover.sh`
29
+- `/usr/local/lib/xdev/thunderbolts/uninstall.sh`
30
+- `/usr/local/sbin/tb-recover.sh`
31
+- `/usr/local/sbin/xdev-thunderbolts-uninstall`
32
+- `/etc/systemd/system/tb-bridge.service`
33
+- `/etc/systemd/system/tb-enlist@.service`
34
+- `/etc/systemd/system/tb-recover.service`
35
+- `/etc/systemd/system/tb-recover.timer`
36
+- `/etc/udev/rules.d/90-thunderbolt-net-systemd.rules`
37
+- `/usr/local/share/doc/xdev/thunderbolts/*`
38
+
39
+Ce NU atinge:
40
+- `/etc/network/interfaces`
41
+- `/etc/network/interfaces.d/10-thunderbolt`
42
+
43
+## Uninstall standardizat
44
+
45
+```bash
46
+sudo ./setup.sh --local --uninstall
47
+sudo ./setup.sh --uninstall baobab
48
+```
49
+
50
+Sau direct pe host:
51
+
52
+```bash
53
+sudo /usr/local/lib/xdev/thunderbolts/uninstall.sh
54
+```
55
+
56
+Uninstall-ul elimina doar shared runtime:
57
+- unit-urile systemd
58
+- regula udev
59
+- `tb-recover.sh`
60
+- documentatia instalata
61
+
62
+Nu restaureaza si nu sterge fisierele de retea.
+141 -0
projects/thunderbolts/README.md
@@ -0,0 +1,141 @@
1
+# Madagascar's Thunderbolts
2
+
3
+Thunderbolt networking toolkit for three Proxmox hosts (`baobab`, `ebony`, `tapia`).  
4
+The goal is to bring up a high-MTU Thunderbolt bridge (`thunderbridge`) early in boot,
5
+enlist hot-plugged Thunderbolt NICs as they appear, and keep management networking
6
+configs consistent across the cluster.
7
+
8
+## Repository layout
9
+
10
+```
11
+deploy/attempt1/
12
+├── common/                    # Shared bits copied to every host
13
+│   ├── systemd/system/
14
+│   │   ├── tb-bridge.service  # Ensures the bridge device exists and is up
15
+│   │   └── tb-enlist@.service # Enlists hotplugged NICs into the bridge
16
+│   └── udev/rules.d/
17
+│       └── 90-…systemd.rules  # Starts tb-enlist@ for thunderbolt* devices
18
+├── baobab/…                   # Node-specific /etc/network config
19
+├── ebony/…
20
+├── tapia/…
21
+└── deploy_tb.sh               # Main deployment script
22
+```
23
+
24
+The repo currently holds a single deployment attempt (`deploy/attempt1`). If you
25
+iterate on the design, prefer adding a new attempt directory so older snapshots
26
+stay reproducible.
27
+
28
+## Standardized lifecycle
29
+
30
+This project now has two distinct operational paths:
31
+
32
+- Full bootstrap: `deploy/attempt1/deploy_tb.sh`
33
+  - can update host-specific network configuration
34
+  - use for initial deployment or deliberate network template rollout
35
+- Shared runtime reinstall: `./setup.sh`
36
+  - standardizes the shared runtime artifacts only
37
+  - installs/removes `tb-recover.sh`, the shared systemd units, and the udev rule
38
+  - intentionally leaves `/etc/network/interfaces` and `/etc/network/interfaces.d/10-thunderbolt` untouched
39
+
40
+Standardized host paths for the shared runtime:
41
+
42
+- canonical uninstall: `/usr/local/lib/xdev/thunderbolts/uninstall.sh`
43
+- canonical shared script: `/usr/local/lib/xdev/thunderbolts/tb-recover.sh`
44
+- operator wrapper: `/usr/local/sbin/tb-recover.sh`
45
+- installed docs: `/usr/local/share/doc/xdev/thunderbolts`
46
+
47
+Use:
48
+
49
+```bash
50
+./setup.sh                 # reinstall shared runtime on baobab ebony tapia
51
+./setup.sh baobab          # single host
52
+./setup.sh --uninstall baobab
53
+```
54
+
55
+## Prerequisites
56
+
57
+- Machine with Bash ≥3, `ssh`, and `scp` available.
58
+- Access to the target hosts as `root` (default username) over the management or
59
+  Thunderbolt network; passwordless SSH is assumed.
60
+- Target hosts run Proxmox (or any Debian-like system with ifupdown2 and systemd).
61
+- `ip`, `systemctl`, and `udevadm` available on the remote hosts.
62
+
63
+## How deployment works
64
+
65
+`deploy_tb.sh` is idempotent. For each target host it:
66
+
67
+- Chooses an IP by trying management first, then Thunderbolt (`get_mgmt_ip`/`get_tb_ip`).
68
+- Uploads shared udev and systemd units that prepare the `thunderbridge` device and
69
+  attach Thunderbolt NICs when they hot-plug.
70
+- Replaces `/etc/network/interfaces` with the host-specific template and places the
71
+  Thunderbolt overlay in `/etc/network/interfaces.d/10-thunderbolt`.
72
+- Reloads udev and systemd, triggers network reloads, enables the services, and
73
+  prints a short status report (bridge state, enlisted NICs).
74
+
75
+Run it from inside the attempt directory so relative paths resolve correctly.
76
+
77
+```bash
78
+cd deploy/attempt1
79
+./deploy_tb.sh            # deploys to baobab, ebony, tapia
80
+./deploy_tb.sh baobab     # deploys to a single host
81
+./deploy_tb.sh tapia ebony
82
+```
83
+
84
+## Customising host lists and addresses
85
+
86
+Edit the `get_mgmt_ip()` and `get_tb_ip()` helpers near the top of
87
+`deploy/attempt1/deploy_tb.sh` to match your environment. Each host that you want
88
+to target must:
89
+
90
+1. Have a subdirectory named after the host inside `deploy/attempt1`.
91
+2. Provide the full `/etc/network/interfaces` template.
92
+3. Provide `etc/network/interfaces.d/10-thunderbolt` with the bridge definition
93
+   and hotplug rules for Thunderbolt interfaces.
94
+
95
+To add a new host, copy one of the existing directories, adjust static IPs and
96
+interface names, then extend both helper functions so the script can locate it.
97
+
98
+## What the systemd/udev pieces do
99
+
100
+- `tb-bridge.service` (oneshot) makes sure the `thunderbridge` device exists as a
101
+  Linux bridge, sets MTU 65520, and brings it up during early boot.
102
+- `tb-enlist@.service` attaches Thunderbolt NIC instances to the bridge, aligning
103
+  their MTU and keeping them hotplug friendly; systemd stops the unit cleanly on
104
+  device removal.
105
+- `90-thunderbolt-net-systemd.rules` tags `thunderbolt*` NICs so udev starts the
106
+  enlist service automatically.
107
+
108
+These files live under `deploy/attempt1/common/` and are copied verbatim to the
109
+remote host’s `/etc/systemd/system` and `/etc/udev/rules.d`.
110
+
111
+## Validation checklist
112
+
113
+After running the deploy script on a host:
114
+
115
+- `systemctl status tb-bridge.service` should show an *active* oneshot unit.
116
+- `systemctl list-units 'tb-enlist@*'` should list one unit per detected Thunderbolt
117
+  NIC, each *loaded* and *active*.
118
+- `ip -d link show thunderbridge` should display MTU 65520 and `state UP`.
119
+- `bridge link` should list your Thunderbolt interfaces as ports of `thunderbridge`
120
+  once cables are connected.
121
+
122
+If you change the network definitions, re-run `./deploy_tb.sh <host>` to push the
123
+updates. The script re-applies permissions, reloads systemd, retriggers udev, and
124
+refreshes the interfaces.
125
+
126
+## Troubleshooting tips
127
+
128
+- *SSH unreachable*: Confirm management and Thunderbolt IPs in the helper functions
129
+  are correct, and that firewalls allow SSH. The script prints which IP it tried.
130
+- *Bridge missing after reboot*: Ensure `tb-bridge.service` is enabled; run
131
+  `systemctl enable --now tb-bridge.service` on the host.
132
+- *NICs not joining*: Check `journalctl -u tb-enlist@thunderbolt0` for logs and make
133
+  sure the udev rule is present under `/etc/udev/rules.d`.
134
+- *MTU mismatch complaints*: The service forces MTU 65520 on both sides; verify the
135
+  connected devices also support it.
136
+
137
+## Extending beyond attempt1
138
+
139
+Prefer copying `deploy/attempt1` into a new versioned folder (for example,
140
+`attempt2`) when you experiment with alternate topologies or addresses. This keeps
141
+previous rollouts reproducible and eases diffing of changes.
+1 -0
projects/thunderbolts/cluster
@@ -0,0 +1 @@
1
+../cluster
BIN
projects/thunderbolts/deploy/attempt1/.DS_Store
Binary file not shown.
+45 -0
projects/thunderbolts/deploy/attempt1/baobab/etc/network/interfaces
@@ -0,0 +1,45 @@
1
+# network interface settings; autogenerated
2
+# Please do NOT modify this file directly, unless you know what
3
+# you're doing.
4
+#
5
+# If you want to manage parts of the network configuration manually,
6
+# please utilize the 'source' or 'source-directory' directives to do
7
+# so.
8
+# PVE will preserve these directives, but will NOT read its network
9
+# configuration from sourced files, so do not attempt to move any of
10
+# the PVE managed interfaces into external files!
11
+
12
+auto lo
13
+iface lo inet loopback
14
+
15
+auto enp86s0
16
+iface enp86s0 inet manual
17
+
18
+iface enp86s0.442 inet manual
19
+
20
+iface enp86s0.443 inet manual
21
+
22
+iface enp86s0.444 inet manual
23
+source /etc/network/interfaces.d/*
24
+
25
+auto vmbr443
26
+iface vmbr443 inet static
27
+	address 192.168.2.91/24
28
+	gateway 192.168.2.1
29
+	bridge-ports enp86s0.443
30
+	bridge-stp off
31
+	bridge-fd 0
32
+
33
+auto vmbr444
34
+iface vmbr444 inet static
35
+	address 192.168.4.91/24
36
+	bridge-ports enp86s0.444
37
+	bridge-stp off
38
+	bridge-fd 0
39
+
40
+auto vmbr442
41
+iface vmbr442 inet manual
42
+	bridge-ports enp86s0.442
43
+	bridge-stp off
44
+	bridge-fd 0
45
+
+26 -0
projects/thunderbolts/deploy/attempt1/baobab/etc/network/interfaces.d/10-thunderbolt
@@ -0,0 +1,26 @@
1
+# Modular network configuration for baobab - Thunderbolt networking
2
+# ifupdown2-safe: bridge comes up alone; TB ports hotplug in later
3
+
4
+# Thunderbolt ports appear late — do NOT 'auto' them
5
+allow-hotplug thunderbolt0
6
+iface thunderbolt0 inet manual
7
+    pre-up ip link set dev $IFACE mtu 65520 || true
8
+    post-up ip link set dev $IFACE mtu 65520 || true
9
+    post-up ip link set dev $IFACE master thunderbridge || true
10
+
11
+allow-hotplug thunderbolt1
12
+iface thunderbolt1 inet manual
13
+    pre-up ip link set dev $IFACE mtu 65520 || true
14
+    post-up ip link set dev $IFACE mtu 65520 || true
15
+    post-up ip link set dev $IFACE master thunderbridge || true
16
+
17
+# Bridge must exist and stay up even with zero members
18
+auto thunderbridge
19
+iface thunderbridge inet static
20
+    address 192.168.10.91/24
21
+    bridge-ports none
22
+    bridge-stp off
23
+    bridge-fd 0
24
+    mtu 65520
25
+    pre-up ip link add name $IFACE type bridge 2>/dev/null || true
26
+    post-up ip link set dev $IFACE up
+316 -0
projects/thunderbolts/deploy/attempt1/common/sbin/tb-recover.sh
@@ -0,0 +1,316 @@
1
+#!/usr/bin/env bash
2
+set -euo pipefail
3
+
4
+BRIDGE="thunderbridge"
5
+MTU="65520"
6
+FOUND_TB_IFACE=0
7
+STATE_DIR="/run/tb-recover"
8
+LAST_BOLT_RESTART_FILE="${STATE_DIR}/last_bolt_restart_epoch"
9
+BOLT_RESTART_COOLDOWN_SEC=600
10
+LAST_NHI_RESCAN_FILE="${STATE_DIR}/last_nhi_rescan_epoch"
11
+NHI_RESCAN_COOLDOWN_SEC=600
12
+NHI_SETTLE_SEC=8
13
+PEER_FAIL_THRESHOLD="${TB_PEER_FAIL_THRESHOLD:-2}"
14
+IFACE_CYCLE_COOLDOWN_SEC="${TB_IFACE_CYCLE_COOLDOWN_SEC:-300}"
15
+IFACE_CYCLE_SETTLE_SEC="${TB_IFACE_CYCLE_SETTLE_SEC:-5}"
16
+PING_TIMEOUT_SEC="${TB_PING_TIMEOUT_SEC:-1}"
17
+LOCAL_HOST="$(hostname -s 2>/dev/null || hostname)"
18
+
19
+mkdir -p "$STATE_DIR"
20
+
21
+log() {
22
+  printf '%s %s\n' "$(date -Is)" "$*"
23
+}
24
+
25
+command_exists() {
26
+  command -v "$1" >/dev/null 2>&1
27
+}
28
+
29
+counter_file_for_iface() {
30
+  printf '%s/peer-fail-%s.count\n' "$STATE_DIR" "$1"
31
+}
32
+
33
+cooldown_file_for_iface() {
34
+  printf '%s/last-iface-cycle-%s.epoch\n' "$STATE_DIR" "$1"
35
+}
36
+
37
+read_epoch_file() {
38
+  local file="$1"
39
+  local value="0"
40
+
41
+  if [ -f "$file" ]; then
42
+    value="$(cat "$file" 2>/dev/null || echo 0)"
43
+  fi
44
+
45
+  case "$value" in
46
+    ''|*[!0-9]*)
47
+      value=0
48
+      ;;
49
+  esac
50
+
51
+  printf '%s\n' "$value"
52
+}
53
+
54
+read_counter_file() {
55
+  read_epoch_file "$1"
56
+}
57
+
58
+peer_ip_for_iface() {
59
+  local iface="$1"
60
+
61
+  case "${LOCAL_HOST}:${iface}" in
62
+    baobab:thunderbolt0)
63
+      printf '%s\n' "192.168.10.92"
64
+      ;;
65
+    baobab:thunderbolt1)
66
+      printf '%s\n' "192.168.10.93"
67
+      ;;
68
+    ebony:thunderbolt0)
69
+      printf '%s\n' "192.168.10.91"
70
+      ;;
71
+    tapia:thunderbolt0)
72
+      printf '%s\n' "192.168.10.91"
73
+      ;;
74
+    *)
75
+      return 1
76
+      ;;
77
+  esac
78
+}
79
+
80
+iface_is_forwarding() {
81
+  local iface="$1"
82
+  local state_file="/sys/class/net/${iface}/brport/state"
83
+
84
+  [ -r "$state_file" ] || return 1
85
+  [ "$(cat "$state_file" 2>/dev/null || echo 0)" = "3" ]
86
+}
87
+
88
+iface_is_oper_up() {
89
+  local iface="$1"
90
+  local operstate_file="/sys/class/net/${iface}/operstate"
91
+
92
+  [ -r "$operstate_file" ] || return 1
93
+  [ "$(cat "$operstate_file" 2>/dev/null || true)" = "up" ]
94
+}
95
+
96
+probe_peer_ip() {
97
+  local peer_ip="$1"
98
+
99
+  ip neigh del "$peer_ip" dev "$BRIDGE" 2>/dev/null || true
100
+  ping -I "$BRIDGE" -n -c 1 -W "$PING_TIMEOUT_SEC" "$peer_ip" >/dev/null 2>&1
101
+}
102
+
103
+recover_iface_cycle() {
104
+  local iface="$1"
105
+  local peer_ip="$2"
106
+  local now
107
+  local last_cycle
108
+  local cooldown_file
109
+
110
+  now="$(date +%s)"
111
+  cooldown_file="$(cooldown_file_for_iface "$iface")"
112
+  last_cycle="$(read_epoch_file "$cooldown_file")"
113
+  if [ $((now - last_cycle)) -lt "$IFACE_CYCLE_COOLDOWN_SEC" ]; then
114
+    log "peer ${peer_ip} still unhealthy on ${iface}, but iface cycle is cooling down"
115
+    return 0
116
+  fi
117
+
118
+  log "peer ${peer_ip} unhealthy on ${iface}; cycling link with ifdown/ifup"
119
+  if command_exists ifdown && command_exists ifup; then
120
+    ifdown --force "$iface" || log "ifdown reported a non-zero exit code for ${iface}"
121
+    sleep 2
122
+    if ! ifup "$iface"; then
123
+      log "ifup failed for ${iface}"
124
+      return 1
125
+    fi
126
+  else
127
+    log "ifdown/ifup unavailable; falling back to ip link bounce for ${iface}"
128
+    ip link set "$iface" down || true
129
+    sleep 2
130
+    ip link set "$iface" up || true
131
+  fi
132
+
133
+  ip link set "$iface" mtu "$MTU" || true
134
+  ip link set "$iface" master "$BRIDGE" || true
135
+  systemctl start "tb-enlist@${iface}.service" || true
136
+  printf '%s\n' "$now" > "$cooldown_file"
137
+  rm -f "$(counter_file_for_iface "$iface")"
138
+  sleep "$IFACE_CYCLE_SETTLE_SEC"
139
+}
140
+
141
+assess_peer_health() {
142
+  local iface="$1"
143
+  local peer_ip=""
144
+  local counter_file=""
145
+  local fail_count=0
146
+
147
+  if ! peer_ip="$(peer_ip_for_iface "$iface")"; then
148
+    return 0
149
+  fi
150
+
151
+  counter_file="$(counter_file_for_iface "$iface")"
152
+
153
+  if ! iface_is_oper_up "$iface" || ! iface_is_forwarding "$iface"; then
154
+    rm -f "$counter_file"
155
+    return 0
156
+  fi
157
+
158
+  if probe_peer_ip "$peer_ip"; then
159
+    rm -f "$counter_file"
160
+    return 0
161
+  fi
162
+
163
+  fail_count="$(read_counter_file "$counter_file")"
164
+  fail_count=$((fail_count + 1))
165
+  printf '%s\n' "$fail_count" > "$counter_file"
166
+  log "peer probe failed on ${iface} towards ${peer_ip} (${fail_count}/${PEER_FAIL_THRESHOLD})"
167
+
168
+  if [ "$fail_count" -lt "$PEER_FAIL_THRESHOLD" ]; then
169
+    return 0
170
+  fi
171
+
172
+  recover_iface_cycle "$iface" "$peer_ip"
173
+}
174
+
175
+has_tb_netdev() {
176
+  ls /sys/class/net/thunderbolt* >/dev/null 2>&1
177
+}
178
+
179
+has_stale_tb_xdomain() {
180
+  local dev=""
181
+  for dev in /sys/bus/thunderbolt/devices/[0-9]-[1-9]*; do
182
+    [ -e "$dev" ] || continue
183
+    case "${dev##*/}" in
184
+      *.*|*:*)
185
+        continue
186
+        ;;
187
+    esac
188
+
189
+    if ! ls "${dev}".* >/dev/null 2>&1; then
190
+      return 0
191
+    fi
192
+  done
193
+
194
+  return 1
195
+}
196
+
197
+trigger_tb_rescan() {
198
+  local domain=""
199
+  for domain in /sys/bus/thunderbolt/devices/domain*; do
200
+    [ -e "$domain/rescan" ] && echo 1 > "$domain/rescan" || true
201
+  done
202
+
203
+  udevadm trigger --subsystem-match=thunderbolt --action=change || true
204
+  udevadm trigger --subsystem-match=net --action=add || true
205
+}
206
+
207
+run_nhi_rescan() {
208
+  local epoch="$1"
209
+  local dev=""
210
+  local cls=""
211
+  local drv=""
212
+  local nhi_pci=""
213
+
214
+  for dev in /sys/bus/pci/devices/*; do
215
+    [ -e "$dev/class" ] || continue
216
+    [ -e "$dev/driver" ] || continue
217
+    [ -w "$dev/remove" ] || continue
218
+    cls="$(cat "$dev/class" 2>/dev/null || true)"
219
+    drv="$(basename "$(readlink -f "$dev/driver" 2>/dev/null || true)")"
220
+    if [ "$cls" = "0x088000" ] && [ "$drv" = "thunderbolt" ]; then
221
+      nhi_pci="$dev"
222
+      break
223
+    fi
224
+  done
225
+
226
+  if [ -n "$nhi_pci" ]; then
227
+    echo 1 > "$nhi_pci/remove" || true
228
+    sleep 1
229
+    echo 1 > /sys/bus/pci/rescan || true
230
+    printf '%s\n' "$epoch" > "$LAST_NHI_RESCAN_FILE"
231
+    return 0
232
+  fi
233
+
234
+  return 1
235
+}
236
+
237
+# Keep the bridge present and up before trying to enslave ports.
238
+ip link show "$BRIDGE" >/dev/null 2>&1 || ip link add name "$BRIDGE" type bridge || true
239
+ip link set "$BRIDGE" mtu "$MTU" || true
240
+ip link set "$BRIDGE" up || true
241
+
242
+for path in /sys/class/net/thunderbolt*; do
243
+  [ -e "$path" ] || continue
244
+  IFACE="${path##*/}"
245
+  FOUND_TB_IFACE=1
246
+  ip link set "$IFACE" up || true
247
+  ip link set "$IFACE" mtu "$MTU" || true
248
+  ip link set "$IFACE" master "$BRIDGE" || true
249
+  systemctl start "tb-enlist@${IFACE}.service" || true
250
+done
251
+
252
+# If no thunderbolt netdev exists but a TB domain exists, force a rescan + udev retrigger.
253
+if [ "$FOUND_TB_IFACE" -eq 0 ] && [ -d /sys/bus/thunderbolt/devices ]; then
254
+  trigger_tb_rescan
255
+
256
+  # Escalate with cooldown: try PCI NHI remove+rescan to emulate a soft replug.
257
+  sleep 2
258
+  if ! has_tb_netdev; then
259
+    now="$(date +%s)"
260
+    last="0"
261
+    if [ -f "$LAST_BOLT_RESTART_FILE" ]; then
262
+      last="$(cat "$LAST_BOLT_RESTART_FILE" 2>/dev/null || echo 0)"
263
+    fi
264
+
265
+    case "$last" in
266
+      ''|*[!0-9]*)
267
+        last=0
268
+        ;;
269
+    esac
270
+
271
+    nhi_last="0"
272
+    if [ -f "$LAST_NHI_RESCAN_FILE" ]; then
273
+      nhi_last="$(cat "$LAST_NHI_RESCAN_FILE" 2>/dev/null || echo 0)"
274
+    fi
275
+    case "$nhi_last" in
276
+      ''|*[!0-9]*)
277
+        nhi_last=0
278
+        ;;
279
+    esac
280
+
281
+    if [ $((now - nhi_last)) -ge "$NHI_RESCAN_COOLDOWN_SEC" ]; then
282
+      if run_nhi_rescan "$now"; then
283
+        sleep "$NHI_SETTLE_SEC"
284
+        trigger_tb_rescan
285
+
286
+        # On newer kernels the first NHI reset can stop at the peer xdomain host
287
+        # node without recreating the matching *.0 network service.
288
+        if ! has_tb_netdev && has_stale_tb_xdomain; then
289
+          retry_now="$(date +%s)"
290
+          if run_nhi_rescan "$retry_now"; then
291
+            sleep "$NHI_SETTLE_SEC"
292
+            trigger_tb_rescan
293
+          fi
294
+        fi
295
+      fi
296
+    fi
297
+
298
+    # Secondary fallback with cooldown: restart boltd if interface is still missing
299
+    # and the host actually uses that service.
300
+    if ! has_tb_netdev; then
301
+      if [ $((now - last)) -ge "$BOLT_RESTART_COOLDOWN_SEC" ]; then
302
+        if systemctl list-unit-files bolt.service >/dev/null 2>&1; then
303
+          systemctl restart bolt.service || true
304
+          printf '%s\n' "$now" > "$LAST_BOLT_RESTART_FILE"
305
+        fi
306
+      fi
307
+    fi
308
+
309
+    trigger_tb_rescan
310
+  fi
311
+fi
312
+
313
+for path in /sys/class/net/thunderbolt*; do
314
+  [ -e "$path" ] || continue
315
+  assess_peer_health "${path##*/}"
316
+done
+18 -0
projects/thunderbolts/deploy/attempt1/common/systemd/system/tb-bridge.service
@@ -0,0 +1,18 @@
1
+# /etc/systemd/system/tb-bridge.service
2
+[Unit]
3
+Description=Ensure thunderbridge exists early
4
+DefaultDependencies=no
5
+After=network-pre.target
6
+Before=network.target
7
+
8
+[Service]
9
+Type=oneshot
10
+RemainAfterExit=yes
11
+# Create only if it doesn't exist
12
+ExecStart=/bin/sh -c '/sbin/ip link show thunderbridge >/dev/null 2>&1 || /sbin/ip link add thunderbridge type bridge'
13
+# Set params every time (harmless if already set)
14
+ExecStart=/sbin/ip link set thunderbridge mtu 65520
15
+ExecStart=/sbin/ip link set thunderbridge up
16
+
17
+[Install]
18
+WantedBy=multi-user.target
+24 -0
projects/thunderbolts/deploy/attempt1/common/systemd/system/tb-enlist@.service
@@ -0,0 +1,24 @@
1
+# /etc/systemd/system/tb-enlist@.service
2
+[Unit]
3
+Description=Attach %I to thunderbridge with MTU
4
+# Pornește numai când device-ul există
5
+BindsTo=sys-subsystem-net-devices-%i.device
6
+After=sys-subsystem-net-devices-%i.device tb-bridge.service
7
+Requires=tb-bridge.service
8
+# Păstrează porturile thunderbolt în bridge până când shutdown-ul ajunge
9
+# efectiv la oprirea rețelei; altfel NFS de pe 192.168.10.x pierde
10
+# transportul înainte de unmount și stă în timeout.
11
+Before=network.target
12
+
13
+[Service]
14
+Type=oneshot
15
+RemainAfterExit=yes
16
+# Setează MTU pe iface și bridge, apoi master
17
+ExecStart=/sbin/ip link set %i up
18
+ExecStart=/sbin/ip link set %i mtu 65520
19
+ExecStart=/sbin/ip link set thunderbridge mtu 65520
20
+ExecStart=/sbin/ip link set %i master thunderbridge
21
+
22
+# La stop (device remove), desprinde curat
23
+ExecStop=-/sbin/ip link set %i nomaster
24
+ExecStop=-/sbin/ip link set %i down
+8 -0
projects/thunderbolts/deploy/attempt1/common/systemd/system/tb-recover.service
@@ -0,0 +1,8 @@
1
+[Unit]
2
+Description=Recover Thunderbolt net interfaces into thunderbridge
3
+After=tb-bridge.service bolt.service
4
+Wants=tb-bridge.service
5
+
6
+[Service]
7
+Type=oneshot
8
+ExecStart=/usr/local/sbin/tb-recover.sh
+11 -0
projects/thunderbolts/deploy/attempt1/common/systemd/system/tb-recover.timer
@@ -0,0 +1,11 @@
1
+[Unit]
2
+Description=Periodic Thunderbolt recovery probe
3
+
4
+[Timer]
5
+OnBootSec=30s
6
+OnUnitActiveSec=30s
7
+AccuracySec=5s
8
+Unit=tb-recover.service
9
+
10
+[Install]
11
+WantedBy=timers.target
+4 -0
projects/thunderbolts/deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules
@@ -0,0 +1,4 @@
1
+# /etc/udev/rules.d/90-thunderbolt-net-systemd.rules
2
+ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="thunderbolt*", \
3
+  RUN+="/sbin/ip link set %k mtu 65520", \
4
+  TAG+="systemd", ENV{SYSTEMD_WANTS}="tb-enlist@%k.service"
+129 -0
projects/thunderbolts/deploy/attempt1/deploy_tb.sh
@@ -0,0 +1,129 @@
1
+#!/usr/bin/env bash
2
+# deploy_tb.sh — Thunderbolt bridge deploy (Bash 3 compatible)
3
+
4
+set -eo pipefail
5
+
6
+# ---------- EDIT THESE ----------
7
+get_mgmt_ip() {
8
+  case "$1" in
9
+    baobab) echo "192.168.2.91" ;;
10
+    ebony)  echo "192.168.2.92" ;;
11
+    tapia)  echo "192.168.2.93" ;;
12
+    *)      echo "" ;;
13
+  esac
14
+}
15
+get_tb_ip() {
16
+  case "$1" in
17
+    baobab) echo "192.168.10.91" ;;
18
+    ebony)  echo "192.168.10.92" ;;
19
+    tapia)  echo "192.168.10.93" ;;
20
+    *)      echo "" ;;
21
+  esac
22
+}
23
+# --------------------------------
24
+
25
+TARGETS=("$@")
26
+if [ ${#TARGETS[@]} -eq 0 ]; then
27
+  TARGETS=(baobab ebony tapia)
28
+fi
29
+
30
+SSH_USER="root"
31
+SSH_OPTS="-o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"
32
+BASE_DIR="$(pwd)"
33
+
34
+COMMON_UDEV="$BASE_DIR/common/udev/rules.d/90-thunderbolt-net-systemd.rules"
35
+COMMON_SVC1="$BASE_DIR/common/systemd/system/tb-enlist@.service"
36
+COMMON_SVC2="$BASE_DIR/common/systemd/system/tb-bridge.service"
37
+COMMON_SVC3="$BASE_DIR/common/systemd/system/tb-recover.service"
38
+COMMON_TMR1="$BASE_DIR/common/systemd/system/tb-recover.timer"
39
+COMMON_BIN1="$BASE_DIR/common/sbin/tb-recover.sh"
40
+
41
+require() {
42
+  for f in "$@"; do
43
+    [ -f "$f" ] || { echo "Missing required file: $f" >&2; exit 1; }
44
+  done
45
+}
46
+
47
+# try mgmt IP first, then TB IP; print chosen IP and return 0 if SSH works
48
+pick_ip() {
49
+  local host="$1" ip=""
50
+  ip="$(get_mgmt_ip "$host")"
51
+  if [ -n "$ip" ] && ssh $SSH_OPTS -q "${SSH_USER}@${ip}" true 2>/dev/null; then
52
+    echo "$ip"; return 0
53
+  fi
54
+  ip="$(get_tb_ip "$host")"
55
+  if [ -n "$ip" ] && ssh $SSH_OPTS -q "${SSH_USER}@${ip}" true 2>/dev/null; then
56
+    echo "$ip"; return 0
57
+  fi
58
+  # fall back to mgmt for error messaging
59
+  ip="$(get_mgmt_ip "$host")"
60
+  [ -n "$ip" ] && echo "$ip"
61
+  return 1
62
+}
63
+
64
+deploy_node() {
65
+  local host="$1"
66
+  local node_dir="$BASE_DIR/$host"
67
+  [ -d "$node_dir" ] || { echo "No node directory: $node_dir" >&2; exit 1; }
68
+
69
+  local ip
70
+  ip="$(pick_ip "$host")" || {
71
+    echo "!! [$host] SSH not reachable on $(get_mgmt_ip "$host") or $(get_tb_ip "$host")). Fix IPs or firewall." >&2
72
+    exit 1
73
+  }
74
+
75
+  echo "==> [$host@$ip] prepare remote dirs"
76
+  ssh $SSH_OPTS "${SSH_USER}@${ip}" "mkdir -p /etc/udev/rules.d /etc/systemd/system /etc/network/interfaces.d /usr/local/sbin"
77
+
78
+  echo "==> [$host@$ip] copy COMMON files"
79
+  scp -q "$COMMON_UDEV" "${SSH_USER}@${ip}:/etc/udev/rules.d/90-thunderbolt-net-systemd.rules"
80
+  scp -q "$COMMON_SVC1" "${SSH_USER}@${ip}:/etc/systemd/system/tb-enlist@.service"
81
+  scp -q "$COMMON_SVC2" "${SSH_USER}@${ip}:/etc/systemd/system/tb-bridge.service"
82
+  scp -q "$COMMON_SVC3" "${SSH_USER}@${ip}:/etc/systemd/system/tb-recover.service"
83
+  scp -q "$COMMON_TMR1" "${SSH_USER}@${ip}:/etc/systemd/system/tb-recover.timer"
84
+  scp -q "$COMMON_BIN1" "${SSH_USER}@${ip}:/usr/local/sbin/tb-recover.sh"
85
+
86
+  echo "==> [$host@$ip] copy NODE config"
87
+  require "$node_dir/etc/network/interfaces" "$node_dir/etc/network/interfaces.d/10-thunderbolt"
88
+  scp -q "$node_dir/etc/network/interfaces" "${SSH_USER}@${ip}:/etc/network/interfaces"
89
+  scp -q "$node_dir/etc/network/interfaces.d/10-thunderbolt" "${SSH_USER}@${ip}:/etc/network/interfaces.d/10-thunderbolt"
90
+
91
+  echo "==> [$host@$ip] enable + reload"
92
+  ssh $SSH_OPTS "${SSH_USER}@${ip}" bash -s <<'EOF'
93
+set -e
94
+chmod 0644 /etc/udev/rules.d/90-thunderbolt-net-systemd.rules
95
+chmod 0644 /etc/systemd/system/tb-enlist@.service
96
+chmod 0644 /etc/systemd/system/tb-bridge.service
97
+chmod 0644 /etc/systemd/system/tb-recover.service
98
+chmod 0644 /etc/systemd/system/tb-recover.timer
99
+chmod 0755 /usr/local/sbin/tb-recover.sh
100
+systemctl daemon-reload
101
+udevadm control --reload
102
+command -v ifreload >/dev/null 2>&1 && ifreload -a || true
103
+systemctl enable --now tb-bridge.service
104
+systemctl enable --now tb-recover.timer
105
+systemctl start tb-recover.service
106
+udevadm trigger --subsystem-match=net --action=add
107
+EOF
108
+
109
+  echo "==> [$host@$ip] status"
110
+  ssh $SSH_OPTS "${SSH_USER}@${ip}" bash -s <<'EOF'
111
+set -e
112
+systemctl --no-pager --plain --full status tb-bridge.service | sed -n '1,6p'
113
+systemctl --no-pager --plain --full status tb-recover.timer | sed -n '1,8p'
114
+systemctl --no-pager --plain --full list-units 'tb-enlist@*.service' | sed -n '1,12p' || true
115
+ip -d link show thunderbridge | sed -n '1,3p'
116
+bridge link | grep -E 'thunderbolt|thunderbridge' || true
117
+EOF
118
+
119
+  echo "==> [$host@$ip] done."
120
+  echo
121
+}
122
+
123
+require "$COMMON_UDEV" "$COMMON_SVC1" "$COMMON_SVC2" "$COMMON_SVC3" "$COMMON_TMR1" "$COMMON_BIN1"
124
+
125
+for h in "${TARGETS[@]}"; do
126
+  deploy_node "$h"
127
+done
128
+
129
+echo "All done. Go poke the cables and watch systemd behave."
+41 -0
projects/thunderbolts/deploy/attempt1/ebony/etc/network/interfaces
@@ -0,0 +1,41 @@
1
+# network interface settings; autogenerated
2
+# Please do NOT modify this file directly, unless you know what
3
+# you're doing.
4
+#
5
+# If you want to manage parts of the network configuration manually,
6
+# please utilize the 'source' or 'source-directory' directives to do
7
+# so.
8
+# PVE will preserve these directives, but will NOT read its network
9
+# configuration from sourced files, so do not attempt to move any of
10
+# the PVE managed interfaces into external files!
11
+
12
+auto lo
13
+iface lo inet loopback
14
+
15
+auto eno1
16
+iface eno1 inet manual
17
+
18
+iface eno1.442 inet manual
19
+
20
+auto vmbr443
21
+iface vmbr443 inet static
22
+	address 192.168.2.92/24
23
+	gateway 192.168.2.1
24
+	bridge-ports eno1.443
25
+	bridge-stp off
26
+	bridge-fd 0
27
+
28
+auto vmbr444
29
+iface vmbr444 inet static
30
+	address 192.168.4.92/24
31
+	bridge-ports eno1.444
32
+	bridge-stp off
33
+	bridge-fd 0
34
+
35
+auto vmbr442
36
+iface vmbr442 inet manual
37
+	bridge-ports eno1.442
38
+	bridge-stp off
39
+	bridge-fd 0
40
+
41
+source /etc/network/interfaces.d/*
+20 -0
projects/thunderbolts/deploy/attempt1/ebony/etc/network/interfaces.d/10-thunderbolt
@@ -0,0 +1,20 @@
1
+# Modular network configuration for ebony - Thunderbolt networking
2
+# ifupdown2-safe: bridge comes up alone; TB ports hotplug in later
3
+
4
+# Thunderbolt NIC appears late — don't 'auto' it
5
+allow-hotplug thunderbolt0
6
+iface thunderbolt0 inet manual
7
+    pre-up ip link set dev $IFACE mtu 65520 || true
8
+    post-up ip link set dev $IFACE mtu 65520 || true
9
+    post-up ip link set dev $IFACE master thunderbridge || true
10
+
11
+# Bridge must exist even with zero members
12
+auto thunderbridge
13
+iface thunderbridge inet static
14
+    address 192.168.10.92/24
15
+    bridge-ports none
16
+    bridge-stp off
17
+    bridge-fd 0
18
+    mtu 65520
19
+    pre-up ip link add name $IFACE type bridge 2>/dev/null || true
20
+    post-up ip link set dev $IFACE up
+4 -0
projects/thunderbolts/deploy/attempt1/ebony/etc/udev/rules.d/90-thunderbolt-net-systemd.rules
@@ -0,0 +1,4 @@
1
+# /etc/udev/rules.d/90-thunderbolt-net-systemd.rules
2
+ACTION=="add", SUBSYSTEM=="net", KERNEL=="thunderbolt*", \
3
+  RUN+="/sbin/ip link set %k mtu 65520", \
4
+  TAG+="systemd", ENV{SYSTEMD_WANTS}="tb-enlist@%k.service"
+41 -0
projects/thunderbolts/deploy/attempt1/tapia/etc/network/interfaces
@@ -0,0 +1,41 @@
1
+# network interface settings; autogenerated
2
+# Please do NOT modify this file directly, unless you know what
3
+# you're doing.
4
+#
5
+# If you want to manage parts of the network configuration manually,
6
+# please utilize the 'source' or 'source-directory' directives to do
7
+# so.
8
+# PVE will preserve these directives, but will NOT read its network
9
+# configuration from sourced files, so do not attempt to move any of
10
+# the PVE managed interfaces into external files!
11
+
12
+auto lo
13
+iface lo inet loopback
14
+
15
+auto eno1
16
+iface eno1 inet manual
17
+
18
+iface eno1.442 inet manual
19
+
20
+auto vmbr443
21
+iface vmbr443 inet static
22
+	address 192.168.2.93/24
23
+	gateway 192.168.2.1
24
+	bridge-ports eno1.443
25
+	bridge-stp off
26
+	bridge-fd 0
27
+
28
+auto vmbr444
29
+iface vmbr444 inet static
30
+	address 192.168.4.93/24
31
+	bridge-ports eno1.444
32
+	bridge-stp off
33
+	bridge-fd 0
34
+
35
+auto vmbr442
36
+iface vmbr442 inet manual
37
+	bridge-ports eno1.442
38
+	bridge-stp off
39
+	bridge-fd 0
40
+
41
+source /etc/network/interfaces.d/*
+20 -0
projects/thunderbolts/deploy/attempt1/tapia/etc/network/interfaces.d/10-thunderbolt
@@ -0,0 +1,20 @@
1
+# Modular network configuration for tapia - Thunderbolt networking
2
+# ifupdown2-safe: bridge comes up alone; TB ports hotplug in later
3
+
4
+# Thunderbolt NIC appears late — don't 'auto' it
5
+allow-hotplug thunderbolt0
6
+iface thunderbolt0 inet manual
7
+    pre-up ip link set dev $IFACE mtu 65520 || true
8
+    post-up ip link set dev $IFACE mtu 65520 || true
9
+    post-up ip link set dev $IFACE master thunderbridge || true
10
+
11
+# Bridge must exist even with zero members
12
+auto thunderbridge
13
+iface thunderbridge inet static
14
+    address 192.168.10.93/24
15
+    bridge-ports none
16
+    bridge-stp off
17
+    bridge-fd 0
18
+    mtu 65520
19
+    pre-up ip link add name $IFACE type bridge 2>/dev/null || true
20
+    post-up ip link set dev $IFACE up
+325 -0
projects/thunderbolts/issues/ISSUE-2025-001.md
@@ -0,0 +1,325 @@
1
+# Issue ISSUE-2025-001: Thunderbolt interfaces MTU resets to 1500 after networking restart
2
+
3
+**Status:** closed  
4
+**Priority:** high  
5
+**Created:** 2025-10-30  
6
+**Updated:** 2025-10-30  
7
+**Assigned to:** unassigned
8
+**Resolution:** Fixed with hybrid approach (udev rule + post-up hook)
9
+
10
+---
11
+
12
+## Summary
13
+
14
+`systemctl restart networking` causes thunderbolt interfaces to reset MTU from 65520 to default 1500.
15
+
16
+---
17
+
18
+## Description
19
+
20
+After executing `systemctl restart networking` on cluster nodes, the thunderbolt interfaces (thunderbolt0, thunderbolt1) lose their configured MTU of 65520 and revert to the default 1500. This also sometimes occurs after system reboot, though the behavior is not 100% reproducible on reboot.
21
+
22
+The MTU configuration is critical for thunderbolt bridge performance and should persist across networking restarts.
23
+
24
+---
25
+
26
+## Environment
27
+
28
+- **Affected nodes:** all (baobab, ebony, tapia)
29
+- **Component:** network
30
+- **Version/software:** Proxmox VE 8.x, ifupdown2, thunderbolt networking
31
+
32
+---
33
+
34
+## Steps to Reproduce
35
+
36
+1. Verify current thunderbolt interface MTU: `ip link show thunderbolt0`
37
+2. Observe MTU is set to 65520
38
+3. Execute: `systemctl restart networking`
39
+4. Check MTU again: `ip link show thunderbolt0`
40
+5. MTU has reverted to 1500
41
+
42
+**Reboot scenario (intermittent):**
43
+1. Reboot node
44
+2. After boot, check thunderbolt interface MTU
45
+3. Sometimes MTU is 1500 instead of expected 65520
46
+
47
+---
48
+
49
+## Expected Behavior
50
+
51
+Thunderbolt interfaces should maintain MTU 65520 after:
52
+- `systemctl restart networking`
53
+- System reboot
54
+
55
+---
56
+
57
+## Actual Behavior
58
+
59
+MTU resets to 1500 (default) after networking restart. Reboot behavior is inconsistent but sometimes exhibits the same issue.
60
+
61
+---
62
+
63
+## Logs/Evidence
64
+
65
+```bash
66
+# Before restart
67
+ip link show thunderbolt0
68
+# ... mtu 65520 ...
69
+
70
+# After systemctl restart networking
71
+ip link show thunderbolt0
72
+# ... mtu 1500 ...
73
+```
74
+
75
+---
76
+
77
+## Investigation Notes
78
+
79
+- [2025-10-30] Issue reported. Configuration files in `/etc/network/interfaces.d/10-thunderbolt` contain `pre-up ip link set dev $IFACE mtu 65520 || true` but this may not be executed consistently during networking restart.
80
+- [2025-10-30] The `allow-hotplug` directive for thunderbolt interfaces may cause race conditions where the interface is brought up before the pre-up script runs.
81
+- [2025-10-30] Reboot inconsistency suggests timing or udev rule interaction issues.
82
+
83
+### Deep Investigation (2025-10-30)
84
+
85
+**Current Configuration Analysis:**
86
+
87
+1. **Interface Configuration** (`/etc/network/interfaces.d/10-thunderbolt`):
88
+   - Uses `allow-hotplug` for thunderbolt0 and thunderbolt1
89
+   - Has `pre-up ip link set dev $IFACE mtu 65520 || true` in iface stanza
90
+   - Bridge has `mtu 65520` in its static configuration
91
+
92
+2. **Systemd Services**:
93
+   - `tb-bridge.service`: Creates bridge early, sets MTU 65520
94
+   - `tb-enlist@.service`: Triggered by udev on thunderbolt interface add, sets MTU and enslaves to bridge
95
+   - Services have proper ordering: `After=sys-subsystem-net-devices-%i.device tb-bridge.service`
96
+
97
+3. **Udev Rule** (`/etc/udev/rules.d/90-thunderbolt-net-systemd.rules`):
98
+   - Triggers `tb-enlist@.service` when thunderbolt interfaces appear
99
+   - Does NOT directly set MTU via udev
100
+
101
+**Root Cause Analysis:**
102
+
103
+The problem occurs during `systemctl restart networking` because:
104
+
105
+1. **ifupdown2 behavior**: When restarting networking, ifupdown2:
106
+   - Takes DOWN all `allow-hotplug` interfaces
107
+   - Brings them back UP based on configuration
108
+   - During this process, `pre-up` scripts execute BEFORE the interface is brought up
109
+
110
+2. **Timing Issue**: The sequence is:
111
+   ```
112
+   networking.service restart
113
+   → ifdown thunderbolt0 (MTU reset to default 1500 by kernel)
114
+   → pre-up script runs (sets MTU 65520)
115
+   → ifup brings interface up
116
+   → RACE: systemd tb-enlist@.service might not re-trigger OR might run before ifupdown finishes
117
+   ```
118
+
119
+3. **Why systemd services don't help during networking restart**:
120
+   - `tb-enlist@.service` is triggered by udev on device ADD event
121
+   - During `networking restart`, the device is not removed/added, just brought down/up
122
+   - Therefore, systemd service does NOT re-execute
123
+   - The MTU setting relies ONLY on the `pre-up` script in interfaces configuration
124
+
125
+4. **Why it sometimes fails on reboot**:
126
+   - Race condition between:
127
+     - ifupdown bringing up the interface (with pre-up MTU setting)
128
+     - systemd tb-enlist@ service being triggered by udev
129
+   - If systemd service wins the race and enslaves interface before ifupdown sets MTU, the MTU might not stick
130
+
131
+**Key Finding**: The `pre-up` script in `/etc/network/interfaces.d/10-thunderbolt` SHOULD work, but there's likely a timing issue or the script is not being executed properly during networking restart with ifupdown2.
132
+
133
+---
134
+
135
+## Proposed Solutions
136
+
137
+### Solution 1: Add MTU setting to udev rule (RECOMMENDED)
138
+
139
+Add MTU setting directly in the udev rule that triggers when thunderbolt interfaces appear. This ensures MTU is set immediately when the interface is created, before any other service touches it.
140
+
141
+**Implementation:**
142
+
143
+Modify `/etc/udev/rules.d/90-thunderbolt-net-systemd.rules`:
144
+
145
+```bash
146
+# /etc/udev/rules.d/90-thunderbolt-net-systemd.rules
147
+ACTION=="add", SUBSYSTEM=="net", KERNEL=="thunderbolt*", \
148
+  RUN+="/sbin/ip link set %k mtu 65520", \
149
+  TAG+="systemd", ENV{SYSTEMD_WANTS}="tb-enlist@%k.service"
150
+```
151
+
152
+**Pros:**
153
+- Runs immediately on device add, before any other service
154
+- Independent of ifupdown2 behavior
155
+- Handles both boot and hotplug scenarios
156
+- Simple, one-line change
157
+
158
+**Cons:**
159
+- Must be deployed to all nodes
160
+
161
+### Solution 2: Add post-up hook in interfaces configuration
162
+
163
+Add a `post-up` hook in addition to `pre-up` to ensure MTU is set after the interface is fully up.
164
+
165
+**Implementation:**
166
+
167
+Modify `/etc/network/interfaces.d/10-thunderbolt`:
168
+
169
+```bash
170
+allow-hotplug thunderbolt0
171
+iface thunderbolt0 inet manual
172
+    pre-up ip link set dev $IFACE mtu 65520 || true
173
+    post-up ip link set dev $IFACE mtu 65520 || true
174
+```
175
+
176
+**Pros:**
177
+- Uses existing ifupdown2 mechanisms
178
+- MTU set twice (pre and post) increases reliability
179
+- No new files needed
180
+
181
+**Cons:**
182
+- Still relies on ifupdown2 executing hooks correctly
183
+- May not fix the race condition completely
184
+
185
+### Solution 3: Modify tb-enlist@ service to always set MTU
186
+
187
+Make the systemd service idempotent and ensure it sets MTU even if the device was already up.
188
+
189
+**Implementation:**
190
+
191
+Modify `/etc/systemd/system/tb-enlist@.service`:
192
+
193
+```ini
194
+[Unit]
195
+Description=Attach %I to thunderbridge with MTU
196
+BindsTo=sys-subsystem-net-devices-%i.device
197
+After=sys-subsystem-net-devices-%i.device tb-bridge.service network.target
198
+Requires=tb-bridge.service
199
+
200
+[Service]
201
+Type=oneshot
202
+RemainAfterExit=yes
203
+# Always set MTU first, regardless of current state
204
+ExecStartPre=/sbin/ip link set %i mtu 65520 || true
205
+ExecStart=/sbin/ip link set %i up
206
+ExecStart=/sbin/ip link set %i mtu 65520
207
+ExecStart=/sbin/ip link set thunderbridge mtu 65520
208
+ExecStart=/sbin/ip link set %i master thunderbridge
209
+
210
+ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || true
211
+ExecStop=/sbin/ip link set %i down 2>/dev/null || true
212
+
213
+# Add this to re-run service on networking.service restart
214
+[Install]
215
+Also=network.target
216
+```
217
+
218
+**Pros:**
219
+- Comprehensive, handles multiple scenarios
220
+- Can be triggered manually if needed
221
+
222
+**Cons:**
223
+- More complex
224
+- Still might not trigger on `networking.service` restart without additional changes
225
+
226
+### Solution 4: Hybrid approach (MOST ROBUST)
227
+
228
+Combine Solution 1 (udev) with Solution 2 (post-up hook).
229
+
230
+**Implementation:**
231
+
232
+1. Add MTU to udev rule (Solution 1)
233
+2. Keep both pre-up and add post-up in interfaces.d config (Solution 2)
234
+3. Ensure bridge always has MTU set in its configuration
235
+
236
+This creates multiple layers of MTU enforcement:
237
+- Udev sets it immediately on device appearance
238
+- pre-up sets it before ifup
239
+- post-up sets it after interface is fully up
240
+- systemd service sets it when enslaving to bridge
241
+
242
+**Pros:**
243
+- Defense in depth
244
+- Handles all edge cases
245
+- Most reliable solution
246
+
247
+**Cons:**
248
+- Slight redundancy (MTU set multiple times)
249
+
250
+---
251
+
252
+## Recommended Implementation Plan
253
+
254
+**Phase 1: Quick Fix (Solution 1)**
255
+1. Deploy updated udev rule to all nodes
256
+2. Reload udev rules: `udevadm control --reload-rules`
257
+3. Test with `systemctl restart networking`
258
+4. Verify MTU persists
259
+
260
+**Phase 2: If needed (Solution 4)**
261
+1. Add post-up hook to interfaces.d/10-thunderbolt
262
+2. Update tb-enlist@ service with ExecStartPre
263
+3. Deploy and test
264
+
265
+**Testing Protocol:**
266
+```bash
267
+# On each node:
268
+# 1. Check current MTU
269
+ip link show thunderbolt0 | grep mtu
270
+
271
+# 2. Restart networking
272
+systemctl restart networking
273
+
274
+# 3. Verify MTU persisted
275
+ip link show thunderbolt0 | grep mtu
276
+# Should show: mtu 65520
277
+
278
+# 4. Test reboot persistence
279
+reboot
280
+# After boot:
281
+ip link show thunderbolt0 | grep mtu
282
+```
283
+
284
+---
285
+
286
+## Related Issues
287
+
288
+None yet.
289
+
290
+---
291
+
292
+## Changelog References
293
+
294
+None yet. Will be referenced when fix is implemented.
295
+
296
+---
297
+
298
+## Resolution (2025-10-30)
299
+
300
+**Issue Status: RESOLVED**
301
+
302
+### Root Cause Confirmed
303
+The MTU reset occurred because `systemctl restart networking` triggers ifupdown2 to bring interfaces down and back up, but the existing `pre-up` hooks in interfaces.d were insufficient. The systemd services (`tb-enlist@.service`) don't re-trigger on networking restart since the device isn't removed/added.
304
+
305
+### Solution Implemented
306
+Deployed **hybrid approach** combining:
307
+1. **Enhanced udev rule**: Added MTU setting on device add/change events
308
+2. **Post-up hook**: Added `post-up` script in interfaces.d to ensure MTU after interface bring-up
309
+
310
+### Changes Made
311
+- **Udev rule** (`/etc/udev/rules.d/90-thunderbolt-net-systemd.rules`): Added `RUN+="/sbin/ip link set %k mtu 65520"` for immediate MTU setting
312
+- **Interfaces config** (`/etc/network/interfaces.d/10-thunderbolt`): Added `post-up ip link set dev $IFACE mtu 65520 || true` for all thunderbolt interfaces
313
+
314
+### Testing Results
315
+- **ebony**: ✅ MTU persists after `systemctl restart networking`
316
+- **tapia**: ✅ MTU persists after `systemctl restart networking`  
317
+- **baobab**: ✅ Both thunderbolt0 and thunderbolt1 maintain MTU after restart
318
+
319
+### Files Modified
320
+- `deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules`
321
+- `deploy/attempt1/ebony/etc/network/interfaces.d/10-thunderbolt`
322
+- `deploy/attempt1/tapia/etc/network/interfaces.d/10-thunderbolt`
323
+- `deploy/attempt1/baobab/etc/network/interfaces.d/10-thunderbolt`
324
+
325
+The fix ensures MTU 65520 persists across all scenarios: boot, hotplug, and networking restart.
+88 -0
projects/thunderbolts/issues/ISSUE-2025-002.md
@@ -0,0 +1,88 @@
1
+# Thunderbolt Interfaces Not in Bridge After MTU Fix
2
+
3
+## Issue ID: ISSUE-2025-002
4
+
5
+**Status:** closed  
6
+**Priority:** high  
7
+**Created:** 2025-10-30  
8
+**Updated:** 2025-10-30  
9
+**Assigned to:** unassigned
10
+
11
+---
12
+
13
+## Summary
14
+
15
+After applying the MTU fix, thunderbolt interfaces are no longer members of the thunderbridge.
16
+
17
+---
18
+
19
+## Description
20
+
21
+Following the deployment of the MTU persistence fix (post-up hooks in interfaces.d), the thunderbolt interfaces failed to join the thunderbridge after `systemctl restart networking`. This regression broke cluster connectivity via thunderbolt.
22
+
23
+---
24
+
25
+## Environment
26
+
27
+- **Affected nodes:** baobab, ebony, tapia
28
+- **Component:** network (thunderbolt bridging)
29
+- **Version/software:** Proxmox VE 8.x, ifupdown2, systemd services
30
+
31
+---
32
+
33
+## Steps to Reproduce
34
+
35
+1. Deploy MTU fix with post-up hooks in `/etc/network/interfaces.d/10-thunderbolt`.
36
+2. Run `systemctl restart networking`.
37
+3. Check `bridge link show` - thunderbolt interfaces not in thunderbridge.
38
+
39
+---
40
+
41
+## Expected Behavior
42
+
43
+Thunderbolt interfaces should remain in thunderbridge with MTU 65520 after networking restart.
44
+
45
+---
46
+
47
+## Actual Behavior
48
+
49
+Interfaces have correct MTU but are not added to the bridge, causing loss of cluster connectivity.
50
+
51
+---
52
+
53
+## Logs/Evidence
54
+
55
+```
56
+# After restart networking
57
+$ bridge link show
58
+(no thunderbolt interfaces listed)
59
+
60
+$ ip link show thunderbolt0
61
+thunderbolt0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 ...
62
+```
63
+
64
+---
65
+
66
+## Investigation Notes
67
+
68
+- 2025-10-30: Root cause identified - ifupdown2 brings up interfaces but systemd enlist services don't re-trigger on networking restart. Defense-in-depth needed.
69
+- 2025-10-30: Added `post-up ip link set dev $IFACE master thunderbridge || true` to interfaces.d files.
70
+
71
+---
72
+
73
+## Proposed Solution
74
+
75
+Add bridge membership to post-up hooks in `/etc/network/interfaces.d/10-thunderbolt` for all nodes.
76
+
77
+---
78
+
79
+## Related Issues
80
+
81
+- ISSUE-2025-001 (MTU reset issue)
82
+
83
+---
84
+
85
+## Changelog References
86
+
87
+- CHANGELOG entry: [2025-10-30] - Fixed bridge membership regression after MTU fix deployment.</content>
88
+<parameter name="filePath">/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/thunderbolts/issues/ISSUE-2025-002.md
+118 -0
projects/thunderbolts/issues/ISSUE-2026-001.md
@@ -0,0 +1,118 @@
1
+# tb-enlist Fails on Device Disconnect, Leaving Thunderbolt Link Down After Reboot
2
+
3
+## Issue ID: ISSUE-2026-001
4
+
5
+**Status:** investigating  
6
+**Priority:** high  
7
+**Created:** 2026-03-06  
8
+**Updated:** 2026-03-06  
9
+**Assigned to:** unassigned
10
+
11
+---
12
+
13
+## Summary
14
+
15
+On `tapia`, `tb-enlist@thunderbolt0.service` failed during `ExecStop`, and after a post-boot disconnect/reconnect the `thunderbolt0` interface did not come back.
16
+
17
+---
18
+
19
+## Description
20
+
21
+After reboot, the Tapia-Baobab Thunderbolt link briefly came up, then disconnected. A bad `ExecStop=` command in `tb-enlist@.service` caused unit failure (`status=255`) when systemd stopped the instance. In parallel, `boltd` logged a probing timeout after reconnect, and `thunderbolt0` was no longer present on `tapia`.
22
+
23
+---
24
+
25
+## Environment
26
+
27
+- **Affected nodes:** tapia (observed), all (same shared unit deployed cluster-wide)
28
+- **Component:** network (thunderbolt bridging/systemd integration)
29
+- **Version/software:** Proxmox VE 8.x, kernel `6.8.12-19-pve`, systemd oneshot templated unit
30
+
31
+---
32
+
33
+## Steps to Reproduce
34
+
35
+1. Boot `tapia` with current shared `tb-enlist@.service`.
36
+2. Let Thunderbolt peer connect, then trigger disconnect/remove event (observed during boot sequence).
37
+3. Check `systemctl status tb-enlist@thunderbolt0.service` and `ip link show thunderbolt0`.
38
+
39
+---
40
+
41
+## Expected Behavior
42
+
43
+- `tb-enlist@*.service` should stop cleanly when a Thunderbolt netdev disappears.
44
+- Unit should not remain failed due to teardown path.
45
+- On reconnect, interface should be eligible to re-enlist normally.
46
+
47
+---
48
+
49
+## Actual Behavior
50
+
51
+- `tb-enlist@thunderbolt0.service` entered failed state on stop.
52
+- Error included invalid arguments in `ExecStop`.
53
+- `thunderbolt0` disappeared on `tapia` and did not reappear after reconnect.
54
+- Behavior remains intermittent: after some `tapia` reboots, link stays down until physical unplug/replug.
55
+
56
+---
57
+
58
+## Logs/Evidence
59
+
60
+```text
61
+Mar 06 08:27:07 tapia ip[4054]: Error: either "dev" is duplicate, or "2>/dev/null" is a garbage.
62
+Mar 06 08:27:07 tapia systemd[1]: tb-enlist@thunderbolt0.service: Control process exited, code=exited, status=255/EXCEPTION
63
+Mar 06 08:27:22 tapia boltd[838]: probing: started [1000]
64
+Mar 06 08:27:24 tapia boltd[838]: probing: timeout, done: [2002832] (2000000)
65
+Device "thunderbolt0" does not exist.
66
+```
67
+
68
+---
69
+
70
+## Investigation Notes
71
+
72
+- 2026-03-06: Confirmed `tb-bridge.service` was active and `thunderbridge` existed on both `baobab` and `tapia`.
73
+- 2026-03-06: Confirmed old `ExecStop` lines used shell syntax in non-shell context:
74
+  - `ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || true`
75
+  - `ExecStop=/sbin/ip link set %i down 2>/dev/null || true`
76
+- 2026-03-06: Implemented fix with systemd-native ignore-errors prefix:
77
+  - `ExecStop=-/sbin/ip link set %i nomaster`
78
+  - `ExecStop=-/sbin/ip link set %i down`
79
+- 2026-03-06: Deployed patch to `tapia` and validated that unit can be reset/stopped without entering `failed`.
80
+- 2026-03-06: User-induced flap still showed intermittent non-recovery pattern; remediation was not sufficient by itself.
81
+- 2026-03-06: After reboot at ~08:49 EET, `tapia` link was observed up again (`thunderbolt0` forwarding), confirming intermittent behavior.
82
+- 2026-03-06: Added second-stage mitigation candidate: periodic recovery (`tb-recover.service` + `tb-recover.timer`) to re-enlist interfaces and force rescan when no thunderbolt netdev is present.
83
+- 2026-03-06: Validated mitigation on `tapia` by intentionally stopping `tb-enlist@thunderbolt0`; recovery timer re-attached interface in next cycle and returned `forwarding` state.
84
+- 2026-03-06: Rolled out mitigation to `baobab` and `ebony`; timer enabled and active on all three nodes.
85
+- 2026-03-06 10:01 EET: New flap captured on `tapia` (`host disconnected` at `10:01:30`); recovery happened after reconnect event (`new host found` at `10:01:48`), consistent with unplug/replug recovery.
86
+- 2026-03-06 10:05 EET: Added third-stage mitigation in `tb-recover.sh`: if no thunderbolt netdev after rescan, restart `bolt.service` and retrigger udev as fallback.
87
+- 2026-03-06 10:39 EET: Controlled flap test on `tapia` using `thunderbolt-net` unbind/bind (`0-1.0`) passed; `thunderbolt0` reappeared and returned to `forwarding` within seconds (`TEST_PASS`).
88
+- 2026-03-06 10:46 EET: Latest mitigation rollout completed on `baobab` and `ebony`; `tb-recover.timer` active/enabled and `tb-enlist@*` units active on all nodes.
89
+- 2026-03-06 13:25 EET: Reboot-loop regression reproduced on `tapia` - `thunderbridge` up but `thunderbolt0` missing entirely (`tb-enlist@thunderbolt0` inactive), while peer `baobab` port showed `NO-CARRIER`.
90
+- 2026-03-06 13:22-14:02 EET: Existing fallback (`bolt.service` restart) was insufficient; repeated `boltd` messages observed: `failed to get boot_acl: Connection timed out`.
91
+- 2026-03-06 14:02 EET: Software recovery without cable succeeded via Thunderbolt NHI PCI `remove + rescan`; `thunderbolt0` recreated and rejoined bridge.
92
+- 2026-03-06 14:04 EET: `tb-recover.sh` updated with cooldowned NHI rescan fallback (and guarded `boltd` restart fallback) and deployed cluster-wide.
93
+- 2026-03-07 03:35-03:42 EET: On `tapia` running `6.17.13-1-pve`, first NHI rescan rediscovered peer host `0-1` but did not recreate `0-1.0`; a second manual NHI reset at `03:42` recreated `thunderbolt0` and restored `forwarding`.
94
+- 2026-03-07 03:4x EET: Recovery logic updated so a stale xdomain host node without a `*.0` service triggers one bounded second NHI reset in the same `tb-recover.sh` run.
95
+
96
+---
97
+
98
+## Proposed Solution
99
+
100
+Use a two-layer recovery approach:
101
+1. Keep `ExecStop` commands shell-free and use systemd `-` prefix to ignore expected failures when device is already gone.
102
+2. Run periodic recovery (`tb-recover.timer`) that re-enlists existing thunderbolt netdevs and forces controller/net udev retrigger when no thunderbolt netdev is present.
103
+3. If netdev is still missing, perform cooldowned Thunderbolt NHI PCI `remove + rescan` (soft replug equivalent), then retrigger udev.
104
+4. If the controller comes back only as a peer xdomain host node (for example `0-1`) with no `0-1.0` service child, immediately perform one additional bounded NHI reset in the same recovery run.
105
+
106
+---
107
+
108
+## Related Issues
109
+
110
+- ISSUE-2025-002
111
+- ISSUE-2025-001
112
+
113
+---
114
+
115
+## Changelog References
116
+
117
+List CHANGELOG.md entries that reference this issue:
118
+- CHANGELOG entry: [Unreleased] - Fix invalid `ExecStop` in `tb-enlist@.service` to prevent failed unit on device removal [ISSUE-2026-001]
+83 -0
projects/thunderbolts/issues/TEMPLATE.md
@@ -0,0 +1,83 @@
1
+# Issue Template
2
+
3
+## Issue ID: ISSUE-YYYY-NNN
4
+
5
+**Status:** [open|investigating|in-progress|resolved|closed]  
6
+**Priority:** [low|medium|high|critical]  
7
+**Created:** YYYY-MM-DD  
8
+**Updated:** YYYY-MM-DD  
9
+**Assigned to:** [name or unassigned]
10
+
11
+---
12
+
13
+## Summary
14
+
15
+Brief one-line description of the issue.
16
+
17
+---
18
+
19
+## Description
20
+
21
+Detailed description of the problem, behavior, or feature request.
22
+
23
+---
24
+
25
+## Environment
26
+
27
+- **Affected nodes:** [baobab|ebony|tapia|all]
28
+- **Component:** [network|storage|vm|backup|cluster|other]
29
+- **Version/software:** (e.g., Proxmox 8.x, kernel version, etc.)
30
+
31
+---
32
+
33
+## Steps to Reproduce
34
+
35
+1. Step 1
36
+2. Step 2
37
+3. ...
38
+
39
+---
40
+
41
+## Expected Behavior
42
+
43
+What should happen.
44
+
45
+---
46
+
47
+## Actual Behavior
48
+
49
+What actually happens.
50
+
51
+---
52
+
53
+## Logs/Evidence
54
+
55
+```
56
+Paste relevant logs, command output, or error messages here.
57
+```
58
+
59
+---
60
+
61
+## Investigation Notes
62
+
63
+- [Date] Note 1
64
+- [Date] Note 2
65
+
66
+---
67
+
68
+## Proposed Solution
69
+
70
+Describe the proposed fix or workaround.
71
+
72
+---
73
+
74
+## Related Issues
75
+
76
+- ISSUE-YYYY-NNN (if any)
77
+
78
+---
79
+
80
+## Changelog References
81
+
82
+List CHANGELOG.md entries that reference this issue:
83
+- CHANGELOG entry: [date] - description
+59 -0
projects/thunderbolts/scripts/check_mcluster_network.sh
@@ -0,0 +1,59 @@
1
+#!/usr/bin/env bash
2
+# check_mcluster_network.sh — Minimal cluster network health check (pretty table)
3
+
4
+set -e
5
+
6
+NODES=(baobab ebony tapia autonas1 autonas2)
7
+CLUSTER_IPS=(192.168.10.91 192.168.10.92 192.168.10.93 192.168.10.95 192.168.10.96)
8
+MGMT_IPS=(192.168.2.91 192.168.2.92 192.168.2.93 192.168.2.95 192.168.2.96)
9
+SSH_OPTS="-o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR"
10
+
11
+# Thunderbridge/thunderbolt status (unchanged)
12
+for i in "${!NODES[@]}"; do
13
+    node="${NODES[$i]}"
14
+    mgmt_ip="${MGMT_IPS[$i]}"
15
+    if [[ "$node" == autonas* ]]; then
16
+        continue
17
+    fi
18
+    mtu=$(ssh $SSH_OPTS root@$mgmt_ip "ip link show thunderbridge 2>/dev/null | grep mtu | awk '{print \$5}'" || echo "fail")
19
+    ports=$(ssh $SSH_OPTS root@$mgmt_ip "bridge link | grep master.*thunderbridge | awk '{print \$2}'" | xargs)
20
+    echo "$node: thunderbridge mtu=$mtu ports=$ports"
21
+    ssh $SSH_OPTS root@$mgmt_ip "ip -o link show | grep 'thunderbolt'" | while read -r line; do
22
+        iface=$(echo "$line" | awk '{print $2}')
23
+        mtu=$(echo "$line" | awk '{print $5}')
24
+        up=$(echo "$line" | grep -q 'UP' && echo "up" || echo "down")
25
+        forwarding=$(ssh $SSH_OPTS root@$mgmt_ip "bridge link show dev $iface 2>/dev/null" | grep -q 'state forwarding' && echo "forwarding" || echo "not-forwarding")
26
+        echo "  $iface mtu=$mtu $up $forwarding"
27
+    done
28
+done
29
+
30
+echo
31
+# Table header
32
+printf "%-10s |" "Node"
33
+for node in "${NODES[@]}"; do
34
+    printf " %10s |" "$node"
35
+done
36
+echo
37
+# localhost row
38
+printf "%-10s |" "localhost"
39
+for j in "${!NODES[@]}"; do
40
+    dst_cluster="${CLUSTER_IPS[$j]}"
41
+    if ping -c 1 -W 1 $dst_cluster >/dev/null 2>&1; then
42
+        printf " %10s |" "OK"
43
+    else
44
+        printf " %10s |" "FAILED"
45
+    fi
46
+done
47
+echo
48
+# baobab row
49
+printf "%-10s |" "baobab"
50
+baobab_mgmt="${MGMT_IPS[0]}"
51
+for j in "${!NODES[@]}"; do
52
+    dst_cluster="${CLUSTER_IPS[$j]}"
53
+    if ssh $SSH_OPTS root@$baobab_mgmt "ping -c 1 -W 1 $dst_cluster >/dev/null 2>&1"; then
54
+        printf " %10s |" "OK"
55
+    else
56
+        printf " %10s |" "FAILED"
57
+    fi
58
+done
59
+echo
+144 -0
projects/thunderbolts/scripts/install.sh
@@ -0,0 +1,144 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="thunderbolts"
6
+ORG_ID="xdev"
7
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
8
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
9
+RECOVER_CANONICAL="${INSTALL_DIR}/tb-recover.sh"
10
+RECOVER_WRAPPER="/usr/local/sbin/tb-recover.sh"
11
+UNINSTALL_PATH="${INSTALL_DIR}/uninstall.sh"
12
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
13
+UDEV_RULE_PATH="/etc/udev/rules.d/90-thunderbolt-net-systemd.rules"
14
+TB_BRIDGE_UNIT="/etc/systemd/system/tb-bridge.service"
15
+TB_ENLIST_UNIT="/etc/systemd/system/tb-enlist@.service"
16
+TB_RECOVER_UNIT="/etc/systemd/system/tb-recover.service"
17
+TB_RECOVER_TIMER="/etc/systemd/system/tb-recover.timer"
18
+
19
+SOURCE_DIR=""
20
+
21
+usage() {
22
+    cat <<EOF
23
+Usage: $0 [--source-dir <path>]
24
+
25
+Install shared thunderbolt runtime artifacts on the current host.
26
+This workflow does NOT modify /etc/network/interfaces or interfaces.d/10-thunderbolt.
27
+EOF
28
+}
29
+
30
+require_root() {
31
+    if [[ "${EUID}" -ne 0 ]]; then
32
+        echo "ERROR: this script must be run as root" >&2
33
+        exit 1
34
+    fi
35
+}
36
+
37
+resolve_source_dir() {
38
+    if [[ -n "${SOURCE_DIR}" ]]; then
39
+        SOURCE_DIR="$(cd "${SOURCE_DIR}" && pwd)"
40
+    else
41
+        SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
42
+    fi
43
+}
44
+
45
+validate_source_tree() {
46
+    local required_files=(
47
+        "${SOURCE_DIR}/deploy/attempt1/common/sbin/tb-recover.sh"
48
+        "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-bridge.service"
49
+        "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-enlist@.service"
50
+        "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.service"
51
+        "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.timer"
52
+        "${SOURCE_DIR}/deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules"
53
+        "${SOURCE_DIR}/scripts/uninstall.sh"
54
+        "${SOURCE_DIR}/README.md"
55
+        "${SOURCE_DIR}/INSTALL.md"
56
+        "${SOURCE_DIR}/CHANGELOG.md"
57
+    )
58
+    local file=""
59
+    for file in "${required_files[@]}"; do
60
+        if [[ ! -f "${file}" ]]; then
61
+            echo "ERROR: missing required source file: ${file}" >&2
62
+            exit 1
63
+        fi
64
+    done
65
+}
66
+
67
+run_existing_uninstall() {
68
+    if [[ -x "${UNINSTALL_PATH}" ]]; then
69
+        echo "Existing installation detected. Running canonical uninstall first..."
70
+        "${UNINSTALL_PATH}" --force || true
71
+    else
72
+        bash "${SOURCE_DIR}/scripts/uninstall.sh" --force || true
73
+    fi
74
+}
75
+
76
+install_docs() {
77
+    mkdir -p "${DOC_DIR}"
78
+    cp "${SOURCE_DIR}/README.md" "${DOC_DIR}/"
79
+    cp "${SOURCE_DIR}/INSTALL.md" "${DOC_DIR}/"
80
+    cp "${SOURCE_DIR}/CHANGELOG.md" "${DOC_DIR}/"
81
+}
82
+
83
+main() {
84
+    while [[ $# -gt 0 ]]; do
85
+        case "$1" in
86
+            --source-dir)
87
+                SOURCE_DIR="$2"
88
+                shift 2
89
+                ;;
90
+            -h|--help)
91
+                usage
92
+                exit 0
93
+                ;;
94
+            *)
95
+                echo "ERROR: unknown option: $1" >&2
96
+                usage
97
+                exit 1
98
+                ;;
99
+        esac
100
+    done
101
+
102
+    require_root
103
+    resolve_source_dir
104
+    validate_source_tree
105
+
106
+    echo "=== Installing ${PROJECT_ID} shared runtime ==="
107
+    run_existing_uninstall
108
+
109
+    mkdir -p "${INSTALL_DIR}" "${DOC_DIR}" /usr/local/sbin /etc/udev/rules.d /etc/systemd/system
110
+
111
+    install -m 0755 "${SOURCE_DIR}/deploy/attempt1/common/sbin/tb-recover.sh" "${RECOVER_CANONICAL}"
112
+    ln -sfn "${RECOVER_CANONICAL}" "${RECOVER_WRAPPER}"
113
+
114
+    install -m 0755 "${SOURCE_DIR}/scripts/uninstall.sh" "${UNINSTALL_PATH}"
115
+    ln -sfn "${UNINSTALL_PATH}" "${UNINSTALL_WRAPPER}"
116
+
117
+    install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules" "${UDEV_RULE_PATH}"
118
+    install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-bridge.service" "${TB_BRIDGE_UNIT}"
119
+    install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-enlist@.service" "${TB_ENLIST_UNIT}"
120
+    install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.service" "${TB_RECOVER_UNIT}"
121
+    install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.timer" "${TB_RECOVER_TIMER}"
122
+
123
+    install_docs
124
+
125
+    systemctl daemon-reload
126
+    udevadm control --reload-rules
127
+    systemctl enable --now tb-bridge.service
128
+    systemctl enable --now tb-recover.timer
129
+    systemctl start tb-recover.service || true
130
+    udevadm trigger --subsystem-match=net --action=add || true
131
+
132
+    echo "Installed paths:"
133
+    echo "  runtime: ${INSTALL_DIR}"
134
+    echo "  recover wrapper: ${RECOVER_WRAPPER}"
135
+    echo "  uninstall: ${UNINSTALL_PATH}"
136
+    echo "  udev rule: ${UDEV_RULE_PATH}"
137
+    echo "  systemd units: tb-bridge.service tb-enlist@.service tb-recover.service tb-recover.timer"
138
+    echo "  docs: ${DOC_DIR}"
139
+    echo ""
140
+    echo "Network interface files were left untouched."
141
+    echo "Installation completed."
142
+}
143
+
144
+main "$@"
+83 -0
projects/thunderbolts/scripts/uninstall.sh
@@ -0,0 +1,83 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="thunderbolts"
6
+ORG_ID="xdev"
7
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
8
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
9
+RECOVER_WRAPPER="/usr/local/sbin/tb-recover.sh"
10
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
11
+UDEV_RULE_PATH="/etc/udev/rules.d/90-thunderbolt-net-systemd.rules"
12
+TB_BRIDGE_UNIT="/etc/systemd/system/tb-bridge.service"
13
+TB_ENLIST_UNIT="/etc/systemd/system/tb-enlist@.service"
14
+TB_RECOVER_UNIT="/etc/systemd/system/tb-recover.service"
15
+TB_RECOVER_TIMER="/etc/systemd/system/tb-recover.timer"
16
+
17
+FORCE_MODE=0
18
+
19
+log() {
20
+    if [[ "${FORCE_MODE}" -eq 0 ]]; then
21
+        echo "$@"
22
+    fi
23
+}
24
+
25
+require_root() {
26
+    if [[ "${EUID}" -ne 0 ]]; then
27
+        echo "ERROR: this script must be run as root" >&2
28
+        exit 1
29
+    fi
30
+}
31
+
32
+stop_enlist_instances() {
33
+    local units
34
+    units="$(systemctl list-units --all 'tb-enlist@*.service' --no-legend --no-pager 2>/dev/null | awk '{print $1}')"
35
+    if [[ -n "${units}" ]]; then
36
+        # shellcheck disable=SC2086
37
+        systemctl stop ${units} >/dev/null 2>&1 || true
38
+    fi
39
+}
40
+
41
+main() {
42
+    while [[ $# -gt 0 ]]; do
43
+        case "$1" in
44
+            --force)
45
+                FORCE_MODE=1
46
+                shift
47
+                ;;
48
+            -h|--help)
49
+                echo "Usage: $0 [--force]"
50
+                exit 0
51
+                ;;
52
+            *)
53
+                echo "ERROR: unknown option: $1" >&2
54
+                exit 1
55
+                ;;
56
+        esac
57
+    done
58
+
59
+    require_root
60
+
61
+    log "=== Uninstalling ${PROJECT_ID} shared runtime ==="
62
+
63
+    stop_enlist_instances
64
+    systemctl disable --now tb-recover.timer >/dev/null 2>&1 || true
65
+    systemctl stop tb-recover.service >/dev/null 2>&1 || true
66
+    systemctl disable tb-bridge.service >/dev/null 2>&1 || true
67
+    systemctl stop tb-bridge.service >/dev/null 2>&1 || true
68
+
69
+    rm -f "${TB_RECOVER_TIMER}" "${TB_RECOVER_UNIT}" "${TB_ENLIST_UNIT}" "${TB_BRIDGE_UNIT}" "${UDEV_RULE_PATH}"
70
+    rm -f "${UNINSTALL_WRAPPER}" "${RECOVER_WRAPPER}"
71
+    rm -rf "${DOC_DIR}" "${INSTALL_DIR}"
72
+
73
+    systemctl daemon-reload
74
+    udevadm control --reload-rules
75
+
76
+    rmdir /usr/local/lib/${ORG_ID} 2>/dev/null || true
77
+    rmdir /usr/local/share/doc/${ORG_ID} 2>/dev/null || true
78
+
79
+    log "Shared runtime removed."
80
+    log "Network interface configuration was left untouched."
81
+}
82
+
83
+main "$@"
+166 -0
projects/thunderbolts/setup.sh
@@ -0,0 +1,166 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+PROJECT_ID="thunderbolts"
6
+ORG_ID="xdev"
7
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
8
+MODE="install"
9
+REMOTE_USER="root"
10
+LOCAL_MODE=0
11
+TARGETS=()
12
+
13
+get_mgmt_ip() {
14
+    case "$1" in
15
+        baobab) echo "192.168.2.91" ;;
16
+        ebony) echo "192.168.2.92" ;;
17
+        tapia) echo "192.168.2.93" ;;
18
+        *) echo "" ;;
19
+    esac
20
+}
21
+
22
+resolve_target() {
23
+    local host="$1"
24
+    local ip=""
25
+
26
+    if [[ "$host" == *@* ]]; then
27
+        echo "$host"
28
+        return 0
29
+    fi
30
+
31
+    ip="$(get_mgmt_ip "$host")"
32
+    if [[ -n "$ip" ]]; then
33
+        echo "${REMOTE_USER}@${ip}"
34
+    else
35
+        echo "${REMOTE_USER}@${host}"
36
+    fi
37
+}
38
+
39
+show_help() {
40
+    cat <<EOF
41
+${PROJECT_ID} setup wrapper
42
+
43
+Usage: $0 [OPTIONS] [host...]
44
+
45
+Options:
46
+  -h, --help           Show this help message
47
+  -l, --local          Run on localhost
48
+  -u, --uninstall      Uninstall instead of install
49
+  --user <user>        Remote SSH user (default: root)
50
+
51
+Without explicit hosts, remote mode defaults to: baobab ebony tapia
52
+EOF
53
+}
54
+
55
+run_local_install() {
56
+    bash "${SCRIPT_DIR}/scripts/install.sh" --source-dir "${SCRIPT_DIR}"
57
+}
58
+
59
+run_local_uninstall() {
60
+    local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
61
+    if [[ -x "${canonical}" ]]; then
62
+        "${canonical}"
63
+    else
64
+        bash "${SCRIPT_DIR}/scripts/uninstall.sh"
65
+    fi
66
+}
67
+
68
+copy_remote_tree() {
69
+    local target="$1"
70
+    local remote_tmp="$2"
71
+
72
+    ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts' '${remote_tmp}/deploy/attempt1/common/sbin' '${remote_tmp}/deploy/attempt1/common/systemd/system' '${remote_tmp}/deploy/attempt1/common/udev/rules.d'"
73
+    scp -q "${SCRIPT_DIR}/scripts/install.sh" "${target}:${remote_tmp}/scripts/"
74
+    scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
75
+    scp -q "${SCRIPT_DIR}/README.md" "${target}:${remote_tmp}/"
76
+    scp -q "${SCRIPT_DIR}/INSTALL.md" "${target}:${remote_tmp}/"
77
+    scp -q "${SCRIPT_DIR}/CHANGELOG.md" "${target}:${remote_tmp}/"
78
+    scp -q "${SCRIPT_DIR}/deploy/attempt1/common/sbin/tb-recover.sh" "${target}:${remote_tmp}/deploy/attempt1/common/sbin/"
79
+    scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-bridge.service" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
80
+    scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-enlist@.service" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
81
+    scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-recover.service" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
82
+    scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-recover.timer" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
83
+    scp -q "${SCRIPT_DIR}/deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules" "${target}:${remote_tmp}/deploy/attempt1/common/udev/rules.d/"
84
+}
85
+
86
+run_remote_install() {
87
+    local target="$1"
88
+    local remote_tmp="/tmp/${PROJECT_ID}.$$"
89
+    local remote_prefix=""
90
+
91
+    [[ "${REMOTE_USER}" != "root" ]] && remote_prefix="sudo "
92
+
93
+    copy_remote_tree "${target}" "${remote_tmp}"
94
+    ssh "${target}" "${remote_prefix}bash '${remote_tmp}/scripts/install.sh' --source-dir '${remote_tmp}'"
95
+    ssh "${target}" "rm -rf '${remote_tmp}'"
96
+}
97
+
98
+run_remote_uninstall() {
99
+    local target="$1"
100
+    local remote_tmp="/tmp/${PROJECT_ID}-uninstall.$$"
101
+    local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
102
+
103
+    ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts'"
104
+    scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
105
+    if [[ "${REMOTE_USER}" == "root" ]]; then
106
+        ssh "${target}" "if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi"
107
+    else
108
+        ssh "${target}" "sudo bash -lc \"if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi\""
109
+    fi
110
+    ssh "${target}" "rm -rf '${remote_tmp}'"
111
+}
112
+
113
+while [[ $# -gt 0 ]]; do
114
+    case "$1" in
115
+        -h|--help)
116
+            show_help
117
+            exit 0
118
+            ;;
119
+        -l|--local)
120
+            LOCAL_MODE=1
121
+            shift
122
+            ;;
123
+        -u|--uninstall)
124
+            MODE="uninstall"
125
+            shift
126
+            ;;
127
+        --user)
128
+            REMOTE_USER="$2"
129
+            shift 2
130
+            ;;
131
+        -*)
132
+            echo "ERROR: unknown option: $1" >&2
133
+            show_help
134
+            exit 1
135
+            ;;
136
+        *)
137
+            TARGETS+=("$1")
138
+            shift
139
+            ;;
140
+    esac
141
+done
142
+
143
+if [[ ${#TARGETS[@]} -eq 0 && ${LOCAL_MODE} -eq 0 ]]; then
144
+    TARGETS=(baobab ebony tapia)
145
+fi
146
+
147
+echo "================================"
148
+echo "${PROJECT_ID} - ${MODE}"
149
+echo "================================"
150
+
151
+if [[ ${LOCAL_MODE} -eq 1 ]]; then
152
+    if [[ "${MODE}" == "install" ]]; then
153
+        run_local_install
154
+    else
155
+        run_local_uninstall
156
+    fi
157
+    exit 0
158
+fi
159
+
160
+for host in "${TARGETS[@]}"; do
161
+    if [[ "${MODE}" == "install" ]]; then
162
+        run_remote_install "$(resolve_target "${host}")"
163
+    else
164
+        run_remote_uninstall "$(resolve_target "${host}")"
165
+    fi
166
+done
+11 -0
projects/thunderbolts/thunderbolts.code-workspace
@@ -0,0 +1,11 @@
1
+{
2
+	"folders": [
3
+		{
4
+			"path": "."
5
+		},
6
+		{
7
+			"path": "../backups"
8
+		}
9
+	],
10
+	"settings": {}
11
+}
+77 -0
scripts/cluster-nodes.sh
@@ -0,0 +1,77 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
6
+ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
7
+CONFIG_PATH="${ROOT_DIR}/cluster-context/madagascar.json"
8
+CLUSTER_NAME="madagascar"
9
+FORMAT="ip"
10
+
11
+usage() {
12
+    cat <<EOF
13
+Usage: $0 [--cluster <name>] [--format ip|name|name=ip]
14
+
15
+Reads node information from cluster-context/madagascar.json.
16
+EOF
17
+}
18
+
19
+while [[ $# -gt 0 ]]; do
20
+    case "$1" in
21
+        --cluster)
22
+            CLUSTER_NAME="$2"
23
+            shift 2
24
+            ;;
25
+        --format)
26
+            FORMAT="$2"
27
+            shift 2
28
+            ;;
29
+        -h|--help)
30
+            usage
31
+            exit 0
32
+            ;;
33
+        *)
34
+            echo "ERROR: unknown option: $1" >&2
35
+            usage
36
+            exit 1
37
+            ;;
38
+    esac
39
+done
40
+
41
+if [[ ! -f "${CONFIG_PATH}" ]]; then
42
+    echo "ERROR: missing cluster config: ${CONFIG_PATH}" >&2
43
+    exit 1
44
+fi
45
+
46
+case "${FORMAT}" in
47
+    ip)
48
+        jq -r --arg cluster "${CLUSTER_NAME}" '
49
+            .clusters[$cluster].nodes
50
+            | to_entries[]
51
+            | (
52
+                .value.ip
53
+                // .value.wan.vmbr443.address
54
+                // empty
55
+              )
56
+            | split("/")[0]
57
+        ' "${CONFIG_PATH}"
58
+        ;;
59
+    name)
60
+        jq -r --arg cluster "${CLUSTER_NAME}" '
61
+            .clusters[$cluster].nodes
62
+            | to_entries[]
63
+            | .key
64
+        ' "${CONFIG_PATH}"
65
+        ;;
66
+    name=ip)
67
+        jq -r --arg cluster "${CLUSTER_NAME}" '
68
+            .clusters[$cluster].nodes
69
+            | to_entries[]
70
+            | .key + "=" + ((.value.ip // .value.wan.vmbr443.address // empty) | split("/")[0])
71
+        ' "${CONFIG_PATH}"
72
+        ;;
73
+    *)
74
+        echo "ERROR: unsupported format: ${FORMAT}" >&2
75
+        exit 1
76
+        ;;
77
+esac
+256 -0
scripts/deploy-project.sh
@@ -0,0 +1,256 @@
1
+#!/bin/bash
2
+
3
+set -euo pipefail
4
+
5
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
6
+ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
7
+CONFIG_PATH="${ROOT_DIR}/cluster-context/madagascar.json"
8
+CLUSTER_NAME="madagascar"
9
+COMMAND="install"
10
+REMOTE_USER="root"
11
+DRY_RUN=0
12
+PROJECT_NAME=""
13
+PROJECT_DIR=""
14
+DEPLOY_MODE=""
15
+NODE_FILTERS=()
16
+TARGETS=()
17
+
18
+usage() {
19
+    cat <<EOF
20
+Usage: $0 <project> [command] [options]
21
+
22
+Commands:
23
+  install      Deploy/install project on selected nodes (default)
24
+  uninstall    Remove project from selected nodes
25
+  status       Query project status on selected nodes (deploy.sh projects only)
26
+  start        Start services on selected nodes (deploy.sh projects only)
27
+  restart      Restart services on selected nodes (deploy.sh projects only)
28
+  stop         Stop services on selected nodes (deploy.sh projects only)
29
+
30
+Options:
31
+  --cluster <name>    Cluster name in cluster-context/madagascar.json (default: madagascar)
32
+  --node <name|ip>    Restrict to one node. Can be repeated.
33
+  --user <user>       Remote SSH user for setup.sh projects (default: root)
34
+  --dry-run           Show resolved targets and commands without executing
35
+  -h, --help          Show this help
36
+
37
+Examples:
38
+  $0 pve-guests-state
39
+  $0 pve-guests-state install --node ebony
40
+  $0 autoNAS install
41
+  $0 autoSMART status --node 192.168.2.92
42
+EOF
43
+}
44
+
45
+require_config() {
46
+    if [[ ! -f "${CONFIG_PATH}" ]]; then
47
+        echo "ERROR: missing cluster config: ${CONFIG_PATH}" >&2
48
+        exit 1
49
+    fi
50
+}
51
+
52
+require_project() {
53
+    PROJECT_DIR="${ROOT_DIR}/projects/${PROJECT_NAME}"
54
+    if [[ ! -d "${PROJECT_DIR}" ]]; then
55
+        echo "ERROR: unknown project: ${PROJECT_NAME}" >&2
56
+        exit 1
57
+    fi
58
+
59
+    if [[ -x "${PROJECT_DIR}/setup.sh" ]]; then
60
+        DEPLOY_MODE="setup"
61
+        return
62
+    fi
63
+
64
+    if [[ -x "${PROJECT_DIR}/deploy.sh" ]]; then
65
+        DEPLOY_MODE="deploy"
66
+        return
67
+    fi
68
+
69
+    echo "ERROR: project ${PROJECT_NAME} has neither setup.sh nor deploy.sh" >&2
70
+    exit 1
71
+}
72
+
73
+load_targets() {
74
+    local entry=""
75
+    TARGETS=()
76
+
77
+    while IFS= read -r entry; do
78
+        [[ -n "${entry}" ]] && TARGETS+=("${entry}")
79
+    done < <(
80
+        jq -r --arg cluster "${CLUSTER_NAME}" '
81
+            .clusters[$cluster].nodes
82
+            | to_entries[]
83
+            | .key + "\t" + ((.value.ip // .value.wan.vmbr443.address // empty) | split("/")[0])
84
+        ' "${CONFIG_PATH}"
85
+    )
86
+
87
+    if [[ ${#TARGETS[@]} -eq 0 ]]; then
88
+        echo "ERROR: no targets found for cluster ${CLUSTER_NAME}" >&2
89
+        exit 1
90
+    fi
91
+}
92
+
93
+match_filter() {
94
+    local filter="$1"
95
+    local node_name="$2"
96
+    local node_ip="$3"
97
+
98
+    [[ "${filter}" == "${node_name}" || "${filter}" == "${node_ip}" ]]
99
+}
100
+
101
+filter_targets() {
102
+    local filtered=()
103
+    local entry=""
104
+    local filter=""
105
+    local node_name=""
106
+    local node_ip=""
107
+
108
+    if [[ ${#NODE_FILTERS[@]} -eq 0 ]]; then
109
+        return
110
+    fi
111
+
112
+    for entry in "${TARGETS[@]}"; do
113
+        node_name="${entry%%$'\t'*}"
114
+        node_ip="${entry#*$'\t'}"
115
+        for filter in "${NODE_FILTERS[@]}"; do
116
+            if match_filter "${filter}" "${node_name}" "${node_ip}"; then
117
+                filtered+=("${entry}")
118
+                break
119
+            fi
120
+        done
121
+    done
122
+
123
+    TARGETS=("${filtered[@]}")
124
+
125
+    if [[ ${#TARGETS[@]} -eq 0 ]]; then
126
+        echo "ERROR: no targets matched the provided --node filters" >&2
127
+        exit 1
128
+    fi
129
+}
130
+
131
+run_setup_project() {
132
+    local node_name="$1"
133
+    local node_ip="$2"
134
+    local cmd=()
135
+
136
+    case "${COMMAND}" in
137
+        install)
138
+            cmd=(bash "${PROJECT_DIR}/setup.sh" --user "${REMOTE_USER}" "${node_ip}")
139
+            ;;
140
+        uninstall)
141
+            cmd=(bash "${PROJECT_DIR}/setup.sh" --user "${REMOTE_USER}" --uninstall "${node_ip}")
142
+            ;;
143
+        *)
144
+            echo "ERROR: command ${COMMAND} is not supported for setup.sh-only projects" >&2
145
+            exit 1
146
+            ;;
147
+    esac
148
+
149
+    echo "==> ${PROJECT_NAME}: ${COMMAND} on ${node_name} (${node_ip})"
150
+    if [[ "${DRY_RUN}" -eq 1 ]]; then
151
+        printf 'DRY-RUN:'
152
+        printf ' %q' "${cmd[@]}"
153
+        echo
154
+        return
155
+    fi
156
+
157
+    (cd "${PROJECT_DIR}" && "${cmd[@]}")
158
+}
159
+
160
+run_deploy_project() {
161
+    local node_name="$1"
162
+    local node_ip="$2"
163
+    local cmd=(bash "${PROJECT_DIR}/deploy.sh" "${COMMAND}" "${node_ip}")
164
+
165
+    echo "==> ${PROJECT_NAME}: ${COMMAND} on ${node_name} (${node_ip})"
166
+    if [[ "${DRY_RUN}" -eq 1 ]]; then
167
+        printf 'DRY-RUN:'
168
+        printf ' %q' "${cmd[@]}"
169
+        echo
170
+        return
171
+    fi
172
+
173
+    (cd "${PROJECT_DIR}" && "${cmd[@]}")
174
+}
175
+
176
+parse_args() {
177
+    if [[ $# -lt 1 ]]; then
178
+        usage
179
+        exit 1
180
+    fi
181
+
182
+    PROJECT_NAME="$1"
183
+    shift
184
+
185
+    if [[ $# -gt 0 ]]; then
186
+        case "$1" in
187
+            install|uninstall|status|start|restart|stop)
188
+                COMMAND="$1"
189
+                shift
190
+                ;;
191
+        esac
192
+    fi
193
+
194
+    while [[ $# -gt 0 ]]; do
195
+        case "$1" in
196
+            --cluster)
197
+                CLUSTER_NAME="$2"
198
+                shift 2
199
+                ;;
200
+            --node)
201
+                NODE_FILTERS+=("$2")
202
+                shift 2
203
+                ;;
204
+            --user)
205
+                REMOTE_USER="$2"
206
+                shift 2
207
+                ;;
208
+            --dry-run)
209
+                DRY_RUN=1
210
+                shift
211
+                ;;
212
+            -h|--help)
213
+                usage
214
+                exit 0
215
+                ;;
216
+            *)
217
+                echo "ERROR: unknown option: $1" >&2
218
+                usage
219
+                exit 1
220
+                ;;
221
+        esac
222
+    done
223
+}
224
+
225
+main() {
226
+    parse_args "$@"
227
+    require_config
228
+    require_project
229
+    load_targets
230
+    filter_targets
231
+
232
+    echo "Project: ${PROJECT_NAME}"
233
+    echo "Mode: ${DEPLOY_MODE}"
234
+    echo "Command: ${COMMAND}"
235
+    echo "Cluster: ${CLUSTER_NAME}"
236
+    echo "Targets:"
237
+    printf '  %s\n' "${TARGETS[@]}"
238
+    echo
239
+
240
+    local entry=""
241
+    local node_name=""
242
+    local node_ip=""
243
+
244
+    for entry in "${TARGETS[@]}"; do
245
+        node_name="${entry%%$'\t'*}"
246
+        node_ip="${entry#*$'\t'}"
247
+        if [[ "${DEPLOY_MODE}" == "setup" ]]; then
248
+            run_setup_project "${node_name}" "${node_ip}"
249
+        else
250
+            run_deploy_project "${node_name}" "${node_ip}"
251
+        fi
252
+        echo
253
+    done
254
+}
255
+
256
+main "$@"