@@ -0,0 +1,49 @@ |
||
| 1 |
+# Copilot Instructions (project stub) |
|
| 2 |
+ |
|
| 3 |
+> **Purpose:** Provide context and guidance for GitHub Copilot or other automated agents working with this repository. |
|
| 4 |
+ |
|
| 5 |
+## Project overview |
|
| 6 |
+ |
|
| 7 |
+- **Name:** _<project name goes here>_ (replace this placeholder). |
|
| 8 |
+- **Goal:** Brief description of what the codebase / project is intended to accomplish. |
|
| 9 |
+- **Deployment model:** Outline where production code lives, what is deployed to target systems, and any separation of developer docs versus runtime artifacts. |
|
| 10 |
+ |
|
| 11 |
+## Key components |
|
| 12 |
+ |
|
| 13 |
+- `bin/` – executable scripts used by the project. |
|
| 14 |
+- `docs/` – developer documentation, design notes, and user guides. |
|
| 15 |
+- `projects/` – subprojects, each of which may have its own `deployment/`, `scripts/`, and `.github` configuration. |
|
| 16 |
+- `scripts/` – helper utilities for deployment, testing, or maintenance. |
|
| 17 |
+- `issues/` – markdown issue tracker, one file per issue. |
|
| 18 |
+ |
|
| 19 |
+*(Adjust these bullets to the particular structure of your repository.)* |
|
| 20 |
+ |
|
| 21 |
+## Typical workflows |
|
| 22 |
+ |
|
| 23 |
+1. **Development:** edit source under `deployment/` (if present), update tests, run `./scripts/run_tests.sh` (or similar). |
|
| 24 |
+2. **Deployment:** use `./scripts/deploy_to_nodes.sh` or similar to push changes to cluster nodes; enable any systemd units as needed. |
|
| 25 |
+3. **Debugging:** check logs with `journalctl -u <service>`; examine `dmesg`, `ip link`, etc. (customize to project). |
|
| 26 |
+4. **Configuration:** configuration files live under `/etc/<project>` or are defined in `madagascar.json` and should be treated as source-of-truth. |
|
| 27 |
+ |
|
| 28 |
+## Guidance for Copilot |
|
| 29 |
+ |
|
| 30 |
+- When creating or modifying files, follow existing conventions for naming, documentation, and changelog entries. |
|
| 31 |
+- Read `madagascar.json` (and any other top‑level JSON manifests) to understand cluster configuration and avoid hard‑coding. |
|
| 32 |
+- Append changes to `madagascar-changelog.json` rather than rewriting it. |
|
| 33 |
+- Use POSIX-compliant shell in `bin/` scripts, prefer Python for more complex logic. |
|
| 34 |
+ |
|
| 35 |
+## Issue tracking |
|
| 36 |
+ |
|
| 37 |
+- New issues should be added as markdown files in `issues/` named `YYYY_MM_DD-NN-description.md`. |
|
| 38 |
+- Each issue must include description, steps to reproduce, logs, investigation notes, and resolution. |
|
| 39 |
+- Update `CHANGELOG.md` with a brief entry when an issue is closed or a change is merged. |
|
| 40 |
+ |
|
| 41 |
+## Example starter tasks for Copilot |
|
| 42 |
+ |
|
| 43 |
+- Add a new utility script with proper shebang and logging function. |
|
| 44 |
+- Implement a discovery script that reads `madagascar.json` and enumerates nodes or resources. |
|
| 45 |
+- Scaffold a systemd unit file and accompanying installation script. |
|
| 46 |
+ |
|
| 47 |
+--- |
|
| 48 |
+ |
|
| 49 |
+*This stub is intended as a starting point; customize it for the specific project or subproject.* |
|
@@ -0,0 +1,105 @@ |
||
| 1 |
+# Madagascar Cluster Changelog |
|
| 2 |
+ |
|
| 3 |
+All notable changes to the Madagascar cluster configuration and infrastructure are documented in this file. |
|
| 4 |
+ |
|
| 5 |
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). |
|
| 6 |
+ |
|
| 7 |
+Each entry should reference related issues using the format `[ISSUE-YYYY-NNN]`. |
|
| 8 |
+ |
|
| 9 |
+--- |
|
| 10 |
+ |
|
| 11 |
+## [Unreleased] |
|
| 12 |
+ |
|
| 13 |
+### Known Issues |
|
| 14 |
+- [ISSUE-2025-001] Thunderbolt interfaces MTU resets to 1500 after networking restart (open) |
|
| 15 |
+ |
|
| 16 |
+### Added |
|
| 17 |
+- Added a central `cluster/projects/README.md` policy for current and future cluster-level projects |
|
| 18 |
+ |
|
| 19 |
+### Changed |
|
| 20 |
+- Consolidated `pve-net-hang-watchdog` into its own project folder under `cluster/projects/pve-net-hang-watchdog` |
|
| 21 |
+- Standardized project rules around well-known install paths, mandatory uninstall scripts, and uninstall-before-reinstall workflow |
|
| 22 |
+- Anchored the central project policy in the existing `autoNAS` install/uninstall workflow and documented its known lessons and current path exception |
|
| 23 |
+- Established `/usr/local/lib/xdev/<project-name>/uninstall.sh` as the canonical uninstall script location, with optional `/usr/local/sbin/xdev-<project-name>-uninstall` wrapper |
|
| 24 |
+- Added standard namespaced locations for installed documentation, configuration, operational data, cache, and optional file-based logs |
|
| 25 |
+- Removed the accidental empty `autoNAS/autoSMART` nested drop and kept `cluster/projects/autoSMART` as the canonical project location |
|
| 26 |
+- Standardized `cluster/projects/pve-guests-state` with dedicated install/uninstall scripts, namespaced host paths, migrated state location, and cleaned legacy project artifacts |
|
| 27 |
+- Standardized `cluster/projects/pve-net-hang-watchdog` with namespaced install paths, dedicated lifecycle scripts, and a defaults file under `/etc/default/xdev-pve-net-hang-watchdog` |
|
| 28 |
+- Updated `pve-net-hang-watchdog` install behavior so deployment also starts the service immediately, not just enables it for boot |
|
| 29 |
+- Added a standardized shared-runtime lifecycle for `cluster/projects/thunderbolts` that leaves network interface files untouched during reinstall/uninstall |
|
| 30 |
+- Documented the cluster-wide deployment rule that required services/timers must be activated with `systemctl enable --now` during install, not left merely enabled |
|
| 31 |
+- Standardized `cluster/projects/pve-backup-scheduler` around `/usr/local/lib/xdev/pve-backup-scheduler`, added canonical lifecycle scripts and `setup.sh`, and kept `/etc/pve/autobackup` as an explicit preserved config exception |
|
| 32 |
+- Standardized `cluster/projects/autoNAS` around `/usr/local/lib/xdev/autonas` and `/usr/local/sbin/autonas`, while keeping `/etc/pve/autonas` and `/mnt/autonas` as explicit shared-state exceptions |
|
| 33 |
+- Grouped cluster metadata and historical cache files under `cluster-context/` and moved legacy snapshots under `cluster-context/history/` |
|
| 34 |
+- Added cluster-wide deployment orchestration in `scripts/deploy-project.sh`, driven by `cluster-context/madagascar.json`, while preserving one-node deploy paths for development and testing |
|
| 35 |
+- Tightened lifecycle cleanup for `pve-guests-state` legacy systemd units and suppressed `thunderbolts` recovery noise on hosts without `bolt.service` |
|
| 36 |
+ |
|
| 37 |
+--- |
|
| 38 |
+ |
|
| 39 |
+## [2025-10-30] |
|
| 40 |
+ |
|
| 41 |
+### Fixed |
|
| 42 |
+- [ISSUE-2025-001] Thunderbolt interfaces MTU persistence issue resolved |
|
| 43 |
+ - **Root cause**: `systemctl restart networking` resets MTU because systemd services don't re-trigger |
|
| 44 |
+ - **Solution**: Hybrid approach with udev rule enhancement + post-up hooks |
|
| 45 |
+ - **Changes**: Updated udev rules and interfaces.d configs on all nodes (baobab, ebony, tapia) |
|
| 46 |
+ - **Testing**: Verified MTU 65520 persists after networking restart on all nodes |
|
| 47 |
+ |
|
| 48 |
+### Added |
|
| 49 |
+- Issue tracking system in `cluster/issues/` directory |
|
| 50 |
+- CHANGELOG.md for documenting all cluster changes with issue references |
|
| 51 |
+- Template for issue documentation (`issues/TEMPLATE.md`) |
|
| 52 |
+- First documented issue: ISSUE-2025-001 regarding thunderbolt MTU reset problem |
|
| 53 |
+- Added `scripts/check_mcluster_network.sh` for cluster thunderbridge and network health checks (table output, ping tests from localhost and baobab). |
|
| 54 |
+ |
|
| 55 |
+### Changed |
|
| 56 |
+- Removed codebase-specific references from `madagascar.json` to keep it cluster-focused |
|
| 57 |
+ |
|
| 58 |
+--- |
|
| 59 |
+ |
|
| 60 |
+## [2025-10-19] |
|
| 61 |
+ |
|
| 62 |
+### Added |
|
| 63 |
+- PBS (Proxmox Backup Server) configuration to `madagascar.json` |
|
| 64 |
+ - andrafiabe-AutoNAS (192.168.2.96) |
|
| 65 |
+ - anjothibe-AutoNAS (192.168.2.95) |
|
| 66 |
+- Node roles (primary/secondary) to cluster configuration |
|
| 67 |
+ |
|
| 68 |
+--- |
|
| 69 |
+ |
|
| 70 |
+## [2025-10-18] |
|
| 71 |
+ |
|
| 72 |
+### Added |
|
| 73 |
+- Initial `madagascar.json` cluster cache file |
|
| 74 |
+- Cluster network documentation (thunderbolt bridge configuration) |
|
| 75 |
+- WAN configuration for all nodes (vmbr443, vmbr444) |
|
| 76 |
+- Node-specific network information (baobab, ebony, tapia) |
|
| 77 |
+- `madagascar-changelog.json` for automation-triggered changes |
|
| 78 |
+- `README_madagascar_cache.md` with file contract documentation |
|
| 79 |
+ |
|
| 80 |
+### Infrastructure |
|
| 81 |
+- Thunderbolt bridge (thunderbridge) on 192.168.10.0/24 with MTU 65520 |
|
| 82 |
+- WAN bridges on 192.168.2.0/24 (vmbr443) and 192.168.4.0/24 (vmbr444) |
|
| 83 |
+ |
|
| 84 |
+--- |
|
| 85 |
+ |
|
| 86 |
+## Format Guidelines |
|
| 87 |
+ |
|
| 88 |
+### Categories |
|
| 89 |
+- **Added** - new features, files, or configurations |
|
| 90 |
+- **Changed** - changes to existing functionality or configuration |
|
| 91 |
+- **Deprecated** - features or configurations that will be removed |
|
| 92 |
+- **Removed** - removed features or configurations |
|
| 93 |
+- **Fixed** - bug fixes (always reference issue number) |
|
| 94 |
+- **Security** - security-related changes |
|
| 95 |
+ |
|
| 96 |
+### Entry Format |
|
| 97 |
+``` |
|
| 98 |
+- Brief description [ISSUE-YYYY-NNN] (optional details) |
|
| 99 |
+``` |
|
| 100 |
+ |
|
| 101 |
+### Issue References |
|
| 102 |
+Always link changes to issues when applicable: |
|
| 103 |
+- Bug fixes must reference the issue |
|
| 104 |
+- New features should reference planning/feature issues |
|
| 105 |
+- Configuration changes should reference related issues or RFCs |
|
@@ -0,0 +1,64 @@ |
||
| 1 |
+Madagascar cluster context files |
|
| 2 |
+ |
|
| 3 |
+Purpose |
|
| 4 |
+ |
|
| 5 |
+These files provide a shared cluster-context cache and changelog for Madagascar. Other projects can read or append to these files to share knowledge about cluster layout, network configuration, and changes that may affect deployments. |
|
| 6 |
+ |
|
| 7 |
+Files |
|
| 8 |
+ |
|
| 9 |
+- `madagascar.json` - primary cache. Contains a schemaVersion, lastUpdated, source, and a `clusters` map keyed by cluster name. Each cluster can include hosts, network file paths, services and notes. |
|
| 10 |
+ |
|
| 11 |
+- `madagascar-changelog.json` - append-only changelog. Contains an `entries` array. Each entry should include: `id`, `timestamp` (ISO 8601 UTC), `project`, `author`, `summary`, `details`, `affectedResources` (array), and `type` (info|change|breaking|deprecated). |
|
| 12 |
+- `history/` - historical snapshots that are useful for reference but are not the current source of truth. |
|
| 13 |
+ |
|
| 14 |
+Contract (madagascar.json) |
|
| 15 |
+ |
|
| 16 |
+- schemaVersion: string |
|
| 17 |
+- lastUpdated: ISO 8601 timestamp in UTC |
|
| 18 |
+- source: project name that last updated the file |
|
| 19 |
+- clusters: map of cluster objects. Cluster object fields: |
|
| 20 |
+ - name: cluster name |
|
| 21 |
+ - hosts: map of role->hostname or role->fqdn |
|
| 22 |
+ - network: optional map with keys `interfacesFile` and `interfacesD` (relative paths) |
|
| 23 |
+ - services: optional map of service name -> { enabled: bool, systemdUnit: path }
|
|
| 24 |
+ - notes: optional string |
|
| 25 |
+ |
|
| 26 |
+Changelog entry contract (madagascar-changelog.json) |
|
| 27 |
+ |
|
| 28 |
+- entries: array of objects, each with: |
|
| 29 |
+ - id: unique id string (recommend prefix: project-YYYYMMDD-HHMM) |
|
| 30 |
+ - timestamp: ISO 8601 UTC |
|
| 31 |
+ - project: project name making the change |
|
| 32 |
+ - author: author name or automation id |
|
| 33 |
+ - summary: short summary |
|
| 34 |
+ - details: longer description |
|
| 35 |
+ - affectedResources: array of strings (paths or logical names) |
|
| 36 |
+ - type: one of info|change|breaking|deprecated |
|
| 37 |
+ |
|
| 38 |
+How to update |
|
| 39 |
+ |
|
| 40 |
+Manual append example (bash + jq): |
|
| 41 |
+ |
|
| 42 |
+```bash |
|
| 43 |
+# create new entry JSON |
|
| 44 |
+entry=$(jq -n --arg id "entry-$(date -u +%Y%m%d%H%M%S)" \ |
|
| 45 |
+ --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ |
|
| 46 |
+ --arg project "mysvc" \ |
|
| 47 |
+ --arg author "$USER" \ |
|
| 48 |
+ --arg summary "Updated network config" \ |
|
| 49 |
+ --arg details "Added new interface route needed by Madagascar" \ |
|
| 50 |
+ '{id: $id, timestamp: $ts, project: $project, author: $author, summary: $summary, details: $details, affectedResources:["network/interfaces"], type: "change"}')
|
|
| 51 |
+ |
|
| 52 |
+# append atomically |
|
| 53 |
+jq --argjson e "$entry" '.entries += [$e]' cluster-context/madagascar-changelog.json > cluster-context/madagascar-changelog.json.tmp && mv cluster-context/madagascar-changelog.json.tmp cluster-context/madagascar-changelog.json |
|
| 54 |
+``` |
|
| 55 |
+ |
|
| 56 |
+Automation guidance |
|
| 57 |
+ |
|
| 58 |
+- Prefer creating unique `id` values (project prefix + timestamp + random suffix). |
|
| 59 |
+- When automation updates `cluster-context/madagascar.json`, also add a changelog entry. |
|
| 60 |
+- Keep `cluster-context/madagascar.json` small — only cache what's necessary. |
|
| 61 |
+ |
|
| 62 |
+Notes |
|
| 63 |
+ |
|
| 64 |
+- These files are meant to be shared between related projects. Treat `cluster-context/madagascar-changelog.json` as append-only; prefer appending rather than rewriting history. |
|
@@ -0,0 +1,234 @@ |
||
| 1 |
+# 2026-03-07 Trixie / Proxmox VE 9 Upgrade Journal |
|
| 2 |
+ |
|
| 3 |
+## Scope |
|
| 4 |
+ |
|
| 5 |
+Upgrade and recovery journal for the Madagascar cluster nodes: |
|
| 6 |
+ |
|
| 7 |
+- `tapia` |
|
| 8 |
+- `ebony` |
|
| 9 |
+- `baobab` |
|
| 10 |
+ |
|
| 11 |
+All three nodes were upgraded from Debian 12 / Proxmox VE 8 to Debian 13 (`trixie`) / Proxmox VE 9.1. |
|
| 12 |
+ |
|
| 13 |
+## Common Pattern Observed |
|
| 14 |
+ |
|
| 15 |
+The package upgrade itself completed cleanly on all nodes. The disruptive failures were in the boot path after the upgrade, not in `apt` or `dpkg`. |
|
| 16 |
+ |
|
| 17 |
+Recurring issues: |
|
| 18 |
+ |
|
| 19 |
+- EFI fallback binaries under `EFI/BOOT` were inconsistent across nodes. |
|
| 20 |
+- Boot order could still point to a non-Proxmox path even when the `proxmox` entry existed. |
|
| 21 |
+- Some systems still had `systemd-boot` style fallback artifacts while the host had moved to GRUB + `proxmox-boot-tool`. |
|
| 22 |
+- Testing was complicated by slow shutdowns and, in one case, missing hardware during boot. |
|
| 23 |
+ |
|
| 24 |
+## Node Journal |
|
| 25 |
+ |
|
| 26 |
+### tapia |
|
| 27 |
+ |
|
| 28 |
+Initial symptoms: |
|
| 29 |
+ |
|
| 30 |
+- Upgrade to `trixie` completed, but the node no longer booted normally. |
|
| 31 |
+- UEFI shell could see the ESP and Proxmox EFI payloads. |
|
| 32 |
+- Launching `\EFI\proxmox\grubx64.efi` initially dropped back into BIOS settings. |
|
| 33 |
+- Later, after loader repair, boot worked on the old kernel first. |
|
| 34 |
+ |
|
| 35 |
+Findings: |
|
| 36 |
+ |
|
| 37 |
+- The system had moved to Debian 13 and Proxmox VE 9 packages correctly. |
|
| 38 |
+- GRUB and EFI files existed, but the boot path was inconsistent after the upgrade. |
|
| 39 |
+- `EFI/proxmox/grub.cfg` on `tapia` had drifted from the standard Proxmox ESP stub and referenced Btrfs directly. |
|
| 40 |
+- `AutoNAS` also produced noisy failed units for unmanaged boot disk UUIDs. |
|
| 41 |
+ |
|
| 42 |
+Fixes applied: |
|
| 43 |
+ |
|
| 44 |
+- Offline disk repair on another node. |
|
| 45 |
+- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery. |
|
| 46 |
+- Ran: |
|
| 47 |
+ - `update-initramfs -u -k 6.8.12-19-pve` |
|
| 48 |
+ - `update-grub` |
|
| 49 |
+ - `proxmox-boot-tool refresh` |
|
| 50 |
+ - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck` |
|
| 51 |
+- Restored `EFI/proxmox/grub.cfg` to the standard Proxmox ESP stub: |
|
| 52 |
+ - `search.fs_uuid <ESP>` |
|
| 53 |
+ - `set prefix=($root)/grub` |
|
| 54 |
+ - `configfile $prefix/grub.cfg` |
|
| 55 |
+- Later confirmed `6.17.13-1-pve` boots correctly and made it the default again. |
|
| 56 |
+- Deployed an `AutoNAS` fix so unmanaged UUIDs are ignored instead of failing `autonas-attach@...` units. |
|
| 57 |
+ |
|
| 58 |
+Final state: |
|
| 59 |
+ |
|
| 60 |
+- Running `6.17.13-1-pve` |
|
| 61 |
+- `systemctl --failed` empty |
|
| 62 |
+- Boot default set to `6.17.13-1-pve` |
|
| 63 |
+ |
|
| 64 |
+### ebony |
|
| 65 |
+ |
|
| 66 |
+Initial symptoms: |
|
| 67 |
+ |
|
| 68 |
+- Upgrade completed cleanly, but the node did not return after reboot. |
|
| 69 |
+- UEFI fallback could boot `memtest`, but Proxmox GRUB payloads returned to BIOS settings. |
|
| 70 |
+- After EFI repair, boot progressed to: |
|
| 71 |
+ - `Loading Linux...` |
|
| 72 |
+ - `Loading initial ramdisk...` |
|
| 73 |
+ and then stopped. |
|
| 74 |
+ |
|
| 75 |
+Findings: |
|
| 76 |
+ |
|
| 77 |
+- The fallback `EFI/BOOT/BOOTX64.EFI` was not aligned with the Proxmox boot chain and could route to memtest. |
|
| 78 |
+- GRUB loader repair was required. |
|
| 79 |
+- During one boot attempt, the NVMe device was physically absent; this caused the post-kernel boot stall and initially looked like a kernel/initramfs failure. |
|
| 80 |
+- Once hardware was restored, the newer kernel booted successfully. |
|
| 81 |
+ |
|
| 82 |
+Fixes applied: |
|
| 83 |
+ |
|
| 84 |
+- Offline disk repair on another node. |
|
| 85 |
+- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` during recovery. |
|
| 86 |
+- Ran: |
|
| 87 |
+ - `update-initramfs -u -k 6.8.12-19-pve` |
|
| 88 |
+ - `update-grub` |
|
| 89 |
+ - `proxmox-boot-tool refresh` |
|
| 90 |
+ - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck` |
|
| 91 |
+- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi` payload and synchronized the other fallback EFI files. |
|
| 92 |
+- After the node booted with full hardware present, validated `6.17.13-1-pve` and set it as the default. |
|
| 93 |
+- Fixed stale `AutoNAS` export behavior by cleaning marked exports whose paths do not exist yet at boot. |
|
| 94 |
+ |
|
| 95 |
+Final state: |
|
| 96 |
+ |
|
| 97 |
+- Running `6.17.13-1-pve` |
|
| 98 |
+- `systemctl --failed` empty |
|
| 99 |
+- `AutoNAS-1` and `AutoNAS-2` active |
|
| 100 |
+- Boot default set to `6.17.13-1-pve` |
|
| 101 |
+ |
|
| 102 |
+### baobab |
|
| 103 |
+ |
|
| 104 |
+Initial symptoms: |
|
| 105 |
+ |
|
| 106 |
+- Upgrade completed cleanly, but the node failed to return after reboot. |
|
| 107 |
+- Before recovery, fallback `BOOTX64.EFI` was still a small `systemd-boot` style binary instead of the Proxmox shim. |
|
| 108 |
+- The node eventually required offline repair from another machine. |
|
| 109 |
+ |
|
| 110 |
+Findings: |
|
| 111 |
+ |
|
| 112 |
+- Package state was healthy; the failure was again in the EFI/boot path. |
|
| 113 |
+- `BootOrder` needed to prioritize the `proxmox` entry. |
|
| 114 |
+- `EFI/BOOT/BOOTX64.EFI` needed to point into the Proxmox chain, not the old fallback path. |
|
| 115 |
+ |
|
| 116 |
+Fixes applied: |
|
| 117 |
+ |
|
| 118 |
+- Forced `GRUB_DEFAULT` to `6.8.12-19-pve` for the first stable boot after upgrade. |
|
| 119 |
+- Corrected `BootOrder` so `proxmox` is first. |
|
| 120 |
+- Replaced fallback `EFI/BOOT/BOOTX64.EFI` with the Proxmox `shimx64.efi`. |
|
| 121 |
+- Offline repair after the failed reboot: |
|
| 122 |
+ - `fsck.vfat -a` on the ESP |
|
| 123 |
+ - `grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck` |
|
| 124 |
+ - `update-grub` |
|
| 125 |
+ - `proxmox-boot-tool refresh` |
|
| 126 |
+- Fixed remaining failed units unrelated to the OS upgrade: |
|
| 127 |
+ - `rc-local.service` now ignores missing optional disks instead of failing |
|
| 128 |
+ - removed orphan `discover_vms.service` and `discover_vms.timer` |
|
| 129 |
+ |
|
| 130 |
+Final state: |
|
| 131 |
+ |
|
| 132 |
+- Running `6.8.12-19-pve` |
|
| 133 |
+- `systemctl --failed` empty |
|
| 134 |
+- Boot default left on `6.8.12-19-pve` as the conservative stable choice |
|
| 135 |
+ |
|
| 136 |
+## AutoNAS Follow-up |
|
| 137 |
+ |
|
| 138 |
+Two AutoNAS issues were identified and fixed during the upgrade recovery: |
|
| 139 |
+ |
|
| 140 |
+1. `attach-deferred` could fail for disks with UUIDs that are not managed by AutoNAS. |
|
| 141 |
+ - Fix: return success for unmanaged UUIDs so `systemd` does not mark the unit failed. |
|
| 142 |
+ |
|
| 143 |
+2. Boot-time cleanup preserved stale AutoNAS exports even when the export path did not exist yet. |
|
| 144 |
+ - Fix: remove AutoNAS-marked exports with missing paths during boot cleanup, then let normal mount/export flow recreate them when the disk is available. |
|
| 145 |
+ |
|
| 146 |
+Both fixes were deployed to: |
|
| 147 |
+ |
|
| 148 |
+- `baobab` |
|
| 149 |
+- `ebony` |
|
| 150 |
+- `tapia` |
|
| 151 |
+ |
|
| 152 |
+## Recovery Commands That Proved Useful |
|
| 153 |
+ |
|
| 154 |
+Most effective recovery sequence when a node no longer boots after the upgrade: |
|
| 155 |
+ |
|
| 156 |
+1. Move the system disk to another node. |
|
| 157 |
+2. Mount root and ESP. |
|
| 158 |
+3. Force a known-good kernel in `/etc/default/grub`. |
|
| 159 |
+4. Run: |
|
| 160 |
+ |
|
| 161 |
+```bash |
|
| 162 |
+update-initramfs -u -k <known-good-kernel> |
|
| 163 |
+update-grub |
|
| 164 |
+proxmox-boot-tool refresh |
|
| 165 |
+grub-install.real --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=proxmox --recheck |
|
| 166 |
+``` |
|
| 167 |
+ |
|
| 168 |
+5. Verify: |
|
| 169 |
+ |
|
| 170 |
+```bash |
|
| 171 |
+efibootmgr -v |
|
| 172 |
+proxmox-boot-tool status |
|
| 173 |
+ls -l /boot/efi/EFI/proxmox |
|
| 174 |
+ls -l /boot/efi/EFI/BOOT |
|
| 175 |
+``` |
|
| 176 |
+ |
|
| 177 |
+6. If needed, replace `EFI/BOOT/BOOTX64.EFI` with Proxmox `shimx64.efi`. |
|
| 178 |
+ |
|
| 179 |
+## Recommended Post-Upgrade Checklist |
|
| 180 |
+ |
|
| 181 |
+Before rebooting a node after the Debian 13 / PVE 9 upgrade: |
|
| 182 |
+ |
|
| 183 |
+1. Confirm package state is clean: |
|
| 184 |
+ - `dpkg --audit` |
|
| 185 |
+ - `apt-get -s full-upgrade` |
|
| 186 |
+2. Refresh boot assets: |
|
| 187 |
+ - `update-grub` |
|
| 188 |
+ - `proxmox-boot-tool refresh` |
|
| 189 |
+3. Verify EFI layout: |
|
| 190 |
+ - `efibootmgr -v` |
|
| 191 |
+ - `proxmox-boot-tool status` |
|
| 192 |
+ - `EFI/proxmox/grub.cfg` should be the standard ESP stub |
|
| 193 |
+ - `EFI/BOOT/BOOTX64.EFI` should route into the Proxmox chain, not an old `systemd-boot` or memtest fallback |
|
| 194 |
+4. Suspend guests manually before reboot: |
|
| 195 |
+ - run `/usr/local/sbin/pgs suspend -v` |
|
| 196 |
+ - do not rely on legacy `systemd` automation for guest suspend/resume |
|
| 197 |
+ - otherwise `pve-guests.service` can stall shutdown while waiting for VMs/CTs to stop |
|
| 198 |
+5. Verify all expected storage hardware is physically present before reboot. |
|
| 199 |
+6. Keep one older known-good kernel available in GRUB until the new kernel is validated on that node. |
|
| 200 |
+ |
|
| 201 |
+## Operational Note: Reboot Discipline |
|
| 202 |
+ |
|
| 203 |
+During this upgrade, one avoidable failure mode was a reboot started without first suspending or stopping guests through `pgs`. |
|
| 204 |
+ |
|
| 205 |
+Observed effect: |
|
| 206 |
+ |
|
| 207 |
+- `pve-guests.service` remained in `deactivating` |
|
| 208 |
+- shutdown took a very long time |
|
| 209 |
+- guest stop operations had to be forced manually |
|
| 210 |
+- this obscured boot diagnostics and made the recovery look worse than the underlying boot issue |
|
| 211 |
+ |
|
| 212 |
+Operational rule going forward: |
|
| 213 |
+ |
|
| 214 |
+1. Before any planned node reboot for maintenance, run: |
|
| 215 |
+ |
|
| 216 |
+```bash |
|
| 217 |
+/usr/local/sbin/pgs suspend -v |
|
| 218 |
+``` |
|
| 219 |
+ |
|
| 220 |
+2. Reboot only after guest suspend/shutdown has completed. |
|
| 221 |
+3. After the node or cluster is back in a stable state, run: |
|
| 222 |
+ |
|
| 223 |
+```bash |
|
| 224 |
+/usr/local/sbin/pgs resume -v |
|
| 225 |
+``` |
|
| 226 |
+ |
|
| 227 |
+## Outcome |
|
| 228 |
+ |
|
| 229 |
+The cluster upgrade completed successfully, but only after boot-path recovery on all three nodes. |
|
| 230 |
+ |
|
| 231 |
+Main lesson: |
|
| 232 |
+ |
|
| 233 |
+- the risky part of this upgrade was not package dependency resolution |
|
| 234 |
+- it was EFI and boot chain consistency after the transition to Debian 13 / Proxmox VE 9 |
|
@@ -0,0 +1,86 @@ |
||
| 1 |
+{
|
|
| 2 |
+ "cluster": {
|
|
| 3 |
+ "name": "Madagascar", |
|
| 4 |
+ "topology": {
|
|
| 5 |
+ "thunderbolt_chain": [ |
|
| 6 |
+ "ebony", |
|
| 7 |
+ "baobab", |
|
| 8 |
+ "tapia" |
|
| 9 |
+ ] |
|
| 10 |
+ }, |
|
| 11 |
+ "networks": {
|
|
| 12 |
+ "cluster_network": {
|
|
| 13 |
+ "name": "thunderbridge", |
|
| 14 |
+ "cidr": "192.168.10.0/24", |
|
| 15 |
+ "bridges": [ |
|
| 16 |
+ { "host": "baobab", "bridge_id": "8000.02ff1918f13a" },
|
|
| 17 |
+ { "host": "ebony", "bridge_id": "8000.02518abab22f" },
|
|
| 18 |
+ { "host": "tapia", "bridge_id": "8000.0231336db4df" }
|
|
| 19 |
+ ], |
|
| 20 |
+ "hosts": [ |
|
| 21 |
+ {
|
|
| 22 |
+ "hostname": "ebony", |
|
| 23 |
+ "ip": "192.168.10.92", |
|
| 24 |
+ "interfaces": { "thunderbolt0_mac": "02:51:8a:ba:b2:2f" }
|
|
| 25 |
+ }, |
|
| 26 |
+ {
|
|
| 27 |
+ "hostname": "baobab", |
|
| 28 |
+ "ip": "192.168.10.91", |
|
| 29 |
+ "interfaces": {
|
|
| 30 |
+ "thunderbolt0_mac": "02:ff:19:18:f1:3a", |
|
| 31 |
+ "thunderbolt1_mac": "02:ee:dc:db:e6:0b" |
|
| 32 |
+ } |
|
| 33 |
+ }, |
|
| 34 |
+ {
|
|
| 35 |
+ "hostname": "tapia", |
|
| 36 |
+ "ip": "192.168.10.93", |
|
| 37 |
+ "interfaces": { "thunderbolt0_mac": "02:31:33:6d:b4:df" }
|
|
| 38 |
+ } |
|
| 39 |
+ ] |
|
| 40 |
+ }, |
|
| 41 |
+ "internet_network": {
|
|
| 42 |
+ "name": "vmbr443", |
|
| 43 |
+ "cidr": "192.168.2.0/24", |
|
| 44 |
+ "notes": "VM-urile ies în internet prin aceste bridge-uri", |
|
| 45 |
+ "hosts": [ |
|
| 46 |
+ {
|
|
| 47 |
+ "hostname": "ebony", |
|
| 48 |
+ "ip": "192.168.2.92", |
|
| 49 |
+ "underlay": "eno1.443", |
|
| 50 |
+ "bridge_mac": "1c:69:7a:ab:26:2f" |
|
| 51 |
+ }, |
|
| 52 |
+ {
|
|
| 53 |
+ "hostname": "baobab", |
|
| 54 |
+ "ip": "192.168.2.91", |
|
| 55 |
+ "underlay": "enp86s0.443", |
|
| 56 |
+ "bridge_mac": "48:21:0b:60:9f:ab" |
|
| 57 |
+ }, |
|
| 58 |
+ {
|
|
| 59 |
+ "hostname": "tapia", |
|
| 60 |
+ "ip": "192.168.2.93", |
|
| 61 |
+ "underlay": "eno1.443", |
|
| 62 |
+ "bridge_mac": "1c:69:7a:aa:e3:5d" |
|
| 63 |
+ } |
|
| 64 |
+ ] |
|
| 65 |
+ } |
|
| 66 |
+ }, |
|
| 67 |
+ "services": {
|
|
| 68 |
+ "pbs": [ |
|
| 69 |
+ {
|
|
| 70 |
+ "name": "anjohibe", |
|
| 71 |
+ "role": "proxmox-backup-server", |
|
| 72 |
+ "hypervisor_host": "ebony", |
|
| 73 |
+ "ips": { "internet": "192.168.2.95", "cluster": "192.168.10.95" },
|
|
| 74 |
+ "nas_virtual_ip": "192.168.10.21" |
|
| 75 |
+ }, |
|
| 76 |
+ {
|
|
| 77 |
+ "name": "andrafiabe", |
|
| 78 |
+ "role": "proxmox-backup-server", |
|
| 79 |
+ "hypervisor_host": "tapia", |
|
| 80 |
+ "ips": { "internet": "192.168.2.96", "cluster": "192.168.10.96" },
|
|
| 81 |
+ "nas_virtual_ip": "192.168.10.22" |
|
| 82 |
+ } |
|
| 83 |
+ ] |
|
| 84 |
+ } |
|
| 85 |
+ } |
|
| 86 |
+} |
|
@@ -0,0 +1,241 @@ |
||
| 1 |
+{
|
|
| 2 |
+ "hardware": {
|
|
| 3 |
+ "baobab": {
|
|
| 4 |
+ "system": {
|
|
| 5 |
+ "manufacturer": "Intel(R) Client Systems", |
|
| 6 |
+ "product_name": "NUC13ANHi7", |
|
| 7 |
+ "version": "M89903-208", |
|
| 8 |
+ "serial_number": "BTAN344005DW", |
|
| 9 |
+ "family": "AN", |
|
| 10 |
+ "sku": "NUC13ANHi7000", |
|
| 11 |
+ "smbios_version": "3.5.0" |
|
| 12 |
+ }, |
|
| 13 |
+ "bios": {
|
|
| 14 |
+ "vendor": "Intel Corp.", |
|
| 15 |
+ "version": "ANRPL357.0038.2025.0416.1002", |
|
| 16 |
+ "release_date": "2025-04-16", |
|
| 17 |
+ "bios_revision": "5.27", |
|
| 18 |
+ "firmware_revision": "10.23" |
|
| 19 |
+ }, |
|
| 20 |
+ "cpu": {
|
|
| 21 |
+ "model": "13th Gen Intel(R) Core(TM) i7-1360P", |
|
| 22 |
+ "family": 6, |
|
| 23 |
+ "model_id": 186, |
|
| 24 |
+ "stepping": 2, |
|
| 25 |
+ "cores": 12, |
|
| 26 |
+ "threads": 16, |
|
| 27 |
+ "max_speed_mhz": 5000, |
|
| 28 |
+ "l3_cache_mb": 18 |
|
| 29 |
+ }, |
|
| 30 |
+ "memory": {
|
|
| 31 |
+ "installed_total_gb": 64, |
|
| 32 |
+ "max_capacity_reported_gb": 64, |
|
| 33 |
+ "modules": [ |
|
| 34 |
+ {
|
|
| 35 |
+ "locator": "Controller0-ChannelA-DIMM0", |
|
| 36 |
+ "size_gb": 32, |
|
| 37 |
+ "type": "DDR4", |
|
| 38 |
+ "speed_mtps": 2667, |
|
| 39 |
+ "manufacturer": "Corsair", |
|
| 40 |
+ "part_number": "CMSX64GX4M2A2666C18", |
|
| 41 |
+ "rank": 2, |
|
| 42 |
+ "voltage_v": 1.2 |
|
| 43 |
+ }, |
|
| 44 |
+ {
|
|
| 45 |
+ "locator": "Controller1-ChannelA-DIMM0", |
|
| 46 |
+ "size_gb": 32, |
|
| 47 |
+ "type": "DDR4", |
|
| 48 |
+ "speed_mtps": 2667, |
|
| 49 |
+ "manufacturer": "Corsair", |
|
| 50 |
+ "part_number": "CMSX64GX4M2A2666C18", |
|
| 51 |
+ "rank": 2, |
|
| 52 |
+ "voltage_v": 1.2 |
|
| 53 |
+ } |
|
| 54 |
+ ] |
|
| 55 |
+ }, |
|
| 56 |
+ "storage": {
|
|
| 57 |
+ "nvme_controllers": [ |
|
| 58 |
+ "Samsung SM981/PM981/PM983 (01:00.0)" |
|
| 59 |
+ ], |
|
| 60 |
+ "m2_slots": [ |
|
| 61 |
+ { "designation": "M2_A", "pcie": "x4 Gen3", "status": "in_use" },
|
|
| 62 |
+ { "designation": "M2_B", "pcie": "x4 Gen3", "status": "in_use" }
|
|
| 63 |
+ ] |
|
| 64 |
+ }, |
|
| 65 |
+ "gpu": "Intel Raptor Lake-P [Iris Xe Graphics] (rev 04)", |
|
| 66 |
+ "network_controllers": [ |
|
| 67 |
+ "Intel Ethernet Controller I226-V (rev 04)", |
|
| 68 |
+ "Intel Raptor Lake PCH CNVi WiFi (rev 01)" |
|
| 69 |
+ ], |
|
| 70 |
+ "thunderbolt": {
|
|
| 71 |
+ "generation": "Thunderbolt 4", |
|
| 72 |
+ "controllers": [ |
|
| 73 |
+ "Raptor Lake-P Thunderbolt 4 USB Controller", |
|
| 74 |
+ "Raptor Lake-P Thunderbolt 4 NHI (x2)" |
|
| 75 |
+ ], |
|
| 76 |
+ "mac_addresses": [ |
|
| 77 |
+ "02:ff:19:18:f1:3a", |
|
| 78 |
+ "02:ee:dc:db:e6:0b" |
|
| 79 |
+ ] |
|
| 80 |
+ }, |
|
| 81 |
+ "tpm": {
|
|
| 82 |
+ "vendor_id": "INTC", |
|
| 83 |
+ "spec_version": "2.0", |
|
| 84 |
+ "firmware_revision": "600.18" |
|
| 85 |
+ } |
|
| 86 |
+ }, |
|
| 87 |
+ "ebony": {
|
|
| 88 |
+ "system": {
|
|
| 89 |
+ "manufacturer": "Intel(R) Client Systems", |
|
| 90 |
+ "product_name": "NUC10i7FNH", |
|
| 91 |
+ "version": "M38010-308", |
|
| 92 |
+ "serial_number": "G6FN135001U0", |
|
| 93 |
+ "family": "FN", |
|
| 94 |
+ "sku": "BXNUC10i7FNHN", |
|
| 95 |
+ "smbios_version": "3.3.0" |
|
| 96 |
+ }, |
|
| 97 |
+ "bios": {
|
|
| 98 |
+ "vendor": "Intel Corp.", |
|
| 99 |
+ "version": "FNCML357.0066.2024.1011.0925", |
|
| 100 |
+ "release_date": "2024-10-11", |
|
| 101 |
+ "bios_revision": "5.16", |
|
| 102 |
+ "firmware_revision": "3.12" |
|
| 103 |
+ }, |
|
| 104 |
+ "cpu": {
|
|
| 105 |
+ "model": "Intel(R) Core(TM) i7-10710U", |
|
| 106 |
+ "family": 6, |
|
| 107 |
+ "model_id": 166, |
|
| 108 |
+ "stepping": 0, |
|
| 109 |
+ "cores": 6, |
|
| 110 |
+ "threads": 12, |
|
| 111 |
+ "max_speed_mhz": 4700, |
|
| 112 |
+ "l3_cache_mb": 12 |
|
| 113 |
+ }, |
|
| 114 |
+ "memory": {
|
|
| 115 |
+ "installed_total_gb": 64, |
|
| 116 |
+ "max_capacity_reported_gb": 32, |
|
| 117 |
+ "modules": [ |
|
| 118 |
+ {
|
|
| 119 |
+ "locator": "SODIMM1", |
|
| 120 |
+ "size_gb": 32, |
|
| 121 |
+ "type": "DDR4", |
|
| 122 |
+ "speed_mtps": 2667, |
|
| 123 |
+ "manufacturer": "029E", |
|
| 124 |
+ "part_number": "CMSX64GX4M2A2666C18", |
|
| 125 |
+ "rank": 2, |
|
| 126 |
+ "voltage_v": 1.2 |
|
| 127 |
+ }, |
|
| 128 |
+ {
|
|
| 129 |
+ "locator": "SODIMM2", |
|
| 130 |
+ "size_gb": 32, |
|
| 131 |
+ "type": "DDR4", |
|
| 132 |
+ "speed_mtps": 2667, |
|
| 133 |
+ "manufacturer": "029E", |
|
| 134 |
+ "part_number": "CMSX64GX4M2A2666C18", |
|
| 135 |
+ "rank": 2, |
|
| 136 |
+ "voltage_v": 1.2 |
|
| 137 |
+ } |
|
| 138 |
+ ] |
|
| 139 |
+ }, |
|
| 140 |
+ "storage": {
|
|
| 141 |
+ "nvme_controllers": [ |
|
| 142 |
+ "Samsung PM9A1/PM9A3/980PRO (3a:00.0)" |
|
| 143 |
+ ] |
|
| 144 |
+ }, |
|
| 145 |
+ "gpu": "Intel Comet Lake UHD Graphics (rev 04)", |
|
| 146 |
+ "network_controllers": [ |
|
| 147 |
+ "Intel Ethernet Connection (10) I219-V", |
|
| 148 |
+ "Intel Comet Lake PCH-LP CNVi WiFi (onboard, poate fi dezactivat)" |
|
| 149 |
+ ], |
|
| 150 |
+ "thunderbolt": {
|
|
| 151 |
+ "generation": "Thunderbolt 3", |
|
| 152 |
+ "controllers": [ |
|
| 153 |
+ "Intel JHL7540 Titan Ridge (NHI)", |
|
| 154 |
+ "Intel JHL7540 Titan Ridge USB Controller" |
|
| 155 |
+ ], |
|
| 156 |
+ "mac_addresses": [ |
|
| 157 |
+ "02:51:8a:ba:b2:2f" |
|
| 158 |
+ ] |
|
| 159 |
+ } |
|
| 160 |
+ }, |
|
| 161 |
+ "tapia": {
|
|
| 162 |
+ "system": {
|
|
| 163 |
+ "manufacturer": "Intel(R) Client Systems", |
|
| 164 |
+ "product_name": "NUC10i7FNH", |
|
| 165 |
+ "version": "M38010-308", |
|
| 166 |
+ "serial_number": "G6FN135001AK", |
|
| 167 |
+ "family": "FN", |
|
| 168 |
+ "sku": "BXNUC10i7FNH", |
|
| 169 |
+ "smbios_version": "3.3.0" |
|
| 170 |
+ }, |
|
| 171 |
+ "bios": {
|
|
| 172 |
+ "vendor": "Intel Corp.", |
|
| 173 |
+ "version": "FNCML357.0066.2024.1011.0925", |
|
| 174 |
+ "release_date": "2024-10-11", |
|
| 175 |
+ "bios_revision": "5.16", |
|
| 176 |
+ "firmware_revision": "3.12" |
|
| 177 |
+ }, |
|
| 178 |
+ "cpu": {
|
|
| 179 |
+ "model": "Intel(R) Core(TM) i7-10710U", |
|
| 180 |
+ "family": 6, |
|
| 181 |
+ "model_id": 166, |
|
| 182 |
+ "stepping": 0, |
|
| 183 |
+ "cores": 6, |
|
| 184 |
+ "threads": 12, |
|
| 185 |
+ "max_speed_mhz": 4700, |
|
| 186 |
+ "l3_cache_mb": 12 |
|
| 187 |
+ }, |
|
| 188 |
+ "memory": {
|
|
| 189 |
+ "installed_total_gb": 64, |
|
| 190 |
+ "max_capacity_reported_gb": 32, |
|
| 191 |
+ "modules": [ |
|
| 192 |
+ {
|
|
| 193 |
+ "locator": "SODIMM1", |
|
| 194 |
+ "size_gb": 32, |
|
| 195 |
+ "type": "DDR4", |
|
| 196 |
+ "speed_mtps": 2667, |
|
| 197 |
+ "manufacturer": "029E", |
|
| 198 |
+ "part_number": "CMSX64GX4M2A2666C18", |
|
| 199 |
+ "rank": 2, |
|
| 200 |
+ "voltage_v": 1.2 |
|
| 201 |
+ }, |
|
| 202 |
+ {
|
|
| 203 |
+ "locator": "SODIMM2", |
|
| 204 |
+ "size_gb": 32, |
|
| 205 |
+ "type": "DDR4", |
|
| 206 |
+ "speed_mtps": 2667, |
|
| 207 |
+ "manufacturer": "029E", |
|
| 208 |
+ "part_number": "CMSX64GX4M2A2666C18", |
|
| 209 |
+ "rank": 2, |
|
| 210 |
+ "voltage_v": 1.2 |
|
| 211 |
+ } |
|
| 212 |
+ ] |
|
| 213 |
+ }, |
|
| 214 |
+ "storage": {
|
|
| 215 |
+ "nvme_controllers": [ |
|
| 216 |
+ "Samsung SM981/PM981/PM983 (3a:00.0)" |
|
| 217 |
+ ] |
|
| 218 |
+ }, |
|
| 219 |
+ "gpu": "Intel Comet Lake UHD Graphics (rev 04)", |
|
| 220 |
+ "network_controllers": [ |
|
| 221 |
+ "Intel Ethernet Connection (10) I219-V", |
|
| 222 |
+ "Intel Comet Lake PCH-LP CNVi WiFi" |
|
| 223 |
+ ], |
|
| 224 |
+ "thunderbolt": {
|
|
| 225 |
+ "generation": "Thunderbolt 3", |
|
| 226 |
+ "controllers": [ |
|
| 227 |
+ "Intel JHL7540 Titan Ridge (NHI)", |
|
| 228 |
+ "Intel JHL7540 Titan Ridge USB Controller" |
|
| 229 |
+ ], |
|
| 230 |
+ "mac_addresses": [ |
|
| 231 |
+ "02:31:33:6d:b4:df" |
|
| 232 |
+ ] |
|
| 233 |
+ }, |
|
| 234 |
+ "tpm": {
|
|
| 235 |
+ "vendor_id": "CTNI", |
|
| 236 |
+ "spec_version": "2.0", |
|
| 237 |
+ "firmware_revision": "500.14" |
|
| 238 |
+ } |
|
| 239 |
+ } |
|
| 240 |
+ } |
|
| 241 |
+} |
|
@@ -0,0 +1,98 @@ |
||
| 1 |
+{
|
|
| 2 |
+ "cluster": {
|
|
| 3 |
+ "networks": {
|
|
| 4 |
+ "fabric": {
|
|
| 5 |
+ "type": "thunderbolt", |
|
| 6 |
+ "bridge": "thunderbridge", |
|
| 7 |
+ "cidr": "192.168.10.0/24", |
|
| 8 |
+ "topology": ["ebony", "baobab", "tapia"], |
|
| 9 |
+ "used_by_vms": true |
|
| 10 |
+ }, |
|
| 11 |
+ "internet": {
|
|
| 12 |
+ "bridge": "vmbr443", |
|
| 13 |
+ "vlan": 443, |
|
| 14 |
+ "cidr": "192.168.2.0/24", |
|
| 15 |
+ "used_by_vms": true |
|
| 16 |
+ } |
|
| 17 |
+ } |
|
| 18 |
+ }, |
|
| 19 |
+ "hosts": [ |
|
| 20 |
+ {
|
|
| 21 |
+ "host": "baobab", |
|
| 22 |
+ "network": {
|
|
| 23 |
+ "thunderbridge": {
|
|
| 24 |
+ "ipv4": "192.168.10.91/24", |
|
| 25 |
+ "bridge_id": "8000.02ff1918f13a", |
|
| 26 |
+ "thunderbolt_macs": ["02:ff:19:18:f1:3a", "02:ee:dc:db:e6:0b"] |
|
| 27 |
+ }, |
|
| 28 |
+ "vmbr443": {
|
|
| 29 |
+ "ipv4": "192.168.2.91/24", |
|
| 30 |
+ "mac": "48:21:0b:60:9f:ab", |
|
| 31 |
+ "bridge_id": "8000.48210b609fab", |
|
| 32 |
+ "uplink": "enp86s0.443", |
|
| 33 |
+ "stp": false |
|
| 34 |
+ } |
|
| 35 |
+ } |
|
| 36 |
+ }, |
|
| 37 |
+ {
|
|
| 38 |
+ "host": "ebony", |
|
| 39 |
+ "network": {
|
|
| 40 |
+ "thunderbridge": {
|
|
| 41 |
+ "ipv4": "192.168.10.92/24", |
|
| 42 |
+ "bridge_id": "8000.02518abab22f", |
|
| 43 |
+ "thunderbolt_macs": ["02:51:8a:ba:b2:2f"] |
|
| 44 |
+ }, |
|
| 45 |
+ "vmbr443": {
|
|
| 46 |
+ "ipv4": "192.168.2.92/24", |
|
| 47 |
+ "mac": "1c:69:7a:ab:26:2f", |
|
| 48 |
+ "bridge_id": "8000.1c697aab262f", |
|
| 49 |
+ "uplink": "eno1.443", |
|
| 50 |
+ "stp": false |
|
| 51 |
+ } |
|
| 52 |
+ } |
|
| 53 |
+ }, |
|
| 54 |
+ {
|
|
| 55 |
+ "host": "tapia", |
|
| 56 |
+ "network": {
|
|
| 57 |
+ "thunderbridge": {
|
|
| 58 |
+ "ipv4": "192.168.10.93/24", |
|
| 59 |
+ "bridge_id": "8000.0231336db4df", |
|
| 60 |
+ "thunderbolt_macs": ["02:31:33:6d:b4:df"] |
|
| 61 |
+ }, |
|
| 62 |
+ "vmbr443": {
|
|
| 63 |
+ "ipv4": "192.168.2.93/24", |
|
| 64 |
+ "mac": "1c:69:7a:aa:e3:5d", |
|
| 65 |
+ "bridge_id": "8000.1c697aaae35d", |
|
| 66 |
+ "uplink": "eno1.443", |
|
| 67 |
+ "stp": false |
|
| 68 |
+ } |
|
| 69 |
+ } |
|
| 70 |
+ } |
|
| 71 |
+ ], |
|
| 72 |
+ "services": {
|
|
| 73 |
+ "pbs": [ |
|
| 74 |
+ {
|
|
| 75 |
+ "name": "anjohibe", |
|
| 76 |
+ "role": "proxmox-backup-server", |
|
| 77 |
+ "type": "vm", |
|
| 78 |
+ "host": "ebony", |
|
| 79 |
+ "network": {
|
|
| 80 |
+ "thunderbridge": { "ipv4": "192.168.10.95/24" },
|
|
| 81 |
+ "vmbr443": { "ipv4": "192.168.2.95/24" }
|
|
| 82 |
+ }, |
|
| 83 |
+ "virtual_nas": { "ipv4": "192.168.10.21" }
|
|
| 84 |
+ }, |
|
| 85 |
+ {
|
|
| 86 |
+ "name": "andrafiabe", |
|
| 87 |
+ "role": "proxmox-backup-server", |
|
| 88 |
+ "type": "vm", |
|
| 89 |
+ "host": "tapia", |
|
| 90 |
+ "network": {
|
|
| 91 |
+ "thunderbridge": { "ipv4": "192.168.10.96/24" },
|
|
| 92 |
+ "vmbr443": { "ipv4": "192.168.2.96/24" }
|
|
| 93 |
+ }, |
|
| 94 |
+ "virtual_nas": { "ipv4": "192.168.10.22" }
|
|
| 95 |
+ } |
|
| 96 |
+ ] |
|
| 97 |
+ } |
|
| 98 |
+} |
|
@@ -0,0 +1,117 @@ |
||
| 1 |
+{
|
|
| 2 |
+ "schemaVersion": "1.0", |
|
| 3 |
+ "lastUpdated": "2025-10-19T00:00:00Z", |
|
| 4 |
+ "description": "Cluster configuration for Madagascar Proxmox cluster", |
|
| 5 |
+ "clusters": {
|
|
| 6 |
+ "madagascar": {
|
|
| 7 |
+ "name": "madagascar", |
|
| 8 |
+ "description": "Proxmox VE cluster with 3 nodes: baobab, ebony, tapia", |
|
| 9 |
+ "pveVersion": "8.x", |
|
| 10 |
+ "pbsServers": [ |
|
| 11 |
+ {
|
|
| 12 |
+ "name": "andrafiabe-AutoNAS", |
|
| 13 |
+ "ip": "192.168.2.96", |
|
| 14 |
+ "hostname": "andrafiabe.madagascar.xdev.ro", |
|
| 15 |
+ "repo": "backup", |
|
| 16 |
+ "prunePolicy": "keep-all=1" |
|
| 17 |
+ }, |
|
| 18 |
+ {
|
|
| 19 |
+ "name": "anjothibe-AutoNAS", |
|
| 20 |
+ "ip": "192.168.2.95", |
|
| 21 |
+ "hostname": "anjothibe.madagascar.xdev.ro", |
|
| 22 |
+ "repo": "backup", |
|
| 23 |
+ "prunePolicy": "keep-all=1" |
|
| 24 |
+ } |
|
| 25 |
+ ], |
|
| 26 |
+ "lastUpdated": "2025-10-19T00:00:00Z", |
|
| 27 |
+ "nodes": {
|
|
| 28 |
+ "baobab": {
|
|
| 29 |
+ "name": "baobab", |
|
| 30 |
+ "role": "primary", |
|
| 31 |
+ "wan": {
|
|
| 32 |
+ "vmbr443": {
|
|
| 33 |
+ "address": "192.168.2.91/24", |
|
| 34 |
+ "gateway": "192.168.2.1" |
|
| 35 |
+ }, |
|
| 36 |
+ "vmbr444": {
|
|
| 37 |
+ "address": "192.168.4.91/24" |
|
| 38 |
+ } |
|
| 39 |
+ }, |
|
| 40 |
+ "network": {
|
|
| 41 |
+ "thunderbridge": {
|
|
| 42 |
+ "bridge": "thunderbridge", |
|
| 43 |
+ "address": "192.168.10.91/24", |
|
| 44 |
+ "mtu": 65520 |
|
| 45 |
+ } |
|
| 46 |
+ }, |
|
| 47 |
+ "services": {
|
|
| 48 |
+ "tb-bridge": {
|
|
| 49 |
+ "enabled": true |
|
| 50 |
+ } |
|
| 51 |
+ }, |
|
| 52 |
+ "notes": "Node entry populated from local deploy layout" |
|
| 53 |
+ }, |
|
| 54 |
+ "ebony": {
|
|
| 55 |
+ "name": "ebony", |
|
| 56 |
+ "role": "secondary", |
|
| 57 |
+ "wan": {
|
|
| 58 |
+ "vmbr443": {
|
|
| 59 |
+ "address": "192.168.2.92/24", |
|
| 60 |
+ "gateway": "192.168.2.1" |
|
| 61 |
+ }, |
|
| 62 |
+ "vmbr444": {
|
|
| 63 |
+ "address": "192.168.4.92/24" |
|
| 64 |
+ } |
|
| 65 |
+ }, |
|
| 66 |
+ "network": {
|
|
| 67 |
+ "thunderbridge": {
|
|
| 68 |
+ "bridge": "thunderbridge", |
|
| 69 |
+ "address": "192.168.10.92/24", |
|
| 70 |
+ "mtu": 65520 |
|
| 71 |
+ } |
|
| 72 |
+ } |
|
| 73 |
+ }, |
|
| 74 |
+ "tapia": {
|
|
| 75 |
+ "name": "tapia", |
|
| 76 |
+ "role": "secondary", |
|
| 77 |
+ "wan": {
|
|
| 78 |
+ "vmbr443": {
|
|
| 79 |
+ "address": "192.168.2.93/24", |
|
| 80 |
+ "gateway": "192.168.2.1" |
|
| 81 |
+ }, |
|
| 82 |
+ "vmbr444": {
|
|
| 83 |
+ "address": "192.168.4.93/24" |
|
| 84 |
+ } |
|
| 85 |
+ }, |
|
| 86 |
+ "network": {
|
|
| 87 |
+ "thunderbridge": {
|
|
| 88 |
+ "bridge": "thunderbridge", |
|
| 89 |
+ "address": "192.168.10.93/24", |
|
| 90 |
+ "mtu": 65520 |
|
| 91 |
+ } |
|
| 92 |
+ } |
|
| 93 |
+ } |
|
| 94 |
+ } |
|
| 95 |
+ } |
|
| 96 |
+ }, |
|
| 97 |
+ "clusterNetwork": {
|
|
| 98 |
+ "thunderbolt": {
|
|
| 99 |
+ "description": "Cluster thunderbolt bridge configuration", |
|
| 100 |
+ "bridge": "thunderbridge", |
|
| 101 |
+ "cidr": "192.168.10.0/24", |
|
| 102 |
+ "mtu": 65520, |
|
| 103 |
+ "dns": "192.168.2.2", |
|
| 104 |
+ "nodes": {
|
|
| 105 |
+ "baobab": {
|
|
| 106 |
+ "address": "192.168.10.91/24" |
|
| 107 |
+ }, |
|
| 108 |
+ "ebony": {
|
|
| 109 |
+ "address": "192.168.10.92/24" |
|
| 110 |
+ }, |
|
| 111 |
+ "tapia": {
|
|
| 112 |
+ "address": "192.168.10.93/24" |
|
| 113 |
+ } |
|
| 114 |
+ } |
|
| 115 |
+ } |
|
| 116 |
+ } |
|
| 117 |
+} |
|
@@ -0,0 +1,190 @@ |
||
| 1 |
+# Madagascar Cluster Projects |
|
| 2 |
+ |
|
| 3 |
+Acest director este punctul unic de lucru pentru proiectele cluster-level actuale si viitoare. |
|
| 4 |
+ |
|
| 5 |
+## Baza de referinta |
|
| 6 |
+ |
|
| 7 |
+Workflow-ul de install, uninstall si reinstall documentat aici este bazat pe implementarea cea mai completa existenta in `autoNAS`. |
|
| 8 |
+ |
|
| 9 |
+Referinte principale: |
|
| 10 |
+- `cluster/projects/autoNAS/README.md` |
|
| 11 |
+- `cluster/projects/autoNAS/DEVELOPMENT.md` |
|
| 12 |
+- `cluster/projects/autoNAS/scripts/install.sh` |
|
| 13 |
+- `cluster/projects/autoNAS/scripts/autonas-uninstall.sh` |
|
| 14 |
+ |
|
| 15 |
+Observatie importanta: |
|
| 16 |
+- `autoNAS` confirma workflow-ul corect de uninstall-inainte-de-reinstall si curatare a fisierelor orfane |
|
| 17 |
+- `autoNAS` nu este inca aliniat complet la noua regula de locatie pentru comenzi operator-facing, deoarece instaleaza in prezent in `/usr/local/bin` |
|
| 18 |
+- pentru proiectele noi, regula ramane `/usr/local/sbin`; `autoNAS` trebuie tratat ca precedent functional pentru workflow, nu ca standard final de layout |
|
| 19 |
+ |
|
| 20 |
+## Namespace de organizatie |
|
| 21 |
+ |
|
| 22 |
+Pentru claritate si evitarea coliziunilor intre proiecte, toate locatiile standard trebuie namespaced cu identificatorul de organizatie: |
|
| 23 |
+ |
|
| 24 |
+- `xdev` |
|
| 25 |
+ |
|
| 26 |
+Regula generala este: |
|
| 27 |
+- folosim `<project-name>` pentru identitatea proiectului |
|
| 28 |
+- folosim `xdev` in calea de instalare pentru fisiere interne, configuratie, date si documentatie |
|
| 29 |
+ |
|
| 30 |
+## Reguli generale |
|
| 31 |
+ |
|
| 32 |
+- Toate proiectele noi se creeaza sub `cluster/projects/<project-name>`. |
|
| 33 |
+- Proiectele se deschid si se mentin din `cluster`, pentru a reduce divergenta intre workspace-uri si duplicarea documentatiei sau scripturilor. |
|
| 34 |
+- Fiecare proiect trebuie sa aiba cel putin: |
|
| 35 |
+ - `README.md` |
|
| 36 |
+ - script de instalare |
|
| 37 |
+ - script de dezinstalare |
|
| 38 |
+ - instructiuni de operare si upgrade |
|
| 39 |
+ |
|
| 40 |
+## Locatii well-known obligatorii |
|
| 41 |
+ |
|
| 42 |
+Instalarile trebuie sa foloseasca locatii predictibile si stabile: |
|
| 43 |
+ |
|
| 44 |
+- executabile si scripturi operator-facing: `/usr/local/sbin` |
|
| 45 |
+- binare sau scripturi interne ale proiectului: `/usr/local/lib/xdev/<project-name>` |
|
| 46 |
+- documentatie instalata pe host: `/usr/local/share/doc/xdev/<project-name>` |
|
| 47 |
+- fisiere de configurare persistente: `/etc/xdev/<project-name>` |
|
| 48 |
+- environment defaults: `/etc/default/xdev-<project-name>` |
|
| 49 |
+- unitati systemd: `/etc/systemd/system` |
|
| 50 |
+- stare persistenta si date operationale: `/var/lib/xdev/<project-name>` |
|
| 51 |
+- cache temporar: `/var/cache/xdev/<project-name>` daca este necesar |
|
| 52 |
+- loguri dedicate pe disc, daca proiectul chiar le scrie in fisier: `/var/log/xdev/<project-name>` |
|
| 53 |
+ |
|
| 54 |
+Regula practica: |
|
| 55 |
+- daca un operator trebuie sa ruleze comanda direct, ea merge in `/usr/local/sbin` |
|
| 56 |
+- daca fisierul este suport intern pentru proiect, el merge in `/usr/local/lib/xdev/<project-name>` |
|
| 57 |
+- daca fisierul este documentatie instalata local pentru host, el merge in `/usr/local/share/doc/xdev/<project-name>` |
|
| 58 |
+- daca fisierul reprezinta configuratie editabila, el merge in `/etc/xdev/<project-name>` sau `/etc/default/xdev-<project-name>` |
|
| 59 |
+- daca fisierul reprezinta stare, baza locala, lock, snapshot sau alta data operationala, el merge in `/var/lib/xdev/<project-name>` |
|
| 60 |
+ |
|
| 61 |
+## Locatia standard pentru scripturile de dezinstalare |
|
| 62 |
+ |
|
| 63 |
+Locatia standard canonica pentru scriptul de dezinstalare instalat pe host este: |
|
| 64 |
+ |
|
| 65 |
+- `/usr/local/lib/xdev/<project-name>/uninstall.sh` |
|
| 66 |
+ |
|
| 67 |
+Motivatie: |
|
| 68 |
+- uninstall-ul este in primul rand parte din mecanismul intern de lifecycle al proiectului |
|
| 69 |
+- trebuie sa poata fi apelat de installer pentru cleanup automat inainte de reinstall |
|
| 70 |
+- trebuie versionat impreuna cu restul fisierelor interne ale proiectului |
|
| 71 |
+- evita aglomerarea inutila a `/usr/local/sbin` cu scripturi care nu sunt folosite frecvent in operare zilnica |
|
| 72 |
+ |
|
| 73 |
+Regula de naming: |
|
| 74 |
+- scriptul canonic instalat pe host se numeste `uninstall.sh` |
|
| 75 |
+- directorul proiectului da contextul complet: `/usr/local/lib/xdev/<project-name>/uninstall.sh` |
|
| 76 |
+ |
|
| 77 |
+Expunere optionala pentru operator: |
|
| 78 |
+- daca vrem o comanda manuala simpla si predictibila, se poate instala un wrapper sau symlink in: |
|
| 79 |
+ - `/usr/local/sbin/xdev-<project-name>-uninstall` |
|
| 80 |
+- acest wrapper trebuie sa apeleze scriptul canonic din `/usr/local/lib/xdev/<project-name>/uninstall.sh` |
|
| 81 |
+- wrapperul din `/usr/local/sbin` este optional; scriptul canonic din `/usr/local/lib/xdev/<project-name>/` este obligatoriu |
|
| 82 |
+ |
|
| 83 |
+## Instalare si dezinstalare |
|
| 84 |
+ |
|
| 85 |
+- Orice instalare trebuie sa fie insotita de un script de dezinstalare livrat de acelasi proiect. |
|
| 86 |
+- Scriptul de dezinstalare instalat pe host trebuie sa existe la `/usr/local/lib/xdev/<project-name>/uninstall.sh`. |
|
| 87 |
+- Scriptul de dezinstalare trebuie sa elimine toate fisierele instalate de proiect: |
|
| 88 |
+ - executabile |
|
| 89 |
+ - fisiere din `/usr/local/lib/xdev/<project-name>` |
|
| 90 |
+ - documentatie din `/usr/local/share/doc/xdev/<project-name>` |
|
| 91 |
+ - unitati systemd |
|
| 92 |
+ - fisiere de configurare generate de proiect, daca sunt gestionate exclusiv de el, din `/etc/xdev/<project-name>` sau `/etc/default/xdev-<project-name>` |
|
| 93 |
+ - directoare de stare, date sau cache create de proiect, daca nu contin date care trebuie pastrate explicit, din `/var/lib/xdev/<project-name>` sau `/var/cache/xdev/<project-name>` |
|
| 94 |
+- Scopul este prevenirea fisierelor orfane si a reinstalarilor peste artefacte ramase din versiuni anterioare. |
|
| 95 |
+ |
|
| 96 |
+## Regula de reinstall |
|
| 97 |
+ |
|
| 98 |
+- Toate reinstalarile se fac numai dupa dezinstalare completa. |
|
| 99 |
+- Dezinstalarea se face numai cu scriptul original de uninstall al proiectului, nu prin stergeri manuale partiale. |
|
| 100 |
+- Fluxul obligatoriu este: |
|
| 101 |
+ |
|
| 102 |
+```text |
|
| 103 |
+uninstall -> verificare curatare -> install |
|
| 104 |
+``` |
|
| 105 |
+ |
|
| 106 |
+- Nu se face reinstall direct peste o instalare existenta, chiar daca pare functionala. |
|
| 107 |
+- Daca scriptul de uninstall lipseste, instalarea proiectului este incompleta si trebuie corectata inainte de orice upgrade sau reinstall. |
|
| 108 |
+ |
|
| 109 |
+## Cerinte pentru proiectele noi |
|
| 110 |
+ |
|
| 111 |
+Fiecare proiect nou trebuie sa includa explicit: |
|
| 112 |
+ |
|
| 113 |
+1. un `install` care foloseste locatiile well-known |
|
| 114 |
+2. un `uninstall` care inverseaza complet instalarea |
|
| 115 |
+3. un `README.md` cu: |
|
| 116 |
+ - layout-ul fisierelor instalate |
|
| 117 |
+ - comenzile de instalare |
|
| 118 |
+ - comenzile de dezinstalare |
|
| 119 |
+ - pasii de reinstall |
|
| 120 |
+ - locatia uninstall-ului instalat pe host: `/usr/local/lib/xdev/<project-name>/uninstall.sh` |
|
| 121 |
+ - locatiile pentru configuratie, documentatie si date |
|
| 122 |
+4. daca exista systemd: |
|
| 123 |
+ - `daemon-reload` la install si uninstall |
|
| 124 |
+ - enable/disable/stop clar definite |
|
| 125 |
+ - la deployment, serviciile si timer-ele care trebuie sa ramana active se pornesc cu `systemctl enable --now`, nu doar cu `enable` |
|
| 126 |
+ |
|
| 127 |
+## Aplicare la proiectele existente |
|
| 128 |
+ |
|
| 129 |
+Proiectele deja mutate sub `cluster/projects/` trebuie aliniate progresiv la aceste reguli. |
|
| 130 |
+ |
|
| 131 |
+Prioritati: |
|
| 132 |
+- confirmarea unui script de uninstall pentru fiecare proiect |
|
| 133 |
+- standardizarea instalarii in `/usr/local/sbin` si `/usr/local/lib/xdev/<project-name>` |
|
| 134 |
+- eliminarea reinstalarilor facute peste fisiere existente |
|
| 135 |
+ |
|
| 136 |
+## Lectii confirmate in autoNAS |
|
| 137 |
+ |
|
| 138 |
+Problemele deja identificate si rezolvate in `autoNAS`, care trebuie considerate reguli pentru proiectele viitoare: |
|
| 139 |
+ |
|
| 140 |
+- reinstalarile peste versiuni vechi lasa fisiere orfane daca nu exista cleanup explicit |
|
| 141 |
+- instalarea trebuie sa poata rula cleanup de versiune anterioara inainte de install |
|
| 142 |
+- uninstaller-ul trebuie instalat pe host pentru a permite cleanup corect la upgrade sau reinstall |
|
| 143 |
+- uninstall-ul trebuie sa curete agresiv fisierele istorice ramase din versiuni mai vechi |
|
| 144 |
+- config-ul utilizatorului trebuie pastrat cand contine date reale, nu sters orbeste |
|
| 145 |
+- serviciile systemd trebuie oprite, dezactivate, sterse si urmate de `daemon-reload` |
|
| 146 |
+- la deployment, un serviciu necesar in productie nu trebuie lasat doar `enabled`; se foloseste `enable --now` pentru a evita deploy-uri cu servicii instalate dar nepornite |
|
| 147 |
+- unele resurse necesita cleanup manual explicit daca pot contine date operationale, de exemplu exports NFS sau mount points active |
|
| 148 |
+ |
|
| 149 |
+Fluxul validat de `autoNAS` este: |
|
| 150 |
+ |
|
| 151 |
+```text |
|
| 152 |
+detect previous install -> run original uninstall -> clean orphan files -> install new version -> preserve user data where required |
|
| 153 |
+``` |
|
| 154 |
+ |
|
| 155 |
+## Regula operationala |
|
| 156 |
+ |
|
| 157 |
+Cand se modifica un proiect existent sau se adauga unul nou, se actualizeaza si documentatia proiectului astfel incat procedura de: |
|
| 158 |
+ |
|
| 159 |
+- install |
|
| 160 |
+- uninstall |
|
| 161 |
+- reinstall |
|
| 162 |
+ |
|
| 163 |
+sa fie explicita, repetabila si fara artefacte ramase pe host. |
|
| 164 |
+ |
|
| 165 |
+## Deploy cluster-wide |
|
| 166 |
+ |
|
| 167 |
+Pentru rollout final pe cluster nu facem deploy nod cu nod manual daca proiectul este destinat cluster-wide. |
|
| 168 |
+ |
|
| 169 |
+Regula practica este: |
|
| 170 |
+- fiecare proiect trebuie sa pastreze si varianta pe un singur nod pentru development si testing |
|
| 171 |
+- pentru rollout cluster-wide se foloseste orchestratorul comun din radacina: |
|
| 172 |
+ - `cluster/scripts/deploy-project.sh <project-name>` |
|
| 173 |
+ |
|
| 174 |
+Sursa de adevar pentru noduri: |
|
| 175 |
+- `cluster/cluster-context/madagascar.json` |
|
| 176 |
+ |
|
| 177 |
+Exemple: |
|
| 178 |
+ |
|
| 179 |
+```bash |
|
| 180 |
+./scripts/deploy-project.sh pve-guests-state |
|
| 181 |
+./scripts/deploy-project.sh pve-net-hang-watchdog |
|
| 182 |
+./scripts/deploy-project.sh pve-backup-scheduler |
|
| 183 |
+./scripts/deploy-project.sh autoNAS |
|
| 184 |
+./scripts/deploy-project.sh pve-guests-state install --node ebony |
|
| 185 |
+``` |
|
| 186 |
+ |
|
| 187 |
+Cerinta pentru proiecte: |
|
| 188 |
+- proiectele noi trebuie sa ofere fie `setup.sh`, fie `deploy.sh` |
|
| 189 |
+- `setup.sh` ramane entrypoint-ul standard pentru install/uninstall pe un singur nod |
|
| 190 |
+- orchestratorul comun decide nodurile pe baza `cluster-context/madagascar.json` si ruleaza proiectul pe toate tintele selectate |
|
@@ -0,0 +1 @@ |
||
| 1 |
+Subproject commit d426b0effcb2e2195b7c6742718037862bd15767 |
|
@@ -0,0 +1,45 @@ |
||
| 1 |
+# Exclude these files from deployment |
|
| 2 |
+ |
|
| 3 |
+# Development metadata |
|
| 4 |
+**/.metadata/** |
|
| 5 |
+**/.settings/** |
|
| 6 |
+**/Release/** |
|
| 7 |
+**/Debug/** |
|
| 8 |
+ |
|
| 9 |
+# OS files |
|
| 10 |
+**/.DS_Store |
|
| 11 |
+**/Thumbs.db |
|
| 12 |
+ |
|
| 13 |
+# Version control |
|
| 14 |
+**/.git/** |
|
| 15 |
+**/.svn/** |
|
| 16 |
+ |
|
| 17 |
+# IDE files |
|
| 18 |
+**/.project |
|
| 19 |
+**/.cproject |
|
| 20 |
+**/.classpath |
|
| 21 |
+ |
|
| 22 |
+# Temporary files |
|
| 23 |
+**/tmp/** |
|
| 24 |
+**/temp/** |
|
| 25 |
+ |
|
| 26 |
+# Large binaries |
|
| 27 |
+**/*.bin |
|
| 28 |
+**/*.elf |
|
| 29 |
+# **/*.rpm # Commented out to allow offline packages |
|
| 30 |
+**/*.o |
|
| 31 |
+ |
|
| 32 |
+# Offline packages (comment out the line below to include packages in deployment) |
|
| 33 |
+# packages/** |
|
| 34 |
+ |
|
| 35 |
+# Other projects not related to autoSMART |
|
| 36 |
+configi/** |
|
| 37 |
+raduin/** |
|
| 38 |
+radion/** |
|
| 39 |
+linux/** |
|
| 40 |
+ipconfig/** |
|
| 41 |
+autoNAS/** |
|
| 42 |
+VariaMediaDump/** |
|
| 43 |
+Madagascar/** |
|
| 44 |
+RemoteSystemsTempFiles/** |
|
| 45 |
+ |
|
@@ -0,0 +1,144 @@ |
||
| 1 |
+# autoSMART Debug Resolution Report |
|
| 2 |
+## Date: 2025-08-16 |
|
| 3 |
+ |
|
| 4 |
+### Issues Identified and Resolved |
|
| 5 |
+ |
|
| 6 |
+#### ❌ Issue 1: Empty hdd_presence table |
|
| 7 |
+**Problem**: Table `hdd_presence` was empty despite collector running |
|
| 8 |
+**Root Causes**: |
|
| 9 |
+1. SMART parameter parsing regex was incorrect for new smartctl format |
|
| 10 |
+2. Database permission issues for sequence access |
|
| 11 |
+3. Missing fields in smart_readings INSERT |
|
| 12 |
+ |
|
| 13 |
+#### ✅ Solutions Implemented |
|
| 14 |
+ |
|
| 15 |
+##### 1. Enhanced Debug Logging in smart-collector-daemon.pl |
|
| 16 |
+- Added comprehensive debug logging throughout the collection process |
|
| 17 |
+- Enhanced `get_or_create_hdd()` function with detailed presence tracking logs |
|
| 18 |
+- Added device scanning and SMART parsing debug information |
|
| 19 |
+- Added database connectivity testing in debug mode |
|
| 20 |
+ |
|
| 21 |
+##### 2. Fixed SMART Parameter Parsing |
|
| 22 |
+**Before**: Only supported old format |
|
| 23 |
+```perl |
|
| 24 |
+elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\w+\s+\w+\s+\w+\s+(\d+)/) {
|
|
| 25 |
+``` |
|
| 26 |
+ |
|
| 27 |
+**After**: Supports both old and new smartctl formats |
|
| 28 |
+```perl |
|
| 29 |
+elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(\d+)/) {
|
|
| 30 |
+ # New format: ID ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
|
| 31 |
+``` |
|
| 32 |
+ |
|
| 33 |
+##### 3. Fixed Database Schema Permissions |
|
| 34 |
+**Problem**: `permission denied for sequence hdd_presence_id_seq` |
|
| 35 |
+**Solution**: Added proper sequence permissions |
|
| 36 |
+```sql |
|
| 37 |
+GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO autosmart; |
|
| 38 |
+``` |
|
| 39 |
+ |
|
| 40 |
+##### 4. Fixed smart_readings INSERT Statement |
|
| 41 |
+**Before**: Missing required NOT NULL fields |
|
| 42 |
+```perl |
|
| 43 |
+INSERT INTO smart_readings (hdd_id, timestamp, temperature, parameters_json, reading_type) |
|
| 44 |
+``` |
|
| 45 |
+ |
|
| 46 |
+**After**: Complete field list |
|
| 47 |
+```perl |
|
| 48 |
+INSERT INTO smart_readings (hdd_id, serial_number, device_path, node_id, timestamp, temperature, parameters_json, reading_type) |
|
| 49 |
+``` |
|
| 50 |
+ |
|
| 51 |
+##### 5. Enhanced Configuration Preservation |
|
| 52 |
+**Problem**: Install script overwrote existing `/etc/default/autosmart` configuration |
|
| 53 |
+**Solution**: Implemented configuration merging in install.sh |
|
| 54 |
+- Backup existing configuration with timestamp |
|
| 55 |
+- Parse existing key-value pairs |
|
| 56 |
+- Merge with new defaults while preserving user settings |
|
| 57 |
+- Log preserved/added settings |
|
| 58 |
+ |
|
| 59 |
+```bash |
|
| 60 |
+# Backup existing configuration |
|
| 61 |
+cp "/etc/default/autosmart" "/etc/default/autosmart.backup.$(date +%Y%m%d_%H%M%S)" |
|
| 62 |
+ |
|
| 63 |
+# Read and preserve existing settings |
|
| 64 |
+declare -A existing_config |
|
| 65 |
+while IFS='=' read -r key value; do |
|
| 66 |
+ if [[ $key =~ ^[A-Z_]+$ ]] && [[ -n $value ]]; then |
|
| 67 |
+ value=$(echo "$value" | sed 's/^"//;s/"$//') |
|
| 68 |
+ existing_config["$key"]="$value" |
|
| 69 |
+ fi |
|
| 70 |
+done < "/etc/default/autosmart" |
|
| 71 |
+``` |
|
| 72 |
+ |
|
| 73 |
+### Testing Results |
|
| 74 |
+ |
|
| 75 |
+#### ✅ Successful Data Collection |
|
| 76 |
+``` |
|
| 77 |
+[DEBUG] Found model: ST4000VN006-3CW104 |
|
| 78 |
+[DEBUG] Found serial: ZW60K01R |
|
| 79 |
+[DEBUG] SMART param (new format): Raw_Read_Error_Rate = 1176 |
|
| 80 |
+[DEBUG] SMART param (new format): Start_Stop_Count = 2300 |
|
| 81 |
+[DEBUG] Parsed device data - Model: ST4000VN006-3CW104, Serial: ZW60K01R, Temperature: 44, Parameters: 25 |
|
| 82 |
+[DEBUG] Created new hdd_presence record with id=2 for serial=ZW60K01R node=Bogdans-MacBook-Pro |
|
| 83 |
+✓ SMART reading stored (ID: 18, temp: 44°C, type: full) |
|
| 84 |
+``` |
|
| 85 |
+ |
|
| 86 |
+#### ✅ Database Population Confirmed |
|
| 87 |
+```sql |
|
| 88 |
+-- hdd_presence table |
|
| 89 |
+ id | serial_number | node | data_start | data_end | is_current |
|
| 90 |
+----+----------------+---------------------+----------------------------+----------------------------+------------ |
|
| 91 |
+ 1 | S2HSNXRH402205 | Bogdans-MacBook-Pro | 2025-08-16 21:47:13.078524 | 2025-08-16 21:48:23.357763 | t |
|
| 92 |
+ 2 | ZW60K01R | Bogdans-MacBook-Pro | 2025-08-16 21:47:13.873642 | 2025-08-16 21:48:24.204347 | t |
|
| 93 |
+ |
|
| 94 |
+-- smart_readings summary |
|
| 95 |
+ total_readings | unique_devices |
|
| 96 |
+----------------+---------------- |
|
| 97 |
+ 16 | 2 |
|
| 98 |
+``` |
|
| 99 |
+ |
|
| 100 |
+### Configuration Management |
|
| 101 |
+ |
|
| 102 |
+#### ✅ Debug Mode Activation |
|
| 103 |
+```bash |
|
| 104 |
+# Enable debug mode |
|
| 105 |
+AUTOSMART_DEBUG="true" |
|
| 106 |
+ |
|
| 107 |
+# Configuration preserved across deployments |
|
| 108 |
+[INFO] ✓ Preserved existing setting: AUTOSMART_DEBUG="true" |
|
| 109 |
+[INFO] ✓ Configuration merged successfully |
|
| 110 |
+``` |
|
| 111 |
+ |
|
| 112 |
+### Deployment Process |
|
| 113 |
+ |
|
| 114 |
+All fixes deployed successfully using: |
|
| 115 |
+```bash |
|
| 116 |
+./deploy.sh install ebony |
|
| 117 |
+``` |
|
| 118 |
+ |
|
| 119 |
+### Files Modified |
|
| 120 |
+ |
|
| 121 |
+1. **scripts/smart-collector-daemon.pl** |
|
| 122 |
+ - Enhanced debug logging |
|
| 123 |
+ - Fixed SMART parameter parsing regex |
|
| 124 |
+ - Fixed smart_readings INSERT statement |
|
| 125 |
+ - Added comprehensive error handling |
|
| 126 |
+ |
|
| 127 |
+2. **scripts/install.sh** |
|
| 128 |
+ - Implemented configuration preservation |
|
| 129 |
+ - Added backup functionality |
|
| 130 |
+ - Enhanced user setting migration |
|
| 131 |
+ |
|
| 132 |
+3. **sql/schema-fixed.sql** |
|
| 133 |
+ - Added proper sequence permissions |
|
| 134 |
+ |
|
| 135 |
+### Summary |
|
| 136 |
+ |
|
| 137 |
+The autoSMART system now successfully: |
|
| 138 |
+- ✅ Detects and parses SMART data from all device types |
|
| 139 |
+- ✅ Populates hdd_presence table with mobility tracking |
|
| 140 |
+- ✅ Stores complete SMART readings with all metadata |
|
| 141 |
+- ✅ Preserves user configuration across deployments |
|
| 142 |
+- ✅ Provides comprehensive debug logging for troubleshooting |
|
| 143 |
+ |
|
| 144 |
+All identified issues have been resolved and the system is ready for production use across the Madagascar cluster. |
|
@@ -0,0 +1 @@ |
||
| 1 |
+/Users/bogdan/Documents/Workspaces/Xdev/Madagascar |
|
@@ -0,0 +1,19 @@ |
||
| 1 |
+{
|
|
| 2 |
+ "cluster": {
|
|
| 3 |
+ "name": "madagascar", |
|
| 4 |
+ "nodes": [ |
|
| 5 |
+ {
|
|
| 6 |
+ "hostname": "ebony", |
|
| 7 |
+ "ip": "192.168.2.92" |
|
| 8 |
+ }, |
|
| 9 |
+ {
|
|
| 10 |
+ "hostname": "baobab", |
|
| 11 |
+ "ip": "192.168.2.91" |
|
| 12 |
+ }, |
|
| 13 |
+ {
|
|
| 14 |
+ "hostname": "tapia", |
|
| 15 |
+ "ip": "192.168.2.94" |
|
| 16 |
+ } |
|
| 17 |
+ ] |
|
| 18 |
+ } |
|
| 19 |
+} |
|
@@ -0,0 +1,5 @@ |
||
| 1 |
+# AutoSMART Configuration |
|
| 2 |
+# This file is sourced by AutoSMART scripts to set default behavior |
|
| 3 |
+# Debug mode - set to "true" to enable verbose logging |
|
| 4 |
+# When enabled, all AutoSMART operations will produce detailed debug output |
|
| 5 |
+AUTOSMART_DEBUG="false" |
|
@@ -0,0 +1,13 @@ |
||
| 1 |
+[database] |
|
| 2 |
+host = 192.168.2.102 |
|
| 3 |
+port = 5432 |
|
| 4 |
+name = autosmart |
|
| 5 |
+user = autosmart |
|
| 6 |
+password = autoSMART2025! |
|
| 7 |
+ |
|
| 8 |
+[collection] |
|
| 9 |
+interval = 1800 |
|
| 10 |
+timeout = 60 |
|
| 11 |
+ |
|
| 12 |
+[node] |
|
| 13 |
+id = ebony |
|
@@ -0,0 +1,88 @@ |
||
| 1 |
+# autoSMART Cluster Configuration |
|
| 2 |
+# Location: /etc/pve/autoSMART/cluster.conf |
|
| 3 |
+# This file is shared across all Proxmox cluster nodes |
|
| 4 |
+ |
|
| 5 |
+[cluster] |
|
| 6 |
+# Cluster identification |
|
| 7 |
+cluster_name = proxmox-cluster-main |
|
| 8 |
+cluster_id = pve-cluster-001 |
|
| 9 |
+nodes = node91,node92,node93 |
|
| 10 |
+ |
|
| 11 |
+# Database configuration (shared cluster database) |
|
| 12 |
+[database] |
|
| 13 |
+host = 192.168.2.91 |
|
| 14 |
+port = 5432 |
|
| 15 |
+database = autosmart_cluster |
|
| 16 |
+username = autosmart_cluster |
|
| 17 |
+password = cluster_secure_password_here |
|
| 18 |
+connection_timeout = 30 |
|
| 19 |
+pool_size = 10 |
|
| 20 |
+ |
|
| 21 |
+# OpenAI configuration (shared API key) |
|
| 22 |
+[openai] |
|
| 23 |
+api_key = your_cluster_openai_api_key_here |
|
| 24 |
+model = gpt-4 |
|
| 25 |
+max_tokens = 1500 |
|
| 26 |
+temperature = 0.3 |
|
| 27 |
+rate_limit_delay = 2 |
|
| 28 |
+ |
|
| 29 |
+# Madagascar inventory integration |
|
| 30 |
+[madagascar] |
|
| 31 |
+inventory_path = /etc/pve/autoSMART/madagascar_inventory.json |
|
| 32 |
+update_interval = 3600 |
|
| 33 |
+sync_across_nodes = true |
|
| 34 |
+ |
|
| 35 |
+# Cluster-wide SMART monitoring parameters |
|
| 36 |
+[smart_parameters] |
|
| 37 |
+# Critical parameters (high weight for AI analysis) |
|
| 38 |
+Reallocated_Sector_Ct = 1,10.0,true,Critical reallocated sectors |
|
| 39 |
+Reallocated_Event_Count = 1,9.0,true,Reallocation events |
|
| 40 |
+Current_Pending_Sector = 1,9.5,true,Pending sector reallocation |
|
| 41 |
+Offline_Uncorrectable = 1,10.0,true,Uncorrectable sectors |
|
| 42 |
+UDMA_CRC_Error_Count = 10,5.0,true,Communication errors |
|
| 43 |
+Spin_Retry_Count = 1,8.0,true,Spindle motor retries |
|
| 44 |
+ |
|
| 45 |
+# Important parameters (medium weight) |
|
| 46 |
+Raw_Read_Error_Rate = 100000,3.0,true,Raw read errors |
|
| 47 |
+Seek_Error_Rate = 100000,4.0,true,Seek operation errors |
|
| 48 |
+Load_Cycle_Count = 100000,2.0,true,Head load cycles |
|
| 49 |
+Power_On_Hours = 35000,2.0,true,Power-on time |
|
| 50 |
+Temperature_Celsius = 50,3.0,true,Operating temperature |
|
| 51 |
+ |
|
| 52 |
+# Monitoring parameters (low weight) |
|
| 53 |
+Start_Stop_Count = 10000,1.0,true,Start/stop cycles |
|
| 54 |
+Power_Cycle_Count = 10000,1.0,true,Power cycles |
|
| 55 |
+Command_Timeout = 100,2.0,true,Command timeouts |
|
| 56 |
+High_Fly_Writes = 1,4.0,true,Head fly height issues |
|
| 57 |
+Airflow_Temperature_Cel = 45,1.5,true,Airflow temperature |
|
| 58 |
+ |
|
| 59 |
+# Cluster-wide alert settings |
|
| 60 |
+[alerts] |
|
| 61 |
+email_enabled = true |
|
| 62 |
+email_smtp_server = mail.domain.com |
|
| 63 |
+email_smtp_port = 587 |
|
| 64 |
+email_username = autosmart@domain.com |
|
| 65 |
+email_password = email_password_here |
|
| 66 |
+email_recipients = admin@domain.com,ops@domain.com |
|
| 67 |
+email_critical_only = false |
|
| 68 |
+ |
|
| 69 |
+# Risk level alert thresholds |
|
| 70 |
+alert_critical_immediate = true |
|
| 71 |
+alert_high_delay_minutes = 30 |
|
| 72 |
+alert_moderate_delay_hours = 4 |
|
| 73 |
+alert_low_daily_summary = true |
|
| 74 |
+ |
|
| 75 |
+# Data retention (cluster-wide policy) |
|
| 76 |
+[retention] |
|
| 77 |
+smart_readings_days = 365 |
|
| 78 |
+predictions_days = 180 |
|
| 79 |
+alerts_days = 90 |
|
| 80 |
+cleanup_interval_hours = 24 |
|
| 81 |
+ |
|
| 82 |
+# Cluster synchronization |
|
| 83 |
+[synchronization] |
|
| 84 |
+node_discovery_interval = 300 |
|
| 85 |
+health_check_interval = 60 |
|
| 86 |
+failover_enabled = true |
|
| 87 |
+backup_nodes = node92,node93 |
|
| 88 |
+primary_node = node91 |
|
@@ -0,0 +1,30 @@ |
||
| 1 |
+# autoSMART Database Configuration |
|
| 2 |
+# PostgreSQL connection settings |
|
| 3 |
+ |
|
| 4 |
+[database] |
|
| 5 |
+host = localhost |
|
| 6 |
+port = 5432 |
|
| 7 |
+database = autosmart |
|
| 8 |
+username = autosmart_user |
|
| 9 |
+password = secure_password_here |
|
| 10 |
+schema = smart_monitoring |
|
| 11 |
+ |
|
| 12 |
+# Connection pool settings |
|
| 13 |
+max_connections = 20 |
|
| 14 |
+connection_timeout = 30 |
|
| 15 |
+query_timeout = 60 |
|
| 16 |
+ |
|
| 17 |
+# Data retention policies |
|
| 18 |
+retention_raw_data = 365 # days to keep raw SMART readings |
|
| 19 |
+retention_predictions = 180 # days to keep AI predictions |
|
| 20 |
+retention_alerts = 90 # days to keep alert history |
|
| 21 |
+ |
|
| 22 |
+# Backup settings |
|
| 23 |
+backup_enabled = true |
|
| 24 |
+backup_schedule = "0 2 * * *" # Daily at 2 AM |
|
| 25 |
+backup_retention = 30 # days to keep backups |
|
| 26 |
+ |
|
| 27 |
+[performance] |
|
| 28 |
+batch_insert_size = 1000 |
|
| 29 |
+vacuum_schedule = "0 3 * * 0" # Weekly vacuum |
|
| 30 |
+analyze_schedule = "0 4 * * *" # Daily analyze |
|
@@ -0,0 +1,29 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Debug Configuration for ebony |
|
| 4 |
+export AUTOSMART_DEBUG=3 |
|
| 5 |
+export AUTOSMART_NODE_ID="ebony" |
|
| 6 |
+export AUTOSMART_CLUSTER_CONFIG="/etc/pve/autoSMART/config/cluster.conf" |
|
| 7 |
+ |
|
| 8 |
+# Database configuration |
|
| 9 |
+export AUTOSMART_DB_HOST="192.168.2.102" |
|
| 10 |
+export AUTOSMART_DB_USER="autosmart" |
|
| 11 |
+export AUTOSMART_DB_PASS="autoSMART2025!" |
|
| 12 |
+export AUTOSMART_DB_NAME="autosmart" |
|
| 13 |
+export AUTOSMART_DB_PORT="5432" |
|
| 14 |
+ |
|
| 15 |
+# Collection settings |
|
| 16 |
+export SMART_COLLECTION_ENABLED="true" |
|
| 17 |
+export MIGRATION_DETECTION_ENABLED="true" |
|
| 18 |
+export DIFFERENTIAL_STORAGE_ENABLED="true" |
|
| 19 |
+ |
|
| 20 |
+# Debug logging |
|
| 21 |
+export AUTOSMART_LOG_LEVEL="DEBUG" |
|
| 22 |
+export AUTOSMART_LOG_TO_SYSLOG="true" |
|
| 23 |
+ |
|
| 24 |
+echo "autoSMART debug environment configured:" |
|
| 25 |
+echo " Node: $AUTOSMART_NODE_ID" |
|
| 26 |
+echo " Database: $AUTOSMART_DB_HOST:$AUTOSMART_DB_PORT/$AUTOSMART_DB_NAME" |
|
| 27 |
+echo " User: $AUTOSMART_DB_USER" |
|
| 28 |
+echo " Debug Level: $AUTOSMART_DEBUG" |
|
| 29 |
+echo "" |
|
@@ -0,0 +1,107 @@ |
||
| 1 |
+# autoSMART Local Configuration |
|
| 2 |
+# Location: /etc/default/autosmart |
|
| 3 |
+# This file contains node-specific settings and debug flags |
|
| 4 |
+ |
|
| 5 |
+# Node identification |
|
| 6 |
+AUTOSMART_NODE_ID="$(hostname)" |
|
| 7 |
+AUTOSMART_CLUSTER_CONFIG="/etc/pve/autoSMART/cluster.conf" |
|
| 8 |
+ |
|
| 9 |
+# Debug settings |
|
| 10 |
+AUTOSMART_DEBUG_ENABLED=false |
|
| 11 |
+AUTOSMART_DEBUG_LEVEL=1 # 0=none, 1=basic, 2=verbose, 3=trace |
|
| 12 |
+AUTOSMART_DEBUG_LOG_FILE="/var/log/autosmart/debug.log" |
|
| 13 |
+AUTOSMART_DEBUG_MAX_SIZE="100M" |
|
| 14 |
+AUTOSMART_DEBUG_ROTATE_COUNT=5 |
|
| 15 |
+ |
|
| 16 |
+# Local logging |
|
| 17 |
+AUTOSMART_LOG_ENABLED=true |
|
| 18 |
+AUTOSMART_LOG_LEVEL="info" # debug, info, warn, error |
|
| 19 |
+AUTOSMART_LOG_FILE="/var/log/autosmart/autosmart.log" |
|
| 20 |
+AUTOSMART_LOG_SYSLOG=true |
|
| 21 |
+AUTOSMART_LOG_FACILITY="daemon" |
|
| 22 |
+ |
|
| 23 |
+# Collection settings (can override cluster defaults) |
|
| 24 |
+AUTOSMART_COLLECTION_INTERVAL=300 # seconds (5 minutes) |
|
| 25 |
+AUTOSMART_COLLECTION_TIMEOUT=30 # seconds |
|
| 26 |
+AUTOSMART_COLLECTION_RETRIES=3 |
|
| 27 |
+AUTOSMART_COLLECTION_PARALLEL=true |
|
| 28 |
+ |
|
| 29 |
+# Local storage paths |
|
| 30 |
+AUTOSMART_PID_FILE="/var/run/autosmart.pid" |
|
| 31 |
+AUTOSMART_LOCK_FILE="/var/lock/autosmart.lock" |
|
| 32 |
+AUTOSMART_CACHE_DIR="/var/cache/autosmart" |
|
| 33 |
+AUTOSMART_TEMP_DIR="/tmp/autosmart" |
|
| 34 |
+ |
|
| 35 |
+# Process management |
|
| 36 |
+AUTOSMART_DAEMON_USER="autosmart" |
|
| 37 |
+AUTOSMART_DAEMON_GROUP="autosmart" |
|
| 38 |
+AUTOSMART_MAX_MEMORY="256M" |
|
| 39 |
+AUTOSMART_NICE_LEVEL=10 |
|
| 40 |
+ |
|
| 41 |
+# Local device discovery |
|
| 42 |
+AUTOSMART_DEVICE_SCAN_ENABLED=true |
|
| 43 |
+AUTOSMART_DEVICE_SCAN_PATHS="/dev/sd* /dev/nvme*" |
|
| 44 |
+AUTOSMART_DEVICE_EXCLUDE_PATTERNS="loop*,dm-*,sr*" |
|
| 45 |
+AUTOSMART_DEVICE_CACHE_TTL=3600 # seconds |
|
| 46 |
+ |
|
| 47 |
+# Network settings |
|
| 48 |
+AUTOSMART_BIND_ADDRESS="0.0.0.0" |
|
| 49 |
+AUTOSMART_BIND_PORT=0 # 0 = disable local API |
|
| 50 |
+AUTOSMART_CLUSTER_TIMEOUT=10 # seconds |
|
| 51 |
+AUTOSMART_CLUSTER_RETRIES=2 |
|
| 52 |
+ |
|
| 53 |
+# Performance tuning |
|
| 54 |
+AUTOSMART_WORKER_THREADS=4 |
|
| 55 |
+AUTOSMART_QUEUE_SIZE=1000 |
|
| 56 |
+AUTOSMART_BATCH_SIZE=10 |
|
| 57 |
+AUTOSMART_RATE_LIMIT_ENABLED=true |
|
| 58 |
+AUTOSMART_RATE_LIMIT_REQUESTS=60 # per minute |
|
| 59 |
+ |
|
| 60 |
+# Security |
|
| 61 |
+AUTOSMART_SECURE_MODE=true |
|
| 62 |
+AUTOSMART_SSL_VERIFY=true |
|
| 63 |
+AUTOSMART_PERMISSIONS_CHECK=true |
|
| 64 |
+AUTOSMART_CONFIG_VALIDATION=true |
|
| 65 |
+ |
|
| 66 |
+# Emergency settings |
|
| 67 |
+AUTOSMART_EMERGENCY_STOP_FILE="/etc/autosmart/EMERGENCY_STOP" |
|
| 68 |
+AUTOSMART_SAFE_MODE_ENABLED=true |
|
| 69 |
+AUTOSMART_RECOVERY_MODE=false |
|
| 70 |
+ |
|
| 71 |
+# Development/Testing flags (production should be false) |
|
| 72 |
+AUTOSMART_DEVELOPMENT_MODE=false |
|
| 73 |
+AUTOSMART_MOCK_SMARTCTL=false |
|
| 74 |
+AUTOSMART_MOCK_DATABASE=false |
|
| 75 |
+AUTOSMART_MOCK_OPENAI=false |
|
| 76 |
+AUTOSMART_TEST_MODE=false |
|
| 77 |
+ |
|
| 78 |
+# Feature toggles |
|
| 79 |
+AUTOSMART_FEATURE_AI_PREDICTIONS=true |
|
| 80 |
+AUTOSMART_FEATURE_EMAIL_ALERTS=true |
|
| 81 |
+AUTOSMART_FEATURE_CLUSTER_SYNC=true |
|
| 82 |
+AUTOSMART_FEATURE_AUTO_DISCOVERY=true |
|
| 83 |
+AUTOSMART_FEATURE_HEALTH_CHECKS=true |
|
| 84 |
+ |
|
| 85 |
+# Compatibility settings |
|
| 86 |
+AUTOSMART_LEGACY_SUPPORT=false |
|
| 87 |
+AUTOSMART_STRICT_MODE=true |
|
| 88 |
+AUTOSMART_BACKWARD_COMPATIBILITY=false |
|
| 89 |
+ |
|
| 90 |
+# Monitoring and health checks |
|
| 91 |
+AUTOSMART_HEALTH_CHECK_ENABLED=true |
|
| 92 |
+AUTOSMART_HEALTH_CHECK_INTERVAL=60 # seconds |
|
| 93 |
+AUTOSMART_HEALTH_CHECK_TIMEOUT=5 # seconds |
|
| 94 |
+AUTOSMART_METRICS_ENABLED=true |
|
| 95 |
+AUTOSMART_METRICS_PORT=9090 |
|
| 96 |
+ |
|
| 97 |
+# Resource limits |
|
| 98 |
+AUTOSMART_MAX_OPEN_FILES=1024 |
|
| 99 |
+AUTOSMART_MAX_PROCESSES=50 |
|
| 100 |
+AUTOSMART_MEMORY_LIMIT="512M" |
|
| 101 |
+AUTOSMART_CPU_LIMIT=80 # percentage |
|
| 102 |
+ |
|
| 103 |
+# Maintenance |
|
| 104 |
+AUTOSMART_AUTO_CLEANUP=true |
|
| 105 |
+AUTOSMART_CLEANUP_INTERVAL=86400 # daily |
|
| 106 |
+AUTOSMART_VACUUM_DATABASE=true |
|
| 107 |
+AUTOSMART_OPTIMIZE_INTERVAL=604800 # weekly |
|
@@ -0,0 +1,50 @@ |
||
| 1 |
+# autoSMART OpenAI Configuration |
|
| 2 |
+# AI prediction engine settings |
|
| 3 |
+ |
|
| 4 |
+[openai] |
|
| 5 |
+# API Configuration |
|
| 6 |
+api_key = sk-your-openai-api-key-here |
|
| 7 |
+api_endpoint = https://api.openai.com/v1 |
|
| 8 |
+model = gpt-4 |
|
| 9 |
+max_tokens = 2048 |
|
| 10 |
+temperature = 0.1 # Low temperature for consistent predictions |
|
| 11 |
+ |
|
| 12 |
+# Request limits and retry |
|
| 13 |
+max_requests_per_hour = 100 |
|
| 14 |
+retry_attempts = 3 |
|
| 15 |
+retry_delay = 5 # seconds between retries |
|
| 16 |
+request_timeout = 60 # seconds |
|
| 17 |
+ |
|
| 18 |
+[prediction] |
|
| 19 |
+# Prediction parameters |
|
| 20 |
+prediction_window_days = 30 # Predict failures within 30 days |
|
| 21 |
+confidence_threshold = 0.7 # Minimum confidence for alerts |
|
| 22 |
+historical_data_days = 90 # Use 90 days of historical data |
|
| 23 |
+minimum_readings = 10 # Minimum readings before prediction |
|
| 24 |
+ |
|
| 25 |
+# AI prompt configuration |
|
| 26 |
+system_prompt = "You are an expert HDD failure prediction system. Analyze SMART data and provide failure probability with reasoning." |
|
| 27 |
+include_context = true # Include disk model, age, environment |
|
| 28 |
+include_trends = true # Include trend analysis in prompts |
|
| 29 |
+ |
|
| 30 |
+[analysis] |
|
| 31 |
+# Analysis frequency |
|
| 32 |
+full_analysis_hours = 24 # Full AI analysis every 24 hours |
|
| 33 |
+quick_check_hours = 6 # Quick check every 6 hours |
|
| 34 |
+emergency_check_minutes = 30 # Emergency analysis for critical values |
|
| 35 |
+ |
|
| 36 |
+# Batch processing |
|
| 37 |
+batch_size = 10 # Analyze 10 disks per batch |
|
| 38 |
+batch_delay = 2 # seconds between batch requests |
|
| 39 |
+ |
|
| 40 |
+[features] |
|
| 41 |
+# Feature engineering for AI |
|
| 42 |
+enable_trend_analysis = true |
|
| 43 |
+enable_anomaly_detection = true |
|
| 44 |
+enable_correlation_analysis = true |
|
| 45 |
+enable_environmental_factors = true |
|
| 46 |
+ |
|
| 47 |
+# Advanced features |
|
| 48 |
+enable_model_specific_analysis = true # Different analysis per HDD model |
|
| 49 |
+enable_failure_clustering = true # Group similar failure patterns |
|
| 50 |
+enable_seasonal_adjustment = true # Account for seasonal temperature changes |
|
@@ -0,0 +1,57 @@ |
||
| 1 |
+# autoSMART SMART Parameters Configuration |
|
| 2 |
+# Defines which SMART parameters to monitor and their thresholds |
|
| 3 |
+ |
|
| 4 |
+[monitoring] |
|
| 5 |
+# Collection interval in seconds |
|
| 6 |
+collection_interval = 300 # 5 minutes |
|
| 7 |
+collection_timeout = 30 # 30 seconds timeout per disk |
|
| 8 |
+ |
|
| 9 |
+# Madagascar integration |
|
| 10 |
+madagascar_inventory_file = /etc/madagascar/disk_inventory.json |
|
| 11 |
+madagascar_api_endpoint = http://madagascar.local/api/v1/disks |
|
| 12 |
+ |
|
| 13 |
+[smart_parameters] |
|
| 14 |
+# Format: parameter_name = threshold,weight,enabled,description |
|
| 15 |
+ |
|
| 16 |
+# Critical parameters (high weight, immediate attention) |
|
| 17 |
+Raw_Read_Error_Rate = 100000,0.9,true,"Raw read error rate from disk surface" |
|
| 18 |
+Reallocated_Sector_Ct = 5,0.95,true,"Count of reallocated sectors" |
|
| 19 |
+Current_Pending_Sector = 1,0.9,true,"Count of sectors waiting for reallocation" |
|
| 20 |
+Offline_Uncorrectable = 1,0.95,true,"Count of uncorrectable sectors" |
|
| 21 |
+UDMA_CRC_Error_Count = 100,0.7,true,"Count of UDMA CRC errors" |
|
| 22 |
+ |
|
| 23 |
+# Important parameters (medium weight) |
|
| 24 |
+Spin_Retry_Count = 3,0.8,true,"Count of spin-up retry attempts" |
|
| 25 |
+End-to-End_Error = 1,0.8,true,"End-to-end error detection count" |
|
| 26 |
+Reported_Uncorrect = 1,0.85,true,"Count of uncorrectable errors reported" |
|
| 27 |
+High_Fly_Writes = 1,0.7,true,"Count of high fly write operations" |
|
| 28 |
+Airflow_Temperature_Cel = 50,0.6,true,"Temperature of airflow in Celsius" |
|
| 29 |
+ |
|
| 30 |
+# Monitoring parameters (lower weight, trending) |
|
| 31 |
+Temperature_Celsius = 55,0.6,true,"Drive temperature in Celsius" |
|
| 32 |
+Power_On_Hours = 43800,0.4,true,"Total power-on hours (5 years)" |
|
| 33 |
+Load_Cycle_Count = 300000,0.5,true,"Count of load/unload cycles" |
|
| 34 |
+Start_Stop_Count = 10000,0.4,true,"Count of start/stop cycles" |
|
| 35 |
+Power_Cycle_Count = 10000,0.4,true,"Count of power-on cycles" |
|
| 36 |
+ |
|
| 37 |
+# Performance parameters (informational) |
|
| 38 |
+Seek_Error_Rate = 100000,0.3,true,"Rate of seek errors" |
|
| 39 |
+Throughput_Performance = 80,0.3,true,"Overall throughput performance" |
|
| 40 |
+Spin_Up_Time = 10000,0.4,true,"Time required to spin up" |
|
| 41 |
+ |
|
| 42 |
+[thresholds] |
|
| 43 |
+# Global threshold multipliers |
|
| 44 |
+temperature_warning = 0.9 # Warning at 90% of threshold |
|
| 45 |
+temperature_critical = 1.0 # Critical at 100% of threshold |
|
| 46 |
+sector_warning = 0.5 # Warning at 50% of threshold |
|
| 47 |
+sector_critical = 1.0 # Critical at 100% of threshold |
|
| 48 |
+ |
|
| 49 |
+# Trend analysis |
|
| 50 |
+trend_window_hours = 168 # 7 days for trend analysis |
|
| 51 |
+trend_deviation_threshold = 2.0 # Standard deviations for anomaly |
|
| 52 |
+ |
|
| 53 |
+[exclusions] |
|
| 54 |
+# Disk models/serials to exclude from monitoring |
|
| 55 |
+exclude_models = "Virtual,QEMU,VMware" |
|
| 56 |
+exclude_serials = "" |
|
| 57 |
+exclude_by_size_gb = 8 # Exclude disks smaller than 8GB |
|
@@ -0,0 +1,489 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Cluster Deployment Script |
|
| 4 |
+# Version: 1.0 |
|
| 5 |
+# Description: Complete cluster deployment and node installation for autoSMART |
|
| 6 |
+ |
|
| 7 |
+set -e |
|
| 8 |
+ |
|
| 9 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 10 |
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")" |
|
| 11 |
+INSTALL_DIR="/opt/autoSMART" |
|
| 12 |
+CONFIG_DIR="/etc/autosmart" |
|
| 13 |
+SERVICE_NAME="autosmart" |
|
| 14 |
+ |
|
| 15 |
+# Default configuration |
|
| 16 |
+DB_HOST="${DB_HOST:-192.168.2.102}"
|
|
| 17 |
+DB_USER="${DB_USER:-autosmart}"
|
|
| 18 |
+DB_PASS="${DB_PASS:-autoSMART2025!}"
|
|
| 19 |
+DB_NAME="${DB_NAME:-autosmart}"
|
|
| 20 |
+ |
|
| 21 |
+# Node configuration |
|
| 22 |
+NODE_ID="${NODE_ID:-$(hostname -s)}"
|
|
| 23 |
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
|
|
| 24 |
+ |
|
| 25 |
+# Operation modes |
|
| 26 |
+FORCE_REINSTALL=false |
|
| 27 |
+CONFIG_ONLY=false |
|
| 28 |
+DATABASE_MODE=false |
|
| 29 |
+ |
|
| 30 |
+# Colors for output |
|
| 31 |
+RED='\033[0;31m' |
|
| 32 |
+GREEN='\033[0;32m' |
|
| 33 |
+YELLOW='\033[1;33m' |
|
| 34 |
+BLUE='\033[0;34m' |
|
| 35 |
+NC='\033[0m' # No Color |
|
| 36 |
+ |
|
| 37 |
+log_info() {
|
|
| 38 |
+ echo -e "${BLUE}[INFO]${NC} $1"
|
|
| 39 |
+} |
|
| 40 |
+ |
|
| 41 |
+log_success() {
|
|
| 42 |
+ echo -e "${GREEN}[SUCCESS]${NC} $1"
|
|
| 43 |
+} |
|
| 44 |
+ |
|
| 45 |
+log_warning() {
|
|
| 46 |
+ echo -e "${YELLOW}[WARNING]${NC} $1"
|
|
| 47 |
+} |
|
| 48 |
+ |
|
| 49 |
+log_error() {
|
|
| 50 |
+ echo -e "${RED}[ERROR]${NC} $1"
|
|
| 51 |
+} |
|
| 52 |
+ |
|
| 53 |
+show_usage() {
|
|
| 54 |
+ echo "autoSMART Cluster Deployment Script v1.0" |
|
| 55 |
+ echo "=========================================" |
|
| 56 |
+ echo "" |
|
| 57 |
+ echo "Usage: $0 [COMMAND] [IP_ADDRESS] [OPTIONS]" |
|
| 58 |
+ echo "" |
|
| 59 |
+ echo "Commands:" |
|
| 60 |
+ echo " install [IP] Install autoSMART (local or remote node)" |
|
| 61 |
+ echo " install database Install database schema remotely using psql" |
|
| 62 |
+ echo " uninstall [IP] Remove autoSMART (local or remote node)" |
|
| 63 |
+ echo " status [IP] Show autoSMART status (local or remote node)" |
|
| 64 |
+ echo "" |
|
| 65 |
+ echo "Cluster Options:" |
|
| 66 |
+ echo " --cluster Execute command on entire cluster" |
|
| 67 |
+ echo "" |
|
| 68 |
+ echo "Database Options (for 'install database'):" |
|
| 69 |
+ echo " --db-host HOST Database host (default: 192.168.2.102)" |
|
| 70 |
+ echo " --db-user USER Database user (default: autosmart)" |
|
| 71 |
+ echo " --db-pass PASS Database password (default: autoSMART2025!)" |
|
| 72 |
+ echo " --db-name NAME Database name (default: autosmart)" |
|
| 73 |
+ echo "" |
|
| 74 |
+ echo "Examples:" |
|
| 75 |
+ echo " $0 install <node> # Install on a node (name or IP from cluster.json)" |
|
| 76 |
+ echo " $0 install database # Install database schema" |
|
| 77 |
+ echo " $0 status <node> # Check status on a node (name or IP from cluster.json)" |
|
| 78 |
+ echo " $0 install --cluster # Install on entire cluster" |
|
| 79 |
+ echo " $0 status --cluster # Check status on all nodes" |
|
| 80 |
+} |
|
| 81 |
+ |
|
| 82 |
+parse_arguments() {
|
|
| 83 |
+ COMMAND="" |
|
| 84 |
+ TARGET_IP="" |
|
| 85 |
+ CLUSTER_MODE=false |
|
| 86 |
+ DATABASE_MODE=false |
|
| 87 |
+ |
|
| 88 |
+ # If no arguments provided, show help |
|
| 89 |
+ if [[ $# -eq 0 ]]; then |
|
| 90 |
+ show_usage |
|
| 91 |
+ exit 0 |
|
| 92 |
+ fi |
|
| 93 |
+ |
|
| 94 |
+ while [[ $# -gt 0 ]]; do |
|
| 95 |
+ case $1 in |
|
| 96 |
+ install|uninstall|status) |
|
| 97 |
+ COMMAND="$1" |
|
| 98 |
+ shift |
|
| 99 |
+ ;; |
|
| 100 |
+ database) |
|
| 101 |
+ if [[ "$COMMAND" == "install" ]]; then |
|
| 102 |
+ DATABASE_MODE=true |
|
| 103 |
+ shift |
|
| 104 |
+ else |
|
| 105 |
+ log_error "database can only be used with install command" |
|
| 106 |
+ exit 1 |
|
| 107 |
+ fi |
|
| 108 |
+ ;; |
|
| 109 |
+ --help) |
|
| 110 |
+ show_usage |
|
| 111 |
+ exit 0 |
|
| 112 |
+ ;; |
|
| 113 |
+ --cluster) |
|
| 114 |
+ CLUSTER_MODE=true |
|
| 115 |
+ shift |
|
| 116 |
+ ;; |
|
| 117 |
+ --db-host) |
|
| 118 |
+ DB_HOST="$2" |
|
| 119 |
+ shift 2 |
|
| 120 |
+ ;; |
|
| 121 |
+ --db-user) |
|
| 122 |
+ DB_USER="$2" |
|
| 123 |
+ shift 2 |
|
| 124 |
+ ;; |
|
| 125 |
+ --db-pass) |
|
| 126 |
+ DB_PASS="$2" |
|
| 127 |
+ shift 2 |
|
| 128 |
+ ;; |
|
| 129 |
+ --db-name) |
|
| 130 |
+ DB_NAME="$2" |
|
| 131 |
+ shift 2 |
|
| 132 |
+ ;; |
|
| 133 |
+ --*) |
|
| 134 |
+ log_error "Unknown option: $1" |
|
| 135 |
+ exit 1 |
|
| 136 |
+ ;; |
|
| 137 |
+ *) |
|
| 138 |
+ if [[ $1 =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
|
|
| 139 |
+ TARGET_IP="$1" |
|
| 140 |
+ shift |
|
| 141 |
+ else |
|
| 142 |
+ # Try to resolve node name from cluster.json |
|
| 143 |
+ local cluster_config="$SCRIPT_DIR/cluster.json" |
|
| 144 |
+ if [[ -f "$cluster_config" ]] && command -v jq &> /dev/null; then |
|
| 145 |
+ local resolved_ip=$(jq -r --arg name "$1" '.cluster.nodes[] | select(.hostname==$name) | .ip' "$cluster_config") |
|
| 146 |
+ if [[ -n "$resolved_ip" && "$resolved_ip" != "null" ]]; then |
|
| 147 |
+ TARGET_IP="$resolved_ip" |
|
| 148 |
+ shift |
|
| 149 |
+ else |
|
| 150 |
+ log_error "Unknown argument: $1 (not an IP or known node name)" |
|
| 151 |
+ exit 1 |
|
| 152 |
+ fi |
|
| 153 |
+ else |
|
| 154 |
+ log_error "Unknown argument: $1" |
|
| 155 |
+ exit 1 |
|
| 156 |
+ fi |
|
| 157 |
+ fi |
|
| 158 |
+ ;; |
|
| 159 |
+ esac |
|
| 160 |
+ done |
|
| 161 |
+ |
|
| 162 |
+ # Validate that a command was provided |
|
| 163 |
+ if [[ -z "$COMMAND" ]]; then |
|
| 164 |
+ log_error "No command specified" |
|
| 165 |
+ show_usage |
|
| 166 |
+ exit 1 |
|
| 167 |
+ fi |
|
| 168 |
+ |
|
| 169 |
+ if [[ "$CLUSTER_MODE" == true ]]; then |
|
| 170 |
+ TARGET_IP="" |
|
| 171 |
+ fi |
|
| 172 |
+} |
|
| 173 |
+ |
|
| 174 |
+show_header() {
|
|
| 175 |
+ log_info "�� autoSMART Cluster Deployment v1.0" |
|
| 176 |
+ log_info "=====================================" |
|
| 177 |
+ log_info "Hardware-based HDD tracking with differential storage" |
|
| 178 |
+ log_info "" |
|
| 179 |
+ log_info "Operation: $COMMAND" |
|
| 180 |
+ |
|
| 181 |
+ if [[ "$CLUSTER_MODE" == true ]]; then |
|
| 182 |
+ log_info "Target: Entire cluster (nodes from cluster.json)" |
|
| 183 |
+ elif [[ -n "$TARGET_IP" ]]; then |
|
| 184 |
+ log_info "Target: Remote node ($TARGET_IP)" |
|
| 185 |
+ else |
|
| 186 |
+ log_info "Target: Current node ($(hostname -s))" |
|
| 187 |
+ fi |
|
| 188 |
+ |
|
| 189 |
+ log_info "Database: $DB_HOST:5432/$DB_NAME" |
|
| 190 |
+ log_info "" |
|
| 191 |
+} |
|
| 192 |
+ |
|
| 193 |
+handle_database_deployment() {
|
|
| 194 |
+ log_info "💾 Installing autoSMART Database Schema" |
|
| 195 |
+ log_info "=======================================" |
|
| 196 |
+ log_info "Target Database: $DB_HOST:5432/$DB_NAME" |
|
| 197 |
+ log_info "Database User: $DB_USER" |
|
| 198 |
+ log_info "" |
|
| 199 |
+ |
|
| 200 |
+ # Check if psql is available |
|
| 201 |
+ if ! command -v psql &> /dev/null; then |
|
| 202 |
+ log_error "psql client not found. Please install PostgreSQL client:" |
|
| 203 |
+ log_error " macOS: brew install postgresql" |
|
| 204 |
+ log_error " Ubuntu: sudo apt install postgresql-client" |
|
| 205 |
+ log_error " CentOS: sudo dnf install postgresql" |
|
| 206 |
+ return 1 |
|
| 207 |
+ fi |
|
| 208 |
+ |
|
| 209 |
+ # Test database connection |
|
| 210 |
+ log_info "🔗 Testing database connection..." |
|
| 211 |
+ local psql_cmd="psql -h $DB_HOST -U $DB_USER -d $DB_NAME" |
|
| 212 |
+ if [[ -n "$DB_PASS" ]]; then |
|
| 213 |
+ export PGPASSWORD="$DB_PASS" |
|
| 214 |
+ fi |
|
| 215 |
+ |
|
| 216 |
+ if ! $psql_cmd -c "SELECT version();" >/dev/null 2>&1; then |
|
| 217 |
+ log_error "Cannot connect to database $DB_HOST:5432/$DB_NAME" |
|
| 218 |
+ log_error "Please check:" |
|
| 219 |
+ log_error " • Database server is running" |
|
| 220 |
+ log_error " • Database '$DB_NAME' exists" |
|
| 221 |
+ log_error " • User '$DB_USER' has proper permissions" |
|
| 222 |
+ log_error " • Network connectivity to $DB_HOST" |
|
| 223 |
+ return 1 |
|
| 224 |
+ fi |
|
| 225 |
+ |
|
| 226 |
+ log_success "✅ Database connection successful" |
|
| 227 |
+ |
|
| 228 |
+ # Check schema files |
|
| 229 |
+ if [[ ! -f "$SCRIPT_DIR/sql/schema.sql" ]]; then |
|
| 230 |
+ log_error "Schema file not found: $SCRIPT_DIR/sql/schema.sql" |
|
| 231 |
+ return 1 |
|
| 232 |
+ fi |
|
| 233 |
+ |
|
| 234 |
+ # Install schema |
|
| 235 |
+ log_info "📊 Installing database schema..." |
|
| 236 |
+ if ! $psql_cmd -f "$SCRIPT_DIR/sql/schema.sql" >/dev/null 2>&1; then |
|
| 237 |
+ log_error "Failed to install database schema" |
|
| 238 |
+ log_error "Check for conflicts or permission issues" |
|
| 239 |
+ return 1 |
|
| 240 |
+ fi |
|
| 241 |
+ |
|
| 242 |
+ log_success "✅ Database schema installed" |
|
| 243 |
+ |
|
| 244 |
+ # Verify installation |
|
| 245 |
+ log_info "🔍 Verifying schema installation..." |
|
| 246 |
+ local table_count=$($psql_cmd -t -c " |
|
| 247 |
+ SELECT COUNT(*) FROM information_schema.tables |
|
| 248 |
+ WHERE table_schema = 'public' AND table_name LIKE '%smart%' OR table_name LIKE '%hdd%'; |
|
| 249 |
+ " 2>/dev/null | tr -d ' ') |
|
| 250 |
+ |
|
| 251 |
+ if [[ "$table_count" -lt 3 ]]; then |
|
| 252 |
+ log_error "Schema verification failed. Expected tables not found." |
|
| 253 |
+ return 1 |
|
| 254 |
+ fi |
|
| 255 |
+ |
|
| 256 |
+ log_success "✅ Schema verification passed ($table_count tables found)" |
|
| 257 |
+ |
|
| 258 |
+ # Show installed components |
|
| 259 |
+ log_info "📋 Database Installation Summary:" |
|
| 260 |
+ $psql_cmd -c " |
|
| 261 |
+ SELECT |
|
| 262 |
+ 'Table' as type, |
|
| 263 |
+ table_name as name, |
|
| 264 |
+ pg_size_pretty(pg_total_relation_size('public.'||table_name)) as size
|
|
| 265 |
+ FROM information_schema.tables |
|
| 266 |
+ WHERE table_schema = 'public' |
|
| 267 |
+ UNION ALL |
|
| 268 |
+ SELECT |
|
| 269 |
+ 'View' as type, |
|
| 270 |
+ viewname as name, |
|
| 271 |
+ 'N/A' as size |
|
| 272 |
+ FROM pg_views |
|
| 273 |
+ WHERE schemaname = 'public' |
|
| 274 |
+ ORDER BY type, name; |
|
| 275 |
+ " 2>/dev/null || true |
|
| 276 |
+ |
|
| 277 |
+ log_success "✅ autoSMART database deployment completed successfully!" |
|
| 278 |
+ log_info "" |
|
| 279 |
+ log_info "🚀 Next Steps:" |
|
| 280 |
+ log_info " 1. Deploy nodes: ./deploy.sh install <node>" |
|
| 281 |
+ log_info " 2. Configure clusters in config files" |
|
| 282 |
+ log_info " 3. Start collecting SMART data" |
|
| 283 |
+ log_info "" |
|
| 284 |
+ |
|
| 285 |
+ return 0 |
|
| 286 |
+} |
|
| 287 |
+ |
|
| 288 |
+handle_remote_deployment() {
|
|
| 289 |
+ local target_ip="$1" |
|
| 290 |
+ local command="$2" |
|
| 291 |
+ |
|
| 292 |
+ # Determine the correct node name from cluster.json |
|
| 293 |
+ local node_name="" |
|
| 294 |
+ local cluster_config="$SCRIPT_DIR/cluster.json" |
|
| 295 |
+ if [[ -f "$cluster_config" ]] && command -v jq &> /dev/null; then |
|
| 296 |
+ node_name=$(jq -r --arg ip "$target_ip" '.cluster.nodes[] | select(.ip==$ip) | .hostname' "$cluster_config") |
|
| 297 |
+ if [[ -z "$node_name" || "$node_name" == "null" ]]; then |
|
| 298 |
+ # Fallback: try to get hostname from target machine |
|
| 299 |
+ node_name=$(ssh -o ConnectTimeout=5 "root@$target_ip" "hostname -s" 2>/dev/null || echo "unknown-node") |
|
| 300 |
+ fi |
|
| 301 |
+ else |
|
| 302 |
+ # Fallback: try to get hostname from target machine |
|
| 303 |
+ node_name=$(ssh -o ConnectTimeout=5 "root@$target_ip" "hostname -s" 2>/dev/null || echo "unknown-node") |
|
| 304 |
+ fi |
|
| 305 |
+ |
|
| 306 |
+ log_info "🌐 Remote deployment to $target_ip (node: $node_name)" |
|
| 307 |
+ |
|
| 308 |
+ # Test connectivity |
|
| 309 |
+ log_info "🔍 Testing connectivity to $target_ip..." |
|
| 310 |
+ if ! ping -c 1 -W 5 "$target_ip" >/dev/null 2>&1; then |
|
| 311 |
+ log_error "Cannot reach $target_ip (ping failed)" |
|
| 312 |
+ return 1 |
|
| 313 |
+ fi |
|
| 314 |
+ |
|
| 315 |
+ # Test SSH |
|
| 316 |
+ log_info "🔐 Testing SSH access to $target_ip..." |
|
| 317 |
+ if ! ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no "root@$target_ip" true 2>/dev/null; then |
|
| 318 |
+ log_error "Cannot connect to $target_ip via SSH" |
|
| 319 |
+ log_info "Setup SSH keys: ssh-copy-id root@$target_ip" |
|
| 320 |
+ return 1 |
|
| 321 |
+ fi |
|
| 322 |
+ |
|
| 323 |
+ log_success "✅ SSH connection to $target_ip successful" |
|
| 324 |
+ |
|
| 325 |
+ # Create temp directory |
|
| 326 |
+ local remote_temp="/tmp/autosmart-deploy-$(date +%s)" |
|
| 327 |
+ log_info "📁 Creating remote directory: $remote_temp" |
|
| 328 |
+ ssh "root@$target_ip" "mkdir -p $remote_temp" |
|
| 329 |
+ |
|
| 330 |
+ # Copy files |
|
| 331 |
+ log_info "📦 Syncing project files to $target_ip..." |
|
| 332 |
+ if ! rsync -avz --progress \ |
|
| 333 |
+ --exclude-from="$SCRIPT_DIR/.deployignore" \ |
|
| 334 |
+ --include='docs/' \ |
|
| 335 |
+ --include='docs/*.md' \ |
|
| 336 |
+ --exclude='.git*' \ |
|
| 337 |
+ --exclude='*.md' \ |
|
| 338 |
+ --exclude='deploy.sh' \ |
|
| 339 |
+ "$SCRIPT_DIR/" "root@$target_ip:$remote_temp/"; then |
|
| 340 |
+ log_error "Failed to sync files to $target_ip" |
|
| 341 |
+ return 1 |
|
| 342 |
+ fi |
|
| 343 |
+ |
|
| 344 |
+ # Execute install.sh |
|
| 345 |
+ log_info "🚀 Executing $command on $target_ip..." |
|
| 346 |
+ |
|
| 347 |
+ local install_args="$command --node-id $node_name --db-host $DB_HOST" |
|
| 348 |
+ |
|
| 349 |
+ if ssh "root@$target_ip" "cd $remote_temp/scripts && bash install.sh $install_args"; then |
|
| 350 |
+ log_success "✅ $command completed successfully on $target_ip" |
|
| 351 |
+ ssh "root@$target_ip" "rm -rf $remote_temp" |
|
| 352 |
+ return 0 |
|
| 353 |
+ else |
|
| 354 |
+ log_error "❌ $command failed on $target_ip" |
|
| 355 |
+ return 1 |
|
| 356 |
+ fi |
|
| 357 |
+} |
|
| 358 |
+ |
|
| 359 |
+handle_status() {
|
|
| 360 |
+ local target_ip="$1" |
|
| 361 |
+ |
|
| 362 |
+ if [[ -n "$target_ip" ]]; then |
|
| 363 |
+ log_info "📊 Checking autoSMART status on $target_ip" |
|
| 364 |
+ ssh "root@$target_ip" "systemctl status autosmart --no-pager" |
|
| 365 |
+ else |
|
| 366 |
+ log_info "📊 Checking autoSMART status on current node" |
|
| 367 |
+ if command -v systemctl >/dev/null 2>&1; then |
|
| 368 |
+ systemctl status autosmart --no-pager |
|
| 369 |
+ else |
|
| 370 |
+ log_error "systemctl not available" |
|
| 371 |
+ return 1 |
|
| 372 |
+ fi |
|
| 373 |
+ fi |
|
| 374 |
+} |
|
| 375 |
+ |
|
| 376 |
+handle_cluster_operation() {
|
|
| 377 |
+ local command="$1" |
|
| 378 |
+ |
|
| 379 |
+ log_info "🚀 Executing $command on cluster..." |
|
| 380 |
+ |
|
| 381 |
+ # Check if cluster.json exists |
|
| 382 |
+ local cluster_config="$SCRIPT_DIR/cluster.json" |
|
| 383 |
+ if [[ ! -f "$cluster_config" ]]; then |
|
| 384 |
+ log_error "Cluster configuration not found: $cluster_config" |
|
| 385 |
+ return 1 |
|
| 386 |
+ fi |
|
| 387 |
+ |
|
| 388 |
+ # Check if jq is available for JSON parsing |
|
| 389 |
+ if ! command -v jq &> /dev/null; then |
|
| 390 |
+ log_error "jq is required for cluster operations" |
|
| 391 |
+ return 1 |
|
| 392 |
+ fi |
|
| 393 |
+ |
|
| 394 |
+ # Parse cluster configuration |
|
| 395 |
+ local cluster_name=$(jq -r '.cluster.name' "$cluster_config") |
|
| 396 |
+ local total_nodes=$(jq -r '.cluster.nodes | length' "$cluster_config") |
|
| 397 |
+ |
|
| 398 |
+ log_info "Cluster: $cluster_name ($total_nodes nodes)" |
|
| 399 |
+ log_info "" |
|
| 400 |
+ |
|
| 401 |
+ local success_count=0 |
|
| 402 |
+ local failed_nodes=() |
|
| 403 |
+ |
|
| 404 |
+ # Process nodes |
|
| 405 |
+ while IFS= read -r node_data; do |
|
| 406 |
+ local node_hostname=$(echo "$node_data" | jq -r '.hostname') |
|
| 407 |
+ local node_ip=$(echo "$node_data" | jq -r '.ip') |
|
| 408 |
+ |
|
| 409 |
+ log_info "🔧 Processing node: $node_hostname ($node_ip)" |
|
| 410 |
+ |
|
| 411 |
+ if handle_remote_deployment "$node_ip" "$command"; then |
|
| 412 |
+ ((success_count++)) |
|
| 413 |
+ log_success "✅ $node_hostname completed successfully" |
|
| 414 |
+ else |
|
| 415 |
+ log_error "❌ $node_hostname failed" |
|
| 416 |
+ failed_nodes+=("$node_hostname")
|
|
| 417 |
+ fi |
|
| 418 |
+ |
|
| 419 |
+ sleep 2 |
|
| 420 |
+ log_info "" |
|
| 421 |
+ done < <(jq -c '.cluster.nodes[]' "$cluster_config") |
|
| 422 |
+ |
|
| 423 |
+ # Summary |
|
| 424 |
+ log_info "📊 Cluster Summary:" |
|
| 425 |
+ log_info " • Successful: $success_count/$total_nodes" |
|
| 426 |
+ |
|
| 427 |
+ if [[ ${#failed_nodes[@]} -gt 0 ]]; then
|
|
| 428 |
+ log_error " • Failed nodes: ${failed_nodes[*]}"
|
|
| 429 |
+ fi |
|
| 430 |
+ |
|
| 431 |
+ if [[ $success_count -eq $total_nodes ]]; then |
|
| 432 |
+ log_success "🎉 All nodes processed successfully!" |
|
| 433 |
+ return 0 |
|
| 434 |
+ else |
|
| 435 |
+ log_error "❌ Some nodes failed" |
|
| 436 |
+ return 1 |
|
| 437 |
+ fi |
|
| 438 |
+} |
|
| 439 |
+ |
|
| 440 |
+# Main execution |
|
| 441 |
+main() {
|
|
| 442 |
+ parse_arguments "$@" |
|
| 443 |
+ show_header |
|
| 444 |
+ |
|
| 445 |
+ # Handle database deployment mode |
|
| 446 |
+ if [[ "$DATABASE_MODE" == true ]]; then |
|
| 447 |
+ handle_database_deployment |
|
| 448 |
+ exit $? |
|
| 449 |
+ fi |
|
| 450 |
+ |
|
| 451 |
+ if [[ "$CLUSTER_MODE" == true ]]; then |
|
| 452 |
+ handle_cluster_operation "$COMMAND" |
|
| 453 |
+ exit $? |
|
| 454 |
+ elif [[ -n "$TARGET_IP" ]]; then |
|
| 455 |
+ if [[ "$COMMAND" == "status" ]]; then |
|
| 456 |
+ handle_status "$TARGET_IP" |
|
| 457 |
+ else |
|
| 458 |
+ handle_remote_deployment "$TARGET_IP" "$COMMAND" |
|
| 459 |
+ fi |
|
| 460 |
+ exit $? |
|
| 461 |
+ fi |
|
| 462 |
+ |
|
| 463 |
+ # Local execution |
|
| 464 |
+ case "$COMMAND" in |
|
| 465 |
+ status) |
|
| 466 |
+ handle_status |
|
| 467 |
+ ;; |
|
| 468 |
+ install|uninstall) |
|
| 469 |
+ if [[ "$(uname)" == "Darwin" ]]; then |
|
| 470 |
+ log_error "Cannot install autoSMART on macOS development machine" |
|
| 471 |
+ log_info "Deploy to target nodes instead:" |
|
| 472 |
+ log_info " ./deploy.sh install <node> # Deploy to node from cluster.json" |
|
| 473 |
+ log_info " ./deploy.sh install --cluster # Deploy to all nodes" |
|
| 474 |
+ exit 1 |
|
| 475 |
+ fi |
|
| 476 |
+ |
|
| 477 |
+ log_info "🚀 Local deployment mode" |
|
| 478 |
+ sudo bash "$SCRIPT_DIR/scripts/install.sh" "$COMMAND" --node-id "$NODE_ID" |
|
| 479 |
+ ;; |
|
| 480 |
+ *) |
|
| 481 |
+ log_error "Unknown command: $COMMAND" |
|
| 482 |
+ show_usage |
|
| 483 |
+ exit 1 |
|
| 484 |
+ ;; |
|
| 485 |
+ esac |
|
| 486 |
+} |
|
| 487 |
+ |
|
| 488 |
+# Run main |
|
| 489 |
+main "$@" |
|
@@ -0,0 +1,439 @@ |
||
| 1 |
+# autoSMART API Reference |
|
| 2 |
+ |
|
| 3 |
+## 🔌 OpenAI API Integration |
|
| 4 |
+ |
|
| 5 |
+### Overview |
|
| 6 |
+ |
|
| 7 |
+autoSMART integrates with OpenAI's GPT models to provide intelligent HDD failure predictions based on SMART data analysis. This document covers the API integration, prompt engineering, and response processing. |
|
| 8 |
+ |
|
| 9 |
+### Configuration |
|
| 10 |
+ |
|
| 11 |
+#### Environment Variables |
|
| 12 |
+```bash |
|
| 13 |
+export OPENAI_API_KEY="sk-your-openai-api-key-here" |
|
| 14 |
+export OPENAI_MODEL="gpt-4" # or gpt-3.5-turbo for cost optimization |
|
| 15 |
+export OPENAI_MAX_TOKENS=1000 |
|
| 16 |
+export OPENAI_TEMPERATURE=0.1 # Low temperature for consistent technical analysis |
|
| 17 |
+``` |
|
| 18 |
+ |
|
| 19 |
+#### Database Configuration |
|
| 20 |
+```sql |
|
| 21 |
+-- Add OpenAI configuration to system_config |
|
| 22 |
+INSERT INTO system_config (key, value, description) VALUES |
|
| 23 |
+('openai_api_key', 'sk-your-key', 'OpenAI API key for failure predictions'),
|
|
| 24 |
+('openai_model', 'gpt-4', 'OpenAI model to use (gpt-4, gpt-3.5-turbo)'),
|
|
| 25 |
+('openai_max_tokens', '1000', 'Maximum tokens per API call'),
|
|
| 26 |
+('openai_temperature', '0.1', 'Temperature setting for consistent predictions'),
|
|
| 27 |
+('openai_timeout', '30', 'API timeout in seconds'),
|
|
| 28 |
+('prediction_interval_hours', '24', 'Hours between AI predictions per drive');
|
|
| 29 |
+``` |
|
| 30 |
+ |
|
| 31 |
+## 🤖 AI Prediction System |
|
| 32 |
+ |
|
| 33 |
+### Prompt Engineering |
|
| 34 |
+ |
|
| 35 |
+#### System Prompt Template |
|
| 36 |
+```text |
|
| 37 |
+You are an expert storage systems engineer specializing in HDD failure prediction and analysis. |
|
| 38 |
+ |
|
| 39 |
+Your expertise includes: |
|
| 40 |
+- SMART parameter interpretation across all major manufacturers (WD, Seagate, Hitachi, Toshiba) |
|
| 41 |
+- Statistical analysis of drive health trends and patterns |
|
| 42 |
+- Hardware failure mode identification and prediction |
|
| 43 |
+- Maintenance recommendations based on drive condition |
|
| 44 |
+ |
|
| 45 |
+Analyze the provided SMART data and historical trends to: |
|
| 46 |
+1. Assess current drive health status |
|
| 47 |
+2. Predict failure probability and timeline |
|
| 48 |
+3. Identify concerning parameter trends |
|
| 49 |
+4. Provide specific maintenance recommendations |
|
| 50 |
+ |
|
| 51 |
+Be precise, technical, and provide confidence levels for your predictions. |
|
| 52 |
+Return responses in structured JSON format for automated processing. |
|
| 53 |
+``` |
|
| 54 |
+ |
|
| 55 |
+#### User Prompt Templates |
|
| 56 |
+ |
|
| 57 |
+##### Single Drive Analysis |
|
| 58 |
+```json |
|
| 59 |
+{
|
|
| 60 |
+ "task": "analyze_drive_health", |
|
| 61 |
+ "drive_info": {
|
|
| 62 |
+ "serial_number": "WD-XXXXX", |
|
| 63 |
+ "model": "WD4003FZEX", |
|
| 64 |
+ "manufacturer": "Western Digital", |
|
| 65 |
+ "capacity_gb": 4000, |
|
| 66 |
+ "age_days": 1825, |
|
| 67 |
+ "power_on_hours": 15000 |
|
| 68 |
+ }, |
|
| 69 |
+ "current_smart": {
|
|
| 70 |
+ "Reallocated_Sector_Ct": 0, |
|
| 71 |
+ "Spin_Retry_Count": 0, |
|
| 72 |
+ "Current_Pending_Sector": 1, |
|
| 73 |
+ "Offline_Uncorrectable": 0, |
|
| 74 |
+ "UDMA_CRC_Error_Count": 0, |
|
| 75 |
+ "Raw_Read_Error_Rate": 158584832, |
|
| 76 |
+ "Seek_Error_Rate": 34405355, |
|
| 77 |
+ "Power_On_Hours": 15234, |
|
| 78 |
+ "Load_Cycle_Count": 45123, |
|
| 79 |
+ "Temperature_Celsius": 42, |
|
| 80 |
+ "Start_Stop_Count": 1205, |
|
| 81 |
+ "Power_Cycle_Count": 1198 |
|
| 82 |
+ }, |
|
| 83 |
+ "historical_trends": {
|
|
| 84 |
+ "30_day_changes": {
|
|
| 85 |
+ "Current_Pending_Sector": [0, 0, 0, 1], |
|
| 86 |
+ "Temperature_Celsius": [38, 39, 41, 42], |
|
| 87 |
+ "Power_On_Hours": [14950, 15050, 15150, 15234] |
|
| 88 |
+ }, |
|
| 89 |
+ "parameter_velocities": {
|
|
| 90 |
+ "Current_Pending_Sector": 0.033, |
|
| 91 |
+ "Temperature_Celsius": 0.133 |
|
| 92 |
+ } |
|
| 93 |
+ } |
|
| 94 |
+} |
|
| 95 |
+``` |
|
| 96 |
+ |
|
| 97 |
+##### Multi-Drive Comparative Analysis |
|
| 98 |
+```json |
|
| 99 |
+{
|
|
| 100 |
+ "task": "comparative_analysis", |
|
| 101 |
+ "drives": [ |
|
| 102 |
+ {
|
|
| 103 |
+ "serial_number": "WD-XXXXX1", |
|
| 104 |
+ "health_score": 85, |
|
| 105 |
+ "critical_parameters": ["Current_Pending_Sector"], |
|
| 106 |
+ "smart_summary": {...}
|
|
| 107 |
+ }, |
|
| 108 |
+ {
|
|
| 109 |
+ "serial_number": "WD-XXXXX2", |
|
| 110 |
+ "health_score": 92, |
|
| 111 |
+ "critical_parameters": [], |
|
| 112 |
+ "smart_summary": {...}
|
|
| 113 |
+ } |
|
| 114 |
+ ], |
|
| 115 |
+ "analysis_context": {
|
|
| 116 |
+ "environment": "proxmox_cluster", |
|
| 117 |
+ "usage_pattern": "high_io_database", |
|
| 118 |
+ "temperature_environment": "datacenter" |
|
| 119 |
+ } |
|
| 120 |
+} |
|
| 121 |
+``` |
|
| 122 |
+ |
|
| 123 |
+### Response Format |
|
| 124 |
+ |
|
| 125 |
+#### Standard Health Assessment Response |
|
| 126 |
+```json |
|
| 127 |
+{
|
|
| 128 |
+ "prediction_id": "uuid-generated", |
|
| 129 |
+ "timestamp": "2025-08-15T10:30:00Z", |
|
| 130 |
+ "drive_serial": "WD-XXXXX", |
|
| 131 |
+ "analysis": {
|
|
| 132 |
+ "health_score": 78, |
|
| 133 |
+ "risk_level": "medium", |
|
| 134 |
+ "failure_probability": {
|
|
| 135 |
+ "7_days": 0.02, |
|
| 136 |
+ "30_days": 0.08, |
|
| 137 |
+ "90_days": 0.15, |
|
| 138 |
+ "1_year": 0.35 |
|
| 139 |
+ }, |
|
| 140 |
+ "predicted_failure_date": "2026-02-15", |
|
| 141 |
+ "confidence_level": 0.75 |
|
| 142 |
+ }, |
|
| 143 |
+ "critical_findings": [ |
|
| 144 |
+ {
|
|
| 145 |
+ "parameter": "Current_Pending_Sector", |
|
| 146 |
+ "current_value": 1, |
|
| 147 |
+ "trend": "increasing", |
|
| 148 |
+ "severity": "warning", |
|
| 149 |
+ "description": "One sector is pending reallocation - monitor closely" |
|
| 150 |
+ }, |
|
| 151 |
+ {
|
|
| 152 |
+ "parameter": "Temperature_Celsius", |
|
| 153 |
+ "current_value": 42, |
|
| 154 |
+ "trend": "increasing", |
|
| 155 |
+ "severity": "info", |
|
| 156 |
+ "description": "Temperature trending upward but within normal range" |
|
| 157 |
+ } |
|
| 158 |
+ ], |
|
| 159 |
+ "recommendations": [ |
|
| 160 |
+ {
|
|
| 161 |
+ "priority": "high", |
|
| 162 |
+ "action": "monitor_pending_sectors", |
|
| 163 |
+ "description": "Monitor pending sector count daily - consider replacement if count increases", |
|
| 164 |
+ "timeline": "immediate" |
|
| 165 |
+ }, |
|
| 166 |
+ {
|
|
| 167 |
+ "priority": "medium", |
|
| 168 |
+ "action": "improve_cooling", |
|
| 169 |
+ "description": "Consider improving airflow to reduce operating temperature", |
|
| 170 |
+ "timeline": "within_30_days" |
|
| 171 |
+ } |
|
| 172 |
+ ], |
|
| 173 |
+ "manufacturer_specific": {
|
|
| 174 |
+ "western_digital": {
|
|
| 175 |
+ "expected_lifespan_hours": 50000, |
|
| 176 |
+ "current_usage_percent": 30.5, |
|
| 177 |
+ "wear_level_assessment": "normal" |
|
| 178 |
+ } |
|
| 179 |
+ } |
|
| 180 |
+} |
|
| 181 |
+``` |
|
| 182 |
+ |
|
| 183 |
+## 🔧 Implementation Details |
|
| 184 |
+ |
|
| 185 |
+### SmartAnalyzer.pm API Integration |
|
| 186 |
+ |
|
| 187 |
+#### Core API Methods |
|
| 188 |
+```perl |
|
| 189 |
+=head2 predict_failure |
|
| 190 |
+ |
|
| 191 |
+Generate AI-powered failure prediction for a specific drive |
|
| 192 |
+ |
|
| 193 |
+=cut |
|
| 194 |
+ |
|
| 195 |
+sub predict_failure {
|
|
| 196 |
+ my ($self, $hdd_id, $options) = @_; |
|
| 197 |
+ |
|
| 198 |
+ # Gather drive data and historical trends |
|
| 199 |
+ my $drive_data = $self->_gather_drive_data($hdd_id); |
|
| 200 |
+ my $historical_data = $self->_analyze_trends($hdd_id, $options->{days} || 30);
|
|
| 201 |
+ |
|
| 202 |
+ # Construct AI prompt |
|
| 203 |
+ my $prompt = $self->_build_analysis_prompt($drive_data, $historical_data); |
|
| 204 |
+ |
|
| 205 |
+ # Call OpenAI API |
|
| 206 |
+ my $prediction = $self->_call_openai_api($prompt); |
|
| 207 |
+ |
|
| 208 |
+ # Store prediction result |
|
| 209 |
+ $self->_store_prediction($hdd_id, $prediction); |
|
| 210 |
+ |
|
| 211 |
+ return $prediction; |
|
| 212 |
+} |
|
| 213 |
+``` |
|
| 214 |
+ |
|
| 215 |
+#### API Request Handler |
|
| 216 |
+```perl |
|
| 217 |
+sub _call_openai_api {
|
|
| 218 |
+ my ($self, $prompt) = @_; |
|
| 219 |
+ |
|
| 220 |
+ my $ua = LWP::UserAgent->new(timeout => $self->{openai_timeout} || 30);
|
|
| 221 |
+ |
|
| 222 |
+ my $request = HTTP::Request->new(POST => 'https://api.openai.com/v1/chat/completions'); |
|
| 223 |
+ $request->header('Authorization' => "Bearer $self->{openai_api_key}");
|
|
| 224 |
+ $request->header('Content-Type' => 'application/json');
|
|
| 225 |
+ |
|
| 226 |
+ my $payload = {
|
|
| 227 |
+ model => $self->{openai_model} || 'gpt-4',
|
|
| 228 |
+ messages => [ |
|
| 229 |
+ {
|
|
| 230 |
+ role => "system", |
|
| 231 |
+ content => $self->_get_system_prompt() |
|
| 232 |
+ }, |
|
| 233 |
+ {
|
|
| 234 |
+ role => "user", |
|
| 235 |
+ content => encode_json($prompt) |
|
| 236 |
+ } |
|
| 237 |
+ ], |
|
| 238 |
+ max_tokens => $self->{openai_max_tokens} || 1000,
|
|
| 239 |
+ temperature => $self->{openai_temperature} || 0.1,
|
|
| 240 |
+ response_format => { type => "json_object" }
|
|
| 241 |
+ }; |
|
| 242 |
+ |
|
| 243 |
+ $request->content(encode_json($payload)); |
|
| 244 |
+ |
|
| 245 |
+ my $response = $ua->request($request); |
|
| 246 |
+ |
|
| 247 |
+ if ($response->is_success) {
|
|
| 248 |
+ my $result = decode_json($response->content); |
|
| 249 |
+ return decode_json($result->{choices}[0]{message}{content});
|
|
| 250 |
+ } else {
|
|
| 251 |
+ die "OpenAI API error: " . $response->status_line . "\n" . $response->content; |
|
| 252 |
+ } |
|
| 253 |
+} |
|
| 254 |
+``` |
|
| 255 |
+ |
|
| 256 |
+### Error Handling and Retry Logic |
|
| 257 |
+ |
|
| 258 |
+```perl |
|
| 259 |
+sub _call_openai_api_with_retry {
|
|
| 260 |
+ my ($self, $prompt, $max_retries) = @_; |
|
| 261 |
+ $max_retries ||= 3; |
|
| 262 |
+ |
|
| 263 |
+ for my $attempt (1..$max_retries) {
|
|
| 264 |
+ eval {
|
|
| 265 |
+ return $self->_call_openai_api($prompt); |
|
| 266 |
+ }; |
|
| 267 |
+ |
|
| 268 |
+ if ($@) {
|
|
| 269 |
+ $self->_log("OpenAI API attempt $attempt failed: $@", 2);
|
|
| 270 |
+ |
|
| 271 |
+ if ($attempt < $max_retries) {
|
|
| 272 |
+ # Exponential backoff |
|
| 273 |
+ my $delay = 2 ** $attempt; |
|
| 274 |
+ $self->_log("Retrying in ${delay}s...", 2);
|
|
| 275 |
+ sleep($delay); |
|
| 276 |
+ } else {
|
|
| 277 |
+ die "OpenAI API failed after $max_retries attempts: $@"; |
|
| 278 |
+ } |
|
| 279 |
+ } |
|
| 280 |
+ } |
|
| 281 |
+} |
|
| 282 |
+``` |
|
| 283 |
+ |
|
| 284 |
+## 📊 Prediction Storage and Retrieval |
|
| 285 |
+ |
|
| 286 |
+### Database Schema for Predictions |
|
| 287 |
+```sql |
|
| 288 |
+-- Enhanced predictions table |
|
| 289 |
+ALTER TABLE predictions ADD COLUMN api_model VARCHAR(50); |
|
| 290 |
+ALTER TABLE predictions ADD COLUMN api_tokens_used INTEGER; |
|
| 291 |
+ALTER TABLE predictions ADD COLUMN api_cost_estimate DECIMAL(10,6); |
|
| 292 |
+ALTER TABLE predictions ADD COLUMN confidence_level DECIMAL(3,2); |
|
| 293 |
+ALTER TABLE predictions ADD COLUMN failure_probability_7d DECIMAL(5,4); |
|
| 294 |
+ALTER TABLE predictions ADD COLUMN failure_probability_30d DECIMAL(5,4); |
|
| 295 |
+ALTER TABLE predictions ADD COLUMN failure_probability_90d DECIMAL(5,4); |
|
| 296 |
+ALTER TABLE predictions ADD COLUMN failure_probability_1y DECIMAL(5,4); |
|
| 297 |
+ALTER TABLE predictions ADD COLUMN predicted_failure_date DATE; |
|
| 298 |
+ALTER TABLE predictions ADD COLUMN recommendations JSONB; |
|
| 299 |
+ALTER TABLE predictions ADD COLUMN critical_findings JSONB; |
|
| 300 |
+``` |
|
| 301 |
+ |
|
| 302 |
+### Prediction Retrieval Methods |
|
| 303 |
+```perl |
|
| 304 |
+=head2 get_latest_prediction |
|
| 305 |
+ |
|
| 306 |
+Get the most recent prediction for a drive |
|
| 307 |
+ |
|
| 308 |
+=cut |
|
| 309 |
+ |
|
| 310 |
+sub get_latest_prediction {
|
|
| 311 |
+ my ($self, $hdd_id) = @_; |
|
| 312 |
+ |
|
| 313 |
+ my $sql = q{
|
|
| 314 |
+ SELECT p.*, hi.serial_number, hi.model_name |
|
| 315 |
+ FROM predictions p |
|
| 316 |
+ JOIN hdd_inventory hi ON p.hdd_id = hi.id |
|
| 317 |
+ WHERE p.hdd_id = ? |
|
| 318 |
+ ORDER BY p.timestamp DESC |
|
| 319 |
+ LIMIT 1 |
|
| 320 |
+ }; |
|
| 321 |
+ |
|
| 322 |
+ my $sth = $self->{db_handle}->prepare($sql);
|
|
| 323 |
+ $sth->execute($hdd_id); |
|
| 324 |
+ |
|
| 325 |
+ return $sth->fetchrow_hashref(); |
|
| 326 |
+} |
|
| 327 |
+``` |
|
| 328 |
+ |
|
| 329 |
+## 🎯 Performance Optimization |
|
| 330 |
+ |
|
| 331 |
+### API Usage Optimization |
|
| 332 |
+ |
|
| 333 |
+#### Batch Processing |
|
| 334 |
+```perl |
|
| 335 |
+sub predict_multiple_drives {
|
|
| 336 |
+ my ($self, $hdd_ids, $options) = @_; |
|
| 337 |
+ |
|
| 338 |
+ # Group drives by similarity for efficient batch processing |
|
| 339 |
+ my $drive_groups = $self->_group_drives_by_similarity($hdd_ids); |
|
| 340 |
+ |
|
| 341 |
+ my @predictions; |
|
| 342 |
+ for my $group (@$drive_groups) {
|
|
| 343 |
+ if (scalar(@$group) > 1) {
|
|
| 344 |
+ # Use comparative analysis for similar drives |
|
| 345 |
+ push @predictions, $self->_batch_comparative_analysis($group, $options); |
|
| 346 |
+ } else {
|
|
| 347 |
+ # Use individual analysis for single drives |
|
| 348 |
+ push @predictions, $self->predict_failure($group->[0], $options); |
|
| 349 |
+ } |
|
| 350 |
+ } |
|
| 351 |
+ |
|
| 352 |
+ return @predictions; |
|
| 353 |
+} |
|
| 354 |
+``` |
|
| 355 |
+ |
|
| 356 |
+#### Caching Strategy |
|
| 357 |
+```perl |
|
| 358 |
+sub _get_cached_prediction {
|
|
| 359 |
+ my ($self, $hdd_id, $cache_hours) = @_; |
|
| 360 |
+ $cache_hours ||= 24; |
|
| 361 |
+ |
|
| 362 |
+ my $sql = q{
|
|
| 363 |
+ SELECT * FROM predictions |
|
| 364 |
+ WHERE hdd_id = ? |
|
| 365 |
+ AND timestamp > NOW() - INTERVAL ? hour |
|
| 366 |
+ ORDER BY timestamp DESC |
|
| 367 |
+ LIMIT 1 |
|
| 368 |
+ }; |
|
| 369 |
+ |
|
| 370 |
+ my $sth = $self->{db_handle}->prepare($sql);
|
|
| 371 |
+ $sth->execute($hdd_id, $cache_hours); |
|
| 372 |
+ |
|
| 373 |
+ return $sth->fetchrow_hashref(); |
|
| 374 |
+} |
|
| 375 |
+``` |
|
| 376 |
+ |
|
| 377 |
+### Cost Management |
|
| 378 |
+ |
|
| 379 |
+#### Token Usage Tracking |
|
| 380 |
+```perl |
|
| 381 |
+sub _track_api_usage {
|
|
| 382 |
+ my ($self, $hdd_id, $tokens_used, $model) = @_; |
|
| 383 |
+ |
|
| 384 |
+ # Estimate cost based on model pricing |
|
| 385 |
+ my $cost_per_token = $model eq 'gpt-4' ? 0.00003 : 0.000002; |
|
| 386 |
+ my $estimated_cost = $tokens_used * $cost_per_token; |
|
| 387 |
+ |
|
| 388 |
+ # Log usage statistics |
|
| 389 |
+ my $sql = q{
|
|
| 390 |
+ INSERT INTO api_usage_log |
|
| 391 |
+ (hdd_id, timestamp, model, tokens_used, estimated_cost) |
|
| 392 |
+ VALUES (?, NOW(), ?, ?, ?) |
|
| 393 |
+ }; |
|
| 394 |
+ |
|
| 395 |
+ $self->{db_handle}->do($sql, undef, $hdd_id, $model, $tokens_used, $estimated_cost);
|
|
| 396 |
+ |
|
| 397 |
+ return $estimated_cost; |
|
| 398 |
+} |
|
| 399 |
+``` |
|
| 400 |
+ |
|
| 401 |
+## 📈 Analytics and Reporting |
|
| 402 |
+ |
|
| 403 |
+### Prediction Accuracy Tracking |
|
| 404 |
+```sql |
|
| 405 |
+-- Track prediction accuracy over time |
|
| 406 |
+CREATE VIEW prediction_accuracy AS |
|
| 407 |
+SELECT |
|
| 408 |
+ p.hdd_id, |
|
| 409 |
+ p.timestamp as prediction_date, |
|
| 410 |
+ p.failure_probability_30d, |
|
| 411 |
+ p.predicted_failure_date, |
|
| 412 |
+ hi.status_changed_at, |
|
| 413 |
+ CASE |
|
| 414 |
+ WHEN hi.status = 'failed' AND hi.status_changed_at <= p.predicted_failure_date THEN 'accurate' |
|
| 415 |
+ WHEN hi.status = 'failed' AND hi.status_changed_at > p.predicted_failure_date THEN 'early' |
|
| 416 |
+ WHEN hi.status = 'active' AND NOW() > p.predicted_failure_date THEN 'late' |
|
| 417 |
+ ELSE 'pending' |
|
| 418 |
+ END as accuracy_assessment |
|
| 419 |
+FROM predictions p |
|
| 420 |
+JOIN hdd_inventory hi ON p.hdd_id = hi.id |
|
| 421 |
+WHERE p.timestamp > NOW() - INTERVAL '6 months'; |
|
| 422 |
+``` |
|
| 423 |
+ |
|
| 424 |
+### API Cost Analysis |
|
| 425 |
+```sql |
|
| 426 |
+-- Monitor API costs and usage patterns |
|
| 427 |
+SELECT |
|
| 428 |
+ DATE_TRUNC('day', timestamp) as date,
|
|
| 429 |
+ model, |
|
| 430 |
+ COUNT(*) as api_calls, |
|
| 431 |
+ SUM(tokens_used) as total_tokens, |
|
| 432 |
+ SUM(estimated_cost) as daily_cost |
|
| 433 |
+FROM api_usage_log |
|
| 434 |
+WHERE timestamp > NOW() - INTERVAL '30 days' |
|
| 435 |
+GROUP BY DATE_TRUNC('day', timestamp), model
|
|
| 436 |
+ORDER BY date DESC, model; |
|
| 437 |
+``` |
|
| 438 |
+ |
|
| 439 |
+This API reference provides comprehensive guidance for integrating and optimizing OpenAI API usage within the autoSMART system. The implementation focuses on accuracy, cost-effectiveness, and reliable failure prediction capabilities. |
|
@@ -0,0 +1,264 @@ |
||
| 1 |
+# autoSMART Release Notes |
|
| 2 |
+ |
|
| 3 |
+All notable changes and updates to autoSMART will be documented in this file. |
|
| 4 |
+ |
|
| 5 |
+## [1.0.0] - August 15, 2025 |
|
| 6 |
+ |
|
| 7 |
+### 🎉 Initial Release - Production Ready |
|
| 8 |
+ |
|
| 9 |
+We're excited to announce the first production release of autoSMART! This release provides a complete, enterprise-ready solution for intelligent HDD monitoring with AI-powered failure predictions. |
|
| 10 |
+ |
|
| 11 |
+### ✨ What's New |
|
| 12 |
+ |
|
| 13 |
+#### Core Features |
|
| 14 |
+- **Smart HDD Tracking**: Automatically identifies and tracks all HDDs in your Proxmox cluster using hardware identifiers |
|
| 15 |
+- **AI Failure Predictions**: Uses OpenAI GPT to predict drive failures before they happen |
|
| 16 |
+- **Efficient Storage**: Advanced storage optimization reduces database size by 60-80% |
|
| 17 |
+- **Migration Detection**: Automatically detects when drives move between servers |
|
| 18 |
+- **Proxmox Integration**: Native support for Proxmox VE cluster environments |
|
| 19 |
+ |
|
| 20 |
+#### Monitoring Capabilities |
|
| 21 |
+- **Real-time Health Monitoring**: Continuous SMART parameter monitoring |
|
| 22 |
+- **Configurable Alerts**: Customizable thresholds for all SMART parameters |
|
| 23 |
+- **Historical Analysis**: Long-term trend analysis and reporting |
|
| 24 |
+- **Performance Tracking**: Monitor drive performance degradation over time |
|
| 25 |
+ |
|
| 26 |
+#### User Experience |
|
| 27 |
+- **Easy Installation**: Simple deployment script for quick setup |
|
| 28 |
+- **Comprehensive Reports**: Detailed health reports and failure predictions |
|
| 29 |
+- **Web Dashboard**: (Coming in v1.1) Real-time monitoring interface |
|
| 30 |
+- **Email Alerts**: Immediate notifications for critical issues |
|
| 31 |
+ |
|
| 32 |
+### 🔧 System Requirements |
|
| 33 |
+ |
|
| 34 |
+#### Minimum Requirements |
|
| 35 |
+- **Operating System**: Proxmox VE 7.0+ or compatible Linux distribution |
|
| 36 |
+- **Database**: PostgreSQL 13+ with 1GB+ available storage |
|
| 37 |
+- **Perl**: Version 5.20+ with internet access for module installation |
|
| 38 |
+- **Memory**: 512MB RAM minimum, 1GB recommended per node |
|
| 39 |
+- **Network**: Stable network connection for database and API access |
|
| 40 |
+ |
|
| 41 |
+#### Recommended Setup |
|
| 42 |
+- **Database Server**: Dedicated PostgreSQL server with SSD storage |
|
| 43 |
+- **Cluster Size**: Optimized for 3-50 node Proxmox clusters |
|
| 44 |
+- **Storage**: 10GB+ database storage for large clusters with long retention |
|
| 45 |
+- **Monitoring**: Integration with existing monitoring infrastructure |
|
| 46 |
+ |
|
| 47 |
+### 📊 Performance Benefits |
|
| 48 |
+ |
|
| 49 |
+#### Storage Efficiency |
|
| 50 |
+- **60-80% smaller database** compared to traditional SMART logging |
|
| 51 |
+- **Intelligent change detection** stores only modified parameters |
|
| 52 |
+- **Automatic optimization** requires no manual configuration |
|
| 53 |
+- **Scalable architecture** grows efficiently with cluster size |
|
| 54 |
+ |
|
| 55 |
+#### Monitoring Accuracy |
|
| 56 |
+- **Hardware-based tracking** eliminates drive identification issues |
|
| 57 |
+- **Migration detection** maintains accurate drive history |
|
| 58 |
+- **AI-powered analysis** provides reliable failure predictions |
|
| 59 |
+- **Real-time alerts** enable proactive maintenance |
|
| 60 |
+ |
|
| 61 |
+### 🚀 Getting Started |
|
| 62 |
+ |
|
| 63 |
+#### Quick Installation |
|
| 64 |
+```bash |
|
| 65 |
+# 1. Download and extract autoSMART |
|
| 66 |
+# 2. Run the installer |
|
| 67 |
+sudo ./scripts/deploy.sh install |
|
| 68 |
+ |
|
| 69 |
+# 3. Configure your database connection |
|
| 70 |
+sudo vim /opt/autoSMART/config/autosmart.conf |
|
| 71 |
+ |
|
| 72 |
+# 4. Start monitoring |
|
| 73 |
+sudo systemctl start autosmart |
|
| 74 |
+``` |
|
| 75 |
+ |
|
| 76 |
+#### First Steps |
|
| 77 |
+1. **Verify Installation**: Check that all drives are detected and monitored |
|
| 78 |
+2. **Configure Alerts**: Set up email notifications for your team |
|
| 79 |
+3. **Review Reports**: Generate initial health reports for all drives |
|
| 80 |
+4. **Set Thresholds**: Customize alert thresholds for your environment |
|
| 81 |
+ |
|
| 82 |
+### 🏥 Health Monitoring |
|
| 83 |
+ |
|
| 84 |
+#### What autoSMART Monitors |
|
| 85 |
+- **Temperature**: Operating temperatures and thermal stress |
|
| 86 |
+- **Error Rates**: Read/write errors and retry counts |
|
| 87 |
+- **Mechanical Health**: Spin-up time, seek errors, and mechanical issues |
|
| 88 |
+- **Surface Quality**: Bad sectors, reallocated sectors, and surface scans |
|
| 89 |
+- **Performance**: Transfer rates and response times |
|
| 90 |
+ |
|
| 91 |
+#### AI Predictions |
|
| 92 |
+- **Failure Probability**: Confidence scores for potential failures |
|
| 93 |
+- **Time Estimates**: Predicted time until failure occurs |
|
| 94 |
+- **Risk Assessment**: Categorization of failure risk levels |
|
| 95 |
+- **Recommendation Engine**: Suggested maintenance actions |
|
| 96 |
+ |
|
| 97 |
+### 🔔 Alert System |
|
| 98 |
+ |
|
| 99 |
+#### Alert Types |
|
| 100 |
+- **Critical**: Immediate action required (drive failure imminent) |
|
| 101 |
+- **Warning**: Monitor closely (parameters approaching limits) |
|
| 102 |
+- **Info**: Normal operation (routine status updates) |
|
| 103 |
+- **Prediction**: AI-identified potential issues |
|
| 104 |
+ |
|
| 105 |
+#### Notification Methods |
|
| 106 |
+- **Email**: Immediate email alerts for critical issues |
|
| 107 |
+- **Logs**: Detailed logging for all events and changes |
|
| 108 |
+- **Reports**: Regular summary reports with cluster health overview |
|
| 109 |
+- **API Integration**: RESTful API for custom integrations (v1.1+) |
|
| 110 |
+ |
|
| 111 |
+### 💡 Use Cases |
|
| 112 |
+ |
|
| 113 |
+#### Preventive Maintenance |
|
| 114 |
+- **Predict Failures**: Replace drives before they fail |
|
| 115 |
+- **Schedule Maintenance**: Plan maintenance windows effectively |
|
| 116 |
+- **Optimize Workloads**: Balance load based on drive health |
|
| 117 |
+- **Track Warranties**: Monitor warranty status and replacement schedules |
|
| 118 |
+ |
|
| 119 |
+#### Capacity Planning |
|
| 120 |
+- **Growth Trends**: Monitor storage usage patterns |
|
| 121 |
+- **Performance Planning**: Identify performance bottlenecks |
|
| 122 |
+- **Cluster Expansion**: Plan future capacity requirements |
|
| 123 |
+- **Cost Optimization**: Maximize drive utilization efficiency |
|
| 124 |
+ |
|
| 125 |
+### 🛠️ Support & Documentation |
|
| 126 |
+ |
|
| 127 |
+#### Getting Help |
|
| 128 |
+- **Installation Guide**: Complete setup instructions in `docs/INSTALLATION.md` |
|
| 129 |
+- **Configuration**: Detailed configuration options and examples |
|
| 130 |
+- **Troubleshooting**: Common issues and solutions |
|
| 131 |
+- **API Documentation**: Integration guides and examples |
|
| 132 |
+ |
|
| 133 |
+#### Community |
|
| 134 |
+- **Documentation**: Comprehensive guides for all features |
|
| 135 |
+- **Support**: Technical support and assistance |
|
| 136 |
+- **Updates**: Regular updates and security patches |
|
| 137 |
+- **Feedback**: We welcome your feedback and suggestions |
|
| 138 |
+ |
|
| 139 |
+### 🔮 What's Next |
|
| 140 |
+ |
|
| 141 |
+#### Version 1.1 (Coming Soon) |
|
| 142 |
+- **Web Dashboard**: Real-time monitoring interface |
|
| 143 |
+- **Advanced Analytics**: Enhanced prediction models |
|
| 144 |
+- **API Integration**: RESTful API for custom integrations |
|
| 145 |
+- **Mobile Alerts**: SMS and mobile app notifications |
|
| 146 |
+ |
|
| 147 |
+#### Future Releases |
|
| 148 |
+- **Multi-Tenant Support**: Support for managed service providers |
|
| 149 |
+- **Advanced ML Models**: Custom machine learning models |
|
| 150 |
+- **Cloud Integration**: Cloud storage and analytics options |
|
| 151 |
+- **Enterprise Features**: Advanced reporting and compliance tools |
|
| 152 |
+ |
|
| 153 |
+--- |
|
| 154 |
+ |
|
| 155 |
+**Welcome to autoSMART v1.0!** |
|
| 156 |
+ |
|
| 157 |
+Thank you for choosing autoSMART for your drive monitoring needs. This release represents months of development and testing to provide you with a reliable, efficient, and intelligent monitoring solution. |
|
| 158 |
+ |
|
| 159 |
+For technical support, documentation, or questions, please refer to the documentation in the `docs/` directory or contact our support team. |
|
| 160 |
+ |
|
| 161 |
+#### Scripts and Tools |
|
| 162 |
+- **collect-smart-data.pl**: Main data collection script |
|
| 163 |
+- **analyze-smart-data.pl**: Analysis and prediction script |
|
| 164 |
+- **generate-reports.pl**: Report generation script |
|
| 165 |
+- **test-differential-storage.pl**: Comprehensive storage optimization test suite |
|
| 166 |
+ |
|
| 167 |
+#### Configuration System |
|
| 168 |
+- **Proxmox cluster integration**: |
|
| 169 |
+ - `/etc/pve/autoSMART/cluster.conf`: Cluster-wide shared configuration |
|
| 170 |
+ - `/etc/default/autosmart`: Local node-specific configuration |
|
| 171 |
+- **Flexible configuration**: Database connection, API keys, thresholds, intervals |
|
| 172 |
+ |
|
| 173 |
+#### Documentation |
|
| 174 |
+- Complete installation and setup guide |
|
| 175 |
+- API integration documentation |
|
| 176 |
+- Migration detection system documentation |
|
| 177 |
+- Differential storage system documentation |
|
| 178 |
+- Development and testing guides |
|
| 179 |
+ |
|
| 180 |
+### 🔧 Technical Specifications |
|
| 181 |
+ |
|
| 182 |
+#### Database Requirements |
|
| 183 |
+- PostgreSQL 13+ with JSONB support |
|
| 184 |
+- GIN indexes for JSONB columns |
|
| 185 |
+- Recursive CTE support for data reconstruction |
|
| 186 |
+- Extension support for advanced functions |
|
| 187 |
+ |
|
| 188 |
+#### Performance Optimizations |
|
| 189 |
+- Hardware-based tracking eliminates volatile path dependencies |
|
| 190 |
+- Differential storage reduces data volume by 60-80% |
|
| 191 |
+- Optimized indexes for time-series data |
|
| 192 |
+- Efficient recursive queries for data reconstruction |
|
| 193 |
+ |
|
| 194 |
+#### Storage Efficiency |
|
| 195 |
+- **Baseline readings**: ~1% of all readings (first reading per HDD) |
|
| 196 |
+- **Full readings**: ~15-20% of readings (critical changes + forced intervals) |
|
| 197 |
+- **Differential readings**: ~5-15% of readings (minor parameter changes) |
|
| 198 |
+- **Skipped readings**: ~60-75% of readings (no changes detected) |
|
| 199 |
+ |
|
| 200 |
+#### Migration Detection |
|
| 201 |
+- Automatic detection of HDD movements between: |
|
| 202 |
+ - Physical nodes in cluster |
|
| 203 |
+ - Device paths (/dev/sdX changes) |
|
| 204 |
+ - Slot positions in chassis |
|
| 205 |
+- Complete audit trail of hardware movements |
|
| 206 |
+- No data loss during migrations |
|
| 207 |
+ |
|
| 208 |
+### 🎯 Phase 1 Completion Status |
|
| 209 |
+ |
|
| 210 |
+- ✅ Project structure and organization |
|
| 211 |
+- ✅ PostgreSQL schema with hardware tracking |
|
| 212 |
+- ✅ Hardware-based SMART collector with migration detection |
|
| 213 |
+- ✅ Differential storage optimization implementation |
|
| 214 |
+- ✅ Proxmox cluster configuration system |
|
| 215 |
+- ✅ Test suite and validation tools |
|
| 216 |
+- ✅ Comprehensive documentation |
|
| 217 |
+ |
|
| 218 |
+### 🔜 Next Phase (v1.1 - AI Integration) |
|
| 219 |
+ |
|
| 220 |
+Planned features for Phase 2: |
|
| 221 |
+- AI prediction engine implementation |
|
| 222 |
+- Historical data analysis and pattern recognition |
|
| 223 |
+- Failure prediction algorithms refinement |
|
| 224 |
+- Enhanced alerting system |
|
| 225 |
+ |
|
| 226 |
+### 🏗️ Infrastructure Notes |
|
| 227 |
+ |
|
| 228 |
+- **Test Database**: PostgreSQL on 192.168.2.102 (user: postgres, no password) |
|
| 229 |
+- **Development Environment**: macOS with Perl 5.x |
|
| 230 |
+- **Target Deployment**: Proxmox VE cluster with shared storage |
|
| 231 |
+ |
|
| 232 |
+### 📊 Project Metrics |
|
| 233 |
+ |
|
| 234 |
+- **Total files**: 25+ files across modules, scripts, SQL, and documentation |
|
| 235 |
+- **Code quality**: Full error handling, logging, and validation |
|
| 236 |
+- **Test coverage**: Comprehensive test suite for differential storage |
|
| 237 |
+- **Documentation**: Complete user and developer documentation |
|
| 238 |
+- **Database optimization**: 60-80% storage reduction achieved |
|
| 239 |
+ |
|
| 240 |
+--- |
|
| 241 |
+ |
|
| 242 |
+## Development Guidelines |
|
| 243 |
+ |
|
| 244 |
+### Version Numbering |
|
| 245 |
+- **Major** (X.0.0): Breaking changes, major feature additions |
|
| 246 |
+- **Minor** (X.Y.0): New features, backward compatible |
|
| 247 |
+- **Patch** (X.Y.Z): Bug fixes, small improvements |
|
| 248 |
+ |
|
| 249 |
+### Change Categories |
|
| 250 |
+- 🎉 **Major Release** |
|
| 251 |
+- ✨ **Added** - New features |
|
| 252 |
+- 🔧 **Changed** - Changes in existing functionality |
|
| 253 |
+- 🐛 **Fixed** - Bug fixes |
|
| 254 |
+- 🔒 **Security** - Security improvements |
|
| 255 |
+- 🗑️ **Deprecated** - Soon-to-be removed features |
|
| 256 |
+- ❌ **Removed** - Removed features |
|
| 257 |
+ |
|
| 258 |
+### Future Releases |
|
| 259 |
+ |
|
| 260 |
+Planning for upcoming versions: |
|
| 261 |
+- **v1.1.0**: AI Integration Phase |
|
| 262 |
+- **v1.2.0**: Production Deployment Phase |
|
| 263 |
+- **v1.3.0**: Advanced Analytics Phase |
|
| 264 |
+- **v2.0.0**: Next Generation Architecture |
|
@@ -0,0 +1,467 @@ |
||
| 1 |
+# autoSMART Database Documentation |
|
| 2 |
+ |
|
| 3 |
+## Overview |
|
| 4 |
+ |
|
| 5 |
+autoSMART uses PostgreSQL as its primary database for storing SMART data, HDD tracking information, predictions, and system configuration. The database is designed for multi-node cluster deployments with comprehensive HDD mobility tracking. |
|
| 6 |
+ |
|
| 7 |
+## Database Schema |
|
| 8 |
+ |
|
| 9 |
+### Core Tables |
|
| 10 |
+ |
|
| 11 |
+#### `hdd_inventory` |
|
| 12 |
+The central inventory table that tracks all HDDs across the cluster. |
|
| 13 |
+ |
|
| 14 |
+```sql |
|
| 15 |
+CREATE TABLE hdd_inventory ( |
|
| 16 |
+ id SERIAL PRIMARY KEY, |
|
| 17 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 18 |
+ model_name VARCHAR(200) NOT NULL, |
|
| 19 |
+ firmware VARCHAR(50), |
|
| 20 |
+ size_gb INTEGER, |
|
| 21 |
+ manufacturer VARCHAR(100), |
|
| 22 |
+ current_device_path VARCHAR(50), |
|
| 23 |
+ current_node_id VARCHAR(50), |
|
| 24 |
+ current_slot VARCHAR(20), |
|
| 25 |
+ madagascar_id VARCHAR(100), |
|
| 26 |
+ first_seen TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 27 |
+ last_seen TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 28 |
+ status VARCHAR(20) DEFAULT 'active', |
|
| 29 |
+ status_changed_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 30 |
+ notes TEXT, |
|
| 31 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 32 |
+ updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 33 |
+ |
|
| 34 |
+ CONSTRAINT unique_hardware_id UNIQUE (serial_number, model_name) |
|
| 35 |
+); |
|
| 36 |
+``` |
|
| 37 |
+ |
|
| 38 |
+**Key Features:** |
|
| 39 |
+- **Hardware-based identification**: Uses `serial_number` + `model_name` as unique constraint |
|
| 40 |
+- **Current location tracking**: `current_device_path`, `current_node_id` show where HDD is now |
|
| 41 |
+- **Lifecycle management**: `first_seen`, `last_seen`, `status` track HDD lifecycle |
|
| 42 |
+- **Madagascar integration**: `madagascar_id` field for cluster-specific identification |
|
| 43 |
+ |
|
| 44 |
+#### `hdd_presence` |
|
| 45 |
+Tracks HDD mobility across cluster nodes - records when HDDs are present on different nodes. |
|
| 46 |
+ |
|
| 47 |
+```sql |
|
| 48 |
+CREATE TABLE hdd_presence ( |
|
| 49 |
+ id SERIAL PRIMARY KEY, |
|
| 50 |
+ serial_number VARCHAR(64) NOT NULL, |
|
| 51 |
+ node VARCHAR(64) NOT NULL, |
|
| 52 |
+ data_start TIMESTAMP NOT NULL, |
|
| 53 |
+ data_end TIMESTAMP NOT NULL, |
|
| 54 |
+ is_current BOOLEAN NOT NULL DEFAULT TRUE |
|
| 55 |
+); |
|
| 56 |
+``` |
|
| 57 |
+ |
|
| 58 |
+**Key Features:** |
|
| 59 |
+- **Mobility tracking**: Records when HDDs move between nodes |
|
| 60 |
+- **Time-based records**: `data_start`/`data_end` define presence periods |
|
| 61 |
+- **Current vs Historic**: `is_current` flag marks active presence |
|
| 62 |
+- **Independent of inventory**: Works independently of `hdd_inventory` for pure mobility data |
|
| 63 |
+ |
|
| 64 |
+**Example Data:** |
|
| 65 |
+```sql |
|
| 66 |
+ id | serial_number | node | data_start | data_end | is_current |
|
| 67 |
+----+----------------+-----------+----------------------------+----------------------------+------------ |
|
| 68 |
+ 4 | ZW60K01R | ebony | 2025-08-16 22:05:15.863971 | 2025-08-16 22:05:15.863971 | t |
|
| 69 |
+ 3 | S2HSNXRH402205 | ebony | 2025-08-16 22:05:15.109956 | 2025-08-16 22:05:15.109956 | t |
|
| 70 |
+ 2 | ZW60K01R | baobab | 2025-08-16 21:47:13.873642 | 2025-08-16 22:03:31.052316 | f |
|
| 71 |
+ 1 | S2HSNXRH402205 | tapia | 2025-08-16 21:47:13.078524 | 2025-08-16 22:03:30.268985 | f |
|
| 72 |
+``` |
|
| 73 |
+ |
|
| 74 |
+#### `smart_readings` |
|
| 75 |
+Stores SMART data readings with differential storage optimization. |
|
| 76 |
+ |
|
| 77 |
+```sql |
|
| 78 |
+CREATE TABLE smart_readings ( |
|
| 79 |
+ id BIGSERIAL PRIMARY KEY, |
|
| 80 |
+ hdd_id INTEGER REFERENCES hdd_inventory(id), |
|
| 81 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 82 |
+ device_path VARCHAR(50), |
|
| 83 |
+ node_id VARCHAR(50), |
|
| 84 |
+ timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 85 |
+ collection_ok BOOLEAN DEFAULT true, |
|
| 86 |
+ temperature INTEGER, |
|
| 87 |
+ parameters_json JSONB, |
|
| 88 |
+ reading_type VARCHAR(20) DEFAULT 'full', |
|
| 89 |
+ changes_detected BOOLEAN DEFAULT true, |
|
| 90 |
+ changed_parameters JSONB, |
|
| 91 |
+ previous_reading_id INTEGER REFERENCES smart_readings(id), |
|
| 92 |
+ checksum VARCHAR(64) |
|
| 93 |
+); |
|
| 94 |
+``` |
|
| 95 |
+ |
|
| 96 |
+**Reading Types:** |
|
| 97 |
+- `baseline`: First reading for an HDD |
|
| 98 |
+- `full`: Complete parameter set (forced by time interval) |
|
| 99 |
+- `differential`: Only changed parameters (optimization) |
|
| 100 |
+- `skipped`: No changes detected |
|
| 101 |
+ |
|
| 102 |
+**Key Features:** |
|
| 103 |
+- **Differential storage**: Only stores changes to reduce data volume |
|
| 104 |
+- **Full context**: Links to `hdd_inventory` and includes node information |
|
| 105 |
+- **Change tracking**: `previous_reading_id` creates reading chains |
|
| 106 |
+- **JSONB parameters**: Flexible storage for SMART attributes |
|
| 107 |
+ |
|
| 108 |
+#### `predictions` |
|
| 109 |
+AI-generated failure predictions and analysis. |
|
| 110 |
+ |
|
| 111 |
+```sql |
|
| 112 |
+CREATE TABLE predictions ( |
|
| 113 |
+ id SERIAL PRIMARY KEY, |
|
| 114 |
+ hdd_id INTEGER REFERENCES hdd_inventory(id), |
|
| 115 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 116 |
+ device_path VARCHAR(50), |
|
| 117 |
+ timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 118 |
+ risk_level VARCHAR(20), |
|
| 119 |
+ failure_probability DECIMAL(5,4), |
|
| 120 |
+ predicted_failure_date DATE, |
|
| 121 |
+ confidence_score DECIMAL(5,4), |
|
| 122 |
+ analysis_summary TEXT, |
|
| 123 |
+ recommendations JSONB, |
|
| 124 |
+ openai_response JSONB, |
|
| 125 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 126 |
+); |
|
| 127 |
+``` |
|
| 128 |
+ |
|
| 129 |
+#### `alert_history` |
|
| 130 |
+Tracks all alerts sent about HDD issues. |
|
| 131 |
+ |
|
| 132 |
+```sql |
|
| 133 |
+CREATE TABLE alert_history ( |
|
| 134 |
+ id SERIAL PRIMARY KEY, |
|
| 135 |
+ hdd_id INTEGER REFERENCES hdd_inventory(id), |
|
| 136 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 137 |
+ alert_type VARCHAR(50), |
|
| 138 |
+ severity VARCHAR(20), |
|
| 139 |
+ message TEXT, |
|
| 140 |
+ sent_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 141 |
+ sent_to TEXT, |
|
| 142 |
+ delivery_status VARCHAR(20) DEFAULT 'pending', |
|
| 143 |
+ related_reading_id BIGINT REFERENCES smart_readings(id), |
|
| 144 |
+ related_prediction_id INTEGER REFERENCES predictions(id) |
|
| 145 |
+); |
|
| 146 |
+``` |
|
| 147 |
+ |
|
| 148 |
+### Configuration Tables |
|
| 149 |
+ |
|
| 150 |
+#### `system_config` |
|
| 151 |
+Global system configuration parameters. |
|
| 152 |
+ |
|
| 153 |
+```sql |
|
| 154 |
+CREATE TABLE system_config ( |
|
| 155 |
+ id SERIAL PRIMARY KEY, |
|
| 156 |
+ config_key VARCHAR(100) UNIQUE NOT NULL, |
|
| 157 |
+ value TEXT, |
|
| 158 |
+ description TEXT, |
|
| 159 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 160 |
+ updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 161 |
+); |
|
| 162 |
+``` |
|
| 163 |
+ |
|
| 164 |
+**Default Configuration:** |
|
| 165 |
+- `collection_interval_seconds`: SMART data collection frequency |
|
| 166 |
+- `differential_storage_enabled`: Enable/disable storage optimization |
|
| 167 |
+- `forced_storage_interval_hours`: Force full readings periodically |
|
| 168 |
+- `critical_parameter_force_store`: Always store critical changes |
|
| 169 |
+- `temperature_change_threshold`: Temperature delta for storage |
|
| 170 |
+ |
|
| 171 |
+#### `smart_thresholds` |
|
| 172 |
+SMART parameter warning and critical thresholds. |
|
| 173 |
+ |
|
| 174 |
+```sql |
|
| 175 |
+CREATE TABLE smart_thresholds ( |
|
| 176 |
+ id SERIAL PRIMARY KEY, |
|
| 177 |
+ parameter_name VARCHAR(100) NOT NULL, |
|
| 178 |
+ warning_threshold NUMERIC, |
|
| 179 |
+ critical_threshold NUMERIC, |
|
| 180 |
+ weight NUMERIC DEFAULT 1.0, |
|
| 181 |
+ enabled BOOLEAN DEFAULT true, |
|
| 182 |
+ description TEXT, |
|
| 183 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 184 |
+ updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 185 |
+); |
|
| 186 |
+``` |
|
| 187 |
+ |
|
| 188 |
+## Views |
|
| 189 |
+ |
|
| 190 |
+### `smart_readings_reconstructed` |
|
| 191 |
+Reconstructs complete SMART data from differential storage. |
|
| 192 |
+ |
|
| 193 |
+```sql |
|
| 194 |
+CREATE VIEW smart_readings_reconstructed AS |
|
| 195 |
+WITH RECURSIVE reading_chain AS ( |
|
| 196 |
+ -- Base case: get baseline readings |
|
| 197 |
+ SELECT id, hdd_id, serial_number, timestamp, |
|
| 198 |
+ parameters_json, temperature, reading_type, |
|
| 199 |
+ previous_reading_id, 1 as chain_level |
|
| 200 |
+ FROM smart_readings |
|
| 201 |
+ WHERE reading_type IN ('baseline', 'full')
|
|
| 202 |
+ |
|
| 203 |
+ UNION ALL |
|
| 204 |
+ |
|
| 205 |
+ -- Recursive case: follow the chain of differential readings |
|
| 206 |
+ SELECT sr.id, sr.hdd_id, sr.serial_number, sr.timestamp, |
|
| 207 |
+ COALESCE(rc.parameters_json, '{}'::jsonb) || sr.parameters_json as parameters_json,
|
|
| 208 |
+ COALESCE(sr.temperature, rc.temperature) as temperature, |
|
| 209 |
+ sr.reading_type, sr.previous_reading_id, |
|
| 210 |
+ rc.chain_level + 1 |
|
| 211 |
+ FROM smart_readings sr |
|
| 212 |
+ JOIN reading_chain rc ON sr.previous_reading_id = rc.id |
|
| 213 |
+ WHERE sr.reading_type = 'differential' |
|
| 214 |
+) |
|
| 215 |
+SELECT id, hdd_id, serial_number, timestamp, |
|
| 216 |
+ parameters_json, temperature, reading_type, chain_level |
|
| 217 |
+FROM reading_chain; |
|
| 218 |
+``` |
|
| 219 |
+ |
|
| 220 |
+### `latest_smart_readings` |
|
| 221 |
+Current SMART status for all active drives. |
|
| 222 |
+ |
|
| 223 |
+```sql |
|
| 224 |
+CREATE VIEW latest_smart_readings AS |
|
| 225 |
+SELECT DISTINCT ON (sr.hdd_id) |
|
| 226 |
+ sr.id, sr.hdd_id, sr.serial_number, sr.timestamp, |
|
| 227 |
+ sr.parameters_json, sr.temperature, |
|
| 228 |
+ hi.model_name, hi.manufacturer, hi.size_gb, |
|
| 229 |
+ hi.current_device_path, hi.current_node_id |
|
| 230 |
+FROM smart_readings_reconstructed sr |
|
| 231 |
+JOIN hdd_inventory hi ON sr.hdd_id = hi.id |
|
| 232 |
+ORDER BY sr.hdd_id, sr.timestamp DESC; |
|
| 233 |
+``` |
|
| 234 |
+ |
|
| 235 |
+### `drive_health_summary` |
|
| 236 |
+Comprehensive health overview for all drives. |
|
| 237 |
+ |
|
| 238 |
+```sql |
|
| 239 |
+CREATE VIEW drive_health_summary AS |
|
| 240 |
+SELECT |
|
| 241 |
+ hi.id as hdd_id, hi.serial_number, hi.model_name, |
|
| 242 |
+ hi.manufacturer, hi.current_device_path, hi.current_node_id, |
|
| 243 |
+ hi.status, lsr.timestamp as last_reading, lsr.temperature, |
|
| 244 |
+ p.risk_level, p.failure_probability, p.predicted_failure_date, |
|
| 245 |
+ EXTRACT(EPOCH FROM (NOW() - lsr.timestamp))/3600 as hours_since_last_reading |
|
| 246 |
+FROM hdd_inventory hi |
|
| 247 |
+LEFT JOIN latest_smart_readings lsr ON hi.id = lsr.hdd_id |
|
| 248 |
+LEFT JOIN LATERAL ( |
|
| 249 |
+ SELECT risk_level, failure_probability, predicted_failure_date |
|
| 250 |
+ FROM predictions |
|
| 251 |
+ WHERE hdd_id = hi.id |
|
| 252 |
+ ORDER BY timestamp DESC |
|
| 253 |
+ LIMIT 1 |
|
| 254 |
+) p ON true |
|
| 255 |
+WHERE hi.status = 'active'; |
|
| 256 |
+``` |
|
| 257 |
+ |
|
| 258 |
+## Functions |
|
| 259 |
+ |
|
| 260 |
+### `update_hdd_presence()` |
|
| 261 |
+Manages HDD presence tracking when a drive is detected on a node. |
|
| 262 |
+ |
|
| 263 |
+```sql |
|
| 264 |
+CREATE OR REPLACE FUNCTION update_hdd_presence( |
|
| 265 |
+ p_serial_number VARCHAR(64), |
|
| 266 |
+ p_node VARCHAR(64) |
|
| 267 |
+) RETURNS VOID AS $$ |
|
| 268 |
+BEGIN |
|
| 269 |
+ -- Mark all previous presence records for this serial as historic |
|
| 270 |
+ UPDATE hdd_presence |
|
| 271 |
+ SET is_current = FALSE |
|
| 272 |
+ WHERE serial_number = p_serial_number AND is_current = TRUE AND node <> p_node; |
|
| 273 |
+ |
|
| 274 |
+ -- Check if there's already a current presence for this serial/node |
|
| 275 |
+ IF EXISTS (SELECT 1 FROM hdd_presence WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE) THEN |
|
| 276 |
+ -- Update data_end for existing current presence |
|
| 277 |
+ UPDATE hdd_presence |
|
| 278 |
+ SET data_end = NOW() |
|
| 279 |
+ WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE; |
|
| 280 |
+ ELSE |
|
| 281 |
+ -- Create new presence record |
|
| 282 |
+ INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current) |
|
| 283 |
+ VALUES (p_serial_number, p_node, NOW(), NOW(), TRUE); |
|
| 284 |
+ END IF; |
|
| 285 |
+END; |
|
| 286 |
+$$ LANGUAGE plpgsql; |
|
| 287 |
+``` |
|
| 288 |
+ |
|
| 289 |
+### `should_store_smart_reading()` |
|
| 290 |
+Determines if a SMART reading should be stored based on differential storage logic. |
|
| 291 |
+ |
|
| 292 |
+```sql |
|
| 293 |
+CREATE OR REPLACE FUNCTION should_store_smart_reading( |
|
| 294 |
+ p_hdd_id INTEGER, |
|
| 295 |
+ p_parameters_json JSONB, |
|
| 296 |
+ p_checksum VARCHAR(64), |
|
| 297 |
+ p_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 298 |
+) RETURNS TABLE( |
|
| 299 |
+ should_store BOOLEAN, |
|
| 300 |
+ reading_type VARCHAR(20), |
|
| 301 |
+ changes_detected BOOLEAN, |
|
| 302 |
+ changed_parameters JSONB, |
|
| 303 |
+ previous_reading_id INTEGER |
|
| 304 |
+) AS $$ |
|
| 305 |
+-- Function implementation handles: |
|
| 306 |
+-- - Differential storage enabled/disabled |
|
| 307 |
+-- - Checksum-based change detection |
|
| 308 |
+-- - Force intervals for full readings |
|
| 309 |
+-- - Reading type determination |
|
| 310 |
+$$; |
|
| 311 |
+``` |
|
| 312 |
+ |
|
| 313 |
+## Indexes |
|
| 314 |
+ |
|
| 315 |
+### Performance Indexes |
|
| 316 |
+```sql |
|
| 317 |
+-- hdd_inventory indexes |
|
| 318 |
+CREATE INDEX idx_hdd_inventory_device_path ON hdd_inventory(current_device_path); |
|
| 319 |
+CREATE INDEX idx_hdd_inventory_node ON hdd_inventory(current_node_id); |
|
| 320 |
+CREATE INDEX idx_hdd_inventory_status ON hdd_inventory(status); |
|
| 321 |
+CREATE INDEX idx_hdd_inventory_last_seen ON hdd_inventory(last_seen); |
|
| 322 |
+ |
|
| 323 |
+-- hdd_presence indexes |
|
| 324 |
+CREATE INDEX idx_hdd_presence_serial_current ON hdd_presence(serial_number, is_current); |
|
| 325 |
+CREATE INDEX idx_hdd_presence_node ON hdd_presence(node); |
|
| 326 |
+CREATE INDEX idx_hdd_presence_data_end ON hdd_presence(data_end DESC); |
|
| 327 |
+ |
|
| 328 |
+-- smart_readings indexes |
|
| 329 |
+CREATE INDEX idx_smart_readings_hdd_id ON smart_readings(hdd_id); |
|
| 330 |
+CREATE INDEX idx_smart_readings_timestamp ON smart_readings(timestamp DESC); |
|
| 331 |
+CREATE INDEX idx_smart_readings_serial ON smart_readings(serial_number); |
|
| 332 |
+CREATE INDEX idx_smart_readings_device_path ON smart_readings(device_path); |
|
| 333 |
+CREATE INDEX idx_smart_readings_type ON smart_readings(reading_type); |
|
| 334 |
+CREATE INDEX idx_smart_readings_checksum ON smart_readings(checksum); |
|
| 335 |
+CREATE INDEX idx_smart_readings_previous ON smart_readings(previous_reading_id); |
|
| 336 |
+ |
|
| 337 |
+-- JSONB indexes for flexible queries |
|
| 338 |
+CREATE INDEX idx_smart_readings_parameters ON smart_readings USING GIN (parameters_json); |
|
| 339 |
+CREATE INDEX idx_smart_readings_changed_params ON smart_readings USING GIN (changed_parameters); |
|
| 340 |
+``` |
|
| 341 |
+ |
|
| 342 |
+## Data Flow |
|
| 343 |
+ |
|
| 344 |
+### Collection Process |
|
| 345 |
+1. **Device Discovery**: Collector scans `/dev/sd*` and `/dev/nvme*` devices |
|
| 346 |
+2. **SMART Reading**: Uses `smartctl` to extract device information and parameters |
|
| 347 |
+3. **HDD Registration**: `get_or_create_hdd()` adds new devices to `hdd_inventory` |
|
| 348 |
+4. **Presence Tracking**: `update_hdd_presence()` records current node location |
|
| 349 |
+5. **Data Storage**: Stores SMART readings with differential optimization |
|
| 350 |
+6. **Change Detection**: Uses checksums to detect parameter changes |
|
| 351 |
+ |
|
| 352 |
+### Mobility Tracking |
|
| 353 |
+1. **HDD Detected**: When HDD is found on a new node |
|
| 354 |
+2. **Historic Records**: Previous presence records marked `is_current = FALSE` |
|
| 355 |
+3. **New Presence**: New record created with `is_current = TRUE` |
|
| 356 |
+4. **Timeline**: Complete history maintained with `data_start`/`data_end` timestamps |
|
| 357 |
+ |
|
| 358 |
+### Query Examples |
|
| 359 |
+ |
|
| 360 |
+#### Find HDD History |
|
| 361 |
+```sql |
|
| 362 |
+SELECT serial_number, node, data_start, data_end, is_current |
|
| 363 |
+FROM hdd_presence |
|
| 364 |
+WHERE serial_number = 'ZW60K01R' |
|
| 365 |
+ORDER BY data_start DESC; |
|
| 366 |
+``` |
|
| 367 |
+ |
|
| 368 |
+#### Current HDD Locations |
|
| 369 |
+```sql |
|
| 370 |
+SELECT h.serial_number, h.model_name, p.node, h.current_device_path |
|
| 371 |
+FROM hdd_inventory h |
|
| 372 |
+JOIN hdd_presence p ON h.serial_number = p.serial_number |
|
| 373 |
+WHERE p.is_current = TRUE; |
|
| 374 |
+``` |
|
| 375 |
+ |
|
| 376 |
+#### SMART Parameter Trends |
|
| 377 |
+```sql |
|
| 378 |
+SELECT timestamp, |
|
| 379 |
+ parameters_json->>'Power_On_Hours' as power_hours, |
|
| 380 |
+ parameters_json->>'Temperature_Celsius' as temp, |
|
| 381 |
+ temperature |
|
| 382 |
+FROM smart_readings_reconstructed |
|
| 383 |
+WHERE serial_number = 'ZW60K01R' |
|
| 384 |
+ORDER BY timestamp DESC |
|
| 385 |
+LIMIT 10; |
|
| 386 |
+``` |
|
| 387 |
+ |
|
| 388 |
+#### Health Summary |
|
| 389 |
+```sql |
|
| 390 |
+SELECT * FROM drive_health_summary |
|
| 391 |
+WHERE current_node_id = 'ebony'; |
|
| 392 |
+``` |
|
| 393 |
+ |
|
| 394 |
+## Troubleshooting |
|
| 395 |
+ |
|
| 396 |
+### Common Issues |
|
| 397 |
+ |
|
| 398 |
+#### 1. Node ID Mismatch |
|
| 399 |
+**Problem**: HDD presence shows wrong node name |
|
| 400 |
+**Cause**: Deploy script used local hostname instead of target node name |
|
| 401 |
+**Solution**: Deploy script now correctly determines target node name from `cluster.json` |
|
| 402 |
+ |
|
| 403 |
+#### 2. Empty hdd_presence Table |
|
| 404 |
+**Problem**: No mobility tracking data |
|
| 405 |
+**Causes**: |
|
| 406 |
+- SMART parameter parsing regex incompatible with new smartctl format |
|
| 407 |
+- Missing database sequence permissions |
|
| 408 |
+- Incomplete smart_readings INSERT statements |
|
| 409 |
+ |
|
| 410 |
+**Solutions**: |
|
| 411 |
+- Updated regex to support both old and new smartctl formats |
|
| 412 |
+- Added sequence permissions: `GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO autosmart;` |
|
| 413 |
+- Fixed INSERT to include all required fields |
|
| 414 |
+ |
|
| 415 |
+#### 3. Differential Storage Issues |
|
| 416 |
+**Problem**: Too much or too little data stored |
|
| 417 |
+**Configuration**: Adjust in `system_config` table: |
|
| 418 |
+```sql |
|
| 419 |
+UPDATE system_config SET value = 'false' WHERE config_key = 'differential_storage_enabled'; |
|
| 420 |
+UPDATE system_config SET value = '12' WHERE config_key = 'forced_storage_interval_hours'; |
|
| 421 |
+``` |
|
| 422 |
+ |
|
| 423 |
+## Permissions |
|
| 424 |
+ |
|
| 425 |
+### Database User Setup |
|
| 426 |
+```sql |
|
| 427 |
+-- Create autosmart user |
|
| 428 |
+CREATE USER autosmart WITH PASSWORD 'autoSMART2025!'; |
|
| 429 |
+ |
|
| 430 |
+-- Grant permissions |
|
| 431 |
+GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO autosmart; |
|
| 432 |
+GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO autosmart; |
|
| 433 |
+GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO autosmart; |
|
| 434 |
+``` |
|
| 435 |
+ |
|
| 436 |
+### Sequence Permissions |
|
| 437 |
+```sql |
|
| 438 |
+-- Required for INSERT operations with SERIAL columns |
|
| 439 |
+GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO autosmart; |
|
| 440 |
+``` |
|
| 441 |
+ |
|
| 442 |
+## Maintenance |
|
| 443 |
+ |
|
| 444 |
+### Regular Tasks |
|
| 445 |
+1. **Monitor disk usage**: SMART readings table grows over time |
|
| 446 |
+2. **Archive old data**: Consider archiving readings older than 1 year |
|
| 447 |
+3. **Index maintenance**: REINDEX periodically for performance |
|
| 448 |
+4. **Backup**: Regular PostgreSQL backups recommended |
|
| 449 |
+ |
|
| 450 |
+### Performance Monitoring |
|
| 451 |
+```sql |
|
| 452 |
+-- Table sizes |
|
| 453 |
+SELECT schemaname, tablename, |
|
| 454 |
+ pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size |
|
| 455 |
+FROM pg_tables |
|
| 456 |
+WHERE schemaname = 'public' |
|
| 457 |
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC; |
|
| 458 |
+ |
|
| 459 |
+-- Recent activity |
|
| 460 |
+SELECT COUNT(*) as readings_today |
|
| 461 |
+FROM smart_readings |
|
| 462 |
+WHERE timestamp > CURRENT_DATE; |
|
| 463 |
+ |
|
| 464 |
+SELECT COUNT(*) as active_drives |
|
| 465 |
+FROM hdd_inventory |
|
| 466 |
+WHERE status = 'active'; |
|
| 467 |
+``` |
|
@@ -0,0 +1,991 @@ |
||
| 1 |
+# autoSMART Development Guide |
|
| 2 |
+ |
|
| 3 |
+## 📚 Developer Documentation Index |
|
| 4 |
+ |
|
| 5 |
+This document serves as the complete guide for developers working on autoSMART. It includes development environment setup, architecture documentation, testing procedures, and developer-specific changelog. |
|
| 6 |
+ |
|
| 7 |
+### Quick Navigation |
|
| 8 |
+- [Codebase Structure](#codebase-structure) |
|
| 9 |
+- [Development Environment Setup](#development-environment-setup) |
|
| 10 |
+- [Architecture Overview](#architecture-overview) |
|
| 11 |
+- [Database Development](#database-development) |
|
| 12 |
+- [Module Development](#module-development) |
|
| 13 |
+- [Testing Strategies](#testing-strategies) |
|
| 14 |
+- [Deployment Procedures](#deployment-procedures) |
|
| 15 |
+- [Developer Changelog](#developer-changelog) |
|
| 16 |
+- [Technical Reference](#technical-reference) |
|
| 17 |
+ |
|
| 18 |
+## 📁 Codebase Structure |
|
| 19 |
+ |
|
| 20 |
+autoSMART follows a modular architecture with clear separation of concerns. Below is the complete directory structure and file descriptions: |
|
| 21 |
+ |
|
| 22 |
+### Project Root |
|
| 23 |
+``` |
|
| 24 |
+autoSMART/ |
|
| 25 |
+├── README.md # Symlink to docs/README.md (end-user documentation) |
|
| 26 |
+├── .deployignore # Files excluded from production deployment |
|
| 27 |
+├── config/ # Configuration files and templates |
|
| 28 |
+├── docs/ # Documentation (mixed deployment) |
|
| 29 |
+├── lib/ # Perl modules and core libraries |
|
| 30 |
+├── scripts/ # Executable scripts and utilities |
|
| 31 |
+└── sql/ # Database schema and SQL files |
|
| 32 |
+``` |
|
| 33 |
+ |
|
| 34 |
+### 📁 `/config/` - Configuration Management |
|
| 35 |
+Configuration files are organized by scope and environment: |
|
| 36 |
+ |
|
| 37 |
+``` |
|
| 38 |
+config/ |
|
| 39 |
+├── cluster.conf # Cluster-wide settings (shared across nodes) |
|
| 40 |
+├── cluster-ebony.conf # Node-specific configuration for ebony |
|
| 41 |
+├── database.conf # PostgreSQL connection settings |
|
| 42 |
+├── openai.conf # OpenAI API configuration and prompts |
|
| 43 |
+├── smart.conf # SMART parameter thresholds and monitoring rules |
|
| 44 |
+├── default # Default/template configuration |
|
| 45 |
+└── debug-ebony.sh # Development debugging script for ebony node |
|
| 46 |
+``` |
|
| 47 |
+ |
|
| 48 |
+#### Configuration File Details |
|
| 49 |
+- **`cluster.conf`** (88 lines): |
|
| 50 |
+ - Cluster topology and node definitions |
|
| 51 |
+ - Node hostnames, IP addresses, and roles |
|
| 52 |
+ - Shared monitoring parameters across cluster |
|
| 53 |
+ - Global system settings and defaults |
|
| 54 |
+ - Inter-node communication configuration |
|
| 55 |
+ |
|
| 56 |
+- **`database.conf`** (30 lines): |
|
| 57 |
+ - PostgreSQL connection parameters (host, port, database, credentials) |
|
| 58 |
+ - Connection pooling settings and timeouts |
|
| 59 |
+ - Database-specific optimizations and tuning parameters |
|
| 60 |
+ - SSL configuration and security settings |
|
| 61 |
+ |
|
| 62 |
+- **`openai.conf`** (50 lines): |
|
| 63 |
+ - OpenAI API key and model configuration |
|
| 64 |
+ - Prompt templates for failure prediction analysis |
|
| 65 |
+ - Response parsing rules and confidence thresholds |
|
| 66 |
+ - Rate limiting and cost management settings |
|
| 67 |
+ - Fallback configurations for API failures |
|
| 68 |
+ |
|
| 69 |
+- **`smart.conf`** (57 lines): |
|
| 70 |
+ - SMART parameter monitoring thresholds for different drive types |
|
| 71 |
+ - Critical parameter definitions and escalation rules |
|
| 72 |
+ - Alert generation rules and notification preferences |
|
| 73 |
+ - Parameter collection intervals and scheduling |
|
| 74 |
+ - Drive type specific monitoring configurations |
|
| 75 |
+ |
|
| 76 |
+- **`default`** (107 lines): |
|
| 77 |
+ - Default/template configuration for new node deployments |
|
| 78 |
+ - Standard parameter values and system defaults |
|
| 79 |
+ - Configuration validation rules and constraints |
|
| 80 |
+ - Example configurations with detailed comments |
|
| 81 |
+ |
|
| 82 |
+- **`cluster-ebony.conf`** (13 lines): |
|
| 83 |
+ - Node-specific configuration overrides for ebony node |
|
| 84 |
+ - Local network settings and hardware-specific parameters |
|
| 85 |
+ - Custom thresholds for specific hardware configurations |
|
| 86 |
+ |
|
| 87 |
+- **`debug-ebony.sh`** (29 lines): |
|
| 88 |
+ - Development debugging utilities for ebony node |
|
| 89 |
+ - Test data generation and validation scripts |
|
| 90 |
+ - Development environment setup and configuration |
|
| 91 |
+ - Debugging tools and diagnostic utilities |
|
| 92 |
+ |
|
| 93 |
+### 📁 `/lib/` - Core Perl Modules |
|
| 94 |
+Core business logic implemented as reusable Perl modules: |
|
| 95 |
+ |
|
| 96 |
+``` |
|
| 97 |
+lib/ |
|
| 98 |
+├── SmartCollector.pm # SMART data collection and hardware tracking |
|
| 99 |
+└── PredictionEngine.pm # AI-powered failure prediction engine |
|
| 100 |
+``` |
|
| 101 |
+ |
|
| 102 |
+#### Module Architecture |
|
| 103 |
+- **`SmartCollector.pm`** (802 lines): |
|
| 104 |
+ - **Hardware Identification**: Device detection using serial numbers and model names |
|
| 105 |
+ - **SMART Data Collection**: Integration with smartmontools for comprehensive parameter collection |
|
| 106 |
+ - **Migration Detection**: Algorithms to detect when drives move between nodes or device paths |
|
| 107 |
+ - **Differential Storage**: Intelligent storage system that only saves changed parameters |
|
| 108 |
+ - **Database Layer**: PostgreSQL integration with connection pooling and error handling |
|
| 109 |
+ - **Storage Efficiency**: Real-time monitoring of storage optimization effectiveness |
|
| 110 |
+ - **Configuration Management**: Dynamic configuration loading and validation |
|
| 111 |
+ - **Error Handling**: Comprehensive error handling with detailed logging |
|
| 112 |
+ |
|
| 113 |
+- **`PredictionEngine.pm`** (607 lines): |
|
| 114 |
+ - **OpenAI Integration**: Direct API communication with GPT models |
|
| 115 |
+ - **Prompt Engineering**: Sophisticated prompt templates for failure prediction |
|
| 116 |
+ - **Response Processing**: Parsing and validation of AI-generated predictions |
|
| 117 |
+ - **Confidence Scoring**: Statistical analysis of prediction reliability |
|
| 118 |
+ - **Timeline Estimation**: Failure time prediction with confidence intervals |
|
| 119 |
+ - **Cost Optimization**: API usage optimization and request batching |
|
| 120 |
+ - **Error Recovery**: Robust error handling for API failures and rate limits |
|
| 121 |
+ |
|
| 122 |
+### 📁 `/scripts/` - Executable Components |
|
| 123 |
+Production scripts and development utilities: |
|
| 124 |
+ |
|
| 125 |
+``` |
|
| 126 |
+scripts/ |
|
| 127 |
+├── autosmart-collector.pl # Main data collection daemon |
|
| 128 |
+├── autosmart-predictor.pl # AI prediction processing |
|
| 129 |
+├── autosmart-report.pl # Report generation engine |
|
| 130 |
+├── autosmart-migration-report.pl # Hardware migration analysis |
|
| 131 |
+├── smart-collector-daemon.pl # Background collection service |
|
| 132 |
+├── deploy.sh # Unified deployment script |
|
| 133 |
+├── deploy-production.sh # Production cluster deployment |
|
| 134 |
+├── install.sh # Symlink to deploy.sh for compatibility |
|
| 135 |
+├── uninstall.sh # Complete system removal |
|
| 136 |
+├── monitor-cluster.sh # Cluster health monitoring |
|
| 137 |
+├── test-smart-collection.pl # SMART collection testing |
|
| 138 |
+├── test-differential-storage.pl # Storage optimization testing |
|
| 139 |
+├── test-db-connection.pl # Database connectivity testing |
|
| 140 |
+└── simple-smart-test.pl # Basic SMART functionality test |
|
| 141 |
+``` |
|
| 142 |
+ |
|
| 143 |
+#### Script Categories |
|
| 144 |
+ |
|
| 145 |
+##### Production Scripts |
|
| 146 |
+- **`autosmart-collector.pl`** (348 lines): |
|
| 147 |
+ - Main collection daemon that runs on each node |
|
| 148 |
+ - Scheduled SMART data collection and processing |
|
| 149 |
+ - Hardware detection and migration tracking |
|
| 150 |
+ - Integration with SmartCollector.pm module |
|
| 151 |
+ - Command-line options for daemon mode, single-run, and debugging |
|
| 152 |
+ |
|
| 153 |
+- **`autosmart-predictor.pl`** (483 lines): |
|
| 154 |
+ - Processes collected data for AI predictions |
|
| 155 |
+ - Batch processing of pending SMART readings |
|
| 156 |
+ - Integration with PredictionEngine.pm for OpenAI communication |
|
| 157 |
+ - Prediction result storage and confidence tracking |
|
| 158 |
+ |
|
| 159 |
+- **`autosmart-report.pl`** (662 lines): |
|
| 160 |
+ - Generates comprehensive health reports and alerts |
|
| 161 |
+ - Configurable report formats (summary, detailed, trend analysis) |
|
| 162 |
+ - Email notification system for critical alerts |
|
| 163 |
+ - Historical data analysis and trend detection |
|
| 164 |
+ |
|
| 165 |
+- **`smart-collector-daemon.pl`** (252 lines): |
|
| 166 |
+ - Background service wrapper for collector |
|
| 167 |
+ - Process management and restart capabilities |
|
| 168 |
+ - Log rotation and system integration |
|
| 169 |
+ - Service status monitoring and health checks |
|
| 170 |
+ |
|
| 171 |
+##### Deployment Scripts |
|
| 172 |
+- **`deploy.sh`** (697 lines): |
|
| 173 |
+ - Unified deployment for single node or cluster |
|
| 174 |
+ - Supports install, uninstall, and cluster deployment modes |
|
| 175 |
+ - Automatic dependency checking and installation |
|
| 176 |
+ - Configuration template deployment and customization |
|
| 177 |
+ - System service registration and startup |
|
| 178 |
+ |
|
| 179 |
+- **`deploy-production.sh`** (116 lines): |
|
| 180 |
+ - Production-specific deployment procedures |
|
| 181 |
+ - Multi-node cluster deployment automation |
|
| 182 |
+ - Production safety checks and validation |
|
| 183 |
+ - Rollback capabilities for failed deployments |
|
| 184 |
+ |
|
| 185 |
+- **`uninstall.sh`** (187 lines): |
|
| 186 |
+ - Complete system cleanup and removal |
|
| 187 |
+ - Service stopping and deregistration |
|
| 188 |
+ - File and directory cleanup |
|
| 189 |
+ - Database cleanup options (configurable) |
|
| 190 |
+ |
|
| 191 |
+- **`monitor-cluster.sh`** (515 lines): |
|
| 192 |
+ - Ongoing cluster health monitoring |
|
| 193 |
+ - Node status verification and reporting |
|
| 194 |
+ - Service health checks across all cluster nodes |
|
| 195 |
+ - Automated restart capabilities for failed services |
|
| 196 |
+ |
|
| 197 |
+##### Development & Testing Scripts |
|
| 198 |
+- **`test-smart-collection.pl`** (132 lines): |
|
| 199 |
+ - Validates SMART data collection functionality |
|
| 200 |
+ - Tests hardware detection and identification |
|
| 201 |
+ - Verifies database connectivity and data storage |
|
| 202 |
+ - Performance benchmarking for collection operations |
|
| 203 |
+ |
|
| 204 |
+- **`test-differential-storage.pl`** (270 lines): |
|
| 205 |
+ - Comprehensive testing of storage optimization |
|
| 206 |
+ - Validates differential storage algorithms |
|
| 207 |
+ - Tests change detection and storage efficiency |
|
| 208 |
+ - Performance analysis and optimization verification |
|
| 209 |
+ |
|
| 210 |
+- **`test-db-connection.pl`** (55 lines): |
|
| 211 |
+ - Database connectivity verification |
|
| 212 |
+ - Connection pooling and timeout testing |
|
| 213 |
+ - SQL execution validation |
|
| 214 |
+ - Database performance testing |
|
| 215 |
+ |
|
| 216 |
+- **`simple-smart-test.pl`** (144 lines): |
|
| 217 |
+ - Basic functionality testing |
|
| 218 |
+ - Quick validation of core components |
|
| 219 |
+ - Integration testing for development |
|
| 220 |
+ - Smoke testing for deployment validation |
|
| 221 |
+ |
|
| 222 |
+##### Analysis Scripts |
|
| 223 |
+- **`autosmart-migration-report.pl`** (615 lines): |
|
| 224 |
+ - Hardware migration tracking and analysis |
|
| 225 |
+ - Migration pattern detection and reporting |
|
| 226 |
+ - Historical migration data analysis |
|
| 227 |
+ - Migration-related issue identification and troubleshooting |
|
| 228 |
+ |
|
| 229 |
+### 📁 `/sql/` - Database Schema |
|
| 230 |
+PostgreSQL database definitions and utilities: |
|
| 231 |
+ |
|
| 232 |
+``` |
|
| 233 |
+sql/ |
|
| 234 |
+├── schema.sql # Complete production database schema |
|
| 235 |
+└── schema-fixed.sql # Schema with specific fixes/patches |
|
| 236 |
+``` |
|
| 237 |
+ |
|
| 238 |
+#### Database Schema Components |
|
| 239 |
+- **Core Tables**: |
|
| 240 |
+ - `hdd_inventory`: Hardware identification and location tracking |
|
| 241 |
+ - `smart_readings`: SMART parameter data with differential storage |
|
| 242 |
+ - `hdd_migrations`: Drive movement logging between nodes/paths |
|
| 243 |
+- **AI Integration**: |
|
| 244 |
+ - `predictions`: AI-generated failure predictions with confidence scores |
|
| 245 |
+ - `alert_history`: Alert notification tracking and escalation |
|
| 246 |
+- **Configuration**: |
|
| 247 |
+ - `smart_thresholds`: Configurable parameter thresholds and alert rules |
|
| 248 |
+ - `system_config`: System-wide configuration parameters |
|
| 249 |
+- **Optimization**: |
|
| 250 |
+ - Differential storage functions (`should_store_smart_reading()`) |
|
| 251 |
+ - Reconstructed views (`smart_readings_reconstructed`) |
|
| 252 |
+ - Change detection algorithms with SHA256 checksums |
|
| 253 |
+- **Indexing**: |
|
| 254 |
+ - Performance-optimized indexes for temporal queries |
|
| 255 |
+ - Hardware identification indexes for fast lookups |
|
| 256 |
+ - Composite indexes for complex query patterns |
|
| 257 |
+ |
|
| 258 |
+##### Schema Files Details |
|
| 259 |
+- **`schema.sql`** (726 lines): |
|
| 260 |
+ - Complete production database schema |
|
| 261 |
+ - Full table definitions with constraints and indexes |
|
| 262 |
+ - PostgreSQL functions for differential storage |
|
| 263 |
+ - Views for data reconstruction and reporting |
|
| 264 |
+ - Trigger definitions for automated processes |
|
| 265 |
+ |
|
| 266 |
+- **`schema-fixed.sql`** (423 lines): |
|
| 267 |
+ - Schema patches and specific fixes |
|
| 268 |
+ - Migration scripts for schema updates |
|
| 269 |
+ - Performance optimization adjustments |
|
| 270 |
+ - Compatibility fixes for different PostgreSQL versions |
|
| 271 |
+ |
|
| 272 |
+### 📁 `/docs/` - Documentation |
|
| 273 |
+Documentation organized by audience and deployment status: |
|
| 274 |
+ |
|
| 275 |
+``` |
|
| 276 |
+docs/ |
|
| 277 |
+├── README.md # End-user guide (DEPLOYED) |
|
| 278 |
+├── INSTALLATION.md # Setup and configuration (DEPLOYED) |
|
| 279 |
+├── CHANGELOG.md # Release notes for end-users (DEPLOYED) |
|
| 280 |
+├── API.md # OpenAI API configuration (DEPLOYED) |
|
| 281 |
+├── DEVELOPMENT.md # Developer guide (NOT DEPLOYED) |
|
| 282 |
+└── DIFFERENTIAL_STORAGE.md # Technical storage details (NOT DEPLOYED) |
|
| 283 |
+``` |
|
| 284 |
+ |
|
| 285 |
+#### Documentation Deployment Strategy |
|
| 286 |
+- **Deployed docs**: End-user facing documentation |
|
| 287 |
+- **Non-deployed docs**: Developer and technical implementation details |
|
| 288 |
+ |
|
| 289 |
+### 🔧 Key File Relationships |
|
| 290 |
+ |
|
| 291 |
+#### Data Flow Architecture |
|
| 292 |
+``` |
|
| 293 |
+smartmontools → SmartCollector.pm → PostgreSQL → PredictionEngine.pm → OpenAI API |
|
| 294 |
+ ↓ ↓ ↓ ↓ |
|
| 295 |
+autosmart-collector.pl → Database → autosmart-predictor.pl → Reports |
|
| 296 |
+``` |
|
| 297 |
+ |
|
| 298 |
+#### Configuration Hierarchy |
|
| 299 |
+``` |
|
| 300 |
+cluster.conf (global) → node-specific.conf → smart.conf → openai.conf |
|
| 301 |
+ ↓ |
|
| 302 |
+ Individual script configurations |
|
| 303 |
+``` |
|
| 304 |
+ |
|
| 305 |
+#### Module Dependencies |
|
| 306 |
+``` |
|
| 307 |
+autosmart-collector.pl |
|
| 308 |
+├── SmartCollector.pm |
|
| 309 |
+├── database.conf |
|
| 310 |
+├── smart.conf |
|
| 311 |
+└── cluster.conf |
|
| 312 |
+ |
|
| 313 |
+autosmart-predictor.pl |
|
| 314 |
+├── PredictionEngine.pm |
|
| 315 |
+├── SmartCollector.pm (for data access) |
|
| 316 |
+├── openai.conf |
|
| 317 |
+└── database.conf |
|
| 318 |
+``` |
|
| 319 |
+ |
|
| 320 |
+### 📊 Codebase Metrics |
|
| 321 |
+ |
|
| 322 |
+#### File Type Distribution |
|
| 323 |
+- **Perl Scripts**: 8 production scripts + 4 test scripts (12 total) |
|
| 324 |
+- **Perl Modules**: 2 core modules (1,409 total lines) |
|
| 325 |
+- **Shell Scripts**: 5 deployment/management scripts (1,645 total lines) |
|
| 326 |
+- **SQL Files**: 2 schema files (1,149 total lines) |
|
| 327 |
+- **Configuration**: 7 configuration files (374 total lines) |
|
| 328 |
+- **Documentation**: 5 documentation files |
|
| 329 |
+ |
|
| 330 |
+#### Code Complexity by Lines of Code |
|
| 331 |
+- **SmartCollector.pm**: 802 lines (High complexity - hardware integration, differential storage) |
|
| 332 |
+- **PredictionEngine.pm**: 607 lines (Medium complexity - API integration, data processing) |
|
| 333 |
+- **Database Schema**: 726 lines (High complexity - advanced PostgreSQL features) |
|
| 334 |
+- **Deploy Scripts**: 697 lines each (Medium complexity - system integration) |
|
| 335 |
+- **Report Generation**: 662 lines (Medium complexity - data analysis and formatting) |
|
| 336 |
+- **Migration Analysis**: 615 lines (Medium complexity - pattern detection) |
|
| 337 |
+- **Cluster Monitoring**: 515 lines (Medium complexity - distributed system monitoring) |
|
| 338 |
+ |
|
| 339 |
+#### Total Codebase Size |
|
| 340 |
+- **Production Code**: ~4,500 lines (Perl modules + production scripts) |
|
| 341 |
+- **Deployment & Management**: ~1,800 lines (deployment and monitoring scripts) |
|
| 342 |
+- **Testing Code**: ~600 lines (test scripts and utilities) |
|
| 343 |
+- **Database Schema**: ~1,150 lines (PostgreSQL schema and functions) |
|
| 344 |
+- **Configuration**: ~375 lines (configuration templates and examples) |
|
| 345 |
+- **Total**: ~8,400+ lines of code |
|
| 346 |
+ |
|
| 347 |
+#### Testing Coverage Areas |
|
| 348 |
+- **Unit Tests**: Module-specific functionality testing |
|
| 349 |
+- **Integration Tests**: End-to-end data flow validation |
|
| 350 |
+- **Performance Tests**: Storage efficiency and query optimization benchmarks |
|
| 351 |
+- **Deployment Tests**: Installation and configuration validation across environments |
|
| 352 |
+- **Regression Tests**: Automated testing for core functionality preservation |
|
| 353 |
+ |
|
| 354 |
+### 🏗️ Development Workflow |
|
| 355 |
+ |
|
| 356 |
+#### Getting Started with Development |
|
| 357 |
+1. **Clone Repository**: Set up local development environment |
|
| 358 |
+2. **Database Setup**: Configure PostgreSQL connection to development database |
|
| 359 |
+3. **Perl Dependencies**: Install required CPAN modules |
|
| 360 |
+4. **Configuration**: Copy and customize configuration templates |
|
| 361 |
+5. **Testing**: Run test suite to verify setup |
|
| 362 |
+ |
|
| 363 |
+#### Adding New Features |
|
| 364 |
+1. **Module Development**: Extend existing Perl modules or create new ones |
|
| 365 |
+2. **Script Integration**: Create or modify scripts to use new functionality |
|
| 366 |
+3. **Database Changes**: Update schema if new data structures are needed |
|
| 367 |
+4. **Testing**: Add comprehensive tests for new functionality |
|
| 368 |
+5. **Documentation**: Update both end-user and developer documentation |
|
| 369 |
+ |
|
| 370 |
+#### Code Organization Principles |
|
| 371 |
+- **Separation of Concerns**: Each module and script has a specific, well-defined responsibility |
|
| 372 |
+- **Configuration-Driven**: System behavior is controlled through configuration files rather than hard-coded values |
|
| 373 |
+- **Database-Centric**: PostgreSQL serves as the central data store with business logic in database functions |
|
| 374 |
+- **Modular Design**: Components can be developed, tested, and deployed independently |
|
| 375 |
+- **Error Handling**: Comprehensive error handling and logging throughout all components |
|
| 376 |
+- **Performance-First**: Optimized for high-volume data collection and processing |
|
| 377 |
+- **Scalability**: Designed to scale across multiple nodes in a cluster environment |
|
| 378 |
+ |
|
| 379 |
+#### Development Patterns Used |
|
| 380 |
+- **Factory Pattern**: Configuration-based object creation in Perl modules |
|
| 381 |
+- **Observer Pattern**: Event-driven processing for hardware changes and alerts |
|
| 382 |
+- **Strategy Pattern**: Configurable algorithms for different drive types and thresholds |
|
| 383 |
+- **Template Method**: Standardized data processing pipelines with customizable steps |
|
| 384 |
+- **Singleton Pattern**: Database connection management and configuration loading |
|
| 385 |
+- **Command Pattern**: Script-based operations with standardized interfaces |
|
| 386 |
+ |
|
| 387 |
+#### Code Quality Standards |
|
| 388 |
+- **Perl Best Practices**: Strict warnings, proper scoping, and defensive programming |
|
| 389 |
+- **Database Normalization**: Proper relational design with referential integrity |
|
| 390 |
+- **Configuration Validation**: Input validation and sanitization throughout |
|
| 391 |
+- **Error Recovery**: Graceful degradation and automatic recovery mechanisms |
|
| 392 |
+- **Performance Monitoring**: Built-in performance metrics and optimization tracking |
|
| 393 |
+- **Security Practices**: SQL injection prevention, input validation, and secure configuration management |
|
| 394 |
+ |
|
| 395 |
+## 🏗️ Development Environment Setup |
|
| 396 |
+ |
|
| 397 |
+### Prerequisites |
|
| 398 |
+ |
|
| 399 |
+#### System Requirements |
|
| 400 |
+- **Operating System**: Linux/macOS (tested on macOS, deployed on Proxmox VE) |
|
| 401 |
+- **Perl**: Version 5.20+ with CPAN access |
|
| 402 |
+- **PostgreSQL**: Version 13+ with JSONB and extension support |
|
| 403 |
+- **Git**: For version control and collaboration |
|
| 404 |
+ |
|
| 405 |
+#### Development Database |
|
| 406 |
+```bash |
|
| 407 |
+# Current test database configuration |
|
| 408 |
+Host: 192.168.2.102 |
|
| 409 |
+Database: autosmart |
|
| 410 |
+User: postgres |
|
| 411 |
+Password: (no password) |
|
| 412 |
+Port: 5432 |
|
| 413 |
+``` |
|
| 414 |
+ |
|
| 415 |
+#### Required Perl Modules |
|
| 416 |
+```bash |
|
| 417 |
+# Core database modules |
|
| 418 |
+cpan install DBI DBD::Pg |
|
| 419 |
+ |
|
| 420 |
+# JSON processing |
|
| 421 |
+cpan install JSON::XS |
|
| 422 |
+ |
|
| 423 |
+# System utilities |
|
| 424 |
+cpan install Config::Simple File::Slurp Time::HiRes |
|
| 425 |
+ |
|
| 426 |
+# Security and hashing |
|
| 427 |
+cpan install Digest::SHA |
|
| 428 |
+ |
|
| 429 |
+# HTTP/API clients (for OpenAI integration) |
|
| 430 |
+cpan install LWP::UserAgent HTTP::Request::Common |
|
| 431 |
+ |
|
| 432 |
+# Optional: Development and testing |
|
| 433 |
+cpan install Data::Dumper Test::More Test::Exception |
|
| 434 |
+``` |
|
| 435 |
+ |
|
| 436 |
+### Development Workflow |
|
| 437 |
+ |
|
| 438 |
+#### 1. Environment Setup |
|
| 439 |
+```bash |
|
| 440 |
+# Clone the project |
|
| 441 |
+cd /Users/bogdan/Documents/workspace/ |
|
| 442 |
+git clone <autoSMART-repo> |
|
| 443 |
+cd autoSMART |
|
| 444 |
+ |
|
| 445 |
+# Set environment variables |
|
| 446 |
+export AUTOSMART_DB_HOST=192.168.2.102 |
|
| 447 |
+export AUTOSMART_DB_NAME=autosmart |
|
| 448 |
+export AUTOSMART_DB_USER=postgres |
|
| 449 |
+export AUTOSMART_DB_PASS= |
|
| 450 |
+export AUTOSMART_DB_PORT=5432 |
|
| 451 |
+ |
|
| 452 |
+# Optional: OpenAI API key for AI features |
|
| 453 |
+export OPENAI_API_KEY=your-api-key-here |
|
| 454 |
+``` |
|
| 455 |
+ |
|
| 456 |
+#### 2. Database Setup |
|
| 457 |
+```bash |
|
| 458 |
+# Initialize the database schema |
|
| 459 |
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql |
|
| 460 |
+ |
|
| 461 |
+# Verify installation |
|
| 462 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c "\\dt" |
|
| 463 |
+``` |
|
| 464 |
+ |
|
| 465 |
+#### 3. Testing Environment |
|
| 466 |
+```bash |
|
| 467 |
+# Run the differential storage test suite |
|
| 468 |
+cd scripts/ |
|
| 469 |
+perl test-differential-storage.pl |
|
| 470 |
+ |
|
| 471 |
+# Test database connectivity |
|
| 472 |
+perl -e " |
|
| 473 |
+use DBI; |
|
| 474 |
+my \$dsn = 'DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432'; |
|
| 475 |
+my \$dbh = DBI->connect(\$dsn, 'postgres', '', {RaiseError => 1});
|
|
| 476 |
+print \"Database connection successful!\\n\"; |
|
| 477 |
+\$dbh->disconnect(); |
|
| 478 |
+" |
|
| 479 |
+``` |
|
| 480 |
+ |
|
| 481 |
+## 🧩 Architecture Overview |
|
| 482 |
+ |
|
| 483 |
+### System Components |
|
| 484 |
+ |
|
| 485 |
+``` |
|
| 486 |
+autoSMART Architecture |
|
| 487 |
+┌─────────────────────────────────────────────────────────────┐ |
|
| 488 |
+│ Proxmox Cluster │ |
|
| 489 |
+├─────────────────────┬─────────────────────┬─────────────────┤ |
|
| 490 |
+│ Node 1 │ Node 2 │ Node 3 │ |
|
| 491 |
+│ │ │ │ |
|
| 492 |
+│ ┌─── SmartCollector ┤ ┌─── SmartCollector ┤ ┌─── SmartCollector |
|
| 493 |
+│ │ - HDD Scanning │ │ - HDD Scanning │ │ - HDD Scanning |
|
| 494 |
+│ │ - SMART Reading │ │ - SMART Reading │ │ - SMART Reading |
|
| 495 |
+│ │ - Migration Det │ │ - Migration Det │ │ - Migration Det |
|
| 496 |
+│ └─── Data Storage │ └─── Data Storage │ └─── Data Storage |
|
| 497 |
+└─────────────────────┴─────────────────────┴─────────────────┘ |
|
| 498 |
+ │ |
|
| 499 |
+ ┌────────▼─────────┐ |
|
| 500 |
+ │ PostgreSQL DB │ |
|
| 501 |
+ │ │ |
|
| 502 |
+ │ • HDD Inventory │ |
|
| 503 |
+ │ • SMART Readings │ |
|
| 504 |
+ │ • Migrations │ |
|
| 505 |
+ │ • AI Predictions │ |
|
| 506 |
+ └────────┬─────────┘ |
|
| 507 |
+ │ |
|
| 508 |
+ ┌──────────▼───────────┐ |
|
| 509 |
+ │ SmartAnalyzer │ |
|
| 510 |
+ │ │ |
|
| 511 |
+ │ • OpenAI API │ |
|
| 512 |
+ │ • Failure Prediction │ |
|
| 513 |
+ │ • Pattern Analysis │ |
|
| 514 |
+ └──────────┬───────────┘ |
|
| 515 |
+ │ |
|
| 516 |
+ ┌──────────▼───────────┐ |
|
| 517 |
+ │ SmartReporter │ |
|
| 518 |
+ │ │ |
|
| 519 |
+ │ • Alert Generation │ |
|
| 520 |
+ │ • Report Creation │ |
|
| 521 |
+ │ • Dashboard Data │ |
|
| 522 |
+ └──────────────────────┘ |
|
| 523 |
+``` |
|
| 524 |
+ |
|
| 525 |
+### Data Flow |
|
| 526 |
+ |
|
| 527 |
+1. **Collection Phase**: |
|
| 528 |
+ - SmartCollector scans HDDs on each node |
|
| 529 |
+ - Hardware identification (serial + model) |
|
| 530 |
+ - Migration detection if HDD moved |
|
| 531 |
+ - Differential storage decision |
|
| 532 |
+ - Store only changed/critical data |
|
| 533 |
+ |
|
| 534 |
+2. **Analysis Phase**: |
|
| 535 |
+ - SmartAnalyzer processes stored data |
|
| 536 |
+ - Historical pattern analysis |
|
| 537 |
+ - OpenAI API calls for predictions |
|
| 538 |
+ - Risk assessment and trending |
|
| 539 |
+ |
|
| 540 |
+3. **Reporting Phase**: |
|
| 541 |
+ - SmartReporter generates alerts |
|
| 542 |
+ - Dashboard data preparation |
|
| 543 |
+ - Health reports creation |
|
| 544 |
+ - Maintenance recommendations |
|
| 545 |
+ |
|
| 546 |
+## 🔧 Module Development |
|
| 547 |
+ |
|
| 548 |
+### SmartCollector.pm Development |
|
| 549 |
+ |
|
| 550 |
+#### Key Methods to Understand |
|
| 551 |
+```perl |
|
| 552 |
+# Hardware identification and migration detection |
|
| 553 |
+sub _detect_or_create_hdd($drive_info, $smart_data) |
|
| 554 |
+ |
|
| 555 |
+# Differential storage decision making |
|
| 556 |
+sub _should_store_reading($hdd_id, $smart_data) |
|
| 557 |
+ |
|
| 558 |
+# Optimized data storage |
|
| 559 |
+sub _insert_smart_reading_differential($hdd_id, $drive_info, $smart_data, $storage_info) |
|
| 560 |
+``` |
|
| 561 |
+ |
|
| 562 |
+#### Adding New Features |
|
| 563 |
+1. **New SMART Parameters**: |
|
| 564 |
+ ```perl |
|
| 565 |
+ # Add parameter processing in collect_smart_data() |
|
| 566 |
+ if ($line =~ /New_Parameter.*\s+(\d+)/) {
|
|
| 567 |
+ $smart_data->{parameters}{'New_Parameter'} = $1;
|
|
| 568 |
+ } |
|
| 569 |
+ ``` |
|
| 570 |
+ |
|
| 571 |
+2. **Custom Manufacturer Detection**: |
|
| 572 |
+ ```perl |
|
| 573 |
+ # Extend _detect_manufacturer() method |
|
| 574 |
+ sub _detect_manufacturer {
|
|
| 575 |
+ my ($self, $model) = @_; |
|
| 576 |
+ return 'Custom_Manufacturer' if $model =~ /CUSTOM_PATTERN/; |
|
| 577 |
+ # ... existing logic |
|
| 578 |
+ } |
|
| 579 |
+ ``` |
|
| 580 |
+ |
|
| 581 |
+### SmartAnalyzer.pm Development |
|
| 582 |
+ |
|
| 583 |
+#### AI Integration Patterns |
|
| 584 |
+```perl |
|
| 585 |
+# OpenAI API call structure |
|
| 586 |
+sub _call_openai_api {
|
|
| 587 |
+ my ($self, $prompt, $smart_data) = @_; |
|
| 588 |
+ |
|
| 589 |
+ my $request = HTTP::Request->new(POST => 'https://api.openai.com/v1/chat/completions'); |
|
| 590 |
+ $request->header('Authorization' => "Bearer $self->{openai_api_key}");
|
|
| 591 |
+ $request->header('Content-Type' => 'application/json');
|
|
| 592 |
+ |
|
| 593 |
+ my $payload = {
|
|
| 594 |
+ model => "gpt-4", |
|
| 595 |
+ messages => [ |
|
| 596 |
+ {
|
|
| 597 |
+ role => "system", |
|
| 598 |
+ content => "You are an expert in HDD failure prediction..." |
|
| 599 |
+ }, |
|
| 600 |
+ {
|
|
| 601 |
+ role => "user", |
|
| 602 |
+ content => $prompt |
|
| 603 |
+ } |
|
| 604 |
+ ] |
|
| 605 |
+ }; |
|
| 606 |
+ |
|
| 607 |
+ # ... handle response |
|
| 608 |
+} |
|
| 609 |
+``` |
|
| 610 |
+ |
|
| 611 |
+## 🗃️ Database Development |
|
| 612 |
+ |
|
| 613 |
+### Schema Evolution |
|
| 614 |
+ |
|
| 615 |
+#### Adding New Tables |
|
| 616 |
+```sql |
|
| 617 |
+-- Always include migration scripts |
|
| 618 |
+CREATE TABLE new_feature ( |
|
| 619 |
+ id SERIAL PRIMARY KEY, |
|
| 620 |
+ hdd_id INTEGER REFERENCES hdd_inventory(id), |
|
| 621 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 622 |
+); |
|
| 623 |
+ |
|
| 624 |
+-- Add indexes for performance |
|
| 625 |
+CREATE INDEX idx_new_feature_hdd_id ON new_feature(hdd_id); |
|
| 626 |
+``` |
|
| 627 |
+ |
|
| 628 |
+#### Modifying Existing Tables |
|
| 629 |
+```sql |
|
| 630 |
+-- Use ALTER statements for compatibility |
|
| 631 |
+ALTER TABLE smart_readings ADD COLUMN new_field VARCHAR(100); |
|
| 632 |
+CREATE INDEX CONCURRENTLY idx_smart_readings_new_field ON smart_readings(new_field); |
|
| 633 |
+``` |
|
| 634 |
+ |
|
| 635 |
+### Query Optimization |
|
| 636 |
+ |
|
| 637 |
+#### Efficient SMART Data Queries |
|
| 638 |
+```sql |
|
| 639 |
+-- Use the reconstructed view for complete data |
|
| 640 |
+SELECT * FROM smart_readings_reconstructed |
|
| 641 |
+WHERE hdd_id = $1 |
|
| 642 |
+ AND timestamp > NOW() - INTERVAL '30 days' |
|
| 643 |
+ORDER BY timestamp DESC; |
|
| 644 |
+ |
|
| 645 |
+-- Use raw table for storage statistics |
|
| 646 |
+SELECT reading_type, COUNT(*) |
|
| 647 |
+FROM smart_readings |
|
| 648 |
+WHERE timestamp > NOW() - INTERVAL '7 days' |
|
| 649 |
+GROUP BY reading_type; |
|
| 650 |
+``` |
|
| 651 |
+ |
|
| 652 |
+## 🧪 Testing Guidelines |
|
| 653 |
+ |
|
| 654 |
+### Unit Testing |
|
| 655 |
+```perl |
|
| 656 |
+# Example test structure |
|
| 657 |
+use Test::More tests => 5; |
|
| 658 |
+use lib '../lib'; |
|
| 659 |
+use SmartCollector; |
|
| 660 |
+ |
|
| 661 |
+my $collector = SmartCollector->new({
|
|
| 662 |
+ db_host => '192.168.2.102', |
|
| 663 |
+ db_name => 'autosmart_test', |
|
| 664 |
+ # ... test config |
|
| 665 |
+}); |
|
| 666 |
+ |
|
| 667 |
+# Test hardware identification |
|
| 668 |
+my $hdd_id = $collector->_detect_or_create_hdd($drive_info, $smart_data); |
|
| 669 |
+ok($hdd_id > 0, "HDD identification successful"); |
|
| 670 |
+ |
|
| 671 |
+# Test differential storage |
|
| 672 |
+my $storage_decision = $collector->_should_store_reading($hdd_id, $smart_data); |
|
| 673 |
+ok($storage_decision->{store}, "Storage decision made");
|
|
| 674 |
+``` |
|
| 675 |
+ |
|
| 676 |
+### Integration Testing |
|
| 677 |
+```bash |
|
| 678 |
+# Run the comprehensive test suite |
|
| 679 |
+cd scripts/ |
|
| 680 |
+perl test-differential-storage.pl |
|
| 681 |
+ |
|
| 682 |
+# Test with real hardware (if available) |
|
| 683 |
+perl collect-smart-data.pl --test-mode --device /dev/sdb |
|
| 684 |
+``` |
|
| 685 |
+ |
|
| 686 |
+### Performance Testing |
|
| 687 |
+```sql |
|
| 688 |
+-- Test query performance |
|
| 689 |
+EXPLAIN ANALYZE |
|
| 690 |
+SELECT * FROM smart_readings_reconstructed |
|
| 691 |
+WHERE hdd_id IN (1,2,3,4,5) |
|
| 692 |
+ AND timestamp > NOW() - INTERVAL '90 days'; |
|
| 693 |
+ |
|
| 694 |
+-- Monitor storage efficiency |
|
| 695 |
+SELECT |
|
| 696 |
+ reading_type, |
|
| 697 |
+ COUNT(*) as readings, |
|
| 698 |
+ AVG(length(parameters_json::text)) as avg_size_bytes |
|
| 699 |
+FROM smart_readings |
|
| 700 |
+WHERE timestamp > NOW() - INTERVAL '24 hours' |
|
| 701 |
+GROUP BY reading_type; |
|
| 702 |
+``` |
|
| 703 |
+ |
|
| 704 |
+## 🔍 Debugging and Troubleshooting |
|
| 705 |
+ |
|
| 706 |
+### Logging System |
|
| 707 |
+```perl |
|
| 708 |
+# Enable debug logging |
|
| 709 |
+$ENV{AUTOSMART_DEBUG} = 3; # Maximum verbosity
|
|
| 710 |
+ |
|
| 711 |
+# Log levels: |
|
| 712 |
+# 1 = Errors only |
|
| 713 |
+# 2 = Warnings and errors |
|
| 714 |
+# 3 = Info, warnings, errors |
|
| 715 |
+# 4 = Debug everything |
|
| 716 |
+``` |
|
| 717 |
+ |
|
| 718 |
+### Common Issues |
|
| 719 |
+ |
|
| 720 |
+#### Database Connection Problems |
|
| 721 |
+```bash |
|
| 722 |
+# Test database connectivity |
|
| 723 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT version();" |
|
| 724 |
+ |
|
| 725 |
+# Check permissions |
|
| 726 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c "\\dp smart_readings" |
|
| 727 |
+``` |
|
| 728 |
+ |
|
| 729 |
+#### SMART Data Collection Issues |
|
| 730 |
+```bash |
|
| 731 |
+# Test smartctl access |
|
| 732 |
+sudo smartctl -a /dev/sda |
|
| 733 |
+ |
|
| 734 |
+# Check permissions |
|
| 735 |
+ls -la /dev/sd* |
|
| 736 |
+``` |
|
| 737 |
+ |
|
| 738 |
+#### Migration Detection Problems |
|
| 739 |
+```sql |
|
| 740 |
+-- Check migration logs |
|
| 741 |
+SELECT * FROM hdd_migrations |
|
| 742 |
+ORDER BY detected_at DESC |
|
| 743 |
+LIMIT 10; |
|
| 744 |
+ |
|
| 745 |
+-- Verify HDD inventory |
|
| 746 |
+SELECT serial_number, model_name, current_device_path, current_node_id |
|
| 747 |
+FROM hdd_inventory |
|
| 748 |
+WHERE status = 'active'; |
|
| 749 |
+``` |
|
| 750 |
+ |
|
| 751 |
+## 📊 Performance Monitoring |
|
| 752 |
+ |
|
| 753 |
+### Database Performance |
|
| 754 |
+```sql |
|
| 755 |
+-- Monitor table sizes |
|
| 756 |
+SELECT schemaname, tablename, |
|
| 757 |
+ pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size |
|
| 758 |
+FROM pg_tables |
|
| 759 |
+WHERE schemaname = 'public' |
|
| 760 |
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC; |
|
| 761 |
+ |
|
| 762 |
+-- Monitor query performance |
|
| 763 |
+SELECT query, mean_time, calls |
|
| 764 |
+FROM pg_stat_statements |
|
| 765 |
+WHERE query LIKE '%smart_readings%' |
|
| 766 |
+ORDER BY mean_time DESC; |
|
| 767 |
+``` |
|
| 768 |
+ |
|
| 769 |
+### Application Performance |
|
| 770 |
+```perl |
|
| 771 |
+# Add timing to critical operations |
|
| 772 |
+use Time::HiRes qw(time); |
|
| 773 |
+ |
|
| 774 |
+my $start_time = time(); |
|
| 775 |
+my $result = $self->collect_smart_data($device_path); |
|
| 776 |
+my $duration = time() - $start_time; |
|
| 777 |
+ |
|
| 778 |
+$self->_log("SMART collection took ${duration}s for $device_path", 3);
|
|
| 779 |
+``` |
|
| 780 |
+ |
|
| 781 |
+## 🚀 Deployment Guidelines |
|
| 782 |
+ |
|
| 783 |
+### Production Deployment |
|
| 784 |
+1. **Database Setup**: |
|
| 785 |
+ - Use dedicated PostgreSQL server |
|
| 786 |
+ - Configure proper backup strategy |
|
| 787 |
+ - Set up monitoring and alerting |
|
| 788 |
+ |
|
| 789 |
+2. **Security Configuration**: |
|
| 790 |
+ - Use dedicated database users with minimal privileges |
|
| 791 |
+ - Secure API keys and configuration files |
|
| 792 |
+ - Enable SSL connections for database |
|
| 793 |
+ |
|
| 794 |
+3. **Performance Tuning**: |
|
| 795 |
+ - Configure PostgreSQL for time-series workload |
|
| 796 |
+ - Set up proper indexing strategy |
|
| 797 |
+ - Monitor and optimize slow queries |
|
| 798 |
+ |
|
| 799 |
+### Proxmox Integration |
|
| 800 |
+```bash |
|
| 801 |
+# Install on cluster nodes |
|
| 802 |
+for node in pve01 pve02 pve03; do |
|
| 803 |
+ scp -r autoSMART/ root@$node:/etc/pve/ |
|
| 804 |
+done |
|
| 805 |
+ |
|
| 806 |
+# Configure systemd services |
|
| 807 |
+systemctl enable autosmart-collector |
|
| 808 |
+systemctl start autosmart-collector |
|
| 809 |
+``` |
|
| 810 |
+ |
|
| 811 |
+## 📚 Additional Resources |
|
| 812 |
+ |
|
| 813 |
+### Useful Commands |
|
| 814 |
+```bash |
|
| 815 |
+# Monitor system in real-time |
|
| 816 |
+watch -n 30 'psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT COUNT(*) FROM smart_readings WHERE timestamp > NOW() - INTERVAL '\''1 hour'\''"' |
|
| 817 |
+ |
|
| 818 |
+# Generate performance report |
|
| 819 |
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/performance-report.sql |
|
| 820 |
+``` |
|
| 821 |
+ |
|
| 822 |
+### Development Tools |
|
| 823 |
+- **pgAdmin**: Database administration and query development |
|
| 824 |
+- **Perl::Critic**: Code quality analysis |
|
| 825 |
+- **Perl::Tidy**: Code formatting |
|
| 826 |
+- **Git**: Version control with feature branches |
|
| 827 |
+ |
|
| 828 |
+## 📝 Developer Changelog |
|
| 829 |
+ |
|
| 830 |
+This section contains detailed technical changes, internal API modifications, and development-specific information that is not relevant for end-users. |
|
| 831 |
+ |
|
| 832 |
+### [1.0.0] - 2025-08-15 - Development Details |
|
| 833 |
+ |
|
| 834 |
+#### 🏗️ Architecture Changes |
|
| 835 |
+- **Database Schema Evolution**: Complete redesign from simple SMART storage to differential storage architecture |
|
| 836 |
+- **Hardware Tracking Implementation**: Added `hdd_inventory` and `hdd_migrations` tables for hardware-based identification |
|
| 837 |
+- **Differential Storage Engine**: Implemented `should_store_smart_reading()` PostgreSQL function with configurable change detection |
|
| 838 |
+- **Migration Detection Algorithm**: Created automatic hardware migration detection using serial numbers and model matching |
|
| 839 |
+ |
|
| 840 |
+#### 🔧 Internal API Changes |
|
| 841 |
+- **SmartCollector.pm Refactor**: |
|
| 842 |
+ - Added hardware identification methods (`identify_hardware()`, `detect_migration()`) |
|
| 843 |
+ - Implemented differential storage integration (`should_store_reading()`) |
|
| 844 |
+ - Added storage efficiency monitoring |
|
| 845 |
+ - Breaking change: Constructor now requires database handle |
|
| 846 |
+- **Database Functions**: |
|
| 847 |
+ - Added `should_store_smart_reading(jsonb, text, text, interval, text[])` function |
|
| 848 |
+ - Added `smart_readings_reconstructed` view for seamless data access |
|
| 849 |
+ - Added migration tracking triggers |
|
| 850 |
+- **Configuration Schema**: |
|
| 851 |
+ - Split configuration into cluster-wide (`cluster.conf`) and node-specific (`autosmart.conf`) |
|
| 852 |
+ - Added differential storage parameters (`force_storage_interval`, `critical_parameters`) |
|
| 853 |
+ |
|
| 854 |
+#### 🧪 Testing Infrastructure |
|
| 855 |
+- **Differential Storage Test Suite**: Added comprehensive test coverage in `test-differential-storage.pl` |
|
| 856 |
+- **Migration Detection Tests**: Validated hardware tracking across different scenarios |
|
| 857 |
+- **Performance Benchmarks**: Established baseline performance metrics for storage efficiency |
|
| 858 |
+- **Database Integration Tests**: Added tests for PostgreSQL function behavior |
|
| 859 |
+ |
|
| 860 |
+#### 📊 Performance Optimizations |
|
| 861 |
+- **Storage Efficiency**: Achieved 60-80% database size reduction through differential storage |
|
| 862 |
+- **Query Optimization**: Added proper indexing for hardware tracking and temporal queries |
|
| 863 |
+- **Background Processing**: Implemented non-blocking collection and analysis workflows |
|
| 864 |
+- **Memory Management**: Optimized Perl module memory usage for long-running processes |
|
| 865 |
+ |
|
| 866 |
+#### 🔒 Security Enhancements |
|
| 867 |
+- **Configuration Security**: Separated sensitive configuration from shared cluster config |
|
| 868 |
+- **Database Security**: Implemented proper user permissions and access controls |
|
| 869 |
+- **API Key Management**: Secure storage and rotation procedures for OpenAI API keys |
|
| 870 |
+- **Audit Trail**: Complete logging of all system changes and data access |
|
| 871 |
+ |
|
| 872 |
+#### 🐛 Known Technical Issues |
|
| 873 |
+- **Large Dataset Performance**: Initial data collection on large clusters may require tuning |
|
| 874 |
+- **Migration Detection Edge Cases**: Rare scenarios with identical drives may need manual verification |
|
| 875 |
+- **PostgreSQL Version Compatibility**: Requires PostgreSQL 13+ for JSONB and advanced indexing features |
|
| 876 |
+- **Perl Module Dependencies**: Some CPAN modules may require system-level library installation |
|
| 877 |
+ |
|
| 878 |
+#### 🔮 Technical Roadmap |
|
| 879 |
+- **Phase 2**: Real-time streaming data collection with Apache Kafka |
|
| 880 |
+- **Phase 3**: Machine learning model training on historical data |
|
| 881 |
+- **Phase 4**: Integration with Proxmox VE API for automated responses |
|
| 882 |
+- **Phase 5**: Multi-tenant architecture for managed service providers |
|
| 883 |
+ |
|
| 884 |
+#### 💻 Development Environment Notes |
|
| 885 |
+- **Test Database**: Currently using `192.168.2.102` for development and testing |
|
| 886 |
+- **Perl Version**: Developed and tested on Perl 5.32+ |
|
| 887 |
+- **PostgreSQL Extensions**: Requires `uuid-ossp` and `btree_gin` extensions |
|
| 888 |
+- **Development Workflow**: Feature branch development with PR reviews required |
|
| 889 |
+ |
|
| 890 |
+## 🔧 Technical Reference for Developers |
|
| 891 |
+ |
|
| 892 |
+### Database Schema Reference |
|
| 893 |
+- **Primary location**: `../sql/schema.sql` |
|
| 894 |
+- **Documentation**: [DIFFERENTIAL_STORAGE.md](DIFFERENTIAL_STORAGE.md), [MIGRATION_DETECTION.md](MIGRATION_DETECTION.md) |
|
| 895 |
+- **Sample queries**: `../sql/sample-queries.sql` |
|
| 896 |
+- **Migration scripts**: `../sql/migrations/` |
|
| 897 |
+ |
|
| 898 |
+### Perl Module Architecture |
|
| 899 |
+- **SmartCollector.pm**: Data collection and hardware tracking |
|
| 900 |
+ - Hardware manufacturer detection |
|
| 901 |
+ - Migration detection and logging |
|
| 902 |
+ - Differential storage integration |
|
| 903 |
+ - Storage efficiency monitoring |
|
| 904 |
+- **SmartAnalyzer.pm**: AI-powered analysis and predictions |
|
| 905 |
+- **SmartReporter.pm**: Report generation and alerting |
|
| 906 |
+- **Module documentation**: Inline POD documentation in each module |
|
| 907 |
+ |
|
| 908 |
+### Configuration Management |
|
| 909 |
+- **Cluster config**: `../config/cluster.conf` (shared across all nodes) |
|
| 910 |
+- **Node config**: `../config/defaults/autosmart` (node-specific settings) |
|
| 911 |
+- **OpenAI config**: `../config/openai.conf` (API configuration) |
|
| 912 |
+- **Configuration documentation**: [INSTALLATION.md](INSTALLATION.md) |
|
| 913 |
+ |
|
| 914 |
+### Scripts and Development Tools |
|
| 915 |
+- **Collection**: `../scripts/collect-smart-data.pl` |
|
| 916 |
+- **Analysis**: `../scripts/analyze-smart-data.pl` |
|
| 917 |
+- **Reporting**: `../scripts/generate-reports.pl` |
|
| 918 |
+- **Testing**: `../scripts/test-differential-storage.pl` |
|
| 919 |
+- **Deployment**: `../scripts/deploy.sh`, `../scripts/deploy-production.sh` |
|
| 920 |
+ |
|
| 921 |
+### Development Scenarios |
|
| 922 |
+ |
|
| 923 |
+#### Scenario 1: Adding New SMART Parameters |
|
| 924 |
+**Files to modify**: |
|
| 925 |
+1. `lib/SmartCollector.pm` - Add parameter collection logic |
|
| 926 |
+2. `sql/schema.sql` - Update parameter definitions if needed |
|
| 927 |
+3. `scripts/test-differential-storage.pl` - Add parameter tests |
|
| 928 |
+4. `docs/DIFFERENTIAL_STORAGE.md` - Document parameter behavior |
|
| 929 |
+ |
|
| 930 |
+#### Scenario 2: Implementing New AI Prediction Models |
|
| 931 |
+**Files to modify**: |
|
| 932 |
+1. `lib/SmartAnalyzer.pm` - Add new prediction algorithms |
|
| 933 |
+2. `docs/API.md` - Update API integration patterns |
|
| 934 |
+3. `scripts/analyze-smart-data.pl` - Add model selection logic |
|
| 935 |
+4. `sql/schema.sql` - Add prediction result tables if needed |
|
| 936 |
+ |
|
| 937 |
+#### Scenario 3: Performance Optimization |
|
| 938 |
+**Areas to investigate**: |
|
| 939 |
+1. `docs/DIFFERENTIAL_STORAGE.md` - Storage optimization techniques |
|
| 940 |
+2. `sql/schema.sql` - Index optimization |
|
| 941 |
+3. `lib/SmartCollector.pm` - Collection efficiency |
|
| 942 |
+4. PostgreSQL query performance using `EXPLAIN ANALYZE` |
|
| 943 |
+ |
|
| 944 |
+#### Scenario 4: Adding New Hardware Support |
|
| 945 |
+**Files to modify**: |
|
| 946 |
+1. `lib/SmartCollector.pm` - Hardware detection logic |
|
| 947 |
+2. `docs/MIGRATION_DETECTION.md` - Hardware tracking specifications |
|
| 948 |
+3. `scripts/test-differential-storage.pl` - Hardware-specific tests |
|
| 949 |
+4. Configuration templates for new hardware types |
|
| 950 |
+ |
|
| 951 |
+### Code Quality Guidelines |
|
| 952 |
+ |
|
| 953 |
+#### Perl Coding Standards |
|
| 954 |
+```perl |
|
| 955 |
+# Use strict and warnings |
|
| 956 |
+use strict; |
|
| 957 |
+use warnings; |
|
| 958 |
+ |
|
| 959 |
+# Consistent indentation (4 spaces) |
|
| 960 |
+sub example_function {
|
|
| 961 |
+ my ($self, $param) = @_; |
|
| 962 |
+ |
|
| 963 |
+ # Clear variable names |
|
| 964 |
+ my $smart_data = $self->collect_smart_data($param); |
|
| 965 |
+ |
|
| 966 |
+ # Error handling |
|
| 967 |
+ return unless defined $smart_data; |
|
| 968 |
+ |
|
| 969 |
+ return $smart_data; |
|
| 970 |
+} |
|
| 971 |
+``` |
|
| 972 |
+ |
|
| 973 |
+#### Database Development Patterns |
|
| 974 |
+```sql |
|
| 975 |
+-- Use transactions for data consistency |
|
| 976 |
+BEGIN; |
|
| 977 |
+ -- Multiple related operations |
|
| 978 |
+ INSERT INTO hdd_inventory (...) VALUES (...); |
|
| 979 |
+ INSERT INTO smart_readings (...) VALUES (...); |
|
| 980 |
+COMMIT; |
|
| 981 |
+ |
|
| 982 |
+-- Use proper indexing |
|
| 983 |
+CREATE INDEX CONCURRENTLY idx_smart_readings_timestamp |
|
| 984 |
+ON smart_readings(timestamp DESC, serial_number); |
|
| 985 |
+ |
|
| 986 |
+-- Use parameterized queries to prevent SQL injection |
|
| 987 |
+my $sth = $dbh->prepare("SELECT * FROM smart_readings WHERE serial_number = ?");
|
|
| 988 |
+$sth->execute($serial_number); |
|
| 989 |
+``` |
|
| 990 |
+ |
|
| 991 |
+This development guide provides the foundation for extending and maintaining the autoSMART system. Follow these guidelines to ensure code quality, performance, and reliability. |
|
@@ -0,0 +1,204 @@ |
||
| 1 |
+# autoSMART Differential Storage System |
|
| 2 |
+ |
|
| 3 |
+## Overview |
|
| 4 |
+ |
|
| 5 |
+The autoSMART v1.0 system now implements **differential storage optimization** to significantly reduce database storage requirements while maintaining full data integrity and analysis capabilities. |
|
| 6 |
+ |
|
| 7 |
+## How It Works |
|
| 8 |
+ |
|
| 9 |
+### Storage Strategy |
|
| 10 |
+ |
|
| 11 |
+Instead of storing complete SMART readings for every collection cycle, the system intelligently stores only: |
|
| 12 |
+ |
|
| 13 |
+1. **Baseline readings** - First reading for each HDD |
|
| 14 |
+2. **Full readings** - When critical parameters change or forced intervals are reached |
|
| 15 |
+3. **Differential readings** - When only non-critical parameters change (stores only the changes) |
|
| 16 |
+4. **Skipped readings** - When no changes are detected (no storage) |
|
| 17 |
+ |
|
| 18 |
+### Change Detection |
|
| 19 |
+ |
|
| 20 |
+The system uses multiple methods to detect changes: |
|
| 21 |
+ |
|
| 22 |
+- **Checksum comparison** - SHA256 hash of all parameters + temperature |
|
| 23 |
+- **Parameter-level analysis** - Individual SMART parameter change detection |
|
| 24 |
+- **Critical parameter monitoring** - Immediate storage for health-critical changes |
|
| 25 |
+- **Temperature thresholds** - Configurable temperature change sensitivity |
|
| 26 |
+- **Time-based forcing** - Periodic full readings regardless of changes (default: 24 hours) |
|
| 27 |
+ |
|
| 28 |
+## Database Schema Changes |
|
| 29 |
+ |
|
| 30 |
+### Enhanced smart_readings Table |
|
| 31 |
+ |
|
| 32 |
+```sql |
|
| 33 |
+ALTER TABLE smart_readings ADD COLUMN reading_type VARCHAR(20) DEFAULT 'full'; |
|
| 34 |
+ALTER TABLE smart_readings ADD COLUMN changes_detected BOOLEAN DEFAULT true; |
|
| 35 |
+ALTER TABLE smart_readings ADD COLUMN changed_parameters JSONB; |
|
| 36 |
+ALTER TABLE smart_readings ADD COLUMN previous_reading_id INTEGER REFERENCES smart_readings(id); |
|
| 37 |
+ALTER TABLE smart_readings ADD COLUMN checksum VARCHAR(64); |
|
| 38 |
+``` |
|
| 39 |
+ |
|
| 40 |
+### New PostgreSQL Function |
|
| 41 |
+ |
|
| 42 |
+The `should_store_smart_reading()` function provides intelligent storage decisions: |
|
| 43 |
+ |
|
| 44 |
+```sql |
|
| 45 |
+SELECT should_store_smart_reading(hdd_id, parameters_json, checksum, current_timestamp); |
|
| 46 |
+``` |
|
| 47 |
+ |
|
| 48 |
+Returns: |
|
| 49 |
+- `should_store` - Boolean indicating if reading should be stored |
|
| 50 |
+- `reading_type` - 'baseline', 'full', or 'differential' |
|
| 51 |
+- `changes_detected` - Boolean indicating if changes were found |
|
| 52 |
+- `changed_parameters` - JSON array of changed parameter names |
|
| 53 |
+- `previous_reading_id` - Reference to previous reading for chaining |
|
| 54 |
+ |
|
| 55 |
+### Reconstructed Data View |
|
| 56 |
+ |
|
| 57 |
+The `smart_readings_reconstructed` view uses recursive SQL to rebuild complete SMART data from differential readings: |
|
| 58 |
+ |
|
| 59 |
+```sql |
|
| 60 |
+SELECT * FROM smart_readings_reconstructed WHERE hdd_id = 123; |
|
| 61 |
+``` |
|
| 62 |
+ |
|
| 63 |
+## Configuration Parameters |
|
| 64 |
+ |
|
| 65 |
+Add to `system_config` table: |
|
| 66 |
+ |
|
| 67 |
+```sql |
|
| 68 |
+INSERT INTO system_config (key, value, description) VALUES |
|
| 69 |
+('differential_storage_enabled', 'true', 'Enable differential storage optimization'),
|
|
| 70 |
+('forced_storage_interval_hours', '24', 'Hours between forced full readings'),
|
|
| 71 |
+('critical_parameter_force_store', 'true', 'Force storage for critical parameter changes'),
|
|
| 72 |
+('temperature_change_threshold', '5', 'Temperature change threshold for storage (Celsius)');
|
|
| 73 |
+``` |
|
| 74 |
+ |
|
| 75 |
+## Updated Perl Modules |
|
| 76 |
+ |
|
| 77 |
+### SmartCollector.pm Changes |
|
| 78 |
+ |
|
| 79 |
+1. **New methods**: |
|
| 80 |
+ - `_should_store_reading()` - Check storage requirements |
|
| 81 |
+ - `_insert_smart_reading_differential()` - Store with differential info |
|
| 82 |
+ - `_get_recent_storage_stats()` - Monitor storage efficiency |
|
| 83 |
+ |
|
| 84 |
+2. **Enhanced collection**: |
|
| 85 |
+ - Automatic change detection |
|
| 86 |
+ - Storage type determination |
|
| 87 |
+ - Efficiency reporting |
|
| 88 |
+ |
|
| 89 |
+3. **Storage optimization**: |
|
| 90 |
+ - Only changed parameters stored for differential readings |
|
| 91 |
+ - Checksum validation |
|
| 92 |
+ - Chain reference tracking |
|
| 93 |
+ |
|
| 94 |
+## Benefits |
|
| 95 |
+ |
|
| 96 |
+### Storage Reduction |
|
| 97 |
+ |
|
| 98 |
+Expected storage reduction of **60-80%** for typical HDD environments: |
|
| 99 |
+ |
|
| 100 |
+- **Baseline readings**: ~1% of all readings |
|
| 101 |
+- **Full readings**: ~15-20% of readings (critical changes + forced intervals) |
|
| 102 |
+- **Differential readings**: ~5-15% of readings (minor changes) |
|
| 103 |
+- **Skipped readings**: ~60-75% of readings (no changes) |
|
| 104 |
+ |
|
| 105 |
+### Performance Impact |
|
| 106 |
+ |
|
| 107 |
+- **Minimal collection overhead**: Single database function call for decision |
|
| 108 |
+- **Fast reconstruction**: Recursive SQL with indexes |
|
| 109 |
+- **Efficient queries**: Reconstructed view handles complexity |
|
| 110 |
+ |
|
| 111 |
+### Data Integrity |
|
| 112 |
+ |
|
| 113 |
+- **Complete reconstruction**: All historical data accessible |
|
| 114 |
+- **Change tracking**: Full audit trail of parameter changes |
|
| 115 |
+- **Critical monitoring**: No loss of important health indicators |
|
| 116 |
+ |
|
| 117 |
+## Usage Examples |
|
| 118 |
+ |
|
| 119 |
+### Collection with Statistics |
|
| 120 |
+ |
|
| 121 |
+```perl |
|
| 122 |
+use SmartCollector; |
|
| 123 |
+ |
|
| 124 |
+my $collector = SmartCollector->new($config); |
|
| 125 |
+my $result = $collector->collect_all(); |
|
| 126 |
+ |
|
| 127 |
+print "Storage efficiency: " . $result->{storage_stats}->{efficiency_percent} . "%\n";
|
|
| 128 |
+print "Differential readings: " . $result->{storage_stats}->{differential} . "\n";
|
|
| 129 |
+``` |
|
| 130 |
+ |
|
| 131 |
+### Testing the System |
|
| 132 |
+ |
|
| 133 |
+Run the comprehensive test suite: |
|
| 134 |
+ |
|
| 135 |
+```bash |
|
| 136 |
+cd /etc/pve/autoSMART |
|
| 137 |
+./scripts/test-differential-storage.pl |
|
| 138 |
+``` |
|
| 139 |
+ |
|
| 140 |
+This will: |
|
| 141 |
+1. Create test HDD entries |
|
| 142 |
+2. Test storage decisions for various change scenarios |
|
| 143 |
+3. Validate data reconstruction |
|
| 144 |
+4. Show storage efficiency statistics |
|
| 145 |
+ |
|
| 146 |
+## Migration from Legacy Data |
|
| 147 |
+ |
|
| 148 |
+Existing installations can migrate seamlessly: |
|
| 149 |
+ |
|
| 150 |
+1. **Schema updates**: Run the enhanced schema SQL |
|
| 151 |
+2. **Existing data**: Marked as 'full' readings automatically |
|
| 152 |
+3. **No data loss**: All existing readings preserved |
|
| 153 |
+4. **Gradual optimization**: New readings use differential storage immediately |
|
| 154 |
+ |
|
| 155 |
+## Monitoring and Maintenance |
|
| 156 |
+ |
|
| 157 |
+### Storage Statistics Query |
|
| 158 |
+ |
|
| 159 |
+```sql |
|
| 160 |
+SELECT |
|
| 161 |
+ reading_type, |
|
| 162 |
+ COUNT(*) as count, |
|
| 163 |
+ COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage |
|
| 164 |
+FROM smart_readings |
|
| 165 |
+WHERE timestamp > NOW() - INTERVAL '7 days' |
|
| 166 |
+GROUP BY reading_type; |
|
| 167 |
+``` |
|
| 168 |
+ |
|
| 169 |
+### Reconstruction Performance |
|
| 170 |
+ |
|
| 171 |
+```sql |
|
| 172 |
+EXPLAIN ANALYZE |
|
| 173 |
+SELECT * FROM smart_readings_reconstructed |
|
| 174 |
+WHERE hdd_id = 123 AND timestamp > NOW() - INTERVAL '30 days'; |
|
| 175 |
+``` |
|
| 176 |
+ |
|
| 177 |
+### Space Savings Report |
|
| 178 |
+ |
|
| 179 |
+```sql |
|
| 180 |
+SELECT |
|
| 181 |
+ COUNT(*) as total_possible_readings, |
|
| 182 |
+ COUNT(*) FILTER (WHERE reading_type != 'skipped') as stored_readings, |
|
| 183 |
+ (COUNT(*) FILTER (WHERE reading_type != 'skipped') * 100.0 / COUNT(*)) as storage_percentage, |
|
| 184 |
+ (100 - (COUNT(*) FILTER (WHERE reading_type != 'skipped') * 100.0 / COUNT(*))) as savings_percentage |
|
| 185 |
+FROM smart_readings |
|
| 186 |
+WHERE timestamp > NOW() - INTERVAL '30 days'; |
|
| 187 |
+``` |
|
| 188 |
+ |
|
| 189 |
+## Critical Parameters List |
|
| 190 |
+ |
|
| 191 |
+Default parameters that trigger immediate full storage: |
|
| 192 |
+- Reallocated_Sector_Ct |
|
| 193 |
+- Current_Pending_Sector |
|
| 194 |
+- Offline_Uncorrectable |
|
| 195 |
+- Reallocated_Event_Count |
|
| 196 |
+- Spin_Retry_Count |
|
| 197 |
+ |
|
| 198 |
+Configure in `smart_thresholds` table with `weight >= 8.0`. |
|
| 199 |
+ |
|
| 200 |
+## Conclusion |
|
| 201 |
+ |
|
| 202 |
+The differential storage system provides significant storage optimization while maintaining complete data integrity and analytical capabilities. The system automatically adapts to HDD behavior patterns, storing more data when drives show issues and reducing storage when drives are stable. |
|
| 203 |
+ |
|
| 204 |
+This optimization is particularly beneficial for large-scale deployments like the Madagascar cluster, where hundreds of HDDs generate continuous SMART data over years of operation. |
|
@@ -0,0 +1,675 @@ |
||
| 1 |
+# autoSMART Installation and Setup Guide |
|
| 2 |
+ |
|
| 3 |
+## 🚀 Quick Start |
|
| 4 |
+ |
|
| 5 |
+### Prerequisites Checklist |
|
| 6 |
+ |
|
| 7 |
+#### System Requirements |
|
| 8 |
+- ✅ **Operating System**: Linux (Ubuntu 20.04+, CentOS 8+, Proxmox VE 7+) |
|
| 9 |
+- ✅ **Perl**: Version 5.20+ with CPAN access |
|
| 10 |
+- ✅ **PostgreSQL**: Version 13+ with JSONB support |
|
| 11 |
+- ✅ **Hardware Access**: sudo/root access for SMART data collection |
|
| 12 |
+- ✅ **Network**: Access to OpenAI API (optional, for AI predictions) |
|
| 13 |
+ |
|
| 14 |
+#### Test Database Available |
|
| 15 |
+``` |
|
| 16 |
+Host: 192.168.2.102 |
|
| 17 |
+Database: autosmart |
|
| 18 |
+User: postgres |
|
| 19 |
+Password: (no password) |
|
| 20 |
+Port: 5432 |
|
| 21 |
+``` |
|
| 22 |
+ |
|
| 23 |
+## 🔧 Installation Steps |
|
| 24 |
+ |
|
| 25 |
+### 1. System Dependencies |
|
| 26 |
+ |
|
| 27 |
+#### Ubuntu/Debian |
|
| 28 |
+```bash |
|
| 29 |
+# Update system packages |
|
| 30 |
+sudo apt update && sudo apt upgrade -y |
|
| 31 |
+ |
|
| 32 |
+# Install system dependencies |
|
| 33 |
+sudo apt install -y perl postgresql-client smartmontools git curl |
|
| 34 |
+ |
|
| 35 |
+# Install PostgreSQL server (if not using remote database) |
|
| 36 |
+sudo apt install -y postgresql postgresql-contrib |
|
| 37 |
+ |
|
| 38 |
+# Install Perl development tools |
|
| 39 |
+sudo apt install -y build-essential cpanminus libdbi-perl |
|
| 40 |
+``` |
|
| 41 |
+ |
|
| 42 |
+#### CentOS/RHEL/Rocky Linux |
|
| 43 |
+```bash |
|
| 44 |
+# Update system packages |
|
| 45 |
+sudo dnf update -y |
|
| 46 |
+ |
|
| 47 |
+# Install system dependencies |
|
| 48 |
+sudo dnf install -y perl postgresql smartmontools git curl |
|
| 49 |
+ |
|
| 50 |
+# Install development tools |
|
| 51 |
+sudo dnf groupinstall -y "Development Tools" |
|
| 52 |
+sudo dnf install -y perl-App-cpanminus perl-DBI |
|
| 53 |
+``` |
|
| 54 |
+ |
|
| 55 |
+#### Proxmox VE |
|
| 56 |
+```bash |
|
| 57 |
+# Proxmox already includes most dependencies |
|
| 58 |
+apt update |
|
| 59 |
+apt install -y cpanminus libdbi-perl libdbd-pg-perl libjson-xs-perl |
|
| 60 |
+``` |
|
| 61 |
+ |
|
| 62 |
+### 2. Perl Modules Installation |
|
| 63 |
+ |
|
| 64 |
+#### Required Modules |
|
| 65 |
+```bash |
|
| 66 |
+# Core database connectivity |
|
| 67 |
+sudo cpanm DBI DBD::Pg |
|
| 68 |
+ |
|
| 69 |
+# JSON processing |
|
| 70 |
+sudo cpanm JSON::XS |
|
| 71 |
+ |
|
| 72 |
+# Configuration and utilities |
|
| 73 |
+sudo cpanm Config::Simple File::Slurp Time::HiRes Digest::SHA |
|
| 74 |
+ |
|
| 75 |
+# HTTP clients for API integration |
|
| 76 |
+sudo cpanm LWP::UserAgent HTTP::Request::Common |
|
| 77 |
+ |
|
| 78 |
+# Optional: Testing modules |
|
| 79 |
+sudo cpanm Test::More Test::Exception Data::Dumper |
|
| 80 |
+``` |
|
| 81 |
+ |
|
| 82 |
+#### Verify Perl Module Installation |
|
| 83 |
+```bash |
|
| 84 |
+perl -e " |
|
| 85 |
+use DBI; |
|
| 86 |
+use JSON::XS; |
|
| 87 |
+use Config::Simple; |
|
| 88 |
+use Digest::SHA; |
|
| 89 |
+use LWP::UserAgent; |
|
| 90 |
+print \"All required Perl modules installed successfully!\n\"; |
|
| 91 |
+" |
|
| 92 |
+``` |
|
| 93 |
+ |
|
| 94 |
+### 3. Database Setup |
|
| 95 |
+ |
|
| 96 |
+#### Option A: Use Test Database (Recommended for Development) |
|
| 97 |
+```bash |
|
| 98 |
+# Test connection to existing database |
|
| 99 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT version();" |
|
| 100 |
+ |
|
| 101 |
+# If successful, skip to step 4 - Project Installation |
|
| 102 |
+``` |
|
| 103 |
+ |
|
| 104 |
+#### Option B: Local PostgreSQL Installation |
|
| 105 |
+```bash |
|
| 106 |
+# Install PostgreSQL |
|
| 107 |
+sudo apt install -y postgresql postgresql-contrib |
|
| 108 |
+ |
|
| 109 |
+# Start and enable PostgreSQL |
|
| 110 |
+sudo systemctl start postgresql |
|
| 111 |
+sudo systemctl enable postgresql |
|
| 112 |
+ |
|
| 113 |
+# Create database and user |
|
| 114 |
+sudo -u postgres psql << EOF |
|
| 115 |
+CREATE DATABASE autosmart; |
|
| 116 |
+CREATE USER autosmart WITH PASSWORD 'smartpassword'; |
|
| 117 |
+GRANT ALL PRIVILEGES ON DATABASE autosmart TO autosmart; |
|
| 118 |
+ALTER USER autosmart CREATEDB; |
|
| 119 |
+\q |
|
| 120 |
+EOF |
|
| 121 |
+``` |
|
| 122 |
+ |
|
| 123 |
+#### Option C: Remote PostgreSQL Setup |
|
| 124 |
+```bash |
|
| 125 |
+# Connect to your PostgreSQL server |
|
| 126 |
+psql -h your-db-host -U postgres |
|
| 127 |
+ |
|
| 128 |
+# Create database and configure |
|
| 129 |
+CREATE DATABASE autosmart; |
|
| 130 |
+CREATE USER autosmart WITH PASSWORD 'your-secure-password'; |
|
| 131 |
+GRANT ALL PRIVILEGES ON DATABASE autosmart TO autosmart; |
|
| 132 |
+``` |
|
| 133 |
+ |
|
| 134 |
+### 4. Project Installation |
|
| 135 |
+ |
|
| 136 |
+#### Download and Setup |
|
| 137 |
+```bash |
|
| 138 |
+# Create installation directory |
|
| 139 |
+sudo mkdir -p /etc/pve/autoSMART |
|
| 140 |
+cd /etc/pve/autoSMART |
|
| 141 |
+ |
|
| 142 |
+# Clone or copy project files (adjust as needed) |
|
| 143 |
+# git clone https://github.com/your-repo/autoSMART.git . |
|
| 144 |
+# OR copy from development workspace: |
|
| 145 |
+cp -r /Users/bogdan/Documents/workspace/autoSMART/* . |
|
| 146 |
+ |
|
| 147 |
+# Set proper ownership and permissions |
|
| 148 |
+sudo chown -R root:root . |
|
| 149 |
+chmod +x scripts/*.pl |
|
| 150 |
+chmod 600 config/cluster.conf |
|
| 151 |
+``` |
|
| 152 |
+ |
|
| 153 |
+#### Directory Structure Verification |
|
| 154 |
+```bash |
|
| 155 |
+tree /etc/pve/autoSMART |
|
| 156 |
+# Should show: |
|
| 157 |
+# ├── config/ |
|
| 158 |
+# ├── docs/ |
|
| 159 |
+# ├── lib/ |
|
| 160 |
+# ├── scripts/ |
|
| 161 |
+# ├── sql/ |
|
| 162 |
+# └── README.md |
|
| 163 |
+``` |
|
| 164 |
+ |
|
| 165 |
+### 5. Database Deployment |
|
| 166 |
+ |
|
| 167 |
+autoSMART uses PostgreSQL for storing SMART data, configurations, and analysis results. You can deploy the database schema from your development machine using the included deployment scripts. |
|
| 168 |
+ |
|
| 169 |
+#### Prerequisites for Database Deployment |
|
| 170 |
+- ✅ **psql** client installed on development machine (macOS/Linux) |
|
| 171 |
+- ✅ **Network access** to target PostgreSQL server |
|
| 172 |
+- ✅ **Database credentials** with schema creation privileges |
|
| 173 |
+- ✅ **Target database** already created and accessible |
|
| 174 |
+ |
|
| 175 |
+#### Database Deployment with deploy.sh |
|
| 176 |
+ |
|
| 177 |
+The `deploy.sh` script can install the database schema remotely using psql from your development machine: |
|
| 178 |
+ |
|
| 179 |
+```bash |
|
| 180 |
+# Show help and available options |
|
| 181 |
+./deploy.sh |
|
| 182 |
+ |
|
| 183 |
+# Deploy database schema to remote PostgreSQL server |
|
| 184 |
+./deploy.sh install database --db-host 192.168.2.102 --db-user postgres --db-name autosmart |
|
| 185 |
+ |
|
| 186 |
+# Deploy with custom credentials |
|
| 187 |
+./deploy.sh install database \ |
|
| 188 |
+ --db-host your-postgres-server.local \ |
|
| 189 |
+ --db-user autosmart \ |
|
| 190 |
+ --db-pass your-password \ |
|
| 191 |
+ --db-name autosmart_prod |
|
| 192 |
+``` |
|
| 193 |
+ |
|
| 194 |
+#### Manual Database Installation from Development Machine |
|
| 195 |
+ |
|
| 196 |
+If you prefer manual control over the database installation: |
|
| 197 |
+ |
|
| 198 |
+```bash |
|
| 199 |
+# 1. Ensure psql is available on your development machine |
|
| 200 |
+# macOS: |
|
| 201 |
+brew install postgresql |
|
| 202 |
+ |
|
| 203 |
+# Ubuntu/Debian: |
|
| 204 |
+sudo apt install postgresql-client |
|
| 205 |
+ |
|
| 206 |
+# 2. Test connection to target database |
|
| 207 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c "SELECT version();" |
|
| 208 |
+ |
|
| 209 |
+# 3. Install the complete schema |
|
| 210 |
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql |
|
| 211 |
+ |
|
| 212 |
+# 4. Verify schema installation |
|
| 213 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c " |
|
| 214 |
+SELECT |
|
| 215 |
+ schemaname, |
|
| 216 |
+ tablename, |
|
| 217 |
+ tableowner |
|
| 218 |
+FROM pg_tables |
|
| 219 |
+WHERE schemaname = 'public' |
|
| 220 |
+ORDER BY tablename; |
|
| 221 |
+" |
|
| 222 |
+ |
|
| 223 |
+# 5. Check database functions and triggers |
|
| 224 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c " |
|
| 225 |
+SELECT |
|
| 226 |
+ proname as function_name, |
|
| 227 |
+ pg_get_function_result(oid) as return_type |
|
| 228 |
+FROM pg_proc |
|
| 229 |
+WHERE pronamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'public') |
|
| 230 |
+ORDER BY proname; |
|
| 231 |
+" |
|
| 232 |
+``` |
|
| 233 |
+ |
|
| 234 |
+#### Database Schema Components |
|
| 235 |
+ |
|
| 236 |
+The autoSMART database schema includes: |
|
| 237 |
+ |
|
| 238 |
+**Core Tables:** |
|
| 239 |
+- `hdd_inventory` - Physical drive tracking and migration history |
|
| 240 |
+- `smart_readings` - Raw SMART data collection (differential storage) |
|
| 241 |
+- `smart_thresholds` - Drive-specific alert thresholds |
|
| 242 |
+- `predictions` - AI-generated failure predictions |
|
| 243 |
+- `alert_history` - System alerts and notifications |
|
| 244 |
+- `system_config` - Cluster-wide configuration settings |
|
| 245 |
+ |
|
| 246 |
+**Analytical Views:** |
|
| 247 |
+- `smart_readings_reconstructed` - Full SMART data reconstruction from differential storage |
|
| 248 |
+- `latest_smart_readings` - Most recent SMART values per drive |
|
| 249 |
+- `drive_health_summary` - Drive health status and trend analysis |
|
| 250 |
+ |
|
| 251 |
+**Functions and Triggers:** |
|
| 252 |
+- `differential_storage_trigger()` - Automatic differential storage on SMART updates |
|
| 253 |
+- `update_drive_health()` - Health score calculation |
|
| 254 |
+- `cleanup_old_readings()` - Data retention management |
|
| 255 |
+ |
|
| 256 |
+#### Database Verification Commands |
|
| 257 |
+ |
|
| 258 |
+```bash |
|
| 259 |
+# Verify all components are installed |
|
| 260 |
+psql -h 192.168.2.102 -U postgres -d autosmart << EOF |
|
| 261 |
+ |
|
| 262 |
+-- Check table count and sizes |
|
| 263 |
+SELECT |
|
| 264 |
+ schemaname, |
|
| 265 |
+ tablename, |
|
| 266 |
+ pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size |
|
| 267 |
+FROM pg_tables |
|
| 268 |
+WHERE schemaname = 'public'; |
|
| 269 |
+ |
|
| 270 |
+-- Check views |
|
| 271 |
+SELECT |
|
| 272 |
+ schemaname, |
|
| 273 |
+ viewname, |
|
| 274 |
+ definition |
|
| 275 |
+FROM pg_views |
|
| 276 |
+WHERE schemaname = 'public'; |
|
| 277 |
+ |
|
| 278 |
+-- Test differential storage function |
|
| 279 |
+SELECT differential_storage_trigger() as function_test; |
|
| 280 |
+ |
|
| 281 |
+-- Verify database is ready |
|
| 282 |
+SELECT 'autoSMART database ready!' as status; |
|
| 283 |
+ |
|
| 284 |
+EOF |
|
| 285 |
+``` |
|
| 286 |
+ |
|
| 287 |
+#### Troubleshooting Database Installation |
|
| 288 |
+ |
|
| 289 |
+**Connection Issues:** |
|
| 290 |
+```bash |
|
| 291 |
+# Test basic connectivity |
|
| 292 |
+ping 192.168.2.102 |
|
| 293 |
+ |
|
| 294 |
+# Test PostgreSQL port |
|
| 295 |
+telnet 192.168.2.102 5432 |
|
| 296 |
+ |
|
| 297 |
+# Test authentication |
|
| 298 |
+psql -h 192.168.2.102 -U postgres -d postgres -c "SELECT current_user;" |
|
| 299 |
+``` |
|
| 300 |
+ |
|
| 301 |
+**Schema Installation Issues:** |
|
| 302 |
+```bash |
|
| 303 |
+# Check for existing schema conflicts |
|
| 304 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c " |
|
| 305 |
+SELECT table_name FROM information_schema.tables |
|
| 306 |
+WHERE table_schema = 'public' AND table_name LIKE '%smart%'; |
|
| 307 |
+" |
|
| 308 |
+ |
|
| 309 |
+# Force clean installation (⚠️ DESTRUCTIVE) |
|
| 310 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c " |
|
| 311 |
+DROP SCHEMA public CASCADE; |
|
| 312 |
+CREATE SCHEMA public; |
|
| 313 |
+GRANT ALL ON SCHEMA public TO postgres; |
|
| 314 |
+GRANT ALL ON SCHEMA public TO public; |
|
| 315 |
+" |
|
| 316 |
+ |
|
| 317 |
+# Reinstall schema |
|
| 318 |
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql |
|
| 319 |
+``` |
|
| 320 |
+ |
|
| 321 |
+### 6. Database Schema Installation (Legacy Method) |
|
| 322 |
+ |
|
| 323 |
+#### Using Test Database (192.168.2.102) |
|
| 324 |
+```bash |
|
| 325 |
+# Install the complete schema |
|
| 326 |
+cd /etc/pve/autoSMART |
|
| 327 |
+psql -h 192.168.2.102 -U postgres -d autosmart -f sql/schema.sql |
|
| 328 |
+ |
|
| 329 |
+# Verify installation |
|
| 330 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c " |
|
| 331 |
+SELECT table_name FROM information_schema.tables |
|
| 332 |
+WHERE table_schema = 'public' |
|
| 333 |
+ORDER BY table_name; |
|
| 334 |
+" |
|
| 335 |
+``` |
|
| 336 |
+ |
|
| 337 |
+#### Expected Tables |
|
| 338 |
+- ✅ hdd_inventory |
|
| 339 |
+- ✅ hdd_migrations |
|
| 340 |
+- ✅ smart_readings |
|
| 341 |
+- ✅ predictions |
|
| 342 |
+- ✅ smart_thresholds |
|
| 343 |
+- ✅ alert_history |
|
| 344 |
+- ✅ system_config |
|
| 345 |
+ |
|
| 346 |
+#### Expected Views |
|
| 347 |
+- ✅ smart_readings_reconstructed |
|
| 348 |
+- ✅ latest_smart_readings |
|
| 349 |
+- ✅ drive_health_summary |
|
| 350 |
+ |
|
| 351 |
+### 7. Configuration |
|
| 352 |
+ |
|
| 353 |
+#### Cluster Configuration |
|
| 354 |
+```bash |
|
| 355 |
+# Edit cluster-wide settings |
|
| 356 |
+nano /etc/pve/autoSMART/config/cluster.conf |
|
| 357 |
+``` |
|
| 358 |
+ |
|
| 359 |
+```ini |
|
| 360 |
+[database] |
|
| 361 |
+host = 192.168.2.102 |
|
| 362 |
+port = 5432 |
|
| 363 |
+name = autosmart |
|
| 364 |
+user = postgres |
|
| 365 |
+password = |
|
| 366 |
+ |
|
| 367 |
+[collection] |
|
| 368 |
+interval = 1800 |
|
| 369 |
+timeout = 60 |
|
| 370 |
+madagascar_inventory_path = /opt/madagascar/inventory.json |
|
| 371 |
+ |
|
| 372 |
+[ai_predictions] |
|
| 373 |
+enabled = true |
|
| 374 |
+openai_api_key = your-openai-api-key-here |
|
| 375 |
+openai_model = gpt-4 |
|
| 376 |
+prediction_interval = 86400 |
|
| 377 |
+ |
|
| 378 |
+[alerts] |
|
| 379 |
+enabled = true |
|
| 380 |
+email_notifications = true |
|
| 381 |
+slack_webhook = https://hooks.slack.com/your-webhook |
|
| 382 |
+``` |
|
| 383 |
+ |
|
| 384 |
+#### Local Node Configuration |
|
| 385 |
+```bash |
|
| 386 |
+# Copy default configuration |
|
| 387 |
+cp config/defaults/autosmart /etc/default/autosmart |
|
| 388 |
+ |
|
| 389 |
+# Edit local settings |
|
| 390 |
+nano /etc/default/autosmart |
|
| 391 |
+``` |
|
| 392 |
+ |
|
| 393 |
+```bash |
|
| 394 |
+# autoSMART local configuration |
|
| 395 |
+AUTOSMART_DEBUG=2 |
|
| 396 |
+AUTOSMART_NODE_ID=$(hostname) |
|
| 397 |
+AUTOSMART_CLUSTER_CONFIG="/etc/pve/autoSMART/config/cluster.conf" |
|
| 398 |
+ |
|
| 399 |
+# Local database override (if needed) |
|
| 400 |
+# AUTOSMART_DB_HOST=192.168.2.102 |
|
| 401 |
+# AUTOSMART_DB_USER=postgres |
|
| 402 |
+# AUTOSMART_DB_PASS= |
|
| 403 |
+ |
|
| 404 |
+# OpenAI API configuration |
|
| 405 |
+OPENAI_API_KEY=your-api-key-here |
|
| 406 |
+ |
|
| 407 |
+# Collection settings |
|
| 408 |
+SMART_COLLECTION_ENABLED=true |
|
| 409 |
+MIGRATION_DETECTION_ENABLED=true |
|
| 410 |
+DIFFERENTIAL_STORAGE_ENABLED=true |
|
| 411 |
+``` |
|
| 412 |
+ |
|
| 413 |
+### 8. Testing Installation |
|
| 414 |
+ |
|
| 415 |
+#### Database Connectivity Test |
|
| 416 |
+```bash |
|
| 417 |
+cd /etc/pve/autoSMART |
|
| 418 |
+perl -e " |
|
| 419 |
+use lib 'lib'; |
|
| 420 |
+use DBI; |
|
| 421 |
+ |
|
| 422 |
+my \$dsn = 'DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432'; |
|
| 423 |
+my \$dbh = DBI->connect(\$dsn, 'postgres', '', {RaiseError => 1});
|
|
| 424 |
+ |
|
| 425 |
+print \"✅ Database connection successful!\n\"; |
|
| 426 |
+ |
|
| 427 |
+# Test schema |
|
| 428 |
+my \$sth = \$dbh->prepare('SELECT COUNT(*) FROM hdd_inventory');
|
|
| 429 |
+\$sth->execute(); |
|
| 430 |
+my (\$count) = \$sth->fetchrow_array(); |
|
| 431 |
+ |
|
| 432 |
+print \"✅ Schema installed - hdd_inventory table accessible\n\"; |
|
| 433 |
+\$dbh->disconnect(); |
|
| 434 |
+" |
|
| 435 |
+``` |
|
| 436 |
+ |
|
| 437 |
+#### SMART Data Collection Test |
|
| 438 |
+```bash |
|
| 439 |
+# Test SMART data access |
|
| 440 |
+sudo smartctl -a /dev/sda | head -20 |
|
| 441 |
+ |
|
| 442 |
+# Test collection script (dry-run mode) |
|
| 443 |
+cd /etc/pve/autoSMART/scripts |
|
| 444 |
+sudo perl collect-smart-data.pl --test --dry-run |
|
| 445 |
+``` |
|
| 446 |
+ |
|
| 447 |
+#### Differential Storage Test |
|
| 448 |
+```bash |
|
| 449 |
+# Run comprehensive storage test |
|
| 450 |
+cd /etc/pve/autoSMART/scripts |
|
| 451 |
+perl test-differential-storage.pl |
|
| 452 |
+``` |
|
| 453 |
+ |
|
| 454 |
+Expected output: |
|
| 455 |
+``` |
|
| 456 |
+=== autoSMART Differential Storage Test === |
|
| 457 |
+✓ Connected to database |
|
| 458 |
+✓ Created test HDD (ID: 1) |
|
| 459 |
+✓ Inserted baseline reading (ID: 1) |
|
| 460 |
+✓ Identical reading test - Should store: NO (Type: baseline) |
|
| 461 |
+✓ Temperature change reading - Should store: YES (Type: differential, ID: 2) |
|
| 462 |
+✓ Critical change reading - Should store: YES (Type: full, ID: 3) |
|
| 463 |
+--- Storage Statistics --- |
|
| 464 |
+baseline : 1 readings, avg size: 245 bytes |
|
| 465 |
+differential : 1 readings, avg size: 89 bytes |
|
| 466 |
+full : 1 readings, avg size: 245 bytes |
|
| 467 |
+Total: 3 readings, estimated size: 579 bytes |
|
| 468 |
+=== Test Complete === |
|
| 469 |
+``` |
|
| 470 |
+ |
|
| 471 |
+### 9. Service Configuration (Optional) |
|
| 472 |
+ |
|
| 473 |
+#### SystemD Service Files |
|
| 474 |
+ |
|
| 475 |
+Create `/etc/systemd/system/autosmart-collector.service`: |
|
| 476 |
+```ini |
|
| 477 |
+[Unit] |
|
| 478 |
+Description=autoSMART Data Collector |
|
| 479 |
+After=network.target postgresql.service |
|
| 480 |
+ |
|
| 481 |
+[Service] |
|
| 482 |
+Type=simple |
|
| 483 |
+User=root |
|
| 484 |
+WorkingDirectory=/etc/pve/autoSMART |
|
| 485 |
+ExecStart=/usr/bin/perl scripts/collect-smart-data.pl --daemon |
|
| 486 |
+Restart=always |
|
| 487 |
+RestartSec=30 |
|
| 488 |
+ |
|
| 489 |
+[Install] |
|
| 490 |
+WantedBy=multi-user.target |
|
| 491 |
+``` |
|
| 492 |
+ |
|
| 493 |
+Create `/etc/systemd/system/autosmart-analyzer.service`: |
|
| 494 |
+```ini |
|
| 495 |
+[Unit] |
|
| 496 |
+Description=autoSMART AI Analyzer |
|
| 497 |
+After=network.target postgresql.service autosmart-collector.service |
|
| 498 |
+ |
|
| 499 |
+[Service] |
|
| 500 |
+Type=simple |
|
| 501 |
+User=root |
|
| 502 |
+WorkingDirectory=/etc/pve/autoSMART |
|
| 503 |
+ExecStart=/usr/bin/perl scripts/analyze-smart-data.pl --daemon |
|
| 504 |
+Restart=always |
|
| 505 |
+RestartSec=60 |
|
| 506 |
+ |
|
| 507 |
+[Install] |
|
| 508 |
+WantedBy=multi-user.target |
|
| 509 |
+``` |
|
| 510 |
+ |
|
| 511 |
+#### Enable Services |
|
| 512 |
+```bash |
|
| 513 |
+# Reload systemd configuration |
|
| 514 |
+sudo systemctl daemon-reload |
|
| 515 |
+ |
|
| 516 |
+# Enable and start services |
|
| 517 |
+sudo systemctl enable autosmart-collector |
|
| 518 |
+sudo systemctl start autosmart-collector |
|
| 519 |
+ |
|
| 520 |
+sudo systemctl enable autosmart-analyzer |
|
| 521 |
+sudo systemctl start autosmart-analyzer |
|
| 522 |
+ |
|
| 523 |
+# Check service status |
|
| 524 |
+sudo systemctl status autosmart-collector |
|
| 525 |
+sudo systemctl status autosmart-analyzer |
|
| 526 |
+``` |
|
| 527 |
+ |
|
| 528 |
+### 10. Verification and Monitoring |
|
| 529 |
+ |
|
| 530 |
+#### Log Files |
|
| 531 |
+```bash |
|
| 532 |
+# View collection logs |
|
| 533 |
+sudo journalctl -u autosmart-collector -f |
|
| 534 |
+ |
|
| 535 |
+# View analysis logs |
|
| 536 |
+sudo journalctl -u autosmart-analyzer -f |
|
| 537 |
+ |
|
| 538 |
+# Check syslog for SMART events |
|
| 539 |
+sudo tail -f /var/log/syslog | grep -i smart |
|
| 540 |
+``` |
|
| 541 |
+ |
|
| 542 |
+#### Database Monitoring |
|
| 543 |
+```bash |
|
| 544 |
+# Monitor data collection |
|
| 545 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c " |
|
| 546 |
+SELECT |
|
| 547 |
+ COUNT(*) as total_readings, |
|
| 548 |
+ MAX(timestamp) as latest_reading, |
|
| 549 |
+ COUNT(DISTINCT hdd_id) as active_drives |
|
| 550 |
+FROM smart_readings |
|
| 551 |
+WHERE timestamp > NOW() - INTERVAL '24 hours'; |
|
| 552 |
+" |
|
| 553 |
+ |
|
| 554 |
+# Monitor storage efficiency |
|
| 555 |
+psql -h 192.168.2.102 -U postgres -d autosmart -c " |
|
| 556 |
+SELECT |
|
| 557 |
+ reading_type, |
|
| 558 |
+ COUNT(*) as count, |
|
| 559 |
+ COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage |
|
| 560 |
+FROM smart_readings |
|
| 561 |
+WHERE timestamp > NOW() - INTERVAL '24 hours' |
|
| 562 |
+GROUP BY reading_type; |
|
| 563 |
+" |
|
| 564 |
+``` |
|
| 565 |
+ |
|
| 566 |
+## 🎯 Post-Installation Steps |
|
| 567 |
+ |
|
| 568 |
+### 1. Madagascar Integration |
|
| 569 |
+```bash |
|
| 570 |
+# Ensure Madagascar inventory is accessible |
|
| 571 |
+ls -la /opt/madagascar/inventory.json |
|
| 572 |
+ |
|
| 573 |
+# Test Madagascar data parsing |
|
| 574 |
+cd /etc/pve/autoSMART/scripts |
|
| 575 |
+perl -e " |
|
| 576 |
+use JSON::XS; |
|
| 577 |
+my \$data = decode_json(qx(cat /opt/madagascar/inventory.json)); |
|
| 578 |
+print 'Madagascar drives found: ' . scalar(\@{\$data->{drives}}) . \"\n\";
|
|
| 579 |
+" |
|
| 580 |
+``` |
|
| 581 |
+ |
|
| 582 |
+### 2. OpenAI API Setup (Optional) |
|
| 583 |
+```bash |
|
| 584 |
+# Test OpenAI API access |
|
| 585 |
+curl -H "Authorization: Bearer your-api-key" \ |
|
| 586 |
+ -H "Content-Type: application/json" \ |
|
| 587 |
+ -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello"}]}' \
|
|
| 588 |
+ https://api.openai.com/v1/chat/completions |
|
| 589 |
+``` |
|
| 590 |
+ |
|
| 591 |
+### 3. Alert Configuration |
|
| 592 |
+```bash |
|
| 593 |
+# Test email notifications (if configured) |
|
| 594 |
+echo "Test autoSMART alert" | mail -s "autoSMART Test" admin@yourcompany.com |
|
| 595 |
+ |
|
| 596 |
+# Test Slack webhook (if configured) |
|
| 597 |
+curl -X POST -H 'Content-type: application/json' \ |
|
| 598 |
+ --data '{"text":"autoSMART installation test"}' \
|
|
| 599 |
+ YOUR_SLACK_WEBHOOK_URL |
|
| 600 |
+``` |
|
| 601 |
+ |
|
| 602 |
+## 🔍 Troubleshooting |
|
| 603 |
+ |
|
| 604 |
+### Common Issues |
|
| 605 |
+ |
|
| 606 |
+#### Database Connection Failed |
|
| 607 |
+```bash |
|
| 608 |
+# Check PostgreSQL service |
|
| 609 |
+sudo systemctl status postgresql |
|
| 610 |
+ |
|
| 611 |
+# Test network connectivity |
|
| 612 |
+telnet 192.168.2.102 5432 |
|
| 613 |
+ |
|
| 614 |
+# Check firewall |
|
| 615 |
+sudo ufw status |
|
| 616 |
+sudo iptables -L | grep 5432 |
|
| 617 |
+``` |
|
| 618 |
+ |
|
| 619 |
+#### Permission Denied for SMART Data |
|
| 620 |
+```bash |
|
| 621 |
+# Check smartctl permissions |
|
| 622 |
+ls -la /usr/sbin/smartctl |
|
| 623 |
+ |
|
| 624 |
+# Test with sudo |
|
| 625 |
+sudo smartctl -a /dev/sda |
|
| 626 |
+ |
|
| 627 |
+# Add user to disk group (if running as non-root) |
|
| 628 |
+sudo usermod -a -G disk $USER |
|
| 629 |
+``` |
|
| 630 |
+ |
|
| 631 |
+#### Perl Module Issues |
|
| 632 |
+```bash |
|
| 633 |
+# Check module installation |
|
| 634 |
+perl -MDBI -e 'print "DBI version: $DBI::VERSION\n"' |
|
| 635 |
+perl -MJSON::XS -e 'print "JSON::XS installed OK\n"' |
|
| 636 |
+ |
|
| 637 |
+# Reinstall problematic modules |
|
| 638 |
+sudo cpanm --force --reinstall DBD::Pg |
|
| 639 |
+``` |
|
| 640 |
+ |
|
| 641 |
+### Log Analysis |
|
| 642 |
+```bash |
|
| 643 |
+# Enable debug logging |
|
| 644 |
+export AUTOSMART_DEBUG=3 |
|
| 645 |
+ |
|
| 646 |
+# Run collection manually for debugging |
|
| 647 |
+cd /etc/pve/autoSMART/scripts |
|
| 648 |
+sudo perl collect-smart-data.pl --verbose --test |
|
| 649 |
+``` |
|
| 650 |
+ |
|
| 651 |
+## 📚 Next Steps |
|
| 652 |
+ |
|
| 653 |
+1. **Customize Configuration**: Adjust thresholds and intervals for your environment |
|
| 654 |
+2. **Set Up Monitoring**: Configure alerts and dashboards |
|
| 655 |
+3. **Schedule Regular Backups**: Database and configuration files |
|
| 656 |
+4. **Plan for Scaling**: Consider performance optimization for large deployments |
|
| 657 |
+5. **Implement AI Predictions**: Configure OpenAI integration for failure prediction |
|
| 658 |
+ |
|
| 659 |
+For more detailed information, see: |
|
| 660 |
+- [DEVELOPMENT.md](DEVELOPMENT.md) - Development and customization guide |
|
| 661 |
+- [API.md](API.md) - OpenAI API integration details |
|
| 662 |
+- [DIFFERENTIAL_STORAGE.md](DIFFERENTIAL_STORAGE.md) - Storage optimization details |
|
| 663 |
+- [MIGRATION_DETECTION.md](MIGRATION_DETECTION.md) - HDD tracking system details |
|
| 664 |
+ |
|
| 665 |
+## 🆘 Getting Help |
|
| 666 |
+ |
|
| 667 |
+If you encounter issues during installation: |
|
| 668 |
+ |
|
| 669 |
+1. **Check logs**: `journalctl -u autosmart-collector -n 50` |
|
| 670 |
+2. **Verify database**: Test database connectivity and schema |
|
| 671 |
+3. **Test components**: Run individual scripts manually with debug output |
|
| 672 |
+4. **Review configuration**: Ensure all paths and credentials are correct |
|
| 673 |
+5. **Check dependencies**: Verify all system and Perl dependencies are installed |
|
| 674 |
+ |
|
| 675 |
+The autoSMART system is designed to be robust and self-healing, but proper installation and configuration are essential for optimal performance. |
|
@@ -0,0 +1,325 @@ |
||
| 1 |
+# autoSMART v1.0 - Intelligent HDD Monitoring & Failure Prediction |
|
| 2 |
+ |
|
| 3 |
+autoSMART este un sistem inteligent de monitorizare SMART pentru HDD-urile din cluster-ul Proxmox, cu predicții de defectare bazate pe AI și stocare optimizată în PostgreSQL. |
|
| 4 |
+ |
|
| 5 |
+## 🎯 **Scopul Proiectului** |
|
| 6 |
+ |
|
| 7 |
+- **Monitorizare continuă** a parametrilor SMART pentru toate HDD-urile din cluster |
|
| 8 |
+- **Predicții AI** pentru defectări iminente folosind OpenAI API |
|
| 9 |
+- **Stocare long-term** în PostgreSQL pentru analize temporale |
|
| 10 |
+- **Alerting proactiv** pentru mentenanță preventivă |
|
| 11 |
+ |
|
| 12 |
+## Key Features |
|
| 13 |
+ |
|
| 14 |
+- **🔍 Hardware-based HDD tracking**: Permanent identification using serial numbers and model names (not volatile /dev/sdX paths) |
|
| 15 |
+- **🔄 Migration detection**: Automatic detection and logging when HDDs move between nodes or device paths |
|
| 16 |
+- **💾 Differential storage optimization**: Store only SMART readings with changes, reducing database size by 60-80% |
|
| 17 |
+- **🤖 AI-powered failure prediction**: Uses OpenAI GPT for intelligent drive failure forecasting |
|
| 18 |
+- **🏥 Health monitoring**: Continuous SMART parameter analysis with configurable thresholds |
|
| 19 |
+- **📊 Comprehensive reporting**: Detailed drive health reports and predictive analytics |
|
| 20 |
+- **🔧 Proxmox cluster integration**: Designed for distributed Proxmox VE environments |
|
| 21 |
+- **⚡ High performance**: PostgreSQL backend with optimized indexing and queries |
|
| 22 |
+ |
|
| 23 |
+## 🚀 Quick Start |
|
| 24 |
+ |
|
| 25 |
+### Prerequisites |
|
| 26 |
+- **PostgreSQL 13+** for data storage |
|
| 27 |
+- **Perl 5.20+** with required modules |
|
| 28 |
+- **Proxmox VE** cluster environment |
|
| 29 |
+- **smartmontools** for SMART data collection |
|
| 30 |
+- **OpenAI API key** for failure predictions |
|
| 31 |
+ |
|
| 32 |
+### Installation |
|
| 33 |
+```bash |
|
| 34 |
+# 1. Download autoSMART and run automated deployment |
|
| 35 |
+git clone <repository-url> |
|
| 36 |
+cd autoSMART |
|
| 37 |
+sudo ./scripts/deploy.sh install |
|
| 38 |
+ |
|
| 39 |
+# The deployment script automatically: |
|
| 40 |
+# - Installs all dependencies (Perl modules, smartmontools, etc.) |
|
| 41 |
+# - Creates system directories and sets permissions |
|
| 42 |
+# - Deploys application files to /opt/autoSMART/ |
|
| 43 |
+# - Creates configuration files in /etc/autosmart/ |
|
| 44 |
+# - Registers and starts systemd services |
|
| 45 |
+# - Performs initial system validation |
|
| 46 |
+ |
|
| 47 |
+# 2. Configure database connection (interactive prompts during install) |
|
| 48 |
+# 3. Configure OpenAI API key (interactive prompts during install) |
|
| 49 |
+# 4. System is ready - services are automatically started |
|
| 50 |
+``` |
|
| 51 |
+ |
|
| 52 |
+### Verification |
|
| 53 |
+```bash |
|
| 54 |
+# Check system status (all services should be active) |
|
| 55 |
+sudo systemctl status autosmart |
|
| 56 |
+ |
|
| 57 |
+# View recent SMART data collection |
|
| 58 |
+sudo journalctl -u autosmart-collector -f |
|
| 59 |
+ |
|
| 60 |
+# Generate initial health report |
|
| 61 |
+sudo /opt/autoSMART/scripts/autosmart-report.pl --summary |
|
| 62 |
+``` |
|
| 63 |
+ |
|
| 64 |
+## 📚 Documentation |
|
| 65 |
+ |
|
| 66 |
+### Getting Started |
|
| 67 |
+- **[CHANGELOG.md](CHANGELOG.md)** - Version history and release notes |
|
| 68 |
+ |
|
| 69 |
+### System Configuration |
|
| 70 |
+- **[API.md](API.md)** - OpenAI API integration and configuration |
|
| 71 |
+ |
|
| 72 |
+## 🏥 Monitoring Dashboard |
|
| 73 |
+ |
|
| 74 |
+autoSMART provides comprehensive monitoring capabilities: |
|
| 75 |
+ |
|
| 76 |
+### Health Status Overview |
|
| 77 |
+- Real-time drive health status for all cluster nodes |
|
| 78 |
+- Critical parameter alerts and warnings |
|
| 79 |
+- AI-powered failure predictions with confidence scores |
|
| 80 |
+- Storage efficiency metrics |
|
| 81 |
+ |
|
| 82 |
+### Historical Analysis |
|
| 83 |
+- Long-term SMART parameter trends |
|
| 84 |
+- Performance degradation tracking |
|
| 85 |
+- Migration history between nodes |
|
| 86 |
+- Predictive analytics reports |
|
| 87 |
+ |
|
| 88 |
+### Alerting System |
|
| 89 |
+- Configurable thresholds for all SMART parameters |
|
| 90 |
+- Email/webhook notifications |
|
| 91 |
+- Integration with monitoring systems |
|
| 92 |
+- Escalation procedures for critical alerts |
|
| 93 |
+ |
|
| 94 |
+## 🔧 System Architecture |
|
| 95 |
+ |
|
| 96 |
+autoSMART operates as a distributed system across your Proxmox cluster: |
|
| 97 |
+ |
|
| 98 |
+### Data Collection |
|
| 99 |
+- Continuous SMART data collection from all nodes |
|
| 100 |
+- Hardware-based drive identification |
|
| 101 |
+- Migration detection and logging |
|
| 102 |
+- Differential storage for efficiency |
|
| 103 |
+ |
|
| 104 |
+### Analysis Engine |
|
| 105 |
+- AI-powered failure prediction |
|
| 106 |
+- Threshold-based alerting |
|
| 107 |
+- Trend analysis and reporting |
|
| 108 |
+- Performance optimization recommendations |
|
| 109 |
+ |
|
| 110 |
+### Storage Layer |
|
| 111 |
+- PostgreSQL database with optimized schema |
|
| 112 |
+- Differential storage reducing size by 60-80% |
|
| 113 |
+- Historical data retention policies |
|
| 114 |
+- Automated backup and maintenance |
|
| 115 |
+ |
|
| 116 |
+## 📁 Installed File Structure |
|
| 117 |
+ |
|
| 118 |
+When autoSMART is installed on your system, it creates the following directory structure: |
|
| 119 |
+ |
|
| 120 |
+### System Directories |
|
| 121 |
+ |
|
| 122 |
+``` |
|
| 123 |
+/opt/autoSMART/ # Main installation directory |
|
| 124 |
+├── scripts/ # Executable scripts and utilities |
|
| 125 |
+│ ├── autosmart-collector.pl # Main data collection daemon |
|
| 126 |
+│ ├── autosmart-predictor.pl # AI prediction processing |
|
| 127 |
+│ ├── autosmart-report.pl # Report generation engine |
|
| 128 |
+│ ├── autosmart-migration-report.pl # Hardware migration analysis |
|
| 129 |
+│ ├── smart-collector-daemon.pl # Background collection service |
|
| 130 |
+│ ├── uninstall.sh # System removal script |
|
| 131 |
+│ ├── monitor-cluster.sh # Cluster health monitoring |
|
| 132 |
+│ └── test-*.pl # Testing and validation utilities |
|
| 133 |
+├── lib/ # Perl modules and core libraries |
|
| 134 |
+│ ├── SmartCollector.pm # SMART data collection and hardware tracking |
|
| 135 |
+│ └── PredictionEngine.pm # AI-powered failure prediction engine |
|
| 136 |
+├── config/ # Configuration templates and examples |
|
| 137 |
+│ └── (template files) # Default configuration templates |
|
| 138 |
+├── docs/ # End-user documentation |
|
| 139 |
+│ ├── README.md # System overview and quick start |
|
| 140 |
+│ ├── CHANGELOG.md # Release notes and version history |
|
| 141 |
+│ └── API.md # OpenAI API configuration guide |
|
| 142 |
+ |
|
| 143 |
+/etc/autosmart/ # System configuration directory |
|
| 144 |
+├── autosmart.conf # Main system configuration |
|
| 145 |
+├── cluster.conf # Cluster topology and node definitions |
|
| 146 |
+├── database.conf # PostgreSQL connection settings |
|
| 147 |
+├── openai.conf # OpenAI API configuration and prompts |
|
| 148 |
+└── smart.conf # SMART parameter thresholds and monitoring rules |
|
| 149 |
+ |
|
| 150 |
+/etc/systemd/system/ # Systemd service files |
|
| 151 |
+├── autosmart.service # Main autoSMART service |
|
| 152 |
+├── autosmart-collector.service # Data collection service |
|
| 153 |
+└── autosmart-predictor.service # AI prediction service |
|
| 154 |
+``` |
|
| 155 |
+ |
|
| 156 |
+### Configuration Files Detail |
|
| 157 |
+ |
|
| 158 |
+#### `/etc/autosmart/autosmart.conf` |
|
| 159 |
+Main system configuration file containing: |
|
| 160 |
+- Database connection parameters |
|
| 161 |
+- Collection intervals and scheduling |
|
| 162 |
+- Local node identification and settings |
|
| 163 |
+- Log levels and debugging options |
|
| 164 |
+ |
|
| 165 |
+#### `/etc/autosmart/cluster.conf` |
|
| 166 |
+Cluster-wide configuration shared across all nodes: |
|
| 167 |
+- Node topology and IP addresses |
|
| 168 |
+- Shared monitoring parameters |
|
| 169 |
+- Cluster-wide alert settings |
|
| 170 |
+- Inter-node communication settings |
|
| 171 |
+ |
|
| 172 |
+#### `/etc/autosmart/database.conf` |
|
| 173 |
+PostgreSQL database connection settings: |
|
| 174 |
+- Database host, port, and credentials |
|
| 175 |
+- Connection pooling configuration |
|
| 176 |
+- SSL settings and security parameters |
|
| 177 |
+- Performance tuning options |
|
| 178 |
+ |
|
| 179 |
+#### `/etc/autosmart/openai.conf` |
|
| 180 |
+OpenAI API integration configuration: |
|
| 181 |
+- API key and model selection |
|
| 182 |
+- Prompt templates for failure prediction |
|
| 183 |
+- Response parsing and confidence thresholds |
|
| 184 |
+- Rate limiting and cost management |
|
| 185 |
+ |
|
| 186 |
+#### `/etc/autosmart/smart.conf` |
|
| 187 |
+SMART parameter monitoring configuration: |
|
| 188 |
+- Parameter thresholds for different drive types |
|
| 189 |
+- Critical parameter definitions |
|
| 190 |
+- Alert escalation rules and notifications |
|
| 191 |
+- Drive-specific monitoring settings |
|
| 192 |
+ |
|
| 193 |
+### Service Integration |
|
| 194 |
+ |
|
| 195 |
+#### Systemd Services |
|
| 196 |
+- **`autosmart.service`**: Main system service that manages other components |
|
| 197 |
+- **`autosmart-collector.service`**: Background data collection service |
|
| 198 |
+- **`autosmart-predictor.service`**: AI prediction processing service |
|
| 199 |
+ |
|
| 200 |
+#### Service Management |
|
| 201 |
+```bash |
|
| 202 |
+# Start/stop services |
|
| 203 |
+sudo systemctl start autosmart |
|
| 204 |
+sudo systemctl stop autosmart |
|
| 205 |
+ |
|
| 206 |
+# Enable/disable automatic startup |
|
| 207 |
+sudo systemctl enable autosmart |
|
| 208 |
+sudo systemctl disable autosmart |
|
| 209 |
+ |
|
| 210 |
+# Check service status |
|
| 211 |
+sudo systemctl status autosmart |
|
| 212 |
+ |
|
| 213 |
+# View service logs using systemd journal |
|
| 214 |
+sudo journalctl -u autosmart -f # Follow main service logs |
|
| 215 |
+sudo journalctl -u autosmart-collector -f # Follow data collection logs |
|
| 216 |
+sudo journalctl -u autosmart-predictor -f # Follow AI prediction logs |
|
| 217 |
+ |
|
| 218 |
+# View logs by time period |
|
| 219 |
+sudo journalctl -u autosmart --since "1 hour ago" # Last hour |
|
| 220 |
+sudo journalctl -u autosmart --since today # Today's logs |
|
| 221 |
+sudo journalctl -u autosmart --since yesterday # Yesterday's logs |
|
| 222 |
+ |
|
| 223 |
+# View logs by priority level |
|
| 224 |
+sudo journalctl -u autosmart -p err # Error level and above |
|
| 225 |
+sudo journalctl -u autosmart -p warning # Warning level and above |
|
| 226 |
+``` |
|
| 227 |
+ |
|
| 228 |
+### File Permissions |
|
| 229 |
+ |
|
| 230 |
+#### Executable Files |
|
| 231 |
+- All scripts in `/opt/autoSMART/scripts/` are executable (755) |
|
| 232 |
+- Perl modules in `/opt/autoSMART/lib/` are readable (644) |
|
| 233 |
+- Configuration files in `/etc/autosmart/` are readable by autosmart user (640) |
|
| 234 |
+ |
|
| 235 |
+#### Log Management |
|
| 236 |
+- All application logs are handled by systemd journal |
|
| 237 |
+- No separate log files created in filesystem |
|
| 238 |
+- Log retention managed by journald configuration |
|
| 239 |
+- Logs accessible via `journalctl` commands |
|
| 240 |
+- Automatic log rotation and cleanup by systemd |
|
| 241 |
+ |
|
| 242 |
+### Storage Requirements |
|
| 243 |
+ |
|
| 244 |
+#### Disk Space |
|
| 245 |
+- **Installation**: ~50MB for application files and documentation |
|
| 246 |
+- **Configuration**: ~1MB for all configuration files |
|
| 247 |
+- **Logs**: Managed by systemd journal (configurable retention) |
|
| 248 |
+- **Database**: Handled separately on PostgreSQL server |
|
| 249 |
+ |
|
| 250 |
+#### Network Requirements |
|
| 251 |
+- **Database Access**: Persistent connection to PostgreSQL server |
|
| 252 |
+- **OpenAI API**: HTTPS access for AI predictions (configurable) |
|
| 253 |
+- **Inter-node Communication**: SSH access between cluster nodes for deployment |
|
| 254 |
+ |
|
| 255 |
+This file structure provides a complete, organized installation that integrates seamlessly with Linux system conventions while maintaining clear separation between application code, configuration, and operational data. |
|
| 256 |
+ |
|
| 257 |
+## 📊 Performance Benefits |
|
| 258 |
+ |
|
| 259 |
+### Storage Optimization |
|
| 260 |
+- **60-80% reduction** in database storage through differential storage |
|
| 261 |
+- **Intelligent change detection** stores only modified SMART parameters |
|
| 262 |
+- **Baseline reconstruction** provides complete historical views |
|
| 263 |
+- **Configurable retention** policies for long-term storage |
|
| 264 |
+ |
|
| 265 |
+### Monitoring Efficiency |
|
| 266 |
+- **Hardware-based tracking** eliminates /dev/sdX path volatility |
|
| 267 |
+- **Migration detection** automatically tracks drive movements |
|
| 268 |
+- **Real-time analysis** with configurable collection intervals |
|
| 269 |
+- **Distributed architecture** scales across cluster nodes |
|
| 270 |
+ |
|
| 271 |
+## 🚨 Alert Examples |
|
| 272 |
+ |
|
| 273 |
+### Critical Alerts |
|
| 274 |
+- **Imminent Failure**: AI predicts drive failure within 24-48 hours |
|
| 275 |
+- **Temperature Critical**: Drive operating above safe temperature thresholds |
|
| 276 |
+- **Reallocated Sectors**: Increasing bad sector count detected |
|
| 277 |
+- **Spin Retry Count**: Mechanical issues detected |
|
| 278 |
+ |
|
| 279 |
+### Warning Alerts |
|
| 280 |
+- **Performance Degradation**: Slower response times detected |
|
| 281 |
+- **Temperature Warning**: Operating temperatures approaching limits |
|
| 282 |
+- **SMART Threshold**: Parameters approaching warning thresholds |
|
| 283 |
+- **Migration Detected**: Drive moved to different node or path |
|
| 284 |
+ |
|
| 285 |
+## 💡 Use Cases |
|
| 286 |
+ |
|
| 287 |
+### Preventive Maintenance |
|
| 288 |
+- Schedule drive replacements before failures occur |
|
| 289 |
+- Optimize workload distribution based on drive health |
|
| 290 |
+- Plan cluster maintenance windows effectively |
|
| 291 |
+- Track warranty and replacement schedules |
|
| 292 |
+ |
|
| 293 |
+### Capacity Planning |
|
| 294 |
+- Monitor storage growth trends |
|
| 295 |
+- Predict future storage requirements |
|
| 296 |
+- Optimize drive allocation across nodes |
|
| 297 |
+- Plan cluster expansion timing |
|
| 298 |
+ |
|
| 299 |
+### Performance Optimization |
|
| 300 |
+- Identify performance bottlenecks |
|
| 301 |
+- Balance load across healthy drives |
|
| 302 |
+- Optimize I/O patterns based on drive characteristics |
|
| 303 |
+- Monitor storage tier performance |
|
| 304 |
+ |
|
| 305 |
+## 🆘 Support & Troubleshooting |
|
| 306 |
+ |
|
| 307 |
+### Common Issues |
|
| 308 |
+- **Collection failures**: Check smartmontools installation |
|
| 309 |
+- **Database connectivity**: Verify PostgreSQL connection settings |
|
| 310 |
+- **API errors**: Validate OpenAI API key and quotas |
|
| 311 |
+- **Performance issues**: Review differential storage configuration |
|
| 312 |
+ |
|
| 313 |
+### Log Analysis |
|
| 314 |
+Use systemd journal for comprehensive log analysis: |
|
| 315 |
+- **All service logs**: `sudo journalctl -u autosmart*` |
|
| 316 |
+- **Data collection**: `sudo journalctl -u autosmart-collector` |
|
| 317 |
+- **AI predictions**: `sudo journalctl -u autosmart-predictor` |
|
| 318 |
+- **System errors**: `sudo journalctl -u autosmart* -p err` |
|
| 319 |
+ |
|
| 320 |
+### Getting Help |
|
| 321 |
+For detailed installation, configuration, and troubleshooting information, refer to the complete documentation in the `docs/` directory. |
|
| 322 |
+ |
|
| 323 |
+--- |
|
| 324 |
+ |
|
| 325 |
+**autoSMART v1.0** - Intelligent drive monitoring for mission-critical infrastructure |
|
@@ -0,0 +1,607 @@ |
||
| 1 |
+package PredictionEngine; |
|
| 2 |
+ |
|
| 3 |
+use strict; |
|
| 4 |
+use warnings; |
|
| 5 |
+use DBI; |
|
| 6 |
+use HTTP::Tiny; |
|
| 7 |
+use JSON::XS; |
|
| 8 |
+use Math::Round; |
|
| 9 |
+use Config::Simple; |
|
| 10 |
+use Time::Piece; |
|
| 11 |
+ |
|
| 12 |
+=head1 NAME |
|
| 13 |
+ |
|
| 14 |
+PredictionEngine - AI-powered HDD failure prediction for autoSMART |
|
| 15 |
+ |
|
| 16 |
+=head1 DESCRIPTION |
|
| 17 |
+ |
|
| 18 |
+This module integrates with OpenAI's API to analyze SMART data trends and predict |
|
| 19 |
+HDD failures. It processes historical SMART data, generates feature vectors, |
|
| 20 |
+and uses GPT models for intelligent failure prediction. |
|
| 21 |
+ |
|
| 22 |
+=head1 SYNOPSIS |
|
| 23 |
+ |
|
| 24 |
+ use PredictionEngine; |
|
| 25 |
+ |
|
| 26 |
+ my $predictor = PredictionEngine->new( |
|
| 27 |
+ db_config => '/path/to/database.conf', |
|
| 28 |
+ openai_config => '/path/to/openai.conf' |
|
| 29 |
+ ); |
|
| 30 |
+ |
|
| 31 |
+ # Predict failure for specific drive |
|
| 32 |
+ my $prediction = $predictor->predict_failure('/dev/sda');
|
|
| 33 |
+ |
|
| 34 |
+ # Analyze all drives |
|
| 35 |
+ my $results = $predictor->analyze_all_drives(); |
|
| 36 |
+ |
|
| 37 |
+=cut |
|
| 38 |
+ |
|
| 39 |
+sub new {
|
|
| 40 |
+ my ($class, %args) = @_; |
|
| 41 |
+ |
|
| 42 |
+ my $self = {
|
|
| 43 |
+ db_config => $args{db_config} || '/etc/autosmart/database.conf',
|
|
| 44 |
+ openai_config => $args{openai_config} || '/etc/autosmart/openai.conf',
|
|
| 45 |
+ debug => $args{debug} || 0,
|
|
| 46 |
+ db_handle => undef, |
|
| 47 |
+ openai_key => '', |
|
| 48 |
+ model => 'gpt-4', |
|
| 49 |
+ http_client => HTTP::Tiny->new(timeout => 30), |
|
| 50 |
+ }; |
|
| 51 |
+ |
|
| 52 |
+ bless $self, $class; |
|
| 53 |
+ $self->_load_config(); |
|
| 54 |
+ $self->_connect_database(); |
|
| 55 |
+ |
|
| 56 |
+ return $self; |
|
| 57 |
+} |
|
| 58 |
+ |
|
| 59 |
+=head2 _load_config |
|
| 60 |
+ |
|
| 61 |
+Load OpenAI configuration |
|
| 62 |
+ |
|
| 63 |
+=cut |
|
| 64 |
+ |
|
| 65 |
+sub _load_config {
|
|
| 66 |
+ my $self = shift; |
|
| 67 |
+ |
|
| 68 |
+ my $cfg = Config::Simple->new($self->{openai_config})
|
|
| 69 |
+ or die "Cannot load OpenAI config: $self->{openai_config}";
|
|
| 70 |
+ |
|
| 71 |
+ $self->{openai_key} = $cfg->param('openai.api_key')
|
|
| 72 |
+ or die "OpenAI API key not configured"; |
|
| 73 |
+ |
|
| 74 |
+ $self->{model} = $cfg->param('openai.model') || 'gpt-4';
|
|
| 75 |
+ $self->{max_tokens} = $cfg->param('openai.max_tokens') || 1000;
|
|
| 76 |
+ $self->{temperature} = $cfg->param('openai.temperature') || 0.3;
|
|
| 77 |
+ |
|
| 78 |
+ $self->_log("OpenAI configuration loaded (model: $self->{model})");
|
|
| 79 |
+} |
|
| 80 |
+ |
|
| 81 |
+=head2 _connect_database |
|
| 82 |
+ |
|
| 83 |
+Establish PostgreSQL database connection |
|
| 84 |
+ |
|
| 85 |
+=cut |
|
| 86 |
+ |
|
| 87 |
+sub _connect_database {
|
|
| 88 |
+ my $self = shift; |
|
| 89 |
+ |
|
| 90 |
+ my $cfg = Config::Simple->new($self->{db_config})
|
|
| 91 |
+ or die "Cannot load database config: $self->{db_config}";
|
|
| 92 |
+ |
|
| 93 |
+ my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
|
|
| 94 |
+ $cfg->param('database.database'),
|
|
| 95 |
+ $cfg->param('database.host'),
|
|
| 96 |
+ $cfg->param('database.port')
|
|
| 97 |
+ ); |
|
| 98 |
+ |
|
| 99 |
+ $self->{db_handle} = DBI->connect(
|
|
| 100 |
+ $dsn, |
|
| 101 |
+ $cfg->param('database.username'),
|
|
| 102 |
+ $cfg->param('database.password'),
|
|
| 103 |
+ {
|
|
| 104 |
+ RaiseError => 1, |
|
| 105 |
+ AutoCommit => 1, |
|
| 106 |
+ pg_enable_utf8 => 1 |
|
| 107 |
+ } |
|
| 108 |
+ ) or die "Database connection failed: $DBI::errstr"; |
|
| 109 |
+ |
|
| 110 |
+ $self->_log("Database connection established");
|
|
| 111 |
+} |
|
| 112 |
+ |
|
| 113 |
+=head2 get_drive_smart_history |
|
| 114 |
+ |
|
| 115 |
+Retrieve SMART data history for a drive |
|
| 116 |
+ |
|
| 117 |
+=cut |
|
| 118 |
+ |
|
| 119 |
+sub get_drive_smart_history {
|
|
| 120 |
+ my ($self, $device_path, $days_back) = @_; |
|
| 121 |
+ |
|
| 122 |
+ $days_back ||= 90; # Default 3 months |
|
| 123 |
+ |
|
| 124 |
+ my $sql = q{
|
|
| 125 |
+ SELECT |
|
| 126 |
+ sr.timestamp, |
|
| 127 |
+ sr.temperature, |
|
| 128 |
+ sr.parameters_json, |
|
| 129 |
+ hi.model_name, |
|
| 130 |
+ hi.serial_number, |
|
| 131 |
+ hi.size_gb |
|
| 132 |
+ FROM smart_readings sr |
|
| 133 |
+ JOIN hdd_inventory hi ON sr.device_path = hi.device_path |
|
| 134 |
+ WHERE sr.device_path = ? |
|
| 135 |
+ AND sr.timestamp >= NOW() - INTERVAL ? DAY |
|
| 136 |
+ ORDER BY sr.timestamp ASC |
|
| 137 |
+ }; |
|
| 138 |
+ |
|
| 139 |
+ my $sth = $self->{db_handle}->prepare($sql);
|
|
| 140 |
+ $sth->execute($device_path, $days_back); |
|
| 141 |
+ |
|
| 142 |
+ my @history = (); |
|
| 143 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 144 |
+ $row->{parameters} = decode_json($row->{parameters_json});
|
|
| 145 |
+ delete $row->{parameters_json};
|
|
| 146 |
+ push @history, $row; |
|
| 147 |
+ } |
|
| 148 |
+ |
|
| 149 |
+ return \@history; |
|
| 150 |
+} |
|
| 151 |
+ |
|
| 152 |
+=head2 analyze_smart_trends |
|
| 153 |
+ |
|
| 154 |
+Analyze SMART parameter trends for patterns |
|
| 155 |
+ |
|
| 156 |
+=cut |
|
| 157 |
+ |
|
| 158 |
+sub analyze_smart_trends {
|
|
| 159 |
+ my ($self, $history) = @_; |
|
| 160 |
+ |
|
| 161 |
+ return {} unless @$history >= 5; # Need minimum data points
|
|
| 162 |
+ |
|
| 163 |
+ my $trends = {};
|
|
| 164 |
+ my $critical_params = [ |
|
| 165 |
+ 'Reallocated_Sector_Ct', |
|
| 166 |
+ 'Spin_Retry_Count', |
|
| 167 |
+ 'Reallocated_Event_Count', |
|
| 168 |
+ 'Current_Pending_Sector', |
|
| 169 |
+ 'Offline_Uncorrectable', |
|
| 170 |
+ 'UDMA_CRC_Error_Count', |
|
| 171 |
+ 'Raw_Read_Error_Rate' |
|
| 172 |
+ ]; |
|
| 173 |
+ |
|
| 174 |
+ # Analyze each critical parameter |
|
| 175 |
+ foreach my $param_name (@$critical_params) {
|
|
| 176 |
+ my @values = (); |
|
| 177 |
+ my @timestamps = (); |
|
| 178 |
+ |
|
| 179 |
+ # Extract values for this parameter |
|
| 180 |
+ foreach my $reading (@$history) {
|
|
| 181 |
+ next unless exists $reading->{parameters}->{$param_name};
|
|
| 182 |
+ |
|
| 183 |
+ push @values, $reading->{parameters}->{$param_name}->{raw_value};
|
|
| 184 |
+ push @timestamps, $reading->{timestamp};
|
|
| 185 |
+ } |
|
| 186 |
+ |
|
| 187 |
+ next unless @values >= 3; |
|
| 188 |
+ |
|
| 189 |
+ # Calculate trend statistics |
|
| 190 |
+ my $trend_analysis = $self->_calculate_trend_stats(\@values, \@timestamps); |
|
| 191 |
+ |
|
| 192 |
+ $trends->{$param_name} = {
|
|
| 193 |
+ current_value => $values[-1], |
|
| 194 |
+ min_value => $trend_analysis->{min},
|
|
| 195 |
+ max_value => $trend_analysis->{max},
|
|
| 196 |
+ slope => $trend_analysis->{slope},
|
|
| 197 |
+ volatility => $trend_analysis->{volatility},
|
|
| 198 |
+ data_points => scalar(@values), |
|
| 199 |
+ concerning => $self->_is_trend_concerning($param_name, $trend_analysis), |
|
| 200 |
+ }; |
|
| 201 |
+ } |
|
| 202 |
+ |
|
| 203 |
+ # Analyze temperature trends |
|
| 204 |
+ my @temperatures = map { $_->{temperature} } @$history;
|
|
| 205 |
+ if (@temperatures >= 3) {
|
|
| 206 |
+ my @temp_timestamps = map { $_->{timestamp} } @$history;
|
|
| 207 |
+ my $temp_stats = $self->_calculate_trend_stats(\@temperatures, \@temp_timestamps); |
|
| 208 |
+ |
|
| 209 |
+ $trends->{temperature} = {
|
|
| 210 |
+ current_temp => $temperatures[-1], |
|
| 211 |
+ avg_temp => $temp_stats->{mean},
|
|
| 212 |
+ max_temp => $temp_stats->{max},
|
|
| 213 |
+ slope => $temp_stats->{slope},
|
|
| 214 |
+ concerning => ($temp_stats->{max} > 60 || $temp_stats->{slope} > 0.1),
|
|
| 215 |
+ }; |
|
| 216 |
+ } |
|
| 217 |
+ |
|
| 218 |
+ return $trends; |
|
| 219 |
+} |
|
| 220 |
+ |
|
| 221 |
+=head2 _calculate_trend_stats |
|
| 222 |
+ |
|
| 223 |
+Calculate statistical metrics for trend analysis |
|
| 224 |
+ |
|
| 225 |
+=cut |
|
| 226 |
+ |
|
| 227 |
+sub _calculate_trend_stats {
|
|
| 228 |
+ my ($self, $values, $timestamps) = @_; |
|
| 229 |
+ |
|
| 230 |
+ return {} unless @$values >= 2;
|
|
| 231 |
+ |
|
| 232 |
+ # Basic statistics |
|
| 233 |
+ my $sum = 0; |
|
| 234 |
+ my $min = $values->[0]; |
|
| 235 |
+ my $max = $values->[0]; |
|
| 236 |
+ |
|
| 237 |
+ foreach my $val (@$values) {
|
|
| 238 |
+ $sum += $val; |
|
| 239 |
+ $min = $val if $val < $min; |
|
| 240 |
+ $max = $val if $val > $max; |
|
| 241 |
+ } |
|
| 242 |
+ |
|
| 243 |
+ my $mean = $sum / @$values; |
|
| 244 |
+ |
|
| 245 |
+ # Calculate variance |
|
| 246 |
+ my $variance = 0; |
|
| 247 |
+ foreach my $val (@$values) {
|
|
| 248 |
+ $variance += ($val - $mean) ** 2; |
|
| 249 |
+ } |
|
| 250 |
+ $variance /= (@$values - 1) if @$values > 1; |
|
| 251 |
+ |
|
| 252 |
+ # Simple linear regression for slope |
|
| 253 |
+ my $slope = 0; |
|
| 254 |
+ if (@$values >= 2) {
|
|
| 255 |
+ my $n = @$values; |
|
| 256 |
+ my $sum_x = 0; |
|
| 257 |
+ my $sum_y = 0; |
|
| 258 |
+ my $sum_xy = 0; |
|
| 259 |
+ my $sum_x2 = 0; |
|
| 260 |
+ |
|
| 261 |
+ for my $i (0..$#$values) {
|
|
| 262 |
+ my $x = $i; # Use index as x (time progression) |
|
| 263 |
+ my $y = $values->[$i]; |
|
| 264 |
+ |
|
| 265 |
+ $sum_x += $x; |
|
| 266 |
+ $sum_y += $y; |
|
| 267 |
+ $sum_xy += $x * $y; |
|
| 268 |
+ $sum_x2 += $x * $x; |
|
| 269 |
+ } |
|
| 270 |
+ |
|
| 271 |
+ my $denominator = $n * $sum_x2 - $sum_x * $sum_x; |
|
| 272 |
+ if ($denominator != 0) {
|
|
| 273 |
+ $slope = ($n * $sum_xy - $sum_x * $sum_y) / $denominator; |
|
| 274 |
+ } |
|
| 275 |
+ } |
|
| 276 |
+ |
|
| 277 |
+ return {
|
|
| 278 |
+ min => $min, |
|
| 279 |
+ max => $max, |
|
| 280 |
+ mean => $mean, |
|
| 281 |
+ variance => $variance, |
|
| 282 |
+ volatility => sqrt($variance), |
|
| 283 |
+ slope => $slope, |
|
| 284 |
+ }; |
|
| 285 |
+} |
|
| 286 |
+ |
|
| 287 |
+=head2 _is_trend_concerning |
|
| 288 |
+ |
|
| 289 |
+Determine if a SMART parameter trend is concerning |
|
| 290 |
+ |
|
| 291 |
+=cut |
|
| 292 |
+ |
|
| 293 |
+sub _is_trend_concerning {
|
|
| 294 |
+ my ($self, $param_name, $stats) = @_; |
|
| 295 |
+ |
|
| 296 |
+ # Critical parameters that should never increase |
|
| 297 |
+ my $critical_increasing = {
|
|
| 298 |
+ 'Reallocated_Sector_Ct' => 0, |
|
| 299 |
+ 'Reallocated_Event_Count' => 0, |
|
| 300 |
+ 'Current_Pending_Sector' => 0, |
|
| 301 |
+ 'Offline_Uncorrectable' => 0, |
|
| 302 |
+ 'Spin_Retry_Count' => 10, |
|
| 303 |
+ }; |
|
| 304 |
+ |
|
| 305 |
+ if (exists $critical_increasing->{$param_name}) {
|
|
| 306 |
+ my $threshold = $critical_increasing->{$param_name};
|
|
| 307 |
+ |
|
| 308 |
+ return 1 if $stats->{max} > $threshold;
|
|
| 309 |
+ return 1 if $stats->{slope} > 0.1 && $stats->{max} > 0;
|
|
| 310 |
+ } |
|
| 311 |
+ |
|
| 312 |
+ # High volatility is concerning |
|
| 313 |
+ return 1 if $stats->{volatility} > ($stats->{mean} * 0.5) && $stats->{mean} > 0;
|
|
| 314 |
+ |
|
| 315 |
+ return 0; |
|
| 316 |
+} |
|
| 317 |
+ |
|
| 318 |
+=head2 predict_failure |
|
| 319 |
+ |
|
| 320 |
+Generate AI-powered failure prediction for a drive |
|
| 321 |
+ |
|
| 322 |
+=cut |
|
| 323 |
+ |
|
| 324 |
+sub predict_failure {
|
|
| 325 |
+ my ($self, $device_path, $days_back) = @_; |
|
| 326 |
+ |
|
| 327 |
+ $days_back ||= 90; |
|
| 328 |
+ |
|
| 329 |
+ # Get SMART history |
|
| 330 |
+ my $history = $self->get_drive_smart_history($device_path, $days_back); |
|
| 331 |
+ |
|
| 332 |
+ unless (@$history >= 5) {
|
|
| 333 |
+ return {
|
|
| 334 |
+ device_path => $device_path, |
|
| 335 |
+ prediction => 'insufficient_data', |
|
| 336 |
+ confidence => 0, |
|
| 337 |
+ risk_level => 'unknown', |
|
| 338 |
+ message => 'Insufficient historical data for prediction' |
|
| 339 |
+ }; |
|
| 340 |
+ } |
|
| 341 |
+ |
|
| 342 |
+ # Analyze trends |
|
| 343 |
+ my $trends = $self->analyze_smart_trends($history); |
|
| 344 |
+ |
|
| 345 |
+ # Generate AI prompt |
|
| 346 |
+ my $prompt = $self->_generate_prediction_prompt($device_path, $history, $trends); |
|
| 347 |
+ |
|
| 348 |
+ # Call OpenAI API |
|
| 349 |
+ my $ai_response = $self->_call_openai_api($prompt); |
|
| 350 |
+ |
|
| 351 |
+ # Parse and store prediction |
|
| 352 |
+ my $prediction = $self->_parse_prediction_response($ai_response, $device_path); |
|
| 353 |
+ |
|
| 354 |
+ # Store prediction in database |
|
| 355 |
+ $self->_store_prediction($prediction); |
|
| 356 |
+ |
|
| 357 |
+ return $prediction; |
|
| 358 |
+} |
|
| 359 |
+ |
|
| 360 |
+=head2 _generate_prediction_prompt |
|
| 361 |
+ |
|
| 362 |
+Generate detailed prompt for OpenAI API |
|
| 363 |
+ |
|
| 364 |
+=cut |
|
| 365 |
+ |
|
| 366 |
+sub _generate_prediction_prompt {
|
|
| 367 |
+ my ($self, $device_path, $history, $trends) = @_; |
|
| 368 |
+ |
|
| 369 |
+ my $drive_info = $history->[0]; # Basic drive info from first record |
|
| 370 |
+ |
|
| 371 |
+ my $prompt = "You are an expert HDD failure prediction system analyzing SMART data.\n\n"; |
|
| 372 |
+ |
|
| 373 |
+ $prompt .= "DRIVE INFORMATION:\n"; |
|
| 374 |
+ $prompt .= "- Device: $device_path\n"; |
|
| 375 |
+ $prompt .= "- Model: " . ($drive_info->{model_name} || 'Unknown') . "\n";
|
|
| 376 |
+ $prompt .= "- Serial: " . ($drive_info->{serial_number} || 'Unknown') . "\n";
|
|
| 377 |
+ $prompt .= "- Size: " . ($drive_info->{size_gb} || 'Unknown') . " GB\n";
|
|
| 378 |
+ $prompt .= "- Data Points: " . scalar(@$history) . " readings\n\n"; |
|
| 379 |
+ |
|
| 380 |
+ $prompt .= "CRITICAL SMART PARAMETER ANALYSIS:\n"; |
|
| 381 |
+ |
|
| 382 |
+ foreach my $param_name (sort keys %$trends) {
|
|
| 383 |
+ next if $param_name eq 'temperature'; |
|
| 384 |
+ |
|
| 385 |
+ my $trend = $trends->{$param_name};
|
|
| 386 |
+ $prompt .= "- $param_name:\n"; |
|
| 387 |
+ $prompt .= " * Current: $trend->{current_value}\n";
|
|
| 388 |
+ $prompt .= " * Range: $trend->{min_value} - $trend->{max_value}\n";
|
|
| 389 |
+ $prompt .= " * Slope: " . sprintf("%.4f", $trend->{slope}) . "\n";
|
|
| 390 |
+ $prompt .= " * Volatility: " . sprintf("%.2f", $trend->{volatility}) . "\n";
|
|
| 391 |
+ $prompt .= " * Concerning: " . ($trend->{concerning} ? 'YES' : 'No') . "\n";
|
|
| 392 |
+ } |
|
| 393 |
+ |
|
| 394 |
+ if (exists $trends->{temperature}) {
|
|
| 395 |
+ my $temp = $trends->{temperature};
|
|
| 396 |
+ $prompt .= "\nTEMPERATURE ANALYSIS:\n"; |
|
| 397 |
+ $prompt .= "- Current: $temp->{current_temp}°C\n";
|
|
| 398 |
+ $prompt .= "- Average: " . sprintf("%.1f", $temp->{avg_temp}) . "°C\n";
|
|
| 399 |
+ $prompt .= "- Maximum: $temp->{max_temp}°C\n";
|
|
| 400 |
+ $prompt .= "- Trend: " . sprintf("%.3f", $temp->{slope}) . "°C per reading\n";
|
|
| 401 |
+ } |
|
| 402 |
+ |
|
| 403 |
+ $prompt .= "\nPLEASE ANALYZE THIS DATA AND PROVIDE:\n"; |
|
| 404 |
+ $prompt .= "1. Overall failure risk assessment (LOW/MODERATE/HIGH/CRITICAL)\n"; |
|
| 405 |
+ $prompt .= "2. Confidence level (0-100%)\n"; |
|
| 406 |
+ $prompt .= "3. Estimated time to failure (if applicable)\n"; |
|
| 407 |
+ $prompt .= "4. Key concerning indicators\n"; |
|
| 408 |
+ $prompt .= "5. Recommended actions\n\n"; |
|
| 409 |
+ |
|
| 410 |
+ $prompt .= "Format your response as JSON with fields: risk_level, confidence, time_to_failure_days, concerns, recommendations, reasoning\n"; |
|
| 411 |
+ |
|
| 412 |
+ return $prompt; |
|
| 413 |
+} |
|
| 414 |
+ |
|
| 415 |
+=head2 _call_openai_api |
|
| 416 |
+ |
|
| 417 |
+Make API call to OpenAI |
|
| 418 |
+ |
|
| 419 |
+=cut |
|
| 420 |
+ |
|
| 421 |
+sub _call_openai_api {
|
|
| 422 |
+ my ($self, $prompt) = @_; |
|
| 423 |
+ |
|
| 424 |
+ my $payload = {
|
|
| 425 |
+ model => $self->{model},
|
|
| 426 |
+ messages => [ |
|
| 427 |
+ {
|
|
| 428 |
+ role => 'system', |
|
| 429 |
+ content => 'You are an expert HDD failure prediction system with deep knowledge of SMART parameters and drive reliability patterns.' |
|
| 430 |
+ }, |
|
| 431 |
+ {
|
|
| 432 |
+ role => 'user', |
|
| 433 |
+ content => $prompt |
|
| 434 |
+ } |
|
| 435 |
+ ], |
|
| 436 |
+ max_tokens => $self->{max_tokens},
|
|
| 437 |
+ temperature => $self->{temperature},
|
|
| 438 |
+ }; |
|
| 439 |
+ |
|
| 440 |
+ my $response = $self->{http_client}->post(
|
|
| 441 |
+ 'https://api.openai.com/v1/chat/completions', |
|
| 442 |
+ {
|
|
| 443 |
+ headers => {
|
|
| 444 |
+ 'Authorization' => "Bearer $self->{openai_key}",
|
|
| 445 |
+ 'Content-Type' => 'application/json', |
|
| 446 |
+ }, |
|
| 447 |
+ content => encode_json($payload) |
|
| 448 |
+ } |
|
| 449 |
+ ); |
|
| 450 |
+ |
|
| 451 |
+ unless ($response->{success}) {
|
|
| 452 |
+ die "OpenAI API call failed: $response->{status} $response->{reason}";
|
|
| 453 |
+ } |
|
| 454 |
+ |
|
| 455 |
+ my $result = decode_json($response->{content});
|
|
| 456 |
+ |
|
| 457 |
+ return $result->{choices}->[0]->{message}->{content};
|
|
| 458 |
+} |
|
| 459 |
+ |
|
| 460 |
+=head2 _parse_prediction_response |
|
| 461 |
+ |
|
| 462 |
+Parse OpenAI response into structured prediction |
|
| 463 |
+ |
|
| 464 |
+=cut |
|
| 465 |
+ |
|
| 466 |
+sub _parse_prediction_response {
|
|
| 467 |
+ my ($self, $ai_response, $device_path) = @_; |
|
| 468 |
+ |
|
| 469 |
+ my $prediction = {
|
|
| 470 |
+ device_path => $device_path, |
|
| 471 |
+ timestamp => time(), |
|
| 472 |
+ prediction => 'unknown', |
|
| 473 |
+ confidence => 0, |
|
| 474 |
+ risk_level => 'unknown', |
|
| 475 |
+ message => $ai_response, |
|
| 476 |
+ }; |
|
| 477 |
+ |
|
| 478 |
+ # Try to parse JSON response |
|
| 479 |
+ eval {
|
|
| 480 |
+ my $parsed = decode_json($ai_response); |
|
| 481 |
+ |
|
| 482 |
+ $prediction->{risk_level} = lc($parsed->{risk_level}) if $parsed->{risk_level};
|
|
| 483 |
+ $prediction->{confidence} = $parsed->{confidence} if defined $parsed->{confidence};
|
|
| 484 |
+ $prediction->{time_to_failure_days} = $parsed->{time_to_failure_days} if $parsed->{time_to_failure_days};
|
|
| 485 |
+ $prediction->{concerns} = $parsed->{concerns} if $parsed->{concerns};
|
|
| 486 |
+ $prediction->{recommendations} = $parsed->{recommendations} if $parsed->{recommendations};
|
|
| 487 |
+ $prediction->{reasoning} = $parsed->{reasoning} if $parsed->{reasoning};
|
|
| 488 |
+ |
|
| 489 |
+ $prediction->{prediction} = 'success';
|
|
| 490 |
+ }; |
|
| 491 |
+ |
|
| 492 |
+ if ($@) {
|
|
| 493 |
+ $self->_log("Failed to parse AI response as JSON, using raw text");
|
|
| 494 |
+ $prediction->{prediction} = 'text_response';
|
|
| 495 |
+ |
|
| 496 |
+ # Try to extract basic info from text |
|
| 497 |
+ if ($ai_response =~ /risk.*?:.*?(low|moderate|high|critical)/i) {
|
|
| 498 |
+ $prediction->{risk_level} = lc($1);
|
|
| 499 |
+ } |
|
| 500 |
+ |
|
| 501 |
+ if ($ai_response =~ /confidence.*?:.*?(\d+)/i) {
|
|
| 502 |
+ $prediction->{confidence} = $1;
|
|
| 503 |
+ } |
|
| 504 |
+ } |
|
| 505 |
+ |
|
| 506 |
+ return $prediction; |
|
| 507 |
+} |
|
| 508 |
+ |
|
| 509 |
+=head2 _store_prediction |
|
| 510 |
+ |
|
| 511 |
+Store prediction results in database |
|
| 512 |
+ |
|
| 513 |
+=cut |
|
| 514 |
+ |
|
| 515 |
+sub _store_prediction {
|
|
| 516 |
+ my ($self, $prediction) = @_; |
|
| 517 |
+ |
|
| 518 |
+ my $sql = q{
|
|
| 519 |
+ INSERT INTO predictions |
|
| 520 |
+ (device_path, timestamp, risk_level, confidence, time_to_failure_days, |
|
| 521 |
+ concerns, recommendations, reasoning, raw_response) |
|
| 522 |
+ VALUES (?, to_timestamp(?), ?, ?, ?, ?, ?, ?, ?) |
|
| 523 |
+ }; |
|
| 524 |
+ |
|
| 525 |
+ $self->{db_handle}->do($sql,
|
|
| 526 |
+ undef, |
|
| 527 |
+ $prediction->{device_path},
|
|
| 528 |
+ $prediction->{timestamp},
|
|
| 529 |
+ $prediction->{risk_level},
|
|
| 530 |
+ $prediction->{confidence},
|
|
| 531 |
+ $prediction->{time_to_failure_days},
|
|
| 532 |
+ $prediction->{concerns},
|
|
| 533 |
+ $prediction->{recommendations},
|
|
| 534 |
+ $prediction->{reasoning},
|
|
| 535 |
+ $prediction->{message}
|
|
| 536 |
+ ); |
|
| 537 |
+} |
|
| 538 |
+ |
|
| 539 |
+=head2 analyze_all_drives |
|
| 540 |
+ |
|
| 541 |
+Run predictions for all active drives |
|
| 542 |
+ |
|
| 543 |
+=cut |
|
| 544 |
+ |
|
| 545 |
+sub analyze_all_drives {
|
|
| 546 |
+ my $self = shift; |
|
| 547 |
+ |
|
| 548 |
+ my $sql = q{
|
|
| 549 |
+ SELECT device_path, model_name, serial_number |
|
| 550 |
+ FROM hdd_inventory |
|
| 551 |
+ WHERE status = 'active' |
|
| 552 |
+ ORDER BY device_path |
|
| 553 |
+ }; |
|
| 554 |
+ |
|
| 555 |
+ my $sth = $self->{db_handle}->prepare($sql);
|
|
| 556 |
+ $sth->execute(); |
|
| 557 |
+ |
|
| 558 |
+ my @results = (); |
|
| 559 |
+ |
|
| 560 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 561 |
+ my $prediction = $self->predict_failure($row->{device_path});
|
|
| 562 |
+ push @results, $prediction; |
|
| 563 |
+ |
|
| 564 |
+ # Rate limiting - small delay between API calls |
|
| 565 |
+ sleep(1); |
|
| 566 |
+ } |
|
| 567 |
+ |
|
| 568 |
+ return \@results; |
|
| 569 |
+} |
|
| 570 |
+ |
|
| 571 |
+=head2 _log |
|
| 572 |
+ |
|
| 573 |
+Internal logging method |
|
| 574 |
+ |
|
| 575 |
+=cut |
|
| 576 |
+ |
|
| 577 |
+sub _log {
|
|
| 578 |
+ my ($self, $message) = @_; |
|
| 579 |
+ |
|
| 580 |
+ my $timestamp = scalar(localtime()); |
|
| 581 |
+ print "[$timestamp] PredictionEngine: $message\n" if $self->{debug};
|
|
| 582 |
+} |
|
| 583 |
+ |
|
| 584 |
+=head2 DESTROY |
|
| 585 |
+ |
|
| 586 |
+Cleanup database connection |
|
| 587 |
+ |
|
| 588 |
+=cut |
|
| 589 |
+ |
|
| 590 |
+sub DESTROY {
|
|
| 591 |
+ my $self = shift; |
|
| 592 |
+ $self->{db_handle}->disconnect() if $self->{db_handle};
|
|
| 593 |
+} |
|
| 594 |
+ |
|
| 595 |
+1; |
|
| 596 |
+ |
|
| 597 |
+__END__ |
|
| 598 |
+ |
|
| 599 |
+=head1 AUTHOR |
|
| 600 |
+ |
|
| 601 |
+AutoSMART Development Team |
|
| 602 |
+ |
|
| 603 |
+=head1 LICENSE |
|
| 604 |
+ |
|
| 605 |
+This software is part of the autoSMART project. |
|
| 606 |
+ |
|
| 607 |
+=cut |
|
@@ -0,0 +1,802 @@ |
||
| 1 |
+package SmartCollector; |
|
| 2 |
+ |
|
| 3 |
+use strict; |
|
| 4 |
+use warnings; |
|
| 5 |
+use DBI; |
|
| 6 |
+use JSON::XS; |
|
| 7 |
+use Time::HiRes qw(time); |
|
| 8 |
+use File::Slurp; |
|
| 9 |
+use Config::Simple; |
|
| 10 |
+use Digest::SHA qw(sha256_hex); |
|
| 11 |
+ |
|
| 12 |
+=head1 NAME |
|
| 13 |
+ |
|
| 14 |
+SmartCollector - SMART data collection module for autoSMART |
|
| 15 |
+ |
|
| 16 |
+=head1 DESCRIPTION |
|
| 17 |
+ |
|
| 18 |
+This module handles the collection of SMART data from HDDs identified in Madagascar inventory, |
|
| 19 |
+processes the data, and stores it in PostgreSQL for long-term analysis and AI predictions. |
|
| 20 |
+ |
|
| 21 |
+=head1 SYNOPSIS |
|
| 22 |
+ |
|
| 23 |
+ use SmartCollector; |
|
| 24 |
+ |
|
| 25 |
+ my $collector = SmartCollector->new( |
|
| 26 |
+ config_file => '/path/to/smart.conf', |
|
| 27 |
+ db_config => '/path/to/database.conf' |
|
| 28 |
+ ); |
|
| 29 |
+ |
|
| 30 |
+ # Collect data from all monitored drives |
|
| 31 |
+ $collector->collect_all(); |
|
| 32 |
+ |
|
| 33 |
+ # Collect data from specific drive |
|
| 34 |
+ $collector->collect_drive('/dev/sda');
|
|
| 35 |
+ |
|
| 36 |
+=cut |
|
| 37 |
+ |
|
| 38 |
+sub new {
|
|
| 39 |
+ my ($class, %args) = @_; |
|
| 40 |
+ |
|
| 41 |
+ my $self = {
|
|
| 42 |
+ cluster_config => $args{cluster_config} || '/etc/pve/autoSMART/cluster.conf',
|
|
| 43 |
+ local_config => $args{local_config} || '/etc/default/autosmart',
|
|
| 44 |
+ debug => $args{debug} || 0,
|
|
| 45 |
+ node_id => $args{node_id} || `hostname`,
|
|
| 46 |
+ smart_params => {},
|
|
| 47 |
+ db_handle => undef, |
|
| 48 |
+ local_settings => {},
|
|
| 49 |
+ }; |
|
| 50 |
+ |
|
| 51 |
+ chomp $self->{node_id};
|
|
| 52 |
+ |
|
| 53 |
+ bless $self, $class; |
|
| 54 |
+ $self->_load_local_config(); |
|
| 55 |
+ $self->_load_cluster_config(); |
|
| 56 |
+ $self->_connect_database(); |
|
| 57 |
+ |
|
| 58 |
+ return $self; |
|
| 59 |
+} |
|
| 60 |
+ |
|
| 61 |
+=head2 _load_local_config |
|
| 62 |
+ |
|
| 63 |
+Load local node-specific configuration from /etc/default/autosmart |
|
| 64 |
+ |
|
| 65 |
+=cut |
|
| 66 |
+ |
|
| 67 |
+sub _load_local_config {
|
|
| 68 |
+ my $self = shift; |
|
| 69 |
+ |
|
| 70 |
+ return unless -f $self->{local_config};
|
|
| 71 |
+ |
|
| 72 |
+ open my $fh, '<', $self->{local_config}
|
|
| 73 |
+ or die "Cannot read local config: $self->{local_config}: $!";
|
|
| 74 |
+ |
|
| 75 |
+ while (my $line = <$fh>) {
|
|
| 76 |
+ chomp $line; |
|
| 77 |
+ next if $line =~ /^\s*#/ || $line =~ /^\s*$/; |
|
| 78 |
+ |
|
| 79 |
+ if ($line =~ /^(\w+)=(.+)$/) {
|
|
| 80 |
+ my ($key, $value) = ($1, $2); |
|
| 81 |
+ $value =~ s/^["']|["']$//g; # Remove quotes |
|
| 82 |
+ $self->{local_settings}->{$key} = $value;
|
|
| 83 |
+ } |
|
| 84 |
+ } |
|
| 85 |
+ |
|
| 86 |
+ close $fh; |
|
| 87 |
+ |
|
| 88 |
+ # Apply debug settings |
|
| 89 |
+ if ($self->{local_settings}->{AUTOSMART_DEBUG_ENABLED} eq 'true') {
|
|
| 90 |
+ $self->{debug} = $self->{local_settings}->{AUTOSMART_DEBUG_LEVEL} || 1;
|
|
| 91 |
+ } |
|
| 92 |
+ |
|
| 93 |
+ $self->_log("Loaded local configuration from $self->{local_config}");
|
|
| 94 |
+} |
|
| 95 |
+ |
|
| 96 |
+=head2 _load_cluster_config |
|
| 97 |
+ |
|
| 98 |
+Load cluster-wide configuration from Proxmox shared storage |
|
| 99 |
+ |
|
| 100 |
+=cut |
|
| 101 |
+ |
|
| 102 |
+sub _load_cluster_config {
|
|
| 103 |
+ my $self = shift; |
|
| 104 |
+ |
|
| 105 |
+ unless (-f $self->{cluster_config}) {
|
|
| 106 |
+ die "Cluster configuration not found: $self->{cluster_config}";
|
|
| 107 |
+ } |
|
| 108 |
+ |
|
| 109 |
+ my $cfg = Config::Simple->new($self->{cluster_config})
|
|
| 110 |
+ or die "Cannot load cluster config file: $self->{cluster_config}";
|
|
| 111 |
+ |
|
| 112 |
+ # Load monitoring settings |
|
| 113 |
+ $self->{collection_interval} = $cfg->param('cluster.collection_interval')
|
|
| 114 |
+ || $self->{local_settings}->{AUTOSMART_COLLECTION_INTERVAL} || 300;
|
|
| 115 |
+ $self->{collection_timeout} = $cfg->param('cluster.collection_timeout')
|
|
| 116 |
+ || $self->{local_settings}->{AUTOSMART_COLLECTION_TIMEOUT} || 30;
|
|
| 117 |
+ $self->{madagascar_inventory} = $cfg->param('madagascar.inventory_path');
|
|
| 118 |
+ |
|
| 119 |
+ # Load cluster information |
|
| 120 |
+ $self->{cluster_name} = $cfg->param('cluster.cluster_name');
|
|
| 121 |
+ $self->{cluster_nodes} = [split /,/, ($cfg->param('cluster.nodes') || '')];
|
|
| 122 |
+ |
|
| 123 |
+ # Load SMART parameters from cluster config |
|
| 124 |
+ my @param_keys = $cfg->param(-block => 'smart_parameters'); |
|
| 125 |
+ |
|
| 126 |
+ foreach my $key (@param_keys) {
|
|
| 127 |
+ my $value = $cfg->param("smart_parameters.$key");
|
|
| 128 |
+ my ($threshold, $weight, $enabled, $description) = split /,/, $value, 4; |
|
| 129 |
+ |
|
| 130 |
+ $self->{smart_params}->{$key} = {
|
|
| 131 |
+ threshold => $threshold, |
|
| 132 |
+ weight => $weight, |
|
| 133 |
+ enabled => ($enabled eq 'true'), |
|
| 134 |
+ description => $description, |
|
| 135 |
+ } if $enabled eq 'true'; |
|
| 136 |
+ } |
|
| 137 |
+ |
|
| 138 |
+ $self->_log("Loaded cluster configuration: $self->{cluster_name} (" .
|
|
| 139 |
+ keys(%{$self->{smart_params}}) . " SMART parameters)");
|
|
| 140 |
+} |
|
| 141 |
+ |
|
| 142 |
+=head2 _connect_database |
|
| 143 |
+ |
|
| 144 |
+Establish PostgreSQL database connection using cluster configuration |
|
| 145 |
+ |
|
| 146 |
+=cut |
|
| 147 |
+ |
|
| 148 |
+sub _connect_database {
|
|
| 149 |
+ my $self = shift; |
|
| 150 |
+ |
|
| 151 |
+ my $cfg = Config::Simple->new($self->{cluster_config})
|
|
| 152 |
+ or die "Cannot load cluster config for database: $self->{cluster_config}";
|
|
| 153 |
+ |
|
| 154 |
+ my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
|
|
| 155 |
+ $cfg->param('database.database'),
|
|
| 156 |
+ $cfg->param('database.host'),
|
|
| 157 |
+ $cfg->param('database.port')
|
|
| 158 |
+ ); |
|
| 159 |
+ |
|
| 160 |
+ my $timeout = $cfg->param('database.connection_timeout') || 30;
|
|
| 161 |
+ |
|
| 162 |
+ $self->{db_handle} = DBI->connect(
|
|
| 163 |
+ $dsn, |
|
| 164 |
+ $cfg->param('database.username'),
|
|
| 165 |
+ $cfg->param('database.password'),
|
|
| 166 |
+ {
|
|
| 167 |
+ RaiseError => 1, |
|
| 168 |
+ AutoCommit => 1, |
|
| 169 |
+ pg_enable_utf8 => 1, |
|
| 170 |
+ connect_timeout => $timeout, |
|
| 171 |
+ } |
|
| 172 |
+ ) or die "Database connection failed: $DBI::errstr"; |
|
| 173 |
+ |
|
| 174 |
+ # Register this node in the cluster |
|
| 175 |
+ $self->_register_node(); |
|
| 176 |
+ |
|
| 177 |
+ $self->_log("Database connection established to cluster database");
|
|
| 178 |
+} |
|
| 179 |
+ |
|
| 180 |
+=head2 get_madagascar_drives |
|
| 181 |
+ |
|
| 182 |
+Get list of HDDs from Madagascar inventory (cluster-shared) |
|
| 183 |
+ |
|
| 184 |
+=cut |
|
| 185 |
+ |
|
| 186 |
+sub get_madagascar_drives {
|
|
| 187 |
+ my $self = shift; |
|
| 188 |
+ |
|
| 189 |
+ return [] unless -f $self->{madagascar_inventory};
|
|
| 190 |
+ |
|
| 191 |
+ my $inventory_json = read_file($self->{madagascar_inventory});
|
|
| 192 |
+ my $inventory = decode_json($inventory_json); |
|
| 193 |
+ |
|
| 194 |
+ my @drives = (); |
|
| 195 |
+ |
|
| 196 |
+ # Extract HDD information from Madagascar inventory |
|
| 197 |
+ if (ref $inventory eq 'HASH' && exists $inventory->{storage}) {
|
|
| 198 |
+ foreach my $storage (@{$inventory->{storage}}) {
|
|
| 199 |
+ # Only include drives for this node |
|
| 200 |
+ next unless $storage->{node_id} eq $self->{node_id} || !$storage->{node_id};
|
|
| 201 |
+ next unless $storage->{type} eq 'HDD';
|
|
| 202 |
+ next unless $storage->{device_path};
|
|
| 203 |
+ |
|
| 204 |
+ push @drives, {
|
|
| 205 |
+ device_path => $storage->{device_path},
|
|
| 206 |
+ serial => $storage->{serial},
|
|
| 207 |
+ model => $storage->{model},
|
|
| 208 |
+ size_gb => $storage->{size_gb},
|
|
| 209 |
+ madagascar_id => $storage->{id},
|
|
| 210 |
+ node_id => $self->{node_id},
|
|
| 211 |
+ }; |
|
| 212 |
+ } |
|
| 213 |
+ } |
|
| 214 |
+ |
|
| 215 |
+ $self->_log("Found " . @drives . " HDDs for node $self->{node_id} in Madagascar inventory");
|
|
| 216 |
+ return \@drives; |
|
| 217 |
+} |
|
| 218 |
+ |
|
| 219 |
+=head2 collect_smart_data |
|
| 220 |
+ |
|
| 221 |
+Collect SMART data from a specific drive |
|
| 222 |
+ |
|
| 223 |
+=cut |
|
| 224 |
+ |
|
| 225 |
+sub collect_smart_data {
|
|
| 226 |
+ my ($self, $device_path) = @_; |
|
| 227 |
+ |
|
| 228 |
+ my $cmd = "smartctl -A -f brief -j '$device_path' 2>/dev/null"; |
|
| 229 |
+ my $output = `$cmd`; |
|
| 230 |
+ my $exit_code = $? >> 8; |
|
| 231 |
+ |
|
| 232 |
+ # Parse smartctl JSON output |
|
| 233 |
+ my $smart_data = {};
|
|
| 234 |
+ |
|
| 235 |
+ eval {
|
|
| 236 |
+ $smart_data = decode_json($output); |
|
| 237 |
+ }; |
|
| 238 |
+ |
|
| 239 |
+ if ($@) {
|
|
| 240 |
+ $self->_log("Failed to parse SMART data for $device_path: $@");
|
|
| 241 |
+ return undef; |
|
| 242 |
+ } |
|
| 243 |
+ |
|
| 244 |
+ return $self->_process_smart_data($smart_data, $device_path); |
|
| 245 |
+} |
|
| 246 |
+ |
|
| 247 |
+=head2 _process_smart_data |
|
| 248 |
+ |
|
| 249 |
+Process and normalize SMART data |
|
| 250 |
+ |
|
| 251 |
+=cut |
|
| 252 |
+ |
|
| 253 |
+sub _process_smart_data {
|
|
| 254 |
+ my ($self, $raw_data, $device_path) = @_; |
|
| 255 |
+ |
|
| 256 |
+ my $processed = {
|
|
| 257 |
+ device_path => $device_path, |
|
| 258 |
+ timestamp => time(), |
|
| 259 |
+ collection_ok => ($raw_data->{smart_status}->{passed} || 0),
|
|
| 260 |
+ temperature => 0, |
|
| 261 |
+ parameters => {},
|
|
| 262 |
+ }; |
|
| 263 |
+ |
|
| 264 |
+ # Extract device information |
|
| 265 |
+ if (exists $raw_data->{device}) {
|
|
| 266 |
+ $processed->{model_name} = $raw_data->{device}->{model_name} || '';
|
|
| 267 |
+ $processed->{serial_number} = $raw_data->{device}->{serial_number} || '';
|
|
| 268 |
+ $processed->{firmware} = $raw_data->{device}->{firmware_version} || '';
|
|
| 269 |
+ } |
|
| 270 |
+ |
|
| 271 |
+ # Extract temperature |
|
| 272 |
+ if (exists $raw_data->{temperature}) {
|
|
| 273 |
+ $processed->{temperature} = $raw_data->{temperature}->{current} || 0;
|
|
| 274 |
+ } |
|
| 275 |
+ |
|
| 276 |
+ # Extract SMART attributes |
|
| 277 |
+ if (exists $raw_data->{ata_smart_attributes}->{table}) {
|
|
| 278 |
+ foreach my $attr (@{$raw_data->{ata_smart_attributes}->{table}}) {
|
|
| 279 |
+ my $name = $attr->{name};
|
|
| 280 |
+ |
|
| 281 |
+ # Only collect monitored parameters |
|
| 282 |
+ next unless exists $self->{smart_params}->{$name};
|
|
| 283 |
+ |
|
| 284 |
+ $processed->{parameters}->{$name} = {
|
|
| 285 |
+ id => $attr->{id},
|
|
| 286 |
+ value => $attr->{value},
|
|
| 287 |
+ worst => $attr->{worst},
|
|
| 288 |
+ thresh => $attr->{thresh},
|
|
| 289 |
+ raw_value => $attr->{raw}->{value},
|
|
| 290 |
+ when_failed => $attr->{when_failed} || '',
|
|
| 291 |
+ flags => $attr->{flags}->{string} || '',
|
|
| 292 |
+ }; |
|
| 293 |
+ } |
|
| 294 |
+ } |
|
| 295 |
+ |
|
| 296 |
+ return $processed; |
|
| 297 |
+} |
|
| 298 |
+ |
|
| 299 |
+=head2 store_smart_data |
|
| 300 |
+ |
|
| 301 |
+Store processed SMART data using hardware-based tracking with migration detection |
|
| 302 |
+ |
|
| 303 |
+=cut |
|
| 304 |
+ |
|
| 305 |
+sub store_smart_data {
|
|
| 306 |
+ my ($self, $drive_info, $smart_data) = @_; |
|
| 307 |
+ |
|
| 308 |
+ eval {
|
|
| 309 |
+ # Detect/handle HDD migration first |
|
| 310 |
+ my $hdd_id = $self->_detect_or_create_hdd($drive_info, $smart_data); |
|
| 311 |
+ |
|
| 312 |
+ # Check if we should store this reading using differential storage |
|
| 313 |
+ my $should_store = $self->_should_store_reading($hdd_id, $smart_data); |
|
| 314 |
+ |
|
| 315 |
+ if ($should_store->{store}) {
|
|
| 316 |
+ # Insert SMART reading with differential storage information |
|
| 317 |
+ $self->_insert_smart_reading_differential($hdd_id, $drive_info, $smart_data, $should_store); |
|
| 318 |
+ |
|
| 319 |
+ $self->_log("Stored SMART data for HDD ID $hdd_id (Serial: $smart_data->{serial_number}, Type: $should_store->{type})", 2);
|
|
| 320 |
+ } else {
|
|
| 321 |
+ $self->_log("Skipped unchanged SMART data for HDD ID $hdd_id (Serial: $smart_data->{serial_number})", 3);
|
|
| 322 |
+ } |
|
| 323 |
+ }; |
|
| 324 |
+ |
|
| 325 |
+ if ($@) {
|
|
| 326 |
+ $self->_log("ERROR storing SMART data: $@", 1);
|
|
| 327 |
+ return 0; |
|
| 328 |
+ } |
|
| 329 |
+ |
|
| 330 |
+ return 1; |
|
| 331 |
+} |
|
| 332 |
+ |
|
| 333 |
+=head2 _detect_or_create_hdd |
|
| 334 |
+ |
|
| 335 |
+Detect HDD migration or create new HDD record using hardware identifiers |
|
| 336 |
+ |
|
| 337 |
+=cut |
|
| 338 |
+ |
|
| 339 |
+sub _detect_or_create_hdd {
|
|
| 340 |
+ my ($self, $drive_info, $smart_data) = @_; |
|
| 341 |
+ |
|
| 342 |
+ my $serial = $smart_data->{serial_number} || 'unknown';
|
|
| 343 |
+ my $model = $smart_data->{model_name} || 'unknown';
|
|
| 344 |
+ my $device_path = $drive_info->{device_path};
|
|
| 345 |
+ |
|
| 346 |
+ # Call PostgreSQL function to detect migration |
|
| 347 |
+ my $sth = $self->{db_handle}->prepare(q{
|
|
| 348 |
+ SELECT detect_hdd_migration(?, ?, ?, ?, ?, 'collector') |
|
| 349 |
+ }); |
|
| 350 |
+ |
|
| 351 |
+ $sth->execute( |
|
| 352 |
+ $serial, |
|
| 353 |
+ $model, |
|
| 354 |
+ $device_path, |
|
| 355 |
+ $self->{node_id},
|
|
| 356 |
+ $drive_info->{slot} || undef
|
|
| 357 |
+ ); |
|
| 358 |
+ |
|
| 359 |
+ my ($hdd_id) = $sth->fetchrow_array(); |
|
| 360 |
+ |
|
| 361 |
+ # If NULL returned, this is a new HDD - create it |
|
| 362 |
+ if (!defined $hdd_id) {
|
|
| 363 |
+ $hdd_id = $self->_create_new_hdd($drive_info, $smart_data); |
|
| 364 |
+ $self->_log("New HDD discovered: $serial ($model) at $device_path", 2);
|
|
| 365 |
+ } else {
|
|
| 366 |
+ $self->_log("HDD tracked: ID $hdd_id, Serial $serial", 3);
|
|
| 367 |
+ } |
|
| 368 |
+ |
|
| 369 |
+ return $hdd_id; |
|
| 370 |
+} |
|
| 371 |
+ |
|
| 372 |
+=head2 _create_new_hdd |
|
| 373 |
+ |
|
| 374 |
+Create new HDD record with hardware-based identification |
|
| 375 |
+ |
|
| 376 |
+=cut |
|
| 377 |
+ |
|
| 378 |
+sub _create_new_hdd {
|
|
| 379 |
+ my ($self, $drive_info, $smart_data) = @_; |
|
| 380 |
+ |
|
| 381 |
+ my $sql = q{
|
|
| 382 |
+ INSERT INTO hdd_inventory |
|
| 383 |
+ (serial_number, model_name, firmware, size_gb, manufacturer, |
|
| 384 |
+ current_device_path, current_node_id, current_slot, |
|
| 385 |
+ madagascar_id, first_seen, last_seen, status) |
|
| 386 |
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, NOW(), NOW(), 'active') |
|
| 387 |
+ RETURNING id |
|
| 388 |
+ }; |
|
| 389 |
+ |
|
| 390 |
+ my $sth = $self->{db_handle}->prepare($sql);
|
|
| 391 |
+ $sth->execute( |
|
| 392 |
+ $smart_data->{serial_number} || 'unknown',
|
|
| 393 |
+ $smart_data->{model_name} || 'unknown',
|
|
| 394 |
+ $smart_data->{firmware} || '',
|
|
| 395 |
+ $drive_info->{size_gb} || 0,
|
|
| 396 |
+ $self->_extract_manufacturer($smart_data->{model_name}),
|
|
| 397 |
+ $drive_info->{device_path},
|
|
| 398 |
+ $self->{node_id},
|
|
| 399 |
+ $drive_info->{slot} || undef,
|
|
| 400 |
+ $drive_info->{madagascar_id}
|
|
| 401 |
+ ); |
|
| 402 |
+ |
|
| 403 |
+ my ($hdd_id) = $sth->fetchrow_array(); |
|
| 404 |
+ |
|
| 405 |
+ # Create discovery alert |
|
| 406 |
+ $self->_create_discovery_alert($hdd_id, $drive_info, $smart_data); |
|
| 407 |
+ |
|
| 408 |
+ return $hdd_id; |
|
| 409 |
+} |
|
| 410 |
+ |
|
| 411 |
+=head2 _extract_manufacturer |
|
| 412 |
+ |
|
| 413 |
+Extract manufacturer from model name |
|
| 414 |
+ |
|
| 415 |
+=cut |
|
| 416 |
+ |
|
| 417 |
+sub _extract_manufacturer {
|
|
| 418 |
+ my ($self, $model_name) = @_; |
|
| 419 |
+ |
|
| 420 |
+ return 'Unknown' unless $model_name; |
|
| 421 |
+ |
|
| 422 |
+ # Common HDD manufacturer patterns |
|
| 423 |
+ my %manufacturers = ( |
|
| 424 |
+ qr/^WD|Western\s*Digital/i => 'Western Digital', |
|
| 425 |
+ qr/^ST|Seagate/i => 'Seagate', |
|
| 426 |
+ qr/^HGST|Hitachi/i => 'HGST/Hitachi', |
|
| 427 |
+ qr/^TOSHIBA/i => 'Toshiba', |
|
| 428 |
+ qr/^Samsung/i => 'Samsung', |
|
| 429 |
+ qr/^Maxtor/i => 'Maxtor', |
|
| 430 |
+ qr/^Fujitsu/i => 'Fujitsu', |
|
| 431 |
+ ); |
|
| 432 |
+ |
|
| 433 |
+ foreach my $pattern (keys %manufacturers) {
|
|
| 434 |
+ return $manufacturers{$pattern} if $model_name =~ /$pattern/;
|
|
| 435 |
+ } |
|
| 436 |
+ |
|
| 437 |
+ # Extract first word as fallback |
|
| 438 |
+ if ($model_name =~ /^(\w+)/) {
|
|
| 439 |
+ return $1; |
|
| 440 |
+ } |
|
| 441 |
+ |
|
| 442 |
+ return 'Unknown'; |
|
| 443 |
+} |
|
| 444 |
+ |
|
| 445 |
+=head2 _create_discovery_alert |
|
| 446 |
+ |
|
| 447 |
+Create alert for new HDD discovery |
|
| 448 |
+ |
|
| 449 |
+=cut |
|
| 450 |
+ |
|
| 451 |
+sub _create_discovery_alert {
|
|
| 452 |
+ my ($self, $hdd_id, $drive_info, $smart_data) = @_; |
|
| 453 |
+ |
|
| 454 |
+ my $sql = q{
|
|
| 455 |
+ INSERT INTO alert_history |
|
| 456 |
+ (hdd_id, serial_number, device_path, node_id, alert_type, message) |
|
| 457 |
+ VALUES (?, ?, ?, ?, 'discovery', ?) |
|
| 458 |
+ }; |
|
| 459 |
+ |
|
| 460 |
+ my $message = sprintf( |
|
| 461 |
+ "New HDD discovered: %s (%s) at %s on node %s - Size: %s GB", |
|
| 462 |
+ $smart_data->{serial_number} || 'unknown',
|
|
| 463 |
+ $smart_data->{model_name} || 'unknown',
|
|
| 464 |
+ $drive_info->{device_path},
|
|
| 465 |
+ $self->{node_id},
|
|
| 466 |
+ $drive_info->{size_gb} || '?'
|
|
| 467 |
+ ); |
|
| 468 |
+ |
|
| 469 |
+ $self->{db_handle}->do($sql, undef,
|
|
| 470 |
+ $hdd_id, |
|
| 471 |
+ $smart_data->{serial_number},
|
|
| 472 |
+ $drive_info->{device_path},
|
|
| 473 |
+ $self->{node_id},
|
|
| 474 |
+ $message |
|
| 475 |
+ ); |
|
| 476 |
+} |
|
| 477 |
+ |
|
| 478 |
+=head2 _should_store_reading |
|
| 479 |
+ |
|
| 480 |
+Check if SMART reading should be stored using differential storage logic |
|
| 481 |
+ |
|
| 482 |
+=cut |
|
| 483 |
+ |
|
| 484 |
+sub _should_store_reading {
|
|
| 485 |
+ my ($self, $hdd_id, $smart_data) = @_; |
|
| 486 |
+ |
|
| 487 |
+ # Generate checksum of SMART parameters |
|
| 488 |
+ my $parameters_json = encode_json($smart_data->{parameters});
|
|
| 489 |
+ my $checksum = sha256_hex($parameters_json . ($smart_data->{temperature} || ''));
|
|
| 490 |
+ |
|
| 491 |
+ # Call PostgreSQL function to determine if we should store this reading |
|
| 492 |
+ my $sth = $self->{db_handle}->prepare(q{
|
|
| 493 |
+ SELECT should_store_smart_reading(?, ?, ?, NOW()) |
|
| 494 |
+ }); |
|
| 495 |
+ |
|
| 496 |
+ $sth->execute($hdd_id, $parameters_json, $checksum); |
|
| 497 |
+ |
|
| 498 |
+ my $result = $sth->fetchrow_hashref(); |
|
| 499 |
+ |
|
| 500 |
+ return {
|
|
| 501 |
+ store => $result->{should_store},
|
|
| 502 |
+ type => $result->{reading_type},
|
|
| 503 |
+ changes_detected => $result->{changes_detected},
|
|
| 504 |
+ changed_parameters => $result->{changed_parameters},
|
|
| 505 |
+ previous_reading_id => $result->{previous_reading_id},
|
|
| 506 |
+ checksum => $checksum |
|
| 507 |
+ }; |
|
| 508 |
+} |
|
| 509 |
+ |
|
| 510 |
+=head2 _insert_smart_reading_differential |
|
| 511 |
+ |
|
| 512 |
+Insert SMART reading with differential storage information |
|
| 513 |
+ |
|
| 514 |
+=cut |
|
| 515 |
+ |
|
| 516 |
+sub _insert_smart_reading_differential {
|
|
| 517 |
+ my ($self, $hdd_id, $drive_info, $smart_data, $storage_info) = @_; |
|
| 518 |
+ |
|
| 519 |
+ my $sql = q{
|
|
| 520 |
+ INSERT INTO smart_readings |
|
| 521 |
+ (hdd_id, serial_number, device_path, node_id, timestamp, |
|
| 522 |
+ collection_ok, temperature, parameters_json, reading_type, |
|
| 523 |
+ changes_detected, changed_parameters, previous_reading_id, checksum) |
|
| 524 |
+ VALUES (?, ?, ?, ?, to_timestamp(?), ?, ?, ?, ?, ?, ?, ?, ?) |
|
| 525 |
+ }; |
|
| 526 |
+ |
|
| 527 |
+ # For differential readings, only store changed parameters |
|
| 528 |
+ my $parameters_to_store; |
|
| 529 |
+ if ($storage_info->{type} eq 'differential' && $storage_info->{changed_parameters}) {
|
|
| 530 |
+ # Extract only changed parameters |
|
| 531 |
+ my $changed_params = decode_json($storage_info->{changed_parameters});
|
|
| 532 |
+ my $all_params = $smart_data->{parameters};
|
|
| 533 |
+ $parameters_to_store = {};
|
|
| 534 |
+ |
|
| 535 |
+ for my $param_name (@$changed_params) {
|
|
| 536 |
+ $parameters_to_store->{$param_name} = $all_params->{$param_name};
|
|
| 537 |
+ } |
|
| 538 |
+ } else {
|
|
| 539 |
+ # Store all parameters for baseline/full readings |
|
| 540 |
+ $parameters_to_store = $smart_data->{parameters};
|
|
| 541 |
+ } |
|
| 542 |
+ |
|
| 543 |
+ my $parameters_json = encode_json($parameters_to_store); |
|
| 544 |
+ |
|
| 545 |
+ $self->{db_handle}->do($sql,
|
|
| 546 |
+ undef, |
|
| 547 |
+ $hdd_id, |
|
| 548 |
+ $smart_data->{serial_number},
|
|
| 549 |
+ $drive_info->{device_path},
|
|
| 550 |
+ $self->{node_id},
|
|
| 551 |
+ $smart_data->{timestamp},
|
|
| 552 |
+ $smart_data->{collection_ok},
|
|
| 553 |
+ $smart_data->{temperature},
|
|
| 554 |
+ $parameters_json, |
|
| 555 |
+ $storage_info->{type},
|
|
| 556 |
+ $storage_info->{changes_detected} ? 'true' : 'false',
|
|
| 557 |
+ $storage_info->{changed_parameters},
|
|
| 558 |
+ $storage_info->{previous_reading_id},
|
|
| 559 |
+ $storage_info->{checksum}
|
|
| 560 |
+ ); |
|
| 561 |
+} |
|
| 562 |
+ |
|
| 563 |
+=head2 _insert_smart_reading |
|
| 564 |
+ |
|
| 565 |
+Insert SMART reading linked to hardware ID (legacy method for compatibility) |
|
| 566 |
+ |
|
| 567 |
+=cut |
|
| 568 |
+ |
|
| 569 |
+sub _insert_smart_reading {
|
|
| 570 |
+ my ($self, $hdd_id, $drive_info, $smart_data) = @_; |
|
| 571 |
+ |
|
| 572 |
+ my $sql = q{
|
|
| 573 |
+ INSERT INTO smart_readings |
|
| 574 |
+ (hdd_id, serial_number, device_path, node_id, timestamp, |
|
| 575 |
+ collection_ok, temperature, parameters_json) |
|
| 576 |
+ VALUES (?, ?, ?, ?, to_timestamp(?), ?, ?, ?) |
|
| 577 |
+ }; |
|
| 578 |
+ |
|
| 579 |
+ my $parameters_json = encode_json($smart_data->{parameters});
|
|
| 580 |
+ |
|
| 581 |
+ $self->{db_handle}->do($sql,
|
|
| 582 |
+ undef, |
|
| 583 |
+ $hdd_id, |
|
| 584 |
+ $smart_data->{serial_number},
|
|
| 585 |
+ $drive_info->{device_path},
|
|
| 586 |
+ $self->{node_id},
|
|
| 587 |
+ $smart_data->{timestamp},
|
|
| 588 |
+ $smart_data->{collection_ok},
|
|
| 589 |
+ $smart_data->{temperature},
|
|
| 590 |
+ $parameters_json |
|
| 591 |
+ ); |
|
| 592 |
+} |
|
| 593 |
+ |
|
| 594 |
+=head2 collect_all |
|
| 595 |
+ |
|
| 596 |
+Collect SMART data from all drives in Madagascar inventory |
|
| 597 |
+ |
|
| 598 |
+=cut |
|
| 599 |
+ |
|
| 600 |
+sub collect_all {
|
|
| 601 |
+ my $self = shift; |
|
| 602 |
+ |
|
| 603 |
+ my $drives = $self->get_madagascar_drives(); |
|
| 604 |
+ my $successful = 0; |
|
| 605 |
+ my $failed = 0; |
|
| 606 |
+ my $storage_stats = {
|
|
| 607 |
+ baseline => 0, |
|
| 608 |
+ full => 0, |
|
| 609 |
+ differential => 0, |
|
| 610 |
+ skipped => 0 |
|
| 611 |
+ }; |
|
| 612 |
+ |
|
| 613 |
+ foreach my $drive (@$drives) {
|
|
| 614 |
+ my $smart_data = $self->collect_smart_data($drive->{device_path});
|
|
| 615 |
+ |
|
| 616 |
+ if ($smart_data && $self->store_smart_data($drive, $smart_data)) {
|
|
| 617 |
+ $successful++; |
|
| 618 |
+ } else {
|
|
| 619 |
+ $failed++; |
|
| 620 |
+ $self->_log("Failed to collect/store data for $drive->{device_path}");
|
|
| 621 |
+ } |
|
| 622 |
+ |
|
| 623 |
+ # Small delay between drives to avoid overwhelming system |
|
| 624 |
+ select(undef, undef, undef, 0.1); |
|
| 625 |
+ } |
|
| 626 |
+ |
|
| 627 |
+ # Get storage statistics for this collection run |
|
| 628 |
+ my $stats = $self->_get_recent_storage_stats(); |
|
| 629 |
+ $self->_log("Collection complete: $successful successful, $failed failed");
|
|
| 630 |
+ $self->_log("Storage efficiency - Baseline: $stats->{baseline}, Full: $stats->{full}, Differential: $stats->{differential}, Skipped: $stats->{skipped}");
|
|
| 631 |
+ |
|
| 632 |
+ return {
|
|
| 633 |
+ successful => $successful, |
|
| 634 |
+ failed => $failed, |
|
| 635 |
+ total => scalar(@$drives), |
|
| 636 |
+ storage_stats => $stats |
|
| 637 |
+ }; |
|
| 638 |
+} |
|
| 639 |
+ |
|
| 640 |
+=head2 _get_recent_storage_stats |
|
| 641 |
+ |
|
| 642 |
+Get statistics about storage efficiency from recent readings |
|
| 643 |
+ |
|
| 644 |
+=cut |
|
| 645 |
+ |
|
| 646 |
+sub _get_recent_storage_stats {
|
|
| 647 |
+ my $self = shift; |
|
| 648 |
+ |
|
| 649 |
+ my $sql = q{
|
|
| 650 |
+ SELECT |
|
| 651 |
+ reading_type, |
|
| 652 |
+ COUNT(*) as count |
|
| 653 |
+ FROM smart_readings |
|
| 654 |
+ WHERE timestamp > NOW() - INTERVAL '1 hour' |
|
| 655 |
+ GROUP BY reading_type |
|
| 656 |
+ ORDER BY reading_type |
|
| 657 |
+ }; |
|
| 658 |
+ |
|
| 659 |
+ my $sth = $self->{db_handle}->prepare($sql);
|
|
| 660 |
+ $sth->execute(); |
|
| 661 |
+ |
|
| 662 |
+ my $stats = {
|
|
| 663 |
+ baseline => 0, |
|
| 664 |
+ full => 0, |
|
| 665 |
+ differential => 0, |
|
| 666 |
+ total => 0 |
|
| 667 |
+ }; |
|
| 668 |
+ |
|
| 669 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 670 |
+ $stats->{$row->{reading_type}} = $row->{count};
|
|
| 671 |
+ $stats->{total} += $row->{count};
|
|
| 672 |
+ } |
|
| 673 |
+ |
|
| 674 |
+ # Calculate efficiency percentage |
|
| 675 |
+ my $efficient_readings = $stats->{differential} + $stats->{baseline};
|
|
| 676 |
+ my $efficiency_pct = $stats->{total} > 0 ?
|
|
| 677 |
+ sprintf("%.1f", ($efficient_readings / $stats->{total}) * 100) : 0;
|
|
| 678 |
+ |
|
| 679 |
+ $stats->{efficiency_percent} = $efficiency_pct;
|
|
| 680 |
+ |
|
| 681 |
+ return $stats; |
|
| 682 |
+} |
|
| 683 |
+ |
|
| 684 |
+=head2 _log |
|
| 685 |
+ |
|
| 686 |
+Internal logging method with enhanced debug levels |
|
| 687 |
+ |
|
| 688 |
+=cut |
|
| 689 |
+ |
|
| 690 |
+sub _log {
|
|
| 691 |
+ my ($self, $message, $level) = @_; |
|
| 692 |
+ |
|
| 693 |
+ $level ||= 1; # Default to basic level |
|
| 694 |
+ |
|
| 695 |
+ # Check if we should log based on debug level |
|
| 696 |
+ return unless $self->{debug} >= $level;
|
|
| 697 |
+ |
|
| 698 |
+ my $timestamp = scalar(localtime()); |
|
| 699 |
+ my $node_id = $self->{node_id} || 'unknown';
|
|
| 700 |
+ my $prefix = "[$timestamp] [$node_id] SmartCollector"; |
|
| 701 |
+ |
|
| 702 |
+ if ($self->{debug}) {
|
|
| 703 |
+ print "$prefix: $message\n"; |
|
| 704 |
+ } |
|
| 705 |
+ |
|
| 706 |
+ # Also log to syslog if enabled |
|
| 707 |
+ if ($self->{local_settings}->{AUTOSMART_LOG_SYSLOG} eq 'true') {
|
|
| 708 |
+ eval {
|
|
| 709 |
+ use Sys::Syslog qw(:standard :macros); |
|
| 710 |
+ my $facility = $self->{local_settings}->{AUTOSMART_LOG_FACILITY} || 'daemon';
|
|
| 711 |
+ openlog('autosmart', 'pid,ndelay', $facility);
|
|
| 712 |
+ syslog(LOG_INFO, "SmartCollector[$node_id]: $message"); |
|
| 713 |
+ closelog(); |
|
| 714 |
+ }; |
|
| 715 |
+ } |
|
| 716 |
+ |
|
| 717 |
+ # Log to file if specified |
|
| 718 |
+ my $log_file = $self->{local_settings}->{AUTOSMART_DEBUG_LOG_FILE};
|
|
| 719 |
+ if ($log_file && $self->{debug} >= 2) {
|
|
| 720 |
+ eval {
|
|
| 721 |
+ open my $fh, '>>', $log_file; |
|
| 722 |
+ print $fh "$prefix: $message\n"; |
|
| 723 |
+ close $fh; |
|
| 724 |
+ }; |
|
| 725 |
+ } |
|
| 726 |
+} |
|
| 727 |
+ |
|
| 728 |
+=head2 _register_node |
|
| 729 |
+ |
|
| 730 |
+Register this node in the cluster database |
|
| 731 |
+ |
|
| 732 |
+=cut |
|
| 733 |
+ |
|
| 734 |
+sub _register_node {
|
|
| 735 |
+ my $self = shift; |
|
| 736 |
+ |
|
| 737 |
+ eval {
|
|
| 738 |
+ # Create cluster_nodes table if it doesn't exist |
|
| 739 |
+ $self->{db_handle}->do(q{
|
|
| 740 |
+ CREATE TABLE IF NOT EXISTS cluster_nodes ( |
|
| 741 |
+ node_id VARCHAR(100) PRIMARY KEY, |
|
| 742 |
+ hostname VARCHAR(255), |
|
| 743 |
+ ip_address INET, |
|
| 744 |
+ last_seen TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 745 |
+ status VARCHAR(20) DEFAULT 'active', |
|
| 746 |
+ version VARCHAR(50), |
|
| 747 |
+ capabilities JSON, |
|
| 748 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 749 |
+ ) |
|
| 750 |
+ }); |
|
| 751 |
+ |
|
| 752 |
+ # Register/update this node |
|
| 753 |
+ my $hostname = `hostname -f`; |
|
| 754 |
+ chomp $hostname; |
|
| 755 |
+ |
|
| 756 |
+ my $ip = `hostname -I | awk '{print \$1}'`;
|
|
| 757 |
+ chomp $ip; |
|
| 758 |
+ |
|
| 759 |
+ $self->{db_handle}->do(q{
|
|
| 760 |
+ INSERT INTO cluster_nodes |
|
| 761 |
+ (node_id, hostname, ip_address, last_seen, status, version) |
|
| 762 |
+ VALUES (?, ?, ?, NOW(), 'active', '1.0') |
|
| 763 |
+ ON CONFLICT (node_id) |
|
| 764 |
+ DO UPDATE SET |
|
| 765 |
+ hostname = EXCLUDED.hostname, |
|
| 766 |
+ ip_address = EXCLUDED.ip_address, |
|
| 767 |
+ last_seen = NOW(), |
|
| 768 |
+ status = 'active' |
|
| 769 |
+ }, undef, $self->{node_id}, $hostname, $ip);
|
|
| 770 |
+ |
|
| 771 |
+ $self->_log("Registered node $self->{node_id} in cluster", 2);
|
|
| 772 |
+ }; |
|
| 773 |
+ |
|
| 774 |
+ if ($@) {
|
|
| 775 |
+ $self->_log("Warning: Failed to register node: $@", 1);
|
|
| 776 |
+ } |
|
| 777 |
+} |
|
| 778 |
+ |
|
| 779 |
+=head2 DESTROY |
|
| 780 |
+ |
|
| 781 |
+Cleanup database connection |
|
| 782 |
+ |
|
| 783 |
+=cut |
|
| 784 |
+ |
|
| 785 |
+sub DESTROY {
|
|
| 786 |
+ my $self = shift; |
|
| 787 |
+ $self->{db_handle}->disconnect() if $self->{db_handle};
|
|
| 788 |
+} |
|
| 789 |
+ |
|
| 790 |
+1; |
|
| 791 |
+ |
|
| 792 |
+__END__ |
|
| 793 |
+ |
|
| 794 |
+=head1 AUTHOR |
|
| 795 |
+ |
|
| 796 |
+AutoSMART Development Team |
|
| 797 |
+ |
|
| 798 |
+=head1 LICENSE |
|
| 799 |
+ |
|
| 800 |
+This software is part of the autoSMART project. |
|
| 801 |
+ |
|
| 802 |
+=cut |
|
@@ -0,0 +1,348 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+use strict; |
|
| 4 |
+use warnings; |
|
| 5 |
+use FindBin qw($Bin); |
|
| 6 |
+use lib "$Bin/../lib"; |
|
| 7 |
+ |
|
| 8 |
+use SmartCollector; |
|
| 9 |
+use Getopt::Long; |
|
| 10 |
+use POSIX qw(strftime); |
|
| 11 |
+ |
|
| 12 |
+=head1 NAME |
|
| 13 |
+ |
|
| 14 |
+autosmart-collector.pl - SMART data collection daemon for Proxmox cluster |
|
| 15 |
+ |
|
| 16 |
+=head1 SYNOPSIS |
|
| 17 |
+ |
|
| 18 |
+ autosmart-collector.pl [OPTIONS] |
|
| 19 |
+ |
|
| 20 |
+=head1 OPTIONS |
|
| 21 |
+ |
|
| 22 |
+ --cluster-config FILE Cluster configuration file (default: /etc/pve/autoSMART/cluster.conf) |
|
| 23 |
+ --local-config FILE Local configuration file (default: /etc/default/autosmart) |
|
| 24 |
+ --daemon Run as daemon |
|
| 25 |
+ --once Run once and exit (for cron jobs) |
|
| 26 |
+ --device PATH Collect from specific device only |
|
| 27 |
+ --debug Enable debug logging |
|
| 28 |
+ --help Show this help |
|
| 29 |
+ |
|
| 30 |
+=head1 DESCRIPTION |
|
| 31 |
+ |
|
| 32 |
+This script collects SMART data from HDDs in a Proxmox cluster environment. |
|
| 33 |
+Configuration is split between cluster-wide settings in /etc/pve/autoSMART/ |
|
| 34 |
+and local node settings in /etc/default/autosmart. |
|
| 35 |
+ |
|
| 36 |
+=cut |
|
| 37 |
+ |
|
| 38 |
+# Configuration |
|
| 39 |
+my $cluster_config = '/etc/pve/autoSMART/cluster.conf'; |
|
| 40 |
+my $local_config = '/etc/default/autosmart'; |
|
| 41 |
+my $daemon_mode = 0; |
|
| 42 |
+my $run_once = 0; |
|
| 43 |
+my $specific_device = ''; |
|
| 44 |
+my $debug = 0; |
|
| 45 |
+my $help = 0; |
|
| 46 |
+ |
|
| 47 |
+GetOptions( |
|
| 48 |
+ 'cluster-config=s' => \$cluster_config, |
|
| 49 |
+ 'local-config=s' => \$local_config, |
|
| 50 |
+ 'daemon' => \$daemon_mode, |
|
| 51 |
+ 'once' => \$run_once, |
|
| 52 |
+ 'device=s' => \$specific_device, |
|
| 53 |
+ 'debug' => \$debug, |
|
| 54 |
+ 'help' => \$help, |
|
| 55 |
+) or die "Error parsing command line arguments\n"; |
|
| 56 |
+ |
|
| 57 |
+if ($help) {
|
|
| 58 |
+ print_help(); |
|
| 59 |
+ exit 0; |
|
| 60 |
+} |
|
| 61 |
+ |
|
| 62 |
+# Load local configuration for environment setup |
|
| 63 |
+my %local_settings = load_local_config($local_config); |
|
| 64 |
+ |
|
| 65 |
+# Override debug flag from local config if not specified |
|
| 66 |
+unless ($debug) {
|
|
| 67 |
+ $debug = ($local_settings{AUTOSMART_DEBUG_ENABLED} eq 'true') ?
|
|
| 68 |
+ ($local_settings{AUTOSMART_DEBUG_LEVEL} || 1) : 0;
|
|
| 69 |
+} |
|
| 70 |
+ |
|
| 71 |
+# Validate configuration files |
|
| 72 |
+unless (-f $cluster_config) {
|
|
| 73 |
+ die "Cluster configuration not found: $cluster_config\n"; |
|
| 74 |
+} |
|
| 75 |
+ |
|
| 76 |
+unless (-f $local_config) {
|
|
| 77 |
+ die "Local configuration not found: $local_config\n"; |
|
| 78 |
+} |
|
| 79 |
+ |
|
| 80 |
+# Check for emergency stop |
|
| 81 |
+if (-f ($local_settings{AUTOSMART_EMERGENCY_STOP_FILE} || '/etc/autosmart/EMERGENCY_STOP')) {
|
|
| 82 |
+ die "Emergency stop file detected - autoSMART is disabled\n"; |
|
| 83 |
+} |
|
| 84 |
+ |
|
| 85 |
+# Initialize collector with Proxmox cluster configuration |
|
| 86 |
+my $collector = SmartCollector->new( |
|
| 87 |
+ cluster_config => $cluster_config, |
|
| 88 |
+ local_config => $local_config, |
|
| 89 |
+ debug => $debug, |
|
| 90 |
+); |
|
| 91 |
+ |
|
| 92 |
+log_message("autoSMART collector starting for cluster node...");
|
|
| 93 |
+ |
|
| 94 |
+if ($specific_device) {
|
|
| 95 |
+ # Collect from specific device |
|
| 96 |
+ collect_specific_device($collector, $specific_device); |
|
| 97 |
+} elsif ($run_once) {
|
|
| 98 |
+ # Single collection run |
|
| 99 |
+ run_collection_cycle($collector); |
|
| 100 |
+} elsif ($daemon_mode) {
|
|
| 101 |
+ # Daemon mode |
|
| 102 |
+ run_daemon($collector, \%local_settings); |
|
| 103 |
+} else {
|
|
| 104 |
+ # Default: single collection run |
|
| 105 |
+ run_collection_cycle($collector); |
|
| 106 |
+} |
|
| 107 |
+ |
|
| 108 |
+log_message("autoSMART collector finished");
|
|
| 109 |
+ |
|
| 110 |
+=head2 load_local_config |
|
| 111 |
+ |
|
| 112 |
+Load local configuration from /etc/default/autosmart |
|
| 113 |
+ |
|
| 114 |
+=cut |
|
| 115 |
+ |
|
| 116 |
+sub load_local_config {
|
|
| 117 |
+ my $config_file = shift; |
|
| 118 |
+ |
|
| 119 |
+ my %settings = (); |
|
| 120 |
+ |
|
| 121 |
+ return %settings unless -f $config_file; |
|
| 122 |
+ |
|
| 123 |
+ open my $fh, '<', $config_file |
|
| 124 |
+ or die "Cannot read local config: $config_file: $!"; |
|
| 125 |
+ |
|
| 126 |
+ while (my $line = <$fh>) {
|
|
| 127 |
+ chomp $line; |
|
| 128 |
+ next if $line =~ /^\s*#/ || $line =~ /^\s*$/; |
|
| 129 |
+ |
|
| 130 |
+ if ($line =~ /^(\w+)=(.+)$/) {
|
|
| 131 |
+ my ($key, $value) = ($1, $2); |
|
| 132 |
+ $value =~ s/^["']|["']$//g; # Remove quotes |
|
| 133 |
+ $settings{$key} = $value;
|
|
| 134 |
+ } |
|
| 135 |
+ } |
|
| 136 |
+ |
|
| 137 |
+ close $fh; |
|
| 138 |
+ |
|
| 139 |
+ return %settings; |
|
| 140 |
+} |
|
| 141 |
+ |
|
| 142 |
+=head2 collect_specific_device |
|
| 143 |
+ |
|
| 144 |
+Collect SMART data from a specific device |
|
| 145 |
+ |
|
| 146 |
+=cut |
|
| 147 |
+ |
|
| 148 |
+sub collect_specific_device {
|
|
| 149 |
+ my ($collector, $device_path) = @_; |
|
| 150 |
+ |
|
| 151 |
+ log_message("Collecting SMART data from $device_path");
|
|
| 152 |
+ |
|
| 153 |
+ my $smart_data = $collector->collect_smart_data($device_path); |
|
| 154 |
+ |
|
| 155 |
+ unless ($smart_data) {
|
|
| 156 |
+ log_message("ERROR: Failed to collect SMART data from $device_path");
|
|
| 157 |
+ exit 1; |
|
| 158 |
+ } |
|
| 159 |
+ |
|
| 160 |
+ # Create minimal drive info for storage |
|
| 161 |
+ my $drive_info = {
|
|
| 162 |
+ device_path => $device_path, |
|
| 163 |
+ serial => $smart_data->{serial_number} || 'unknown',
|
|
| 164 |
+ model => $smart_data->{model_name} || 'unknown',
|
|
| 165 |
+ size_gb => 0, |
|
| 166 |
+ madagascar_id => "manual_$device_path", |
|
| 167 |
+ }; |
|
| 168 |
+ |
|
| 169 |
+ if ($collector->store_smart_data($drive_info, $smart_data)) {
|
|
| 170 |
+ log_message("Successfully stored SMART data for $device_path");
|
|
| 171 |
+ } else {
|
|
| 172 |
+ log_message("ERROR: Failed to store SMART data for $device_path");
|
|
| 173 |
+ exit 1; |
|
| 174 |
+ } |
|
| 175 |
+} |
|
| 176 |
+ |
|
| 177 |
+=head2 run_collection_cycle |
|
| 178 |
+ |
|
| 179 |
+Execute one complete collection cycle |
|
| 180 |
+ |
|
| 181 |
+=cut |
|
| 182 |
+ |
|
| 183 |
+sub run_collection_cycle {
|
|
| 184 |
+ my $collector = shift; |
|
| 185 |
+ |
|
| 186 |
+ log_message("Starting collection cycle");
|
|
| 187 |
+ |
|
| 188 |
+ my $result = $collector->collect_all(); |
|
| 189 |
+ |
|
| 190 |
+ log_message(sprintf( |
|
| 191 |
+ "Collection cycle complete: %d successful, %d failed, %d total", |
|
| 192 |
+ $result->{successful},
|
|
| 193 |
+ $result->{failed},
|
|
| 194 |
+ $result->{total}
|
|
| 195 |
+ )); |
|
| 196 |
+ |
|
| 197 |
+ # Exit with error code if any collections failed |
|
| 198 |
+ if ($result->{failed} > 0) {
|
|
| 199 |
+ exit 1; |
|
| 200 |
+ } |
|
| 201 |
+} |
|
| 202 |
+ |
|
| 203 |
+=head2 run_daemon |
|
| 204 |
+ |
|
| 205 |
+Run as daemon with periodic collection |
|
| 206 |
+ |
|
| 207 |
+=cut |
|
| 208 |
+ |
|
| 209 |
+sub run_daemon {
|
|
| 210 |
+ my $collector = shift; |
|
| 211 |
+ |
|
| 212 |
+ # Get collection interval from config |
|
| 213 |
+ my $cfg = Config::Simple->new("$config_dir/smart.conf");
|
|
| 214 |
+ my $interval = $cfg->param('monitoring.collection_interval') || 300;
|
|
| 215 |
+ |
|
| 216 |
+ log_message("Running in daemon mode (interval: ${interval}s)");
|
|
| 217 |
+ |
|
| 218 |
+ # Set up signal handlers for graceful shutdown |
|
| 219 |
+ my $running = 1; |
|
| 220 |
+ |
|
| 221 |
+ $SIG{TERM} = sub {
|
|
| 222 |
+ log_message("Received SIGTERM, shutting down gracefully");
|
|
| 223 |
+ $running = 0; |
|
| 224 |
+ }; |
|
| 225 |
+ |
|
| 226 |
+ $SIG{INT} = sub {
|
|
| 227 |
+ log_message("Received SIGINT, shutting down gracefully");
|
|
| 228 |
+ $running = 0; |
|
| 229 |
+ }; |
|
| 230 |
+ |
|
| 231 |
+ # Main daemon loop |
|
| 232 |
+ while ($running) {
|
|
| 233 |
+ my $start_time = time(); |
|
| 234 |
+ |
|
| 235 |
+ eval {
|
|
| 236 |
+ run_collection_cycle($collector); |
|
| 237 |
+ }; |
|
| 238 |
+ |
|
| 239 |
+ if ($@) {
|
|
| 240 |
+ log_message("ERROR in collection cycle: $@");
|
|
| 241 |
+ } |
|
| 242 |
+ |
|
| 243 |
+ # Calculate sleep time to maintain interval |
|
| 244 |
+ my $elapsed = time() - $start_time; |
|
| 245 |
+ my $sleep_time = $interval - $elapsed; |
|
| 246 |
+ |
|
| 247 |
+ if ($sleep_time > 0) {
|
|
| 248 |
+ log_message("Sleeping for ${sleep_time}s until next collection");
|
|
| 249 |
+ |
|
| 250 |
+ # Sleep in small chunks to allow signal handling |
|
| 251 |
+ while ($sleep_time > 0 && $running) {
|
|
| 252 |
+ my $chunk = $sleep_time > 5 ? 5 : $sleep_time; |
|
| 253 |
+ sleep($chunk); |
|
| 254 |
+ $sleep_time -= $chunk; |
|
| 255 |
+ } |
|
| 256 |
+ } else {
|
|
| 257 |
+ log_message("WARNING: Collection took longer than interval (${elapsed}s > ${interval}s)");
|
|
| 258 |
+ } |
|
| 259 |
+ } |
|
| 260 |
+ |
|
| 261 |
+ log_message("Daemon shutdown complete");
|
|
| 262 |
+} |
|
| 263 |
+ |
|
| 264 |
+=head2 log_message |
|
| 265 |
+ |
|
| 266 |
+Log message with timestamp |
|
| 267 |
+ |
|
| 268 |
+=cut |
|
| 269 |
+ |
|
| 270 |
+sub log_message {
|
|
| 271 |
+ my $message = shift; |
|
| 272 |
+ |
|
| 273 |
+ my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime());
|
|
| 274 |
+ print "[$timestamp] autosmart-collector: $message\n"; |
|
| 275 |
+} |
|
| 276 |
+ |
|
| 277 |
+=head2 print_help |
|
| 278 |
+ |
|
| 279 |
+Display help information |
|
| 280 |
+ |
|
| 281 |
+=cut |
|
| 282 |
+ |
|
| 283 |
+sub print_help {
|
|
| 284 |
+ print <<'EOF'; |
|
| 285 |
+autoSMART Data Collector v1.0 |
|
| 286 |
+ |
|
| 287 |
+USAGE: |
|
| 288 |
+ autosmart-collector.pl [OPTIONS] |
|
| 289 |
+ |
|
| 290 |
+OPTIONS: |
|
| 291 |
+ --config-dir DIR Configuration directory (default: /etc/autosmart) |
|
| 292 |
+ --daemon Run as daemon with periodic collection |
|
| 293 |
+ --once Run once and exit (useful for cron jobs) |
|
| 294 |
+ --device PATH Collect from specific device only (e.g., /dev/sda) |
|
| 295 |
+ --debug Enable debug logging |
|
| 296 |
+ --help Show this help message |
|
| 297 |
+ |
|
| 298 |
+EXAMPLES: |
|
| 299 |
+ # Run once (for cron jobs) |
|
| 300 |
+ autosmart-collector.pl --once |
|
| 301 |
+ |
|
| 302 |
+ # Run as daemon |
|
| 303 |
+ autosmart-collector.pl --daemon |
|
| 304 |
+ |
|
| 305 |
+ # Collect from specific device |
|
| 306 |
+ autosmart-collector.pl --device /dev/sda |
|
| 307 |
+ |
|
| 308 |
+ # Run with debug logging |
|
| 309 |
+ autosmart-collector.pl --debug --once |
|
| 310 |
+ |
|
| 311 |
+ # Use custom config directory |
|
| 312 |
+ autosmart-collector.pl --config-dir /opt/autosmart/config --once |
|
| 313 |
+ |
|
| 314 |
+CONFIGURATION: |
|
| 315 |
+ Configuration files should be in /etc/autosmart/ or specified directory: |
|
| 316 |
+ - smart.conf SMART monitoring settings |
|
| 317 |
+ - database.conf PostgreSQL connection settings |
|
| 318 |
+ |
|
| 319 |
+DAEMON MODE: |
|
| 320 |
+ In daemon mode, the collector runs continuously and collects data at |
|
| 321 |
+ intervals specified in smart.conf (monitoring.collection_interval). |
|
| 322 |
+ |
|
| 323 |
+ Send SIGTERM or SIGINT for graceful shutdown. |
|
| 324 |
+ |
|
| 325 |
+CRON MODE: |
|
| 326 |
+ Use --once flag for cron-based scheduling: |
|
| 327 |
+ |
|
| 328 |
+ # Collect every 5 minutes |
|
| 329 |
+ */5 * * * * /usr/local/bin/autosmart-collector.pl --once |
|
| 330 |
+ |
|
| 331 |
+EXIT CODES: |
|
| 332 |
+ 0 Success |
|
| 333 |
+ 1 Error (failed collections, missing config, etc.) |
|
| 334 |
+ |
|
| 335 |
+EOF |
|
| 336 |
+} |
|
| 337 |
+ |
|
| 338 |
+__END__ |
|
| 339 |
+ |
|
| 340 |
+=head1 AUTHOR |
|
| 341 |
+ |
|
| 342 |
+AutoSMART Development Team |
|
| 343 |
+ |
|
| 344 |
+=head1 LICENSE |
|
| 345 |
+ |
|
| 346 |
+This software is part of the autoSMART project. |
|
| 347 |
+ |
|
| 348 |
+=cut |
|
@@ -0,0 +1,615 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+use strict; |
|
| 4 |
+use warnings; |
|
| 5 |
+use DBI; |
|
| 6 |
+use Getopt::Long; |
|
| 7 |
+use Config::Simple; |
|
| 8 |
+use JSON::XS; |
|
| 9 |
+use POSIX qw(strftime); |
|
| 10 |
+ |
|
| 11 |
+=head1 NAME |
|
| 12 |
+ |
|
| 13 |
+autosmart-migration-report.pl - HDD Migration Analysis and Reporting |
|
| 14 |
+ |
|
| 15 |
+=head1 SYNOPSIS |
|
| 16 |
+ |
|
| 17 |
+ autosmart-migration-report.pl [OPTIONS] |
|
| 18 |
+ |
|
| 19 |
+=head1 OPTIONS |
|
| 20 |
+ |
|
| 21 |
+ --config-dir DIR Configuration directory (default: /etc/pve/autoSMART) |
|
| 22 |
+ --days N Days of migration history (default: 30) |
|
| 23 |
+ --serial SERIAL Report for specific HDD serial number |
|
| 24 |
+ --node NODE Report migrations for specific node |
|
| 25 |
+ --type TYPE Migration type: device_change, node_change, slot_change, all |
|
| 26 |
+ --format FORMAT Output format: text, json, csv (default: text) |
|
| 27 |
+ --frequent-only Show only frequently migrated drives (>3 migrations) |
|
| 28 |
+ --recent-only Show only recent migrations (<24h) |
|
| 29 |
+ --output FILE Write to file instead of stdout |
|
| 30 |
+ --help Show this help |
|
| 31 |
+ |
|
| 32 |
+=head1 DESCRIPTION |
|
| 33 |
+ |
|
| 34 |
+Analyze and report HDD migrations tracked by autoSMART. Shows drive movements |
|
| 35 |
+between nodes, device path changes, and slot changes with detailed history. |
|
| 36 |
+ |
|
| 37 |
+=cut |
|
| 38 |
+ |
|
| 39 |
+# Configuration |
|
| 40 |
+my $config_dir = '/etc/pve/autoSMART'; |
|
| 41 |
+my $days = 30; |
|
| 42 |
+my $specific_serial = ''; |
|
| 43 |
+my $specific_node = ''; |
|
| 44 |
+my $migration_type = 'all'; |
|
| 45 |
+my $format = 'text'; |
|
| 46 |
+my $frequent_only = 0; |
|
| 47 |
+my $recent_only = 0; |
|
| 48 |
+my $output_file = ''; |
|
| 49 |
+my $help = 0; |
|
| 50 |
+ |
|
| 51 |
+GetOptions( |
|
| 52 |
+ 'config-dir=s' => \$config_dir, |
|
| 53 |
+ 'days=i' => \$days, |
|
| 54 |
+ 'serial=s' => \$specific_serial, |
|
| 55 |
+ 'node=s' => \$specific_node, |
|
| 56 |
+ 'type=s' => \$migration_type, |
|
| 57 |
+ 'format=s' => \$format, |
|
| 58 |
+ 'frequent-only' => \$frequent_only, |
|
| 59 |
+ 'recent-only' => \$recent_only, |
|
| 60 |
+ 'output=s' => \$output_file, |
|
| 61 |
+ 'help' => \$help, |
|
| 62 |
+) or die "Error parsing command line arguments\n"; |
|
| 63 |
+ |
|
| 64 |
+if ($help) {
|
|
| 65 |
+ print_help(); |
|
| 66 |
+ exit 0; |
|
| 67 |
+} |
|
| 68 |
+ |
|
| 69 |
+# Validate options |
|
| 70 |
+unless ($format =~ /^(text|json|csv)$/) {
|
|
| 71 |
+ die "Invalid format: $format (must be text, json, or csv)\n"; |
|
| 72 |
+} |
|
| 73 |
+ |
|
| 74 |
+unless ($migration_type =~ /^(device_change|node_change|slot_change|all)$/) {
|
|
| 75 |
+ die "Invalid migration type: $migration_type\n"; |
|
| 76 |
+} |
|
| 77 |
+ |
|
| 78 |
+# Connect to database |
|
| 79 |
+my $db_config = "$config_dir/cluster.conf"; |
|
| 80 |
+unless (-f $db_config) {
|
|
| 81 |
+ die "Cluster configuration not found: $db_config\n"; |
|
| 82 |
+} |
|
| 83 |
+ |
|
| 84 |
+my $cfg = Config::Simple->new($db_config); |
|
| 85 |
+my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
|
|
| 86 |
+ $cfg->param('database.database'),
|
|
| 87 |
+ $cfg->param('database.host'),
|
|
| 88 |
+ $cfg->param('database.port')
|
|
| 89 |
+); |
|
| 90 |
+ |
|
| 91 |
+my $dbh = DBI->connect( |
|
| 92 |
+ $dsn, |
|
| 93 |
+ $cfg->param('database.username'),
|
|
| 94 |
+ $cfg->param('database.password'),
|
|
| 95 |
+ { RaiseError => 1, AutoCommit => 1, pg_enable_utf8 => 1 }
|
|
| 96 |
+) or die "Database connection failed: $DBI::errstr"; |
|
| 97 |
+ |
|
| 98 |
+# Generate migration report |
|
| 99 |
+my $report_data = generate_migration_report($dbh); |
|
| 100 |
+ |
|
| 101 |
+# Output report |
|
| 102 |
+my $output_handle = \*STDOUT; |
|
| 103 |
+if ($output_file) {
|
|
| 104 |
+ open $output_handle, '>', $output_file |
|
| 105 |
+ or die "Cannot open output file $output_file: $!\n"; |
|
| 106 |
+} |
|
| 107 |
+ |
|
| 108 |
+if ($format eq 'json') {
|
|
| 109 |
+ output_json($output_handle, $report_data); |
|
| 110 |
+} elsif ($format eq 'csv') {
|
|
| 111 |
+ output_csv($output_handle, $report_data); |
|
| 112 |
+} else {
|
|
| 113 |
+ output_text($output_handle, $report_data); |
|
| 114 |
+} |
|
| 115 |
+ |
|
| 116 |
+close $output_handle if $output_file; |
|
| 117 |
+$dbh->disconnect(); |
|
| 118 |
+ |
|
| 119 |
+=head2 generate_migration_report |
|
| 120 |
+ |
|
| 121 |
+Generate comprehensive migration report |
|
| 122 |
+ |
|
| 123 |
+=cut |
|
| 124 |
+ |
|
| 125 |
+sub generate_migration_report {
|
|
| 126 |
+ my $dbh = shift; |
|
| 127 |
+ |
|
| 128 |
+ my $data = {
|
|
| 129 |
+ generated_at => time(), |
|
| 130 |
+ days_analyzed => $days, |
|
| 131 |
+ filters => {
|
|
| 132 |
+ serial => $specific_serial, |
|
| 133 |
+ node => $specific_node, |
|
| 134 |
+ type => $migration_type, |
|
| 135 |
+ frequent_only => $frequent_only, |
|
| 136 |
+ recent_only => $recent_only, |
|
| 137 |
+ } |
|
| 138 |
+ }; |
|
| 139 |
+ |
|
| 140 |
+ # Get migration statistics |
|
| 141 |
+ $data->{statistics} = get_migration_statistics($dbh);
|
|
| 142 |
+ |
|
| 143 |
+ # Get migration details |
|
| 144 |
+ $data->{migrations} = get_migration_details($dbh);
|
|
| 145 |
+ |
|
| 146 |
+ # Get frequently migrated drives |
|
| 147 |
+ $data->{frequent_migrants} = get_frequent_migrants($dbh);
|
|
| 148 |
+ |
|
| 149 |
+ # Get drive current status |
|
| 150 |
+ $data->{drive_status} = get_drive_migration_status($dbh);
|
|
| 151 |
+ |
|
| 152 |
+ return $data; |
|
| 153 |
+} |
|
| 154 |
+ |
|
| 155 |
+=head2 get_migration_statistics |
|
| 156 |
+ |
|
| 157 |
+Get overall migration statistics |
|
| 158 |
+ |
|
| 159 |
+=cut |
|
| 160 |
+ |
|
| 161 |
+sub get_migration_statistics {
|
|
| 162 |
+ my $dbh = shift; |
|
| 163 |
+ |
|
| 164 |
+ my $stats = {};
|
|
| 165 |
+ |
|
| 166 |
+ # Total migrations by type |
|
| 167 |
+ my $sql = q{
|
|
| 168 |
+ SELECT |
|
| 169 |
+ migration_type, |
|
| 170 |
+ COUNT(*) as count, |
|
| 171 |
+ COUNT(DISTINCT serial_number) as unique_drives |
|
| 172 |
+ FROM hdd_migrations |
|
| 173 |
+ WHERE migration_timestamp >= NOW() - INTERVAL ? DAY |
|
| 174 |
+ GROUP BY migration_type |
|
| 175 |
+ ORDER BY count DESC |
|
| 176 |
+ }; |
|
| 177 |
+ |
|
| 178 |
+ my $sth = $dbh->prepare($sql); |
|
| 179 |
+ $sth->execute($days); |
|
| 180 |
+ |
|
| 181 |
+ $stats->{by_type} = {};
|
|
| 182 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 183 |
+ $stats->{by_type}->{$row->{migration_type}} = {
|
|
| 184 |
+ count => $row->{count},
|
|
| 185 |
+ unique_drives => $row->{unique_drives}
|
|
| 186 |
+ }; |
|
| 187 |
+ } |
|
| 188 |
+ |
|
| 189 |
+ # Migrations by node |
|
| 190 |
+ $sql = q{
|
|
| 191 |
+ SELECT |
|
| 192 |
+ COALESCE(new_node_id, old_node_id) as node_id, |
|
| 193 |
+ COUNT(*) as migrations_involving_node |
|
| 194 |
+ FROM hdd_migrations |
|
| 195 |
+ WHERE migration_timestamp >= NOW() - INTERVAL ? DAY |
|
| 196 |
+ GROUP BY COALESCE(new_node_id, old_node_id) |
|
| 197 |
+ ORDER BY migrations_involving_node DESC |
|
| 198 |
+ }; |
|
| 199 |
+ |
|
| 200 |
+ $sth = $dbh->prepare($sql); |
|
| 201 |
+ $sth->execute($days); |
|
| 202 |
+ |
|
| 203 |
+ $stats->{by_node} = {};
|
|
| 204 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 205 |
+ $stats->{by_node}->{$row->{node_id}} = $row->{migrations_involving_node};
|
|
| 206 |
+ } |
|
| 207 |
+ |
|
| 208 |
+ # Recent activity |
|
| 209 |
+ $sql = q{
|
|
| 210 |
+ SELECT |
|
| 211 |
+ DATE(migration_timestamp) as date, |
|
| 212 |
+ COUNT(*) as migrations_per_day |
|
| 213 |
+ FROM hdd_migrations |
|
| 214 |
+ WHERE migration_timestamp >= NOW() - INTERVAL ? DAY |
|
| 215 |
+ GROUP BY DATE(migration_timestamp) |
|
| 216 |
+ ORDER BY date DESC |
|
| 217 |
+ LIMIT 7 |
|
| 218 |
+ }; |
|
| 219 |
+ |
|
| 220 |
+ $sth = $dbh->prepare($sql); |
|
| 221 |
+ $sth->execute($days); |
|
| 222 |
+ |
|
| 223 |
+ $stats->{recent_activity} = [];
|
|
| 224 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 225 |
+ push @{$stats->{recent_activity}}, {
|
|
| 226 |
+ date => $row->{date},
|
|
| 227 |
+ count => $row->{migrations_per_day}
|
|
| 228 |
+ }; |
|
| 229 |
+ } |
|
| 230 |
+ |
|
| 231 |
+ return $stats; |
|
| 232 |
+} |
|
| 233 |
+ |
|
| 234 |
+=head2 get_migration_details |
|
| 235 |
+ |
|
| 236 |
+Get detailed migration records |
|
| 237 |
+ |
|
| 238 |
+=cut |
|
| 239 |
+ |
|
| 240 |
+sub get_migration_details {
|
|
| 241 |
+ my $dbh = shift; |
|
| 242 |
+ |
|
| 243 |
+ my $sql = q{
|
|
| 244 |
+ SELECT |
|
| 245 |
+ m.serial_number, |
|
| 246 |
+ hi.model_name, |
|
| 247 |
+ hi.current_device_path, |
|
| 248 |
+ hi.current_node_id, |
|
| 249 |
+ m.migration_type, |
|
| 250 |
+ m.migration_timestamp, |
|
| 251 |
+ m.old_device_path, |
|
| 252 |
+ m.old_node_id, |
|
| 253 |
+ m.old_slot, |
|
| 254 |
+ m.new_device_path, |
|
| 255 |
+ m.new_node_id, |
|
| 256 |
+ m.new_slot, |
|
| 257 |
+ m.detected_by, |
|
| 258 |
+ m.confidence_level, |
|
| 259 |
+ m.trigger_reason, |
|
| 260 |
+ m.verification_status |
|
| 261 |
+ FROM hdd_migrations m |
|
| 262 |
+ JOIN hdd_inventory hi ON m.hdd_id = hi.id |
|
| 263 |
+ WHERE m.migration_timestamp >= NOW() - INTERVAL ? DAY |
|
| 264 |
+ }; |
|
| 265 |
+ |
|
| 266 |
+ my @params = ($days); |
|
| 267 |
+ |
|
| 268 |
+ # Add filters |
|
| 269 |
+ if ($specific_serial) {
|
|
| 270 |
+ $sql .= " AND m.serial_number = ?"; |
|
| 271 |
+ push @params, $specific_serial; |
|
| 272 |
+ } |
|
| 273 |
+ |
|
| 274 |
+ if ($specific_node) {
|
|
| 275 |
+ $sql .= " AND (m.old_node_id = ? OR m.new_node_id = ?)"; |
|
| 276 |
+ push @params, $specific_node, $specific_node; |
|
| 277 |
+ } |
|
| 278 |
+ |
|
| 279 |
+ if ($migration_type ne 'all') {
|
|
| 280 |
+ $sql .= " AND m.migration_type = ?"; |
|
| 281 |
+ push @params, $migration_type; |
|
| 282 |
+ } |
|
| 283 |
+ |
|
| 284 |
+ if ($recent_only) {
|
|
| 285 |
+ $sql .= " AND m.migration_timestamp >= NOW() - INTERVAL '24 hours'"; |
|
| 286 |
+ } |
|
| 287 |
+ |
|
| 288 |
+ $sql .= " ORDER BY m.migration_timestamp DESC LIMIT 100"; |
|
| 289 |
+ |
|
| 290 |
+ my $sth = $dbh->prepare($sql); |
|
| 291 |
+ $sth->execute(@params); |
|
| 292 |
+ |
|
| 293 |
+ my @migrations = (); |
|
| 294 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 295 |
+ push @migrations, $row; |
|
| 296 |
+ } |
|
| 297 |
+ |
|
| 298 |
+ return \@migrations; |
|
| 299 |
+} |
|
| 300 |
+ |
|
| 301 |
+=head2 get_frequent_migrants |
|
| 302 |
+ |
|
| 303 |
+Get drives that migrate frequently |
|
| 304 |
+ |
|
| 305 |
+=cut |
|
| 306 |
+ |
|
| 307 |
+sub get_frequent_migrants {
|
|
| 308 |
+ my $dbh = shift; |
|
| 309 |
+ |
|
| 310 |
+ my $min_migrations = $frequent_only ? 3 : 1; |
|
| 311 |
+ |
|
| 312 |
+ my $sql = q{
|
|
| 313 |
+ SELECT |
|
| 314 |
+ hi.serial_number, |
|
| 315 |
+ hi.model_name, |
|
| 316 |
+ hi.current_device_path, |
|
| 317 |
+ hi.current_node_id, |
|
| 318 |
+ hi.migration_count, |
|
| 319 |
+ hi.last_migration, |
|
| 320 |
+ hi.first_seen, |
|
| 321 |
+ COUNT(m.id) as recent_migrations, |
|
| 322 |
+ string_agg(DISTINCT m.migration_type, ', ') as migration_types |
|
| 323 |
+ FROM hdd_inventory hi |
|
| 324 |
+ LEFT JOIN hdd_migrations m ON hi.id = m.hdd_id |
|
| 325 |
+ AND m.migration_timestamp >= NOW() - INTERVAL ? DAY |
|
| 326 |
+ WHERE hi.migration_count >= ? |
|
| 327 |
+ GROUP BY hi.id, hi.serial_number, hi.model_name, hi.current_device_path, |
|
| 328 |
+ hi.current_node_id, hi.migration_count, hi.last_migration, hi.first_seen |
|
| 329 |
+ HAVING COUNT(m.id) > 0 OR hi.migration_count >= ? |
|
| 330 |
+ ORDER BY hi.migration_count DESC, hi.last_migration DESC |
|
| 331 |
+ LIMIT 20 |
|
| 332 |
+ }; |
|
| 333 |
+ |
|
| 334 |
+ my $sth = $dbh->prepare($sql); |
|
| 335 |
+ $sth->execute($days, $min_migrations, $min_migrations); |
|
| 336 |
+ |
|
| 337 |
+ my @frequent = (); |
|
| 338 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 339 |
+ push @frequent, $row; |
|
| 340 |
+ } |
|
| 341 |
+ |
|
| 342 |
+ return \@frequent; |
|
| 343 |
+} |
|
| 344 |
+ |
|
| 345 |
+=head2 get_drive_migration_status |
|
| 346 |
+ |
|
| 347 |
+Get current migration status of drives |
|
| 348 |
+ |
|
| 349 |
+=cut |
|
| 350 |
+ |
|
| 351 |
+sub get_drive_migration_status {
|
|
| 352 |
+ my $dbh = shift; |
|
| 353 |
+ |
|
| 354 |
+ my $sql = q{
|
|
| 355 |
+ SELECT |
|
| 356 |
+ migration_status, |
|
| 357 |
+ COUNT(*) as drive_count |
|
| 358 |
+ FROM drive_health_summary |
|
| 359 |
+ GROUP BY migration_status |
|
| 360 |
+ ORDER BY drive_count DESC |
|
| 361 |
+ }; |
|
| 362 |
+ |
|
| 363 |
+ my $sth = $dbh->prepare($sql); |
|
| 364 |
+ $sth->execute(); |
|
| 365 |
+ |
|
| 366 |
+ my %status = (); |
|
| 367 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 368 |
+ $status{$row->{migration_status}} = $row->{drive_count};
|
|
| 369 |
+ } |
|
| 370 |
+ |
|
| 371 |
+ return \%status; |
|
| 372 |
+} |
|
| 373 |
+ |
|
| 374 |
+=head2 output_text |
|
| 375 |
+ |
|
| 376 |
+Output report as text |
|
| 377 |
+ |
|
| 378 |
+=cut |
|
| 379 |
+ |
|
| 380 |
+sub output_text {
|
|
| 381 |
+ my ($fh, $data) = @_; |
|
| 382 |
+ |
|
| 383 |
+ print $fh "\n" . "="x80 . "\n"; |
|
| 384 |
+ print $fh "autoSMART HDD Migration Report\n"; |
|
| 385 |
+ print $fh "Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime($data->{generated_at})) . "\n";
|
|
| 386 |
+ print $fh "Period: Last $data->{days_analyzed} days\n";
|
|
| 387 |
+ print $fh "="x80 . "\n\n"; |
|
| 388 |
+ |
|
| 389 |
+ # Statistics |
|
| 390 |
+ my $stats = $data->{statistics};
|
|
| 391 |
+ |
|
| 392 |
+ print $fh "MIGRATION STATISTICS\n"; |
|
| 393 |
+ print $fh "-"x40 . "\n"; |
|
| 394 |
+ |
|
| 395 |
+ if (%{$stats->{by_type}}) {
|
|
| 396 |
+ print $fh "By Type:\n"; |
|
| 397 |
+ foreach my $type (sort keys %{$stats->{by_type}}) {
|
|
| 398 |
+ my $info = $stats->{by_type}->{$type};
|
|
| 399 |
+ printf $fh " %-15s: %d migrations (%d unique drives)\n", |
|
| 400 |
+ $type, $info->{count}, $info->{unique_drives};
|
|
| 401 |
+ } |
|
| 402 |
+ } |
|
| 403 |
+ |
|
| 404 |
+ if (%{$stats->{by_node}}) {
|
|
| 405 |
+ print $fh "\nBy Node:\n"; |
|
| 406 |
+ foreach my $node (sort { $stats->{by_node}->{$b} <=> $stats->{by_node}->{$a} }
|
|
| 407 |
+ keys %{$stats->{by_node}}) {
|
|
| 408 |
+ printf $fh " %-15s: %d migrations\n", $node, $stats->{by_node}->{$node};
|
|
| 409 |
+ } |
|
| 410 |
+ } |
|
| 411 |
+ |
|
| 412 |
+ # Recent activity |
|
| 413 |
+ if (@{$stats->{recent_activity}}) {
|
|
| 414 |
+ print $fh "\nRecent Activity (Last 7 days):\n"; |
|
| 415 |
+ foreach my $activity (@{$stats->{recent_activity}}) {
|
|
| 416 |
+ printf $fh " %s: %d migrations\n", $activity->{date}, $activity->{count};
|
|
| 417 |
+ } |
|
| 418 |
+ } |
|
| 419 |
+ |
|
| 420 |
+ # Drive migration status |
|
| 421 |
+ if (%{$data->{drive_status}}) {
|
|
| 422 |
+ print $fh "\nDrive Migration Status:\n"; |
|
| 423 |
+ foreach my $status (sort keys %{$data->{drive_status}}) {
|
|
| 424 |
+ printf $fh " %-20s: %d drives\n", $status, $data->{drive_status}->{$status};
|
|
| 425 |
+ } |
|
| 426 |
+ } |
|
| 427 |
+ |
|
| 428 |
+ # Frequently migrated drives |
|
| 429 |
+ if (@{$data->{frequent_migrants}}) {
|
|
| 430 |
+ print $fh "\n" . "="x80 . "\n"; |
|
| 431 |
+ print $fh "FREQUENTLY MIGRATED DRIVES\n"; |
|
| 432 |
+ print $fh "="x80 . "\n"; |
|
| 433 |
+ |
|
| 434 |
+ foreach my $drive (@{$data->{frequent_migrants}}) {
|
|
| 435 |
+ printf $fh "\nSerial: %s (%s)\n", |
|
| 436 |
+ $drive->{serial_number}, $drive->{model_name};
|
|
| 437 |
+ printf $fh "Current: %s @ %s\n", |
|
| 438 |
+ $drive->{current_device_path} || 'unknown',
|
|
| 439 |
+ $drive->{current_node_id} || 'unknown';
|
|
| 440 |
+ printf $fh "Total migrations: %d (Recent: %d)\n", |
|
| 441 |
+ $drive->{migration_count}, $drive->{recent_migrations};
|
|
| 442 |
+ printf $fh "Last migration: %s\n", |
|
| 443 |
+ $drive->{last_migration} || 'never';
|
|
| 444 |
+ printf $fh "Migration types: %s\n", |
|
| 445 |
+ $drive->{migration_types} || 'none';
|
|
| 446 |
+ } |
|
| 447 |
+ } |
|
| 448 |
+ |
|
| 449 |
+ # Recent migrations |
|
| 450 |
+ if (@{$data->{migrations}}) {
|
|
| 451 |
+ print $fh "\n" . "="x80 . "\n"; |
|
| 452 |
+ print $fh "RECENT MIGRATIONS\n"; |
|
| 453 |
+ print $fh "="x80 . "\n"; |
|
| 454 |
+ |
|
| 455 |
+ foreach my $migration (@{$data->{migrations}}) {
|
|
| 456 |
+ printf $fh "\n[%s] %s - %s\n", |
|
| 457 |
+ $migration->{migration_timestamp},
|
|
| 458 |
+ $migration->{serial_number},
|
|
| 459 |
+ uc($migration->{migration_type});
|
|
| 460 |
+ |
|
| 461 |
+ if ($migration->{migration_type} eq 'node_change') {
|
|
| 462 |
+ printf $fh " Moved: %s@%s -> %s@%s\n", |
|
| 463 |
+ $migration->{old_device_path} || '?',
|
|
| 464 |
+ $migration->{old_node_id} || '?',
|
|
| 465 |
+ $migration->{new_device_path} || '?',
|
|
| 466 |
+ $migration->{new_node_id} || '?';
|
|
| 467 |
+ } elsif ($migration->{migration_type} eq 'device_change') {
|
|
| 468 |
+ printf $fh " Device: %s -> %s (on %s)\n", |
|
| 469 |
+ $migration->{old_device_path} || '?',
|
|
| 470 |
+ $migration->{new_device_path} || '?',
|
|
| 471 |
+ $migration->{new_node_id} || '?';
|
|
| 472 |
+ } |
|
| 473 |
+ |
|
| 474 |
+ printf $fh " Detected by: %s (confidence: %d/10)\n", |
|
| 475 |
+ $migration->{detected_by}, $migration->{confidence_level};
|
|
| 476 |
+ |
|
| 477 |
+ if ($migration->{trigger_reason}) {
|
|
| 478 |
+ printf $fh " Reason: %s\n", $migration->{trigger_reason};
|
|
| 479 |
+ } |
|
| 480 |
+ } |
|
| 481 |
+ } |
|
| 482 |
+ |
|
| 483 |
+ print $fh "\n"; |
|
| 484 |
+} |
|
| 485 |
+ |
|
| 486 |
+=head2 output_json |
|
| 487 |
+ |
|
| 488 |
+Output report as JSON |
|
| 489 |
+ |
|
| 490 |
+=cut |
|
| 491 |
+ |
|
| 492 |
+sub output_json {
|
|
| 493 |
+ my ($fh, $data) = @_; |
|
| 494 |
+ |
|
| 495 |
+ my $json = JSON::XS->new->pretty->encode($data); |
|
| 496 |
+ print $fh $json; |
|
| 497 |
+} |
|
| 498 |
+ |
|
| 499 |
+=head2 output_csv |
|
| 500 |
+ |
|
| 501 |
+Output migrations as CSV |
|
| 502 |
+ |
|
| 503 |
+=cut |
|
| 504 |
+ |
|
| 505 |
+sub output_csv {
|
|
| 506 |
+ my ($fh, $data) = @_; |
|
| 507 |
+ |
|
| 508 |
+ # CSV header |
|
| 509 |
+ print $fh "timestamp,serial_number,model_name,migration_type,old_location,new_location,detected_by,confidence\n"; |
|
| 510 |
+ |
|
| 511 |
+ foreach my $migration (@{$data->{migrations}}) {
|
|
| 512 |
+ my @fields = ( |
|
| 513 |
+ $migration->{migration_timestamp},
|
|
| 514 |
+ $migration->{serial_number},
|
|
| 515 |
+ $migration->{model_name} || '',
|
|
| 516 |
+ $migration->{migration_type},
|
|
| 517 |
+ sprintf("%s@%s", $migration->{old_device_path} || '', $migration->{old_node_id} || ''),
|
|
| 518 |
+ sprintf("%s@%s", $migration->{new_device_path} || '', $migration->{new_node_id} || ''),
|
|
| 519 |
+ $migration->{detected_by},
|
|
| 520 |
+ $migration->{confidence_level}
|
|
| 521 |
+ ); |
|
| 522 |
+ |
|
| 523 |
+ # Escape CSV fields |
|
| 524 |
+ @fields = map { escape_csv($_) } @fields;
|
|
| 525 |
+ print $fh join(',', @fields) . "\n";
|
|
| 526 |
+ } |
|
| 527 |
+} |
|
| 528 |
+ |
|
| 529 |
+=head2 escape_csv |
|
| 530 |
+ |
|
| 531 |
+Escape CSV field |
|
| 532 |
+ |
|
| 533 |
+=cut |
|
| 534 |
+ |
|
| 535 |
+sub escape_csv {
|
|
| 536 |
+ my $field = shift || ''; |
|
| 537 |
+ |
|
| 538 |
+ if ($field =~ /[",\n]/) {
|
|
| 539 |
+ $field =~ s/"/""/g; |
|
| 540 |
+ $field = "\"$field\""; |
|
| 541 |
+ } |
|
| 542 |
+ |
|
| 543 |
+ return $field; |
|
| 544 |
+} |
|
| 545 |
+ |
|
| 546 |
+=head2 print_help |
|
| 547 |
+ |
|
| 548 |
+Display help information |
|
| 549 |
+ |
|
| 550 |
+=cut |
|
| 551 |
+ |
|
| 552 |
+sub print_help {
|
|
| 553 |
+ print <<'EOF'; |
|
| 554 |
+autoSMART HDD Migration Report v1.0 |
|
| 555 |
+ |
|
| 556 |
+USAGE: |
|
| 557 |
+ autosmart-migration-report.pl [OPTIONS] |
|
| 558 |
+ |
|
| 559 |
+OPTIONS: |
|
| 560 |
+ --config-dir DIR Configuration directory (default: /etc/pve/autoSMART) |
|
| 561 |
+ --days N Days of migration history to analyze (default: 30) |
|
| 562 |
+ --serial SERIAL Report for specific HDD serial number |
|
| 563 |
+ --node NODE Show migrations involving specific node |
|
| 564 |
+ --type TYPE Filter by migration type: |
|
| 565 |
+ device_change, node_change, slot_change, all (default) |
|
| 566 |
+ --format FORMAT Output format: text, json, csv (default: text) |
|
| 567 |
+ --frequent-only Show only frequently migrated drives (3+ migrations) |
|
| 568 |
+ --recent-only Show only migrations in last 24 hours |
|
| 569 |
+ --output FILE Write to file instead of stdout |
|
| 570 |
+ --help Show this help message |
|
| 571 |
+ |
|
| 572 |
+EXAMPLES: |
|
| 573 |
+ # Show all migrations in last 7 days |
|
| 574 |
+ autosmart-migration-report.pl --days 7 |
|
| 575 |
+ |
|
| 576 |
+ # Show only node changes |
|
| 577 |
+ autosmart-migration-report.pl --type node_change |
|
| 578 |
+ |
|
| 579 |
+ # Show migrations for specific drive |
|
| 580 |
+ autosmart-migration-report.pl --serial WD-WCC4N5123456 |
|
| 581 |
+ |
|
| 582 |
+ # Show frequently migrated drives |
|
| 583 |
+ autosmart-migration-report.pl --frequent-only |
|
| 584 |
+ |
|
| 585 |
+ # Export recent migrations as CSV |
|
| 586 |
+ autosmart-migration-report.pl --recent-only --format csv --output migrations.csv |
|
| 587 |
+ |
|
| 588 |
+MIGRATION TYPES: |
|
| 589 |
+ device_change Drive appeared at different /dev/sdX path |
|
| 590 |
+ node_change Drive moved between Proxmox nodes |
|
| 591 |
+ slot_change Drive moved to different physical slot/bay |
|
| 592 |
+ discovery New drive detected for first time |
|
| 593 |
+ |
|
| 594 |
+OUTPUT: |
|
| 595 |
+ The report includes: |
|
| 596 |
+ - Overall migration statistics |
|
| 597 |
+ - Frequently migrated drives |
|
| 598 |
+ - Recent migration activity |
|
| 599 |
+ - Detailed migration logs |
|
| 600 |
+ - Drive migration status summary |
|
| 601 |
+ |
|
| 602 |
+EOF |
|
| 603 |
+} |
|
| 604 |
+ |
|
| 605 |
+__END__ |
|
| 606 |
+ |
|
| 607 |
+=head1 AUTHOR |
|
| 608 |
+ |
|
| 609 |
+AutoSMART Development Team |
|
| 610 |
+ |
|
| 611 |
+=head1 LICENSE |
|
| 612 |
+ |
|
| 613 |
+This software is part of the autoSMART project. |
|
| 614 |
+ |
|
| 615 |
+=cut |
|
@@ -0,0 +1,483 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+use strict; |
|
| 4 |
+use warnings; |
|
| 5 |
+use FindBin qw($Bin); |
|
| 6 |
+use lib "$Bin/../lib"; |
|
| 7 |
+ |
|
| 8 |
+use PredictionEngine; |
|
| 9 |
+use Getopt::Long; |
|
| 10 |
+use JSON::XS; |
|
| 11 |
+use POSIX qw(strftime); |
|
| 12 |
+ |
|
| 13 |
+=head1 NAME |
|
| 14 |
+ |
|
| 15 |
+autosmart-predictor.pl - AI-powered HDD failure prediction for autoSMART |
|
| 16 |
+ |
|
| 17 |
+=head1 SYNOPSIS |
|
| 18 |
+ |
|
| 19 |
+ autosmart-predictor.pl [OPTIONS] |
|
| 20 |
+ |
|
| 21 |
+=head1 OPTIONS |
|
| 22 |
+ |
|
| 23 |
+ --config-dir DIR Configuration directory (default: /etc/autosmart) |
|
| 24 |
+ --device PATH Analyze specific device only |
|
| 25 |
+ --all Analyze all active drives |
|
| 26 |
+ --days-back N Days of history to analyze (default: 90) |
|
| 27 |
+ --output FORMAT Output format: text, json, csv (default: text) |
|
| 28 |
+ --risk-level LEVEL Show only drives with risk >= LEVEL (low, moderate, high, critical) |
|
| 29 |
+ --quiet Quiet mode - only output results |
|
| 30 |
+ --debug Enable debug logging |
|
| 31 |
+ --help Show this help |
|
| 32 |
+ |
|
| 33 |
+=head1 DESCRIPTION |
|
| 34 |
+ |
|
| 35 |
+This script uses AI (OpenAI GPT) to analyze SMART data trends and predict |
|
| 36 |
+HDD failures. It processes historical SMART data stored by the collector |
|
| 37 |
+and generates intelligent predictions with confidence levels. |
|
| 38 |
+ |
|
| 39 |
+=cut |
|
| 40 |
+ |
|
| 41 |
+# Configuration |
|
| 42 |
+my $config_dir = '/etc/autosmart'; |
|
| 43 |
+my $specific_device = ''; |
|
| 44 |
+my $analyze_all = 0; |
|
| 45 |
+my $days_back = 90; |
|
| 46 |
+my $output_format = 'text'; |
|
| 47 |
+my $min_risk_level = ''; |
|
| 48 |
+my $quiet = 0; |
|
| 49 |
+my $debug = 0; |
|
| 50 |
+my $help = 0; |
|
| 51 |
+ |
|
| 52 |
+GetOptions( |
|
| 53 |
+ 'config-dir=s' => \$config_dir, |
|
| 54 |
+ 'device=s' => \$specific_device, |
|
| 55 |
+ 'all' => \$analyze_all, |
|
| 56 |
+ 'days-back=i' => \$days_back, |
|
| 57 |
+ 'output=s' => \$output_format, |
|
| 58 |
+ 'risk-level=s' => \$min_risk_level, |
|
| 59 |
+ 'quiet' => \$quiet, |
|
| 60 |
+ 'debug' => \$debug, |
|
| 61 |
+ 'help' => \$help, |
|
| 62 |
+) or die "Error parsing command line arguments\n"; |
|
| 63 |
+ |
|
| 64 |
+if ($help) {
|
|
| 65 |
+ print_help(); |
|
| 66 |
+ exit 0; |
|
| 67 |
+} |
|
| 68 |
+ |
|
| 69 |
+# Validate options |
|
| 70 |
+unless ($specific_device || $analyze_all) {
|
|
| 71 |
+ die "Must specify either --device PATH or --all\n"; |
|
| 72 |
+} |
|
| 73 |
+ |
|
| 74 |
+if ($specific_device && $analyze_all) {
|
|
| 75 |
+ die "Cannot specify both --device and --all\n"; |
|
| 76 |
+} |
|
| 77 |
+ |
|
| 78 |
+unless ($output_format =~ /^(text|json|csv)$/) {
|
|
| 79 |
+ die "Invalid output format: $output_format (must be text, json, or csv)\n"; |
|
| 80 |
+} |
|
| 81 |
+ |
|
| 82 |
+if ($min_risk_level && $min_risk_level !~ /^(low|moderate|high|critical)$/) {
|
|
| 83 |
+ die "Invalid risk level: $min_risk_level (must be low, moderate, high, or critical)\n"; |
|
| 84 |
+} |
|
| 85 |
+ |
|
| 86 |
+# Validate configuration directory |
|
| 87 |
+unless (-d $config_dir) {
|
|
| 88 |
+ die "Configuration directory not found: $config_dir\n"; |
|
| 89 |
+} |
|
| 90 |
+ |
|
| 91 |
+my $db_config = "$config_dir/database.conf"; |
|
| 92 |
+my $openai_config = "$config_dir/openai.conf"; |
|
| 93 |
+ |
|
| 94 |
+unless (-f $db_config && -f $openai_config) {
|
|
| 95 |
+ die "Required configuration files not found in $config_dir\n"; |
|
| 96 |
+} |
|
| 97 |
+ |
|
| 98 |
+# Initialize prediction engine |
|
| 99 |
+my $predictor = PredictionEngine->new( |
|
| 100 |
+ db_config => $db_config, |
|
| 101 |
+ openai_config => $openai_config, |
|
| 102 |
+ debug => $debug, |
|
| 103 |
+); |
|
| 104 |
+ |
|
| 105 |
+log_message("autoSMART predictor starting...") unless $quiet;
|
|
| 106 |
+ |
|
| 107 |
+my @predictions = (); |
|
| 108 |
+ |
|
| 109 |
+if ($specific_device) {
|
|
| 110 |
+ # Analyze specific device |
|
| 111 |
+ log_message("Analyzing device: $specific_device") unless $quiet;
|
|
| 112 |
+ |
|
| 113 |
+ my $prediction = $predictor->predict_failure($specific_device, $days_back); |
|
| 114 |
+ push @predictions, $prediction; |
|
| 115 |
+ |
|
| 116 |
+} elsif ($analyze_all) {
|
|
| 117 |
+ # Analyze all active drives |
|
| 118 |
+ log_message("Analyzing all active drives...") unless $quiet;
|
|
| 119 |
+ |
|
| 120 |
+ @predictions = @{$predictor->analyze_all_drives()};
|
|
| 121 |
+} |
|
| 122 |
+ |
|
| 123 |
+# Filter predictions by minimum risk level if specified |
|
| 124 |
+if ($min_risk_level) {
|
|
| 125 |
+ @predictions = filter_by_risk_level(\@predictions, $min_risk_level); |
|
| 126 |
+} |
|
| 127 |
+ |
|
| 128 |
+# Output results |
|
| 129 |
+output_predictions(\@predictions, $output_format); |
|
| 130 |
+ |
|
| 131 |
+log_message("Analysis complete") unless $quiet;
|
|
| 132 |
+ |
|
| 133 |
+=head2 filter_by_risk_level |
|
| 134 |
+ |
|
| 135 |
+Filter predictions by minimum risk level |
|
| 136 |
+ |
|
| 137 |
+=cut |
|
| 138 |
+ |
|
| 139 |
+sub filter_by_risk_level {
|
|
| 140 |
+ my ($predictions, $min_level) = @_; |
|
| 141 |
+ |
|
| 142 |
+ my %risk_order = ( |
|
| 143 |
+ 'low' => 1, |
|
| 144 |
+ 'moderate' => 2, |
|
| 145 |
+ 'high' => 3, |
|
| 146 |
+ 'critical' => 4, |
|
| 147 |
+ ); |
|
| 148 |
+ |
|
| 149 |
+ my $min_order = $risk_order{$min_level} || 1;
|
|
| 150 |
+ |
|
| 151 |
+ return grep {
|
|
| 152 |
+ exists $risk_order{$_->{risk_level}} &&
|
|
| 153 |
+ $risk_order{$_->{risk_level}} >= $min_order
|
|
| 154 |
+ } @$predictions; |
|
| 155 |
+} |
|
| 156 |
+ |
|
| 157 |
+=head2 output_predictions |
|
| 158 |
+ |
|
| 159 |
+Output predictions in specified format |
|
| 160 |
+ |
|
| 161 |
+=cut |
|
| 162 |
+ |
|
| 163 |
+sub output_predictions {
|
|
| 164 |
+ my ($predictions, $format) = @_; |
|
| 165 |
+ |
|
| 166 |
+ if ($format eq 'json') {
|
|
| 167 |
+ output_json($predictions); |
|
| 168 |
+ } elsif ($format eq 'csv') {
|
|
| 169 |
+ output_csv($predictions); |
|
| 170 |
+ } else {
|
|
| 171 |
+ output_text($predictions); |
|
| 172 |
+ } |
|
| 173 |
+} |
|
| 174 |
+ |
|
| 175 |
+=head2 output_json |
|
| 176 |
+ |
|
| 177 |
+Output predictions as JSON |
|
| 178 |
+ |
|
| 179 |
+=cut |
|
| 180 |
+ |
|
| 181 |
+sub output_json {
|
|
| 182 |
+ my $predictions = shift; |
|
| 183 |
+ |
|
| 184 |
+ my $json = JSON::XS->new->pretty->encode({
|
|
| 185 |
+ timestamp => time(), |
|
| 186 |
+ predictions => $predictions, |
|
| 187 |
+ }); |
|
| 188 |
+ |
|
| 189 |
+ print $json; |
|
| 190 |
+} |
|
| 191 |
+ |
|
| 192 |
+=head2 output_csv |
|
| 193 |
+ |
|
| 194 |
+Output predictions as CSV |
|
| 195 |
+ |
|
| 196 |
+=cut |
|
| 197 |
+ |
|
| 198 |
+sub output_csv {
|
|
| 199 |
+ my $predictions = shift; |
|
| 200 |
+ |
|
| 201 |
+ # CSV header |
|
| 202 |
+ print "device_path,timestamp,risk_level,confidence,time_to_failure_days,concerns,recommendations\n"; |
|
| 203 |
+ |
|
| 204 |
+ foreach my $pred (@$predictions) {
|
|
| 205 |
+ my @fields = ( |
|
| 206 |
+ $pred->{device_path} || '',
|
|
| 207 |
+ $pred->{timestamp} || '',
|
|
| 208 |
+ $pred->{risk_level} || '',
|
|
| 209 |
+ $pred->{confidence} || '',
|
|
| 210 |
+ $pred->{time_to_failure_days} || '',
|
|
| 211 |
+ escape_csv($pred->{concerns} || ''),
|
|
| 212 |
+ escape_csv($pred->{recommendations} || ''),
|
|
| 213 |
+ ); |
|
| 214 |
+ |
|
| 215 |
+ print join(',', @fields) . "\n";
|
|
| 216 |
+ } |
|
| 217 |
+} |
|
| 218 |
+ |
|
| 219 |
+=head2 output_text |
|
| 220 |
+ |
|
| 221 |
+Output predictions as human-readable text |
|
| 222 |
+ |
|
| 223 |
+=cut |
|
| 224 |
+ |
|
| 225 |
+sub output_text {
|
|
| 226 |
+ my $predictions = shift; |
|
| 227 |
+ |
|
| 228 |
+ unless (@$predictions) {
|
|
| 229 |
+ print "No predictions available.\n"; |
|
| 230 |
+ return; |
|
| 231 |
+ } |
|
| 232 |
+ |
|
| 233 |
+ print "\n" . "="x80 . "\n"; |
|
| 234 |
+ print "autoSMART HDD Failure Prediction Report\n"; |
|
| 235 |
+ print "Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime()) . "\n";
|
|
| 236 |
+ print "="x80 . "\n\n"; |
|
| 237 |
+ |
|
| 238 |
+ foreach my $pred (@$predictions) {
|
|
| 239 |
+ print_prediction_text($pred); |
|
| 240 |
+ print "-"x80 . "\n"; |
|
| 241 |
+ } |
|
| 242 |
+ |
|
| 243 |
+ # Summary statistics |
|
| 244 |
+ my %risk_counts = (); |
|
| 245 |
+ my $total_confidence = 0; |
|
| 246 |
+ my $confidence_count = 0; |
|
| 247 |
+ |
|
| 248 |
+ foreach my $pred (@$predictions) {
|
|
| 249 |
+ $risk_counts{$pred->{risk_level} || 'unknown'}++;
|
|
| 250 |
+ |
|
| 251 |
+ if (defined $pred->{confidence} && $pred->{confidence} > 0) {
|
|
| 252 |
+ $total_confidence += $pred->{confidence};
|
|
| 253 |
+ $confidence_count++; |
|
| 254 |
+ } |
|
| 255 |
+ } |
|
| 256 |
+ |
|
| 257 |
+ print "\nSUMMARY:\n"; |
|
| 258 |
+ print "Total drives analyzed: " . scalar(@$predictions) . "\n"; |
|
| 259 |
+ |
|
| 260 |
+ foreach my $level (qw(critical high moderate low unknown)) {
|
|
| 261 |
+ next unless $risk_counts{$level};
|
|
| 262 |
+ print sprintf("%-10s risk: %d drives\n",
|
|
| 263 |
+ ucfirst($level), $risk_counts{$level});
|
|
| 264 |
+ } |
|
| 265 |
+ |
|
| 266 |
+ if ($confidence_count > 0) {
|
|
| 267 |
+ my $avg_confidence = $total_confidence / $confidence_count; |
|
| 268 |
+ print sprintf("Average confidence: %.1f%%\n", $avg_confidence);
|
|
| 269 |
+ } |
|
| 270 |
+ |
|
| 271 |
+ print "\n"; |
|
| 272 |
+} |
|
| 273 |
+ |
|
| 274 |
+=head2 print_prediction_text |
|
| 275 |
+ |
|
| 276 |
+Print a single prediction in text format |
|
| 277 |
+ |
|
| 278 |
+=cut |
|
| 279 |
+ |
|
| 280 |
+sub print_prediction_text {
|
|
| 281 |
+ my $pred = shift; |
|
| 282 |
+ |
|
| 283 |
+ print "DEVICE: $pred->{device_path}\n";
|
|
| 284 |
+ |
|
| 285 |
+ if ($pred->{prediction} eq 'insufficient_data') {
|
|
| 286 |
+ print "STATUS: Insufficient data for analysis\n"; |
|
| 287 |
+ print "MESSAGE: $pred->{message}\n";
|
|
| 288 |
+ return; |
|
| 289 |
+ } |
|
| 290 |
+ |
|
| 291 |
+ print "RISK LEVEL: " . format_risk_level($pred->{risk_level}) . "\n";
|
|
| 292 |
+ |
|
| 293 |
+ if (defined $pred->{confidence}) {
|
|
| 294 |
+ print "CONFIDENCE: $pred->{confidence}%\n";
|
|
| 295 |
+ } |
|
| 296 |
+ |
|
| 297 |
+ if (defined $pred->{time_to_failure_days} && $pred->{time_to_failure_days} > 0) {
|
|
| 298 |
+ print "ESTIMATED TIME TO FAILURE: $pred->{time_to_failure_days} days\n";
|
|
| 299 |
+ } |
|
| 300 |
+ |
|
| 301 |
+ if ($pred->{concerns}) {
|
|
| 302 |
+ print "CONCERNS:\n"; |
|
| 303 |
+ print format_text_block($pred->{concerns}, " ");
|
|
| 304 |
+ } |
|
| 305 |
+ |
|
| 306 |
+ if ($pred->{recommendations}) {
|
|
| 307 |
+ print "RECOMMENDATIONS:\n"; |
|
| 308 |
+ print format_text_block($pred->{recommendations}, " ");
|
|
| 309 |
+ } |
|
| 310 |
+ |
|
| 311 |
+ if ($pred->{reasoning} && $debug) {
|
|
| 312 |
+ print "AI REASONING:\n"; |
|
| 313 |
+ print format_text_block($pred->{reasoning}, " ");
|
|
| 314 |
+ } |
|
| 315 |
+ |
|
| 316 |
+ my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime($pred->{timestamp}));
|
|
| 317 |
+ print "ANALYZED: $timestamp\n"; |
|
| 318 |
+ |
|
| 319 |
+ print "\n"; |
|
| 320 |
+} |
|
| 321 |
+ |
|
| 322 |
+=head2 format_risk_level |
|
| 323 |
+ |
|
| 324 |
+Format risk level with color coding (if terminal supports it) |
|
| 325 |
+ |
|
| 326 |
+=cut |
|
| 327 |
+ |
|
| 328 |
+sub format_risk_level {
|
|
| 329 |
+ my $level = shift || 'unknown'; |
|
| 330 |
+ |
|
| 331 |
+ # Simple color codes (won't work in all terminals) |
|
| 332 |
+ my %colors = ( |
|
| 333 |
+ 'critical' => "\033[1;31m", # Bold red |
|
| 334 |
+ 'high' => "\033[0;31m", # Red |
|
| 335 |
+ 'moderate' => "\033[0;33m", # Yellow |
|
| 336 |
+ 'low' => "\033[0;32m", # Green |
|
| 337 |
+ 'unknown' => "\033[0;37m", # Gray |
|
| 338 |
+ ); |
|
| 339 |
+ |
|
| 340 |
+ my $reset = "\033[0m"; |
|
| 341 |
+ |
|
| 342 |
+ # Only use colors if output is to terminal |
|
| 343 |
+ if (-t STDOUT) {
|
|
| 344 |
+ return ($colors{$level} || '') . uc($level) . $reset;
|
|
| 345 |
+ } else {
|
|
| 346 |
+ return uc($level); |
|
| 347 |
+ } |
|
| 348 |
+} |
|
| 349 |
+ |
|
| 350 |
+=head2 format_text_block |
|
| 351 |
+ |
|
| 352 |
+Format multi-line text with indentation |
|
| 353 |
+ |
|
| 354 |
+=cut |
|
| 355 |
+ |
|
| 356 |
+sub format_text_block {
|
|
| 357 |
+ my ($text, $indent) = @_; |
|
| 358 |
+ |
|
| 359 |
+ return '' unless $text; |
|
| 360 |
+ |
|
| 361 |
+ $indent ||= ''; |
|
| 362 |
+ |
|
| 363 |
+ my @lines = split /\n/, $text; |
|
| 364 |
+ return join("\n", map { "$indent$_" } @lines) . "\n";
|
|
| 365 |
+} |
|
| 366 |
+ |
|
| 367 |
+=head2 escape_csv |
|
| 368 |
+ |
|
| 369 |
+Escape CSV field content |
|
| 370 |
+ |
|
| 371 |
+=cut |
|
| 372 |
+ |
|
| 373 |
+sub escape_csv {
|
|
| 374 |
+ my $field = shift || ''; |
|
| 375 |
+ |
|
| 376 |
+ # Escape quotes and wrap in quotes if contains comma/quote/newline |
|
| 377 |
+ if ($field =~ /[",\n]/) {
|
|
| 378 |
+ $field =~ s/"/""/g; |
|
| 379 |
+ $field = "\"$field\""; |
|
| 380 |
+ } |
|
| 381 |
+ |
|
| 382 |
+ return $field; |
|
| 383 |
+} |
|
| 384 |
+ |
|
| 385 |
+=head2 log_message |
|
| 386 |
+ |
|
| 387 |
+Log message with timestamp |
|
| 388 |
+ |
|
| 389 |
+=cut |
|
| 390 |
+ |
|
| 391 |
+sub log_message {
|
|
| 392 |
+ my $message = shift; |
|
| 393 |
+ |
|
| 394 |
+ my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime());
|
|
| 395 |
+ print STDERR "[$timestamp] autosmart-predictor: $message\n"; |
|
| 396 |
+} |
|
| 397 |
+ |
|
| 398 |
+=head2 print_help |
|
| 399 |
+ |
|
| 400 |
+Display help information |
|
| 401 |
+ |
|
| 402 |
+=cut |
|
| 403 |
+ |
|
| 404 |
+sub print_help {
|
|
| 405 |
+ print <<'EOF'; |
|
| 406 |
+autoSMART AI Predictor v1.0 |
|
| 407 |
+ |
|
| 408 |
+USAGE: |
|
| 409 |
+ autosmart-predictor.pl [OPTIONS] |
|
| 410 |
+ |
|
| 411 |
+OPTIONS: |
|
| 412 |
+ --config-dir DIR Configuration directory (default: /etc/autosmart) |
|
| 413 |
+ --device PATH Analyze specific device only (e.g., /dev/sda) |
|
| 414 |
+ --all Analyze all active drives in inventory |
|
| 415 |
+ --days-back N Days of SMART history to analyze (default: 90) |
|
| 416 |
+ --output FORMAT Output format: text, json, csv (default: text) |
|
| 417 |
+ --risk-level LEVEL Show only drives with risk >= LEVEL |
|
| 418 |
+ (low, moderate, high, critical) |
|
| 419 |
+ --quiet Quiet mode - suppress status messages |
|
| 420 |
+ --debug Enable debug logging and show AI reasoning |
|
| 421 |
+ --help Show this help message |
|
| 422 |
+ |
|
| 423 |
+EXAMPLES: |
|
| 424 |
+ # Analyze specific drive |
|
| 425 |
+ autosmart-predictor.pl --device /dev/sda |
|
| 426 |
+ |
|
| 427 |
+ # Analyze all drives |
|
| 428 |
+ autosmart-predictor.pl --all |
|
| 429 |
+ |
|
| 430 |
+ # Analyze with 30 days of history |
|
| 431 |
+ autosmart-predictor.pl --all --days-back 30 |
|
| 432 |
+ |
|
| 433 |
+ # Show only high/critical risk drives |
|
| 434 |
+ autosmart-predictor.pl --all --risk-level high |
|
| 435 |
+ |
|
| 436 |
+ # Output as JSON |
|
| 437 |
+ autosmart-predictor.pl --all --output json |
|
| 438 |
+ |
|
| 439 |
+ # Quiet CSV output for scripts |
|
| 440 |
+ autosmart-predictor.pl --all --output csv --quiet |
|
| 441 |
+ |
|
| 442 |
+RISK LEVELS: |
|
| 443 |
+ LOW No immediate concerns detected |
|
| 444 |
+ MODERATE Some parameters showing minor degradation |
|
| 445 |
+ HIGH Multiple concerning trends detected |
|
| 446 |
+ CRITICAL Immediate action required - failure likely soon |
|
| 447 |
+ |
|
| 448 |
+OUTPUT FORMATS: |
|
| 449 |
+ text Human-readable report (default) |
|
| 450 |
+ json Machine-readable JSON format |
|
| 451 |
+ csv Comma-separated values for spreadsheets |
|
| 452 |
+ |
|
| 453 |
+CONFIGURATION: |
|
| 454 |
+ Required configuration files in /etc/autosmart/: |
|
| 455 |
+ - database.conf PostgreSQL connection settings |
|
| 456 |
+ - openai.conf OpenAI API configuration |
|
| 457 |
+ |
|
| 458 |
+ The predictor requires historical SMART data collected by |
|
| 459 |
+ autosmart-collector.pl to generate meaningful predictions. |
|
| 460 |
+ |
|
| 461 |
+AI INTEGRATION: |
|
| 462 |
+ This tool uses OpenAI's GPT models to analyze SMART parameter trends |
|
| 463 |
+ and generate intelligent failure predictions. Ensure your OpenAI API |
|
| 464 |
+ key is properly configured in openai.conf. |
|
| 465 |
+ |
|
| 466 |
+EXIT CODES: |
|
| 467 |
+ 0 Success |
|
| 468 |
+ 1 Error (configuration, API failure, etc.) |
|
| 469 |
+ |
|
| 470 |
+EOF |
|
| 471 |
+} |
|
| 472 |
+ |
|
| 473 |
+__END__ |
|
| 474 |
+ |
|
| 475 |
+=head1 AUTHOR |
|
| 476 |
+ |
|
| 477 |
+AutoSMART Development Team |
|
| 478 |
+ |
|
| 479 |
+=head1 LICENSE |
|
| 480 |
+ |
|
| 481 |
+This software is part of the autoSMART project. |
|
| 482 |
+ |
|
| 483 |
+=cut |
|
@@ -0,0 +1,662 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+use strict; |
|
| 4 |
+use warnings; |
|
| 5 |
+use DBI; |
|
| 6 |
+use Getopt::Long; |
|
| 7 |
+use Config::Simple; |
|
| 8 |
+use JSON::XS; |
|
| 9 |
+use POSIX qw(strftime); |
|
| 10 |
+ |
|
| 11 |
+=head1 NAME |
|
| 12 |
+ |
|
| 13 |
+autosmart-report.pl - Generate comprehensive reports for autoSMART system |
|
| 14 |
+ |
|
| 15 |
+=head1 SYNOPSIS |
|
| 16 |
+ |
|
| 17 |
+ autosmart-report.pl [OPTIONS] |
|
| 18 |
+ |
|
| 19 |
+=head1 OPTIONS |
|
| 20 |
+ |
|
| 21 |
+ --config-dir DIR Configuration directory (default: /etc/autosmart) |
|
| 22 |
+ --report TYPE Report type: summary, detailed, health, alerts, trends |
|
| 23 |
+ --device PATH Report for specific device only |
|
| 24 |
+ --days N Days of history to include (default: 30) |
|
| 25 |
+ --format FORMAT Output format: text, html, json (default: text) |
|
| 26 |
+ --output FILE Write to file instead of stdout |
|
| 27 |
+ --help Show this help |
|
| 28 |
+ |
|
| 29 |
+=head1 DESCRIPTION |
|
| 30 |
+ |
|
| 31 |
+Generate various reports from autoSMART data including drive health summaries, |
|
| 32 |
+detailed SMART analysis, alert history, and trend analysis. |
|
| 33 |
+ |
|
| 34 |
+=cut |
|
| 35 |
+ |
|
| 36 |
+# Configuration |
|
| 37 |
+my $config_dir = '/etc/autosmart'; |
|
| 38 |
+my $report_type = 'summary'; |
|
| 39 |
+my $specific_device = ''; |
|
| 40 |
+my $days = 30; |
|
| 41 |
+my $format = 'text'; |
|
| 42 |
+my $output_file = ''; |
|
| 43 |
+my $help = 0; |
|
| 44 |
+ |
|
| 45 |
+GetOptions( |
|
| 46 |
+ 'config-dir=s' => \$config_dir, |
|
| 47 |
+ 'report=s' => \$report_type, |
|
| 48 |
+ 'device=s' => \$specific_device, |
|
| 49 |
+ 'days=i' => \$days, |
|
| 50 |
+ 'format=s' => \$format, |
|
| 51 |
+ 'output=s' => \$output_file, |
|
| 52 |
+ 'help' => \$help, |
|
| 53 |
+) or die "Error parsing command line arguments\n"; |
|
| 54 |
+ |
|
| 55 |
+if ($help) {
|
|
| 56 |
+ print_help(); |
|
| 57 |
+ exit 0; |
|
| 58 |
+} |
|
| 59 |
+ |
|
| 60 |
+# Validate options |
|
| 61 |
+unless ($report_type =~ /^(summary|detailed|health|alerts|trends)$/) {
|
|
| 62 |
+ die "Invalid report type: $report_type\n"; |
|
| 63 |
+} |
|
| 64 |
+ |
|
| 65 |
+unless ($format =~ /^(text|html|json)$/) {
|
|
| 66 |
+ die "Invalid format: $format\n"; |
|
| 67 |
+} |
|
| 68 |
+ |
|
| 69 |
+# Connect to database |
|
| 70 |
+my $db_config = "$config_dir/database.conf"; |
|
| 71 |
+unless (-f $db_config) {
|
|
| 72 |
+ die "Database configuration not found: $db_config\n"; |
|
| 73 |
+} |
|
| 74 |
+ |
|
| 75 |
+my $cfg = Config::Simple->new($db_config); |
|
| 76 |
+my $dsn = sprintf("DBI:Pg:database=%s;host=%s;port=%s",
|
|
| 77 |
+ $cfg->param('database.database'),
|
|
| 78 |
+ $cfg->param('database.host'),
|
|
| 79 |
+ $cfg->param('database.port')
|
|
| 80 |
+); |
|
| 81 |
+ |
|
| 82 |
+my $dbh = DBI->connect( |
|
| 83 |
+ $dsn, |
|
| 84 |
+ $cfg->param('database.username'),
|
|
| 85 |
+ $cfg->param('database.password'),
|
|
| 86 |
+ { RaiseError => 1, AutoCommit => 1, pg_enable_utf8 => 1 }
|
|
| 87 |
+) or die "Database connection failed: $DBI::errstr"; |
|
| 88 |
+ |
|
| 89 |
+# Generate report |
|
| 90 |
+my $report_data = generate_report($dbh, $report_type, $specific_device, $days); |
|
| 91 |
+ |
|
| 92 |
+# Output report |
|
| 93 |
+my $output_handle = \*STDOUT; |
|
| 94 |
+if ($output_file) {
|
|
| 95 |
+ open $output_handle, '>', $output_file |
|
| 96 |
+ or die "Cannot open output file $output_file: $!\n"; |
|
| 97 |
+} |
|
| 98 |
+ |
|
| 99 |
+if ($format eq 'json') {
|
|
| 100 |
+ output_json($output_handle, $report_data); |
|
| 101 |
+} elsif ($format eq 'html') {
|
|
| 102 |
+ output_html($output_handle, $report_data, $report_type); |
|
| 103 |
+} else {
|
|
| 104 |
+ output_text($output_handle, $report_data, $report_type); |
|
| 105 |
+} |
|
| 106 |
+ |
|
| 107 |
+close $output_handle if $output_file; |
|
| 108 |
+ |
|
| 109 |
+$dbh->disconnect(); |
|
| 110 |
+ |
|
| 111 |
+=head2 generate_report |
|
| 112 |
+ |
|
| 113 |
+Generate report data based on type |
|
| 114 |
+ |
|
| 115 |
+=cut |
|
| 116 |
+ |
|
| 117 |
+sub generate_report {
|
|
| 118 |
+ my ($dbh, $type, $device, $days) = @_; |
|
| 119 |
+ |
|
| 120 |
+ my $data = {
|
|
| 121 |
+ report_type => $type, |
|
| 122 |
+ generated_at => time(), |
|
| 123 |
+ days_included => $days, |
|
| 124 |
+ specific_device => $device, |
|
| 125 |
+ }; |
|
| 126 |
+ |
|
| 127 |
+ if ($type eq 'summary') {
|
|
| 128 |
+ $data->{summary} = get_system_summary($dbh, $days);
|
|
| 129 |
+ } elsif ($type eq 'detailed') {
|
|
| 130 |
+ $data->{drives} = get_detailed_drive_info($dbh, $device, $days);
|
|
| 131 |
+ } elsif ($type eq 'health') {
|
|
| 132 |
+ $data->{health} = get_health_overview($dbh, $device);
|
|
| 133 |
+ } elsif ($type eq 'alerts') {
|
|
| 134 |
+ $data->{alerts} = get_alert_history($dbh, $device, $days);
|
|
| 135 |
+ } elsif ($type eq 'trends') {
|
|
| 136 |
+ $data->{trends} = get_trend_analysis($dbh, $device, $days);
|
|
| 137 |
+ } |
|
| 138 |
+ |
|
| 139 |
+ return $data; |
|
| 140 |
+} |
|
| 141 |
+ |
|
| 142 |
+=head2 get_system_summary |
|
| 143 |
+ |
|
| 144 |
+Get high-level system summary |
|
| 145 |
+ |
|
| 146 |
+=cut |
|
| 147 |
+ |
|
| 148 |
+sub get_system_summary {
|
|
| 149 |
+ my ($dbh, $days) = @_; |
|
| 150 |
+ |
|
| 151 |
+ my $summary = {};
|
|
| 152 |
+ |
|
| 153 |
+ # Drive counts by status |
|
| 154 |
+ my $sth = $dbh->prepare(q{
|
|
| 155 |
+ SELECT status, COUNT(*) as count |
|
| 156 |
+ FROM hdd_inventory |
|
| 157 |
+ GROUP BY status |
|
| 158 |
+ }); |
|
| 159 |
+ $sth->execute(); |
|
| 160 |
+ |
|
| 161 |
+ $summary->{drive_counts} = {};
|
|
| 162 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 163 |
+ $summary->{drive_counts}->{$row->{status}} = $row->{count};
|
|
| 164 |
+ } |
|
| 165 |
+ |
|
| 166 |
+ # Recent predictions summary |
|
| 167 |
+ $sth = $dbh->prepare(q{
|
|
| 168 |
+ SELECT risk_level, COUNT(*) as count |
|
| 169 |
+ FROM predictions |
|
| 170 |
+ WHERE timestamp >= NOW() - INTERVAL ? DAY |
|
| 171 |
+ GROUP BY risk_level |
|
| 172 |
+ }); |
|
| 173 |
+ $sth->execute($days); |
|
| 174 |
+ |
|
| 175 |
+ $summary->{recent_predictions} = {};
|
|
| 176 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 177 |
+ $summary->{recent_predictions}->{$row->{risk_level}} = $row->{count};
|
|
| 178 |
+ } |
|
| 179 |
+ |
|
| 180 |
+ # Recent alerts |
|
| 181 |
+ $sth = $dbh->prepare(q{
|
|
| 182 |
+ SELECT alert_type, COUNT(*) as count |
|
| 183 |
+ FROM alert_history |
|
| 184 |
+ WHERE sent_at >= NOW() - INTERVAL ? DAY |
|
| 185 |
+ GROUP BY alert_type |
|
| 186 |
+ }); |
|
| 187 |
+ $sth->execute($days); |
|
| 188 |
+ |
|
| 189 |
+ $summary->{recent_alerts} = {};
|
|
| 190 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 191 |
+ $summary->{recent_alerts}->{$row->{alert_type}} = $row->{count};
|
|
| 192 |
+ } |
|
| 193 |
+ |
|
| 194 |
+ # Data collection stats |
|
| 195 |
+ $sth = $dbh->prepare(q{
|
|
| 196 |
+ SELECT |
|
| 197 |
+ COUNT(*) as total_readings, |
|
| 198 |
+ COUNT(DISTINCT device_path) as devices_with_data, |
|
| 199 |
+ AVG(CASE WHEN collection_ok THEN 1.0 ELSE 0.0 END) * 100 as success_rate |
|
| 200 |
+ FROM smart_readings |
|
| 201 |
+ WHERE timestamp >= NOW() - INTERVAL ? DAY |
|
| 202 |
+ }); |
|
| 203 |
+ $sth->execute($days); |
|
| 204 |
+ |
|
| 205 |
+ if (my $row = $sth->fetchrow_hashref()) {
|
|
| 206 |
+ $summary->{collection_stats} = {
|
|
| 207 |
+ total_readings => $row->{total_readings},
|
|
| 208 |
+ devices_with_data => $row->{devices_with_data},
|
|
| 209 |
+ success_rate => sprintf("%.1f", $row->{success_rate} || 0),
|
|
| 210 |
+ }; |
|
| 211 |
+ } |
|
| 212 |
+ |
|
| 213 |
+ return $summary; |
|
| 214 |
+} |
|
| 215 |
+ |
|
| 216 |
+=head2 get_detailed_drive_info |
|
| 217 |
+ |
|
| 218 |
+Get detailed information for drives |
|
| 219 |
+ |
|
| 220 |
+=cut |
|
| 221 |
+ |
|
| 222 |
+sub get_detailed_drive_info {
|
|
| 223 |
+ my ($dbh, $device, $days) = @_; |
|
| 224 |
+ |
|
| 225 |
+ my $sql = q{
|
|
| 226 |
+ SELECT |
|
| 227 |
+ hi.device_path, |
|
| 228 |
+ hi.model_name, |
|
| 229 |
+ hi.serial_number, |
|
| 230 |
+ hi.size_gb, |
|
| 231 |
+ hi.status, |
|
| 232 |
+ hi.first_seen, |
|
| 233 |
+ hi.last_seen, |
|
| 234 |
+ COUNT(sr.id) as reading_count, |
|
| 235 |
+ AVG(sr.temperature) as avg_temperature, |
|
| 236 |
+ MAX(sr.temperature) as max_temperature |
|
| 237 |
+ FROM hdd_inventory hi |
|
| 238 |
+ LEFT JOIN smart_readings sr ON hi.device_path = sr.device_path |
|
| 239 |
+ AND sr.timestamp >= NOW() - INTERVAL ? DAY |
|
| 240 |
+ }; |
|
| 241 |
+ |
|
| 242 |
+ my @params = ($days); |
|
| 243 |
+ |
|
| 244 |
+ if ($device) {
|
|
| 245 |
+ $sql .= " WHERE hi.device_path = ?"; |
|
| 246 |
+ push @params, $device; |
|
| 247 |
+ } |
|
| 248 |
+ |
|
| 249 |
+ $sql .= q{
|
|
| 250 |
+ GROUP BY hi.device_path, hi.model_name, hi.serial_number, |
|
| 251 |
+ hi.size_gb, hi.status, hi.first_seen, hi.last_seen |
|
| 252 |
+ ORDER BY hi.device_path |
|
| 253 |
+ }; |
|
| 254 |
+ |
|
| 255 |
+ my $sth = $dbh->prepare($sql); |
|
| 256 |
+ $sth->execute(@params); |
|
| 257 |
+ |
|
| 258 |
+ my @drives = (); |
|
| 259 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 260 |
+ # Get latest prediction |
|
| 261 |
+ my $pred_sth = $dbh->prepare(q{
|
|
| 262 |
+ SELECT risk_level, confidence, timestamp |
|
| 263 |
+ FROM predictions |
|
| 264 |
+ WHERE device_path = ? |
|
| 265 |
+ ORDER BY timestamp DESC |
|
| 266 |
+ LIMIT 1 |
|
| 267 |
+ }); |
|
| 268 |
+ $pred_sth->execute($row->{device_path});
|
|
| 269 |
+ |
|
| 270 |
+ if (my $pred = $pred_sth->fetchrow_hashref()) {
|
|
| 271 |
+ $row->{latest_prediction} = $pred;
|
|
| 272 |
+ } |
|
| 273 |
+ |
|
| 274 |
+ # Get recent alerts |
|
| 275 |
+ my $alert_sth = $dbh->prepare(q{
|
|
| 276 |
+ SELECT COUNT(*) as alert_count |
|
| 277 |
+ FROM alert_history |
|
| 278 |
+ WHERE device_path = ? |
|
| 279 |
+ AND sent_at >= NOW() - INTERVAL ? DAY |
|
| 280 |
+ }); |
|
| 281 |
+ $alert_sth->execute($row->{device_path}, $days);
|
|
| 282 |
+ |
|
| 283 |
+ if (my $alert = $alert_sth->fetchrow_hashref()) {
|
|
| 284 |
+ $row->{recent_alert_count} = $alert->{alert_count};
|
|
| 285 |
+ } |
|
| 286 |
+ |
|
| 287 |
+ push @drives, $row; |
|
| 288 |
+ } |
|
| 289 |
+ |
|
| 290 |
+ return \@drives; |
|
| 291 |
+} |
|
| 292 |
+ |
|
| 293 |
+=head2 get_health_overview |
|
| 294 |
+ |
|
| 295 |
+Get current health overview |
|
| 296 |
+ |
|
| 297 |
+=cut |
|
| 298 |
+ |
|
| 299 |
+sub get_health_overview {
|
|
| 300 |
+ my ($dbh, $device) = @_; |
|
| 301 |
+ |
|
| 302 |
+ my $sql = q{
|
|
| 303 |
+ SELECT * FROM drive_health_summary |
|
| 304 |
+ }; |
|
| 305 |
+ |
|
| 306 |
+ my @params = (); |
|
| 307 |
+ if ($device) {
|
|
| 308 |
+ $sql .= " WHERE device_path = ?"; |
|
| 309 |
+ push @params, $device; |
|
| 310 |
+ } |
|
| 311 |
+ |
|
| 312 |
+ $sql .= " ORDER BY device_path"; |
|
| 313 |
+ |
|
| 314 |
+ my $sth = $dbh->prepare($sql); |
|
| 315 |
+ $sth->execute(@params); |
|
| 316 |
+ |
|
| 317 |
+ my @health_data = (); |
|
| 318 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 319 |
+ push @health_data, $row; |
|
| 320 |
+ } |
|
| 321 |
+ |
|
| 322 |
+ return \@health_data; |
|
| 323 |
+} |
|
| 324 |
+ |
|
| 325 |
+=head2 get_alert_history |
|
| 326 |
+ |
|
| 327 |
+Get alert history |
|
| 328 |
+ |
|
| 329 |
+=cut |
|
| 330 |
+ |
|
| 331 |
+sub get_alert_history {
|
|
| 332 |
+ my ($dbh, $device, $days) = @_; |
|
| 333 |
+ |
|
| 334 |
+ my $sql = q{
|
|
| 335 |
+ SELECT |
|
| 336 |
+ ah.device_path, |
|
| 337 |
+ ah.alert_type, |
|
| 338 |
+ ah.risk_level, |
|
| 339 |
+ ah.message, |
|
| 340 |
+ ah.sent_at, |
|
| 341 |
+ ah.acknowledged, |
|
| 342 |
+ ah.acknowledged_by, |
|
| 343 |
+ hi.model_name |
|
| 344 |
+ FROM alert_history ah |
|
| 345 |
+ JOIN hdd_inventory hi ON ah.device_path = hi.device_path |
|
| 346 |
+ WHERE ah.sent_at >= NOW() - INTERVAL ? DAY |
|
| 347 |
+ }; |
|
| 348 |
+ |
|
| 349 |
+ my @params = ($days); |
|
| 350 |
+ |
|
| 351 |
+ if ($device) {
|
|
| 352 |
+ $sql .= " AND ah.device_path = ?"; |
|
| 353 |
+ push @params, $device; |
|
| 354 |
+ } |
|
| 355 |
+ |
|
| 356 |
+ $sql .= " ORDER BY ah.sent_at DESC"; |
|
| 357 |
+ |
|
| 358 |
+ my $sth = $dbh->prepare($sql); |
|
| 359 |
+ $sth->execute(@params); |
|
| 360 |
+ |
|
| 361 |
+ my @alerts = (); |
|
| 362 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 363 |
+ push @alerts, $row; |
|
| 364 |
+ } |
|
| 365 |
+ |
|
| 366 |
+ return \@alerts; |
|
| 367 |
+} |
|
| 368 |
+ |
|
| 369 |
+=head2 get_trend_analysis |
|
| 370 |
+ |
|
| 371 |
+Get trend analysis data |
|
| 372 |
+ |
|
| 373 |
+=cut |
|
| 374 |
+ |
|
| 375 |
+sub get_trend_analysis {
|
|
| 376 |
+ my ($dbh, $device, $days) = @_; |
|
| 377 |
+ |
|
| 378 |
+ # This is a simplified trend analysis |
|
| 379 |
+ # In production, you might want more sophisticated analysis |
|
| 380 |
+ |
|
| 381 |
+ my $sql = q{
|
|
| 382 |
+ SELECT |
|
| 383 |
+ device_path, |
|
| 384 |
+ DATE(timestamp) as date, |
|
| 385 |
+ AVG(temperature) as avg_temp, |
|
| 386 |
+ COUNT(*) as reading_count |
|
| 387 |
+ FROM smart_readings |
|
| 388 |
+ WHERE timestamp >= NOW() - INTERVAL ? DAY |
|
| 389 |
+ }; |
|
| 390 |
+ |
|
| 391 |
+ my @params = ($days); |
|
| 392 |
+ |
|
| 393 |
+ if ($device) {
|
|
| 394 |
+ $sql .= " AND device_path = ?"; |
|
| 395 |
+ push @params, $device; |
|
| 396 |
+ } |
|
| 397 |
+ |
|
| 398 |
+ $sql .= q{
|
|
| 399 |
+ GROUP BY device_path, DATE(timestamp) |
|
| 400 |
+ ORDER BY device_path, date |
|
| 401 |
+ }; |
|
| 402 |
+ |
|
| 403 |
+ my $sth = $dbh->prepare($sql); |
|
| 404 |
+ $sth->execute(@params); |
|
| 405 |
+ |
|
| 406 |
+ my %trends = (); |
|
| 407 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 408 |
+ push @{$trends{$row->{device_path}}}, {
|
|
| 409 |
+ date => $row->{date},
|
|
| 410 |
+ avg_temp => sprintf("%.1f", $row->{avg_temp} || 0),
|
|
| 411 |
+ reading_count => $row->{reading_count},
|
|
| 412 |
+ }; |
|
| 413 |
+ } |
|
| 414 |
+ |
|
| 415 |
+ return \%trends; |
|
| 416 |
+} |
|
| 417 |
+ |
|
| 418 |
+=head2 output_text |
|
| 419 |
+ |
|
| 420 |
+Output report as text |
|
| 421 |
+ |
|
| 422 |
+=cut |
|
| 423 |
+ |
|
| 424 |
+sub output_text {
|
|
| 425 |
+ my ($fh, $data, $type) = @_; |
|
| 426 |
+ |
|
| 427 |
+ print $fh "\n" . "="x80 . "\n"; |
|
| 428 |
+ print $fh "autoSMART System Report - " . ucfirst($type) . "\n"; |
|
| 429 |
+ print $fh "Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime($data->{generated_at})) . "\n";
|
|
| 430 |
+ print $fh "Time Period: Last $data->{days_included} days\n";
|
|
| 431 |
+ print $fh "="x80 . "\n\n"; |
|
| 432 |
+ |
|
| 433 |
+ if ($type eq 'summary') {
|
|
| 434 |
+ output_summary_text($fh, $data->{summary});
|
|
| 435 |
+ } elsif ($type eq 'detailed') {
|
|
| 436 |
+ output_detailed_text($fh, $data->{drives});
|
|
| 437 |
+ } elsif ($type eq 'health') {
|
|
| 438 |
+ output_health_text($fh, $data->{health});
|
|
| 439 |
+ } elsif ($type eq 'alerts') {
|
|
| 440 |
+ output_alerts_text($fh, $data->{alerts});
|
|
| 441 |
+ } elsif ($type eq 'trends') {
|
|
| 442 |
+ output_trends_text($fh, $data->{trends});
|
|
| 443 |
+ } |
|
| 444 |
+} |
|
| 445 |
+ |
|
| 446 |
+=head2 output_summary_text |
|
| 447 |
+ |
|
| 448 |
+Output summary in text format |
|
| 449 |
+ |
|
| 450 |
+=cut |
|
| 451 |
+ |
|
| 452 |
+sub output_summary_text {
|
|
| 453 |
+ my ($fh, $summary) = @_; |
|
| 454 |
+ |
|
| 455 |
+ print $fh "SYSTEM OVERVIEW\n"; |
|
| 456 |
+ print $fh "-"x40 . "\n"; |
|
| 457 |
+ |
|
| 458 |
+ print $fh "Drive Status:\n"; |
|
| 459 |
+ foreach my $status (sort keys %{$summary->{drive_counts}}) {
|
|
| 460 |
+ printf $fh " %-10s: %d drives\n", ucfirst($status), $summary->{drive_counts}->{$status};
|
|
| 461 |
+ } |
|
| 462 |
+ |
|
| 463 |
+ if (%{$summary->{recent_predictions}}) {
|
|
| 464 |
+ print $fh "\nRecent Risk Predictions:\n"; |
|
| 465 |
+ foreach my $level (qw(critical high moderate low)) {
|
|
| 466 |
+ next unless $summary->{recent_predictions}->{$level};
|
|
| 467 |
+ printf $fh " %-10s: %d drives\n", ucfirst($level), $summary->{recent_predictions}->{$level};
|
|
| 468 |
+ } |
|
| 469 |
+ } |
|
| 470 |
+ |
|
| 471 |
+ if (%{$summary->{recent_alerts}}) {
|
|
| 472 |
+ print $fh "\nRecent Alerts:\n"; |
|
| 473 |
+ foreach my $type (sort keys %{$summary->{recent_alerts}}) {
|
|
| 474 |
+ printf $fh " %-15s: %d alerts\n", $type, $summary->{recent_alerts}->{$type};
|
|
| 475 |
+ } |
|
| 476 |
+ } |
|
| 477 |
+ |
|
| 478 |
+ if ($summary->{collection_stats}) {
|
|
| 479 |
+ my $stats = $summary->{collection_stats};
|
|
| 480 |
+ print $fh "\nData Collection:\n"; |
|
| 481 |
+ print $fh " Total readings: $stats->{total_readings}\n";
|
|
| 482 |
+ print $fh " Devices monitored: $stats->{devices_with_data}\n";
|
|
| 483 |
+ print $fh " Success rate: $stats->{success_rate}%\n";
|
|
| 484 |
+ } |
|
| 485 |
+ |
|
| 486 |
+ print $fh "\n"; |
|
| 487 |
+} |
|
| 488 |
+ |
|
| 489 |
+=head2 output_json |
|
| 490 |
+ |
|
| 491 |
+Output report as JSON |
|
| 492 |
+ |
|
| 493 |
+=cut |
|
| 494 |
+ |
|
| 495 |
+sub output_json {
|
|
| 496 |
+ my ($fh, $data) = @_; |
|
| 497 |
+ |
|
| 498 |
+ my $json = JSON::XS->new->pretty->encode($data); |
|
| 499 |
+ print $fh $json; |
|
| 500 |
+} |
|
| 501 |
+ |
|
| 502 |
+=head2 output_html |
|
| 503 |
+ |
|
| 504 |
+Output report as HTML (basic implementation) |
|
| 505 |
+ |
|
| 506 |
+=cut |
|
| 507 |
+ |
|
| 508 |
+sub output_html {
|
|
| 509 |
+ my ($fh, $data, $type) = @_; |
|
| 510 |
+ |
|
| 511 |
+ print $fh <<'EOF'; |
|
| 512 |
+<!DOCTYPE html> |
|
| 513 |
+<html> |
|
| 514 |
+<head> |
|
| 515 |
+ <title>autoSMART Report</title> |
|
| 516 |
+ <style> |
|
| 517 |
+ body { font-family: Arial, sans-serif; margin: 20px; }
|
|
| 518 |
+ table { border-collapse: collapse; width: 100%; }
|
|
| 519 |
+ th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
|
|
| 520 |
+ th { background-color: #f2f2f2; }
|
|
| 521 |
+ .critical { color: #d32f2f; font-weight: bold; }
|
|
| 522 |
+ .high { color: #f57c00; font-weight: bold; }
|
|
| 523 |
+ .moderate { color: #fbc02d; }
|
|
| 524 |
+ .low { color: #388e3c; }
|
|
| 525 |
+ </style> |
|
| 526 |
+</head> |
|
| 527 |
+<body> |
|
| 528 |
+EOF |
|
| 529 |
+ |
|
| 530 |
+ print $fh "<h1>autoSMART Report - " . ucfirst($type) . "</h1>\n"; |
|
| 531 |
+ print $fh "<p>Generated: " . strftime("%Y-%m-%d %H:%M:%S", localtime($data->{generated_at})) . "</p>\n";
|
|
| 532 |
+ |
|
| 533 |
+ # Basic HTML output - could be expanded significantly |
|
| 534 |
+ print $fh "<pre>" . encode_json($data) . "</pre>\n"; |
|
| 535 |
+ |
|
| 536 |
+ print $fh "</body></html>\n"; |
|
| 537 |
+} |
|
| 538 |
+ |
|
| 539 |
+# Additional text output functions would go here... |
|
| 540 |
+sub output_detailed_text {
|
|
| 541 |
+ my ($fh, $drives) = @_; |
|
| 542 |
+ # Implementation for detailed drive output |
|
| 543 |
+ print $fh "DETAILED DRIVE INFORMATION\n"; |
|
| 544 |
+ print $fh "-"x40 . "\n"; |
|
| 545 |
+ foreach my $drive (@$drives) {
|
|
| 546 |
+ print $fh "Device: $drive->{device_path}\n";
|
|
| 547 |
+ print $fh "Model: " . ($drive->{model_name} || 'Unknown') . "\n";
|
|
| 548 |
+ print $fh "Serial: " . ($drive->{serial_number} || 'Unknown') . "\n";
|
|
| 549 |
+ print $fh "Status: $drive->{status}\n";
|
|
| 550 |
+ if ($drive->{latest_prediction}) {
|
|
| 551 |
+ print $fh "Latest Risk: $drive->{latest_prediction}->{risk_level}\n";
|
|
| 552 |
+ } |
|
| 553 |
+ print $fh "\n"; |
|
| 554 |
+ } |
|
| 555 |
+} |
|
| 556 |
+ |
|
| 557 |
+sub output_health_text {
|
|
| 558 |
+ my ($fh, $health) = @_; |
|
| 559 |
+ print $fh "DRIVE HEALTH OVERVIEW\n"; |
|
| 560 |
+ print $fh "-"x40 . "\n"; |
|
| 561 |
+ foreach my $drive (@$health) {
|
|
| 562 |
+ print $fh "$drive->{device_path}: $drive->{status}";
|
|
| 563 |
+ print $fh " (Risk: $drive->{risk_level})" if $drive->{risk_level};
|
|
| 564 |
+ print $fh "\n"; |
|
| 565 |
+ } |
|
| 566 |
+} |
|
| 567 |
+ |
|
| 568 |
+sub output_alerts_text {
|
|
| 569 |
+ my ($fh, $alerts) = @_; |
|
| 570 |
+ print $fh "ALERT HISTORY\n"; |
|
| 571 |
+ print $fh "-"x40 . "\n"; |
|
| 572 |
+ foreach my $alert (@$alerts) {
|
|
| 573 |
+ printf $fh "%s [%s] %s: %s\n", |
|
| 574 |
+ $alert->{sent_at},
|
|
| 575 |
+ $alert->{alert_type},
|
|
| 576 |
+ $alert->{device_path},
|
|
| 577 |
+ $alert->{message} || '';
|
|
| 578 |
+ } |
|
| 579 |
+} |
|
| 580 |
+ |
|
| 581 |
+sub output_trends_text {
|
|
| 582 |
+ my ($fh, $trends) = @_; |
|
| 583 |
+ print $fh "TREND ANALYSIS\n"; |
|
| 584 |
+ print $fh "-"x40 . "\n"; |
|
| 585 |
+ foreach my $device (sort keys %$trends) {
|
|
| 586 |
+ print $fh "Device: $device\n"; |
|
| 587 |
+ foreach my $trend (@{$trends->{$device}}) {
|
|
| 588 |
+ print $fh " $trend->{date}: Temp $trend->{avg_temp}°C ($trend->{reading_count} readings)\n";
|
|
| 589 |
+ } |
|
| 590 |
+ print $fh "\n"; |
|
| 591 |
+ } |
|
| 592 |
+} |
|
| 593 |
+ |
|
| 594 |
+=head2 print_help |
|
| 595 |
+ |
|
| 596 |
+Display help information |
|
| 597 |
+ |
|
| 598 |
+=cut |
|
| 599 |
+ |
|
| 600 |
+sub print_help {
|
|
| 601 |
+ print <<'EOF'; |
|
| 602 |
+autoSMART Report Generator v1.0 |
|
| 603 |
+ |
|
| 604 |
+USAGE: |
|
| 605 |
+ autosmart-report.pl [OPTIONS] |
|
| 606 |
+ |
|
| 607 |
+OPTIONS: |
|
| 608 |
+ --config-dir DIR Configuration directory (default: /etc/autosmart) |
|
| 609 |
+ --report TYPE Report type (default: summary) |
|
| 610 |
+ summary - System overview and statistics |
|
| 611 |
+ detailed - Detailed drive information |
|
| 612 |
+ health - Current health status of all drives |
|
| 613 |
+ alerts - Alert history |
|
| 614 |
+ trends - Trend analysis |
|
| 615 |
+ --device PATH Generate report for specific device only |
|
| 616 |
+ --days N Days of history to include (default: 30) |
|
| 617 |
+ --format FORMAT Output format: text, html, json (default: text) |
|
| 618 |
+ --output FILE Write to file instead of stdout |
|
| 619 |
+ --help Show this help message |
|
| 620 |
+ |
|
| 621 |
+EXAMPLES: |
|
| 622 |
+ # System summary |
|
| 623 |
+ autosmart-report.pl --report summary |
|
| 624 |
+ |
|
| 625 |
+ # Detailed report for specific drive |
|
| 626 |
+ autosmart-report.pl --report detailed --device /dev/sda |
|
| 627 |
+ |
|
| 628 |
+ # Health status as HTML |
|
| 629 |
+ autosmart-report.pl --report health --format html --output health.html |
|
| 630 |
+ |
|
| 631 |
+ # Alert history for last week |
|
| 632 |
+ autosmart-report.pl --report alerts --days 7 |
|
| 633 |
+ |
|
| 634 |
+ # Trend analysis as JSON |
|
| 635 |
+ autosmart-report.pl --report trends --format json |
|
| 636 |
+ |
|
| 637 |
+REPORT TYPES: |
|
| 638 |
+ summary High-level system statistics and overview |
|
| 639 |
+ detailed Comprehensive information about each drive |
|
| 640 |
+ health Current health status summary |
|
| 641 |
+ alerts Recent alerts and notifications |
|
| 642 |
+ trends Temperature and performance trends |
|
| 643 |
+ |
|
| 644 |
+OUTPUT FORMATS: |
|
| 645 |
+ text Human-readable text format (default) |
|
| 646 |
+ html HTML report with basic styling |
|
| 647 |
+ json Machine-readable JSON format |
|
| 648 |
+ |
|
| 649 |
+EOF |
|
| 650 |
+} |
|
| 651 |
+ |
|
| 652 |
+__END__ |
|
| 653 |
+ |
|
| 654 |
+=head1 AUTHOR |
|
| 655 |
+ |
|
| 656 |
+AutoSMART Development Team |
|
| 657 |
+ |
|
| 658 |
+=head1 LICENSE |
|
| 659 |
+ |
|
| 660 |
+This software is part of the autoSMART project. |
|
| 661 |
+ |
|
| 662 |
+=cut |
|
@@ -0,0 +1,98 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Production Deployment Script |
|
| 4 |
+# Version: 1.0 |
|
| 5 |
+# Description: Deploy autoSMART system to Proxmox cluster |
|
| 6 |
+ |
|
| 7 |
+set -e |
|
| 8 |
+ |
|
| 9 |
+# Configuration |
|
| 10 |
+DB_HOST="192.168.2.102" |
|
| 11 |
+DB_USER="autosmart" |
|
| 12 |
+DB_PASS="autoSMART2025!" |
|
| 13 |
+DB_NAME="autosmart" |
|
| 14 |
+ |
|
| 15 |
+CLUSTER_JSON="$(dirname "$0")/../cluster.json" |
|
| 16 |
+NODES=() |
|
| 17 |
+NODE_IPS=() |
|
| 18 |
+if [[ -f "$CLUSTER_JSON" ]] && command -v jq &> /dev/null; then |
|
| 19 |
+ while IFS= read -r node; do |
|
| 20 |
+ NODES+=("$(echo "$node" | jq -r '.hostname')")
|
|
| 21 |
+ NODE_IPS+=("$(echo "$node" | jq -r '.ip')")
|
|
| 22 |
+ done < <(jq -c '.cluster.nodes[]' "$CLUSTER_JSON") |
|
| 23 |
+fi |
|
| 24 |
+DEPLOY_DIR="/opt/autoSMART" |
|
| 25 |
+CONFIG_DIR="/etc/pve/autoSMART" |
|
| 26 |
+ |
|
| 27 |
+echo "🚀 autoSMART Production Deployment" |
|
| 28 |
+echo "==================================" |
|
| 29 |
+ |
|
| 30 |
+for idx in "${!NODES[@]}"; do
|
|
| 31 |
+ NODE="${NODES[$idx]}"
|
|
| 32 |
+ NODE_IP="${NODE_IPS[$idx]}"
|
|
| 33 |
+ echo "" |
|
| 34 |
+ echo "🔧 Deploying to node: $NODE ($NODE_IP)" |
|
| 35 |
+ echo "------------------------" |
|
| 36 |
+ |
|
| 37 |
+ # Create directories |
|
| 38 |
+ ssh root@$NODE_IP "mkdir -p $DEPLOY_DIR $CONFIG_DIR" |
|
| 39 |
+ |
|
| 40 |
+ # Copy files |
|
| 41 |
+ scp -r /tmp/autoSMART-deploy/* root@$NODE_IP:$DEPLOY_DIR/ |
|
| 42 |
+ |
|
| 43 |
+ # Install Perl dependencies |
|
| 44 |
+ ssh root@$NODE_IP " |
|
| 45 |
+ apt-get update -qq |
|
| 46 |
+ apt-get install -y libdbi-perl libdbd-pg-perl libjson-perl libfile-slurp-perl smartmontools |
|
| 47 |
+ " |
|
| 48 |
+ |
|
| 49 |
+ # Make scripts executable |
|
| 50 |
+ ssh root@$NODE_IP "chmod +x $DEPLOY_DIR/scripts/*.sh $DEPLOY_DIR/scripts/*.pl" |
|
| 51 |
+ |
|
| 52 |
+ # Create node-specific configuration |
|
| 53 |
+ ssh root@$NODE_IP "cat > $CONFIG_DIR/cluster-$NODE.conf << EOF |
|
| 54 |
+# autoSMART Configuration for $NODE |
|
| 55 |
+ExecStart=$DEPLOY_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE.conf |
|
| 56 |
+Restart=always |
|
| 57 |
+RestartSec=30 |
|
| 58 |
+User=root |
|
| 59 |
+ |
|
| 60 |
+[Install] |
|
| 61 |
+WantedBy=multi-user.target |
|
| 62 |
+EOF" |
|
| 63 |
+ |
|
| 64 |
+ # Enable service (but don't start yet) |
|
| 65 |
+ ssh root@192.168.2.$NODE "systemctl daemon-reload && systemctl enable autosmart" |
|
| 66 |
+ |
|
| 67 |
+ echo "✅ Node $NODE deployment complete" |
|
| 68 |
+done |
|
| 69 |
+ |
|
| 70 |
+# Test database connectivity |
|
| 71 |
+ |
|
| 72 |
+ # Install systemd service |
|
| 73 |
+ ssh root@$NODE_IP "cat > /etc/systemd/system/autosmart.service << EOF |
|
| 74 |
+echo "" |
|
| 75 |
+echo "🔍 Testing database connectivity..." |
|
| 76 |
+PGPASSWORD="$DB_PASS" psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c " |
|
| 77 |
+SELECT |
|
| 78 |
+ COUNT(*) as total_drives, |
|
| 79 |
+ COUNT(DISTINCT current_node_id) as active_nodes |
|
| 80 |
+FROM hdd_inventory; |
|
| 81 |
+" |
|
| 82 |
+ |
|
| 83 |
+echo "" |
|
| 84 |
+echo "🎉 Production deployment complete!" |
|
| 85 |
+echo "" |
|
| 86 |
+echo "To start services on all nodes:" |
|
| 87 |
+echo " for node in ebony ivory obsidian; do ssh root@192.168.2.\$node 'systemctl start autosmart'; done" |
|
| 88 |
+echo "" |
|
| 89 |
+echo "To monitor services:" |
|
| 90 |
+echo " for node in ebony ivory obsidian; do echo \"=== \$node ===\"; ssh root@192.168.2.\$node 'systemctl status autosmart'; done" |
|
| 91 |
+echo "" |
|
| 92 |
+echo "Database monitoring:" |
|
| 93 |
+echo " PGPASSWORD='$DB_PASS' psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c 'SELECT * FROM storage_efficiency_stats;'" |
|
| 94 |
+ |
|
| 95 |
+# Cleanup |
|
| 96 |
+rm -rf /tmp/autoSMART-deploy |
|
| 97 |
+ |
|
| 98 |
+echo "✅ Deployment script complete!" |
|
@@ -0,0 +1,755 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Node Installation Script |
|
| 4 |
+# Version: 1.0 |
|
| 5 |
+# Description: Install autoSMART on target nodes (Linux systems only) |
|
| 6 |
+# Note: This script is called by deploy.sh and should run on target nodes |
|
| 7 |
+ |
|
| 8 |
+set -e |
|
| 9 |
+ |
|
| 10 |
+SCRIPT_DIR="$(cd "$(dirnameverify_dependencies() {
|
|
| 11 |
+ log_info "🔍 Verifying system dependencies..." |
|
| 12 |
+ |
|
| 13 |
+ local missing_packages=() |
|
| 14 |
+ local package_manager=""SH_SOURCE[0]}")" && pwd)" |
|
| 15 |
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")" |
|
| 16 |
+INSTALL_DIR="/opt/autoSMART" |
|
| 17 |
+CONFIG_DIR="/etc/autosmart" |
|
| 18 |
+SERVICE_NAME="autosmart" |
|
| 19 |
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
|
|
| 20 |
+ |
|
| 21 |
+# Default configuration (can be overridden by command line) |
|
| 22 |
+DB_HOST="${DB_HOST:-192.168.2.102}"
|
|
| 23 |
+DB_USER="${DB_USER:-autosmart}"
|
|
| 24 |
+DB_PASS="${DB_PASS:-autoSMART2025!}"
|
|
| 25 |
+DB_NAME="${DB_NAME:-autosmart}"
|
|
| 26 |
+ |
|
| 27 |
+# Node configuration |
|
| 28 |
+NODE_ID="${NODE_ID:-$(hostname -s)}"
|
|
| 29 |
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
|
|
| 30 |
+FULL_SCAN_INTERVAL="${FULL_SCAN_INTERVAL:-3600}"
|
|
| 31 |
+ |
|
| 32 |
+# Operation modes |
|
| 33 |
+UNINSTALL=false |
|
| 34 |
+FORCE_REINSTALL=false |
|
| 35 |
+CONFIG_ONLY=false |
|
| 36 |
+ |
|
| 37 |
+# Colors for output |
|
| 38 |
+RED='\033[0;31m' |
|
| 39 |
+GREEN='\033[0;32m' |
|
| 40 |
+YELLOW='\033[1;33m' |
|
| 41 |
+BLUE='\033[0;34m' |
|
| 42 |
+NC='\033[0m' # No Color |
|
| 43 |
+ |
|
| 44 |
+log_info() {
|
|
| 45 |
+ echo -e "${BLUE}[INFO]${NC} $1"
|
|
| 46 |
+} |
|
| 47 |
+ |
|
| 48 |
+log_success() {
|
|
| 49 |
+ echo -e "${GREEN}[SUCCESS]${NC} $1"
|
|
| 50 |
+} |
|
| 51 |
+ |
|
| 52 |
+log_warning() {
|
|
| 53 |
+ echo -e "${YELLOW}[WARNING]${NC} $1"
|
|
| 54 |
+} |
|
| 55 |
+ |
|
| 56 |
+log_error() {
|
|
| 57 |
+ echo -e "${RED}[ERROR]${NC} $1"
|
|
| 58 |
+} |
|
| 59 |
+ |
|
| 60 |
+show_usage() {
|
|
| 61 |
+ echo "autoSMART Node Installation Script v1.0" |
|
| 62 |
+ echo "========================================" |
|
| 63 |
+ echo "" |
|
| 64 |
+ echo "Usage: $0 [COMMAND] [OPTIONS]" |
|
| 65 |
+ echo "" |
|
| 66 |
+ echo "Commands:" |
|
| 67 |
+ echo " install Install autoSMART on current node (default)" |
|
| 68 |
+ echo " uninstall Remove autoSMART completely from current node" |
|
| 69 |
+ echo "" |
|
| 70 |
+ echo "Options:" |
|
| 71 |
+ echo " --help Show this help message" |
|
| 72 |
+ echo " --force-reinstall Clean installation (removes previous version)" |
|
| 73 |
+ echo " --config-only Only create/update configuration files" |
|
| 74 |
+ echo " --db-host HOST Database host (default: 192.168.2.102)" |
|
| 75 |
+ echo " --db-user USER Database user (default: autosmart)" |
|
| 76 |
+ echo " --db-pass PASS Database password (default: autoSMART2025!)" |
|
| 77 |
+ echo " --db-name NAME Database name (default: autosmart)" |
|
| 78 |
+ echo " --node-id ID Node identifier (default: hostname)" |
|
| 79 |
+ echo " --scan-interval SEC Scan interval in seconds (default: 300)" |
|
| 80 |
+ echo "" |
|
| 81 |
+ echo "Note: This script should be called by deploy.sh, not run directly." |
|
| 82 |
+ echo "For deployment from development machine, use: deploy.sh install <IP>" |
|
| 83 |
+ echo "" |
|
| 84 |
+} |
|
| 85 |
+ |
|
| 86 |
+parse_arguments() {
|
|
| 87 |
+ COMMAND="install" # Default command |
|
| 88 |
+ |
|
| 89 |
+ while [[ $# -gt 0 ]]; do |
|
| 90 |
+ case $1 in |
|
| 91 |
+ install|uninstall) |
|
| 92 |
+ COMMAND="$1" |
|
| 93 |
+ shift |
|
| 94 |
+ ;; |
|
| 95 |
+ --help) |
|
| 96 |
+ show_usage |
|
| 97 |
+ exit 0 |
|
| 98 |
+ ;; |
|
| 99 |
+ --force-reinstall) |
|
| 100 |
+ FORCE_REINSTALL=true |
|
| 101 |
+ shift |
|
| 102 |
+ ;; |
|
| 103 |
+ --config-only) |
|
| 104 |
+ CONFIG_ONLY=true |
|
| 105 |
+ shift |
|
| 106 |
+ ;; |
|
| 107 |
+ --db-host) |
|
| 108 |
+ DB_HOST="$2" |
|
| 109 |
+ shift 2 |
|
| 110 |
+ ;; |
|
| 111 |
+ --db-user) |
|
| 112 |
+ DB_USER="$2" |
|
| 113 |
+ shift 2 |
|
| 114 |
+ ;; |
|
| 115 |
+ --db-pass) |
|
| 116 |
+ DB_PASS="$2" |
|
| 117 |
+ shift 2 |
|
| 118 |
+ ;; |
|
| 119 |
+ --db-name) |
|
| 120 |
+ DB_NAME="$2" |
|
| 121 |
+ shift 2 |
|
| 122 |
+ ;; |
|
| 123 |
+ --node-id) |
|
| 124 |
+ NODE_ID="$2" |
|
| 125 |
+ shift 2 |
|
| 126 |
+ ;; |
|
| 127 |
+ --scan-interval) |
|
| 128 |
+ SCAN_INTERVAL="$2" |
|
| 129 |
+ shift 2 |
|
| 130 |
+ ;; |
|
| 131 |
+ *) |
|
| 132 |
+ log_error "Unknown option: $1" |
|
| 133 |
+ show_usage |
|
| 134 |
+ exit 1 |
|
| 135 |
+ ;; |
|
| 136 |
+ esac |
|
| 137 |
+ done |
|
| 138 |
+} |
|
| 139 |
+ |
|
| 140 |
+show_header() {
|
|
| 141 |
+ log_info "🔧 autoSMART Node Installation v1.0" |
|
| 142 |
+ log_info "===================================" |
|
| 143 |
+ log_info "Installing on target node: $(hostname)" |
|
| 144 |
+ log_info "" |
|
| 145 |
+ log_info "Operation: $COMMAND" |
|
| 146 |
+ log_info "Node ID: $NODE_ID" |
|
| 147 |
+ log_info "Database: $DB_HOST:5432/$DB_NAME" |
|
| 148 |
+ if [[ "$COMMAND" == "install" ]]; then |
|
| 149 |
+ log_info "Install Directory: $INSTALL_DIR" |
|
| 150 |
+ log_info "Config Directory: $CONFIG_DIR" |
|
| 151 |
+ fi |
|
| 152 |
+ log_info "" |
|
| 153 |
+} |
|
| 154 |
+ |
|
| 155 |
+check_requirements() {
|
|
| 156 |
+ log_info "🔍 Checking system requirements..." |
|
| 157 |
+ |
|
| 158 |
+ # Check if running as root |
|
| 159 |
+ if [[ $EUID -ne 0 ]]; then |
|
| 160 |
+ log_error "This script must be run as root (use sudo)" |
|
| 161 |
+ exit 1 |
|
| 162 |
+ fi |
|
| 163 |
+ |
|
| 164 |
+ # Check if running on Linux |
|
| 165 |
+ if [[ "$(uname)" != "Linux" ]]; then |
|
| 166 |
+ log_error "autoSMART can only be installed on Linux systems" |
|
| 167 |
+ log_error "Current system: $(uname)" |
|
| 168 |
+ exit 1 |
|
| 169 |
+ fi |
|
| 170 |
+ |
|
| 171 |
+ # Check systemd |
|
| 172 |
+ if ! command -v systemctl &> /dev/null; then |
|
| 173 |
+ log_error "systemd is required but not found" |
|
| 174 |
+ exit 1 |
|
| 175 |
+ fi |
|
| 176 |
+ |
|
| 177 |
+ # Check and report dependency status |
|
| 178 |
+ if ! verify_dependencies >/dev/null 2>&1; then |
|
| 179 |
+ log_warning "Some dependencies are missing (will be installed automatically)" |
|
| 180 |
+ fi |
|
| 181 |
+ |
|
| 182 |
+ # Check available space |
|
| 183 |
+ AVAILABLE_SPACE=$(df / | tail -1 | awk '{print $4}')
|
|
| 184 |
+ if [[ $AVAILABLE_SPACE -lt 100000 ]]; then |
|
| 185 |
+ log_warning "Less than 100MB available space. Installation may fail." |
|
| 186 |
+ fi |
|
| 187 |
+ |
|
| 188 |
+ log_success "System requirements check passed" |
|
| 189 |
+} |
|
| 190 |
+ |
|
| 191 |
+handle_uninstall() {
|
|
| 192 |
+ log_info "🗑️ Uninstalling autoSMART..." |
|
| 193 |
+ |
|
| 194 |
+ # Stop and disable service |
|
| 195 |
+ if systemctl is-active --quiet autosmart; then |
|
| 196 |
+ systemctl stop autosmart |
|
| 197 |
+ fi |
|
| 198 |
+ if systemctl is-enabled --quiet autosmart; then |
|
| 199 |
+ systemctl disable autosmart |
|
| 200 |
+ fi |
|
| 201 |
+ |
|
| 202 |
+ # Remove service file |
|
| 203 |
+ if [[ -f "$SYSTEMD_SERVICE" ]]; then |
|
| 204 |
+ rm "$SYSTEMD_SERVICE" |
|
| 205 |
+ systemctl daemon-reload |
|
| 206 |
+ fi |
|
| 207 |
+ |
|
| 208 |
+ # Remove installation directory |
|
| 209 |
+ if [[ -d "$INSTALL_DIR" ]]; then |
|
| 210 |
+ rm -rf "$INSTALL_DIR" |
|
| 211 |
+ fi |
|
| 212 |
+ |
|
| 213 |
+ # Remove configuration directory |
|
| 214 |
+ if [[ -d "$CONFIG_DIR" ]]; then |
|
| 215 |
+ rm -rf "$CONFIG_DIR" |
|
| 216 |
+ fi |
|
| 217 |
+ |
|
| 218 |
+ # Remove log rotation |
|
| 219 |
+ if [[ -f "/etc/logrotate.d/autosmart" ]]; then |
|
| 220 |
+ rm "/etc/logrotate.d/autosmart" |
|
| 221 |
+ fi |
|
| 222 |
+ |
|
| 223 |
+ log_success "✅ autoSMART uninstalled successfully" |
|
| 224 |
+ exit 0 |
|
| 225 |
+} |
|
| 226 |
+ |
|
| 227 |
+# Function to check if a package is installed |
|
| 228 |
+check_package_installed() {
|
|
| 229 |
+ local package="$1" |
|
| 230 |
+ local package_manager="$2" |
|
| 231 |
+ |
|
| 232 |
+ case "$package_manager" in |
|
| 233 |
+ "apt-get") |
|
| 234 |
+ dpkg -l | grep -q "^ii $package " 2>/dev/null |
|
| 235 |
+ ;; |
|
| 236 |
+ "yum"|"dnf") |
|
| 237 |
+ rpm -qa | grep -q "$package" 2>/dev/null |
|
| 238 |
+ ;; |
|
| 239 |
+ "zypper") |
|
| 240 |
+ zypper se -i "$package" | grep -q "^i" 2>/dev/null |
|
| 241 |
+ ;; |
|
| 242 |
+ "pacman") |
|
| 243 |
+ pacman -Q "$package" >/dev/null 2>&1 |
|
| 244 |
+ ;; |
|
| 245 |
+ *) |
|
| 246 |
+ return 1 |
|
| 247 |
+ ;; |
|
| 248 |
+ esac |
|
| 249 |
+} |
|
| 250 |
+ |
|
| 251 |
+# Function to verify all dependencies are installed |
|
| 252 |
+verify_dependencies() {
|
|
| 253 |
+ log_info "� Verifying system dependencies..." |
|
| 254 |
+ |
|
| 255 |
+ local missing_packages=() |
|
| 256 |
+ local missing_perl_modules=() |
|
| 257 |
+ local package_manager="" |
|
| 258 |
+ |
|
| 259 |
+ # Detect package manager |
|
| 260 |
+ if command -v apt-get &> /dev/null; then |
|
| 261 |
+ package_manager="apt-get" |
|
| 262 |
+ elif command -v yum &> /dev/null; then |
|
| 263 |
+ package_manager="yum" |
|
| 264 |
+ elif command -v dnf &> /dev/null; then |
|
| 265 |
+ package_manager="dnf" |
|
| 266 |
+ elif command -v zypper &> /dev/null; then |
|
| 267 |
+ package_manager="zypper" |
|
| 268 |
+ elif command -v pacman &> /dev/null; then |
|
| 269 |
+ package_manager="pacman" |
|
| 270 |
+ else |
|
| 271 |
+ log_warning "Unknown package manager. Dependency verification limited." |
|
| 272 |
+ return 1 |
|
| 273 |
+ fi |
|
| 274 |
+ |
|
| 275 |
+ # Check system packages (including Perl modules from distribution) |
|
| 276 |
+ local system_packages=("perl" "smartmontools" "postgresql-client" "curl" "wget")
|
|
| 277 |
+ local perl_packages=() |
|
| 278 |
+ |
|
| 279 |
+ # Add Perl module packages based on package manager |
|
| 280 |
+ case "$package_manager" in |
|
| 281 |
+ "apt-get") |
|
| 282 |
+ perl_packages+=("libdbi-perl" "libdbd-pg-perl" "libjson-perl" "libfile-slurp-perl"
|
|
| 283 |
+ "libgetopt-long-descriptive-perl" "libconfig-simple-perl") |
|
| 284 |
+ ;; |
|
| 285 |
+ "yum"|"dnf") |
|
| 286 |
+ perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp"
|
|
| 287 |
+ "perl-Getopt-Long" "perl-Config-Simple") |
|
| 288 |
+ ;; |
|
| 289 |
+ "zypper") |
|
| 290 |
+ perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp"
|
|
| 291 |
+ "perl-Getopt-Long-Descriptive" "perl-Config-Simple") |
|
| 292 |
+ ;; |
|
| 293 |
+ "pacman") |
|
| 294 |
+ perl_packages+=("perl-dbi" "perl-dbd-pg" "perl-json" "perl-file-slurp")
|
|
| 295 |
+ ;; |
|
| 296 |
+ esac |
|
| 297 |
+ |
|
| 298 |
+ # Check system packages |
|
| 299 |
+ for package in "${system_packages[@]}"; do
|
|
| 300 |
+ if ! check_package_installed "$package" "$package_manager"; then |
|
| 301 |
+ missing_packages+=("$package")
|
|
| 302 |
+ fi |
|
| 303 |
+ done |
|
| 304 |
+ |
|
| 305 |
+ # Check Perl packages from distribution |
|
| 306 |
+ for package in "${perl_packages[@]}"; do
|
|
| 307 |
+ if ! check_package_installed "$package" "$package_manager"; then |
|
| 308 |
+ missing_packages+=("$package")
|
|
| 309 |
+ fi |
|
| 310 |
+ done |
|
| 311 |
+ |
|
| 312 |
+ # Report results |
|
| 313 |
+ if [[ ${#missing_packages[@]} -eq 0 ]]; then
|
|
| 314 |
+ log_success "✅ All dependencies are available" |
|
| 315 |
+ return 0 |
|
| 316 |
+ else |
|
| 317 |
+ log_warning "Missing dependencies detected:" |
|
| 318 |
+ if [[ ${#missing_packages[@]} -gt 0 ]]; then
|
|
| 319 |
+ log_warning " Missing packages: ${missing_packages[*]}"
|
|
| 320 |
+ fi |
|
| 321 |
+ return 1 |
|
| 322 |
+ fi |
|
| 323 |
+} |
|
| 324 |
+ |
|
| 325 |
+# Function to install dependencies from local packages (offline) |
|
| 326 |
+install_dependencies_offline() {
|
|
| 327 |
+ log_info "�📦 Installing dependencies from local packages..." |
|
| 328 |
+ |
|
| 329 |
+ local packages_dir="$PROJECT_ROOT/packages" |
|
| 330 |
+ |
|
| 331 |
+ if [[ ! -d "$packages_dir" ]]; then |
|
| 332 |
+ log_warning "Local packages directory not found: $packages_dir" |
|
| 333 |
+ log_info "Falling back to online installation..." |
|
| 334 |
+ return 1 |
|
| 335 |
+ fi |
|
| 336 |
+ |
|
| 337 |
+ local package_manager="" |
|
| 338 |
+ if command -v apt-get &> /dev/null; then |
|
| 339 |
+ package_manager="apt-get" |
|
| 340 |
+ local deb_files=("$packages_dir"/*.deb)
|
|
| 341 |
+ if [[ -f "${deb_files[0]}" ]]; then
|
|
| 342 |
+ log_info "Installing .deb packages..." |
|
| 343 |
+ dpkg -i "$packages_dir"/*.deb 2>/dev/null || {
|
|
| 344 |
+ log_info "Fixing broken dependencies..." |
|
| 345 |
+ apt-get install -f -y >/dev/null 2>&1 |
|
| 346 |
+ } |
|
| 347 |
+ fi |
|
| 348 |
+ elif command -v yum &> /dev/null || command -v dnf &> /dev/null; then |
|
| 349 |
+ package_manager="yum" |
|
| 350 |
+ local rpm_files=("$packages_dir"/*.rpm)
|
|
| 351 |
+ if [[ -f "${rpm_files[0]}" ]]; then
|
|
| 352 |
+ log_info "Installing .rpm packages..." |
|
| 353 |
+ if command -v dnf &> /dev/null; then |
|
| 354 |
+ dnf install -y "$packages_dir"/*.rpm >/dev/null 2>&1 |
|
| 355 |
+ else |
|
| 356 |
+ yum localinstall -y "$packages_dir"/*.rpm >/dev/null 2>&1 |
|
| 357 |
+ fi |
|
| 358 |
+ fi |
|
| 359 |
+ fi |
|
| 360 |
+ |
|
| 361 |
+ # Verify installation |
|
| 362 |
+ if verify_dependencies >/dev/null 2>&1; then |
|
| 363 |
+ log_success "✅ Offline dependencies installed successfully" |
|
| 364 |
+ return 0 |
|
| 365 |
+ else |
|
| 366 |
+ log_warning "Offline installation incomplete" |
|
| 367 |
+ return 1 |
|
| 368 |
+ fi |
|
| 369 |
+} |
|
| 370 |
+ |
|
| 371 |
+# Enhanced dependency installation with offline support |
|
| 372 |
+install_dependencies() {
|
|
| 373 |
+ log_info "📦 Installing system dependencies..." |
|
| 374 |
+ |
|
| 375 |
+ # First try to verify if dependencies are already installed |
|
| 376 |
+ if verify_dependencies >/dev/null 2>&1; then |
|
| 377 |
+ log_success "All dependencies already installed" |
|
| 378 |
+ return 0 |
|
| 379 |
+ fi |
|
| 380 |
+ |
|
| 381 |
+ # If offline mode is enabled, only try offline installation |
|
| 382 |
+ if [[ "$OFFLINE_MODE" == true ]]; then |
|
| 383 |
+ log_info "Offline mode enabled - using local packages only" |
|
| 384 |
+ if install_dependencies_offline; then |
|
| 385 |
+ return 0 |
|
| 386 |
+ else |
|
| 387 |
+ log_error "Offline installation failed and online installation is disabled" |
|
| 388 |
+ log_error "Please check packages directory: $PROJECT_ROOT/packages" |
|
| 389 |
+ exit 1 |
|
| 390 |
+ fi |
|
| 391 |
+ fi |
|
| 392 |
+ |
|
| 393 |
+ # Try offline installation first |
|
| 394 |
+ if install_dependencies_offline; then |
|
| 395 |
+ return 0 |
|
| 396 |
+ fi |
|
| 397 |
+ |
|
| 398 |
+ # Fall back to online installation |
|
| 399 |
+ log_info "Attempting online installation..." |
|
| 400 |
+ |
|
| 401 |
+ if command -v apt-get &> /dev/null; then |
|
| 402 |
+ # Debian/Ubuntu |
|
| 403 |
+ apt-get update -qq |
|
| 404 |
+ PACKAGES=( |
|
| 405 |
+ "perl" |
|
| 406 |
+ "libdbi-perl" |
|
| 407 |
+ "libdbd-pg-perl" |
|
| 408 |
+ "libjson-perl" |
|
| 409 |
+ "libfile-slurp-perl" |
|
| 410 |
+ "libgetopt-long-descriptive-perl" |
|
| 411 |
+ "libconfig-simple-perl" |
|
| 412 |
+ "smartmontools" |
|
| 413 |
+ "postgresql-client" |
|
| 414 |
+ "curl" |
|
| 415 |
+ "wget" |
|
| 416 |
+ ) |
|
| 417 |
+ |
|
| 418 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 419 |
+ if ! dpkg -l | grep -q "^ii $package "; then |
|
| 420 |
+ log_info "Installing $package..." |
|
| 421 |
+ apt-get install -y "$package" >/dev/null 2>&1 |
|
| 422 |
+ fi |
|
| 423 |
+ done |
|
| 424 |
+ |
|
| 425 |
+ elif command -v yum &> /dev/null; then |
|
| 426 |
+ # RHEL/CentOS |
|
| 427 |
+ yum update -y -q |
|
| 428 |
+ PACKAGES=( |
|
| 429 |
+ "perl" |
|
| 430 |
+ "perl-DBI" |
|
| 431 |
+ "perl-DBD-Pg" |
|
| 432 |
+ "perl-JSON" |
|
| 433 |
+ "perl-File-Slurp" |
|
| 434 |
+ "perl-Getopt-Long" |
|
| 435 |
+ "perl-Config-Simple" |
|
| 436 |
+ "smartmontools" |
|
| 437 |
+ "postgresql" |
|
| 438 |
+ "curl" |
|
| 439 |
+ "wget" |
|
| 440 |
+ ) |
|
| 441 |
+ |
|
| 442 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 443 |
+ if ! rpm -qa | grep -q "$package"; then |
|
| 444 |
+ log_info "Installing $package..." |
|
| 445 |
+ yum install -y "$package" >/dev/null 2>&1 |
|
| 446 |
+ fi |
|
| 447 |
+ done |
|
| 448 |
+ |
|
| 449 |
+ else |
|
| 450 |
+ log_error "Unsupported package manager. Please install dependencies manually." |
|
| 451 |
+ exit 1 |
|
| 452 |
+ fi |
|
| 453 |
+ |
|
| 454 |
+ log_success "Dependencies installed" |
|
| 455 |
+} |
|
| 456 |
+ |
|
| 457 |
+create_directories() {
|
|
| 458 |
+ log_info "📁 Creating directory structure..." |
|
| 459 |
+ |
|
| 460 |
+ # Create main directories |
|
| 461 |
+ mkdir -p "$INSTALL_DIR"/{scripts,lib,config,docs}
|
|
| 462 |
+ mkdir -p "$CONFIG_DIR" |
|
| 463 |
+ |
|
| 464 |
+ # Set permissions |
|
| 465 |
+ chmod 755 "$INSTALL_DIR" |
|
| 466 |
+ chmod 755 "$CONFIG_DIR" |
|
| 467 |
+ |
|
| 468 |
+ log_success "Directories created" |
|
| 469 |
+} |
|
| 470 |
+ |
|
| 471 |
+copy_files() {
|
|
| 472 |
+ log_info "📋 Copying autoSMART files..." |
|
| 473 |
+ |
|
| 474 |
+ # Copy scripts |
|
| 475 |
+ if [[ -d "$PROJECT_ROOT/scripts" ]]; then |
|
| 476 |
+ cp -r "$PROJECT_ROOT/scripts"/* "$INSTALL_DIR/scripts/" |
|
| 477 |
+ chmod +x "$INSTALL_DIR/scripts"/*.sh 2>/dev/null || true |
|
| 478 |
+ chmod +x "$INSTALL_DIR/scripts"/*.pl 2>/dev/null || true |
|
| 479 |
+ fi |
|
| 480 |
+ |
|
| 481 |
+ # Copy libraries |
|
| 482 |
+ if [[ -d "$PROJECT_ROOT/lib" ]]; then |
|
| 483 |
+ cp -r "$PROJECT_ROOT/lib"/* "$INSTALL_DIR/lib/" |
|
| 484 |
+ fi |
|
| 485 |
+ |
|
| 486 |
+ # Copy documentation |
|
| 487 |
+ if [[ -d "$PROJECT_ROOT/docs" ]]; then |
|
| 488 |
+ cp -r "$PROJECT_ROOT/docs"/* "$INSTALL_DIR/docs/" |
|
| 489 |
+ fi |
|
| 490 |
+ |
|
| 491 |
+ # Copy SQL files |
|
| 492 |
+ if [[ -d "$PROJECT_ROOT/sql" ]]; then |
|
| 493 |
+ cp -r "$PROJECT_ROOT/sql" "$INSTALL_DIR/" |
|
| 494 |
+ fi |
|
| 495 |
+ |
|
| 496 |
+ log_success "Files copied" |
|
| 497 |
+} |
|
| 498 |
+ |
|
| 499 |
+create_configuration() {
|
|
| 500 |
+ log_info "⚙️ Creating configuration files..." |
|
| 501 |
+ |
|
| 502 |
+ # Main configuration file |
|
| 503 |
+ cat > "$CONFIG_DIR/autosmart.conf" << EOF |
|
| 504 |
+# autoSMART Configuration File |
|
| 505 |
+# Generated on $(date) |
|
| 506 |
+ |
|
| 507 |
+[database] |
|
| 508 |
+host = $DB_HOST |
|
| 509 |
+port = 5432 |
|
| 510 |
+user = $DB_USER |
|
| 511 |
+password = $DB_PASS |
|
| 512 |
+database = $DB_NAME |
|
| 513 |
+timeout = 30 |
|
| 514 |
+ |
|
| 515 |
+[node] |
|
| 516 |
+id = $NODE_ID |
|
| 517 |
+scan_interval = $SCAN_INTERVAL |
|
| 518 |
+full_scan_interval = $FULL_SCAN_INTERVAL |
|
| 519 |
+store_unchanged = false |
|
| 520 |
+max_retries = 3 |
|
| 521 |
+ |
|
| 522 |
+[collection] |
|
| 523 |
+temperature_threshold = 5 |
|
| 524 |
+parameter_changes_only = true |
|
| 525 |
+enable_predictive_analysis = true |
|
| 526 |
+health_check_interval = 86400 |
|
| 527 |
+ |
|
| 528 |
+[logging] |
|
| 529 |
+level = INFO |
|
| 530 |
+max_size = 10M |
|
| 531 |
+rotate_count = 5 |
|
| 532 |
+syslog = true |
|
| 533 |
+ |
|
| 534 |
+[alerts] |
|
| 535 |
+enable = true |
|
| 536 |
+temperature_critical = 60 |
|
| 537 |
+reallocated_sectors_warning = 1 |
|
| 538 |
+pending_sectors_critical = 5 |
|
| 539 |
+EOF |
|
| 540 |
+ |
|
| 541 |
+ # YAML format configuration for Perl daemon |
|
| 542 |
+ cat > "$CONFIG_DIR/cluster-$NODE_ID.conf" << EOF |
|
| 543 |
+# autoSMART YAML Configuration for $NODE_ID |
|
| 544 |
+database: |
|
| 545 |
+ host: $DB_HOST |
|
| 546 |
+ port: 5432 |
|
| 547 |
+ user: $DB_USER |
|
| 548 |
+ password: $DB_PASS |
|
| 549 |
+ database: $DB_NAME |
|
| 550 |
+ |
|
| 551 |
+node: |
|
| 552 |
+ id: $NODE_ID |
|
| 553 |
+ scan_interval: $SCAN_INTERVAL |
|
| 554 |
+ store_unchanged: false |
|
| 555 |
+ |
|
| 556 |
+collection: |
|
| 557 |
+ temperature_threshold: 5 |
|
| 558 |
+ parameter_changes_only: true |
|
| 559 |
+ full_scan_interval: $FULL_SCAN_INTERVAL |
|
| 560 |
+EOF |
|
| 561 |
+ |
|
| 562 |
+ # Set secure permissions on config files |
|
| 563 |
+ chmod 600 "$CONFIG_DIR"/*.conf |
|
| 564 |
+ |
|
| 565 |
+ log_success "Configuration created" |
|
| 566 |
+} |
|
| 567 |
+ |
|
| 568 |
+create_systemd_service() {
|
|
| 569 |
+ log_info "🔧 Creating systemd service..." |
|
| 570 |
+ |
|
| 571 |
+ cat > "$SYSTEMD_SERVICE" << EOF |
|
| 572 |
+[Unit] |
|
| 573 |
+Description=autoSMART SMART Data Collector |
|
| 574 |
+Documentation=file://$INSTALL_DIR/docs/README.md |
|
| 575 |
+After=network.target postgresql.service |
|
| 576 |
+Wants=postgresql.service |
|
| 577 |
+ |
|
| 578 |
+[Service] |
|
| 579 |
+Type=simple |
|
| 580 |
+ExecStart=$INSTALL_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE_ID.conf --foreground |
|
| 581 |
+ExecReload=/bin/kill -HUP \$MAINPID |
|
| 582 |
+KillMode=process |
|
| 583 |
+Restart=always |
|
| 584 |
+RestartSec=30 |
|
| 585 |
+User=root |
|
| 586 |
+Group=root |
|
| 587 |
+ |
|
| 588 |
+# Security settings |
|
| 589 |
+NoNewPrivileges=true |
|
| 590 |
+ProtectSystem=strict |
|
| 591 |
+ProtectHome=true |
|
| 592 |
+ReadWritePaths=$CONFIG_DIR |
|
| 593 |
+PrivateTmp=true |
|
| 594 |
+ |
|
| 595 |
+# Resource limits |
|
| 596 |
+LimitNOFILE=1024 |
|
| 597 |
+MemoryMax=100M |
|
| 598 |
+CPUQuota=10% |
|
| 599 |
+ |
|
| 600 |
+# Logging |
|
| 601 |
+StandardOutput=journal |
|
| 602 |
+StandardError=journal |
|
| 603 |
+SyslogIdentifier=autosmart |
|
| 604 |
+ |
|
| 605 |
+[Install] |
|
| 606 |
+WantedBy=multi-user.target |
|
| 607 |
+EOF |
|
| 608 |
+ |
|
| 609 |
+ # Reload systemd |
|
| 610 |
+ systemctl daemon-reload |
|
| 611 |
+ |
|
| 612 |
+ log_success "Systemd service created" |
|
| 613 |
+} |
|
| 614 |
+ |
|
| 615 |
+test_database_connection() {
|
|
| 616 |
+ log_info "🔗 Testing database connection..." |
|
| 617 |
+ |
|
| 618 |
+ # Test connection using psql |
|
| 619 |
+ if command -v psql &> /dev/null; then |
|
| 620 |
+ if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT version();" >/dev/null 2>&1; then |
|
| 621 |
+ log_success "Database connection successful" |
|
| 622 |
+ else |
|
| 623 |
+ log_warning "Database connection failed. Service may not start correctly." |
|
| 624 |
+ log_info "Please ensure:" |
|
| 625 |
+ log_info " • PostgreSQL server is running on $DB_HOST" |
|
| 626 |
+ log_info " • Database '$DB_NAME' exists" |
|
| 627 |
+ log_info " • User '$DB_USER' has proper permissions" |
|
| 628 |
+ fi |
|
| 629 |
+ else |
|
| 630 |
+ log_warning "psql not found. Cannot test database connection." |
|
| 631 |
+ fi |
|
| 632 |
+} |
|
| 633 |
+ |
|
| 634 |
+test_smart_detection() {
|
|
| 635 |
+ log_info "🔍 Testing SMART device detection..." |
|
| 636 |
+ |
|
| 637 |
+ DEVICES_FOUND=0 |
|
| 638 |
+ for device in /dev/sd? /dev/nvme?n?; do |
|
| 639 |
+ if [[ -b "$device" ]] && smartctl -i "$device" >/dev/null 2>&1; then |
|
| 640 |
+ MODEL=$(smartctl -i "$device" | grep "Device Model\|Model Number" | head -1 | cut -d: -f2 | xargs) |
|
| 641 |
+ if [[ -n "$MODEL" ]]; then |
|
| 642 |
+ log_info " Found: $device - $MODEL" |
|
| 643 |
+ ((DEVICES_FOUND++)) |
|
| 644 |
+ fi |
|
| 645 |
+ fi |
|
| 646 |
+ done |
|
| 647 |
+ |
|
| 648 |
+ if [[ $DEVICES_FOUND -gt 0 ]]; then |
|
| 649 |
+ log_success "Detected $DEVICES_FOUND SMART-capable devices" |
|
| 650 |
+ else |
|
| 651 |
+ log_warning "No SMART-capable devices detected" |
|
| 652 |
+ fi |
|
| 653 |
+} |
|
| 654 |
+ |
|
| 655 |
+finalize_installation() {
|
|
| 656 |
+ log_info "🎯 Finalizing installation..." |
|
| 657 |
+ |
|
| 658 |
+ # Enable service (but don't start yet) |
|
| 659 |
+ systemctl enable "$SERVICE_NAME" |
|
| 660 |
+ |
|
| 661 |
+ # Create log rotation |
|
| 662 |
+ cat > "/etc/logrotate.d/autosmart" << EOF |
|
| 663 |
+/var/log/autosmart/*.log {
|
|
| 664 |
+ daily |
|
| 665 |
+ rotate 7 |
|
| 666 |
+ compress |
|
| 667 |
+ delaycompress |
|
| 668 |
+ missingok |
|
| 669 |
+ notifempty |
|
| 670 |
+ postrotate |
|
| 671 |
+ systemctl reload-or-restart autosmart |
|
| 672 |
+ endscript |
|
| 673 |
+} |
|
| 674 |
+EOF |
|
| 675 |
+ |
|
| 676 |
+ log_success "Installation finalized" |
|
| 677 |
+} |
|
| 678 |
+ |
|
| 679 |
+show_completion_message() {
|
|
| 680 |
+ log_success "✅ autoSMART installation completed successfully!" |
|
| 681 |
+ log_info "" |
|
| 682 |
+ log_info "📋 Installation Summary:" |
|
| 683 |
+ log_info " • Install Directory: $INSTALL_DIR" |
|
| 684 |
+ log_info " • Config Directory: $CONFIG_DIR" |
|
| 685 |
+ log_info " • Service Name: $SERVICE_NAME" |
|
| 686 |
+ log_info " • Node ID: $NODE_ID" |
|
| 687 |
+ log_info "" |
|
| 688 |
+ log_info "🚀 Next Steps:" |
|
| 689 |
+ log_info " 1. Start the service:" |
|
| 690 |
+ log_info " systemctl start $SERVICE_NAME" |
|
| 691 |
+ log_info "" |
|
| 692 |
+ log_info " 2. Check service status:" |
|
| 693 |
+ log_info " systemctl status $SERVICE_NAME" |
|
| 694 |
+ log_info "" |
|
| 695 |
+ log_info " 3. View logs:" |
|
| 696 |
+ log_info " journalctl -u $SERVICE_NAME -f" |
|
| 697 |
+ log_info "" |
|
| 698 |
+ log_info "📖 Documentation: $INSTALL_DIR/docs/README.md" |
|
| 699 |
+ log_info "⚙️ Configuration: $CONFIG_DIR/autosmart.conf" |
|
| 700 |
+ log_info "" |
|
| 701 |
+ log_info "🎉 autoSMART is ready to monitor your storage devices!" |
|
| 702 |
+} |
|
| 703 |
+ |
|
| 704 |
+# Main execution |
|
| 705 |
+main() {
|
|
| 706 |
+ parse_arguments "$@" |
|
| 707 |
+ show_header |
|
| 708 |
+ |
|
| 709 |
+ case "$COMMAND" in |
|
| 710 |
+ uninstall) |
|
| 711 |
+ handle_uninstall |
|
| 712 |
+ ;; |
|
| 713 |
+ install) |
|
| 714 |
+ check_requirements |
|
| 715 |
+ |
|
| 716 |
+ # Handle force reinstall |
|
| 717 |
+ if [[ "$FORCE_REINSTALL" == true ]]; then |
|
| 718 |
+ log_info "🗑️ Force reinstall: cleaning previous installation..." |
|
| 719 |
+ handle_uninstall 2>/dev/null || true |
|
| 720 |
+ sleep 2 |
|
| 721 |
+ fi |
|
| 722 |
+ |
|
| 723 |
+ # Handle config-only mode |
|
| 724 |
+ if [[ "$CONFIG_ONLY" == true ]]; then |
|
| 725 |
+ log_info "⚙️ Configuration-only mode" |
|
| 726 |
+ if [[ ! -d "$INSTALL_DIR" ]]; then |
|
| 727 |
+ log_error "autoSMART is not installed. Run full installation first." |
|
| 728 |
+ exit 1 |
|
| 729 |
+ fi |
|
| 730 |
+ create_configuration |
|
| 731 |
+ log_success "✅ Configuration updated successfully!" |
|
| 732 |
+ exit 0 |
|
| 733 |
+ fi |
|
| 734 |
+ |
|
| 735 |
+ # Full installation |
|
| 736 |
+ install_dependencies |
|
| 737 |
+ create_directories |
|
| 738 |
+ copy_files |
|
| 739 |
+ create_configuration |
|
| 740 |
+ create_systemd_service |
|
| 741 |
+ test_database_connection |
|
| 742 |
+ test_smart_detection |
|
| 743 |
+ finalize_installation |
|
| 744 |
+ show_completion_message |
|
| 745 |
+ ;; |
|
| 746 |
+ *) |
|
| 747 |
+ log_error "Unknown command: $COMMAND" |
|
| 748 |
+ show_usage |
|
| 749 |
+ exit 1 |
|
| 750 |
+ ;; |
|
| 751 |
+ esac |
|
| 752 |
+} |
|
| 753 |
+ |
|
| 754 |
+# Run main function |
|
| 755 |
+main "$@" |
|
@@ -0,0 +1,789 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Node Installation Script |
|
| 4 |
+# Version: 1.0 |
|
| 5 |
+# Description: Install autoSMART on target nodes (Linux systems only) |
|
| 6 |
+# Note: This script is called by deploy.sh and should run on target nodes |
|
| 7 |
+ |
|
| 8 |
+set -e |
|
| 9 |
+ |
|
| 10 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 11 |
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")" |
|
| 12 |
+INSTALL_DIR="/opt/autoSMART" |
|
| 13 |
+CONFIG_DIR="/etc/autosmart" |
|
| 14 |
+SERVICE_NAME="autosmart" |
|
| 15 |
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
|
|
| 16 |
+ |
|
| 17 |
+# Default configuration (can be overridden by command line) |
|
| 18 |
+DB_HOST="${DB_HOST:-192.168.2.102}"
|
|
| 19 |
+DB_USER="${DB_USER:-autosmart}"
|
|
| 20 |
+DB_PASS="${DB_PASS:-autoSMART2025!}"
|
|
| 21 |
+DB_NAME="${DB_NAME:-autosmart}"
|
|
| 22 |
+ |
|
| 23 |
+# Node configuration |
|
| 24 |
+NODE_ID="${NODE_ID:-$(hostname -s)}"
|
|
| 25 |
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
|
|
| 26 |
+FULL_SCAN_INTERVAL="${FULL_SCAN_INTERVAL:-3600}"
|
|
| 27 |
+ |
|
| 28 |
+# Operation modes |
|
| 29 |
+UNINSTALL=false |
|
| 30 |
+FORCE_REINSTALL=false |
|
| 31 |
+CONFIG_ONLY=false |
|
| 32 |
+ |
|
| 33 |
+# Colors for output |
|
| 34 |
+RED='\033[0;31m' |
|
| 35 |
+GREEN='\033[0;32m' |
|
| 36 |
+YELLOW='\033[1;33m' |
|
| 37 |
+BLUE='\033[0;34m' |
|
| 38 |
+NC='\033[0m' # No Color |
|
| 39 |
+ |
|
| 40 |
+log_info() {
|
|
| 41 |
+ echo -e "${BLUE}[INFO]${NC} $1"
|
|
| 42 |
+} |
|
| 43 |
+ |
|
| 44 |
+log_success() {
|
|
| 45 |
+ echo -e "${GREEN}[SUCCESS]${NC} $1"
|
|
| 46 |
+} |
|
| 47 |
+ |
|
| 48 |
+log_warning() {
|
|
| 49 |
+ echo -e "${YELLOW}[WARNING]${NC} $1"
|
|
| 50 |
+} |
|
| 51 |
+ |
|
| 52 |
+log_error() {
|
|
| 53 |
+ echo -e "${RED}[ERROR]${NC} $1"
|
|
| 54 |
+} |
|
| 55 |
+ |
|
| 56 |
+show_usage() {
|
|
| 57 |
+ echo "autoSMART Node Installation Script v1.0" |
|
| 58 |
+ echo "========================================" |
|
| 59 |
+ echo "" |
|
| 60 |
+ echo "Usage: $0 [COMMAND] [OPTIONS]" |
|
| 61 |
+ echo "" |
|
| 62 |
+ echo "Commands:" |
|
| 63 |
+ echo " install Install autoSMART on current node (default)" |
|
| 64 |
+ echo " uninstall Remove autoSMART completely from current node" |
|
| 65 |
+ echo "" |
|
| 66 |
+ echo "Options:" |
|
| 67 |
+ echo " --help Show this help message" |
|
| 68 |
+ echo " --force-reinstall Clean installation (removes previous version)" |
|
| 69 |
+ echo " --config-only Only create/update configuration files" |
|
| 70 |
+ echo " --db-host HOST Database host (default: 192.168.2.102)" |
|
| 71 |
+ echo " --db-user USER Database user (default: autosmart)" |
|
| 72 |
+ echo " --db-pass PASS Database password (default: autoSMART2025!)" |
|
| 73 |
+ echo " --db-name NAME Database name (default: autosmart)" |
|
| 74 |
+ echo " --node-id ID Node identifier (default: hostname)" |
|
| 75 |
+ echo " --scan-interval SEC Scan interval in seconds (default: 300)" |
|
| 76 |
+ echo "" |
|
| 77 |
+ echo "Note: This script should be called by deploy.sh, not run directly." |
|
| 78 |
+ echo "For deployment from development machine, use: deploy.sh install <IP>" |
|
| 79 |
+ echo "" |
|
| 80 |
+} |
|
| 81 |
+ |
|
| 82 |
+parse_arguments() {
|
|
| 83 |
+ COMMAND="install" # Default command |
|
| 84 |
+ |
|
| 85 |
+ while [[ $# -gt 0 ]]; do |
|
| 86 |
+ case $1 in |
|
| 87 |
+ install|uninstall) |
|
| 88 |
+ COMMAND="$1" |
|
| 89 |
+ shift |
|
| 90 |
+ ;; |
|
| 91 |
+ --help) |
|
| 92 |
+ show_usage |
|
| 93 |
+ exit 0 |
|
| 94 |
+ ;; |
|
| 95 |
+ --force-reinstall) |
|
| 96 |
+ FORCE_REINSTALL=true |
|
| 97 |
+ shift |
|
| 98 |
+ ;; |
|
| 99 |
+ --config-only) |
|
| 100 |
+ CONFIG_ONLY=true |
|
| 101 |
+ shift |
|
| 102 |
+ ;; |
|
| 103 |
+ --db-host) |
|
| 104 |
+ DB_HOST="$2" |
|
| 105 |
+ shift 2 |
|
| 106 |
+ ;; |
|
| 107 |
+ --db-user) |
|
| 108 |
+ DB_USER="$2" |
|
| 109 |
+ shift 2 |
|
| 110 |
+ ;; |
|
| 111 |
+ --db-pass) |
|
| 112 |
+ DB_PASS="$2" |
|
| 113 |
+ shift 2 |
|
| 114 |
+ ;; |
|
| 115 |
+ --db-name) |
|
| 116 |
+ DB_NAME="$2" |
|
| 117 |
+ shift 2 |
|
| 118 |
+ ;; |
|
| 119 |
+ --node-id) |
|
| 120 |
+ NODE_ID="$2" |
|
| 121 |
+ shift 2 |
|
| 122 |
+ ;; |
|
| 123 |
+ --scan-interval) |
|
| 124 |
+ SCAN_INTERVAL="$2" |
|
| 125 |
+ shift 2 |
|
| 126 |
+ ;; |
|
| 127 |
+ *) |
|
| 128 |
+ log_error "Unknown option: $1" |
|
| 129 |
+ show_usage |
|
| 130 |
+ exit 1 |
|
| 131 |
+ ;; |
|
| 132 |
+ esac |
|
| 133 |
+ done |
|
| 134 |
+} |
|
| 135 |
+ |
|
| 136 |
+show_header() {
|
|
| 137 |
+ log_info "🔧 autoSMART Node Installation v1.0" |
|
| 138 |
+ log_info "===================================" |
|
| 139 |
+ log_info "Installing on target node: $(hostname)" |
|
| 140 |
+ log_info "" |
|
| 141 |
+ log_info "Operation: $COMMAND" |
|
| 142 |
+ log_info "Node ID: $NODE_ID" |
|
| 143 |
+ log_info "Database: $DB_HOST:5432/$DB_NAME" |
|
| 144 |
+ if [[ "$COMMAND" == "install" ]]; then |
|
| 145 |
+ log_info "Install Directory: $INSTALL_DIR" |
|
| 146 |
+ log_info "Config Directory: $CONFIG_DIR" |
|
| 147 |
+ fi |
|
| 148 |
+ log_info "" |
|
| 149 |
+} |
|
| 150 |
+ |
|
| 151 |
+check_requirements() {
|
|
| 152 |
+ log_info "🔍 Checking system requirements..." |
|
| 153 |
+ |
|
| 154 |
+ # Check if running as root |
|
| 155 |
+ if [[ $EUID -ne 0 ]]; then |
|
| 156 |
+ log_error "This script must be run as root (use sudo)" |
|
| 157 |
+ exit 1 |
|
| 158 |
+ fi |
|
| 159 |
+ |
|
| 160 |
+ # Check if running on Linux |
|
| 161 |
+ if [[ "$(uname)" != "Linux" ]]; then |
|
| 162 |
+ log_error "autoSMART can only be installed on Linux systems" |
|
| 163 |
+ log_error "Current system: $(uname)" |
|
| 164 |
+ exit 1 |
|
| 165 |
+ fi |
|
| 166 |
+ |
|
| 167 |
+ # Check systemd |
|
| 168 |
+ if ! command -v systemctl &> /dev/null; then |
|
| 169 |
+ log_error "systemd is required but not found" |
|
| 170 |
+ exit 1 |
|
| 171 |
+ fi |
|
| 172 |
+ |
|
| 173 |
+ # Check and report dependency status |
|
| 174 |
+ if ! verify_dependencies >/dev/null 2>&1; then |
|
| 175 |
+ log_warning "Some dependencies are missing (will be installed automatically)" |
|
| 176 |
+ fi |
|
| 177 |
+ |
|
| 178 |
+ # Check available space |
|
| 179 |
+ AVAILABLE_SPACE=$(df / | tail -1 | awk '{print $4}')
|
|
| 180 |
+ if [[ $AVAILABLE_SPACE -lt 100000 ]]; then |
|
| 181 |
+ log_warning "Less than 100MB available space. Installation may fail." |
|
| 182 |
+ fi |
|
| 183 |
+ |
|
| 184 |
+ log_success "System requirements check passed" |
|
| 185 |
+} |
|
| 186 |
+ |
|
| 187 |
+handle_uninstall() {
|
|
| 188 |
+ log_info "🗑️ Uninstalling autoSMART..." |
|
| 189 |
+ |
|
| 190 |
+ # Stop and disable service |
|
| 191 |
+ if systemctl is-active --quiet autosmart; then |
|
| 192 |
+ systemctl stop autosmart |
|
| 193 |
+ fi |
|
| 194 |
+ if systemctl is-enabled --quiet autosmart; then |
|
| 195 |
+ systemctl disable autosmart |
|
| 196 |
+ fi |
|
| 197 |
+ |
|
| 198 |
+ # Remove service file |
|
| 199 |
+ if [[ -f "$SYSTEMD_SERVICE" ]]; then |
|
| 200 |
+ rm "$SYSTEMD_SERVICE" |
|
| 201 |
+ systemctl daemon-reload |
|
| 202 |
+ fi |
|
| 203 |
+ |
|
| 204 |
+ # Remove installation directory |
|
| 205 |
+ if [[ -d "$INSTALL_DIR" ]]; then |
|
| 206 |
+ rm -rf "$INSTALL_DIR" |
|
| 207 |
+ fi |
|
| 208 |
+ |
|
| 209 |
+ # Remove configuration directory |
|
| 210 |
+ if [[ -d "$CONFIG_DIR" ]]; then |
|
| 211 |
+ rm -rf "$CONFIG_DIR" |
|
| 212 |
+ fi |
|
| 213 |
+ |
|
| 214 |
+ # Remove log rotation |
|
| 215 |
+ if [[ -f "/etc/logrotate.d/autosmart" ]]; then |
|
| 216 |
+ rm "/etc/logrotate.d/autosmart" |
|
| 217 |
+ fi |
|
| 218 |
+ |
|
| 219 |
+ log_success "✅ autoSMART uninstalled successfully" |
|
| 220 |
+ exit 0 |
|
| 221 |
+} |
|
| 222 |
+ |
|
| 223 |
+# Function to check if a package is installed |
|
| 224 |
+check_package_installed() {
|
|
| 225 |
+ local package="$1" |
|
| 226 |
+ local package_manager="$2" |
|
| 227 |
+ |
|
| 228 |
+ case "$package_manager" in |
|
| 229 |
+ "apt-get") |
|
| 230 |
+ dpkg -l | grep -q "^ii $package " 2>/dev/null |
|
| 231 |
+ ;; |
|
| 232 |
+ "yum"|"dnf") |
|
| 233 |
+ rpm -qa | grep -q "$package" 2>/dev/null |
|
| 234 |
+ ;; |
|
| 235 |
+ "zypper") |
|
| 236 |
+ zypper se -i "$package" | grep -q "^i" 2>/dev/null |
|
| 237 |
+ ;; |
|
| 238 |
+ "pacman") |
|
| 239 |
+ pacman -Q "$package" >/dev/null 2>&1 |
|
| 240 |
+ ;; |
|
| 241 |
+ *) |
|
| 242 |
+ return 1 |
|
| 243 |
+ ;; |
|
| 244 |
+ esac |
|
| 245 |
+} |
|
| 246 |
+ |
|
| 247 |
+# Function to verify all dependencies are installed |
|
| 248 |
+verify_dependencies() {
|
|
| 249 |
+ log_info "🔍 Verifying system dependencies..." |
|
| 250 |
+ |
|
| 251 |
+ local missing_packages=() |
|
| 252 |
+ local package_manager="" |
|
| 253 |
+ |
|
| 254 |
+ # Detect package manager |
|
| 255 |
+ if command -v apt-get &> /dev/null; then |
|
| 256 |
+ package_manager="apt-get" |
|
| 257 |
+ elif command -v yum &> /dev/null; then |
|
| 258 |
+ package_manager="yum" |
|
| 259 |
+ elif command -v dnf &> /dev/null; then |
|
| 260 |
+ package_manager="dnf" |
|
| 261 |
+ elif command -v zypper &> /dev/null; then |
|
| 262 |
+ package_manager="zypper" |
|
| 263 |
+ elif command -v pacman &> /dev/null; then |
|
| 264 |
+ package_manager="pacman" |
|
| 265 |
+ else |
|
| 266 |
+ log_warning "Unknown package manager. Dependency verification limited." |
|
| 267 |
+ return 1 |
|
| 268 |
+ fi |
|
| 269 |
+ |
|
| 270 |
+ # Check system packages (including Perl modules from distribution) |
|
| 271 |
+ local system_packages=("perl" "smartmontools" "postgresql-client" "curl" "wget")
|
|
| 272 |
+ local perl_packages=() |
|
| 273 |
+ |
|
| 274 |
+ # Add Perl module packages based on package manager |
|
| 275 |
+ case "$package_manager" in |
|
| 276 |
+ "apt-get") |
|
| 277 |
+ perl_packages+=("libdbi-perl" "libdbd-pg-perl" "libjson-perl" "libfile-slurp-perl"
|
|
| 278 |
+ "libgetopt-long-descriptive-perl" "libconfig-simple-perl") |
|
| 279 |
+ ;; |
|
| 280 |
+ "yum"|"dnf") |
|
| 281 |
+ perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp"
|
|
| 282 |
+ "perl-Getopt-Long" "perl-Config-Simple") |
|
| 283 |
+ ;; |
|
| 284 |
+ "zypper") |
|
| 285 |
+ perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp"
|
|
| 286 |
+ "perl-Getopt-Long-Descriptive" "perl-Config-Simple") |
|
| 287 |
+ ;; |
|
| 288 |
+ "pacman") |
|
| 289 |
+ perl_packages+=("perl-dbi" "perl-dbd-pg" "perl-json" "perl-file-slurp")
|
|
| 290 |
+ ;; |
|
| 291 |
+ esac |
|
| 292 |
+ |
|
| 293 |
+ # Check system packages |
|
| 294 |
+ for package in "${system_packages[@]}"; do
|
|
| 295 |
+ if ! check_package_installed "$package" "$package_manager"; then |
|
| 296 |
+ missing_packages+=("$package")
|
|
| 297 |
+ fi |
|
| 298 |
+ done |
|
| 299 |
+ |
|
| 300 |
+ # Check Perl packages from distribution |
|
| 301 |
+ for package in "${perl_packages[@]}"; do
|
|
| 302 |
+ if ! check_package_installed "$package" "$package_manager"; then |
|
| 303 |
+ missing_packages+=("$package")
|
|
| 304 |
+ fi |
|
| 305 |
+ done |
|
| 306 |
+ |
|
| 307 |
+ # Report results |
|
| 308 |
+ if [[ ${#missing_packages[@]} -eq 0 ]]; then
|
|
| 309 |
+ log_success "✅ All dependencies are available" |
|
| 310 |
+ return 0 |
|
| 311 |
+ else |
|
| 312 |
+ log_warning "Missing dependencies detected:" |
|
| 313 |
+ if [[ ${#missing_packages[@]} -gt 0 ]]; then
|
|
| 314 |
+ log_warning " Missing packages: ${missing_packages[*]}"
|
|
| 315 |
+ fi |
|
| 316 |
+ return 1 |
|
| 317 |
+ fi |
|
| 318 |
+} |
|
| 319 |
+ |
|
| 320 |
+# Function to install dependencies |
|
| 321 |
+install_dependencies() {
|
|
| 322 |
+ log_info "📦 Installing system dependencies..." |
|
| 323 |
+ |
|
| 324 |
+ # First check if dependencies are already installed |
|
| 325 |
+ if verify_dependencies >/dev/null 2>&1; then |
|
| 326 |
+ log_success "All dependencies already installed" |
|
| 327 |
+ return 0 |
|
| 328 |
+ fi |
|
| 329 |
+ |
|
| 330 |
+ log_info "Installing missing dependencies..." |
|
| 331 |
+ |
|
| 332 |
+ if command -v apt-get &> /dev/null; then |
|
| 333 |
+ # Debian/Ubuntu |
|
| 334 |
+ log_info "Updating package lists..." |
|
| 335 |
+ apt-get update -qq |
|
| 336 |
+ |
|
| 337 |
+ PACKAGES=( |
|
| 338 |
+ "perl" |
|
| 339 |
+ "libdbi-perl" |
|
| 340 |
+ "libdbd-pg-perl" |
|
| 341 |
+ "libjson-perl" |
|
| 342 |
+ "libfile-slurp-perl" |
|
| 343 |
+ "libgetopt-long-descriptive-perl" |
|
| 344 |
+ "libconfig-simple-perl" |
|
| 345 |
+ "smartmontools" |
|
| 346 |
+ "postgresql-client" |
|
| 347 |
+ "curl" |
|
| 348 |
+ "wget" |
|
| 349 |
+ ) |
|
| 350 |
+ |
|
| 351 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 352 |
+ if ! check_package_installed "$package" "apt-get"; then |
|
| 353 |
+ log_info "Installing $package..." |
|
| 354 |
+ if ! apt-get install -y "$package" >/dev/null 2>&1; then |
|
| 355 |
+ log_error "Failed to install $package" |
|
| 356 |
+ exit 1 |
|
| 357 |
+ fi |
|
| 358 |
+ fi |
|
| 359 |
+ done |
|
| 360 |
+ |
|
| 361 |
+ elif command -v dnf &> /dev/null; then |
|
| 362 |
+ # Fedora/RHEL 8+ |
|
| 363 |
+ log_info "Updating package lists..." |
|
| 364 |
+ dnf update -y -q |
|
| 365 |
+ |
|
| 366 |
+ PACKAGES=( |
|
| 367 |
+ "perl" |
|
| 368 |
+ "perl-DBI" |
|
| 369 |
+ "perl-DBD-Pg" |
|
| 370 |
+ "perl-JSON" |
|
| 371 |
+ "perl-File-Slurp" |
|
| 372 |
+ "perl-Getopt-Long" |
|
| 373 |
+ "perl-Config-Simple" |
|
| 374 |
+ "smartmontools" |
|
| 375 |
+ "postgresql" |
|
| 376 |
+ "curl" |
|
| 377 |
+ "wget" |
|
| 378 |
+ ) |
|
| 379 |
+ |
|
| 380 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 381 |
+ if ! check_package_installed "$package" "dnf"; then |
|
| 382 |
+ log_info "Installing $package..." |
|
| 383 |
+ if ! dnf install -y "$package" >/dev/null 2>&1; then |
|
| 384 |
+ log_error "Failed to install $package" |
|
| 385 |
+ exit 1 |
|
| 386 |
+ fi |
|
| 387 |
+ fi |
|
| 388 |
+ done |
|
| 389 |
+ |
|
| 390 |
+ elif command -v yum &> /dev/null; then |
|
| 391 |
+ # RHEL/CentOS 7 |
|
| 392 |
+ log_info "Updating package lists..." |
|
| 393 |
+ yum update -y -q |
|
| 394 |
+ |
|
| 395 |
+ PACKAGES=( |
|
| 396 |
+ "perl" |
|
| 397 |
+ "perl-DBI" |
|
| 398 |
+ "perl-DBD-Pg" |
|
| 399 |
+ "perl-JSON" |
|
| 400 |
+ "perl-File-Slurp" |
|
| 401 |
+ "perl-Getopt-Long" |
|
| 402 |
+ "perl-Config-Simple" |
|
| 403 |
+ "smartmontools" |
|
| 404 |
+ "postgresql" |
|
| 405 |
+ "curl" |
|
| 406 |
+ "wget" |
|
| 407 |
+ ) |
|
| 408 |
+ |
|
| 409 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 410 |
+ if ! check_package_installed "$package" "yum"; then |
|
| 411 |
+ log_info "Installing $package..." |
|
| 412 |
+ if ! yum install -y "$package" >/dev/null 2>&1; then |
|
| 413 |
+ log_error "Failed to install $package" |
|
| 414 |
+ exit 1 |
|
| 415 |
+ fi |
|
| 416 |
+ fi |
|
| 417 |
+ done |
|
| 418 |
+ |
|
| 419 |
+ elif command -v zypper &> /dev/null; then |
|
| 420 |
+ # openSUSE |
|
| 421 |
+ log_info "Updating package lists..." |
|
| 422 |
+ zypper refresh -q |
|
| 423 |
+ |
|
| 424 |
+ PACKAGES=( |
|
| 425 |
+ "perl" |
|
| 426 |
+ "perl-DBI" |
|
| 427 |
+ "perl-DBD-Pg" |
|
| 428 |
+ "perl-JSON" |
|
| 429 |
+ "perl-File-Slurp" |
|
| 430 |
+ "perl-Getopt-Long-Descriptive" |
|
| 431 |
+ "perl-Config-Simple" |
|
| 432 |
+ "smartmontools" |
|
| 433 |
+ "postgresql" |
|
| 434 |
+ "curl" |
|
| 435 |
+ "wget" |
|
| 436 |
+ ) |
|
| 437 |
+ |
|
| 438 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 439 |
+ if ! check_package_installed "$package" "zypper"; then |
|
| 440 |
+ log_info "Installing $package..." |
|
| 441 |
+ if ! zypper install -y "$package" >/dev/null 2>&1; then |
|
| 442 |
+ log_error "Failed to install $package" |
|
| 443 |
+ exit 1 |
|
| 444 |
+ fi |
|
| 445 |
+ fi |
|
| 446 |
+ done |
|
| 447 |
+ |
|
| 448 |
+ elif command -v pacman &> /dev/null; then |
|
| 449 |
+ # Arch Linux |
|
| 450 |
+ log_info "Updating package lists..." |
|
| 451 |
+ pacman -Sy --noconfirm |
|
| 452 |
+ |
|
| 453 |
+ PACKAGES=( |
|
| 454 |
+ "perl" |
|
| 455 |
+ "perl-dbi" |
|
| 456 |
+ "perl-dbd-pg" |
|
| 457 |
+ "perl-json" |
|
| 458 |
+ "perl-file-slurp" |
|
| 459 |
+ "smartmontools" |
|
| 460 |
+ "postgresql" |
|
| 461 |
+ "curl" |
|
| 462 |
+ "wget" |
|
| 463 |
+ ) |
|
| 464 |
+ |
|
| 465 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 466 |
+ if ! check_package_installed "$package" "pacman"; then |
|
| 467 |
+ log_info "Installing $package..." |
|
| 468 |
+ if ! pacman -S --noconfirm "$package" >/dev/null 2>&1; then |
|
| 469 |
+ log_error "Failed to install $package" |
|
| 470 |
+ exit 1 |
|
| 471 |
+ fi |
|
| 472 |
+ fi |
|
| 473 |
+ done |
|
| 474 |
+ |
|
| 475 |
+ else |
|
| 476 |
+ log_error "Unsupported package manager. Please install dependencies manually:" |
|
| 477 |
+ log_error " - perl, smartmontools, postgresql-client, curl, wget" |
|
| 478 |
+ log_error " - Perl modules: DBI, DBD::Pg, JSON, File::Slurp, Getopt::Long, Config::Simple" |
|
| 479 |
+ exit 1 |
|
| 480 |
+ fi |
|
| 481 |
+ |
|
| 482 |
+ # Verify installation was successful |
|
| 483 |
+ if verify_dependencies >/dev/null 2>&1; then |
|
| 484 |
+ log_success "✅ All dependencies installed successfully" |
|
| 485 |
+ else |
|
| 486 |
+ log_error "Some dependencies may not have installed correctly" |
|
| 487 |
+ exit 1 |
|
| 488 |
+ fi |
|
| 489 |
+} |
|
| 490 |
+ |
|
| 491 |
+create_directories() {
|
|
| 492 |
+ log_info "📁 Creating directory structure..." |
|
| 493 |
+ |
|
| 494 |
+ # Create main directories |
|
| 495 |
+ mkdir -p "$INSTALL_DIR"/{scripts,lib,config,docs}
|
|
| 496 |
+ mkdir -p "$CONFIG_DIR" |
|
| 497 |
+ |
|
| 498 |
+ # Set permissions |
|
| 499 |
+ chmod 755 "$INSTALL_DIR" |
|
| 500 |
+ chmod 755 "$CONFIG_DIR" |
|
| 501 |
+ |
|
| 502 |
+ log_success "Directories created" |
|
| 503 |
+} |
|
| 504 |
+ |
|
| 505 |
+copy_files() {
|
|
| 506 |
+ log_info "📋 Copying autoSMART files..." |
|
| 507 |
+ |
|
| 508 |
+ # Copy scripts |
|
| 509 |
+ if [[ -d "$PROJECT_ROOT/scripts" ]]; then |
|
| 510 |
+ cp -r "$PROJECT_ROOT/scripts"/* "$INSTALL_DIR/scripts/" |
|
| 511 |
+ chmod +x "$INSTALL_DIR/scripts"/*.sh 2>/dev/null || true |
|
| 512 |
+ chmod +x "$INSTALL_DIR/scripts"/*.pl 2>/dev/null || true |
|
| 513 |
+ fi |
|
| 514 |
+ |
|
| 515 |
+ # Copy libraries |
|
| 516 |
+ if [[ -d "$PROJECT_ROOT/lib" ]]; then |
|
| 517 |
+ cp -r "$PROJECT_ROOT/lib"/* "$INSTALL_DIR/lib/" |
|
| 518 |
+ fi |
|
| 519 |
+ |
|
| 520 |
+ # Copy documentation |
|
| 521 |
+ if [[ -d "$PROJECT_ROOT/docs" ]]; then |
|
| 522 |
+ cp -r "$PROJECT_ROOT/docs"/* "$INSTALL_DIR/docs/" |
|
| 523 |
+ fi |
|
| 524 |
+ |
|
| 525 |
+ # Copy SQL files |
|
| 526 |
+ if [[ -d "$PROJECT_ROOT/sql" ]]; then |
|
| 527 |
+ cp -r "$PROJECT_ROOT/sql" "$INSTALL_DIR/" |
|
| 528 |
+ fi |
|
| 529 |
+ |
|
| 530 |
+ log_success "Files copied" |
|
| 531 |
+} |
|
| 532 |
+ |
|
| 533 |
+create_configuration() {
|
|
| 534 |
+ log_info "⚙️ Creating configuration files..." |
|
| 535 |
+ |
|
| 536 |
+ # Main configuration file |
|
| 537 |
+ cat > "$CONFIG_DIR/autosmart.conf" << EOF |
|
| 538 |
+# autoSMART Configuration File |
|
| 539 |
+# Generated on $(date) |
|
| 540 |
+ |
|
| 541 |
+[database] |
|
| 542 |
+host = $DB_HOST |
|
| 543 |
+port = 5432 |
|
| 544 |
+user = $DB_USER |
|
| 545 |
+password = $DB_PASS |
|
| 546 |
+database = $DB_NAME |
|
| 547 |
+timeout = 30 |
|
| 548 |
+ |
|
| 549 |
+[node] |
|
| 550 |
+id = $NODE_ID |
|
| 551 |
+scan_interval = $SCAN_INTERVAL |
|
| 552 |
+full_scan_interval = $FULL_SCAN_INTERVAL |
|
| 553 |
+store_unchanged = false |
|
| 554 |
+max_retries = 3 |
|
| 555 |
+ |
|
| 556 |
+[collection] |
|
| 557 |
+temperature_threshold = 5 |
|
| 558 |
+parameter_changes_only = true |
|
| 559 |
+enable_predictive_analysis = true |
|
| 560 |
+health_check_interval = 86400 |
|
| 561 |
+ |
|
| 562 |
+[logging] |
|
| 563 |
+level = INFO |
|
| 564 |
+max_size = 10M |
|
| 565 |
+rotate_count = 5 |
|
| 566 |
+syslog = true |
|
| 567 |
+ |
|
| 568 |
+[alerts] |
|
| 569 |
+enable = true |
|
| 570 |
+temperature_critical = 60 |
|
| 571 |
+reallocated_sectors_warning = 1 |
|
| 572 |
+pending_sectors_critical = 5 |
|
| 573 |
+EOF |
|
| 574 |
+ |
|
| 575 |
+ # YAML format configuration for Perl daemon |
|
| 576 |
+ cat > "$CONFIG_DIR/cluster-$NODE_ID.conf" << EOF |
|
| 577 |
+# autoSMART YAML Configuration for $NODE_ID |
|
| 578 |
+database: |
|
| 579 |
+ host: $DB_HOST |
|
| 580 |
+ port: 5432 |
|
| 581 |
+ user: $DB_USER |
|
| 582 |
+ password: $DB_PASS |
|
| 583 |
+ database: $DB_NAME |
|
| 584 |
+ |
|
| 585 |
+node: |
|
| 586 |
+ id: $NODE_ID |
|
| 587 |
+ scan_interval: $SCAN_INTERVAL |
|
| 588 |
+ store_unchanged: false |
|
| 589 |
+ |
|
| 590 |
+collection: |
|
| 591 |
+ temperature_threshold: 5 |
|
| 592 |
+ parameter_changes_only: true |
|
| 593 |
+ full_scan_interval: $FULL_SCAN_INTERVAL |
|
| 594 |
+EOF |
|
| 595 |
+ |
|
| 596 |
+ # Set secure permissions on config files |
|
| 597 |
+ chmod 600 "$CONFIG_DIR"/*.conf |
|
| 598 |
+ |
|
| 599 |
+ log_success "Configuration created" |
|
| 600 |
+} |
|
| 601 |
+ |
|
| 602 |
+create_systemd_service() {
|
|
| 603 |
+ log_info "🔧 Creating systemd service..." |
|
| 604 |
+ |
|
| 605 |
+ cat > "$SYSTEMD_SERVICE" << EOF |
|
| 606 |
+[Unit] |
|
| 607 |
+Description=autoSMART SMART Data Collector |
|
| 608 |
+Documentation=file://$INSTALL_DIR/docs/README.md |
|
| 609 |
+After=network.target postgresql.service |
|
| 610 |
+Wants=postgresql.service |
|
| 611 |
+ |
|
| 612 |
+[Service] |
|
| 613 |
+Type=simple |
|
| 614 |
+ExecStart=$INSTALL_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE_ID.conf --foreground |
|
| 615 |
+ExecReload=/bin/kill -HUP \$MAINPID |
|
| 616 |
+KillMode=process |
|
| 617 |
+Restart=always |
|
| 618 |
+RestartSec=30 |
|
| 619 |
+User=root |
|
| 620 |
+Group=root |
|
| 621 |
+ |
|
| 622 |
+# Security settings |
|
| 623 |
+NoNewPrivileges=true |
|
| 624 |
+ProtectSystem=strict |
|
| 625 |
+ProtectHome=true |
|
| 626 |
+ReadWritePaths=$CONFIG_DIR |
|
| 627 |
+PrivateTmp=true |
|
| 628 |
+ |
|
| 629 |
+# Resource limits |
|
| 630 |
+LimitNOFILE=1024 |
|
| 631 |
+MemoryMax=100M |
|
| 632 |
+CPUQuota=10% |
|
| 633 |
+ |
|
| 634 |
+# Logging |
|
| 635 |
+StandardOutput=journal |
|
| 636 |
+StandardError=journal |
|
| 637 |
+SyslogIdentifier=autosmart |
|
| 638 |
+ |
|
| 639 |
+[Install] |
|
| 640 |
+WantedBy=multi-user.target |
|
| 641 |
+EOF |
|
| 642 |
+ |
|
| 643 |
+ # Reload systemd |
|
| 644 |
+ systemctl daemon-reload |
|
| 645 |
+ |
|
| 646 |
+ log_success "Systemd service created" |
|
| 647 |
+} |
|
| 648 |
+ |
|
| 649 |
+test_database_connection() {
|
|
| 650 |
+ log_info "🔗 Testing database connection..." |
|
| 651 |
+ |
|
| 652 |
+ # Test connection using psql |
|
| 653 |
+ if command -v psql &> /dev/null; then |
|
| 654 |
+ if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT version();" >/dev/null 2>&1; then |
|
| 655 |
+ log_success "Database connection successful" |
|
| 656 |
+ else |
|
| 657 |
+ log_warning "Database connection failed. Service may not start correctly." |
|
| 658 |
+ log_info "Please ensure:" |
|
| 659 |
+ log_info " • PostgreSQL server is running on $DB_HOST" |
|
| 660 |
+ log_info " • Database '$DB_NAME' exists" |
|
| 661 |
+ log_info " • User '$DB_USER' has proper permissions" |
|
| 662 |
+ fi |
|
| 663 |
+ else |
|
| 664 |
+ log_warning "psql not found. Cannot test database connection." |
|
| 665 |
+ fi |
|
| 666 |
+} |
|
| 667 |
+ |
|
| 668 |
+test_smart_detection() {
|
|
| 669 |
+ log_info "🔍 Testing SMART device detection..." |
|
| 670 |
+ |
|
| 671 |
+ DEVICES_FOUND=0 |
|
| 672 |
+ for device in /dev/sd? /dev/nvme?n?; do |
|
| 673 |
+ if [[ -b "$device" ]] && smartctl -i "$device" >/dev/null 2>&1; then |
|
| 674 |
+ MODEL=$(smartctl -i "$device" | grep "Device Model\|Model Number" | head -1 | cut -d: -f2 | xargs) |
|
| 675 |
+ if [[ -n "$MODEL" ]]; then |
|
| 676 |
+ log_info " Found: $device - $MODEL" |
|
| 677 |
+ ((DEVICES_FOUND++)) |
|
| 678 |
+ fi |
|
| 679 |
+ fi |
|
| 680 |
+ done |
|
| 681 |
+ |
|
| 682 |
+ if [[ $DEVICES_FOUND -gt 0 ]]; then |
|
| 683 |
+ log_success "Detected $DEVICES_FOUND SMART-capable devices" |
|
| 684 |
+ else |
|
| 685 |
+ log_warning "No SMART-capable devices detected" |
|
| 686 |
+ fi |
|
| 687 |
+} |
|
| 688 |
+ |
|
| 689 |
+finalize_installation() {
|
|
| 690 |
+ log_info "🎯 Finalizing installation..." |
|
| 691 |
+ |
|
| 692 |
+ # Enable service (but don't start yet) |
|
| 693 |
+ systemctl enable "$SERVICE_NAME" |
|
| 694 |
+ |
|
| 695 |
+ # Create log rotation |
|
| 696 |
+ cat > "/etc/logrotate.d/autosmart" << EOF |
|
| 697 |
+/var/log/autosmart/*.log {
|
|
| 698 |
+ daily |
|
| 699 |
+ rotate 7 |
|
| 700 |
+ compress |
|
| 701 |
+ delaycompress |
|
| 702 |
+ missingok |
|
| 703 |
+ notifempty |
|
| 704 |
+ postrotate |
|
| 705 |
+ systemctl reload-or-restart autosmart |
|
| 706 |
+ endscript |
|
| 707 |
+} |
|
| 708 |
+EOF |
|
| 709 |
+ |
|
| 710 |
+ log_success "Installation finalized" |
|
| 711 |
+} |
|
| 712 |
+ |
|
| 713 |
+show_completion_message() {
|
|
| 714 |
+ log_success "✅ autoSMART installation completed successfully!" |
|
| 715 |
+ log_info "" |
|
| 716 |
+ log_info "📋 Installation Summary:" |
|
| 717 |
+ log_info " • Install Directory: $INSTALL_DIR" |
|
| 718 |
+ log_info " • Config Directory: $CONFIG_DIR" |
|
| 719 |
+ log_info " • Service Name: $SERVICE_NAME" |
|
| 720 |
+ log_info " • Node ID: $NODE_ID" |
|
| 721 |
+ log_info "" |
|
| 722 |
+ log_info "🚀 Next Steps:" |
|
| 723 |
+ log_info " 1. Start the service:" |
|
| 724 |
+ log_info " systemctl start $SERVICE_NAME" |
|
| 725 |
+ log_info "" |
|
| 726 |
+ log_info " 2. Check service status:" |
|
| 727 |
+ log_info " systemctl status $SERVICE_NAME" |
|
| 728 |
+ log_info "" |
|
| 729 |
+ log_info " 3. View logs:" |
|
| 730 |
+ log_info " journalctl -u $SERVICE_NAME -f" |
|
| 731 |
+ log_info "" |
|
| 732 |
+ log_info "📖 Documentation: $INSTALL_DIR/docs/README.md" |
|
| 733 |
+ log_info "⚙️ Configuration: $CONFIG_DIR/autosmart.conf" |
|
| 734 |
+ log_info "" |
|
| 735 |
+ log_info "🎉 autoSMART is ready to monitor your storage devices!" |
|
| 736 |
+} |
|
| 737 |
+ |
|
| 738 |
+# Main execution |
|
| 739 |
+main() {
|
|
| 740 |
+ parse_arguments "$@" |
|
| 741 |
+ show_header |
|
| 742 |
+ |
|
| 743 |
+ case "$COMMAND" in |
|
| 744 |
+ uninstall) |
|
| 745 |
+ handle_uninstall |
|
| 746 |
+ ;; |
|
| 747 |
+ install) |
|
| 748 |
+ check_requirements |
|
| 749 |
+ |
|
| 750 |
+ # Handle force reinstall |
|
| 751 |
+ if [[ "$FORCE_REINSTALL" == true ]]; then |
|
| 752 |
+ log_info "🗑️ Force reinstall: cleaning previous installation..." |
|
| 753 |
+ handle_uninstall 2>/dev/null || true |
|
| 754 |
+ sleep 2 |
|
| 755 |
+ fi |
|
| 756 |
+ |
|
| 757 |
+ # Handle config-only mode |
|
| 758 |
+ if [[ "$CONFIG_ONLY" == true ]]; then |
|
| 759 |
+ log_info "⚙️ Configuration-only mode" |
|
| 760 |
+ if [[ ! -d "$INSTALL_DIR" ]]; then |
|
| 761 |
+ log_error "autoSMART is not installed. Run full installation first." |
|
| 762 |
+ exit 1 |
|
| 763 |
+ fi |
|
| 764 |
+ create_configuration |
|
| 765 |
+ log_success "✅ Configuration updated successfully!" |
|
| 766 |
+ exit 0 |
|
| 767 |
+ fi |
|
| 768 |
+ |
|
| 769 |
+ # Full installation |
|
| 770 |
+ install_dependencies |
|
| 771 |
+ create_directories |
|
| 772 |
+ copy_files |
|
| 773 |
+ create_configuration |
|
| 774 |
+ create_systemd_service |
|
| 775 |
+ test_database_connection |
|
| 776 |
+ test_smart_detection |
|
| 777 |
+ finalize_installation |
|
| 778 |
+ show_completion_message |
|
| 779 |
+ ;; |
|
| 780 |
+ *) |
|
| 781 |
+ log_error "Unknown command: $COMMAND" |
|
| 782 |
+ show_usage |
|
| 783 |
+ exit 1 |
|
| 784 |
+ ;; |
|
| 785 |
+ esac |
|
| 786 |
+} |
|
| 787 |
+ |
|
| 788 |
+# Run main function |
|
| 789 |
+main "$@" |
|
@@ -0,0 +1,844 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Node Installation Script |
|
| 4 |
+# Version: 1.0 |
|
| 5 |
+# Description: Install autoSMART on target nodes (Linux systems only) |
|
| 6 |
+# Note: This script is called by deploy.sh and should run on target nodes |
|
| 7 |
+ |
|
| 8 |
+set -e |
|
| 9 |
+ |
|
| 10 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 11 |
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")" |
|
| 12 |
+INSTALL_DIR="/opt/autoSMART" |
|
| 13 |
+CONFIG_DIR="/etc/autosmart" |
|
| 14 |
+SERVICE_NAME="autosmart" |
|
| 15 |
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
|
|
| 16 |
+ |
|
| 17 |
+# Default configuration (can be overridden by command line) |
|
| 18 |
+DB_HOST="${DB_HOST:-192.168.2.102}"
|
|
| 19 |
+DB_USER="${DB_USER:-autosmart}"
|
|
| 20 |
+DB_PASS="${DB_PASS:-autoSMART2025!}"
|
|
| 21 |
+DB_NAME="${DB_NAME:-autosmart}"
|
|
| 22 |
+ |
|
| 23 |
+# Node configuration |
|
| 24 |
+NODE_ID="${NODE_ID:-$(hostname -s)}"
|
|
| 25 |
+SCAN_INTERVAL="${SCAN_INTERVAL:-300}"
|
|
| 26 |
+FULL_SCAN_INTERVAL="${FULL_SCAN_INTERVAL:-3600}"
|
|
| 27 |
+ |
|
| 28 |
+# Operation modes |
|
| 29 |
+UNINSTALL=false |
|
| 30 |
+FORCE_REINSTALL=false |
|
| 31 |
+CONFIG_ONLY=false |
|
| 32 |
+ |
|
| 33 |
+# Colors for output |
|
| 34 |
+RED='\033[0;31m' |
|
| 35 |
+GREEN='\033[0;32m' |
|
| 36 |
+YELLOW='\033[1;33m' |
|
| 37 |
+BLUE='\033[0;34m' |
|
| 38 |
+NC='\033[0m' # No Color |
|
| 39 |
+ |
|
| 40 |
+log_info() {
|
|
| 41 |
+ echo -e "${BLUE}[INFO]${NC} $1"
|
|
| 42 |
+} |
|
| 43 |
+ |
|
| 44 |
+log_success() {
|
|
| 45 |
+ echo -e "${GREEN}[SUCCESS]${NC} $1"
|
|
| 46 |
+} |
|
| 47 |
+ |
|
| 48 |
+log_warning() {
|
|
| 49 |
+ echo -e "${YELLOW}[WARNING]${NC} $1"
|
|
| 50 |
+} |
|
| 51 |
+ |
|
| 52 |
+log_error() {
|
|
| 53 |
+ echo -e "${RED}[ERROR]${NC} $1"
|
|
| 54 |
+} |
|
| 55 |
+ |
|
| 56 |
+show_usage() {
|
|
| 57 |
+ echo "autoSMART Node Installation Script v1.0" |
|
| 58 |
+ echo "========================================" |
|
| 59 |
+ echo "" |
|
| 60 |
+ echo "Usage: $0 [COMMAND] [OPTIONS]" |
|
| 61 |
+ echo "" |
|
| 62 |
+ echo "Commands:" |
|
| 63 |
+ echo " install Install autoSMART on current node (default)" |
|
| 64 |
+ echo " uninstall Remove autoSMART completely from current node" |
|
| 65 |
+ echo "" |
|
| 66 |
+ echo "Options:" |
|
| 67 |
+ echo " --help Show this help message" |
|
| 68 |
+ echo " --force-reinstall Clean installation (removes previous version)" |
|
| 69 |
+ echo " --config-only Only create/update configuration files" |
|
| 70 |
+ echo " --db-host HOST Database host (default: 192.168.2.102)" |
|
| 71 |
+ echo " --db-user USER Database user (default: autosmart)" |
|
| 72 |
+ echo " --db-pass PASS Database password (default: autoSMART2025!)" |
|
| 73 |
+ echo " --db-name NAME Database name (default: autosmart)" |
|
| 74 |
+ echo " --node-id ID Node identifier (default: hostname)" |
|
| 75 |
+ echo " --scan-interval SEC Scan interval in seconds (default: 300)" |
|
| 76 |
+ echo "" |
|
| 77 |
+ echo "Note: This script should be called by deploy.sh, not run directly." |
|
| 78 |
+ echo "For deployment from development machine, use: deploy.sh install <IP>" |
|
| 79 |
+ echo "" |
|
| 80 |
+} |
|
| 81 |
+ |
|
| 82 |
+parse_arguments() {
|
|
| 83 |
+ COMMAND="install" # Default command |
|
| 84 |
+ |
|
| 85 |
+ while [[ $# -gt 0 ]]; do |
|
| 86 |
+ case $1 in |
|
| 87 |
+ install|uninstall) |
|
| 88 |
+ COMMAND="$1" |
|
| 89 |
+ shift |
|
| 90 |
+ ;; |
|
| 91 |
+ --help) |
|
| 92 |
+ show_usage |
|
| 93 |
+ exit 0 |
|
| 94 |
+ ;; |
|
| 95 |
+ --force-reinstall) |
|
| 96 |
+ FORCE_REINSTALL=true |
|
| 97 |
+ shift |
|
| 98 |
+ ;; |
|
| 99 |
+ --config-only) |
|
| 100 |
+ CONFIG_ONLY=true |
|
| 101 |
+ shift |
|
| 102 |
+ ;; |
|
| 103 |
+ --db-host) |
|
| 104 |
+ DB_HOST="$2" |
|
| 105 |
+ shift 2 |
|
| 106 |
+ ;; |
|
| 107 |
+ --db-user) |
|
| 108 |
+ DB_USER="$2" |
|
| 109 |
+ shift 2 |
|
| 110 |
+ ;; |
|
| 111 |
+ --db-pass) |
|
| 112 |
+ DB_PASS="$2" |
|
| 113 |
+ shift 2 |
|
| 114 |
+ ;; |
|
| 115 |
+ --db-name) |
|
| 116 |
+ DB_NAME="$2" |
|
| 117 |
+ shift 2 |
|
| 118 |
+ ;; |
|
| 119 |
+ --node-id) |
|
| 120 |
+ NODE_ID="$2" |
|
| 121 |
+ shift 2 |
|
| 122 |
+ ;; |
|
| 123 |
+ --scan-interval) |
|
| 124 |
+ SCAN_INTERVAL="$2" |
|
| 125 |
+ shift 2 |
|
| 126 |
+ ;; |
|
| 127 |
+ *) |
|
| 128 |
+ log_error "Unknown option: $1" |
|
| 129 |
+ show_usage |
|
| 130 |
+ exit 1 |
|
| 131 |
+ ;; |
|
| 132 |
+ esac |
|
| 133 |
+ done |
|
| 134 |
+} |
|
| 135 |
+ |
|
| 136 |
+show_header() {
|
|
| 137 |
+ log_info "🔧 autoSMART Node Installation v1.0" |
|
| 138 |
+ log_info "===================================" |
|
| 139 |
+ log_info "Installing on target node: $(hostname)" |
|
| 140 |
+ log_info "" |
|
| 141 |
+ log_info "Operation: $COMMAND" |
|
| 142 |
+ log_info "Node ID: $NODE_ID" |
|
| 143 |
+ log_info "Database: $DB_HOST:5432/$DB_NAME" |
|
| 144 |
+ if [[ "$COMMAND" == "install" ]]; then |
|
| 145 |
+ log_info "Install Directory: $INSTALL_DIR" |
|
| 146 |
+ log_info "Config Directory: $CONFIG_DIR" |
|
| 147 |
+ fi |
|
| 148 |
+ log_info "" |
|
| 149 |
+} |
|
| 150 |
+ |
|
| 151 |
+check_requirements() {
|
|
| 152 |
+ log_info "🔍 Checking system requirements..." |
|
| 153 |
+ |
|
| 154 |
+ # Check if running as root |
|
| 155 |
+ if [[ $EUID -ne 0 ]]; then |
|
| 156 |
+ log_error "This script must be run as root (use sudo)" |
|
| 157 |
+ exit 1 |
|
| 158 |
+ fi |
|
| 159 |
+ |
|
| 160 |
+ # Check if running on Linux |
|
| 161 |
+ if [[ "$(uname)" != "Linux" ]]; then |
|
| 162 |
+ log_error "autoSMART can only be installed on Linux systems" |
|
| 163 |
+ log_error "Current system: $(uname)" |
|
| 164 |
+ exit 1 |
|
| 165 |
+ fi |
|
| 166 |
+ |
|
| 167 |
+ # Check systemd |
|
| 168 |
+ if ! command -v systemctl &> /dev/null; then |
|
| 169 |
+ log_error "systemd is required but not found" |
|
| 170 |
+ exit 1 |
|
| 171 |
+ fi |
|
| 172 |
+ |
|
| 173 |
+ # Check and report dependency status |
|
| 174 |
+ if ! verify_dependencies >/dev/null 2>&1; then |
|
| 175 |
+ log_warning "Some dependencies are missing (will be installed automatically)" |
|
| 176 |
+ fi |
|
| 177 |
+ |
|
| 178 |
+ # Check available space |
|
| 179 |
+ AVAILABLE_SPACE=$(df / | tail -1 | awk '{print $4}')
|
|
| 180 |
+ if [[ $AVAILABLE_SPACE -lt 100000 ]]; then |
|
| 181 |
+ log_warning "Less than 100MB available space. Installation may fail." |
|
| 182 |
+ fi |
|
| 183 |
+ |
|
| 184 |
+ log_success "System requirements check passed" |
|
| 185 |
+} |
|
| 186 |
+ |
|
| 187 |
+handle_uninstall() {
|
|
| 188 |
+ log_info "🗑️ Uninstalling autoSMART..." |
|
| 189 |
+ |
|
| 190 |
+ # Stop and disable service |
|
| 191 |
+ if systemctl is-active --quiet autosmart; then |
|
| 192 |
+ systemctl stop autosmart |
|
| 193 |
+ fi |
|
| 194 |
+ if systemctl is-enabled --quiet autosmart; then |
|
| 195 |
+ systemctl disable autosmart |
|
| 196 |
+ fi |
|
| 197 |
+ |
|
| 198 |
+ # Remove service file |
|
| 199 |
+ if [[ -f "$SYSTEMD_SERVICE" ]]; then |
|
| 200 |
+ rm "$SYSTEMD_SERVICE" |
|
| 201 |
+ systemctl daemon-reload |
|
| 202 |
+ fi |
|
| 203 |
+ |
|
| 204 |
+ # Remove installation directory |
|
| 205 |
+ if [[ -d "$INSTALL_DIR" ]]; then |
|
| 206 |
+ rm -rf "$INSTALL_DIR" |
|
| 207 |
+ fi |
|
| 208 |
+ |
|
| 209 |
+ # Remove configuration directory |
|
| 210 |
+ if [[ -d "$CONFIG_DIR" ]]; then |
|
| 211 |
+ rm -rf "$CONFIG_DIR" |
|
| 212 |
+ fi |
|
| 213 |
+ |
|
| 214 |
+ # Remove log rotation |
|
| 215 |
+ if [[ -f "/etc/logrotate.d/autosmart" ]]; then |
|
| 216 |
+ rm "/etc/logrotate.d/autosmart" |
|
| 217 |
+ fi |
|
| 218 |
+ |
|
| 219 |
+ log_success "✅ autoSMART uninstalled successfully" |
|
| 220 |
+ exit 0 |
|
| 221 |
+} |
|
| 222 |
+ |
|
| 223 |
+# Function to check if a package is installed |
|
| 224 |
+check_package_installed() {
|
|
| 225 |
+ local package="$1" |
|
| 226 |
+ local package_manager="$2" |
|
| 227 |
+ |
|
| 228 |
+ case "$package_manager" in |
|
| 229 |
+ "apt-get") |
|
| 230 |
+ dpkg -l | grep -q "^ii $package\( \|:\)" 2>/dev/null |
|
| 231 |
+ ;; |
|
| 232 |
+ "yum"|"dnf") |
|
| 233 |
+ rpm -qa | grep -q "$package" 2>/dev/null |
|
| 234 |
+ ;; |
|
| 235 |
+ "zypper") |
|
| 236 |
+ zypper se -i "$package" | grep -q "^i" 2>/dev/null |
|
| 237 |
+ ;; |
|
| 238 |
+ "pacman") |
|
| 239 |
+ pacman -Q "$package" >/dev/null 2>&1 |
|
| 240 |
+ ;; |
|
| 241 |
+ *) |
|
| 242 |
+ return 1 |
|
| 243 |
+ ;; |
|
| 244 |
+ esac |
|
| 245 |
+} |
|
| 246 |
+ |
|
| 247 |
+# Function to verify all dependencies are installed |
|
| 248 |
+verify_dependencies() {
|
|
| 249 |
+ log_info "🔍 Verifying system dependencies..." |
|
| 250 |
+ |
|
| 251 |
+ local missing_packages=() |
|
| 252 |
+ local package_manager="" |
|
| 253 |
+ |
|
| 254 |
+ # Detect package manager |
|
| 255 |
+ if command -v apt-get &> /dev/null; then |
|
| 256 |
+ package_manager="apt-get" |
|
| 257 |
+ elif command -v yum &> /dev/null; then |
|
| 258 |
+ package_manager="yum" |
|
| 259 |
+ elif command -v dnf &> /dev/null; then |
|
| 260 |
+ package_manager="dnf" |
|
| 261 |
+ elif command -v zypper &> /dev/null; then |
|
| 262 |
+ package_manager="zypper" |
|
| 263 |
+ elif command -v pacman &> /dev/null; then |
|
| 264 |
+ package_manager="pacman" |
|
| 265 |
+ else |
|
| 266 |
+ log_warning "Unknown package manager. Dependency verification limited." |
|
| 267 |
+ return 1 |
|
| 268 |
+ fi |
|
| 269 |
+ |
|
| 270 |
+ # Check system packages (including Perl modules from distribution) |
|
| 271 |
+ local system_packages=("perl" "smartmontools" "postgresql-client" "curl" "wget")
|
|
| 272 |
+ local perl_packages=() |
|
| 273 |
+ |
|
| 274 |
+ # Add Perl module packages based on package manager |
|
| 275 |
+ case "$package_manager" in |
|
| 276 |
+ "apt-get") |
|
| 277 |
+ perl_packages+=("libdbi-perl" "libdbd-pg-perl" "libjson-perl" "libfile-slurp-perl"
|
|
| 278 |
+ "libgetopt-long-descriptive-perl" "libconfig-simple-perl") |
|
| 279 |
+ ;; |
|
| 280 |
+ "yum"|"dnf") |
|
| 281 |
+ perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp"
|
|
| 282 |
+ "perl-Getopt-Long" "perl-Config-Simple") |
|
| 283 |
+ ;; |
|
| 284 |
+ "zypper") |
|
| 285 |
+ perl_packages+=("perl-DBI" "perl-DBD-Pg" "perl-JSON" "perl-File-Slurp"
|
|
| 286 |
+ "perl-Getopt-Long-Descriptive" "perl-Config-Simple") |
|
| 287 |
+ ;; |
|
| 288 |
+ "pacman") |
|
| 289 |
+ perl_packages+=("perl-dbi" "perl-dbd-pg" "perl-json" "perl-file-slurp")
|
|
| 290 |
+ ;; |
|
| 291 |
+ esac |
|
| 292 |
+ |
|
| 293 |
+ # Check system packages |
|
| 294 |
+ for package in "${system_packages[@]}"; do
|
|
| 295 |
+ if ! check_package_installed "$package" "$package_manager"; then |
|
| 296 |
+ missing_packages+=("$package")
|
|
| 297 |
+ fi |
|
| 298 |
+ done |
|
| 299 |
+ |
|
| 300 |
+ # Check Perl packages from distribution |
|
| 301 |
+ for package in "${perl_packages[@]}"; do
|
|
| 302 |
+ if ! check_package_installed "$package" "$package_manager"; then |
|
| 303 |
+ missing_packages+=("$package")
|
|
| 304 |
+ fi |
|
| 305 |
+ done |
|
| 306 |
+ |
|
| 307 |
+ # Report results |
|
| 308 |
+ if [[ ${#missing_packages[@]} -eq 0 ]]; then
|
|
| 309 |
+ log_success "✅ All dependencies are available" |
|
| 310 |
+ return 0 |
|
| 311 |
+ else |
|
| 312 |
+ log_warning "Missing dependencies detected:" |
|
| 313 |
+ if [[ ${#missing_packages[@]} -gt 0 ]]; then
|
|
| 314 |
+ log_warning " Missing packages: ${missing_packages[*]}"
|
|
| 315 |
+ fi |
|
| 316 |
+ return 1 |
|
| 317 |
+ fi |
|
| 318 |
+} |
|
| 319 |
+ |
|
| 320 |
+# Function to install dependencies |
|
| 321 |
+install_dependencies() {
|
|
| 322 |
+ log_info "📦 Installing system dependencies..." |
|
| 323 |
+ |
|
| 324 |
+ # First check if dependencies are already installed |
|
| 325 |
+ if verify_dependencies >/dev/null 2>&1; then |
|
| 326 |
+ log_success "All dependencies already installed" |
|
| 327 |
+ return 0 |
|
| 328 |
+ fi |
|
| 329 |
+ |
|
| 330 |
+ log_info "Installing missing dependencies..." |
|
| 331 |
+ |
|
| 332 |
+ if command -v apt-get &> /dev/null; then |
|
| 333 |
+ # Debian/Ubuntu |
|
| 334 |
+ log_info "Updating package lists..." |
|
| 335 |
+ apt-get update -qq |
|
| 336 |
+ |
|
| 337 |
+ PACKAGES=( |
|
| 338 |
+ "perl" |
|
| 339 |
+ "libdbi-perl" |
|
| 340 |
+ "libdbd-pg-perl" |
|
| 341 |
+ "libjson-perl" |
|
| 342 |
+ "libfile-slurp-perl" |
|
| 343 |
+ "libgetopt-long-descriptive-perl" |
|
| 344 |
+ "libconfig-simple-perl" |
|
| 345 |
+ "smartmontools" |
|
| 346 |
+ "postgresql-client" |
|
| 347 |
+ "curl" |
|
| 348 |
+ "wget" |
|
| 349 |
+ ) |
|
| 350 |
+ |
|
| 351 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 352 |
+ if ! check_package_installed "$package" "apt-get"; then |
|
| 353 |
+ log_info "Installing $package..." |
|
| 354 |
+ if ! apt-get install -y "$package" >/dev/null 2>&1; then |
|
| 355 |
+ log_error "Failed to install $package" |
|
| 356 |
+ exit 1 |
|
| 357 |
+ fi |
|
| 358 |
+ fi |
|
| 359 |
+ done |
|
| 360 |
+ |
|
| 361 |
+ elif command -v dnf &> /dev/null; then |
|
| 362 |
+ # Fedora/RHEL 8+ |
|
| 363 |
+ log_info "Updating package lists..." |
|
| 364 |
+ dnf update -y -q |
|
| 365 |
+ |
|
| 366 |
+ PACKAGES=( |
|
| 367 |
+ "perl" |
|
| 368 |
+ "perl-DBI" |
|
| 369 |
+ "perl-DBD-Pg" |
|
| 370 |
+ "perl-JSON" |
|
| 371 |
+ "perl-File-Slurp" |
|
| 372 |
+ "perl-Getopt-Long" |
|
| 373 |
+ "perl-Config-Simple" |
|
| 374 |
+ "smartmontools" |
|
| 375 |
+ "postgresql" |
|
| 376 |
+ "curl" |
|
| 377 |
+ "wget" |
|
| 378 |
+ ) |
|
| 379 |
+ |
|
| 380 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 381 |
+ if ! check_package_installed "$package" "dnf"; then |
|
| 382 |
+ log_info "Installing $package..." |
|
| 383 |
+ if ! dnf install -y "$package" >/dev/null 2>&1; then |
|
| 384 |
+ log_error "Failed to install $package" |
|
| 385 |
+ exit 1 |
|
| 386 |
+ fi |
|
| 387 |
+ fi |
|
| 388 |
+ done |
|
| 389 |
+ |
|
| 390 |
+ elif command -v yum &> /dev/null; then |
|
| 391 |
+ # RHEL/CentOS 7 |
|
| 392 |
+ log_info "Updating package lists..." |
|
| 393 |
+ yum update -y -q |
|
| 394 |
+ |
|
| 395 |
+ PACKAGES=( |
|
| 396 |
+ "perl" |
|
| 397 |
+ "perl-DBI" |
|
| 398 |
+ "perl-DBD-Pg" |
|
| 399 |
+ "perl-JSON" |
|
| 400 |
+ "perl-File-Slurp" |
|
| 401 |
+ "perl-Getopt-Long" |
|
| 402 |
+ "perl-Config-Simple" |
|
| 403 |
+ "smartmontools" |
|
| 404 |
+ "postgresql" |
|
| 405 |
+ "curl" |
|
| 406 |
+ "wget" |
|
| 407 |
+ ) |
|
| 408 |
+ |
|
| 409 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 410 |
+ if ! check_package_installed "$package" "yum"; then |
|
| 411 |
+ log_info "Installing $package..." |
|
| 412 |
+ if ! yum install -y "$package" >/dev/null 2>&1; then |
|
| 413 |
+ log_error "Failed to install $package" |
|
| 414 |
+ exit 1 |
|
| 415 |
+ fi |
|
| 416 |
+ fi |
|
| 417 |
+ done |
|
| 418 |
+ |
|
| 419 |
+ elif command -v zypper &> /dev/null; then |
|
| 420 |
+ # openSUSE |
|
| 421 |
+ log_info "Updating package lists..." |
|
| 422 |
+ zypper refresh -q |
|
| 423 |
+ |
|
| 424 |
+ PACKAGES=( |
|
| 425 |
+ "perl" |
|
| 426 |
+ "perl-DBI" |
|
| 427 |
+ "perl-DBD-Pg" |
|
| 428 |
+ "perl-JSON" |
|
| 429 |
+ "perl-File-Slurp" |
|
| 430 |
+ "perl-Getopt-Long-Descriptive" |
|
| 431 |
+ "perl-Config-Simple" |
|
| 432 |
+ "smartmontools" |
|
| 433 |
+ "postgresql" |
|
| 434 |
+ "curl" |
|
| 435 |
+ "wget" |
|
| 436 |
+ ) |
|
| 437 |
+ |
|
| 438 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 439 |
+ if ! check_package_installed "$package" "zypper"; then |
|
| 440 |
+ log_info "Installing $package..." |
|
| 441 |
+ if ! zypper install -y "$package" >/dev/null 2>&1; then |
|
| 442 |
+ log_error "Failed to install $package" |
|
| 443 |
+ exit 1 |
|
| 444 |
+ fi |
|
| 445 |
+ fi |
|
| 446 |
+ done |
|
| 447 |
+ |
|
| 448 |
+ elif command -v pacman &> /dev/null; then |
|
| 449 |
+ # Arch Linux |
|
| 450 |
+ log_info "Updating package lists..." |
|
| 451 |
+ pacman -Sy --noconfirm |
|
| 452 |
+ |
|
| 453 |
+ PACKAGES=( |
|
| 454 |
+ "perl" |
|
| 455 |
+ "perl-dbi" |
|
| 456 |
+ "perl-dbd-pg" |
|
| 457 |
+ "perl-json" |
|
| 458 |
+ "perl-file-slurp" |
|
| 459 |
+ "smartmontools" |
|
| 460 |
+ "postgresql" |
|
| 461 |
+ "curl" |
|
| 462 |
+ "wget" |
|
| 463 |
+ ) |
|
| 464 |
+ |
|
| 465 |
+ for package in "${PACKAGES[@]}"; do
|
|
| 466 |
+ if ! check_package_installed "$package" "pacman"; then |
|
| 467 |
+ log_info "Installing $package..." |
|
| 468 |
+ if ! pacman -S --noconfirm "$package" >/dev/null 2>&1; then |
|
| 469 |
+ log_error "Failed to install $package" |
|
| 470 |
+ exit 1 |
|
| 471 |
+ fi |
|
| 472 |
+ fi |
|
| 473 |
+ done |
|
| 474 |
+ |
|
| 475 |
+ else |
|
| 476 |
+ log_error "Unsupported package manager. Please install dependencies manually:" |
|
| 477 |
+ log_error " - perl, smartmontools, postgresql-client, curl, wget" |
|
| 478 |
+ log_error " - Perl modules: DBI, DBD::Pg, JSON, File::Slurp, Getopt::Long, Config::Simple" |
|
| 479 |
+ exit 1 |
|
| 480 |
+ fi |
|
| 481 |
+ |
|
| 482 |
+ # Verify installation was successful |
|
| 483 |
+ if verify_dependencies >/dev/null 2>&1; then |
|
| 484 |
+ log_success "✅ All dependencies installed successfully" |
|
| 485 |
+ else |
|
| 486 |
+ log_error "Some dependencies may not have installed correctly" |
|
| 487 |
+ exit 1 |
|
| 488 |
+ fi |
|
| 489 |
+} |
|
| 490 |
+ |
|
| 491 |
+create_directories() {
|
|
| 492 |
+ log_info "📁 Creating directory structure..." |
|
| 493 |
+ |
|
| 494 |
+ # Create main directories |
|
| 495 |
+ mkdir -p "$INSTALL_DIR"/{scripts,lib,config,docs}
|
|
| 496 |
+ mkdir -p "$CONFIG_DIR" |
|
| 497 |
+ |
|
| 498 |
+ # Set permissions |
|
| 499 |
+ chmod 755 "$INSTALL_DIR" |
|
| 500 |
+ chmod 755 "$CONFIG_DIR" |
|
| 501 |
+ |
|
| 502 |
+ log_success "Directories created" |
|
| 503 |
+} |
|
| 504 |
+ |
|
| 505 |
+copy_files() {
|
|
| 506 |
+ log_info "📋 Copying autoSMART files..." |
|
| 507 |
+ |
|
| 508 |
+ # Copy scripts |
|
| 509 |
+ if [[ -d "$PROJECT_ROOT/scripts" ]]; then |
|
| 510 |
+ cp -r "$PROJECT_ROOT/scripts"/* "$INSTALL_DIR/scripts/" |
|
| 511 |
+ chmod +x "$INSTALL_DIR/scripts"/*.sh 2>/dev/null || true |
|
| 512 |
+ chmod +x "$INSTALL_DIR/scripts"/*.pl 2>/dev/null || true |
|
| 513 |
+ fi |
|
| 514 |
+ |
|
| 515 |
+ # Copy libraries |
|
| 516 |
+ if [[ -d "$PROJECT_ROOT/lib" ]]; then |
|
| 517 |
+ cp -r "$PROJECT_ROOT/lib"/* "$INSTALL_DIR/lib/" |
|
| 518 |
+ fi |
|
| 519 |
+ |
|
| 520 |
+ # Copy default configuration to /etc/default/autosmart |
|
| 521 |
+ if [[ -f "/etc/default/autosmart" ]]; then |
|
| 522 |
+ log_info "📝 Existing configuration found, merging with defaults..." |
|
| 523 |
+ |
|
| 524 |
+ # Backup existing configuration |
|
| 525 |
+ cp "/etc/default/autosmart" "/etc/default/autosmart.backup.$(date +%Y%m%d_%H%M%S)" |
|
| 526 |
+ |
|
| 527 |
+ # Read existing configuration |
|
| 528 |
+ declare -A existing_config |
|
| 529 |
+ while IFS='=' read -r key value; do |
|
| 530 |
+ if [[ $key =~ ^[A-Z_]+$ ]] && [[ -n $value ]]; then |
|
| 531 |
+ # Remove quotes and store |
|
| 532 |
+ value=$(echo "$value" | sed 's/^"//;s/"$//') |
|
| 533 |
+ existing_config["$key"]="$value" |
|
| 534 |
+ fi |
|
| 535 |
+ done < "/etc/default/autosmart" |
|
| 536 |
+ |
|
| 537 |
+ # Start with new configuration template |
|
| 538 |
+ if [[ -f "$PROJECT_ROOT/config/autosmart-defaults.conf" ]]; then |
|
| 539 |
+ cp "$PROJECT_ROOT/config/autosmart-defaults.conf" "/etc/default/autosmart" |
|
| 540 |
+ else |
|
| 541 |
+ cat > "/etc/default/autosmart" << 'EOF' |
|
| 542 |
+# AutoSMART Configuration |
|
| 543 |
+AUTOSMART_DEBUG="false" |
|
| 544 |
+EOF |
|
| 545 |
+ fi |
|
| 546 |
+ |
|
| 547 |
+ # Merge existing values back |
|
| 548 |
+ for key in "${!existing_config[@]}"; do
|
|
| 549 |
+ value="${existing_config[$key]}"
|
|
| 550 |
+ if grep -q "^${key}=" "/etc/default/autosmart"; then
|
|
| 551 |
+ # Update existing key with preserved value |
|
| 552 |
+ sed -i "s|^${key}=.*|${key}=\"${value}\"|" "/etc/default/autosmart"
|
|
| 553 |
+ log_info "✓ Preserved existing setting: ${key}=\"${value}\""
|
|
| 554 |
+ else |
|
| 555 |
+ # Add new key |
|
| 556 |
+ echo "${key}=\"${value}\"" >> "/etc/default/autosmart"
|
|
| 557 |
+ log_info "✓ Added custom setting: ${key}=\"${value}\""
|
|
| 558 |
+ fi |
|
| 559 |
+ done |
|
| 560 |
+ |
|
| 561 |
+ log_info "✓ Configuration merged successfully" |
|
| 562 |
+ |
|
| 563 |
+ elif [[ -f "$PROJECT_ROOT/config/autosmart-defaults.conf" ]]; then |
|
| 564 |
+ cp "$PROJECT_ROOT/config/autosmart-defaults.conf" /etc/default/autosmart |
|
| 565 |
+ log_info "✓ AutoSMART default configuration installed" |
|
| 566 |
+ else |
|
| 567 |
+ log_warning "Default configuration file not found, creating basic one" |
|
| 568 |
+ cat > /etc/default/autosmart << 'EOF' |
|
| 569 |
+# AutoSMART Configuration |
|
| 570 |
+AUTOSMART_DEBUG="false" |
|
| 571 |
+EOF |
|
| 572 |
+ fi |
|
| 573 |
+ |
|
| 574 |
+ # Copy documentation |
|
| 575 |
+ if [[ -d "$PROJECT_ROOT/docs" ]]; then |
|
| 576 |
+ cp -r "$PROJECT_ROOT/docs"/* "$INSTALL_DIR/docs/" |
|
| 577 |
+ fi |
|
| 578 |
+ |
|
| 579 |
+ # Copy SQL files |
|
| 580 |
+ if [[ -d "$PROJECT_ROOT/sql" ]]; then |
|
| 581 |
+ cp -r "$PROJECT_ROOT/sql" "$INSTALL_DIR/" |
|
| 582 |
+ fi |
|
| 583 |
+ |
|
| 584 |
+ log_success "Files copied" |
|
| 585 |
+} |
|
| 586 |
+ |
|
| 587 |
+create_configuration() {
|
|
| 588 |
+ log_info "⚙️ Creating configuration files..." |
|
| 589 |
+ |
|
| 590 |
+ # Main configuration file |
|
| 591 |
+ cat > "$CONFIG_DIR/autosmart.conf" << EOF |
|
| 592 |
+# autoSMART Configuration File |
|
| 593 |
+# Generated on $(date) |
|
| 594 |
+ |
|
| 595 |
+[database] |
|
| 596 |
+host = $DB_HOST |
|
| 597 |
+port = 5432 |
|
| 598 |
+user = $DB_USER |
|
| 599 |
+password = $DB_PASS |
|
| 600 |
+database = $DB_NAME |
|
| 601 |
+timeout = 30 |
|
| 602 |
+ |
|
| 603 |
+[node] |
|
| 604 |
+id = $NODE_ID |
|
| 605 |
+scan_interval = $SCAN_INTERVAL |
|
| 606 |
+full_scan_interval = $FULL_SCAN_INTERVAL |
|
| 607 |
+store_unchanged = false |
|
| 608 |
+max_retries = 3 |
|
| 609 |
+ |
|
| 610 |
+[collection] |
|
| 611 |
+temperature_threshold = 5 |
|
| 612 |
+parameter_changes_only = true |
|
| 613 |
+enable_predictive_analysis = true |
|
| 614 |
+health_check_interval = 86400 |
|
| 615 |
+ |
|
| 616 |
+[logging] |
|
| 617 |
+level = INFO |
|
| 618 |
+max_size = 10M |
|
| 619 |
+rotate_count = 5 |
|
| 620 |
+syslog = true |
|
| 621 |
+ |
|
| 622 |
+[alerts] |
|
| 623 |
+enable = true |
|
| 624 |
+temperature_critical = 60 |
|
| 625 |
+reallocated_sectors_warning = 1 |
|
| 626 |
+pending_sectors_critical = 5 |
|
| 627 |
+EOF |
|
| 628 |
+ |
|
| 629 |
+ # YAML format configuration for Perl daemon |
|
| 630 |
+ cat > "$CONFIG_DIR/cluster-$NODE_ID.conf" << EOF |
|
| 631 |
+# autoSMART YAML Configuration for $NODE_ID |
|
| 632 |
+database: |
|
| 633 |
+ host: $DB_HOST |
|
| 634 |
+ port: 5432 |
|
| 635 |
+ user: $DB_USER |
|
| 636 |
+ password: $DB_PASS |
|
| 637 |
+ database: $DB_NAME |
|
| 638 |
+ |
|
| 639 |
+node: |
|
| 640 |
+ id: $NODE_ID |
|
| 641 |
+ scan_interval: $SCAN_INTERVAL |
|
| 642 |
+ store_unchanged: false |
|
| 643 |
+ |
|
| 644 |
+collection: |
|
| 645 |
+ temperature_threshold: 5 |
|
| 646 |
+ parameter_changes_only: true |
|
| 647 |
+ full_scan_interval: $FULL_SCAN_INTERVAL |
|
| 648 |
+EOF |
|
| 649 |
+ |
|
| 650 |
+ # Set secure permissions on config files |
|
| 651 |
+ chmod 600 "$CONFIG_DIR"/*.conf |
|
| 652 |
+ |
|
| 653 |
+ log_success "Configuration created" |
|
| 654 |
+} |
|
| 655 |
+ |
|
| 656 |
+create_systemd_service() {
|
|
| 657 |
+ log_info "🔧 Creating systemd service..." |
|
| 658 |
+ |
|
| 659 |
+ cat > "$SYSTEMD_SERVICE" << EOF |
|
| 660 |
+[Unit] |
|
| 661 |
+Description=autoSMART SMART Data Collector |
|
| 662 |
+Documentation=file://$INSTALL_DIR/docs/README.md |
|
| 663 |
+After=network.target postgresql.service |
|
| 664 |
+Wants=postgresql.service |
|
| 665 |
+ |
|
| 666 |
+[Service] |
|
| 667 |
+Type=simple |
|
| 668 |
+EnvironmentFile=/etc/default/autosmart |
|
| 669 |
+ExecStart=$INSTALL_DIR/scripts/smart-collector-daemon.pl --config $CONFIG_DIR/cluster-$NODE_ID.conf --foreground |
|
| 670 |
+ExecReload=/bin/kill -HUP \$MAINPID |
|
| 671 |
+KillMode=process |
|
| 672 |
+Restart=always |
|
| 673 |
+RestartSec=30 |
|
| 674 |
+User=root |
|
| 675 |
+Group=root |
|
| 676 |
+ |
|
| 677 |
+# Security settings |
|
| 678 |
+NoNewPrivileges=true |
|
| 679 |
+ProtectSystem=strict |
|
| 680 |
+ProtectHome=true |
|
| 681 |
+ReadWritePaths=$CONFIG_DIR |
|
| 682 |
+PrivateTmp=true |
|
| 683 |
+ |
|
| 684 |
+# Resource limits |
|
| 685 |
+LimitNOFILE=1024 |
|
| 686 |
+MemoryMax=100M |
|
| 687 |
+CPUQuota=10% |
|
| 688 |
+ |
|
| 689 |
+# Logging |
|
| 690 |
+StandardOutput=journal |
|
| 691 |
+StandardError=journal |
|
| 692 |
+SyslogIdentifier=autosmart |
|
| 693 |
+ |
|
| 694 |
+[Install] |
|
| 695 |
+WantedBy=multi-user.target |
|
| 696 |
+EOF |
|
| 697 |
+ |
|
| 698 |
+ # Reload systemd |
|
| 699 |
+ systemctl daemon-reload |
|
| 700 |
+ |
|
| 701 |
+ log_success "Systemd service created" |
|
| 702 |
+} |
|
| 703 |
+ |
|
| 704 |
+test_database_connection() {
|
|
| 705 |
+ log_info "🔗 Testing database connection..." |
|
| 706 |
+ |
|
| 707 |
+ # Test connection using psql |
|
| 708 |
+ if command -v psql &> /dev/null; then |
|
| 709 |
+ if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT version();" >/dev/null 2>&1; then |
|
| 710 |
+ log_success "Database connection successful" |
|
| 711 |
+ else |
|
| 712 |
+ log_warning "Database connection failed. Service may not start correctly." |
|
| 713 |
+ log_info "Please ensure:" |
|
| 714 |
+ log_info " • PostgreSQL server is running on $DB_HOST" |
|
| 715 |
+ log_info " • Database '$DB_NAME' exists" |
|
| 716 |
+ log_info " • User '$DB_USER' has proper permissions" |
|
| 717 |
+ fi |
|
| 718 |
+ else |
|
| 719 |
+ log_warning "psql not found. Cannot test database connection." |
|
| 720 |
+ fi |
|
| 721 |
+} |
|
| 722 |
+ |
|
| 723 |
+test_smart_detection() {
|
|
| 724 |
+ log_info "🔍 Testing SMART device detection..." |
|
| 725 |
+ |
|
| 726 |
+ DEVICES_FOUND=0 |
|
| 727 |
+ for device in /dev/sd? /dev/nvme?n?; do |
|
| 728 |
+ if [[ -b "$device" ]] && smartctl -i "$device" >/dev/null 2>&1; then |
|
| 729 |
+ MODEL=$(smartctl -i "$device" | grep "Device Model\|Model Number" | head -1 | cut -d: -f2 | xargs) |
|
| 730 |
+ if [[ -n "$MODEL" ]]; then |
|
| 731 |
+ log_info " Found: $device - $MODEL" |
|
| 732 |
+ DEVICES_FOUND=$((DEVICES_FOUND + 1)) |
|
| 733 |
+ fi |
|
| 734 |
+ fi |
|
| 735 |
+ done |
|
| 736 |
+ |
|
| 737 |
+ if [[ $DEVICES_FOUND -gt 0 ]]; then |
|
| 738 |
+ log_success "Detected $DEVICES_FOUND SMART-capable devices" |
|
| 739 |
+ else |
|
| 740 |
+ log_warning "No SMART-capable devices detected" |
|
| 741 |
+ fi |
|
| 742 |
+} |
|
| 743 |
+ |
|
| 744 |
+finalize_installation() {
|
|
| 745 |
+ log_info "🎯 Finalizing installation..." |
|
| 746 |
+ |
|
| 747 |
+ # Enable service (but don't start yet) |
|
| 748 |
+ systemctl enable "$SERVICE_NAME" |
|
| 749 |
+ |
|
| 750 |
+ # Create log rotation |
|
| 751 |
+ cat > "/etc/logrotate.d/autosmart" << EOF |
|
| 752 |
+/var/log/autosmart/*.log {
|
|
| 753 |
+ daily |
|
| 754 |
+ rotate 7 |
|
| 755 |
+ compress |
|
| 756 |
+ delaycompress |
|
| 757 |
+ missingok |
|
| 758 |
+ notifempty |
|
| 759 |
+ postrotate |
|
| 760 |
+ systemctl reload-or-restart autosmart |
|
| 761 |
+ endscript |
|
| 762 |
+} |
|
| 763 |
+EOF |
|
| 764 |
+ |
|
| 765 |
+ log_success "Installation finalized" |
|
| 766 |
+} |
|
| 767 |
+ |
|
| 768 |
+show_completion_message() {
|
|
| 769 |
+ log_success "✅ autoSMART installation completed successfully!" |
|
| 770 |
+ log_info "" |
|
| 771 |
+ log_info "📋 Installation Summary:" |
|
| 772 |
+ log_info " • Install Directory: $INSTALL_DIR" |
|
| 773 |
+ log_info " • Config Directory: $CONFIG_DIR" |
|
| 774 |
+ log_info " • Service Name: $SERVICE_NAME" |
|
| 775 |
+ log_info " • Node ID: $NODE_ID" |
|
| 776 |
+ log_info "" |
|
| 777 |
+ log_info "🚀 Next Steps:" |
|
| 778 |
+ log_info " 1. Start the service:" |
|
| 779 |
+ log_info " systemctl start $SERVICE_NAME" |
|
| 780 |
+ log_info "" |
|
| 781 |
+ log_info " 2. Check service status:" |
|
| 782 |
+ log_info " systemctl status $SERVICE_NAME" |
|
| 783 |
+ log_info "" |
|
| 784 |
+ log_info " 3. View logs:" |
|
| 785 |
+ log_info " journalctl -u $SERVICE_NAME -f" |
|
| 786 |
+ log_info "" |
|
| 787 |
+ log_info "📖 Documentation: $INSTALL_DIR/docs/README.md" |
|
| 788 |
+ log_info "⚙️ Configuration: $CONFIG_DIR/autosmart.conf" |
|
| 789 |
+ log_info "" |
|
| 790 |
+ log_info "🎉 autoSMART is ready to monitor your storage devices!" |
|
| 791 |
+} |
|
| 792 |
+ |
|
| 793 |
+# Main execution |
|
| 794 |
+main() {
|
|
| 795 |
+ parse_arguments "$@" |
|
| 796 |
+ show_header |
|
| 797 |
+ |
|
| 798 |
+ case "$COMMAND" in |
|
| 799 |
+ uninstall) |
|
| 800 |
+ handle_uninstall |
|
| 801 |
+ ;; |
|
| 802 |
+ install) |
|
| 803 |
+ check_requirements |
|
| 804 |
+ |
|
| 805 |
+ # Handle force reinstall |
|
| 806 |
+ if [[ "$FORCE_REINSTALL" == true ]]; then |
|
| 807 |
+ log_info "🗑️ Force reinstall: cleaning previous installation..." |
|
| 808 |
+ handle_uninstall 2>/dev/null || true |
|
| 809 |
+ sleep 2 |
|
| 810 |
+ fi |
|
| 811 |
+ |
|
| 812 |
+ # Handle config-only mode |
|
| 813 |
+ if [[ "$CONFIG_ONLY" == true ]]; then |
|
| 814 |
+ log_info "⚙️ Configuration-only mode" |
|
| 815 |
+ if [[ ! -d "$INSTALL_DIR" ]]; then |
|
| 816 |
+ log_error "autoSMART is not installed. Run full installation first." |
|
| 817 |
+ exit 1 |
|
| 818 |
+ fi |
|
| 819 |
+ create_configuration |
|
| 820 |
+ log_success "✅ Configuration updated successfully!" |
|
| 821 |
+ exit 0 |
|
| 822 |
+ fi |
|
| 823 |
+ |
|
| 824 |
+ # Full installation |
|
| 825 |
+ install_dependencies |
|
| 826 |
+ create_directories |
|
| 827 |
+ copy_files |
|
| 828 |
+ create_configuration |
|
| 829 |
+ create_systemd_service |
|
| 830 |
+ test_database_connection |
|
| 831 |
+ test_smart_detection |
|
| 832 |
+ finalize_installation |
|
| 833 |
+ show_completion_message |
|
| 834 |
+ ;; |
|
| 835 |
+ *) |
|
| 836 |
+ log_error "Unknown command: $COMMAND" |
|
| 837 |
+ show_usage |
|
| 838 |
+ exit 1 |
|
| 839 |
+ ;; |
|
| 840 |
+ esac |
|
| 841 |
+} |
|
| 842 |
+ |
|
| 843 |
+# Run main function |
|
| 844 |
+main "$@" |
|
@@ -0,0 +1,521 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Cluster Monitor |
|
| 4 |
+# Version: 1.0 |
|
| 5 |
+# Description: Monitor autoSMART services across Proxmox cluster |
|
| 6 |
+ |
|
| 7 |
+# Configuration |
|
| 8 |
+CLUSTER_JSON="$(dirname "$0")/../cluster.json" |
|
| 9 |
+NODES=() |
|
| 10 |
+NODE_IPS=() |
|
| 11 |
+if [[ -f "$CLUSTER_JSON" ]] && command -v jq &> /dev/null; then |
|
| 12 |
+ while IFS= read -r node; do |
|
| 13 |
+ NODES+=("$(echo "$node" | jq -r '.hostname')")
|
|
| 14 |
+ NODE_IPS+=("$(echo "$node" | jq -r '.ip')")
|
|
| 15 |
+ done < <(jq -c '.cluster.nodes[]' "$CLUSTER_JSON") |
|
| 16 |
+fi |
|
| 17 |
+DB_HOST="192.168.2.102" |
|
| 18 |
+DB_USER="autosmart" |
|
| 19 |
+DB_PASS="autoSMART2025!" |
|
| 20 |
+DB_NAME="autosmart" |
|
| 21 |
+ |
|
| 22 |
+# Colors for output |
|
| 23 |
+RED='\033[0;31m' |
|
| 24 |
+GREEN='\033[0;32m' |
|
| 25 |
+YELLOW='\033[1;33m' |
|
| 26 |
+BLUE='\033[0;34m' |
|
| 27 |
+CYAN='\033[0;36m' |
|
| 28 |
+NC='\033[0m' # No Color |
|
| 29 |
+ |
|
| 30 |
+log_info() {
|
|
| 31 |
+ echo -e "${BLUE}[INFO]${NC} $1"
|
|
| 32 |
+} |
|
| 33 |
+ |
|
| 34 |
+log_success() {
|
|
| 35 |
+ echo -e "${GREEN}[SUCCESS]${NC} $1"
|
|
| 36 |
+} |
|
| 37 |
+ |
|
| 38 |
+log_warning() {
|
|
| 39 |
+ echo -e "${YELLOW}[WARNING]${NC} $1"
|
|
| 40 |
+} |
|
| 41 |
+ |
|
| 42 |
+log_error() {
|
|
| 43 |
+ echo -e "${RED}[ERROR]${NC} $1"
|
|
| 44 |
+} |
|
| 45 |
+ |
|
| 46 |
+log_header() {
|
|
| 47 |
+ echo -e "${CYAN}$1${NC}"
|
|
| 48 |
+} |
|
| 49 |
+ |
|
| 50 |
+show_usage() {
|
|
| 51 |
+ echo "autoSMART Cluster Monitor v1.0" |
|
| 52 |
+ echo "" |
|
| 53 |
+ echo "Usage: $0 [COMMAND] [OPTIONS]" |
|
| 54 |
+ echo "" |
|
| 55 |
+ echo "Commands:" |
|
| 56 |
+ echo " status Show service status on all nodes" |
|
| 57 |
+ echo " logs [NODE] Show recent logs (all nodes or specific node)" |
|
| 58 |
+ echo " start Start services on all nodes" |
|
| 59 |
+ echo " stop Stop services on all nodes" |
|
| 60 |
+ echo " restart Restart services on all nodes" |
|
| 61 |
+ echo " deploy Deploy autoSMART to all nodes" |
|
| 62 |
+ echo " database Show database statistics" |
|
| 63 |
+ echo " health Show cluster health summary" |
|
| 64 |
+ echo " collect Force immediate SMART collection on all nodes" |
|
| 65 |
+ echo "" |
|
| 66 |
+ echo "Options:" |
|
| 67 |
+ echo " --node NODE Target specific node (name from cluster.json)" |
|
| 68 |
+ echo " --watch Continuous monitoring (refresh every 10s)" |
|
| 69 |
+ echo " --verbose Show detailed output" |
|
| 70 |
+ echo "" |
|
| 71 |
+ echo "Examples:" |
|
| 72 |
+ echo " $0 status # Show status on all nodes" |
|
| 73 |
+ echo " $0 status --node <node> # Show status on node from cluster.json" |
|
| 74 |
+ echo " $0 logs <node> # Show logs from node in cluster.json" |
|
| 75 |
+ echo " $0 health --watch # Continuous health monitoring" |
|
| 76 |
+ echo " $0 deploy # Deploy to all nodes" |
|
| 77 |
+ echo "" |
|
| 78 |
+} |
|
| 79 |
+ |
|
| 80 |
+check_node_connectivity() {
|
|
| 81 |
+ local node=$1 |
|
| 82 |
+ local ip=$2 |
|
| 83 |
+ |
|
| 84 |
+ if ping -c 1 -W 2 "$ip" >/dev/null 2>&1; then |
|
| 85 |
+ return 0 |
|
| 86 |
+ else |
|
| 87 |
+ return 1 |
|
| 88 |
+ fi |
|
| 89 |
+} |
|
| 90 |
+ |
|
| 91 |
+show_service_status() {
|
|
| 92 |
+ local target_node=$1 |
|
| 93 |
+ |
|
| 94 |
+ log_header "🔍 autoSMART Service Status" |
|
| 95 |
+ log_header "=============================" |
|
| 96 |
+ |
|
| 97 |
+ for i in "${!NODES[@]}"; do
|
|
| 98 |
+ local node="${NODES[$i]}"
|
|
| 99 |
+ local ip="${NODE_IPS[$i]}"
|
|
| 100 |
+ |
|
| 101 |
+ # Skip if specific node requested and this isn't it |
|
| 102 |
+ if [[ -n "$target_node" && "$node" != "$target_node" ]]; then |
|
| 103 |
+ continue |
|
| 104 |
+ fi |
|
| 105 |
+ |
|
| 106 |
+ echo "" |
|
| 107 |
+ log_info "Node: $node ($NODE_IP_BASE.$ip)" |
|
| 108 |
+ echo "----------------------------------------" |
|
| 109 |
+ |
|
| 110 |
+ if check_node_connectivity "$node" "$ip"; then |
|
| 111 |
+ local status_output |
|
| 112 |
+ status_output=$(ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no "root@$NODE_IP_BASE.$ip" \ |
|
| 113 |
+ "systemctl is-active autosmart 2>/dev/null || echo 'inactive'; \ |
|
| 114 |
+ systemctl is-enabled autosmart 2>/dev/null || echo 'disabled'; \ |
|
| 115 |
+ uptime | awk '{print \$3, \$4}' | sed 's/,//'" 2>/dev/null)
|
|
| 116 |
+ |
|
| 117 |
+ if [[ $? -eq 0 ]]; then |
|
| 118 |
+ local active=$(echo "$status_output" | sed -n '1p') |
|
| 119 |
+ local enabled=$(echo "$status_output" | sed -n '2p') |
|
| 120 |
+ local uptime=$(echo "$status_output" | sed -n '3p') |
|
| 121 |
+ |
|
| 122 |
+ echo -n " Status: " |
|
| 123 |
+ if [[ "$active" == "active" ]]; then |
|
| 124 |
+ log_success "RUNNING" |
|
| 125 |
+ else |
|
| 126 |
+ log_error "NOT RUNNING" |
|
| 127 |
+ fi |
|
| 128 |
+ |
|
| 129 |
+ echo -n " Enabled: " |
|
| 130 |
+ if [[ "$enabled" == "enabled" ]]; then |
|
| 131 |
+ log_success "YES" |
|
| 132 |
+ else |
|
| 133 |
+ log_warning "NO" |
|
| 134 |
+ fi |
|
| 135 |
+ |
|
| 136 |
+ echo " Uptime: $uptime" |
|
| 137 |
+ |
|
| 138 |
+ # Get recent activity |
|
| 139 |
+ local last_log=$(ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \ |
|
| 140 |
+ "journalctl -u autosmart --no-pager -n 1 --output=short-iso 2>/dev/null | tail -1" 2>/dev/null) |
|
| 141 |
+ if [[ -n "$last_log" ]]; then |
|
| 142 |
+ echo " Last Activity: $(echo "$last_log" | awk '{print $1, $2}')"
|
|
| 143 |
+ fi |
|
| 144 |
+ |
|
| 145 |
+ else |
|
| 146 |
+ log_error "SSH CONNECTION FAILED" |
|
| 147 |
+ fi |
|
| 148 |
+ else |
|
| 149 |
+ log_error "NETWORK UNREACHABLE" |
|
| 150 |
+ fi |
|
| 151 |
+ done |
|
| 152 |
+} |
|
| 153 |
+ |
|
| 154 |
+show_logs() {
|
|
| 155 |
+ local target_node=$1 |
|
| 156 |
+ local lines=${2:-20}
|
|
| 157 |
+ |
|
| 158 |
+ log_header "📋 Recent Logs" |
|
| 159 |
+ log_header "===============" |
|
| 160 |
+ |
|
| 161 |
+ for i in "${!NODES[@]}"; do
|
|
| 162 |
+ local node="${NODES[$i]}"
|
|
| 163 |
+ local ip="${NODE_IPS[$i]}"
|
|
| 164 |
+ |
|
| 165 |
+ # Skip if specific node requested and this isn't it |
|
| 166 |
+ if [[ -n "$target_node" && "$node" != "$target_node" ]]; then |
|
| 167 |
+ continue |
|
| 168 |
+ fi |
|
| 169 |
+ |
|
| 170 |
+ echo "" |
|
| 171 |
+ log_info "Node: $node ($NODE_IP_BASE.$ip)" |
|
| 172 |
+ echo "----------------------------------------" |
|
| 173 |
+ |
|
| 174 |
+ if check_node_connectivity "$node" "$ip"; then |
|
| 175 |
+ ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \ |
|
| 176 |
+ "journalctl -u autosmart --no-pager -n $lines --output=short-iso 2>/dev/null || echo 'No logs available'" 2>/dev/null |
|
| 177 |
+ else |
|
| 178 |
+ log_error "Node unreachable" |
|
| 179 |
+ fi |
|
| 180 |
+ done |
|
| 181 |
+} |
|
| 182 |
+ |
|
| 183 |
+control_services() {
|
|
| 184 |
+ local action=$1 |
|
| 185 |
+ local target_node=$2 |
|
| 186 |
+ |
|
| 187 |
+ log_header "🔧 ${action^} Services"
|
|
| 188 |
+ log_header "===================" |
|
| 189 |
+ |
|
| 190 |
+ for i in "${!NODES[@]}"; do
|
|
| 191 |
+ local node="${NODES[$i]}"
|
|
| 192 |
+ local ip="${NODE_IPS[$i]}"
|
|
| 193 |
+ |
|
| 194 |
+ # Skip if specific node requested and this isn't it |
|
| 195 |
+ if [[ -n "$target_node" && "$node" != "$target_node" ]]; then |
|
| 196 |
+ continue |
|
| 197 |
+ fi |
|
| 198 |
+ |
|
| 199 |
+ echo "" |
|
| 200 |
+ log_info "Node: $node - ${action}ing autosmart service..."
|
|
| 201 |
+ |
|
| 202 |
+ if check_node_connectivity "$node" "$ip"; then |
|
| 203 |
+ if ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" "systemctl $action autosmart" 2>/dev/null; then |
|
| 204 |
+ log_success "$node: Service ${action}ed successfully"
|
|
| 205 |
+ else |
|
| 206 |
+ log_error "$node: Failed to $action service" |
|
| 207 |
+ fi |
|
| 208 |
+ else |
|
| 209 |
+ log_error "$node: Node unreachable" |
|
| 210 |
+ fi |
|
| 211 |
+ done |
|
| 212 |
+} |
|
| 213 |
+ |
|
| 214 |
+show_database_stats() {
|
|
| 215 |
+ log_header "📊 Database Statistics" |
|
| 216 |
+ log_header "=====================" |
|
| 217 |
+ |
|
| 218 |
+ if command -v psql &> /dev/null; then |
|
| 219 |
+ echo "" |
|
| 220 |
+ log_info "Connection: $DB_HOST:5432/$DB_NAME" |
|
| 221 |
+ echo "" |
|
| 222 |
+ |
|
| 223 |
+ # Test connection |
|
| 224 |
+ if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT 1;" >/dev/null 2>&1; then |
|
| 225 |
+ log_success "Database connection: OK" |
|
| 226 |
+ echo "" |
|
| 227 |
+ |
|
| 228 |
+ # Get statistics |
|
| 229 |
+ PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c " |
|
| 230 |
+ SELECT |
|
| 231 |
+ 'Total Drives' as metric, COUNT(DISTINCT serial_number)::text as value |
|
| 232 |
+ FROM hdd_inventory |
|
| 233 |
+ UNION ALL |
|
| 234 |
+ SELECT |
|
| 235 |
+ 'Active Nodes', COUNT(DISTINCT current_node_id)::text |
|
| 236 |
+ FROM hdd_inventory WHERE last_seen > NOW() - INTERVAL '1 hour' |
|
| 237 |
+ UNION ALL |
|
| 238 |
+ SELECT |
|
| 239 |
+ 'Total Readings', COUNT(*)::text |
|
| 240 |
+ FROM smart_readings |
|
| 241 |
+ UNION ALL |
|
| 242 |
+ SELECT |
|
| 243 |
+ 'Readings Today', COUNT(*)::text |
|
| 244 |
+ FROM smart_readings WHERE timestamp > CURRENT_DATE |
|
| 245 |
+ UNION ALL |
|
| 246 |
+ SELECT |
|
| 247 |
+ 'Latest Reading', MAX(timestamp)::text |
|
| 248 |
+ FROM smart_readings; |
|
| 249 |
+ " 2>/dev/null |
|
| 250 |
+ |
|
| 251 |
+ echo "" |
|
| 252 |
+ log_info "Storage Efficiency:" |
|
| 253 |
+ PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c " |
|
| 254 |
+ SELECT |
|
| 255 |
+ hi.serial_number, |
|
| 256 |
+ hi.model_name, |
|
| 257 |
+ COUNT(sr.id) as readings, |
|
| 258 |
+ COUNT(DISTINCT sr.parameters_json) as unique_sets, |
|
| 259 |
+ CASE |
|
| 260 |
+ WHEN COUNT(DISTINCT sr.parameters_json) > 0 |
|
| 261 |
+ THEN ROUND((1 - COUNT(DISTINCT sr.parameters_json)::decimal / COUNT(sr.id)) * 100, 1) |
|
| 262 |
+ ELSE 0 |
|
| 263 |
+ END as savings_percent |
|
| 264 |
+ FROM hdd_inventory hi |
|
| 265 |
+ LEFT JOIN smart_readings sr ON hi.id = sr.hdd_id |
|
| 266 |
+ GROUP BY hi.id, hi.serial_number, hi.model_name |
|
| 267 |
+ HAVING COUNT(sr.id) > 0 |
|
| 268 |
+ ORDER BY readings DESC; |
|
| 269 |
+ " 2>/dev/null |
|
| 270 |
+ |
|
| 271 |
+ else |
|
| 272 |
+ log_error "Database connection failed" |
|
| 273 |
+ log_info "Please check:" |
|
| 274 |
+ log_info " • PostgreSQL server is running on $DB_HOST" |
|
| 275 |
+ log_info " • Database '$DB_NAME' exists" |
|
| 276 |
+ log_info " • User '$DB_USER' has proper permissions" |
|
| 277 |
+ fi |
|
| 278 |
+ else |
|
| 279 |
+ log_warning "psql not installed. Cannot check database statistics." |
|
| 280 |
+ fi |
|
| 281 |
+} |
|
| 282 |
+ |
|
| 283 |
+show_cluster_health() {
|
|
| 284 |
+ local watch_mode=$1 |
|
| 285 |
+ |
|
| 286 |
+ while true; do |
|
| 287 |
+ clear |
|
| 288 |
+ log_header "🏥 Cluster Health Summary" |
|
| 289 |
+ log_header "=========================" |
|
| 290 |
+ echo "Last Update: $(date)" |
|
| 291 |
+ echo "" |
|
| 292 |
+ |
|
| 293 |
+ # Service status summary |
|
| 294 |
+ local total_nodes=0 |
|
| 295 |
+ local active_nodes=0 |
|
| 296 |
+ local enabled_nodes=0 |
|
| 297 |
+ |
|
| 298 |
+ for i in "${!NODES[@]}"; do
|
|
| 299 |
+ local node="${NODES[$i]}"
|
|
| 300 |
+ local ip="${NODE_IPS[$i]}"
|
|
| 301 |
+ |
|
| 302 |
+ if check_node_connectivity "$node" "$ip"; then |
|
| 303 |
+ ((total_nodes++)) |
|
| 304 |
+ |
|
| 305 |
+ local status=$(ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \ |
|
| 306 |
+ "systemctl is-active autosmart 2>/dev/null" 2>/dev/null) |
|
| 307 |
+ local enabled=$(ssh -o ConnectTimeout=5 "root@$NODE_IP_BASE.$ip" \ |
|
| 308 |
+ "systemctl is-enabled autosmart 2>/dev/null" 2>/dev/null) |
|
| 309 |
+ |
|
| 310 |
+ if [[ "$status" == "active" ]]; then |
|
| 311 |
+ ((active_nodes++)) |
|
| 312 |
+ fi |
|
| 313 |
+ |
|
| 314 |
+ if [[ "$enabled" == "enabled" ]]; then |
|
| 315 |
+ ((enabled_nodes++)) |
|
| 316 |
+ fi |
|
| 317 |
+ fi |
|
| 318 |
+ done |
|
| 319 |
+ |
|
| 320 |
+ echo "📡 Cluster Status:" |
|
| 321 |
+ echo " • Total Nodes: $total_nodes/${#NODES[@]}"
|
|
| 322 |
+ echo " • Active Services: $active_nodes/$total_nodes" |
|
| 323 |
+ echo " • Enabled Services: $enabled_nodes/$total_nodes" |
|
| 324 |
+ echo "" |
|
| 325 |
+ |
|
| 326 |
+ # Quick database check |
|
| 327 |
+ if command -v psql &> /dev/null; then |
|
| 328 |
+ if PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -c "SELECT 1;" >/dev/null 2>&1; then |
|
| 329 |
+ local db_stats=$(PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -t -c " |
|
| 330 |
+ SELECT |
|
| 331 |
+ COUNT(DISTINCT serial_number) || '|' || |
|
| 332 |
+ COUNT(DISTINCT current_node_id) || '|' || |
|
| 333 |
+ COUNT(*) || '|' || |
|
| 334 |
+ MAX(timestamp) |
|
| 335 |
+ FROM hdd_inventory hi |
|
| 336 |
+ LEFT JOIN smart_readings sr ON hi.id = sr.hdd_id; |
|
| 337 |
+ " 2>/dev/null | xargs) |
|
| 338 |
+ |
|
| 339 |
+ IFS='|' read -r drives nodes readings latest <<< "$db_stats" |
|
| 340 |
+ |
|
| 341 |
+ echo "🗄️ Database Status:" |
|
| 342 |
+ echo " • Connection: OK" |
|
| 343 |
+ echo " • Drives Tracked: $drives" |
|
| 344 |
+ echo " • Active Nodes: $nodes" |
|
| 345 |
+ echo " • Total Readings: $readings" |
|
| 346 |
+ echo " • Latest Reading: $(echo "$latest" | cut -d'.' -f1)" |
|
| 347 |
+ else |
|
| 348 |
+ echo "🗄️ Database Status: ❌ CONNECTION FAILED" |
|
| 349 |
+ fi |
|
| 350 |
+ fi |
|
| 351 |
+ |
|
| 352 |
+ if [[ "$watch_mode" != "watch" ]]; then |
|
| 353 |
+ break |
|
| 354 |
+ fi |
|
| 355 |
+ |
|
| 356 |
+ echo "" |
|
| 357 |
+ echo "Press Ctrl+C to exit watch mode..." |
|
| 358 |
+ sleep 10 |
|
| 359 |
+ done |
|
| 360 |
+} |
|
| 361 |
+ |
|
| 362 |
+deploy_cluster() {
|
|
| 363 |
+ log_header "🚀 Cluster Deployment" |
|
| 364 |
+ log_header "===================" |
|
| 365 |
+ |
|
| 366 |
+ local script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 367 |
+ local deploy_script="$script_dir/deploy-production.sh" |
|
| 368 |
+ |
|
| 369 |
+ if [[ -f "$deploy_script" ]]; then |
|
| 370 |
+ log_info "Running cluster deployment script..." |
|
| 371 |
+ bash "$deploy_script" |
|
| 372 |
+ else |
|
| 373 |
+ log_error "Deployment script not found: $deploy_script" |
|
| 374 |
+ log_info "Deploying manually to each node..." |
|
| 375 |
+ |
|
| 376 |
+ for i in "${!NODES[@]}"; do
|
|
| 377 |
+ local node="${NODES[$i]}"
|
|
| 378 |
+ local ip="${NODE_IPS[$i]}"
|
|
| 379 |
+ |
|
| 380 |
+ echo "" |
|
| 381 |
+ log_info "Deploying to $node ($NODE_IP_BASE.$ip)..." |
|
| 382 |
+ |
|
| 383 |
+ if check_node_connectivity "$node" "$ip"; then |
|
| 384 |
+ # Copy autoSMART files |
|
| 385 |
+ scp -r "$(dirname "$script_dir")"/* "root@$NODE_IP_BASE.$ip:/tmp/autosmart-deploy/" 2>/dev/null |
|
| 386 |
+ |
|
| 387 |
+ # Run installation |
|
| 388 |
+ ssh "root@$NODE_IP_BASE.$ip" "cd /tmp/autosmart-deploy/scripts && bash deploy.sh install --force-reinstall --node-id $node" 2>/dev/null |
|
| 389 |
+ |
|
| 390 |
+ if [[ $? -eq 0 ]]; then |
|
| 391 |
+ log_success "$node: Deployment successful" |
|
| 392 |
+ else |
|
| 393 |
+ log_error "$node: Deployment failed" |
|
| 394 |
+ fi |
|
| 395 |
+ else |
|
| 396 |
+ log_error "$node: Node unreachable" |
|
| 397 |
+ fi |
|
| 398 |
+ done |
|
| 399 |
+ fi |
|
| 400 |
+} |
|
| 401 |
+ |
|
| 402 |
+force_collection() {
|
|
| 403 |
+ log_header "🔄 Force SMART Collection" |
|
| 404 |
+ log_header "=========================" |
|
| 405 |
+ |
|
| 406 |
+ for i in "${!NODES[@]}"; do
|
|
| 407 |
+ local node="${NODES[$i]}"
|
|
| 408 |
+ local ip="${NODE_IPS[$i]}"
|
|
| 409 |
+ |
|
| 410 |
+ echo "" |
|
| 411 |
+ log_info "Node: $node - Triggering SMART collection..." |
|
| 412 |
+ |
|
| 413 |
+ if check_node_connectivity "$node" "$ip"; then |
|
| 414 |
+ # Send SIGHUP to daemon to trigger immediate collection |
|
| 415 |
+ ssh "root@$NODE_IP_BASE.$ip" "pkill -HUP -f smart-collector-daemon || systemctl reload autosmart" 2>/dev/null |
|
| 416 |
+ |
|
| 417 |
+ if [[ $? -eq 0 ]]; then |
|
| 418 |
+ log_success "$node: Collection triggered" |
|
| 419 |
+ else |
|
| 420 |
+ log_warning "$node: Signal sent, check service status" |
|
| 421 |
+ fi |
|
| 422 |
+ else |
|
| 423 |
+ log_error "$node: Node unreachable" |
|
| 424 |
+ fi |
|
| 425 |
+ done |
|
| 426 |
+} |
|
| 427 |
+ |
|
| 428 |
+# Parse command line arguments |
|
| 429 |
+COMMAND="" |
|
| 430 |
+TARGET_NODE="" |
|
| 431 |
+WATCH_MODE=false |
|
| 432 |
+VERBOSE=false |
|
| 433 |
+ |
|
| 434 |
+while [[ $# -gt 0 ]]; do |
|
| 435 |
+ case $1 in |
|
| 436 |
+ status|logs|start|stop|restart|deploy|database|health|collect) |
|
| 437 |
+ COMMAND="$1" |
|
| 438 |
+ shift |
|
| 439 |
+ ;; |
|
| 440 |
+ --node) |
|
| 441 |
+ TARGET_NODE="$2" |
|
| 442 |
+ shift 2 |
|
| 443 |
+ ;; |
|
| 444 |
+ --watch) |
|
| 445 |
+ WATCH_MODE=true |
|
| 446 |
+ shift |
|
| 447 |
+ ;; |
|
| 448 |
+ --verbose) |
|
| 449 |
+ VERBOSE=true |
|
| 450 |
+ shift |
|
| 451 |
+ ;; |
|
| 452 |
+ --help) |
|
| 453 |
+ show_usage |
|
| 454 |
+ exit 0 |
|
| 455 |
+ ;; |
|
| 456 |
+ ebony|ivory|obsidian) |
|
| 457 |
+ # Allow node names as direct arguments for logs command |
|
| 458 |
+ if [[ "$COMMAND" == "logs" ]]; then |
|
| 459 |
+ TARGET_NODE="$1" |
|
| 460 |
+ fi |
|
| 461 |
+ shift |
|
| 462 |
+ ;; |
|
| 463 |
+ *) |
|
| 464 |
+ if [[ -z "$COMMAND" ]]; then |
|
| 465 |
+ COMMAND="$1" |
|
| 466 |
+ else |
|
| 467 |
+ log_error "Unknown option: $1" |
|
| 468 |
+ show_usage |
|
| 469 |
+ exit 1 |
|
| 470 |
+ fi |
|
| 471 |
+ shift |
|
| 472 |
+ ;; |
|
| 473 |
+ esac |
|
| 474 |
+done |
|
| 475 |
+ |
|
| 476 |
+# Default command |
|
| 477 |
+if [[ -z "$COMMAND" ]]; then |
|
| 478 |
+ COMMAND="status" |
|
| 479 |
+fi |
|
| 480 |
+ |
|
| 481 |
+# Execute command |
|
| 482 |
+case "$COMMAND" in |
|
| 483 |
+ status) |
|
| 484 |
+ if [[ "$WATCH_MODE" == true ]]; then |
|
| 485 |
+ while true; do |
|
| 486 |
+ clear |
|
| 487 |
+ show_service_status "$TARGET_NODE" |
|
| 488 |
+ echo "" |
|
| 489 |
+ echo "Press Ctrl+C to exit watch mode..." |
|
| 490 |
+ sleep 10 |
|
| 491 |
+ done |
|
| 492 |
+ else |
|
| 493 |
+ show_service_status "$TARGET_NODE" |
|
| 494 |
+ fi |
|
| 495 |
+ ;; |
|
| 496 |
+ logs) |
|
| 497 |
+ show_logs "$TARGET_NODE" |
|
| 498 |
+ ;; |
|
| 499 |
+ start|stop|restart) |
|
| 500 |
+ control_services "$COMMAND" "$TARGET_NODE" |
|
| 501 |
+ ;; |
|
| 502 |
+ database) |
|
| 503 |
+ show_database_stats |
|
| 504 |
+ ;; |
|
| 505 |
+ health) |
|
| 506 |
+ show_cluster_health "$([[ "$WATCH_MODE" == true ]] && echo "watch")" |
|
| 507 |
+ ;; |
|
| 508 |
+ deploy) |
|
| 509 |
+ deploy_cluster |
|
| 510 |
+ ;; |
|
| 511 |
+ collect) |
|
| 512 |
+ force_collection |
|
| 513 |
+ ;; |
|
| 514 |
+ *) |
|
| 515 |
+ log_error "Unknown command: $COMMAND" |
|
| 516 |
+ show_usage |
|
| 517 |
+ exit 1 |
|
| 518 |
+ ;; |
|
| 519 |
+esac |
|
| 520 |
+ |
|
| 521 |
+exit 0 |
|
@@ -0,0 +1,144 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+=head1 NAME |
|
| 4 |
+ |
|
| 5 |
+simple-smart-test.pl - Very simple SMART data test |
|
| 6 |
+ |
|
| 7 |
+=head1 DESCRIPTION |
|
| 8 |
+ |
|
| 9 |
+Direct SMART data collection and database storage test. |
|
| 10 |
+ |
|
| 11 |
+=cut |
|
| 12 |
+ |
|
| 13 |
+use strict; |
|
| 14 |
+use warnings; |
|
| 15 |
+use DBI; |
|
| 16 |
+use JSON::XS; |
|
| 17 |
+ |
|
| 18 |
+print "=== Simple SMART Test ===\n\n"; |
|
| 19 |
+ |
|
| 20 |
+# Database connection |
|
| 21 |
+my $dsn = "DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432"; |
|
| 22 |
+my $dbh = DBI->connect($dsn, "autosmart", "autoSMART2025!", {
|
|
| 23 |
+ RaiseError => 1, |
|
| 24 |
+ AutoCommit => 1, |
|
| 25 |
+ PrintError => 0 |
|
| 26 |
+}) or die "Failed to connect to database: $DBI::errstr\n"; |
|
| 27 |
+ |
|
| 28 |
+print "✓ Database connected\n"; |
|
| 29 |
+ |
|
| 30 |
+# Test SMART data collection manually |
|
| 31 |
+my @devices = glob('/dev/sd[a-z]');
|
|
| 32 |
+ |
|
| 33 |
+for my $device (@devices) {
|
|
| 34 |
+ print "\nTesting device: $device\n"; |
|
| 35 |
+ |
|
| 36 |
+ # Get basic device info |
|
| 37 |
+ my $smartctl_output = `smartctl -i $device 2>/dev/null`; |
|
| 38 |
+ if ($? != 0) {
|
|
| 39 |
+ print " ✗ SMART not available\n"; |
|
| 40 |
+ next; |
|
| 41 |
+ } |
|
| 42 |
+ |
|
| 43 |
+ # Parse basic info |
|
| 44 |
+ my ($model) = $smartctl_output =~ /Device Model:\s+(.+)/; |
|
| 45 |
+ my ($serial) = $smartctl_output =~ /Serial Number:\s+(.+)/; |
|
| 46 |
+ |
|
| 47 |
+ if (!$model || !$serial) {
|
|
| 48 |
+ print " ✗ Could not parse model/serial\n"; |
|
| 49 |
+ next; |
|
| 50 |
+ } |
|
| 51 |
+ |
|
| 52 |
+ print " Model: $model\n"; |
|
| 53 |
+ print " Serial: $serial\n"; |
|
| 54 |
+ |
|
| 55 |
+ # Get SMART attributes |
|
| 56 |
+ my $smart_output = `smartctl -A $device 2>/dev/null`; |
|
| 57 |
+ my %parameters; |
|
| 58 |
+ |
|
| 59 |
+ # Parse SMART attributes |
|
| 60 |
+ for my $line (split /\n/, $smart_output) {
|
|
| 61 |
+ if ($line =~ /^\s*(\d+)\s+(\w+)\s+0x[\da-f]+\s+(\d+)\s+(\d+)\s+(\d+)\s+\S+\s+\S+\s+\S+\s+(\d+)/) {
|
|
| 62 |
+ my ($id, $name, $current, $worst, $threshold, $raw) = ($1, $2, $3, $4, $5, $6); |
|
| 63 |
+ $parameters{$name} = $raw;
|
|
| 64 |
+ } |
|
| 65 |
+ } |
|
| 66 |
+ |
|
| 67 |
+ # Get temperature |
|
| 68 |
+ my $temp = $parameters{'Temperature_Celsius'} || 0;
|
|
| 69 |
+ print " Temperature: ${temp}°C\n";
|
|
| 70 |
+ print " Parameters: " . scalar(keys %parameters) . "\n"; |
|
| 71 |
+ |
|
| 72 |
+ # Check if HDD exists in inventory |
|
| 73 |
+ my $sth = $dbh->prepare("SELECT id FROM hdd_inventory WHERE serial_number = ? AND model_name = ?");
|
|
| 74 |
+ $sth->execute($serial, $model); |
|
| 75 |
+ my ($hdd_id) = $sth->fetchrow_array(); |
|
| 76 |
+ |
|
| 77 |
+ if (!$hdd_id) {
|
|
| 78 |
+ # Create new HDD |
|
| 79 |
+ print " Creating new HDD entry...\n"; |
|
| 80 |
+ $sth = $dbh->prepare(q{
|
|
| 81 |
+ INSERT INTO hdd_inventory |
|
| 82 |
+ (serial_number, model_name, current_device_path, current_node_id, status) |
|
| 83 |
+ VALUES (?, ?, ?, 'ebony', 'active') |
|
| 84 |
+ RETURNING id |
|
| 85 |
+ }); |
|
| 86 |
+ $sth->execute($serial, $model, $device); |
|
| 87 |
+ ($hdd_id) = $sth->fetchrow_array(); |
|
| 88 |
+ print " ✓ HDD created with ID: $hdd_id\n"; |
|
| 89 |
+ } else {
|
|
| 90 |
+ print " ✓ HDD exists with ID: $hdd_id\n"; |
|
| 91 |
+ # Update location |
|
| 92 |
+ $sth = $dbh->prepare("UPDATE hdd_inventory SET current_device_path = ?, last_seen = NOW() WHERE id = ?");
|
|
| 93 |
+ $sth->execute($device, $hdd_id); |
|
| 94 |
+ } |
|
| 95 |
+ |
|
| 96 |
+ # Store SMART reading |
|
| 97 |
+ print " Storing SMART reading...\n"; |
|
| 98 |
+ my $parameters_json = encode_json(\%parameters); |
|
| 99 |
+ |
|
| 100 |
+ $sth = $dbh->prepare(q{
|
|
| 101 |
+ INSERT INTO smart_readings |
|
| 102 |
+ (hdd_id, serial_number, device_path, node_id, timestamp, |
|
| 103 |
+ collection_ok, temperature, parameters_json, reading_type) |
|
| 104 |
+ VALUES (?, ?, ?, 'ebony', NOW(), true, ?, ?, 'full') |
|
| 105 |
+ }); |
|
| 106 |
+ |
|
| 107 |
+ $sth->execute($hdd_id, $serial, $device, $temp, $parameters_json); |
|
| 108 |
+ print " ✓ SMART reading stored\n"; |
|
| 109 |
+} |
|
| 110 |
+ |
|
| 111 |
+# Show results |
|
| 112 |
+print "\n=== Database Summary ===\n"; |
|
| 113 |
+ |
|
| 114 |
+my $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_inventory");
|
|
| 115 |
+$sth->execute(); |
|
| 116 |
+my ($hdd_count) = $sth->fetchrow_array(); |
|
| 117 |
+print "HDD Inventory: $hdd_count drives\n"; |
|
| 118 |
+ |
|
| 119 |
+$sth = $dbh->prepare("SELECT COUNT(*) FROM smart_readings");
|
|
| 120 |
+$sth->execute(); |
|
| 121 |
+my ($reading_count) = $sth->fetchrow_array(); |
|
| 122 |
+print "SMART Readings: $reading_count readings\n"; |
|
| 123 |
+ |
|
| 124 |
+# Show latest readings |
|
| 125 |
+$sth = $dbh->prepare(q{
|
|
| 126 |
+ SELECT hi.serial_number, hi.model_name, sr.timestamp, sr.temperature |
|
| 127 |
+ FROM smart_readings sr |
|
| 128 |
+ JOIN hdd_inventory hi ON sr.hdd_id = hi.id |
|
| 129 |
+ ORDER BY sr.timestamp DESC |
|
| 130 |
+ LIMIT 5 |
|
| 131 |
+}); |
|
| 132 |
+$sth->execute(); |
|
| 133 |
+ |
|
| 134 |
+print "\nLatest readings:\n"; |
|
| 135 |
+while (my $row = $sth->fetchrow_hashref()) {
|
|
| 136 |
+ printf " %s (%s) - %s - %d°C\n", |
|
| 137 |
+ substr($row->{serial_number}, 0, 12),
|
|
| 138 |
+ substr($row->{model_name}, 0, 20),
|
|
| 139 |
+ $row->{timestamp},
|
|
| 140 |
+ $row->{temperature} || 0;
|
|
| 141 |
+} |
|
| 142 |
+ |
|
| 143 |
+$dbh->disconnect(); |
|
| 144 |
+print "\n=== Test Complete ===\n"; |
|
@@ -0,0 +1,384 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+use strict; |
|
| 3 |
+use warnings; |
|
| 4 |
+use DBI; |
|
| 5 |
+use JSON; |
|
| 6 |
+use File::Slurp; |
|
| 7 |
+use Getopt::Long; |
|
| 8 |
+use POSIX qw(strftime); |
|
| 9 |
+use Time::HiRes qw(sleep); |
|
| 10 |
+ |
|
| 11 |
+# autoSMART Collector Daemon |
|
| 12 |
+# Version: 1.0 |
|
| 13 |
+# Description: Automated SMART data collection daemon |
|
| 14 |
+ |
|
| 15 |
+my $config_file; |
|
| 16 |
+my $debug = (defined $ENV{AUTOSMART_DEBUG} && $ENV{AUTOSMART_DEBUG} eq 'true') ? 1 : 0;
|
|
| 17 |
+my $foreground = 0; |
|
| 18 |
+ |
|
| 19 |
+GetOptions( |
|
| 20 |
+ 'config=s' => \$config_file, |
|
| 21 |
+ 'debug' => \$debug, |
|
| 22 |
+ 'foreground' => \$foreground |
|
| 23 |
+) or die "Usage: $0 --config <file> [--debug] [--foreground]\n"; |
|
| 24 |
+ |
|
| 25 |
+if (defined $ENV{AUTOSMART_DEBUG}) {
|
|
| 26 |
+ if ($ENV{AUTOSMART_DEBUG} eq 'true') {
|
|
| 27 |
+ $debug = 1; |
|
| 28 |
+ log_message("AUTOSMART_DEBUG enabled via /etc/default/autonas or environment");
|
|
| 29 |
+ } else {
|
|
| 30 |
+ $debug = 0; |
|
| 31 |
+ log_message("AUTOSMART_DEBUG disabled via /etc/default/autonas or environment");
|
|
| 32 |
+ } |
|
| 33 |
+} |
|
| 34 |
+ |
|
| 35 |
+die "Configuration file required\n" unless $config_file; |
|
| 36 |
+die "Configuration file not found: $config_file\n" unless -f $config_file; |
|
| 37 |
+ |
|
| 38 |
+# Load configuration |
|
| 39 |
+my $config = load_config($config_file); |
|
| 40 |
+my $node_id = $config->{node}{id} || `hostname -s`;
|
|
| 41 |
+chomp $node_id; |
|
| 42 |
+ |
|
| 43 |
+log_message("Starting autoSMART collector daemon on node: $node_id");
|
|
| 44 |
+log_message("Configuration loaded from: $config_file");
|
|
| 45 |
+ |
|
| 46 |
+# Main collection loop |
|
| 47 |
+my $last_full_scan = 0; |
|
| 48 |
+my $scan_interval = $config->{node}{scan_interval} || 300;
|
|
| 49 |
+my $full_scan_interval = $config->{collection}{full_scan_interval} || 3600;
|
|
| 50 |
+ |
|
| 51 |
+while (1) {
|
|
| 52 |
+ eval {
|
|
| 53 |
+ my $current_time = time(); |
|
| 54 |
+ my $force_full = ($current_time - $last_full_scan) >= $full_scan_interval; |
|
| 55 |
+ |
|
| 56 |
+ if ($force_full) {
|
|
| 57 |
+ log_message("Performing full SMART scan (forced)");
|
|
| 58 |
+ $last_full_scan = $current_time; |
|
| 59 |
+ } |
|
| 60 |
+ |
|
| 61 |
+ collect_smart_data($force_full); |
|
| 62 |
+ |
|
| 63 |
+ }; |
|
| 64 |
+ |
|
| 65 |
+ if ($@) {
|
|
| 66 |
+ log_message("ERROR: Collection failed: $@");
|
|
| 67 |
+ } |
|
| 68 |
+ |
|
| 69 |
+ log_message("Sleeping for $scan_interval seconds...") if $debug;
|
|
| 70 |
+ sleep($scan_interval); |
|
| 71 |
+} |
|
| 72 |
+ |
|
| 73 |
+sub collect_smart_data {
|
|
| 74 |
+ my ($force_full) = @_; |
|
| 75 |
+ |
|
| 76 |
+ log_message("[DEBUG] Starting data collection cycle, force_full=" . ($force_full ? 'true' : 'false')) if $debug;
|
|
| 77 |
+ |
|
| 78 |
+ # Connect to database |
|
| 79 |
+ my $dsn = "DBI:Pg:host=$config->{database}{host};dbname=$config->{database}{database}";
|
|
| 80 |
+ log_message("[DEBUG] Connecting to database: $dsn") if $debug;
|
|
| 81 |
+ |
|
| 82 |
+ my $dbh = DBI->connect($dsn, $config->{database}{user}, $config->{database}{password},
|
|
| 83 |
+ {RaiseError => 1, AutoCommit => 1})
|
|
| 84 |
+ or die "Database connection failed: $DBI::errstr"; |
|
| 85 |
+ |
|
| 86 |
+ log_message("✓ Database connected") if $debug;
|
|
| 87 |
+ |
|
| 88 |
+ # Test database connectivity |
|
| 89 |
+ if ($debug) {
|
|
| 90 |
+ eval {
|
|
| 91 |
+ my $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_inventory");
|
|
| 92 |
+ $sth->execute(); |
|
| 93 |
+ my ($count) = $sth->fetchrow_array(); |
|
| 94 |
+ log_message("[DEBUG] Database test: found $count HDDs in inventory");
|
|
| 95 |
+ |
|
| 96 |
+ $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_presence WHERE is_current = TRUE");
|
|
| 97 |
+ $sth->execute(); |
|
| 98 |
+ my ($presence_count) = $sth->fetchrow_array(); |
|
| 99 |
+ log_message("[DEBUG] Database test: found $presence_count current HDD presence records");
|
|
| 100 |
+ }; |
|
| 101 |
+ if ($@) {
|
|
| 102 |
+ log_message("[DEBUG] Database test failed: $@");
|
|
| 103 |
+ } |
|
| 104 |
+ } |
|
| 105 |
+ |
|
| 106 |
+ # Scan for devices |
|
| 107 |
+ my @devices = glob('/dev/sd?');
|
|
| 108 |
+ push @devices, glob('/dev/nvme?n?');
|
|
| 109 |
+ |
|
| 110 |
+ log_message("[DEBUG] Found " . scalar(@devices) . " potential devices: " . join(', ', @devices)) if $debug;
|
|
| 111 |
+ |
|
| 112 |
+ foreach my $device (@devices) {
|
|
| 113 |
+ if (-b $device) {
|
|
| 114 |
+ log_message("[DEBUG] Processing block device: $device") if $debug;
|
|
| 115 |
+ } else {
|
|
| 116 |
+ log_message("[DEBUG] Skipping non-block device: $device") if $debug;
|
|
| 117 |
+ next; |
|
| 118 |
+ } |
|
| 119 |
+ |
|
| 120 |
+ eval {
|
|
| 121 |
+ process_device($dbh, $device, $force_full); |
|
| 122 |
+ }; |
|
| 123 |
+ |
|
| 124 |
+ if ($@) {
|
|
| 125 |
+ log_message("ERROR processing device $device: $@");
|
|
| 126 |
+ } |
|
| 127 |
+ } |
|
| 128 |
+ |
|
| 129 |
+ $dbh->disconnect(); |
|
| 130 |
+ log_message("Collection cycle complete") if $debug;
|
|
| 131 |
+} |
|
| 132 |
+ |
|
| 133 |
+sub process_device {
|
|
| 134 |
+ my ($dbh, $device, $force_full) = @_; |
|
| 135 |
+ |
|
| 136 |
+ log_message("[DEBUG] process_device: Processing $device") if $debug;
|
|
| 137 |
+ |
|
| 138 |
+ # Get SMART data |
|
| 139 |
+ my $smartctl_cmd = "smartctl -A -i -H $device 2>&1"; |
|
| 140 |
+ log_message("[DEBUG] Running: $smartctl_cmd") if $debug;
|
|
| 141 |
+ my @smart_output = `$smartctl_cmd`; |
|
| 142 |
+ my $exit_code = $? >> 8; |
|
| 143 |
+ |
|
| 144 |
+ if (!@smart_output) {
|
|
| 145 |
+ log_message("[DEBUG] No SMART output for $device") if $debug;
|
|
| 146 |
+ return; |
|
| 147 |
+ } |
|
| 148 |
+ |
|
| 149 |
+ log_message("[DEBUG] Got " . scalar(@smart_output) . " lines of SMART output from $device (exit code: $exit_code)") if $debug;
|
|
| 150 |
+ |
|
| 151 |
+ # Check if smartctl indicates the device doesn't support SMART |
|
| 152 |
+ my $smart_output_text = join('', @smart_output);
|
|
| 153 |
+ if ($smart_output_text =~ /SMART support is.*Unavailable|Device does not support SMART|No such device/) {
|
|
| 154 |
+ log_message("[DEBUG] Device $device does not support SMART or is not accessible") if $debug;
|
|
| 155 |
+ return; |
|
| 156 |
+ } |
|
| 157 |
+ |
|
| 158 |
+ my ($model, $serial, $temp, %smart_params); |
|
| 159 |
+ |
|
| 160 |
+ foreach my $line (@smart_output) {
|
|
| 161 |
+ chomp $line; |
|
| 162 |
+ |
|
| 163 |
+ if ($line =~ /Device Model:\s+(.+)/) {
|
|
| 164 |
+ $model = $1; |
|
| 165 |
+ log_message("[DEBUG] Found model: $model") if $debug;
|
|
| 166 |
+ } elsif ($line =~ /Serial Number:\s+(.+)/) {
|
|
| 167 |
+ $serial = $1; |
|
| 168 |
+ log_message("[DEBUG] Found serial: $serial") if $debug;
|
|
| 169 |
+ } elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\w+\s+\w+\s+\w+\s+(\d+)/) {
|
|
| 170 |
+ # Old format: ID ATTRIBUTE_NAME 0xXXXX DDD DDD DDD Pre-fail Always - RAW_VALUE |
|
| 171 |
+ my ($id, $name, $raw) = ($1, $2, $3); |
|
| 172 |
+ $name =~ s/\s+/_/g; |
|
| 173 |
+ $smart_params{$name} = $raw;
|
|
| 174 |
+ |
|
| 175 |
+ if ($debug && scalar(keys %smart_params) <= 5) {
|
|
| 176 |
+ log_message("[DEBUG] SMART param (old format): $name = $raw");
|
|
| 177 |
+ } |
|
| 178 |
+ |
|
| 179 |
+ if ($name =~ /Temperature|Temp/i) {
|
|
| 180 |
+ $temp = $raw if (!defined $temp || $raw > 0); |
|
| 181 |
+ } |
|
| 182 |
+ } elsif ($line =~ /^\s*(\d+)\s+(.+?)\s+0x\w+\s+\d+\s+\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(\d+)/) {
|
|
| 183 |
+ # New format: ID ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE |
|
| 184 |
+ my ($id, $name, $raw) = ($1, $2, $3); |
|
| 185 |
+ $name =~ s/\s+/_/g; |
|
| 186 |
+ $smart_params{$name} = $raw;
|
|
| 187 |
+ |
|
| 188 |
+ if ($debug && scalar(keys %smart_params) <= 5) {
|
|
| 189 |
+ log_message("[DEBUG] SMART param (new format): $name = $raw");
|
|
| 190 |
+ } |
|
| 191 |
+ |
|
| 192 |
+ if ($name =~ /Temperature|Temp/i) {
|
|
| 193 |
+ $temp = $raw if (!defined $temp || $raw > 0); |
|
| 194 |
+ } |
|
| 195 |
+ } |
|
| 196 |
+ } |
|
| 197 |
+ |
|
| 198 |
+ if (!$model || !$serial) {
|
|
| 199 |
+ log_message("[DEBUG] Missing critical data for $device - model: " . ($model || 'NULL') . ", serial: " . ($serial || 'NULL')) if $debug;
|
|
| 200 |
+ return; |
|
| 201 |
+ } |
|
| 202 |
+ |
|
| 203 |
+ if (!%smart_params) {
|
|
| 204 |
+ log_message("[DEBUG] No SMART parameters found for $device") if $debug;
|
|
| 205 |
+ return; |
|
| 206 |
+ } |
|
| 207 |
+ |
|
| 208 |
+ log_message("[DEBUG] Parsed device data - Model: $model, Serial: $serial, Temperature: " . ($temp || 'NULL') . ", Parameters: " . scalar(keys %smart_params)) if $debug;
|
|
| 209 |
+ |
|
| 210 |
+ return unless ($model && $serial && %smart_params); |
|
| 211 |
+ |
|
| 212 |
+ log_message("Processing: $model ($serial) @ $device") if $debug;
|
|
| 213 |
+ |
|
| 214 |
+ # Get or create HDD inventory entry |
|
| 215 |
+ my $hdd_id = get_or_create_hdd($dbh, $serial, $model, $device); |
|
| 216 |
+ |
|
| 217 |
+ # Check if we should store this reading |
|
| 218 |
+ my $params_json = encode_json(\%smart_params); |
|
| 219 |
+ |
|
| 220 |
+ if (!$force_full && !$config->{node}{store_unchanged}) {
|
|
| 221 |
+ # Check for recent identical reading |
|
| 222 |
+ my $sth = $dbh->prepare("
|
|
| 223 |
+ SELECT id FROM smart_readings |
|
| 224 |
+ WHERE hdd_id = ? AND parameters_json = ? |
|
| 225 |
+ AND timestamp > NOW() - INTERVAL '1 hour' |
|
| 226 |
+ LIMIT 1 |
|
| 227 |
+ "); |
|
| 228 |
+ $sth->execute($hdd_id, $params_json); |
|
| 229 |
+ |
|
| 230 |
+ if ($sth->fetchrow_array()) {
|
|
| 231 |
+ log_message(" Skipping unchanged parameters") if $debug;
|
|
| 232 |
+ return; |
|
| 233 |
+ } |
|
| 234 |
+ } |
|
| 235 |
+ |
|
| 236 |
+ # Store SMART reading |
|
| 237 |
+ my $reading_type = $force_full ? 'full' : 'differential'; |
|
| 238 |
+ |
|
| 239 |
+ my $sth = $dbh->prepare("
|
|
| 240 |
+ INSERT INTO smart_readings (hdd_id, serial_number, device_path, node_id, timestamp, temperature, parameters_json, reading_type) |
|
| 241 |
+ VALUES (?, ?, ?, ?, NOW(), ?, ?::jsonb, ?) |
|
| 242 |
+ RETURNING id |
|
| 243 |
+ "); |
|
| 244 |
+ |
|
| 245 |
+ my $reading_id = $dbh->selectrow_array($sth, undef, $hdd_id, $serial, $device, $node_id, $temp || 0, $params_json, $reading_type); |
|
| 246 |
+ |
|
| 247 |
+ log_message(" ✓ SMART reading stored (ID: $reading_id, temp: " . ($temp || 0) . "°C, type: $reading_type)") if $debug;
|
|
| 248 |
+} |
|
| 249 |
+ |
|
| 250 |
+sub get_or_create_hdd {
|
|
| 251 |
+ my ($dbh, $serial, $model, $device_path) = @_; |
|
| 252 |
+ |
|
| 253 |
+ log_message("[DEBUG] get_or_create_hdd: serial=$serial, model=$model, device=$device_path, node=$node_id") if $debug;
|
|
| 254 |
+ |
|
| 255 |
+ # Check if HDD exists |
|
| 256 |
+ my $sth = $dbh->prepare("SELECT id FROM hdd_inventory WHERE serial_number = ?");
|
|
| 257 |
+ $sth->execute($serial); |
|
| 258 |
+ my ($hdd_id) = $sth->fetchrow_array(); |
|
| 259 |
+ |
|
| 260 |
+ log_message("[DEBUG] HDD lookup result: hdd_id=" . ($hdd_id || 'NULL') . " for serial=$serial") if $debug;
|
|
| 261 |
+ |
|
| 262 |
+ if ($hdd_id) {
|
|
| 263 |
+ log_message("[DEBUG] Found existing HDD with id=$hdd_id, updating location and presence") if $debug;
|
|
| 264 |
+ |
|
| 265 |
+ # Update current location in inventory |
|
| 266 |
+ $dbh->do("UPDATE hdd_inventory SET current_device_path = ?, current_node_id = ?, last_seen = NOW()
|
|
| 267 |
+ WHERE id = ?", undef, $device_path, $node_id, $hdd_id); |
|
| 268 |
+ log_message("[DEBUG] Updated hdd_inventory location for hdd_id=$hdd_id") if $debug;
|
|
| 269 |
+ |
|
| 270 |
+ # Mark all previous hdd_presence as historic for this serial |
|
| 271 |
+ my $affected_rows = $dbh->do("UPDATE hdd_presence SET is_current = FALSE WHERE serial_number = ? AND is_current = TRUE AND node <> ?", undef, $serial, $node_id);
|
|
| 272 |
+ log_message("[DEBUG] Marked $affected_rows historic hdd_presence records for serial=$serial") if $debug;
|
|
| 273 |
+ |
|
| 274 |
+ # Check if there is already a current presence for this serial/node |
|
| 275 |
+ my $sth2 = $dbh->prepare("SELECT id FROM hdd_presence WHERE serial_number = ? AND node = ? AND is_current = TRUE");
|
|
| 276 |
+ $sth2->execute($serial, $node_id); |
|
| 277 |
+ my ($presence_id) = $sth2->fetchrow_array(); |
|
| 278 |
+ |
|
| 279 |
+ if ($presence_id) {
|
|
| 280 |
+ log_message("[DEBUG] Found existing presence record id=$presence_id, updating data_end") if $debug;
|
|
| 281 |
+ # Update data_end |
|
| 282 |
+ $dbh->do("UPDATE hdd_presence SET data_end = NOW() WHERE id = ?", undef, $presence_id);
|
|
| 283 |
+ log_message("[DEBUG] Updated data_end for presence_id=$presence_id") if $debug;
|
|
| 284 |
+ } else {
|
|
| 285 |
+ log_message("[DEBUG] No existing presence for serial=$serial node=$node_id, creating new record") if $debug;
|
|
| 286 |
+ # Create new presence record |
|
| 287 |
+ $dbh->do("UPDATE hdd_presence SET is_current = FALSE WHERE serial_number = ? AND is_current = TRUE", undef, $serial);
|
|
| 288 |
+ $sth2 = $dbh->prepare("INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current) VALUES (?, ?, NOW(), NOW(), TRUE)");
|
|
| 289 |
+ $sth2->execute($serial, $node_id); |
|
| 290 |
+ my $new_presence_id = $dbh->last_insert_id(undef, undef, 'hdd_presence', undef); |
|
| 291 |
+ log_message("[DEBUG] Created new hdd_presence record with id=$new_presence_id for serial=$serial node=$node_id") if $debug;
|
|
| 292 |
+ } |
|
| 293 |
+ return $hdd_id; |
|
| 294 |
+ } |
|
| 295 |
+ # Create new HDD entry |
|
| 296 |
+ log_message("[DEBUG] Creating new HDD entry for serial=$serial model=$model") if $debug;
|
|
| 297 |
+ $sth = $dbh->prepare("
|
|
| 298 |
+ INSERT INTO hdd_inventory (serial_number, model_name, current_device_path, current_node_id, |
|
| 299 |
+ first_seen, last_seen) |
|
| 300 |
+ VALUES (?, ?, ?, ?, NOW(), NOW()) |
|
| 301 |
+ RETURNING id |
|
| 302 |
+ "); |
|
| 303 |
+ my $new_id = $dbh->selectrow_array($sth, undef, $serial, $model, $device_path, $node_id); |
|
| 304 |
+ log_message("[DEBUG] Created new HDD inventory entry with id=$new_id") if $debug;
|
|
| 305 |
+ |
|
| 306 |
+ # Mark all previous hdd_presence as historic for this serial |
|
| 307 |
+ my $affected_rows = $dbh->do("UPDATE hdd_presence SET is_current = FALSE WHERE serial_number = ? AND is_current = TRUE", undef, $serial);
|
|
| 308 |
+ log_message("[DEBUG] Marked $affected_rows historic hdd_presence records for new serial=$serial") if $debug;
|
|
| 309 |
+ |
|
| 310 |
+ # Create new presence record |
|
| 311 |
+ my $sth2 = $dbh->prepare("INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current) VALUES (?, ?, NOW(), NOW(), TRUE)");
|
|
| 312 |
+ $sth2->execute($serial, $node_id); |
|
| 313 |
+ my $new_presence_id = $dbh->last_insert_id(undef, undef, 'hdd_presence', undef); |
|
| 314 |
+ log_message("[DEBUG] Created new hdd_presence record with id=$new_presence_id for new serial=$serial node=$node_id") if $debug;
|
|
| 315 |
+ |
|
| 316 |
+ return $new_id; |
|
| 317 |
+} |
|
| 318 |
+ |
|
| 319 |
+sub load_config {
|
|
| 320 |
+ my ($file) = @_; |
|
| 321 |
+ |
|
| 322 |
+ my $content = read_file($file); |
|
| 323 |
+ my %config; |
|
| 324 |
+ |
|
| 325 |
+ # Simple YAML-like parser |
|
| 326 |
+ my $current_section; |
|
| 327 |
+ foreach my $line (split /\n/, $content) {
|
|
| 328 |
+ $line =~ s/^\s+|\s+$//g; |
|
| 329 |
+ next if $line =~ /^#/ || $line eq ''; |
|
| 330 |
+ |
|
| 331 |
+ if ($line =~ /^(\w+):$/) {
|
|
| 332 |
+ $current_section = $1; |
|
| 333 |
+ } elsif ($line =~ /^\s*(\w+):\s*(.+)$/) {
|
|
| 334 |
+ $config{$current_section}{$1} = $2;
|
|
| 335 |
+ } |
|
| 336 |
+ } |
|
| 337 |
+ |
|
| 338 |
+ return \%config; |
|
| 339 |
+} |
|
| 340 |
+ |
|
| 341 |
+sub log_message {
|
|
| 342 |
+ my ($message) = @_; |
|
| 343 |
+ my $timestamp = strftime("%Y-%m-%d %H:%M:%S", localtime);
|
|
| 344 |
+ print "[$timestamp] $message\n"; |
|
| 345 |
+} |
|
| 346 |
+ |
|
| 347 |
+__END__ |
|
| 348 |
+ |
|
| 349 |
+=head1 NAME |
|
| 350 |
+ |
|
| 351 |
+smart-collector-daemon.pl - autoSMART SMART Data Collection Daemon |
|
| 352 |
+ |
|
| 353 |
+=head1 SYNOPSIS |
|
| 354 |
+ |
|
| 355 |
+smart-collector-daemon.pl --config <config_file> [--debug] [--foreground] |
|
| 356 |
+ |
|
| 357 |
+=head1 DESCRIPTION |
|
| 358 |
+ |
|
| 359 |
+Automated daemon for collecting SMART data from storage devices and storing |
|
| 360 |
+in PostgreSQL database with differential storage optimization. |
|
| 361 |
+ |
|
| 362 |
+=head1 OPTIONS |
|
| 363 |
+ |
|
| 364 |
+=over 4 |
|
| 365 |
+ |
|
| 366 |
+=item --config <file> |
|
| 367 |
+ |
|
| 368 |
+Configuration file path (required) |
|
| 369 |
+ |
|
| 370 |
+=item --debug |
|
| 371 |
+ |
|
| 372 |
+Enable debug logging |
|
| 373 |
+ |
|
| 374 |
+=item --foreground |
|
| 375 |
+ |
|
| 376 |
+Run in foreground (don't daemonize) |
|
| 377 |
+ |
|
| 378 |
+=back |
|
| 379 |
+ |
|
| 380 |
+=head1 AUTHOR |
|
| 381 |
+ |
|
| 382 |
+autoSMART v1.0 - Hardware-based HDD tracking system |
|
| 383 |
+ |
|
| 384 |
+=cut |
|
@@ -0,0 +1,55 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+use strict; |
|
| 4 |
+use warnings; |
|
| 5 |
+use DBI; |
|
| 6 |
+ |
|
| 7 |
+print "=== autoSMART Database Test ===\n\n"; |
|
| 8 |
+ |
|
| 9 |
+my $dsn = "DBI:Pg:dbname=autosmart;host=192.168.2.102;port=5432"; |
|
| 10 |
+my $dbh = DBI->connect($dsn, "autosmart", "autoSMART2025!", {
|
|
| 11 |
+ RaiseError => 1, |
|
| 12 |
+ AutoCommit => 1, |
|
| 13 |
+ PrintError => 0 |
|
| 14 |
+}) or die "Failed to connect to database: $DBI::errstr\n"; |
|
| 15 |
+ |
|
| 16 |
+print "✓ Database connection successful\n"; |
|
| 17 |
+ |
|
| 18 |
+# Test tables exist |
|
| 19 |
+my @tables = qw(hdd_inventory hdd_migrations smart_readings predictions smart_thresholds alert_history system_config); |
|
| 20 |
+ |
|
| 21 |
+for my $table (@tables) {
|
|
| 22 |
+ my $sth = $dbh->prepare("SELECT COUNT(*) FROM $table");
|
|
| 23 |
+ $sth->execute(); |
|
| 24 |
+ my ($count) = $sth->fetchrow_array(); |
|
| 25 |
+ print "✓ Table $table: $count rows\n"; |
|
| 26 |
+} |
|
| 27 |
+ |
|
| 28 |
+# Test views |
|
| 29 |
+my @views = qw(smart_readings_reconstructed latest_smart_readings drive_health_summary); |
|
| 30 |
+ |
|
| 31 |
+for my $view (@views) {
|
|
| 32 |
+ eval {
|
|
| 33 |
+ my $sth = $dbh->prepare("SELECT COUNT(*) FROM $view");
|
|
| 34 |
+ $sth->execute(); |
|
| 35 |
+ my ($count) = $sth->fetchrow_array(); |
|
| 36 |
+ print "✓ View $view: $count rows\n"; |
|
| 37 |
+ }; |
|
| 38 |
+ if ($@) {
|
|
| 39 |
+ print "✗ View $view: ERROR - $@\n"; |
|
| 40 |
+ } |
|
| 41 |
+} |
|
| 42 |
+ |
|
| 43 |
+# Test function |
|
| 44 |
+eval {
|
|
| 45 |
+ my $sth = $dbh->prepare("SELECT should_store_smart_reading(1, '{}', 'test', NOW())");
|
|
| 46 |
+ $sth->execute(); |
|
| 47 |
+ print "✓ Function should_store_smart_reading: Available\n"; |
|
| 48 |
+}; |
|
| 49 |
+if ($@) {
|
|
| 50 |
+ print "✗ Function should_store_smart_reading: ERROR - $@\n"; |
|
| 51 |
+} |
|
| 52 |
+ |
|
| 53 |
+$dbh->disconnect(); |
|
| 54 |
+ |
|
| 55 |
+print "\n=== Test Complete ===\n"; |
|
@@ -0,0 +1,79 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# Test script pentru debugging autoSMART collector |
|
| 4 |
+# Acest script permite testarea manualelor cu debugging activ |
|
| 5 |
+ |
|
| 6 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 7 |
+PROJECT_ROOT="$(dirname "$SCRIPT_DIR")" |
|
| 8 |
+ |
|
| 9 |
+echo "🔍 Testing autoSMART collector debugging..." |
|
| 10 |
+echo "Project root: $PROJECT_ROOT" |
|
| 11 |
+ |
|
| 12 |
+# Verifică dacă există configurația de debug |
|
| 13 |
+if [[ -f "/etc/default/autosmart" ]]; then |
|
| 14 |
+ echo "✓ Found configuration file: /etc/default/autosmart" |
|
| 15 |
+ echo "Current configuration:" |
|
| 16 |
+ cat /etc/default/autosmart |
|
| 17 |
+ echo "" |
|
| 18 |
+else |
|
| 19 |
+ echo "❌ Configuration file /etc/default/autosmart not found" |
|
| 20 |
+ echo "Creating test configuration..." |
|
| 21 |
+ sudo mkdir -p /etc/default |
|
| 22 |
+ sudo tee /etc/default/autosmart > /dev/null << 'EOF' |
|
| 23 |
+# AutoSMART Configuration - Test Debug Mode |
|
| 24 |
+AUTOSMART_DEBUG="true" |
|
| 25 |
+EOF |
|
| 26 |
+ echo "✓ Test configuration created" |
|
| 27 |
+fi |
|
| 28 |
+ |
|
| 29 |
+# Verifică dacă există fișierul de configurare pentru daemon |
|
| 30 |
+CONFIG_FILE="$PROJECT_ROOT/test-config.yaml" |
|
| 31 |
+if [[ ! -f "$CONFIG_FILE" ]]; then |
|
| 32 |
+ echo "Creating test configuration file: $CONFIG_FILE" |
|
| 33 |
+ cat > "$CONFIG_FILE" << 'EOF' |
|
| 34 |
+node: |
|
| 35 |
+ id: test-node |
|
| 36 |
+ scan_interval: 30 |
|
| 37 |
+ store_unchanged: false |
|
| 38 |
+ |
|
| 39 |
+collection: |
|
| 40 |
+ full_scan_interval: 300 |
|
| 41 |
+ |
|
| 42 |
+database: |
|
| 43 |
+ host: 192.168.2.102 |
|
| 44 |
+ database: autosmart |
|
| 45 |
+ user: autosmart |
|
| 46 |
+ password: autosmart123 |
|
| 47 |
+EOF |
|
| 48 |
+ echo "✓ Test config created: $CONFIG_FILE" |
|
| 49 |
+fi |
|
| 50 |
+ |
|
| 51 |
+echo "" |
|
| 52 |
+echo "🔧 Available devices for testing:" |
|
| 53 |
+ls -la /dev/sd* /dev/nvme* 2>/dev/null | head -10 |
|
| 54 |
+ |
|
| 55 |
+echo "" |
|
| 56 |
+echo "💾 Testing database connectivity..." |
|
| 57 |
+if command -v psql >/dev/null; then |
|
| 58 |
+ echo "Testing connection to database..." |
|
| 59 |
+ psql -h 192.168.2.102 -U autosmart -d autosmart -c "SELECT 'Database connection OK' as status;" 2>/dev/null || echo "❌ Database connection failed" |
|
| 60 |
+else |
|
| 61 |
+ echo "❌ psql not available for testing" |
|
| 62 |
+fi |
|
| 63 |
+ |
|
| 64 |
+echo "" |
|
| 65 |
+echo "🚀 To run collector in debug mode:" |
|
| 66 |
+echo "export AUTOSMART_DEBUG=true" |
|
| 67 |
+echo "sudo -E perl $SCRIPT_DIR/smart-collector-daemon.pl --config $CONFIG_FILE --debug --foreground" |
|
| 68 |
+ |
|
| 69 |
+echo "" |
|
| 70 |
+echo "📊 To check hdd_presence table:" |
|
| 71 |
+echo "psql -h 192.168.2.102 -U autosmart -d autosmart -c \"SELECT * FROM hdd_presence;\"" |
|
| 72 |
+ |
|
| 73 |
+echo "" |
|
| 74 |
+echo "📋 To check hdd_inventory table:" |
|
| 75 |
+echo "psql -h 192.168.2.102 -U autosmart -d autosmart -c \"SELECT id, serial_number, model_name, current_node_id, last_seen FROM hdd_inventory;\"" |
|
| 76 |
+ |
|
| 77 |
+echo "" |
|
| 78 |
+echo "🔍 To check SMART readings:" |
|
| 79 |
+echo "psql -h 192.168.2.102 -U autosmart -d autosmart -c \"SELECT COUNT(*) as total_readings FROM smart_readings;\"" |
|
@@ -0,0 +1,270 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+=head1 NAME |
|
| 4 |
+ |
|
| 5 |
+test-differential-storage.pl - Test differential SMART storage system |
|
| 6 |
+ |
|
| 7 |
+=head1 DESCRIPTION |
|
| 8 |
+ |
|
| 9 |
+This script tests the differential storage implementation by: |
|
| 10 |
+1. Creating test HDD entries |
|
| 11 |
+2. Inserting baseline SMART readings |
|
| 12 |
+3. Inserting identical readings (should be skipped) |
|
| 13 |
+4. Inserting readings with small changes (differential storage) |
|
| 14 |
+5. Inserting readings with critical changes (full storage) |
|
| 15 |
+6. Validating storage efficiency and reconstruction |
|
| 16 |
+ |
|
| 17 |
+=cut |
|
| 18 |
+ |
|
| 19 |
+use strict; |
|
| 20 |
+use warnings; |
|
| 21 |
+use FindBin qw($Bin); |
|
| 22 |
+use lib "$Bin/../lib"; |
|
| 23 |
+ |
|
| 24 |
+use DBI; |
|
| 25 |
+use JSON::XS; |
|
| 26 |
+use Data::Dumper; |
|
| 27 |
+use Time::HiRes qw(time); |
|
| 28 |
+use Digest::SHA; |
|
| 29 |
+ |
|
| 30 |
+# Database configuration |
|
| 31 |
+my $config = {
|
|
| 32 |
+ db_host => $ENV{AUTOSMART_DB_HOST} || 'localhost',
|
|
| 33 |
+ db_port => $ENV{AUTOSMART_DB_PORT} || '5432',
|
|
| 34 |
+ db_name => $ENV{AUTOSMART_DB_NAME} || 'autosmart',
|
|
| 35 |
+ db_user => $ENV{AUTOSMART_DB_USER} || 'autosmart',
|
|
| 36 |
+ db_pass => $ENV{AUTOSMART_DB_PASS} || 'smartpassword',
|
|
| 37 |
+}; |
|
| 38 |
+ |
|
| 39 |
+print "=== autoSMART Differential Storage Test ===\n\n"; |
|
| 40 |
+ |
|
| 41 |
+# Connect to database |
|
| 42 |
+my $dsn = "DBI:Pg:dbname=$config->{db_name};host=$config->{db_host};port=$config->{db_port}";
|
|
| 43 |
+my $dbh = DBI->connect($dsn, $config->{db_user}, $config->{db_pass}, {
|
|
| 44 |
+ RaiseError => 1, |
|
| 45 |
+ AutoCommit => 1, |
|
| 46 |
+ PrintError => 0 |
|
| 47 |
+}) or die "Failed to connect to database: $DBI::errstr\n"; |
|
| 48 |
+ |
|
| 49 |
+print "✓ Connected to database\n"; |
|
| 50 |
+ |
|
| 51 |
+# Clean up any existing test data |
|
| 52 |
+cleanup_test_data($dbh); |
|
| 53 |
+ |
|
| 54 |
+# Test 1: Create test HDD |
|
| 55 |
+my $test_hdd_id = create_test_hdd($dbh); |
|
| 56 |
+print "✓ Created test HDD (ID: $test_hdd_id)\n"; |
|
| 57 |
+ |
|
| 58 |
+# Test 2: Insert baseline reading |
|
| 59 |
+my $baseline_reading = {
|
|
| 60 |
+ parameters => {
|
|
| 61 |
+ 'Reallocated_Sector_Ct' => 0, |
|
| 62 |
+ 'Spin_Retry_Count' => 0, |
|
| 63 |
+ 'Current_Pending_Sector' => 0, |
|
| 64 |
+ 'Power_On_Hours' => 1000, |
|
| 65 |
+ 'Temperature_Celsius' => 35, |
|
| 66 |
+ 'Load_Cycle_Count' => 5000 |
|
| 67 |
+ }, |
|
| 68 |
+ temperature => 35 |
|
| 69 |
+}; |
|
| 70 |
+ |
|
| 71 |
+my $baseline_id = insert_test_reading($dbh, $test_hdd_id, $baseline_reading); |
|
| 72 |
+print "✓ Inserted baseline reading (ID: $baseline_id)\n"; |
|
| 73 |
+ |
|
| 74 |
+# Test 3: Insert identical reading (should be skipped) |
|
| 75 |
+sleep(1); |
|
| 76 |
+my $identical_result = test_should_store($dbh, $test_hdd_id, $baseline_reading); |
|
| 77 |
+print "✓ Identical reading test - Should store: " . |
|
| 78 |
+ ($identical_result->{should_store} ? "YES" : "NO") .
|
|
| 79 |
+ " (Type: $identical_result->{reading_type})\n";
|
|
| 80 |
+ |
|
| 81 |
+# Test 4: Insert reading with temperature change only (differential) |
|
| 82 |
+my $temp_change_reading = {
|
|
| 83 |
+ %$baseline_reading, |
|
| 84 |
+ temperature => 38 |
|
| 85 |
+}; |
|
| 86 |
+$temp_change_reading->{parameters}{Temperature_Celsius} = 38;
|
|
| 87 |
+ |
|
| 88 |
+sleep(1); |
|
| 89 |
+my $temp_result = test_should_store($dbh, $test_hdd_id, $temp_change_reading); |
|
| 90 |
+my $temp_id = insert_test_reading($dbh, $test_hdd_id, $temp_change_reading, $temp_result); |
|
| 91 |
+print "✓ Temperature change reading - Should store: " . |
|
| 92 |
+ ($temp_result->{should_store} ? "YES" : "NO") .
|
|
| 93 |
+ " (Type: $temp_result->{reading_type}, ID: $temp_id)\n";
|
|
| 94 |
+ |
|
| 95 |
+# Test 5: Insert reading with critical parameter change (full) |
|
| 96 |
+my $critical_reading = {
|
|
| 97 |
+ %$baseline_reading, |
|
| 98 |
+ temperature => 40 |
|
| 99 |
+}; |
|
| 100 |
+$critical_reading->{parameters}{Reallocated_Sector_Ct} = 1; # Critical parameter change
|
|
| 101 |
+$critical_reading->{parameters}{Temperature_Celsius} = 40;
|
|
| 102 |
+ |
|
| 103 |
+sleep(1); |
|
| 104 |
+my $critical_result = test_should_store($dbh, $test_hdd_id, $critical_reading); |
|
| 105 |
+my $critical_id = insert_test_reading($dbh, $test_hdd_id, $critical_reading, $critical_result); |
|
| 106 |
+print "✓ Critical change reading - Should store: " . |
|
| 107 |
+ ($critical_result->{should_store} ? "YES" : "NO") .
|
|
| 108 |
+ " (Type: $critical_result->{reading_type}, ID: $critical_id)\n";
|
|
| 109 |
+ |
|
| 110 |
+# Test 6: Validate reconstruction |
|
| 111 |
+print "\n--- Testing Data Reconstruction ---\n"; |
|
| 112 |
+test_reconstruction($dbh, $test_hdd_id); |
|
| 113 |
+ |
|
| 114 |
+# Test 7: Show storage statistics |
|
| 115 |
+print "\n--- Storage Statistics ---\n"; |
|
| 116 |
+show_storage_stats($dbh, $test_hdd_id); |
|
| 117 |
+ |
|
| 118 |
+print "\n=== Test Complete ===\n"; |
|
| 119 |
+ |
|
| 120 |
+$dbh->disconnect(); |
|
| 121 |
+ |
|
| 122 |
+sub cleanup_test_data {
|
|
| 123 |
+ my ($dbh) = @_; |
|
| 124 |
+ |
|
| 125 |
+ $dbh->do("DELETE FROM smart_readings WHERE serial_number = 'TEST_SERIAL_001'");
|
|
| 126 |
+ $dbh->do("DELETE FROM hdd_inventory WHERE serial_number = 'TEST_SERIAL_001'");
|
|
| 127 |
+} |
|
| 128 |
+ |
|
| 129 |
+sub create_test_hdd {
|
|
| 130 |
+ my ($dbh) = @_; |
|
| 131 |
+ |
|
| 132 |
+ my $sql = q{
|
|
| 133 |
+ INSERT INTO hdd_inventory |
|
| 134 |
+ (serial_number, model_name, firmware, size_gb, manufacturer, |
|
| 135 |
+ current_device_path, current_node_id, status) |
|
| 136 |
+ VALUES ('TEST_SERIAL_001', 'TEST_MODEL_WD', '1.0', 1000, 'Western Digital',
|
|
| 137 |
+ '/dev/sdb', 'test-node', 'active') |
|
| 138 |
+ RETURNING id |
|
| 139 |
+ }; |
|
| 140 |
+ |
|
| 141 |
+ my $sth = $dbh->prepare($sql); |
|
| 142 |
+ $sth->execute(); |
|
| 143 |
+ |
|
| 144 |
+ return $sth->fetchrow_array(); |
|
| 145 |
+} |
|
| 146 |
+ |
|
| 147 |
+sub test_should_store {
|
|
| 148 |
+ my ($dbh, $hdd_id, $reading) = @_; |
|
| 149 |
+ |
|
| 150 |
+ my $parameters_json = encode_json($reading->{parameters});
|
|
| 151 |
+ my $checksum = Digest::SHA::sha256_hex($parameters_json . ($reading->{temperature} || ''));
|
|
| 152 |
+ |
|
| 153 |
+ my $sth = $dbh->prepare(q{
|
|
| 154 |
+ SELECT should_store_smart_reading(?, ?, ?, NOW()) |
|
| 155 |
+ }); |
|
| 156 |
+ |
|
| 157 |
+ $sth->execute($hdd_id, $parameters_json, $checksum); |
|
| 158 |
+ |
|
| 159 |
+ return $sth->fetchrow_hashref(); |
|
| 160 |
+} |
|
| 161 |
+ |
|
| 162 |
+sub insert_test_reading {
|
|
| 163 |
+ my ($dbh, $hdd_id, $reading, $storage_info) = @_; |
|
| 164 |
+ |
|
| 165 |
+ # If no storage info provided, get it |
|
| 166 |
+ if (!$storage_info) {
|
|
| 167 |
+ $storage_info = test_should_store($dbh, $hdd_id, $reading); |
|
| 168 |
+ return undef unless $storage_info->{should_store};
|
|
| 169 |
+ } |
|
| 170 |
+ |
|
| 171 |
+ return undef unless $storage_info->{should_store};
|
|
| 172 |
+ |
|
| 173 |
+ # For differential readings, only store changed parameters |
|
| 174 |
+ my $parameters_to_store; |
|
| 175 |
+ if ($storage_info->{reading_type} eq 'differential' && $storage_info->{changed_parameters}) {
|
|
| 176 |
+ my $changed_params = decode_json($storage_info->{changed_parameters});
|
|
| 177 |
+ $parameters_to_store = {};
|
|
| 178 |
+ |
|
| 179 |
+ for my $param_name (@$changed_params) {
|
|
| 180 |
+ $parameters_to_store->{$param_name} = $reading->{parameters}{$param_name};
|
|
| 181 |
+ } |
|
| 182 |
+ } else {
|
|
| 183 |
+ $parameters_to_store = $reading->{parameters};
|
|
| 184 |
+ } |
|
| 185 |
+ |
|
| 186 |
+ my $sql = q{
|
|
| 187 |
+ INSERT INTO smart_readings |
|
| 188 |
+ (hdd_id, serial_number, device_path, node_id, timestamp, |
|
| 189 |
+ collection_ok, temperature, parameters_json, reading_type, |
|
| 190 |
+ changes_detected, changed_parameters, previous_reading_id, checksum) |
|
| 191 |
+ VALUES (?, ?, ?, ?, NOW(), ?, ?, ?, ?, ?, ?, ?, ?) |
|
| 192 |
+ RETURNING id |
|
| 193 |
+ }; |
|
| 194 |
+ |
|
| 195 |
+ my $parameters_json = encode_json($parameters_to_store); |
|
| 196 |
+ my $checksum = Digest::SHA::sha256_hex(encode_json($reading->{parameters}) . ($reading->{temperature} || ''));
|
|
| 197 |
+ |
|
| 198 |
+ my $sth = $dbh->prepare($sql); |
|
| 199 |
+ $sth->execute( |
|
| 200 |
+ $hdd_id, |
|
| 201 |
+ 'TEST_SERIAL_001', |
|
| 202 |
+ '/dev/sdb', |
|
| 203 |
+ 'test-node', |
|
| 204 |
+ 1, # collection_ok |
|
| 205 |
+ $reading->{temperature},
|
|
| 206 |
+ $parameters_json, |
|
| 207 |
+ $storage_info->{reading_type},
|
|
| 208 |
+ $storage_info->{changes_detected} ? 1 : 0,
|
|
| 209 |
+ $storage_info->{changed_parameters},
|
|
| 210 |
+ $storage_info->{previous_reading_id},
|
|
| 211 |
+ $checksum |
|
| 212 |
+ ); |
|
| 213 |
+ |
|
| 214 |
+ return $sth->fetchrow_array(); |
|
| 215 |
+} |
|
| 216 |
+ |
|
| 217 |
+sub test_reconstruction {
|
|
| 218 |
+ my ($dbh, $hdd_id) = @_; |
|
| 219 |
+ |
|
| 220 |
+ my $sql = q{
|
|
| 221 |
+ SELECT id, timestamp, reading_type, chain_level, parameters_json, temperature |
|
| 222 |
+ FROM smart_readings_reconstructed |
|
| 223 |
+ WHERE hdd_id = ? |
|
| 224 |
+ ORDER BY timestamp |
|
| 225 |
+ }; |
|
| 226 |
+ |
|
| 227 |
+ my $sth = $dbh->prepare($sql); |
|
| 228 |
+ $sth->execute($hdd_id); |
|
| 229 |
+ |
|
| 230 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 231 |
+ print "Reading ID: $row->{id}, Type: $row->{reading_type}, Chain: $row->{chain_level}\n";
|
|
| 232 |
+ print " Temperature: $row->{temperature}°C\n";
|
|
| 233 |
+ |
|
| 234 |
+ my $params = decode_json($row->{parameters_json});
|
|
| 235 |
+ for my $param (sort keys %$params) {
|
|
| 236 |
+ print " $param: $params->{$param}\n";
|
|
| 237 |
+ } |
|
| 238 |
+ print "\n"; |
|
| 239 |
+ } |
|
| 240 |
+} |
|
| 241 |
+ |
|
| 242 |
+sub show_storage_stats {
|
|
| 243 |
+ my ($dbh, $hdd_id) = @_; |
|
| 244 |
+ |
|
| 245 |
+ my $sql = q{
|
|
| 246 |
+ SELECT |
|
| 247 |
+ reading_type, |
|
| 248 |
+ COUNT(*) as count, |
|
| 249 |
+ AVG(length(parameters_json::text)) as avg_size |
|
| 250 |
+ FROM smart_readings |
|
| 251 |
+ WHERE hdd_id = ? |
|
| 252 |
+ GROUP BY reading_type |
|
| 253 |
+ ORDER BY reading_type |
|
| 254 |
+ }; |
|
| 255 |
+ |
|
| 256 |
+ my $sth = $dbh->prepare($sql); |
|
| 257 |
+ $sth->execute($hdd_id); |
|
| 258 |
+ |
|
| 259 |
+ my $total_readings = 0; |
|
| 260 |
+ my $total_size = 0; |
|
| 261 |
+ |
|
| 262 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 263 |
+ printf "%-12s: %d readings, avg size: %.0f bytes\n", |
|
| 264 |
+ $row->{reading_type}, $row->{count}, $row->{avg_size};
|
|
| 265 |
+ $total_readings += $row->{count};
|
|
| 266 |
+ $total_size += $row->{count} * $row->{avg_size};
|
|
| 267 |
+ } |
|
| 268 |
+ |
|
| 269 |
+ print "\nTotal: $total_readings readings, estimated size: " . int($total_size) . " bytes\n"; |
|
| 270 |
+} |
|
@@ -0,0 +1,132 @@ |
||
| 1 |
+#!/usr/bin/perl |
|
| 2 |
+ |
|
| 3 |
+=head1 NAME |
|
| 4 |
+ |
|
| 5 |
+test-smart-collection.pl - Simple SMART data collection test |
|
| 6 |
+ |
|
| 7 |
+=head1 DESCRIPTION |
|
| 8 |
+ |
|
| 9 |
+Simplified SMART data collection test for autoSMART deployment verification. |
|
| 10 |
+ |
|
| 11 |
+=cut |
|
| 12 |
+ |
|
| 13 |
+use strict; |
|
| 14 |
+use warnings; |
|
| 15 |
+use FindBin qw($Bin); |
|
| 16 |
+use lib "$Bin/../lib"; |
|
| 17 |
+ |
|
| 18 |
+use SmartCollector; |
|
| 19 |
+use DBI; |
|
| 20 |
+use JSON::XS; |
|
| 21 |
+ |
|
| 22 |
+# Configuration from environment |
|
| 23 |
+my $config = {
|
|
| 24 |
+ db_host => $ENV{AUTOSMART_DB_HOST} || '192.168.2.102',
|
|
| 25 |
+ db_port => $ENV{AUTOSMART_DB_PORT} || '5432',
|
|
| 26 |
+ db_name => $ENV{AUTOSMART_DB_NAME} || 'autosmart',
|
|
| 27 |
+ db_user => $ENV{AUTOSMART_DB_USER} || 'autosmart',
|
|
| 28 |
+ db_pass => $ENV{AUTOSMART_DB_PASS} || 'autoSMART2025!',
|
|
| 29 |
+ node_id => $ENV{AUTOSMART_NODE_ID} || 'ebony',
|
|
| 30 |
+ debug => $ENV{AUTOSMART_DEBUG} || 2,
|
|
| 31 |
+}; |
|
| 32 |
+ |
|
| 33 |
+print "=== autoSMART SMART Collection Test ===\n\n"; |
|
| 34 |
+ |
|
| 35 |
+# Test database connection |
|
| 36 |
+print "Testing database connection...\n"; |
|
| 37 |
+my $dsn = "DBI:Pg:dbname=$config->{db_name};host=$config->{db_host};port=$config->{db_port}";
|
|
| 38 |
+my $dbh = DBI->connect($dsn, $config->{db_user}, $config->{db_pass}, {
|
|
| 39 |
+ RaiseError => 1, |
|
| 40 |
+ AutoCommit => 1, |
|
| 41 |
+ PrintError => 0 |
|
| 42 |
+}) or die "Failed to connect to database: $DBI::errstr\n"; |
|
| 43 |
+ |
|
| 44 |
+print "✓ Database connection successful\n\n"; |
|
| 45 |
+ |
|
| 46 |
+# Initialize collector |
|
| 47 |
+print "Initializing SMART collector...\n"; |
|
| 48 |
+my $collector = SmartCollector->new($config); |
|
| 49 |
+print "✓ Collector initialized\n\n"; |
|
| 50 |
+ |
|
| 51 |
+# Discover available drives |
|
| 52 |
+print "Discovering storage devices...\n"; |
|
| 53 |
+my @devices = glob('/dev/sd[a-z]');
|
|
| 54 |
+ |
|
| 55 |
+for my $device (@devices) {
|
|
| 56 |
+ print "Found device: $device\n"; |
|
| 57 |
+ |
|
| 58 |
+ # Test SMART data collection |
|
| 59 |
+ print " Collecting SMART data...\n"; |
|
| 60 |
+ my $smart_data = $collector->collect_smart_data($device); |
|
| 61 |
+ |
|
| 62 |
+ if ($smart_data) {
|
|
| 63 |
+ print " ✓ SMART data collected successfully\n"; |
|
| 64 |
+ print " Serial: $smart_data->{serial_number}\n";
|
|
| 65 |
+ print " Model: $smart_data->{model_name}\n";
|
|
| 66 |
+ print " Temperature: $smart_data->{temperature}°C\n";
|
|
| 67 |
+ print " Parameters: " . scalar(keys %{$smart_data->{parameters}}) . "\n";
|
|
| 68 |
+ |
|
| 69 |
+ # Create drive info structure |
|
| 70 |
+ my $drive_info = {
|
|
| 71 |
+ device_path => $device, |
|
| 72 |
+ serial_number => $smart_data->{serial_number},
|
|
| 73 |
+ model_name => $smart_data->{model_name}
|
|
| 74 |
+ }; |
|
| 75 |
+ |
|
| 76 |
+ # Store in database |
|
| 77 |
+ print " Storing in database...\n"; |
|
| 78 |
+ if ($collector->store_smart_data($drive_info, $smart_data)) {
|
|
| 79 |
+ print " ✓ Data stored successfully\n"; |
|
| 80 |
+ } else {
|
|
| 81 |
+ print " ✗ Failed to store data\n"; |
|
| 82 |
+ } |
|
| 83 |
+ |
|
| 84 |
+ } else {
|
|
| 85 |
+ print " ✗ Failed to collect SMART data\n"; |
|
| 86 |
+ } |
|
| 87 |
+ print "\n"; |
|
| 88 |
+} |
|
| 89 |
+ |
|
| 90 |
+# Check database contents |
|
| 91 |
+print "Checking database contents:\n"; |
|
| 92 |
+ |
|
| 93 |
+my $sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_inventory");
|
|
| 94 |
+$sth->execute(); |
|
| 95 |
+my ($hdd_count) = $sth->fetchrow_array(); |
|
| 96 |
+print " HDD Inventory: $hdd_count drives\n"; |
|
| 97 |
+ |
|
| 98 |
+$sth = $dbh->prepare("SELECT COUNT(*) FROM smart_readings");
|
|
| 99 |
+$sth->execute(); |
|
| 100 |
+my ($reading_count) = $sth->fetchrow_array(); |
|
| 101 |
+print " SMART Readings: $reading_count readings\n"; |
|
| 102 |
+ |
|
| 103 |
+$sth = $dbh->prepare("SELECT COUNT(*) FROM hdd_migrations");
|
|
| 104 |
+$sth->execute(); |
|
| 105 |
+my ($migration_count) = $sth->fetchrow_array(); |
|
| 106 |
+print " HDD Migrations: $migration_count migrations\n"; |
|
| 107 |
+ |
|
| 108 |
+# Show recent readings |
|
| 109 |
+if ($reading_count > 0) {
|
|
| 110 |
+ print "\nRecent SMART readings:\n"; |
|
| 111 |
+ $sth = $dbh->prepare(q{
|
|
| 112 |
+ SELECT hi.serial_number, hi.model_name, sr.timestamp, sr.temperature, sr.reading_type |
|
| 113 |
+ FROM smart_readings sr |
|
| 114 |
+ JOIN hdd_inventory hi ON sr.hdd_id = hi.id |
|
| 115 |
+ ORDER BY sr.timestamp DESC |
|
| 116 |
+ LIMIT 5 |
|
| 117 |
+ }); |
|
| 118 |
+ $sth->execute(); |
|
| 119 |
+ |
|
| 120 |
+ while (my $row = $sth->fetchrow_hashref()) {
|
|
| 121 |
+ printf " %s (%s) - %s - %d°C - %s\n", |
|
| 122 |
+ $row->{serial_number},
|
|
| 123 |
+ $row->{model_name},
|
|
| 124 |
+ $row->{timestamp},
|
|
| 125 |
+ $row->{temperature} || 0,
|
|
| 126 |
+ $row->{reading_type};
|
|
| 127 |
+ } |
|
| 128 |
+} |
|
| 129 |
+ |
|
| 130 |
+$dbh->disconnect(); |
|
| 131 |
+ |
|
| 132 |
+print "\n=== Test Complete ===\n"; |
|
@@ -0,0 +1,187 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# autoSMART Uninstaller |
|
| 4 |
+# Version: 1.0 |
|
| 5 |
+# Description: Complete removal of autoSMART system to prevent orphaned files |
|
| 6 |
+ |
|
| 7 |
+set -e |
|
| 8 |
+ |
|
| 9 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 10 |
+INSTALL_DIR="/opt/autoSMART" |
|
| 11 |
+CONFIG_DIR="/etc/autosmart" |
|
| 12 |
+PVE_CONFIG_DIR="/etc/pve/autoSMART" |
|
| 13 |
+SERVICE_NAME="autosmart" |
|
| 14 |
+LOG_DIR="/var/log/autosmart" |
|
| 15 |
+SYSTEMD_SERVICE="/etc/systemd/system/${SERVICE_NAME}.service"
|
|
| 16 |
+ |
|
| 17 |
+# Colors for output |
|
| 18 |
+RED='\033[0;31m' |
|
| 19 |
+GREEN='\033[0;32m' |
|
| 20 |
+YELLOW='\033[1;33m' |
|
| 21 |
+BLUE='\033[0;34m' |
|
| 22 |
+NC='\033[0m' # No Color |
|
| 23 |
+ |
|
| 24 |
+log_info() {
|
|
| 25 |
+ echo -e "${BLUE}[INFO]${NC} $1"
|
|
| 26 |
+} |
|
| 27 |
+ |
|
| 28 |
+log_success() {
|
|
| 29 |
+ echo -e "${GREEN}[SUCCESS]${NC} $1"
|
|
| 30 |
+} |
|
| 31 |
+ |
|
| 32 |
+log_warning() {
|
|
| 33 |
+ echo -e "${YELLOW}[WARNING]${NC} $1"
|
|
| 34 |
+} |
|
| 35 |
+ |
|
| 36 |
+log_error() {
|
|
| 37 |
+ echo -e "${RED}[ERROR]${NC} $1"
|
|
| 38 |
+} |
|
| 39 |
+ |
|
| 40 |
+log_info "🗑️ autoSMART Uninstaller v1.0" |
|
| 41 |
+log_info "===============================" |
|
| 42 |
+ |
|
| 43 |
+# Check if running as root |
|
| 44 |
+if [[ $EUID -ne 0 ]]; then |
|
| 45 |
+ log_error "This script must be run as root (use sudo)" |
|
| 46 |
+ exit 1 |
|
| 47 |
+fi |
|
| 48 |
+ |
|
| 49 |
+# Stop and disable systemd service |
|
| 50 |
+if systemctl is-active --quiet "$SERVICE_NAME" 2>/dev/null; then |
|
| 51 |
+ log_info "Stopping autoSMART service..." |
|
| 52 |
+ systemctl stop "$SERVICE_NAME" |
|
| 53 |
+ log_success "Service stopped" |
|
| 54 |
+else |
|
| 55 |
+ log_info "Service is not running" |
|
| 56 |
+fi |
|
| 57 |
+ |
|
| 58 |
+if systemctl is-enabled --quiet "$SERVICE_NAME" 2>/dev/null; then |
|
| 59 |
+ log_info "Disabling autoSMART service..." |
|
| 60 |
+ systemctl disable "$SERVICE_NAME" |
|
| 61 |
+ log_success "Service disabled" |
|
| 62 |
+fi |
|
| 63 |
+ |
|
| 64 |
+# Remove systemd service file |
|
| 65 |
+if [[ -f "$SYSTEMD_SERVICE" ]]; then |
|
| 66 |
+ log_info "Removing systemd service file..." |
|
| 67 |
+ rm -f "$SYSTEMD_SERVICE" |
|
| 68 |
+ systemctl daemon-reload |
|
| 69 |
+ log_success "Service file removed" |
|
| 70 |
+fi |
|
| 71 |
+ |
|
| 72 |
+# Remove installation directory |
|
| 73 |
+if [[ -d "$INSTALL_DIR" ]]; then |
|
| 74 |
+ log_info "Removing installation directory: $INSTALL_DIR" |
|
| 75 |
+ rm -rf "$INSTALL_DIR" |
|
| 76 |
+ log_success "Installation directory removed" |
|
| 77 |
+else |
|
| 78 |
+ log_info "Installation directory does not exist" |
|
| 79 |
+fi |
|
| 80 |
+ |
|
| 81 |
+# Remove configuration directory |
|
| 82 |
+if [[ -d "$CONFIG_DIR" ]]; then |
|
| 83 |
+ log_info "Removing configuration directory: $CONFIG_DIR" |
|
| 84 |
+ rm -rf "$CONFIG_DIR" |
|
| 85 |
+ log_success "Configuration directory removed" |
|
| 86 |
+else |
|
| 87 |
+ log_info "Configuration directory does not exist" |
|
| 88 |
+fi |
|
| 89 |
+ |
|
| 90 |
+# Remove PVE configuration directory (if exists) |
|
| 91 |
+if [[ -d "$PVE_CONFIG_DIR" ]]; then |
|
| 92 |
+ log_info "Removing PVE configuration directory: $PVE_CONFIG_DIR" |
|
| 93 |
+ rm -rf "$PVE_CONFIG_DIR" |
|
| 94 |
+ log_success "PVE configuration directory removed" |
|
| 95 |
+fi |
|
| 96 |
+ |
|
| 97 |
+# Remove log directory |
|
| 98 |
+if [[ -d "$LOG_DIR" ]]; then |
|
| 99 |
+ log_info "Removing log directory: $LOG_DIR" |
|
| 100 |
+ rm -rf "$LOG_DIR" |
|
| 101 |
+ log_success "Log directory removed" |
|
| 102 |
+fi |
|
| 103 |
+ |
|
| 104 |
+# Remove cron jobs (if any) |
|
| 105 |
+if crontab -l 2>/dev/null | grep -q "autosmart"; then |
|
| 106 |
+ log_info "Removing autoSMART cron jobs..." |
|
| 107 |
+ (crontab -l 2>/dev/null | grep -v "autosmart") | crontab - |
|
| 108 |
+ log_success "Cron jobs removed" |
|
| 109 |
+fi |
|
| 110 |
+ |
|
| 111 |
+# Remove temporary files |
|
| 112 |
+TEMP_FILES=( |
|
| 113 |
+ "/tmp/autosmart*" |
|
| 114 |
+ "/tmp/smart-*" |
|
| 115 |
+ "/var/tmp/autosmart*" |
|
| 116 |
+) |
|
| 117 |
+ |
|
| 118 |
+for pattern in "${TEMP_FILES[@]}"; do
|
|
| 119 |
+ if ls $pattern 1> /dev/null 2>&1; then |
|
| 120 |
+ log_info "Removing temporary files: $pattern" |
|
| 121 |
+ rm -rf $pattern |
|
| 122 |
+ fi |
|
| 123 |
+done |
|
| 124 |
+ |
|
| 125 |
+# Remove user and group (if created specifically for autoSMART) |
|
| 126 |
+if id "autosmart" &>/dev/null; then |
|
| 127 |
+ log_warning "Found autosmart user - leaving intact (may be used by database)" |
|
| 128 |
+ log_info "To remove user manually: userdel autosmart" |
|
| 129 |
+fi |
|
| 130 |
+ |
|
| 131 |
+# Clean up any remaining processes |
|
| 132 |
+PROCESSES=$(pgrep -f "autosmart|smart-collector" || true) |
|
| 133 |
+if [[ -n "$PROCESSES" ]]; then |
|
| 134 |
+ log_warning "Found running autoSMART processes: $PROCESSES" |
|
| 135 |
+ log_info "Terminating processes..." |
|
| 136 |
+ pkill -f "autosmart|smart-collector" || true |
|
| 137 |
+ sleep 2 |
|
| 138 |
+ pkill -9 -f "autosmart|smart-collector" || true |
|
| 139 |
+ log_success "Processes terminated" |
|
| 140 |
+fi |
|
| 141 |
+ |
|
| 142 |
+# Remove from PATH modifications (if any) |
|
| 143 |
+PROFILE_FILES=( |
|
| 144 |
+ "/etc/profile.d/autosmart.sh" |
|
| 145 |
+ "/etc/bash.bashrc.d/autosmart.sh" |
|
| 146 |
+) |
|
| 147 |
+ |
|
| 148 |
+for file in "${PROFILE_FILES[@]}"; do
|
|
| 149 |
+ if [[ -f "$file" ]]; then |
|
| 150 |
+ log_info "Removing PATH modification: $file" |
|
| 151 |
+ rm -f "$file" |
|
| 152 |
+ fi |
|
| 153 |
+done |
|
| 154 |
+ |
|
| 155 |
+# Clean package manager cache related to autoSMART dependencies |
|
| 156 |
+log_info "Cleaning package cache..." |
|
| 157 |
+if command -v apt-get &> /dev/null; then |
|
| 158 |
+ apt-get clean >/dev/null 2>&1 || true |
|
| 159 |
+elif command -v yum &> /dev/null; then |
|
| 160 |
+ yum clean all >/dev/null 2>&1 || true |
|
| 161 |
+fi |
|
| 162 |
+ |
|
| 163 |
+# Final verification |
|
| 164 |
+REMAINING_FILES=$(find /etc /opt /var -name "*autosmart*" -o -name "*autoSMART*" 2>/dev/null | head -10) |
|
| 165 |
+if [[ -n "$REMAINING_FILES" ]]; then |
|
| 166 |
+ log_warning "Some autoSMART files may still exist:" |
|
| 167 |
+ echo "$REMAINING_FILES" |
|
| 168 |
+ log_info "These may be database files or manually created configurations" |
|
| 169 |
+fi |
|
| 170 |
+ |
|
| 171 |
+log_success "✅ autoSMART uninstallation complete!" |
|
| 172 |
+log_info "" |
|
| 173 |
+log_info "📋 Summary of removed components:" |
|
| 174 |
+log_info " • Systemd service: $SERVICE_NAME" |
|
| 175 |
+log_info " • Installation directory: $INSTALL_DIR" |
|
| 176 |
+log_info " • Configuration directory: $CONFIG_DIR" |
|
| 177 |
+log_info " • Log directory: $LOG_DIR" |
|
| 178 |
+log_info " • Temporary files and processes" |
|
| 179 |
+log_info "" |
|
| 180 |
+log_info "💡 Notes:" |
|
| 181 |
+log_info " • Database data is preserved (not removed)" |
|
| 182 |
+log_info " • System packages (Perl, PostgreSQL client) are preserved" |
|
| 183 |
+log_info " • User 'autosmart' is preserved if it exists" |
|
| 184 |
+log_info "" |
|
| 185 |
+log_info "🔄 System is now clean and ready for fresh installation" |
|
| 186 |
+ |
|
| 187 |
+exit 0 |
|
@@ -0,0 +1,389 @@ |
||
| 1 |
+-- autoSMART Database Schema - Fixed for PostgreSQL 15 |
|
| 2 |
+-- This version removes problematic syntax and creates a working schema |
|
| 3 |
+ |
|
| 4 |
+-- Drop existing tables if they exist |
|
| 5 |
+DROP TABLE IF EXISTS smart_readings CASCADE; |
|
| 6 |
+DROP TABLE IF EXISTS predictions CASCADE; |
|
| 7 |
+DROP TABLE IF EXISTS alert_history CASCADE; |
|
| 8 |
+DROP TABLE IF EXISTS hdd_presence CASCADE; |
|
| 9 |
+DROP TABLE IF EXISTS hdd_inventory CASCADE; |
|
| 10 |
+ |
|
| 11 |
+-- Create required extensions |
|
| 12 |
+CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; |
|
| 13 |
+CREATE EXTENSION IF NOT EXISTS "btree_gin"; |
|
| 14 |
+ |
|
| 15 |
+-- HDD Inventory Table (Hardware-based tracking) |
|
| 16 |
+CREATE TABLE hdd_inventory ( |
|
| 17 |
+ id SERIAL PRIMARY KEY, |
|
| 18 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 19 |
+ model_name VARCHAR(200) NOT NULL, |
|
| 20 |
+ firmware VARCHAR(50), |
|
| 21 |
+ size_gb INTEGER, |
|
| 22 |
+ manufacturer VARCHAR(100), |
|
| 23 |
+ current_device_path VARCHAR(50), |
|
| 24 |
+ current_node_id VARCHAR(50), |
|
| 25 |
+ current_slot VARCHAR(20), |
|
| 26 |
+ madagascar_id VARCHAR(100), |
|
| 27 |
+ first_seen TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 28 |
+ last_seen TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 29 |
+ status VARCHAR(20) DEFAULT 'active', |
|
| 30 |
+ status_changed_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 31 |
+ notes TEXT, |
|
| 32 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 33 |
+ updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 34 |
+ |
|
| 35 |
+ -- Hardware identification constraint |
|
| 36 |
+ CONSTRAINT unique_hardware_id UNIQUE (serial_number, model_name) |
|
| 37 |
+); |
|
| 38 |
+ |
|
| 39 |
+-- Create index for device path (but allow NULLs and duplicates) |
|
| 40 |
+CREATE INDEX idx_hdd_inventory_device_path ON hdd_inventory(current_device_path) WHERE current_device_path IS NOT NULL; |
|
| 41 |
+CREATE INDEX idx_hdd_inventory_node ON hdd_inventory(current_node_id); |
|
| 42 |
+CREATE INDEX idx_hdd_inventory_status ON hdd_inventory(status); |
|
| 43 |
+CREATE INDEX idx_hdd_inventory_last_seen ON hdd_inventory(last_seen); |
|
| 44 |
+ |
|
| 45 |
+-- HDD Presence Table (tracks HDD mobility across nodes) |
|
| 46 |
+CREATE TABLE hdd_presence ( |
|
| 47 |
+ id SERIAL PRIMARY KEY, |
|
| 48 |
+ serial_number VARCHAR(64) NOT NULL, |
|
| 49 |
+ node VARCHAR(64) NOT NULL, |
|
| 50 |
+ data_start TIMESTAMP NOT NULL, |
|
| 51 |
+ data_end TIMESTAMP NOT NULL, |
|
| 52 |
+ is_current BOOLEAN NOT NULL DEFAULT TRUE |
|
| 53 |
+); |
|
| 54 |
+ |
|
| 55 |
+CREATE INDEX idx_hdd_presence_serial_current ON hdd_presence(serial_number, is_current); |
|
| 56 |
+CREATE INDEX idx_hdd_presence_node ON hdd_presence(node); |
|
| 57 |
+CREATE INDEX idx_hdd_presence_data_end ON hdd_presence(data_end DESC); |
|
| 58 |
+ |
|
| 59 |
+-- SMART Readings Table (with differential storage) |
|
| 60 |
+CREATE TABLE smart_readings ( |
|
| 61 |
+ id BIGSERIAL PRIMARY KEY, |
|
| 62 |
+ hdd_id INTEGER REFERENCES hdd_inventory(id), |
|
| 63 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 64 |
+ device_path VARCHAR(50), |
|
| 65 |
+ node_id VARCHAR(50), |
|
| 66 |
+ timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 67 |
+ collection_ok BOOLEAN DEFAULT true, |
|
| 68 |
+ temperature INTEGER, |
|
| 69 |
+ parameters_json JSONB, |
|
| 70 |
+ reading_type VARCHAR(20) DEFAULT 'full', |
|
| 71 |
+ changes_detected BOOLEAN DEFAULT true, |
|
| 72 |
+ changed_parameters JSONB, |
|
| 73 |
+ previous_reading_id INTEGER REFERENCES smart_readings(id), |
|
| 74 |
+ checksum VARCHAR(64) |
|
| 75 |
+); |
|
| 76 |
+ |
|
| 77 |
+CREATE INDEX idx_smart_readings_hdd_id ON smart_readings(hdd_id); |
|
| 78 |
+CREATE INDEX idx_smart_readings_timestamp ON smart_readings(timestamp DESC); |
|
| 79 |
+CREATE INDEX idx_smart_readings_serial ON smart_readings(serial_number); |
|
| 80 |
+CREATE INDEX idx_smart_readings_device_path ON smart_readings(device_path); |
|
| 81 |
+CREATE INDEX idx_smart_readings_type ON smart_readings(reading_type); |
|
| 82 |
+CREATE INDEX idx_smart_readings_checksum ON smart_readings(checksum); |
|
| 83 |
+CREATE INDEX idx_smart_readings_previous ON smart_readings(previous_reading_id); |
|
| 84 |
+ |
|
| 85 |
+-- GIN index for JSONB parameters |
|
| 86 |
+CREATE INDEX idx_smart_readings_parameters ON smart_readings USING GIN (parameters_json); |
|
| 87 |
+CREATE INDEX idx_smart_readings_changed_params ON smart_readings USING GIN (changed_parameters); |
|
| 88 |
+ |
|
| 89 |
+-- Predictions Table |
|
| 90 |
+CREATE TABLE predictions ( |
|
| 91 |
+ id SERIAL PRIMARY KEY, |
|
| 92 |
+ hdd_id INTEGER REFERENCES hdd_inventory(id), |
|
| 93 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 94 |
+ device_path VARCHAR(50), |
|
| 95 |
+ timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 96 |
+ risk_level VARCHAR(20), |
|
| 97 |
+ failure_probability DECIMAL(5,4), |
|
| 98 |
+ predicted_failure_date DATE, |
|
| 99 |
+ confidence_score DECIMAL(5,4), |
|
| 100 |
+ analysis_summary TEXT, |
|
| 101 |
+ recommendations JSONB, |
|
| 102 |
+ openai_response JSONB, |
|
| 103 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 104 |
+); |
|
| 105 |
+ |
|
| 106 |
+CREATE INDEX idx_predictions_hdd_id ON predictions(hdd_id); |
|
| 107 |
+CREATE INDEX idx_predictions_timestamp ON predictions(timestamp DESC); |
|
| 108 |
+CREATE INDEX idx_predictions_risk_level ON predictions(risk_level); |
|
| 109 |
+CREATE INDEX idx_predictions_serial ON predictions(serial_number); |
|
| 110 |
+ |
|
| 111 |
+-- Alert History Table |
|
| 112 |
+CREATE TABLE alert_history ( |
|
| 113 |
+ id SERIAL PRIMARY KEY, |
|
| 114 |
+ hdd_id INTEGER REFERENCES hdd_inventory(id), |
|
| 115 |
+ serial_number VARCHAR(100) NOT NULL, |
|
| 116 |
+ alert_type VARCHAR(50), |
|
| 117 |
+ severity VARCHAR(20), |
|
| 118 |
+ message TEXT, |
|
| 119 |
+ sent_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 120 |
+ sent_to TEXT, |
|
| 121 |
+ delivery_status VARCHAR(20) DEFAULT 'pending', |
|
| 122 |
+ related_reading_id BIGINT REFERENCES smart_readings(id), |
|
| 123 |
+ related_prediction_id INTEGER REFERENCES predictions(id) |
|
| 124 |
+); |
|
| 125 |
+ |
|
| 126 |
+CREATE INDEX idx_alert_history_hdd_id ON alert_history(hdd_id); |
|
| 127 |
+CREATE INDEX idx_alert_history_sent_at ON alert_history(sent_at DESC); |
|
| 128 |
+CREATE INDEX idx_alert_history_severity ON alert_history(severity); |
|
| 129 |
+CREATE INDEX idx_alert_history_serial ON alert_history(serial_number); |
|
| 130 |
+ |
|
| 131 |
+-- System Configuration Table |
|
| 132 |
+CREATE TABLE IF NOT EXISTS system_config ( |
|
| 133 |
+ id SERIAL PRIMARY KEY, |
|
| 134 |
+ config_key VARCHAR(100) UNIQUE NOT NULL, |
|
| 135 |
+ value TEXT, |
|
| 136 |
+ description TEXT, |
|
| 137 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 138 |
+ updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 139 |
+); |
|
| 140 |
+ |
|
| 141 |
+-- Insert default configuration |
|
| 142 |
+INSERT INTO system_config (config_key, value, description) VALUES |
|
| 143 |
+('collection_interval_seconds', '1800', 'SMART data collection interval in seconds')
|
|
| 144 |
+ON CONFLICT (config_key) DO NOTHING; |
|
| 145 |
+ |
|
| 146 |
+INSERT INTO system_config (config_key, value, description) VALUES |
|
| 147 |
+('differential_storage_enabled', 'true', 'Enable differential storage optimization'),
|
|
| 148 |
+('forced_storage_interval_hours', '24', 'Hours between forced full readings'),
|
|
| 149 |
+('critical_parameter_force_store', 'true', 'Force storage for critical parameter changes'),
|
|
| 150 |
+('temperature_change_threshold', '5', 'Temperature change threshold for storage (Celsius)')
|
|
| 151 |
+ON CONFLICT (config_key) DO NOTHING; |
|
| 152 |
+ |
|
| 153 |
+-- SMART Thresholds Table |
|
| 154 |
+CREATE TABLE IF NOT EXISTS smart_thresholds ( |
|
| 155 |
+ id SERIAL PRIMARY KEY, |
|
| 156 |
+ parameter_name VARCHAR(100) NOT NULL, |
|
| 157 |
+ warning_threshold NUMERIC, |
|
| 158 |
+ critical_threshold NUMERIC, |
|
| 159 |
+ weight NUMERIC DEFAULT 1.0, |
|
| 160 |
+ enabled BOOLEAN DEFAULT true, |
|
| 161 |
+ description TEXT, |
|
| 162 |
+ created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), |
|
| 163 |
+ updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 164 |
+); |
|
| 165 |
+ |
|
| 166 |
+-- Insert default SMART thresholds |
|
| 167 |
+INSERT INTO smart_thresholds (parameter_name, warning_threshold, critical_threshold, weight, description) VALUES |
|
| 168 |
+('Reallocated_Sector_Ct', 1, 5, 10.0, 'Reallocated sector count - critical for drive health'),
|
|
| 169 |
+('Spin_Retry_Count', 1, 10, 8.0, 'Spindle motor retry attempts'),
|
|
| 170 |
+('Reallocated_Event_Count', 1, 10, 9.0, 'Number of reallocation events'),
|
|
| 171 |
+('Current_Pending_Sector', 1, 5, 9.5, 'Sectors waiting to be reallocated'),
|
|
| 172 |
+('Offline_Uncorrectable', 1, 1, 10.0, 'Uncorrectable sectors found during offline scan'),
|
|
| 173 |
+('UDMA_CRC_Error_Count', 10, 50, 5.0, 'Communication errors between drive and controller'),
|
|
| 174 |
+('Raw_Read_Error_Rate', 100000, 1000000, 3.0, 'Raw read error rate (varies by manufacturer)'),
|
|
| 175 |
+('Seek_Error_Rate', 100000, 1000000, 4.0, 'Seek error rate'),
|
|
| 176 |
+('Power_On_Hours', 35000, 50000, 2.0, 'Total power-on time in hours'),
|
|
| 177 |
+('Load_Cycle_Count', 100000, 300000, 2.0, 'Number of head load/unload cycles'),
|
|
| 178 |
+('Temperature_Celsius', 50, 60, 3.0, 'Drive operating temperature'),
|
|
| 179 |
+('Start_Stop_Count', 10000, 50000, 1.0, 'Drive start/stop cycles'),
|
|
| 180 |
+('Power_Cycle_Count', 10000, 20000, 1.0, 'Number of power cycles')
|
|
| 181 |
+ON CONFLICT (parameter_name) DO NOTHING; |
|
| 182 |
+ |
|
| 183 |
+-- Create view for reconstructed SMART data (handles differential storage) |
|
| 184 |
+CREATE VIEW smart_readings_reconstructed AS |
|
| 185 |
+WITH RECURSIVE reading_chain AS ( |
|
| 186 |
+ -- Base case: get baseline readings |
|
| 187 |
+ SELECT |
|
| 188 |
+ id, hdd_id, serial_number, timestamp, |
|
| 189 |
+ parameters_json, temperature, reading_type, |
|
| 190 |
+ previous_reading_id, 1 as chain_level |
|
| 191 |
+ FROM smart_readings |
|
| 192 |
+ WHERE reading_type IN ('baseline', 'full')
|
|
| 193 |
+ |
|
| 194 |
+ UNION ALL |
|
| 195 |
+ |
|
| 196 |
+ -- Recursive case: follow the chain of differential readings |
|
| 197 |
+ SELECT |
|
| 198 |
+ sr.id, sr.hdd_id, sr.serial_number, sr.timestamp, |
|
| 199 |
+ -- Merge parameters from previous reading with current changes |
|
| 200 |
+ COALESCE(rc.parameters_json, '{}'::jsonb) || sr.parameters_json as parameters_json,
|
|
| 201 |
+ COALESCE(sr.temperature, rc.temperature) as temperature, |
|
| 202 |
+ sr.reading_type, |
|
| 203 |
+ sr.previous_reading_id, |
|
| 204 |
+ rc.chain_level + 1 |
|
| 205 |
+ FROM smart_readings sr |
|
| 206 |
+ JOIN reading_chain rc ON sr.previous_reading_id = rc.id |
|
| 207 |
+ WHERE sr.reading_type = 'differential' |
|
| 208 |
+) |
|
| 209 |
+SELECT |
|
| 210 |
+ id, hdd_id, serial_number, timestamp, |
|
| 211 |
+ parameters_json, temperature, reading_type, |
|
| 212 |
+ chain_level |
|
| 213 |
+FROM reading_chain; |
|
| 214 |
+ |
|
| 215 |
+-- Latest SMART readings for all drives (using reconstructed differential data) |
|
| 216 |
+CREATE VIEW latest_smart_readings AS |
|
| 217 |
+SELECT DISTINCT ON (sr.hdd_id) |
|
| 218 |
+ sr.id, |
|
| 219 |
+ sr.hdd_id, |
|
| 220 |
+ sr.serial_number, |
|
| 221 |
+ sr.timestamp, |
|
| 222 |
+ sr.parameters_json, |
|
| 223 |
+ sr.temperature, |
|
| 224 |
+ hi.model_name, |
|
| 225 |
+ hi.manufacturer, |
|
| 226 |
+ hi.size_gb, |
|
| 227 |
+ hi.current_device_path, |
|
| 228 |
+ hi.current_node_id |
|
| 229 |
+FROM smart_readings_reconstructed sr |
|
| 230 |
+JOIN hdd_inventory hi ON sr.hdd_id = hi.id |
|
| 231 |
+ORDER BY sr.hdd_id, sr.timestamp DESC; |
|
| 232 |
+ |
|
| 233 |
+-- Drive health summary view |
|
| 234 |
+CREATE VIEW drive_health_summary AS |
|
| 235 |
+SELECT |
|
| 236 |
+ hi.id as hdd_id, |
|
| 237 |
+ hi.serial_number, |
|
| 238 |
+ hi.model_name, |
|
| 239 |
+ hi.manufacturer, |
|
| 240 |
+ hi.current_device_path, |
|
| 241 |
+ hi.current_node_id, |
|
| 242 |
+ hi.status, |
|
| 243 |
+ lsr.timestamp as last_reading, |
|
| 244 |
+ lsr.temperature, |
|
| 245 |
+ p.risk_level, |
|
| 246 |
+ p.failure_probability, |
|
| 247 |
+ p.predicted_failure_date, |
|
| 248 |
+ EXTRACT(EPOCH FROM (NOW() - lsr.timestamp))/3600 as hours_since_last_reading |
|
| 249 |
+FROM hdd_inventory hi |
|
| 250 |
+LEFT JOIN latest_smart_readings lsr ON hi.id = lsr.hdd_id |
|
| 251 |
+LEFT JOIN LATERAL ( |
|
| 252 |
+ SELECT risk_level, failure_probability, predicted_failure_date |
|
| 253 |
+ FROM predictions |
|
| 254 |
+ WHERE hdd_id = hi.id |
|
| 255 |
+ ORDER BY timestamp DESC |
|
| 256 |
+ LIMIT 1 |
|
| 257 |
+) p ON true |
|
| 258 |
+WHERE hi.status = 'active'; |
|
| 259 |
+ |
|
| 260 |
+-- Function to check if SMART reading should be stored (simplified version) |
|
| 261 |
+CREATE OR REPLACE FUNCTION should_store_smart_reading( |
|
| 262 |
+ p_hdd_id INTEGER, |
|
| 263 |
+ p_parameters_json JSONB, |
|
| 264 |
+ p_checksum VARCHAR(64), |
|
| 265 |
+ p_timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW() |
|
| 266 |
+) RETURNS TABLE( |
|
| 267 |
+ should_store BOOLEAN, |
|
| 268 |
+ reading_type VARCHAR(20), |
|
| 269 |
+ changes_detected BOOLEAN, |
|
| 270 |
+ changed_parameters JSONB, |
|
| 271 |
+ previous_reading_id INTEGER |
|
| 272 |
+) AS $$ |
|
| 273 |
+DECLARE |
|
| 274 |
+ v_last_reading RECORD; |
|
| 275 |
+ v_config_enabled BOOLEAN := true; |
|
| 276 |
+ v_force_interval_hours INTEGER := 24; |
|
| 277 |
+ v_temp_threshold INTEGER := 5; |
|
| 278 |
+BEGIN |
|
| 279 |
+ -- Get configuration |
|
| 280 |
+ SELECT (value::boolean) INTO v_config_enabled |
|
| 281 |
+ FROM system_config WHERE config_key = 'differential_storage_enabled'; |
|
| 282 |
+ |
|
| 283 |
+ SELECT (value::integer) INTO v_force_interval_hours |
|
| 284 |
+ FROM system_config WHERE config_key = 'forced_storage_interval_hours'; |
|
| 285 |
+ |
|
| 286 |
+ SELECT (value::integer) INTO v_temp_threshold |
|
| 287 |
+ FROM system_config WHERE config_key = 'temperature_change_threshold'; |
|
| 288 |
+ |
|
| 289 |
+ -- If differential storage is disabled, always store as full |
|
| 290 |
+ IF v_config_enabled IS FALSE OR v_config_enabled IS NULL THEN |
|
| 291 |
+ RETURN QUERY SELECT true, 'full'::varchar(20), true, NULL::jsonb, NULL::integer; |
|
| 292 |
+ RETURN; |
|
| 293 |
+ END IF; |
|
| 294 |
+ |
|
| 295 |
+ -- Get the last reading for this HDD |
|
| 296 |
+ SELECT id, checksum, timestamp, parameters_json, temperature |
|
| 297 |
+ INTO v_last_reading |
|
| 298 |
+ FROM smart_readings |
|
| 299 |
+ WHERE hdd_id = p_hdd_id |
|
| 300 |
+ ORDER BY timestamp DESC |
|
| 301 |
+ LIMIT 1; |
|
| 302 |
+ |
|
| 303 |
+ -- If no previous reading, store as baseline |
|
| 304 |
+ IF v_last_reading IS NULL THEN |
|
| 305 |
+ RETURN QUERY SELECT true, 'baseline'::varchar(20), true, NULL::jsonb, NULL::integer; |
|
| 306 |
+ RETURN; |
|
| 307 |
+ END IF; |
|
| 308 |
+ |
|
| 309 |
+ -- If checksum matches, no changes detected |
|
| 310 |
+ IF v_last_reading.checksum = p_checksum THEN |
|
| 311 |
+ RETURN QUERY SELECT false, 'skipped'::varchar(20), false, NULL::jsonb, v_last_reading.id; |
|
| 312 |
+ RETURN; |
|
| 313 |
+ END IF; |
|
| 314 |
+ |
|
| 315 |
+ -- If forced interval exceeded, store as full |
|
| 316 |
+ IF p_timestamp > v_last_reading.timestamp + (v_force_interval_hours || ' hours')::interval THEN |
|
| 317 |
+ RETURN QUERY SELECT true, 'full'::varchar(20), true, NULL::jsonb, v_last_reading.id; |
|
| 318 |
+ RETURN; |
|
| 319 |
+ END IF; |
|
| 320 |
+ |
|
| 321 |
+ -- Otherwise, store as differential |
|
| 322 |
+ RETURN QUERY SELECT true, 'differential'::varchar(20), true, '[]'::jsonb, v_last_reading.id; |
|
| 323 |
+ RETURN; |
|
| 324 |
+END; |
|
| 325 |
+$$ LANGUAGE plpgsql; |
|
| 326 |
+ |
|
| 327 |
+-- Function to update HDD presence tracking |
|
| 328 |
+CREATE OR REPLACE FUNCTION update_hdd_presence( |
|
| 329 |
+ p_serial_number VARCHAR(64), |
|
| 330 |
+ p_node VARCHAR(64) |
|
| 331 |
+) RETURNS VOID AS $$ |
|
| 332 |
+BEGIN |
|
| 333 |
+ -- Mark all previous presence records for this serial as historic |
|
| 334 |
+ UPDATE hdd_presence |
|
| 335 |
+ SET is_current = FALSE |
|
| 336 |
+ WHERE serial_number = p_serial_number AND is_current = TRUE AND node <> p_node; |
|
| 337 |
+ |
|
| 338 |
+ -- Check if there's already a current presence for this serial/node |
|
| 339 |
+ IF EXISTS (SELECT 1 FROM hdd_presence WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE) THEN |
|
| 340 |
+ -- Update data_end for existing current presence |
|
| 341 |
+ UPDATE hdd_presence |
|
| 342 |
+ SET data_end = NOW() |
|
| 343 |
+ WHERE serial_number = p_serial_number AND node = p_node AND is_current = TRUE; |
|
| 344 |
+ ELSE |
|
| 345 |
+ -- Create new presence record |
|
| 346 |
+ INSERT INTO hdd_presence (serial_number, node, data_start, data_end, is_current) |
|
| 347 |
+ VALUES (p_serial_number, p_node, NOW(), NOW(), TRUE); |
|
| 348 |
+ END IF; |
|
| 349 |
+END; |
|
| 350 |
+$$ LANGUAGE plpgsql; |
|
| 351 |
+ |
|
| 352 |
+-- Function to update timestamps |
|
| 353 |
+CREATE OR REPLACE FUNCTION update_timestamp() RETURNS TRIGGER AS $$ |
|
| 354 |
+BEGIN |
|
| 355 |
+ NEW.updated_at = NOW(); |
|
| 356 |
+ RETURN NEW; |
|
| 357 |
+END; |
|
| 358 |
+$$ LANGUAGE plpgsql; |
|
| 359 |
+ |
|
| 360 |
+-- Create triggers for timestamp updates |
|
| 361 |
+CREATE TRIGGER update_hdd_inventory_timestamp |
|
| 362 |
+ BEFORE UPDATE ON hdd_inventory |
|
| 363 |
+ FOR EACH ROW EXECUTE FUNCTION update_timestamp(); |
|
| 364 |
+ |
|
| 365 |
+CREATE TRIGGER update_smart_thresholds_timestamp |
|
| 366 |
+ BEFORE UPDATE ON smart_thresholds |
|
| 367 |
+ FOR EACH ROW EXECUTE FUNCTION update_timestamp(); |
|
| 368 |
+ |
|
| 369 |
+CREATE TRIGGER update_system_config_timestamp |
|
| 370 |
+ BEFORE UPDATE ON system_config |
|
| 371 |
+ FOR EACH ROW EXECUTE FUNCTION update_timestamp(); |
|
| 372 |
+ |
|
| 373 |
+-- Grant permissions to autosmart user |
|
| 374 |
+GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO autosmart; |
|
| 375 |
+GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO autosmart; |
|
| 376 |
+GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA public TO autosmart; |
|
| 377 |
+ |
|
| 378 |
+-- Grant specific permissions for hdd_presence table |
|
| 379 |
+GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE hdd_presence TO autosmart; |
|
| 380 |
+ |
|
| 381 |
+-- Final message |
|
| 382 |
+DO $$ |
|
| 383 |
+BEGIN |
|
| 384 |
+ RAISE NOTICE 'autoSMART database schema deployed successfully!'; |
|
| 385 |
+ RAISE NOTICE 'Tables created: hdd_inventory, hdd_presence, smart_readings, predictions, smart_thresholds, alert_history, system_config'; |
|
| 386 |
+ RAISE NOTICE 'Views created: smart_readings_reconstructed, latest_smart_readings, drive_health_summary'; |
|
| 387 |
+ RAISE NOTICE 'Functions created: update_hdd_presence(), should_store_smart_reading()'; |
|
| 388 |
+ RAISE NOTICE 'Permissions granted to autosmart user'; |
|
| 389 |
+END $$; |
|
@@ -0,0 +1 @@ |
||
| 1 |
+Subproject commit f6606003657b44138d3c80b234e65074146345a9 |
|
@@ -0,0 +1,102 @@ |
||
| 1 |
+# PGS - Changelog |
|
| 2 |
+ |
|
| 3 |
+## [1.5] - 2026-03-07 |
|
| 4 |
+ |
|
| 5 |
+### Added |
|
| 6 |
+- Added `pgs cleanup` to scan image storages for orphan/stale `vm-*-state-suspend-YYYY-MM-DD.raw` volumes and remove them safely |
|
| 7 |
+ |
|
| 8 |
+### Fixed |
|
| 9 |
+- Stopped VMs are no longer classified as "already suspended to disk" from config flags alone; `bin/pgs` now requires `lock: suspended`, `vmstate:`, and a resolvable backing saved-state volume |
|
| 10 |
+- Added cleanup for inconsistent suspend artifacts on stopped VMs, including stale suspend locks, stale `vmstate:` metadata, and orphaned saved-state volumes on storage |
|
| 11 |
+- `pgs suspend` now runs suspend-artifact cleanup as a preflight, reducing same-day collisions with stale `state-suspend` volumes |
|
| 12 |
+- Cleanup explicitly ignores `vm-*-state-cp*.raw` checkpoint files and only targets `vm-*-state-suspend-YYYY-MM-DD.raw` |
|
| 13 |
+- Repeated `pgs suspend` runs now merge with the existing state file instead of discarding prior `to_resume` intent |
|
| 14 |
+- State now records `vm_details.suspend_volume` and `vm_details.suspend_file_date`, and `resume` skips auto-restore when a VM's suspend artifact changed after the state was saved |
|
| 15 |
+ |
|
| 16 |
+## [1.4] - 2026-03-06 |
|
| 17 |
+ |
|
| 18 |
+### Changed |
|
| 19 |
+- Standardized install layout around `xdev` paths for uninstall, documentation, and runtime state |
|
| 20 |
+- Added dedicated `scripts/install.sh` and `scripts/uninstall.sh` and reduced `setup.sh` to a local/remote wrapper |
|
| 21 |
+- Updated `bin/pgs` to migrate legacy state from `/var/lib/pve-manager/pgs-state.json` to `/var/lib/xdev/pve-guests-state/pgs-state.json` |
|
| 22 |
+- Promoted `bin/pgs` as the canonical executable and removed the duplicate top-level `pgs` file |
|
| 23 |
+- Marked `systemd/` artifacts as legacy reference material instead of active install targets |
|
| 24 |
+ |
|
| 25 |
+### Fixed |
|
| 26 |
+- Fixed documentation to reflect the current manual workflow and the standardized host layout |
|
| 27 |
+ |
|
| 28 |
+## [1.2] - 2026-03-05 |
|
| 29 |
+ |
|
| 30 |
+### Added |
|
| 31 |
+- LXC container (CT) support: graceful shutdown before maintenance, auto-start after maintenance |
|
| 32 |
+- New `ct_to_start` array in state JSON for CT restoration |
|
| 33 |
+- `load_ct_info()` function using single `pct list` call |
|
| 34 |
+- `shutdown_ct()` function with 120s timeout for graceful shutdown |
|
| 35 |
+- `start_ct()` function for post-maintenance startup |
|
| 36 |
+- TODO placeholder for critical VM/CT migration support |
|
| 37 |
+ |
|
| 38 |
+### Changed |
|
| 39 |
+- State file now includes `ct_to_start` array |
|
| 40 |
+- Suspend operation processes VMs then CTs |
|
| 41 |
+- Resume operation resumes VMs then starts CTs |
|
| 42 |
+ |
|
| 43 |
+### Fixed |
|
| 44 |
+- Fixed `pct list` column parsing (Status/Lock/Name column order) |
|
| 45 |
+- Handle empty Lock column in `pct list` output |
|
| 46 |
+ |
|
| 47 |
+## [1.1] - 2026-03-05 |
|
| 48 |
+ |
|
| 49 |
+### Fixed |
|
| 50 |
+- Fixed `load_state()` outputting log messages to stdout, corrupting JSON parsing |
|
| 51 |
+- Fixed empty arrays in JSON state file (was generating `[""]` instead of `[]`) |
|
| 52 |
+- Fixed paused VMs being treated as "running" - now properly detects `paused` status |
|
| 53 |
+ |
|
| 54 |
+### Changed |
|
| 55 |
+- Optimized VM info loading: single `qm list` call instead of per-VM calls |
|
| 56 |
+- Optimized suspend lock detection: read config files directly, no extra `qm` calls |
|
| 57 |
+- Optimized status checking: only verify actual status for "running" VMs, rest trust `qm list` |
|
| 58 |
+- Reduced scan time from ~180 seconds to ~2.5 seconds for 30+ VMs |
|
| 59 |
+ |
|
| 60 |
+### Added |
|
| 61 |
+- Proper systemd service setup for manual suspend before maintenance |
|
| 62 |
+- Proper systemd service setup for manual resume after maintenance |
|
| 63 |
+- Better handling of paused VMs: suspend to disk but don't auto-resume |
|
| 64 |
+- Comprehensive journal logging with severity levels (INFO, WARNING, ERROR, SUCCESS) |
|
| 65 |
+- Dry-run mode for testing without effects |
|
| 66 |
+ |
|
| 67 |
+## [1.0] - 2026-03-05 |
|
| 68 |
+ |
|
| 69 |
+### Initial Release |
|
| 70 |
+- Basic suspend/resume functionality |
|
| 71 |
+- State file preservation |
|
| 72 |
+- Manual testing scripts |
|
| 73 |
+ |
|
| 74 |
+--- |
|
| 75 |
+ |
|
| 76 |
+## Performance Improvements |
|
| 77 |
+ |
|
| 78 |
+| Operation | v1.0 | v1.1 | Improvement | |
|
| 79 |
+|-----------|------|------|-------------| |
|
| 80 |
+| Scan 30 VMs | ~180s | ~2.5s | **72x faster** | |
|
| 81 |
+| System calls | Per-VM qm calls | Single qm list + file I/O | **Drastically reduced** | |
|
| 82 |
+ |
|
| 83 |
+## Known Limitations |
|
| 84 |
+ |
|
| 85 |
+- Requires passwordless SSH for cluster-wide operations |
|
| 86 |
+- No critical VM/CT migration support yet (TODO) |
|
| 87 |
+ |
|
| 88 |
+## Testing |
|
| 89 |
+ |
|
| 90 |
+Tested on: |
|
| 91 |
+- Proxmox VE 8.x with 30+ VMs and CTs |
|
| 92 |
+- Mixed VM configurations (4GB-16GB RAM) |
|
| 93 |
+- LXC containers with running services |
|
| 94 |
+- Storage: local-dir, NFS mount points |
|
| 95 |
+ |
|
| 96 |
+## Future Enhancements |
|
| 97 |
+ |
|
| 98 |
+- [ ] Support for LXC container shutdown |
|
| 99 |
+- [ ] Configurable exclusion list for VMs |
|
| 100 |
+- [ ] Metrics/performance monitoring |
|
| 101 |
+- [ ] Multi-node coordination for cluster-wide operations |
|
| 102 |
+- [ ] Backup integration for backup snapshots before suspend |
|
@@ -0,0 +1,81 @@ |
||
| 1 |
+# Instalare |
|
| 2 |
+ |
|
| 3 |
+## Cerinte |
|
| 4 |
+ |
|
| 5 |
+- nod Proxmox VE cu acces root |
|
| 6 |
+- `jq` disponibil pe host |
|
| 7 |
+- acces SSH pentru instalare remote |
|
| 8 |
+ |
|
| 9 |
+## Metoda recomandata |
|
| 10 |
+ |
|
| 11 |
+Wrapper-ul [setup.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/setup.sh) este metoda standard de install si uninstall. |
|
| 12 |
+ |
|
| 13 |
+### Instalare locala |
|
| 14 |
+ |
|
| 15 |
+```bash |
|
| 16 |
+sudo ./setup.sh --local |
|
| 17 |
+``` |
|
| 18 |
+ |
|
| 19 |
+### Instalare remote |
|
| 20 |
+ |
|
| 21 |
+```bash |
|
| 22 |
+sudo ./setup.sh <node> |
|
| 23 |
+sudo ./setup.sh --user admin <node> |
|
| 24 |
+``` |
|
| 25 |
+ |
|
| 26 |
+## Ce instaleaza |
|
| 27 |
+ |
|
| 28 |
+- `/usr/local/sbin/pgs` |
|
| 29 |
+- `/usr/local/lib/xdev/pve-guests-state/uninstall.sh` |
|
| 30 |
+- `/usr/local/sbin/xdev-pve-guests-state-uninstall` |
|
| 31 |
+- `/usr/local/share/doc/xdev/pve-guests-state/*` |
|
| 32 |
+- state runtime in `/var/lib/xdev/pve-guests-state/` |
|
| 33 |
+ |
|
| 34 |
+## Verificare dupa install |
|
| 35 |
+ |
|
| 36 |
+```bash |
|
| 37 |
+/usr/local/sbin/pgs suspend --dry-run -v |
|
| 38 |
+journalctl -t pgs -n 20 |
|
| 39 |
+``` |
|
| 40 |
+ |
|
| 41 |
+## Uninstall |
|
| 42 |
+ |
|
| 43 |
+### Metoda recomandata |
|
| 44 |
+ |
|
| 45 |
+```bash |
|
| 46 |
+sudo ./setup.sh --local --uninstall |
|
| 47 |
+sudo ./setup.sh --uninstall <node> |
|
| 48 |
+``` |
|
| 49 |
+ |
|
| 50 |
+### Direct pe host |
|
| 51 |
+ |
|
| 52 |
+```bash |
|
| 53 |
+sudo /usr/local/lib/xdev/pve-guests-state/uninstall.sh |
|
| 54 |
+``` |
|
| 55 |
+ |
|
| 56 |
+## Reinstall |
|
| 57 |
+ |
|
| 58 |
+Fluxul acceptat este: |
|
| 59 |
+ |
|
| 60 |
+```text |
|
| 61 |
+uninstall -> install |
|
| 62 |
+``` |
|
| 63 |
+ |
|
| 64 |
+Practic: |
|
| 65 |
+- daca exista deja un install curent, installerul ruleaza mai intai uninstall-ul canonic |
|
| 66 |
+- reinstall direct peste fisiere ramase dintr-o versiune veche nu este workflow-ul recomandat |
|
| 67 |
+ |
|
| 68 |
+## State file |
|
| 69 |
+ |
|
| 70 |
+Locatia curenta: |
|
| 71 |
+ |
|
| 72 |
+```bash |
|
| 73 |
+cat /var/lib/xdev/pve-guests-state/pgs-state.json |
|
| 74 |
+``` |
|
| 75 |
+ |
|
| 76 |
+Compatibilitate: |
|
| 77 |
+- daca exista vechiul fisier `/var/lib/pve-manager/pgs-state.json`, noua versiune il migreaza automat |
|
| 78 |
+ |
|
| 79 |
+## Unitati systemd legacy |
|
| 80 |
+ |
|
| 81 |
+Fisierele din [systemd](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/systemd) sunt pastrate doar ca referinta istorica. Scripturile actuale nu le instaleaza; dimpotriva, le elimina daca sunt prezente pe host. |
|
@@ -0,0 +1,25 @@ |
||
| 1 |
+BSD 2-Clause License |
|
| 2 |
+ |
|
| 3 |
+Copyright (c) 2026, Proxmox VE Utilities |
|
| 4 |
+All rights reserved. |
|
| 5 |
+ |
|
| 6 |
+Redistribution and use in source and binary forms, with or without |
|
| 7 |
+modification, are permitted provided that the following conditions are met: |
|
| 8 |
+ |
|
| 9 |
+1. Redistributions of source code must retain the above copyright notice, this |
|
| 10 |
+ list of conditions and the following disclaimer. |
|
| 11 |
+ |
|
| 12 |
+2. Redistributions in binary form must reproduce the above copyright notice, |
|
| 13 |
+ this list of conditions and the following disclaimer in the documentation |
|
| 14 |
+ and/or other materials provided with the distribution. |
|
| 15 |
+ |
|
| 16 |
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" |
|
| 17 |
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
|
| 18 |
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE |
|
| 19 |
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE |
|
| 20 |
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL |
|
| 21 |
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR |
|
| 22 |
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER |
|
| 23 |
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, |
|
| 24 |
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE |
|
| 25 |
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
|
@@ -0,0 +1,101 @@ |
||
| 1 |
+# PGS |
|
| 2 |
+ |
|
| 3 |
+`pve-guests-state` este utilitarul manual pentru suspendarea si restaurarea guest-urilor Proxmox inainte si dupa lucrari de mentenanta. |
|
| 4 |
+ |
|
| 5 |
+Modelul suportat este deliberat simplu: |
|
| 6 |
+- `pgs suspend` se ruleaza manual inainte de mentenanta |
|
| 7 |
+- `pgs resume` se ruleaza manual dupa revenirea stabila a clusterului |
|
| 8 |
+- `pgs cleanup` poate fi rulat manual pentru audit sau cleanup al artefactelor stale de suspend |
|
| 9 |
+ |
|
| 10 |
+Automatizarea prin systemd pentru shutdown si boot a fost abandonata intentionat. Contextul complet este in [docs/DECISIONS.md](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/docs/DECISIONS.md). |
|
| 11 |
+ |
|
| 12 |
+## Capabilitati |
|
| 13 |
+ |
|
| 14 |
+- suspend to disk pentru VM-uri QEMU care ruleaza |
|
| 15 |
+- graceful shutdown pentru containere LXC care ruleaza |
|
| 16 |
+- resume pentru VM-urile salvate in state |
|
| 17 |
+- start pentru containerele salvate in state |
|
| 18 |
+- cleanup pentru stale suspend images |
|
| 19 |
+- cleanup pentru volume orphan `vm-*-state-suspend-YYYY-MM-DD.raw` |
|
| 20 |
+- retry pentru anumite erori legate de quorum |
|
| 21 |
+- dry-run pentru verificare fara efecte |
|
| 22 |
+ |
|
| 23 |
+## Layout proiect |
|
| 24 |
+ |
|
| 25 |
+- [bin/pgs](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/bin/pgs) - comanda principala |
|
| 26 |
+- [scripts/install.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/scripts/install.sh) - instalare locala pe host |
|
| 27 |
+- [scripts/uninstall.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/scripts/uninstall.sh) - uninstall canonic |
|
| 28 |
+- [setup.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/setup.sh) - wrapper local/remote |
|
| 29 |
+- [docs/TECHNICAL.md](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/docs/TECHNICAL.md) - detalii tehnice |
|
| 30 |
+- [systemd/README.md](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-guests-state/systemd/README.md) - statusul unitatilor legacy |
|
| 31 |
+ |
|
| 32 |
+## Locatii instalate pe host |
|
| 33 |
+ |
|
| 34 |
+- comanda operatorului: `/usr/local/sbin/pgs` |
|
| 35 |
+- uninstall canonic: `/usr/local/lib/xdev/pve-guests-state/uninstall.sh` |
|
| 36 |
+- wrapper optional pentru uninstall: `/usr/local/sbin/xdev-pve-guests-state-uninstall` |
|
| 37 |
+- documentatie instalata: `/usr/local/share/doc/xdev/pve-guests-state` |
|
| 38 |
+- state runtime: `/var/lib/xdev/pve-guests-state/pgs-state.json` |
|
| 39 |
+ |
|
| 40 |
+Compatibilitate: |
|
| 41 |
+- la primul run, daca exista vechiul state file `/var/lib/pve-manager/pgs-state.json`, acesta este migrat automat in locatia noua |
|
| 42 |
+- installerul si uninstallerul curata si artefactele istorice `pve-reboot-manager.sh` si `pve-guest-state.sh` |
|
| 43 |
+ |
|
| 44 |
+## Flux rapid |
|
| 45 |
+ |
|
| 46 |
+```bash |
|
| 47 |
+# instalare locala |
|
| 48 |
+sudo ./setup.sh --local |
|
| 49 |
+ |
|
| 50 |
+# test |
|
| 51 |
+/usr/local/sbin/pgs suspend --dry-run -v |
|
| 52 |
+/usr/local/sbin/pgs cleanup --dry-run -v |
|
| 53 |
+ |
|
| 54 |
+# suspend inainte de mentenanta |
|
| 55 |
+/usr/local/sbin/pgs suspend -v |
|
| 56 |
+ |
|
| 57 |
+# resume dupa revenirea clusterului |
|
| 58 |
+/usr/local/sbin/pgs resume -v |
|
| 59 |
+ |
|
| 60 |
+# cleanup manual al artefactelor stale/orphan |
|
| 61 |
+/usr/local/sbin/pgs cleanup -v |
|
| 62 |
+``` |
|
| 63 |
+ |
|
| 64 |
+## Instalare si uninstall |
|
| 65 |
+ |
|
| 66 |
+Instalare: |
|
| 67 |
+ |
|
| 68 |
+```bash |
|
| 69 |
+sudo ./setup.sh --local |
|
| 70 |
+sudo ./setup.sh <node> |
|
| 71 |
+``` |
|
| 72 |
+ |
|
| 73 |
+Uninstall: |
|
| 74 |
+ |
|
| 75 |
+```bash |
|
| 76 |
+sudo ./setup.sh --local --uninstall |
|
| 77 |
+sudo ./setup.sh --uninstall <node> |
|
| 78 |
+``` |
|
| 79 |
+ |
|
| 80 |
+Sau direct pe host: |
|
| 81 |
+ |
|
| 82 |
+```bash |
|
| 83 |
+sudo /usr/local/lib/xdev/pve-guests-state/uninstall.sh |
|
| 84 |
+``` |
|
| 85 |
+ |
|
| 86 |
+## Observatii operationale |
|
| 87 |
+ |
|
| 88 |
+- proiectul nu instaleaza configuratie persistenta proprie in `/etc` |
|
| 89 |
+- proiectul nu instaleaza unitati systemd active; cele vechi sunt doar artefacte istorice si sunt eliminate la install/uninstall |
|
| 90 |
+- dupa un `resume` complet reusit, state file-ul este sters |
|
| 91 |
+- daca `resume` are erori, state file-ul este pastrat pentru retry |
|
| 92 |
+- `cleanup` si preflight-ul din `suspend` ating doar fisiere `vm-*-state-suspend-YYYY-MM-DD.raw`; fisiere `vm-*-state-cp*.raw` sau alte variante raman neatinse |
|
| 93 |
+- un nou `suspend` peste un state file existent face merge, nu reseteaza lista de guest-uri de restaurat |
|
| 94 |
+- state file-ul retine si `suspend_volume`/`suspend_file_date` per VM pentru a detecta guest-uri alterate dupa salvarea state-ului |
|
| 95 |
+ |
|
| 96 |
+## Debug rapid |
|
| 97 |
+ |
|
| 98 |
+```bash |
|
| 99 |
+journalctl -t pgs -n 50 |
|
| 100 |
+cat /var/lib/xdev/pve-guests-state/pgs-state.json |
|
| 101 |
+``` |
|
@@ -0,0 +1,1352 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+# pgs |
|
| 4 |
+# Manages VM and CT suspend/shutdown for planned maintenance. |
|
| 5 |
+# |
|
| 6 |
+# Before maintenance (suspend mode): |
|
| 7 |
+# - Suspends all running VMs to disk |
|
| 8 |
+# - Gracefully shuts down all running CTs |
|
| 9 |
+# - Saves state to a list for restoration |
|
| 10 |
+# - VMs already suspended to disk: logged as warning, not auto-resumed |
|
| 11 |
+# - VMs suspended to RAM: suspended to disk but not auto-resumed (preserving user intent) |
|
| 12 |
+# |
|
| 13 |
+# After maintenance (resume mode): |
|
| 14 |
+# - Resumes VMs from the saved list |
|
| 15 |
+# - Starts CTs from the saved list |
|
| 16 |
+# - Logs warnings for VMs/CTs skipped |
|
| 17 |
+# - Logs errors for VMs/CTs that fail to resume/start |
|
| 18 |
+# |
|
| 19 |
+# Usage: pgs suspend|resume [--dry-run] [-v] |
|
| 20 |
+# |
|
| 21 |
+# Version: 1.4 - Standardized xdev state path with legacy state migration |
|
| 22 |
+# |
|
| 23 |
+# TODO: Implement critical VM/CT migration support. |
|
| 24 |
+# Critical guests (tagged or listed) should be live-migrated to another |
|
| 25 |
+# node before maintenance instead of suspended/stopped. Rules TBD: |
|
| 26 |
+# - Which guests are critical (tag? config flag? external list?) |
|
| 27 |
+# - Target node selection (least loaded? affinity rules?) |
|
| 28 |
+# - Fallback if migration fails (suspend locally?) |
|
| 29 |
+# - Post-maintenance: migrate back or leave on target node? |
|
| 30 |
+ |
|
| 31 |
+PROJECT_ID="pve-guests-state" |
|
| 32 |
+ORG_ID="xdev" |
|
| 33 |
+DEFAULT_STATE_DIR="/var/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 34 |
+LEGACY_STATE_DIR="/var/lib/pve-manager" |
|
| 35 |
+LEGACY_STATE_FILE="${LEGACY_STATE_DIR}/pgs-state.json"
|
|
| 36 |
+STATE_DIR="${PGS_STATE_DIR:-${DEFAULT_STATE_DIR}}"
|
|
| 37 |
+STATE_FILE="${STATE_DIR}/pgs-state.json"
|
|
| 38 |
+LOCK_FILE="/run/pgs.lock" |
|
| 39 |
+SCRIPT_NAME=$(basename "$0") |
|
| 40 |
+ |
|
| 41 |
+DRY_RUN=0 |
|
| 42 |
+VERBOSE=0 |
|
| 43 |
+QUORUM_RELAXED=0 |
|
| 44 |
+ |
|
| 45 |
+# Associative arrays for VM data (populated once) |
|
| 46 |
+declare -A VM_STATUS |
|
| 47 |
+declare -A VM_NAME |
|
| 48 |
+declare -A VM_HAS_LOCK |
|
| 49 |
+declare -A VM_VMSTATE |
|
| 50 |
+declare -A VMSTATE_TO_VMID |
|
| 51 |
+ |
|
| 52 |
+# Associative arrays for CT data (populated once) |
|
| 53 |
+declare -A CT_STATUS |
|
| 54 |
+declare -A CT_NAME |
|
| 55 |
+ |
|
| 56 |
+# Logging functions. |
|
| 57 |
+# When running inside systemd (JOURNAL_STREAM is set), stdout goes directly to |
|
| 58 |
+# the journal - calling logger in addition causes duplicate entries. When running |
|
| 59 |
+# interactively, use both echo (terminal) and logger (journal archive). |
|
| 60 |
+_log() {
|
|
| 61 |
+ local level="$1" prefix="$2"; shift 2 |
|
| 62 |
+ echo "$prefix $*" |
|
| 63 |
+ [[ -z "${JOURNAL_STREAM:-}" ]] && logger -t "$SCRIPT_NAME" -p "$level" "$*"
|
|
| 64 |
+} |
|
| 65 |
+ |
|
| 66 |
+log_info() {
|
|
| 67 |
+ # When in systemd: always log regardless of VERBOSE (journal is the destination) |
|
| 68 |
+ # When interactive: only log if -v is set |
|
| 69 |
+ if [[ -n "${JOURNAL_STREAM:-}" ]] || [[ $VERBOSE -ge 1 ]]; then
|
|
| 70 |
+ _log user.info "[INFO]" "$@" |
|
| 71 |
+ fi |
|
| 72 |
+} |
|
| 73 |
+ |
|
| 74 |
+log_debug() {
|
|
| 75 |
+ if [[ -n "${JOURNAL_STREAM:-}" ]] || [[ $VERBOSE -ge 2 ]]; then
|
|
| 76 |
+ _log user.debug "[DEBUG]" "$@" |
|
| 77 |
+ fi |
|
| 78 |
+} |
|
| 79 |
+ |
|
| 80 |
+log_warning() {
|
|
| 81 |
+ _log user.warning "[WARNING]" "$@" |
|
| 82 |
+} |
|
| 83 |
+ |
|
| 84 |
+log_error() {
|
|
| 85 |
+ echo "[ERROR] $*" >&2 |
|
| 86 |
+ [[ -z "${JOURNAL_STREAM:-}" ]] && logger -t "$SCRIPT_NAME" -p user.err "$*"
|
|
| 87 |
+} |
|
| 88 |
+ |
|
| 89 |
+log_success() {
|
|
| 90 |
+ _log user.notice "[SUCCESS]" "$@" |
|
| 91 |
+} |
|
| 92 |
+ |
|
| 93 |
+usage() {
|
|
| 94 |
+ cat <<EOF |
|
| 95 |
+Usage: $0 suspend|resume|cleanup [OPTIONS] |
|
| 96 |
+ |
|
| 97 |
+Manage VM and CT suspend/shutdown for planned maintenance. |
|
| 98 |
+ |
|
| 99 |
+Commands: |
|
| 100 |
+ suspend Suspend running VMs to disk, shutdown running CTs |
|
| 101 |
+ resume Resume VMs and start CTs from saved state |
|
| 102 |
+ cleanup Remove stale suspend artifacts from config and storage |
|
| 103 |
+ |
|
| 104 |
+Options: |
|
| 105 |
+ -n, --dry-run Show what would be done without making changes |
|
| 106 |
+ -v, --verbose Print informational messages (-vv adds debug detail) |
|
| 107 |
+ -h, --help Display this help and exit |
|
| 108 |
+ |
|
| 109 |
+Examples: |
|
| 110 |
+ $0 suspend # Suspend VMs, shutdown CTs |
|
| 111 |
+ $0 resume # Resume VMs, start CTs |
|
| 112 |
+ $0 cleanup -v # Remove orphan/stale suspend artifacts |
|
| 113 |
+ $0 cleanup -vv # Include real filesystem paths in cleanup output |
|
| 114 |
+ $0 suspend --dry-run # Show what would happen |
|
| 115 |
+EOF |
|
| 116 |
+} |
|
| 117 |
+ |
|
| 118 |
+refresh_vm_artifact_metadata() {
|
|
| 119 |
+ VM_HAS_LOCK=() |
|
| 120 |
+ VM_VMSTATE=() |
|
| 121 |
+ VMSTATE_TO_VMID=() |
|
| 122 |
+ |
|
| 123 |
+ for conf in /etc/pve/qemu-server/*.conf; do |
|
| 124 |
+ [[ ! -f "$conf" ]] && continue |
|
| 125 |
+ local vmid=$(basename "$conf" .conf) |
|
| 126 |
+ if grep -q '^lock: suspended$' "$conf" 2>/dev/null; then |
|
| 127 |
+ VM_HAS_LOCK[$vmid]=1 |
|
| 128 |
+ fi |
|
| 129 |
+ local vmstate |
|
| 130 |
+ vmstate=$(awk -F': ' '/^vmstate: / {print $2; exit}' "$conf" 2>/dev/null)
|
|
| 131 |
+ if [[ -n "$vmstate" ]]; then |
|
| 132 |
+ VM_VMSTATE[$vmid]="$vmstate" |
|
| 133 |
+ VMSTATE_TO_VMID[$vmstate]="$vmid" |
|
| 134 |
+ fi |
|
| 135 |
+ done |
|
| 136 |
+} |
|
| 137 |
+ |
|
| 138 |
+load_vm_config_metadata() {
|
|
| 139 |
+ VM_STATUS=() |
|
| 140 |
+ VM_NAME=() |
|
| 141 |
+ |
|
| 142 |
+ while read -r vmid name status _rest; do |
|
| 143 |
+ [[ "$vmid" == "VMID" ]] && continue |
|
| 144 |
+ VM_NAME[$vmid]="$name" |
|
| 145 |
+ done < <(qm list 2>/dev/null) |
|
| 146 |
+ |
|
| 147 |
+ refresh_vm_artifact_metadata |
|
| 148 |
+} |
|
| 149 |
+ |
|
| 150 |
+# Load all VM info in one pass - FAST |
|
| 151 |
+load_vm_info() {
|
|
| 152 |
+ load_vm_config_metadata |
|
| 153 |
+ |
|
| 154 |
+ # Get status and name from qm list (single call) |
|
| 155 |
+ while read -r vmid name status _rest; do |
|
| 156 |
+ [[ "$vmid" == "VMID" ]] && continue # skip header |
|
| 157 |
+ VM_STATUS[$vmid]="$status" |
|
| 158 |
+ VM_NAME[$vmid]="$name" |
|
| 159 |
+ done < <(qm list 2>/dev/null) |
|
| 160 |
+ |
|
| 161 |
+ # For "running" VMs, get actual status (qm list shows "running" for paused/suspended VMs) |
|
| 162 |
+ # This is only a few VMs so the overhead is acceptable |
|
| 163 |
+ for vmid in "${!VM_STATUS[@]}"; do
|
|
| 164 |
+ if [[ "${VM_STATUS[$vmid]}" == "running" ]]; then
|
|
| 165 |
+ local real_status |
|
| 166 |
+ real_status=$(qm status "$vmid" 2>/dev/null | awk '{print $2}')
|
|
| 167 |
+ [[ -n "$real_status" ]] && VM_STATUS[$vmid]="$real_status" |
|
| 168 |
+ fi |
|
| 169 |
+ done |
|
| 170 |
+} |
|
| 171 |
+ |
|
| 172 |
+array_contains() {
|
|
| 173 |
+ local needle="$1" |
|
| 174 |
+ shift |
|
| 175 |
+ local item |
|
| 176 |
+ for item in "$@"; do |
|
| 177 |
+ [[ "$item" == "$needle" ]] && return 0 |
|
| 178 |
+ done |
|
| 179 |
+ return 1 |
|
| 180 |
+} |
|
| 181 |
+ |
|
| 182 |
+append_unique() {
|
|
| 183 |
+ local -n target_ref=$1 |
|
| 184 |
+ local value="$2" |
|
| 185 |
+ |
|
| 186 |
+ array_contains "$value" "${target_ref[@]}" || target_ref+=("$value")
|
|
| 187 |
+} |
|
| 188 |
+ |
|
| 189 |
+remove_value() {
|
|
| 190 |
+ local -n target_ref=$1 |
|
| 191 |
+ local value="$2" |
|
| 192 |
+ local filtered=() |
|
| 193 |
+ local item |
|
| 194 |
+ |
|
| 195 |
+ for item in "${target_ref[@]}"; do
|
|
| 196 |
+ [[ "$item" == "$value" ]] && continue |
|
| 197 |
+ filtered+=("$item")
|
|
| 198 |
+ done |
|
| 199 |
+ |
|
| 200 |
+ target_ref=("${filtered[@]}")
|
|
| 201 |
+} |
|
| 202 |
+ |
|
| 203 |
+extract_suspend_file_date() {
|
|
| 204 |
+ local vmid="$1" |
|
| 205 |
+ local volume="$2" |
|
| 206 |
+ local volume_name="${volume##*/}"
|
|
| 207 |
+ |
|
| 208 |
+ if [[ "$volume_name" =~ ^vm-${vmid}-state-suspend-([0-9]{4}-[0-9]{2}-[0-9]{2})\.raw$ ]]; then
|
|
| 209 |
+ echo "${BASH_REMATCH[1]}"
|
|
| 210 |
+ fi |
|
| 211 |
+} |
|
| 212 |
+ |
|
| 213 |
+# Load all CT info in one pass - FAST |
|
| 214 |
+load_ct_info() {
|
|
| 215 |
+ # pct list columns: VMID Status Lock Name |
|
| 216 |
+ # When Lock is empty, read shifts Name into the lock variable |
|
| 217 |
+ while read -r vmid status lock name; do |
|
| 218 |
+ [[ "$vmid" == "VMID" ]] && continue # skip header |
|
| 219 |
+ if [[ -z "$name" ]]; then |
|
| 220 |
+ # No lock present: lock actually holds the name |
|
| 221 |
+ name="$lock" |
|
| 222 |
+ lock="" |
|
| 223 |
+ fi |
|
| 224 |
+ CT_STATUS[$vmid]="$status" |
|
| 225 |
+ CT_NAME[$vmid]="$name" |
|
| 226 |
+ done < <(pct list 2>/dev/null) |
|
| 227 |
+} |
|
| 228 |
+ |
|
| 229 |
+# Get VM name (from cache) |
|
| 230 |
+get_vm_name() {
|
|
| 231 |
+ echo "${VM_NAME[$1]:-unknown}"
|
|
| 232 |
+} |
|
| 233 |
+ |
|
| 234 |
+vm_has_suspend_lock() {
|
|
| 235 |
+ local vmid="$1" |
|
| 236 |
+ grep -q '^lock: suspended$' "/etc/pve/qemu-server/${vmid}.conf" 2>/dev/null
|
|
| 237 |
+} |
|
| 238 |
+ |
|
| 239 |
+vm_has_vmstate_reference() {
|
|
| 240 |
+ local vmid="$1" |
|
| 241 |
+ grep -q '^vmstate:' "/etc/pve/qemu-server/${vmid}.conf" 2>/dev/null
|
|
| 242 |
+} |
|
| 243 |
+ |
|
| 244 |
+get_vm_vmstate_volume() {
|
|
| 245 |
+ local vmid="$1" |
|
| 246 |
+ echo "${VM_VMSTATE[$vmid]:-}"
|
|
| 247 |
+} |
|
| 248 |
+ |
|
| 249 |
+is_strict_suspend_volume_name() {
|
|
| 250 |
+ local vmid="$1" |
|
| 251 |
+ local name="$2" |
|
| 252 |
+ [[ "$name" =~ ^vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ ]]
|
|
| 253 |
+} |
|
| 254 |
+ |
|
| 255 |
+storage_cleanup_supports_path_scan() {
|
|
| 256 |
+ local storage_type="$1" |
|
| 257 |
+ |
|
| 258 |
+ # Cleanup walks filesystem paths directly under <path>/images. |
|
| 259 |
+ # Keep this limited to local directory-backed storages so a stale remote |
|
| 260 |
+ # mount cannot block planned maintenance in kernel I/O wait. |
|
| 261 |
+ [[ "$storage_type" == "dir" ]] |
|
| 262 |
+} |
|
| 263 |
+ |
|
| 264 |
+vmstate_volume_looks_like_suspend_artifact() {
|
|
| 265 |
+ local vmid="$1" |
|
| 266 |
+ local volume="$2" |
|
| 267 |
+ local volume_name="${volume##*/}"
|
|
| 268 |
+ |
|
| 269 |
+ [[ -n "$volume" ]] || return 1 |
|
| 270 |
+ is_strict_suspend_volume_name "$vmid" "$volume_name" |
|
| 271 |
+} |
|
| 272 |
+ |
|
| 273 |
+resolve_storage_volume_path() {
|
|
| 274 |
+ local volume="$1" |
|
| 275 |
+ pvesm path "$volume" 2>/dev/null |
|
| 276 |
+} |
|
| 277 |
+ |
|
| 278 |
+vmstate_volume_exists() {
|
|
| 279 |
+ local volume="$1" |
|
| 280 |
+ local resolved_path |
|
| 281 |
+ |
|
| 282 |
+ [[ -z "$volume" ]] && return 1 |
|
| 283 |
+ resolved_path=$(resolve_storage_volume_path "$volume") || return 1 |
|
| 284 |
+ [[ -n "$resolved_path" && -e "$resolved_path" ]] |
|
| 285 |
+} |
|
| 286 |
+ |
|
| 287 |
+remove_suspend_volume_by_volid() {
|
|
| 288 |
+ local vmid="$1" |
|
| 289 |
+ local volume="$2" |
|
| 290 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 291 |
+ local free_output |
|
| 292 |
+ |
|
| 293 |
+ if ! vmstate_volume_looks_like_suspend_artifact "$vmid" "$volume"; then |
|
| 294 |
+ log_warning "VM $vmid ($name) suspend volume does not look like a suspend artifact, leaving it untouched: ${volume:-none}"
|
|
| 295 |
+ return 1 |
|
| 296 |
+ fi |
|
| 297 |
+ |
|
| 298 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 299 |
+ echo "would remove stale vmstate volume for VM $vmid ($name): $volume" |
|
| 300 |
+ return 0 |
|
| 301 |
+ fi |
|
| 302 |
+ |
|
| 303 |
+ free_output=$(pvesm free "$volume" 2>&1) |
|
| 304 |
+ if [[ $? -eq 0 ]]; then |
|
| 305 |
+ log_info "Removed stale vmstate volume for VM $vmid ($name): $volume" |
|
| 306 |
+ return 0 |
|
| 307 |
+ fi |
|
| 308 |
+ |
|
| 309 |
+ if maybe_relax_quorum "$free_output"; then |
|
| 310 |
+ free_output=$(pvesm free "$volume" 2>&1) |
|
| 311 |
+ if [[ $? -eq 0 ]]; then |
|
| 312 |
+ log_info "Removed stale vmstate volume for VM $vmid ($name) after quorum recovery: $volume" |
|
| 313 |
+ return 0 |
|
| 314 |
+ fi |
|
| 315 |
+ fi |
|
| 316 |
+ |
|
| 317 |
+ if echo "$free_output" | grep -qiE 'does not exist|no such file|not found'; then |
|
| 318 |
+ log_info "Stale vmstate volume for VM $vmid ($name) was already absent: $volume" |
|
| 319 |
+ return 0 |
|
| 320 |
+ fi |
|
| 321 |
+ |
|
| 322 |
+ log_warning "VM $vmid ($name) stale vmstate volume could not be removed: $volume ($free_output)" |
|
| 323 |
+ return 1 |
|
| 324 |
+} |
|
| 325 |
+ |
|
| 326 |
+clear_vmstate_metadata() {
|
|
| 327 |
+ local vmid="$1" |
|
| 328 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 329 |
+ local set_output |
|
| 330 |
+ |
|
| 331 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 332 |
+ echo "would remove stale vmstate metadata for VM $vmid ($name)" |
|
| 333 |
+ return 0 |
|
| 334 |
+ fi |
|
| 335 |
+ |
|
| 336 |
+ set_output=$(qm set "$vmid" --delete vmstate 2>&1) |
|
| 337 |
+ if [[ $? -eq 0 ]]; then |
|
| 338 |
+ log_info "Removed stale vmstate metadata for VM $vmid ($name)" |
|
| 339 |
+ return 0 |
|
| 340 |
+ fi |
|
| 341 |
+ |
|
| 342 |
+ if maybe_relax_quorum "$set_output"; then |
|
| 343 |
+ set_output=$(qm set "$vmid" --delete vmstate 2>&1) |
|
| 344 |
+ if [[ $? -eq 0 ]]; then |
|
| 345 |
+ log_info "Removed stale vmstate metadata for VM $vmid ($name) after quorum recovery" |
|
| 346 |
+ return 0 |
|
| 347 |
+ fi |
|
| 348 |
+ fi |
|
| 349 |
+ |
|
| 350 |
+ log_warning "VM $vmid ($name) stale vmstate metadata could not be removed: $set_output" |
|
| 351 |
+ return 1 |
|
| 352 |
+} |
|
| 353 |
+ |
|
| 354 |
+free_stale_vmstate_volume() {
|
|
| 355 |
+ local vmid="$1" |
|
| 356 |
+ local volume="$2" |
|
| 357 |
+ |
|
| 358 |
+ remove_suspend_volume_by_volid "$vmid" "$volume" |
|
| 359 |
+} |
|
| 360 |
+ |
|
| 361 |
+cleanup_stale_suspend_artifacts() {
|
|
| 362 |
+ local vmid="$1" |
|
| 363 |
+ local context="${2:-}"
|
|
| 364 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 365 |
+ local volume |
|
| 366 |
+ local had_issue=0 |
|
| 367 |
+ local cleanup_failed=0 |
|
| 368 |
+ |
|
| 369 |
+ volume=$(get_vm_vmstate_volume "$vmid") |
|
| 370 |
+ |
|
| 371 |
+ if vm_has_suspend_lock "$vmid"; then |
|
| 372 |
+ had_issue=1 |
|
| 373 |
+ if ! unlock_vm_suspend_lock "$vmid" "$context"; then |
|
| 374 |
+ cleanup_failed=1 |
|
| 375 |
+ fi |
|
| 376 |
+ fi |
|
| 377 |
+ |
|
| 378 |
+ if [[ -n "$volume" ]]; then |
|
| 379 |
+ had_issue=1 |
|
| 380 |
+ if vmstate_volume_exists "$volume"; then |
|
| 381 |
+ if ! free_stale_vmstate_volume "$vmid" "$volume"; then |
|
| 382 |
+ cleanup_failed=1 |
|
| 383 |
+ fi |
|
| 384 |
+ else |
|
| 385 |
+ log_info "VM $vmid ($name) has stale vmstate metadata pointing to missing volume: $volume" |
|
| 386 |
+ fi |
|
| 387 |
+ |
|
| 388 |
+ if ! clear_vmstate_metadata "$vmid"; then |
|
| 389 |
+ cleanup_failed=1 |
|
| 390 |
+ fi |
|
| 391 |
+ fi |
|
| 392 |
+ |
|
| 393 |
+ if [[ $had_issue -eq 0 ]]; then |
|
| 394 |
+ return 0 |
|
| 395 |
+ fi |
|
| 396 |
+ |
|
| 397 |
+ [[ $cleanup_failed -eq 0 ]] |
|
| 398 |
+} |
|
| 399 |
+ |
|
| 400 |
+vm_has_valid_suspend_state() {
|
|
| 401 |
+ local vmid="$1" |
|
| 402 |
+ local volume |
|
| 403 |
+ |
|
| 404 |
+ vm_has_suspend_lock "$vmid" || return 1 |
|
| 405 |
+ vm_has_vmstate_reference "$vmid" || return 1 |
|
| 406 |
+ volume=$(get_vm_vmstate_volume "$vmid") |
|
| 407 |
+ vmstate_volume_looks_like_suspend_artifact "$vmid" "$volume" || return 1 |
|
| 408 |
+ vmstate_volume_exists "$volume" |
|
| 409 |
+} |
|
| 410 |
+ |
|
| 411 |
+get_referencing_vmid_for_vmstate() {
|
|
| 412 |
+ local target_volume="$1" |
|
| 413 |
+ local vmid="${VMSTATE_TO_VMID[$target_volume]:-}"
|
|
| 414 |
+ [[ -n "$vmid" ]] || return 1 |
|
| 415 |
+ echo "$vmid" |
|
| 416 |
+ return 0 |
|
| 417 |
+} |
|
| 418 |
+ |
|
| 419 |
+list_suspend_artifact_files() {
|
|
| 420 |
+ awk ' |
|
| 421 |
+ BEGIN {
|
|
| 422 |
+ RS = "" |
|
| 423 |
+ FS = "\n" |
|
| 424 |
+ } |
|
| 425 |
+ {
|
|
| 426 |
+ type = "" |
|
| 427 |
+ name = "" |
|
| 428 |
+ path = "" |
|
| 429 |
+ content = "" |
|
| 430 |
+ split($1, header_parts, /:[[:space:]]+/) |
|
| 431 |
+ if (length(header_parts) >= 2) {
|
|
| 432 |
+ type = header_parts[1] |
|
| 433 |
+ name = header_parts[2] |
|
| 434 |
+ } |
|
| 435 |
+ |
|
| 436 |
+ for (i = 2; i <= NF; i++) {
|
|
| 437 |
+ line = $i |
|
| 438 |
+ sub(/^\t/, "", line) |
|
| 439 |
+ if (line ~ /^path /) {
|
|
| 440 |
+ path = substr(line, 6) |
|
| 441 |
+ } else if (line ~ /^content /) {
|
|
| 442 |
+ content = substr(line, 9) |
|
| 443 |
+ } |
|
| 444 |
+ } |
|
| 445 |
+ |
|
| 446 |
+ if (name != "" && path != "" && content ~ /(^|,)images(,|$)/) {
|
|
| 447 |
+ print type "\t" name "\t" path |
|
| 448 |
+ } |
|
| 449 |
+ } |
|
| 450 |
+ ' /etc/pve/storage.cfg 2>/dev/null | while IFS=$'\t' read -r storage_type storage path; do |
|
| 451 |
+ [[ -z "$storage" || -z "$path" ]] && continue |
|
| 452 |
+ if ! storage_cleanup_supports_path_scan "$storage_type"; then |
|
| 453 |
+ continue |
|
| 454 |
+ fi |
|
| 455 |
+ [[ -d "${path}/images" ]] || continue
|
|
| 456 |
+ local file |
|
| 457 |
+ for file in "${path}"/images/[0-9]*/vm-*-state-suspend-????-??-??.raw; do
|
|
| 458 |
+ [[ -e "$file" ]] || continue |
|
| 459 |
+ local relative_path="${file#${path}/images/}"
|
|
| 460 |
+ [[ "$relative_path" == "$file" ]] && continue |
|
| 461 |
+ local vm_dir="${relative_path%%/*}"
|
|
| 462 |
+ local file_name="${relative_path##*/}"
|
|
| 463 |
+ [[ "$vm_dir" =~ ^[0-9]+$ ]] || continue |
|
| 464 |
+ is_strict_suspend_volume_name "$vm_dir" "$file_name" || continue |
|
| 465 |
+ printf '%s\t%s:%s/%s\t%s\n' "$storage" "$storage" "$vm_dir" "$file_name" "$file" |
|
| 466 |
+ done |
|
| 467 |
+ done |
|
| 468 |
+} |
|
| 469 |
+ |
|
| 470 |
+cleanup_orphan_suspend_artifacts() {
|
|
| 471 |
+ local cleaned_count=0 |
|
| 472 |
+ local skipped_count=0 |
|
| 473 |
+ local fail_count=0 |
|
| 474 |
+ local storage |
|
| 475 |
+ local volume |
|
| 476 |
+ local file_path |
|
| 477 |
+ local vmid |
|
| 478 |
+ |
|
| 479 |
+ log_info "Scanning storages for orphan suspend-state volumes..." |
|
| 480 |
+ |
|
| 481 |
+ while IFS=$'\t' read -r storage volume file_path; do |
|
| 482 |
+ [[ -z "$volume" ]] && continue |
|
| 483 |
+ |
|
| 484 |
+ if vmid=$(get_referencing_vmid_for_vmstate "$volume"); then |
|
| 485 |
+ if vm_has_valid_suspend_state "$vmid"; then |
|
| 486 |
+ log_info "Keeping active suspend-state volume for VM $vmid (${VM_NAME[$vmid]:-unknown}): $volume"
|
|
| 487 |
+ ((skipped_count++)) |
|
| 488 |
+ else |
|
| 489 |
+ log_warning "VM $vmid (${VM_NAME[$vmid]:-unknown}) references inconsistent suspend artifacts - cleaning up"
|
|
| 490 |
+ if cleanup_stale_suspend_artifacts "$vmid" "during cleanup"; then |
|
| 491 |
+ ((cleaned_count++)) |
|
| 492 |
+ else |
|
| 493 |
+ ((fail_count++)) |
|
| 494 |
+ fi |
|
| 495 |
+ fi |
|
| 496 |
+ continue |
|
| 497 |
+ fi |
|
| 498 |
+ |
|
| 499 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 500 |
+ echo "would remove orphan suspend-state volume: $volume" |
|
| 501 |
+ log_debug "real path: $file_path" |
|
| 502 |
+ ((cleaned_count++)) |
|
| 503 |
+ continue |
|
| 504 |
+ fi |
|
| 505 |
+ |
|
| 506 |
+ if [[ "$volume" =~ ^([^:]+):([0-9]+)/vm-([0-9]+)-state-suspend-([0-9]{4}-[0-9]{2}-[0-9]{2})\.raw$ ]]; then
|
|
| 507 |
+ vmid="${BASH_REMATCH[3]}"
|
|
| 508 |
+ else |
|
| 509 |
+ log_warning "Skipping suspicious suspend-state volume with unexpected name: $volume" |
|
| 510 |
+ ((skipped_count++)) |
|
| 511 |
+ continue |
|
| 512 |
+ fi |
|
| 513 |
+ |
|
| 514 |
+ VM_NAME[$vmid]="${VM_NAME[$vmid]:-unknown}"
|
|
| 515 |
+ if remove_suspend_volume_by_volid "$vmid" "$volume"; then |
|
| 516 |
+ log_info "Removed orphan suspend-state volume from $storage: $volume" |
|
| 517 |
+ ((cleaned_count++)) |
|
| 518 |
+ else |
|
| 519 |
+ ((fail_count++)) |
|
| 520 |
+ fi |
|
| 521 |
+ done < <(list_suspend_artifact_files) |
|
| 522 |
+ |
|
| 523 |
+ log_success "Suspend artifact cleanup complete: $cleaned_count cleaned, $skipped_count retained, $fail_count failed" |
|
| 524 |
+ return $fail_count |
|
| 525 |
+} |
|
| 526 |
+ |
|
| 527 |
+unlock_vm_suspend_lock() {
|
|
| 528 |
+ local vmid="$1" |
|
| 529 |
+ local context="${2:-}"
|
|
| 530 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 531 |
+ local unlock_output |
|
| 532 |
+ |
|
| 533 |
+ if ! vm_has_suspend_lock "$vmid"; then |
|
| 534 |
+ return 0 |
|
| 535 |
+ fi |
|
| 536 |
+ |
|
| 537 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 538 |
+ if [[ -n "$context" ]]; then |
|
| 539 |
+ echo "would remove stale suspend lock for VM $vmid ($name) $context" |
|
| 540 |
+ else |
|
| 541 |
+ echo "would remove stale suspend lock for VM $vmid ($name)" |
|
| 542 |
+ fi |
|
| 543 |
+ return 0 |
|
| 544 |
+ fi |
|
| 545 |
+ |
|
| 546 |
+ unlock_output=$(qm unlock "$vmid" 2>&1) |
|
| 547 |
+ if [[ $? -eq 0 ]]; then |
|
| 548 |
+ if [[ -n "$context" ]]; then |
|
| 549 |
+ log_info "Removed stale suspend lock for VM $vmid ($name) $context" |
|
| 550 |
+ else |
|
| 551 |
+ log_info "Removed stale suspend lock for VM $vmid ($name)" |
|
| 552 |
+ fi |
|
| 553 |
+ return 0 |
|
| 554 |
+ fi |
|
| 555 |
+ |
|
| 556 |
+ if maybe_relax_quorum "$unlock_output"; then |
|
| 557 |
+ unlock_output=$(qm unlock "$vmid" 2>&1) |
|
| 558 |
+ if [[ $? -eq 0 ]]; then |
|
| 559 |
+ if [[ -n "$context" ]]; then |
|
| 560 |
+ log_info "Removed stale suspend lock for VM $vmid ($name) $context after quorum recovery" |
|
| 561 |
+ else |
|
| 562 |
+ log_info "Removed stale suspend lock for VM $vmid ($name) after quorum recovery" |
|
| 563 |
+ fi |
|
| 564 |
+ return 0 |
|
| 565 |
+ fi |
|
| 566 |
+ fi |
|
| 567 |
+ |
|
| 568 |
+ if [[ -n "$context" ]]; then |
|
| 569 |
+ log_warning "VM $vmid ($name) has a stale suspend lock $context but it could not be removed: $unlock_output" |
|
| 570 |
+ else |
|
| 571 |
+ log_warning "VM $vmid ($name) has a stale suspend lock but it could not be removed: $unlock_output" |
|
| 572 |
+ fi |
|
| 573 |
+ return 1 |
|
| 574 |
+} |
|
| 575 |
+ |
|
| 576 |
+unlock_vm_if_needed() {
|
|
| 577 |
+ unlock_vm_suspend_lock "$1" "while VM is running" |
|
| 578 |
+} |
|
| 579 |
+ |
|
| 580 |
+# Quorum-sensitive operations (qm suspend/start/resume) may fail during |
|
| 581 |
+# cluster-wide maintenance when pmxcfs becomes read-only. In that case, relax |
|
| 582 |
+# expected votes once and retry the failed operation. |
|
| 583 |
+maybe_relax_quorum() {
|
|
| 584 |
+ local cmd_output="$1" |
|
| 585 |
+ |
|
| 586 |
+ # Already attempted in this run. |
|
| 587 |
+ if [[ $QUORUM_RELAXED -eq 1 ]]; then |
|
| 588 |
+ return 1 |
|
| 589 |
+ fi |
|
| 590 |
+ |
|
| 591 |
+ if echo "$cmd_output" | grep -qiE "cluster not ready - no quorum|/etc/pve/.+\\.conf\\.tmp.+(Permission denied|Device or resource busy)"; then |
|
| 592 |
+ log_warning "Detected quorum-related write failure in /etc/pve - attempting temporary 'pvecm expected 1'" |
|
| 593 |
+ if pvecm expected 1 >/dev/null 2>&1; then |
|
| 594 |
+ QUORUM_RELAXED=1 |
|
| 595 |
+ log_warning "Applied 'pvecm expected 1' for this maintenance cycle; retrying operation" |
|
| 596 |
+ return 0 |
|
| 597 |
+ fi |
|
| 598 |
+ log_error "Failed to apply 'pvecm expected 1' after quorum-related error" |
|
| 599 |
+ fi |
|
| 600 |
+ |
|
| 601 |
+ return 1 |
|
| 602 |
+} |
|
| 603 |
+ |
|
| 604 |
+# Suspend a VM to disk |
|
| 605 |
+suspend_vm_to_disk() {
|
|
| 606 |
+ local vmid="$1" |
|
| 607 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 608 |
+ local qm_output |
|
| 609 |
+ local stale_path |
|
| 610 |
+ local retry_output |
|
| 611 |
+ local stale_retry_path |
|
| 612 |
+ |
|
| 613 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 614 |
+ echo "would suspend VM $vmid ($name) to disk" |
|
| 615 |
+ return 0 |
|
| 616 |
+ fi |
|
| 617 |
+ |
|
| 618 |
+ log_info "Suspending VM $vmid ($name) to disk..." |
|
| 619 |
+ qm_output=$(qm suspend "$vmid" --todisk 1 2>&1) |
|
| 620 |
+ if [[ $? -eq 0 ]]; then |
|
| 621 |
+ log_success "VM $vmid ($name) suspended to disk" |
|
| 622 |
+ return 0 |
|
| 623 |
+ fi |
|
| 624 |
+ |
|
| 625 |
+ # Recover from stale suspend image left from a previous interrupted suspend. |
|
| 626 |
+ # Proxmox can emit either: |
|
| 627 |
+ # - "stale saved state disk image ('...raw' already exists)"
|
|
| 628 |
+ # - "disk image '...raw' already exists" |
|
| 629 |
+ stale_path=$( |
|
| 630 |
+ echo "$qm_output" | sed -n \ |
|
| 631 |
+ -e "s/.*stale saved state[[:space:]]*disk image ('\\([^']*\\)' already exists).*/\\1/p" \
|
|
| 632 |
+ -e "s/.*disk image '\\([^']*\\)' already exists.*/\\1/p" | head -n 1 |
|
| 633 |
+ ) |
|
| 634 |
+ if [[ -n "$stale_path" && "$stale_path" =~ /vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ && -f "$stale_path" ]]; then
|
|
| 635 |
+ log_warning "VM $vmid ($name) has stale suspend image: $stale_path - removing and retrying once" |
|
| 636 |
+ if rm -f -- "$stale_path"; then |
|
| 637 |
+ retry_output=$(qm suspend "$vmid" --todisk 1 2>&1) |
|
| 638 |
+ if [[ $? -eq 0 ]]; then |
|
| 639 |
+ log_success "VM $vmid ($name) suspended to disk (after stale image cleanup)" |
|
| 640 |
+ return 0 |
|
| 641 |
+ fi |
|
| 642 |
+ if maybe_relax_quorum "$retry_output"; then |
|
| 643 |
+ retry_output=$(qm suspend "$vmid" --todisk 1 2>&1) |
|
| 644 |
+ if [[ $? -eq 0 ]]; then |
|
| 645 |
+ log_success "VM $vmid ($name) suspended to disk (after stale image cleanup + quorum recovery)" |
|
| 646 |
+ return 0 |
|
| 647 |
+ fi |
|
| 648 |
+ stale_retry_path=$( |
|
| 649 |
+ echo "$retry_output" | sed -n \ |
|
| 650 |
+ -e "s/.*stale saved state[[:space:]]*disk image ('\\([^']*\\)' already exists).*/\\1/p" \
|
|
| 651 |
+ -e "s/.*disk image '\\([^']*\\)' already exists.*/\\1/p" | head -n 1 |
|
| 652 |
+ ) |
|
| 653 |
+ if [[ -n "$stale_retry_path" && "$stale_retry_path" =~ /vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ && -f "$stale_retry_path" ]]; then
|
|
| 654 |
+ log_warning "VM $vmid ($name) retry left stale suspend image: $stale_retry_path - removing and retrying once more" |
|
| 655 |
+ if rm -f -- "$stale_retry_path"; then |
|
| 656 |
+ retry_output=$(qm suspend "$vmid" --todisk 1 2>&1) |
|
| 657 |
+ if [[ $? -eq 0 ]]; then |
|
| 658 |
+ log_success "VM $vmid ($name) suspended to disk (after stale image cleanup + quorum recovery + retry)" |
|
| 659 |
+ return 0 |
|
| 660 |
+ fi |
|
| 661 |
+ fi |
|
| 662 |
+ fi |
|
| 663 |
+ fi |
|
| 664 |
+ log_error "Failed to suspend VM $vmid ($name) after stale image cleanup: $retry_output" |
|
| 665 |
+ return 1 |
|
| 666 |
+ fi |
|
| 667 |
+ log_error "Failed to remove stale suspend image for VM $vmid ($name): $stale_path" |
|
| 668 |
+ return 1 |
|
| 669 |
+ fi |
|
| 670 |
+ |
|
| 671 |
+ if maybe_relax_quorum "$qm_output"; then |
|
| 672 |
+ retry_output=$(qm suspend "$vmid" --todisk 1 2>&1) |
|
| 673 |
+ if [[ $? -eq 0 ]]; then |
|
| 674 |
+ log_success "VM $vmid ($name) suspended to disk (after quorum recovery)" |
|
| 675 |
+ return 0 |
|
| 676 |
+ fi |
|
| 677 |
+ stale_retry_path=$( |
|
| 678 |
+ echo "$retry_output" | sed -n \ |
|
| 679 |
+ -e "s/.*stale saved state[[:space:]]*disk image ('\\([^']*\\)' already exists).*/\\1/p" \
|
|
| 680 |
+ -e "s/.*disk image '\\([^']*\\)' already exists.*/\\1/p" | head -n 1 |
|
| 681 |
+ ) |
|
| 682 |
+ if [[ -n "$stale_retry_path" && "$stale_retry_path" =~ /vm-${vmid}-state-suspend-[0-9]{4}-[0-9]{2}-[0-9]{2}\.raw$ && -f "$stale_retry_path" ]]; then
|
|
| 683 |
+ log_warning "VM $vmid ($name) quorum retry hit stale suspend image: $stale_retry_path - removing and retrying once more" |
|
| 684 |
+ if rm -f -- "$stale_retry_path"; then |
|
| 685 |
+ retry_output=$(qm suspend "$vmid" --todisk 1 2>&1) |
|
| 686 |
+ if [[ $? -eq 0 ]]; then |
|
| 687 |
+ log_success "VM $vmid ($name) suspended to disk (after quorum recovery + stale retry)" |
|
| 688 |
+ return 0 |
|
| 689 |
+ fi |
|
| 690 |
+ fi |
|
| 691 |
+ fi |
|
| 692 |
+ log_error "Failed to suspend VM $vmid ($name) after quorum recovery: $retry_output" |
|
| 693 |
+ return 1 |
|
| 694 |
+ fi |
|
| 695 |
+ |
|
| 696 |
+ log_error "Failed to suspend VM $vmid ($name) to disk: $qm_output" |
|
| 697 |
+ return 1 |
|
| 698 |
+} |
|
| 699 |
+ |
|
| 700 |
+# Resume a VM from disk suspend |
|
| 701 |
+resume_vm() {
|
|
| 702 |
+ local vmid="$1" |
|
| 703 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 704 |
+ local qm_output |
|
| 705 |
+ local current_status |
|
| 706 |
+ |
|
| 707 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 708 |
+ echo "would resume VM $vmid ($name)" |
|
| 709 |
+ return 0 |
|
| 710 |
+ fi |
|
| 711 |
+ |
|
| 712 |
+ log_info "Resuming VM $vmid ($name)..." |
|
| 713 |
+ qm_output=$(qm resume "$vmid" 2>&1) |
|
| 714 |
+ if [[ $? -eq 0 ]]; then |
|
| 715 |
+ unlock_vm_if_needed "$vmid" |
|
| 716 |
+ log_success "VM $vmid ($name) resumed successfully" |
|
| 717 |
+ return 0 |
|
| 718 |
+ fi |
|
| 719 |
+ |
|
| 720 |
+ if maybe_relax_quorum "$qm_output"; then |
|
| 721 |
+ qm_output=$(qm resume "$vmid" 2>&1) |
|
| 722 |
+ if [[ $? -eq 0 ]]; then |
|
| 723 |
+ unlock_vm_if_needed "$vmid" |
|
| 724 |
+ log_success "VM $vmid ($name) resumed successfully (after quorum recovery)" |
|
| 725 |
+ return 0 |
|
| 726 |
+ fi |
|
| 727 |
+ current_status=$(qm status "$vmid" 2>/dev/null | awk '{print $2}')
|
|
| 728 |
+ if [[ "$current_status" == "running" ]]; then |
|
| 729 |
+ unlock_vm_if_needed "$vmid" |
|
| 730 |
+ log_warning "VM $vmid ($name) is running despite resume error after quorum recovery - treating as resumed" |
|
| 731 |
+ return 2 |
|
| 732 |
+ fi |
|
| 733 |
+ log_error "Failed to resume VM $vmid ($name) after quorum recovery: $qm_output" |
|
| 734 |
+ return 1 |
|
| 735 |
+ fi |
|
| 736 |
+ |
|
| 737 |
+ if echo "$qm_output" | grep -qi "already running"; then |
|
| 738 |
+ unlock_vm_if_needed "$vmid" |
|
| 739 |
+ log_warning "VM $vmid ($name) is already running - treating as resumed" |
|
| 740 |
+ return 2 |
|
| 741 |
+ fi |
|
| 742 |
+ |
|
| 743 |
+ current_status=$(qm status "$vmid" 2>/dev/null | awk '{print $2}')
|
|
| 744 |
+ if [[ "$current_status" == "running" ]]; then |
|
| 745 |
+ unlock_vm_if_needed "$vmid" |
|
| 746 |
+ log_warning "VM $vmid ($name) is running despite resume error - treating as resumed" |
|
| 747 |
+ return 2 |
|
| 748 |
+ fi |
|
| 749 |
+ |
|
| 750 |
+ log_error "Failed to resume VM $vmid ($name): $qm_output" |
|
| 751 |
+ return 1 |
|
| 752 |
+} |
|
| 753 |
+ |
|
| 754 |
+# Graceful shutdown a CT |
|
| 755 |
+shutdown_ct() {
|
|
| 756 |
+ local ctid="$1" |
|
| 757 |
+ local name="${CT_NAME[$ctid]:-unknown}"
|
|
| 758 |
+ |
|
| 759 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 760 |
+ echo "would shutdown CT $ctid ($name)" |
|
| 761 |
+ return 0 |
|
| 762 |
+ fi |
|
| 763 |
+ |
|
| 764 |
+ log_info "Shutting down CT $ctid ($name)..." |
|
| 765 |
+ if pct shutdown "$ctid" --timeout 120; then |
|
| 766 |
+ log_success "CT $ctid ($name) shut down gracefully" |
|
| 767 |
+ return 0 |
|
| 768 |
+ else |
|
| 769 |
+ log_error "Failed to shutdown CT $ctid ($name)" |
|
| 770 |
+ return 1 |
|
| 771 |
+ fi |
|
| 772 |
+} |
|
| 773 |
+ |
|
| 774 |
+# Start a CT |
|
| 775 |
+start_ct() {
|
|
| 776 |
+ local ctid="$1" |
|
| 777 |
+ local name="${CT_NAME[$ctid]:-unknown}"
|
|
| 778 |
+ local pct_output |
|
| 779 |
+ |
|
| 780 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 781 |
+ echo "would start CT $ctid ($name)" |
|
| 782 |
+ return 0 |
|
| 783 |
+ fi |
|
| 784 |
+ |
|
| 785 |
+ log_info "Starting CT $ctid ($name)..." |
|
| 786 |
+ pct_output=$(pct start "$ctid" 2>&1) |
|
| 787 |
+ if [[ $? -eq 0 ]]; then |
|
| 788 |
+ log_success "CT $ctid ($name) started successfully" |
|
| 789 |
+ return 0 |
|
| 790 |
+ fi |
|
| 791 |
+ |
|
| 792 |
+ if maybe_relax_quorum "$pct_output"; then |
|
| 793 |
+ pct_output=$(pct start "$ctid" 2>&1) |
|
| 794 |
+ if [[ $? -eq 0 ]]; then |
|
| 795 |
+ log_success "CT $ctid ($name) started successfully (after quorum recovery)" |
|
| 796 |
+ return 0 |
|
| 797 |
+ fi |
|
| 798 |
+ if [[ "$(pct status "$ctid" 2>/dev/null | awk '{print $2}')" == "running" ]]; then
|
|
| 799 |
+ log_warning "CT $ctid ($name) is running despite start error after quorum recovery - treating as started" |
|
| 800 |
+ return 2 |
|
| 801 |
+ fi |
|
| 802 |
+ log_error "Failed to start CT $ctid ($name) after quorum recovery: $pct_output" |
|
| 803 |
+ return 1 |
|
| 804 |
+ fi |
|
| 805 |
+ |
|
| 806 |
+ if echo "$pct_output" | grep -qi "already running"; then |
|
| 807 |
+ log_warning "CT $ctid ($name) is already running - treating as started" |
|
| 808 |
+ return 2 |
|
| 809 |
+ fi |
|
| 810 |
+ |
|
| 811 |
+ if [[ "$(pct status "$ctid" 2>/dev/null | awk '{print $2}')" == "running" ]]; then
|
|
| 812 |
+ log_warning "CT $ctid ($name) is running despite start error - treating as started" |
|
| 813 |
+ return 2 |
|
| 814 |
+ fi |
|
| 815 |
+ |
|
| 816 |
+ log_error "Failed to start CT $ctid ($name): $pct_output" |
|
| 817 |
+ return 1 |
|
| 818 |
+} |
|
| 819 |
+ |
|
| 820 |
+# Save state to JSON file |
|
| 821 |
+# Usage: save_state vm_resume_array vm_suspended_array ct_start_array |
|
| 822 |
+save_state() {
|
|
| 823 |
+ local -n to_resume_ref=$1 |
|
| 824 |
+ local -n was_suspended_ref=$2 |
|
| 825 |
+ local -n ct_to_start_ref=$3 |
|
| 826 |
+ local existing_state_json="" |
|
| 827 |
+ local existing_to_resume=() |
|
| 828 |
+ local existing_was_suspended=() |
|
| 829 |
+ local existing_ct_to_start=() |
|
| 830 |
+ local final_to_resume=() |
|
| 831 |
+ local final_was_suspended=() |
|
| 832 |
+ local final_ct_to_start=() |
|
| 833 |
+ local vmid |
|
| 834 |
+ local volume |
|
| 835 |
+ local suspend_date |
|
| 836 |
+ local -A existing_vm_volume=() |
|
| 837 |
+ local -A existing_vm_date=() |
|
| 838 |
+ local -A current_vm_volume=() |
|
| 839 |
+ local -A current_vm_date=() |
|
| 840 |
+ |
|
| 841 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 842 |
+ echo "would save state to $STATE_FILE" |
|
| 843 |
+ echo " to_resume (VMs): ${to_resume_ref[*]}"
|
|
| 844 |
+ echo " was_suspended (VMs): ${was_suspended_ref[*]}"
|
|
| 845 |
+ echo " ct_to_start (CTs): ${ct_to_start_ref[*]}"
|
|
| 846 |
+ return 0 |
|
| 847 |
+ fi |
|
| 848 |
+ |
|
| 849 |
+ if existing_state_json=$(load_state 2>/dev/null); then |
|
| 850 |
+ mapfile -t existing_to_resume < <(echo "$existing_state_json" | jq -r '.to_resume[]?' 2>/dev/null) |
|
| 851 |
+ mapfile -t existing_was_suspended < <(echo "$existing_state_json" | jq -r '.was_suspended[]?' 2>/dev/null) |
|
| 852 |
+ mapfile -t existing_ct_to_start < <(echo "$existing_state_json" | jq -r '.ct_to_start[]?' 2>/dev/null) |
|
| 853 |
+ while IFS=$'\t' read -r vmid volume suspend_date; do |
|
| 854 |
+ [[ -z "$vmid" ]] && continue |
|
| 855 |
+ existing_vm_volume[$vmid]="$volume" |
|
| 856 |
+ existing_vm_date[$vmid]="$suspend_date" |
|
| 857 |
+ done < <( |
|
| 858 |
+ echo "$existing_state_json" | jq -r ' |
|
| 859 |
+ (.vm_details // {})
|
|
| 860 |
+ | to_entries[] |
|
| 861 |
+ | [.key, (.value.suspend_volume // ""), (.value.suspend_file_date // "")] |
|
| 862 |
+ | @tsv |
|
| 863 |
+ ' 2>/dev/null |
|
| 864 |
+ ) |
|
| 865 |
+ fi |
|
| 866 |
+ |
|
| 867 |
+ refresh_vm_artifact_metadata |
|
| 868 |
+ |
|
| 869 |
+ for vmid in "${to_resume_ref[@]}"; do
|
|
| 870 |
+ append_unique final_to_resume "$vmid" |
|
| 871 |
+ volume="${VM_VMSTATE[$vmid]:-}"
|
|
| 872 |
+ suspend_date=$(extract_suspend_file_date "$vmid" "$volume") |
|
| 873 |
+ current_vm_volume[$vmid]="$volume" |
|
| 874 |
+ current_vm_date[$vmid]="$suspend_date" |
|
| 875 |
+ done |
|
| 876 |
+ |
|
| 877 |
+ for vmid in "${existing_to_resume[@]}"; do
|
|
| 878 |
+ append_unique final_to_resume "$vmid" |
|
| 879 |
+ done |
|
| 880 |
+ |
|
| 881 |
+ for vmid in "${existing_was_suspended[@]}"; do
|
|
| 882 |
+ if ! array_contains "$vmid" "${final_to_resume[@]}"; then
|
|
| 883 |
+ append_unique final_was_suspended "$vmid" |
|
| 884 |
+ fi |
|
| 885 |
+ done |
|
| 886 |
+ |
|
| 887 |
+ for vmid in "${was_suspended_ref[@]}"; do
|
|
| 888 |
+ if array_contains "$vmid" "${final_to_resume[@]}"; then
|
|
| 889 |
+ volume="${VM_VMSTATE[$vmid]:-}"
|
|
| 890 |
+ if [[ -n "$volume" ]]; then |
|
| 891 |
+ current_vm_volume[$vmid]="$volume" |
|
| 892 |
+ current_vm_date[$vmid]="$(extract_suspend_file_date "$vmid" "$volume")" |
|
| 893 |
+ fi |
|
| 894 |
+ continue |
|
| 895 |
+ fi |
|
| 896 |
+ append_unique final_was_suspended "$vmid" |
|
| 897 |
+ volume="${VM_VMSTATE[$vmid]:-}"
|
|
| 898 |
+ suspend_date=$(extract_suspend_file_date "$vmid" "$volume") |
|
| 899 |
+ current_vm_volume[$vmid]="$volume" |
|
| 900 |
+ current_vm_date[$vmid]="$suspend_date" |
|
| 901 |
+ done |
|
| 902 |
+ |
|
| 903 |
+ for vmid in "${final_to_resume[@]}"; do
|
|
| 904 |
+ remove_value final_was_suspended "$vmid" |
|
| 905 |
+ done |
|
| 906 |
+ |
|
| 907 |
+ for vmid in "${existing_ct_to_start[@]}"; do
|
|
| 908 |
+ append_unique final_ct_to_start "$vmid" |
|
| 909 |
+ done |
|
| 910 |
+ for vmid in "${ct_to_start_ref[@]}"; do
|
|
| 911 |
+ append_unique final_ct_to_start "$vmid" |
|
| 912 |
+ done |
|
| 913 |
+ |
|
| 914 |
+ # Create JSON arrays (handle empty arrays properly) |
|
| 915 |
+ local to_resume_json="[]" |
|
| 916 |
+ local was_suspended_json="[]" |
|
| 917 |
+ local ct_to_start_json="[]" |
|
| 918 |
+ local vm_details_json="{}"
|
|
| 919 |
+ |
|
| 920 |
+ if [[ ${#final_to_resume[@]} -gt 0 ]]; then
|
|
| 921 |
+ to_resume_json=$(printf '%s\n' "${final_to_resume[@]}" | jq -R . | jq -s .)
|
|
| 922 |
+ fi |
|
| 923 |
+ if [[ ${#final_was_suspended[@]} -gt 0 ]]; then
|
|
| 924 |
+ was_suspended_json=$(printf '%s\n' "${final_was_suspended[@]}" | jq -R . | jq -s .)
|
|
| 925 |
+ fi |
|
| 926 |
+ if [[ ${#final_ct_to_start[@]} -gt 0 ]]; then
|
|
| 927 |
+ ct_to_start_json=$(printf '%s\n' "${final_ct_to_start[@]}" | jq -R . | jq -s .)
|
|
| 928 |
+ fi |
|
| 929 |
+ |
|
| 930 |
+ for vmid in "${final_to_resume[@]}"; do
|
|
| 931 |
+ volume="${current_vm_volume[$vmid]:-${existing_vm_volume[$vmid]:-}}"
|
|
| 932 |
+ suspend_date="${current_vm_date[$vmid]:-${existing_vm_date[$vmid]:-}}"
|
|
| 933 |
+ vm_details_json=$( |
|
| 934 |
+ jq \ |
|
| 935 |
+ --arg vmid "$vmid" \ |
|
| 936 |
+ --arg mode "to_resume" \ |
|
| 937 |
+ --arg volume "$volume" \ |
|
| 938 |
+ --arg suspend_date "$suspend_date" \ |
|
| 939 |
+ ' |
|
| 940 |
+ .[$vmid] = {
|
|
| 941 |
+ mode: $mode, |
|
| 942 |
+ suspend_volume: $volume, |
|
| 943 |
+ suspend_file_date: $suspend_date |
|
| 944 |
+ } |
|
| 945 |
+ ' <<<"$vm_details_json" |
|
| 946 |
+ ) |
|
| 947 |
+ done |
|
| 948 |
+ |
|
| 949 |
+ for vmid in "${final_was_suspended[@]}"; do
|
|
| 950 |
+ volume="${current_vm_volume[$vmid]:-${existing_vm_volume[$vmid]:-}}"
|
|
| 951 |
+ suspend_date="${current_vm_date[$vmid]:-${existing_vm_date[$vmid]:-}}"
|
|
| 952 |
+ vm_details_json=$( |
|
| 953 |
+ jq \ |
|
| 954 |
+ --arg vmid "$vmid" \ |
|
| 955 |
+ --arg mode "was_suspended" \ |
|
| 956 |
+ --arg volume "$volume" \ |
|
| 957 |
+ --arg suspend_date "$suspend_date" \ |
|
| 958 |
+ ' |
|
| 959 |
+ .[$vmid] = {
|
|
| 960 |
+ mode: $mode, |
|
| 961 |
+ suspend_volume: $volume, |
|
| 962 |
+ suspend_file_date: $suspend_date |
|
| 963 |
+ } |
|
| 964 |
+ ' <<<"$vm_details_json" |
|
| 965 |
+ ) |
|
| 966 |
+ done |
|
| 967 |
+ |
|
| 968 |
+ cat > "$STATE_FILE" <<EOF |
|
| 969 |
+{
|
|
| 970 |
+ "timestamp": "$(date -Iseconds)", |
|
| 971 |
+ "hostname": "$(hostname)", |
|
| 972 |
+ "to_resume": $to_resume_json, |
|
| 973 |
+ "was_suspended": $was_suspended_json, |
|
| 974 |
+ "ct_to_start": $ct_to_start_json, |
|
| 975 |
+ "vm_details": $vm_details_json |
|
| 976 |
+} |
|
| 977 |
+EOF |
|
| 978 |
+ |
|
| 979 |
+ log_info "State saved to $STATE_FILE" |
|
| 980 |
+} |
|
| 981 |
+ |
|
| 982 |
+# Load state from JSON file (outputs JSON only, no logging to avoid capture issues) |
|
| 983 |
+load_state() {
|
|
| 984 |
+ if [[ ! -f "$STATE_FILE" ]]; then |
|
| 985 |
+ return 1 |
|
| 986 |
+ fi |
|
| 987 |
+ cat "$STATE_FILE" |
|
| 988 |
+} |
|
| 989 |
+ |
|
| 990 |
+# Remove state file after resume is complete |
|
| 991 |
+clear_state() {
|
|
| 992 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 993 |
+ echo "would remove state file $STATE_FILE" |
|
| 994 |
+ return 0 |
|
| 995 |
+ fi |
|
| 996 |
+ |
|
| 997 |
+ if [[ -f "$STATE_FILE" ]]; then |
|
| 998 |
+ rm -f "$STATE_FILE" |
|
| 999 |
+ log_info "State file removed" |
|
| 1000 |
+ fi |
|
| 1001 |
+} |
|
| 1002 |
+ |
|
| 1003 |
+migrate_legacy_state_if_needed() {
|
|
| 1004 |
+ if [[ "${STATE_FILE}" == "${LEGACY_STATE_FILE}" ]]; then
|
|
| 1005 |
+ return 0 |
|
| 1006 |
+ fi |
|
| 1007 |
+ |
|
| 1008 |
+ if [[ -f "${LEGACY_STATE_FILE}" && ! -f "${STATE_FILE}" ]]; then
|
|
| 1009 |
+ mkdir -p "${STATE_DIR}"
|
|
| 1010 |
+ mv "${LEGACY_STATE_FILE}" "${STATE_FILE}"
|
|
| 1011 |
+ log_warning "Migrated legacy state file from ${LEGACY_STATE_FILE} to ${STATE_FILE}"
|
|
| 1012 |
+ fi |
|
| 1013 |
+} |
|
| 1014 |
+ |
|
| 1015 |
+# Main suspend operation |
|
| 1016 |
+do_suspend() {
|
|
| 1017 |
+ log_info "Starting suspend/shutdown operation on $(hostname)" |
|
| 1018 |
+ |
|
| 1019 |
+ # Clean stale suspend artifacts before creating new suspend volumes. |
|
| 1020 |
+ load_vm_config_metadata |
|
| 1021 |
+ if ! cleanup_orphan_suspend_artifacts; then |
|
| 1022 |
+ log_warning "Suspend artifact preflight cleanup had failures; continuing with suspend operation" |
|
| 1023 |
+ fi |
|
| 1024 |
+ |
|
| 1025 |
+ # Load all VM and CT info in one pass |
|
| 1026 |
+ load_vm_info |
|
| 1027 |
+ load_ct_info |
|
| 1028 |
+ |
|
| 1029 |
+ local to_resume=() |
|
| 1030 |
+ local was_suspended=() |
|
| 1031 |
+ local ct_to_start=() |
|
| 1032 |
+ local suspend_count=0 |
|
| 1033 |
+ local skip_count=0 |
|
| 1034 |
+ local fail_count=0 |
|
| 1035 |
+ |
|
| 1036 |
+ # --- Process QEMU VMs --- |
|
| 1037 |
+ log_info "Processing QEMU VMs..." |
|
| 1038 |
+ for conf in /etc/pve/qemu-server/*.conf; do |
|
| 1039 |
+ [[ ! -f "$conf" ]] && continue |
|
| 1040 |
+ |
|
| 1041 |
+ local vmid=$(basename "$conf" .conf) |
|
| 1042 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 1043 |
+ local status="${VM_STATUS[$vmid]:-stopped}"
|
|
| 1044 |
+ |
|
| 1045 |
+ case "$status" in |
|
| 1046 |
+ running) |
|
| 1047 |
+ # Running VM: suspend to disk, add to resume list |
|
| 1048 |
+ if suspend_vm_to_disk "$vmid"; then |
|
| 1049 |
+ to_resume+=("$vmid")
|
|
| 1050 |
+ ((suspend_count++)) |
|
| 1051 |
+ else |
|
| 1052 |
+ ((fail_count++)) |
|
| 1053 |
+ fi |
|
| 1054 |
+ ;; |
|
| 1055 |
+ suspended) |
|
| 1056 |
+ # Suspended to RAM: save state to disk but DON'T add to resume list |
|
| 1057 |
+ log_warning "VM $vmid ($name) is suspended to RAM - saving to disk but will NOT auto-resume (was manually suspended)" |
|
| 1058 |
+ if suspend_vm_to_disk "$vmid"; then |
|
| 1059 |
+ was_suspended+=("$vmid")
|
|
| 1060 |
+ ((suspend_count++)) |
|
| 1061 |
+ else |
|
| 1062 |
+ ((fail_count++)) |
|
| 1063 |
+ fi |
|
| 1064 |
+ ;; |
|
| 1065 |
+ stopped) |
|
| 1066 |
+ # Could be stopped normally or suspended to disk |
|
| 1067 |
+ if vm_has_valid_suspend_state "$vmid"; then |
|
| 1068 |
+ log_warning "VM $vmid ($name) is already suspended to disk - will NOT auto-resume" |
|
| 1069 |
+ was_suspended+=("$vmid")
|
|
| 1070 |
+ ((skip_count++)) |
|
| 1071 |
+ elif vm_has_suspend_lock "$vmid" || vm_has_vmstate_reference "$vmid"; then |
|
| 1072 |
+ log_warning "VM $vmid ($name) has inconsistent suspend artifacts - treating them as stale" |
|
| 1073 |
+ if cleanup_stale_suspend_artifacts "$vmid" "while VM is stopped"; then |
|
| 1074 |
+ ((skip_count++)) |
|
| 1075 |
+ else |
|
| 1076 |
+ ((fail_count++)) |
|
| 1077 |
+ fi |
|
| 1078 |
+ else |
|
| 1079 |
+ log_info "VM $vmid ($name) is stopped, skipping" |
|
| 1080 |
+ fi |
|
| 1081 |
+ ;; |
|
| 1082 |
+ paused) |
|
| 1083 |
+ # Paused/suspended to RAM: save state to disk but DON'T auto-resume |
|
| 1084 |
+ log_warning "VM $vmid ($name) is paused/suspended to RAM - saving to disk but will NOT auto-resume (was manually paused)" |
|
| 1085 |
+ if suspend_vm_to_disk "$vmid"; then |
|
| 1086 |
+ was_suspended+=("$vmid")
|
|
| 1087 |
+ ((suspend_count++)) |
|
| 1088 |
+ else |
|
| 1089 |
+ ((fail_count++)) |
|
| 1090 |
+ fi |
|
| 1091 |
+ ;; |
|
| 1092 |
+ *) |
|
| 1093 |
+ log_info "VM $vmid ($name) status '$status', skipping" |
|
| 1094 |
+ ;; |
|
| 1095 |
+ esac |
|
| 1096 |
+ done |
|
| 1097 |
+ |
|
| 1098 |
+ # --- Process LXC Containers --- |
|
| 1099 |
+ log_info "Processing LXC containers..." |
|
| 1100 |
+ for conf in /etc/pve/lxc/*.conf; do |
|
| 1101 |
+ [[ ! -f "$conf" ]] && continue |
|
| 1102 |
+ |
|
| 1103 |
+ local ctid=$(basename "$conf" .conf) |
|
| 1104 |
+ local name="${CT_NAME[$ctid]:-unknown}"
|
|
| 1105 |
+ local status="${CT_STATUS[$ctid]:-stopped}"
|
|
| 1106 |
+ |
|
| 1107 |
+ case "$status" in |
|
| 1108 |
+ running) |
|
| 1109 |
+ # Running CT: graceful shutdown, add to start list |
|
| 1110 |
+ if shutdown_ct "$ctid"; then |
|
| 1111 |
+ ct_to_start+=("$ctid")
|
|
| 1112 |
+ ((suspend_count++)) |
|
| 1113 |
+ else |
|
| 1114 |
+ ((fail_count++)) |
|
| 1115 |
+ fi |
|
| 1116 |
+ ;; |
|
| 1117 |
+ stopped) |
|
| 1118 |
+ log_info "CT $ctid ($name) is stopped, skipping" |
|
| 1119 |
+ ;; |
|
| 1120 |
+ *) |
|
| 1121 |
+ log_info "CT $ctid ($name) status '$status', skipping" |
|
| 1122 |
+ ;; |
|
| 1123 |
+ esac |
|
| 1124 |
+ done |
|
| 1125 |
+ |
|
| 1126 |
+ # Save state |
|
| 1127 |
+ save_state to_resume was_suspended ct_to_start |
|
| 1128 |
+ |
|
| 1129 |
+ # Summary |
|
| 1130 |
+ log_success "Suspend/shutdown complete: $suspend_count processed, $skip_count skipped, $fail_count failed" |
|
| 1131 |
+ log_info "VMs to auto-resume: ${to_resume[*]:-none}"
|
|
| 1132 |
+ log_info "VMs NOT to auto-resume (were suspended): ${was_suspended[*]:-none}"
|
|
| 1133 |
+ log_info "CTs to auto-start: ${ct_to_start[*]:-none}"
|
|
| 1134 |
+ |
|
| 1135 |
+ return $fail_count |
|
| 1136 |
+} |
|
| 1137 |
+ |
|
| 1138 |
+do_cleanup() {
|
|
| 1139 |
+ log_info "Starting suspend artifact cleanup on $(hostname)" |
|
| 1140 |
+ |
|
| 1141 |
+ load_vm_config_metadata |
|
| 1142 |
+ cleanup_orphan_suspend_artifacts |
|
| 1143 |
+ return $? |
|
| 1144 |
+} |
|
| 1145 |
+ |
|
| 1146 |
+# Main resume operation |
|
| 1147 |
+do_resume() {
|
|
| 1148 |
+ log_info "Starting resume/start operation on $(hostname)" |
|
| 1149 |
+ |
|
| 1150 |
+ # Load all VM and CT info in one pass |
|
| 1151 |
+ load_vm_info |
|
| 1152 |
+ load_ct_info |
|
| 1153 |
+ |
|
| 1154 |
+ local state_json |
|
| 1155 |
+ state_json=$(load_state) |
|
| 1156 |
+ if [[ $? -ne 0 ]]; then |
|
| 1157 |
+ log_warning "No saved state - nothing to resume" |
|
| 1158 |
+ return 0 |
|
| 1159 |
+ fi |
|
| 1160 |
+ |
|
| 1161 |
+ # Parse state file |
|
| 1162 |
+ local to_resume=($(echo "$state_json" | jq -r '.to_resume[]' 2>/dev/null)) |
|
| 1163 |
+ local was_suspended=($(echo "$state_json" | jq -r '.was_suspended[]' 2>/dev/null)) |
|
| 1164 |
+ local ct_to_start=($(echo "$state_json" | jq -r '.ct_to_start[]' 2>/dev/null)) |
|
| 1165 |
+ local saved_timestamp=$(echo "$state_json" | jq -r '.timestamp' 2>/dev/null) |
|
| 1166 |
+ local -A saved_vm_volume=() |
|
| 1167 |
+ local -A saved_vm_date=() |
|
| 1168 |
+ local saved_volume |
|
| 1169 |
+ local current_volume |
|
| 1170 |
+ |
|
| 1171 |
+ while IFS=$'\t' read -r vmid saved_volume saved_date; do |
|
| 1172 |
+ [[ -z "$vmid" ]] && continue |
|
| 1173 |
+ saved_vm_volume[$vmid]="$saved_volume" |
|
| 1174 |
+ saved_vm_date[$vmid]="$saved_date" |
|
| 1175 |
+ done < <( |
|
| 1176 |
+ echo "$state_json" | jq -r ' |
|
| 1177 |
+ (.vm_details // {})
|
|
| 1178 |
+ | to_entries[] |
|
| 1179 |
+ | [.key, (.value.suspend_volume // ""), (.value.suspend_file_date // "")] |
|
| 1180 |
+ | @tsv |
|
| 1181 |
+ ' 2>/dev/null |
|
| 1182 |
+ ) |
|
| 1183 |
+ |
|
| 1184 |
+ log_info "State file from: $saved_timestamp" |
|
| 1185 |
+ |
|
| 1186 |
+ local resume_count=0 |
|
| 1187 |
+ local skip_count=0 |
|
| 1188 |
+ local fail_count=0 |
|
| 1189 |
+ |
|
| 1190 |
+ # --- Resume QEMU VMs --- |
|
| 1191 |
+ |
|
| 1192 |
+ # Log warnings for VMs that won't be resumed |
|
| 1193 |
+ for vmid in "${was_suspended[@]}"; do
|
|
| 1194 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 1195 |
+ log_warning "VM $vmid ($name) was already suspended before maintenance - NOT auto-resuming" |
|
| 1196 |
+ ((skip_count++)) |
|
| 1197 |
+ done |
|
| 1198 |
+ |
|
| 1199 |
+ # Resume VMs that should be resumed |
|
| 1200 |
+ for vmid in "${to_resume[@]}"; do
|
|
| 1201 |
+ local name="${VM_NAME[$vmid]:-unknown}"
|
|
| 1202 |
+ |
|
| 1203 |
+ # Verify VM still exists and has suspend lock |
|
| 1204 |
+ if [[ ! -f "/etc/pve/qemu-server/${vmid}.conf" ]]; then
|
|
| 1205 |
+ log_error "VM $vmid config not found - skipping" |
|
| 1206 |
+ ((fail_count++)) |
|
| 1207 |
+ continue |
|
| 1208 |
+ fi |
|
| 1209 |
+ |
|
| 1210 |
+ if [[ -z "${VM_HAS_LOCK[$vmid]}" ]]; then
|
|
| 1211 |
+ log_warning "VM $vmid ($name) no longer has suspend lock - may have been manually resumed" |
|
| 1212 |
+ ((skip_count++)) |
|
| 1213 |
+ continue |
|
| 1214 |
+ fi |
|
| 1215 |
+ |
|
| 1216 |
+ saved_volume="${saved_vm_volume[$vmid]:-}"
|
|
| 1217 |
+ current_volume="${VM_VMSTATE[$vmid]:-}"
|
|
| 1218 |
+ if [[ -n "$saved_volume" && "$current_volume" != "$saved_volume" ]]; then |
|
| 1219 |
+ log_warning "VM $vmid ($name) suspend volume changed since state file (${saved_vm_date[$vmid]:-unknown date}): saved=$saved_volume current=${current_volume:-none} - skipping auto-resume"
|
|
| 1220 |
+ ((skip_count++)) |
|
| 1221 |
+ continue |
|
| 1222 |
+ fi |
|
| 1223 |
+ |
|
| 1224 |
+ resume_vm "$vmid" |
|
| 1225 |
+ case $? in |
|
| 1226 |
+ 0) ((resume_count++)) ;; |
|
| 1227 |
+ 2) ((skip_count++)) ;; |
|
| 1228 |
+ *) ((fail_count++)) ;; |
|
| 1229 |
+ esac |
|
| 1230 |
+ done |
|
| 1231 |
+ |
|
| 1232 |
+ # --- Start LXC Containers --- |
|
| 1233 |
+ for ctid in "${ct_to_start[@]}"; do
|
|
| 1234 |
+ local name="${CT_NAME[$ctid]:-unknown}"
|
|
| 1235 |
+ |
|
| 1236 |
+ # Verify CT still exists |
|
| 1237 |
+ if [[ ! -f "/etc/pve/lxc/${ctid}.conf" ]]; then
|
|
| 1238 |
+ log_error "CT $ctid config not found - skipping" |
|
| 1239 |
+ ((fail_count++)) |
|
| 1240 |
+ continue |
|
| 1241 |
+ fi |
|
| 1242 |
+ |
|
| 1243 |
+ # Check if already running (someone started it manually) |
|
| 1244 |
+ if [[ "${CT_STATUS[$ctid]}" == "running" ]]; then
|
|
| 1245 |
+ log_warning "CT $ctid ($name) is already running - skipping" |
|
| 1246 |
+ ((skip_count++)) |
|
| 1247 |
+ continue |
|
| 1248 |
+ fi |
|
| 1249 |
+ |
|
| 1250 |
+ start_ct "$ctid" |
|
| 1251 |
+ case $? in |
|
| 1252 |
+ 0) ((resume_count++)) ;; |
|
| 1253 |
+ 2) ((skip_count++)) ;; |
|
| 1254 |
+ *) ((fail_count++)) ;; |
|
| 1255 |
+ esac |
|
| 1256 |
+ done |
|
| 1257 |
+ |
|
| 1258 |
+ # Clear state file only on full success; keep it for retry if any failures. |
|
| 1259 |
+ if [[ $fail_count -eq 0 ]]; then |
|
| 1260 |
+ clear_state |
|
| 1261 |
+ else |
|
| 1262 |
+ log_warning "Resume/start encountered failures - keeping state file for retry" |
|
| 1263 |
+ fi |
|
| 1264 |
+ |
|
| 1265 |
+ # Summary |
|
| 1266 |
+ log_success "Resume/start complete: $resume_count restored, $skip_count skipped, $fail_count failed" |
|
| 1267 |
+ |
|
| 1268 |
+ return $fail_count |
|
| 1269 |
+} |
|
| 1270 |
+ |
|
| 1271 |
+# Acquire lock to prevent concurrent runs |
|
| 1272 |
+acquire_lock() {
|
|
| 1273 |
+ if [[ $DRY_RUN -eq 1 ]]; then |
|
| 1274 |
+ return 0 |
|
| 1275 |
+ fi |
|
| 1276 |
+ |
|
| 1277 |
+ if [[ -f "$LOCK_FILE" ]]; then |
|
| 1278 |
+ local pid=$(cat "$LOCK_FILE" 2>/dev/null) |
|
| 1279 |
+ if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then |
|
| 1280 |
+ log_error "Another instance is running (PID $pid)" |
|
| 1281 |
+ exit 1 |
|
| 1282 |
+ fi |
|
| 1283 |
+ # Stale lock file |
|
| 1284 |
+ rm -f "$LOCK_FILE" |
|
| 1285 |
+ fi |
|
| 1286 |
+ |
|
| 1287 |
+ echo $$ > "$LOCK_FILE" |
|
| 1288 |
+ trap "rm -f '$LOCK_FILE'" EXIT |
|
| 1289 |
+} |
|
| 1290 |
+ |
|
| 1291 |
+# Parse command line |
|
| 1292 |
+COMMAND="" |
|
| 1293 |
+while [[ $# -gt 0 ]]; do |
|
| 1294 |
+ case "$1" in |
|
| 1295 |
+ suspend|resume|cleanup) |
|
| 1296 |
+ COMMAND="$1" |
|
| 1297 |
+ shift |
|
| 1298 |
+ ;; |
|
| 1299 |
+ -n|--dry-run) |
|
| 1300 |
+ DRY_RUN=1 |
|
| 1301 |
+ shift |
|
| 1302 |
+ ;; |
|
| 1303 |
+ -v|--verbose) |
|
| 1304 |
+ ((VERBOSE++)) |
|
| 1305 |
+ shift |
|
| 1306 |
+ ;; |
|
| 1307 |
+ -vv) |
|
| 1308 |
+ VERBOSE=2 |
|
| 1309 |
+ shift |
|
| 1310 |
+ ;; |
|
| 1311 |
+ -h|--help) |
|
| 1312 |
+ usage |
|
| 1313 |
+ exit 0 |
|
| 1314 |
+ ;; |
|
| 1315 |
+ *) |
|
| 1316 |
+ echo "Unknown option: $1" >&2 |
|
| 1317 |
+ usage |
|
| 1318 |
+ exit 1 |
|
| 1319 |
+ ;; |
|
| 1320 |
+ esac |
|
| 1321 |
+done |
|
| 1322 |
+ |
|
| 1323 |
+if [[ -z "$COMMAND" ]]; then |
|
| 1324 |
+ echo "Error: No command specified" >&2 |
|
| 1325 |
+ usage |
|
| 1326 |
+ exit 1 |
|
| 1327 |
+fi |
|
| 1328 |
+ |
|
| 1329 |
+# Ensure state directory exists |
|
| 1330 |
+mkdir -p "$STATE_DIR" |
|
| 1331 |
+ |
|
| 1332 |
+# Migrate state from the legacy location used by older installs. |
|
| 1333 |
+migrate_legacy_state_if_needed |
|
| 1334 |
+ |
|
| 1335 |
+# Acquire lock |
|
| 1336 |
+acquire_lock |
|
| 1337 |
+ |
|
| 1338 |
+# Execute command |
|
| 1339 |
+case "$COMMAND" in |
|
| 1340 |
+ suspend) |
|
| 1341 |
+ do_suspend |
|
| 1342 |
+ exit $? |
|
| 1343 |
+ ;; |
|
| 1344 |
+ resume) |
|
| 1345 |
+ do_resume |
|
| 1346 |
+ exit $? |
|
| 1347 |
+ ;; |
|
| 1348 |
+ cleanup) |
|
| 1349 |
+ do_cleanup |
|
| 1350 |
+ exit $? |
|
| 1351 |
+ ;; |
|
| 1352 |
+esac |
|
@@ -0,0 +1,104 @@ |
||
| 1 |
+# PGS - Intenție, Probleme și Compromis |
|
| 2 |
+ |
|
| 3 |
+## Intenție |
|
| 4 |
+ |
|
| 5 |
+Scopul inițial a fost simplu: |
|
| 6 |
+- să se poată salva starea guest-urilor active înainte de lucrări de mentenanță; |
|
| 7 |
+- să se poată restaura după revenirea nodurilor; |
|
| 8 |
+- să se evite pornirea guest-urilor care erau deja suspendate sau oprite înainte de operație. |
|
| 9 |
+ |
|
| 10 |
+Pentru VM-uri QEMU, asta a însemnat `qm suspend --todisk 1`. |
|
| 11 |
+Pentru containere LXC, asta a însemnat `pct shutdown`, urmat de `pct start` la restaurare. |
|
| 12 |
+ |
|
| 13 |
+## Abordarea inițială |
|
| 14 |
+ |
|
| 15 |
+Prima variantă a fost automată: |
|
| 16 |
+- `systemd` apela `suspend` la oprirea nodului; |
|
| 17 |
+- `systemd` apela `resume` la revenirea nodului; |
|
| 18 |
+- un fișier JSON local păstra lista guest-urilor care trebuiau restaurate. |
|
| 19 |
+ |
|
| 20 |
+Motivația a fost să existe un flux "hands-off" pentru reboot și shutdown. |
|
| 21 |
+ |
|
| 22 |
+## Probleme întâmpinate |
|
| 23 |
+ |
|
| 24 |
+În practică, abordarea automată a fost fragilă pe un cluster Proxmox real. |
|
| 25 |
+ |
|
| 26 |
+Problemele observate: |
|
| 27 |
+- imagini stale de suspend: |
|
| 28 |
+ - `disk image '...state-suspend-....raw' already exists` |
|
| 29 |
+- scrieri eșuate în `pmxcfs` / `/etc/pve`: |
|
| 30 |
+ - `Permission denied` |
|
| 31 |
+ - `Device or resource busy` |
|
| 32 |
+- ferestre fără quorum în timpul opririi sau revenirii nodurilor; |
|
| 33 |
+- VM-uri care porneau, dar rămâneau cu `lock: suspended`; |
|
| 34 |
+- restaurări parțiale, cu numai o parte din guest-uri repornite; |
|
| 35 |
+- comportament dependent de ordinea exactă în care reveneau rețeaua, corosync, storage-ul și `pve-cluster`. |
|
| 36 |
+ |
|
| 37 |
+Problema structurală a fost aceasta: |
|
| 38 |
+- `qm suspend` și `qm resume` nu sunt operații pur locale; |
|
| 39 |
+- ele au nevoie de scrieri coerente în `/etc/pve`; |
|
| 40 |
+- `/etc/pve` depinde de `pmxcfs` și de starea clusterului; |
|
| 41 |
+- în scenarii în care mai multe noduri se opresc sau pornesc în același timp, această dependență devine sursă de curse și inconsistențe. |
|
| 42 |
+ |
|
| 43 |
+Am adăugat mai multe remedieri tactice: |
|
| 44 |
+- cleanup pentru imagini stale; |
|
| 45 |
+- retry după eșec; |
|
| 46 |
+- relaxare temporară de quorum cu `pvecm expected 1`; |
|
| 47 |
+- curățare automată pentru `lock: suspended`. |
|
| 48 |
+ |
|
| 49 |
+Aceste remedieri au redus unele erori, dar nu au schimbat faptul că modelul automat rămânea nedeterminist în scenarii de mentenanță pe întregul cluster. |
|
| 50 |
+ |
|
| 51 |
+## Concluzie |
|
| 52 |
+ |
|
| 53 |
+Automatizarea la shutdown/boot nu a fost suficient de robustă pentru mediul real. |
|
| 54 |
+ |
|
| 55 |
+Mai exact: |
|
| 56 |
+- pentru reboot de un singur nod, putea funcționa uneori acceptabil; |
|
| 57 |
+- pentru lucrări în care mai multe noduri sau întregul cluster sunt oprite, rezultatele nu au fost suficient de predictibile. |
|
| 58 |
+ |
|
| 59 |
+Problema nu mai era un bug punctual, ci o nepotrivire între design și condițiile reale de operare. |
|
| 60 |
+ |
|
| 61 |
+## Compromisul ales |
|
| 62 |
+ |
|
| 63 |
+A fost aleasă o variantă mai simplă și mai controlabilă: |
|
| 64 |
+- fără automatizare `systemd`; |
|
| 65 |
+- fără hook-uri la shutdown sau boot; |
|
| 66 |
+- suspendarea se rulează manual înainte de mentenanță; |
|
| 67 |
+- restaurarea se rulează manual după revenirea clusterului. |
|
| 68 |
+ |
|
| 69 |
+Comenzile sunt: |
|
| 70 |
+ |
|
| 71 |
+```bash |
|
| 72 |
+/usr/local/sbin/pgs suspend -v |
|
| 73 |
+/usr/local/sbin/pgs resume -v |
|
| 74 |
+``` |
|
| 75 |
+ |
|
| 76 |
+## De ce acest compromis este acceptabil |
|
| 77 |
+ |
|
| 78 |
+Se pierde comoditatea automatizării, dar se câștigă: |
|
| 79 |
+- control explicit asupra momentului execuției; |
|
| 80 |
+- posibilitatea de a aștepta revenirea clusterului înainte de `resume`; |
|
| 81 |
+- debug mai simplu; |
|
| 82 |
+- mai puține efecte surprinzătoare în timpul shutdown-ului; |
|
| 83 |
+- separare clară între pregătirea mentenanței și restaurarea ulterioară. |
|
| 84 |
+ |
|
| 85 |
+Practic, operatorul decide când clusterul este suficient de stabil pentru restaurare, în loc să lase asta pe seama ordinii de pornire a serviciilor. |
|
| 86 |
+ |
|
| 87 |
+## Ce rămâne intenționat în cod |
|
| 88 |
+ |
|
| 89 |
+Deși automatizarea a fost eliminată, scriptul păstrează unele protecții utile: |
|
| 90 |
+- detecție și cleanup pentru imagini stale de suspend; |
|
| 91 |
+- tratament pentru guest-uri deja pornite sau deja suspendate; |
|
| 92 |
+- cleanup pentru `lock: suspended` când este posibil; |
|
| 93 |
+- jurnalizare clară în `journalctl -t pgs`. |
|
| 94 |
+ |
|
| 95 |
+Acestea rămân utile și în fluxul manual. |
|
| 96 |
+ |
|
| 97 |
+## Ce nu mai face proiectul |
|
| 98 |
+ |
|
| 99 |
+Proiectul nu mai încearcă: |
|
| 100 |
+- să orchestreze reboot-ul nodurilor; |
|
| 101 |
+- să decidă automat momentul corect pentru restore; |
|
| 102 |
+- să garanteze restaurare automată după revenirea clusterului. |
|
| 103 |
+ |
|
| 104 |
+Acesta este acum un utilitar manual de guest state, nu un manager de reboot. |
|
@@ -0,0 +1,97 @@ |
||
| 1 |
+# PGS - Technical Notes |
|
| 2 |
+ |
|
| 3 |
+## Rol |
|
| 4 |
+ |
|
| 5 |
+`pgs` ofera un flux manual si predictibil pentru: |
|
| 6 |
+- suspend to disk la VM-uri QEMU aflate in rulare |
|
| 7 |
+- shutdown graceful la containere LXC aflate in rulare |
|
| 8 |
+- resume/start dupa mentenanta pe baza unui state file local |
|
| 9 |
+ |
|
| 10 |
+## Comanda instalata |
|
| 11 |
+ |
|
| 12 |
+- locatie: `/usr/local/sbin/pgs` |
|
| 13 |
+- uninstall canonic: `/usr/local/lib/xdev/pve-guests-state/uninstall.sh` |
|
| 14 |
+- documentatie instalata: `/usr/local/share/doc/xdev/pve-guests-state` |
|
| 15 |
+ |
|
| 16 |
+## State runtime |
|
| 17 |
+ |
|
| 18 |
+- locatie curenta: `/var/lib/xdev/pve-guests-state/pgs-state.json` |
|
| 19 |
+- locatie legacy acceptata pentru migrare: `/var/lib/pve-manager/pgs-state.json` |
|
| 20 |
+- lock file: `/run/pgs.lock` |
|
| 21 |
+ |
|
| 22 |
+State file-ul contine: |
|
| 23 |
+- `timestamp` |
|
| 24 |
+- `hostname` |
|
| 25 |
+- `to_resume` |
|
| 26 |
+- `was_suspended` |
|
| 27 |
+- `ct_to_start` |
|
| 28 |
+- `vm_details` |
|
| 29 |
+ - `mode` |
|
| 30 |
+ - `suspend_volume` |
|
| 31 |
+ - `suspend_file_date` |
|
| 32 |
+ |
|
| 33 |
+## Comenzi |
|
| 34 |
+ |
|
| 35 |
+```bash |
|
| 36 |
+/usr/local/sbin/pgs suspend [-v] [--dry-run] |
|
| 37 |
+/usr/local/sbin/pgs resume [-v] [--dry-run] |
|
| 38 |
+/usr/local/sbin/pgs cleanup [-v] [--dry-run] |
|
| 39 |
+``` |
|
| 40 |
+ |
|
| 41 |
+## Comportament |
|
| 42 |
+ |
|
| 43 |
+### Suspend |
|
| 44 |
+ |
|
| 45 |
+- preflight cleanup pentru volume orphan/stale `vm-*-state-suspend-YYYY-MM-DD.raw` |
|
| 46 |
+- VM running -> `qm suspend --todisk 1` -> adaugat in `to_resume` |
|
| 47 |
+- VM paused/suspended RAM -> suspend to disk, dar nu intra in `to_resume` |
|
| 48 |
+- VM deja suspendat pe disk -> warning, fara auto-resume; detectia pentru disk suspend cere `lock: suspended`, `vmstate:` in config si un volum de saved-state rezolvabil in storage |
|
| 49 |
+- CT running -> `pct shutdown --timeout 120` -> adaugat in `ct_to_start` |
|
| 50 |
+- daca exista deja state file, un nou `suspend` face merge peste state-ul existent si pastreaza intentia anterioara de `to_resume` |
|
| 51 |
+- pentru fiecare VM retinut in state se salveaza si `suspend_volume` plus `suspend_file_date` |
|
| 52 |
+ |
|
| 53 |
+### Cleanup |
|
| 54 |
+ |
|
| 55 |
+- scaneaza storage-urile cu `content images` definite in `/etc/pve/storage.cfg` |
|
| 56 |
+- cauta exclusiv fisiere `vm-*-state-suspend-YYYY-MM-DD.raw` |
|
| 57 |
+- ignora fisiere de forma `vm-*-state-cp*.raw` |
|
| 58 |
+- daca un volum `state-suspend` este referit de un VM valid suspendat, il pastreaza |
|
| 59 |
+- daca un volum `state-suspend` este referit, dar VM-ul nu mai are stare valida de suspend, curata `lock`, `vmstate` si volumul |
|
| 60 |
+- daca un volum `state-suspend` nu mai este referit de niciun VM, il trateaza ca orphan si il sterge |
|
| 61 |
+ |
|
| 62 |
+### Resume |
|
| 63 |
+ |
|
| 64 |
+- VMs din `to_resume` -> `qm resume` |
|
| 65 |
+- CTs din `ct_to_start` -> `pct start` |
|
| 66 |
+- daca `suspend_volume` curent nu mai corespunde cu cel din state, VM-ul este tratat ca alterat dupa salvarea state-ului si nu este auto-resumat |
|
| 67 |
+- daca apar esecuri, state file-ul ramane pentru retry |
|
| 68 |
+- daca totul reuseste, state file-ul este sters |
|
| 69 |
+ |
|
| 70 |
+## Protectii implementate |
|
| 71 |
+ |
|
| 72 |
+- stale suspend image cleanup |
|
| 73 |
+- cleanup pentru volume orphan `vm-*-state-suspend-YYYY-MM-DD.raw` |
|
| 74 |
+- retry dupa erori specifice de quorum |
|
| 75 |
+- `pvecm expected 1` in fereastra de mentenanta, cand eroarea indica lipsa de quorum |
|
| 76 |
+- cleanup pentru `lock: suspended` cand VM-ul este deja running |
|
| 77 |
+- cleanup pentru artefacte stale de suspend pe VM-uri `stopped`: `lock: suspended`, `vmstate:` ramas in config si volume orphaned de saved-state |
|
| 78 |
+- lock local pentru a preveni rulari concurente |
|
| 79 |
+ |
|
| 80 |
+## Logging |
|
| 81 |
+ |
|
| 82 |
+- interactiva: output pe terminal |
|
| 83 |
+- prin systemd/journal stream: evitarea dublarii mesajelor in journal |
|
| 84 |
+- tag jurnal: `pgs` |
|
| 85 |
+ |
|
| 86 |
+Exemple: |
|
| 87 |
+ |
|
| 88 |
+```bash |
|
| 89 |
+journalctl -t pgs -n 50 --no-pager |
|
| 90 |
+journalctl -t pgs -f |
|
| 91 |
+``` |
|
| 92 |
+ |
|
| 93 |
+## Note de design |
|
| 94 |
+ |
|
| 95 |
+- proiectul nu mai foloseste unitati systemd pentru execuție automata |
|
| 96 |
+- fisierele din `systemd/` sunt legacy si nu fac parte din install-ul curent |
|
| 97 |
+- proiectul nu are inca propriul config persistent in `/etc/xdev/...` |
|
@@ -0,0 +1,8 @@ |
||
| 1 |
+{
|
|
| 2 |
+ "folders": [ |
|
| 3 |
+ {
|
|
| 4 |
+ "path": "." |
|
| 5 |
+ } |
|
| 6 |
+ ], |
|
| 7 |
+ "settings": {}
|
|
| 8 |
+} |
|
@@ -0,0 +1,141 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="pve-guests-state" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 8 |
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
|
|
| 9 |
+STATE_DIR="/var/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 10 |
+COMMAND_PATH="/usr/local/sbin/pgs" |
|
| 11 |
+UNINSTALL_PATH="${INSTALL_DIR}/uninstall.sh"
|
|
| 12 |
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
|
|
| 13 |
+ |
|
| 14 |
+SOURCE_DIR="" |
|
| 15 |
+ |
|
| 16 |
+usage() {
|
|
| 17 |
+ cat <<EOF |
|
| 18 |
+Usage: $0 [--source-dir <path>] |
|
| 19 |
+ |
|
| 20 |
+Install ${PROJECT_ID} on the current host.
|
|
| 21 |
+EOF |
|
| 22 |
+} |
|
| 23 |
+ |
|
| 24 |
+require_root() {
|
|
| 25 |
+ if [[ "${EUID}" -ne 0 ]]; then
|
|
| 26 |
+ echo "ERROR: this script must be run as root" >&2 |
|
| 27 |
+ exit 1 |
|
| 28 |
+ fi |
|
| 29 |
+} |
|
| 30 |
+ |
|
| 31 |
+resolve_source_dir() {
|
|
| 32 |
+ if [[ -n "${SOURCE_DIR}" ]]; then
|
|
| 33 |
+ SOURCE_DIR="$(cd "${SOURCE_DIR}" && pwd)"
|
|
| 34 |
+ else |
|
| 35 |
+ SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
|
| 36 |
+ fi |
|
| 37 |
+} |
|
| 38 |
+ |
|
| 39 |
+validate_source_tree() {
|
|
| 40 |
+ local required_files=( |
|
| 41 |
+ "${SOURCE_DIR}/bin/pgs"
|
|
| 42 |
+ "${SOURCE_DIR}/scripts/uninstall.sh"
|
|
| 43 |
+ "${SOURCE_DIR}/README.md"
|
|
| 44 |
+ "${SOURCE_DIR}/INSTALL.md"
|
|
| 45 |
+ "${SOURCE_DIR}/CHANGELOG.md"
|
|
| 46 |
+ "${SOURCE_DIR}/LICENSE"
|
|
| 47 |
+ "${SOURCE_DIR}/docs/DECISIONS.md"
|
|
| 48 |
+ "${SOURCE_DIR}/docs/TECHNICAL.md"
|
|
| 49 |
+ ) |
|
| 50 |
+ local file="" |
|
| 51 |
+ for file in "${required_files[@]}"; do
|
|
| 52 |
+ if [[ ! -f "${file}" ]]; then
|
|
| 53 |
+ echo "ERROR: missing required source file: ${file}" >&2
|
|
| 54 |
+ exit 1 |
|
| 55 |
+ fi |
|
| 56 |
+ done |
|
| 57 |
+} |
|
| 58 |
+ |
|
| 59 |
+cleanup_legacy_artifacts() {
|
|
| 60 |
+ rm -f /usr/local/sbin/pve-reboot-manager.sh |
|
| 61 |
+ rm -f /usr/local/sbin/pve-guest-state.sh |
|
| 62 |
+ rm -f /root/bin/pgs |
|
| 63 |
+ rm -f /root/bin/pve-reboot-manager.sh |
|
| 64 |
+ rm -f /root/bin/pve-guest-state.sh |
|
| 65 |
+ |
|
| 66 |
+ systemctl disable pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true |
|
| 67 |
+ systemctl stop pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true |
|
| 68 |
+ rm -f /etc/systemd/system/pve-suspend-vms.service |
|
| 69 |
+ rm -f /etc/systemd/system/pve-resume-vms.service |
|
| 70 |
+ systemctl daemon-reload |
|
| 71 |
+ systemctl reset-failed pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true |
|
| 72 |
+} |
|
| 73 |
+ |
|
| 74 |
+run_existing_uninstall() {
|
|
| 75 |
+ if [[ -x "${UNINSTALL_PATH}" ]]; then
|
|
| 76 |
+ echo "Existing installation detected. Running canonical uninstall first..." |
|
| 77 |
+ "${UNINSTALL_PATH}" --force || true
|
|
| 78 |
+ else |
|
| 79 |
+ bash "${SOURCE_DIR}/scripts/uninstall.sh" --force || true
|
|
| 80 |
+ fi |
|
| 81 |
+} |
|
| 82 |
+ |
|
| 83 |
+install_docs() {
|
|
| 84 |
+ mkdir -p "${DOC_DIR}/docs"
|
|
| 85 |
+ cp "${SOURCE_DIR}/README.md" "${DOC_DIR}/"
|
|
| 86 |
+ cp "${SOURCE_DIR}/INSTALL.md" "${DOC_DIR}/"
|
|
| 87 |
+ cp "${SOURCE_DIR}/CHANGELOG.md" "${DOC_DIR}/"
|
|
| 88 |
+ cp "${SOURCE_DIR}/LICENSE" "${DOC_DIR}/"
|
|
| 89 |
+ cp "${SOURCE_DIR}/docs/DECISIONS.md" "${DOC_DIR}/docs/"
|
|
| 90 |
+ cp "${SOURCE_DIR}/docs/TECHNICAL.md" "${DOC_DIR}/docs/"
|
|
| 91 |
+} |
|
| 92 |
+ |
|
| 93 |
+main() {
|
|
| 94 |
+ while [[ $# -gt 0 ]]; do |
|
| 95 |
+ case "$1" in |
|
| 96 |
+ --source-dir) |
|
| 97 |
+ SOURCE_DIR="$2" |
|
| 98 |
+ shift 2 |
|
| 99 |
+ ;; |
|
| 100 |
+ -h|--help) |
|
| 101 |
+ usage |
|
| 102 |
+ exit 0 |
|
| 103 |
+ ;; |
|
| 104 |
+ *) |
|
| 105 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 106 |
+ usage |
|
| 107 |
+ exit 1 |
|
| 108 |
+ ;; |
|
| 109 |
+ esac |
|
| 110 |
+ done |
|
| 111 |
+ |
|
| 112 |
+ require_root |
|
| 113 |
+ resolve_source_dir |
|
| 114 |
+ validate_source_tree |
|
| 115 |
+ |
|
| 116 |
+ echo "=== Installing ${PROJECT_ID} ==="
|
|
| 117 |
+ run_existing_uninstall |
|
| 118 |
+ |
|
| 119 |
+ mkdir -p "${INSTALL_DIR}" "${DOC_DIR}" "${STATE_DIR}" /usr/local/sbin
|
|
| 120 |
+ |
|
| 121 |
+ cleanup_legacy_artifacts |
|
| 122 |
+ |
|
| 123 |
+ install -m 0755 "${SOURCE_DIR}/bin/pgs" "${COMMAND_PATH}"
|
|
| 124 |
+ install -m 0755 "${SOURCE_DIR}/scripts/uninstall.sh" "${UNINSTALL_PATH}"
|
|
| 125 |
+ ln -sfn "${UNINSTALL_PATH}" "${UNINSTALL_WRAPPER}"
|
|
| 126 |
+ |
|
| 127 |
+ install_docs |
|
| 128 |
+ |
|
| 129 |
+ echo "Installed paths:" |
|
| 130 |
+ echo " command: ${COMMAND_PATH}"
|
|
| 131 |
+ echo " uninstall: ${UNINSTALL_PATH}"
|
|
| 132 |
+ echo " docs: ${DOC_DIR}"
|
|
| 133 |
+ echo " state: ${STATE_DIR}"
|
|
| 134 |
+ echo "" |
|
| 135 |
+ echo "Running dry-run verification..." |
|
| 136 |
+ "${COMMAND_PATH}" suspend --dry-run -v 2>&1 | tail -3 || true
|
|
| 137 |
+ echo "" |
|
| 138 |
+ echo "Installation completed." |
|
| 139 |
+} |
|
| 140 |
+ |
|
| 141 |
+main "$@" |
|
@@ -0,0 +1,86 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="pve-guests-state" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 8 |
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
|
|
| 9 |
+STATE_DIR="/var/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 10 |
+STATE_FILE="${STATE_DIR}/pgs-state.json"
|
|
| 11 |
+COMMAND_PATH="/usr/local/sbin/pgs" |
|
| 12 |
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
|
|
| 13 |
+ |
|
| 14 |
+FORCE_MODE=0 |
|
| 15 |
+ |
|
| 16 |
+log() {
|
|
| 17 |
+ if [[ "${FORCE_MODE}" -eq 0 ]]; then
|
|
| 18 |
+ echo "$@" |
|
| 19 |
+ fi |
|
| 20 |
+} |
|
| 21 |
+ |
|
| 22 |
+require_root() {
|
|
| 23 |
+ if [[ "${EUID}" -ne 0 ]]; then
|
|
| 24 |
+ echo "ERROR: this script must be run as root" >&2 |
|
| 25 |
+ exit 1 |
|
| 26 |
+ fi |
|
| 27 |
+} |
|
| 28 |
+ |
|
| 29 |
+cleanup_legacy_artifacts() {
|
|
| 30 |
+ rm -f /usr/local/sbin/pve-reboot-manager.sh |
|
| 31 |
+ rm -f /usr/local/sbin/pve-guest-state.sh |
|
| 32 |
+ rm -f /root/bin/pgs |
|
| 33 |
+ rm -f /root/bin/pve-reboot-manager.sh |
|
| 34 |
+ rm -f /root/bin/pve-guest-state.sh |
|
| 35 |
+ |
|
| 36 |
+ rm -f /var/lib/pve-manager/pgs-state.json |
|
| 37 |
+ rm -f /var/lib/pve-manager/guest-state.json |
|
| 38 |
+ rm -f /var/lib/pve-manager/reboot-vm-state.json |
|
| 39 |
+} |
|
| 40 |
+ |
|
| 41 |
+main() {
|
|
| 42 |
+ while [[ $# -gt 0 ]]; do |
|
| 43 |
+ case "$1" in |
|
| 44 |
+ --force) |
|
| 45 |
+ FORCE_MODE=1 |
|
| 46 |
+ shift |
|
| 47 |
+ ;; |
|
| 48 |
+ -h|--help) |
|
| 49 |
+ echo "Usage: $0 [--force]" |
|
| 50 |
+ exit 0 |
|
| 51 |
+ ;; |
|
| 52 |
+ *) |
|
| 53 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 54 |
+ exit 1 |
|
| 55 |
+ ;; |
|
| 56 |
+ esac |
|
| 57 |
+ done |
|
| 58 |
+ |
|
| 59 |
+ require_root |
|
| 60 |
+ |
|
| 61 |
+ log "=== Uninstalling ${PROJECT_ID} ==="
|
|
| 62 |
+ |
|
| 63 |
+ systemctl disable pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true |
|
| 64 |
+ systemctl stop pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true |
|
| 65 |
+ rm -f /etc/systemd/system/pve-suspend-vms.service |
|
| 66 |
+ rm -f /etc/systemd/system/pve-resume-vms.service |
|
| 67 |
+ systemctl daemon-reload |
|
| 68 |
+ systemctl reset-failed pve-suspend-vms.service pve-resume-vms.service >/dev/null 2>&1 || true |
|
| 69 |
+ |
|
| 70 |
+ rm -f "${UNINSTALL_WRAPPER}"
|
|
| 71 |
+ rm -f "${COMMAND_PATH}"
|
|
| 72 |
+ rm -f "${STATE_FILE}"
|
|
| 73 |
+ rm -rf "${DOC_DIR}"
|
|
| 74 |
+ rm -rf "${INSTALL_DIR}"
|
|
| 75 |
+ rm -rf "${STATE_DIR}"
|
|
| 76 |
+ |
|
| 77 |
+ cleanup_legacy_artifacts |
|
| 78 |
+ |
|
| 79 |
+ rmdir /usr/local/lib/${ORG_ID} 2>/dev/null || true
|
|
| 80 |
+ rmdir /usr/local/share/doc/${ORG_ID} 2>/dev/null || true
|
|
| 81 |
+ rmdir /var/lib/${ORG_ID} 2>/dev/null || true
|
|
| 82 |
+ |
|
| 83 |
+ log "Uninstall complete." |
|
| 84 |
+} |
|
| 85 |
+ |
|
| 86 |
+main "$@" |
|
@@ -0,0 +1,153 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="pve-guests-state" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 8 |
+MODE="install" |
|
| 9 |
+REMOTE_NODE="" |
|
| 10 |
+REMOTE_USER="root" |
|
| 11 |
+LOCAL_MODE=0 |
|
| 12 |
+ |
|
| 13 |
+show_help() {
|
|
| 14 |
+ cat <<EOF |
|
| 15 |
+PGS setup wrapper |
|
| 16 |
+ |
|
| 17 |
+Usage: $0 [OPTIONS] [<target_node>] |
|
| 18 |
+ |
|
| 19 |
+Options: |
|
| 20 |
+ -h, --help Show this help message |
|
| 21 |
+ -l, --local Run on localhost |
|
| 22 |
+ -u, --uninstall Uninstall instead of install |
|
| 23 |
+ --user <user> Remote SSH user (default: root) |
|
| 24 |
+ |
|
| 25 |
+Examples: |
|
| 26 |
+ $0 --local |
|
| 27 |
+ $0 pve-node-2 |
|
| 28 |
+ $0 --user admin pve-node-2 |
|
| 29 |
+ $0 --uninstall pve-node-2 |
|
| 30 |
+ $0 --local --uninstall |
|
| 31 |
+EOF |
|
| 32 |
+} |
|
| 33 |
+ |
|
| 34 |
+run_local_install() {
|
|
| 35 |
+ bash "${SCRIPT_DIR}/scripts/install.sh" --source-dir "${SCRIPT_DIR}"
|
|
| 36 |
+} |
|
| 37 |
+ |
|
| 38 |
+run_local_uninstall() {
|
|
| 39 |
+ local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
|
|
| 40 |
+ |
|
| 41 |
+ if [[ -x "${canonical}" ]]; then
|
|
| 42 |
+ "${canonical}"
|
|
| 43 |
+ else |
|
| 44 |
+ bash "${SCRIPT_DIR}/scripts/uninstall.sh"
|
|
| 45 |
+ fi |
|
| 46 |
+} |
|
| 47 |
+ |
|
| 48 |
+copy_remote_tree() {
|
|
| 49 |
+ local target="$1" |
|
| 50 |
+ local remote_tmp="$2" |
|
| 51 |
+ |
|
| 52 |
+ ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/bin' '${remote_tmp}/scripts' '${remote_tmp}/docs'"
|
|
| 53 |
+ scp -q "${SCRIPT_DIR}/bin/pgs" "${target}:${remote_tmp}/bin/"
|
|
| 54 |
+ scp -q "${SCRIPT_DIR}/scripts/install.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 55 |
+ scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 56 |
+ scp -q "${SCRIPT_DIR}/README.md" "${target}:${remote_tmp}/"
|
|
| 57 |
+ scp -q "${SCRIPT_DIR}/INSTALL.md" "${target}:${remote_tmp}/"
|
|
| 58 |
+ scp -q "${SCRIPT_DIR}/CHANGELOG.md" "${target}:${remote_tmp}/"
|
|
| 59 |
+ scp -q "${SCRIPT_DIR}/LICENSE" "${target}:${remote_tmp}/"
|
|
| 60 |
+ scp -q "${SCRIPT_DIR}/docs/DECISIONS.md" "${target}:${remote_tmp}/docs/"
|
|
| 61 |
+ scp -q "${SCRIPT_DIR}/docs/TECHNICAL.md" "${target}:${remote_tmp}/docs/"
|
|
| 62 |
+} |
|
| 63 |
+ |
|
| 64 |
+run_remote_install() {
|
|
| 65 |
+ local target="$1" |
|
| 66 |
+ local remote_tmp="/tmp/${PROJECT_ID}.$$"
|
|
| 67 |
+ local remote_prefix="" |
|
| 68 |
+ |
|
| 69 |
+ [[ "${REMOTE_USER}" != "root" ]] && remote_prefix="sudo "
|
|
| 70 |
+ |
|
| 71 |
+ copy_remote_tree "${target}" "${remote_tmp}"
|
|
| 72 |
+ ssh "${target}" "${remote_prefix}bash '${remote_tmp}/scripts/install.sh' --source-dir '${remote_tmp}'"
|
|
| 73 |
+ ssh "${target}" "rm -rf '${remote_tmp}'"
|
|
| 74 |
+} |
|
| 75 |
+ |
|
| 76 |
+run_remote_uninstall() {
|
|
| 77 |
+ local target="$1" |
|
| 78 |
+ local remote_tmp="/tmp/${PROJECT_ID}-uninstall.$$"
|
|
| 79 |
+ local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
|
|
| 80 |
+ |
|
| 81 |
+ ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts'"
|
|
| 82 |
+ scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 83 |
+ if [[ "${REMOTE_USER}" == "root" ]]; then
|
|
| 84 |
+ ssh "${target}" "if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi"
|
|
| 85 |
+ else |
|
| 86 |
+ ssh "${target}" "sudo bash -lc \"if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi\""
|
|
| 87 |
+ fi |
|
| 88 |
+ ssh "${target}" "rm -rf '${remote_tmp}'"
|
|
| 89 |
+} |
|
| 90 |
+ |
|
| 91 |
+while [[ $# -gt 0 ]]; do |
|
| 92 |
+ case "$1" in |
|
| 93 |
+ -h|--help) |
|
| 94 |
+ show_help |
|
| 95 |
+ exit 0 |
|
| 96 |
+ ;; |
|
| 97 |
+ -l|--local) |
|
| 98 |
+ LOCAL_MODE=1 |
|
| 99 |
+ shift |
|
| 100 |
+ ;; |
|
| 101 |
+ -u|--uninstall) |
|
| 102 |
+ MODE="uninstall" |
|
| 103 |
+ shift |
|
| 104 |
+ ;; |
|
| 105 |
+ --user) |
|
| 106 |
+ REMOTE_USER="$2" |
|
| 107 |
+ shift 2 |
|
| 108 |
+ ;; |
|
| 109 |
+ -*) |
|
| 110 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 111 |
+ show_help |
|
| 112 |
+ exit 1 |
|
| 113 |
+ ;; |
|
| 114 |
+ *) |
|
| 115 |
+ REMOTE_NODE="$1" |
|
| 116 |
+ shift |
|
| 117 |
+ ;; |
|
| 118 |
+ esac |
|
| 119 |
+done |
|
| 120 |
+ |
|
| 121 |
+if [[ -z "${REMOTE_NODE}" && ${LOCAL_MODE} -eq 0 ]]; then
|
|
| 122 |
+ LOCAL_MODE=1 |
|
| 123 |
+fi |
|
| 124 |
+ |
|
| 125 |
+echo "================================" |
|
| 126 |
+echo "PGS - ${MODE}"
|
|
| 127 |
+echo "================================" |
|
| 128 |
+ |
|
| 129 |
+if [[ ${LOCAL_MODE} -eq 1 ]]; then
|
|
| 130 |
+ echo "Target: localhost" |
|
| 131 |
+ echo "" |
|
| 132 |
+ if [[ "${MODE}" == "install" ]]; then
|
|
| 133 |
+ run_local_install |
|
| 134 |
+ else |
|
| 135 |
+ run_local_uninstall |
|
| 136 |
+ fi |
|
| 137 |
+ exit 0 |
|
| 138 |
+fi |
|
| 139 |
+ |
|
| 140 |
+TARGET="${REMOTE_USER}@${REMOTE_NODE}"
|
|
| 141 |
+echo "Target: ${TARGET}"
|
|
| 142 |
+echo "" |
|
| 143 |
+ |
|
| 144 |
+if ! ping -c 1 "${REMOTE_NODE}" >/dev/null 2>&1; then
|
|
| 145 |
+ echo "ERROR: cannot reach ${REMOTE_NODE}" >&2
|
|
| 146 |
+ exit 1 |
|
| 147 |
+fi |
|
| 148 |
+ |
|
| 149 |
+if [[ "${MODE}" == "install" ]]; then
|
|
| 150 |
+ run_remote_install "${TARGET}"
|
|
| 151 |
+else |
|
| 152 |
+ run_remote_uninstall "${TARGET}"
|
|
| 153 |
+fi |
|
@@ -0,0 +1,9 @@ |
||
| 1 |
+Legacy automation units retained for reference only. |
|
| 2 |
+ |
|
| 3 |
+These systemd unit files are not installed by the current project workflow. |
|
| 4 |
+The supported operational model is manual: |
|
| 5 |
+ |
|
| 6 |
+- `/usr/local/sbin/pgs suspend -v` |
|
| 7 |
+- `/usr/local/sbin/pgs resume -v` |
|
| 8 |
+ |
|
| 9 |
+Current install and uninstall scripts explicitly remove these legacy units from hosts. |
|
@@ -0,0 +1,27 @@ |
||
| 1 |
+[Unit] |
|
| 2 |
+Description=Resume PVE VMs manually |
|
| 3 |
+Documentation=man:qm(1) |
|
| 4 |
+ |
|
| 5 |
+# Only run if we have a state file from previous suspend |
|
| 6 |
+ConditionPathExists=/var/lib/pve-manager/pgs-state.json |
|
| 7 |
+ |
|
| 8 |
+# We need pve-cluster for /etc/pve access |
|
| 9 |
+Requires=pve-cluster.service |
|
| 10 |
+After=pve-cluster.service |
|
| 11 |
+ |
|
| 12 |
+# Run after storage is available |
|
| 13 |
+After=pve-storage.target |
|
| 14 |
+ |
|
| 15 |
+# Run before the standard pve-guests service to handle our VMs first |
|
| 16 |
+Before=pve-guests.service |
|
| 17 |
+ |
|
| 18 |
+[Service] |
|
| 19 |
+Type=oneshot |
|
| 20 |
+ExecStart=/usr/local/sbin/pgs resume -v |
|
| 21 |
+# Allow generous time for VMs to resume |
|
| 22 |
+TimeoutStartSec=900 |
|
| 23 |
+Restart=on-failure |
|
| 24 |
+RestartSec=20 |
|
| 25 |
+ |
|
| 26 |
+[Install] |
|
| 27 |
+WantedBy=multi-user.target |
|
@@ -0,0 +1,32 @@ |
||
| 1 |
+[Unit] |
|
| 2 |
+Description=Suspend PVE VMs to disk manually |
|
| 3 |
+Documentation=man:qm(1) |
|
| 4 |
+ |
|
| 5 |
+# Only run if pve-cluster is available (not rescue/recovery) |
|
| 6 |
+ConditionPathExists=/var/lib/pve-cluster/config.db |
|
| 7 |
+ |
|
| 8 |
+# We need storage and cluster access when stopping (suspend needs these alive) |
|
| 9 |
+Requires=pve-cluster.service |
|
| 10 |
+After=pve-cluster.service network.target local-fs.target remote-fs.target |
|
| 11 |
+ |
|
| 12 |
+# Start AFTER pve-guests → during shutdown we stop BEFORE pve-guests |
|
| 13 |
+# Critical: ensures we suspend VMs before pve-guests runs "stopall" |
|
| 14 |
+After=pve-guests.service |
|
| 15 |
+ |
|
| 16 |
+[Service] |
|
| 17 |
+Type=oneshot |
|
| 18 |
+RemainAfterExit=yes |
|
| 19 |
+ |
|
| 20 |
+# Trivial start - just marks the service as "active" while the node is up |
|
| 21 |
+# The actual work happens in ExecStop during shutdown |
|
| 22 |
+ExecStart=/bin/true |
|
| 23 |
+ |
|
| 24 |
+# REAL work: suspend VMs and shutdown CTs when system is going down |
|
| 25 |
+ExecStop=/usr/local/sbin/pgs suspend -v |
|
| 26 |
+ |
|
| 27 |
+# Allow generous time for all VMs to suspend to disk |
|
| 28 |
+TimeoutStopSec=900 |
|
| 29 |
+ |
|
| 30 |
+[Install] |
|
| 31 |
+WantedBy=multi-user.target |
|
| 32 |
+ |
|
@@ -0,0 +1,15 @@ |
||
| 1 |
+# pve-net-hang-watchdog Changelog |
|
| 2 |
+ |
|
| 3 |
+## [1.0] - 2026-03-06 |
|
| 4 |
+ |
|
| 5 |
+### Added |
|
| 6 |
+- Dedicated `scripts/install.sh` and `scripts/uninstall.sh` |
|
| 7 |
+- `setup.sh` wrapper for local and remote lifecycle operations |
|
| 8 |
+- Standardized defaults file at `/etc/default/xdev-pve-net-hang-watchdog` |
|
| 9 |
+- Installed documentation under `/usr/local/share/doc/xdev/pve-net-hang-watchdog` |
|
| 10 |
+ |
|
| 11 |
+### Changed |
|
| 12 |
+- Standardized uninstall path to `/usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh` |
|
| 13 |
+- Updated systemd unit to use the namespaced defaults file |
|
| 14 |
+- Standardized project documentation and install workflow |
|
| 15 |
+- Installer now performs `systemctl enable --now` so the watchdog is active immediately after install |
|
@@ -0,0 +1,66 @@ |
||
| 1 |
+# Instalare |
|
| 2 |
+ |
|
| 3 |
+## Metoda recomandata |
|
| 4 |
+ |
|
| 5 |
+Wrapper-ul `setup.sh` este metoda standard de install si uninstall. |
|
| 6 |
+ |
|
| 7 |
+### Instalare locala |
|
| 8 |
+ |
|
| 9 |
+```bash |
|
| 10 |
+sudo ./setup.sh --local |
|
| 11 |
+``` |
|
| 12 |
+ |
|
| 13 |
+### Instalare remote |
|
| 14 |
+ |
|
| 15 |
+```bash |
|
| 16 |
+sudo ./setup.sh <node> |
|
| 17 |
+sudo ./setup.sh --user admin <node> |
|
| 18 |
+``` |
|
| 19 |
+ |
|
| 20 |
+## Ce instaleaza |
|
| 21 |
+ |
|
| 22 |
+- `/usr/local/sbin/pve-net-hang-watchdog.sh` |
|
| 23 |
+- `/usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh` |
|
| 24 |
+- `/usr/local/sbin/xdev-pve-net-hang-watchdog-uninstall` |
|
| 25 |
+- `/etc/default/xdev-pve-net-hang-watchdog` |
|
| 26 |
+- `/etc/systemd/system/pve-net-hang-watchdog.service` |
|
| 27 |
+- `/usr/local/share/doc/xdev/pve-net-hang-watchdog/*` |
|
| 28 |
+ |
|
| 29 |
+## Activare |
|
| 30 |
+ |
|
| 31 |
+Installerul face: |
|
| 32 |
+- `systemctl daemon-reload` |
|
| 33 |
+- `systemctl enable --now pve-net-hang-watchdog.service` |
|
| 34 |
+ |
|
| 35 |
+Verificare: |
|
| 36 |
+ |
|
| 37 |
+```bash |
|
| 38 |
+sudo systemctl status pve-net-hang-watchdog.service |
|
| 39 |
+``` |
|
| 40 |
+ |
|
| 41 |
+## Configurare |
|
| 42 |
+ |
|
| 43 |
+Defaults instalate: |
|
| 44 |
+ |
|
| 45 |
+```bash |
|
| 46 |
+sudo editor /etc/default/xdev-pve-net-hang-watchdog |
|
| 47 |
+``` |
|
| 48 |
+ |
|
| 49 |
+Parametri suportati: |
|
| 50 |
+- `WATCH_BRIDGE` |
|
| 51 |
+- `WATCH_IFACE` |
|
| 52 |
+- `COOLDOWN_SECONDS` |
|
| 53 |
+- `HANG_PATTERN` |
|
| 54 |
+ |
|
| 55 |
+## Uninstall |
|
| 56 |
+ |
|
| 57 |
+```bash |
|
| 58 |
+sudo ./setup.sh --local --uninstall |
|
| 59 |
+sudo ./setup.sh --uninstall <node> |
|
| 60 |
+``` |
|
| 61 |
+ |
|
| 62 |
+Sau direct pe host: |
|
| 63 |
+ |
|
| 64 |
+```bash |
|
| 65 |
+sudo /usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh |
|
| 66 |
+``` |
|
@@ -0,0 +1,72 @@ |
||
| 1 |
+# pve-net-hang-watchdog |
|
| 2 |
+ |
|
| 3 |
+`pve-net-hang-watchdog` este un serviciu simplu care urmareste jurnalul kernel pentru hang-uri de NIC si incearca recuperarea uplink-ului prin `ifdown` si `ifup`. |
|
| 4 |
+ |
|
| 5 |
+## Rol |
|
| 6 |
+ |
|
| 7 |
+Util pentru noduri Proxmox unde interfata fizica din spatele unui bridge WAN poate intra in stare de hang hardware, iar recovery-ul cel mai pragmatic este ciclarea link-ului. |
|
| 8 |
+ |
|
| 9 |
+## Componente |
|
| 10 |
+ |
|
| 11 |
+- [bin/pve-net-hang-watchdog.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/bin/pve-net-hang-watchdog.sh) - scriptul principal |
|
| 12 |
+- [systemd/pve-net-hang-watchdog.service](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/systemd/pve-net-hang-watchdog.service) - unitatea systemd |
|
| 13 |
+- [config/xdev-pve-net-hang-watchdog](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/config/xdev-pve-net-hang-watchdog) - defaults standard |
|
| 14 |
+- [scripts/install.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/scripts/install.sh) - install local |
|
| 15 |
+- [scripts/uninstall.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/scripts/uninstall.sh) - uninstall canonic |
|
| 16 |
+- [setup.sh](/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/cluster/projects/pve-net-hang-watchdog/setup.sh) - wrapper local/remote |
|
| 17 |
+ |
|
| 18 |
+## Locatii instalate pe host |
|
| 19 |
+ |
|
| 20 |
+- comanda/daemon script: `/usr/local/sbin/pve-net-hang-watchdog.sh` |
|
| 21 |
+- uninstall canonic: `/usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh` |
|
| 22 |
+- wrapper optional pentru uninstall: `/usr/local/sbin/xdev-pve-net-hang-watchdog-uninstall` |
|
| 23 |
+- defaults: `/etc/default/xdev-pve-net-hang-watchdog` |
|
| 24 |
+- unitate systemd: `/etc/systemd/system/pve-net-hang-watchdog.service` |
|
| 25 |
+- documentatie instalata: `/usr/local/share/doc/xdev/pve-net-hang-watchdog` |
|
| 26 |
+ |
|
| 27 |
+## Configurare |
|
| 28 |
+ |
|
| 29 |
+Parametri suportati prin defaults: |
|
| 30 |
+ |
|
| 31 |
+- `WATCH_BRIDGE` |
|
| 32 |
+- `WATCH_IFACE` |
|
| 33 |
+- `COOLDOWN_SECONDS` |
|
| 34 |
+- `HANG_PATTERN` |
|
| 35 |
+ |
|
| 36 |
+Daca `WATCH_IFACE` este gol, scriptul incearca sa descopere automat interfata fizica din `bridge-ports`. |
|
| 37 |
+ |
|
| 38 |
+## Flux rapid |
|
| 39 |
+ |
|
| 40 |
+```bash |
|
| 41 |
+sudo ./setup.sh --local |
|
| 42 |
+sudo systemctl status pve-net-hang-watchdog.service |
|
| 43 |
+``` |
|
| 44 |
+ |
|
| 45 |
+## Operare |
|
| 46 |
+ |
|
| 47 |
+Loguri: |
|
| 48 |
+ |
|
| 49 |
+```bash |
|
| 50 |
+journalctl -u pve-net-hang-watchdog.service -f |
|
| 51 |
+``` |
|
| 52 |
+ |
|
| 53 |
+Configurare: |
|
| 54 |
+ |
|
| 55 |
+```bash |
|
| 56 |
+sudo editor /etc/default/xdev-pve-net-hang-watchdog |
|
| 57 |
+sudo systemctl restart pve-net-hang-watchdog.service |
|
| 58 |
+``` |
|
| 59 |
+ |
|
| 60 |
+Installerul face si `enable --now`, deci dupa instalare serviciul este deja pornit. |
|
| 61 |
+ |
|
| 62 |
+## Uninstall |
|
| 63 |
+ |
|
| 64 |
+```bash |
|
| 65 |
+sudo ./setup.sh --local --uninstall |
|
| 66 |
+``` |
|
| 67 |
+ |
|
| 68 |
+Sau direct: |
|
| 69 |
+ |
|
| 70 |
+```bash |
|
| 71 |
+sudo /usr/local/lib/xdev/pve-net-hang-watchdog/uninstall.sh |
|
| 72 |
+``` |
|
@@ -0,0 +1,102 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -u |
|
| 4 |
+ |
|
| 5 |
+WATCH_BRIDGE="${WATCH_BRIDGE:-vmbr443}"
|
|
| 6 |
+WATCH_IFACE="${WATCH_IFACE:-}"
|
|
| 7 |
+COOLDOWN_SECONDS="${COOLDOWN_SECONDS:-30}"
|
|
| 8 |
+HANG_PATTERN="${HANG_PATTERN:-Detected Hardware Unit Hang:}"
|
|
| 9 |
+ |
|
| 10 |
+log() {
|
|
| 11 |
+ printf '%s %s\n' "$(date -Is)" "$*" >&2 |
|
| 12 |
+} |
|
| 13 |
+ |
|
| 14 |
+discover_watch_iface() {
|
|
| 15 |
+ local candidate="" |
|
| 16 |
+ |
|
| 17 |
+ if [[ -n "$WATCH_IFACE" ]]; then |
|
| 18 |
+ printf '%s\n' "$WATCH_IFACE" |
|
| 19 |
+ return 0 |
|
| 20 |
+ fi |
|
| 21 |
+ |
|
| 22 |
+ if [[ -r /etc/network/interfaces ]]; then |
|
| 23 |
+ candidate="$( |
|
| 24 |
+ awk -v bridge="$WATCH_BRIDGE" ' |
|
| 25 |
+ $1 == "iface" && $2 == bridge { in_bridge = 1; next }
|
|
| 26 |
+ $1 == "iface" && $2 != bridge { in_bridge = 0 }
|
|
| 27 |
+ in_bridge && $1 == "bridge-ports" { print $2; exit }
|
|
| 28 |
+ ' /etc/network/interfaces |
|
| 29 |
+ )" |
|
| 30 |
+ fi |
|
| 31 |
+ |
|
| 32 |
+ if [[ -z "$candidate" && -d /etc/network/interfaces.d ]]; then |
|
| 33 |
+ candidate="$( |
|
| 34 |
+ awk -v bridge="$WATCH_BRIDGE" ' |
|
| 35 |
+ $1 == "iface" && $2 == bridge { in_bridge = 1; next }
|
|
| 36 |
+ $1 == "iface" && $2 != bridge { in_bridge = 0 }
|
|
| 37 |
+ in_bridge && $1 == "bridge-ports" { print $2; exit }
|
|
| 38 |
+ ' /etc/network/interfaces.d/* 2>/dev/null |
|
| 39 |
+ )" |
|
| 40 |
+ fi |
|
| 41 |
+ |
|
| 42 |
+ if [[ -n "$candidate" ]]; then |
|
| 43 |
+ printf '%s\n' "${candidate%%.*}"
|
|
| 44 |
+ return 0 |
|
| 45 |
+ fi |
|
| 46 |
+ |
|
| 47 |
+ return 1 |
|
| 48 |
+} |
|
| 49 |
+ |
|
| 50 |
+require_command() {
|
|
| 51 |
+ local cmd="$1" |
|
| 52 |
+ if ! command -v "$cmd" >/dev/null 2>&1; then |
|
| 53 |
+ log "missing required command: $cmd" |
|
| 54 |
+ exit 1 |
|
| 55 |
+ fi |
|
| 56 |
+} |
|
| 57 |
+ |
|
| 58 |
+recover_iface() {
|
|
| 59 |
+ local iface="$1" |
|
| 60 |
+ |
|
| 61 |
+ log "hardware hang detected on $iface; cycling link with ifdown/ifup" |
|
| 62 |
+ ifdown --force "$iface" || log "ifdown reported a non-zero exit code for $iface" |
|
| 63 |
+ sleep 2 |
|
| 64 |
+ if ! ifup "$iface"; then |
|
| 65 |
+ log "ifup failed for $iface" |
|
| 66 |
+ return 1 |
|
| 67 |
+ fi |
|
| 68 |
+ log "link recovery finished for $iface" |
|
| 69 |
+} |
|
| 70 |
+ |
|
| 71 |
+main() {
|
|
| 72 |
+ local iface="" |
|
| 73 |
+ local last_recovery=0 |
|
| 74 |
+ local now=0 |
|
| 75 |
+ local line="" |
|
| 76 |
+ |
|
| 77 |
+ require_command journalctl |
|
| 78 |
+ require_command ifdown |
|
| 79 |
+ require_command ifup |
|
| 80 |
+ |
|
| 81 |
+ if ! iface="$(discover_watch_iface)"; then |
|
| 82 |
+ log "failed to determine uplink interface for bridge $WATCH_BRIDGE" |
|
| 83 |
+ exit 1 |
|
| 84 |
+ fi |
|
| 85 |
+ |
|
| 86 |
+ log "watching journald for '$HANG_PATTERN' on interface $iface" |
|
| 87 |
+ |
|
| 88 |
+ while IFS= read -r line; do |
|
| 89 |
+ [[ "$line" == *"$iface: $HANG_PATTERN"* ]] || continue |
|
| 90 |
+ |
|
| 91 |
+ now="$(date +%s)" |
|
| 92 |
+ if (( now - last_recovery < COOLDOWN_SECONDS )); then |
|
| 93 |
+ log "skipping duplicate event for $iface during cooldown (${COOLDOWN_SECONDS}s)"
|
|
| 94 |
+ continue |
|
| 95 |
+ fi |
|
| 96 |
+ |
|
| 97 |
+ last_recovery="$now" |
|
| 98 |
+ recover_iface "$iface" |
|
| 99 |
+ done < <(journalctl --dmesg --follow --since now --output=cat) |
|
| 100 |
+} |
|
| 101 |
+ |
|
| 102 |
+main "$@" |
|
@@ -0,0 +1,18 @@ |
||
| 1 |
+# Default environment for pve-net-hang-watchdog |
|
| 2 |
+# |
|
| 3 |
+# Copy or install to: |
|
| 4 |
+# /etc/default/xdev-pve-net-hang-watchdog |
|
| 5 |
+# |
|
| 6 |
+# Uncomment to override defaults. |
|
| 7 |
+ |
|
| 8 |
+# Bridge whose uplink should be monitored for NIC hardware hang recovery. |
|
| 9 |
+# WATCH_BRIDGE=vmbr443 |
|
| 10 |
+ |
|
| 11 |
+# Explicit interface to recover. If empty, the script auto-discovers bridge-ports. |
|
| 12 |
+# WATCH_IFACE= |
|
| 13 |
+ |
|
| 14 |
+# Minimum number of seconds between recoveries for duplicate events. |
|
| 15 |
+# COOLDOWN_SECONDS=30 |
|
| 16 |
+ |
|
| 17 |
+# Journal pattern that identifies the hardware hang message. |
|
| 18 |
+# HANG_PATTERN=Detected Hardware Unit Hang: |
|
@@ -0,0 +1,130 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="pve-net-hang-watchdog" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 8 |
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
|
|
| 9 |
+COMMAND_PATH="/usr/local/sbin/pve-net-hang-watchdog.sh" |
|
| 10 |
+UNINSTALL_PATH="${INSTALL_DIR}/uninstall.sh"
|
|
| 11 |
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
|
|
| 12 |
+CONFIG_PATH="/etc/default/${ORG_ID}-${PROJECT_ID}"
|
|
| 13 |
+UNIT_PATH="/etc/systemd/system/${PROJECT_ID}.service"
|
|
| 14 |
+ |
|
| 15 |
+SOURCE_DIR="" |
|
| 16 |
+ |
|
| 17 |
+usage() {
|
|
| 18 |
+ cat <<EOF |
|
| 19 |
+Usage: $0 [--source-dir <path>] |
|
| 20 |
+ |
|
| 21 |
+Install ${PROJECT_ID} on the current host.
|
|
| 22 |
+EOF |
|
| 23 |
+} |
|
| 24 |
+ |
|
| 25 |
+require_root() {
|
|
| 26 |
+ if [[ "${EUID}" -ne 0 ]]; then
|
|
| 27 |
+ echo "ERROR: this script must be run as root" >&2 |
|
| 28 |
+ exit 1 |
|
| 29 |
+ fi |
|
| 30 |
+} |
|
| 31 |
+ |
|
| 32 |
+resolve_source_dir() {
|
|
| 33 |
+ if [[ -n "${SOURCE_DIR}" ]]; then
|
|
| 34 |
+ SOURCE_DIR="$(cd "${SOURCE_DIR}" && pwd)"
|
|
| 35 |
+ else |
|
| 36 |
+ SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
|
| 37 |
+ fi |
|
| 38 |
+} |
|
| 39 |
+ |
|
| 40 |
+validate_source_tree() {
|
|
| 41 |
+ local required_files=( |
|
| 42 |
+ "${SOURCE_DIR}/bin/pve-net-hang-watchdog.sh"
|
|
| 43 |
+ "${SOURCE_DIR}/systemd/pve-net-hang-watchdog.service"
|
|
| 44 |
+ "${SOURCE_DIR}/config/xdev-pve-net-hang-watchdog"
|
|
| 45 |
+ "${SOURCE_DIR}/scripts/uninstall.sh"
|
|
| 46 |
+ "${SOURCE_DIR}/README.md"
|
|
| 47 |
+ "${SOURCE_DIR}/INSTALL.md"
|
|
| 48 |
+ "${SOURCE_DIR}/CHANGELOG.md"
|
|
| 49 |
+ ) |
|
| 50 |
+ local file="" |
|
| 51 |
+ for file in "${required_files[@]}"; do
|
|
| 52 |
+ if [[ ! -f "${file}" ]]; then
|
|
| 53 |
+ echo "ERROR: missing required source file: ${file}" >&2
|
|
| 54 |
+ exit 1 |
|
| 55 |
+ fi |
|
| 56 |
+ done |
|
| 57 |
+} |
|
| 58 |
+ |
|
| 59 |
+run_existing_uninstall() {
|
|
| 60 |
+ if [[ -x "${UNINSTALL_PATH}" ]]; then
|
|
| 61 |
+ echo "Existing installation detected. Running canonical uninstall first..." |
|
| 62 |
+ "${UNINSTALL_PATH}" --force || true
|
|
| 63 |
+ else |
|
| 64 |
+ bash "${SOURCE_DIR}/scripts/uninstall.sh" --force || true
|
|
| 65 |
+ fi |
|
| 66 |
+} |
|
| 67 |
+ |
|
| 68 |
+install_docs() {
|
|
| 69 |
+ mkdir -p "${DOC_DIR}"
|
|
| 70 |
+ cp "${SOURCE_DIR}/README.md" "${DOC_DIR}/"
|
|
| 71 |
+ cp "${SOURCE_DIR}/INSTALL.md" "${DOC_DIR}/"
|
|
| 72 |
+ cp "${SOURCE_DIR}/CHANGELOG.md" "${DOC_DIR}/"
|
|
| 73 |
+} |
|
| 74 |
+ |
|
| 75 |
+main() {
|
|
| 76 |
+ while [[ $# -gt 0 ]]; do |
|
| 77 |
+ case "$1" in |
|
| 78 |
+ --source-dir) |
|
| 79 |
+ SOURCE_DIR="$2" |
|
| 80 |
+ shift 2 |
|
| 81 |
+ ;; |
|
| 82 |
+ -h|--help) |
|
| 83 |
+ usage |
|
| 84 |
+ exit 0 |
|
| 85 |
+ ;; |
|
| 86 |
+ *) |
|
| 87 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 88 |
+ usage |
|
| 89 |
+ exit 1 |
|
| 90 |
+ ;; |
|
| 91 |
+ esac |
|
| 92 |
+ done |
|
| 93 |
+ |
|
| 94 |
+ require_root |
|
| 95 |
+ resolve_source_dir |
|
| 96 |
+ validate_source_tree |
|
| 97 |
+ |
|
| 98 |
+ echo "=== Installing ${PROJECT_ID} ==="
|
|
| 99 |
+ run_existing_uninstall |
|
| 100 |
+ |
|
| 101 |
+ mkdir -p "${INSTALL_DIR}" "${DOC_DIR}" /usr/local/sbin /etc/default
|
|
| 102 |
+ |
|
| 103 |
+ install -m 0755 "${SOURCE_DIR}/bin/pve-net-hang-watchdog.sh" "${COMMAND_PATH}"
|
|
| 104 |
+ install -m 0755 "${SOURCE_DIR}/scripts/uninstall.sh" "${UNINSTALL_PATH}"
|
|
| 105 |
+ ln -sfn "${UNINSTALL_PATH}" "${UNINSTALL_WRAPPER}"
|
|
| 106 |
+ |
|
| 107 |
+ if [[ ! -f "${CONFIG_PATH}" ]]; then
|
|
| 108 |
+ install -m 0644 "${SOURCE_DIR}/config/xdev-pve-net-hang-watchdog" "${CONFIG_PATH}"
|
|
| 109 |
+ else |
|
| 110 |
+ echo "Preserving existing config: ${CONFIG_PATH}"
|
|
| 111 |
+ fi |
|
| 112 |
+ |
|
| 113 |
+ install -m 0644 "${SOURCE_DIR}/systemd/pve-net-hang-watchdog.service" "${UNIT_PATH}"
|
|
| 114 |
+ systemctl daemon-reload |
|
| 115 |
+ systemctl enable --now pve-net-hang-watchdog.service >/dev/null 2>&1 |
|
| 116 |
+ |
|
| 117 |
+ install_docs |
|
| 118 |
+ |
|
| 119 |
+ echo "Installed paths:" |
|
| 120 |
+ echo " command: ${COMMAND_PATH}"
|
|
| 121 |
+ echo " uninstall: ${UNINSTALL_PATH}"
|
|
| 122 |
+ echo " config: ${CONFIG_PATH}"
|
|
| 123 |
+ echo " systemd: ${UNIT_PATH}"
|
|
| 124 |
+ echo " docs: ${DOC_DIR}"
|
|
| 125 |
+ echo " service: enabled and started" |
|
| 126 |
+ echo "" |
|
| 127 |
+ echo "Installation completed." |
|
| 128 |
+} |
|
| 129 |
+ |
|
| 130 |
+main "$@" |
|
@@ -0,0 +1,68 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="pve-net-hang-watchdog" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 8 |
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
|
|
| 9 |
+COMMAND_PATH="/usr/local/sbin/pve-net-hang-watchdog.sh" |
|
| 10 |
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
|
|
| 11 |
+CONFIG_PATH="/etc/default/${ORG_ID}-${PROJECT_ID}"
|
|
| 12 |
+UNIT_PATH="/etc/systemd/system/${PROJECT_ID}.service"
|
|
| 13 |
+ |
|
| 14 |
+FORCE_MODE=0 |
|
| 15 |
+ |
|
| 16 |
+log() {
|
|
| 17 |
+ if [[ "${FORCE_MODE}" -eq 0 ]]; then
|
|
| 18 |
+ echo "$@" |
|
| 19 |
+ fi |
|
| 20 |
+} |
|
| 21 |
+ |
|
| 22 |
+require_root() {
|
|
| 23 |
+ if [[ "${EUID}" -ne 0 ]]; then
|
|
| 24 |
+ echo "ERROR: this script must be run as root" >&2 |
|
| 25 |
+ exit 1 |
|
| 26 |
+ fi |
|
| 27 |
+} |
|
| 28 |
+ |
|
| 29 |
+main() {
|
|
| 30 |
+ while [[ $# -gt 0 ]]; do |
|
| 31 |
+ case "$1" in |
|
| 32 |
+ --force) |
|
| 33 |
+ FORCE_MODE=1 |
|
| 34 |
+ shift |
|
| 35 |
+ ;; |
|
| 36 |
+ -h|--help) |
|
| 37 |
+ echo "Usage: $0 [--force]" |
|
| 38 |
+ exit 0 |
|
| 39 |
+ ;; |
|
| 40 |
+ *) |
|
| 41 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 42 |
+ exit 1 |
|
| 43 |
+ ;; |
|
| 44 |
+ esac |
|
| 45 |
+ done |
|
| 46 |
+ |
|
| 47 |
+ require_root |
|
| 48 |
+ |
|
| 49 |
+ log "=== Uninstalling ${PROJECT_ID} ==="
|
|
| 50 |
+ |
|
| 51 |
+ systemctl disable pve-net-hang-watchdog.service >/dev/null 2>&1 || true |
|
| 52 |
+ systemctl stop pve-net-hang-watchdog.service >/dev/null 2>&1 || true |
|
| 53 |
+ rm -f "${UNIT_PATH}"
|
|
| 54 |
+ systemctl daemon-reload |
|
| 55 |
+ |
|
| 56 |
+ rm -f "${UNINSTALL_WRAPPER}"
|
|
| 57 |
+ rm -f "${COMMAND_PATH}"
|
|
| 58 |
+ rm -f "${CONFIG_PATH}"
|
|
| 59 |
+ rm -rf "${DOC_DIR}"
|
|
| 60 |
+ rm -rf "${INSTALL_DIR}"
|
|
| 61 |
+ |
|
| 62 |
+ rmdir /usr/local/lib/${ORG_ID} 2>/dev/null || true
|
|
| 63 |
+ rmdir /usr/local/share/doc/${ORG_ID} 2>/dev/null || true
|
|
| 64 |
+ |
|
| 65 |
+ log "Uninstall complete." |
|
| 66 |
+} |
|
| 67 |
+ |
|
| 68 |
+main "$@" |
|
@@ -0,0 +1,139 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="pve-net-hang-watchdog" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 8 |
+MODE="install" |
|
| 9 |
+REMOTE_NODE="" |
|
| 10 |
+REMOTE_USER="root" |
|
| 11 |
+LOCAL_MODE=0 |
|
| 12 |
+ |
|
| 13 |
+show_help() {
|
|
| 14 |
+ cat <<EOF |
|
| 15 |
+${PROJECT_ID} setup wrapper
|
|
| 16 |
+ |
|
| 17 |
+Usage: $0 [OPTIONS] [<target_node>] |
|
| 18 |
+ |
|
| 19 |
+Options: |
|
| 20 |
+ -h, --help Show this help message |
|
| 21 |
+ -l, --local Run on localhost |
|
| 22 |
+ -u, --uninstall Uninstall instead of install |
|
| 23 |
+ --user <user> Remote SSH user (default: root) |
|
| 24 |
+EOF |
|
| 25 |
+} |
|
| 26 |
+ |
|
| 27 |
+run_local_install() {
|
|
| 28 |
+ bash "${SCRIPT_DIR}/scripts/install.sh" --source-dir "${SCRIPT_DIR}"
|
|
| 29 |
+} |
|
| 30 |
+ |
|
| 31 |
+run_local_uninstall() {
|
|
| 32 |
+ local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
|
|
| 33 |
+ if [[ -x "${canonical}" ]]; then
|
|
| 34 |
+ "${canonical}"
|
|
| 35 |
+ else |
|
| 36 |
+ bash "${SCRIPT_DIR}/scripts/uninstall.sh"
|
|
| 37 |
+ fi |
|
| 38 |
+} |
|
| 39 |
+ |
|
| 40 |
+copy_remote_tree() {
|
|
| 41 |
+ local target="$1" |
|
| 42 |
+ local remote_tmp="$2" |
|
| 43 |
+ |
|
| 44 |
+ ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/bin' '${remote_tmp}/scripts' '${remote_tmp}/systemd' '${remote_tmp}/config'"
|
|
| 45 |
+ scp -q "${SCRIPT_DIR}/bin/pve-net-hang-watchdog.sh" "${target}:${remote_tmp}/bin/"
|
|
| 46 |
+ scp -q "${SCRIPT_DIR}/scripts/install.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 47 |
+ scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 48 |
+ scp -q "${SCRIPT_DIR}/systemd/pve-net-hang-watchdog.service" "${target}:${remote_tmp}/systemd/"
|
|
| 49 |
+ scp -q "${SCRIPT_DIR}/config/xdev-pve-net-hang-watchdog" "${target}:${remote_tmp}/config/"
|
|
| 50 |
+ scp -q "${SCRIPT_DIR}/README.md" "${target}:${remote_tmp}/"
|
|
| 51 |
+ scp -q "${SCRIPT_DIR}/INSTALL.md" "${target}:${remote_tmp}/"
|
|
| 52 |
+ scp -q "${SCRIPT_DIR}/CHANGELOG.md" "${target}:${remote_tmp}/"
|
|
| 53 |
+} |
|
| 54 |
+ |
|
| 55 |
+run_remote_install() {
|
|
| 56 |
+ local target="$1" |
|
| 57 |
+ local remote_tmp="/tmp/${PROJECT_ID}.$$"
|
|
| 58 |
+ local remote_prefix="" |
|
| 59 |
+ |
|
| 60 |
+ [[ "${REMOTE_USER}" != "root" ]] && remote_prefix="sudo "
|
|
| 61 |
+ |
|
| 62 |
+ copy_remote_tree "${target}" "${remote_tmp}"
|
|
| 63 |
+ ssh "${target}" "${remote_prefix}bash '${remote_tmp}/scripts/install.sh' --source-dir '${remote_tmp}'"
|
|
| 64 |
+ ssh "${target}" "rm -rf '${remote_tmp}'"
|
|
| 65 |
+} |
|
| 66 |
+ |
|
| 67 |
+run_remote_uninstall() {
|
|
| 68 |
+ local target="$1" |
|
| 69 |
+ local remote_tmp="/tmp/${PROJECT_ID}-uninstall.$$"
|
|
| 70 |
+ local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
|
|
| 71 |
+ |
|
| 72 |
+ ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts'"
|
|
| 73 |
+ scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 74 |
+ if [[ "${REMOTE_USER}" == "root" ]]; then
|
|
| 75 |
+ ssh "${target}" "if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi"
|
|
| 76 |
+ else |
|
| 77 |
+ ssh "${target}" "sudo bash -lc \"if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi\""
|
|
| 78 |
+ fi |
|
| 79 |
+ ssh "${target}" "rm -rf '${remote_tmp}'"
|
|
| 80 |
+} |
|
| 81 |
+ |
|
| 82 |
+while [[ $# -gt 0 ]]; do |
|
| 83 |
+ case "$1" in |
|
| 84 |
+ -h|--help) |
|
| 85 |
+ show_help |
|
| 86 |
+ exit 0 |
|
| 87 |
+ ;; |
|
| 88 |
+ -l|--local) |
|
| 89 |
+ LOCAL_MODE=1 |
|
| 90 |
+ shift |
|
| 91 |
+ ;; |
|
| 92 |
+ -u|--uninstall) |
|
| 93 |
+ MODE="uninstall" |
|
| 94 |
+ shift |
|
| 95 |
+ ;; |
|
| 96 |
+ --user) |
|
| 97 |
+ REMOTE_USER="$2" |
|
| 98 |
+ shift 2 |
|
| 99 |
+ ;; |
|
| 100 |
+ -*) |
|
| 101 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 102 |
+ show_help |
|
| 103 |
+ exit 1 |
|
| 104 |
+ ;; |
|
| 105 |
+ *) |
|
| 106 |
+ REMOTE_NODE="$1" |
|
| 107 |
+ shift |
|
| 108 |
+ ;; |
|
| 109 |
+ esac |
|
| 110 |
+done |
|
| 111 |
+ |
|
| 112 |
+if [[ -z "${REMOTE_NODE}" && ${LOCAL_MODE} -eq 0 ]]; then
|
|
| 113 |
+ LOCAL_MODE=1 |
|
| 114 |
+fi |
|
| 115 |
+ |
|
| 116 |
+echo "================================" |
|
| 117 |
+echo "${PROJECT_ID} - ${MODE}"
|
|
| 118 |
+echo "================================" |
|
| 119 |
+ |
|
| 120 |
+if [[ ${LOCAL_MODE} -eq 1 ]]; then
|
|
| 121 |
+ if [[ "${MODE}" == "install" ]]; then
|
|
| 122 |
+ run_local_install |
|
| 123 |
+ else |
|
| 124 |
+ run_local_uninstall |
|
| 125 |
+ fi |
|
| 126 |
+ exit 0 |
|
| 127 |
+fi |
|
| 128 |
+ |
|
| 129 |
+TARGET="${REMOTE_USER}@${REMOTE_NODE}"
|
|
| 130 |
+if ! ping -c 1 "${REMOTE_NODE}" >/dev/null 2>&1; then
|
|
| 131 |
+ echo "ERROR: cannot reach ${REMOTE_NODE}" >&2
|
|
| 132 |
+ exit 1 |
|
| 133 |
+fi |
|
| 134 |
+ |
|
| 135 |
+if [[ "${MODE}" == "install" ]]; then
|
|
| 136 |
+ run_remote_install "${TARGET}"
|
|
| 137 |
+else |
|
| 138 |
+ run_remote_uninstall "${TARGET}"
|
|
| 139 |
+fi |
|
@@ -0,0 +1,14 @@ |
||
| 1 |
+[Unit] |
|
| 2 |
+Description=Recover network uplink after NIC hardware hangs |
|
| 3 |
+After=systemd-journald.service network.target |
|
| 4 |
+Requires=systemd-journald.service |
|
| 5 |
+ |
|
| 6 |
+[Service] |
|
| 7 |
+Type=simple |
|
| 8 |
+EnvironmentFile=-/etc/default/xdev-pve-net-hang-watchdog |
|
| 9 |
+ExecStart=/usr/local/sbin/pve-net-hang-watchdog.sh |
|
| 10 |
+Restart=always |
|
| 11 |
+RestartSec=2 |
|
| 12 |
+ |
|
| 13 |
+[Install] |
|
| 14 |
+WantedBy=multi-user.target |
|
@@ -0,0 +1,59 @@ |
||
| 1 |
+# Copilot Instructions for Madagascar Thunderbolts & Backups |
|
| 2 |
+ |
|
| 3 |
+## Big Picture Architecture |
|
| 4 |
+- The codebase manages high-MTU Thunderbolt networking and automated backups for a Proxmox cluster (`baobab`, `ebony`, `tapia`). |
|
| 5 |
+- Networking: Early boot systemd/udev units create and maintain a `thunderbridge` (MTU 65520), hotplug Thunderbolt NICs, and ensure persistent bridge membership. |
|
| 6 |
+- Backups: Autonomous agent scripts (in `backups/`) discover VMs, run scheduled backups, and log lifecycle events. |
|
| 7 |
+- All node, network, and backup config is centralized in `cluster/madagascar.json`. |
|
| 8 |
+ |
|
| 9 |
+## Critical Developer Workflows |
|
| 10 |
+- **Network Deploy:** |
|
| 11 |
+ - Run `deploy/attempt1/deploy_tb.sh` from its directory to push configs and services to all nodes. |
|
| 12 |
+ - Validate with `scripts/check_thunderbridge.sh` (checks bridge ports, MTU, and cluster connectivity). |
|
| 13 |
+- **Backup Deploy:** |
|
| 14 |
+ - Use `backups/scripts/deploy_to_nodes.sh` to install backup agents on all nodes. |
|
| 15 |
+ - Backup agent lifecycle is managed by systemd timers (`backup_agent.timer`). |
|
| 16 |
+- **Issue Tracking:** |
|
| 17 |
+ - All issues documented in `issues/` using `TEMPLATE.md`. |
|
| 18 |
+ - Every fix/change must be referenced in `CHANGELOG.md`. |
|
| 19 |
+ |
|
| 20 |
+## Project-Specific Conventions |
|
| 21 |
+- **Network config:** |
|
| 22 |
+ - Node-specific overlays in `deploy/attempt1/<node>/etc/network/interfaces.d/10-thunderbolt`. |
|
| 23 |
+ - Shared systemd/udev units in `deploy/attempt1/common/`. |
|
| 24 |
+ - Always use post-up hooks for bridge membership and MTU persistence. |
|
| 25 |
+- **SSH Automation:** |
|
| 26 |
+ - Scripts use `-o LogLevel=ERROR` to suppress known hosts warnings. |
|
| 27 |
+ - Management and Thunderbolt IPs are set in deploy scripts; update helpers for new nodes. |
|
| 28 |
+- **Versioning:** |
|
| 29 |
+ - New network designs go in new `attemptN` folders for reproducibility. |
|
| 30 |
+- **Backups:** |
|
| 31 |
+ - All backup config and manifests reference `madagascar.json` for node/IP discovery. |
|
| 32 |
+ - Backup agent logs lifecycle events and changes in `madagascar-changelog.json` (if present). |
|
| 33 |
+ |
|
| 34 |
+## Integration Points & Data Flows |
|
| 35 |
+- **Network:** |
|
| 36 |
+ - Systemd/udev units interact via device events; enlist services attach NICs to bridge. |
|
| 37 |
+ - Deploy script pushes all config and reloads services atomically. |
|
| 38 |
+- **Backups:** |
|
| 39 |
+ - Agent scripts SSH into nodes, discover VMs, and run backups using Proxmox CLI. |
|
| 40 |
+ - Results and metadata are logged for auditability. |
|
| 41 |
+ |
|
| 42 |
+## Key Files & Directories |
|
| 43 |
+- `deploy/attempt1/deploy_tb.sh`: Main network deploy script |
|
| 44 |
+- `deploy/attempt1/common/`: Shared systemd/udev units |
|
| 45 |
+- `deploy/attempt1/<node>/etc/network/interfaces.d/10-thunderbolt`: Node overlays |
|
| 46 |
+- `scripts/check_thunderbridge.sh`: Cluster network health check |
|
| 47 |
+- `cluster/madagascar.json`: Canonical node/network/backup config |
|
| 48 |
+- `backups/`: Backup agent, deployment, and documentation |
|
| 49 |
+- `issues/`: Issue tracker |
|
| 50 |
+- `CHANGELOG.md`: Change log |
|
| 51 |
+ |
|
| 52 |
+## Example Patterns |
|
| 53 |
+- To add a node: copy an existing node directory, update IPs, extend deploy script helpers. |
|
| 54 |
+- To troubleshoot: check systemd unit status, bridge membership, and kernel logs. |
|
| 55 |
+- To automate: use provided scripts, keep configs in sync with `madagascar.json`, and document all changes. |
|
| 56 |
+ |
|
| 57 |
+--- |
|
| 58 |
+ |
|
| 59 |
+For questions or unclear conventions, review `README.md` and issue templates, or ask for clarification in the issue tracker. |
|
@@ -0,0 +1,43 @@ |
||
| 1 |
+# Changelog |
|
| 2 |
+ |
|
| 3 |
+All notable changes to the Madagascar cluster will be documented in this file. |
|
| 4 |
+ |
|
| 5 |
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), |
|
| 6 |
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). |
|
| 7 |
+ |
|
| 8 |
+## [Unreleased] |
|
| 9 |
+ |
|
| 10 |
+### Fixed |
|
| 11 |
+- Invalid `ExecStop` syntax in `tb-enlist@.service` caused failed unit teardown on Thunderbolt device removal [ISSUE-2026-001] |
|
| 12 |
+- Tapia-Baobab Thunderbolt recovery path hardened after reboot-time disconnect/reconnect events [ISSUE-2026-001] |
|
| 13 |
+ |
|
| 14 |
+### Added |
|
| 15 |
+- Automatic Thunderbolt recovery worker (`tb-recover.service`) and periodic timer (`tb-recover.timer`) for flap resilience [ISSUE-2026-001] |
|
| 16 |
+ |
|
| 17 |
+### Changed |
|
| 18 |
+- `tb-recover.sh` now escalates recovery by restarting `bolt.service` when rescan alone does not recreate thunderbolt net devices [ISSUE-2026-001] |
|
| 19 |
+- `tb-recover.sh` now includes cooldowned Thunderbolt NHI PCI `remove+rescan` fallback (soft replug path) for reboot cases where netdev is missing [ISSUE-2026-001] |
|
| 20 |
+- `tb-recover.sh` now retries the Thunderbolt NHI reset within the same recovery run when a peer xdomain host reappears without its `*.0` network service [ISSUE-2026-001] |
|
| 21 |
+- `tb-recover.sh` now probes the expected peer behind each Thunderbolt port and cycles the affected interface with `ifdown/ifup` when a port stays attached but logically detached [ISSUE-2026-001] |
|
| 22 |
+- Added standardized shared-runtime install/uninstall flow that manages scripts, unit files, and udev rules without rewriting host network configuration |
|
| 23 |
+ |
|
| 24 |
+## [2025-10-30] |
|
| 25 |
+ |
|
| 26 |
+### Fixed |
|
| 27 |
+- Thunderbolt interfaces not in bridge after MTU fix deployment [ISSUE-2025-002] |
|
| 28 |
+- MTU reset to 1500 after systemctl restart networking [ISSUE-2025-001] |
|
| 29 |
+ |
|
| 30 |
+### Added |
|
| 31 |
+- Issue tracking system with structured templates |
|
| 32 |
+- Defense-in-depth for thunderbolt network configuration (udev + ifupdown2 hooks) |
|
| 33 |
+ |
|
| 34 |
+### Changed |
|
| 35 |
+- Enhanced udev rules for thunderbolt device handling |
|
| 36 |
+- Updated network interfaces.d with post-up hooks for MTU and bridge membership |
|
| 37 |
+ |
|
| 38 |
+## [2025-10-29] |
|
| 39 |
+ |
|
| 40 |
+### Added |
|
| 41 |
+- Initial issue tracking setup |
|
| 42 |
+- COPILOT_BACKUPS_INSTRUCTIONS.md for backup procedures |
|
| 43 |
+- CHANGELOG.md for change documentation</content> |
|
@@ -0,0 +1,113 @@ |
||
| 1 |
+COPILOT instructions — VM backup management (project scaffold) |
|
| 2 |
+ |
|
| 3 |
+Purpose |
|
| 4 |
+ |
|
| 5 |
+This document provides context and instructions for an automated assistant (copilot) to start building a project that manages VM backups for the Madagascar cluster. The detailed backup behaviors (retention, snapshot type, schedule) will be added later. For now we focus on cluster context, knowledge sources, file contracts, and recommended initial tasks. |
|
| 6 |
+ |
|
| 7 |
+Context & what the agent already knows |
|
| 8 |
+ |
|
| 9 |
+- The cluster name is `madagascar` and node names are available under `clusters.madagascar.nodes` in `cluster-context/madagascar.json`. |
|
| 10 |
+- `cluster-context/madagascar.json` is the canonical source of cluster context available to this project: it may contain node hostnames, network information, and references to where configurations originate. |
|
| 11 |
+- `madagascar-changelog.json` (if present in the same directory) is an append-only changelog recommended for recording automation changes; prefer appending entries rather than rewriting. |
|
| 12 |
+ |
|
| 13 |
+Primary goals for the backup project (to be specified later) |
|
| 14 |
+ |
|
| 15 |
+- Discover VMs across cluster nodes. |
|
| 16 |
+- Create consistent backups (snapshots, exports) per VM on a regular schedule. |
|
| 17 |
+- Store backups in a target storage (local NAS, remote S3-compatible, etc.). |
|
| 18 |
+- Maintain retention and pruning policies. |
|
| 19 |
+- Integrate with `cluster-context/madagascar.json` for cluster information and to avoid stepping on other projects' config. |
|
| 20 |
+ |
|
| 21 |
+Files the assistant should read and keep in mind |
|
| 22 |
+ |
|
| 23 |
+- `cluster-context/madagascar.json` — primary source of truth for node hostnames, network addresses, and where configuration is defined. |
|
| 24 |
+- `madagascar-changelog.json` — append-only log to record changes made by automation (if present). |
|
| 25 |
+- `CHANGELOG.md` — human-readable changelog documenting all cluster changes with issue references. |
|
| 26 |
+- `issues/` directory — contains detailed issue documentation. Each issue has format `ISSUE-YYYY-NNN.md`. |
|
| 27 |
+ |
|
| 28 |
+Data contract (minimal) — how `cluster-context/madagascar.json` will be used by backups |
|
| 29 |
+ |
|
| 30 |
+- Inputs: |
|
| 31 |
+ - Node list: `clusters.madagascar.nodes` keys |
|
| 32 |
+ - Node access: hostname(s) under `nodes.<node>.hosts` (ssh target or provisioning endpoint) |
|
| 33 |
+ - Node VM network context: used to determine which subnets/backups touches (from `wan`/`thunderbridge`) |
|
| 34 |
+- Outputs: |
|
| 35 |
+ - Backup metadata appended to `madagascar-changelog.json` (id, timestamp, project: `backups`, summary, details, affectedResources) |
|
| 36 |
+ - (Optional) A `backups.json` manifest in repo or storage describing performed backups. |
|
| 37 |
+ |
|
| 38 |
+Assumptions (inferred, verify early) |
|
| 39 |
+ |
|
| 40 |
+- The ops runner will have SSH access to each node via hostnames in `cluster-context/madagascar.json`. |
|
| 41 |
+- VM management is Proxmox (PVE) given file style (vmbr*). If different, adapt tooling. |
|
| 42 |
+- jq is available on automation host for simple JSON operations; Python is acceptable for more complex logic. |
|
| 43 |
+ |
|
| 44 |
+Starter tasks for the copilot (priority order) |
|
| 45 |
+ |
|
| 46 |
+1. Discovery script: `./discover_vms.sh` (or Python) that: |
|
| 47 |
+ - Reads `./cluster-context/madagascar.json` to get nodes and hostnames. |
|
| 48 |
+ - SSH into each node and lists VMs (for Proxmox: `qm list` or `pvesh` / `pct list` for containers). |
|
| 49 |
+ - Produces a `backups/manifest-<date>.json` with the discovered VMs. |
|
| 50 |
+2. Backup runner: `./run_backup.sh` which takes a VM id and node, creates a snapshot/export, and uploads it to configured storage. Keep steps idempotent and record metadata. |
|
| 51 |
+3. Pruner: `./prune_backups.sh` to remove old backups according to retention policy (to be defined). |
|
| 52 |
+4. Integration tests: small harness that runs discovery against a mocked inventory or a minimal local mock environment and validates outputs. |
|
| 53 |
+5. Changelog integration: every automated change to `cluster-context/madagascar.json` or backup metadata must append an entry in `cluster-context/madagascar-changelog.json` describing reason and affected resources. |
|
| 54 |
+ |
|
| 55 |
+Developer guidance & best practices |
|
| 56 |
+ |
|
| 57 |
+- Treat `cluster-context/madagascar.json` as the source of truth for discovery; do not hardcode hostnames elsewhere. |
|
| 58 |
+- When writing automation that mutates `cluster-context/madagascar.json`, always also append a changelog entry and prefer atomic updates (write to tmp file then rename). |
|
| 59 |
+- Prefer small, single-purpose scripts. Keep complex logic in Python where JSON and SSH handling is easier. |
|
| 60 |
+- Add unit tests for parsing and manifest generation. |
|
| 61 |
+ |
|
| 62 |
+--- |
|
| 63 |
+ |
|
| 64 |
+## Copilot Automation Instructions (Network & Issue Tracking) |
|
| 65 |
+ |
|
| 66 |
+### Cluster Discovery & Network Checks |
|
| 67 |
+- Use `cluster/cluster-context/madagascar.json` for node, IP, and service info. |
|
| 68 |
+- To verify thunderbolt networking, run `scripts/check_thunderbridge.sh`: |
|
| 69 |
+ - Checks bridge membership and MTU for all thunderbolt interfaces. |
|
| 70 |
+ - Verifies cluster network connectivity (ping between all nodes). |
|
| 71 |
+- For troubleshooting, check kernel logs (`dmesg`), interface status (`ip link show`), and bridge membership (`bridge link`). |
|
| 72 |
+ |
|
| 73 |
+### Issue Tracking Workflow |
|
| 74 |
+- All issues are tracked in `issues/` as Markdown files using `TEMPLATE.md`. |
|
| 75 |
+- Each issue gets a unique ID (e.g., ISSUE-2025-001, ISSUE-2025-002). |
|
| 76 |
+- Document: |
|
| 77 |
+ - Summary, environment, steps to reproduce, expected/actual behavior |
|
| 78 |
+ - Logs/evidence, investigation notes, proposed solution |
|
| 79 |
+ - Related issues and changelog references |
|
| 80 |
+- Update `CHANGELOG.md` for every fix, enhancement, or regression. |
|
| 81 |
+- Close issues only after full deployment and verification. |
|
| 82 |
+ |
|
| 83 |
+### Copilot Automation Conventions |
|
| 84 |
+- Always verify changes on all affected nodes (baobab, ebony, tapia, etc.). |
|
| 85 |
+- Use defense-in-depth for network fixes (udev rules + ifupdown2 hooks). |
|
| 86 |
+- Scripts should be POSIX-compliant for maximum compatibility. |
|
| 87 |
+- Suppress SSH warnings for clean output (`-o LogLevel=ERROR`). |
|
| 88 |
+- Document every change and test result in the issue tracker and changelog. |
|
| 89 |
+ |
|
| 90 |
+### Example Copilot Tasks |
|
| 91 |
+- Deploy network fixes: `deploy/attempt1/deploy_tb.sh <node>` |
|
| 92 |
+- Check thunderbolt status: `scripts/check_thunderbridge.sh` |
|
| 93 |
+- Investigate hardware/network issues: kernel logs, interface status, bridge membership |
|
| 94 |
+- Document and close issues: update `issues/` and `CHANGELOG.md` |
|
| 95 |
+ |
|
| 96 |
+### References |
|
| 97 |
+- `cluster/cluster-context/madagascar.json`: Node, network, and backup server definitions |
|
| 98 |
+- `issues/`: Issue tracker and templates |
|
| 99 |
+- `CHANGELOG.md`: Change documentation |
|
| 100 |
+- `scripts/check_thunderbridge.sh`: Cluster network health check |
|
| 101 |
+- `deploy/attempt1/deploy_tb.sh`: Network deployment script |
|
| 102 |
+ |
|
| 103 |
+### Maintenance |
|
| 104 |
+- Regularly run network and backup checks. |
|
| 105 |
+- Update documentation and changelogs for every change. |
|
| 106 |
+- Use Copilot to automate repetitive tasks and ensure consistency across the cluster. |
|
| 107 |
+ |
|
| 108 |
+Next steps for the user |
|
| 109 |
+ |
|
| 110 |
+- Provide backup policy details (snapshot vs export, retention counts, storage endpoint credentials). |
|
| 111 |
+- Confirm VM manager (Proxmox vs KVM/libvirt vs other). |
|
| 112 |
+ |
|
| 113 |
+If you want, I can now scaffold `./discover_vms.sh`, `./run_backup.sh` (stubs), and a small `backups/README.md` describing configuration fields. Which do you prefer I create first: discovery script (bash) or Python scaffold? |
|
@@ -0,0 +1,62 @@ |
||
| 1 |
+# Instalare |
|
| 2 |
+ |
|
| 3 |
+Acest proiect are acum doua fluxuri distincte: |
|
| 4 |
+ |
|
| 5 |
+1. `deploy/attempt1/deploy_tb.sh` |
|
| 6 |
+ - bootstrap complet |
|
| 7 |
+ - poate actualiza si fisierele de retea per-host |
|
| 8 |
+2. `scripts/install.sh` sau `setup.sh` |
|
| 9 |
+ - reinstalare/upgrade pentru shared runtime |
|
| 10 |
+ - NU atinge `/etc/network/interfaces` si `interfaces.d/10-thunderbolt` |
|
| 11 |
+ |
|
| 12 |
+## Reinstalare standardizata |
|
| 13 |
+ |
|
| 14 |
+### Local |
|
| 15 |
+ |
|
| 16 |
+```bash |
|
| 17 |
+sudo ./setup.sh --local |
|
| 18 |
+``` |
|
| 19 |
+ |
|
| 20 |
+### Remote |
|
| 21 |
+ |
|
| 22 |
+```bash |
|
| 23 |
+sudo ./setup.sh baobab |
|
| 24 |
+sudo ./setup.sh ebony tapia |
|
| 25 |
+``` |
|
| 26 |
+ |
|
| 27 |
+Ce instaleaza: |
|
| 28 |
+- `/usr/local/lib/xdev/thunderbolts/tb-recover.sh` |
|
| 29 |
+- `/usr/local/lib/xdev/thunderbolts/uninstall.sh` |
|
| 30 |
+- `/usr/local/sbin/tb-recover.sh` |
|
| 31 |
+- `/usr/local/sbin/xdev-thunderbolts-uninstall` |
|
| 32 |
+- `/etc/systemd/system/tb-bridge.service` |
|
| 33 |
+- `/etc/systemd/system/tb-enlist@.service` |
|
| 34 |
+- `/etc/systemd/system/tb-recover.service` |
|
| 35 |
+- `/etc/systemd/system/tb-recover.timer` |
|
| 36 |
+- `/etc/udev/rules.d/90-thunderbolt-net-systemd.rules` |
|
| 37 |
+- `/usr/local/share/doc/xdev/thunderbolts/*` |
|
| 38 |
+ |
|
| 39 |
+Ce NU atinge: |
|
| 40 |
+- `/etc/network/interfaces` |
|
| 41 |
+- `/etc/network/interfaces.d/10-thunderbolt` |
|
| 42 |
+ |
|
| 43 |
+## Uninstall standardizat |
|
| 44 |
+ |
|
| 45 |
+```bash |
|
| 46 |
+sudo ./setup.sh --local --uninstall |
|
| 47 |
+sudo ./setup.sh --uninstall baobab |
|
| 48 |
+``` |
|
| 49 |
+ |
|
| 50 |
+Sau direct pe host: |
|
| 51 |
+ |
|
| 52 |
+```bash |
|
| 53 |
+sudo /usr/local/lib/xdev/thunderbolts/uninstall.sh |
|
| 54 |
+``` |
|
| 55 |
+ |
|
| 56 |
+Uninstall-ul elimina doar shared runtime: |
|
| 57 |
+- unit-urile systemd |
|
| 58 |
+- regula udev |
|
| 59 |
+- `tb-recover.sh` |
|
| 60 |
+- documentatia instalata |
|
| 61 |
+ |
|
| 62 |
+Nu restaureaza si nu sterge fisierele de retea. |
|
@@ -0,0 +1,141 @@ |
||
| 1 |
+# Madagascar's Thunderbolts |
|
| 2 |
+ |
|
| 3 |
+Thunderbolt networking toolkit for three Proxmox hosts (`baobab`, `ebony`, `tapia`). |
|
| 4 |
+The goal is to bring up a high-MTU Thunderbolt bridge (`thunderbridge`) early in boot, |
|
| 5 |
+enlist hot-plugged Thunderbolt NICs as they appear, and keep management networking |
|
| 6 |
+configs consistent across the cluster. |
|
| 7 |
+ |
|
| 8 |
+## Repository layout |
|
| 9 |
+ |
|
| 10 |
+``` |
|
| 11 |
+deploy/attempt1/ |
|
| 12 |
+├── common/ # Shared bits copied to every host |
|
| 13 |
+│ ├── systemd/system/ |
|
| 14 |
+│ │ ├── tb-bridge.service # Ensures the bridge device exists and is up |
|
| 15 |
+│ │ └── tb-enlist@.service # Enlists hotplugged NICs into the bridge |
|
| 16 |
+│ └── udev/rules.d/ |
|
| 17 |
+│ └── 90-…systemd.rules # Starts tb-enlist@ for thunderbolt* devices |
|
| 18 |
+├── baobab/… # Node-specific /etc/network config |
|
| 19 |
+├── ebony/… |
|
| 20 |
+├── tapia/… |
|
| 21 |
+└── deploy_tb.sh # Main deployment script |
|
| 22 |
+``` |
|
| 23 |
+ |
|
| 24 |
+The repo currently holds a single deployment attempt (`deploy/attempt1`). If you |
|
| 25 |
+iterate on the design, prefer adding a new attempt directory so older snapshots |
|
| 26 |
+stay reproducible. |
|
| 27 |
+ |
|
| 28 |
+## Standardized lifecycle |
|
| 29 |
+ |
|
| 30 |
+This project now has two distinct operational paths: |
|
| 31 |
+ |
|
| 32 |
+- Full bootstrap: `deploy/attempt1/deploy_tb.sh` |
|
| 33 |
+ - can update host-specific network configuration |
|
| 34 |
+ - use for initial deployment or deliberate network template rollout |
|
| 35 |
+- Shared runtime reinstall: `./setup.sh` |
|
| 36 |
+ - standardizes the shared runtime artifacts only |
|
| 37 |
+ - installs/removes `tb-recover.sh`, the shared systemd units, and the udev rule |
|
| 38 |
+ - intentionally leaves `/etc/network/interfaces` and `/etc/network/interfaces.d/10-thunderbolt` untouched |
|
| 39 |
+ |
|
| 40 |
+Standardized host paths for the shared runtime: |
|
| 41 |
+ |
|
| 42 |
+- canonical uninstall: `/usr/local/lib/xdev/thunderbolts/uninstall.sh` |
|
| 43 |
+- canonical shared script: `/usr/local/lib/xdev/thunderbolts/tb-recover.sh` |
|
| 44 |
+- operator wrapper: `/usr/local/sbin/tb-recover.sh` |
|
| 45 |
+- installed docs: `/usr/local/share/doc/xdev/thunderbolts` |
|
| 46 |
+ |
|
| 47 |
+Use: |
|
| 48 |
+ |
|
| 49 |
+```bash |
|
| 50 |
+./setup.sh # reinstall shared runtime on baobab ebony tapia |
|
| 51 |
+./setup.sh baobab # single host |
|
| 52 |
+./setup.sh --uninstall baobab |
|
| 53 |
+``` |
|
| 54 |
+ |
|
| 55 |
+## Prerequisites |
|
| 56 |
+ |
|
| 57 |
+- Machine with Bash ≥3, `ssh`, and `scp` available. |
|
| 58 |
+- Access to the target hosts as `root` (default username) over the management or |
|
| 59 |
+ Thunderbolt network; passwordless SSH is assumed. |
|
| 60 |
+- Target hosts run Proxmox (or any Debian-like system with ifupdown2 and systemd). |
|
| 61 |
+- `ip`, `systemctl`, and `udevadm` available on the remote hosts. |
|
| 62 |
+ |
|
| 63 |
+## How deployment works |
|
| 64 |
+ |
|
| 65 |
+`deploy_tb.sh` is idempotent. For each target host it: |
|
| 66 |
+ |
|
| 67 |
+- Chooses an IP by trying management first, then Thunderbolt (`get_mgmt_ip`/`get_tb_ip`). |
|
| 68 |
+- Uploads shared udev and systemd units that prepare the `thunderbridge` device and |
|
| 69 |
+ attach Thunderbolt NICs when they hot-plug. |
|
| 70 |
+- Replaces `/etc/network/interfaces` with the host-specific template and places the |
|
| 71 |
+ Thunderbolt overlay in `/etc/network/interfaces.d/10-thunderbolt`. |
|
| 72 |
+- Reloads udev and systemd, triggers network reloads, enables the services, and |
|
| 73 |
+ prints a short status report (bridge state, enlisted NICs). |
|
| 74 |
+ |
|
| 75 |
+Run it from inside the attempt directory so relative paths resolve correctly. |
|
| 76 |
+ |
|
| 77 |
+```bash |
|
| 78 |
+cd deploy/attempt1 |
|
| 79 |
+./deploy_tb.sh # deploys to baobab, ebony, tapia |
|
| 80 |
+./deploy_tb.sh baobab # deploys to a single host |
|
| 81 |
+./deploy_tb.sh tapia ebony |
|
| 82 |
+``` |
|
| 83 |
+ |
|
| 84 |
+## Customising host lists and addresses |
|
| 85 |
+ |
|
| 86 |
+Edit the `get_mgmt_ip()` and `get_tb_ip()` helpers near the top of |
|
| 87 |
+`deploy/attempt1/deploy_tb.sh` to match your environment. Each host that you want |
|
| 88 |
+to target must: |
|
| 89 |
+ |
|
| 90 |
+1. Have a subdirectory named after the host inside `deploy/attempt1`. |
|
| 91 |
+2. Provide the full `/etc/network/interfaces` template. |
|
| 92 |
+3. Provide `etc/network/interfaces.d/10-thunderbolt` with the bridge definition |
|
| 93 |
+ and hotplug rules for Thunderbolt interfaces. |
|
| 94 |
+ |
|
| 95 |
+To add a new host, copy one of the existing directories, adjust static IPs and |
|
| 96 |
+interface names, then extend both helper functions so the script can locate it. |
|
| 97 |
+ |
|
| 98 |
+## What the systemd/udev pieces do |
|
| 99 |
+ |
|
| 100 |
+- `tb-bridge.service` (oneshot) makes sure the `thunderbridge` device exists as a |
|
| 101 |
+ Linux bridge, sets MTU 65520, and brings it up during early boot. |
|
| 102 |
+- `tb-enlist@.service` attaches Thunderbolt NIC instances to the bridge, aligning |
|
| 103 |
+ their MTU and keeping them hotplug friendly; systemd stops the unit cleanly on |
|
| 104 |
+ device removal. |
|
| 105 |
+- `90-thunderbolt-net-systemd.rules` tags `thunderbolt*` NICs so udev starts the |
|
| 106 |
+ enlist service automatically. |
|
| 107 |
+ |
|
| 108 |
+These files live under `deploy/attempt1/common/` and are copied verbatim to the |
|
| 109 |
+remote host’s `/etc/systemd/system` and `/etc/udev/rules.d`. |
|
| 110 |
+ |
|
| 111 |
+## Validation checklist |
|
| 112 |
+ |
|
| 113 |
+After running the deploy script on a host: |
|
| 114 |
+ |
|
| 115 |
+- `systemctl status tb-bridge.service` should show an *active* oneshot unit. |
|
| 116 |
+- `systemctl list-units 'tb-enlist@*'` should list one unit per detected Thunderbolt |
|
| 117 |
+ NIC, each *loaded* and *active*. |
|
| 118 |
+- `ip -d link show thunderbridge` should display MTU 65520 and `state UP`. |
|
| 119 |
+- `bridge link` should list your Thunderbolt interfaces as ports of `thunderbridge` |
|
| 120 |
+ once cables are connected. |
|
| 121 |
+ |
|
| 122 |
+If you change the network definitions, re-run `./deploy_tb.sh <host>` to push the |
|
| 123 |
+updates. The script re-applies permissions, reloads systemd, retriggers udev, and |
|
| 124 |
+refreshes the interfaces. |
|
| 125 |
+ |
|
| 126 |
+## Troubleshooting tips |
|
| 127 |
+ |
|
| 128 |
+- *SSH unreachable*: Confirm management and Thunderbolt IPs in the helper functions |
|
| 129 |
+ are correct, and that firewalls allow SSH. The script prints which IP it tried. |
|
| 130 |
+- *Bridge missing after reboot*: Ensure `tb-bridge.service` is enabled; run |
|
| 131 |
+ `systemctl enable --now tb-bridge.service` on the host. |
|
| 132 |
+- *NICs not joining*: Check `journalctl -u tb-enlist@thunderbolt0` for logs and make |
|
| 133 |
+ sure the udev rule is present under `/etc/udev/rules.d`. |
|
| 134 |
+- *MTU mismatch complaints*: The service forces MTU 65520 on both sides; verify the |
|
| 135 |
+ connected devices also support it. |
|
| 136 |
+ |
|
| 137 |
+## Extending beyond attempt1 |
|
| 138 |
+ |
|
| 139 |
+Prefer copying `deploy/attempt1` into a new versioned folder (for example, |
|
| 140 |
+`attempt2`) when you experiment with alternate topologies or addresses. This keeps |
|
| 141 |
+previous rollouts reproducible and eases diffing of changes. |
|
@@ -0,0 +1 @@ |
||
| 1 |
+../cluster |
|
@@ -0,0 +1,45 @@ |
||
| 1 |
+# network interface settings; autogenerated |
|
| 2 |
+# Please do NOT modify this file directly, unless you know what |
|
| 3 |
+# you're doing. |
|
| 4 |
+# |
|
| 5 |
+# If you want to manage parts of the network configuration manually, |
|
| 6 |
+# please utilize the 'source' or 'source-directory' directives to do |
|
| 7 |
+# so. |
|
| 8 |
+# PVE will preserve these directives, but will NOT read its network |
|
| 9 |
+# configuration from sourced files, so do not attempt to move any of |
|
| 10 |
+# the PVE managed interfaces into external files! |
|
| 11 |
+ |
|
| 12 |
+auto lo |
|
| 13 |
+iface lo inet loopback |
|
| 14 |
+ |
|
| 15 |
+auto enp86s0 |
|
| 16 |
+iface enp86s0 inet manual |
|
| 17 |
+ |
|
| 18 |
+iface enp86s0.442 inet manual |
|
| 19 |
+ |
|
| 20 |
+iface enp86s0.443 inet manual |
|
| 21 |
+ |
|
| 22 |
+iface enp86s0.444 inet manual |
|
| 23 |
+source /etc/network/interfaces.d/* |
|
| 24 |
+ |
|
| 25 |
+auto vmbr443 |
|
| 26 |
+iface vmbr443 inet static |
|
| 27 |
+ address 192.168.2.91/24 |
|
| 28 |
+ gateway 192.168.2.1 |
|
| 29 |
+ bridge-ports enp86s0.443 |
|
| 30 |
+ bridge-stp off |
|
| 31 |
+ bridge-fd 0 |
|
| 32 |
+ |
|
| 33 |
+auto vmbr444 |
|
| 34 |
+iface vmbr444 inet static |
|
| 35 |
+ address 192.168.4.91/24 |
|
| 36 |
+ bridge-ports enp86s0.444 |
|
| 37 |
+ bridge-stp off |
|
| 38 |
+ bridge-fd 0 |
|
| 39 |
+ |
|
| 40 |
+auto vmbr442 |
|
| 41 |
+iface vmbr442 inet manual |
|
| 42 |
+ bridge-ports enp86s0.442 |
|
| 43 |
+ bridge-stp off |
|
| 44 |
+ bridge-fd 0 |
|
| 45 |
+ |
|
@@ -0,0 +1,26 @@ |
||
| 1 |
+# Modular network configuration for baobab - Thunderbolt networking |
|
| 2 |
+# ifupdown2-safe: bridge comes up alone; TB ports hotplug in later |
|
| 3 |
+ |
|
| 4 |
+# Thunderbolt ports appear late — do NOT 'auto' them |
|
| 5 |
+allow-hotplug thunderbolt0 |
|
| 6 |
+iface thunderbolt0 inet manual |
|
| 7 |
+ pre-up ip link set dev $IFACE mtu 65520 || true |
|
| 8 |
+ post-up ip link set dev $IFACE mtu 65520 || true |
|
| 9 |
+ post-up ip link set dev $IFACE master thunderbridge || true |
|
| 10 |
+ |
|
| 11 |
+allow-hotplug thunderbolt1 |
|
| 12 |
+iface thunderbolt1 inet manual |
|
| 13 |
+ pre-up ip link set dev $IFACE mtu 65520 || true |
|
| 14 |
+ post-up ip link set dev $IFACE mtu 65520 || true |
|
| 15 |
+ post-up ip link set dev $IFACE master thunderbridge || true |
|
| 16 |
+ |
|
| 17 |
+# Bridge must exist and stay up even with zero members |
|
| 18 |
+auto thunderbridge |
|
| 19 |
+iface thunderbridge inet static |
|
| 20 |
+ address 192.168.10.91/24 |
|
| 21 |
+ bridge-ports none |
|
| 22 |
+ bridge-stp off |
|
| 23 |
+ bridge-fd 0 |
|
| 24 |
+ mtu 65520 |
|
| 25 |
+ pre-up ip link add name $IFACE type bridge 2>/dev/null || true |
|
| 26 |
+ post-up ip link set dev $IFACE up |
|
@@ -0,0 +1,316 @@ |
||
| 1 |
+#!/usr/bin/env bash |
|
| 2 |
+set -euo pipefail |
|
| 3 |
+ |
|
| 4 |
+BRIDGE="thunderbridge" |
|
| 5 |
+MTU="65520" |
|
| 6 |
+FOUND_TB_IFACE=0 |
|
| 7 |
+STATE_DIR="/run/tb-recover" |
|
| 8 |
+LAST_BOLT_RESTART_FILE="${STATE_DIR}/last_bolt_restart_epoch"
|
|
| 9 |
+BOLT_RESTART_COOLDOWN_SEC=600 |
|
| 10 |
+LAST_NHI_RESCAN_FILE="${STATE_DIR}/last_nhi_rescan_epoch"
|
|
| 11 |
+NHI_RESCAN_COOLDOWN_SEC=600 |
|
| 12 |
+NHI_SETTLE_SEC=8 |
|
| 13 |
+PEER_FAIL_THRESHOLD="${TB_PEER_FAIL_THRESHOLD:-2}"
|
|
| 14 |
+IFACE_CYCLE_COOLDOWN_SEC="${TB_IFACE_CYCLE_COOLDOWN_SEC:-300}"
|
|
| 15 |
+IFACE_CYCLE_SETTLE_SEC="${TB_IFACE_CYCLE_SETTLE_SEC:-5}"
|
|
| 16 |
+PING_TIMEOUT_SEC="${TB_PING_TIMEOUT_SEC:-1}"
|
|
| 17 |
+LOCAL_HOST="$(hostname -s 2>/dev/null || hostname)" |
|
| 18 |
+ |
|
| 19 |
+mkdir -p "$STATE_DIR" |
|
| 20 |
+ |
|
| 21 |
+log() {
|
|
| 22 |
+ printf '%s %s\n' "$(date -Is)" "$*" |
|
| 23 |
+} |
|
| 24 |
+ |
|
| 25 |
+command_exists() {
|
|
| 26 |
+ command -v "$1" >/dev/null 2>&1 |
|
| 27 |
+} |
|
| 28 |
+ |
|
| 29 |
+counter_file_for_iface() {
|
|
| 30 |
+ printf '%s/peer-fail-%s.count\n' "$STATE_DIR" "$1" |
|
| 31 |
+} |
|
| 32 |
+ |
|
| 33 |
+cooldown_file_for_iface() {
|
|
| 34 |
+ printf '%s/last-iface-cycle-%s.epoch\n' "$STATE_DIR" "$1" |
|
| 35 |
+} |
|
| 36 |
+ |
|
| 37 |
+read_epoch_file() {
|
|
| 38 |
+ local file="$1" |
|
| 39 |
+ local value="0" |
|
| 40 |
+ |
|
| 41 |
+ if [ -f "$file" ]; then |
|
| 42 |
+ value="$(cat "$file" 2>/dev/null || echo 0)" |
|
| 43 |
+ fi |
|
| 44 |
+ |
|
| 45 |
+ case "$value" in |
|
| 46 |
+ ''|*[!0-9]*) |
|
| 47 |
+ value=0 |
|
| 48 |
+ ;; |
|
| 49 |
+ esac |
|
| 50 |
+ |
|
| 51 |
+ printf '%s\n' "$value" |
|
| 52 |
+} |
|
| 53 |
+ |
|
| 54 |
+read_counter_file() {
|
|
| 55 |
+ read_epoch_file "$1" |
|
| 56 |
+} |
|
| 57 |
+ |
|
| 58 |
+peer_ip_for_iface() {
|
|
| 59 |
+ local iface="$1" |
|
| 60 |
+ |
|
| 61 |
+ case "${LOCAL_HOST}:${iface}" in
|
|
| 62 |
+ baobab:thunderbolt0) |
|
| 63 |
+ printf '%s\n' "192.168.10.92" |
|
| 64 |
+ ;; |
|
| 65 |
+ baobab:thunderbolt1) |
|
| 66 |
+ printf '%s\n' "192.168.10.93" |
|
| 67 |
+ ;; |
|
| 68 |
+ ebony:thunderbolt0) |
|
| 69 |
+ printf '%s\n' "192.168.10.91" |
|
| 70 |
+ ;; |
|
| 71 |
+ tapia:thunderbolt0) |
|
| 72 |
+ printf '%s\n' "192.168.10.91" |
|
| 73 |
+ ;; |
|
| 74 |
+ *) |
|
| 75 |
+ return 1 |
|
| 76 |
+ ;; |
|
| 77 |
+ esac |
|
| 78 |
+} |
|
| 79 |
+ |
|
| 80 |
+iface_is_forwarding() {
|
|
| 81 |
+ local iface="$1" |
|
| 82 |
+ local state_file="/sys/class/net/${iface}/brport/state"
|
|
| 83 |
+ |
|
| 84 |
+ [ -r "$state_file" ] || return 1 |
|
| 85 |
+ [ "$(cat "$state_file" 2>/dev/null || echo 0)" = "3" ] |
|
| 86 |
+} |
|
| 87 |
+ |
|
| 88 |
+iface_is_oper_up() {
|
|
| 89 |
+ local iface="$1" |
|
| 90 |
+ local operstate_file="/sys/class/net/${iface}/operstate"
|
|
| 91 |
+ |
|
| 92 |
+ [ -r "$operstate_file" ] || return 1 |
|
| 93 |
+ [ "$(cat "$operstate_file" 2>/dev/null || true)" = "up" ] |
|
| 94 |
+} |
|
| 95 |
+ |
|
| 96 |
+probe_peer_ip() {
|
|
| 97 |
+ local peer_ip="$1" |
|
| 98 |
+ |
|
| 99 |
+ ip neigh del "$peer_ip" dev "$BRIDGE" 2>/dev/null || true |
|
| 100 |
+ ping -I "$BRIDGE" -n -c 1 -W "$PING_TIMEOUT_SEC" "$peer_ip" >/dev/null 2>&1 |
|
| 101 |
+} |
|
| 102 |
+ |
|
| 103 |
+recover_iface_cycle() {
|
|
| 104 |
+ local iface="$1" |
|
| 105 |
+ local peer_ip="$2" |
|
| 106 |
+ local now |
|
| 107 |
+ local last_cycle |
|
| 108 |
+ local cooldown_file |
|
| 109 |
+ |
|
| 110 |
+ now="$(date +%s)" |
|
| 111 |
+ cooldown_file="$(cooldown_file_for_iface "$iface")" |
|
| 112 |
+ last_cycle="$(read_epoch_file "$cooldown_file")" |
|
| 113 |
+ if [ $((now - last_cycle)) -lt "$IFACE_CYCLE_COOLDOWN_SEC" ]; then |
|
| 114 |
+ log "peer ${peer_ip} still unhealthy on ${iface}, but iface cycle is cooling down"
|
|
| 115 |
+ return 0 |
|
| 116 |
+ fi |
|
| 117 |
+ |
|
| 118 |
+ log "peer ${peer_ip} unhealthy on ${iface}; cycling link with ifdown/ifup"
|
|
| 119 |
+ if command_exists ifdown && command_exists ifup; then |
|
| 120 |
+ ifdown --force "$iface" || log "ifdown reported a non-zero exit code for ${iface}"
|
|
| 121 |
+ sleep 2 |
|
| 122 |
+ if ! ifup "$iface"; then |
|
| 123 |
+ log "ifup failed for ${iface}"
|
|
| 124 |
+ return 1 |
|
| 125 |
+ fi |
|
| 126 |
+ else |
|
| 127 |
+ log "ifdown/ifup unavailable; falling back to ip link bounce for ${iface}"
|
|
| 128 |
+ ip link set "$iface" down || true |
|
| 129 |
+ sleep 2 |
|
| 130 |
+ ip link set "$iface" up || true |
|
| 131 |
+ fi |
|
| 132 |
+ |
|
| 133 |
+ ip link set "$iface" mtu "$MTU" || true |
|
| 134 |
+ ip link set "$iface" master "$BRIDGE" || true |
|
| 135 |
+ systemctl start "tb-enlist@${iface}.service" || true
|
|
| 136 |
+ printf '%s\n' "$now" > "$cooldown_file" |
|
| 137 |
+ rm -f "$(counter_file_for_iface "$iface")" |
|
| 138 |
+ sleep "$IFACE_CYCLE_SETTLE_SEC" |
|
| 139 |
+} |
|
| 140 |
+ |
|
| 141 |
+assess_peer_health() {
|
|
| 142 |
+ local iface="$1" |
|
| 143 |
+ local peer_ip="" |
|
| 144 |
+ local counter_file="" |
|
| 145 |
+ local fail_count=0 |
|
| 146 |
+ |
|
| 147 |
+ if ! peer_ip="$(peer_ip_for_iface "$iface")"; then |
|
| 148 |
+ return 0 |
|
| 149 |
+ fi |
|
| 150 |
+ |
|
| 151 |
+ counter_file="$(counter_file_for_iface "$iface")" |
|
| 152 |
+ |
|
| 153 |
+ if ! iface_is_oper_up "$iface" || ! iface_is_forwarding "$iface"; then |
|
| 154 |
+ rm -f "$counter_file" |
|
| 155 |
+ return 0 |
|
| 156 |
+ fi |
|
| 157 |
+ |
|
| 158 |
+ if probe_peer_ip "$peer_ip"; then |
|
| 159 |
+ rm -f "$counter_file" |
|
| 160 |
+ return 0 |
|
| 161 |
+ fi |
|
| 162 |
+ |
|
| 163 |
+ fail_count="$(read_counter_file "$counter_file")" |
|
| 164 |
+ fail_count=$((fail_count + 1)) |
|
| 165 |
+ printf '%s\n' "$fail_count" > "$counter_file" |
|
| 166 |
+ log "peer probe failed on ${iface} towards ${peer_ip} (${fail_count}/${PEER_FAIL_THRESHOLD})"
|
|
| 167 |
+ |
|
| 168 |
+ if [ "$fail_count" -lt "$PEER_FAIL_THRESHOLD" ]; then |
|
| 169 |
+ return 0 |
|
| 170 |
+ fi |
|
| 171 |
+ |
|
| 172 |
+ recover_iface_cycle "$iface" "$peer_ip" |
|
| 173 |
+} |
|
| 174 |
+ |
|
| 175 |
+has_tb_netdev() {
|
|
| 176 |
+ ls /sys/class/net/thunderbolt* >/dev/null 2>&1 |
|
| 177 |
+} |
|
| 178 |
+ |
|
| 179 |
+has_stale_tb_xdomain() {
|
|
| 180 |
+ local dev="" |
|
| 181 |
+ for dev in /sys/bus/thunderbolt/devices/[0-9]-[1-9]*; do |
|
| 182 |
+ [ -e "$dev" ] || continue |
|
| 183 |
+ case "${dev##*/}" in
|
|
| 184 |
+ *.*|*:*) |
|
| 185 |
+ continue |
|
| 186 |
+ ;; |
|
| 187 |
+ esac |
|
| 188 |
+ |
|
| 189 |
+ if ! ls "${dev}".* >/dev/null 2>&1; then
|
|
| 190 |
+ return 0 |
|
| 191 |
+ fi |
|
| 192 |
+ done |
|
| 193 |
+ |
|
| 194 |
+ return 1 |
|
| 195 |
+} |
|
| 196 |
+ |
|
| 197 |
+trigger_tb_rescan() {
|
|
| 198 |
+ local domain="" |
|
| 199 |
+ for domain in /sys/bus/thunderbolt/devices/domain*; do |
|
| 200 |
+ [ -e "$domain/rescan" ] && echo 1 > "$domain/rescan" || true |
|
| 201 |
+ done |
|
| 202 |
+ |
|
| 203 |
+ udevadm trigger --subsystem-match=thunderbolt --action=change || true |
|
| 204 |
+ udevadm trigger --subsystem-match=net --action=add || true |
|
| 205 |
+} |
|
| 206 |
+ |
|
| 207 |
+run_nhi_rescan() {
|
|
| 208 |
+ local epoch="$1" |
|
| 209 |
+ local dev="" |
|
| 210 |
+ local cls="" |
|
| 211 |
+ local drv="" |
|
| 212 |
+ local nhi_pci="" |
|
| 213 |
+ |
|
| 214 |
+ for dev in /sys/bus/pci/devices/*; do |
|
| 215 |
+ [ -e "$dev/class" ] || continue |
|
| 216 |
+ [ -e "$dev/driver" ] || continue |
|
| 217 |
+ [ -w "$dev/remove" ] || continue |
|
| 218 |
+ cls="$(cat "$dev/class" 2>/dev/null || true)" |
|
| 219 |
+ drv="$(basename "$(readlink -f "$dev/driver" 2>/dev/null || true)")" |
|
| 220 |
+ if [ "$cls" = "0x088000" ] && [ "$drv" = "thunderbolt" ]; then |
|
| 221 |
+ nhi_pci="$dev" |
|
| 222 |
+ break |
|
| 223 |
+ fi |
|
| 224 |
+ done |
|
| 225 |
+ |
|
| 226 |
+ if [ -n "$nhi_pci" ]; then |
|
| 227 |
+ echo 1 > "$nhi_pci/remove" || true |
|
| 228 |
+ sleep 1 |
|
| 229 |
+ echo 1 > /sys/bus/pci/rescan || true |
|
| 230 |
+ printf '%s\n' "$epoch" > "$LAST_NHI_RESCAN_FILE" |
|
| 231 |
+ return 0 |
|
| 232 |
+ fi |
|
| 233 |
+ |
|
| 234 |
+ return 1 |
|
| 235 |
+} |
|
| 236 |
+ |
|
| 237 |
+# Keep the bridge present and up before trying to enslave ports. |
|
| 238 |
+ip link show "$BRIDGE" >/dev/null 2>&1 || ip link add name "$BRIDGE" type bridge || true |
|
| 239 |
+ip link set "$BRIDGE" mtu "$MTU" || true |
|
| 240 |
+ip link set "$BRIDGE" up || true |
|
| 241 |
+ |
|
| 242 |
+for path in /sys/class/net/thunderbolt*; do |
|
| 243 |
+ [ -e "$path" ] || continue |
|
| 244 |
+ IFACE="${path##*/}"
|
|
| 245 |
+ FOUND_TB_IFACE=1 |
|
| 246 |
+ ip link set "$IFACE" up || true |
|
| 247 |
+ ip link set "$IFACE" mtu "$MTU" || true |
|
| 248 |
+ ip link set "$IFACE" master "$BRIDGE" || true |
|
| 249 |
+ systemctl start "tb-enlist@${IFACE}.service" || true
|
|
| 250 |
+done |
|
| 251 |
+ |
|
| 252 |
+# If no thunderbolt netdev exists but a TB domain exists, force a rescan + udev retrigger. |
|
| 253 |
+if [ "$FOUND_TB_IFACE" -eq 0 ] && [ -d /sys/bus/thunderbolt/devices ]; then |
|
| 254 |
+ trigger_tb_rescan |
|
| 255 |
+ |
|
| 256 |
+ # Escalate with cooldown: try PCI NHI remove+rescan to emulate a soft replug. |
|
| 257 |
+ sleep 2 |
|
| 258 |
+ if ! has_tb_netdev; then |
|
| 259 |
+ now="$(date +%s)" |
|
| 260 |
+ last="0" |
|
| 261 |
+ if [ -f "$LAST_BOLT_RESTART_FILE" ]; then |
|
| 262 |
+ last="$(cat "$LAST_BOLT_RESTART_FILE" 2>/dev/null || echo 0)" |
|
| 263 |
+ fi |
|
| 264 |
+ |
|
| 265 |
+ case "$last" in |
|
| 266 |
+ ''|*[!0-9]*) |
|
| 267 |
+ last=0 |
|
| 268 |
+ ;; |
|
| 269 |
+ esac |
|
| 270 |
+ |
|
| 271 |
+ nhi_last="0" |
|
| 272 |
+ if [ -f "$LAST_NHI_RESCAN_FILE" ]; then |
|
| 273 |
+ nhi_last="$(cat "$LAST_NHI_RESCAN_FILE" 2>/dev/null || echo 0)" |
|
| 274 |
+ fi |
|
| 275 |
+ case "$nhi_last" in |
|
| 276 |
+ ''|*[!0-9]*) |
|
| 277 |
+ nhi_last=0 |
|
| 278 |
+ ;; |
|
| 279 |
+ esac |
|
| 280 |
+ |
|
| 281 |
+ if [ $((now - nhi_last)) -ge "$NHI_RESCAN_COOLDOWN_SEC" ]; then |
|
| 282 |
+ if run_nhi_rescan "$now"; then |
|
| 283 |
+ sleep "$NHI_SETTLE_SEC" |
|
| 284 |
+ trigger_tb_rescan |
|
| 285 |
+ |
|
| 286 |
+ # On newer kernels the first NHI reset can stop at the peer xdomain host |
|
| 287 |
+ # node without recreating the matching *.0 network service. |
|
| 288 |
+ if ! has_tb_netdev && has_stale_tb_xdomain; then |
|
| 289 |
+ retry_now="$(date +%s)" |
|
| 290 |
+ if run_nhi_rescan "$retry_now"; then |
|
| 291 |
+ sleep "$NHI_SETTLE_SEC" |
|
| 292 |
+ trigger_tb_rescan |
|
| 293 |
+ fi |
|
| 294 |
+ fi |
|
| 295 |
+ fi |
|
| 296 |
+ fi |
|
| 297 |
+ |
|
| 298 |
+ # Secondary fallback with cooldown: restart boltd if interface is still missing |
|
| 299 |
+ # and the host actually uses that service. |
|
| 300 |
+ if ! has_tb_netdev; then |
|
| 301 |
+ if [ $((now - last)) -ge "$BOLT_RESTART_COOLDOWN_SEC" ]; then |
|
| 302 |
+ if systemctl list-unit-files bolt.service >/dev/null 2>&1; then |
|
| 303 |
+ systemctl restart bolt.service || true |
|
| 304 |
+ printf '%s\n' "$now" > "$LAST_BOLT_RESTART_FILE" |
|
| 305 |
+ fi |
|
| 306 |
+ fi |
|
| 307 |
+ fi |
|
| 308 |
+ |
|
| 309 |
+ trigger_tb_rescan |
|
| 310 |
+ fi |
|
| 311 |
+fi |
|
| 312 |
+ |
|
| 313 |
+for path in /sys/class/net/thunderbolt*; do |
|
| 314 |
+ [ -e "$path" ] || continue |
|
| 315 |
+ assess_peer_health "${path##*/}"
|
|
| 316 |
+done |
|
@@ -0,0 +1,18 @@ |
||
| 1 |
+# /etc/systemd/system/tb-bridge.service |
|
| 2 |
+[Unit] |
|
| 3 |
+Description=Ensure thunderbridge exists early |
|
| 4 |
+DefaultDependencies=no |
|
| 5 |
+After=network-pre.target |
|
| 6 |
+Before=network.target |
|
| 7 |
+ |
|
| 8 |
+[Service] |
|
| 9 |
+Type=oneshot |
|
| 10 |
+RemainAfterExit=yes |
|
| 11 |
+# Create only if it doesn't exist |
|
| 12 |
+ExecStart=/bin/sh -c '/sbin/ip link show thunderbridge >/dev/null 2>&1 || /sbin/ip link add thunderbridge type bridge' |
|
| 13 |
+# Set params every time (harmless if already set) |
|
| 14 |
+ExecStart=/sbin/ip link set thunderbridge mtu 65520 |
|
| 15 |
+ExecStart=/sbin/ip link set thunderbridge up |
|
| 16 |
+ |
|
| 17 |
+[Install] |
|
| 18 |
+WantedBy=multi-user.target |
|
@@ -0,0 +1,24 @@ |
||
| 1 |
+# /etc/systemd/system/tb-enlist@.service |
|
| 2 |
+[Unit] |
|
| 3 |
+Description=Attach %I to thunderbridge with MTU |
|
| 4 |
+# Pornește numai când device-ul există |
|
| 5 |
+BindsTo=sys-subsystem-net-devices-%i.device |
|
| 6 |
+After=sys-subsystem-net-devices-%i.device tb-bridge.service |
|
| 7 |
+Requires=tb-bridge.service |
|
| 8 |
+# Păstrează porturile thunderbolt în bridge până când shutdown-ul ajunge |
|
| 9 |
+# efectiv la oprirea rețelei; altfel NFS de pe 192.168.10.x pierde |
|
| 10 |
+# transportul înainte de unmount și stă în timeout. |
|
| 11 |
+Before=network.target |
|
| 12 |
+ |
|
| 13 |
+[Service] |
|
| 14 |
+Type=oneshot |
|
| 15 |
+RemainAfterExit=yes |
|
| 16 |
+# Setează MTU pe iface și bridge, apoi master |
|
| 17 |
+ExecStart=/sbin/ip link set %i up |
|
| 18 |
+ExecStart=/sbin/ip link set %i mtu 65520 |
|
| 19 |
+ExecStart=/sbin/ip link set thunderbridge mtu 65520 |
|
| 20 |
+ExecStart=/sbin/ip link set %i master thunderbridge |
|
| 21 |
+ |
|
| 22 |
+# La stop (device remove), desprinde curat |
|
| 23 |
+ExecStop=-/sbin/ip link set %i nomaster |
|
| 24 |
+ExecStop=-/sbin/ip link set %i down |
|
@@ -0,0 +1,8 @@ |
||
| 1 |
+[Unit] |
|
| 2 |
+Description=Recover Thunderbolt net interfaces into thunderbridge |
|
| 3 |
+After=tb-bridge.service bolt.service |
|
| 4 |
+Wants=tb-bridge.service |
|
| 5 |
+ |
|
| 6 |
+[Service] |
|
| 7 |
+Type=oneshot |
|
| 8 |
+ExecStart=/usr/local/sbin/tb-recover.sh |
|
@@ -0,0 +1,11 @@ |
||
| 1 |
+[Unit] |
|
| 2 |
+Description=Periodic Thunderbolt recovery probe |
|
| 3 |
+ |
|
| 4 |
+[Timer] |
|
| 5 |
+OnBootSec=30s |
|
| 6 |
+OnUnitActiveSec=30s |
|
| 7 |
+AccuracySec=5s |
|
| 8 |
+Unit=tb-recover.service |
|
| 9 |
+ |
|
| 10 |
+[Install] |
|
| 11 |
+WantedBy=timers.target |
|
@@ -0,0 +1,4 @@ |
||
| 1 |
+# /etc/udev/rules.d/90-thunderbolt-net-systemd.rules |
|
| 2 |
+ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="thunderbolt*", \ |
|
| 3 |
+ RUN+="/sbin/ip link set %k mtu 65520", \ |
|
| 4 |
+ TAG+="systemd", ENV{SYSTEMD_WANTS}="tb-enlist@%k.service"
|
|
@@ -0,0 +1,129 @@ |
||
| 1 |
+#!/usr/bin/env bash |
|
| 2 |
+# deploy_tb.sh — Thunderbolt bridge deploy (Bash 3 compatible) |
|
| 3 |
+ |
|
| 4 |
+set -eo pipefail |
|
| 5 |
+ |
|
| 6 |
+# ---------- EDIT THESE ---------- |
|
| 7 |
+get_mgmt_ip() {
|
|
| 8 |
+ case "$1" in |
|
| 9 |
+ baobab) echo "192.168.2.91" ;; |
|
| 10 |
+ ebony) echo "192.168.2.92" ;; |
|
| 11 |
+ tapia) echo "192.168.2.93" ;; |
|
| 12 |
+ *) echo "" ;; |
|
| 13 |
+ esac |
|
| 14 |
+} |
|
| 15 |
+get_tb_ip() {
|
|
| 16 |
+ case "$1" in |
|
| 17 |
+ baobab) echo "192.168.10.91" ;; |
|
| 18 |
+ ebony) echo "192.168.10.92" ;; |
|
| 19 |
+ tapia) echo "192.168.10.93" ;; |
|
| 20 |
+ *) echo "" ;; |
|
| 21 |
+ esac |
|
| 22 |
+} |
|
| 23 |
+# -------------------------------- |
|
| 24 |
+ |
|
| 25 |
+TARGETS=("$@")
|
|
| 26 |
+if [ ${#TARGETS[@]} -eq 0 ]; then
|
|
| 27 |
+ TARGETS=(baobab ebony tapia) |
|
| 28 |
+fi |
|
| 29 |
+ |
|
| 30 |
+SSH_USER="root" |
|
| 31 |
+SSH_OPTS="-o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" |
|
| 32 |
+BASE_DIR="$(pwd)" |
|
| 33 |
+ |
|
| 34 |
+COMMON_UDEV="$BASE_DIR/common/udev/rules.d/90-thunderbolt-net-systemd.rules" |
|
| 35 |
+COMMON_SVC1="$BASE_DIR/common/systemd/system/tb-enlist@.service" |
|
| 36 |
+COMMON_SVC2="$BASE_DIR/common/systemd/system/tb-bridge.service" |
|
| 37 |
+COMMON_SVC3="$BASE_DIR/common/systemd/system/tb-recover.service" |
|
| 38 |
+COMMON_TMR1="$BASE_DIR/common/systemd/system/tb-recover.timer" |
|
| 39 |
+COMMON_BIN1="$BASE_DIR/common/sbin/tb-recover.sh" |
|
| 40 |
+ |
|
| 41 |
+require() {
|
|
| 42 |
+ for f in "$@"; do |
|
| 43 |
+ [ -f "$f" ] || { echo "Missing required file: $f" >&2; exit 1; }
|
|
| 44 |
+ done |
|
| 45 |
+} |
|
| 46 |
+ |
|
| 47 |
+# try mgmt IP first, then TB IP; print chosen IP and return 0 if SSH works |
|
| 48 |
+pick_ip() {
|
|
| 49 |
+ local host="$1" ip="" |
|
| 50 |
+ ip="$(get_mgmt_ip "$host")" |
|
| 51 |
+ if [ -n "$ip" ] && ssh $SSH_OPTS -q "${SSH_USER}@${ip}" true 2>/dev/null; then
|
|
| 52 |
+ echo "$ip"; return 0 |
|
| 53 |
+ fi |
|
| 54 |
+ ip="$(get_tb_ip "$host")" |
|
| 55 |
+ if [ -n "$ip" ] && ssh $SSH_OPTS -q "${SSH_USER}@${ip}" true 2>/dev/null; then
|
|
| 56 |
+ echo "$ip"; return 0 |
|
| 57 |
+ fi |
|
| 58 |
+ # fall back to mgmt for error messaging |
|
| 59 |
+ ip="$(get_mgmt_ip "$host")" |
|
| 60 |
+ [ -n "$ip" ] && echo "$ip" |
|
| 61 |
+ return 1 |
|
| 62 |
+} |
|
| 63 |
+ |
|
| 64 |
+deploy_node() {
|
|
| 65 |
+ local host="$1" |
|
| 66 |
+ local node_dir="$BASE_DIR/$host" |
|
| 67 |
+ [ -d "$node_dir" ] || { echo "No node directory: $node_dir" >&2; exit 1; }
|
|
| 68 |
+ |
|
| 69 |
+ local ip |
|
| 70 |
+ ip="$(pick_ip "$host")" || {
|
|
| 71 |
+ echo "!! [$host] SSH not reachable on $(get_mgmt_ip "$host") or $(get_tb_ip "$host")). Fix IPs or firewall." >&2 |
|
| 72 |
+ exit 1 |
|
| 73 |
+ } |
|
| 74 |
+ |
|
| 75 |
+ echo "==> [$host@$ip] prepare remote dirs" |
|
| 76 |
+ ssh $SSH_OPTS "${SSH_USER}@${ip}" "mkdir -p /etc/udev/rules.d /etc/systemd/system /etc/network/interfaces.d /usr/local/sbin"
|
|
| 77 |
+ |
|
| 78 |
+ echo "==> [$host@$ip] copy COMMON files" |
|
| 79 |
+ scp -q "$COMMON_UDEV" "${SSH_USER}@${ip}:/etc/udev/rules.d/90-thunderbolt-net-systemd.rules"
|
|
| 80 |
+ scp -q "$COMMON_SVC1" "${SSH_USER}@${ip}:/etc/systemd/system/tb-enlist@.service"
|
|
| 81 |
+ scp -q "$COMMON_SVC2" "${SSH_USER}@${ip}:/etc/systemd/system/tb-bridge.service"
|
|
| 82 |
+ scp -q "$COMMON_SVC3" "${SSH_USER}@${ip}:/etc/systemd/system/tb-recover.service"
|
|
| 83 |
+ scp -q "$COMMON_TMR1" "${SSH_USER}@${ip}:/etc/systemd/system/tb-recover.timer"
|
|
| 84 |
+ scp -q "$COMMON_BIN1" "${SSH_USER}@${ip}:/usr/local/sbin/tb-recover.sh"
|
|
| 85 |
+ |
|
| 86 |
+ echo "==> [$host@$ip] copy NODE config" |
|
| 87 |
+ require "$node_dir/etc/network/interfaces" "$node_dir/etc/network/interfaces.d/10-thunderbolt" |
|
| 88 |
+ scp -q "$node_dir/etc/network/interfaces" "${SSH_USER}@${ip}:/etc/network/interfaces"
|
|
| 89 |
+ scp -q "$node_dir/etc/network/interfaces.d/10-thunderbolt" "${SSH_USER}@${ip}:/etc/network/interfaces.d/10-thunderbolt"
|
|
| 90 |
+ |
|
| 91 |
+ echo "==> [$host@$ip] enable + reload" |
|
| 92 |
+ ssh $SSH_OPTS "${SSH_USER}@${ip}" bash -s <<'EOF'
|
|
| 93 |
+set -e |
|
| 94 |
+chmod 0644 /etc/udev/rules.d/90-thunderbolt-net-systemd.rules |
|
| 95 |
+chmod 0644 /etc/systemd/system/tb-enlist@.service |
|
| 96 |
+chmod 0644 /etc/systemd/system/tb-bridge.service |
|
| 97 |
+chmod 0644 /etc/systemd/system/tb-recover.service |
|
| 98 |
+chmod 0644 /etc/systemd/system/tb-recover.timer |
|
| 99 |
+chmod 0755 /usr/local/sbin/tb-recover.sh |
|
| 100 |
+systemctl daemon-reload |
|
| 101 |
+udevadm control --reload |
|
| 102 |
+command -v ifreload >/dev/null 2>&1 && ifreload -a || true |
|
| 103 |
+systemctl enable --now tb-bridge.service |
|
| 104 |
+systemctl enable --now tb-recover.timer |
|
| 105 |
+systemctl start tb-recover.service |
|
| 106 |
+udevadm trigger --subsystem-match=net --action=add |
|
| 107 |
+EOF |
|
| 108 |
+ |
|
| 109 |
+ echo "==> [$host@$ip] status" |
|
| 110 |
+ ssh $SSH_OPTS "${SSH_USER}@${ip}" bash -s <<'EOF'
|
|
| 111 |
+set -e |
|
| 112 |
+systemctl --no-pager --plain --full status tb-bridge.service | sed -n '1,6p' |
|
| 113 |
+systemctl --no-pager --plain --full status tb-recover.timer | sed -n '1,8p' |
|
| 114 |
+systemctl --no-pager --plain --full list-units 'tb-enlist@*.service' | sed -n '1,12p' || true |
|
| 115 |
+ip -d link show thunderbridge | sed -n '1,3p' |
|
| 116 |
+bridge link | grep -E 'thunderbolt|thunderbridge' || true |
|
| 117 |
+EOF |
|
| 118 |
+ |
|
| 119 |
+ echo "==> [$host@$ip] done." |
|
| 120 |
+ echo |
|
| 121 |
+} |
|
| 122 |
+ |
|
| 123 |
+require "$COMMON_UDEV" "$COMMON_SVC1" "$COMMON_SVC2" "$COMMON_SVC3" "$COMMON_TMR1" "$COMMON_BIN1" |
|
| 124 |
+ |
|
| 125 |
+for h in "${TARGETS[@]}"; do
|
|
| 126 |
+ deploy_node "$h" |
|
| 127 |
+done |
|
| 128 |
+ |
|
| 129 |
+echo "All done. Go poke the cables and watch systemd behave." |
|
@@ -0,0 +1,41 @@ |
||
| 1 |
+# network interface settings; autogenerated |
|
| 2 |
+# Please do NOT modify this file directly, unless you know what |
|
| 3 |
+# you're doing. |
|
| 4 |
+# |
|
| 5 |
+# If you want to manage parts of the network configuration manually, |
|
| 6 |
+# please utilize the 'source' or 'source-directory' directives to do |
|
| 7 |
+# so. |
|
| 8 |
+# PVE will preserve these directives, but will NOT read its network |
|
| 9 |
+# configuration from sourced files, so do not attempt to move any of |
|
| 10 |
+# the PVE managed interfaces into external files! |
|
| 11 |
+ |
|
| 12 |
+auto lo |
|
| 13 |
+iface lo inet loopback |
|
| 14 |
+ |
|
| 15 |
+auto eno1 |
|
| 16 |
+iface eno1 inet manual |
|
| 17 |
+ |
|
| 18 |
+iface eno1.442 inet manual |
|
| 19 |
+ |
|
| 20 |
+auto vmbr443 |
|
| 21 |
+iface vmbr443 inet static |
|
| 22 |
+ address 192.168.2.92/24 |
|
| 23 |
+ gateway 192.168.2.1 |
|
| 24 |
+ bridge-ports eno1.443 |
|
| 25 |
+ bridge-stp off |
|
| 26 |
+ bridge-fd 0 |
|
| 27 |
+ |
|
| 28 |
+auto vmbr444 |
|
| 29 |
+iface vmbr444 inet static |
|
| 30 |
+ address 192.168.4.92/24 |
|
| 31 |
+ bridge-ports eno1.444 |
|
| 32 |
+ bridge-stp off |
|
| 33 |
+ bridge-fd 0 |
|
| 34 |
+ |
|
| 35 |
+auto vmbr442 |
|
| 36 |
+iface vmbr442 inet manual |
|
| 37 |
+ bridge-ports eno1.442 |
|
| 38 |
+ bridge-stp off |
|
| 39 |
+ bridge-fd 0 |
|
| 40 |
+ |
|
| 41 |
+source /etc/network/interfaces.d/* |
|
@@ -0,0 +1,20 @@ |
||
| 1 |
+# Modular network configuration for ebony - Thunderbolt networking |
|
| 2 |
+# ifupdown2-safe: bridge comes up alone; TB ports hotplug in later |
|
| 3 |
+ |
|
| 4 |
+# Thunderbolt NIC appears late — don't 'auto' it |
|
| 5 |
+allow-hotplug thunderbolt0 |
|
| 6 |
+iface thunderbolt0 inet manual |
|
| 7 |
+ pre-up ip link set dev $IFACE mtu 65520 || true |
|
| 8 |
+ post-up ip link set dev $IFACE mtu 65520 || true |
|
| 9 |
+ post-up ip link set dev $IFACE master thunderbridge || true |
|
| 10 |
+ |
|
| 11 |
+# Bridge must exist even with zero members |
|
| 12 |
+auto thunderbridge |
|
| 13 |
+iface thunderbridge inet static |
|
| 14 |
+ address 192.168.10.92/24 |
|
| 15 |
+ bridge-ports none |
|
| 16 |
+ bridge-stp off |
|
| 17 |
+ bridge-fd 0 |
|
| 18 |
+ mtu 65520 |
|
| 19 |
+ pre-up ip link add name $IFACE type bridge 2>/dev/null || true |
|
| 20 |
+ post-up ip link set dev $IFACE up |
|
@@ -0,0 +1,4 @@ |
||
| 1 |
+# /etc/udev/rules.d/90-thunderbolt-net-systemd.rules |
|
| 2 |
+ACTION=="add", SUBSYSTEM=="net", KERNEL=="thunderbolt*", \ |
|
| 3 |
+ RUN+="/sbin/ip link set %k mtu 65520", \ |
|
| 4 |
+ TAG+="systemd", ENV{SYSTEMD_WANTS}="tb-enlist@%k.service"
|
|
@@ -0,0 +1,41 @@ |
||
| 1 |
+# network interface settings; autogenerated |
|
| 2 |
+# Please do NOT modify this file directly, unless you know what |
|
| 3 |
+# you're doing. |
|
| 4 |
+# |
|
| 5 |
+# If you want to manage parts of the network configuration manually, |
|
| 6 |
+# please utilize the 'source' or 'source-directory' directives to do |
|
| 7 |
+# so. |
|
| 8 |
+# PVE will preserve these directives, but will NOT read its network |
|
| 9 |
+# configuration from sourced files, so do not attempt to move any of |
|
| 10 |
+# the PVE managed interfaces into external files! |
|
| 11 |
+ |
|
| 12 |
+auto lo |
|
| 13 |
+iface lo inet loopback |
|
| 14 |
+ |
|
| 15 |
+auto eno1 |
|
| 16 |
+iface eno1 inet manual |
|
| 17 |
+ |
|
| 18 |
+iface eno1.442 inet manual |
|
| 19 |
+ |
|
| 20 |
+auto vmbr443 |
|
| 21 |
+iface vmbr443 inet static |
|
| 22 |
+ address 192.168.2.93/24 |
|
| 23 |
+ gateway 192.168.2.1 |
|
| 24 |
+ bridge-ports eno1.443 |
|
| 25 |
+ bridge-stp off |
|
| 26 |
+ bridge-fd 0 |
|
| 27 |
+ |
|
| 28 |
+auto vmbr444 |
|
| 29 |
+iface vmbr444 inet static |
|
| 30 |
+ address 192.168.4.93/24 |
|
| 31 |
+ bridge-ports eno1.444 |
|
| 32 |
+ bridge-stp off |
|
| 33 |
+ bridge-fd 0 |
|
| 34 |
+ |
|
| 35 |
+auto vmbr442 |
|
| 36 |
+iface vmbr442 inet manual |
|
| 37 |
+ bridge-ports eno1.442 |
|
| 38 |
+ bridge-stp off |
|
| 39 |
+ bridge-fd 0 |
|
| 40 |
+ |
|
| 41 |
+source /etc/network/interfaces.d/* |
|
@@ -0,0 +1,20 @@ |
||
| 1 |
+# Modular network configuration for tapia - Thunderbolt networking |
|
| 2 |
+# ifupdown2-safe: bridge comes up alone; TB ports hotplug in later |
|
| 3 |
+ |
|
| 4 |
+# Thunderbolt NIC appears late — don't 'auto' it |
|
| 5 |
+allow-hotplug thunderbolt0 |
|
| 6 |
+iface thunderbolt0 inet manual |
|
| 7 |
+ pre-up ip link set dev $IFACE mtu 65520 || true |
|
| 8 |
+ post-up ip link set dev $IFACE mtu 65520 || true |
|
| 9 |
+ post-up ip link set dev $IFACE master thunderbridge || true |
|
| 10 |
+ |
|
| 11 |
+# Bridge must exist even with zero members |
|
| 12 |
+auto thunderbridge |
|
| 13 |
+iface thunderbridge inet static |
|
| 14 |
+ address 192.168.10.93/24 |
|
| 15 |
+ bridge-ports none |
|
| 16 |
+ bridge-stp off |
|
| 17 |
+ bridge-fd 0 |
|
| 18 |
+ mtu 65520 |
|
| 19 |
+ pre-up ip link add name $IFACE type bridge 2>/dev/null || true |
|
| 20 |
+ post-up ip link set dev $IFACE up |
|
@@ -0,0 +1,325 @@ |
||
| 1 |
+# Issue ISSUE-2025-001: Thunderbolt interfaces MTU resets to 1500 after networking restart |
|
| 2 |
+ |
|
| 3 |
+**Status:** closed |
|
| 4 |
+**Priority:** high |
|
| 5 |
+**Created:** 2025-10-30 |
|
| 6 |
+**Updated:** 2025-10-30 |
|
| 7 |
+**Assigned to:** unassigned |
|
| 8 |
+**Resolution:** Fixed with hybrid approach (udev rule + post-up hook) |
|
| 9 |
+ |
|
| 10 |
+--- |
|
| 11 |
+ |
|
| 12 |
+## Summary |
|
| 13 |
+ |
|
| 14 |
+`systemctl restart networking` causes thunderbolt interfaces to reset MTU from 65520 to default 1500. |
|
| 15 |
+ |
|
| 16 |
+--- |
|
| 17 |
+ |
|
| 18 |
+## Description |
|
| 19 |
+ |
|
| 20 |
+After executing `systemctl restart networking` on cluster nodes, the thunderbolt interfaces (thunderbolt0, thunderbolt1) lose their configured MTU of 65520 and revert to the default 1500. This also sometimes occurs after system reboot, though the behavior is not 100% reproducible on reboot. |
|
| 21 |
+ |
|
| 22 |
+The MTU configuration is critical for thunderbolt bridge performance and should persist across networking restarts. |
|
| 23 |
+ |
|
| 24 |
+--- |
|
| 25 |
+ |
|
| 26 |
+## Environment |
|
| 27 |
+ |
|
| 28 |
+- **Affected nodes:** all (baobab, ebony, tapia) |
|
| 29 |
+- **Component:** network |
|
| 30 |
+- **Version/software:** Proxmox VE 8.x, ifupdown2, thunderbolt networking |
|
| 31 |
+ |
|
| 32 |
+--- |
|
| 33 |
+ |
|
| 34 |
+## Steps to Reproduce |
|
| 35 |
+ |
|
| 36 |
+1. Verify current thunderbolt interface MTU: `ip link show thunderbolt0` |
|
| 37 |
+2. Observe MTU is set to 65520 |
|
| 38 |
+3. Execute: `systemctl restart networking` |
|
| 39 |
+4. Check MTU again: `ip link show thunderbolt0` |
|
| 40 |
+5. MTU has reverted to 1500 |
|
| 41 |
+ |
|
| 42 |
+**Reboot scenario (intermittent):** |
|
| 43 |
+1. Reboot node |
|
| 44 |
+2. After boot, check thunderbolt interface MTU |
|
| 45 |
+3. Sometimes MTU is 1500 instead of expected 65520 |
|
| 46 |
+ |
|
| 47 |
+--- |
|
| 48 |
+ |
|
| 49 |
+## Expected Behavior |
|
| 50 |
+ |
|
| 51 |
+Thunderbolt interfaces should maintain MTU 65520 after: |
|
| 52 |
+- `systemctl restart networking` |
|
| 53 |
+- System reboot |
|
| 54 |
+ |
|
| 55 |
+--- |
|
| 56 |
+ |
|
| 57 |
+## Actual Behavior |
|
| 58 |
+ |
|
| 59 |
+MTU resets to 1500 (default) after networking restart. Reboot behavior is inconsistent but sometimes exhibits the same issue. |
|
| 60 |
+ |
|
| 61 |
+--- |
|
| 62 |
+ |
|
| 63 |
+## Logs/Evidence |
|
| 64 |
+ |
|
| 65 |
+```bash |
|
| 66 |
+# Before restart |
|
| 67 |
+ip link show thunderbolt0 |
|
| 68 |
+# ... mtu 65520 ... |
|
| 69 |
+ |
|
| 70 |
+# After systemctl restart networking |
|
| 71 |
+ip link show thunderbolt0 |
|
| 72 |
+# ... mtu 1500 ... |
|
| 73 |
+``` |
|
| 74 |
+ |
|
| 75 |
+--- |
|
| 76 |
+ |
|
| 77 |
+## Investigation Notes |
|
| 78 |
+ |
|
| 79 |
+- [2025-10-30] Issue reported. Configuration files in `/etc/network/interfaces.d/10-thunderbolt` contain `pre-up ip link set dev $IFACE mtu 65520 || true` but this may not be executed consistently during networking restart. |
|
| 80 |
+- [2025-10-30] The `allow-hotplug` directive for thunderbolt interfaces may cause race conditions where the interface is brought up before the pre-up script runs. |
|
| 81 |
+- [2025-10-30] Reboot inconsistency suggests timing or udev rule interaction issues. |
|
| 82 |
+ |
|
| 83 |
+### Deep Investigation (2025-10-30) |
|
| 84 |
+ |
|
| 85 |
+**Current Configuration Analysis:** |
|
| 86 |
+ |
|
| 87 |
+1. **Interface Configuration** (`/etc/network/interfaces.d/10-thunderbolt`): |
|
| 88 |
+ - Uses `allow-hotplug` for thunderbolt0 and thunderbolt1 |
|
| 89 |
+ - Has `pre-up ip link set dev $IFACE mtu 65520 || true` in iface stanza |
|
| 90 |
+ - Bridge has `mtu 65520` in its static configuration |
|
| 91 |
+ |
|
| 92 |
+2. **Systemd Services**: |
|
| 93 |
+ - `tb-bridge.service`: Creates bridge early, sets MTU 65520 |
|
| 94 |
+ - `tb-enlist@.service`: Triggered by udev on thunderbolt interface add, sets MTU and enslaves to bridge |
|
| 95 |
+ - Services have proper ordering: `After=sys-subsystem-net-devices-%i.device tb-bridge.service` |
|
| 96 |
+ |
|
| 97 |
+3. **Udev Rule** (`/etc/udev/rules.d/90-thunderbolt-net-systemd.rules`): |
|
| 98 |
+ - Triggers `tb-enlist@.service` when thunderbolt interfaces appear |
|
| 99 |
+ - Does NOT directly set MTU via udev |
|
| 100 |
+ |
|
| 101 |
+**Root Cause Analysis:** |
|
| 102 |
+ |
|
| 103 |
+The problem occurs during `systemctl restart networking` because: |
|
| 104 |
+ |
|
| 105 |
+1. **ifupdown2 behavior**: When restarting networking, ifupdown2: |
|
| 106 |
+ - Takes DOWN all `allow-hotplug` interfaces |
|
| 107 |
+ - Brings them back UP based on configuration |
|
| 108 |
+ - During this process, `pre-up` scripts execute BEFORE the interface is brought up |
|
| 109 |
+ |
|
| 110 |
+2. **Timing Issue**: The sequence is: |
|
| 111 |
+ ``` |
|
| 112 |
+ networking.service restart |
|
| 113 |
+ → ifdown thunderbolt0 (MTU reset to default 1500 by kernel) |
|
| 114 |
+ → pre-up script runs (sets MTU 65520) |
|
| 115 |
+ → ifup brings interface up |
|
| 116 |
+ → RACE: systemd tb-enlist@.service might not re-trigger OR might run before ifupdown finishes |
|
| 117 |
+ ``` |
|
| 118 |
+ |
|
| 119 |
+3. **Why systemd services don't help during networking restart**: |
|
| 120 |
+ - `tb-enlist@.service` is triggered by udev on device ADD event |
|
| 121 |
+ - During `networking restart`, the device is not removed/added, just brought down/up |
|
| 122 |
+ - Therefore, systemd service does NOT re-execute |
|
| 123 |
+ - The MTU setting relies ONLY on the `pre-up` script in interfaces configuration |
|
| 124 |
+ |
|
| 125 |
+4. **Why it sometimes fails on reboot**: |
|
| 126 |
+ - Race condition between: |
|
| 127 |
+ - ifupdown bringing up the interface (with pre-up MTU setting) |
|
| 128 |
+ - systemd tb-enlist@ service being triggered by udev |
|
| 129 |
+ - If systemd service wins the race and enslaves interface before ifupdown sets MTU, the MTU might not stick |
|
| 130 |
+ |
|
| 131 |
+**Key Finding**: The `pre-up` script in `/etc/network/interfaces.d/10-thunderbolt` SHOULD work, but there's likely a timing issue or the script is not being executed properly during networking restart with ifupdown2. |
|
| 132 |
+ |
|
| 133 |
+--- |
|
| 134 |
+ |
|
| 135 |
+## Proposed Solutions |
|
| 136 |
+ |
|
| 137 |
+### Solution 1: Add MTU setting to udev rule (RECOMMENDED) |
|
| 138 |
+ |
|
| 139 |
+Add MTU setting directly in the udev rule that triggers when thunderbolt interfaces appear. This ensures MTU is set immediately when the interface is created, before any other service touches it. |
|
| 140 |
+ |
|
| 141 |
+**Implementation:** |
|
| 142 |
+ |
|
| 143 |
+Modify `/etc/udev/rules.d/90-thunderbolt-net-systemd.rules`: |
|
| 144 |
+ |
|
| 145 |
+```bash |
|
| 146 |
+# /etc/udev/rules.d/90-thunderbolt-net-systemd.rules |
|
| 147 |
+ACTION=="add", SUBSYSTEM=="net", KERNEL=="thunderbolt*", \ |
|
| 148 |
+ RUN+="/sbin/ip link set %k mtu 65520", \ |
|
| 149 |
+ TAG+="systemd", ENV{SYSTEMD_WANTS}="tb-enlist@%k.service"
|
|
| 150 |
+``` |
|
| 151 |
+ |
|
| 152 |
+**Pros:** |
|
| 153 |
+- Runs immediately on device add, before any other service |
|
| 154 |
+- Independent of ifupdown2 behavior |
|
| 155 |
+- Handles both boot and hotplug scenarios |
|
| 156 |
+- Simple, one-line change |
|
| 157 |
+ |
|
| 158 |
+**Cons:** |
|
| 159 |
+- Must be deployed to all nodes |
|
| 160 |
+ |
|
| 161 |
+### Solution 2: Add post-up hook in interfaces configuration |
|
| 162 |
+ |
|
| 163 |
+Add a `post-up` hook in addition to `pre-up` to ensure MTU is set after the interface is fully up. |
|
| 164 |
+ |
|
| 165 |
+**Implementation:** |
|
| 166 |
+ |
|
| 167 |
+Modify `/etc/network/interfaces.d/10-thunderbolt`: |
|
| 168 |
+ |
|
| 169 |
+```bash |
|
| 170 |
+allow-hotplug thunderbolt0 |
|
| 171 |
+iface thunderbolt0 inet manual |
|
| 172 |
+ pre-up ip link set dev $IFACE mtu 65520 || true |
|
| 173 |
+ post-up ip link set dev $IFACE mtu 65520 || true |
|
| 174 |
+``` |
|
| 175 |
+ |
|
| 176 |
+**Pros:** |
|
| 177 |
+- Uses existing ifupdown2 mechanisms |
|
| 178 |
+- MTU set twice (pre and post) increases reliability |
|
| 179 |
+- No new files needed |
|
| 180 |
+ |
|
| 181 |
+**Cons:** |
|
| 182 |
+- Still relies on ifupdown2 executing hooks correctly |
|
| 183 |
+- May not fix the race condition completely |
|
| 184 |
+ |
|
| 185 |
+### Solution 3: Modify tb-enlist@ service to always set MTU |
|
| 186 |
+ |
|
| 187 |
+Make the systemd service idempotent and ensure it sets MTU even if the device was already up. |
|
| 188 |
+ |
|
| 189 |
+**Implementation:** |
|
| 190 |
+ |
|
| 191 |
+Modify `/etc/systemd/system/tb-enlist@.service`: |
|
| 192 |
+ |
|
| 193 |
+```ini |
|
| 194 |
+[Unit] |
|
| 195 |
+Description=Attach %I to thunderbridge with MTU |
|
| 196 |
+BindsTo=sys-subsystem-net-devices-%i.device |
|
| 197 |
+After=sys-subsystem-net-devices-%i.device tb-bridge.service network.target |
|
| 198 |
+Requires=tb-bridge.service |
|
| 199 |
+ |
|
| 200 |
+[Service] |
|
| 201 |
+Type=oneshot |
|
| 202 |
+RemainAfterExit=yes |
|
| 203 |
+# Always set MTU first, regardless of current state |
|
| 204 |
+ExecStartPre=/sbin/ip link set %i mtu 65520 || true |
|
| 205 |
+ExecStart=/sbin/ip link set %i up |
|
| 206 |
+ExecStart=/sbin/ip link set %i mtu 65520 |
|
| 207 |
+ExecStart=/sbin/ip link set thunderbridge mtu 65520 |
|
| 208 |
+ExecStart=/sbin/ip link set %i master thunderbridge |
|
| 209 |
+ |
|
| 210 |
+ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || true |
|
| 211 |
+ExecStop=/sbin/ip link set %i down 2>/dev/null || true |
|
| 212 |
+ |
|
| 213 |
+# Add this to re-run service on networking.service restart |
|
| 214 |
+[Install] |
|
| 215 |
+Also=network.target |
|
| 216 |
+``` |
|
| 217 |
+ |
|
| 218 |
+**Pros:** |
|
| 219 |
+- Comprehensive, handles multiple scenarios |
|
| 220 |
+- Can be triggered manually if needed |
|
| 221 |
+ |
|
| 222 |
+**Cons:** |
|
| 223 |
+- More complex |
|
| 224 |
+- Still might not trigger on `networking.service` restart without additional changes |
|
| 225 |
+ |
|
| 226 |
+### Solution 4: Hybrid approach (MOST ROBUST) |
|
| 227 |
+ |
|
| 228 |
+Combine Solution 1 (udev) with Solution 2 (post-up hook). |
|
| 229 |
+ |
|
| 230 |
+**Implementation:** |
|
| 231 |
+ |
|
| 232 |
+1. Add MTU to udev rule (Solution 1) |
|
| 233 |
+2. Keep both pre-up and add post-up in interfaces.d config (Solution 2) |
|
| 234 |
+3. Ensure bridge always has MTU set in its configuration |
|
| 235 |
+ |
|
| 236 |
+This creates multiple layers of MTU enforcement: |
|
| 237 |
+- Udev sets it immediately on device appearance |
|
| 238 |
+- pre-up sets it before ifup |
|
| 239 |
+- post-up sets it after interface is fully up |
|
| 240 |
+- systemd service sets it when enslaving to bridge |
|
| 241 |
+ |
|
| 242 |
+**Pros:** |
|
| 243 |
+- Defense in depth |
|
| 244 |
+- Handles all edge cases |
|
| 245 |
+- Most reliable solution |
|
| 246 |
+ |
|
| 247 |
+**Cons:** |
|
| 248 |
+- Slight redundancy (MTU set multiple times) |
|
| 249 |
+ |
|
| 250 |
+--- |
|
| 251 |
+ |
|
| 252 |
+## Recommended Implementation Plan |
|
| 253 |
+ |
|
| 254 |
+**Phase 1: Quick Fix (Solution 1)** |
|
| 255 |
+1. Deploy updated udev rule to all nodes |
|
| 256 |
+2. Reload udev rules: `udevadm control --reload-rules` |
|
| 257 |
+3. Test with `systemctl restart networking` |
|
| 258 |
+4. Verify MTU persists |
|
| 259 |
+ |
|
| 260 |
+**Phase 2: If needed (Solution 4)** |
|
| 261 |
+1. Add post-up hook to interfaces.d/10-thunderbolt |
|
| 262 |
+2. Update tb-enlist@ service with ExecStartPre |
|
| 263 |
+3. Deploy and test |
|
| 264 |
+ |
|
| 265 |
+**Testing Protocol:** |
|
| 266 |
+```bash |
|
| 267 |
+# On each node: |
|
| 268 |
+# 1. Check current MTU |
|
| 269 |
+ip link show thunderbolt0 | grep mtu |
|
| 270 |
+ |
|
| 271 |
+# 2. Restart networking |
|
| 272 |
+systemctl restart networking |
|
| 273 |
+ |
|
| 274 |
+# 3. Verify MTU persisted |
|
| 275 |
+ip link show thunderbolt0 | grep mtu |
|
| 276 |
+# Should show: mtu 65520 |
|
| 277 |
+ |
|
| 278 |
+# 4. Test reboot persistence |
|
| 279 |
+reboot |
|
| 280 |
+# After boot: |
|
| 281 |
+ip link show thunderbolt0 | grep mtu |
|
| 282 |
+``` |
|
| 283 |
+ |
|
| 284 |
+--- |
|
| 285 |
+ |
|
| 286 |
+## Related Issues |
|
| 287 |
+ |
|
| 288 |
+None yet. |
|
| 289 |
+ |
|
| 290 |
+--- |
|
| 291 |
+ |
|
| 292 |
+## Changelog References |
|
| 293 |
+ |
|
| 294 |
+None yet. Will be referenced when fix is implemented. |
|
| 295 |
+ |
|
| 296 |
+--- |
|
| 297 |
+ |
|
| 298 |
+## Resolution (2025-10-30) |
|
| 299 |
+ |
|
| 300 |
+**Issue Status: RESOLVED** |
|
| 301 |
+ |
|
| 302 |
+### Root Cause Confirmed |
|
| 303 |
+The MTU reset occurred because `systemctl restart networking` triggers ifupdown2 to bring interfaces down and back up, but the existing `pre-up` hooks in interfaces.d were insufficient. The systemd services (`tb-enlist@.service`) don't re-trigger on networking restart since the device isn't removed/added. |
|
| 304 |
+ |
|
| 305 |
+### Solution Implemented |
|
| 306 |
+Deployed **hybrid approach** combining: |
|
| 307 |
+1. **Enhanced udev rule**: Added MTU setting on device add/change events |
|
| 308 |
+2. **Post-up hook**: Added `post-up` script in interfaces.d to ensure MTU after interface bring-up |
|
| 309 |
+ |
|
| 310 |
+### Changes Made |
|
| 311 |
+- **Udev rule** (`/etc/udev/rules.d/90-thunderbolt-net-systemd.rules`): Added `RUN+="/sbin/ip link set %k mtu 65520"` for immediate MTU setting |
|
| 312 |
+- **Interfaces config** (`/etc/network/interfaces.d/10-thunderbolt`): Added `post-up ip link set dev $IFACE mtu 65520 || true` for all thunderbolt interfaces |
|
| 313 |
+ |
|
| 314 |
+### Testing Results |
|
| 315 |
+- **ebony**: ✅ MTU persists after `systemctl restart networking` |
|
| 316 |
+- **tapia**: ✅ MTU persists after `systemctl restart networking` |
|
| 317 |
+- **baobab**: ✅ Both thunderbolt0 and thunderbolt1 maintain MTU after restart |
|
| 318 |
+ |
|
| 319 |
+### Files Modified |
|
| 320 |
+- `deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules` |
|
| 321 |
+- `deploy/attempt1/ebony/etc/network/interfaces.d/10-thunderbolt` |
|
| 322 |
+- `deploy/attempt1/tapia/etc/network/interfaces.d/10-thunderbolt` |
|
| 323 |
+- `deploy/attempt1/baobab/etc/network/interfaces.d/10-thunderbolt` |
|
| 324 |
+ |
|
| 325 |
+The fix ensures MTU 65520 persists across all scenarios: boot, hotplug, and networking restart. |
|
@@ -0,0 +1,88 @@ |
||
| 1 |
+# Thunderbolt Interfaces Not in Bridge After MTU Fix |
|
| 2 |
+ |
|
| 3 |
+## Issue ID: ISSUE-2025-002 |
|
| 4 |
+ |
|
| 5 |
+**Status:** closed |
|
| 6 |
+**Priority:** high |
|
| 7 |
+**Created:** 2025-10-30 |
|
| 8 |
+**Updated:** 2025-10-30 |
|
| 9 |
+**Assigned to:** unassigned |
|
| 10 |
+ |
|
| 11 |
+--- |
|
| 12 |
+ |
|
| 13 |
+## Summary |
|
| 14 |
+ |
|
| 15 |
+After applying the MTU fix, thunderbolt interfaces are no longer members of the thunderbridge. |
|
| 16 |
+ |
|
| 17 |
+--- |
|
| 18 |
+ |
|
| 19 |
+## Description |
|
| 20 |
+ |
|
| 21 |
+Following the deployment of the MTU persistence fix (post-up hooks in interfaces.d), the thunderbolt interfaces failed to join the thunderbridge after `systemctl restart networking`. This regression broke cluster connectivity via thunderbolt. |
|
| 22 |
+ |
|
| 23 |
+--- |
|
| 24 |
+ |
|
| 25 |
+## Environment |
|
| 26 |
+ |
|
| 27 |
+- **Affected nodes:** baobab, ebony, tapia |
|
| 28 |
+- **Component:** network (thunderbolt bridging) |
|
| 29 |
+- **Version/software:** Proxmox VE 8.x, ifupdown2, systemd services |
|
| 30 |
+ |
|
| 31 |
+--- |
|
| 32 |
+ |
|
| 33 |
+## Steps to Reproduce |
|
| 34 |
+ |
|
| 35 |
+1. Deploy MTU fix with post-up hooks in `/etc/network/interfaces.d/10-thunderbolt`. |
|
| 36 |
+2. Run `systemctl restart networking`. |
|
| 37 |
+3. Check `bridge link show` - thunderbolt interfaces not in thunderbridge. |
|
| 38 |
+ |
|
| 39 |
+--- |
|
| 40 |
+ |
|
| 41 |
+## Expected Behavior |
|
| 42 |
+ |
|
| 43 |
+Thunderbolt interfaces should remain in thunderbridge with MTU 65520 after networking restart. |
|
| 44 |
+ |
|
| 45 |
+--- |
|
| 46 |
+ |
|
| 47 |
+## Actual Behavior |
|
| 48 |
+ |
|
| 49 |
+Interfaces have correct MTU but are not added to the bridge, causing loss of cluster connectivity. |
|
| 50 |
+ |
|
| 51 |
+--- |
|
| 52 |
+ |
|
| 53 |
+## Logs/Evidence |
|
| 54 |
+ |
|
| 55 |
+``` |
|
| 56 |
+# After restart networking |
|
| 57 |
+$ bridge link show |
|
| 58 |
+(no thunderbolt interfaces listed) |
|
| 59 |
+ |
|
| 60 |
+$ ip link show thunderbolt0 |
|
| 61 |
+thunderbolt0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 ... |
|
| 62 |
+``` |
|
| 63 |
+ |
|
| 64 |
+--- |
|
| 65 |
+ |
|
| 66 |
+## Investigation Notes |
|
| 67 |
+ |
|
| 68 |
+- 2025-10-30: Root cause identified - ifupdown2 brings up interfaces but systemd enlist services don't re-trigger on networking restart. Defense-in-depth needed. |
|
| 69 |
+- 2025-10-30: Added `post-up ip link set dev $IFACE master thunderbridge || true` to interfaces.d files. |
|
| 70 |
+ |
|
| 71 |
+--- |
|
| 72 |
+ |
|
| 73 |
+## Proposed Solution |
|
| 74 |
+ |
|
| 75 |
+Add bridge membership to post-up hooks in `/etc/network/interfaces.d/10-thunderbolt` for all nodes. |
|
| 76 |
+ |
|
| 77 |
+--- |
|
| 78 |
+ |
|
| 79 |
+## Related Issues |
|
| 80 |
+ |
|
| 81 |
+- ISSUE-2025-001 (MTU reset issue) |
|
| 82 |
+ |
|
| 83 |
+--- |
|
| 84 |
+ |
|
| 85 |
+## Changelog References |
|
| 86 |
+ |
|
| 87 |
+- CHANGELOG entry: [2025-10-30] - Fixed bridge membership regression after MTU fix deployment.</content> |
|
| 88 |
+<parameter name="filePath">/Users/bogdan/Documents/Workspaces/Xdev/Madagascar/thunderbolts/issues/ISSUE-2025-002.md |
|
@@ -0,0 +1,118 @@ |
||
| 1 |
+# tb-enlist Fails on Device Disconnect, Leaving Thunderbolt Link Down After Reboot |
|
| 2 |
+ |
|
| 3 |
+## Issue ID: ISSUE-2026-001 |
|
| 4 |
+ |
|
| 5 |
+**Status:** investigating |
|
| 6 |
+**Priority:** high |
|
| 7 |
+**Created:** 2026-03-06 |
|
| 8 |
+**Updated:** 2026-03-06 |
|
| 9 |
+**Assigned to:** unassigned |
|
| 10 |
+ |
|
| 11 |
+--- |
|
| 12 |
+ |
|
| 13 |
+## Summary |
|
| 14 |
+ |
|
| 15 |
+On `tapia`, `tb-enlist@thunderbolt0.service` failed during `ExecStop`, and after a post-boot disconnect/reconnect the `thunderbolt0` interface did not come back. |
|
| 16 |
+ |
|
| 17 |
+--- |
|
| 18 |
+ |
|
| 19 |
+## Description |
|
| 20 |
+ |
|
| 21 |
+After reboot, the Tapia-Baobab Thunderbolt link briefly came up, then disconnected. A bad `ExecStop=` command in `tb-enlist@.service` caused unit failure (`status=255`) when systemd stopped the instance. In parallel, `boltd` logged a probing timeout after reconnect, and `thunderbolt0` was no longer present on `tapia`. |
|
| 22 |
+ |
|
| 23 |
+--- |
|
| 24 |
+ |
|
| 25 |
+## Environment |
|
| 26 |
+ |
|
| 27 |
+- **Affected nodes:** tapia (observed), all (same shared unit deployed cluster-wide) |
|
| 28 |
+- **Component:** network (thunderbolt bridging/systemd integration) |
|
| 29 |
+- **Version/software:** Proxmox VE 8.x, kernel `6.8.12-19-pve`, systemd oneshot templated unit |
|
| 30 |
+ |
|
| 31 |
+--- |
|
| 32 |
+ |
|
| 33 |
+## Steps to Reproduce |
|
| 34 |
+ |
|
| 35 |
+1. Boot `tapia` with current shared `tb-enlist@.service`. |
|
| 36 |
+2. Let Thunderbolt peer connect, then trigger disconnect/remove event (observed during boot sequence). |
|
| 37 |
+3. Check `systemctl status tb-enlist@thunderbolt0.service` and `ip link show thunderbolt0`. |
|
| 38 |
+ |
|
| 39 |
+--- |
|
| 40 |
+ |
|
| 41 |
+## Expected Behavior |
|
| 42 |
+ |
|
| 43 |
+- `tb-enlist@*.service` should stop cleanly when a Thunderbolt netdev disappears. |
|
| 44 |
+- Unit should not remain failed due to teardown path. |
|
| 45 |
+- On reconnect, interface should be eligible to re-enlist normally. |
|
| 46 |
+ |
|
| 47 |
+--- |
|
| 48 |
+ |
|
| 49 |
+## Actual Behavior |
|
| 50 |
+ |
|
| 51 |
+- `tb-enlist@thunderbolt0.service` entered failed state on stop. |
|
| 52 |
+- Error included invalid arguments in `ExecStop`. |
|
| 53 |
+- `thunderbolt0` disappeared on `tapia` and did not reappear after reconnect. |
|
| 54 |
+- Behavior remains intermittent: after some `tapia` reboots, link stays down until physical unplug/replug. |
|
| 55 |
+ |
|
| 56 |
+--- |
|
| 57 |
+ |
|
| 58 |
+## Logs/Evidence |
|
| 59 |
+ |
|
| 60 |
+```text |
|
| 61 |
+Mar 06 08:27:07 tapia ip[4054]: Error: either "dev" is duplicate, or "2>/dev/null" is a garbage. |
|
| 62 |
+Mar 06 08:27:07 tapia systemd[1]: tb-enlist@thunderbolt0.service: Control process exited, code=exited, status=255/EXCEPTION |
|
| 63 |
+Mar 06 08:27:22 tapia boltd[838]: probing: started [1000] |
|
| 64 |
+Mar 06 08:27:24 tapia boltd[838]: probing: timeout, done: [2002832] (2000000) |
|
| 65 |
+Device "thunderbolt0" does not exist. |
|
| 66 |
+``` |
|
| 67 |
+ |
|
| 68 |
+--- |
|
| 69 |
+ |
|
| 70 |
+## Investigation Notes |
|
| 71 |
+ |
|
| 72 |
+- 2026-03-06: Confirmed `tb-bridge.service` was active and `thunderbridge` existed on both `baobab` and `tapia`. |
|
| 73 |
+- 2026-03-06: Confirmed old `ExecStop` lines used shell syntax in non-shell context: |
|
| 74 |
+ - `ExecStop=/sbin/ip link set %i nomaster 2>/dev/null || true` |
|
| 75 |
+ - `ExecStop=/sbin/ip link set %i down 2>/dev/null || true` |
|
| 76 |
+- 2026-03-06: Implemented fix with systemd-native ignore-errors prefix: |
|
| 77 |
+ - `ExecStop=-/sbin/ip link set %i nomaster` |
|
| 78 |
+ - `ExecStop=-/sbin/ip link set %i down` |
|
| 79 |
+- 2026-03-06: Deployed patch to `tapia` and validated that unit can be reset/stopped without entering `failed`. |
|
| 80 |
+- 2026-03-06: User-induced flap still showed intermittent non-recovery pattern; remediation was not sufficient by itself. |
|
| 81 |
+- 2026-03-06: After reboot at ~08:49 EET, `tapia` link was observed up again (`thunderbolt0` forwarding), confirming intermittent behavior. |
|
| 82 |
+- 2026-03-06: Added second-stage mitigation candidate: periodic recovery (`tb-recover.service` + `tb-recover.timer`) to re-enlist interfaces and force rescan when no thunderbolt netdev is present. |
|
| 83 |
+- 2026-03-06: Validated mitigation on `tapia` by intentionally stopping `tb-enlist@thunderbolt0`; recovery timer re-attached interface in next cycle and returned `forwarding` state. |
|
| 84 |
+- 2026-03-06: Rolled out mitigation to `baobab` and `ebony`; timer enabled and active on all three nodes. |
|
| 85 |
+- 2026-03-06 10:01 EET: New flap captured on `tapia` (`host disconnected` at `10:01:30`); recovery happened after reconnect event (`new host found` at `10:01:48`), consistent with unplug/replug recovery. |
|
| 86 |
+- 2026-03-06 10:05 EET: Added third-stage mitigation in `tb-recover.sh`: if no thunderbolt netdev after rescan, restart `bolt.service` and retrigger udev as fallback. |
|
| 87 |
+- 2026-03-06 10:39 EET: Controlled flap test on `tapia` using `thunderbolt-net` unbind/bind (`0-1.0`) passed; `thunderbolt0` reappeared and returned to `forwarding` within seconds (`TEST_PASS`). |
|
| 88 |
+- 2026-03-06 10:46 EET: Latest mitigation rollout completed on `baobab` and `ebony`; `tb-recover.timer` active/enabled and `tb-enlist@*` units active on all nodes. |
|
| 89 |
+- 2026-03-06 13:25 EET: Reboot-loop regression reproduced on `tapia` - `thunderbridge` up but `thunderbolt0` missing entirely (`tb-enlist@thunderbolt0` inactive), while peer `baobab` port showed `NO-CARRIER`. |
|
| 90 |
+- 2026-03-06 13:22-14:02 EET: Existing fallback (`bolt.service` restart) was insufficient; repeated `boltd` messages observed: `failed to get boot_acl: Connection timed out`. |
|
| 91 |
+- 2026-03-06 14:02 EET: Software recovery without cable succeeded via Thunderbolt NHI PCI `remove + rescan`; `thunderbolt0` recreated and rejoined bridge. |
|
| 92 |
+- 2026-03-06 14:04 EET: `tb-recover.sh` updated with cooldowned NHI rescan fallback (and guarded `boltd` restart fallback) and deployed cluster-wide. |
|
| 93 |
+- 2026-03-07 03:35-03:42 EET: On `tapia` running `6.17.13-1-pve`, first NHI rescan rediscovered peer host `0-1` but did not recreate `0-1.0`; a second manual NHI reset at `03:42` recreated `thunderbolt0` and restored `forwarding`. |
|
| 94 |
+- 2026-03-07 03:4x EET: Recovery logic updated so a stale xdomain host node without a `*.0` service triggers one bounded second NHI reset in the same `tb-recover.sh` run. |
|
| 95 |
+ |
|
| 96 |
+--- |
|
| 97 |
+ |
|
| 98 |
+## Proposed Solution |
|
| 99 |
+ |
|
| 100 |
+Use a two-layer recovery approach: |
|
| 101 |
+1. Keep `ExecStop` commands shell-free and use systemd `-` prefix to ignore expected failures when device is already gone. |
|
| 102 |
+2. Run periodic recovery (`tb-recover.timer`) that re-enlists existing thunderbolt netdevs and forces controller/net udev retrigger when no thunderbolt netdev is present. |
|
| 103 |
+3. If netdev is still missing, perform cooldowned Thunderbolt NHI PCI `remove + rescan` (soft replug equivalent), then retrigger udev. |
|
| 104 |
+4. If the controller comes back only as a peer xdomain host node (for example `0-1`) with no `0-1.0` service child, immediately perform one additional bounded NHI reset in the same recovery run. |
|
| 105 |
+ |
|
| 106 |
+--- |
|
| 107 |
+ |
|
| 108 |
+## Related Issues |
|
| 109 |
+ |
|
| 110 |
+- ISSUE-2025-002 |
|
| 111 |
+- ISSUE-2025-001 |
|
| 112 |
+ |
|
| 113 |
+--- |
|
| 114 |
+ |
|
| 115 |
+## Changelog References |
|
| 116 |
+ |
|
| 117 |
+List CHANGELOG.md entries that reference this issue: |
|
| 118 |
+- CHANGELOG entry: [Unreleased] - Fix invalid `ExecStop` in `tb-enlist@.service` to prevent failed unit on device removal [ISSUE-2026-001] |
|
@@ -0,0 +1,83 @@ |
||
| 1 |
+# Issue Template |
|
| 2 |
+ |
|
| 3 |
+## Issue ID: ISSUE-YYYY-NNN |
|
| 4 |
+ |
|
| 5 |
+**Status:** [open|investigating|in-progress|resolved|closed] |
|
| 6 |
+**Priority:** [low|medium|high|critical] |
|
| 7 |
+**Created:** YYYY-MM-DD |
|
| 8 |
+**Updated:** YYYY-MM-DD |
|
| 9 |
+**Assigned to:** [name or unassigned] |
|
| 10 |
+ |
|
| 11 |
+--- |
|
| 12 |
+ |
|
| 13 |
+## Summary |
|
| 14 |
+ |
|
| 15 |
+Brief one-line description of the issue. |
|
| 16 |
+ |
|
| 17 |
+--- |
|
| 18 |
+ |
|
| 19 |
+## Description |
|
| 20 |
+ |
|
| 21 |
+Detailed description of the problem, behavior, or feature request. |
|
| 22 |
+ |
|
| 23 |
+--- |
|
| 24 |
+ |
|
| 25 |
+## Environment |
|
| 26 |
+ |
|
| 27 |
+- **Affected nodes:** [baobab|ebony|tapia|all] |
|
| 28 |
+- **Component:** [network|storage|vm|backup|cluster|other] |
|
| 29 |
+- **Version/software:** (e.g., Proxmox 8.x, kernel version, etc.) |
|
| 30 |
+ |
|
| 31 |
+--- |
|
| 32 |
+ |
|
| 33 |
+## Steps to Reproduce |
|
| 34 |
+ |
|
| 35 |
+1. Step 1 |
|
| 36 |
+2. Step 2 |
|
| 37 |
+3. ... |
|
| 38 |
+ |
|
| 39 |
+--- |
|
| 40 |
+ |
|
| 41 |
+## Expected Behavior |
|
| 42 |
+ |
|
| 43 |
+What should happen. |
|
| 44 |
+ |
|
| 45 |
+--- |
|
| 46 |
+ |
|
| 47 |
+## Actual Behavior |
|
| 48 |
+ |
|
| 49 |
+What actually happens. |
|
| 50 |
+ |
|
| 51 |
+--- |
|
| 52 |
+ |
|
| 53 |
+## Logs/Evidence |
|
| 54 |
+ |
|
| 55 |
+``` |
|
| 56 |
+Paste relevant logs, command output, or error messages here. |
|
| 57 |
+``` |
|
| 58 |
+ |
|
| 59 |
+--- |
|
| 60 |
+ |
|
| 61 |
+## Investigation Notes |
|
| 62 |
+ |
|
| 63 |
+- [Date] Note 1 |
|
| 64 |
+- [Date] Note 2 |
|
| 65 |
+ |
|
| 66 |
+--- |
|
| 67 |
+ |
|
| 68 |
+## Proposed Solution |
|
| 69 |
+ |
|
| 70 |
+Describe the proposed fix or workaround. |
|
| 71 |
+ |
|
| 72 |
+--- |
|
| 73 |
+ |
|
| 74 |
+## Related Issues |
|
| 75 |
+ |
|
| 76 |
+- ISSUE-YYYY-NNN (if any) |
|
| 77 |
+ |
|
| 78 |
+--- |
|
| 79 |
+ |
|
| 80 |
+## Changelog References |
|
| 81 |
+ |
|
| 82 |
+List CHANGELOG.md entries that reference this issue: |
|
| 83 |
+- CHANGELOG entry: [date] - description |
|
@@ -0,0 +1,59 @@ |
||
| 1 |
+#!/usr/bin/env bash |
|
| 2 |
+# check_mcluster_network.sh — Minimal cluster network health check (pretty table) |
|
| 3 |
+ |
|
| 4 |
+set -e |
|
| 5 |
+ |
|
| 6 |
+NODES=(baobab ebony tapia autonas1 autonas2) |
|
| 7 |
+CLUSTER_IPS=(192.168.10.91 192.168.10.92 192.168.10.93 192.168.10.95 192.168.10.96) |
|
| 8 |
+MGMT_IPS=(192.168.2.91 192.168.2.92 192.168.2.93 192.168.2.95 192.168.2.96) |
|
| 9 |
+SSH_OPTS="-o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR" |
|
| 10 |
+ |
|
| 11 |
+# Thunderbridge/thunderbolt status (unchanged) |
|
| 12 |
+for i in "${!NODES[@]}"; do
|
|
| 13 |
+ node="${NODES[$i]}"
|
|
| 14 |
+ mgmt_ip="${MGMT_IPS[$i]}"
|
|
| 15 |
+ if [[ "$node" == autonas* ]]; then |
|
| 16 |
+ continue |
|
| 17 |
+ fi |
|
| 18 |
+ mtu=$(ssh $SSH_OPTS root@$mgmt_ip "ip link show thunderbridge 2>/dev/null | grep mtu | awk '{print \$5}'" || echo "fail")
|
|
| 19 |
+ ports=$(ssh $SSH_OPTS root@$mgmt_ip "bridge link | grep master.*thunderbridge | awk '{print \$2}'" | xargs)
|
|
| 20 |
+ echo "$node: thunderbridge mtu=$mtu ports=$ports" |
|
| 21 |
+ ssh $SSH_OPTS root@$mgmt_ip "ip -o link show | grep 'thunderbolt'" | while read -r line; do |
|
| 22 |
+ iface=$(echo "$line" | awk '{print $2}')
|
|
| 23 |
+ mtu=$(echo "$line" | awk '{print $5}')
|
|
| 24 |
+ up=$(echo "$line" | grep -q 'UP' && echo "up" || echo "down") |
|
| 25 |
+ forwarding=$(ssh $SSH_OPTS root@$mgmt_ip "bridge link show dev $iface 2>/dev/null" | grep -q 'state forwarding' && echo "forwarding" || echo "not-forwarding") |
|
| 26 |
+ echo " $iface mtu=$mtu $up $forwarding" |
|
| 27 |
+ done |
|
| 28 |
+done |
|
| 29 |
+ |
|
| 30 |
+echo |
|
| 31 |
+# Table header |
|
| 32 |
+printf "%-10s |" "Node" |
|
| 33 |
+for node in "${NODES[@]}"; do
|
|
| 34 |
+ printf " %10s |" "$node" |
|
| 35 |
+done |
|
| 36 |
+echo |
|
| 37 |
+# localhost row |
|
| 38 |
+printf "%-10s |" "localhost" |
|
| 39 |
+for j in "${!NODES[@]}"; do
|
|
| 40 |
+ dst_cluster="${CLUSTER_IPS[$j]}"
|
|
| 41 |
+ if ping -c 1 -W 1 $dst_cluster >/dev/null 2>&1; then |
|
| 42 |
+ printf " %10s |" "OK" |
|
| 43 |
+ else |
|
| 44 |
+ printf " %10s |" "FAILED" |
|
| 45 |
+ fi |
|
| 46 |
+done |
|
| 47 |
+echo |
|
| 48 |
+# baobab row |
|
| 49 |
+printf "%-10s |" "baobab" |
|
| 50 |
+baobab_mgmt="${MGMT_IPS[0]}"
|
|
| 51 |
+for j in "${!NODES[@]}"; do
|
|
| 52 |
+ dst_cluster="${CLUSTER_IPS[$j]}"
|
|
| 53 |
+ if ssh $SSH_OPTS root@$baobab_mgmt "ping -c 1 -W 1 $dst_cluster >/dev/null 2>&1"; then |
|
| 54 |
+ printf " %10s |" "OK" |
|
| 55 |
+ else |
|
| 56 |
+ printf " %10s |" "FAILED" |
|
| 57 |
+ fi |
|
| 58 |
+done |
|
| 59 |
+echo |
|
@@ -0,0 +1,144 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="thunderbolts" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 8 |
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
|
|
| 9 |
+RECOVER_CANONICAL="${INSTALL_DIR}/tb-recover.sh"
|
|
| 10 |
+RECOVER_WRAPPER="/usr/local/sbin/tb-recover.sh" |
|
| 11 |
+UNINSTALL_PATH="${INSTALL_DIR}/uninstall.sh"
|
|
| 12 |
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
|
|
| 13 |
+UDEV_RULE_PATH="/etc/udev/rules.d/90-thunderbolt-net-systemd.rules" |
|
| 14 |
+TB_BRIDGE_UNIT="/etc/systemd/system/tb-bridge.service" |
|
| 15 |
+TB_ENLIST_UNIT="/etc/systemd/system/tb-enlist@.service" |
|
| 16 |
+TB_RECOVER_UNIT="/etc/systemd/system/tb-recover.service" |
|
| 17 |
+TB_RECOVER_TIMER="/etc/systemd/system/tb-recover.timer" |
|
| 18 |
+ |
|
| 19 |
+SOURCE_DIR="" |
|
| 20 |
+ |
|
| 21 |
+usage() {
|
|
| 22 |
+ cat <<EOF |
|
| 23 |
+Usage: $0 [--source-dir <path>] |
|
| 24 |
+ |
|
| 25 |
+Install shared thunderbolt runtime artifacts on the current host. |
|
| 26 |
+This workflow does NOT modify /etc/network/interfaces or interfaces.d/10-thunderbolt. |
|
| 27 |
+EOF |
|
| 28 |
+} |
|
| 29 |
+ |
|
| 30 |
+require_root() {
|
|
| 31 |
+ if [[ "${EUID}" -ne 0 ]]; then
|
|
| 32 |
+ echo "ERROR: this script must be run as root" >&2 |
|
| 33 |
+ exit 1 |
|
| 34 |
+ fi |
|
| 35 |
+} |
|
| 36 |
+ |
|
| 37 |
+resolve_source_dir() {
|
|
| 38 |
+ if [[ -n "${SOURCE_DIR}" ]]; then
|
|
| 39 |
+ SOURCE_DIR="$(cd "${SOURCE_DIR}" && pwd)"
|
|
| 40 |
+ else |
|
| 41 |
+ SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
|
| 42 |
+ fi |
|
| 43 |
+} |
|
| 44 |
+ |
|
| 45 |
+validate_source_tree() {
|
|
| 46 |
+ local required_files=( |
|
| 47 |
+ "${SOURCE_DIR}/deploy/attempt1/common/sbin/tb-recover.sh"
|
|
| 48 |
+ "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-bridge.service"
|
|
| 49 |
+ "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-enlist@.service"
|
|
| 50 |
+ "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.service"
|
|
| 51 |
+ "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.timer"
|
|
| 52 |
+ "${SOURCE_DIR}/deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules"
|
|
| 53 |
+ "${SOURCE_DIR}/scripts/uninstall.sh"
|
|
| 54 |
+ "${SOURCE_DIR}/README.md"
|
|
| 55 |
+ "${SOURCE_DIR}/INSTALL.md"
|
|
| 56 |
+ "${SOURCE_DIR}/CHANGELOG.md"
|
|
| 57 |
+ ) |
|
| 58 |
+ local file="" |
|
| 59 |
+ for file in "${required_files[@]}"; do
|
|
| 60 |
+ if [[ ! -f "${file}" ]]; then
|
|
| 61 |
+ echo "ERROR: missing required source file: ${file}" >&2
|
|
| 62 |
+ exit 1 |
|
| 63 |
+ fi |
|
| 64 |
+ done |
|
| 65 |
+} |
|
| 66 |
+ |
|
| 67 |
+run_existing_uninstall() {
|
|
| 68 |
+ if [[ -x "${UNINSTALL_PATH}" ]]; then
|
|
| 69 |
+ echo "Existing installation detected. Running canonical uninstall first..." |
|
| 70 |
+ "${UNINSTALL_PATH}" --force || true
|
|
| 71 |
+ else |
|
| 72 |
+ bash "${SOURCE_DIR}/scripts/uninstall.sh" --force || true
|
|
| 73 |
+ fi |
|
| 74 |
+} |
|
| 75 |
+ |
|
| 76 |
+install_docs() {
|
|
| 77 |
+ mkdir -p "${DOC_DIR}"
|
|
| 78 |
+ cp "${SOURCE_DIR}/README.md" "${DOC_DIR}/"
|
|
| 79 |
+ cp "${SOURCE_DIR}/INSTALL.md" "${DOC_DIR}/"
|
|
| 80 |
+ cp "${SOURCE_DIR}/CHANGELOG.md" "${DOC_DIR}/"
|
|
| 81 |
+} |
|
| 82 |
+ |
|
| 83 |
+main() {
|
|
| 84 |
+ while [[ $# -gt 0 ]]; do |
|
| 85 |
+ case "$1" in |
|
| 86 |
+ --source-dir) |
|
| 87 |
+ SOURCE_DIR="$2" |
|
| 88 |
+ shift 2 |
|
| 89 |
+ ;; |
|
| 90 |
+ -h|--help) |
|
| 91 |
+ usage |
|
| 92 |
+ exit 0 |
|
| 93 |
+ ;; |
|
| 94 |
+ *) |
|
| 95 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 96 |
+ usage |
|
| 97 |
+ exit 1 |
|
| 98 |
+ ;; |
|
| 99 |
+ esac |
|
| 100 |
+ done |
|
| 101 |
+ |
|
| 102 |
+ require_root |
|
| 103 |
+ resolve_source_dir |
|
| 104 |
+ validate_source_tree |
|
| 105 |
+ |
|
| 106 |
+ echo "=== Installing ${PROJECT_ID} shared runtime ==="
|
|
| 107 |
+ run_existing_uninstall |
|
| 108 |
+ |
|
| 109 |
+ mkdir -p "${INSTALL_DIR}" "${DOC_DIR}" /usr/local/sbin /etc/udev/rules.d /etc/systemd/system
|
|
| 110 |
+ |
|
| 111 |
+ install -m 0755 "${SOURCE_DIR}/deploy/attempt1/common/sbin/tb-recover.sh" "${RECOVER_CANONICAL}"
|
|
| 112 |
+ ln -sfn "${RECOVER_CANONICAL}" "${RECOVER_WRAPPER}"
|
|
| 113 |
+ |
|
| 114 |
+ install -m 0755 "${SOURCE_DIR}/scripts/uninstall.sh" "${UNINSTALL_PATH}"
|
|
| 115 |
+ ln -sfn "${UNINSTALL_PATH}" "${UNINSTALL_WRAPPER}"
|
|
| 116 |
+ |
|
| 117 |
+ install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules" "${UDEV_RULE_PATH}"
|
|
| 118 |
+ install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-bridge.service" "${TB_BRIDGE_UNIT}"
|
|
| 119 |
+ install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-enlist@.service" "${TB_ENLIST_UNIT}"
|
|
| 120 |
+ install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.service" "${TB_RECOVER_UNIT}"
|
|
| 121 |
+ install -m 0644 "${SOURCE_DIR}/deploy/attempt1/common/systemd/system/tb-recover.timer" "${TB_RECOVER_TIMER}"
|
|
| 122 |
+ |
|
| 123 |
+ install_docs |
|
| 124 |
+ |
|
| 125 |
+ systemctl daemon-reload |
|
| 126 |
+ udevadm control --reload-rules |
|
| 127 |
+ systemctl enable --now tb-bridge.service |
|
| 128 |
+ systemctl enable --now tb-recover.timer |
|
| 129 |
+ systemctl start tb-recover.service || true |
|
| 130 |
+ udevadm trigger --subsystem-match=net --action=add || true |
|
| 131 |
+ |
|
| 132 |
+ echo "Installed paths:" |
|
| 133 |
+ echo " runtime: ${INSTALL_DIR}"
|
|
| 134 |
+ echo " recover wrapper: ${RECOVER_WRAPPER}"
|
|
| 135 |
+ echo " uninstall: ${UNINSTALL_PATH}"
|
|
| 136 |
+ echo " udev rule: ${UDEV_RULE_PATH}"
|
|
| 137 |
+ echo " systemd units: tb-bridge.service tb-enlist@.service tb-recover.service tb-recover.timer" |
|
| 138 |
+ echo " docs: ${DOC_DIR}"
|
|
| 139 |
+ echo "" |
|
| 140 |
+ echo "Network interface files were left untouched." |
|
| 141 |
+ echo "Installation completed." |
|
| 142 |
+} |
|
| 143 |
+ |
|
| 144 |
+main "$@" |
|
@@ -0,0 +1,83 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="thunderbolts" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+INSTALL_DIR="/usr/local/lib/${ORG_ID}/${PROJECT_ID}"
|
|
| 8 |
+DOC_DIR="/usr/local/share/doc/${ORG_ID}/${PROJECT_ID}"
|
|
| 9 |
+RECOVER_WRAPPER="/usr/local/sbin/tb-recover.sh" |
|
| 10 |
+UNINSTALL_WRAPPER="/usr/local/sbin/${ORG_ID}-${PROJECT_ID}-uninstall"
|
|
| 11 |
+UDEV_RULE_PATH="/etc/udev/rules.d/90-thunderbolt-net-systemd.rules" |
|
| 12 |
+TB_BRIDGE_UNIT="/etc/systemd/system/tb-bridge.service" |
|
| 13 |
+TB_ENLIST_UNIT="/etc/systemd/system/tb-enlist@.service" |
|
| 14 |
+TB_RECOVER_UNIT="/etc/systemd/system/tb-recover.service" |
|
| 15 |
+TB_RECOVER_TIMER="/etc/systemd/system/tb-recover.timer" |
|
| 16 |
+ |
|
| 17 |
+FORCE_MODE=0 |
|
| 18 |
+ |
|
| 19 |
+log() {
|
|
| 20 |
+ if [[ "${FORCE_MODE}" -eq 0 ]]; then
|
|
| 21 |
+ echo "$@" |
|
| 22 |
+ fi |
|
| 23 |
+} |
|
| 24 |
+ |
|
| 25 |
+require_root() {
|
|
| 26 |
+ if [[ "${EUID}" -ne 0 ]]; then
|
|
| 27 |
+ echo "ERROR: this script must be run as root" >&2 |
|
| 28 |
+ exit 1 |
|
| 29 |
+ fi |
|
| 30 |
+} |
|
| 31 |
+ |
|
| 32 |
+stop_enlist_instances() {
|
|
| 33 |
+ local units |
|
| 34 |
+ units="$(systemctl list-units --all 'tb-enlist@*.service' --no-legend --no-pager 2>/dev/null | awk '{print $1}')"
|
|
| 35 |
+ if [[ -n "${units}" ]]; then
|
|
| 36 |
+ # shellcheck disable=SC2086 |
|
| 37 |
+ systemctl stop ${units} >/dev/null 2>&1 || true
|
|
| 38 |
+ fi |
|
| 39 |
+} |
|
| 40 |
+ |
|
| 41 |
+main() {
|
|
| 42 |
+ while [[ $# -gt 0 ]]; do |
|
| 43 |
+ case "$1" in |
|
| 44 |
+ --force) |
|
| 45 |
+ FORCE_MODE=1 |
|
| 46 |
+ shift |
|
| 47 |
+ ;; |
|
| 48 |
+ -h|--help) |
|
| 49 |
+ echo "Usage: $0 [--force]" |
|
| 50 |
+ exit 0 |
|
| 51 |
+ ;; |
|
| 52 |
+ *) |
|
| 53 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 54 |
+ exit 1 |
|
| 55 |
+ ;; |
|
| 56 |
+ esac |
|
| 57 |
+ done |
|
| 58 |
+ |
|
| 59 |
+ require_root |
|
| 60 |
+ |
|
| 61 |
+ log "=== Uninstalling ${PROJECT_ID} shared runtime ==="
|
|
| 62 |
+ |
|
| 63 |
+ stop_enlist_instances |
|
| 64 |
+ systemctl disable --now tb-recover.timer >/dev/null 2>&1 || true |
|
| 65 |
+ systemctl stop tb-recover.service >/dev/null 2>&1 || true |
|
| 66 |
+ systemctl disable tb-bridge.service >/dev/null 2>&1 || true |
|
| 67 |
+ systemctl stop tb-bridge.service >/dev/null 2>&1 || true |
|
| 68 |
+ |
|
| 69 |
+ rm -f "${TB_RECOVER_TIMER}" "${TB_RECOVER_UNIT}" "${TB_ENLIST_UNIT}" "${TB_BRIDGE_UNIT}" "${UDEV_RULE_PATH}"
|
|
| 70 |
+ rm -f "${UNINSTALL_WRAPPER}" "${RECOVER_WRAPPER}"
|
|
| 71 |
+ rm -rf "${DOC_DIR}" "${INSTALL_DIR}"
|
|
| 72 |
+ |
|
| 73 |
+ systemctl daemon-reload |
|
| 74 |
+ udevadm control --reload-rules |
|
| 75 |
+ |
|
| 76 |
+ rmdir /usr/local/lib/${ORG_ID} 2>/dev/null || true
|
|
| 77 |
+ rmdir /usr/local/share/doc/${ORG_ID} 2>/dev/null || true
|
|
| 78 |
+ |
|
| 79 |
+ log "Shared runtime removed." |
|
| 80 |
+ log "Network interface configuration was left untouched." |
|
| 81 |
+} |
|
| 82 |
+ |
|
| 83 |
+main "$@" |
|
@@ -0,0 +1,166 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+PROJECT_ID="thunderbolts" |
|
| 6 |
+ORG_ID="xdev" |
|
| 7 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 8 |
+MODE="install" |
|
| 9 |
+REMOTE_USER="root" |
|
| 10 |
+LOCAL_MODE=0 |
|
| 11 |
+TARGETS=() |
|
| 12 |
+ |
|
| 13 |
+get_mgmt_ip() {
|
|
| 14 |
+ case "$1" in |
|
| 15 |
+ baobab) echo "192.168.2.91" ;; |
|
| 16 |
+ ebony) echo "192.168.2.92" ;; |
|
| 17 |
+ tapia) echo "192.168.2.93" ;; |
|
| 18 |
+ *) echo "" ;; |
|
| 19 |
+ esac |
|
| 20 |
+} |
|
| 21 |
+ |
|
| 22 |
+resolve_target() {
|
|
| 23 |
+ local host="$1" |
|
| 24 |
+ local ip="" |
|
| 25 |
+ |
|
| 26 |
+ if [[ "$host" == *@* ]]; then |
|
| 27 |
+ echo "$host" |
|
| 28 |
+ return 0 |
|
| 29 |
+ fi |
|
| 30 |
+ |
|
| 31 |
+ ip="$(get_mgmt_ip "$host")" |
|
| 32 |
+ if [[ -n "$ip" ]]; then |
|
| 33 |
+ echo "${REMOTE_USER}@${ip}"
|
|
| 34 |
+ else |
|
| 35 |
+ echo "${REMOTE_USER}@${host}"
|
|
| 36 |
+ fi |
|
| 37 |
+} |
|
| 38 |
+ |
|
| 39 |
+show_help() {
|
|
| 40 |
+ cat <<EOF |
|
| 41 |
+${PROJECT_ID} setup wrapper
|
|
| 42 |
+ |
|
| 43 |
+Usage: $0 [OPTIONS] [host...] |
|
| 44 |
+ |
|
| 45 |
+Options: |
|
| 46 |
+ -h, --help Show this help message |
|
| 47 |
+ -l, --local Run on localhost |
|
| 48 |
+ -u, --uninstall Uninstall instead of install |
|
| 49 |
+ --user <user> Remote SSH user (default: root) |
|
| 50 |
+ |
|
| 51 |
+Without explicit hosts, remote mode defaults to: baobab ebony tapia |
|
| 52 |
+EOF |
|
| 53 |
+} |
|
| 54 |
+ |
|
| 55 |
+run_local_install() {
|
|
| 56 |
+ bash "${SCRIPT_DIR}/scripts/install.sh" --source-dir "${SCRIPT_DIR}"
|
|
| 57 |
+} |
|
| 58 |
+ |
|
| 59 |
+run_local_uninstall() {
|
|
| 60 |
+ local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
|
|
| 61 |
+ if [[ -x "${canonical}" ]]; then
|
|
| 62 |
+ "${canonical}"
|
|
| 63 |
+ else |
|
| 64 |
+ bash "${SCRIPT_DIR}/scripts/uninstall.sh"
|
|
| 65 |
+ fi |
|
| 66 |
+} |
|
| 67 |
+ |
|
| 68 |
+copy_remote_tree() {
|
|
| 69 |
+ local target="$1" |
|
| 70 |
+ local remote_tmp="$2" |
|
| 71 |
+ |
|
| 72 |
+ ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts' '${remote_tmp}/deploy/attempt1/common/sbin' '${remote_tmp}/deploy/attempt1/common/systemd/system' '${remote_tmp}/deploy/attempt1/common/udev/rules.d'"
|
|
| 73 |
+ scp -q "${SCRIPT_DIR}/scripts/install.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 74 |
+ scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 75 |
+ scp -q "${SCRIPT_DIR}/README.md" "${target}:${remote_tmp}/"
|
|
| 76 |
+ scp -q "${SCRIPT_DIR}/INSTALL.md" "${target}:${remote_tmp}/"
|
|
| 77 |
+ scp -q "${SCRIPT_DIR}/CHANGELOG.md" "${target}:${remote_tmp}/"
|
|
| 78 |
+ scp -q "${SCRIPT_DIR}/deploy/attempt1/common/sbin/tb-recover.sh" "${target}:${remote_tmp}/deploy/attempt1/common/sbin/"
|
|
| 79 |
+ scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-bridge.service" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
|
|
| 80 |
+ scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-enlist@.service" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
|
|
| 81 |
+ scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-recover.service" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
|
|
| 82 |
+ scp -q "${SCRIPT_DIR}/deploy/attempt1/common/systemd/system/tb-recover.timer" "${target}:${remote_tmp}/deploy/attempt1/common/systemd/system/"
|
|
| 83 |
+ scp -q "${SCRIPT_DIR}/deploy/attempt1/common/udev/rules.d/90-thunderbolt-net-systemd.rules" "${target}:${remote_tmp}/deploy/attempt1/common/udev/rules.d/"
|
|
| 84 |
+} |
|
| 85 |
+ |
|
| 86 |
+run_remote_install() {
|
|
| 87 |
+ local target="$1" |
|
| 88 |
+ local remote_tmp="/tmp/${PROJECT_ID}.$$"
|
|
| 89 |
+ local remote_prefix="" |
|
| 90 |
+ |
|
| 91 |
+ [[ "${REMOTE_USER}" != "root" ]] && remote_prefix="sudo "
|
|
| 92 |
+ |
|
| 93 |
+ copy_remote_tree "${target}" "${remote_tmp}"
|
|
| 94 |
+ ssh "${target}" "${remote_prefix}bash '${remote_tmp}/scripts/install.sh' --source-dir '${remote_tmp}'"
|
|
| 95 |
+ ssh "${target}" "rm -rf '${remote_tmp}'"
|
|
| 96 |
+} |
|
| 97 |
+ |
|
| 98 |
+run_remote_uninstall() {
|
|
| 99 |
+ local target="$1" |
|
| 100 |
+ local remote_tmp="/tmp/${PROJECT_ID}-uninstall.$$"
|
|
| 101 |
+ local canonical="/usr/local/lib/${ORG_ID}/${PROJECT_ID}/uninstall.sh"
|
|
| 102 |
+ |
|
| 103 |
+ ssh "${target}" "rm -rf '${remote_tmp}' && mkdir -p '${remote_tmp}/scripts'"
|
|
| 104 |
+ scp -q "${SCRIPT_DIR}/scripts/uninstall.sh" "${target}:${remote_tmp}/scripts/"
|
|
| 105 |
+ if [[ "${REMOTE_USER}" == "root" ]]; then
|
|
| 106 |
+ ssh "${target}" "if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi"
|
|
| 107 |
+ else |
|
| 108 |
+ ssh "${target}" "sudo bash -lc \"if [ -x '${canonical}' ]; then '${canonical}'; else bash '${remote_tmp}/scripts/uninstall.sh'; fi\""
|
|
| 109 |
+ fi |
|
| 110 |
+ ssh "${target}" "rm -rf '${remote_tmp}'"
|
|
| 111 |
+} |
|
| 112 |
+ |
|
| 113 |
+while [[ $# -gt 0 ]]; do |
|
| 114 |
+ case "$1" in |
|
| 115 |
+ -h|--help) |
|
| 116 |
+ show_help |
|
| 117 |
+ exit 0 |
|
| 118 |
+ ;; |
|
| 119 |
+ -l|--local) |
|
| 120 |
+ LOCAL_MODE=1 |
|
| 121 |
+ shift |
|
| 122 |
+ ;; |
|
| 123 |
+ -u|--uninstall) |
|
| 124 |
+ MODE="uninstall" |
|
| 125 |
+ shift |
|
| 126 |
+ ;; |
|
| 127 |
+ --user) |
|
| 128 |
+ REMOTE_USER="$2" |
|
| 129 |
+ shift 2 |
|
| 130 |
+ ;; |
|
| 131 |
+ -*) |
|
| 132 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 133 |
+ show_help |
|
| 134 |
+ exit 1 |
|
| 135 |
+ ;; |
|
| 136 |
+ *) |
|
| 137 |
+ TARGETS+=("$1")
|
|
| 138 |
+ shift |
|
| 139 |
+ ;; |
|
| 140 |
+ esac |
|
| 141 |
+done |
|
| 142 |
+ |
|
| 143 |
+if [[ ${#TARGETS[@]} -eq 0 && ${LOCAL_MODE} -eq 0 ]]; then
|
|
| 144 |
+ TARGETS=(baobab ebony tapia) |
|
| 145 |
+fi |
|
| 146 |
+ |
|
| 147 |
+echo "================================" |
|
| 148 |
+echo "${PROJECT_ID} - ${MODE}"
|
|
| 149 |
+echo "================================" |
|
| 150 |
+ |
|
| 151 |
+if [[ ${LOCAL_MODE} -eq 1 ]]; then
|
|
| 152 |
+ if [[ "${MODE}" == "install" ]]; then
|
|
| 153 |
+ run_local_install |
|
| 154 |
+ else |
|
| 155 |
+ run_local_uninstall |
|
| 156 |
+ fi |
|
| 157 |
+ exit 0 |
|
| 158 |
+fi |
|
| 159 |
+ |
|
| 160 |
+for host in "${TARGETS[@]}"; do
|
|
| 161 |
+ if [[ "${MODE}" == "install" ]]; then
|
|
| 162 |
+ run_remote_install "$(resolve_target "${host}")"
|
|
| 163 |
+ else |
|
| 164 |
+ run_remote_uninstall "$(resolve_target "${host}")"
|
|
| 165 |
+ fi |
|
| 166 |
+done |
|
@@ -0,0 +1,11 @@ |
||
| 1 |
+{
|
|
| 2 |
+ "folders": [ |
|
| 3 |
+ {
|
|
| 4 |
+ "path": "." |
|
| 5 |
+ }, |
|
| 6 |
+ {
|
|
| 7 |
+ "path": "../backups" |
|
| 8 |
+ } |
|
| 9 |
+ ], |
|
| 10 |
+ "settings": {}
|
|
| 11 |
+} |
|
@@ -0,0 +1,77 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 6 |
+ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
|
|
| 7 |
+CONFIG_PATH="${ROOT_DIR}/cluster-context/madagascar.json"
|
|
| 8 |
+CLUSTER_NAME="madagascar" |
|
| 9 |
+FORMAT="ip" |
|
| 10 |
+ |
|
| 11 |
+usage() {
|
|
| 12 |
+ cat <<EOF |
|
| 13 |
+Usage: $0 [--cluster <name>] [--format ip|name|name=ip] |
|
| 14 |
+ |
|
| 15 |
+Reads node information from cluster-context/madagascar.json. |
|
| 16 |
+EOF |
|
| 17 |
+} |
|
| 18 |
+ |
|
| 19 |
+while [[ $# -gt 0 ]]; do |
|
| 20 |
+ case "$1" in |
|
| 21 |
+ --cluster) |
|
| 22 |
+ CLUSTER_NAME="$2" |
|
| 23 |
+ shift 2 |
|
| 24 |
+ ;; |
|
| 25 |
+ --format) |
|
| 26 |
+ FORMAT="$2" |
|
| 27 |
+ shift 2 |
|
| 28 |
+ ;; |
|
| 29 |
+ -h|--help) |
|
| 30 |
+ usage |
|
| 31 |
+ exit 0 |
|
| 32 |
+ ;; |
|
| 33 |
+ *) |
|
| 34 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 35 |
+ usage |
|
| 36 |
+ exit 1 |
|
| 37 |
+ ;; |
|
| 38 |
+ esac |
|
| 39 |
+done |
|
| 40 |
+ |
|
| 41 |
+if [[ ! -f "${CONFIG_PATH}" ]]; then
|
|
| 42 |
+ echo "ERROR: missing cluster config: ${CONFIG_PATH}" >&2
|
|
| 43 |
+ exit 1 |
|
| 44 |
+fi |
|
| 45 |
+ |
|
| 46 |
+case "${FORMAT}" in
|
|
| 47 |
+ ip) |
|
| 48 |
+ jq -r --arg cluster "${CLUSTER_NAME}" '
|
|
| 49 |
+ .clusters[$cluster].nodes |
|
| 50 |
+ | to_entries[] |
|
| 51 |
+ | ( |
|
| 52 |
+ .value.ip |
|
| 53 |
+ // .value.wan.vmbr443.address |
|
| 54 |
+ // empty |
|
| 55 |
+ ) |
|
| 56 |
+ | split("/")[0]
|
|
| 57 |
+ ' "${CONFIG_PATH}"
|
|
| 58 |
+ ;; |
|
| 59 |
+ name) |
|
| 60 |
+ jq -r --arg cluster "${CLUSTER_NAME}" '
|
|
| 61 |
+ .clusters[$cluster].nodes |
|
| 62 |
+ | to_entries[] |
|
| 63 |
+ | .key |
|
| 64 |
+ ' "${CONFIG_PATH}"
|
|
| 65 |
+ ;; |
|
| 66 |
+ name=ip) |
|
| 67 |
+ jq -r --arg cluster "${CLUSTER_NAME}" '
|
|
| 68 |
+ .clusters[$cluster].nodes |
|
| 69 |
+ | to_entries[] |
|
| 70 |
+ | .key + "=" + ((.value.ip // .value.wan.vmbr443.address // empty) | split("/")[0])
|
|
| 71 |
+ ' "${CONFIG_PATH}"
|
|
| 72 |
+ ;; |
|
| 73 |
+ *) |
|
| 74 |
+ echo "ERROR: unsupported format: ${FORMAT}" >&2
|
|
| 75 |
+ exit 1 |
|
| 76 |
+ ;; |
|
| 77 |
+esac |
|
@@ -0,0 +1,256 @@ |
||
| 1 |
+#!/bin/bash |
|
| 2 |
+ |
|
| 3 |
+set -euo pipefail |
|
| 4 |
+ |
|
| 5 |
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
| 6 |
+ROOT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)"
|
|
| 7 |
+CONFIG_PATH="${ROOT_DIR}/cluster-context/madagascar.json"
|
|
| 8 |
+CLUSTER_NAME="madagascar" |
|
| 9 |
+COMMAND="install" |
|
| 10 |
+REMOTE_USER="root" |
|
| 11 |
+DRY_RUN=0 |
|
| 12 |
+PROJECT_NAME="" |
|
| 13 |
+PROJECT_DIR="" |
|
| 14 |
+DEPLOY_MODE="" |
|
| 15 |
+NODE_FILTERS=() |
|
| 16 |
+TARGETS=() |
|
| 17 |
+ |
|
| 18 |
+usage() {
|
|
| 19 |
+ cat <<EOF |
|
| 20 |
+Usage: $0 <project> [command] [options] |
|
| 21 |
+ |
|
| 22 |
+Commands: |
|
| 23 |
+ install Deploy/install project on selected nodes (default) |
|
| 24 |
+ uninstall Remove project from selected nodes |
|
| 25 |
+ status Query project status on selected nodes (deploy.sh projects only) |
|
| 26 |
+ start Start services on selected nodes (deploy.sh projects only) |
|
| 27 |
+ restart Restart services on selected nodes (deploy.sh projects only) |
|
| 28 |
+ stop Stop services on selected nodes (deploy.sh projects only) |
|
| 29 |
+ |
|
| 30 |
+Options: |
|
| 31 |
+ --cluster <name> Cluster name in cluster-context/madagascar.json (default: madagascar) |
|
| 32 |
+ --node <name|ip> Restrict to one node. Can be repeated. |
|
| 33 |
+ --user <user> Remote SSH user for setup.sh projects (default: root) |
|
| 34 |
+ --dry-run Show resolved targets and commands without executing |
|
| 35 |
+ -h, --help Show this help |
|
| 36 |
+ |
|
| 37 |
+Examples: |
|
| 38 |
+ $0 pve-guests-state |
|
| 39 |
+ $0 pve-guests-state install --node ebony |
|
| 40 |
+ $0 autoNAS install |
|
| 41 |
+ $0 autoSMART status --node 192.168.2.92 |
|
| 42 |
+EOF |
|
| 43 |
+} |
|
| 44 |
+ |
|
| 45 |
+require_config() {
|
|
| 46 |
+ if [[ ! -f "${CONFIG_PATH}" ]]; then
|
|
| 47 |
+ echo "ERROR: missing cluster config: ${CONFIG_PATH}" >&2
|
|
| 48 |
+ exit 1 |
|
| 49 |
+ fi |
|
| 50 |
+} |
|
| 51 |
+ |
|
| 52 |
+require_project() {
|
|
| 53 |
+ PROJECT_DIR="${ROOT_DIR}/projects/${PROJECT_NAME}"
|
|
| 54 |
+ if [[ ! -d "${PROJECT_DIR}" ]]; then
|
|
| 55 |
+ echo "ERROR: unknown project: ${PROJECT_NAME}" >&2
|
|
| 56 |
+ exit 1 |
|
| 57 |
+ fi |
|
| 58 |
+ |
|
| 59 |
+ if [[ -x "${PROJECT_DIR}/setup.sh" ]]; then
|
|
| 60 |
+ DEPLOY_MODE="setup" |
|
| 61 |
+ return |
|
| 62 |
+ fi |
|
| 63 |
+ |
|
| 64 |
+ if [[ -x "${PROJECT_DIR}/deploy.sh" ]]; then
|
|
| 65 |
+ DEPLOY_MODE="deploy" |
|
| 66 |
+ return |
|
| 67 |
+ fi |
|
| 68 |
+ |
|
| 69 |
+ echo "ERROR: project ${PROJECT_NAME} has neither setup.sh nor deploy.sh" >&2
|
|
| 70 |
+ exit 1 |
|
| 71 |
+} |
|
| 72 |
+ |
|
| 73 |
+load_targets() {
|
|
| 74 |
+ local entry="" |
|
| 75 |
+ TARGETS=() |
|
| 76 |
+ |
|
| 77 |
+ while IFS= read -r entry; do |
|
| 78 |
+ [[ -n "${entry}" ]] && TARGETS+=("${entry}")
|
|
| 79 |
+ done < <( |
|
| 80 |
+ jq -r --arg cluster "${CLUSTER_NAME}" '
|
|
| 81 |
+ .clusters[$cluster].nodes |
|
| 82 |
+ | to_entries[] |
|
| 83 |
+ | .key + "\t" + ((.value.ip // .value.wan.vmbr443.address // empty) | split("/")[0])
|
|
| 84 |
+ ' "${CONFIG_PATH}"
|
|
| 85 |
+ ) |
|
| 86 |
+ |
|
| 87 |
+ if [[ ${#TARGETS[@]} -eq 0 ]]; then
|
|
| 88 |
+ echo "ERROR: no targets found for cluster ${CLUSTER_NAME}" >&2
|
|
| 89 |
+ exit 1 |
|
| 90 |
+ fi |
|
| 91 |
+} |
|
| 92 |
+ |
|
| 93 |
+match_filter() {
|
|
| 94 |
+ local filter="$1" |
|
| 95 |
+ local node_name="$2" |
|
| 96 |
+ local node_ip="$3" |
|
| 97 |
+ |
|
| 98 |
+ [[ "${filter}" == "${node_name}" || "${filter}" == "${node_ip}" ]]
|
|
| 99 |
+} |
|
| 100 |
+ |
|
| 101 |
+filter_targets() {
|
|
| 102 |
+ local filtered=() |
|
| 103 |
+ local entry="" |
|
| 104 |
+ local filter="" |
|
| 105 |
+ local node_name="" |
|
| 106 |
+ local node_ip="" |
|
| 107 |
+ |
|
| 108 |
+ if [[ ${#NODE_FILTERS[@]} -eq 0 ]]; then
|
|
| 109 |
+ return |
|
| 110 |
+ fi |
|
| 111 |
+ |
|
| 112 |
+ for entry in "${TARGETS[@]}"; do
|
|
| 113 |
+ node_name="${entry%%$'\t'*}"
|
|
| 114 |
+ node_ip="${entry#*$'\t'}"
|
|
| 115 |
+ for filter in "${NODE_FILTERS[@]}"; do
|
|
| 116 |
+ if match_filter "${filter}" "${node_name}" "${node_ip}"; then
|
|
| 117 |
+ filtered+=("${entry}")
|
|
| 118 |
+ break |
|
| 119 |
+ fi |
|
| 120 |
+ done |
|
| 121 |
+ done |
|
| 122 |
+ |
|
| 123 |
+ TARGETS=("${filtered[@]}")
|
|
| 124 |
+ |
|
| 125 |
+ if [[ ${#TARGETS[@]} -eq 0 ]]; then
|
|
| 126 |
+ echo "ERROR: no targets matched the provided --node filters" >&2 |
|
| 127 |
+ exit 1 |
|
| 128 |
+ fi |
|
| 129 |
+} |
|
| 130 |
+ |
|
| 131 |
+run_setup_project() {
|
|
| 132 |
+ local node_name="$1" |
|
| 133 |
+ local node_ip="$2" |
|
| 134 |
+ local cmd=() |
|
| 135 |
+ |
|
| 136 |
+ case "${COMMAND}" in
|
|
| 137 |
+ install) |
|
| 138 |
+ cmd=(bash "${PROJECT_DIR}/setup.sh" --user "${REMOTE_USER}" "${node_ip}")
|
|
| 139 |
+ ;; |
|
| 140 |
+ uninstall) |
|
| 141 |
+ cmd=(bash "${PROJECT_DIR}/setup.sh" --user "${REMOTE_USER}" --uninstall "${node_ip}")
|
|
| 142 |
+ ;; |
|
| 143 |
+ *) |
|
| 144 |
+ echo "ERROR: command ${COMMAND} is not supported for setup.sh-only projects" >&2
|
|
| 145 |
+ exit 1 |
|
| 146 |
+ ;; |
|
| 147 |
+ esac |
|
| 148 |
+ |
|
| 149 |
+ echo "==> ${PROJECT_NAME}: ${COMMAND} on ${node_name} (${node_ip})"
|
|
| 150 |
+ if [[ "${DRY_RUN}" -eq 1 ]]; then
|
|
| 151 |
+ printf 'DRY-RUN:' |
|
| 152 |
+ printf ' %q' "${cmd[@]}"
|
|
| 153 |
+ echo |
|
| 154 |
+ return |
|
| 155 |
+ fi |
|
| 156 |
+ |
|
| 157 |
+ (cd "${PROJECT_DIR}" && "${cmd[@]}")
|
|
| 158 |
+} |
|
| 159 |
+ |
|
| 160 |
+run_deploy_project() {
|
|
| 161 |
+ local node_name="$1" |
|
| 162 |
+ local node_ip="$2" |
|
| 163 |
+ local cmd=(bash "${PROJECT_DIR}/deploy.sh" "${COMMAND}" "${node_ip}")
|
|
| 164 |
+ |
|
| 165 |
+ echo "==> ${PROJECT_NAME}: ${COMMAND} on ${node_name} (${node_ip})"
|
|
| 166 |
+ if [[ "${DRY_RUN}" -eq 1 ]]; then
|
|
| 167 |
+ printf 'DRY-RUN:' |
|
| 168 |
+ printf ' %q' "${cmd[@]}"
|
|
| 169 |
+ echo |
|
| 170 |
+ return |
|
| 171 |
+ fi |
|
| 172 |
+ |
|
| 173 |
+ (cd "${PROJECT_DIR}" && "${cmd[@]}")
|
|
| 174 |
+} |
|
| 175 |
+ |
|
| 176 |
+parse_args() {
|
|
| 177 |
+ if [[ $# -lt 1 ]]; then |
|
| 178 |
+ usage |
|
| 179 |
+ exit 1 |
|
| 180 |
+ fi |
|
| 181 |
+ |
|
| 182 |
+ PROJECT_NAME="$1" |
|
| 183 |
+ shift |
|
| 184 |
+ |
|
| 185 |
+ if [[ $# -gt 0 ]]; then |
|
| 186 |
+ case "$1" in |
|
| 187 |
+ install|uninstall|status|start|restart|stop) |
|
| 188 |
+ COMMAND="$1" |
|
| 189 |
+ shift |
|
| 190 |
+ ;; |
|
| 191 |
+ esac |
|
| 192 |
+ fi |
|
| 193 |
+ |
|
| 194 |
+ while [[ $# -gt 0 ]]; do |
|
| 195 |
+ case "$1" in |
|
| 196 |
+ --cluster) |
|
| 197 |
+ CLUSTER_NAME="$2" |
|
| 198 |
+ shift 2 |
|
| 199 |
+ ;; |
|
| 200 |
+ --node) |
|
| 201 |
+ NODE_FILTERS+=("$2")
|
|
| 202 |
+ shift 2 |
|
| 203 |
+ ;; |
|
| 204 |
+ --user) |
|
| 205 |
+ REMOTE_USER="$2" |
|
| 206 |
+ shift 2 |
|
| 207 |
+ ;; |
|
| 208 |
+ --dry-run) |
|
| 209 |
+ DRY_RUN=1 |
|
| 210 |
+ shift |
|
| 211 |
+ ;; |
|
| 212 |
+ -h|--help) |
|
| 213 |
+ usage |
|
| 214 |
+ exit 0 |
|
| 215 |
+ ;; |
|
| 216 |
+ *) |
|
| 217 |
+ echo "ERROR: unknown option: $1" >&2 |
|
| 218 |
+ usage |
|
| 219 |
+ exit 1 |
|
| 220 |
+ ;; |
|
| 221 |
+ esac |
|
| 222 |
+ done |
|
| 223 |
+} |
|
| 224 |
+ |
|
| 225 |
+main() {
|
|
| 226 |
+ parse_args "$@" |
|
| 227 |
+ require_config |
|
| 228 |
+ require_project |
|
| 229 |
+ load_targets |
|
| 230 |
+ filter_targets |
|
| 231 |
+ |
|
| 232 |
+ echo "Project: ${PROJECT_NAME}"
|
|
| 233 |
+ echo "Mode: ${DEPLOY_MODE}"
|
|
| 234 |
+ echo "Command: ${COMMAND}"
|
|
| 235 |
+ echo "Cluster: ${CLUSTER_NAME}"
|
|
| 236 |
+ echo "Targets:" |
|
| 237 |
+ printf ' %s\n' "${TARGETS[@]}"
|
|
| 238 |
+ echo |
|
| 239 |
+ |
|
| 240 |
+ local entry="" |
|
| 241 |
+ local node_name="" |
|
| 242 |
+ local node_ip="" |
|
| 243 |
+ |
|
| 244 |
+ for entry in "${TARGETS[@]}"; do
|
|
| 245 |
+ node_name="${entry%%$'\t'*}"
|
|
| 246 |
+ node_ip="${entry#*$'\t'}"
|
|
| 247 |
+ if [[ "${DEPLOY_MODE}" == "setup" ]]; then
|
|
| 248 |
+ run_setup_project "${node_name}" "${node_ip}"
|
|
| 249 |
+ else |
|
| 250 |
+ run_deploy_project "${node_name}" "${node_ip}"
|
|
| 251 |
+ fi |
|
| 252 |
+ echo |
|
| 253 |
+ done |
|
| 254 |
+} |
|
| 255 |
+ |
|
| 256 |
+main "$@" |
|