f16725e 3 months ago History
1 contributor
325 lines | 12.63kb

autoSMART v1.0 - Intelligent HDD Monitoring & Failure Prediction

autoSMART este un sistem inteligent de monitorizare SMART pentru HDD-urile din cluster-ul Proxmox, cu predicții de defectare bazate pe AI și stocare optimizată în PostgreSQL.

🎯 Scopul Proiectului

  • Monitorizare continuă a parametrilor SMART pentru toate HDD-urile din cluster
  • Predicții AI pentru defectări iminente folosind OpenAI API
  • Stocare long-term în PostgreSQL pentru analize temporale
  • Alerting proactiv pentru mentenanță preventivă

Key Features

  • 🔍 Hardware-based HDD tracking: Permanent identification using serial numbers and model names (not volatile /dev/sdX paths)
  • 🔄 Migration detection: Automatic detection and logging when HDDs move between nodes or device paths
  • 💾 Differential storage optimization: Store only SMART readings with changes, reducing database size by 60-80%
  • 🤖 AI-powered failure prediction: Uses OpenAI GPT for intelligent drive failure forecasting
  • 🏥 Health monitoring: Continuous SMART parameter analysis with configurable thresholds
  • 📊 Comprehensive reporting: Detailed drive health reports and predictive analytics
  • 🔧 Proxmox cluster integration: Designed for distributed Proxmox VE environments
  • ⚡ High performance: PostgreSQL backend with optimized indexing and queries

🚀 Quick Start

Prerequisites

  • PostgreSQL 13+ for data storage
  • Perl 5.20+ with required modules
  • Proxmox VE cluster environment
  • smartmontools for SMART data collection
  • OpenAI API key for failure predictions

Installation

# 1. Download autoSMART and run automated deployment
git clone <repository-url>
cd autoSMART
sudo ./scripts/deploy.sh install

# The deployment script automatically:
# - Installs all dependencies (Perl modules, smartmontools, etc.)
# - Creates system directories and sets permissions
# - Deploys application files to /opt/autoSMART/
# - Creates configuration files in /etc/autosmart/
# - Registers and starts systemd services
# - Performs initial system validation

# 2. Configure database connection (interactive prompts during install)
# 3. Configure OpenAI API key (interactive prompts during install)
# 4. System is ready - services are automatically started

Verification

# Check system status (all services should be active)
sudo systemctl status autosmart

# View recent SMART data collection
sudo journalctl -u autosmart-collector -f

# Generate initial health report
sudo /opt/autoSMART/scripts/autosmart-report.pl --summary

📚 Documentation

Getting Started

System Configuration

  • API.md - OpenAI API integration and configuration

🏥 Monitoring Dashboard

autoSMART provides comprehensive monitoring capabilities:

Health Status Overview

  • Real-time drive health status for all cluster nodes
  • Critical parameter alerts and warnings
  • AI-powered failure predictions with confidence scores
  • Storage efficiency metrics

Historical Analysis

  • Long-term SMART parameter trends
  • Performance degradation tracking
  • Migration history between nodes
  • Predictive analytics reports

Alerting System

  • Configurable thresholds for all SMART parameters
  • Email/webhook notifications
  • Integration with monitoring systems
  • Escalation procedures for critical alerts

🔧 System Architecture

autoSMART operates as a distributed system across your Proxmox cluster:

Data Collection

  • Continuous SMART data collection from all nodes
  • Hardware-based drive identification
  • Migration detection and logging
  • Differential storage for efficiency

Analysis Engine

  • AI-powered failure prediction
  • Threshold-based alerting
  • Trend analysis and reporting
  • Performance optimization recommendations

Storage Layer

  • PostgreSQL database with optimized schema
  • Differential storage reducing size by 60-80%
  • Historical data retention policies
  • Automated backup and maintenance

📁 Installed File Structure

When autoSMART is installed on your system, it creates the following directory structure:

System Directories

/opt/autoSMART/                    # Main installation directory
├── scripts/                      # Executable scripts and utilities
│   ├── autosmart-collector.pl    # Main data collection daemon
│   ├── autosmart-predictor.pl    # AI prediction processing
│   ├── autosmart-report.pl       # Report generation engine
│   ├── autosmart-migration-report.pl # Hardware migration analysis
│   ├── smart-collector-daemon.pl # Background collection service
│   ├── uninstall.sh             # System removal script
│   ├── monitor-cluster.sh        # Cluster health monitoring
│   └── test-*.pl                # Testing and validation utilities
├── lib/                         # Perl modules and core libraries
│   ├── SmartCollector.pm        # SMART data collection and hardware tracking
│   └── PredictionEngine.pm      # AI-powered failure prediction engine
├── config/                      # Configuration templates and examples
│   └── (template files)        # Default configuration templates
├── docs/                        # End-user documentation
│   ├── README.md               # System overview and quick start
│   ├── CHANGELOG.md            # Release notes and version history
│   └── API.md                  # OpenAI API configuration guide

/etc/autosmart/                   # System configuration directory
├── autosmart.conf              # Main system configuration
├── cluster.conf                # Cluster topology and node definitions
├── database.conf               # PostgreSQL connection settings
├── openai.conf                 # OpenAI API configuration and prompts
└── smart.conf                  # SMART parameter thresholds and monitoring rules

/etc/systemd/system/             # Systemd service files
├── autosmart.service           # Main autoSMART service
├── autosmart-collector.service # Data collection service
└── autosmart-predictor.service # AI prediction service

Configuration Files Detail

/etc/autosmart/autosmart.conf

Main system configuration file containing: - Database connection parameters - Collection intervals and scheduling - Local node identification and settings - Log levels and debugging options

/etc/autosmart/cluster.conf

Cluster-wide configuration shared across all nodes: - Node topology and IP addresses - Shared monitoring parameters - Cluster-wide alert settings - Inter-node communication settings

/etc/autosmart/database.conf

PostgreSQL database connection settings: - Database host, port, and credentials - Connection pooling configuration - SSL settings and security parameters - Performance tuning options

/etc/autosmart/openai.conf

OpenAI API integration configuration: - API key and model selection - Prompt templates for failure prediction - Response parsing and confidence thresholds - Rate limiting and cost management

/etc/autosmart/smart.conf

SMART parameter monitoring configuration: - Parameter thresholds for different drive types - Critical parameter definitions - Alert escalation rules and notifications - Drive-specific monitoring settings

Service Integration

Systemd Services

  • autosmart.service: Main system service that manages other components
  • autosmart-collector.service: Background data collection service
  • autosmart-predictor.service: AI prediction processing service

Service Management

# Start/stop services
sudo systemctl start autosmart
sudo systemctl stop autosmart

# Enable/disable automatic startup
sudo systemctl enable autosmart
sudo systemctl disable autosmart

# Check service status
sudo systemctl status autosmart

# View service logs using systemd journal
sudo journalctl -u autosmart -f                    # Follow main service logs
sudo journalctl -u autosmart-collector -f          # Follow data collection logs  
sudo journalctl -u autosmart-predictor -f          # Follow AI prediction logs

# View logs by time period
sudo journalctl -u autosmart --since "1 hour ago"  # Last hour
sudo journalctl -u autosmart --since today         # Today's logs
sudo journalctl -u autosmart --since yesterday     # Yesterday's logs

# View logs by priority level
sudo journalctl -u autosmart -p err                # Error level and above
sudo journalctl -u autosmart -p warning            # Warning level and above

File Permissions

Executable Files

  • All scripts in /opt/autoSMART/scripts/ are executable (755)
  • Perl modules in /opt/autoSMART/lib/ are readable (644)
  • Configuration files in /etc/autosmart/ are readable by autosmart user (640)

Log Management

  • All application logs are handled by systemd journal
  • No separate log files created in filesystem
  • Log retention managed by journald configuration
  • Logs accessible via journalctl commands
  • Automatic log rotation and cleanup by systemd

Storage Requirements

Disk Space

  • Installation: ~50MB for application files and documentation
  • Configuration: ~1MB for all configuration files
  • Logs: Managed by systemd journal (configurable retention)
  • Database: Handled separately on PostgreSQL server

Network Requirements

  • Database Access: Persistent connection to PostgreSQL server
  • OpenAI API: HTTPS access for AI predictions (configurable)
  • Inter-node Communication: SSH access between cluster nodes for deployment

This file structure provides a complete, organized installation that integrates seamlessly with Linux system conventions while maintaining clear separation between application code, configuration, and operational data.

📊 Performance Benefits

Storage Optimization

  • 60-80% reduction in database storage through differential storage
  • Intelligent change detection stores only modified SMART parameters
  • Baseline reconstruction provides complete historical views
  • Configurable retention policies for long-term storage

Monitoring Efficiency

  • Hardware-based tracking eliminates /dev/sdX path volatility
  • Migration detection automatically tracks drive movements
  • Real-time analysis with configurable collection intervals
  • Distributed architecture scales across cluster nodes

🚨 Alert Examples

Critical Alerts

  • Imminent Failure: AI predicts drive failure within 24-48 hours
  • Temperature Critical: Drive operating above safe temperature thresholds
  • Reallocated Sectors: Increasing bad sector count detected
  • Spin Retry Count: Mechanical issues detected

Warning Alerts

  • Performance Degradation: Slower response times detected
  • Temperature Warning: Operating temperatures approaching limits
  • SMART Threshold: Parameters approaching warning thresholds
  • Migration Detected: Drive moved to different node or path

💡 Use Cases

Preventive Maintenance

  • Schedule drive replacements before failures occur
  • Optimize workload distribution based on drive health
  • Plan cluster maintenance windows effectively
  • Track warranty and replacement schedules

Capacity Planning

  • Monitor storage growth trends
  • Predict future storage requirements
  • Optimize drive allocation across nodes
  • Plan cluster expansion timing

Performance Optimization

  • Identify performance bottlenecks
  • Balance load across healthy drives
  • Optimize I/O patterns based on drive characteristics
  • Monitor storage tier performance

🆘 Support & Troubleshooting

Common Issues

  • Collection failures: Check smartmontools installation
  • Database connectivity: Verify PostgreSQL connection settings
  • API errors: Validate OpenAI API key and quotas
  • Performance issues: Review differential storage configuration

Log Analysis

Use systemd journal for comprehensive log analysis: - All service logs: sudo journalctl -u autosmart* - Data collection: sudo journalctl -u autosmart-collector - AI predictions: sudo journalctl -u autosmart-predictor - System errors: sudo journalctl -u autosmart* -p err

Getting Help

For detailed installation, configuration, and troubleshooting information, refer to the complete documentation in the docs/ directory.


autoSMART v1.0 - Intelligent drive monitoring for mission-critical infrastructure