|
Bogdan Timofte
authored
3 months ago
|
1
|
# autoSMART v1.0 - Intelligent HDD Monitoring & Failure Prediction
|
|
|
2
|
|
|
|
3
|
autoSMART este un sistem inteligent de monitorizare SMART pentru HDD-urile din cluster-ul Proxmox, cu predicții de defectare bazate pe AI și stocare optimizată în PostgreSQL.
|
|
|
4
|
|
|
|
5
|
## 🎯 **Scopul Proiectului**
|
|
|
6
|
|
|
|
7
|
- **Monitorizare continuă** a parametrilor SMART pentru toate HDD-urile din cluster
|
|
|
8
|
- **Predicții AI** pentru defectări iminente folosind OpenAI API
|
|
|
9
|
- **Stocare long-term** în PostgreSQL pentru analize temporale
|
|
|
10
|
- **Alerting proactiv** pentru mentenanță preventivă
|
|
|
11
|
|
|
|
12
|
## Key Features
|
|
|
13
|
|
|
|
14
|
- **🔍 Hardware-based HDD tracking**: Permanent identification using serial numbers and model names (not volatile /dev/sdX paths)
|
|
|
15
|
- **🔄 Migration detection**: Automatic detection and logging when HDDs move between nodes or device paths
|
|
|
16
|
- **💾 Differential storage optimization**: Store only SMART readings with changes, reducing database size by 60-80%
|
|
|
17
|
- **🤖 AI-powered failure prediction**: Uses OpenAI GPT for intelligent drive failure forecasting
|
|
|
18
|
- **🏥 Health monitoring**: Continuous SMART parameter analysis with configurable thresholds
|
|
|
19
|
- **📊 Comprehensive reporting**: Detailed drive health reports and predictive analytics
|
|
|
20
|
- **🔧 Proxmox cluster integration**: Designed for distributed Proxmox VE environments
|
|
|
21
|
- **⚡ High performance**: PostgreSQL backend with optimized indexing and queries
|
|
|
22
|
|
|
|
23
|
## 🚀 Quick Start
|
|
|
24
|
|
|
|
25
|
### Prerequisites
|
|
|
26
|
- **PostgreSQL 13+** for data storage
|
|
|
27
|
- **Perl 5.20+** with required modules
|
|
|
28
|
- **Proxmox VE** cluster environment
|
|
|
29
|
- **smartmontools** for SMART data collection
|
|
|
30
|
- **OpenAI API key** for failure predictions
|
|
|
31
|
|
|
|
32
|
### Installation
|
|
|
33
|
```bash
|
|
|
34
|
# 1. Download autoSMART and run automated deployment
|
|
|
35
|
git clone <repository-url>
|
|
|
36
|
cd autoSMART
|
|
|
37
|
sudo ./scripts/deploy.sh install
|
|
|
38
|
|
|
|
39
|
# The deployment script automatically:
|
|
|
40
|
# - Installs all dependencies (Perl modules, smartmontools, etc.)
|
|
|
41
|
# - Creates system directories and sets permissions
|
|
|
42
|
# - Deploys application files to /opt/autoSMART/
|
|
|
43
|
# - Creates configuration files in /etc/autosmart/
|
|
|
44
|
# - Registers and starts systemd services
|
|
|
45
|
# - Performs initial system validation
|
|
|
46
|
|
|
|
47
|
# 2. Configure database connection (interactive prompts during install)
|
|
|
48
|
# 3. Configure OpenAI API key (interactive prompts during install)
|
|
|
49
|
# 4. System is ready - services are automatically started
|
|
|
50
|
```
|
|
|
51
|
|
|
|
52
|
### Verification
|
|
|
53
|
```bash
|
|
|
54
|
# Check system status (all services should be active)
|
|
|
55
|
sudo systemctl status autosmart
|
|
|
56
|
|
|
|
57
|
# View recent SMART data collection
|
|
|
58
|
sudo journalctl -u autosmart-collector -f
|
|
|
59
|
|
|
|
60
|
# Generate initial health report
|
|
|
61
|
sudo /opt/autoSMART/scripts/autosmart-report.pl --summary
|
|
|
62
|
```
|
|
|
63
|
|
|
|
64
|
## 📚 Documentation
|
|
|
65
|
|
|
|
66
|
### Getting Started
|
|
|
67
|
- **[CHANGELOG.md](CHANGELOG.md)** - Version history and release notes
|
|
|
68
|
|
|
|
69
|
### System Configuration
|
|
|
70
|
- **[API.md](API.md)** - OpenAI API integration and configuration
|
|
|
71
|
|
|
|
72
|
## 🏥 Monitoring Dashboard
|
|
|
73
|
|
|
|
74
|
autoSMART provides comprehensive monitoring capabilities:
|
|
|
75
|
|
|
|
76
|
### Health Status Overview
|
|
|
77
|
- Real-time drive health status for all cluster nodes
|
|
|
78
|
- Critical parameter alerts and warnings
|
|
|
79
|
- AI-powered failure predictions with confidence scores
|
|
|
80
|
- Storage efficiency metrics
|
|
|
81
|
|
|
|
82
|
### Historical Analysis
|
|
|
83
|
- Long-term SMART parameter trends
|
|
|
84
|
- Performance degradation tracking
|
|
|
85
|
- Migration history between nodes
|
|
|
86
|
- Predictive analytics reports
|
|
|
87
|
|
|
|
88
|
### Alerting System
|
|
|
89
|
- Configurable thresholds for all SMART parameters
|
|
|
90
|
- Email/webhook notifications
|
|
|
91
|
- Integration with monitoring systems
|
|
|
92
|
- Escalation procedures for critical alerts
|
|
|
93
|
|
|
|
94
|
## 🔧 System Architecture
|
|
|
95
|
|
|
|
96
|
autoSMART operates as a distributed system across your Proxmox cluster:
|
|
|
97
|
|
|
|
98
|
### Data Collection
|
|
|
99
|
- Continuous SMART data collection from all nodes
|
|
|
100
|
- Hardware-based drive identification
|
|
|
101
|
- Migration detection and logging
|
|
|
102
|
- Differential storage for efficiency
|
|
|
103
|
|
|
|
104
|
### Analysis Engine
|
|
|
105
|
- AI-powered failure prediction
|
|
|
106
|
- Threshold-based alerting
|
|
|
107
|
- Trend analysis and reporting
|
|
|
108
|
- Performance optimization recommendations
|
|
|
109
|
|
|
|
110
|
### Storage Layer
|
|
|
111
|
- PostgreSQL database with optimized schema
|
|
|
112
|
- Differential storage reducing size by 60-80%
|
|
|
113
|
- Historical data retention policies
|
|
|
114
|
- Automated backup and maintenance
|
|
|
115
|
|
|
|
116
|
## 📁 Installed File Structure
|
|
|
117
|
|
|
|
118
|
When autoSMART is installed on your system, it creates the following directory structure:
|
|
|
119
|
|
|
|
120
|
### System Directories
|
|
|
121
|
|
|
|
122
|
```
|
|
|
123
|
/opt/autoSMART/ # Main installation directory
|
|
|
124
|
├── scripts/ # Executable scripts and utilities
|
|
|
125
|
│ ├── autosmart-collector.pl # Main data collection daemon
|
|
|
126
|
│ ├── autosmart-predictor.pl # AI prediction processing
|
|
|
127
|
│ ├── autosmart-report.pl # Report generation engine
|
|
|
128
|
│ ├── autosmart-migration-report.pl # Hardware migration analysis
|
|
|
129
|
│ ├── smart-collector-daemon.pl # Background collection service
|
|
|
130
|
│ ├── uninstall.sh # System removal script
|
|
|
131
|
│ ├── monitor-cluster.sh # Cluster health monitoring
|
|
|
132
|
│ └── test-*.pl # Testing and validation utilities
|
|
|
133
|
├── lib/ # Perl modules and core libraries
|
|
|
134
|
│ ├── SmartCollector.pm # SMART data collection and hardware tracking
|
|
|
135
|
│ └── PredictionEngine.pm # AI-powered failure prediction engine
|
|
|
136
|
├── config/ # Configuration templates and examples
|
|
|
137
|
│ └── (template files) # Default configuration templates
|
|
|
138
|
├── docs/ # End-user documentation
|
|
|
139
|
│ ├── README.md # System overview and quick start
|
|
|
140
|
│ ├── CHANGELOG.md # Release notes and version history
|
|
|
141
|
│ └── API.md # OpenAI API configuration guide
|
|
|
142
|
|
|
|
143
|
/etc/autosmart/ # System configuration directory
|
|
|
144
|
├── autosmart.conf # Main system configuration
|
|
|
145
|
├── cluster.conf # Cluster topology and node definitions
|
|
|
146
|
├── database.conf # PostgreSQL connection settings
|
|
|
147
|
├── openai.conf # OpenAI API configuration and prompts
|
|
|
148
|
└── smart.conf # SMART parameter thresholds and monitoring rules
|
|
|
149
|
|
|
|
150
|
/etc/systemd/system/ # Systemd service files
|
|
|
151
|
├── autosmart.service # Main autoSMART service
|
|
|
152
|
├── autosmart-collector.service # Data collection service
|
|
|
153
|
└── autosmart-predictor.service # AI prediction service
|
|
|
154
|
```
|
|
|
155
|
|
|
|
156
|
### Configuration Files Detail
|
|
|
157
|
|
|
|
158
|
#### `/etc/autosmart/autosmart.conf`
|
|
|
159
|
Main system configuration file containing:
|
|
|
160
|
- Database connection parameters
|
|
|
161
|
- Collection intervals and scheduling
|
|
|
162
|
- Local node identification and settings
|
|
|
163
|
- Log levels and debugging options
|
|
|
164
|
|
|
|
165
|
#### `/etc/autosmart/cluster.conf`
|
|
|
166
|
Cluster-wide configuration shared across all nodes:
|
|
|
167
|
- Node topology and IP addresses
|
|
|
168
|
- Shared monitoring parameters
|
|
|
169
|
- Cluster-wide alert settings
|
|
|
170
|
- Inter-node communication settings
|
|
|
171
|
|
|
|
172
|
#### `/etc/autosmart/database.conf`
|
|
|
173
|
PostgreSQL database connection settings:
|
|
|
174
|
- Database host, port, and credentials
|
|
|
175
|
- Connection pooling configuration
|
|
|
176
|
- SSL settings and security parameters
|
|
|
177
|
- Performance tuning options
|
|
|
178
|
|
|
|
179
|
#### `/etc/autosmart/openai.conf`
|
|
|
180
|
OpenAI API integration configuration:
|
|
|
181
|
- API key and model selection
|
|
|
182
|
- Prompt templates for failure prediction
|
|
|
183
|
- Response parsing and confidence thresholds
|
|
|
184
|
- Rate limiting and cost management
|
|
|
185
|
|
|
|
186
|
#### `/etc/autosmart/smart.conf`
|
|
|
187
|
SMART parameter monitoring configuration:
|
|
|
188
|
- Parameter thresholds for different drive types
|
|
|
189
|
- Critical parameter definitions
|
|
|
190
|
- Alert escalation rules and notifications
|
|
|
191
|
- Drive-specific monitoring settings
|
|
|
192
|
|
|
|
193
|
### Service Integration
|
|
|
194
|
|
|
|
195
|
#### Systemd Services
|
|
|
196
|
- **`autosmart.service`**: Main system service that manages other components
|
|
|
197
|
- **`autosmart-collector.service`**: Background data collection service
|
|
|
198
|
- **`autosmart-predictor.service`**: AI prediction processing service
|
|
|
199
|
|
|
|
200
|
#### Service Management
|
|
|
201
|
```bash
|
|
|
202
|
# Start/stop services
|
|
|
203
|
sudo systemctl start autosmart
|
|
|
204
|
sudo systemctl stop autosmart
|
|
|
205
|
|
|
|
206
|
# Enable/disable automatic startup
|
|
|
207
|
sudo systemctl enable autosmart
|
|
|
208
|
sudo systemctl disable autosmart
|
|
|
209
|
|
|
|
210
|
# Check service status
|
|
|
211
|
sudo systemctl status autosmart
|
|
|
212
|
|
|
|
213
|
# View service logs using systemd journal
|
|
|
214
|
sudo journalctl -u autosmart -f # Follow main service logs
|
|
|
215
|
sudo journalctl -u autosmart-collector -f # Follow data collection logs
|
|
|
216
|
sudo journalctl -u autosmart-predictor -f # Follow AI prediction logs
|
|
|
217
|
|
|
|
218
|
# View logs by time period
|
|
|
219
|
sudo journalctl -u autosmart --since "1 hour ago" # Last hour
|
|
|
220
|
sudo journalctl -u autosmart --since today # Today's logs
|
|
|
221
|
sudo journalctl -u autosmart --since yesterday # Yesterday's logs
|
|
|
222
|
|
|
|
223
|
# View logs by priority level
|
|
|
224
|
sudo journalctl -u autosmart -p err # Error level and above
|
|
|
225
|
sudo journalctl -u autosmart -p warning # Warning level and above
|
|
|
226
|
```
|
|
|
227
|
|
|
|
228
|
### File Permissions
|
|
|
229
|
|
|
|
230
|
#### Executable Files
|
|
|
231
|
- All scripts in `/opt/autoSMART/scripts/` are executable (755)
|
|
|
232
|
- Perl modules in `/opt/autoSMART/lib/` are readable (644)
|
|
|
233
|
- Configuration files in `/etc/autosmart/` are readable by autosmart user (640)
|
|
|
234
|
|
|
|
235
|
#### Log Management
|
|
|
236
|
- All application logs are handled by systemd journal
|
|
|
237
|
- No separate log files created in filesystem
|
|
|
238
|
- Log retention managed by journald configuration
|
|
|
239
|
- Logs accessible via `journalctl` commands
|
|
|
240
|
- Automatic log rotation and cleanup by systemd
|
|
|
241
|
|
|
|
242
|
### Storage Requirements
|
|
|
243
|
|
|
|
244
|
#### Disk Space
|
|
|
245
|
- **Installation**: ~50MB for application files and documentation
|
|
|
246
|
- **Configuration**: ~1MB for all configuration files
|
|
|
247
|
- **Logs**: Managed by systemd journal (configurable retention)
|
|
|
248
|
- **Database**: Handled separately on PostgreSQL server
|
|
|
249
|
|
|
|
250
|
#### Network Requirements
|
|
|
251
|
- **Database Access**: Persistent connection to PostgreSQL server
|
|
|
252
|
- **OpenAI API**: HTTPS access for AI predictions (configurable)
|
|
|
253
|
- **Inter-node Communication**: SSH access between cluster nodes for deployment
|
|
|
254
|
|
|
|
255
|
This file structure provides a complete, organized installation that integrates seamlessly with Linux system conventions while maintaining clear separation between application code, configuration, and operational data.
|
|
|
256
|
|
|
|
257
|
## 📊 Performance Benefits
|
|
|
258
|
|
|
|
259
|
### Storage Optimization
|
|
|
260
|
- **60-80% reduction** in database storage through differential storage
|
|
|
261
|
- **Intelligent change detection** stores only modified SMART parameters
|
|
|
262
|
- **Baseline reconstruction** provides complete historical views
|
|
|
263
|
- **Configurable retention** policies for long-term storage
|
|
|
264
|
|
|
|
265
|
### Monitoring Efficiency
|
|
|
266
|
- **Hardware-based tracking** eliminates /dev/sdX path volatility
|
|
|
267
|
- **Migration detection** automatically tracks drive movements
|
|
|
268
|
- **Real-time analysis** with configurable collection intervals
|
|
|
269
|
- **Distributed architecture** scales across cluster nodes
|
|
|
270
|
|
|
|
271
|
## 🚨 Alert Examples
|
|
|
272
|
|
|
|
273
|
### Critical Alerts
|
|
|
274
|
- **Imminent Failure**: AI predicts drive failure within 24-48 hours
|
|
|
275
|
- **Temperature Critical**: Drive operating above safe temperature thresholds
|
|
|
276
|
- **Reallocated Sectors**: Increasing bad sector count detected
|
|
|
277
|
- **Spin Retry Count**: Mechanical issues detected
|
|
|
278
|
|
|
|
279
|
### Warning Alerts
|
|
|
280
|
- **Performance Degradation**: Slower response times detected
|
|
|
281
|
- **Temperature Warning**: Operating temperatures approaching limits
|
|
|
282
|
- **SMART Threshold**: Parameters approaching warning thresholds
|
|
|
283
|
- **Migration Detected**: Drive moved to different node or path
|
|
|
284
|
|
|
|
285
|
## 💡 Use Cases
|
|
|
286
|
|
|
|
287
|
### Preventive Maintenance
|
|
|
288
|
- Schedule drive replacements before failures occur
|
|
|
289
|
- Optimize workload distribution based on drive health
|
|
|
290
|
- Plan cluster maintenance windows effectively
|
|
|
291
|
- Track warranty and replacement schedules
|
|
|
292
|
|
|
|
293
|
### Capacity Planning
|
|
|
294
|
- Monitor storage growth trends
|
|
|
295
|
- Predict future storage requirements
|
|
|
296
|
- Optimize drive allocation across nodes
|
|
|
297
|
- Plan cluster expansion timing
|
|
|
298
|
|
|
|
299
|
### Performance Optimization
|
|
|
300
|
- Identify performance bottlenecks
|
|
|
301
|
- Balance load across healthy drives
|
|
|
302
|
- Optimize I/O patterns based on drive characteristics
|
|
|
303
|
- Monitor storage tier performance
|
|
|
304
|
|
|
|
305
|
## 🆘 Support & Troubleshooting
|
|
|
306
|
|
|
|
307
|
### Common Issues
|
|
|
308
|
- **Collection failures**: Check smartmontools installation
|
|
|
309
|
- **Database connectivity**: Verify PostgreSQL connection settings
|
|
|
310
|
- **API errors**: Validate OpenAI API key and quotas
|
|
|
311
|
- **Performance issues**: Review differential storage configuration
|
|
|
312
|
|
|
|
313
|
### Log Analysis
|
|
|
314
|
Use systemd journal for comprehensive log analysis:
|
|
|
315
|
- **All service logs**: `sudo journalctl -u autosmart*`
|
|
|
316
|
- **Data collection**: `sudo journalctl -u autosmart-collector`
|
|
|
317
|
- **AI predictions**: `sudo journalctl -u autosmart-predictor`
|
|
|
318
|
- **System errors**: `sudo journalctl -u autosmart* -p err`
|
|
|
319
|
|
|
|
320
|
### Getting Help
|
|
|
321
|
For detailed installation, configuration, and troubleshooting information, refer to the complete documentation in the `docs/` directory.
|
|
|
322
|
|
|
|
323
|
---
|
|
|
324
|
|
|
|
325
|
**autoSMART v1.0** - Intelligent drive monitoring for mission-critical infrastructure
|