Madagascar / projects / autoSMART / docs / DIFFERENTIAL_STORAGE.md
f16725e 3 months ago History
1 contributor
204 lines | 6.797kb

autoSMART Differential Storage System

Overview

The autoSMART v1.0 system now implements differential storage optimization to significantly reduce database storage requirements while maintaining full data integrity and analysis capabilities.

How It Works

Storage Strategy

Instead of storing complete SMART readings for every collection cycle, the system intelligently stores only:

  1. Baseline readings - First reading for each HDD
  2. Full readings - When critical parameters change or forced intervals are reached
  3. Differential readings - When only non-critical parameters change (stores only the changes)
  4. Skipped readings - When no changes are detected (no storage)

Change Detection

The system uses multiple methods to detect changes:

  • Checksum comparison - SHA256 hash of all parameters + temperature
  • Parameter-level analysis - Individual SMART parameter change detection
  • Critical parameter monitoring - Immediate storage for health-critical changes
  • Temperature thresholds - Configurable temperature change sensitivity
  • Time-based forcing - Periodic full readings regardless of changes (default: 24 hours)

Database Schema Changes

Enhanced smart_readings Table

ALTER TABLE smart_readings ADD COLUMN reading_type VARCHAR(20) DEFAULT 'full';
ALTER TABLE smart_readings ADD COLUMN changes_detected BOOLEAN DEFAULT true;
ALTER TABLE smart_readings ADD COLUMN changed_parameters JSONB;
ALTER TABLE smart_readings ADD COLUMN previous_reading_id INTEGER REFERENCES smart_readings(id);
ALTER TABLE smart_readings ADD COLUMN checksum VARCHAR(64);

New PostgreSQL Function

The should_store_smart_reading() function provides intelligent storage decisions:

SELECT should_store_smart_reading(hdd_id, parameters_json, checksum, current_timestamp);

Returns: - should_store - Boolean indicating if reading should be stored - reading_type - 'baseline', 'full', or 'differential' - changes_detected - Boolean indicating if changes were found - changed_parameters - JSON array of changed parameter names - previous_reading_id - Reference to previous reading for chaining

Reconstructed Data View

The smart_readings_reconstructed view uses recursive SQL to rebuild complete SMART data from differential readings:

SELECT * FROM smart_readings_reconstructed WHERE hdd_id = 123;

Configuration Parameters

Add to system_config table:

INSERT INTO system_config (key, value, description) VALUES
('differential_storage_enabled', 'true', 'Enable differential storage optimization'),
('forced_storage_interval_hours', '24', 'Hours between forced full readings'),
('critical_parameter_force_store', 'true', 'Force storage for critical parameter changes'),
('temperature_change_threshold', '5', 'Temperature change threshold for storage (Celsius)');

Updated Perl Modules

SmartCollector.pm Changes

  1. New methods:

    • _should_store_reading() - Check storage requirements
    • _insert_smart_reading_differential() - Store with differential info
    • _get_recent_storage_stats() - Monitor storage efficiency
  2. Enhanced collection:

    • Automatic change detection
    • Storage type determination
    • Efficiency reporting
  3. Storage optimization:

    • Only changed parameters stored for differential readings
    • Checksum validation
    • Chain reference tracking

Benefits

Storage Reduction

Expected storage reduction of 60-80% for typical HDD environments:

  • Baseline readings: ~1% of all readings
  • Full readings: ~15-20% of readings (critical changes + forced intervals)
  • Differential readings: ~5-15% of readings (minor changes)
  • Skipped readings: ~60-75% of readings (no changes)

Performance Impact

  • Minimal collection overhead: Single database function call for decision
  • Fast reconstruction: Recursive SQL with indexes
  • Efficient queries: Reconstructed view handles complexity

Data Integrity

  • Complete reconstruction: All historical data accessible
  • Change tracking: Full audit trail of parameter changes
  • Critical monitoring: No loss of important health indicators

Usage Examples

Collection with Statistics

use SmartCollector;

my $collector = SmartCollector->new($config);
my $result = $collector->collect_all();

print "Storage efficiency: " . $result->{storage_stats}->{efficiency_percent} . "%\n";
print "Differential readings: " . $result->{storage_stats}->{differential} . "\n";

Testing the System

Run the comprehensive test suite:

cd /etc/pve/autoSMART
./scripts/test-differential-storage.pl

This will: 1. Create test HDD entries 2. Test storage decisions for various change scenarios 3. Validate data reconstruction 4. Show storage efficiency statistics

Migration from Legacy Data

Existing installations can migrate seamlessly:

  1. Schema updates: Run the enhanced schema SQL
  2. Existing data: Marked as 'full' readings automatically
  3. No data loss: All existing readings preserved
  4. Gradual optimization: New readings use differential storage immediately

Monitoring and Maintenance

Storage Statistics Query

SELECT 
    reading_type,
    COUNT(*) as count,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER() as percentage
FROM smart_readings 
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY reading_type;

Reconstruction Performance

EXPLAIN ANALYZE 
SELECT * FROM smart_readings_reconstructed 
WHERE hdd_id = 123 AND timestamp > NOW() - INTERVAL '30 days';

Space Savings Report

SELECT 
    COUNT(*) as total_possible_readings,
    COUNT(*) FILTER (WHERE reading_type != 'skipped') as stored_readings,
    (COUNT(*) FILTER (WHERE reading_type != 'skipped') * 100.0 / COUNT(*)) as storage_percentage,
    (100 - (COUNT(*) FILTER (WHERE reading_type != 'skipped') * 100.0 / COUNT(*))) as savings_percentage
FROM smart_readings 
WHERE timestamp > NOW() - INTERVAL '30 days';

Critical Parameters List

Default parameters that trigger immediate full storage: - Reallocated_Sector_Ct - Current_Pending_Sector
- Offline_Uncorrectable - Reallocated_Event_Count - Spin_Retry_Count

Configure in smart_thresholds table with weight >= 8.0.

Conclusion

The differential storage system provides significant storage optimization while maintaining complete data integrity and analytical capabilities. The system automatically adapts to HDD behavior patterns, storing more data when drives show issues and reducing storage when drives are stable.

This optimization is particularly beneficial for large-scale deployments like the Madagascar cluster, where hundreds of HDDs generate continuous SMART data over years of operation.