Showing 2 changed files with 84 additions and 53 deletions
+16 -8
HealthProbe/Doc/04-project/Import-Optimization-Log.md
@@ -583,6 +583,7 @@ rows exist".
583 583
 | 2026-06-03 | pending | Clarify capture mode in record-import diagnostics. | Two full-profile repeated snapshots after SQLite capture-state persistence completed successfully with stable checksum and identical record count (`2,646,613`). The first ran in `6.3s` with `SummedProcessingElapsed: 0.0s`, `SummedInsertElapsed: 0.0s`, `SummedFinalizeElapsed: 2.6s`; the second ran in `6.4s` with `0.0s` processing/insert and `2.7s` finalize. Heart Rate `922,526` and Active Energy `346,478` each completed around `0.2s`, proving the heavy full reimport path was avoided. The report wording still said "`N` samples in 1 anchored segment", which is ambiguous for inherited unchanged summaries; diagnostics now label unchanged empty-delta, delta-apply, and full-import modes explicitly. |
584 584
 | 2026-06-03 | pending | Add capture-mode summary to diagnostics. | Repeated full-profile captures rarely produce a perfect no-delta report because at least one metric can change between manual runs. Diagnostic reports now include aggregate `CaptureModes` counts plus per-metric `captureMode`, so comparisons can separate unchanged empty-delta metrics from delta-applied metrics and full imports without manually reading every `record_import` line. Expected signal: stable checksum plus high `unchangedDelta` count and zero summed processing/insert confirms the fast path even when a few metrics changed. |
585 585
 | 2026-06-03 | pending | Add delta-event counts to diagnostics. | A full-profile follow-up completed in `47.4s` with `127/127` complete, `0` degraded, `CaptureModes: unchangedDelta=115, delta=12, initialImport=0`, `SummedProcessingElapsed: 25.9s`, `SummedInsertElapsed: 0.2s`, and `SummedFinalizeElapsed: 16.0s`. This confirms anchors work and no full import ran. Remaining cost is delta application for large metrics: Heart Rate `23.5s` total (`14.1s` processing, `8.8s` finalize), Active Energy `7.1s`, and Basal Energy `6.0s`. Diagnostics now report aggregate/per-metric `DeltaEvents` so future logs can separate true HealthKit delta size from the final visible record count. |
586
+| 2026-06-04 | pending | Rebuild delta compact archives without large intermediate record arrays. | Delta captures still need the legacy per-type content hash and compact record archive until the SwiftData bridge is fully retired. The delta path now streams the previous compact archive into the UUID map without first decoding a `[HealthRecordValue]`, and rebuilds the new compact archive/hash directly from the map instead of sorting and materializing a second large record array. Expected signal: lower `processingElapsed` for high-volume delta metrics such as Heart Rate, Active Energy, and Basal Energy while `snapshotChecksum` remains stable for equivalent content. |
586 587
 
587 588
 ## Current Diagnosis
588 589
 
@@ -639,6 +640,10 @@ The likely bottleneck is per-row SQLite work:
639 640
   reconstruction of legacy compact record archives and type hashes. Future logs
640 641
   should compare `DeltaEvents` against processing/finalize time before changing
641 642
   page sizes or HealthKit query strategy.
643
+- The current delta optimization keeps checksum semantics unchanged and only
644
+  removes avoidable Swift allocations. A larger future optimization would need a
645
+  deliberate replacement for the legacy per-type fingerprint hash, not an ad hoc
646
+  switch to SQLite aggregate hashes.
642 647
 
643 648
 ## Open Issues / Observations
644 649
 
@@ -681,17 +686,20 @@ Prioritize experiments in this order:
681 686
    Rate, Active Energy, and Basal Energy. If delta events are small while
682 687
    processing/finalize remain large, optimize legacy compact archive/hash
683 688
    reconstruction rather than HealthKit fetch or SQLite insert.
684
-5. Run a non-chain-start/full-scan benchmark after skipping unchanged `verified` events and fast-pathing already-open visibility ranges. Compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, `Steps insertElapsed`, and `Walking + Running Distance insertElapsed`.
685
-6. Reduce any remaining per-sample SQLite writes for unchanged existing samples during non-chain-start full scans.
686
-7. Profile whether index maintenance dominates first-import insert cost.
687
-8. Consider a guarded bulk-import mode for first observations:
689
+5. Compare the next full-profile delta report against the 2026-06-03 `47.4s`
690
+   run: Heart Rate processing `14.1s`, Active Energy processing `4.9s`, Basal
691
+   Energy processing `4.1s`, and `SummedProcessingElapsed: 25.9s`.
692
+6. Run a non-chain-start/full-scan benchmark after skipping unchanged `verified` events and fast-pathing already-open visibility ranges. Compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, `Steps insertElapsed`, and `Walking + Running Distance insertElapsed`.
693
+7. Reduce any remaining per-sample SQLite writes for unchanged existing samples during non-chain-start full scans.
694
+8. Profile whether index maintenance dominates first-import insert cost.
695
+9. Consider a guarded bulk-import mode for first observations:
688 696
    - keep archive semantics unchanged;
689 697
    - only relax work that can be safely reconstructed or validated;
690 698
    - re-enable normal idempotent paths for incremental observations.
691
-9. Run a fresh first-import benchmark after the unused-index removal and compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, and `Active Energy insertElapsed`.
692
-10. Investigate whether first-import-only deferred index creation or temporary staging tables can reduce `samples` / `sample_versions` / `sample_observation_events` write cost without weakening final archive integrity.
693
-11. Revisit adaptive page sizes only after SQLite write-path costs are reduced.
694
-12. Revisit background / scheduled collection once initial import can finish reliably and post-import UI recovery is bounded.
699
+10. Run a fresh first-import benchmark after the unused-index removal and compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, and `Active Energy insertElapsed`.
700
+11. Investigate whether first-import-only deferred index creation or temporary staging tables can reduce `samples` / `sample_versions` / `sample_observation_events` write cost without weakening final archive integrity.
701
+12. Revisit adaptive page sizes only after SQLite write-path costs are reduced.
702
+13. Revisit background / scheduled collection once initial import can finish reliably and post-import UI recovery is bounded.
695 703
 
696 704
 ## Verification Checklist For Each Optimization
697 705
 
+68 -45
HealthProbe/Services/HealthKitService.swift
@@ -1299,45 +1299,20 @@ final class HealthKitService {
1299 1299
             progress: progress
1300 1300
         )
1301 1301
 
1302
-        let sortedRecordsStartedAt = Date()
1303
-        let sortedKeys = recordMap.keys.sorted {
1304
-            guard let left = recordMap[$0],
1305
-                  let right = recordMap[$1] else {
1306
-                return $0 < $1
1307
-            }
1308
-            if left.startDate != right.startDate {
1309
-                return left.startDate < right.startDate
1310
-            }
1311
-            return left.recordFingerprint < right.recordFingerprint
1312
-        }
1313
-        var sortedRecords: [HealthRecordValue] = []
1314
-        sortedRecords.reserveCapacity(sortedKeys.count)
1315
-        for sampleUUIDHash in sortedKeys {
1316
-            guard let record = recordMap[sampleUUIDHash] else { continue }
1317
-            sortedRecords.append(
1318
-                HealthRecordValue(
1319
-                    typeIdentifier: typeIdentifier,
1320
-                    sampleUUIDHash: sampleUUIDHash,
1321
-                    recordFingerprint: record.recordFingerprint,
1322
-                    startDate: record.startDate,
1323
-                    endDate: record.endDate,
1324
-                    displayValue: record.displayValue
1325
-                )
1326
-            )
1327
-        }
1328
-        let contentHash = HashService.typeHash(
1302
+        let archiveRebuildStartedAt = Date()
1303
+        let rebuiltArchive = Self.rebuildRecordArchive(
1329 1304
             typeIdentifier: typeIdentifier,
1330
-            recordFingerprints: sortedRecords.map(\.recordFingerprint)
1305
+            recordMap: recordMap
1331 1306
         )
1332
-        captureTimings.processingElapsedSeconds += Date().timeIntervalSince(sortedRecordsStartedAt)
1307
+        captureTimings.processingElapsedSeconds += Date().timeIntervalSince(archiveRebuildStartedAt)
1333 1308
 
1334 1309
         progress?.updateBlockProgress(
1335 1310
             typeIdentifier,
1336 1311
             detail: pageNumber == 1 ? "Imported 1 page" : "Imported \(pageNumber) pages",
1337
-            recordCount: sortedRecords.count
1312
+            recordCount: rebuiltArchive.count
1338 1313
         )
1339 1314
 
1340
-        guard !sortedRecords.isEmpty || anchor != nil else {
1315
+        guard rebuiltArchive.count > 0 || anchor != nil else {
1341 1316
             return SampleDistribution(
1342 1317
                 totalCount: 0,
1343 1318
                 bins: [],
@@ -1351,31 +1326,72 @@ final class HealthKitService {
1351 1326
             )
1352 1327
         }
1353 1328
 
1354
-        let binStart = earliestDate ?? sortedRecords.first?.startDate ?? previousDistribution.earliestRecordDate ?? Date()
1355
-        let rawBinEnd = latestDate ?? sortedRecords.last?.endDate ?? previousDistribution.latestRecordDate ?? binStart
1329
+        let binStart = earliestDate ?? rebuiltArchive.earliestDate ?? previousDistribution.earliestRecordDate ?? Date()
1330
+        let rawBinEnd = latestDate ?? rebuiltArchive.latestDate ?? previousDistribution.latestRecordDate ?? binStart
1356 1331
         let binEnd = rawBinEnd > binStart ? rawBinEnd : binStart.addingTimeInterval(1)
1357 1332
 
1358 1333
         return SampleDistribution(
1359
-            totalCount: sortedRecords.count,
1334
+            totalCount: rebuiltArchive.count,
1360 1335
             bins: [
1361 1336
                 SampleDistribution.Bin(
1362 1337
                     start: binStart,
1363 1338
                     end: binEnd,
1364
-                    count: sortedRecords.count,
1365
-                    contentHash: contentHash,
1339
+                    count: rebuiltArchive.count,
1340
+                    contentHash: rebuiltArchive.contentHash,
1366 1341
                     anchorData: anchor.flatMap(Self.archiveAnchor(_:))
1367 1342
                 )
1368 1343
             ],
1369
-            records: sortedRecords,
1370
-            contentHash: contentHash,
1344
+            records: [],
1345
+            contentHash: rebuiltArchive.contentHash,
1371 1346
             yearlyCounts: nil,
1372
-            recordArchiveData: nil,
1347
+            recordArchiveData: rebuiltArchive.recordArchiveData,
1373 1348
             captureMode: .delta,
1374 1349
             deltaEventCount: processedEventCount,
1375 1350
             timingBreakdown: captureTimings.importBreakdown
1376 1351
         )
1377 1352
     }
1378 1353
 
1354
+    private static func rebuildRecordArchive(
1355
+        typeIdentifier: String,
1356
+        recordMap: [String: SampleRecordPayload]
1357
+    ) -> RebuiltRecordArchive {
1358
+        var writer = HealthRecordArchive.makeCompactWriter(
1359
+            typeIdentifier: typeIdentifier,
1360
+            estimatedRecordCount: recordMap.count
1361
+        )
1362
+        var recordFingerprints: [String] = []
1363
+        recordFingerprints.reserveCapacity(recordMap.count)
1364
+        var earliestDate: Date?
1365
+        var latestDate: Date?
1366
+
1367
+        for (sampleUUIDHash, record) in recordMap {
1368
+            recordFingerprints.append(record.recordFingerprint)
1369
+            earliestDate = min(earliestDate ?? record.startDate, record.startDate)
1370
+            latestDate = max(latestDate ?? record.endDate, record.endDate)
1371
+            writer.append(
1372
+                HealthRecordValue(
1373
+                    typeIdentifier: typeIdentifier,
1374
+                    sampleUUIDHash: sampleUUIDHash,
1375
+                    recordFingerprint: record.recordFingerprint,
1376
+                    startDate: record.startDate,
1377
+                    endDate: record.endDate,
1378
+                    displayValue: record.displayValue
1379
+                )
1380
+            )
1381
+        }
1382
+
1383
+        return RebuiltRecordArchive(
1384
+            count: recordMap.count,
1385
+            contentHash: HashService.typeHash(
1386
+                typeIdentifier: typeIdentifier,
1387
+                recordFingerprints: recordFingerprints
1388
+            ),
1389
+            earliestDate: earliestDate,
1390
+            latestDate: latestDate,
1391
+            recordArchiveData: writer.finalize()
1392
+        )
1393
+    }
1394
+
1379 1395
     private func fetchInitialDistributionStreaming(
1380 1396
         for sampleType: HKSampleType,
1381 1397
         typeIdentifier: String,
@@ -2502,6 +2518,14 @@ private struct SampleRecordPayload: Sendable {
2502 2518
     let displayValue: String?
2503 2519
 }
2504 2520
 
2521
+private struct RebuiltRecordArchive: Sendable {
2522
+    let count: Int
2523
+    let contentHash: String
2524
+    let earliestDate: Date?
2525
+    let latestDate: Date?
2526
+    let recordArchiveData: Data
2527
+}
2528
+
2505 2529
 private enum HealthRecordArchiveReadError: LocalizedError {
2506 2530
     case missingArchive(typeIdentifier: String, count: Int)
2507 2531
     case decodeFailed(typeIdentifier: String)
@@ -2707,13 +2731,9 @@ private struct PreviousDistributionState: Sendable {
2707 2731
             return [:]
2708 2732
         }
2709 2733
 
2710
-        guard let records = HealthRecordArchive.decode(recordArchiveData) else {
2711
-            throw HealthRecordArchiveReadError.decodeFailed(typeIdentifier: typeIdentifier)
2712
-        }
2713
-
2714 2734
         var recordMap: [String: SampleRecordPayload] = [:]
2715
-        recordMap.reserveCapacity(records.count)
2716
-        for record in records {
2735
+        recordMap.reserveCapacity(count)
2736
+        let didRead = HealthRecordArchive.forEachRecord(in: recordArchiveData) { record in
2717 2737
             recordMap[record.sampleUUIDHash] = SampleRecordPayload(
2718 2738
                 recordFingerprint: record.recordFingerprint,
2719 2739
                 startDate: record.startDate,
@@ -2721,6 +2741,9 @@ private struct PreviousDistributionState: Sendable {
2721 2741
                 displayValue: record.displayValue
2722 2742
             )
2723 2743
         }
2744
+        guard didRead else {
2745
+            throw HealthRecordArchiveReadError.decodeFailed(typeIdentifier: typeIdentifier)
2746
+        }
2724 2747
         return recordMap
2725 2748
     }
2726 2749