Skip unchanged sample verification events · a281c51

Skip unchanged sample verification events
Browse files

bogdan committed 5 days ago

main

1 parent 56bfdbe

commit a281c51

Showing 3 changed files with 63 additions and 24 deletions

+46 -6

HealthProbe/Doc/04-project/Import-Optimization-Log.md

	@@ -142,6 +142,41 @@ Conclusion: direct inserts for brand-new dependent rows produced a valid but
142	142	modest first-import gain. The large reimport improvement was not representative
143	143	of a clean first snapshot. SQLite insert remains the dominant bottleneck.
144	144
	145	+### 2026-06-02 Non-Chain-Start Full Scan After Index Removal
	146	+
	147	+Commit context: after `ff59257` (`Drop unused sample import indexes`)
	148	+Source: user-provided diagnostic report with `previousSnapshotID` present and
	149	+`isChainStart: false`.
	150	+
	151	+This is not a comparable first-import benchmark for the unused-index removal,
	152	+but it is important because it shows that non-initial captures can be slower
	153	+than first imports when the app performs a full-history scan.
	154	+
	155	+\| Metric \| Value \|
	156	+\|--------\|-------\|
	157	+\| Wall clock \| 22m 33s \|
	158	+\| Summed metric total \| 22m 14s \|
	159	+\| Summed fetch \| 52.0s \|
	160	+\| Summed processing \| 2m 23s \|
	161	+\| Summed insert \| 18m 44s \|
	162	+\| Summed finalize \| 11.5s \|
	163	+\| Heart Rate count \| 922,440 \|
	164	+\| Heart Rate total \| 13m 30s \|
	165	+\| Heart Rate fetch \| 24.3s \|
	166	+\| Heart Rate processing \| 1m 29s \|
	167	+\| Heart Rate insert \| 11m 25s \|
	168	+\| Active Energy count \| 348,698 \|
	169	+\| Active Energy insert \| 4m 44s \|
	170	+\| Steps insert \| 40.4s \|
	171	+\| Walking + Running Distance insert \| 36.0s \|
	172	+
	173	+Conclusion: this run should not be used to judge first-import index removal.
	174	+However, it indicates a separate bottleneck: subsequent full scans still spend
	175	+most of their time in SQLite writes, likely because unchanged samples are still
	176	+touching the archive write path. The next implementation target should reduce
	177	+per-sample work for unchanged existing samples during verification/full-scan
	178	+captures.
	179	+
145	180	## Optimization Iterations
146	181
147	182	\| Date \| Commit \| Change \| Result / Status \|
	@@ -159,6 +194,8 @@ of a clean first snapshot. SQLite insert remains the dominant bottleneck.
159	194	\| 2026-06-02 \| `c138b7b` \| Increased initial import write chunk sizes. \| Marginal improvement: summed insert from 15m44s to 15m24s on the next comparable run. \|
160	195	\| 2026-06-02 \| `44d9ebd` \| Used direct inserts for dependent rows when `samples` creates a new sample. \| Confirmed modest first-import gain: wall clock 18m30s -> 17m13s, summed insert 15m24s -> 14m38s, Heart Rate insert 9m58s -> 8m59s. \|
161	196	\| 2026-06-02 \| `ff59257` \| Removed unused `samples` indexes on global UUID hash and semantic fingerprint. \| Awaiting comparable first-import report. Expected signal is lower `SummedInsertElapsed`; deleted-object lookup remains covered by `(sample_type_id, sample_uuid_hash)`. \|
	197	+\| 2026-06-02 \| pending \| Captured non-chain-start full-scan report after index removal. \| Not comparable for first-import performance; reveals a separate full-scan/unchanged-sample write bottleneck. \|
	198	+\| 2026-06-02 \| pending \| Stopped writing `verified` observation events for unchanged existing samples. \| Awaiting comparable non-chain-start/full-scan report. Expected signal is lower `SummedInsertElapsed` and especially lower Heart Rate insert time when most rows are unchanged. \|
162	199
163	200	## Current Diagnosis
164	201
	@@ -183,21 +220,24 @@ The likely bottleneck is per-row SQLite work:
183	220	- A previous Heart Rate import appeared to stall for long periods around roughly 900k records, but later progress resumed; avoid classifying this as a hard timeout without report evidence.
184	221	- After a completed import, the app may remain unresponsive for more than one minute. This needs separate timing around post-import cache rebuild, UI refresh, report generation, and main-thread work.
185	222	- Partial / old imported observations can pollute comparisons. Fresh first-snapshot performance comparisons should use a confirmed reset database.
	223	+- Non-chain-start full scans can be slower than first imports if unchanged existing samples still write per-sample archive evidence.
186	224
187	225	## Next Experiments
188	226
189	227	Prioritize experiments in this order:
190	228
191	229	1. Add explicit post-import timings if the app is still unresponsive after the operation reports success.
192		-2. Profile whether index maintenance dominates first-import insert cost.
193		-3. Consider a guarded bulk-import mode for first observations:
	230	+2. Run a non-chain-start/full-scan benchmark after skipping unchanged `verified` events and compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, `Steps insertElapsed`, and `Walking + Running Distance insertElapsed`.
	231	+3. Reduce remaining per-sample SQLite writes for unchanged existing samples during non-chain-start full scans, especially open visibility-range existence checks.
	232	+4. Profile whether index maintenance dominates first-import insert cost.
	233	+5. Consider a guarded bulk-import mode for first observations:
194	234	- keep archive semantics unchanged;
195	235	- only relax work that can be safely reconstructed or validated;
196	236	- re-enable normal idempotent paths for incremental observations.
197		-4. Run a fresh first-import benchmark after the unused-index removal and compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, and `Active Energy insertElapsed`.
198		-5. Investigate whether first-import-only deferred index creation or temporary staging tables can reduce `samples` / `sample_versions` / `sample_observation_events` write cost without weakening final archive integrity.
199		-6. Revisit adaptive page sizes only after SQLite write-path costs are reduced.
200		-7. Revisit background / scheduled collection once initial import can finish reliably and post-import UI recovery is bounded.
	237	+6. Run a fresh first-import benchmark after the unused-index removal and compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, and `Active Energy insertElapsed`.
	238	+7. Investigate whether first-import-only deferred index creation or temporary staging tables can reduce `samples` / `sample_versions` / `sample_observation_events` write cost without weakening final archive integrity.
	239	+8. Revisit adaptive page sizes only after SQLite write-path costs are reduced.
	240	+9. Revisit background / scheduled collection once initial import can finish reliably and post-import UI recovery is bounded.
201	241
202	242	## Verification Checklist For Each Optimization
203	243

+14 -18

HealthProbe/Services/SQLiteHealthArchiveStore.swift

View

+                                 statementCache: statementCache
+                    -        let writeKind: ArchiveV2SampleWriteKind
+                    -        let eventKind: String
+                             if versionResult.inserted {
+                    -            writeKind = .updated
+                    -            eventKind = "representationChanged"
+                    -        } else {
+                    -            writeKind = .unchanged
+                    -            eventKind = "verified"
+                    +            try insertObservationEvent(
+                    +                observationID: observationID,
+                    +                sampleID: sampleResult.id,
+                    +                versionID: versionResult.id,
+                    +                eventKind: "representationChanged",
+                    +                evidenceKind: "healthkit_sample",
+                    +                observedAt: row.observedAt,
+                    +                db: db,
+                    +                statementCache: statementCache
+                    +            )
+                    -        try insertObservationEvent(
+                    -            observationID: observationID,
+                    -            sampleID: sampleResult.id,
+                    -            versionID: versionResult.id,
+                    -            eventKind: eventKind,
+                    -            evidenceKind: "healthkit_sample",
+                    -            observedAt: row.observedAt,
+                    -            db: db,
+                    -            statementCache: statementCache
+                    -        )
+                             if versionResult.inserted {
+                                 try closeOpenVisibilityRanges(
+                                     sampleID: sampleResult.id,
+                    -        return ArchiveV2SampleWriteResult(sampleTypeID: sampleTypeID, kind: writeKind)
+                    +        return ArchiveV2SampleWriteResult(
+                    +            sampleTypeID: sampleTypeID,
+                    +            kind: versionResult.inserted ? .updated : .unchanged
+                    +        )
+                         private func createObservation(

+3 -0

HealthProbeTests/SQLiteHealthArchiveStoreTests.swift

View

@@ -72,6 +72,9 @@ final class SQLiteHealthArchiveStoreTests: XCTestCase {
         XCTAssertEqual(firstWrite.unchangedCount, 0)
         XCTAssertEqual(try countRows(in: "samples", at: url), 1)
         XCTAssertEqual(try countRows(in: "sample_versions", at: url), 1, versionDebugRows)
+        XCTAssertEqual(try countRows(in: "sample_observation_events", at: url), 1)
+        XCTAssertEqual(try countRows(in: "sample_observation_events WHERE event_kind = 'appeared'", at: url), 1)
+        XCTAssertEqual(try countRows(in: "sample_observation_events WHERE event_kind = 'verified'", at: url), 0)
         XCTAssertEqual(try countRows(in: "sample_visibility_ranges", at: url), 1, visibilityDebugRows)
         XCTAssertEqual(try countRows(in: "source_revisions", at: url), 1)
         XCTAssertFalse(try tableExists("archive_samples", at: url))


	@@ -72,6 +72,9 @@ final class SQLiteHealthArchiveStoreTests: XCTestCase {
72	72	XCTAssertEqual(firstWrite.unchangedCount, 0)
73	73	XCTAssertEqual(try countRows(in: "samples", at: url), 1)
74	74	XCTAssertEqual(try countRows(in: "sample_versions", at: url), 1, versionDebugRows)
	75	+ XCTAssertEqual(try countRows(in: "sample_observation_events", at: url), 1)
	76	+ XCTAssertEqual(try countRows(in: "sample_observation_events WHERE event_kind = 'appeared'", at: url), 1)
	77	+ XCTAssertEqual(try countRows(in: "sample_observation_events WHERE event_kind = 'verified'", at: url), 0)
75	78	XCTAssertEqual(try countRows(in: "sample_visibility_ranges", at: url), 1, visibilityDebugRows)
76	79	XCTAssertEqual(try countRows(in: "source_revisions", at: url), 1)
77	80	XCTAssertFalse(try tableExists("archive_samples", at: url))