Count archive events from event index · e7d45a2

+30 -17

HealthProbe/Doc/04-project/Import-Optimization-Log.md

	@@ -594,7 +594,8 @@ rows exist".
594	594	\| 2026-06-04 \| `f73f076` \| Add archive finalization phase timings to diagnostics. \| The post-`7e3b997` report proved the top-level `finalizeElapsed` bucket was too opaque for the next optimization. Diagnostics now split finalization into event-count/previous-summary lookup, type-summary work, daily-aggregate work, observation-type-run update, and residual other time. Follow-up report with `reportSchemaVersion: 3` and `buildFingerprint: 1.0(1)-1780606903-92064` completed in `22.6s`, with `127/127` complete, `CaptureModes: unchangedDelta=119, delta=8`, and `DeltaEvents: 27`. Finalization was `10.3s`: event-count/previous-summary lookup `1.8s`, type-summary `0.0s`, daily aggregates `7.3s`, run update `0.0s`, and other `1.2s`. Heart Rate had `9` delta events and spent `4.8s` finalizing, of which `3.8s` was daily aggregate work and `0.9s` was event-count/previous-summary lookup. Conclusion: the remaining finalize bottleneck is not type-summary fallback; it is changed-type daily aggregate maintenance, especially Heart Rate. \|
595	595	\| 2026-06-04 \| older build / schema v2 \| Captured large first-import baseline on a bigger device database. \| Initial full-profile snapshot on an older build completed with `127/127` metrics and `8,421,978` records, but it used `reportSchemaVersion: 2` and has no build fingerprint. Treat it as a volume/shape baseline, not a precise current-build comparison. Wall clock was `166m10s`; summed fetch `5m19s`, processing `20m29s`, insert `137m31s`, finalize `1m53s`. The high-volume types dominated: Heart Rate `2,225,738` records and `46m57s` total (`39m16s` insert), Active Energy `1,914,449` records and `41m35s` total (`35m21s` insert), another high-volume type around `2,007,920` records and `41m20s` total (`34m29s` insert), and Basal Energy `1,116,074` records and `21m37s` total (`17m48s` insert). Conclusion: for clean first imports on very large databases, SQLite insert/index/write-path cost remains the central risk; incremental daily-aggregate optimization should not add first-import indexes without measurement. \|
596	596	\| 2026-06-05 \| `6041bac` \| Split daily aggregate finalization timings. \| The first finalization phase report identified daily aggregate work as the remaining changed-type bottleneck, but `finalizeDailyAggregateElapsed` still mixed affected-bucket lookup, previous aggregate copy, destination delete, affected-bucket rebuild, replacement insert, and residual SQL/transaction overhead. Diagnostics now emit aggregate and per-type daily subphase fields: bucket lookup, copy, delete, rebuild, insert, and other. Follow-up report with `buildFingerprint: 1.0(1)-1780618540-92064` completed in `23.5s`, with `127/127` complete, `CaptureModes: unchangedDelta=118, delta=9`, and `DeltaEvents: 97`. Finalization was `10.5s`, daily aggregate work was `7.4s`, and daily rebuild alone was `6.9s`; daily copy was only `0.5s`. Heart Rate had `40` delta events and spent `4.8s` finalizing, of which `3.8s` was daily aggregate rebuild. Conclusion: copying previous materialized daily rows is not the bottleneck; affected-bucket rebuild scans are. \|
597		-\| 2026-06-05 \| pending \| Rebuild changed daily aggregate buckets from time-ranged versions. \| The changed-bucket rebuild query previously started from all samples for the type and only then filtered version `start_date`; for Heart Rate this can traverse roughly `900k` visible rows to rebuild a few affected days. The query now starts from `sample_versions(start_date, sample_id)` for the affected date window, joins to `samples` for type filtering, and joins open visibility ranges by `(sample_id, version_id, last_observation_id)`. Expected signal: repeated full-profile captures should reduce `SummedFinalizeDailyAggregateRebuildElapsed`, especially Heart Rate's `3.8s` daily rebuild. Risk to monitor: the new `sample_versions(start_date, sample_id)` index adds first-import write/index cost, so keep checking large first-import insert timing before accepting this as a permanent schema tradeoff. \|
	597	+\| 2026-06-05 \| `bf5a861` \| Rebuild changed daily aggregate buckets from time-ranged versions. \| Confirmed on two full-profile repeated captures with `buildFingerprint: 1.0(1)-1780640325-92064`. The overnight-data run completed in `21.8s` with `127/127` complete, `CaptureModes: unchangedDelta=111, delta=16`, and `DeltaEvents: 322`; daily aggregate rebuild dropped from `6.9s` to `0.0s`, with daily aggregate work now only `0.6s` copy. The low-delta run completed in `6.6s` with `CaptureModes: unchangedDelta=125, delta=2`; daily aggregate rebuild again stayed `0.0s`. Conclusion: the time-ranged `sample_versions(start_date, sample_id)` query solved the affected-bucket rebuild bottleneck. Continue monitoring first-import insert timing because the new index is a write-path tradeoff. \|
	598	+\| 2026-06-05 \| pending \| Count observation events from the event table first. \| After `bf5a861`, remaining finalization cost moved to event-count / previous-summary lookup: `2.5s` on the overnight-data run and `0.2s` on low-delta. `eventCounts` still started from `samples` filtered by type, which can scan a high-volume type to count a few events. The query now starts from `sample_observation_events(observation_id, event_kind)` and joins to `samples` only to filter type. Expected signal: lower `SummedFinalizeEventCountElapsed`, especially Heart Rate's `0.9s` and Cycling Distance's `0.7s` in the overnight report. \|
598	599
599	600	## Current Diagnosis
600	601
	@@ -717,6 +718,17 @@ The likely bottleneck is per-row SQLite work:
717	718	`3.8s` rebuilding affected daily buckets. The next experiment changes the
718	719	affected-bucket rebuild query shape so it starts from time-ranged
719	720	`sample_versions` instead of all samples of the type.
	721	+- The `bf5a861` follow-up reports validated that time-ranged daily aggregate
	722	+ rebuilds worked: `SummedFinalizeDailyAggregateRebuildElapsed` was `0.0s` on
	723	+ both an overnight-data run (`322` delta events, `21.8s` wall clock) and a
	724	+ low-delta run (`2` delta events, `6.6s` wall clock). Daily aggregate cost is
	725	+ no longer the active repeated-capture bottleneck.
	726	+- After daily rebuild was removed, the next measured finalization floor is
	727	+ event-count / previous-summary lookup. The old event-count query started from
	728	+ `samples(sample_type_id)` and could traverse a high-volume type to count a
	729	+ handful of events. The next query shape starts from
	730	+ `sample_observation_events(observation_id, event_kind)` and joins to samples
	731	+ for type filtering.
720	732	- A large older-build first import on an `8.4M`-record database completed but
721	733	took `166m10s`, with `137m31s` summed insert time. This confirms that full
722	734	authorized backup volume can be much larger than the original 15-type test
	@@ -780,32 +792,33 @@ Prioritize experiments in this order:
780	792	identity unless the build provenance is otherwise certain. `sourceCommit`
781	793	and `sourceDirty` are useful when present, but may be `unknown` for normal
782	794	Xcode test installs.
783		-8. Run a repeated full-profile capture after the time-ranged daily aggregate
784		- rebuild query. Compare `SummedFinalizeDailyAggregateRebuildElapsed` and Heart
785		- Rate `finalizeDailyAggregateRebuildElapsed` against the `6.9s` total /
786		- `3.8s` Heart Rate baseline from `6041bac`. Also watch first-import insert
787		- timing on the next clean large-database import because the new
788		- `sample_versions(start_date, sample_id)` index is a write-path tradeoff.
789		-9. Investigate replacing legacy compact `recordArchiveData` delta rebuild with
	795	+8. Run a repeated full-profile capture after counting observation events from
	796	+ the event table first. Compare `SummedFinalizeEventCountElapsed` and per-type
	797	+ event-count time against the post-`bf5a861` overnight baseline: `2.5s` total,
	798	+ Heart Rate `0.9s`, and Cycling Distance `0.7s`.
	799	+9. Keep watching first-import insert timing on the next clean large-database
	800	+ import because the new `sample_versions(start_date, sample_id)` index from
	801	+ `bf5a861` is a write-path tradeoff.
	802	+10. Investigate replacing legacy compact `recordArchiveData` delta rebuild with
790	803	a SQLite-derived capture-state/hash path. The current repeated full-profile
791	804	reports still spend about `4s` processing Heart Rate for tiny deltas because
792	805	the Swift compact archive is decoded and rewritten for the whole 900k-row
793	806	type.
794		-10. Investigate full-profile empty anchored-query cost for zero-count types.
	807	+11. Investigate full-profile empty anchored-query cost for zero-count types.
795	808	Compare slow empty types across reports before changing behavior; any skip or
796	809	lower-frequency strategy must preserve the promise that full authorized
797	810	backup can notice newly appearing data.
798		-11. Run a non-chain-start/full-scan benchmark after skipping unchanged `verified` events and fast-pathing already-open visibility ranges. Compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, `Steps insertElapsed`, and `Walking + Running Distance insertElapsed`.
799		-12. Reduce any remaining per-sample SQLite writes for unchanged existing samples during non-chain-start full scans.
800		-13. Profile whether index maintenance dominates first-import insert cost.
801		-14. Consider a guarded bulk-import mode for first observations:
	811	+12. Run a non-chain-start/full-scan benchmark after skipping unchanged `verified` events and fast-pathing already-open visibility ranges. Compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, `Steps insertElapsed`, and `Walking + Running Distance insertElapsed`.
	812	+13. Reduce any remaining per-sample SQLite writes for unchanged existing samples during non-chain-start full scans.
	813	+14. Profile whether index maintenance dominates first-import insert cost.
	814	+15. Consider a guarded bulk-import mode for first observations:
802	815	- keep archive semantics unchanged;
803	816	- only relax work that can be safely reconstructed or validated;
804	817	- re-enable normal idempotent paths for incremental observations.
805		-15. Run a fresh first-import benchmark after the unused-index removal and compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, and `Active Energy insertElapsed`.
806		-16. Investigate whether first-import-only deferred index creation or temporary staging tables can reduce `samples` / `sample_versions` / `sample_observation_events` write cost without weakening final archive integrity.
807		-17. Revisit adaptive page sizes only after SQLite write-path costs are reduced.
808		-18. Revisit background / scheduled collection once initial import can finish reliably and post-import UI recovery is bounded.
	818	+16. Run a fresh first-import benchmark after the unused-index removal and compare `SummedInsertElapsed`, `Heart Rate insertElapsed`, and `Active Energy insertElapsed`.
	819	+17. Investigate whether first-import-only deferred index creation or temporary staging tables can reduce `samples` / `sample_versions` / `sample_observation_events` write cost without weakening final archive integrity.
	820	+18. Revisit adaptive page sizes only after SQLite write-path costs are reduced.
	821	+19. Revisit background / scheduled collection once initial import can finish reliably and post-import UI recovery is bounded.
809	822
810	823	## Verification Checklist For Each Optimization
811	824

●	HealthProbe/Doc/04-project/Import-Optimization-Log.md	+30 -17
●	HealthProbe/Services/SQLiteHealthArchiveStore.swift	+4 -5

	@@ -3910,11 +3910,10 @@ actor SQLiteHealthArchiveStore: HealthArchiveStore {
3910	3910	) throws -> (appeared: Int, disappeared: Int, representationChanged: Int) {
3911	3911	let sql = """
3912	3912	SELECT e.event_kind, COUNT(*)
3913		- FROM samples s INDEXED BY idx_samples_type_id
3914		- JOIN sample_observation_events e INDEXED BY idx_events_sample
3915		- ON e.sample_id = s.id
3916		- AND e.observation_id = ?
3917		- WHERE s.sample_type_id = ?
	3913	+ FROM sample_observation_events e INDEXED BY idx_events_observation_kind
	3914	+ JOIN samples s ON s.id = e.sample_id
	3915	+ WHERE e.observation_id = ?
	3916	+ AND s.sample_type_id = ?
3918	3917	GROUP BY e.event_kind
3919	3918	"""
3920	3919	return try withStatement(sql, db: db) { statement in