Version: 1.1 Last Updated: 2026-05-23 Status: Canonical database/storage design
The database is the central piece of HealthProbe. The app can only reconstruct, analyze, export, and explain HealthKit history if the archive is complete, correct, queryable, and stable across product changes.
UI can be refactored cheaply. A wrong archive design can permanently lose evidence, make large analyses impossible on low-end devices, or prevent future recovery-compatible exports. All storage work must start from this document.
SQLite archive is the source of truth. Core Data is a rebuildable cache. SwiftData is legacy/prototype only.
Store differentially. Do not append recurring complete snapshots of large HealthKit datasets. Store identities, payload versions, observation events/ranges, and aggregates.
Analyze in SQL, not RAM. Diffs, counts, point-in-time reconstruction, export selection, and consolidation heuristics must use SQLite indexes, joins, CTEs, temporary tables, and paged/streaming results.
Support legacy devices. The target includes iOS 15-era devices such as iPhone 6s-class Health collection setups. Do not require SwiftData.
Preserve recovery-compatible structure. The app will not restore or re-publish data, but archives/exports must preserve enough identity, payload, provenance, relationships, hashes, and observation history for external recovery/salvage tooling.
Never treat counts as sufficient truth. Counts are cached for reports/UI, but record identity, payload versions, visibility history, and aggregate context are required for interpretation.
No real personal data in repository artifacts. Database fixtures, docs, tests, and examples must use synthetic values only.
HealthProbeArchive.sqlite
Responsibilities: - canonical HealthKit observation history; - sample identity and payload versioning; - source/device/metadata/relationship preservation; - point-in-time reconstruction; - adjacent and selected-observation diffs; - consolidation heuristics; - materialized aggregates; - streaming/paged exports; - integrity manifests and future schema migrations.
This database must be queryable without loading high-frequency datasets into Swift arrays.
Core Data cache store.
Responsibilities: - expensive counts already computed from SQLite; - observation list rows; - dashboard/timeline summaries; - per-type summary rows; - report/export metadata; - app settings and lightweight UI state.
Rules: - cache rows are disposable; - cache rows must be rebuildable from SQLite; - if Core Data and SQLite disagree, SQLite wins; - Core Data must not contain the only copy of any record-level evidence.
Current SwiftData models are a prototype implementation detail. New storage work should not expand them.
There are no real deployments, only test installs. During the archive v2 refactor, old SwiftData stores and prototype SQLite archives may be ignored, deleted, or reinitialized. Do not build backward compatibility or one-way import for the old prototype schema unless a later product decision explicitly changes this policy.
An observation is one local capture attempt/result at a specific time on the current device chain. It is not a full copy of all visible records.
An observation records: - when capture started/ended; - app/schema/OS context; - timezone context at observation time; - selected type registry; - per-type capture quality; - HealthKit anchors; - events and aggregate changes observed during the capture.
Implementation note, 2026-05-24: the app creates one archive observations
row for each user-visible capture/snapshot. Anchored HealthKit pages,
HKDeletedObject evidence, and the final type verification rows for that
capture attach to that same observation id so SQL diffs, cache rows, and
transition UI can reference a stable capture boundary.
A sample identity is the stable record or semantic record HealthProbe tracks over time.
Identity inputs may include: - HealthKit UUID hash when available; - strict fingerprint; - semantic fingerprint; - sample type; - date range; - value/unit/category/workout fields; - source revision where relevant.
HealthKit UUID hash is important but not enough for every future use case. Apple exports and backup database extracts may require semantic/fuzzy matching.
A sample version is the payload representation observed for a sample identity.
A new version is created only when the representation changes: - start/end dates; - value/unit/category/workout fields; - source revision; - metadata hash; - related sample/workout/event links.
HealthProbe stores visibility as events and/or compressed ranges:
- appeared;
- verified/seen;
- disappeared/no longer visible;
- representation changed;
- deleted-object evidence where HealthKit exposes HKDeletedObject.
This allows point-in-time reconstruction without duplicating every visible record into every observation.
SQLite stores materialized aggregates because many reports and screens need expensive counts/sums repeatedly.
Aggregates are archive-derived evidence, not the source of truth. They must be rebuildable from sample/version/event tables.
Exact names may evolve, but the shape and constraints should remain.
CREATE TABLE schema_migrations (
version INTEGER PRIMARY KEY,
applied_at REAL NOT NULL,
description TEXT NOT NULL
);
CREATE TABLE archive_metadata (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
CREATE TABLE device_chains (
id INTEGER PRIMARY KEY,
device_chain_hash TEXT NOT NULL UNIQUE,
created_at REAL NOT NULL,
recovered_from_keychain INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE observations (
id INTEGER PRIMARY KEY,
device_chain_id INTEGER NOT NULL REFERENCES device_chains(id),
observed_at REAL NOT NULL,
started_at REAL,
ended_at REAL,
status TEXT NOT NULL,
trigger_reason TEXT NOT NULL,
app_version TEXT,
os_version TEXT,
time_zone_identifier TEXT,
time_zone_seconds_from_gmt INTEGER,
schema_version INTEGER NOT NULL,
selected_type_set_hash TEXT,
notes TEXT
);
CREATE INDEX idx_observations_device_time
ON observations(device_chain_id, observed_at);
CREATE TABLE sample_types (
id INTEGER PRIMARY KEY,
type_identifier TEXT NOT NULL UNIQUE,
display_name TEXT,
category TEXT
);
CREATE TABLE observation_type_runs (
id INTEGER PRIMARY KEY,
observation_id INTEGER NOT NULL REFERENCES observations(id),
sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
status TEXT NOT NULL,
started_at REAL,
ended_at REAL,
anchor_before BLOB,
anchor_after BLOB,
inserted_event_count INTEGER NOT NULL DEFAULT 0,
deleted_event_count INTEGER NOT NULL DEFAULT 0,
verified_visible_count INTEGER,
error_kind TEXT,
error_message_hash TEXT,
UNIQUE(observation_id, sample_type_id)
);
CREATE INDEX idx_type_runs_type_observation
ON observation_type_runs(sample_type_id, observation_id);
CREATE TABLE sources (
id INTEGER PRIMARY KEY,
source_name_hash TEXT,
bundle_identifier TEXT
);
CREATE TABLE source_revisions (
id INTEGER PRIMARY KEY,
source_id INTEGER NOT NULL REFERENCES sources(id),
product_type TEXT,
version TEXT,
operating_system_version TEXT,
UNIQUE(source_id, product_type, version, operating_system_version)
);
CREATE TABLE hk_devices (
id INTEGER PRIMARY KEY,
device_hash TEXT,
manufacturer_hash TEXT,
model TEXT,
hardware_version TEXT,
firmware_version TEXT,
software_version TEXT,
local_identifier_hash TEXT,
udi_hash TEXT
);
CREATE TABLE metadata_blobs (
id INTEGER PRIMARY KEY,
metadata_hash TEXT NOT NULL UNIQUE,
metadata_json TEXT NOT NULL
);
Privacy note: raw personal/device identifiers should be hashed or omitted according to policy. Store enough provenance for local analysis and recovery-compatible exports without leaking identifiers into logs or repository fixtures.
CREATE TABLE samples (
id INTEGER PRIMARY KEY,
sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
sample_uuid_hash TEXT,
strict_fingerprint TEXT NOT NULL,
semantic_fingerprint TEXT,
fuzzy_key TEXT,
first_seen_observation_id INTEGER NOT NULL REFERENCES observations(id),
first_seen_at REAL NOT NULL,
UNIQUE(sample_type_id, strict_fingerprint)
);
CREATE INDEX idx_samples_uuid_hash
ON samples(sample_uuid_hash);
CREATE INDEX idx_samples_type_semantic
ON samples(sample_type_id, semantic_fingerprint);
CREATE TABLE sample_versions (
id INTEGER PRIMARY KEY,
sample_id INTEGER NOT NULL REFERENCES samples(id),
payload_hash TEXT NOT NULL,
start_date REAL NOT NULL,
end_date REAL NOT NULL,
value_kind TEXT,
numeric_value REAL,
unit TEXT,
category_value INTEGER,
workout_activity_type INTEGER,
duration_seconds REAL,
source_revision_id INTEGER REFERENCES source_revisions(id),
hk_device_id INTEGER REFERENCES hk_devices(id),
metadata_id INTEGER REFERENCES metadata_blobs(id),
created_observation_id INTEGER NOT NULL REFERENCES observations(id),
UNIQUE(sample_id, payload_hash)
);
CREATE INDEX idx_sample_versions_sample
ON sample_versions(sample_id);
CREATE INDEX idx_sample_versions_time
ON sample_versions(start_date, end_date);
CREATE TABLE sample_observation_events (
id INTEGER PRIMARY KEY,
observation_id INTEGER NOT NULL REFERENCES observations(id),
sample_id INTEGER NOT NULL REFERENCES samples(id),
version_id INTEGER REFERENCES sample_versions(id),
event_kind TEXT NOT NULL,
observed_at REAL NOT NULL,
evidence_kind TEXT,
UNIQUE(observation_id, sample_id, event_kind)
);
CREATE INDEX idx_events_observation_kind
ON sample_observation_events(observation_id, event_kind);
CREATE INDEX idx_events_sample
ON sample_observation_events(sample_id, observation_id);
CREATE TABLE sample_visibility_ranges (
sample_id INTEGER NOT NULL REFERENCES samples(id),
version_id INTEGER REFERENCES sample_versions(id),
first_observation_id INTEGER NOT NULL REFERENCES observations(id),
last_observation_id INTEGER REFERENCES observations(id),
first_seen_at REAL NOT NULL,
last_seen_at REAL,
PRIMARY KEY (sample_id, version_id, first_observation_id)
);
CREATE INDEX idx_visibility_open_ranges
ON sample_visibility_ranges(last_observation_id);
CREATE INDEX idx_visibility_point_lookup
ON sample_visibility_ranges(first_observation_id, last_observation_id);
Range convention:
- last_observation_id IS NULL means still visible at the latest verified observation for that type;
- last_observation_id stores the observation that closed the range and is therefore an exclusive end;
- closed samples/versions are visible when first_observation_id <= target.id and last_observation_id > target.id;
- deleted-object evidence should create an event even when full payload is not available.
CREATE TABLE sample_relationships (
id INTEGER PRIMARY KEY,
observation_id INTEGER REFERENCES observations(id),
source_sample_id INTEGER NOT NULL REFERENCES samples(id),
target_sample_id INTEGER NOT NULL REFERENCES samples(id),
relationship_kind TEXT NOT NULL,
metadata_id INTEGER REFERENCES metadata_blobs(id),
UNIQUE(observation_id, source_sample_id, target_sample_id, relationship_kind)
);
CREATE INDEX idx_relationship_source
ON sample_relationships(source_sample_id, relationship_kind);
CREATE INDEX idx_relationship_target
ON sample_relationships(target_sample_id, relationship_kind);
Relationships are required for recovery-compatible archives. Even if iOS HealthKit exposes limited relationships, the schema must not prevent future preservation.
CREATE TABLE observation_type_summaries (
observation_id INTEGER NOT NULL REFERENCES observations(id),
sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
visible_record_count INTEGER NOT NULL,
appeared_count INTEGER NOT NULL DEFAULT 0,
disappeared_count INTEGER NOT NULL DEFAULT 0,
representation_changed_count INTEGER NOT NULL DEFAULT 0,
earliest_start_date REAL,
latest_end_date REAL,
value_sum REAL,
value_max REAL,
aggregate_hash TEXT,
PRIMARY KEY (observation_id, sample_type_id)
);
CREATE TABLE daily_type_aggregates (
observation_id INTEGER NOT NULL REFERENCES observations(id),
sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
bucket_start REAL NOT NULL,
bucket_end REAL NOT NULL,
visible_record_count INTEGER NOT NULL,
value_sum REAL,
value_max REAL,
source_revision_id INTEGER,
aggregate_hash TEXT,
PRIMARY KEY (observation_id, sample_type_id, bucket_start, source_revision_id)
);
CREATE INDEX idx_daily_type_bucket
ON daily_type_aggregates(sample_type_id, bucket_start);
Aggregates feed reports and the Core Data cache. They are also important for consolidation heuristics because a count drop with stable aggregate value may be representation change, not meaningful loss.
CREATE TABLE export_manifests (
id INTEGER PRIMARY KEY,
export_id TEXT NOT NULL UNIQUE,
created_at REAL NOT NULL,
export_kind TEXT NOT NULL,
from_observation_id INTEGER REFERENCES observations(id),
to_observation_id INTEGER REFERENCES observations(id),
filter_json TEXT,
manifest_hash TEXT NOT NULL,
record_count INTEGER NOT NULL
);
CREATE TABLE export_items (
export_manifest_id INTEGER NOT NULL REFERENCES export_manifests(id),
sample_id INTEGER NOT NULL REFERENCES samples(id),
version_id INTEGER REFERENCES sample_versions(id),
item_hash TEXT NOT NULL,
PRIMARY KEY (export_manifest_id, sample_id, version_id)
);
Exports should be reproducible from the archive when possible. Manifest hashes let external tools verify that a recovery-compatible export matches the archived evidence.
These decisions close Milestone 1 for archive v2 unless a later dated entry changes them.
Archive timestamps are stored as Unix seconds in UTC using SQLite REAL.
Rules:
- write dates with Date.timeIntervalSince1970;
- read dates with Date(timeIntervalSince1970:);
- never store local-time interpreted timestamps in archive date columns;
- canonical hash/export text uses ISO 8601 UTC with fractional seconds;
- aggregate bucket rows store UTC bucket boundaries plus the observation timezone context that produced them.
Use two hash classes:
- Integrity/content hashes: plain SHA-256, lower-case hex. Use for payload hashes, metadata hashes, aggregate hashes, export item hashes, and manifest hashes.
- Privacy-sensitive identifiers: HMAC-SHA256, lower-case hex, using a locally stored archive secret. Use for HealthKit UUIDs, device identifiers, source names, local identifiers, UDI-like values, and device-chain identifiers.
All hash inputs must include a domain/version prefix such as hp:v2:sample_uuid: or hp:v2:payload:. Do not hash raw strings without a domain prefix.
The local archive secret is device-local application state. If the secret is lost, future captures may start a new device chain. Export files include already-computed hashes and manifest/item hashes, not the local archive secret.
device_chain_hash identifies one local capture chain, not a globally unique device.
Initial implementation:
- create or recover a random chain seed from Keychain;
- compute device_chain_hash = HMAC-SHA256(archiveSecret, "hp:v2:device_chain:" + chainSeed);
- set recovered_from_keychain = 1 when the seed survived app reinstall and was reused;
- start a new chain if the seed is missing or explicitly reset.
HealthKit device metadata may be stored as hashed provenance in hk_devices, but it must not be used as a cross-device comparison key.
sample_uuid_hash is the preferred stable HealthKit identity when HealthKit exposes a UUID.
strict_fingerprint is a deterministic exact fallback/verification key built from canonical fields:
- type identifier;
- start/end timestamps in canonical UTC text;
- value kind and canonical value representation;
- unit/category/workout fields where applicable;
- source bundle identifier hash when available.
Do not include SQLite row ids in fingerprints. If HealthKit UUID is available, a payload change creates a new sample_versions row for the same sample identity. If HealthKit UUID is not available, an exact strict fingerprint change may create appeared/disappeared evidence, with semantic_fingerprint used only as weak consolidation context.
payload_hash is SHA-256 over the canonical sample payload representation, including dates, value/unit/category/workout fields, source revision fields, device provenance hashes, metadata hash, and relationship payload when available. A new sample_versions row is created when payload_hash changes.
Implementation note, 2026-05-24: archive v2 capture must derive payload_hash from the same normalized row values that are persisted. Unknown HealthKit OS versions, including 0.0.0, are stored as absent. Capture, verification, and deletion bookkeeping must use archive v2 identity/version/visibility tables, not the removed legacy archive_samples mirror.
semantic_fingerprint is type-specific and optional. It supports consolidation heuristics and fuzzy backup/export reconciliation, but it is never sufficient by itself to prove record identity.
Raw sample timestamps remain UTC. Daily/monthly aggregate buckets are computed using the device timezone active at observation time because user-facing health summaries are local-day concepts.
Rules:
- store time_zone_identifier and time_zone_seconds_from_gmt on observations;
- bucket boundaries are midnight-to-midnight in that observation timezone, stored as UTC seconds;
- old observations are not retroactively re-bucketed when the device timezone changes;
- exports include timezone metadata so external tools can reinterpret buckets if needed.
Maintain sample_visibility_ranges eagerly in the same transaction that writes observation events. Point-in-time queries should read ranges, not rebuild them from events on every request.
Integrity tools may rebuild ranges from sample_observation_events into temporary tables and compare the result with stored ranges. Validation rebuild is for tests/repair checks, not the normal query path.
Store every relationship the capture/import surface can observe. Relationship rows are append-only evidence for what was known at an observation; do not rewrite relationship_kind to encode later disappearance.
If a related sample disappears, endpoint visibility ranges and events explain that disappearance. Relationship exports should include relationships when both endpoints are included, and may include unresolved endpoint hashes when allowed by the export scope.
Aggregates are computed in SQLite after each successful observation/type run and materialized in observation_type_summaries and daily_type_aggregates.
SQL-first does not mean recompute every count live for the UI. It means heavy computation is done by SQLite using indexes, joins, CTEs, temporary tables, and paged result sets. Repeated UI/report reads consume materialized SQLite aggregates and the Core Data cache.
Structured exports use a versioned canonical envelope with deterministic ordering.
Rules:
- export format version starts at 1;
- JSON object keys are sorted for canonical bytes;
- record/item order is deterministic: sample type, start date, end date, sample identity hash, version hash;
- each exported item has an item_hash = SHA-256(canonical item JSON);
- manifest_hash = SHA-256(canonical export metadata + ordered item_hash list + counts + filter description);
- large exports compute hashes incrementally while streaming/paging rows from SQLite.
Manifest hashes must cover exported content through item hashes, not only counts or first/last dates.
For each user-visible capture:
observations row with status running.observation_type_runs.samples;sample_versions;HKDeletedObject:
completed, needs_review, partial_*, or failed.SQLite writes may be committed per page/type to keep memory and lock time bounded, but they must share the capture's observation id.
SQLite commit must happen before Core Data cache update. Cache rebuild failures must not corrupt archive truth.
Capture pages may be retried. Writes must be idempotent using uniqueness constraints:
- (sample_type_id, strict_fingerprint) for samples;
- (sample_id, payload_hash) for versions;
- (observation_id, sample_id, event_kind) for events;
- range primary key for visibility.
HealthKit anchors are capture implementation state. Store them per type run, but do not treat anchors as forensic truth. If an anchor is unusable, the archive must still support rebuilding the current visible state from a full type scan.
Point-in-time reconstruction should use ranges, not full snapshot tables.
Conceptual query:
SELECT s.id AS sample_id, sv.id AS version_id, sv.start_date, sv.end_date,
sv.value_kind, sv.numeric_value, sv.unit
FROM sample_visibility_ranges r
JOIN samples s ON s.id = r.sample_id
JOIN sample_versions sv ON sv.id = r.version_id
JOIN observations target ON target.id = :observation_id
WHERE s.sample_type_id = :sample_type_id
AND r.first_observation_id <= target.id
AND (r.last_observation_id IS NULL OR r.last_observation_id > target.id)
ORDER BY sv.start_date, s.strict_fingerprint
LIMIT :limit OFFSET :offset;
Implementation may optimize this with temporary tables or materialized visible sets for selected observations.
Diffs must run in SQLite.
CREATE TEMP TABLE prev_visible AS
SELECT r.sample_id, r.version_id
FROM sample_visibility_ranges r
WHERE r.first_observation_id <= :previous
AND (r.last_observation_id IS NULL OR r.last_observation_id > :previous);
CREATE TEMP TABLE curr_visible AS
SELECT r.sample_id, r.version_id
FROM sample_visibility_ranges r
WHERE r.first_observation_id <= :current
AND (r.last_observation_id IS NULL OR r.last_observation_id > :current);
CREATE INDEX temp_prev_sample ON prev_visible(sample_id);
CREATE INDEX temp_curr_sample ON curr_visible(sample_id);
-- Disappeared.
SELECT p.sample_id
FROM prev_visible p
LEFT JOIN curr_visible c ON c.sample_id = p.sample_id
WHERE c.sample_id IS NULL;
-- Appeared.
SELECT c.sample_id
FROM curr_visible c
LEFT JOIN prev_visible p ON p.sample_id = c.sample_id
WHERE p.sample_id IS NULL;
-- Representation changed.
SELECT c.sample_id, p.version_id AS before_version_id, c.version_id AS after_version_id
FROM curr_visible c
JOIN prev_visible p ON p.sample_id = c.sample_id
WHERE p.version_id != c.version_id;
Result sets must be paged. Counts can be materialized into observation_type_summaries and Core Data cache.
Consolidation likely when: - many old high-frequency records disappear; - newer records cover similar date windows; - aggregate sums remain within tolerance; - sample density decreases while duration/interval length increases; - source/provenance context is compatible.
Required evidence: - record-count delta; - value-sum delta; - coverage-window overlap; - interval-length/density comparison; - source/source-revision breakdown; - uncertainty label if evidence is incomplete.
Never classify count drops alone as loss.
Core Data entities should mirror presentation needs, not archive internals.
Core Data may contain two categories: - rebuildable UI/report cache derived from SQLite; - non-forensic local app state/settings.
Deleting rebuildable cache rows must not delete the SQLite archive. User settings may be preserved across cache rebuilds.
Candidate cache entities:
- CachedObservationRow;
- CachedTypeSummary;
- CachedDailyAggregate;
- CachedDiffSummary;
- CachedExportManifest;
- CachedArchiveHealth;
- AppSetting.
Every cache row should include: - archive schema version; - cache schema version; - source observation id(s); - source aggregate/hash where applicable; - computed_at timestamp.
Invalidation rules: - archive reset or future archive migration invalidates cache; - selected type registry change invalidates affected summaries; - aggregate rebuild invalidates corresponding Core Data rows; - app can delete all cache rows and rebuild from SQLite.
Exports are scoped and recovery-compatible.
Every structured export should include: - export id; - archive schema version; - app version; - observation id(s); - selected type filters; - record count; - manifest hash; - per-record sample identity/fingerprint; - payload version hash; - dates, values, units, category/workout fields; - source/provenance metadata where available and allowed; - relationships where available; - provenance-loss warnings for external re-publication workflows.
Exports must stream/page from SQLite. Do not build large JSON/CSV exports entirely in RAM.
Current status: HealthProbe has no real deployments, only test installations. The archive v2 refactor does not need backward compatibility with the old SwiftData/prototype SQLite schema.
For the current refactor: - old SwiftData stores and prototype SQLite archives may be deleted, ignored, or reinitialized; - no one-way migration from old prototype stores is required; - test users/developers must expect prototype data loss when moving to archive v2; - reset behavior must be documented in test/release notes and must not be presented as data preservation; - Core Data cache stores remain disposable and rebuildable.
For future real archives:
- migrations must be versioned in schema_migrations;
- migrations must be tested on synthetic large archives;
- migration failure must not silently delete the archive;
- cache stores may be deleted and rebuilt;
- archive reset must require explicit user confirmation.
Archive integrity checks:
- SQLite PRAGMA integrity_check;
- schema version check;
- missing FK/reference checks;
- aggregate rebuild spot checks;
- manifest hash verification;
- open visibility range sanity checks;
- duplicate identity checks;
- export reproducibility checks.
Use WAL mode for normal operation. Consider periodic checkpoints when safe.
Unit/integration tests must cover: - idempotent repeated capture page writes; - first observation for a type; - appeared/disappeared/representation-changed events; - visibility range open/close behavior; - point-in-time reconstruction; - diff queries on large synthetic datasets; - aggregate rebuild; - Core Data cache rebuild after deletion; - export manifest reproducibility; - recovery-compatible export fields; - prototype-store reset/reinitialization behavior for current test installs; - future archive migrations once real archive versions exist; - memory ceiling during export and diff.
No real HealthKit data in fixtures.
These do not block the archive v2 foundation, but they should be revisited before advanced import/export tooling:
Record any changed decision here with a date before implementing schema changes.