# HealthProbe - Database Design

**Version:** 1.1
**Last Updated:** 2026-05-23
**Status:** Canonical database/storage design

## 1. Purpose

The database is the central piece of HealthProbe. The app can only reconstruct, analyze, export, and explain HealthKit history if the archive is complete, correct, queryable, and stable across product changes.

UI can be refactored cheaply. A wrong archive design can permanently lose evidence, make large analyses impossible on low-end devices, or prevent future recovery-compatible exports. All storage work must start from this document.

## 2. Non-Negotiable Requirements

1. **SQLite archive is the source of truth.**
   Core Data is a rebuildable cache. SwiftData is legacy/prototype only.

2. **Store differentially.**
   Do not append recurring complete snapshots of large HealthKit datasets. Store identities, payload versions, observation events/ranges, and aggregates.

3. **Analyze in SQL, not RAM.**
   Diffs, counts, point-in-time reconstruction, export selection, and consolidation heuristics must use SQLite indexes, joins, CTEs, temporary tables, and paged/streaming results.

4. **Support legacy devices.**
   The target includes iOS 15-era devices such as iPhone 6s-class Health collection setups. Do not require SwiftData.

5. **Preserve recovery-compatible structure.**
   The app will not restore or re-publish data, but archives/exports must preserve enough identity, payload, provenance, relationships, hashes, and observation history for external recovery/salvage tooling.

6. **Never treat counts as sufficient truth.**
   Counts are cached for reports/UI, but record identity, payload versions, visibility history, and aggregate context are required for interpretation.

7. **No real personal data in repository artifacts.**
   Database fixtures, docs, tests, and examples must use synthetic values only.

## 3. Storage Layers

### 3.1 SQLite Archive / Analysis Database

`HealthProbeArchive.sqlite`

Responsibilities:
- canonical HealthKit observation history;
- sample identity and payload versioning;
- source/device/metadata/relationship preservation;
- point-in-time reconstruction;
- adjacent and selected-observation diffs;
- consolidation heuristics;
- materialized aggregates;
- streaming/paged exports;
- integrity manifests and future schema migrations.

This database must be queryable without loading high-frequency datasets into Swift arrays.

### 3.2 Core Data UI / Report Cache

Core Data cache store.

Responsibilities:
- expensive counts already computed from SQLite;
- observation list rows;
- dashboard/timeline summaries;
- per-type summary rows;
- report/export metadata;
- app settings and lightweight UI state.

Rules:
- cache rows are disposable;
- cache rows must be rebuildable from SQLite;
- if Core Data and SQLite disagree, SQLite wins;
- Core Data must not contain the only copy of any record-level evidence.

### 3.3 SwiftData Legacy Store

Current SwiftData models are a prototype implementation detail. New storage work should not expand them.

There are no real deployments, only test installs. During the archive v2 refactor, old SwiftData stores and prototype SQLite archives may be ignored, deleted, or reinitialized. Do not build backward compatibility or one-way import for the old prototype schema unless a later product decision explicitly changes this policy.

## 4. Conceptual Model

### Observation

An observation is one local capture attempt/result at a specific time on the current device chain. It is not a full copy of all visible records.

An observation records:
- when capture started/ended;
- app/schema/OS context;
- timezone context at observation time;
- selected type registry;
- per-type capture quality;
- HealthKit anchors;
- events and aggregate changes observed during the capture.

Implementation note, 2026-05-24: the app creates one archive `observations`
row for each user-visible capture/snapshot. Anchored HealthKit pages,
`HKDeletedObject` evidence, and the final type verification rows for that
capture attach to that same observation id so SQL diffs, cache rows, and
transition UI can reference a stable capture boundary.

### Terminology

- **Capture**: the act of querying HealthKit and writing results to the archive.
- **Observation**: the durable archive record created by a capture attempt.
- **Snapshot**: a reconstructed view of records visible at a selected observation. Do not store snapshot copies for high-volume data.
- **Diff**: SQL-derived comparison between two observations on the same local device chain.

### Sample Identity

A sample identity is the stable record or semantic record HealthProbe tracks over time.

Identity inputs may include:
- HealthKit UUID hash when available;
- strict fingerprint;
- semantic fingerprint;
- sample type;
- date range;
- value/unit/category/workout fields;
- source revision where relevant.

HealthKit UUID hash is important but not enough for every future use case. Apple exports and backup database extracts may require semantic/fuzzy matching.

### Sample Version

A sample version is the payload representation observed for a sample identity.

A new version is created only when the representation changes:
- start/end dates;
- value/unit/category/workout fields;
- source revision;
- metadata hash;
- related sample/workout/event links.

### Visibility/Event History

HealthProbe stores visibility as events and/or compressed ranges:
- appeared;
- verified/seen;
- disappeared/no longer visible;
- representation changed;
- deleted-object evidence where HealthKit exposes `HKDeletedObject`.

This allows point-in-time reconstruction without duplicating every visible record into every observation.

### Aggregate Cache In SQLite

SQLite stores materialized aggregates because many reports and screens need expensive counts/sums repeatedly.

Aggregates are archive-derived evidence, not the source of truth. They must be rebuildable from sample/version/event tables.

## 5. Target SQLite Schema

Exact names may evolve, but the shape and constraints should remain.

### 5.1 Schema And Metadata

```sql
CREATE TABLE schema_migrations (
    version INTEGER PRIMARY KEY,
    applied_at REAL NOT NULL,
    description TEXT NOT NULL
);

CREATE TABLE archive_metadata (
    key TEXT PRIMARY KEY,
    value TEXT NOT NULL
);
```

### 5.2 Device Chain And Observations

```sql
CREATE TABLE device_chains (
    id INTEGER PRIMARY KEY,
    device_chain_hash TEXT NOT NULL UNIQUE,
    created_at REAL NOT NULL,
    recovered_from_keychain INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE observations (
    id INTEGER PRIMARY KEY,
    device_chain_id INTEGER NOT NULL REFERENCES device_chains(id),
    observed_at REAL NOT NULL,
    started_at REAL,
    ended_at REAL,
    status TEXT NOT NULL,
    trigger_reason TEXT NOT NULL,
    app_version TEXT,
    os_version TEXT,
    time_zone_identifier TEXT,
    time_zone_seconds_from_gmt INTEGER,
    schema_version INTEGER NOT NULL,
    selected_type_set_hash TEXT,
    notes TEXT
);

CREATE INDEX idx_observations_device_time
ON observations(device_chain_id, observed_at);
```

### 5.3 Per-Type Capture Runs And Anchors

```sql
CREATE TABLE sample_types (
    id INTEGER PRIMARY KEY,
    type_identifier TEXT NOT NULL UNIQUE,
    display_name TEXT,
    category TEXT
);

CREATE TABLE observation_type_runs (
    id INTEGER PRIMARY KEY,
    observation_id INTEGER NOT NULL REFERENCES observations(id),
    sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
    status TEXT NOT NULL,
    started_at REAL,
    ended_at REAL,
    anchor_before BLOB,
    anchor_after BLOB,
    inserted_event_count INTEGER NOT NULL DEFAULT 0,
    deleted_event_count INTEGER NOT NULL DEFAULT 0,
    verified_visible_count INTEGER,
    error_kind TEXT,
    error_message_hash TEXT,
    UNIQUE(observation_id, sample_type_id)
);

CREATE INDEX idx_type_runs_type_observation
ON observation_type_runs(sample_type_id, observation_id);
```

### 5.4 Sources, Devices, Metadata

```sql
CREATE TABLE sources (
    id INTEGER PRIMARY KEY,
    source_name_hash TEXT,
    bundle_identifier TEXT
);

CREATE TABLE source_revisions (
    id INTEGER PRIMARY KEY,
    source_id INTEGER NOT NULL REFERENCES sources(id),
    product_type TEXT,
    version TEXT,
    operating_system_version TEXT,
    UNIQUE(source_id, product_type, version, operating_system_version)
);

CREATE TABLE hk_devices (
    id INTEGER PRIMARY KEY,
    device_hash TEXT,
    manufacturer_hash TEXT,
    model TEXT,
    hardware_version TEXT,
    firmware_version TEXT,
    software_version TEXT,
    local_identifier_hash TEXT,
    udi_hash TEXT
);

CREATE TABLE metadata_blobs (
    id INTEGER PRIMARY KEY,
    metadata_hash TEXT NOT NULL UNIQUE,
    metadata_json TEXT NOT NULL
);
```

Privacy note: raw personal/device identifiers should be hashed or omitted according to policy. Store enough provenance for local analysis and recovery-compatible exports without leaking identifiers into logs or repository fixtures.

### 5.5 Samples And Payload Versions

```sql
CREATE TABLE samples (
    id INTEGER PRIMARY KEY,
    sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
    sample_uuid_hash TEXT,
    strict_fingerprint TEXT NOT NULL,
    semantic_fingerprint TEXT,
    fuzzy_key TEXT,
    first_seen_observation_id INTEGER NOT NULL REFERENCES observations(id),
    first_seen_at REAL NOT NULL,
    UNIQUE(sample_type_id, strict_fingerprint)
);

CREATE INDEX idx_samples_uuid_hash
ON samples(sample_uuid_hash);

CREATE INDEX idx_samples_type_semantic
ON samples(sample_type_id, semantic_fingerprint);

CREATE TABLE sample_versions (
    id INTEGER PRIMARY KEY,
    sample_id INTEGER NOT NULL REFERENCES samples(id),
    payload_hash TEXT NOT NULL,
    start_date REAL NOT NULL,
    end_date REAL NOT NULL,
    value_kind TEXT,
    numeric_value REAL,
    unit TEXT,
    category_value INTEGER,
    workout_activity_type INTEGER,
    duration_seconds REAL,
    source_revision_id INTEGER REFERENCES source_revisions(id),
    hk_device_id INTEGER REFERENCES hk_devices(id),
    metadata_id INTEGER REFERENCES metadata_blobs(id),
    created_observation_id INTEGER NOT NULL REFERENCES observations(id),
    UNIQUE(sample_id, payload_hash)
);

CREATE INDEX idx_sample_versions_sample
ON sample_versions(sample_id);

CREATE INDEX idx_sample_versions_time
ON sample_versions(start_date, end_date);
```

### 5.6 Observation Events And Visibility Ranges

```sql
CREATE TABLE sample_observation_events (
    id INTEGER PRIMARY KEY,
    observation_id INTEGER NOT NULL REFERENCES observations(id),
    sample_id INTEGER NOT NULL REFERENCES samples(id),
    version_id INTEGER REFERENCES sample_versions(id),
    event_kind TEXT NOT NULL,
    observed_at REAL NOT NULL,
    evidence_kind TEXT,
    UNIQUE(observation_id, sample_id, event_kind)
);

CREATE INDEX idx_events_observation_kind
ON sample_observation_events(observation_id, event_kind);

CREATE INDEX idx_events_sample
ON sample_observation_events(sample_id, observation_id);

CREATE TABLE sample_visibility_ranges (
    sample_id INTEGER NOT NULL REFERENCES samples(id),
    version_id INTEGER REFERENCES sample_versions(id),
    first_observation_id INTEGER NOT NULL REFERENCES observations(id),
    last_observation_id INTEGER REFERENCES observations(id),
    first_seen_at REAL NOT NULL,
    last_seen_at REAL,
    PRIMARY KEY (sample_id, version_id, first_observation_id)
);

CREATE INDEX idx_visibility_open_ranges
ON sample_visibility_ranges(last_observation_id);

CREATE INDEX idx_visibility_point_lookup
ON sample_visibility_ranges(first_observation_id, last_observation_id);
```

Range convention:
- `last_observation_id IS NULL` means still visible at the latest verified observation for that type;
- `last_observation_id` stores the observation that closed the range and is therefore an exclusive end;
- closed samples/versions are visible when `first_observation_id <= target.id` and `last_observation_id > target.id`;
- deleted-object evidence should create an event even when full payload is not available.

### 5.7 Relationships

```sql
CREATE TABLE sample_relationships (
    id INTEGER PRIMARY KEY,
    observation_id INTEGER REFERENCES observations(id),
    source_sample_id INTEGER NOT NULL REFERENCES samples(id),
    target_sample_id INTEGER NOT NULL REFERENCES samples(id),
    relationship_kind TEXT NOT NULL,
    metadata_id INTEGER REFERENCES metadata_blobs(id),
    UNIQUE(observation_id, source_sample_id, target_sample_id, relationship_kind)
);

CREATE INDEX idx_relationship_source
ON sample_relationships(source_sample_id, relationship_kind);

CREATE INDEX idx_relationship_target
ON sample_relationships(target_sample_id, relationship_kind);
```

Relationships are required for recovery-compatible archives. Even if iOS HealthKit exposes limited relationships, the schema must not prevent future preservation.

### 5.8 Materialized Aggregates

```sql
CREATE TABLE observation_type_summaries (
    observation_id INTEGER NOT NULL REFERENCES observations(id),
    sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
    visible_record_count INTEGER NOT NULL,
    appeared_count INTEGER NOT NULL DEFAULT 0,
    disappeared_count INTEGER NOT NULL DEFAULT 0,
    representation_changed_count INTEGER NOT NULL DEFAULT 0,
    earliest_start_date REAL,
    latest_end_date REAL,
    value_sum REAL,
    value_max REAL,
    aggregate_hash TEXT,
    PRIMARY KEY (observation_id, sample_type_id)
);

CREATE TABLE daily_type_aggregates (
    observation_id INTEGER NOT NULL REFERENCES observations(id),
    sample_type_id INTEGER NOT NULL REFERENCES sample_types(id),
    bucket_start REAL NOT NULL,
    bucket_end REAL NOT NULL,
    visible_record_count INTEGER NOT NULL,
    value_sum REAL,
    value_max REAL,
    source_revision_id INTEGER,
    aggregate_hash TEXT,
    PRIMARY KEY (observation_id, sample_type_id, bucket_start, source_revision_id)
);

CREATE INDEX idx_daily_type_bucket
ON daily_type_aggregates(sample_type_id, bucket_start);
```

Aggregates feed reports and the Core Data cache. They are also important for consolidation heuristics because a count drop with stable aggregate value may be representation change, not meaningful loss.

### 5.9 Exports And Manifests

```sql
CREATE TABLE export_manifests (
    id INTEGER PRIMARY KEY,
    export_id TEXT NOT NULL UNIQUE,
    created_at REAL NOT NULL,
    export_kind TEXT NOT NULL,
    from_observation_id INTEGER REFERENCES observations(id),
    to_observation_id INTEGER REFERENCES observations(id),
    filter_json TEXT,
    manifest_hash TEXT NOT NULL,
    record_count INTEGER NOT NULL
);

CREATE TABLE export_items (
    export_manifest_id INTEGER NOT NULL REFERENCES export_manifests(id),
    sample_id INTEGER NOT NULL REFERENCES samples(id),
    version_id INTEGER REFERENCES sample_versions(id),
    item_hash TEXT NOT NULL,
    PRIMARY KEY (export_manifest_id, sample_id, version_id)
);
```

Exports should be reproducible from the archive when possible. Manifest hashes let external tools verify that a recovery-compatible export matches the archived evidence.

## 6. Archive V2 Implementation Decisions

These decisions close Milestone 1 for archive v2 unless a later dated entry changes them.

### 6.1 Timestamps

Archive timestamps are stored as Unix seconds in UTC using SQLite `REAL`.

Rules:
- write dates with `Date.timeIntervalSince1970`;
- read dates with `Date(timeIntervalSince1970:)`;
- never store local-time interpreted timestamps in archive date columns;
- canonical hash/export text uses ISO 8601 UTC with fractional seconds;
- aggregate bucket rows store UTC bucket boundaries plus the observation timezone context that produced them.

### 6.2 Hashing And Privacy

Use two hash classes:
- **Integrity/content hashes:** plain `SHA-256`, lower-case hex. Use for payload hashes, metadata hashes, aggregate hashes, export item hashes, and manifest hashes.
- **Privacy-sensitive identifiers:** `HMAC-SHA256`, lower-case hex, using a locally stored archive secret. Use for HealthKit UUIDs, device identifiers, source names, local identifiers, UDI-like values, and device-chain identifiers.

All hash inputs must include a domain/version prefix such as `hp:v2:sample_uuid:` or `hp:v2:payload:`. Do not hash raw strings without a domain prefix.

The local archive secret is device-local application state. If the secret is lost, future captures may start a new device chain. Export files include already-computed hashes and manifest/item hashes, not the local archive secret.

### 6.3 Device Chain Identity

`device_chain_hash` identifies one local capture chain, not a globally unique device.

Initial implementation:
- create or recover a random chain seed from Keychain;
- compute `device_chain_hash = HMAC-SHA256(archiveSecret, "hp:v2:device_chain:" + chainSeed)`;
- set `recovered_from_keychain = 1` when the seed survived app reinstall and was reused;
- start a new chain if the seed is missing or explicitly reset.

HealthKit device metadata may be stored as hashed provenance in `hk_devices`, but it must not be used as a cross-device comparison key.

### 6.4 Fingerprints And Payload Versions

`sample_uuid_hash` is the preferred stable HealthKit identity when HealthKit exposes a UUID.

`strict_fingerprint` is a deterministic exact fallback/verification key built from canonical fields:
- type identifier;
- start/end timestamps in canonical UTC text;
- value kind and canonical value representation;
- unit/category/workout fields where applicable;
- source bundle identifier hash when available.

Do not include SQLite row ids in fingerprints. If HealthKit UUID is available, a payload change creates a new `sample_versions` row for the same sample identity. If HealthKit UUID is not available, an exact strict fingerprint change may create appeared/disappeared evidence, with `semantic_fingerprint` used only as weak consolidation context.

`payload_hash` is `SHA-256` over the canonical sample payload representation, including dates, value/unit/category/workout fields, source revision fields, device provenance hashes, metadata hash, and relationship payload when available. A new `sample_versions` row is created when `payload_hash` changes.

Implementation note, 2026-05-24: archive v2 capture must derive `payload_hash` from the same normalized row values that are persisted. Unknown HealthKit OS versions, including `0.0.0`, are stored as absent. Capture, verification, and deletion bookkeeping must use archive v2 identity/version/visibility tables, not the removed legacy `archive_samples` mirror.

`semantic_fingerprint` is type-specific and optional. It supports consolidation heuristics and fuzzy backup/export reconciliation, but it is never sufficient by itself to prove record identity.

### 6.5 Timezone And Aggregate Buckets

Raw sample timestamps remain UTC. Daily/monthly aggregate buckets are computed using the device timezone active at observation time because user-facing health summaries are local-day concepts.

Rules:
- store `time_zone_identifier` and `time_zone_seconds_from_gmt` on `observations`;
- bucket boundaries are midnight-to-midnight in that observation timezone, stored as UTC seconds;
- old observations are not retroactively re-bucketed when the device timezone changes;
- exports include timezone metadata so external tools can reinterpret buckets if needed.

### 6.6 Visibility Range Maintenance

Maintain `sample_visibility_ranges` eagerly in the same transaction that writes observation events. Point-in-time queries should read ranges, not rebuild them from events on every request.

Integrity tools may rebuild ranges from `sample_observation_events` into temporary tables and compare the result with stored ranges. Validation rebuild is for tests/repair checks, not the normal query path.

### 6.7 Relationship Preservation

Store every relationship the capture/import surface can observe. Relationship rows are append-only evidence for what was known at an observation; do not rewrite `relationship_kind` to encode later disappearance.

If a related sample disappears, endpoint visibility ranges and events explain that disappearance. Relationship exports should include relationships when both endpoints are included, and may include unresolved endpoint hashes when allowed by the export scope.

### 6.8 Aggregates

Aggregates are computed in SQLite after each successful observation/type run and materialized in `observation_type_summaries` and `daily_type_aggregates`.

SQL-first does not mean recompute every count live for the UI. It means heavy computation is done by SQLite using indexes, joins, CTEs, temporary tables, and paged result sets. Repeated UI/report reads consume materialized SQLite aggregates and the Core Data cache.

### 6.9 Export Manifest Canonicalization

Structured exports use a versioned canonical envelope with deterministic ordering.

Rules:
- export format version starts at `1`;
- JSON object keys are sorted for canonical bytes;
- record/item order is deterministic: sample type, start date, end date, sample identity hash, version hash;
- each exported item has an `item_hash = SHA-256(canonical item JSON)`;
- `manifest_hash = SHA-256(canonical export metadata + ordered item_hash list + counts + filter description)`;
- large exports compute hashes incrementally while streaming/paging rows from SQLite.

Manifest hashes must cover exported content through item hashes, not only counts or first/last dates.

## 7. Write Path

### 7.1 Capture Transaction Shape

For each user-visible capture:

1. Create one `observations` row with status `running`.
2. For each observation/type run, update `observation_type_runs`.
3. For each added/visible sample page:
   - upsert source/source revision/device/metadata;
   - upsert `samples`;
   - upsert `sample_versions`;
   - insert observation event;
   - update visibility ranges.
4. For each `HKDeletedObject`:
   - find sample by UUID hash;
   - insert deleted/disappeared event;
   - close open visibility ranges.
5. Record final per-type verification counts for the same observation id.
6. Recompute affected materialized aggregates.
7. Mark the observation `completed`, `needs_review`, `partial_*`, or `failed`.

SQLite writes may be committed per page/type to keep memory and lock time
bounded, but they must share the capture's observation id.

8. Update/rebuild Core Data cache after SQLite commit.

SQLite commit must happen before Core Data cache update. Cache rebuild failures must not corrupt archive truth.

### 7.2 Idempotency

Capture pages may be retried. Writes must be idempotent using uniqueness constraints:
- `(sample_type_id, strict_fingerprint)` for samples;
- `(sample_id, payload_hash)` for versions;
- `(observation_id, sample_id, event_kind)` for events;
- range primary key for visibility.

### 7.3 Anchor Handling

HealthKit anchors are capture implementation state. Store them per type run, but do not treat anchors as forensic truth. If an anchor is unusable, the archive must still support rebuilding the current visible state from a full type scan.

## 8. Point-In-Time Reconstruction

Point-in-time reconstruction should use ranges, not full snapshot tables.

Conceptual query:

```sql
SELECT s.id AS sample_id, sv.id AS version_id, sv.start_date, sv.end_date,
       sv.value_kind, sv.numeric_value, sv.unit
FROM sample_visibility_ranges r
JOIN samples s ON s.id = r.sample_id
JOIN sample_versions sv ON sv.id = r.version_id
JOIN observations target ON target.id = :observation_id
WHERE s.sample_type_id = :sample_type_id
  AND r.first_observation_id <= target.id
  AND (r.last_observation_id IS NULL OR r.last_observation_id > target.id)
ORDER BY sv.start_date, s.strict_fingerprint
LIMIT :limit OFFSET :offset;
```

Implementation may optimize this with temporary tables or materialized visible sets for selected observations.

## 9. Diff Between Observations

Diffs must run in SQLite.

```sql
CREATE TEMP TABLE prev_visible AS
SELECT r.sample_id, r.version_id
FROM sample_visibility_ranges r
WHERE r.first_observation_id <= :previous
  AND (r.last_observation_id IS NULL OR r.last_observation_id > :previous);

CREATE TEMP TABLE curr_visible AS
SELECT r.sample_id, r.version_id
FROM sample_visibility_ranges r
WHERE r.first_observation_id <= :current
  AND (r.last_observation_id IS NULL OR r.last_observation_id > :current);

CREATE INDEX temp_prev_sample ON prev_visible(sample_id);
CREATE INDEX temp_curr_sample ON curr_visible(sample_id);

-- Disappeared.
SELECT p.sample_id
FROM prev_visible p
LEFT JOIN curr_visible c ON c.sample_id = p.sample_id
WHERE c.sample_id IS NULL;

-- Appeared.
SELECT c.sample_id
FROM curr_visible c
LEFT JOIN prev_visible p ON p.sample_id = c.sample_id
WHERE p.sample_id IS NULL;

-- Representation changed.
SELECT c.sample_id, p.version_id AS before_version_id, c.version_id AS after_version_id
FROM curr_visible c
JOIN prev_visible p ON p.sample_id = c.sample_id
WHERE p.version_id != c.version_id;
```

Result sets must be paged. Counts can be materialized into `observation_type_summaries` and Core Data cache.

## 10. Consolidation Heuristics

Consolidation likely when:
- many old high-frequency records disappear;
- newer records cover similar date windows;
- aggregate sums remain within tolerance;
- sample density decreases while duration/interval length increases;
- source/provenance context is compatible.

Required evidence:
- record-count delta;
- value-sum delta;
- coverage-window overlap;
- interval-length/density comparison;
- source/source-revision breakdown;
- uncertainty label if evidence is incomplete.

Never classify count drops alone as loss.

## 11. Core Data Cache Contract

Core Data entities should mirror presentation needs, not archive internals.

Core Data may contain two categories:
- rebuildable UI/report cache derived from SQLite;
- non-forensic local app state/settings.

Deleting rebuildable cache rows must not delete the SQLite archive. User settings may be preserved across cache rebuilds.

Candidate cache entities:
- `CachedObservationRow`;
- `CachedTypeSummary`;
- `CachedDailyAggregate`;
- `CachedDiffSummary`;
- `CachedExportManifest`;
- `CachedArchiveHealth`;
- `AppSetting`.

Every cache row should include:
- archive schema version;
- cache schema version;
- source observation id(s);
- source aggregate/hash where applicable;
- computed_at timestamp.

Invalidation rules:
- archive reset or future archive migration invalidates cache;
- selected type registry change invalidates affected summaries;
- aggregate rebuild invalidates corresponding Core Data rows;
- app can delete all cache rows and rebuild from SQLite.

## 12. Export Requirements

Exports are scoped and recovery-compatible.

Every structured export should include:
- export id;
- archive schema version;
- app version;
- observation id(s);
- selected type filters;
- record count;
- manifest hash;
- per-record sample identity/fingerprint;
- payload version hash;
- dates, values, units, category/workout fields;
- source/provenance metadata where available and allowed;
- relationships where available;
- provenance-loss warnings for external re-publication workflows.

Exports must stream/page from SQLite. Do not build large JSON/CSV exports entirely in RAM.

## 13. Reset And Future Migration Policy

Current status: HealthProbe has no real deployments, only test installations. The archive v2 refactor does not need backward compatibility with the old SwiftData/prototype SQLite schema.

For the current refactor:
- old SwiftData stores and prototype SQLite archives may be deleted, ignored, or reinitialized;
- no one-way migration from old prototype stores is required;
- test users/developers must expect prototype data loss when moving to archive v2;
- reset behavior must be documented in test/release notes and must not be presented as data preservation;
- Core Data cache stores remain disposable and rebuildable.

For future real archives:
- migrations must be versioned in `schema_migrations`;
- migrations must be tested on synthetic large archives;
- migration failure must not silently delete the archive;
- cache stores may be deleted and rebuilt;
- archive reset must require explicit user confirmation.

## 14. Integrity And Verification

Archive integrity checks:
- SQLite `PRAGMA integrity_check`;
- schema version check;
- missing FK/reference checks;
- aggregate rebuild spot checks;
- manifest hash verification;
- open visibility range sanity checks;
- duplicate identity checks;
- export reproducibility checks.

Use WAL mode for normal operation. Consider periodic checkpoints when safe.

## 15. Performance Rules

- Use prepared statements for bulk writes.
- Batch within transactions.
- Use indexes listed in the schema and add query-specific indexes only with evidence.
- Prefer integer primary keys internally.
- Store archive dates as Unix seconds UTC.
- Page record tables and exports.
- Use temporary tables for large selected-observation diffs.
- Do not decode large archived payloads into Swift collections.
- Profile on low-memory/legacy-class devices.

## 16. Testing Requirements

Unit/integration tests must cover:
- idempotent repeated capture page writes;
- first observation for a type;
- appeared/disappeared/representation-changed events;
- visibility range open/close behavior;
- point-in-time reconstruction;
- diff queries on large synthetic datasets;
- aggregate rebuild;
- Core Data cache rebuild after deletion;
- export manifest reproducibility;
- recovery-compatible export fields;
- prototype-store reset/reinitialization behavior for current test installs;
- future archive migrations once real archive versions exist;
- memory ceiling during export and diff.

No real HealthKit data in fixtures.

## 17. Deferred Design Questions

These do not block the archive v2 foundation, but they should be revisited before advanced import/export tooling:

1. Exact semantic/fuzzy fingerprint fields per HealthKit sample family.
2. Relationship extraction surface available from HealthKit vs backup/XML exports.
3. Optional user-controlled export profiles for more/less provenance disclosure.
4. Repair tooling for rebuilding visibility ranges after future real archive migrations.

Record any changed decision here with a date before implementing schema changes.