Madagascar / projects / thunderbolts / COPILOT_BACKUPS_INSTRUCTIONS.md
f16725e 3 months ago History
1 contributor
113 lines | 6.951kb

COPILOT instructions — VM backup management (project scaffold)

Purpose

This document provides context and instructions for an automated assistant (copilot) to start building a project that manages VM backups for the Madagascar cluster. The detailed backup behaviors (retention, snapshot type, schedule) will be added later. For now we focus on cluster context, knowledge sources, file contracts, and recommended initial tasks.

Context & what the agent already knows

  • The cluster name is madagascar and node names are available under clusters.madagascar.nodes in cluster-context/madagascar.json.
  • cluster-context/madagascar.json is the canonical source of cluster context available to this project: it may contain node hostnames, network information, and references to where configurations originate.
  • madagascar-changelog.json (if present in the same directory) is an append-only changelog recommended for recording automation changes; prefer appending entries rather than rewriting.

Primary goals for the backup project (to be specified later)

  • Discover VMs across cluster nodes.
  • Create consistent backups (snapshots, exports) per VM on a regular schedule.
  • Store backups in a target storage (local NAS, remote S3-compatible, etc.).
  • Maintain retention and pruning policies.
  • Integrate with cluster-context/madagascar.json for cluster information and to avoid stepping on other projects' config.

Files the assistant should read and keep in mind

  • cluster-context/madagascar.json — primary source of truth for node hostnames, network addresses, and where configuration is defined.
  • madagascar-changelog.json — append-only log to record changes made by automation (if present).
  • CHANGELOG.md — human-readable changelog documenting all cluster changes with issue references.
  • issues/ directory — contains detailed issue documentation. Each issue has format ISSUE-YYYY-NNN.md.

Data contract (minimal) — how cluster-context/madagascar.json will be used by backups

  • Inputs:
    • Node list: clusters.madagascar.nodes keys
    • Node access: hostname(s) under nodes.<node>.hosts (ssh target or provisioning endpoint)
    • Node VM network context: used to determine which subnets/backups touches (from wan/thunderbridge)
  • Outputs:
    • Backup metadata appended to madagascar-changelog.json (id, timestamp, project: backups, summary, details, affectedResources)
    • (Optional) A backups.json manifest in repo or storage describing performed backups.

Assumptions (inferred, verify early)

  • The ops runner will have SSH access to each node via hostnames in cluster-context/madagascar.json.
  • VM management is Proxmox (PVE) given file style (vmbr*). If different, adapt tooling.
  • jq is available on automation host for simple JSON operations; Python is acceptable for more complex logic.

Starter tasks for the copilot (priority order)

  1. Discovery script: ./discover_vms.sh (or Python) that:
    • Reads ./cluster-context/madagascar.json to get nodes and hostnames.
    • SSH into each node and lists VMs (for Proxmox: qm list or pvesh / pct list for containers).
    • Produces a backups/manifest-<date>.json with the discovered VMs.
  2. Backup runner: ./run_backup.sh which takes a VM id and node, creates a snapshot/export, and uploads it to configured storage. Keep steps idempotent and record metadata.
  3. Pruner: ./prune_backups.sh to remove old backups according to retention policy (to be defined).
  4. Integration tests: small harness that runs discovery against a mocked inventory or a minimal local mock environment and validates outputs.
  5. Changelog integration: every automated change to cluster-context/madagascar.json or backup metadata must append an entry in cluster-context/madagascar-changelog.json describing reason and affected resources.

Developer guidance & best practices

  • Treat cluster-context/madagascar.json as the source of truth for discovery; do not hardcode hostnames elsewhere.
  • When writing automation that mutates cluster-context/madagascar.json, always also append a changelog entry and prefer atomic updates (write to tmp file then rename).
  • Prefer small, single-purpose scripts. Keep complex logic in Python where JSON and SSH handling is easier.
  • Add unit tests for parsing and manifest generation.

Copilot Automation Instructions (Network & Issue Tracking)

Cluster Discovery & Network Checks

  • Use cluster/cluster-context/madagascar.json for node, IP, and service info.
  • To verify thunderbolt networking, run scripts/check_thunderbridge.sh:
    • Checks bridge membership and MTU for all thunderbolt interfaces.
    • Verifies cluster network connectivity (ping between all nodes).
  • For troubleshooting, check kernel logs (dmesg), interface status (ip link show), and bridge membership (bridge link).

Issue Tracking Workflow

  • All issues are tracked in issues/ as Markdown files using TEMPLATE.md.
  • Each issue gets a unique ID (e.g., ISSUE-2025-001, ISSUE-2025-002).
  • Document:
    • Summary, environment, steps to reproduce, expected/actual behavior
    • Logs/evidence, investigation notes, proposed solution
    • Related issues and changelog references
  • Update CHANGELOG.md for every fix, enhancement, or regression.
  • Close issues only after full deployment and verification.

Copilot Automation Conventions

  • Always verify changes on all affected nodes (baobab, ebony, tapia, etc.).
  • Use defense-in-depth for network fixes (udev rules + ifupdown2 hooks).
  • Scripts should be POSIX-compliant for maximum compatibility.
  • Suppress SSH warnings for clean output (-o LogLevel=ERROR).
  • Document every change and test result in the issue tracker and changelog.

Example Copilot Tasks

  • Deploy network fixes: deploy/attempt1/deploy_tb.sh <node>
  • Check thunderbolt status: scripts/check_thunderbridge.sh
  • Investigate hardware/network issues: kernel logs, interface status, bridge membership
  • Document and close issues: update issues/ and CHANGELOG.md

References

  • cluster/cluster-context/madagascar.json: Node, network, and backup server definitions
  • issues/: Issue tracker and templates
  • CHANGELOG.md: Change documentation
  • scripts/check_thunderbridge.sh: Cluster network health check
  • deploy/attempt1/deploy_tb.sh: Network deployment script

Maintenance

  • Regularly run network and backup checks.
  • Update documentation and changelogs for every change.
  • Use Copilot to automate repetitive tasks and ensure consistency across the cluster.

Next steps for the user

  • Provide backup policy details (snapshot vs export, retention counts, storage endpoint credentials).
  • Confirm VM manager (Proxmox vs KVM/libvirt vs other).

If you want, I can now scaffold ./discover_vms.sh, ./run_backup.sh (stubs), and a small backups/README.md describing configuration fields. Which do you prefer I create first: discovery script (bash) or Python scaffold?