tolerance — cross-machine divergence bounding tool

Purpose

Replicate ONE run-spec stem on every cluster box at once, collect each box’s fresh NPZ, and report
the pairwise max‖x_a − x_b‖ divergence — the bounded-divergence contract that stands in for
bit-identity when ARM (mac) and x86 (pc2/workstation) can’t be byte-equal.

Role in the system

  • Cluster sibling of dispatch (dispatch.py): dispatch PARTITIONS different stems across boxes
    for throughput; tolerance REPLICATES the same stem to every box to measure their disagreement.
  • Built entirely on dispatch’s primitives — import dispatch as dsp gives it load_cluster,
    sync_cluster, _launch_one, _read_one_state, collect_run, RunOpts, StemRecord.
  • Hosts come from analysis/cluster.yaml (the same registry dispatch reads).
  • All divergence numbers go through the canonical measurement facade (signal_measurement’s
    divergence_series) — never a hand-rolled ‖a − b‖.
  • A command-line tool (__main__main(argv)), not a pipeline stage; orchestrator does not call it.

Inputs / Outputs

  • In: a run-spec stem; --cluster host registry; --keys (default z_b,omega_b); poll cadence;
    --no-sync to skip the pre-launch HEAD sync.
  • Out: console-only — one max‖x_a − x_b‖ line per (box_a, box_b, key) triple. No NPZ/report is
    written; the per-box NPZ logs land in the normal dated run dirs via the launch path.

Key functions

  • replicate — launch the stem on every box, poll all to terminal, collect each box’s fresh npz → {box: npz}analysis/tolerance.py:51
  • pairwise_divergence{box: npz}[(a, b, key, max_divergence)] via the facade (ACTUAL state) — analysis/tolerance.py:89
  • newest_npz — most-recently-written .npz under a tree (by st_mtime) — analysis/tolerance.py:24
  • main — argparse front-end: sync → replicate → diff → print — analysis/tolerance.py:108

Footguns

Boxes must run the SAME committed code — the sync step is load-bearing

Before launch, sync_cluster brings every box to controller HEAD. The mac leg may have a dirty tree,
but that is comment/YAML-only (AST-verified neutral), so it stays numerically equivalent. Skipping the
sync with --no-sync can silently diff different code. (analysis/INSIGHTS.md, tolerance [baseline].)

newest_npz picks by mtime, not name — guards against stale run-stamped siblings

A re-run leaves old run-stamped .npz files in the tree. Selecting by st_mtime (not an alphabetical
sort) avoids silently diffing against a previous artifact. (analysis/INSIGHTS.md, tolerance [footgun].)

The facade's stdout chatter is swallowed deliberately

pairwise_divergence wraps each facade call in contextlib.redirect_stdout to suppress the
scoreboard-coverage prints; only the returned divergence arrays are consumed. A missing key or shape
mismatch is caught, reported as a (warn) line, and degraded to nan — it never crashes the sweep.
(analysis/INSIGHTS.md, tolerance [io].)

Pseudocode

main(stem):
    hosts = load_cluster(cluster.yaml)
    unless --no-sync: sync_cluster(hosts)          # every box → controller HEAD
    npzs = replicate(stem, hosts, opts)
    if len(npzs) < 2: bail ("need >=2 boxes to diff")
    for (a, b, key, mx) in pairwise_divergence(npzs, keys): print mx

replicate(stem, hosts):
    for host: _launch_one(host)                    # fire on ALL boxes first (one sim wall-clock, not N)
    poll all to terminal (FINISHED/FAILED/VANISHED) # concurrently, up to max_polls
    for FINISHED boxes: collect fresh npz (_collect_box_npz)
    return {box: npz}

pairwise_divergence(npzs, keys):
    for each box pair (i<j), each key:
        t, d = sm.divergence_series(npz_i, npz_j, key)   # ‖actual_i − actual_j‖ per step
        row  = max(d)                                    # nan on key-absent / shape mismatch

dispatch · signal_measurement · detached_run · orchestrator · terminology