tolerance — cross-machine divergence bounding tool
Purpose
Replicate ONE run-spec stem on every cluster box at once, collect each box’s fresh NPZ, and report
the pairwisemax‖x_a − x_b‖divergence — the bounded-divergence contract that stands in for
bit-identity when ARM (mac) and x86 (pc2/workstation) can’t be byte-equal.
Role in the system
- Cluster sibling of dispatch (
dispatch.py): dispatch PARTITIONS different stems across boxes
for throughput; tolerance REPLICATES the same stem to every box to measure their disagreement. - Built entirely on dispatch’s primitives —
import dispatch as dspgives itload_cluster,
sync_cluster,_launch_one,_read_one_state,collect_run,RunOpts,StemRecord. - Hosts come from
analysis/cluster.yaml(the same registry dispatch reads). - All divergence numbers go through the canonical measurement facade (signal_measurement’s
divergence_series) — never a hand-rolled‖a − b‖. - A command-line tool (
__main__→main(argv)), not a pipeline stage; orchestrator does not call it.
Inputs / Outputs
- In: a run-spec
stem;--clusterhost registry;--keys(defaultz_b,omega_b); poll cadence;
--no-syncto skip the pre-launch HEAD sync. - Out: console-only — one
max‖x_a − x_b‖line per(box_a, box_b, key)triple. No NPZ/report is
written; the per-box NPZ logs land in the normal dated run dirs via the launch path.
Key functions
replicate— launch the stem on every box, poll all to terminal, collect each box’s fresh npz →{box: npz}—analysis/tolerance.py:51pairwise_divergence—{box: npz}→[(a, b, key, max_divergence)]via the facade (ACTUAL state) —analysis/tolerance.py:89newest_npz— most-recently-written.npzunder a tree (byst_mtime) —analysis/tolerance.py:24main— argparse front-end: sync → replicate → diff → print —analysis/tolerance.py:108
Footguns
Boxes must run the SAME committed code — the sync step is load-bearing
Before launch,
sync_clusterbrings every box to controller HEAD. The mac leg may have a dirty tree,
but that is comment/YAML-only (AST-verified neutral), so it stays numerically equivalent. Skipping the
sync with--no-synccan silently diff different code. (analysis/INSIGHTS.md, tolerance [baseline].)
newest_npzpicks by mtime, not name — guards against stale run-stamped siblingsA re-run leaves old run-stamped
.npzfiles in the tree. Selecting byst_mtime(not an alphabetical
sort) avoids silently diffing against a previous artifact. (analysis/INSIGHTS.md, tolerance [footgun].)
The facade's stdout chatter is swallowed deliberately
pairwise_divergencewraps each facade call incontextlib.redirect_stdoutto suppress the
scoreboard-coverage prints; only the returned divergence arrays are consumed. A missing key or shape
mismatch is caught, reported as a(warn)line, and degraded tonan— it never crashes the sweep.
(analysis/INSIGHTS.md, tolerance [io].)
Pseudocode
main(stem):
hosts = load_cluster(cluster.yaml)
unless --no-sync: sync_cluster(hosts) # every box → controller HEAD
npzs = replicate(stem, hosts, opts)
if len(npzs) < 2: bail ("need >=2 boxes to diff")
for (a, b, key, mx) in pairwise_divergence(npzs, keys): print mx
replicate(stem, hosts):
for host: _launch_one(host) # fire on ALL boxes first (one sim wall-clock, not N)
poll all to terminal (FINISHED/FAILED/VANISHED) # concurrently, up to max_polls
for FINISHED boxes: collect fresh npz (_collect_box_npz)
return {box: npz}
pairwise_divergence(npzs, keys):
for each box pair (i<j), each key:
t, d = sm.divergence_series(npz_i, npz_j, key) # ‖actual_i − actual_j‖ per step
row = max(d) # nan on key-absent / shape mismatch
Related
dispatch · signal_measurement · detached_run · orchestrator · terminology