detached_run — detached tmux launcher for long sims

Purpose

Wraps run_spec.py in a detached tmux session per stem so a closed SSH window or a VSCode
restart cannot kill a running sim. Verbs: run / list / status / watch [--attach] / log [--follow].

Role in the system

  • The remote-execution primitive the cluster dispatcher builds on: dispatch calls these verbs
    in-process on the local Mac leg and shells the same analysis/detached_run.py <verb> over SSH on
    the Linux boxes (invoke_detached).
  • Launches the pipeline front-end run_spec.py (the Orchestrator(stem).run() gateway) — it does not
    reach into orchestrator / runner itself; it only owns the process lifecycle around them.
  • Shares the dated log root (logs/logs_<date>/<stem>, always flat — the box never knows about
    campaigns) with the pipeline via utils.infra.LOGS_DIR: the wrapper’s console.log / exit_status
    sit beside the NPZ the run writes. dispatch nests on collect; the box side stays flat.
  • Owns the wire vocabulary PollKind (FINISHED/RUNNING/VANISHED/FAILED) — it is the producer:
    classify() emits a PollKind and status() serialises it onto the status line that dispatch
    parses back. dispatch imports dr.PollKind; it keeps its own lowercase State lifecycle separate.

Inputs / Outputs

  • In: one or more run-spec stems (the spec name = the file stem); verb + flags from argv.
  • Out: per-stem console.log (captured stdout+stderr) and exit_status sentinel under the dated
    run dir; one tmux session named sim_<sanitized>_<6-char-sha1>; status lines on stdout whose tokens
    are the PollKind values (FINISHED/RUNNING/VANISHED/FAILED(<code>)).

Key functions

  • PollKind — the wire poll vocabulary StrEnum (UPPERCASE; byte-stable on the SSH line) — analysis/detached_run.py:28
  • run — launch each stem in its own detached session, skipping any already alive — analysis/detached_run.py:125
  • status — wire report (stem: FINISHED|RUNNING|VANISHED|FAILED(n)), tails the log on failure — analysis/detached_run.py:163
  • classify — sentinel-first state machine, returns (PollKind, exit_code|None)analysis/detached_run.py:70
  • session_name — tmux-legal, injective name = sanitized stem + SHA-1 hash — analysis/detached_run.py:40
  • inner_command — the shlex-quoted shell line tmux runs (redirect-then-echo $?) — analysis/detached_run.py:62
  • main — verb dispatch — analysis/detached_run.py:210

Footguns

list/status key on TODAY's dated dir — a midnight-crossing run goes invisible

The log root is resolved at launch time. A run that crosses local midnight still writes under the dir
stamped at launch, but a bare status the next day scans the new day’s dir and reports nothing.
Check long runs same-day, or invoke the wrapper on the same calendar day as the launch.
(analysis/INSIGHTS.md, detached_run [run][footgun].)

Linux cluster needs loginctl enable-linger <user> once

Without it, systemd reaps the tmux server on SSH logout and kills every session — defeating the whole
point. macOS launchd has no such issue. (analysis/INSIGHTS.md, detached_run [perf][run].)

Sanitized session names alone collide — the SHA-1 hash is load-bearing

Mapping ./-/whitespace → _ is many-to-one, so distinct stems could share a session name and make
is_alive/status lie. A 6-char SHA-1 of the raw stem is appended for injectivity; _known_stems
reads the real stem name off disk so it never has to invert the lossy sanitization.
(analysis/INSIGHTS.md, detached_run [footgun].)

No checkpoint/resume — a killed run is re-launched from scratch

Nothing is serialized mid-trajectory; scope is intentionally re-run only. run() first touches
the console log (so list/status see it immediately) and clears any prior exit_status sentinel so
a stale result can’t impersonate the new launch. (analysis/INSIGHTS.md, detached_run [run].)

Pseudocode (the sentinel state machine)

run(stem):  touch console.log ; rm old exit_status ; tmux new-session -d -s sim_<safe>_<sha> "<inner>"
inner_command: cd ROOT && $PYTHON -u run_spec.py <stem> > console.log 2>&1 ; echo $? > exit_status
                                                  # redirect-then-echo (NOT | tee) → $? is run_spec's own code
classify(sentinel?, code, alive) -> (PollKind, exit_code|None):
    sentinel? → (FINISHED, 0) if code==0 else (FAILED, int(code))   # durable, outlives the session
    else      → (RUNNING|VANISHED, None)                            # no sentinel + dead = killed before echo
status() then formats the wire line: f"{kind}({ecode})" for FAILED, else str(kind)

PYTHON = sys.executable (absolute path) is used as the interpreter — portable across boxes and it
sidesteps conda activate, which a fresh non-interactive tmux shell never sources.

dispatch · orchestrator · runner · logger · terminology