detached_run — detached tmux launcher for long sims
Purpose
Wraps
run_spec.pyin a detached tmux session per stem so a closed SSH window or a VSCode
restart cannot kill a running sim. Verbs:run / list / status / watch [--attach] / log [--follow].
Role in the system
- The remote-execution primitive the cluster dispatcher builds on: dispatch calls these verbs
in-process on thelocalMac leg and shells the sameanalysis/detached_run.py <verb>over SSH on
the Linux boxes (invoke_detached). - Launches the pipeline front-end
run_spec.py(theOrchestrator(stem).run()gateway) — it does not
reach into orchestrator / runner itself; it only owns the process lifecycle around them. - Shares the dated log root (
logs/logs_<date>/<stem>, always flat — the box never knows about
campaigns) with the pipeline viautils.infra.LOGS_DIR: the wrapper’sconsole.log/exit_status
sit beside the NPZ the run writes. dispatch nests on collect; the box side stays flat. - Owns the wire vocabulary
PollKind(FINISHED/RUNNING/VANISHED/FAILED) — it is the producer:
classify()emits aPollKindandstatus()serialises it onto the status line thatdispatch
parses back.dispatchimportsdr.PollKind; it keeps its own lowercaseStatelifecycle separate.
Inputs / Outputs
- In: one or more run-spec stems (the spec name = the file stem); verb + flags from
argv. - Out: per-stem
console.log(captured stdout+stderr) andexit_statussentinel under the dated
run dir; one tmux session namedsim_<sanitized>_<6-char-sha1>; status lines on stdout whose tokens
are thePollKindvalues (FINISHED/RUNNING/VANISHED/FAILED(<code>)).
Key functions
PollKind— the wire poll vocabulary StrEnum (UPPERCASE; byte-stable on the SSH line) —analysis/detached_run.py:28run— launch each stem in its own detached session, skipping any already alive —analysis/detached_run.py:125status— wire report (stem: FINISHED|RUNNING|VANISHED|FAILED(n)), tails the log on failure —analysis/detached_run.py:163classify— sentinel-first state machine, returns(PollKind, exit_code|None)—analysis/detached_run.py:70session_name— tmux-legal, injective name = sanitized stem + SHA-1 hash —analysis/detached_run.py:40inner_command— the shlex-quoted shell line tmux runs (redirect-then-echo $?) —analysis/detached_run.py:62main— verb dispatch —analysis/detached_run.py:210
Footguns
list/statuskey on TODAY's dated dir — a midnight-crossing run goes invisibleThe log root is resolved at launch time. A run that crosses local midnight still writes under the dir
stamped at launch, but a barestatusthe next day scans the new day’s dir and reports nothing.
Check long runs same-day, or invoke the wrapper on the same calendar day as the launch.
(analysis/INSIGHTS.md, detached_run [run][footgun].)
Linux cluster needs
loginctl enable-linger <user>onceWithout it, systemd reaps the tmux server on SSH logout and kills every session — defeating the whole
point. macOS launchd has no such issue. (analysis/INSIGHTS.md, detached_run [perf][run].)
Sanitized session names alone collide — the SHA-1 hash is load-bearing
Mapping
./-/whitespace →_is many-to-one, so distinct stems could share a session name and make
is_alive/statuslie. A 6-char SHA-1 of the raw stem is appended for injectivity;_known_stems
reads the real stem name off disk so it never has to invert the lossy sanitization.
(analysis/INSIGHTS.md, detached_run [footgun].)
No checkpoint/resume — a killed run is re-launched from scratch
Nothing is serialized mid-trajectory; scope is intentionally re-run only.
run()firsttouches
the console log (solist/statussee it immediately) and clears any priorexit_statussentinel so
a stale result can’t impersonate the new launch. (analysis/INSIGHTS.md, detached_run [run].)
Pseudocode (the sentinel state machine)
run(stem): touch console.log ; rm old exit_status ; tmux new-session -d -s sim_<safe>_<sha> "<inner>"
inner_command: cd ROOT && $PYTHON -u run_spec.py <stem> > console.log 2>&1 ; echo $? > exit_status
# redirect-then-echo (NOT | tee) → $? is run_spec's own code
classify(sentinel?, code, alive) -> (PollKind, exit_code|None):
sentinel? → (FINISHED, 0) if code==0 else (FAILED, int(code)) # durable, outlives the session
else → (RUNNING|VANISHED, None) # no sentinel + dead = killed before echo
status() then formats the wire line: f"{kind}({ecode})" for FAILED, else str(kind)
PYTHON = sys.executable (absolute path) is used as the interpreter — portable across boxes and it
sidesteps conda activate, which a fresh non-interactive tmux shell never sources.
Related
dispatch · orchestrator · runner · logger · terminology