scoreboard — mission success gates (the judgment layer)

Purpose

Defines the scored mission objectives: for each one, how it is judged — reduction, direction,
pass/fail gate, weight, human meaning, and the after: startup-skip window. The catalog
(metric_catalog’s metrics.yaml) says what is measured; this file says how it is judged.
These two files own separate concerns. Three entries today: tracking error, coverage, throttle fraction.

Structure

A single top-level variables: list. Each item is one scored objective with:

  • name — the logged variable key (matched by leaf name against the raw log).
  • weight — relative importance in the overall verdict (p_e 3 > coverage 2 > throttle 1).
  • success: — the judgment block: reduction, direction, gate, after, label, meaning.
  • (optional) threshold / reductions: — for diagnostic variables that need a band/floor reduction (e.g. s_min_J’s frac_below at 0.025).

Key knobs/sections

key pathwhat it controlsdefault/examplenote
variables[].namethe logged signal this gate scoresp_e, area_coverage_frac, s_min_Jmatched by leaf name (last dotted segment) — see completeness bar below
variables[].weightrelative weight in the verdict3 / 2 / 1p_e is the tightest, highest-weight gate
success.reductionhow the per-step signal collapses to one numberp99 / final / frac_belowreductions live in data_analyzer (Reductions enum)
success.directionwhich way is goodlower / higherlower for errors/throttle, higher for coverage
success.gatethe pass/fail threshold0.2 m / 0.99 / nullnull = diagnostic only, never fails the run
success.afterstartup-skip window (seconds)90.0excludes the startup transient; steps inside it are recorded but NOT scored
threshold + reductionsband/floor for diagnostics0.025, [min, frac_below]s_min_J floor mirrors controller.arm.null_space.freeze_floor
success.label / success.meaninghuman-facing strings for the console verdict”camera tracking error”per the no-internal-nomenclature rule — every metric explained

Who reads it

  • pre_run_loaderScoreboardSpecLoader parses this into ScoreboardSpec objects (sibling of MetricSpecLoader, which parses metrics.yaml). pre_run_loader.py:117.
  • validation/run_scoreboard.py — the console tool that prints the verdict (the model output; reports do NOT auto-inject scoreboard content).
  • loggerscoreboard_metric_coverage uses this file as the log-completeness bar (NOT the full metrics.yaml catalog), matched by leaf variable name.
  • parameter_loader / metric_catalog — adjacent config owners (built matrices, the metric catalog) that this judgment layer pairs with.

Footguns

Two files, two concerns — don't merge them

metrics.yaml (metric_catalog) = the variable catalog, what is measured. scoreboard.yaml = the
judgment layer, how it is judged. MetricSpecLoader and ScoreboardSpecLoader are separate parsers.
A gate change belongs here; a new logged signal belongs in the catalog. (analysis/INSIGHTS.md
[pre_run_loader, config]; .claude/rules/MEASUREMENT.md.)

Completeness bar = the scoreboard set, NOT the full catalog

logger’s scoreboard_metric_coverage matches raw log keys against this file, not metrics.yaml.
The full catalog has include:true metrics that are config-conditional (e.g. reactive-scorer scores
absent in POSE mode), so they can’t be a universal bar. Adding an entry here makes it a hard
write-completeness requirement (save_raw_log(require_complete=True)). (analysis/INSIGHTS.md [logger, io].)

The frac_below threshold must stay in sync with the controller floor

s_min_J’s threshold: 0.025 mirrors controller.arm.null_space.freeze_floor (the speed-throttle
floor) — and the catalog’s derate_fraction threshold mirrors the same value. All three must move
together when the controller floor is retuned. (YAMLs_by_domain/INSIGHTS.md [config, baseline].)

after: 90.0 is load-bearing, not cosmetic

All gates exclude the first 90 s (startup transient). This is the operational window where the arm has
acquired its first target and the controller has settled. TASK_10 tried 15 s and full-helix coverage
regressed 0.998 → 0.78. Metrics inside the skip are recorded but not scored. (YAMLs_by_domain/INSIGHTS.md
[config, baseline]; matches camera_guidance.targeting.startup.time = 90.0.)

Reduction semantics are not obvious — read the enum

final = the LAST logged value (coverage marking is monotone by design, so a late-achieved 0.99 passes).
frac_below scores the share of steps below threshold and is a diagnostic here (gate: null, never
fails). p99 allows 1% of steps above the gate. The Reductions enum abs-maps most signals — see
data_analyzer’s magnitude footgun before adding a new reduction. (analysis/INSIGHTS.md [data_analyzer, math].)

Keep it lean — one in, one out

The scoreboard is deliberately the minimum-viable scorecard (3 entries since CHAIN_5). The companion
metrics.yaml enforces “retire one metric per addition”; the same discipline keeps this file from
accreting cruft. Don’t add a fourth gate without retiring or strongly justifying it.
(YAMLs_by_domain/INSIGHTS.md [history].)

metric_catalog · pre_run_loader · parameter_loader · logger · data_analyzer · breve_controller · robot · terminology · current_sota

For the exhaustive key list, read YAMLs_by_domain/scoreboard.yaml itself — this page is a navigable reference, not a dump.