scoreboard — mission success gates (the judgment layer)
Purpose
Defines the scored mission objectives: for each one, how it is judged —
reduction,direction,
pass/failgate,weight, humanmeaning, and theafter:startup-skip window. The catalog
(metric_catalog’smetrics.yaml) says what is measured; this file says how it is judged.
These two files own separate concerns. Three entries today: tracking error, coverage, throttle fraction.
Structure
A single top-level variables: list. Each item is one scored objective with:
name— the logged variable key (matched by leaf name against the raw log).weight— relative importance in the overall verdict (p_e 3 > coverage 2 > throttle 1).success:— the judgment block:reduction,direction,gate,after,label,meaning.- (optional)
threshold/reductions:— for diagnostic variables that need a band/floor reduction (e.g.s_min_J’sfrac_belowat 0.025).
Key knobs/sections
| key path | what it controls | default/example | note |
|---|---|---|---|
variables[].name | the logged signal this gate scores | p_e, area_coverage_frac, s_min_J | matched by leaf name (last dotted segment) — see completeness bar below |
variables[].weight | relative weight in the verdict | 3 / 2 / 1 | p_e is the tightest, highest-weight gate |
success.reduction | how the per-step signal collapses to one number | p99 / final / frac_below | reductions live in data_analyzer (Reductions enum) |
success.direction | which way is good | lower / higher | lower for errors/throttle, higher for coverage |
success.gate | the pass/fail threshold | 0.2 m / 0.99 / null | null = diagnostic only, never fails the run |
success.after | startup-skip window (seconds) | 90.0 | excludes the startup transient; steps inside it are recorded but NOT scored |
threshold + reductions | band/floor for diagnostics | 0.025, [min, frac_below] | s_min_J floor mirrors controller.arm.null_space.freeze_floor |
success.label / success.meaning | human-facing strings for the console verdict | ”camera tracking error” | per the no-internal-nomenclature rule — every metric explained |
Who reads it
- pre_run_loader —
ScoreboardSpecLoaderparses this intoScoreboardSpecobjects (sibling ofMetricSpecLoader, which parsesmetrics.yaml).pre_run_loader.py:117. validation/run_scoreboard.py— the console tool that prints the verdict (the model output; reports do NOT auto-inject scoreboard content).- logger —
scoreboard_metric_coverageuses this file as the log-completeness bar (NOT the fullmetrics.yamlcatalog), matched by leaf variable name. - parameter_loader / metric_catalog — adjacent config owners (built matrices, the metric catalog) that this judgment layer pairs with.
Footguns
Two files, two concerns — don't merge them
metrics.yaml(metric_catalog) = the variable catalog, what is measured.scoreboard.yaml= the
judgment layer, how it is judged.MetricSpecLoaderandScoreboardSpecLoaderare separate parsers.
A gate change belongs here; a new logged signal belongs in the catalog. (analysis/INSIGHTS.md
[pre_run_loader, config];.claude/rules/MEASUREMENT.md.)
Completeness bar = the scoreboard set, NOT the full catalog
logger’s
scoreboard_metric_coveragematches raw log keys against this file, notmetrics.yaml.
The full catalog hasinclude:truemetrics that are config-conditional (e.g. reactive-scorer scores
absent in POSE mode), so they can’t be a universal bar. Adding an entry here makes it a hard
write-completeness requirement (save_raw_log(require_complete=True)). (analysis/INSIGHTS.md[logger, io].)
The
frac_belowthreshold must stay in sync with the controller floor
s_min_J’sthreshold: 0.025mirrorscontroller.arm.null_space.freeze_floor(the speed-throttle
floor) — and the catalog’sderate_fractionthreshold mirrors the same value. All three must move
together when the controller floor is retuned. (YAMLs_by_domain/INSIGHTS.md[config, baseline].)
after: 90.0is load-bearing, not cosmeticAll gates exclude the first 90 s (startup transient). This is the operational window where the arm has
acquired its first target and the controller has settled. TASK_10 tried 15 s and full-helix coverage
regressed 0.998 → 0.78. Metrics inside the skip are recorded but not scored. (YAMLs_by_domain/INSIGHTS.md
[config, baseline]; matchescamera_guidance.targeting.startup.time = 90.0.)
Reduction semantics are not obvious — read the enum
final= the LAST logged value (coverage marking is monotone by design, so a late-achieved 0.99 passes).
frac_belowscores the share of steps belowthresholdand is a diagnostic here (gate: null, never
fails).p99allows 1% of steps above the gate. TheReductionsenum abs-maps most signals — see
data_analyzer’s magnitude footgun before adding a new reduction. (analysis/INSIGHTS.md[data_analyzer, math].)
Keep it lean — one in, one out
The scoreboard is deliberately the minimum-viable scorecard (3 entries since CHAIN_5). The companion
metrics.yamlenforces “retire one metric per addition”; the same discipline keeps this file from
accreting cruft. Don’t add a fourth gate without retiring or strongly justifying it.
(YAMLs_by_domain/INSIGHTS.md[history].)
Related
metric_catalog · pre_run_loader · parameter_loader · logger · data_analyzer · breve_controller · robot · terminology · current_sota
For the exhaustive key list, read YAMLs_by_domain/scoreboard.yaml itself — this page is a navigable reference, not a dump.