Long-form companion to analysis/statistics.py (the
inference owner carved out of data_analyzer.py). Every
function here turns one or two signals into a single number that answers
a question — “do these move together?”, “which one leads?”, “is this
spike real or luck?”. Each claim below is machine-checked by a named
test in validation/test_data_primitives.py;
the test name is given at the end of each section, and the worked
numbers are reproduced there. No number in this file is invented — they
are all either hand-computable or pinned by a test.
A single signal answers “how big?” — you reduce it (rms, p99, max) to one number. Correlation and effect-size statistics answer a different kind of question: how do two things relate? There are only four shapes of question, and the whole toolbox is organized around them:
| Question | Reach for | Returns |
|---|---|---|
| Do \(a\) and \(b\) rise and fall together? | pearson_r (linear), spearman_rho
(monotone), tail_corr (in the worst cases) |
a number in \([-1, 1]\) |
| Does \(a\) lead \(b\) in time? | xcorr_peak_lag |
a peak and a lag (seconds) |
| Is the trend of \(b\) vs \(a\) a straight line? | linfit |
slope, intercept, residual |
| Are the values of \(a\) systematically bigger than \(b\)? | cliffs_delta |
a number in \([-1, 1]\) |
| How spread out is one number across many runs/seeds? | ci95, rel_spread,
exceedance_curve |
a width / a curve |
| Is a signal elevated near events beyond coincidence? | event_effect_size (+ the control machinery) |
an effect size + counts |
If you remember nothing else: Pearson trusts the values, Spearman trusts only the order, and Cliff’s delta trusts only who-beats-whom. Each throws away more information than the last, and in exchange becomes harder to fool. The art is matching how much you throw away to how much you trust your data.
Before any statistic, two housekeeping rules apply to all of these functions, so we state them once.
Real logged signals contain NaN (a metric that was not
recorded at some step) and inf (a blown-up quantity). A
correlation computed naively over such an array returns NaN
and silently poisons your whole study. So every two-signal function
begins by calling the private helper _finite_pair(a, b),
which (i) truncates both arrays to their common length, then (ii) keeps
only the positions where both are finite. Single-sample
functions use the analogous finite_numeric_samples. The
rule:
Fact 0 (NaN-awareness). Every statistic operates on the finite, length-matched overlap of its inputs, and returns
NaN(not a garbage number, and not an exception) when there is too little data to answer — fewer than two finite points, or a constant signal with no variation to correlate.
This is why you will see the guard
if a.size < 2: return float("nan") at the top of nearly
every function. Returning NaN rather than raising is
deliberate: a sweep over fifty runs should mark the one degenerate run
as “no answer” and keep going, not crash.
Machine-check: test_corr_nan_safe_and_constant,
test_cliffs_delta_nan_safe_and_empty.
The problem. You suspect the end-effector position error \(p_e\) gets worse as the arm approaches a singularity (small \(\sigma_{\min}\)). You have both signals over a run. Is there a relationship, and how strong?
The intuition. Pearson’s \(r\) asks: when \(a\) is above its average, is \(b\) also above its average, by a proportional amount? If yes, \(r\) is near \(+1\); if \(b\) tends to be below when \(a\) is above, \(r\) is near \(-1\); if there is no linear tendency, \(r \approx 0\). It is the cosine of the angle between the two mean-centered signals — a purely geometric “do these point the same way” measure.
The formula. For paired samples \(\{(a_i, b_i)\}_{i=1}^n\) with means \(\bar a, \bar b\),
\[ r \;=\; \frac{\sum_{i}(a_i - \bar a)(b_i - \bar b)} {\sqrt{\sum_i (a_i - \bar a)^2}\,\sqrt{\sum_i (b_i - \bar b)^2}} \;=\; \frac{\operatorname{cov}(a,b)}{\sigma_a\,\sigma_b}. \]
The implementation defers to np.corrcoef(a, b)[0, 1],
which computes exactly this. The denominator explains the “constant
signal” guard: if either signal never varies, \(\sigma = 0\) and the ratio is undefined, so
we return NaN.
A tiny worked example. Take \(a = (1, 2, 3, 4, 5)\) and \(b = 2a + 7 = (9, 11, 13, 15, 17)\). Because
\(b\) is an exactly linear
function of \(a\) with positive slope,
the mean-centered versions are parallel, and \(r = 1\) to floating precision. This is
precisely the statistics_smoke() assertion.
How it fools you. Pearson sees only straight-line association. If \(b = a^2\) on \([-1, 1]\), the relationship is perfectly deterministic yet \(r = 0\), because the upward-left and upward-right halves cancel. And a single wild outlier can drag \(r\) from \(0\) to \(0.9\). When you do not trust linearity or outliers, do not use Pearson — use Spearman (next).
Machine-check: test_pearson_perfect_linear,
test_pearson_matches_numpy_corrcoef.
The problem. The relationship between conditioning \(\kappa\) and tracking error is surely monotone (worse conditioning, worse error) but almost certainly not a straight line — it probably curves. Pearson would understate it. We want “do they move together in the same direction?” without assuming the shape.
The intuition — the rank trick. Replace every value by its rank (smallest \(\to 1\), next \(\to 2\), …), throwing away the actual magnitudes and keeping only the ordering. Then run Pearson on the ranks. Any strictly increasing relationship — linear, quadratic, exponential, logarithmic — becomes a perfectly straight line in rank-space, so Spearman reports \(\rho = 1\). We have traded sensitivity to magnitude for robustness to shape and to outliers (a wild value is just “the largest rank,” no more).
The mechanism. The helper _rankdata
assigns ranks, and — this is the subtle part — gives tied values
their average rank (if two samples tie for ranks 3 and 4, both
receive \(3.5\)). Without this, ties
would inject a false ordering. Then \(\rho =
\texttt{pearson\_r}(\operatorname{rank}(a),
\operatorname{rank}(b))\).
A tiny worked example. Let \(a = (1, 2, 3, 4)\) and \(b = (1, 8, 27, 64) = a^3\). This is wildly nonlinear, so Pearson would report less than \(1\). But the order is identical — rank of \(a\) is \((1,2,3,4)\) and so is the rank of \(b\) — so \(\rho = 1\) exactly. That monotone-but-curved case is the pinned test below.
The moral. Pearson asks “is it a line?”; Spearman asks “is it monotone?” Reach for Spearman whenever the physical relationship is “more of this means more of that” but you have no reason to believe the curve is straight — which, in this project, is most of the time.
Machine-check:
test_spearman_monotone_nonlinear_is_one,
test_spearman_matches_scipy_when_available.
tail_corr — does the association survive in the worst
cases?The problem. A risk-aware study does not care whether two signals correlate on average — it cares whether they correlate when things are going badly. The average behavior can be benign while the tail is where the mission fails.
The intuition. Restrict attention to the worst slice
of \(a\) — the steps where \(a\) is at or above its \(q\)-th percentile — and compute the
correlation only there. With the default \(q =
90\), this is the correlation within the top 10% of \(a\). If \(a\) and \(b\) correlate overall but
tail_corr collapses to near zero, the association is a
calm-weather artifact; if tail_corr stays high, the
coupling is real exactly where it matters.
The mechanism.
sel = a >= percentile(a, q) selects the tail (note this
reuses the one canonical percentile, which lives
with the math primitives in data_analyzer —
tail_corr imports it lazily). Then it runs Spearman
(default) or Pearson on the restricted pair via the
CorrMethod selector. The same < 2 points
guard applies after restriction — a tail too thin to judge returns
NaN.
How it fools you. The tail is, by construction, a
small sample, so tail_corr is noisier than the full-sample
correlation. Treat a single tail correlation as a hint, not a verdict —
corroborate it across runs (Section 7).
Machine-check: covered by the correlation suite
(test_corr_nan_safe_and_constant) since
tail_corr composes _finite_pair,
percentile, and
spearman_rho/pearson_r, each pinned
individually.
xcorr_peak_lag —
who leads whom?The problem. When the base spacecraft wobbles, does the arm tracking error react at the same instant, or does it lag the wobble by some delay? Correlation at zero lag cannot answer this; we need to slide one signal against the other in time.
The intuition. Shift \(b\) backward and forward in time relative to \(a\), and at each shift compute how well they line up. The shift that maximizes the alignment is the lag. If the best alignment happens when we delay \(b\), then \(a\) precedes \(b\) — \(a\) leads. The sign convention here is: a positive lag means \(a\) precedes \(b\) (i.e. \(b\) is the delayed copy).
The mechanism. Both signals are \(z\)-scored (mean-subtracted, divided by
standard deviation, with a tiny \(10^{-12}\) floor so a flat signal does not
divide by zero). Then np.correlate(..., mode="full") sweeps
all shifts; we take the shift of largest absolute correlation
(absolute, so a strong anti-correlation at some lag is still
detected) and convert it to seconds via the sample step
dt.
A tiny worked example. Cross-correlating a signal
with itself must peak at lag \(0\) — there is no shift that aligns a
signal better than no shift at all. Construct \(b\) as \(a\) delayed by exactly \(k\) samples, and
xcorr_peak_lag recovers a lag of \(k\,dt\). Both are pinned below.
The moral. Correlation tells you that two signals are related; cross-correlation lag tells you in what temporal order — the first whisper of a causal story (though, as always, lead is not proof of cause).
Machine-check: test_xcorr_zero_lag_for_identical,
test_xcorr_recovers_known_shift.
linfit
— the ordinary-least-squares lineThe problem. You have established (via Spearman) that \(b\) rises with \(a\) and you believe the trend is roughly linear. Now you want the actual numbers: how much does \(b\) change per unit of \(a\), and how well does a line actually describe the data?
The intuition. Fit the straight line \(b \approx m\,a + c\) that minimizes the sum of squared vertical gaps between the line and the data — ordinary least squares. The slope \(m\) is the “exchange rate” (\(\Delta b\) per \(\Delta a\)); the intercept \(c\) is the offset.
The mechanism and the honesty check.
np.polyfit(a, b, 1) returns \((m,
c)\). But a slope is meaningless if the data is not actually
line-shaped, so linfit also returns a goodness
number:
\[ \texttt{max\_rel\_resid} \;=\; \frac{\max_i \lvert b_i - (m\,a_i + c)\rvert}{\max_i \lvert b_i\rvert}. \]
This is the worst vertical miss, expressed as a fraction of the signal’s scale. If it is near \(0\), the line is an excellent description; if it is large, the slope is a fiction you should not quote. Reporting the slope with its worst residual is the difference between a measurement and a guess.
A tiny worked example. For \(a = \texttt{linspace}(0,1,50)\) and \(b = 2a + 0.5\), linfit returns
slope \(2\), intercept \(0.5\), and max_rel_resid
essentially \(0\) — the exact
statistics_smoke() assertion.
Machine-check: test_linfit_exact_line,
test_linfit_nan_safe_and_residual.
The problem. Two batches of numbers — say \(p_e\) over the steps near a singularity event, versus \(p_e\) over quiet steps. Are the near-event values genuinely larger, and by how much? A difference of means is fragile (outliers, skew, unequal spread). We want a robust “how separated are these two groups?”
The intuition — count the contests. Pit every value of group \(a\) against every value of group \(b\) in a head-to-head. Cliff’s \(\delta\) is simply (fraction of contests \(a\) wins) minus (fraction \(a\) loses):
\[ \delta \;=\; \frac{\#\{(i,j): a_i > b_j\} \;-\; \#\{(i,j): a_i < b_j\}}{n_a\, n_b}. \]
It lives in \([-1, 1]\). \(\delta = +1\) means every value of \(a\) exceeds every value of \(b\) (total dominance); \(\delta = 0\) means the groups are thoroughly interleaved (no effect); \(\delta = -1\) is total dominance the other way. Notice what it never does: it never looks at how much \(a_i\) exceeds \(b_j\), only whether it does. That is what makes it immune to outliers and to the shape of either distribution — it is the order-statistic cousin of “who beats whom” from Section 1’s hierarchy.
A tiny worked example. With \(a = (1, 2, 3)\) and \(b = (4, 5, 6)\), every one of the \(3 \times 3 = 9\) contests is a loss for \(a\), so \(\delta = (0 - 9)/9 = -1\). Identical groups give \(\delta = 0\). Both are pinned.
The moral. When you need to report “is regime A worse than regime B, robustly?”, Cliff’s \(\delta\) is the honest answer — it makes no normality assumption and degrades gracefully. A rough field guide: \(\lvert\delta\rvert \lesssim 0.15\) negligible, \(\approx 0.33\) medium, \(\gtrsim 0.47\) large.
Machine-check:
test_cliffs_delta_complete_separation,
test_cliffs_delta_identical_is_zero,
test_cliffs_delta_nan_safe_and_empty.
ci95, rel_spread,
exceedance_curveThe functions so far digest signals within one run. The next three digest one scalar across many runs — the seeds or arm configurations of a fragility campaign. The input is a small list of per-run numbers (e.g. “\(p_e\) p99 from each of 8 seeds”).
ci95 — how trustworthy is the mean?
Given \(n\) per-run values with sample
standard deviation \(s\), the \(95\%\) confidence half-width under a normal
approximation is
\[ \texttt{ci95} \;=\; 1.96\,\frac{s}{\sqrt{n}}. \]
The reported mean is “\(\bar x \pm
\texttt{ci95}\)”. The \(\sqrt
n\) is the whole story of why more seeds help: halving the
uncertainty costs four times the runs. Returns NaN
for \(n < 2\) (a single run has no
spread to estimate). Machine-check:
test_ci95_matches_normal_formula.
rel_spread — a quick scale-free spread.
Defined as \((\max -
\min)/\text{mean}\), this is a crude but instantly readable “how
much does the outcome swing relative to its size?” A
rel_spread of \(0.02\)
says the worst and best runs differ by 2% of the mean — tight; a value
of \(0.6\) says the result is barely
reproducible. It is deliberately simple (no distributional assumption);
for anything load-bearing, prefer ci95. Machine-check:
test_rel_spread_basic.
exceedance_curve — the whole tail, not one
number. Returns the sorted values together with the empirical
survival function \(1 -
\widehat{F}(x)\), i.e. the fraction of runs at or above each
level. This is the picture behind a risk statement: reading off the
curve gives “in 10% of runs, \(p_e\)
p99 exceeds \(X\).” Where a single
percentile is one point, the exceedance curve is the entire upper tail.
Machine-check:
test_exceedance_curve_is_survival.
The problem. This is the subtlest tool, and the most important for risk work. You observe that conditioning \(\kappa\) seems to spike near freeze/thaw events. But a busy signal spikes all the time — maybe the near-event spikes are nothing special. How do you separate a genuine event effect from the background rate of spiking?
The idea — a matched control. Borrow the logic of a controlled experiment. Build two groups: the signal’s values near events, and the signal’s values at comparable steps far from any event. If the near-event group is reliably larger (Cliff’s \(\delta\) from Section 6), the event effect is real; if the two groups are interleaved, the “spikes” were just the signal being its usual noisy self.
The three pieces.
event_proximity_mask(events, n, prox) — dilate each
event index into a window \(\pm\,\texttt{prox}\) and union them,
producing the boolean “near an event” mask over all \(n\) steps. This is the treatment
set. Machine-check:
test_event_proximity_mask_dilates_and_unions.
matched_control_pool(n, exclude_idx, n_samples, rng, gap)
— draw a seeded random sample of step indices that lie strictly
more than gap away from any event. This is the
control set, and the explicit rng is what makes
the whole study reproducible: same seed, same controls,
same verdict. Machine-check:
test_matched_control_pool_avoids_events_and_is_seeded.
event_effect_size(signal, events, exclude_idx, ...)
(in analysis/statistics.py) ties them together: it forms
the near-event sample and a control sample \(\texttt{control\_mult}\times\) larger, then
returns Cliff’s \(\delta\) between them
along with the two sample sizes. A large positive \(\delta\) with healthy counts is the
evidence that the events genuinely elevate the signal.
Machine-check:
test_event_effect_size_separates_event_signal.
The moral. A raw “the signal spikes near events” plot is suggestive; the matched-control effect size is evidence. The seeded control pool is what lets a skeptical reader rerun your study and get your exact number — which is the entire point of putting these in a tested, owned module rather than a throwaire script.
(Companion: matched_control_windows builds
width-matched control windows* rather than point samples, for when
the events are intervals;
test_matched_control_windows_widths_and_disjoint.)*
| You want to know… | Use | Robust to outliers? | Assumes a shape? |
|---|---|---|---|
| Linear co-movement of two signals | pearson_r |
No | Yes (linear) |
| Monotone co-movement (any curve) | spearman_rho |
Yes | No (monotone) |
| Co-movement in the worst \(q\%\) | tail_corr |
Yes (Spearman default) | No |
| Temporal lead/lag between two signals | xcorr_peak_lag |
Moderately | No |
| Slope + honesty of a linear trend | linfit (read max_rel_resid!) |
No | Yes (linear) |
| Whether group A exceeds group B | cliffs_delta |
Yes | No |
| Uncertainty of a mean across runs | ci95 |
No | Yes (normal) |
| Quick reproducibility spread | rel_spread |
No | No |
| The full upper tail across runs | exceedance_curve |
— | No |
| Whether events genuinely move a signal | event_effect_size + controls |
Yes (Cliff’s \(\delta\)) | No |
The throughline: the more robust you need to be, the
more you retreat from values to ranks to
who-beats-whom — and the price is sensitivity. Pick the
leftmost tool your data’s trustworthiness can afford, and report the
honesty number (max_rel_resid, ci95, the
sample counts) alongside the headline so the result is a measurement,
not a hope.
All functions live in analysis/statistics.py and are
pinned by validation/test_data_primitives.py.
Import them directly from the home —
from analysis.statistics import pearson_r, spearman_rho, tail_corr, xcorr_peak_lag, linfit, cliffs_delta, ci95, rel_spread, exceedance_curve, event_effect_size
— and feed them logged signals via
series(...)/measure(...) from
analysis.data_analyzer, so the inputs are the pipeline’s
own numbers, not a parallel re-derivation.