Reading Two Signals at Once — the correlation & inference toolbox

Long-form companion to analysis/statistics.py (the inference owner carved out of data_analyzer.py). Every function here turns one or two signals into a single number that answers a question — “do these move together?”, “which one leads?”, “is this spike real or luck?”. Each claim below is machine-checked by a named test in validation/test_data_primitives.py; the test name is given at the end of each section, and the worked numbers are reproduced there. No number in this file is invented — they are all either hand-computable or pinned by a test.

The one thing to hold onto

A single signal answers “how big?” — you reduce it (rms, p99, max) to one number. Correlation and effect-size statistics answer a different kind of question: how do two things relate? There are only four shapes of question, and the whole toolbox is organized around them:

Question	Reach for	Returns
Do \(a\) and \(b\) rise and fall together?	`pearson_r` (linear), `spearman_rho` (monotone), `tail_corr` (in the worst cases)	a number in \([-1, 1]\)
Does \(a\) lead \(b\) in time?	`xcorr_peak_lag`	a peak and a lag (seconds)
Is the trend of \(b\) vs \(a\) a straight line?	`linfit`	slope, intercept, residual
Are the values of \(a\) systematically bigger than \(b\)?	`cliffs_delta`	a number in \([-1, 1]\)
How spread out is one number across many runs/seeds?	`ci95`, `rel_spread`, `exceedance_curve`	a width / a curve
Is a signal elevated near events beyond coincidence?	`event_effect_size` (+ the control machinery)	an effect size + counts

If you remember nothing else: Pearson trusts the values, Spearman trusts only the order, and Cliff’s delta trusts only who-beats-whom. Each throws away more information than the last, and in exchange becomes harder to fool. The art is matching how much you throw away to how much you trust your data.

0 · The contract every function shares: finite-pairing

Before any statistic, two housekeeping rules apply to all of these functions, so we state them once.

Real logged signals contain NaN (a metric that was not recorded at some step) and inf (a blown-up quantity). A correlation computed naively over such an array returns NaN and silently poisons your whole study. So every two-signal function begins by calling the private helper _finite_pair(a, b), which (i) truncates both arrays to their common length, then (ii) keeps only the positions where both are finite. Single-sample functions use the analogous finite_numeric_samples. The rule:

Fact 0 (NaN-awareness). Every statistic operates on the finite, length-matched overlap of its inputs, and returns NaN (not a garbage number, and not an exception) when there is too little data to answer — fewer than two finite points, or a constant signal with no variation to correlate.

This is why you will see the guard if a.size < 2: return float("nan") at the top of nearly every function. Returning NaN rather than raising is deliberate: a sweep over fifty runs should mark the one degenerate run as “no answer” and keep going, not crash.

Machine-check: test_corr_nan_safe_and_constant, test_cliffs_delta_nan_safe_and_empty.

1 · Pearson’s \(r\) — straight-line association

The problem. You suspect the end-effector position error \(p_e\) gets worse as the arm approaches a singularity (small \(\sigma_{\min}\)). You have both signals over a run. Is there a relationship, and how strong?

The intuition. Pearson’s \(r\) asks: when \(a\) is above its average, is \(b\) also above its average, by a proportional amount? If yes, \(r\) is near \(+1\); if \(b\) tends to be below when \(a\) is above, \(r\) is near \(-1\); if there is no linear tendency, \(r \approx 0\). It is the cosine of the angle between the two mean-centered signals — a purely geometric “do these point the same way” measure.

The formula. For paired samples \(\{(a_i, b_i)\}_{i=1}^n\) with means \(\bar a, \bar b\),

\[ r \;=\; \frac{\sum_{i}(a_i - \bar a)(b_i - \bar b)} {\sqrt{\sum_i (a_i - \bar a)^2}\,\sqrt{\sum_i (b_i - \bar b)^2}} \;=\; \frac{\operatorname{cov}(a,b)}{\sigma_a\,\sigma_b}. \]

The implementation defers to np.corrcoef(a, b)[0, 1], which computes exactly this. The denominator explains the “constant signal” guard: if either signal never varies, \(\sigma = 0\) and the ratio is undefined, so we return NaN.

A tiny worked example. Take \(a = (1, 2, 3, 4, 5)\) and \(b = 2a + 7 = (9, 11, 13, 15, 17)\). Because \(b\) is an exactly linear function of \(a\) with positive slope, the mean-centered versions are parallel, and \(r = 1\) to floating precision. This is precisely the statistics_smoke() assertion.

How it fools you. Pearson sees only straight-line association. If \(b = a^2\) on \([-1, 1]\), the relationship is perfectly deterministic yet \(r = 0\), because the upward-left and upward-right halves cancel. And a single wild outlier can drag \(r\) from \(0\) to \(0.9\). When you do not trust linearity or outliers, do not use Pearson — use Spearman (next).

Machine-check: test_pearson_perfect_linear, test_pearson_matches_numpy_corrcoef.

2 · Spearman’s \(\rho\) — monotone association via ranks

The problem. The relationship between conditioning \(\kappa\) and tracking error is surely monotone (worse conditioning, worse error) but almost certainly not a straight line — it probably curves. Pearson would understate it. We want “do they move together in the same direction?” without assuming the shape.

The intuition — the rank trick. Replace every value by its rank (smallest \(\to 1\), next \(\to 2\), …), throwing away the actual magnitudes and keeping only the ordering. Then run Pearson on the ranks. Any strictly increasing relationship — linear, quadratic, exponential, logarithmic — becomes a perfectly straight line in rank-space, so Spearman reports \(\rho = 1\). We have traded sensitivity to magnitude for robustness to shape and to outliers (a wild value is just “the largest rank,” no more).

The mechanism. The helper _rankdata assigns ranks, and — this is the subtle part — gives tied values their average rank (if two samples tie for ranks 3 and 4, both receive \(3.5\)). Without this, ties would inject a false ordering. Then \(\rho = \texttt{pearson\_r}(\operatorname{rank}(a), \operatorname{rank}(b))\).

A tiny worked example. Let \(a = (1, 2, 3, 4)\) and \(b = (1, 8, 27, 64) = a^3\). This is wildly nonlinear, so Pearson would report less than \(1\). But the order is identical — rank of \(a\) is \((1,2,3,4)\) and so is the rank of \(b\) — so \(\rho = 1\) exactly. That monotone-but-curved case is the pinned test below.

The moral. Pearson asks “is it a line?”; Spearman asks “is it monotone?” Reach for Spearman whenever the physical relationship is “more of this means more of that” but you have no reason to believe the curve is straight — which, in this project, is most of the time.

Machine-check: test_spearman_monotone_nonlinear_is_one, test_spearman_matches_scipy_when_available.

3 · `tail_corr` — does the association survive in the worst cases?

The problem. A risk-aware study does not care whether two signals correlate on average — it cares whether they correlate when things are going badly. The average behavior can be benign while the tail is where the mission fails.

The intuition. Restrict attention to the worst slice of \(a\) — the steps where \(a\) is at or above its \(q\)-th percentile — and compute the correlation only there. With the default \(q = 90\), this is the correlation within the top 10% of \(a\). If \(a\) and \(b\) correlate overall but tail_corr collapses to near zero, the association is a calm-weather artifact; if tail_corr stays high, the coupling is real exactly where it matters.

The mechanism. sel = a >= percentile(a, q) selects the tail (note this reuses the one canonical percentile, which lives with the math primitives in data_analyzer — tail_corr imports it lazily). Then it runs Spearman (default) or Pearson on the restricted pair via the CorrMethod selector. The same < 2 points guard applies after restriction — a tail too thin to judge returns NaN.

How it fools you. The tail is, by construction, a small sample, so tail_corr is noisier than the full-sample correlation. Treat a single tail correlation as a hint, not a verdict — corroborate it across runs (Section 7).

Machine-check: covered by the correlation suite (test_corr_nan_safe_and_constant) since tail_corr composes _finite_pair, percentile, and spearman_rho/pearson_r, each pinned individually.

4 · `xcorr_peak_lag` — who leads whom?

The problem. When the base spacecraft wobbles, does the arm tracking error react at the same instant, or does it lag the wobble by some delay? Correlation at zero lag cannot answer this; we need to slide one signal against the other in time.

The intuition. Shift \(b\) backward and forward in time relative to \(a\), and at each shift compute how well they line up. The shift that maximizes the alignment is the lag. If the best alignment happens when we delay \(b\), then \(a\) precedes \(b\) — \(a\) leads. The sign convention here is: a positive lag means \(a\) precedes \(b\) (i.e. \(b\) is the delayed copy).

The mechanism. Both signals are \(z\)-scored (mean-subtracted, divided by standard deviation, with a tiny \(10^{-12}\) floor so a flat signal does not divide by zero). Then np.correlate(..., mode="full") sweeps all shifts; we take the shift of largest absolute correlation (absolute, so a strong anti-correlation at some lag is still detected) and convert it to seconds via the sample step dt.

A tiny worked example. Cross-correlating a signal with itself must peak at lag \(0\) — there is no shift that aligns a signal better than no shift at all. Construct \(b\) as \(a\) delayed by exactly \(k\) samples, and xcorr_peak_lag recovers a lag of \(k\,dt\). Both are pinned below.

The moral. Correlation tells you that two signals are related; cross-correlation lag tells you in what temporal order — the first whisper of a causal story (though, as always, lead is not proof of cause).

Machine-check: test_xcorr_zero_lag_for_identical, test_xcorr_recovers_known_shift.

5 · `linfit` — the ordinary-least-squares line

The problem. You have established (via Spearman) that \(b\) rises with \(a\) and you believe the trend is roughly linear. Now you want the actual numbers: how much does \(b\) change per unit of \(a\), and how well does a line actually describe the data?

The intuition. Fit the straight line \(b \approx m\,a + c\) that minimizes the sum of squared vertical gaps between the line and the data — ordinary least squares. The slope \(m\) is the “exchange rate” (\(\Delta b\) per \(\Delta a\)); the intercept \(c\) is the offset.

The mechanism and the honesty check. np.polyfit(a, b, 1) returns \((m, c)\). But a slope is meaningless if the data is not actually line-shaped, so linfit also returns a goodness number:

\[ \texttt{max\_rel\_resid} \;=\; \frac{\max_i \lvert b_i - (m\,a_i + c)\rvert}{\max_i \lvert b_i\rvert}. \]

This is the worst vertical miss, expressed as a fraction of the signal’s scale. If it is near \(0\), the line is an excellent description; if it is large, the slope is a fiction you should not quote. Reporting the slope with its worst residual is the difference between a measurement and a guess.

A tiny worked example. For \(a = \texttt{linspace}(0,1,50)\) and \(b = 2a + 0.5\), linfit returns slope \(2\), intercept \(0.5\), and max_rel_resid essentially \(0\) — the exact statistics_smoke() assertion.

Machine-check: test_linfit_exact_line, test_linfit_nan_safe_and_residual.

6 · Cliff’s \(\delta\) — a nonparametric effect size

The problem. Two batches of numbers — say \(p_e\) over the steps near a singularity event, versus \(p_e\) over quiet steps. Are the near-event values genuinely larger, and by how much? A difference of means is fragile (outliers, skew, unequal spread). We want a robust “how separated are these two groups?”

The intuition — count the contests. Pit every value of group \(a\) against every value of group \(b\) in a head-to-head. Cliff’s \(\delta\) is simply (fraction of contests \(a\) wins) minus (fraction \(a\) loses):

\[ \delta \;=\; \frac{\#\{(i,j): a_i > b_j\} \;-\; \#\{(i,j): a_i < b_j\}}{n_a\, n_b}. \]

It lives in \([-1, 1]\). \(\delta = +1\) means every value of \(a\) exceeds every value of \(b\) (total dominance); \(\delta = 0\) means the groups are thoroughly interleaved (no effect); \(\delta = -1\) is total dominance the other way. Notice what it never does: it never looks at how much \(a_i\) exceeds \(b_j\), only whether it does. That is what makes it immune to outliers and to the shape of either distribution — it is the order-statistic cousin of “who beats whom” from Section 1’s hierarchy.

A tiny worked example. With \(a = (1, 2, 3)\) and \(b = (4, 5, 6)\), every one of the \(3 \times 3 = 9\) contests is a loss for \(a\), so \(\delta = (0 - 9)/9 = -1\). Identical groups give \(\delta = 0\). Both are pinned.

The moral. When you need to report “is regime A worse than regime B, robustly?”, Cliff’s \(\delta\) is the honest answer — it makes no normality assumption and degrades gracefully. A rough field guide: \(\lvert\delta\rvert \lesssim 0.15\) negligible, \(\approx 0.33\) medium, \(\gtrsim 0.47\) large.

Machine-check: test_cliffs_delta_complete_separation, test_cliffs_delta_identical_is_zero, test_cliffs_delta_nan_safe_and_empty.

7 · Ensemble dispersion — `ci95`, `rel_spread`, `exceedance_curve`

The functions so far digest signals within one run. The next three digest one scalar across many runs — the seeds or arm configurations of a fragility campaign. The input is a small list of per-run numbers (e.g. “\(p_e\) p99 from each of 8 seeds”).

ci95 — how trustworthy is the mean? Given \(n\) per-run values with sample standard deviation \(s\), the \(95\%\) confidence half-width under a normal approximation is

\[ \texttt{ci95} \;=\; 1.96\,\frac{s}{\sqrt{n}}. \]

The reported mean is “\(\bar x \pm \texttt{ci95}\)”. The \(\sqrt n\) is the whole story of why more seeds help: halving the uncertainty costs four times the runs. Returns NaN for \(n < 2\) (a single run has no spread to estimate). Machine-check: test_ci95_matches_normal_formula.

rel_spread — a quick scale-free spread. Defined as \((\max - \min)/\text{mean}\), this is a crude but instantly readable “how much does the outcome swing relative to its size?” A rel_spread of \(0.02\) says the worst and best runs differ by 2% of the mean — tight; a value of \(0.6\) says the result is barely reproducible. It is deliberately simple (no distributional assumption); for anything load-bearing, prefer ci95. Machine-check: test_rel_spread_basic.

exceedance_curve — the whole tail, not one number. Returns the sorted values together with the empirical survival function \(1 - \widehat{F}(x)\), i.e. the fraction of runs at or above each level. This is the picture behind a risk statement: reading off the curve gives “in 10% of runs, \(p_e\) p99 exceeds \(X\).” Where a single percentile is one point, the exceedance curve is the entire upper tail. Machine-check: test_exceedance_curve_is_survival.

8 · The event study — is the spike real, or a coincidence?

The problem. This is the subtlest tool, and the most important for risk work. You observe that conditioning \(\kappa\) seems to spike near freeze/thaw events. But a busy signal spikes all the time — maybe the near-event spikes are nothing special. How do you separate a genuine event effect from the background rate of spiking?

The idea — a matched control. Borrow the logic of a controlled experiment. Build two groups: the signal’s values near events, and the signal’s values at comparable steps far from any event. If the near-event group is reliably larger (Cliff’s \(\delta\) from Section 6), the event effect is real; if the two groups are interleaved, the “spikes” were just the signal being its usual noisy self.

The three pieces.

event_proximity_mask(events, n, prox) — dilate each event index into a window \(\pm\,\texttt{prox}\) and union them, producing the boolean “near an event” mask over all \(n\) steps. This is the treatment set. Machine-check: test_event_proximity_mask_dilates_and_unions.
matched_control_pool(n, exclude_idx, n_samples, rng, gap) — draw a seeded random sample of step indices that lie strictly more than gap away from any event. This is the control set, and the explicit rng is what makes the whole study reproducible: same seed, same controls, same verdict. Machine-check: test_matched_control_pool_avoids_events_and_is_seeded.
event_effect_size(signal, events, exclude_idx, ...) (in analysis/statistics.py) ties them together: it forms the near-event sample and a control sample \(\texttt{control\_mult}\times\) larger, then returns Cliff’s \(\delta\) between them along with the two sample sizes. A large positive \(\delta\) with healthy counts is the evidence that the events genuinely elevate the signal. Machine-check: test_event_effect_size_separates_event_signal.

The moral. A raw “the signal spikes near events” plot is suggestive; the matched-control effect size is evidence. The seeded control pool is what lets a skeptical reader rerun your study and get your exact number — which is the entire point of putting these in a tested, owned module rather than a throwaire script.

(Companion: matched_control_windows builds width-matched control windows* rather than point samples, for when the events are intervals; test_matched_control_windows_widths_and_disjoint.)*

Choosing the right tool — a one-screen summary

You want to know…	Use	Robust to outliers?	Assumes a shape?
Linear co-movement of two signals	`pearson_r`	No	Yes (linear)
Monotone co-movement (any curve)	`spearman_rho`	Yes	No (monotone)
Co-movement in the worst \(q\%\)	`tail_corr`	Yes (Spearman default)	No
Temporal lead/lag between two signals	`xcorr_peak_lag`	Moderately	No
Slope + honesty of a linear trend	`linfit` (read `max_rel_resid`!)	No	Yes (linear)
Whether group A exceeds group B	`cliffs_delta`	Yes	No
Uncertainty of a mean across runs	`ci95`	No	Yes (normal)
Quick reproducibility spread	`rel_spread`	No	No
The full upper tail across runs	`exceedance_curve`	—	No
Whether events genuinely move a signal	`event_effect_size` + controls	Yes (Cliff’s \(\delta\))	No

The throughline: the more robust you need to be, the more you retreat from values to ranks to who-beats-whom — and the price is sensitivity. Pick the leftmost tool your data’s trustworthiness can afford, and report the honesty number (max_rel_resid, ci95, the sample counts) alongside the headline so the result is a measurement, not a hope.

All functions live in analysis/statistics.py and are pinned by validation/test_data_primitives.py. Import them directly from the home — from analysis.statistics import pearson_r, spearman_rho, tail_corr, xcorr_peak_lag, linfit, cliffs_delta, ci95, rel_spread, exceedance_curve, event_effect_size — and feed them logged signals via series(...)/measure(...) from analysis.data_analyzer, so the inputs are the pipeline’s own numbers, not a parallel re-derivation.