Doctoral Research · Space Robotics Inspection with a Free-Flying Space Manipulator
A Doctoral Research Journal Aerospace Engineering

The structural-clone detector (validation/ast_duplication.py)

A stdlib-only static-analysis tool that finds copy-paste-then-rename duplication across the codebase — the kind a textual git diff misses, because renaming a local variable changes zero AST node types. It scores every pair of functions by how much of their subtree shape they share.

Enforced by the originality gate. Terse field notes live in validation/INSIGHTS.md; this is the depth.

[!note] One-line use python validation/ast_duplication.py GNC --threshold 0.70 → ranked markdown report under logs/logs_<date>/.

Algorithm — multiset Jaccard of subtree fingerprints

Each function becomes a multiset of subtree-shape fingerprints; the similarity of two functions is the multiset Jaccard of those fingerprints.

  1. Fingerprint (bottom-up blake2b Merkle hash). For every node in the function body, fingerprint(node) = blake2b("(" + token(node) + "," + child_fingerprints + ")"). Because each node’s hash folds in all its descendants’ hashes, two subtrees collide iff they have identical normalized shape. Nested def/class bodies are opaque leaves (a function is scored on its own structure, not its inner helpers).
  2. Feature multiset. Walk the body and tally every subtree’s fingerprint into a Counter (identical sub-shapes counted with multiplicity — three identical if blocks count as three).
  3. Score. For two multisets a, b: J_multiset = Σ min(a[k], b[k]) / Σ max(a[k], b[k]) over all fingerprints k — in [0, 1], 1.0 iff structurally identical post-normalization. (A set-Jaccard J_set is also reported but not used for ranking.)

Why subtree fingerprints and not the obvious alternatives: a bag of node types is order-blind (a - b looks like b - a); k-grams over the token stream over-penalize a single control-flow swap (one matchif/elif edit shifts every window). Subtree hashes degrade gracefully — an edit kills only the subtrees that contain it, leaving every untouched leaf subtree still matched (the Baxter et al. lineage).

Normalization — the recall/precision dial

What gets erased vs. kept is the whole ballgame:

Erased (→ recall) Kept (→ precision)
local var / arg names → NAME call / method / attribute names (np.zerosnp.asarray)
literal values, bucketed by kind (NUM/STR/BOOL/NONE) literal kind (a number is not a string)
docstrings (stripped before fingerprinting) control-flow & operator structure
the receiver variable of a method call (cells.appendout.appendNAME.append) the method name on the receiver

Two floors keep the noise down: --min-nodes 15 drops trivial one-liners, and a size-band early-out skips any pair whose larger/smaller body-size ratio exceeds 3.0 (such a pair can never clear the threshold — an exact Jaccard upper bound, so the prune is free).

Clone-type matrix

Type What it is Caught?
1 exact copy J = 1.0
2 renamed (the design target) J = 1.0 (locals/receivers normalize; call names stay)
3 lightly edited (e.g. match ↔︎ if/elif) ✅ graceful — J decays with the edit, stays ≥ ~0.70
4 semantic (same result, different structure) blind — different shape, low J

Benign false-positive families to expect: boilerplate getters, __repr__/__post_init__, and value/derivative twins (a function and its time-derivative share most structure). A high score is necessary but not sufficient — a human confirms each pair.

How to read a J score (on this tree)

After the Jun-22 dedup (the formatted_dict clone and the BaseController/BreveController J=1.0 twins removed via [[breve_core_controller]]), the live GNC maximum is 0.827:

J meaning example here
1.0 structurally identical post-normalization (none — deduped)
≥0.90 near copy-paste; the gate’s fail line (none)
0.74–0.83 the deliberate-fork band BaseController.calc_posture ↔︎ BreveController.calc_posture (0.827) — a Template-Method fork that must stay independent
0.50–0.70 boilerplate / value-derivative twins g_vc_desired ↔︎ g_vc_desired_dot (0.792); arm_gain_scale ↔︎ base_gain_scale (0.737)
<0.50 unrelated

API & invocation

from validation.ast_duplication import collect, rank_pairs
funcs = collect(("GNC",), min_nodes=15)          # list[Func]; the only disk step
pairs = rank_pairs(funcs, threshold=0.70, top=40) # list[Pair], sorted by J desc

Limitations (verbatim from the tool’s own report)

Catches Type-1/2 and mild Type-3 (renamed / lightly-edited) clones; BLIND to Type-4 semantic clones (same result, different structure). A high score is necessary-but-not-sufficient — a human confirms each pair. Expect boilerplate getters / __repr__ blocks as benign false-positive families.

Module-level preamble (imports, constants) is not scored — only def bodies. Pairs below --min-nodes never appear.

See also