validation/ast_duplication.py)A stdlib-only static-analysis tool that finds
copy-paste-then-rename duplication across the codebase
— the kind a textual git diff misses, because renaming a
local variable changes zero AST node types. It scores every
pair of functions by how much of their subtree shape
they share.
Enforced by the originality gate.
Terse field notes live in validation/INSIGHTS.md; this is
the depth.
[!note] One-line use
python validation/ast_duplication.py GNC --threshold 0.70→ ranked markdown report underlogs/logs_<date>/.
Each function becomes a multiset of subtree-shape fingerprints; the similarity of two functions is the multiset Jaccard of those fingerprints.
fingerprint(node) = blake2b("(" + token(node) + "," + child_fingerprints + ")").
Because each node’s hash folds in all its descendants’ hashes,
two subtrees collide iff they have identical normalized shape. Nested
def/class bodies are opaque leaves (a function
is scored on its own structure, not its inner helpers).Counter (identical sub-shapes
counted with multiplicity — three identical if blocks count
as three).a, b:
J_multiset = Σ min(a[k], b[k]) / Σ max(a[k], b[k]) over all
fingerprints k — in [0, 1], 1.0
iff structurally identical post-normalization. (A set-Jaccard
J_set is also reported but not used for ranking.)Why subtree fingerprints and not the obvious alternatives: a
bag of node types is order-blind (a - b
looks like b - a); k-grams over the token
stream over-penalize a single control-flow swap (one
match→if/elif edit shifts every window).
Subtree hashes degrade gracefully — an edit kills only
the subtrees that contain it, leaving every untouched leaf subtree still
matched (the Baxter et al. lineage).
What gets erased vs. kept is the whole ballgame:
| Erased (→ recall) | Kept (→ precision) |
|---|---|
local var / arg names → NAME |
call / method / attribute names
(np.zeros ≠ np.asarray) |
literal values, bucketed by kind
(NUM/STR/BOOL/NONE) |
literal kind (a number is not a string) |
| docstrings (stripped before fingerprinting) | control-flow & operator structure |
the receiver variable of a method call (cells.append ≡
out.append → NAME.append) |
the method name on the receiver |
Two floors keep the noise down: --min-nodes 15 drops
trivial one-liners, and a size-band early-out skips any
pair whose larger/smaller body-size ratio exceeds 3.0 (such
a pair can never clear the threshold — an exact Jaccard upper bound, so
the prune is free).
| Type | What it is | Caught? |
|---|---|---|
| 1 | exact copy | ✅ J = 1.0 |
| 2 | renamed (the design target) | ✅ J = 1.0 (locals/receivers normalize; call names
stay) |
| 3 | lightly edited (e.g. match ↔︎ if/elif) |
✅ graceful — J decays with the edit, stays ≥
~0.70 |
| 4 | semantic (same result, different structure) | ❌ blind — different shape, low J |
Benign false-positive families to expect:
boilerplate getters, __repr__/__post_init__,
and value/derivative twins (a function and its time-derivative share
most structure). A high score is necessary but not
sufficient — a human confirms each pair.
After the Jun-22 dedup (the formatted_dict clone and the
BaseController/BreveController J=1.0 twins
removed via [[breve_core_controller]]), the live GNC maximum is
0.827:
| J | meaning | example here |
|---|---|---|
1.0 |
structurally identical post-normalization | (none — deduped) |
≥0.90 |
near copy-paste; the gate’s fail line | (none) |
0.74–0.83 |
the deliberate-fork band | BaseController.calc_posture ↔︎
BreveController.calc_posture (0.827) — a Template-Method
fork that must stay independent |
0.50–0.70 |
boilerplate / value-derivative twins | g_vc_desired ↔︎ g_vc_desired_dot (0.792);
arm_gain_scale ↔︎ base_gain_scale (0.737) |
<0.50 |
unrelated | — |
from validation.ast_duplication import collect, rank_pairs
funcs = collect(("GNC",), min_nodes=15) # list[Func]; the only disk step
pairs = rank_pairs(funcs, threshold=0.70, top=40) # list[Pair], sorted by J desc
Func: .qualname (dotted
ClassName.method), .path,
.lineno, .size, .feats
(Counter).Pair: .a, .b (each a
Func), .j_multiset, .j_set,
.shared (count of shared fingerprints).python validation/ast_duplication.py <roots...> [--threshold 0.70] [--min-nodes 15] [--top 40] [--out PATH].rank | J_ms | J_set | shared | A | B | same-file).Catches Type-1/2 and mild Type-3 (renamed / lightly-edited) clones; BLIND to Type-4 semantic clones (same result, different structure). A high score is necessary-but-not-sufficient — a human confirms each pair. Expect boilerplate getters /
__repr__blocks as benign false-positive families.
Module-level preamble (imports, constants) is not scored — only
def bodies. Pairs below --min-nodes never
appear.
validation/tests/test_ast_duplication.py — the 11 pins
(rename-invariance = 1.0, call-names-kept < 1.0, literal bucketing,
docstring-strip, min-nodes floor,
match-vs-if/elif ≥ 0.70, empty-safe, hermetic
e2e, out-of-ROOT).validation/INSIGHTS.md → the terse
ast_duplication field entry.