ast_duplication — structural clone detector (AST fingerprints + Jaccard)
Purpose
Finds copy-paste-then-rename duplicate functions that a text diff misses. Each function body is
reduced to a multiset of subtree-shape fingerprints (a bottom-up Merkle hash over the AST, with
names/literals normalised so a rename changes nothing), then every function pair is scored by multiset
Jaccard similarity. High-scoring cross-file pairs are the consolidation worklist behind the
originality gate — e.g. thevalidate_*_baseline.pyharness-dedup target.
Role in the system
- Standalone tooling, not a pipeline stage — run as a script (
python analysis/ast_duplication.py analysis utils GNC validation --cross-module --threshold 0.70), never called by orchestrator in the
live loop. Console summary by default;--out PATHalso writes a markdown report. - The detector behind the DRY / originality gate:
--cross-modulefilters to pairs whose two functions
live in different files — the dedup candidates. It is necessary-but-not-sufficient evidence; a human
confirms every pair (see Footguns). - Lives in
analysis/(corrected from a formervalidation/mis-path, commitd6d5cc0). Self-contained:
stdlib only (ast,hashlib.blake2b,Counter) — no project imports, no NPZ, no plotting.
Inputs / Outputs
- In: root dirs to scan (default
analysis, utils, GNC, validation),--threshold(multiset-J, default
0.70),--min-nodes(size floor, default 15),--top(always-report N, default 40),--cross-module,
--out. Skipslegacy/,__pycache__/,/tests/dirs. - Out: console
[scan]/[pairs]/[top]lines; optional pure-markdown report (ranked table:
J_ms,J_set, shared count, the twopath:line · qualnameends, same-file flag).
Key functions
feature_counts— body → multiset of subtree fingerprints (docstring stripped, nested defs opaque) —analysis/ast_duplication.py:121jaccard—(multiset_J, set_J)for two fingerprintCounters —analysis/ast_duplication.py:129functions_from_source— parse source →list[Func]with dottedClassName.methodqualnames —analysis/ast_duplication.py:143rank_pairs— score alli<jpairs (size-band early-out), keep≥threshold+ top N —analysis/ast_duplication.py:164collect— read + parse every.pyunder roots → eligibleFunclist (the only disk step) —analysis/ast_duplication.py:212render_md— pure-markdown report assembly (header block + ranked table) —analysis/ast_duplication.py:230main— CLI entry: scan → rank → print → optional report —analysis/ast_duplication.py:253_fingerprint— bottom-upblake2bMerkle hash of(token, child-fingerprints)—analysis/ast_duplication.py:107_tok— rename-invariant token:Name→NAME, literals bucketed by kind, call/attr names kept —analysis/ast_duplication.py:66
Footguns
A high score is necessary-but-not-sufficient — a human confirms every pair
Catches Type-1/2 (exact / renamed) and mild Type-3 (small-edit) clones; BLIND to Type-4
(semantic) clones — same result via different structure scores LOW. Never auto-act on the report; it is a
shortlist, not a verdict. (analysis/ast_duplication.py:19LIMITATIONS block.)
Benign false-positive families
Boilerplate getters,
__repr__/__str__blocks, and trivial property stubs score high by structure
alone — expect them in the table and discount them.--min-nodesis the trivial-duplicate floor (below
it, one-liners match each other spuriously). (analysis/ast_duplication.py:23.)
Size-band early-out can hide a real partial overlap
rank_pairsskips any pair whose larger/smaller body-size ratio exceedsSIZE_BAND = 3.0
(analysis/ast_duplication.py:33) — such a pair can never clear threshold, but a small helper genuinely
copied into a much larger function won’t surface. Only function bodies are fingerprinted; the
module-levelROOT/sys.pathpreamble never enters the corpus, so it cannot inflate scores.
rank_pairsstashes counters on the function objectLast-run totals are read back as
rank_pairs.last_scored/.last_banded
(analysis/ast_duplication.py:186) — function-attribute state, so it is not reentrant and a parallel
caller would clobber it. Fine for the single-shot CLI; mind it if ever imported into a loop.
Pseudocode
collect(roots, min_nodes): # only disk step
for each .py under roots (skip legacy/__pycache__/tests):
for each def (dotted qualname, nested defs disjoint):
feats = feature_counts(body) # multiset of subtree fingerprints
keep if feats.size >= min_nodes
rank_pairs(funcs, threshold, top):
for i < j:
if min size 0 or max/min > SIZE_BAND -> band out (skip)
score (j_multiset, j_set) = jaccard(feats_i, feats_j)
keep = pairs >= threshold, plus always the global top N
sort by j_multiset desc
main: scan -> rank -> [--cross-module: drop same-file pairs] -> print -> [--out: render_md]