ast_duplication — structural clone detector (AST fingerprints + Jaccard)

Purpose

Finds copy-paste-then-rename duplicate functions that a text diff misses. Each function body is
reduced to a multiset of subtree-shape fingerprints (a bottom-up Merkle hash over the AST, with
names/literals normalised so a rename changes nothing), then every function pair is scored by multiset
Jaccard
similarity. High-scoring cross-file pairs are the consolidation worklist behind the
originality gate — e.g. the validate_*_baseline.py harness-dedup target.

Role in the system

  • Standalone tooling, not a pipeline stage — run as a script (python analysis/ast_duplication.py analysis utils GNC validation --cross-module --threshold 0.70), never called by orchestrator in the
    live loop. Console summary by default; --out PATH also writes a markdown report.
  • The detector behind the DRY / originality gate: --cross-module filters to pairs whose two functions
    live in different files — the dedup candidates. It is necessary-but-not-sufficient evidence; a human
    confirms every pair (see Footguns).
  • Lives in analysis/ (corrected from a former validation/ mis-path, commit d6d5cc0). Self-contained:
    stdlib only (ast, hashlib.blake2b, Counter) — no project imports, no NPZ, no plotting.

Inputs / Outputs

  • In: root dirs to scan (default analysis, utils, GNC, validation), --threshold (multiset-J, default
    0.70), --min-nodes (size floor, default 15), --top (always-report N, default 40), --cross-module,
    --out. Skips legacy/, __pycache__/, /tests/ dirs.
  • Out: console [scan]/[pairs]/[top] lines; optional pure-markdown report (ranked table:
    J_ms, J_set, shared count, the two path:line · qualname ends, same-file flag).

Key functions

  • feature_counts — body → multiset of subtree fingerprints (docstring stripped, nested defs opaque) — analysis/ast_duplication.py:121
  • jaccard(multiset_J, set_J) for two fingerprint Counters — analysis/ast_duplication.py:129
  • functions_from_source — parse source → list[Func] with dotted ClassName.method qualnames — analysis/ast_duplication.py:143
  • rank_pairs — score all i<j pairs (size-band early-out), keep ≥threshold + top N — analysis/ast_duplication.py:164
  • collect — read + parse every .py under roots → eligible Func list (the only disk step) — analysis/ast_duplication.py:212
  • render_md — pure-markdown report assembly (header block + ranked table) — analysis/ast_duplication.py:230
  • main — CLI entry: scan → rank → print → optional report — analysis/ast_duplication.py:253
  • _fingerprint — bottom-up blake2b Merkle hash of (token, child-fingerprints)analysis/ast_duplication.py:107
  • _tok — rename-invariant token: NameNAME, literals bucketed by kind, call/attr names kept — analysis/ast_duplication.py:66

Footguns

A high score is necessary-but-not-sufficient — a human confirms every pair

Catches Type-1/2 (exact / renamed) and mild Type-3 (small-edit) clones; BLIND to Type-4
(semantic) clones — same result via different structure scores LOW. Never auto-act on the report; it is a
shortlist, not a verdict. (analysis/ast_duplication.py:19 LIMITATIONS block.)

Benign false-positive families

Boilerplate getters, __repr__/__str__ blocks, and trivial property stubs score high by structure
alone — expect them in the table and discount them. --min-nodes is the trivial-duplicate floor (below
it, one-liners match each other spuriously). (analysis/ast_duplication.py:23.)

Size-band early-out can hide a real partial overlap

rank_pairs skips any pair whose larger/smaller body-size ratio exceeds SIZE_BAND = 3.0
(analysis/ast_duplication.py:33) — such a pair can never clear threshold, but a small helper genuinely
copied into a much larger function won’t surface. Only function bodies are fingerprinted; the
module-level ROOT/sys.path preamble never enters the corpus, so it cannot inflate scores.

rank_pairs stashes counters on the function object

Last-run totals are read back as rank_pairs.last_scored / .last_banded
(analysis/ast_duplication.py:186) — function-attribute state, so it is not reentrant and a parallel
caller would clobber it. Fine for the single-shot CLI; mind it if ever imported into a loop.

Pseudocode

collect(roots, min_nodes):                 # only disk step
  for each .py under roots (skip legacy/__pycache__/tests):
    for each def (dotted qualname, nested defs disjoint):
      feats = feature_counts(body)         # multiset of subtree fingerprints
      keep if feats.size >= min_nodes

rank_pairs(funcs, threshold, top):
  for i < j:
    if min size 0 or max/min > SIZE_BAND -> band out (skip)
    score (j_multiset, j_set) = jaccard(feats_i, feats_j)
  keep = pairs >= threshold, plus always the global top N
  sort by j_multiset desc

main: scan -> rank -> [--cross-module: drop same-file pairs] -> print -> [--out: render_md]

metric_alignment · orchestrator · terminology · tags