alexey.zh

Posted on May 3 • Edited on May 5

Finding Structurally Duplicate Go Functions with AST Hashing

#go #algorithms #devtools #programming

You know that feeling when you're reviewing a PR and you see a function that looks suspiciously familiar? Same structure, different variable names, slightly different literals. Someone copy-pasted and tweaked it, and now there are two places to update every time something changes.

godedup catches this automatically. It finds structurally duplicate functions in Go codebases - even when the copies have been superficially modified. This post is about the algorithms that make it work, and a few implementation details I found interesting to think through.

The Problem With Text-Based Approaches

The naive approach is text diffing. Take two functions, run them through a standard diff algorithm, see if they're similar enough. This immediately falls apart:

// Function A
func (r *UserRepo) findByID(ctx context.Context, id int64) (*User, error) {
    row := r.db.QueryRowContext(ctx, "SELECT * FROM users WHERE id = $1", id)
    var u User
    if err := row.Scan(&u.ID, &u.Name, &u.Email); err != nil {
        return nil, fmt.Errorf("findByID: %w", err)
    }
    return &u, nil
}

// Function B
func (r *OrderRepo) findByID(ctx context.Context, id int64) (*Order, error) {
    row := r.db.QueryRowContext(ctx, "SELECT * FROM orders WHERE id = $1", id)
    var o Order
    if err := row.Scan(&o.ID, &o.Status, &o.Amount); err != nil {
        return nil, fmt.Errorf("findByID: %w", err)
    }
    return &o, nil
}

Text diff says these are 60% different. A human says these are the same function. The structure - a query, a scan, an error return, a result return - is identical. Only the types and column names differ.

What you actually want is to compare shape, not text. That means going through the AST.

The Core Insight: Normalize, Then Hash

The key idea is that two functions are structurally equivalent if their AST subtrees are isomorphic - same shape, same node types, same operators, same control flow - after you normalize away the parts that don't matter.

What doesn't matter for structural comparison:

Variable and parameter names (userID vs orderID)
String literals ("users" vs "orders")
Numeric literals (1 vs 42)
Package qualifiers (fmt.Println vs log.Println)

What does matter:

Node types (IfStmt, ForStmt, ReturnStmt)
Operators (+ vs - are different)
Control flow structure (a loop containing an if is different from an if containing a loop)
nil, true, false - these have semantic meaning

The implementation is a recursive hash function over the AST. For each node type, it combines a stable ID for the node type with the hashes of its children:

func (h *Hasher) hashNode(node ast.Node) uint64 {
    switch n := node.(type) {
    case *ast.IfStmt:
        return combine(nodeID("IfStmt"),
            h.hashNode(n.Init),
            h.hashNode(n.Cond),
            h.hashNode(n.Body),
            h.hashNode(n.Else),
        )

    case *ast.Ident:
        // Normalize: all identifiers hash identically
        // EXCEPT nil/true/false which have semantic meaning
        switch n.Name {
        case "nil":   return nodeID("nil")
        case "true":  return nodeID("true")
        case "false": return nodeID("false")
        }
        return nodeID("Ident") // all other identifiers are equivalent

    case *ast.BasicLit:
        // Normalize: literals hash only by kind, not value
        // "users" and "orders" produce the same hash
        return combine(nodeID("BasicLit"), uint64(n.Kind))

    case *ast.SelectorExpr:
        // Normalize package qualifier: only the selected name matters
        // fmt.Println and log.Println hash identically
        return combine(nodeID("SelectorExpr"), nodeID(n.Sel.Name))
    // ...
    }
}

The nodeID function maps a string name to a stable uint64 using the first 8 bytes of its SHA-256 hash. The combine function mixes multiple values using FNV-style multiplication - fast, good avalanche, and order-dependent (so [A, B] and [B, A] produce different hashes):

func combine(vals ...uint64) uint64 {
    var h uint64 = 0xcbf29ce484222325 // FNV offset basis
    for _, v := range vals {
        h ^= v
        h *= 0x100000001b3 // FNV prime
    }
    return h
}

After hashing, both findByID functions above produce the same uint64. Detecting exact clones is then just grouping by hash - O(n) with a map.

Two Representations Per Function

Here's a design decision that enables both exact and near clone detection from a single parse pass.

Each function gets two hash representations stored in FuncInfo:

type FuncInfo struct {
    TopHash  uint64   // hash of the entire function body
    StmtSeq  []uint64 // per-statement hashes
    // ...
}

TopHash is the hash of the complete function - used for exact clone detection. If two functions have the same TopHash, they're structurally identical.

StmtSeq is a slice where each element is the hash of one top-level statement. This is what enables near-clone detection.

Computing both is trivial:

stmtSeq := make([]uint64, 0, len(fn.Body.List))
for _, stmt := range fn.Body.List {
    stmtSeq = append(stmtSeq, h.hashNode(stmt))
}
topHash := hashUint64Slice(stmtSeq)

TopHash is derived from StmtSeq - it's the hash of the sequence of statement hashes. So you get both representations for the cost of one AST traversal.

Near-Clone Detection: Edit Distance on Hash Sequences

Two functions are near-clones if one has a few extra or different statements compared to the other. The canonical algorithm for "how many insertions/deletions does it take to transform sequence A into sequence B" is Levenshtein edit distance.

The twist: instead of computing edit distance on characters or lines, we compute it on the StmtSeq - the sequence of statement hashes.

func editDistance(a, b []uint64) int {
    la, lb := len(a), len(b)
    dp := make([][]int, la+1)
    for i := range dp {
        dp[i] = make([]int, lb+1)
    }
    for i := 0; i <= la; i++ { dp[i][0] = i }
    for j := 0; j <= lb; j++ { dp[0][j] = j }

    for i := 1; i <= la; i++ {
        for j := 1; j <= lb; j++ {
            if a[i-1] == b[j-1] {
                dp[i][j] = dp[i-1][j-1] // statements are structurally identical
            } else {
                dp[i][j] = 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1])
            }
        }
    }
    return dp[la][lb]
}

Similarity is then normalized to [0.0, 1.0]:

func Similarity(a, b *FuncInfo) float64 {
    dist := editDistance(a.StmtSeq, b.StmtSeq)
    maxLen := max(len(a.StmtSeq), len(b.StmtSeq))
    return 1.0 - float64(dist)/float64(maxLen)
}

This means: if two functions share 8 out of 10 statements (by structure), they score 0.80. The default threshold is 0.85, so that pair would not be reported - you need at least 85% structural overlap.

The practical effect: adding a logging statement or an extra validation check doesn't make two otherwise-identical functions invisible to the detector. The near-clone detection catches exactly the "same function, one copy got an extra guard clause" pattern.

Grouping Near-Clones: Union-Find

If function A is 90% similar to B, and B is 88% similar to C, then A, B, and C probably all belong to the same clone group. Union-Find handles this transitivity correctly.

The pairwise comparison is O(n^2) - for every pair of candidate functions (those not already in an exact clone group), compute similarity and union them if above threshold:

for i := 0; i < len(candidates); i++ {
    for j := i + 1; j < len(candidates); j++ {
        a, b := candidates[i], candidates[j]

        // fast pre-filter: skip pairs with very different statement counts
        ratio := float64(a.NumStmts) / float64(b.NumStmts)
        if ratio < 0.7 || ratio > 1.43 {
            continue
        }

        sim := hash.Similarity(&a, &b)
        if sim >= cfg.MinSimilarity {
            union(a.Name, b.Name, sim)
        }
    }
}

The statement count pre-filter is worth noting: if function A has 5 statements and function B has 20, they can't possibly score above 0.75 similarity (at best 5 matching statements out of 20). The filter skips pairs where the ratio is outside [0.7, 1.43] before doing the O(n·m) dynamic programming - significant in practice.

A subtle bug to be aware of: when you later compute the minimum similarity across a group, the similarity map is keyed by [2]string{a.Name, b.Name} in insertion order. But when you iterate over the group (which comes from a map), order is random. So you need to try both orderings:

na, nb := group[i].Name, group[j].Name
if s, ok := similarity[[2]string{na, nb}]; ok && s < minSim {
    minSim = s
} else if s, ok := similarity[[2]string{nb, na}]; ok && s < minSim {
    minSim = s
}

Miss this and near-clone groups all report 100% similarity regardless of actual score.

The Two-Pass Architecture

Detection runs in two sequential passes:

Pass 1 - Exact clones, O(n):
Group all functions by TopHash. Any group with 2+ members is an exact clone group. Mark all members so they're excluded from Pass 2.

Pass 2 - Near clones, O(n^2):
Only compare functions not already in an exact clone group. This is both a correctness choice (exact clones would trivially satisfy the near-clone threshold and pollute groups) and a performance choice (exact clones are often numerous - validate() copied across 10 packages - and skipping them keeps the O(n^2) set small).

// Pass 1: O(n) exact detection
exactGroups := make(map[uint64][]hash.FuncInfo)
for _, f := range funcs {
    exactGroups[f.TopHash] = append(exactGroups[f.TopHash], f)
}

// Pass 2: O(n^2) near detection on remaining functions
var candidates []hash.FuncInfo
for _, f := range funcs {
    if !inExactClone[f.Name] {
        candidates = append(candidates, f)
    }
}

What It Finds in Practice

Running it on a real codebase immediately found things worth fixing:

GROUP  TYPE   SIM   FUNCTION                          LOCATION                              STMTS  LINES
-----------------------------------------------------------------------------------------------------
1      EXACT  100%  auth.(*UserStore).GetUserByID     internal/auth/user_store.go:102       7      24
1      EXACT  100%  auth.(*UserStore).GetUserByEmail  internal/auth/user_store.go:127       7      23
1      EXACT  100%  auth.(*UserStore).GetUserByName   internal/auth/user_store.go:151       7      22
------------------------------------------------------------------------------------------------------
2      EXACT  100%  http.disableClientCache           internal/http/router.go:45            3      5
2      EXACT  100%  branding.disableClientCache       internal/wiki/branding/routes.go:249  3      5

The first group is three database query functions - same structure, different WHERE clauses. They should be one generic function or at least share a helper. The second is a middleware function that got copy-pasted into two packages instead of being placed in a shared location.

Both are actionable. Neither would have been caught by a linter.

What's Next

A few directions worth exploring:

Baseline support - the biggest adoption blocker for existing codebases is that there are already dozens of clones accumulated over years. A --save-baseline / --diff-baseline flag would let teams adopt the tool without failing CI on pre-existing debt.

SARIF output - SARIF is the standard format for GitHub Code Scanning. One output flag and findings appear as inline PR annotations with file links. Roughly 50 lines of output code.

LSH for scale - the O(n^2) near-clone pass starts showing latency on codebases with 5000+ functions. Locality-Sensitive Hashing on the StmtSeq arrays would reduce it to near-O(n) by only comparing functions that land in the same hash bucket.

The tool is at github.com/hashmap-kz/godedup - go install github.com/hashmap-kz/godedup@latest and point it at any Go project. It runs in seconds and exits 0, so there's no friction in trying it.

If you hit false positives or miss cases you expected to catch, issues are open. The normalization rules are the most interesting part to tune.

DEV Community