Ch3nyang's blog web_stories

post

assignment_ind

about

data_object

github

local_offer

tag

rss_feed

rss

Ever wondered how git diff manages to efficiently compare files and show you exactly what changed? The algorithms behind this seemingly simple operation are quite fascinating.

Consider two files \(A\) and \(B\) containing \(N=7\) and \(M=6\) lines respectively:

Text

 A     B
===   ===
 A     C
 B     B
 C     A
 A     B
 B     A
 B     C
 A

We can represent this comparison problem as navigating through an \(N \times M\) grid:

grid

From any coordinate \((x, y)\), we have three possible moves:

  • Move right to \((x+1, y)\) — represents deleting line \(x+1\) from the first file
  • Move down to \((x, y+1)\) — represents adding line \(y+1\) from the second file
  • Move diagonally to \((x+1, y+1)\) — represents that lines \(x+1\) and \(y+1\) match (no edit needed)

The diagonal move is called a “snake” in diff terminology, and crucially, it doesn’t increase our edit distance.

Our objective is to find a path from \((0, 0)\) to \((N, M)\) that minimizes the total number of horizontal and vertical moves—in other words, we want to minimize the edit distance.

This is essentially the classic minimum edit distance problem, also known as computing the Levenshtein distance between two sequences.

Dynamic Programming Approach

The most straightforward solution to this problem uses dynamic programming:

\[dp[i][j] = \begin{cases} \min\left(dp[i-1][j], dp[i][j-1]\right) + 1 & A[i] \neq B[j] \\ dp[i-1][j-1] & A[i] = B[j] \end{cases}\]

With boundary conditions:

\[\begin{aligned} dp[0][0] &= 0 \\ dp[i][0] &= i \\ dp[0][j] &= j \end{aligned}\]

dp

Here’s the Python implementation:

Python

def diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
    """
    Calculate the minimum edit distance (Levenshtein distance) between two sequences and return the detailed edit path.
    
    Args:
        A: Source sequence
        B: Target sequence
    
    Returns:
        tuple: A tuple containing two elements
            - int: Minimum edit distance (minimum number of operations required)
            - list: Edit path, each element is a 4-tuple (operation, element_value, A_index, B_index)
    
    Edit operations:
        '=': Elements are the same, no modification needed
        '-': Delete operation, exists in A but not in B
        '+': Insert operation, exists in B but not in A
    """
    N, M = len(A), len(B)
    dp = [[0] * (M + 1) for _ in range(N + 1)]
    for i in range(1, N + 1):
        dp[i][0] = i
    for j in range(1, M + 1):
        dp[0][j] = j
    for i in range(1, N + 1):
        for j in range(1, M + 1):
            if A[i - 1] == B[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                if dp[i - 1][j] < dp[i][j - 1]:
                    dp[i][j] = dp[i - 1][j] + 1
                else:
                    dp[i][j] = dp[i][j - 1] + 1
    i, j = N, M
    path = []
    while i > 0 or j > 0:
        if i > 0 and j > 0 and A[i - 1] == B[j - 1]:
            path.append(('=', A[i - 1], i - 1, j - 1))
            i -= 1
            j -= 1
        elif j > 0 and (i == 0 or dp[i][j - 1] <= dp[i - 1][j]):
            path.append(('+', B[j - 1], None, j - 1))
            j -= 1
        else:
            path.append(('-', A[i - 1], i - 1, None))
            i -= 1
    path.reverse()
    return dp[N][M], path

When multiple optimal paths exist, the backtracking process requires careful consideration of tie-breaking rules. In our implementation, we prioritize horizontal moves (deletions) over vertical moves (additions) when both options have equal cost. This choice affects the visual appearance of the diff output but doesn’t change the minimum edit distance.

Note: To prefer additions over deletions, simply change dp[i][j - 1] <= dp[i - 1][j] to dp[i][j - 1] < dp[i - 1][j] in the backtracking logic.

Here’s the execution result:

A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = diff(A, B)
print("Minimum edit distance:", distance)
print("diff:")
for op, val, i, j in path:
    if op == '=':
        print(f"  {val}")
    elif op == '-':
        print(f"- {val}")
    elif op == '+':
        print(f"+ {val}")
DP diff
Minimum edit distance: 5
diff:
- A
- B
  C
- A
  B
+ A
  B
  A
+ C

This works perfectly! However, our DP algorithm has both time and space complexity of \(O(NM)\). For large files, this can become quite expensive. Can we do better?

The Myers Algorithm

The Myers algorithm offers a more efficient solution with time complexity \(O\left(N\log N + D^2\right)\) and space complexity \(O(N)\), where \(D\) is the minimum edit distance. This is particularly advantageous when files are similar (small \(D\)) compared to their size.

The key insight is to work with a different coordinate system. We define the diagonal \(k\) as the difference between our horizontal and vertical positions: \(k = x - y\).

Since we need exactly \(N\) horizontal moves and \(M\) vertical moves to reach \((N, M)\), we know that the final diagonal will be \(k = N - M\). Moreover, the minimum edit distance \(D\) must satisfy:

\[D \geq |N - M|\]

The algorithm explores paths in order of increasing edit distance \(D\), using a \(k-D\) coordinate system:

myers

Instead of filling an entire \(N \times M\) table, we incrementally explore paths with \(D = 0, 1, 2, \ldots\) edits until we reach the target. For each edit distance \(D\), the possible diagonals range from \(k = -D\) to \(k = D\) (in steps of 2, since each edit changes \(k\) by \(\pm 1\)).

We maintain an array \(V\) where \(V[k]\) stores the farthest \(x\)-coordinate reached on diagonal \(k\) using exactly \(D\) edits.

Initially, \(V[1] = 0\) (there’s a subtle indexing convention here—we start with \(k=0\) at \(D=0\)).

The strategy is to maximize \(x\) on each diagonal, allowing us to take as many “free” diagonal moves (snakes) as possible. To reach diagonal \(k\) with \(D\) edits, we can come from:

  • Diagonal \(k+1\) by moving down (insertion): \(x = V[k+1]\)
  • Diagonal \(k-1\) by moving right (deletion): \(x = V[k-1] + 1\)

The choice depends on boundary conditions and which option gives us a larger \(x\):

  • When \(k = -D\): We must come from \(k+1\) (can’t go further left), so \(x = V[k+1]\)
  • When \(k = D\): We must come from \(k-1\) (can’t go further right), so \(x = V[k-1] + 1\)
  • When \(-D < k < D\): We choose whichever gives the larger \(x\) value

    \[x = \text{max}(V[k-1] + 1, V[k+1])\]

Combining these, we get:

Python

if k == -D or (k != D and V.get(k - 1) < V.get(k + 1)):
    x = V.get(k + 1)
else:
    x = V.get(k - 1) + 1

With \(x\), we can calculate \(y\):

\[y = x - k\]

Then we can perform snake moves until \(A[x] \neq B[y]\):

Python

while x < N and y < M and A[x] == B[y]:
    x += 1
    y += 1

We continue this process, incrementally increasing \(D\), until we reach our target: when \(V[N - M] \geq N\), we’ve found the shortest path from \((0, 0)\) to \((N, M)\).

The following diagrams illustrate how the Myers algorithm progresses:

myers2

myers3

Here’s the Python implementation:

Python

def myers_diff(A: list[str], B: list[str]) -> tuple[int, list[tuple[str, str, int | None, int | None]]]:
    """
    Calculate edit distance using the Myers algorithm.
    
    Returns the same format as the DP version for easy comparison.
    """
    N, M = len(A), len(B)
    if N == 0:
        return M, [('+', B[j], None, j) for j in range(M)]
    if M == 0:
        return N, [('-', A[i], i, None) for i in range(N)]
    
    max_d = N + M
    V = {1: 0}
    trace = [V.copy()]
    
    for D in range(max_d + 1):
        new_V = {}
        for k in range(-D, D + 1, 2):
            if k == -D or (k != D and V.get(k - 1, -1) < V.get(k + 1, -1)):
                x = V.get(k + 1, 0)
            else:
                x = V.get(k - 1, 0) + 1
            y = x - k
            
            while x < N and y < M and A[x] == B[y]:
                x += 1
                y += 1
            
            new_V[k] = x
            
            if x >= N and y >= M:
                path = []
                x, y = N, M
                for d in range(D, -1, -1):
                    k = x - y
                    V_prev = trace[d]
                    
                    if k == -d or (k != d and V_prev.get(k - 1, -1) < V_prev.get(k + 1, -1)):
                        prev_k = k + 1
                    else:
                        prev_k = k - 1
                    
                    prev_x = V_prev.get(prev_k, 0)
                    prev_y = prev_x - prev_k
                    
                    while x > prev_x and y > prev_y:
                        path.append(('=', A[x - 1], x - 1, y - 1))
                        x -= 1
                        y -= 1
                    
                    if d > 0:
                        if x == prev_x:
                            path.append(('+', B[y - 1], None, y - 1))
                            y -= 1
                        else:
                            path.append(('-', A[x - 1], x - 1, None))
                            x -= 1
                
                path.reverse()
                return D, path
        
        V = new_V
        trace.append(V.copy())
    
    return max_d, []

Here’s the execution result:

A = ["A", "B", "C", "A", "B", "B", "A"]
B = ["C", "B", "A", "B", "A", "C"]
distance, path = myers_diff(A, B)
print("Minimum edit distance:", distance)
print("diff:")
for op, val, i, j in path:
    if op == '=':
        print(f"  {val}")
    elif op == '-':
        print(f"- {val}")
    elif op == '+':
        print(f"+ {val}")
Myers diff
Minimum edit distance: 5
diff:
- A
- B
  C
- A
  B
+ A
  B
  A
+ C

Performance Analysis

Comparing the Algorithms

While the Myers algorithm boasts better theoretical complexity—\(O(N\log N + D^2)\) time and \(O(N)\) space compared to DP’s \(O(NM)\) for both—the real-world performance depends heavily on how similar the input files are.

When \(D\) approaches \(N + M\) (completely different files), Myers degrades to \(O((N + M)^2)\) time complexity, potentially performing worse than the straightforward DP approach. Let’s benchmark both algorithms under various conditions:

Benchmark
N=100, M=100, similarity=0.9 | DP: 0.0008s (±0.0001) | Myers: 0.0001s (±0.0000)
N=100, M=100, similarity=0.8 | DP: 0.0008s (±0.0001) | Myers: 0.0002s (±0.0001)
N=100, M=100, similarity=0.7 | DP: 0.0008s (±0.0000) | Myers: 0.0003s (±0.0001)
N=100, M=100, similarity=0.6 | DP: 0.0008s (±0.0000) | Myers: 0.0006s (±0.0001)
N=100, M=100, similarity=0.5 | DP: 0.0008s (±0.0001) | Myers: 0.0008s (±0.0002)
N=100, M=100, similarity=0.4 | DP: 0.0008s (±0.0000) | Myers: 0.0011s (±0.0002)
N=100, M=100, similarity=0.3 | DP: 0.0008s (±0.0000) | Myers: 0.0013s (±0.0002)
N=100, M=100, similarity=0.2 | DP: 0.0008s (±0.0000) | Myers: 0.0015s (±0.0001)
N=100, M=100, similarity=0.1 | DP: 0.0008s (±0.0000) | Myers: 0.0017s (±0.0001)
N=200, M=200, similarity=0.9 | DP: 0.0031s (±0.0001) | Myers: 0.0002s (±0.0001)
N=200, M=200, similarity=0.8 | DP: 0.0031s (±0.0001) | Myers: 0.0006s (±0.0002)
N=200, M=200, similarity=0.7 | DP: 0.0031s (±0.0000) | Myers: 0.0013s (±0.0003)
N=200, M=200, similarity=0.6 | DP: 0.0032s (±0.0001) | Myers: 0.0021s (±0.0004)
N=200, M=200, similarity=0.5 | DP: 0.0032s (±0.0002) | Myers: 0.0032s (±0.0004)
N=200, M=200, similarity=0.4 | DP: 0.0032s (±0.0001) | Myers: 0.0042s (±0.0005)
N=200, M=200, similarity=0.3 | DP: 0.0032s (±0.0003) | Myers: 0.0052s (±0.0004)
N=200, M=200, similarity=0.2 | DP: 0.0032s (±0.0003) | Myers: 0.0060s (±0.0004)
N=200, M=200, similarity=0.1 | DP: 0.0032s (±0.0003) | Myers: 0.0066s (±0.0003)
N=300, M=300, similarity=0.9 | DP: 0.0074s (±0.0002) | Myers: 0.0004s (±0.0001)
N=300, M=300, similarity=0.8 | DP: 0.0075s (±0.0005) | Myers: 0.0013s (±0.0003)
N=300, M=300, similarity=0.7 | DP: 0.0076s (±0.0003) | Myers: 0.0028s (±0.0006)
N=300, M=300, similarity=0.6 | DP: 0.0077s (±0.0003) | Myers: 0.0045s (±0.0006)
N=300, M=300, similarity=0.5 | DP: 0.0078s (±0.0003) | Myers: 0.0072s (±0.0011)
N=300, M=300, similarity=0.4 | DP: 0.0078s (±0.0003) | Myers: 0.0097s (±0.0013)
N=300, M=300, similarity=0.3 | DP: 0.0079s (±0.0004) | Myers: 0.0119s (±0.0010)
N=300, M=300, similarity=0.2 | DP: 0.0080s (±0.0004) | Myers: 0.0141s (±0.0010)
N=300, M=300, similarity=0.1 | DP: 0.0079s (±0.0004) | Myers: 0.0156s (±0.0007)
N=400, M=400, similarity=0.9 | DP: 0.0140s (±0.0007) | Myers: 0.0007s (±0.0002)
N=400, M=400, similarity=0.8 | DP: 0.0141s (±0.0004) | Myers: 0.0024s (±0.0005)
N=400, M=400, similarity=0.7 | DP: 0.0145s (±0.0008) | Myers: 0.0050s (±0.0007)
N=400, M=400, similarity=0.6 | DP: 0.0147s (±0.0006) | Myers: 0.0085s (±0.0012)
N=400, M=400, similarity=0.5 | DP: 0.0152s (±0.0011) | Myers: 0.0127s (±0.0015)
N=400, M=400, similarity=0.4 | DP: 0.0151s (±0.0007) | Myers: 0.0176s (±0.0016)
N=400, M=400, similarity=0.3 | DP: 0.0150s (±0.0008) | Myers: 0.0221s (±0.0018)
N=400, M=400, similarity=0.2 | DP: 0.0149s (±0.0006) | Myers: 0.0258s (±0.0014)
N=400, M=400, similarity=0.1 | DP: 0.0150s (±0.0012) | Myers: 0.0283s (±0.0010)
N=500, M=500, similarity=0.9 | DP: 0.0227s (±0.0009) | Myers: 0.0010s (±0.0002)
N=500, M=500, similarity=0.8 | DP: 0.0231s (±0.0008) | Myers: 0.0036s (±0.0007)
N=500, M=500, similarity=0.7 | DP: 0.0234s (±0.0007) | Myers: 0.0077s (±0.0011)
N=500, M=500, similarity=0.6 | DP: 0.0239s (±0.0007) | Myers: 0.0131s (±0.0016)
N=500, M=500, similarity=0.5 | DP: 0.0243s (±0.0009) | Myers: 0.0206s (±0.0020)
N=500, M=500, similarity=0.4 | DP: 0.0244s (±0.0012) | Myers: 0.0283s (±0.0021)
N=500, M=500, similarity=0.3 | DP: 0.0244s (±0.0010) | Myers: 0.0354s (±0.0024)
N=500, M=500, similarity=0.2 | DP: 0.0243s (±0.0010) | Myers: 0.0417s (±0.0027)
N=500, M=500, similarity=0.1 | DP: 0.0243s (±0.0007) | Myers: 0.0452s (±0.0020)
N=600, M=600, similarity=0.9 | DP: 0.0337s (±0.0015) | Myers: 0.0015s (±0.0003)
N=600, M=600, similarity=0.8 | DP: 0.0342s (±0.0015) | Myers: 0.0053s (±0.0010)
N=600, M=600, similarity=0.7 | DP: 0.0348s (±0.0014) | Myers: 0.0109s (±0.0015)
N=600, M=600, similarity=0.6 | DP: 0.0360s (±0.0017) | Myers: 0.0198s (±0.0020)
N=600, M=600, similarity=0.5 | DP: 0.0360s (±0.0010) | Myers: 0.0302s (±0.0028)
N=600, M=600, similarity=0.4 | DP: 0.0364s (±0.0020) | Myers: 0.0407s (±0.0034)
N=600, M=600, similarity=0.3 | DP: 0.0359s (±0.0012) | Myers: 0.0513s (±0.0031)
N=600, M=600, similarity=0.2 | DP: 0.0363s (±0.0016) | Myers: 0.0614s (±0.0040)
N=600, M=600, similarity=0.1 | DP: 0.0362s (±0.0014) | Myers: 0.0666s (±0.0033)

Benchmark Results

The results clearly show that Myers excels when sequences are highly similar (high similarity ratios), where it can leverage many “snake” moves. However, as files become increasingly different, Myers’ performance degrades and can actually become slower than the straightforward DP approach. This makes intuitive sense: when \(D\) approaches \(N + M\), the \(D^2\) factor dominates, negating Myers’ advantages.

Key takeaway: For typical version control scenarios where files change incrementally, Myers is usually the better choice. For comparing completely unrelated files, DP might be more predictable.

Improving Diff Readability

While both algorithms produce optimal results in terms of edit distance, they don’t always generate the most human-readable diffs. Consider these two equivalent outputs:

More readable:

Diff

for (int i = 0; i < n; i++) {
    process1(i);
}
+for (int i = 0; i < n; i++) {
+    process2(i);
+}

Less readable:

Diff

for (int i = 0; i < n; i++) {
    process1(i);
+}
+for (int i = 0; i < n; i++) {
+    process2(i);
}

Similarly, consider:

More readable (grouped changes):

Diff

if (isSocketReady()) {
-    sendDataPart1();
-    sendDataPart2();
+    sendDataPartA();
+    sendDataPartB();
}

Less readable (interleaved changes):

Diff

if (isSocketReady()) {
-    sendDataPart1();
+    sendDataPartA();
-    sendDataPart2();
+    sendDataPartB();
}

Both achieve the same edit distance, but the first versions are significantly easier to understand at a glance.

Neither algorithm inherently optimizes for readability, but we can apply post-processing to improve the output. A simple approach involves grouping consecutive deletions and additions, then attempting to shift block boundaries to create more intuitive change patterns.

Here’s a basic implementation of such post-processing:

Python

def optimize_diff_readability(path: list[tuple[str, str, int | None, int | None]]) -> list[tuple[str, str, int | None, int | None]]:
    """
    Post-process diff output to improve readability by grouping
    related changes and minimizing interleaved operations.
    """
    # This is a simplified example - real implementations like
    # git's diff use much more sophisticated heuristics
    optimized = []
    i = 0
    
    while i < len(path):
        if path[i][0] == '-':
            # Collect consecutive deletions
            deletions = []
            while i < len(path) and path[i][0] == '-':
                deletions.append(path[i])
                i += 1
            
            # Check for immediately following additions
            additions = []
            while i < len(path) and path[i][0] == '+':
                additions.append(path[i])
                i += 1
            
            # Add grouped changes
            optimized.extend(deletions)
            optimized.extend(additions)
        else:
            optimized.append(path[i])
            i += 1
    
    return optimized

This is just the tip of the iceberg—production diff tools like Git employ much more sophisticated algorithms for optimizing readability, including:

  • Patience diff: An alternative to Myers that tends to produce more intuitive results for code
  • Histogram diff: A variant of patience diff with better performance characteristics
  • Word-level and character-level diffing: For more granular comparisons within lines
  • Semantic awareness: Understanding code structure to make more meaningful comparisons

Conclusion

The humble git diff command embodies decades of algorithmic research and optimization. While the dynamic programming approach provides a solid foundation with predictable \(O(NM)\) performance, the Myers algorithm offers superior efficiency for the common case of comparing similar files—precisely the scenario encountered in version control.

The choice between algorithms depends on your specific use case:

  • Use Myers when files are likely to be similar (version control, incremental changes)
  • Use DP when you need predictable performance regardless of file similarity
  • Consider both and choose dynamically based on a quick similarity estimate

Understanding these algorithms not only demystifies one of our most commonly used development tools but also provides insight into the broader field of sequence comparison—with applications ranging from bioinformatics (DNA sequence alignment) to plagiarism detection and beyond.

The next time you run git diff, you’ll know there’s some serious algorithmic magic happening under the hood!