leastfixedpoint

Diff for Javascript, revisited

This page is a mirrored copy of an article originally posted on the (now sadly defunct) LShift blog; see the archive index here.

Fri, 9 May 2008

Last weekend I finally revisited the diff-in-javascript code I’d written a couple of years back, adding (very simple) patch-like and diff3-like functionality.

On the way, not only did I discover Khanna, Kunal and Pierce’s excellent paper “A Formal Investigation of Diff3“, but I found revctrl.org, the revision-control wiki, which I’m just starting to get my teeth into. I’m looking forward to learning more about merge algorithms.

The code I wrote last weekend is available: just download diff.js. The tools included:

Let’s try out a few examples. First, we need to set up a few “files” that we’ll run through the tools. The tools work with files represented as arrays of strings - each string representing one line in the file - so we’ll define three such arrays:

var base = "the quick brown fox jumped over a dog".split(/\s+/);
var derived1 = "the quick fox jumps over some lazy dog".split(/\s+/);
var derived2 =
  "the quick brown fox jumps over some record dog".split(/\s+/);

Examining base shows us that we have the right format:

js> uneval(base);
["the", "quick", "brown", "fox", "jumped", "over", "a", "dog"]

First, let’s run the subroutine I originally wrote back in 2006:

js> uneval(Diff.diff_comm(base, derived1));
[{common:["the", "quick"]},
 {file1:["brown"], file2:[]},
 {common:["fox"]},
 {file1:["jumped"], file2:["jumps"]},
 {common:["over"]},
 {file1:["a"], file2:["some", "lazy"]},
 {common:["dog"]}]

The result is an analysis of the chunks in the two files that are the same, and those that are different. The same results can be presented in a terser differential format, similar to Unix diff:

js> uneval(Diff.diff_patch(base, derived1));
[{file1:{offset:2, length:1, chunk:["brown"]},
  file2:{offset:2, length:0, chunk:[]}},

 {file1:{offset:4, length:1, chunk:["jumped"]},
  file2:{offset:3, length:1, chunk:["jumps"]}},

 {file1:{offset:6, length:1, chunk:["a"]},
  file2:{offset:5, length:2, chunk:["some", "lazy"]}}]

Note the similarity of the results to the line-numbers and line-counts that diff(1) outputs by default.

The next example takes a diff-like patch, and uses it to reconstruct a derived file from the base file:

js> uneval(Diff.patch(base, Diff.diff_patch(base, derived1)));
["the", "quick", "fox", "jumps", "over", "some", "lazy", "dog"]

Finally, we attempt a diff3 style merge of the changes made in derived1 and derived2, hopefully ending up with a file without conflicts:

js> uneval(Diff.diff3_merge(derived1, base, derived2, true));
[{ok:["the", "quick", "fox", "jumps", "over"]},
 {conflict:{a:["some", "lazy"], aIndex:5,
            o:["a"], oIndex:6,
            b:["some", "record"], bIndex:6}},
 {ok:["dog"]}]

We see that the result isn’t what we’d hoped for: while two regions are unproblematic, and merge cleanly, the algorithm couldn’t decide how to merge one part of the inputs, and has left the conflict details in the output for the user to resolve.

There are algorithms that give different (and usually better) results than diff3 - the paper I mentioned above explains some of the problems with diff3, and I’m looking forward to reading about what alternatives others have come up with at revctrl.org.

Comments

On 14 May, 2008 at 3:04 pm, David Roussel wrote:

Have you seen the diff and patch impemented in http://code.google.com/p/google-diff-match-patch/

It’s in javascript, python and java. And there is a demo page for the javascript version.

It seems more advanced than the standard gnu diff and patch.

It would certainly do a better job than svn merge!

On 14 May, 2008 at 7:22 pm, tonyg wrote:

Yes, I had a quick look just now. Impressive stuff.

On 18 May, 2010 at 3:19 pm, Leo wrote:

This thing is great. Exactly what I was looking for.

But…

It gets too slow for bigger texts.

What is the complexity? Can’t be dropped to linear by doing some simple heuristics?

On 20 June, 2010 at 8:08 am, tonyg wrote:

IIRC the complexity is O(n^2), but I may well not be recalling correctly. Do take a look at the paper for details!

What kind of heuristics did you have in mind?