4 Module difflib -- helpers for computing deltas between objects.
6 Function get_close_matches(word, possibilities, n=3, cutoff=0.6):
7 Use SequenceMatcher to return list of the best "good enough" matches.
9 Function context_diff(a, b):
10 For two lists of strings, return a delta in context diff format.
13 Return a delta: the difference between `a` and `b` (lists of strings).
15 Function restore(delta, which):
16 Return one of the two sequences that generated an ndiff delta.
18 Function unified_diff(a, b):
19 For two lists of strings, return a delta in unified diff format.
21 Class SequenceMatcher:
22 A flexible class for comparing pairs of sequences of any type.
25 For producing human-readable deltas from sequences of lines of text.
28 __all__
= ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',
29 'Differ','IS_CHARACTER_JUNK', 'IS_LINE_JUNK', 'context_diff',
32 class SequenceMatcher
:
35 SequenceMatcher is a flexible class for comparing pairs of sequences of
36 any type, so long as the sequence elements are hashable. The basic
37 algorithm predates, and is a little fancier than, an algorithm
38 published in the late 1980's by Ratcliff and Obershelp under the
39 hyperbolic name "gestalt pattern matching". The basic idea is to find
40 the longest contiguous matching subsequence that contains no "junk"
41 elements (R-O doesn't address junk). The same idea is then applied
42 recursively to the pieces of the sequences to the left and to the right
43 of the matching subsequence. This does not yield minimal edit
44 sequences, but does tend to yield matches that "look right" to people.
46 SequenceMatcher tries to compute a "human-friendly diff" between two
47 sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
48 longest *contiguous* & junk-free matching subsequence. That's what
49 catches peoples' eyes. The Windows(tm) windiff has another interesting
50 notion, pairing up elements that appear uniquely in each sequence.
51 That, and the method here, appear to yield more intuitive difference
52 reports than does diff. This method appears to be the least vulnerable
53 to synching up on blocks of "junk lines", though (like blank lines in
54 ordinary text files, or maybe "<P>" lines in HTML files). That may be
55 because this is the only method of the 3 that has a *concept* of
58 Example, comparing two strings, and considering blanks to be "junk":
60 >>> s = SequenceMatcher(lambda x: x == " ",
61 ... "private Thread currentThread;",
62 ... "private volatile Thread currentThread;")
65 .ratio() returns a float in [0, 1], measuring the "similarity" of the
66 sequences. As a rule of thumb, a .ratio() value over 0.6 means the
67 sequences are close matches:
69 >>> print round(s.ratio(), 3)
73 If you're only interested in where the sequences match,
74 .get_matching_blocks() is handy:
76 >>> for block in s.get_matching_blocks():
77 ... print "a[%d] and b[%d] match for %d elements" % block
78 a[0] and b[0] match for 8 elements
79 a[8] and b[17] match for 6 elements
80 a[14] and b[23] match for 15 elements
81 a[29] and b[38] match for 0 elements
83 Note that the last tuple returned by .get_matching_blocks() is always a
84 dummy, (len(a), len(b), 0), and this is the only case in which the last
85 tuple element (number of elements matched) is 0.
87 If you want to know how to change the first sequence into the second,
90 >>> for opcode in s.get_opcodes():
91 ... print "%6s a[%d:%d] b[%d:%d]" % opcode
94 equal a[8:14] b[17:23]
95 equal a[14:29] b[23:38]
97 See the Differ class for a fancy human-friendly file differencer, which
98 uses SequenceMatcher both to compare sequences of lines, and to compare
99 sequences of characters within similar (near-matching) lines.
101 See also function get_close_matches() in this module, which shows how
102 simple code building on SequenceMatcher can be used to do useful work.
104 Timing: Basic R-O is cubic time worst case and quadratic time expected
105 case. SequenceMatcher is quadratic time for the worst case and has
106 expected-case behavior dependent in a complicated way on how many
107 elements the sequences have in common; best case time is linear.
111 __init__(isjunk=None, a='', b='')
112 Construct a SequenceMatcher.
115 Set the two sequences to be compared.
118 Set the first sequence to be compared.
121 Set the second sequence to be compared.
123 find_longest_match(alo, ahi, blo, bhi)
124 Find longest matching block in a[alo:ahi] and b[blo:bhi].
126 get_matching_blocks()
127 Return list of triples describing matching subsequences.
130 Return list of 5-tuples describing how to turn a into b.
133 Return a measure of the sequences' similarity (float in [0,1]).
136 Return an upper bound on .ratio() relatively quickly.
139 Return an upper bound on ratio() very quickly.
142 def __init__(self
, isjunk
=None, a
='', b
=''):
143 """Construct a SequenceMatcher.
145 Optional arg isjunk is None (the default), or a one-argument
146 function that takes a sequence element and returns true iff the
147 element is junk. None is equivalent to passing "lambda x: 0", i.e.
148 no elements are considered to be junk. For example, pass
149 lambda x: x in " \\t"
150 if you're comparing lines as sequences of characters, and don't
151 want to synch up on blanks or hard tabs.
153 Optional arg a is the first of two sequences to be compared. By
154 default, an empty string. The elements of a must be hashable. See
155 also .set_seqs() and .set_seq1().
157 Optional arg b is the second of two sequences to be compared. By
158 default, an empty string. The elements of b must be hashable. See
159 also .set_seqs() and .set_seq2().
166 # second sequence; differences are computed as "what do
167 # we need to do to 'a' to change it into 'b'?"
169 # for x in b, b2j[x] is a list of the indices (into b)
170 # at which x appears; junk elements do not appear
172 # for x in b, fullbcount[x] == the number of times x
173 # appears in b; only materialized if really needed (used
174 # only for computing quick_ratio())
176 # a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
177 # ascending & non-overlapping in i and in j; terminated by
178 # a dummy (len(a), len(b), 0) sentinel
180 # a list of (tag, i1, i2, j1, j2) tuples, where tag is
182 # 'replace' a[i1:i2] should be replaced by b[j1:j2]
183 # 'delete' a[i1:i2] should be deleted
184 # 'insert' b[j1:j2] should be inserted
185 # 'equal' a[i1:i2] == b[j1:j2]
187 # a user-supplied function taking a sequence element and
188 # returning true iff the element is "junk" -- this has
189 # subtle but helpful effects on the algorithm, which I'll
190 # get around to writing up someday <0.9 wink>.
191 # DON'T USE! Only __chain_b uses this. Use isbjunk.
193 # for x in b, isbjunk(x) == isjunk(x) but much faster;
194 # it's really the has_key method of a hidden dict.
195 # DOES NOT WORK for x in a!
197 # for x in b, isbpopular(x) is true iff b is reasonably long
198 # (at least 200 elements) and x accounts for more than 1% of
199 # its elements. DOES NOT WORK for x in a!
202 self
.a
= self
.b
= None
205 def set_seqs(self
, a
, b
):
206 """Set the two sequences to be compared.
208 >>> s = SequenceMatcher()
209 >>> s.set_seqs("abcd", "bcde")
217 def set_seq1(self
, a
):
218 """Set the first sequence to be compared.
220 The second sequence to be compared is not changed.
222 >>> s = SequenceMatcher(None, "abcd", "bcde")
225 >>> s.set_seq1("bcde")
230 SequenceMatcher computes and caches detailed information about the
231 second sequence, so if you want to compare one sequence S against
232 many sequences, use .set_seq2(S) once and call .set_seq1(x)
233 repeatedly for each of the other sequences.
235 See also set_seqs() and set_seq2().
241 self
.matching_blocks
= self
.opcodes
= None
243 def set_seq2(self
, b
):
244 """Set the second sequence to be compared.
246 The first sequence to be compared is not changed.
248 >>> s = SequenceMatcher(None, "abcd", "bcde")
251 >>> s.set_seq2("abcd")
256 SequenceMatcher computes and caches detailed information about the
257 second sequence, so if you want to compare one sequence S against
258 many sequences, use .set_seq2(S) once and call .set_seq1(x)
259 repeatedly for each of the other sequences.
261 See also set_seqs() and set_seq1().
267 self
.matching_blocks
= self
.opcodes
= None
268 self
.fullbcount
= None
271 # For each element x in b, set b2j[x] to a list of the indices in
272 # b where x appears; the indices are in increasing order; note that
273 # the number of times x appears in b is len(b2j[x]) ...
274 # when self.isjunk is defined, junk elements don't show up in this
275 # map at all, which stops the central find_longest_match method
276 # from starting any matching block at a junk element ...
277 # also creates the fast isbjunk function ...
278 # b2j also does not contain entries for "popular" elements, meaning
279 # elements that account for more than 1% of the total elements, and
280 # when the sequence is reasonably large (>= 200 elements); this can
281 # be viewed as an adaptive notion of semi-junk, and yields an enormous
282 # speedup when, e.g., comparing program files with hundreds of
283 # instances of "return NULL;" ...
284 # note that this is only called when b changes; so for cross-product
285 # kinds of matches, it's best to call set_seq2 once, then set_seq1
289 # Because isjunk is a user-defined (not C) function, and we test
290 # for junk a LOT, it's important to minimize the number of calls.
291 # Before the tricks described here, __chain_b was by far the most
292 # time-consuming routine in the whole module! If anyone sees
293 # Jim Roskind, thank him again for profile.py -- I never would
295 # The first trick is to build b2j ignoring the possibility
296 # of junk. I.e., we don't call isjunk at all yet. Throwing
297 # out the junk later is much cheaper than building b2j "right"
303 for i
, elt
in enumerate(b
):
306 if n
>= 200 and len(indices
) * 100 > n
:
314 # Purge leftover indices for popular elements.
315 for elt
in populardict
:
318 # Now b2j.keys() contains elements uniquely, and especially when
319 # the sequence is a string, that's usually a good deal smaller
320 # than len(string). The difference is the number of isjunk calls
325 for d
in populardict
, b2j
:
331 # Now for x in b, isjunk(x) == x in junkdict, but the
332 # latter is much faster. Note too that while there may be a
333 # lot of junk in the sequence, the number of *unique* junk
334 # elements is probably small. So the memory burden of keeping
335 # this dict alive is likely trivial compared to the size of b2j.
336 self
.isbjunk
= junkdict
.has_key
337 self
.isbpopular
= populardict
.has_key
339 def find_longest_match(self
, alo
, ahi
, blo
, bhi
):
340 """Find longest matching block in a[alo:ahi] and b[blo:bhi].
342 If isjunk is not defined:
344 Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
345 alo <= i <= i+k <= ahi
346 blo <= j <= j+k <= bhi
347 and for all (i',j',k') meeting those conditions,
350 and if i == i', j <= j'
352 In other words, of all maximal matching blocks, return one that
353 starts earliest in a, and of all those maximal matching blocks that
354 start earliest in a, return the one that starts earliest in b.
356 >>> s = SequenceMatcher(None, " abcd", "abcd abcd")
357 >>> s.find_longest_match(0, 5, 0, 9)
360 If isjunk is defined, first the longest matching block is
361 determined as above, but with the additional restriction that no
362 junk element appears in the block. Then that block is extended as
363 far as possible by matching (only) junk elements on both sides. So
364 the resulting block never matches on junk except as identical junk
365 happens to be adjacent to an "interesting" match.
367 Here's the same example as before, but considering blanks to be
368 junk. That prevents " abcd" from matching the " abcd" at the tail
369 end of the second sequence directly. Instead only the "abcd" can
370 match, and matches the leftmost "abcd" in the second sequence:
372 >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
373 >>> s.find_longest_match(0, 5, 0, 9)
376 If no blocks match, return (alo, blo, 0).
378 >>> s = SequenceMatcher(None, "ab", "c")
379 >>> s.find_longest_match(0, 2, 0, 1)
383 # CAUTION: stripping common prefix or suffix would be incorrect.
387 # Longest matching block is "ab", but if common prefix is
388 # stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
389 # strip, so ends up claiming that ab is changed to acab by
390 # inserting "ca" in the middle. That's minimal but unintuitive:
391 # "it's obvious" that someone inserted "ac" at the front.
392 # Windiff ends up at the same place as diff, but by pairing up
393 # the unique 'b's and then matching the first two 'a's.
395 a
, b
, b2j
, isbjunk
= self
.a
, self
.b
, self
.b2j
, self
.isbjunk
396 besti
, bestj
, bestsize
= alo
, blo
, 0
397 # find longest junk-free match
398 # during an iteration of the loop, j2len[j] = length of longest
399 # junk-free match ending with a[i-1] and b[j]
402 for i
in xrange(alo
, ahi
):
403 # look at all instances of a[i] in b; note that because
404 # b2j has no junk keys, the loop is skipped if a[i] is junk
407 for j
in b2j
.get(a
[i
], nothing
):
413 k
= newj2len
[j
] = j2lenget(j
-1, 0) + 1
415 besti
, bestj
, bestsize
= i
-k
+1, j
-k
+1, k
418 # Extend the best by non-junk elements on each end. In particular,
419 # "popular" non-junk elements aren't in b2j, which greatly speeds
420 # the inner loop above, but also means "the best" match so far
421 # doesn't contain any junk *or* popular non-junk elements.
422 while besti
> alo
and bestj
> blo
and \
423 not isbjunk(b
[bestj
-1]) and \
424 a
[besti
-1] == b
[bestj
-1]:
425 besti
, bestj
, bestsize
= besti
-1, bestj
-1, bestsize
+1
426 while besti
+bestsize
< ahi
and bestj
+bestsize
< bhi
and \
427 not isbjunk(b
[bestj
+bestsize
]) and \
428 a
[besti
+bestsize
] == b
[bestj
+bestsize
]:
431 # Now that we have a wholly interesting match (albeit possibly
432 # empty!), we may as well suck up the matching junk on each
433 # side of it too. Can't think of a good reason not to, and it
434 # saves post-processing the (possibly considerable) expense of
435 # figuring out what to do with it. In the case of an empty
436 # interesting match, this is clearly the right thing to do,
437 # because no other kind of match is possible in the regions.
438 while besti
> alo
and bestj
> blo
and \
439 isbjunk(b
[bestj
-1]) and \
440 a
[besti
-1] == b
[bestj
-1]:
441 besti
, bestj
, bestsize
= besti
-1, bestj
-1, bestsize
+1
442 while besti
+bestsize
< ahi
and bestj
+bestsize
< bhi
and \
443 isbjunk(b
[bestj
+bestsize
]) and \
444 a
[besti
+bestsize
] == b
[bestj
+bestsize
]:
445 bestsize
= bestsize
+ 1
447 return besti
, bestj
, bestsize
449 def get_matching_blocks(self
):
450 """Return list of triples describing matching subsequences.
452 Each triple is of the form (i, j, n), and means that
453 a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
456 The last triple is a dummy, (len(a), len(b), 0), and is the only
459 >>> s = SequenceMatcher(None, "abxcd", "abcd")
460 >>> s.get_matching_blocks()
461 [(0, 0, 2), (3, 2, 2), (5, 4, 0)]
464 if self
.matching_blocks
is not None:
465 return self
.matching_blocks
466 self
.matching_blocks
= []
467 la
, lb
= len(self
.a
), len(self
.b
)
468 self
.__helper
(0, la
, 0, lb
, self
.matching_blocks
)
469 self
.matching_blocks
.append( (la
, lb
, 0) )
470 return self
.matching_blocks
472 # builds list of matching blocks covering a[alo:ahi] and
473 # b[blo:bhi], appending them in increasing order to answer
475 def __helper(self
, alo
, ahi
, blo
, bhi
, answer
):
476 i
, j
, k
= x
= self
.find_longest_match(alo
, ahi
, blo
, bhi
)
477 # a[alo:i] vs b[blo:j] unknown
478 # a[i:i+k] same as b[j:j+k]
479 # a[i+k:ahi] vs b[j+k:bhi] unknown
481 if alo
< i
and blo
< j
:
482 self
.__helper
(alo
, i
, blo
, j
, answer
)
484 if i
+k
< ahi
and j
+k
< bhi
:
485 self
.__helper
(i
+k
, ahi
, j
+k
, bhi
, answer
)
487 def get_opcodes(self
):
488 """Return list of 5-tuples describing how to turn a into b.
490 Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple
491 has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
492 tuple preceding it, and likewise for j1 == the previous j2.
494 The tags are strings, with these meanings:
496 'replace': a[i1:i2] should be replaced by b[j1:j2]
497 'delete': a[i1:i2] should be deleted.
498 Note that j1==j2 in this case.
499 'insert': b[j1:j2] should be inserted at a[i1:i1].
500 Note that i1==i2 in this case.
501 'equal': a[i1:i2] == b[j1:j2]
505 >>> s = SequenceMatcher(None, a, b)
506 >>> for tag, i1, i2, j1, j2 in s.get_opcodes():
507 ... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
508 ... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
509 delete a[0:1] (q) b[0:0] ()
510 equal a[1:3] (ab) b[0:2] (ab)
511 replace a[3:4] (x) b[2:3] (y)
512 equal a[4:6] (cd) b[3:5] (cd)
513 insert a[6:6] () b[5:6] (f)
516 if self
.opcodes
is not None:
519 self
.opcodes
= answer
= []
520 for ai
, bj
, size
in self
.get_matching_blocks():
521 # invariant: we've pumped out correct diffs to change
522 # a[:i] into b[:j], and the next matching block is
523 # a[ai:ai+size] == b[bj:bj+size]. So we need to pump
524 # out a diff to change a[i:ai] into b[j:bj], pump out
525 # the matching block, and move (i,j) beyond the match
527 if i
< ai
and j
< bj
:
534 answer
.append( (tag
, i
, ai
, j
, bj
) )
535 i
, j
= ai
+size
, bj
+size
536 # the list of matching blocks is terminated by a
537 # sentinel with size 0
539 answer
.append( ('equal', ai
, i
, bj
, j
) )
542 def get_grouped_opcodes(self
, n
=3):
543 """ Isolate change clusters by eliminating ranges with no changes.
545 Return a generator of groups with upto n lines of context.
546 Each group is in the same format as returned by get_opcodes().
548 >>> from pprint import pprint
549 >>> a = map(str, range(1,40))
551 >>> b[8:8] = ['i'] # Make an insertion
552 >>> b[20] += 'x' # Make a replacement
553 >>> b[23:28] = [] # Make a deletion
554 >>> b[30] += 'y' # Make another replacement
555 >>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
556 [[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
557 [('equal', 16, 19, 17, 20),
558 ('replace', 19, 20, 20, 21),
559 ('equal', 20, 22, 21, 23),
560 ('delete', 22, 27, 23, 23),
561 ('equal', 27, 30, 23, 26)],
562 [('equal', 31, 34, 27, 30),
563 ('replace', 34, 35, 30, 31),
564 ('equal', 35, 38, 31, 34)]]
567 codes
= self
.get_opcodes()
568 # Fixup leading and trailing groups if they show no changes.
569 if codes
[0][0] == 'equal':
570 tag
, i1
, i2
, j1
, j2
= codes
[0]
571 codes
[0] = tag
, max(i1
, i2
-n
), i2
, max(j1
, j2
-n
), j2
572 if codes
[-1][0] == 'equal':
573 tag
, i1
, i2
, j1
, j2
= codes
[-1]
574 codes
[-1] = tag
, i1
, min(i2
, i1
+n
), j1
, min(j2
, j1
+n
)
578 for tag
, i1
, i2
, j1
, j2
in codes
:
579 # End the current group and start a new one whenever
580 # there is a large range with no changes.
581 if tag
== 'equal' and i2
-i1
> nn
:
582 group
.append((tag
, i1
, min(i2
, i1
+n
), j1
, min(j2
, j1
+n
)))
585 i1
, j1
= max(i1
, i2
-n
), max(j1
, j2
-n
)
586 group
.append((tag
, i1
, i2
, j1
,j2
))
587 if group
and not (len(group
)==1 and group
[0][0] == 'equal'):
591 """Return a measure of the sequences' similarity (float in [0,1]).
593 Where T is the total number of elements in both sequences, and
594 M is the number of matches, this is 2,0*M / T.
595 Note that this is 1 if the sequences are identical, and 0 if
596 they have nothing in common.
598 .ratio() is expensive to compute if you haven't already computed
599 .get_matching_blocks() or .get_opcodes(), in which case you may
600 want to try .quick_ratio() or .real_quick_ratio() first to get an
603 >>> s = SequenceMatcher(None, "abcd", "bcde")
608 >>> s.real_quick_ratio()
612 matches
= reduce(lambda sum, triple
: sum + triple
[-1],
613 self
.get_matching_blocks(), 0)
614 return 2.0 * matches
/ (len(self
.a
) + len(self
.b
))
616 def quick_ratio(self
):
617 """Return an upper bound on ratio() relatively quickly.
619 This isn't defined beyond that it is an upper bound on .ratio(), and
620 is faster to compute.
623 # viewing a and b as multisets, set matches to the cardinality
624 # of their intersection; this counts the number of matches
625 # without regard to order, so is clearly an upper bound
626 if self
.fullbcount
is None:
627 self
.fullbcount
= fullbcount
= {}
629 fullbcount
[elt
] = fullbcount
.get(elt
, 0) + 1
630 fullbcount
= self
.fullbcount
631 # avail[x] is the number of times x appears in 'b' less the
632 # number of times we've seen it in 'a' so far ... kinda
634 availhas
, matches
= avail
.has_key
, 0
639 numb
= fullbcount
.get(elt
, 0)
640 avail
[elt
] = numb
- 1
642 matches
= matches
+ 1
643 return 2.0 * matches
/ (len(self
.a
) + len(self
.b
))
645 def real_quick_ratio(self
):
646 """Return an upper bound on ratio() very quickly.
648 This isn't defined beyond that it is an upper bound on .ratio(), and
649 is faster to compute than either .ratio() or .quick_ratio().
652 la
, lb
= len(self
.a
), len(self
.b
)
653 # can't have more matches than the number of elements in the
655 return 2.0 * min(la
, lb
) / (la
+ lb
)
657 def get_close_matches(word
, possibilities
, n
=3, cutoff
=0.6):
658 """Use SequenceMatcher to return list of the best "good enough" matches.
660 word is a sequence for which close matches are desired (typically a
663 possibilities is a list of sequences against which to match word
664 (typically a list of strings).
666 Optional arg n (default 3) is the maximum number of close matches to
667 return. n must be > 0.
669 Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
670 that don't score at least that similar to word are ignored.
672 The best (no more than n) matches among the possibilities are returned
673 in a list, sorted by similarity score, most similar first.
675 >>> get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
677 >>> import keyword as _keyword
678 >>> get_close_matches("wheel", _keyword.kwlist)
680 >>> get_close_matches("apple", _keyword.kwlist)
682 >>> get_close_matches("accept", _keyword.kwlist)
687 raise ValueError("n must be > 0: " + `n`
)
688 if not 0.0 <= cutoff
<= 1.0:
689 raise ValueError("cutoff must be in [0.0, 1.0]: " + `cutoff`
)
691 s
= SequenceMatcher()
693 for x
in possibilities
:
695 if s
.real_quick_ratio() >= cutoff
and \
696 s
.quick_ratio() >= cutoff
and \
698 result
.append((s
.ratio(), x
))
701 # Retain only the best n.
703 # Move best-scorer to head of list.
706 return [x
for score
, x
in result
]
709 def _count_leading(line
, ch
):
711 Return number of `ch` characters at the start of `line`.
715 >>> _count_leading(' abc', ' ')
720 while i
< n
and line
[i
] == ch
:
726 Differ is a class for comparing sequences of lines of text, and
727 producing human-readable differences or deltas. Differ uses
728 SequenceMatcher both to compare sequences of lines, and to compare
729 sequences of characters within similar (near-matching) lines.
731 Each line of a Differ delta begins with a two-letter code:
733 '- ' line unique to sequence 1
734 '+ ' line unique to sequence 2
735 ' ' line common to both sequences
736 '? ' line not present in either input sequence
738 Lines beginning with '? ' attempt to guide the eye to intraline
739 differences, and were not present in either input sequence. These lines
740 can be confusing if the sequences contain tab characters.
742 Note that Differ makes no claim to produce a *minimal* diff. To the
743 contrary, minimal diffs are often counter-intuitive, because they synch
744 up anywhere possible, sometimes accidental matches 100 pages apart.
745 Restricting synch points to contiguous matches preserves some notion of
746 locality, at the occasional cost of producing a longer diff.
748 Example: Comparing two texts.
750 First we set up the texts, sequences of individual single-line strings
751 ending with newlines (such sequences can also be obtained from the
752 `readlines()` method of file-like objects):
754 >>> text1 = ''' 1. Beautiful is better than ugly.
755 ... 2. Explicit is better than implicit.
756 ... 3. Simple is better than complex.
757 ... 4. Complex is better than complicated.
758 ... '''.splitlines(1)
763 >>> text2 = ''' 1. Beautiful is better than ugly.
764 ... 3. Simple is better than complex.
765 ... 4. Complicated is better than complex.
766 ... 5. Flat is better than nested.
767 ... '''.splitlines(1)
769 Next we instantiate a Differ object:
773 Note that when instantiating a Differ object we may pass functions to
774 filter out line and character 'junk'. See Differ.__init__ for details.
776 Finally, we compare the two:
778 >>> result = list(d.compare(text1, text2))
780 'result' is a list of strings, so let's pretty-print it:
782 >>> from pprint import pprint as _pprint
784 [' 1. Beautiful is better than ugly.\n',
785 '- 2. Explicit is better than implicit.\n',
786 '- 3. Simple is better than complex.\n',
787 '+ 3. Simple is better than complex.\n',
789 '- 4. Complex is better than complicated.\n',
791 '+ 4. Complicated is better than complex.\n',
793 '+ 5. Flat is better than nested.\n']
795 As a single multi-line string it looks like this:
797 >>> print ''.join(result),
798 1. Beautiful is better than ugly.
799 - 2. Explicit is better than implicit.
800 - 3. Simple is better than complex.
801 + 3. Simple is better than complex.
803 - 4. Complex is better than complicated.
805 + 4. Complicated is better than complex.
807 + 5. Flat is better than nested.
811 __init__(linejunk=None, charjunk=None)
812 Construct a text differencer, with optional filters.
815 Compare two sequences of lines; generate the resulting delta.
818 def __init__(self
, linejunk
=None, charjunk
=None):
820 Construct a text differencer, with optional filters.
822 The two optional keyword parameters are for filter functions:
824 - `linejunk`: A function that should accept a single string argument,
825 and return true iff the string is junk. The module-level function
826 `IS_LINE_JUNK` may be used to filter out lines without visible
827 characters, except for at most one splat ('#'). It is recommended
828 to leave linejunk None; as of Python 2.3, the underlying
829 SequenceMatcher class has grown an adaptive notion of "noise" lines
830 that's better than any static definition the author has ever been
833 - `charjunk`: A function that should accept a string of length 1. The
834 module-level function `IS_CHARACTER_JUNK` may be used to filter out
835 whitespace characters (a blank or tab; **note**: bad idea to include
836 newline in this!). Use of IS_CHARACTER_JUNK is recommended.
839 self
.linejunk
= linejunk
840 self
.charjunk
= charjunk
842 def compare(self
, a
, b
):
844 Compare two sequences of lines; generate the resulting delta.
846 Each sequence must contain individual single-line strings ending with
847 newlines. Such sequences can be obtained from the `readlines()` method
848 of file-like objects. The delta generated also consists of newline-
849 terminated strings, ready to be printed as-is via the writeline()
850 method of a file-like object.
854 >>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
855 ... 'ore\ntree\nemu\n'.splitlines(1))),
867 cruncher
= SequenceMatcher(self
.linejunk
, a
, b
)
868 for tag
, alo
, ahi
, blo
, bhi
in cruncher
.get_opcodes():
870 g
= self
._fancy
_replace
(a
, alo
, ahi
, b
, blo
, bhi
)
871 elif tag
== 'delete':
872 g
= self
._dump
('-', a
, alo
, ahi
)
873 elif tag
== 'insert':
874 g
= self
._dump
('+', b
, blo
, bhi
)
876 g
= self
._dump
(' ', a
, alo
, ahi
)
878 raise ValueError, 'unknown tag ' + `tag`
883 def _dump(self
, tag
, x
, lo
, hi
):
884 """Generate comparison results for a same-tagged range."""
885 for i
in xrange(lo
, hi
):
886 yield '%s %s' % (tag
, x
[i
])
888 def _plain_replace(self
, a
, alo
, ahi
, b
, blo
, bhi
):
889 assert alo
< ahi
and blo
< bhi
890 # dump the shorter block first -- reduces the burden on short-term
891 # memory if the blocks are of very different sizes
892 if bhi
- blo
< ahi
- alo
:
893 first
= self
._dump
('+', b
, blo
, bhi
)
894 second
= self
._dump
('-', a
, alo
, ahi
)
896 first
= self
._dump
('-', a
, alo
, ahi
)
897 second
= self
._dump
('+', b
, blo
, bhi
)
899 for g
in first
, second
:
903 def _fancy_replace(self
, a
, alo
, ahi
, b
, blo
, bhi
):
905 When replacing one block of lines with another, search the blocks
906 for *similar* lines; the best-matching pair (if any) is used as a
907 synch point, and intraline difference marking is done on the
908 similar pair. Lots of work, but often worth it.
913 >>> d._fancy_replace(['abcDefghiJkl\n'], 0, 1, ['abcdefGhijkl\n'], 0, 1)
914 >>> print ''.join(d.results),
921 # don't synch up unless the lines have a similarity score of at
922 # least cutoff; best_ratio tracks the best score seen so far
923 best_ratio
, cutoff
= 0.74, 0.75
924 cruncher
= SequenceMatcher(self
.charjunk
)
925 eqi
, eqj
= None, None # 1st indices of equal lines (if any)
927 # search for the pair that matches best without being identical
928 # (identical lines must be junk lines, & we don't want to synch up
929 # on junk -- unless we have to)
930 for j
in xrange(blo
, bhi
):
932 cruncher
.set_seq2(bj
)
933 for i
in xrange(alo
, ahi
):
939 cruncher
.set_seq1(ai
)
940 # computing similarity is expensive, so use the quick
941 # upper bounds first -- have seen this speed up messy
942 # compares by a factor of 3.
943 # note that ratio() is only expensive to compute the first
944 # time it's called on a sequence pair; the expensive part
945 # of the computation is cached by cruncher
946 if cruncher
.real_quick_ratio() > best_ratio
and \
947 cruncher
.quick_ratio() > best_ratio
and \
948 cruncher
.ratio() > best_ratio
:
949 best_ratio
, best_i
, best_j
= cruncher
.ratio(), i
, j
950 if best_ratio
< cutoff
:
951 # no non-identical "pretty close" pair
953 # no identical pair either -- treat it as a straight replace
954 for line
in self
._plain
_replace
(a
, alo
, ahi
, b
, blo
, bhi
):
957 # no close pair, but an identical pair -- synch up on that
958 best_i
, best_j
, best_ratio
= eqi
, eqj
, 1.0
960 # there's a close pair, so forget the identical pair (if any)
963 # a[best_i] very similar to b[best_j]; eqi is None iff they're not
966 # pump out diffs from before the synch point
967 for line
in self
._fancy
_helper
(a
, alo
, best_i
, b
, blo
, best_j
):
970 # do intraline marking on the synch pair
971 aelt
, belt
= a
[best_i
], b
[best_j
]
973 # pump out a '-', '?', '+', '?' quad for the synched lines
975 cruncher
.set_seqs(aelt
, belt
)
976 for tag
, ai1
, ai2
, bj1
, bj2
in cruncher
.get_opcodes():
977 la
, lb
= ai2
- ai1
, bj2
- bj1
981 elif tag
== 'delete':
983 elif tag
== 'insert':
989 raise ValueError, 'unknown tag ' + `tag`
990 for line
in self
._qformat
(aelt
, belt
, atags
, btags
):
993 # the synch pair is identical
996 # pump out diffs from after the synch point
997 for line
in self
._fancy
_helper
(a
, best_i
+1, ahi
, b
, best_j
+1, bhi
):
1000 def _fancy_helper(self
, a
, alo
, ahi
, b
, blo
, bhi
):
1004 g
= self
._fancy
_replace
(a
, alo
, ahi
, b
, blo
, bhi
)
1006 g
= self
._dump
('-', a
, alo
, ahi
)
1008 g
= self
._dump
('+', b
, blo
, bhi
)
1013 def _qformat(self
, aline
, bline
, atags
, btags
):
1015 Format "?" output and deal with leading tabs.
1020 >>> d._qformat('\tabcDefghiJkl\n', '\t\tabcdefGhijkl\n',
1021 ... ' ^ ^ ^ ', '+ ^ ^ ^ ')
1022 >>> for line in d.results: print repr(line)
1024 '- \tabcDefghiJkl\n'
1026 '+ \t\tabcdefGhijkl\n'
1030 # Can hurt, but will probably help most of the time.
1031 common
= min(_count_leading(aline
, "\t"),
1032 _count_leading(bline
, "\t"))
1033 common
= min(common
, _count_leading(atags
[:common
], " "))
1034 atags
= atags
[common
:].rstrip()
1035 btags
= btags
[common
:].rstrip()
1039 yield "? %s%s\n" % ("\t" * common
, atags
)
1043 yield "? %s%s\n" % ("\t" * common
, btags
)
1045 # With respect to junk, an earlier version of ndiff simply refused to
1046 # *start* a match with a junk element. The result was cases like this:
1047 # before: private Thread currentThread;
1048 # after: private volatile Thread currentThread;
1049 # If you consider whitespace to be junk, the longest contiguous match
1050 # not starting with junk is "e Thread currentThread". So ndiff reported
1051 # that "e volatil" was inserted between the 't' and the 'e' in "private".
1052 # While an accurate view, to people that's absurd. The current version
1053 # looks for matching blocks that are entirely junk-free, then extends the
1054 # longest one of those as far as possible but only with matching junk.
1055 # So now "currentThread" is matched, then extended to suck up the
1056 # preceding blank; then "private" is matched, and extended to suck up the
1057 # following blank; then "Thread" is matched; and finally ndiff reports
1058 # that "volatile " was inserted before "Thread". The only quibble
1059 # remaining is that perhaps it was really the case that " volatile"
1060 # was inserted after "private". I can live with that <wink>.
1064 def IS_LINE_JUNK(line
, pat
=re
.compile(r
"\s*#?\s*$").match
):
1066 Return 1 for ignorable line: iff `line` is blank or contains a single '#'.
1070 >>> IS_LINE_JUNK('\n')
1072 >>> IS_LINE_JUNK(' # \n')
1074 >>> IS_LINE_JUNK('hello\n')
1078 return pat(line
) is not None
1080 def IS_CHARACTER_JUNK(ch
, ws
=" \t"):
1082 Return 1 for ignorable character: iff `ch` is a space or tab.
1086 >>> IS_CHARACTER_JUNK(' ')
1088 >>> IS_CHARACTER_JUNK('\t')
1090 >>> IS_CHARACTER_JUNK('\n')
1092 >>> IS_CHARACTER_JUNK('x')
1101 def unified_diff(a
, b
, fromfile
='', tofile
='', fromfiledate
='',
1102 tofiledate
='', n
=3, lineterm
='\n'):
1104 Compare two sequences of lines; generate the delta as a unified diff.
1106 Unified diffs are a compact way of showing line changes and a few
1107 lines of context. The number of context lines is set by 'n' which
1110 By default, the diff control lines (those with *** or ---) are
1111 created with a trailing newline. This is helpful so that inputs
1112 created from file.readlines() result in diffs that are suitable for
1113 file.writelines() since both the inputs and outputs have trailing
1116 For inputs that do not have trailing newlines, set the lineterm
1117 argument to "" so that the output will be uniformly newline free.
1119 The unidiff format normally has a header for filenames and modification
1120 times. Any or all of these may be specified using strings for
1121 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. The modification
1122 times are normally expressed in the format returned by time.ctime().
1126 >>> for line in unified_diff('one two three four'.split(),
1127 ... 'zero one tree four'.split(), 'Original', 'Current',
1128 ... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:20:52 2003',
1131 --- Original Sat Jan 26 23:30:50 1991
1132 +++ Current Fri Jun 06 10:20:52 2003
1143 for group
in SequenceMatcher(None,a
,b
).get_grouped_opcodes(n
):
1145 yield '--- %s %s%s' % (fromfile
, fromfiledate
, lineterm
)
1146 yield '+++ %s %s%s' % (tofile
, tofiledate
, lineterm
)
1148 i1
, i2
, j1
, j2
= group
[0][1], group
[-1][2], group
[0][3], group
[-1][4]
1149 yield "@@ -%d,%d +%d,%d @@%s" % (i1
+1, i2
-i1
, j1
+1, j2
-j1
, lineterm
)
1150 for tag
, i1
, i2
, j1
, j2
in group
:
1152 for line
in a
[i1
:i2
]:
1155 if tag
== 'replace' or tag
== 'delete':
1156 for line
in a
[i1
:i2
]:
1158 if tag
== 'replace' or tag
== 'insert':
1159 for line
in b
[j1
:j2
]:
1162 # See http://www.unix.org/single_unix_specification/
1163 def context_diff(a
, b
, fromfile
='', tofile
='',
1164 fromfiledate
='', tofiledate
='', n
=3, lineterm
='\n'):
1166 Compare two sequences of lines; generate the delta as a context diff.
1168 Context diffs are a compact way of showing line changes and a few
1169 lines of context. The number of context lines is set by 'n' which
1172 By default, the diff control lines (those with *** or ---) are
1173 created with a trailing newline. This is helpful so that inputs
1174 created from file.readlines() result in diffs that are suitable for
1175 file.writelines() since both the inputs and outputs have trailing
1178 For inputs that do not have trailing newlines, set the lineterm
1179 argument to "" so that the output will be uniformly newline free.
1181 The context diff format normally has a header for filenames and
1182 modification times. Any or all of these may be specified using
1183 strings for 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'.
1184 The modification times are normally expressed in the format returned
1185 by time.ctime(). If not specified, the strings default to blanks.
1189 >>> print ''.join(context_diff('one\ntwo\nthree\nfour\n'.splitlines(1),
1190 ... 'zero\none\ntree\nfour\n'.splitlines(1), 'Original', 'Current',
1191 ... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:22:46 2003')),
1192 *** Original Sat Jan 26 23:30:50 1991
1193 --- Current Fri Jun 06 10:22:46 2003
1208 prefixmap
= {'insert':'+ ', 'delete':'- ', 'replace':'! ', 'equal':' '}
1209 for group
in SequenceMatcher(None,a
,b
).get_grouped_opcodes(n
):
1211 yield '*** %s %s%s' % (fromfile
, fromfiledate
, lineterm
)
1212 yield '--- %s %s%s' % (tofile
, tofiledate
, lineterm
)
1215 yield '***************%s' % (lineterm
,)
1216 if group
[-1][2] - group
[0][1] >= 2:
1217 yield '*** %d,%d ****%s' % (group
[0][1]+1, group
[-1][2], lineterm
)
1219 yield '*** %d ****%s' % (group
[-1][2], lineterm
)
1220 visiblechanges
= [e
for e
in group
if e
[0] in ('replace', 'delete')]
1222 for tag
, i1
, i2
, _
, _
in group
:
1224 for line
in a
[i1
:i2
]:
1225 yield prefixmap
[tag
] + line
1227 if group
[-1][4] - group
[0][3] >= 2:
1228 yield '--- %d,%d ----%s' % (group
[0][3]+1, group
[-1][4], lineterm
)
1230 yield '--- %d ----%s' % (group
[-1][4], lineterm
)
1231 visiblechanges
= [e
for e
in group
if e
[0] in ('replace', 'insert')]
1233 for tag
, _
, _
, j1
, j2
in group
:
1235 for line
in b
[j1
:j2
]:
1236 yield prefixmap
[tag
] + line
1238 def ndiff(a
, b
, linejunk
=None, charjunk
=IS_CHARACTER_JUNK
):
1240 Compare `a` and `b` (lists of strings); return a `Differ`-style delta.
1242 Optional keyword parameters `linejunk` and `charjunk` are for filter
1243 functions (or None):
1245 - linejunk: A function that should accept a single string argument, and
1246 return true iff the string is junk. The default is None, and is
1247 recommended; as of Python 2.3, an adaptive notion of "noise" lines is
1248 used that does a good job on its own.
1250 - charjunk: A function that should accept a string of length 1. The
1251 default is module-level function IS_CHARACTER_JUNK, which filters out
1252 whitespace characters (a blank or tab; note: bad idea to include newline
1255 Tools/scripts/ndiff.py is a command-line front-end to this function.
1259 >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
1260 ... 'ore\ntree\nemu\n'.splitlines(1))
1261 >>> print ''.join(diff),
1272 return Differ(linejunk
, charjunk
).compare(a
, b
)
1274 def restore(delta
, which
):
1276 Generate one of the two sequences that generated a delta.
1278 Given a `delta` produced by `Differ.compare()` or `ndiff()`, extract
1279 lines originating from file 1 or 2 (parameter `which`), stripping off line
1284 >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
1285 ... 'ore\ntree\nemu\n'.splitlines(1))
1286 >>> diff = list(diff)
1287 >>> print ''.join(restore(diff, 1)),
1291 >>> print ''.join(restore(diff, 2)),
1297 tag
= {1: "- ", 2: "+ "}[int(which
)]
1299 raise ValueError, ('unknown delta choice (must be 1 or 2): %r'
1301 prefixes
= (" ", tag
)
1303 if line
[:2] in prefixes
:
1307 import doctest
, difflib
1308 return doctest
.testmod(difflib
)
1310 if __name__
== "__main__":