Doc/lib/libheapq.tex

   1 \section{\module{heapq} ---
   2          Heap queue algorithm}
   3
   4 \declaremodule{standard}{heapq}
   5 \modulesynopsis{Heap queue algorithm (a.k.a. priority queue).}
   6 \moduleauthor{Kevin O'Connor}{}
   7 \sectionauthor{Guido van Rossum}{guido@python.org}
   8 % Theoretical explanation:
   9 \sectionauthor{Fran\c cois Pinard}{}
  10 \versionadded{2.3}
  11
  12
  13 This module provides an implementation of the heap queue algorithm,
  14 also known as the priority queue algorithm.
  15
  16 Heaps are arrays for which
  17 \code{\var{heap}[\var{k}] <= \var{heap}[2*\var{k}+1]} and
  18 \code{\var{heap}[\var{k}] <= \var{heap}[2*\var{k}+2]}
  19 for all \var{k}, counting elements from zero.  For the sake of
  20 comparison, non-existing elements are considered to be infinite.  The
  21 interesting property of a heap is that \code{\var{heap}[0]} is always
  22 its smallest element.
  23
  24 The API below differs from textbook heap algorithms in two aspects:
  25 (a) We use zero-based indexing.  This makes the relationship between the
  26 index for a node and the indexes for its children slightly less
  27 obvious, but is more suitable since Python uses zero-based indexing.
  28 (b) Our pop method returns the smallest item, not the largest (called a
  29 "min heap" in textbooks; a "max heap" is more common in texts because
  30 of its suitability for in-place sorting).
  31
  32 These two make it possible to view the heap as a regular Python list
  33 without surprises: \code{\var{heap}[0]} is the smallest item, and
  34 \code{\var{heap}.sort()} maintains the heap invariant!
  35
  36 To create a heap, use a list initialized to \code{[]}, or you can
  37 transform a populated list into a heap via function \function{heapify()}.
  38
  39 The following functions are provided:
  40
  41 \begin{funcdesc}{heappush}{heap, item}
  42 Push the value \var{item} onto the \var{heap}, maintaining the
  43 heap invariant.
  44 \end{funcdesc}
  45
  46 \begin{funcdesc}{heappop}{heap}
  47 Pop and return the smallest item from the \var{heap}, maintaining the
  48 heap invariant.  If the heap is empty, \exception{IndexError} is raised.
  49 \end{funcdesc}
  50
  51 \begin{funcdesc}{heapify}{x}
  52 Transform list \var{x} into a heap, in-place, in linear time.
  53 \end{funcdesc}
  54
  55 \begin{funcdesc}{heapreplace}{heap, item}
  56 Pop and return the smallest item from the \var{heap}, and also push
  57 the new \var{item}.  The heap size doesn't change.
  58 If the heap is empty, \exception{IndexError} is raised.
  59 This is more efficient than \function{heappop()} followed
  60 by  \function{heappush()}, and can be more appropriate when using
  61 a fixed-size heap.  Note that the value returned may be larger
  62 than \var{item}!  That constrains reasonable uses of this routine
  63 unless written as part of a conditional replacement:
  64 \begin{verbatim}
  65         if item > heap[0]:
  66             item = heapreplace(heap, item)
  67 \end{verbatim}
  68 \end{funcdesc}
  69
  70 Example of use:
  71
  72 \begin{verbatim}
  73 >>> from heapq import heappush, heappop
  74 >>> heap = []
  75 >>> data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
  76 >>> for item in data:
  77 ...     heappush(heap, item)
  78 ...
  79 >>> sorted = []
  80 >>> while heap:
  81 ...     sorted.append(heappop(heap))
  82 ...
  83 >>> print sorted
  84 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
  85 >>> data.sort()
  86 >>> print data == sorted
  87 True
  88 >>>
  89 \end{verbatim}
  90
  91 The module also offers two general purpose functions based on heaps.
  92
  93 \begin{funcdesc}{nlargest}{n, iterable}
  94 Return a list with the \var{n} largest elements from the dataset defined
  95 by \var{iterable}. Equivalent to:  \code{sorted(iterable, reverse=True)[:n]}
  96 \versionadded{2.4}
  97 \end{funcdesc}
  98
  99 \begin{funcdesc}{nsmallest}{n, iterable}
 100 Return a list with the \var{n} smallest elements from the dataset defined
 101 by \var{iterable}. Equivalent to:  \code{sorted(iterable)[:n]}
 102 \versionadded{2.4}
 103 \end{funcdesc}
 104
 105 Both functions perform best for smaller values of \var{n}.  For larger
 106 values, it is more efficient to use the \function{sorted()} function.  Also,
 107 when \code{n==1}, it is more efficient to use the builtin \function{min()}
 108 and \function{max()} functions.
 109
 110
 111 \subsection{Theory}
 112
 113 (This explanation is due to François Pinard.  The Python
 114 code for this module was contributed by Kevin O'Connor.)
 115
 116 Heaps are arrays for which \code{a[\var{k}] <= a[2*\var{k}+1]} and
 117 \code{a[\var{k}] <= a[2*\var{k}+2]}
 118 for all \var{k}, counting elements from 0.  For the sake of comparison,
 119 non-existing elements are considered to be infinite.  The interesting
 120 property of a heap is that \code{a[0]} is always its smallest element.
 121
 122 The strange invariant above is meant to be an efficient memory
 123 representation for a tournament.  The numbers below are \var{k}, not
 124 \code{a[\var{k}]}:
 125
 126 \begin{verbatim}
 127                                    0
 128
 129                   1                                 2
 130
 131           3               4                5               6
 132
 133       7       8       9       10      11      12      13      14
 134
 135     15 16   17 18   19 20   21 22   23 24   25 26   27 28   29 30
 136 \end{verbatim}
 137
 138 In the tree above, each cell \var{k} is topping \code{2*\var{k}+1} and
 139 \code{2*\var{k}+2}.
 140 In an usual binary tournament we see in sports, each cell is the winner
 141 over the two cells it tops, and we can trace the winner down the tree
 142 to see all opponents s/he had.  However, in many computer applications
 143 of such tournaments, we do not need to trace the history of a winner.
 144 To be more memory efficient, when a winner is promoted, we try to
 145 replace it by something else at a lower level, and the rule becomes
 146 that a cell and the two cells it tops contain three different items,
 147 but the top cell "wins" over the two topped cells.
 148
 149 If this heap invariant is protected at all time, index 0 is clearly
 150 the overall winner.  The simplest algorithmic way to remove it and
 151 find the "next" winner is to move some loser (let's say cell 30 in the
 152 diagram above) into the 0 position, and then percolate this new 0 down
 153 the tree, exchanging values, until the invariant is re-established.
 154 This is clearly logarithmic on the total number of items in the tree.
 155 By iterating over all items, you get an O(n log n) sort.
 156
 157 A nice feature of this sort is that you can efficiently insert new
 158 items while the sort is going on, provided that the inserted items are
 159 not "better" than the last 0'th element you extracted.  This is
 160 especially useful in simulation contexts, where the tree holds all
 161 incoming events, and the "win" condition means the smallest scheduled
 162 time.  When an event schedule other events for execution, they are
 163 scheduled into the future, so they can easily go into the heap.  So, a
 164 heap is a good structure for implementing schedulers (this is what I
 165 used for my MIDI sequencer :-).
 166
 167 Various structures for implementing schedulers have been extensively
 168 studied, and heaps are good for this, as they are reasonably speedy,
 169 the speed is almost constant, and the worst case is not much different
 170 than the average case.  However, there are other representations which
 171 are more efficient overall, yet the worst cases might be terrible.
 172
 173 Heaps are also very useful in big disk sorts.  You most probably all
 174 know that a big sort implies producing "runs" (which are pre-sorted
 175 sequences, which size is usually related to the amount of CPU memory),
 176 followed by a merging passes for these runs, which merging is often
 177 very cleverly organised\footnote{The disk balancing algorithms which
 178 are current, nowadays, are
 179 more annoying than clever, and this is a consequence of the seeking
 180 capabilities of the disks.  On devices which cannot seek, like big
 181 tape drives, the story was quite different, and one had to be very
 182 clever to ensure (far in advance) that each tape movement will be the
 183 most effective possible (that is, will best participate at
 184 "progressing" the merge).  Some tapes were even able to read
 185 backwards, and this was also used to avoid the rewinding time.
 186 Believe me, real good tape sorts were quite spectacular to watch!
 187 From all times, sorting has always been a Great Art! :-)}.
 188 It is very important that the initial
 189 sort produces the longest runs possible.  Tournaments are a good way
 190 to that.  If, using all the memory available to hold a tournament, you
 191 replace and percolate items that happen to fit the current run, you'll
 192 produce runs which are twice the size of the memory for random input,
 193 and much better for input fuzzily ordered.
 194
 195 Moreover, if you output the 0'th item on disk and get an input which
 196 may not fit in the current tournament (because the value "wins" over
 197 the last output value), it cannot fit in the heap, so the size of the
 198 heap decreases.  The freed memory could be cleverly reused immediately
 199 for progressively building a second heap, which grows at exactly the
 200 same rate the first heap is melting.  When the first heap completely
 201 vanishes, you switch heaps and start a new run.  Clever and quite
 202 effective!
 203
 204 In a word, heaps are useful memory structures to know.  I use them in
 205 a few applications, and I think it is good to keep a `heap' module
 206 around. :-)