1 <chapter xmlns="http://docbook.org/ns/docbook" version="5.0"
2 xml:id="manual.ext.containers.pbds" xreflabel="pbds">
4 <title>Policy-Based Data Structures</title>
6 <keyword>ISO C++</keyword>
7 <keyword>policy</keyword>
8 <keyword>container</keyword>
9 <keyword>data</keyword>
10 <keyword>structure</keyword>
11 <keyword>associated</keyword>
12 <keyword>tree</keyword>
13 <keyword>trie</keyword>
14 <keyword>hash</keyword>
15 <keyword>metaprogramming</keyword>
18 <?dbhtml filename="policy_data_structures.html"?>
20 <!-- 2006-04-01 Ami Tavory -->
21 <!-- 2011-05-25 Benjamin Kosnik -->
24 <section xml:id="pbds.intro">
25 <info><title>Intro</title></info>
28 This is a library of policy-based elementary data structures:
29 associative containers and priority queues. It is designed for
30 high-performance, flexibility, semantic safety, and conformance to
31 the corresponding containers in <literal>std</literal> and
32 <literal>std::tr1</literal> (except for some points where it differs
38 <section xml:id="pbds.intro.issues">
39 <info><title>Performance Issues</title></info>
44 An attempt is made to categorize the wide variety of possible
45 container designs in terms of performance-impacting factors. These
46 performance factors are translated into design policies and
47 incorporated into container design.
51 There is tension between unravelling factors into a coherent set of
52 policies. Every attempt is made to make a minimal set of
53 factors. However, in many cases multiple factors make for long
54 template names. Every attempt is made to alias and use typedefs in
55 the source files, but the generated names for external symbols can
56 be large for binary files or debuggers.
60 In many cases, the longer names allow capabilities and behaviours
61 controlled by macros to also be unamibiguously emitted as distinct
66 Specific issues found while unraveling performance factors in the
67 design of associative containers and priority queues follow.
70 <section xml:id="pbds.intro.issues.associative">
71 <info><title>Associative</title></info>
74 Associative containers depend on their composite policies to a very
75 large extent. Implicitly hard-wiring policies can hamper their
76 performance and limit their functionality. An efficient hash-based
77 container, for example, requires policies for testing key
78 equivalence, hashing keys, translating hash values into positions
79 within the hash table, and determining when and how to resize the
80 table internally. A tree-based container can efficiently support
81 order statistics, i.e. the ability to query what is the order of
82 each key within the sequence of keys in the container, but only if
83 the container is supplied with a policy to internally update
84 meta-data. There are many other such examples.
88 Ideally, all associative containers would share the same
89 interface. Unfortunately, underlying data structures and mapping
90 semantics differentiate between different containers. For example,
91 suppose one writes a generic function manipulating an associative
96 template<typename Cntnr>
98 some_op_sequence(Cntnr& r_cnt)
105 Given this, then what can one assume about the instantiating
106 container? The answer varies according to its underlying data
107 structure. If the underlying data structure of
108 <literal>Cntnr</literal> is based on a tree or trie, then the order
109 of elements is well defined; otherwise, it is not, in general. If
110 the underlying data structure of <literal>Cntnr</literal> is based
111 on a collision-chaining hash table, then modifying
112 r_<literal>Cntnr</literal> will not invalidate its iterators' order;
113 if the underlying data structure is a probing hash table, then this
114 is not the case. If the underlying data structure is based on a tree
115 or trie, then a reference to the container can efficiently be split;
116 otherwise, it cannot, in general. If the underlying data structure
117 is a red-black tree, then splitting a reference to the container is
118 exception-free; if it is an ordered-vector tree, exceptions can be
124 <section xml:id="pbds.intro.issues.priority_queue">
125 <info><title>Priority Que</title></info>
128 Priority queues are useful when one needs to efficiently access a
129 minimum (or maximum) value as the set of values changes.
133 Most useful data structures for priority queues have a relatively
134 simple structure, as they are geared toward relatively simple
135 requirements. Unfortunately, these structures do not support access
136 to an arbitrary value, which turns out to be necessary in many
137 algorithms. Say, decreasing an arbitrary value in a graph
138 algorithm. Therefore, some extra mechanism is necessary and must be
139 invented for accessing arbitrary values. There are at least two
140 alternatives: embedding an associative container in a priority
141 queue, or allowing cross-referencing through iterators. The first
142 solution adds significant overhead; the second solution requires a
143 precise definition of iterator invalidation. Which is the next
148 Priority queues, like hash-based containers, store values in an
149 order that is meaningless and undefined externally. For example, a
150 <code>push</code> operation can internally reorganize the
151 values. Because of this characteristic, describing a priority
152 queues' iterator is difficult: on one hand, the values to which
153 iterators point can remain valid, but on the other, the logical
154 order of iterators can change unpredictably.
158 Roughly speaking, any element that is both inserted to a priority
159 queue (e.g. through <code>push</code>) and removed
160 from it (e.g., through <code>pop</code>), incurs a
161 logarithmic overhead (in the amortized sense). Different underlying
162 data structures place the actual cost differently: some are
163 optimized for amortized complexity, whereas others guarantee that
164 specific operations only have a constant cost. One underlying data
165 structure might be chosen if modifying a value is frequent
166 (Dijkstra's shortest-path algorithm), whereas a different one might
167 be chosen otherwise. Unfortunately, an array-based binary heap - an
168 underlying data structure that optimizes (in the amortized sense)
169 <code>push</code> and <code>pop</code> operations, differs from the
170 others in terms of its invalidation guarantees. Other design
171 decisions also impact the cost and placement of the overhead, at the
172 expense of more difference in the kinds of operations that the
173 underlying data structure can support. These differences pose a
174 challenge when creating a uniform interface for priority queues.
179 <section xml:id="pbds.intro.motivation">
180 <info><title>Goals</title></info>
183 Many fine associative-container libraries were already written,
184 most notably, the C++ standard's associative containers. Why
185 then write another library? This section shows some possible
186 advantages of this library, when considering the challenges in
187 the introduction. Many of these points stem from the fact that
188 the ISO C++ process introduced associative-containers in a
189 two-step process (first standardizing tree-based containers,
190 only then adding hash-based containers, which are fundamentally
191 different), did not standardize priority queues as containers,
192 and (in our opinion) overloads the iterator concept.
195 <section xml:id="pbds.intro.motivation.associative">
196 <info><title>Associative</title></info>
200 <section xml:id="motivation.associative.policy">
201 <info><title>Policy Choices</title></info>
203 Associative containers require a relatively large number of
204 policies to function efficiently in various settings. In some
205 cases this is needed for making their common operations more
206 efficient, and in other cases this allows them to support a
207 larger set of operations
213 Hash-based containers, for example, support look-up and
214 insertion methods (<function>find</function> and
215 <function>insert</function>). In order to locate elements
216 quickly, they are supplied a hash functor, which instruct
217 how to transform a key object into some size type; a hash
218 functor might transform <constant>"hello"</constant>
219 into <constant>1123002298</constant>. A hash table, though,
220 requires transforming each key object into some size-type
221 type in some specific domain; a hash table with a 128-long
222 table might transform <constant>"hello"</constant> into
223 position <constant>63</constant>. The policy by which the
224 hash value is transformed into a position within the table
225 can dramatically affect performance. Hash-based containers
226 also do not resize naturally (as opposed to tree-based
227 containers, for example). The appropriate resize policy is
228 unfortunately intertwined with the policy that transforms
229 hash value into a position within the table.
235 Tree-based containers, for example, also support look-up and
236 insertion methods, and are primarily useful when maintaining
237 order between elements is important. In some cases, though,
238 one can utilize their balancing algorithms for completely
243 Figure A shows a tree whose each node contains two entries:
244 a floating-point key, and some size-type
245 <emphasis>metadata</emphasis> (in bold beneath it) that is
246 the number of nodes in the sub-tree. (The root has key 0.99,
247 and has 5 nodes (including itself) in its sub-tree.) A
248 container based on this data structure can obviously answer
249 efficiently whether 0.3 is in the container object, but it
250 can also answer what is the order of 0.3 among all those in
251 the container object: see <xref linkend="biblio.clrs2001"/>.
256 As another example, Figure B shows a tree whose each node
257 contains two entries: a half-open geometric line interval,
258 and a number <emphasis>metadata</emphasis> (in bold beneath
259 it) that is the largest endpoint of all intervals in its
260 sub-tree. (The root describes the interval <constant>[20,
261 36)</constant>, and the largest endpoint in its sub-tree is
262 99.) A container based on this data structure can obviously
263 answer efficiently whether <constant>[3, 41)</constant> is
264 in the container object, but it can also answer efficiently
265 whether the container object has intervals that intersect
266 <constant>[3, 41)</constant>. These types of queries are
267 very useful in geometric algorithms and lease-management
272 It is important to note, however, that as the trees are
273 modified, their internal structure changes. To maintain
274 these invariants, one must supply some policy that is aware
275 of these changes. Without this, it would be better to use a
276 linked list (in itself very efficient for these purposes).
283 <title>Node Invariants</title>
286 <imagedata align="center" format="PNG" scale="100"
287 fileref="../images/pbds_node_invariants.png"/>
290 <phrase>Node Invariants</phrase>
297 <section xml:id="motivation.associative.underlying">
298 <info><title>Underlying Data Structures</title></info>
300 The standard C++ library contains associative containers based on
301 red-black trees and collision-chaining hash tables. These are
302 very useful, but they are not ideal for all types of
307 The figure below shows the different underlying data structures
308 currently supported in this library.
312 <title>Underlying Associative Data Structures</title>
315 <imagedata align="center" format="PNG" scale="100"
316 fileref="../images/pbds_different_underlying_dss_1.png"/>
319 <phrase>Underlying Associative Data Structures</phrase>
325 A shows a collision-chaining hash-table, B shows a probing
326 hash-table, C shows a red-black tree, D shows a splay tree, E shows
327 a tree based on an ordered vector(implicit in the order of the
328 elements), F shows a PATRICIA trie, and G shows a list-based
329 container with update policies.
333 Each of these data structures has some performance benefits, in
334 terms of speed, size or both. For now, note that vector-based trees
335 and probing hash tables manipulate memory more efficiently than
336 red-black trees and collision-chaining hash tables, and that
337 list-based associative containers are very useful for constructing
342 Now consider a function manipulating a generic associative
346 template<class Cntnr>
348 some_op_sequence(Cntnr &r_cnt)
355 Ideally, the underlying data structure
356 of <classname>Cntnr</classname> would not affect what can be
357 done with <varname>r_cnt</varname>. Unfortunately, this is not
362 For example, if <classname>Cntnr</classname>
363 is <classname>std::map</classname>, then the function can
367 std::for_each(r_cnt.find(foo), r_cnt.find(bar), foobar)
370 in order to apply <classname>foobar</classname> to all
371 elements between <classname>foo</classname> and
372 <classname>bar</classname>. If
373 <classname>Cntnr</classname> is a hash-based container,
374 then this call's results are undefined.
378 Also, if <classname>Cntnr</classname> is tree-based, the type
379 and object of the comparison functor can be
380 accessed. If <classname>Cntnr</classname> is hash based, these
381 queries are nonsensical.
385 There are various other differences based on the container's
386 underlying data structure. For one, they can be constructed by,
387 and queried for, different policies. Furthermore:
393 Containers based on C, D, E and F store elements in a
394 meaningful order; the others store elements in a meaningless
395 (and probably time-varying) order. By implication, only
396 containers based on C, D, E and F can
397 support <function>erase</function> operations taking an
398 iterator and returning an iterator to the following element
399 without performance loss.
405 Containers based on C, D, E, and F can be split and joined
406 efficiently, while the others cannot. Containers based on C
407 and D, furthermore, can guarantee that this is exception-free;
408 containers based on E cannot guarantee this.
414 Containers based on all but E can guarantee that
415 erasing an element is exception free; containers based on E
416 cannot guarantee this. Containers based on all but B and E
417 can guarantee that modifying an object of their type does
418 not invalidate iterators or references to their elements,
419 while containers based on B and E cannot. Containers based
420 on C, D, and E can furthermore make a stronger guarantee,
421 namely that modifying an object of their type does not
422 affect the order of iterators.
428 A unified tag and traits system (as used for the C++ standard
429 library iterators, for example) can ease generic manipulation of
430 associative containers based on different underlying data
436 <section xml:id="motivation.associative.iterators">
437 <info><title>Iterators</title></info>
439 Iterators are centric to the design of the standard library
440 containers, because of the container/algorithm/iterator
441 decomposition that allows an algorithm to operate on a range
442 through iterators of some sequence. Iterators, then, are useful
443 because they allow going over a
444 specific <emphasis>sequence</emphasis>. The standard library
445 also uses iterators for accessing a
446 specific <emphasis>element</emphasis>: when an associative
447 container returns one through <function>find</function>. The
448 standard library consistently uses the same types of iterators
449 for both purposes: going over a range, and accessing a specific
450 found element. Before the introduction of hash-based containers
451 to the standard library, this made sense (with the exception of
452 priority queues, which are discussed later).
456 Using the standard associative containers together with
457 non-order-preserving associative containers (and also because of
458 priority-queues container), there is a possible need for
459 different types of iterators for self-organizing containers:
460 the iterator concept seems overloaded to mean two different
461 things (in some cases). <!-- <remark> XXX
462 "ds_gen.html#find_range">Design::Associative
463 Containers::Data-Structure Genericity::Point-Type and Range-Type
464 Methods</remark>. -->
467 <section xml:id="associative.iterators.using">
469 <title>Using Point Iterators for Range Operations</title>
472 Suppose <classname>cntnr</classname> is some associative
473 container, and say <varname>c</varname> is an object of
474 type <classname>cntnr</classname>. Then what will be the outcome
479 std::for_each(c.find(1), c.find(5), foo);
483 If <classname>cntnr</classname> is a tree-based container
484 object, then an in-order walk will
485 apply <classname>foo</classname> to the relevant elements,
486 as in the graphic below, label A. If <varname>c</varname> is
487 a hash-based container, then the order of elements between any
488 two elements is undefined (and probably time-varying); there is
489 no guarantee that the elements traversed will coincide with the
490 <emphasis>logical</emphasis> elements between 1 and 5, as in
495 <title>Range Iteration in Different Data Structures</title>
498 <imagedata align="center" format="PNG" scale="100"
499 fileref="../images/pbds_point_iterators_range_ops_1.png"/>
502 <phrase>Node Invariants</phrase>
508 In our opinion, this problem is not caused just because
509 red-black trees are order preserving while
510 collision-chaining hash tables are (generally) not - it
511 is more fundamental. Most of the standard's containers
512 order sequences in a well-defined manner that is
513 determined by their <emphasis>interface</emphasis>:
514 calling <function>insert</function> on a tree-based
515 container modifies its sequence in a predictable way, as
516 does calling <function>push_back</function> on a list or
517 a vector. Conversely, collision-chaining hash tables,
518 probing hash tables, priority queues, and list-based
519 containers (which are very useful for "multimaps") are
520 self-organizing data structures; the effect of each
521 operation modifies their sequences in a manner that is
522 (practically) determined by their
523 <emphasis>implementation</emphasis>.
527 Consequently, applying an algorithm to a sequence obtained from most
528 containers may or may not make sense, but applying it to a
529 sub-sequence of a self-organizing container does not.
533 <section xml:id="associative.iterators.cost">
535 <title>Cost to Point Iterators to Enable Range Operations</title>
538 Suppose <varname>c</varname> is some collision-chaining
539 hash-based container object, and one calls
541 <programlisting>c.find(3)</programlisting>
543 Then what composes the returned iterator?
547 In the graphic below, label A shows the simplest (and
548 most efficient) implementation of a collision-chaining
549 hash table. The little box marked
550 <classname>point_iterator</classname> shows an object
551 that contains a pointer to the element's node. Note that
552 this "iterator" has no way to move to the next element (
554 <function>operator++</function>). Conversely, the little
555 box marked <classname>iterator</classname> stores both a
556 pointer to the element, as well as some other
557 information (the bucket number of the element). the
558 second iterator, then, is "heavier" than the first one-
559 it requires more time and space. If we were to use a
560 different container to cross-reference into this
561 hash-table using these iterators - it would take much
562 more space. As noted above, nothing much can be done by
563 incrementing these iterators, so why is this extra
568 Alternatively, one might create a collision-chaining hash-table
569 where the lists might be linked, forming a monolithic total-element
570 list, as in the graphic below, label B. Here the iterators are as
571 light as can be, but the hash-table's operations are more
576 <title>Point Iteration in Hash Data Structures</title>
579 <imagedata align="center" format="PNG" scale="100"
580 fileref="../images/pbds_point_iterators_range_ops_2.png"/>
583 <phrase>Point Iteration in Hash Data Structures</phrase>
589 It should be noted that containers based on collision-chaining
590 hash-tables are not the only ones with this type of behavior;
591 many other self-organizing data structures display it as well.
595 <section xml:id="associative.iterators.invalidation">
596 <info><title>Invalidation Guarantees</title></info>
597 <para>Consider the following snippet:</para>
604 Following the call to <classname>erase</classname>, what is the
605 validity of <classname>it</classname>: can it be de-referenced?
606 can it be incremented?
610 The answer depends on the underlying data structure of the
611 container. The graphic below shows three cases: A1 and A2 show
612 a red-black tree; B1 and B2 show a probing hash-table; C1 and C2
613 show a collision-chaining hash table.
617 <title>Effect of erase in different underlying data structures</title>
620 <imagedata align="center" format="PNG" scale="100"
621 fileref="../images/pbds_invalidation_guarantee_erase.png"/>
624 <phrase>Effect of erase in different underlying data structures</phrase>
632 Erasing 5 from A1 yields A2. Clearly, an iterator to 3 can
633 be de-referenced and incremented. The sequence of iterators
634 changed, but in a way that is well-defined by the interface.
640 Erasing 5 from B1 yields B2. Clearly, an iterator to 3 is
641 not valid at all - it cannot be de-referenced or
642 incremented; the order of iterators changed in a way that is
643 (practically) determined by the implementation and not by
650 Erasing 5 from C1 yields C2. Here the situation is more
651 complicated. On the one hand, there is no problem in
652 de-referencing <classname>it</classname>. On the other hand,
653 the order of iterators changed in a way that is
654 (practically) determined by the implementation and not by
661 So in the standard library containers, it is not always possible
662 to express whether <varname>it</varname> is valid or not. This
663 is true also for <function>insert</function>. Again, the
664 iterator concept seems overloaded.
667 </section> <!--iterators-->
670 <section xml:id="motivation.associative.functions">
671 <info><title>Functional</title></info>
676 The design of the functional overlay to the underlying data
677 structures differs slightly from some of the conventions used in
678 the C++ standard. A strict public interface of methods that
679 comprise only operations which depend on the class's internal
680 structure; other operations are best designed as external
681 functions. (See <xref linkend="biblio.meyers02both"/>).With this
682 rubric, the standard associative containers lack some useful
683 methods, and provide other methods which would be better
687 <section xml:id="motivation.associative.functions.erase">
688 <info><title><function>erase</function></title></info>
693 Order-preserving standard associative containers provide the
702 which takes an iterator, erases the corresponding
703 element, and returns an iterator to the following
704 element. Also standardd hash-based associative
705 containers provide this method. This seemingly
706 increasesgenericity between associative containers,
707 since it is possible to use
710 typename C::iterator it = c.begin();
711 typename C::iterator e_it = c.end();
714 it = pred(*it)? c.erase(it) : ++it;
718 in order to erase from a container object <varname>
719 c</varname> all element which match a
720 predicate <classname>pred</classname>. However, in a
721 different sense this actually decreases genericity: an
722 integral implication of this method is that tree-based
723 associative containers' memory use is linear in the total
724 number of elements they store, while hash-based
725 containers' memory use is unbounded in the total number of
726 elements they store. Assume a hash-based container is
727 allowed to decrease its size when an element is
728 erased. Then the elements might be rehashed, which means
729 that there is no "next" element - it is simply
730 undefined. Consequently, it is possible to infer from the
731 fact that the standard library's hash-based containers
732 provide this method that they cannot downsize when
733 elements are erased. As a consequence, different code is
734 needed to manipulate different containers, assuming that
735 memory should be conserved. Therefor, this library's
736 non-order preserving associative containers omit this
743 All associative containers include a conditional-erase method
753 which erases all elements matching a predicate. This is probably the
754 only way to ensure linear-time multiple-item erase which can
755 actually downsize a container.
761 The standard associative containers provide methods for
762 multiple-item erase of the form
769 erasing a range of elements given by a pair of
770 iterators. For tree-based or trie-based containers, this can
771 implemented more efficiently as a (small) sequence of split
772 and join operations. For other, unordered, containers, this
773 method isn't much better than an external loop. Moreover,
774 if <varname>c</varname> is a hash-based container,
778 c.erase(c.find(2), c.find(5))
781 is almost certain to do something
782 different than erasing all elements whose keys are between 2
783 and 5, and is likely to produce other undefined behavior.
787 </section> <!-- erase -->
789 <section xml:id="motivation.associative.functions.split">
792 <function>split</function> and <function>join</function>
796 It is well-known that tree-based and trie-based container
797 objects can be efficiently split or joined (See
798 <xref linkend="biblio.clrs2001"/>). Externally splitting or
799 joining trees is super-linear, and, furthermore, can throw
800 exceptions. Split and join methods, consequently, seem good
801 choices for tree-based container methods, especially, since as
802 noted just before, they are efficient replacements for erasing
806 </section> <!-- split -->
808 <section xml:id="motivation.associative.functions.insert">
811 <function>insert</function>
815 The standard associative containers provide methods of the form
818 template<class It>
824 for inserting a range of elements given by a pair of
825 iterators. At best, this can be implemented as an external loop,
826 or, even more efficiently, as a join operation (for the case of
827 tree-based or trie-based containers). Moreover, these methods seem
828 similar to constructors taking a range given by a pair of
829 iterators; the constructors, however, are transactional, whereas
830 the insert methods are not; this is possibly confusing.
833 </section> <!-- insert -->
835 <section xml:id="motivation.associative.functions.compare">
838 <function>operator==</function> and <function>operator<=</function>
843 Associative containers are parametrized by policies allowing to
844 test key equivalence: a hash-based container can do this through
845 its equivalence functor, and a tree-based container can do this
846 through its comparison functor. In addition, some standard
847 associative containers have global function operators, like
848 <function>operator==</function> and <function>operator<=</function>,
849 that allow comparing entire associative containers.
853 In our opinion, these functions are better left out. To begin
854 with, they do not significantly improve over an external
855 loop. More importantly, however, they are possibly misleading -
856 <function>operator==</function>, for example, usually checks for
857 equivalence, or interchangeability, but the associative
858 container cannot check for values' equivalence, only keys'
859 equivalence; also, are two containers considered equivalent if
860 they store the same values in different order? this is an
863 </section> <!-- compare -->
865 </section> <!-- functional -->
867 </section> <!--associative-->
869 <section xml:id="pbds.intro.motivation.priority_queue">
870 <info><title>Priority Queues</title></info>
872 <section xml:id="motivation.priority_queue.policy">
873 <info><title>Policy Choices</title></info>
876 Priority queues are containers that allow efficiently inserting
877 values and accessing the maximal value (in the sense of the
878 container's comparison functor). Their interface
879 supports <function>push</function>
880 and <function>pop</function>. The standard
881 container <classname>std::priorityqueue</classname> indeed support
882 these methods, but little else. For algorithmic and
883 software-engineering purposes, other methods are needed:
889 Many graph algorithms (see
890 <xref linkend="biblio.clrs2001"/>) require increasing a
891 value in a priority queue (again, in the sense of the
892 container's comparison functor), or joining two
893 priority-queue objects.
898 <para>The return type of <classname>priority_queue</classname>'s
899 <function>push</function> method is a point-type iterator, which can
900 be used for modifying or erasing arbitrary values. For
903 priority_queue<int> p;
904 priority_queue<int>::point_iterator it = p.push(3);
908 <para>These types of cross-referencing operations are necessary
909 for making priority queues useful for different applications,
910 especially graph applications.</para>
915 It is sometimes necessary to erase an arbitrary value in a
916 priority queue. For example, consider
917 the <function>select</function> function for monitoring
923 select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds,
924 struct timeval *timeout);
927 then, as the select documentation states:
931 The nfds argument specifies the range of file
932 descriptors to be tested. The select() function tests file
933 descriptors in the range of 0 to nfds-1.</quote>
937 It stands to reason, therefore, that we might wish to
938 maintain a minimal value for <varname>nfds</varname>, and
939 priority queues immediately come to mind. Note, though, that
940 when a socket is closed, the minimal file description might
941 change; in the absence of an efficient means to erase an
942 arbitrary value from a priority queue, we might as well
943 avoid its use altogether.
947 The standard containers typically support iterators. It is
949 for <classname>std::priority_queue</classname> to omit them
950 (See <xref linkend="biblio.meyers01stl"/>). One might
951 ask why do priority queues need to support iterators, since
952 they are self-organizing containers with a different purpose
953 than abstracting sequences. There are several reasons:
958 Iterators (even in self-organizing containers) are
959 useful for many purposes: cross-referencing
960 containers, serialization, and debugging code that uses
967 The standard library's hash-based containers support
968 iterators, even though they too are self-organizing
969 containers with a different purpose than abstracting
976 In standard-library-like containers, it is natural to specify the
977 interface of operations for modifying a value or erasing
978 a value (discussed previously) in terms of a iterators.
979 It should be noted that the standard
980 containers also use iterators for accessing and
981 manipulating a specific value. In hash-based
982 containers, one checks the existence of a key by
983 comparing the iterator returned by <function>find</function> to the
984 iterator returned by <function>end</function>, and not by comparing a
985 pointer returned by <function>find</function> to <type>NULL</type>.
994 <section xml:id="motivation.priority_queue.underlying">
995 <info><title>Underlying Data Structures</title></info>
998 There are three main implementations of priority queues: the
999 first employs a binary heap, typically one which uses a
1000 sequence; the second uses a tree (or forest of trees), which is
1001 typically less structured than an associative container's tree;
1002 the third simply uses an associative container. These are
1003 shown in the figure below with labels A1 and A2, B, and C.
1007 <title>Underlying Priority Queue Data Structures</title>
1010 <imagedata align="center" format="PNG" scale="100"
1011 fileref="../images/pbds_different_underlying_dss_2.png"/>
1014 <phrase>Underlying Priority Queue Data Structures</phrase>
1020 No single implementation can completely replace any of the
1021 others. Some have better <function>push</function>
1022 and <function>pop</function> amortized performance, some have
1023 better bounded (worst case) response time than others, some
1024 optimize a single method at the expense of others, etc. In
1025 general the "best" implementation is dictated by the specific
1030 As with associative containers, the more implementations
1031 co-exist, the more necessary a traits mechanism is for handling
1032 generic containers safely and efficiently. This is especially
1033 important for priority queues, since the invalidation guarantees
1034 of one of the most useful data structures - binary heaps - is
1035 markedly different than those of most of the others.
1040 <section xml:id="motivation.priority_queue.binary_heap">
1041 <info><title>Binary Heaps</title></info>
1045 Binary heaps are one of the most useful underlying
1046 data structures for priority queues. They are very efficient in
1047 terms of memory (since they don't require per-value structure
1048 metadata), and have the best amortized <function>push</function> and
1049 <function>pop</function> performance for primitive types like
1054 The standard library's <classname>priority_queue</classname>
1055 implements this data structure as an adapter over a sequence,
1057 <classname>std::vector</classname>
1058 or <classname>std::deque</classname>, which correspond to labels
1059 A1 and A2 respectively in the graphic above.
1063 This is indeed an elegant example of the adapter concept and
1064 the algorithm/container/iterator decomposition. (See <xref linkend="biblio.nelson96stlpq"/>). There are
1065 several reasons why a binary-heap priority queue
1066 may be better implemented as a container instead of a
1073 <classname>std::priority_queue</classname> cannot erase values
1074 from its adapted sequence (irrespective of the sequence
1075 type). This means that the memory use of
1076 an <classname>std::priority_queue</classname> object is always
1077 proportional to the maximal number of values it ever contained,
1078 and not to the number of values that it currently
1079 contains. (See <filename>performance/priority_queue_text_pop_mem_usage.cc</filename>.)
1080 This implementation of binary heaps acts very differently than
1081 other underlying data structures (See also pairing heaps).
1087 Some combinations of adapted sequences and value types
1088 are very inefficient or just don't make sense. If one uses
1089 <classname>std::priority_queue<std::vector<std::string>
1090 > ></classname>, for example, then not only will each
1091 operation perform a logarithmic number of
1092 <classname>std::string</classname> assignments, but, furthermore, any
1093 operation (including <function>pop</function>) can render the container
1094 useless due to exceptions. Conversely, if one uses
1095 <classname>std::priority_queue<std::deque<int> >
1096 ></classname>, then each operation uses incurs a logarithmic
1097 number of indirect accesses (through pointers) unnecessarily.
1098 It might be better to let the container make a conservative
1099 deduction whether to use the structure in the graphic above, labels A1 or A2.
1105 There does not seem to be a systematic way to determine
1106 what exactly can be done with the priority queue.
1111 If <classname>p</classname> is a priority queue adapting an
1112 <classname>std::vector</classname>, then it is possible to iterate over
1113 all values by using <function>&p.top()</function> and
1114 <function>&p.top() + p.size()</function>, but this will not work
1115 if <varname>p</varname> is adapting an <classname>std::deque</classname>; in any
1116 case, one cannot use <classname>p.begin()</classname> and
1117 <classname>p.end()</classname>. If a different sequence is adapted, it
1118 is even more difficult to determine what can be
1125 If <varname>p</varname> is a priority queue adapting an
1126 <classname>std::deque</classname>, then the reference return by
1132 will remain valid until it is popped,
1133 but if <varname>p</varname> adapts an <classname>std::vector</classname>, the
1134 next <function>push</function> will invalidate it. If a different
1135 sequence is adapted, it is even more difficult to
1136 determine what can be done.
1144 Sequence-based binary heaps can still implement
1145 linear-time <function>erase</function> and <function>modify</function> operations.
1146 This means that if one needs to erase a small
1147 (say logarithmic) number of values, then one might still
1148 choose this underlying data structure. Using
1149 <classname>std::priority_queue</classname>, however, this will generally
1150 change the order of growth of the entire sequence of
1158 </section> <!-- goals/motivation -->
1159 </section> <!-- intro -->
1162 <section xml:id="containers.pbds.using">
1163 <info><title>Using</title></info>
1164 <?dbhtml filename="policy_data_structures_using.html"?>
1166 <section xml:id="pbds.using.prereq">
1167 <info><title>Prerequisites</title></info>
1169 <para>The library contains only header files, and does not require any
1170 other libraries except the standard C++ library . All classes are
1171 defined in namespace <code>__gnu_pbds</code>. The library internally
1172 uses macros beginning with <code>PB_DS</code>, but
1173 <code>#undef</code>s anything it <code>#define</code>s (except for
1174 header guards). Compiling the library in an environment where macros
1175 beginning in <code>PB_DS</code> are defined, may yield unpredictable
1176 results in compilation, execution, or both.</para>
1179 Further dependencies are necessary to create the visual output
1180 for the performance tests. To create these graphs, an
1181 additional package is needed: <command>pychart</command>.
1185 <section xml:id="pbds.using.organization">
1186 <info><title>Organization</title></info>
1189 The various data structures are organized as follows.
1201 <classname>basic_branch</classname>
1202 is an abstract base class for branched-based
1203 associative-containers
1209 <classname>tree</classname>
1210 is a concrete base class for tree-based
1211 associative-containers
1217 <classname>trie</classname>
1218 is a concrete base class trie-based
1219 associative-containers
1232 <classname>basic_hash_table</classname>
1233 is an abstract base class for hash-based
1234 associative-containers
1240 <classname>cc_hash_table</classname>
1241 is a concrete collision-chaining hash-based
1242 associative-containers
1248 <classname>gp_hash_table</classname>
1249 is a concrete (general) probing hash-based
1250 associative-containers
1263 <classname>list_update</classname>
1264 list-based update-policy associative container
1276 <classname>priority_queue</classname>
1285 The hierarchy is composed naturally so that commonality is
1286 captured by base classes. Thus <function>operator[]</function>
1287 is defined at the base of any hierarchy, since all derived
1288 containers support it. Conversely <function>split</function> is
1289 defined in <classname>basic_branch</classname>, since only
1290 tree-like containers support it.
1294 In addition, there are the following diagnostics classes,
1295 used to report errors specific to this library's data
1300 <title>Exception Hierarchy</title>
1303 <imagedata align="center" format="PDF" scale="75"
1304 fileref="../images/pbds_exception_hierarchy.pdf"/>
1307 <imagedata align="center" format="PNG" scale="100"
1308 fileref="../images/pbds_exception_hierarchy.png"/>
1311 <phrase>Exception Hierarchy</phrase>
1318 <section xml:id="pbds.using.tutorial">
1319 <info><title>Tutorial</title></info>
1321 <section xml:id="pbds.using.tutorial.basic">
1322 <info><title>Basic Use</title></info>
1325 For the most part, the policy-based containers containers in
1326 namespace <literal>__gnu_pbds</literal> have the same interface as
1327 the equivalent containers in the standard C++ library, except for
1328 the names used for the container classes themselves. For example,
1329 this shows basic operations on a collision-chaining hash-based
1333 #include <ext/pb_ds/assoc_container.h>
1337 __gnu_pbds::cc_hash_table<int, char> c;
1339 assert(c.find(1) == c.end());
1344 The container is called
1345 <classname>__gnu_pbds::cc_hash_table</classname> instead of
1346 <classname>std::unordered_map</classname>, since <quote>unordered
1347 map</quote> does not necessarily mean a hash-based map as implied by
1348 the C++ library (C++11 or TR1). For example, list-based associative
1349 containers, which are very useful for the construction of
1350 "multimaps," are also unordered.
1353 <para>This snippet shows a red-black tree based container:</para>
1356 #include <ext/pb_ds/assoc_container.h>
1360 __gnu_pbds::tree<int, char> c;
1362 assert(c.find(2) != c.end());
1366 <para>The container is called <classname>tree</classname> instead of
1367 <classname>map</classname> since the underlying data structures are
1368 being named with specificity.
1372 The member function naming convention is to strive to be the same as
1373 the equivalent member functions in other C++ standard library
1374 containers. The familiar methods are unchanged:
1375 <function>begin</function>, <function>end</function>,
1376 <function>size</function>, <function>empty</function>, and
1377 <function>clear</function>.
1381 This isn't to say that things are exactly as one would expect, given
1382 the container requirments and interfaces in the C++ standard.
1386 The names of containers' policies and policy accessors are
1387 different then the usual. For example, if <type>hash_type</type> is
1388 some type of hash-based container, then</para>
1395 gives the type of its hash functor, and if <varname>obj</varname> is
1396 some hash-based container object, then
1403 <para>will return a reference to its hash-functor object.</para>
1407 Similarly, if <type>tree_type</type> is some type of tree-based
1416 gives the type of its comparison functor, and if
1417 <varname>obj</varname> is some tree-based container object,
1425 <para>will return a reference to its comparison-functor object.</para>
1428 It would be nice to give names consistent with those in the existing
1429 C++ standard (inclusive of TR1). Unfortunately, these standard
1430 containers don't consistently name types and methods. For example,
1431 <classname>std::tr1::unordered_map</classname> uses
1432 <type>hasher</type> for the hash functor, but
1433 <classname>std::map</classname> uses <type>key_compare</type> for
1434 the comparison functor. Also, we could not find an accessor for
1435 <classname>std::tr1::unordered_map</classname>'s hash functor, but
1436 <classname>std::map</classname> uses <classname>compare</classname>
1437 for accessing the comparison functor.
1441 Instead, <literal>__gnu_pbds</literal> attempts to be internally
1442 consistent, and uses standard-derived terminology if possible.
1446 Another source of difference is in scope:
1447 <literal>__gnu_pbds</literal> contains more types of associative
1448 containers than the standard C++ library, and more opportunities
1449 to configure these new containers, since different types of
1450 associative containers are useful in different settings.
1454 Namespace <literal>__gnu_pbds</literal> contains different classes for
1455 hash-based containers, tree-based containers, trie-based containers,
1456 and list-based containers.
1460 Since associative containers share parts of their interface, they
1461 are organized as a class hierarchy.
1464 <para>Each type or method is defined in the most-common ancestor
1465 in which it makes sense.
1468 <para>For example, all associative containers support iteration
1469 expressed in the following form:
1487 But not all containers contain or use hash functors. Yet, both
1488 collision-chaining and (general) probing hash-based associative
1489 containers have a hash functor, so
1490 <classname>basic_hash_table</classname> contains the interface:
1495 get_hash_fn() const;
1502 so all hash-based associative containers inherit the same
1503 hash-functor accessor methods.
1506 </section> <!--basic use -->
1508 <section xml:id="pbds.using.tutorial.configuring">
1511 Configuring via Template Parameters
1516 In general, each of this library's containers is
1517 parametrized by more policies than those of the standard library. For
1518 example, the standard hash-based container is parametrized as
1522 template<typename Key, typename Mapped, typename Hash,
1523 typename Pred, typename Allocator, bool Cache_Hashe_Code>
1524 class unordered_map;
1528 and so can be configured by key type, mapped type, a functor
1529 that translates keys to unsigned integral types, an equivalence
1530 predicate, an allocator, and an indicator whether to store hash
1531 values with each entry. this library's collision-chaining
1532 hash-based container is parametrized as
1535 template<typename Key, typename Mapped, typename Hash_Fn,
1536 typename Eq_Fn, typename Comb_Hash_Fn,
1537 typename Resize_Policy, bool Store_Hash
1538 typename Allocator>
1539 class cc_hash_table;
1543 and so can be configured by the first four types of
1544 <classname>std::tr1::unordered_map</classname>, then a
1545 policy for translating the key-hash result into a position
1546 within the table, then a policy by which the table resizes,
1547 an indicator whether to store hash values with each entry,
1548 and an allocator (which is typically the last template
1549 parameter in standard containers).
1553 Nearly all policy parameters have default values, so this
1554 need not be considered for casual use. It is important to
1555 note, however, that hash-based containers' policies can
1556 dramatically alter their performance in different settings,
1557 and that tree-based containers' policies can make them
1558 useful for other purposes than just look-up.
1562 <para>As opposed to associative containers, priority queues have
1563 relatively few configuration options. The priority queue is
1564 parametrized as follows:</para>
1566 template<typename Value_Type, typename Cmp_Fn,typename Tag,
1567 typename Allocator>
1568 class priority_queue;
1571 <para>The <classname>Value_Type</classname>, <classname>Cmp_Fn</classname>, and
1572 <classname>Allocator</classname> parameters are the container's value type,
1573 comparison-functor type, and allocator type, respectively;
1574 these are very similar to the standard's priority queue. The
1575 <classname>Tag</classname> parameter is different: there are a number of
1576 pre-defined tag types corresponding to binary heaps, binomial
1577 heaps, etc., and <classname>Tag</classname> should be instantiated
1578 by one of them.</para>
1580 <para>Note that as opposed to the
1581 <classname>std::priority_queue</classname>,
1582 <classname>__gnu_pbds::priority_queue</classname> is not a
1583 sequence-adapter; it is a regular container.</para>
1587 <section xml:id="pbds.using.tutorial.traits">
1590 Querying Container Attributes
1595 <para>A containers underlying data structure
1596 affect their performance; Unfortunately, they can also affect
1597 their interface. When manipulating generically associative
1598 containers, it is often useful to be able to statically
1599 determine what they can support and what the cannot.
1602 <para>Happily, the standard provides a good solution to a similar
1603 problem - that of the different behavior of iterators. If
1604 <classname>It</classname> is an iterator, then
1607 typename std::iterator_traits<It>::iterator_category
1610 <para>is one of a small number of pre-defined tag classes, and
1613 typename std::iterator_traits<It>::value_type
1616 <para>is the value type to which the iterator "points".</para>
1619 Similarly, in this library, if <type>C</type> is a
1620 container, then <classname>container_traits</classname> is a
1621 trait class that stores information about the kind of
1622 container that is implemented.
1625 typename container_traits<C>::container_category
1628 is one of a small number of predefined tag structures that
1629 uniquely identifies the type of underlying data structure.
1632 <para>In most cases, however, the exact underlying data
1633 structure is not really important, but what is important is
1634 one of its other attributes: whether it guarantees storing
1635 elements by key order, for example. For this one can
1638 typename container_traits<C>::order_preserving
1644 typename container_traits<C>::invalidation_guarantee
1647 <para>is the container's invalidation guarantee. Invalidation
1648 guarantees are especially important regarding priority queues,
1649 since in this library's design, iterators are practically the
1650 only way to manipulate them.</para>
1653 <section xml:id="pbds.using.tutorial.point_range_iteration">
1656 Point and Range Iteration
1661 <para>This library differentiates between two types of methods
1662 and iterators: point-type, and range-type. For example,
1663 <function>find</function> and <function>insert</function> are point-type methods, since
1664 they each deal with a specific element; their returned
1665 iterators are point-type iterators. <function>begin</function> and
1666 <function>end</function> are range-type methods, since they are not used to
1667 find a specific element, but rather to go over all elements in
1668 a container object; their returned iterators are range-type
1672 <para>Most containers store elements in an order that is
1673 determined by their interface. Correspondingly, it is fine that
1674 their point-type iterators are synonymous with their range-type
1675 iterators. For example, in the following snippet
1678 std::for_each(c.find(1), c.find(5), foo);
1681 two point-type iterators (returned by <function>find</function>) are used
1682 for a range-type purpose - going over all elements whose key is
1687 Conversely, the above snippet makes no sense for
1688 self-organizing containers - ones that order (and reorder)
1689 their elements by implementation. It would be nice to have a
1690 uniform iterator system that would allow the above snippet to
1691 compile only if it made sense.
1695 This could trivially be done by specializing
1696 <function>std::for_each</function> for the case of iterators returned by
1697 <classname>std::tr1::unordered_map</classname>, but this would only solve the
1698 problem for one algorithm and one container. Fundamentally, the
1699 problem is that one can loop using a self-organizing
1700 container's point-type iterators.
1704 This library's containers define two families of
1705 iterators: <type>point_const_iterator</type> and
1706 <type>point_iterator</type> are the iterator types returned by
1707 point-type methods; <type>const_iterator</type> and
1708 <type>iterator</type> are the iterator types returned by range-type
1712 class <- some container ->
1717 typedef <- something -> const_iterator;
1719 typedef <- something -> iterator;
1721 typedef <- something -> point_const_iterator;
1723 typedef <- something -> point_iterator;
1730 const_iterator begin () const;
1734 point_const_iterator find(...) const;
1736 point_iterator find(...);
1741 containers whose interface defines sequence order , it
1742 is very simple: point-type and range-type iterators are exactly
1743 the same, which means that the above snippet will compile if it
1744 is used for an order-preserving associative container.
1748 For self-organizing containers, however, (hash-based
1749 containers as a special example), the preceding snippet will
1750 not compile, because their point-type iterators do not support
1751 <function>operator++</function>.
1754 <para>In any case, both for order-preserving and self-organizing
1755 containers, the following snippet will compile:
1758 typename Cntnr::point_iterator it = c.find(2);
1762 because a range-type iterator can always be converted to a
1763 point-type iterator.
1766 <para>Distingushing between iterator types also
1767 raises the point that a container's iterators might have
1768 different invalidation rules concerning their de-referencing
1769 abilities and movement abilities. This now corresponds exactly
1770 to the question of whether point-type and range-type iterators
1771 are valid. As explained above, <classname>container_traits</classname> allows
1772 querying a container for its data structure attributes. The
1773 iterator-invalidation guarantees are certainly a property of
1774 the underlying data structure, and so
1777 container_traits<C>::invalidation_guarantee
1781 gives one of three pre-determined types that answer this
1786 </section> <!-- tutorial -->
1788 <section xml:id="pbds.using.examples">
1789 <info><title>Examples</title></info>
1791 Additional code examples are provided in the source
1792 distribution, as part of the regression and performance
1796 <section xml:id="pbds.using.examples.basic">
1797 <info><title>Intermediate Use</title></info>
1803 <filename>basic_map.cc</filename>
1810 <filename>basic_set.cc</filename>
1816 Conditionally erasing values from an associative container object:
1817 <filename>erase_if.cc</filename>
1823 Basic use of multimaps:
1824 <filename>basic_multimap.cc</filename>
1830 Basic use of multisets:
1831 <filename>basic_multiset.cc</filename>
1837 Basic use of priority queues:
1838 <filename>basic_priority_queue.cc</filename>
1844 Splitting and joining priority queues:
1845 <filename>priority_queue_split_join.cc</filename>
1851 Conditionally erasing values from a priority queue:
1852 <filename>priority_queue_erase_if.cc</filename>
1859 <section xml:id="pbds.using.examples.query">
1860 <info><title>Querying with <classname>container_traits</classname> </title></info>
1864 Using <classname>container_traits</classname> to query
1865 about underlying data structure behavior:
1866 <filename>assoc_container_traits.cc</filename>
1872 A non-compiling example showing wrong use of finding keys in
1873 hash-based containers: <filename>hash_find_neg.cc</filename>
1878 Using <classname>container_traits</classname>
1879 to query about underlying data structure behavior:
1880 <filename>priority_queue_container_traits.cc</filename>
1888 <section xml:id="pbds.using.examples.container">
1889 <info><title>By Container Method</title></info>
1892 <section xml:id="pbds.using.examples.container.hash">
1893 <info><title>Hash-Based</title></info>
1895 <section xml:id="pbds.using.examples.container.hash.resize">
1896 <info><title>size Related</title></info>
1901 Setting the initial size of a hash-based container
1903 <filename>hash_initial_size.cc</filename>
1909 A non-compiling example showing how not to resize a
1910 hash-based container object:
1911 <filename>hash_resize_neg.cc</filename>
1917 Resizing the size of a hash-based container object:
1918 <filename>hash_resize.cc</filename>
1924 Showing an illegal resize of a hash-based container
1926 <filename>hash_illegal_resize.cc</filename>
1932 Changing the load factors of a hash-based container
1933 object: <filename>hash_load_set_change.cc</filename>
1939 <section xml:id="pbds.using.examples.container.hash.hashor">
1940 <info><title>Hashing Function Related</title></info>
1946 Using a modulo range-hashing function for the case of an
1947 unknown skewed key distribution:
1948 <filename>hash_mod.cc</filename>
1954 Writing a range-hashing functor for the case of a known
1955 skewed key distribution:
1956 <filename>shift_mask.cc</filename>
1962 Storing the hash value along with each key:
1963 <filename>store_hash.cc</filename>
1969 Writing a ranged-hash functor:
1970 <filename>ranged_hash.cc</filename>
1979 <section xml:id="pbds.using.examples.container.branch">
1980 <info><title>Branch-Based</title></info>
1983 <section xml:id="pbds.using.examples.container.branch.split">
1984 <info><title>split or join Related</title></info>
1989 Joining two tree-based container objects:
1990 <filename>tree_join.cc</filename>
1996 Splitting a PATRICIA trie container object:
1997 <filename>trie_split.cc</filename>
2003 Order statistics while joining two tree-based container
2005 <filename>tree_order_statistics_join.cc</filename>
2012 <section xml:id="pbds.using.examples.container.branch.invariants">
2013 <info><title>Node Invariants</title></info>
2018 Using trees for order statistics:
2019 <filename>tree_order_statistics.cc</filename>
2025 Augmenting trees to support operations on line
2027 <filename>tree_intervals.cc</filename>
2034 <section xml:id="pbds.using.examples.container.branch.trie">
2035 <info><title>trie</title></info>
2039 Using a PATRICIA trie for DNA strings:
2040 <filename>trie_dna.cc</filename>
2047 trie for finding all entries whose key matches a given prefix:
2048 <filename>trie_prefix_search.cc</filename>
2057 <section xml:id="pbds.using.examples.container.priority_queue">
2058 <info><title>Priority Queues</title></info>
2062 Cross referencing an associative container and a priority
2063 queue: <filename>priority_queue_xref.cc</filename>
2069 Cross referencing a vector and a priority queue using a
2070 very simple version of Dijkstra's shortest path
2072 <filename>priority_queue_dijkstra.cc</filename>
2084 </section> <!-- using -->
2086 <!-- S03: Design -->
2089 <section xml:id="containers.pbds.design">
2090 <info><title>Design</title></info>
2091 <?dbhtml filename="policy_data_structures_design.html"?>
2094 <section xml:id="pbds.design.concepts">
2095 <info><title>Concepts</title></info>
2097 <section xml:id="pbds.design.concepts.null_type">
2098 <info><title>Null Policy Classes</title></info>
2101 Associative containers are typically parametrized by various
2102 policies. For example, a hash-based associative container is
2103 parametrized by a hash-functor, transforming each key into an
2104 non-negative numerical type. Each such value is then further mapped
2105 into a position within the table. The mapping of a key into a
2106 position within the table is therefore a two-step process.
2110 In some cases, instantiations are redundant. For example, when the
2111 keys are integers, it is possible to use a redundant hash policy,
2112 which transforms each key into its value.
2116 In some other cases, these policies are irrelevant. For example, a
2117 hash-based associative container might transform keys into positions
2118 within a table by a different method than the two-step method
2119 described above. In such a case, the hash functor is simply
2124 When a policy is either redundant or irrelevant, it can be replaced
2125 by <classname>null_type</classname>.
2129 For example, a <emphasis>set</emphasis> is an associative
2130 container with one of its template parameters (the one for the
2131 mapped type) replaced with <classname>null_type</classname>. Other
2132 places simplifications are made possible with this technique
2133 include node updates in tree and trie data structures, and hash
2134 and probe functions for hash data structures.
2138 <section xml:id="pbds.design.concepts.associative_semantics">
2139 <info><title>Map and Set Semantics</title></info>
2141 <section xml:id="concepts.associative_semantics.set_vs_map">
2144 Distinguishing Between Maps and Sets
2149 Anyone familiar with the standard knows that there are four kinds
2150 of associative containers: maps, sets, multimaps, and
2151 multisets. The map datatype associates each key to
2156 Sets are associative containers that simply store keys -
2157 they do not map them to anything. In the standard, each map class
2158 has a corresponding set class. E.g.,
2159 <classname>std::map<int, char></classname> maps each
2160 <classname>int</classname> to a <classname>char</classname>, but
2161 <classname>std::set<int, char></classname> simply stores
2162 <classname>int</classname>s. In this library, however, there are no
2163 distinct classes for maps and sets. Instead, an associative
2164 container's <classname>Mapped</classname> template parameter is a policy: if
2165 it is instantiated by <classname>null_type</classname>, then it
2166 is a "set"; otherwise, it is a "map". E.g.,
2169 cc_hash_table<int, char>
2172 is a "map" mapping each <type>int</type> value to a <type>
2176 cc_hash_table<int, null_type>
2179 is a type that uniquely stores <type>int</type> values.
2181 <para>Once the <classname>Mapped</classname> template parameter is instantiated
2182 by <classname>null_type</classname>, then
2183 the "set" acts very similarly to the standard's sets - it does not
2184 map each key to a distinct <classname>null_type</classname> object. Also,
2185 , the container's <type>value_type</type> is essentially
2186 its <type>key_type</type> - just as with the standard's sets
2190 The standard's multimaps and multisets allow, respectively,
2191 non-uniquely mapping keys and non-uniquely storing keys. As
2193 reasons why this might be necessary are 1) that a key might be
2194 decomposed into a primary key and a secondary key, 2) that a
2195 key might appear more than once, or 3) any arbitrary
2196 combination of 1)s and 2)s. Correspondingly,
2197 one should use 1) "maps" mapping primary keys to secondary
2198 keys, 2) "maps" mapping keys to size types, or 3) any arbitrary
2199 combination of 1)s and 2)s. Thus, for example, an
2200 <classname>std::multiset<int></classname> might be used to store
2201 multiple instances of integers, but using this library's
2202 containers, one might use
2205 tree<int, size_t>
2209 i.e., a <classname>map</classname> of <type>int</type>s to
2210 <type>size_t</type>s.
2213 These "multimaps" and "multisets" might be confusing to
2214 anyone familiar with the standard's <classname>std::multimap</classname> and
2215 <classname>std::multiset</classname>, because there is no clear
2216 correspondence between the two. For example, in some cases
2217 where one uses <classname>std::multiset</classname> in the standard, one might use
2218 in this library a "multimap" of "multisets" - i.e., a
2219 container that maps primary keys each to an associative
2220 container that maps each secondary key to the number of times
2225 When one uses a "multimap," one should choose with care the
2226 type of container used for secondary keys.
2228 </section> <!-- map vs set -->
2231 <section xml:id="concepts.associative_semantics.multi">
2232 <info><title>Alternatives to <classname>std::multiset</classname> and <classname>std::multimap</classname></title></info>
2235 Brace onself: this library does not contain containers like
2236 <classname>std::multimap</classname> or
2237 <classname>std::multiset</classname>. Instead, these data
2238 structures can be synthesized via manipulation of the
2239 <classname>Mapped</classname> template parameter.
2242 One maps the unique part of a key - the primary key, into an
2243 associative-container of the (originally) non-unique parts of
2244 the key - the secondary key. A primary associative-container
2245 is an associative container of primary keys; a secondary
2246 associative-container is an associative container of
2251 Stepping back a bit, and starting in from the beginning.
2256 Maps (or sets) allow mapping (or storing) unique-key values.
2257 The standard library also supplies associative containers which
2258 map (or store) multiple values with equivalent keys:
2259 <classname>std::multimap</classname>, <classname>std::multiset</classname>,
2260 <classname>std::tr1::unordered_multimap</classname>, and
2261 <classname>unordered_multiset</classname>. We first discuss how these might
2262 be used, then why we think it is best to avoid them.
2266 Suppose one builds a simple bank-account application that
2267 records for each client (identified by an <classname>std::string</classname>)
2268 and account-id (marked by an <type>unsigned long</type>) -
2269 the balance in the account (described by a
2270 <type>float</type>). Suppose further that ordering this
2271 information is not useful, so a hash-based container is
2272 preferable to a tree based container. Then one can use
2276 std::tr1::unordered_map<std::pair<std::string, unsigned long>, float, ...>
2280 which hashes every combination of client and account-id. This
2281 might work well, except for the fact that it is now impossible
2282 to efficiently list all of the accounts of a specific client
2283 (this would practically require iterating over all
2284 entries). Instead, one can use
2288 std::tr1::unordered_multimap<std::pair<std::string, unsigned long>, float, ...>
2292 which hashes every client, and decides equivalence based on
2293 client only. This will ensure that all accounts belonging to a
2294 specific user are stored consecutively.
2298 Also, suppose one wants an integers' priority queue
2299 (a container that supports <function>push</function>,
2300 <function>pop</function>, and <function>top</function> operations, the last of which
2301 returns the largest <type>int</type>) that also supports
2302 operations such as <function>find</function> and <function>lower_bound</function>. A
2303 reasonable solution is to build an adapter over
2304 <classname>std::set<int></classname>. In this adapter,
2305 <function>push</function> will just call the tree-based
2306 associative container's <function>insert</function> method; <function>pop</function>
2307 will call its <function>end</function> method, and use it to return the
2308 preceding element (which must be the largest). Then this might
2309 work well, except that the container object cannot hold
2310 multiple instances of the same integer (<function>push(4)</function>,
2311 will be a no-op if <constant>4</constant> is already in the
2312 container object). If multiple keys are necessary, then one
2313 might build the adapter over an
2314 <classname>std::multiset<int></classname>.
2318 The standard library's non-unique-mapping containers are useful
2319 when (1) a key can be decomposed in to a primary key and a
2320 secondary key, (2) a key is needed multiple times, or (3) any
2321 combination of (1) and (2).
2325 The graphic below shows how the standard library's container
2326 design works internally; in this figure nodes shaded equally
2327 represent equivalent-key values. Equivalent keys are stored
2328 consecutively using the properties of the underlying data
2329 structure: binary search trees (label A) store equivalent-key
2330 values consecutively (in the sense of an in-order walk)
2331 naturally; collision-chaining hash tables (label B) store
2332 equivalent-key values in the same bucket, the bucket can be
2333 arranged so that equivalent-key values are consecutive.
2337 <title>Non-unique Mapping Standard Containers</title>
2340 <imagedata align="center" format="PNG" scale="100"
2341 fileref="../images/pbds_embedded_lists_1.png"/>
2344 <phrase>Non-unique Mapping Standard Containers</phrase>
2350 Put differently, the standards' non-unique mapping
2351 associative-containers are associative containers that map
2352 primary keys to linked lists that are embedded into the
2353 container. The graphic below shows again the two
2354 containers from the first graphic above, this time with
2355 the embedded linked lists of the grayed nodes marked
2359 <figure xml:id="fig.pbds_embedded_lists_2">
2361 Effect of embedded lists in
2362 <classname>std::multimap</classname>
2366 <imagedata align="center" format="PNG" scale="100"
2367 fileref="../images/pbds_embedded_lists_2.png"/>
2371 Effect of embedded lists in
2372 <classname>std::multimap</classname>
2379 These embedded linked lists have several disadvantages.
2385 The underlying data structure embeds the linked lists
2386 according to its own consideration, which means that the
2387 search path for a value might include several different
2388 equivalent-key values. For example, the search path for the
2389 the black node in either of the first graphic, labels A or B,
2390 includes more than a single gray node.
2396 The links of the linked lists are the underlying data
2397 structures' nodes, which typically are quite structured. In
2398 the case of tree-based containers (the grapic above, label
2399 B), each "link" is actually a node with three pointers (one
2400 to a parent and two to children), and a
2401 relatively-complicated iteration algorithm. The linked
2402 lists, therefore, can take up quite a lot of memory, and
2403 iterating over all values equal to a given key (through the
2404 return value of the standard
2405 library's <function>equal_range</function>) can be
2412 The primary key is stored multiply; this uses more memory.
2418 Finally, the interface of this design excludes several
2419 useful underlying data structures. Of all the unordered
2420 self-organizing data structures, practically only
2421 collision-chaining hash tables can (efficiently) guarantee
2422 that equivalent-key values are stored consecutively.
2428 The above reasons hold even when the ratio of secondary keys to
2429 primary keys (or average number of identical keys) is small, but
2430 when it is large, there are more severe problems:
2436 The underlying data structures order the links inside each
2437 embedded linked-lists according to their internal
2438 considerations, which effectively means that each of the
2439 links is unordered. Irrespective of the underlying data
2440 structure, searching for a specific value can degrade to
2447 Similarly to the above point, it is impossible to apply
2448 to the secondary keys considerations that apply to primary
2449 keys. For example, it is not possible to maintain secondary
2450 keys by sorted order.
2456 While the interface "understands" that all equivalent-key
2457 values constitute a distinct list (through
2458 <function>equal_range</function>), the underlying data
2459 structure typically does not. This means that operations such
2460 as erasing from a tree-based container all values whose keys
2461 are equivalent to a a given key can be super-linear in the
2462 size of the tree; this is also true also for several other
2463 operations that target a specific list.
2470 In this library, all associative containers map
2471 (or store) unique-key values. One can (1) map primary keys to
2472 secondary associative-containers (containers of
2473 secondary keys) or non-associative containers (2) map identical
2474 keys to a size-type representing the number of times they
2475 occur, or (3) any combination of (1) and (2). Instead of
2476 allowing multiple equivalent-key values, this library
2477 supplies associative containers based on underlying
2478 data structures that are suitable as secondary
2479 associative-containers.
2483 In the figure below, labels A and B show the equivalent
2484 underlying data structures in this library, as mapped to the
2485 first graphic above. Labels A and B, respectively. Each shaded
2486 box represents some size-type or secondary
2487 associative-container.
2491 <title>Non-unique Mapping Containers</title>
2494 <imagedata align="center" format="PNG" scale="100"
2495 fileref="../images/pbds_embedded_lists_3.png"/>
2498 <phrase>Non-unique Mapping Containers</phrase>
2504 In the first example above, then, one would use an associative
2505 container mapping each user to an associative container which
2506 maps each application id to a start time (see
2507 <filename>example/basic_multimap.cc</filename>); in the second
2508 example, one would use an associative container mapping
2509 each <classname>int</classname> to some size-type indicating the
2510 number of times it logically occurs
2511 (see <filename>example/basic_multiset.cc</filename>.
2515 See the discussion in list-based container types for containers
2516 especially suited as secondary associative-containers.
2520 </section> <!-- map and set semantics -->
2522 <section xml:id="pbds.design.concepts.iterator_semantics">
2523 <info><title>Iterator Semantics</title></info>
2525 <section xml:id="concepts.iterator_semantics.point_and_range">
2526 <info><title>Point and Range Iterators</title></info>
2529 Iterator concepts are bifurcated in this design, and are
2530 comprised of point-type and range-type iteration.
2534 A point-type iterator is an iterator that refers to a specific
2535 element as returned through an
2536 associative-container's <function>find</function> method.
2540 A range-type iterator is an iterator that is used to go over a
2541 sequence of elements, as returned by a container's
2542 <function>find</function> method.
2546 A point-type method is a method that
2547 returns a point-type iterator; a range-type method is a method
2548 that returns a range-type iterator.
2551 <para>For most containers, these types are synonymous; for
2552 self-organizing containers, such as hash-based containers or
2553 priority queues, these are inherently different (in any
2554 implementation, including that of C++ standard library
2555 components), but in this design, it is made explicit. They are
2561 <section xml:id="concepts.iterator_semantics.both">
2562 <info><title>Distinguishing Point and Range Iterators</title></info>
2564 <para>When using this library, is necessary to differentiate
2565 between two types of methods and iterators: point-type methods and
2566 iterators, and range-type methods and iterators. Each associative
2567 container's interface includes the methods:</para>
2569 point_const_iterator
2570 find(const_key_reference r_key) const;
2573 find(const_key_reference r_key);
2575 std::pair<point_iterator,bool>
2576 insert(const_reference r_val);
2579 <para>The relationship between these iterator types varies between
2580 container types. The figure below
2581 shows the most general invariant between point-type and
2582 range-type iterators: In <emphasis>A</emphasis> <literal>iterator</literal>, can
2583 always be converted to <literal>point_iterator</literal>. In <emphasis>B</emphasis>
2584 shows invariants for order-preserving containers: point-type
2585 iterators are synonymous with range-type iterators.
2586 Orthogonally, <emphasis>C</emphasis>shows invariants for "set"
2587 containers: iterators are synonymous with const iterators.</para>
2590 <title>Point Iterator Hierarchy</title>
2593 <imagedata align="center" format="PNG" scale="100"
2594 fileref="../images/pbds_point_iterator_hierarchy.png"/>
2597 <phrase>Point Iterator Hierarchy</phrase>
2603 <para>Note that point-type iterators in self-organizing containers
2604 (hash-based associative containers) lack movement
2605 operators, such as <literal>operator++</literal> - in fact, this
2606 is the reason why this library differentiates from the standard C++ librarys
2607 design on this point.</para>
2609 <para>Typically, one can determine an iterator's movement
2611 <literal>std::iterator_traits<It>iterator_category</literal>,
2612 which is a <literal>struct</literal> indicating the iterator's
2613 movement capabilities. Unfortunately, none of the standard predefined
2614 categories reflect a pointer's <emphasis>not</emphasis> having any
2615 movement capabilities whatsoever. Consequently,
2616 <literal>pb_ds</literal> adds a type
2617 <literal>trivial_iterator_tag</literal> (whose name is taken from
2618 a concept in C++ standardese, which is the category of iterators
2619 with no movement capabilities.) All other standard C++ library
2620 tags, such as <literal>forward_iterator_tag</literal> retain their
2625 <section xml:id="pbds.design.concepts.invalidation">
2626 <info><title>Invalidation Guarantees</title></info>
2628 If one manipulates a container object, then iterators previously
2629 obtained from it can be invalidated. In some cases a
2630 previously-obtained iterator cannot be de-referenced; in other cases,
2631 the iterator's next or previous element might have changed
2632 unpredictably. This corresponds exactly to the question whether a
2633 point-type or range-type iterator (see previous concept) is valid or
2634 not. In this design, one can query a container (in compile time) about
2635 its invalidation guarantees.
2640 Given three different types of associative containers, a modifying
2641 operation (in that example, <function>erase</function>) invalidated
2642 iterators in three different ways: the iterator of one container
2643 remained completely valid - it could be de-referenced and
2644 incremented; the iterator of a different container could not even be
2645 de-referenced; the iterator of the third container could be
2646 de-referenced, but its "next" iterator changed unpredictably.
2650 Distinguishing between find and range types allows fine-grained
2651 invalidation guarantees, because these questions correspond exactly
2652 to the question of whether point-type iterators and range-type
2653 iterators are valid. The graphic below shows tags corresponding to
2654 different types of invalidation guarantees.
2658 <title>Invalidation Guarantee Tags Hierarchy</title>
2661 <imagedata align="center" format="PDF" scale="75"
2662 fileref="../images/pbds_invalidation_tag_hierarchy.pdf"/>
2665 <imagedata align="center" format="PNG" scale="100"
2666 fileref="../images/pbds_invalidation_tag_hierarchy.png"/>
2669 <phrase>Invalidation Guarantee Tags Hierarchy</phrase>
2677 <classname>basic_invalidation_guarantee</classname>
2678 corresponds to a basic guarantee that a point-type iterator,
2679 a found pointer, or a found reference, remains valid as long
2680 as the container object is not modified.
2686 <classname>point_invalidation_guarantee</classname>
2687 corresponds to a guarantee that a point-type iterator, a
2688 found pointer, or a found reference, remains valid even if
2689 the container object is modified.
2695 <classname>range_invalidation_guarantee</classname>
2696 corresponds to a guarantee that a range-type iterator remains
2697 valid even if the container object is modified.
2702 <para>To find the invalidation guarantee of a
2703 container, one can use</para>
2705 typename container_traits<Cntnr>::invalidation_guarantee
2708 <para>Note that this hierarchy corresponds to the logic it
2709 represents: if a container has range-invalidation guarantees,
2710 then it must also have find invalidation guarantees;
2711 correspondingly, its invalidation guarantee (in this case
2712 <classname>range_invalidation_guarantee</classname>)
2713 can be cast to its base class (in this case <classname>point_invalidation_guarantee</classname>).
2714 This means that this this hierarchy can be used easily using
2715 standard metaprogramming techniques, by specializing on the
2716 type of <literal>invalidation_guarantee</literal>.</para>
2719 These types of problems were addressed, in a more general
2720 setting, in <xref linkend="biblio.meyers96more"/> - Item 2. In
2721 our opinion, an invalidation-guarantee hierarchy would solve
2722 these problems in all container types - not just associative
2727 </section> <!-- iterator semantics -->
2729 <section xml:id="pbds.design.concepts.genericity">
2730 <info><title>Genericity</title></info>
2733 The design attempts to address the following problem of
2734 data-structure genericity. When writing a function manipulating
2735 a generic container object, what is the behavior of the object?
2739 template<typename Cntnr>
2741 some_op_sequence(Cntnr &r_container)
2748 then one needs to address the following questions in the body
2749 of <function>some_op_sequence</function>:
2755 Which types and methods does <literal>Cntnr</literal> support?
2756 Containers based on hash tables can be queries for the
2757 hash-functor type and object; this is meaningless for tree-based
2758 containers. Containers based on trees can be split, joined, or
2759 can erase iterators and return the following iterator; this
2760 cannot be done by hash-based containers.
2766 What are the exception and invalidation guarantees
2767 of <literal>Cntnr</literal>? A container based on a probing
2768 hash-table invalidates all iterators when it is modified; this
2769 is not the case for containers based on node-based
2770 trees. Containers based on a node-based tree can be split or
2771 joined without exceptions; this is not the case for containers
2772 based on vector-based trees.
2778 How does the container maintain its elements? Tree-based and
2779 Trie-based containers store elements by key order; others,
2780 typically, do not. A container based on a splay trees or lists
2781 with update policies "cache" "frequently accessed" elements;
2782 containers based on most other underlying data structures do
2788 How does one query a container about characteristics and
2789 capabilities? What is the relationship between two different
2790 data structures, if anything?
2795 <para>The remainder of this section explains these issues in
2799 <section xml:id="concepts.genericity.tag">
2800 <info><title>Tag</title></info>
2802 Tags are very useful for manipulating generic types. For example, if
2803 <literal>It</literal> is an iterator class, then <literal>typename
2804 It::iterator_category</literal> or <literal>typename
2805 std::iterator_traits<It>::iterator_category</literal> will
2806 yield its category, and <literal>typename
2807 std::iterator_traits<It>::value_type</literal> will yield its
2812 This library contains a container tag hierarchy corresponding to the
2817 <title>Container Tag Hierarchy</title>
2820 <imagedata align="center" format="PDF" scale="75"
2821 fileref="../images/pbds_container_tag_hierarchy.pdf"/>
2824 <imagedata align="center" format="PNG" scale="100"
2825 fileref="../images/pbds_container_tag_hierarchy.png"/>
2828 <phrase>Container Tag Hierarchy</phrase>
2834 Given any container <type>Cntnr</type>, the tag of
2835 the underlying data structure can be found via <literal>typename
2836 Cntnr::container_category</literal>.
2839 </section> <!-- tag -->
2841 <section xml:id="concepts.genericity.traits">
2842 <info><title>Traits</title></info>
2845 <para>Additionally, a traits mechanism can be used to query a
2846 container type for its attributes. Given any container
2847 <literal>Cntnr</literal>, then <literal><Cntnr></literal>
2848 is a traits class identifying the properties of the
2851 <para>To find if a container can throw when a key is erased (which
2852 is true for vector-based trees, for example), one can
2855 <programlisting>container_traits<Cntnr>::erase_can_throw</programlisting>
2858 Some of the definitions in <classname>container_traits</classname>
2859 are dependent on other
2860 definitions. If <classname>container_traits<Cntnr>::order_preserving</classname>
2861 is <constant>true</constant> (which is the case for containers
2862 based on trees and tries), then the container can be split or
2864 case, <classname>container_traits<Cntnr>::split_join_can_throw</classname>
2865 indicates whether splits or joins can throw exceptions (which is
2866 true for vector-based trees);
2867 otherwise <classname>container_traits<Cntnr>::split_join_can_throw</classname>
2868 will yield a compilation error. (This is somewhat similar to a
2869 compile-time version of the COM model).
2872 </section> <!-- traits -->
2874 </section> <!-- genericity -->
2875 </section> <!-- concepts -->
2877 <section xml:id="pbds.design.container">
2878 <info><title>By Container</title></info>
2881 <section xml:id="pbds.design.container.hash">
2882 <info><title>hash</title></info>
2887 /// general terms / background
2888 /// range hashing policies
2889 /// ranged-hash policies
2895 /// trigger policies
2898 // policy interactions
2899 /// probe/size/trigger
2901 /// eq/hash/storing hash values
2902 /// size/load-check trigger
2904 <section xml:id="container.hash.interface">
2905 <info><title>Interface</title></info>
2910 The collision-chaining hash-based container has the
2911 following declaration.</para>
2916 typename Hash_Fn = std::hash<Key>,
2917 typename Eq_Fn = std::equal_to<Key>,
2918 typename Comb_Hash_Fn = direct_mask_range_hashing<>
2919 typename Resize_Policy = default explained below.
2920 bool Store_Hash = false,
2921 typename Allocator = std::allocator<char> >
2922 class cc_hash_table;
2925 <para>The parameters have the following meaning:</para>
2928 <listitem><para><classname>Key</classname> is the key type.</para></listitem>
2930 <listitem><para><classname>Mapped</classname> is the mapped-policy.</para></listitem>
2932 <listitem><para><classname>Hash_Fn</classname> is a key hashing functor.</para></listitem>
2934 <listitem><para><classname>Eq_Fn</classname> is a key equivalence functor.</para></listitem>
2936 <listitem><para><classname>Comb_Hash_Fn</classname> is a range-hashing_functor;
2937 it describes how to translate hash values into positions
2938 within the table. </para></listitem>
2940 <listitem><para><classname>Resize_Policy</classname> describes how a container object
2941 should change its internal size. </para></listitem>
2943 <listitem><para><classname>Store_Hash</classname> indicates whether the hash value
2944 should be stored with each entry. </para></listitem>
2946 <listitem><para><classname>Allocator</classname> is an allocator
2947 type.</para></listitem>
2950 <para>The probing hash-based container has the following
2956 typename Hash_Fn = std::hash<Key>,
2957 typename Eq_Fn = std::equal_to<Key>,
2958 typename Comb_Probe_Fn = direct_mask_range_hashing<>
2959 typename Probe_Fn = default explained below.
2960 typename Resize_Policy = default explained below.
2961 bool Store_Hash = false,
2962 typename Allocator = std::allocator<char> >
2963 class gp_hash_table;
2966 <para>The parameters are identical to those of the
2967 collision-chaining container, except for the following.</para>
2970 <listitem><para><classname>Comb_Probe_Fn</classname> describes how to transform a probe
2971 sequence into a sequence of positions within the table.</para></listitem>
2973 <listitem><para><classname>Probe_Fn</classname> describes a probe sequence policy.</para></listitem>
2976 <para>Some of the default template values depend on the values of
2977 other parameters, and are explained below.</para>
2980 <section xml:id="container.hash.details">
2981 <info><title>Details</title></info>
2983 <section xml:id="container.hash.details.hash_policies">
2984 <info><title>Hash Policies</title></info>
2986 <section xml:id="details.hash_policies.general">
2987 <info><title>General</title></info>
2989 <para>Following is an explanation of some functions which hashing
2990 involves. The graphic below illustrates the discussion.</para>
2993 <title>Hash functions, ranged-hash functions, and
2994 range-hashing functions</title>
2997 <imagedata align="center" format="PNG" scale="100"
2998 fileref="../images/pbds_hash_ranged_hash_range_hashing_fns.png"/>
3001 <phrase>Hash functions, ranged-hash functions, and
3002 range-hashing functions</phrase>
3007 <para>Let U be a domain (e.g., the integers, or the
3008 strings of 3 characters). A hash-table algorithm needs to map
3009 elements of U "uniformly" into the range [0,..., m -
3010 1] (where m is a non-negative integral value, and
3011 is, in general, time varying). I.e., the algorithm needs
3012 a ranged-hash function</para>
3015 f : U × Z<subscript>+</subscript> → Z<subscript>+</subscript>
3018 <para>such that for any u in U ,</para>
3020 <para>0 ≤ f(u, m) ≤ m - 1</para>
3022 <para>and which has "good uniformity" properties (say
3023 <xref linkend="biblio.knuth98sorting"/>.)
3025 common solution is to use the composition of the hash
3028 <para>h : U → Z<subscript>+</subscript> ,</para>
3030 <para>which maps elements of U into the non-negative
3031 integrals, and</para>
3033 <para>g : Z<subscript>+</subscript> × Z<subscript>+</subscript> →
3034 Z<subscript>+</subscript>,</para>
3036 <para>which maps a non-negative hash value, and a non-negative
3037 range upper-bound into a non-negative integral in the range
3038 between 0 (inclusive) and the range upper bound (exclusive),
3039 i.e., for any r in Z<subscript>+</subscript>,</para>
3041 <para>0 ≤ g(r, m) ≤ m - 1</para>
3044 <para>The resulting ranged-hash function, is</para>
3046 <!-- ranged_hash_composed_of_hash_and_range_hashing -->
3048 <title>Ranged Hash Function</title>
3050 f(u , m) = g(h(u), m)
3054 <para>From the above, it is obvious that given g and
3055 h, f can always be composed (however the converse
3056 is not true). The standard's hash-based containers allow specifying
3057 a hash function, and use a hard-wired range-hashing function;
3058 the ranged-hash function is implicitly composed.</para>
3060 <para>The above describes the case where a key is to be mapped
3061 into a single position within a hash table, e.g.,
3062 in a collision-chaining table. In other cases, a key is to be
3063 mapped into a sequence of positions within a table,
3064 e.g., in a probing table. Similar terms apply in this
3065 case: the table requires a ranged probe function,
3066 mapping a key into a sequence of positions withing the table.
3067 This is typically achieved by composing a hash function
3068 mapping the key into a non-negative integral type, a
3069 probe function transforming the hash value into a
3070 sequence of hash values, and a range-hashing function
3071 transforming the sequence of hash values into a sequence of
3076 <section xml:id="details.hash_policies.range">
3077 <info><title>Range Hashing</title></info>
3079 <para>Some common choices for range-hashing functions are the
3080 division, multiplication, and middle-square methods (<xref linkend="biblio.knuth98sorting"/>), defined
3084 <title>Range-Hashing, Division Method</title>
3092 <para>g(r, m) = ⌈ u/v ( a r mod v ) ⌉</para>
3096 <para>g(r, m) = ⌈ u/v ( r<superscript>2</superscript> mod v ) ⌉</para>
3098 <para>respectively, for some positive integrals u and
3099 v (typically powers of 2), and some a. Each of
3100 these range-hashing functions works best for some different
3103 <para>The division method (see above) is a
3104 very common choice. However, even this single method can be
3105 implemented in two very different ways. It is possible to
3106 implement using the low
3107 level % (modulo) operation (for any m), or the
3108 low level & (bit-mask) operation (for the case where
3109 m is a power of 2), i.e.,</para>
3112 <title>Division via Prime Modulo</title>
3121 <title>Division via Bit Mask</title>
3123 g(r, m) = r & m - 1, (with m =
3124 2<superscript>k</superscript> for some k)
3129 <para>respectively.</para>
3131 <para>The % (modulo) implementation has the advantage that for
3132 m a prime far from a power of 2, g(r, m) is
3133 affected by all the bits of r (minimizing the chance of
3134 collision). It has the disadvantage of using the costly modulo
3135 operation. This method is hard-wired into SGI's implementation
3138 <para>The & (bit-mask) implementation has the advantage of
3139 relying on the fast bit-wise and operation. It has the
3140 disadvantage that for g(r, m) is affected only by the
3141 low order bits of r. This method is hard-wired into
3142 Dinkumware's implementation.</para>
3147 <section xml:id="details.hash_policies.ranged">
3148 <info><title>Ranged Hash</title></info>
3150 <para>In cases it is beneficial to allow the
3151 client to directly specify a ranged-hash hash function. It is
3152 true, that the writer of the ranged-hash function cannot rely
3153 on the values of m having specific numerical properties
3154 suitable for hashing (in the sense used in <xref linkend="biblio.knuth98sorting"/>), since
3155 the values of m are determined by a resize policy with
3156 possibly orthogonal considerations.</para>
3158 <para>There are two cases where a ranged-hash function can be
3159 superior. The firs is when using perfect hashing: the
3160 second is when the values of m can be used to estimate
3161 the "general" number of distinct values required. This is
3162 described in the following.</para>
3167 s = [ s<subscript>0</subscript>,..., s<subscript>t - 1</subscript>]
3170 <para>be a string of t characters, each of which is from
3171 domain S. Consider the following ranged-hash
3175 A Standard String Hash Function
3178 f<subscript>1</subscript>(s, m) = ∑ <subscript>i =
3179 0</subscript><superscript>t - 1</superscript> s<subscript>i</subscript> a<superscript>i</superscript> mod m
3184 <para>where a is some non-negative integral value. This is
3185 the standard string-hashing function used in SGI's
3186 implementation (with a = 5). Its advantage is that
3187 it takes into account all of the characters of the string.</para>
3189 <para>Now assume that s is the string representation of a
3190 of a long DNA sequence (and so S = {'A', 'C', 'G',
3191 'T'}). In this case, scanning the entire string might be
3192 prohibitively expensive. A possible alternative might be to use
3193 only the first k characters of the string, where</para>
3195 <para>|S|<superscript>k</superscript> ≥ m ,</para>
3197 <para>i.e., using the hash function</para>
3201 Only k String DNA Hash
3204 f<subscript>2</subscript>(s, m) = ∑ <subscript>i
3205 = 0</subscript><superscript>k - 1</superscript> s<subscript>i</subscript> a<superscript>i</superscript> mod m
3209 <para>requiring scanning over only</para>
3211 <para>k = log<subscript>4</subscript>( m )</para>
3213 <para>characters.</para>
3215 <para>Other more elaborate hash-functions might scan k
3216 characters starting at a random position (determined at each
3217 resize), or scanning k random positions (determined at
3218 each resize), i.e., using</para>
3220 <para>f<subscript>3</subscript>(s, m) = ∑ <subscript>i =
3221 r</subscript>0<superscript>r<subscript>0</subscript> + k - 1</superscript> s<subscript>i</subscript>
3222 a<superscript>i</superscript> mod m ,</para>
3226 <para>f<subscript>4</subscript>(s, m) = ∑ <subscript>i = 0</subscript><superscript>k -
3227 1</superscript> s<subscript>r</subscript>i a<superscript>r<subscript>i</subscript></superscript> mod
3230 <para>respectively, for r<subscript>0</subscript>,..., r<subscript>k-1</subscript>
3231 each in the (inclusive) range [0,...,t-1].</para>
3233 <para>It should be noted that the above functions cannot be
3234 decomposed as per a ranged hash composed of hash and range hashing.</para>
3239 <section xml:id="details.hash_policies.implementation">
3240 <info><title>Implementation</title></info>
3242 <para>This sub-subsection describes the implementation of
3243 the above in this library. It first explains range-hashing
3244 functions in collision-chaining tables, then ranged-hash
3245 functions in collision-chaining tables, then probing-based
3246 tables, and finally lists the relevant classes in this
3249 <section xml:id="hash_policies.implementation.collision-chaining">
3251 Range-Hashing and Ranged-Hashes in Collision-Chaining Tables
3255 <para><classname>cc_hash_table</classname> is
3256 parametrized by <classname>Hash_Fn</classname> and <classname>Comb_Hash_Fn</classname>, a
3257 hash functor and a combining hash functor, respectively.</para>
3259 <para>In general, <classname>Comb_Hash_Fn</classname> is considered a
3260 range-hashing functor. <classname>cc_hash_table</classname>
3261 synthesizes a ranged-hash function from <classname>Hash_Fn</classname> and
3262 <classname>Comb_Hash_Fn</classname>. The figure below shows an <classname>insert</classname> sequence
3263 diagram for this case. The user inserts an element (point A),
3264 the container transforms the key into a non-negative integral
3265 using the hash functor (points B and C), and transforms the
3266 result into a position using the combining functor (points D
3270 <title>Insert hash sequence diagram</title>
3273 <imagedata align="center" format="PNG" scale="100"
3274 fileref="../images/pbds_hash_range_hashing_seq_diagram.png"/>
3277 <phrase>Insert hash sequence diagram</phrase>
3282 <para>If <classname>cc_hash_table</classname>'s
3283 hash-functor, <classname>Hash_Fn</classname> is instantiated by <classname>null_type</classname> , then <classname>Comb_Hash_Fn</classname> is taken to be
3284 a ranged-hash function. The graphic below shows an <function>insert</function> sequence
3285 diagram. The user inserts an element (point A), the container
3286 transforms the key into a position using the combining functor
3287 (points B and C).</para>
3290 <title>Insert hash sequence diagram with a null policy</title>
3293 <imagedata align="center" format="PNG" scale="100"
3294 fileref="../images/pbds_hash_range_hashing_seq_diagram2.png"/>
3297 <phrase>Insert hash sequence diagram with a null policy</phrase>
3304 <section xml:id="hash_policies.implementation.probe">
3308 <para><classname>gp_hash_table</classname> is parametrized by
3309 <classname>Hash_Fn</classname>, <classname>Probe_Fn</classname>,
3310 and <classname>Comb_Probe_Fn</classname>. As before, if
3311 <classname>Hash_Fn</classname> and <classname>Probe_Fn</classname>
3312 are both <classname>null_type</classname>, then
3313 <classname>Comb_Probe_Fn</classname> is a ranged-probe
3314 functor. Otherwise, <classname>Hash_Fn</classname> is a hash
3315 functor, <classname>Probe_Fn</classname> is a functor for offsets
3316 from a hash value, and <classname>Comb_Probe_Fn</classname>
3317 transforms a probe sequence into a sequence of positions within
3322 <section xml:id="hash_policies.implementation.predefined">
3324 Pre-Defined Policies
3327 <para>This library contains some pre-defined classes
3328 implementing range-hashing and probing functions:</para>
3331 <listitem><para><classname>direct_mask_range_hashing</classname>
3332 and <classname>direct_mod_range_hashing</classname>
3333 are range-hashing functions based on a bit-mask and a modulo
3334 operation, respectively.</para></listitem>
3336 <listitem><para><classname>linear_probe_fn</classname>, and
3337 <classname>quadratic_probe_fn</classname> are
3338 a linear probe and a quadratic probe function,
3339 respectively.</para></listitem>
3343 The graphic below shows the relationships.
3346 <title>Hash policy class diagram</title>
3349 <imagedata align="center" format="PNG" scale="100"
3350 fileref="../images/pbds_hash_policy_cd.png"/>
3353 <phrase>Hash policy class diagram</phrase>
3361 </section> <!-- impl -->
3365 <section xml:id="container.hash.details.resize_policies">
3366 <info><title>Resize Policies</title></info>
3368 <section xml:id="resize_policies.general">
3369 <info><title>General</title></info>
3371 <para>Hash-tables, as opposed to trees, do not naturally grow or
3372 shrink. It is necessary to specify policies to determine how
3373 and when a hash table should change its size. Usually, resize
3374 policies can be decomposed into orthogonal policies:</para>
3377 <listitem><para>A size policy indicating how a hash table
3378 should grow (e.g., it should multiply by powers of
3379 2).</para></listitem>
3381 <listitem><para>A trigger policy indicating when a hash
3382 table should grow (e.g., a load factor is
3383 exceeded).</para></listitem>
3388 <section xml:id="resize_policies.size">
3389 <info><title>Size Policies</title></info>
3392 <para>Size policies determine how a hash table changes size. These
3393 policies are simple, and there are relatively few sensible
3394 options. An exponential-size policy (with the initial size and
3395 growth factors both powers of 2) works well with a mask-based
3396 range-hashing function, and is the
3397 hard-wired policy used by Dinkumware. A
3398 prime-list based policy works well with a modulo-prime range
3399 hashing function and is the hard-wired policy used by SGI's
3400 implementation.</para>
3404 <section xml:id="resize_policies.trigger">
3405 <info><title>Trigger Policies</title></info>
3407 <para>Trigger policies determine when a hash table changes size.
3408 Following is a description of two policies: load-check
3409 policies, and collision-check policies.</para>
3411 <para>Load-check policies are straightforward. The user specifies
3412 two factors, Α<subscript>min</subscript> and
3413 Α<subscript>max</subscript>, and the hash table maintains the
3414 invariant that</para>
3416 <para>Α<subscript>min</subscript> ≤ (number of
3417 stored elements) / (hash-table size) ≤
3418 Α<subscript>max</subscript>
3419 <!-- <remark>load factor min max</remark> -->
3422 <para>Collision-check policies work in the opposite direction of
3423 load-check policies. They focus on keeping the number of
3424 collisions moderate and hoping that the size of the table will
3425 not grow very large, instead of keeping a moderate load-factor
3426 and hoping that the number of collisions will be small. A
3427 maximal collision-check policy resizes when the longest
3428 probe-sequence grows too large.</para>
3430 <para>Consider the graphic below. Let the size of the hash table
3431 be denoted by m, the length of a probe sequence be denoted by k,
3432 and some load factor be denoted by Α. We would like to
3433 calculate the minimal length of k, such that if there were Α
3434 m elements in the hash table, a probe sequence of length k would
3435 be found with probability at most 1/m.</para>
3438 <title>Balls and bins</title>
3441 <imagedata align="center" format="PNG" scale="100"
3442 fileref="../images/pbds_balls_and_bins.png"/>
3445 <phrase>Balls and bins</phrase>
3450 <para>Denote the probability that a probe sequence of length
3451 k appears in bin i by p<subscript>i</subscript>, the
3452 length of the probe sequence of bin i by
3453 l<subscript>i</subscript>, and assume uniform distribution. Then</para>
3459 Probability of Probe Sequence of Length k
3462 p<subscript>1</subscript> =
3466 <para>P(l<subscript>1</subscript> ≥ k) =</para>
3469 P(l<subscript>1</subscript> ≥ α ( 1 + k / α - 1) ≤ (a)
3473 e ^ ( - ( α ( k / α - 1 )<superscript>2</superscript> ) /2)
3476 <para>where (a) follows from the Chernoff bound (<xref linkend="biblio.motwani95random"/>). To
3477 calculate the probability that some bin contains a probe
3478 sequence greater than k, we note that the
3479 l<subscript>i</subscript> are negatively-dependent
3480 (<xref linkend="biblio.dubhashi98neg"/>)
3482 I(.) denote the indicator function. Then</para>
3486 Probability Probe Sequence in Some Bin
3489 P( exists<subscript>i</subscript> l<subscript>i</subscript> ≥ k ) =
3493 <para>P ( ∑ <subscript>i = 1</subscript><superscript>m</superscript>
3494 I(l<subscript>i</subscript> ≥ k) ≥ 1 ) =</para>
3496 <para>P ( ∑ <subscript>i = 1</subscript><superscript>m</superscript> I (
3497 l<subscript>i</subscript> ≥ k ) ≥ m p<subscript>1</subscript> ( 1 + 1 / (m
3498 p<subscript>1</subscript>) - 1 ) ) ≤ (a)</para>
3500 <para>e ^ ( ( - m p<subscript>1</subscript> ( 1 / (m p<subscript>1</subscript>)
3501 - 1 ) <superscript>2</superscript> ) / 2 ) ,</para>
3503 <para>where (a) follows from the fact that the Chernoff bound can
3504 be applied to negatively-dependent variables (<xref
3505 linkend="biblio.dubhashi98neg"/>). Inserting the first probability
3506 equation into the second one, and equating with 1/m, we
3510 <para>k ~ √ ( 2 α ln 2 m ln(m) )
3515 <section xml:id="resize_policies.impl">
3516 <info><title>Implementation</title></info>
3518 <para>This sub-subsection describes the implementation of the
3519 above in this library. It first describes resize policies and
3520 their decomposition into trigger and size policies, then
3521 describes pre-defined classes, and finally discusses controlled
3522 access the policies' internals.</para>
3524 <section xml:id="resize_policies.impl.decomposition">
3525 <info><title>Decomposition</title></info>
3528 <para>Each hash-based container is parametrized by a
3529 <classname>Resize_Policy</classname> parameter; the container derives
3530 <classname>public</classname>ly from <classname>Resize_Policy</classname>. For
3533 cc_hash_table<typename Key,
3536 typename Resize_Policy
3537 ...> : public Resize_Policy
3540 <para>As a container object is modified, it continuously notifies
3541 its <classname>Resize_Policy</classname> base of internal changes
3542 (e.g., collisions encountered and elements being
3543 inserted). It queries its <classname>Resize_Policy</classname> base whether
3544 it needs to be resized, and if so, to what size.</para>
3546 <para>The graphic below shows a (possible) sequence diagram
3547 of an insert operation. The user inserts an element; the hash
3548 table notifies its resize policy that a search has started
3549 (point A); in this case, a single collision is encountered -
3550 the table notifies its resize policy of this (point B); the
3551 container finally notifies its resize policy that the search
3552 has ended (point C); it then queries its resize policy whether
3553 a resize is needed, and if so, what is the new size (points D
3554 to G); following the resize, it notifies the policy that a
3555 resize has completed (point H); finally, the element is
3556 inserted, and the policy notified (point I).</para>
3559 <title>Insert resize sequence diagram</title>
3562 <imagedata align="center" format="PNG" scale="100"
3563 fileref="../images/pbds_insert_resize_sequence_diagram1.png"/>
3566 <phrase>Insert resize sequence diagram</phrase>
3572 <para>In practice, a resize policy can be usually orthogonally
3573 decomposed to a size policy and a trigger policy. Consequently,
3574 the library contains a single class for instantiating a resize
3575 policy: <classname>hash_standard_resize_policy</classname>
3576 is parametrized by <classname>Size_Policy</classname> and
3577 <classname>Trigger_Policy</classname>, derives <classname>public</classname>ly from
3578 both, and acts as a standard delegate (<xref linkend="biblio.gof"/>)
3579 to these policies.</para>
3581 <para>The two graphics immediately below show sequence diagrams
3582 illustrating the interaction between the standard resize policy
3583 and its trigger and size policies, respectively.</para>
3586 <title>Standard resize policy trigger sequence
3590 <imagedata align="center" format="PNG" scale="100"
3591 fileref="../images/pbds_insert_resize_sequence_diagram2.png"/>
3594 <phrase>Standard resize policy trigger sequence
3601 <title>Standard resize policy size sequence
3605 <imagedata align="center" format="PNG" scale="100"
3606 fileref="../images/pbds_insert_resize_sequence_diagram3.png"/>
3609 <phrase>Standard resize policy size sequence
3618 <section xml:id="resize_policies.impl.predefined">
3619 <info><title>Predefined Policies</title></info>
3620 <para>The library includes the following
3621 instantiations of size and trigger policies:</para>
3624 <listitem><para><classname>hash_load_check_resize_trigger</classname>
3625 implements a load check trigger policy.</para></listitem>
3627 <listitem><para><classname>cc_hash_max_collision_check_resize_trigger</classname>
3628 implements a collision check trigger policy.</para></listitem>
3630 <listitem><para><classname>hash_exponential_size_policy</classname>
3631 implements an exponential-size policy (which should be used
3632 with mask range hashing).</para></listitem>
3634 <listitem><para><classname>hash_prime_size_policy</classname>
3635 implementing a size policy based on a sequence of primes
3637 be used with mod range hashing</para></listitem>
3640 <para>The graphic below gives an overall picture of the resize-related
3641 classes. <classname>basic_hash_table</classname>
3642 is parametrized by <classname>Resize_Policy</classname>, which it subclasses
3643 publicly. This class is currently instantiated only by <classname>hash_standard_resize_policy</classname>.
3644 <classname>hash_standard_resize_policy</classname>
3645 itself is parametrized by <classname>Trigger_Policy</classname> and
3646 <classname>Size_Policy</classname>. Currently, <classname>Trigger_Policy</classname> is
3647 instantiated by <classname>hash_load_check_resize_trigger</classname>,
3648 or <classname>cc_hash_max_collision_check_resize_trigger</classname>;
3649 <classname>Size_Policy</classname> is instantiated by <classname>hash_exponential_size_policy</classname>,
3650 or <classname>hash_prime_size_policy</classname>.</para>
3654 <section xml:id="resize_policies.impl.internals">
3655 <info><title>Controling Access to Internals</title></info>
3657 <para>There are cases where (controlled) access to resize
3658 policies' internals is beneficial. E.g., it is sometimes
3659 useful to query a hash-table for the table's actual size (as
3660 opposed to its <function>size()</function> - the number of values it
3661 currently holds); it is sometimes useful to set a table's
3662 initial size, externally resize it, or change load factors.</para>
3664 <para>Clearly, supporting such methods both decreases the
3665 encapsulation of hash-based containers, and increases the
3666 diversity between different associative-containers' interfaces.
3667 Conversely, omitting such methods can decrease containers'
3670 <para>In order to avoid, to the extent possible, the above
3671 conflict, the hash-based containers themselves do not address
3672 any of these questions; this is deferred to the resize policies,
3673 which are easier to change or replace. Thus, for example,
3674 neither <classname>cc_hash_table</classname> nor
3675 <classname>gp_hash_table</classname>
3676 contain methods for querying the actual size of the table; this
3677 is deferred to <classname>hash_standard_resize_policy</classname>.</para>
3679 <para>Furthermore, the policies themselves are parametrized by
3680 template arguments that determine the methods they support
3682 <xref linkend="biblio.alexandrescu01modern"/>
3683 shows techniques for doing so). <classname>hash_standard_resize_policy</classname>
3684 is parametrized by <classname>External_Size_Access</classname> that
3685 determines whether it supports methods for querying the actual
3686 size of the table or resizing it. <classname>hash_load_check_resize_trigger</classname>
3687 is parametrized by <classname>External_Load_Access</classname> that
3688 determines whether it supports methods for querying or
3689 modifying the loads. <classname>cc_hash_max_collision_check_resize_trigger</classname>
3690 is parametrized by <classname>External_Load_Access</classname> that
3691 determines whether it supports methods for querying the
3694 <para>Some operations, for example, resizing a container at
3695 run time, or changing the load factors of a load-check trigger
3696 policy, require the container itself to resize. As mentioned
3697 above, the hash-based containers themselves do not contain
3698 these types of methods, only their resize policies.
3699 Consequently, there must be some mechanism for a resize policy
3700 to manipulate the hash-based container. As the hash-based
3701 container is a subclass of the resize policy, this is done
3702 through virtual methods. Each hash-based container has a
3703 <classname>private</classname> <classname>virtual</classname> method:</para>
3707 (size_type new_size);
3710 <para>which resizes the container. Implementations of
3711 <classname>Resize_Policy</classname> can export public methods for resizing
3712 the container externally; these methods internally call
3713 <classname>do_resize</classname> to resize the table.</para>
3721 </section> <!-- resize policies -->
3723 <section xml:id="container.hash.details.policy_interaction">
3724 <info><title>Policy Interactions</title></info>
3727 <para>Hash-tables are unfortunately especially susceptible to
3728 choice of policies. One of the more complicated aspects of this
3729 is that poor combinations of good policies can form a poor
3730 container. Following are some considerations.</para>
3732 <section xml:id="policy_interaction.probesizetrigger">
3733 <info><title>probe/size/trigger</title></info>
3735 <para>Some combinations do not work well for probing containers.
3736 For example, combining a quadratic probe policy with an
3737 exponential size policy can yield a poor container: when an
3738 element is inserted, a trigger policy might decide that there
3739 is no need to resize, as the table still contains unused
3740 entries; the probe sequence, however, might never reach any of
3741 the unused entries.</para>
3743 <para>Unfortunately, this library cannot detect such problems at
3744 compilation (they are halting reducible). It therefore defines
3745 an exception class <classname>insert_error</classname> to throw an
3746 exception in this case.</para>
3750 <section xml:id="policy_interaction.hashtrigger">
3751 <info><title>hash/trigger</title></info>
3753 <para>Some trigger policies are especially susceptible to poor
3754 hash functions. Suppose, as an extreme case, that the hash
3755 function transforms each key to the same hash value. After some
3756 inserts, a collision detecting policy will always indicate that
3757 the container needs to grow.</para>
3759 <para>The library, therefore, by design, limits each operation to
3760 one resize. For each <classname>insert</classname>, for example, it queries
3761 only once whether a resize is needed.</para>
3765 <section xml:id="policy_interaction.eqstorehash">
3766 <info><title>equivalence functors/storing hash values/hash</title></info>
3768 <para><classname>cc_hash_table</classname> and
3769 <classname>gp_hash_table</classname> are
3770 parametrized by an equivalence functor and by a
3771 <classname>Store_Hash</classname> parameter. If the latter parameter is
3772 <classname>true</classname>, then the container stores with each entry
3773 a hash value, and uses this value in case of collisions to
3774 determine whether to apply a hash value. This can lower the
3775 cost of collision for some types, but increase the cost of
3776 collisions for other types.</para>
3778 <para>If a ranged-hash function or ranged probe function is
3779 directly supplied, however, then it makes no sense to store the
3780 hash value with each entry. This library's container will
3781 fail at compilation, by design, if this is attempted.</para>
3785 <section xml:id="policy_interaction.sizeloadtrigger">
3786 <info><title>size/load-check trigger</title></info>
3788 <para>Assume a size policy issues an increasing sequence of sizes
3789 a, a q, a q<superscript>1</superscript>, a q<superscript>2</superscript>, ... For
3790 example, an exponential size policy might issue the sequence of
3791 sizes 8, 16, 32, 64, ...</para>
3793 <para>If a load-check trigger policy is used, with loads
3794 α<subscript>min</subscript> and α<subscript>max</subscript>,
3795 respectively, then it is a good idea to have:</para>
3798 <listitem><para>α<subscript>max</subscript> ~ 1 / q</para></listitem>
3800 <listitem><para>α<subscript>min</subscript> < 1 / (2 q)</para></listitem>
3803 <para>This will ensure that the amortized hash cost of each
3804 modifying operation is at most approximately 3.</para>
3806 <para>α<subscript>min</subscript> ~ α<subscript>max</subscript> is, in
3807 any case, a bad choice, and α<subscript>min</subscript> >
3808 α <subscript>max</subscript> is horrendous.</para>
3814 </section> <!-- details -->
3816 </section> <!-- hash -->
3819 <section xml:id="pbds.design.container.tree">
3820 <info><title>tree</title></info>
3822 <section xml:id="container.tree.interface">
3823 <info><title>Interface</title></info>
3825 <para>The tree-based container has the following declaration:</para>
3830 typename Cmp_Fn = std::less<Key>,
3831 typename Tag = rb_tree_tag,
3833 typename Const_Node_Iterator,
3834 typename Node_Iterator,
3836 typename Allocator_>
3837 class Node_Update = null_node_update,
3838 typename Allocator = std::allocator<char> >
3842 <para>The parameters have the following meaning:</para>
3846 <para><classname>Key</classname> is the key type.</para></listitem>
3849 <para><classname>Mapped</classname> is the mapped-policy.</para></listitem>
3852 <para><classname>Cmp_Fn</classname> is a key comparison functor</para></listitem>
3855 <para><classname>Tag</classname> specifies which underlying data structure
3856 to use.</para></listitem>
3859 <para><classname>Node_Update</classname> is a policy for updating node
3860 invariants.</para></listitem>
3863 <para><classname>Allocator</classname> is an allocator
3864 type.</para></listitem>
3867 <para>The <classname>Tag</classname> parameter specifies which underlying
3868 data structure to use. Instantiating it by <classname>rb_tree_tag</classname>, <classname>splay_tree_tag</classname>, or
3869 <classname>ov_tree_tag</classname>,
3870 specifies an underlying red-black tree, splay tree, or
3871 ordered-vector tree, respectively; any other tag is illegal.
3872 Note that containers based on the former two contain more types
3873 and methods than the latter (e.g.,
3874 <classname>reverse_iterator</classname> and <classname>rbegin</classname>), and different
3875 exception and invalidation guarantees.</para>
3879 <section xml:id="container.tree.details">
3880 <info><title>Details</title></info>
3882 <section xml:id="container.tree.node">
3883 <info><title>Node Invariants</title></info>
3886 <para>Consider the two trees in the graphic below, labels A and B. The first
3887 is a tree of floats; the second is a tree of pairs, each
3888 signifying a geometric line interval. Each element in a tree is referred to as a node of the tree. Of course, each of
3889 these trees can support the usual queries: the first can easily
3890 search for <classname>0.4</classname>; the second can easily search for
3891 <classname>std::make_pair(10, 41)</classname>.</para>
3893 <para>Each of these trees can efficiently support other queries.
3894 The first can efficiently determine that the 2rd key in the
3895 tree is <constant>0.3</constant>; the second can efficiently determine
3896 whether any of its intervals overlaps
3897 <programlisting>std::make_pair(29,42)</programlisting> (useful in geometric
3898 applications or distributed file systems with leases, for
3899 example). It should be noted that an <classname>std::set</classname> can
3900 only solve these types of problems with linear complexity.</para>
3902 <para>In order to do so, each tree stores some metadata in
3903 each node, and maintains node invariants (see <xref linkend="biblio.clrs2001"/>.) The first stores in
3904 each node the size of the sub-tree rooted at the node; the
3905 second stores at each node the maximal endpoint of the
3906 intervals at the sub-tree rooted at the node.</para>
3909 <title>Tree node invariants</title>
3912 <imagedata align="center" format="PNG" scale="100"
3913 fileref="../images/pbds_tree_node_invariants.png"/>
3916 <phrase>Tree node invariants</phrase>
3921 <para>Supporting such trees is difficult for a number of
3925 <listitem><para>There must be a way to specify what a node's metadata
3926 should be (if any).</para></listitem>
3928 <listitem><para>Various operations can invalidate node
3929 invariants. The graphic below shows how a right rotation,
3930 performed on A, results in B, with nodes x and y having
3931 corrupted invariants (the grayed nodes in C). The graphic shows
3932 how an insert, performed on D, results in E, with nodes x and y
3933 having corrupted invariants (the grayed nodes in F). It is not
3934 feasible to know outside the tree the effect of an operation on
3935 the nodes of the tree.</para></listitem>
3937 <listitem><para>The search paths of standard associative containers are
3938 defined by comparisons between keys, and not through
3939 metadata.</para></listitem>
3941 <listitem><para>It is not feasible to know in advance which methods trees
3942 can support. Besides the usual <classname>find</classname> method, the
3943 first tree can support a <classname>find_by_order</classname> method, while
3944 the second can support an <classname>overlaps</classname> method.</para></listitem>
3948 <title>Tree node invalidation</title>
3951 <imagedata align="center" format="PNG" scale="100"
3952 fileref="../images/pbds_tree_node_invalidations.png"/>
3955 <phrase>Tree node invalidation</phrase>
3960 <para>These problems are solved by a combination of two means:
3961 node iterators, and template-template node updater
3964 <section xml:id="container.tree.node.iterators">
3965 <info><title>Node Iterators</title></info>
3968 <para>Each tree-based container defines two additional iterator
3969 types, <classname>const_node_iterator</classname>
3970 and <classname>node_iterator</classname>.
3971 These iterators allow descending from a node to one of its
3972 children. Node iterator allow search paths different than those
3973 determined by the comparison functor. The <classname>tree</classname>
3974 supports the methods:</para>
3989 <para>The first pairs return node iterators corresponding to the
3990 root node of the tree; the latter pair returns node iterators
3991 corresponding to a just-after-leaf node.</para>
3994 <section xml:id="container.tree.node.updator">
3995 <info><title>Node Updator</title></info>
3997 <para>The tree-based containers are parametrized by a
3998 <classname>Node_Update</classname> template-template parameter. A
3999 tree-based container instantiates
4000 <classname>Node_Update</classname> to some
4001 <classname>node_update</classname> class, and publicly subclasses
4002 <classname>node_update</classname>. The graphic below shows this
4003 scheme, as well as some predefined policies (which are explained
4007 <title>A tree and its update policy</title>
4010 <imagedata align="center" format="PNG" scale="100"
4011 fileref="../images/pbds_tree_node_updator_policy_cd.png"/>
4014 <phrase>A tree and its update policy</phrase>
4019 <para><classname>node_update</classname> (an instantiation of
4020 <classname>Node_Update</classname>) must define <classname>metadata_type</classname> as
4021 the type of metadata it requires. For order statistics,
4022 e.g., <classname>metadata_type</classname> might be <classname>size_t</classname>.
4023 The tree defines within each node a <classname>metadata_type</classname>
4026 <para><classname>node_update</classname> must also define the following method
4027 for restoring node invariants:</para>
4030 operator()(node_iterator nd_it, const_node_iterator end_nd_it)
4033 <para>In this method, <varname>nd_it</varname> is a
4034 <classname>node_iterator</classname> corresponding to a node whose
4035 A) all descendants have valid invariants, and B) its own
4036 invariants might be violated; <classname>end_nd_it</classname> is
4037 a <classname>const_node_iterator</classname> corresponding to a
4038 just-after-leaf node. This method should correct the node
4039 invariants of the node pointed to by
4040 <classname>nd_it</classname>. For example, say node x in the
4041 graphic below label A has an invalid invariant, but its' children,
4042 y and z have valid invariants. After the invocation, all three
4043 nodes should have valid invariants, as in label B.</para>
4047 <title>Restoring node invariants</title>
4050 <imagedata align="center" format="PNG" scale="100"
4051 fileref="../images/pbds_restoring_node_invariants.png"/>
4054 <phrase>Restoring node invariants</phrase>
4059 <para>When a tree operation might invalidate some node invariant,
4060 it invokes this method in its <classname>node_update</classname> base to
4061 restore the invariant. For example, the graphic below shows
4062 an <function>insert</function> operation (point A); the tree performs some
4063 operations, and calls the update functor three times (points B,
4064 C, and D). (It is well known that any <function>insert</function>,
4065 <function>erase</function>, <function>split</function> or <function>join</function>, can restore
4066 all node invariants by a small number of node invariant updates (<xref linkend="biblio.clrs2001"/>)
4070 <title>Insert update sequence</title>
4073 <imagedata align="center" format="PNG" scale="100"
4074 fileref="../images/pbds_update_seq_diagram.png"/>
4077 <phrase>Insert update sequence</phrase>
4082 <para>To complete the description of the scheme, three questions
4083 need to be answered:</para>
4086 <listitem><para>How can a tree which supports order statistics define a
4087 method such as <classname>find_by_order</classname>?</para></listitem>
4089 <listitem><para>How can the node updater base access methods of the
4090 tree?</para></listitem>
4092 <listitem><para>How can the following cyclic dependency be resolved?
4093 <classname>node_update</classname> is a base class of the tree, yet it
4094 uses node iterators defined in the tree (its child).</para></listitem>
4097 <para>The first two questions are answered by the fact that
4098 <classname>node_update</classname> (an instantiation of
4099 <classname>Node_Update</classname>) is a <emphasis>public</emphasis> base class
4100 of the tree. Consequently:</para>
4103 <listitem><para>Any public methods of
4104 <classname>node_update</classname> are automatically methods of
4105 the tree (<xref linkend="biblio.alexandrescu01modern"/>).
4106 Thus an order-statistics node updater,
4107 <classname>tree_order_statistics_node_update</classname> defines
4108 the <function>find_by_order</function> method; any tree
4109 instantiated by this policy consequently supports this method as
4110 well.</para></listitem>
4112 <listitem><para>In C++, if a base class declares a method as
4113 <literal>virtual</literal>, it is
4114 <literal>virtual</literal> in its subclasses. If
4115 <classname>node_update</classname> needs to access one of the
4116 tree's methods, say the member function
4117 <function>end</function>, it simply declares that method as
4118 <literal>virtual</literal> abstract.</para></listitem>
4121 <para>The cyclic dependency is solved through template-template
4122 parameters. <classname>Node_Update</classname> is parametrized by
4123 the tree's node iterators, its comparison functor, and its
4124 allocator type. Thus, instantiations of
4125 <classname>Node_Update</classname> have all information
4128 <para>This library assumes that constructing a metadata object and
4129 modifying it are exception free. Suppose that during some method,
4130 say <classname>insert</classname>, a metadata-related operation
4131 (e.g., changing the value of a metadata) throws an exception. Ack!
4132 Rolling back the method is unusually complex.</para>
4134 <para>Previously, a distinction was made between redundant
4135 policies and null policies. Node invariants show a
4136 case where null policies are required.</para>
4138 <para>Assume a regular tree is required, one which need not
4139 support order statistics or interval overlap queries.
4140 Seemingly, in this case a redundant policy - a policy which
4141 doesn't affect nodes' contents would suffice. This, would lead
4142 to the following drawbacks:</para>
4145 <listitem><para>Each node would carry a useless metadata object, wasting
4146 space.</para></listitem>
4148 <listitem><para>The tree cannot know if its
4149 <classname>Node_Update</classname> policy actually modifies a
4150 node's metadata (this is halting reducible). In the graphic
4151 below, assume the shaded node is inserted. The tree would have
4152 to traverse the useless path shown to the root, applying
4153 redundant updates all the way.</para></listitem>
4156 <title>Useless update path</title>
4159 <imagedata align="center" format="PNG" scale="100"
4160 fileref="../images/pbds_rationale_null_node_updator.png"/>
4163 <phrase>Useless update path</phrase>
4169 <para>A null policy class, <classname>null_node_update</classname>
4170 solves both these problems. The tree detects that node
4171 invariants are irrelevant, and defines all accordingly.</para>
4177 <section xml:id="container.tree.details.split">
4178 <info><title>Split and Join</title></info>
4180 <para>Tree-based containers support split and join methods.
4181 It is possible to split a tree so that it passes
4182 all nodes with keys larger than a given key to a different
4183 tree. These methods have the following advantages over the
4184 alternative of externally inserting to the destination
4185 tree and erasing from the source tree:</para>
4188 <listitem><para>These methods are efficient - red-black trees are split
4189 and joined in poly-logarithmic complexity; ordered-vector
4190 trees are split and joined at linear complexity. The
4191 alternatives have super-linear complexity.</para></listitem>
4193 <listitem><para>Aside from orders of growth, these operations perform
4194 few allocations and de-allocations. For red-black trees, allocations are not performed,
4195 and the methods are exception-free. </para></listitem>
4199 </section> <!-- details -->
4201 </section> <!-- tree -->
4204 <section xml:id="pbds.design.container.trie">
4205 <info><title>Trie</title></info>
4207 <section xml:id="container.trie.interface">
4208 <info><title>Interface</title></info>
4210 <para>The trie-based container has the following declaration:</para>
4212 template<typename Key,
4214 typename Cmp_Fn = std::less<Key>,
4215 typename Tag = pat_trie_tag,
4216 template<typename Const_Node_Iterator,
4217 typename Node_Iterator,
4218 typename E_Access_Traits_,
4219 typename Allocator_>
4220 class Node_Update = null_node_update,
4221 typename Allocator = std::allocator<char> >
4225 <para>The parameters have the following meaning:</para>
4228 <listitem><para><classname>Key</classname> is the key type.</para></listitem>
4230 <listitem><para><classname>Mapped</classname> is the mapped-policy.</para></listitem>
4232 <listitem><para><classname>E_Access_Traits</classname> is described in below.</para></listitem>
4234 <listitem><para><classname>Tag</classname> specifies which underlying data structure
4235 to use, and is described shortly.</para></listitem>
4237 <listitem><para><classname>Node_Update</classname> is a policy for updating node
4238 invariants. This is described below.</para></listitem>
4240 <listitem><para><classname>Allocator</classname> is an allocator
4241 type.</para></listitem>
4244 <para>The <classname>Tag</classname> parameter specifies which underlying
4245 data structure to use. Instantiating it by <classname>pat_trie_tag</classname>, specifies an
4246 underlying PATRICIA trie (explained shortly); any other tag is
4247 currently illegal.</para>
4249 <para>Following is a description of a (PATRICIA) trie
4250 (this implementation follows <xref linkend="biblio.okasaki98mereable"/> and
4251 <xref linkend="biblio.filliatre2000ptset"/>).
4254 <para>A (PATRICIA) trie is similar to a tree, but with the
4255 following differences:</para>
4258 <listitem><para>It explicitly views keys as a sequence of elements.
4259 E.g., a trie can view a string as a sequence of
4260 characters; a trie can view a number as a sequence of
4261 bits.</para></listitem>
4263 <listitem><para>It is not (necessarily) binary. Each node has fan-out n
4264 + 1, where n is the number of distinct
4265 elements.</para></listitem>
4267 <listitem><para>It stores values only at leaf nodes.</para></listitem>
4269 <listitem><para>Internal nodes have the properties that A) each has at
4270 least two children, and B) each shares the same prefix with
4271 any of its descendant.</para></listitem>
4274 <para>A (PATRICIA) trie has some useful properties:</para>
4277 <listitem><para>It can be configured to use large node fan-out, giving it
4278 very efficient find performance (albeit at insertion
4279 complexity and size).</para></listitem>
4281 <listitem><para>It works well for common-prefix keys.</para></listitem>
4283 <listitem><para>It can support efficiently queries such as which
4284 keys match a certain prefix. This is sometimes useful in file
4285 systems and routers, and for "type-ahead" aka predictive text matching
4286 on mobile devices.</para></listitem>
4292 <section xml:id="container.trie.details">
4293 <info><title>Details</title></info>
4295 <section xml:id="container.trie.details.etraits">
4296 <info><title>Element Access Traits</title></info>
4298 <para>A trie inherently views its keys as sequences of elements.
4299 For example, a trie can view a string as a sequence of
4300 characters. A trie needs to map each of n elements to a
4301 number in {0, n - 1}. For example, a trie can map a
4302 character <varname>c</varname> to
4303 <programlisting>static_cast<size_t>(c)</programlisting>.</para>
4305 <para>Seemingly, then, a trie can assume that its keys support
4306 (const) iterators, and that the <classname>value_type</classname> of this
4307 iterator can be cast to a <classname>size_t</classname>. There are several
4308 reasons, though, to decouple the mechanism by which the trie
4309 accesses its keys' elements from the trie:</para>
4312 <listitem><para>In some cases, the numerical value of an element is
4313 inappropriate. Consider a trie storing DNA strings. It is
4314 logical to use a trie with a fan-out of 5 = 1 + |{'A', 'C',
4315 'G', 'T'}|. This requires mapping 'T' to 3, though.</para></listitem>
4317 <listitem><para>In some cases the keys' iterators are different than what
4318 is needed. For example, a trie can be used to search for
4319 common suffixes, by using strings'
4320 <classname>reverse_iterator</classname>. As another example, a trie mapping
4321 UNICODE strings would have a huge fan-out if each node would
4322 branch on a UNICODE character; instead, one can define an
4323 iterator iterating over 8-bit (or less) groups.</para></listitem>
4327 consequently, parametrized by <classname>E_Access_Traits</classname> -
4328 traits which instruct how to access sequences' elements.
4329 <classname>string_trie_e_access_traits</classname>
4330 is a traits class for strings. Each such traits define some
4333 typename E_Access_Traits::const_iterator
4336 <para>is a const iterator iterating over a key's elements. The
4337 traits class must also define methods for obtaining an iterator
4338 to the first and last element of a key.</para>
4340 <para>The graphic below shows a
4341 (PATRICIA) trie resulting from inserting the words: "I wish
4342 that I could ever see a poem lovely as a trie" (which,
4343 unfortunately, does not rhyme).</para>
4345 <para>The leaf nodes contain values; each internal node contains
4346 two <classname>typename E_Access_Traits::const_iterator</classname>
4347 objects, indicating the maximal common prefix of all keys in
4348 the sub-tree. For example, the shaded internal node roots a
4349 sub-tree with leafs "a" and "as". The maximal common prefix is
4350 "a". The internal node contains, consequently, to const
4351 iterators, one pointing to <varname>'a'</varname>, and the other to
4352 <varname>'s'</varname>.</para>
4355 <title>A PATRICIA trie</title>
4358 <imagedata align="center" format="PNG" scale="100"
4359 fileref="../images/pbds_pat_trie.png"/>
4362 <phrase>A PATRICIA trie</phrase>
4369 <section xml:id="container.trie.details.node">
4370 <info><title>Node Invariants</title></info>
4372 <para>Trie-based containers support node invariants, as do
4373 tree-based containers. There are two minor
4374 differences, though, which, unfortunately, thwart sharing them
4375 sharing the same node-updating policies:</para>
4379 <para>A trie's <classname>Node_Update</classname> template-template
4380 parameter is parametrized by <classname>E_Access_Traits</classname>, while
4381 a tree's <classname>Node_Update</classname> template-template parameter is
4382 parametrized by <classname>Cmp_Fn</classname>.</para></listitem>
4384 <listitem><para>Tree-based containers store values in all nodes, while
4385 trie-based containers (at least in this implementation) store
4386 values in leafs.</para></listitem>
4389 <para>The graphic below shows the scheme, as well as some predefined
4390 policies (which are explained below).</para>
4393 <title>A trie and its update policy</title>
4396 <imagedata align="center" format="PNG" scale="100"
4397 fileref="../images/pbds_trie_node_updator_policy_cd.png"/>
4400 <phrase>A trie and its update policy</phrase>
4406 <para>This library offers the following pre-defined trie node
4407 updating policies:</para>
4412 <classname>trie_order_statistics_node_update</classname>
4413 supports order statistics.
4417 <listitem><para><classname>trie_prefix_search_node_update</classname>
4418 supports searching for ranges that match a given prefix.</para></listitem>
4420 <listitem><para><classname>null_node_update</classname>
4421 is the null node updater.</para></listitem>
4426 <section xml:id="container.trie.details.split">
4427 <info><title>Split and Join</title></info>
4428 <para>Trie-based containers support split and join methods; the
4429 rationale is equal to that of tree-based containers supporting
4430 these methods.</para>
4433 </section> <!-- details -->
4435 </section> <!-- trie -->
4437 <!-- list_update -->
4438 <section xml:id="pbds.design.container.list">
4439 <info><title>List</title></info>
4441 <section xml:id="container.list.interface">
4442 <info><title>Interface</title></info>
4444 <para>The list-based container has the following declaration:</para>
4446 template<typename Key,
4448 typename Eq_Fn = std::equal_to<Key>,
4449 typename Update_Policy = move_to_front_lu_policy<>,
4450 typename Allocator = std::allocator<char> >
4454 <para>The parameters have the following meaning:</para>
4459 <classname>Key</classname> is the key type.
4465 <classname>Mapped</classname> is the mapped-policy.
4471 <classname>Eq_Fn</classname> is a key equivalence functor.
4477 <classname>Update_Policy</classname> is a policy updating positions in
4478 the list based on access patterns. It is described in the
4479 following subsection.
4485 <classname>Allocator</classname> is an allocator type.
4490 <para>A list-based associative container is a container that
4491 stores elements in a linked-list. It does not order the elements
4492 by any particular order related to the keys. List-based
4493 containers are primarily useful for creating "multimaps". In fact,
4494 list-based containers are designed in this library expressly for
4495 this purpose.</para>
4497 <para>List-based containers might also be useful for some rare
4498 cases, where a key is encapsulated to the extent that only
4499 key-equivalence can be tested. Hash-based containers need to know
4500 how to transform a key into a size type, and tree-based containers
4501 need to know if some key is larger than another. List-based
4502 associative containers, conversely, only need to know if two keys
4503 are equivalent.</para>
4505 <para>Since a list-based associative container does not order
4506 elements by keys, is it possible to order the list in some
4507 useful manner? Remarkably, many on-line competitive
4508 algorithms exist for reordering lists to reflect access
4509 prediction. (See <xref linkend="biblio.motwani95random"/> and <xref linkend="biblio.andrew04mtf"/>).
4514 <section xml:id="container.list.details">
4515 <info><title>Details</title></info>
4518 <section xml:id="container.list.details.ds">
4519 <info><title>Underlying Data Structure</title></info>
4521 <para>The graphic below shows a
4522 simple list of integer keys. If we search for the integer 6, we
4523 are paying an overhead: the link with key 6 is only the fifth
4524 link; if it were the first link, it could be accessed
4528 <title>A simple list</title>
4531 <imagedata align="center" format="PNG" scale="100"
4532 fileref="../images/pbds_simple_list.png"/>
4535 <phrase>A simple list</phrase>
4540 <para>List-update algorithms reorder lists as elements are
4541 accessed. They try to determine, by the access history, which
4542 keys to move to the front of the list. Some of these algorithms
4543 require adding some metadata alongside each entry.</para>
4545 <para>For example, in the graphic below label A shows the counter
4546 algorithm. Each node contains both a key and a count metadata
4547 (shown in bold). When an element is accessed (e.g. 6) its count is
4548 incremented, as shown in label B. If the count reaches some
4549 predetermined value, say 10, as shown in label C, the count is set
4550 to 0 and the node is moved to the front of the list, as in label
4555 <title>The counter algorithm</title>
4558 <imagedata align="center" format="PNG" scale="100"
4559 fileref="../images/pbds_list_update.png"/>
4562 <phrase>The counter algorithm</phrase>
4570 <section xml:id="container.list.details.policies">
4571 <info><title>Policies</title></info>
4573 <para>this library allows instantiating lists with policies
4574 implementing any algorithm moving nodes to the front of the
4575 list (policies implementing algorithms interchanging nodes are
4576 unsupported).</para>
4578 <para>Associative containers based on lists are parametrized by a
4579 <classname>Update_Policy</classname> parameter. This parameter defines the
4580 type of metadata each node contains, how to create the
4581 metadata, and how to decide, using this metadata, whether to
4582 move a node to the front of the list. A list-based associative
4583 container object derives (publicly) from its update policy.
4586 <para>An instantiation of <classname>Update_Policy</classname> must define
4587 internally <classname>update_metadata</classname> as the metadata it
4588 requires. Internally, each node of the list contains, besides
4589 the usual key and data, an instance of <classname>typename
4590 Update_Policy::update_metadata</classname>.</para>
4592 <para>An instantiation of <classname>Update_Policy</classname> must define
4593 internally two operators:</para>
4599 operator()(update_metadata &);
4602 <para>The first is called by the container object, when creating a
4603 new node, to create the node's metadata. The second is called
4604 by the container object, when a node is accessed (
4605 when a find operation's key is equivalent to the key of the
4606 node), to determine whether to move the node to the front of
4610 <para>The library contains two predefined implementations of
4611 list-update policies. The first
4612 is <classname>lu_counter_policy</classname>, which implements the
4613 counter algorithm described above. The second is
4614 <classname>lu_move_to_front_policy</classname>,
4615 which unconditionally move an accessed element to the front of
4616 the list. The latter type is very useful in this library,
4617 since there is no need to associate metadata with each element.
4618 (See <xref linkend="biblio.andrew04mtf"/>
4623 <section xml:id="container.list.details.mapped">
4624 <info><title>Use in Multimaps</title></info>
4626 <para>In this library, there are no equivalents for the standard's
4627 multimaps and multisets; instead one uses an associative
4628 container mapping primary keys to secondary keys.</para>
4630 <para>List-based containers are especially useful as associative
4631 containers for secondary keys. In fact, they are implemented
4632 here expressly for this purpose.</para>
4634 <para>To begin with, these containers use very little per-entry
4635 structure memory overhead, since they can be implemented as
4636 singly-linked lists. (Arrays use even lower per-entry memory
4637 overhead, but they are less flexible in moving around entries,
4638 and have weaker invalidation guarantees).</para>
4640 <para>More importantly, though, list-based containers use very
4641 little per-container memory overhead. The memory overhead of an
4642 empty list-based container is practically that of a pointer.
4643 This is important for when they are used as secondary
4644 associative-containers in situations where the average ratio of
4645 secondary keys to primary keys is low (or even 1).</para>
4647 <para>In order to reduce the per-container memory overhead as much
4648 as possible, they are implemented as closely as possible to
4649 singly-linked lists.</para>
4654 List-based containers do not store internally the number
4655 of values that they hold. This means that their <function>size</function>
4656 method has linear complexity (just like <classname>std::list</classname>).
4657 Note that finding the number of equivalent-key values in a
4658 standard multimap also has linear complexity (because it must be
4659 done, via <function>std::distance</function> of the
4660 multimap's <function>equal_range</function> method), but usually with
4667 Most associative-container objects each hold a policy
4668 object (a hash-based container object holds a
4669 hash functor). List-based containers, conversely, only have
4670 class-wide policy objects.
4678 </section> <!-- details -->
4680 </section> <!-- list -->
4683 <!-- priority_queue -->
4684 <section xml:id="pbds.design.container.priority_queue">
4685 <info><title>Priority Queue</title></info>
4687 <section xml:id="container.priority_queue.interface">
4688 <info><title>Interface</title></info>
4690 <para>The priority queue container has the following
4694 template<typename Value_Type,
4695 typename Cmp_Fn = std::less<Value_Type>,
4696 typename Tag = pairing_heap_tag,
4697 typename Allocator = std::allocator<char > >
4698 class priority_queue;
4701 <para>The parameters have the following meaning:</para>
4704 <listitem><para><classname>Value_Type</classname> is the value type.</para></listitem>
4706 <listitem><para><classname>Cmp_Fn</classname> is a value comparison functor</para></listitem>
4708 <listitem><para><classname>Tag</classname> specifies which underlying data structure
4709 to use.</para></listitem>
4711 <listitem><para><classname>Allocator</classname> is an allocator
4712 type.</para></listitem>
4715 <para>The <classname>Tag</classname> parameter specifies which underlying
4716 data structure to use. Instantiating it by<classname>pairing_heap_tag</classname>,<classname>binary_heap_tag</classname>,
4717 <classname>binomial_heap_tag</classname>,
4718 <classname>rc_binomial_heap_tag</classname>,
4719 or <classname>thin_heap_tag</classname>,
4720 specifies, respectively,
4721 an underlying pairing heap (<xref linkend="biblio.fredman86pairing"/>),
4722 binary heap (<xref linkend="biblio.clrs2001"/>),
4723 binomial heap (<xref linkend="biblio.clrs2001"/>),
4724 a binomial heap with a redundant binary counter (<xref linkend="biblio.maverick_lowerbounds"/>),
4725 or a thin heap (<xref linkend="biblio.kt99fat_heaps"/>).
4729 As mentioned in the tutorial,
4730 <classname>__gnu_pbds::priority_queue</classname> shares most of the
4731 same interface with <classname>std::priority_queue</classname>.
4732 E.g. if <varname>q</varname> is a priority queue of type
4733 <classname>Q</classname>, then <function>q.top()</function> will
4734 return the "largest" value in the container (according to
4736 Q::cmp_fn</classname>). <classname>__gnu_pbds::priority_queue</classname>
4737 has a larger (and very slightly different) interface than
4738 <classname>std::priority_queue</classname>, however, since typically
4739 <classname>push</classname> and <classname>pop</classname> are deemed
4740 insufficient for manipulating priority-queues. </para>
4742 <para>Different settings require different priority-queue
4743 implementations which are described in later; see traits
4744 discusses ways to differentiate between the different traits of
4745 different implementations.</para>
4750 <section xml:id="container.priority_queue.details">
4751 <info><title>Details</title></info>
4753 <section xml:id="container.priority_queue.details.iterators">
4754 <info><title>Iterators</title></info>
4756 <para>There are many different underlying-data structures for
4757 implementing priority queues. Unfortunately, most such
4758 structures are oriented towards making <function>push</function> and
4759 <function>top</function> efficient, and consequently don't allow efficient
4760 access of other elements: for instance, they cannot support an efficient
4761 <function>find</function> method. In the use case where it
4762 is important to both access and "do something with" an
4763 arbitrary value, one would be out of luck. For example, many graph algorithms require
4764 modifying a value (typically increasing it in the sense of the
4765 priority queue's comparison functor).</para>
4767 <para>In order to access and manipulate an arbitrary value in a
4768 priority queue, one needs to reference the internals of the
4769 priority queue from some form of an associative container -
4770 this is unavoidable. Of course, in order to maintain the
4771 encapsulation of the priority queue, this needs to be done in a
4772 way that minimizes exposure to implementation internals.</para>
4774 <para>In this library the priority queue's <function>insert</function>
4775 method returns an iterator, which if valid can be used for subsequent <function>modify</function> and
4776 <function>erase</function> operations. This both preserves the priority
4777 queue's encapsulation, and allows accessing arbitrary values (since the
4778 returned iterators from the <function>push</function> operation can be
4779 stored in some form of associative container).</para>
4781 <para>Priority queues' iterators present a problem regarding their
4782 invalidation guarantees. One assumes that calling
4783 <function>operator++</function> on an iterator will associate it
4784 with the "next" value. Priority-queues are
4785 self-organizing: each operation changes what the "next" value
4786 means. Consequently, it does not make sense that <function>push</function>
4787 will return an iterator that can be incremented - this can have
4788 no possible use. Also, as in the case of hash-based containers,
4789 it is awkward to define if a subsequent <function>push</function> operation
4790 invalidates a prior returned iterator: it invalidates it in the
4791 sense that its "next" value is not related to what it
4792 previously considered to be its "next" value. However, it might not
4793 invalidate it, in the sense that it can be
4794 de-referenced and used for <function>modify</function> and <function>erase</function>
4797 <para>Similarly to the case of the other unordered associative
4798 containers, this library uses a distinction between
4799 point-type and range type iterators. A priority queue's <classname>iterator</classname> can always be
4800 converted to a <classname>point_iterator</classname>, and a
4801 <classname>const_iterator</classname> can always be converted to a
4802 <classname>point_const_iterator</classname>.</para>
4804 <para>The following snippet demonstrates manipulating an arbitrary
4807 // A priority queue of integers.
4808 priority_queue<int > p;
4810 // Insert some values into the priority queue.
4811 priority_queue<int >::point_iterator it = p.push(0);
4816 // Now modify a value.
4819 assert(p.top() == 3);
4823 <para>It should be noted that an alternative design could embed an
4824 associative container in a priority queue. Could, but most
4825 probably should not. To begin with, it should be noted that one
4826 could always encapsulate a priority queue and an associative
4827 container mapping values to priority queue iterators with no
4828 performance loss. One cannot, however, "un-encapsulate" a priority
4829 queue embedding an associative container, which might lead to
4830 performance loss. Assume, that one needs to associate each value
4831 with some data unrelated to priority queues. Then using
4832 this library's design, one could use an
4833 associative container mapping each value to a pair consisting of
4834 this data and a priority queue's iterator. Using the embedded
4835 method would need to use two associative containers. Similar
4836 problems might arise in cases where a value can reside
4837 simultaneously in many priority queues.</para>
4842 <section xml:id="container.priority_queue.details.d">
4843 <info><title>Underlying Data Structure</title></info>
4845 <para>There are three main implementations of priority queues: the
4846 first employs a binary heap, typically one which uses a
4847 sequence; the second uses a tree (or forest of trees), which is
4848 typically less structured than an associative container's tree;
4849 the third simply uses an associative container. These are
4850 shown in the graphic below, in labels A1 and A2, label B, and label C.</para>
4853 <title>Underlying Priority-Queue Data-Structures.</title>
4856 <imagedata align="center" format="PNG" scale="100"
4857 fileref="../images/pbds_priority_queue_different_underlying_dss.png"/>
4860 <phrase>Underlying Priority-Queue Data-Structures.</phrase>
4865 <para>Roughly speaking, any value that is both pushed and popped
4866 from a priority queue must incur a logarithmic expense (in the
4867 amortized sense). Any priority queue implementation that would
4868 avoid this, would violate known bounds on comparison-based
4869 sorting (see <xref linkend="biblio.clrs2001"/> and <xref linkend="biblio.brodal96priority"/>).
4872 <para>Most implementations do
4873 not differ in the asymptotic amortized complexity of
4874 <function>push</function> and <function>pop</function> operations, but they differ in
4875 the constants involved, in the complexity of other operations
4876 (e.g., <function>modify</function>), and in the worst-case
4877 complexity of single operations. In general, the more
4878 "structured" an implementation (i.e., the more internal
4879 invariants it possesses) - the higher its amortized complexity
4880 of <function>push</function> and <function>pop</function> operations.</para>
4882 <para>This library implements different algorithms using a
4883 single class: <classname>priority_queue</classname>.
4884 Instantiating the <classname>Tag</classname> template parameter, "selects"
4885 the implementation:</para>
4889 Instantiating <classname>Tag = binary_heap_tag</classname> creates
4890 a binary heap of the form in represented in the graphic with labels A1 or A2. The former is internally
4891 selected by priority_queue
4892 if <classname>Value_Type</classname> is instantiated by a primitive type
4893 (e.g., an <type>int</type>); the latter is
4894 internally selected for all other types (e.g.,
4895 <classname>std::string</classname>). This implementations is relatively
4896 unstructured, and so has good <classname>push</classname> and <classname>pop</classname>
4897 performance; it is the "best-in-kind" for primitive
4898 types, e.g., <type>int</type>s. Conversely, it has
4899 high worst-case performance, and can support only linear-time
4900 <function>modify</function> and <function>erase</function> operations.</para></listitem>
4902 <listitem><para>Instantiating <classname>Tag =
4903 pairing_heap_tag</classname> creates a pairing heap of the form
4904 in represented by label B in the graphic above. This
4905 implementations too is relatively unstructured, and so has good
4906 <function>push</function> and <function>pop</function>
4907 performance; it is the "best-in-kind" for non-primitive types,
4908 e.g., <classname>std:string</classname>s. It also has very good
4909 worst-case <function>push</function> and
4910 <function>join</function> performance (O(1)), but has high
4911 worst-case <function>pop</function>
4912 complexity.</para></listitem>
4914 <listitem><para>Instantiating <classname>Tag =
4915 binomial_heap_tag</classname> creates a binomial heap of the
4916 form repsented by label B in the graphic above. This
4917 implementations is more structured than a pairing heap, and so
4918 has worse <function>push</function> and <function>pop</function>
4919 performance. Conversely, it has sub-linear worst-case bounds for
4920 <function>pop</function>, e.g., and so it might be preferred in
4921 cases where responsiveness is important.</para></listitem>
4923 <listitem><para>Instantiating <classname>Tag =
4924 rc_binomial_heap_tag</classname> creates a binomial heap of the
4925 form represented in label B above, accompanied by a redundant
4926 counter which governs the trees. This implementations is
4927 therefore more structured than a binomial heap, and so has worse
4928 <function>push</function> and <function>pop</function>
4929 performance. Conversely, it guarantees O(1)
4930 <function>push</function> complexity, and so it might be
4931 preferred in cases where the responsiveness of a binomial heap
4932 is insufficient.</para></listitem>
4934 <listitem><para>Instantiating <classname>Tag =
4935 thin_heap_tag</classname> creates a thin heap of the form
4936 represented by the label B in the graphic above. This
4937 implementations too is more structured than a pairing heap, and
4938 so has worse <function>push</function> and
4939 <function>pop</function> performance. Conversely, it has better
4940 worst-case and identical amortized complexities than a Fibonacci
4941 heap, and so might be more appropriate for some graph
4942 algorithms.</para></listitem>
4945 <para>Of course, one can use any order-preserving associative
4946 container as a priority queue, as in the graphic above label C, possibly by creating an adapter class
4947 over the associative container (much as
4948 <classname>std::priority_queue</classname> can adapt <classname>std::vector</classname>).
4949 This has the advantage that no cross-referencing is necessary
4950 at all; the priority queue itself is an associative container.
4951 Most associative containers are too structured to compete with
4952 priority queues in terms of <function>push</function> and <function>pop</function>
4959 <section xml:id="container.priority_queue.details.traits">
4960 <info><title>Traits</title></info>
4962 <para>It would be nice if all priority queues could
4963 share exactly the same behavior regardless of implementation. Sadly, this is not possible. Just one for instance is in join operations: joining
4964 two binary heaps might throw an exception (not corrupt
4965 any of the heaps on which it operates), but joining two pairing
4966 heaps is exception free.</para>
4968 <para>Tags and traits are very useful for manipulating generic
4969 types. <classname>__gnu_pbds::priority_queue</classname>
4970 publicly defines <classname>container_category</classname> as one of the tags. Given any
4971 container <classname>Cntnr</classname>, the tag of the underlying
4972 data structure can be found via <classname>typename
4973 Cntnr::container_category</classname>; this is one of the possible tags shown in the graphic below.
4977 <title>Priority-Queue Data-Structure Tags.</title>
4980 <imagedata align="center" format="PNG" scale="100"
4981 fileref="../images/pbds_priority_queue_tag_hierarchy.png"/>
4984 <phrase>Priority-Queue Data-Structure Tags.</phrase>
4990 <para>Additionally, a traits mechanism can be used to query a
4991 container type for its attributes. Given any container
4992 <classname>Cntnr</classname>, then <programlisting>__gnu_pbds::container_traits<Cntnr></programlisting>
4993 is a traits class identifying the properties of the
4996 <para>To find if a container might throw if two of its objects are
4999 container_traits<Cntnr>::split_join_can_throw
5004 Different priority-queue implementations have different invalidation guarantees. This is
5005 especially important, since there is no way to access an arbitrary
5006 value of priority queues except for iterators. Similarly to
5007 associative containers, one can use
5009 container_traits<Cntnr>::invalidation_guarantee
5011 to get the invalidation guarantee type of a priority queue.</para>
5013 <para>It is easy to understand from the graphic above, what <classname>container_traits<Cntnr>::invalidation_guarantee</classname>
5014 will be for different implementations. All implementations of
5015 type represented by label B have <classname>point_invalidation_guarantee</classname>:
5016 the container can freely internally reorganize the nodes -
5017 range-type iterators are invalidated, but point-type iterators
5018 are always valid. Implementations of type represented by labels A1 and A2 have <classname>basic_invalidation_guarantee</classname>:
5019 the container can freely internally reallocate the array - both
5020 point-type and range-type iterators might be invalidated.</para>
5023 This has major implications, and constitutes a good reason to avoid
5024 using binary heaps. A binary heap can perform <function>modify</function>
5025 or <function>erase</function> efficiently given a valid point-type
5026 iterator. However, in order to supply it with a valid point-type
5027 iterator, one needs to iterate (linearly) over all
5028 values, then supply the relevant iterator (recall that a
5029 range-type iterator can always be converted to a point-type
5030 iterator). This means that if the number of <function>modify</function> or
5031 <function>erase</function> operations is non-negligible (say
5032 super-logarithmic in the total sequence of operations) - binary
5033 heaps will perform badly.
5038 </section> <!-- details -->
5040 </section> <!-- priority_queue -->
5044 </section> <!-- container -->
5046 </section> <!-- design -->
5051 <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" parse="xml"
5052 href="test_policy_data_structures.xml">
5055 <!-- S05: Reference/Acknowledgments -->
5056 <section xml:id="pbds.ack">
5057 <info><title>Acknowledgments</title></info>
5058 <?dbhtml filename="policy_data_structures_ack.html"?>
5061 Written by Ami Tavory and Vladimir Dreizin (IBM Haifa Research
5062 Laboratories), and Benjamin Kosnik (Red Hat).
5066 This library was partially written at IBM's Haifa Research Labs.
5067 It is based heavily on policy-based design and uses many useful
5068 techniques from Modern C++ Design: Generic Programming and Design
5069 Patterns Applied by Andrei Alexandrescu.
5073 Two ideas are borrowed from the SGI-STL implementation:
5079 The prime-based resize policies use a list of primes taken from
5080 the SGI-STL implementation.
5086 The red-black trees contain both a root node and a header node
5087 (containing metadata), connected in a way that forward and
5088 reverse iteration can be performed efficiently.
5094 Some test utilities borrow ideas from
5095 <link xmlns:xlink="http://www.w3.org/1999/xlink"
5096 xlink:href="http://www.boost.org/libs/timer/">boost::timer</link>.
5100 We would like to thank Scott Meyers for useful comments (without
5101 attributing to him any flaws in the design or implementation of the
5104 <para>We would like to thank Matt Austern for the suggestion to
5105 include tries.</para>
5108 <!-- S06: Biblio -->
5109 <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" parse="xml"
5110 href="policy_data_structures_biblio.xml">