doc/F.dbstree.tex

   1 \documentclass[twocolumn]{article}
   2
   3 \title{The Dynamic Binary search tree}
   4 \author{}
   5
   6 \begin{document}
   7
   8 \maketitle
   9
  10 \section{Introduction}
  11
  12 Binary trees are an important data organization in which searching for
  13 an element is performed in $O(\log{n})$ steps. Dynamic trees were new
  14 elements can be added and removed may become unbalanced and thus more
  15 sophisticated binary tree implementations have been suggested to keep
  16 the tree in $O(\log{n})$ limits.
  17
  18 Our goal is controlling the height of a tree to small values. A balanced
  19 tree fulfills this goal but a tree need not necessarily be balanced to have
  20 its height controlled.
  21
  22 The algorithm for an efficient implementation of dynamic tree is described,
  23 along with comparisons with existing dynamic tree implementations.
  24
  25 \section{Description of the tree}
  26
  27 The basic idea is using a simple binary search tree and setting a limit on
  28 it's height. Let us use for the rest of this study the number 32 as height
  29 limit. This restricts the total number of nodes to no more than $2^{32}$ as
  30 long as duplicate elements are not allowed.
  31 We should notice that due to the logarithmic nature of complexity,
  32 in real-life applications a balanced tree will never reach such values for
  33 it's height.
  34
  35 Supposing we can efficiently fulfill this condition, locating an element in
  36 the tree will be performed in no more than 32 steps. Typically
  37 $O(32)<O(\log{n})$. Actually it may result in a few more steps.
  38 In practice it is an unimportant
  39 difference and therefore it is as fast as any balanced tree.
  40
  41 Removing an element from the tree can be easily done without the necessity
  42 of a balancing operation as long as by removing an element the
  43 height of the tree is not increased. This condition can be implemented.
  44
  45 If the node to be removed has two child nodes then the procedure is to find
  46 the biggest element from the smaller ones or the smallest from the bigger
  47 ones and replace the node to be removed with this element. A sample code is
  48 illustrated in \ref{app:rmv}
  49
  50 Adding a new element to the tree is the most important operation. While
  51 inserting the element the number of comparisons performed is counted. The
  52 simple rule, when this number reaches 32 the tree must be rebalanced, is
  53 very efficient.
  54
  55 Balancing the tree is accomplished in $n$ steps. However, the frequency at
  56 which the tree needs to be balanced again is a function of $\frac{1}{2^n}$.
  57 As a result adding a new element is in the great majority of the cases done
  58 in less than 32 steps; but rebalancing is more rare as the number of nodes
  59 increases
  60 \footnote{A wosre-case behavior were a lot of elements have to be inserted
  61 in an empty tree and they arrive sorted is solved in \ref{init}}.
  62
  63 \subsection{Balancing the tree}
  64
  65 The binary tree can be entirely reconstructed in $n$ steps using a simple
  66 algorithm. First, an array of $n$ pointers is allocated an all the elements
  67 of the tree are placed in this array in sorted order. Let us remember than an
  68 always-balanced tree implementation requires extra memory for the node
  69 structures and therefore the extra allocation of the array is yet efficient;
  70 furthermore, in modern multithreaded database systems an operation which
  71 alters part of the data is given full control of the hardware and the memory
  72 usage burst will in this case be safely performed.
  73
  74 The next step is to reset the root of the tree and start inserting elements
  75 in the order at which they will result in a perfectly balanced tree. This is
  76 done by first inserting the middle element $N/2$ of the array, then
  77 elements $N/4$ and $3 N / 4$, then $N/8$, $3 N / 8$, $5 N / 8 \ldots$
  78 and so on.
  79
  80 By this procedure $2^k - 1$ where $k$ the integer for which
  81 $2^k \leq N < 2^{k + 1}$ nodes can be inserted. For the remaining $N-(2^k -1)$
  82 nodes, if any, various algorithms of $O(n)$ steps can be applied. One way
  83 that is used in the sample code is to proceed normally as if there were
  84 $2^{k+1}$ nodes and exclude the elements $i$ for which the modulo of the
  85 division $i * 2^k / N$ is greater or equal to $N-(2^k -1)$ or zero; in other
  86 words excluding those numbers $a$ for which there is a number $b$ so that
  87 \begin{equation}
  88 \left\lfloor \frac{a N}{2^{k+1}} \right\rfloor =
  89 \left\lfloor \frac{b N}{2^k} \right\rfloor
  90 \: , \;\; 0 < b < 2^k
  91 \end{equation}
  92
  93 The complexity of the above algorithm will be linear if when inserting nodes
  94 in the new tree redundant comparisons are avoided.
  95
  96 A sample implementation is provided at \ref{code:balance}.
  97
  98 \section{Characteristics of the tree}
  99
 100 The main feature of the this tree is that after is has been balanced, the
 101 more nodes it contains, the more new elements be distributed in external
 102 nodes and thus it will be much more unlikely to become ``unbalanced''.
 103
 104 \begin{itemize}
 105 \item
 106    Memory requirements for this tree are minimal. The standard memory needed
 107    is as much as any simple binary tree and the same as a double linked
 108    list. Memory usage bursts occur when
 109    balancing is performed but this is generally a rare and very fast action.
 110    Memory due to recursion is not trivial since tree height is limited
 111    to 32.
 112 \item
 113    Locating elements will happen in less than 32 steps. Generally after
 114    balancing has been performed locating elements is a matter of $\log{n}$.
 115 \item
 116    Removing elements is also a matter of less than 32 steps.
 117    However, there are no rotations performed and therefore removing an
 118    element is possibly performed faster than in an always-balanced tree.
 119 \item
 120    Adding elements in most of the cases is a matter of less than 32 very
 121    fast steps.
 122    Balancing is rare and becomes even more rare as the tree grows and it
 123    requires $n$ simple steps.
 124
 125    In long term performance, for a growing tree the number of nodes $n$
 126    divided with the number of steps spent for balancing, tends to be
 127    independent of $n$.
 128 \end{itemize}
 129
 130 Tests have shown that the tree will need to balance 2 times for
 131 1,000,000 elements arriving in random order, once after 20,000 elements
 132 and once again at about 400,000.
 133
 134 \section{Database initialization}
 135 \label{init}
 136
 137 One case were the suggested tree would prove inefficient is inserting a lot
 138 of elements in sorted order in an \emph{empty} tree.
 139 If the tree is not empty and has been balanced before then the arriving
 140 elements will be distributed on the external nodes.
 141
 142 This is a case which would occur when initially loading the data from
 143 storage devices. The workaround to this problem is similar to the way the
 144 tree is balanced. The -sorted- elements will have to be inserted in the tree
 145 in the order that they will make a balanced tree, first the middle element,
 146 $N/4$, $3*N/4$ and so on.
 147
 148 Once a perfectly balance tree is constructed there should be no worry for
 149 this worse-case occurring unless most of the elements are removed.
 150
 151 Generally, in cases were a long sequence of elements is to be inserted
 152 in the tree, if the tree has few nodes it may be preferable to sort the
 153 sequence of elements and then insert the right elements first.
 154 This procedure is worth the tradeoff if the arriving elements are almost
 155 sorted, and in this case special sorting algorithms can be applied.
 156
 157 Another solution is to make the elements really random, by using a random
 158 tree. In a \emph{random tree}, the comparison of elements is heads or tails.
 159 After the sequence of elements has been inserted in a random tree (which
 160 will never be unbalanced), they can be read in-order and sent to the dbs
 161 tree.
 162
 163 \section{Conclusions}
 164
 165 The tree structure with the characteristics that have been presented is
 166 efficient for specific database requirements.
 167
 168 When the database is idle or the specific tree is not used, is consumes the
 169 least memory compared to other balanced tree structures.
 170 That is the primary reason for the design of this structure.
 171
 172 When the tree is just accessed for locating an element this is as fast as
 173 any balanced tree.
 174
 175 Theoretically, balanced tree implementations which perform rotations after
 176 a node is inserted or removed from the tree are better. In practice, and
 177 more specificly in a database environment limiting the height of the tree to
 178 a `small' number is more than adequate.
 179 Except from memory efficiency, the suggested tree structure may possibly
 180 behave faster as well; the worse case scenario will either occur at the
 181 initialization of the database or else is can be easilly prevented.
 182
 183 \onecolumn
 184 \section{Sample code}
 185
 186 A sample implementation in C of the presented tree is provided.
 187 For the sample code the simple definition of a node with a pointer to
 188 data is used. The function {\tt compare ()} is the application-specific
 189 function which compares the data of two nodes.
 190
 191 \begin{verbatim}
 192 #define MAX_HEIGHT 32
 193 typedef struct tnode node;
 194 struct tnode {
 195    char *data;
 196    node *less, *more;
 197 };
 198 node *root;
 199 \end{verbatim}
 200
 201 \subsection{Removing a node}
 202 \label{app:rmv}
 203
 204 When removing an element which has two children,
 205 both the smallest from the bigger elements and the
 206 biggest from the smaller elements are located. By convention the deepest one
 207 is selected to replace the node to be removed.
 208
 209 \begin{verbatim}
 210 void remove (node *n)
 211 {
 212     node *np, *nl, *nr, *nlp, *nrp;
 213     int isroot;
 214
 215     if (!(isroot = (np = root) == n))
 216         for (;;)
 217             if (compare (np->data, n->data) > 0)
 218                 if (np->less == n) break;
 219                 else np = np->less;
 220             else
 221                 if (np->more == n) break;
 222                 else np = np->more;
 223
 224     if (!(n->less && n->more))
 225     {
 226         if (isroot) root = (n->less) ? n->less : n->more;
 227         else
 228             if (np->less == n) np->less = (n->less) ? n->less : n->more;
 229             else np->more = (n->less) ? n->less : n->more;
 230         return;
 231     }
 232     for (i = 0, nlp = NULL, nl = n->less; nl->more; i++)
 233         nlp = nl, nl = nl->more;
 234     for (j = 0, nrp = NULL, nr = n->more; nr->less; j++)
 235         nrp = nr, nr = nr->less;
 236     if (i >= j)
 237     {
 238         if (isroot) root = nl;
 239         else
 240             if (np->less == n) np->less = nl;
 241             else               np->more = nl;
 242         if (nlp)
 243         {
 244              nlp->more = nl->less
 245              nl->less = n->less;
 246         }
 247         nl->more = n->more;
 248     } else {
 249         if (isroot) root = nr;
 250         else
 251             if (np->less == n) np->less = nr;
 252             else               np->more = nr;
 253         if (nrp)
 254         {
 255             nrp->less = nr->more
 256             nr->more = n->more;
 257         }
 258         nr->less = n->less;
 259     }
 260 }
 261 \end{verbatim}
 262
 263 \subsection{Balancing}
 264 \label{code:balance}
 265
 266 The algorithm for reconstructing a balanced tree.
 267 {\tt alloca} is a stack allocation mechanism.
 268
 269 Special care should be taken that the variables {\tt i, j, D, k} in
 270 {\tt balance()} are 64 bit integers. If not then the algorithm will fail for
 271 trees with more than about 65000 nodes.
 272
 273 \begin{verbatim}
 274 unsigned int suint;
 275 node **snpp;
 276
 277 void count_nodes (node *n)
 278 {
 279     ++suint;
 280     if (n->less) count_nodes (n->less);
 281     if (n->more) count_nodes (n->more);
 282 }
 283 void tree_to_array (node *n)
 284 {
 285     if (!n) return;
 286     tree_to_array (n->less);
 287     *snpp++ = n;
 288     tree_to_array (n->more);
 289 }
 290 void balance ()
 291 {
 292     node **npp;
 293     long int i, j, D, k, k2, k3;
 294
 295     /** Generate a sorted array **/
 296     suint = 0;
 297     count_nodes (root);
 298     npp = snpp = (node**) alloca (sizeof (node*) * suint);
 299     tree_to_array (root);
 300     /** Insert the top 2^k -1 nodes **/
 301     root = npp [suint / 2];
 302     for (i = 4; i <= suint + 1; i *= 2)
 303         for (j = 2; j < i; j += 4)
 304         {
 305             k2 = suint * j / i;
 306             npp [k2]->less = npp [suint * (j - 1) / i];
 307             npp [k2]->more = npp [suint * (j + 1) / i];
 308         }
 309     /** Test whether there are nodes left **/
 310     if ((k = suint + 1 - i / 2)) == 0)
 311     {
 312         for (i /= 2, j = 1; j < i; j += 2)
 313             k2 = suint * j / i,
 314             npp [k2]->less = npp [k2]->more = NULL;
 315         return;
 316     }
 317     /** Proceed normally but for specific nodes **/
 318     for (j = 2; j < i; j += 4)
 319     {
 320         k3 = suint * j / i;
 321         D = (k2 = suint * (j - 1) / i) * (i / 2) % suint;
 322         if (D >= k || D == 0)    /* k2 Excluded */
 323             npp [k3]->less = NULL;
 324         else {                   /* k2 Inserted */
 325             npp [k3]->less = npp [k2];
 326             npp [k2]->less = npp [k2]->more = NULL;
 327         }
 328         D = (k2 = suint * (j + 1) / i) * (i / 2) % suint;
 329         if (D >= k || D == 0)
 330             npp [k3]->more = NULL;
 331         else {
 332             npp [k3]->more = npp [k2];
 333             npp [k2]->less = npp [k2]->more = NULL;
 334         }
 335     }
 336     insert (npp [0]);
 337 }
 338 \end{verbatim}
 339
 340 \subsection{Inserting a node}
 341
 342 The procedure is similar to that of a simple binary search tree.
 343
 344 \begin{verbatim}
 345 int insert (node *n)
 346 {
 347     node *np;
 348     int h;
 349
 350     n->less = n->more = NULL;
 351     if (!(np = root))
 352     {
 353         root = n;
 354         return 0;
 355     }
 356     for (h = 0;; h++)
 357         if (compare (np->data, n->data) > 0)
 358             if (!np->less) { np->less = n; break; }
 359             else np = np->less;
 360         else
 361             if (!np->more) { np->more = n; break; }
 362             else np = np->more;
 363      if (h == MAX_HEIGHT)
 364      {
 365          balance ();
 366          return 0;
 367      } else if (h > MAX_HEIGHT) {
 368          // If that ever happens please send me a mail and I'll send
 369          // you 10.000$ for having a tree with 4bill nodes
 370          // or make MAX_HEIGHT 34 and go to 16bills
 371          printf ("Tree full. Reached 2^%i nodes\n", MAX_HEIGHT);
 372          return -1;
 373      }
 374      return h;
 375 }
 376 \end{verbatim}
 377
 378 \end{document}