lang/C/the.ansi.c.programming.language/notes.accompany.ansi.c/sx9f.html

   1 <!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN">
   2 <!-- This collection of hypertext pages is Copyright 1995, 1996 by Steve Summit. -->
   3 <!-- This material may be freely redistributed and used -->
   4 <!-- but may not be republished or sold without permission. -->
   5 <html>
   6 <head>
   7 <link rev="owner" href="mailto:scs@eskimo.com">
   8 <link rev="made" href="mailto:scs@eskimo.com">
   9 <title>section 6.5: Self-referential Structures</title>
  10 <link href="sx9e.html" rev=precedes>
  11 <link href="sx10.html" rel=precedes>
  12 <link href="sx9.html" rev=subdocument>
  13 </head>
  14 <body>
  15 <H2>section 6.5: Self-referential Structures</H2>
  16
  17 page 139
  18 <p>In section 4.10, we met recursive functions.
  19 Now, we're going to meet recursively-defined data structures.
  20 Don't throw up your hands:
  21 the two should be easier to understand in combination.
  22 </p><p>The mention of ``quadratic running time'' is tangential,
  23 but it's a useful-enough concept that it might be worth a bit of explanation.
  24 If we were keeping a simple list
  25 (``linear array'')
  26 in order,
  27 each time we had a new word to install,
  28 we'd have to scan over the old list.
  29 On average, we'd have to scan over half the old list.
  30 (Even if we used binary search to find the position,
  31 we'd still have to move some part of the list to insert it.)
  32 Therefore, the more words that were in the list,
  33 the longer it would take to install each new word.
  34 It turns out that the running time of this linear insertion
  35 algorithm would grow as the square of the number of items in the list
  36 (that's what ``quadratically'' means).
  37 If you doubled the size of the list,
  38 the running time would be four times longer.
  39 An algorithm like this may seem to work fine
  40 when you run it on small test inputs,
  41 but then when you run it on a real problem
  42 consisting of a thousand or ten thousand or a million words,
  43 it bogs down hopelessly.
  44 </p><p>A binary tree is a great way to keep a set of words
  45 (or other values)
  46 in sorted order.
  47 The definition of a binary tree is simply that,
  48 at each node,
  49 all items in the left subtree are less than the item at that node,
  50 and all items in the right subtree are greater.
  51 (Note that the top item in the left subtree
  52 is not necessarily <em>immediately</em> less than the item at that node
  53 or anything;
  54 the immediately-preceding item is merely down in the left subtree somewhere,
  55 along with all the rest of the preceding items.
  56 In the ``now is the time'' example,
  57 the word ``now'' is neither the first, last, nor middle word in the sorted list;
  58 it's merely the word that happened to be installed first.
  59 The word preceding it is ``men'';
  60 the word following it is ``of.''
  61 The first word in the sorted list is ``aid,''
  62 and the last word is ``to.'')
  63 </p><p>The binary tree may not immediately seem like much of an
  64 improvement over the linear array--we still have to scan over
  65 part of the existing tree in order to insert each new word,
  66 and the time to add each new word will get longer as there are
  67 more words in the tree.
  68 But, if you do the math,
  69 it turns out
  70 that on average you have to scan over a much smaller part of the tree,
  71 and
  72 it's not a simple fraction
  73 like half
  74 or one quarter,
  75 but rather the log (base two) of the number of items already in the tree.
  76 Furthermore,
  77 inserting a new node doesn't involve reshuffling any old data.
  78 For these reasons,
  79 the running time of binary tree insertion doesn't slow down
  80 nearly as badly as linear insertion does.
  81 </p><p>By the way,
  82 the reason that the word ``binary'' comes up so often
  83 is because it simply means ``two.''
  84 The binary number system has two digits
  85 (0 and 1);
  86 a binary operator has two operands;
  87 binary search eliminates half
  88 (one over two)
  89 of the possibilities at each step;
  90 a binary tree has two subtrees at each node.
  91 </p><p>One other bit of nomenclature:
  92 the word ``node'' simply refers to
  93 one of the structures in a set of structures
  94 that is linked together in some way,
  95 and as we're about to see,
  96 we're going to use a set of linked structures to implement a binary tree.
  97 Just as we talk about a ``cell'' or ``element'' of an array,
  98 we talk about a ``node'' in a tree or linked list.
  99 </p><p>When we look at the description
 100 of the algorithm for finding out whether a word is already in the tree,
 101 we may begin to see why the binary tree is more efficient than the linear list.
 102 When searching through a linear list,
 103 each time we discard a value that's not the one we're looking for,
 104 we've only discarded that one value;
 105 we still have the entire rest of the list to search.
 106 In a binary tree, however,
 107 whenever we move down the tree,
 108 we've just eliminated half of the tree.
 109 (We might say that a binary tree is a data structure
 110 which makes binary search automatic.)
 111 Consider guessing a number between one and a hundred by asking
 112 ``Is it 1?  Is it 2?  Is it 3?'' etc.,
 113 versus asking
 114 ``Is it less than 50?
 115 Is it greater than 25?
 116 Is it less than 12?''
 117 </p><p>page 140
 118 </p><p>Make sure you're comfortable with the idea of a structure
 119 which contains pointers to other instances of itself.
 120 If you draw some little box-and-arrow diagrams for a binary tree,
 121 the idea should fall into place easily.
 122 (As the authors point out,
 123 what would be impossible
 124 would be for a structure to contain
 125 not a pointer
 126 but rather another entire instance of itself,
 127 because
 128 that instance would contain another,
 129 and another,
 130 and the structure
 131 would be infinitely big.)
 132 </p><p>page 141
 133 </p><p>Note that <TT>addtree</TT> accepts as an argument the tree to
 134 be added to,
 135 and returns a pointer to a tree,
 136 because it may have to modify the tree in the process of
 137 adding a new node to it.
 138 If it doesn't have to modify the tree
 139 (more precisely,
 140 if it doesn't have to modify the top or <dfn>root</dfn> of the tree)
 141 it returns the same pointer it was handed.
 142 </p><p>Another thing to note
 143 is the technique used to mark the edges or ``leaves'' of the tree.
 144 We said that a null pointer was a special pointer value
 145 guaranteed not to point anywhere,
 146 and it is therefore an excellent marker to use
 147 when a left or right subtree does not exist.
 148 Whenever a new node is built,
 149 <TT>addtree</TT> initializes both subtree pointers
 150 (``children'')
 151 to null pointers.
 152 Later,
 153 another chain of calls to <TT>addtree</TT>
 154 may replace one or the other of these with a new subtree.
 155 (Eventually, both might be replaced.)
 156 </p><p>If you don't completely see how <TT>addtree</TT> works,
 157 leave it for a moment and look at <TT>treeprint</TT>
 158 on the next page
 159 first.
 160 </p><p>The bottom of page 141 discusses a tremendously important issue:
 161 memory allocation.
 162 Although we only have one copy of the <TT>addtree</TT> function
 163 (which may call itself recursively many times),
 164 by the time we're done,
 165 we'll have many instances of the <TT>tnode</TT> structure
 166 (one for each unique word in the input).
 167 Therefore,
 168 we have to arrange somehow
 169 that memory for these multiple instances is properly allocated.
 170 We can't use a local variable
 171 of type <TT>struct tnode</TT> in <TT>addtree</TT>,
 172 because local variables disappear when their containing function returns.
 173 We can't use a <TT>static</TT> variable
 174 of type <TT>struct tnode</TT> in <TT>addtree</TT>,
 175 or a global variable of type <TT>struct tnode</TT>,
 176 because then we'd have only one node in the whole program,
 177 and we need many.
 178 </p><p>What we need is some brand-new memory.
 179 Furthermore,
 180 we have to arrange it
 181 so that each time <TT>addtree</TT> builds a brand-new node,
 182 it does so in another new piece of brand-new memory.
 183 Since each node contains a pointer (<TT>char *</TT>)
 184 to a string, the memory for that string has to be dynamically allocated, too.
 185 (If we didn't allocate memory for each new string,
 186 all the strings would end up being stored
 187 in the <TT>word</TT> array in <TT>main</TT> on page 140,
 188 and they'd step all over each other,
 189 and we'd only be able to see the last word we read.)
 190 </p><p>For the moment,
 191 we defer the questions of exactly where this brand-new memory is to come from
 192 by defining two functions to do it.
 193 <TT>talloc</TT> is going to return a
 194 (pointer to a)
 195 brand-new piece of memory suitable for holding a <TT>struct tnode</TT>,
 196 and <TT>strdup</TT> is going to return a
 197 (pointer to a)
 198 brand-new piece of memory
 199 containing a copy of
 200 a string.
 201 </p><p>page 142
 202 </p><p><TT>treeprint</TT> is probably
 203 the cleanest, simplest recursive function there is.
 204 If you've been putting off getting comfortable with recursive functions,
 205 now is the time.
 206 </p><p>Suppose it's our job to print a binary tree:
 207 we've just been handed a pointer to the base (root) of the tree.
 208 What do we do?
 209 The only node we've got easy access to is the root node,
 210 but as we saw,
 211 that's not the first or the last element to print or anything;
 212 it's generally a random node
 213 somewhere in the middle of the eventual sorted list
 214 (distinguished only by the fact that it happened to be inserted first).
 215 The node that needs to be printed first
 216 is buried somewhere down in the left subtree,
 217 and the node to print just before the node we've got easy access to
 218 is buried somewhere else down in the left subtree,
 219 and the node to print next
 220 (after the one we've got)
 221 is buried somewhere down in the right subtree.
 222 In fact,
 223 everything down in the left subtree is to be printed before the node we've got,
 224 and everything down in the right subtree is to be printed after.
 225 A pseudocode description of our task, therefore, might be
 226 <pre><I>        print the left subtree (in order)
 227         print the node we're at
 228         print the right subtree (in order)
 229 </I></pre>How can we print the left subtree,
 230 in order?
 231 The left subtree is,
 232 in general,
 233 another tree,
 234 so printing it out sounds about as hard as printing an entire tree,
 235 which is what we were supposed to do.
 236 In fact, it's exactly as hard:
 237 it's the same problem.
 238 Are we going in circles?
 239 Are we getting anywhere?
 240 Yes, we are:
 241 the left subtree,
 242 even though it is still a tree,
 243 is at least <em>smaller</em> than the full tree we started with.
 244 The same is true of the right subtree.
 245 Therefore,
 246 we can use a recursive call to do the hard work
 247 of printing the subtrees,
 248 and all we have to do is the easy part: print the node we're at.
 249 The fact that the subtrees are smaller
 250 gives us the leverage
 251 we need
 252 to make a recursive algorithm work.
 253 </p><p>In any recursive function,
 254 it is (obviously) important to terminate the recursion,
 255 that is,
 256 to make sure that the function doesn't recursively call itself forever.
 257 In the case of binary trees,
 258 when you reach a ``leaf'' of the tree
 259 (more precisely, when the left or right subtree is a null pointer),
 260 there's nothing more to visit,
 261 so the recursion can stop.
 262 We can test for this in two different ways,
 263 either before or after we make the ``last'' recursive call:
 264 <pre>   void treeprint(struct tnode *p)
 265         {
 266                 if(p-&gt;left != NULL)
 267                         treeprint(p-&gt;left);
 268                 printf("%4d %s\n", p-&gt;count, p-&gt;word);
 269                 if(p-&gt;right != NULL)
 270                         treeprint(p-&gt;right);
 271         }
 272 </pre>or
 273 <pre>   void treeprint(struct tnode *p)
 274         {
 275                 if(p == NULL)
 276                         return;
 277 <br>
 278 <br>
 279                 treeprint(p-&gt;left);
 280                 printf("%4d %s\n", p-&gt;count, p-&gt;word);
 281                 treeprint(p-&gt;right);
 282         }
 283 </pre>Sometimes,
 284 there's little difference between one approach and the other.
 285 Here, though, the second approach
 286 (which is equivalent to the code on page 142)
 287 has a distinct advantage:
 288 it will work even if the very first call is on an empty tree
 289 (in this case, if there were no words in the input).
 290 As we mentioned earlier,
 291 it's extremely nice if programs work well at their boundary conditions,
 292 even if we don't think those conditions are likely to occur.
 293 </p><p>(One more thing to notice
 294 is that it's quite possible
 295 for a node to have a left subtree but not a right,
 296 or vice versa;
 297 one example is the node labeled ``of''
 298 in the tree on page 139.)
 299 </p><p>Another impressive thing about a recursive <TT>treeprint</TT>
 300 function is that it's not just <em>a</em> way of writing it,
 301 or a nifty way of writing it;
 302 it's really the <em>only</em> way of writing it.
 303 You might try to figure out how to write a nonrecursive version.
 304 Once you've printed something down in the left subtree,
 305 how do you know where to go back up to?
 306 Our <TT>struct tnode</TT> only has pointers down the tree,
 307 there aren't any pointers back to the ``parent'' of each node.
 308 If you write a nonrecursive version,
 309 you have to keep track of how you got to where you are,
 310 and it's not enough to keep track of the parent of the node you're at;
 311 you have to keep a stack of <em>all</em> the nodes you've passed down through.
 312 When you write a recursive version,
 313 on the other hand,
 314 the normal function-call stack
 315 essentially keeps track of all this for you.
 316 </p><p>We now return to the problem of dynamic memory allocation.
 317 The basic approach builds on something we've been seeing
 318 glimpses of for a few chapters now:
 319 we use
 320 a general-purpose function
 321 which returns a pointer to a block of <TT>n</TT> bytes of memory.
 322 (The authors presented a primitive version of such a function in section 5.4,
 323 and we used it in the sorting program in section 5.6.)
 324 Our problem is then reduced to
 325 (1) remembering to call this allocation function when we need to,
 326 and
 327 (2) figuring out how many bytes we need.
 328 Problem 1 is stubborn,
 329 but problem 2 is solved by the <TT>sizeof</TT> operator we met in section 6.3.
 330 </p><p>You don't need to worry about all the details of the
 331 ``digression on a problem related to storage allocators.''
 332 The vast majority of the time,
 333 this problem is taken care of for you,
 334 because you use
 335 the system library function <TT>malloc</TT>.
 336 </p><p>The problem of <TT>malloc</TT>'s return type
 337 is not quite as bad as the authors make it out to be.
 338 In ANSI C, the <TT>void *</TT> type is a ``generic'' pointer type,
 339 specifically intended to be used
 340 where you need a pointer which can be a pointer to any data type.
 341 Since <TT>void *</TT> is never a pointer to anything by itself,
 342 but is always intended to be converted (``coerced'') into some other type,
 343 it turns out that a cast is not strictly required:
 344 in code like
 345 <pre>   struct tnode *tp = malloc(sizeof(struct tnode));
 346 or
 347         return malloc(sizeof(struct tnode));
 348 </pre>the compiler is willing to convert the pointer types implicitly,
 349 without warning you
 350 and without requiring you to insert explicit casts.
 351 (If you feel more comfortable with the casts, though,
 352 you're welcome to leave them in.)
 353 </p><p>page 143
 354 </p><p><TT>strdup</TT> is a handy little function that does two things:
 355 it allocates enough memory for one of your strings,
 356 and it copies your string to the new memory,
 357 returning a pointer to it.
 358 (It encapsulates a pattern which we first saw
 359 in the <TT>readlines</TT> function on page 109 in section 5.6.)
 360 Note the <TT>+1</TT> in the call to <TT>malloc</TT>!
 361 Accidentally calling <TT>malloc(strlen(s))</TT>
 362 is an easy but serious mistake.
 363 </p><p>As we mentioned at the beginning of chapter 5,
 364 memory allocation can be hard to get right,
 365 and is at the root of many difficulties and bugs in many C programs.
 366 Here are some rules and other things to remember:
 367 <OL><li>Make sure you know where things are allocated,
 368 either by the compiler or by you.
 369 Watch out for things like
 370 the local <TT>line</TT> array we've been tending to use with <TT>getline</TT>,
 371 and the local <TT>word</TT> array on page 140.
 372 When a function writes to an array or a pointer supplied by the caller,
 373 it
 374 depends on the caller to have allocated storage correctly.
 375 When you're the caller,
 376 make sure you pass a valid pointer!
 377 Make sure you understand why
 378 <pre>   char *ptr;
 379         getline(ptr, 100);
 380 </pre>is wrong and can't work.
 381 (For one thing:
 382 what does that 100 mean?
 383 If <TT>getline</TT> is only allowed to read at most 100 characters,
 384 where have we allocated those 100 characters
 385 that <TT>getline</TT> is not allowed to write to more of than?)
 386 <li>Be aware of any situations
 387 where a single array or data structure
 388 is used to store multiple different things,
 389 in succession.
 390 Think again about
 391 the local <TT>line</TT> array we've been tending to use with <TT>getline</TT>,
 392 and the local <TT>word</TT> array on page 140.
 393 These arrays are overwritten with each new line, word, etc.,
 394 so if you need to keep all of the lines or words around,
 395 you must copy them immediately to allocated memory
 396 (as the line-sorting program on pages 108-9 in section 5.6 did,
 397 but as the
 398 longest line program on page 29 in section 1.9
 399 and the
 400 pattern-matching programs on page 69 in section 4.1
 401 and pages 116-7 in section 5.10
 402 did <em>not</em> have to do).
 403 <li>Make sure you allocate enough memory!
 404 If you allocate memory for an array of 10 things,
 405 don't accidentally store 11 things in it.
 406 If you have a string that's 10 characters long,
 407 make sure you always allocate 11 characters for it
 408 (including one for the terminating <TT>'\0'</TT>).
 409 <li>When you free (deallocate) memory,
 410 make sure that you don't have any pointers lying around
 411 which still point to it
 412 (or if you do, make sure not to use
 413 them
 414 any more).
 415 <li>Always check the return value from memory-allocation functions.
 416 Memory is never infinite:
 417 sooner or later,
 418 you will run out of memory,
 419 and allocation functions generally return a null pointer when this happens.
 420
 421 <li>When you're not using dynamically-allocated memory any more,
 422 do try to free it,
 423 if it's convenient to do so and the program's not just about to exit.
 424 Otherwise, you may eventually have so much memory allocated
 425 to stuff you're not using any more
 426 that there's no more memory left for new stuff you need to allocate.
 427 (However,
 428 on all but a few broken systems,
 429 all memory is automatically and definitively returned
 430 to the operating system when your program exits,
 431 so if one of your programs doesn't free some memory,
 432 you
 433 shouldn't
 434 have to worry that it's wasted forever.)
 435 </OL>Unfortunately,
 436 checking the return values from memory allocation functions
 437 (point 5
 438
 439 above)
 440 requires a few more lines of code,
 441 so it is often left out of sample code in textbooks,
 442 including this one.
 443 Here are versions of <TT>main</TT> and <TT>addtree</TT>
 444 for the word-counting program
 445 (pages 140-1 in the text)
 446 which do check for out-of-memory conditions:
 447 <pre>/* word frequency count */
 448 main()
 449 {
 450         struct tnode *root;
 451         char word[MAXWORD];
 452 <br>
 453 <br>
 454         root = NULL;
 455         while (getword(word, MAXWORD) != EOF) {
 456                 if (isalpha(word[0])) {
 457                         root = addtree(root, word);
 458                         if(root == NULL) {
 459                                 printf("out of memory\n");
 460                                 return 1;
 461                         }
 462                 }
 463         }
 464 <br>
 465 <br>
 466         treeprint(root);
 467 <br>
 468 <br>
 469         return 0;
 470 }
 471 </pre><pre>struct tnode *addtree(struct tnode *p, char *w)
 472 {
 473         int cond;
 474 <br>
 475 <br>
 476         if (p == NULL) {        /* a new word has arrived */
 477                 p = talloc();   /* make a new node */
 478                 if (p == NULL)
 479                         return NULL;
 480                 p-&gt;word = strdup(w);
 481                 if (p-&gt;word == NULL) {
 482                         free(p);
 483                         return NULL;
 484                 }
 485                 p-&gt;count = 1;
 486                 p-&gt;left = p-&gt;right = NULL;
 487         } else if ((cond = strcmp(w, p-&gt;word)) == 0)
 488                 p-&gt;count++;  /* repeated word */
 489         else if (cond &lt; 0) { /* less than: into left subtree */
 490                 p-&gt;left = addtree(p-&gt;left, w);
 491                 if(p-&gt;left == NULL)
 492                         return NULL;
 493         }
 494         else {          /* greater than: into right subtree */
 495                 p-&gt;right = addtree(p-&gt;right, w);
 496                 if(p-&gt;right == NULL)
 497                         return NULL;
 498         }
 499 <br>
 500 <br>
 501         return p;
 502 }
 503 </pre>In practice,
 504 many programmers would collapse the calls and tests:
 505 <pre>struct tnode *addtree(struct tnode *p, char *w)
 506 {
 507         int cond;
 508 <br>
 509 <br>
 510         if (p == NULL) {        /* a new word has arrived */
 511                 if ((p = talloc()) == NULL)
 512                         return NULL;
 513                 if ((p-&gt;word = strdup(w)) == NULL) {
 514                         free(p);
 515                         return NULL;
 516                 }
 517                 p-&gt;count = 1;
 518                 p-&gt;left = p-&gt;right = NULL;
 519         } else if ((cond = strcmp(w, p-&gt;word)) == 0)
 520                 p-&gt;count++;  /* repeated word */
 521         else if (cond &lt; 0) { /* less than: into left subtree */
 522                 if ((p-&gt;left = addtree(p-&gt;left, w)) == NULL)
 523                         return NULL;
 524         }
 525         else {          /* greater than: into right subtree */
 526                 if ((p-&gt;right = addtree(p-&gt;right, w)) == NULL)
 527                         return NULL;
 528         }
 529 <br>
 530 <br>
 531         return p;
 532 }
 533 </pre></p><hr>
 534 <p>
 535 Read sequentially:
 536 <a href="sx9e.html" rev=precedes>prev</a>
 537 <a href="sx10.html" rel=precedes>next</a>
 538 <a href="sx9.html" rev=subdocument>up</a>
 539 <a href="top.html">top</a>
 540 </p>
 541 <p>
 542 This page by <a href="http://www.eskimo.com/~scs/">Steve Summit</a>
 543 // <a href="copyright.html">Copyright</a> 1995, 1996
 544 // <a href="mailto:scs@eskimo.com">mail feedback</a>
 545 </p>
 546 </body>
 547 </html>