1 <!DOCTYPE HTML PUBLIC
"-//W3O//DTD W3 HTML 2.0//EN">
2 <!-- This collection of hypertext pages is Copyright 1995, 1996 by Steve Summit. -->
3 <!-- This material may be freely redistributed and used -->
4 <!-- but may not be republished or sold without permission. -->
7 <link rev=
"owner" href=
"mailto:scs@eskimo.com">
8 <link rev=
"made" href=
"mailto:scs@eskimo.com">
9 <title>section
6.5: Self-referential Structures
</title>
10 <link href=
"sx9e.html" rev=precedes
>
11 <link href=
"sx10.html" rel=precedes
>
12 <link href=
"sx9.html" rev=subdocument
>
15 <H2>section
6.5: Self-referential Structures
</H2>
18 <p>In section
4.10, we met recursive functions.
19 Now, we're going to meet recursively-defined data structures.
20 Don't throw up your hands:
21 the two should be easier to understand in combination.
22 </p><p>The mention of ``quadratic running time'' is tangential,
23 but it's a useful-enough concept that it might be worth a bit of explanation.
24 If we were keeping a simple list
27 each time we had a new word to install,
28 we'd have to scan over the old list.
29 On average, we'd have to scan over half the old list.
30 (Even if we used binary search to find the position,
31 we'd still have to move some part of the list to insert it.)
32 Therefore, the more words that were in the list,
33 the longer it would take to install each new word.
34 It turns out that the running time of this linear insertion
35 algorithm would grow as the square of the number of items in the list
36 (that's what ``quadratically'' means).
37 If you doubled the size of the list,
38 the running time would be four times longer.
39 An algorithm like this may seem to work fine
40 when you run it on small test inputs,
41 but then when you run it on a real problem
42 consisting of a thousand or ten thousand or a million words,
43 it bogs down hopelessly.
44 </p><p>A binary tree is a great way to keep a set of words
47 The definition of a binary tree is simply that,
49 all items in the left subtree are less than the item at that node,
50 and all items in the right subtree are greater.
51 (Note that the top item in the left subtree
52 is not necessarily
<em>immediately
</em> less than the item at that node
54 the immediately-preceding item is merely down in the left subtree somewhere,
55 along with all the rest of the preceding items.
56 In the ``now is the time'' example,
57 the word ``now'' is neither the first, last, nor middle word in the sorted list;
58 it's merely the word that happened to be installed first.
59 The word preceding it is ``men'';
60 the word following it is ``of.''
61 The first word in the sorted list is ``aid,''
62 and the last word is ``to.'')
63 </p><p>The binary tree may not immediately seem like much of an
64 improvement over the linear array--we still have to scan over
65 part of the existing tree in order to insert each new word,
66 and the time to add each new word will get longer as there are
67 more words in the tree.
68 But, if you do the math,
70 that on average you have to scan over a much smaller part of the tree,
72 it's not a simple fraction
75 but rather the log (base two) of the number of items already in the tree.
77 inserting a new node doesn't involve reshuffling any old data.
79 the running time of binary tree insertion doesn't slow down
80 nearly as badly as linear insertion does.
82 the reason that the word ``binary'' comes up so often
83 is because it simply means ``two.''
84 The binary number system has two digits
86 a binary operator has two operands;
87 binary search eliminates half
89 of the possibilities at each step;
90 a binary tree has two subtrees at each node.
91 </p><p>One other bit of nomenclature:
92 the word ``node'' simply refers to
93 one of the structures in a set of structures
94 that is linked together in some way,
95 and as we're about to see,
96 we're going to use a set of linked structures to implement a binary tree.
97 Just as we talk about a ``cell'' or ``element'' of an array,
98 we talk about a ``node'' in a tree or linked list.
99 </p><p>When we look at the description
100 of the algorithm for finding out whether a word is already in the tree,
101 we may begin to see why the binary tree is more efficient than the linear list.
102 When searching through a linear list,
103 each time we discard a value that's not the one we're looking for,
104 we've only discarded that one value;
105 we still have the entire rest of the list to search.
106 In a binary tree, however,
107 whenever we move down the tree,
108 we've just eliminated half of the tree.
109 (We might say that a binary tree is a data structure
110 which makes binary search automatic.)
111 Consider guessing a number between one and a hundred by asking
112 ``Is it
1? Is it
2? Is it
3?'' etc.,
114 ``Is it less than
50?
115 Is it greater than
25?
116 Is it less than
12?''
118 </p><p>Make sure you're comfortable with the idea of a structure
119 which contains pointers to other instances of itself.
120 If you draw some little box-and-arrow diagrams for a binary tree,
121 the idea should fall into place easily.
122 (As the authors point out,
123 what would be impossible
124 would be for a structure to contain
126 but rather another entire instance of itself,
128 that instance would contain another,
131 would be infinitely big.)
133 </p><p>Note that
<TT>addtree
</TT> accepts as an argument the tree to
135 and returns a pointer to a tree,
136 because it may have to modify the tree in the process of
137 adding a new node to it.
138 If it doesn't have to modify the tree
140 if it doesn't have to modify the top or
<dfn>root
</dfn> of the tree)
141 it returns the same pointer it was handed.
142 </p><p>Another thing to note
143 is the technique used to mark the edges or ``leaves'' of the tree.
144 We said that a null pointer was a special pointer value
145 guaranteed not to point anywhere,
146 and it is therefore an excellent marker to use
147 when a left or right subtree does not exist.
148 Whenever a new node is built,
149 <TT>addtree
</TT> initializes both subtree pointers
153 another chain of calls to
<TT>addtree
</TT>
154 may replace one or the other of these with a new subtree.
155 (Eventually, both might be replaced.)
156 </p><p>If you don't completely see how
<TT>addtree
</TT> works,
157 leave it for a moment and look at
<TT>treeprint
</TT>
160 </p><p>The bottom of page
141 discusses a tremendously important issue:
162 Although we only have one copy of the
<TT>addtree
</TT> function
163 (which may call itself recursively many times),
164 by the time we're done,
165 we'll have many instances of the
<TT>tnode
</TT> structure
166 (one for each unique word in the input).
168 we have to arrange somehow
169 that memory for these multiple instances is properly allocated.
170 We can't use a local variable
171 of type
<TT>struct tnode
</TT> in
<TT>addtree
</TT>,
172 because local variables disappear when their containing function returns.
173 We can't use a
<TT>static
</TT> variable
174 of type
<TT>struct tnode
</TT> in
<TT>addtree
</TT>,
175 or a global variable of type
<TT>struct tnode
</TT>,
176 because then we'd have only one node in the whole program,
178 </p><p>What we need is some brand-new memory.
180 we have to arrange it
181 so that each time
<TT>addtree
</TT> builds a brand-new node,
182 it does so in another new piece of brand-new memory.
183 Since each node contains a pointer (
<TT>char *
</TT>)
184 to a string, the memory for that string has to be dynamically allocated, too.
185 (If we didn't allocate memory for each new string,
186 all the strings would end up being stored
187 in the
<TT>word
</TT> array in
<TT>main
</TT> on page
140,
188 and they'd step all over each other,
189 and we'd only be able to see the last word we read.)
190 </p><p>For the moment,
191 we defer the questions of exactly where this brand-new memory is to come from
192 by defining two functions to do it.
193 <TT>talloc
</TT> is going to return a
195 brand-new piece of memory suitable for holding a
<TT>struct tnode
</TT>,
196 and
<TT>strdup
</TT> is going to return a
198 brand-new piece of memory
202 </p><p><TT>treeprint
</TT> is probably
203 the cleanest, simplest recursive function there is.
204 If you've been putting off getting comfortable with recursive functions,
206 </p><p>Suppose it's our job to print a binary tree:
207 we've just been handed a pointer to the base (root) of the tree.
209 The only node we've got easy access to is the root node,
211 that's not the first or the last element to print or anything;
212 it's generally a random node
213 somewhere in the middle of the eventual sorted list
214 (distinguished only by the fact that it happened to be inserted first).
215 The node that needs to be printed first
216 is buried somewhere down in the left subtree,
217 and the node to print just before the node we've got easy access to
218 is buried somewhere else down in the left subtree,
219 and the node to print next
220 (after the one we've got)
221 is buried somewhere down in the right subtree.
223 everything down in the left subtree is to be printed before the node we've got,
224 and everything down in the right subtree is to be printed after.
225 A pseudocode description of our task, therefore, might be
226 <pre><I> print the left subtree (in order)
227 print the node we're at
228 print the right subtree (in order)
229 </I></pre>How can we print the left subtree,
234 so printing it out sounds about as hard as printing an entire tree,
235 which is what we were supposed to do.
236 In fact, it's exactly as hard:
237 it's the same problem.
238 Are we going in circles?
239 Are we getting anywhere?
242 even though it is still a tree,
243 is at least
<em>smaller
</em> than the full tree we started with.
244 The same is true of the right subtree.
246 we can use a recursive call to do the hard work
247 of printing the subtrees,
248 and all we have to do is the easy part: print the node we're at.
249 The fact that the subtrees are smaller
250 gives us the leverage
252 to make a recursive algorithm work.
253 </p><p>In any recursive function,
254 it is (obviously) important to terminate the recursion,
256 to make sure that the function doesn't recursively call itself forever.
257 In the case of binary trees,
258 when you reach a ``leaf'' of the tree
259 (more precisely, when the left or right subtree is a null pointer),
260 there's nothing more to visit,
261 so the recursion can stop.
262 We can test for this in two different ways,
263 either before or after we make the ``last'' recursive call:
264 <pre> void treeprint(struct tnode *p)
266 if(p-
>left != NULL)
267 treeprint(p-
>left);
268 printf(
"%4d %s\n", p-
>count, p-
>word);
269 if(p-
>right != NULL)
270 treeprint(p-
>right);
273 <pre> void treeprint(struct tnode *p)
279 treeprint(p-
>left);
280 printf(
"%4d %s\n", p-
>count, p-
>word);
281 treeprint(p-
>right);
284 there's little difference between one approach and the other.
285 Here, though, the second approach
286 (which is equivalent to the code on page
142)
287 has a distinct advantage:
288 it will work even if the very first call is on an empty tree
289 (in this case, if there were no words in the input).
290 As we mentioned earlier,
291 it's extremely nice if programs work well at their boundary conditions,
292 even if we don't think those conditions are likely to occur.
293 </p><p>(One more thing to notice
294 is that it's quite possible
295 for a node to have a left subtree but not a right,
297 one example is the node labeled ``of''
298 in the tree on page
139.)
299 </p><p>Another impressive thing about a recursive
<TT>treeprint
</TT>
300 function is that it's not just
<em>a
</em> way of writing it,
301 or a nifty way of writing it;
302 it's really the
<em>only
</em> way of writing it.
303 You might try to figure out how to write a nonrecursive version.
304 Once you've printed something down in the left subtree,
305 how do you know where to go back up to?
306 Our
<TT>struct tnode
</TT> only has pointers down the tree,
307 there aren't any pointers back to the ``parent'' of each node.
308 If you write a nonrecursive version,
309 you have to keep track of how you got to where you are,
310 and it's not enough to keep track of the parent of the node you're at;
311 you have to keep a stack of
<em>all
</em> the nodes you've passed down through.
312 When you write a recursive version,
314 the normal function-call stack
315 essentially keeps track of all this for you.
316 </p><p>We now return to the problem of dynamic memory allocation.
317 The basic approach builds on something we've been seeing
318 glimpses of for a few chapters now:
320 a general-purpose function
321 which returns a pointer to a block of
<TT>n
</TT> bytes of memory.
322 (The authors presented a primitive version of such a function in section
5.4,
323 and we used it in the sorting program in section
5.6.)
324 Our problem is then reduced to
325 (
1) remembering to call this allocation function when we need to,
327 (
2) figuring out how many bytes we need.
328 Problem
1 is stubborn,
329 but problem
2 is solved by the
<TT>sizeof
</TT> operator we met in section
6.3.
330 </p><p>You don't need to worry about all the details of the
331 ``digression on a problem related to storage allocators.''
332 The vast majority of the time,
333 this problem is taken care of for you,
335 the system library function
<TT>malloc
</TT>.
336 </p><p>The problem of
<TT>malloc
</TT>'s return type
337 is not quite as bad as the authors make it out to be.
338 In ANSI C, the
<TT>void *
</TT> type is a ``generic'' pointer type,
339 specifically intended to be used
340 where you need a pointer which can be a pointer to any data type.
341 Since
<TT>void *
</TT> is never a pointer to anything by itself,
342 but is always intended to be converted (``coerced'') into some other type,
343 it turns out that a cast is not strictly required:
345 <pre> struct tnode *tp = malloc(sizeof(struct tnode));
347 return malloc(sizeof(struct tnode));
348 </pre>the compiler is willing to convert the pointer types implicitly,
350 and without requiring you to insert explicit casts.
351 (If you feel more comfortable with the casts, though,
352 you're welcome to leave them in.)
354 </p><p><TT>strdup
</TT> is a handy little function that does two things:
355 it allocates enough memory for one of your strings,
356 and it copies your string to the new memory,
357 returning a pointer to it.
358 (It encapsulates a pattern which we first saw
359 in the
<TT>readlines
</TT> function on page
109 in section
5.6.)
360 Note the
<TT>+
1</TT> in the call to
<TT>malloc
</TT>!
361 Accidentally calling
<TT>malloc(strlen(s))
</TT>
362 is an easy but serious mistake.
363 </p><p>As we mentioned at the beginning of chapter
5,
364 memory allocation can be hard to get right,
365 and is at the root of many difficulties and bugs in many C programs.
366 Here are some rules and other things to remember:
367 <OL><li>Make sure you know where things are allocated,
368 either by the compiler or by you.
369 Watch out for things like
370 the local
<TT>line
</TT> array we've been tending to use with
<TT>getline
</TT>,
371 and the local
<TT>word
</TT> array on page
140.
372 When a function writes to an array or a pointer supplied by the caller,
374 depends on the caller to have allocated storage correctly.
375 When you're the caller,
376 make sure you pass a valid pointer!
377 Make sure you understand why
380 </pre>is wrong and can't work.
382 what does that
100 mean?
383 If
<TT>getline
</TT> is only allowed to read at most
100 characters,
384 where have we allocated those
100 characters
385 that
<TT>getline
</TT> is not allowed to write to more of than?)
386 <li>Be aware of any situations
387 where a single array or data structure
388 is used to store multiple different things,
391 the local
<TT>line
</TT> array we've been tending to use with
<TT>getline
</TT>,
392 and the local
<TT>word
</TT> array on page
140.
393 These arrays are overwritten with each new line, word, etc.,
394 so if you need to keep all of the lines or words around,
395 you must copy them immediately to allocated memory
396 (as the line-sorting program on pages
108-
9 in section
5.6 did,
398 longest line program on page
29 in section
1.9
400 pattern-matching programs on page
69 in section
4.1
401 and pages
116-
7 in section
5.10
402 did
<em>not
</em> have to do).
403 <li>Make sure you allocate enough memory!
404 If you allocate memory for an array of
10 things,
405 don't accidentally store
11 things in it.
406 If you have a string that's
10 characters long,
407 make sure you always allocate
11 characters for it
408 (including one for the terminating
<TT>'\
0'
</TT>).
409 <li>When you free (deallocate) memory,
410 make sure that you don't have any pointers lying around
411 which still point to it
412 (or if you do, make sure not to use
415 <li>Always check the return value from memory-allocation functions.
416 Memory is never infinite:
418 you will run out of memory,
419 and allocation functions generally return a null pointer when this happens.
421 <li>When you're not using dynamically-allocated memory any more,
423 if it's convenient to do so and the program's not just about to exit.
424 Otherwise, you may eventually have so much memory allocated
425 to stuff you're not using any more
426 that there's no more memory left for new stuff you need to allocate.
428 on all but a few broken systems,
429 all memory is automatically and definitively returned
430 to the operating system when your program exits,
431 so if one of your programs doesn't free some memory,
434 have to worry that it's wasted forever.)
436 checking the return values from memory allocation functions
440 requires a few more lines of code,
441 so it is often left out of sample code in textbooks,
443 Here are versions of
<TT>main
</TT> and
<TT>addtree
</TT>
444 for the word-counting program
445 (pages
140-
1 in the text)
446 which do check for out-of-memory conditions:
447 <pre>/* word frequency count */
455 while (getword(word, MAXWORD) != EOF) {
456 if (isalpha(word[
0])) {
457 root = addtree(root, word);
459 printf(
"out of memory\n");
471 </pre><pre>struct tnode *addtree(struct tnode *p, char *w)
476 if (p == NULL) { /* a new word has arrived */
477 p = talloc(); /* make a new node */
480 p-
>word = strdup(w);
481 if (p-
>word == NULL) {
486 p-
>left = p-
>right = NULL;
487 } else if ((cond = strcmp(w, p-
>word)) ==
0)
488 p-
>count++; /* repeated word */
489 else if (cond
< 0) { /* less than: into left subtree */
490 p-
>left = addtree(p-
>left, w);
491 if(p-
>left == NULL)
494 else { /* greater than: into right subtree */
495 p-
>right = addtree(p-
>right, w);
496 if(p-
>right == NULL)
504 many programmers would collapse the calls and tests:
505 <pre>struct tnode *addtree(struct tnode *p, char *w)
510 if (p == NULL) { /* a new word has arrived */
511 if ((p = talloc()) == NULL)
513 if ((p-
>word = strdup(w)) == NULL) {
518 p-
>left = p-
>right = NULL;
519 } else if ((cond = strcmp(w, p-
>word)) ==
0)
520 p-
>count++; /* repeated word */
521 else if (cond
< 0) { /* less than: into left subtree */
522 if ((p-
>left = addtree(p-
>left, w)) == NULL)
525 else { /* greater than: into right subtree */
526 if ((p-
>right = addtree(p-
>right, w)) == NULL)
536 <a href=
"sx9e.html" rev=precedes
>prev
</a>
537 <a href=
"sx10.html" rel=precedes
>next
</a>
538 <a href=
"sx9.html" rev=subdocument
>up
</a>
539 <a href=
"top.html">top
</a>
542 This page by
<a href=
"http://www.eskimo.com/~scs/">Steve Summit
</a>
543 //
<a href=
"copyright.html">Copyright
</a> 1995,
1996
544 //
<a href=
"mailto:scs@eskimo.com">mail feedback
</a>