* remove "\r" nonsense
[mascara-docs.git] / C / the.ansi.c.programming.language / c.programming.notes / sx8.html
blob4db608a602185fa9c20b740e479ed202fd751482
1 <!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN">
2 <!-- This collection of hypertext pages is Copyright 1995-7 by Steve Summit. -->
3 <!-- This material may be freely redistributed and used -->
4 <!-- but may not be republished or sold without permission. -->
5 <html>
6 <head>
7 <link rev="owner" href="mailto:scs@eskimo.com">
8 <link rev="made" href="mailto:scs@eskimo.com">
9 <title>Chapter 8: Strings</title>
10 <link href="sx7c.html" rev=precedes>
11 <link href="sx9.html" rel=precedes>
12 <link href="top.html" rev=subdocument>
13 </head>
14 <body>
15 <H1>Chapter 8: Strings</H1>
17 <p>Strings in C are represented by arrays of characters.
18 The end of the string
19 is marked with
20 a special
21 character, the <dfn>null character</dfn>,
22 which is simply the character with the value 0.
23 (The null character has no relation except in name to the
24 <dfn>null pointer</dfn>.
25 In the ASCII character set, the null character is named NUL.)
26 The null or string-terminating character is represented by
27 another character escape sequence, <TT>\0</TT>.
28 (We've seen it once already, in the <TT>getline</TT> function
29 of chapter 6.)
31 </p><p>Because C has no built-in facilities for manipulating entire
32 arrays (copying them, comparing them, etc.),
33 it also has very few built-in facilities for manipulating strings.
34 </p><p>In fact, C's only truly built-in string-handling is that
35 it allows us to use
36 <dfn>string constants</dfn> (also called <dfn>string
37 literals</dfn>) in our code.
38 Whenever we write a string, enclosed in double quotes, C
39 automatically creates an array of characters for us,
40 containing that string, terminated by the <TT>\0</TT> character.
43 For example, we can declare and define an array of characters,
44 and initialize it with a string constant:
45 <pre>
46 char string[] = "Hello, world!";
47 </pre>
48 In this case, we can leave out the dimension of the array,
49 since the compiler can compute it for us based on the
50 size of the
51 initializer (14, including the terminating <TT>\0</TT>).
52 This is the only case where the compiler sizes a string array
54 for us, however; in other cases, it will be necessary that
55 <em>we</em> decide how big the arrays and other data structures
56 we use to hold strings are.
57 </p><p>To do anything else with strings, we must typically call functions.
58 The C library contains a few basic string manipulation functions,
59 and to learn more about strings, we'll be looking at how these
60 functions might be implemented.
61 </p><p>
62 Since C
63 never lets
64 us assign entire arrays,
65 we use the <TT>strcpy</TT> function to copy one string to another:
66 <pre>
67 #include &lt;string.h&gt;
69 char string1[] = "Hello, world!";
70 char string2[20];
72 strcpy(string2, string1);
73 </pre>
74 The destination string is <TT>strcpy</TT>'s first
75 argument, so that a call to <TT>strcpy</TT> mimics an
76 assignment expression (with the destination on the left-hand side).
77 Notice that we had to allocate <TT>string2</TT> big enough to
78 hold the string that would be copied to it.
79 Also, at the top of any source file where we're using the
80 standard library's string-handling functions (such as
81 <TT>strcpy</TT>)
82 we must include
83 the line
84 <pre>
85 #include &lt;string.h&gt;
86 </pre>
87 which contains external declarations for these functions.
88 </p><p>
89 Since C won't let us compare entire arrays,
90 either,
91 we must call a function to do that, too.
92 The standard library's <TT>strcmp</TT> function compares two
93 strings, and returns 0 if they are identical, or a negative
94 number if the first string is alphabetically ``less than'' the
95 second string, or a positive number if the first string is
96 ``greater.''
97 (Roughly speaking,
98 what it means for one string to be ``less than'' another
99 is that it would come first in a dictionary or telephone book,
100 although there are a few anomalies.)
101 Here is an example:
102 <pre>
103 char string3[] = "this is";
104 char string4[] = "a test";
106 if(strcmp(string3, string4) == 0)
107 printf("strings are equal\n");
108 else printf("strings are different\n");
109 </pre>
110 This code fragment will print ``strings are different''.
111 Notice that <TT>strcmp</TT> does <em>not</em> return a Boolean,
112 true/false, zero/nonzero answer, so it's not a good idea to
113 write something like
114 <pre>
115 if(strcmp(string3, string4))
117 </pre>
118 because it will behave backwards from what you might reasonably
119 expect.
120 (Nevertheless, if you start reading other people's code,
121 you're likely to come across conditionals like
122 <TT>if(strcmp(a, b))</TT> or even
123 <TT>if(!strcmp(a, b))</TT>.
124 The first does something if the strings are unequal;
125 the second does something if they're equal.
126 You can read these more easily if you pretend for a moment
127 that <TT>strcmp</TT>'s name were <TT>strdiff</TT>, instead.)
128 </p><p>
129 Another standard library function is <TT>strcat</TT>,
130 which concatenates strings.
131 It does <em>not</em> concatenate two strings together
132 and give you a third, new string;
133 what it really does is append one string onto the end of another.
134 (If it gave you a new string, it would have to allocate memory
135 for it somewhere, and the standard library string functions
136 generally never do that for you automatically.)
137 Here's an example:
138 <pre>
139 char string5[20] = "Hello, ";
140 char string6[] = "world!";
142 printf("%s\n", string5);
144 strcat(string5, string6);
146 printf("%s\n", string5);
147 </pre>
148 The first call to <TT>printf</TT> prints ``Hello, '',
149 and the second one prints ``Hello, world!'',
150 indicating that the contents of <TT>string6</TT> have been tacked
151 on to the end of <TT>string5</TT>.
152 Notice that we declared <TT>string5</TT> with extra space,
153 to make room for the appended characters.
154 </p><p>If you have a string and you want to know its length (perhaps
155 so that you can check whether it will fit in
156 some other
157 array you've
158 allocated for it), you can call <TT>strlen</TT>,
159 which returns the length of the string
160 (i.e. the number of characters in it),
161 not including the <TT>\0</TT>:
162 <pre>
163 char string7[] = "abc";
164 int len = strlen(string7);
165 printf("%d\n", len);
166 </pre>
167 </p><p>Finally, you can print strings out
168 with <TT>printf</TT>
169 using the <TT>%s</TT> format specifier,
170 as we've been doing in these examples already
171 (e.g. <TT>printf("%s\n", string5);</TT>).
172 </p><p>
173 Since a string is just an array of characters,
174 all of the string-handling functions we've just seen
175 can be written quite simply,
176 using no techniques
178 more complicated than the ones we already know.
179 In fact,
180 it's quite instructive to look at how these functions might be implemented.
181 Here is a version of <TT>strcpy</TT>:
182 <pre>
183 mystrcpy(char dest[], char src[])
185 int i = 0;
187 while(src[i] != '\0')
189 dest[i] = src[i];
190 i++;
193 dest[i] = '\0';
195 </pre>
196 We've called it <TT>mystrcpy</TT> instead of <TT>strcpy</TT>
197 so that it won't clash with the version that's already in the
198 standard library.
199 Its operation is simple:
200 it looks at characters in the <TT>src</TT> string one at a time,
201 and as long as they're not <TT>\0</TT>,
202 assigns them,
203 one by one,
204 to the corresponding positions in the <TT>dest</TT> string.
205 When it's done, it terminates the <TT>dest</TT> string
206 by appending a <TT>\0</TT>.
207 (After exiting the <TT>while</TT> loop,
208 <TT>i</TT> is guaranteed to have a value one
209 greater than the subscript of the last character in <TT>src</TT>.)
210 For comparison, here's a way of writing the same code, using a
211 <TT>for</TT> loop:
212 <pre>
213 for(i = 0; src[i] != '\0'; i++)
214 dest[i] = src[i];
216 dest[i] = '\0';
217 </pre>
218 Yet a third possibility
219 is to move the test for the terminating <TT>\0</TT> character
220 out of the <TT>for</TT> loop header and into the body of the loop,
221 using an explicit <TT>if</TT> and <TT>break</TT> statement,
222 so that we can perform the test after the assignment
223 and therefore use the assignment inside the loop
224 to copy the <TT>\0</TT> to <TT>dest</TT>, too:
225 <pre>
226 for(i = 0; ; i++)
228 dest[i] = src[i];
229 if(src[i] == '\0')
230 break;
232 </pre>
233 </p><p>(There are in fact many, many ways to write <TT>strcpy</TT>.
234 Many programmers like to combine the assignment and test,
235 using an expression like <TT>(dest[i] = src[i]) != '\0'</TT>.
236 This is actually the same sort of combined operation
237 as we used in our <TT>getchar</TT> loop in
239 chapter 6.)
240 </p><p>Here is a version of <TT>strcmp</TT>:
241 <pre>
242 mystrcmp(char str1[], char str2[])
244 int i = 0;
246 while(1)
248 if(str1[i] != str2[i])
249 return str1[i] - str2[i];
250 if(str1[i] == '\0' || str2[i] == '\0')
251 return 0;
252 i++;
255 </pre>
256 Characters are compared one at a time.
257 If two characters in one position differ,
258 the strings are different,
259 and we are supposed to return a value less than zero
260 if the first string (<TT>str1</TT>)
261 is alphabetically less than the second string.
262 Since characters in C are represented
263 by their numeric character set values,
264 and since most reasonable character sets
265 assign values to characters in alphabetical order,
266 we can simply subtract the two differing characters from each other:
267 the expression <TT>str1[i] - str2[i]</TT>
268 will yield a negative result if the <TT>i</TT>'th character of
269 <TT>str1</TT> is less than the corresponding character in <TT>str2</TT>.
270 (As it turns out,
271 this will behave a bit strangely
272 when comparing upper- and lower-case letters,
273 but it's the traditional approach,
274 which the standard versions of <TT>strcmp</TT> tend to use.)
275 If the characters are the same, we continue around the loop,
276 <em>unless</em> the characters we just compared were (both)
277 <TT>\0</TT>, in which case we've reached the end of both strings,
278 and they were both equal.
279 Notice that we used
280 what may at first appear to be an
281 infinite loop--the
282 controlling expression is the constant 1,
283 which is always true.
285 What actually happens is that
286 the loop runs until one of the two <TT>return</TT>
287 statements breaks out of it (and the entire function).
289 Note also that when one string is longer than the other,
290 the first test will notice this
291 (because one string will contain a real character at the <TT>[i]</TT> location,
292 while the other will contain <TT>\0</TT>, and these are not equal)
293 and the return value will be computed by subtracting the real
294 character's value from 0, or vice versa.
295 (Thus the shorter string will be treated as
296 ``less than'' the longer.)
297 </p><p>Finally, here is a version of <TT>strlen</TT>:
298 <pre>
299 int mystrlen(char str[])
301 int i;
303 for(i = 0; str[i] != '\0'; i++)
306 return i;
308 </pre>
309 In this case, all we have to do is find the <TT>\0</TT> that
310 terminates the string,
311 and it turns out that the three control expressions of the
312 <TT>for</TT> loop do all the work;
313 there's nothing left to do in the body.
314 Therefore, we use an empty pair of braces <TT>{}</TT> as the loop
315 body.
316 Equivalently, we could use a <dfn>null statement</dfn>,
317 which is simply a semicolon:
318 <pre>
319 for(i = 0; str[i] != '\0'; i++)
321 </pre>
322 Empty loop bodies can be a bit startling at first,
323 but they're not unheard of.
324 </p><p>Everything we've looked at so far has come out of C's standard libraries.
325 As one last example, let's write a <TT>substr</TT> function,
326 for extracting a substring out of a larger string.
327 We might call it like this:
328 <pre>
329 char string8[] = "this is a test";
330 char string9[10];
331 substr(string9, string8, 5, 4);
332 printf("%s\n", string9);
333 </pre>
334 The idea is that we'll extract a substring of length 4,
335 starting at character 5 (0-based) of <TT>string8</TT>,
336 and copy the substring to <TT>string9</TT>.
337 Just as
338 with
339 <TT>strcpy</TT>,
340 it's our responsibility to declare the destination string
341 (<TT>string9</TT>)
342 big enough.
343 Here is an implementation of <TT>substr</TT>.
344 Not surprisingly, it's quite similar to <TT>strcpy</TT>:
345 <pre>
346 substr(char dest[], char src[], int offset, int len)
348 int i;
349 for(i = 0; i &lt; len &amp;&amp; src[offset + i] != '\0'; i++)
350 dest[i] = src[i + offset];
351 dest[i] = '\0';
353 </pre>
354 If you compare this code to the code for <TT>mystrcpy</TT>,
355 you'll see that the only differences are that characters are
356 fetched from <TT>src[offset + i]</TT> instead of <TT>src[i]</TT>,
357 and that the loop stops when <TT>len</TT> characters have been copied
358 (or when the <TT>src</TT> string runs out of characters,
359 whichever comes first).
360 </p><p></p><p>In this chapter, we've been careless about declaring the return
361 types of the string functions, and (with the exception of
362 <TT>mystrlen</TT>) they haven't returned values.
363 The real string functions do return values,
364 but they're of type ``pointer to character,'' which
365 we haven't discussed yet.
366 </p><p>
367 When working with strings,
368 it's important to keep firmly in mind the differences
369 between characters and strings.
370 We must also occasionally remember the way characters are represented,
371 and about the relation between character values and integers.
372 </p><p>As we
373 have had several occasions to mention,
375 a character is represented internally as a small integer,
376 with a value depending on the character set in use.
377 For example, we might find that <TT>'A'</TT> had the value 65,
378 that <TT>'a'</TT> had the value 97,
379 and that <TT>'+'</TT> had the value 43.
380 (These are, in fact, the values in the ASCII character set,
381 which most computers use.
382 However,
383 you don't need to learn these values,
384 because the vast majority of the time,
385 you use character constants to refer to characters,
386 and the compiler worries about the values for you.
387 Using character constants in preference to raw numeric values
388 also makes your programs more portable.)
389 </p><p>As we
391 also
392 have
393 mentioned,
395 there is a big difference between a character and a string,
396 even a string which contains only one character
397 (other than the <TT>\0</TT>).
398 For example, <TT>'A'</TT> is <em>not</em> the same as <TT>"A"</TT>.
399 To drive home this point,
400 let's illustrate it with
402 a few examples.
403 </p><p>If you have a string:
404 <pre>
405 char string[] = "hello, world!";
406 </pre>
408 can modify its
409 first character by saying
410 <pre>
411 string[0] = 'H';
412 </pre>
413 (Of course, there's nothing magic about the first character;
414 you can modify any character in the string in this way.
415 Be aware, though,
416 that it is not always safe to modify strings in-place like this;
417 we'll say more
418 about the modifiability of strings
419 in a later chapter on pointers.)
420 Since you're replacing a character,
421 you want a character constant, <TT>'H'</TT>.
422 It would <em>not</em> be right to write
423 <pre>
424 string[0] = "H"; /* WRONG */
425 </pre>
426 because <TT>"H"</TT> is a string
427 (an array of characters),
428 not a single character.
429 (The destination of the assignment, <TT>string[0]</TT>,
430 is a <TT>char</TT>,
431 but the right-hand side is a string;
432 these types don't match.)
433 </p><p>On the other hand, when you need a string, you must use a string.
434 To print a single newline,
435 you could call
436 <pre>
437 printf("\n");
438 </pre>
439 It would <em>not</em> be correct to call
440 <pre>
441 printf('\n'); /* WRONG */
442 </pre>
443 <TT>printf</TT> always wants a string as its first argument.
444 (As one final example,
445 <TT>putchar</TT> wants a single character,
446 so <TT>putchar('\n')</TT> would be
447 correct,
448 and <TT>putchar("\n")</TT> would be
449 incorrect.)
450 </p><p>We must also remember the difference between strings and integers.
451 If we treat the character <TT>'1'</TT> as an integer,
452 perhaps by saying
453 <pre>
454 int i = '1';
455 </pre>
456 we will probably <em>not</em> get the value 1 in <TT>i</TT>;
457 we'll get the value of the character <TT>'1'</TT>
458 in the machine's character set.
459 (In ASCII, it's 49.)
460 When we do need to find the numeric value of a digit character
461 (or to go the other way,
462 to get the digit character with a particular value)
463 we can make use of the fact that,
464 in any character set used by C,
465 the values for the digit characters,
466 whatever they are, are contiguous.
467 In other words,
468 no matter what values <TT>'0'</TT> and <TT>'1'</TT> have,
469 <TT>'1' - '0'</TT> will be 1
470 (and, obviously, <TT>'0' - '0'</TT> will be 0).
471 So, for a variable <TT>c</TT> holding some digit character,
472 the expression
473 <pre>
474 c - '0'
475 </pre>
476 gives us its value.
477 (Similarly, for an integer value <TT>i</TT>,
478 <TT>i + '0'</TT> gives us the corresponding digit character,
479 as long as 0 &lt;= <TT>i</TT> &lt;= 9.)
480 </p><p>Just as the character <TT>'1'</TT> is not the integer 1,
481 the string <TT>"123"</TT> is not the integer 123.
482 When we have a string of digits,
483 we can convert it to the corresponding integer
484 by calling the standard function <TT>atoi</TT>:
485 <pre>
486 char string[] = "123";
487 int i = atoi(string);
488 int j = atoi("456");
489 </pre>
491 Later
493 we'll learn how to go in the other direction,
494 to convert an integer
495 into a string.
496 (One way,
497 as long as what you want to do is print the number out,
498 is to call <TT>printf</TT>,
499 using <TT>%d</TT>
501 in the format string.)
502 </p><hr>
504 Read sequentially:
505 <a href="sx7c.html" rev=precedes>prev</a>
506 <a href="sx9.html" rel=precedes>next</a>
507 <a href="top.html" rev=subdocument>up</a>
508 <a href="top.html">top</a>
509 </p>
511 This page by <a href="http://www.eskimo.com/~scs/">Steve Summit</a>
512 // <a href="copyright.html">Copyright</a> 1995-1997
513 // <a href="mailto:scs@eskimo.com">mail feedback</a>
514 </p>
515 </body>
516 </html>