C/the.ansi.c.programming.language/c.programming.notes.int/sx3a.html

   1 <!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN">
   2 <!-- This collection of hypertext pages is Copyright 1995-7 by Steve Summit. -->
   3 <!-- This material may be freely redistributed and used -->
   4 <!-- but may not be republished or sold without permission. -->
   5 <html>
   6 <head>
   7 <link rev="owner" href="mailto:scs@eskimo.com">
   8 <link rev="made" href="mailto:scs@eskimo.com">
   9 <title>17.1: Text Data Files</title>
  10 <link href="sx3.html" rev=precedes>
  11 <link href="sx3b.html" rel=precedes>
  12 <link href="sx3.html" rev=subdocument>
  13 </head>
  14 <body>
  15 <H2>17.1: Text Data Files</H2>
  16
  17 <p>Text data files,
  18 it must be admitted,
  19 are not always as compact
  20 or as efficient to read and write
  21 as binary files.
  22 It can be a bit more work to set up the code which reads and writes them.
  23 But they have some powerful advantages:
  24 any time you need to,
  25 you can look at them using ordinary text editors and other tools.
  26 If program A is writing a data file
  27 which program B is supposed to
  28 be able to read but cannot,
  29 you can immediately look at the file
  30 to see if it's in the correct format
  31 and so determine whether it's program A's or B's fault.
  32 If program A has not been written yet,
  33 you can easily create a data file by hand to test program B with.
  34 Text files are automatically portable between machines,
  35 even those where integers and other data types
  36 are of different sizes or are laid out differently in memory.
  37 Because they're not expected to have the rigid formats of binary files,
  38 it tends to be more natural to arrange
  39 text files so
  40 that as the data file format changes slightly,
  41 newer (or older) versions of the software
  42 can read older (or newer) versions of the data file.
  43 Text data files are the focus of this chapter;
  44 they're what I use all the time,
  45 and they're what I recommend you use
  46 unless you have compelling reasons not to.
  47 </p><p>When we're using text data files, we acknowledge
  48 that the <em>internal</em> and <em>external</em> representations
  49 of our data
  50 are quite different.
  51 For example, a value of type <TT>int</TT>
  52 will usually be represented internally as a 2- or 4-byte
  53 (16- or 32-bit)
  54 piece of memory.
  55 Externally, though,
  56 that integer will be represented as a string of characters
  57 representing its decimal or hexadecimal value.
  58 Converting back and forth
  59 between the internal and external representations
  60 is easy enough.
  61 To go from the internal representation to the external,
  62 we'll almost always use <TT>printf</TT> or <TT>fprintf</TT>;
  63 for example,
  64 to convert an <TT>int</TT> we might use <TT>%d</TT> or <TT>%x</TT> format.
  65 To convert from the external representation back to the internal,
  66 we could use <TT>scanf</TT> or <TT>fscanf</TT>,
  67 or read the characters in some other way
  68 and then use
  69 functions like
  70 <TT>atoi</TT>, <TT>strtol</TT>, or <TT>sscanf</TT>.
  71 </p><p>We have a great many options
  72 when it comes to performing this mapping,
  73 that is,
  74 when converting between the internal and external representations.
  75 Our choice may be determined by the layout we want the data file to have,
  76 or by what's easiest to implement,
  77 or by some combination of these factors.
  78 Some of the choices are pretty arbitrary;
  79 but in any case,
  80 what matters most is obviously that
  81 the reading and writing code ``match'',
  82 that is,
  83 that the data file writing code write the data in the right format
  84 such that the data file reading code can accurately read it.
  85 For the rest of this section,
  86 we'll explore several ways of writing and reading data
  87 to and from text data files,
  88 using various combinations of the stdio functions
  89 (and perhaps one or two of our own).
  90 </p><p>Suppose we had an array of integers:
  91 <pre>
  92         int a[10];
  93 </pre>
  94 and suppose it had been filled up with values,
  95 and suppose we wanted to write them out to a data file.
  96 We could write them all on one line, separated by spaces:
  97 <pre>
  98         fprintf(ofp, "%d %d %d %d %d %d %d %d %d %d\n",
  99                 a[0], a[1], a[2], a[3], a[4], a[5],
 100                         a[6], a[7], a[8], a[9]);
 101 </pre>
 102 We could write them on 10 separate lines:
 103 <pre>
 104         for(i = 0; i &lt; 10; i++)
 105                 fprintf(ofp, "%d\n", a[i]);
 106 </pre>
 107 Realizing that the loop is easier and more flexible,
 108 we could go back to writing them all on one line, using a loop:
 109 <pre>
 110         for(i = 0; i &lt; 10; i++)
 111                 fprintf(ofp, "%d ", a[i]);
 112         fprintf(ofp, "\n");
 113 </pre>
 114 If we were worried about that trailing space at the end of the line,
 115 we could arrange to eliminate it:
 116 <pre>
 117         for(i = 0; i &lt; 10; i++)
 118                 {
 119                 if(i &gt; 0)
 120                         fprintf(ofp, " ");
 121                 fprintf(ofp, "%d", a[i]);
 122                 }
 123         fprintf(ofp, "\n");
 124 </pre>
 125 Recognizing that <TT>fprintf</TT> is overkill
 126 for printing single, fixed characters,
 127 we could replace two of the calls with <TT>putc</TT>:
 128 <pre>
 129         for(i = 0; i &lt; 10; i++)
 130                 {
 131                 if(i &gt; 0)
 132                         putc(' ', ofp);
 133                 fprintf(ofp, "%d", a[i]);
 134                 }
 135         putc('\n', ofp);
 136 </pre>
 137 </p><p>When it came time to read the numbers in,
 138 we would have at least as many choices.
 139 We could read the ten values all at once, using <TT>fscanf</TT>:
 140 <pre>
 141         int r = fscanf(ifp, "%d %d %d %d %d %d %d %d %d %d",
 142                 &amp;a[0], &amp;a[1], &amp;a[2], &amp;a[3], &amp;a[4], &amp;a[5],
 143                         &amp;a[6], &amp;a[7], &amp;a[8], &amp;a[9]);
 144         if(r != 10)
 145                 fprintf(stderr, "error in data file\n");
 146 </pre>
 147 Since the <TT>scanf</TT> family treats all whitespace
 148 (spaces, tabs, and newlines)
 149 the same,
 150 this code would read either the format with all the numbers on one line,
 151 or the format with one number per line.
 152 Notice that we check <TT>fscanf</TT>'s return value,
 153 to make sure that it successfully read in
 154 all the numbers we expected it to.
 155 Since data files come in from the outside world,
 156 it's possible for them to be corrupted,
 157 and programs should not blindly read them assuming that they're perfect.
 158 A program that crashes when it attempts to read a damaged data file
 159 is terribly frustrating;
 160 a program that diagnoses the problem is much more polite.
 161 </p><p>We could also read the data file a line at a time,
 162 converting the text to integers via other means.
 163 If the integers were stored one per line,
 164 we could use code like this:
 165 <pre>
 166         #define MAXLINE 200
 167
 168         char line[MAXLINE];
 169         for(i = 0; i &lt; 10; i++)
 170                 {
 171                 if(fgets(line, MAXLINE, ifp) == NULL)
 172                         {
 173                         fprintf(stderr, "error in data file\n");
 174                         break;
 175                         }
 176                 a[i] = atoi(line);
 177                 }
 178 </pre>
 179 (We could also use
 180 our own <TT>getline</TT> or <TT>fgetline</TT> function
 181 instead of <TT>fgets</TT>.)
 182 If the integers were stored all on one line,
 183 we could use the <TT>getwords</TT> function from chapter 10
 184 to separate the numbers at the whitespace boundaries:
 185 <pre>
 186         char *av[10];
 187
 188         if(fgets(line, MAXLINE, ifp) == NULL)
 189                 fprintf(stderr, "error in data file\n");
 190         else if(getwords(line, av, 10) != 10)
 191                 fprintf(stderr, "error in data file\n");
 192         else    {
 193                 for(i = 0; i &lt; 10; i++)
 194                         a[i] = atoi(av[i]);
 195                 }
 196 </pre>
 197 </p><p>Suppose, now, that
 198 there were not always 10 elements in the array <TT>a</TT>;
 199 suppose we had a separate integer variable <TT>na</TT>
 200 to record how many elements the array <TT>a</TT> currently contains.
 201 When writing the data out,
 202 we would certainly then
 203 use a loop;
 204 we might also want to precede the data by the count,
 205 in case that will make it easier for the reading program:
 206 <pre>
 207         fprintf(ofp, "%d\n", na);
 208         for(i = 0; i &lt; na; i++)
 209                 fprintf(ofp, "%d\n", a[i]);
 210 </pre>
 211 We could also print all
 212 of the numbers
 213 on one line:
 214 <pre>
 215         fprintf(ofp, "%d", na);
 216         for(i = 0; i &lt; na; i++)
 217                 fprintf(ofp, " %d ", a[i]);
 218 </pre>
 219 (Notice that the presence of the extra value at the beginning of the line
 220 makes the space separator game easier to play.)
 221 </p><p>Now, when reading the data in, we would simply read the count first,
 222 then the data.
 223 Using <TT>fscanf</TT>:
 224 <pre>
 225         if(fscanf(ifp, "%d", &amp;na) != 1)
 226                 {
 227                 fprintf(stderr, "error in data file\n");
 228                 return;
 229                 }
 230
 231         if(na &gt; 10)
 232                 {
 233                 fprintf(stderr, "too many items in data file\n");
 234                 return;
 235                 }
 236
 237         for(i = 0; i &lt; na; i++)
 238                 {
 239                 if(fscanf(ifp, "%d", &amp;a[i]) != 1)
 240                         {
 241                         fprintf(stderr, "error in data file\n");
 242                         return;
 243                         }
 244                 }
 245 </pre>
 246 (Here we assume that
 247 the code to read the array from the data file is part of a function,
 248 and that when we detect an error,
 249 we return early from the function.
 250 In practice,
 251 we would probably return some error code to the caller.)
 252 </p><p>If we chose to use <TT>fgets</TT>
 253 (or <TT>fgetline</TT>),
 254 the code might look like this for data on separate lines:
 255 <pre>
 256         if(fgets(line, MAXLINE, ifp) == NULL)
 257                 {
 258                 fprintf(stderr, "error in data file\n");
 259                 return;
 260                 }
 261         na = atoi(line);
 262         if(na &gt; 10)
 263                 {
 264                 fprintf(stderr, "too many items in data file\n");
 265                 return;
 266                 }
 267
 268         for(i = 0; i &lt; na; i++)
 269                 {
 270                 if(fgets(line, MAXLINE, ifp) == NULL)
 271                         {
 272                         fprintf(stderr, "error in data file\n");
 273                         return;
 274                         }
 275                 a[i] = atoi(line);
 276                 }
 277 </pre>
 278 Or, if the data were all on one line, like this:
 279 <pre>
 280         int ac;
 281         char *av[11];
 282
 283         if(fgets(line, MAXLINE, ifp) == NULL)
 284                 {
 285                 fprintf(stderr, "error in data file\n");
 286                 return;
 287                 }
 288
 289         ac = getwords(line, av, 10);
 290         if(ac &lt; 1)
 291                 {
 292                 fprintf(stderr, "error in data file\n");
 293                 return;
 294                 }
 295         na = atoi(av[1]);
 296         if(na &gt; 10)
 297                 {
 298                 fprintf(stderr, "too many items in data file\n");
 299                 return;
 300                 }
 301         if(na != ac - 1)
 302                 {
 303                 fprintf(stderr, "error in data file\n");
 304                 return;
 305                 }
 306         for(i = 0; i &lt; na; i++)
 307                 a[i] = atoi(av[i+1]);
 308 </pre>
 309 </p><p>But sometimes, you don't need to save the count
 310 (<TT>na</TT>)
 311 explicitly;
 312 the reading program can deduce the number of items
 313 from the number of items in the file.
 314 If the file contains <em>only</em> the integers in this array,
 315 then we can simply read integers until we reach end-of-file.
 316 For example, using <TT>fscanf</TT>:
 317 <pre>
 318         na = 0;
 319         while(na &lt; 10 &amp;&amp; fscanf(ifp, "%d", &amp;a[na]) == 1)
 320                 na++;
 321 </pre>
 322 (This code is deceptively simple;
 323 we haven't carefully dealt with appropriate error messages
 324 for a data file with more than 10 values,
 325 or a data file with a non-numeric ``value''
 326 for which <TT>fscanf</TT> returns 0.)
 327 </p><p>Again, we could also use <TT>fgets</TT>.
 328 If the data is on separate lines:
 329 <pre>
 330         na = 0;
 331         while(na &lt; 10 &amp;&amp; fgets(line, MAXLINE, ifp) != NULL)
 332                 a[na++] = atoi(line);
 333 </pre>
 334 If the data is all on one line:
 335 <pre>
 336         if(fgets(line, MAXLINE, ifp) == NULL)
 337                 {
 338                 fprintf(stderr, "error in data file\n");
 339                 return;
 340                 }
 341         na = getwords(line, av, 10);
 342         if(na &gt; 10)
 343                 {
 344                 fprintf(stderr, "too many items in data file\n");
 345                 return;
 346                 }
 347         for(i = 0; i &lt; na; i++)
 348                 a[i] = atoi(av[i]);
 349 </pre>
 350 Notice that this last implementation does not require
 351 that the file consist of
 352 only
 353 data for the array <TT>a</TT>.
 354 One <em>line</em> of the file consists of data for the array <TT>a</TT>,
 355 but other lines of the file could contain other data.
 356 </p><p>We could also scatter <TT>a</TT>'s data on multiple lines,
 357 without using an explicit count,
 358 and with the ability for the file to contain other data as well,
 359 if we marked the end of the array data with an explicit marker in the file,
 360 rather than assuming that the array's data continued until end-of-file.
 361 For example, we could write the data out like this:
 362 <pre>
 363         for(i = 0; i &lt; na; i++)
 364                 fprintf(ofp, "%d\n", a[i]);
 365         fprintf(ofp, "end\n");
 366 </pre>
 367 and read it like this:
 368 <pre>
 369         na = 0;
 370         while(fgets(line, MAXLINE, ifp) != NULL)
 371                 {
 372                 if(strncmp(line, "end", 3) == 0)
 373                         break;
 374                 if(na &gt; 10)
 375                         {
 376                         fprintf(stderr, "too many items in data file\n");
 377                         return;
 378                         }
 379                 a[na++] = atoi(line);
 380                 }
 381 </pre>
 382 (There's just one nuisance here in checking for the ``end'' marker:
 383 <TT>fgets</TT> leaves the <TT>\n</TT> in the line it reads,
 384 so a simple <TT>strcmp</TT> against <TT>"end"</TT> would fail.
 385 Here we use <TT>strncmp</TT>, which compares at most <TT>n</TT> characters,
 386 and we pass the third argument, <TT>n</TT>, as 3.
 387 Other solutions would be
 388 to use <TT>strcmp</TT> against the string <TT>"end\n"</TT>,
 389 or to strip the <TT>\n</TT> somehow,
 390 or to use our old <TT>getline</TT> or <TT>fgetline</TT>
 391 functions,
 392 since they strip the <TT>\n</TT> for us.)
 393 </p><p>Now that we've seen many
 394 (too many!)
 395 options for writing and reading the array,
 396 how do you decide which to use?
 397 Should you use <TT>fscanf</TT>,
 398 or the slightly more <I>ad hoc</I> methods
 399 involving <TT>fgets</TT>, <TT>getwords</TT>, <TT>atoi</TT>, etc?
 400 It's largely a matter of personal preference.
 401 In the code fragments we've looked at so far,
 402 the ones using <TT>fscanf</TT> have seemed shorter,
 403 although in some cases that was because
 404 they weren't doing as much error checking
 405 as the ones that used <TT>fgets</TT>.
 406 In general,
 407 the methods using <TT>fgets</TT> will allow somewhat more flexibility,
 408 as we saw when checking for the explicit ``end'' marker,
 409
 410 which would have been difficult or impossible
 411 using <TT>scanf</TT> or <TT>fscanf</TT>.
 412 </p><p>Now let's move to another example,
 413 a user-defined data structure.
 414 Suppose we have this structure:
 415 <pre>
 416         struct s
 417                 {
 418                 int i;
 419                 float f;
 420                 char s[20];
 421                 };
 422 </pre>
 423 To write an instance of this structure out,
 424 we could simply print its fields on one line:
 425 <pre>
 426         struct s x;
 427         ...
 428         fprintf(ofp, "%d %g %s\n", x.i, x.f, x.s);
 429 </pre>
 430 or on several lines:
 431 <pre>
 432         fprintf(ofp, "%d\n", x.i);
 433         fprintf(ofp, "%g\n", x.f);
 434         fprintf(ofp, "%s\n", x.s);
 435 </pre>
 436 or simply
 437 <pre>
 438         fprintf(ofp, "%d\n%g\n%s\n", x.i, x.f, x.s);
 439 </pre>
 440 (We use <TT>%g</TT> format for the <TT>float</TT> field
 441 because <TT>%g</TT> tends to print
 442 the most accurate representation in the smallest space,
 443 e.g. <TT>1.23e6</TT> instead of <TT>1230000</TT>
 444 and <TT>1.23e-6</TT> instead of <TT>0.00000123</TT> or <TT>0.000001</TT>.)
 445 </p><p>To read this structure back in,
 446 we could again either use <TT>fscanf</TT>,
 447 or <TT>fgets</TT> and some other functions.
 448 As before, <TT>fscanf</TT> seems easier:
 449 <pre>
 450         if(fscanf(ifp, "%d %g %s", &amp;x.i, &amp;x.f, &amp;x.s) != 3)
 451                 {
 452                 fprintf(stderr, "error in data file\n");
 453                 return;
 454                 }
 455 </pre>
 456 Here we have a problem, though:
 457 what if the third, string field contains a space?
 458 In the <TT>scanf</TT> family,
 459 the <TT>%s</TT> format stops reading at whitespace,
 460 so if <TT>x.s</TT> had contained the string <TT>"Hello, world!"</TT>,
 461 it would be read back in as <TT>"Hello,"</TT>.
 462 As it happens,
 463 we could fix it by using the less-obvious format string
 464 <TT>"%d %g %[^\n]"</TT>,
 465 where <TT>%[^\n]</TT> means
 466 ``match any string of characters not including <TT>\n</TT>''.
 467 But we also have another problem:
 468 what if the string is longer
 469 than the 20 characters we allocated for the <TT>s</TT> field?
 470 We could fix this by using <TT>%20s</TT> or <TT>%20[^\n]</TT>,
 471 although we'd have to remember to change
 472 the <TT>scanf</TT> format string
 473 if we ever changed the size of the array.
 474 </p><p>Let's leave <TT>fscanf</TT> for a moment
 475 and
 476 look at our other alternatives.
 477 If we'd printed the data all on one line, we could use
 478 <pre>
 479         #include &lt;stdlib.h&gt;       /* for atof() */
 480
 481         char *av[3];
 482
 483         if(fgets(line, MAXLINE, ifp) == NULL)
 484                 {
 485                 fprintf(stderr, "error in data file\n");
 486                 return;
 487                 }
 488         if(getwords(line, av, 3) != 3)
 489                 {
 490                 fprintf(stderr, "error in data file\n");
 491                 return;
 492                 }
 493         x.i = atoi(av[0]);
 494         x.f = atof(av[1]);
 495         strcpy(x.s, av[2]);     /* XXX */
 496 </pre>
 497 Here we luck out
 498 on the question of what happens if the string contains a space,
 499 because it happens that our version of <TT>getwords</TT>
 500 (see chapter 10, p. 13)
 501 leaves
 502
 503 the remaining words
 504 in the last ``word''
 505 if there are more words in the string than we told it to find,
 506 i.e. more than the third argument to <TT>getwords</TT>
 507 which gives the size of the <TT>av</TT> array.
 508 Here, we told it it could only look for 3 words,
 509 so if the string contains spaces,
 510 making the line appear to have 4 or more words,
 511 words 3, 4, etc. will all be pointed to by <TT>av[2]</TT>.
 512 However, we still have the problem
 513 that we haven't guarded against overflow of <TT>x.s</TT>
 514 if the third (plus fourth, etc.) word on the data line
 515 is longer than 20 characters.
 516 (The comment <TT>/* XXX */</TT> is a traditional marker which means
 517 ``this line is inadequate
 518 and definitely won't work reliably in all situations
 519 but for one reason or another
 520 the person writing it is
 521 not going to take the trouble to do it right just yet.'')
 522 </p><p>If the data is written on three lines,
 523 on the other hand,
 524 we obviously have to call <TT>fgets</TT> three times to read it:
 525 <pre>
 526         if(fgets(line, MAXLINE, ifp) == NULL)
 527                 { fprintf(stderr, "error in data file\n"); return; }
 528         x.i = atoi(line);
 529
 530         if(fgets(line, MAXLINE, ifp) == NULL)
 531                 { fprintf(stderr, "error in data file\n"); return; }
 532         x.f = atof(line);
 533
 534         if(fgets(line, MAXLINE, ifp) == NULL)
 535                 { fprintf(stderr, "error in data file\n"); return; }
 536         strcpy(x.s, line);      /* XXX */
 537 </pre>
 538 Now the last line has two problems:
 539 besides the lingering problem of overflow
 540 (if the line is more than 18 characters long),
 541 we have the problem that <TT>fgets</TT> retains the <TT>\n</TT>
 542 (which is why <TT>x.s</TT> will overflow if
 543 the line is longer than 18 characters, not 19).
 544 In this case, one way to fix the overflow problem
 545 would be to have <TT>fgets</TT> read into <TT>x.s</TT> directly:
 546 <pre>
 547         if(fgets(x.s, 20, ifp) == NULL)
 548                 { fprintf(stderr, "error in data file\n"); return; }
 549 </pre>
 550 If we didn't want to have to remember
 551 to change that 20 in the call to <TT>fgets</TT>
 552 if we ever re-sized the array,
 553 we could get clever and write
 554 <TT>fgets(x.s, sizeof(x.s), ifp)</TT>.
 555 Also, we might as well figure out how to get rid of that pesky <TT>\n</TT>.
 556 One way is by calling the standard library function <TT>strchr</TT>,
 557 which searches for a certain character in a string.
 558 This
 559
 560 will require that we <TT>#include &lt;string.h&gt;</TT>,
 561 and declare an extra <TT>char *</TT> variable:
 562 <pre>
 563         #include &lt;string.h&gt;
 564         char *p;
 565         p = strchr(x.s, '\n');
 566         if(p != NULL)
 567                 *p = '\0';
 568 </pre>
 569 <TT>strchr</TT> returns a pointer to the character that it finds,
 570 or a null pointer if it doesn't find the character.
 571 If there's a <TT>\n</TT> in the line at all,
 572 we know it's at the end,
 573 so it's safe to overwrite it with a <TT>\0</TT>,
 574 making the string one character shorter.
 575 (Since we know that the <TT>\n</TT> is at the end,
 576 we could also call
 577 the
 578 function
 579 <TT>strrchr</TT>,
 580 which finds a character starting from the right.)
 581 </p><p>For any of the methods we've been using so far,
 582 what if one day we add a new field to the structure <TT>s</TT>?
 583 Obviously, we'll have to rewrite the code which writes the structure out
 584 and also the code which reads it in.
 585 Also, unless we're careful,
 586 the modified code won't be able to read in
 587 any data files we might happen to have lying around
 588 which were written before the structure was changed.
 589 Depending on the nature of the data file and the way it's used,
 590 this can be a real problem.
 591 (In principle,
 592 it's possible to write a utility program
 593 to convert the old data files to the new format,
 594 but it can be a nuisance to write that program,
 595 and it can be a <em>real</em> nuisance to track down
 596 all of the old data files that need converting.)
 597 </p><p>Therefore,
 598 when a data file format must be changed,
 599 it's often a good idea if the
 600 new, improved
 601 data file reader
 602 can be made to automatically detect and read old-format files as well.
 603 (Automatic detection isn't a strict necessity,
 604 but it's certainly a nicety.)
 605 Furthermore,
 606 it's <em>much</em> easier to write a new &amp; improved data file reader,
 607 that can read both old and new formats,
 608 if the possibility was thought of
 609 back when the original data file format was designed.
 610 </p><p>One thing that helps a lot is if data file formats have version numbers,
 611 and if each data file begins with a number,
 612 in a simple format and known location
 613 which won't change even if the rest of the format changes,
 614 indicating which version of the format this file uses.
 615 Having a file format version number
 616 at the beginning of each data file leads to two immediate advantages:
 617 <OL><li>Whenever a new program reads a data file,
 618 it can immediately and unambiguously decide how it's going to read it,
 619 whether it can use its new &amp; improved reading routines
 620 or whether it might have to fall back
 621 on its backwards-compatibility, old-style reader.
 622 <li>If there is a suite of several programs,
 623 all of which read the same data files,
 624 and if for some reason
 625 there's an old version of one of the programs still in use,
 626 the old program can print an unambiguous message
 627 along the lines of
 628 ``this is a new data file which I am too old to read'',
 629 rather than printing the
 630 (misleading, in this case)
 631 ``error in data file''
 632 (or crashing).
 633 </OL></p><p>Another technique
 634 which can be immensely useful
 635 and which we'll explore next
 636 is to define a data file format in such a way
 637 that the overall format doesn't change
 638 even if new data is added to it.
 639 </p><p>It's easy to see why
 640 the simple data file fragments we've been looking at so far
 641 are not resilient in the face of newly-introduced data fields.
 642 In the case of <TT>struct s</TT>,
 643 the reader always assumed that
 644 the first field in the data file was <TT>i</TT>,
 645 the second field was <TT>f</TT>,
 646 and the third field was <TT>s</TT>.
 647 If we ever add any new fields,
 648 unless we're careful to add them at the end of the file
 649 (and lucky on top of that),
 650 the simpleminded reader will get confused.
 651 </p><p>One powerful way of getting around this problem
 652 is to <dfn>tag</dfn> each piece of data in the file,
 653 so that the reader knows unambiguously what it is.
 654 For example,
 655 suppose that we wrote instances of our <TT>struct s</TT> out like this:
 656 <pre>
 657         fprintf(ofp, "i %d\n", x.i);
 658         fprintf(ofp, "f %g\n", x.f);
 659         fprintf(ofp, "s %s\n", x.s);
 660 </pre>
 661 Now, each line begins with a little code which identifies it.
 662 (The code in the data file happens to match
 663 the name of the corresponding structure member,
 664 but that's not necessary,
 665 nor is there any way
 666 of getting the compiler to make any correspondence automatically.)
 667 </p><p>If we simply modified one of our previous file-reading code fragments
 668 to read this new, tagged format,
 669 we might quickly end up with a mess.
 670 We'd be continually checking the tag on the line we just read
 671 against the tag we expected to read,
 672 and constantly printing error messages or trying to resynchronize.
 673 But in fact,
 674 there's no reason to expect the lines to come in a certain order,
 675 and
 676 it turns out that it's easier to read such a file a line at a time,
 677 without that assumption,
 678 taking each line as it comes
 679 and
 680 not
 681 worrying what order the lines come in.
 682 Here is how we might do it:
 683 <pre>
 684         x.i = 0; x.f = 0.0; x.s[0] = '\0';
 685
 686         while(fgets(line, MAXLINE, ifp) != NULL)
 687                 {
 688                 if(*line == '#')
 689                         continue;
 690                 ac = getwords(line, av, 2);
 691                 if(ac == 0)
 692                         continue;
 693                 if(strcmp(av[0], "i") == 0)
 694                         x.i = atoi(av[1]);
 695                 else if(strcmp(av[0], "f") == 0)
 696                         x.f = atof(av[1]);
 697                 else if(strcmp(av[0], "s") == 0)
 698                         strcpy(x.s, av[1]);     /* XXX */
 699                 }
 700 </pre>
 701 This example also throws in a few new little features:
 702 a line beginning with <TT>#</TT> is ignored,
 703 so we will be able to place comment lines in data files
 704 by beginning them with <TT>#</TT>.
 705 The code also ignores blank lines
 706 (those for which <TT>getwords</TT> returns 0).
 707 </p><p>We're now treating the ``data file''
 708 almost like a ``command file''--the
 709 first word on each line is almost like a ``command''
 710 telling us to do something:
 711 <TT>i</TT> means store this value in <TT>x.i</TT>;
 712 <TT>f</TT> means store this value in <TT>x.f</TT>,
 713 etc.
 714 Since we don't have any easy way
 715 of telling whether we ever got around to setting a particular field,
 716 we initialize each one to an appropriate default value
 717 before we start.
 718 Notice that we did <em>not</em> have a last line
 719 in the <TT>if</TT>/<TT>else</TT>/<TT>if</TT>/<TT>else</TT> chain
 720 saying
 721 <pre>
 722         else    fprintf(stderr, "error in data file\n");
 723 </pre>
 724 Instead, we quietly <em>ignore</em> lines we don't recognize!
 725 This strategy is admittedly on the simpleminded side,
 726 and it would not be adequate under all circumstances,
 727 but it means that an old program can read a new data file
 728 containing fields it's never heard of.
 729 The old program will still be able to pluck out the data
 730 it does recognize and can use,
 731 while (deliberately) ignoring the (new) data it doesn't know about.
 732 </p><p>This code is not perfect.
 733 We still have the same sorts of problems with that string field, <TT>s</TT>:
 734 it might contain spaces,
 735 which we get around (this time)
 736 by calling <TT>getwords</TT> with a second argument of 2,
 737 so that
 738 all but
 739 the first word on the line
 740 end up ``in'' <TT>av[1]</TT>.
 741 Also, the code does not check
 742 to see that there actually was a second word on the line
 743 before using it to set <TT>x.i</TT>, <TT>x.f</TT>, or <TT>x.s</TT>.
 744 (In this case,
 745 we could fix that by complaining
 746 if <TT>getwords</TT> did not return 2.)
 747 </p><p>Finally, we still have the potential for overflow,
 748 and we might as well grit our teeth now and figure out how to fix it.
 749 Since we already initialized <TT>x.s</TT> to the empty string
 750 with the assignment <TT>x.s[0] = '\0'</TT>,
 751 one way around the problem
 752 is to replace the call to <TT>strcpy</TT> with a call to <TT>strncat</TT>:
 753 <pre>
 754                 ...
 755                 else if(strcmp(av[0], "s") == 0)
 756                         strncat(x.s, av[1], 19);
 757 </pre>
 758 (or, again, perhaps <TT>strncat(x.s, av[1], sizeof(x.s)-1)</TT>).
 759 The <TT>strcat</TT> and <TT>strncat</TT> functions
 760 are slightly misleadingly named:
 761 what they actually do is <em>append</em>
 762 the second string you hand them
 763 (i.e. the second argument)
 764 to the first, in place.
 765 In the case of <TT>strncat</TT>,
 766 it never copies more than <TT>n</TT> characters,
 767 where <TT>n</TT> is its third argument,
 768 although it does always append a <TT>\0</TT>,
 769 which is why we tell it to copy at most 19 characters, not 20.
 770
 771 (Since <TT>x1.s</TT> starts out empty,
 772 there's definitely room for 19,
 773 although we would still have to worry about the possibility
 774 of a corrupted data file which contained two <TT>s</TT> lines.
 775 You might wonder why we couldn't simply use <TT>strncpy</TT>,
 776 but it turns out that,
 777 for obscure historical reasons,
 778 <TT>strncpy</TT> does <em>not</em> always append the <TT>\0</TT>.)
 779 </p><p>Although it has a few imperfections
 780 (which are easily remedied, and are left as exercises)
 781
 782 this last example
 783 (using <TT>fgets</TT>,
 784 <TT>getwords</TT>,
 785 and an <TT>if</TT>/<TT>strcmp</TT>/<TT>else</TT>... chain)
 786 is an excellent basis
 787 for a flexible, robust data file reader.
 788 </p><p>One footnote about the troublesome string field, <TT>s</TT>:
 789 to get around the problem of fixed-size arrays,
 790 you might one day decide
 791 to declare the <TT>s</TT> field of <TT>struct s</TT>
 792 as a pointer rather than a fixed-size array.
 793 You would have to be careful while reading, however.
 794 It might seem that you could just write,
 795 for example,
 796 <pre>
 797         x.s = av[1];    /* assumes char *s, but also WRONG */
 798 </pre>
 799 but this would <em>not</em> work;
 800 remember that whenever you use pointers
 801 you have to worry about memory allocation.
 802 If you assigned <TT>x.s</TT> in that way,
 803 where would be the memory that it points to?
 804 It would be
 805 wherever <TT>av[1]</TT> points,
 806 which is back into the <TT>line</TT> array.
 807 Not only is that (probably) a local array,
 808 valid only while the file-reading functions are active,
 809 but it's also overwritten with each new line in the data file.
 810 You'll obviously want <TT>x.s</TT>
 811 to retain a useful pointer value
 812 pointing to the text read from the file,
 813 which means that you'll still have to make a copy,
 814 after allocating some memory.
 815 In this case, you might do
 816 <pre>
 817         x.s = malloc(strlen(av[1]) + 1);
 818         if(x.s == NULL)
 819                 { fprintf(stderr, "out of memory\n"); return; }
 820         strcpy(x.s, av[1]);
 821 </pre>
 822 To some extent,
 823 the problems we've been having with field <TT>s</TT> are fundamental.
 824 In particular,
 825 any
 826 time you use text formats
 827 which are based on whitespace-separated ``words,''
 828 string fields which might <em>contain</em> spaces are always
 829 tricky
 830 to handle.
 831
 832 </p><hr>
 833 <p>
 834 Read sequentially:
 835 <a href="sx3.html" rev=precedes>prev</a>
 836 <a href="sx3b.html" rel=precedes>next</a>
 837 <a href="sx3.html" rev=subdocument>up</a>
 838 <a href="top.html">top</a>
 839 </p>
 840 <p>
 841 This page by <a href="http://www.eskimo.com/~scs/">Steve Summit</a>
 842 // <a href="copyright.html">Copyright</a> 1996-1999
 843 // <a href="mailto:scs@eskimo.com">mail feedback</a>
 844 </p>
 845 </body>
 846 </html>