doc/textutils.texi

   1 \input texinfo
   2 @c %**start of header
   3 @setfilename textutils.info
   4 @settitle GNU text utilities
   5 @c %**end of header
   6
   7 @include version.texi
   8
   9 @c Define new indices.
  10 @defcodeindex op
  11
  12 @c Put everything in one index (arbitrarily chosen to be the concept index).
  13 @syncodeindex fn cp
  14 @syncodeindex ky cp
  15 @syncodeindex op cp
  16 @syncodeindex pg cp
  17 @syncodeindex vr cp
  18
  19 @ifinfo
  20 @set Francois Franc,ois
  21 @end ifinfo
  22 @tex
  23 @set Francois Fran\noexpand\ptexc cois
  24 @end tex
  25
  26 @ifinfo
  27 @format
  28 START-INFO-DIR-ENTRY
  29 * Text utilities: (textutils).          GNU text utilities.
  30 * cat: (textutils)cat invocation.               Concatenate and write files.
  31 * cksum: (textutils)cksum invocation.           Print POSIX CRC checksum.
  32 * comm: (textutils)comm invocation.             Compare sorted files by line.
  33 * csplit: (textutils)csplit invocation.         Split by context.
  34 * cut: (textutils)cut invocation.               Print selected parts of lines.
  35 * expand: (textutils)expand invocation.         Convert tabs to spaces.
  36 * fmt: (textutils)fmt invocation.               Reformat paragraph text.
  37 * fold: (textutils)fold invocation.             Wrap long input lines.
  38 * head: (textutils)head invocation.             Output the first part of files.
  39 * join: (textutils)join invocation.             Join lines on a common field.
  40 * md5sum: (textutils)md5sum invocation.         Print or check message-digests.
  41 * nl: (textutils)nl invocation.                 Number lines and write files.
  42 * od: (textutils)od invocation.                 Dump files in octal, etc.
  43 * paste: (textutils)paste invocation.           Merge lines of files.
  44 * pr: (textutils)pr invocation.                 Paginate or columnate files.
  45 * sort: (textutils)sort invocation.             Sort text files.
  46 * split: (textutils)split invocation.           Split into fixed-size pieces.
  47 * sum: (textutils)sum invocation.               Print traditional checksum.
  48 * tac: (textutils)tac invocation.               Reverse files.
  49 * tail: (textutils)tail invocation.             Output the last part of files.
  50 * tr: (textutils)tr invocation.                 Translate characters.
  51 * unexpand: (textutils)unexpand invocation.     Convert spaces to tabs.
  52 * uniq: (textutils)uniq invocation.             Uniqify files.
  53 * wc: (textutils)wc invocation.                 Byte, word, and line counts.
  54 END-INFO-DIR-ENTRY
  55 @end format
  56 @end ifinfo
  57
  58 @ifinfo
  59 This file documents the GNU text utilities.
  60
  61 Copyright (C) 1994, 95 Free Software Foundation, Inc.
  62
  63 Permission is granted to make and distribute verbatim copies of
  64 this manual provided the copyright notice and this permission notice
  65 are preserved on all copies.
  66
  67 @ignore
  68 Permission is granted to process this file through TeX and print the
  69 results, provided the printed document carries copying permission
  70 notice identical to this one except for the removal of this paragraph
  71 (this paragraph not being relevant to the printed manual).
  72
  73 @end ignore
  74 Permission is granted to copy and distribute modified versions of this
  75 manual under the conditions for verbatim copying, provided that the entire
  76 resulting derived work is distributed under the terms of a permission
  77 notice identical to this one.
  78
  79 Permission is granted to copy and distribute translations of this manual
  80 into another language, under the above conditions for modified versions,
  81 except that this permission notice may be stated in a translation approved
  82 by the Foundation.
  83 @end ifinfo
  84
  85 @titlepage
  86 @title GNU @code{textutils}
  87 @subtitle A set of text utilities
  88 @subtitle for version @value{VERSION}, @value{RELEASEDATE}
  89 @author David MacKenzie et al.
  90
  91 @page
  92 @vskip 0pt plus 1filll
  93 Copyright @copyright{} 1994, 95 Free Software Foundation, Inc.
  94
  95 Permission is granted to make and distribute verbatim copies of
  96 this manual provided the copyright notice and this permission notice
  97 are preserved on all copies.
  98
  99 Permission is granted to copy and distribute modified versions of this
 100 manual under the conditions for verbatim copying, provided that the entire
 101 resulting derived work is distributed under the terms of a permission
 102 notice identical to this one.
 103
 104 Permission is granted to copy and distribute translations of this manual
 105 into another language, under the above conditions for modified versions,
 106 except that this permission notice may be stated in a translation approved
 107 by the Foundation.
 108 @end titlepage
 109
 110
 111 @ifinfo
 112 @node Top
 113 @top GNU text utilities
 114
 115 @cindex text utilities
 116 @cindex utilities for text handling
 117
 118 This manual minimally documents version @value{VERSION} of the GNU text
 119 utilities.
 120
 121 @menu
 122 * Introduction::                       Caveats, overview, and authors.
 123 * Common options::                     Common options.
 124 * Output of entire files::             cat tac nl od
 125 * Formatting file contents::           fmt pr fold
 126 * Output of parts of files::           head tail split csplit
 127 * Summarizing files::                  wc sum cksum md5sum
 128 * Operating on sorted files::          sort uniq comm
 129 * Operating on fields within a line::  cut paste join
 130 * Operating on characters::            tr expand unexpand
 131 * Opening the software toolbox::       The software tools philosophy.
 132 * Index::                              General index.
 133 @end menu
 134 @end ifinfo
 135
 136
 137 @node Introduction
 138 @chapter Introduction
 139
 140 @cindex introduction
 141
 142 This manual is incomplete: No attempt is made to explain basic concepts
 143 in a way suitable for novices.  Thus, if you are interested, please get
 144 involved in improving this manual.  The entire GNU community will
 145 benefit.
 146
 147 @cindex POSIX.2
 148 The GNU text utilities are mostly compatible with the POSIX.2 standard.
 149
 150 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
 151 @c sh-utils.texi too -- so be sure to keep them consistent.
 152 @cindex bugs, reporting
 153 Please report bugs to @samp{bug-gnu-utils@@prep.ai.mit.edu}.  Remember
 154 to include the version number, machine architecture, input files, and
 155 any other information needed to reproduce the bug: your input, what you
 156 expected, what you got, and why it is wrong.  Diffs are welcome, but
 157 please include a description of the problem as well, since this is
 158 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
 159
 160 This manual is based on the Unix man pages in the distribution, which
 161 were originally written by David MacKenzie and updated by Jim Meyering.
 162 The original @code{fmt} man page was written by Ross Paterson.
 163 @value{Francois} Pinard did the initial conversion to Texinfo format.
 164 Karl Berry did the indexing, some reorganization, and editing of the results.
 165 Richard Stallman contributed his usual invaluable insights to the
 166 overall process.
 167
 168
 169 @node Common options
 170 @chapter Common options
 171
 172 @cindex common options
 173
 174 Certain options are available in all these programs.  Rather than
 175 writing identical descriptions for each of the programs, they are
 176 described here.  (In fact, every GNU program accepts (or should accept)
 177 these options.)
 178
 179 A few of these programs take arbitrary strings as arguments.  In those
 180 cases, @samp{--help} and @samp{--version} are taken as these options
 181 only if there is one and exactly one command line argument.
 182
 183 @table @samp
 184
 185 @item --help
 186 @opindex --help
 187 @cindex help, online
 188 Print a usage message listing all available options, then exit successfully.
 189
 190 @item --version
 191 @opindex --version
 192 @cindex version number, finding
 193 Print the version number, then exit successfully.
 194
 195 @end table
 196
 197
 198 @node Output of entire files
 199 @chapter Output of entire files
 200
 201 @cindex output of entire files
 202 @cindex entire files, output of
 203
 204 These commands read and write entire files, possibly transforming them
 205 in some way.
 206
 207 @menu
 208 * cat invocation::              Concatenate and write files.
 209 * tac invocation::              Concatenate and write files in reverse.
 210 * nl invocation::               Number lines and write files.
 211 * od invocation::               Write files in octal or other formats.
 212 @end menu
 213
 214 @node cat invocation
 215 @section @code{cat}: Concatenate and write files
 216
 217 @pindex cat
 218 @cindex concatenate and write files
 219 @cindex copying files
 220
 221 @code{cat} copies each @var{file} (@samp{-} means standard input), or
 222 standard input if none are given, to standard output.  Synopsis:
 223
 224 @example
 225 cat [@var{option}] [@var{file}]@dots{}
 226 @end example
 227
 228 The program accepts the following options.  Also see @ref{Common options}.
 229
 230 @table @samp
 231
 232 @item -A
 233 @itemx --show-all
 234 @opindex -A
 235 @opindex --show-all
 236 Equivalent to @samp{-vET}.
 237
 238 @item -b
 239 @itemx --number-nonblank
 240 @opindex -b
 241 @opindex --number-nonblank
 242 Number all nonblank output lines, starting with 1.
 243
 244 @item -e
 245 @opindex -e
 246 Equivalent to @samp{-vE}.
 247
 248 @item -E
 249 @itemx --show-ends
 250 @opindex -E
 251 @opindex --show-ends
 252 Display a @samp{$} after the end of each line.
 253
 254 @item -n
 255 @itemx --number
 256 @opindex -n
 257 @opindex --number
 258 Number all output lines, starting with 1.
 259
 260 @item -s
 261 @itemx --squeeze-blank
 262 @opindex -s
 263 @opindex --squeeze-blank
 264 @cindex squeezing blank lines
 265 Replace multiple adjacent blank lines with a single blank line.
 266
 267 @item -t
 268 @opindex -t
 269 Equivalent to @samp{-vT}.
 270
 271 @item -T
 272 @itemx --show-tabs
 273 @opindex -T
 274 @opindex --show-tabs
 275 Display @key{TAB} characters as @samp{^I}.
 276
 277 @item -u
 278 @opindex -u
 279 Ignored; for Unix compatibility.
 280
 281 @item -v
 282 @itemx --show-nonprinting
 283 @opindex -v
 284 @opindex --show-nonprinting
 285 Display control characters except for @key{LFD} and @key{TAB} using
 286 @samp{^} notation and precede characters that have the high bit set
 287 with @samp{M-}.
 288
 289 @end table
 290
 291
 292 @node tac invocation
 293 @section @code{tac}: Concatenate and write files in reverse
 294
 295 @pindex tac
 296 @cindex reversing files
 297
 298 @code{tac} copies each @var{file} (@samp{-} means standard input), or
 299 standard input if none are given, to standard output, reversing the
 300 records (lines by default) in each separately.  Synopsis:
 301
 302 @example
 303 tac [@var{option}]@dots{} [@var{file}]@dots{}
 304 @end example
 305
 306 @dfn{Records} are separated by instances of a string (newline by
 307 default).  By default, this separator string is attached to the end of
 308 the record that it follows in the file.
 309
 310 The program accepts the following options.  Also see @ref{Common options}.
 311
 312 @table @samp
 313
 314 @item -b
 315 @itemx --before
 316 @opindex -b
 317 @opindex --before
 318 The separator is attached to the beginning of the record that it
 319 precedes in the file.
 320
 321 @item -r
 322 @itemx --regex
 323 @opindex -r
 324 @opindex --regex
 325 Treat the separator string as a regular expression.
 326
 327 @item -s @var{separator}
 328 @itemx --separator=@var{separator}
 329 @opindex -s
 330 @opindex --separator
 331 Use @var{separator} as the record separator, instead of newline.
 332
 333 @end table
 334
 335
 336 @node nl invocation
 337 @section @code{nl}: Number lines and write files
 338
 339 @pindex nl
 340 @cindex numbering lines
 341 @cindex line numbering
 342
 343 @code{nl} writes each @var{file} (@samp{-} means standard input), or
 344 standard input if none are given, to standard output, with line numbers
 345 added to some or all of the lines.  Synopsis:
 346
 347 @example
 348 nl [@var{option}]@dots{} [@var{file}]@dots{}
 349 @end example
 350
 351 @cindex logical pages, numbering on
 352 @code{nl} decomposes its input into (logical) pages; by default, the
 353 line number is reset to 1 at the top of each logical page.  @code{nl}
 354 treats all of the input files as a single document; it does not reset
 355 line numbers or logical pages between files.
 356
 357 @cindex headers, numbering
 358 @cindex body, numbering
 359 @cindex footers, numbering
 360 A logical page consists of three sections: header, body, and footer.
 361 Any of the sections can be empty.  Each can be numbered in a different
 362 style from the others.
 363
 364 The beginnings of the sections of logical pages are indicated in the
 365 input file by a line containing exactly one of these delimiter strings:
 366
 367 @table @samp
 368 @item \:\:\:
 369 start of header;
 370 @item \:\:
 371 start of body;
 372 @item \:
 373 start of footer.
 374 @end table
 375
 376 The two characters from which these strings are made can be changed from
 377 @samp{\} and @samp{:} via options (see below), but the pattern and
 378 length of each string cannot be changed.
 379
 380 A section delimiter is replaced by an empty line on output.  Any text
 381 that comes before the first section delimiter string in the input file
 382 is considered to be part of a body section, so @code{nl} treats a
 383 file that contains no section delimiters as a single body section.
 384
 385 The program accepts the following options.  Also see @ref{Common options}.
 386
 387 @table @samp
 388
 389 @item -b @var{style}
 390 @itemx --body-numbering=@var{style}
 391 @opindex -b
 392 @opindex --body-numbering
 393 Select the numbering style for lines in the body section of each
 394 logical page.  When a line is not numbered, the current line number
 395 is not incremented, but the line number separator character is still
 396 prepended to the line.  The styles are:
 397
 398 @table @samp
 399 @item a
 400 number all lines,
 401 @item t
 402 number only nonempty lines (default for body),
 403 @item n
 404 do not number lines (default for header and footer),
 405 @item p@var{regexp}
 406 number only lines that contain a match for @var{regexp}.
 407 @end table
 408
 409 @item -d @var{cd}
 410 @itemx --section-delimiter=@var{cd}
 411 @opindex -d
 412 @opindex --section-delimiter
 413 @cindex section delimiters of pages
 414 Set the section delimiter characters to @var{cd}; default is
 415 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
 416 (Remember to protect @samp{\} or other metacharacters from shell
 417 expansion with quotes or extra backslashes.)
 418
 419 @item -f @var{style}
 420 @itemx --footer-numbering=@var{style}
 421 @opindex -f
 422 @opindex --footer-numbering
 423 Analogous to @samp{--body-numbering}.
 424
 425 @item -h @var{style}
 426 @itemx --header-numbering=@var{style}
 427 @opindex -h
 428 @opindex --header-numbering
 429 Analogous to @samp{--body-numbering}.
 430
 431 @item -i @var{number}
 432 @itemx --page-increment=@var{number}
 433 @opindex -i
 434 @opindex --page-increment
 435 Increment line numbers by @var{number} (default 1).
 436
 437 @item -l @var{number}
 438 @itemx --join-blank-lines=@var{number}
 439 @opindex -l
 440 @opindex --join-blank-lines
 441 @cindex empty lines, numbering
 442 @cindex blank lines, numbering
 443 Consider @var{number} (default 1) consecutive empty lines to be one
 444 logical line for numbering, and only number the last one.  Where fewer
 445 than @var{number} consecutive empty lines occur, do not number them.
 446 An empty line is one that contains no characters, not even spaces
 447 or tabs.
 448
 449 @item -n @var{format}
 450 @itemx --number-format=@var{format}
 451 @opindex -n
 452 @opindex --number-format
 453 Select the line numbering format (default is @code{rn}):
 454
 455 @table @samp
 456 @item ln
 457 @opindex ln @r{format for @code{nl}}
 458 left justified, no leading zeros;
 459 @item rn
 460 @opindex rn @r{format for @code{nl}}
 461 right justified, no leading zeros;
 462 @item rz
 463 @opindex rz @r{format for @code{nl}}
 464 right justified, leading zeros.
 465 @end table
 466
 467 @item -p
 468 @itemx --no-renumber
 469 @opindex -p
 470 @opindex --no-renumber
 471 Do not reset the line number at the start of a logical page.
 472
 473 @item -s @var{string}
 474 @itemx --number-separator=@var{string}
 475 @opindex -s
 476 @opindex --number-separator
 477 Separate the line number from the text line in the output with
 478 @var{string} (default is @key{TAB}).
 479
 480 @item -v @var{number}
 481 @itemx --first-page=@var{number}
 482 @opindex -v
 483 @opindex --first-page
 484 Set the initial line number on each logical page to @var{number} (default 1).
 485
 486 @item -w @var{number}
 487 @itemx --number-width=@var{number}
 488 @opindex -w
 489 @opindex --number-width
 490 Use @var{number} characters for line numbers (default 6).
 491
 492 @end table
 493
 494
 495 @node od invocation
 496 @section @code{od}: Write files in octal or other formats
 497
 498 @pindex od
 499 @cindex octal dump of files
 500 @cindex hex dump of files
 501 @cindex ASCII dump of files
 502 @cindex file contents, dumping unambiguously
 503
 504 @code{od} writes an unambiguous representation of each @var{file}
 505 (@samp{-} means standard input), or standard input if none are given.
 506 Synopsis:
 507
 508 @example
 509 od [@var{option}]@dots{} [@var{file}]@dots{}
 510 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
 511 @end example
 512
 513 Each line of output consists of the offset in the input, followed by
 514 groups of data from the file. By default, @code{od} prints the offset in
 515 octal, and each group of file data is two bytes of input printed as a
 516 single octal number.
 517
 518 The program accepts the following options.  Also see @ref{Common options}.
 519
 520 @table @samp
 521
 522 @item -A @var{radix}
 523 @itemx --address-radix=@var{radix}
 524 @opindex -A
 525 @opindex --address-radix
 526 @cindex radix for file offsets
 527 @cindex file offset radix
 528 Select the base in which file offsets are printed.  @var{radix} can
 529 be one of the following:
 530
 531 @table @samp
 532 @item d
 533 decimal;
 534 @item o
 535 octal;
 536 @item x
 537 hexadecimal;
 538 @item n
 539 none (do not print offsets).
 540 @end table
 541
 542 The default is octal.
 543
 544 @item -j @var{bytes}
 545 @itemx --skip-bytes=@var{bytes}
 546 @opindex -j
 547 @opindex --skip-bytes
 548 Skip @var{bytes} input bytes before formatting and writing.  If
 549 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
 550 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
 551 in decimal.  Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
 552 by 1024, and @samp{m} by 1048576.
 553
 554 @item -N @var{bytes}
 555 @itemx --read-bytes=@var{bytes}
 556 @opindex -N
 557 @opindex --read-bytes
 558 Output at most @var{bytes} bytes of the input.  Prefixes and suffixes on
 559 @code{bytes} are interpreted as for the @samp{-j} option.
 560
 561 @item -s [@var{n}]
 562 @itemx --strings[=@var{n}]
 563 @opindex -s
 564 @opindex --strings
 565 @cindex string constants, outputting
 566 Instead of the normal output, output only @dfn{string constants}: at
 567 least @var{n} (3 by default) consecutive ASCII graphic characters,
 568 followed by a null (zero) byte.
 569
 570 @item -t @var{type}
 571 @itemx --format=@var{type}
 572 @opindex -t
 573 @opindex --format
 574 Select the format in which to output the file data.  @var{type} is a
 575 string of one or more of the below type indicator characters.  If you
 576 include more than one type indicator character in a single @var{type}
 577 string, or use this option more than once, @code{od} writes one copy
 578 of each output line using each of the data types that you specified,
 579 in the order that you specified.
 580
 581 @table @samp
 582 @item a
 583 named character,
 584 @item c
 585 ASCII character or backslash escape,
 586 @item d
 587 signed decimal,
 588 @item f
 589 floating point,
 590 @item o
 591 octal,
 592 @item u
 593 unsigned decimal,
 594 @item x
 595 hexadecimal.
 596 @end table
 597
 598 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
 599 newline, and @samp{nul} for a null (zero) byte.  Type @code{c} outputs
 600 @samp{ }, @samp{\n}, and @code{\0}, respectively.
 601
 602 @cindex type size
 603 Except for types @samp{a} and @samp{c}, you can specify the number
 604 of bytes to use in interpreting each number in the given data type
 605 by following the type indicator character with a decimal integer.
 606 Alternately, you can specify the size of one of the C compiler's
 607 built-in data types by following the type indicator character with
 608 one of the following characters.  For integers (@samp{d}, @samp{o},
 609 @samp{u}, @samp{x}):
 610
 611 @table @samp
 612 @item C
 613 char,
 614 @item S
 615 short,
 616 @item I
 617 int,
 618 @item L
 619 long.
 620 @end table
 621
 622 For floating point (@code{f}):
 623
 624 @table @asis
 625 @item F
 626 float,
 627 @item D
 628 double,
 629 @item L
 630 long double.
 631 @end table
 632
 633 @item -v
 634 @itemx --output-duplicates
 635 @opindex -v
 636 @opindex --output-duplicates
 637 Output consecutive lines that are identical.  By default, when two or
 638 more consecutive output lines would be identical, @code{od} outputs only
 639 the first line, and puts just an asterisk on the following line to
 640 indicate the elision.
 641
 642 @item -w[@var{n}]
 643 @itemx --width[=@var{n}]
 644 @opindex -w
 645 @opindex --width
 646 Dump @code{n} input bytes per output line.  This must be a multiple of
 647 the least common multiple of the sizes associated with the specified
 648 output types.  If @var{n} is omitted, the default is 32.  If this option
 649 is not given at all, the default is 16.
 650
 651 @end table
 652
 653 The next several options map the old, pre-POSIX format specification
 654 options to the corresponding POSIX format specs.  GNU @code{od} accepts
 655 any combination of old- and new-style options.  Format specification
 656 options accumulate.
 657
 658 @table @samp
 659
 660 @item -a
 661 @opindex -a
 662 Output as named characters.  Equivalent to @samp{-ta}.
 663
 664 @item -b
 665 @opindex -b
 666 Output as octal bytes.  Equivalent to @samp{-toC}.
 667
 668 @item -c
 669 @opindex -c
 670 Output as ASCII characters or backslash escapes.  Equivalent to
 671 @samp{-tc}.
 672
 673 @item -d
 674 @opindex -d
 675 Output as unsigned decimal shorts.  Equivalent to @samp{-tu2}.
 676
 677 @item -f
 678 @opindex -f
 679 Output as floats.  Equivalent to @samp{-tfF}.
 680
 681 @item -h
 682 @opindex -h
 683 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 684
 685 @item -i
 686 @opindex -i
 687 Output as decimal shorts.  Equivalent to @samp{-td2}.
 688
 689 @item -l
 690 @opindex -l
 691 Output as decimal longs.  Equivalent to @samp{-td4}.
 692
 693 @item -o
 694 @opindex -o
 695 Output as octal shorts.  Equivalent to @samp{-to2}.
 696
 697 @item -x
 698 @opindex -x
 699 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 700
 701 @item -C
 702 @itemx --traditional
 703 @opindex --traditional
 704 Recognize the pre-POSIX non-option arguments that traditional @code{od}
 705 accepted.  The following syntax:
 706
 707 @example
 708 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
 709 @end example
 710
 711 @noindent
 712 can be used to specify at most one file and optional arguments
 713 specifying an offset and a pseudo-start address, @var{label}.  By
 714 default, @var{offset} is interpreted as an octal number specifying how
 715 many input bytes to skip before formatting and writing.  The optional
 716 trailing decimal point forces the interpretation of @var{offset} as a
 717 decimal number.  If no decimal is specified and the offset begins with
 718 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number.  If
 719 there is a trailing @samp{b}, the number of bytes skipped will be
 720 @var{offset} multiplied by 512.  The @var{label} argument is interpreted
 721 just like @var{offset}, but it specifies an initial pseudo-address.  The
 722 pseudo-addresses are displayed in parentheses following any normal
 723 address.
 724
 725 @end table
 726
 727
 728 @node Formatting file contents
 729 @chapter Formatting file contents
 730
 731 @cindex formatting file contents
 732
 733 These commands reformat the contents of files.
 734
 735 @menu
 736 * fmt invocation::              Reformat paragraph text.
 737 * pr invocation::               Paginate or columnate files for printing.
 738 * fold invocation::             Wrap input lines to fit in specified width.
 739 @end menu
 740
 741
 742 @node fmt invocation
 743 @section @code{fmt}: Reformat paragraph text
 744
 745 @pindex fmt
 746 @cindex reformatting paragraph text
 747 @cindex paragraphs, reformatting
 748 @cindex text, reformatting
 749
 750 @code{fmt} fills and joins lines to produce output lines of (at most)
 751 a given number of characters (75 by default).  Synopsis:
 752
 753 @example
 754 fmt [@var{option}]@dots{} [@var{file}]@dots{}
 755 @end example
 756
 757 @code{fmt} reads from the specified @var{file} arguments (or standard
 758 input if none are given), and writes to standard output.
 759
 760 By default, blank lines, spaces between words, and indentation are
 761 preserved in the output; successive input lines with different
 762 indentation are not joined; tabs are expanded on input and introduced on
 763 output.
 764
 765 @cindex line-breaking
 766 @cindex sentences and line-breaking
 767 @cindex Knuth, Donald E.
 768 @cindex Plass, Michael F.
 769 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
 770 avoid line breaks after the first word of a sentence or before the last
 771 word of a sentence.  A @dfn{sentence break} is defined as either the end
 772 of a paragraph or a word ending in any of @samp{.?!}, followed by two
 773 spaces or end of line, ignoring any intervening parentheses or quotes.
 774 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
 775 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
 776 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
 777 and Experience}, 11 (1981), 1119--1184).
 778
 779 The program accepts the following options.  Also see @ref{Common options}.
 780
 781 @table @samp
 782
 783 @item -c
 784 @itemx --crown-margin
 785 @opindex -c
 786 @opindex --crown-margin
 787 @cindex crown margin
 788 @dfn{Crown margin} mode: preserve the indentation of the first two
 789 lines within a paragraph, and align the left margin of each subsequent
 790 line with that of the second line.
 791
 792 @item -t
 793 @itemx --tagged-paragraph
 794 @opindex -t
 795 @opindex --tagged-paragraph
 796 @cindex tagged paragraphs
 797 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
 798 indentation of the first line of a paragraph is the same as the
 799 indentation of the second, the first line is treated as a one-line
 800 paragraph.
 801
 802 @item -s
 803 @itemx --split-only
 804 @opindex -s
 805 @opindex --split-only
 806 Split lines only.  Do not join short lines to form longer ones.  This
 807 prevents sample lines of code, and other such ``formatted'' text from
 808 being unduly combined.
 809
 810 @item -u
 811 @itemx --uniform-spacing
 812 @opindex -u
 813 @opindex --uniform-spacing
 814 Uniform spacing.  Reduce spacing between words to one space, and spacing
 815 between sentences to two spaces.
 816
 817 @item -@var{width}
 818 @itemx -w @var{width}
 819 @itemx --width=@var{width}
 820 @opindex -@var{width}
 821 @opindex -w
 822 @opindex --width
 823 Fill output lines up to @var{width} characters (default 75).  @code{fmt}
 824 initially tries to make lines about 7% shorter than this, to give it
 825 room to balance line lengths.
 826
 827 @item -p @var{prefix}
 828 @itemx --prefix=@var{prefix}
 829 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
 830 are subject to formatting. The prefix and any preceding whitespace is
 831 stripped for the formatting and then re-attached to each formatted output
 832 line.  One use is to format certain kinds of program comments, while
 833 leaving the code unchanged.
 834
 835 @end table
 836
 837
 838 @node pr invocation
 839 @section @code{pr}: Paginate or columnate files for printing
 840
 841 @pindex pr
 842 @cindex printing, preparing files for
 843 @cindex multicolumn output, generating
 844
 845 @code{pr} writes each @var{file} (@samp{-} means standard input), or
 846 standard input if none are given, to standard output, paginating and
 847 optionally outputting in multicolumn format.  Synopsis:
 848
 849 @example
 850 pr [@var{option}]@dots{} [@var{file}]@dots{}
 851 @end example
 852
 853 By default, a 5-line header is printed: two blank lines; a line with the
 854 date, the file name, and the page count; and two more blank lines.  A
 855 five line footer (entirely) is also printed.
 856
 857 Form feeds in the input cause page breaks in the output.
 858
 859 The program accepts the following options.  Also see @ref{Common options}.
 860
 861 @table @samp
 862
 863 @item +@var{page}
 864 Begin printing with page @var{page}.
 865
 866 @item -@var{column}
 867 @opindex -@var{column}
 868 Produce @var{column}-column output and print columns down.  The column
 869 width is automatically decreased as @var{column} increases; unless you
 870 use the @samp{-w} option to increase the page width as well, this option
 871 might well cause some input to be truncated.
 872
 873 @item -a
 874 @opindex -a
 875 @cindex across columns
 876 Print columns across rather than down.
 877
 878 @item -b
 879 @opindex -b
 880 @cindex balancing columns
 881 Balance columns on the last page.
 882
 883 @item -c
 884 @opindex -c
 885 Print control characters using hat notation (e.g., @samp{^G}); print
 886 other unprintable characters in octal backslash notation.  By default,
 887 unprintable characters are not changed.
 888
 889 @item -d
 890 @opindex -d
 891 @cindex double spacing
 892 Double space the output.
 893
 894 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
 895 @opindex -e
 896 @cindex input tabs
 897 Expand tabs to spaces on input.  Optional argument @var{in-tabchar} is
 898 the input tab character (default is @key{TAB}).  Second optional
 899 argument @var{in-tabwidth} is the input tab character's width (default
 900 is 8).
 901
 902 @item -f
 903 @itemx -F
 904 @opindex -F
 905 @opindex -f
 906 Use a formfeed instead of newlines to separate output pages.
 907
 908 @item -h @var{header}
 909 @opindex -h
 910 Replace the file name in the header with the string @var{header}.
 911
 912 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
 913 @opindex -i
 914 @cindex output tabs
 915 Replace spaces with tabs on output.  Optional argument @var{out-tabchar}
 916 is the output tab character (default is @key{TAB}).  Second optional
 917 argument @var{out-tabwidth} is the output tab character's width (default
 918 is 8).
 919
 920 @item -l @var{n}
 921 @opindex -l
 922 Set the page length to @var{n} (default 66) lines.  If @var{n} is less
 923 than 10, the headers and footers are omitted, as if the @samp{-t} option
 924 had been given.
 925
 926 @item -m
 927 @opindex -m
 928 Print all files in parallel, one in each column.
 929
 930 @item -n[@var{number-separator}[@var{digits}]]
 931 @opindex -n
 932 Precede each column with a line number; with parallel files (@samp{-m}),
 933 precede each line with a line number.  Optional argument
 934 @var{number-separator} is the character to print after each number
 935 (default is @key{TAB}).  Optional argument @var{digits} is the number of
 936 digits per line number (default is 5).
 937
 938 @item -o @var{n}
 939 @opindex -o
 940 @cindex indenting lines
 941 @cindex left margin
 942 Indent each line with @var{n} (default is zero) spaces wide, i.e., set
 943 the left margin.  The total page width is @samp{n} plus the width set
 944 with the @samp{-w} option.
 945
 946 @item -r
 947 @opindex -r
 948 Do not print a warning message when an argument @var{file} cannot be
 949 opened.  (The exit status will still be nonzero, however.)
 950
 951 @item -s[@var{c}]
 952 @opindex -s
 953 Separate columns by the single character @var{c}.  If @var{c} is
 954 omitted, the default is space; if this option is omitted altogether, the
 955 default is @key{TAB}.
 956
 957 @item -t
 958 @opindex -t
 959 Do not print the usual 5-line header and the 5-line footer on each page,
 960 and do not fill out the bottoms of pages (with blank lines or
 961 formfeeds).
 962
 963 @item -v
 964 @opindex -v
 965 Print unprintable characters in octal backslash notation.
 966
 967 @item -w @var{n}
 968 @opindex -w
 969 Set the page width to @var{n} (default is 72) columns.
 970
 971 @end table
 972
 973
 974 @node fold invocation
 975 @section @code{fold}: Wrap input lines to fit in specified width
 976
 977 @pindex fold
 978 @cindex wrapping long input lines
 979 @cindex folding long input lines
 980
 981 @code{fold} writes each @var{file} (@samp{-} means standard input), or
 982 standard input if none are given, to standard output, breaking long
 983 lines.  Synopsis:
 984
 985 @example
 986 fold [@var{option}]@dots{} [@var{file}]@dots{}
 987 @end example
 988
 989 By default, @code{fold} breaks lines wider than 80 columns. The output
 990 is split into as many lines as necessary.
 991
 992 @cindex screen columns
 993 @code{fold} counts screen columns by default; thus, a tab may count more
 994 than one column, backspace decreases the column count, and carriage
 995 return sets the column to zero.
 996
 997 The program accepts the following options.  Also see @ref{Common options}.
 998
 999 @table @samp
1000
1001 @item -b
1002 @itemx --bytes
1003 @opindex -b
1004 @opindex --bytes
1005 Count bytes rather than columns, so that tabs, backspaces, and carriage
1006 returns are each counted as taking up one column, just like other
1007 characters.
1008
1009 @item -s
1010 @itemx --spaces
1011 @opindex -s
1012 @opindex --spaces
1013 Break at word boundaries: the line is broken after the last blank before
1014 the maximum line length.  If the line contains no such blanks, the line
1015 is broken at the maximum line length as usual.
1016
1017 @item -w @var{width}
1018 @itemx --width=@var{width}
1019 @opindex -w
1020 @opindex --width
1021 Use a maximum line length of @var{width} columns instead of 80.
1022
1023 @end table
1024
1025
1026 @node Output of parts of files
1027 @chapter Output of parts of files
1028
1029 @cindex output of parts of files
1030 @cindex parts of files, output of
1031
1032 These commands output pieces of the input.
1033
1034 @menu
1035 * head invocation::             Output the first part of files.
1036 * tail invocation::             Output the last part of files.
1037 * split invocation::            Split a file into fixed-size pieces.
1038 * csplit invocation::           Split a file into context-determined pieces.
1039 @end menu
1040
1041 @node head invocation
1042 @section @code{head}: Output the first part of files
1043
1044 @pindex head
1045 @cindex initial part of files, outputting
1046 @cindex first part of files, outputting
1047
1048 @code{head} prints the first part (10 lines by default) of each
1049 @var{file}; it reads from standard input if no files are given or
1050 when given a @var{file} of @samp{-}.  Synopses:
1051
1052 @example
1053 head [@var{option}]@dots{} [@var{file}]@dots{}
1054 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1055 @end example
1056
1057 If more than one @var{file} is specicified, @code{head} prints a
1058 one-line header consisting of
1059 @example
1060 ==> @var{file name} <==
1061 @end example
1062 @noindent
1063 before the output for each @var{file}.
1064
1065 @code{head} accepts two option formats: the new one, in which numbers
1066 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1067 the number precedes any option letters (@samp{-1q}).
1068
1069 The program accepts the following options.  Also see @ref{Common options}.
1070
1071 @table @samp
1072
1073 @item -@var{count}@var{options}
1074 @opindex -@var{count}
1075 This option is only recognized if it is specified first.  @var{count} is
1076 a decimal number optionally followed by a size letter (@samp{b},
1077 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1078 or other option letters (@samp{cqv}).
1079
1080 @item -c @var{bytes}
1081 @itemx --bytes=@var{bytes}
1082 @opindex -c
1083 @opindex --bytes
1084 Print the first @var{bytes} bytes, instead of initial lines.  Appending
1085 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1086 by 1048576.
1087
1088 @itemx -n @var{n}
1089 @itemx --lines=@var{n}
1090 @opindex -n
1091 @opindex --lines
1092 Output the first @var{n} lines.
1093
1094 @item -q
1095 @itemx --quiet
1096 @itemx --silent
1097 @opindex -q
1098 @opindex --quiet
1099 @opindex --silent
1100 Never print file name headers.
1101
1102 @item -v
1103 @itemx --verbose
1104 @opindex -v
1105 @opindex --verbose
1106 Always print file name headers.
1107
1108 @end table
1109
1110
1111 @node tail invocation
1112 @section @code{tail}: Output the last part of files
1113
1114 @pindex tail
1115 @cindex last part of files, outputting
1116
1117 @code{tail} prints the last part (10 lines by default) of each
1118 @var{file}; it reads from standard input if no files are given or
1119 when given a @var{file} of @samp{-}.  Synopses:
1120
1121 @example
1122 tail [@var{option}]@dots{} [@var{file}]@dots{}
1123 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1124 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1125 @end example
1126
1127 If more than one @var{file} is specified, @code{tail} prints a
1128 one-line header consisting of
1129 @example
1130 ==> @var{file name} <==
1131 @end example
1132 @noindent
1133 before the output for each @var{file}.
1134
1135 @cindex BSD @code{tail}
1136 GNU @code{tail} can output any amount of data (some other versions of
1137 @code{tail} cannot).  It also has no @samp{-r} option (print in
1138 reverse), since reversing a file is really a different job from printing
1139 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1140 only reverse files that are at most as large as its buffer, which is
1141 typically 32k.  A more reliable and versatile way to reverse files is
1142 the GNU @code{tac} command.
1143
1144 @code{tail} accepts two option formats: the new one, in which numbers
1145 are arguments to the options (@samp{-n 1}), and the old one, in which
1146 the number precedes any option letters (@samp{-1} or @samp{+1}).
1147
1148 If any option-argument is a number @var{n} starting with a @samp{+},
1149 @code{tail} begins printing with the @var{n}th item from the start of
1150 each file, instead of from the end.
1151
1152 The program accepts the following options.  Also see @ref{Common options}.
1153
1154 @table @samp
1155
1156 @item -@var{count}
1157 @itemx +@var{count}
1158 @opindex -@var{count}
1159 @opindex +@var{count}
1160 This option is only recognized if it is specified first.  @var{count} is
1161 a decimal number optionally followed by a size letter (@samp{b},
1162 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1163 or other option letters (@samp{cfqv}).
1164
1165 @item -c @var{bytes}
1166 @itemx --bytes=@var{bytes}
1167 @opindex -c
1168 @opindex --bytes
1169 Output the last @var{bytes} bytes, instead of final lines.  Appending
1170 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1171 by 1048576.
1172
1173 @item -f
1174 @itemx --follow
1175 @opindex -f
1176 @opindex --follow
1177 @cindex growing files
1178 Loop forever trying to read more characters at the end of the file,
1179 presumably because the file is growing.  Ignored if reading from a pipe.
1180 If more than one file is given, @code{tail} prints a header whenever it
1181 gets output from a different file, to indicate which file that output is
1182 from.
1183
1184 @itemx -n @var{n}
1185 @itemx --lines=@var{n}
1186 @opindex -n
1187 @opindex --lines
1188 Output the last @var{n} lines.
1189
1190 @item -q
1191 @itemx -quiet
1192 @itemx --silent
1193 @opindex -q
1194 @opindex --quiet
1195 @opindex --silent
1196 Never print file name headers.
1197
1198 @item -v
1199 @itemx --verbose
1200 @opindex -v
1201 @opindex --verbose
1202 Always print file name headers.
1203
1204 @end table
1205
1206
1207 @node split invocation
1208 @section @code{split}: Split a file into fixed-size pieces
1209
1210 @pindex split
1211 @cindex splitting a file into pieces
1212 @cindex pieces, splitting a file into
1213
1214 @code{split} creates output files containing consecutive sections of
1215 @var{input} (standard input if none is given or @var{input} is
1216 @samp{-}).  Synopsis:
1217
1218 @example
1219 split [@var{option}] [@var{input} [@var{prefix}]]
1220 @end example
1221
1222 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1223 left over for the last section), into each output file.
1224
1225 @cindex output file name prefix
1226 The output files' names consist of @var{prefix} (@samp{x} by default)
1227 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1228 that concatenating the output files in sorted order by file name produces
1229 the original input file.  (If more than 676 output files are required,
1230 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1231
1232 The program accepts the following options.  Also see @ref{Common options}.
1233
1234 @table @samp
1235
1236 @item -@var{lines}
1237 @itemx -l @var{lines}
1238 @itemx --lines=@var{lines}
1239 @opindex -l
1240 @opindex --lines
1241 Put @var{lines} lines of @var{input} into each output file.
1242
1243 @item -b @var{bytes}
1244 @itemx --bytes=@var{bytes}
1245 @opindex -b
1246 @opindex --bytes
1247 Put the first @var{bytes} bytes of @var{input} into each output file.
1248 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1249 @samp{m} by 1048576.
1250
1251 @item -C @var{bytes}
1252 @itemx --line-bytes=@var{bytes}
1253 @opindex -C
1254 @opindex --line-bytes
1255 Put into each output file as many complete lines of @var{input} as
1256 possible without exceeding @var{bytes} bytes.  For lines longer than
1257 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1258 less than @var{bytes} bytes of the line are left, then continue
1259 normally.  @var{bytes} has the same format as for the @samp{--bytes}
1260 option.
1261
1262 @end table
1263
1264
1265 @node csplit invocation
1266 @section @code{csplit}: Split a file into context-determined pieces
1267
1268 @pindex csplit
1269 @cindex context splitting
1270 @cindex splitting a file into pieces by context
1271
1272 @code{csplit} creates zero or more output files containing sections of
1273 @var{input} (standard input if @var{input} is @samp{-}).  Synopsis:
1274
1275 @example
1276 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1277 @end example
1278
1279 The contents of the output files are determined by the @var{pattern}
1280 arguments, as detailed below.  An error occurs if a @var{pattern}
1281 argument refers to a nonexistent line of the input file (e.g., if no
1282 remaining line matches a given regular expression).  After every
1283 @var{pattern} has been matched, any remaining input is copied into one
1284 last output file.
1285
1286 By default, @code{csplit} prints the number of bytes written to each
1287 output file after it has been created.
1288
1289 The types of pattern arguments are:
1290
1291 @table @samp
1292
1293 @item @var{n}
1294 Create an output file containing the input up to but not including line
1295 @var{n} (a positive integer).  If followed by a repeat count, also
1296 create an output file containing the next @var{line} lines of the input
1297 file once for each repeat.
1298
1299 @item /@var{regexp}/[@var{offset}]
1300 Create an output file containing the current line up to (but not
1301 including) the next line of the input file that contains a match for
1302 @var{regexp}.  The optional @var{offset} is a @samp{+} or @samp{-}
1303 followed by a positive integer.  If it is given, the input up to the
1304 matching line plus or minus @var{offset} is put into the output file,
1305 and the line after that begins the next section of input.
1306
1307 @item %@var{regexp}%[@var{offset}]
1308 Like the previous type, except that it does not create an output
1309 file, so that section of the input file is effectively ignored.
1310
1311 @item @{@var{repeat-count}@}
1312 Repeat the previous pattern @var{repeat-count} additional
1313 times. @var{repeat-count} can either be a positive integer or an
1314 asterisk, meaning repeat as many times as necessary until the input is
1315 exausted.
1316
1317 @end table
1318
1319 The output files' names consist of a prefix (@samp{xx} by default)
1320 followed by a suffix.  By default, the suffix is an ascending sequence
1321 of two-digit decimal numbers from @samp{00} and up to @samp{99}.  In any
1322 case, concatenating the output files in sorted order by filename
1323 produces the original input file.
1324
1325 By default, if @code{csplit} encounters an error or receives a hangup,
1326 interrupt, quit, or terminate signal, it removes any output files
1327 that it has created so far before it exits.
1328
1329 The program accepts the following options.  Also see @ref{Common options}.
1330
1331 @table @samp
1332
1333 @item -f @var{prefix}
1334 @itemx --prefix=@var{prefix}
1335 @opindex -f
1336 @opindex --prefix
1337 @cindex output file name prefix
1338 Use @var{prefix} as the output file name prefix.
1339
1340 @item -b @var{suffix}
1341 @itemx --suffix=@var{suffix}
1342 @opindex -b
1343 @opindex --suffix
1344 @cindex output file name suffix
1345 Use @var{suffix} as the output file name suffix.  When this option is
1346 specified, the suffix string must include exactly one
1347 @code{printf(3)}-style conversion specification, possibly including
1348 format specification flags, a field width, a precision specifications,
1349 or all of these kinds of modifiers.  The format letter must convert a
1350 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1351 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed.  The
1352 entire @var{suffix} is given (with the current output file number) to
1353 @code{sprintf(3)} to form the file name suffixes for each of the
1354 individual output files in turn.  If this option is used, the
1355 @samp{--digits} option is ignored.
1356
1357 @item -n @var{digits}
1358 @itemx --digits=@var{digits}
1359 @opindex -n
1360 @opindex --digits
1361 Use output file names containing numbers that are @var{digits} digits
1362 long instead of the default 2.
1363
1364 @item -k
1365 @itemx --keep-files
1366 @opindex -k
1367 @opindex --keep-files
1368 Do not remove output files when errors are encountered.
1369
1370 @item -z
1371 @itemx --elide-empty-files
1372 @opindex -z
1373 @opindex --elide-empty-files
1374 Suppress the generation of zero-length output files.  (In cases where
1375 the section delimiters of the input file are supposed to mark the first
1376 lines of each of the sections, the first output file will generally be a
1377 zero-length file unless you use this option.)  The output file sequence
1378 numbers always run consecutively starting from 0, even when this option
1379 is specified.
1380
1381 @item -s
1382 @itemx -q
1383 @itemx --silent
1384 @itemx --quiet
1385 @opindex -s
1386 @opindex -q
1387 @opindex --silent
1388 @opindex --quiet
1389 Do not print counts of output file sizes.
1390
1391 @end table
1392
1393
1394 @node Summarizing files
1395 @chapter Summarizing files
1396
1397 @cindex summarizing files
1398
1399 These commands generate just a few numbers representing entire
1400 contents of files.
1401
1402 @menu
1403 * wc invocation::               Print byte, word, and line counts.
1404 * sum invocation::              Print checksum and block counts.
1405 * cksum invocation::            Print CRC checksum and byte counts.
1406 * md5sum invocation::           Print or check message-digests.
1407 @end menu
1408
1409
1410 @node wc invocation
1411 @section @code{wc}: Print byte, word, and line counts
1412
1413 @pindex wc
1414 @cindex byte count
1415 @cindex word count
1416 @cindex line count
1417
1418 @code{wc} counts the number of bytes, whitespace-separated words, and
1419 newlines in each given @var{file}, or standard input if none are given
1420 or for a @var{file} of @samp{-}.  Synopsis:
1421
1422 @example
1423 wc [@var{option}]@dots{} [@var{file}]@dots{}
1424 @end example
1425
1426 @cindex total counts
1427 @code{wc} prints one line of counts for each file, and if the file was
1428 given as an argument, it prints the file name following the counts.  If
1429 more than one @var{file} is given, @code{wc} prints a final line
1430 containing the cumulative counts, with the file name @file{total}.  The
1431 counts are printed in this order: newlines, words, bytes.
1432
1433 By default, @code{wc} prints all three counts.  Options can specify
1434 that only certain counts be printed.  Options do not undo others
1435 previously given, so
1436
1437 @example
1438 wc --bytes --words
1439 @end example
1440
1441 @noindent
1442 prints both the byte counts and the word counts.
1443
1444 The program accepts the following options.  Also see @ref{Common options}.
1445
1446 @table @samp
1447
1448 @item -c
1449 @itemx --bytes
1450 @itemx --chars
1451 @opindex -c
1452 @opindex --bytes
1453 @opindex --chars
1454 Print only the byte counts.
1455
1456 @item -w
1457 @itemx --words
1458 @opindex -w
1459 @opindex --words
1460 Print only the word counts.
1461
1462 @item -l
1463 @itemx --lines
1464 @opindex -l
1465 @opindex --lines
1466 Print only the newline counts.
1467
1468 @end table
1469
1470
1471 @node sum invocation
1472 @section @code{sum}: Print checksum and block counts
1473
1474 @pindex sum
1475 @cindex 16-bit checksum
1476 @cindex checksum, 16-bit
1477
1478 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1479 standard input if none are given or for a @var{file} of @samp{-}.  Synopsis:
1480
1481 @example
1482 sum [@var{option}]@dots{} [@var{file}]@dots{}
1483 @end example
1484
1485 @code{sum} prints the checksum for each @var{file} followed by the
1486 number of blocks in the file (rounded up).  If more than one @var{file}
1487 is given, file names are also printed (by default).  (With the
1488 @samp{--sysv} option, corresponding file name are printed when there is
1489 at least one file argument.)
1490
1491 By default, GNU @code{sum} computes checksums using an algorithm
1492 compatible with BSD @code{sum} and prints file sizes in units of
1493 1024-byte blocks.
1494
1495 The program accepts the following options.  Also see @ref{Common options}.
1496
1497 @table @samp
1498
1499 @item -r
1500 @opindex -r
1501 @cindex BSD @code{sum}
1502 Use the default (BSD compatible) algorithm.  This option is included for
1503 compatibility with the System V @code{sum}.  Unless @samp{-s} was also
1504 given, it has no effect.
1505
1506 @item -s
1507 @itemx --sysv
1508 @opindex -s
1509 @opindex --sysv
1510 @cindex System V @code{sum}
1511 Compute checksums using an algorithm compatible with System V
1512 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1513
1514 @end table
1515
1516 @code{sum} is provided for compatibility; the @code{cksum} program (see
1517 next section) is preferable in new applications.
1518
1519
1520 @node cksum invocation
1521 @section @code{cksum}: Print CRC checksum and byte counts
1522
1523 @pindex cksum
1524 @cindex cyclic redundancy check
1525 @cindex CRC checksum
1526
1527 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1528 given @var{file}, or standard input if none are given or for a
1529 @var{file} of @samp{-}.  Synopsis:
1530
1531 @example
1532 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1533 @end example
1534
1535 @code{cksum} prints the CRC checksum for each file along with the number
1536 of bytes in the file, and the filename unless no arguments were given.
1537
1538 @code{cksum} is typically used to ensure that files have been
1539 transferred by unreliable means (e.g., netnews) have not been corrupted,
1540 by comparing the @code{cksum} output for the received files with the
1541 @code{cksum} output for the original files (typically given in the
1542 distribution).
1543
1544 The CRC algorithm is specified by the POSIX.2 standard.  It is not
1545 compatible with the BSD or System V @code{sum} algorithms (see the
1546 previous section); it is more robust.
1547
1548 The only options are @samp{--help} and @samp{--version}.  @xref{Common
1549 options}.
1550
1551
1552 @node md5sum invocation
1553 @section @code{md5sum}: Print or check message-digests
1554
1555 @pindex md5sum
1556 @cindex 128-bit checksum
1557 @cindex checksum, 128-bit
1558 @cindex fingerprint, 128-bit
1559 @cindex message-digest, 128-bit
1560
1561 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1562 @dfn{message-digest} for each given @var{file}, or standard input if
1563 none are given or for a @var{file} of @samp{-}.  It can also check if the
1564 checksum has changed. Synopsis:
1565
1566 @example
1567 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1568 @end example
1569
1570 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1571 indicating a binary or text input file, and the filename.
1572
1573 The program accepts the following options.  Also see @ref{Common options}.
1574
1575 @table @samp
1576
1577 @item -b
1578 @itemx --binary
1579 @opindex -b
1580 @opindex --binary
1581 @cindex binary input files
1582 Treat input files as binary.  This makes no difference on Unix systems,
1583 but other systems have different internal and external character
1584 representations, notably to mark end-of-line.
1585
1586 @item -c
1587 @itemx --check=@var{file}
1588 @var{file} is taken as the output of a former run of @samp{md5sum}: each
1589 line consists of an MD5 checksum, a binary/text flag, and a filename.
1590 The file will be opened (with each possible relative path) and the its
1591 message-digest computed.  If this computed message digest is not the
1592 same as that given in the line, the file will be marked as failed.
1593
1594 @item -s
1595 @itemx --string=@var{string}
1596 @opindex -s
1597 @opindex --string
1598 Compute the message digest for @var{string}, instead of for a file.  The
1599 result is the same as for a file with contains exactly @var{string}.
1600
1601 @item -t
1602 @itemx --text
1603 @opindex -t
1604 @opindex --text
1605 @cindex text input files
1606 Treat all input files as text files.  This is the reverse of
1607 @samp{--binary}.
1608 @item -v
1609 @itemx --verbose
1610 @opindex -v
1611 @opindex --verbose
1612 Print progress information.
1613
1614 @end table
1615
1616
1617 @node Operating on sorted files
1618 @chapter Operating on sorted files
1619
1620 @cindex operating on sorted files
1621 @cindex sorted files, operations on
1622
1623 These commands work with (or produce) sorted files.
1624
1625 @menu
1626 * sort invocation::             Sort text files.
1627 * uniq invocation::             Uniqify files.
1628 * comm invocation::             Compare two sorted files line by line.
1629 @end menu
1630
1631
1632 @node sort invocation
1633 @section @code{sort}: Sort text files
1634
1635 @pindex sort
1636 @cindex sorting files
1637
1638 @code{sort} sorts, merges, or compares all the lines from the given
1639 files, or standard input if none are given or for a @var{file} of
1640 @samp{-}.  By default, @code{sort} writes the results to standard
1641 output.  Synopsis:
1642
1643 @example
1644 sort [@var{option}]@dots{} [@var{file}]@dots{}
1645 @end example
1646
1647 @code{sort} has three modes of operation: sort (the default), merge,
1648 and check for sortedness.  The following options change the operation
1649 mode:
1650
1651 @table @samp
1652
1653 @item -c
1654 @opindex -c
1655 @cindex checking for sortedness
1656 Check whether the given files are already sorted: if they are not all
1657 sorted, print an error message and exit with a status of 1.
1658
1659 @item -m
1660 @opindex -m
1661 @cindex merging sorted files
1662 Merge the given files by sorting them as a group.  Each input file must
1663 always be individually sorted.  It always works to sort instead of
1664 merge; merging is provided because it is faster, in the case where it
1665 works.
1666
1667 @end table
1668
1669 A pair of lines is compared as follows: if any key fields have been
1670 specified, @code{sort} compares each pair of fields, in the order
1671 specified on the command line, according to the associated ordering
1672 options, until a difference is found or no fields are left.
1673
1674 If any of the global options @samp{Mbdfinr} are given but no key fields
1675 are specified, @code{sort} compares the entire lines according to the
1676 global options.
1677
1678 Finally, as a last resort when all keys compare equal (or if no
1679 ordering options were specified at all), @code{sort} compares the lines
1680 byte by byte in machine collating sequence.  The last resort comparison
1681 honors the @samp{-r} global option.  The @samp{-s} (stable) option
1682 disables this last-resort comparison so that lines in which all fields
1683 compare equal are left in their original relative order.  If no fields
1684 or global options are specified, @samp{-s} has no effect.
1685
1686 GNU @code{sort} (as specified for all GNU utilities) has no limits on
1687 input line length or restrictions on bytes allowed within lines.  In
1688 addition, if the final byte of an input file is not a newline, GNU
1689 @code{sort} silently supplies one.
1690
1691 @vindex TMPDIR
1692 If the environment variable @code{TMPDIR} is set, @code{sort} uses its
1693 value as the directory for temporary files instead of @file{/tmp}.  The
1694 @samp{-T @var{tempdir}} option in turn overrides the environment
1695 variable.
1696
1697 The following options affect the ordering of output lines.  They may be
1698 specified globally or as part of a specific key field.  If no key
1699 fields are specified, global options apply to comparison of entire
1700 lines; otherwise the global options are inherited by key fields that do
1701 not specify any special options of their own.
1702
1703 @table @samp
1704
1705 @item -b
1706 @opindex -b
1707 @cindex blanks, ignoring leading
1708 Ignore leading blanks when finding sort keys in each line.
1709
1710 @item -d
1711 @opindex -d
1712 @cindex phone directory order
1713 @cindex telephone directory order
1714 Sort in @dfn{phone directory} order: ignore all characters except
1715 letters, digits and blanks when sorting.
1716
1717 @item -f
1718 @opindex -f
1719 @cindex case folding
1720 Fold lowercase characters into the equivalent uppercase characters when
1721 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
1722
1723 @item -i
1724 @opindex -i
1725 @cindex unprintable characters, ignoring
1726 Ignore characters outside the printable ASCII range 040-0176 octal
1727 (inclusive) when sorting.
1728
1729 @item -M
1730 @opindex -M
1731 @cindex months, sorting by
1732 An initial string, consisting of any amount of whitespace, followed
1733 by three letters abbreviating a month name, is folded to UPPER case and
1734 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
1735 Invalid names compare low to valid names.
1736
1737 @item -n
1738 @opindex -n
1739 @cindex numeric sort
1740 Sort numerically: the number begins each line; specifically, it consists
1741 of optional whitespace, an optional @samp{-} sign, and zero or more
1742 digits, optionally followed by a decimal point and zero or more digits.
1743
1744 @item -r
1745 @opindex -r
1746 @cindex reverse sorting
1747 Reverse the result of comparison, so that lines with greater key values
1748 appear earlier in the output instead of later.
1749
1750 @end table
1751
1752 Other options are:
1753
1754 @table @samp
1755
1756 @item -o @var{output-file}
1757 @opindex -o
1758 @cindex overwriting of input, allowed
1759 Write output to @var{output-file} instead of standard output.
1760 If @var{output-file} is one of the input files, @code{sort} copies
1761 it to a temporary file before sorting and writing the output to
1762 @var{output-file}.
1763
1764 @item -t @var{separator}
1765 @opindex -t
1766 @cindex field separator character
1767 Use character @var{separator} as the field separator when finding the
1768 sort keys in each line.  By default, fields are separated by the empty
1769 string between a non-whitespace character and a whitespace character.
1770 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
1771 into fields @w{@samp{ foo}} and @w{@samp{ bar}}.  The field separator is
1772 not considered to be part of either the field preceding or the field
1773 following.
1774
1775 @item -u
1776 @opindex -u
1777 @cindex uniqifying output
1778 For the default case or the @samp{-m} option, only output the first
1779 of a sequence of lines that compare equal.  For the @samp{-c} option,
1780 check that no pair of consecutive lines compares equal.
1781
1782 @item -k @var{pos1}[,@var{pos2}]
1783 @opindex -k
1784 @cindex sort field
1785 The recommended, POSIX, option for specifying a sort field.  The field
1786 consists of the line between @var{pos1} and @var{pos2} (or the end of
1787 the line, if @var{pos2} is omitted), inclusive.  Fields and character
1788 positions are numbered starting with 1.  See below.
1789
1790 @item +@var{pos1}[-@var{pos2}]
1791 The obsolete, traditional option for specifying a sort field.  The field
1792 consists of the line between @var{pos1} and up to but not including
1793 @var{pos2} (or the end of the line if @var{pos2} is omitted).  Fields
1794 and character positions are numbered starting with 0.  See below.
1795
1796 @end table
1797
1798 In addition, when GNU @code{sort} is invoked with exactly one argument,
1799 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
1800 options}.
1801
1802 Historical (BSD and System V) implementations of @code{sort} have
1803 differed in their interpretation of some options, particularly
1804 @samp{-b}, @samp{-f}, and @samp{-n}.  GNU sort follows the POSIX
1805 behavior, which is usually (but not always!) like the System V behavior.
1806 According to POSIX, @samp{-n} no longer implies @samp{-b}.  For
1807 consistency, @samp{-M} has been changed in the same way.  This may
1808 affect the meaning of character positions in field specifications in
1809 obscure cases.  The only fix is to add an explicit @samp{-b}.
1810
1811 A position in a sort field specified with the @samp{-k} or @samp{+}
1812 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
1813 of the field to use and @var{c} is the number of the first character
1814 from the beginning of the field (for @samp{+@var{pos}}) or from the end
1815 of the previous field (for @samp{-@var{pos}}).  If the @samp{.@var{c}}
1816 is omitted, it's taken to be the first character in the field.  If the
1817 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
1818 specification is counted from the first nonblank character of the field
1819 (for @samp{+@var{pos}}) or from the first nonblank character following
1820 the previous field (for @samp{-@var{pos}}).
1821
1822 A sort key option may also have any of the option letters @samp{Mbdfinr}
1823 appended to it, in which case the global ordering options are not used
1824 for that particular field.  The @samp{-b} option may be independently
1825 attached to either or both of the @samp{+@var{pos}} and
1826 @samp{-@var{pos}} parts of a field specification, and if it is inherited
1827 from the global options it will be attached to both.  If a @samp{-n} or
1828 @samp{-M} option is used, thus implying a @samp{-b} option, the
1829 @samp{-b} option is taken to apply to both the @samp{+@var{pos}} and the
1830 @samp{-@var{pos}} parts of a key specification.  Keys may span multiple
1831 fields.
1832
1833
1834 @node uniq invocation
1835 @section @code{uniq}: Uniqify files
1836
1837 @pindex uniq
1838 @cindex uniqify files
1839
1840 @code{uniq} writes the unique lines in the given @file{input}, or
1841 standard input if nothing is given or for an @var{input} name of
1842 @samp{-}.  Synopsis:
1843
1844 @example
1845 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
1846 @end example
1847
1848 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
1849 discards all but one of identical successive lines.  Optionally, it can
1850 instead show only lines that appear exactly once, or lines that appear
1851 more than once.
1852
1853 The input must be sorted.  If your input is not sorted, perhaps you want
1854 to use @code{sort -u}.
1855
1856 If no @var{output} file is specified, @code{uniq} writes to standard
1857 output.
1858
1859 The program accepts the following options.  Also see @ref{Common options}.
1860
1861 @table @samp
1862
1863 @item -@var{n}
1864 @itemx -f @var{n}
1865 @itemx --skip-fields=@var{n}
1866 @opindex -@var{n}
1867 @opindex -f
1868 @opindex --skip-fields
1869 Skip @var{n} fields on each line before checking for uniqueness.  Fields
1870 are sequences of non-space non-tab characters that are separated from
1871 each other by at least one spaces or tabs.
1872
1873 @item +@var{n}
1874 @itemx -s @var{n}
1875 @itemx --skip-chars=@var{n}
1876 @opindex +@var{n}
1877 @opindex -s
1878 @opindex --skip-chars
1879 Skip @var{n} characters before checking for uniqueness.  If you use both
1880 the field and character skipping options, fields are skipped over first.
1881
1882 @item -c
1883 @itemx --count
1884 @opindex -c
1885 @opindex --count
1886 Print the number of times each line occurred along with the line.
1887
1888 @item -d
1889 @itemx --repeated
1890 @opindex -d
1891 @opindex --repeated
1892 @cindex duplicate lines, outputting
1893 Print only duplicate lines.
1894
1895 @item -u
1896 @itemx --unique
1897 @opindex -u
1898 @opindex --unique
1899 @cindex unique lines, outputting
1900 Print only unique lines.
1901
1902 @item -w @var{n}
1903 @itemx --check-chars=@var{n}
1904 @opindex -w
1905 @opindex --check-chars
1906 Compare @var{n} characters on each line (after skipping any specified
1907 fields and characters).  By default the entire rest of the lines are
1908 compared.
1909
1910 @end table
1911
1912
1913 @node comm invocation
1914 @section @code{comm}: Compare two sorted files line by line
1915
1916 @pindex comm
1917 @cindex line-by-line comparison
1918 @cindex comparing sorted files
1919
1920 @code{comm} writes to standard output lines that are common, and lines
1921 that are unique, to two input files; a file name of @samp{-} means
1922 standard input.  Synopsis:
1923
1924 @example
1925 comm [@var{option}]@dots{} @var{file1} @var{file2}
1926 @end example
1927
1928 The input files must be sorted before @code{comm} can be used.
1929
1930 @cindex differing lines
1931 @cindex common lines
1932 With no options, @code{comm} produces three column output.  Column one
1933 contains lines unique to @var{file1}, column two contains lines unique
1934 to @var{file2}, and column three contains lines common to both files.
1935
1936 @opindex -1
1937 @opindex -2
1938 @opindex -3
1939 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
1940 the corresponding columns.  Also see @ref{Common options}.
1941
1942
1943 @node Operating on fields within a line
1944 @chapter Operating on fields within a line
1945
1946 @menu
1947 * cut invocation::              Print selected parts of lines.
1948 * paste invocation::            Merge lines of files.
1949 * join invocation::             Join lines on a common field.
1950 @end menu
1951
1952
1953 @node cut invocation
1954 @section @code{cut}: Print selected parts of lines
1955
1956 @pindex cut
1957 @code{cut} writes to standard output selected parts of each line of each
1958 input file, or standard input if no files are given or for a file name of
1959 @samp{-}.  Synopsis:
1960
1961 @example
1962 cut [@var{option}]@dots{} [@var{file}]@dots{}
1963 @end example
1964
1965 In the table which follows, the @var{byte-list}, @var{character-list},
1966 and @var{field-list} are one or more numbers or ranges (two numbers
1967 separated by a dash) separated by commas.  Bytes, characters, and
1968 fields are numbered from starting at 1.  Incomplete ranges may be
1969 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
1970 @samp{@var{n}} through end of line or last field.
1971
1972 The program accepts the following options.  Also see @ref{Common
1973 options}.
1974
1975 @table @samp
1976
1977 @item -b @var{byte-list}
1978 @itemx --bytes=@var{byte-list}
1979 @opindex -b
1980 @opindex --bytes
1981 Print only the bytes in positions listed in @var{byte-list}.  Tabs and
1982 backspaces are treated like any other character; they take up 1 byte.
1983
1984 @item -c @var{character-list}
1985 @itemx --characters=@var{character-list}
1986 @opindex -c
1987 @opindex --characters
1988 Print only characters in positions listed in @var{character-list}.
1989 The same as @samp{-b} for now, but internationalization will change
1990 that.  Tabs and backspaces are treated like any other character; they
1991 take up 1 character.
1992
1993 @item -f @var{field-list}
1994 @itemx --fields=@var{field-list}
1995 @opindex -f
1996 @opindex --fields
1997 Print only the fields listed in @var{field-list}.  Fields are
1998 separated by a @key{TAB} by default.
1999
2000 @item -d @var{delim}
2001 @itemx --delimiter=@var{delim}
2002 @opindex -d
2003 @opindex --delimiter
2004 For @samp{-f}, fields are separated by the first character in @var{delim}
2005 (default is @key{TAB}).
2006
2007 @item -n
2008 @opindex -n
2009 Do not split multibyte characters (no-op for now).
2010
2011 @item -s
2012 @itemx --only-delimited
2013 @opindex -s
2014 @opindex --only-delimited
2015 For @samp{-f}, do not print lines that do not contain the field separator
2016 character.
2017
2018 @end table
2019
2020
2021 @node paste invocation
2022 @section @code{paste}: Merge lines of files
2023
2024 @pindex paste
2025 @cindex merging files
2026
2027 @code{paste} writes to standard output lines consisting of sequentially
2028 corresponding lines of each given file, separated by @key{TAB}.
2029 Standard input is used for a file name of @samp{-} or if no input files
2030 are given.
2031
2032 Synopsis:
2033
2034 @example
2035 paste [@var{option}]@dots{} [@var{file}]@dots{}
2036 @end example
2037
2038 The program accepts the following options.  Also see @ref{Common options}.
2039
2040 @table @samp
2041
2042 @item -s
2043 @itemx --serial
2044 @opindex -s
2045 @opindex --serial
2046 Paste the lines of one file at a time rather than one line from each
2047 file.
2048
2049 @item -d @var{delim-list}
2050 @itemx --delimiters @var{delim-list}
2051 @opindex -d
2052 @opindex --delimiters
2053 Consecutively use the characters in @var{delim-list} instead of
2054 @key{TAB} to separate merged lines.  When @var{delim-list} is
2055 exhausted, start again at its beginning.
2056
2057 @end table
2058
2059
2060 @node join invocation
2061 @section @code{join}: Join lines on a common field
2062
2063 @pindex join
2064 @cindex common field, joining on
2065
2066 @code{join} writes to standard output a line for each pair of input
2067 lines that have identical join fields.  Synopsis:
2068
2069 @example
2070 join [@var{option}]@dots{} @var{file1} @var{file2}
2071 @end example
2072
2073 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
2074 meaning standard input.  @var{file1} and @var{file2} should be already
2075 sorted in increasing order (not numerically) on the join fields; unless
2076 the @samp{-t} option is given, they should be sorted ignoring blanks at
2077 the start of the line, as in @code{sort -b}.
2078
2079 The defaults are: the join field is the first field in each line;
2080 fields in the input are separated by one or more blanks, with leading
2081 blanks on the line ignored; fields in the output are separated by a
2082 space; each output line consists of the join field, the remaining
2083 fields from @var{file1}, then the remaining fields from @var{file2}.
2084
2085 The program accepts the following options.  Also see @ref{Common options}.
2086
2087 @table @samp
2088
2089 @item -a @var{file-number}
2090 @opindex -a
2091 Print a line for each unpairable line in file @var{file-number} (either
2092 @samp{1} or @samp{2}), in addition to the normal output.
2093
2094 @item -e @var{string}
2095 @opindex -e
2096 Replace those output fields that are missing in the input with
2097 @var{string}.
2098
2099 @item -1 @var{field}
2100 @itemx -j1 @var{field}
2101 @opindex -1
2102 @opindex -j1
2103 Join on field @var{field} (a positive integer) of file 1.
2104
2105 @item -2 @var{field}
2106 @itemx -j2 @var{field}
2107 @opindex -2
2108 @opindex -j2
2109 Join on field @var{field} (a positive integer) of file 2.
2110
2111 @item -j @var{field}
2112 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
2113
2114 @item -o @var{field-list}@dots{}
2115 Construct each output line according to the format in @var{field-list}.
2116 Each element in @var{field-list} consists of a file number (either 1 or
2117 2), a period, and a field number (a positive integer).  The elements in
2118 the list are separated by commas or blanks.  Multiple @var{field-list}
2119 arguments can be given after a single @samp{-o} option; the values
2120 of all lists given with @samp{-o} are concatenated together.
2121
2122 @item -t @var{char}
2123 Use character @var{char} as the input and output field separator.
2124
2125 @item -v @var{file-number}
2126 Print a line for each unpairable line in file @var{file-number}
2127 (either 1 or 2), instead of the normal output.
2128
2129 @end table
2130
2131 In addition, when GNU @code{join} is invoked with exactly one argument,
2132 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
2133 options}.
2134
2135
2136 @node Operating on characters
2137 @chapter Operating on characters
2138
2139 @cindex operating on characters
2140
2141 This commands operate on individual characters.
2142
2143 @menu
2144 * tr invocation::               Translate, squeeze, and/or delete characters.
2145 * expand invocation::           Convert tabs to spaces.
2146 * unexpand invocation::         Convert spaces to tabs.
2147 @end menu
2148
2149
2150 @node tr invocation
2151 @section @code{tr}: Translate, squeeze, and/or delete characters
2152
2153 @pindex tr
2154
2155 Synopsis:
2156
2157 @example
2158 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
2159 @end example
2160
2161 @code{tr} copies standard input to standard output, performing
2162 one of the following operations:
2163
2164 @itemize @bullet
2165 @item
2166 translate, and optionally squeeze repeated characters in the result,
2167 @item
2168 squeeze repeated characters,
2169 @item
2170 delete characters,
2171 @item
2172 delete characters, then squeeze repeated characters from the result.
2173 @end itemize
2174
2175 The @var{set1} and (if given) @var{set2} arguments define ordered
2176 sets of characters, referred to below as @var{set1} and @var{set2}.  These
2177 sets are the characters of the input that @code{tr} operates on.
2178 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
2179 complement (all of the characters that are not in @var{set1}).
2180
2181 @menu
2182 * Character sets::              Specifying sets of characters.
2183 * Translating::                 Changing one characters to another.
2184 * Squeezing::                   Squeezing repeats and deleting.
2185 * Warnings in tr::              Warning messages.
2186 @end menu
2187
2188
2189 @node Character sets
2190 @subsection Specifying sets of characters
2191
2192 @cindex specifying sets of characters
2193
2194 The format of the @var{set1} and @var{set2} arguments resembles
2195 the format of regular expressions; however, they are not regular
2196 expressions, only lists of characters.  Most characters simply
2197 represent themselves in these strings, but the strings can contain
2198 the shorthands listed below, for convenience.  Some of them can be
2199 used only in @var{set1} or @var{set2}, as noted below.
2200
2201 @table @asis
2202
2203 @item Backslash escapes.
2204 @cindex backslash escapes
2205
2206 A backslash followed by a character not listed below causes an error
2207 message.
2208
2209 @table @samp
2210 @item \a
2211 Control-G,
2212 @item \b
2213 Control-H,
2214 @item \f
2215 Control-L,
2216 @item \n
2217 Control-J,
2218 @item \r
2219 Control-M,
2220 @item \t
2221 Control-I,
2222 @item \v
2223 Control-K,
2224 @item \@var{ooo}
2225 The character with the value given by @var{ooo}, which is 1 to 3
2226 octal digits,
2227 @item \\
2228 A backslash.
2229 @end table
2230
2231 @item Ranges.
2232 @cindex ranges
2233
2234 The notation @samp{@var{m}-@var{n}} expands to all of the characters
2235 from @var{m} through @var{n}, in ascending order.  @var{m} should
2236 collate before @var{n}; if it doesn't, an error results.  As an example,
2237 @samp{0-9} is the same as @samp{0123456789}.  Although GNU @code{tr}
2238 does not support the System V syntax that uses square brackets to
2239 enclose ranges, translations specified in that format will still work as
2240 long as the brackets in @var{string1} correspond to identical brackets
2241 in @var{string2}.
2242
2243 @item Repeated characters.
2244 @cindex repeated characters
2245
2246 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
2247 copies of character @var{c}.  Thus, @samp{[y*6]} is the same as
2248 @samp{yyyyyy}.  The notation @samp{[@var{c}*]} in @var{string2} expands
2249 to as many copies of @var{c} as are needed to make @var{set2} as long as
2250 @var{set1}.  If @var{n} begins with @samp{0}, it is interpreted in
2251 octal, otherwise in decimal.
2252
2253 @item Character classes.
2254 @cindex characters classes
2255
2256 The notation @samp{[:@var{class}:]} expands to all of the characters in
2257 the (predefined) class @var{class}.  The characters expand in no
2258 particular order, except for the @code{upper} and @code{lower} classes,
2259 which expand in ascending order.  When the @samp{--delete} (@samp{-d})
2260 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
2261 character class can be used in @var{set2}.  Otherwise, only the
2262 character classes @code{lower} and @code{upper} are accepted in
2263 @var{set2}, and then only if the corresponding character class
2264 (@code{upper} and @code{lower}, respectively) is specified in the same
2265 relative position in @var{set1}.  Doing this specifies case conversion.
2266 The class names are given below; an error results when an invalid class
2267 name is given.
2268
2269 @table @code
2270 @item alnum
2271 @opindex alnum
2272 Letters and digits.
2273 @item alpha
2274 @opindex alpha
2275 Letters.
2276 @item blank
2277 @opindex blank
2278 Horizontal whitespace.
2279 @item cntrl
2280 @opindex cntrl
2281 Control characters.
2282 @item digit
2283 @opindex digit
2284 Digits.
2285 @item graph
2286 @opindex graph
2287 Printable characters, not including space.
2288 @item lower
2289 @opindex lower
2290 Lowercase letters.
2291 @item print
2292 @opindex print
2293 Printable characters, including space.
2294 @item punct
2295 @opindex punct
2296 Punctuation characters.
2297 @item space
2298 @opindex space
2299 Horizontal or vertical whitespace.
2300 @item upper
2301 @opindex upper
2302 Uppercase letters.
2303 @item xdigit
2304 @opindex xdigit
2305 Hexadecimal digits.
2306 @end table
2307
2308 @item Equivalence classes.
2309 @cindex equivalence classes
2310
2311 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
2312 equivalent to @var{c}, in no particular order.  Equivalence classes are
2313 a relatively recent invention intended to support non-English alphabets.
2314 But there seems to be no standard way to define them or determine their
2315 contents.  Therefore, they are not fully implemented in GNU @code{tr};
2316 each character's equivalence class consists only of that character,
2317 which is of no particular use.
2318
2319 @end table
2320
2321
2322 @node Translating
2323 @subsection Translating
2324
2325 @cindex translating characters
2326
2327 @code{tr} performs translation when @var{set1} and @var{set2} are
2328 both given and the @samp{--delete} (@samp{-d}) option is not given.
2329 @code{tr} translates each character of its input that is in @var{set1}
2330 to the corresponding character in @var{set2}.  Characters not in
2331 @var{set1} are passed through unchanged.  When a character appears more
2332 than once in @var{set1} and the corresponding characters in @var{set2}
2333 are not all the same, only the final one is used.  For example, these
2334 two commands are equivalent:
2335
2336 @example
2337 tr aaa xyz
2338 tr a z
2339 @end example
2340
2341 A common use of @code{tr} is to convert lowercase characters to
2342 uppercase.  This can be done in many ways.  Here are three of them:
2343
2344 @example
2345 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
2346 tr a-z A-Z
2347 tr '[:lower:]' '[:upper:]'
2348 @end example
2349
2350 When @code{tr} is performing translation, @var{set1} and @var{set2}
2351 typically have the same length.  If @var{set1} is shorter than
2352 @var{set2}, the extra characters at the end of @var{set2} are ignored.
2353
2354 On the other hand, making @var{set1} longer than @var{set2} is not
2355 portable; POSIX.2 says that the result is undefined.  In this situation,
2356 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
2357 the last character of @var{set2} as many times as necessary.  System V
2358 @code{tr} truncates @var{set1} to the length of @var{set2}.
2359
2360 By default, GNU @code{tr} handles this case like BSD @code{tr}.  When
2361 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
2362 handles this case like the System V @code{tr} instead.  This option is
2363 ignored for operations other than translation.
2364
2365 Acting like System V @code{tr} in this case breaks the relatively common
2366 BSD idiom:
2367
2368 @example
2369 tr -cs A-Za-z0-9 '\012'
2370 @end example
2371
2372 @noindent
2373 because it converts only zero bytes (the first element in the
2374 complement of @var{set1}), rather than all non-alphanumerics, to
2375 newlines.
2376
2377
2378 @node Squeezing
2379 @subsection Squeezing repeats and deleting
2380
2381 @cindex squeezing repeat characters
2382 @cindex deleting characters
2383
2384 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
2385 removes any input characters that are in @var{set1}.
2386
2387 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
2388 @code{tr} replaces each input sequence of a repeated character that
2389 is in @var{set1} with a single occurrence of that character.
2390
2391 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
2392 first performs any deletions using @var{set1}, then squeezes repeats
2393 from any remaining characters using @var{set2}.
2394
2395 The @samp{--squeeze-repeats} option may also be used when translating,
2396 in which case @code{tr} first performs translation, then squeezes
2397 repeats from any remaining characters using @var{set2}.
2398
2399 Here are some examples to illustrate various combinations of options:
2400
2401 @itemize @bullet
2402
2403 @item
2404 Remove all zero bytes:
2405
2406 @example
2407 tr -d '\000'
2408 @end example
2409
2410 @item
2411 Put all words on lines by themselves.  This converts all
2412 non-alphanumeric characters to newlines, then squeezes each string
2413 of repeated newlines into a single newline:
2414
2415 @example
2416 tr -cs '[a-zA-Z0-9]' '[\n*]'
2417 @end example
2418
2419 @item
2420 Convert each sequence of repeated newlines to a single newline:
2421
2422 @example
2423 tr -s '\n'
2424 @end example
2425
2426 @end itemize
2427
2428
2429 @node Warnings in tr
2430 @subsection Warning messages
2431
2432 @vindex POSIXLY_CORRECT
2433 Setting the environment variable @code{POSIXLY_CORRECT} turns off the
2434 following warning and error messages, for strict compliance with
2435 POSIX.2.  Otherwise, the following diagnostics are issued:
2436
2437 @enumerate
2438
2439 @item
2440 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
2441 is not, and @var{set2} is given, GNU @code{tr} by default prints
2442 a usage message and exits, because @var{set2} would not be used.
2443 The POSIX specification says that @var{set2} must be ignored in
2444 this case. Silently ignoring arguments is a bad idea.
2445
2446 @item
2447 When an ambiguous octal escape is given.  For example, @samp{\400}
2448 is actually @samp{\40} followed by the digit @samp{0}, because the
2449 value 400 octal does not fit into a single byte.
2450
2451 @end enumerate
2452
2453 GNU @code{tr} does not provide complete BSD or System V compatibility.
2454 For example, it is impossible to disable interpretation of the POSIX
2455 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}.  Also, GNU
2456 @code{tr} does not delete zero bytes automatically, unlike traditional
2457 Unix versions, which provide no way to preserve zero bytes.
2458
2459
2460 @node expand invocation
2461 @section @code{expand}: Convert tabs to spaces
2462
2463 @pindex expand
2464 @cindex tabs to spaces, converting
2465 @cindex converting tabs to spaces
2466
2467 @code{expand} writes the contents of each given @var{file}, or standard
2468 input if none are given or for a @var{file} of @samp{-}, to standard
2469 output, with tab characters converted to the appropriate number of
2470 spaces.  Synopsis:
2471
2472 @example
2473 expand [@var{option}]@dots{} [@var{file}]@dots{}
2474 @end example
2475
2476 By default, @code{expand} converts all tabs to spaces.  It preserves
2477 backspace characters in the output; they decrement the column count for
2478 tab calculations.  The default action is equivalent to @samp{-8} (set
2479 tabs every 8 columns).
2480
2481 The program accepts the following options.  Also see @ref{Common options}.
2482
2483 @table @samp
2484
2485 @item -@var{tab1}[,@var{tab2}]@dots{}
2486 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2487 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2488 @opindex -@var{tab}
2489 @opindex -t
2490 @opindex --tabs
2491 @cindex tabstops, setting
2492 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2493 (default is 8).  Otherwise, set the tabs at columns @var{tab1},
2494 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
2495 last tabstop given with single spaces.  If the tabstops are specified
2496 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2497 blanks as well as by commas.
2498
2499 @item -i
2500 @itemx --initial
2501 @opindex -i
2502 @opindex --initial
2503 @cindex initial tabs, converting
2504 Only convert initial tabs (those that precede all non-space or non-tab
2505 characters) on each line to spaces.
2506
2507 @end table
2508
2509
2510 @node unexpand invocation
2511 @section @code{unexpand}: Convert spaces to tabs
2512
2513 @pindex unexpand
2514
2515 @code{unexpand} writes the contents of each given @var{file}, or
2516 standard input if none are given or for a @var{file} of @samp{-}, to
2517 standard output, with strings of two or more space or tab characters
2518 converted to as many tabs as possible followed by as many spaces as are
2519 needed.  Synopsis:
2520
2521 @example
2522 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
2523 @end example
2524
2525 By default, @code{unexpand} converts only initial spaces and tabs (those
2526 that precede all non space or tab characters) on each line.  It
2527 preserves backspace characters in the output; they decrement the column
2528 count for tab calculations.  By default, tabs are set at every 8th
2529 column.
2530
2531 The program accepts the following options.  Also see @ref{Common options}.
2532
2533 @table @samp
2534
2535 @item -@var{tab1}[,@var{tab2}]@dots{}
2536 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2537 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2538 @opindex -@var{tab}
2539 @opindex -t
2540 @opindex --tabs
2541 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2542 instead of the default 8.  Otherwise, set the tabs at columns
2543 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
2544 tabs beyond the tabstops given unchanged.  If the tabstops are specified
2545 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2546 blanks as well as by commas.  This option implies the @samp{-a} option.
2547
2548 @item -a
2549 @itemx --all
2550 @opindex -a
2551 @opindex --all
2552 Convert all strings of two or more spaces or tabs, not just initial
2553 ones, to tabs.
2554
2555 @end table
2556
2557
2558 @c              What's GNU?
2559 @c              Arnold Robbins
2560 @node Opening the software toolbox
2561 @chapter Opening the software toolbox
2562
2563 This chapter originally appeared in @cite{Linux Journal}, volume 1,
2564 number 2, in the @cite{What's GNU?} column. It was written by Arnold
2565 Robbins.
2566
2567 @menu
2568 * Toolbox introduction::
2569 * I/O redirection::
2570 * The @code{who} command::
2571 * The @code{cut} command::
2572 * The @code{sort} command::
2573 * The @code{uniq} command::
2574 * Putting the tools together::
2575 @end menu
2576
2577
2578 @node Toolbox introduction
2579 @unnumberedsec Toolbox introduction
2580
2581 This month's column is only peripherally related to the GNU Project, in
2582 that it describes a number of the GNU tools on your Linux system and how they
2583 might be used.  What it's really about is the ``Software Tools'' philosophy
2584 of program development and usage.
2585
2586 The software tools philosophy was an important and integral concept
2587 in the initial design and development of Unix (of which Linux and GNU are
2588 essentially clones).  Unfortunately, in the modern day press of
2589 Internetworking and flashy GUIs, it seems to have fallen by the
2590 wayside.  This is a shame, since it provides a powerful mental model
2591 for solving many kinds of problems.
2592
2593 Many people carry a Swiss Army knife around in their pants pockets (or
2594 purse).  A Swiss Army knife is a handy tool to have: it has several knife
2595 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
2596 a number of other things on it.  For the everyday, small miscellaneous jobs
2597 where you need a simple, general purpose tool, it's just the thing.
2598
2599 On the other hand, an experienced carpenter doesn't build a house using
2600 a Swiss Army knife.  Instead, he has a toolbox chock full of specialized
2601 tools---a saw, a hammer, a screwdriver, a plane, and so on.  And he knows
2602 exactly when and where to use each tool; you won't catch him hammering nails
2603 with the handle of his screwdriver.
2604
2605 The Unix developers at Bell Labs were all professional programmers and trained
2606 computer scientists.  They had found that while a one-size-fits-all program
2607 might appeal to a user because there's only one program to use, in practice
2608 such programs are
2609
2610 @enumerate a
2611 @item
2612 difficult to write,
2613
2614 @item
2615 difficult to maintain and
2616 debug, and
2617
2618 @item
2619 difficult to extend to meet new situations.
2620 @end enumerate
2621
2622 Instead, they felt that programs should be specialized tools.  In short, each
2623 program ``should do one thing well.''  No more and no less.  Such programs are
2624 simpler to design, write, and get right---they only do one thing.
2625
2626 Furthermore, they found that with the right machinery for hooking programs
2627 together, that the whole was greater than the sum of the parts.  By combining
2628 several special purpose programs, you could accomplish a specific task
2629 that none of the programs was designed for, and accomplish it much more
2630 quickly and easily than if you had to write a special purpose program.
2631 We will see some (classic) examples of this further on in the column.
2632 (An important additional point was that, if necessary, take a detour
2633 and build any software tools you may need first, if you don't already
2634 have something appropriate in the toolbox.)
2635
2636 @node I/O redirection
2637 @unnumberedsec I/O redirection
2638
2639 Hopefully, you are familiar with the basics of I/O redirection in the
2640 shell, in particular the concepts of ``standard input,'' ``standard output,''
2641 and ``standard error''.  Briefly, ``standard input'' is a data source, where
2642 data comes from.  A program should not need to either know or care if the
2643 data source is a disk file, a keyboard, a magnetic tape, or even a punched
2644 card reader.  Similarly, ``standard output'' is a data sink, where data goes
2645 to.  The program should neither know nor care where this might be.
2646 Programs that only read their standard input, do something to the data,
2647 and then send it on, are called ``filters'', by analogy to filters in a
2648 water pipeline.
2649
2650 With the Unix shell, it's very easy to set up data pipelines:
2651
2652 @example
2653 program_to_create_data | filter1 | .... | filterN > final.pretty.data
2654 @end example
2655
2656 We start out by creating the raw data; each filter applies some successive
2657 transformation to the data, until by the time it comes out of the pipeline,
2658 it is in the desired form.
2659
2660 This is fine and good for standard input and standard output.  Where does the
2661 standard error come in to play?  Well, think about @code{filter1} in
2662 the pipeline above.  What happens if it encounters an error in the data it
2663 sees?  If it writes an error message to standard output, it will just
2664 disappear down the pipeline into @code{filter2}'s input, and the
2665 user will probably never see it.  So programs need a place where they can send
2666 error messages so that the user will notice them.  This is standard error,
2667 and it is usually connected to your console or window, even if you have
2668 redirected standard output of your program away from your screen.
2669
2670 For filter programs to work together, the format of the data has to be
2671 agreed upon.  The most straightforward and easiest format to use is simply
2672 lines of text.  Unix data files are generally just streams of bytes, with
2673 lines delimited by the @sc{ASCII} @sc{LF} (Line Feed) character,
2674 conventionally called a ``newline'' in the Unix literature. (This is
2675 @code{'\n'} if you're a C programmer.)  This is the format used by all
2676 the traditional filtering programs.  (Many earlier operating systems
2677 had elaborate facilities and special purpose programs for managing
2678 binary data.  Unix has always shied away from such things, under the
2679 philosophy that it's easiest to simply be able to view and edit your
2680 data with a text editor.)
2681
2682 OK, enough introduction. Let's take a look at some of the tools, and then
2683 we'll see how to hook them together in interesting ways.   In the following
2684 discussion, we will only present those command line options that interest
2685 us.  As you should always do, double check your system documentation
2686 for the full story.
2687
2688 @node The @code{who} command
2689 @unnumberedsec The @code{who} command
2690
2691 The first program is the @code{who} command.  By itself, it generates a
2692 list of the users who are currently logged in.  Although I'm writing
2693 this on a single-user system, we'll pretend that several people are
2694 logged in:
2695
2696 @example
2697 $ who
2698 arnold   console Jan 22 19:57
2699 miriam   ttyp0   Jan 23 14:19(:0.0)
2700 bill     ttyp1   Jan 21 09:32(:0.0)
2701 arnold   ttyp2   Jan 23 20:48(:0.0)
2702 @end example
2703
2704 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
2705 There are three people logged in, and I am logged in twice.  On traditional
2706 Unix systems, user names are never more than eight characters long.  This
2707 little bit of trivia will be useful later.  The output of @code{who} is nice,
2708 but the data is not all that exciting.
2709
2710 @node The @code{cut} command
2711 @unnumberedsec The @code{cut} command
2712
2713 The next program we'll look at is the @code{cut} command.  This program
2714 cuts out columns or fields of input data.  For example, we can tell it
2715 to print just the login name and full name from the @file{/etc/passwd
2716 file}.  The @file{/etc/passwd} file has seven fields, separated by
2717 colons:
2718
2719 @example
2720 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
2721 @end example
2722
2723 To get the first and fifth fields, we would use cut like this:
2724
2725 @example
2726 $ cut -d: -f1,5 /etc/passwd
2727 root:Operator
2728 @dots{}
2729 arnold:Arnold D. Robbins
2730 miriam:Miriam A. Robbins
2731 @dots{}
2732 @end example
2733
2734 With the @samp{-c} option, @code{cut} will cut out specific characters
2735 (i.e., columns) in the input lines.  This command looks like it might be
2736 useful for data filtering.
2737
2738
2739 @node The @code{sort} command
2740 @unnumberedsec The @code{sort} command
2741
2742 Next we'll look at the @code{sort} command.  This is one of the most
2743 powerful commands on a Unix-style system; one that you will often find
2744 yourself using when setting up fancy data plumbing. The @code{sort}
2745 command reads and sorts each file named on the command line.  It then
2746 merges the sorted data and writes it to standard output.  It will read
2747 standard input if no files are given on the command line (thus
2748 making it into a filter).  The sort is based on the machine collating
2749 sequence (@sc{ASCII}) or based on  user-supplied ordering criteria.
2750
2751
2752 @node The @code{uniq} command
2753 @unnumberedsec The @code{uniq} command
2754
2755 Finally (at least for now), we'll look at the @code{uniq} program.  When
2756 sorting data, you will often end up with duplicate lines, lines that
2757 are identical.  Usually, all you need is one instance of each line.
2758 This is where @code{uniq} comes in. The @code{uniq} program reads its
2759 standard input, which it expects to be sorted.  It only prints out one
2760 copy of each duplicated line.  It does have several options.  Later on,
2761 we'll use the @samp{-c} option, which prints each unique line, preceded
2762 by a count of the number of times that line occurred in the input.
2763
2764
2765 @node Putting the tools together
2766 @unnumberedsec Putting the tools together
2767
2768 Now, let's suppose this is a large BBS system with dozens of users
2769 logged in.  The management wants the SysOp to write a program that will
2770 generate a sorted list of logged in users.  Furthermore, even if a user
2771 is logged in multiple times, his or her name should only show up in the
2772 output once.
2773
2774 The SysOp could sit down with the system documentation and write a C
2775 program that did this. It would take perhaps a couple of hundred lines
2776 of code and about two hours to write it, test it, and debug it.
2777 However, knowing the software toolbox, the SysOp can instead start out
2778 by generating just a list of logged on users:
2779
2780 @example
2781 $ who | cut -c1-8
2782 arnold
2783 miriam
2784 bill
2785 arnold
2786 @end example
2787
2788 Next, sort the list:
2789
2790 @example
2791 $ who | cut -c1-8 | sort
2792 arnold
2793 arnold
2794 bill
2795 miriam
2796 @end example
2797
2798 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
2799
2800 @example
2801 $ who | cut -c1-8 | sort | uniq
2802 arnold
2803 bill
2804 miriam
2805 @end example
2806
2807 The @code{sort} command actually has a @samp{-u} option that does what
2808 @code{uniq} does. However, @code{uniq} has other uses for which one
2809 cannot substitute @samp{sort -u}.
2810
2811 The SysOp puts this pipeline into a shell script, and makes it available for
2812 all the users on the system:
2813
2814 @example
2815 # cat > /usr/local/bin/listusers
2816 who | cut -c1-8 | sort | uniq
2817 ^D
2818 # chmod +x /usr/local/bin/listusers
2819 @end example
2820
2821 There are four major points to note here.  First, with just four
2822 programs, on one command line, the SysOp was able to save about two
2823 hours worth of work.  Furthermore, the shell pipeline is just about as
2824 efficient as the C program would be, and it is much more efficient in
2825 terms of programmer time.  People time is much more expensive than
2826 computer time, and in our modern ``there's never enough time to do
2827 everything'' society, saving two hours of programmer time is no mean
2828 feat.
2829
2830 Second, it is also important to emphasize that with the
2831 @emph{combination} of the tools, it is possible to do a special
2832 purpose job never imagined by the authors of the individual programs.
2833
2834 Third, it is also valuable to build up your pipeline in stages, as we did here.
2835 This allows you to view the data at each stage in the pipeline, which helps
2836 you acquire the confidence that you are indeed using these tools correctly.
2837
2838 Finally, by bundling the pipeline in a shell script, other users can use
2839 your command, without having to remember the fancy plumbing you set up for
2840 them. In terms of how you run them, shell scripts and compiled programs are
2841 indistinguishable.
2842
2843 After the previous warm-up exercise, we'll look at two additional, more
2844 complicated pipelines.  For them, we need to introduce two more tools.
2845
2846 The first is the @code{tr} command, which stands for ``transliterate.''
2847 The @code{tr} command works on a character-by-character basis, changing
2848 characters. Normally it is used for things like mapping upper case to
2849 lower case:
2850
2851 @example
2852 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
2853 this example has mixed case!
2854 @end example
2855
2856 There are several options of interest:
2857
2858 @table @samp
2859 @item -c
2860 work on the complement of the listed characters, i.e.,
2861 operations apply to characters not in the given set
2862
2863 @item -d
2864 delete characters in the first set from the output
2865
2866 @item -s
2867 squeeze repeated characters in the output into just one character.
2868 @end table
2869
2870 We will be using all three options in a moment.
2871
2872 The other command we'll look at is @code{comm}.  The @code{comm}
2873 command takes two sorted input files as input data, and prints out the
2874 files' lines in three columns.  The output columns are the data lines
2875 unique to the first file, the data lines unique to the second file, and
2876 the data lines that are common to both.  The @samp{-1}, @samp{-2}, and
2877 @samp{-3} command line options omit the respective columns. (This is
2878 non-intuitive and takes a little getting used to.)  For example:
2879
2880 @example
2881 $ cat f1
2882 11111
2883 22222
2884 33333
2885 44444
2886 $ cat f2
2887 00000
2888 22222
2889 33333
2890 55555
2891 $ comm f1 f2
2892         00000
2893 11111
2894                 22222
2895                 33333
2896 44444
2897         55555
2898 @end example
2899
2900 The single dash as a filename tells @code{comm} to read standard input
2901 instead of a regular file.
2902
2903 Now we're ready to build a fancy pipeline.  The first application is a word
2904 frequency counter.  This helps an author determine if he or she is over-using
2905 certain words.
2906
2907 The first step is to change the case of all the letters in our input file
2908 to one case.  ``The'' and ``the'' are the same word when doing counting.
2909
2910 @example
2911 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
2912 @end example
2913
2914 The next step is to get rid of punctuation.  Quoted words and unquoted words
2915 should be treated identically; it's easiest to just get the punctuation out of
2916 the way.
2917
2918 @example
2919 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
2920 @end example
2921
2922 The second @code{tr} command operates on the complement of the listed
2923 characters, which are all the letters, the digits, the underscore, and
2924 the blank.  The @samp{\012} represents the newline character; it has to
2925 be left alone.  (The ASCII TAB character should also be included for
2926 good measure in a production script.)
2927
2928 At this point, we have data consisting of words separated by blank space.
2929 The words only contain alphanumeric characters (and the underscore).  The
2930 next step is break the data apart so that we have one word per line. This
2931 makes the counting operation much easier, as we will see shortly.
2932
2933 @example
2934 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
2935 > tr -s '[ ]' '\012' | ...
2936 @end example
2937
2938 This command turns blanks into newlines.  The @samp{-s} option squeezes
2939 multiple newline characters in the output into just one.  This helps us
2940 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
2941 This is what the shell prints when it notices you haven't finished
2942 typing in all of a command.)
2943
2944 We now have data consisting of one word per line, no punctuation, all one
2945 case.  We're ready to count each word:
2946
2947 @example
2948 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
2949 > tr -s '[ ]' '\012' | sort | uniq -c | ...
2950 @end example
2951
2952 At this point, the data might look something like this:
2953
2954 @example
2955   60 a
2956    2 able
2957    6 about
2958    1 above
2959    2 accomplish
2960    1 acquire
2961    1 actually
2962    2 additional
2963 @end example
2964
2965 The output is sorted by word, not by count!  What we want is the most
2966 frequently used words first.  Fortunately, this is easy to accomplish,
2967 with the help of two more @code{sort} options:
2968
2969 @table @samp
2970 @item -n
2971 do a numeric sort, not an ASCII one
2972
2973 @item -r
2974 reverse the order of the sort
2975 @end table
2976
2977 The final pipeline looks like this:
2978
2979 @example
2980 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
2981 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
2982  156 the
2983   60 a
2984   58 to
2985   51 of
2986   51 and
2987  ...
2988 @end example
2989
2990 Whew!  That's a lot to digest.  Yet, the same principles apply. With six
2991 commands, on two lines (really one long one split for convenience), we've
2992 created a program that does something interesting and useful, in much
2993 less time than we could have written a C program to do the same thing.
2994
2995 A minor modification to the above pipeline can give us a simple spelling
2996 checker!  To determine if you've spelled a word correctly, all you have to
2997 do is look it up in a dictionary.  If it is not there, then chances are
2998 that your spelling is incorrect.  So, we need a dictionary.  If you
2999 have the Slackware Linux distribution, you have the file
3000 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
3001 dictionary.
3002
3003 Now, how to compare our file with the dictionary?  As before, we generate
3004 a sorted list of words, one per line:
3005
3006 @example
3007 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3008 > tr -s '[ ]' '\012' | sort -u | ...
3009 @end example
3010
3011 Now, all we need is a list of words that are @emph{not} in the
3012 dictionary.  Here is where the @code{comm} command comes in.
3013
3014 @example
3015 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3016 > tr -s '[ ]' '\012' | sort -u |
3017 > comm -23 - /usr/lib/ispell/ispell.words
3018 @end example
3019
3020 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
3021 dictionary (the second file), and lines that are in both files.  Lines
3022 only in the first file (standard input, our stream of words), are
3023 words that are not in the dictionary.  These are likely candidates for
3024 spelling errors.  This pipeline was the first cut at a production
3025 spelling checker on Unix.
3026
3027 There are some other tools that deserve brief mention.
3028
3029 @table @code
3030 @item grep
3031 search files for text that matches a regular expression
3032
3033 @item egrep
3034 like @code{grep}, but with more powerful regular expressions
3035
3036 @item wc
3037 count lines, words, characters
3038
3039 @item tee
3040 a T-fitting for data pipes, copies data to files and to standard output
3041
3042 @item sed
3043 the stream editor, an advanced tool
3044
3045 @item awk
3046 a data manipulation language, another advanced tool
3047 @end table
3048
3049 The software tools philosophy also espoused the following bit of
3050 advice: ``Let someone else do the hard part.'' This means, take
3051 something that gives you most of what you need, and then massage it the
3052 rest of the way until it's in the form that you want.
3053
3054 To summarize:
3055
3056 @enumerate 1
3057 @item
3058 Each program should do one thing well. No more, no less.
3059
3060 @item
3061 Combining programs with appropriate plumbing leads to results where
3062 the whole is greater than the sum of the parts.  It also leads to novel
3063 uses of programs that the authors might never have imagined.
3064
3065 @item
3066 Programs should never print extraneous header or trailer data, since these
3067 could get sent on down a pipeline. (A point we didn't mention earlier.)
3068
3069 @item
3070 Let someone else do the hard part.
3071
3072 @item
3073 Know your toolbox! Use each program appropriately. If you don't have an
3074 appropriate tool, build one.
3075 @end enumerate
3076
3077 As of this writing, all the programs we've discussed are available via
3078 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
3079 @file{/pub/gnu/textutils-1.9.tar.gz} directory.@footnote{Version 1.9 was
3080 current when this column was written. Check the nearest GNU archive for
3081 the current version.}
3082
3083 None of what I have presented in this column is new. The Software Tools
3084 philosophy was first introduced in the book @cite{Software Tools},
3085 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
3086 0-201-03669-X).   This book showed how to write and use software
3087 tools.   It was written in 1976, using a preprocessor for FORTRAN named
3088 @code{ratfor} (RATional FORtran).  At the time, C was not as ubiquitous
3089 as it is now; FORTRAN was.  The last chapter presented a @code{ratfor}
3090 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
3091 awful lot like C; if you know C, you won't have any problem following
3092 the code.
3093
3094 In 1981, the book was updated and made available as @cite{Software
3095 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7).  Both books
3096 remain in print, and are well worth reading if you're a programmer.
3097 They certainly made a major change in how I view programming.
3098
3099 Initially, the programs in both books were available (on 9-track tape)
3100 from Addison-Wesley.  Unfortunately, this is no longer the case,
3101 although you might be able to find copies floating around the Internet.
3102 For a number of years, there was an active Software Tools Users Group,
3103 whose members had ported the original @code{ratfor} programs to essentially
3104 every computer system with a FORTRAN compiler.  The popularity of the
3105 group waned in the middle '80s as Unix began to spread beyond universities.
3106
3107 With the current proliferation of GNU code and other clones of Unix programs,
3108 these programs now receive little attention; modern C versions are
3109 much more efficient and do more than these programs do.  Nevertheless, as
3110 exposition of good programming style, and evangelism for a still-valuable
3111 philosophy, these books are unparalleled, and I recommend them highly.
3112
3113 Acknowledgement: I would like to express my gratitude to Brian Kernighan
3114 of Bell Labs, the original Software Toolsmith, for reviewing this column.
3115
3116
3117 @node Index
3118 @unnumbered Index
3119
3120 @printindex cp
3121
3122 @contents
3123 @bye
3124
3125 @c Local variables:
3126 @c texinfo-column-for-description: 32
3127 @c End: