doc/textutils.texi

   1 \input texinfo
   2 @c %**start of header
   3 @setfilename textutils.info
   4 @settitle GNU text utilities
   5 @c %**end of header
   6
   7 @include version.texi
   8
   9 @c Define new indices.
  10 @defcodeindex op
  11
  12 @c Put everything in one index (arbitrarily chosen to be the concept index).
  13 @syncodeindex fn cp
  14 @syncodeindex ky cp
  15 @syncodeindex op cp
  16 @syncodeindex pg cp
  17 @syncodeindex vr cp
  18
  19 @ifinfo
  20 @set Francois Franc,ois
  21 @end ifinfo
  22 @tex
  23 @set Francois Fran\noexpand\ptexc cois
  24 @end tex
  25
  26 @ifinfo
  27 @format
  28 START-INFO-DIR-ENTRY
  29 * Text utilities: (textutils).          GNU text utilities.
  30 * cat: (textutils)cat invocation.               Concatenate and write files.
  31 * cksum: (textutils)cksum invocation.           Print @sc{POSIX} CRC checksum.
  32 * comm: (textutils)comm invocation.             Compare sorted files by line.
  33 * csplit: (textutils)csplit invocation.         Split by context.
  34 * cut: (textutils)cut invocation.               Print selected parts of lines.
  35 * expand: (textutils)expand invocation.         Convert tabs to spaces.
  36 * fmt: (textutils)fmt invocation.               Reformat paragraph text.
  37 * fold: (textutils)fold invocation.             Wrap long input lines.
  38 * head: (textutils)head invocation.             Output the first part of files.
  39 * join: (textutils)join invocation.             Join lines on a common field.
  40 * md5sum: (textutils)md5sum invocation.         Print or check message-digests.
  41 * nl: (textutils)nl invocation.                 Number lines and write files.
  42 * od: (textutils)od invocation.                 Dump files in octal, etc.
  43 * paste: (textutils)paste invocation.           Merge lines of files.
  44 * pr: (textutils)pr invocation.                 Paginate or columnate files.
  45 * sort: (textutils)sort invocation.             Sort text files.
  46 * split: (textutils)split invocation.           Split into fixed-size pieces.
  47 * sum: (textutils)sum invocation.               Print traditional checksum.
  48 * tac: (textutils)tac invocation.               Reverse files.
  49 * tail: (textutils)tail invocation.             Output the last part of files.
  50 * tr: (textutils)tr invocation.                 Translate characters.
  51 * unexpand: (textutils)unexpand invocation.     Convert spaces to tabs.
  52 * uniq: (textutils)uniq invocation.             Uniqify files.
  53 * wc: (textutils)wc invocation.                 Byte, word, and line counts.
  54 END-INFO-DIR-ENTRY
  55 @end format
  56 @end ifinfo
  57
  58 @ifinfo
  59 This file documents the GNU text utilities.
  60
  61 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
  62
  63 Permission is granted to make and distribute verbatim copies of
  64 this manual provided the copyright notice and this permission notice
  65 are preserved on all copies.
  66
  67 @ignore
  68 Permission is granted to process this file through TeX and print the
  69 results, provided the printed document carries copying permission
  70 notice identical to this one except for the removal of this paragraph
  71 (this paragraph not being relevant to the printed manual).
  72
  73 @end ignore
  74 Permission is granted to copy and distribute modified versions of this
  75 manual under the conditions for verbatim copying, provided that the entire
  76 resulting derived work is distributed under the terms of a permission
  77 notice identical to this one.
  78
  79 Permission is granted to copy and distribute translations of this manual
  80 into another language, under the above conditions for modified versions,
  81 except that this permission notice may be stated in a translation approved
  82 by the Foundation.
  83 @end ifinfo
  84
  85 @titlepage
  86 @title GNU @code{textutils}
  87 @subtitle A set of text utilities
  88 @subtitle for version @value{VERSION}, @value{UPDATED}
  89 @author David MacKenzie et al.
  90
  91 @page
  92 @vskip 0pt plus 1filll
  93 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
  94
  95 Permission is granted to make and distribute verbatim copies of
  96 this manual provided the copyright notice and this permission notice
  97 are preserved on all copies.
  98
  99 Permission is granted to copy and distribute modified versions of this
 100 manual under the conditions for verbatim copying, provided that the entire
 101 resulting derived work is distributed under the terms of a permission
 102 notice identical to this one.
 103
 104 Permission is granted to copy and distribute translations of this manual
 105 into another language, under the above conditions for modified versions,
 106 except that this permission notice may be stated in a translation approved
 107 by the Foundation.
 108 @end titlepage
 109
 110
 111 @ifinfo
 112 @node Top
 113 @top GNU text utilities
 114
 115 @cindex text utilities
 116 @cindex utilities for text handling
 117
 118 This manual minimally documents version @value{VERSION} of the GNU text
 119 utilities.
 120
 121 @menu
 122 * Introduction::                       Caveats, overview, and authors.
 123 * Common options::                     Common options.
 124 * Output of entire files::             cat tac nl od
 125 * Formatting file contents::           fmt pr fold
 126 * Output of parts of files::           head tail split csplit
 127 * Summarizing files::                  wc sum cksum md5sum
 128 * Operating on sorted files::          sort uniq comm
 129 * Operating on fields within a line::  cut paste join
 130 * Operating on characters::            tr expand unexpand
 131 * Opening the software toolbox::       The software tools philosophy.
 132 * Index::                              General index.
 133 @end menu
 134 @end ifinfo
 135
 136
 137 @node Introduction
 138 @chapter Introduction
 139
 140 @cindex introduction
 141
 142 This manual is incomplete: No attempt is made to explain basic concepts
 143 in a way suitable for novices.  Thus, if you are interested, please get
 144 involved in improving this manual.  The entire GNU community will
 145 benefit.
 146
 147 @cindex POSIX.2
 148 The GNU text utilities are mostly compatible with the @sc{POSIX.2} standard.
 149
 150 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
 151 @c sh-utils.texi too -- so be sure to keep them consistent.
 152 @cindex bugs, reporting
 153 Please report bugs to @samp{bug-gnu-utils@@prep.ai.mit.edu}.  Remember
 154 to include the version number, machine architecture, input files, and
 155 any other information needed to reproduce the bug: your input, what you
 156 expected, what you got, and why it is wrong.  Diffs are welcome, but
 157 please include a description of the problem as well, since this is
 158 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
 159
 160 This manual is based on the Unix man pages in the distribution, which
 161 were originally written by David MacKenzie and updated by Jim Meyering.
 162 The original @code{fmt} man page was written by Ross Paterson.
 163 @value{Francois} Pinard did the initial conversion to Texinfo format.
 164 Karl Berry did the indexing, some reorganization, and editing of the results.
 165 Richard Stallman contributed his usual invaluable insights to the
 166 overall process.
 167
 168
 169 @node Common options
 170 @chapter Common options
 171
 172 @cindex common options
 173
 174 Certain options are available in all these programs.  Rather than
 175 writing identical descriptions for each of the programs, they are
 176 described here.  (In fact, every GNU program accepts (or should accept)
 177 these options.)
 178
 179 A few of these programs take arbitrary strings as arguments.  In those
 180 cases, @samp{--help} and @samp{--version} are taken as these options
 181 only if there is one and exactly one command line argument.
 182
 183 @table @samp
 184
 185 @item --help
 186 @opindex --help
 187 @cindex help, online
 188 Print a usage message listing all available options, then exit successfully.
 189
 190 @item --version
 191 @opindex --version
 192 @cindex version number, finding
 193 Print the version number, then exit successfully.
 194
 195 @end table
 196
 197
 198 @node Output of entire files
 199 @chapter Output of entire files
 200
 201 @cindex output of entire files
 202 @cindex entire files, output of
 203
 204 These commands read and write entire files, possibly transforming them
 205 in some way.
 206
 207 @menu
 208 * cat invocation::              Concatenate and write files.
 209 * tac invocation::              Concatenate and write files in reverse.
 210 * nl invocation::               Number lines and write files.
 211 * od invocation::               Write files in octal or other formats.
 212 @end menu
 213
 214 @node cat invocation
 215 @section @code{cat}: Concatenate and write files
 216
 217 @pindex cat
 218 @cindex concatenate and write files
 219 @cindex copying files
 220
 221 @code{cat} copies each @var{file} (@samp{-} means standard input), or
 222 standard input if none are given, to standard output.  Synopsis:
 223
 224 @example
 225 cat [@var{option}] [@var{file}]@dots{}
 226 @end example
 227
 228 The program accepts the following options.  Also see @ref{Common options}.
 229
 230 @table @samp
 231
 232 @item -A
 233 @itemx --show-all
 234 @opindex -A
 235 @opindex --show-all
 236 Equivalent to @samp{-vET}.
 237
 238 @item -b
 239 @itemx --number-nonblank
 240 @opindex -b
 241 @opindex --number-nonblank
 242 Number all nonblank output lines, starting with 1.
 243
 244 @item -e
 245 @opindex -e
 246 Equivalent to @samp{-vE}.
 247
 248 @item -E
 249 @itemx --show-ends
 250 @opindex -E
 251 @opindex --show-ends
 252 Display a @samp{$} after the end of each line.
 253
 254 @item -n
 255 @itemx --number
 256 @opindex -n
 257 @opindex --number
 258 Number all output lines, starting with 1.
 259
 260 @item -s
 261 @itemx --squeeze-blank
 262 @opindex -s
 263 @opindex --squeeze-blank
 264 @cindex squeezing blank lines
 265 Replace multiple adjacent blank lines with a single blank line.
 266
 267 @item -t
 268 @opindex -t
 269 Equivalent to @samp{-vT}.
 270
 271 @item -T
 272 @itemx --show-tabs
 273 @opindex -T
 274 @opindex --show-tabs
 275 Display @key{TAB} characters as @samp{^I}.
 276
 277 @item -u
 278 @opindex -u
 279 Ignored; for Unix compatibility.
 280
 281 @item -v
 282 @itemx --show-nonprinting
 283 @opindex -v
 284 @opindex --show-nonprinting
 285 Display control characters except for @key{LFD} and @key{TAB} using
 286 @samp{^} notation and precede characters that have the high bit set
 287 with @samp{M-}.
 288
 289 @end table
 290
 291
 292 @node tac invocation
 293 @section @code{tac}: Concatenate and write files in reverse
 294
 295 @pindex tac
 296 @cindex reversing files
 297
 298 @code{tac} copies each @var{file} (@samp{-} means standard input), or
 299 standard input if none are given, to standard output, reversing the
 300 records (lines by default) in each separately.  Synopsis:
 301
 302 @example
 303 tac [@var{option}]@dots{} [@var{file}]@dots{}
 304 @end example
 305
 306 @dfn{Records} are separated by instances of a string (newline by
 307 default).  By default, this separator string is attached to the end of
 308 the record that it follows in the file.
 309
 310 The program accepts the following options.  Also see @ref{Common options}.
 311
 312 @table @samp
 313
 314 @item -b
 315 @itemx --before
 316 @opindex -b
 317 @opindex --before
 318 The separator is attached to the beginning of the record that it
 319 precedes in the file.
 320
 321 @item -r
 322 @itemx --regex
 323 @opindex -r
 324 @opindex --regex
 325 Treat the separator string as a regular expression.
 326
 327 @item -s @var{separator}
 328 @itemx --separator=@var{separator}
 329 @opindex -s
 330 @opindex --separator
 331 Use @var{separator} as the record separator, instead of newline.
 332
 333 @end table
 334
 335
 336 @node nl invocation
 337 @section @code{nl}: Number lines and write files
 338
 339 @pindex nl
 340 @cindex numbering lines
 341 @cindex line numbering
 342
 343 @code{nl} writes each @var{file} (@samp{-} means standard input), or
 344 standard input if none are given, to standard output, with line numbers
 345 added to some or all of the lines.  Synopsis:
 346
 347 @example
 348 nl [@var{option}]@dots{} [@var{file}]@dots{}
 349 @end example
 350
 351 @cindex logical pages, numbering on
 352 @code{nl} decomposes its input into (logical) pages; by default, the
 353 line number is reset to 1 at the top of each logical page.  @code{nl}
 354 treats all of the input files as a single document; it does not reset
 355 line numbers or logical pages between files.
 356
 357 @cindex headers, numbering
 358 @cindex body, numbering
 359 @cindex footers, numbering
 360 A logical page consists of three sections: header, body, and footer.
 361 Any of the sections can be empty.  Each can be numbered in a different
 362 style from the others.
 363
 364 The beginnings of the sections of logical pages are indicated in the
 365 input file by a line containing exactly one of these delimiter strings:
 366
 367 @table @samp
 368 @item \:\:\:
 369 start of header;
 370 @item \:\:
 371 start of body;
 372 @item \:
 373 start of footer.
 374 @end table
 375
 376 The two characters from which these strings are made can be changed from
 377 @samp{\} and @samp{:} via options (see below), but the pattern and
 378 length of each string cannot be changed.
 379
 380 A section delimiter is replaced by an empty line on output.  Any text
 381 that comes before the first section delimiter string in the input file
 382 is considered to be part of a body section, so @code{nl} treats a
 383 file that contains no section delimiters as a single body section.
 384
 385 The program accepts the following options.  Also see @ref{Common options}.
 386
 387 @table @samp
 388
 389 @item -b @var{style}
 390 @itemx --body-numbering=@var{style}
 391 @opindex -b
 392 @opindex --body-numbering
 393 Select the numbering style for lines in the body section of each
 394 logical page.  When a line is not numbered, the current line number
 395 is not incremented, but the line number separator character is still
 396 prepended to the line.  The styles are:
 397
 398 @table @samp
 399 @item a
 400 number all lines,
 401 @item t
 402 number only nonempty lines (default for body),
 403 @item n
 404 do not number lines (default for header and footer),
 405 @item p@var{regexp}
 406 number only lines that contain a match for @var{regexp}.
 407 @end table
 408
 409 @item -d @var{cd}
 410 @itemx --section-delimiter=@var{cd}
 411 @opindex -d
 412 @opindex --section-delimiter
 413 @cindex section delimiters of pages
 414 Set the section delimiter characters to @var{cd}; default is
 415 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
 416 (Remember to protect @samp{\} or other metacharacters from shell
 417 expansion with quotes or extra backslashes.)
 418
 419 @item -f @var{style}
 420 @itemx --footer-numbering=@var{style}
 421 @opindex -f
 422 @opindex --footer-numbering
 423 Analogous to @samp{--body-numbering}.
 424
 425 @item -h @var{style}
 426 @itemx --header-numbering=@var{style}
 427 @opindex -h
 428 @opindex --header-numbering
 429 Analogous to @samp{--body-numbering}.
 430
 431 @item -i @var{number}
 432 @itemx --page-increment=@var{number}
 433 @opindex -i
 434 @opindex --page-increment
 435 Increment line numbers by @var{number} (default 1).
 436
 437 @item -l @var{number}
 438 @itemx --join-blank-lines=@var{number}
 439 @opindex -l
 440 @opindex --join-blank-lines
 441 @cindex empty lines, numbering
 442 @cindex blank lines, numbering
 443 Consider @var{number} (default 1) consecutive empty lines to be one
 444 logical line for numbering, and only number the last one.  Where fewer
 445 than @var{number} consecutive empty lines occur, do not number them.
 446 An empty line is one that contains no characters, not even spaces
 447 or tabs.
 448
 449 @item -n @var{format}
 450 @itemx --number-format=@var{format}
 451 @opindex -n
 452 @opindex --number-format
 453 Select the line numbering format (default is @code{rn}):
 454
 455 @table @samp
 456 @item ln
 457 @opindex ln @r{format for @code{nl}}
 458 left justified, no leading zeros;
 459 @item rn
 460 @opindex rn @r{format for @code{nl}}
 461 right justified, no leading zeros;
 462 @item rz
 463 @opindex rz @r{format for @code{nl}}
 464 right justified, leading zeros.
 465 @end table
 466
 467 @item -p
 468 @itemx --no-renumber
 469 @opindex -p
 470 @opindex --no-renumber
 471 Do not reset the line number at the start of a logical page.
 472
 473 @item -s @var{string}
 474 @itemx --number-separator=@var{string}
 475 @opindex -s
 476 @opindex --number-separator
 477 Separate the line number from the text line in the output with
 478 @var{string} (default is @key{TAB}).
 479
 480 @item -v @var{number}
 481 @itemx --starting-line-number=@var{number}
 482 @opindex -v
 483 @opindex --starting-line-number
 484 Set the initial line number on each logical page to @var{number} (default 1).
 485
 486 @item -w @var{number}
 487 @itemx --number-width=@var{number}
 488 @opindex -w
 489 @opindex --number-width
 490 Use @var{number} characters for line numbers (default 6).
 491
 492 @end table
 493
 494
 495 @node od invocation
 496 @section @code{od}: Write files in octal or other formats
 497
 498 @pindex od
 499 @cindex octal dump of files
 500 @cindex hex dump of files
 501 @cindex ASCII dump of files
 502 @cindex file contents, dumping unambiguously
 503
 504 @code{od} writes an unambiguous representation of each @var{file}
 505 (@samp{-} means standard input), or standard input if none are given.
 506 Synopsis:
 507
 508 @example
 509 od [@var{option}]@dots{} [@var{file}]@dots{}
 510 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
 511 @end example
 512
 513 Each line of output consists of the offset in the input, followed by
 514 groups of data from the file. By default, @code{od} prints the offset in
 515 octal, and each group of file data is two bytes of input printed as a
 516 single octal number.
 517
 518 The program accepts the following options.  Also see @ref{Common options}.
 519
 520 @table @samp
 521
 522 @item -A @var{radix}
 523 @itemx --address-radix=@var{radix}
 524 @opindex -A
 525 @opindex --address-radix
 526 @cindex radix for file offsets
 527 @cindex file offset radix
 528 Select the base in which file offsets are printed.  @var{radix} can
 529 be one of the following:
 530
 531 @table @samp
 532 @item d
 533 decimal;
 534 @item o
 535 octal;
 536 @item x
 537 hexadecimal;
 538 @item n
 539 none (do not print offsets).
 540 @end table
 541
 542 The default is octal.
 543
 544 @item -j @var{bytes}
 545 @itemx --skip-bytes=@var{bytes}
 546 @opindex -j
 547 @opindex --skip-bytes
 548 Skip @var{bytes} input bytes before formatting and writing.  If
 549 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
 550 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
 551 in decimal.  Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
 552 by 1024, and @samp{m} by 1048576.
 553
 554 @item -N @var{bytes}
 555 @itemx --read-bytes=@var{bytes}
 556 @opindex -N
 557 @opindex --read-bytes
 558 Output at most @var{bytes} bytes of the input.  Prefixes and suffixes on
 559 @code{bytes} are interpreted as for the @samp{-j} option.
 560
 561 @item -s [@var{n}]
 562 @itemx --strings[=@var{n}]
 563 @opindex -s
 564 @opindex --strings
 565 @cindex string constants, outputting
 566 Instead of the normal output, output only @dfn{string constants}: at
 567 least @var{n} (3 by default) consecutive ASCII graphic characters,
 568 followed by a null (zero) byte.
 569
 570 @item -t @var{type}
 571 @itemx --format=@var{type}
 572 @opindex -t
 573 @opindex --format
 574 Select the format in which to output the file data.  @var{type} is a
 575 string of one or more of the below type indicator characters.  If you
 576 include more than one type indicator character in a single @var{type}
 577 string, or use this option more than once, @code{od} writes one copy
 578 of each output line using each of the data types that you specified,
 579 in the order that you specified.
 580
 581 @table @samp
 582 @item a
 583 named character,
 584 @item c
 585 ASCII character or backslash escape,
 586 @item d
 587 signed decimal,
 588 @item f
 589 floating point,
 590 @item o
 591 octal,
 592 @item u
 593 unsigned decimal,
 594 @item x
 595 hexadecimal.
 596 @end table
 597
 598 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
 599 newline, and @samp{nul} for a null (zero) byte.  Type @code{c} outputs
 600 @samp{ }, @samp{\n}, and @code{\0}, respectively.
 601
 602 @cindex type size
 603 Except for types @samp{a} and @samp{c}, you can specify the number
 604 of bytes to use in interpreting each number in the given data type
 605 by following the type indicator character with a decimal integer.
 606 Alternately, you can specify the size of one of the C compiler's
 607 built-in data types by following the type indicator character with
 608 one of the following characters.  For integers (@samp{d}, @samp{o},
 609 @samp{u}, @samp{x}):
 610
 611 @table @samp
 612 @item C
 613 char,
 614 @item S
 615 short,
 616 @item I
 617 int,
 618 @item L
 619 long.
 620 @end table
 621
 622 For floating point (@code{f}):
 623
 624 @table @asis
 625 @item F
 626 float,
 627 @item D
 628 double,
 629 @item L
 630 long double.
 631 @end table
 632
 633 @item -v
 634 @itemx --output-duplicates
 635 @opindex -v
 636 @opindex --output-duplicates
 637 Output consecutive lines that are identical.  By default, when two or
 638 more consecutive output lines would be identical, @code{od} outputs only
 639 the first line, and puts just an asterisk on the following line to
 640 indicate the elision.
 641
 642 @item -w[@var{n}]
 643 @itemx --width[=@var{n}]
 644 @opindex -w
 645 @opindex --width
 646 Dump @code{n} input bytes per output line.  This must be a multiple of
 647 the least common multiple of the sizes associated with the specified
 648 output types.  If @var{n} is omitted, the default is 32.  If this option
 649 is not given at all, the default is 16.
 650
 651 @end table
 652
 653 The next several options map the old, pre-@sc{POSIX} format specification
 654 options to the corresponding @sc{POSIX} format specs.  GNU @code{od} accepts
 655 any combination of old- and new-style options.  Format specification
 656 options accumulate.
 657
 658 @table @samp
 659
 660 @item -a
 661 @opindex -a
 662 Output as named characters.  Equivalent to @samp{-ta}.
 663
 664 @item -b
 665 @opindex -b
 666 Output as octal bytes.  Equivalent to @samp{-toC}.
 667
 668 @item -c
 669 @opindex -c
 670 Output as ASCII characters or backslash escapes.  Equivalent to
 671 @samp{-tc}.
 672
 673 @item -d
 674 @opindex -d
 675 Output as unsigned decimal shorts.  Equivalent to @samp{-tu2}.
 676
 677 @item -f
 678 @opindex -f
 679 Output as floats.  Equivalent to @samp{-tfF}.
 680
 681 @item -h
 682 @opindex -h
 683 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 684
 685 @item -i
 686 @opindex -i
 687 Output as decimal shorts.  Equivalent to @samp{-td2}.
 688
 689 @item -l
 690 @opindex -l
 691 Output as decimal longs.  Equivalent to @samp{-td4}.
 692
 693 @item -o
 694 @opindex -o
 695 Output as octal shorts.  Equivalent to @samp{-to2}.
 696
 697 @item -x
 698 @opindex -x
 699 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 700
 701 @item -C
 702 @itemx --traditional
 703 @opindex --traditional
 704 Recognize the pre-POSIX non-option arguments that traditional @code{od}
 705 accepted.  The following syntax:
 706
 707 @example
 708 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
 709 @end example
 710
 711 @noindent
 712 can be used to specify at most one file and optional arguments
 713 specifying an offset and a pseudo-start address, @var{label}.  By
 714 default, @var{offset} is interpreted as an octal number specifying how
 715 many input bytes to skip before formatting and writing.  The optional
 716 trailing decimal point forces the interpretation of @var{offset} as a
 717 decimal number.  If no decimal is specified and the offset begins with
 718 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number.  If
 719 there is a trailing @samp{b}, the number of bytes skipped will be
 720 @var{offset} multiplied by 512.  The @var{label} argument is interpreted
 721 just like @var{offset}, but it specifies an initial pseudo-address.  The
 722 pseudo-addresses are displayed in parentheses following any normal
 723 address.
 724
 725 @end table
 726
 727
 728 @node Formatting file contents
 729 @chapter Formatting file contents
 730
 731 @cindex formatting file contents
 732
 733 These commands reformat the contents of files.
 734
 735 @menu
 736 * fmt invocation::              Reformat paragraph text.
 737 * pr invocation::               Paginate or columnate files for printing.
 738 * fold invocation::             Wrap input lines to fit in specified width.
 739 @end menu
 740
 741
 742 @node fmt invocation
 743 @section @code{fmt}: Reformat paragraph text
 744
 745 @pindex fmt
 746 @cindex reformatting paragraph text
 747 @cindex paragraphs, reformatting
 748 @cindex text, reformatting
 749
 750 @code{fmt} fills and joins lines to produce output lines of (at most)
 751 a given number of characters (75 by default).  Synopsis:
 752
 753 @example
 754 fmt [@var{option}]@dots{} [@var{file}]@dots{}
 755 @end example
 756
 757 @code{fmt} reads from the specified @var{file} arguments (or standard
 758 input if none are given), and writes to standard output.
 759
 760 By default, blank lines, spaces between words, and indentation are
 761 preserved in the output; successive input lines with different
 762 indentation are not joined; tabs are expanded on input and introduced on
 763 output.
 764
 765 @cindex line-breaking
 766 @cindex sentences and line-breaking
 767 @cindex Knuth, Donald E.
 768 @cindex Plass, Michael F.
 769 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
 770 avoid line breaks after the first word of a sentence or before the last
 771 word of a sentence.  A @dfn{sentence break} is defined as either the end
 772 of a paragraph or a word ending in any of @samp{.?!}, followed by two
 773 spaces or end of line, ignoring any intervening parentheses or quotes.
 774 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
 775 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
 776 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
 777 and Experience}, 11 (1981), 1119--1184).
 778
 779 The program accepts the following options.  Also see @ref{Common options}.
 780
 781 @table @samp
 782
 783 @item -c
 784 @itemx --crown-margin
 785 @opindex -c
 786 @opindex --crown-margin
 787 @cindex crown margin
 788 @dfn{Crown margin} mode: preserve the indentation of the first two
 789 lines within a paragraph, and align the left margin of each subsequent
 790 line with that of the second line.
 791
 792 @item -t
 793 @itemx --tagged-paragraph
 794 @opindex -t
 795 @opindex --tagged-paragraph
 796 @cindex tagged paragraphs
 797 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
 798 indentation of the first line of a paragraph is the same as the
 799 indentation of the second, the first line is treated as a one-line
 800 paragraph.
 801
 802 @item -s
 803 @itemx --split-only
 804 @opindex -s
 805 @opindex --split-only
 806 Split lines only.  Do not join short lines to form longer ones.  This
 807 prevents sample lines of code, and other such ``formatted'' text from
 808 being unduly combined.
 809
 810 @item -u
 811 @itemx --uniform-spacing
 812 @opindex -u
 813 @opindex --uniform-spacing
 814 Uniform spacing.  Reduce spacing between words to one space, and spacing
 815 between sentences to two spaces.
 816
 817 @item -@var{width}
 818 @itemx -w @var{width}
 819 @itemx --width=@var{width}
 820 @opindex -@var{width}
 821 @opindex -w
 822 @opindex --width
 823 Fill output lines up to @var{width} characters (default 75).  @code{fmt}
 824 initially tries to make lines about 7% shorter than this, to give it
 825 room to balance line lengths.
 826
 827 @item -p @var{prefix}
 828 @itemx --prefix=@var{prefix}
 829 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
 830 are subject to formatting. The prefix and any preceding whitespace are
 831 stripped for the formatting and then re-attached to each formatted output
 832 line.  One use is to format certain kinds of program comments, while
 833 leaving the code unchanged.
 834
 835 @end table
 836
 837
 838 @node pr invocation
 839 @section @code{pr}: Paginate or columnate files for printing
 840
 841 @pindex pr
 842 @cindex printing, preparing files for
 843 @cindex multicolumn output, generating
 844
 845 @code{pr} writes each @var{file} (@samp{-} means standard input), or
 846 standard input if none are given, to standard output, paginating and
 847 optionally outputting in multicolumn format.  Synopsis:
 848
 849 @example
 850 pr [@var{option}]@dots{} [@var{file}]@dots{}
 851 @end example
 852
 853 By default, a 5-line header is printed: two blank lines; a line with the
 854 date, the file name, and the page count; and two more blank lines.  A
 855 five line footer (entirely) is also printed.
 856
 857 Form feeds in the input cause page breaks in the output.
 858
 859 The program accepts the following options.  Also see @ref{Common options}.
 860
 861 @table @samp
 862
 863 @item +@var{page}
 864 Begin printing with page @var{page}.
 865
 866 @item -@var{column}
 867 @opindex -@var{column}
 868 Produce @var{column}-column output and print columns down.  The column
 869 width is automatically decreased as @var{column} increases; unless you
 870 use the @samp{-w} option to increase the page width as well, this option
 871 might well cause some input to be truncated.
 872
 873 @item -a
 874 @opindex -a
 875 @cindex across columns
 876 Print columns across rather than down.
 877
 878 @item -b
 879 @opindex -b
 880 @cindex balancing columns
 881 Balance columns on the last page.
 882
 883 @item -c
 884 @opindex -c
 885 Print control characters using hat notation (e.g., @samp{^G}); print
 886 other unprintable characters in octal backslash notation.  By default,
 887 unprintable characters are not changed.
 888
 889 @item -d
 890 @opindex -d
 891 @cindex double spacing
 892 Double space the output.
 893
 894 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
 895 @opindex -e
 896 @cindex input tabs
 897 Expand tabs to spaces on input.  Optional argument @var{in-tabchar} is
 898 the input tab character (default is @key{TAB}).  Second optional
 899 argument @var{in-tabwidth} is the input tab character's width (default
 900 is 8).
 901
 902 @item -f
 903 @itemx -F
 904 @opindex -F
 905 @opindex -f
 906 Use a formfeed instead of newlines to separate output pages.
 907
 908 @item -h @var{header}
 909 @opindex -h
 910 Replace the file name in the header with the string @var{header}.
 911
 912 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
 913 @opindex -i
 914 @cindex output tabs
 915 Replace spaces with tabs on output.  Optional argument @var{out-tabchar}
 916 is the output tab character (default is @key{TAB}).  Second optional
 917 argument @var{out-tabwidth} is the output tab character's width (default
 918 is 8).
 919
 920 @item -l @var{n}
 921 @opindex -l
 922 Set the page length to @var{n} (default 66) lines.  If @var{n} is less
 923 than 10, the headers and footers are omitted, as if the @samp{-t} option
 924 had been given.
 925
 926 @item -m
 927 @opindex -m
 928 Print all files in parallel, one in each column.
 929
 930 @item -n[@var{number-separator}[@var{digits}]]
 931 @opindex -n
 932 Precede each column with a line number; with parallel files (@samp{-m}),
 933 precede each line with a line number.  Optional argument
 934 @var{number-separator} is the character to print after each number
 935 (default is @key{TAB}).  Optional argument @var{digits} is the number of
 936 digits per line number (default is 5).
 937
 938 @item -o @var{n}
 939 @opindex -o
 940 @cindex indenting lines
 941 @cindex left margin
 942 Indent each line with @var{n} (default is zero) spaces wide, i.e., set
 943 the left margin.  The total page width is @samp{n} plus the width set
 944 with the @samp{-w} option.
 945
 946 @item -r
 947 @opindex -r
 948 Do not print a warning message when an argument @var{file} cannot be
 949 opened.  (The exit status will still be nonzero, however.)
 950
 951 @item -s[@var{c}]
 952 @opindex -s
 953 Separate columns by the single character @var{c}.  If @var{c} is
 954 omitted, the default is space; if this option is omitted altogether, the
 955 default is @key{TAB}.
 956
 957 @item -t
 958 @opindex -t
 959 Do not print the usual 5-line header and the 5-line footer on each page,
 960 and do not fill out the bottoms of pages (with blank lines or
 961 formfeeds).
 962
 963 @item -v
 964 @opindex -v
 965 Print unprintable characters in octal backslash notation.
 966
 967 @item -w @var{n}
 968 @opindex -w
 969 Set the page width to @var{n} (default is 72) columns.
 970
 971 @end table
 972
 973
 974 @node fold invocation
 975 @section @code{fold}: Wrap input lines to fit in specified width
 976
 977 @pindex fold
 978 @cindex wrapping long input lines
 979 @cindex folding long input lines
 980
 981 @code{fold} writes each @var{file} (@samp{-} means standard input), or
 982 standard input if none are given, to standard output, breaking long
 983 lines.  Synopsis:
 984
 985 @example
 986 fold [@var{option}]@dots{} [@var{file}]@dots{}
 987 @end example
 988
 989 By default, @code{fold} breaks lines wider than 80 columns. The output
 990 is split into as many lines as necessary.
 991
 992 @cindex screen columns
 993 @code{fold} counts screen columns by default; thus, a tab may count more
 994 than one column, backspace decreases the column count, and carriage
 995 return sets the column to zero.
 996
 997 The program accepts the following options.  Also see @ref{Common options}.
 998
 999 @table @samp
1000
1001 @item -b
1002 @itemx --bytes
1003 @opindex -b
1004 @opindex --bytes
1005 Count bytes rather than columns, so that tabs, backspaces, and carriage
1006 returns are each counted as taking up one column, just like other
1007 characters.
1008
1009 @item -s
1010 @itemx --spaces
1011 @opindex -s
1012 @opindex --spaces
1013 Break at word boundaries: the line is broken after the last blank before
1014 the maximum line length.  If the line contains no such blanks, the line
1015 is broken at the maximum line length as usual.
1016
1017 @item -w @var{width}
1018 @itemx --width=@var{width}
1019 @opindex -w
1020 @opindex --width
1021 Use a maximum line length of @var{width} columns instead of 80.
1022
1023 @end table
1024
1025
1026 @node Output of parts of files
1027 @chapter Output of parts of files
1028
1029 @cindex output of parts of files
1030 @cindex parts of files, output of
1031
1032 These commands output pieces of the input.
1033
1034 @menu
1035 * head invocation::             Output the first part of files.
1036 * tail invocation::             Output the last part of files.
1037 * split invocation::            Split a file into fixed-size pieces.
1038 * csplit invocation::           Split a file into context-determined pieces.
1039 @end menu
1040
1041 @node head invocation
1042 @section @code{head}: Output the first part of files
1043
1044 @pindex head
1045 @cindex initial part of files, outputting
1046 @cindex first part of files, outputting
1047
1048 @code{head} prints the first part (10 lines by default) of each
1049 @var{file}; it reads from standard input if no files are given or
1050 when given a @var{file} of @samp{-}.  Synopses:
1051
1052 @example
1053 head [@var{option}]@dots{} [@var{file}]@dots{}
1054 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1055 @end example
1056
1057 If more than one @var{file} is specified, @code{head} prints a
1058 one-line header consisting of
1059 @example
1060 ==> @var{file name} <==
1061 @end example
1062 @noindent
1063 before the output for each @var{file}.
1064
1065 @code{head} accepts two option formats: the new one, in which numbers
1066 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1067 the number precedes any option letters (@samp{-1q}).
1068
1069 The program accepts the following options.  Also see @ref{Common options}.
1070
1071 @table @samp
1072
1073 @item -@var{count}@var{options}
1074 @opindex -@var{count}
1075 This option is only recognized if it is specified first.  @var{count} is
1076 a decimal number optionally followed by a size letter (@samp{b},
1077 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1078 or other option letters (@samp{cqv}).
1079
1080 @item -c @var{bytes}
1081 @itemx --bytes=@var{bytes}
1082 @opindex -c
1083 @opindex --bytes
1084 Print the first @var{bytes} bytes, instead of initial lines.  Appending
1085 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1086 by 1048576.
1087
1088 @itemx -n @var{n}
1089 @itemx --lines=@var{n}
1090 @opindex -n
1091 @opindex --lines
1092 Output the first @var{n} lines.
1093
1094 @item -q
1095 @itemx --quiet
1096 @itemx --silent
1097 @opindex -q
1098 @opindex --quiet
1099 @opindex --silent
1100 Never print file name headers.
1101
1102 @item -v
1103 @itemx --verbose
1104 @opindex -v
1105 @opindex --verbose
1106 Always print file name headers.
1107
1108 @end table
1109
1110
1111 @node tail invocation
1112 @section @code{tail}: Output the last part of files
1113
1114 @pindex tail
1115 @cindex last part of files, outputting
1116
1117 @code{tail} prints the last part (10 lines by default) of each
1118 @var{file}; it reads from standard input if no files are given or
1119 when given a @var{file} of @samp{-}.  Synopses:
1120
1121 @example
1122 tail [@var{option}]@dots{} [@var{file}]@dots{}
1123 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1124 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1125 @end example
1126
1127 If more than one @var{file} is specified, @code{tail} prints a
1128 one-line header consisting of
1129 @example
1130 ==> @var{file name} <==
1131 @end example
1132 @noindent
1133 before the output for each @var{file}.
1134
1135 @cindex BSD @code{tail}
1136 GNU @code{tail} can output any amount of data (some other versions of
1137 @code{tail} cannot).  It also has no @samp{-r} option (print in
1138 reverse), since reversing a file is really a different job from printing
1139 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1140 only reverse files that are at most as large as its buffer, which is
1141 typically 32k.  A more reliable and versatile way to reverse files is
1142 the GNU @code{tac} command.
1143
1144 @code{tail} accepts two option formats: the new one, in which numbers
1145 are arguments to the options (@samp{-n 1}), and the old one, in which
1146 the number precedes any option letters (@samp{-1} or @samp{+1}).
1147
1148 If any option-argument is a number @var{n} starting with a @samp{+},
1149 @code{tail} begins printing with the @var{n}th item from the start of
1150 each file, instead of from the end.
1151
1152 The program accepts the following options.  Also see @ref{Common options}.
1153
1154 @table @samp
1155
1156 @item -@var{count}
1157 @itemx +@var{count}
1158 @opindex -@var{count}
1159 @opindex +@var{count}
1160 This option is only recognized if it is specified first.  @var{count} is
1161 a decimal number optionally followed by a size letter (@samp{b},
1162 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1163 or other option letters (@samp{cfqv}).
1164
1165 @item -c @var{bytes}
1166 @itemx --bytes=@var{bytes}
1167 @opindex -c
1168 @opindex --bytes
1169 Output the last @var{bytes} bytes, instead of final lines.  Appending
1170 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1171 by 1048576.
1172
1173 @item -f
1174 @itemx --follow
1175 @opindex -f
1176 @opindex --follow
1177 @cindex growing files
1178 Loop forever trying to read more characters at the end of the file,
1179 presumably because the file is growing.  Ignored if reading from a pipe.
1180 If more than one file is given, @code{tail} prints a header whenever it
1181 gets output from a different file, to indicate which file that output is
1182 from.
1183
1184 @itemx -n @var{n}
1185 @itemx --lines=@var{n}
1186 @opindex -n
1187 @opindex --lines
1188 Output the last @var{n} lines.
1189
1190 @item -q
1191 @itemx -quiet
1192 @itemx --silent
1193 @opindex -q
1194 @opindex --quiet
1195 @opindex --silent
1196 Never print file name headers.
1197
1198 @item -v
1199 @itemx --verbose
1200 @opindex -v
1201 @opindex --verbose
1202 Always print file name headers.
1203
1204 @end table
1205
1206
1207 @node split invocation
1208 @section @code{split}: Split a file into fixed-size pieces
1209
1210 @pindex split
1211 @cindex splitting a file into pieces
1212 @cindex pieces, splitting a file into
1213
1214 @code{split} creates output files containing consecutive sections of
1215 @var{input} (standard input if none is given or @var{input} is
1216 @samp{-}).  Synopsis:
1217
1218 @example
1219 split [@var{option}] [@var{input} [@var{prefix}]]
1220 @end example
1221
1222 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1223 left over for the last section), into each output file.
1224
1225 @cindex output file name prefix
1226 The output files' names consist of @var{prefix} (@samp{x} by default)
1227 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1228 that concatenating the output files in sorted order by file name produces
1229 the original input file.  (If more than 676 output files are required,
1230 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1231
1232 The program accepts the following options.  Also see @ref{Common options}.
1233
1234 @table @samp
1235
1236 @item -@var{lines}
1237 @itemx -l @var{lines}
1238 @itemx --lines=@var{lines}
1239 @opindex -l
1240 @opindex --lines
1241 Put @var{lines} lines of @var{input} into each output file.
1242
1243 @item -b @var{bytes}
1244 @itemx --bytes=@var{bytes}
1245 @opindex -b
1246 @opindex --bytes
1247 Put the first @var{bytes} bytes of @var{input} into each output file.
1248 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1249 @samp{m} by 1048576.
1250
1251 @item -C @var{bytes}
1252 @itemx --line-bytes=@var{bytes}
1253 @opindex -C
1254 @opindex --line-bytes
1255 Put into each output file as many complete lines of @var{input} as
1256 possible without exceeding @var{bytes} bytes.  For lines longer than
1257 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1258 less than @var{bytes} bytes of the line are left, then continue
1259 normally.  @var{bytes} has the same format as for the @samp{--bytes}
1260 option.
1261
1262 @itemx --verbose=@var{bytes}
1263 @opindex --verbose
1264 Write a diagnostic to standard error just before each output file is opened.
1265
1266 @end table
1267
1268
1269 @node csplit invocation
1270 @section @code{csplit}: Split a file into context-determined pieces
1271
1272 @pindex csplit
1273 @cindex context splitting
1274 @cindex splitting a file into pieces by context
1275
1276 @code{csplit} creates zero or more output files containing sections of
1277 @var{input} (standard input if @var{input} is @samp{-}).  Synopsis:
1278
1279 @example
1280 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1281 @end example
1282
1283 The contents of the output files are determined by the @var{pattern}
1284 arguments, as detailed below.  An error occurs if a @var{pattern}
1285 argument refers to a nonexistent line of the input file (e.g., if no
1286 remaining line matches a given regular expression).  After every
1287 @var{pattern} has been matched, any remaining input is copied into one
1288 last output file.
1289
1290 By default, @code{csplit} prints the number of bytes written to each
1291 output file after it has been created.
1292
1293 The types of pattern arguments are:
1294
1295 @table @samp
1296
1297 @item @var{n}
1298 Create an output file containing the input up to but not including line
1299 @var{n} (a positive integer).  If followed by a repeat count, also
1300 create an output file containing the next @var{line} lines of the input
1301 file once for each repeat.
1302
1303 @item /@var{regexp}/[@var{offset}]
1304 Create an output file containing the current line up to (but not
1305 including) the next line of the input file that contains a match for
1306 @var{regexp}.  The optional @var{offset} is a @samp{+} or @samp{-}
1307 followed by a positive integer.  If it is given, the input up to the
1308 matching line plus or minus @var{offset} is put into the output file,
1309 and the line after that begins the next section of input.
1310
1311 @item %@var{regexp}%[@var{offset}]
1312 Like the previous type, except that it does not create an output
1313 file, so that section of the input file is effectively ignored.
1314
1315 @item @{@var{repeat-count}@}
1316 Repeat the previous pattern @var{repeat-count} additional
1317 times. @var{repeat-count} can either be a positive integer or an
1318 asterisk, meaning repeat as many times as necessary until the input is
1319 exhausted.
1320
1321 @end table
1322
1323 The output files' names consist of a prefix (@samp{xx} by default)
1324 followed by a suffix.  By default, the suffix is an ascending sequence
1325 of two-digit decimal numbers from @samp{00} and up to @samp{99}.  In any
1326 case, concatenating the output files in sorted order by filename
1327 produces the original input file.
1328
1329 By default, if @code{csplit} encounters an error or receives a hangup,
1330 interrupt, quit, or terminate signal, it removes any output files
1331 that it has created so far before it exits.
1332
1333 The program accepts the following options.  Also see @ref{Common options}.
1334
1335 @table @samp
1336
1337 @item -f @var{prefix}
1338 @itemx --prefix=@var{prefix}
1339 @opindex -f
1340 @opindex --prefix
1341 @cindex output file name prefix
1342 Use @var{prefix} as the output file name prefix.
1343
1344 @item -b @var{suffix}
1345 @itemx --suffix=@var{suffix}
1346 @opindex -b
1347 @opindex --suffix
1348 @cindex output file name suffix
1349 Use @var{suffix} as the output file name suffix.  When this option is
1350 specified, the suffix string must include exactly one
1351 @code{printf(3)}-style conversion specification, possibly including
1352 format specification flags, a field width, a precision specifications,
1353 or all of these kinds of modifiers.  The format letter must convert a
1354 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1355 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed.  The
1356 entire @var{suffix} is given (with the current output file number) to
1357 @code{sprintf(3)} to form the file name suffixes for each of the
1358 individual output files in turn.  If this option is used, the
1359 @samp{--digits} option is ignored.
1360
1361 @item -n @var{digits}
1362 @itemx --digits=@var{digits}
1363 @opindex -n
1364 @opindex --digits
1365 Use output file names containing numbers that are @var{digits} digits
1366 long instead of the default 2.
1367
1368 @item -k
1369 @itemx --keep-files
1370 @opindex -k
1371 @opindex --keep-files
1372 Do not remove output files when errors are encountered.
1373
1374 @item -z
1375 @itemx --elide-empty-files
1376 @opindex -z
1377 @opindex --elide-empty-files
1378 Suppress the generation of zero-length output files.  (In cases where
1379 the section delimiters of the input file are supposed to mark the first
1380 lines of each of the sections, the first output file will generally be a
1381 zero-length file unless you use this option.)  The output file sequence
1382 numbers always run consecutively starting from 0, even when this option
1383 is specified.
1384
1385 @item -s
1386 @itemx -q
1387 @itemx --silent
1388 @itemx --quiet
1389 @opindex -s
1390 @opindex -q
1391 @opindex --silent
1392 @opindex --quiet
1393 Do not print counts of output file sizes.
1394
1395 @end table
1396
1397
1398 @node Summarizing files
1399 @chapter Summarizing files
1400
1401 @cindex summarizing files
1402
1403 These commands generate just a few numbers representing entire
1404 contents of files.
1405
1406 @menu
1407 * wc invocation::               Print byte, word, and line counts.
1408 * sum invocation::              Print checksum and block counts.
1409 * cksum invocation::            Print CRC checksum and byte counts.
1410 * md5sum invocation::           Print or check message-digests.
1411 @end menu
1412
1413
1414 @node wc invocation
1415 @section @code{wc}: Print byte, word, and line counts
1416
1417 @pindex wc
1418 @cindex byte count
1419 @cindex word count
1420 @cindex line count
1421
1422 @code{wc} counts the number of bytes, whitespace-separated words, and
1423 newlines in each given @var{file}, or standard input if none are given
1424 or for a @var{file} of @samp{-}.  Synopsis:
1425
1426 @example
1427 wc [@var{option}]@dots{} [@var{file}]@dots{}
1428 @end example
1429
1430 @cindex total counts
1431 @code{wc} prints one line of counts for each file, and if the file was
1432 given as an argument, it prints the file name following the counts.  If
1433 more than one @var{file} is given, @code{wc} prints a final line
1434 containing the cumulative counts, with the file name @file{total}.  The
1435 counts are printed in this order: newlines, words, bytes.
1436
1437 By default, @code{wc} prints all three counts.  Options can specify
1438 that only certain counts be printed.  Options do not undo others
1439 previously given, so
1440
1441 @example
1442 wc --bytes --words
1443 @end example
1444
1445 @noindent
1446 prints both the byte counts and the word counts.
1447
1448 The program accepts the following options.  Also see @ref{Common options}.
1449
1450 @table @samp
1451
1452 @item -c
1453 @itemx --bytes
1454 @itemx --chars
1455 @opindex -c
1456 @opindex --bytes
1457 @opindex --chars
1458 Print only the byte counts.
1459
1460 @item -w
1461 @itemx --words
1462 @opindex -w
1463 @opindex --words
1464 Print only the word counts.
1465
1466 @item -l
1467 @itemx --lines
1468 @opindex -l
1469 @opindex --lines
1470 Print only the newline counts.
1471
1472 @end table
1473
1474
1475 @node sum invocation
1476 @section @code{sum}: Print checksum and block counts
1477
1478 @pindex sum
1479 @cindex 16-bit checksum
1480 @cindex checksum, 16-bit
1481
1482 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1483 standard input if none are given or for a @var{file} of @samp{-}.  Synopsis:
1484
1485 @example
1486 sum [@var{option}]@dots{} [@var{file}]@dots{}
1487 @end example
1488
1489 @code{sum} prints the checksum for each @var{file} followed by the
1490 number of blocks in the file (rounded up).  If more than one @var{file}
1491 is given, file names are also printed (by default).  (With the
1492 @samp{--sysv} option, corresponding file name are printed when there is
1493 at least one file argument.)
1494
1495 By default, GNU @code{sum} computes checksums using an algorithm
1496 compatible with BSD @code{sum} and prints file sizes in units of
1497 1024-byte blocks.
1498
1499 The program accepts the following options.  Also see @ref{Common options}.
1500
1501 @table @samp
1502
1503 @item -r
1504 @opindex -r
1505 @cindex BSD @code{sum}
1506 Use the default (BSD compatible) algorithm.  This option is included for
1507 compatibility with the System V @code{sum}.  Unless @samp{-s} was also
1508 given, it has no effect.
1509
1510 @item -s
1511 @itemx --sysv
1512 @opindex -s
1513 @opindex --sysv
1514 @cindex System V @code{sum}
1515 Compute checksums using an algorithm compatible with System V
1516 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1517
1518 @end table
1519
1520 @code{sum} is provided for compatibility; the @code{cksum} program (see
1521 next section) is preferable in new applications.
1522
1523
1524 @node cksum invocation
1525 @section @code{cksum}: Print CRC checksum and byte counts
1526
1527 @pindex cksum
1528 @cindex cyclic redundancy check
1529 @cindex CRC checksum
1530
1531 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1532 given @var{file}, or standard input if none are given or for a
1533 @var{file} of @samp{-}.  Synopsis:
1534
1535 @example
1536 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1537 @end example
1538
1539 @code{cksum} prints the CRC checksum for each file along with the number
1540 of bytes in the file, and the filename unless no arguments were given.
1541
1542 @code{cksum} is typically used to ensure that files
1543 transferred by unreliable means (e.g., netnews) have not been corrupted,
1544 by comparing the @code{cksum} output for the received files with the
1545 @code{cksum} output for the original files (typically given in the
1546 distribution).
1547
1548 The CRC algorithm is specified by the @sc{POSIX.2} standard.  It is not
1549 compatible with the BSD or System V @code{sum} algorithms (see the
1550 previous section); it is more robust.
1551
1552 The only options are @samp{--help} and @samp{--version}.  @xref{Common
1553 options}.
1554
1555
1556 @node md5sum invocation
1557 @section @code{md5sum}: Print or check message-digests
1558
1559 @pindex md5sum
1560 @cindex 128-bit checksum
1561 @cindex checksum, 128-bit
1562 @cindex fingerprint, 128-bit
1563 @cindex message-digest, 128-bit
1564
1565 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1566 @dfn{message-digest}) for each specified @var{file}.
1567 If a @var{file} is specified as @samp{-} or if no files are given
1568 @code{md5sum} computes the checksum for the standard input.
1569 @code{md5sum} can also determine whether a file and checksum are
1570 consistent. Synopsis:
1571
1572 @example
1573 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1574 md5sum [@var{option}]@dots{} --check [@var{file}]
1575 md5sum [@var{option}]@dots{} --string=@var{string} @dots{}
1576 @end example
1577
1578 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1579 indicating a binary or text input file, and the filename.
1580 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1581
1582 The program accepts the following options.  Also see @ref{Common options}.
1583
1584 @table @samp
1585
1586 @item -b
1587 @itemx --binary
1588 @opindex -b
1589 @opindex --binary
1590 @cindex binary input files
1591 Treat all input files as binary.  This option has no effect on Unix
1592 systems, since they don't distinguish between binary and text files.
1593 This option is useful on systems that have different internal and
1594 external character representations.
1595
1596 @item -c
1597 @itemx --check
1598 Read filenames and checksum information from the single @var{file}
1599 (or from stdin if no @var{file} was specified) and report whether
1600 each named file and the corresponding checksum data are consistent.
1601 The input to this mode of @code{md5sum} is usually the output of
1602 a prior, checksum-generating run of @samp{md5sum}.
1603 Each valid line of input consists of an MD5 checksum, a binary/text
1604 flag, and then a filename.
1605 Binary files are marked with @samp{*}, text with @samp{ }.
1606 For each such line, @code{md5sum} reads the named file and computes its
1607 MD5 checksum.  Then, if the computed message digest does not match the
1608 one on the line with the filename, the file is noted as having
1609 failed the test.  Otherwise, the file passes the test.
1610 By default, for each valid line, one line is written to standard
1611 output indicating whether the named file passed the test.
1612 After all checks have been performed, if there were any failures,
1613 a warning is issued to standard error.
1614 Use the @samp{--status} option to inhibit that output.
1615 If any listed file cannot be opened or read, if any valid line has
1616 an MD5 checksum inconsistent with the associated file, or if no valid
1617 line is found, @code{md5sum} exits with nonzero status.  Otherwise,
1618 it exits successfully.
1619
1620 @itemx --status
1621 @opindex --status
1622 @cindex verifying MD5 checksums
1623 This option is useful only when verifying checksums.
1624 When verifying checksums, don't generate the default one-line-per-file
1625 diagnostic and don't output the warning summarizing any failures.
1626 Failures to open or read a file still evoke individual diagnostics to
1627 standard error.
1628 If all listed files are readable and are consistent with the associated
1629 MD5 checksums, exit successfully.  Otherwise exit with a status code
1630 indicating there was a failure.
1631
1632 @itemx --string=@var{string}
1633 @opindex --string
1634 Compute the message digest for @var{string}, instead of for a file.  The
1635 result is the same as for a file that contains exactly @var{string}.
1636
1637 @item -t
1638 @itemx --text
1639 @opindex -t
1640 @opindex --text
1641 @cindex text input files
1642 Treat all input files as text files.  This is the reverse of
1643 @samp{--binary}.
1644
1645 @item -w
1646 @itemx --warn
1647 @opindex -w
1648 @opindex --warn
1649 @cindex verifying MD5 checksums
1650 When verifying checksums, warn about improperly formated MD5 checksum lines.
1651 This option is useful only if all but a few lines in the checked input
1652 are valid.
1653
1654 @end table
1655
1656
1657 @node Operating on sorted files
1658 @chapter Operating on sorted files
1659
1660 @cindex operating on sorted files
1661 @cindex sorted files, operations on
1662
1663 These commands work with (or produce) sorted files.
1664
1665 @menu
1666 * sort invocation::             Sort text files.
1667 * uniq invocation::             Uniqify files.
1668 * comm invocation::             Compare two sorted files line by line.
1669 @end menu
1670
1671
1672 @node sort invocation
1673 @section @code{sort}: Sort text files
1674
1675 @pindex sort
1676 @cindex sorting files
1677
1678 @code{sort} sorts, merges, or compares all the lines from the given
1679 files, or standard input if none are given or for a @var{file} of
1680 @samp{-}.  By default, @code{sort} writes the results to standard
1681 output.  Synopsis:
1682
1683 @example
1684 sort [@var{option}]@dots{} [@var{file}]@dots{}
1685 @end example
1686
1687 @code{sort} has three modes of operation: sort (the default), merge,
1688 and check for sortedness.  The following options change the operation
1689 mode:
1690
1691 @table @samp
1692
1693 @item -c
1694 @opindex -c
1695 @cindex checking for sortedness
1696 Check whether the given files are already sorted: if they are not all
1697 sorted, print an error message and exit with a status of 1.
1698 Otherwise, exit successfully.
1699
1700 @item -m
1701 @opindex -m
1702 @cindex merging sorted files
1703 Merge the given files by sorting them as a group.  Each input file must
1704 always be individually sorted.  It always works to sort instead of
1705 merge; merging is provided because it is faster, in the case where it
1706 works.
1707
1708 @end table
1709
1710 A pair of lines is compared as follows: if any key fields have been
1711 specified, @code{sort} compares each pair of fields, in the order
1712 specified on the command line, according to the associated ordering
1713 options, until a difference is found or no fields are left.
1714
1715 If any of the global options @samp{Mbdfinr} are given but no key fields
1716 are specified, @code{sort} compares the entire lines according to the
1717 global options.
1718
1719 Finally, as a last resort when all keys compare equal (or if no
1720 ordering options were specified at all), @code{sort} compares the lines
1721 byte by byte in machine collating sequence.  The last resort comparison
1722 honors the @samp{-r} global option.  The @samp{-s} (stable) option
1723 disables this last-resort comparison so that lines in which all fields
1724 compare equal are left in their original relative order.  If no fields
1725 or global options are specified, @samp{-s} has no effect.
1726
1727 GNU @code{sort} (as specified for all GNU utilities) has no limits on
1728 input line length or restrictions on bytes allowed within lines.  In
1729 addition, if the final byte of an input file is not a newline, GNU
1730 @code{sort} silently supplies one.
1731
1732 Upon any error, @code{sort} exits with a status of @samp{2}.
1733
1734 @vindex TMPDIR
1735 If the environment variable @code{TMPDIR} is set, @code{sort} uses its
1736 value as the directory for temporary files instead of @file{/tmp}.  The
1737 @samp{-T @var{tempdir}} option in turn overrides the environment
1738 variable.
1739
1740 The following options affect the ordering of output lines.  They may be
1741 specified globally or as part of a specific key field.  If no key
1742 fields are specified, global options apply to comparison of entire
1743 lines; otherwise the global options are inherited by key fields that do
1744 not specify any special options of their own.
1745
1746 @table @samp
1747
1748 @item -b
1749 @opindex -b
1750 @cindex blanks, ignoring leading
1751 Ignore leading blanks when finding sort keys in each line.
1752
1753 @item -d
1754 @opindex -d
1755 @cindex phone directory order
1756 @cindex telephone directory order
1757 Sort in @dfn{phone directory} order: ignore all characters except
1758 letters, digits and blanks when sorting.
1759
1760 @item -f
1761 @opindex -f
1762 @cindex case folding
1763 Fold lowercase characters into the equivalent uppercase characters when
1764 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
1765
1766 @item -g
1767 @opindex -g
1768 @cindex general numeric sort
1769 Sort numerically, but use strtod(3) to arrive at the numeric values.
1770 This allows floating point numbers to be specified in scientific notation,
1771 like @code{1.0e-34} and @code{10e100}.  Use this option only if there
1772 is no alternative;  it is much slower than @samp{-n} and numbers with
1773 too many significant digits will be compared as if they had been
1774 truncated.  In addition, numbers outside the range of representable
1775 double precision floating point numbers are treated as if they were
1776 zeroes; overflow and underflow are not reported.
1777
1778 @item -i
1779 @opindex -i
1780 @cindex unprintable characters, ignoring
1781 Ignore characters outside the printable ASCII range 040-0176 octal
1782 (inclusive) when sorting.
1783
1784 @item -M
1785 @opindex -M
1786 @cindex months, sorting by
1787 An initial string, consisting of any amount of whitespace, followed
1788 by three letters abbreviating a month name, is folded to UPPER case and
1789 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
1790 Invalid names compare low to valid names.
1791
1792 @item -n
1793 @opindex -n
1794 @cindex numeric sort
1795 Sort numerically: the number begins each line; specifically, it consists
1796 of optional whitespace, an optional @samp{-} sign, and zero or more
1797 digits, optionally followed by a decimal point and zero or more digits.
1798
1799 @code{sort -n} uses what might be considered an unconventional method
1800 to compare strings representing floating point numbers.  Rather than
1801 first converting each string to the C @code{double} type and then
1802 comparing those values, sort aligns the decimal points in the two
1803 strings and compares the strings a character at a time.  One benefit
1804 of using this approach is its speed.  In practice this is much more
1805 efficient than performing the two corresponding string-to-double (or even
1806 string-to-integer) conversions and then comparing doubles.  In addition,
1807 there is no corresponding loss of precision.  Converting each string to
1808 @code{double} before comparison would limit precision to about 16 digits
1809 on most systems.
1810
1811 Neither a leading @samp{+} nor exponential notation is recognized.
1812 To compare such strings numerically, use the @samp{-g} option.
1813
1814 @item -r
1815 @opindex -r
1816 @cindex reverse sorting
1817 Reverse the result of comparison, so that lines with greater key values
1818 appear earlier in the output instead of later.
1819
1820 @end table
1821
1822 Other options are:
1823
1824 @table @samp
1825
1826 @item -o @var{output-file}
1827 @opindex -o
1828 @cindex overwriting of input, allowed
1829 Write output to @var{output-file} instead of standard output.
1830 If @var{output-file} is one of the input files, @code{sort} copies
1831 it to a temporary file before sorting and writing the output to
1832 @var{output-file}.
1833
1834 @item -t @var{separator}
1835 @opindex -t
1836 @cindex field separator character
1837 Use character @var{separator} as the field separator when finding the
1838 sort keys in each line.  By default, fields are separated by the empty
1839 string between a non-whitespace character and a whitespace character.
1840 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
1841 into fields @w{@samp{ foo}} and @w{@samp{ bar}}.  The field separator is
1842 not considered to be part of either the field preceding or the field
1843 following.
1844
1845 @item -u
1846 @opindex -u
1847 @cindex uniqifying output
1848 For the default case or the @samp{-m} option, only output the first
1849 of a sequence of lines that compare equal.  For the @samp{-c} option,
1850 check that no pair of consecutive lines compares equal.
1851
1852 @item -k @var{pos1}[,@var{pos2}]
1853 @opindex -k
1854 @cindex sort field
1855 The recommended, @sc{POSIX}, option for specifying a sort field.  The field
1856 consists of the line between @var{pos1} and @var{pos2} (or the end of
1857 the line, if @var{pos2} is omitted), inclusive.  Fields and character
1858 positions are numbered starting with 1.  See below.
1859
1860 @item -z
1861 @opindex -z
1862 @cindex sort zero-terminated lines
1863 Treat the input as a set of lines, each terminated by a zero byte (@sc{ASCII}
1864 @sc{NUL} (Null) character) instead of a @sc{ASCII} @sc{LF} (Line Feed.)
1865 This option can be useful in conjunction with @samp{perl -0} or
1866 @samp{find -print0} and @samp{xargs -0} which do the same in order to
1867 reliably handle arbitrary pathnames (even those which contain Line Feed
1868 characters.)
1869
1870 @item +@var{pos1}[-@var{pos2}]
1871 The obsolete, traditional option for specifying a sort field.  The field
1872 consists of the line between @var{pos1} and up to but @emph{not including}
1873 @var{pos2} (or the end of the line if @var{pos2} is omitted).  Fields
1874 and character positions are numbered starting with 0.  See below.
1875
1876 @end table
1877
1878 In addition, when GNU @code{sort} is invoked with exactly one argument,
1879 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
1880 options}.
1881
1882 Historical (BSD and System V) implementations of @code{sort} have
1883 differed in their interpretation of some options, particularly
1884 @samp{-b}, @samp{-f}, and @samp{-n}.  GNU sort follows the @sc{POSIX}
1885 behavior, which is usually (but not always!) like the System V behavior.
1886 According to @sc{POSIX}, @samp{-n} no longer implies @samp{-b}.  For
1887 consistency, @samp{-M} has been changed in the same way.  This may
1888 affect the meaning of character positions in field specifications in
1889 obscure cases.  The only fix is to add an explicit @samp{-b}.
1890
1891 A position in a sort field specified with the @samp{-k} or @samp{+}
1892 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
1893 of the field to use and @var{c} is the number of the first character
1894 from the beginning of the field (for @samp{+@var{pos}}) or from the end
1895 of the previous field (for @samp{-@var{pos}}).  If the @samp{.@var{c}}
1896 is omitted, it is taken to be the first character in the field.  If the
1897 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
1898 specification is counted from the first nonblank character of the field
1899 (for @samp{+@var{pos}}) or from the first nonblank character following
1900 the previous field (for @samp{-@var{pos}}).
1901
1902 A sort key option may also have any of the option letters @samp{Mbdfinr}
1903 appended to it, in which case the global ordering options are not used
1904 for that particular field.  The @samp{-b} option may be independently
1905 attached to either or both of the @samp{+@var{pos}} and
1906 @samp{-@var{pos}} parts of a field specification, and if it is inherited
1907 from the global options it will be attached to both.  If a @samp{-n} or
1908 @samp{-M} option is used, thus implying a @samp{-b} option, the
1909 @samp{-b} option is taken to apply to both the @samp{+@var{pos}} and the
1910 @samp{-@var{pos}} parts of a key specification.  Keys may span multiple
1911 fields.
1912
1913 Here are some examples to illustrate various combinations of options.
1914 In them, the @sc{POSIX} @samp{-k} option is used to specify sort keys rather
1915 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
1916
1917 @itemize @bullet
1918
1919 @item
1920 Sort in descending (reverse) numeric order.
1921
1922 @example
1923 sort -nr
1924 @end example
1925
1926 Sort alphabetically, omitting the first and second fields.
1927 This uses a single key composed of the characters beginning
1928 at the start of field three and extending to the end of each line.
1929
1930 @example
1931 sort -k3
1932 @end example
1933
1934 @item
1935 Sort numerically on the second field and resolve ties by sorting
1936 alphabetically on the third and fourth characters of field five.
1937 Use @samp{:} as the field delimiter.
1938
1939 @example
1940 sort -t : -k 2,2n -k 5.3,5.4
1941 @end example
1942
1943 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
1944 @samp{sort} would have used all characters beginning in the second field
1945 and extending to the end of the line as the primary @emph{numeric}
1946 key.  For the large majority of applications, treating keys spanning
1947 more than one field as numeric will not do what you expect.
1948
1949 Also note that the @samp{n} modifier was applied to the field-end
1950 specifier for the first key.  It would have been equivalent to
1951 specify @samp{-k 2n,2} or @samp{-k 2n,2n}.  All modifiers except
1952 @samp{b} apply to the associated @emph{field}, regardless of whether
1953 the modifier character is attached to the field-start and/or the
1954 field-end part of the key specifier.
1955
1956 @item
1957 Sort the password file on the fifth field and ignore any
1958 leading white space.  Sort lines with equal values in field five
1959 on the numeric user ID in field three.
1960
1961 @example
1962 sort -t : -k 5b,5 -k 3,3n /etc/passwd
1963 @end example
1964
1965 An alternative is to use the global numeric modifier @samp{-n}.
1966
1967 @example
1968 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
1969 @end example
1970
1971 @item
1972 Generate a tags file in case insensitive sorted order.
1973 @example
1974 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
1975 @end example
1976
1977 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case mean
1978 that pathnames that contain Line Feed characters will not get broken up
1979 by the sort operation.
1980
1981 Finally, to ignore both leading and trailing white space, you
1982 could have applied the @samp{b} modifier to the field-end specifier
1983 for the first key,
1984
1985 @example
1986 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
1987 @end example
1988
1989 or by using the global @samp{-b} modifier instead of @samp{-n}
1990 and an explicit @samp{n} with the second key specifier.
1991
1992 @example
1993 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
1994 @end example
1995
1996 @end itemize
1997
1998
1999 @node uniq invocation
2000 @section @code{uniq}: Uniqify files
2001
2002 @pindex uniq
2003 @cindex uniqify files
2004
2005 @code{uniq} writes the unique lines in the given @file{input}, or
2006 standard input if nothing is given or for an @var{input} name of
2007 @samp{-}.  Synopsis:
2008
2009 @example
2010 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2011 @end example
2012
2013 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2014 discards all but one of identical successive lines.  Optionally, it can
2015 instead show only lines that appear exactly once, or lines that appear
2016 more than once.
2017
2018 The input must be sorted.  If your input is not sorted, perhaps you want
2019 to use @code{sort -u}.
2020
2021 If no @var{output} file is specified, @code{uniq} writes to standard
2022 output.
2023
2024 The program accepts the following options.  Also see @ref{Common options}.
2025
2026 @table @samp
2027
2028 @item -@var{n}
2029 @itemx -f @var{n}
2030 @itemx --skip-fields=@var{n}
2031 @opindex -@var{n}
2032 @opindex -f
2033 @opindex --skip-fields
2034 Skip @var{n} fields on each line before checking for uniqueness.  Fields
2035 are sequences of non-space non-tab characters that are separated from
2036 each other by at least one spaces or tabs.
2037
2038 @item +@var{n}
2039 @itemx -s @var{n}
2040 @itemx --skip-chars=@var{n}
2041 @opindex +@var{n}
2042 @opindex -s
2043 @opindex --skip-chars
2044 Skip @var{n} characters before checking for uniqueness.  If you use both
2045 the field and character skipping options, fields are skipped over first.
2046
2047 @item -c
2048 @itemx --count
2049 @opindex -c
2050 @opindex --count
2051 Print the number of times each line occurred along with the line.
2052
2053 @item -i
2054 @itemx --ignore-case
2055 @opindex -i
2056 @opindex --ignore-case
2057 Ignore differences in case when comparing lines.
2058
2059 @item -d
2060 @itemx --repeated
2061 @opindex -d
2062 @opindex --repeated
2063 @cindex duplicate lines, outputting
2064 Print only duplicate lines.
2065
2066 @item -u
2067 @itemx --unique
2068 @opindex -u
2069 @opindex --unique
2070 @cindex unique lines, outputting
2071 Print only unique lines.
2072
2073 @item -w @var{n}
2074 @itemx --check-chars=@var{n}
2075 @opindex -w
2076 @opindex --check-chars
2077 Compare @var{n} characters on each line (after skipping any specified
2078 fields and characters).  By default the entire rest of the lines are
2079 compared.
2080
2081 @end table
2082
2083
2084 @node comm invocation
2085 @section @code{comm}: Compare two sorted files line by line
2086
2087 @pindex comm
2088 @cindex line-by-line comparison
2089 @cindex comparing sorted files
2090
2091 @code{comm} writes to standard output lines that are common, and lines
2092 that are unique, to two input files; a file name of @samp{-} means
2093 standard input.  Synopsis:
2094
2095 @example
2096 comm [@var{option}]@dots{} @var{file1} @var{file2}
2097 @end example
2098
2099 The input files must be sorted before @code{comm} can be used.
2100
2101 @cindex differing lines
2102 @cindex common lines
2103 With no options, @code{comm} produces three column output.  Column one
2104 contains lines unique to @var{file1}, column two contains lines unique
2105 to @var{file2}, and column three contains lines common to both files.
2106 Columns are separated by @key{TAB}.
2107 @c FIXME: when there's an option to supply an alternative separator
2108 @c string, append `by default' to the above sentence.
2109
2110 @opindex -1
2111 @opindex -2
2112 @opindex -3
2113 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2114 the corresponding columns.  Also see @ref{Common options}.
2115
2116
2117 @node Operating on fields within a line
2118 @chapter Operating on fields within a line
2119
2120 @menu
2121 * cut invocation::              Print selected parts of lines.
2122 * paste invocation::            Merge lines of files.
2123 * join invocation::             Join lines on a common field.
2124 @end menu
2125
2126
2127 @node cut invocation
2128 @section @code{cut}: Print selected parts of lines
2129
2130 @pindex cut
2131 @code{cut} writes to standard output selected parts of each line of each
2132 input file, or standard input if no files are given or for a file name of
2133 @samp{-}.  Synopsis:
2134
2135 @example
2136 cut [@var{option}]@dots{} [@var{file}]@dots{}
2137 @end example
2138
2139 In the table which follows, the @var{byte-list}, @var{character-list},
2140 and @var{field-list} are one or more numbers or ranges (two numbers
2141 separated by a dash) separated by commas.  Bytes, characters, and
2142 fields are numbered from starting at 1.  Incomplete ranges may be
2143 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
2144 @samp{@var{n}} through end of line or last field.
2145
2146 The program accepts the following options.  Also see @ref{Common
2147 options}.
2148
2149 @table @samp
2150
2151 @item -b @var{byte-list}
2152 @itemx --bytes=@var{byte-list}
2153 @opindex -b
2154 @opindex --bytes
2155 Print only the bytes in positions listed in @var{byte-list}.  Tabs and
2156 backspaces are treated like any other character; they take up 1 byte.
2157
2158 @item -c @var{character-list}
2159 @itemx --characters=@var{character-list}
2160 @opindex -c
2161 @opindex --characters
2162 Print only characters in positions listed in @var{character-list}.
2163 The same as @samp{-b} for now, but internationalization will change
2164 that.  Tabs and backspaces are treated like any other character; they
2165 take up 1 character.
2166
2167 @item -f @var{field-list}
2168 @itemx --fields=@var{field-list}
2169 @opindex -f
2170 @opindex --fields
2171 Print only the fields listed in @var{field-list}.  Fields are
2172 separated by a @key{TAB} by default.
2173
2174 @item -d @var{delim}
2175 @itemx --delimiter=@var{delim}
2176 @opindex -d
2177 @opindex --delimiter
2178 For @samp{-f}, fields are separated by the first character in @var{delim}
2179 (default is @key{TAB}).
2180
2181 @item -n
2182 @opindex -n
2183 Do not split multi-byte characters (no-op for now).
2184
2185 @item -s
2186 @itemx --only-delimited
2187 @opindex -s
2188 @opindex --only-delimited
2189 For @samp{-f}, do not print lines that do not contain the field separator
2190 character.
2191
2192 @end table
2193
2194
2195 @node paste invocation
2196 @section @code{paste}: Merge lines of files
2197
2198 @pindex paste
2199 @cindex merging files
2200
2201 @code{paste} writes to standard output lines consisting of sequentially
2202 corresponding lines of each given file, separated by @key{TAB}.
2203 Standard input is used for a file name of @samp{-} or if no input files
2204 are given.
2205
2206 Synopsis:
2207
2208 @example
2209 paste [@var{option}]@dots{} [@var{file}]@dots{}
2210 @end example
2211
2212 The program accepts the following options.  Also see @ref{Common options}.
2213
2214 @table @samp
2215
2216 @item -s
2217 @itemx --serial
2218 @opindex -s
2219 @opindex --serial
2220 Paste the lines of one file at a time rather than one line from each
2221 file.
2222
2223 @item -d @var{delim-list}
2224 @itemx --delimiters @var{delim-list}
2225 @opindex -d
2226 @opindex --delimiters
2227 Consecutively use the characters in @var{delim-list} instead of
2228 @key{TAB} to separate merged lines.  When @var{delim-list} is
2229 exhausted, start again at its beginning.
2230
2231 @end table
2232
2233
2234 @node join invocation
2235 @section @code{join}: Join lines on a common field
2236
2237 @pindex join
2238 @cindex common field, joining on
2239
2240 @code{join} writes to standard output a line for each pair of input
2241 lines that have identical join fields.  Synopsis:
2242
2243 @example
2244 join [@var{option}]@dots{} @var{file1} @var{file2}
2245 @end example
2246
2247 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
2248 meaning standard input.  @var{file1} and @var{file2} should be already
2249 sorted in increasing order (not numerically) on the join fields; unless
2250 the @samp{-t} option is given, they should be sorted ignoring blanks at
2251 the start of the join field, as in @code{sort -b}.  If the
2252 @samp{--ignore-case} option is given, lines should be sorted without
2253 regard to the case of characters in the join field, as in @code{sort -f}.
2254
2255 The defaults are: the join field is the first field in each line;
2256 fields in the input are separated by one or more blanks, with leading
2257 blanks on the line ignored; fields in the output are separated by a
2258 space; each output line consists of the join field, the remaining
2259 fields from @var{file1}, then the remaining fields from @var{file2}.
2260
2261 The program accepts the following options.  Also see @ref{Common options}.
2262
2263 @table @samp
2264
2265 @item -a @var{file-number}
2266 @opindex -a
2267 Print a line for each unpairable line in file @var{file-number} (either
2268 @samp{1} or @samp{2}), in addition to the normal output.
2269
2270 @item -e @var{string}
2271 @opindex -e
2272 Replace those output fields that are missing in the input with
2273 @var{string}.
2274
2275 @item -i
2276 @itemx --ignore-case
2277 @opindex -i
2278 @opindex --ignore-case
2279 Ignore differences in case when comparing keys.
2280 With this option, the lines of the input files must be ordered in the same way.
2281 Use @samp{sort -f} to produce this ordering.
2282
2283 @item -1 @var{field}
2284 @itemx -j1 @var{field}
2285 @opindex -1
2286 @opindex -j1
2287 Join on field @var{field} (a positive integer) of file 1.
2288
2289 @item -2 @var{field}
2290 @itemx -j2 @var{field}
2291 @opindex -2
2292 @opindex -j2
2293 Join on field @var{field} (a positive integer) of file 2.
2294
2295 @item -j @var{field}
2296 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
2297
2298 @item -o @var{field-list}@dots{}
2299 Construct each output line according to the format in @var{field-list}.
2300 Each element in @var{field-list} is either the single character @samp{0} or
2301 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
2302 @samp{2} and @var{n} is a positive field number.
2303
2304 A field specification of @samp{0} denotes the join field.
2305 In most cases, the functionality of the @samp{0} field spec
2306 may be reproduced using the explicit @var{m.n} that corresponds
2307 to the join field.  However, when printing unpairable lines
2308 (using either of the @samp{-a} or @samp{-v} options), there is no way
2309 to specify the join field using @var{m.n} in @var{field-list}
2310 if there are unpairable lines in both files.
2311 To give @code{join} that functionality, @sc{POSIX} invented the @samp{0}
2312 field specification notation.
2313
2314 The elements in @var{field-list}
2315 are separated by commas or blanks.  Multiple @var{field-list}
2316 arguments can be given after a single @samp{-o} option; the values
2317 of all lists given with @samp{-o} are concatenated together.
2318 All output lines -- including those printed because of any -a or -v
2319 option -- are subject to the specified @var{field-list}.
2320
2321 @item -t @var{char}
2322 Use character @var{char} as the input and output field separator.
2323
2324 @item -v @var{file-number}
2325 Print a line for each unpairable line in file @var{file-number}
2326 (either @samp{1} or @samp{2}), instead of the normal output.
2327
2328 @end table
2329
2330 In addition, when GNU @code{join} is invoked with exactly one argument,
2331 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
2332 options}.
2333
2334
2335 @node Operating on characters
2336 @chapter Operating on characters
2337
2338 @cindex operating on characters
2339
2340 This commands operate on individual characters.
2341
2342 @menu
2343 * tr invocation::               Translate, squeeze, and/or delete characters.
2344 * expand invocation::           Convert tabs to spaces.
2345 * unexpand invocation::         Convert spaces to tabs.
2346 @end menu
2347
2348
2349 @node tr invocation
2350 @section @code{tr}: Translate, squeeze, and/or delete characters
2351
2352 @pindex tr
2353
2354 Synopsis:
2355
2356 @example
2357 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
2358 @end example
2359
2360 @code{tr} copies standard input to standard output, performing
2361 one of the following operations:
2362
2363 @itemize @bullet
2364 @item
2365 translate, and optionally squeeze repeated characters in the result,
2366 @item
2367 squeeze repeated characters,
2368 @item
2369 delete characters,
2370 @item
2371 delete characters, then squeeze repeated characters from the result.
2372 @end itemize
2373
2374 The @var{set1} and (if given) @var{set2} arguments define ordered
2375 sets of characters, referred to below as @var{set1} and @var{set2}.  These
2376 sets are the characters of the input that @code{tr} operates on.
2377 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
2378 complement (all of the characters that are not in @var{set1}).
2379
2380 @menu
2381 * Character sets::              Specifying sets of characters.
2382 * Translating::                 Changing one characters to another.
2383 * Squeezing::                   Squeezing repeats and deleting.
2384 * Warnings in tr::              Warning messages.
2385 @end menu
2386
2387
2388 @node Character sets
2389 @subsection Specifying sets of characters
2390
2391 @cindex specifying sets of characters
2392
2393 The format of the @var{set1} and @var{set2} arguments resembles
2394 the format of regular expressions; however, they are not regular
2395 expressions, only lists of characters.  Most characters simply
2396 represent themselves in these strings, but the strings can contain
2397 the shorthands listed below, for convenience.  Some of them can be
2398 used only in @var{set1} or @var{set2}, as noted below.
2399
2400 @table @asis
2401
2402 @item Backslash escapes.
2403 @cindex backslash escapes
2404
2405 A backslash followed by a character not listed below causes an error
2406 message.
2407
2408 @table @samp
2409 @item \a
2410 Control-G,
2411 @item \b
2412 Control-H,
2413 @item \f
2414 Control-L,
2415 @item \n
2416 Control-J,
2417 @item \r
2418 Control-M,
2419 @item \t
2420 Control-I,
2421 @item \v
2422 Control-K,
2423 @item \@var{ooo}
2424 The character with the value given by @var{ooo}, which is 1 to 3
2425 octal digits,
2426 @item \\
2427 A backslash.
2428 @end table
2429
2430 @item Ranges.
2431 @cindex ranges
2432
2433 The notation @samp{@var{m}-@var{n}} expands to all of the characters
2434 from @var{m} through @var{n}, in ascending order.  @var{m} should
2435 collate before @var{n}; if it doesn't, an error results.  As an example,
2436 @samp{0-9} is the same as @samp{0123456789}.  Although GNU @code{tr}
2437 does not support the System V syntax that uses square brackets to
2438 enclose ranges, translations specified in that format will still work as
2439 long as the brackets in @var{string1} correspond to identical brackets
2440 in @var{string2}.
2441
2442 @item Repeated characters.
2443 @cindex repeated characters
2444
2445 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
2446 copies of character @var{c}.  Thus, @samp{[y*6]} is the same as
2447 @samp{yyyyyy}.  The notation @samp{[@var{c}*]} in @var{string2} expands
2448 to as many copies of @var{c} as are needed to make @var{set2} as long as
2449 @var{set1}.  If @var{n} begins with @samp{0}, it is interpreted in
2450 octal, otherwise in decimal.
2451
2452 @item Character classes.
2453 @cindex characters classes
2454
2455 The notation @samp{[:@var{class}:]} expands to all of the characters in
2456 the (predefined) class @var{class}.  The characters expand in no
2457 particular order, except for the @code{upper} and @code{lower} classes,
2458 which expand in ascending order.  When the @samp{--delete} (@samp{-d})
2459 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
2460 character class can be used in @var{set2}.  Otherwise, only the
2461 character classes @code{lower} and @code{upper} are accepted in
2462 @var{set2}, and then only if the corresponding character class
2463 (@code{upper} and @code{lower}, respectively) is specified in the same
2464 relative position in @var{set1}.  Doing this specifies case conversion.
2465 The class names are given below; an error results when an invalid class
2466 name is given.
2467
2468 @table @code
2469 @item alnum
2470 @opindex alnum
2471 Letters and digits.
2472 @item alpha
2473 @opindex alpha
2474 Letters.
2475 @item blank
2476 @opindex blank
2477 Horizontal whitespace.
2478 @item cntrl
2479 @opindex cntrl
2480 Control characters.
2481 @item digit
2482 @opindex digit
2483 Digits.
2484 @item graph
2485 @opindex graph
2486 Printable characters, not including space.
2487 @item lower
2488 @opindex lower
2489 Lowercase letters.
2490 @item print
2491 @opindex print
2492 Printable characters, including space.
2493 @item punct
2494 @opindex punct
2495 Punctuation characters.
2496 @item space
2497 @opindex space
2498 Horizontal or vertical whitespace.
2499 @item upper
2500 @opindex upper
2501 Uppercase letters.
2502 @item xdigit
2503 @opindex xdigit
2504 Hexadecimal digits.
2505 @end table
2506
2507 @item Equivalence classes.
2508 @cindex equivalence classes
2509
2510 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
2511 equivalent to @var{c}, in no particular order.  Equivalence classes are
2512 a relatively recent invention intended to support non-English alphabets.
2513 But there seems to be no standard way to define them or determine their
2514 contents.  Therefore, they are not fully implemented in GNU @code{tr};
2515 each character's equivalence class consists only of that character,
2516 which is of no particular use.
2517
2518 @end table
2519
2520
2521 @node Translating
2522 @subsection Translating
2523
2524 @cindex translating characters
2525
2526 @code{tr} performs translation when @var{set1} and @var{set2} are
2527 both given and the @samp{--delete} (@samp{-d}) option is not given.
2528 @code{tr} translates each character of its input that is in @var{set1}
2529 to the corresponding character in @var{set2}.  Characters not in
2530 @var{set1} are passed through unchanged.  When a character appears more
2531 than once in @var{set1} and the corresponding characters in @var{set2}
2532 are not all the same, only the final one is used.  For example, these
2533 two commands are equivalent:
2534
2535 @example
2536 tr aaa xyz
2537 tr a z
2538 @end example
2539
2540 A common use of @code{tr} is to convert lowercase characters to
2541 uppercase.  This can be done in many ways.  Here are three of them:
2542
2543 @example
2544 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
2545 tr a-z A-Z
2546 tr '[:lower:]' '[:upper:]'
2547 @end example
2548
2549 When @code{tr} is performing translation, @var{set1} and @var{set2}
2550 typically have the same length.  If @var{set1} is shorter than
2551 @var{set2}, the extra characters at the end of @var{set2} are ignored.
2552
2553 On the other hand, making @var{set1} longer than @var{set2} is not
2554 portable; @sc{POSIX.2} says that the result is undefined.  In this situation,
2555 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
2556 the last character of @var{set2} as many times as necessary.  System V
2557 @code{tr} truncates @var{set1} to the length of @var{set2}.
2558
2559 By default, GNU @code{tr} handles this case like BSD @code{tr}.  When
2560 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
2561 handles this case like the System V @code{tr} instead.  This option is
2562 ignored for operations other than translation.
2563
2564 Acting like System V @code{tr} in this case breaks the relatively common
2565 BSD idiom:
2566
2567 @example
2568 tr -cs A-Za-z0-9 '\012'
2569 @end example
2570
2571 @noindent
2572 because it converts only zero bytes (the first element in the
2573 complement of @var{set1}), rather than all non-alphanumerics, to
2574 newlines.
2575
2576
2577 @node Squeezing
2578 @subsection Squeezing repeats and deleting
2579
2580 @cindex squeezing repeat characters
2581 @cindex deleting characters
2582
2583 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
2584 removes any input characters that are in @var{set1}.
2585
2586 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
2587 @code{tr} replaces each input sequence of a repeated character that
2588 is in @var{set1} with a single occurrence of that character.
2589
2590 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
2591 first performs any deletions using @var{set1}, then squeezes repeats
2592 from any remaining characters using @var{set2}.
2593
2594 The @samp{--squeeze-repeats} option may also be used when translating,
2595 in which case @code{tr} first performs translation, then squeezes
2596 repeats from any remaining characters using @var{set2}.
2597
2598 Here are some examples to illustrate various combinations of options:
2599
2600 @itemize @bullet
2601
2602 @item
2603 Remove all zero bytes:
2604
2605 @example
2606 tr -d '\000'
2607 @end example
2608
2609 @item
2610 Put all words on lines by themselves.  This converts all
2611 non-alphanumeric characters to newlines, then squeezes each string
2612 of repeated newlines into a single newline:
2613
2614 @example
2615 tr -cs '[a-zA-Z0-9]' '[\n*]'
2616 @end example
2617
2618 @item
2619 Convert each sequence of repeated newlines to a single newline:
2620
2621 @example
2622 tr -s '\n'
2623 @end example
2624
2625 @end itemize
2626
2627
2628 @node Warnings in tr
2629 @subsection Warning messages
2630
2631 @vindex POSIXLY_CORRECT
2632 Setting the environment variable @code{POSIXLY_CORRECT} turns off the
2633 following warning and error messages, for strict compliance with
2634 @sc{POSIX.2}.  Otherwise, the following diagnostics are issued:
2635
2636 @enumerate
2637
2638 @item
2639 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
2640 is not, and @var{set2} is given, GNU @code{tr} by default prints
2641 a usage message and exits, because @var{set2} would not be used.
2642 The @sc{POSIX} specification says that @var{set2} must be ignored in
2643 this case. Silently ignoring arguments is a bad idea.
2644
2645 @item
2646 When an ambiguous octal escape is given.  For example, @samp{\400}
2647 is actually @samp{\40} followed by the digit @samp{0}, because the
2648 value 400 octal does not fit into a single byte.
2649
2650 @end enumerate
2651
2652 GNU @code{tr} does not provide complete BSD or System V compatibility.
2653 For example, it is impossible to disable interpretation of the @sc{POSIX}
2654 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}.  Also, GNU
2655 @code{tr} does not delete zero bytes automatically, unlike traditional
2656 Unix versions, which provide no way to preserve zero bytes.
2657
2658
2659 @node expand invocation
2660 @section @code{expand}: Convert tabs to spaces
2661
2662 @pindex expand
2663 @cindex tabs to spaces, converting
2664 @cindex converting tabs to spaces
2665
2666 @code{expand} writes the contents of each given @var{file}, or standard
2667 input if none are given or for a @var{file} of @samp{-}, to standard
2668 output, with tab characters converted to the appropriate number of
2669 spaces.  Synopsis:
2670
2671 @example
2672 expand [@var{option}]@dots{} [@var{file}]@dots{}
2673 @end example
2674
2675 By default, @code{expand} converts all tabs to spaces.  It preserves
2676 backspace characters in the output; they decrement the column count for
2677 tab calculations.  The default action is equivalent to @samp{-8} (set
2678 tabs every 8 columns).
2679
2680 The program accepts the following options.  Also see @ref{Common options}.
2681
2682 @table @samp
2683
2684 @item -@var{tab1}[,@var{tab2}]@dots{}
2685 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2686 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2687 @opindex -@var{tab}
2688 @opindex -t
2689 @opindex --tabs
2690 @cindex tabstops, setting
2691 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2692 (default is 8).  Otherwise, set the tabs at columns @var{tab1},
2693 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
2694 last tabstop given with single spaces.  If the tabstops are specified
2695 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2696 blanks as well as by commas.
2697
2698 @item -i
2699 @itemx --initial
2700 @opindex -i
2701 @opindex --initial
2702 @cindex initial tabs, converting
2703 Only convert initial tabs (those that precede all non-space or non-tab
2704 characters) on each line to spaces.
2705
2706 @end table
2707
2708
2709 @node unexpand invocation
2710 @section @code{unexpand}: Convert spaces to tabs
2711
2712 @pindex unexpand
2713
2714 @code{unexpand} writes the contents of each given @var{file}, or
2715 standard input if none are given or for a @var{file} of @samp{-}, to
2716 standard output, with strings of two or more space or tab characters
2717 converted to as many tabs as possible followed by as many spaces as are
2718 needed.  Synopsis:
2719
2720 @example
2721 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
2722 @end example
2723
2724 By default, @code{unexpand} converts only initial spaces and tabs (those
2725 that precede all non space or tab characters) on each line.  It
2726 preserves backspace characters in the output; they decrement the column
2727 count for tab calculations.  By default, tabs are set at every 8th
2728 column.
2729
2730 The program accepts the following options.  Also see @ref{Common options}.
2731
2732 @table @samp
2733
2734 @item -@var{tab1}[,@var{tab2}]@dots{}
2735 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2736 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2737 @opindex -@var{tab}
2738 @opindex -t
2739 @opindex --tabs
2740 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2741 instead of the default 8.  Otherwise, set the tabs at columns
2742 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
2743 tabs beyond the tabstops given unchanged.  If the tabstops are specified
2744 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2745 blanks as well as by commas.  This option implies the @samp{-a} option.
2746
2747 @item -a
2748 @itemx --all
2749 @opindex -a
2750 @opindex --all
2751 Convert all strings of two or more spaces or tabs, not just initial
2752 ones, to tabs.
2753
2754 @end table
2755
2756
2757 @c              What's GNU?
2758 @c              Arnold Robbins
2759 @node Opening the software toolbox
2760 @chapter Opening the software toolbox
2761
2762 This chapter originally appeared in @cite{Linux Journal}, volume 1,
2763 number 2, in the @cite{What's GNU?} column. It was written by Arnold
2764 Robbins.
2765
2766 @menu
2767 * Toolbox introduction::
2768 * I/O redirection::
2769 * The @code{who} command::
2770 * The @code{cut} command::
2771 * The @code{sort} command::
2772 * The @code{uniq} command::
2773 * Putting the tools together::
2774 @end menu
2775
2776
2777 @node Toolbox introduction
2778 @unnumberedsec Toolbox introduction
2779
2780 This month's column is only peripherally related to the GNU Project, in
2781 that it describes a number of the GNU tools on your Linux system and how they
2782 might be used.  What it's really about is the ``Software Tools'' philosophy
2783 of program development and usage.
2784
2785 The software tools philosophy was an important and integral concept
2786 in the initial design and development of Unix (of which Linux and GNU are
2787 essentially clones).  Unfortunately, in the modern day press of
2788 Internetworking and flashy GUIs, it seems to have fallen by the
2789 wayside.  This is a shame, since it provides a powerful mental model
2790 for solving many kinds of problems.
2791
2792 Many people carry a Swiss Army knife around in their pants pockets (or
2793 purse).  A Swiss Army knife is a handy tool to have: it has several knife
2794 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
2795 a number of other things on it.  For the everyday, small miscellaneous jobs
2796 where you need a simple, general purpose tool, it's just the thing.
2797
2798 On the other hand, an experienced carpenter doesn't build a house using
2799 a Swiss Army knife.  Instead, he has a toolbox chock full of specialized
2800 tools---a saw, a hammer, a screwdriver, a plane, and so on.  And he knows
2801 exactly when and where to use each tool; you won't catch him hammering nails
2802 with the handle of his screwdriver.
2803
2804 The Unix developers at Bell Labs were all professional programmers and trained
2805 computer scientists.  They had found that while a one-size-fits-all program
2806 might appeal to a user because there's only one program to use, in practice
2807 such programs are
2808
2809 @enumerate a
2810 @item
2811 difficult to write,
2812
2813 @item
2814 difficult to maintain and
2815 debug, and
2816
2817 @item
2818 difficult to extend to meet new situations.
2819 @end enumerate
2820
2821 Instead, they felt that programs should be specialized tools.  In short, each
2822 program ``should do one thing well.''  No more and no less.  Such programs are
2823 simpler to design, write, and get right---they only do one thing.
2824
2825 Furthermore, they found that with the right machinery for hooking programs
2826 together, that the whole was greater than the sum of the parts.  By combining
2827 several special purpose programs, you could accomplish a specific task
2828 that none of the programs was designed for, and accomplish it much more
2829 quickly and easily than if you had to write a special purpose program.
2830 We will see some (classic) examples of this further on in the column.
2831 (An important additional point was that, if necessary, take a detour
2832 and build any software tools you may need first, if you don't already
2833 have something appropriate in the toolbox.)
2834
2835 @node I/O redirection
2836 @unnumberedsec I/O redirection
2837
2838 Hopefully, you are familiar with the basics of I/O redirection in the
2839 shell, in particular the concepts of ``standard input,'' ``standard output,''
2840 and ``standard error''.  Briefly, ``standard input'' is a data source, where
2841 data comes from.  A program should not need to either know or care if the
2842 data source is a disk file, a keyboard, a magnetic tape, or even a punched
2843 card reader.  Similarly, ``standard output'' is a data sink, where data goes
2844 to.  The program should neither know nor care where this might be.
2845 Programs that only read their standard input, do something to the data,
2846 and then send it on, are called ``filters'', by analogy to filters in a
2847 water pipeline.
2848
2849 With the Unix shell, it's very easy to set up data pipelines:
2850
2851 @example
2852 program_to_create_data | filter1 | .... | filterN > final.pretty.data
2853 @end example
2854
2855 We start out by creating the raw data; each filter applies some successive
2856 transformation to the data, until by the time it comes out of the pipeline,
2857 it is in the desired form.
2858
2859 This is fine and good for standard input and standard output.  Where does the
2860 standard error come in to play?  Well, think about @code{filter1} in
2861 the pipeline above.  What happens if it encounters an error in the data it
2862 sees?  If it writes an error message to standard output, it will just
2863 disappear down the pipeline into @code{filter2}'s input, and the
2864 user will probably never see it.  So programs need a place where they can send
2865 error messages so that the user will notice them.  This is standard error,
2866 and it is usually connected to your console or window, even if you have
2867 redirected standard output of your program away from your screen.
2868
2869 For filter programs to work together, the format of the data has to be
2870 agreed upon.  The most straightforward and easiest format to use is simply
2871 lines of text.  Unix data files are generally just streams of bytes, with
2872 lines delimited by the @sc{ASCII} @sc{LF} (Line Feed) character,
2873 conventionally called a ``newline'' in the Unix literature. (This is
2874 @code{'\n'} if you're a C programmer.)  This is the format used by all
2875 the traditional filtering programs.  (Many earlier operating systems
2876 had elaborate facilities and special purpose programs for managing
2877 binary data.  Unix has always shied away from such things, under the
2878 philosophy that it's easiest to simply be able to view and edit your
2879 data with a text editor.)
2880
2881 OK, enough introduction. Let's take a look at some of the tools, and then
2882 we'll see how to hook them together in interesting ways.   In the following
2883 discussion, we will only present those command line options that interest
2884 us.  As you should always do, double check your system documentation
2885 for the full story.
2886
2887 @node The @code{who} command
2888 @unnumberedsec The @code{who} command
2889
2890 The first program is the @code{who} command.  By itself, it generates a
2891 list of the users who are currently logged in.  Although I'm writing
2892 this on a single-user system, we'll pretend that several people are
2893 logged in:
2894
2895 @example
2896 $ who
2897 arnold   console Jan 22 19:57
2898 miriam   ttyp0   Jan 23 14:19(:0.0)
2899 bill     ttyp1   Jan 21 09:32(:0.0)
2900 arnold   ttyp2   Jan 23 20:48(:0.0)
2901 @end example
2902
2903 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
2904 There are three people logged in, and I am logged in twice.  On traditional
2905 Unix systems, user names are never more than eight characters long.  This
2906 little bit of trivia will be useful later.  The output of @code{who} is nice,
2907 but the data is not all that exciting.
2908
2909 @node The @code{cut} command
2910 @unnumberedsec The @code{cut} command
2911
2912 The next program we'll look at is the @code{cut} command.  This program
2913 cuts out columns or fields of input data.  For example, we can tell it
2914 to print just the login name and full name from the @file{/etc/passwd
2915 file}.  The @file{/etc/passwd} file has seven fields, separated by
2916 colons:
2917
2918 @example
2919 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
2920 @end example
2921
2922 To get the first and fifth fields, we would use cut like this:
2923
2924 @example
2925 $ cut -d: -f1,5 /etc/passwd
2926 root:Operator
2927 @dots{}
2928 arnold:Arnold D. Robbins
2929 miriam:Miriam A. Robbins
2930 @dots{}
2931 @end example
2932
2933 With the @samp{-c} option, @code{cut} will cut out specific characters
2934 (i.e., columns) in the input lines.  This command looks like it might be
2935 useful for data filtering.
2936
2937
2938 @node The @code{sort} command
2939 @unnumberedsec The @code{sort} command
2940
2941 Next we'll look at the @code{sort} command.  This is one of the most
2942 powerful commands on a Unix-style system; one that you will often find
2943 yourself using when setting up fancy data plumbing. The @code{sort}
2944 command reads and sorts each file named on the command line.  It then
2945 merges the sorted data and writes it to standard output.  It will read
2946 standard input if no files are given on the command line (thus
2947 making it into a filter).  The sort is based on the machine collating
2948 sequence (@sc{ASCII}) or based on  user-supplied ordering criteria.
2949
2950
2951 @node The @code{uniq} command
2952 @unnumberedsec The @code{uniq} command
2953
2954 Finally (at least for now), we'll look at the @code{uniq} program.  When
2955 sorting data, you will often end up with duplicate lines, lines that
2956 are identical.  Usually, all you need is one instance of each line.
2957 This is where @code{uniq} comes in. The @code{uniq} program reads its
2958 standard input, which it expects to be sorted.  It only prints out one
2959 copy of each duplicated line.  It does have several options.  Later on,
2960 we'll use the @samp{-c} option, which prints each unique line, preceded
2961 by a count of the number of times that line occurred in the input.
2962
2963
2964 @node Putting the tools together
2965 @unnumberedsec Putting the tools together
2966
2967 Now, let's suppose this is a large BBS system with dozens of users
2968 logged in.  The management wants the SysOp to write a program that will
2969 generate a sorted list of logged in users.  Furthermore, even if a user
2970 is logged in multiple times, his or her name should only show up in the
2971 output once.
2972
2973 The SysOp could sit down with the system documentation and write a C
2974 program that did this. It would take perhaps a couple of hundred lines
2975 of code and about two hours to write it, test it, and debug it.
2976 However, knowing the software toolbox, the SysOp can instead start out
2977 by generating just a list of logged on users:
2978
2979 @example
2980 $ who | cut -c1-8
2981 arnold
2982 miriam
2983 bill
2984 arnold
2985 @end example
2986
2987 Next, sort the list:
2988
2989 @example
2990 $ who | cut -c1-8 | sort
2991 arnold
2992 arnold
2993 bill
2994 miriam
2995 @end example
2996
2997 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
2998
2999 @example
3000 $ who | cut -c1-8 | sort | uniq
3001 arnold
3002 bill
3003 miriam
3004 @end example
3005
3006 The @code{sort} command actually has a @samp{-u} option that does what
3007 @code{uniq} does. However, @code{uniq} has other uses for which one
3008 cannot substitute @samp{sort -u}.
3009
3010 The SysOp puts this pipeline into a shell script, and makes it available for
3011 all the users on the system:
3012
3013 @example
3014 # cat > /usr/local/bin/listusers
3015 who | cut -c1-8 | sort | uniq
3016 ^D
3017 # chmod +x /usr/local/bin/listusers
3018 @end example
3019
3020 There are four major points to note here.  First, with just four
3021 programs, on one command line, the SysOp was able to save about two
3022 hours worth of work.  Furthermore, the shell pipeline is just about as
3023 efficient as the C program would be, and it is much more efficient in
3024 terms of programmer time.  People time is much more expensive than
3025 computer time, and in our modern ``there's never enough time to do
3026 everything'' society, saving two hours of programmer time is no mean
3027 feat.
3028
3029 Second, it is also important to emphasize that with the
3030 @emph{combination} of the tools, it is possible to do a special
3031 purpose job never imagined by the authors of the individual programs.
3032
3033 Third, it is also valuable to build up your pipeline in stages, as we did here.
3034 This allows you to view the data at each stage in the pipeline, which helps
3035 you acquire the confidence that you are indeed using these tools correctly.
3036
3037 Finally, by bundling the pipeline in a shell script, other users can use
3038 your command, without having to remember the fancy plumbing you set up for
3039 them. In terms of how you run them, shell scripts and compiled programs are
3040 indistinguishable.
3041
3042 After the previous warm-up exercise, we'll look at two additional, more
3043 complicated pipelines.  For them, we need to introduce two more tools.
3044
3045 The first is the @code{tr} command, which stands for ``transliterate.''
3046 The @code{tr} command works on a character-by-character basis, changing
3047 characters. Normally it is used for things like mapping upper case to
3048 lower case:
3049
3050 @example
3051 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
3052 this example has mixed case!
3053 @end example
3054
3055 There are several options of interest:
3056
3057 @table @samp
3058 @item -c
3059 work on the complement of the listed characters, i.e.,
3060 operations apply to characters not in the given set
3061
3062 @item -d
3063 delete characters in the first set from the output
3064
3065 @item -s
3066 squeeze repeated characters in the output into just one character.
3067 @end table
3068
3069 We will be using all three options in a moment.
3070
3071 The other command we'll look at is @code{comm}.  The @code{comm}
3072 command takes two sorted input files as input data, and prints out the
3073 files' lines in three columns.  The output columns are the data lines
3074 unique to the first file, the data lines unique to the second file, and
3075 the data lines that are common to both.  The @samp{-1}, @samp{-2}, and
3076 @samp{-3} command line options omit the respective columns. (This is
3077 non-intuitive and takes a little getting used to.)  For example:
3078
3079 @example
3080 $ cat f1
3081 11111
3082 22222
3083 33333
3084 44444
3085 $ cat f2
3086 00000
3087 22222
3088 33333
3089 55555
3090 $ comm f1 f2
3091         00000
3092 11111
3093                 22222
3094                 33333
3095 44444
3096         55555
3097 @end example
3098
3099 The single dash as a filename tells @code{comm} to read standard input
3100 instead of a regular file.
3101
3102 Now we're ready to build a fancy pipeline.  The first application is a word
3103 frequency counter.  This helps an author determine if he or she is over-using
3104 certain words.
3105
3106 The first step is to change the case of all the letters in our input file
3107 to one case.  ``The'' and ``the'' are the same word when doing counting.
3108
3109 @example
3110 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
3111 @end example
3112
3113 The next step is to get rid of punctuation.  Quoted words and unquoted words
3114 should be treated identically; it's easiest to just get the punctuation out of
3115 the way.
3116
3117 @example
3118 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
3119 @end example
3120
3121 The second @code{tr} command operates on the complement of the listed
3122 characters, which are all the letters, the digits, the underscore, and
3123 the blank.  The @samp{\012} represents the newline character; it has to
3124 be left alone.  (The ASCII TAB character should also be included for
3125 good measure in a production script.)
3126
3127 At this point, we have data consisting of words separated by blank space.
3128 The words only contain alphanumeric characters (and the underscore).  The
3129 next step is break the data apart so that we have one word per line. This
3130 makes the counting operation much easier, as we will see shortly.
3131
3132 @example
3133 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3134 > tr -s '[ ]' '\012' | ...
3135 @end example
3136
3137 This command turns blanks into newlines.  The @samp{-s} option squeezes
3138 multiple newline characters in the output into just one.  This helps us
3139 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
3140 This is what the shell prints when it notices you haven't finished
3141 typing in all of a command.)
3142
3143 We now have data consisting of one word per line, no punctuation, all one
3144 case.  We're ready to count each word:
3145
3146 @example
3147 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3148 > tr -s '[ ]' '\012' | sort | uniq -c | ...
3149 @end example
3150
3151 At this point, the data might look something like this:
3152
3153 @example
3154   60 a
3155    2 able
3156    6 about
3157    1 above
3158    2 accomplish
3159    1 acquire
3160    1 actually
3161    2 additional
3162 @end example
3163
3164 The output is sorted by word, not by count!  What we want is the most
3165 frequently used words first.  Fortunately, this is easy to accomplish,
3166 with the help of two more @code{sort} options:
3167
3168 @table @samp
3169 @item -n
3170 do a numeric sort, not an ASCII one
3171
3172 @item -r
3173 reverse the order of the sort
3174 @end table
3175
3176 The final pipeline looks like this:
3177
3178 @example
3179 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3180 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
3181  156 the
3182   60 a
3183   58 to
3184   51 of
3185   51 and
3186  ...
3187 @end example
3188
3189 Whew!  That's a lot to digest.  Yet, the same principles apply. With six
3190 commands, on two lines (really one long one split for convenience), we've
3191 created a program that does something interesting and useful, in much
3192 less time than we could have written a C program to do the same thing.
3193
3194 A minor modification to the above pipeline can give us a simple spelling
3195 checker!  To determine if you've spelled a word correctly, all you have to
3196 do is look it up in a dictionary.  If it is not there, then chances are
3197 that your spelling is incorrect.  So, we need a dictionary.  If you
3198 have the Slackware Linux distribution, you have the file
3199 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
3200 dictionary.
3201
3202 Now, how to compare our file with the dictionary?  As before, we generate
3203 a sorted list of words, one per line:
3204
3205 @example
3206 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3207 > tr -s '[ ]' '\012' | sort -u | ...
3208 @end example
3209
3210 Now, all we need is a list of words that are @emph{not} in the
3211 dictionary.  Here is where the @code{comm} command comes in.
3212
3213 @example
3214 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3215 > tr -s '[ ]' '\012' | sort -u |
3216 > comm -23 - /usr/lib/ispell/ispell.words
3217 @end example
3218
3219 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
3220 dictionary (the second file), and lines that are in both files.  Lines
3221 only in the first file (standard input, our stream of words), are
3222 words that are not in the dictionary.  These are likely candidates for
3223 spelling errors.  This pipeline was the first cut at a production
3224 spelling checker on Unix.
3225
3226 There are some other tools that deserve brief mention.
3227
3228 @table @code
3229 @item grep
3230 search files for text that matches a regular expression
3231
3232 @item egrep
3233 like @code{grep}, but with more powerful regular expressions
3234
3235 @item wc
3236 count lines, words, characters
3237
3238 @item tee
3239 a T-fitting for data pipes, copies data to files and to standard output
3240
3241 @item sed
3242 the stream editor, an advanced tool
3243
3244 @item awk
3245 a data manipulation language, another advanced tool
3246 @end table
3247
3248 The software tools philosophy also espoused the following bit of
3249 advice: ``Let someone else do the hard part.'' This means, take
3250 something that gives you most of what you need, and then massage it the
3251 rest of the way until it's in the form that you want.
3252
3253 To summarize:
3254
3255 @enumerate 1
3256 @item
3257 Each program should do one thing well. No more, no less.
3258
3259 @item
3260 Combining programs with appropriate plumbing leads to results where
3261 the whole is greater than the sum of the parts.  It also leads to novel
3262 uses of programs that the authors might never have imagined.
3263
3264 @item
3265 Programs should never print extraneous header or trailer data, since these
3266 could get sent on down a pipeline. (A point we didn't mention earlier.)
3267
3268 @item
3269 Let someone else do the hard part.
3270
3271 @item
3272 Know your toolbox! Use each program appropriately. If you don't have an
3273 appropriate tool, build one.
3274 @end enumerate
3275
3276 As of this writing, all the programs we've discussed are available via
3277 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
3278 @file{/pub/gnu/textutils-1.9.tar.gz} directory.@footnote{Version 1.9 was
3279 current when this column was written. Check the nearest GNU archive for
3280 the current version.}
3281
3282 None of what I have presented in this column is new. The Software Tools
3283 philosophy was first introduced in the book @cite{Software Tools},
3284 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
3285 0-201-03669-X).   This book showed how to write and use software
3286 tools.   It was written in 1976, using a preprocessor for FORTRAN named
3287 @code{ratfor} (RATional FORtran).  At the time, C was not as ubiquitous
3288 as it is now; FORTRAN was.  The last chapter presented a @code{ratfor}
3289 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
3290 awful lot like C; if you know C, you won't have any problem following
3291 the code.
3292
3293 In 1981, the book was updated and made available as @cite{Software
3294 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7).  Both books
3295 remain in print, and are well worth reading if you're a programmer.
3296 They certainly made a major change in how I view programming.
3297
3298 Initially, the programs in both books were available (on 9-track tape)
3299 from Addison-Wesley.  Unfortunately, this is no longer the case,
3300 although you might be able to find copies floating around the Internet.
3301 For a number of years, there was an active Software Tools Users Group,
3302 whose members had ported the original @code{ratfor} programs to essentially
3303 every computer system with a FORTRAN compiler.  The popularity of the
3304 group waned in the middle '80s as Unix began to spread beyond universities.
3305
3306 With the current proliferation of GNU code and other clones of Unix programs,
3307 these programs now receive little attention; modern C versions are
3308 much more efficient and do more than these programs do.  Nevertheless, as
3309 exposition of good programming style, and evangelism for a still-valuable
3310 philosophy, these books are unparalleled, and I recommend them highly.
3311
3312 Acknowledgment: I would like to express my gratitude to Brian Kernighan
3313 of Bell Labs, the original Software Toolsmith, for reviewing this column.
3314
3315
3316 @node Index
3317 @unnumbered Index
3318
3319 @printindex cp
3320
3321 @contents
3322 @bye
3323
3324 @c Local variables:
3325 @c texinfo-column-for-description: 32
3326 @c End: